Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

PTXAS v13.0 — Reverse Engineering Reference

Purpose: reimplementation-grade documentation of NVIDIA's PTX-to-SASS assembler, recovered entirely from static analysis of the stripped x86-64 binary.

PTX (Parallel Thread Execution) is NVIDIA's virtual ISA for GPU compute. SASS (Shader Assembly) is the native machine code executed by GPU hardware. PTXAS is the binary that transforms PTX into SASS. At 37.7 MB stripped, it is a fully proprietary compiler with no LLVM code, no EDG frontend, and no third-party optimizer components. Every pass, every data structure, and every encoding table was built in-house by NVIDIA. This wiki documents its internal architecture using IDA Pro 8.x and Hex-Rays decompilation.

Version note: All addresses and binary offsets in this wiki apply to ptxas v13.0.88 (CUDA Toolkit 13.0). Other versions will have different addresses.

Binaryptxas v13.0.88, 37,741,528 bytes, x86-64, stripped
Buildcuda_13.0.r13.0/compiler.36424714_0 (Aug 20 2025)
Decompilation40,185 functions, IDA Pro 8.x + Hex-Rays
Strings30,632 extracted
Call graph548,693 edges
Version stringCuda compilation tools, release 13.0, V13.0.88 (sub_612DE0)
LLVM codeNone — fully proprietary compiler
Default targetsm_75 (Turing)
Supported SMssm_75 through sm_121f (Turing through DGX Spark)
Internal codenameOCG (Optimizing Code Generator), Mercury (SASS encoder)

Glossary

TermMeaning
Ori IRPTXAS's internal intermediate representation — basic blocks containing an instruction DAG with typed virtual registers. Named after recovered debug strings; not an acronym.
MercuryThe SASS binary encoder subsystem. Converts abstract instruction objects into 128-bit packed machine words. Named in NVIDIA source paths and error strings.
OCGOptimizing Code Generator — NVIDIA's internal name for the ptxas optimization+codegen pipeline (the 159-phase core). Appears in knob prefixes and timing strings.
FatpointThe register allocation algorithm used by ptxas. A fatpoint is a program point annotated with the set of simultaneously live virtual registers. The allocator works by computing these sets and mapping them to physical registers.
OpexOperand expansion — a late pipeline stage that expands abstract operands into concrete SASS encoding fields. Converts virtual register references, immediates, and address modes into the bit patterns Mercury expects.
CapmercCapsule Mercury — an ELF section (.nv.capmerc) that embeds a secondary Mercury-encoded representation of the kernel alongside the primary .text section. Used for debug metadata and binary patching support.
ELFWPTXAS's custom ELF writer (sub_1C9F280, 97 KB). Not a standard library — a bespoke emitter that builds CUBIN files with NVIDIA-specific sections, relocations, and symbol conventions.
EIATTRExtended Info Attributes — per-kernel metadata encoded in .nv.info sections. Each attribute is a tag-length-value record carrying register counts, barrier usage, shared memory sizes, CRS stack depth, and other kernel properties consumed by the CUDA runtime and driver.

Three Subsystems

PTXAS is not a monolithic assembler. It decomposes into three largely independent subsystems with distinct coding conventions, data structures, and lineages:

1. PTX Frontend (~3 MB, 0x400000--0x5AA000) — A Flex-generated DFA scanner (sub_720F00, 64 KB, ~552 rules) feeds tokens into a Bison-generated LALR(1) parser (sub_4CE6B0, 48 KB). The parser is driven from sub_446240 (the real main, 11 KB), which orchestrates the full pipeline: parse, DAGgen, OCG, ELF, DebugInfo. The frontend also contains 1,141 instruction descriptors registered via sub_46E000 (93 KB) that define accepted type combinations for every PTX opcode, 608 CUDA runtime intrinsics registered in sub_5D1660 (46 KB), and a suite of per-instruction semantic validators (0x460000--0x4D5000) that check architecture requirements, type compatibility, and operand constraints before lowering. See PTX Parser and Entry Point & CLI.

2. Ori Optimizer (~8 MB, 0x5AA000--0xC52000) — A proprietary 159-phase optimization pipeline managed by the PhaseManager (sub_C62720). The phase factory at sub_C60D30 is a 159-case switch that allocates polymorphic phase objects from a vtable table at off_22BD5C8. Each phase has virtual methods for execute(), isNoOp(), and getName(). Major subsystems include: a fatpoint-based register allocator (sub_957160 core, sub_95DC10 driver, sub_926A30 interference graph builder), a 3-phase instruction scheduler (sub_688DD0 with ReduceReg/DynBatch modes and 9 register pressure counters), copy propagation, strength reduction, predication (if-conversion), rematerialization, and GMMA/WGMMA pipelining. The pipeline reads its default phase ordering from a 159-entry table at 0x22BEEA0. See Optimization Pipeline and Phase Manager.

3. SASS Backend (~14 MB, 0xC52000--0x1CE3000) — The Mercury encoder generates native SASS binary code. Instruction encoding is handled by ~4,000 per-variant handler functions (683 + 678 = 1,361 in the SM100 Blackwell encoding tables alone at 0xED1000--0x107B000, with additional tables for other SM generations). Each handler follows a rigid template: set opcode ID, load a 128-bit encoding format descriptor via SIMD, initialize a 10-slot register class map, register operand descriptors via sub_7BD3C0/sub_7BD650/sub_7BE090, finalize with sub_7BD260, then extract bitfields from the packed instruction word. The backend also contains 3 peephole optimizers (the PeepholeOptimizer class at 0x7A5D10 with Init, RunOnFunction, RunOnBB, RunPatterns, SpecialPatterns, ComplexPatterns, and SchedulingAwarePatterns methods), a capsule Mercury ELF embedder for debug metadata (sub_1CB53A0, section .nv.capmerc), and a custom ELF emitter (sub_1C9F280, 97 KB) that builds the final CUBIN output. See SASS Code Generation, Mercury Encoder, and Peephole Optimization.

Additionally, the binary embeds a custom pool allocator (sub_424070, 3,809 callers), MurmurHash3-based hash maps (sub_426150 insert / sub_426D60 lookup), a thread pool with pthread-based parallel compilation support, and a GNU Make jobserver client for integration with build systems.

Compilation Pipeline

Both standalone and library-mode invocations converge on the same pipeline, visible in the timing strings emitted by sub_446240:

PTX text (.ptx file or string)
  |
  +-- Flex Scanner (sub_720F00, 64KB)
  |     552-rule DFA, off_203C020 transition table
  |     Tokens: 340+ terminal symbols for Bison grammar
  |
  +-- Bison LALR(1) Parser (sub_4CE6B0, 48KB)
  |     Semantic validators: 0x460000-0x4D5000
  |     1,141 instruction descriptors via sub_46E000
  |
  +-- Ori IR Construction (DAGgen phase)
  |     Internal representation: basic blocks + instruction DAG
  |     608 CUDA runtime intrinsics (sub_5D1660)
  |
  +-- 159-Phase Optimization Pipeline (PhaseManager, sub_C62720)
  |     Phase factory: sub_C60D30 (159-case switch)
  |     Fatpoint register allocator (sub_957160)
  |     3-phase instruction scheduler (sub_688DD0)
  |     Copy propagation, CSE, strength reduction, predication,
  |     rematerialization, GMMA pipelining, late legalization
  |
  +-- Mercury SASS Encoder
  |     Instruction encoding: ~4000 per-variant handlers
  |     3 peephole optimizers (PeepholeOptimizer at 0x7A5D10)
  |     WAR hazard resolution (sub_6FC240)
  |     Operand expansion (Opex pipeline)
  |
  +-- ELF/CUBIN Output (sub_1C9F280, 97KB)
        Sections: .text, .nv.constant0, .nv.info, .symtab
        Capsule Mercury: .nv.capmerc (debug metadata)
        DWARF: .debug_line, .debug_info, .debug_frame

The driver at sub_446240 reports per-stage timing: Parse-time, CompileUnitSetup-time, DAGgen-time, OCG-time, ELF-time, DebugInfo-time, plus PeakMemoryUsage in KB. For multi-entry PTX files, each compile unit is processed independently with the header "\nCompile-unit with entry %s".

Dual Compilation Modes

PTXAS operates in two modes selected at invocation:

Standalone CLILibrary Mode
Invocationptxas [options] file.ptxCalled from nvcc/nvlink as a subprocess
Entrymain at 0x409460sub_9F63D0 (library/ftrace entry)
Real driversub_446240 (11 KB)Same pipeline, alternate setup
InputPTX file on diskPTX string via --input-as-string
Output.cubin / .o fileBinary blob returned to caller
Usage string"Usage : %s [options] <ptx file>,...\n"N/A

The main function (0x409460, 84 bytes) is a thin wrapper: it stores argv[0], sets stdout/stderr to unbuffered via setvbuf, and delegates to sub_446240. The --input-as-string flag enables accepting PTX source directly as a CLI argument rather than reading from a file.

Configuration

PTXAS exposes three layers of configuration:

CLI Options (~100 flags) — Registered in sub_432A00 and parsed by sub_434320. Key options include --gpu-name (target SM), --maxrregcount (register limit), --opt-level (0--4), --verbose, --warn-on-spills, --warn-on-local-memory-usage, --fast-compile, --fdevice-time-trace (Chrome trace JSON output), --compile-as-tools-patch (sanitizer mode), and --extensible-whole-program. Help is printed by sub_403588 which calls sub_1C97640 to enumerate all registered options.

Internal Knobs (1,294 ROT13-encoded entries) — A separate configuration system implemented in generic_knobs_impl.h (source path recovered: /dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/common/utils/generic/impl/generic_knobs_impl.h). The knob table is populated by two massive static constructors: ctor_005 at 0x40D860 (80 KB, ~2,000 general OCG knobs) and ctor_007 at 0x421290 (8 KB, 98 Mercury scheduler knobs). All knob names are ROT13-obfuscated in the binary. Examples after decoding: MercuryUseActiveThreadCollectiveInsts, MercuryTrackMultiReadsWarLatency, MercuryPresumeXblockWaitBeneficial, ScavInlineExpansion, ScavDisableSpilling. Knobs are read from environment variables and knob files via ReadKnobsFile (sub_79D070) which parses [knobs]-header INI files. Lookup is performed by GetKnobIndex (sub_79B240) with inline ROT13 decoding and case-insensitive comparison. See Knobs System.

SM Profile Tables — Per-architecture capability maps initialized by sub_607DB0 (14 KB) which creates 7 hash maps indexing sm_XX / compute_XX strings to handler functions. Profile objects are constructed by sub_6765E0 (54 KB) with architecture-to-family mappings (sm_75 -> Turing, sm_80/86/87/88 -> Ampere, sm_89 -> Ada Lovelace, sm_90/90a -> Hopper, sm_100/100a/100f -> Blackwell, sm_103/103a/103f -> Blackwell Ultra, sm_110/110a/110f -> Jetson Thor, sm_120/120a/120f -> RTX 50xx, sm_121/121a/121f -> DGX Spark). See SM Architecture Map.

Reading This Wiki

The wiki is organized around the compilation pipeline. Every page is written at reimplementation-grade depth for an audience of senior C++ developers with GPU compiler experience.

Section Index

Overview

Compilation Pipeline

Ori IR — Internal Representation

Optimization Passes

Register Allocation

Instruction Scheduling

SASS Code Generation

GPU Architecture Targets

CUDA Intrinsics

ELF/Cubin Output

Configuration

Infrastructure

Reference

Reading Path 1: End-to-End Pipeline Understanding

Goal: understand how PTX text becomes SASS binary, what each stage does, and how control flows between subsystems.

  1. Pipeline Overview — The complete flow diagram. Establishes all stages and their address ranges.
  2. Entry Point & CLI — How ptxas is invoked, the ~100 CLI flags, and the sub_446240 driver function.
  3. PTX Parser — The Flex scanner and Bison parser. How PTX text becomes an internal parse tree.
  4. PTX-to-Ori Lowering — How the parse tree is lowered to Ori IR (basic blocks + instruction DAG).
  5. Optimization Pipeline — The 159-phase PhaseManager. Phase factory, ordering, timing infrastructure.
  6. SASS Code Generation — Mercury encoder, instruction selection, operand expansion, peephole.
  7. ELF/Cubin Output — Custom ELF emitter, section layout, DWARF debug info, capsule Mercury.

Reading Path 2: Reimplementing a Specific Pass

Goal: reproduce the exact behavior of one optimization phase deeply enough to write a compatible replacement.

  1. Pass Inventory & Ordering — Locate the phase in the 159-entry table. Note its index, vtable address, and pipeline position.
  2. The phase's dedicated page (e.g., Copy Propagation & CSE, Predication). Every dedicated page contains the function address, decompiled algorithm, data flow, and controlling knobs.
  3. Knobs System — Find which ROT13 knobs control the phase's behavior (enable/disable toggles, thresholds).
  4. Ori IR Overview — Understand the IR data structures the phase operates on.
  5. Register Model — The R/UR/P/UP register classes and their constraints.
  6. Function Map — Cross-reference internal function addresses with the master function map.

Reading Path 3: Debugging Correctness

Goal: diagnose a miscompilation, crash, or incorrect SASS output by tracing the problem to a specific phase.

  1. DUMPIR & NamedPhases — How to dump IR at specific pipeline points. Use DUMPIR to observe the IR before and after each phase.
  2. Optimization Levels — Compare phase pipelines at different O-levels. If a bug appears at -O2 but not -O1, the diff identifies suspect phases.
  3. Pipeline Overview — The pipeline is linear: Parse -> DAGgen -> OCG (159 phases) -> Mercury -> ELF. The stage where output first goes wrong narrows the search.
  4. Knobs System — Check whether the suspect phase has enable/disable knobs. Toggle them to confirm or rule out the phase.
  5. Instruction Scheduling and Scoreboards & Dependency Barriers — If the generated SASS hangs or produces wrong results under specific warp configurations, the scheduler or barrier insertion may be at fault.

Reading Path 4: Tuning Performance

Goal: understand what ptxas does at each optimization level and what knobs control aggressiveness.

  1. Optimization Levels — The O-level to phase mapping, including --fast-compile tiers.
  2. Knobs System — The 1,294 ROT13-encoded internal tuning parameters. The primary mechanism for fine-grained control.
  3. Register Allocation — The fatpoint allocator directly determines register count, which determines maximum occupancy.
  4. Instruction Scheduling — The scheduler's ReduceReg and DynBatch modes, WAR hazard resolution, and interaction with register pressure.
  5. Peephole Optimization — The 3 peephole dispatchers that perform late SASS-level rewrites.
  6. SM Architecture Map — Per-SM feature gates that influence code generation decisions.

Function Map

Binary: ptxas v13.0.88, 37.7 MB stripped ELF, ~40,000 functions
Documented: 2,063 unique functions across 70 wiki pages
This page: Top ~100 most cross-referenced functions, plus routing tables
Complete listings: Each wiki page has its own Function Map section with full details

This page is the central lookup index for identified functions in ptxas. It lists the functions that appear most frequently across the wiki (cross-cutting infrastructure and major entry points), and provides routing tables to find any function by address range or subsystem.

Confidence levels: CERTAIN = named in symbols or strings. HIGH = strong evidence from strings and call patterns (>90%). MEDIUM = structural analysis with partial string evidence (70-90%).


Core Infrastructure

These functions appear in 10+ wiki pages -- they are the universal building blocks called by nearly every subsystem.

AddressIdentityPagesCallersNotes
0x424070pool_alloc(pool, size)193,809Custom slab allocator, 8-byte aligned
0x4248B0pool_free(ptr)81,215Coalescing free, boundary tags
0x4280C0get_thread_local_context103,928Most-called function in ptxas; 280-byte TLS struct
0x42BDB0fatal_OOM_handler83,825Called on every allocation failure
0x426150hashmap_put(map, key, value)112,800Open-addressing + chaining, auto-resize
0x426D60hashmap_get(map, key)11422Returns value or 0
0x425CA0hashmap_create(hash_fn, cmp_fn, cap)7127Integer/pointer/custom hash modes
0x427630murmurhash3_x86_32(str)573Constants: 0xcc9e2d51, 0x1b873593
0x42D850hashset_insert(set, key)4282Hash set variant
0x42FBA0diagnostic_emit(desc, loc, fmt...)72,350Central error/warning reporter
0x42F590fatal_internal_error(desc, ...)83,825Assertion handler
0x4279D0starts_with(str, prefix)4185Returns suffix pointer or 0
0x42CA60list_push_front(node, head_ptr)4298Pool-allocated linked list
0xBDBA60bitvector_allocate8many(bits+31)>>5 word count
0xBDCDE0bitvector_or_assign (SSE2)5many_mm_or_si128 on 128-bit chunks

Details: Memory Pools, Hash & Bitvector, Threading


Compilation Driver & CLI

AddressIdentityPagesCallersNotes
0x409460main51Delegates to 0x446240
0x446240real_main (top-level driver)131Orchestrates entire pipeline
0x4428E0ptx_input_setup61Version/target validation
0x43CC70per_entry_compile_unit51Processes each entry through pipeline
0x43F400function_abi_config41Parameter regs, return addr, scratch
0x43A400compilation_target_config71SM-specific defaults
0x43B660register_constraint_calculator51Balances .maxnreg, occupancy
0x432A00option_registration91CLI option definitions
0x434320option_parser91Validates combinations, applies state

Details: Pipeline Entry, Pipeline Overview, CLI Options


PTX Front End

AddressIdentityPagesCallersNotes
0x46E000instruction_table_builder9193 KB, 1168 callees, one per PTX opcode
0x451730parser_setup (special register init)91%ntid, %laneid, %clock, etc.
0x4CE6B0bison_parser (directive/decl)71.local_maxnreg, .alias, .pragma
0x720F00flex_lexer (ptxlex / yylex)82~550 Flex rules, DFA scanner
0x4B2F20ptx_validator_general41Validates texture, surface, cvt, call
0x4C5FB0ptx_validator_mma_wmma_tcgen0541MMA, WMMA, tensor core validation
0x71F630preprocessor_dispatch41.MACRO, .ELSE, .INCLUDE
0x489050ptx_to_ori_converter51PTX AST to ORI IR translation

Details: PTX Parser, PTX Directives, PTX to ORI


Static Initialization

AddressIdentityPagesCallersNotes
0x4094C0ctor_001 -- thread infra init40pthread_key_create, mutex
0x4095D0ctor_003 -- PTX opcode name table60~900 ROT13-encoded PTX mnemonics
0x40D860ctor_005 -- tuning knob registry6080 KB, 2000+ ROT13 knob names
0x421290ctor_007 -- scheduler knob registry4098 ROT13 scheduler knobs

Details: Pipeline Entry, Binary Layout


Phase Manager & Optimization Framework

AddressIdentityPagesCallersNotes
0xC60D30phase_factory (159-case switch)121Allocates phase objects
0xC62720PhaseManager_ctor102159-entry phase table
0xC64F70phase_dispatch_loop52Executes phases, reports timing
0xC64310per_phase_timing_reporter51"[Total N KB] [Freeable N KB]"
0xC641D0phase_name_to_index_lookup53Binary search, case-insensitive
0x7DDB50phase_run_dispatch14manyVtable-based phase execution
0x9F4040NamedPhases_parse_and_build61"shuffle", "OriCopyProp", etc.
0x798B60NamedPhases_parser42PTXAS_DISABLE env var parsing
0x799250IsPassDisabled54Checks knob index 185
0xA36360pass_sequence_builder61Constructs NvOptRecipe pass list

Details: Phase Manager, Pass Inventory, Optimizer Pipeline


ORI IR & Instruction Access

AddressIdentityPagesCallersNotes
0x9253C0instruction_operand_get11manyOperand accessor on ORI instructions
0x7E6090instruction_modifier_set10manyIR modification helper
0x781F80instruction_iterator12manyDoubly-linked list traversal
0x7DF3A0instruction_property_query5manyInstruction flag/attribute checker
0x91BF30register_type_query8manyRegister class/type inspection
0x9314F0register_class_id_query71,547Most-called non-trivial regalloc fn
0x931920register_class_compat_checker6328Pair register class handling
0x934630register_id_packer9856Packs reg#/class/type into 32-bit
0xB28E00ir_node_type_query5manyNode kind discrimination
0xB28E90ir_node_field_accessor6manyGeneric field getter
0xA50650CodeObject_EmitRecords1874 KB, ORI record serializer (56 section types)
0xA53840EmitRecords_wrapper11Thin wrapper, adds type-44 header

Details: Instructions, Registers, Data Structures, CFG


Intrinsic Infrastructure

AddressIdentityPagesCallersNotes
0x5D1660intrinsic_table_register (608 entries)71Master name-to-ID table
0x5D4190intrinsic_dispatch_builder131PTX opcode -> codegen handler mapping
0x5FF700intrinsic_prototype_emitter51354 KB -- largest function in binary
0x5C7A50wmma_mma_codegen41173 KB, all shapes/types/layouts
0x5C10A0mma_codegen (mma.sync)41120 KB, m8n8k4 through m16n8k256
0x5BBC30tcgen05_mma_codegen (Blackwell)5190 KB, 5th-gen tensor core
0x70FA00ocg_intrinsic_handler81OCG-level intrinsic routing
0x6A97B0intrinsic_lowering_main4126 KB, switch-based lowering
0x6C9EB0ocg_builtin_name_lookup51Blackwell+ OCG name table

Details: Intrinsics Index, Math Intrinsics, Tensor Intrinsics, Sync & Warp


Register Allocator

AddressIdentityPagesCallersNotes
0x9721C0regalloc_entry ("REGALLOC GUIDANCE")61Top-level allocator entry
0x957160fatpoint_allocator_core71Core fatpoint graph coloring
0x96D940spill_guidance_engine51Determines spill strategy
0x971A90full_alloc_with_spill_retry41"NOSPILL REGALLOC" path
0x9714E0regalloc_failure_reporter61"Register allocation failed..."
0x926A30interference_graph_builder9722 KB, SSE bitvectors
0x92C240liveness_bitvector_ops587Set/clear/query with aliasing
0x917A60opcode_to_regclass_mapping4221Massive switch
0x910840ConvertMemoryToRegisterOrUniform51Pass driver

Details: RegAlloc Overview, RegAlloc Algorithm, Spilling, ABI


Instruction Scheduling

AddressIdentityPagesCallersNotes
0x8D0640ScheduleInstructions (top-level)71String: "ScheduleInstructions"
0x688DD0scheduler_engine (main BB loop)51ReduceReg / DynBatch selection
0x8C9320scheduling_priority_function40~300 locals, core heuristic
0x68B9C0dependency_graph_builder41RAW/WAR/WAW hazard analysis
0x6820B0build_ready_list51Zero-dependency instructions
0x8CD6E0reverse_scheduling_driver41Reverse post-order iteration
0x8CEE80register_budget_with_occupancy41Pressure coeff default 0.045
0x8E4400hw_profile_table_init63Encoding/latency property tables
0xA9CDE0scheduling_metadata_builder61Per-instruction sched metadata
0xA9CF90scheduling_metadata_accessor5manySched metadata field queries
0xAED3C0scheduling_optimization_mega_pass40137 KB, ~560 locals, largest vtable pass

Details: Scheduling Overview, Scheduling Algorithm, Latency Model, Scoreboards


Codegen & ISel

AddressIdentityPagesCallersNotes
0x169B190isel_pattern_dispatch (master)51280 KB, 65,999 insns -- largest function
0x143C440sm120_peephole_dispatch41SM120 (RTX 50), 373-case switch
0x198BCD0sm100_peephole_dispatch41SM100 (Blackwell), 1336 callees
0x83EF00main_peephole_pass6029 KB, 392 callees
0x6D9690master_instruction_encoder7194 KB, opcode switch
0x6E4110sass_codegen_main41EmitSASSForFunction, FNV-1a BB hash
0x6F52F0SASS_pipeline_run_stages51Mercury SASS compilation pipeline
0x9ED2D0MercConverter_entry61ORI to Mercury IR conversion
0x9F1A90MercConverter_builder61Mercury instruction construction

Details: ISel, Encoding, Peephole, Mercury, Templates


Bitfield Encoding

AddressIdentityPagesCallersNotes
0x7B9B80bitfield_insert(insn, off, wid, val)918,347Most-called by caller count
0x7BC030encode_register_operand46,1471-bit + 4-bit type + 10-bit reg
0x7B9D60encode_reuse_flags_predicate42,4081-bit reuse + 5-bit predicate
0x7BC5C0encode_immediate_const_operand41,449Const buffer index or immediate
0x7BCF00encode_predicate_register41,657PT=14, 2-bit type + 3-bit condition
0x10B61801_bit_boolean_encoder38,091.S/.U, .STRONG, etc.

Details: Encoding, SASS Printing


ELF / CUBIN Output

AddressIdentityPagesCallersNotes
0x612DE0section_attr_builder11176 KB, ELF section/attribute config
0x1C9F280master_elf_emitter91Complete CUBIN assembly
0x1CB53A0elf_world_init71672-byte ELFW context
0x1CB68D0symbol_table_builder51.symtab from internal symbols
0x1CABD60master_section_allocator51Shared/const/local memory
0x1CB3570add_function_section544Creates .text.FUNCNAME + .rela
0x1CD48C0relocation_processor51Relocation section emission
0x1C9B110mercury_capsule_builder41Creates embedded .nv.merc ELF

Details: ELF Emitter, Sections, Relocations, Debug Info, Capsule Mercury


Knobs System

AddressIdentityPagesCallersNotes
0x79B240GetKnobIndex62ROT13 name lookup, case-insensitive
0x79D070ReadKnobsFile51Parses [knobs] section from file
0x79F540ParseKnobValue4112-type switch: bool/int/float/string/...
0x79D990ProcessKnobs (top-level)41File + pragma + numbered config
0xA0F020knob_conditional_evaluator5many[WHEN condition] handler

Details: Knobs, Opt Levels


Target-Specific Code

AddressIdentityPagesCallersNotes
0x6765E0target_profile_selector71SM-dependent profile dispatch
0x607DB0target_feature_query7manySM feature capability checks
0x896D50sass_mnemonic_table_init (ROT13)41~400+ SASS instruction names
0x89FBA0instruction_latency_init43Encoding/latency property tables

Details: Targets Index, Turing-Ampere, Ada-Hopper, Blackwell, tcgen05


Subsystem Routing Table

To find a specific function, locate it by address range or subsystem topic in this table. Each page contains a detailed Function Map section with complete listings.

By Subsystem Topic

SubsystemPrimary PagesFunctions
Memory allocator, poolsmemory-pools.md30
Hash maps, bitvectors, setshash-bitvector.md51
Threading, TLS, jobserverthreading.md41
CLI parsing, option handlingcli-options.md17
Tuning knobs (2000+ knobs)knobs.md56
Optimization levelsopt-levels.md14
DumpIR debug outputdumpir.md14
Compilation pipelineoverview.md, entry.md56+25
PTX lexer & parserptx-parser.md75
PTX directivesptx-directives.md41
PTX-to-ORI translationptx-to-ori.md41
Optimizer pipelineoptimizer.md28
ORI instruction IRinstructions.md80
CFG constructioncfg.md18
Register representationregisters.md40
IR data structuresdata-structures.md74
Phase manager (159 phases)phase-manager.md26
Copy propagation, CSE, GVNcopy-prop-cse.md65
General optimization passesgeneral-optimize.md71
Loop optimization (unroll, LICM, SWP)loop-passes.md92
Branch/switch optimizationbranch-switch.md24
Strength reductionstrength-reduction.md25
Predicationpredication.md28
Rematerializationrematerialization.md55
Liveness analysisliveness.md42
Sync barrierssync-barriers.md66
Late legalizationlate-legalization.md59
Hot/cold splittinghot-cold.md10
GMMA pipelininggmma-pipeline.md47
Uniform registersuniform-regs.md22
Register allocator corealgorithm.md50
Spillingspilling.md54
ABI handlingabi.md87
Scheduling overviewoverview.md112
Scheduling algorithmalgorithm.md121
Latency model & HW profileslatency-model.md78
Scoreboards & barriersscoreboards.md56
ISel pattern matchingisel.md182
SASS encodingencoding.md92
Peephole optimizationpeephole.md67
Mercury IR conversionmercury.md79
SASS templatestemplates.md46
SASS printing / renderersass-printing.md96
Capsule Mercurycapmerc.md20
Intrinsic infrastructureindex.md159
Math intrinsicsmath.md42
Tensor core intrinsicstensor.md45
Sync & warp intrinsicssync-warp.md65
SM targets & featuresindex.md70
ELF emitterelf-emitter.md29
ELF sectionssections.md33
Debug info (DWARF)debug-info.md33
Relocationsrelocations.md19

By Address Range

Functions in the binary are clustered by subsystem. This table maps address ranges to the pages that document them.

Address RangePrimary SubsystemKey Pages
0x400000-0x424000Entry, static init, mainentry.md, binary-layout.md
0x424000-0x42E000Memory pools, hash maps, listsmemory-pools.md, hash-bitvector.md
0x42E000-0x446000Diagnostics, CLI parsingcli-options.md, entry.md
0x446000-0x452000Compilation driveroverview.md, entry.md
0x452000-0x4D5000PTX parser & validatorptx-parser.md, ptx-directives.md
0x4D5000-0x5AA000PTX-to-ORI, early IRptx-to-ori.md, instructions.md
0x5AA000-0x612000Intrinsic infrastructureindex.md, math.md, tensor.md
0x612000-0x67F000Section builder, target configsections.md, index.md
0x67F000-0x6E4000Scheduling engine, OCG lowering, encodingoverview.md, encoding.md
0x6E4000-0x754000SASS codegen, SASS pipelinemercury.md, overview.md
0x754000-0x7C0000Liveness, knobs, bitfield encodingliveness.md, knobs.md, encoding.md
0x7C0000-0x8FE000Peephole, SASS mnemonics, scheduling upperpeephole.md, algorithm.md
0x8FE000-0x9D3000Register allocatoroverview.md, algorithm.md, abi.md
0x9D3000-0xAA8000Post-regalloc, named phases, rematrematerialization.md, phase-manager.md
0xAA8000-0xC52000Mega-passes, sync barriers, dataflowsync-barriers.md, general-optimize.md
0xC52000-0xD27000Phase manager, phase factoryphase-manager.md, optimizer.md
0xD27000-0x10B7000592 SASS encoder bodiesencoding.md, isel.md
0x10B7000-0x1225000Field encoders, ISel helpersencoding.md, isel.md
0x1225000-0x13CF000Bitvector, ISel coordinatorshash-bitvector.md, isel.md
0x13CF000-0x17F8000SM-specific ISel, pattern matchers, templatesisel.md, templates.md
0x17F8000-0x1C21000SASS printing, peephole mega-dispatcherssass-printing.md, peephole.md
0x1C21000-0x1CE3000ELF emitter, capsule mercury, relocationself-emitter.md, capmerc.md

Statistics

Top 10 Most-Called Functions

RankAddressIdentityCallers
10x7B9B80bitfield_insert18,347
20x10B61801-bit boolean encoder8,091
30x7BC030encode_register_operand6,147
40x4280C0get_thread_local_context3,928
50x42BDB0fatal_OOM_handler3,825
60x424070pool_alloc3,809
70x426150hashmap_put2,800
80x7B9D30clear_const_buffer_slots2,408
90x7B9D60encode_reuse_flags_predicate2,408
100x42FBA0diagnostic_emit2,350

Top 5 Largest Functions

RankAddressIdentitySize
10x5FF700intrinsic_prototype_emitter354 KB
20x169B190isel_pattern_dispatch280 KB
30x198BCD0sm100_peephole_dispatch233 KB
40x143C440sm120_peephole_dispatch233 KB
50x5C7A50wmma_mma_codegen173 KB

Top 10 Most Cross-Referenced (by wiki page count)

RankAddressIdentityPages
10x424070pool_alloc19
20x7DDB50phase_run_dispatch14
30x446240real_main13
30x5D4190intrinsic_dispatch_builder13
50x781F80instruction_iterator12
50xC60D30phase_factory12
70x9253C0instruction_operand_get11
70x612DE0section_attr_builder11
70x426150hashmap_put11
70x426D60hashmap_get11

Documentation Coverage

MetricCount
Total unique functions documented2,063
Wiki pages with function maps70
Functions in 5+ pages (high cross-reference)89
Functions in 1 page only (subsystem-internal)1,324
Confidence CERTAIN~40
Confidence HIGH~1,400
Confidence MEDIUM~620

Binary Layout

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

PTXAS v13.0.88 is a 37,741,528-byte stripped x86-64 ELF executable. Its .text section spans 26.2 MB (0x403520--0x1CE2DE2) containing 40,185 functions. This page maps every byte of the binary to the subsystem that owns it, derived from all 40 sweep reports covering the complete address range.

ELF Section Map

SectionAddressSizeNotes
.plt0x402C002,336 B (146 stubs)Procedure linkage table for libc/libpthread imports
.text0x40352026,212,546 B (26.2 MB)All executable code -- 40,185 functions
.rodata0x1CE2E007,508,368 B (7.5 MB)Read-only data: encoding tables, strings, DFA tables
.eh_frame_hdr0x240BF90358,460 B (350 KB)Exception handling frame index
.eh_frame0x2664A603,751,640 B (3.7 MB)Unwinding data for 40K functions
.gcc_except_table0x29F8938940 BC++ exception filter tables
.ctors0x29F8CE8104 B (12 entries)Static constructor table
.data.rel.ro0x29F8D604,256 BVtable pointers, resolved at load time
.got.plt0x29FA0001,184 B (148 entries)Global offset table for PLT
.data0x29FA4A014,032 B (13.7 KB)Initialized globals: function pointers, defaults
.bss0x29FDB8085,864 B (83.9 KB)Zero-init globals: knob tables, TLS keys, mutexes

Total file composition:

ComponentSizePercentage
.text26.2 MB69.4%
.rodata7.5 MB19.9%
.eh_frame + .eh_frame_hdr4.0 MB10.7%
.data + .bss + other0.1 MB0.3%

Program Headers

SegmentVirtAddrMemSizFlagsContents
LOAD 00x40000032.4 MBR E.text + .rodata + headers + .eh_frame_hdr
LOAD 10x2664A603.7 MBRW.eh_frame + .data + .bss + .got
GNU_RELRO0x2664A603.6 MBRRead-only after relocation (.eh_frame through .data.rel.ro)
GNU_EH_FRAME0x240BF90350 KBRException handling index
GNU_STACK0x00RWNon-executable stack

Entry point: 0x42333C (ELF e_entry), which is inside .text (the CRT startup stub _start). The actual main is at 0x409460.

Three Subsystems

The .text section decomposes into three subsystems with distinct coding styles, data structures, and origins:

  .text linear address map (26.2 MB)
  0x403520                 0x67F000        0xC52000                          0x1CE2DE2
  |--- PTX Frontend 2.9 MB ---|-- Ori Optimizer 5.8 MB --|---- SASS Backend 17.6 MB ----|
  |          11%               |          22%             |              67%              |
  |  parsers, validators,      | passes, regalloc,        | encoding handlers, ISel,      |
  |  intrinsics, formatters    | scheduling, CFG analysis  | peephole, codecs, ABI, ELF    |
SubsystemAddress RangeSizeFunctionsShareAvg Fn SizeLargest Function
PTX Frontend0x403520--0x67F0002.9 MB~2,59211%~1,170 Bsub_46E000 (93 KB, opcode table builder)
Ori Optimizer0x67F000--0xC520005.8 MB~11,00122%~550 Bsub_926A30 (155 KB decomp, interference graph)
SASS Backend0xC52000--0x1CE2DE217.6 MB~26,59267%~690 Bsub_169B190 (280 KB, master ISel dispatch)

The backend dominates the binary because SASS instruction encoding is template-generated code: each of the ~4,000 encoding handler functions is a standalone vtable entry, never called directly. The optimizer has the highest function density (many small pass helpers), while the frontend has the largest average function size (complex validators and parsers).

Complete .text Address Map

The table below maps every address range in the .text section to its subsystem, function count, and key entry points. Data is aggregated from the 30 sweep partitions (p1.01 through p1.30).

PTX Frontend (0x403520--0x67F000, 2.9 MB)

Note on the 0x400000--0x403520 gap. The LOAD segment begins at 0x400000, but the first 13.6 KB before .text contains the ELF header (64 B at 0x400000), program headers (7 entries, 392 B), .interp (28 B, path to ld-linux-x86-64.so.2), .hash / .gnu.hash (symbol hash tables), .dynsym / .dynstr (dynamic symbol table, 146 entries), .gnu.version / .gnu.version_r (symbol versioning), .rela.plt (PLT relocations, 146 entries), and the .plt stub table (2,336 B, 146 stubs at 0x402C00--0x403520). These are standard ELF infrastructure, not ptxas application code. The first ptxas function begins at 0x403520.

Address RangeSizeFunctionsSubsystemKey Functions
0x403520--0x430000178 KB~300Runtime infrastructure: pool allocator, hash maps, TLS, diagnostics, error reporting, string utilitiessub_424070 (pool alloc, 3809 callers), sub_4280C0 (TLS context, 3928 callers), sub_426150 (hash insert, 2800 callers), sub_42FBA0 (diagnostic emitter, 2350 callers), sub_427630 (MurmurHash3)
0x430000--0x460000200 KB~120CLI parsing and compilation driver: option registration, argument parser, target configuration, register/resource constraints, Chrome trace JSON parsersub_446240 (real main, 11 KB), sub_432A00 (option registration, 6 KB), sub_434320 (option parser, 10 KB), sub_43B660 (register constraint calc), sub_439880 (trace JSON parser)
0x460000--0x4D5000470 KB~350PTX instruction validators: per-opcode semantic checkers for MMA, WMMA, load/store, cvt, atomics, barriers, tensormap, async copysub_4B2F20 (general validator, 52 KB), sub_4CE6B0 (Bison parser, 48 KB), sub_4C5FB0 (operand validator, 28 KB), sub_4C2FD0 (WMMA/MMA validator, 12 KB), sub_4A73C0 (tensormap validator, 11 KB)
0x4D5000--0x5AA000872 KB581PTX instruction text generation: 580 per-opcode formatters that convert internal IR to PTX assembly text, plus a built-in function declaration emittersub_5D4190 (formatter dispatch, 13 KB), sub_5FF700 (builtin decl emitter, 34 KB), ~580 formatter functions (avg 1.5 KB each)
0x5AA000--0x67F000874 KB628Intrinsic infrastructure: 608 CUDA intrinsic handlers, MMA/WMMA/tcgen05 tensor core codegen, SM profile tables (sm_75 through sm_121), special register init, ELF/DWARF finalization, memory space managementsub_5D1660 (608 intrinsics, 46 KB), sub_607DB0 (SM profile hash maps, 14 KB), sub_6765E0 (arch capability constructor, 54 KB), sub_612DE0 (version string)

Ori Optimizer (0x67F000--0xC52000, 5.8 MB)

Address RangeSizeFunctionsSubsystemKey Functions
0x67F000--0x754000869 KB~500Mercury SASS backend core: scheduling engine (ReduceReg/DynBatch, 9 reg pressure counters), WAR hazard management, Opex (operand expansion) pipeline, OCG intrinsic lowering, instruction encoding core, Flex DFA scanner, ELF section helperssub_688DD0 (scheduler engine, 20 KB), sub_6D9690 (encoding switch, 94 KB), sub_6FC240 (WAR/scoreboard), sub_720F00 (Flex scanner, 64 KB, 552 rules)
0x754000--0x829000872 KB1,545Knobs infrastructure (1,294 entries) and peephole optimizer class: knob lookup/read/file parsing, PeepholeOptimizer with 7 virtual methods (Init, RunOnFunction, RunOnBB, RunPatterns, SpecialPatterns, ComplexPatterns, SchedulingAwarePatterns), pipeline orchestrator, Mercury operand registration helperssub_79B240 (GetKnobIndex), sub_79D070 (ReadKnobsFile), sub_7A5D10 (PeepholeOptimizer), sub_7BD3C0/sub_7BD650/sub_7BE090 (operand registrars), sub_7BD260 (encoding finalize)
0x829000--0x8FE000872 KB1,069Debug line tables, scheduler core, and HW profiles: ScheduleInstructions pipeline (context setup, priority computation, reverse scheduling, register budget with occupancy optimization), ROT13 SASS mnemonic table, architecture-specific latency/throughput profiles, constant bank naming, peephole/legalization passes, cutlass-aware scheduling heuristicssub_8BF000--0x8D1600 (ScheduleInstructions), sub_896D50 (ROT13 SASS mnemonics), sub_8F0D00 (HW latency profiles), sub_8F4820 (cutlass heuristics)
0x8FE000--0x9D3000872 KB1,090Register allocator: fatpoint algorithm core, interference graph builder (155 KB decompiled -- largest non-dispatch function), spill/refill mechanism, live range analysis, retry with reduced register count, memory-to-register promotion, ConvertMemoryToRegisterOrUniform passsub_926A30 (interference graph, 155 KB decomp), sub_957160 (fatpoint core), sub_95DC10 (regalloc driver), sub_9714E0 (failure handler + retry), sub_910840 (ConvertMemoryToRegister)
0x9D3000--0xAA8000860 KB1,218Post-RA pipeline phases: NamedPhases registry (OriPerformLiveDead, OriCopyProp, shuffle, swap1--swap6), DAG/dependency analysis, IR statistics printer (instruction count, reg count, estimated latency, spill bytes, occupancy, throughput), hot/cold split, mbarrier intrinsics, regalloc verification, uninitialized register detectionsub_9F4040 (NamedPhases registry), sub_A3A7E0 (IR stats printer), sub_A0B5E0 (uninitialized reg detector), sub_A9EDB0 (mbarrier/scheduling, 85 KB decomp)
0xAA8000--0xB7D000862 KB4,493GMMA/WGMMA pipeline optimizer, ISel, and instruction emission: GMMA register allocation, warpgroup sync injection, instruction emission helpers (SASS encoder dispatch), post-scheduling IR statistics, operand legalization, 1,269 tiny vtable dispatchers (~160 bytes each), live range analysis, scheduler-integrated mega-passsub_AED3C0 (mega scheduling/ISel pass, 137 KB decomp), sub_AF7DF0/sub_AF7200 (register decode helpers), ~1,269 vtable dispatchers
0xB7D000--0xC52000870 KB1,086CFG analysis, bitvectors, and IR manipulation: ~390 instruction operand pattern matchers, bitvector dataflow framework (alloc, OR, AND, XOR, clear, iterate), CFG analysis (edge printing, reverse post-order, DOT graph dump), scoreboard and instruction classification, sync analysissub_BDC000 (bitvector infra), sub_BDE8B0 (CFG/RPO/DOT), sub_BE2E40 (scoreboard classification), ~390 operand pattern matchers

SASS Backend (0xC52000--0x1CE2DE2, 17.6 MB)

Address RangeSizeFunctionsSubsystemKey Functions
0xC52000--0xD27000853 KB1,053PhaseManager (159 phases): phase factory (159-case switch), phase vtable table at off_22BD5C8, default phase ordering table at 0x22BEEA0, 530 encoding table initialization bodies, instruction handler vtable bodiessub_C60D30 (phase factory), sub_C62720 (PhaseManager constructor), sub_C60D20 (default table pointer), ~530 phase table body functions
0xD27000--0xDFC000853 KB592SASS encoder table (SM100 Blackwell, set 1): 592 uniform template-generated encoding handlers, each packing operands into a 1,280-bit instruction word at a1+544. Covers 60 opcode classes across 16 format groups. All vtable-dispatched (zero direct callers).592 per-variant handlers (avg 1,473 B), sub_7B9B80 (bitfield insert helper)
0xDFC000--0xED1000877 KB591SASS encoder/decoder (SM100 Blackwell, set 2): 494 encoders translating IR to packed SASS bitfields, plus 97 decoders for the reverse direction (disassembly/validation). All vtable-dispatched.494 encoders (0xDFC--0xEB2), 97 decoders (0xEB3--0xED0), sub_E0F370 (largest, 11 KB)
0xED1000--0xFA6000860 KB683SM100 SASS encoders (set 3): 683 per-variant encoding handlers for 59 SASS opcodes. Each sets opcode ID, loads 128-bit format descriptor via SSE, initializes 10-slot register class map, registers operands, finalizes, extracts bitfields.683 template-generated handlers, 128-bit xmmword format descriptors
0xFA6000--0x107B000851 KB678SM100 SASS encoders (set 4): 587 primary encoders (opcodes 16--372, predicate/comparison/memory/tensor/control flow), plus 91 alternate-form encoders for dual-width or SM-variant instruction encodings. Combined with sets 1--3: 2,544 SM100 encoding handlers total. Six mega dispatch tables.587 primary + 91 alternate-form encoders, 6 dispatch tables
0x107B000--0x1150000853 KB3,396SM100 codec completion: 641 final encoding handlers, 78 object lifecycle and scheduling support functions (FNV-1a hash, instruction construction), 2,095 bitfield accessor functions (machine-generated read/write primitives for the packed encoding format). Seven core extractors handle 1-bit, 2-bit, and multi-bit fields across 192-bit words.sub_10AFF80 (instruction constructor, 11 KB, 32 params), 2,095 bitfield accessors, 7 core extractors
0x1150000--0x1225000852 KB733SASS codec (decoders + encoders): both directions of the instruction codec for an older SM target (likely sm_89 Ada Lovelace or sm_90 Hopper). Decoders read 128-bit words and extract fields; encoders pack fields back. Three mega-decoders (29--33 KB each) and two mega-dispatchers (78--104 KB, too large for Hex-Rays).3 mega decoders (29--33 KB), 2 mega dispatchers (78--104 KB), 728 of 733 vtable-dispatched
0x1225000--0x12FA000860 KB1,552Register-pressure scheduling + ISel + encoders: register-pressure-aware instruction scheduling (0x1225--0x1240), instruction selection and emission pipeline (0x1240--0x1254), 982 SASS binary encoders packing operand fields into 128-bit words (0x1254--0x12FA). All encoders vtable-dispatched.Scheduling at 0x1225--0x1240, ISel at 0x1240--0x1254, 982 encoding handlers
0x12FA000--0x13CF000845 KB1,282Operand legalization and peephole: 522 per-instruction bit-field encoders (366 KB), 186 peephole pattern matchers (81 KB), 11 operand legalization/materialization functions (40 KB), 38 operand encoding emitters (31 KB), 8 live-range analysis functions (14 KB).sub_137B790 (operand legalization, 8.5 KB), 186 peephole matchers, 522 encoders
0x13CF000--0x14A4000844 KB1,219SM120 (RTX 50-series) peephole pipeline: 1,087 instruction pattern matchers (429 KB), one 233 KB master opcode dispatch switch (sub_143C440, 373-case primary switch), 123 instruction encoders (180 KB). Pattern matchers validate opcode, modifiers, and operand types; dispatch rewrites opcode byte and operand mapping.sub_143C440 (233 KB dispatch, 373-case switch), 1,087 pattern matchers, 123 encoders
0x14A4000--0x1579000852 KB606Blackwell ISA encode/decode: 332 encoder functions (0x14A4--0x1520) packing SASS bitstreams, 1 dispatcher (vtable router at 0x15209F0), 273 decoder functions (0x1520--0x1578) unpacking bitstreams and validating fields. Encoder state struct is 600+ bytes with 128-bit format descriptor at +8, operand arrays at +24--+143.332 encoders, 273 decoders, 1 dispatcher
0x1579000--0x164E000852 KB1,324SASS encoding + peephole matchers: Zone A has 367 instruction encoders, Zone B has 78 utility/transition functions, Zone C has 469 peephole pattern matchers. All pattern matchers are called from a single 280 KB mega-dispatcher (sub_169B190).367 encoders, 469 peephole matchers, 78 utilities
0x164E000--0x1723000873 KB899ISel pattern matching core: 762 PTX opcode pattern matchers (Zone A), the master dispatch function sub_169B190 at 280 KB / 66K instructions (Zone B -- the single largest function in the binary), 100 encoding table entries, and 36 multi-instruction template expanders. The dispatch tries every matcher, selects the highest-scoring match, and records which SASS expansion template to use.sub_169B190 (280 KB, 66K insns, 15,870 callees), 762 matchers, 36 template expanders
0x1723000--0x17F8000852 KB631ISA description database: ~555 SASS instruction format descriptor classes (one per opcode variant), ~316 bitfield layout initializers, ~239 opcode handler vtable entries. Also contains instruction sequence generators (multi-instruction expansions for complex PTX operations), register allocation helpers, and Newton-Raphson approximation templates. 91.8% of functions have zero static callers (vtable-dispatched).~555 format descriptor classes, ~316 bitfield initializers, ~239 vtable entries
0x17F8000--0x18CD000852 KB1,460SASS instruction printer + peephole: Subsystem A (0x17F8--0x181F) implements SASS disassembly rendering via virtual method overrides on a builder/visitor with a 4,080+ byte vtable. Subsystem B (0x1820--0x18CC) is a 231 KB peephole dispatch function (sub_18A2CA0, 54K instructions, 1,330 unique callees).sub_18189C0 (SASS printer, 45 KB), sub_181B370 (SASS printer, 28 KB), sub_18A2CA0 (231 KB peephole dispatch)
0x18CD000--0x19A2000877 KB1,598Scheduling + peephole dispatchers: Zone A (275 KB) is the instruction scheduling core (list scheduler, dependency graph, ready queue, register pressure tracking). Zone B (130 KB) contains 318 opcode property/classification tables. Zones C+D (460 KB) contain 888 peephole pattern matchers called from sub_198BCD0 (239 KB, 1,336 unique callees).sub_198BCD0 (239 KB peephole dispatch), 392 scheduling functions, 318 opcode property tables, 888 pattern matchers
0x19A2000--0x1A77000880 KB1,393GPU ABI/calling convention + SM89/90 encoders: Zone A (250 KB, 276 functions) implements the NVIDIA GPU calling convention -- parameter register allocation, return address placement, scratch/preserved classification, convergent boundary enforcement, coroutine SUSPEND semantics, uniform register support, per-SM ABI lowering (sm_35 through sm_100+). Zone B (480 KB) has ~1,117 supplementary SASS encoding vtable handlers.sub_19D1AF0 (master ABI setup, 5.6 KB), 276 ABI functions, ~1,117 encoding handlers
0x1A77000--0x1B4C000829 KB1,518SASS emission backend (4 SM families): Zone A has 1,083 bit-field packing encoders spanning sm_50 through sm_100+. Zone B has 339 instruction lowering/expansion functions (two SM families: sm_8x and sm_9x/10x). Zone C has 84 Ampere/Ada/Hopper-era encoders. Zone D has 92 Blackwell-era encoders.sub_1B6B250 (register-class-to-HW mapping, 254 callers), 1,083 emitters, 339 lowering functions
0x1B4C000--0x1C21000876 KB1,974SASS emission + format descriptors: register-class encoding tables (Zone A), per-SM instruction bit-field encoders (Zone B), instruction emission orchestrators (Zone C), multi-operand dispatch emitters (Zone D), mirrored SM-variant emitters (Zone E), instruction format descriptors (Zone F, 0x1C05--0x1C21).487 functions exceed 2 KB decompiled
0x1C21000--0x1CE2DE2776 KB1,628Library layer: custom ELF emitter (CUBIN output), capsule Mercury ELF (.nv.capmerc debug metadata), section layout and memory allocation (shared/constant/local/global), relocation resolution (branch targets, UFT/UDT, YIELD-to-NOP), call graph analysis (recursion detection, dead function elimination), DWARF debug generation (.debug_info/.debug_line/.debug_frame), option parsing library, thread pool (pthread-based), JSON builder, GNU Make jobserver client, C++ name demangler (Itanium ABI), ELF file writersub_1C9F280 (ELF emitter, 97 KB decomp), sub_1CABD60 (section allocator, 67 KB), sub_1CC9800 (EIATTR builder, 90 KB), sub_1CDC780 (demangler, 93 KB), sub_1CB53A0 (ELF world init), sub_1CD48C0 (relocation resolver, 22 KB), sub_1CBB920 (recursion detector), sub_1CB18B0 (thread pool), sub_1CD13A0 (file writer, 11 KB)

.rodata Contents (7.5 MB)

The .rodata section at 0x1CE2E00--0x240BF8F is 29% of the binary by size. Its dominant consumers:

ContentEstimated SizeNotes
SASS encoding format descriptors~3.5 MB128-bit xmmword constants loaded via SSE by ~4,000 encoding handlers
Flex DFA transition tables~600 KBoff_203C020, the 552-rule PTX scanner's state machine
Bison parser tables~400 KBLALR(1) action/goto tables for the PTX grammar
Error/diagnostic format strings~300 KB30,632 strings extracted from the binary
Phase ordering + vtable tables~100 KBDefault 159-entry phase table at 0x22BEEA0, vtable table at off_22BD5C8
ROT13-encoded string tables~200 KBPTX opcode names (~900 entries), knob names (~2,000 entries)
Architecture capability tables~150 KBPer-SM feature maps (sm_75 through sm_121), HW latency profiles
DWARF name tables~50 KBDW_FORM_*, DW_AT_*, DW_OP_* string tables
Hash constants + misc~2.2 MBMurmurHash3 mixing constants, lookup tables, padding

.bss Contents (84 KB)

ContentNotes
ROT13 PTX opcode name tablePopulated by ctor_003 (0x4095D0, 17 KB) at startup
General OCG knob tablePopulated by ctor_005 (0x40D860, 80 KB) -- ~2,000 entries
Mercury scheduler knob tablePopulated by ctor_007 (0x421290, 8 KB) -- 98 entries
Thread-local storage keyspthread_key_t for per-thread context (280-byte struct)
Global pool allocator mutexpthread_mutex_t at pool struct offset 7128
Diagnostic suppression bitmapsPer-warning-ID suppression flags
SM architecture profile objectsConstructed on demand per sub_6765E0
Global error/warning countersIncremented by sub_42FBA0
Make jobserver stateAtomic state machine (0=init, 5=no MAKEFLAGS, 6=no auth, 7=failed)

.data Contents (14 KB)

ContentNotes
Function pointer tablesExit wrapper (off_29FA4B0), error handler dispatch
Default option valuesPopulated by sub_432A00 (option registration)
Static string table pointersVersion strings, format strings
Diagnostic output tablesSeverity prefix strings: "error ", "warning ", "info ", "fatal "

Static Constructors

The .ctors section holds 12 entries executed before main. The four largest are:

ConstructorAddressBinary SizePurpose
ctor_0010x4094C0204 BThread infrastructure: pthread_key_create, mutex init, thread priority range
ctor_0030x4095D017,007 BPTX opcode name table: ~900 ROT13-encoded opcode mnemonics
ctor_0050x40D86080,397 BGeneral OCG knob table: ~2,000 ROT13-encoded knob names + hex defaults
ctor_0070x4212907,921 BMercury scheduler knob table: 98 ROT13-encoded scheduler knobs

The remaining 8 constructors handle memory allocator pool initialization, hash map infrastructure setup, diagnostic system initialization, and architecture vtable factory registration (sub_1CCD900).

Mega-Functions (>50 KB binary)

AddressBinary SizeDecompiledFunctionCallees
sub_169B190280 KBN/AMaster ISel pattern dispatch (66K instructions)15,870
sub_198BCD0239 KBN/APeephole dispatch, SM variant 21,336
sub_143C440233 KBN/ASM120 peephole dispatch (373-case switch)~1,100
sub_18A2CA0231 KBN/APeephole dispatch, SM variant 11,330
sub_6D969094 KBN/AInstruction encoding switch~500
sub_46E00093 KBN/APTX opcode-to-handler table builder1,168
sub_40D86080 KBN/Actor_005: general knob registration~2,000
sub_720F0064 KBN/AFlex DFA scanner (552 rules)~50

These eight functions account for 1.2 MB of code (4.8% of .text) but only 0.02% of the function count.

Most-Called Functions

AddressCallersIdentity
sub_4280C03,928Thread-local context accessor (pthread_getspecific)
sub_42BDB03,825Fatal OOM handler (called from every allocation site)
sub_4240703,809Pool memory allocator (alloc)
sub_4261502,800Hash map insert/update
sub_42FBA02,350Central diagnostic message emitter
sub_4248B01,215Pool memory deallocator (free)
sub_42CA60298Linked list prepend
sub_42D850282Hash set insert
sub_1B6B250254Register-class-to-hardware-number lookup (SASS emission)
sub_4279D0185String prefix match (starts_with)

The top five functions are all in the runtime infrastructure region (0x403520--0x42F000). Together they represent the core allocation, error handling, and data structure layer that the rest of the binary depends on.

Binary Composition by Purpose

Estimated from function classification across 30 sweep reports (p1.01--p1.30). Each function was assigned to a single purpose category based on its dominant behavior; functions straddling categories (e.g., a scheduling pass that also emits SASS) are attributed to the category consuming the larger share of their code.

PurposeEstimated SizeShare of .text
SASS instruction encoding/decoding~12 MB46%
Optimization passes + scheduling~5 MB19%
Peephole pattern matching + dispatch~3 MB12%
Frontend: parsing + validation~2 MB8%
ISel pattern matching + templates~1.5 MB6%
Infrastructure: allocator, hash, ELF, debug~1.5 MB6%
GPU ABI + calling convention~0.7 MB3%

The single largest consumer of code space is SASS instruction encoding. Each SM architecture generation requires its own set of per-opcode encoding/decoding handler functions. With support for SM75 through SM121 (six major generations), this yields approximately 4,000 encoding handlers, each a standalone function averaging 1,400 bytes.

Methodology

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

This page documents how the reverse engineering of ptxas v13.0.88 was performed. It serves as a transparency record so readers can assess the confidence of any claim in this wiki, and as a practical guide for anyone who wants to reproduce or extend the analysis.

Scope and Scale

PTXAS is a 37.7 MB stripped x86-64 ELF binary with no debug symbols, no DWARF information, and no export table beyond 146 libc/libpthread PLT stubs. Unlike NVIDIA's cicc (which is an LLVM fork), ptxas contains no LLVM code, no EDG frontend, and no third-party optimizer components. Every pass, data structure, and encoding table is proprietary NVIDIA code. This makes the analysis harder than LLVM-derived binaries -- there is no upstream source to compare against.

MetricValue
Binary size37,741,528 bytes
Build stringcuda_13.0.r13.0/compiler.36424714_0
Total functions detected40,185
Functions decompiled39,881 (99.2%)
Strings extracted30,632
Call graph edges548,693
Cross-references7,427,044
IDA comments recovered66,598
IDA auto-names recovered16,019
Control flow graphs exported80,078
PLT imports146 (libc, libpthread, libm, libgcc)
Functions with 0 static callers15,907 (39.6%) -- vtable-dispatched
Functions < 100 bytes11,532 (28.7%)
Functions > 10 KB86 (0.2%)
Named functions (not sub_*)319 (0.8%)
Internal codenamesOCG (Optimizing Code Generator), Mercury (SASS encoder), Ori (IR)

The 304 functions that Hex-Rays could not decompile are predominantly PLT stubs, computed-jump trampolines in the Flex DFA scanner, and the four mega-dispatch functions exceeding 200 KB (too large for Hex-Rays to handle within default limits). None are in critical analysis paths -- the dispatch functions are understood from their callee lists and the PLT stubs from their import names.

Why PTXAS Is Harder Than LLVM-Based Binaries

Reverse engineering cicc (NVIDIA's LLVM-based CUDA compiler) benefits from extensive prior art: LLVM's open-source codebase provides structural templates, pass names are registered in predictable patterns, and cl::opt strings directly name their global variables. PTXAS offers none of these advantages:

  • No upstream source. Every identified function is identified from first principles -- string evidence, callgraph position, structural fingerprinting, or decompiled algorithm analysis. There is no reference implementation to compare against.
  • ROT13 obfuscation. Internal names for tuning knobs and PTX opcode mnemonics are ROT13-encoded in the binary, requiring decoding before they become useful anchors.
  • Vtable-heavy architecture. 39.6% of functions have zero static callers because they are dispatched through vtable pointers or function pointer tables. The call graph alone cannot reach them.
  • Template-generated code. The SASS backend contains approximately 4,000 encoding handler functions generated from templates, each structurally near-identical. These dominate the function count but carry almost no unique identifying features.
  • No pass registration infrastructure. LLVM passes register themselves via PassInfo objects with name strings. PTXAS phases are allocated by a factory switch (sub_C60D30) and their names are only visible through the NamedPhases registry and AdvancedPhase* timing strings -- far fewer anchors than LLVM's registration system.

Toolchain

All analysis was performed with IDA Pro 8.x and the Hex-Rays x86-64 decompiler. The entire effort is static analysis of the binary at rest -- no dynamic analysis (debugging, tracing, instrumentation) was used for function identification. Runtime tools (ptxas --stat, DUMPIR knob, --keep) were used only for validation and cross-referencing.

ToolPurpose
IDA Pro 8.xDisassembly, auto-analysis, cross-referencing, vtable reconstruction
Hex-Rays decompilerPseudocode generation for 39,881 recovered functions
IDA Python scriptingComplete database extraction: all 8 JSON artifact exports
Custom Python scriptanalyze_ptxas.py: batch string, function, graph, xref, and decompilation export
ptxas CLI--stat, --verbose, --compiler-stats, --fdevice-time-trace for runtime validation
ptxas DUMPIR knob-knob DUMPIR=<phase> to dump IR at specific pipeline points
ROT13 decoderStandard codecs.decode(s, "rot_13") for 2,000+ obfuscated knob/opcode names

IDA Pro Setup and Initial Analysis

Loading the Binary

PTXAS is a dynamically-linked ELF with 146 PLT imports but no symbol table beyond those imports. IDA auto-analysis settings:

  1. Processor: Meta PC (x86-64)
  2. Analysis options: default. IDA correctly identifies the Flex DFA scanner tables, Bison parser tables, and the .ctors/.dtors sections.
  3. Auto-analysis time: approximately 8-10 minutes on a modern machine for the 37.7 MB binary.
  4. Compiler detection: IDA identifies GCC as the compiler. The binary uses the Itanium C++ ABI (confirmed by the embedded C++ name demangler at sub_1CDC780, 93 KB).

Post-Auto-Analysis Steps

After auto-analysis completes:

  1. Run string extraction. IDA's auto-analysis finds 30,632 strings. All are exported via the analyze_ptxas.py IDA Python script.
  2. Force function creation. Some address ranges, particularly the template-generated encoding handlers, are not automatically recognized as functions. IDA's "Create function" (P key) was applied selectively in the 0xD27000--0x1579000 range where encoding handler stubs are tightly packed.
  3. Batch decompile. The IDA Python script iterates all 40,185 detected functions and calls ida_hexrays.decompile() on each, saving per-function .c files. 39,881 succeeded; 304 failed (PLT stubs, computed-jump trampolines, and 4 mega-functions exceeding decompiler limits).
  4. Export control flow graphs. For each function, the script extracts the FlowChart (basic blocks, edges, per-instruction disassembly) as JSON. 80,078 graph files were produced.

Type Recovery

PTXAS uses no C++ RTTI (no typeid, no dynamic_cast -- the binary has no .data.rel.ro RTTI structures). Type recovery relies on:

  • Vtable layout analysis. Each vtable is a contiguous array of function pointers in .data.rel.ro (4,256 bytes total). The vtable at off_22BD5C8 contains 159 entries, one per optimization phase. Each entry points to the phase's constructor function.
  • Structure offset patterns. The pool allocator struct has free-list bins at offset +2128 and a mutex at +7128. The thread-local context is a 280-byte struct accessed via pthread_getspecific. These offsets were recovered from the decompiled code of sub_424070 (pool alloc, 3,809 callers) and sub_4280C0 (TLS accessor, 3,928 callers).
  • Parameter/return type propagation. Once a function's signature is established (e.g., pool_alloc(pool*, size_t) -> void*), Hex-Rays propagates types to all 3,809 call sites, improving decompilation quality throughout the binary.

String-Driven Analysis

Strings are the single most productive source of function identification in ptxas. Of the 30,632 strings extracted, several categories are particularly valuable.

ROT13-Encoded Knob Names (2,000+ entries)

PTXAS uses ROT13 encoding as a light obfuscation layer on internal configuration names. Two massive static constructors populate these tables at startup:

  • ctor_005 at 0x40D860 (80 KB) registers approximately 2,000 general OCG tuning knobs
  • ctor_007 at 0x421290 (8 KB) registers 98 Mercury scheduler knobs

Each entry pairs a ROT13-encoded name with a hex-encoded default value. Decoding examples:

ROT13 in binaryDecoded name
ZrephelHfrNpgvirGuernqPbyyrpgvirVafgfMercuryUseActiveThreadCollectiveInsts
ZrephelGenpxZhygvErnqfJneYngraplMercuryTrackMultiReadsWarLatency
ZrephelCerfhzrKoybpxJnvgOrarsvpvnyMercuryPresumeXblockWaitBeneficial
ZrephelZretrCebybthrOybpxfMercuryMergePrologueBlocks
ZrephelTraFnffHPbqrMercuryGenSassUCode
FpniVayvarRkcnafvbaScavInlineExpansion
FpniQvfnoyrFcvyyvatScavDisableSpilling

The knob names directly reveal subsystem organization. Names prefixed with Mercury* belong to the SASS encoder. Names prefixed with Scav* belong to the register allocator's scavenger. Names like XBlockWait* and WarDeploy* belong to the instruction scheduler. The knob lookup function GetKnobIndex at sub_79B240 performs inline ROT13 decoding and case-insensitive comparison, which was itself identified by tracing the xrefs from the ROT13-encoded strings.

ROT13-Encoded PTX Opcode Names (~900 entries)

A third static constructor, ctor_003 at 0x4095D0 (17 KB), populates a table of ~900 ROT13-encoded PTX opcode mnemonics. Decoding examples:

ROT13Decoded
NPDOHYXACQBULK
OFLAPBSYNC
SZNFMA
FRGCSETP
ERGHEARETURN
RKVGEXIT

These strings are used by the PTX parser to match instruction mnemonics. Each xref from one of these strings leads to a parser action or instruction validator function.

Timing and Phase Name Strings

The compilation driver at sub_446240 emits per-stage timing via format strings:

Parse-time            : %.3f ms (%.2f%%)
CompileUnitSetup-time : %.3f ms (%.2f%%)
DAGgen-time           : %.3f ms (%.2f%%)
OCG-time              : %.3f ms (%.2f%%)
ELF-time              : %.3f ms (%.2f%%)
DebugInfo-time        : %.3f ms (%.2f%%)
PeakMemoryUsage = %.3lf KB

Tracing the xrefs from these format strings identifies the code that brackets each pipeline stage, revealing the stage boundaries within sub_446240.

The NamedPhases registry (string at 0x21B64C8, xrefs to sub_9F4040) and the AdvancedPhase* timing strings provide phase-level anchors within the 159-phase optimization pipeline:

  • AdvancedPhaseBeforeConvUnSup, AdvancedPhaseAfterConvUnSup
  • AdvancedPhaseEarlyEnforceArgs, AdvancedPhaseLateConvUnSup
  • AdvancedPhasePreSched, AdvancedPhaseAllocReg, AdvancedPhasePostSched
  • AdvancedPhaseOriPhaseEncoding, AdvancedPhasePostFixUp
  • GeneralOptimizeEarly, GeneralOptimize, GeneralOptimizeMid, GeneralOptimizeMid2
  • GeneralOptimizeLate, GeneralOptimizeLate2
  • OriPerformLiveDead, OriPerformLiveDeadFirst through OriPerformLiveDeadFourth

Each AdvancedPhase* string xrefs to exactly one call site, which is a boundary marker in the phase pipeline. These 15 markers divide the 159-phase pipeline into named segments whose boundaries were used to identify the phases between each pair of markers.

Error and Diagnostic Strings

The central diagnostic emitter sub_42FBA0 (2,350 callers) prints error messages whose text reveals the calling function's purpose. Examples:

  • "Please use -knob DUMPIR=AllocateRegisters for debugging" -- identifies the register allocator failure path at sub_9714E0
  • "SM does not support LDCU" -- identifies SM capability checking in the instruction legalizer
  • "Invalid knob identifier", "Invalid knob specified (%s)" -- identifies the knob parsing infrastructure around sub_79D070
  • "fseek() error knobsfile %s", "[knobs]" -- identifies ReadKnobsFile at sub_79D070

Source File Path

One recovered source path provides a structural anchor:

/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/common/utils/generic/impl/generic_knobs_impl.h

This string (at 0x202D4D8, 66 xrefs) is referenced from assertion checks throughout the knobs infrastructure, confirming that the knob system is a shared utility component (generic_knobs_impl.h) used across NVIDIA's compiler drivers.

Build and Version Strings

Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0

The version string at sub_612DE0 identifies both the exact build and the version reporting function. The Usage : string at 0x1CE3666 identifies the usage printer. The "\nCompile-unit with entry %s" string identifies the per-kernel compilation loop within the driver.

Vtable-Driven Discovery

The Phase Vtable Table

The most productive vtable discovery was the phase vtable table at off_22BD5C8 in .rodata. This is an array of 159 pointers, each pointing to a vtable for one optimization phase class. The phase factory function at sub_C60D30 is a 159-case switch statement that allocates a 16-byte phase object and assigns the corresponding vtable from this table:

// Simplified from decompiled sub_C60D30
switch (phase_index) {
    case 0:  obj->vtable = off_22BD5C8[0];  break;
    case 1:  obj->vtable = off_22BD5C8[1];  break;
    ...
    case 158: obj->vtable = off_22BD5C8[158]; break;
}
return obj;

Each vtable contains pointers to the phase's virtual methods. The virtual method at slot 0 is execute() (the phase body). The virtual method at slot 1 is isNoOp() (returns whether the phase should be skipped). The virtual method at slot 2 is getName() (returns the phase name string).

By following each of the 159 vtable entries to their execute() slot, every optimization phase's main function was identified. The getName() slot provided the phase name for phases that implement it. For phases that return a constant empty string, the name was inferred from the NamedPhases registry or from the AdvancedPhase* timing strings that bracket the phase in the pipeline.

Encoding Handler Vtables

The SASS backend uses vtable dispatch for instruction encoding. Each SASS opcode variant has its own encoding handler function, registered in dispatch tables rather than called directly. This explains why 15,907 functions (39.6%) have zero static callers -- they are reached exclusively through indirect calls via function pointer tables.

The encoding handler vtables were identified by their structural uniformity: every handler in the 0xD27000--0x1579000 range follows an identical template:

  1. Set opcode ID via bitfield insert into the instruction word at a1+544
  2. Load a 128-bit format descriptor from .rodata via SSE (movaps xmm0, xmmword_XXXXXX)
  3. Initialize a 10-slot register class map
  4. Register operand descriptors via sub_7BD3C0 / sub_7BD650 / sub_7BE090
  5. Finalize encoding via sub_7BD260
  6. Extract bitfields from the packed instruction word

The uniformity of this template allowed batch identification: once the template was recognized in a few handlers, the remaining ~4,000 were identified by structural matching alone.

Peephole Optimizer Vtable

The PeepholeOptimizer class at 0x7A5D10 has a reconstructed vtable with 7 virtual methods:

SlotMethodPurpose
0InitInitialize peephole state for a compilation unit
1RunOnFunctionEntry point for per-function peephole optimization
2RunOnBBPer-basic-block dispatch
3RunPatternsStandard pattern matching pass
4SpecialPatternsArchitecture-specific pattern pass
5ComplexPatternsMulti-instruction pattern pass
6SchedulingAwarePatternsSchedule-preserving pattern pass

The three peephole dispatch mega-functions (sub_143C440 at 233 KB, sub_18A2CA0 at 231 KB, sub_198BCD0 at 239 KB) each serve a different SM generation family and call 1,100--1,336 pattern matcher functions. These dispatchers were identified by their enormous callee counts and their position in the pipeline after instruction encoding.

Callgraph Analysis

The 548,693-edge call graph, exported from IDA, reveals the binary's module structure and function relationships. Several callgraph properties were systematically exploited.

Hub Function Identification

Functions with extreme callee or caller counts serve as structural anchors:

Top callees (hub functions -- "fan-out" nodes):

AddressNameSizeCalleesRole
sub_169B190ISel master dispatch280 KB15,870The single largest function in the binary. Dispatches to all ISel pattern matchers.
sub_143C440SM120 peephole dispatch233 KB13,425SM120 (RTX 50-series) peephole optimization
sub_198BCD0Peephole dispatch (variant 2)239 KB13,391Peephole optimization for another SM family
sub_18A2CA0Peephole dispatch (variant 1)231 KB12,974Peephole optimization for another SM family
sub_BA9D00Bitvector/CFG analysis204 KB11,335Dataflow framework core

Top callers (utility functions -- "fan-in" nodes):

AddressNameSizeCallersRole
sub_B28F30(unknown leaf)12 B31,399Tiny utility, likely a type tag or opcode check
sub_10AE5C0(unknown leaf)60 B30,768Small encoding helper
.sprintflibc sprintf6 B20,398String formatting (PLT stub)
sub_7B9B80Bitfield insert216 B18,347Inserts bits into the 1280-bit instruction word
sub_424070Pool allocator2,098 B3,809Custom memory allocator
sub_4280C0TLS context accessor597 B3,928Thread-local storage via pthread_getspecific
sub_42FBA0Diagnostic emitter2,388 B2,350Central error/warning reporter

The fan-out nodes identify the mega-dispatch functions: ISel, peephole, and dataflow. The fan-in nodes identify the shared infrastructure layer: memory allocation, encoding primitives, string formatting, and error reporting.

Module Boundary Detection

The call graph reveals clear module boundaries. Functions in the 0x400000--0x67F000 range (PTX frontend) rarely call functions in 0xC52000--0x1CE3000 (SASS backend) directly, and vice versa. The optimizer region (0x67F000--0xC52000) bridges the two, calling into both the frontend (for IR construction) and the backend (for encoding).

The call graph was used to validate the three-subsystem decomposition:

Call directionEdge countInterpretation
Frontend -> Frontend~8,000Internal frontend cohesion
Frontend -> Optimizer~1,200IR construction handoff
Optimizer -> Optimizer~15,000Phase-to-phase internal calls
Optimizer -> Backend~3,500Scheduling, encoding setup
Backend -> Backend~18,000Encoding handler internal calls
Backend -> Frontend~500Shared infrastructure (allocator, hash)

Propagation from Known Functions

Once a high-confidence function is identified, its callees and callers gain contextual identity. The most productive propagation chains:

  1. sub_446240 (real main, CERTAIN) -> calls stage entry points for Parse, DAGgen, OCG, ELF, DebugInfo. Each stage's entry point was identified by following the timing format string pattern.

  2. sub_C62720 (PhaseManager constructor) -> allocates 159 phase objects via sub_C60D30 (factory). The factory's 159 case targets are the phase constructors. Each constructor installs a vtable whose slot 0 points to the phase's execute() method.

  3. sub_79B240 (GetKnobIndex) -> called from every function that reads a tuning knob. The first argument to GetKnobIndex is the ROT13-encoded knob name, so every call site reveals which knob a function checks.

  4. sub_42FBA0 (diagnostic emitter) -> the format string argument at each of the 2,350 call sites reveals the error context. A call with "Cannot take address of texture/surface variable (%s)" identifies a PTX semantic checker.

Pattern Recognition

16-Byte Phase Objects

All 159 optimization phases share a uniform object layout:

Offset 0: vtable pointer (8 bytes) -- points to phase-specific vtable
Offset 8: phase data pointer or inline data (8 bytes)

The phase factory (sub_C60D30) allocates each phase as a 16-byte object from the pool allocator, sets the vtable pointer from the vtable table at off_22BD5C8, and returns the object. The PhaseManager stores these 159 objects in its internal array and iterates them to execute the pipeline.

Pool Allocator Usage Pattern

The custom pool allocator (sub_424070, 3,809 callers) is the dominant allocation mechanism. Its usage pattern is recognizable throughout the binary:

ptr = sub_424070(pool, size);   // Allocate
if (!ptr) sub_42BDB0();         // Fatal OOM -- never returns
// ... use ptr ...
sub_4248B0(ptr);                // Free (1,215 callers)

The OOM handler sub_42BDB0 (14 bytes, 3,825 callers) is a tiny wrapper that calls sub_42F590 (fatal internal error). Because every allocation site checks for failure and calls the same handler, the allocator usage pattern is a reliable structural marker. Finding sub_42BDB0 in a function's callee list confirms that function performs heap allocation.

SASS Encoding Handler Template

Every encoding handler in the backend follows a rigid 6-step template (described in the vtable section above). The key identification markers:

  • Calls to sub_7B9B80 (bitfield insert, 18,347 callers)
  • SSE movaps loading a 128-bit constant from .rodata
  • Calls to sub_7BD3C0, sub_7BD650, or sub_7BE090 (operand registrars)
  • Final call to sub_7BD260 (encoding finalize)

Any function matching this pattern is a SASS encoding handler. This template recognition identified approximately 4,000 handlers spanning 6 SM architecture generations.

Hash Map Infrastructure Pattern

The MurmurHash3-based hash map infrastructure (sub_426150 insert, sub_426D60 lookup, sub_427630 MurmurHash3) appears throughout the binary with a consistent usage pattern:

map = sub_425CA0(hash_fn, cmp_fn, initial_capacity);  // Create
sub_426150(map, key, value);                           // Insert (2,800 callers)
result = sub_426D60(map, key);                         // Lookup (422 callers)
sub_425D20(map);                                       // Destroy

The MurmurHash3 constants (0xcc9e2d51, 0x1b873593) in sub_427630 confirmed the hash algorithm. The hash map supports three modes (custom function pointers, pointer hash, integer hash) selected by flags at struct offset 84.

Data Artifacts

The complete IDA database was exported via analyze_ptxas.py into 8 JSON artifacts. These artifacts are the foundation for all subsequent analysis.

ArtifactFileSizeEntriesSchema
Functionsptxas_functions.json92 MB40,185{addr, end, name, size, insn_count, is_library, is_thunk, callers[], callees[]}
Stringsptxas_strings.json4.8 MB30,632{addr, value, type, xrefs[{from, func, type}]}
Call graphptxas_callgraph.json64 MB548,693{from, from_addr, to, to_addr} -- one edge per call site
Cross-referencesptxas_xrefs.json978 MB7,427,044Complete xref database (code, data, string references)
Commentsptxas_comments.json5.9 MB66,598{addr, type, text} -- IDA auto-comments and analyst annotations
Namesptxas_names.json972 KB16,019{addr, name} -- IDA auto-generated and analyst-assigned names
Importsptxas_imports.json17 KB146{module, name, addr, ordinal} -- PLT import stubs
Segmentsptxas_segments.json3 KB24{name, start, end, size, type, perm} -- ELF segment map

Total artifact storage: 1.14 GB (dominated by the 978 MB xref database).

What Each Artifact Reveals

Functions (ptxas_functions.json): The master index. Every function's address, size, instruction count, caller list, and callee list. The caller/callee lists are the basis for callgraph analysis. The is_thunk flag identifies PLT stubs (exclude from analysis). The is_library flag identifies functions IDA tagged as library code (CRT startup, jemalloc-like allocator internals).

Strings (ptxas_strings.json): The primary identification tool. Each string's xref list shows which functions reference it. Searching for "AdvancedPhase" returns 15 strings, each xref pointing to a pipeline boundary in the PhaseManager. Searching for strings starting with "Z" (ROT13 "M" for "Mercury") returns the Mercury subsystem's knob names. The 2,035 hex-encoded default value strings ("0k..." / "0x...") are paired 1:1 with knob name strings in the constructors.

Call graph (ptxas_callgraph.json): The structural backbone. Each edge records a direct call from one function to another. Indirect calls (vtable dispatch, function pointer callbacks) are not captured, which is the primary limitation -- the 15,907 zero-caller functions are almost all vtable-dispatched. The call graph is used for module boundary detection, propagation from known functions, and entry/exit point analysis.

Cross-references (ptxas_xrefs.json): The most comprehensive artifact. Contains all code-to-code, code-to-data, and data-to-data references detected by IDA. At 7.4 million entries, it is too large to load into memory on machines with less than 16 GB RAM. Used for deep analysis of specific functions: finding all references to a particular .rodata constant, tracing data flow through global variables, and identifying vtable consumers.

Comments (ptxas_comments.json): IDA's auto-generated comments (e.g., "File format: \\x7FELF") plus analyst-added annotations. The auto-comments on function prologues identify calling conventions and stack frame layouts. Analyst comments record identification rationale for reviewed functions.

Names (ptxas_names.json): IDA's auto-generated names for data and code addresses. Of 16,019 entries, approximately 9,670 are auto-generated string reference names (aLib64LdLinuxX8, aGnu, etc.) and ~6,349 are analyst-assigned or IDA-recovered names (PLT stubs, constructors, etc.). These names appear in the callgraph edges as from/to identifiers.

Imports (ptxas_imports.json): The 146 PLT imports. Key imports include pthread_* (13 functions), malloc/free/realloc, _setjmp/longjmp (used by the error recovery system), select/fcntl (used by the GNU Make jobserver client), and clock (used by the timing infrastructure).

Segments (ptxas_segments.json): The 24 ELF segments/sections. Used to establish the address space layout and map code/data boundaries. The .ctors section (104 bytes, 12 entries) is particularly important -- it lists the static constructors that initialize the ROT13 tables and the knob registry.

The 30-Region Sweep Approach

The primary analysis was conducted as a systematic address-range sweep of the entire .text section, divided into 30 contiguous regions. Each region was analyzed independently in a single session, producing a raw sweep report. The 40 report files (including sub-region splits) total 34,880 lines of working notes.

Region Partitioning

The .text section (0x403520--0x1CE2DE2, 26.2 MB) was divided into approximately 870 KB regions. The partitioning was not arbitrary -- region boundaries were chosen to align with subsystem boundaries where possible, so that each sweep report covers a coherent functional area.

ReportAddress RangeSizeFunctionsSubsystem
p1.010x400000--0x4D5000853 KB1,383Runtime infra + CLI + PTX validators
p1.020x4D5000--0x5AA000853 KB581PTX text generation (580 formatters)
p1.030x5AA000--0x67F000853 KB628Intrinsics + SM profiles
p1.040x67F000--0x754000469 KB~500Mercury core + scheduling engine
p1.050x754000--0x829000853 KB1,545Knobs + peephole optimizer class
p1.060x829000--0x8FE000853 KB1,069Debug tables + scheduler + HW profiles
p1.070x8FE000--0x9D3000853 KB1,090Register allocator (fatpoint)
p1.080x9D3000--0xAA8000853 KB1,218Post-RA pipeline + NamedPhases
p1.090xAA8000--0xB7D000853 KB4,493GMMA/WGMMA + ISel + emission
p1.100xB7D000--0xC52000853 KB1,086CFG analysis + bitvectors
p1.110xC52000--0xD27000853 KB1,053PhaseManager + phase factory
p1.120xD27000--0xDFC000853 KB592SM100 SASS encoders (set 1)
p1.130xDFC000--0xED1000853 KB591SM100 SASS encoders (set 2) + decoders
p1.140xED1000--0xFA6000853 KB683SM100 SASS encoders (set 3)
p1.150xFA6000--0x107B000853 KB678SM100 SASS encoders (set 4)
p1.160x107B000--0x1150000853 KB3,396SM100 codec + 2,095 bitfield accessors
p1.170x1150000--0x1225000853 KB733SM89/90 codec (decoders + encoders)
p1.180x1225000--0x12FA000853 KB1,552Reg-pressure scheduling + ISel + encoders
p1.190x12FA000--0x13CF000853 KB1,282Operand legalization + peephole
p1.200x13CF000--0x14A4000853 KB1,219SM120 peephole pipeline
p1.210x14A4000--0x1579000853 KB606Blackwell ISA encode/decode
p1.220x1579000--0x164E000853 KB1,324Encoding + peephole matchers
p1.230x164E000--0x1723000853 KB899ISel pattern matching core
p1.240x1723000--0x17F8000853 KB631ISA description database
p1.250x17F8000--0x18CD000853 KB1,460SASS printer + peephole dispatch
p1.260x18CD000--0x19A2000853 KB1,598Scheduling + peephole dispatchers
p1.270x19A2000--0x1A77000853 KB1,393GPU ABI + SM89/90 encoders
p1.280x1A77000--0x1B4C000853 KB1,518SASS emission backend
p1.290x1B4C000--0x1C21000853 KB1,974SASS emission + format descriptors
p1.300x1C21000--0x1CE3000780 KB1,628ELF emitter + infra library layer

Several regions were further split into sub-reports (p1.04a/b, p1.05a/b, p1.06a/b, p1.07a/b, p1.08a/b) when the initial analysis revealed that a region contained multiple distinct subsystems requiring separate treatment.

Sweep Report Structure

Each sweep report follows a consistent format:

================================================================================
P1.XX SWEEP: Functions in address range 0xAAAA000 - 0xBBBB000
================================================================================
Range: 0xAAAA000 - 0xBBBB000
Files found: NNN decompiled .c files (of which ~MMM are > 1KB)
Total decompiled size: X,XXX,XXX bytes
Functions in range (from DB): NNN
Named functions: NNN (or 0 if all are sub_XXXXXX)
Functions with identified callers: NNN

CONTEXT: [1-paragraph summary of the region's purpose]

================================================================================
SECTION 1: [Subsystem name]
================================================================================

### 0xAAAAAA -- sub_AAAAAA (NNNN bytes / NNN lines)
**Identity**: [Function identification]
**Confidence**: [CERTAIN / HIGH / MEDIUM]
**Evidence**:
  - [String evidence]
  - [Structural evidence]
  - [Callgraph evidence]
**Key code**:
  [Relevant decompiled excerpts]
**Note**: [Additional observations]

Each function entry records the address, size, decompiled line count, proposed identity, confidence level, evidence citations, and key code excerpts. The reports are raw working notes -- they contain false starts, corrections, and evolving hypotheses that were resolved as more context became available.

Analysis Ordering

The sweep was not performed in address order. The analysis followed an information-maximizing sequence:

  1. p1.01 (infrastructure + CLI) first -- establishes the allocator, hash map, TLS, and diagnostic patterns that appear throughout the binary.
  2. p1.11 (PhaseManager) second -- identifies all 159 phases and their vtable entries, providing the skeleton of the optimization pipeline.
  3. p1.07 (register allocator) and p1.06 (scheduler) third -- these are the highest-complexity subsystems with the richest string evidence.
  4. p1.12--p1.15 (SASS encoders) in batch -- once the encoding template was recognized, all encoder regions were swept rapidly with template matching.
  5. p1.30 (library layer) late -- identifies shared infrastructure (ELF emitter, demangler, thread pool) referenced by earlier regions.
  6. Remaining regions filled in by decreasing information density.

Cross-Referencing with PTXAS CLI

Several ptxas command-line features and internal mechanisms provide runtime validation of static analysis findings.

--stat and --verbose

Running ptxas --stat input.ptx prints per-kernel resource usage (register count, shared memory, stack frame size). This output is generated by sub_A3A7E0 (the IR statistics printer), which was identified from the format strings:

ptxas info    : Used %d registers, %d bytes smem, %d bytes cmem[0]

Comparing the --stat output against the decompiled statistics printer confirms the register counting and resource tracking logic.

--compiler-stats

Enables the timing output (Parse-time, DAGgen-time, OCG-time, etc.) from sub_446240. This confirms the pipeline stage ordering and the stage boundary functions identified by string xrefs.

--fdevice-time-trace

Generates Chrome trace JSON output showing per-phase timing. The trace parser at sub_439880 and the ftracePhaseAfter string at 0x1CE383F confirm the per-phase instrumentation infrastructure. The trace output lists phase names that can be cross-referenced against the 159-entry phase table.

DUMPIR Knob

The internal DUMPIR knob (accessed via -knob DUMPIR=<phase_name>) dumps the Ori IR at specified pipeline points. The string "Please use -knob DUMPIR=AllocateRegisters for debugging" at 0x21EFBD0 confirms this mechanism. The NamedPhases registry at sub_9F4040 maps phase names to pipeline positions. Available DUMPIR points include:

  • OriPerformLiveDead, OriPerformLiveDeadFirst through OriPerformLiveDeadFourth
  • AllocateRegisters (the register allocation phase)
  • swap1 through swap6 (swap elimination phases)
  • shuffle (instruction scheduling)

The DUMPIR output format reveals the IR structure: basic block headers, instruction opcodes, register names (R0--R255, UR0--UR63, P0--P7, UP0--UP7), and operand encodings. This runtime output was used to validate the IR format reconstructed from static analysis.

--keep Flag

The --keep flag preserves intermediate files. While ptxas does not emit intermediate text files in the same way as nvcc, the --keep behavior in the overall CUDA compilation pipeline (nvcc -> cicc -> ptxas) allows inspecting the PTX input that reaches ptxas, confirming the PTX grammar and instruction format expectations.

Confidence Levels

Every function identification in this wiki carries one of three confidence levels:

LevelMeaningBasis
CERTAINIdentity is certainDirect string evidence naming the function, or the function is a PLT import with a known name
HIGHStrong identification (>90%)Multiple corroborating indicators: string xrefs, callgraph position, structural fingerprint, decompiled algorithm match
MEDIUMProbable identification (70--90%)Single indicator (vtable position, size fingerprint, callgraph context) or inferred from surrounding identified functions

The distribution across the ~200 key identified functions in the Function Map:

  • CERTAIN: ~30 functions (PLT imports, main, functions with unique identifying strings)
  • HIGH: ~130 functions (string evidence + structural confirmation)
  • MEDIUM: ~40 functions (inferred from callgraph context or structural similarity)

The remaining ~39,985 functions are either unidentified (template-generated encoding handlers, small utility stubs) or identified at subsystem level only (e.g., "this is an SM100 SASS encoding handler" without knowing which specific opcode it encodes).

Reproducing the Analysis

To reproduce this analysis from scratch:

  1. Obtain the binary. Install CUDA Toolkit 13.0. The binary is at <cuda>/bin/ptxas. Verify: ptxas --version should report V13.0.88 and the binary should be 37,741,528 bytes. Build string: cuda_13.0.r13.0/compiler.36424714_0.

  2. Run IDA auto-analysis. Open ptxas in IDA Pro 8.x with default x86-64 settings. Allow auto-analysis to complete (8-10 minutes). Accept GCC as the detected compiler.

  3. Run the extraction script. Load analyze_ptxas.py in IDA's Python console. The script exports all 8 JSON artifacts plus per-function decompiled C files, disassembly files, and control flow graph JSON files. Expected runtime: 4-8 hours for the full export (the xref export dominates).

  4. Decode ROT13 strings. Apply codecs.decode(s, "rot_13") to all strings in the knob constructors (ctor_003, ctor_005, ctor_007). This decodes ~3,000 obfuscated names into readable English identifiers.

  5. Identify anchor functions. Start with the highest-confidence identifications:

    • main at 0x409460 (named in symbol table)
    • sub_446240 (real main -- called from main, contains timing format strings)
    • sub_C60D30 (phase factory -- 159-case switch)
    • sub_C62720 (PhaseManager constructor -- references phase vtable table)
    • sub_79B240 (GetKnobIndex -- inline ROT13 decoding)
    • sub_42FBA0 (diagnostic emitter -- 2,350 callers, severity dispatch)
  6. Sweep the address space. Work through the .text section in regions of ~870 KB. For each region:

    • Count functions and decompiled file sizes
    • Identify string anchors (search for region-specific strings)
    • Classify functions by structural template (encoding handler, phase body, utility, etc.)
    • Propagate identities from known callers/callees
    • Record findings in the sweep report format
  7. Cross-reference with runtime. Compile a simple CUDA kernel and run ptxas --stat --verbose --compiler-stats to observe runtime behavior. Use -knob DUMPIR=<phase> to dump IR at specific pipeline points. Compare the dumped IR format against the IR structure reconstructed from decompiled code.

Dependencies

The extraction script (analyze_ptxas.py) requires IDA Pro 8.x with Hex-Rays decompiler and Python 3.x. No external Python packages are needed -- only the IDA Python API (idautils, idc, idaapi, ida_bytes, ida_funcs, ida_segment, ida_nalt, ida_gdl, ida_hexrays).

Post-export analysis requires only the Python 3.8+ standard library (json, codecs, collections).

Debug Infrastructure: bugspec.txt

ptxas contains an internal fault injection framework that deliberately corrupts the Mercury IR to test compiler verification passes. The mechanism is entirely file-driven: if a file named ./bugspec.txt exists in the current working directory when ptxas runs, the function sub_A83AC0 reads it and injects controlled mutations into the post-register-allocation instruction stream. No CLI flag activates this -- file presence alone is sufficient. If the file is absent, a diagnostic is printed to stdout (Cannot open file with bug specification) and compilation proceeds normally.

File Format

The file contains a single line of six integers:

COUNT0,COUNT1,COUNT2,COUNT3 COUNT4 COUNT5

The first four are comma-separated; then a space; then two space-separated values. Each integer specifies the number of faults to inject for that bug category. Zero or negative disables the category.

FieldVariableCategoryTarget
COUNT0v78Register bugsGeneral (R) and uniform (UR) register operands
COUNT1v79Predicate bugsPredicated instruction operands
COUNT2v80Offset/spill bugsMemory offsets in spill/refill instructions
COUNT3v81Remat bugsRematerialized value operands
COUNT4v82R2P/P2R bugsRegister-to-predicate conversion instructions
COUNT5v83Bit-spill bugsBit-level spill storage operands

Example: 3,2,1,0 0 1 injects 3 register bugs, 2 predicate bugs, 1 offset bug, and 1 bit-spill bug.

Bug Kind String Table

Each injected fault record carries a kind code (1--10) mapped to a string table at 0x21F0500:

KindStringMeaning
1r-ur registerGeneral or uniform register replaced with wrong register
2p-up registerPredicate or uniform predicate register corrupted
3any regAny register class operand corrupted
4offsetMemory offset shifted by +16 bytes
5regular bugGeneric operand value replacement
6predicated bugPredicate source operand corrupted
7remat bugRematerialization value corrupted
8spill-regill bugSpill or refill path value corrupted
9r2p-p2r bugRegister-predicate conversion operand corrupted
10bit-spill bugBit-level spill storage operand corrupted

Injection Algorithm

The injection proceeds in four phases:

1. Candidate collection. The function walks the Mercury IR instruction linked list (from context[0]+272). For each instruction, it checks which bug categories are active and whether the instruction qualifies:

  • Register bugs (field0): Scans operands for type-tag 1 (register) with register class 6 (general) or 3 (predicate), excluding opcodes 41--44. Eligible instructions are collected into a candidate list.
  • Predicate bugs (field1): Checks flag byte at instruction+73 for bit 0x10 (predicated). Eligible instructions are collected separately.
  • Offset/spill bugs (field2): Calls sub_A56DE0 / sub_A56CE0 against the register allocator state (context[133]) to identify spill/refill instructions.
  • Remat bugs (field3): Queries the rematerialization hash table (context+21 via sub_A54200) for instructions with remat entries.
  • R2P/P2R bugs (field4): Checks instruction opcode (offset +72) for values 268, 155, 267, 173 (the R2P and P2R conversion opcodes, with bit-masked variants).
  • Bit-spill bugs (field5): Checks operand count > 2, flag bit 0x10 at offset +28, and calls sub_A53DB0 / sub_A53C40 / sub_A56880 for bit-spill eligibility.

2. Random selection. Seeds the RNG with time(0) via srand(). For each active category, sub_A83490 randomly selects N instruction indices from the candidate list, where N is the count from bugspec.txt. The selector uses FNV-1a hashing on instruction addresses for collision avoidance, re-rolling duplicates.

3. Mutation application. For register and predicate categories, sub_A5EC40 iterates over selected instructions and calls sub_A5E9E0, which finds the last register operand, allocates a new register of the same class via sub_91BF30, and replaces the operand value. For offset bugs, the mutation adds +16 to the signed 24-bit offset field directly: *operand = (sign_extend_24(*operand) + 16) & 0xFFFFFF | (*operand & 0xFF000000).

4. Reporting. Prints to stdout:

Num forced bugs N
Created a bug at index I : kind K inst # ID [OFF] in operand OP correct val V replaced with W

Fault Record Structure (40 bytes)

OffsetSizeField
+04Kind (1--10)
+88Pointer to Mercury instruction node
+164Operand index within instruction
+204Original operand value
+244Replacement operand value
+284Selection index (position in candidate list)
+324Instruction ID (from instruction+16)

Records are stored in a dynamic array at context[135].

Function Map

AddressFunctionRoleConfidence
0xA83AC0sub_A83AC0bugspec.txt reader and injection coordinatorCERTAIN (string: ./bugspec.txt)
0xA83490sub_A83490Random index selector with FNV-1a dedupHIGH
0xA5E9E0sub_A5E9E0Register operand mutation (allocates new register)HIGH
0xA5EC40sub_A5EC40Batch mutation applicator (iterates selected instructions)HIGH
0xA832D0sub_A832D0Hash table resize for dedup trackingMEDIUM

Significance

This is NVIDIA's internal compiler testing infrastructure for stochastic fault injection. It targets specific vulnerability surfaces in the register allocator and post-allocation pipeline: wrong-register assignments, address calculation errors, predicate propagation failures, rematerialization correctness, spill code integrity, and register-predicate conversion accuracy. The time(0)-seeded RNG produces different fault patterns on each run for the same bugspec.txt, enabling randomized stress testing of verification passes.

Embedded C++ Name Demangler

PTXAS statically embeds an Itanium ABI C++ name demangler rather than linking libc++abi or libstdc++. The demangler is a self-contained 41-function cluster spanning 0x1CD8B00--0x1CE1E60 in .text, with a single external entry point. The core recursive-descent parser at sub_1CDC780 (93 KB decompiled, 3,442 lines) handles the full Itanium mangling grammar: nested names, template arguments, substitutions, function types, and special names.

API and Integration

The public-facing function is sub_1CE23F0, whose signature matches __cxa_demangle exactly: it takes a mangled name string, an optional output buffer with length pointer, and a status pointer; it returns a malloc-allocated demangled string or NULL with a status code (-1 = memory failure, -3 = invalid arguments). The only caller of this function is the embedded terminate handler at sub_1CD7850, which prints the standard "terminate called after throwing an instance of '...'" diagnostic to stderr, demangling the exception type name before display.

Why Embedded

PTXAS imports only libc, libpthread, libm, and libgcc_s (146 PLT stubs total). It has no dependency on any C++ runtime library. The only C++ ABI symbol in the PLT is __cxa_atexit (at 0x401989), used to register the terminate handler. By embedding the demangler and terminate handler directly, NVIDIA avoids a runtime dependency on libstdc++ or libc++abi, which would otherwise be required solely for exception type name display in fatal error messages. This is consistent with the binary's overall strategy of minimizing external dependencies.

Function Map

AddressFunctionSizeRoleConfidence
sub_1CDC780Demangler core (recursive-descent parser)93 KBParses Itanium-mangled names via large switch dispatchHIGH (size, structure, callgraph isolation)
sub_1CE0600Recursive dispatch wrapper580 BRe-enters the parser for nested name components (76 call sites from core)HIGH (mutual recursion with sub_1CDC780)
sub_1CE23F0__cxa_demangle-compatible API340 BPublic entry: mangled string in, demangled string out, malloc-allocatedCERTAIN (API shape, status codes, free/memcpy/strlen callees)
sub_1CE1E60Parse entry point~200 BInitializes parse state and invokes the coreHIGH (bridge between API and parser)
sub_1CD7850Terminate handler (__cxa_terminate)280 BPrints "terminate called after throwing..." to stderrCERTAIN (string: "terminate called after throwing an instance of '")

Version Update Procedure

All addresses, function counts, and structural offsets in this wiki are specific to ptxas v13.0.88 (build cuda_13.0.r13.0/compiler.36424714_0, 37,741,528 bytes). When a new CUDA toolkit ships a different ptxas binary, the wiki must be updated. This section documents the procedure.

Version-Stable vs Version-Fragile Findings

Not everything changes between versions. Understanding what is stable dramatically reduces update effort.

Version-stable (survives across minor and most major releases unchanged):

CategoryExamplesWhy stable
Algorithm logicCopy propagation worklist walk, fatpoint pressure computation, MurmurHash3 constantsAlgorithms are rarely rewritten between releases
Data structure layoutsPool allocator bins at +2128, Mercury instruction node at 112 bytes, 16-byte phase objectsStruct layouts change only when fields are added or reordered
Knob namesMercuryUseActiveThreadCollectiveInsts, ScavInlineExpansion, all 2,000+ ROT13 namesKnob names are API-like -- changing them breaks internal test harnesses
ROT13 encodingThe ROT13 obfuscation layer itself, decoded by codecs.decode(s, "rot_13")Obfuscation scheme has been consistent across observed versions
Phase count and ordering159 phases in the OCG pipeline, ordered by the PhaseManager vtable tablePhase count may grow but existing phases retain their relative order
Pipeline stage namesParse-time, DAGgen-time, OCG-time, ELF-time, DebugInfo-timeStage names are embedded in format strings unlikely to change
Subsystem namesOCG, Mercury, Ori, ScavInternal codenames are stable across releases
Encoding handler template6-step pattern: opcode ID, movaps format descriptor, register class map, operand registration, finalize, bitfield extractTemplate structure is generated from a stable code generator
Error message text"SM does not support LDCU", "Invalid knob identifier"Diagnostic strings are rarely reworded

Version-fragile (changes with every recompilation):

CategoryExamplesWhy fragile
Function addressesEvery sub_XXXXXX reference, vtable addresses like off_22BD5C8ASLR-style shifts from any code or data size change
Address rangesSweep boundaries 0x400000--0x4D5000, subsystem regionsFunctions move when preceding code grows or shrinks
Function sizessub_446240 at 12,345 bytesInlining decisions change, optimizer improvements add/remove code
Caller/callee countssub_424070 at 3,809 callersNew call sites added, old ones removed
Struct offsetscontext[133], context+1584New fields inserted into context structs
.rodata addressesString locations like 0x202D4D8, encoding table addressesData layout shifts with code changes
Call graph edge counts548,693 edgesNew functions and call sites
Total function count40,185New SM targets add encoding handlers

Identifying Function Address Changes

When loading a new ptxas version into IDA:

  1. Extract the same 8 JSON artifacts using analyze_ptxas.py (or equivalent). The critical artifacts for diffing are ptxas_functions.json (address, size, callee list) and ptxas_strings.json (string content, xref locations).

  2. Match functions by invariant properties. Functions cannot be matched by address alone. Use these matching criteria in priority order:

    • String anchors. Functions containing unique string references (e.g., the function referencing "Please use -knob DUMPIR=AllocateRegisters") can be matched across versions by searching for the same string in the new binary. This is the highest-confidence matching method.
    • Size + callee signature. For functions without string anchors, match by (approximate size, sorted callee list). A function of ~2,100 bytes calling the pool allocator, OOM handler, and hash map insert is almost certainly the same function even if its address shifted by megabytes.
    • Callgraph position. Functions identified by their caller/callee topology: the phase factory is the function called from the PhaseManager constructor with 159+ case targets. The diagnostic emitter is the function with 2,000+ callers that calls vfprintf.
    • Vtable slot position. Phase execute() methods are at vtable slot 0. If the vtable table address changes but still contains 159 entries, the slot positions identify each phase.
    • Template fingerprinting. Encoding handlers matching the 6-step template (bitfield insert via the highest-caller utility, movaps from .rodata, operand registrars, finalize call) are encoding handlers in any version.
  3. Diff the function lists. Produce a mapping {old_addr -> new_addr} for all matched functions. Functions present in the new binary but absent in the old are new (likely new SM target support). Functions absent in the new binary are removed (dropped legacy SM support) or merged.

Updating Sweep Reports

The 30-region sweep reports in ptxas/raw/ are version-locked historical records -- they document the analysis of v13.0.88 and should not be overwritten. For a new version:

  1. Re-run the sweep with new address ranges derived from the new binary's function list. The region partitioning should follow the same subsystem-aligned strategy: infrastructure first, then PhaseManager, then high-complexity subsystems, then batch encoding handlers.

  2. Name new reports with a version suffix: p2.01-sweep-v13.1-0xNNN-0xMMM.txt (or whatever scheme distinguishes the version).

  3. Cross-reference against old reports. For each region, note which functions moved, which are new, and which disappeared. The old sweep reports provide the expected function identities; the new sweep validates whether those identities still hold at the new addresses.

Pages Most Sensitive to Version Changes

These wiki pages require immediate updates when the binary changes:

PageSensitivityWhat changes
function-map.mdCriticalEvery address in every table row. The entire page is address-indexed.
binary-layout.mdCriticalSection addresses, subsystem boundaries, address-range diagram.
VERSIONS.mdCriticalBinary size, build string, function count, version number.
pipeline/overview.mdHighPhase factory address, PhaseManager constructor address, vtable table address.
scheduling/algorithm.mdHighScheduler function addresses, priority function addresses.
regalloc/algorithm.mdHighAllocator function addresses, fatpoint computation address.
codegen/encoding.mdHighEncoding handler address ranges, format descriptor addresses.
config/knobs.mdMediumKnob constructor addresses (content of knob names is stable).
ir/instructions.mdMediumOpcode numbers may shift if new instructions are added.
targets/index.mdMediumNew SM targets may appear, changing validation table sizes.
methodology.mdLowThe methodology itself is version-stable; only the "Scope and Scale" table needs updating.

The update follows a five-step sequence. Steps 1-2 are mechanical; steps 3-5 require analyst judgment.

Step 1: Extract new IDA artifacts.

Load the new ptxas binary into IDA Pro 8.x. Run analyze_ptxas.py to produce the 8 JSON artifacts and per-function decompiled .c files. Store them in a version-specific directory (e.g., ptxas/ida-v13.1/ or alongside the existing artifacts with clear version labeling).

Step 2: Diff against the old artifacts.

Write or use a diff script that:

  • Compares ptxas_functions.json (old vs new) by matching on string anchors, size+callee signature, and callgraph position.
  • Produces a {old_addr -> new_addr} mapping for matched functions.
  • Lists unmatched functions in both directions (new functions, removed functions).
  • Compares ptxas_strings.json to detect new strings, removed strings, and strings whose xref functions changed.
  • Reports total function count delta, binary size delta, and new section addresses.

Step 3: Update address-sensitive pages.

Using the address mapping from Step 2:

  • Update every sub_XXXXXX reference in function-map.md, binary-layout.md, and all pages listed in the sensitivity table above.
  • Update the "Scope and Scale" table in methodology.md with new function counts, string counts, binary size, and build string.
  • Update VERSIONS.md with the new binary metadata.
  • For pages with address ranges (sweep boundaries, subsystem regions), recompute the ranges from the new function list.

Step 4: Verify key struct layouts.

Struct offset changes are the most dangerous kind of version drift because they silently invalidate decompiled code analysis. For each documented struct:

  • Re-decompile the struct's primary accessor function (e.g., sub_424070 for the pool allocator, sub_4280C0 for the TLS context).
  • Compare field offsets against the documented layout.
  • If offsets shifted, update the struct documentation and propagate the change to all pages that reference those offsets.

Priority structs to verify: pool allocator (free-list bins at +2128, mutex at +7128), TLS context (280 bytes), Mercury instruction node (112 bytes), scheduler context (~1000 bytes), allocator state (1590+ bytes), phase objects (16 bytes).

Step 5: Validate phase pipeline.

  • Re-extract the phase vtable table (find the new address of the 159-entry pointer array in .data.rel.ro).
  • Verify all 159 phases are present and in the expected order.
  • Check for new phases (count > 159) or removed phases (count < 159).
  • Re-run ptxas --fdevice-time-trace on a test kernel and cross-reference the phase names in the trace output against the wiki's phase list.

Raw Data Locations

All raw analysis artifacts for the current version (v13.0.88) live in the repository under ptxas/:

DirectoryContents
ptxas/raw/40 sweep reports (p1.01--p1.30 plus sub-region splits), per-task investigation reports (P0_*, P1_*, P2_*, etc.)
ptxas/decompiled/Per-function Hex-Rays decompiled C files (sub_XXXXXX.c, named functions like ctor_003_0x4095d0.c)
ptxas/disasm/Per-function disassembly files
ptxas/graphs/Per-function control flow graph JSON files (80,078 files)
ptxas/ (root)The 8 JSON artifacts (ptxas_functions.json, ptxas_strings.json, ptxas_callgraph.json, ptxas_xrefs.json, ptxas_comments.json, ptxas_names.json, ptxas_imports.json, ptxas_segments.json), the IDA database (ptxas.i64), the extraction script (analyze_ptxas.py), and the binary itself (ptxas)
ptxas/wiki/src/The wiki source pages (this document and all others)

When updating to a new version, preserve the existing artifacts for v13.0.88 (rename or move to a versioned subdirectory) and store new artifacts alongside them. The sweep reports in ptxas/raw/ are historical records and should never be overwritten.

Limitations and Known Gaps

  • No dynamic validation of optimization correctness. All findings are from static analysis. The identified phase algorithms have not been tested against runtime inputs to verify they produce correct output for all corner cases.

  • 39.6% of functions are vtable-dispatched. Functions with zero static callers can only be reached by finding the vtable or function pointer table that references them. Some vtables in deep .rodata may have been missed, leaving some functions orphaned.

  • No upstream reference for any code. Unlike cicc (LLVM fork) or nvcc (EDG frontend), ptxas has no open-source analog. Every identification is from first principles. This limits confidence for functions where string evidence is absent and structural analysis is the only basis.

  • Template-generated code is indistinguishable. The ~4,000 SASS encoding handlers are generated from internal templates. Without the template source, mapping individual handlers to specific opcodes requires tracing the dispatch table entries, which has only been done for select handlers.

  • Mega-functions are partially opaque. The four functions exceeding 200 KB (sub_169B190 at 280 KB, sub_143C440 at 233 KB, sub_198BCD0 at 239 KB, sub_18A2CA0 at 231 KB) could not be decompiled by Hex-Rays. Their behavior is understood from their callee lists (13,000--15,870 callees each) and their position in the pipeline, but the internal dispatch logic is known only at the disassembly level.

  • ROT13 decoding is necessary but not sufficient. Decoding the 2,000+ knob names reveals the existence of tuning parameters but not their semantics. A knob named MercuryPresumeXblockWaitBeneficial can be decoded from ROT13, but understanding what "xblock wait beneficial" means requires analyzing the code paths that read the knob.

  • Version-specific addresses. All addresses in this wiki apply to ptxas v13.0.88 (build cuda_13.0.r13.0/compiler.36424714_0). Other CUDA toolkit versions will have different addresses, different function counts, and potentially different phase orderings. However, the analysis methodology (string-driven, vtable-driven, callgraph propagation) applies to any version.

  • Indirect calls are undercounted. The 548,693-edge call graph captures only direct call instructions resolved by IDA. Virtual calls through vtable pointers, function pointer callbacks, and computed jumps are not fully captured. The true call graph is significantly denser than what is recorded.

Corrections Log

This section documents every factual error discovered and corrected during the wiki improvement pass. Each entry records the error, the correction, affected pages, and the agent task that performed the fix. The full detail for each correction is in ptxas/raw/P5_11_corrections_log_report.txt.

Summary

MetricCount
Distinct factual errors corrected22
Wiki pages with at least one fix30+
Agent tasks that discovered errors15
Agent tasks that propagated fixes5

Corrections by Severity

Systematic errors (affected 5+ pages each)

#ErrorCorrectionPagesAgent
01Opcode numbering: wiki assumed two numbering systems; "Selected Opcode Values" table had wrong SASS mnemonic labels (e.g., 93=CALL, 95=EXIT, 97=MOV, 130=BAR)One numbering system: ROT13 name table index IS the instruction opcode. Correct labels: 93=OUT_FINAL, 95=STS, 97=STG, 130=HSET215 pages (ir/instructions, ir/cfg, passes/predication, passes/sync-barriers, passes/liveness, passes/general-optimize, passes/rematerialization, passes/copy-prop-cse, passes/strength-reduction, regalloc/abi, regalloc/spilling, intrinsics/sync-warp, codegen/isel, scheduling/latency-model, scheduling/algorithm)P0-01, P4-02, P5-01
02Register class 6 = UB (Uniform Barrier); classes 2-6 all wrongClass 6 = Tensor/Accumulator (MMA/WGMMA). Correct table: 2=R(alt), 3=UR, 4=UR(ext), 5=P/UP, 6=Tensor/Acc. Barrier regs use reg_type 9, outside the 7-class system7 pages (ir/registers, regalloc/overview, regalloc/algorithm, regalloc/spilling, passes/gmma-pipeline, intrinsics/tensor, ir/overview)P0-02
03context+1584 had 5 conflicting names: code_object, sched_ctx, arch_backend, optimizer_state, function managerSingle object: SM-specific architecture backend ("sm_backend"), constructed per-compilation-unit in sub_662920 via SM version switch3 pages corrected (ir/data-structures, ir/overview, passes/copy-prop-cse); 14 pages acceptable as-isP0-03

Identity misattributions

#ErrorCorrectionPagesAgent
06sub_83EF00 (29KB) listed as "Top-level unrolling driver"sub_83EF00 is MainPeepholeOptimizer (opcode switch on 2, 134, 133, 214, 213, 210). Actual unrolling driver: sub_1390B30 via Phase 22 entry sub_1392E30passes/loop-passes.mdP1-04, P5-03
07sub_926A30 (22KB) listed as "Main pipelining engine (modulo scheduling)"sub_926A30 is the operand-level latency annotator and interference weight builder, called by sub_92C0D0 per-instructionpasses/loop-passes.mdP1-06
08sub_7E7380 described as "full structural equivalence" (opcode, type, all operands, register class comparison)sub_7E7380 is 30 lines / 150 bytes: narrow predicate-operand compatibility check (predicate bit parity + last operand 24-bit ID + penultimate 8-byte encoding). Full structural comparison done by the 21 callerspasses/copy-prop-cse.md, passes/general-optimize.mdP1-07, P5-06

Inverted semantics

#ErrorCorrectionPagesAgent
05isNoOp()=1 "means it executes unconditionally"isNoOp()=1 means the dispatch loop SKIPS execute(). Code: if (!phase->isNoOp()) { phase->execute(ctx); }passes/rematerialization.mdP0-05
09Hot-cold priority: "1 = cold, 0 = hot"1 = hot = higher priority, 0 = cold = lower priority. sub_A9CDE0 (hot detector) returns true -> bit 5 set -> higher prioritypasses/hot-cold.mdP1-09, P5-06
10"Fatpoint" implied to be maximum-pressure pointFatpoint scans for MINIMUM-cost slot. The name refers to the exhaustive (fat) scan evaluating all slots, not to picking the maximum(verified correct across all pages -- 0 fixes needed)P1-10, P5-06

Wrong numeric values

#ErrorCorrectionPagesAgent
04context+1552 = "Legalization stage counter" with 3 values (3, 7, 12)Pipeline progress counter with 22 values (0-21) spanning all pipeline categories4 pages (ir/data-structures, passes/late-legalization, passes/rematerialization, passes/copy-prop-cse)P0-04
125 SASS opcode mnemonic typos: PSMTEST, LGDEPBAR, LGSTS, UBLKPC, UTMAREDGCSMTEST, LDGDEPBAR, LDGSTS, UBLKCP, UTMREDGreference/sass-opcodes.mdP2-11
14WGMMA case 9 = 0x1D5D (7517), case 10 = 0x1D5E (7518)Case 9 = 0x1D5E (7518), case 10 = 0x1D60 (7520). Codes 0x1D5D/0x1D5F are advisory (non-serialization) warningspasses/gmma-pipeline.mdP3-25
15ABI minimum: gen 5 (sm_60-sm_89) = 16 regs, gen 9+ = 24 regsgen 3-4 (sm_35-sm_53) = 16, gen 5-9 (sm_60-sm_100) = 24. Binary: (generation - 5) < 5 ? 24 : 16regalloc/abi.mdP3-26
17Unrolling rejection table at 0x21D1980 with 36-byte structuresRejection string pointer array at 0x21D1EA0 with simple integer indices 7-24. The 0x21D1980 table is for peephole operand range lookupspasses/loop-passes.mdP1-04

Phantom data and scope errors

#ErrorCorrectionPagesAgent
11"Approximately 80 additional entries bulk-copied from unk_21C0E00" at SASS opcode indices 322-401, "totaling roughly 402 named opcodes"Table has exactly 322 entries. The 1288-byte block at unk_21C0E00 is a 322-element identity map {0,1,...,321} copied to a different data structure (encoding category map at obj+0x2478)reference/sass-opcodes.mdP2-11
13"139 explicitly named phases and 20 architecture-specific unnamed phases"All 159 phases have names in the static table at off_22BD0C0. The original 139-phase inventory missed 20 phases (e.g., OriCopyProp, Vectorization, MercConverter, AllocateRegisters)pipeline/overview.md, passes/index.mdP2-14, P4-03
16Warning 7018 (0x1B6A) attributed to SUSPEND/preserved scratch diagnosticCode 0x1B6A does not exist in the binary. The actual code is 7011 (0x1B63)regalloc/abi.mdP3-26
18Unrolling rejection codes listed as 0x80000001-0x80000018Those hex values appear in diagnostic message STRINGS, not as internal codes. Internal codes are simple integers 7-24passes/loop-passes.mdP1-04

Minor corrections

#ErrorCorrectionPagesAgent
19sub_80B700/sub_80BC80 listed as unrolling functionsBoth are peephole optimizer functions (called through sub_83EF00), not unrollingpasses/loop-passes.mdP1-04
22general-optimize.md called sub_7E7380 "instruction_equivalent" / "structural instruction equivalence" in 6 locationsRenamed to "predicate_operand_compatible" / "predicate-operand compatibility check"passes/general-optimize.mdP5-06

Error Categories

CategoryCountExamples
Identity misattribution5Wrong function-to-role mappings, wrong names for context fields
Wrong numeric values5Wrong opcode labels, wrong hex codes, wrong thresholds, wrong addresses
Inverted semantics3isNoOp skip-vs-execute, hot-cold bit polarity, fatpoint min-vs-max
Conflicting definitions3Register class contradictions across pages
Phantom data2Nonexistent SASS entries 322-401, nonexistent warning 7018
Scope mischaracterization2context+1552 scope too narrow, phase naming scope too narrow
Encoding confusion2Hex-in-message-string vs internal code, wrong address for lookup table

Lessons Learned

  1. Behavioral inference is unreliable for opcode identity. Observing that an opcode appears in branch contexts does not make it BRA. Always check the authoritative ROT13 name table.

  2. Cross-page consistency checks catch conflicting speculations. Five pages independently naming the same field (context+1584) is a strong signal that at least four are wrong.

  3. Counts from partial analysis are systematically low. The "3 values" for context+1552 and "139 named phases" both resulted from stopping the search too early. Exhaustive binary sweeps consistently reveal more entries.

  4. Function size is not a reliable identity signal. sub_83EF00 (29KB) was large enough to seem like a major driver, but size alone does not distinguish a peephole optimizer from a loop unroller.

  5. ROT13 decoding + binary cross-validation is the gold standard. Every correction that replaced speculative labels with ROT13-decoded names has held up under subsequent audits.

Version Tracking

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

This page documents the exact ptxas binary under analysis and the version-related metadata recovered from the stripped ELF.

Binary Under Analysis

FieldValue
Toolptxas (PTX optimizing assembler)
Version13.0.88
Build tagcuda_13.0.r13.0/compiler.36424714_0
Build dateWed Aug 20 01:55:12 PM PDT 2025
Source path/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/common/utils/generic/impl/generic_knobs_impl.h
ELF size37,741,528 bytes (37.7 MB)
Architecturex86-64 (AMD64)
LinkingDynamically linked, stripped
Functions~40,000 (estimated from IDA/Ghidra DB)

Embedded Version Strings

sub_432A00 (0x432A00, CLI option registration) contains the self-identification strings that ptxas prints for --version / --list-version:

StringLocation
"Ptx optimizing assembler"Product name
"NVIDIA (R)"Vendor
Copyright 2005-2025Date range
"ptxocg.0.0"OCG backend version tag

The "ptxocg.0.0" tag also appears in sub_43A400 (compilation setup) and at address 0x1CE74AB in the .rodata section, identifying the backend optimizer component embedded inside ptxas.

Default Target Architecture

sub_6784B0 returns sm_75 (Turing) as the default compilation target when no --gpu-name flag is supplied. This is consistent with the CUDA 13.0 toolkit defaulting to a Turing-class GPU.

The full set of architecture strings referenced in the front-end validators (addresses 0x460000-0x4D5000) includes:

sm_20  sm_30  sm_35  sm_50  sm_60  sm_75  sm_80  sm_86  sm_89  sm_90

with sm_%d format-string patterns covering all supported SM codes.

Output ELF Format

Cubins emitted by ptxas use the ELF standard with:

FieldValue
e_machineEM_CUDA (0xBE = 190)
ELF classELFCLASS32 or ELFCLASS64 (per target)
Custom section typeSHT_CUDA_INFO = 0x70000064
Magic (code object header)0x16375564E ("dUWc" + version nibble)

The SM-version-to-code-object mapping lives in the ELF emitter at sub_1C9F280. Example encodings recovered from sub_A3D000 range:

field[93]TargetVersion encoding
12288sm_300x70007
20481sm_500xC000C

Build System Metadata

The source path leaked through __FILE__ macros in the knobs infrastructure reveals the NVIDIA internal build tree layout:

/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/
  drivers/common/utils/generic/impl/generic_knobs_impl.h

Key observations:

  • /dvs/p4/ -- Perforce depot root on the DVS (Driver Verification System) build farm.
  • sw/rel/gpgpu/toolkit/r13.0/ -- Release branch for CUDA toolkit 13.0.
  • compiler/drivers/common/ -- Shared compiler driver code (used by both ptxas and cicc).
  • generic_knobs_impl.h -- The knob system implementation header; the __FILE__ macro at lines 395-1090 of this file is embedded in ptxas error metadata.

Evidence Index

ClaimSource
Version 13.0.88, 37.7 MBHeaders of all 30 sweep reports (e.g. p1.23, p1.28)
sub_432A00 stringsp1.01 lines 514-521
sub_6784B0 default sm_75User-provided; corroborated by sm_75 prevalence across all validators
Source pathp1.05 lines 14-16, p1.04a line 628
ptxocg.0.0p1.01 line 553, p1.05 line 1256
ELF emitter / EM_CUDAp1.30 lines 46-69
SM version encoding tablep1.08b lines 217-237

Key Functions

AddressSizeRoleConfidence
sub_432A00--CLI option registration; contains --version / --list-version self-identification strings ("Ptx optimizing assembler", "NVIDIA (R)", "ptxocg.0.0")0.92
sub_43A400--Compilation setup; references the "ptxocg.0.0" backend version tag0.85
sub_6784B0--Default target architecture selector; returns sm_75 (Turing) when no --gpu-name flag is supplied0.90
sub_1C9F280--ELF emitter; SM-version-to-code-object mapping for cubin output0.85
sub_A3D000--SM version encoding table; example encodings (12288 = sm_30, 20481 = sm_50)0.80

Compilation Pipeline Overview

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

This page maps the complete end-to-end flow of a PTX assembly through ptxas v13.0.88, from the initial CLI invocation to the final ELF/cubin binary output. Each stage is a self-contained subsystem with its own address range, data structures, and failure modes. The links below lead to dedicated pages with reimplementation-grade detail for every stage.

Pipeline Diagram

nvcc / cicc
  |  (PTX text file or --input-as-string)
  v
+================================================================+
| ptxas v13.0.88 (37.7 MB, ~40,000 functions)                   |
|                                                                |
|  1. Entry & CLI Parsing ----------> [entry.md]                 |
|     |  main -> sub_446240 -> sub_434320                        |
|     |  target arch, opt level, --maxrregcount, knobs           |
|     v                                                          |
|  2. PTX Lexer + Parser -----------> [ptx-parser.md]            |
|     |  sub_451730: Flex scanner, Bison grammar                 |
|     |  ROT13-decoded opcode table (900+ mnemonics)             |
|     |  30+ per-instruction semantic validators                 |
|     v                                                          |
|  3. PTX Directive Handling --------> [ptx-directives.md]       |
|     |  .version, .target, .entry, .func, .reg, .shared         |
|     |  register constraints, ABI configuration                 |
|     v                                                          |
|  4. PTX-to-Ori Lowering ----------> [ptx-to-ori.md]           |
|     |  PTX AST -> Ori IR (basic blocks, virtual registers)     |
|     |  address space annotation, special register mapping      |
|     v                                                          |
|  5. 159-Phase Optimization -------> [optimizer.md]             |
|     |  PhaseManager: sub_C62720 (constructor),                 |
|     |                sub_C64F70 (executor)                     |
|     |  10 stages, 17 AdvancedPhase hooks,                     |
|     |  8-phase Mercury encoding sub-pipeline                   |
|     |  per-kernel via sub_7FBB70 -> sub_7FB6C0                 |
|     v                                                          |
|  6. Register Allocation ----------> [../regalloc/overview.md]  |
|     |  Fatpoint algorithm, phase 101 (AdvancedPhaseAllocReg)   |
|     |  spill/fill insertion, ABI register reservations          |
|     v                                                          |
|  7. Instruction Scheduling -------> [../scheduling/overview.md]|
|     |  3-phase: pre-schedule (97), post-schedule (106),        |
|     |           post-fixup (111)                               |
|     |  scoreboard generation, dependency barriers              |
|     v                                                          |
|  8. SASS Encoding ----------------> [../codegen/encoding.md]   |
|     |  530 instruction encoding handlers (vtable dispatch)     |
|     |  Mercury format: phases 113-122                          |
|     |  Capsule Mercury (default on sm_100+)                    |
|     v                                                          |
|  9. ELF/Cubin Output -------------> [output.md]               |
|     |  sub_612DE0 (finalizer) -> sub_1C9F280 (ELF emitter)    |
|     |  section layout, symbol table, relocations               |
|     |  DWARF debug info, EIATTR attributes                     |
|     v                                                          |
|  OUTPUT: .cubin / .o (ELF)                                    |
+================================================================+

Side paths:
  * Capsule Mercury (--cap-merc) -----> [../codegen/capmerc.md]
  * Debug info (all stages) ----------> [../output/debug-info.md]
  * SASS text (--verbose) ------------> [../codegen/sass-printing.md]

Narrative Walk-Through: One Kernel, Start to Finish

A concrete trace of a single-kernel PTX module compiled for sm_100 at -O2:

1. PTX text arrives (~2--200 KB). Either read from a .ptx file or received in-memory via --input-as-string from nvcc. The driver sub_446240 establishes a setjmp recovery point, parses CLI options into the 1,352-byte options block, and allocates the "Top level ptxas memory pool".

2. Lexer + Parser (sub_451730). A Flex-generated scanner tokenizes the PTX text into a token stream. Tokens flow into a Bison-generated LALR parser that builds an AST. The opcode dispatch table (sub_46E000, 93 KB, 1,168 callees) routes each instruction mnemonic through ROT13 decoding, type resolution, and 30+ per-instruction semantic validators. For a 5 KB PTX kernel, the parser typically produces ~200--500 AST nodes with ~50 virtual register declarations. The "PTX parsing state" pool holds all AST memory.

3. Directive processing and CompileUnitSetup. .version/.target directives configure the SM profile via sub_6765E0 (54 KB profile constructor). .entry/.func directives establish the kernel boundary. .reg/.shared/.const directives declare resources. sub_43B660 computes the physical register budget from .maxnreg, --maxrregcount, and .maxntid constraints. The 1,936-byte profile object is now populated with codegen factory value (36864 for sm_100), scheduling parameters, and capability flags.

4. PTX-to-Ori lowering (DAGgen). sub_6273E0 (44 KB) converts each AST instruction into an Ori IR node: a basic block with virtual registers, control flow edges, and memory space annotations. Special registers (%ntid, %laneid, %smid) map to internal IDs. Address computation uses a 6-bit operand type encoding. A 500-instruction PTX kernel typically produces ~600--1,200 Ori instructions (expansion from pseudo-ops, address calculations, and predicate materialization). The "Permanent OCG memory pool" is created here to hold all IR state.

5. 159-phase OCG pipeline (sub_C62720 constructs, sub_C64F70 executes). Each phase is a 16-byte polymorphic object with execute(), isNoOp(), and getName() vtable methods. The PhaseManager iterates the phase table at 0x22BEEA0, skipping any phase whose isNoOp() returns true. At -O2, roughly 80--100 of the 159 phases are active. Typical expansion factors: the initial 1,000 Ori instructions may grow to 1,200--1,500 after unrolling and intrinsic expansion, then shrink to 800--1,000 after CSE/DCE, then re-expand to 1,500--2,500 after register allocation spill/fill insertion. The PhaseManager logs "Before <phase>" / "After <phase>" strings (visible in the sub_C64F70 decompile) for DUMPIR.

6. Register allocation (phase 101, sub_971A90). The Fatpoint algorithm attempts NOSPILL allocation first. If pressure exceeds the register budget, the spill guidance engine (sub_96D940, 84 KB) computes spill candidates across 7 register classes, and the retry loop makes up to N attempts (knob 638/639) with progressively more aggressive spilling. Physical register assignments are committed; spill/fill instructions are inserted into the Ori IR.

7. Instruction scheduling (phases 97, 106, 111). Three scheduling passes assign dependency barriers and reorder instructions for pipeline throughput. The scoreboard generator tracks 6 dependency barriers per warp. For a 1,500-instruction kernel, scheduling typically produces a ~2,000--3,000-entry instruction stream after barrier insertion and NOP padding.

8. SASS encoding (phases 113--122). Each Ori instruction is lowered to a 128-bit SASS binary instruction via the 530-handler vtable dispatch. The 1,280-bit (160-byte) encoding workspace at instruction+544 is filled by sub_7B9B80 (bitfield insert, 18,347 callers). A 2,000-instruction kernel produces ~32 KB of raw SASS binary. On sm_100+, Capsule Mercury (capmerc) is the default format, embedding PTX source alongside the SASS.

9. ELF/cubin emission (sub_612DE0, 47 KB). The finalizer assembles the cubin: .text.FUNCNAME (SASS binary), .nv.info.FUNCNAME (EIATTR attributes), .nv.shared.FUNCNAME (shared memory layout), .nv.constant0.FUNCNAME (constant bank), plus global sections (.shstrtab, .strtab, .symtab). Section layout (sub_1CABD60, 67 KB) assigns addresses; the master ELF emitter (sub_1C9F280, 97 KB) writes headers, section tables, and program headers. A single-kernel cubin for a medium-complexity kernel is typically 40--120 KB.

Approximate data sizes at each stage (medium kernel, sm_100, -O2):

StageInputOutputPeak Memory
PTX text--5--50 KB text100 KB (file buffer + parser state)
ASTToken stream200--500 nodes (~40--100 KB)200 KB
Ori IR (initial)AST600--1,200 instructions (~100--250 KB)500 KB
Ori IR (post-OCG)1,200 instr1,500--2,500 instr (~300--600 KB)2--8 MB (peak during regalloc)
SASS binaryScheduled IR32--128 KB1 MB
Cubin (ELF)SASS + metadata40--120 KB2 MB

Timed Phases

The compilation driver sub_446240 measures six timed phases per compile unit and reports them when --compiler-stats is enabled. The format strings are embedded directly in the binary:

PhaseFormat StringSubsystem
Parse-time"Parse-time : %.3f ms (%.2f%%)\n"PTX lexer + Bison parser + semantic validation
CompileUnitSetup-time"CompileUnitSetup-time : %.3f ms (%.2f%%)\n"Target configuration, ABI setup, register constraints
DAGgen-time"DAGgen-time : %.3f ms (%.2f%%)\n"PTX-to-Ori lowering, CFG construction, initial DAG formation
OCG-time"OCG-time : %.3f ms (%.2f%%)\n"Optimized Code Generation: all 159 optimization phases, register allocation, instruction scheduling, SASS encoding
ELF-time"ELF-time : %.3f ms (%.2f%%)\n"ELF construction, section layout, symbol table, relocations, EIATTR, file write
DebugInfo-time"DebugInfo-time : %.3f ms (%.2f%%)\n"DWARF .debug_info/.debug_line/.debug_frame generation, LEB128 encoding

Additional aggregate stats:

CompileTime = %f ms (100%)
PeakMemoryUsage = %.3lf KB

The per-unit header prints "\nCompile-unit with entry %s" before each kernel's phase breakdown.

Per-Kernel Parallelism

ptxas supports two compilation modes for multi-kernel PTX modules:

Single-Threaded Mode (Default)

The compilation driver sub_446240 iterates over compile units sequentially. For each kernel entry:

  1. sub_43CC70 -- per-entry compilation unit processor, skips __cuda_dummy_entry__
  2. sub_7FBB70 -- per-kernel entry point, prints "\nFunction name: " + kernel name
  3. sub_7FB6C0 -- pipeline orchestrator: builds phases via sub_C62720, executes via sub_C64F70
  4. Cleanup: destroys 17 analysis data structures (live ranges, register maps, scheduling state)

Each kernel runs through the entire 159-phase pipeline independently. Cross-kernel state is limited to shared memory layout and the global symbol table.

Thread Pool Mode (--split-compile)

When --allow-expensive-optimizations or --split-compile is active, ptxas uses a pthread-based thread pool for per-kernel parallelism:

  • Pool constructor (sub_1CB18B0): allocates a 184-byte pool struct (0xB8), spawns N detached worker threads via pthread_create, initializes mutex at +24, two condition variables at +64 and +112
  • Task submit (sub_1CB1A50): allocates a 24-byte task node {func_ptr, arg, next}, enqueues via linked list, broadcasts on cond_work
  • Jobserver integration (sub_1CC7300): reads MAKEFLAGS environment variable, parses --jobserver-auth= for either fifo: named pipes or pipe-based file descriptors, throttles thread count to respect GNU Make's -j slot limit

The thread pool is used throughout the OCG and ELF phases (stages 5-9 in the diagram). Each worker thread receives its own thread-local context (sub_4280C0, 280-byte TLS struct with per-thread error flags, diagnostic suppression state, Werror flag, and synchronization primitives).

Thread-Local Context Layout

struct ThreadLocalContext {  // 280 bytes (0x118), per-thread via pthread_getspecific
    uint64_t error_flags;          // +0:   error/warning state flags
    uint64_t has_error;            // +8:   nonzero if error occurred
    // +16..+48: internal fields (jmp_buf pointer, pool pointer, counters)
    uint8_t  diag_suppress;        // +49:  diagnostic suppression flag
    uint8_t  werror_flag;          // +50:  --Werror promotion flag
    // +51..+127: reserved / internal state
    pthread_cond_t  cond;          // +128: condition variable (48 bytes)
    pthread_mutex_t mutex;         // +176: per-thread mutex (40 bytes)
    sem_t           sem;           // +216: semaphore (32 bytes)
    // +248..+279: linked-list pointers (global thread list at +256/+264)
};

Accessed by sub_4280C0 (3,928 callers -- the single most-called function in the binary). On first call in a new thread, allocates and initializes via malloc(0x118) + memset + pthread_cond_init + pthread_mutex_init + sem_init. The decompiled code confirms the 280-byte size: v5 = malloc(0x118u), followed by memset(v5, 0, 0x118u), pthread_cond_init(v5 + 128), pthread_mutex_init(v5 + 176), sem_init(v5 + 216). After initialization, the struct is inserted into a global doubly-linked list (offsets +256 and +264 hold prev/next pointers, protected by a global mutex). The pthread_setspecific(key, v5) call stores the pointer for subsequent pthread_getspecific retrieval.

Key Function Call Chain

The top-level control flow from program entry to ELF output:

main (0x409460, 84 bytes)
  |  setvbuf(stdout/stderr, unbuffered)
  v
sub_446240 (0x446240, 11KB) ---- "Top-level compilation driver"
  |
  |-- sub_434320 (0x434320, 10KB) -- Parse CLI options, validate flags
  |     reads: --gpu-name, --maxrregcount, --opt-level, --verbose,
  |            --compiler-stats, --split-compile, --fast-compile
  |
  |-- [allocate "Top level ptxas memory pool"]
  |-- [allocate "Command option parser" pool]
  |
  |-- sub_445EB0 (setup) ----------- Target configuration, texturing mode
  |     sub_43A400 --------------- SM-specific defaults ("ptxocg.0.0")
  |     sub_43B660 --------------- Register/resource constraint calculation
  |
  |-- sub_451730 (0x451730, 14KB) -- Parser initialization
  |     |  "PTX parsing state" pool allocation
  |     |  Builtin symbol table: %ntid, %laneid, %smid, %clock64, ...
  |     |  sub_46E000 (93KB) ---- Opcode-to-handler dispatch table (1168 callees)
  |     v
  |   [Flex lexer + Bison parser: PTX text -> AST]
  |
  |-- for each compile unit:
  |     sub_4428E0 (0x4428E0, 14KB) -- PTX input validation
  |     |  .version/.target checks, ABI mode selection
  |     |  --extensible-whole-program, --compile-only handling
  |     |
  |     sub_43CC70 (5.4KB) --------- Per-entry unit processor
  |     |  skip __cuda_dummy_entry__
  |     |  generate .sass and .ucode sections
  |     |
  |     sub_7FBB70 (198 bytes) ----- Per-kernel entry point
  |       |
  |       sub_7FB6C0 (1.2KB) ------- Pipeline orchestrator
  |         |  check knob 298 (NamedPhases mode)
  |         |  if NamedPhases: delegate to sub_9F63D0
  |         |  else:
  |         |    sub_C62720 -- PhaseManager constructor (159 phases)
  |         |    sub_C60D20 -- get default phase table (at 0x22BEEA0)
  |         |    sub_C64F70 -- execute all phases
  |         |  cleanup: destroy 17 analysis data structures
  |         v
  |       [159-phase pipeline: optimization -> regalloc -> scheduling -> encoding]
  |
  |-- sub_612DE0 (0x612DE0, 47KB) -- Kernel finalizer / ELF builder
  |     |  "Finalizer fastpath optimization"
  |     |  version: "Cuda compilation tools, release 13.0, V13.0.88"
  |     |  build:   "Build cuda_13.0.r13.0/compiler.36424714_0"
  |     |
  |     sub_1CB53A0 (13KB) ------- ELF world initializer (672-byte object)
  |     |  "elfw memory space", .shstrtab, .strtab, .symtab
  |     |
  |     sub_1CB3570 (10KB) ------- Add .text.FUNCNAME sections (44 callers)
  |     sub_1CB68D0 (49KB) ------- Symbol table builder
  |     sub_1CABD60 (67KB) ------- Section layout & memory allocation
  |     sub_1CD48C0 (22KB) ------- Relocation resolver
  |     sub_1C9B110 (23KB) ------- Mercury capsule builder (capmerc)
  |     sub_1C9F280 (97KB) ------- Master ELF emitter (largest in range)
  |     sub_1CD13A0 (11KB) ------- Final file writer
  |
  v
[report CompileTime, PeakMemoryUsage, per-phase breakdown]

Memory Pools

ptxas uses a custom hierarchical pool allocator (sub_424070 / sub_4248B0, the most-called allocation functions with 3,809 and 1,215 callers respectively) instead of the system malloc/free. Three named pools are created during the top-level driver:

Pool NameCreated ByLifetimePurpose
"Top level ptxas memory pool"sub_446240Entire compilationGlobal allocations, cross-kernel data structures
"Command option parser"sub_446240Entire compilationCLI option storage, flag validation state
"Permanent OCG memory pool"OCG initializationPer-kernelOptimization phase state, instruction IR, register maps

Additional per-subsystem pools exist:

  • "PTX parsing state" -- created by sub_451730, holds the lexer/parser symbol tables and AST nodes
  • "elfw memory space" -- created by sub_1CB53A0, holds the ELF world object (672 bytes) and section data

Pool Allocator Internals

The allocator at sub_424070 implements a dual-path design:

  • Small allocations (up to 4,999 bytes / 0x1387): 8-byte-aligned, size-class binned free lists at pool struct offset +2128. Pop from free list head on alloc, push back on free.
  • Large allocations (above 4,999 bytes): boundary-tag allocator with coalescing of adjacent free blocks.
  • Thread safety: pthread_mutex_lock/unlock around all pool operations, mutex at pool struct offset +7128.
  • OOM handling: calls sub_42BDB0 (3,825 callers) which triggers longjmp-based fatal abort via sub_42F590.

Pipeline Stage Breakdown

Terminology note. The 6 stages below (Parse, CompileUnitSetup, DAGgen, OCG, ELF, DebugInfo) correspond to the 6 timed phases measured by --compiler-stats. They cover the entire program lifecycle. The OCG stage (Stage 4 here) is itself subdivided into 10 internal stages in the Pass Inventory, numbered OCG-Stage 1--10. To avoid confusion, cross-references use "timed phase" for the 6 whole-program stages and "OCG stage" for the 10 optimizer sub-stages.

Stage 1: Parse (Parse-time)

The Flex-generated scanner and Bison-generated parser consume PTX text and produce an internal AST. The opcode dispatch table at sub_46E000 (93KB, 1,168 callees) registers type-checking rules for every PTX instruction. Thirty separate validator functions (in 0x460000-0x4D5000) enforce SM architecture requirements, PTX version constraints, operand types, and state space compatibility. See PTX Parser.

Stage 2: CompileUnitSetup (CompileUnitSetup-time)

Target configuration via sub_43A400: sets SM-specific defaults (texturing mode, cache policies, def-load-cache, force-load-cache), applies --fast-compile shortcuts, configures ABI (parameter registers, return address register, scratch registers). Register constraints computed by sub_43B660 from .maxnreg, --maxrregcount, .minnctapersm, and .maxntid directives. See Entry Point & CLI.

Stage 3: DAGgen (DAGgen-time)

Lowers the validated PTX AST into the Ori intermediate representation: basic blocks with a control flow graph, virtual registers, and memory space annotations. Special PTX registers (%ntid, %laneid, %smid, %ctaid, etc.) are mapped to internal identifiers. Operand processing at sub_6273E0 (44KB) handles address computation with a 6-bit operand type encoding. See PTX-to-Ori Lowering.

Stage 4: OCG (OCG-time)

The core of ptxas: the 159-phase Optimized Code Generation pipeline. This single timed phase encompasses:

  • Early optimization (phases 13-36): general optimization, branch/switch, loop simplification, strength reduction, unrolling, pipelining, barrier removal
  • Mid-level optimization (phases 37-58): GVN/CSE, reassociation, commoning, late expansion, speculative hoisting
  • Late optimization (phases 59-95): loop fusion, predication, GMMA propagation, legalization
  • Register allocation (phase 101): Fatpoint algorithm
  • Instruction scheduling (phases 97, 106, 111): pre-schedule, post-schedule, post-fixup
  • Mercury encoding (phases 113-122): SASS binary format generation

The PhaseManager (sub_C62720) instantiates phases via a 159-case factory switch (sub_C60D30), each phase a 16-byte polymorphic object with a vtable providing execute(), isNoOp(), and getName() methods. See Optimization Pipeline.

Stage 5: ELF (ELF-time)

The finalizer sub_612DE0 (47KB) assembles the NVIDIA ELF/cubin from the compiled SASS. Section layout (sub_1CABD60, 67KB) assigns addresses to shared memory, constant banks (with OCG deduplication), local memory, and reserved shared memory (.nv.reservedSmem.begin/cap/offset0). The master ELF emitter sub_1C9F280 (97KB) constructs headers, section tables, and program headers. Three binary output modes exist:

  1. mercury -- traditional SASS binary format
  2. capmerc -- Capsule Mercury (default on sm_100+), embeds PTX source in .nv.merc.* sections
  3. sass -- direct SASS output

See ELF/Cubin Output.

Stage 6: DebugInfo (DebugInfo-time)

DWARF debug information generation: .debug_info, .debug_line, .debug_frame sections. The LEB128 encoder at sub_45A870 handles all variable-length integer encoding. Source location tracking uses the location map (hash map at sub_426150/sub_426D60) with file offset caching every 10 lines for fast random access. Labels follow the pattern .L__$locationLabel$__%d. See Debug Information.

Error Paths and Recovery

ptxas uses setjmp/longjmp as its sole error recovery mechanism -- there are no C++ exceptions (the binary is compiled as C). Three nested recovery points exist, each catching progressively more localized failures.

Recovery Point Hierarchy

sub_446240 (top-level driver)
  setjmp(jmp_buf_global)         // Level 1: catches any fatal anywhere
    |
    sub_43A400 (per-kernel worker)
      setjmp(jmp_buf_kernel)     // Level 2: catches per-kernel fatals
        |
        sub_432500 (finalization bridge)
          setjmp(jmp_buf_local)  // Level 3: catches OCG pipeline fatals
            |
            [159-phase pipeline, regalloc, encoding, ELF]

Level 1 (global). Established by sub_446240 at function entry. If any code path anywhere in ptxas calls sub_42FBA0 with severity >= 6 (fatal), execution longjmps back here. The handler cleans up global resources and returns a non-zero exit code. This is the last-resort handler.

Level 2 (per-kernel). Established by sub_43A400 before the OCG pipeline runs. On longjmp, the handler destroys the partially-compiled kernel's state, clears the error flags in the TLS context, and continues to the next kernel. This allows multi-kernel compilations to survive a single kernel's failure.

Level 3 (finalization). Established by sub_432500, which saves and replaces the TLS jmp_buf pointer for nested recovery. On longjmp: restores the previous jmp_buf, sets error_flags = 1, releases output buffers, and calls report_internal_error(). Execution returns false to the Level 2 handler.

Parse Error Recovery

Parse errors in sub_451730 (the Flex/Bison parser) invoke sub_42FBA0 with the message "syntax error":

  • Severity 4--5 (non-fatal error): The error is printed with file:line location, and the parser attempts to continue via Bison's error recovery rules. Multiple non-fatal parse errors can accumulate. After parsing completes, if the error count > 0, the compilation is aborted before entering the OCG pipeline.
  • Severity 6 (fatal): Triggers longjmp to the Level 1 handler immediately. The parser state pool is leaked (accepted trade-off since the process is about to exit).

Bison error recovery operates through the error token in the grammar. When the parser encounters a token that matches no production, it discards tokens until it finds one that allows the error production to reduce, then resumes parsing. This provides reasonable error recovery for common mistakes (missing semicolons, misspelled opcodes) but can cascade badly for structural errors (mismatched braces, corrupt PTX).

Phase Failure in PhaseManager

The phase executor sub_C64F70 runs each phase by calling its vtable execute() method. There is no explicit per-phase error check -- phases that detect internal errors call the diagnostic emitter sub_42FBA0 directly. The error handling cascade:

  1. Non-fatal phase error (severity 3--5): The error is printed and the error flag is set in the TLS context. The PhaseManager continues executing subsequent phases. This allows multiple diagnostics to be collected in a single run.
  2. Fatal phase error (severity 6): Triggers longjmp to Level 2 or Level 3. The current kernel's compilation is aborted. The PhaseManager's loop is unwound non-locally -- no cleanup of intermediate phase state occurs. Resources are reclaimed when the per-kernel memory pool is destroyed.
  3. OOM during phase execution: Any allocation failure calls sub_42BDB0 (3,825 callers), which forwards to sub_42F590 with a severity-6 descriptor at unk_29FA530. This always triggers longjmp.

The PhaseManager logs phase transitions using "Before <phase>" and "After <phase>" string construction (visible in sub_C64F70). When DUMPIR is set to a phase name, the IR is dumped to a file after that phase completes. This enables bisection of phase failures: --knob DUMPIR=<phase_name> isolates which phase corrupted the IR.

Register Allocation Failure and Retry

The register allocator has its own retry mechanism that operates within the normal pipeline (not via longjmp). The retry driver sub_971A90 (355 lines) wraps the Fatpoint allocator in a two-phase strategy:

Phase 1 -- NOSPILL. Attempt allocation without spilling. If the allocator fits within the register budget, proceed directly to finalization.

Phase 2 -- SPILL retry loop. If NOSPILL fails:

  1. The spill guidance engine sub_96D940 (84 KB) computes per-register-class spill candidates
  2. The allocator retries with progressively more aggressive spilling, up to N attempts (controlled by knobs 638/639)
  3. Each attempt prints: "-CLASS NOSPILL REGALLOC: attemp %d, used %d, target %d" (note: the typo "attemp" is in the binary)
  4. The best result across all attempts is tracked by sub_93D070
  5. The finalization function sub_9714E0 (290 lines) commits the best result or emits a fatal error

On allocation failure (all retry attempts exhausted):

Register allocation failed with register count of '%d'.
Compile the program with a higher register target

This error is emitted by sub_9714E0 through two paths: with source location (via sub_895530, including function name and PTX line number) or without source location (via sub_7EEFA0, generic). After this error, sub_9714E0 returns with HIBYTE(status) set, causing the retry driver to clear all register assignments to -1 and propagate the failure.

A dedicated DUMPIR hook exists: "Please use -knob DUMPIR=AllocateRegisters for debugging" -- this string (found at sub_9714E0's error path) directs users to dump the IR state before the allocator runs.

Fatal Error Handler Chain

The complete chain from any error site to process termination:

[any function, 2,350 call sites]
  sub_42FBA0(descriptor, location, ...)   // central diagnostic emitter
    |  checks descriptor[0] for severity
    |  severity 0: silently ignored
    |  severity 1-2: prints "info    " message
    |  severity 3: prints "warning " (or "error   " if TLS[50] Werror flag set)
    |  severity 4-5: prints "error   " / "error*  " (non-fatal)
    |  severity 6: prints "fatal   " then:
    v
  longjmp(tls->jmp_buf, 1)
    |  unwinds to nearest setjmp recovery point
    v
  [Level 3] sub_432500 -> restore jmp_buf, set error_flags, return false
  [Level 2] sub_43A400 -> cleanup kernel state, continue to next kernel
  [Level 1] sub_446240 -> cleanup global state, exit(non-zero)

Resource leak note. Because longjmp bypasses normal stack unwinding, all heap allocations made between the setjmp and the fatal error are leaked unless tracked in a pool. This is why ptxas uses pool allocators -- the per-kernel pool can be destroyed wholesale at the Level 2 recovery point, reclaiming all leaked memory without tracking individual allocations.

Architecture Dispatch

An architecture vtable factory at sub_1CCEEE0 (17KB, 244 callees) constructs a 632-byte vtable object (79 function pointers) based on the target SM version. The version dispatch ranges:

RangeArchitectureGenerationStatus in v13.0.88
sm_30-39Kepler1st genValidation only -- accepted by bsearch in unk_1D16220, but no codegen factory, no capability dispatch handlers, and no SASS encoders ship for these targets. Compilation fails after parsing.
sm_50-59Maxwell2nd genValidation only -- same as Kepler. Present in the base validation table for backward-compatible PTX version/target checking, but no backend support.
sm_60-69Pascal3rd genValidation only -- same as above. The codegen factory value 24576 (6 << 12) is referenced in comparison thresholds but no Pascal-specific encoder tables exist.
sm_70-73Volta4th genValidation only -- sm_70, sm_72, sm_73 are in the base table but have no active capability dispatch handlers in sub_607DB0.
sm_75Turing4th genActive -- lowest SM with full codegen support (factory 24577).
sm_80-89Ampere / Ada5th genActive -- factory 28673.
sm_90Hopper6th genActive -- factory 32768.
sm_100-110Blackwell7th genActive -- factory 36864.
sm_120-121Consumer / DGX Spark7th gen (desktop)Active -- factory 36864 (shared with Blackwell datacenter).

The distinction between "validation only" and "active" is critical: the base validation table at unk_1D16220 contains 32 entries including all legacy SMs back to sm_20, allowing ptxas to parse PTX files that declare .target sm_30 without immediately rejecting them. However, the capability dispatch initializer sub_607DB0 only registers handler functions for sm_75 through sm_121 (13 base targets). Attempting to compile code for an unregistered SM produces a fatal error during codegen factory lookup -- the architecture vtable factory sub_1CCEEE0 cannot construct a backend object for these targets.

The legacy codegen factory values (12288 for sm_30, 16385/20481 for sm_50, 24576 for sm_60) survive as comparison constants in feature-gating checks throughout the backend (e.g., if (factory_value > 28673) gates sm_90+ features), but the code paths they would activate no longer exist.

Each vtable entry is a function pointer to an SM-specific implementation of a codegen or emission primitive. This is the central dispatch mechanism for all architecture-dependent behavior in the backend.

Obfuscation: ROT13 Encoding

All internal identifiers in ptxas's static initializers are ROT13-encoded:

  • Opcode table (ctor_003 at 0x4095D0, 17KB): 900+ PTX opcode mnemonics. Example: NPDOHYX decodes to ACQBULK, SZN decodes to FMA, RKVG decodes to EXIT.
  • General knob table (ctor_005 at 0x40D860, 80KB): 2,000+ Mercury/OCG tuning knob names with hex default values. Example: ZrephelHfrNpgvirGuernqPbyyrpgvirVafgf decodes to MercuryUseActiveThreadCollectiveInsts.
  • Scheduler knob table (ctor_007 at 0x421290, 8KB): 98 scheduler-specific knob names. Example: XBlockWaitOut, ScavInlineExpansion.

The ROT13 decoding is performed inline during lookup (in sub_79B240, GetKnobIndex) using character-range detection: bytes in A-M get +13, bytes in N-Z get -13, with case-insensitive comparison via tolower().

Cross-References

Function Map

AddressSizeCallersIdentityConfidence
0x40946084 B--main (entry point)CERTAIN
0x44624011 KB1Top-level compilation driverHIGH
0x43432010 KB1CLI option parser + validatorHIGH
0x445EB0--1Target configuration setupHIGH
0x43A4004.7 KB1SM-specific default configurationHIGH
0x43B6603.8 KB1Register/resource constraint calculatorHIGH
0x45173014 KB1Parser init + special register setupHIGH
0x46E00093 KB1Opcode dispatch table builder (1,168 callees)HIGH
0x4428E014 KB1PTX input validation + preprocessingHIGH
0x43CC705.4 KB1Per-entry compilation unit processorHIGH
0x7FBB70198 BvtablePer-kernel entry pointCERTAIN
0x7FB6C01.2 KB1Pipeline orchestratorCERTAIN
0xC627204.7 KB1PhaseManager constructorVERY HIGH
0xC60D303.6 KB1Phase factory (159-case switch)VERY HIGH
0xC64F70--1Phase executorHIGH
0x9F63D0342 B1NamedPhases executorVERY HIGH
0x612DE047 KB1Kernel finalizer / ELF builderHIGH
0x1C9F28097 KB1Master ELF emitterHIGH
0x1CB53A013 KB1ELF world initializer (672-byte object)HIGH
0x1CABD6067 KB1Section layout & memory allocatorHIGH
0x1CD13A011 KB2Final ELF file writerHIGH
0x1CB18B0~200 B1Thread pool constructorHIGH
0x1CB1A50~200 BNThread pool task submitHIGH
0x1CC73008 KB1GNU Make jobserver clientHIGH
0x1CCEEE017 KB3Architecture vtable factoryHIGH
0x4240702.1 KB3,809Pool allocator: alloc(pool, size)HIGH
0x4248B0923 B1,215Pool allocator: free(ptr)HIGH
0x4280C0597 B3,928Thread-local context accessorHIGH
0x4261502.5 KB2,800Hash map: put(map, key, value)HIGH
0x42FBA02.4 KB2,350Diagnostic message emitterHIGH

Entry Point & CLI

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The ptxas binary has a deceptively simple entry point. The exported main at 0x409460 is an 84-byte wrapper that sets up unbuffered I/O and immediately tail-calls sub_446240 -- the real compilation driver. This driver is a monolithic 11 KB function that allocates a 1,352-byte master options block on the stack, establishes setjmp-based error recovery, parses all command-line options through a generic framework, reads PTX input, and then loops over compile units running the full Parse -> CompileUnitSetup -> DAGgen -> OCG -> ELF -> DebugInfo pipeline for each. The entire error-handling strategy is non-local: any of the 2,350 call sites to the central diagnostic emitter sub_42FBA0 can trigger a longjmp back to the driver's recovery point on fatal errors.

The same binary doubles as an in-process library. When nvcc loads ptxas as a shared object rather than spawning a subprocess, three extra arguments to the driver carry an output buffer pointer, an extra option count, and an extra options array. Callback function pointers at fixed offsets in the options block allow the host process to receive diagnostics and progress notifications without going through stderr.

main()0x409460 (84 bytes) -- setvbuf + tail-call to sub_446240
Real mainsub_446240 (11,064 bytes, ~900 lines)
Options block1,352 bytes on stack
Error recoverysetjmp / longjmp (no C++ exceptions)
Option registrationsub_432A00 (6,427 bytes, ~100 options via sub_1C97210)
Option parsersub_434320 (10,289 bytes, ~800 lines)
Diagnostic emittersub_42FBA0 (2,388 bytes, 2,350 callers, 7 severity levels)
TLS contextsub_4280C0 (597 bytes, 3,928 callers, 280-byte per-thread struct)
Pipeline phasesParse, CompileUnitSetup, DAGgen, OCG, ELF, DebugInfo
Library modesub_446240(argc, argv, output_buf, extra_opt_count, extra_opts)

Architecture

main (0x409460, 84B)
  │
  ├─ nullsub_1(*argv)          // store program name (no-op)
  ├─ setvbuf(stdout, _IONBF)
  ├─ setvbuf(stderr, _IONBF)
  │
  └─ sub_446240(argc, argv, 0, 0, 0)   // REAL MAIN
       │
       ├─ setjmp(jmp_buf)               // fatal error recovery point
       │
       ├─ sub_434320(opts_block, ...)    // OPTION PARSER
       │    └─ sub_432A00(...)           // register ~100 options via sub_1C97210
       │
       ├─ sub_4428E0(...)                // PTX INPUT SETUP
       │    ├─ validate .version / .target
       │    ├─ handle --input-as-string
       │    └─ generate __cuda_dummy_entry__ if --compile-only
       │
       ├─ sub_43A400(...)                // TARGET CONFIGURATION
       │    └─ set cache defaults, texmode, arch-specific flags
       │
       ├─ FOR EACH compile unit:
       │    ├─ sub_451730(...)           // parser/lexer init + special regs
       │    ├─ sub_43B660(...)           // register constraint calculator
       │    ├─ sub_43F400(...)           // function/ABI setup
       │    └─ sub_43CC70(...)           // per-entry: DAGgen → OCG → ELF → DebugInfo
       │
       ├─ timing / memory stats output (--compiler-stats)
       │
       └─ cleanup + return exit code

Pre-main Static Constructors

Before main executes, four static constructors run as part of the ELF .init_array. Three of them populate ROT13-obfuscated lookup tables that are foundational to the rest of the binary. This obfuscation is deliberate -- it prevents trivial string searching for internal opcode names and tuning knob identifiers in the stripped binary.

ctor_001 -- Thread Infrastructure (0x4094C0, 204 bytes)

Initializes the POSIX threading foundation used throughout ptxas:

pthread_key_create(&key, destr_function);       // TLS key for sub_4280C0
pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
pthread_mutex_init(&mutex, &attr);
dword_29FE0F4 = sched_get_priority_max(SCHED_RR);
dword_29FE0F0 = sched_get_priority_min(SCHED_RR);
__cxa_atexit(cleanup_func, ...);                // registered destruction

The TLS key created here is the one used by sub_4280C0 (3,928 callers), making it the single most important piece of global state in the binary.

ctor_003 -- PTX Opcode Name Table (0x4095D0, 17,007 bytes)

Populates a table at 0x29FE300+ with approximately 900 ROT13-encoded PTX opcode mnemonic strings. Each entry is a (string_ptr, length) pair. The ROT13 encoding maps A-Z to N-Z,A-M and a-z to n-z,a-m, leaving digits and punctuation unchanged.

EncodedDecodedInstruction
NPDOHYXACQBULKBulk acquire
NPDFUZVAVGACQSHMINITShared memory acquire init
OFLAPBSYNCBarrier sync
PPGY.PCCTL.CCache control
SZNFMAFused multiply-add
FRGCSETPSet predicate
ERGHEARETURNReturn
RKVGEXITThread exit

These decoded names are the canonical PTX opcode mnemonics used during parsing and validation. The table is consumed by the PTX lexer initialization at sub_451730 and the opcode-to-handler dispatch table at sub_46E000 (93 KB, the largest function in the front-end range).

ctor_005 -- Mercury Tuning Knob Registry (0x40D860, 80,397 bytes)

The single largest function in the front-end address range. Registers 2,000+ ROT13-encoded internal tuning knob names, each paired with a hexadecimal default value string. These are the "Mercury" (OCG) backend tuning parameters that control every aspect of code generation, scheduling, and register allocation.

Encoded NameDecoded NameDefault
ZrephelHfrNpgvirGuernqPbyyrpgvirVafgfMercuryUseActiveThreadCollectiveInsts0x3e40
ZrephelGenpxZhygvErnqfJneYngraplMercuryTrackMultiReadsWarLatency
ZrephelCerfhzrKoybpxJnvgOrarsvpvnyMercuryPresumeXblockWaitBeneficial
ZrephelZreteCebybthrOybpxfMercuryMergePrologueBlocks
ZrephelTraFnffHPbqrMercuryGenSassUCode

The knob system is documented in detail in the Knobs System page. The ROT13 encoding applies identically to all knob name strings in all four constructors.

ctor_007 -- Scheduler Knob Registry (0x421290, 7,921 bytes)

A smaller companion to ctor_005 that registers 98 scheduler-specific knobs. These control the instruction scheduler (Mercury/OCG) behavior at a finer granularity than the general knobs:

Decoded examples: XBlockWaitOut, XBlockWaitInOut, XBlockWaitInOnTarget, WarDeploySyncsFlush_SW4397903, WaitToForceCTASwitch, VoltageWar_SW4981360PredicateOffDummies, TrackMultiReadsWarLatency, ScavInlineExpansion, ScavDisableSpilling.

Knob names containing _SW followed by a number (e.g., _SW4397903) indicate workarounds for specific hardware bugs identified by NVIDIA's internal bug tracking system.

Real Main -- sub_446240

The exported main() tail-calls sub_446240 with three zero arguments appended. This function is the complete compilation orchestrator: it owns the options block, the error recovery, the compilation loop, and the statistics output.

FieldValue
Address0x446240
Size11,064 bytes
Stack frame1,352+ bytes (master options block + locals)
Callers1 (main)
Error recoverysetjmp at function entry

Signature and Library Mode

int sub_446240(int argc, char **argv,
               void *output_buf,        // a3: cubin output buffer (NULL for standalone)
               int   extra_opt_count,   // a4: count of extra options from nvcc
               char **extra_opts);      // a5: array of extra option strings

When main calls this, a3/a4/a5 are all zero -- standalone mode. When nvcc loads ptxas as a shared library and calls the entry point directly, these arguments carry non-null values:

  • a3 (output_buf): Pointer to a memory buffer where the compiled cubin is written. Eliminates the need for temporary files and filesystem round-trips, which matters for large CUDA compilations where nvcc may invoke ptxas hundreds of times.
  • a4 (extra_opt_count): Number of additional option strings injected by nvcc beyond what appears on the command line.
  • a5 (extra_opts): Array of those extra option strings.

Additionally, callback function pointers at offsets 37--39 of the 1,352-byte options block (byte offsets ~296, ~304, ~312) allow the host process to receive progress notifications and diagnostic messages in-process rather than through stderr.

Error Recovery with setjmp/longjmp

The first significant action in sub_446240 is establishing a setjmp recovery point:

if (setjmp(jmp_buf) != 0) {
    // Fatal error occurred somewhere in the pipeline.
    // Clean up and return non-zero exit code.
    goto cleanup;
}

This is the only error recovery mechanism in ptxas -- there are no C++ exceptions (the binary is compiled as C, not C++). Any function anywhere in the call tree that encounters an unrecoverable error calls sub_42FBA0 with severity >= 6, which internally calls longjmp(jmp_buf, 1) to unwind directly back to this point. The approach is simple but has a critical implication: all resources allocated between the setjmp and the fatal error are leaked unless explicitly tracked and cleaned up at the recovery site.

The 1,352-Byte Options Block

The master options block lives on the stack and accumulates all compilation state during option parsing. It is passed by pointer to virtually every subsystem. Key fields (approximate offsets based on access patterns):

Offset RangePurpose
0--63Input/output file paths, PTX version, target SM
64--127Optimization level, debug flags, cache modes
128--255Register limits, occupancy constraints
256--295Warning/error control flags
296--319Library-mode callback function pointers (offsets 37--39)
320--1351Per-pass configuration, knob overrides, feature flags

Compilation Loop

After option parsing and PTX input setup, the driver enters a loop over compile units. Each unit corresponds to one entry function (or device function in --compile-only mode). The per-entry processing is handled by sub_43CC70, which prints a separator:

printf("\n# ============== entry %s ==============\n", entry_name);

and then sequences: DAGgen (PTX-to-Ori lowering), OCG (optimization and code generation), ELF (binary emission), and DebugInfo (DWARF generation). The special entry __cuda_dummy_entry__ is silently skipped.

Timing and Memory Statistics

When --compiler-stats is active, sub_446240 prints per-phase timing and peak memory after all compile units complete:

CompileTime = 42.3 ms (100%)
Parse-time            : 12.1 ms (28.61%)
CompileUnitSetup-time :  1.4 ms ( 3.31%)
DAGgen-time           :  8.7 ms (20.57%)
OCG-time              : 15.2 ms (35.93%)
ELF-time              :  3.8 ms ( 8.98%)
DebugInfo-time        :  1.1 ms ( 2.60%)
PeakMemoryUsage = 2048.000 KB

When --compiler-stats-file is specified, the same data is written in JSON format using the shared JSON builder (sub_1CBA950). When --fdevice-time-trace is active, sub_439880 parses Chrome DevTools trace format JSON and merges ptxas timing events into the trace.

Option Parser -- sub_434320 and sub_432A00

Option parsing is split into two phases: registration and processing.

Option Registration -- sub_432A00

This 6,427-byte function calls sub_1C97210 approximately 100 times, once per recognized option. Each call provides the option's long name, short name, value type, help text, and default value to the generic option framework (implemented in the 0x1C96xxx--0x1C97xxx range, shared with other NVIDIA tools).

OptionShortTypeHelp Text
--arch-archstring"Specify the 'sm_' name of the target architecture"
--output-file-ostring"Specify name and location of the output file"
--opt-level-Oint"Specify optimization level"
--maxrregcountint"Specify the maximum number of registers"
--register-usage-levelenum(0..10)Register usage reporting level
--verbose-vboolVerbose output
--version-VPrint version and exit
--compile-onlyboolCompile without linking
--compile-functionsstring"Entry function name"
--input-as-stringstring"PTX string" (compile from memory)
--fast-compileboolReduce compile time at cost of code quality
--suppress-stack-size-warningboolSuppress stack size warnings
--warn-on-local-memory-usageboolWarn when local memory is used
--warn-on-spillsboolWarn on register spills
--warn-on-double-precision-useboolWarn on FP64 usage
--compiler-statsboolPrint compilation timing
--compiler-stats-filestring"/path/to/file" (JSON output)
--fdevice-time-tracestringChrome trace JSON output
--def-load-cacheenumDefault load cache operation
--force-load-cacheenumForce load cache operation
--position-independent-codeboolGenerate PIC
--compile-as-tools-patchboolCUDA sanitizer/tools patch mode
--extensible-whole-programboolWhole-program compilation
--cloningenum(yes/no)Inline cloning control
--ptxlenPTX length statistics
--list-version / --version-lsList supported PTX versions
--disable-smem-reservationboolDisable shared memory reservation
--generate-relocatable-object-cboolGenerate relocatable object

Option Processing -- sub_434320

The 10,289-byte parser iterates over argv (and any extra options from library mode), matches each argument against registered options via the framework, and populates fields in the 1,352-byte options block. Special handling exists for:

  • --version: Prints the identification string "Ptx optimizing assembler" followed by the version (e.g., "Cuda compilation tools, release 13.0, V13.0.88") and exits.
  • --help: Delegates to sub_403588, which prints "Usage : %s [options] <ptx file>,...\n" followed by all registered options, then exits.
  • --fast-compile: Validated against conflicting optimization options.
  • -cloning=yes/-cloning=no: Inline cloning control parsed as an equality option.

Generic Option Framework

The option parsing library lives in the 0x1C96000--0x1C97FFF range and is shared with other NVIDIA tools (nvlink, fatbinary, etc.):

AddressIdentityRole
sub_1C960C0Option parser constructorCreates the option parser state
sub_1C96680Argv processorMatches argv entries against registered options
sub_1C97210Option registrarRegisters one option with name, type, help
sub_1C97640Help printerIterates all registered options, prints help text

Diagnostic System -- sub_42FBA0

The central diagnostic emitter is the most important error-reporting function in ptxas. With 2,350 call sites, it handles every warning, error, and fatal message in the entire binary.

Signature

void sub_42FBA0(
    int *descriptor,    // a1: points to severity level at *a1
    void *location,     // a2: source location context
    ...                 // variadic: printf-style format args
);

Severity Levels

LevelPrefixTagBehavior
0(none)Suppressed -- message is silently discarded
1"info "@I@Informational
2"info "@I@Informational (alternate)
3"warning " / "error "@W@ / @E@Warning, promoted to error if TLS[50] set
4"error* "@O@Non-fatal error with special marker
5"error "@E@Non-fatal error
6"fatal "@E@Fatal -- triggers longjmp(jmp_buf, 1)

The machine-readable tags (@E@, @W@, @O@, @I@) allow nvcc and other tools to parse ptxas output programmatically, extracting severity without parsing the human-readable text.

Warning-to-Error Promotion

Severity level 3 has context-dependent behavior controlled by two flags in the thread-local storage:

v5 = *a1;   // severity
if (v5 == 3) {
    if (sub_4280C0()[49])   // TLS offset 49: suppression flag
        return;             // silently discard
    if (sub_4280C0()[50])   // TLS offset 50: Werror flag
        prefix = "error   ";
    else
        prefix = "warning ";
}

This implements the --Werror equivalent: when the Werror flag is active in the TLS context, all warnings become errors.

Output Format

<filename>, line <N>; <severity>: <message>

When source is available, the diagnostic emitter reads the PTX input file, seeks to line N, and prints the source line prefixed with "# ". To avoid O(n) seeking through large files on every diagnostic, it maintains a hash map (sub_426150/sub_426D60) that caches file byte offsets every 10 lines for fast random access to arbitrary line numbers.

Fatal Error Handler -- sub_42BDB0

A 14-byte wrapper called from 3,825 sites (nearly every allocation in ptxas). It fires whenever the pool allocator sub_424070 returns NULL:

void sub_42BDB0(...) {
    return sub_42F590(&unk_29FA530, ...);   // internal error descriptor
}

The descriptor at unk_29FA530 has severity 6 (fatal), so this always triggers longjmp back to the driver's recovery point.

Thread-Local Storage -- sub_4280C0

The most-called function in the entire binary (3,928 callers). Returns a pointer to a 280-byte per-thread context struct, allocating and initializing it on first access.

void *sub_4280C0(void) {
    void *ctx = pthread_getspecific(key);
    if (ctx) return ctx;

    ctx = malloc(0x118);        // 280 bytes
    memset(ctx, 0, 0x118);
    pthread_cond_init(ctx + 128, NULL);
    pthread_mutex_init(ctx + 176, NULL);
    sem_init(ctx + 216, 0, 0);
    pthread_setspecific(key, ctx);
    return ctx;
}

TLS Context Layout (280 bytes)

OffsetSizeTypePurpose
08int/flagsError/warning state flags
88inthas_error flag
491byteDiagnostic suppression flag
501byteWerror promotion flag
12848pthread_cond_tCondition variable
17640pthread_mutex_tPer-thread mutex
21632sem_tSemaphore for synchronization

The TLS key is created by ctor_001 before main runs, and a destructor function registered via pthread_key_create frees the 280-byte struct when a thread terminates. This per-thread context enables concurrent compilation of multiple compile units (when the thread pool is active), with each thread maintaining independent error state and diagnostic suppression flags.

PTX Input Setup -- sub_4428E0

After options are parsed, this 13,774-byte function reads and preprocesses the PTX input:

  1. Version and target validation. Checks .version and .target directives in the input. Emits synthetic headers ("\t.version %s\n", "\t.target %s\n") when needed.

  2. Compile-only mode. When --compile-only is active and no real entries exist, generates a dummy entry: "\t.entry %s { ret; }\n" with name __cuda_dummy_entry__.

  3. Input-as-string mode. When --input-as-string is active, PTX is read from memory (passed as a string argument) rather than from a file. This is used by the library-mode interface.

  4. Whole-program mode. --extensible-whole-program enables inter-function optimization across all entries in the compilation unit.

  5. Cache and debug configuration. Applies --def-load-cache, --def-store-cache, --force-load-cache, --force-store-cache, and suppress-debug-info settings.

  6. Tools-patch mode. --compile-as-tools-patch activates the CUDA sanitizer compilation path, checking for __cuda_sanitizer symbols.

Key diagnostic strings from this function:

  • "'--fast-compile'"
  • "calls without ABI"
  • "compilation without ABI"
  • "device-debug or lineinfo"
  • "unified Functions"

Target Configuration -- sub_43A400

A 4,696-byte function that configures target-specific defaults after option parsing completes. It reads the SM architecture number from the options block and sets:

  • Texturing mode: texmode_unified vs raw texture mode.
  • Cache defaults: Based on architecture capabilities.
  • Feature flags: Hardware-specific workaround flags (e.g., --sw4575628).
  • Indirect function support: "Indirect Functions or Extern Functions" validation.

The function references "NVIDIA" and "ptxocg.0.0" (the internal name for the OCG optimization pass), suggesting it also initializes the pass pipeline configuration for the target architecture.

Register Constraint Calculator -- sub_43B660

A 3,843-byte function that resolves potentially conflicting register limit specifications into a single register budget per function. Register constraints come from four sources with different priorities:

SourceDirective/OptionPriority
PTX directive.maxnreg NPer-function, highest priority
CLI option--maxrregcount NGlobal, overridden by .maxnreg
PTX directive.minnctapersm NOccupancy target, derived limit
PTX directive.maxntid Nx,Ny,NzThread block size, derived limit

The occupancy-derived limit is computed from .minnctapersm and .maxntid: given a minimum number of CTAs per SM and a maximum thread count per CTA, the function calculates the maximum register count that allows the requested occupancy level, accounting for per-SM register file size.

Diagnostic strings indicate the resolution process:

  • "computed using thread count" -- derived from .maxntid
  • "of .maxnreg" -- explicit per-function limit
  • "of maxrregcount option" -- CLI override
  • "global register limit specified" -- global cap applied

Per-Entry Compilation -- sub_43CC70

A 5,425-byte function that processes each entry function through the complete backend pipeline. For each entry:

  1. Skips __cuda_dummy_entry__ (generated by compile-only mode).
  2. Prints the entry separator: "\n# ============== entry %s ==============\n".
  3. Runs DAGgen (PTX-to-Ori lowering).
  4. Runs OCG (the 159-phase optimization pipeline + SASS code generation).
  5. Generates .sass and .ucode ELF sections.
  6. Generates DWARF debug information if requested.

The function also handles reg-fatpoint configuration (the register allocation algorithm, documented in the Fatpoint Algorithm page).

Function/ABI Setup -- sub_43F400

A 9,078-byte function that configures the calling convention for each function before compilation. This includes:

ResourceDiagnostic String
Parameter passing registers"number of registers used for parameter passing"
First parameter register"first parameter register"
Return address register"return address register"
Scratch data registers"scratch data registers"
Scratch control barriers"scratch control barriers"
Call prototype"callprotoype" (sic -- misspelled in binary)
Call target"calltarget"

The function handles both entry functions (kernels launched from the host) and device functions (callable from other device code), with different ABI requirements for each. Entry functions use a simplified ABI where parameters come from constant memory, while device functions use register-based parameter passing.

The --compile-as-tools-patch and --sw200428197 flags activate a special ABI variant for CUDA sanitizer instrumentation, which inserts additional scratch registers for sanitizer state.

Function Map

AddressSizeCallersIdentity
0x40946084 Bmain (entry point thunk)
0x4094C0204 Bctor_001 (thread infrastructure init)
0x4095D017 KBctor_003 (ROT13 opcode table, ~900 entries)
0x40D86080 KBctor_005 (ROT13 knob registry, 2000+ entries)
0x4212908 KBctor_007 (scheduler knob registry, 98 entries)
0x40358875 B1Usage printer (--help)
0x4280C0597 B3,928TLS context accessor (280-byte struct)
0x42BDB014 B3,825OOM fatal error handler
0x42FBA02.4 KB2,350Central diagnostic emitter
0x42F5901Internal fatal error handler
0x4305702Program name getter
0x432A006.4 KB1Option registration (~100 options)
0x43432010 KB1Option parser and validator
0x4398802.9 KB1Chrome trace JSON parser
0x43A4004.7 KB1Target configuration
0x43B6603.8 KB1Register constraint calculator
0x43CC705.4 KB1Per-entry compilation processor
0x43F4009 KB1Function/ABI setup
0x4428E013.8 KB1PTX input setup and preprocessing
0x44624011 KB1Compilation driver (real main)
0x45173014 KB1Parser/lexer init + special registers
0x46E00093 KB1Opcode-to-handler dispatch table builder
0x1C960C0Option parser constructor
0x1C96680Argv processor
0x1C97210~100Option registrar (per-option)
0x1C976401Help text printer
0x1CBA950JSON context constructor
0x1CBAC202.9 KB3JSON recursive descent parser

Cross-References

PTX Parser (Flex + Bison)

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The ptxas front-end parses PTX assembly text into internal IR using a classic two-stage architecture: a Flex-generated DFA scanner (lexer) and a Bison-generated LALR(1) shift-reduce parser. Unlike most compiler front-ends, the parser does not construct an AST. Instead, Bison reduction actions directly build IR nodes, populate the instruction table, and emit validation calls -- the parse tree is consumed inline and never materialized as a data structure. A separate macro preprocessor handles .MACRO, .ELSE/.ELIF/.ENDIF, and .INCLUDE directives at the character level before tokens reach the Flex DFA. The instruction table builder (sub_46E000, 93 KB) registers all PTX opcodes with their legal type combinations during parser initialization, and an instruction lookup subsystem classifies operands into 12 categories at parse time.

Flex scannersub_720F00 (15.8 KB, 64 KB with inlined helpers)
DFA tableoff_203C020 (transition/accept array)
Scanner rules~552 Flex rules, 162 token types (codes 258--422)
Scanner prefixptx (all Flex symbols: ptxlex, ptxensure_buffer_stack, etc.)
Bison parsersub_4CE6B0 (48 KB, spans 0x4CE6B0--0x4DA337)
Grammar size~512 productions, 443 reduction cases
LALR tablesword_1D146A0 (yydefact), word_1D121A0 (yycheck), word_1D13360 (yypact), word_1D150C0 (yypgoto), byte_1D15960 (yyr2)
Instruction table buildersub_46E000 (93 KB, 1,141 calls to sub_46BED0)
Instruction lookupsub_46C690 (entry), sub_46C6E0 (6.4 KB descriptor matcher)
Macro preprocessorsub_71F630 (14 KB dispatcher), sub_71E2B0 (32 KB conditional handler)
Parser state object1,128 bytes (+ 2,528-byte lexer state via pointer at +1096)
Error handlersub_42FBA0 (2,350 callers, central diagnostics)
Parser initsub_451730 (14 KB, symbol table + special registers + opcode table)

Architecture

PTX source text
     │
     ▼
┌─────────────────────────────────────────────────────────┐
│  MACRO PREPROCESSOR (character-level, 0x71B000-0x720000)│
│  sub_71F630  dispatch: .MACRO / .ELSE / .INCLUDE        │
│  sub_71E2B0  conditional: .ELSE / .ELIF / .ENDIF (32KB) │
│  sub_71DCA0  macro definition handler                   │
│  sub_71C310  .INCLUDE file handler                      │
└────────────────────┬────────────────────────────────────┘
                     │ preprocessed character stream
                     ▼
┌─────────────────────────────────────────────────────────┐
│  FLEX DFA SCANNER  sub_720F00 (15.8KB, 552 rules)       │
│  off_203C020       DFA transition table                  │
│  Token codes:      258-422 (162 types)                   │
│  Helper:           sub_720410 (yy_get_next_buffer)       │
│                    sub_720630 (yy_get_previous_state)     │
│                    sub_720BA0 (yy_scan_string)            │
└────────────────────┬────────────────────────────────────┘
                     │ token stream (code + attribute)
                     ▼
┌─────────────────────────────────────────────────────────┐
│  BISON LALR(1) PARSER  sub_4CE6B0 (48KB, 512 prods)     │
│  5 LALR tables at 0x1D12xxx-0x1D15xxx                    │
│  443 reduction actions → direct IR construction           │
│  NO AST: reductions emit IR nodes inline                  │
└────────────────────┬────────────────────────────────────┘
                     │
          ┌──────────┴──────────┐
          ▼                     ▼
  INSTRUCTION TABLE         SEMANTIC VALIDATORS
  sub_46E000 (93KB)         sub_4B2F20 (52KB, general)
  sub_46BED0 (per-opcode)   sub_4C5FB0 (28KB, operands)
  sub_46C690 (lookup)       sub_4C2FD0 (12KB, WMMA/MMA)
  sub_46C6E0 (6.4KB match)  sub_4ABFD0 (11KB, async copy)
                            sub_4A73C0 (10KB, tensormap)
                            + 20 more validators

Flex DFA Scanner -- sub_720F00

The scanner is a standard Flex-generated DFA with the ptx prefix (all exported symbols use ptx instead of yy: ptxlex, ptxensure_buffer_stack, ptx_create_buffer, etc.). At 15.8 KB of core logic (64 KB including inlined buffer management), it is the largest single function in the lexer region. The DFA transition table lives at off_203C020 and is indexed by *(DWORD*)(state + 76) (the current start condition). The main loop structure follows the textbook Flex pattern:

// DFA transition core (reconstructed from sub_720F00)
while (1) {
    v10 = (DWORD*)(table_base + 8 * state);   // table[state]
    if (current_char == *v10) {                 // character match
        state = table_base + 8 * v10[1];       // goto next state
        action = *(unsigned int*)(state - 4);   // accept action (or 0)
    }
    if (action != 0) break;                     // matched a rule
}
// Giant switch on action number (0..~550)
switch (action) { ... }

The scanner returns integer token codes to the Bison parser. The value 550 is YY_NULL (end-of-input sentinel). Token attributes are communicated through the lexer state object, which the parser state carries as a pointer at offset +1096. The scanner receives this pointer as its a3 argument and dereferences it (e.g., *(_QWORD *)(a3 + 1096)) to reach the 2,528-byte lexer state.

Token Categories

The 552 Flex rules map PTX lexemes to 162 distinct token types. Bison terminal codes range from 258 to 422. The scanner switch cases reveal the following category structure:

Switch casesToken codeCategoryExamples / attributes
2364Semicolons / newlinesStatement terminator
5--7340, 341, 344KeywordsPTX keywords
63--65302Register namesAttribute: -1, chr-48, chr-38 (register numbering)
74--91320Data typesValues 1--18: .b8 through .f64 (18 type qualifiers)
92--94322Comparison typesValues 9, 7, 11
95--99323Rounding modesValues 24--29: .rn, .rz, .rm, .rp, etc.
1(internal)#includeStrips whitespace, copies filename
3(dispatch)Preprocessor directiveCalls sub_71F630
4339#pragmaStrips whitespace

Line and column tracking uses fields at *(state+48) (line number) and *(state+52) (column), incremented on each newline character.

Buffer Management

The scanner uses the standard Flex buffer stack for nested input sources (includes, macros, inline strings). Key buffer management functions:

AddressSizeIdentityPurpose
sub_7201902.0 KBptxensure_buffer_stackGrows buffer stack via realloc
sub_7202E01.3 KBptx_create_bufferCreates YY_BUFFER_STATE from FILE*
sub_7204103.3 KByy_get_next_bufferRefills character buffer, handles EOF
sub_7206309.7 KByy_get_previous_stateRestores DFA state, SIMD-optimized memmove
sub_720BA04.3 KBptx_scan_stringScans inline string into buffer
sub_724CC04.9 KBptx_scan_bytesMacro expansion buffer allocation
sub_7250702.7 KBptx_scan_bufferBuffer creation with error recovery

Notable: sub_720630 contains SSE2-optimized memmove using __m128i aligned 16-byte copies for buffer compaction -- a Flex optimization for large input buffers. The ptx_scan_bytes function (sub_724CC0) is called from the Bison parser actions (3 call sites in sub_4CCF30) to handle inline macro expansion during parsing.

Error strings in the buffer system:

  • "out of dynamic memory in ptxensure_buffer_stack()"
  • "out of dynamic memory in ptx_create_buffer()"
  • "out of dynamic memory in yy_get_next_buffer()"
  • "out of dynamic memory in ptx_scan_bytes()"
  • "bad buffer in ptx_scan_bytes()"
  • "out of dynamic memory in ptx_scan_buffer()"
  • "fatal flex scanner internal error--no action found"
  • "fatal flex scanner internal error--end of buffer missed"
  • "unexpected EOF while scanning"

Macro Preprocessor

Before tokens reach the Flex DFA, a character-level macro preprocessor handles .MACRO/.ENDM, .ELSE/.ELIF/.ENDIF, and .INCLUDE directives. The preprocessor lives at 0x71B000--0x720000 (~20 KB) and operates on raw character streams, not tokens. This design is identical to C's preprocessor running before the lexer.

Preprocessor Dispatch -- sub_71F630

The top-level dispatcher (14 KB) is called from the Flex scanner's case 3 (directive detection). It examines the directive name and routes to the appropriate handler:

DirectiveHandlerSizeDescription
.MACROsub_71DCA08.4 KBMacro definition: records body text, handles nesting
.ELSE / .ELIFsub_71E2B032 KBConditional code: skips blocks, handles nested conditionals
.ENDIFsub_71E2B0(shared)End of conditional block
.INCLUDEsub_71C3108.3 KBFile inclusion: pushes new input source onto lexer stack

The dispatcher uses strstr for substring matching on directive names and returns token codes (e.g., 364 for end-of-directive).

Conditional Handler -- sub_71E2B0

At 32 KB, this is the largest preprocessor function. It handles .ELSE, .ELIF, and .ENDIF by scanning ahead through the input character stream, counting nesting levels, and skipping entire blocks of PTX text when conditions are false. It calls sub_4287D0 (the token reader) to evaluate conditional expressions and sub_428C40 (string compare) for keyword matching. Two nearly-duplicate code blocks handle .ELSE and .ELIF paths with identical scanning logic but different branch conditions.

Macro Definition -- sub_71DCA0

Handles .MACRO directives by recording the macro body text. The function is recursive to support nested .MACRO definitions. It delegates to sub_71D710 (macro body scanner, 7.5 KB) and sub_71D1B0 (macro argument scanner, 6.8 KB). The argument scanner uses strlen + strncmp for keyword matching against a delimiter string parameter.

Include Handler -- sub_71C310

Processes .INCLUDE by pushing a new file onto the lexer's input stack. The function is recursive (calls itself 4 times) for nested includes. It manages the include-stack pointers at offsets +2128, +2136, +2160, and +2168 of the lexer state object (the 2,528-byte struct pointed to by parser+1096), and uses the "pushback character" register at offset +2441 of the same lexer state. String reference: "ptxset_lineno called with no buffer".

Error Handling

Macro errors are reported through sub_71BF60 (fatal macro abort) which calls sub_71BF30 to print "out of dynamic memory..." messages, and sub_71C140 (format error) which calls sub_42CA60 (error output). Nesting depth is checked by sub_724CC0 which prints "macro nesting too deep!" on overflow.

Bison LALR(1) Parser -- sub_4CE6B0

The parser is a standard Bison-generated LALR(1) shift-reduce parser spanning 48 KB (addresses 0x4CE6B0--0x4DA337). It contains ~512 grammar productions with 443 reduction cases. The function calls ptxlex (sub_720F00) to obtain tokens and uses five LALR tables for state transitions:

TableAddressBison namePurpose
word_1D146A00x1D146A0yydefactDefault reduction rule for each state
word_1D121A00x1D121A0yycheckValid lookahead verification
word_1D133600x1D13360yypactParser action table (shift/reduce)
word_1D150C00x1D150C0yypgotoGoto table for nonterminals
byte_1D159600x1D15960yyr2Right-hand-side length for each rule

Direct IR Construction (No AST)

The critical architectural decision: Bison reduction actions directly construct IR nodes rather than building an intermediate AST. When a grammar rule is reduced, the semantic action immediately:

  1. Allocates IR nodes via the pool allocator (sub_424070)
  2. Populates instruction fields from token attributes
  3. Calls instruction validators for semantic checking
  4. Links nodes into the instruction stream
  5. Registers symbols in the symbol table (via sub_426150, the hash map)

This means the parser is a single-pass translator from PTX text to IR. The trade-off is clear: no AST means no multi-pass source-level analysis, but it eliminates an entire allocation and traversal phase. For an assembler (as opposed to a high-level language compiler), this is the right choice -- PTX is already a linearized instruction stream with no complex scoping or overload resolution that would benefit from an AST.

Reduction Actions -- Semantic Processing

The 443 reduction cases in the parser body handle PTX constructs from simple register declarations to complex matrix instruction specifications. Diagnostic strings found in the parser tail (0x4D5000--0x4DA337) reveal the kinds of semantic checks performed during reduction:

Directive validation:

  • "Defining labels in .section"
  • "dwarf data" -- DWARF section processing
  • "reqntid" / ".reqntid directive" -- required thread count
  • ".minnctapersm directive" -- min CTAs per SM
  • ".maxnctapersm" / ".maxnctapersm directive" -- max CTAs per SM (deprecated)
  • ".maxntid and .reqntid cannot both be specified"
  • ".maxnctapersm directive deprecated..."
  • ".minnctapersm is ignored..."

Type and operand validation:

  • "Vector Type not specified properly"
  • ".f16x2 packed data-type" -- half-precision packed type
  • "matrix shape" -- matrix instruction dimensions
  • ".scale_vectorsize" -- vector scaling modifier
  • "too many layout specifiers"

Resource limits:

  • "Kernel parameter size larger than 4352 bytes"

Architecture gating:

  • "sm_50", "sm_20", "sm_53" -- target architecture checks via sub_485520(ctx, sm_number)
  • PTX version checks via sub_485570(ctx, major, minor)

Expression handling:

  • "%s+%llu" / "%s-%s" -- label arithmetic in address expressions
  • "Negative numbers in dwarf section" -- DWARF data validation

Symbol resolution:

  • "unrecognized symbol" -- lexer/symbol table failure
  • "syntax error" -- generic parse error
  • ".extern" -- external declarations
  • ".noreturn directive" -- function attributes
  • "texmode_unified" / "texmode_raw" -- texture mode selection
  • "cache eviction priority" / ".level::eviction_priority" -- cache policy

Error Recovery

Parse errors trigger sub_42FBA0 with "syntax error" as the message. The central diagnostic emitter (sub_42FBA0, 2,388 bytes, 2,350 callers) handles all severity levels:

SeverityPrefixTagBehavior
0(suppressed)--Silently ignored
1--2"info "@I@Informational message
3"warning " or "error "@W@ or @E@Context-dependent; promoted to error by --Werror
4"error* "@E@Non-fatal error
5"error "@E@Error
6+"fatal "(none)Calls longjmp to abort compilation

The diagnostic system reads the source file to display context lines (prefixed with "# "), caching file offsets every 10 lines in a hash map for fast random-access seeking.

Parser Initialization -- sub_451730

Parser initialization (14 KB) builds the lexer's symbol table with all built-in PTX names before parsing begins. This function is called from the compilation driver (sub_446240) and performs three major tasks:

1. Special Register Registration

All PTX special registers are pre-registered in the symbol table with their internal identifiers:

CategoryRegisters
Thread/block ID%ntid, %laneid, %warpid, %nwarpid, %smid, %nsmid, %ctaid, %nctaid, %gridid
Clocks%clock, %clock_hi, %clock64
Performance counters%%pm0--%%pm7, %%pm0_64--%%pm7_64
Lane masks%lanemask_eq, %lanemask_le, %lanemask_lt, %lanemask_ge, %lanemask_gt
Environment%%envreg0--%%envreg31
Timers%globaltimer_lo, %globaltimer_hi
Shared memory%total_smem_size, %dynamic_smem_size
Texture types.texref, .samplerref, .surfref
Predefined macrosGPU_ARCH, PTX_MAJOR_VERSION, PTX_MINOR_VERSION

2. Opcode Table Construction

Calls sub_46E000 -- the 93 KB instruction table builder -- to register all PTX opcodes with their legal type combinations. See the dedicated section below.

3. Context State Initialization

Allocates and initializes two objects: the parser state (1,128 bytes, sub_424070(pool, 1128)) and the lexer state (2,528 bytes, sub_424070(pool, 2528)). The parser state stores a pointer to the lexer state at offset +1096. The string "PTX parsing state" identifies the parser state allocation in memory dumps. The string "<builtin>" serves as the filename for built-in declarations. Both objects are zeroed via memset before field initialization.

Instruction Table Builder -- sub_46E000

This is the largest single function in the front-end region at 93 KB. It is not a normal function body but a massive initialization sequence that calls sub_46BED0 exactly 1,141 times -- once per legal PTX instruction variant. Each call registers an opcode name together with its accepted type combinations using compact encoding strings.

Operand Encoding Strings

Each instruction variant is registered with a string that encodes its operand signature. The encoding uses single-character codes for operand categories:

CodeMeaning
FFloat operand (.f16, .f32, .f64)
HHalf-precision (.f16, .f16x2)
IInteger operand (.s8--.s64, .u8--.u64)
BBitwise operand (.b8--.b128)
NImmediate / numeric literal
PPredicate operand

String references found in the function include composite type signatures:

  • "F32F32" -- binary float32 operation
  • "F16F16F16F16" -- quad half-precision
  • "I32I8I8I32" -- integer MMA (int32 accumulator, int8 operands)
  • "F64F64F64F64" -- quad float64 (double-precision MMA)
  • "_mma.warpgroup" -- warp-group MMA marker

Hash Tables

The instruction table builder populates two hash tables at offsets +2472 and +2480 within the lexer state object (the 2,528-byte struct passed as the first argument to sub_46E000). These hash tables provide O(1) lookup from opcode name to the registered type combination list.

Registration Function -- sub_46BED0

Called 1,141 times from sub_46E000. Each call takes an opcode name string and an operand encoding string, creates a descriptor node, and inserts it into the hash table. The descriptor captures the opcode, its legal operand types, and the semantic validation function to call during parsing.

Instruction Lookup -- sub_46C690 and sub_46C6E0

At parse time, when the parser reduces an instruction production, it calls sub_46C690 to look up the instruction name in the hash table built by sub_46E000. The lookup returns a descriptor list, and sub_46C6E0 (6.4 KB, the descriptor matcher) walks the list to find the variant matching the actual operands present in the source.

Operand Classification -- 12 Categories

The descriptor matcher (sub_46C6E0) classifies each operand into one of 12 categories based on its syntactic form, then matches the category sequence against the registered encoding strings. The 12 categories cover:

  1. General-purpose register (R)
  2. Predicate register (P)
  3. Uniform register (UR)
  4. Uniform predicate (UP)
  5. Integer immediate
  6. Float immediate
  7. Address expression (register + offset)
  8. Label / symbol reference
  9. Special register
  10. Vector operand
  11. Texture / surface / sampler reference
  12. Bitfield / compound modifier

The classification examines token attributes set by the lexer -- register type bits at (field >> 28) & 7, immediate flag (0x1000000), uniform flag (0x6000000), and operand descriptor fields at instruction offset 84+.

Parser State Object (1,128 bytes)

The parser passes a state object through all phases. This 1,128-byte structure (sub_424070(pool, 1128)) carries compilation context and pointers to sub-systems. It is indexed as _QWORD* (8-byte slots), so QWORD index [N] = byte offset N*8. The highest accessed byte is +1120 (index [140]), fitting exactly within the 1,128-byte allocation.

OffsetSizeFieldDescription
+08pool_contextPool allocator handle (from sub_4258D0)
+88compilation_unitPointer to compilation unit (parameter a2)
+168macro_symbol_tableHash table for macros (sub_425CA0, 64 buckets)
+248module_ptrPointer to module object (parameter a3)
+328container_aSorted set container (8,192 buckets)
+568scope_chain[0]Scope chain entry (sub_44F7C0), used for symbol resolution
+648scope_chain[1]Second scope chain entry
+728scope_chain[2]Third scope chain entry
+808type_mapType descriptor hash map (sub_42D150, 8 buckets)
+968symbol_tables[0..5]Six hash tables for symbol lookup (at +96, +104, +112, +120, +128, +136)
+1528current_functionPointer to current function being parsed
+1604ptx_major_versionPTX ISA major version (set by Bison reduction)
+1644ptx_minor_versionPTX ISA minor version
+1684sm_version_checkSM target version for feature gating
+1771flag_aInitialization flag
+1922word_96Zero-initialized word at WORD index 96
+1964address_size32 or 64 (address width)
+2088hash_ref_aHash table reference (64-bucket)
+2361default_flagInitialized to 1
+26416list_aLinked list (head at +264, tail ptr at +272 points to head)
+2808sorted_set_bSorted set (8,192 buckets)
+2888sorted_set_cSorted set (1,024 buckets)
+29616sorted_maps[0..1]Two sorted maps (sub_42A300)
+3208hash_eHash table (1,024 buckets)
+32816list_bLinked list (head/tail pair)
+34416list_cLinked list (head/tail pair)
+360256offset_table[16]SSE-initialized offset table (16 entries of 16 bytes each, computed from base address + constants at xmmword_1CFDA00--1CFDA70)
+61616list_dLinked list (head/tail pair)
+63216list_eLinked list (head/tail pair); low bits of first word used as address_space_flags
+6488local_symbol_tablePer-scope local symbol table pointer
+8248symbol_lookup_refHash table for symbol name lookup
+8321dwarf_section_flagNonzero when inside .section DWARF data
+8341directive_flag_aChecked as pair with +835
+8361directive_flag_bSet to 1 by multiple Bison reductions
+8408builtin_filenameInterned string "<builtin>"
+8488empty_stringInterned empty string ""
+8564sm_arch_numberSM architecture number (parameter a6, e.g. 90 for sm_90)
+8601feature_aFeature flags set during parsing
+8611feature_b
+8621feature_c
+8641feature_d
+8651feature_eORed with 1 by Bison reductions
+8691flag_hInitialized to 0
+9604sm_target_codeSM target code used in sub_454E70 checks
+9688insn_stream_aInstruction stream pointer A (set in Bison)
+9768insn_stream_bInstruction stream pointer B
+9848insn_stream_cInstruction stream pointer C
+10001insn_state_flagInstruction state flag (= 0)
+10088string_poolString pool pointer
+10168context_refCompilation context reference (parameter a4)
+10484dword_262Zero-initialized
+10531parsing_activeToggled 1/0 during active parsing
+108016list_fLinked list (head/tail pair)
+10968lexer_state_ptrPointer to 2,528-byte lexer state object (see below)
+110416list_gLinked list (head/tail pair)
+11201param_flagFrom parameter a10

Lexer State Object (2,528 bytes)

The lexer state is a separate heap-allocated object (sub_424070(pool, 2528)) pointed to by parser_state+1096. It is the primary state carrier for the Flex DFA scanner and the instruction table subsystem. All functions that need scanner state (the Bison parser, the Flex scanner, the include handler, and the instruction table builder) access this object through the pointer at +1096.

OffsetSizeFieldDescription
+484line_numberCurrent source line (incremented on newline)
+524column_numberCurrent source column
+648buffer_limitPointer to end of current scan buffer
+764start_conditionFlex DFA start condition (*(state+76), indexes off_203C020)
+1521flag_aScanner state flag
+1568sentinel_aInitialized to -1 (0xFFFFFFFFFFFFFFFF)
+1648sentinel_bInitialized to -1
+1724address_size_proxyWritten by Bison via sub_4563E0; -1 on init
+1808zero_pairZero-initialized
+1888sentinel_cInitialized to 0xFFFFFFFF00000000
+1968sentinel_dInitialized to -1
+2044sentinel_eDWORD[51], initialized to -1
+2082word_104WORD[104], zero-initialized
+5401flag_bScanner flag
+5411include_activeChecked by Flex (lexer+541) and Bison to gate .INCLUDE behavior
+7848current_filenamePointer to current filename string (set during include handling)
+1984128version_array[32]DWORD array of version fields; written by sub_70FDD0(lexer, index, value) as *(lexer + 4*index + 1984) = value
+21044ptx_major_verversion_array[30] = PTX major version (initialized to 9)
+21084ptx_minor_verversion_array[31] = PTX minor version (initialized to 0)
+21288include_stack_aInclude nesting pointer 1 (linked list for file stack)
+21368include_stack_bInclude nesting pointer 2
+21608include_stack_headHead of include stack (walked by sub_71C310)
+21688include_stack_fileInclude stack filename pointer
+24411pushback_charCharacter pushed back into input stream by scanner
+24642word_1232Zero-initialized
+24661flag_cFlag
+24728opcode_hash_aOpcode lookup hash table (populated by sub_46E000)
+24808opcode_hash_bSecond opcode lookup hash table (populated by sub_46E000)
+24888context_sub_refCompilation context sub-reference (parameter a9); accessed by Bison for sub_457CB0/sub_70A5B0 calls
+24961flag_dFlag
+250424tail_fieldsThree zero-initialized QWORD slots (indices [313],[314],[315])

Version checks use sub_485520(ctx, sm_number) (SM architecture >= N) and sub_485570(ctx, major, minor) (PTX version >= major.minor). For example, the address-space attribute setter (sub_4035D3) checks sm_90 and PTX 7.8:

if (!sub_485520(ctx, 90))
    sub_42FBA0(&err, loc, "sm_90", ...);   // Error: requires sm_90
if (!sub_485570(ctx, 7, 8))
    sub_42FBA0(&err, loc, "7.8", ...);     // Error: requires PTX 7.8
*(byte*)(v15 + 632) = (old & 0xFC) | (a2 & 3);   // Set address space bits

Semantic Validators

The parser's reduction actions dispatch to specialized validator functions for each instruction category. These functions live in 0x460000--0x4D5000 and check SM architecture requirements, type compatibility, operand constraints, and instruction-specific invariants.

AddressSizeIdentityCoverage
sub_4B2F2052.6 KBGeneral instruction validatorTextures, surfaces, loads, stores, cvt, calls
sub_4CE6B0 tail48 KBDirective/declaration validator.local_maxnreg, .alias, .unified, .pragma, .noreturn
sub_4C5FB028.5 KBOperand validatorState spaces, rounding, barriers, cache levels
sub_4C2FD012.2 KBWMMA/MMA validatorMatrix dimensions, FP8 types, layout specifiers
sub_49BBA011.4 KBMMA scale/block validator.scale_vec_size, .block_scale, sparse GMMA
sub_4ABFD011.1 KBAsync copy validatorcp.async, bulk copy, cvt.tf32.f32.rna
sub_4A73C010.9 KBTensormap validator.tile, field ranges, .tensormap::generic
sub_4BFED010.3 KBWMMA shape/type validator.m%dn%dk%d shapes, .aligned modifier
sub_4AF9F05.8 KBCVT validatorcvt.f16x2.f32, type combinations, rounding
sub_4AEB603.7 KBLDSM validator_ldsm.s8.s4/_ldsm.u8.u4 format conversion
sub_4B16304.6 KBFunction address validatorcudaDeviceSynchronize, kernel/device addresses
sub_498AF03.9 KBMMA layout validatorRow/col layout, floating-point type constraints
sub_497C003.0 KBPrototype validator.FORCE_INLINE, .noreturn, .unique, register counts
sub_4966903.6 KBScope/barrier validatorScope modifiers, barrier constraints
sub_4942102.3 KBSparse GMMA validatorSparse GMMA with specific types
sub_492C804.0 KBCache eviction validatorL2 eviction priority, .v8.b32/.v4.b64
sub_49A5A03.5 KBSpecial register validator%laneid, %clock64, %lanemask_*, arch gating
sub_4A0CD04.9 KBVariable declaration validator.texref, .managed, .reserved, .common
sub_4A02A02.6 KBInitializer validatorgeneric() operator, function addresses
sub_4036D9437 BParameter list validatorCount, types, alignment, state space

Validators follow a uniform pattern: they receive the parser context and instruction data, check constraints against the current SM architecture and PTX version, and call sub_42FBA0 with descriptive error messages when violations are found. The general validator (sub_4B2F20, 52.6 KB) is the second-largest function in the front-end and covers the broadest range of PTX instructions.

ROT13 Opcode Name Obfuscation

PTX opcode names stored in the binary are ROT13-encoded as an obfuscation measure. The static constructor ctor_003 at 0x4095D0 (17 KB, ~1,700 lines) decodes and populates the opcode name table at 0x29FE300 during program startup. Each entry is a (string_ptr, length) pair. Decoded examples:

ROT13DecodedPTX instruction
NPDOHYXACQBULKacqbulk
OFLAPBSYNCbsync
PPGY.PCCTL.Ccctl.c
SZNFMAfma
FRGCSETPsetp
ERGHEARETURNreturn
RKVGEXITexit

The table covers the entire PTX ISA vocabulary -- hundreds of opcodes. A separate ROT13 table in ctor_005 (0x40D860, 80 KB) encodes 2,000+ internal Mercury/OCG tuning knob names (see Knobs System).

Compilation Pipeline Integration

The parser is invoked from the top-level compilation driver sub_446240 (11 KB), which orchestrates the full pipeline:

Parse  →  CompileUnitSetup  →  DAGgen  →  OCG  →  ELF  →  DebugInfo

The driver reports timing for each phase:

  • "Parse-time : %.3f ms (%.2f%%)"
  • "CompileUnitSetup-time : %.3f ms (%.2f%%)"
  • "DAGgen-time : %.3f ms (%.2f%%)"
  • "OCG-time : %.3f ms (%.2f%%)"
  • "ELF-time : %.3f ms (%.2f%%)"
  • "DebugInfo-time : %.3f ms (%.2f%%)"

The parse phase encompasses the Flex scanner, macro preprocessor, Bison parser, instruction table lookup, and all semantic validation. Since the parser directly builds IR, the output of the parse phase is a populated instruction stream ready for the DAG generation phase.

PTX Text Generation (Reverse Direction)

The inverse of parsing -- converting IR back to PTX text -- lives in 0x4DA340--0x5A8E40 (580 formatter functions). Each handles one PTX opcode. A dispatcher at sub_5D4190 (12.9 KB) routes by opcode name using 81 direct string comparisons plus a 473-entry hash switch. Every formatter follows an identical allocation pattern:

pool = sub_4280C0(ctx)[3];              // Get allocator pool
buf = sub_424070(pool, 50000);          // 50KB temp buffer
// ... sprintf() operands into buf ...
len = strlen(buf);
result = sub_424070(pool, len + 1);     // Exact-size allocation
strcpy(result, buf);
sub_4248B0(buf);                        // Free temp buffer
return result;

A monolithic format string table (~1.8 MB) at the a2 parameter contains pre-assembled PTX text templates with %s/%llu/%d placeholders. This trades memory for speed: instead of building instruction text dynamically, ptxas simply fills in operand names at runtime.

Function Map

AddressSizeIdentityConfidence
sub_720F0015.8 KBptxlex -- Flex DFA scanner main98%
sub_4CE6B048 KBptxparse -- Bison LALR(1) parserHIGH
sub_46E00093 KBInstruction table builder (1,141 opcode registrations)HIGH
sub_46BED0--Per-opcode registration function (called 1,141x)HIGH
sub_46C690--Instruction lookup entryHIGH
sub_46C6E06.4 KBDescriptor matcher (12-category operand classifier)HIGH
sub_45173014 KBParser initialization (allocs 1,128B parser state + 2,528B lexer state)HIGH
sub_70FDD014 BLexer version array writer: *(a1 + 4*a2 + 1984) = a3HIGH
sub_71F63014 KBPreprocessor directive dispatcher93%
sub_71E2B032 KBConditional handler (.ELSE/.ELIF/.ENDIF)92%
sub_71DCA08.4 KBMacro definition handler (.MACRO)90%
sub_71C91013 KBDirective scanner91%
sub_71C3108.3 KBInclude handler (.INCLUDE)90%
sub_71D1B06.8 KBMacro argument scanner89%
sub_71D7107.5 KBMacro body scanner89%
sub_71BA102.3 KBMacro character peek88%
sub_71BB802.6 KBMacro buffer reader88%
sub_71BE201.1 KBMacro expansion entry85%
sub_71BF601.8 KBMacro fatal abort90%
sub_71C1402.5 KBMacro format error88%
sub_7201902.0 KBptxensure_buffer_stack95%
sub_7202E01.3 KBptx_create_buffer96%
sub_7204103.3 KByy_get_next_buffer95%
sub_7206309.7 KByy_get_previous_state (SSE2 optimized)94%
sub_720BA04.3 KBptx_scan_string93%
sub_724CC04.9 KBptx_scan_bytes / macro nesting check91%
sub_7250702.7 KBptx_scan_buffer93%
sub_42FBA02.4 KBCentral diagnostic emitter (2,350 callers)HIGH
sub_4280C0597 BThread-local context accessor (3,928 callers)HIGH
sub_4240702.1 KBPool allocator (3,809 callers)HIGH
sub_4248B0923 BPool deallocator (1,215 callers)HIGH
sub_42BDB014 BFatal OOM handler (3,825 callers)HIGH
sub_44624011 KBTop-level compilation driverHIGH
sub_4095D017 KBROT13 opcode name table initializerHIGH
sub_5D419012.9 KBPTX text format dispatcherHIGH
sub_4B2F2052.6 KBGeneral instruction validatorHIGH
sub_4C5FB028.5 KBInstruction operand validatorHIGH
sub_4C2FD012.2 KBWMMA/MMA validatorHIGH
sub_485520--SM architecture check (sm >= N)HIGH
sub_485570--PTX version check (version >= M.N)HIGH

Cross-References

PTX Directive Handling

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

PTX directives -- .version, .target, .entry, .func, .global, .shared, .local, .const, .reg, .param, .weak, .common, .extern, .visible, .alias, .pragma -- are parsed and semantically validated by the Bison reduction actions embedded in the 48 KB parser function sub_4CE6B0. Unlike instructions which pass through opcode table lookup (sub_46E000) and per-instruction semantic validators, directives are handled entirely within the Bison reduction switch: each grammar production's action block reads values from the parser value stack, validates them against the current PTX version and target architecture, and writes the results into the 1,200-byte parser state object or its child compile-unit state (CU_state). No intermediate AST is constructed; directives take effect immediately during parsing.

The state object maintains 18 linked lists (9 head/tail pairs at offsets 368--512) that track symbols per state space, a string-keyed hash map (offset 208) for target feature flags, and a scope chain (offset 984) rooted at offset 968 for nested function declarations. Two version-gating functions -- sub_489050 (PTX ISA version) and sub_489390 (SM architecture) -- guard every directive that was introduced after the baseline ISA.

Bison parsersub_4CE6B0 (48,263 bytes, 631 case labels)
Version validatorsub_44A100 (bsearch over 44 valid PTX version IDs at xmmword_1CFD940)
PTX version gatesub_489050 -- sub_454E70 + sub_455A80(major, minor, state)
SM arch gatesub_489390 -- checks state+168 >= required_sm
Target handlersub_4B1080 (per-target, texmode logic)
Function handlersub_497C00 (entry/func declarations, ABI)
Variable handlersub_4A0CD0 (state-space declarations, type validation)
Parameter allocatorsub_44F6E0 (48-byte parameter nodes)
Scope managersub_44B9C0 (scope hash map at state+1008)
State-space lists18 linked lists at state+368--state+512
Target feature mapHash map at state+208 (string keys, presence values)

Architecture

PTX source text
     |
     v
+-------------------------------------------------------------------+
| BISON LALR(1) PARSER  sub_4CE6B0                                  |
| 631 reduction cases, each a direct action                         |
|                                                                   |
|   DIRECTIVE       CASES           HANDLER                         |
|   .version        35              sscanf + sub_44A100             |
|   .target         5, 38           sub_4B1080 (per-target)         |
|   .address_size   10              inline validation                |
|   .entry          82, 86-88       sub_497C00                      |
|   .func           97, 100-105     sub_497C00                      |
|   .global/shared  57-68           sub_4A0CD0                      |
|     /local/const                                                  |
|   .reg/.param     110-112         inline + sub_48BE80             |
|   .weak           55              sub_489050(3,1)                 |
|   .common         56              sub_489050(5,0)                 |
|   .extern         79              sets CU+81, linkage=3           |
|   .visible        80              sets CU+81, linkage=2           |
|   .alias          41              sub_4036D9 (param match)        |
|   .pragma         42              prefix-match dispatch chain     |
|                                                                   |
+-------------------+-----------------------------------------------+
                    |
          +---------+---------+
          v                   v
  PARSER STATE OBJECT     CU_STATE (compile-unit)
  ~1200 bytes             pointed to by state+1096
  state+144: version      CU+0:   linkage code
  state+152: target       CU+24:  state-space ID
  state+160: ptx_major    CU+48:  func metadata buf
  state+164: ptx_minor    CU+80:  return type
  state+168: sm_id        CU+81:  declaration linkage
  state+196: addr_size    CU+88:  current function
  state+208: feature map  CU+156: noinline pragma
  state+368: 18 ll heads  CU+172: reg-usage pragma
  state+968: scope root   CU+784: arch capability
  state+984: scope chain  CU+2448: target string
  state+1008: scope map   CU+2456: version string

.version X.Y -- Case 35

The .version directive establishes the PTX ISA version for the compilation unit. The parser extracts the major and minor version integers from the grammar, validates the combined version against a sorted table of 44 known versions, and stores both the numeric and string forms.

// Reconstructed from case 35 of sub_4CE6B0
int major = sub_449950();       // extract major from parser state
int minor = sub_449960();       // extract minor from parser state
sscanf(token, "%d.%d", &major, &minor);

// Allocate formatted version string
char* ver_str = pool_alloc(pool, 5);
sprintf(ver_str, "%d.%d", major, minor);

// Validate: bsearch over 44 valid version IDs
int combined = major * 10 + minor;
if (!sub_44A100(combined))
    fatal_error("Unsupported PTX version %s", ver_str);

pool_free(ver_str);

// Store in parser state
state->version_string = token;          // state+144
state->ptx_major = major;               // state+160
state->ptx_minor = minor;               // state+164
CU_state->version_string = token;       // CU+2456

Version Validation -- sub_44A100

// sub_44A100: validate PTX version against known versions
bool sub_44A100(int version_id) {
    int key = version_id;
    return bsearch(&key,
                   xmmword_1CFD940,   // sorted table base
                   0x2C,              // 44 entries
                   4,                 // sizeof(int)
                   compar) != NULL;   // simple integer compare
}

The 44-entry table at xmmword_1CFD940 contains the combined version IDs (major*10 + minor) for every PTX ISA version recognized by ptxas v13.0. This covers PTX 1.0 through 8.7+.

.target sm_XX -- Cases 5 and 38

The .target directive accepts a comma-separated list of targets: SM architecture identifiers (sm_XX, compute_XX) and feature modifiers (texmode_unified, texmode_independent, texmode_raw, map_f64_to_f32, debug).

Case 38 -- Target List Iteration

// Reconstructed from case 38
for (node = list_begin(*v5); !list_end(node); node = list_next(node)) {
    char* target_str = list_value(node);
    sub_4B1080(target_str, location, state);
}

Per-Target Handler -- sub_4B1080

The function branches on whether the target string contains "sm_" or "compute_".

SM/compute targets:

// SM target path in sub_4B1080
state->target_string = target_str;       // state+152
CU->target_string = target_str;          // CU+2448
state->arch_variant = sub_1CBEFD0(target_str);  // state+177

int sm_id;
sscanf(target_str + prefix_len, "%d", &sm_id);
state->target_id = sm_id;               // state+168
if (sm_id > state->max_target)
    state->max_target = sm_id;           // state+204

// Validate against one of three target tables:
//   compute_ targets:  unk_1D16160 (6 entries, 12 bytes each)
//   sm_ sub-variant:   unk_1D161C0 (7 entries, 12 bytes each)
//   standard sm_:      unk_1D16220 (32 entries, 12 bytes each)
// Each entry: { sm_id, required_ptx_major, required_ptx_minor }
entry = bsearch(&sm_id, table, count, 12, sub_484B70);
if (entry) {
    if (!sub_455A80(entry->ptx_major, entry->ptx_minor, state))
        state->version_mismatch_flag |= 1;   // state+178
}

Feature modifiers:

ModifierPTX RequirementActionCU State
map_f64_to_f32Deprecated for sm > 12Stored in feature map; CU+152 |= 1Feature flag
texmode_unified--Stored in feature map; default if none specifiedDefault
texmode_independentPTX >= 1.5Stored in feature map; CU+2464 = 1Tex mode
texmode_rawRequires state+220 flagStored in feature map; CU+2465 = 1Tex mode
debugPTX >= 3.0CU+2466 = 1; state+1033 = 1; state+834 = 1Debug on

Texmode values are mutually exclusive. Each setter checks the feature hash map at state+208 for conflicting entries before inserting:

// texmode_unified path in sub_4B1080
if (map_get(state->feature_map, "texmode_independent"))
    error("conflicting texmode: %s", target_str);
if (map_get(state->feature_map, "texmode_raw"))
    error("conflicting texmode: %s", target_str);
map_put(state->feature_map, "texmode_unified", 1);

Case 5 -- Automatic Texmode Inference

When the .target directive omits an explicit texmode, case 5 infers one based on CLI flags:

if (arch_supports_texmode(CU->arch_capability)) {
    if (!map_has(feature_map, "texmode_independent") &&
        !map_has(feature_map, "texmode_raw")) {
        if (state->cli_texmode_independent)
            sub_4B1080("texmode_independent", loc, state);
        else if (state->cli_texmode_raw)
            sub_4B1080("texmode_raw", loc, state);
        else
            sub_4B1080("texmode_unified", loc, state);
    }
}

.address_size 32|64 -- Case 10

// Reconstructed from case 10
sub_489050(state, 2, 3, ".address_size directive", location);  // PTX >= 2.3

int value = stack_value;
if (((value - 32) & ~0x20) != 0)       // allows exactly 32 and 64
    error("Invalid address size: %d", value);

state->address_size = value;             // state+196

The bit trick (v - 32) & ~0x20 passes for exactly two values:

  • v=32: (0) & 0xFFFFFFDF = 0
  • v=64: (32) & 0xFFFFFFDF = 0

Any other value produces a nonzero result and triggers an error.

.entry / .func Declarations -- Cases 76+, 82, 88, 97, 103

Function and entry declarations span multiple Bison productions because the grammar decomposes them into prototype, parameter list, linkage qualifier, and body productions. The central handler sub_497C00 processes both entry functions and device functions.

sub_497C00 -- Function Declaration Handler

// Reconstructed signature
int64 sub_497C00(
    state,          // parser state
    int decl_type,  // 1=visible, 2=forward, 3=extern, 4=static, 5=definition
    name,           // function name token
    return_params,  // return parameter list (NULL for entries)
    params,         // input parameter list
    bool is_entry,  // 1 for .entry, 0 for .func
    bool is_func,   // CU+80 qualifier for .func
    scratch_regs,   // scratch register list
    int retaddr,    // return address allocno (-1 if none)
    bool noreturn,  // .noreturn attribute
    bool unique,    // .unique attribute
    bool force_inline, // .FORCE_INLINE attribute
    location        // source location token
);

Processing steps:

  1. Scope creation: sub_44B9C0(state) creates a new scope context. The scope hash map at state+1008 maps scope IDs (starting at 61) to 40-byte scope descriptors.

  2. Parameter node allocation: sub_44F6E0(state, scope, name, 0, 0, location) allocates a 48-byte parameter descriptor: {type_info, name, scope, alignment, init_data, location}.

  3. Symbol lookup: sub_4504D0(state+968, name, 1, state) searches the current scope chain for an existing declaration.

  4. Forward declaration resolution: If a matching forward declaration exists, the handler validates compatibility:

    • Declaration type consistency (except 2->1 and 4->1 promotions)
    • Parameter list type/alignment/state-space matching via sub_484DA0
    • Return parameter matching via sub_484DA0
    • Scratch register count and types
    • Return address register, first parameter register
    • .noreturn and .unique attribute consistency
    • Unified identifier matching
  5. New function creation: If no prior declaration:

    • Registers in state+968 (regular scope) or state+976 (extern scope)
    • Calls sub_44FDC0 to record ABI metadata
    • For Blackwell GB10B architecture (sub_70FA00(CU, 33)): allocates __nv_reservedSMEM_gb10b_war_var in shared memory as a hardware workaround

Case 82 -- Entry Function

// Case 82: .entry declaration
if (CU->output_param_context)
    error("Parameter to entry function");

result = sub_497C00(state, decl_type, name,
                    NULL,    // no return params for entries
                    params,
                    1,       // is_entry = true
                    0,       // is_func = false
                    NULL,    // no scratch regs
                    -1,      // no retaddr
                    0, 0, 0, // no .noreturn/.unique/.force_inline
                    location);

Case 88 -- Entry Function Body Completion

After the function body is parsed, case 88 performs the final validation pass:

  1. Performance directive validation:

    • .maxntid and .reqntid are mutually exclusive
    • .maxnctapersm/.minnctapersm require either .maxntid or .reqntid
    • .reqntid + .reqnctapercluster require .blocksareclusters
    • .reqnctapercluster and .maxclusterrank are mutually exclusive
  2. Kernel parameter size limits (computed via sub_42CBF0 + sub_484ED0):

    PTX VersionMax Kernel Param Size
    < 1.5256 bytes
    >= 1.5, < 8.14,352 bytes
    >= 8.132,764 bytes

    Parameters exceeding 4,352 bytes also require SM >= 70 and PTX >= 8.1.

  3. Debug labels: Generates __$startLabel$__<name> and __$endLabel$__<name> for DWARF debug info.

  4. Debug hash: If debug mode enabled (state+856 != 0), computes CRC32(name) % 0xFFFF + base as a debug identifier stored at func->80+176.

Case 97 -- Device Function

// Case 97: .func declaration
result = sub_497C00(state, decl_type, name,
                    return_params, params,
                    0,                   // is_entry = false
                    CU->return_qualifier, // CU+80
                    scratch_regs, retaddr,
                    noreturn, unique, force_inline,
                    location);

State-Space Declarations -- .global, .shared, .local, .const

State-space directives set the "current state space" field (CU+24) and then delegate to sub_4A0CD0 for variable declaration processing or sub_4A2020 for declaration-without-initializer processing.

State-Space Code Assignment

CaseActionState Space
57*CU = 1(extern/unresolved)
59*CU = 3.shared
61*CU = 2.global
63*CU = 4.local
65*CU = 5.const
67*CU = 0.reg
58, 60, 62, 64, 66, 68sub_4A2020(...)Process declaration in current space

The odd-numbered cases set the state-space code; the immediately following even-numbered cases trigger the actual declaration processing.

Variable Validator -- sub_4A0CD0

This 4,937-byte function validates variable declarations across all state spaces. Key checks:

  1. Type validation: Resolves .texref via sub_450D00. For types 9 (.surfref) and 10 (.texref), enforces .tex deprecation after PTX 1.5 and .surfref scope restrictions.

  2. .b128 type: Requires PTX >= 8.3 (sub_455A80(8, 3)) and SM >= 70 (sub_489390(state, 70)).

  3. State-space restrictions:

    • .managed valid only with .global (space 5)
    • .reserved valid only with .shared (space 8)
    • .reserved shared alignment must be <= 64
    • .common valid only with .const
    • .param at file scope requires .const space
    • .local const disallowed at file scope
  4. Texmode interaction:

    • .surfref types require texmode_independent in the feature map
    • .tex/.texref types incompatible with texmode_raw
  5. Initializer handling: If an initializer is present, calls sub_4A02A0 to validate constant expressions (no function pointers, no entry functions as values, no opaque type initializers).

State-Space Linked Lists -- 18 Lists at state+368

The parser maintains 18 linked list heads (9 head/tail pairs) at state offsets 368--512 to track declared symbols per state space:

Offset    Pair   State Space
368/376   0      .global
384/392   1      .shared
400/408   2      .local
416/424   3      .const
432/440   4      .param
448/456   5      .tex
464/472   6      .surf
480/488   7      .sampler
496/504   8      reserved / other

Initialization (case 3 -- section begin): Iterates j from 0 to 144 in steps of 8, allocating an 88-byte sentinel node (type=6) for each list. Each node's +48 field links to per-section tracking data at state+656 + j.

Scope teardown (case 76 -- new compilation unit): Destroys old symbol tables via sub_425D20, clears the target feature map, and merges scope-level lists into the parent scope by concatenating linked list chains for offsets 16, 48, 112, 128, 144, and 184 of the scope node.

.reg / .param -- Register and Parameter Declarations

Within function bodies, .reg and .param declarations create typed register/parameter entries. Three grammar productions handle the variants:

Declaration Node Layout (56 bytes)

OffsetTypeField
0ptrType list pointer
8ptrName pointer
16int32State-space code
20byteIs array
21byteIs vector
24int32Alignment
28byteExtra flags
40int32Count / range start
44int32Range end (0xFFFFFFFF = no upper bound)
48ptrAuxiliary data

Case 110 -- Single declaration: Reads type info from CU_state (offsets +16, +24, +28, +29, +32, +36), allocates the 56-byte node, sets count from the parsed integer, and calls sub_48BE80(state) to validate.

Case 111 -- Range declaration: Same as 110 but sets both start and end bounds. The sentinel value 0xFFFFFFFF at offset 44 distinguishes range from single declarations.

Case 112 -- Vector declaration: Handles vector type qualifiers (.v2, .v4).

Visibility / Linkage Directives

.weak -- Case 55

sub_489050(state, 3, 1, ".weak directive", location);  // PTX >= 3.1

.common -- Case 56

sub_489050(state, 5, 0, ".common directive", location);  // PTX >= 5.0

Linkage Qualifiers -- Cases 78--81

These set CU+81 (declaration linkage type) within function prototype production contexts:

CaseLinkagePTX Directive
781.visible (default/internal)
793.extern
802.visible
814.weak

.alias -- Case 41

Symbol aliasing requires PTX >= 6.3 and SM >= 30:

// Reconstructed from case 41
sub_489050(state, 6, 3, ".alias", location);   // PTX >= 6.3
sub_489390(state, 0x1E, ".alias", location);   // SM >= 30

sym1 = sub_4504D0(state->scope_chain, name1, 1, state);
sym2 = sub_4504D0(state->scope_chain, name2, 1, state);

if (!sym1) error("undefined symbol: %s", name1);
if (!sym2) error("undefined symbol: %s", name2);

Validation:

  • Both symbols must be function type (node type == 5)
  • sym1 must not already have a body defined (sym1->80->88 == 0)
  • Neither can be entry functions
  • No self-aliasing (names must differ)
  • Parameter lists must match (calls sub_4036D9 twice: once for return params, once for input params)
  • .noreturn attribute must be consistent across both symbols
  • Cannot alias to .extern or declaration-qualified functions

On success: sym1->80->64 = sym2 (sets the alias-target pointer).

.pragma -- Case 42

The .pragma directive requires PTX >= 2.0 and dispatches through a prefix-matching chain. Each pragma string is compared against known prefixes via sub_4279D0 (starts-with test):

// Reconstructed dispatch structure from case 42
for (node = list_begin(pragma_list); !list_end(node); node = list_next(node)) {
    char* pragma_str = list_value(node);
    sub_489050(state, 2, 0, ".pragma directive", location);  // PTX >= 2.0

    char* arch_str = sub_457CB0(CU->arch_descriptor, index);
    if (starts_with(arch_str, pragma_str)) {
        // matched known pragma
        dispatch_to_handler(pragma_str, state);
    }
}

Pragma Dispatch Chain

PriorityPrefix IndexPragmaHandlerStorage
1sub_457CB0(arch, 1)"noinline"sub_456A50 + sub_48D8F0CU+156, CU+192
2sub_457CB0(arch, 3)inline-relatedSets CU+160 = 1CU+160
3sub_457CB0(arch, 16)register-usagesub_4563E0 + sub_48C370CU+172
4sub_457CB0(arch, 5)min threadssub_4563E0 + sub_48C6F0CU+164 or CU+168
5sub_457CB0(arch, 9)max constraintsub_4567E0 + sub_403D2FCU+176
6sub_457CB0(arch, 10)min constraintsub_4567E0 + sub_403D2FCU+184
7sub_457CB0(arch, 18)deprecatedWarning via dword_29FA6C0--
8sub_457CC0(arch, 1)deprecatedWarning via dword_29FA6C0--
9--11sub_457C60/CA0/C70unsupportedWarning via dword_29FA7F0--
12sub_457D30/D50unsupportedWarning via dword_29FA7F0--
13sub_457CB0(arch, 22)function-levelAppends to func or module pragma listfunc->80->80 or state+272

Unmatched pragmas trigger an error via dword_29FA6C0.

Feature Version Gating

Two functions guard every directive against minimum PTX ISA version and SM architecture requirements. They are called hundreds of times throughout the Bison reduction actions.

sub_489050 -- PTX ISA Version Check

// sub_489050(state, required_major, required_minor, directive_name, location)
char sub_489050(state, int major, int minor, char* name, location) {
    if (sub_454E70(state->version_check_disabled))  // state+960
        return 1;  // checks disabled
    if (state->lenient_mode)  // state+832
        return 1;  // lenient mode
    if (!sub_455A80(major, minor, state)) {
        char buf[152];
        sprintf(buf, "%d.%d", major, minor);
        sub_42FBA0(error_desc, location, name, buf);
    }
    return result;
}

sub_489390 -- SM Architecture Check

// sub_489390(state, required_sm, directive_name, location)
char sub_489390(state, uint required_sm, char* name, location) {
    if (sub_454E70(state->version_check_disabled))  // state+960
        return 1;
    if (!state->target_string || state->target_id > required_sm) {
        // state+152 == NULL or state+168 > required_sm
        char buf[48];
        sprintf(buf, "sm_%d", required_sm);
        sub_42FBA0(error_desc, location, name, buf);
    }
    return result;
}

Version Requirements by Directive

DirectivePTX ISASM Architecture
.address_size>= 2.3--
.weak>= 3.1--
.common>= 5.0--
.alias>= 6.3>= 30
.branchtargets>= 6.0>= 30
.calltargets>= 2.1>= 20
.callprototype>= 2.1>= 20
.pragma>= 2.0--
texmode_independent>= 1.5--
debug target>= 3.0--
kernel param list>= 1.4--
opaque types>= 1.5--
.b128 type>= 8.3>= 70
kernel params > 4352B>= 8.1>= 70

Parser State Object Layout

The parser state object (v1127 / a1 in sub_4CE6B0) is approximately 1,200 bytes. Key offsets for directive handling:

OffsetTypeField
72ptrModule-level output buffer
88ptrCurrent function link
144char*.version string (e.g., "8.5")
152char*.target string (e.g., "sm_90")
160int32PTX major version
164int32PTX minor version
168int32SM architecture ID
177byteArchitecture sub-variant flag
178byteVersion mismatch flag
196int32.address_size (32 or 64)
204int32Maximum SM ID encountered
208ptrTarget feature hash map
219byteCLI texmode_independent flag
220byteCLI texmode_raw flag
272ptrModule pragma list head
368--512ptr[18]State-space linked list heads
656--800bytesPer-section tracking data (144 bytes)
832byteLenient mode flag
834wordDebug mode flags
856int32Debug hash base
960int32Version check disable flag
968ptrScope root (top-level symbol table)
976ptrExtern function scope
984ptrCurrent scope chain pointer
1000byteFunction body active flag
1008ptrScope hash map
1033byteDebug info enabled
1096ptrCU_state pointer

Function Map

AddressSizeIdentityCallers
sub_44A10039 BPTX version bsearch validatorcase 35
sub_44B9C0171 BScope context creatorcase 82, 97 via sub_497C00
sub_44F6E0135 BParameter node allocator (48 B nodes)sub_497C00
sub_489050115 BPTX ISA version gate~30 directive cases
sub_48939085 BSM architecture version gate~15 directive cases
sub_497C002,992 BFunction/entry declaration handlercases 82, 97
sub_4A0CD04,937 BVariable/symbol declaration validatorcases 58--68
sub_4A02A02,607 BInitializer/constant expression validatorsub_4A0CD0
sub_4B1080~700 BPer-target handler (SM + texmode)cases 5, 38
sub_4036D9437 BParameter list compatibility checkcase 41 (.alias)
sub_4CE6B048,263 BBison parser (all directive cases)compilation driver

PTX-to-Ori Lowering

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The PTX-to-Ori lowering is the transition from parsed PTX assembly into the Ori internal representation -- the SASS-level, virtual-register IR that all subsequent optimization operates on. Unlike a traditional compiler where the parser builds an AST and a separate lowering pass consumes it, ptxas has no materialized AST: the Bison parser's reduction actions directly construct Ori IR nodes, basic blocks, and CFG edges inline. What the --compiler-stats timer calls "DAGgen-time" measures this inline construction phase. The result is a raw Ori IR that still uses PTX-derived opcodes and has unresolved architecture-dependent constructs. Fourteen "bridge phases" (pipeline indices 0--13) then transform this raw IR into the optimizer-ready form where every instruction carries its final SASS opcode, the CFG is fully annotated, and architecture-incompatible operations have been legalized.

The key architectural consequence of this design: there is no separate "lowering" function that you can point at and say "this converts PTX to Ori." The conversion is distributed across (1) the Bison parser's 443 reduction actions, (2) a 44 KB operand processing function, (3) the MercConverter instruction legalization pass, and (4) six additional bridge phases that handle FP16 promotion, control flow canonicalization, macro fusion, and recipe application.

DAGgen timer"DAGgen-time : %.3f ms (%.2f%%)\n" (inline Bison -> Ori construction)
Bison parsersub_4CE6B0 (48 KB, 512 productions, 443 reductions, no AST)
Operand processingsub_6273E0 (44 KB, 6-bit operand type switch)
MercConvertersub_9F1A90 (35 KB, opcode-dispatched visitor)
MercConverter orchestratorsub_9F3340 (7 KB)
Opcode dispatchsub_9ED2D0 (25 KB, master switch on *(instr+72) & 0xCF)
Post-conversion loweringsub_9EF5E0 (27 KB, string "CONVERTING")
Bridge phasesPhases 0--13 (14 phases, first group in the 159-phase pipeline)
Diagnostic dumpPhase 9: ReportInitialRepresentation (sub_A3A7E0 stats emitter)
Intrinsic descriptorssub_9EE390 (20 KB, "IntrinsicDescrFile=%s")

Architecture

PTX source text
     |
     v
[Flex scanner]  sub_720F00 (15.8KB, 552 rules)
     |  token stream
     v
[Bison parser]  sub_4CE6B0 (48KB, 512 productions)
     |  NO AST -- reduction actions build IR directly:
     |    - allocate instruction nodes from pool
     |    - set opcode field (instruction +72)
     |    - build operand array (instruction +84)
     |    - link into doubly-linked list per basic block
     |    - create basic block entries (40B each)
     |    - populate CFG hash maps (Code Object +648, +680)
     |
     v                                             "DAGgen-time"
[Operand processing]  sub_6273E0 (44KB)            boundary
     |  6-bit type switch (v12 & 0x3F)             ----------
     |  address computation, state space annotation
     v
+----------------------------------------------------------+
|  RAW ORI IR (PTX-derived opcodes, virtual registers)     |
|  Instructions: PTX-level names (add.f32, ld.global, etc) |
|  Registers: virtual R-file, typed descriptors             |
|  CFG: basic blocks + edge hash maps (partially formed)    |
+----------------------------------------------------------+
     |
     |  Phase 0: OriCheckInitialProgram (validate)
     |  Phase 1: ApplyNvOptRecipes      (configure opt levels)
     |  Phase 2: PromoteFP16            (FP16 -> FP32 where needed)
     |  Phase 3: AnalyzeControlFlow     (finalize CFG + RPO + backedges)
     |  Phase 4: AdvancedPhaseBeforeConvUnSup (arch hook, no-op default)
     |  Phase 5: ConvertUnsupportedOps  (MercConverter: PTX ops -> SASS ops)
     |  Phase 6: SetControlFlowOpLastInBB (CFG structural fixup)
     |  Phase 7: AdvancedPhaseAfterConvUnSup (arch hook, no-op default)
     |  Phase 8: OriCreateMacroInsts    (fuse instruction sequences)
     |  Phase 9: ReportInitialRepresentation (diagnostic dump)
     |  Phase 10: EarlyOriSimpleLiveDead (dead code elimination)
     |  Phase 11: ReplaceUniformsWithImm (fold known constants)
     |  Phase 12: OriSanitize           (validate post-bridge IR)
     |  Phase 13: GeneralOptimizeEarly  (bundled copy-prop + const-fold)
     v                                             "OCG-time"
+----------------------------------------------------------+               begins
|  OPTIMIZER-READY ORI IR                                  |
|  Instructions: SASS opcodes (FADD, IMAD, LDG, STG, ...) |
|  Registers: virtual R/UR/P/UP files                       |
|  CFG: complete with RPO, backedge map, loop headers       |
+----------------------------------------------------------+
     |
     v
[Phase 14+: main optimization pipeline]

Inline IR Construction (Bison -> Ori)

The Bison parser at sub_4CE6B0 has 512 grammar productions with 443 reduction-action cases. Each reduction action constructs IR directly -- no intermediate AST is ever materialized. The instruction table builder (sub_46E000, 93 KB, 1,141 per-opcode registration calls to sub_46BED0) runs during parser initialization and registers the legal type combinations for every PTX instruction. The instruction lookup subsystem (sub_46C690 entry, sub_46C6E0 matcher at 6.4 KB) classifies operands into 12 categories at parse time.

When the parser encounters a PTX instruction like add.f32 %r1, %r2, %r3, it:

  1. Looks up add.f32 in the opcode table to get the internal opcode index and validate the type qualifier .f32
  2. Allocates an Ori instruction node from the memory pool
  3. Writes the opcode into the instruction field at offset +72
  4. Processes each operand through sub_6273E0 to build the packed operand array at offset +84
  5. Links the instruction into the current basic block's doubly-linked list
  6. If the instruction is a branch/jump/return, creates a CFG edge in the successor hash map at Code Object +648

Special PTX registers (%ntid, %laneid, %smid, %ctaid, %clock64, etc.) are mapped to internal identifiers during parser initialization at sub_451730. The mapping table is built from the ROT13-encoded opcode table populated by ctor_003 at 0x4095D0.

Operand Processing -- sub_6273E0

The 44 KB operand processing function handles all PTX operand forms. It switches on a 6-bit type encoding extracted as v12 & 0x3F:

Type bitsOperand kindPTX syntaxProcessing
RegisterDirect register reference%r1, %rd1, %f1Look up register descriptor via *(ctx+88) + 8*regId
Register pair64-bit register pair%rd1 (on 32-bit ALU)Allocate paired descriptors, link hi/lo
ImmediateInteger constant42, 0xFFPack into operand field
Float immediateFloating-point constant0F3F800000Encode IEEE 754 bits
AddressBase + offset[%rd1+16]Compute effective address, annotate state space
Constant bankConstant memory refc[2][0x100]Bank index + offset encoding
LabelBranch target$L__BB0_1Resolve to basic block index
Special registerBuilt-in register%ntid.x, %laneidMap to internal ID from sub_451730 table

String evidence in sub_6273E0:

  • ".nv.reservedSmem.offset0" -- reserved shared memory region handling
  • "COARSEOFFSET" -- coarse-grained offset computation for large address spaces
  • "__$endLabel$__%s" -- label generation for structured control flow expansion

The function bridges PTX's explicitly-typed operand model (where .u32, .f32, .b64 qualifiers are part of the syntax) to Ori's implicitly-typed model where the operand type is determined by the SASS opcode.

Bridge Phases (0--13)

Phase 0: OriCheckInitialProgram -- Validation

Validates the raw Ori IR produced by the Bison parser for structural correctness: all basic blocks have valid entry/exit points, instruction operand counts match opcode requirements, register references are within bounds, and CFG edges are consistent. This is a pure validation pass that produces no IR transformations. It catches malformed IR early, before any optimization pass can amplify a structural error into a hard-to-diagnose miscompile.

Phase 1: ApplyNvOptRecipes -- Optimization Level Configuration

Applies NvOptRecipe transformations controlled by option 391. When enabled, the PhaseManager's constructor (sub_C62720) allocates a 440-byte NvOptRecipe sub-manager at PhaseManager+56. This sub-manager configures per-phase behavior based on the NvOpt level (0--5), controlling which later phases are active and their aggressiveness:

NvOpt levelBehavior
0Minimal optimization (fast-compile path, many phases isNoOp())
1--2Standard optimization
3--4Aggressive optimization (loop unrolling, speculative hoisting enabled)
5Maximum optimization (may significantly increase compile time)

The string "Invalid nvopt level : %d." in sub_C173E0 confirms the valid range. The recipe data lives at NvOptRecipe+312 with per-phase records at stride 584 bytes. The sub-manager maintains its own sorted array (+376) and hash table (+400..+416) for fast recipe lookup by phase index.

NvOptRecipe Sub-Manager (440 bytes, at PhaseManager+56)
  +0      compilation_unit
  +8      phase_manager back-reference
  +16     ref_counted_list_1
  +312    recipe_data
  +336    allocator
  +344    timing_records (stride = 584 per entry)
  +376    sorted_array (for binary search by phase index)
  +400    hash_bucket_count
  +408    hash_buckets
  +432    shared_list_ptr (ref-counted)

Phase 2: PromoteFP16 -- Half-Precision Type Promotion

Promotes half-precision (FP16) operations where hardware support is insufficient or promotion yields better throughput. The promotion strategy is architecture-dependent:

  • Pre-sm_53: no native FP16 ALUs. All FP16 arithmetic is expanded to FP32 with narrowing conversions at stores.
  • sm_53+: native FP16 support. Only operations that require expensive multi-instruction sequences in FP16 (certain transcendentals, complex comparisons) are promoted.
  • sm_89+ (Ada, Blackwell): wide FP16 tensor paths. Promotion is minimal; most FP16 stays native.

The phase walks the instruction linked list, inspects each instruction's type encoding at offset +72, and rewrites FP16 operations to FP32 equivalents by replacing the opcode and inserting conversion instructions (F2F in SASS terminology) at use/def boundaries.

Phase 3: AnalyzeControlFlow -- CFG Finalization

Builds and finalizes the control flow graph data structures that the optimizer requires:

  • Successor edges: populates the FNV-1a hash table at Code Object +648
  • Backedge map: computes backedges and stores them at Code Object +680
  • RPO array: builds the reverse post-order traversal at Code Object +720
  • Loop identification: marks loop headers and backedge targets for later loop optimization passes (phases 18, 22, 24, 59)

The Bison parser constructs basic blocks and edges incrementally as it processes PTX instructions, but the CFG is not guaranteed to be fully consistent until this phase runs. For example, forward branch targets may reference blocks that were not yet created at parse time. This phase resolves all pending edges and ensures the CFG is complete.

Phases 4 and 7: Architecture Hook Points

Phases 4 (AdvancedPhaseBeforeConvUnSup) and 7 (AdvancedPhaseAfterConvUnSup) are no-op-by-default hook points that bracket ConvertUnsupportedOps. Architecture backends override their vtables to inject target-specific processing:

  • Phase 4 (before): prepare target-specific state, mark instructions that need special handling on this architecture
  • Phase 7 (after): clean up after legalization, fix architecture-specific edge cases introduced by the generic lowering

These hooks are part of the 16 AdvancedPhase injection points distributed throughout the 159-phase pipeline. The architecture vtable factory at sub_1CCEEE0 (17 KB, 244 callees) selects which overrides are active based on the sm_version.

Phase 5: ConvertUnsupportedOps -- Instruction Legalization

The most substantial bridge phase. Lowers PTX operations that have no direct SASS equivalent for the target architecture. This phase runs the MercConverter engine (see next section) and handles:

  • 64-bit integer arithmetic on architectures with 32-bit ALUs: splits add.s64, mul.lo.s64 into hi/lo 32-bit instruction pairs using carry chains
  • Complex addressing modes: decomposes multi-component addresses into separate arithmetic instructions
  • PTX-specific operations: converts PTX instructions that have no 1:1 SASS mapping (e.g., bfe, bfi, prmt variants not supported on all targets)
  • Architecture availability: gates instructions by SM version (an instruction added in sm_80 is lowered to a multi-instruction sequence on sm_70)
  • Texture/surface operations: legalizes texture sampling and surface access patterns (sub_9E8B20, 17 KB)
  • Memory operations: legalizes load/store patterns, address register handling (sub_9D76D0/sub_9D80E0, 17--18 KB each)

After ConvertUnsupportedOps completes, every instruction in the IR has a valid SASS opcode for the target architecture.

The late phase 132 (UpdateAfterConvertUnsupportedOps) runs cleanup for edge cases introduced by this phase that are only detectable after optimization.

Phase 6: SetControlFlowOpLastInBB -- CFG Structural Fixup

Enforces a critical structural invariant: control flow operations must be the last instruction in their basic block. If a branch, jump, return, or exit instruction is followed by other instructions in the same block (which can happen during lowering when a PTX instruction expands to a sequence ending in a branch), this phase splits the block at the control flow point.

The invariant is required by the scheduler (which assumes only the last instruction in a block can transfer control) and the register allocator (which computes live-out sets at block boundaries). The phase rewrites the instruction linked list and allocates new 40-byte basic block entries as needed.

Phase 8: OriCreateMacroInsts -- Macro Fusion

Identifies and fuses instruction sequences into macro instructions for hardware efficiency. The phase scans the instruction linked list for patterns that the GPU hardware can execute as a single macro-op:

  • Compare + branch: fused into a conditional branch macro instruction
  • Multiply + add: fused into FMA where not already (different from PTX fma -- this catches mul followed by add on the same operands)
  • Address computation + memory access: fused sequences for coalesced access patterns

The fused macro instructions carry composite semantics in a single IR node. They are expanded back into individual SASS instructions much later at phase 118 (MercExpandInstructions), after scheduling has determined the optimal placement. This late expansion allows the optimizer to treat the fused sequence as atomic, preventing passes from inserting unrelated instructions between the components.

Phase 9: ReportInitialRepresentation -- Diagnostic Dump

Dumps the Ori IR state for debugging, active when DUMPIR or --ftrace diagnostics are enabled. The stats emitter at sub_A3A7E0 prints a per-function profile:

# 142 instructions, 24 R-regs
# [inst=142] [texInst=0] [tepid=0] [rregs=24]
# [est latency = 87] [LSpillB=0]
# [Occupancy = 0.750000]
# [issue thru=0.888889] [fp thru=0.000000]
# [worstcaseLat=87.000000]
# [avgcaseLat=52.500000]
# [SharedMem Alloc thru=0.000000]
# [instHint=0] [instPairs=0]

This snapshot provides the pre-optimization baseline. Comparing it against ReportBeforeScheduling (phase 96) and ReportFinalMemoryUsage (phase 126) shows the optimizer's impact on instruction count, register pressure, and estimated latency.

Phases 10--13: Early Cleanup

PhaseNamePurpose
10EarlyOriSimpleLiveDeadFirst dead code elimination pass. Removes instructions whose results are unused. Uses the SIMD-accelerated bitvector library (sub_BDBA60..sub_BDE150) for liveness computation.
11ReplaceUniformsWithImmFolds known-constant uniform register loads into immediate operands. Important for kernel launch parameters passed through constant memory.
12OriSanitizeSecond structural validation after all bridge transformations. Catches errors introduced by phases 1--11 before the main optimizer begins.
13GeneralOptimizeEarlyFirst compound optimization pass: copy propagation + constant folding + algebraic simplification in a single fixed-point iteration. Cleans up redundancies introduced by the bridge phases.

The MercConverter Engine

The MercConverter (sub_9F1A90, 35 KB) is the instruction conversion engine at the heart of ConvertUnsupportedOps. Despite its name referencing "Mercury" (NVIDIA's SASS encoding format), it operates purely at the IR level -- converting instruction semantics, not binary encodings.

Call Chain

sub_9F3340 (orchestrator, 7KB)
  |
  +-- sub_9F1A90 (MercConverter main pass, 35KB)
  |     |
  |     +-- sub_9ED2D0 (opcode dispatch, 25KB)
  |     |     |
  |     |     |  Large switch on (*(instr+72)) with byte-1 mask:
  |     |     |    BYTE1(opcode) &= 0xCF  -- strips modifier bits 4-5
  |     |     |
  |     |     +-- case 1:  sub_9DA5C0 (2KB)   -- opcode class 1
  |     |     +-- case 6:  sub_9DA100 (9KB)   -- arithmetic operations
  |     |     +-- case 8:  sub_9D2440         -- specific class
  |     |     +-- case 10,11,149,151,152,290,291:
  |     |     |            sub_9D80E0 (17KB)  -- memory load/store
  |     |     +-- default: vfunc[0](a1, a2)   -- vtable dispatch
  |     |
  |     +-- sub_934630 (instruction creation utility, called N times)
  |
  +-- sub_9EF5E0 (post-conversion lowering, 27KB)
        |  string "CONVERTING"
        +-- sub_9EC160, sub_7C11F0, sub_7BFC30 (intrinsic expansion)

Per-Category Handlers

HandlerSizeCategoryKey behavior
sub_9D76D018 KBMemory legalization (load/store)Register type dispatch: 6=GPR, 7=predicate, 3=address. Uses sub_9D4380 (instruction builder) and sub_9CD420 (predication).
sub_9D80E017 KBMemory legalization (variant)Same opcode set as sub_9D76D0, alternate code path for different operand patterns.
sub_9EC34023 KBMulti-operand legalizationOperand type test: (v >> 28) & 7 == 1 means register. Register class query via sub_7BE7B0. Creates new instructions via sub_7DEAD0.
sub_9E660025 KBInstruction expansionSplits instructions into multiple SASS equivalents (e.g., 64-bit ops on 32-bit ALU). Uses sub_9D4380 ~10 times.
sub_9E8B2017 KBTexture/surface loweringRegister type 6 = GPR. Manipulates bitmask at register descriptor offset +48.
sub_9DA1009 KBArithmetic operationsHandles opcode case 6 -- standard ALU instruction legalization.
sub_9DE89017 KBControl flow legalizationBranch/call instruction patterns. Calls sub_9D4380 (builder) 5 times.
sub_9DDEE014 KBAddress computationAddress arithmetic lowering, complex addressing mode decomposition.

Intrinsic Descriptor Loading

sub_9EE390 (20 KB) loads architecture-specific instruction descriptions from a file ("IntrinsicDescrFile=%s"). This allows the MercConverter to query which intrinsic operations are natively supported on the target SM and which require multi-instruction expansion. The descriptor file is architecture-versioned and loaded once during the first compilation of a kernel targeting that architecture.

The PTX-to-SASS Opcode Transition

The fundamental semantic transformation during lowering: PTX uses high-level, explicitly-typed opcodes; Ori uses SASS-level opcodes where the type is encoded in the mnemonic. All SASS opcode strings in the binary are ROT13-encoded.

PTX source (typed virtual ISA)          Ori IR (SASS machine-level)
---------------------------------       ---------------------------------
add.f32  %r1, %r2, %r3           -->   FADD  R1, R2, R3
add.s32  %r4, %r5, %r6           -->   IADD3 R4, R5, R6, RZ
mul.f64  %d1, %d2, %d3           -->   DMUL  D1, D2, D3
mad.lo.s32 %r7, %r8, %r9, %r10  -->   IMAD  R7, R8, R9, R10
ld.global.f32 %r11, [%rd1]       -->   LDG   R11, [R1]
st.shared.f32 [%rd2], %r12       -->   STS   [R2], R12
bra  $L__BB0_1                   -->   BRA   bix1
@%p0 bra $L__BB0_2               -->   @P0 BRA bix2
exit                              -->   EXIT
bar.sync 0                        -->   BAR

ROT13 encoding in the binary:

SNQQ  = FADD       VZNQ  = IMAD       SSZN  = FFMA
VNQQ3 = IADD3      QZHY  = DMUL       YQT   = LDG
FGT   = STG        OEN   = BRA        RKVG  = EXIT
ERG   = RET        ONE   = BAR        FGF   = STS

Key semantic differences at the transition:

  1. Type moves into the opcode: PTX add.f32 becomes FADD (the "F" encodes float); PTX add.s32 becomes IADD3 (the "I" encodes integer). The type qualifier disappears from the instruction syntax.

  2. Register namespace unification: PTX's typed virtual registers (%r for int, %f for float, %rd for 64-bit, %p for predicate) merge into Ori's four register files (R, UR, P, UP) with type tracked in the register descriptor at offset +64.

  3. Operand count changes: SASS IADD3 takes 3 source operands where PTX add takes 2 -- the third source defaults to RZ (the hardware zero register). This is handled by the expansion in sub_9E6600.

  4. Multi-instruction expansion: Complex PTX operations expand to multiple SASS instructions. A PTX div.f32 may become a Newton-Raphson sequence of RCP + FMUL + correction iterations.

  5. Predication mapping: PTX @%p0 instruction maps to an Ori predicate operand in the P register file, attached to the instruction node's predicate slot.

Error Detection During Lowering

The bridge phases include two error detection mechanisms:

Internal compiler error assertion (sub_9EB990, 1.4 KB): three references to "Internal compiler error.". Called when a bridge phase encounters an impossible IR state (e.g., an opcode value outside the known range in the MercConverter dispatch switch). Triggers longjmp-based fatal abort via sub_42F590 back to the driver's error recovery point in sub_446240.

Uninitialized register detector (sub_A0B5E0, 7 KB): "Found %d potentially uninitialized register(s) in function %s". Walks the instruction list per block, checks register descriptor flags at offset +48 (bit 5 = "defined"). Reports registers that appear as sources without any prior definition. This detector fires after the bridge phases to catch conversion errors that leave registers undefined.

Key Data Structures

Instruction Node

Instruction (variable size, linked list node)
  +0     prev_ptr           // doubly-linked list: previous instruction
  +8     next_ptr           // doubly-linked list: next instruction
  +16    child_ptr          // child/expanded instruction chain
  +32    control_word_ptr   // set later during scheduling (initially NULL)
  +72    opcode             // byte 0: primary opcode
                            // byte 1 bits 4-5: modifier (masked with 0xCF)
  +80    operand_count      // number of operands
  +84    operand_array      // packed operand descriptors

Operand Encoding

Each operand is a packed 32-bit value:
  Bits 28-30: operand kind ((value >> 28) & 7)
    1 = register operand
    5 = predicate register
    (other values for immediate, constant bank, label, etc.)

  Lower bits: operand-kind-specific payload (register ID, immediate value, etc.)

Register Descriptor

Register descriptor (accessed via *(ctx+88) + 8*regId)
  +12    register number (int)
  +48    flags (bit 5 = "defined", other bits for liveness state)
  +64    type (3=address, 6=GPR, 7=predicate)

Timing Boundary

The lowering spans two --compiler-stats timer phases:

TimerCovers
DAGgen-timeBison parser reduction actions -> Ori instruction nodes, operand processing (sub_6273E0), basic block / CFG construction
OCG-timePhases 0--13 (bridge), then phases 14--158 (optimization + codegen)

The boundary between "lowering" and "optimization" is therefore between phase 13 (GeneralOptimizeEarly, the last bridge phase) and phase 14 (DoSwitchOptFirst, the first pure optimization). After phase 13, the IR is in its final SASS-opcode form with validated structure, ready for the main optimization pipeline.

Cross-References

  • PTX Parser -- Flex scanner + Bison LALR(1) parser (the source of raw Ori IR)
  • Ori IR -- IR design: Code Object, basic blocks, instruction format, register files
  • Optimization Pipeline -- 159-phase pipeline (phases 0--13 are the bridge)
  • Phase Manager -- PhaseManager object, phase factory, dispatch loop
  • Optimization Levels -- NvOpt levels 0--5 and their effect on recipes
  • SASS Opcodes -- target SASS instruction set after lowering

Function Map

AddressSizeCallersIdentityConfidence
0x45173014 KB1Parser init, special register setupHIGH
0x46E00093 KB1Opcode table builder (1,141 per-opcode calls)HIGH
0x4CE6B048 KB1Bison LALR(1) parser (512 productions)HIGH
0x6273E044 KBNOperand processing (6-bit type switch)MEDIUM
0x9D43807 KB~10Instruction builder / inserter into linked listHIGH
0x9D76D018 KB1Memory instruction legalization (load/store)HIGH
0x9D80E017 KB1Memory instruction legalization (variant)HIGH
0x9DA1009 KB1Arithmetic operation handler (case 6)HIGH
0x9DE89017 KB1Control flow legalization (branch/call)MEDIUM
0x9DDEE014 KB1Address computation legalizationMEDIUM
0x9E660025 KB1Instruction expansion (64-bit split, etc.)HIGH
0x9E8B2017 KB1Texture/surface loweringMEDIUM
0x9EB9901.4 KB3Internal compiler error assertionHIGH
0x9EC34023 KB1Multi-operand instruction legalizationMEDIUM
0x9ED2D025 KB1Opcode dispatch (master switch, & 0xCF mask)HIGH
0x9EE39020 KB1Intrinsic descriptor file loaderMEDIUM
0x9EF5E027 KB1Post-MercConverter lowering ("CONVERTING")HIGH
0x9F1A9035 KB1MercConverter main instruction conversion passHIGH
0x9F33407 KB1MercConverter orchestrator ("After MercConverter")HIGH
0xA0B5E07 KBNUninitialized register detectorHIGH
0xA3A7E06 KBNScheduling statistics printer (phase 9 output)VERY HIGH

Optimization Pipeline (159 Phases)

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The ptxas optimizer is a fixed-order pipeline of 159 compilation phases that transform Ori IR from its initial post-lowering form into scheduled, register-allocated SASS machine code. Unlike LLVM's PassManager -- which uses dependency-driven scheduling and analysis preservation -- ptxas runs every phase unconditionally in a predetermined order, relying on per-phase isNoOp() checks to skip inapplicable transformations. This design trades flexibility for predictability: the phase ordering is identical across all compilations, and architecture-specific behavior is injected through 16 "AdvancedPhase" hook points whose vtables are overridden per target.

Each phase is a polymorphic C++ object exactly 16 bytes in size, allocated from a memory pool by a 159-case factory switch. The PhaseManager constructs all 159 phase objects up front during initialization, stores them in a flat array, and iterates the array in a simple dispatch loop. Per-phase timing and memory consumption are optionally tracked for --stat=phase-wise output.

Key Facts

FieldValue
Total phases159 (indices 0--158)
Named phases (static table)139 (indices 0--138)
Dynamic phases (vtable names)20 (indices 139--158)
AdvancedPhase hook points16
Mercury sub-pipeline phases8 (phases 113--114, 117--122)
Phase object size16 bytes: {vtable_ptr, allocator_ptr}
Factory switchsub_C60D30 (3554 bytes, 159 cases)
PhaseManager constructorsub_C62720 (4734 bytes)
Dispatch loopsub_C64F70 (1455 bytes)
Phase name tableoff_22BD0C0 (159 entries, 1272 bytes)
Default ordering tableunk_22BEEA0 (159-entry index array)
Vtable rangeoff_22BD5C8..off_22BEE78 (40-byte stride)
NamedPhases option ID298
Pipeline orchestratorsub_7FB6C0

Phase Object Layout

Every phase is a 16-byte polymorphic object created by the factory:

struct Phase {
    void** vtable;       // +0: pointer to phase-specific vtable in .data.rel.ro
    void*  allocator;    // +8: memory pool used for allocation
};

The vtable provides three virtual methods common to all phases:

OffsetSignaturePurpose
+0execute(Phase*, CompilationContext*)Run the phase on the IR
+8isNoOp(Phase*) -> boolReturn true to skip execution
+16getName(Phase*) -> intReturn index into the phase name table

Additional vtable slots (+24 pool alloc, +32 pool free) are present but belong to the allocator interface, not the phase protocol.

Dispatch Loop

The dispatch loop at sub_C64F70 drives execution:

// sub_C64F70 -- simplified
void dispatch(PhaseManager* pm, int* phase_indices, int count) {
    MemorySnapshot baseline = take_snapshot();

    for (int i = 0; i < count; i++) {
        int idx = phase_indices[i];
        Phase* phase = pm->phase_list[idx];

        const char* name = pm->name_table[phase->getName()];

        if (!phase->isNoOp()) {
            MemorySnapshot before = take_snapshot();
            phase->execute(pm->compilation_unit);

            if (pm->timing_enabled) {
                report_phase_stats(pm, name, &before);
            }
        }
    }

    if (pm->timing_enabled) {
        report_summary(pm, "All Phases Summary", &baseline);
        report_pool_consumption(pm);
    }
}

Timing output format (to stderr when --stat=phase-wise):

  <phase_name>  ::  [Total 42 KB ]   [Freeable 8 KB ]   [Freeable Leaked 0 KB ] (0%)

Complete Phase Table

Group 1 -- Initial Setup (phases 0--13)

Program validation, recipe application, FP16 promotion, control flow analysis, macro instruction creation.

#Phase NameCategory
0OriCheckInitialProgramValidation
1ApplyNvOptRecipesRecipe application
2PromoteFP16Type promotion
3AnalyzeControlFlowCFG analysis
4AdvancedPhaseBeforeConvUnSupHook (no-op default)
5ConvertUnsupportedOpsLegalization
6SetControlFlowOpLastInBBCFG fixup
7AdvancedPhaseAfterConvUnSupHook (no-op default)
8OriCreateMacroInstsMacro expansion
9ReportInitialRepresentationDiagnostics
10EarlyOriSimpleLiveDeadEarly DCE
11ReplaceUniformsWithImmImmediate folding
12OriSanitizeIR validation
13GeneralOptimizeEarlyBundled early opts

Phase 0 validates the initial Ori IR for structural correctness. Phase 1 applies NvOptRecipe transformations (controlled by option 391, which allocates a 440-byte sub-manager at PhaseManager+56). Phase 2 promotes FP16 operations where profitable. Phases 4 and 7 are architecture hooks that bracket ConvertUnsupportedOps -- backends override them to inject target-specific pre/post-legalization logic.

Group 2 -- Early Optimization (phases 14--32)

Branch optimization, loop canonicalization, strength reduction, software pipelining, SSA formation.

#Phase NameCategory
14DoSwitchOptFirstSwitch optimization
15OriBranchOptBranch optimization
16OriPerformLiveDeadFirstLiveness / DCE
17OptimizeBindlessHeaderLoadsTexture header opt
18OriLoopSimplificationLoop canonicalization
19OriSplitLiveRangesLive range splitting
20PerformPGOProfile-guided opt
21OriStrengthReduceStrength reduction
22OriLoopUnrollingLoop unrolling
23GenerateMovPhiSSA phi insertion
24OriPipeliningSoftware pipelining
25StageAndFenceMemory fence insertion
26OriRemoveRedundantBarriersBarrier elimination
27AnalyzeUniformsForSpeculationUniform analysis
28SinkRematSink + rematerialization
29GeneralOptimizeBundled mid opts
30DoSwitchOptSecondSwitch optimization (2nd)
31OriLinearReplacementLinear scan replacement
32CompactLocalMemoryLocal memory compaction

The GeneralOptimize* phases (13, 29, 37, 46, 58, 65) are compound passes that bundle multiple small optimizations (copy propagation, constant folding, algebraic simplification) into a single fixed-point iteration. They appear at multiple pipeline positions to re-clean the IR after major transformations. Liveness/DCE also runs repeatedly (OriPerformLiveDead at phases 16, 33, 61, 84) to remove dead code exposed by intervening passes.

Group 3 -- Mid-Level Optimization (phases 33--52)

GVN-CSE, reassociation, shader constant extraction, CTA expansion, argument enforcement.

#Phase NameCategory
33OriPerformLiveDeadSecondLiveness / DCE (2nd)
34ExtractShaderConstsFirstShader constant extraction
35OriHoistInvariantsEarlyLICM (early)
36EmitPSIPSI emission
37GeneralOptimizeMidBundled mid opts
38OptimizeNestedCondBranchesNested branch opt
39ConvertVTGReadWriteVTG read/write conversion
40DoVirtualCTAExpansionVirtual CTA expansion
41MarkAdditionalColdBlocksCold block marking
42ExpandMbarrierMbarrier expansion
43ForwardProgressForward progress guarantee
44OptimizeUniformAtomicUniform atomic opt
45MidExpansionMid-level legalization
46GeneralOptimizeMid2Bundled mid opts (2nd)
47AdvancedPhaseEarlyEnforceArgsHook (no-op default)
48EnforceArgumentRestrictionsABI enforcement
49GvnCseGVN + CSE
50OriReassociateAndCommonReassociation + commoning
51ExtractShaderConstsFinalShader constants (final)
52OriReplaceEquivMultiDefMovRedundant move elimination

Shader constant extraction (phases 34, 51) identifies uniform values that can be loaded from constant memory rather than recomputed per-thread. GvnCse (phase 49) combines global value numbering with common subexpression elimination in a single pass. The MidExpansion (phase 45) performs target-dependent lowering of operations that must be expanded before register allocation but after high-level optimizations have had their chance.

Group 4 -- Late Optimization (phases 53--77)

Predication, rematerialization, loop fusion, varying propagation, sync optimization, phi destruction, uniform register conversion.

#Phase NameCategory
53OriPropagateVaryingFirstVarying propagation
54OriDoRematEarlyEarly rematerialization
55LateExpansionLate legalization
56SpeculativeHoistComInstsSpeculative hoisting
57RemoveASTToDefaultValuesAST cleanup
58GeneralOptimizeLateBundled late opts
59OriLoopFusionLoop fusion
60DoVTGMultiViewExpansionMulti-view expansion
61OriPerformLiveDeadThirdLiveness / DCE (3rd)
62OriRemoveRedundantMultiDefMovDead move elimination
63OriDoPredicationIf-conversion
64LateOriCommoningLate commoning
65GeneralOptimizeLate2Bundled late opts (2nd)
66OriHoistInvariantsLateLICM (late)
67DoKillMovementKill movement
68DoTexMovementTexture movement
69OriDoRematRematerialization
70OriPropagateVaryingSecondVarying propagation (2nd)
71OptimizeSyncInstructionsSync optimization
72LateExpandSyncInstructionsLate sync expansion
73ConvertAllMovPhiToMovPhi destruction
74ConvertToUniformRegUniform reg conversion
75LateArchOptimizeFirstArch-specific late opt
76UpdateAfterOptimizeIR update pass
77AdvancedPhaseLateConvUnSupHook (no-op default)

Predication (phase 63) converts short conditional branches into predicated instruction sequences, eliminating branch divergence. Rematerialization runs twice (phases 54 and 69) -- the early pass targets values that are cheap to recompute, while the late pass handles cases exposed by predication and loop fusion. Phase 73 (ConvertAllMovPhiToMov) destroys SSA form by converting phi nodes into move instructions, preparing the IR for register allocation. Phase 74 converts qualifying values to uniform registers (UR), reducing general register pressure.

Group 5 -- Legalization (phases 78--96)

Late unsupported-op expansion, backward copy propagation, GMMA fixup, register attribute setting, final inspection.

#Phase NameCategory
78LateExpansionUnsupportedOpsLate unsupported ops
79OriHoistInvariantsLate2LICM (late 2nd)
80ExpandJmxComputationJMX expansion
81LateArchOptimizeSecondArch-specific late opt (2nd)
82AdvancedPhaseBackPropVRegHook (no-op default)
83OriBackCopyPropagateBackward copy propagation
84OriPerformLiveDeadFourthLiveness / DCE (4th)
85OriPropagateGmmaGMMA propagation
86InsertPseudoUseDefForConvURUR pseudo use/def
87FixupGmmaSequenceGMMA sequence fixup
88OriHoistInvariantsLate3LICM (late 3rd)
89AdvancedPhaseSetRegAttrHook (no-op default)
90OriSetRegisterAttrRegister attribute setting
91OriCalcDependantTexTexture dependency calc
92AdvancedPhaseAfterSetRegAttrHook (no-op default)
93LateExpansionUnsupportedOps2Late unsupported ops (2nd)
94FinalInspectionPassFinal IR validation
95SetAfterLegalizationPost-legalization marker
96ReportBeforeSchedulingDiagnostics

GMMA (phases 85, 87) handles WGMMA (warp group matrix multiply-accumulate) instruction sequences that require specific register arrangements and ordering constraints. OriSetRegisterAttr (phase 90) annotates registers with scheduling attributes (latency class, bank assignment) consumed by the downstream scheduler. FinalInspectionPass (phase 94) is a validation gate that catches illegal IR patterns before the irreversible scheduling/RA phases.

Group 6 -- Pre-Scheduling and Register Allocation (phases 97--103)

Synchronization insertion, WAR fixup, register allocation, 64-bit register handling.

#Phase NameCategory
97AdvancedPhasePreSchedHook (no-op default)
98BackPropagateVEC2DVector back-propagation
99OriDoSyncronizationSynchronization insertion
100ApplyPostSyncronizationWarsPost-sync WAR fixup
101AdvancedPhaseAllocRegHook (no-op default)
102ReportAfterRegisterAllocationDiagnostics
103Get64bRegComponents64-bit register splitting

Phase 99 inserts the synchronization instructions (BAR, DEPBAR, MEMBAR) required by the GPU memory model. Phase 100 fixes write-after-read hazards exposed by sync insertion. Register allocation is driven through the hook at phase 101 -- the actual allocator is architecture-specific and invoked from the AdvancedPhase override. Phase 103 splits 64-bit register pairs into their 32-bit components for architectures that require it.

Group 7 -- Post-RA and Post-Scheduling (phases 104--116)

Post-expansion, NOP removal, hot/cold optimization, block placement, scoreboards.

#Phase NameCategory
104AdvancedPhasePostExpansionHook (no-op default)
105ApplyPostRegAllocWarsPost-RA WAR fixup
106AdvancedPhasePostSchedHook (no-op default)
107OriRemoveNopCodeNOP removal
108OptimizeHotColdInLoopHot/cold in loops
109OptimizeHotColdFlowHot/cold flow opt
110PostSchedulePost-scheduling
111AdvancedPhasePostFixUpHook (no-op default)
112PlaceBlocksInSourceOrderBlock layout
113PostFixForMercTargetsMercury target fixup
114FixUpTexDepBarAndSyncTexture barrier fixup
115AdvancedScoreboardsAndOpexesScoreboard generation
116ProcessO0WaitsAndSBsO0 wait/scoreboard

Hot/cold partitioning (phases 108--109) separates frequently executed blocks from cold paths, improving instruction cache locality. PlaceBlocksInSourceOrder (phase 112) determines the final layout of basic blocks in the emitted binary. The scoreboard sub-system has two paths: at -O1 and above, AdvancedScoreboardsAndOpexes (phase 115) performs full dependency analysis to compute the 23-bit control word per instruction (4-bit stall count, 1-bit yield, 3-bit write barrier, 6-bit read barrier mask, 6-bit wait barrier mask, plus reuse flags). At -O0, phase 115 is a no-op and ProcessO0WaitsAndSBs (phase 116) inserts conservative waits.

Group 8 -- Mercury Backend (phases 117--122)

SASS instruction encoding, expansion, WAR generation, opex computation, microcode emission.

#Phase NameCategory
117MercEncodeAndDecodeMercury encode/decode
118MercExpandInstructionsInstruction expansion
119MercGenerateWARs1WAR generation (1st pass)
120MercGenerateOpexOpex generation
121MercGenerateWARs2WAR generation (2nd pass)
122MercGenerateSassUCodeSASS microcode generation

"Mercury" is NVIDIA's internal name for the SASS encoding framework. Phase 117 converts Ori instructions into Mercury's intermediate encoding, then decodes them back to verify round-trip correctness. Phase 118 expands pseudo-instructions into their final SASS sequences. WAR generation runs in two passes (119, 121) because expansion in phase 118 can introduce new write-after-read hazards. Phase 120 generates "opex" (operation extension) annotations. Phase 122 produces the final SASS microcode bytes. The MercConverter infrastructure (sub_9F1A90, 35KB) drives the instruction-level legalization using a visitor pattern dispatched through a large opcode switch (sub_9ED2D0, 25KB).

Group 9 -- Post-Mercury (phases 123--131)

Register map, diagnostics, debug output.

#Phase NameCategory
123ComputeVCallRegUseVirtual call reg use
124CalcRegisterMapRegister map computation
125UpdateAfterPostRegAllocPost-RA update
126ReportFinalMemoryUsageDiagnostics
127AdvancedPhaseOriPhaseEncodingHook (no-op default)
128UpdateAfterFormatCodeListCode list formatting
129DumpNVuCodeTextSASS text dump
130DumpNVuCodeHexSASS hex dump
131DebuggerBreakDebugger breakpoint

CalcRegisterMap (phase 124) computes the final physical-to-logical register mapping emitted as EIATTR metadata in the output ELF. DumpNVuCodeText and DumpNVuCodeHex (phases 129--130) produce the human-readable SASS text and raw hex dumps used by cuobjdump and debugging workflows. DebuggerBreak (phase 131) is a development-only hook that triggers a breakpoint when a specific phase is reached.

Group 10 -- Finalization (phases 132--158)

Late merge operations, late unsupported-op expansion, high-pressure live range splitting, architecture-specific fixups.

#Phase NameCategory
132UpdateAfterConvertUnsupportedOpsPost-conversion update
133MergeEquivalentConditionalFlowConditional flow merge
134AdvancedPhaseAfterMidExpansionHook (no-op default)
135AdvancedPhaseLateExpandSyncInstructionsHook (no-op default)
136LateMergeEquivalentConditionalFlowLate conditional merge
137LateExpansionUnsupportedOpsMidLate unsupported mid
138OriSplitHighPressureLiveRangesHigh-pressure splitting
139--158(architecture-specific)Arch-specific fixups

Phases 132--138 handle late-breaking transformations that must run after the Mercury backend but before finalization. OriSplitHighPressureLiveRanges (phase 138) is a last-resort live range splitter that fires when register pressure exceeds hardware limits after the main allocation pass.

Phases 139--158 are 20 additional slots whose names are not in the static name table but are returned by their vtable getString() methods. These are architecture-specific phases registered in the factory switch (vtable addresses off_22BEB08..off_22BEE78) that target particular SM generations or compilation modes. They provide extensibility for new architectures without modifying the fixed 139-phase base table.

Optimization Level Gating

AdvancedPhase Hook Points

Sixteen phases serve as conditional extension points. Their isNoOp() method returns true by default, causing the dispatch loop to skip them. Architecture backends and optimization-level configurations override the vtable to activate these hooks:

PhaseNameGate Location
4AdvancedPhaseBeforeConvUnSupBefore unsupported-op conversion
7AdvancedPhaseAfterConvUnSupAfter unsupported-op conversion
47AdvancedPhaseEarlyEnforceArgsBefore argument enforcement
77AdvancedPhaseLateConvUnSupLate unsupported-op boundary
82AdvancedPhaseBackPropVRegBefore backward copy prop
89AdvancedPhaseSetRegAttrBefore register attr setting
92AdvancedPhaseAfterSetRegAttrAfter register attr setting
97AdvancedPhasePreSchedBefore scheduling
101AdvancedPhaseAllocRegRegister allocation driver
104AdvancedPhasePostExpansionAfter post-RA expansion
106AdvancedPhasePostSchedAfter post-scheduling
111AdvancedPhasePostFixUpAfter post-fixup
115AdvancedScoreboardsAndOpexesFull scoreboard analysis
127AdvancedPhaseOriPhaseEncodingPhase encoding hook
134AdvancedPhaseAfterMidExpansionAfter mid-expansion
135AdvancedPhaseLateExpandSyncInstructionsLate sync expansion

The pattern is consistent: AdvancedPhase hooks bracket major pipeline stages, allowing backends to insert target-specific transformations without altering the fixed phase ordering. Phase 101 (AdvancedPhaseAllocReg) is notable because register allocation itself is entirely driven through this hook -- the base pipeline has no hardcoded allocator.

O0 vs O1+ Behavior

At -O0, the pipeline skips most optimization phases via their individual isNoOp() checks. The critical difference is in scoreboard generation:

  • -O1 and above: Phase 115 (AdvancedScoreboardsAndOpexes) runs the full dependency analysis using sub_A36360 (52KB control word encoder) and sub_A23CF0 (54KB DAG list scheduler heuristic). Phase 116 is a no-op.
  • -O0: Phase 115 is a no-op. Phase 116 (ProcessO0WaitsAndSBs) inserts conservative stall counts and wait barriers -- every instruction gets the maximum stall, and barriers are placed at every potential hazard point. This produces correct but slow code.

Individual phases also check the optimization level internally via the compilation context. The scheduling infrastructure (sub_8D0640) reads the opt-level via sub_7DDB50 and selects between forward-pass scheduling (opt-level <= 2, register-pressure-reducing) and reverse-pass scheduling (opt-level > 2, latency-hiding).

NamedPhases Override (Option 298)

The NamedPhases mechanism allows complete replacement of the default 159-phase pipeline with a user-specified phase sequence, primarily used for debugging and performance investigation.

Activation

The pipeline orchestrator (sub_7FB6C0) checks option ID 298 via a vtable call at compilation context offset +72. When set, the orchestrator bypasses the default pipeline and delegates to sub_9F63D0 (NamedPhases entry point):

// sub_7FB6C0 -- simplified
void orchestrate(CompilationUnit* cu) {
    if (cu->config->getOption(298)) {
        // NamedPhases mode -- user-specified phase sequence
        NamedPhases_run(cu);              // sub_9F63D0
    } else {
        // Default mode -- fixed 159-phase pipeline
        PhaseManager* pm = PhaseManager_new(cu);  // sub_C62720
        int* ordering = get_default_ordering();    // sub_C60D20
        dispatch(pm, ordering, 159);               // sub_C64F70
        PhaseManager_destroy(pm);                  // sub_C61B20
    }
    // ... cleanup 17 data structures, refcounted objects ...
}

Configuration String Format

Option 298 is set via a knob string (environment variable or command-line). The string is stored at compilation context offset 21464 with a type indicator at offset 21456. The parser (sub_798B60, NamedPhases::ParsePhaseList) tokenizes the comma-delimited string:

"phase_name1,phase_name2=param,shuffle,swap1,..."

Maximum 256 entries. The parser populates three parallel arrays:

  • Phase name strings
  • Parameter value strings (parsed via strtol)
  • Full name=value pairs

Phase List Builder

The core builder (sub_9F4040, 49KB) processes the parsed configuration:

  1. Allocates a 0x2728-byte stack frame with 256-entry string tables
  2. Initializes a 158-entry phase descriptor table (zeroed 0x400 bytes)
  3. Resolves phase names to indices via sub_C641D0 (case-insensitive binary search)
  4. Recognized manipulation keywords:
    • shuffle -- randomize the phase ordering
    • swap1..swap6 -- swap specific phase pairs (for A/B testing)
    • OriPerformLiveDead -- override liveness pass placement
    • OriCopyProp -- override copy propagation placement
  5. Constructs the final phase index sequence and dispatches via sub_C64F70

Pass-Disable Integration

Individual passes can be disabled without reordering the pipeline. The check function sub_799250 (IsPassDisabled, 68 bytes) performs a case-insensitive substring match against the PTXAS_DISABLED_PASSES string at context offset 13328:

// sub_799250 -- simplified
bool is_pass_disabled(Context* ctx, const char* pass_name) {
    if (ctx->pass_disable_flag == 0) return false;  // offset 13320
    if (ctx->pass_disable_flag == 5) {
        return strcasestr(ctx->pass_disable_string, pass_name);  // offset 13328
    }
    return false;
}

This check is called from 16+ sites across the codebase, guarding passes like LoopMakeSingleEntry and SinkCodeIntoBlock. A more thorough variant (sub_7992A0, IsPassDisabledFull) uses FNV-1a hashing for function-specific override tables.

PhaseManager Data Structures

PhaseManager Object (~112 bytes)

Offset  Type       Field
------  ----       -----
+0      int64      compilation_unit pointer
+8      int64*     allocator
+16     void*      sorted_name_table (for binary search)
+24     int32      sorted_name_count
+28     int32      sorted_name_capacity
+32     int64*     allocator_copy
+40     void*      phase_list (array of 16-byte Phase entries)
+48     int32      phase_list_count
+52     int32      phase_list_capacity
+56     int64      nvopt_recipe_ptr (NvOptRecipe sub-manager, or NULL)
+64     int64      (reserved)
+72     bool       timing_enabled (from options[17928])
+76     int32      (flags)
+80     bool       flag_byte
+88     int64*     timing_allocator
+96     void*      phase_name_raw_table
+104    int32      phase_name_raw_count
+108    int32      phase_name_raw_capacity

Timing Record (32 bytes)

Offset  Type       Field
------  ----       -----
+0      int32      phase_index (-1 = sentinel)
+8      int64      phase_name_or_magic (0x2030007 = sentinel)
+16     int64      timing_value
+24     int32      memory_flags

NvOptRecipe Sub-Manager (440 bytes, at PhaseManager+56)

Created when option 391 is set. Contains timing records with 584-byte stride, a hash table for recipe lookup, sorted arrays, and ref-counted shared lists. The sub-manager inherits the phase chain from the previous execution context, enabling recipe-based pipeline modification across compilation units.

Function Map

AddressSizeIdentity
sub_C60D2016 BDefault phase table pointer
sub_C60D303554 BPhase factory (159-case switch)
sub_C60BD0334 BMulti-function phase invoker
sub_C61B201753 BPhaseManager destructor
sub_C62200888 BPool consumption reporter
sub_C62580253 BTiming record array resizer (1.5x growth)
sub_C62640223 BPhase list resizer (1.5x growth)
sub_C627204734 BPhaseManager constructor
sub_C639A01535 BCase-insensitive quicksort (median-of-3)
sub_C63FA0556 BPhase name table sort/rebuild
sub_C641D0305 BPhase name-to-index binary search
sub_C643103168 BPer-phase timing reporter
sub_C64F701455 BPhase dispatch loop
sub_7FB6C01193 BPipeline orchestrator (option 298 gate)
sub_798B601776 BNamedPhases::ParsePhaseList
sub_79925068 BIsPassDisabled (substring check)
sub_7992A0894 BIsPassDisabledFull (FNV-1a hash)
sub_9F40409093 BNamedPhases::parseAndBuild
sub_9F63D0342 BNamedPhases::run
sub_9F1A906310 BMercConverter main pass
sub_9F3340~7 KBMercConverter orchestrator
sub_9ED2D0~25 KBMercConverter opcode dispatch

Diagnostic Strings

StringLocationTrigger
"All Phases Summary"sub_C64F70End of dispatch loop (timing enabled)
"[Pool Consumption = "sub_C62200End of dispatch loop (timing enabled)
" :: "sub_C64310Per-phase timing line
"[Total ", "[Freeable ", "[Freeable Leaked "sub_C64310Memory delta columns
"Before ", "After "sub_C64F70Phase execution markers
"NamedPhases"sub_9F4040NamedPhases config parsing
"shuffle", "swap1".."swap6"sub_9F4040NamedPhases manipulation keywords
"After MercConverter"near sub_9F3340Post-MercConverter diagnostic
"CONVERTING"sub_9EF5E0During MercConverter lowering
"Internal compiler error."sub_9EB990ICE assertion (3 sites)

Cross-References

SASS Code Generation

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

This page covers the top-level compilation orchestration layer of ptxas: the code that sits between the PTX front-end (parsing, directive handling) and the 159-phase optimization pipeline. It is responsible for validating the parsed PTX, selecting a compilation strategy, computing register constraints, dispatching per-kernel compilation (either sequentially or via a thread pool), and collecting per-kernel outputs for finalization. The orchestrator is the single largest function in the front-end region at 2,141 decompiled lines.

Key Facts

Core orchestratorsub_4428E0 (13,774 bytes, 2,141 decompiled lines)
Per-kernel workersub_43A400 (4,696 bytes, 647 lines)
Per-kernel DAGgen+OCGsub_64BAF0 (~30 KB, 1,006 decompiled lines)
Per-entry outputsub_43CC70 (5,425 bytes, 1,077 decompiled lines)
Thread pool workersub_436DF0 (485 bytes, 59 decompiled lines)
Thread pool constructorsub_1CB18B0 (184-byte pool struct, calls pthread_create)
Finalizationsub_432500 (461 bytes, 47 decompiled lines)
Regalloc finalizesub_4370F0 (522 bytes, 64 decompiled lines)
Compilation strategies4 (normal, compile-only, debug, non-ABI)
Error recoverysetjmp/longjmp (non-local, no C++ exceptions)

Architecture

sub_446240 (top-level driver)
  |
  v
sub_4428E0 (core orchestrator, 2141 lines)
  |
  |-- Option validation: .version/.target, --compile-only, --compile-as-tools-patch
  |-- Cache config: def-load-cache, force-load-cache, def-store-cache, force-store-cache
  |-- Strategy selection: 4 function-pointer pairs (see below)
  |-- Register constraints: sub_43B660 per kernel (via strategy function)
  |-- Compile-unit table: 48-byte per-CU entry at a1+336
  |-- Timing array: 112-byte per-kernel entry at a1+256
  |
  +-- IF single-threaded (thread_count == 0):
  |     |
  |     FOR EACH compile unit:
  |       |
  |       +-- sub_43A400 (per-kernel setup, 647 lines)
  |       |     |-- Target-specific defaults ("ptxocg.0.0", cache, texmode)
  |       |     |-- ABI configuration, fast-compile shortcuts
  |       |     +-- Error recovery via setjmp
  |       |
  |       +-- sub_432500 (finalization wrapper, 47 lines)
  |       |     |-- setjmp error guard
  |       |     +-- vtable call at a1+96: invokes the actual OCG pipeline
  |       |
  |       +-- sub_4370F0 (timing finalization, 64 lines)
  |       |     +-- Accumulates per-kernel timing into 112-byte records
  |       |
  |       +-- sub_43CC70 (per-entry output, 1077 lines)
  |             |-- Skip __cuda_dummy_entry__
  |             |-- Generate .sass and .ucode sections
  |             +-- Emit "# ============== entry %s ==============" header
  |
  +-- IF multi-threaded (thread_count > 0):
        |
        |-- sub_1CB18B0(thread_count) --> create thread pool
        |
        FOR EACH compile unit:
          |
          +-- sub_43A400 (per-kernel setup, same as above)
          +-- Snapshot 15 config vectors (v158[3]..v158[17])
          +-- Copy hash maps for thread isolation
          +-- sub_1CB1A50(pool, sub_436DF0, task_struct) --> enqueue
          |
          v
        sub_436DF0 (thread pool worker, 59 lines)
          |-- sub_430590("ptxas", ...) -- set thread-local program name
          |-- Jobserver slot check (sub_1CC6EC0)
          |-- sub_432500(...) -- finalize via vtable call (DAGgen+OCG+SASS)
          |-- Timing: wall-clock and phase timers into per-CU record
          |-- sub_693630 (release compiled output to downstream)
          +-- sub_4248B0 (free task struct)
        |
        sub_1CB1AE0(pool)  --> wait for all tasks
        sub_1CB1970(pool)  --> destroy pool
        sub_4370F0(a1, -1) --> finalize aggregate timing

The Core Orchestrator: sub_4428E0

This is a 2,141-line monolith that drives the entire compilation after the PTX has been parsed. Its responsibilities, in execution order:

1. Cache and Option Validation

The first 200+ lines read four cache-mode knobs from the options store at a1+904:

def_load_cache   = get_knob(a1->options, "def-load-cache");
force_load_cache = get_knob(a1->options, "force-load-cache");
def_store_cache  = get_knob(a1->options, "def-store-cache");
force_store_cache = get_knob(a1->options, "force-store-cache");

It then validates a long series of option combinations:

  • --compile-as-tools-patch (a1+727) incompatibility checks against shared memory, textures, surfaces, samplers, constants
  • --assyscall (a1+627) resource allocation checks
  • --compile-only (a1+726) vs unified functions
  • Non-ABI mode (a2+218, a2+235): disables --fast-compile, --extensible-whole-program
  • --position-independent-code vs --extensible-whole-program mutual exclusion
  • Architecture version checks: .target SM version vs --gpu-name SM version
  • -noFwdPrg forward-progress flag against SM version
  • --legacy-bar-warp-wide-behavior against SM >= 70

2. Strategy Selection

Four compilation strategies are expressed as pairs of function pointers (v314, v293), selected at lines 756-779 of the decompilation. Each strategy pair consists of a register-constraint calculator and a compile-unit builder:

StrategyConditionRegister Calculator (v314)CU Builder (v293)
Compile-only--compile-only OR --assyscall OR --compile-as-tools-patchsub_43C6F0 (225 lines)sub_4383B0 (177 lines)
Debug--compile-as-tools-patch AND NOT debug modesub_43CAE0 (91 lines)sub_4378E0 (250 lines)
Non-ABI--extensible-whole-programsub_43C570 (77 lines)sub_438B50 (375 lines)
Normaldefaultsub_43CA80 (24 lines)sub_438B50 (375 lines)

The selection logic:

if (compile_only || assyscall || tools_patch) {
    calc_regs = sub_43C6F0;
    build_cus = sub_4383B0;
} else if (tools_patch_mode) {
    calc_regs = debug_mode ? sub_43C6F0 : sub_43CAE0;
    build_cus = debug_mode ? sub_4383B0 : sub_4378E0;
} else {
    calc_regs = extensible_whole_program ? sub_43C570 : sub_43CA80;
    build_cus = sub_438B50;
}

3. Register Constraint Calculation

Each strategy's register calculator iterates the kernel list and calls sub_43B660 to compute per-kernel register limits. The result is stored in a hash map at a1+1192:

// sub_43CA80 (normal strategy, 24 lines) -- simplest form
for (node = kernel_list; node; node = node->next) {
    entry = node->data;
    name  = entry->func_ptr;    // at +16
    limit = sub_43B660(a1, name, a1->opt_level, entry->thread_count);
    map_put(a1->reg_limit_map, name, limit);  // a1+1192
}

sub_43B660 (3,843 bytes) balances register pressure against occupancy by considering .maxnreg, --maxrregcount, .minnctapersm, and .maxntid. String evidence: ".minnctapersm and .maxntid", "threads per SM", "computed using thread count", "of .maxnreg", "of maxrregcount option", "global register limit specified".

4. Compile-Unit Table Construction

The CU builder (v293) constructs a linked list of 72-byte compile-unit descriptors. Each descriptor contains:

struct CompileUnitDescriptor {  // 72 bytes
    int32   index;          // +0:  CU ordinal
    void*   dep_list;       // +8:  dependency tracking set
    void*   entry_ptr;      // +16: pointer to entry function symbol
    bool    is_entry;       // +25: true if .entry, false if .func
    int32   regalloc[2];    // +28: register allocation mode pair
    bool    flags[4];       // +36: has_shared_mem, has_surfaces, has_textures, has_samplers
    int16   cap_flags;      // +40: capability flags from func_attr+240
    int32   min_regs;       // +44: minimum register count (from profile check)
    // +48..55: additional attribute OR bits from func_attr+208..236
    // +56..63: reserved
    void*   smem_info;      // +48: 24-byte sub-struct for shared memory
};

The builder allocates this via sub_424070(pool, 72), populates it from the function attribute struct at entry_ptr+80, and enqueues it into the output list via sub_42CA60.

5. Per-Kernel Dispatch

After building the CU list, the orchestrator enters one of two dispatch modes based on the thread count at a1+668:

Single-Threaded Path (thread_count == 0)

The loop at decompilation lines 1607-1686 iterates each CU sequentially:

for (node = cu_list; node; node = node->next) {
    cu_desc = node->data;
    // Record start time in 112-byte timing record
    timing_record[cu_desc->index].start = get_time();
    
    // Allocate 360-byte work buffer
    work = pool_alloc(pool, 360);
    memset(work, 0, 360);
    
    // Per-kernel setup: target config, cache defaults, ABI
    sub_43A400(a1, parser_state, cu_desc, elf_builder, work);
    
    // Finalization: runs the actual DAGgen + OCG pipeline
    sub_432500(a1, cu_desc + 16, work[0], work[1]);
    
    // Timing finalization for this kernel
    sub_4370F0(a1, cu_desc->index);
    
    // Per-entry output: .sass/.ucode sections
    sub_43CC70(a1, parser_state, cu_desc, work);
    
    pool_free(work);
}

Multi-Threaded Path (thread_count > 0)

The thread pool path (decompilation lines 1709-1929) uses the pthread-based thread pool:

// Phase 1: prepare all tasks
pool_obj = sub_1CB18B0(thread_count);  // create pool with N threads

for (node = cu_list; node; node = node->next) {
    cu_desc = node->data;
    
    // Allocate 360-byte work buffer (same as single-threaded)
    work = pool_alloc(pool, 360);
    
    // Extra per-thread state: 3 hash maps for thread isolation
    work[288] = hashmap_new(8);   // per-thread reg constraints
    work[296] = hashmap_new(8);   // per-thread symbol copies
    work[304] = hashmap_new(8);   // per-thread attribute copies
    
    sub_43A400(a1, parser_state, cu_desc, elf_builder, work);
    
    // Snapshot 15 config vectors from global state (a1+1072..a1+1296)
    // into work[48]..work[288] for thread-safe access
    for (i = 0; i < 15; i++)
        work[48 + 16*i] = load_128bit(a1 + 1072 + 16*i);
    
    // Copy hash maps from shared state into per-thread copies
    // (reg constraints, symbol tables, attribute maps)
    
    // Enqueue: sub_1CB1A50(pool, sub_436DF0, task_struct)
    task = malloc(48);
    task->pool_data = work;
    task->timing_base = ...;
    sub_1CB1A50(pool_obj, sub_436DF0, task);
}

// Phase 2: wait for all tasks
sub_1CB1AE0(pool_obj);   // wait-for-all
sub_1CB1970(pool_obj);   // destroy pool

// Phase 3: aggregate timing
sub_4370F0(a1, -1);      // -1 = aggregate all CUs
Jobserver Integration

When --split-compile is active with GNU Make, the thread pool integrates with Make's jobserver protocol. The worker function sub_436DF0 checks sub_1CC6EC0() (jobserver slot acquire) before starting compilation and calls sub_1CC7040() (jobserver slot release) after completion. This prevents ptxas from exceeding the -j slot limit during parallel builds.

Per-Kernel Worker: sub_43A400

This 647-line function sets up target-specific defaults for each kernel before the OCG pipeline runs. Key responsibilities:

  1. Timing instrumentation -- records start timestamps, wall-clock time
  2. Target configuration -- reads "ptxocg.0.0" defaults, sets cache mode, texturing mode, "specified texturing mode" string evidence
  3. Fast-compile shortcuts -- when --fast-compile is active, reduces optimization effort
  4. ABI setup -- configures parameter passing, return address register, scratch registers
  5. Error recovery -- establishes setjmp point for fatal errors during kernel compilation

The function allocates a _jmp_buf on the stack for error recovery. If any phase in the downstream pipeline calls the fatal diagnostic path (sub_42F590 with severity >= 6), execution longjmps back to sub_43A400's recovery handler, which cleans up the partially-compiled kernel and continues to the next.

String evidence: "def-load-cache", "force-load-cache", "--sw4575628", "NVIDIA", "ptxocg.0.0", "specified texturing mode", "Indirect Functions or Extern Functions".

Finalization: sub_432500

This 47-line wrapper function is the bridge between the orchestrator and the actual DAGgen+OCG pipeline. It:

  1. Retrieves the thread-local context via sub_4280C0
  2. Saves and replaces the jmp_buf pointer in the TLS struct (for nested error recovery)
  3. Saves the current error/warning flags
  4. Clears the error flags to create a clean compilation context
  5. Calls through a vtable pointer at a1+96 to invoke the actual compilation:
// sub_432500 -- simplified
bool finalize(Context* ctx, CUDesc* cu, void* sass_out, void* ucode_out) {
    char* tls = get_tls();
    jmp_buf* old_jmp = tls->jmp_buf;
    tls->jmp_buf = &local_jmp;
    char saved_err = tls->error_flags;
    char saved_warn = tls->warning_flags;
    tls->error_flags = 0;
    tls->warning_flags = 0;
    
    if (setjmp(local_jmp)) {
        // Error path: restore state, cleanup, report ICE
        tls->jmp_buf = old_jmp;
        tls->error_flags = 1;  tls->warning_flags = 1;
        release_output(ucode_out);
        release_output(cu->output);
        report_internal_error();
        return false;
    }
    
    // Normal path: invoke the pipeline
    bool ok = ctx->vtable->compile(ctx->state, sass_out, ctx + 384);
    if (!ok) report_internal_error();
    
    // Merge error flags
    tls->jmp_buf = old_jmp;
    tls->error_flags = saved_err ? true : (tls->error_flags != 0);
    tls->warning_flags = saved_warn ? true : (tls->warning_flags != 0);
    return ok;
}

The vtable call at a1+96 is the entry point into sub_64BAF0 (the 1,006-line function that runs DAGgen, the 159-phase OCG pipeline, and Mercury SASS encoding for a single kernel).

Regalloc Finalization: sub_4370F0

This 64-line function accumulates per-kernel timing results into the master timing array at a1+256. Each entry in this array is a 112-byte record:

struct KernelTimingRecord {  // 112 bytes, at a1->timing_array + 112*index
    char*   kernel_name;     // +0
    float   ocg_time;        // +20
    float   total_time;      // +36
    float   cumulative;      // +40
    double  wall_clock;      // +72
    // ... other timing fields
};

When called with index == -1 (aggregate mode after multi-threaded compilation), it sums all per-kernel records into the global timing counters at a1+176 (total parse time), a1+184 (total OCG time), and a1+208 (peak wall-clock).

Per-Entry Output: sub_43CC70

This 1,077-line function produces the final per-kernel output artifacts. Key behaviors:

  1. Skip dummy entries -- checks for "__cuda_dummy_entry__" and returns immediately
  2. Section generation -- creates .sass and .ucode ELF sections for each kernel
  3. Entry banner -- emits "\n# ============== entry %s ==============\n" to the SASS text output
  4. Register map -- calls "reg-fatpoint" to annotate the register allocation
  5. Verbose SASS output -- when --verbose is active, formats and writes human-readable SASS text
  6. Multiple output paths -- supports mercury, capmerc, and direct SASS output modes

Thread Pool Worker: sub_436DF0

The worker function dispatched to each thread pool thread is compact (59 lines) but carefully structured for thread safety:

void thread_worker(Context* a1, TaskStruct* task) {
    set_thread_program_name("ptxas", task);
    
    // Acquire jobserver slot if applicable
    if (a1->jobserver_enabled && !jobserver_acquire())
        report_fatal_error();  // Cannot get build slot
    
    float time_before = get_pool_time(a1->timer);
    double wall_before = get_wall_time();
    
    // Run the actual compilation
    sub_432500(a1->state, task + 64, task->sass_output, task->ucode_output);
    
    float time_after = get_pool_time(a1->timer);
    double wall_after = get_wall_time();
    
    // Record timing in per-CU record
    int cu_index = *(int*)task->cu_desc;
    TimingRecord* rec = &a1->timing_array[cu_index];
    rec->ocg_time = time_after - time_before;
    rec->cumulative += (time_after - time_before);
    if (wall_after - wall_before > 0)
        rec->wall_clock = wall_after - wall_before;
    
    // Peak wall-clock tracking (under lock)
    if (a1->compiler_stats && a1->per_kernel_stats) {
        lock_timing(6);
        double peak = a1->peak_wall_clock;
        if (get_wall_time() - a1->start_time > peak)
            a1->peak_wall_clock = get_wall_time() - a1->start_time;
        unlock_timing(6);
    }
    
    // Emit compiled output downstream
    if (a1->dump_sass)
        dump_sass(task->ucode_output);
    release_output(task->ucode_output);
    
    // Transfer kernel name to output
    task->output->name = **(task->cu_desc->entry_ptr + 88);
    
    // Release jobserver slot
    if (a1->jobserver_enabled && !jobserver_release())
        report_fatal_error();
    
    pool_free(task);
}

The timing lock at index 6 (sub_607D70(6) / sub_607D90(6)) serializes access to the peak wall-clock counter across threads. This is the only shared mutable state in the multi-threaded path -- all other per-kernel state is isolated in the 360-byte work buffer and per-thread hash map copies.

Data Flow Summary

PTX text
  |
  v (parsed by sub_451730 into AST at parser_state+88)
  |
sub_4428E0: strategy_calc(kernel_list)  --> reg_limit_map (a1+1192)
sub_4428E0: strategy_build(kernel_list) --> cu_descriptor_list (72-byte nodes)
  |
  v (for each CU descriptor)
  |
sub_43A400: target_config(cu_desc) --> 360-byte work buffer
  |
sub_432500: vtable->compile()
  |  invokes sub_64BAF0 (DAGgen + 159-phase OCG + Mercury)
  |    |
  |    +-- Ori IR construction (DAGgen phase)
  |    +-- 159 phases via PhaseManager (sub_C62720 / sub_C64F70)
  |    +-- Mercury SASS encoding (phases 113-122)
  |    |
  v    v
work[0] = .sass output    work[1] = .ucode output
  |
sub_4370F0: record timing
  |
sub_43CC70: emit .sass/.ucode sections, entry banner, verbose output
  |
  v
ELF builder (sub_612DE0)

Cross-References

Function Map

AddressSizeLinesIdentityConfidence
sub_4428E013,774 B2,141Core compilation orchestratorHIGH
sub_43CA80192 B24Normal strategy: register calculatorHIGH
sub_438B502,419 B375Normal/non-ABI strategy: CU builderHIGH
sub_43C6F01,600 B225Compile-only strategy: register calculatorHIGH
sub_4383B01,320 B177Compile-only/debug strategy: CU builderHIGH
sub_43CAE0648 B91Debug strategy: register calculatorHIGH
sub_4378E02,010 B250Debug strategy: CU builderHIGH
sub_43C570577 B77Non-ABI strategy: register calculatorHIGH
sub_43A4004,696 B647Per-kernel worker (target config + setup)HIGH
sub_64BAF0~30 KB1,006DAGgen + OCG + SASS (per-kernel pipeline)MEDIUM
sub_43CC705,425 B1,077Per-entry output (.sass/.ucode sections)HIGH
sub_436DF0485 B59Thread pool worker functionHIGH
sub_432500461 B47Finalization wrapper (setjmp + vtable call)HIGH
sub_4370F0522 B64Timing finalization (per-kernel + aggregate)HIGH
sub_43B6603,843 B~300Register/resource constraint calculatorHIGH
sub_1CB18B0~200 B33Thread pool constructor (184-byte struct)HIGH
sub_1CB1A50~200 B21Thread pool task submitHIGH
sub_1CB1AE0----Thread pool wait-for-allHIGH
sub_1CB1970----Thread pool destructorHIGH
sub_1CC73002,027 B--GNU Make jobserver clientHIGH

Diagnostic Strings

StringLocationPurpose
"def-load-cache"sub_4428E0Cache mode knob read
"force-load-cache"sub_4428E0Cache mode knob read
"def-store-cache"sub_4428E0Cache mode knob read
"force-store-cache"sub_4428E0Cache mode knob read
"--compile-only"sub_4428E0Option validation
"--compile-as-tools-patch"sub_4428E0Option validation
"--extensible-whole-program"sub_4428E0Option validation
"calls without ABI"sub_4428E0Non-ABI mode diagnostic
"compilation without ABI"sub_4428E0Non-ABI mode diagnostic
"unified Functions"sub_4428E0Compile-only restriction
"suppress-debug-info"sub_4428E0Debug info suppression warning
"position-independent-code"sub_4428E0PIC mode configuration
"__cuda_dummy_entry__"sub_43CC70Dummy entry skip check
"reg-fatpoint"sub_43CC70Register map annotation
".sass", ".ucode"sub_43CC70Output section names
"# ============== entry %s =="sub_43CC70Per-entry SASS banner
"ptxocg.0.0"sub_43A400Target config identifier
"specified texturing mode"sub_43A400Texturing mode diagnostic
".local_maxnreg"sub_438B50Per-function register limit
"device functions"sub_438B50Compile-only device function handling
"--compile-only option"sub_438B50Compile-only restriction
"ptxas"sub_436DF0Thread-local program name

ELF/Cubin Output

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

After all per-kernel SASS encoding completes, ptxas enters the ELF output phase -- the final stage of the compilation pipeline. This phase transforms the accumulated per-kernel SASS bytes, relocation metadata, constant bank data, shared memory layouts, and debug information into a complete NVIDIA CUBIN file. The CUBIN is a standard ELF container with NVIDIA-proprietary extensions: machine type EM_CUDA (0xBE), non-standard ELF class bytes, CUDA-specific section types, and a rich per-entry metadata system called EIATTR. The output pipeline is a custom implementation with no libelf dependency -- ptxas constructs every byte of the ELF from scratch, including headers, section tables, symbol tables, string tables, relocations, and program headers.

The output phase handles three binary kinds: SASS (raw resolved SASS, legacy default), Mercury (SM 75--99 default), and Capsule Mercury (SM 100+ default, supporting deferred finalization). All three produce a valid CUBIN ELF; the difference is whether the .text sections contain final SASS bytes or Mercury-encoded streams that a downstream finalizer resolves at link or load time.

Cubin entry pointsub_612DE0 (47 KB, called from sub_446240)
ELFW constructorsub_1CB53A0 (3,480 bytes, 672-byte central object)
Section creatorsub_1CB3570 (1,963 bytes, 44 call sites)
Symbol table buildersub_1CB68D0 (9,578 bytes, ~1,700 decompiled lines)
Master ELF emittersub_1C9F280 (15,263 bytes, 97 KB decompiled -- largest function in output range)
Section layout calculatorsub_1C9DC60 (5,663 bytes)
Master section allocatorsub_1CABD60 (11,856 bytes, 67 KB decompiled -- shared/constant/local addresses)
nvinfo/EIATTR buildersub_1CC9800 (14,764 bytes, 90 KB decompiled)
Master relocation resolversub_1CD48C0 (4,184 bytes, 22 KB decompiled)
File serializersub_1CD13A0 (2,541 bytes, writes final bytes to disk)
ELF machine typeEM_CUDA = 0xBE (190)
CUDA section typeSHT_CUDA_INFO = 0x70000064
ELF timing"ELF-time : %.3f ms (%.2f%%)" in --compiler-stats output
Peak ELF memory"PeakELFMemoryUsage : %.3lf KB"

Pipeline Overview

The ELF output pipeline runs as a single-threaded sequence after all per-kernel OCG passes have completed (multi-kernel compilation may be parallel, but ELF emission is serialized). The flow is orchestrated by sub_612DE0, which reads compilation flags, constructs the ELFW central object, then drives 11 phases to produce the final .cubin or .o file.

sub_446240 (compilation driver -- "real main")
  |  all per-kernel OCG passes complete
  v
sub_612DE0 (cubin entry, 47KB)
  |  reads: deviceDebug, lineInfo, optLevel, IsCompute, IsPIC
  |  establishes setjmp/longjmp error recovery
  |  writes "Cuda compilation tools, release 13.0, V13.0.88"
  |         "Build cuda_13.0.r13.0/compiler.36424714_0"
  |
  |-- Phase 1: ELFW construction
  |     sub_1CB53A0 -- create 672-byte ELFW object, 7 standard sections
  |
  |-- Phase 2: Per-kernel section creation
  |     sub_1CB42D0 x N -- .text.<func>, .rela.text.<func> (one per kernel)
  |     sub_1CB3570 x 44 -- .nv.constant0.<func>, .nv.shared.<func>, etc.
  |
  |-- Phase 3: Call graph analysis
  |     sub_1CBB920 -- recursion detection (DFS)
  |     sub_1CBC090 -- dead function elimination
  |     sub_1CBE1B0 -- .nv.callgraph section builder
  |
  |-- Phase 4: Symbol fixup
  |     sub_1CB2CA0 -- renumber symbols after dead code elimination
  |     sub_1C99BB0 -- remap .symtab_shndx extended indices
  |
  |-- Phase 5: Memory allocation
  |     sub_1CABD60 -- assign addresses: shared, constant, local memory
  |     sub_1CA92F0 -- shared memory interference graph
  |     sub_1CA6890 -- constant bank deduplication
  |
  |-- Phase 6: nvinfo/EIATTR generation
  |     sub_1CC9800 -- build .nv.info.<func> sections (EIATTR attributes)
  |     sub_1CC8950 -- propagate barrier/register counts across call graph
  |
  |-- Phase 7: Symbol table construction
  |     sub_1CB68D0 -- build .symtab, handle SHN_XINDEX overflow
  |
  |-- Phase 8: Section layout
  |     sub_1C9DC60 -- compute file offsets with alignment padding
  |
  |-- Phase 9: Relocation resolution
  |     sub_1CD48C0 -- resolve all R_CUDA_* relocations
  |     sub_1CD5920 -- write .nv.resolvedrela sections
  |
  |-- Phase 10: Capsule Mercury embedding (SM 100+ only)
  |     sub_1C9B110 -- create .nv.merc.* section namespace
  |     sub_1CA2E40 -- clone memory-space sections into merc namespace
  |     sub_1C9C300 -- build 328-byte capsule descriptors per function
  |
  |-- Phase 11: Final assembly & write
  |     sub_1C9F280 -- master ELF emitter (assemble complete CUBIN)
  |     sub_1CD13A0 -- file serializer (write to disk)
  |
  v
OUTPUT: .cubin / .o file

Custom ELF Emitter

ptxas builds the entire ELF output without libelf. The custom implementation spans approximately 20 functions in the 0x1C99--0x1CD6 address range (~300 KB of binary code). At the center is the ELFW ("ELF world") object -- a 672-byte structure that owns all sections, symbols, and string tables for a single compilation unit.

ELFW Object Layout

The ELFW constructor sub_1CB53A0 allocates 672 bytes from the pool allocator sub_424070, creates a dedicated "elfw memory space" pool (4,096-byte initial allocation), writes the ELF header, and initializes 7 mandatory sections:

IndexSectionTypePurpose
0(null)SHT_NULLRequired ELF null section
1.shstrtabSHT_STRTABSection name string table
2.strtabSHT_STRTABSymbol name string table
3.symtabSHT_SYMTABSymbol table
4.symtab_shndxSHT_SYMTAB_SHNDXExtended section indices (for >65,280 sections)
5.note.nv.tkinfoSHT_NOTENVIDIA toolkit info (version, build ID, CLI args)
6.note.nv.cuinfoSHT_NOTENVIDIA CUDA info (SM version, features)
7.nv.uft.entrySHT_PROGBITSUnified Function Table entries

ELF Header

The ELF header uses standard structure with NVIDIA-specific overrides:

OffsetSizeFieldCUDA Value
0x004e_ident[EI_MAG0..3]0x7F 'E' 'L' 'F' (magic 0x464C457F)
0x041e_ident[EI_CLASS]0x33 ('3', 32-bit) or 0x41 ('A', 64-bit)
0x051e_ident[EI_DATA]0x01 (little-endian)
0x061e_ident[EI_VERSION]0x01 (EV_CURRENT)
0x071e_ident[EI_OSABI]CUDA ABI version
0x122e_machine0x00BE (EM_CUDA = 190)
0x244e_flagsSM version bits [7:0] + CUDA control flags

Standard ELF uses ELFCLASS32 = 1 and ELFCLASS64 = 2. CUDA cubins use non-standard values '3' (0x33) and 'A' (0x41), which the CUDA driver recognizes as the cubin signature. Any tool using standard libelf will reject these as invalid, which is one reason ptxas uses a custom emitter.

The e_flags field packs the SM architecture version in the low byte (e.g., 100 for sm_100) along with flags for relocatable vs. executable mode and address size. The mask 0x7FFFBFFF (clears bits 14 and 31) is applied during finalization to strip internal control flags that must not appear in the output cubin.

Section Generation

Each kernel/function produces a set of ELF sections. For a program with N entry functions and M device functions, the CUBIN contains on the order of 4*(N+M) sections minimum. The section creator sub_1CB3570 (44 call sites) handles the generic case; sub_1CB42D0 is specialized for .text.<funcname> code sections.

Per-Kernel Sections

For each kernel entry my_kernel, the output pipeline generates:

SectionTypeContent
.text.my_kernelSHT_PROGBITSSASS instruction bytes (SHF_ALLOC | SHF_EXECINSTR)
.rela.text.my_kernelSHT_RELARelocation entries for the code section
.nv.info.my_kernelSHT_CUDA_INFO (0x70000064)EIATTR metadata (register count, barriers, stack sizes, etc.)
.nv.constant0.my_kernelSHT_PROGBITSConstant bank 0 data (kernel parameters + literal constants)

Additional per-kernel sections are generated as needed:

SectionCondition
.nv.shared.my_kernelKernel uses shared memory
.nv.local.my_kernelKernel uses local (spill) memory
.nv.global.initProgram uses initialized global variables
.nv.callgraphRelocatable object mode (-c)
.nv.prototypePrototype information for cross-module linking

Global Sections

SectionPurpose
.nv.infoGlobal EIATTR attributes (not per-kernel)
.nv.constant0Merged constant bank (whole-program mode)
.nv.reservedSmemReserved shared memory for runtime (tmem allocation, mbarrier parity)
.nv.metadataModule-level metadata
.nv.compatForward-compatibility attributes
.note.nv.cuverCUDA version note
.nv.uftUnified Function Table (indirect call support)
.nv.udtUnified Data Table
.nv.uft.entryUFT entry point table
.nv.udt.entryUDT entry point table
.nv.rel.actionRelocation action table
.nv.resolvedrelaResolved relocations (post-linking)
.nv.hostHost-side interop data

Constant Banks

CUDA supports up to 18 numbered constant banks (0--17) plus 6 named banks:

BankNamePurpose
0.nv.constant0Kernel parameters + compiler constants (per-entry)
1--17.nv.constant1--.nv.constant17User-declared __constant__ variables
--.nv.constant.entry_paramsEntry point parameter block
--.nv.constant.entry_image_header_indicesTexture/surface header index table
--.nv.constant.driverDriver-injected constants
--.nv.constant.optimizerOptimizer-generated constants (OCG)
--.nv.constant.userUser-specified constants
--.nv.constant.picPosition-independent code constants
--.nv.constant.tools_dataTools/debugger-injected data

Section Ordering

During finalization, sections are sorted into 8 priority buckets that determine their order in the output ELF:

BucketContents
0 (highest)ELF header pseudo-section, .shstrtab
1.strtab, .symtab, .symtab_shndx
2.note.nv.tkinfo, .note.nv.cuinfo
3.text.<funcname> (code sections)
4.nv.constant0.*, .nv.shared.*, .nv.local.* (data sections)
5.rela.*, .rel.* (relocation sections)
6.nv.info.* (EIATTR metadata sections)
7 (lowest).debug_*, .nv.merc.* (debug and Mercury metadata)

Section file offsets are assigned by sub_1C9DC60, walking the sorted list with alignment padding:

uint64_t offset = elf_header_size;
for (int i = 0; i < section_count; i++) {
    section_t* sec = sorted_sections[i];
    if (sec->sh_addralign > 1)
        offset = (offset + sec->sh_addralign - 1) & ~(sec->sh_addralign - 1);
    sec->sh_offset = offset;
    offset += sec->sh_size;
}

Two section types receive special treatment during layout: .nv.constant0 (address assigned by the OCG constant bank allocator, not the layout calculator) and .nv.reservedSmem (address assigned by the shared memory master allocator sub_1CABD60).

EIATTR Metadata

Each kernel's .nv.info.<funcname> section contains a sequence of EIATTR (Entry Information Attribute) records. These encode per-kernel metadata that the CUDA driver reads at launch time to configure the hardware correctly. The EIATTR builder is sub_1CC9800 (14,764 binary bytes, 90 KB decompiled, 51 callees) -- one of the largest functions in the output pipeline.

EIATTR Encoding

Each EIATTR record is a TLV (type-length-value) structure:

+--------+--------+------------------+
| type   | length | value            |
| 2 bytes| 2 bytes| length bytes     |
+--------+--------+------------------+

The type field is a 16-bit EIATTR code. The length field specifies the payload size. The value is type-dependent (scalar, array, or structured).

EIATTR Catalog

ptxas v13.0.88 defines 98 EIATTR codes. The critical ones that every cubin emitter must produce:

EIATTRPurposeEncoding
EIATTR_REGCOUNTRegister count for this kernel4-byte LE integer
EIATTR_NUM_BARRIERSHardware barrier count (0--16)4-byte LE integer
EIATTR_FRAME_SIZEPer-thread stack frame size in bytes4-byte LE integer
EIATTR_MIN_STACK_SIZEMinimum call stack size4-byte LE integer
EIATTR_MAX_STACK_SIZEMaximum call stack size (recursive)4-byte LE integer
EIATTR_CRS_STACK_SIZECall/Return/Sync stack size4-byte LE integer
EIATTR_EXIT_INSTR_OFFSETSByte offsets of EXIT instructions in .textArray of 4-byte offsets
EIATTR_S2RCTAID_INSTR_OFFSETSByte offsets of S2R SR_CTAID.* instructionsArray of 4-byte offsets
EIATTR_CTAIDZ_USEDKernel reads CTA ID Z dimensionFlag (0-byte payload)
EIATTR_REQNTIDRequired thread block dimensions3x 4-byte integers (X, Y, Z)
EIATTR_MAX_THREADSMaximum threads per block4-byte LE integer
EIATTR_PARAM_CBANKConstant bank for kernel parameters4-byte bank index + offset
EIATTR_CBANK_PARAM_SIZESize of parameter constant bank region4-byte LE integer
EIATTR_KPARAM_INFOPer-parameter ordinal/offset/size/alignmentStructured (V1)
EIATTR_KPARAM_INFO_V2Per-parameter info, extended formatStructured (V2)
EIATTR_MAXREG_COUNTMaximum register count directive4-byte LE integer
EIATTR_EXTERNSList of external symbol referencesArray of symbol indices

Additional EIATTR codes for textures/surfaces, barriers, cooperative groups, tensor cores, and hardware workarounds:

EIATTRPurpose
EIATTR_IMAGE_SLOTTexture/surface image binding slot
EIATTR_SAMPLER_INITSampler initialization data
EIATTR_TEXID_SAMPID_MAPTexture-to-sampler mapping
EIATTR_BINDLESS_IMAGE_OFFSETSBindless texture/surface offset table
EIATTR_SYNC_STACKSynchronization stack requirements
EIATTR_COOP_GROUP_MASK_REGIDSCooperative group register IDs
EIATTR_NUM_MBARRIERSNumber of mbarrier objects used
EIATTR_MBARRIER_INSTR_OFFSETSmbarrier instruction locations
EIATTR_WMMA_USEDKernel uses WMMA (Tensor Core) instructions
EIATTR_TCGEN05_1CTA_USEDKernel uses 5th-gen Tensor Core (1-CTA mode)
EIATTR_TCGEN05_2CTA_USEDKernel uses 5th-gen Tensor Core (2-CTA mode)
EIATTR_SPARSE_MMA_MASKStructured sparsity mask for MMA
EIATTR_CTA_PER_CLUSTERCTAs per cluster (SM 90+)
EIATTR_EXPLICIT_CLUSTERKernel requires explicit cluster launch
EIATTR_MAX_CLUSTER_RANKMaximum cluster rank
EIATTR_REG_RECONFIGRegister reconfiguration (setmaxreg)
EIATTR_SAM_REGION_STACK_SIZESAM (Shared Address Mode) region stack
EIATTR_RESERVED_SMEM_USEDKernel uses reserved shared memory
EIATTR_RESERVED_SMEM_0_SIZESize of reserved shared memory region 0
EIATTR_SW1850030_WARHardware workaround (bug SW-1850030)
EIATTR_SW2393858_WARHardware workaround (bug SW-2393858)
EIATTR_SW2861232_WARHardware workaround (bug SW-2861232)
EIATTR_CUDA_API_VERSIONRequired CUDA API version
EIATTR_MERCURY_ISA_VERSIONMercury ISA version for capmerc binaries
EIATTR_MERCURY_FINALIZER_OPTIONSFinalizer tuning for capmerc
EIATTR_SYSCALL_OFFSETSByte offsets of syscall instructions
EIATTR_INSTR_REG_MAPInstruction-to-register allocation map (debug)
EIATTR_STATISTICSPer-kernel compilation statistics
EIATTR_PERF_STATISTICSPerformance statistics

Barrier and Register Count Propagation

The EIATTR builder runs a cross-function propagation pass via sub_1CC8950 (2,634 bytes). When a device function uses barriers or a high register count, these requirements must propagate upward to the entry kernel that calls it:

"regcount %d for %s propagated to entry %s"
"Creating new EIATTR_NUM_BARRIERS and moving barcount %d
    from section flags of %s to nvinfo for entry symbol %s"
"Propagating higher barcount %d to the section flags
    of %s of entry symbol %s"

This ensures that the CUDA driver allocates sufficient barriers and registers for the entry kernel, accounting for all callees.

Relocation Processing

The relocation system handles symbol resolution for branch targets, constant bank references, function descriptors, texture/surface bindings, and address computations. The master relocation resolver is sub_1CD48C0 (4,184 binary bytes, 22 KB decompiled). ptxas defines 117 CUDA-specific relocation types (R_CUDA_NONE through R_CUDA_NONE_LAST).

Relocation Categories

CategoryTypesPurpose
Absolute addressR_CUDA_32, R_CUDA_64, R_CUDA_ABS*Global memory addresses
Global addressR_CUDA_G32, R_CUDA_G64, R_CUDA_G8_*Global-space addresses
PC-relativeR_CUDA_PCREL_IMM24_26, R_CUDA_PCREL_IMM24_23Branch/call targets
Constant fieldR_CUDA_CONST_FIELD19_*, R_CUDA_CONST_FIELD21_*, R_CUDA_CONST_FIELD22_*Constant bank references
Function descriptorR_CUDA_FUNC_DESC*Indirect function call targets
Texture/surfaceR_CUDA_TEX_HEADER_INDEX, R_CUDA_SAMP_HEADER_INDEX, R_CUDA_SURF_*Texture/surface binding
BindlessR_CUDA_BINDLESSOFF*, R_CUDA_TEX_BINDLESSOFF*Bindless texture/surface offsets
Sub-byte patchingR_CUDA_8_0 through R_CUDA_8_56Individual byte within a 64-bit instruction
Unified addressR_CUDA_UNIFIED, R_CUDA_UNIFIED_32, R_CUDA_UNIFIED_8_*Unified address space
Instruction-levelR_CUDA_INSTRUCTION64, R_CUDA_INSTRUCTION128Whole-instruction replacement
Yield/NOPR_CUDA_YIELD_OPCODE9_0, R_CUDA_YIELD_CLEAR_PRED4_87YIELD-to-NOP patching
CleanupR_CUDA_UNUSED_CLEAR32, R_CUDA_UNUSED_CLEAR64Zero out unused instruction fields

Resolution Logic

The resolver performs these operations for each relocation entry:

  1. Alias resolution -- redirect relocations from alias symbols to their targets ("change alias reloc %s to %s")
  2. Dead function filtering -- skip relocations on eliminated functions ("ignore reloc on dead func %s")
  3. UFT/UDT pseudo-relocation -- handle __UFT_OFFSET, __UFT_CANONICAL, __UDT_OFFSET, __UDT_CANONICAL synthetic symbols
  4. PC-relative validation -- ensure branch targets are in the same section ("PC relative branch address should be in the same section")
  5. YIELD-to-NOP conversion -- convert YIELD instructions to NOP when forward progress requirements prevent yielding
  6. Unified reloc replacement -- convert type 103 (unified) to type 1 (absolute) for final resolution
  7. Address computation -- compute final patched value from symbol address + addend

Output relocation sections (.nv.resolvedrela) are written by sub_1CD5920.

Debug Information

When --device-debug or --generate-line-info is active, ptxas generates DWARF debug sections. The debug subsystem spans 0x1CBF--0x1CC9 and includes parsers, emitters, and dumpers for the standard DWARF sections plus NVIDIA-specific debug extensions.

Debug Sections

SectionContent
.debug_infoDWARF DIE tree (compilation units, types, variables)
.debug_abbrevDWARF abbreviation table
.debug_lineSource-to-address line number mapping
.debug_frameCall frame information for unwinding
.debug_locLocation lists for variables
.debug_strDWARF string table
.debug_rangesAddress ranges
.debug_arangesAddress range lookup table
.debug_pubnamesPublic name index
.debug_pubtypesPublic type index
.nv_debug_ptx_txtEmbedded PTX source text
.nv_debug_line_sassSASS-level line number mapping
.nv_debug_info_reg_sassRegister allocation debug info
.nv_debug_info_reg_typeRegister type information

Key debug infrastructure functions:

FunctionPurpose
sub_1CBF820DWARF form name table (DW_FORM_addr, DW_FORM_data4, etc.)
sub_1CBF9B0DWARF attribute name table (DW_AT_producer, DW_AT_comp_dir, etc.)
sub_1CC0850.debug_abbrev parser/emitter (18 KB decompiled)
sub_1CC4A40.debug_info DIE tree walker (28 KB decompiled)
sub_1CC34E0DWARF location expression decoder (DW_OP_* operations)
sub_1CC24C0.debug_info emission pass (18 KB decompiled)
sub_1CC5EB0Compilation unit header parser
sub_1C9D1F0Debug section classifier/mapper (16 KB decompiled)

The --suppress-debug-info option (sub_432A00) disables debug section generation even when debug flags are present.

Capsule Mercury Output Path

For SM 100+ targets (Blackwell, Jetson Thor, consumer RTX 50-series), the default output mode is Capsule Mercury. In this mode, the CUBIN ELF contains two layers of content: standard CUBIN sections and a parallel set of .nv.merc.* sections carrying Mercury-encoded instruction streams plus all metadata needed for deferred finalization.

Mercury-Specific Sections

SectionPurpose
.nv.merc.debug_abbrevCloned DWARF abbreviation table
.nv.merc.debug_infoCloned DWARF info
.nv.merc.debug_lineCloned DWARF line table
.nv.merc.debug_frameCloned DWARF frame info
.nv.merc.debug_locCloned DWARF locations
.nv.merc.debug_strCloned DWARF string table
.nv.merc.debug_rangesCloned DWARF ranges
.nv.merc.debug_arangesCloned DWARF address ranges
.nv.merc.debug_pubnamesCloned DWARF public names
.nv.merc.debug_pubtypesCloned DWARF public types
.nv.merc.debug_macinfoCloned DWARF macro info
.nv.merc.nv_debug_ptx_txtEmbedded PTX source text
.nv.merc.nv_debug_line_sassSASS-level line mapping
.nv.merc.nv_debug_info_reg_sassRegister allocation debug info
.nv.merc.nv_debug_info_reg_typeRegister type debug info
.nv.merc.symtab_shndxExtended section index table (merc copy)
.nv.merc.nv.shared.reservedShared memory reservation metadata
.nv.merc.rela<secname>Per-section relocation tables

Capsule Mercury Construction

The capmerc path is integrated into the master ELF emitter. The sequence:

  1. sub_1C9B110 (23 KB decompiled) creates the .nv.merc.* section namespace
  2. sub_1CA2E40 (18 KB decompiled) clones constant/global/shared/local sections into the merc namespace, creating .nv.merc.rela relocation sections
  3. sub_1C9C300 (24 KB decompiled) processes .nv.capmerc<funcname> sections. Constructs a 328-byte capsule descriptor per function containing: Mercury-encoded instruction stream, relocation metadata, KNOBS configuration snapshot, and function-level metadata (register counts, barriers, shared memory usage)
  4. sub_1CA3A90 (45 KB decompiled) merges sections that have both merc and non-merc copies
  5. sub_1C99BB0 (25 KB decompiled) remaps section indices after merc section insertion

Off-Target Finalization

The cubin entry sub_612DE0 implements a "fastpath optimization" for off-target finalization:

"[Finalizer] fastpath optimization applied for off-target %u -> %u finalization"

When a capmerc binary compiled for SM X is finalized for SM Y (within the same family), the fastpath patches the Mercury instruction stream directly without full recompilation. The compatibility checker sub_60F290 determines whether fastpath is safe based on instruction set compatibility, register file layout, and memory model.

Self-Check

The --self-check option performs roundtrip verification: generate capmerc, reconstitute SASS from the capsule, and compare section-by-section. The verifier uses a Flex SASS lexer (sub_720F00, 64 KB) and a comparator (sub_729540, 35 KB). Error string: "Failure of '%s' section in self-check for capsule mercury".

Multi-Kernel Output

A typical CUDA program compiles multiple kernels and device functions into a single CUBIN. The output pipeline handles this through per-function section isolation, combined with cross-function analysis for call graph construction, barrier propagation, and dead code elimination.

Per-Function Section Layout

Each entry function and each device function gets its own .text section (the -ffunction-sections pattern). This enables:

  • Function-level dead code elimination -- sub_1CBC090 removes .text, .rela.text, .nv.info, and .nv.constant0 sections for unreachable functions
  • Linker granularity -- nvlink can select individual functions from relocatable objects
  • Driver loading -- the CUDA runtime can load individual kernels by name

Call Graph Construction

The call graph builder (sub_1CBE1B0) emits a .nv.callgraph section that encodes inter-function call edges. This section is present only in relocatable object mode (-c). The recursion detector (sub_1CBB920) performs a DFS traversal with manual 9-level unrolling, emitting "recursion at function %d" for each cycle found.

Dead functions are eliminated by sub_1CBC090:

"dead function %d(%s)"
"removed un-used section %s (%d)"   (x8 -- once per section type)
"function %d(%s) has address taken but no call to it"

Memory Allocation Across Kernels

The master section allocator sub_1CABD60 (11,856 binary bytes, 67 KB decompiled, 69 callees) assigns addresses to all memory-space sections across all kernels. It runs a multi-pass algorithm:

  1. Global shared allocation -- shared variables visible to multiple kernels
  2. Per-entry shared memory -- shared variables private to each kernel
  3. Extern shared handling -- dynamically-sized shared memory (extern __shared__)
  4. Reserved shared memory -- runtime reservations (.nv.reservedSmem.begin, .nv.reservedSmem.cap, .nv.reservedSmem.offset0, .nv.reservedSmem.offset1)
  5. Local memory -- per-thread spill storage
  6. Constant bank merging -- merges constant bank data across kernels, with deduplication (sub_1CA6890: "found duplicate value 0x%x, alias %s to %s")

The shared memory allocator sub_1CA92F0 (2,804 bytes) builds an interference graph for shared objects and performs group allocation for non-overlapping variables.

SHN_XINDEX Overflow

Large CUDA programs can exceed the ELF 65,280-section limit (SHN_LORESERVE = 0xFF00). Each kernel generates at minimum 4 sections (.text, .rela.text, .nv.info, .nv.constant0), so a program with 16,000+ kernels triggers the overflow mechanism:

  1. e_shnum = 0 in the ELF header (signals overflow)
  2. Section header [0].sh_size = real section count
  3. e_shstrndx = SHN_XINDEX (0xFFFF)
  4. Section header [0].sh_link = real .shstrtab index
  5. Symbol st_shndx = SHN_XINDEX when real index >= 0xFF00
  6. .symtab_shndx entries hold the actual section indices

This is standard ELF overflow handling, and it is production-critical -- sub_1CB68D0 checks for it with "overflow number of sections %d".

Key Functions

AddressSizeDecompiledPurpose
sub_612DE0~12 KB47 KBCubin generation entry point
sub_1C9F28015,263 B97 KBMaster ELF emitter
sub_1CC980014,764 B90 KBnvinfo/EIATTR section builder
sub_1CABD6011,856 B67 KBMaster section allocator (shared/const/local)
sub_1CB68D09,578 B49 KBSymbol table builder (.symtab)
sub_1CA3A906,289 B45 KBSection merger (merc + non-merc)
sub_1C9DC605,663 B29 KBSection layout calculator
sub_1C99BB04,900 B25 KBSection index remap (.symtab_shndx)
sub_1C9C3003,816 B24 KBCapsule Mercury section processor
sub_1C9B1104,585 B23 KBMercury capsule builder
sub_1CD48C04,184 B22 KBMaster relocation resolver
sub_1CBC0902,870 B20 KBDead function eliminator
sub_1CA2E403,152 B18 KBMercury section cloner
sub_1CA92F02,804 B16 KBShared memory interference graph
sub_1C9D1F02,667 B16 KBDebug section classifier
sub_1CA68902,286 B15 KBConstant bank deduplication
sub_1CC89502,634 B15 KBBarrier/register count propagator
sub_1CBB920~2,000 B14 KBRecursion detector (DFS)
sub_1CB91C02,668 B13 KBELF structure dumper (debug)
sub_1CB53A03,480 B13 KBELFW constructor
sub_1CAB3002,157 B12 KBBindless texture/surface handler
sub_1CD59201,985 B11 KBRelocation writer (.nv.resolvedrela)
sub_1CD13A02,541 B11 KBFile serializer (final disk write)
sub_1CBE1B01,992 B10 KB.nv.callgraph section builder
sub_1CD22E01,979 B10 KBUFT manager
sub_1CB35701,963 B10 KBSection creator (44 call sites)
sub_1C98C601,755 B9 KBMercury debug section classifier
sub_1CB2CA02,038 B8 KBSymbol fixup (post-deletion)
sub_1CC7FB0----.nv.info section name formatter
sub_1CB9FF0----Section count accessor
sub_1CB9C40----Get section by index

Cross-References

The Ori Internal Representation

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

Ori -- short for "Original IR" -- is ptxas's sole intermediate representation. It is a fully proprietary, SASS-level IR with virtual registers, its own CFG infrastructure, and a partial-SSA discipline. Ori has no relationship to LLVM IR: there is no LLVM Value hierarchy, no LLVM-style use-def chains, no SSA dominance-frontier construction. Every IR-level optimization pass in ptxas (prefixed Ori in the NamedPhases table: OriCopyProp, OriSanitize, OriBranchOpt, OriLoopSimplification, OriStrengthReduce, OriDoPredication, etc.) operates on this representation.

The key design decision that distinguishes Ori from PTX: Ori uses SASS opcode names, not PTX mnemonics. After the MercConverter pass (sub_9F1A90, 35KB) runs, every instruction carries the name of the hardware SASS instruction it will become -- IMAD, FFMA, LDG, STG, BAR, BRA, EXIT, etc. -- just with virtual (not physical) register operands. This means the optimizer already knows exactly which hardware functional unit each instruction will execute on, enabling accurate latency modeling and scheduling from the earliest optimization phases.

Key Facts

PropertyValue
NameOri ("Original IR")
HeritageFully proprietary (not LLVM-based)
LevelSASS machine-level with virtual registers
SSA formPartial -- constructed by phase 23, destroyed by phase 73
Code Object size~1136 bytes per function (C++ object)
Code Object vtable0x21EE238
Register files4: R (GPR), UR (uniform), P (predicate), UP (uniform predicate)
Operand kinds10 distinct types
CFG representationFNV-1a hash maps for successor/backedge edges
Opcode encodingROT13 of real SASS mnemonic names
BB entry size40 bytes per basic block, contiguous array
Instruction linkageDoubly-linked list within each basic block

Architecture Overview

  PTX source
      |
      v
  [Flex/Bison parser]          -- see pipeline/ptx-parser.md
      |
      v
  [PTX-to-Ori lowering]        -- see pipeline/ptx-to-ori.md
      |
      v
  +-------------------------------------------+
  |            Ori IR                          |
  |                                            |
  |  Code Object (per-function container)      |
  |    +-- Basic Block array (40B entries)     |
  |    |     +-- Instruction linked list       |
  |    |           +-- Packed operand array     |
  |    +-- CFG (FNV-1a hash map edges)         |
  |    +-- RPO array                           |
  |    +-- Register file arrays                |
  |    +-- Backedge map                        |
  +-------------------------------------------+
      |
      | 159 optimization phases (phases 0-158)
      |   phase 23: GenerateMovPhi (enter partial SSA)
      |   phase 73: ConvertAllMovPhiToMov (exit partial SSA)
      |
      v
  [Instruction selection]      -- see codegen/isel.md
      |
      v
  [Register allocation]        -- see regalloc/overview.md
      |
      v
  [Instruction scheduling]     -- see scheduling/overview.md
      |
      v
  [SASS binary encoding]       -- see codegen/encoding.md

The Code Object

Every function under compilation is represented by a single Code Object -- a ~1136-byte C++ structure that owns all IR data for that function. The Code Object vtable is at 0x21EE238. Its constructor is at sub_A3B080.

Field Map

OffsetTypeFieldDescription
+24u32sm_versionSM target (encoded: 12288=sm30, 20481=sm50, 36865=sm90)
+72ptrcode_bufOutput code object buffer
+88ptrreg_fileRegister descriptor array. *(ctx+88)+8*regId -> descriptor
+152ptrsym_tableSymbol/constant lookup array
+272ptrinstr_headInstruction linked-list head
+296ptrbb_arrayBasic block array pointer (40B per entry)
+304u32bb_indexBasic block array count/current index
+312ptroptionsOptionsManager* for knob queries
+648ptrsucc_mapCFG successor edge hash table
+680ptrbackedge_mapCFG backedge hash table
+720ptrrpo_arrayReverse post-order array (int*)
+768ptrconst_sectionsConstant memory section array
+776ptrsmem_sectionsShared memory section array
+976ptrblock_infoBlock info array (40 bytes per entry, contiguous)
+984i32num_blocksNumber of basic blocks
+1584ptrsm_backendSM-specific architecture backend object (see data-structures.md)
+1664ptrknob_containerKnob container pointer (for -knob queries)
+1928ptrcodegen_ctxCode object / code generation context

Register and Instruction Counts (SM Backend Object)

The register counts and instruction counts live in the SM backend object at *(code_obj+1584), accessed via DWORD-indexed fields (not Code Object byte offsets). Earlier versions of this page incorrectly listed these as Code Object offsets +99, +102, +159, +335, +341 -- those are DWORD indices, making the actual byte offsets 396, 408, 636, 1340, and 1364 respectively within the SM backend.

DWORD IndexByte OffsetTypeFieldDescription
[99]+396u32ur_countUniform register (UR) count
[102]+408u32r_allocR-register count (allocated)
[159]+636u32r_reservedR-register count (reserved)
[335]+1340u32instr_hiInstruction count (upper bound)
[341]+1364u32instr_loInstruction count (lower bound)

Register count formula (from sub_A4B8F0, where v5 = *(_DWORD **)(ctx + 1584)):

total_R_regs      = v5[159] + v5[102]   // reserved + allocated
instruction_count = v5[335] - v5[341]   // upper - lower

The stats emitter at sub_A3A7E0 prints a detailed per-function profile:

# 142 instructions, 24 R-regs
# [inst=142] [texInst=0] [tepid=0] [rregs=24]
# [est latency = 87] [LSpillB=0]
# [Occupancy = 0.750000]
# [issue thru=0.888889] [fp thru=0.000000]
# [worstcaseLat=87.000000]
# [avgcaseLat=52.500000]

Basic Blocks

Basic blocks are stored as 40-byte entries in a contiguous array at Code Object +976. The block count is at +984.

Block Entry Layout (40 bytes)

OffsetTypeField
+0ptrHead instruction pointer (first instruction in BB)
+8ptrInstruction list link / tail
+28i32bix -- block index (unique ID for CFG operations)
+32u64Flags / padding

Blocks are additionally accessible via a sub-block array at Code Object +368, indexed as *(ctx+368) + 8*blockIndex.

The debug dumper (sub_BE21D0) emits Graphviz DOT output for the CFG:

digraph f {
  node [fontname="Courier" ...]
  bix0 -> bix1
  bix0 -> bix3
  bix1 -> bix2
  bix2 -> bix1    // backedge (loop)
  bix2 -> bix3
}

Control Flow Graph

The CFG uses FNV-1a hash maps to represent edges. Two separate hash tables exist at Code Object offsets +648 (successor edges) and +680 (backedge info).

FNV-1a Hashing

All CFG hash lookups use the same parameters, confirmed across 50+ call sites:

ParameterValue
Initial hash0x811C9DC5
Prime16777619 (0x01000193)
Input4-byte block index, hashed byte-by-byte

Hash Map Structure

Each hash map uses chained hashing with 24-byte bucket entries:

Bucket (24 bytes):
  +0   node* head      // first node in chain
  +8   node* tail      // last node in chain
  +16  i32   count     // entries in this bucket

Full Node (64 bytes):
  +0   node* next      // chain link
  +8   i32   key       // block index
  +16  ptr   values    // successor/predecessor block list
  +32  sub-hash data   // embedded sub-table for multi-edge blocks
  +56  u32   hash      // cached FNV-1a hash

Simple Node (16 bytes):
  +0   node* next
  +8   i32   key
  +12  u32   hash

Growth policy: rehash when total_elements > num_unique_keys (load factor exceeds 1.0). Capacity doubles on each rehash.

Key CFG Functions

AddressSizeFunctionNotes
sub_BDE1509KBCFG::computeRPOExplicit DFS stack, assigns RPO numbers into +720 array
sub_BDE8B02KBCFG::printEdgesFNV-1a lookup, prints "bix%d -> bix%d\n"
sub_BDEA504KBCFG::dumpRPOAndBackedgesRPO + backedge debug dump
sub_BE069054KBCFG::buildAndAnalyzeMain CFG constructor: predecessors, successors, RPO, loop detection
sub_BE21D01.4KBCFG::dumpDOTGraphviz DOT format output
sub_BE23304KBCFG::computeDominatorsPost-build dominator/loop analysis with bitvector ops

The RPO dump (sub_BDEA50) produces output like:

Showing RPO state for each basic block:
  bix0 -> RPONum: 0
  bix1 -> RPONum: 1
  bix2 -> RPONum: 3
  bix3 -> RPONum: 2
RPO traversal order: bix0, bix1, bix3, bix2
Showing backedge info:
  bix2 -> backedge's successor BB: 1

Instructions

Instructions are C++ objects with a large vtable, linked into per-basic-block doubly-linked lists. Each instruction carries a unique integer ID, an opcode, and a packed operand array.

Instruction Layout

OffsetTypeFieldDescription
+8variesreg_classRegister class / encoding fields
+16i32idUnique instruction ID
+28u32opcodeSASS opcode (lower 12 bits = base, bits 11-12 = modifier)
+36u32flagsFlags (bits 19-21 = subtype)
+48u8special_flagsVolatile/special (bit 5 = volatile)
+72u32opcode_infoOpcode info (duplicate/extended field, confirmed 50+ sites)
+73u8instr_flagsPer-instruction flag byte
+80u32operand_countNumber of operands
+84u32[]operandsPacked operand array (8 bytes per operand)
+160ptrenc_bufEncoding buffer pointer (post-selection)
+184u32enc_modeEncoding mode
+200u64imm_valueImmediate value

Packed Operand Encoding

Each operand occupies 8 bytes in the operand array starting at instruction offset +84:

 31  30  29  28  27       24  23  22  21  20  19                  0
+---+---+---+---+-----------+---+---+---+---+---------------------+
|     type      |  modifier bits (8 bits)    |  index (20 bits)    |
+---+---+---+---+-----------+---+---+---+---+---------------------+
                 ^                            ^
                 bit 24: extended flag         bits 0-19: reg/sym index

type field (bits 28-30):
  1 = register operand      -> index into *(ctx+88) register file
  5 = symbol/const operand  -> index into *(ctx+152) symbol table

Operand Word 1 (Upper 4 Bytes)

Each 8-byte operand slot has two DWORDs. Word 0 (documented above) carries type/modifier/index. Word 1 carries extended flags:

Word 1 (at instr + 84 + 8*i + 4):

 31  30  29  28  27  26  25  24  23                             0
+---+---+---+---+---+---+---+---+-------------------------------+
|     reserved / mod flags      |CB |      auxiliary data        |
+---+---+---+---+---+---+---+---+-------------------------------+
                             ^
                             bit 24: const-bank flag (CB)

Bits 25-31 (mask 0xFE000000): extended modifier flags
  When any bit is set, the operand has special semantics.
  Peephole matchers bail out early if (word1 & 0xFE000000) != 0.
  Bit 25 (0x2000000): operand reuse / negation extension
  Bit 26 (0x4000000): absolute-value modifier (|x|)

Bit 24 (mask 0x1000000): const-bank flag
  When set, indicates the source references a constant bank (c[N][offset]).
  The scheduler uses this to distinguish FADD (standard) from FADD (const-bank)
  for latency modeling (see scheduling/latency-model.md).

Bits 0-23: auxiliary data
  For symbol/const operands (type 5): constant bank number
  For predicate guards (type 6): predicate sense (true/false)
  For register operands (type 1): typically zero

Evidence: sub_40848E checks (word1 & 0xFE000000) != 0 across all operands; sub_405769 tests both 0x1000000 and 0x6000000 combinations; sub_404AD0 verifies (word1 & 0xFE000000) == 0 before allowing peephole transforms. Confirmed in 30+ decompiled functions (confidence 0.92).

Extraction Pattern

Extraction pattern (appears in 50+ functions):

uint32_t operand = *(uint32_t*)(instr + 84 + 8 * i);
int type    = (operand >> 28) & 7;
int index   = operand & 0xFFFFF;
int mods    = (operand >> 20) & 0xFF;

uint32_t word1 = *(uint32_t*)(instr + 84 + 8 * i + 4);
bool has_const_bank = (word1 & 0x1000000) != 0;
bool has_ext_mods   = (word1 & 0xFE000000) != 0;

Opcode Constants

Selected confirmed opcodes (from multiple independent functions):

ValueInstructionNotes
47NOP / barrier
72CALL / JMPFunction call or jump
91ATOMAtomic memory operation
92REDReduction operation
95STSStore to shared memory (ROT13: FGF). Note: EXIT = opcode 77 (RKVG), RET = opcode 72 (ERG)
155LD variantLoad instruction
173ST variantStore instruction
183LD.EExtended load (& 0xFFFFCFFF mask removes modifier bits)
267ST variantStore (& 0xFFFFCFFF)
268LD variantLoad (& 0xFFFFCFFF)
288ST.EExtended store

The 0xFFFFCFFF mask (clear bits 12-13) strips modifier/suboperation bits from the opcode, yielding the base instruction class. This pattern appears in InstructionClassifier, MBarrierDetector, and OperandLowering code.

ROT13 Opcode Names

All SASS opcode mnemonic strings stored in the binary are ROT13-encoded. The master table is initialized in sub_BE7390 (InstructionInfo constructor) at offset 4184 of the InstructionInfo object, with 16-byte {name, length} entries. This is lightweight obfuscation -- not a security measure.

Selected decoded names (~200+ total, covering the full sm_70+ SASS ISA):

ROT13RealCategory
VZNQIMADInteger multiply-add
VNQQ3IADD33-input integer add
SSZNFFMAFP fused multiply-add
SNQQFADDFP add
SZHYFMULFP multiply
ZBIMOVMove
FRYSELSelect
YBC3LOP33-input logic
VFRGCISETPInteger set-predicate
SFRGCFSETPFP set-predicate
YRNLEALoad effective address
FUSSHFShift / funnel shift
ZHSHMUFUMulti-function unit (SFU)
YQTLDGLoad global
FGTSTGStore global
YQPLDCLoad constant
YQYLDLLoad local
YQFLDSLoad shared
NGBZATOMAtomic
ONEBARBarrier
OENBRABranch
PNYYCALLCall
ERGRETReturn
RKVGEXITExit
GRKTEXTexture
ZRZONEMEMBARMemory barrier
JNECFLAPWARPSYNCWarp synchronize
C2EP2RPredicate to register
E2CR2PRegister to predicate
ABCNOPNo-op
OFFLBSSYBranch sync stack push
OFLAPBSYNCBranch sync
QRCONEDEPBARDependency barrier

Register Files

Ori maintains four distinct register files, mirroring the SASS hardware register model.

Register File Summary

FileWidthRangeSpecialABI typeCode Object offset
R32-bitR0 -- R255RZ (read-zero)2+102 (alloc), +159 (reserved)
UR32-bitUR0 -- UR63URZ (read-zero)3+99
P1-bitP0 -- P6PT (always-true)5(tracked separately)
UP1-bitUP0 -- UP6UPT (always-true)--(tracked separately)

R registers are the main 32-bit general-purpose registers. 64-bit values occupy consecutive pairs (e.g., R4:R5). The total R-register count for a function is field[159] + field[102] (reserved + allocated). Maximum is 255 usable registers (R0-R254); R255 is the hardware zero register RZ.

UR registers (sm_75+) are uniform registers shared across the warp. Every thread sees the same value. UR0-UR63 on supported architectures. The count is at Code Object +99.

P registers are 1-bit predicate registers used for conditional execution. P0-P6 are usable; PT is the hardwired always-true predicate (writes are discarded).

UP registers are the uniform variant of predicates, shared across the warp like UR.

Register Descriptor

Each register is described by a descriptor in the register file array, accessed as *(ctx+88) + 8*regId:

OffsetTypeField
+8u32Size / live range info
+12u32Register number
+16u32Register class (enum)
+20u32Physical register name (assigned after regalloc)
+24ptrDefinition info (0 = undefined / uninitialized)
+36u32Flags (bits 19-21 = subtype)
+48u8Volatile/special flags (bit 5 = volatile marker)
+64u32Register file type enum
+68u32Physical register number (post-allocation)

Register file type values at descriptor +64:

ValueMeaning
2General-purpose (R)
3Uniform (UR)
5Predicate (P)
6General register (alternate classification)
7Predicate (alternate classification)
10Extended register pair (64-bit)
11Extended register quad (128-bit)

The register class name table at off_21D2400 maps reg_type enum values to string names. The stat collector (sub_A60B60, 24KB) enumerates ~25 register sub-classes including R, P, B, UR, UP, UB, Tensor/Acc, SRZ, PT, RZ, and others. The allocator processes classes 0--6 (matching reg_type values 0--6); barrier registers (reg_type 9) are handled separately.

Partial SSA

Ori does not maintain full SSA form at all times. Instead, it uses a bounded "partial SSA" window managed by two phases in the 159-phase optimization pipeline.

Phase 23: GenerateMovPhi

Constructs phi-like MovPhi pseudo-instructions at CFG merge points. Inserted after loop unrolling (phase 22) and before pipelining (phase 24). This establishes partial SSA form -- not through LLVM-style dominance-frontier phi insertion, but through explicit MovPhi nodes that represent value merging at control-flow join points.

Phase 73: ConvertAllMovPhiToMov

Destructs SSA form by lowering every MovPhi into a plain MOV instruction. Runs after sync instruction expansion (phase 72) and before uniform register conversion (phase 74). This is SSA destruction without the need for interference-graph-based coalescing -- the MovPhi nodes simply become copies.

The SSA Window

The partial-SSA window spans phases 23 through 73, covering the bulk of the optimization pipeline:

Phase 23  GenerateMovPhi         <-- SSA construction
Phase 24  OriPipelining
Phase 25  StageAndFence
Phase 26  OriRemoveRedundantBarriers
Phase 29  GeneralOptimize
Phase 37  GeneralOptimizeMid
Phase 46  GeneralOptimizeMid2
Phase 49  GvnCse
Phase 50  OriReassociateAndCommon
Phase 54  OriDoRematEarly
Phase 58  GeneralOptimizeLate
Phase 63  OriDoPredication
Phase 65  GeneralOptimizeLate2
Phase 69  OriDoRemat
Phase 70  OriPropagateVaryingSecond
Phase 71  OptimizeSyncInstructions
Phase 72  LateExpandSyncInstructions
Phase 73  ConvertAllMovPhiToMov  <-- SSA destruction

All optimizations between these two phases can rely on the single-definition property of MovPhi nodes for reaching-definition analysis.

MovPhi Instruction Format

A MovPhi is not a distinct opcode -- it reuses the MOV opcode (19) with a distinguishing flag in the instruction's auxiliary fields. Phase 73 (ConvertAllMovPhiToMov) converts MovPhi to plain MOV by clearing this flag, without changing the opcode value.

MovPhi operand layout:
  +72  opcode         = 19 (MOV)
  +76  opcode_aux     = flag distinguishing MovPhi from plain MOV
  +80  operand_count  = 2*N + 1  (variable, one destination + N source-predecessor pairs)

  operand[0]:           destination register (the merged value)
  operand[1], [2]:      {source_reg, predecessor_bix} for predecessor 0
  operand[3], [4]:      {source_reg, predecessor_bix} for predecessor 1
  ...
  operand[2*N-1], [2*N]: {source_reg, predecessor_bix} for predecessor N-1

This is the operational equivalent of an SSA phi node. For a CFG merge with two predecessors:

;; PTX-level CFG:            ;; Ori MovPhi:
;;   bix1 defines R7         ;;
;;   bix2 defines R9         ;;   MovPhi R3, R7, bix1, R9, bix2
;;   bix3 merges             ;;
;;   uses R3                 ;;   "if from bix1, R3 = R7; if from bix2, R3 = R9"

Phase 23 (GenerateMovPhi) inserts these at merge points where a register has different reaching definitions from different predecessors. Phase 73 destructor linearizes them: it inserts a MOV R3, R7 at the end of bix1 and a MOV R3, R9 at the end of bix2, then deletes the MovPhi.

Operand Kinds

The IR supports 10 distinct operand kinds, identified through the register allocator verifier (sub_A55D80) and the instruction selection pattern matcher infrastructure.

#KindDescription
1R/UR registerGeneral-purpose or uniform register operand
2P/UP registerPredicate or uniform-predicate register operand
3Any registerWildcard -- matches any register class
4OffsetMemory offset for address computation
5RegularStandard immediate or constant value
6PredicatedGuard predicate controlling conditional execution
7RematRematerialization marker (value can be recomputed instead of spilled)
8Spill-refillSpill/refill pair marker for register allocator
9R2P / P2RRegister-to-predicate or predicate-to-register conversion pair
10Bit-spillSingle-bit spill (predicate register spill to GPR)

The regalloc verifier (sub_A55D80, confidence 0.95) classifies 10 problem categories that map to these operand kinds:

  1. Missing spill match for refill
  2. Refill reads uninitialized memory
  3. P2R-R2P pattern match failure
  4. Bit-spill-refill pattern match failure
  5. Previously defined operand now uninitialized
  6. Extra post-regalloc definitions (mixed-size check)
  7. Rematerialization problem
  8. P2R-R2P base destroyed
  9. Bit-spill-refill base destroyed
  10. Definitions disappeared without new ones added

The pattern matcher infrastructure at 0xB7D000--0xBA9D00 (~390 functions) uses a separate classification for instruction selection:

FunctionPredicate
sub_B28E10isRegOperand
sub_B28E20isPredOperand
sub_B28E40isImmOperand
sub_B28E80isConstOperand
sub_B28E90isUReg
sub_B28E00getRegClass (1023 = wildcard, 1 = GPR)

Ori vs. PTX

PTX is a virtual ISA -- a stable interface between the compiler frontend and the architecture-specific backend. Ori is the architecture-specific backend representation that replaces PTX opcodes with actual SASS instructions early in compilation.

AspectPTXOri
Opcode setVirtual mnemonics (add, mul, ld, st)SASS hardware opcodes (IMAD, FFMA, LDG, STG)
Register modelUnlimited virtual registers, typed4 hardware register files (R, UR, P, UP) with virtual numbering
SSA formNot applicable (PTX is a linear ISA)Partial SSA between phases 23 and 73
CFG representationImplicit (labels + branches)Explicit hash-map-based CFG with RPO, backedges, dominators
Target dependenceArchitecture-independent (forward-compatible)Architecture-specific (per-SM instruction selection)
Conversion pointInput to ptxasAfter MercConverter (sub_9F1A90)

The MercConverter pass is the boundary: it transforms PTX-derived intermediate opcodes into SM-specific SASS opcodes by dispatching through a large opcode switch (sub_9ED2D0, 25KB). After MercConverter, the string "After MercConverter" appears in diagnostic output, and the IR is fully in SASS-opcode form. Each instruction then carries enough information for the scheduler to compute accurate latencies, throughputs, and functional-unit assignments.

Worked Example: add.f32 to FADD

This traces a single PTX instruction through the Ori representation, showing exactly how the opcode, operands, and register references are encoded in memory.

PTX Input

add.f32 %f3, %f1, %f2

After MercConverter (sub_9F1A90), this becomes the Ori instruction:

FADD R3, R1, R2

The type qualifier .f32 disappears -- the "F" in FADD encodes the float type. Register names %f1, %f2, %f3 become virtual register IDs R1, R2, R3 in the R (GPR) register file.

Instruction Object in Memory

FADD is opcode 12 in the ROT13 name table (ROT13: SNQQ, at InstructionInfo+4184+16*12). The 296-byte instruction object:

Offset  Value              Field
------  -----------------  ---------------------
+0      prev_ptr           Linked-list prev
+8      next_ptr           Linked-list next
+16     <id>               Unique instruction ID
+72     0x0000000C         opcode = 12 (FADD)
+80     0x00000003         operand_count = 3
+84     0x10000003         operand[0] word0: dst R3
+88     0x00000000         operand[0] word1: no ext flags
+92     0x10000001         operand[1] word0: src R1
+96     0x00000000         operand[1] word1: no ext flags
+100    0x10000002         operand[2] word0: src R2
+104    0x00000000         operand[2] word1: no ext flags

Operand Decoding

Take operand[0] word0 = 0x10000003:

  0x10000003 in binary:
    bit 31     = 0       (no sign/negate)
    bits 28-30 = 001     (type = 1 = register operand)
    bits 20-27 = 00000000 (no modifiers)
    bits 0-19  = 00003   (register index = 3)

The register index resolves through the register descriptor array:

reg_desc = *(ptr*)(*(ptr*)(code_obj + 88) + 8 * 3);
// reg_desc + 64: reg_file_type = 2 (R / GPR file)
// reg_desc + 12: register number = 3

If the source operand were a constant-bank reference (e.g., FADD R3, R1, c[0][0x10]), operand[2] would have type=5 (symbol/constant) in word0 and the const-bank flag (0x1000000) set in word1. The scheduler distinguishes these two FADD variants for latency modeling: standard FADD gets throughput class 0x3D, while const-bank FADD gets 0x78.

Memory Space Classification

Memory operands carry a space type enum, resolved by sub_91C840 which maps the PTX-level space identifier to an internal category number. The full input enumeration (from complete decompilation of sub_91C840, confidence 0.98):

InputPTX SpaceInternal CategoryNotes
0(none)--Unmapped, no memory space
1Register / generic16Register file address
2Code / function12Function address
3(gap)--Unmapped
4.shared1Shared memory
5.const3Constant memory
6.global11Global memory
7.local2Local memory
8(gap)--Unmapped
9.local (variant)2Same as 7, alternate encoding
10--11(gap)--Unmapped
12.param4Parameter memory
13Generic (unqualified)0Generic address space
14.tex8Texture memory
15.surf17Surface memory
16Spill space7Register spill/fill scratch
17(gap)--Unmapped
18(instruction-dependent)variesSub-classifies by opcode at a2[1]
19.uniform15Uniform (sm_75+)
20.global (extended)6Global, extended variant
21.const (extended)5Constant, extended store-to-global path
22.const (extended, alt)5Constant, alternate extended
23.surf / tensor (ext)18Surface/tensor extended (sm_90+)

Case 18 (0x12) uses a sub-switch on the opcode value at a2[1] to further classify: opcodes 7, 43, 45, 53 map to category 6 (global-like); opcode 111 and opcodes in the 183--199 range map to category 5 (constant-like); opcodes 54 and 189 map to category 9 (special).

The hot/cold classifier pair (sub_A9CDE0 / sub_A9CF90) consumes the internal category to partition instructions for scheduling. Hot memory operations (global loads/stores, certain atomics -- category 11) have long latencies and benefit from aggressive scheduling; cold operations (constant loads -- category 3) have shorter latencies and are treated more conservatively.

Key Functions

AddressSizeRoleConfidence
sub_A3B080--Code Object constructor; allocates ~1136-byte per-function IR container (vtable at 0x21EE238)0.90
sub_A4B8F0--Register count formula: total_R = v5[159] + v5[102], instr_count = v5[335] - v5[341]0.90
sub_A3A7E0--Stats emitter; prints per-function profile (instruction count, register count, occupancy, latency)0.90
sub_BE21D01.4KBCFG::dumpDOT; emits Graphviz DOT output for the control flow graph0.92
sub_BDE1509KBCFG::computeRPO; explicit DFS stack, assigns reverse post-order numbers into Code Object +720 array0.92
sub_BDE8B02KBCFG::printEdges; FNV-1a lookup, prints "bix%d -> bix%d\n"0.92
sub_BDEA504KBCFG::dumpRPOAndBackedges; RPO traversal order + backedge debug dump0.92
sub_BE069054KBCFG::buildAndAnalyze; main CFG constructor -- predecessors, successors, RPO, loop detection0.92
sub_BE23304KBCFG::computeDominators; post-build dominator and loop analysis with bitvector operations0.92
sub_BE7390--InstructionInfo constructor; initializes 322-entry ROT13 opcode name table at object offset +41840.90
sub_9F1A9035KBMercConverter pass; transforms PTX-derived opcodes into SM-specific SASS opcodes0.92
sub_9ED2D025KBOpcode switch inside MercConverter; dispatches per-opcode legalization0.90
sub_91C840--Memory space classifier; maps PTX-level space identifiers (0--23) to internal category numbers0.98
sub_A9CDE0--Hot/cold memory classifier (hot path); partitions instructions by memory category for scheduling0.85
sub_A9CF90--Hot/cold memory classifier (cold path); complement of sub_A9CDE00.85
sub_A60B6024KBRegister stat collector; enumerates ~25 register sub-classes (R, P, B, UR, UP, UB, Tensor/Acc, etc.)0.85
sub_A55D80--Register allocator verifier; classifies 10 operand-kind problem categories for regalloc validation0.95
sub_40848E--Operand extended-flag checker; tests (word1 & 0xFE000000) != 0 across all operands0.85
sub_405769--Operand flag tester; tests 0x1000000 and 0x6000000 combinations in operand word 10.85
sub_404AD0--Peephole guard; verifies (word1 & 0xFE000000) == 0 before allowing peephole transforms0.85
sub_B28E10--isRegOperand; ISel pattern matcher operand predicate0.90
sub_B28E20--isPredOperand; ISel pattern matcher operand predicate0.90
sub_B28E40--isImmOperand; ISel pattern matcher operand predicate0.90
sub_B28E80--isConstOperand; ISel pattern matcher operand predicate0.90
sub_B28E90--isUReg; ISel pattern matcher operand predicate0.90
sub_B28E00--getRegClass; returns register class (1023 = wildcard, 1 = GPR)0.90

Instructions & Opcodes

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

This page documents the Ori IR instruction representation: in-memory layout, opcode encoding, operand model, instruction flags, creation/iteration APIs, the master descriptor table, and opcode categories. All offsets are from ptxas v13.0.88 (37.7 MB stripped x86-64 ELF).

Instruction Object Layout

Every Ori instruction is a 296-byte C++ object allocated from the Code Object's arena. Instructions are linked into per-basic-block doubly-linked lists via pointers at offsets +0 and +8. The allocator at sub_7DD010 allocates exactly 296 bytes per instruction and zeroes the object before populating it.

Memory Layout (296 bytes)

OffsetSizeTypeFieldDescription
+08ptrprevPrevious instruction in BB linked list (nullptr for head)
+88ptrnextNext instruction in BB linked list (nullptr for tail)
+164i32idUnique instruction ID (monotonically increasing within function)
+204i32ref_countReference/use count (incremented by sub_7E6090)
+244i32bb_indexBasic block index (bix) this instruction belongs to
+284u32reserved_28Reserved / padding
+324u32control_wordScheduling control word (stall cycles, yield, etc.)
+364u32flags_36Instruction flags (bits 19-21 = subtype, see below)
+408ptrsched_slotScheduling state pointer
+488u64flag_bitsExtended flag bits (bit 5 = volatile, bit 27 = reuse)
+568ptrdef_instrDefining instruction (for SSA def-use chains)
+648ptrreserved_64Reserved / register class info
+724u32opcodeFull opcode word (lower 12 bits = base opcode, bits 12-13 = modifier)
+764u32opcode_auxAuxiliary opcode data (sub-operation, comparison predicate)
+804u32operand_countTotal number of operands (destinations + sources)
+84varu32[N*2]operands[]Packed operand array (8 bytes per operand slot)
+884u32operands[0].extraHigh word of first operand slot
+1001u8type_flagsData type / modifier flags (bits 0-2 = data type code)
+1044u32reserved_104Reserved
+1128ptruse_chainUse chain linked list head (for CSE)
+1208ptrreserved_120Reserved
+1364i32reserved_136Reserved
+1608ptrenc_bufEncoding buffer pointer (populated during code generation)
+1688ptrreserved_168Reserved
+1844u32enc_modeEncoding mode selector
+2008u64imm_valueImmediate value (for instructions with constant operands)
+20816xmmsched_paramsScheduling parameters (loaded via _mm_load_si128)
+2404u32reserved_240Reserved
+2441u8reserved_244Reserved
+2488i64sentinel_248Initialized to -1 (0xFFFFFFFFFFFFFFFF)
+2568i64sentinel_256Initialized to 0xFFFFFFFF
+2648i64bb_refBasic block reference / block index storage
+2728i64reserved_272Reserved
+28016u128reserved_280Zeroed on creation

Linked-List Pointers

Instructions form a doubly-linked list within each basic block. The Code Object stores the global list head at offset +272 and tail at offset +280:

Code Object +272  -->  head instruction (prev = nullptr)
                            |
                            v  (+8 = next)
                       instruction 2
                            |
                            v
                       instruction 3
                            |
                            v  ...
Code Object +280  -->  tail instruction (next = nullptr)

The linked-list traversal pattern appears in hundreds of functions throughout ptxas:

// Forward iteration over all instructions
for (instr = *(ptr*)(code_obj + 272); instr != nullptr; instr = *(ptr*)(instr + 8)) {
    uint32_t opcode = *(uint32_t*)(instr + 72);
    uint32_t num_ops = *(uint32_t*)(instr + 80);
    // process instruction...
}

Opcode Encoding

The opcode field at offset +72 is a 32-bit word with a structured layout.

Opcode Word Format

 31              16  15  14  13  12  11            0
+------------------+---+---+---+---+---------------+
|    upper flags   |   |   | M | M |  base opcode  |
+------------------+---+---+---+---+---------------+
                            ^   ^
                            |   bit 12: modifier bit 0
                            bit 13: modifier bit 1

M = modifier bits (stripped by the 0xFFFFCFFF mask)
base opcode = 12-bit instruction class identifier (0-4095)

The mask 0xFFFFCFFF (clear bits 12-13) is used throughout InstructionClassifier, MBarrierDetector, OperandLowering, and many other subsystems to extract the base instruction class, stripping sub-operation modifier bits:

uint32_t raw_opcode = *(uint32_t*)(instr + 72);
uint32_t base_opcode = raw_opcode & 0xFFFFCFFF;

Additionally, bit 11 is sometimes used in operand count calculations:

// Effective operand count adjustment (appears in 50+ functions)
int adj = (*(uint32_t*)(instr + 72) >> 11) & 2;  // 0 or 2
int dst_count = *(uint32_t*)(instr + 80) - adj;

Canonical Opcode Reference

The opcode value stored at instruction+72 is the same index into the ROT13 name table at InstructionInfo+4184. There is a single numbering system -- the ROT13 table index IS the runtime opcode. This was verified by tracing sub_BEBAC0 (getName), which computes InstructionInfo + 4184 + 16 * opcode with no remapping.

The following table lists frequently-referenced opcodes from decompiled code, with their canonical SASS mnemonic names from the ROT13 table. Each opcode appears in 10+ decompiled functions reading *(instr+72).

Base OpcodeSASS MnemonicCategoryReference Count
0ERRBARError barrier (internal)Sentinel in scheduler
1IMADInteger multiply-add100+ functions
7ISETPInteger set-predicatesub_7E0030 switch
18FSETPFP set-predicatesub_7E0030 switch
19MOVMove80+ functions
23PLOP3Predicate 3-input logicsub_7E0030 case 23
25NOPNo-opScheduling, peephole
52AL2P_INDEXEDBB boundary pseudo-opcodesub_6820B0, 100+
54BMOV_BBarrier move (B)sub_7E6090 case 54
61BARBarrier synchronizationSync passes
67BRABranchsub_74ED70, CFG builders
71CALLFunction callsub_7B81D0, ABI, spill
72RETReturnsub_74ED70 (with 67)
77EXITExit threadsub_7E4150, CFG sinks
93OUT_FINALTessellation output (final)sub_734AD0, 25+
94LDSLoad sharedsub_7E0650 case 94
95STSStore sharedsub_7E0030, 40+
96LDGLoad globalMemory analysis
97STGStore globalsub_6820B0, 30+
102ATOMAtomicEncoding switch
104REDReductionEncoding switch
111MEMBARMemory barrierSync passes
119SHFLWarp shufflesub_7E0030 case 119
122DFMADouble FP fused mul-addsub_7E0030 case 122
130HSET2Half-precision set (packed)20+ functions
135INTRINSICCompiler intrinsic (pseudo)ISel, lowering
137SM73_FIRSTSM gen boundary (real instr)Strength reduction
183sm_82+ opcodeExtended mem operation& 0xFFFFCFFF mask

Important caveats:

  1. Opcode 52 (AL2P_INDEXED in name table) is universally used as a basic block delimiter in 100+ decompiled functions. The SASS mnemonic name may be vestigial; no decompiled code uses it for attribute-to-patch operations.

  2. SM boundary markers (136=SM70_LAST, 137=SM73_FIRST, etc.) have marker names in the ROT13 table but are valid runtime opcodes. Instructions with these opcode values exist in the IR and are processed by optimization passes (e.g., strength reduction operates on opcode 137).

  3. Earlier versions of this page had a "Selected Opcode Values" table that assigned incorrect SASS mnemonics based on behavioral inference rather than the ROT13 name table. Those labels (93=BRA/CALL, 95=EXIT, 97=CALL/label, 130=MOV) were wrong. The correct labels are: 93=OUT_FINAL, 95=STS, 97=STG, 130=HSET2. Branch/call/exit are at 67=BRA, 71=CALL, 77=EXIT.

Opcode Ranges by SM Generation

The ROT13 opcode name table in sub_BE7390 (InstructionInfo constructor) includes explicit SM generation boundary markers:

Marker OpcodeDecoded NameMeaning
136SM70_LASTLast sm_70 (Volta) opcode
137SM73_FIRSTFirst sm_73 (Volta+) opcode
171SM73_LASTLast sm_73 opcode
172SM82_FIRSTFirst sm_82 (Ampere) opcode
193SM82_LASTLast sm_82 opcode
194SM86_FIRSTFirst sm_86 (Ampere+) opcode
199SM86_LASTLast sm_86 opcode
200SM89_FIRSTFirst sm_89 (Ada) opcode
205SM89_LASTLast sm_89 opcode
206SM90_FIRSTFirst sm_90 (Hopper) opcode
252SM90_LASTLast sm_90 opcode
253SM100_FIRSTFirst sm_100 (Blackwell) opcode
280SM100_LASTLast sm_100 opcode
281SM104_FIRSTFirst sm_104 (Blackwell Ultra) opcode
320SM104_LASTLast sm_104 opcode
321LASTSentinel (end of table)

This gives a clear partitioning: opcodes 0-136 are the base sm_70+ ISA, 137-171 extend to sm_73, and so on up through sm_104. Each SM generation only adds opcodes; no base opcodes are removed.

Operand Model

Packed Operand Encoding

Each operand occupies 8 bytes (two 32-bit words) in the operand array starting at instruction offset +84. The first word carries the type, modifier bits, and index. The second word carries additional data (extended flags, immediate bits, etc.).

Word 0 (at instr + 84 + 8*i):

 31  30  29  28  27  26  25  24  23  22  21  20  19                  0
+---+---+---+---+---+---+---+---+---+---+---+---+---------------------+
| S |  type(3) |       modifier (8 bits)        |    index (20 bits)   |
+---+---+---+---+---+---+---+---+---+---+---+---+---------------------+
  ^   ^                                           ^
  |   bits 28-30: operand type                    bits 0-19: register/symbol index
  bit 31: sign/negative flag (S)

Word 1 (at instr + 88 + 8*i):

 31                                                                  0
+--------------------------------------------------------------------+
|               extended data / immediate bits / flags                |
+--------------------------------------------------------------------+

Operand Type Field (bits 28-30)

ValueTypeIndex Meaning
0Unused / padding
1RegisterIndex into *(code_obj+88) + 8*index register descriptor array
2Predicate registerIndex into predicate register file
3Uniform registerUR file index
4Address/offsetMemory offset value
5Symbol/constantIndex into *(code_obj+152) symbol table
6Predicate guardGuard predicate controlling conditional execution
7ImmediateEncoded immediate value

Operand Extraction Pattern

This exact extraction pattern appears in 50+ functions across scheduling, regalloc, encoding, and optimization passes:

uint32_t operand_word = *(uint32_t*)(instr + 84 + 8 * i);

int  type   = (operand_word >> 28) & 7;     // bits 28-30
int  index  = operand_word & 0xFFFFF;        // bits 0-19 (also seen as 0xFFFFFF)
int  mods   = (operand_word >> 20) & 0xFF;   // bits 20-27
bool is_neg = (operand_word >> 31) & 1;      // bit 31

// Register operand check (most common pattern)
if (type == 1) {
    reg_descriptor = *(ptr*)(*(ptr*)(code_obj + 88) + 8 * index);
    reg_file_type  = *(uint32_t*)(reg_descriptor + 64);
    reg_number     = *(uint32_t*)(reg_descriptor + 12);
}

Some functions use a 24-bit index mask (& 0xFFFFFF) instead of 20-bit, packing additional modifier bits into the upper nibble of the index field.

Operand Classification Predicates

Small predicate functions at 0xB28E00-0xB28E90 provide the instruction selection interface for operand queries:

AddressFunctionLogic
sub_B28E00getRegClassReturns register class; 1023 = wildcard, 1 = GPR
sub_B28E10isRegOperand(word >> 28) & 7 == 1
sub_B28E20isPredOperand(word >> 28) & 7 == 2
sub_B28E40isImmOperand(word >> 28) & 7 == 7
sub_B28E80isConstOperand(word >> 28) & 7 == 5
sub_B28E90isUReg(word >> 28) & 7 == 3

Destination vs. Source Operand Split

Destinations come first in the operand array, followed by sources. The boundary is computed from the operand_count field and the modifier bits in the opcode:

uint32_t total_ops = *(uint32_t*)(instr + 80);
int adj = (*(uint32_t*)(instr + 72) >> 11) & 2;  // 0 or 2
int first_src_index = total_ops - adj;             // or total_ops + ~adj + 1
// Destinations: operands[0 .. first_src_index-1]
// Sources:      operands[first_src_index .. total_ops-1]

For most instructions, adj = 0 and the split point equals operand_count. Instructions with bit 11 set in the opcode word shift the split by 2, indicating 2 extra destination operands (e.g., predicated compare-and-swap operations that write both a result register and a predicate).

Predicate Guard Operand

The last operand (at index operand_count - 1) can be a predicate guard (type 6) controlling conditional execution. The guard predicate check in sub_7E0E80:

bool has_pred_guard(instr) {
    int last_idx = *(uint32_t*)(instr + 80) + ~((*(uint32_t*)(instr + 72) >> 11) & 2);
    uint32_t last_op = *(uint32_t*)(instr + 84 + 8 * last_idx);
    return ((last_op & 0xF) - 2) < 7;  // type bits in low nibble
}

Instruction Flags and Modifiers

Opcode Modifier Bits (offset +72, bits 12-13)

Bits 12-13 of the opcode word encode sub-operation modifiers. The 0xFFFFCFFF mask strips them to yield the base opcode. Common uses:

ModifierMeaning
0Default operation
1.HI or alternate form
2.WIDE or extended form
3Reserved / architecture-specific

Extended Flag Bits (offset +48)

The 64-bit flag word at offset +48 accumulates flags throughout the compilation pipeline:

BitHex MaskFlagSet By
60x40Live-outsub_7E6090 (def-use builder)
160x10000Has single defsub_7E6090
250x2000000Has prior usesub_7E6090
270x8000000Same-block defsub_7E6090
330x200000000Source-only refsub_7E6090

Control Word (offset +32)

The control word encodes scheduling metadata added by the instruction scheduler. It is initialized to zero and populated during scheduling (phases ~150+):

  • Stall cycles (how many cycles to wait before issuing the next instruction)
  • Yield hint (whether the warp scheduler should yield after this instruction)
  • Dependency barrier assignments
  • Reuse flags (register reuse hints for the hardware register file cache)

The stall cycle field is checked during scoreboard computation at sub_A08910. The control word format is the same as the SASS encoding control field.

Data Type Flags (offset +100)

The byte at offset +100 encodes the instruction's data type in its low 3 bits:

uint8_t type_code = *(uint8_t*)(instr + 100) & 7;

These correspond to SASS data type suffixes (.F32, .F64, .U32, .S32, .F16, .B32, etc.). The exact encoding is architecture-specific and queried through the InstructionInfo descriptor table.

ROT13 Opcode Name Table

All SASS opcode mnemonic strings in the binary are ROT13-encoded. This is lightweight obfuscation, not a security measure. The InstructionInfo constructor at sub_BE7390 populates a name table at object offset +4184 with 16-byte {char* name, uint64_t length} entries.

Table Structure

InstructionInfo object:
  +0       vtable pointer (off_233ADC0)
  +8       parent pointer
  ...
  +4184    opcode_names[0].name_ptr    -> "REEONE"   (ROT13 of ERRBAR)
  +4192    opcode_names[0].length      -> 6
  +4200    opcode_names[1].name_ptr    -> "VZNQ"     (ROT13 of IMAD)
  +4208    opcode_names[1].length      -> 4
  ...
  +9320    opcode_names[321].name_ptr  -> "YNFG"     (ROT13 of LAST)
  +9328    opcode_names[321].length    -> 4
  +9336    encoding_category_map[0..321]  (322 x int32, from unk_22B2320)
  +10624   (end of encoding category map)

Total: 322 named opcodes (indices 0-321). The 0x508 bytes at +9336 are not additional name entries -- they are a 322-element int32 array mapping each opcode index to an encoding category number (see Encoding Category Map below).

Full Decoded Opcode Table (Base ISA, sm_70+)

IdxROT13SASSCategory
0REEONEERRBARError barrier (internal)
1VZNQIMADInteger multiply-add
2VZNQ_JVQRIMAD_WIDEInteger multiply-add wide
3VNQQ3IADD33-input integer add
4OZFXBMSKBit mask
5FTKGSGXTSign extend
6YBC3LOP33-input logic
7VFRGCISETPInteger set-predicate
8VNOFIABSInteger absolute value
9YRNLEALoad effective address
10FUSSHFFunnel shift
11SSZNFFMAFP fused multiply-add
12SNQQFADDFP add
13SZHYFMULFP multiply
14SZAZKFMNMXFP min/max
15SFJMNQQFSWZADDFP swizzle add
16SFRGFSETFP set
17SFRYFSELFP select
18SFRGCFSETPFP set-predicate
19ZBIMOVMove
20FRYSELSelect
21C2EP2RPredicate to register
22E2CR2PRegister to predicate
23CYBC3PLOP3Predicate 3-input logic
24CEZGPRMTByte permute
25ABCNOPNo-op
26IBGRVOTEWarp vote
27PF2E_32CS2R_32Control/status to register (32-bit)
28PF2E_64CS2R_64Control/status to register (64-bit)
29CZGEVTPMTRIGPerformance monitor trigger
30CFZGRFGPSMTESTPSM test
31INOFQVSSVABSDIFFVector absolute difference
32INOFQVSS4VABSDIFF4Vector absolute difference (4-way)
33VQCIDPInteger dot product
34VQRIDEInteger dot expand
35V2VI2IInteger to integer conversion
36V2VCI2IPInteger to integer (packed)
37VZAZKIMNMXInteger min/max
38CBCPPOPCPopulation count
39SYBFLOFind leading one
40SPUXFCHKFP check (NaN/Inf)
41VCNIPAInterpolate attribute
42ZHSHMUFUMulti-function unit (SFU)
43S2SF2FFloat to float conversion
44S2S_KF2F_XFloat to float (extended)
45S2VF2IFloat to integer
46S2V_KF2I_XFloat to integer (extended)
47V2SI2FInteger to float
48V2S_KI2F_XInteger to float (extended)
49SEAQFRNDFP round
50SEAQ_KFRND_XFP round (extended)
51NY2CAL2PAttribute to patch
52NY2C_VAQRKRQAL2P_INDEXEDAttribute to patch (indexed)
53OERIBREVBit reverse
54OZBI_OBMOV_BBarrier move (B)
55OZBI_EBMOV_RBarrier move (R)
56OZBIBMOVBarrier move
57F2ES2RSpecial register to register
58O2EB2RBarrier to register
59E2OR2BRegister to barrier
60YRCPLEPCLoad effective PC
61ONEBARBarrier synchronization
62ONE_VAQRKRQBAR_INDEXEDBarrier (indexed)
63FRGPGNVQSETCTAIDSet CTA ID
64FRGYZRZONFRSETLMEMBASESet local memory base
65TRGYZRZONFRGETLMEMBASEGet local memory base
66QRCONEDEPBARDependency barrier
67OENBRABranch
68OEKBRXBranch indirect
69WZCJMPJump
70WZKJMXJump indirect
71PNYYCALLFunction call
72ERGRETReturn
73OFFLBSSYBranch sync stack push
74OERNXBREAKBreak
75OCGBPTBreakpoint trap
76XVYYKILLKill thread
77RKVGEXITExit
78EGGRTTReturn to trap handler
79OFLAPBSYNCBranch sync
80ZNGPUMATCHWarp match
81ANABFYRRCNANOSLEEPNanosleep
82ANABGENCNANOTRAPNano trap
83GRKTEXTexture fetch
84GYQTLDTexture load
85GYQ4TLD4Texture load 4
86GZZYTMMLTexture mip-map level
87GKQTXDTexture fetch with derivatives
88GKDTXQTexture query
89YQPLDCLoad constant
90NYQALDAttribute load
91NFGASTAttribute store
92BHGOUTTessellation output
93BHG_SVANYOUT_FINALTessellation output (final)
94YQFLDSLoad shared
95FGFSTSStore shared
96YQTLDGLoad global
97FGTSTGStore global
98YQYLDLLoad local
99FGYSTLStore local
100YQLDLoad (generic)
101FGSTStore (generic)
102NGBZATOMAtomic
103NGBZTATOMGAtomic global
104ERQREDReduction
105NGBZFATOMSAtomic shared
106DFCPQSPCQuery space
107PPGY_AB_FOCCTL_NO_SBCache control (no scoreboard)
108PPGYCCTLCache control
109PPGYYCCTLLCache control (L2)
110PPGYGCCTLTCache control (texture)
111ZRZONEMEMBARMemory barrier
112FHYQSULDSurface load
113FHFGSUSTSurface store
114FHNGBZSUATOMSurface atomic
115FHERQSUREDSurface reduction
116CVKYQPIXLDPixel load
117VFOREQISBERDIndexed set binding for redirect
118VFORJEISBEWRIndexed set binding for write
119FUSYSHFLWarp shuffle
120JNECFLAPWARPSYNCWarp synchronize
121ZVRYQMYELDYield (internal)
122QSZNDFMADouble FP fused multiply-add
123QNQQDADDDouble FP add
124QZHYDMULDouble FP multiply
125QFRGCDSETPDouble FP set-predicate
126UNQQ2HADD2Half-precision add (packed)
127UNQQ2_S32HADD2_F32Half-precision add (F32 accum)
128USZN2HFMA2Half FP fused multiply-add (packed)
129UZHY2HMUL2Half-precision multiply (packed)
130UFRG2HSET2Half-precision set (packed)
131UFRGC2HSETP2Half-precision set-predicate (packed)
132UZZN_16HMMA_16Half MMA (16-wide)
133UZZN_32HMMA_32Half MMA (32-wide)
134VZZNIMMAInteger MMA
135VAGEVAFVPINTRINSICCompiler intrinsic (pseudo)

Opcode Categories

The ~400 opcodes group into these functional categories:

Integer ALU (14 opcodes): IMAD, IMAD_WIDE, IADD3, IADD, IMNMX, IABS, BMSK, SGXT, LOP3, ISETP, LEA, SHF, POPC, FLO, BREV, IDP, IDE, PRMT

FP32 ALU (9 opcodes): FFMA, FADD, FMUL, FMNMX, FSWZADD, FSET, FSEL, FSETP, FCHK

FP64 ALU (4 opcodes): DFMA, DADD, DMUL, DSETP

FP16 Packed (6 opcodes): HADD2, HADD2_F32, HFMA2, HMUL2, HSET2, HSETP2

Conversion (12 opcodes): F2F, F2I, I2F, I2I, F2FP, F2IP, I2FP, I2IP, FRND, and their _X extended variants

Data Movement (6 opcodes): MOV, UMOV, MOVM, SEL, USEL, PRMT

Special Function (1 opcode): MUFU (sin, cos, rsqrt, rcp, etc.)

Predicate (4 opcodes): PLOP3, P2R, R2P, VOTE

Memory -- Global (4 opcodes): LDG, STG, LD, ST

Memory -- Shared (4 opcodes): LDS, STS, LDSM, STSM

Memory -- Local (2 opcodes): LDL, STL

Memory -- Constant (2 opcodes): LDC, LDCU

Atomic/Reduction (6 opcodes): ATOM, ATOMG, ATOMS, RED, REDUX, REDAS

Texture (6 opcodes): TEX, TLD, TLD4, TMML, TXD, TXQ

Surface (4 opcodes): SULD, SUST, SUATOM, SURED

Control Flow (12 opcodes): BRA, BRX, JMP, JMX, CALL, RET, EXIT, BREAK, BSSY, BSYNC, KILL, BPT

Synchronization (6 opcodes): BAR, BAR_INDEXED, DEPBAR, MEMBAR, WARPSYNC, NANOSLEEP

Tensor Core / MMA (25+ opcodes): HMMA_*, IMMA_*, BMMA_*, DMMA, GMMA, QMMA_*, OMMA_*, and their sparse (_SP_) variants

Uniform Register (30+ opcodes): All U-prefixed variants (UIMAD, UIADD3, UMOV, USEL, ULOP3, ULEPC, etc.) that operate on uniform registers shared across the warp

Blackwell sm_100+ (28 opcodes): ACQBLK, CGABAR_*, CREATEPOLICY, ELECT, ENDCOLLECTIVE, FENCE_G/S/T, LDTM, STTM, MEMSET, ACQSHMINIT, UTCBAR_*, UTCMMA_*, UTCSHIFT_*, UTCCP_*, TCATOMSWS, TCLDSWS, TCSTSWS, VIRTCOUNT, UGETNEXTWORKID, FADD2, FFMA2, FMUL2, FMNMX3, CREDUX, QFMA4, QADD4, QMUL4, WARPGROUP

Instruction Descriptor Table

The InstructionInfo class at sub_BE7390 (inheriting from the base class at sub_738E20) provides a per-opcode descriptor table consulted by every pass in the compiler. The derived constructor calls the base class constructor sub_738E20, then populates the ROT13 name table, allocates the per-opcode descriptor block, and queries SM-specific configuration knobs. The resulting object is ~11,240 bytes inline plus a 10,288-byte dynamically allocated descriptor block.

Construction Sequence

sub_BE7390(this, parent_context) executes in this order:

  1. Base class init (sub_738E20): sets vtable, stores parent pointer, allocates the opcode-to-descriptor mapping array (512 bytes, 64 QWORD slots), zeroes all four descriptor data areas (+744..+3624), queries SM version and stores at +3728, allocates per-opcode property array (4 * sm_opcode_count bytes at +4112), allocates a reference-counted descriptor block (24 bytes at +4136), queries knobs 812/867/822/493 for configuration. Sets +4132 = 8 and +4176 = 0 (init incomplete).
  2. Override vtable: +0 = off_233ADC0 (derived vtable).
  3. Populate ROT13 name table: 322 inline entries (indices 0-321) at offsets +4184..+9328, each 16 bytes ({char* name_ptr, u64 length}).
  4. Bulk-copy encoding category map: qmemcpy(+9336, unk_22B2320, 0x508) -- 322-entry int32 array (1288 bytes) mapping opcode index to encoding category number. The source table varies by arch constructor (see below).
  5. Initialize post-table fields: zero offsets +10624..+10680.
  6. Store sentinels: +11200 = -2, +11224 = 0xFFFFFFFF.
  7. Set constants: +4048 = 2, +4056 = 10, +3733 = 1.
  8. Descriptor defaults (sub_1370BD0): populates scheduling templates and operand defaults at +192..+704.
  9. Override property mode: +4132 = 7 (overwriting base class's 8).
  10. Allocate descriptor block: 10,288 bytes via the MemoryManager, partitioned into 3 sections.
  11. Query SM-specific config: reads parent->+1664->+72->+55080 and stores result at +10648.

InstructionInfo Object Layout

The complete byte-level field map, derived from sub_BE7390 (derived constructor), sub_738E20 (base constructor), and sub_1370BD0 (descriptor defaults init).

Region 1: Vtable, Parent, and Core Identity (+0 to +91)

OffsetSizeTypeFieldDescription
+08ptrvtableoff_233ADC0 (derived); base chain: off_21DB6E8 / off_21B4790
+88ptrparent_ctxParent compilation context pointer
+448u64operand_countsPacked pair 0x100000001: lo=1 dst, hi=1 src (base default)

Region 2: Scheduling Defaults and Flags (+92 to +159)

OffsetSizeTypeFieldDescription
+9216xmmsched_defaultsScheduling parameter defaults (loaded from xmmword_2029FE0)
+1084i32desc_idx_aDescriptor index sentinel = 0
+1124i32desc_idx_bDescriptor index sentinel = -1 (0xFFFFFFFF)
+1161u8flag_116= 0
+1171u8flag_117= 0
+1181u8flag_118= 1
+1203u8[3]flags_120All = 0
+1364i32sentinel_136= -1 (0xFFFFFFFF)
+1488u64reserved_148= 0

Region 3: Opcode-to-Descriptor Mapping (+160 to +191)

OffsetSizeTypeFieldDescription
+1608ptrmapping_allocatorMemoryManager used for mapping array
+1688ptrmapping_arrayDynamically allocated QWORD array (initial: 512 bytes, 64 entries)
+1764i32mapping_countCurrent entry count (initially 63)
+1804i32mapping_capacityCurrent capacity (initially 64)
+1848u64packed_flags= 0x4000000000 (bit 38: descriptor config flag)

Region 4: Descriptor Defaults (+192 to +704, set by sub_1370BD0)

OffsetSizeTypeFieldDescription
+1928u64default_operand_cfgPacked 0x200000002: lo=2, hi=2
+2004u32default_dst_count= 4
+2084u32default_modifier= 2
+21616xmmsched_template_aScheduling template (from xmmword_233B1E0)
+2404u32default_operand_w= 4
+4488u64section_marker_448= 1
+4564u32section_id_456= 2
+4644u32section_id_464= 3
+47216xmmsched_template_bScheduling template (from xmmword_233B1F0)
+4964u32default_value_496= 5

Gaps within +204..+447 and +500..+695 are zero-initialized by sub_1370BD0.

Region 5: Primary Descriptor Data (+744 to +2155)

OffsetSizeTypeFieldDescription
+7448u64desc_data_startPrimary area header = 0
+752..+21551404u8[]desc_dataZero-initialized per-opcode descriptor records

Region 6: Secondary Descriptor Area (+2156 to +2211)

OffsetSizeTypeFieldDescription
+21568u64secondary_header= 0
+2164..+221148u8[]secondary_dataZero-initialized

Region 7: Tertiary Descriptor Area (+2212 to +3623)

OffsetSizeTypeFieldDescription
+22128u64tertiary_header= 0
+2220..+36231404u8[]tertiary_dataZero-initialized
+23724u32desc_record_type_a= 4 (set by derived constructor)
+24004u32desc_record_type_b= 4 (set by derived constructor)

Region 8: Quaternary Descriptor Area and Target Config (+3624 to +3735)

OffsetSizeTypeFieldDescription
+36248u64quaternary_header= 0
+3640..+366432u64[4]quat_ptrsAll = 0
+36721u8is_sm75_plus= 1 if SM ID >= 16389, else 0
+36731u8target_flag_bit6Bit 6 of *(target+1080)
+36741u8target_flag_bit7Bit 7 of *(target+1080)
+3675..+36828u8[8]zero_padAll = 0
+368432u128[2]zero_pad_3684= 0
+3716..+37172u8[2]flags_3716= 0
+37204u32value_3720= 0
+37241u8flag_3724= 1
+37251u8flag_3725= 0
+37284u32sm_opcode_countSM version / total opcode count from arch query
+37321u8knob_812_flagKnob 812 derived flag
+37331u8derived_flag= 1 (set by derived constructor; base leaves at 0)

Region 9: Scheduling Configuration (+4016 to +4111)

OffsetSizeTypeFieldDescription
+401616u128sched_config_a= 0
+40328u64sched_config_b= 0
+404016xmmsched_constantsLoaded from xmmword_21B4EE0
+40484u32constant_2= 2 (derived overrides base default 0)
+40564u32constant_10= 10 (derived overrides base default 0x7FFFFFFF)
+4060..+40648u32[2]zero_pad= 0
+40728u64sched_ptr= 0
+40808u64sched_ext= 0
+40881u8flag_4088= 0
+40891u8knob_867_flag= 1 if knob absent; = (knob_value == 1) otherwise
+40901u8flag_4090= 0
+40924u32knob_822_valueDefault 7; overridden by knob 822
+40964u32knob_493_valueDefault 5; overridden by knob 493

Region 10: Per-Opcode Property Array (+4112 to +4183)

OffsetSizeTypeFieldDescription
+41128ptrproperty_arrayAllocated: 4 * sm_opcode_count bytes; 4 bytes per opcode
+41204u32property_count= 4 * !hasExtendedPredicates (0 or 4)
+41244u32property_aux= 0
+41281u8property_init_flag= 1
+41324u32property_modeBase sets 8, derived overwrites to 7
+41368ptrref_counted_block24-byte block: [refcount=2, data=0, allocator_ptr]
+4144..+416024u64[3]rc_auxAll = 0
+41761u8init_complete= 0 initially; set to 1 after full initialization

Region 11: ROT13 Opcode Name Table (+4184 to +10623)

OffsetSizeTypeFieldDescription
+41845152struct[322]opcode_names[0..321]322 inline entries, each 16 bytes: {char* name, u64 len}
+93361288int32[322]encoding_category_map[0..321]Per-opcode encoding category; bulk-copied from arch-specific static table (see below)

Total: 322 named opcodes. Index N name is at offset 4184 + 16*N. The getName accessor at sub_BEBAC0 computes this + 4184 + 16 * opcode directly. Encoding category for opcode N is at +9336 + 4*N.

Encoding Category Map

The 1288-byte block at +9336 is a 322-element int32 array that maps each opcode index to an encoding category number. The SASS mnemonic lookup function (sub_1377C60) uses this to resolve a (mnemonic, arch) pair to a binary encoding format descriptor.

Arch-specific source tables:

ConstructorSource TableContent
sub_7A5D10 (base)unk_21C0E00Identity map: map[i] = i for all i in 0..321
sub_7C5410unk_21C3600Arch-remapped: some entries differ from identity
sub_BE7390unk_22B2320Arch-remapped: some entries differ from identity

The base constructor uses a pure identity map where opcode N maps to encoding category N. Arch-specific constructors override selected entries so the same mnemonic at different opcode indices can map to different encoding formats. For example, DMMA at opcode index 180 maps to encoding category 434 on one arch, while DMMA at opcode index 215 maps to encoding category 515 on another.

Reader: sub_1377C60 (SASS mnemonic lookup)

// After matching mnemonic string v11 to opcode index v18 via ROT13 comparison:
v84 = *(_DWORD *)(a1 + 4 * v18 + 9336);  // encoding_category_map[v18]
// v84 is then FNV-1a hashed together with arch discriminator v16,
// and looked up in the hash table at *(a1 + 10672) to find the
// encoding format descriptor for this (category, arch) pair.

The hash table at +10672 stores entries of the form {encoding_category, arch_code, format_value}, keyed by FNV-1a of (encoding_category, arch_discriminator). This is the central mechanism that maps a SASS mnemonic string plus target architecture to the correct binary encoding format.

Region 12: Descriptor Block Control (+10624 to +10687)

OffsetSizeTypeFieldDescription
+106248u64block_ctrl_a= 0
+106328u64block_ctrl_b= 0
+106484u32arch_configSM-specific config from target+55080/55088
+106568ptrdescriptor_blockPointer to allocated 10,288-byte per-opcode descriptor block
+106648ptrblock_allocatorMemoryManager that allocated the descriptor block
+106728ptrencoding_lookup_tableHash table for (encoding_category, arch) -> format descriptor lookup; read by sub_1377C60
+106808u64block_aux_b= 0

Region 13: Sentinels and Architecture Handler (+11200 to +11240)

OffsetSizeTypeFieldDescription
+112004i32sentinel= -2 (0xFFFFFFFE)
+112088ptrarch_handler= parent_ctx->+16 (MemoryManager)
+112168u64zero_11216= 0
+112248u64sentinel_11224= 0xFFFFFFFF
+112321u8flag_11232= 0
+112364u32zero_11236= 0

Per-Opcode Descriptor Block (10,288 bytes)

Allocated by the derived constructor and stored at +10656. The block is 10288 / 8 = 1286 QWORD entries, partitioned into three sections:

+--------------------+  block + 0
| Section 0 header   |  QWORD[0] = 0
+--------------------+  block + 8
| Section 0 payload  |  QWORD[1..640]  = all zero (memset)
| (640 slots)        |  Per-opcode descriptors for opcodes 0..639
+--------------------+  block + 5128
| Section 1 header   |  QWORD[641] = 0
+--------------------+  block + 5136
| Section 1 payload  |  QWORD[642..1283]  (NOT explicitly zeroed)
| (642 slots)        |  Modifier-variant descriptors (opcode | 0x1000, etc.)
+--------------------+  block + 10272
| Section 2 (16B)    |  QWORD[1284] = parent_ctx  (back-pointer)
|                    |  QWORD[1285] = instr_info   (self back-pointer)
+--------------------+  block + 10288

Section 0 (5,128 bytes): 641 QWORD slots. Only the payload (slots 1..640, 5,120 bytes) is explicitly zeroed. Each slot corresponds to a base opcode index. With 402 named opcodes, ~240 slots remain spare.

Section 1 (5,144 bytes): 643 QWORD slots. The header is zeroed but the payload is NOT explicitly zeroed -- it relies on the arena allocator's default behavior or lazy initialization during opcode registration. Likely stores modifier-variant descriptors (e.g., entries for opcode | 0x1000 when bits 12-13 carry sub-operation modifiers).

Section 2 (16 bytes): Two back-pointers for navigating from the descriptor block back to its owning objects (parent compilation context and the InstructionInfo instance).

Architecture-Specific Sub-Tables (sub_896D50, 26,888 bytes)

The architecture-specific extended property object is NOT stored inside InstructionInfo. It is lazily allocated by sub_7A4650, which gates on target+372 == 0x8000 (sm_80 / Ampere targets). The allocation is 26,888 bytes, constructed by sub_896D50(block, parent_context).

sub_896D50 Object Layout

OffsetSizeTypeFieldDescription
+08ptrvtableoff_21DADF8
+88ptrparent_ctxFrom construction parameter
+408ptrallocator_baseMemoryManager from parent->+16

Property Array A (at sub-object +56):

Sub-offsetFieldDescription
+56ptrArray pointer: 64 bytes per entry, 772 entries (49,408 bytes allocated)
+64i32Count = 771
+68i32Capacity = 772

Each 64-byte entry: bytes [0..11] initialized to 0xFF (pipeline-unassigned sentinel), bytes [12..63] zeroed. Stores latency, throughput, port mask, and register class requirements per opcode.

Property Array B (at sub-object +80):

Sub-offsetFieldDescription
+80ptrArray pointer: 36 bytes per entry, 772 entries (27,792 bytes allocated)
+88i32Count = 771
+92i32Capacity = 772

Each 36-byte entry: all zeroed. Stores encoding class, format identifiers, operand encoding rules.

Property Array C (at sub-object +176):

Sub-offsetFieldDescription
+176ptrArray pointer: 16 bytes per entry, 35 entries (560 bytes allocated)
+184i32Count = 34
+188i32Capacity = 35

Each 16-byte entry: zeroed. Stores functional unit properties for major FU categories.

Property Array D (at sub-object +200):

Sub-offsetFieldDescription
+200ptrArray pointer: 16 bytes per entry, 35 entries (560 bytes allocated)
+208i32Count = 34

Parallel table for alternate functional unit configurations.

Dimension Table (at sub-object +472):

Sub-offsetFieldDescription
+472ptr168-byte block: [count=40, entries[0..39]], 4 bytes per entry, zero-initialized

Alphabetical SASS Name Table (at sub-object +11360):

Starting at offset +11360, sub_896D50 populates an alphabetically sorted ROT13 name table using the same {char*, u64} format. Unlike the InstructionInfo name table (indexed by opcode), this table is sorted by decoded mnemonic name and includes modifier variants:

  • OZZN.168128 (BMMA.168128)
  • PPGY.P.YQP.VINYY (CCTL.C.LDC.IVALL)
  • VZNQ.JVQR.ERNQ.NO (IMAD.WIDE.READ.AB)
  • VZZN.FC.{168128.*|16864.*8.*8} (IMMA.SP.{...} -- regex patterns for variant matching)

This table is used for SASS assembly parsing and opcode-to-encoding resolution, where a single base opcode may map to multiple encoding variants distinguished by modifier suffixes.

Knob-derived fields:

Sub-offsetFieldSource
+108i32Knob 803 value (instruction scheduling latency override)
+468u8= 0
+469u8= 1
+470u8= 1

Accessor Stubs

40+ tiny vtable accessor stubs at 0x859F80-0x85A5F0 and 0x868500-0x869700 provide virtual dispatch access to per-opcode properties. Typical pattern:

int getLatency(ArchSpecificInfo* this, int opcode) {
    return *(int*)(this->property_array_a + 64 * opcode + latency_offset);
}

PTX Text-Generation Operand Accessor API

The PTX text generation subsystem (instruction pretty-printer, dispatcher at sub_5D4190) converts Ori IR instructions into PTX assembly text. The ~580 formatter functions at 0x4DA340-0x5A9FFF query a PTX instruction context object through a stable API of 48 small accessor helpers concentrated at 0x707000-0x710FFF.

PTX Instruction Context Object

The accessor functions do NOT operate on the 296-byte Ori IR instruction directly. They take a PTX instruction context object (~2500+ bytes) that contains pre-decoded fields for text generation. The raw Ori instruction is accessible at *(context + 1096). Each formatter receives this context as argument a1 and a pool allocator table as argument a2.

Partial field map of the PTX instruction context (offsets used by accessors):

OffsetSizeTypeFieldAccessed By
+5448ptrpredicate_ptrhas_predicate, get_opcode_string
+5644u32saturation_codeget_saturation_mode (== 12 means saturate)
+5964u32field_operand_countget_field_a..get_field_d
+6001u8flag_byte_aBit 0: precision, bit 6: addressing, bit 7: addr_mode
+6041u8rounding_modeBits 0-2: rounding mode code (3 bits)
+6051u8scale_byteBits 4-7: scale code (4 bits, 16 entries)
+6091u8base_addr_byteBits 2-3: base address mode (2 bits, 4 entries)
+6111u8param_flagsBits 4-5: parameter variant selector
+6151u8ftz_byteBits 6-7: FTZ flag code (2 bits, 4 entries)
+6201u8variant_indexVariant string lookup index (8 bits, 256 entries)
+6271u8flag_byte_bBits 0-1: extended_op, 2-3: flag_b, 4-5: modifier/variant
+6404i32precision_codeIndex into precision string table
+648varptr[]operand_namesPer-operand name string pointer array (8B per slot)
+8004u32operand_countNumber of operands for comparison/count accessors
+816varptr[]reg_operandsRegister operand pointer array (8B per slot)
+944varu32[]operand_typesPer-operand type code array (4B per slot)
+1024varptr[]src_part0Source part 0 pointer array (8B per slot)
+1264varptr[]src_part1Source part 1 pointer array (8B per slot)
+1504varptr[]data_types_0Data type array, part 0 (8B per slot)
+1744varptr[]data_types_1Data type array, part 1 (8B per slot)
+1984varu32[]target_smTarget SM version array (4B per slot)
+21208ptropcode_nameOpcode mnemonic string pointer
+24888ptrstring_internString interning table for modifier deduplication

Accessor Catalog

Tier 1: Core Accessors (>200 callers)

Used by nearly every formatter function. These are the fundamental building blocks of PTX text generation.

AddressNameSizeCallersSignatureLogic
sub_710860getDataType39B2953(ctx, idx, part) -> u8part ? **(ctx+1744+8*idx) & 0x3F : **(ctx+1504+8*idx) & 0x3F
sub_70B910getSrcPart012B1656(ctx, idx) -> ptr*(ctx + 8*idx + 1024)
sub_70B8E0getRegOperand12B1449(ctx, idx) -> ptr*(ctx + 8*idx + 816)
sub_70B920getSrcPart112B1296(ctx, idx) -> ptr*(ctx + 8*idx + 1264)
sub_70B700hasPredicate14B946(ctx) -> bool*(ctx + 544) != 0
sub_70B780getPredicateName151B514(ctx, pool) -> strAllocates "@" + opcode_name; inserts "!" if negated
sub_70CA60getOperandType11B480(ctx, idx) -> u32*(ctx + 4*idx + 944)
sub_70B710getOpcodeString111B348(ctx, pool) -> strAllocates "@" + *(ctx+2120) from arena pool
sub_70FA00getTargetSM10B286(ctx, idx) -> u32*(ctx + 4*idx + 1984)

Tier 2: Modifier and Property Accessors (10-200 callers)

Used by instruction-class families (memory ops, float ops, texture ops, etc.).

AddressNameSizeCallersSignatureLogic
sub_70CA70getTypeSuffix427B191(ctx, pool) -> strIterates *(ctx+796) type codes; looks up in off_2032300[] with interning
sub_70CD20getOperandOffset122B158(ctx, idx) -> stroff_2032300[*(ctx+4*idx+944)]; resolves via string interning for codes <= 0x39
sub_707CE0getAddressOperand22B93(ctx) -> stroff_2033DE0[*(ctx+600) >> 7]
sub_70B930getOperandCount7B68(ctx) -> u32*(ctx + 800)
sub_70B4C0getBaseAddress22B46(ctx) -> stroff_2032700[(*(ctx+609) >> 2) & 3]
sub_709A10getVariantString73B46(ctx) -> stroff_2033060[*(ctx+620)] resolved via string interning
sub_70B6E0hasPredicate_v214B42(ctx) -> bool*(ctx + 544) != 0 (identical body to hasPredicate)
sub_709760getComparisonOp127B21(ctx, pool) -> strIterates *(ctx+800) operand names from +648 array with " , " separator
sub_709FE0getRoundingMode11B17(ctx) -> u8*(ctx + 604) & 7
sub_70A500getSaturationMode13B15(ctx) -> bool*(ctx + 564) == 12
sub_709910getVariantCount14B13(ctx) -> u8(*(ctx+627) >> 4) & 3
sub_708E40getExtendedOperand29B10(ctx, idx) -> stroff_2033720[(*(ctx+627) >> (idx==1 ? 0 : 2)) & 3]

Tier 3: Instruction-Class-Specific Accessors (<10 callers)

Used by specific instruction families (MMA/tensor, texture, guardrail formatters).

AddressNameSizeCallersSignaturePurpose
sub_70FA10checkTargetSM66B7(ctx, idx, str) -> boolsscanf(str, "sm_%d") then compare to *(ctx+1984+4*idx)
sub_70C890getOperandDetail~300Bvaries(ctx, pool, maxlen, type) -> strComplex: hex parse, fallback to sub_707380, type-dispatch
sub_70A810getScaleString22Bvaries(ctx) -> stroff_2032BA0[(*(ctx+605) >> 4) & 0xF]
sub_70B3F0getFtzFlag22Bvaries(ctx) -> stroff_20327C0[(*(ctx+615) >> 6) & 3]
sub_707530getPrecisionString12Bvaries(ctx) -> stroff_2033FA0[*(ctx+640)]
sub_707C60getAddressingMode12Bvaries(ctx) -> bool(*(ctx+600) & 0x40) != 0
sub_707C80getScopeString22Bvaries(ctx) -> stroff_2033E00[(*(ctx+600) & 0x40) != 0]
sub_7075E0getLayoutString22Bvaries(ctx) -> stroff_2033EE0[*(ctx+600) & 1] -- WMMA/TCGEN05
sub_707BE0getShapeString22Bvaries(ctx) -> stroff_2033E30[(*(ctx+600) & 4) != 0] -- WMMA/TCGEN05
sub_7075C0getInstrFlagA7Bvaries(ctx) -> u8*(ctx+600) & 1 -- WMMA/rsqrt
sub_707BC0getInstrFlagBvariesvaries(ctx) -> variesSecondary flag accessor -- WMMA/rsqrt
sub_70D3B0getFieldA91B2(ctx) -> strReturns ".transA" if operand count matches MMA shape
sub_70D410getFieldB99B2(ctx) -> strReturns ".transB" (symmetric with getFieldA)
sub_70D480getFieldC91B2(ctx) -> strMMA field C modifier string
sub_70D4E0getFieldD91B2(ctx) -> strMMA field D modifier string
sub_70D360getModifier76B1(ctx, pool) -> strReads operand at index 3 or 5 depending on byte 627
sub_70D2F0getImmediate107B1(ctx, pool) -> strReads operand at +672, conditionally appends second value
sub_70FCB0getParamAvariesvaries(ctx) -> u64Dispatch on (*(ctx+611) & 0x30): selects guardrail constant
sub_70FCF0getParamBvariesvaries(ctx) -> u64Similar dispatch on different bit field
sub_70E670getParamCvariesvaries(ctx) -> u64Third parameter accessor

Static String Tables

The accessor functions perform table-driven lookups using static string pointer arrays in .rodata. Each table is indexed by a small bit-field extracted from the context object:

Table AddressEntriesIndexed ByContent
off_2032300>57Operand type codeType suffix strings (.f32, .u16, .b64, etc.)
off_20327004(ctx+609 >> 2) & 3Base address mode strings
off_20327C04(ctx+615 >> 6) & 3FTZ flag strings (empty, .ftz, etc.)
off_2032BA016(ctx+605 >> 4) & 0xFScale modifier strings
off_2033060256ctx+620Variant name strings
off_20337204(ctx+627 >> N) & 3Extended operand strings
off_2033DE02ctx+600 >> 7Address operand strings
off_2033E002(ctx+600 & 0x40) != 0Scope strings (.cta, .gpu, etc.)
off_2033E302(ctx+600 & 4) != 0Shape strings -- WMMA/TCGEN05
off_2033EE02ctx+600 & 1Layout strings -- WMMA/TCGEN05
off_2033FA0indexed by intctx+640Precision strings for texture ops

Architectural Notes

  1. String interning: String-returning accessors for type codes <= 0x39 go through a string interning table at *(ctx+2488). The pattern is: look up a candidate string from the static table, then pass it through sub_426D60 (hash lookup) or sub_7072A0 (insert-and-return). This deduplicates PTX modifier strings across the entire text generation pass.

  2. Pool allocation: Accessors that construct new strings (prefixing "@", joining with separators) receive a pool allocator parameter. They allocate from the formatter's 50KB temp buffer via sub_4280C0 (get pool) -> sub_424070 (alloc from pool) -> sub_42BDB0 (abort on failure).

  3. Duplicate functions: sub_70B700 (hasPredicate, 946 callers) and sub_70B6E0 (hasPredicate_v2, 42 callers) have bytewise-identical bodies. Both return *(a1+544) != 0. These are likely methods in different classes (base and derived, or two sibling classes) that were not merged by the linker because they have distinct mangled names.

  4. MMA/tensor accessors: getFieldA through getFieldD, getLayoutString, and getShapeString are used exclusively by WMMA, HMMA, and TCGEN05 instruction formatters. They decode matrix operation modifiers (.transA, .transB, .row, .col) from compressed bit fields.

Instruction Creation

Allocation: sub_7DD010

The primary instruction allocator at sub_7DD010 (called from pass code that needs to create new instructions):

  1. Allocates 296 bytes from the Code Object's arena allocator (vtable+16, size 296)
  2. Zeroes the entire 296-byte object
  3. Initializes sentinel fields: offset +248 = -1, +256 = 0xFFFFFFFF, +264 and +272 = 0xFFFFFFFF00000000
  4. Loads scheduling parameter defaults from xmmword_2027620 into offset +208
  5. Appends the new instruction to the Code Object's instruction index array at +368 (resizable, 1.5x growth policy)
  6. Assigns a unique instruction index: *(instr + 264) = index
  7. Invalidates cached analysis (RPO at +792)

The instruction is created unlinked -- it is not yet in any basic block's linked list.

Linking: sub_925510 (Insert Before)

sub_925510 inserts instruction a2 before instruction a3 in the doubly-linked list of Code Object a1:

void InsertBefore(CodeObject* ctx, Instr* instr, Instr* before) {
    // 1. Check if instruction removal impacts scheduling state
    if (IsScheduleRelevant(instr, ctx))
        UpdateScheduleState(ctx, instr);

    // 2. Notify observers
    NotifyObservers(ctx->observer_chain + 1952, instr);

    // 3. Unlink from current position
    if (instr->prev) {
        instr->prev->next = instr->next;
        if (instr->next)
            instr->next->prev = instr->prev;
        else
            ctx->tail = instr->prev;   // was tail
    } else {
        ctx->head = instr->next;        // was head
        instr->next->prev = nullptr;
    }

    // 4. Insert before target
    instr->next = before;
    instr->bb_index = before->bb_index;
    instr->prev = before->prev;
    if (before->prev)
        before->prev->next = instr;
    if (before == ctx->head)
        ctx->head = instr;
    before->prev = instr;

    // 5. Post-insert bookkeeping
    PostInsertUpdate(ctx, instr);
}

Removal: sub_9253C0

sub_9253C0 (634 callers) removes an instruction from its linked list:

  1. Checks if the instruction affects scheduling state (same check as insert)
  2. Notifies the observer chain at Code Object +1952
  3. Unlinks from the doubly-linked list (updating head/tail pointers at +272/+280)
  4. Optionally updates the instruction map at Code Object +1136 (if a3 flag is set)
  5. Handles debug info cleanup if the debug flag at byte +1421 bit 5 is set

Instruction Removal Check: sub_7E0030

Before removing an instruction (sub_7E0030, called from both sub_9253C0 and sub_925510), the compiler checks whether the removal is legal. This function examines:

  • Whether the instruction is an STS (store shared, base opcode 95) with specific operand count and data type patterns (operand_count - adj == 5 with data type codes 1, 2, or 4 prevent removal)
  • Whether a target-specific scheduler hook (vtable offset 2128 on the SM backend at compilation context +1584) vetoes the removal
  • Whether the instruction is a PLOP3 (predicate logic, opcode 23) writing to a special register (register file type 9 at descriptor +64)
  • Whether the dead-code check (sub_7DF3A0) clears the instruction, excluding opcodes 93 (OUT_FINAL), 124 (DMUL), and 248 (SM90+ opcode) which have required side effects
  • Whether the opcode class has a "must keep" flag in the per-opcode property array at Code Object +776 (byte[4*opcode + 2] & 4)

Instruction Iteration

Forward Walk

The standard forward walk over a basic block's instructions:

// code_obj->head is at +272, tail at +280
instr_ptr instr = *(ptr*)(code_obj + 272);
while (instr) {
    // process instruction
    instr = *(ptr*)(instr + 8);  // next
}

Reverse Walk

instr_ptr instr = *(ptr*)(code_obj + 280);  // tail
while (instr) {
    // process instruction
    instr = *(ptr*)(instr + 0);  // prev
}

Block-Scoped Iteration

When iterating within a specific basic block (used by scheduling, regalloc, and peephole passes), the block's head instruction pointer at block_entry +0 is the starting point, and iteration continues until the next block boundary (opcode 52, named AL2P_INDEXED in the ROT13 table but universally used as a BB delimiter pseudo-opcode) or the list tail:

// Block info at code_obj+976, 40 bytes per block
ptr block_head = *(ptr*)(*(ptr*)(code_obj + 976) + 40 * block_index);
for (instr = block_head; instr != nullptr; instr = *(ptr*)(instr + 8)) {
    uint32_t op = *(uint32_t*)(instr + 72) & 0xFFFFCFFF;
    if (op == 52)  // BB boundary
        break;
    // process instruction
}

Def-Use Chain Iterator: sub_7E6090

The complex def-use chain builder sub_7E6090 (650 lines decompiled) is the core instruction analysis function. Called from sub_8E3A80 and numerous optimization passes, it:

  1. Walks all instructions in program order
  2. For each register operand (type == 1 via (word >> 28) & 7), updates the register descriptor's def/use counts at offsets +20 and +24
  3. Builds use chains via linked list nodes allocated from the arena (16-byte nodes with {next, instruction_ptr})
  4. Sets flag bits in register descriptors (+48) for live-out, same-block-def, has-prior-use, and source-only-ref
  5. Tracks the single-definition instruction at register descriptor +56
  6. Handles CSE matching: compares operand arrays of instructions with matching opcode, operand count, and auxiliary data to detect redundant computations
  7. Takes parameter a5 as a bitmask of register file types to process (bit per register class)

Instruction Lowering Handler -- sub_65D640 (48 KB)

The central PTX-to-Ori instruction lowering handler lives at sub_65D640. It is installed at vtable offset +32 in the ISel Phase 1 dispatch table (sub_660CE0) and called through the vtable for every PTX instruction during lowering.

Signature: int64 sub_65D640(context*, bb_ref, ptx_node*, ori_instr*)

The function reads the PTX opcode from *(*(ptx_node+32)+8) and dispatches through a ~60-case switch. An entry gate (sub_44AC80) diverts certain opcode types to an alternate handler (sub_656600). The function calls sub_A2FD90 (operand setter) 59 times to populate Ori operands on the resulting instructions.

Opcode Case Map

Case(s)PTX familyHandlerDescription
5prmt (byte permute)inlineDecodes 8-bit per-byte channel mask, sets 2 operands
6prmt (extended)inlineTwo-operand permute with address computation via sub_6294E0
10mov (special)inlineClears immediate flag for float type 109
12(delegated)sub_659F90--
13multi-operand expansioninlineExpands via sub_62E840, resolves type 87 (address) and 97 (register) operands
17, 18, 24mov/cvt variantssub_652FA0--
19, 20, 23surface opsinline~200 lines: multi-register data, sub_6273E0 operand classification, up to 4 data regs + address
34, 35load/storeinlineOptional address resolution gated on (ptx_node+61 & 0xC)
45, 238conversioninlineRewrites operand type to 20 (integer), binds address via sub_6294E0
68, 71register indirect rewriteinlineChecks operand size == 8, rewrites descriptor to type 110
81instruction expansioninlineCreates IADD3 (opcode 38) with constant 0, reg class 12
82instruction expansioninlineRewrites to opcode 162 with IADD3 operand
84load expansioninlineCreates IADD3 with offset, flags 0x2000
85operand reorderinline3-operand shuffle
87reg class adjustmentinlineTable lookup at dword_2026C60, swaps operands 1/2, sets opcode 150
88matrix configinlineMMA dimension table at dword_2026C48, sets fields 179/180
1044-wide loadinlineCreates 4-operand instruction, address binding via sub_6294E0
110(delegated)sub_652610--
123generic addressinginlineConverts flat-to-specific addresses; SM-version-dependent multi-instruction sequences
124, 125cvta / isspacepinlineAddress space conversion; creates CVTA opcode 538/539 on SM > 0x1A
130instruction fusioninlineFuses instruction if operand count is not 3 or 4
165(delegated)sub_65BF40--
175--178texture addr_modeinlineResolves .addr_mode_0/1/2 attributes from texture descriptor
179atomic address modeinlineClassifies atomic op type, creates SEL + ATOM sequence
180(delegated)sub_65CE90--
181, 182(delegated)sub_64FF20--
183conditional atomicinlineState space 0x20: rewrites to opcode 71 with mask 0xFF01010101
184--190surface/texture loweringinlineHandles SULD/SUST/SURED (opcodes 449-456); SM-dependent operand resolution
197, 198call site loweringinlineSame-module vs cross-module call dispatch
201--204, 208--211wide load/storeinline.v2/.v4 multi-element operations with IADD3 offset computation
206, 207, 212, 2133-op wide load/storeinline3-operand variants of wide memory operations
221, 222TMA operationsinlineSets field 197 with value 365/366

Addressing Mode Types

ptxas handles four distinct addressing mode categories during instruction lowering, all resolved by sub_65D640:

1. Texture Addressing Modes (per-dimension)

Cases 175--178 resolve .addr_mode_0, .addr_mode_1, .addr_mode_2 attributes from texture descriptors. These are the PTX txq query targets.

The function walks the texture descriptor's attribute linked list at *(descriptor+16)+24, comparing each attribute name string:

// Pseudocode for cases 175-178:
addr_mode_0 = addr_mode_1 = addr_mode_2 = 0;
found = false;
for (node = attr_list_head; node != NULL; node = *node) {
    name = *(node[1] + 16);    // attribute name string
    value = *(*(node[1] + 24) + 16);  // integer value
    if (strcmp(name, "addr_mode_0") == 0)  { addr_mode_0 = value; found = true; }
    else if (strcmp(name, "addr_mode_1") == 0)  { addr_mode_1 = value; found = true; }
    else if (strcmp(name, "addr_mode_2") == 0)  { addr_mode_2 = value; found = true; }
}

For 2D textures (state space byte & 0xB0 == 0x20), the function checks addr_mode_0 == addr_mode_1. For 3D textures (0x30), it checks all three equal. If modes are uniform (all equal), the instruction gets a single addressing mode flag (field 91 = 1 for clamp_to_border). If modes differ, it delegates to sub_64FC90 for a multi-instruction lowering that handles per-dimension mode selection.

2. Generic-to-Specific Address Conversion (case 123)

Converts flat/generic pointers to specific memory space pointers. The address space ID from *(ptx_node+40) selects the conversion strategy:

Space IDMemory spaceStrategy
4sharedsub_654A90 (direct conversion)
5combinedOR of global + shared + local conversions
6localsub_64F7A0 with register pair 101/102
7generic (flat)SM-dependent: sub_654FB0 (SM <= 0x1A) or SHR/AND extraction + SEL mux (SM > 0x10)
8globalsub_64F7A0 with register pair 98/99

For generic space on older architectures (SM <= 0x1A with feature flag via sub_61AF90), a simpler single-instruction path is used. On newer architectures, a multi-instruction sequence extracts the space tag from the upper address bits.

3. Address Space Conversion (cases 124--125, cvta/isspacep)

The cvta (Convert Address) and isspacep (Is Space Predicate) instructions convert between generic and specific address spaces. For global space (type 8) on SM > 0x1A, the handler creates CVTA with opcode 538 (isspacep) or 539 (cvta) and sets register class 7 with width 4 or 16 bytes.

4. Memory Addressing Modes (implicit)

Memory addressing modes for load/store/atomic instructions are not enumerated as named constants. Instead, they emerge from the operand construction patterns in cases 19--23, 34--35, 81--84, 104, 201--213:

PatternPTX syntaxOri representation
Register indirect[%rd1]Operand type 87 from sub_629E40
Register + offset[%rd1+16]Register operand + immediate via sub_6273E0
Constant bankc[2][0x100]Constant operand via sub_620320 (type 12)
Immediate address.local spaceConstant value via sub_620320
Base + index[%rd1], %r2Two-operand form

ISel Phase 1 Dispatch Vtable

sub_660CE0 constructs a 17-slot vtable at context offset +3784 for the ISel Phase 1 instruction handlers:

OffsetHandlerSizeRole
+0sub_650840--Primary handler
+8sub_64EEB0--Operand handler
+16sub_64F270--Type handler
+24sub_6575D049 KBRegister-class-to-opcode dispatch
+32sub_65D64048 KBInstruction lowering (this function)
+40sub_64EDD0--Auxiliary handler
+128sub_64EEC0--Lowering helper

Key Function Reference

AddressSizeFunctionDescription
sub_7DD0101.3KBInstruction::createAllocate and initialize 296-byte instruction
sub_7E00303.6KBInstruction::canRemoveCheck if instruction removal is legal
sub_7E06500.7KBInstruction::hasPredGuardCheck if instruction has predicate guard
sub_7E0E800.1KBInstruction::lastOpIsPredQuick predicate-guard check on last operand
sub_7E609010KBDefUseChain::buildBuild def-use chains for all instructions
sub_7DDCA00.2KBObserver::notifyWalk observer chain and notify
sub_9253C00.5KBInstruction::removeRemove instruction from linked list (634 callers)
sub_9255100.5KBInstruction::insertBeforeInsert instruction before another (13 callers)
sub_917A606.8KBInstrInfo::getRegClassOpcode-to-register-class mapping (221 callers)
sub_91A0F05.6KBInstrInfo::resolveRegClassResolve operand register class with constraints
sub_9314F00.4KBRegClass::queryRegister class query (1,547 callers)
sub_738E2010KBInstrDescTable::initBase instruction descriptor table constructor
sub_BE739016KBInstructionInfo::initInstructionInfo constructor (ROT13 table + descriptors)
sub_896D5021KBInstrMnemTable::initArchitecture-specific mnemonic table initializer
sub_65D64048KBInstrLowering::handlePTX-to-Ori instruction lowering handler (60+ opcode cases, addressing mode resolution)
sub_660CE00.3KBInstrLowering::initVtableConstructs ISel Phase 1 dispatch vtable (17 slots)
sub_6575D049KBRegClassOpcodeDispatch::handleRegister-class-to-opcode dispatch (vtable +24 sibling)
sub_6D969094KBInstruction::encodeMaster SASS instruction encoder
sub_B28E00variesisReg/isPred/isImmOperand type predicates (isel infrastructure)
sub_5D419012.9KBPTXFormatter::dispatchPTX text generation dispatcher (580 formatters)
sub_71086039BPTXCtx::getDataTypeData type accessor (2,953 callers)
sub_70B8E012BPTXCtx::getRegOperandRegister operand accessor (1,449 callers)
sub_70B91012BPTXCtx::getSrcPart0Source part 0 accessor (1,656 callers)
sub_70B70014BPTXCtx::hasPredicatePredicate presence check (946 callers)
sub_70CA6011BPTXCtx::getOperandTypeOperand type code accessor (480 callers)
sub_70B710111BPTXCtx::getOpcodeStringOpcode string with "@" prefix (348 callers)
sub_70FA0010BPTXCtx::getTargetSMTarget SM version accessor (286 callers)

Basic Blocks & Control Flow Graph

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

ptxas maintains a custom CFG infrastructure built entirely from scratch -- no LLVM BasicBlock, no LLVM MachineBasicBlock, no LLVM dominator framework. Basic blocks are stored in contiguous arrays, edges are stored in FNV-1a hash maps, and RPO / backedge / loop information is computed by dedicated functions in the 0xBDE000--0xBE2400 address range.

Key Facts

PropertyValue
BasicBlock object size136 bytes (allocated by sub_62BB00)
Block info entry (scheduling)40 bytes per entry, contiguous array
Block namingbix%d (block index, 0-based integer)
Edge representationFNV-1a hash map (key = block index, value = successor list)
RPO storageint[] array, indexed by RPO position
Backedge storageSeparate FNV-1a hash map
CFG construction phasePhase 3: AnalyzeControlFlow
Block layout phasePhase 112: PlaceBlocksInSourceOrder
BB merge suppression--dont-merge-basicblocks / -no-bb-merge CLI flag

Two-Level Block Representation

ptxas uses two distinct but linked representations for basic blocks. The first is owned by the Code Object (used by all optimization passes); the second is owned by the scheduling/CFG analysis context (used by scheduling and post-regalloc passes).

Code Object Block Array

The Code Object stores an array of pointers to full BasicBlock objects:

Code Object OffsetTypeFieldDescription
+296ptrbb_arrayArray of BasicBlock* pointers (8 bytes each)
+304i32bb_countNumber of basic blocks

Access pattern (from sub_78B430):

int bb_count = *(int*)(ctx + 304);
for (int i = 0; i <= bb_count; i++) {
    BasicBlock* bb = *(BasicBlock**)(*(ctx + 296) + 8 * i);
    int rpo = *(int*)(bb + 144);
    // ...
}

Scheduling Block Info Array

The scheduling context maintains a parallel 40-byte-per-entry array:

Scheduling Context OffsetTypeFieldDescription
+976ptrblock_infoContiguous array of 40-byte entries
+984i32num_blocksMax block index (0-based; actual count = num_blocks + 1)

Block Info Entry Layout (40 bytes)

OffsetTypeFieldDescription
+0ptrbb_ptrPointer to the full BasicBlock object
+8ptrinsn_headPointer to the instruction list head (or sentinel)
+16u64reservedReserved / padding
+24u32flagsBlock flags
+28i32bixBlock index (unique ID used in all CFG operations)
+32u64auxAuxiliary data (varies by pass)

The DOT dumper at sub_BE21D0 iterates this array with a 40-byte stride:

for (int i = 0; i <= num_blocks; i++) {
    entry = *(sched_ctx + 976) + 40 * i;
    int bix = *(int*)(entry + 28);
    int label = *(int*)(*(ptr*)(entry + 0) + 152);
    printf("bix%d(L%x)", bix, label);
}

BasicBlock Object (136 bytes)

Allocated by sub_62BB00 during the parsing/lowering phase. The parser references the string "bb-controlflow" when constructing these objects. After allocation, the 136-byte block is zeroed via memset, then individual fields are populated.

BasicBlock Field Map

OffsetTypeFieldDescription
+0ptrvtableVirtual function table pointer (or type tag)
+8ptrinsn_listInstruction doubly-linked list head/sentinel
+16ptrinsn_tailInstruction list tail (for O(1) append)
+24u32insn_countNumber of instructions in the block
+28u32flags_aBlock attribute flags (see below)
+104ptrbb_nextLinked-list link to next BasicBlock in function
+108u8opcode_flagsTerminator opcode classification bits
+128ptrsucc_listLinked list of successor block references
+136ptrpred_listLinked list of predecessor block references
+144i32rpo_numberReverse post-order number (set by RPO computation)
+152i32label_idLabel / source line identifier (displayed as L%x in DOT)

The insn_list at +8 is the head of a doubly-linked list. Each instruction node has a next pointer at offset +8 of the instruction object. The sentinel/end is detected by comparing the current node pointer against the tail stored in the BasicBlock or against a per-block sentinel address.

Successor/Predecessor Lists

Both succ_list (+128) and pred_list (+136) are singly-linked lists of small nodes. Each node contains:

OffsetTypeField
+0ptrnext pointer (NULL = end of list)
+8i32Block index of the referenced block

Iteration pattern (from sub_78B430 -- LoopStructurePass):

// Walk predecessor list
PredNode* pred = *(PredNode**)(bb + 136);
while (pred) {
    BasicBlock* pred_bb = *(BasicBlock**)(*(ctx + 296) + 8 * pred->bix);
    int pred_rpo = *(int*)(pred_bb + 144);
    // ...
    pred = pred->next;
}

// Walk successor list
SuccNode* succ = *(SuccNode**)(bb + 128);
while (succ) {
    BasicBlock* succ_bb = *(BasicBlock**)(*(ctx + 296) + 8 * succ->bix);
    // ...
    succ = succ->next;
}

CFG Edge Hash Maps

In addition to the per-block predecessor/successor linked lists, the scheduling context maintains two global FNV-1a hash maps for fast edge queries. These are the primary edge representation used by RPO computation, backedge detection, and the scheduling pass.

Successor Edge Map (Code Object +648)

Maps block index to a set of successor block indices. Used by CFG::computeRPO (sub_BDE150), CFG::printEdges (sub_BDE8B0), and CFG::buildAndAnalyze (sub_BE0690).

Backedge Map (Code Object +680)

Maps block index to the set of backedge targets. A backedge exists when block bix_src has a successor bix_dst where RPO(bix_dst) <= RPO(bix_src) -- i.e., the successor was visited before the source in the DFS traversal, indicating a loop.

FNV-1a Hash Parameters

All CFG hash lookups use identical parameters, confirmed across 50+ call sites:

ParameterValue
Initial hash0x811C9DC5
FNV prime16777619 (0x01000193)
Key size4 bytes (block index)
Hash methodByte-by-byte XOR-fold

The hash computation for a 32-bit block index bix:

uint32_t hash = 0x811C9DC5;
hash = 16777619 * (hash ^ (bix & 0xFF));
hash = 16777619 * (hash ^ ((bix >> 8) & 0xFF));
hash = 16777619 * (hash ^ ((bix >> 16) & 0xFF));
hash = 16777619 * (hash ^ ((bix >> 24) & 0xFF));
uint32_t bucket = hash & (num_buckets - 1);

Hash Map Structure

HashMap:
  +0   ptr   first_free_node    // Free list for node recycling
  +8   ptr   node_arena         // Pool allocator for new nodes
  +16  ptr   bucket_array       // Array of 24-byte bucket headers
  +24  u64   num_buckets        // Power of two, initial = 8
  +32  i32   total_elements     // Total entries across all buckets
  +36  i32   num_unique_keys    // Distinct keys inserted

Bucket (24 bytes):
  +0   ptr   head               // First node in collision chain
  +8   ptr   tail               // Last node in collision chain
  +16  i32   count              // Number of nodes in this bucket

Full Node (64 bytes, for edge maps):
  +0   ptr   next               // Chain link within bucket
  +8   i32   key                // Block index (bix)
  +12  i32   value_info         // Edge count or flags
  +16  ptr   value_array        // Pointer to sub-array of successor indices
  +24  i32   value_count        // Number of successors in sub-array
  +32  ptr   sub_hash_data      // Embedded sub-hash for multi-edge blocks
  +40  u64   sub_hash_size      // Sub-hash capacity
  +56  u32   cached_hash        // Cached FNV-1a hash of key

Simple Node (16 bytes, for backedge set membership):
  +0   ptr   next               // Chain link within bucket
  +8   i32   key                // Block index
  +12  u32   cached_hash        // Cached hash

Growth policy: rehash when total_elements > num_unique_keys (load factor > 1.0). New capacity = 4 * old_bucket_count. Hash map insert/find is implemented at sub_BDED20 (full nodes, 64 bytes) and sub_BDF480 (simple nodes, 16 bytes).

CFG Construction: Phase 3 (AnalyzeControlFlow)

AnalyzeControlFlow is phase 3 in the 159-phase optimizer pipeline. It runs immediately after the parser builds the initial instruction list and before any optimization. This phase:

  1. Populates the successor edge hash table at Code Object +648 by scanning the last instruction of each basic block. Branch instructions (opcode 67 = BRA, opcode 77 = EXIT; the code also checks opcodes 93 and 95 which are OUT_FINAL and STS respectively in the ROT13 name table but serve as internal control-flow markers in this context) provide the target block indices.
  2. Computes the backedge map at Code Object +680 by identifying edges where the target has a lower or equal block index position in the DFS tree.
  3. Builds the reverse post-order (RPO) array at Code Object +720 via iterative DFS.
  4. Identifies loop headers and backedges for later loop optimization passes.

The phase is critical because the Bison parser constructs basic blocks and instruction lists incrementally. AnalyzeControlFlow ensures the CFG is fully consistent and annotated before optimization begins.

Phase 6: SetControlFlowOpLastInBB

Phase 6 enforces a structural invariant: control flow operations must be the last instruction in their basic block. If a branch, jump, return, or exit instruction is followed by other instructions in the same block (which can happen during lowering passes), this phase splits the block at the control-flow instruction. New basic block entries are allocated and the instruction linked list is rewritten.

This invariant is required by all downstream passes -- the scheduler and register allocator assume that only the final instruction in a block can be a control-flow transfer.

Reverse Post-Order (RPO) Computation

RPO is computed by sub_BDE150 (CFG::computeRPO), a 9KB function that implements iterative DFS using an explicit stack.

RPO Storage

Code Object OffsetTypeField
+720ptrrpo_array -- int*, indexed by RPO position
+728i32rpo_size -- number of entries used
+732i32rpo_capacity -- allocated capacity

The array is resized with the standard ptxas growth policy: new_capacity = old + (old + 1) / 2, with a minimum of num_blocks + 1. Growth is implemented in sub_BDFB10.

Algorithm

The RPO computation uses a standard iterative DFS with post-order numbering:

function computeRPO(cfg, entry_block):
    stack = [entry_block]           // Explicit stack at offset +88..+100
    visited = new BitArray(num_blocks)  // At offset +16..+40
    in_stack = new BitArray(num_blocks) // At offset +40
    counter = num_blocks            // Decremented as blocks complete

    while stack is not empty:
        bix = stack.top()
        if visited[bix]:
            stack.pop()
            rpo_number[bix] = counter  // *(cfg+64)[bix] = counter
            rpo_array[counter] = bix   // *(*(cfg+720))[counter] = bix
            counter--
            continue

        visited[bix] = true
        for each successor s of bix (via hash map lookup):
            if not visited[s]:
                stack.push(s)

    return counter  // Should be -1 if all blocks reachable

The key assignment line from the decompilation:

*(_DWORD *)(*(_QWORD *)(a1 + 64) + 4 * v16) = *a3;           // rpo_number[bix] = counter
*(_DWORD *)(*(_QWORD *)(*(_QWORD *)a1 + 720) + 4 * (*a3)--) = v16;  // rpo_array[counter--] = bix

After completion, rpo_array[0] is the entry block, and rpo_array[num_blocks] is the deepest post-dominator (typically the EXIT block).

RPO Debug Dump

sub_BDEA50 (CFG::dumpRPOAndBackedges) prints the RPO state:

Showing RPO state for each basic block:
    bix0 -> RPONum: 0
    bix1 -> RPONum: 1
    bix2 -> RPONum: 3
    bix3 -> RPONum: 2
RPO traversal order: [0, 1, 3, 2]
Showing backedge info:
    bix2 -> backedge's successor BB: 1

This output is gated by option flag #24 at offset +1728 relative to the options manager.

Backedge Detection and Loop Identification

Backedges are identified during CFG::buildAndAnalyze (sub_BE0690, 54KB). A backedge from block src to block dst exists when dst has already been visited in the DFS traversal (i.e., dst has a smaller or equal RPO number than src). Backedges are stored in the hash map at Code Object +680.

Natural Loop Detection

The LoopStructurePass (sub_78B430) combines RPO numbering with backedge analysis to identify natural loops:

  1. Calls sub_781F80 (BasicBlockAnalysis) to compute RPO numbers and dominance.
  2. Iterates the bb_array at Code Object +296.
  3. For each block, checks if rpo_number (+144) is non-zero and equals the value at +152 (loop exit RPO marker). Combined with a branch opcode check (opcode & 0xFFFFFFFD == 0x5D = BRA or conditional branch), this identifies loop header blocks.
  4. Walks the predecessor list to find the backedge source -- the predecessor with the largest RPO number that is still less than the header's RPO.
  5. Walks the successor list to find the loop latch -- the successor with the smallest RPO number greater than the loop preheader's RPO.

The RPO range [header_rpo, exit_rpo] defines the set of blocks belonging to the loop body. A block with header_rpo <= block_rpo <= exit_rpo is inside the loop.

LoopMakeSingleEntry Transformation

If a natural loop has multiple entry points, sub_78B430 transforms it into a single-entry loop. This is gated by:

  • The LoopMakeSingleEntry pass-disable check (via sub_799250)
  • Knob 487 (queried via the knob vtable at +152)

Two code paths handle different branch types:

  • Opcode 93 (OUT_FINAL in the ROT13 name table; used here as a control-flow boundary marker): Calls sub_9253C0 to rewrite the branch target
  • Conditional branches: Calls sub_748BF0 to insert a new preheader block and redirect edges

After transformation, sub_931920 is called to split blocks and update the instruction list.

Dominance

Dominance is computed by sub_BE2330 (4KB) and/or within sub_781F80 (12KB, BasicBlockAnalysis). The implementation uses bitvector operations -- each block has a bitvector of dominators, and the fixpoint iteration proceeds in RPO order.

The bitvector layout used by the dominator computation:

OffsetTypeField
+0ptrdata -- pointer to uint32_t[] words
+8i32word_count
+12i32capacity
+16i32bit_count

Evidence for an iterative dataflow approach (rather than Lengauer-Tarjan) comes from the function sizes and patterns: sub_781F80 at 12KB and sub_BE2330 at 4KB are both small enough that they likely implement the simple iterative algorithm:

dom[entry] = {entry}
for all other blocks b: dom[b] = all_blocks

repeat until no changes:
    for each block b in RPO order (skip entry):
        dom[b] = {b} union (intersection of dom[p] for all predecessors p)

This is adequate for the small CFGs typical of GPU kernels (rarely exceeding a few hundred blocks). The O(n^2) worst case is not a concern at GPU kernel scale.

Block Layout: Phase 112 (PlaceBlocksInSourceOrder)

Phase 112 (PlaceBlocksInSourceOrder) runs in the post-scheduling stage of the pipeline, after register allocation and before Mercury encoding. It reorders the basic block array to restore source-order layout.

The implementation at sub_A92C50 (3.5KB binary, ~19KB decompiled) manipulates linked list structures and uses hash table lookups to reorder blocks. The goal is to minimize branch distances in the final SASS output -- placing fall-through successors immediately after their predecessors.

Hot/Cold Block Layout

Two companion phases handle hot/cold partitioning:

PhaseNamePurpose
108OptimizeHotColdInLoopMoves cold blocks out of loop bodies
109OptimizeHotColdFlowGlobal hot/cold block separation

Cold blocks (e.g., error handlers, unlikely branches, assert paths) are moved to the end of the function's block sequence. The MarkAdditionalColdBlocks pass marks blocks as cold based on heuristics. This separation improves instruction cache utilization on the GPU's SM instruction fetch unit.

BB Merge Suppression

The --dont-merge-basicblocks (alias -no-bb-merge) CLI flag prevents the optimizer from merging consecutive basic blocks. This is used for debuggable code -- without it, the debugger cannot set breakpoints at the original source line boundaries. The flag is documented in the binary as:

"Normally, ptxas attempts to merge consecutive basic blocks as part of its optization process. However, for debuggable code this is very confusing. This option prevents basic block merging, at a slight perfomance cost."

(Note: "optization" and "perfomance" are typos in the original binary string.)

Entry and Exit Blocks

Block index 0 (bix0) is always the function entry block. It is the first element in the bb_array and the root of the RPO traversal. The entry block has no predecessors (its predecessor list at +136 is NULL).

The exit block is the block containing the EXIT instruction (opcode 77 = EXIT in the ROT13 name table). For functions with multiple exit points, each EXIT-containing block is a CFG sink. The RPO computation assigns these the highest RPO numbers. The SetControlFlowOpLastInBB phase (phase 6) ensures each EXIT is the final instruction in its block.

The CFG::buildAndAnalyze function (sub_BE0690) checks the terminator opcode at instruction offset +28. Opcodes 4 and 7 (internal control-flow opcodes) receive special treatment during edge construction:

OpcodeTypeEdge behavior
4Unconditional branchSingle successor edge to target block
7Conditional branchTwo successor edges (taken + fall-through)
93OUT_FINALROT13 name is OUT_FINAL; used as a control-flow boundary marker in CFG construction
95STSROT13 name is STS; used as a control-flow terminator marker in CFG construction

CFG Update Protocol

Passes that modify the CFG (block splitting, merging, edge redirection) must maintain consistency across several data structures:

  1. Block array -- both the Code Object bb_array (+296) and the scheduling block_info (+976) must be updated.
  2. Predecessor/successor linked lists -- the per-block lists at +128 and +136 must reflect the new edges.
  3. Edge hash maps -- the successor map (+648) and backedge map (+680) must be invalidated or updated.
  4. RPO array -- the RPO order at +720 must be recomputed after structural changes.
  5. Block count -- both bb_count (+304) and num_blocks (+984) must be incremented.

The general pattern observed in sub_931920 (block splitter called from sub_78B430):

function splitBlock(ctx, bb, split_point):
    new_bb = allocateBasicBlock()
    
    // Move instructions after split_point to new_bb
    new_bb->insn_list = split_point->next
    bb->insn_list_tail = split_point
    split_point->next = sentinel
    
    // Transfer successors from bb to new_bb
    new_bb->succ_list = bb->succ_list
    bb->succ_list = new_node(new_bb->bix)
    
    // Update predecessor lists of old successors
    for each succ in new_bb->succ_list:
        replace bb in succ->pred_list with new_bb
    
    // new_bb's only predecessor is bb
    new_bb->pred_list = new_node(bb->bix)
    
    // Invalidate and recompute RPO
    ctx->bb_count++
    recomputeRPO(ctx)

The AnalyzeControlFlow phase (phase 3) is explicitly re-run or incrementally updated after phases that modify the CFG structure. The phase pipeline contains multiple OriPerformLiveDead and GeneralOptimize passes that may rebuild portions of the CFG.

Key CFG Functions

AddressSizeIdentityConfidence
sub_62BB0016.5KBBasicBlock::allocate -- allocates 136-byte block, initializes fieldsHIGH
sub_781F8012KBBasicBlockAnalysis -- RPO, loop detection, dominanceMEDIUM
sub_78B4301.2KBLoopStructurePass -- single-entry loop transformationHIGH
sub_BDE1509KBCFG::computeRPO -- iterative DFS with explicit stackHIGH
sub_BDE6C03KBHashMap::erase -- remove node from edge hash mapMEDIUM
sub_BDE8B02KBCFG::printEdges -- prints "bix%d -> bix%d\n"HIGH
sub_BDEA504KBCFG::dumpRPOAndBackedges -- RPO + backedge debug dumpHIGH
sub_BDED2012KBHashMap::insertOrFind -- full 64-byte node insertHIGH
sub_BDF48010KBHashMap::insertOrFind_simple -- 16-byte node insertHIGH
sub_BDFB1024KBCFG::buildBlockMap -- block array init, RPO resizeMEDIUM
sub_BE069054KBCFG::buildAndAnalyze -- master CFG builderHIGH
sub_BE21D01.4KBCFG::dumpDOT -- Graphviz DOT format outputHIGH
sub_BE23304KBCFG::computeDominators -- bitvector-based dominanceMEDIUM
sub_A92C503.5KBPlaceBlocksInSourceOrder -- block reordering (phase 112)MEDIUM

CFG Visualization

The CFG::dumpDOT function (sub_BE21D0) generates Graphviz DOT output when option flag #20 is enabled (offset +1440 from the options manager). The output format:

digraph f {
    node [fontname="Courier",fontsize=10,shape=Mrecord];
    "bix0"
    [label="bix0(L0)"]
    bix0 -> bix1
    bix0 -> bix3
    "bix1"
    [label="bix1(L10)"]
    bix1 -> bix2
    "bix2"
    [label="bix2(L20)"]
    bix2 -> bix1
    bix2 -> bix3
    "bix3"
    [label="bix3(L30)"]
}

Where L%x is the label identifier at BasicBlock +152. This can be converted to a visual graph with dot -Tpng.

If option flag #24 is also enabled (offset +1728), the RPO and backedge dump from sub_BDEA50 is appended.

Register Model (R / UR / P / UP)

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

ptxas models four hardware register files plus two auxiliary barrier register files. Every Ori instruction references registers from one or more of these files. During the optimization phases (0--158), registers carry virtual numbers; the fat-point register allocator (phase 159+) maps them to physical hardware slots. This page documents the register files, the virtual/physical register descriptor, the 7 allocator register classes, wide register conventions, special registers, the operand encoding format, pressure tracking, and SM-specific limits.

Four Register Files

FileMnemonicWidthUsable rangeZero/TrueABI typeIntroduced
RGeneral-purpose32 bitsR0 -- R254RZ (R255)2sm_30
URUniform32 bitsUR0 -- UR62URZ (UR63)3sm_75
PPredicate1 bitP0 -- P6PT (P7)5sm_30
UPUniform predicate1 bitUP0 -- UP6UPT (UP7)--sm_75

R registers are per-thread 32-bit general-purpose registers. They hold integers, floating-point values, and addresses. 64-bit values occupy consecutive even/odd pairs (R4:R5); 128-bit values occupy aligned quads (R0:R1:R2:R3). The total R-register count for a function is field[159] + field[102] (reserved + allocated), stored in the Code Object at offsets +159 and +102. Maximum usable: 254 (R0--R254). R255 is the hardware zero register RZ -- reads return 0, writes are discarded.

UR registers (uniform general-purpose) are warp-uniform: every thread in a warp sees the same value. Available on sm_75 and later. Range: UR0--UR62 usable, UR63 is the uniform zero register URZ. The UR count is at Code Object +99. Attempting to use UR on pre-sm_75 targets triggers the diagnostic "Uniform registers were disallowed, but the compiler required (%d) uniform registers for correct code generation.".

P registers are 1-bit predicates used for conditional execution (@P0 FADD ...) and branch conditions. P0--P6 are usable; P7 is the hardwired always-true predicate PT. Writes to PT are discarded. The assembler uses PT as the default predicate for unconditional instructions. In the allocator, predicate registers support half-width packing: two virtual predicates can be packed into one physical predicate slot, with the hi/lo distinction stored in bit 23 (0x800000) of the virtual register flags.

UP registers are the uniform predicate variant. UP0--UP6 are usable; UP7 is UPT (always-true). Available on sm_75+.

Seven Allocator Register Classes

The fat-point allocator processes 7 register classes, indexed by the reg_type field at vreg+64. Class 0 is the cross-class constraint propagation channel and is skipped in the main allocation loop. Classes 1--6 are allocated independently, in order. The allocator distribution loop in sub_9721C0 (lines 520--549) reads *(int*)(vreg+64) and uses it directly as the class bucket index, guarded by reg_type <= 6:

Class IDNameWidthHW limitDescription
0(unified)----Cross-class constraint propagation (skipped)
1R32-bit255General-purpose registers (R0--R254)
2R (alt)32-bit255GPR variant (used for RZ sentinel, stat collector alternate)
3UR32-bit63Uniform general-purpose (UR0--UR62)
4UR (ext)32-bit63Uniform GPR variant (triggers flag update at +1369 in constructor)
5P / UP1-bit7Predicate registers (P0--P6, UP0--UP6)
6Tensor/Acc32-bitvariesTensor/accumulator registers for MMA/WGMMA operations

The class ID is the reg_type value stored at vreg+64. The allocator class at vreg+12 is a separate field used for instruction-level classification, not for the per-class allocation passes. The allocator's per-class linked lists at alloc[3*reg_type + 138] are populated directly from vreg+64.

Per-class state is initialized via the target descriptor vtable call vtable[896](alloc_state, class_id), which populates per-class register file descriptors at alloc[114..156] (four 8-byte entries per class).

Barrier Registers

Barrier registers (B and UB) are a distinct register file used by the BAR, DEPBAR, BSSY, and BSYNC instructions for warp-level and CTA-level synchronization. B0--B15 are the non-uniform barrier registers; UB0--UB15 are the uniform variant. Barrier registers have reg_type = 9, which is above the <= 6 cutoff for the main allocator class buckets. They are handled by a separate allocation mechanism outside the 7-class system.

Tensor/Accumulator Registers (Class 6)

Class 6 registers are created during intrinsic lowering of tensor core operations (MMA, WGMMA, HMMA, DMMA). Over 30 intrinsic lowering functions in the 0x6B--0x6D address range call sub_91BF30(ptr, ctx, 6) to create these registers. The GMMA pipeline pass (sub_ADA740, sub_69E590) identifies accumulator operands by checking *(vreg+64) == 6. The accumulator counting function at sub_78C6B0 uses the pair-mode bits at vreg+48 (bits 20--21) to determine whether a type-6 register consumes 1 or 2 physical R slots.

Virtual Register Descriptor

Every virtual register in a function is represented by a 160-byte descriptor allocated from the per-function arena. The register file array is at Code Object +88, indexed as *(ctx+88) + 8*regId. The descriptor is created by sub_91BF30 (register creation function).

Descriptor Layout

OffsetSizeTypeFieldNotes
+08ptrnextLinked list pointer (allocation worklist)
+84i32idUnique register ID within function
+124i32class_indexAllocator register class (0--6)
+201u8flags_byteBit 0x20 = live
+244i32bb_indexBasic block of definition
+284i32epochEpoch counter for liveness tracking
+328ptralias_nextNext aliased register (coalescing chain)
+368ptralias_parentCoalesced parent pointer
+404f32spill_costAccumulated spill cost
+488u64flagsMulti-purpose flag word (see below)
+568ptrdef_instrDefining instruction pointer
+644i32reg_typeRegister file type enum
+684i32physical_regPhysical register number (-1 = unassigned)
+721u8size0 = scalar, nonzero = encoded width
+764f32secondary_costSecondary spill cost
+804i32spill_flag0 = not spilled, 1 = spilled
+972u16reserved
+1048ptruse_chainUse chain head (instruction pointer)
+1128ptrdef_chainDefinition chain
+1208ptrregfile_nextNext in register file linked list
+1288ptrlinked_nextNext in linked-register chain
+1368ptrreserved2
+1448ptrconstraint_listConstraint list head for allocator
+1528ptrreserved3

Initial values set by the constructor (sub_91BF30):

vreg->next           = NULL;            // +0
vreg->id             = ctx->reg_count + 1;  // +8, auto-incrementing
vreg->class_index    = 0;               // +12
vreg->flags_byte     = 0;               // +20
vreg->alias_parent   = (ptr)-1;         // +20..27 (qword write)
vreg->physical_reg   = -1;              // +68 (unassigned)
vreg->reg_type       = a3;              // +64 (passed as argument)
vreg->size           = 0;               // +72
vreg->spill_flag     = 0;               // +80
vreg->use_chain      = NULL;            // +104
vreg->def_chain      = NULL;            // +112
vreg->constraint_list = NULL;           // +144

For predicate types (a3 == 2 or a3 == 3), the flags word at +48 is initialized to 0x1000 (4096). For all other types, it is initialized to 0x1018 (4120). If the type is 7 (alternate predicate classification), the physical register is initialized to 0 instead of -1.

Flag Bits at +48

BitMaskMeaning
90x200Pre-assigned / fixed register
100x400Coalesced source
110x800Coalesced target
120x1000Base flag (set for all types)
140x4000Spill marker (already spilled)
180x40000Needs-spill (allocator sets when over budget)
20--21(pair mode)0 = single, 1 = lo-half of pair, 3 = double-width
220x400000Constrained to architecture limit
230x800000Hi-half of pair (predicate half-width packing)
270x8000000Special handling flag

Register File Type Enum (at +64)

This enum determines the register file a VR belongs to. It is used by the register class name table at off_21D2400 to map type values to printable strings ("R", "UR", "P", etc.) for diagnostic output such as "Referencing undefined register: %s%d".

ValueFileAlloc classDescription
1R1General-purpose register (32-bit)
2R (alt)2GPR variant (RZ sentinel in sub_7D82E0, stat collector alternate)
3UR3Uniform register (32-bit)
4UR (ext)4Uniform GPR variant (triggers flag update at +1369 in constructor)
5P / UP5Predicate register (1-bit); covers both P and UP
6Tensor/Acc6Tensor/accumulator register for MMA/WGMMA operations
7P (alt)--Predicate variant (physical = 0 at init); above allocator cutoff
8----Extended type (created by sub_83EF00); above allocator cutoff
9B / UB--Barrier register; above allocator cutoff, separate allocation
10R2--Extended register pair (64-bit, two consecutive R regs)
11R4--Extended register quad (128-bit, four consecutive R regs)

Values 0--6 are within the allocator's class system (the distribution loop in sub_9721C0 guards with reg_type <= 6). Values 7+ are handled by separate mechanisms. The off_21D2400 name table is indexed by reg_type and provides display strings for diagnostic output.

The stat collector at sub_A60B60 (24 KB) enumerates approximately 25 register sub-classes including R, P, B, UR, UP, UB, Tensor/Acc, SRZ, PT, RZ, and others by iterating vtable getter functions per register class.

Wide Registers

NVIDIA GPUs have only 32-bit physical registers. Wider values are composed from consecutive registers.

64-Bit Pairs (R2)

A 64-bit value occupies two consecutive registers where the base register has an even index: R0:R1, R2:R3, R4:R5, and so on. The low 32 bits reside in the even register; the high 32 bits in the odd register. In the Ori IR, a 64-bit pair is represented by a single virtual register with:

  • vreg+64 (type) = 10 (extended pair)
  • vreg+48 bits 20--21 (pair mode) = 3 (double-width)

The allocator selects even-numbered physical slots by scanning with stride 2 instead of 1. The register consumption function (sub_939CE0) computes slot + (1 << (pair_mode == 3)) - 1, consuming two physical slots.

128-Bit Quads (R4)

A 128-bit value occupies four consecutive registers aligned to a 4-register boundary: R0:R1:R2:R3, R4:R5:R6:R7, etc. Used by texture instructions, wide loads/stores, and tensor core operations. In the Ori IR:

  • vreg+64 (type) = 11 (extended quad)
  • Allocator scans with stride 4

Alignment Constraints

WidthBase alignmentStrideExample
32-bit (scalar)Any1R7
64-bit (pair)Even2R4:R5
128-bit (quad)4-aligned4R8:R9:R10:R11

The texture instruction decoder (sub_1170920) validates even-register alignment via a dedicated helper (sub_1170680) that checks if a register index falls within the set {34, 36, 38, ..., 78} and returns 0 if misaligned.

The SASS instruction encoder for register pairs (sub_112CDA0, 8.9 KB) maps 40 register pair combinations (0/1, 2/3, ..., 78/79) to packed 5-bit encoding values at 0x2000000 (33,554,432) intervals.

Special Registers

Zero and True Registers

RegisterFileIndexInternal sentinelBehavior
RZR2551023Reads return 0; writes discarded
URZUR631023Uniform zero; reads return 0
PTP731Always-true predicate; writes discarded
UPTUP731Uniform always-true

The internal sentinel value 1023 (0x3FF) represents "don't care" or "zero register" throughout the Ori IR and allocator. During SASS encoding, hardware register index 255 is mapped to sentinel 1023 for R/UR files, and hardware index 7 is mapped to sentinel 31 for P/UP files. These sentinels are checked in encoders to substitute the default register value:

// Decoder: extract register operand (sub_9B3C20)
if (reg_idx == 255)
    internal_idx = 1023;   // RZ sentinel

// Decoder: extract predicate operand (sub_9B3D60)
if (pred_idx == 7)
    internal_idx = 31;     // PT sentinel

// Encoder: emit register field
if (reg == 1023)
    use *(a1+8) as default;  // encode physical RZ

Architectural Predicate Indices

The allocator skips architectural predicate registers by index number:

IndexRegisterTreatment
39(special)Skipped during allocation (skip predicate sub_9446D0)
41PTSkipped -- hardwired true predicate
42P0Skipped -- architectural predicate
43P1Skipped -- architectural predicate
44P2Skipped -- architectural predicate

The skip check in sub_9446D0 returns true (skip) for register indices 41--44 and 39, regardless of register class. For other registers, it checks whether the instruction is a CSSA phi (opcode 195 with barrier type 9) or whether the register is in the exclusion set hash table at alloc+360.

Special System Registers (S2R / CS2R)

Thread identity and hardware state are accessed through the S2R (Special Register to Register) and CS2R (Control/Status Register to Register) instructions. These read read-only hardware registers into R-file registers.

Common system register values (from PTX parser initialization at sub_451730):

PTX nameHardwareDescription
%tid / %ntidSR_TID_X/Y/ZThread ID within CTA
%ctaid / %nctaidSR_CTAID_X/Y/ZCTA ID within grid
%laneidSR_LANEIDLane index within warp (0--31)
%warpid / %nwarpidSR_WARPIDWarp index within CTA
%smid / %nsmidSR_SMIDSM index
%grididSR_GRIDIDGrid identifier
%clock / %clock_hi / %clock64SR_CLOCK / SR_CLOCK_HICycle counter
%lanemask_eq/lt/le/gt/geSR_LANEMASK_*Lane bitmask variants

The S2R register index must be between 0 and 255 inclusive, enforced by the string "S2R register must be between 0 and 255 inclusive". Special system register ranges are tracked at Code Object offsets +1712 (start) and +1716 (count).

Operand Encoding in Ori Instructions

Each instruction operand is encoded as a 32-bit packed value in the operand array starting at instruction offset +84. The operand at index i is at *(instr + 84 + 8*i).

Packed Operand Format (Ori IR)

 31   30  29  28  27            24  23  22  21  20  19                  0
+----+---+---+---+---------------+---+---+---+---+---------------------+
|sign|     type  |  modifier (8) |                index (20)           |
+----+---+---+---+---------------+---+---+---+---+---------------------+
 bit 31: sign/direction flag          bits 0-19: register/symbol index
 bits 28-30: operand type (3 bits)    bit 24: pair extension flag

Extraction pattern (50+ call sites):

uint32_t operand = *(uint32_t*)(instr + 84 + 8 * i);
int type    = (operand >> 28) & 7;     // bits 28-30
int index   = operand & 0xFFFFF;       // bits 0-19
int mods    = (operand >> 20) & 0xFF;  // bits 20-27
bool is_neg = (operand >> 31) & 1;     // bit 31
Type valueMeaning
1Register operand (index into register file at *(ctx+88) + 8*index)
5Symbol/constant operand (index into symbol table at *(ctx+152))
6Special operand (barrier, system register)

For register operands (type 1), the index is masked as operand & 0xFFFFFF (24 bits) to extract the full register ID. Indices 41--44 are architectural predicates that are never allocated.

SASS Instruction Register Encoding

During final SASS encoding, the register operand encoder (sub_7BC030, 814 bytes, 6147 callers) packs register operands into the 128-bit instruction word:

Encoded register field (16 bits at variable bit offset):
  bit 0:      presence flag (1 = register present)
  bits 1-4:   register file type (4 bits, 12 values)
  bits 5-14:  register number (10 bits)

The 4-bit register file type field in the SASS encoding maps the internal operand type tag to hardware encoding:

Operand type tagEncoded valueRegister file
10R (32-bit)
21R pair (64-bit)
32UR (uniform 32-bit)
43UR pair (uniform 64-bit)
54P (predicate)
65(reserved)
76(reserved)
87B (barrier)
168(extended)
329(extended)
6410(extended pair)
12811(extended quad)

The predicate operand encoder (sub_7BCF00, 856 bytes, 1657 callers) uses a different format: 2-bit predicate type, 3-bit predicate condition, and 8-bit value. It checks for PT (operand byte[0] == 14) and handles the always-true case.

Register-Class-to-Hardware Encoding

The function sub_1B6B250 (2965 bytes, 254 callers) implements the mapping from the compiler's abstract (register_class, sub_index) pair to hardware register numbers:

hardware_reg = register_class * 32 + sub_index

For example: class 0, index 1 returns 1; class 1, index 1 returns 33; class 2, index 1 returns 65. The guard wrapper sub_1B73060 (483 callers) returns 0 for the no-register case (class=0, index=0).

The register field writer (sub_1B72F60, 483 callers) packs the encoded register number into the 128-bit instruction word with the encoding split across two bitfields:

*(v2 + 12) |= (encoded_reg << 9) & 0x3E00;       // bits [13:9]
*(v2 + 12) |= (encoded_reg << 21) & 0x1C000000;   // bits [28:26]

Register Pressure Tracking

Scheduling Phase Pressure Counters

The scheduler maintains 10 per-block register pressure counters at offsets +4 through +40 of the per-BB scheduling record (72 bytes per basic block). At BB entry, these are copied into the scheduler context at context offsets +48 through +87. The counters track live register counts for each register class:

BB record offsetContext offset (idx)Register class
+4+48 (idx 12)R (general-purpose)
+8+52 (idx 13)P (predicate)
+12+56 (idx 14)UR (uniform)
+16+60 (idx 15)UP (uniform predicate)
+20+64 (idx 16)B (barrier)
+24+68 (idx 17)(arch-specific class 0)
+28+72 (idx 18)(arch-specific class 1)
+32+76 (idx 19)(arch-specific class 2)
+36+80 (idx 20)(arch-specific class 3)
+40+84 (idx 21)(arch-specific class 4 / control total)

The spill cost analyzer (sub_682490, 14 KB) allocates two stack arrays (v94[511] and v95[538]) as per-register-class pressure delta arrays. For each instruction, it computes pressure increments and decrements based on the instruction's register operand definitions and uses.

The register pressure coefficient is controlled by knob 740 (double, default 0.045). The pressure curve function uses a piecewise linear model with parameters (4, 2, 6) via sub_8CE520.

Liveness Bitvectors

The Code Object maintains register liveness as bitvectors:

OffsetBitvectorDescription
+832Main register livenessOne bit per virtual register; tracks which registers are live at the current program point
+856Uniform register livenessSeparate bitvector for UR/UP registers

These bitvectors are allocated via sub_BDBAD0 (bitvector allocation, with size = register count + 1 bits) and manipulated via the SSE2-optimized bitvector primitives at sub_BDBA60 / sub_BDC180 / sub_BDCDE0 / sub_BDC300.

For each basic block during dependency graph construction (sub_A0D800, 39 KB), the per-block liveness is computed by iterating instructions and checking operand types ((v >> 28) & 7 == 1 for register operands), then updating the bitvector at +832 with set/clear operations.

Allocator Pressure Arrays

The fat-point allocator (sub_957160) uses two 512-DWORD (2048-byte) arrays per allocation round:

ArrayRole
Primary (v12[512])Per-physical-register interference count
Secondary (v225[512])Tie-breaking cost metric

Both are zeroed with SSE2 vectorized _mm_store_si128 loops at the start of each round. For each VR being allocated, the pressure builder (sub_957020) walks the VR's constraint list and increments the corresponding physical register slots. The threshold (knob 684, default 50) filters out congested slots.

ABI Register Reservations

Reserved Registers

Registers R0--R3 are unconditionally reserved by the ABI across all SM generations. The diagnostic "Registers 0-3 are reserved by ABI and cannot be used for %s" fires if they are targeted by parameter assignment or user directives.

Minimum Register Counts by SM Generation

SM generationValueSM targetsMinimum registers
3(sm_target+372) >> 12 == 3sm_35, sm_37(no minimum)
4== 4sm_50 -- sm_5316
5== 5sm_60 -- sm_8916
9== 9sm_90, sm_90a24
>9> 9sm_100+24

Violating the minimum emits warning 7016: "regcount %d specified below abi_minimum of %d".

Per-Class Hardware Limits

ClassLimitNotes
R255R0--R254 usable; controlled by --maxrregcount and --register-usage-level (0--10)
UR63UR0--UR62 usable; sm_75+ only
P7P0--P6 usable
UP7UP0--UP6 usable; sm_75+ only
B16B0--B15
UB16UB0--UB15

The --maxrregcount CLI option sets a per-function hard ceiling for R registers. The --register-usage-level option (0--10, default 5) modulates the register allocation target: level 0 means no restriction, level 10 means minimize register usage as aggressively as possible. The per-class budget at alloc + 32*class + 884 reflects the interaction between the CLI limit and the optimization level.

The --device-function-maxrregcount option overrides the kernel-level limit for device functions when compiling with -c.

Dynamic Register Allocation (setmaxnreg)

sm_90+ (Hopper and later) supports dynamic register allocation through the setmaxnreg.inc and setmaxnreg.dec instructions, which dynamically increase or decrease the per-thread register count at runtime. ptxas tracks these as internal states setmaxreg.try_alloc, setmaxreg.alloc, and setmaxreg.dealloc. Multiple diagnostics guard correct usage:

  • "setmaxnreg.dec has register count (%d) which is larger than the largest temporal register count in the program (%d)"
  • "setmaxreg.dealloc/release has register count (%d) less than launch min target (%d) allowed"
  • "Potential Performance Loss: 'setmaxnreg' ignored to maintain minimum register requirements."

Pair Modes and Coalescing

The pair mode at vreg+48 bits 20--21 controls how the allocator handles wide registers:

Pair modeValueBehavior
Single0Occupies one physical register slot
Lo-half1Low half of a register pair
Double-width3Occupies two consecutive physical slots

The allocator computes register consumption via sub_939CE0:

consumption = slot + (1 << (pair_mode == 3)) - 1;
// single:  slot + 0  = slot (1 slot)
// double:  slot + 1  = slot+1 (2 slots)

The coalescing pass (sub_9B1200, 800 lines) eliminates copy instructions by merging the source and destination VRs into the same physical register. The alias chain at vreg+36 (coalesced parent) is followed during assignment (sub_94FDD0) to propagate the physical register through all aliased VRs:

alias = vreg->alias_parent;     // vreg+36
while (alias != NULL) {
    alias->physical_reg = slot;  // alias+68
    alias = alias->alias_parent; // alias+36
}

Register Name Table

The register class name table at off_21D2400 is a pointer array indexed by the register file type enum (from vreg+64). Each entry points to a string: "R", "UR", "P", "UP", "B", "UB", etc. This table is used by diagnostic functions:

  • sub_A4B9F0 (StatsEmitter::emitUndefinedRegWarning): "Referencing undefined register: %s%d" where %s is off_21D2400[*(vreg+64)] and %d is *(vreg+68) (physical register number).
  • sub_A60B60 (RegisterStatCollector::collectStats, 24 KB): Enumerates ~25 register sub-classes by iterating vtable getters, one per register class. The enumerated classes include R, P, B, UR, UP, UB, SRZ, PT, RZ, and others.
  • "Fatpoint count for entry %s for regclass %s : %d": Prints per-function per-class allocation statistics.

Key Functions

AddressSizeFunctionDescription
sub_91BF3099 linescreateVirtualRegisterAllocates 160-byte VR descriptor, initializes fields, appends to register file array
sub_9446D029 linesshouldSkipRegisterReturns true for indices 41--44, 39 (architectural specials); checks CSSA phi and exclusion set
sub_A4B8F0248BemitInstrRegStatsEmits "instr/R-regs: %d instructions, %d R-regs"
sub_A4B9F0774BemitUndefinedRegWarningWalks operands backward, formats "Referencing undefined register: %s%d"
sub_A60B604560BcollectRegisterStatsEnumerates ~25 register sub-classes via vtable getters
sub_7BC030814BencodeRegOperandPacks register into SASS instruction: 1-bit presence + 4-bit type + 10-bit number
sub_7BCF00856BencodePredOperandPacks predicate into SASS: 2-bit type + 3-bit condition + 8-bit value
sub_9B3C20--decodeRegOperandDecoder helper: extracts register, maps 255 to 1023 (RZ)
sub_9B3D60--decodePredOperandDecoder helper: extracts predicate, maps 7 to 31 (PT)
sub_1B6B2502965BregClassToHardwareMaps (class, sub_index) to hardware number: class * 32 + sub_index
sub_1B7306019BregClassToHardwareGuardGuard wrapper: returns 0 for no-register case
sub_1B72F6032BwriteRegFieldPacks encoded register into instruction word bits [13:9] and [28:26]
sub_112CDA08.9KBencodeRegisterPairMaps 40 register pair combinations to 5-bit packed encoding values
sub_939CE023 linescomputeConsumptionPair-aware register slot consumption counter
sub_94FDD0155 linesassignRegisterCommits physical register assignment, propagates through alias chain
sub_A0D80039KBbuildDependencyGraphPer-block dependency graph with register-to-instruction mapping
sub_A06A6015KBscheduleWithPressurePer-block scheduling loop tracking live register set bitvector
sub_68249014KBcomputeRegPressureDeltasPer-instruction register pressure delta computation
sub_B28E00--getRegClassReturns register class (1023 = wildcard, 1 = GPR)
sub_B28E10--isRegOperandPredicate: is this a register operand?
sub_B28E20--isPredOperandPredicate: is this a predicate operand?
sub_B28E90--isURegPredicate: is this a uniform register?

Opcode Register Class Table

Every Ori opcode carries an implicit register class contract: which register files its operands may reference, what data widths are valid, and which addressing modes apply. The function sub_6575D0 (49 KB, buildEncodingDescriptor) is the central dispatch that translates each instruction's opcode into a packed encoding descriptor consumed by the SASS encoder.

Function Signature

// sub_6575D0 -- buildEncodingDescriptor
// a1 = compiler context
// a2 = Ori instruction node pointer
// a3 = output: 4-DWORD packed encoding descriptor
char buildEncodingDescriptor(Context *a1, Instruction *a2, uint32_t *a3);

Architecture

The function is a two-level dispatch:

  1. Outer switch on the Ori opcode at *(instr->info + 8) -- 168 unique case values spanning opcodes 3 (IADD3) through 0xF5 (PIXLD).

  2. Inner encoding per opcode (or group): assigns an encoding category ID to a3[0], then calls the bitfield packers to fill a3[1..2] with register class attributes.

Two helper functions pack fields into the descriptor:

FunctionRoleCall countField ID range
sub_917A60 (packRegClassField)Bitfield encoder -- field IDs 91--340 map to specific bit positions in a3[1] and a3[2]11291--340
sub_A2FF00 (packOperandField)Alternate encoder for operand-level slots (data type, memory space)283--71

Encoding Category Assignment

The encoding category at a3[0] selects which SASS instruction format template the downstream per-SM encoder uses. Key mappings (opcode index to encoding category):

Opcode(s)SASS mnemonicCategoryRegister class summary
3IADD3489R dest, R/UR sources, P carry
4BMSK106R only
5--6SGXT / LOP3490--491R dest, R/UR sources
7ISETP59P dest, R/UR sources + memory ordering fields
8IABS60R dest, R source + memory ordering fields
0x0E--0x10FSET/FSEL/FSETP510R/P dest, FP operation variant
0x11/0x12/0x18FSETP/MOV/PRMT517FP comparison, combine, data width (IDs 288--299)
0x15--0x16P2R/R2P524--525P-to-R or R-to-P conversion
0x19VOTE526R dest, optional memory class
0x1ACS2R variant527UR source width (494--496), data type from a2+92
0x1BCS2R_32497Source width (494/495/496), predicate flag (ID 270)
0x1EIPA494Interpolation mode (440--442), flat/smooth (443/444)
0x1FMUFU501Subfunction (445--447), precision (450--459)
0x20SHF502Direction (461--463), source class (464--466), clamp, data type
0x21SHFL503Mode (470/471), operand classes (472--482)
0x22--0x23I2I/I2IP55/56Integer conversion type (23 entries in dword_2026B20)
0x28--0x2AIPA/MUFU ext512Extended encoding variants (428--430)
0x2B--0x2CF2F/F2F_X513Conversion direction (432/433), saturation (434/435)
0x2DFRND516Rounding variant (526), mode (528/529)
0x51--0x53AL2P, AL2P_IDX437--438Bindless flag (ID 148), predicate (ID 147)
0x54--0x56BMOV_B/BMOV_R/BMOV423--424B-register class
0x64--0x67SETLMEMBASE/ATOM156/463Atom-vs-red (ID 178), data width (ID 181)
0x68BRX468Target (ID 190), call convention (IDs 191--192)
0x6A/0x6C/0x6DJMP/JMX/CALL469Control flow target class (ID 176)
0x77--0x79BSSY/BREAK/BSYNC528--530Sync mode (ID 324), variant (ID 325)
0x82NANOTRAP487Trap operation class (ID 257), has-source (ID 256)
0x9E--0x9FHopper+ instrs535--536Hopper class A/B (IDs 337--338)
0xAF--0xB2LD/ST variants431--446Full modifier set: uniform (91), pair (92--102)
0xB8--0xBELDG/STG/LDL/STL449--456Cache policy (131), float mode (134), width (131)
0xC1Conditional10/13Branch type (ID 167), divergent (ID 168)
0xC8PRMT24Permute selector (ID 65/66)
0xC9--0xD3Texture/surface61/455Texture data type (IDs 17/18), surface (IDs 19--22)
0xD6--0xD7DMMA/CVTA515Direction (304), predicate (305), data type (306)
0xDA--0xDBSUATOM521/533Data width (326--331), sync mode (328)
0xDCSURED534Data width (331), type (335--336), sync (333)
0xE0WGMMA500Data type (198), enable (199), barrier (201)
0xF5PIXLD532Mode from dword_2026AA0 (ID 323)

Extended Opcode Path (Memory/Atomic Sub-dispatch)

When the opcode falls in the 0xF6--0x10C range (memory/atomic extended instructions), a separate sub-dispatch applies. The function sub_44AC80 gates entry; sub_44AC60 and sub_44AC70 select among three encoding categories:

CategoryGate functionMeaning
441defaultBase memory operation
442sub_44AC60 truePredicated memory variant
443sub_44AC70 trueExtended memory variant

Within each category, the sub-opcode selects register class fields:

Sub-opcodeRegister class (field 115)Data width (field 113)
0xF6/0xFF/0x10669 (class A)60 (standard)
0xF7/0x100/0x10771 (class B)60 (standard)
0xF8/0x102/0x1090 (default)63 (wide)
0xF9/0x103/0x10A0 (default)61 (narrow)
0xFA/0x104/0x10B0 (default)62 (medium)
0xFB0 (default)65 (type A)
0xFC0 (default)66 (type B)
0xFD0 (default)68 (type C)
0xFE/0x105/0x10C0 (from table)64 (from dword_2026C30)
0x101/0x10872 (class C)60 (standard)

Packed Descriptor Layout

The output descriptor a3 is a 4-DWORD (16-byte) structure:

DWORDContent
a3[0]Encoding category ID (0--542) -- selects SASS format template
a3[1]Packed bitfield: memory space (bits 0--3), address type (bits 4--7)
a3[2]Packed bitfield: register class attributes (data width, type, modifiers)
a3[3]Auxiliary flags (bit 1 = texture scope, bit 29 = special)
a3[4]Operand count override (set to 12 for KILL/extended mem ops)

Register Class Field Groups

The 112 calls to packRegClassField (sub_917A60) use field IDs organized into functional groups. Each field ID maps to a specific bit range in the output descriptor via a mask-and-OR encoding:

// Example: field 113 (data width) -- bits 7-9 of a3[2]
case 113:
    val = dword_21DEB20[a3_value - 61];  // 8-entry lookup
    a3[2] = (val << 7) | (a3[2] & 0xFFFFF87F);
    break;

// Example: field 91 (uniform flag) -- bit 16 of a3[2]
case 91:
    a3[2] = ((value == 1) << 16) | (a3[2] & 0xFFFEFFFF);
    break;
Field groupIDsBits writtenPurpose
Core class91--102a3[2] bits 5--22Uniform, pair, predicate, data type, saturate, negate, abs, complement
Data width113--117a3[2] bits 0--9Width code, uniform-mem, source regclass, type specifier, write-back
Load/store118--134a3[1] + a3[2]Memory space, address type, cache policy, atomic op, scope, float mode
Texture/surface135--165a3[2] bits 1--31Texture type, dimension, LOD mode, ordering, acquire, scope hint
Control flow167--202a3[2] bits 1--6Branch type, divergent, WGMMA data type/enable/barrier
FP/conversion230--264a3[2] variousFP operation, comparison, combine, interpolation, MUFU, SHF, SHFL
Extended269--299a3[2] variousCS2R, FSETP, rounding, data type wide, destination regclass
Hopper/Blackwell304--340a3[2] variousDMMA, WGMMA, TMA hints, surface sync, Hopper-specific classes

Sub-handler Functions

Complex opcode families delegate register class encoding to dedicated sub-functions:

FunctionOpcodes handledPurpose
sub_650390TEX, TLD, texture familyTexture register class (sampler, coordinate, LOD)
sub_650220LDG, STG, LD, ST, ATOM, REDMemory instruction register class
sub_651330FMUL (opcode 0x0D)FP multiply register class
sub_650920LEA, special (0x09, 0x72, 0x74, 0x7A, 0x80, 0x81)LEA / special instruction
sub_650A90I2I, F2F, conversions (0x24--0x27, 0xE2--0xEB)Type conversion register class
sub_652190Branch/call (0x13, 0x14, 0x17)Branch/call register class
sub_653B90Misc (0x0C)Miscellaneous instruction
sub_650C80Memory barrier modifiersApplied when (a2+56) & 0x4F0 is nonzero
sub_651A90Texture modifiers (0x83)Applied before texture encoding
sub_62D5D0Memory space computationComputes memory space tag from operand types

Lookup Tables

The function references 28 static lookup tables that map instruction attribute values to register class encoding values:

TableSizeUsed by field(s)Content
dword_21DEB80594Data type encoding
dword_21DEB503107, 115, 145, 157, 1653-value encoding (reused across 5 fields)
dword_21DEB208113Data width code
dword_21DEB007116, 126, 131, 170Type encoding (reused across 4 fields)
dword_21DEAE05119/123, 136, 143, 159Variant table (reused across 4 fields)
dword_21DEAA013120Memory space code
dword_21DEA6010121, 135/151Address/texture type
dword_21DEA2015124/125Reduction type
dword_21DE9F06129/130, 150Scope code
dword_2026C306116 (ext path)Sub-opcode to data type
dword_2026C8020165 (surface)Surface operation codes
dword_2026E2017286Data type (wide)
dword_2026AC016198WGMMA data type
dword_2026B2023I2I conversionInteger conversion type

Data Structure Layouts

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

This page documents the key internal data structures in ptxas v13.0.88: the compilation context ("god object"), the Ori Code Object, symbol tables, constant/shared memory descriptors, the pool allocator's object model, and the generic container types (hash maps, linked lists, growable arrays) that underpin nearly every subsystem.

All offsets are byte offsets from the structure base unless otherwise noted. Types are inferred from decompiled access patterns. Field names are reverse-engineered -- the binary is stripped.

Compilation Context (the "God Object")

The compilation context is the central state object passed to every phase in the pipeline. It is not the Code Object (which is per-function); it is the per-compilation-unit container that owns the Code Object, the knob system, the output stream, the function list, and all per-pass configuration. The sub_7FBB70 (PerKernelEntry) function receives this as a1, and every phase's execute() receives it as the second argument.

The context is a polymorphic C++ object with a vtable at offset +0. It is allocated by the compilation driver and persists for the lifetime of a single compilation unit. Key observations:

  • The vtable at +0 provides 263+ virtual methods (vtable spans to offset 2104+)
  • The object is at least 1928 bytes based on the highest confirmed field access (+1928 = codegen_ctx)
  • The knob/options system is accessed through an indirection at +1664 (pointer to knob container object)
  • The output stream lives at +1440

Compilation Context Field Map

OffsetTypeFieldEvidence
+0vtable*vtable*(_QWORD *)a1 in every virtual dispatch
+8ptrparent / driver_ctxBack-pointer; sub_A3A7E0 reads v2 = *(a1+8) then v2[198] for Code Object
+80u32last_exit_codesub_663C30: *(a1+80) = v2[462]
+96u32compile_unit_indexsub_663C30: *(a1+96) = 1 on first call
+139u8multi_function_flagsub_663C30: if (!*(a1+139))
+144ptrname_table (via vtable+144)sub_7FBB70: *(*a1 + 144) -> name lookup vtable
+296ptrcurrent_functionsub_7FBB70: *(*(a1+296) + 164) = function index
+368ptrfunction_name_arraysub_7FBB70: *(a1+368 + 8*func_id) -> name object
+1144ptrfunction_list_headsub_663C30: linked list of function descriptors
+1160ptrentry_list_headsub_663C30: linked list of kernel entry descriptors
+1376u32scheduling_mode_flagsBit 0x08 = forward, bit 0x10 = bidirectional
+1412i8compilation_flags_bytesub_A3B080: *(char*)(a2+1412) < 0
+1416u8output_detail_flagssub_7FBB70: *(a1+1416) |= 0x80; bits 4-5 control latency reporting mode
+1418u8codegen_mode_flagssub_A3B080: *(a2+1418) & 4
+1428i32function_indexsub_7FBB70: *(a1+1428) < 0 means first invocation
+1440stream*output_streamsub_7FBB70: sub_7FE930(a1+1440, "\nFunction name: ")
+1560ptrtiming_recordsGrowable array of 32-byte timing entries
+1576u32timing_countsub_C62720: cu->timing_count++
+1552i32pipeline_progressPipeline progress counter (0--21), monotonically increases; see known values
+1584ptrsm_backendSM-specific architecture backend object (polymorphic, 1712--1992B depending on SM target); provides vtable dispatch for legalization, optimization, scheduling, and codegen; see note below
+1664ptrknob_containersub_7FB6C0, sub_A3B080: options/knob dispatch object
+1864ptrbb_structuresub_7FB6C0: destroyed via sub_77F880
+1872ptrper_func_datasub_7FB6C0: destroyed via sub_7937D0
+1880ptrfunction_contextsub_7FB6C0: 17 analysis-result pairs at qword offsets
+1928ptrcodegen_ctxConfirmed in overview.md Code Object table

SM Backend Object at +1584

The pointer at context+0x630 (decimal 1584) is the single most confusing field in the compilation context, because it serves multiple roles through a single polymorphic C++ object. Different wiki pages historically called it different names depending on which role they observed:

  • Legalization pages see it dispatching MidExpansion, LateExpansionUnsupportedOps, etc., and call it "SM backend" or "arch_backend"
  • Scheduling pages see it providing hardware latency profiles at *(sm_backend+372) and call it "scheduler context" or "hw_profile"
  • Optimization pages see it dispatching GvnCse (vtable[23]) and OriReassociateAndCommon (vtable[44]) and call it "optimizer state" or "function manager"
  • Codegen/template pages see it holding register file capacity at +372 and hardware capability flags at +1037

It is one object. The canonical name is sm_backend. It is constructed per-compilation-unit in sub_662920 with a switch on SM version bits (v3 >> 12). Each SM generation gets a different-sized allocation and a different vtable:

SM CaseSizeBase ConstructorVtableSM Generations
31712Bsub_A99A30off_2029DD0sm_30 (Kepler)
41712Bsub_A99A30off_21B4A50sm_50 (Maxwell)
51888Bsub_A99A30off_22B2A58sm_60 (Pascal)
61912Bsub_A99A30off_21D82B0sm_70 (Volta)
71928Bsub_ACDE20off_21B2D30sm_80 (Ampere)
81992Bsub_662220off_21C0C68sm_89 (Ada)
91992Bsub_662220off_21D6860sm_90+ (Hopper/Blackwell)

Key sub-fields on the SM backend:

  • +372 (i32): codegen factory value / encoded SM architecture version (e.g., 28673 = sm_80)
  • +1037 (u8): hardware capability flags (bit 0 = has high-precision FP64 MUFU seeds)
  • Vtable slots provide architecture-specific dispatch for 50+ operations

Pipeline Progress Counter at +1552

The field at context+1552 is a monotonically increasing int32 that tracks how far the compilation has progressed through the 159-phase pipeline. It is not a legalization-only counter -- it is incremented by phases across all categories (legalization, optimization, scheduling, regalloc). Each increment is performed by a small thunk function whose sole body is *(ctx + 1552) = N.

Known values and their associated phases:

ValueThunk AddressPhase / Context
0(init)sub_7F7DC0 -- compilation context constructor
1sub_C5F620Early pipeline (before ConvertUnsupportedOps)
2sub_C5F5A0After ConvertUnsupportedOps (phase 5)
3sub_C5EF80After MidExpansion (phase 45)
4sub_C5EF30After OriDoRematEarly (phase 54) -- signals remat mode active
5sub_1233D70Mid-pipeline scheduling/ISel context
7sub_6612E0 / sub_C60AA0After LateExpansion (phase 55)
8sub_849C60Post-optimization context
9sub_C5EB80After OriBackCopyPropagate (phase 83)
10sub_88E9D0Late optimization
11sub_C5EA80After SetAfterLegalization (phase 95) region
12sub_C5E980Post-legalization
13sub_13B5C80ISel/scheduling
14sub_C5E830Post-scheduling
15sub_C5E7C0Register allocation phase
16sub_C5E6E0Post-regalloc
17sub_C5E5A0Mercury/codegen
18sub_C5E4D0Post-Mercury
19sub_C5E440Late codegen
20sub_C5E390Post-RA cleanup
21sub_C5E0B0Final pipeline stage

Readers of downstream passes use *(ctx+1552) > N to gate behavior that should only run after a certain pipeline point. For example, the rematerialization cross-block pass checks *(ctx+1552) > 4 to enable its second-pass mode.

Knob Container Access Pattern

The knob container at +1664 is accessed through a two-level virtual dispatch pattern that appears at 100+ call sites:

// Fast path: known vtable -> direct array read
_QWORD *v2 = *(_QWORD **)(ctx + 1664);
bool (*query)(__int64, int) = *(bool (**)(...))(*v2 + 72);
if (query == sub_6614A0)
    result = *(u8*)(v2[9] + knob_index * 72 + offset) != 0;
else
    result = query((int64)v2, knob_index);  // slow path

The fast path reads directly from the knob value array at v2[9] (offset +72 of the knob state object), where each knob value occupies 72 bytes. The slow path invokes the virtual method for derived knob containers.

Function Context (at +1880)

When a function is under compilation, +1880 points to a large context object containing 17 pairs of analysis-result data structures. Each pair consists of a sorted container and a hash map, holding results such as live ranges, register maps, and scheduling data. The cleanup code in sub_7FB6C0 destroys pairs at qword offsets [102, 97, 92, 87, 82, 77, 72, 67, 62, 57, 52, 47, 42, 36, 31, 26, 21] from the context base, then handles reference-counted objects at offsets [10] and [2].

Ori Code Object (~1136 bytes)

The Code Object is the per-function container for all IR data. One instance exists for each function under compilation. Constructor is at sub_A3B080, vtable at 0x21EE238.

Constructor Analysis

The constructor (sub_A3B080) takes two arguments: a1 (the Code Object to initialize) and a2 (the compilation context). It:

  1. Sets +8 = a2 (back-pointer to compilation context)
  2. Sets +0 = &unk_21EE238 (vtable)
  3. Zeroes approximately 250 distinct fields across the 1136-byte range
  4. Loads two SSE constants from xmmword_2027600 and xmmword_21EFAE0 into offsets +96 and +112 (likely default register file descriptors or encoding parameters)
  5. Reads a2+1412 and a2+1418 to set mode flags at +1101 and +1008
  6. Accesses the knob container at a2+1664 to query knob 367 for initial configuration
  7. Sets +1008 = 0x300000050 (default) or 0x400000080 (if a2+1418 & 4)

Code Object Field Map

OffsetTypeFieldEvidence / Notes
+0vtable*vtable0x21EE238, 263+ virtual methods
+8ptrcompilation_ctxBack-pointer to owning compilation context
+16u128(zeroed)SSE zero-store in constructor
+24u32sm_versionEncoded SM target (12288=sm30, 20481=sm50, 36865=sm90)
+32u128(zeroed)SSE zero-store
+48u128(zeroed)SSE zero-store
+64u32init_flagsZeroed in constructor
+72ptrcode_bufOutput code buffer
+80u128(zeroed)
+88ptrreg_fileRegister descriptor array: *(ctx+88) + 8*regId
+96u128reg_defaults_1Loaded from xmmword_2027600
+99u32ur_countUniform register (UR) count
+102u32r_allocR-register allocated count
+112u128reg_defaults_2Loaded from xmmword_21EFAE0
+128--175u128[3](zeroed)SSE zero-stores
+152ptrsym_tableSymbol/constant lookup array
+159u32r_reservedR-register reserved count
+176ptr(zeroed)
+184u32(zeroed)
+192ptr(zeroed)
+200u128(zeroed)
+216u128(zeroed)
+232u32(zeroed)
+236u32(zeroed)
+240ptr(zeroed)
+248u128(zeroed)
+264u128(zeroed)
+272ptrinstr_headInstruction linked-list head
+280u32(zeroed)
+288ptr(zeroed)
+296ptrbb_arrayBasic block array pointer (40 bytes per entry)
+304u32bb_indexCurrent basic block count
+312ptroptionsOptionsManager* for knob queries
+320--359u128[3](zeroed)
+335u32instr_hiInstruction count upper bound
+336u32tex_inst_countTexture instruction count (stats emitter)
+338u32fp16_vect_instFP16 vectorized instruction count
+340u32inst_pairsInstruction pair count
+341u32instr_loInstruction count lower bound
+342u32tepid_instTepid instruction count
+360ptr(zeroed)
+368u32sub_block_flags
+372u32instr_totalTotal instruction count (triggers chunked scheduling at > 0x3FFF)
+376u32(zeroed)
+384--416ptr[5](zeroed)
+424u32(zeroed)
+432ptr(zeroed)
+440u32(zeroed)
+448ptr(zeroed)
+464ptr(zeroed)
+472u8(zeroed)
+473u8(zeroed)
+536u32(zeroed)
+540u32(zeroed)
+648ptrsucc_mapCFG successor edge hash table
+680ptrbackedge_mapCFG backedge hash table
+720ptrrpo_arrayReverse post-order array (int*)
+728ptrbitmask_arrayGrow-on-demand bitmask array for scheduling
+740u32bitmask_capacityCapacity of bitmask array
+752ptr(zeroed)
+760u32(zeroed)
+764u32(zeroed)
+768ptrconst_sectionsConstant memory section array
+772u8(zeroed)
+776ptrsmem_sectionsShared memory section array
+976ptrblock_infoBlock info array (40 bytes per entry, contiguous)
+984i32num_blocksNumber of basic blocks
+996u32annotation_offsetCurrent offset into annotation buffer (sub_A4B8F0)
+1000ptrannotation_bufferAnnotation data buffer (sub_A4B8F0)
+1008u64encoding_paramsDefault 0x300000050 or 0x400000080
+1016ptr(zeroed)
+1024u32(zeroed)
+1032ptr(zeroed)
+1040ptr(zeroed)
+1064ptr(zeroed)
+1080u128(zeroed)
+1096u32(zeroed)
+1100u8(zeroed)
+1101u8optimization_modeSet from knob 367 and compilation_ctx+1412
+1102u8(zeroed)
+1104ptr(zeroed)
+1120u128(zeroed)

Register Count Formula

From the stats emitter at sub_A3A7E0 and the register count function at sub_A4B8F0 (which both use vtable+2104 dispatch with sub_859FC0 as the fast path):

total_R_regs      = code_obj[159] + code_obj[102]   // reserved + allocated
instruction_count = code_obj[335] - code_obj[341]   // upper - lower

Stats Emitter Field Map

The stats emitter (sub_A3A7E0) accesses a per-function stats record through the SM backend: v3 = *(compilation_ctx+8)[198] (offset +1584 from the outer compilation context points to the SM backend object; the emitter then reads per-function stats fields within it). It uses DWORD indexing (4-byte), and reveals these additional fields:

DWORD IndexByte OffsetFieldStat String
8+32est_latency[est latency = %d]
10+40worst_case_lat[worstcaseLat=%f]
11+44avg_case_lat[avgcaseLat=%f]
12+48spill_bytes[LSpillB=%d]
13+52refill_bytes[LRefillB=%d]
14+56s_refill_bytes[SRefillB=%d]
15+60s_spill_bytes[SSpillB=%d]
16+64low_lmem_spill[LowLmemSpillSize=%d]
17+68frame_lmem_spill[FrameLmemSpillSize=%d]
18+72non_spill_bytes[LNonSpillB=%d]
19+76non_refill_bytes[LNonRefillB=%d]
20+80non_spill_size[NonSpillSize=%d]
26+104occupancy (float)[Occupancy = %f]
27+108div_branches[est numDivergentBranches=%d]
28+112attr_mem_usage[attributeMemUsage=%d]
29+116program_size[programSize=%d]
42+168precise_inst[Precise inst=%d]
44+176udp_inst[UDP inst=%d]
45+180vec_to_ur[numVecToURConverts inst=%d]
49+196max_live_suspend[maxNumLiveValuesAtSuspend=%d]
87+348partial_unroll[partially unrolled loops=%d]
88+352non_unrolled[non-unrolled loops=%d]
89+356cb_bound_tex[CB-Bound Tex=%d]
90+360partial_bound_tex[Partially Bound Tex=%d]
91+364bindless_tex[Bindless Tex=%d]
92+368ur_bound_tex[UR-Bound Tex=%d]
93+372sm_version_check> 24575 triggers UR reporting
99+396ur_count_stats[urregs=%d]
102+408r_allocR-register allocated count
159+636r_reservedR-register reserved count
303+1212est_fp[est fp=%d]
306+1224est_half[est half=%d]
307+1228est_transcendental[est trancedental=%d]
308+1232est_ipa[est ipa=%d]
310+1240est_shared[est shared=%d]
311+1244est_control_flow[est controlFlow=%d]
315+1260est_load_store[est loadStore=%d]
316+1264est_tex[est tex=%d]
334+1336inst_pairs[instPairs=%d]
335+1340instr_hiInstruction count upper bound
336+1344tex_inst_count[texInst=%d]
337+1348fp16_inst[FP16 inst=%d]
338+1352fp16_vect_inst[FP16 VectInst=%d]
339+1356inst_hint[instHint=%d]
340+1360inst_pairs_2checked for non-zero to print instHint line
341+1364instr_loInstruction count lower bound
342+1368tepid_inst[tepid=%d]

Note: The stats emitter accesses the Code Object through a float pointer (v3), so DWORD indices map to byte offsets via index * 4 for integers and index * 4 for floats. Float fields at indices 9, 26, 50, 54, 57, 58, 59, 61, 62, 65, 84, 85, 86 hold throughput and occupancy metrics. A linked list at qword index 55 (byte +440) holds additional string annotations.

Basic Block Entry (40 bytes)

Basic blocks are stored in a contiguous array at Code Object +976, with count at +984.

BasicBlock (40 bytes)
  +0    ptr      instr_head     // first instruction in this BB
  +8    ptr      instr_tail     // last instruction (or list link)
  +16   ptr      (reserved)
  +24   u32      (reserved)
  +28   i32      bix            // block index (unique ID for CFG ops)
  +32   u64      flags          // scheduling/analysis flags

The scheduling pass (sub_8D0640) initializes per-block scheduling state by iterating the block list and zeroing qword offsets [7], [13], [19], and setting [21] = -1 on each block.

Instruction Layout

Instructions are polymorphic C++ objects linked into per-BB doubly-linked lists. The instruction format is detailed in Instructions; this section covers only the structural linkage.

Each instruction carries a unique integer ID at +16, an opcode at +72 (the peephole optimizer masks with & 0xCF on byte 1 to strip modifier bits), and a packed operand array starting at +84. The operand count is at +80. Operands are 8 bytes each.

Packed Operand Format

 31  30  29  28  27       24  23  22  21  20  19                  0
+---+---+---+---+-----------+---+---+---+---+---------------------+
|     type      |  modifier bits (8 bits)    |  index (20 bits)    |
+---+---+---+---+-----------+---+---+---+---+---------------------+

Extraction (50+ confirmed sites):
  uint32_t operand = *(uint32_t*)(instr + 84 + 8 * i);
  int type    = (operand >> 28) & 7;     // bits 28-30
  int index   = operand & 0xFFFFF;       // bits 0-19
  int mods    = (operand >> 20) & 0xFF;  // bits 20-27
Type ValueMeaningResolution
1Register operandIndex into *(code_obj+88) register file
5Symbol/constant operandIndex into *(code_obj+152) symbol table

The operand classifier functions at 0xB28E00--0xB28E90 provide predicate checks:

FunctionPredicate
sub_B28E00getRegClass (1023 = wildcard, 1 = GPR)
sub_B28E10isRegOperand
sub_B28E20isPredOperand
sub_B28E40isImmOperand
sub_B28E80isConstOperand
sub_B28E90isUReg

Symbol Table

The symbol table is accessed through Code Object +152. Based on the symbol table builder at sub_621480 (21KB, references a1+30016 for the symbol table base), symbols are stored in a hash-map-backed structure where each symbol has a name and associated properties (address, type, section binding).

Internal Symbol Names

The following internal symbol names appear in decompiled code, indicating the kinds of entities tracked:

SymbolPurpose
__ocg_constOCG-generated constant data
__shared_scratchShared memory scratch space
__funcAddrTab_gGlobal indirect function call table
__funcAddrTab_cConstant indirect function call table
_global_ptr_%sGlobal pointer for named variable
$funcID$nameFunction-local relocation symbol
__cuda_dummy_entry__Dummy entry generated by --compile-only
__cuda_sanitizerCUDA sanitizer instrumentation symbol

Symbol Resolution Flow

Symbol resolution (sub_625800, 27KB) traverses the symbol table to resolve references during the PTX-to-Ori lowering and subsequent optimization phases. The format %s[%d] (from sub_6200A0) is used for array-subscripted symbol references, and __$endLabel$__%s markers delimit function boundaries.

Constant Buffer Layout

Constant memory is organized into banks (c[0], c[1], ...) corresponding to the CUDA .nv.constant0, .nv.constant2, etc. ELF sections. The constant section array at Code Object +768 tracks all constant banks for the current function.

Constant Bank Handling

The constant bank handler at sub_6BC560 (4.9KB) manages references to constant memory using the c[%d] (integer bank) and c[%s] (named bank, sw-compiler-bank) notation. It enforces:

  • A maximum constant register count (error: "Constant register limit exceeded; more than %d constant registers")
  • LDC (Load Constant) requires a constant or immediate bank number

ELF Constant Symbols

The ELF symbol emitter (sub_7FD6C0) creates symbols for constant bank metadata:

Symbol NamePurpose
.nv.ptx.const0.sizeSize of constant bank 0 (kernel parameters)

The constant emission function (sub_7D14C0, 5.6KB) iterates the constant section array and copies bank data into the output ELF sections.

Shared Memory Layout

Shared memory (.nv.shared) allocations are tracked through the shared memory section array at Code Object +776. Reserved shared memory regions are managed by sub_6294E0 (12.1KB) and sub_629E40 (6.1KB).

Reserved Shared Memory Symbols

The ELF emitter recognizes these special symbols for shared memory layout:

Symbol NamePurpose
.nv.reservedSmem.beginStart of reserved shared memory region
.nv.reservedSmem.capCapacity of reserved shared memory
.nv.reservedSmem.endEnd of reserved shared memory region
.nv.reservedSmem.offset0First reserved offset within shared memory
.nv.reservedSmem.offset1Second reserved offset within shared memory

The --disable-smem-reservation CLI option disables the reservation mechanism. Shared memory intrinsic lowering (sub_6C4DA0, 15KB) validates that shared memory operations use types {b32, b64}.

Descriptor Size Symbols

Additional ELF symbols track texture/surface descriptor sizes in shared memory:

Symbol NamePurpose
.nv.unified.texrefDescSizeUnified texture reference descriptor size
.nv.independent.texrefDescSizeIndependent texture reference descriptor size
.nv.independent.samplerrefDescSizeIndependent sampler reference descriptor size
.nv.surfrefDescSizeSurface reference descriptor size

Pool Allocator

The pool allocator (sub_424070, 3,809 callers) is the single most heavily used allocation function. Every dynamic data structure in ptxas is allocated through pools.

Pool Object Layout

OffsetTypeFieldNotes
+0ptrlarge_block_listSingly-linked list of large (>4999 byte) blocks
+32u32min_slab_sizeMinimum slab allocation size
+44u32slab_countNumber of slabs allocated
+48ptrlarge_free_listFree list for large blocks (boundary-tag managed)
+56u32fragmentation_countFragmentation counter (decremented on split)
+60u32max_orderMaximum power-of-2 order for large blocks
+64...(large block free lists)a1 + 32*(order+2) = per-order free list head
+2112ptrtracking_mapHash map for allocation metadata tracking
+2128ptr[N]small_free_listsSize-binned free lists: *(pool + 8*(size>>3) + 2128) = head
+7128mutex*pool_mutexpthread_mutex_t* for thread safety

Allocation Paths

Small path (size <= 4999 bytes = 0x1387):

  1. Round size up to 8-byte alignment: aligned = (size + 7) & ~7
  2. Minimum 16 bytes
  3. Compute bin: bin = pool + 8 * (aligned >> 3) + 2128
  4. If bin has a free block: pop from free list, decrement slab available bytes
  5. If bin is empty: allocate a new slab from the parent (size = aligned * ceil(min_slab_size / aligned)), carve into free-list nodes

Large path (size > 4999 bytes):

  1. Add 32 bytes for boundary tags
  2. Search power-of-2 order free lists starting from log2(size+32)
  3. If found: split block if remainder > 39 bytes, return payload
  4. If not found: call sub_423B60 to grow the pool, allocate new slab from parent

Boundary Tag Format (Large Blocks)

Large blocks use boundary tags for coalescing on free:

Block Header (32 bytes):
  +0    i64      sentinel      // -1 = allocated, else -> next free
  +8    ptr      prev_free     // previous in free list (or 0)
  +16   u64      tag_offset    // 32 (header size)
  +24   u64      payload_size  // user-requested allocation size

Block Footer (32 bytes at end):
  +0    i64      sentinel
  +8    ptr      prev_free
  +16   u64      footer_tag    // 32
  +24   u64      block_size    // total size including headers

Slab Descriptor (56 bytes)

Each slab is tracked by a 56-byte descriptor:

OffsetTypeField
+0ptrchain_link
+8u64total_size
+16u64available_size
+24ptrowning_pool
+32ptrmemory_base
+40u8is_small_slab
+44u32slab_id
+48u32bin_size

Hierarchical Pools

Pools are hierarchical. When sub_424070 is called with a1 = NULL, it falls back to a global allocator (sub_427A10) that uses malloc directly. Non-null a1 values are pool objects that allocate from their own slabs, which are themselves allocated from a parent pool (the TLS context at offset +24 holds the per-thread pool pointer). The top-level pool is named "Top level ptxas memory pool" and is created in the compilation driver.

Hash Map

The hash map (sub_426150 insert / sub_426D60 lookup, 2,800+ and 422+ callers respectively) is the primary associative container in ptxas.

Hash Map Object Layout (~112 bytes)

OffsetTypeFieldNotes
+0fptrhash_funcCustom hash function pointer
+8fptrcompare_funcCustom compare function pointer
+16fptrhash_func_2Secondary hash (or NULL)
+24fptrcompare_func_2Secondary compare (or NULL)
+32u32has_custom_compareFlag
+40u64bucket_maskcapacity - 1 for power-of-2 masking
+48u64entry_countNumber of stored entries
+64u64load_factor_thresholdResize when entry_count exceeds this
+72u32first_free_slotTracking for bitmap-based slot allocation
+76u32entries_capacityCapacity of entries array
+80u32bitmap_capacityCapacity of used-bits bitmap
+84u32flagsHash mode in bits 4-7
+88ptrentriesArray of 16-byte {key, value} pairs
+96ptrused_bitmapBitmap tracking occupied slots
+104ptrbucketsArray of pointers to chained index lists

Hash Modes

The hash mode is encoded in bits 4-7 of the flags field at offset +84:

ModeFlag BitsHash FunctionUse Case
00x00Custom (+0 function pointer)User-defined hash/compare
10x10Pointer hash: (key>>11) ^ (key>>8) ^ (key>>5)Pointer-keyed maps
20x20Identity: key used directlyInteger-keyed maps

Mode selection happens automatically in the constructor (sub_425CA0): if the hash/compare pair matches (sub_427750, sub_427760), mode 2 is set; if (sub_4277F0, sub_427810), mode 1.

Lookup Algorithm

// Mode 1 (pointer hash) example:
uint64_t hash = (key >> 11) ^ (key >> 8) ^ (key >> 5);
uint64_t bucket_idx = hash & map->bucket_mask;
int32_t* chain = map->buckets[bucket_idx];
while (*++chain != -1) {
    entry_t* e = map->entries + 16 * (*chain);
    if (key == e->key)
        return e->value;  // found
}
return 0;  // not found

Growth policy: the map doubles capacity and rehashes when entry_count > load_factor_threshold.

String-Keyed Maps

String-keyed maps use MurmurHash3 (sub_427630, 73 callers) as the hash function. The implementation uses the standard MurmurHash3_x86_32 constants:

ConstantValueStandard Name
c10xCC9E2D51 (-862048943)MurmurHash3 c1
c20x1B873593 (461845907)MurmurHash3 c2
fmix10x85EBCA6B (-2048144789)MurmurHash3 fmix
fmix20xC2B2AE35 (-1028477387)MurmurHash3 fmix

CFG Hash Map (FNV-1a)

The control flow graph uses a separate hash map implementation based on FNV-1a hashing, distinct from the general-purpose hash map above. Two instances exist per Code Object at offsets +648 (successor edges) and +680 (backedge info).

ParameterValue
Initial hash0x811C9DC5 (-2128831035)
Prime16777619 (0x01000193)
Input4-byte block index, hashed byte-by-byte

Bucket entry: 24 bytes {head, tail, count}. Node: 64 bytes with chain link, key, values, sub-hash data, and cached hash. See CFG for the full CFG hash map specification.

Linked List

The linked list (sub_42CA60 prepend, 298 callers; sub_42CC30 length, 48 callers) is a singly-linked list of 16-byte nodes:

ListNode (16 bytes, pool-allocated)
  +0    ptr      next        // pointer to next node (NULL = end)
  +8    ptr      data        // pointer to payload object

Prepend allocates a 16-byte node from the pool, sets node->data = payload, and links it at the list head. This is used for function lists, relocation lists, annotation chains, and many intermediate pass-local collections.

Growable Array (Pool Vector)

Growable arrays appear throughout the PhaseManager and elsewhere. The layout is a triple of {data_ptr, count, capacity}:

PoolVector (24 bytes inline, or embedded in parent struct)
  +0    ptr      data         // pointer to element array
  +8    i32      count        // current element count
  +12   i32      capacity     // allocated capacity

Growth strategy (confirmed in the PhaseManager timing records): new_capacity = max(old + old/2 + 1, requested) (1.5x growth factor). Elements are typically 8 bytes (pointers) or 16 bytes (pointer pairs). Reallocation uses sub_424C50 (pool realloc, 27 callers).

The PhaseManager uses this pattern for the phase list (16-byte {phase_ptr, pool_ptr} pairs), the name table (8-byte string pointers), and the timing records (32-byte entries).

Knob Value Array

Knob values are stored in a contiguous array of 72-byte slots, accessed at knob_state[9] + 72 * knob_index (where knob_state[9] is offset +72 of the knob state object).

Knob Value Slot (72 bytes)

OffsetTypeField
+0u8Type tag (0=unset, 1=bool, 2=int, ..., 12=opcode list)
+8i64Integer value / pointer to string / linked list head
+16i64Secondary value (range max, list count, etc.)
+24i64Tertiary value
+64ptrAllocator reference

Supported types:

TypeTagStorage
Boolean1Flag at +0
Integer2Value at +8
Integer+extra3Value at +8, extra at +12
Integer range4Min at +8, max at +16
Integer list5Growable array of ints
Float6float at +8
Double7double at +8
String8/11Pointer at +8
When-string9Linked list of 24-byte condition+value nodes
Value-pair list10Opcode:integer pairs via vtable
Opcode list12Opcode names resolved through vtable

Knob Descriptor (64 bytes)

Knob descriptors are stored in a table at knob_state+16, with count at knob_state+24:

OffsetTypeField
+0ptrPrimary name (ROT13-encoded)
+8u64Primary name length
+16u32Type tag
+24...(reserved)
+40ptrAlias name (ROT13-encoded)
+48u64Alias name length

Stream Object

The output stream used for diagnostics and stats reporting (e.g., at compilation context +1440) is a C++ iostream-like object with operator overloads. Field layout (from sub_7FE5D0 and sub_7FECA0):

OffsetTypeField
+0vtable*vtable (dispatch for actual I/O)
+8u32width
+12u32precision
+16u64char_count
+24ptrformat_buffer
+56u32flags (bit 0=hex, bit 1=oct, bit 2=left-align, bit 3=uppercase, bits 7-8=sign)

ORI Record Serializer (sub_A50650)

The ORI Record Serializer (sub_A50650, 74 KB, 2,728 decompiled lines) is the central function that takes a Code Object's in-memory state and flattens it into a linear output buffer organized as a table of typed section records. It is the serialization backbone for both the DUMPIR diagnostic subsystem and the compilation output path. Despite the _ORI_ string it contains, it is not an optimization pass -- it is infrastructure.

Address0xA50650
Size~74 KB
IdentityCodeObject::EmitRecords
Confidence0.90
Called fromsub_A53840 (wrapper), sub_AACBF0 / sub_AAD2A0 (DUMPIR diagnostic path)
Callssub_A4BC60 (register serializer, new format), sub_A4D3F0 (legacy format), sub_A4B8F0 (register count annotation), sub_A47330 + sub_A474F0 (multi-section finalization), sub_1730890 / sub_17308C0 / sub_17309A0 (scheduling serializers), sub_1730FE0 (register file map)

Parameters

a1 is a serialization state object ("OriRecordContext") that carries the section table, compilation context back-pointer, and per-subsection index/size pairs. a2 is the output buffer write cursor, advanced as data is emitted.

Key fields on a1:

OffsetTypeFieldEvidence
+8ptrcompilation_ctxDereferenced to reach sm_backend at +1584
+24i32header_section_idxv5 + 32 * (*(a1+24) + 1)
+72ptrsection_tableArray of 32-byte section entries
+180u32instr_counter_1Reset to 0 at entry
+472u8has_debug_infoGates debug section emission
+916i32multi_section_count> 0 triggers link-record emission and tail call to sub_A47330
+1102u8multi_section_enabledMaster flag for multi-section mode
+1120ptrscheduling_ctxScheduling context for barrier/scope serialization

Section Record Format

Each section occupies a 32-byte entry in the table at *(a1+72) + 32 * section_index:

Offset  Type   Field
+0      u16    type_tag           section type identifier
+4      u32    data_size          byte size of data payload
+8      ptr    data_ptr           pointer to data in output buffer
+16     u32    element_count      number of elements (or auxiliary metadata)
+20     u32    aux_field          additional per-type context
+24     u32    aux_field_2        secondary per-type context

Data payloads are 16-byte aligned: cursor += (size + 15) & ~0xF.

Section Type Tag Catalog

The serializer emits up to 56 unique section types across three tag ranges.

Base types (0x01--0x58):

TagHexContentEvidence
10x01Instruction stream (register-allocated code body)Emitted via sub_A4BC60 or sub_A4D3F0
30x03Virtual-dispatch section (vtable+48 on state obj)Conditional on *(a1+64) > 0
160x10Source operand bank (v7[199] entries at v7+97)*(entry+48) = v7[199]
170x11Destination operand bank (bit-packed from v7+203)Conditional on !v7[1414]
190x13Annotation stream*(a1+232) counter
340x22Original-definition name table (_ORI_ prefixed)strcpy(v50, "_ORI_") at line 1762
350x23Instruction info snapshot (340 bytes from v7+4)qmemcpy of 340 bytes
460x2ETexture/surface binding tablev7[248] entries, 16 bytes each
500x32Live range interval table (spill map)From compilation context +984
510x33Register file occupancy table*(ctx+1424) & 4
530x35Source operand type bitmap (4-bit per operand)v7[131] operands, 20-byte stride
540x36Destination operand type bitmapv7[134] operands, 20-byte stride
550x37Scheduling barrier datavia sub_1730890
560x38Register file mappingvia sub_1730FE0
580x3AScheduling dependency graphvia sub_17309A0
590x3BMulti-section link recordConditional on *(a1+1102)
640x40External reference (from ctx+2120)Pointer stored, no data copy
680x44Performance counter section*(a1+932) counter
700x46Spill/fill metadatav7[408]
710x47Call graph edge tableFrom v7+61, linked list traversal
730x49Codegen context snapshotFrom ctx+932 register allocation state
800x50Hash table sectionv7+207/208, hash bucket traversal
810x51Extended call infoFrom v7+84
830x53Convergence scope datavia sub_17308C0
850x55Register geometry record (banks, warps, lanes)From ctx+1600, writes bank/warp/lane counts
880x58Extended scheduling annotationsConditional on *(a1+1088) > 0

Extended types (0x1208--0x1221): Emitted only when *(char*)(ctx+1412) < 0, which enables the full post-register-allocation diagnostic mode. These 16 types carry per-register-class live range and operand definition data:

TagHexContent
46160x1208Extended operand class 0
4617--46230x1209--0x120FExtended operand classes 1--7
46240x1210Block-level operand summary
46250x1211Live-in vector (12 bytes/element, count at *(a1+668))
46260x1212Live-out vector (12 bytes/element)
46270x1213Extended operand class 8
4628--46290x1214--0x1215Extended operand classes 9--10
46300x1216Memory space descriptor (SM arch > 0x4FFF)
46310x1217Extended scheduling flag (SM arch > 0x4FFF)
46320x1218Instruction hash (ctx+1386 bit 3)
46330x1219Annotation metadata
46400x1220Extended section metadata
46410x1221Optimization level record (from knob system, knob 988)

The _ORI_ Name Prefix

The _ORI_ string is not a pass name. At line 1762 the serializer iterates the linked list at v7+55 (the original-definition chain maintained for rematerialization debugging) and for each entry creates a string "_ORI_<original_name>":

// Line 1748-1770 (simplified)
for (def = v7->original_defs; def; def = def->next) {
    entry = &section_table[16 * (state->instr_offset + idx)];
    entry->type_tag = 34;      // original-definition name
    entry->data_ptr = cursor;
    strcpy(cursor, "_ORI_");
    strcpy(cursor + 5, def->name);
    cursor += align16(strlen(def->name) + 21);
}

These names are consumed by the register allocation verifier (sub_A55D80) when it compares pre- and post-allocation reaching definitions. A mismatch triggers the "REMATERIALIZATION PROBLEM" diagnostic (string at 0xa55dd8), which lists original definitions under their _ORI_ names alongside the post-allocation state.

Wrapper: sub_A53840

sub_A53840 (48 lines) is a thin wrapper that:

  1. Emits a type-44 header record if *(ctx+1600)[1193] is set (scheduling metadata header)
  2. Calls sub_A50650 with the output buffer
  3. Optionally emits a type-62 trailer record if *(ctx+1600)[48] is set

This wrapper is the typical entry point reached through vtable dispatch.

Function Map

AddressSizeCallersIdentity
sub_A3B080~700 BmultipleCode Object constructor
sub_A3A7E0~700 B1Stats emitter (per-function profile)
sub_A4B8F0~250 B1Register count / annotation writer
sub_A50650~74 KB8ORI Record Serializer (CodeObject::EmitRecords)
sub_A53840~400 B1EmitRecords wrapper (adds type-44 header)
sub_4240702,098 B3,809Pool allocator (alloc)
sub_4248B0923 B1,215Pool deallocator (free)
sub_424C50488 B27Pool reallocator (realloc)
sub_426150~1.2 KB2,800Hash map insert
sub_426D60345 B422Hash map lookup
sub_426EC0349 B29Hash map contains
sub_425CA0114 B127Hash map constructor
sub_425D20121 B63Hash map destructor
sub_42CA6081 B298Linked list prepend
sub_42CC3034 B48Linked list length
sub_427630273 B73MurmurHash3 string hash
sub_62148021 KBlowSymbol table builder
sub_62580027 KBlowSymbol resolution
sub_6BC5604.9 KBlowConstant bank handler
sub_6294E012.1 KBlowReserved shared memory management
sub_6C4DA015 KBlowShared memory intrinsic lowering
sub_7FD6C0~800 B3ELF symbol emitter
sub_7FB6C0~800 B1Pipeline orchestrator (context cleanup)
sub_7FBB70~100 B1Per-kernel entry point
sub_663C30~300 B1Compilation loop body
sub_662920varies1Global initialization (calls KnobsInit)

Pass Inventory & Ordering

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The ptxas compilation pipeline consists of exactly 159 phases, executed in a fixed order determined by a static index table at 0x22BEEA0. Every compilation traverses the same sequence -- phase skipping is handled per-phase via isNoOp() virtual method overrides, not by reordering the table. This page is the definitive inventory of all 159 phases: their index, name, category, one-line description, and cross-references to detailed documentation where available.

All 159 phases have names in the static name table at off_22BD0C0 (159 entries, indexed 0--158). The factory switch at sub_C60D30 allocates each phase as a 16-byte polymorphic object with a 5-slot vtable: execute() at +0, getIndex() at +8 (returns the factory/table index), and isNoOp() at +16 (returns 0 for active phases, 1 for phases skipped by default). Slots +24 and +32 are NULL.

Total phases159 (indices 0--158)
Named (static table)159 (all have entries in off_22BD0C0)
Late-pipeline phases20 (indices 139--158, added after the original 0--138 design)
Gate passes (AdvancedPhase)17 conditional hooks
Update passes9 data-structure refresh passes (6 in main table + 3 in static name table, not yet positioned)
Report passes10 diagnostic/dump passes (9 in main table + 1 in static name table, not yet positioned)
GeneralOptimize instances6 compound optimization bundles
Liveness/DCE instances5 (including EarlyOriSimpleLiveDead)
LICM instances4
Pipeline infrastructurePhase Manager, Optimization Pipeline

Phase Categories

Each phase is tagged with one of 10 categories. These are not present in the binary -- they are an analytical classification applied during reverse engineering.

TagMeaningCount
ValidationChecks IR structural correctness, catches illegal patterns3
LoweringConverts unsupported ops, expands macros, legalizes IR14
OptimizationTransforms IR to improve performance (DCE, CSE, LICM, etc.)68
AnalysisComputes information consumed by later passes (liveness, CFG)6
ReportingDumps IR, statistics, or memory usage for debugging9
SchedulingInstruction scheduling, sync insertion, WAR fixup8
RegAllocRegister allocation and related fixups6
EncodingMercury SASS encoding, expansion, microcode generation9
CleanupPost-transformation updates, NOP removal, block layout13
GateConditional hooks (AdvancedPhase*) -- no-op by default17

Phases 139--158 are late-pipeline phases covering Mercury encoding, scoreboards, register map computation, diagnostics, and a terminal NOP. They have the same vtable infrastructure as phases 0--138 and are fully named in the static table.

Numbering Discrepancy

Warning: The phase numbers 0--138 on this page use a compressed numbering scheme established before the full 159-entry name table was discovered (P2-14). The true static name table at off_22BD0C0 contains 159 entries indexed 0--158, and 16 of the 20 newly-discovered names occupy indices within the 0--138 range. In the true table, these 16 entries sit at their listed indices, and all subsequent phases shift up. The wiki's compressed numbering diverges from the true binary indices starting around phase 8.

Phases 139--158 are correctly numbered (they match the true static table indices). A full renumbering of phases 0--138 to match the true binary indices is deferred as a separate task because it would affect cross-references across 40+ wiki pages.

The 16 omitted name table entries (with their true static table indices) are:

True IndexNameCategoryRelationship to Wiki
22OriCopyPropOptimizationSub-pass within all 6 GeneralOptimize bundles; also injected into Mercury pipeline
32OptimizeNaNOrZeroOptimizationStandalone NaN/zero folding pass; not documented under current wiki numbering
37ConvertMemoryToRegisterOrUniformOptimizationSub-pass of GeneralOptimizeMid; gated by knob 487; sub_910840
41VectorizationOptimizationLoad/store vectorization; gated by DisableReadVectorization/DisableWriteVectorization knobs
57OriCommoningOptimizationCommoning sub-pass; related to LateOriCommoning (wiki phase 64)
69OriSimpleLiveDeadOptimizationLiveness/DCE sub-pass; related to EarlyOriSimpleLiveDead (wiki phase 10)
73LateVectorizationOptimizationLate vectorization (2nd instance, after optimization exposes new opportunities)
77SinkCodeIntoBlockOptimizationCode sinking; sub_78DB70; DisablePhases=SinkCodeIntoBlock gate
103LateEnforceArgumentRestrictionsLoweringLate counterpart to EnforceArgumentRestrictions (wiki phase 48)
114ScheduleInstructionsSchedulingWorker for AdvancedPhasePreSched; sub_8D0640 (22 KB)
115UpdateAfterScheduleInstructionsCleanupIR metadata refresh after scheduling completes
118UpdateAfterOriDoSyncronizationCleanupIR metadata refresh after sync insertion (wiki phase 99)
120ReportBeforeRegisterAllocationReportingDUMPIR target; diagnostic dump before register allocation
122AllocateRegistersRegAllocWorker for AdvancedPhaseAllocReg; canonical allocator entry
124UpdateAfterOriAllocateRegistersCleanupIR metadata refresh after register allocation
127PostExpansionLoweringWorker for AdvancedPhasePostExpansion; post-RA expansion

All 16 are valid DUMPIR targets (resolvable through sub_C641D0 binary search over the phase name table). Several are also valid DisablePhases targets.

Gate Passes (AdvancedPhase)

Seventeen phases are conditional extension points whose isNoOp() returns true in the default vtable. They exist as insertion points for architecture backends and optimization-level overrides. When a specific SM target or -O level requires additional processing at a given pipeline position, the backend overrides the phase's vtable to provide a real execute() implementation.

Gate passes bracket major pipeline transitions. For example, phases 4 and 7 bracket ConvertUnsupportedOps (phase 5), allowing a backend to inject pre- and post-legalization logic without modifying the fixed phase table. Phase 101 (AdvancedPhaseAllocReg) is the most critical gate -- the entire register allocation subsystem is driven through this hook; the base pipeline contains no hardcoded allocator.

The naming convention is consistent: AdvancedPhase prefix followed by the pipeline position or action name. One exception is AdvancedScoreboardsAndOpexes (phase 115), which uses Advanced without Phase.

Gate Pass Worker Correspondence

Several gate passes dispatch to named worker functions when activated by a backend. The worker names appear in the static name table and are valid DUMPIR/NamedPhases targets:

Gate Pass (Wiki #)Worker Function (True Table Index)Evidence
AdvancedPhasePreSched (97)ScheduleInstructions [114]sub_8D0640, string "ScheduleInstructions"
AdvancedPhaseAllocReg (101)AllocateRegisters [122]String "Please use -knob DUMPIR=AllocateRegisters" at sub_9714E0
AdvancedPhasePostExpansion (104)PostExpansion [127]Post-RA expansion dispatch
AdvancedPhasePostFixUp (111)PostFixUp [140]Target vtable+0x148 dispatch

See Optimization Levels for per-gate activation rules.

Update Passes

Nine phases refresh data structures invalidated by preceding transformations. Six are documented at specific wiki phase numbers; three additional update phases exist in the static name table but are not yet mapped to wiki phase numbers (see Numbering Discrepancy above):

PhaseNameRefreshes
76UpdateAfterOptimizeRebuilds IR metadata after the late optimization group
125UpdateAfterPostRegAllocRebuilds IR metadata after register allocation and post-RA fixups
128UpdateAfterFormatCodeListRebuilds the code list after Mercury encoding reformats instructions
132UpdateAfterConvertUnsupportedOpsRebuilds IR metadata after late unsupported-op expansion
150UpdateAfterPostRegAllocLate-pipeline duplicate: rebuilds IR metadata after post-RA processing (no-op by default)
154UpdateAfterFormatCodeListLate-pipeline duplicate: rebuilds IR data structures after FormatCodeList (no-op by default)
(true 115)UpdateAfterScheduleInstructionsRefreshes IR after scheduling completes (omitted from compressed numbering)
(true 118)UpdateAfterOriDoSyncronizationRefreshes IR after sync insertion (omitted from compressed numbering)
(true 124)UpdateAfterOriAllocateRegistersRefreshes IR after register allocation (omitted from compressed numbering)

These are lightweight passes that call into the IR's internal consistency maintenance routines. They do not transform the IR -- they only update auxiliary data structures (liveness bitmaps, instruction lists, block layout caches) so that downstream passes see a coherent view. Phases 150 and 154 are late-pipeline duplicates whose isNoOp() returns 1 by default; they only activate when a backend requires a second update cycle. The three *(true N)* entries are in the static name table at the indicated indices but are not yet assigned wiki phase numbers.

Report Passes

Ten phases produce diagnostic output. They are no-ops unless specific debug options are enabled (e.g., --stat=phase-wise, DUMPIR, --keep):

PhaseNameOutput
9ReportInitialRepresentationDumps the Ori IR immediately after initial lowering
96ReportBeforeSchedulingDumps the IR as it enters the scheduling/RA stage
102ReportAfterRegisterAllocationDumps the IR after register allocation completes
(true 120)ReportBeforeRegisterAllocationDumps IR before register allocation; omitted from compressed numbering (name at 0x22BD068)
126ReportFinalMemoryUsagePrints memory pool consumption summary
129DumpNVuCodeTextSASS text disassembly (cuobjdump-style)
130DumpNVuCodeHexRaw SASS hex dump
151ReportFinalMemoryUsageLate-pipeline duplicate: memory pool summary (no-op by default, isNoOp=1)
155DumpNVuCodeTextLate-pipeline duplicate: SASS text disassembly; guarded by ctx+0x598 and ctx+0x740
156DumpNVuCodeHexLate-pipeline duplicate: raw SASS hex dump; same guard as phase 155

Phase 131 (DebuggerBreak) is a development-only hook that triggers a breakpoint -- it is not a report pass per se, but serves a similar diagnostic purpose. Phase 157 is its late-pipeline counterpart (empty body in release builds).

GeneralOptimize Bundles

The GeneralOptimize* passes are compound optimization bundles that run multiple small transformations (copy propagation, constant folding, algebraic simplification, dead code elimination) in a fixed-point iteration until no further changes occur. They appear at 6 positions throughout the pipeline to re-clean the IR after major transformations:

PhaseNamePosition
13GeneralOptimizeEarlyAfter initial setup, before loop passes
29GeneralOptimizeAfter early loop/branch optimizations
37GeneralOptimizeMidAfter mid-level transformations
46GeneralOptimizeMid2After VTA/CTA/mbarrier expansion
58GeneralOptimizeLateAfter late expansion
65GeneralOptimizeLate2After predication and late commoning

See GeneralOptimize Bundles for the sub-pass decomposition.


O-Level Gating

Twenty-two phases have confirmed optimization-level gates. The O-Level column in the table below annotates every phase where the activation threshold has been verified from decompiled isNoOp() methods or execute-function guards. Phases without an O-Level annotation run at all optimization levels (O0--O5). Threshold notation: > N means the phase requires opt_level > N; == 0 means the phase is active only at O0.

See Optimization Levels for the complete per-phase activation table, the O-level accessor (sub_7DDB50), and the NvOpt recipe system.


Complete 159-Phase Table

Stage 1 -- Initial Setup (Phases 0--13)

Program validation, recipe application, FP16 promotion, control flow analysis, unsupported-op conversion, macro creation, initial diagnostics.

#Phase NameCategoryO-LevelDescriptionDetail Page
0OriCheckInitialProgramValidationValidates structural correctness of the initial Ori IR after PTX lowering
1ApplyNvOptRecipesOptimizationApplies NvOptRecipe transformations (option 391, 440-byte sub-manager)
2PromoteFP16LoweringPromotes FP16 operations to FP32 where hardware lacks native support
3AnalyzeControlFlowAnalysisBuilds the CFG: identifies loops, dominators, back edges
4AdvancedPhaseBeforeConvUnSupGateHook before unsupported-op conversion; no-op by default
5ConvertUnsupportedOpsLoweringReplaces operations not natively supported on the target SM with equivalent sequencesLate Legalization
6SetControlFlowOpLastInBBCleanupEnsures control flow instructions are the final instruction in each basic block
7AdvancedPhaseAfterConvUnSupGateHook after unsupported-op conversion; no-op by default
8OriCreateMacroInstsLoweringExpands PTX-level macro instructions into Ori instruction sequences
9ReportInitialRepresentationReportingDumps the Ori IR for debugging (no-op unless DUMPIR enabled)
10EarlyOriSimpleLiveDeadOptimizationQuick early dead code elimination passLiveness
11ReplaceUniformsWithImmOptimizationReplaces uniform register reads with immediate constants where value is knownUniform Regs
12OriSanitizeValidationValidates IR consistency after initial setup transformations
13GeneralOptimizeEarlyOptimizationCompound pass: copy prop + const fold + algebraic simplify + DCE (early)GeneralOptimize

Stage 2 -- Early Optimization (Phases 14--32)

Branch/switch optimization, loop canonicalization, strength reduction, software pipelining, SSA phi insertion, barrier optimization.

#Phase NameCategoryO-LevelDescriptionDetail Page
14DoSwitchOptFirstOptimization> 0Optimizes switch statements: jump table generation, case clustering (1st pass)Branch & Switch
15OriBranchOptOptimization> 0Branch folding, unreachable block elimination, conditional branch simplificationBranch & Switch
16OriPerformLiveDeadFirstAnalysisFull liveness analysis + dead code elimination (1st of 4 major instances)Liveness
17OptimizeBindlessHeaderLoadsOptimizationHoists and deduplicates bindless texture header loads
18OriLoopSimplificationOptimization4--5Canonicalizes loops: single entry, single back-edge, preheader insertion; aggressive loop peeling at O4+Loop Passes
19OriSplitLiveRangesOptimizationSplits live ranges at loop boundaries to reduce register pressureLiveness
20PerformPGOOptimizationApplies profile-guided optimization data (block weights, branch probabilities)
21OriStrengthReduceOptimizationReplaces expensive operations (multiply, divide) with cheaper equivalents (shift, add)Strength Reduction
22OriLoopUnrollingOptimization> 1Unrolls loops based on trip count and register pressure heuristicsLoop Passes
23GenerateMovPhiLoweringInserts SSA phi nodes as MOV.PHI pseudo-instructions
24OriPipeliningOptimization> 1Software pipelining: overlaps loop iterations to hide latencyLoop Passes
25StageAndFenceLoweringInserts memory fence and staging instructions for coherenceSync & Barriers
26OriRemoveRedundantBarriersOptimization> 1Eliminates barrier instructions proven redundant by data-flow analysisSync & Barriers
27AnalyzeUniformsForSpeculationAnalysisIdentifies uniform values safe for speculative executionUniform Regs
28SinkRematOptimization> 1 / > 4Sinks instructions closer to uses and marks remat candidates; O2+: basic; O5: full cutlassRematerialization
29GeneralOptimizeOptimizationCompound pass: copy prop + const fold + algebraic simplify + DCE (mid-early)GeneralOptimize
30DoSwitchOptSecondOptimization> 0Second switch optimization pass after loop/branch transformationsBranch & Switch
31OriLinearReplacementOptimizationReplaces branch-heavy patterns with linear (branchless) sequences
32CompactLocalMemoryOptimizationCompacts local memory allocations by eliminating dead slots and reordering

Stage 3 -- Mid-Level Optimization (Phases 33--52)

GVN-CSE, reassociation, shader constant extraction, CTA/VTG expansion, argument enforcement.

#Phase NameCategoryO-LevelDescriptionDetail Page
33OriPerformLiveDeadSecondAnalysisFull liveness analysis + DCE (2nd instance, post-early-optimization cleanup)Liveness
34ExtractShaderConstsFirstOptimizationIdentifies uniform values loadable from constant memory instead of per-thread computation (1st pass)
35OriHoistInvariantsEarlyOptimizationLoop-invariant code motion: hoists invariant computations out of loops (early)Loop Passes
36EmitPSILoweringEmits PSI (Pixel Shader Input) interpolation setup for graphics shaders
37GeneralOptimizeMidOptimizationCompound pass: copy prop + const fold + algebraic simplify + DCE (mid)GeneralOptimize
38OptimizeNestedCondBranchesOptimization> 0Simplifies nested conditional branches into flatter control flowBranch & Switch
39ConvertVTGReadWriteLoweringConverts vertex/tessellation/geometry shader read/write operations
40DoVirtualCTAExpansionLoweringExpands virtual CTA operations into physical CTA primitives
41MarkAdditionalColdBlocksAnalysisMarks basic blocks as cold based on heuristics and profile dataHot/Cold
42ExpandMbarrierLoweringExpands MBARRIER pseudo-instructions into native barrier sequencesSync & Barriers
43ForwardProgressLoweringInserts instructions guaranteeing forward progress (prevents infinite stalls)
44OptimizeUniformAtomicOptimizationConverts thread-uniform atomic operations into warp-level reductions
45MidExpansionLoweringTarget-dependent mid-level expansion of operations before register allocationLate Legalization
46GeneralOptimizeMid2OptimizationCompound pass: copy prop + const fold + algebraic simplify + DCE (mid 2nd)GeneralOptimize
47AdvancedPhaseEarlyEnforceArgsGateHook before argument enforcement; no-op by default
48EnforceArgumentRestrictionsLoweringEnforces ABI restrictions on function arguments (register classes, alignment)
49GvnCseOptimization> 1Global value numbering combined with common subexpression eliminationCopy Prop & CSE
50OriReassociateAndCommonOptimizationReassociates expressions for better commoning opportunities, then eliminates commonsCopy Prop & CSE
51ExtractShaderConstsFinalOptimizationFinal shader constant extraction pass (after GVN may expose new constants)
52OriReplaceEquivMultiDefMovOptimizationEliminates redundant multi-definition move instructions with equivalent sources

Stage 4 -- Late Optimization (Phases 53--77)

Predication, rematerialization, loop fusion, varying propagation, sync optimization, phi destruction, uniform register conversion.

#Phase NameCategoryO-LevelDescriptionDetail Page
53OriPropagateVaryingFirstOptimizationPropagates varying (non-uniform) annotations to identify divergent values (1st pass)
54OriDoRematEarlyOptimization> 1Early rematerialization: recomputes cheap values near uses to reduce register pressureRematerialization
55LateExpansionLoweringExpands operations that must be lowered after high-level optimizationsLate Legalization
56SpeculativeHoistComInstsOptimizationSpeculatively hoists common instructions above branches
57RemoveASTToDefaultValuesCleanupRemoves AST (address space type) annotations that have been lowered to defaults
58GeneralOptimizeLateOptimizationCompound pass: copy prop + const fold + algebraic simplify + DCE (late)GeneralOptimize
59OriLoopFusionOptimizationFuses adjacent loops with compatible bounds and no inter-loop dependenciesLoop Passes
60DoVTGMultiViewExpansionLoweringExpands multi-view operations for vertex/tessellation/geometry shaders
61OriPerformLiveDeadThirdAnalysisFull liveness analysis + DCE (3rd instance, post-late-optimization)Liveness
62OriRemoveRedundantMultiDefMovOptimizationRemoves dead multi-definition move instructions
63OriDoPredicationOptimization> 1If-conversion: converts short conditional branches into predicated instructionsPredication
64LateOriCommoningOptimizationLate commoning pass: eliminates common subexpressions exposed by predicationCopy Prop & CSE
65GeneralOptimizeLate2OptimizationCompound pass: copy prop + const fold + algebraic simplify + DCE (late 2nd)GeneralOptimize
66OriHoistInvariantsLateOptimizationLICM: hoists loop-invariant code (late, after predication may expose new invariants)Loop Passes
67DoKillMovementOptimizationMoves kill annotations closer to last use to improve register pressure
68DoTexMovementOptimizationMoves texture fetch instructions to minimize latency exposure
69OriDoRematOptimization> 1Late rematerialization: recomputes values exposed by predication and fusionRematerialization
70OriPropagateVaryingSecondOptimizationPropagates varying annotations (2nd pass, after predication changes control flow)
71OptimizeSyncInstructionsOptimization> 1Eliminates and simplifies synchronization instructionsSync & Barriers
72LateExpandSyncInstructionsLowering> 2Expands sync pseudo-instructions into final hardware sequencesSync & Barriers
73ConvertAllMovPhiToMovLoweringDestroys SSA form: converts MOV.PHI instructions into plain MOV
74ConvertToUniformRegOptimizationConverts qualifying values from general registers (R) to uniform registers (UR)Uniform Regs
75LateArchOptimizeFirstOptimizationArchitecture-specific late optimizations (1st pass)
76UpdateAfterOptimizeCleanupRebuilds IR metadata invalidated by the late optimization group
77AdvancedPhaseLateConvUnSupGateHook at the late unsupported-op boundary; no-op by default

Stage 5 -- Legalization (Phases 78--96)

Late unsupported-op expansion, backward copy propagation, GMMA fixup, register attributes, final validation.

#Phase NameCategoryO-LevelDescriptionDetail Page
78LateExpansionUnsupportedOpsLoweringExpands remaining unsupported operations after all optimizationsLate Legalization
79OriHoistInvariantsLate2OptimizationLICM (late 2nd pass) after unsupported-op expansionLoop Passes
80ExpandJmxComputationLoweringExpands JMX (jump with index computation) pseudo-instructions
81LateArchOptimizeSecondOptimizationArchitecture-specific late optimizations (2nd pass)
82AdvancedPhaseBackPropVRegGateHook before backward copy propagation; no-op by default
83OriBackCopyPropagateOptimizationBackward copy propagation: propagates values backward through move chainsCopy Prop & CSE
84OriPerformLiveDeadFourthAnalysisFull liveness analysis + DCE (4th instance, pre-legalization cleanup)Liveness
85OriPropagateGmmaOptimizationPropagates WGMMA accumulator values through the IRGMMA Pipeline
86InsertPseudoUseDefForConvURLoweringInserts pseudo use/def instructions for uniform register conversion bookkeepingUniform Regs
87FixupGmmaSequenceLoweringFixes WGMMA instruction sequences for hardware ordering constraintsGMMA Pipeline
88OriHoistInvariantsLate3OptimizationLICM (late 3rd pass) after GMMA fixupLoop Passes
89AdvancedPhaseSetRegAttrGateHook before register attribute setting; no-op by default
90OriSetRegisterAttrAnalysisAnnotates registers with scheduling attributes (latency class, bank assignment)Scheduling
91OriCalcDependantTexAnalysisComputes texture instruction dependencies for scheduling
92AdvancedPhaseAfterSetRegAttrGateHook after register attribute setting; no-op by default
93LateExpansionUnsupportedOps2LoweringSecond late unsupported-op expansion (catches ops exposed by GMMA/attr passes)Late Legalization
94FinalInspectionPassValidationFinal IR validation gate: catches illegal patterns before irreversible scheduling/RA
95SetAfterLegalizationCleanup> 1Sets post-legalization flag on the compilation context
96ReportBeforeSchedulingReportingDumps IR before scheduling (no-op unless diagnostic options enabled)

Stage 6 -- Scheduling & Register Allocation (Phases 97--103)

Synchronization insertion, WAR fixup, register allocation, 64-bit register handling.

#Phase NameCategoryO-LevelDescriptionDetail Page
97AdvancedPhasePreSchedGateHook before scheduling; when active, dispatches to ScheduleInstructions (sub_8D0640, true table index 114)Scheduling
98BackPropagateVEC2DOptimizationBackward-propagates 2D vector register assignments
99OriDoSyncronizationScheduling> 1Inserts synchronization instructions (BAR, DEPBAR, MEMBAR) per GPU memory modelSync & Barriers
100ApplyPostSyncronizationWarsScheduling> 1Fixes write-after-read hazards exposed by sync insertionSync & Barriers
101AdvancedPhaseAllocRegGateRegister allocation driver hook; when active, dispatches to AllocateRegisters (true table index 122); DUMPIR=AllocateRegisters targets thisRegAlloc Architecture
102ReportAfterRegisterAllocationReportingDumps IR after register allocation (no-op unless diagnostic options enabled)
103Get64bRegComponentsRegAllocSplits 64-bit register pairs into 32-bit components for architectures that require itRegAlloc Architecture

Stage 7 -- Post-RA & Post-Scheduling (Phases 104--116)

Post-expansion, NOP removal, hot/cold optimization, block placement, scoreboard generation.

#Phase NameCategoryO-LevelDescriptionDetail Page
104AdvancedPhasePostExpansionGateHook after post-RA expansion; when active, dispatches to PostExpansion (true table index 127)
105ApplyPostRegAllocWarsRegAllocFixes write-after-read hazards exposed by register allocation
106AdvancedPhasePostSchedGateHook after post-scheduling; no-op by default
107OriRemoveNopCodeCleanupRemoves NOP instructions and dead code inserted as placeholders
108OptimizeHotColdInLoopOptimizationSeparates hot and cold paths within loops for cache localityHot/Cold
109OptimizeHotColdFlowOptimizationSeparates hot and cold paths at the function levelHot/Cold
110PostScheduleScheduling> 0Post-scheduling pass: finalizes instruction orderingScheduling
111AdvancedPhasePostFixUpGateHook after post-fixup; when active, dispatches to PostFixUp (phase 140, target vtable+0x148)
112PlaceBlocksInSourceOrderCleanupDetermines final basic block layout in the emitted binary
113PostFixForMercTargetsEncodingFixes up instructions for Mercury encoding requirementsMercury
114FixUpTexDepBarAndSyncSchedulingFixes texture dependency barriers and sync instructions post-schedulingScoreboards
115AdvancedScoreboardsAndOpexesGate> 0Full scoreboard generation: computes 23-bit control word per instruction (-O1+); no-op at -O0Scoreboards
116ProcessO0WaitsAndSBsScheduling== 0Conservative scoreboard insertion for -O0: maximum stalls, barriers at every hazardScoreboards

Scoreboard generation has two mutually exclusive paths. At -O1 and above, phase 115 (AdvancedScoreboardsAndOpexes) runs the full dependency analysis using sub_A36360 (52 KB) and sub_A23CF0 (54 KB DAG list scheduler), while phase 116 is a no-op. At -O0, phase 115 is a no-op and phase 116 inserts conservative stall counts.

Stage 8 -- Mercury Backend (Phases 117--122)

SASS instruction encoding, expansion, WAR generation, opex computation, microcode emission.

#Phase NameCategoryO-LevelDescriptionDetail Page
117MercEncodeAndDecodeEncodingConverts Ori instructions to Mercury encoding, then round-trip decodes for verificationMercury
118MercExpandInstructionsEncodingExpands pseudo-instructions into final SASS instruction sequencesMercury
119MercGenerateWARs1EncodingGenerates write-after-read hazard annotations (1st pass, pre-expansion)Mercury
120MercGenerateOpexEncodingGenerates "opex" (operation extension) annotations for each instructionMercury
121MercGenerateWARs2EncodingGenerates WAR annotations (2nd pass, covers hazards introduced by expansion)Mercury
122MercGenerateSassUCodeEncodingProduces the final SASS microcode bytes (the actual binary encoding)Mercury

"Mercury" is NVIDIA's internal name for the SASS encoding framework. WAR generation runs in two passes (119, 121) because instruction expansion in phase 118 can introduce new write-after-read hazards. The MercConverter infrastructure (sub_9F1A90, 35 KB) drives instruction-level legalization via a visitor pattern dispatched through sub_9ED2D0 (25 KB opcode switch).

Stage 9 -- Post-Mercury (Phases 123--131)

Register map computation, diagnostics, debug output.

#Phase NameCategoryO-LevelDescriptionDetail Page
123ComputeVCallRegUseRegAllocComputes register usage for virtual call sites
124CalcRegisterMapRegAllocComputes the final physical-to-logical register mapping emitted as EIATTR metadataRegAlloc Architecture
125UpdateAfterPostRegAllocCleanupRebuilds IR metadata after post-RA processing
126ReportFinalMemoryUsageReportingPrints memory pool consumption summary to stderr
127AdvancedPhaseOriPhaseEncodingGatePhase encoding hook; no-op by default
128UpdateAfterFormatCodeListCleanupRebuilds the code list after Mercury encoding reformats instructions
129DumpNVuCodeTextReportingDumps human-readable SASS text disassembly
130DumpNVuCodeHexReportingDumps raw SASS binary as hex
131DebuggerBreakCleanupDevelopment hook: triggers a debugger breakpoint at this pipeline position

Stage 10 -- Late Cleanup & Late Pipeline (Phases 132--158)

Late merge operations, late unsupported-op expansion, high-pressure live range splitting, Mercury encoding pipeline, register map computation, diagnostics, and debug hooks.

#Phase NameCategoryO-LevelDescriptionDetail Page
132UpdateAfterConvertUnsupportedOpsCleanupRebuilds IR metadata after late unsupported-op conversion
133MergeEquivalentConditionalFlowOptimizationMerges basic blocks with equivalent conditional flow (tail merging)
134AdvancedPhaseAfterMidExpansionGateHook after mid-level expansion; no-op by default
135AdvancedPhaseLateExpandSyncInstructionsGateHook for late sync instruction expansion; no-op by default
136LateMergeEquivalentConditionalFlowOptimizationSecond conditional flow merge pass (catches cases exposed by late transforms)
137LateExpansionUnsupportedOpsMidLoweringMid-late unsupported-op expansion (between the two merge passes)Late Legalization
138OriSplitHighPressureLiveRangesRegAllocLast-resort live range splitter when register pressure exceeds hardware limitsRegAlloc Architecture
139ProcessO0WaitsAndSBsScheduling== 0Conservative scoreboard insertion for -O0; inserts maximum wait counts at every hazardScoreboards
140PostFixUpCleanupTarget-specific post-fixup dispatch (calls target vtable+0x148)
141MercConverterEncodingInitial Mercury conversion: translates Ori instructions to Mercury format (sub_9F3760)Mercury
142MercEncodeAndDecodeEncodingEncode/decode round-trip verification of SASS binary encoding (sub_18F21F0)Mercury
143MercExpandInstructionsEncodingExpands Mercury pseudo-instructions into final SASS sequences; gated by ctx+0x570 bit 5Mercury
144MercGenerateWARs1EncodingWAR hazard annotation (1st pass, pre-expansion); gated by ctx+0x570 sign bitMercury
145MercGenerateOpexEncodingGenerates operation extension annotations per instruction; gated by ctx+0x570 bit 6Mercury
146MercGenerateWARs2EncodingWAR hazard annotation (2nd pass, covers hazards from expansion in phase 143)Mercury
147MercGenerateSassUCodeEncodingFinal SASS microcode emission: produces the binary bytes for the ELF; gated by ctx+0x571 bit 0Mercury
148ComputeVCallRegUseRegAllocComputes register usage for virtual call sites (EIATTR metadata for indirect calls)
149CalcRegisterMapRegAllocComputes the final physical-to-logical register mapping; gated by ctx+0x590 bit 1RegAlloc Architecture
150UpdateAfterPostRegAllocCleanupRebuilds IR metadata after post-RA processing (no-op by default, isNoOp=1)
151ReportFinalMemoryUsageReportingPrints memory pool consumption summary (no-op by default, isNoOp=1)
152AdvancedPhaseOriPhaseEncodingGatePhase encoding gate; when active, sets ctx+0x610 (pipeline_progress) = 0x15 (21) to mark encoding boundary
153FormatCodeListEncodingFormats the instruction list for ELF output; dispatches through ctx+0x648 vtable+0x10Mercury
154UpdateAfterFormatCodeListCleanupRebuilds IR data structures after FormatCodeList reformats instructions (no-op by default, isNoOp=1)
155DumpNVuCodeTextReportingDumps human-readable SASS text disassembly; guarded by ctx+0x598 > 0 and ctx+0x740 non-null
156DumpNVuCodeHexReportingDumps raw SASS binary as hex; same guard as phase 155
157DebuggerBreakCleanupDevelopment hook: convenient breakpoint location for pipeline debugging (empty body in release)
158NOPCleanupTerminal no-op sentinel; final phase in the 159-phase pipeline

Phases 139--158 are 20 late-pipeline phases whose vtable pointers range from off_22BEB80 to off_22BEE78 (40-byte stride). All 20 have names in the static table at off_22BD0C0 (159 entries, not 139). The vtable slot at +16 is isNoOp() (returns 0 for active phases, 1 for phases skipped by default); name resolution goes through the static table indexed by getIndex() at +8.

The Mercury phases (141--147) are gated by flag bits at ctx+0x570/ctx+0x571, allowing backends to selectively enable/disable encoding passes. WAR generation runs in two passes (144, 146) bracketing instruction expansion (143) because expansion can introduce new write-after-read hazards.


Pipeline Ordering Notes

Stage numbering. The 10 stages on this page (Stage 1--10) subdivide the 159-phase OCG pipeline. They are distinct from the 6 timed phases in Pipeline Overview (Parse, CompileUnitSetup, DAGgen, OCG, ELF, DebugInfo), which cover the entire program lifecycle. All 10 stages here fall within the single OCG timed phase.

Identity ordering. The default ordering table at 0x22BEEA0 is an identity mapping: exec[N] = factory[N] for all 159 phases. The factory index IS the execution order. The original wiki analysis that placed phases 132--138 as "out-of-order slots" was based on a compressed 139-phase model that excluded 20 phases (see note below). In the true 159-phase table, phases execute in strict index order 0--158.

Repeated passes. Several transformations run at multiple pipeline positions because intervening passes expose new opportunities:

Pass FamilyInstancesPhases
GeneralOptimize*613, 29, 37, 46, 58, 65
OriPerformLiveDead*416, 33, 61, 84
OriHoistInvariants*435, 66, 79, 88
LateExpansionUnsupportedOps*378, 93, 137
ExtractShaderConsts*234, 51
OriPropagateVarying*253, 70
OriDoRemat*254, 69
DoSwitchOpt*214, 30
LateArchOptimize*275, 81
MergeEquivalentConditionalFlow2133, 136
MercGenerateWARs*2144, 146
UpdateAfterPostRegAlloc2125, 150
UpdateAfterFormatCodeList2128, 154
ReportFinalMemoryUsage2126, 151
DumpNVuCodeText2129, 155
DumpNVuCodeHex2130, 156
ComputeVCallRegUse2123, 148
CalcRegisterMap2124, 149
DebuggerBreak2131, 157
Vectorization/LateVectorization2(true 41, 73) -- omitted from compressed numbering
EnforceArgumentRestrictions/Late...248 (wiki), (true 103) -- late variant omitted

Cross-References

Key Functions

AddressSizeRoleConfidence
sub_C60D30--Phase factory switch; allocates each of the 159 phases as a 16-byte polymorphic object with a 5-slot vtable (execute, getIndex, isNoOp, NULL, NULL)0.92
sub_7DDB50232BOpt-level accessor; runtime gate called by 20+ pass execute functions to check opt-level threshold0.95
sub_A3636052KBMaster scoreboard control word generator; per-opcode dispatch for phase 115 (AdvancedScoreboardsAndOpexes)0.90
sub_A23CF054KBDAG list scheduler heuristic; barrier assignment for phase 115 scoreboard generation0.90
sub_9F1A9035KBMercConverter infrastructure; drives instruction-level legalization for Mercury phases 117--122 via visitor pattern0.92
sub_9ED2D025KBOpcode switch inside MercConverter; dispatches per-opcode legalization/conversion0.90
sub_9F3760--Phase 141 (MercConverter) execute function; initial Mercury conversion of Ori instructions0.85
sub_18F21F0--Phase 142 (MercEncodeAndDecode) execute function; encode/decode round-trip verification0.85

Phase Manager Infrastructure

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The PhaseManager is the central orchestration layer in ptxas. It owns the entire 159-phase optimization and code generation pipeline, constructs each phase as a polymorphic object via an abstract factory, and drives execution through a virtual dispatch loop. Every compilation unit passes through the same PhaseManager sequence: construct all 159 phase objects, iterate the phase index array calling execute() on each, optionally collect per-phase timing and memory statistics, then tear down. The PhaseManager also hosts an optional NvOptRecipe sub-manager (440 bytes) for architecture-specific "advanced phase" hooks that inject additional processing at 16 defined points in the pipeline.

The design is a textbook Strategy + Abstract Factory pattern: a 159-case switch statement maps phase indices to vtable pointers, each vtable provides execute(), isNoOp(), and getName() virtual methods, and the dispatch loop iterates a flat index array that defines execution order. This makes the pipeline fully data-driven -- reordering, disabling, or injecting phases requires only modifying the index array, not the dispatch logic.

Core range0xC60000--0xC66000 (13 functions, ~17.5 KB)
Constructorsub_C62720 (4,734 bytes)
Destructorsub_C61B20 (1,753 bytes)
Phase factorysub_C60D30 (3,554 bytes, 159-case switch)
Dispatch loopsub_C64F70 (1,455 bytes)
Name lookupsub_C641D0 (305 bytes, case-insensitive binary search)
Timing reportersub_C64310 (3,168 bytes)
Pool reportersub_C62200 (888 bytes)
Total phases159 (139 explicitly named + 20 arch-specific)
AdvancedPhase hooks16 no-op-by-default insertion points
Default phase tableStatic array at 0x22BEEA0 (returned by sub_C60D20)
Phase name tableStatic array at off_22BD0C0 (159 string pointers)
Vtable rangeoff_22BD5C8 (phase 0) through off_22BEE78 (phase 158)
Callerssub_7FB6C0 (main compilation driver), sub_9F63D0 (library/ftrace entry)

PhaseManager Object Layout

The PhaseManager is a plain C++ object (no vtable of its own) allocated by the compilation driver. Minimum size is 112 bytes, though the full extent depends on whether timing and NvOptRecipe are enabled.

PhaseManager (112+ bytes)
  +0    int64     compilation_unit      // back-pointer to owning compilation unit
  +8    int64*    allocator             // pool allocator (from compilation_unit->field_16)
  +16   void*     sorted_name_table     // sorted {name_ptr, index} pairs for binary search
  +24   int       sorted_name_count
  +28   int       sorted_name_capacity
  +32   int64*    allocator2            // copy of allocator (for phase list ops)
  +40   void*     phase_list            // array of 16-byte {phase_ptr, pool_ptr} pairs
  +48   int       phase_list_count      // always 159 after construction
  +52   int       phase_list_capacity
  +56   int64     nvopt_recipe_ptr      // NvOptRecipe sub-manager, or NULL
  +64   int64     (reserved)
  +72   bool      timing_enabled        // set from compilation_unit->config->options[17928]
  +76   int       (flags/padding)
  +80   bool      flag_byte             // initialized to 1, reset after first timing report
  +88   int64*    timing_allocator
  +96   void*     phase_name_raw_table  // 159 name string pointers, copied from off_22BD0C0
  +104  int       phase_name_raw_count
  +108  int       phase_name_raw_capacity

The two allocator fields (+8 and +32) both point to the same pool allocator extracted from the compilation unit, but are used in different contexts: +8 for name table operations, +32 for phase list operations.

Phase Object Model

Each phase is a 16-byte polymorphic object:

Phase (16 bytes)
  +0    vtable*   // points to one of 159 vtable instances
  +8    void*     // pool pointer (memory pool for phase-local allocations)

The vtable provides the interface contract:

Vtable offsetMethodSignature
+0executevoid execute(phase*, compilation_context*)
+8isNoOpbool isNoOp(phase*) -- returns true to skip execution
+16getNameint getName(phase*) -- returns index into name table
+24allocvoid* alloc(pool*, size_t) -- pool allocator
+32freevoid free(pool*, void*) -- pool deallocator

The vtable addresses span off_22BD5C8 (phase 0) through off_22BEE78 (phase 158), with a stride of 0x28 (40 bytes) between consecutive entries. All vtables reside in .data.rel.ro.

Phase Factory -- sub_C60D30

The factory is a 159-case switch statement that serves as the sole point of phase instantiation. For each case:

  1. Extracts the pool allocator from context->field_16
  2. Allocates 16 bytes via pool_alloc (vtable offset +24)
  3. Writes the case-specific vtable pointer at offset +0
  4. Returns a {phase_ptr, pool_ptr} pair

The default case returns {NULL, NULL}, which the caller treats as an invalid phase index.

// Pseudocode for sub_C60D30
pair<phase*, pool*> PhaseFactory(int phase_index, context* ctx) {
    pool* p = ctx->allocator;
    phase* obj = p->alloc(16);
    switch (phase_index) {
        case 0:   obj->vtable = off_22BD5C8; break;  // OriCheckInitialProgram
        case 1:   obj->vtable = off_22BD5F0; break;  // ApplyNvOptRecipes
        case 2:   obj->vtable = off_22BD618; break;  // PromoteFP16
        // ... 156 more cases ...
        case 158: obj->vtable = off_22BEE78; break;  // sentinel/NOP
        default:  return {NULL, NULL};
    }
    return {obj, p};
}

Called exclusively by the constructor (sub_C62720).

Construction Sequence -- sub_C62720

The constructor performs 11 steps, building all internal data structures and instantiating every phase:

// Pseudocode for sub_C62720
bool PhaseManager::construct(compilation_unit* cu) {
    this->cu          = cu;
    this->allocator   = cu->field_16;      // extract pool allocator
    this->allocator2  = cu->field_16;

    // 1. Check timing flag
    this->timing_enabled = cu->config->options[17928];

    // 2. Allocate and copy phase name table (1272 = 159 * 8 bytes)
    this->phase_name_raw_table = alloc(1272);
    memcpy(this->phase_name_raw_table, off_22BD0C0, 1272);
    this->phase_name_raw_count    = 159;
    this->phase_name_raw_capacity = 159;

    // 3. Initialize timing records
    resize_timing(/*capacity=*/159);                    // sub_C62580
    cu->timing_count++;                                 // at cu+1576
    append_timing({index=-1, name=0x2030007, time=0, flags=0});  // sentinel

    // 4. Create all 159 phase objects
    resize_phase_list(/*capacity=*/159);                // sub_C62640
    for (int i = 0; i < 159; i++) {
        auto [phase, pool] = PhaseFactory(i, cu);       // sub_C60D30
        phase_list[i] = {phase, pool};
    }

    // 5. Optionally create NvOptRecipe sub-manager
    if (cu->config->getOption(391)) {
        auto* recipe = alloc(440);
        // initialize hash table, ref-counted lists, timing arrays (8 entries)
        // inherit phase chain from previous execution context
        this->nvopt_recipe_ptr = recipe;
    }
    this->flag_byte = 1;
    return true;
}

Key constants:

  • 159 -- total phase count, used as loop bound and array capacities
  • 1272 -- 159 * 8, phase name pointer table size in bytes
  • 440 -- NvOptRecipe sub-manager object size
  • 0x2030007 (33,739,079) -- timing sentinel magic value
  • Option 17928 -- enables per-phase timing/memory reporting
  • Option 391 -- enables NvOptRecipe sub-manager

Destruction Sequence -- sub_C61B20

Teardown mirrors construction in reverse order, with careful handling of the NvOptRecipe's reference-counted shared state:

// Pseudocode for sub_C61B20
void PhaseManager::destroy() {
    // 1. Free raw name table
    timing_allocator->free(phase_name_raw_table);

    // 2. Tear down NvOptRecipe if present
    if (nvopt_recipe_ptr) {
        auto* r = nvopt_recipe_ptr;
        // decrement shared_list ref-count at +432
        if (--r->shared_list_refcount == 0)
            free_list_nodes(r->shared_list);
        free(r->hash_buckets);        // +408
        free(r->sorted_array);        // +376
        free(r->timing_records);      // +344, stride=584 per entry
        free(r->node_pool);           // +16
        free(r);
    }

    // 3. Destroy each phase via virtual destructor (vtable+32)
    for (int i = 0; i < phase_list_count; i++) {
        auto [phase, pool] = phase_list[i];
        pool->free(phase);            // invokes vtable+32
    }

    // 4. Free base arrays
    allocator2->free(phase_list);
    allocator->free(sorted_name_table);
}

The ref-count decrement-and-destroy pattern on shared_list at +432 follows C++ shared_ptr semantics: the NvOptRecipe may share state across multiple compilation units in library mode.

Phase Dispatch Loop -- sub_C64F70

The dispatch loop is the runtime engine. It takes a slice of the phase index array and executes each phase in order:

// Pseudocode for sub_C64F70
bool PhaseManager::dispatch(int* phase_indices, int count) {
    memory_snapshot_t base_snap;
    take_snapshot(&base_snap);                          // sub_8DADE0

    for (int i = 0; i < count; i++) {
        int idx = phase_indices[i];
        phase* p = this->phase_list[idx].phase;

        // Resolve phase name
        int name_idx = p->getName();                    // vtable+16
        const char* name = this->phase_name_raw_table[name_idx];

        // Record timing entry
        append_timing({idx, name, opt_level, flags, metrics});

        // Take pre-execution snapshot
        memory_snapshot_t pre_snap;
        take_snapshot(&pre_snap);

        // Execute (unless no-op)
        if (!p->isNoOp()) {                             // vtable+8
            p->execute(this->cu);                       // vtable+0
            // Construct diagnostic: "Before <name>" or "After <name>"
        }

        // Report per-phase stats
        if (this->timing_enabled) {
            report_phase_stats(name, &pre_snap, false); // sub_C64310
            this->flag_byte = 0;
        }
    }

    // Summary after all phases
    if (this->timing_enabled) {
        report_phase_stats("All Phases Summary", &base_snap, true);
        report_pool_consumption();                      // sub_C62200
    }
    return true;
}

The "Before" / "After" diagnostic strings use an interesting encoding trick: the string "Before " is stored as the 64-bit integer 0x2065726F666542 in little-endian, allowing the compiler to emit a single mov instruction instead of a memcpy.

Phase Name Lookup -- sub_C641D0

External callers (e.g., --ftrace-phase-after option processing in sub_9F4040) resolve phase names to indices through a case-insensitive binary search:

// Pseudocode for sub_C641D0
int PhaseManager::lookup_phase(const char* name) {
    ensure_sorted();                                    // sub_C63FA0

    int lo = 0, hi = sorted_name_count - 1;
    while (lo <= hi) {
        int mid = (lo + hi) / 2;
        int cmp = strcasecmp(sorted_name_table[mid].name, name);
        if (cmp == 0) return sorted_name_table[mid].index;
        if (cmp < 0)  lo = mid + 1;
        else           hi = mid - 1;
    }
    return 158;  // sentinel: last phase (NOP)
}

The sorted name table is rebuilt on demand by sub_C63FA0 when the raw count differs from the sorted count. Sorting uses an iterative quicksort (sub_C639A0) with median-of-three pivot selection and three-way partitioning. The sort stack is pre-allocated to 33 entries, sufficient for ceil(log2(160)).

Per-Phase Timing and Memory Reporting

When timing is enabled (option 17928), the dispatch loop calls sub_C64310 after each phase to print memory statistics:

<indent><phase_name>  ::  [Total 1234 KB]  [Freeable 567 KB]  [Freeable Leaked 12 KB] (2%)

The reporter computes three memory deltas from snapshot pairs:

MetricHelperMeaning
Totalsub_8DAE20Total memory allocated since snapshot
Freeablesub_8DAE30Memory eligible for release
Freeable Leakedsub_8DAE40Freeable memory not actually released

Size formatting thresholds:

  • 0--1023: raw bytes (suffix B)
  • 1024--10,485,760: kilobytes with 3 decimal places (suffix KB)
  • above 10 MB: megabytes with 3 decimal places (suffix MB)

After all phases complete, the loop prints an "All Phases Summary" line using the same reporter, then calls sub_C62200 to print the pool consumption total:

[Pool Consumption = 45.678 MB]

Timing Record Format

Each timing entry is 32 bytes:

Timing Record (32 bytes)
  +0    int       phase_index       // -1 for sentinel
  +8    int64     phase_name        // string pointer, or 0x2030007 for sentinel
  +16   int64     timing_value      // elapsed time
  +24   int       memory_flags      // opt level / additional metrics

Records are stored in a growable array at compilation_unit+1560. Growth uses a 1.5x strategy: new_capacity = max(old + old/2 + 1, requested).

NvOptRecipe Sub-Manager (440 bytes)

When option 391 is enabled, the constructor creates a 440-byte NvOptRecipe sub-manager at PhaseManager+56. This object provides the runtime for "AdvancedPhase" hooks -- the 16 phases that are no-ops by default but can be activated for architecture-specific or optimization-level-specific processing. The NvOpt level (0--5) controls per-phase aggressiveness independently of the -O CLI level: -O gates which phases run at all, while the NvOpt level controls how aggressively active phases behave.

Object Layout

NvOptRecipe (440 bytes)
  +0    int64     compilation_unit           // back-pointer to owning CU
  +8    int64     phase_manager_backref      // back-pointer to PhaseManager
  +16   void*     node_pool                  // 24-byte ref-counted list node
  +24   int64     secondary_bucket_count     // secondary hash (migration buffer)
  +32   void*     secondary_bucket_array     // secondary hash bucket array
  +40   int64     secondary_total_entries    // secondary hash entry count
  +48   (264 B)   [opaque internal region]   // +48..+311 undecoded
  +312  int64     recipe_data                // from option 391 value (ext. pointer)
  +320  int64     (reserved)                 // zeroed in constructor
  +328  (8 B)     [alignment gap]
  +336  int64     allocator                  // cu->field_16->field_16
  +344  void*     timing_records             // stride = 584 bytes per entry
  +352  int32     timing_count               // init -1 (empty sentinel)
  +356  int32     timing_flags               // init 0
  +360  int32     timing_extra               // init 0
  +364  (4 B)     (padding)
  +368  int64*    timing_allocator            // cu->field_16->field_16 copy
  +376  void*     sorted_array               // 4-byte entries, init capacity = 8
  +384  int32     sorted_count               // init 7 (pre-filled)
  +388  int32     sorted_capacity            // init 8
  +392  void*     ref_counted_list_2         // 24-byte ref-counted list node
  +400  int32     hash_bucket_count          // primary hash table bucket count
  +404  (4 B)     (padding)
  +408  void*     hash_buckets               // primary hash, 24-byte stride/bucket
  +416  int64     hash_size                  // total entries across all buckets
  +424  (8 B)     (padding)
  +432  void*     shared_list_ptr            // ref-counted, shared across CUs

Sub-Structures

Ref-Counted List Node (24 bytes) -- used at +16, +392, +432:

RefCountedListNode (24 bytes)
  +0    int64     refcount        // manual shared_ptr: decrement-and-destroy
  +8    void*     next            // singly-linked list chain
  +16   void*     allocator       // for self-deallocation when refcount → 0

When the refcount reaches zero, the destructor walks the next chain freeing each node, then frees the head node itself through the allocator at +16.

Hash Bucket Entry (24 bytes) -- array at +408:

HashBucketEntry (24 bytes)
  +0    void*     chain_head      // first element in bucket chain
  +8    void*     chain_sentinel  // end-of-chain sentinel
  +16   int32     chain_count     // number of elements in this bucket

Timing Record (584 bytes) -- array at +344:

TimingRecord (584 bytes)
  +0    (40 B)    header
  +40   void*     sub_allocator   // allocator for sub-data at +48
  +48   void*     sub_data        // freed during cleanup
  +56   int32     sub_count       // set to -1 when cleaned
  +60   int32     cleanup_flag    // if >= 0: sub_data exists, free it
  +64   (520 B)   timing/metric data

Records are iterated backward during cleanup (base + 584 * (count + 1) - 584 down to base). The sentinel value -1 at offset +56 marks an entry as already cleaned up.

Construction Sequence

The constructor (sub_C62720, lines 356--850 in decompilation) performs these steps:

  1. Check option 391 -- fast path: *(config_obj[9] + 28152) != 0; slow path: virtual call with argument 391. If disabled, skip entirely.

  2. Read option 391 value -- the value is the recipe_data pointer. Fast path checks type tag 5 (int64) at config offset 28152, reads the 64-bit value at offset 28160. This is an externally-provided pointer, not computed locally.

  3. Allocate 440 bytes from the pool allocator at compilation_unit->field_16.

  4. Initialize core fields -- back-pointers at +0/+8, node_pool at +16 (24-byte ref-counted node, refcount=1), zero +24/+32/+40, store recipe_data at +312.

  5. Initialize timing -- zero +344, set +352 to -1 (empty sentinel), zero +360, copy allocator to +336 and +368.

  6. Allocate sorted_array -- initial capacity 8 entries (32 bytes), pre-fill 7 entries, set +384 = 7, +388 = 8.

  7. Allocate ref_counted_list_2 at +392 (24-byte node, refcount=1), zero +400/+408/+416.

  8. Allocate shared_list at +432 (24-byte node, refcount=1).

  9. Inherit from previous recipe -- if PhaseManager+56 already holds an NvOptRecipe from a prior compilation unit:

    • Decrement old shared_list refcount; free if zero
    • Migrate hash bucket chains from old recipe to new ref_counted_list_2
    • Walk old timing records backward (stride 584), freeing sub-allocations
    • Drain old secondary hash table, release old node_pool
    • Free old NvOptRecipe object
  10. Install -- set PhaseManager+56 = new recipe, PhaseManager+64 = allocator.

Destruction Sequence

The destructor (sub_C61B20) tears down the recipe in reverse:

  1. Decrement shared_list_ptr (+432) refcount; free linked nodes if zero
  2. Walk hash buckets (+408, stride 24, count from +416): for each chain element, clean sub-entries (timing at offsets +56/+60/+64/+76), decrement per-entry refcounts at element [9], append to ref_counted_list_2; zero bucket; reset +400 to 0
  3. Clean up ref_counted_list_2 (+392); free if refcount zero
  4. Free sorted_array (+376) if sorted_count (+388) >= 0
  5. Walk timing_records (+344) backward, stride 584, freeing sub-allocations; reset +352 to -1
  6. Drain secondary hash (+24/+32/+40), move chains to node_pool
  7. Release node_pool (+16); free if refcount zero
  8. Free the 440-byte object via PhaseManager+64 allocator

NvOpt Level Validation

The recipe application function sub_C173E0 validates the NvOpt level at each recipe record:

// At sub_C173E0 + 0x2FD9 (line 1431)
int nvopt_level = *(int*)(recipe_record + 344);
if (nvopt_level > 5) {
    emit_warning(cu + 1232, 8000, "Invalid nvopt level : %d.", nvopt_level);
    // warning 8000 (0x1F40) -- non-fatal, compilation continues
}

Valid levels are 0--5. The level is consumed as a bitmask 1 << nvopt_level, passed to a vtable call that dispatches on a recipe configuration byte at target descriptor offset 35280 (8-case switch: cases 0--5, 7). This byte controls which recipe application mode is used for the target architecture.

Shared State in Library Mode

The shared_list at +432 enables recipe state persistence across compilation units in library mode (multiple .ptx files compiled by one ptxas invocation):

  • Each new NvOptRecipe sets its shared_list refcount to 1
  • During inheritance (step 9), hash bucket contents are migrated from the old recipe to the new one, accumulating per-kernel recipe decisions
  • When a PhaseManager is destroyed, the recipe decrements the shared_list refcount; only the last reference frees the nodes
  • This allows the NvOptRecipe to cache per-kernel optimization decisions across compilation passes

Key Constants

ValueMeaning
440NvOptRecipe object size (bytes)
584Per-entry timing record stride (bytes)
24Hash bucket entry size / ref-counted list node size
8Initial sorted_array capacity
7Initial sorted_count (pre-filled entries)
391Option ID (enables NvOptRecipe; value = recipe data pointer)
28152Option 391 type-tag offset in config storage
28160Option 391 value offset (8 bytes after type tag)
0x1F40Warning code 8000: "Invalid nvopt level"
5Maximum valid NvOpt level
35280Recipe config byte offset in target descriptor

Multi-Function Dispatch -- sub_C60BD0

When a compilation unit contains more than one function, sub_C60BD0 redirects to a per-function dispatch path:

// Pseudocode for sub_C60BD0
void PhaseManager::invoke_multi(compilation_unit* cu) {
    int func_count = get_function_count(cu);            // sub_7DDB50
    if (func_count > 1) {
        auto list1 = create_refcounted_list();
        auto list2 = create_refcounted_list();
        this->phase_chain = current_chain;              // +88
        per_function_dispatch(cu, list1, list2);        // sub_790A40
        release(list1);
        release(list2);
    }
}

Complete Phase Table

Stage numbering. The 7 groups below are a coarse summary of the 159-phase OCG pipeline. The authoritative fine-grained grouping is the 10-stage scheme in the Pass Inventory (OCG-Stage 1--10). The 7-group table here collapses several of those stages for brevity; phase boundaries differ slightly. When citing a stage by number, prefer the Pass Inventory's 10-stage numbering.

Group 1: Initial Setup (phases 0--12)

IndexPhase NamePurpose
0OriCheckInitialProgramValidate initial Ori IR
1ApplyNvOptRecipesApply NvOptRecipe transformations
2PromoteFP16Promote FP16 operations where beneficial
3AnalyzeControlFlowBuild/analyze control flow graph
4AdvancedPhaseBeforeConvUnSupHook -- before unsupported op conversion
5ConvertUnsupportedOpsLower unsupported operations to supported sequences
6SetControlFlowOpLastInBBMark control flow ops as last in basic block
7AdvancedPhaseAfterConvUnSupHook -- after unsupported op conversion
8OriCreateMacroInstsCreate macro instruction patterns
9ReportInitialRepresentationDiagnostic dump of initial IR
10EarlyOriSimpleLiveDeadEarly dead code elimination
11ReplaceUniformsWithImmReplace uniform register loads with immediates
12OriSanitizeIR consistency checks

Group 2: Early Optimization (phases 13--36)

IndexPhase NamePurpose
13GeneralOptimizeEarlyFirst GeneralOptimize pass (peephole + simplify)
14DoSwitchOptFirstSwitch statement optimization, first pass
15OriBranchOptBranch simplification and folding
16OriPerformLiveDeadFirstLiveness analysis, first pass
17OptimizeBindlessHeaderLoadsOptimize bindless texture header loads
18OriLoopSimplificationCanonicalize loop structure
19OriSplitLiveRangesSplit long live ranges to reduce pressure
20PerformPGOApply profile-guided optimizations
21OriStrengthReduceStrength reduction on induction variables
22OriLoopUnrollingLoop unrolling
23GenerateMovPhiConvert phi nodes to MOV-phi representation
24OriPipeliningSoftware pipelining of loops
25StageAndFenceMemory staging and fence insertion
26OriRemoveRedundantBarriersRemove unnecessary barrier instructions
27AnalyzeUniformsForSpeculationIdentify uniform values for speculative execution
28SinkRematSink rematerializable instructions
29GeneralOptimizeMain GeneralOptimize pass
30DoSwitchOptSecondSwitch optimization, second pass
31OriLinearReplacementReplace complex patterns with linear sequences
32CompactLocalMemoryCompact local memory layout
33OriPerformLiveDeadSecondLiveness analysis, second pass
34ExtractShaderConstsFirstExtract shader constants, first pass
35OriHoistInvariantsEarlyEarly loop-invariant hoisting
36EmitPSIEmit program state information

Group 3: Mid-Level Optimization (phases 37--58)

IndexPhase NamePurpose
37GeneralOptimizeMidMid-pipeline GeneralOptimize
38OptimizeNestedCondBranchesSimplify nested conditional branches
39ConvertVTGReadWriteConvert vertex/tessellation/geometry read/write ops
40DoVirtualCTAExpansionExpand virtual CTA operations
41MarkAdditionalColdBlocksMark additional basic blocks as cold
42ExpandMbarrierExpand mbarrier intrinsics
43ForwardProgressEnsure forward progress guarantees
44OptimizeUniformAtomicOptimize uniform atomic operations
45MidExpansionMid-pipeline lowering and expansion
46GeneralOptimizeMid2Second mid-pipeline GeneralOptimize
47AdvancedPhaseEarlyEnforceArgsHook -- before argument restrictions
48EnforceArgumentRestrictionsEnforce ABI argument constraints
49GvnCseGlobal value numbering and common subexpression elimination
50OriReassociateAndCommonReassociation and commoning
51ExtractShaderConstsFinalExtract shader constants, final pass
52OriReplaceEquivMultiDefMovReplace equivalent multi-def MOVs
53OriPropagateVaryingFirstVarying propagation, first pass
54OriDoRematEarlyEarly rematerialization
55LateExpansionLate lowering of complex operations
56SpeculativeHoistComInstsSpeculatively hoist common instructions
57RemoveASTToDefaultValuesRemove AST nodes set to default values
58GeneralOptimizeLateLate GeneralOptimize

Group 4: Late Optimization (phases 59--95)

IndexPhase NamePurpose
59OriLoopFusionFuse compatible loops
60DoVTGMultiViewExpansionExpand multi-view VTG operations
61OriPerformLiveDeadThirdLiveness analysis, third pass
62OriRemoveRedundantMultiDefMovRemove redundant multi-def MOVs
63OriDoPredicationIf-conversion / predication
64LateOriCommoningLate value commoning
65GeneralOptimizeLate2Second late GeneralOptimize
66OriHoistInvariantsLateLate invariant hoisting
67DoKillMovementMove kill instructions for better scheduling
68DoTexMovementMove texture instructions for latency hiding
69OriDoRematMain rematerialization pass
70OriPropagateVaryingSecondVarying propagation, second pass
71OptimizeSyncInstructionsOptimize synchronization instructions
72LateExpandSyncInstructionsExpand sync instructions to HW sequences
73ConvertAllMovPhiToMovConvert all MOV-phi to plain MOV
74ConvertToUniformRegPromote values to uniform registers
75LateArchOptimizeFirstArchitecture-specific late optimization, first pass
76UpdateAfterOptimizePost-optimization bookkeeping
77AdvancedPhaseLateConvUnSupHook -- before late unsupported op expansion
78LateExpansionUnsupportedOpsLate lowering of unsupported operations
79OriHoistInvariantsLate2Second late invariant hoisting
80ExpandJmxComputationExpand JMX (join/merge) computations
81LateArchOptimizeSecondArchitecture-specific late optimization, second pass
82AdvancedPhaseBackPropVRegHook -- before back-copy propagation
83OriBackCopyPropagateBackward copy propagation
84OriPerformLiveDeadFourthLiveness analysis, fourth pass
85OriPropagateGmmaGMMA/WGMMA propagation
86InsertPseudoUseDefForConvURInsert pseudo use/def for uniform reg conversion
87FixupGmmaSequenceFix up GMMA instruction sequences
88OriHoistInvariantsLate3Third late invariant hoisting
89AdvancedPhaseSetRegAttrHook -- before register attribute setting
90OriSetRegisterAttrSet register attributes (types, constraints)
91OriCalcDependantTexCalculate dependent texture operations
92AdvancedPhaseAfterSetRegAttrHook -- after register attribute setting
93LateExpansionUnsupportedOps2Second late unsupported op expansion
94FinalInspectionPassFinal IR validity checks
95SetAfterLegalizationMark legalization complete

Group 5: Scheduling and Register Allocation (phases 96--105)

IndexPhase NamePurpose
96ReportBeforeSchedulingDiagnostic dump before scheduling
97AdvancedPhasePreSchedHook -- before scheduling
98BackPropagateVEC2DBack-propagate 2D vector instructions
99OriDoSyncronizationInsert synchronization instructions
100ApplyPostSyncronizationWarsApply post-synchronization write-after-read fixes
101AdvancedPhaseAllocRegHook -- register allocation
102ReportAfterRegisterAllocationDiagnostic dump after regalloc
103Get64bRegComponentsExtract 64-bit register components
104AdvancedPhasePostExpansionHook -- after post-expansion
105ApplyPostRegAllocWarsApply post-regalloc write-after-read fixes

Group 6: Post-Schedule and Code Generation (phases 106--131)

IndexPhase NamePurpose
106AdvancedPhasePostSchedHook -- after scheduling
107OriRemoveNopCodeRemove NOP instructions
108OptimizeHotColdInLoopHot/cold partitioning within loops
109OptimizeHotColdFlowHot/cold partitioning across flow
110PostSchedulePost-scheduling fixups
111AdvancedPhasePostFixUpHook -- after post-schedule fixup
112PlaceBlocksInSourceOrderReorder blocks to match source order
113PostFixForMercTargetsMercury target-specific fixups
114FixUpTexDepBarAndSyncFix texture dependency barriers and sync
115AdvancedScoreboardsAndOpexesHook -- before scoreboard generation
116ProcessO0WaitsAndSBsProcess O0-level waits and scoreboards
117MercEncodeAndDecodeMercury encode to SASS and decode-verify
118MercExpandInstructionsExpand macro instructions to SASS
119MercGenerateWARs1Generate write-after-read hazard stalls, pass 1
120MercGenerateOpexGenerate operand exchange stalls
121MercGenerateWARs2Generate write-after-read hazard stalls, pass 2
122MercGenerateSassUCodeEmit final SASS microcode
123ComputeVCallRegUseCompute virtual call register usage
124CalcRegisterMapCalculate final register map
125UpdateAfterPostRegAllocPost-regalloc bookkeeping
126ReportFinalMemoryUsageReport final memory consumption
127AdvancedPhaseOriPhaseEncodingHook -- before final encoding
128UpdateAfterFormatCodeListUpdate after code list formatting
129DumpNVuCodeTextDump NV microcode as text (debug)
130DumpNVuCodeHexDump NV microcode as hex (debug)
131DebuggerBreakDebugger breakpoint (debug)

Group 7: Late Cleanup (phases 132--158)

IndexPhase NamePurpose
132UpdateAfterConvertUnsupportedOpsBookkeeping after late conversion
133MergeEquivalentConditionalFlowMerge equivalent conditional branches
134AdvancedPhaseAfterMidExpansionHook -- after mid-expansion
135AdvancedPhaseLateExpandSyncInstructionsHook -- after late sync expansion
136LateMergeEquivalentConditionalFlowLate merge of equivalent conditionals
137LateExpansionUnsupportedOpsMidMid-point late unsupported op expansion
138OriSplitHighPressureLiveRangesSplit live ranges under high register pressure
139--158(architecture-specific)20 additional phases with names in vtable getString() methods

Phases 139--158 are not in the static name table at off_22BD0C0. Their names are returned by each phase's getName() virtual method. These are conditionally-enabled phases for specific architecture targets (SM variants) or optimization levels.

AdvancedPhase Hook Points

The 16 AdvancedPhase entries are insertion points for architecture-specific or optimization-level-specific processing. All return isNoOp() == true by default. When activated (typically by NvOptRecipe configuration for a specific SM target), they execute additional transformations at precisely defined points in the pipeline:

IndexHook NameInsertion Context
4AdvancedPhaseBeforeConvUnSupBefore ConvertUnsupportedOps
7AdvancedPhaseAfterConvUnSupAfter ConvertUnsupportedOps
47AdvancedPhaseEarlyEnforceArgsBefore EnforceArgumentRestrictions
77AdvancedPhaseLateConvUnSupBefore LateExpansionUnsupportedOps
82AdvancedPhaseBackPropVRegBefore OriBackCopyPropagate
89AdvancedPhaseSetRegAttrBefore OriSetRegisterAttr
92AdvancedPhaseAfterSetRegAttrAfter OriSetRegisterAttr
97AdvancedPhasePreSchedBefore scheduling pipeline
101AdvancedPhaseAllocRegRegister allocation entry point
104AdvancedPhasePostExpansionAfter post-regalloc expansion
106AdvancedPhasePostSchedAfter scheduling
111AdvancedPhasePostFixUpAfter post-schedule fixup
115AdvancedScoreboardsAndOpexesBefore scoreboard/opex generation
127AdvancedPhaseOriPhaseEncodingBefore final instruction encoding
134AdvancedPhaseAfterMidExpansionAfter mid-level expansion
135AdvancedPhaseLateExpandSyncInstructionsAfter late sync instruction expansion

Mercury Encoding Sub-Pipeline

Phases 113--122 form a self-contained sub-pipeline that transforms the optimized, register-allocated Ori IR into final SASS machine code via the Mercury encoding format:

PostFixForMercTargets (113)
  → FixUpTexDepBarAndSync (114)
    → [AdvancedScoreboardsAndOpexes hook (115)]
      → ProcessO0WaitsAndSBs (116)
        → MercEncodeAndDecode (117)      ← encode to SASS + decode for verification
          → MercExpandInstructions (118) ← expand remaining macros
            → MercGenerateWARs1 (119)    ← first WAR hazard pass
              → MercGenerateOpex (120)   ← operand exchange stalls
                → MercGenerateWARs2 (121)← second WAR hazard pass
                  → MercGenerateSassUCode (122) ← final microcode emission

"Mercury" is NVIDIA's internal name for the SASS encoding format on recent GPU architectures (Blackwell-era SM 100/103/110/120).

Diagnostic Strings

AddressStringEmitted ByContext
0x22BC3B3"[Pool Consumption = "sub_C62200After all phases summary
0x22BC416"All Phases Summary"sub_C64F70End of dispatch loop
(inline)" :: "sub_C64310Phase timing line separator
(inline)"[Total "sub_C64310Total memory delta
(inline)"[Freeable "sub_C64310Freeable memory delta
(inline)"[Freeable Leaked "sub_C64310Leaked memory delta
(inline)"Before " / "After "sub_C64F70Phase execution diagnostic

Function Map

AddressSizeFunctionConfidence
sub_C60D2016Default phase table pointerHIGH
sub_C60D303,554Phase factory (159-case switch)VERY HIGH
sub_C60BD0334Multi-function phase invokerMEDIUM-HIGH
sub_C61B201,753PhaseManager destructorVERY HIGH
sub_C62200888Pool consumption reporterVERY HIGH
sub_C62580253Timing record array resizerHIGH
sub_C62640223Phase list resizerHIGH
sub_C627204,734PhaseManager constructorVERY HIGH
sub_C639A01,535Case-insensitive quicksortHIGH
sub_C63FA0556Phase name table sort/rebuildHIGH
sub_C641D0305Phase name-to-index lookupVERY HIGH
sub_C643103,168Per-phase timing reporterVERY HIGH
sub_C64F701,455Phase dispatch loopVERY HIGH

Cross-References

GeneralOptimize Bundles

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The GeneralOptimize* passes are compound optimization bundles that run multiple sub-transformations in sequence on each basic block, repeating until no further changes occur (fixed-point iteration). They serve as the primary IR cleanup mechanism throughout the pipeline: after any major transformation introduces new dead code, redundant copies, or foldable constants, a GeneralOptimize pass re-normalizes the IR before the next major phase.

Six instances exist at strategic positions in the 159-phase pipeline. Despite sharing the "GeneralOptimize" name prefix, the six instances decompose into three distinct implementation families -- a lightweight block-iteration variant, a heavyweight bitvector-tracked orchestrator, and an indirect vtable dispatch stub. Each family shares a common architectural pattern (per-block iteration with convergence check) but invokes different sub-pass combinations and has different gate conditions.

Instances6 (phases 13, 29, 37, 46, 58, 65)
PatternPer-block iteration with convergence check
Sub-passesCopy propagation, constant folding, structural equivalence elimination, dead code elimination, predicate simplification, register promotion (Phase 37)
ConvergenceBoolean change flag per iteration; stops when no sub-pass reports a change
Iteration capKnob-controlled (option 464); breaks loop if knob returns false
Single-function fast pathPhases 13 and 65 have direct tail-call paths bypassing the multi-function dispatch
Multi-function gateAll variants check sub_7DDB50(ctx) > 1 before entering the main loop
Code rangeExecute functions at 0xC5F940--0xC60870; sub-pass bodies at 0x7917F0--0x910840

Instance Map

PhaseNameVtableexecute()Sub-pass BodyGate Conditions
13GeneralOptimizeEarlyoff_22BD7D00xC5F940sub_7917F0 (multi-func) / 0x1C64BF0 (single-func)bit 2 of ctx+1382 must be set
29GeneralOptimizeoff_22BDA500xC5FC50sub_908EB0Option 487 enabled; option 231 not set; option 461 pass
37GeneralOptimizeMidoff_22BDB900xC5FD70sub_910840sub_8F3EA0 pre-check; option 487; "ConvertMemoryToRegisterOrUniform" name-gate
46GeneralOptimizeMid2off_22BDCF80xC60840indirect via [*(ctx+1584)]->vtable[0x1C0]Vtable dispatch; skips if target == sub_7D6DD0 (no-op sentinel)
58GeneralOptimizeLateoff_22BDED80xC5FF20sub_8F7080Function count > 2; bits 4-5 of ctx+1396 != 0x20; option 31 checked
65GeneralOptimizeLate2off_22BDFF00xC60550indirect via [*(ctx+1584)]->vtable[392]Function count > 1; indirect dispatch through compilation unit vtable

Architecture: Three Structural Variants

Variant A: Block-Iteration with Explicit Fixed-Point Loop (Phases 13, 29)

The Early and standard GeneralOptimize passes iterate over basic blocks with an explicit convergence loop. Phase 13 (GeneralOptimizeEarly) at sub_7917F0 is the simplest and best-documented:

// sub_7917F0 -- GeneralOptimizeEarly (multi-function path)
void GeneralOptimizeEarly(int64_t ctx) {
    if (!(*(uint8_t*)(ctx + 1382) & 4))   return;   // gate: optimization flag

    // Option 214 check -- uses vtable fast-path comparison:
    //   if vtable[72] == sub_6614A0, reads *(config + 15408) directly
    //   otherwise calls the virtual getOption(214)
    if (getOption(ctx, 214))               return;   // gate: skip if set

    // Option 487 check -- uses vtable[152] fast-path:
    //   if vtable[152] == sub_67EB60, calls sub_7468B0(config, 487)
    //   otherwise calls the virtual isOptionSet(487, 1)
    if (!getOption_v2(ctx, 487))           return;   // gate: general opt enable

    if (*(int64_t*)(*(int64_t*)ctx + 1056)) return;  // gate: already processed

    sub_785E20(ctx, 0);                    // reset per-block change tracking
    sub_781F80(ctx, 1);                    // initialize instruction flags
    sub_7E6090(ctx, 0, 0, 0, 0);          // prepare operand use/def chains
    sub_7E6AD0(ctx, 0, ...);              // build def-use/use-def links

    // Iterate over basic blocks (block_count at ctx+520)
    int bb_count = *(int32_t*)(ctx + 520);
    for (int i = 1; i <= bb_count; i++) {
        // block_order at ctx+512, block_table at ctx+296
        int bb_idx = *(int32_t*)(*(int64_t*)(ctx + 512) + 4*i);
        BasicBlock* bb = *(BasicBlock**)(*(int64_t*)(ctx + 296) + 8*bb_idx);

        // Fixed-point loop on this block
        int64_t state[...];   // stack-allocated state at rbp-0x88
        while (true) {
            bool changed = sub_753600(&state, bb);   // run sub-passes
            if (!changed)  break;

            // Iteration cap: knob 464
            if (!getOption_v2(ctx, 464))  break;

            sub_753B50(&state);            // apply instruction rewrites
        }
    }

    if (any_changed)
        sub_785E20(ctx, 0);                // re-normalize if anything changed
}

The inner function sub_753600 runs on a single basic block and returns a boolean indicating whether any transformation fired. When it returns true, sub_753B50 applies the accumulated changes (instruction replacement, operand rewriting, def-use chain updates), and the loop re-runs sub_753600 on the same block to check if the new IR enables further simplifications.

The convergence check for option 464 acts as an emergency brake: if the knob returns false, the loop breaks even if changes were detected. This prevents pathological cases where mutual transformations oscillate indefinitely.

Phase 29 (sub_C5FC50) follows the same pattern but delegates to sub_908EB0, which implements a more complex instruction walk with additional opcode dispatch (opcodes 97 [STG in ROT13; used here as a definition anchor], 18 [FSETP], 124 [conditional select]) and predicate-aware propagation.

Variant B: Full-Program Sub-Pass Orchestration (Phases 37, 58)

The Mid and Late variants operate at a higher level: they construct a multi-field context structure, initialize bitvector tracking infrastructure, and call a heavyweight sub-pass orchestrator.

Phase 37 -- GeneralOptimizeMid (sub_910840)

  1. Calls sub_8F3EA0 -- a pre-condition check (returns false to skip the entire pass)
  2. Checks option 487 (general optimization enable) via the same vtable fast-path pattern
  3. Calls sub_799250 with the string "ConvertMemoryToRegisterOrUniform" (at 0x21DD228) -- a named phase gate that allows the pass to be selectively disabled via --no-phase
  4. Constructs a 0x408-byte context object on the stack with vtable pointer off_21DBEF8 at offset 0. The layout is:
    GeneralOptimizeMid Context (0x408 bytes)
      +0x000  vtable_ptr     = off_21DBEF8
      +0x008  allocator      = *(ctx + 16)
      +0x010  (zero-init)    ...
      +0x018  (zero-init)    ...
      +0x020  (zero-init)    ...
      +0x030  int count      = 0
      +0x040  sub_context    -- initialized by sub_905B50 (bitvectors, register tracking)
      ...
    
  5. Calls sub_905B50 -- a 500+ line setup function that creates bitvector arrays for tracking register definitions, use-def chains, and per-block change flags. Allocates three pairs of {bitvector, metadata, capacity} structures for tracking definition reach, register liveness, and fold eligibility
  6. Calls sub_90FBA0 -- the main optimization loop that iterates over all blocks, running sub-passes per instruction

After sub_90FBA0 returns, the function destroys three RAII-style bitvector containers at offsets +0x200, +0x228, and +0x1E0 by invoking their vtable destructors via *(vtable + 32).

Phase 58 -- GeneralOptimizeLate (sub_8F7080)

  1. Checks function count > 2 via sub_7DDB50 (stricter than other variants that check > 1)
  2. Checks optimization level bits at ctx+1396: the condition (flags & 0x30) != 0x20 ensures the pass is skipped at certain reduced optimization levels
  3. Checks option 31 via the vtable fast-path; when option 31 reports as "extended" (value at config+2232 is 1 with non-zero extra word at config+2240), an additional sub_7DC0E0 check determines a secondary control flag v7
  4. Constructs a 0x168-byte context on the stack with 7 sub-pass tracking groups. Each group occupies 56 bytes (three __int128 values + a boolean changed-flag + a counter):
    GeneralOptimizeLate Context (0x168 bytes)
      +0x000  ctx_ptr     = ctx (the compilation context)
      +0x008  flag_a      -- initialized from (ctx+1396 & 4)
      +0x009  flag_b      -- initialized from (ctx+1396 & 8)
      +0x00C  counter_0   = 0   |
      +0x010  changed_0   = 0   | Sub-pass group 0 (56 bytes)
      +0x018  ...              |
      +0x048  counter_1   = 0   | Sub-pass group 1
      ...
      +0x12C  counter_6   = 0   | Sub-pass group 6
      +0x130  changed_6   = 0   |
      +0x138  ...              |
    
  5. Calls sub_8F6FA0 -- the block iterator

The block iterator sub_8F6FA0 initializes per-context flags from ctx+1396:

  • Bit 2 (& 4): stored at context+9, controls whether opcode-7 instructions are processed
  • Bit 3 (& 8): stored at context+8, controls whether opcode-6 (MOV variant) instructions are processed

It then calls sub_7E6090 to rebuild use-def chains and walks the block list calling sub_8F6530 per block.

Variant C: Indirect Vtable Dispatch (Phases 46, 65)

The Mid2 and Late2 variants use indirect vtable dispatch to call their sub-pass bodies, making the exact implementation architecture-dependent:

Phase 46 (GeneralOptimizeMid2) at 0xC60840:

mov  rdi, [rsi+0x630]      ; load sm_backend (compilation_context+1584)
mov  rax, [rdi]             ; load vtable
mov  rax, [rax+0x1C0]      ; load vtable slot 56 (offset 0x1C0 = 448)
cmp  rax, 0x7D6DD0          ; compare against no-op sentinel
jne  call_it                ; if not sentinel, call it
ret                          ; otherwise, return (phase is no-op)
call_it:
jmp  rax                    ; tail-call the vtable method

Phase 65 (GeneralOptimizeLate2) at sub_C60550:

// sub_C60550 -- GeneralOptimizeLate2 execute
int64_t GeneralOptimizeLate2(int64_t phase, int64_t ctx) {
    int64_t result = sub_7DDB50(ctx);       // get function count
    if ((int)result > 1) {
        int64_t comp_unit = *(int64_t*)(ctx + 1584);
        return (*(int64_t(**)(int64_t, int64_t))(*(int64_t*)comp_unit + 392))(comp_unit, ctx);
    }
    return result;
}

This indirection means the actual optimization behavior for phases 46 and 65 is determined by the compilation unit's vtable, which varies by target architecture and optimization level. The no-op sentinel sub_7D6DD0 (for phase 46) indicates that some architectures skip this pass entirely.

Sub-Pass Decomposition

The sub-passes that run inside a GeneralOptimize iteration are not named individually in the binary -- they are inline code within the per-block processing functions. Based on the decompiled logic, the following sub-transformations are identifiable:

Copy Propagation Algorithm

String evidence: "OriCopyProp" at 0x21E6CE1 appears in the phase name table at index 22, confirming that copy propagation is a recognized sub-pass within the system.

Two distinct copy propagation algorithms exist across the GeneralOptimize variants:

Algorithm A: Chain-Matching Copy Propagation (Phase 13 -- sub_753600)

Phase 13's copy propagation operates by matching structurally equivalent instruction pairs connected through single-use def-use chains. The 253-line function sub_753600 uses a state structure (8 int64_t fields, allocated on the stack at rbp-0x88 in sub_7917F0) that accumulates matched chain endpoints:

sub_753600 State Layout (8 qwords)
  state[0] = ctx           -- Code Object pointer (set by caller)
  state[1] = match_start   -- first matched instruction in chain
  state[2] = match_end     -- last matched instruction in chain
  state[3] = def_entry_a   -- first definition chain entry (from sub_753520)
  state[4] = reg_entry     -- register/BB entry for replacement target
  state[5] = def_entry_b   -- extended chain entry (second level)
  state[6] = reg_entry_b   -- extended register/BB entry

The algorithm proceeds in eight steps:

// sub_753600 -- Phase 13 copy propagation (decompiled pseudocode)
function copy_prop_early(state, basic_block):
    ctx = state[0]
    first_instr = *(basic_block[1])              // head of instruction list

    // Step 1: Entry gate -- only process blocks starting with control-flow terminator
    if first_instr.opcode != 95: return false    // opcode 95 = STS in ROT13; used as terminator class
    if first_instr.operand_count != 5: return false
    format = first_instr[25] & 7
    if format != 3 and format != 4: return false // must be imm or reg source

    // Step 2: Single-use chain check
    use_link = basic_block[17]                   // use-def chain link
    if use_link == NULL: return false
    if *use_link == NULL: return false
    if **use_link != NULL: return false           // must be SINGLE consumer

    // Step 3: Follow to defining instruction via opcode-97 anchor
    next_instr = *(basic_block[1] + 8)           // linked list next
    if next_instr.opcode != 97: return false     // must be def anchor
    reg_entry = *(ctx+296)[ next_instr.bb_index ] // BB/def lookup

    // Step 4: Walk def-use chain to find structural match
    chain_a = follow_chain_filtered(state, reg_entry)  // sub_753520
    if chain_a == NULL: return false
    state[3] = chain_a

    // Step 5: Walk reverse chain from chain_a
    chain_b = follow_reverse_chain(state, chain_a)     // sub_753570
    if chain_b == NULL: return false
    state[1] = chain_b
    state[2] = chain_b

    // Step 6: Predicate-operand compatibility check
    endpoint_instr = *(chain_b[1])
    if endpoint_instr.opcode != 95: return false
    if !predicate_operand_compatible(first_instr, endpoint_instr): return false
                                                       // sub_7E7380

    // Step 7: Operand-level matching
    if operand formats differ (format-4 parity mismatch): return false
    if reg_indices match AND metadata matches AND modifiers match:
        goto apply   // direct match

    // Step 7b: Deep sub-DAG equivalence (for non-trivial patterns)
    if both sources are register type (bits 28-30 == 1)
       and both have use_count <= 1
       and both defining instructions have opcode 119
       and no aliasing hazards (sub_748570)
       and sub_1245740(ctx, def_a, def_b, 2):   // depth-2 DAG compare
        goto apply

    return false

apply:
    // Step 8: Record replacement target
    state[4] = register_entry_for_replacement
    // Optionally follow one more chain level for state[5]/state[6]
    return true   // caller invokes sub_753B50 to rewrite

The chain walker sub_753480 (43 lines) is the core of this algorithm. It follows single-use, single-def chains within a basic block:

// sub_753480 -- def-use chain walker (at 0x753480)
function follow_chain(ctx, entry, &skip_flag):
    skip_flag = false
    if entry == NULL: return NULL
    current = entry
    loop:
        if check_multi_condition_skip(current):   // sub_7E5120
            skip_flag = true                      // chain crossed a skip point

        if current[16] == NULL: break             // no next-use link
        if *current[16] != NULL: break            // MULTI-USE: stop

        if current[17] == NULL: break             // no def link
        if *current[17] != NULL: break            // MULTI-DEF: stop

        def_bb_idx = *(current[17] + 8)
        instr_bb_idx = *(current[1] + 8).bb_index  // at +24
        if def_bb_idx != instr_bb_idx: break      // CROSS-BB: stop

        next_instr = *(current[1] + 8)
        if next_instr.opcode == 97:               // def anchor
            current = *(ctx+296)[ def_bb_idx ]    // follow chain
            continue
        else:
            return NULL                           // chain broken

    return current                                // last valid entry

Key properties of this walker:

  • Only follows single-use chains (current[16] must have exactly one consumer)
  • Only follows single-def chains (current[17] must have exactly one producer)
  • Only follows intra-block chains (definition and use must share the same BB index)
  • Only traverses through opcode 97 (definition anchor) instructions
  • The check_multi_condition_skip (sub_7E5120, 18 lines) tests four conditions: vtable dispatch at ctx+1784, block ordering bounds at ctx+1776, instruction flags at +283 bit 0, and knob 91

The helper sub_753520 wraps sub_753480 with an additional opcode-93 gate: the chain endpoint's instruction must have opcode 93 (OUT_FINAL in ROT13; used as an internal chain-link marker) and the use-chain at entry[16] must be empty. sub_753570 performs the reverse direction check, verifying that following the chain backward from a given entry reaches the expected starting point with matching register indices.

Algorithm B: Forward Walk with Flag Marking (Phase 29 -- sub_908EB0)

Phase 29's copy propagation walks the instruction linked list sequentially from *(ctx+272) (instruction list head) and marks eligible operands with flag bits for later consumption. The 217-line function sub_908EB0 maintains three key state variables:

VariableTypePurpose
v10bool"previous instruction was a recognized copy" -- gates liveness fallback
v11int64_tCurrent definition tracking entry (BB array pointer, from opcode 97)
v21charArchitecture-allows-predicate-marking flag (from vtable at **(ctx+1584)+1312)
// sub_908EB0 -- Phase 29 forward copy propagation (decompiled pseudocode)
function copy_prop_forward(ctx):
    // Gate checks: option 487, option 231, option 461, function count,
    // architecture check via sub_7DC0E0, vtable dispatch at +1312
    v21 = check_arch_predicate_marking(ctx)

    sub_781F80(ctx, 1)         // initialize per-instruction flags
    v10 = initial_gate_flag    // from option 487 check
    v11 = 0                    // no current definition context

    for instr in instruction_list(ctx+272):
        opcode = instr.opcode & ~0x3000          // mask bits 12-13

        switch opcode:
        case 97:   // DEFINITION ANCHOR
            v10 = initial_gate_flag              // reset copy-tracking
            v11 = *(ctx+296)[ instr.operand[0] & 0xFFFFFF ]
            // Updates definition context -- subsequent opcodes 18/124
            // reference v11 for their propagation decisions
            continue

        case 18:   // SET-PREDICATE (FSETP/ISETP)
            if sub_8F2E50(ctx, instr):           // eligible?
                v10 = false                       // suppress liveness check
                if v21:                           // arch supports pred marking?
                    dst_idx = count + ~((opcode>>11) & 2)
                    instr.operand[dst_idx] |= 0x400   // mark: propagated-under-predicate
            continue

        case 124:  // CONDITIONAL SELECT
            if !sub_8F2E50(ctx, instr): continue
            dst = instr.operand[ count + ~((opcode>>11) & 2) ]

            if (ctx+1379) & 7 == 0:              // simple mode
                dst |= 0x100                      // mark: propagated
                continue

            if (dst & 0xF) == 1:                  // integer constant type
                if !sub_8F29C0(ctx): continue     // arch check
                // fall through to direct marking
            else:
                if !sub_8F29C0(ctx) or (ctx+1379 & 0x1B) != 0:
                    // Two-pass predicate simplifier
                    sub_908A60(ctx, v11, instr, 1, &hit, &partial)  // forward
                    if hit: goto mark_propagated
                    if !partial:
                        sub_908A60(ctx, v11, instr, 0, &hit, &partial) // backward
                        if hit: goto mark_propagated
                        if !partial: continue     // no match at all
                // Direct propagation: convert operand type
                dst = (dst & 0xFFFFFDF0) | 0x201  // clear type, set reg+deferred
                continue

            // Liveness-gated propagation check for extended chains
            if !v10 or v21:
                mark_propagated:
                instr.operand[dst_idx] |= 0x100   // mark: propagated
            else:
                // Follow definition chain from v11 for additional candidates
                follow_and_check_chain(ctx, v11, instr)
            continue

        default:
            if !v10:                               // no prior copy recognized
                status = sub_7DF3A0(instr, ctx)   // liveness check
                v10 = (*status & 0xC) != 0        // live uses exist?
            continue

Target Opcodes in Copy Propagation Context

OpcodeIR MeaningRole in Copy PropEvidence
97Definition anchor / label marker (STG in the ROT13 name table; used here as a definition anchor, not an actual store-global instruction)Updates the current definition tracking context (v11). Its operand instr+84 & 0xFFFFFF is an index into the BB array at ctx+296, retrieving the BasicBlock descriptor for the definition point. All subsequent propagation decisions for opcodes 18 and 124 reference this context.sub_908EB0 lines 74--78: v11 = *(*(a1+296) + 8 * (*(v9+84) & 0xFFFFFF))
18FSETP/ISETP (set predicate)A predicate-setting comparison instruction. Copy propagation treats it as a "predicated copy" target: when source operands have type 2 or 3 (predicate/uniform register) and pass sub_91D150 register constraint checks, the destination predicate can be folded into consumers. Marked with 0x400 when the architecture supports it.sub_908EB0 lines 84--96, sub_8F2E50 lines 19--61
124Conditional select (phi-like)A two-source selection instruction controlled by a predicate. Copy propagation attempts to simplify it to a direct assignment when one source is a constant or when structural analysis shows the predicate is trivially true/false. Marked with 0x100 or type-converted via `(operand & 0xFFFFFDF0)0x201`.

Flag Bit Semantics

The propagation marks operands with three flag bits on the destination operand word at instr + 84 + 8*dst_idx:

BitMaskNameSet WhenEffect
80x100PropagatedConditional select (opcode 124) is eligible for propagation AND the architecture/mode checks passDownstream apply-changes passes replace all uses of this destination with its source operand. Checked as a guard in sub_8F2E50: if (dst & 0x100) return false prevents double-propagation.
90x200Deferred cleanupCombined with type-field rewriting: `(operand & 0xFFFFFDF0)0x201`
100x400Propagated under predicateSet-predicate instruction (opcode 18) is eligible AND the architecture flag v21 is true (vtable dispatch at **(ctx+1584)+1312 returned non-zero)Marks a conditional propagation: the destination predicate can be folded into consumers, but only if the guarding predicate is maintained. Distinguished from 0x100 because the propagation is predicate-dependent rather than unconditional.

Eligibility Checker: sub_8F2E50

The 64-line function sub_8F2E50 is the central gatekeeper for both opcodes 18 and 124. Decompiled logic:

// sub_8F2E50 -- copy/fold eligibility (from decompiled code at 0x8F2E50)
function is_eligible(ctx, instr):
    opcode = instr[18] with BYTE1 &= 0xCF       // mask bits 12-13

    if opcode == 18:                              // set-predicate
        dst = instr[2 * (count + ~((opcode>>11)&2)) + 21]
        type_nibble = (dst >> 2) & 0xF
        if type_nibble == 10: return false        // type 10: never foldable
        if type_nibble == 0 and !(dst & 0x400):   // no type bits, not yet marked
            // Architecture-gated source operand check
            vtable_fn = **(ctx+1584) + 1320
            if vtable_fn == sub_7D7240:           // sentinel: direct check
                if (instr[23] >> 28) & 7 not in {2, 3}: return false
            else:
                if vtable_fn() returns true: goto opcode_124_check
            // Register constraint check on both source operands
            if sub_91D150(ctx, instr[23] & 0xFFFFFF): goto opcode_124_check
            if sub_91D150(ctx, instr[25] & 0xFFFFFF): goto opcode_124_check
            return true
        return false

    opcode_124_check:
    if opcode == 124:                             // conditional select
        dst = instr[2 * (count + ~((opcode>>11)&2)) + 21]
        if dst & 0x100: return false              // already propagated
        if dst & 0x70: return false               // has modifier bits
        type = dst & 0xF
        sm_version = *(*(ctx+1584) + 372)
        if (type == 1 or type == 2)               // integer or float
           and (sm_version <= 20479 or !(dst & 0x1C00)):  // SM gate
            return true

    return false

The SM version threshold 20479 (0x4FFF) divides generation-4-and-below architectures (Kepler/Maxwell, where constant propagation through conditional selects is unconditionally safe) from generation-5+ architectures (Pascal onward) that require the constraint bits at dst & 0x1C00 to be zero. See SM Version Encoding and the 20479 Boundary for the encoding formula.

Architecture Predicate Query: sub_8F29C0

The 9-line function sub_8F29C0 at 0x8F29C0 determines whether the compilation unit's target architecture supports predicate-aware copy propagation:

// sub_8F29C0 -- architecture predicate query (decompiled verbatim)
bool check_arch_predicate(int64_t ctx) {
    int64_t comp_unit = *(int64_t*)(ctx + 1584);
    return sub_7DC0E0(comp_unit)          // primary arch capability
        || sub_7DC050(comp_unit)          // secondary arch capability
        || sub_7DC030(comp_unit);         // tertiary arch capability
}

This same query is used inside sub_908A60 (the two-pass predicate simplifier) to initialize the default "safe to transform" flag before instruction-level analysis refines the answer.

Two-Pass Predicate Simplifier: sub_908A60

When simple eligibility checks pass for opcode 124 but additional predicate analysis is needed (specifically: when sub_8F29C0 returns false OR ctx+1379 & 0x1B has bits set), the two-pass predicate simplifier sub_908A60 at 0x908A60 is invoked. It takes a direction argument (1 = forward, 0 = backward) and scans the instruction stream in the specified direction looking for matching definitions:

  • Forward pass (a4=1): Starts from the current definition context v11, walks forward through the block's instruction list. For each instruction, dispatches on opcode: 97 updates tracking context, 124/18 checks eligibility via sub_8F2E50, others check liveness. Uses a hash-set membership test (sub_767240) to avoid visiting the same instruction twice.
  • Backward pass (a4=0): Starts from the definition chain at v11+136, walks backward through linked definitions with the same opcode dispatch logic.

The function outputs two flags: out_a (full match found -- propagation is safe) and out_b (partial match found -- further analysis may help). Phase 29 invokes forward first; if forward finds neither a full nor partial match, it invokes backward. This handles PHI-like merge patterns where the definition chain has both forward paths (normal control flow) and backward paths (loop back-edges).

Comparison of Algorithm A vs Algorithm B

AspectPhase 13 (sub_753600)Phase 29 (sub_908EB0)
PatternChain matching (pair structural equivalence)Forward walk with flag marking
Opcodes handled95 (entry gate), 93 (chain gate), 97 (anchor), 119 (deep eq)97 (anchor), 18 (pred copy), 124 (cond select)
Chain depthMulti-level (follows through opcode 97 anchors)Single-level (immediate operand check)
Result mechanismDirect instruction rewriting via sub_753B50Flag marking (0x100/0x200/0x400), consumed later
ConvergenceFixed-point loop in sub_7917F0 (option 464 cap)Single pass, flags consumed by subsequent iterations
Complexity253 lines + 5 helper functions217 lines + 4 helper functions
ScopeIntra-block, single-use chains onlyIntra-block, all instructions in sequence

Constant Folding Patterns

Constant folding in GeneralOptimize is a two-level mechanism. At the ORI IR level (phases 29 and 37), the fold-eligibility check sub_8F2E50 at 0x8F2E50 decides which operands can be marked as constant-propagation-eligible. Separately, at the SASS level, the peephole pass sub_1249B50 performs instruction-combining folds on ALU operations whose sources are both MOV-from-immediate. The ORI-level fold does not evaluate arithmetic at compile time -- it marks operands with flag bits that downstream passes consume to replace registers with immediates.

The Eligibility Check: sub_8F2E50

The central gatekeeper, called by sub_908EB0 (phase 29) and sub_908A60 (predicate simplifier). Returns boolean: 1 = foldable, 0 = not foldable. Two dispatch paths based on the masked opcode at instr[18] & ~0x3000:

// sub_8F2E50 -- Fold eligibility check (complete, annotated)
bool is_fold_eligible(int64_t ctx, uint32_t* instr) {
    uint32_t raw = instr[18];
    uint32_t opcode = raw;
    BYTE1(opcode) &= 0xCF;    // clear bits 12-13 (predication variant)

    // --- Path A: opcode 18 (predicated copy) ---
    if (opcode == 18) {
        int dest_idx = instr[20] + ~((raw >> 11) & 2);   // last-operand index
        int dest = instr[2 * dest_idx + 21];
        int type_nibble = (dest >> 2) & 0xF;

        if (type_nibble == 10) return false;   // operand type 10: never foldable

        // Require both type nibble == 0 AND no predicate-propagated flag (0x400)
        if (type_nibble != 0 || (dest & 0x400))
            return false;

        // Vtable dispatch at comp_unit->vtable[1320]:
        //   sentinel sub_7D7240 -> check source operand type bits directly
        //   otherwise -> call virtual method
        fn = *(comp_unit->vtable + 1320);
        if (fn == sub_7D7240) {
            src_type = (instr[23] >> 28) & 7;
            if (src_type - 2 > 1) return false;   // only types 2,3 eligible
        } else {
            if (fn() != 0) goto check_opcode_124;
            src_type = (instr[23] >> 28) & 7;
            if (src_type - 2 > 1) return false;
        }
        // Verify register constraints via sub_91D150 for both sources
        if (sub_91D150(ctx, instr[23] & 0xFFFFFF))
            goto check_opcode_124;
        src2_type = (instr[25] >> 28) & 7;
        if (src2_type - 2 <= 1 && !sub_91D150(ctx, instr[25] & 0xFFFFFF))
            return true;
        // Fall through to opcode 124 check on constraint failure
    }

check_opcode_124:
    // --- Path B: opcode 124 (conditional select / phi-like move) ---
    if (opcode == 124) {
        int dest_idx = instr[20] + ~((raw >> 11) & 2);
        int dest = instr[2 * dest_idx + 21];
        if (dest & 0x100) return false;     // already propagated
        if (dest & 0x70)  return false;     // has modifier bits (neg/abs/sat)

        int type = dest & 0xF;
        int sm_ver = *(int32_t*)(*(int64_t*)(ctx + 1584) + 372);

        if ((type == 1 || type == 2) &&
            (sm_ver <= 20479 || (dest & 0x1C00) == 0))
            return true;
    }
    return false;
}

The function sub_91D150 is a trivial lookup into a per-register constraint array: return *(uint32_t*)(*(ctx+440) + 4 * reg_index). A return value of 0 means the register has no fold-blocking constraint.

Fold Eligibility Table

ORI OpcodeOperationFoldable?ConditionsEvidence
18Predicated copyYesSource operand types must be 2 (predicate) or 3 (uniform); operand type nibble must be 0; no 0x400 flag; both source registers pass sub_91D150 constraint checksub_8F2E50 lines 17--61
124Conditional selectYesDest type 1 (integer) or 2 (float); no modifier bits (& 0x70 == 0); not already propagated (& 0x100 == 0); SM-version-dependent constraint checksub_8F2E50 lines 42--51
97Register-to-register movePropagated, not foldedDest register replaced by source in all uses (copy propagation); no type/SM checkssub_908EB0 lines 75--79
98Local load (LDL)Cost-exempt fold targetIn phase 37 only; target symbol looked up from constant bank; foldable if symbol is in constant banksub_90FBA0 lines 261--270
130HSET2 (packed FP16x2 compare)Cost-exemptPhase 37 bitmask: opcode in {130,133,134,137} bypasses fold cost penaltysub_90FBA0 bitmask 0x99
133(SM-range-dependent ALU)Cost-exemptSame bitmask as 130sub_90FBA0
134(SM-range-dependent ALU)Cost-exemptSame bitmask as 130sub_90FBA0
137(SM-range-dependent ALU)Cost-exemptSame bitmask as 130sub_90FBA0
272Extended instructionCost-exemptPhase 37: excluded from cost penalty alongside 273sub_90FBA0 line 226
273Extended instructionCost-exemptSame as 272sub_90FBA0
OthersGeneral ALU/memoryNot directly foldableGeneralOptimize does not attempt to fold; deferred to peephole passessub_8F2E50 returns 0

"Cost-exempt" means that when phase 37 (GeneralOptimizeMid) encounters these opcodes, it does not apply the standard cost penalty to the fold decision. Other opcodes accumulate a cost of 1.0 per fold, while these accumulate only v104 (default 0.25, adjustable via knob 474). Exception: even cost-exempt opcodes pay full cost if sub_91E860 reports a use count exceeding 4, preventing over-folding of widely-referenced constants.

SM Version 20479 Threshold

The value at *(*(ctx+1584) + 372) encodes the target SM architecture. The threshold 20479 (0x4FFF) appears in three locations in the binary:

  1. sub_8F2E50 line 47: fold eligibility for opcode 124
  2. sub_908A60 line 107: predicate simplifier's opcode-124 handling
  3. Transitively through sub_8F2E50 calls from sub_908A60

The threshold divides two immediate-encoding regimes:

SM rangeEncoded valueFold ruleRationale
<= 20479Legacy encodingInteger (type 1) and float (type 2) constants in conditional selects fold unconditionallyLegacy architectures use fixed-width immediate slots with no sign/width constraints
> 20479Extended encodingSame types fold only if (dest & 0x1C00) == 0 -- constraint bits at operand positions 10--12 must all be zeroExtended architectures introduced variable-width immediate encoding with sign-extension rules; bits 10--12 encode width/signedness constraints that prevent certain constants from being represented as immediates

The encoded value at comp_unit+372 uses the formula (generation << 12) | variant. Known values: 12288 = sm_30 (gen 3), 16385 = sm_50 (gen 4), 20481 = sm_50a (gen 5), 24576 = sm_60 (gen 6), 28672 = sm_70 (gen 7), 32768 = sm_90 (gen 8), 36864 = sm_100 (gen 9). The threshold 20479 = (5 << 12) - 1 = 0x4FFF falls exactly at the generation 4/5 boundary: all generation-4 values (Kepler/Maxwell) are at or below 20479, while the first generation-5 value (20481) exceeds it. This aligns with the introduction of extended immediate encoding formats in Pascal (sm_60, gen 6) and its predecessors in the gen-5 range.

How Fold Results Are Consumed

When sub_8F2E50 returns true in sub_908EB0 (phase 29), the action depends on the opcode:

Opcode 18 (predicated copy): If the v21 flag is set (vtable dispatch at comp_unit+1312 returned non-zero, confirming the target supports this transformation), the destination operand is marked with 0x400 (propagated-under-predicate). No immediate evaluation occurs.

Opcode 124 (conditional select): A tier flag at ctx+1379 controls aggressiveness:

// In sub_908EB0, after sub_8F2E50 returns true for opcode 124:
int tier = *(uint8_t*)(ctx + 1379) & 7;
if (tier == 0) {
    // AGGRESSIVE: mark dest byte-1 |= 1 (fold-committed, fast path)
    dest_operand[1] |= 1;
} else {
    // CONSERVATIVE: type-dispatched analysis required
    if ((dest & 0xF) == 1) {              // integer immediate
        if (sub_8F29C0(ctx))              // predicate analysis passes
            dest = (dest & 0xFFFFFDF0) | 0x201;  // clear type, set propagated+eligible
    } else {                              // float or other
        if (!sub_8F29C0(ctx) || (*(ctx+1379) & 0x1B) != 0) {
            // Two-pass predicate simplifier (forward, then backward)
            sub_908A60(ctx, reg, instr, 1, &out_a, &out_b);  // forward
            if (!out_a && !out_b[0])
                sub_908A60(ctx, reg, instr, 0, &out_a, &out_b);  // backward
        }
        dest = (dest & 0xFFFFFDF0) | 0x201;  // set propagated+eligible
    }
}

The tier value at ctx+1379 & 7 distinguishes:

  • 0 = aggressive fold (unconditional fast path, no predicate analysis)
  • 1--7 = conservative fold (requires sub_8F29C0 predicate analysis and potentially sub_908A60 two-pass simplification)

The actual constant value is not computed during GeneralOptimize. The fold marks operands with flag bits (0x100, 0x200, 0x400, byte-1 |= 1) that downstream passes consume: the apply-changes function sub_753B50 rewrites instruction lists, and the peephole/codegen passes emit the actual immediates.

The limit-fold-fp Knob

String"limit-fold-fp" at 0x1CE3D23
Help text"Enable/disable constant folding of float operations." at 0x1CE63B0
TypeBoolean
Default"false" (FP folding is NOT limited -- folding is enabled)
Config offsetconfig + 340 (registered at sub_434320 line 268)
CategoryOptimization control (registration category 4)
VisibilityInternal (not exposed on public CLI)

Despite the name, limit-fold-fp follows the convention that limit-X = true means restrict/disable X. When set to true:

  1. The config+340 byte propagates into per-function context flags at ctx+1379 during compilation context setup
  2. The ctx+1379 & 7 tier value becomes non-zero, forcing all type-2 (float) operands through the conservative fold path
  3. Conservative fold requires predicate analysis via sub_8F29C0 and potentially the two-pass sub_908A60 simplifier, which rejects folds where predicate conditions are ambiguous
  4. This prevents FP constants from being folded when the fold could alter precision semantics -- for example, folding an FMA source operand might lose the fused multiply-add precision guarantee that the original instruction provided

The predicate analysis helper sub_8F29C0 (11 lines) performs three sequential checks on the compilation unit at ctx+1584: sub_7DC0E0, sub_7DC050, and sub_7DC030. If any returns true, the predicate allows safe propagation. These check architecture capability flags for predicated constant operations.

Phase 37 Fold Cost Model

sub_90FBA0 (the main loop for GeneralOptimizeMid) integrates fold decisions into a cost-weighted convergence model rather than a simple boolean. Key elements:

Opcode classification (lines 226--228 of the decompiled output): a bitmask 0x99 applied to the range 130--137 classifies opcodes as cost-exempt. The expression ~(0x99 >> ((uint8_t)opcode + 126)) & v15 clears v15 (the cost flag) for opcodes where the corresponding bit in 0x99 is set. Combined with the range check for 272--273:

Cost-exempt opcodesBitmask bitInterpretation
130bit 0 of 0x99FP16x2 compare (HSET2 family)
133bit 3 of 0x99SM-range-dependent ALU
134bit 4 of 0x99SM-range-dependent ALU
137bit 7 of 0x99SM-range-dependent ALU
272, 273Direct checkExtended load/store variants

Cost computation: For cost-exempt opcodes, fold cost = v104 * v35 where v104 defaults to 0.25 (overridable via knob 474) and v35 is 0.0 if the instruction is dead (checked via sub_7DF3A0), 1.0 otherwise. For non-exempt opcodes, fold cost = 1.0 * v35 (full weight).

Use-count gate: Even cost-exempt opcodes pay full cost if sub_91E860 (use-count estimator) reports more than 4 uses, preventing over-folding of widely-referenced constants.

Convergence: Accumulated costs at context+26 (weighted) and context+27 (unweighted) are doubles. The loop continues until the cost delta falls below the threshold (default 0.25 from knob 474; overridable by knob 135 when *(config+9720) is set).

Register Constraint Validation: sub_8F3FE0

Phase 37 uses sub_8F3FE0 to validate that folding an instruction's operands respects register-class constraints. The function:

  1. Queries comp_unit->vtable[904] for the per-element operand size of the instruction
  2. Queries comp_unit->vtable[936] (if not sentinel sub_7D7040) for per-instruction fold metadata
  3. Iterates over all source operands:
    • Requires operand type bits (>> 28) & 7 to be 2 or 3 (predicate or uniform register)
    • Calls sub_91D150 to look up the register constraint for each source operand
    • Compares against a previously cached constraint at context + 7 (8-byte stride per fold group)
    • Returns 0 (fold invalid) on any constraint mismatch
  4. Loop count is determined by the destination operand format field (& 7)
  5. Returns 1 only if all source operands have consistent register constraints

Constant Folding and Propagation Marking Architecture

The term "constant folding" in the context of GeneralOptimize is misleading. The pass does not evaluate arithmetic at compile time (e.g., replacing 3 + 5 with 8). Instead, it performs constant propagation eligibility marking -- identifying operands that hold constant or propagatable values and setting flag bits so downstream passes can exploit this information. Actual arithmetic evaluation occurs elsewhere in the pipeline.

Three Levels of Constant Handling in ptxas

Constant handling spans three distinct pipeline stages, each with different scope and mechanism:

LevelStageFunctionsWhat It DoesWhat It Does NOT Do
1 -- ORI-IR Propagation MarkingGeneralOptimize (phases 13/29/37/46/58/65)sub_908EB0 (body), sub_8F2E50 (gate), sub_908A60 (deep analysis)Marks operands with flag bits (0x100/0x200/0x400) indicating they are eligible for constant propagationEvaluate arithmetic; rewrite instructions; emit immediates
2 -- SASS Peephole CombiningPost-ISel peephole (phases 83+)sub_83EF00 (156KB mega-peephole), sub_1249B50 (integer ALU fold), sub_1249940 (MOV-pair matcher)Combines MOV-from-immediate + ALU instruction pairs into single instructions with folded constantsOperate on ORI IR; handle non-MOV sources
3 -- Frontend Expression EvaluationPTX parser/validator (address range 0x460000--0x4D5000)Multiple validator functions (string evidence: "Constant expression has division by zero", "Constant overflow")Evaluates PTX-level constant expressions during parsing; reports errors for invalid expressionsOperate on internal IR; run during optimization

The limit-fold-fp knob controls Level 1 only -- specifically whether float-typed operands take the fast path or must go through predicate analysis before being marked.

SM Version Encoding and the 20479 Boundary

The SM version at comp_unit->profile[+372] is not a direct sm_XX number. It uses a packed encoding:

encoded_sm = (generation << 12) | variant

Concrete values from the binary:

EncodedHexGenerationVariantArchitecture
122880x300030sm_30 (Kepler)
163850x400141sm_50 (Maxwell)
204810x500151sm_50a (Maxwell alt / gen-5 base)
245760x600060sm_60 (Pascal)
286720x700070sm_70 (Volta)
286730x700171sm_80 (Ampere)
327680x800080sm_90 (Hopper)
368640x900090sm_100 (Blackwell)

The threshold 20479 = (5 << 12) - 1 = 0x4FFF. This is the largest value that fits in generation 4. Every generation-5+ encoded value exceeds it.

The fold-eligibility impact:

  • SM <= 20479 (generation 4 and below -- Kepler, Maxwell): Integer and float immediates in conditional-select instructions (opcode 124) fold unconditionally. The hardware uses fixed-width immediate slots with no sign/width constraints at operand bit positions 10--12.

  • SM > 20479 (generation 5+ -- Pascal and all newer): The operand's constraint bits at positions 10--12 (mask 0x1C00) must all be zero for folding to proceed. These bits encode hardware constraints introduced with extended immediate formats:

    • Bit 10: immediate width constraint (narrow vs wide encoding)
    • Bit 11: sign-extension requirement
    • Bit 12: bank-relative vs absolute encoding

The threshold appears in 6 locations across the binary, confirming it is a fundamental architectural boundary rather than an ad-hoc check: sub_8F2E50 (fold eligibility), sub_406C5E (peephole), sub_406018 (peephole operand matcher), sub_751940 (instruction walker), sub_78DB70 (phase pre-check), sub_848790 (register bank coalescer).

Architecture Class Predicate: sub_8F29C0 Internals

The 9-line function sub_8F29C0 queries three architecture capability checks in sequence. If any returns true, the conservative fold path (which requires additional predicate analysis) is the correct approach for the target:

bool arch_needs_conservative_fold(int64_t ctx) {
    int64_t cu = *(int64_t*)(ctx + 1584);
    if (sub_7DC0E0(cu)) return true;   // isDualIssue
    if (sub_7DC050(cu)) return true;   // isNvlinkArch
    return sub_7DC030(cu);             // isGraceArch
}

Each sub-function reads the architecture class field at comp_unit->profile[+12]:

FunctionCheckClass IDArchitecture Family
sub_7DC0E0profile[+12] == 44Dual-issue (Maxwell sm_50)
sub_7DC050profile[+12] == 11 OR profile[+1418] & 111NVLink-capable (Volta+)
sub_7DC030profile[+12] == 10 OR profile[+1417] >> 710Grace (ARM-based)

When sub_8F29C0 returns true: folding a constant into a conditional select requires predicate analysis first, because these architectures have immediate encoding differences between conditional and unconditional instruction forms, or because predicate evaluation may have observable side effects.

When sub_8F29C0 returns false (simpler architectures): the fold attempt still proceeds but falls through to the more expensive two-pass predicate simplifier (sub_908A60) as a fallback rather than using the direct marking path.

Two-Pass Predicate Simplifier: sub_908A60 Internals

When the eligibility check passes for opcode 124 but the conservative path is required (either sub_8F29C0 returns false, or the tier flags at ctx+1379 & 0x1B have bits set), sub_908A60 performs a bidirectional scan of the instruction stream to validate that the fold is safe.

Signature: sub_908A60(ctx_array, basic_block_id, instr, direction, &out_hit, &out_partial)

ParameterTypeMeaning
a1int64_t*Context as QWORD array (a1[37] = block array, a1[198] = comp_unit)
a2intBasic block index (from the definition anchor)
a3int64_tCurrent instruction pointer
a4intDirection: 1 = forward scan, 0 = backward scan
a5int*Output: 1 if a complete safe-fold chain was found
a6int*Output: 1 if architecture supports aggressive mode

Algorithm:

  1. Allocates a 24-byte tracking structure via comp_unit->vtable[24]
  2. Queries architecture mode via sub_7DC0E0/sub_7DC050/sub_7DC030
  3. Walks instructions in the specified direction within the basic block:
    • Opcode 97 (STG in ROT13; used as definition anchor/label marker): follows the label chain to the next definition
    • Opcode 52 (NOP/delimiter): stops the scan (block boundary)
    • Opcode 124 or 18: recursively calls sub_8F2E50 on the chained instruction to verify fold safety through the chain
  4. Sets output flags based on whether a complete safe-fold chain was found

Invocation pattern in sub_908EB0:

// Forward pass first
sub_908A60(ctx, bb_id, instr, 1, &hit, &partial);
if (hit) goto mark_propagated;

// If forward found nothing useful, try backward
if (!partial) {
    sub_908A60(ctx, bb_id, instr, 0, &hit, &partial);
    if (hit) goto mark_propagated;
    if (!partial) continue;   // neither direction found a match
}

The two-pass strategy (forward then backward) handles PHI-like merge patterns at loop boundaries. Forward catches definitions along normal control flow; backward catches definitions from loop back-edges. The partial flag prevents unnecessary backward scans when the forward pass already determined the chain is definitively unfoldable.

Algebraic Simplification and Structural Equivalence

The algebraic simplifier in GeneralOptimize is not a traditional constant-identity pattern matcher. It does not check operand values against constants (0, 1, -1) to recognize identities like x+0 or x*1. Instead, it is a structural equivalence-based pattern recognizer that detects when two instructions in a def-use chain compute identical values, enabling one to be eliminated. Traditional algebraic identity patterns (x+0->x, x*1->x, x&0->0, x-x->0, etc.) are handled by the separate MainPeepholeOptimizer -- see the comparison table below.

The simplifier lives in sub_753600 (Phase 13, GeneralOptimizeEarly) and is approximately 253 lines of decompiled code. It operates on chains of instructions linked through def-use relationships.

Entry Guard

The function only triggers on instructions matching a narrow pattern:

// sub_753600 entry guard
if (instr[18] == 95           // opcode 95 (STS in ROT13; used as terminator class)
    && instr[20] == 5         // exactly 5 operands
    && (instr[25] & 7) - 3 <= 1)  // operand format 3 (register) or 4 (immediate)
{
    // proceed to chain walk
}

The restriction to opcode 95 means this simplifier targets conditional exit/return sequences where a guard predicate or condition is computed redundantly. The 5-operand constraint ensures the instruction has the expected layout: result, predicate, and three source operands.

Chain-Walking Algorithm

After the entry guard passes, sub_753600 executes a 9-step algorithm:

Step 1 -- Def-chain traversal. Reads the use-list pointer at instr[17] (offset 136). Checks that the use-list head exists, points to a single definition (head's first element is null), and that the next instruction in the chain has opcode 97 (STG in ROT13; used as definition anchor/label).

Step 2 -- Register resolution. Follows the register index through the register table at ctx+296 to resolve the first chain link to a concrete register entry. Both chain paths (via instr[17]+8 field, "use-list index", and via the register table) must point to the same entry.

Step 3 -- First pair detection via sub_753520. This helper calls sub_753480 to walk the single-def chain forward, looking for an instruction with opcode 93 (OUT_FINAL in ROT13; used as a chain-link marker). At each step, sub_753480 checks:

  • sub_7E5120 -- is the current entry eligible for chain-following? (checks constant bank membership, block region flags, and opcode 91 via sub_7A1A90)
  • The use-list pointer at entry[16] has a null head (single use)
  • The use-list pointer at entry[17] has a null head (single def)
  • The register index at entry[17]+8 matches the next instruction's register at entry[1]+8 -> +24

Step 4 -- Second pair detection via sub_753570. Starting from the first pair's result, follows the chain one more step looking for a second opcode-93 instruction that references back to the same register as the first pair's target.

Step 5 -- Predicate-operand compatibility check via sub_7E7380:

// sub_7E7380 -- predicate-operand compatibility check (narrow, not full structural)
bool predicate_operand_compatible(Instr* a, Instr* b) {
    bool a_has_pred = (a->opcode & 0x1000) != 0;  // bit 12: predicated
    bool b_has_pred = (b->opcode & 0x1000) != 0;
    if (a_has_pred != b_has_pred)
        return false;
    if (a_has_pred && b_has_pred) {
        // Compare last operand (predicate register): 24-bit register index
        int a_idx = a->operands[a->operand_count - 1] & 0xFFFFFF;
        int b_idx = b->operands[b->operand_count - 1] & 0xFFFFFF;
        if (a_idx != b_idx)  return false;
        // Compare preceding operand pair (full 64-bit equality)
        return a->operands[a->operand_count - 2] == b->operands[b->operand_count - 2];
    }
    return true;  // both unpredicated: predicate-compatible at this level
}

This confirms the two instructions have matching predication structure -- same predicate register, same predicate condition encoding.

Step 6 -- Operand format classification. Computes the effective operand position as operand_count - ((opcode >> 11) & 2) and checks whether it equals 5. When it does, reads the format code at instr[25] & 7. Format 3 means register operand, format 4 means immediate. Both instructions must have the same format classification (both register or both immediate).

Step 7 -- Register index equality. Compares the 24-bit register index: (instr_a[v23+21] & 0xFFFFFF) == (instr_b[v24+21] & 0xFFFFFF). When equal and the full operand descriptors at instr[23] and instr[24] also match, the instructions provably compute the same value. The function jumps to the success path.

Step 8 -- Modifier verification via sub_747F40 and sub_747F80:

// sub_747F40 -- negation flag extraction
int get_negation(Instr* instr) {
    int eff = instr->operand_count - ((instr->opcode >> 11) & 2);
    if (eff == 5 && (instr->data[25] & 7) - 3 < 2)
        return (instr->data[25] >> 3) & 1;   // bit 3 of format byte
    return 0;
}

// sub_747F80 -- absolute-value flag extraction
int get_abs(Instr* instr) {
    int eff = instr->operand_count - ((instr->opcode >> 11) & 2);
    if (eff == 5 && (instr->data[25] & 7) - 3 < 2)
        return (instr->data[25] >> 4) & 1;   // bit 4 of format byte
    return 0;
}

Both instructions must have identical negation and absolute-value flags. If neg(a) != neg(b) or abs(a) != abs(b), the pattern is rejected. This prevents incorrectly identifying neg(x) as equivalent to x.

Step 9 -- Deep sub-DAG equivalence. When register indices differ but operand type bits (bits 28-30) equal 1 (register type), the simplifier follows the definition chain to the defining instruction and attempts structural matching at depth:

// Deep equivalence path (sub_753600, lines 149-189)
if (((operand_a >> 28) & 7) == 1) {           // register-type operand
    RegEntry* reg_a = reg_table[operand_a & 0xFFFFFF];
    if (reg_a->use_count_field == 5) {         // field at +64
        Instr* def_a = reg_a->defining_instr;  // field at +56
        // ...same for operand_b...
        if (def_a->opcode == 119 && def_b->opcode == 119) {  // both SHFL
            int res_a = def_a->operands[2 * def_a->operand_count + 19];
            int res_b = def_b->operands[2 * def_b->operand_count + 19];
            if ((res_a & 1) == 0 && (res_b & 1) == 0        // bit 0 clear
                && ((res_a | res_b) & 8) == 0                // bit 3 clear
                && !sub_748570(def_a, ctx)                   // no alias hazard
                && !sub_748570(def_b, ctx)                   // no alias hazard
                && def_a->data[25] == def_b->data[25]        // format match
                && def_a->data[26] == def_b->data[26]        // descriptor match
                && sub_1245740(ctx, def_a, def_b, 2))        // depth-2 DAG eq
            {
                // Match found -- proceed to chain extension
            }
        }
    }
}

The depth limit of 2 (fourth argument to sub_1245740) prevents exponential blowup in the equivalence check while still catching common patterns like f(g(x)) == f(g(y)) when x == y.

Chain Extension and Accumulation

After finding one matching pair, the function extends the search down the chain. It calls sub_753520 and sub_753570 on subsequent entries, accumulating the full matching sequence in the state array at a1[1] through a1[6]. The state layout is:

State array (passed as a1, 7 qword slots):
  a1[0] = ctx (compilation context)
  a1[1] = first matched instruction (start of sequence)
  a1[2] = second matched instruction (end of first pair)
  a1[3] = third matched instruction (from sub_753520)
  a1[4] = fourth instruction (next link)
  a1[5] = fifth instruction (from secondary sub_753520)
  a1[6] = sixth instruction (final chain link)

The function returns true (changed) when the full chain is successfully matched. The caller (sub_7917F0) then invokes sub_753B50 to rewrite the matched sequence.

What This Actually Eliminates

The pattern this simplifier catches is: a sequence of conditional exit instructions where the guard predicates, condition codes, and source operands are structurally equivalent. In practice, this arises from lowering transformations that produce redundant conditional exit/return pairs -- for example, when a function has multiple return paths that were not merged during earlier optimization, or when predicated code duplication creates exit sequences with identical conditions.

The rewrite performed by sub_753B50 replaces the redundant chain with a single exit/return sequence, updating the block's instruction list, register-to-instruction mappings, and def-use chains.

Algebraic Pattern Location Map

The following table clarifies which optimization pass handles each category of algebraic simplification:

Pattern CategoryPassLocationEvidence
Structural equivalence (identical computation chains)GeneralOptimize Phase 13sub_753600CERTAIN -- decompiled
Modifier canonicalization (neg/abs flag matching)GeneralOptimize Phase 13sub_747F40, sub_747F80CERTAIN -- decompiled
Sub-DAG equivalence (depth-limited tree comparison)GeneralOptimize Phase 13sub_1245740CERTAIN -- decompiled
Copy propagation (reg-reg, predicated, conditional)GeneralOptimize Phase 29sub_908EB0CERTAIN -- decompiled
Predicate simplification (constant predicates)GeneralOptimize Phase 29sub_908A60CERTAIN -- decompiled
Register promotion (memory-to-register conversion)GeneralOptimize Phase 37sub_90EF70CERTAIN -- decompiled
Identity: x+0->x, x*1->x, x&(-1)->x, x|0->x, x^0->xMainPeepholeOptimizersub_169B190 et al.HIGH -- 3,185 pattern matchers
Annihilator: x*0->0, x&0->0MainPeepholeOptimizersub_169B190 et al.HIGH -- 3,185 pattern matchers
Inverse: x-x->0, x^x->0, !!x->xMainPeepholeOptimizersub_169B190 et al.HIGH -- 3,185 pattern matchers
Strength reduction: x*2->x<<1, x/1->xStrengthReduction (phase 26)documented separatelyCERTAIN -- separate pass
Predicate identity: p&true->p, p|false->pMainPeepholeOptimizer + Phase 29combinedMEDIUM

The MainPeepholeOptimizer operates on the full SASS opcode set via three 233-280 KB dispatch functions with 373-case primary switches. Its pattern tables encode the constant-identity rules (IADD3 with zero source becomes MOV, IMAD with unit multiplier becomes shift/add, LOP3 with identity LUT becomes passthrough, etc.) as prioritized rewrite rules. See Peephole Optimization for full details.

Helper Functions: sub_753E30 and sub_753F70

Two additional helpers extend the Phase 13 algebraic simplifier beyond the main sub_753600 path:

sub_753E30 (67 lines) -- secondary chain matcher that handles the case where the first instruction in the chain has a source register index (instr[25] & 0xFFFFFF) that differs from the current block's register at *(a2 + 24). It follows a more complex chain topology involving three register entries (at state slots a1[7], a1[8], a1[9]) and validates that the secondary chain loops back to the primary entry. This catches equivalences across register renaming boundaries.

sub_753F70 (49 lines) -- vtable-dispatched transformation that performs the actual rewrite for chains detected by sub_753E30. It calls through comp_unit->vtable[656] (with sentinel check against sub_744F30). When the vtable method returns true, it constructs opcode-93 replacement instructions via sub_92E1B0 and splices the old chain out via sub_91E310. This is the surgical rewrite counterpart to sub_753B50's rewrite for the main path.

sub_753DB0 (33 lines) -- chain tail finder that walks from a given register entry forward through the def-chain, following opcode-97 links via the register table. Returns the last reachable entry in the chain (the "tail") or the entry one step before a broken link. Used by the extended chain detection logic to determine where the equivalence region ends.

Dead Code Elimination

DCE within GeneralOptimize is lightweight compared to the standalone OriPerformLiveDead passes (phases 16, 33, 61, 84). It operates locally within basic blocks using the sub_7DF3A0 function:

// sub_7DF3A0 -- instruction liveness check
//   Returns pointer to status word
//   Bits 2-3 (mask 0xC): has live uses
//   Bit 0 (mask 0x1): marked dead
int8_t* check_liveness(int64_t instr, int64_t* ctx) {
    // ... examines use-def chains ...
    return status_ptr;   // caller checks (*result & 0xC) != 0
}

In sub_908EB0, the DCE check appears as the fallback for unrecognized opcodes:

if (!v10) {   // v10 = "previous instruction was a recognized copy"
    int8_t* status = sub_7DF3A0(instr, ctx);
    v10 = (*status & 0xC) != 0;   // live uses exist?
}

When (*status & 0xC) == 0, the instruction has no live consumers and is effectively dead. In Variant A, dead instructions are not immediately deleted -- they are marked for removal by the convergence loop cleanup phase (sub_753B50), which rewires the instruction list to skip dead nodes and updates the block's def-use chains via sub_931920, sub_932E80, sub_749090, and sub_9253C0.

In Variant B (phase 58), sub_8F6530 uses the same sub_7DF3A0 liveness check but integrates the result into its 7-counter change tracking structure, incrementing the appropriate sub-pass counter when a dead instruction is found.

Predicate Simplification

A distinct sub-pass handles predicate register operations. The code in sub_908EB0 at the opcode-18 and opcode-124 branches processes predicated moves and conditional selects:

  • Opcode 18 (predicated move): if the predicate is known-true (from prior constant folding), simplifies to unconditional move. If the v21 flag is set (indicating the vtable dispatch at comp_unit+1312 returned non-zero, i.e. the target supports this transformation), marks the destination operand with 0x400
  • Opcode 124 (conditional select): if both source operands are identical (detected via def-chain comparison), simplifies to an unconditional copy; if the predicate is constant, selects the appropriate source. The two-pass approach via sub_908A60 handles phi-like patterns where direction matters:
    • Pass 1: sub_908A60(ctx, reg_entry, instr, 1, &out_a, &out_b) -- forward direction
    • Pass 2 (if pass 1 found no simplification but detected a partial match): sub_908A60(ctx, reg_entry, instr, 0, &out_a, &out_b) -- backward direction

The helper sub_8F29C0 at 0x8F29C0 performs predicate-specific analysis, determining whether the predicate condition allows safe propagation given the current instruction context.

The Per-Block Sub-Pass Runner: sub_8F6530 (Variant B Detail)

The 550-line function sub_8F6530 is the core of Variant B (phase 58). It processes a single basic block using a 6-slot circular buffer of instruction pairs, tracked at 56-byte intervals:

sub_8F6530 Context (passed as a1)
  +0x000  ctx_ptr                 -- compilation context
  +0x008  flag_ctrl_flow_4        -- from ctx+1396 bit 2 (opcode-7 enable)
  +0x009  flag_ctrl_flow_8        -- from ctx+1396 bit 3 (opcode-6 enable)
  +0x00C  slot_index              -- current slot (modulo 6)
  +0x010  slot_0_changed          -- boolean: did this slot's pair fire?
  +0x014  slot_0_count            -- how many pairs stored in this slot

  Slot layout (each 56 bytes = 7 int64_t):
    +0x00  count/used flag
    +0x04  changed flag
    +0x08  instr_ptr_a            -- first instruction of the pair
    +0x10  instr_ptr_b            -- second instruction of the pair
    +0x18  (reserved)
    ...

  6 slots at offsets: +0x10, +0x48, +0x80, +0xB8, +0xF0, +0x128

The slot index increments with (*(a1+3) + 1) % 6 after each pair is processed. When a new instruction pair is encountered that doesn't match any existing slot, the oldest slot is evicted (slot index advances). Each slot can hold up to 2 instruction pointers.

The function walks the instruction list looking for specific opcode patterns:

  1. Opcodes 139 and 110 (MOV variants with different addressing modes): these are the primary targets. The function checks operand field at instr+76 for value 6 (register operand) or 7 (immediate operand), with the flag_ctrl_flow_4 and flag_ctrl_flow_8 gates controlling which variants are processed
  2. For register operands (type field bits 28-30 == 1), it verifies:
    • Use count == 1 (*(reginfo+24) == 1)
    • No aliasing flags (*(reginfo+50) & 1 == 0)
    • Register class not in range 2-8 (*(reginfo+20) - 2 > 6)
  3. For instructions with opcode 139 and no modifier bits (*(instr+88) & 0x603FFFF == 0), the function attempts to find the instruction in the circular buffer and either promote it (if found) or insert it as a new entry
  4. Option 605 (getOption(ctx, 605)) at 0x8F6530+0x1A0: when enabled, restricts the matching to only instructions already present in the buffer, preventing new insertions. This is an architecture-gated optimization

Fixed-Point Convergence

Per-Block Iteration Model

All GeneralOptimize variants use a per-block convergence model: they iterate over basic blocks in linear order (following the block ordering table at ctx+512), and for each block, run the sub-passes repeatedly until convergence. This differs from the global worklist model used by other optimizers (GVN-CSE at phase 49 uses a global worklist).

for each block B in reverse postorder:
    repeat:
        changed = run_sub_passes(B)
    until !changed OR !getOption(464)

The block ordering table is an array of int32_t indices at *(ctx+512), with the count at *(ctx+520). Block iteration starts at index 1 (not 0) and proceeds through bb_count inclusive. Each index is used to look up the actual basic block pointer via *(*(ctx+296) + 8 * block_order[i]).

Change Detection Mechanism

Changes are detected through different protocols depending on the variant:

  • Variant A (sub_753600): returns a boolean. The return value is the logical OR of all sub-pass fire events. The state machine in sub_7917F0 stores the result in v15 (mapped to register bp) and accumulates across iterations via v4 = v15
  • Variant B, phase 58 (sub_8F6530): maintains 7 independent counters at 56-byte intervals in the context structure. Counters are at *(a1 + 5), *(a1 + 19), *(a1 + 33), *(a1 + 47), *(a1 + 61), *(a1 + 75). The corresponding boolean changed-flags are at *(a1 + 16), *(a1 + 72), *(a1 + 128), *(a1 + 184), *(a1 + 240), *(a1 + 296). All are zero-initialized at entry. The caller checks if any counter is non-zero to determine convergence
  • Variant B, phase 37 (sub_90FBA0): uses a different approach -- tracks a floating-point "cost" accumulator at context+25/26/27 (three double values representing total cost, weighted cost, and instruction count). Convergence is determined when the cost delta falls below a threshold (initialized to 0.25, adjustable via knob 474 at 0x90FBA0+0x50). Knob 135 at 0x90FBA0+0x20 controls an initial threshold override when enabled (checked via *(config+9720))

Iteration Limits

The fixed-point loop is guarded by option 464 in Variant A. In sub_7917F0:

while (true) {
    bool changed = sub_753600(&state, bb);
    if (!changed) break;

    // Option 464 check -- same vtable fast-path pattern:
    //   vtable[152] == sub_67EB60  =>  sub_7468B0(config, 464)
    //   otherwise                  =>  vtable[152](config, 464, 1)
    if (!getOption_v2(ctx, 464)) break;

    sub_753B50(&state);   // apply rewrites before re-scanning
}

The option 464 check is called after each successful iteration (when changed == true). If the option returns false, the loop terminates even though more changes could be made. The exact semantics of option 464 depend on the knob's implementation -- it could be a simple counter that decrements, a boolean that gets cleared after N iterations, or a cost-based threshold. The default behavior (when option 464 always returns true) allows unbounded iteration until convergence.

Variant B (phases 37 and 58) does not use option 464 for iteration control. Phase 37 uses the cost-based threshold described above. Phase 58 makes a single pass over the block list via sub_8F6FA0, which does not loop -- each block is visited exactly once, with the 6-slot circular buffer providing limited lookback within the walk.

In practice, most basic blocks converge in 1--3 iterations. A block that generates new optimization opportunities typically does so because copy propagation exposes a constant, which enables constant folding, which creates a dead instruction. The second iteration catches any cascading effects, and the third confirms convergence. Blocks requiring more than 3 iterations are rare and typically involve chains of dependent copies or nested predicate simplifications.

The Apply-Changes Function: sub_753B50

After sub_753600 reports changes, sub_753B50 applies the accumulated transformations. This is a compact 70-line function that performs instruction-list surgery:

  1. Creates a replacement instruction via sub_931920(ctx, state->instr_pair, *(*(state->instr_pair+8)+8), -1) -- the -1 argument (0xFFFFFFFF) signals "allocate new"
  2. Updates the block's instruction head at *(ctx+232) with the new instruction's head pointer
  3. Clears the block's instruction count at *(ctx+264) = 0
  4. Calls sub_932E80 to relink the instruction into the block's doubly-linked list
  5. Propagates flags: if the original instruction had flag bit 3 of *(instr+280) set (indicating a control-flow-sensitive instruction), the replacement inherits it via new_instr[70] |= 8
  6. Walks the state's instruction chain (from state[1] through state[2]), creating replacements for each and calling sub_749090 to update register-to-instruction mappings
  7. Final cleanup: calls sub_9253C0 to remove the dead instructions from their blocks, and sub_749290 to update the register numbering, and sub_91E310 to splice the old instruction range out of the linked list

Differences Between Early/Mid/Late Variants

1. Gate Conditions (Who Runs)

PhaseGate Logic
13 (Early)Requires ctx->flags_1382 & 4; skips if option 214 is set; requires option 487; skips if *(*(ctx)+1056) is non-null
29Requires option 487; skips if option 231 (dump mode) is set; requires *(config+33192) check or option 461 pass; skips if function count == 1
37 (Mid)Requires sub_8F3EA0 pre-check; option 487; can be disabled via --no-phase ConvertMemoryToRegisterOrUniform; skips if function count == 1
46 (Mid2)Indirect dispatch; skips if vtable slot [0x1C0] points to no-op sentinel sub_7D6DD0
58 (Late)Requires function count > 2 (not just > 1); checks optimization level bits (ctx+1396 & 0x30) != 0x20; checks option 31 with extended-value semantics
65 (Late2)Requires function count > 1; indirect dispatch through compilation unit vtable slot at offset 392

2. Sub-Pass Selection (What Runs)

PhaseSub-Passes Included
13 (Early)Structural equivalence detection via sub_753600 (def-use chain walking, instruction pair matching, modifier verification, depth-2 sub-DAG comparison via sub_1245740), instruction rewrite via sub_753B50. No instruction-level constant folding. Lightweight -- designed for quick cleanup after initial lowering.
29Copy prop with full opcode dispatch (97, 18, 124), predicate-aware propagation via sub_8F2E50/sub_8F29C0, two-pass predicate simplification via sub_908A60, liveness-gated DCE via sub_7DF3A0. Flag marking with 0x100/0x200/0x400 bits.
37 (Mid)Full sub-pass suite plus ConvertMemoryToRegisterOrUniform (memory-to-register promotion). Bitvector-based change tracking. Cost-driven convergence with configurable threshold (default 0.25, knob 474). Most comprehensive instance.
46 (Mid2)Architecture-dependent (vtable dispatch). May include additional target-specific simplifications.
58 (Late)6-slot circular buffer pattern matching over MOV/copy instructions (opcodes 139, 110). Register use-count and aliasing checks. Option-605-gated restriction mode. Per-block single-pass (no iteration).
65 (Late2)Architecture-dependent (vtable dispatch). Final cleanup before register allocation.

3. Infrastructure Weight (How It Runs)

PhaseContext SizeTrackingComplexity
13 (Early)Minimal (0x88 bytes on stack)Boolean changed flagLow (78 lines in sub_7917F0)
29Stack frame (~0x60 bytes)Boolean + instruction flag bitsMedium (218 lines in sub_908EB0)
37 (Mid)0x408-byte stack context + heap bitvectorsCost-based convergence (3 doubles) + bitvector arraysHigh (500+ lines in setup + 400+ in loop)
46 (Mid2)Vtable-dependentVtable-dependentVariable
58 (Late)0x168-byte stack context7 counters at 56-byte stride + 6-slot circular bufferMedium-high (550 lines in sub_8F6530)
65 (Late2)Vtable-dependentVtable-dependentVariable

Initialization Infrastructure

Two large helper functions set up the state required before the sub-passes can run:

sub_785E20 -- Change Tracking Reset

Called at the start of phase 13 and after the convergence loop completes (if any changes were made). Resets per-block change flags and instruction state. Takes (ctx, 0) -- the second argument selects the reset mode.

sub_781F80 -- Instruction Flag Initialization

A large function (~1800 lines) that walks every instruction in every basic block, setting per-instruction optimization flags. Called with argument 1 to enable full initialization. These flags control which instructions are eligible for the sub-passes: instructions marked with certain flag patterns are skipped by copy prop, others are skipped by the algebraic simplifier.

sub_7E6090 -- Use-Def Chain Builder

Builds operand use-def chains for copy propagation. Called with (ctx, 0, 0, 0, 0) at the start of phases 13 and 58. The zero arguments indicate "build from scratch" rather than incremental update.

Builds bidirectional def-use/use-def links. Called only by phase 13 (Variant A). Variant B phases use their own bitvector-based tracking instead.

sub_905B50 -- Bitvector Infrastructure (Phase 37 Only)

A 500+ line setup function specific to GeneralOptimizeMid. Allocates and initializes three major bitvector structures for tracking:

  1. Register definition reach (which definitions reach each block entry)
  2. Per-register liveness within basic blocks
  3. Fold eligibility tracking (which operands have known-constant sources)

These bitvectors are destroyed by RAII-style cleanup after sub_90FBA0 returns, using vtable destructors at offsets +32 in the bitvector vtables.

Pipeline Positioning

The six instances are positioned to clean up after specific groups of transformations:

Phase 0-12:  Initial setup, FP16 promotion, unsupported op conversion
  --> Phase 13: GeneralOptimizeEarly  (clean up after lowering artifacts)

Phase 14-28: Branch opt, loop passes, strength reduction, pipelining
  --> Phase 29: GeneralOptimize       (clean up after loop transformations)

Phase 30-36: Switch opt, linear replacement, LICM
  --> Phase 37: GeneralOptimizeMid    (heavy cleanup + mem-to-reg promotion)

Phase 38-45: Nested branch opt, CTA expansion, mbarrier, mid expansion
  --> Phase 46: GeneralOptimizeMid2   (clean up after mid-level expansion)

Phase 47-57: GVN-CSE, reassociation, remat, late expansion, speculative hoist
  --> Phase 58: GeneralOptimizeLate   (clean up after late expansion)

Phase 59-64: Loop fusion, predication, late commoning
  --> Phase 65: GeneralOptimizeLate2  (final cleanup before register work)

After phase 65, the pipeline transitions to register-attribute setting (phase 90), synchronization (phase 99), and register allocation (phase 101). No GeneralOptimize instance runs after register allocation -- the post-RA pipeline uses different peephole mechanisms.

Knobs and Options

OptionDecoded NameTypeCode DefaultUsed ByDescription
31AllowReassociateCSEOKT_INTunsetPhase 58Architecture-dependent fold eligibility gate; extended-value semantics via config+2232/+2240
135ConvertMemoryToRegIndexedSizeLimitOKT_INTunset (fallback: 0.25 from knob 474)Phase 37Threshold override for cost-based convergence when *(config+9720) is set; controls indexed-access size limit for memory-to-register conversion
214DisableMergeEquivalentConditionalFlowOKT_NONEfalsePhase 13 onlyWhen present, skips GeneralOptimizeEarly entirely (if (getOption(ctx, 214)) return;)
231DisableRedundantBarrierRemovalOKT_NONEfalsePhase 29 onlyDump mode -- when present, skips GeneralOptimize to preserve IR state for debugging
461MembarFlowControlOKT_INTunsetPhase 29Secondary gate; controls whether memory barrier flow analysis runs during standard GeneralOptimize; passed through sub_661470
464MergeEquivalentConditionalFlowBudgetOKT_BDGTunset (= unbounded)Phase 13 (Variant A)Iteration cap -- budget knob that breaks the fixed-point loop when exhausted; prevents oscillating transformations
474MovWeightForConvertMemToRegOKT_DBL0.25Phase 37 (sub_90FBA0)Cost convergence threshold and per-fold cost weight for cost-exempt opcodes (v104 in cost computation)
487(not yet decoded)--enabledPhases 13, 29, 37General optimization enable -- master switch for all GeneralOptimize passes
499OptBudgetOKT_BDGTenabled (pass-through)sub_7DDB50 (opt-level accessor)Master guard for opt-level accessor; when disabled, caps all opt-level-gated behavior at O1
605ReassociateCSEWindowOKT_NONEfalsePhase 58 (sub_8F6530)When present, restricts 6-slot circular buffer matching to existing entries only (no new entries added during walk)
limit-fold-fp--bool"false" (config+340)Phase 37When true, forces conservative fold path via ctx+1379 tier flags; prevents FP folds that could alter precision semantics

The "ConvertMemoryToRegisterOrUniform" named-phase gate at 0x21DD228 allows phase 37 to be disabled via the --no-phase command-line option.

Function Map

AddressNameRoleConfidence
0xC5F940Phase 13 executeTail-calls 0x1C64BF0 (single-func) or sub_7917F0 (multi-func)CERTAIN
0xC5FC50Phase 29 executeChecks count > 1, calls sub_908EB0CERTAIN
0xC5FD70Phase 37 executeChecks count > 1, calls sub_910840CERTAIN
0xC60840Phase 46 executeIndirect vtable dispatch through comp_unit->vtable[0x1C0]CERTAIN
0xC5FF20Phase 58 executeChecks count > 1, calls sub_8F7080CERTAIN
0xC60550Phase 65 executeChecks count > 1, indirect dispatch through comp_unit->vtable[392]CERTAIN
0x7917F0GeneralOptimizeEarly bodyMulti-function path: iterates blocks, fixed-point loop with sub_753600HIGH
0x908EB0GeneralOptimize bodyPer-block copy prop + predicate simplification with flag markingHIGH
0x910840GeneralOptimizeMid bodyFull suite with mem-to-reg; delegates to sub_905B50 + sub_90FBA0HIGH
0x8F7080GeneralOptimizeLate bodyBitvector-tracked 7-counter pass; calls sub_8F6FA0HIGH
0x753600Per-block sub-pass runner (Early)Structural equivalence detection on def-use chains; returns boolean changedHIGH
0x753B50Per-block apply changes (Early)Instruction rewriting: sub_931920, sub_932E80, sub_749090, sub_9253C0HIGH
0x753480Chain walker (Early)Walks single-def chain forward, checking sub_7E5120 eligibilityHIGH
0x753520Pair detector (Early)Finds opcode-93 instruction in chain via sub_753480HIGH
0x753570Secondary pair detector (Early)Finds second opcode-93 link referencing back to primaryHIGH
0x753DB0Chain tail finder (Early)Walks opcode-97 links to find end of chainMEDIUM
0x753E30Secondary chain matcher (Early)Handles register renaming boundaries; stores a1[7..9]MEDIUM
0x753F70Vtable rewrite dispatcher (Early)Calls comp_unit->vtable[656]; constructs opcode-93 replacementsHIGH
0x7E5120Chain eligibility predicateChecks constant bank, block region, opcode 91HIGH
0x8F6530Per-block sub-pass runner (Late)6-slot circular buffer; 7-counter change tracking; 550-line functionHIGH
0x8F6FA0Block iterator (Late)Walks block list calling sub_8F6530 per block; single pass, no iterationHIGH
0x905B50Setup/init (Mid)~500 lines; creates bitvector infrastructure; 3 tracked structuresHIGH
0x90FBA0Main loop (Mid)Cost-based instruction-level iteration with register bank analysisHIGH
0x90EF70Register promotion (Mid)Memory-to-register conversion; threshold-based (default 0.93, knob 136)HIGH
0x903A10Register bank helper (Mid)Per-instruction register bank assignment for LD/ST materializationMEDIUM
0x8F3FE0Register constraint fold validator (Mid)Validates all source operand types are 2/3 and sub_91D150 constraints match cached values; queries vtable[904] for element size and vtable[936] for fold metadataHIGH
0x8F2E50Fold eligibility checkTwo-path dispatch: opcode 18 checks source types 2/3 + sub_91D150 constraints; opcode 124 checks dest type 1/2 + SM <= 20479 threshold + constraint bits & 0x1C00HIGH
0x8F29C0Architecture predicate query9 lines; returns sub_7DC0E0(cu) || sub_7DC050(cu) || sub_7DC030(cu) on ctx+1584HIGH
0x908A60Two-pass predicate simplifyCalled with direction flag (1 = forward, 0 = backward)HIGH
0x785E20Change tracking resetResets per-block change flagsMEDIUM
0x781F80Instruction flag initInitializes per-instruction optimization flags (~1800 lines)MEDIUM
0x7E6090Use-def chain builderBuilds operand use-def chains; called with (ctx, 0, 0, 0, 0)HIGH
0x7E6AD0Def-use link builderBuilds def-use/use-def bidirectional linksHIGH
0x7DF3A0Liveness checkReturns status pointer; bits 2-3 (& 0xC) indicate live usesHIGH
0x7E7380Predicate-operand compatibilityNarrow check: predicate modifier parity + last-operand 24-bit ID + penultimate 8-byte encoding (not full structural comparison)HIGH
0x747F40Negation flag extractorExtracts negation modifier from operand encodingHIGH
0x747F80Absolute-value flag extractorExtracts abs modifier from operand encodingHIGH
0x748570Alias hazard checkReturns true if operand has aliasing hazardMEDIUM
0x1245740Sub-DAG equivalenceCompares two instruction sub-DAGs for structural equivalence (arg 2 = depth)HIGH
0x91D150Register constraint lookupTrivial: return *(*(ctx+440) + 4*reg_index); 0 = no fold-blocking constraintCERTAIN
0x91E860Use-count estimatorReturns estimated use count for cost-based decisions (used by phase 37)MEDIUM
0xA9BD30Register-class remapperMaps opcode indices in set {1,2,3,7,11,15,20,24} via vtable[632]; writes value | 0x60000000 (constant class marker)HIGH
0x1249B50SASS-level integer ALU foldCombines IMAD_WIDE/IADD3/SGXT/CCTLT (opcodes 2,3,5,110) with MOV source pairs via sub_1249940 and sub_1245740HIGH
0x1249940MOV-pair fold combinerMatches two MOV-from-immediate (opcode 139) instructions feeding an ALU op; validates structural equivalence at depth 1 and 2HIGH
0x7E19E0Operand info extractorBuilds 52-byte operand descriptor for opcodes 2,3,5,6,7; classifies source types and constant bank membershipMEDIUM
0x7DC0E0Architecture capability check AChecks compilation unit capability flag; used by sub_8F29C0 for predicate fold safetyMEDIUM
0x7DC050Architecture capability check BSecondary capability check for sub_8F29C0MEDIUM
0x7DC030Architecture capability check CTertiary capability check for sub_8F29C0MEDIUM

Cross-References

  • Pass Inventory -- full 159-phase table with GeneralOptimize instances highlighted
  • Phase Manager -- dispatch loop, vtable protocol, factory switch at sub_C60D30
  • Optimization Pipeline -- overall pipeline stages
  • Copy Propagation & CSE -- standalone copy propagation passes (phases 49, 50, 64, 83)
  • Liveness Analysis -- standalone OriPerformLiveDead passes (heavier DCE)
  • Peephole Optimization -- MainPeepholeOptimizer; handles constant-identity patterns (x+0, x*1, x&0, etc.)
  • Strength Reduction -- standalone strength reduction pass (phase 26)
  • Knobs System -- MergeEquivalentConditionalFlowBudget (464, iteration cap), option 487 (general opt enable), OptBudget (499, opt-level guard), AllowReassociateCSE (31), MovWeightForConvertMemToReg (474, cost threshold), limit-fold-fp
  • Optimization Levels -- knob 499 (OptBudget) as opt-level accessor guard

Branch & Switch Optimization

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

Four phases in the ptxas pipeline transform branch and switch-statement control flow in the Ori IR. Two phases optimize switch statements (phases 14 and 30), one performs general branch simplification (phase 15), and one flattens nested conditional branches (phase 38). Together they reduce branch count, eliminate unreachable code, and prepare the CFG for downstream passes like predication (phase 63), liveness analysis (phase 16), and loop canonicalization (phase 18).

These phases operate on the Ori IR before register allocation and scheduling. At this pipeline stage, branch instructions use the Ori OEN opcode (SASS BRA), conditional execution is controlled by predicate registers (P0--P6, PT), and the CFG is a hash-map-based structure with FNV-1a-keyed successor/predecessor edges.

DoSwitchOptFirstPhase 14 -- vtable at off_22BD7F8
OriBranchOptPhase 15 -- vtable at off_22BD820
DoSwitchOptSecondPhase 30 -- vtable at off_22BDA78
OptimizeNestedCondBranchesPhase 38 -- vtable at off_22BDBB8
Phase factorysub_C60D30 cases 14, 15, 30, 38
Phase object size16 bytes (standard {vtable_ptr, allocator_ptr})
IR levelOri -- SASS opcodes with virtual registers
Key opcodesOEN (BRA), OFFL (BSSY), OFLAP (BSYNC)
CFG infrastructureFNV-1a hash maps at Code Object +648 (successors), +680 (backedges)
Related passes31 OriLinearReplacement, 63 OriDoPredication, 80 ExpandJmxComputation, 133/136 MergeEquivalentConditionalFlow

Pipeline Placement

Phase   3  AnalyzeControlFlow              ── builds CFG (predecessors, successors, RPO, dominators)
Phase   6  SetControlFlowOpLastInBB        ── ensures branches are last in each block
Phase  13  GeneralOptimizeEarly            ── const fold + copy prop (feeds branch info)
Phase  14  DoSwitchOptFirst                ── SWITCH OPTIMIZATION (1st pass)
Phase  15  OriBranchOpt                    ── BRANCH SIMPLIFICATION
Phase  16  OriPerformLiveDeadFirst         ── DCE cleanup of dead branches
    ...
Phase  29  GeneralOptimize                 ── const fold after loop transforms
Phase  30  DoSwitchOptSecond               ── SWITCH OPTIMIZATION (2nd pass)
Phase  31  OriLinearReplacement            ── branchless replacement
    ...
Phase  37  GeneralOptimizeMid              ── const fold + copy prop (feeds nested cond info)
Phase  38  OptimizeNestedCondBranches      ── NESTED CONDITIONAL FLATTENING
    ...
Phase  63  OriDoPredication                ── if-conversion (converts short branches to predicates)
    ...
Phase  80  ExpandJmxComputation            ── expands jump-table index computations
    ...
Phase 133  MergeEquivalentConditionalFlow  ── tail merging
Phase 136  LateMergeEquivalentConditionalFlow

Why Two DoSwitchOpt Passes?

The first pass (phase 14) runs immediately after the initial GeneralOptimizeEarly compound pass. At this point, constant folding and copy propagation have resolved many switch selector values, enabling the optimizer to determine case density and choose a lowering strategy.

The second pass (phase 30) runs after loop unrolling (phase 22), strength reduction (phase 21), SSA phi insertion (phase 23), and software pipelining (phase 24). These transformations can expose new switch patterns -- particularly after loop unrolling duplicates switch bodies, creating opportunities for case clustering that were not visible before.

Despite their names, the two passes use different dispatch paths. Phase 14 dispatches through the SM backend's vtable at offset +136 (*(*(ctx+1584)+136)), making it a polymorphic, architecture-specific switch optimization. Phase 30 calls the generic switch optimization core (sub_77CF40 via sub_791F00). This means phase 14 runs whatever switch optimization the current SM target provides, while phase 30 always runs the generic algorithm. The two passes share pipeline position semantics (first pass vs. second pass) but not necessarily the same code.

DoSwitchOpt -- Switch Statement Optimization (Phases 14, 30)

Overview

DoSwitchOpt transforms high-level switch statements from their initial representation as cascading conditional branches into one of three lowered forms, selected based on case density and count. The input is a chain of ISETP (integer set-predicate) + BRA (conditional branch) instruction pairs that compare the switch selector against successive case constants. The output is one of:

  1. Jump table -- a contiguous array of branch targets indexed by the selector value
  2. Binary search tree -- a balanced tree of comparisons that narrows the target in O(log n)
  3. Cascading if-else chain -- the original form, retained when the switch is small or sparse

Input: Switch Pattern Recognition

The pass scans each basic block for a characteristic pattern:

// Input: cascading if-else for switch(x)
BB0:
    ISETP.EQ P0, x, #case_0      // compare selector against constant
    @P0 BRA target_0              // conditional branch to case body
    ISETP.EQ P0, x, #case_1
    @P0 BRA target_1
    ISETP.EQ P0, x, #case_2
    @P0 BRA target_2
    ...
    BRA default_target            // fallthrough to default case

The recognizer collects:

  • The selector register (the common first operand of all ISETP instructions)
  • The set of case constants (immediate operands of each ISETP)
  • The branch targets (one per case, plus the default target)
  • The case count N

Decision: Strategy Selection

The strategy is selected by evaluating case density and count:

function select_switch_strategy(cases[], N, min_val, max_val):
    range = max_val - min_val + 1
    density = N / range                    // fraction of range covered by cases

    if N <= SMALL_SWITCH_THRESHOLD:        // observed: ~4 cases
        return CASCADING_IF_ELSE           // keep original form

    if density >= JUMP_TABLE_DENSITY:      // observed: ~0.4 (40%)
        if range <= MAX_JUMP_TABLE_SIZE:   // observed: ~1024 entries
            return JUMP_TABLE

    return BINARY_SEARCH_TREE

The thresholds are not configurable via the knob system. They are hardcoded constants in the phase execute function.

Jump table is preferred when case values are dense -- the selector maps directly to a table index with a bounds check and a subtraction. This produces the fastest code but consumes data memory proportional to the value range (not the case count).

Binary search tree is the default for large sparse switches. The pass sorts case constants and generates a balanced BST of ISETP + BRA pairs. Each comparison eliminates half the remaining candidates, reaching the target in O(log N) branches.

Cascading if-else is retained for small switches (typically 4 or fewer cases) where the overhead of a jump table or BST setup exceeds the cost of linear comparison.

Output: Jump Table Lowering

For jump-table-eligible switches, the pass produces:

// Output: jump table lowering
BB_switch:
    IADD3 Rtmp, Rselector, -min_val, RZ    // normalize to 0-based index
    ISETP.GE.U32 P0, Rtmp, #range          // bounds check (unsigned)
    @P0 BRA default_target                  // out-of-range -> default
    // The jump table index computation is left as a pseudo-instruction
    // that phase 80 (ExpandJmxComputation) expands later into:
    //   LEA Raddr, Rtmp, #table_base, 2    // Raddr = table_base + index * 4
    //   BRX Raddr, #table_base             // indexed branch

The actual BRX (branch indexed) instruction is a SASS-level indirect branch through a table embedded in the .text section. Each table entry is a 4-byte relative offset. Phase 80 (ExpandJmxComputation) runs much later (after legalization) to expand the index computation pseudo-instruction into the final LEA + BRX sequence.

Output: Binary Search Tree Lowering

For BST-eligible switches:

function emit_bst(cases[], lo, hi, selector, default_target):
    if lo > hi:
        emit: BRA default_target
        return

    mid = (lo + hi) / 2

    if lo == hi:
        emit: ISETP.EQ P0, selector, #cases[mid].value
        emit: @P0 BRA cases[mid].target
        emit: BRA default_target
        return

    emit: ISETP.LT P0, selector, #cases[mid].value
    emit: @P0 BRA left_subtree_label

    // Right subtree (selector >= cases[mid].value)
    emit: ISETP.EQ P0, selector, #cases[mid].value
    emit: @P0 BRA cases[mid].target
    emit_bst(cases, mid+1, hi, selector, default_target)

    // Left subtree (selector < cases[mid].value)
    left_subtree_label:
    emit_bst(cases, lo, mid-1, selector, default_target)

This produces a balanced tree with depth ceil(log2(N+1)). Each internal node performs at most two comparisons (less-than and equality), though the pass may optimize nodes with consecutive case values to use range checks.

GPU-Specific: SIMT Divergence Impact

Switch optimization interacts directly with SIMT execution. On a GPU, when threads in a warp take different switch cases, the warp diverges and each case executes serially. The optimizer considers this:

  • Jump tables produce a single divergence point at the BRX instruction. All threads that pick the same case reconverge naturally. The hardware BSSY/BSYNC (branch sync stack push/pop) mechanism ensures reconvergence after the switch.
  • BST lowering produces O(log N) potential divergence points. Threads that agree on the BST path stay converged; threads that disagree at each BST node split into independently masked sub-warps.
  • Cascading if-else produces N potential divergence points. Each comparison can split the warp.

For GPU code, jump tables are strongly preferred when density permits, because they minimize the number of divergence points to exactly one (the BRX), regardless of case count.

OriBranchOpt -- Branch Simplification (Phase 15)

Overview

OriBranchOpt performs four categories of CFG-level simplification on the Ori IR. It runs as a single pass that iterates over all basic blocks and applies the following transformations until no further changes occur:

  1. Unconditional branch folding -- eliminates BRA instructions that jump to the immediately following block
  2. Unreachable block elimination -- removes basic blocks with no predecessors (except the entry block)
  3. Conditional branch simplification -- simplifies conditional branches where the condition is provably constant or the true/false targets are identical
  4. Branch chain threading -- redirects branches that target blocks consisting of a single unconditional BRA, directly to the final destination

Transformation 1: Unconditional Branch Folding

When a basic block ends with an unconditional BRA to the block that immediately follows in layout order, the branch is redundant and is deleted:

// Before:                        // After:
BB_A:                             BB_A:
    ...                               ...
    BRA BB_B                          // fallthrough
BB_B:                             BB_B:
    ...                               ...

This is the most common transformation. It arises frequently after switch optimization introduces new blocks and after loop unrolling creates copies of loop bodies that end with unconditional jumps back to the next iteration.

Transformation 2: Unreachable Block Elimination

After other branch simplifications may redirect branches away from certain blocks, those blocks lose all predecessors and become unreachable. The pass deletes them:

function eliminate_unreachable(func):
    for each block B in func (excluding entry):
        if predecessor_count(B) == 0:
            // Remove B from successor lists of all blocks
            // Delete all instructions in B
            // Remove B from the block list
            // Update CFG hash maps

The CFG hash maps at Code Object offsets +648 (successors) and +680 (backedges) must be updated atomically with block deletion to maintain consistency for downstream passes.

Transformation 3: Conditional Branch Simplification

Two sub-cases:

Constant condition. If copy propagation or constant folding (in the preceding GeneralOptimizeEarly, phase 13) has determined that a predicate register always holds a known value at the branch point, the conditional branch is replaced:

// Before: condition always true      // After:
BB:                                   BB:
    ISETP.EQ PT, R0, R0              //   (deleted -- tautology)
    @PT BRA target                        BRA target   // unconditional
    BRA fallthrough                   //   (deleted)

Equivalent targets. If both the taken and not-taken paths of a conditional branch go to the same block, the condition test is dead and the branch becomes unconditional:

// Before: both targets identical     // After:
BB:                                   BB:
    @P0 BRA target                        BRA target   // unconditional
    BRA target                        //   (deleted)

Transformation 4: Branch Chain Threading

When a branch targets a block whose only content is another unconditional branch, the pass redirects the original branch directly to the final target:

// Before:                            // After:
BB_A:                                 BB_A:
    @P0 BRA BB_B                          @P0 BRA BB_C   // threaded
BB_B:                                 // BB_B may become unreachable
    BRA BB_C                          BB_C:
BB_C:                                     ...
    ...

The pass applies threading iteratively, following chains of single-branch blocks until a non-trivial block is reached. A depth limit prevents infinite loops on pathological CFGs with cycles of empty blocks (which should not exist in well-formed IR but are guarded against defensively).

Fixed-Point Iteration

The four transformations are applied in a worklist-driven loop. Each transformation can enable others:

  • Threading can make intermediate blocks unreachable (enables transformation 2)
  • Unreachable block elimination can make remaining branches target the immediately following block (enables transformation 1)
  • Folding can expose equivalent-target conditionals (enables transformation 3)

The pass terminates when a full iteration over all blocks produces no changes.

OptimizeNestedCondBranches -- Nested Conditional Flattening (Phase 38)

Overview

Phase 38 targets a specific control flow pattern: nested conditional branches that test related predicates. This pattern commonly arises from C/C++ code with compound conditions (if (a && b), if (a || b)) and from switch-case fall-through after DoSwitchOpt lowering.

The pass runs after GeneralOptimizeMid (phase 37), which provides fresh constant folding and copy propagation results. It runs before OriDoPredication (phase 63), feeding it simpler CFG patterns that are easier to convert to predicated code.

Pattern: Nested If-Then

// Before: nested conditional
BB_outer:
    @P0 BRA BB_inner
    BRA BB_merge
BB_inner:
    @P1 BRA BB_body
    BRA BB_merge
BB_body:
    ... body instructions ...
    BRA BB_merge
BB_merge:
    ...

// After: flattened with combined predicate
BB_entry:
    LOP3 Ptmp, P0, P1, 0xC0          // Ptmp = P0 AND P1
    @Ptmp BRA BB_body
    BRA BB_merge
BB_body:
    ... body instructions ...
    BRA BB_merge
BB_merge:
    ...

The LOP3 (3-input logic) instruction with truth table 0xC0 computes AND. This combines two branch tests into one, eliminating a basic block and a divergence point.

Pattern: Nested If-Or

// Before: short-circuit OR
BB_test1:
    @P0 BRA BB_body                   // first condition true -> body
    BRA BB_test2
BB_test2:
    @P1 BRA BB_body                   // second condition true -> body
    BRA BB_merge                      // both false -> merge
BB_body:
    ...

// After: flattened with OR predicate
BB_entry:
    LOP3 Ptmp, P0, P1, 0xFC          // Ptmp = P0 OR P1
    @Ptmp BRA BB_body
    BRA BB_merge
BB_body:
    ...

Safety Constraints

The pass applies these transformations only when:

  1. No side effects between the nested branches -- the intermediate block must contain only the branch instruction (and optionally predicate-setting ISETP/FSETP instructions)
  2. No live-out values from the intermediate block other than the predicate -- if the intermediate block defines registers used after the merge, the transformation would change semantics
  3. Both branches target the same merge point -- the not-taken path of both the outer and inner branches must reach the same merge block
  4. The predicates are independent -- P0 and P1 must not be related by a def-use chain within the nested pattern (otherwise folding changes the evaluation order)

Relationship to Predication

Phase 38 is a stepping stone toward phase 63 (OriDoPredication). By reducing nested branches to single-level branches, it creates more opportunities for if-conversion -- the predication pass can then convert the single remaining branch into a fully predicated (branchless) instruction sequence.

The transformation pipeline for a if (a && b) { x = y; } pattern is:

Phase 38: nested {if(a) { if(b) { ... }}}  -->  if(a AND b) { ... }
Phase 63: if(a AND b) { x = y; }           -->  @(a AND b) MOV x, y

Without phase 38, the predication pass would see a multi-level branch diamond that exceeds its nesting-depth threshold, and both branches would remain in the output.

GPU-Specific Considerations

SIMT Divergence and Reconvergence

On NVIDIA GPUs, branch optimization has a direct impact on warp execution efficiency. Every conditional branch is a potential divergence point where threads in a 32-thread warp may take different paths. Divergence serializes execution: the warp must execute both paths, masking inactive threads.

The BSSY (branch sync stack push) / BSYNC (branch sync) mechanism on modern NVIDIA architectures (sm_75+) manages reconvergence:

BSSY B0, reconvergence_point     // push reconvergence point onto sync stack
@P0 BRA taken_path               // diverge
    ... not-taken path ...
    BSYNC B0                     // threads arriving here wait
taken_path:
    ... taken path ...
    BSYNC B0                     // all threads reconverge here
reconvergence_point:
    ...                          // continue with full warp

Branch optimization directly reduces the number of BSSY/BSYNC pairs needed:

  • Branch folding (phase 15) eliminates unconditional branches that do not cause divergence but still consume BSSY/BSYNC bookkeeping
  • Nested conditional flattening (phase 38) reduces two nested BSSY/BSYNC regions to one, cutting sync-stack depth by one level
  • Jump table lowering (phases 14/30) collapses N divergence points into one BRX instruction

Reconvergence Stack Depth

The hardware branch sync stack has finite depth (varies by architecture, typically 16--32 entries on sm_75+). Deeply nested branches can overflow the stack, causing hardware serialization or requiring the compiler to restructure control flow. Branch optimization reduces sync-stack pressure by flattening nesting.

Uniform Branches

When all threads in a warp evaluate a branch condition identically (uniform branch), no divergence occurs. The optimizer detects uniform branches via the AnalyzeUniformsForSpeculation pass (phase 27) and the OriPropagateVarying passes (phases 53, 70). Uniform branches are cheaper because:

  • No BSSY/BSYNC is needed (the warp stays converged)
  • On sm_75+, uniform branches can use the UBRA (uniform branch) encoding, which has lower latency

Branch optimization interacts with uniformity analysis: simplifications that eliminate branches also eliminate divergence-point metadata, and conversely, branches proven uniform may not need optimization because their execution cost is already minimal.

Switch Tables and Warp Divergence

A switch with K active cases in a 32-thread warp incurs at most K serialized case executions (one per unique case value across threads). Jump table lowering does not change this thread-level divergence cost, but it does change the instruction-level cost:

StrategyInstructions (worst case)Divergence pointsSync-stack entries
Cascading if-else (N cases)2N (ISETP + BRA per case)NN
BST (N cases)2 * ceil(log2(N))ceil(log2(N))ceil(log2(N))
Jump table3 (IADD3 + ISETP + BRX)11

The jump table is strongly preferred for GPU execution because it minimizes sync-stack entries to exactly 1, regardless of case count.

Implementation Details

Phase Vtable Structure

All four phases follow the standard 16-byte phase object model. Each vtable has three methods: +0 execute, +8 getPhaseNumber, +16 isNoOp.

PhaseFactory caseVtable addressexecute bodyisNoOp
14 DoSwitchOptFirstcase 14off_22BD7F8sub_C5F720 (42B)returns false
15 OriBranchOptcase 15off_22BD820sub_C5F950 (34B)returns false
30 DoSwitchOptSecondcase 30off_22BDA78sub_C5FC80 (34B)returns false
38 OptimizeNestedCondBranchescase 38off_22BDBB8sub_C5FA70 (34B)returns false

All four isNoOp methods return false unconditionally -- gating is performed inside the execute body, not via isNoOp. Each execute body calls sub_7DDB50 (156B), which reads the optimization level from compilation_context+2104 and checks knob 499. The guard is opt_level > 1, so these phases execute at -O2 and above. At -O0 and -O1, sub_7DDB50 returns 1 and the execute body returns without action.

Execute Body Details

Phase 14 -- sub_C5F720 (42 bytes). After the sub_7DDB50 gate, dispatches through the SM backend object's vtable: (*(*(ctx+1584) + 136))(*(ctx+1584)). Offset +136 is vtable slot 17 on the SM backend. This is a polymorphic call -- each SM target (sm_50, sm_75, sm_89, sm_100, etc.) provides its own switch optimization implementation. The SM backend object at compilation_context+1584 is documented in data-structures.md.

Phase 15 -- sub_C5F950 (34 bytes). After the gate, calls sub_7917F0 (529B) directly -- no polymorphic dispatch. sub_7917F0 is the branch simplification core:

  1. Checks context+1382 bit 2 (CFG validity flag) -- returns immediately if clear
  2. Checks knob 214 via the knob state dispatcher -- if set, skips the pass (OriBranchOpt disable switch)
  3. Checks knob 487 (general optimization enablement)
  4. Calls sub_785E20 (266B) to rebuild the CFG
  5. Calls sub_781F80 (8335B) for block preparation infrastructure
  6. Calls sub_7E6090 (2614B) to scan branch patterns and sub_7E6AD0 (33B) for chain setup
  7. Iterates over basic blocks in RPO order (block list at *(ctx+296), RPO indices at *(ctx+512)). For each block, calls sub_753600 (1351B) for the transformation, with a convergence loop gated by knob 464
  8. After convergence, calls sub_785E20 again to finalize the CFG

Phase 30 -- sub_C5FC80 (34 bytes). After the gate, calls sub_791F00(ctx, 1). The second argument 1 indicates this is the second switch optimization pass. sub_791F00 (587B) performs lazy initialization of a 152-byte SwitchOptContext cached at code_object+1288:

SwitchOptContext (152 bytes, allocated at code_object+1288):
    +0   back-pointer to code object
    +8   allocator reference (from code_object+16)
    +16  case collection array (capacity = block_count + 2)
    +56  secondary collection array
    +96  code_object reference copy
    +104 initialized sentinel (0xFFFFFFFF)
    +112 tertiary collection array

After setup, sub_791F00 calls sub_77CF40 (4698B, 987 instructions) -- the main switch optimization algorithm containing pattern matching, strategy selection (jump table vs. BST vs. cascading if-else), and code emission.

Phase 38 -- sub_C5FA70 (34 bytes). After the gate, calls sub_A0F020 (2375B, 563 instructions) directly. sub_A0F020 implements the nested conditional optimizer as a fixed-point loop. It allocates a 16-byte work context at code_object+1648 (lazy init), then iterates: scan blocks for nested branch patterns, combine predicates with LOP3, remove eliminated blocks, repeat until stable. The function also accesses code object fields +832 (block hash map) and +856 (edge data) for CFG manipulation.

Knob Gating Summary

KnobIndexEffectChecked by
ConvertUnsupportedOps499Master opt-level gate (all 4 phases)sub_7DDB50
OriBranchOpt disable214Disables branch simplificationsub_7917F0 (phase 15)
General optimization487Enables/disables optimizer passessub_7917F0 (phase 15)
Convergence loop464Gates the fixed-point iterationsub_7917F0 (phase 15)

Interaction with ExpandJmxComputation (Phase 80)

Phase 80 is the delayed lowering phase for jump table index computations created by DoSwitchOpt. The separation exists because:

  1. Jump table index computation requires knowing the final table address, which is not available until after legalization
  2. Intervening optimization passes (GVN-CSE, strength reduction) may simplify the index computation before it is expanded
  3. Register allocation needs to see the index computation as a single pseudo-instruction for accurate pressure estimation

The pseudo-instruction left by DoSwitchOpt is expanded by phase 80 into the final LEA + BRX sequence after all high-level optimizations are complete.

Interaction with OriLinearReplacement (Phase 31)

Phase 31 runs immediately after DoSwitchOptSecond (phase 30). It targets branch-heavy patterns that survived switch optimization and attempts to replace them with branchless (linear) computation sequences using SEL (select) and predicated MOV instructions. This is a complement to predication (phase 63) -- it operates earlier in the pipeline on simpler patterns, while predication handles more complex diamond-shaped control flow later.

Interaction with MergeEquivalentConditionalFlow (Phases 133, 136)

Two late-pipeline passes perform tail merging -- identifying basic blocks with identical instruction sequences that branch to the same targets, and merging them into a single block. This catches redundancy left over after branch optimization, particularly in switch case bodies that perform similar operations on different case values.

Algorithmic Summary

Pass                           Algorithm                    Complexity    CFG Changes
─────────────────────────────  ───────────────────────────  ────────────  ──────────────────────
DoSwitchOpt (14, 30)           Pattern match + decision     O(N log N)    Rewrites blocks, adds
                               tree for strategy selection   per switch    jump table pseudo-ops

OriBranchOpt (15)              Worklist-driven CFG          O(B + E)      Deletes blocks, removes
                               simplification (fixed-point)  per iter      edges, threads branches

OptimizeNestedCondBranches     Pattern match on nested      O(B)          Merges blocks, replaces
(38)                           branch diamonds                             branches with LOP3+BRA

Where N = number of switch cases, B = number of basic blocks, E = number of CFG edges.

Function Map

All addresses from ptxas v13.0.88. Vtable entries resolved by reading the ELF .rodata section at file offset VA - 0x400000. Confidence: HIGH for vtable functions (direct binary read), HIGH for core algorithms (single-caller chains from vtable execute bodies).

Phase Vtable Functions

AddressSizePhaseVtable slotRole
sub_C5F72042B14+0execute -- dispatches to SM backend vtable[17]
sub_C5F4A06B14+8getPhaseNumber -- returns 14
sub_C5F4B03B14+16isNoOp -- returns false
sub_C5F95034B15+0execute -- calls sub_7917F0
sub_C5F4806B15+8getPhaseNumber -- returns 15
sub_C5F4903B15+16isNoOp -- returns false
sub_C5FC8034B30+0execute -- calls sub_791F00(ctx, 1)
sub_C5F2A06B30+8getPhaseNumber -- returns 30
sub_C5F2B03B30+16isNoOp -- returns false
sub_C5FA7034B38+0execute -- calls sub_A0F020
sub_C5F1A06B38+8getPhaseNumber -- returns 38
sub_C5F1B03B38+16isNoOp -- returns false

Core Algorithm Functions

AddressSizeCallersDescription
sub_77CF404698B1DoSwitchOpt core -- pattern match, strategy select, code emit
sub_7917F0529B2OriBranchOpt core -- worklist CFG simplification
sub_A0F0202375B11OptimizeNestedCondBranches core -- predicate combining
sub_791F00587B3DoSwitchOpt setup -- SwitchOptContext init, calls sub_77CF40

Infrastructure Functions

AddressSizeCallersDescription
sub_7DDB50156B180Optimization level gate (knob 499 + opt-level check)
sub_781F808335B131Block preparation infrastructure (major shared function)
sub_785E20266B34CFG rebuild after transformation
sub_7E60902614B80Branch pattern scanner
sub_7E6AD033B10Branch chain setup
sub_7536001351B1Block-level branch transform (phase 15 inner loop)
sub_753B50598B1Block transform continuation

Factory and Vtable Data

SymbolAddressDescription
sub_C60D300xC60D30Phase factory -- 159-case switch, allocates 16-byte phase objects
off_22BD5C80x22BD5C8Vtable base -- 40-byte stride, index = phase number
off_22BD7F80x22BD7F8Phase 14 vtable (base + 14 * 0x28)
off_22BD8200x22BD820Phase 15 vtable (base + 15 * 0x28)
off_22BDA780x22BDA78Phase 30 vtable (base + 30 * 0x28)
off_22BDBB80x22BDBB8Phase 38 vtable (base + 38 * 0x28)

Cross-References

Loop Passes

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

Eight phases in the ptxas pipeline transform loops in the Ori IR: one canonicalizer (phase 18), one unroller (phase 22), one software pipeliner (phase 24), four LICM instances (phases 35, 66, 79, 88), and one fusion pass (phase 59). Together they account for the largest category of repeated-pass instances in the pipeline -- the LICM family alone runs four times because intervening transformations (predication, legalization, GMMA fixup) continuously expose new invariants.

ptxas is not built on LLVM. Its loop infrastructure is a custom, non-SSA representation operating directly on the Ori IR's basic-block graph. Loop detection is performed by AnalyzeControlFlow (phase 3), which identifies back-edges, computes dominators, and annotates each basic block with a loop nesting depth stored at block offset +144. This nesting depth is the primary loop identity used by all eight passes.

OriLoopSimplificationPhase 18 -- vtable at off_22BD898
OriLoopUnrollingPhase 22 -- vtable at off_22BD938
OriPipeliningPhase 24 -- vtable at off_22BD988
OriHoistInvariantsEarlyPhase 35 -- vtable at off_22BDB40
OriLoopFusionPhase 59 -- vtable at off_22BDF00
OriHoistInvariantsLatePhase 66 -- vtable at off_22BE018
OriHoistInvariantsLate2Phase 79 -- vtable at off_22BE220
OriHoistInvariantsLate3Phase 88 -- vtable at off_22BE388
Phase factorysub_C60D30 cases 18, 22, 24, 35, 59, 66, 79, 88
Phase object size16 bytes (standard {vtable_ptr, allocator_ptr})
IR levelOri -- SASS opcodes with virtual registers, pre-RA
Loop detectionAnalyzeControlFlow (phase 3) -- back-edges, dominators, nesting depth
Related passes3 AnalyzeControlFlow, 19 OriSplitLiveRanges, 21 OriStrengthReduce, 108 OptimizeHotColdInLoop

Pipeline Placement

Phase   3  AnalyzeControlFlow              ── builds CFG, identifies loops, computes dominators
Phase  13  GeneralOptimizeEarly            ── const fold + copy prop (feeds loop analysis)
Phase  15  OriBranchOpt                    ── branch simplification (may change loop shape)
Phase  16  OriPerformLiveDeadFirst         ── DCE removes dead loop bodies
Phase  18  OriLoopSimplification           ── CANONICALIZATION: single entry, preheader insertion
Phase  19  OriSplitLiveRanges              ── splits live ranges at loop boundaries
Phase  21  OriStrengthReduce               ── induction variable strength reduction
Phase  22  OriLoopUnrolling                ── UNROLLING: full/partial based on trip count
Phase  23  GenerateMovPhi                  ── SSA phi insertion (after unrolling changes CFG)
Phase  24  OriPipelining                   ── SOFTWARE PIPELINING: overlaps iterations
    ...
Phase  35  OriHoistInvariantsEarly         ── LICM #1: after GVN, before mid-expansion
    ...
Phase  59  OriLoopFusion                   ── FUSION: merges adjacent compatible loops
    ...
Phase  66  OriHoistInvariantsLate          ── LICM #2: after predication
    ...
Phase  79  OriHoistInvariantsLate2         ── LICM #3: after late unsupported-op expansion
    ...
Phase  88  OriHoistInvariantsLate3         ── LICM #4: after GMMA fixup
    ...
Phase 108  OptimizeHotColdInLoop           ── separates hot/cold paths within loops (post-RA)

Ordering Rationale

The eight loop passes are deliberately spread across the pipeline rather than clustered together. Each occupies a specific position dictated by what has been lowered or optimized upstream:

  1. Phase 18 (simplification) must run before strength reduction (21) and unrolling (22) because both require canonical loop forms.
  2. Phase 22 (unrolling) runs after strength reduction so that induction variable simplifications are already applied, avoiding redundant computation in unrolled copies.
  3. Phase 24 (pipelining) runs after unrolling because pipelining targets loops that were not fully unrolled.
  4. Phase 35 (early LICM) runs after GeneralOptimize at phase 29, which performs partial CSE, giving it common subexpressions to hoist.
  5. Phase 59 (fusion) runs after late expansion (phase 55) because expansion can split a single operation into a loop pair that fusion can reunite.
  6. Phases 66, 79, 88 (late LICM instances) each follow a major transformation that can create new loop-invariant code: predication (63), unsupported-op expansion (78), and GMMA fixup (87), respectively.

Loop Representation in Ori IR

ptxas does not use a dedicated loop descriptor data structure (no LoopInfo object like LLVM's). Instead, loop membership is implicit in the CFG through annotations computed by AnalyzeControlFlow (phase 3):

BB FieldOffsetTypeMeaning
loop_depth+144intLoop nesting depth (0 = not in loop)
loop_depth_equal+152intCopy of loop_depth, used for sibling detection
predecessor_list+128linked_list*List of predecessor block indices
successor_list+136linked_list*List of successor block indices

A loop header is a block whose loop_depth equals its own back-edge source's depth. Back-edge information is stored in the Code Object's back-edge hash map at offset +680. Diagnostic output from sub_BDEA50 prints this information as bix%d -> backedge's successor BB: %d.

The block iteration order is controlled by a reverse-post-order (RPO) array stored at Code Object offset +512. All loop passes iterate over this array, ensuring they visit headers before inner blocks. The array length is at Code Object offset +520.


Phase 18 -- OriLoopSimplification

Purpose

Canonicalizes loop structure to simplify downstream analysis. Ensures each natural loop has a single entry edge, inserts dedicated preheader blocks where needed, and normalizes back-edge shapes. This is a prerequisite for strength reduction, unrolling, and pipelining, all of which assume canonical loop form.

Entry Point

sub_C5FB00 (34 bytes)          ── vtable execute(), calls sub_7DDB50
  └─ sub_78B430 (1,172 bytes)  ── LoopMakeSingleEntry core
       ├─ sub_7753F0            ── pre-pass: loop peeling setup
       ├─ sub_789BE0            ── canonicalize back-edges
       ├─ sub_781F80            ── rebuild instruction list
       └─ sub_9253C0            ── split edges / insert preheader

Algorithm

function LoopSimplification(code_object):
    if code_object.flags[1368] & 1 == 0:          // optimization disabled
        return

    // Phase 1: optional loop peeling for O4+ or flagged functions
    if opt_level not in {4,5} and flags[1382] & 4 set:
        peeled = PeelOuterEdges(code_object, 0)         // sub_7753F0
        canonicalized = CanonicalizeBackEdges(code_object, peeled)  // sub_789BE0
    else:
        canonicalized = CanonicalizeBackEdges(code_object, 0)

    if code_object.flags[1368] & 1 == 0:          // re-check after canon
        return

    // Phase 2: single-entry enforcement
    if not QueryKnob("LoopMakeSingleEntry", knob_487):  // OCG knob 487
        return

    RebuildInstructionList(code_object, 1)               // sub_781F80
    for each block in RPO order:
        if block.loop_depth > 0 and block is loop header:
            // find the deepest-nesting back-edge target
            // if multiple entries exist, split into single-entry form
            // insert preheader block between external predecessors and header
            InsertPreheaderIfNeeded(code_object, block)  // sub_9253C0

GPU-Specific Considerations

The simplification pass checks the optimization level at offset +896 of the code object. Levels 4 and 5 (-O4, -O5) enable aggressive loop peeling via sub_7753F0 before canonicalization. At the default -O2, peeling is suppressed to avoid code size growth that could cause instruction cache thrashing.

The LoopMakeSingleEntry knob (OCG knob 487) is the master enable. When disabled, only back-edge canonicalization runs -- preheader insertion is skipped. This knob is checked via the standard OCG knob query at offset +152 of the allocator vtable.

The pass also inspects the convergence flag at offset +1380 (bit 7). When set, it indicates a convergent execution context (e.g., warp-synchronous code), and certain edge-splitting transformations are suppressed to avoid disrupting convergence guarantees.

Knob NameDefaultDescription
LoopInversionenabledEnable loop inversion (do-while to while conversion)
LoopInversionBudgetunsetMaximum instruction count for loop inversion
LoopPeelInversiondisabledEnable loop peeling combined with inversion
EnableSingleThreadPeelingLoopsunsetEnable peeling for single-thread execution paths
GenPeelingLoopsForSyncsunsetGenerate peeling loops around sync instructions
AssertIfPeelingLoopForTexSurfunsetAssert (debug) if peeling a loop for texture/surface ops

Phase 22 -- OriLoopUnrolling

Purpose

Performs full unrolling of loops with known small trip counts and partial unrolling of larger loops to amortize loop overhead and expose instruction-level parallelism. This is one of the most impactful optimization passes for GPU code, where loops over texture coordinates, reduction accumulators, and matrix tiles dominate execution time.

Function Map

Correction (P1-04): The W023 report incorrectly listed sub_83EF00 as the unrolling driver. That function is the MainPeepholeOptimizer (confirmed by p1.06a sweep). The actual unrolling call chain starts at sub_1392E30.

FunctionSizeRoleConfidence
sub_1392E3025 linesPhase 22 execute entry: guards, calls initializer + driver + cleanupHIGH
sub_1389AF0593 linesUnrolling context initializer: reads all knobs from OCG profileHIGH
sub_1390B301,598 linesMain unrolling driver: per-loop decision, factor selection, dispatchHIGH
sub_138A6E0774 linesPost-unroll cleanup: frees working structuresHIGH
sub_7E512019 linesNounroll/skip check: pragma flag, convergence, knob 91HIGH
sub_7F5D2099 linesRejection recording: indexes string table at 0x21D1EA0HIGH
sub_138E3E0125 linesLoop body scanner: three-pass analysis (header, forward, backward)HIGH
sub_13858C042 linesLoop back-edge locatorHIGH
sub_1385E90~200 linesTrip count bound extractor (init, limit, stride from IV)MEDIUM
sub_13836201,157 linesFull unroll profitability evaluator (foldable constants, addresses)MEDIUM
sub_1387C30~400 linesPartial unroll body replicatorMEDIUM
sub_13880F0~200 linesPost-unroll CFG fixupMEDIUM
sub_1385950~300 linesInduction variable analysisMEDIUM
sub_138E9C0~400 linesIV stride/direction verificationMEDIUM
sub_1385CC0~200 linesIV constant detectionMEDIUM
sub_13829F0~200 linesProfitability: foldable constant load countingMEDIUM
sub_A3A7E01,236 linesPost-unroll statistics (DUMPIR output)HIGH

Unrolling Decision Algorithm

The unrolling decision is a multi-stage pipeline implemented in sub_1390B30. The function iterates over loops in reverse RPO order (innermost first, matching the RPO array at code_object+512) and applies a series of eligibility checks, trip count analysis, factor selection, and profitability evaluation before committing to the unroll.

Entry Guard (sub_1392E30)

function OriLoopUnrolling_Execute(code_object):
    if code_object.flags[1368] & 1 == 0:           // optimization disabled
        return
    if code_object.flags[1397] & 0xC0 == 0x40:     // global nounroll override
        return
    if DUMPIR_skip("LoopUnrolling"):                // sub_799250
        return
    if CountBlocks(code_object) <= 2:               // sub_7DDB50
        return
    if not QueryKnob(487, true):                    // master loop pass guard
        return

    ctx = InitializeContext(code_object)             // sub_1389AF0
    RunUnrolling(ctx)                                // sub_1390B30
    Cleanup(ctx)                                     // sub_138A6E0

Context Initialization and Knob Defaults (sub_1389AF0)

The initializer reads unrolling parameters from the OCG profile object. Each knob uses a three-valued flag: 0 = use hardcoded default, 1 = use integer override, 2 = use float override, 3 = use double override. The defaults recovered from binary:

Context FieldProfile OffsetDefaultKnob Name (inferred)
ctx+168 (int32)+31320140UnrollBudget
ctx+172 (float)+310320.25UnrollFlexableFullLimit
ctx+176 (int32)+309604UnrollUnknownCount
ctx+180 (int32)+308164UnrollSmallLoopLimit
ctx+184 (dbl)+646560.4LoopUnrollLargePartOfShaderPct
ctx+192 (float)+3139220.0UnrollInstLimit
ctx+196 (int32)+6487250UnrollPregThreshold
ctx+200 (int32)+312482UnrollExtraInstPerPercentSaving
ctx+204 (int32)+31176200UnrollFullInstLimit
ctx+208 (int32)+6429646LoopUnrollNumExtraInstBase

Boolean and integer knobs read via vtable dispatch:

Knob IDProfile OffsetDefaultKnob Name
437+31464trueLoopUnroll (master enable)
894+64368trueLoopUnrollNonInnermost
897+64584trueUnrollMultiBlockLoops
902+64944trueUnrollVariableBounds
896+645120LoopUnrollFactor (INT override; 0 = heuristic)
895+644400EpilogueLoopUnrollCount
900+648000LoopUnrollNumInstTex
903+65016falseDisablePartialUnrollOverflowCheck

String knob: knob 427 (profile+30744) returns the LoopUnrollFactor per-block override string, with the format "-N-" to skip block N, "+N+" to force-unroll block N, "-" to skip all, "+" to force all.

Nounroll Pragma Check (sub_7E5120)

Returns true (suppress unrolling) when any of these conditions hold:

  1. Convergence constraint: The back-edge analysis context at code_object+1784 is active, and the loop header's entry in the back-edge table (code_object+1776+16) is valid and within the convergence limit. This suppresses unrolling of warp-synchronous loops.
  2. PTX nounroll pragma: Byte 292 of the block descriptor at (code_object+368 + 8*block_idx) has bit 1 set. This bit is set during PTX-to-Ori lowering when the nounroll pragma string (at 0x1CFE126) is parsed.
  3. Instruction-level marker: Byte 283 of the loop header instruction has bit 0 set.
  4. Per-block knob: OCG knob 91 is set for this block (queried via sub_7A1A90).

Main Decision Flowchart (sub_1390B30)

function RunUnrolling(ctx):
    code_object = ctx.code_object

    // Phase 1: Read master enable and per-block override string
    master_enable = QueryKnob(437)                   // LoopUnroll
    override_string = QueryKnobString(427)           // "-N-" / "+N+" format
    RecomputeRegisterPressure(code_object)            // sub_7E6090
    RebuildInstructionList(code_object)               // sub_781F80

    // Phase 2: Pre-scan -- count inlinable calls and non-unrollable instructions
    for each instruction in code_object.instruction_list:
        if opcode == 97 (BRX):
            if callee.entry_block == callee.exit_block:
                inlinable_calls++
                if trip_count > 1:
                    multi_exit |= AnalyzeMultiExit(ctx, callee)

    // Phase 3: Iterate loops in reverse RPO (innermost first)
    rpo_count = code_object.rpo_count                // offset +520
    for idx = rpo_count-1 downto 0:
        block = code_object.blocks[code_object.rpo[idx]]

        // ── Step A: nounroll annotation propagation ──
        if block.nounroll_annotation:                // byte +246
            propagate nounroll to all blocks at >= same nesting depth

        // ── Step B: eligibility filter ──
        if block.loop_depth == 0:          continue  // not a loop
        if block.loop_depth != block.loop_depth_equal: continue
        if block.nounroll and not ctx.force_all:     continue

        // ── Step C: structure analysis ──
        latch = LocateBackEdge(ctx, block)           // sub_13858C0
        if not latch:                    continue
        exit_inst = latch.last_instruction
        if exit_inst.opcode != 95:                   // not conditional branch
            Reject(block, 13); continue              // indirect jump

        // ── Step D: nounroll / convergence check ──
        if CheckNounroll(block, code_object):        // sub_7E5120
            Reject(block, 11); continue

        // ── Step E: execution frequency analysis ──
        freq_header = code_object.freq_table[header_reg]
        freq_latch  = code_object.freq_table[latch_reg]
        is_hot = (freq_latch > 999) and (freq_header > 0)
                 and (freq_latch / freq_header > 3)

        // ── Step F: body analysis ──
        body_info = ScanLoopBody(ctx, block, latch)  // sub_138E3E0
        // body_info contains: tex_count, body_size, foldable_ldc_count,
        //                     has_cross_edges, mem_count
        if body_info.has_cross_edges:    continue

        // ── Step G: budget computation ──
        budget_scale = QueryKnobDouble(898, 0.5)     // default 0.5
        scaled_body = (int)(budget_scale * body_size)
        remaining = total_budget - body_size - scaled_body - ...

        // ── Step H: per-block override check ──
        if override_string:
            needle = "-{block_id}-"
            if override_string == "-" or strstr(override_string, needle):
                continue                             // skip this block
            needle = "+{block_id}+"
            if override_string == "+" or strstr(override_string, needle):
                force_unroll = true

        // ── Step I: pragma force-unroll ──
        if flags[1397] & 0xC0 == 0x80:              // PTX pragma force
            force_unroll = true

        // ── Step J: non-innermost filter ──
        if not ctx.allow_non_innermost and not force_unroll:
            if 10 * body_info.tex_count < remaining:
                Reject(block, 7); continue

        // ── Step K: factor selection ──
        if force_unroll:
            factor = 1 << ctx.force_factor           // power-of-2 override
        else if known_trip_count:
            factor = trip_count
            // Budget-constrain: while factor * body_cost > UnrollBudget:
            //     factor--
            if factor > 4 and trip_count == 1:
                factor &= ~3                         // round to mult-of-4
            if factor <= 1:
                Reject(block, 12); continue
        else:
            if body_size <= 49 and body_info.tex_count > 0:
                factor = 2                           // conservative default
            else:
                factor = max(1, UnrollBudget / body_cost)

        // ── Step L: knob override ──
        if QueryKnob(429):                           // LoopUnrollFactor INT
            factor = GetKnobInt(429)

        // ── Step M: IV analysis ──
        iv_info = AnalyzeIV(ctx, latch)              // sub_1385950
        if not iv_info:             Reject(block, 14); continue
        if not ValidateIV(ctx, iv_info):             // sub_1387870
                                    Reject(block, 14); continue
        bound = ExtractBound(ctx, iv_info)           // sub_1385E90
        if not bound or bound.opcode != 2:
                                    Reject(block, 16); continue
        if bound.def_block.predecessor_count != 1:
                                    Reject(block, 17); continue
        if bound.init_reg == bound.limit_reg:
                                    Reject(block, 18); continue
        stride_ok = VerifyStride(ctx, block, latch, iv_info, bound)
        if stride_ok & 2:          Reject(block, 17); continue
        if stride_ok & 1:          Reject(block, 18); continue

        // ── Step N: detect constant trip count ──
        const_iv = DetectConstantIV(ctx, iv_info)    // sub_1385CC0

        // ── Step O: profitability for full unroll ──
        if factor == trip_count and single_block_body:
            if CheckFoldableProfitability(ctx, block, iv_info, factor):
                ReplicateFullUnroll(ctx, block, factor) // sub_1383620
                stats.unrolled_count++
                continue

        // ── Step P: partial unroll execution ──
        if factor >= 2:
            remainder = trip_count % factor
            iterations_per_copy = (trip_count - remainder) / factor
            block.iterations_per_copy = iterations_per_copy
            if remainder > 0:
                for r = 0 to remainder-1:
                    DuplicateBody(ctx, block)         // sub_932E40
            ReplicatePartialUnroll(ctx, block, latch,
                factor, remainder)                    // sub_1387C30
            stats.unrolled_count++
        else:
            Reject(block, 24)                         // budget exceeded

    // Phase 4: Post-unroll fixup
    stats.non_unrolled = total_loops - stats.unrolled - stats.failed
    if any_unrolled:
        RebuildBackEdges(code_object)                 // sub_7846F0
        RerunLiveness(code_object)                    // sub_A0F020
        RerunControlFlow(code_object)                 // sub_752E40
        MarkModified(code_object)                     // sub_7B52B0

Unroll Rejection Table

When a loop cannot be unrolled, sub_7F5D20 records the reason by indexing a string pointer array at 0x21D1EA0. The diagnostic strings contain hex codes like "0x80000001 - Not unrolled: Irregular loop" -- these hex values are part of the printed message text, not the internal array index. The W023 report originally described a 36-byte structure table at 0x21D1980; that table belongs to the operand range lookup in the peephole optimizer (sub_7E39B0), not the unrolling pass. The actual internal rejection codes are simple integers indexing the string array:

CodeCategoryReason
7PerformanceBody too large relative to texture savings (10 * tex_count < remaining_budget)
11Pragma/knobPTX nounroll pragma, convergence constraint, or per-block knob 91
12BudgetPartial unroll factor reduced to 1 (no factor >= 2 fits within UnrollBudget)
13IneligibleLoop exit contains BRX (indirect jump, opcode 95 with special flags)
14Unsupported IVInduction variable analysis failed (sub_1385950 or sub_1387870)
15Unsupported IVIV register class is not integer (class 1) or pointer (class 2/3)
16Trip countTrip count bound extraction failed (sub_1385E90)
17IrregularIV definition block has multiple predecessors, or stride/direction verification failed
18Trip countIV initial value register equals IV limit register (degenerate zero-trip loop)
19Unsupported IVIV stride sign inconsistent between loop header and induction increment
24BudgetCatch-all: budget exceeded after all factor reduction attempts

The diagnostic output is gated by flags[1421] & 0x20 (DUMPIR verbose mode). When enabled, the rejection string is recorded in a hash map keyed by the loop header instruction node, using FNV-1a hashing of the node's block index.

Heuristic Thresholds (Knobs)

The unrolling decision is controlled by a rich set of OCG knobs. All knob names are stored ROT13-encoded in the binary:

Knob NameTypeDefaultDescription
LoopUnrollBOOLtrueMaster enable for loop unrolling
LoopUnrollFactorINT0Override unroll factor (0 = heuristic)
UnrollBudgetINT140Maximum total instruction count after unrolling
UnrollInstLimitFLOAT20.0Maximum instructions in a single unrolled loop body
UnrollFullInstLimitINT200Maximum body size for full unrolling
UnrollFlexableFullLimitFLOAT0.25Flexible full-unroll limit (adjusted by loop characteristics)
UnrollSmallLoopLimitINT4Body size threshold below which loops are always fully unrolled
UnrollPregThresholdINT50Maximum predicate register pressure for unrolling
UnrollMultiBlockLoopsBOOLtrueAllow unrolling of multi-basic-block loop bodies
UnrollVariableBoundsBOOLtrueAllow unrolling when trip count is not compile-time constant
UnrollUnknownCountINT4Default trip count assumption when count is unknown
UnrollUnknownInstLimitINT0Maximum body size for unrolling with unknown trip count
UnrollExtraInstPerPercentSavingINT2Instructions allowed per percent of cycle saving
UnrollTex3DPercentSavedThresholdINT0Minimum savings percent for 3D texture loops
UnrollProfiledColdInstsScaleINT0Scale factor for instruction count in profiled-cold blocks
LoopUnrollExtraFoldableLdcWeightINT0Extra weight for foldable constant loads in unroll benefit
LoopUnrollFoldableAddrWeightINT0Weight for foldable address computations
LoopUnrollLargePartOfShaderPctDOUBLE0.4Percentage threshold: loop is "large part of shader"
LoopUnrollNumExtraInstBaseINT46Base extra instruction allowance per unroll iteration
LoopUnrollNumInstSmallLoopINT0Instruction count defining "small loop"
LoopUnrollNumInstTexINT0Texture instruction count bonus for unrolling
LoopUnrollSingleLoopSavedPctFactorINT0Savings factor for single-loop shaders
LoopUnrollNonInnermostBOOLtrueAllow unrolling of non-innermost loops
LoopUnrollUnknownMultiBlockBOOLfalseAllow multi-block unroll with unknown bounds
EpilogueLoopUnrollCountINT0Unroll count for epilogue (remainder) loops
DisablePartialUnrollOverflowCheckBOOLfalseSkip overflow check on partial unroll count

GPU-Specific Unrolling Concerns

Register pressure. GPU threads share a fixed register file per SM. Unrolling increases live ranges, potentially reducing occupancy (the number of concurrent warps). The unroller queries register pressure estimates and compares against UnrollPregThreshold before committing.

Instruction cache. GPU instruction caches are small (typically 128KB L1i per SM). Aggressive unrolling of large loop bodies can cause i-cache thrashing. The UnrollBudget knob caps the total instruction growth.

Texture instruction scheduling. Texture fetches have high latency (hundreds of cycles). Unrolling loops containing texture operations is especially profitable because it exposes independent fetches that the scheduler can overlap. The LoopUnrollNumInstTex and UnrollTex3DPercentSavedThreshold knobs give extra weight to texture-heavy loops.

PTX nounroll pragma. The PTX string nounroll at 0x1CFE126 is parsed during PTX-to-Ori lowering and sets bit 1 of byte 292 in the block descriptor at (code_object+368 + 8*block_idx). The check is performed by sub_7E5120, which also tests three additional suppression conditions: the convergence constraint (back-edge table at code_object+1776), an instruction-level marker (byte 283 bit 0), and per-block knob 91. Any single condition is sufficient to suppress unrolling for that loop (rejection code 11).

Convergence constraint. When the back-edge analysis context at code_object+1784 is active (indicating warp-synchronous code), the unroller checks whether the loop header falls within the convergence region. If it does, unrolling is suppressed to avoid breaking warp-level synchronization guarantees. This is particularly important for cooperative groups and ballot-based algorithms.

DUMPIR Statistics

When diagnostics are enabled, the pass outputs:

# [partially unrolled loops=N] [non-unrolled loops=M]

This line appears in eight SM-variant statistics printers (sub_ABBA50 through sub_ABEB50), each a 1,771-byte clone specializing output format for a specific SM generation.


Phase 24 -- OriPipelining

Purpose

Performs modulo software pipelining on loops that were not fully unrolled. The pass overlaps successive loop iterations by interleaving instructions from different iterations within a single loop body, hiding functional unit and memory latency. This is the single most complex loop transformation in ptxas.

Two-Layer Pipelining Architecture

ptxas implements software pipelining in two cooperating layers:

  1. Phase 24 (OriPipelining, pre-RA): Annotates instruction operands with pipeline latency classes, computes the minimum initiation interval (MII), performs the modulo scheduling loop transformation (iteration overlap, prolog/epilog generation). Operates on the Ori IR before register allocation.

  2. Post-RA SoftwarePipeline (sub_8B9390, 23KB): A scheduling algorithm variant within the post-RA instruction scheduler (address range 0x893000--0x8FE000) that performs instruction-level scheduling of already-pipelined loop bodies using physical registers. One of approximately 12 scheduling variants alongside DualIssueScheduler, TensorScheduler, LoopScheduler, PrefetchScheduler, etc.

The two layers cooperate: Phase 24 transforms the loop structure (instruction replication, prolog/epilog construction) before register allocation. The post-RA SoftwarePipeline variant handles the cycle-accurate instruction placement of already-pipelined loops.

Function Map

FunctionSizeRoleConfidence
sub_926A3022,116 bytesPer-instruction operand latency annotator and encoding rewriterHIGH
sub_91A0F05,550 bytesOpcode-to-latency-class classifier (~350 opcodes, 13 distinct classes)HIGH
sub_9203A04,881 bytesPipeline stage cost calculator (ResMII computation, FP cost accumulation)MEDIUM
sub_9218201,592 bytesProlog/epilog code generatorMEDIUM
sub_9202D0207 bytesTwo-operand pipeline feasibility check (returns 60=reject, 130=accept)HIGH
sub_91E610399 bytesRegister-class-based latency lookup (class 4→26, class 5/2→20)HIGH
sub_91E900470 bytesPipe-assignment-based stall cycle calculator (32/64 cycle caps)HIGH
sub_92C0D0358 bytesPer-instruction annotation wrapper (calls sub_926A30, checks opcode changes)HIGH
sub_92C2408,033 bytesExtended GEMM-loop pipeliner (SM90+ TMA pipeline depth management)MEDIUM
sub_8B939022,841 bytesPost-RA software pipelining scheduling variant (in scheduler subsystem)MEDIUM

Correction (P1-06): The original function map listed sub_926A30 as the "main pipelining engine (modulo scheduling)." Decompilation reveals it is the per-instruction operand latency annotator -- it iterates over each operand of an instruction, calls sub_91A0F0 to classify the operand's latency class, and rewrites the operand encoding with the latency annotation. The modulo scheduling loop transformation is distributed across the remaining functions, with sub_9203A0 computing stage costs and sub_921820 generating prolog/epilog code.

Software Pipelining Algorithm

Phase 1: Operand Latency Annotation

For each instruction in the loop body, sub_92C0D0 calls sub_926A30 to annotate operands:

function AnnotateOperandLatencies(code_object, instruction):
    opcode = instruction.word & 0xFFFFCFFF      // strip modifier bits (bits 12-13)
    secondary_opcode = instruction.secondary_opcode
    operand_array = instruction.operands         // offset +84
    operand_count = instruction.operand_count    // offset +80

    for i in 0..operand_count-1:
        operand_type = (operand_array[i].word >> 28) & 7
        if operand_type in {2, 3}:               // register or register pair
            // Adjust count for predicated instructions (bit 12)
            adjusted_count = operand_count - 2 * ((opcode >> 11) & 2 != 0)
            if i < adjusted_count:
                latency_class = ClassifyLatency(opcode, secondary_opcode,
                                                operand_array, adjusted_count, i)
                if latency_class != default:
                    RewriteOperandEncoding(operand_array[i], code_object, latency_class)

        // For register operands: call full rewriter sub_922210
        // For non-register operands: call sub_9267C0

Phase 2: Pipeline Feasibility Filtering

Each instruction is checked by sub_9202D0:

function CheckPipelineFeasibility(code_object, instruction):
    // Reject instructions with special operand flags
    if (operand_array[1] & 0x603FFFF) != 0 or (operand_array[3] & 0xF8000000) != 0:
        if optimization_level > 1:
            return REJECT                        // return code 60

    // Reject if pipe assignment class <= 3 (control/barrier pipe)
    pipe_class = PipeAssignment(code_object, primary_opcode)   // vtable+904
    if pipe_class <= 3:
        return REJECT

    // Reject if operand 0 and operand 1 have different latency classes
    lat0 = ClassifyLatency(opcode, secondary_opcode, operand_array, count, 0)
    lat1 = ClassifyLatency(opcode, secondary_opcode, operand_array, count, 1)
    if lat0 != lat1:
        return REJECT                            // asymmetric latencies

    // Reject if extended operands have blocking flags
    if operand_count > 2 and (operand_array[4] & 0xF) or (operand_array[4] >> 4) & 1:
        return REJECT

    // Accept: trim to 2-operand form
    result_operands = &operand_array[2]
    result_count = 2
    return ACCEPT                                // return code 130

Phase 3: MII Computation

The minimum initiation interval is computed as:

MII = max(RecMII, ResMII)

RecMII (recurrence-constrained): The longest data dependence cycle in the DDG divided by the iteration distance it spans. For a cycle of total latency L spanning D iterations: RecMII = ceil(L / D).

ResMII (resource-constrained): Computed by sub_9203A0 using floating-point cost accumulation. The function classifies each instruction's pipe class using a 7-entry pipe class table at code_object+16 and accumulates per-pipe instruction counts:

function ComputeResMII(loop_body, pipe_table):
    pipe_counts[0..6] = {0}
    for each instruction in loop_body:
        lat0 = ClassifyLatency(instruction, operand=0)
        lat1 = ClassifyLatency(instruction, operand=1)
        pipe = MapLatencyToPipe(lat0, pipe_table)    // 7-entry lookup
        pipe_counts[pipe] += cost(instruction)       // FP cost weights

    ResMII = max(pipe_counts[i] / pipe_width[i] for i in 0..6)

The pipe class boundaries stored at code_object+16 define 7 functional unit classes. Each class has a capacity (number of execution slots per cycle). ResMII is the maximum ratio of instruction demand to capacity across all pipe classes.

Phase 4: Modulo Schedule Construction

function ModuloSchedule(loop_body, MII):
    II = MII
    while II <= MAX_II:
        MRT = new ModuloReservationTable(II)     // II rows x pipe_classes columns
        success = true

        for each instruction in priority order:
            earliest = max(data_dependency_constraints)
            latest = earliest + II - 1
            placed = false

            for slot in earliest..latest:
                row = slot mod II
                pipe = instruction.pipe_class
                if MRT[row][pipe] has capacity:
                    MRT[row][pipe] -= 1
                    instruction.scheduled_time = slot
                    instruction.stage = slot / II
                    placed = true
                    break

            if not placed:
                success = false
                break

        if success:
            return (II, schedule)
        II += 1

    return FAILURE                               // could not pipeline

Phase 5: Prolog/Epilog Generation

Once a valid schedule is found at initiation interval II with S pipeline stages, sub_921820 generates:

function GeneratePrologEpilog(loop, II, num_stages):
    // Prolog: S-1 partial iterations
    for stage in 0..num_stages-2:
        emit instructions assigned to stages 0..stage
        // Each prolog iteration adds one more stage

    // Kernel: steady-state loop body
    emit all instructions from all stages
    // Trip count adjusted: new_trip = original_trip - (num_stages - 1)

    // Epilog: S-1 drain iterations
    for stage in num_stages-2..0:
        emit instructions assigned to stages stage+1..num_stages-1
        // Each epilog iteration removes one stage

Instruction Latency Classifier (sub_91A0F0)

The classifier is a 5.5KB, 1372-line switch statement mapping approximately 350 Ori opcodes to 13 distinct latency class values. It takes five parameters: (opcode, secondary_opcode, operand_array, operand_count, operand_index) and returns a class ID -- not a cycle count. The scheduler maps class IDs to actual cycle counts via the hardware profile.

Latency Class Table

ClassTypical opcodesMeaning
1Past-end operands, invalid indicesSkip / not used
6Simple ALU, bitwise, short integerShort-pipe latency (~80 opcodes)
7Paired register operationsMedium-short (~5 opcodes)
8Special cases (via lookup table dword_21E1340)Medium
9Type conversions (via lookup table)Medium
10Integer multiply, shifts, IMADMedium-long (~40 opcodes)
11Address computations, LEA variantsMedium-long (~15 opcodes)
12Memory operations, FP32, barriersStandard long (~100 opcodes)
14Wide memory, atomics, FP64 storesExtended long (~20 opcodes)
16FP64 special variantsExtended long (~3 opcodes)
20Texture fetches, uniform loadsVery long (~30 opcodes)
26Global memory loads, uncached accessMaximum latency (~25 opcodes)
31Scoreboard/barrier-related operandsSpecial handling (~5 opcodes)

Opcode Family Handling

Opcode rangeCategoryLatency behavior
0x03--0x24Integer ALUMostly passthrough default; 0x23 always returns 10
0x3C, 0x3E, 0x4E, 0x4FMemory (load/store)Returns field from operand_array[4] bits for operands 0--1
0x46, 0xF3--0x106TextureReturns 6 normally; 10 for MIO-dependent with extended flag check
0x49, 0x4A, 0x51, 0x143, 0x15EAtomic/reduceAlways returns 12
0x55--0x6FFloating-pointComplex per-operand logic; 0x55 uses lookup table dword_21E1340
0x5B, 0x5C, 0x137Barriers/syncReturns 12 for operand 1, else default
0xB7, 0x120WGMMA setupPer-operand latency (10--20) based on accumulator flags
0x135HMMA/IMMACalls sub_7E39B0/sub_7E3A70/sub_7E3BA0/sub_7E3C30 for matrix latency
0x13D, 0x13EExtended FPAccumulator-flag-dependent returns (10 or 12)

Stall Cycle Calculator (sub_91E900)

sub_91E900 computes the stall penalty for an instruction by mapping latency classes through the pipe assignment function (vtable+904):

function ComputeStallCycles(code_object, instruction):
    lat0 = ClassifyLatency(instruction, operand=0)
    pipe0 = PipeAssignment(code_object, lat0)         // vtable+904

    if pipe0 == 8:                                     // long-latency pipe
        stall = StallTable[instruction.index]          // code_object+440
        return min(stall, 64)                          // cap at 64 cycles

    lat1 = ClassifyLatency(instruction, operand=1)
    pipe1 = PipeAssignment(code_object, lat1)

    if pipe1 == 8:
        stall = StallTable[instruction.index]
        return min(stall, 64)

    // Neither operand on long pipe
    stall = StallTable[instruction.index]
    return min(stall, 32)                              // cap at 32 cycles

The pipe assignment value 8 corresponds to the long-latency functional unit (memory/texture). Instructions on this pipe get a 64-cycle cap; all others are capped at 32 cycles.

GEMM Pipelining (sub_92C240)

The GemmPipeliner* family of knobs controls a specialized pipelining mode for GEMM (matrix multiply) loops:

Knob NameTypeDefaultDescription
GemmPipelinerEnabledBOOLfalseMaster enable for GEMM-specific pipelining
GemmPipelinerPipelineDepthEnforceDeltaFullINT0Pipeline depth adjustment for full enforcement
GemmPipelinerPipelineDepthEnforceDeltaPartialINT0Pipeline depth adjustment for partial enforcement
GemmPipelinerDependenciesPopblBOOLfalseDependency resolution policy between DMA and compute stages
GemmPipelinerScoreboardHashPopblBOOLfalseScoreboard hash policy for GEMM barrier tracking
GemmPipelinerUseRegisterCalculationINT0Use register-based calculation for pipeline depth vs. fixed

The extended pipelining in sub_92C240 (8KB) handles GEMM-like patterns where the loop body contains WGMMA/IMMA instructions. From decompilation:

  1. Activation: The GEMM pipeliner activates when code_object+48 (GEMM mode flag) is set and the pipeline context at code_object+56 has a valid stage range.
  2. Stage iteration: Iterates from context+84 (start stage) to context+88 (end stage), with 96-byte descriptors per stage at context+136.
  3. Pipeline depth management: Uses sub_8A4DA0 to validate stage depth and sub_6E6650 for dynamic array resizing when pipeline depth exceeds the current allocation. Writes stage bitmasks (1 << stage_index) into the stage descriptor arrays.
  4. Hardware model: On SM90+ (Hopper), TMA supports up to 8 outstanding asynchronous copy operations. The GEMM pipeliner matches this hardware depth, staging DMA (memory) and compute (math) operations to fill the pipeline.

The DUMPIR diagnostic output includes For Dma Loop and For Math Loop sections from sub_7A4500, confirming the pipeliner explicitly distinguishes between DMA and compute loop stages.

Other Pipelining Knobs

Knob NameTypeDefaultDescription
OkToPipelineNoUnrollINT0 (disabled)Allow pipelining even when unrolling was also suppressed
PipelineHoistCondLimitINTunsetMaximum condition complexity for hoisting in pipelined loops
PipelineHoistRRegPressureLimitINTunsetR-register pressure limit for hoisting inside pipelined body
PipelineHoistPRegPressureLimitINTunsetP-register pressure limit for hoisting inside pipelined body
PipelineMIOVQToInstRatioDBLunsetMIOVQ-to-instruction ratio threshold for pipeline profitability
PipelineMultiOutputTexINT0 (disabled)Enable pipelining of loops with multi-output texture instructions
PipelineSpecUsesInHeadOnlyINT0 (disabled)Restrict speculative uses to loop header only

GPU-Specific Pipeline Concerns

Warp divergence. Pipelined loops assume all threads in a warp execute the same number of iterations. If the trip count is warp-divergent, the prolog/epilog handling must account for early-exit threads. The pass checks the varying analysis (phases 53, 70) to determine divergence.

Barrier placement. Pipelined loops containing BAR.SYNC or MEMBAR instructions are checked by sub_9202D0 -- if the pipe assignment class for a barrier instruction is <= 3, the instruction is rejected from pipelining. The latency classifier (sub_91A0F0) assigns class 12 to barrier operands (opcodes 0x5B, 0x5C, 0x137), but the feasibility check rejects based on pipe class, not latency class.

Memory pipeline depth. The sub_92C240 extended pipeliner for GEMM-like loops manages the hardware memory pipeline on SM90+. It explicitly tracks DMA pipeline depth using 96-byte per-stage descriptors, resizing arrays dynamically when depth exceeds allocation. The stage descriptor at context+136 + 96*stage holds bitmask membership, latency counters, and dependency links.

Pipe class model. The 7-entry pipe class table at code_object+16 partitions the functional units into classes. The post-RA software pipelining variant (sub_8B9390) uses the same table to determine which functional unit class each instruction uses, ensuring resource conflict detection is consistent between the two pipelining layers.


Phases 35, 66, 79, 88 -- OriHoistInvariants (LICM)

Purpose

Hoists computations that produce the same result on every loop iteration out of the loop body and into the preheader. This reduces the dynamic instruction count proportionally to the trip count. The four instances are not redundant -- each targets invariants created by different intervening transformations.

Function Map

All four instances share the same core implementation:

FunctionSizeRoleConfidence
sub_C5FE0034 bytesPhase 35 execute wrapperCERTAIN
sub_C5FE3034 bytesPhase 66 execute wrapperCERTAIN
sub_C5FE6034 bytesPhase 79 execute wrapperCERTAIN
sub_C5FE9034 bytesPhase 88 execute wrapperCERTAIN
sub_7DDB50156 bytesOptimization guard: checks knob 499, block count > 2HIGH
sub_8FFDE0573 bytesHoistInvariants orchestrator: iterates blocks, queries knob 381, dispatches inner workerHIGH
sub_8FF7801,622 bytesLICM inner worker: identifies and moves invariant instructionsHIGH
sub_8FEAC02,053 bytesInvariance marking: forward/backward operand scan per blockHIGH
sub_8F76E090 bytesPer-instruction invariance test: checks output register def-blockHIGH
sub_8F7770810 bytesHoisting safety check: operand class + latency analysisHIGH
sub_8F8CB0658 bytesProfitability check: budget-weighted score vs latency penaltyHIGH
sub_8F7DD0374 bytesTransitive invariance propagation through def-use chainsHIGH
sub_8F7AE0558 bytesInstruction mover: unlinks from loop, inserts at preheaderHIGH
sub_8FF2D01,186 bytesBudget computation + invariant marking + hoist dispatchHIGH
sub_8F8BC0257 bytesInstruction counting: header/body weight via isNoOpHIGH
sub_74D720353 bytesLoop boundary analysis: barrier/jump/predecessor checksHIGH
sub_74F500--Preheader location finderMEDIUM
sub_7DF3A088 bytesOpcode flags table lookup (side-effect classification)HIGH
sub_7E0540156 bytesObservable side-effect checker (memory, call, barrier)HIGH

Execute Flow

sub_C5FExxx(phase_obj)                         // 34-byte vtable dispatch
  └─ sub_8FFDE0(code_object, pass_id)          // orchestrator
       ├─ sub_7DDB50(code_object)              // guard: returns block count, checks knob 499
       ├─ sub_799250(allocator, "HoistInvariants", &skip)  // DUMPIR check
       └─ sub_8FF780(context)                  // per-loop LICM core
            ├─ sub_781F80                       // rebuild instruction list
            ├─ sub_7E6090                       // recompute register pressure
            ├─ sub_773140                       // recompute loop depths
            ├─ sub_74D720                       // analyze loop boundaries
            ├─ sub_74F500                       // find preheader
            ├─ sub_7A1A90 / sub_7A1B80         // query knob 381 per block
            └─ sub_8F8BC0                       // move instruction to preheader

Why Four Instances?

PhasePass ID (a2)Pipeline PositionWhat Creates New Invariants
35 (Early)0After GeneralOptimize (29), ExtractShaderConsts (34)CSE eliminates redundant expressions, exposing loop-invariant results; shader constant extraction hoists uniform loads
66 (Late)1After predication (63), GeneralOptimizeLate2 (65)Predication converts conditional branches to predicated instructions; if the condition is loop-invariant, the entire predicated instruction becomes invariant
79 (Late2)2After LateExpansionUnsupportedOps (78)Late expansion splits compound operations into sequences; address computations and constant sub-expressions in expanded sequences are often invariant
88 (Late3)3After FixupGmmaSequence (87)GMMA fixup reorders/inserts instructions for wgmma hardware constraints; descriptor loads and accumulator setup become visible as invariants

Pass ID Controls Aggressiveness

The pass_id parameter (parameter a2 of sub_8FFDE0) affects which loops are processed and how aggressively hoisting is performed. From the decompiled logic at sub_8FFDE0:

// sub_8FFDE0 lines 58-89 (simplified)
v7 = sub_7A1B80(allocator, 381, block);   // query knob 381 for this block
if (v7 == 1) {                             // knob says "inner loops only"
    if (pass_id == 1) goto hoist_block;    // Late pass: proceed
    goto skip_block;                       // Early pass: skip
}
if (v7 == 3) {                             // knob says "never"
    if (pass_id <= 1) goto handle_conservative;
    goto skip_block;
}
if (v7 == 0) {                             // knob says "always"
    if (pass_id == 0) goto hoist_aggressively;
    goto skip_block;
}
  • pass_id = 0 (Early): Hoists aggressively and calls sub_A112C0(code_object, 1) to re-run sub-analyses afterward. This is the most aggressive pass.
  • pass_id = 1 (Late): Includes inner-loop-only blocks, but skips the re-analysis call.
  • pass_id >= 2 (Late2, Late3): Most conservative -- only hoists from blocks where knob 381 returns 0 (always-hoist).

Per-Block Knob 381 Policy

The LICM pass queries OCG knob 381 (sub_7A1A90 / sub_7A1B80) per basic block to determine the hoisting policy:

Knob 381 ResultMeaning
0Always hoist from this block
1Hoist from inner loops only
3Never hoist from this block

This per-block granularity allows the knob system to selectively disable hoisting in specific loop nests (e.g., those known to be register-pressure-critical).

Guard Function (sub_7DDB50)

Before the LICM core runs, sub_7DDB50 (156 bytes) gates execution on two conditions:

  1. Knob 499 enabled. Queries the allocator vtable at +72 for OCG knob 499 (the master LICM switch). If disabled, returns 1 which causes the orchestrator to bail (since 1 <= 2).
  2. Rate limiter. When knob 499 is enabled, the guard checks a pair of counters at allocator[9]+35936 (max invocations) and allocator[9]+35940 (current count). If the current count has reached the maximum, returns 1. Otherwise increments the counter and returns the actual basic block count from code_object+2104. This bounds the number of LICM invocations for compile-time control in functions with many loops.
  3. Block count > 2. The orchestrator (sub_8FFDE0) checks the return value: if <= 2, no hoisting is attempted. Single-block functions have no loops; two-block functions have at most a trivial loop not worth processing.

LICM Invariant Detection Algorithm

The invariance detection pipeline runs inside sub_8FF2D0 (1,186 bytes), which is called from sub_8FF780 once per loop nest level. It executes five stages in sequence: budget computation, forward invariance marking, backward non-invariance marking, transitive propagation, and profitability gating.

Stage 1: Budget Computation (sub_8FF2D0)

function ComputeHoistBudget(context, block, is_simple, num_preds, hoist_mode, is_inner):
    // Base budget from knob 483 (HoistBudget)
    if QueryKnob(483):
        budget = QueryKnobValue(483)                 // 0 = unlimited
    else:
        budget = 10                                  // default

    // CBO budget from knob 482
    if QueryKnob(482):
        cbo_budget = QueryKnobValue(482)
    else:
        cbo_budget = (pass_id == 0) ? 22 : 100

    // Adjust by loop type and depth
    if pass_id > 0 and is_simple:
        budget = (hoist_mode < 2) ? cbo_budget : 300
    else if pass_id == 0 and is_simple:
        budget = (hoist_mode < 2) ? cbo_budget : 200

    // Conservative multiplier for Late3
    if pass_id == 3:
        budget *= 100                                // generous once decided to hoist

    // Split budget among back-edge blocks
    if hoist_mode == 3:                              // processing back-edge block
        budget /= num_preds

    // Inner-loop divisor from knob 380
    if is_inner:
        if QueryKnob(380):
            budget /= QueryKnobValue(380)
        else:
            budget /= 10

Stage 2: Forward Invariance Marking (sub_8FEAC0, a3=1)

The forward pass iterates every instruction in the basic block and marks each register operand's invariance status based on where it was defined.

function MarkInvariants_Forward(context, block_index):
    block = blocks[block_index]
    header_depth = context.header_depth
    loop_depth_range = [context.header_depth, context.max_depth]

    // Two code paths based on knob 934 (UseNewLoopInvariantRoutineForHoisting)
    if QueryKnob(934):
        // Advanced path: set-based computation via sub_768BF0 + sub_8F7280
        return MarkInvariants_SetBased(context, block_index)

    // Default path: single-pass scan
    for each instruction in block (linked list: block+0 .. sentinel at block+8):
        has_side_effect = isNoOp(instruction)            // vtable+1824
        opcode = instruction+72 (masked: BYTE1 &= 0xCF)
        num_operands = instruction+80

        // Special case: opcode 195 + first dst is reg class 9 (predicate)
        is_predicate_def = (opcode == 195 and dst_reg.class == 9)

        is_invariant = true
        for each operand from LAST to FIRST:             // reverse scan
            operand = instruction + 84 + 8*i
            type = (operand >> 28) & 7

            if type != 1:                                // not a register
                continue                                 // immediates are always invariant

            if IsFixedRegister(operand, code_object):    // sub_7DEB90
                continue                                 // e.g., RZ, PT — always available

            if pass_id == 3:                             // Late3 extra check
                if IsSpecialRegClass(operand, code_object):  // sub_7DA2F0
                    // Exception: IADD3 (opcode 130, flag 0x1000) carry-out
                    if not (opcode == 130 and flag_0x1000 and is_penultimate_operand):
                        continue

            reg = RegisterDescriptor(code_object, operand & 0xFFFFFF)

            if reg.def_block (reg+76) == block_index:
                // Defined in THIS block — not invariant for this loop
                is_invariant = false
            else if context.is_multi_depth:
                def_instr = reg.def_instruction (reg+56)
                if def_instr is null or reg has pinned bit:
                    handle_predicate_invariance()
                else:
                    def_block = blocks[def_instr.block_index]
                    def_depth = def_block.loop_depth (offset +144)
                    if def_depth < header_depth or def_depth > max_depth:
                        reg.use_count (reg+80) = 0       // mark as loop-external
                    else:
                        is_invariant = false
                        reg.def_block (reg+76) = block_index
            else:
                reg.use_count (reg+80) = 0               // simple loop: mark external

        // Side-effect check for the entire instruction
        flags = LookupOpcodeFlags(instruction, code_object)  // sub_7DF3A0
        if (flags & 2) != 0:                             // has memory/control side effect
            is_invariant = false

        if MemoryOverlapsLoopLiveSet(instruction):       // sub_74F5E0
            is_invariant = false

        if is_multi_depth and HasObservableSideEffects(instruction):  // sub_7E0540
            is_invariant = false

        // Mark destination operands
        for each dst_operand (bit 31 set = definition):
            if type == 1 and not pinned:
                if is_invariant:
                    reg.def_block = block_index           // mark for hoisting
                else:
                    reg.use_count += 1                    // count loop-internal uses

The key insight is that invariance is determined by definition site: if every source register was defined outside the loop (or in a block already processed), the instruction is invariant. Immediates and constants are trivially invariant. The check is not purely structural -- it uses the reg+76 field which gets updated as hoisting proceeds, allowing transitive invariance discovery.

Stage 3: Backward Non-Invariance Marking (sub_8FEAC0, a3=0)

The backward pass uses the same function with a3=0. Instead of marking definitions as external, it marks operands whose definitions are inside the loop as non-invariant by setting reg.def_block = block_index. This clears any false positives from the forward pass where a register appeared invariant but its defining instruction depends on a loop-variant value.

For destination operands, the backward pass increments reg.use_count for all non-pinned register definitions, building the use-count information needed by the profitability check.

Stage 4: Transitive Invariance Propagation (sub_8F7DD0)

After the two marking passes, sub_8F7DD0 propagates invariance transitively through the instruction chain. This handles the case where instruction A is invariant and defines register R, and instruction B uses R and is otherwise invariant -- the forward pass may have marked B as non-invariant because R's definition was in the loop, but A (the definer) is itself invariant.

function PropagateInvariance(context, block_index):
    block = blocks[block_index]
    side_effect_mask = 0

    for each instruction in block:
        aliases_memory = CheckMemoryAlias(code_object, instruction)  // sub_74F5E0

        for each operand (type == 1, register):
            reg = RegisterDescriptor(operand)

            if operand is definition (bit 31 set):
                if isNoOp(instruction):
                    if IsInvariant(instruction, block_index):      // sub_8F76E0
                        side_effect_mask |= reg.flags & 0x3
                    else:
                        reg.flags |= aliases_memory ? 1 : 0
                else:
                    reg.flags |= (has_side_effect ? 1 : 0) | 2
            else:  // use
                if has_side_effect:
                    reg.def_block = block_index            // taint defining register
                else:
                    reg.use_count += 1

    return side_effect_mask

Stage 5: Profitability Check (sub_8F8CB0)

The final gate before hoisting. Computes a cost-benefit ratio and rejects hoisting if the ratio is unfavorable.

function IsProfitable(context, block_index, budget, is_hoist_safe):
    header_weight = context.header_insn_count            // from sub_8F8BC0
    body_weight = context.body_insn_count

    // Scoring weights depend on pass aggressiveness and safety
    if is_hoist_safe:
        noOp_weight = (pass_id == 0) ? 60 : 150
        real_weight = 5
    else:
        noOp_weight = (pass_id == 0) ? 12 : 30
        real_weight = 1

    score = 0
    latency_penalty = 0
    instruction_count = 0

    for each instruction in block:
        instruction_count += 1
        if IsInvariant(instruction, block_index):        // sub_8F76E0
            if isNoOp(instruction):
                score += noOp_weight
            else:
                score += 1
                for each dst_operand with scoreboard flag:
                    score += real_weight
                    latency = GetLatencyClass(instruction)  // sub_91E860
                    latency_penalty += (latency > 4) ? 2 : 1
        else:
            for each high-latency dst_operand:
                latency_penalty += (latency > 4) ? 2 : 1

    // Final decision: weighted score vs latency cost
    if pass_id == 0:                                     // aggressive
        denominator = real_weight * instruction_count
    else:
        denominator = body_weight / 3 + header_weight

    return denominator != 0 and (score * budget) / (real_weight * denominator) >= latency_penalty

The profitability check encodes a fundamental GPU tradeoff: hoisting reduces dynamic instruction count (proportional to trip count) but extends live ranges (increasing register pressure and reducing occupancy). The budget parameter, which varies by 100x between pass_id 0 and 3, controls how aggressively this tradeoff is resolved. Pass_id 0 (Early) uses the smallest denominator, making it easiest to exceed the threshold.

Per-Instruction Invariance Test (sub_8F76E0)

The leaf-level invariance test used by stages 4 and 5 is a simple definition-site check:

function IsInvariant(instruction, current_block_index):
    num_operands = instruction.operand_count             // inst+80
    if num_operands == 0:
        return false

    // Find the last "interesting" operand (skip immediates/constants)
    // Immediates have type bits in the 0x70000000 range
    last_operand = scan backwards from operand[num_operands-1]
                   while (operand XOR 0x70000000) & 0x70000000 == 0

    // Check: is this a register definition outside the current block?
    if last_operand is negative (bit 31 = definition)
       and type_bits == 1 (register)
       and not pinned (byte+7 bit 0 == 0):
        reg = RegisterDescriptor(last_operand & 0xFFFFFF)
        return reg.def_block (reg+76) != current_block_index

    return false

This is the most-called function in the LICM pipeline. It checks whether an instruction's primary output register was defined outside the current block -- if so, the instruction is considered invariant (already hoisted or defined in a dominating block).

Side-Effect Blocking Rules

An instruction is blocked from hoisting if any of the following conditions hold, regardless of operand invariance:

CheckFunctionCondition
Memory storesub_7DF3A0Flags byte bits 2-3 set and bit 5 clear
Memory barriersub_74D720Opcode 159 (BAR.SYNC), 32 (MEMBAR), or 271 (barrier variant)
Indirect jumpsub_74D720Opcode 236 (BRX)
Volatile/atomic accesssub_7DFA80Called from sub_7E0540; detects volatile or atomic memory
Function callvtable+1456isBarrier() returns true
Texture side effectsub_7DF3A0Flags byte bit 6 set with operand modifier flag
Address-space effectsub_7E0540Opcodes 85/109 (memory ops) with (flags+20 & 2) != 0

The boundary analysis (sub_74D720) also produces a 5-byte result array that gates the entire loop:

ByteMeaningEffect
0Has external predecessor (outside loop depth range)Skip loop (not a natural loop)
1Non-header block with different nestingMarks as complex multi-depth loop
2Contains barrier instructionSkip loop entirely
3Contains indirect jumpSkip loop entirely
4Multi-depth safety flagAND-ed with sub_7E5120 per inner block

Instruction Counting (sub_8F8BC0)

Before the profitability check, sub_8F8BC0 counts instructions in the loop header and body separately. It walks the instruction linked list for each block in the loop and classifies each instruction using isNoOp (vtable+1824):

  • No-op instruction (scheduling placeholder, predicate set, etc.): weight 1
  • Real instruction (ALU, memory, branch, etc.): weight 30

The header count is stored at context+64 and the body count at context+68. The profitability formula uses these to normalize the hoisting score: a loop with a heavy header relative to the body benefits less from hoisting.

Instruction Movement (sub_8F7AE0)

After all checks pass, sub_8F7AE0 physically moves each invariant instruction from the loop body to the preheader:

  1. Invariance re-check. Calls sub_8F76E0 one final time per instruction. Instructions whose invariance status changed during the marking passes are skipped.
  2. Knob 484 gate. Queries the allocator for knob 484; if disabled, no movement occurs. This provides a fine-grained override separate from the loop-level knob 381.
  3. Preheader creation. On the first hoisted instruction, creates or locates the preheader block:
    • If the loop has an existing preheader block (context+16 non-null): clones it via sub_931920, copies convergence flags from the original preheader's offset+282 bit 3, and links it into the CFG via sub_8F7610.
    • If no preheader exists: creates a new block via sub_92E1F0 and links it.
  4. Unlink and reinsert. For each invariant instruction:
    • sub_9253C0(code_object, instruction, 1): unlinks the instruction from the current block.
    • sub_91E290(code_object, instruction): inserts at the preheader insertion point.
    • Updates the Ori instruction's control word at instruction+32 (not the SchedNode): sets bit 1 at byte offset +13 to mark the instruction as hoisted (prevents the scheduler from reordering it back into the loop).
  5. Destination register tracking. For each output operand, if the defining instruction at reg+56 differs from the current instruction, sets context+44 (hoisted_cbo flag). For pass_id == 2, additionally sets reg+48 bit 26 if the register class is in {2, 3, 4} (GPR classes) and the preheader has the convergence flag.
  6. Special IADD3 handling. For pass_id == 3, instructions with opcode 130 (IADD3), flag 0x1000, and a negative byte at +90 (carry chain) receive special treatment via sub_9232B0 which adjusts the carry-out register linkage before movement.

Multi-Depth Loop Handling

For loops with nesting depth > 1 (inner loops within the hoisting target), sub_8FF780 performs multiple rounds of sub_8FF2D0 calls:

  1. Header block. First call processes the loop header with hoist_mode = 0.
  2. Intermediate blocks. For each depth level between header_depth+1 and max_depth, checks if the block's parent depth (block+148) matches the header depth. If the block is a back-edge predecessor of the loop header, uses hoist_mode = 3. Otherwise, checks a dominance bitmap at block[25] + 4*(depth >> 5): if bit (1 << depth) is set, uses hoist_mode = 1 (dominated); otherwise hoist_mode = 2 (non-dominated).
  3. Back-edge block. Final call with hoist_mode = 3 and the deepest back-edge block index, ensuring the budget is split among back-edge predecessors.

Multi-depth permission is gated by knob 220 (queried at allocator[9]+15840 for the fast path) and the DisableNestedHoist knob. When hoisting from an inner loop to the header of an outer loop, an additional constraint applies:

allow_nested = allow_nested_hoist AND is_simple_loop
               AND body_insn_count > 1
               AND num_predecessors == 1
               AND body_insn_count < header_insn_count * max_iterations

This prevents hoisting from inner loops where the cost (extended live range across multiple loop levels) exceeds the benefit (reduced inner-loop dynamic count).

LICM Outer Loop (sub_8FF780)

The complete outer driver that iterates over all loop nests:

function HoistInvariantsCore(context):
    code_object = context.code_object
    pass_id = context.pass_id

    // Read iteration limit from allocator+34632
    config_byte = allocator[34632]
    max_iterations = (config_byte == 0) ? 2
                   : (config_byte == 1) ? allocator[34640]
                   : 0                                   // unlimited

    allow_nested_hoist = (allocator[20016] != 0)

    // Prepare IR
    RebuildInstructionList(code_object, 1)               // sub_781F80
    RecomputeRegisterPressure(code_object, 1, 0, 0, 0)  // sub_7E6090
    RecomputeLoopDepths(code_object, 0)                  // sub_773140

    if code_object.flags[176] & 2 and pass_id > 1:
        RecomputeLoopNesting(code_object)                // sub_789280

    // Clear prior invariance markers
    for each block in instruction list:
        block.marker (offset +76) = 0xFFFFFFFF

    // Iterate from innermost loop outward (last RPO entry first)
    current = blocks[rpo[block_count]]

    while current is valid:
        if current has no predecessors or no first instruction:
            advance; continue

        // Count predecessors at >= current loop depth
        header_depth = current.loop_depth                // offset +144
        for each predecessor:
            if pred.loop_depth >= header_depth:
                num_at_depth++; track deepest back-edge index

        if num_at_depth == 0:                            // not a loop header
            advance; continue

        // Simple vs multi-depth
        if max_depth == header_depth:
            is_simple = true
        else:
            info = AnalyzeBoundaries(code_object, header_depth, max_depth)
            if has_external_pred or has_barrier or has_indirect_jump:
                advance; continue
            if !MultiDepthAllowed(knob_220):
                advance; continue
            context.is_multi_depth = true

        // Find preheader and query knob 381
        context.insert_pt = FindPreheader(code_object, current, ...)
        if !ShouldHoist(QueryKnob381(381, current), pass_id, opt_level):
            advance; continue

        // Count instruction weights
        CountInstructions(context)                       // sub_8F8BC0

        // === CORE HOISTING PIPELINE (per loop) ===
        sub_8FF2D0(context, header_block, ...)           // header block

        if context.is_multi_depth:
            for depth in (header_depth+1 .. max_depth-1):
                sub_8FF2D0(context, block_at_depth, ..., hoist_mode, ...)
            sub_8FF2D0(context, back_edge_block, ..., 3, ...)  // back-edge

        // Post-hoist cleanup
        if context.changed and current.num_back_edge_successors > 1:
            RebuildInstructionList(code_object, 0)

        advance to next loop

Hoisting Knobs

Knob NameTypeDefaultDescription
HoistBudgetFLOAT10Maximum number of instructions to hoist per loop (0 = unlimited)
HoistLoopInvBudgetFLOAT22 (early) / 100 (late)Budget specifically for loop-invariant hoisting; pass_id 0 uses 22, pass_id > 0 uses 100
HoistConservativeScaleINT10 (divisor)Inner-loop budget divisor; budget /= scale when hoisting from inner loops
HoistLateINTper-block policyPer-block hoisting policy (0=always, 1=inner only, 3=never)
HoistCBOModeINT0Constant-buffer-object hoisting mode
HoistCBOLoadINTunsetEnable hoisting of CBO load instructions
HoistCBOFromLoopWithColdNestINT1 (enabled)Hoist CBO loads even from loops with cold nesting
HoistCBOHighCostSBInstRatioThresholdINTunsetScoreboard cost threshold for CBO hoisting
HoistCBOLoadIDOMTravseLimitINT4IDOM traversal limit for CBO load hoisting
HoistCBORRegPressureLimitApplyRateINT80R-register pressure limit application rate (percentage)
HoistTexToInstRatioHighDBL0.045High texture-to-instruction ratio threshold for aggressive hoisting
HoistTexToInstRatioLowDBL0.03Low texture-to-instruction ratio threshold for conservative hoisting
DisableNestedHoistBOOLfalseDisable hoisting from nested loops (false = nested hoisting allowed)
NestedHoistInnerThresholdINT22 / 100Inner loop instruction threshold for nested hoisting (same value as HoistLoopInvBudget)
NestedHoistOuterThresholdINT10Outer loop instruction threshold for nested hoisting (same value as HoistBudget)
UseNewLoopInvariantRoutineForHoistingBOOLfalseUse updated set-based invariance check routine (legacy single-pass is default)
MaxMidHeaderSizeRateForAggressiveHoistINT2Maximum LICM iteration count (limits repeated hoisting passes)
EnableHoistLowLatencyInstMidBlockBOOLfalseHoist low-latency instructions from mid-block positions
MovWeightForSinkingHoistingDBL0.25Weight for MOV instructions in sink/hoist decisions

GPU-Specific LICM Concerns

Constant buffer loads. GPU shaders frequently load from constant buffers (LDC). These loads are loop-invariant by definition (the buffer is read-only during kernel execution). The HoistCBO* knobs control a specialized path that aggressively hoists these loads, trading register pressure for reduced memory traffic.

Register pressure vs. occupancy. Every hoisted instruction extends its live range from the preheader through the entire loop. On GPUs, this directly reduces occupancy. The four LICM passes use increasingly conservative heuristics (controlled by pass_id) to avoid excessive register growth in later pipeline stages where register allocation is imminent.

Texture instruction hoisting. Texture fetches (TEX, TLD, TLD4) are high-latency and loop-invariant when their coordinates are loop-invariant. The HoistTexToInstRatio* knobs provide thresholds for deciding when to hoist texture instructions -- a tradeoff between reducing loop body latency and increasing preheader register pressure.


Phase 59 -- OriLoopFusion

Purpose

Fuses adjacent loops with compatible bounds and no inter-loop data dependencies into a single loop. This reduces loop overhead (branch, induction variable update) and creates opportunities for the scheduler to overlap instructions from the formerly separate loop bodies.

Knobs

Knob NameTypeDefaultDescription
PerformLoopFusionINT0 (disabled)Master enable for loop fusion; must be explicitly set to a nonzero value
PerformLoopFusionBudgetFLOATunsetMaximum instruction count in fused body

Fusion Criteria

Two adjacent loops L1 followed by L2 are candidates for fusion when:

  1. Same trip count. Both loops iterate the same number of times (same induction variable bounds and stride, or equivalent after normalization).
  2. No violated inter-loop dependencies. No flow dependence (write in L1, read in L2) that crosses iteration boundaries differently after fusion. Since both loops are sequential pre-fusion, this reduces to: L2 must not read a value written by L1 at a different iteration index.
  3. Compatible loop structure. Both must be single-basic-block bodies (or the fused body must remain within the PerformLoopFusionBudget instruction limit).
  4. No intervening barriers. No BAR.SYNC, MEMBAR, or fence instructions between the two loop bodies.

Pipeline Position Rationale

Phase 59 runs after GeneralOptimizeLate (phase 58) and before predication (phase 63). This position is chosen because:

  • Late expansion (phase 55) may have split a single operation into a pair of loops (e.g., an atomic-reduce pattern becomes a compare loop followed by an exchange loop).
  • After fusion, the merged loop body gives predication (phase 63) a larger basic block to work with, improving if-conversion opportunities.
  • The subsequent LICM (phase 66) can hoist invariants from the fused loop that were not hoistable from either original loop individually (because they appeared in the "between-loops" region).

Loop Infrastructure Functions

Several utility functions are shared across the loop passes:

FunctionAddressSizePurpose
sub_781F800x781F80--Rebuild instruction linked list after CFG modification
sub_7892800x789280--Recompute loop nesting depths (called when flags[176] & 2 set)
sub_7731400x773140--Recompute register pressure estimates
sub_7E60900x7E60902,614Create complex multi-operand instruction (used in unroll body duplication)
sub_7753F00x7753F0--Loop peeling setup (splits first/last iterations)
sub_789BE00x789BE0--Back-edge canonicalization
sub_74D7200x74D720--Loop boundary analysis (determines header, latch, exit)
sub_74F5000x74F500--Find preheader block for a given loop
sub_9253C00x9253C0--Edge splitting / preheader block insertion
sub_7A1A900x7A1A90--OCG knob query (boolean)
sub_7A1B800x7A1B80--OCG knob query (multi-valued)
sub_7992500x799250--Named-phase DUMPIR check (string match against phase name)
sub_A112C00xA112C0--Trigger sub-analysis re-run (liveness, CFG refresh)
sub_BDEA500xBDEA50--Back-edge information printer (bix%d -> backedge's successor BB: %d)

PhaseNameRelationship
3AnalyzeControlFlowBuilds the CFG, identifies loops, computes dominators -- prerequisite for all loop passes
19OriSplitLiveRangesSplits live ranges at loop boundaries to reduce register pressure post-simplification
20PerformPGOApplies profile data that informs unrolling and pipelining heuristics
21OriStrengthReduceReduces induction variable strength before unrolling
23GenerateMovPhiInserts SSA phi nodes after unrolling changes the CFG
25StageAndFenceInserts memory fences needed by pipelined loops
56SpeculativeHoistComInstsSpeculatively hoists common instructions above branches (related to LICM)
108OptimizeHotColdInLoopPost-RA hot/cold partitioning within loop bodies
138OriSplitHighPressureLiveRangesLast-resort splitter when unrolling or LICM caused excessive register pressure

Cross-References

Strength Reduction

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

Phase 21 (OriStrengthReduce) replaces expensive arithmetic operations with cheaper equivalents in the Ori IR. It runs early in the optimization pipeline -- after loop simplification (phase 18) and live range splitting (phase 19), but before loop unrolling (phase 22) and software pipelining (phase 24). This placement is deliberate: strength reduction benefits from canonicalized loop structure and benefits subsequent loop transformations by simplifying induction variable expressions.

Strength reduction in ptxas is not a single monolithic pass. It is distributed across three layers, each operating at a different abstraction level:

  1. Phase 21 (OriStrengthReduce) -- Ori-level induction variable strength reduction on the use-def graph
  2. Peephole patterns -- SASS-level algebraic simplifications in the MainPeepholeOptimizer (sub_83EF00)
  3. Division lowering templates -- Newton-Raphson integer division sequences emitted during instruction selection
Phase index21
Phase nameOriStrengthReduce
CategoryOptimization
Pipeline positionStage 2 (Early Optimization), between PGO (phase 20) and loop unrolling (phase 22)
Vtable addressoff_22BD910
execute()sub_C5FB30 (wrapper) -> sub_752E40 (core logic, 359 lines decompiled, ~1.2 KB binary)
isNoOp()sub_C5F3D0 -- returns 0 (always runs)
getName()sub_C5F3C0 -- returns 21
Gate knob487 (general optimization enablement)
Key helperssub_745A80 (replacement register creator), sub_91BF30 (virtual register allocator), sub_A13890 (use-def chain iterator), sub_9253C0 (instruction deleter)
Peephole SHR+SHL->BFEsub_81DB30 (matcher: sub_81D7E0)
Peephole BFE+ADDsub_81DDD0 (matcher: sub_81DBC0)
Division templatessub_1724A20 (32-bit, 28 KB), sub_1728930 (64-bit unsigned, 16.5 KB), sub_1727AC0 (64-bit signed, 13.7 KB)

Phase 21: Induction Variable Strength Reduction

Architecture

The execute wrapper (sub_C5FB30) gates on multi-function compilation (function count > 1 via sub_7DDB50) and delegates to sub_752E40 with parameters (context, 0, 0, 0).

sub_752E40 is the core. It performs a single-pass walk over the instruction list, focusing on a specific intermediate opcode -- opcode 137 (SM73_FIRST), masked with & 0xFFFFCFFF to strip modifier bits in the opcode field at instruction offset +72. The ROT13 name SM73_FIRST is a generation boundary marker name, but the Ori IR reuses this opcode slot at runtime for IMAD-like multiply-accumulate instructions in their pre-lowered form. The actual SASS IMAD is opcode 1.

Algorithm

The pass executes in two phases within a single call:

Phase 1 -- Trivial multiply elimination. The first loop walks the instruction list (*(context+272) is the list head). For each instruction with masked opcode == 137 (SM73_FIRST; IMAD-like):

  1. Check if the destination register (operand at +84) has no uses (*(def+56) == NULL) AND the source chain is empty (*src_chain == NULL). If both hold, delete the instruction via sub_9253C0 -- it is dead.
  2. Otherwise, for each source operand (iterating from operand count - 1 down to 0):
    • Check operand type: must be register ((operand >> 28) & 7 == 1)
    • Look up the register definition in the SSA value table (*(context+88) + 8 * (operand & 0xFFFFFF))
    • Check the definition has no special flags (*(def+48) & 0x400000022 == 0)
    • Check the register type is not 9 (predicate register)
    • Check the source operand's use chain is empty (single-use) and the def has no other users
    • If all conditions hold, call sub_91BF30 to allocate a replacement register with the same type, then patch the operand in place

Phase 2 -- Use-def chain traversal. The second loop walks the instruction list again. For each instruction with operands that have been marked (flag 0x100 at instruction[6], set during initialization):

  1. For each source operand with a use chain:
    • Compute the replacement register via sub_745A80(context, def, a4), which:
      • Allocates a new virtual register via sub_91BF30 with the same type as the original
      • Copies the data type field (+16) and relevant flags (0x40, 0x10, 0x8 bits of the flags word at +48)
      • Returns the new register ID
    • If the operand was not yet marked (flag 0x100 bit not set), initialize it and mark as "needs strength reduction"
    • Traverse the use chain as a worklist: for each user of the replaced register, check if its uses also need updating, growing the worklist dynamically (doubling allocation via pool allocator)
  2. Track how many source operands were rewritten (v72 counter)
  3. After processing all operands of an instruction: if the instruction is still opcode 137 (SM73_FIRST; IMAD-like) and certain conditions hold (destination matches source pattern, specific operand bit patterns), either delete it or convert it to opcode 130 / 0x82 (HSET2 in ROT13; used as an internal MOV-like marker -- actual SASS MOV is opcode 19)

The worklist traversal is the key algorithmic insight: when a multiply's result feeds into another multiply, the chain of strength reductions propagates transitively through the def-use graph.

Data Flow

sub_C5FB30 (execute wrapper)
  |
  +-- sub_7DDB50: check function count > 1
  |
  +-- sub_752E40 (core logic)
        |
        +-- sub_7468B0 / vtable+152: check knob 487 (optimization enabled)
        |
        +-- Phase 1: Walk instruction list (*(ctx+272))
        |     +-- For opcode 137 (`SM73_FIRST`; IMAD-like) instructions:
        |     |     +-- sub_9253C0: delete dead instructions
        |     |     +-- sub_91BF30: allocate replacement registers
        |     |
        |     +-- Clear flag 0x100 on all basic blocks (*(ctx+104) chain)
        |     +-- Set flag 0x40 at ctx+1385
        |
        +-- sub_A13890: initialize use-def traversal context
        |     +-- Creates context object with vtable off_21DBEF8
        |     +-- Sets up iterator with vtable off_21B4FD0
        |
        +-- Phase 2: Walk instruction list again
        |     +-- For each source operand with use chain:
        |     |     +-- sub_745A80: create replacement register
        |     |     +-- Worklist propagation through use chain
        |     |
        |     +-- Convert trivial IMAD to MOV (opcode 130 / `0x82`, `HSET2`; MOV-like)
        |     +-- sub_9253C0: delete fully reduced instructions
        |
        +-- sub_7B52B0: optional post-reduction scheduling pass
        |     (called if any replacements were made, flag v76)
        |
        +-- sub_8E3A20: destroy use-def context

Instruction Representation

The pass operates on the Ori IR instruction format. Relevant fields:

OffsetFieldUsage in this pass
+8next pointerInstruction list traversal
+64source operand chainArray of {use_chain_ptr, ...} per operand
+72opcode (DWORD)Bits 0-11 = base opcode, bits 12-13 = modifier (masked with 0xCF)
+80operand countNumber of source operands
+84operand[0]First source operand descriptor (bits 28-30 = type tag, bits 0-23 = register ID)
+92operand[1]Second source operand
+100operand[2]Third source operand (for IMAD: accumulator)

Operand type tags (bits 28-30):

  • 1 = register operand (index into SSA value table at *(context+88))
  • 2, 3 = immediate operand
  • 7 = special/predicate

Register definition structure (from SSA value table):

OffsetFieldUsage
+8register IDUnique identifier
+16data typeCopied to replacement register
+20use countChecked for single-use optimization
+28replacement IDSet by sub_745A80 to point to strength-reduced version
+48flags (QWORD)Bit 0x100 = "marked for strength reduction", bit 0x40 = volatile, bit 0x10/0x8 = scheduling hints
+56defining instructionPointer to the instruction that defines this register
+64register classType code (2/3 = integer, 4 = predicate, 7 = special, 9 = predicate)

Peephole Strength Reduction Patterns

The MainPeepholeOptimizer (sub_83EF00, 29 KB, case 2 of the opcode switch) applies algebraic strength reduction patterns at the SASS instruction level. These run later in the pipeline than phase 21 and operate on concrete SASS opcodes rather than the pre-lowered intermediate form.

Pattern: SHR + SHL -> BFE (Bit-Field Extract)

Matcher: sub_81D7E0 (166 lines decompiled) Emitter: sub_81DB30 Target opcodes: 290, 151, or 2 (various ALU forms) with operand size 11 or 12

Recognition:

  1. The instruction must have two register source operands (type tag 1), no modifier bits, no special flags
  2. Source operand 0's definition must be opcode 213 (SHL) or 214 (SHR)
  3. Source operand 1's definition must be the complementary shift (SHR if 0 was SHL, or vice versa)
  4. Both shift definitions must have immediate shift amounts (type tag 2 or 3)
  5. The shift amounts must sum to 32 (i.e., SHL(x, n) paired with SHR(x, 32-n))
  6. Both definitions must dominate the current instruction (sub_1245740 dominance check)
  7. Loop depth heuristic: if the shift definitions are in a shallower loop than the current instruction (checked via block RPO depth at *(block+156)), the transformation may be suppressed to avoid increasing register pressure

Transformation:

Before:  t1 = SHL(x, n)          ; opcode 213
         t2 = SHR(x, 32-n)       ; opcode 214
         r  = ALU(t1, t2)        ; opcode 290/151/2

After:   r  = BFE(x, ...)        ; opcode 210 (bit-field extract)

The emitter calls sub_9314F0 to create a BFE instruction (opcode 210) with the appropriate operands, then sub_9253C0 to delete the original ALU instruction.

Pattern: BFE Folding into ADD

Matcher: sub_81DBC0 (83 lines decompiled) Emitter: sub_81DDD0 Target opcode: 2 (IADD) with operand size 11 or 12

Recognition:

  1. One source operand is defined by opcode 210 (BFE)
  2. The BFE has no modifier bits, no special flags on the last operand
  3. The BFE's immediate operand (shift amount) is 1-31
  4. The BFE has a single use (use count <= 1)
  5. Dominance check passes

Transformation:

Before:  t = BFE(x, amount)      ; opcode 210
         r = IADD(t, y)          ; opcode 2

After:   r = LOP3/SHF(x, y, ...) ; opcode 102, combining shift+add

The emitter creates opcode 102 (a combined shift-and-add operation) with encoded shift amount (8 * amount | 0x60000002).

Integer Division Lowering

Integer division and modulo by non-constant values are lowered to multi-instruction sequences during instruction selection. This is not part of phase 21 but is the most visible strength reduction in ptxas output. The sequences use the classic Barrett reduction / Newton-Raphson reciprocal algorithm.

32-bit Division -- sub_1724A20

Size: 28,138 bytes decompiled (the largest function in the 0x1723000-0x17F8000 ISA description range) Called from: sub_1727130 (template driver) Instruction count: ~35 SASS instructions emitted

Algorithm (unsigned 32-bit a / b):

  1. Convert divisor to float: I2F(b) (opcode 0xD5)
  2. Compute approximate reciprocal via MUFU.RCP (opcode 0x3C)
  3. Convert back to integer: F2I(1/b) (opcode 0xD6)
  4. Refine via multiply-add: IMAD(q, b, ...) (opcode 0x6E)
  5. Correction step with conditional branch: ISETP + BRA (opcodes 0xC9, 0x5F)
  6. Final adjustment via IADD (opcode 0x02)

Key constants allocated via sub_91D160:

  • 23 (float exponent bias for mantissa extraction)
  • 255 (exponent mask)
  • 127 (IEEE 754 single-precision bias)
  • 254 (double-bias for overflow guard)
  • 1, -1 (correction increments)

The temporary register pool uses indices 90-126 from a parameter array (a7[]), providing 37 dedicated scratch registers for the sequence.

64-bit Division

Two variants handle 64-bit operands:

  • sub_1728930 (16,545 bytes): unsigned 64-bit division. Emits longer sequences with double-width multiply and carry propagation.
  • sub_1727AC0 (13,776 bytes): signed 64-bit division. Parallel structure with sign-handling logic.

Both are called from sub_1729B50.

Division by Constant

Division by compile-time constant is handled separately during constant folding (in the GeneralOptimize bundle passes). The classic magic-number multiplication technique (Granlund-Montgomery) converts x / C into MULHI(x, magic) >> shift, avoiding the Newton-Raphson sequence entirely. This produces 2-3 instructions instead of ~35.

SASS Cost Model

The profitability of strength reduction on NVIDIA GPUs differs from CPUs in several important ways:

Integer multiply is cheap. Modern NVIDIA GPUs (sm_70+) have dedicated integer multiply-add (IMAD) functional units. IMAD has the same throughput as IADD on most architectures -- both are single-cycle operations on the integer ALU. This means the classical "replace multiply with shift+add" transformation is often not profitable on GPU. ptxas does not aggressively replace multiplies with shift chains the way CPU compilers do.

Integer division is expensive. There is no hardware integer divider. Division must be lowered to the ~35-instruction Newton-Raphson sequence described above. This is why division-by-constant is a high-priority optimization -- replacing 35 instructions with 2-3 is a massive win.

Shift operations. SHL and SHR are single-cycle on the integer ALU, same throughput as IADD and IMAD. However, they use a different functional unit slot on some architectures, which can matter for scheduling.

BFE (bit-field extract) is a dedicated single-cycle instruction. Recognizing SHR+SHL pairs and folding them to BFE saves an instruction and a register, which is the primary motivation for the peephole patterns.

Register pressure dominates. On GPUs, the primary cost metric is not instruction count but register pressure, because register count directly determines occupancy (the number of concurrent warps). The strength reduction pass checks loop depth before transformations and suppresses replacements that would increase register pressure in inner loops (the RPO depth comparison in sub_81D7E0).

This explains why phase 21's core logic is relatively compact (~1.2 KB binary) compared to CPU compilers' strength reduction passes: the GPU cost model makes fewer algebraic replacements profitable, so the pass focuses narrowly on use-def chain simplification and trivial multiply elimination rather than elaborate pattern tables.

Pipeline Context

Phase 21 runs after:

  • Phase 18 (OriLoopSimplification) -- loops are canonicalized with single entry, single back-edge, and preheaders
  • Phase 19 (OriSplitLiveRanges) -- live ranges are split at loop boundaries
  • Phase 20 (PerformPGO) -- profile data is applied (block weights inform the cost model)

Phase 21 runs before:

  • Phase 22 (OriLoopUnrolling) -- simplified induction variables enable better unroll decisions
  • Phase 24 (OriPipelining) -- strength-reduced loops are more amenable to software pipelining
  • Phase 29 (GeneralOptimize) -- compound pass cleans up any dead code left by strength reduction

The GeneralOptimize bundles (phases 13, 29, 37, 46, 58, 65) also perform algebraic simplification that overlaps with strength reduction -- specifically constant folding of multiply-by-power-of-2 to shifts. Phase 21 handles the more complex cases that require use-def chain analysis, while GeneralOptimize handles local, single-instruction rewrites.

Function Map

AddressSizeFunctionRole
sub_C5FB309 bytesOriStrengthReduce::executeVtable entry, gates on function count
sub_C5F3C016 bytesOriStrengthReduce::getNameReturns phase index 21
sub_C5F3D016 bytesOriStrengthReduce::isNoOpReturns 0 (never skipped)
sub_752E40~1.2 KBCore strength reductionUse-def chain walk, replacement
sub_745A80168 bytesReplacement register creatorAllocates new register with copied type/flags
sub_91BF30~400 bytesVirtual register allocatorCreates 160-byte register descriptor
sub_9253C0325 bytesInstruction deleterUnlinks and removes instruction (634 callers)
sub_A13890~2 KBUse-def context initializerSets up chain traversal structures
sub_81D7E0~660 bytesSHR+SHL->BFE matcherPeephole pattern recognizer
sub_81DB30~112 bytesSHR+SHL->BFE emitterEmits BFE (opcode 210)
sub_81DBC0~330 bytesBFE+ADD matcherPeephole pattern recognizer
sub_81DDD0~100 bytesBFE+ADD emitterEmits combined shift-add (opcode 102)
sub_1724A2028,138 bytes32-bit div/mod templateNewton-Raphson integer division
sub_172893016,545 bytes64-bit unsigned div templateDouble-width Newton-Raphson
sub_1727AC013,776 bytes64-bit signed div templateSigned variant
sub_1727130~2 KBDivision template driverAllocates temps, dispatches to templates

Cross-References

Copy Propagation & CSE

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

Copy propagation and common subexpression elimination in ptxas are spread across four dedicated pipeline phases (49, 50, 64, 83) plus a forward copy propagation sub-pass (OriCopyProp) embedded inside every GeneralOptimize bundle. Together these passes form the value-redundancy elimination subsystem: they detect computations that produce values already available elsewhere in the program, then eliminate the redundant instructions or replace them with cheaper copies.

The four dedicated phases run at specific pipeline positions chosen to exploit opportunities created by preceding transformations. GvnCse (phase 49) runs after mid-level expansion and argument enforcement when the IR is maximally normalized. OriReassociateAndCommon (phase 50) immediately follows GvnCse to catch near-misses through algebraic normalization. LateOriCommoning (phase 64) runs after predication (phase 63) converts branches into predicated instructions, exposing new redundancies. OriBackCopyPropagate (phase 83) runs late in the pipeline to shorten MOV chains before register allocation.

Phases covered49 (GvnCse), 50 (OriReassociateAndCommon), 64 (LateOriCommoning), 83 (OriBackCopyPropagate)
Forward copy propOriCopyProp sub-pass inside each GeneralOptimize bundle (phases 13, 29, 37, 46, 58, 65)
Related knobs22 knobs controlling budgets, modes, and enable/disable flags
Pipeline positionMid-optimization (49--50), post-predication (64), pre-regalloc legalization (83)
Prerequisite passesAnalyzeControlFlow (3), GeneralOptimizeMid2 (46), EnforceArgumentRestrictions (48)
Downstream consumersExtractShaderConstsFinal (51), OriDoPredication (63), register allocation (101)

Phase Summary Table

PhaseNameVtableexecutegetNameisNoOpDefault
49GvnCseoff_22BDD700xC5F000 (thunk)0xC5F010 (ret 49)0xC5F020 (ret 0)Enabled
50OriReassociateAndCommonoff_22BDD98sub_C604D00xC5EFE0 (ret 50)0xC5EFF0 (ret 0)Enabled
64LateOriCommoningoff_22BDFC8sub_C600200xC5EDF0 (ret 64)0xC5EE00 (ret 0)Enabled
83OriBackCopyPropagateoff_22BE2C0sub_C5EB800xC5EB90 (ret 83)0xC5EBA0 (ret 1)Disabled

Phase name strings (from static name table at off_22BD0C0, verified in ptxas_strings.json):

PhaseString AddressName Table Ref
490x22BC80C0x22BD280
500x22BC8130x22BD290
640x22BC9490x22BD310
830x22BCAE50x22BD3C8

All four vtables are laid out at uniform 0x28-byte (40-byte) spacing in .data.rel.ro, matching the 5-pointer-per-vtable pattern used by all 159 phases. The factory switch at sub_C60D30 allocates each phase as a 16-byte object and installs the corresponding vtable pointer.

Phase 83 is disabled by default (isNoOp returns 1). It is activated through the AdvancedPhaseBackPropVReg gate (phase 82), which architecture-specific backends override to enable backward copy propagation for their target.


Phase 49 -- GvnCse (Global Value Numbering + CSE)

Overview

GvnCse combines global value numbering (GVN) with common subexpression elimination (CSE) in a single pass. GVN assigns a canonical "value number" to every expression in the program such that two expressions with the same value number are guaranteed to compute the same result. CSE then uses these value numbers to detect and eliminate redundant computations.

The pass is gated by the EnableGvnCse knob (address 0x21BDA50). When disabled, the pass is skipped entirely.

Dispatch Mechanism

The execute function at 0xC5F000 is a 16-byte thunk:

mov  rdi, [rsi+0x630]     ; rdi = compilation_context->sm_backend
mov  rax, [rdi]            ; rax = sm_backend->vtable
jmp  [rax+0xB8]            ; tail-call vtable[23] -- the actual GVN-CSE implementation

The real implementation lives in the compilation context's SM backend object (at context+0x630 / +1584), dispatched through its vtable at offset 0xB8 (slot 23). This indirection means the GVN-CSE algorithm can be overridden by architecture-specific backends that provide a different SM backend vtable. (This object was previously called "optimizer_state" on this page, but it is the same polymorphic SM backend used for legalization, scheduling, and all other architecture-dependent dispatch -- see data-structures.md.)

Algorithm (Reconstructed)

The ptxas GVN-CSE operates on the Ori IR basic block list with dominator-tree-guided traversal:

procedure GvnCse(function F):
    build dominator tree DT for F
    initialize value_table: hash_map<expression_key, value_number>
    vn_counter = 0

    for each block B in RPO(DT):
        for each instruction I in B:
            key = canonicalize(I.opcode, I.type, [lookup_vn(op) for op in I.operands])
            if key in value_table:
                existing_vn = value_table[key]
                replace all uses of I.dest with representative(existing_vn)
                mark I as dead
            else:
                value_table[key] = ++vn_counter
                set_representative(vn_counter, I.dest)

    run dead code elimination to remove marked instructions

Key design decisions visible from the binary:

  1. Hash-based value table. The value numbering table uses FNV-1a hashing (seed 0x811C9DC5, prime 16777619 / 0x01000193), the same hash primitive used throughout ptxas for instruction fingerprinting, code caching, and scheduling table lookups. The hash function incorporates the opcode, type, and recursively resolved value numbers of all operands. Hash table entries are 24 bytes each: [next_ptr (8B), key (8B), value/metadata (8B)] with chained collision resolution.

  2. Dominator-tree scoping. Values defined in block B are only visible to blocks dominated by B. When the walk exits a dominator subtree, value table entries scoped to that subtree are removed. This prevents CSE from moving computations to positions where they would not dominate all uses. Dominance is checked via sub_1245740, which performs a single-bit test against a per-block dominator bitvector: the dominator set at block descriptor offset +176 is indexed by the dominator block's ID from offset +144. The check is O(1).

  3. Commutativity normalization. For commutative operations (ADD, MUL, AND, OR, XOR, MIN, MAX), operands are sorted by value number before hashing. This ensures a + b and b + a get the same value number without requiring a separate reassociation pass.

  4. Address space awareness. Memory operations in different address spaces (shared, global, local, constant) are never considered equivalent even if they have identical operands. The address space qualifier is encoded in the instruction opcode or modifier bits (not the operand), so the opcode comparison in the structural equivalence check inherently preserves this distinction.

  5. Predicate handling. Predicated instructions (@P0 IADD R1, R2, R3) hash the predicate register's value number as an additional operand. Two identical computations under different predicates are distinct values.

  6. Predicate-operand compatibility (sub_7E7380). After opcode and type matching in the caller, sub_7E7380 performs a focused predicate-operand compatibility check (30 lines, 150 bytes). The function tests: (a) predicate modifier parity -- instr+73 bit 4 versus instr+72 bit 12 (0x1000); if one instruction has a predicate modifier and the other does not, they are incompatible; (b) last operand 24-bit value ID -- (instr + 84 + 8*(operand_count-1)) & 0xFFFFFF must match; (c) second-to-last operand 8-byte encoding -- the two dwords immediately before the last operand slot must be identical. The broader structural comparison (opcodes masked with & 0xFFFFCFFF, data types at +76, operand counts at +80, full per-operand encoding, register class at +64) is performed by each of the 21 callers of sub_7E7380, not by the function itself. Instructions with volatile flags (bit 0x20 at register descriptor offset +48) and barrier-type registers (type 9) are excluded from CSE by the callers' pre-checks.

GVN Algorithm Details (Binary Trace)

The GVN-CSE body was located by reading SM backend vtable slot 23 (offset +0xB8) from all seven SM backend vtables in the ptxas binary. The actual function pointer varies by SM generation:

SM BackendVtableSlot 23 FunctionBehavior
SM30 (Kepler)off_2029DD0sub_661250Returns 0 -- NO-OP
SM50 (Maxwell)off_21B4A50sub_661250Returns 0 -- NO-OP
SM60 (Pascal)off_22B2A58sub_BEE590Real GVN-CSE
SM70 (Volta)off_21D82B0sub_BEE590Real GVN-CSE
SM80 (Ampere)off_21B2D30sub_661250Returns 0 -- NO-OP
SM89 (Ada)off_21C0C68sub_661250Returns 0 -- NO-OP
SM90+ (Hopper)off_21D6860sub_BEE590Real GVN-CSE

GVN-CSE (phase 49) is a no-op on Kepler, Maxwell, Ampere, and Ada. It only executes on Pascal, Volta, and Hopper/Blackwell. SM80/SM89 backends rely on LateOriCommoning (phase 64) and the GeneralOptimize sub-passes for CSE coverage instead. This is a deliberate per-generation decision embedded in each SM backend's vtable.

Call Chain

GvnCse::execute (0xC5F000)
  -> sm_backend->vtable[23]  (indirect dispatch)
     -> sub_BEE590           (GVN entry, SM60/70/90)
        -> sub_781F80(ctx, 0) (rebuild def chains, mode=full)
        -> sub_BEE370         (mode dispatcher)
           -- queries knob 402 via knob_container->vtable[9] --
           mode 0: disabled, return
           mode 1: sub_BEA450  (simple GVN)
           mode 2: sub_BEAD00  (standard dominator-guided GVN)
           mode 3-6: sub_BED7E0 (full GVN with extended block scope)

Mode Dispatcher (sub_BEE370)

The mode is determined by knob 402 (EnableGvnCseMode), queried through two vtable calls on the knob container at context+1664:

  1. Boolean query -- knob_container->vtable[9](402) (offset +72): checks if the knob is set at all. The dispatcher has a fast-path optimization: when vtable[9] is sub_6614A0 (the standard implementation), it reads directly from knob_container+72+28944 instead of dispatching through the vtable.
  2. Integer query -- knob_container->vtable[15](402) (offset +120): reads the mode value as an integer. Similarly fast-pathed when vtable[15] is sub_661470.

If both queries return truthy, the integer value selects the GVN variant:

ModeFunctionDescription
0(none)Pass disabled, return immediately
1sub_BEA450Simple single-block GVN (111 lines, ~2KB)
2sub_BEAD00Standard dominator-guided GVN (157 lines, ~2.5KB)
3sub_BED7E0Full GVN (when sm_backend+1106 bit 6 AND context+1416 bit 0)
4sub_BED7E0Full GVN (remapped to mode 2 if bit 6 is clear)
5-6sub_BED7E0Full GVN with extended block scope
>6(none)Return immediately (no operation)

Additional flags modulate the mode selection:

  • SM backend flag at sm_backend+1106 bit 6 (0x40): when set, enables modes 5-6 (enhanced scope). When clear and mode is 4, the dispatcher remaps to mode 2.
  • Context flag at context+1416 bit 0: when set (and bit 6 is set), selects mode 3 over mode 5-6.
  • SM version threshold sm_backend+372 <= 0x7FFF (32767): gates the EBB pre-pass sub_BED430 via knob 210.

Before the standard GVN (sub_BEAD00), the mode dispatcher may invoke sub_BED430 -- an extended basic block (EBB) pre-pass that identifies and marks multi-block CSE opportunities within single-entry regions. The EBB pre-pass is called unless: (a) SM version > 0x7FFF, AND (b) knob 210 is set or context+1368 bit 0 is clear.

Simple GVN (sub_BEA450, Mode 1)

Mode 1 provides the lightest GVN variant -- single-scope CSE without cross-dominator lookup. Reconstructed pseudocode:

procedure SimpleGvn(gvn_state S):
    context = S.context
    first_reg = operand_24bit(first_instr(context+272))
    value_record = context.reg_table[first_reg]       // context+296
    if not value_record: return

    for each value_record in linked order:
        if knob_query(257, value_record): break        // per-instruction gate

        first_instr = value_record.head                // value_record[0]
        sentinel = value_record.sentinel               // value_record[1]
        eligible = false

        for each instr from first_instr to sentinel:
            if not eligible:
                eligible = check_eligibility(instr)    // sub_BEA1E0
                if eligible: advance and check sentinel
            if instr.opcode_masked == 145:             // barrier
                if sm_backend->vtable[371](instr):     // safe to CSE
                    mark eligible
                else: break scope

        if eligible:
            // Directly generate MOV replacement -- no dominator check
            context+232 = value_record.head
            context+264 = value_record.head->field_20
            sub_9314F0(context, 0x124, 1, 0, 0)       // insert MOV 292

        advance to next block via opcode 97 (block header) -> field +24

This variant does not examine the immediate-dominator chain at instruction+148. It only replaces redundancies that are visible within the current value record's instruction list (effectively single-block scope).

Standard GVN (sub_BEAD00, Mode 2)

Mode 2 extends the simple GVN with cross-dominator CSE. After finding an eligible instruction and reaching the end of a block, it follows the immediate-dominator chain:

procedure StandardGvn(gvn_state S, char cross_block_flag):
    // ... (same entry and block walk as SimpleGvn) ...

    // After eligibility walk reaches sentinel:
    idom = instr.field_148                             // immediate dominator index
    if idom != 0:
        dom_record = context.reg_table[context.idom_map[idom]]  // context+296[context+512[4*idom]]
        if dom_record and (not cross_block_flag or dom_record.opcode != 1):
            if not dominance_check(S, value_record):   // sub_BEA3B0
                leader = dom_record.head
                if leader.next.opcode != 292:          // not already a MOV
                    context+232 = leader
                    context+264 = leader.field_20
                    sub_9314F0(context, 0x124, 1, 0, 0)   // insert MOV

    // Fallback: if idom chain is empty, try block-level CSE
    block_desc = context.block_table[instr.field_164]  // context+368
    if block_desc+280 bit 0 is clear:
        leader = reg_table[operand_24bit(block_desc.first_instr)]
        if leader.next.opcode != 292:
            generate MOV replacement

The cross_block_flag parameter (passed from the mode dispatcher) controls whether the standard GVN allows replacement when the dominator has opcode == 1 (a block-header sentinel). When set, it skips such cases to avoid unsafe cross-block hoisting.

Dominance Check with Cache (sub_BEA3B0)

The dominance check is guarded by context+1377 bit 5 (0x20). When this flag is clear, the function returns 0 immediately (no dominance, meaning "safe to CSE" -- the caller inverts the result).

When the flag is set, the function implements a single-entry global cache to accelerate repeated dominator queries:

procedure DominanceCheck(gvn_state S, value_record vr):
    if not (context+1377 & 0x20): return 0           // no extended scope

    idom = vr.field_148
    if idom == 0: return 1                             // no dominator -> can't CSE

    dom_record = reg_table[idom_map[idom]]
    if dom_record == NULL: return 1

    // Check global cache (single-entry, TLS-safe through static storage)
    if dom_record == cached_key:                       // qword_2A12A08
        return cached_result ^ 1                       // byte_2A129FE[0] ^ 1

    // Cache miss: compute dominator ordering via sub_74D720
    if idom >= 0 and vr.field_152 >= 0:
        cached_key = dom_record
        sub_74D720(context, idom, vr.field_152, &cached_result)
        return cached_result ^ 1
    else:
        return 1                                       // negative index -> can't CSE

The cache stores a single (key, result) pair in global statics qword_2A12A08 and byte_2A129FE. This is effective because the GVN walk processes instructions within a block sequentially, and many consecutive instructions share the same dominator. The cache hit rate is high for blocks dominated by a single predecessor.

EBB Pre-Pass (sub_BED430)

The Extended Basic Block (EBB) pre-pass runs before mode 2 GVN when the SM version and knob conditions are met. It identifies cross-block CSE opportunities within single-entry CFG regions.

procedure EbbPrePass(gvn_state S):
    // Phase 1: Clear previous markings
    for each block B in linked order:
        B.field_264 = 0                               // clear EBB marking

    // Phase 2: Find first CSE-eligible instruction
    for each instr in instruction list:
        if check_eligibility(instr) and instr.opcode != 211:
            break  // found seed
    if not found: return

    // Phase 3: Build dominator tree and compute block ordering
    sub_7846D0(context)                                // dominator tree + RPO
    sub_A12EA0(context, walker_context, visitor)       // dominator tree walk
    sub_775010(context)                                // predecessor setup
    sub_773140(context, 0)                             // successor setup
    sub_770E60(context, 0)                             // block ordering

    // Phase 4: Mark CSE candidates on every instruction
    for each instr in instruction list:
        if check_eligibility(instr) and instr.opcode != 211:
            instr.field_48 = 1                         // mark as CSE candidate
        else:
            instr.field_48 = 0

    // Phase 5: Propagate eligibility through operand chains
    sub_BED0A0(walker_state)                           // fixed-point propagation

    // Phase 6: Evaluate cross-block candidates
    for each value_record in RPO order:
        if knob_query(257, vr): continue               // per-instruction gate
        idom = vr.field_148
        if idom != 0:
            dom_record = resolve_idom(context, idom)
            if dom_record and dom_record.field_264 == 0:
                dom_record.field_264 = sub_BEA000(walker, dom_record, 0) ? 2 : 1

The EBB propagation engine (sub_BED0A0) is a fixed-point iteration that propagates CSE eligibility backward through operand use-chains. For each instruction with field_48 bit 0 set, it follows the operand-to-instruction back-references at context+88 to mark defining instructions as eligible too. The iteration continues until no more changes occur. This ensures that an entire expression tree is marked eligible when any of its consumers is eligible.

Full GVN Body (sub_BED7E0, 689 lines, ~18KB binary)

This is the most complete GVN variant (modes 3-6). Reconstructed pseudocode:

procedure FullGvnCse(gvn_state S):
    context = S.context
    mode_flags = S.mode & ~0x02            // strip bit 1
    extended_scope = (mode_flags - 5)      // >1 enables cross-block scope

    // Phase 1: Initialization
    block_count = context.block_count      // at +376
    visited[] = allocate_zeroed(block_count + 1)   // one byte per block
    build_dominator_tree(context)                   // sub_7846D0
    scope_tree = new_scoped_tree()                  // sub_661750
    rpo = context.rpo_ordering                      // at +792

    // Phase 2: RPO block walk
    for i = 0 to rpo.count - 1:
        block_idx = rpo.indices[i]
        block = context.block_table[block_idx]      // context+368
        if block.head == NULL or block.flags & SKIP:
            continue

        first_instr = lookup_first_instruction(block)
        dominator_candidate = NULL
        has_redundant = false

        for each instr in block:
            // Per-instruction knob gate (knob 257)
            if knob_query(257, instr):
                break to next block boundary

            eligible = check_eligibility(instr)     // sub_BEA1E0

            if eligible:
                visited[block_idx] = not block.visited_flag

            elif opcode_masked(instr) in {32, 159}: // branch, return
                propagate visited flag from predicate operand

            elif opcode_masked(instr) == 145:       // barrier/sync
                safe = sm_backend->vtable[371](instr)
                if safe: mark_as_candidate

            // Check dominator for existing equivalent
            idom_ref = instr.idom_ref               // at +148
            if idom_ref != 0:
                dom_block = resolve_idom(context, idom_ref)
                if dom_block dominates current position:
                    leader = dom_block.first_instr
                    if leader.opcode != 292:        // not already a MOV
                        replace_with_mov(context, leader, 0x124)

        record_block_in_scope_tree(scope_tree, block_idx)

    // Phase 3: Post-processing dominated blocks
    for each (node, bit_pos) in scope_tree.bit_iterate():
        block_idx = bit_pos | (node.data << 6)
        block_record = reg_table[block_idx]
        cse_dominated_block(S, block_record)        // sub_BEA5F0

    // Phase 4: Cleanup
    flush_deferred_instructions(scope_tree)
    destroy_scoped_tree(scope_tree)

Key observations from the binary:

  1. Block walk order is RPO. The outer loop reads context+792 -- a struct containing {int count; int indices[]} -- and iterates in that order. The RPO array is pre-computed by sub_7846D0 which also builds the dominator tree.

  2. The value table is a register-indexed array, not a hash map. Values are stored in context+296 (an array of pointers indexed by the 24-bit register/value identifier from the operand encoding at instruction+84). This gives O(1) lookup by register ID. The dominator tree is used for scoping, not a stack-based hash table.

  3. Dominator scoping uses a balanced binary tree with bitset nodes. Each tree node stores a 64-bit bitset of block indices, traversed with tzcnt for efficient iteration. Block index is recovered as bit_position | (node_data << 6), supporting up to 64 * depth blocks.

  4. Replacement is MOV insertion. When a redundant instruction is found, the pass calls sub_9314F0(context, 0x124, 1, 0, 0) to generate a replacement MOV instruction (opcode 0x124 = 292 decimal). The original computation is recorded at context+232 (source) and context+264 (metadata) before the MOV is generated.

  5. Barrier instructions (opcode 145) have a dedicated safety check via sm_backend->vtable[371] (offset +2968), which is an architecture-specific predicate that determines whether an instruction can be CSE'd across a barrier boundary.

Instruction Eligibility (sub_BEA1E0)

Opcode (masked)CategoryCondition
16Register copy / PHIAlways, unless last operand bit 1 set
183Memory load/computeBit 5 of last operand, or sub_91BC40 safety check
119GPU specialSM flag +1106 bit 6 required; operand bit 1
186GPU specialSM flag +1106 bit 6 required; operand bit 0
211GPU specialSM flag +1106 bit 6 required; operand bit 2
283GPU specialSM flag +1106 bit 6 required; operand bit 3
122ConditionalType 2-3: always; type 7-8: bit 7 set
310Specialized(flags & 0xF) == 2 and (flags & 0x30) != 0x30
145Barrier/syncSeparate sm_backend->vtable[371] check
all others--Not eligible

Opcodes 119, 186, 211, 283 are only CSE-eligible when sm_backend+1106 bit 6 (0x40) is set. This bit appears to be an architecture-specific capability flag enabling extended CSE for certain GPU-specific instruction classes.

Per-Dominated-Block CSE (sub_BEA5F0)

After the RPO walk populates the scope tree, sub_BEA5F0 processes each dominated block:

  1. SM version gate: if sm_backend+372 <= 28671 (SM70 or earlier), enables a special operand canonicalization path for commutative operations
  2. Instruction walk: iterates via block+128 child pointer chain
  3. Dominator ordering: compares instruction+144 (dominator number) to test dominance
  4. Commutative canonicalization (opcode 95): calls sm_backend->vtable[79] (offset +632) to sort operands by value number. Rewrites operand encoding with flags 0x60000000 and 0x40000000 to mark canonicalized operands
  5. Replacement: calls sub_931920 to insert copy instructions when a dominating equivalent is found

Scope Tree Bit-Iteration Detail

The scope tree post-processing (lines 498-664 of sub_BED7E0) uses a binary tree where each node contains a 4-word (32-byte) bitset region starting at node+32. The iteration:

  1. Start at the leftmost node: follow node->left until NULL
  2. Scan the 4-word bitset region (node+32 through node+64), finding each set bit via tzcnt (x86 trailing-zero-count)
  3. Recover the block index: bit_position | (((word_offset_in_node | (node.field_24 << 2)) << 6))
  4. After processing a bit, mask it out: word &= ~(0xFFFFFFFFFFFFFFFF >> (64 - (bit_pos + 1)))
  5. When current word is exhausted, advance to next word in the 4-word region
  6. When all 4 words are exhausted, follow parent/right-child links to traverse the tree in order

Each block index recovered from the tree triggers a call to sub_BEA5F0 for per-dominated-block CSE. The tree structure allows the scope walk to skip large ranges of blocks that have no CSE candidates, making it efficient for sparse CFGs.

GVN Function Map

AddressNameSizeRole
sub_BEE590GvnEntry~200BEntry point (vtable slot 23, SM60/70/90)
sub_BEE370ModeDispatcher~550BSelects GVN variant via knob 402
sub_BED7E0FullGvn~18KBFull GVN body (modes 3-6, RPO + scope tree)
sub_BED430EbbPrePass~2KBExtended basic block pre-pass
sub_BED0A0EbbPropagate~3KBEBB eligibility propagation (fixed-point)
sub_BEC880EbbInit--EBB state initialization
sub_BEAD00StandardGvn~2.5KBStandard dominator-guided GVN (mode 2)
sub_BEA5F0PerBlockCse~9KBPer-dominated-block CSE + commutative canon.
sub_BEA450SimpleGvn~2KBSimple single-block GVN (mode 1)
sub_BEA3B0DomCheckCached~300BDominance check with global cache
sub_BEA1E0EligibilityCheck~500BInstruction eligibility (7 opcode classes)
sub_BEA000EbbCandidateCheck~700BEBB candidate dominator-chain walk
sub_7E7380PredicateCompat~150BPredicate-operand compatibility check
sub_661250NoOp~6BNo-op stub (SM30/50/80/89)
sub_7846D0BuildDomTree--Dominator tree + RPO ordering builder
sub_661750ScopeTreeInit--Scoped value tree init/destroy
sub_9314F0InsertMov--Instruction insertion (generates MOV 292)
sub_934630InsertMulti--Instruction insertion (multi-operand variant)
sub_931920InsertNode--Instruction node insertion into linked list
sub_9253C0DeleteInstr--Instruction deletion
sub_6B4520RecordBlock--Block recording for dominator scoping
sub_74D720DomOrdering--Dominator ordering comparison
sub_69DD70TreeExtract--Tree node extraction (deferred processing)
sub_7A1A90KnobQuery--Knob query (per-instruction enablement)
sub_91BC40MemSafetyCheck--Memory operation safety check
sub_A12EA0DomTreeWalk--Dominator tree walker (EBB discovery)

GPU-Specific CSE Constraints

GPU CSE must respect constraints that do not arise in CPU compilers:

  • Divergence. A uniform subexpression (same value across all threads in a warp) can be safely hoisted. A divergent subexpression may have different values per thread and must only be CSE'd within the same control-flow path. The GvnCse pass runs after AnalyzeUniformsForSpeculation (phase 27), which provides divergence annotations.

  • Barrier sensitivity. A computation that reads shared memory before a BAR.SYNC cannot be commoned with an identical computation after the barrier, because intervening threads may have written different values. Memory operations with barrier dependencies are assigned unique value numbers. The actual barrier check is performed by sm_backend->vtable[371] (offset +2968), an architecture-specific predicate.

  • Register pressure. Aggressive CSE can increase register pressure by extending the live range of the representative value. The EnableGvnCse knob allows the pass to be disabled when register pressure is the binding constraint.

  • Per-SM enablement. GVN-CSE is only active on SM60, SM70, and SM90+. SM80/SM89 rely on LateOriCommoning (phase 64) and GeneralOptimize sub-passes instead. This per-generation selection is embedded in the SM backend vtable at slot 23.


Phase 50 -- OriReassociateAndCommon

Overview

Reassociation normalizes the algebraic structure of expressions to expose commoning opportunities that GvnCse missed. GvnCse cannot detect that (a + b) + c and (a + c) + b compute the same value unless the expressions are first reassociated into a canonical form. This pass performs that reassociation and then runs a second commoning pass over the normalized IR.

Dispatch Mechanism

// sub_C604D0 -- OriReassociateAndCommon::execute
int64 execute(phase* self, compilation_context* ctx) {
    int func_count = get_function_count(ctx);   // sub_7DDB50
    if (func_count > 1)
        return ctx->field_1584->vtable[44](ctx->field_1584, ctx);
    return func_count;
}

For multi-function compilation units, the pass dispatches through the compilation context's SM backend (field +1584 / 0x630), calling vtable slot 44 (offset 0x160). This enables per-function reassociation with function-level isolation of value numbering state.

Algorithm (Reconstructed)

Reassociation works on associative and commutative operators:

procedure ReassociateAndCommon(function F):
    for each basic block B in RPO:
        for each instruction I in B:
            if I.opcode is associative+commutative (ADD, MUL, AND, OR, XOR):
                flatten expression tree rooted at I into a list of leaves
                sort leaves by canonical order (constants last, then by register number)
                rebuild balanced binary tree from sorted leaves
            if I.opcode is SUB:
                rewrite (a - b) as (a + (-b)) for uniformity

    // Second pass: hash-based commoning over the reassociated IR
    run local CSE over each basic block

Why Reassociation Matters

The reassociation and commoning phases are tightly coupled because reassociation's primary goal is to enable commoning:

BB0:  R5 = (R2 + R3) + R4 ; GvnCse sees: VN(ADD, VN(ADD,vn(R2),vn(R3)), vn(R4))
BB1:  R6 = (R2 + R4) + R3 ; GvnCse sees: VN(ADD, VN(ADD,vn(R2),vn(R4)), vn(R3))
      -- These are NOT the same VN because the inner ADDs differ.

After reassociation, both flatten to {R2, R3, R4} sorted canonically, then rebuild as (R2 + R3) + R4. Now they share the same value number and the second is eliminated.

Controlling Knobs

KnobAddressPurpose
AllowReassociateCSE0x21C0180Master enable/disable
ReassociateCSEBudget0x21BA810Max instructions processed per function
ReassociateCSEWindow0x21BA7D0Sliding window size for local CSE after reassociation
ReassociateCSESkip0x21BA7F0Skip first N instructions (debugging)
ReassociateLargeImmInUIADD640x21BA7A0Large immediates in 64-bit unsigned ADD
DistributeAndReassociateMulBudget0x21BDDC0Budget for a*b + a*c -> a*(b+c)

Phase 64 -- LateOriCommoning

Overview

LateOriCommoning is a late CSE pass that runs immediately after predication (phase 63, OriDoPredication). If-conversion transforms conditional branches into predicated instructions, which can expose new redundancies: two computations that were previously in mutually exclusive branches become adjacent predicated instructions that may compute the same value.

Dispatch Mechanism

// sub_C60020 -- LateOriCommoning::execute
char execute(phase* self, compilation_context* ctx) {
    int func_count = get_function_count(ctx);    // sub_7DDB50
    if (func_count > 1)
        return sub_9059B0(ctx);                  // late commoning implementation
    return func_count;
}

Implementation -- sub_9059B0

sub_9059B0 is the entry point for late commoning. It:

  1. Checks knob 487 (ForceLateCommoning at 0x21BD2F0) to determine whether the pass is enabled
  2. Verifies the function's optimization state has commoning enabled: the byte at context->field_1664->field_72 + 60696 must be 1, and the dword at offset +60704 must be nonzero
  3. Allocates a ref-counted working set via the pool allocator
  4. Calls sub_9055F0 -- the core commoning walker

Core Commoning Walker -- sub_9055F0

sub_9055F0 (203 lines decompiled) is the central commoning algorithm for late CSE. Its structure, reconstructed from the decompilation:

procedure LateCommoning(function_state S):
    if not knob_enabled(487):  return
    if S.flags & 0x02:  return                 // already processed
    if (S.flags | S.flags2) & 0x08:  return    // conflicting mode

    rebuild_def_chains(S, mode=1)              // sub_781F80
    rebuild_use_chains(S)                      // sub_763070
    compute_hash_values(S, 0, 0, 0, 0)        // sub_7E6090

    block_count = S.field_520 + 1
    allocate bit_array[block_count]

    // Reset hash/VN slots on all instructions
    for each instruction I in S.instruction_list:
        I.field_88 = 0xFFFFFFFF00000000        // upper 32 bits = -1, lower = 0

    // Main commoning loop over code list
    for each instruction I in S.code_list:
        // Phase 1: Remap operands through equivalence table
        for each operand (reverse order):
            if operand is register ref (type 0x10000000):
                resolve to canonical representative

        // Phase 2: Try commoning based on opcode class
        if I.opcode == 72 (MOV):
            propagate_equivalence(I)            // sub_8F2CD0
        elif is_pure(I):                        // sub_7DF3A0
            opcode_class = I.opcode & 0xCF00
            if opcode_class == 0x0061 (SEL):    // conditional select
                reset_tracking()
            elif opcode_class == 0x0034 (PHI):
                record_phi_equivalence(S, I)
            else:
                if not try_common(S, I):        // sub_901A90
                    hash = compute_hash(S, I)   // sub_74ED70
                    record_hash_for_future_matching(hash)

The three infrastructure functions called at the beginning are shared with the GeneralOptimize sub-passes:

  • sub_781F80 -- rebuilds reaching definition chains (also used by GeneralOptimizeEarly)
  • sub_763070 -- rebuilds use-def chains
  • sub_7E6090 -- pre-computes instruction hash values

Commoning Check -- sub_901A90

sub_901A90 (387 lines) is the instruction-level CSE checker. It:

  1. Examines the instruction's opcode, type, and operand value numbers
  2. Looks up the instruction's hash in the per-block equivalence table
  3. If a match is found, verifies that the matched instruction dominates the current position via sub_1245740 (O(1) bitvector bit test: (1 << def_dom_id) & dom_set[def_dom_id >> 5])
  4. If domination holds, replaces the current instruction's destination with the matched instruction's destination
  5. Returns true if commoning succeeded, false otherwise

A related commoning pattern was confirmed from sub_90A340 (1670 bytes, 21 callees), which performs commoning on opcode 130 (HSET2 in the ROT13 name table; used as an internal marker for MOV-like instructions -- actual SASS MOV is opcode 19) instructions. From the decompilation, the operand comparison loop:

// Operand-by-operand equivalence check within commoning body
for (i = operand_count - 1; i >= 0; i--) {
    if (candidate.operands[2*i + 21] != existing.operands[2*i + 21])
        break;  // operand value mismatch
    if (candidate.operands[2*i + 22] != existing.operands[2*i + 22])
        break;  // operand modifier mismatch
}
// If all operands match AND opcodes match AND operand counts match:
//   verify dominance, then replace

The reverse iteration order (from last operand to first) is an optimization: destination operands at lower indices are more likely to differ, so checking source operands first (higher indices) allows early exit.

Instruction Hashing -- sub_74ED70

sub_74ED70 (304 lines) computes a hash value for an instruction, incorporating:

  • Opcode and type qualifiers
  • Value numbers of all source operands (recursively resolved through MOV chains)
  • Address space for memory operations
  • Predicate register (if predicated)
  • Immediate values (folded into the hash)

The hash is stored at instruction field +88 (the upper 32 bits that were reset to 0xFFFFFFFF during initialization). The function calls sub_7DF3A0 (purity check), sub_7E0030 and sub_7E2530 (operand accessors), and sub_748440 (hash combining).

Controlling Knobs

KnobAddressPurpose
ForceLateCommoning0x21BD2F0Force-enable late commoning
DisableMoveCommoning0x21BE2C0Disable MOV-based equivalence propagation within the commoning walker

Phase 83 -- OriBackCopyPropagate

Overview

Backward copy propagation propagates values backward through MOV chains, eliminating intermediate copies. Unlike forward copy propagation (which replaces uses of a copy's destination with the copy's source), backward copy propagation replaces the definition of a copy's source with the copy's destination, allowing the copy instruction itself to be deleted.

Phase 83 uses a split-phase design with phase 82 (AdvancedPhaseBackPropVReg). The actual backward copy propagation algorithm lives in architecture-specific SM backend overrides of phase 82. Phase 83 is a pipeline progress marker that advances the pipeline counter context+1552 to 9 after backward copy propagation completes, signaling to downstream operand encoding functions that they may apply relaxed register constraints.

This phase is disabled by default (isNoOp returns 1). It is activated only when an architecture backend overrides phase 82 to provide its own backward propagation implementation.

Dispatch Mechanism

The execute function is a 7-byte stub that advances the pipeline progress counter:

// sub_C5EB80 -- OriBackCopyPropagate::execute
void execute(phase* self, compilation_context* ctx) {
    ctx->field_1552 = 9;   // advance pipeline progress counter to backward-copy-prop stage
}

Phase 83 does not contain the backward copy propagation algorithm. The actual algorithm is provided by the architecture-specific SM backend that overrides phase 82 (AdvancedPhaseBackPropVReg). The split-phase design works as follows:

PhaseRoleDefault behaviorWhen arch-activated
82 (AdvancedPhaseBackPropVReg)Gate + algorithm providerNo-op (hook, isNoOp = 1)Arch backend installs backward copy propagation body
83 (OriBackCopyPropagate)Pipeline progress markerNo-op (isNoOp = 1)Sets context+1552 = 9, enabling downstream constraint relaxation

The factory switch at sub_C60D30 installs vtable off_22BE298 for phase 82 and off_22BE2C0 for phase 83. Both vtables are 40-byte (5-pointer) structures at consecutive addresses in .data.rel.ro.

Gate Mechanism (Phase 82)

Phase 82 (AdvancedPhaseBackPropVReg) is one of 16 AdvancedPhase hook points in the pipeline. By default its isNoOp returns true, meaning the phase is skipped entirely. When an architecture backend needs backward copy propagation, it:

  1. Overrides phase 82's vtable to install the actual backward propagation algorithm as the execute function
  2. Overrides phase 82's isNoOp to return 0 (enabled)
  3. Configures phase 83's isNoOp to return 0, enabling the pipeline counter advancement

The BackCopyPropBudget knob (index 808, address 0x21BFDF0) limits the number of backward propagations performed. This knob is read by sub_8C0270 (scheduler initialization) at the point where the scheduler allocates its per-function work structure. When knob 808 is not set by the user, the budget falls back to a default stored in the scheduler state object at offset +92.

Algorithm (Reconstructed)

The backward copy propagation algorithm is reconstructed from the phase name, the infrastructure it shares with forward copy propagation (sub_781F80, sub_763070), the BackCopyPropBudget knob, and the pipeline position constraints. The actual algorithm body resides in architecture-specific SM backend code, not in the generic binary.

procedure BackCopyPropagate(function F):
    budget = knob(808)     // BackCopyPropBudget
    count = 0

    // Phase 1: rebuild def-use chains (shared infrastructure)
    rebuild_def_chains(F)  // sub_781F80
    rebuild_use_chains(F)  // sub_763070

    // Phase 2: walk blocks in RPO, instructions in reverse
    for each basic block B in reverse postorder:
        for each instruction I in B (last to first):
            if count >= budget:
                return

            if I is not MOV (opcode & 0xCF00 != MOV class):
                continue

            // I is: Rd = MOV Rs
            def_of_Rs = reaching_def(Rs)

            // Guard 1: Rs must have exactly one use (this MOV)
            if use_count(Rs) != 1:
                continue

            // Guard 2: def(Rs).dest can be renamed to Rd without conflict
            if not can_rename(def_of_Rs.dest, Rd):
                continue

            // Guard 3: no intervening definition of Rd between def(Rs) and I
            if has_intervening_def(Rd, def_of_Rs, I):
                continue

            // Perform backward propagation: rename definition
            rename def_of_Rs.dest from Rs to Rd
            delete I  // MOV is now redundant
            count++

The backward walk direction is essential for cascading chain collapse:

Before:    R1 = expr;    R2 = R1;    R3 = R2
                                      ^^^^^^ processed first (backward)
Step 1:    R1 = expr;    R3 = R1;    (deleted R3=R2, renamed R2→R3 in "R2=R1")
                         ^^^^^^ processed next
Step 2:    R3 = expr;                (deleted R3=R1, renamed R1→R3 in "R1=expr")

Result: entire 3-instruction chain collapses to single "R3 = expr"

If the walk were forward, only R2 = R1 would be processed first (renaming R1 = expr to R2 = expr), but then R3 = R2 would need a second pass to collapse further. The backward direction achieves full chain collapse in a single pass.

Why Phase 83 Runs So Late

Phase 83 is positioned at pipeline slot 83 out of 158, immediately before the register attribute computation sequence (phases 84--95). This late position serves three purposes:

  1. Catches late-created copies. Phases 66--81 include late optimizations (LICM, texture movement, rematerialization, late arch-specific peepholes) that frequently insert new MOV instructions. Backward copy propagation after these passes cleans up the residual chains that forward propagation (which last ran in phase 65) cannot see.

  2. Reduces register pressure for allocation. Every eliminated MOV is one fewer live range the register allocator (phase 101) must handle. By running just before the liveness/DCE pass (phase 84, OriPerformLiveDeadFourth), backward copy propagation minimizes the input to register allocation.

  3. Safe renaming window. After phase 83, the pipeline enters the register attribute and legalization sequence. Renaming destinations before this point avoids conflicts with the fixed register assignments that legalization may impose.

Why Disabled by Default

Phase 83 is disabled by default (isNoOp returns 1) for several reasons:

  1. Backward renaming is inherently riskier than forward propagation. Forward copy propagation modifies uses (safe because the original definition still exists). Backward copy propagation modifies definitions -- changing which register an instruction writes to. A bug here can silently corrupt values used by other instructions.

  2. Architecture-specific register constraints. The legality of renaming a destination depends on target-specific constraints: fixed-function registers (thread ID, special purpose), register bank conflicts, paired/grouped register requirements for 64-bit operations, and uniform register constraints on newer architectures (Volta+). Only the architecture backend knows which renames are safe.

  3. Diminishing returns. Forward copy propagation (OriCopyProp) runs six times during the GeneralOptimize bundles (phases 13, 29, 37, 46, 58, 65) and handles the majority of copy elimination. Backward propagation catches only residual chains that forward propagation structurally cannot eliminate.

  4. Gate requirement. Architecture backends that enable backward copy propagation via phase 82 may also need to pre-process the IR (e.g., marking registers that must not be renamed, or inserting constraints that protect fixed-function registers).

Downstream Effects: Pipeline Counter and Encoding Relaxation

When phase 83 sets context+1552 to 9, two operand encoding pattern functions (sub_9BF350 and sub_9BFAF0) change behavior. These functions gate on two conditions:

// Gate check in sub_9BF350 and sub_9BFAF0
if ((context->field_1398 & 0x04) != 0 && context->field_1552 > 9) {
    // Apply register constraint relaxation
    // Check if operand register class == 3 (address register) or reg_id == 41
    // Assign special operand mask 0xFFFFFA (16777210) instead of 0xFFFFFF
}

The flag at context+1398 bit 2 is an architecture capability flag. When both conditions are met (capability flag set AND pipeline has progressed past phase 83), the encoding functions relax operand constraints for address registers (class 3) and special register 41, allowing these to participate in operand patterns that they would otherwise be excluded from.

The pipeline counter value 9 is part of a progression: phase 95 (SetAfterLegalization, sub_C5E440) later advances the counter to 19, enabling a further tier of relaxation in the scheduler initialization (sub_8C0270).

Forward vs. Backward Copy Propagation

The two propagation directions are complementary and handle different structural patterns:

PropertyForward (OriCopyProp)Backward (OriBackCopyPropagate)
DirectionReplaces uses of copy destination with copy sourceReplaces definitions to eliminate copies
ExampleR2=R1; ADD R3,R2,R4 -> ADD R3,R1,R4R1=expr; R2=R1 -> R2=expr
Runs6 times (phases 13,29,37,46,58,65)Once (phase 83)
DefaultAlways enabledDisabled (arch-gated)
RiskLow (original def unchanged)Higher (modifies defs)
CatchesMost copies from expansion and loweringResidual chains from late passes (66--81)

Controlling Knobs

KnobAddressPurpose
BackCopyPropBudget0x21BFDF0Maximum backward propagations per function (knob index 808)

Forward Copy Propagation -- OriCopyProp

Overview

Forward copy propagation is not a standalone pipeline phase but a sub-pass within each of the six GeneralOptimize bundles (phases 13, 29, 37, 46, 58, 65). It is identified by the name string OriCopyProp at address 0x21E6CE1 and can be individually targeted via the --named-phases mechanism.

The OriCopyProp name appears in the NamedPhases parser (sub_9F4040 at offset +1648), where it is looked up via sub_C641D0 (case-insensitive binary search over the phase name table). When the user specifies --named-phases OriCopyProp, the system resolves this to the appropriate sub-pass within GeneralOptimize.

Target Opcodes and Flag Bits

Three Ori opcodes are candidates for forward copy propagation:

OpcodeMeaningPropagation Rule
97Definition anchor (STG in ROT13; used internally as a register-to-register MOV/definition marker -- actual SASS MOV is opcode 19)Replace uses of destination with source
18Predicated copyPropagate only under matching predicate guard
124Conditional select (CSEL)Propagate when select condition is provably constant

Opcode matching uses a mask: (instr.opcode & 0xCF00) == target, stripping modifier bits in the upper nibble of the opcode field at instruction offset +72.

Three flag bits on instruction field [6] (byte offset 24) track propagation state:

BitHexMeaning
80x100Copy has been propagated
90x200Deferred cleanup (instruction may still be needed)
100x400Under predicate guard (requires predicate-aware handling)

Eligibility Check (sub_8F2E50)

The eligibility function checks whether a copy can be safely propagated, with an SM-version-dependent constraint:

function isEligibleForPropagation(instr, ctx):
    sm_version = *(ctx + 372)
    operand_type = instr.operand_type & 0xF
    if sm_version <= 20479:        // pre-Turing (sm_70 and earlier)
        return true                // unconditionally safe
    else:                          // Turing+ (sm_75+)
        return (operand_type & 0x1C00) == 0   // constraint bits must be clear

The SM version threshold 20479 corresponds to the boundary between Volta (sm_70) and Turing (sm_75). Turing introduced new operand constraint bits that restrict when copies can be folded.

Algorithm

Forward copy propagation replaces uses of a copy's destination with the copy's source:

procedure OriCopyProp(function F):
    for each basic block B in RPO:
        for each instruction I in B:
            if I is MOV Rd, Rs:
                for each use U of Rd that I dominates:
                    if Rs is still live at U:
                        replace Rd with Rs in U
                if Rd has no remaining uses:
                    mark I as dead

Within the GeneralOptimize loop, copy propagation interacts with constant folding and algebraic simplification: a copy propagation may expose a constant operand, enabling constant folding in the next iteration, which may create a dead instruction for DCE. This is why GeneralOptimize runs as a fixed-point loop. In Variant A (phases 13, 29), the fixed-point iteration is capped by knob 464. In Variant B (phases 37, 58), convergence uses a cost-based threshold of 0.25 (knob 474). Two-pass predicate simplification via sub_908A60 runs within the copy propagation loop to handle predicate-conditional copies.

Controlling Knobs

KnobAddressPurpose
CopyPropBudget0x21BECD0Maximum instructions processed per invocation
CopyPropGlobalBudget0x21BEC70Budget for cross-block (global) copy propagation
CopyPropForceGlobal0x21BEC90Force global copy propagation
CopyPropAddr0x21BECE8Propagate through address computations
CopyPropConstantBank0x21BECB0Propagate constant bank references
CopyPropUseReachingDefs0x21BEBD0Use reaching definitions for more aggressive propagation
CopyPropPreAllocReg0x21BEBF0Enable for pre-allocated (fixed) registers
CopyPropNoWriteNonRR0x21BEC10Disable into non-register-register contexts
CopyPropNonRegMultiDef0x21BEC30Handle non-register multi-definition copies
CopyPropNoMmaCb0x21BEC50Disable into MMA constant bank operands
LateCopyPropComplPred0x21BC680Late copy propagation for complementary predicates

The CopyPropUseReachingDefs knob is particularly significant: when enabled, the pass uses reaching definitions analysis (built by sub_781F80) instead of simple dominator checks, allowing more aggressive propagation at the cost of additional analysis time.


Complete Knob Reference

All 24 knobs controlling copy propagation and CSE:

KnobROT13AddressControls
EnableGvnCseRanoyrTiaPfr0x21BDA50Master enable for phase 49
EnableGvnCseMode--knob 402GVN mode selector (0=off, 1=simple, 2=standard, 3-6=full)
EnableGvnCsePerInstr--knob 257Per-instruction GVN enablement gate
AllowReassociateCSENyybjErnffbpvngrPFR0x21C0180Master enable for reassociation CSE
ReassociateCSEBudgetErnffbpvngrPFROhqtrg0x21BA810Instruction budget
ReassociateCSEWindowErnffbpvngrPFRJvaqbj0x21BA7D0Sliding window size
ReassociateCSESkipErnffbpvngrPFRFxvc0x21BA7F0Skip first N
ReassociateLargeImmInUIADD64ErnffbpvngrYnetrVzzVaHVNQQ640x21BA7A064-bit ADD imm
DistributeAndReassociateMulBudgetQvfgevohgrNaqErnffbpvngrZhyOhqtrg0x21BDDC0Distributive law
ForceLateCommoningSbeprYngrPbzzbavat0x21BD2F0Force phase 64
DisableMoveCommoningQvfnoyrZbirPbzzbavat0x21BE2C0Disable MOV commoning
BackCopyPropBudgetOnpxPbclCebcOhqtrg0x21BFDF0Phase 83 budget
CopyPropBudgetPbclCebcOhqtrg0x21BECD0Per-invocation budget
CopyPropGlobalBudgetPbclCebcTybonyOhqtrg0x21BEC70Cross-block budget
CopyPropForceGlobalPbclCebcSbeprTybony0x21BEC90Force global
CopyPropAddrPbclCebcNqqe0x21BECE8Address prop
CopyPropConstantBankPbclCebcPbafgnagOnax0x21BECB0Constant bank
CopyPropUseReachingDefsPbclCebcHfrErnpuvatQrsf0x21BEBD0Reaching defs
CopyPropPreAllocRegPbclCebcCerNyybpErt0x21BEBF0Fixed registers
CopyPropNoWriteNonRRPbclCebcAbJevgrAbaEE0x21BEC10Non-RR disable
CopyPropNonRegMultiDefPbclCebcAbaErtZhygvQrs0x21BEC30Multi-def
CopyPropNoMmaCbPbclCebcAbZznPo0x21BEC50MMA disable
LateCopyPropComplPredYngrPbclCebcPbzcyCerq0x21BC680Compl pred
SpeculativeHoistCommonInstsFcrphyngivruBvfgPbzzbaVafgf0x21B81B0Spec hoist (phase 56)

Interaction Between Passes

The copy propagation and CSE passes interact with each other and with the rest of the pipeline in a specific sequence designed to maximize redundancy elimination:

Phase 46: GeneralOptimizeMid2
  |-- OriCopyProp (forward copy propagation)
  |-- constant folding, algebraic simplification, DCE

Phase 48: EnforceArgumentRestrictions
  |-- may insert MOVs for ABI compliance -> new copy prop opportunities

Phase 49: GvnCse
  |-- global value numbering + CSE
  |-- eliminates redundant computations across basic blocks

Phase 50: OriReassociateAndCommon
  |-- normalizes expression trees for better commoning
  |-- local CSE over reassociated IR
  |-- catches cases GvnCse missed due to non-canonical form

Phase 51: ExtractShaderConstsFinal
  |-- may replace computations with constant loads -> dead code

Phase 58: GeneralOptimizeLate
  |-- OriCopyProp again (cleans up after expansion passes)

Phase 63: OriDoPredication
  |-- converts branches to predicated instructions
  |-- previously mutually-exclusive code becomes linear

Phase 64: LateOriCommoning
  |-- CSE on newly-linearized predicated code
  |-- eliminates redundancies exposed by if-conversion

Phase 65: GeneralOptimizeLate2
  |-- OriCopyProp + DCE (final cleanup)

Phase 82: AdvancedPhaseBackPropVReg (gate, arch-specific)
Phase 83: OriBackCopyPropagate
  |-- backward MOV chain elimination (disabled by default)
  |-- reduces copy count before register allocation

Key Function Map

AddressSizeNamePurpose
0xC5F00016 BGvnCse::executeThunk to sm_backend (context+0x630)->vtable[23]
0xC5F0106 BGvnCse::getNameReturns 49
0xC5F0206 BGvnCse::isNoOpReturns 0 (enabled)
sub_BEE590~200 BGvnCse body (SM60/70/90)Entry point: rebuilds def chains, dispatches to mode
sub_BEE370~550 BGvnCse mode dispatcherQueries knob 402, selects mode 0-6
sub_BED7E0~18 KBFullGvnCse (modes 3-6)RPO block walk + dominator-scoped CSE, 689 lines
sub_BEAD00~2.5 KBStandardGvnCse (mode 2)Dominator-guided GVN for SM < 32K threshold
sub_BEA5F0~9 KBPerDominatedBlockCsePer-block CSE within dominator subtree, commutative canon
sub_BEA450~2 KBSimpleGvn (mode 1)Basic GVN variant
sub_BEA1E0~500 BGvnCse eligibility checkOpcode-based CSE eligibility (16,122,145,183,186,...)
sub_BED430~2 KBEBB pre-passExtended basic block identification (gated by knob 210)
sub_6612506 BGvnCse no-op stubReturns 0 (SM30/50/80/89 vtable slot 23)
sub_7846D0--Build dominator treeAlso computes RPO ordering at context+792
sub_661750--Scoped value treeInit/destroy balanced BST for dominator scoping
0xC604D042 BOriReassociate::executeDispatches to sm_backend (context+1584)->vtable[44]
0xC5EFE06 BOriReassociate::getNameReturns 50
0xC5EFF06 BOriReassociate::isNoOpReturns 0 (enabled)
0xC6002048 BLateOriCommoning::executeCalls sub_9059B0
0xC5EDF06 BLateOriCommoning::getNameReturns 64
0xC5EE006 BLateOriCommoning::isNoOpReturns 0 (enabled)
0xC5EB807 BBackCopyProp::executeSets context+1552 = 9 (pipeline progress marker)
0xC5EB906 BBackCopyProp::getNameReturns 83
0xC5EBA06 BBackCopyProp::isNoOpReturns 1 (disabled)
0xC5EBB06 BAdvancedPhaseBackPropVReg::getNameReturns 82
0xC5EBC06 BAdvancedPhaseBackPropVReg::isNoOpReturns 0 (overridden to 1 at runtime by default vtable)
sub_9BF3508.6 KBEncoding pattern (post-phase-83)Checks context+1552 > 9 for register constraint relaxation
sub_9BFAF09.0 KBEncoding pattern (post-phase-83)Checks context+1552 > 9 for register constraint relaxation
sub_8C027014 KBScheduler vtable initReads knob 808 (BackCopyPropBudget), checks +1552 == 19
sub_9059B0~320 BLateOriCommoning implKnob check + ref-counted working set + core walker
sub_9055F0~800 BLateCommoning coreIterates code list, remaps operands, calls commoning check
sub_901A90~1.5 KBCommoning checkHash lookup + dominance verify + replacement
sub_74ED70~1.2 KBInstruction hashOpcode + type + operand VNs + address space -> hash
sub_781F80--Rebuild def chainsReaching definitions for commoning
sub_763070--Rebuild use chainsUse-def chains
sub_7E6090--Compute hash valuesPre-computes per-instruction hashes
sub_7DDB50~140 Bget_function_countReturns func count from compilation context
sub_7DF3A0~80 Bis_pure_instructionSide-effect-free check (bits 2-3 of status word)
sub_748440--Hash combineMixes operand hashes into instruction hash
sub_8F2CD0--Propagate equivalenceMOV-based value equivalence propagation
sub_8FCE70~150 BRef-count releaseReleases ref-counted working set objects
sub_1245740--Dominance checkO(1) bitvector bit test for CSE safety
sub_6B9180--Set membership testCommoning set contains check
sub_9253C0--Instruction deletionRemoves dead/redundant instructions
sub_90A3401.7 KBCommoning bodyCommoning pass instance (21 callees, confirms operand comparison pattern)
sub_908A60--Predicate simplifierTwo-pass (forward+backward) predicate simplification in copy prop
sub_8F2E50--Copy/fold eligibilitySM-version-dependent eligibility check (threshold 20479)
sub_7BA5105.2 KBHashComputeProgram/instruction sequence hash (FNV/Jenkins variant)
sub_7BB2603.5 KBHashAccumulateIncremental hash accumulation
sub_8DCF2023 KBFNV-1a hash table8-byte key hash table with chained collision (24-byte entries)
sub_8DF1C016 KBFNV-1a hash table32-bit key hash table, two-level structure
sub_9B12007.7 KBCode-caching hashJenkins-style instruction fingerprint for RA cache

Hash Infrastructure

The GVN/CSE passes share hash infrastructure with other subsystems (scheduling, code caching, register allocation). All FNV-1a implementations in ptxas use the same constants:

ConstantValuePurpose
FNV offset basis0x811C9DC5Initial hash state
FNV prime16777619 (0x01000193)Multiplication factor per byte

Hash-related functions identified in the binary:

AddressSizeFunctionUsed By
sub_7BA5105.2 KBHashCompute -- program/instruction sequence hashShader hash matching (SH= knob)
sub_7BB2603.5 KBHashAccumulate -- incremental hash accumulationInstruction-at-a-time hashing
sub_8DCF2023 KBFNV-1a hash table (8-byte keys, chained collision)Instruction deduplication in scheduling
sub_8DF1C016 KBFNV-1a hash table (32-bit keys, two-level)Opcode pattern classification
sub_9B12007.7 KBJenkins-style instruction hash for code cachingRegister allocator cache hit detection
sub_74ED70~1.2 KBPer-instruction hash for commoningLateOriCommoning (phase 64)
sub_748440--Hash combine helperMixes operand hashes into instruction hash

The code-caching hash at sub_9B1200 uses a different algorithm from FNV-1a:

hash = (1025 * (value + hash)) ^ ((1025 * (value + hash)) >> 6)

It processes instruction opcodes (offset +72), operand counts (+80), operand encodings (+76), register properties (+64), and variable pair mode (bits 20-21 of the descriptor at offset +48).


Cross-References

Predication (If-Conversion)

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

OriDoPredication (phase 63) is the if-conversion pass in ptxas. It transforms short conditional branch regions into predicated straight-line code, eliminating branches by guarding individual instructions with predicate registers. On NVIDIA GPUs, where all threads in a warp execute in lockstep, eliminating divergent branches avoids the performance penalty of serialized path execution under the SIMT model.

Phase index63
Phase nameOriDoPredication
CategoryOptimization
Entry pointsub_1381DA0 (1,517 bytes)
Core driversub_1381CD0 (206 bytes)
Main loopsub_1381010 (3,249 bytes)
Total code~17 KB across 19 functions in 0x137D8B0--0x13829F0
SSA windowYes -- runs at phase 63, within the partial-SSA window (phases 23--73)
Pipeline positionAfter OriRemoveRedundantMultiDefMov (62), before LateOriCommoning (64)
GatingDisabled when bit 5 of context+1376 flags is set; can be disabled via PTXAS_DISABLED_PASSES containing "Predication"
Knob controlsKnob 487 (enable/limit gate), knob 577 (per-region enable), knob 579 (texture-bearing region gate), knob 582 (block-level cold-region query), knob 260 (extra-latency penalty check)

GPU Motivation

The SIMT execution model makes predication qualitatively different from its role on scalar CPUs.

On a scalar CPU, a correctly-predicted branch is essentially free -- the branch predictor eliminates the control flow cost. If-conversion on CPUs is a niche optimization applied only when branches are highly unpredictable.

On a GPU, a divergent conditional branch forces the warp to serialize: the hardware executes the taken path with some threads masked off, then executes the not-taken path with the complementary mask. Both paths execute regardless, and the warp reconverges at the post-dominator. The cost is the sum of both paths, not the maximum.

Predication eliminates this divergence penalty entirely. Both paths still execute, but without the overhead of stack-based reconvergence (BSSY/BSYNC pairs on sm_70+), without the branch instruction itself, and with the ability for the scheduler to interleave the predicated instructions with other independent work. For short regions (a few instructions per side), predication is strictly superior to branching.

Branching (divergent):               Predicated:

  ISETP.NE P0, R4, R5               ISETP.NE P0, R4, R5
  BSSY B0, target                    @P0  IADD3 R6, R6, 1, RZ
  @P0 BRA taken_path                 @!P0 IADD3 R7, R7, 1, RZ
  // not-taken:                      // continues straight-line
  IADD3 R7, R7, 1, RZ
  BRA rejoin
  // taken:
  IADD3 R6, R6, 1, RZ
  // rejoin:
  BSYNC B0

The branching version requires 6 instructions (including BSSY/BSYNC convergence bookkeeping) and forces warp serialization. The predicated version requires 3 instructions and executes without divergence.

Algorithm Overview

The pass operates in three layers:

  1. Entry and gating (sub_1381DA0): checks the "Predication" disable flag and knob 487, initializes working state, calls the driver.
  2. Iterative driver (sub_1381CD0): initializes via the SM backend's vtable dispatch at sm_backend+1296, then calls the main loop up to 3 times (controlled by a knob at options offset 41768) with different aggressiveness settings.
  3. Main RPO loop (sub_1381010): walks the RPO block order, identifies candidate branch regions, evaluates profitability, and applies the transformation.

Entry Point -- sub_1381DA0

sub_1381DA0(compilation_unit):
    if context+1376 bit 5 set:
        return                       // phase disabled by flag

    knob_state = *(context+1664)     // OCG knob container
    mode = *(*(knob_state+72) + 16416)

    if mode == 0:
        limit = (context+1419 bit 4) != 0
    elif mode == 1:
        limit = *(*(knob_state+72) + 16424)
    else:
        // mode >= 2: skip limit check

    IsPassDisabled(knob_state, "Predication", &disabled)
    if disabled or limit:
        return

    // Check knob 487 iteration limit
    CheckKnob487(knob_state)

    // Set up working state (allocate two pool objects)
    context+1385 |= 1               // mark predication active
    sub_1381CD0(state)               // call driver
    context+1385 &= ~1              // clear predication flag

    // Cleanup: release pool objects and tree structures

The context+1385 byte has bit 0 set during predication execution, which signals downstream code (such as sub_137EE50) that the pass is active.

Iterative Driver -- sub_1381CD0

sub_1381CD0(state):
    // Initialize via SM backend
    sm_backend = *(context+1584)
    init_fn = vtable(sm_backend)+1296
    if init_fn == sub_7D82C0:       // fast path: zero-init
        clear state fields
    else:
        init_fn(sm_backend, state)   // backend-specific init

    bb_count = *(context+520)
    if bb_count <= 1: return 0       // nothing to if-convert

    // Determine iteration count from knob at options+41760
    iterations = 0
    if *(options+41760) == 1:
        iterations = *(options+41768)

    // First pass: always run
    state[14].byte[8] = 0           // not second-pass mode
    changed = sub_1381010(state)

    // Optional second/third pass with relaxed thresholds
    while changed and iterations > 0:
        state[14].byte[8] = (iterations == 1)
        changed = sub_1381010(state)
        if iterations <= 2: break

The iteration mechanism allows the pass to make a second (and potentially third) traversal with progressively relaxed profitability thresholds. The flag at state[14].byte[8] signals the final iteration, which changes some size-limit comparisons in the profitability heuristic.

Main Loop -- sub_1381010

The main loop walks basic blocks in RPO order (via the block index array at context+512), identifies candidate branch regions, and decides whether to if-convert each one.

sub_1381010(state):
    // Rebuild liveness and CFG
    sub_781F80(context, 1)          // rebuild liveness
    if context+1370 bit 4 set:
        sub_A10160(context, 1)      // rebuild analysis
    sub_7E6090(context, 0,0,0,0)    // refresh CFG
    // Clear block-76 fields
    for each block in chain:
        block+76 = 0
    sub_791F00(context, 0)          // clear RPO numbering

    changed = false
    for rpo_idx = 2 .. bb_count:
        bb = bb_array[rpo_order[rpo_idx]]

        if bb is same as previous region tail:
            // Continuation of prior diamond -- reuse state
            restore saved state
        else:
            // Fresh candidate: analyze new region
            init candidate state
            if not isTriangleDiamondCandidate(bb):
                skip
            if not analyzeRegion(state, candidate):
                skip

        // Region identified -- extract branch info
        header = bb
        true_target = successor of header's terminator
        branch_pred = extractBranchPredicate(header)
        false_target = fallthrough

        // Try to if-convert both sides
        if evaluateProfitability(true_side, false_side):
            applyTransformation(...)
            changed = true

    if changed:
        context+1370 &= ~4          // invalidate CFG
        sub_785E20(context, 0)       // rebuild
    return changed

CFG Pattern Recognition

The pass recognizes three CFG shapes for if-conversion:

Triangle Pattern

One arm of the branch is empty (falls through directly to the merge point).

         [header]
        /    \
       /      \
   [then]      |
       \      /
        \    /
       [merge]

Requirements:

  • header ends with a conditional branch (opcode 93; OUT_FINAL in the ROT13 name table, but checked here as a control-flow terminator marker)
  • then block has a single predecessor (the header)
  • then block's sole successor is the merge block
  • merge has exactly two predecessors: header and then
  • No backedges into the region

Diamond Pattern

Both arms contain instructions.

         [header]
        /    \
       /      \
   [then]  [else]
       \      /
        \    /
       [merge]

Requirements (same as triangle, plus):

  • The else block has a single predecessor (the header)
  • The else block's sole successor is the same merge block
  • merge has exactly two (or three, for extended diamonds) predecessors

Extended Diamond Pattern

The pass can also handle diamonds where one or both arms chain through a successor block before merging. The sub_137FE10 function implements this extended analysis, walking forward through fall-through blocks until it reaches a merge point or encounters a block that fails the candidate check.

         [header]
        /    \
       /      \
   [then]  [else]
      |       |
   [then2] [else2]   (optional chain blocks)
       \      /
        \    /
       [merge]

Region Analysis -- sub_137E3A0

This function (sub_137E3A0, 367 bytes) validates that a basic block is part of a valid if-conversion candidate. It checks:

  1. Predecessor count: The merge block must have exactly header_predecessor_count + 1 predecessors.
  2. Terminator type: The header's terminator must match opcode 95 after masking bits 12-13 (STS in the ROT13 name table; used here as a control-flow terminator class marker, not an actual store-shared instruction).
  3. Branch predicate: The branch guard must be a non-negated register operand (type field (>>28)&7 == 1), from the predicate register file (register file type checked against the state's expected file types 2 or 3, corresponding to R or UR).
  4. No backedges: The predecessor list must not contain a self-edge.
  5. Merge block successor check: Validates that the merge block's sole successor leads to the expected continuation block.
// Pseudocode for sub_137E3A0
bool isTriangleDiamondCandidate(state, bb):
    pred_count = bb->predecessor_count    // at bb+144
    if pred_count == 0: return false
    preds = bb->predecessor_list          // at bb+128
    if preds == NULL: return false
    if preds->next != NULL: return false  // must be single-entry

    header = bb_array[preds->block_index]
    if header->predecessor_count + 1 != pred_count:
        return false

    terminator = header->first_instr
    opcode = terminator->opcode & 0xFFFFCFFF   // mask bits 12-13
    if opcode != 95: return false               // opcode 95 = STS in ROT13 table; used as control-flow terminator class

    // Extract branch predicate from last operand
    last_op_idx = terminator->num_operands - ((opcode >> 11) & 2) - 2
    pred_operand = terminator->operands[last_op_idx]
    if operand_type(pred_operand) != 1: return false   // must be register
    if pred_operand is negated: return false

    reg_file = get_register_file(pred_operand)
    if reg_file != state->expected_file: return false

    // Check successor list for backedges
    for each successor of header:
        if successor == bb: continue
        if other_successor exists: return false  // at most one other
    return true

Region Scanning -- sub_137D990

This function (1,270 bytes) walks all instructions in a candidate block, counting them and checking each for predicability. It builds a cost model:

Per-Instruction Checks

For each instruction in the candidate block:

  1. Already-predicated check (opcode bit 12 = 0x1000): Instructions that already carry a predicate guard are flagged via state+48 for special handling.

  2. MOV counting (opcode 130): Instructions with opcode 130 (HSET2 in the ROT13 name table; the code treats this value as an internal marker for MOV-like operations) that match specific operand patterns increment a separate MOV counter at state+4, used to adjust profitability thresholds. The actual SASS MOV instruction is opcode 19.

  3. Predicable instruction check (sub_137D8B0): Each instruction is tested via the SM backend's canPredicate vtable method at sm_backend+1424. Instructions that cannot be predicated (atomics, certain memory operations, barriers) cause the scan to fail.

  4. Primary memory load classification: For load instructions (opcode 125 after masking), the memory space is queried via sub_91C840. The internal category number is tested against bitmask 0x90E ((1 << category) & 0x90E), which selects the five primary data memory spaces: .shared (1), .local (2), .const (3), .tex (8), .global (11). When a load targets one of these spaces, the has_primary_memory_load flag is set at candidate+12, which affects profitability thresholds in the heuristic. See the Memory Space Classification for Predication section for the full bitmask decode.

  5. Extra-latency check: Instructions matching opcodes in the set {22, 23, 41, 42, 55, 57, 352, 297} (long-latency operations including texture, surface, and certain memory ops) have their latency contribution tallied at state+16 via the SM backend's getExtraLatency method at sm_backend+1392.

  6. Predicate-register conflict: If any destination operand writes to the same predicate register that the branch uses as its guard, the region cannot be if-converted (the predicate would be clobbered before all instructions are guarded).

  7. Instruction count limit: The non-MOV instruction count at state+8 is compared against a threshold from the state object. If exceeded and the block is not marked as "must-predicate" (state+20), the scan returns failure.

// Pseudocode for sub_137D990
bool analyzeRegion(state, candidate):
    bb = candidate->basic_block
    if bb->flags & 2: return false         // block excluded

    first_instr = bb->first_instruction
    // Check if first instruction is speculative-safe
    if isSpeculativelyUnsafe(first_instr, context):
        candidate->has_unsafe = first_instr

    // Extract branch predicate register index
    header = bb_array[bb->predecessor->block_index]
    terminator = header->first_instruction
    branch_pred_idx = extractPredicateIndex(terminator)

    // Walk all instructions in the block
    for instr = first_instr; instr != bb->tail; instr = instr->next:
        // Track already-predicated flag
        candidate->has_predicated |= (instr->opcode & 0x1000) != 0

        // Count MOVs
        if isMOV(instr) and matchesMOVPattern(instr):
            candidate->mov_count++

        // Check speculation safety for uniform operands
        if state->has_uniform_speculation:
            check uniform register SSA chain

        // Check predicability via backend
        if not canPredicateInstruction(state, instr, header):
            fail with "too many instructions"

        // Primary memory load classification (0x90E bitmask)
        if isLoadOp(instr):
            space = getMemorySpace(instr)
            if space is in {shared, local, const, tex, global}:
                candidate->has_primary_memory_load = true

        // Extra latency accounting
        if isLongLatencyOp(instr):
            candidate->extra_latency += getExtraLatency(instr)

        // Count non-trivial instructions
        if not isMOVPHI(instr):         // opcode 263 = MOV.PHI
            candidate->instr_count++
            if not candidate->must_predicate:
                if candidate->instr_count > state->threshold:
                    return false

        // Check for predicate-register clobber
        for each destination operand:
            if dest is register and dest index == branch_pred_idx:
                return false

    candidate->complete = true
    return true

Profitability Heuristic -- sub_1380BF0

The profitability decision (sub_1380BF0, 1,055 bytes) is the most complex part of the pass. It considers multiple factors to decide whether converting a branch region to predicated code is profitable.

Decision Flow

sub_1380BF0(state, true_side, false_side, is_reverse, result):
    result = false

    // 1. Texture-bearing region check
    if true_side->has_predicated:
        if not CheckKnob579(knob_state):
            return false

    // 2. Must-predicate override
    if true_side->must_predicate:
        return true

    // 3. CONV.ALLOC check
    if state->has_conv_alloc:
        if not (bb->flags & 8) or not state->flag_byte76:
            return false

    // 4. Branch-predicate matching
    //    Check if the branch condition matches a known pattern
    //    (SEL instruction producing the predicate)
    header_terminator = state->header->first_instruction
    pred_operand = extractLastPredicate(header_terminator)
    if predicateMatchesSELPattern(pred_operand):
        return true

    // 5. False-side memory load check
    if false_side->has_primary_memory_load:
        return sub_137F800(...)        // speculation safety analysis

    // 6. Extra-latency penalty
    if CheckKnob260(knob_state):
        if true_side->extra_latency > 0 and false_side->extra_latency > 0:
            return false               // both sides have long-latency ops

    // 7. Size-based thresholds (main heuristic)
    instr_count = true_side->instr_count

    if true_side->has_primary_memory_load:
        // Memory loads route to extended diamond analysis
        return sub_137FE10(...)        // extended diamond analysis

    mov_count = true_side->mov_count
    if mov_count <= state->mov_threshold:
        if state->flag_byte76:
            // Uniform-speculation-aware thresholds
            if true_side->has_predicated:
                return state->uniform_tex_limit >= instr_count
            else:
                return state->uniform_limit >= instr_count
        else:
            if true_side->has_predicated:
                return state->tex_limit >= instr_count
            else:
                return state->base_limit >= instr_count
                       and (true_extra <= 2 or false_extra <= 2)

    // 8. Fallback: combined size check
    combined = true_side->instr_count + false_side->instr_count
    if state->combined_limit < instr_count and combined > state->threshold:
        return false

    // 9. False-side memory loads boost profitability
    if false_side->has_primary_memory_load:
        return true                    // scheduling overlap benefit
    return sub_1380810(...)            // fall-through block analysis

Threshold Fields

The state object contains multiple instruction-count thresholds, initialized by the scheduler backend during sub_1381CD0:

State offset (as int32 index)FieldTypical role
[8]base_limitMaximum instructions for simple (non-textured, non-uniform) regions
[9]tex_limitMaximum instructions for textured regions (without uniform speculation)
[10]uniform_limitMaximum instructions with uniform-speculation enabled
[11]uniform_tex_limitMaximum for textured + uniform-speculation regions
[12]thresholdHard ceiling on non-MOV instruction count
[13]combined_limitMaximum for combined (both-sides) instruction count
[14]fallthrough_limitThreshold for fall-through block extension
[15]extended_limitThreshold for extended diamond regions
[16]mov_thresholdMOV count below which standard limits apply
[17]mov_limitMOV-specific threshold

These values are architecture-specific -- the scheduler backend's vtable method at offset 1296 initializes them based on the SM target and optimization level.

Instruction Predication -- sub_9324E0

Once a region passes the profitability check, each instruction in the region is predicated. The predication is performed by sub_9324E0 (280 bytes), which transforms each instruction by adding a predicate guard operand.

Transformation Rules

For a non-branch instruction with opcode op:

  1. Copy the operand array, appending the guard predicate as the new last operand and the predicate register as the penultimate operand.
  2. Set bit 12 of the opcode (op | 0x1000) to mark the instruction as predicated.
  3. Special case for opcode 188: remapped to 190.
  4. Special case for opcode 93 (OUT_FINAL in the ROT13 name table; used here as a branch marker): replaced with opcode 95 (STS in the ROT13 name table; used here as a conditional-select construct), not simply predicated.
  5. Emit the new instruction via sub_92C240, which creates the replacement in the code list.
  6. Transfer debug info: *new_instr+32 = *old_instr+32 (debug location).
  7. Delete the original instruction via sub_9253C0.
// Predicate guard encoding in operand word:
//   guard_pred = predicate_reg_index | 0x60000000
//   (type field 3 = 0x6000_0000 >> 28, register index in low 24 bits)
//
// Example: @P2 IADD3 R0, R1, R2, RZ
//   Original IADD3 operands: [R0_def, R1, R2, RZ]
//   Predicated operands:     [R0_def, R1, R2, RZ, guard_word, P2 | 0x60000000]

Already-Predicated Instructions

Instructions that already have a predicate guard (bit 12 set in original opcode) are handled by sub_9321B0, which must compose the existing predicate with the new guard using a predicate-AND or predicate-SEL operation rather than simply replacing the guard.

Post-Transformation -- sub_137DE90

After predicating all instructions in a region, sub_137DE90 (1,286 bytes) performs cleanup:

  1. Bitvector maintenance: For each register operand in the predicated instructions, checks whether the register is live in the dominator's bitvector (at context+832). If the register is newly defined under the predicate, marks it in the bitvector via sub_BDBB80. This ensures later liveness analysis accounts for the conditionally-defined values.

  2. Per-instruction predication: Walks the block's instruction list and calls sub_9324E0 on each instruction, passing the predicate register index and the guard operand word.

  3. Predicate register tracking: If any register was newly exposed to the bitvector, and the guard predicate is a non-negated register operand, marks the predicate register's descriptor at +76 with bit 0 set, and increments a counter at state+200.

  4. Cleanup: Resets the per-block tracking arrays (stored at state[27]/state[56..57]) which track which registers were bitvector-updated during this region.

Speculative Execution Safety -- sub_137EE50

After the main if-conversion, sub_137EE50 (969 bytes) performs a secondary scan to identify instructions that were speculatively moved above their original control-flow guard. This function:

  1. Checks the global predication flag at context+1412 and the per-function flag at context+1392 bit 0. If the function already has speculated instructions from a prior pass, returns immediately.

  2. Scans the true-side block for load instructions to global or surface memory (opcodes 183 and 288 after masking). For each such load, queries the memory space via sub_91C840 and checks whether space type 18 (unmapped/invalid) could be accessed.

  3. Records speculatively unsafe instructions in a tracking hash set (at state+240), used by later passes to insert appropriate guard instructions or to avoid further speculation.

  4. Scans the false-side block with the same logic.

The post-predication speculation safety check targets exclusively category 18 (.surf/tensor extended, sm_90+). This is the only memory space that sub_137EE50 treats as requiring speculative-unsafe tracking; global loads and texture loads are considered acceptable for speculative execution in the predication cost model.

Memory Space Classification for Predication

The bitmask 0x90E appears in five functions within the predication pass (sub_137D990, sub_137F560, sub_137F220, sub_137FB60, sub_1380810). All five use the identical test pattern:

category = sub_91C840(operand);          // classify memory space
if (category <= 0xB && ((1LL << category) & 0x90E) != 0)
    // load targets a primary data memory space

Bitmask Decode

0x90E = binary 1001 0000 1110 -- bits {1, 2, 3, 8, 11} are set.

BitCategoryPTX SpaceIn 0x90E?Role in predication
00Generic (unqualified)NoUnresolved address space -- cannot be classified, excluded
11.sharedYesCTA-scope scratchpad; always mapped for executing CTA; 20--30 cycle latency
22.localYesThread-private stack/frame; always mapped; backed by L1/L2
33.constYesConstant bank (c[bank][offset]); loaded by driver before launch; always mapped
44.paramNoKernel parameter memory; typically constant-folded or register-promoted by earlier passes
55.const (extended)NoExtended constant path (PTX inputs 21, 22); different scheduling model
66.global (extended)NoExtended global variant (PTX input 20); different scheduling model
77Spill spaceNoCompiler-generated register spill/fill; handled separately by regalloc
88.texYesTexture memory; high latency (200+ cycles); texture cache always valid when bound
99Special (opcode-dep.)NoAmbiguous classification from case-18 sub-switch in sub_91C840
10--(unused)NoNo memory space maps to category 10
1111.globalYesDRAM-backed global memory; highest latency (300+ cycles)

Categories 12--18 (code/function, uniform, register file, surface, surface/tensor extended) all exceed the <= 0xB range check and are excluded from the bitmask test automatically.

What the Bitmask Selects

The five selected categories -- shared, local, const, texture, global -- are the primary data memory spaces: the ones that involve real data movement through the GPU memory hierarchy and carry meaningful scheduling latency. These are the loads a scheduler can profitably overlap with predicated computation.

The excluded categories are either:

  • Unresolvable (generic -- could be anything)
  • Non-load in practice (param -- folded away, code -- function pointers)
  • Compiler-internal (spill, special -- the compiler already knows how to handle these)
  • Out of range (register file, uniform, surface, surface/tensor -- categories > 11)

How the Bitmask Affects Profitability

The bitmask test does NOT directly determine speculation safety. It sets a has_primary_memory_load flag at candidate offset +12, which the profitability heuristic (sub_1380BF0) uses in three ways:

  1. True-side memory loads (a2+12 set): The profitability check routes to the extended diamond analysis (sub_137FE10) instead of the standard size-threshold path. This allows larger regions to be if-converted when they contain meaningful loads.

  2. False-side memory loads -- speculation guard (a3+12 set): If the false side has memory loads AND the SM backend's speculation policy (vtable at sm_backend+1200) allows it, the detailed speculation analysis (sub_137F800) is invoked. If that analysis flags the loads as risky, predication is rejected.

  3. False-side memory loads -- profitability boost (a3+12 set, passes safety): If the false side has memory loads and passes safety checks, the profitability heuristic returns true directly (line 166 of sub_1380BF0). The reasoning: if the false-side code contains real memory loads, converting the branch to predicated straight-line code lets the scheduler overlap those loads with other work.

Speculation Safety (Separate Mechanism)

The actual speculation safety tracking is handled by sub_137EE50 (post-predication scan), which uses a different criterion from the 0x90E bitmask:

  • Scans both sides for opcodes 183 (LDG) and 288 (STG) after masking
  • For each, queries sub_91C840 and checks if category == 18 (.surf/tensor extended)
  • Only category 18 loads are tracked as "speculatively unsafe" in the hash set at state+240
  • The context+1392 bit 0 flag persists and is checked by OriHoistInvariantsLate (phase 66)

This means global loads (category 11) that are speculatively predicated are not tracked as unsafe. In the ptxas cost model, global memory loads under a predicate guard are considered acceptable: the hardware will issue the load speculatively, and if the predicate is false, the result is simply discarded. On architectures with memory access traps (e.g., page faults on unmapped addresses), the hardware masks the fault for lanes where the predicate is false. Surface/tensor extended operations (category 18), however, may have side effects that cannot be masked, so they receive the unsafe designation.

Fall-Through Block Analysis -- sub_1380810

When the standard profitability check is inconclusive, sub_1380810 (980 bytes) analyzes the fall-through continuation of the merge block. The idea: even if the region itself is borderline, if the code immediately after the merge point contains long-latency operations (loads, texture fetches), the predicated version may be better because the scheduler can overlap the predicated instructions with those long-latency operations.

The function walks instructions in the merge block's successor(s), using the same 0x90E bitmask test to identify primary-data-memory loads. Non-load instructions are checked via the SM backend's vtable at sm_backend+1824. The function counts:

  • Primary-memory-space loads (via the 0x90E mask)
  • Other long-latency operations (via the backend vtable check)
  • Total instruction count

If the fall-through region contains enough long-latency work (compared to state->fallthrough_limit and state->extended_limit), the function returns true, indicating that predication is profitable despite the region being above the standard size threshold.

Extended Diamond Analysis -- sub_137FE10

For complex diamonds where one side has primary-memory loads that affect profitability thresholds, sub_137FE10 (2,550 bytes) performs a more thorough analysis. It can "look through" the diamond to the merge block and even one block beyond, checking whether the instruction mix in the continuation makes predication worthwhile. It invokes sub_137F560 (which also uses the 0x90E bitmask) to scan continuation blocks for scheduling-relevant loads.

The function also handles the case where the merge block falls through to another conditional branch that itself is a predication candidate -- effectively analyzing a chain of adjacent diamonds.

Interaction with Later Passes

The predication pass is positioned to maximize the benefit of subsequent passes:

PhaseNameInteraction
64LateOriCommoningPredication may create duplicate computations on both sides of the original branch. Commoning eliminates these by recognizing that @P0 IADD3 R0, R1, R2, RZ and @!P0 IADD3 R0, R1, R2, RZ with the same inputs can be merged into an unconditional instruction.
65GeneralOptimizeLate2The copy propagation and constant folding sub-passes clean up the predicated code: dead predicate definitions, redundant MOVs introduced by the PHI destruction at merge points, and constant-foldable predicates.
66OriHoistInvariantsLatePredication can convert loop-varying branches into predicated straight-line code. LICM then hoists any newly-exposed loop-invariant computations.
69OriDoRematPredicated instructions that define values used far from their definition are candidates for rematerialization, reducing register pressure.
70OriPropagateVaryingSecondAfter predication changes the control flow, varying annotations must be recomputed. The second varying-propagation pass updates which values are uniform vs. divergent.

The context+1392 bit 0 flag set by sub_137EE50 persists through these passes and is checked by OriHoistInvariantsLate to avoid hoisting speculatively-unsafe instructions out of their guarded context.

Key Functions

AddressSizeFunctionRole
sub_1381DA01,517 BOriDoPredication::executePhase entry point; gating, setup, cleanup
sub_1381CD0206 BrunPredicationDriverIterative driver; calls main loop up to 3 times
sub_13810103,249 BpredicationMainLoopRPO walk, region identification, transformation dispatch
sub_137E3A0367 BisTriangleDiamondCandidateCFG pattern validation
sub_137D9901,270 BanalyzeRegionPer-block instruction scan, cost modeling
sub_137D8B0209 BcanPredicateInstructionSingle-instruction predicability check
sub_1380BF01,055 BevaluateProfitabilityMulti-factor profitability decision
sub_137FE102,550 BanalyzeExtendedDiamondExtended diamond and chain analysis
sub_137F800864 BanalyzeSpeculationSafetySpeculation safety for side-effect loads
sub_1380810980 BanalyzeFallThroughFall-through block continuation analysis
sub_137EE50969 BmarkSpeculativeInstructionsPost-transformation speculative-load tracking
sub_137DE901,286 BapplyPredicationInstruction rewriting and bitvector update
sub_137FB60687 BclassifyInstructionPer-instruction classification during walk
sub_137F560665 BscanBlockForUnsafeBlock scan for speculative safety
sub_137F220828 BclassifyInstructionExtendedClassification with bitvector tracking
sub_137E5102,360 BmoveInstructionsToHashInstruction movement during transformation
sub_9324E0280 BpredicateInstructionAdds predicate guard to single instruction
sub_9321B0~800 BpredicateAlreadyGuardedHandles already-predicated instructions
sub_92C240(shared)createInstructionInstruction builder (shared utility)

SASS Predicate Model

NVIDIA SASS provides 7 usable predicate registers (P0--P6) plus the hardwired always-true register PT. Every instruction in the SASS ISA can optionally carry a predicate guard:

@P0  IADD3 R0, R1, R2, RZ    // executes only if P0 is true
@!P2 FMUL  R3, R4, R5         // executes only if P2 is false
     FADD  R6, R7, R8          // unconditional (implicit @PT)

Predicate conditions are set by comparison instructions:

ISETP.GT.AND P0, PT, R1, R2, PT   // P0 = (R1 > R2) AND PT
FSETP.LT.AND P1, P2, R3, R4, PT   // P1 = (R3 < R4), P2 = !(R3 < R4)

Uniform predicates (UP0--UP6, UPT) are the warp-uniform variant available on sm_75+. When all threads in a warp have the same predicate value, using UP instead of P avoids consuming a per-thread predicate register and enables the hardware to skip the entire instruction rather than masking per-thread.

In the Ori IR, predicate operands are encoded with type field 5 (bits 28-30 of the packed operand word). The guard predicate is appended as a pair of extra operands: the guard control word (type 3, 0x60000000 | reg_index) followed by the predicate register operand itself.

Opcode Reference

Key opcodes referenced by the predication pass (after BYTE1 &= 0xCF masking to clear bits 12-13):

ValueMnemonicRole in predication
93OUT_FINALROT13 name is OUT_FINAL; used here as a conditional branch marker -- the instruction being eliminated. Actual SASS BRA is opcode 67.
95STSROT13 name is STS; used here as the branch terminator class marker and conditional-select replacement target. Actual SASS EXIT is opcode 77.
97STGROT13 name is STG; used here as a block boundary sentinel for scan termination. Actual SASS CALL is opcode 71.
125LD (variant)Load -- checked for speculative safety
130HSET2ROT13 name is HSET2; used here as an internal marker for MOV-like instructions counted separately for profitability. Actual SASS MOV is opcode 19.
183LDGGlobal load -- speculative-unsafe
188(variant)Remapped to 190 when predicated
263MOV.PHISSA phi -- not counted in instruction totals
286CONV.ALLOCConvergence allocation marker -- special handling in profitability check
288STGGlobal store -- speculative-unsafe
352, 297(long-latency)Texture/surface ops -- extra latency penalty

Cross-References

Rematerialization

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

Rematerialization is the compiler technique of recomputing a value near its use instead of keeping the original definition live across a long range. In ptxas, rematerialization is implemented through three cooperating pipeline phases and tightly integrated with the register allocator's spill-vs-remat decision logic. On GPUs, where register pressure directly determines occupancy and therefore throughput, aggressive rematerialization is one of the most performance-critical optimizations in the entire pipeline.

Phase 28SinkRemat -- sinks instructions closer to uses, marks remat candidates
Phase 54OriDoRematEarly -- sets remat mode flag (ctx+1552 = 4)
Phase 69OriDoRemat -- late rematerialization after predication and fusion
Address range (phase 28)Execute: sub_C5FC20, core: sub_913A30 -> sub_A0F020
Address range (phase 69)Execute: sub_C5F910, core: sub_A112C0 -> sub_A11060 -> sub_A107B0
Minimum opt levelPhase 28: requires level > 4 (knob 487); Phase 69: requires level > 1
Operand kind 7"Remat" marker in the Ori IR operand classification
Vreg flags (offset +80)0x80000001 = remat candidate; 0x80000007 = remat with predication; 0x80000008 = remat committed
Regalloc integrationsub_93AC90 (remat check), sub_99A9D0/sub_99AA50 (range remat cost)
DUMPIR nameSinkRemat, OriDoRematEarly, OriDoRemat

Why Rematerialization Matters on GPUs

On NVIDIA GPUs, register count per thread inversely determines the number of concurrent warps (occupancy). Each additional register consumed by a kernel reduces the number of warps that can be resident on an SM. Since GPU performance depends on hiding memory latency through massive parallelism, even a single extra register can measurably degrade throughput.

Rematerialization trades instruction count for register pressure reduction. Instead of keeping a computed value alive in a register from its definition to its last use, the compiler recomputes it where needed. This is profitable when:

  1. The original instruction is cheap (single-cycle ALU: IADD, IMAD, MOV, SEL, LOP3, SHF)
  2. All source operands are still available at the use point (not overwritten)
  3. The live range of the result is long enough to actually cause register pressure
  4. The instruction has no side effects (no memory writes, no barrier interactions)

On GPUs, the cost-benefit tradeoff is skewed much further toward remat than on CPUs. A single spill/refill pair (STL + LDL) costs 20--100 cycles of local memory latency, while a rematerialized IADD costs 1 cycle. More importantly, the spill itself consumes a register for the address computation, potentially cascading into more spills.

Pipeline Position

Phase 23   GenerateMovPhi          SSA phi nodes -> MOV instructions
Phase 24   OriPipelining           Software pipelining
Phase 25   StageAndFence           Memory fence insertion
Phase 26   OriRemoveRedundantBarriers
Phase 27   AnalyzeUniformsForSpeculation
Phase 28   SinkRemat               *** Sink + remat candidate marking ***
Phase 29   GeneralOptimize         Bundled mid-level optimizations
  ...
Phase 53   OriPropagateVaryingFirst
Phase 54   OriDoRematEarly         *** Sets remat mode flag ***
Phase 55   LateExpansion
  ...
Phase 63   OriDoPredication        If-conversion (creates new opportunities)
  ...
Phase 66   OriHoistInvariantsLate
Phase 67   DoKillMovement
Phase 68   DoTexMovement
Phase 69   OriDoRemat              *** Late rematerialization ***
Phase 70   OriPropagateVaryingSecond

The three-phase design is deliberate:

  • Phase 28 (early): Runs after SSA construction and pipelining but before the main optimization passes. Sinks instructions closer to their uses and identifies candidates. This is the most complex of the three phases.
  • Phase 54 (mode setter): A trivial phase that writes 4 to ctx+1552 (the pipeline progress counter), signaling to downstream passes that rematerialization mode is active. Its isNoOp() returns 1 in the default vtable, meaning the dispatch loop skips its execute() by default. The phase is only active when an architecture backend overrides the vtable to return 0, at which point the single-store execute body runs.
  • Phase 69 (late): Runs after predication (phase 63) and loop fusion (phase 59), which restructure control flow and create new rematerialization opportunities that did not exist at phase 28 time. Also runs after OriHoistInvariantsLate (phase 66), which may have extended live ranges by hoisting invariants.

Phase 28: SinkRemat

Entry and Guard Logic

The execute function (sub_C5FC20) applies two layers of gating:

function SinkRemat_execute(phase, ctx):
    opt_level = getOptLevel(ctx)           // sub_7DDB50
    if opt_level <= 1:
        return
    return sub_913A30(ctx)                 // actual implementation

sub_913A30 (131 lines) performs additional checks before invoking the core:

  1. Optimization level >= 5: Required for the full sink+remat pass
  2. Knob 487: Must be enabled (queried via vtable+152 dispatch on ctx+1664)
  3. Cutlass detection (sub_8F47E0): Checks if the function name contains "cutlass" via strstr(). Cutlass kernels receive special treatment
  4. Flag check (ctx+1368 bit 0): Must be set (compilation is in SSA window)
  5. Feature flags (ctx+1376): Must have bit 26 set (0x4000000) but NOT bit 53 (0x20000000000000) simultaneously

When the cutlass flag (ctx+1381 bit 6) is set, the pass enters an iterative mode:

function sub_913A30(ctx):
    if opt_level <= 4:
        return
    if not knob_enabled(487):
        return
    is_cutlass = function_name_contains("cutlass")
    if not (flag_byte(ctx+1368) & 1):
        return
    if not is_cutlass and not (flag_byte(ctx+1381) & 0x40):
        return

    // Feature flag gating
    features = *(ctx+1376) & 0x20000004000000
    if features != 0x4000000:
        return

    // Cutlass iterative mode
    if flag_byte(ctx+1381) & 0x40:
        max_iters = 5                      // default
        if hw_config->field_62064:         // architecture-specific override
            max_iters = getKnob(862)       // configurable iteration limit
            if max_iters <= 0: goto sinkRemat_core
        for iter in 0..max_iters:
            sub_8F5220(&state, ctx)        // initialize iteration state
            changed = sub_911030(&state, iter)  // core sink+remat
            if not changed or sub_8F59C0(&state):  // convergence check
                break
            sub_8F5AD0(&state)             // update state for next iter
            sub_909A20(&state)             // propagate changes
            // clean up 4 bitvectors + 2 hash tables
        return

    // Non-cutlass path: single invocation
    sinkRemat_core:
    if is_cutlass:
        // Instruction count limit check
        if *(ctx+1584)->field_372 > 0x7FFF:
            // Warn via vtable dispatch (diagnostic knob 356, severity 2)
        sub_A0F020(ctx)                    // CORE: sink + remat driver
        vtable_callback()                  // post-processing hook
        sub_781F80(ctx, 1)                 // rebuild liveness
        sub_8F4820(ctx, &worklist)         // build remat worklist
        // Process worklist in reverse order
        for item in worklist (descending):
            sub_8F4F90(ctx, &item)         // apply remat decisions

Core Sink+Remat Driver: sub_A0F020

sub_A0F020 (494 lines) is the main workhorse of phase 28. It operates on the entire function body, processing basic blocks in reverse postorder through the dominator tree.

The algorithm has two main stages:

Stage 1: Per-block sinking analysis (via sub_A06A60 calling sub_A08250)

For each basic block in reverse postorder:

  1. Walk the instruction list backward
  2. For each instruction, check if it has a single use in a dominated block
  3. If so, sink the instruction to the use block (moves the instruction node in the linked list)
  4. Track whether any changes were made for convergence

Stage 2: Cross-block rematerialization (via sub_A06A60 calling sub_A07DA0)

For each basic block in reverse postorder:

  1. Walk the instruction list
  2. For each rematerialization-eligible instruction, check if the cost model approves duplication
  3. If profitable, clone the instruction at the use site and mark the original's result register with the remat flag

The pass alternates between sinking and rematerialization in a fixed-point loop, repeating until no more changes occur. The two worklist callbacks (sub_A08250 for sinking, sub_A07DA0 for remat) operate on a per-block basis through a generic block visitor (sub_A06A60).

The block visitor manages per-block liveness bitvectors:

  • block+16: live-in bitvector
  • block+40: live-out bitvector
  • block+64: kill set
  • block+112: live-through set (computed as intersection of live-in and live-out)

Phase 69: OriDoRemat (Late Rematerialization)

Entry Logic

The execute function (sub_C5F910) checks the optimization level and delegates:

function OriDoRemat_execute(phase, ctx):
    opt_level = getOptLevel(ctx)           // sub_7DDB50
    if opt_level <= 1:
        return
    return sub_A112C0(ctx, -1)             // -1 = process all blocks

Remat State Initialization: sub_A0C310

sub_A112C0 first initializes a rematerialization state object (88+ bytes) via sub_A0C310:

struct RematState {
    +0    compilation_ctx*  ctx
    +8    RefCountedObj*    block_worklist     // refcounted allocator object
    +16   int              hash_capacity       // hash table for seen blocks
    +20   int              hash_count
    +24   HashNode**       hash_buckets
    +32   int              hash_bucket_count
    +72   int              iteration_count     // convergence counter
    +80   RefCountedObj*   instruction_pool    // for cloned instructions
    +88   int              pass_id             // -1 for late remat
    +92   byte             requires_tex_check  // hw_config->field_17064 == 1
};

The pass ID -1 distinguishes OriDoRemat from OriDoRematEarly. When the hardware configuration at hw_config+17064 is 1 and hw_config+17072 != 0, the requires_tex_check flag is set, enabling additional texture-instruction awareness.

Iterative Remat Loop: sub_A112C0 + sub_A11060

The late remat pass runs in a convergence loop:

function sub_A112C0(ctx, pass_id):
    init_remat_state(&state, ctx, pass_id)

    // Iterative convergence loop
    while sub_A11060(&state) and getOptLevel(ctx) != 1
          and sub_785E20(ctx, 0):           // instruction budget check
        continue

    // Cleanup: drain worklist, release refcounted objects
    ...

Per-Iteration Worker: sub_A11060

Each iteration of sub_A11060 (155 lines) processes the entire instruction list:

function sub_A11060(state):
    ctx = state->ctx
    sub_7E6090(ctx, 0, 1, 0, 0)           // rebuild use-def chains
    // Reset all basic block depth markers to 0x80000000 (unvisited)
    for bb in ctx->block_list:
        bb->field_76 = 0x80000000

    // Drain hash table back into instruction pool
    drain_hash_table(state)

    first_pass = !state->requires_tex_check
    changed = false

    // Walk instructions in program order
    instr = ctx->first_instruction         // ctx+280
    while instr:
        if first_pass:
            first_pass = false
            while instr:
                opcode = instr->opcode & 0xFFFFCFFF
                if opcode == 97:           // STG in ROT13; used as definition anchor/label marker
                    changed |= sub_A10DF0(state, instr)
                next = instr->next
                sub_A107B0(state, instr, &sink_flag, &changed_flag,
                          &remat_flag, true)
                instr = next
        else:
            // Non-first-pass: skip MOV processing
            while instr:
                next = instr->next
                sub_A107B0(state, instr, &sink_flag, &changed_flag,
                          &remat_flag, true)
                instr = next

        if not changed_flag:
            goto check_second_pass
        // Decrement iteration counter, check convergence
        if --state->iteration_count == 0:
            return sink_flag

    check_second_pass:
    if remat_flag and *(ctx+1552) > 4:
        // Second pass: walk block list for cross-block opportunities
        for bb in ctx->block_list:
            if (bb->field_20 & 1) == 0 or bb->size <= 0
               or (bb->field_20 & 6) == 6:
                continue                   // skip empty/dead/cold blocks
        instr = ctx->first_block_instruction
        while instr:
            instr = sub_A0C540(state, instr, &changed, ...)
        if changed:
            // Reset depth markers and loop
            continue

    --state->iteration_count
    return sink_flag

Per-Instruction Remat Worker: sub_A107B0

sub_A107B0 (316 lines) is the core per-instruction decision function called from both phases 28 and 69. It determines whether a specific instruction should be sunk, rematerialized, or left alone.

function sub_A107B0(state, instr, sink_flag_out, changed_out, remat_flag_out,
                     allow_remat):
    // Quick rejection: check if instruction is sinkable
    result = sub_A105F0(state, instr, sink_flag_out, changed_out)
    if result:
        return result                      // already sunk, done

    num_operands = instr->operand_count    // at instr+80
    if num_operands <= 0:
        return 0

    // Walk destination operands
    for i in 0..num_operands:
        operand = instr->operands[i]       // at instr+84 + 8*i
        operand_type = (operand >> 28) & 7

        if operand_type == 7:              // barrier register
            // Track barrier liveness
            continue
        if operand_type != 1:              // not a GPR destination
            continue

        // GPR destination operand
        if operand < 0:                    // bit 31 set = definition
            vreg = lookup_vreg(ctx, operand & 0xFFFFFF)
            vreg->flags |= 0x80000001     // mark as remat candidate
            if has_predication_flag and last_operand_is_0x20:
                vreg->flags |= 0x80000007 // enhanced remat with predication
            if sub_A0C410(state, vreg, instr, allow_remat):
                // Remat is profitable: clear depth flag, update block assignment
                vreg->field_76 = ~instr->block_id
            else:
                // Not profitable: process as regular definition
                // Check for multi-use definitions
                ...
        else:
            // Source operand: track liveness contribution
            ...

    return result

Sinkability Check: sub_A105F0

sub_A105F0 (77 lines) determines if an instruction can be sunk to a single-use block. It enforces strict criteria:

  1. Opcode filter: Only opcode 0x5F (95; STS in the ROT13 name table, used here as a constant/immediate load variant marker) with state->byte_92 clear
  2. Single-use check via sub_A07940: The instruction must have exactly one use
  3. Dominator check: The use must be in a block dominated by the definition block
  4. MOV chain check: If the instruction feeds opcode 93 (OUT_FINAL in ROT13; used here as a MOV-like chain link), verifies through an FNV-1a hash table that the definition matches the expected pattern
  5. Cost check via sub_A0C4A0: Verifies that sinking reduces pressure (returns the pressure delta)

When sinking succeeds, the instruction is physically moved in the linked list via sub_92E1B0 (insert at new position) and sub_9253C0 (remove from old position).

Rematerialization Eligibility Criteria

The eligibility check spans multiple functions. An instruction is rematerializable if it passes ALL of these filters:

Opcode Whitelist

From sub_911030 and sub_A11060, the eligible opcode set (after masking opcode & 0xFFFFCFFF) is:

OpcodeIdentityCategory
22IADD/IADD3Integer add (1 cycle)
50SHFFunnel shift (1 cycle)
77IMADInteger multiply-add (1 cycle on modern SM)
83ISETPInteger set-predicate (1 cycle)
93OUT_FINAL in ROT13; used as MOV-like markerRegister move (0--1 cycles, often eliminated). Actual SASS MOV is opcode 19.
95STS in ROT13; used as constant-load markerConstant materialization
297LOP33-input logic (1 cycle)
352SELConditional select (1 cycle)

The eligibility bitmask is encoded as 0x2080000010000001 >> (opcode - 22) for opcodes in range [22, 83], with explicit checks for opcodes 297 and 352. This is a compile-time-constant bitmask covering single-cycle ALU instructions.

Operand Source Liveness

sub_90C010 (70 lines) checks that all source operands are still available (live) at the proposed remat point:

function check_sources_available(state, instr, operand_idx, cost_out):
    operand = &instr->operands[operand_idx]

    // Immediate operand: always available
    if sub_7DEB90(operand, state->ctx):
        return 1

    // Must be a GPR (type 1) and not a definition (bit 31 clear)
    type = (operand->value >> 28) & 7
    if type != 1 or (operand->value_high & 1):
        return 0

    // Check if the source vreg has a single reaching definition
    vreg = lookup_vreg(ctx, operand->value & 0xFFFFFF)
    single_def = vreg->field_56
    if single_def:
        return sub_90B790(state, single_def, cost_out, false)

    // Multiple definitions: walk the def-use chain
    min_cost = UINT_MAX
    for def in vreg->def_chain:         // at instr->field_64 + 8*operand_idx
        cost = sub_90B790(state, def->instruction, cost_out, false)
        if cost == 0:
            return 0                    // any unavailable source -> reject
        // For rematerializable defs, add depth cost
        if def is rematerializable:
            cost += (def->block_depth <= instr->block_depth) ? 1 : 0
        min_cost = min(min_cost, cost)
    return min_cost

Cost Model: sub_90B790

sub_90B790 (large function, ~350 lines) implements the core cost/benefit analysis. It returns a non-negative integer cost where:

  • 0 = not profitable, do not rematerialize
  • 1+ = profitable, higher values indicate cheaper remat

The function considers:

  1. Opcode-specific register consumption: Different opcodes produce different register-type results. sub_7E36C0, sub_7E40E0, sub_7E3790, sub_7E3800, sub_7E3640 extract per-operand register class (R/P/UR/UP) and width
  2. Live range length: Longer live ranges benefit more from remat
  3. Use count: Multiple uses may require multiple remat copies -- still profitable if the live range is long enough
  4. Block depth: Instructions in deeper loop nests get higher remat cost thresholds since the duplicated instruction executes more frequently
  5. Predication state: Predicated instructions have additional constraints on remat safety
  6. Pre-existing flags: If vreg+80 already has 0x80000001 set, the register is already a remat candidate

Cross-Block Rematerialization: sub_A0C540

sub_A0C540 (228 lines) handles rematerialization across basic block boundaries, invoked in the second pass of sub_A11060. It processes definitions that are used in different blocks:

function cross_block_remat(state, instr, changed_out):
    // Walk operands in reverse order (destinations first)
    for i in (instr->operand_count - 1) downto 0:
        operand = instr->operands[i]
        if (operand >> 28) & 7 != 1:      // not a GPR
            continue
        if (operand_high & 1):             // skip source operands
            continue

        vreg = lookup_vreg(ctx, operand & 0xFFFFFF)
        if (vreg->flags_48 & 0x22) != 0:  // skip special vregs
            continue
        if vreg->reg_index in [41..44]:    // skip architectural predicates
            continue

        flags80 = vreg->field_80
        if not (flags80 & 1):             // not a remat candidate
            continue
        if vreg->use_count <= 0:
            continue
        if (flags80 & 2) and (flags80 & 4):  // already fully processed
            continue

        // Compute instruction-level remat cost
        cost = sub_91E860(ctx, instr, i)

        if operand < 0:                    // definition
            if cost <= 3:
                vreg->field_80 |= 0x80000008  // commit remat
                continue
            // Remat profitable: insert remat copy
            adjust_pressure(state, instr, -1)  // sub_A0C4A0
            duplicate_at_use(ctx, instr)       // vtable dispatch +1280
            adjust_pressure(state, instr, +1)
            vreg->field_80 |= 0x80000008
            vreg->flags_48 &= ~0x300000        // clear live-through bits
            // Rebuild interference for affected ranges
            adjust_pressure(state, instr, -1)
            sub_92C0D0(ctx, instr, 0, ...)     // clone instruction at use
            adjust_pressure(state, instr, +1)
            *changed_out = 1

Interaction with Register Allocator

The rematerialization flags set during phases 28 and 69 are consumed by the fat-point register allocator in several ways:

Remat Detection During Assignment: sub_93AC90

During per-instruction register assignment (sub_9680F0, 3722 lines), the allocator calls sub_93AC90 (29 lines) to check if a virtual register is a rematerialization candidate:

function check_remat_opportunity(alloc, vreg_index, reg_class):
    if alloc->vreg_count == 0:
        BUG()
    entry = hash_lookup(alloc->remat_table, vreg_index)
    cost = entry->field_144[reg_class]
    if cost < entry->threshold:
        return true
    return (cost == entry->threshold) and (reg_class == entry->field_12)

Range Remat Cost: sub_99AA50

The live-range infrastructure at 0x994000--0x9A1000 includes remat-aware cost functions. sub_99AA50 (51 lines) inserts a rematerialization cost node into a range's cost linked list, enabling the allocator to compare spill cost against remat cost when choosing between spilling and rematerializing a value.

Spill-vs-Remat Decision

The allocator's main iteration driver (sub_9AEF60, 1415 lines) uses remat information to guide the spill-vs-remat tradeoff:

  1. During interference analysis, remat candidates get lower interference weights (they can be killed and recreated)
  2. When a spill is triggered, the allocator first checks if the value is rematerializable. If so, it inserts a remat copy instead of a spill/refill pair
  3. Remat linked lists are maintained at alloc+161..+175 in the per-class allocator state

Verification: sub_A55D80

The post-allocation verifier (sub_A55D80, referenced by "REMATERIALIZATION PROBLEM..." string) validates that rematerialization was applied correctly. Error case 7 in the verifier specifically checks that:

  • The rematerialized instruction produces the same value as the original
  • The reaching definitions before and after allocation match (modulo known-safe remat transformations)
  • No rematerialized instruction references a register that was invalidated by the allocation

Operand Kind 7: Remat Markers

The Ori IR operand classification includes a dedicated "Remat" kind (value 7) that marks operands participating in rematerialization. This marker is orthogonal to the vreg+80 flags -- it exists in the instruction's operand descriptors and tells downstream passes that this particular use was created by rematerialization rather than by the original program.

The 10 operand kinds in the Ori IR:

KindNameDescription
0R registerGeneral-purpose register
1OffsetMemory offset
2P/UP registerPredicate register
3Any registerWildcard
4RegularImmediate or constant
5PredicatedGuard predicate
6--(reserved)
7RematRematerialization marker
8Spill-refillSpill/refill pair
9R2P/P2RRegister-to-predicate conversion

Vreg Flags at Offset +80

The virtual register's field at offset +80 encodes rematerialization state through a bitmask:

BitMaskMeaning
00x1Remat candidate -- this value CAN be recomputed
10x2Remat source processed -- cross-block analysis done
20x4Remat committed -- the allocator should prefer remat over spill
310x80000000Depth marker / unvisited sentinel

Common flag combinations:

  • 0x80000001: Candidate identified by sub_A107B0, pending cost analysis
  • 0x80000007: Candidate with predication awareness (stronger guarantee for predicated code paths)
  • 0x80000008: Remat committed by cross-block analysis (sub_A0C540), allocator should use remat

Knobs and Configuration

Knob IDRoleDefaultNotes
487Gate for SinkRemat pass(enabled)Must be true for phase 28 to execute
862Cutlass iteration limit5Max iterations in cutlass-specific iterative mode
356Instruction count diagnostic--Severity-2 warning when instruction count exceeds 32767

The optimization level gating:

  • Level <= 1 (-O0/-O1): All three remat phases are disabled
  • Level <= 4: Phase 28 runs the non-cutlass path only
  • Level >= 5 (-O3+): Full sink+remat with cutlass iteration support

Function Map

Phase 28 (SinkRemat)

AddressFunctionSize (lines)Role
0xC5FC20sub_C5FC2012Phase execute dispatcher
0xC5F2E0sub_C5F2E07getName() -> returns 28
0xC5F2F0sub_C5F2F07isNoOp() -> returns 0 (always runs)
0x913A30sub_913A30131SinkRemat entry with knob/feature gating
0xA0F020sub_A0F020494Core sink+remat driver (block visitor loop)
0x911030sub_9110302408Per-block promotion/sinking engine
0x90C010sub_90C01070Source operand liveness check for remat
0x90B790sub_90B790~350Cost model: remat profitability analysis
0x8F47E0sub_8F47E012Cutlass detection (strstr("cutlass"))
0x8F4820sub_8F4820--Build remat worklist
0x8F4F90sub_8F4F90--Apply remat decisions from worklist

Phase 54 (OriDoRematEarly)

AddressFunctionSize (lines)Role
0xC5EF30sub_C5EF307Phase execute: writes ctx+1552 = 4
0xC5EF40sub_C5EF407getName() -> returns 54
0xC5EF50sub_C5EF507isNoOp() -> returns 1

Phase 54 is a degenerate phase. Its execute body is a single store: *(ctx + 1552) = 4. Its isNoOp() returns 1, so the dispatch loop skips execute() by default -- the phase does nothing unless an architecture backend overrides the vtable to activate it. When active, the value 4 written to ctx+1552 advances the pipeline progress counter, which sub_A11060 checks (if *(ctx+1552) > 4 triggers the cross-block second pass).

Phase 69 (OriDoRemat)

AddressFunctionSize (lines)Role
0xC5F910sub_C5F91024Phase execute dispatcher
0xC5ED50sub_C5ED507getName() -> returns 69
0xC5ED60sub_C5ED607isNoOp() -> returns 0 (always runs)
0xA112C0sub_A112C0245Late remat entry + cleanup
0xA0C310sub_A0C31045RematState initialization
0xA11060sub_A11060155Per-iteration remat worker
0xA107B0sub_A107B0316Per-instruction remat decision
0xA105F0sub_A105F077Sinkability check (opcode 0x5F)
0xA10DF0sub_A10DF0138MOV chain analysis (FNV-1a hash table)
0xA0C540sub_A0C540228Cross-block rematerialization
0xA0C4A0sub_A0C4A0--Pressure adjustment (+1 or -1)
0xA0C410sub_A0C410--Remat profitability check for a vreg

Register Allocator Integration

AddressFunctionSize (lines)Role
0x93AC90sub_93AC9029Remat opportunity check during assignment
0x99A9D0sub_99A9D038Range rematerialization cost cleanup
0x99AA50sub_99AA5051Range rematerialization cost insertion
0x9AEF60sub_9AEF601415Main allocation driver with remat support
0xA55D80sub_A55D80~800Post-allocation remat verification

Sinking vs. Rematerialization

The SinkRemat pass (phase 28) and the late OriDoRemat pass (phase 69) both move instructions closer to their uses, but through fundamentally different mechanisms:

Sinking moves the original instruction. The definition is physically relocated from its original position to a dominated block closer to the use. This does not increase instruction count but may change schedule. Sinking is legal only when:

  • The instruction has exactly one use
  • The use is in a block dominated by the current definition block
  • Moving the instruction does not cross any barrier or synchronization point
  • All source operands remain available at the new position

Rematerialization duplicates the instruction. The original definition remains in place (or is deleted if dead), and a fresh copy is inserted near each use. This increases instruction count but can dramatically reduce register pressure. Remat is legal for any instruction in the opcode whitelist, subject to:

  • All source operands available at the use point
  • The cost model approves the duplication
  • The instruction has no side effects

The sub_A105F0 sinkability check runs first in sub_A107B0. Only if sinking fails does the function proceed to the rematerialization path. This prioritizes the cheaper transformation (sinking = zero instruction overhead) before falling back to the more expensive one (remat = duplicated instructions).

Architectural Notes

The three-phase structure with an interleaved flag-setter (phase 54) suggests the rematerialization infrastructure evolved over multiple ptxas generations. Phase 54's isNoOp() = 1 default means its execute() is skipped unless an architecture backend activates it by overriding the vtable. This indicates the phase was likely once a full pass that was later simplified to a flag write, with its analysis logic migrated into phase 69.

The CUTLASS-specific iterative mode in phase 28 (sub_913A30) reveals that NVIDIA's matrix-multiply library is important enough to warrant dedicated compiler heuristics. The strstr("cutlass") check is a name-based pattern match on the function name, not a property of the IR itself. This coupling between compiler optimization and library naming conventions is a pragmatic choice for a production compiler targeting known workloads.

The FNV-1a hash (constant 0x811C9DC5, prime 16777619) appears in both the rematerialization infrastructure (sub_A10DF0 for MOV chain tracking) and the register allocator (sub_926A30 for interference). This shared hash implementation is one of ptxas's standard infrastructure components (see Hash Tables & Bitvectors).

Liveness Analysis & Dead Code Elimination

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

Liveness analysis is the most frequently repeated computation in the ptxas pipeline. Six dedicated phases perform liveness analysis combined with dead code elimination (DCE), and at least four additional subsystems recompute liveness on demand. The core algorithm is a standard backward dataflow analysis over the CFG, but the implementation is notable for its SSE2-accelerated bitvector library, per-register-file liveness tracking, and the orWithAndNotIfChanged fused transfer function that implements the entire dataflow step in a single SIMD pass.

Dedicated liveness phases6 (phases 10, 16, 19, 33, 61, 84)
Core bitvector library0xBDBA60--0xBDE150 (15+ functions, SSE2)
BitVector object size20 bytes header + dynamic word array
Word size32-bit (uint32_t) -- indexed by >> 5 and & 0x1F
Transfer function`out = gen
Fixed-point detectionorIfChanged / andIfChanged return bool
Liveness storageCode Object +832 (main), +856 (uniform)
NamedPhases override"OriPerformLiveDead" controls all 4 instances
Related phase138 OriSplitHighPressureLiveRanges (late cleanup)

Pipeline Placement

The six liveness-related phases are distributed across the entire optimization pipeline. Each runs after a group of transformations that may have introduced dead code or invalidated previous liveness information:

Phase  10  EarlyOriSimpleLiveDead         ── Initial Setup
Phase  16  OriPerformLiveDeadFirst         ── Early Optimization
Phase  19  OriSplitLiveRanges             ── Early Optimization
Phase  33  OriPerformLiveDeadSecond        ── Mid-Level Optimization
Phase  61  OriPerformLiveDeadThird         ── Late Optimization
Phase  84  OriPerformLiveDeadFourth        ── Legalization

Phase 138  OriSplitHighPressureLiveRanges  ── Late Cleanup (related)

The four OriPerformLiveDead instances are identical passes invoked at different pipeline positions. They share the same vtable execute function and differ only in when they run. The NamedPhases system addresses all four through the single name "OriPerformLiveDead".

Why Four Instances?

Each instance cleans up dead code introduced by the preceding optimization group:

PhaseRuns AfterCleans Up
16 (First)Branch optimization, switch optimizationDead branches, unreachable code from CFG simplification
33 (Second)GeneralOptimize, loop unrolling, pipelining, strength reductionDead loop induction variables, redundant computations from unrolling
61 (Third)GeneralOptimizeLate, loop fusion, VTG expansion, late expansionDead code from loop fusion, expanded macro instructions
84 (Fourth)Backward copy propagation, late arch optimizationDead copies, redundant moves from backward propagation

Without these intermediate liveness passes, dead code would accumulate through the pipeline, inflating register pressure and increasing compile time for downstream passes.

Dataflow Algorithm

Classical Backward Liveness

ptxas implements textbook backward dataflow analysis. For each basic block B, the analysis computes:

LiveIn(B)  = gen(B) | (LiveOut(B) - kill(B))
LiveOut(B) = Union over all successors S of LiveIn(S)

Where:

  • gen(B): registers used in B before any definition in B (upward-exposed uses)
  • kill(B): registers defined in B (regardless of whether they are also used)
  • LiveIn(B): registers live at the entry of B
  • LiveOut(B): registers live at the exit of B

Iterative Fixed-Point Solver

The analysis iterates in reverse post-order (RPO) until no LiveIn/LiveOut set changes:

function compute_liveness(func):
    compute_RPO(func)                              // sub_BDE150
    
    // Initialize gen/kill sets per block
    for each block B in func:
        gen(B)  = {}
        kill(B) = {}
        for each instruction I in B (reverse order):
            for each source operand r of I:
                if r not in kill(B):
                    gen(B) |= {r}
            for each destination operand d of I:
                kill(B) |= {d}
    
    // Initialize LiveOut to empty for all blocks
    for each block B:
        LiveOut(B) = {}
    
    // Iterate until fixed point
    changed = true
    while changed:
        changed = false
        for each block B in reverse RPO:
            // LiveOut = union of successors' LiveIn
            for each successor S of B:
                changed |= LiveOut(B).orIfChanged(LiveIn(S))
            
            // LiveIn = gen | (LiveOut - kill)
            //        implemented as: orWithAndNotIfChanged
            changed |= LiveIn(B).orWithAndNotIfChanged(
                            gen(B), LiveOut(B), kill(B))

The key optimization is the fused orWithAndNotIfChanged operation (sub_BDD560), which computes dst |= gen | (in & ~kill) and returns whether any bit changed -- all in a single SSE2 pass over the bitvector words. This avoids materializing intermediate bitvectors for (LiveOut - kill).

Convergence

The analysis converges because:

  1. All sets are initialized to empty (bottom of the lattice).
  2. Each iteration can only add bits (the transfer function is monotone).
  3. The lattice has finite height (bounded by the total number of virtual registers).
  4. RPO traversal order minimizes the number of iterations -- typically 2--3 passes for acyclic code, proportional to loop nesting depth for loops.

BitVector Implementation

The bitvector library at 0xBDBA60--0xBDE150 is the most performance-critical infrastructure in ptxas dataflow analysis. All operations are SSE2-accelerated with manual alignment handling.

Layout

struct BitVector {       // 20 bytes total
    uint32_t* data;      // +0:  pointer to word array (heap-allocated)
    int32_t   word_count; // +8:  number of 32-bit words in use
    int32_t   capacity;  // +12: allocated words (>= word_count)
    int32_t   bit_count; // +16: number of valid bits
};

Word count is computed from bit count: word_count = (bit_count + 31) >> 5. Memory is allocated via the pool allocator (vtable dispatch at allocator +24 for alloc, +32 for free). Reallocation occurs only when the new word count exceeds the current capacity.

Core Operations

AddressOperationSignatureNotes
sub_BDBA60allocate(bv*, alloc*, num_bits)Grow-only; no shrink
sub_BDBFB0setBit(bv*, bit_index)`data[i>>5]
sub_BDC0E0clearBit(bv*, bit_index)data[i>>5] &= ~(1 << (i&31))
sub_BDC200testBit(bv*, bit_index) -> bool(data[i>>5] >> (i&31)) & 1
sub_BDCDE0operator|=(dst*, src*)SSE2 _mm_or_si128 loop
sub_BDCF40orIfChanged(dst*, src*) -> boolScans (~dst & src) != 0 first
sub_BDC5F0operator&=(dst*, src*)SSE2 _mm_and_si128; zeroes tail
sub_BDC790andIfChanged(dst*, src*) -> boolScans (~src & dst) != 0 first
sub_BDDAA0operator^=(dst*, src*)SSE2 _mm_xor_si128
sub_BDC3F0assignAND(dst*, a*, b*)dst = a & b
sub_BDD300orWithAndNot(dst*, gen*, in*, kill*)dst |= gen | (in & ~kill)
sub_BDD560orWithAndNotIfChanged(dst*, gen*, in*, kill*) -> boolCore transfer function
sub_BDBD60extractBits(out[], start, end)Cross-word boundary handling
sub_BDD8C0popcount(bv*) -> intCount set bits
sub_BDDC00clear(bv*)memset(data, 0, ...)
sub_BDCA60operator=(dst*, src*)Copy with possible realloc
sub_BDCC20isSubsetOf(a*, b*) -> boolTests (a & ~b) == 0

SSE2 Loop Structure

All bulk operations follow the same pattern:

// Alignment prologue: process scalar words until 16-byte aligned
int align_count = (-(uintptr_t)(dst_ptr) >> 2) & 3;
for (int i = 0; i < min(align_count, word_count); i++)
    dst_ptr[i] |= src_ptr[i];

// SSE2 main loop: process 4 words (128 bits) per iteration
int sse_count = (word_count - align_count) >> 2;
for (int i = 0; i < sse_count; i++) {
    __m128i d = _mm_load_si128(&dst_ptr[aligned_offset + 4*i]);
    __m128i s = _mm_loadu_si128(&src_ptr[aligned_offset + 4*i]);
    _mm_store_si128(&dst_ptr[aligned_offset + 4*i], _mm_or_si128(d, s));
}

// Scalar epilogue: remaining 0-3 words
for (remaining words)
    dst_ptr[j] |= src_ptr[j];

The orWithAndNot transfer function fuses three operations into one SSE2 expression:

__m128i result = _mm_or_si128(
    _mm_or_si128(gen_vec, dst_vec),
    _mm_andnot_si128(kill_vec, in_vec)   // in & ~kill
);

The IfChanged variants first scan for any bit that would change (~dst & new_bits), then apply the operation only from the first differing word forward. This early-exit optimization avoids unnecessary writes when the analysis has already converged for most blocks.

Per-Register-File Liveness

GPU register allocation manages multiple independent register files. ptxas tracks liveness separately for each:

Register FileBit RangeStorage
R (GPR, 32-bit)Bits 0..254Code Object +832 (main bitvector)
UR (uniform GPR)Bits 0..63Code Object +856 (uniform bitvector)
P (predicate, 1-bit)Separate trackingOperand type (v >> 28) & 7 == 5
UP (uniform predicate)Separate trackingFlag at Code Object +1378 bit 4
B (barrier)Indices 20, 21Special-cased in dependency graph

The main liveness bitvector at Code Object +832 covers R registers. The uniform register bitvector at +856 is conditionally allocated: it exists only when the flag at Code Object +1378 bit 4 is set (indicating the function uses uniform registers). The scheduling pass (sub_A06A60) allocates both bitvectors via sub_BDBAD0 and processes them in parallel.

Predicate registers are handled at the operand level during scheduling: the operand type check ((operand >> 28) & 7) == 5 identifies predicate operands, which are tracked in a separate per-block set rather than the main bitvector.

Barrier registers (IDs 20, 21 for sm >= 4.0) receive special treatment in the dependency graph builder (sub_A0D800): they generate ordering dependencies rather than data dependencies, since barriers enforce execution ordering constraints independent of register values.

Phase 10: EarlyOriSimpleLiveDead

The earliest liveness pass, running immediately after initial IR construction (after ReportInitialRepresentation at phase 9). This is a simplified liveness + DCE pass that removes obviously dead instructions from the freshly-lowered IR.

Pipeline context: At this point, the IR has just been lowered from PTX. Many PTX instructions expand to multiple Ori instructions, some of which produce values that are immediately dead (e.g., condition codes that are never tested, intermediate values from multi-instruction expansions). EarlyOriSimpleLiveDead removes this low-hanging dead code before the main optimization pipeline begins, reducing the working set for all subsequent passes.

Implementation evidence: The sweep at p1.10 (W010) confirms this pass uses the bitvector infrastructure at sub_BDBA60--sub_BDE150 for liveness computation. The "simple" in the name may indicate a local-only (per-BB) analysis that avoids the cost of full global iterative dataflow -- sufficient for removing obviously dead definitions that have no uses within the same block.

Phases 16, 33, 61, 84: OriPerformLiveDead

The four instances of the full liveness + DCE pass. These perform global iterative dataflow analysis followed by dead instruction removal.

Algorithm

function OriPerformLiveDead(func):
    // 1. Rebuild basic block metadata
    rebuild_basic_blocks(func, mode)        // sub_781F80
    
    // 2. Compute global liveness (iterative fixed-point)
    compute_global_liveness(func)           // iterative solver
    
    // 3. Dead code elimination
    for each block B in func:
        for each instruction I in B:
            if all destinations of I are dead (not in LiveOut):
                if I has no side effects:
                    remove(I)
    
    // 4. Update IR metadata
    //    (instruction counts, block sizes, etc.)

Side-Effect Preservation

Not all instructions with dead destinations can be removed. The DCE must preserve:

  • Memory stores (STG, STS, STL, ATOM, etc.) -- observable side effects
  • Barrier instructions (BAR, MEMBAR) -- synchronization semantics
  • Control flow (BRA, EXIT, RET, CALL) -- program structure
  • Texture operations with side effects
  • Instructions with volatile flags

The opcode mask & 0xCFFF (seen in sub_A06A60) strips modifier bits to obtain the base opcode for side-effect classification. Opcodes 93 (OUT_FINAL in the ROT13 name table; used as a call-like marker -- actual CALL is opcode 71), 94 (LDS)/95 (STS) (used as block boundary markers), 97 (STG; used as a branch-like marker -- actual BRA is opcode 67), and 52 (AL2P_INDEXED; used as NOP/boundary) receive special handling.

DCE Integration

The OriPerformLiveDead pass combines liveness computation with DCE in a single pass rather than running them as separate analyses. After computing LiveOut sets for each block, the pass walks each block backward: for each instruction, it checks whether every destination register is absent from the current live set. If so and the instruction has no side effects, it is unlinked from the instruction list. Source operands of removed instructions are themselves removed from the live set, potentially enabling cascading removal of further dead instructions within the same backward walk.

Phase 19: OriSplitLiveRanges

This phase splits live ranges at loop boundaries and across phi/copy chains to reduce register pressure. It runs after OriPerformLiveDeadFirst (phase 16) and OriLoopSimplification (phase 18), when the loop structure is canonical.

String reference: "OriSplitLiveRanges" at 0x22BC5C0.

Core implementation: sub_BEF110 (108KB, 3,414 decompiled lines). Called via sub_A1D3A0 (vtable execute) -> sub_BF33D0 (knob-gated entry, reads register budget from ctx+1624 and knob 456).

Motivation

On GPUs, register pressure directly determines occupancy (the number of concurrent warps). A value defined before a loop and used only after the loop occupies a register for the entire loop body, even though it is not accessed within the loop. Splitting the live range at the loop boundary -- by inserting a copy before the loop and a copy after -- can free the register for use inside the loop, reducing peak pressure and enabling higher occupancy.

Algorithm (Decompiled from sub_BEF110)

The function operates in five distinct phases:

Phase 1: Pre-analysis -- Rebuilds basic blocks (sub_781F80), allocates three bitvector fields per virtual register (kill at VR+96, gen at VR+24, live-through at VR+176), then runs the standard iterative liveness solver (sub_775010 + sub_773140). Walks the register table checking interference chains: for each VR with a chain at VR+136, tests whether the chain target's kill set is a subset of the VR's kill set (sub_BDC390 = isSubsetOf). Non-subset cases receive the +264 bit 1 flag, marking them as interference candidates.

Phase 2: Work structure allocation -- Allocates a scratch array s[] (one entry per split candidate), a hash table for interference tracking (power-of-2 buckets sized via _BitScanReverse64), and an array of 64-byte per-block split records:

struct PerBlockSplitRecord {    // 64 bytes, indexed by block ID
    void*    list_head;         // +0:  interference linked list
    void*    first_in_block;    // +8:  first entry pointer
    void*    sentinel;          // +16: self-pointer
    void*    reserved;          // +24
    void*    last_in_block;     // +32: last entry pointer
    void*    tail;              // +40: tail pointer
    int32_t  count;             // +48: entry count
    int32_t  pad;               // +52
    void*    allocator_ref;     // +56: refcounted allocator
};

Phase 3: Main splitting loop -- Iterates the ordered register array at ctx+792 in reverse order (highest VR ID first). For each VR, walks the def-use chain via ctx+296 (register table), classifying instructions by opcode:

Opcode (masked)MeaningSplit Action
167 (0xA7)Phi-likeWalk up phi chain, split at each level via sub_931920
158 (0x9E)Copy-likeSimilar chain walk with copy-specific handling
188 (0xBC)Multi-operand specialCheck operand types, dispatch to sub_BE3720 for multi-source split
27 (0x1B)Register moveStandard split point; emit via sub_9314F0 with 4 operands
269 (0x10D)CopyLightweight split; emit via sub_9314F0 with 2 operands

For each split: allocates a new VR via sub_931920, copies the three bitvector fields (sub_BDBA60 allocates, sub_BDC1B0 copies dst |= src), validates the register class via sub_9314F0 (called 11 times total across different split patterns), and updates the interference hash via sub_BEEC80.

The inline interference check in the hot path:

// Fast single-bit test: is vr_class_id live in the kill set?
if ((1 << vr_class_id) & kill_set[vr_class_id >> 5]) != 0)
    // VRs interfere -- cannot share a physical register

Phase 4: Interference hash processing -- Builds a global interference hash table using FNV-1a (0x811C9DC5 offset basis, 16777619 prime). Walks per-block split records, for each entry scans the kill bitvector (sub_BDDC00 clears from position, scanning forward) to find concurrently live VRs. Tests interference via sub_BEE7F0 and emits split instructions via sub_934630 (opcode 46). The hash table resizes when load factor exceeds 50%.

Phase 5: Cleanup -- Marks phi/copy chains with the +245 rewrite flag (triggering opcode mutation from 188 to 93 or 95), frees hash tables and per-block records, clears ctx+1370 bit 2 to signal liveness invalidation.

function OriSplitLiveRanges(func):
    // Phase 1: Pre-analysis
    rebuild_basic_blocks(func, 0)           // sub_781F80
    alloc_kill_bitvectors(func)             // sub_BEAFD0: VR+96
    alloc_gen_bitvectors(func)              // sub_BEB110: VR+24
    compute_liveness(func)                  // sub_775010
    propagate_per_block(func, 0)            // sub_773140
    mark_interference_candidates(func)      // inline: walk chains, test subsets

    // Phase 2: Work structure allocation
    allocate_work_structures(split_candidate_count)

    // Phase 3: Main splitting loop
    for each VR in ordered_array[ctx+792] (reverse):
        walk def-use chain via ctx+296:
            classify instruction by opcode
            if splittable:
                new_vr = allocate_vr(func, vr, def_instr)    // sub_931920
                copy_bitvectors(new_vr, vr)                   // sub_BDBA60 + sub_BDC1B0
                validate_reg_class(new_vr, opcode, operands)  // sub_9314F0
                update_interference_hash(new_vr)               // sub_BEEC80

    // Phase 4: Interference hash processing
    for each entry in interference_hash:
        for each concurrent_vr in kill_bitvector:
            if interferes(entry, concurrent_vr):              // sub_BEE7F0
                emit_split_instruction(entry, concurrent_vr)  // sub_934630

    // Phase 5: Cleanup
    mark_rewrite_flags()                    // byte +245
    free_work_structures()
    ctx[+1370] &= ~4                       // invalidate liveness

Three Bitvector Fields per Virtual Register

The splitting pass maintains three independent bitvectors per VR, all using the standard 32-bit-word BitVector from 0xBDBA60--0xBDE150:

VR OffsetNameContentAllocated by
+96Kill setRegisters defined by this VR's instructionssub_BEAFD0
+24Gen setRegisters used before definition in this VR's rangesub_BEB110
+176Live-through setRegisters live through the range without kill or genDerived

These per-VR bitvectors differ from the per-block liveness bitvectors used by OriPerformLiveDead. The per-block sets track global liveness; the per-VR sets track interference within a single virtual register's live range, enabling the split decision: if two VRs have overlapping kill sets (tested via the fast inline (1 << id) & word[id >> 5] check), they interfere and splitting one of them at the boundary reduces the overlap.

Helper Functions

AddressIdentityRole
sub_BEAFD0AllocKillBitvectorsAllocate VR+96 kill sets; propagate via interference chain VR+136
sub_BEB110AllocGenBitvectorsAllocate VR+24 gen sets; scan phi/copy defs (opcodes 158, 167)
sub_BE3390ComputeSplitCount(interference)Count split points for interference-chain case
sub_BE3590ComputeSplitCount(clean)Count split points for non-interfering case
sub_BE3720ComputeSplitCount(multiSrc)Count split points for multi-source operand case
sub_BEE7F0TestInterferenceTest bitvector interference between two VRs
sub_BEEC80UpdateHashWithSplitUpdate per-split hash table (192-byte entries, 8 buckets)

Relationship to Phase 138

Phase 138 (OriSplitHighPressureLiveRanges) performs a similar transformation but much later in the pipeline (late cleanup stage), targeting live ranges that still cause excessive pressure after all optimization and legalization passes have run. Phase 19 is the early, conservative version; phase 138 is the late, aggressive fallback.

Liveness Consumers

The liveness information computed by these phases is consumed throughout the pipeline:

Register Allocator

The fat-point register allocator (sub_9721C0) is the primary consumer. Its entry point explicitly rebuilds liveness before allocation:

sub_781F80(ctx, 1);     // rebuild basic blocks
sub_A10160(ctx, 1);     // recompute liveness

The allocator uses liveness information for:

  • Interference computation: Two virtual registers interfere if their live ranges overlap. The interference graph builder (sub_926A30, 155KB decompiled) uses bitvector intersection to detect overlaps.
  • Spill cost estimation: sub_94E620 computes spill costs weighted by liveness range length and instruction properties.
  • Spill placement: sub_9449B0 (liveness range calculator, 1800 bytes) iterates instructions in reverse block order using bitvector operations to determine optimal spill/reload insertion points.

Instruction Scheduler

The scheduling subsystem maintains its own liveness tracking at Code Object +832:

  • Pre-scheduling: sub_8DBAF0 (16KB, LivenessAnalysis) computes register liveness for the scheduling priority function.
  • Per-BB liveness: sub_8DB5F0 (8.4KB, LivenessCompute) computes per-basic-block liveness sets.
  • Initialization: sub_8DB070 (8.2KB, LivenessInit) sets up the liveness data structures.
  • Iterative solver: sub_8DE7A0 (12KB) runs the iterative fixed-point computation for scheduling-specific dataflow.

The scheduler uses liveness to:

  • Estimate register pressure at each scheduling point
  • Identify last-use operands for dead-register marking (sub_A08250 checks (1 << reg_num) & *(live_set + 4*(reg_num >> 5)))
  • Compute instruction priority based on register pressure impact

DAG Construction

The dependency graph builder (sub_A0F970, sub_A0D800) uses liveness to:

  • Determine which registers are live at block boundaries
  • Identify anti-dependencies (WAR) that constrain scheduling
  • Track callee-clobbered registers at call sites (opcode 93; OUT_FINAL in ROT13, used as call-like marker -- actual CALL is opcode 71)

Multi-Set Register Manager

sub_A7BC80 (36KB) manages multiple parallel liveness bitvectors for different register classes (R, P, B, UR, UP) during post-allocation scheduling. It allocates and deallocates bitvectors in coordinated groups, updating each set based on instruction defs/uses.

Uninitialized Register Detector

sub_A0B5E0 uses liveness information to detect potentially uninitialized registers. After scheduling, it walks each block's entry live set: for each live register, it checks the 0x20 flag at register descriptor offset 48. If the flag is clear, the register is reported as potentially uninitialized via warning strings "Found %d potentially uninitialized register(s) in function %s" (warning 0x1E14).

Data Flow Infrastructure for Scheduling

The scheduling subsystem has its own dataflow infrastructure (separate from the optimizer's OriPerformLiveDead):

AddressSizeIdentity
sub_8DB0708.2KBLivenessInit -- allocate and initialize per-BB liveness structures
sub_8DB5F08.4KBLivenessCompute -- compute liveness per basic block
sub_8DBAF016KBLivenessAnalysis -- full liveness analysis (red-black tree interval structure)
sub_8DC3F03.0KBComputeDataFlowState -- scheduling-specific dataflow
sub_8DC6203.3KBUpdateDataFlowOnSchedule -- update flow after scheduling decisions
sub_8DC88010KBPropagateDataFlow -- propagate dataflow information
sub_8DCF2023KBBuildDFGForScheduling -- build scheduling data flow graph
sub_8DE7A012KBIterativeDataFlow -- iterative fixed-point solver
sub_8DEF902.0KBFinalizeDataFlow -- finalize dataflow results

The sub_8DBAF0 function implements a red-black tree (evidenced by tree rotations and color flags at node offset +40 in the decompiled code), used to store liveness intervals as an ordered set. This enables efficient range queries: "which registers are live at program point P?" is answered by a tree search in O(log n) rather than a linear scan.

sub_781F80: Basic Block Rebuild

This function appears ubiquitously as a prerequisite to liveness computation. It is called with a mode parameter:

  • sub_781F80(func, 0): Reset/rebuild basic block metadata for reverse scheduling mode
  • sub_781F80(func, 1): Full rebuild for forward analysis (used before register allocation)

Over 50 call sites reference this function across the optimizer, register allocator, and scheduler. It refreshes the basic block linked lists, instruction counts, and block boundary markers that the liveness analysis depends on.

Key Function Table

AddressSizeIdentityConfidence
sub_BDBA60~120BBitVector::allocateHIGH (0.90)
sub_BDBFB0~120BBitVector::setBitHIGH (0.90)
sub_BDC0E0~120BBitVector::clearBitHIGH (0.90)
sub_BDC200~140BBitVector::testBitHIGH (0.90)
sub_BDCDE0~400BBitVector::operator|= (OR)HIGH (0.95)
sub_BDCF40~564BBitVector::orIfChangedHIGH (0.95)
sub_BDC5F0~484BBitVector::operator&= (AND)HIGH (0.95)
sub_BDC790~800BBitVector::andIfChangedHIGH (0.95)
sub_BDDAA0~400BBitVector::operator^= (XOR)HIGH (0.95)
sub_BDC3F0~520BBitVector::assignANDHIGH (0.90)
sub_BDD300~488BBitVector::orWithAndNotHIGH (0.92)
sub_BDD560~648BBitVector::orWithAndNotIfChangedHIGH (0.92)
sub_BDBD60~368BBitVector::extractBitsHIGH (0.88)
sub_BDD8C0~320BBitVector::popcountMEDIUM (0.80)
sub_BDDC00~140BBitVector::clearHIGH (0.90)
sub_BDCA60~280BBitVector::operator= (copy)MEDIUM (0.85)
sub_BDCC20~320BBitVector::isSubsetOfMEDIUM (0.85)
sub_BDE1509KBCFG::computeRPOHIGH (0.90)
sub_781F80variesBasic block rebuildHIGH (0.85)
sub_A10160~2KBLiveness computation entryMEDIUM (0.75)
sub_A0BA4015KBBlock-level liveness iterationHIGH (0.85)
sub_A06A6015KBPer-block register set trackingHIGH (0.95)
sub_A0D80039KBDependency graph constructionHIGH (0.95)
sub_A0F97010KBDAG construction entryHIGH (0.95)
sub_92C2408KBLiveness bitvector operations (regalloc)HIGH (87 callers)
sub_9449B01.8KBLiveness range calculator (spill codegen)HIGH
sub_8DBAF016KBLivenessAnalysis (scheduling)HIGH (0.85)
sub_8DB5F08.4KBLivenessCompute (per-BB scheduling)HIGH (0.85)
sub_8DB0708.2KBLivenessInit (scheduling)HIGH (0.85)
sub_8DE7A012KBIterativeDataFlow (scheduling solver)HIGH (0.80)
sub_A0B5E0variesUninitialized register detectorHIGH (0.97)
sub_A7BC8036KBRegisterSetManager (multi-file liveness)MEDIUM (0.65)
sub_BEF110108KBOriSplitLiveRanges core (Phase 19)HIGH (0.90)
sub_BF33D0~1KBOriSplitLiveRanges knob-gated entry (reads knob 456)HIGH (0.90)
sub_A1D3A0~0.2KBOriSplitLiveRanges vtable executeHIGH (0.90)
sub_BEAFD0~2KBAllocKillBitvectors (VR+96 per-VR kill sets)HIGH (0.85)
sub_BEB110~3KBAllocGenBitvectors (VR+24 per-VR gen sets)HIGH (0.85)
sub_BE3390variesComputeSplitCount(interference)MEDIUM (0.80)
sub_BE3590variesComputeSplitCount(clean)MEDIUM (0.80)
sub_BE3720variesComputeSplitCount(multiSrc)MEDIUM (0.80)
sub_BEE7F0variesTestInterference (BV interference test)MEDIUM (0.80)
sub_BEEC80~1KBUpdateHashWithSplit (per-split hash update)MEDIUM (0.80)
sub_BEB9C0variesHash table init/destroy (secondary)MEDIUM (0.75)
sub_BEBA40variesHash table init/destroy (primary)MEDIUM (0.75)

Key Constants

ValueMeaning
+832Code Object offset: main register liveness bitvector (R registers)
+856Code Object offset: uniform register liveness bitvector (UR registers)
+840Code Object offset: max live register count
+848Code Object offset: liveness info pointer
+720Code Object offset: RPO order array
+984Code Object offset: number of basic blocks
+1378 bit 4Flag: function uses uniform registers (enables +856 bitvector)
0xCFFFOpcode mask: strips modifier bits for side-effect classification
+792Context offset: reverse-ordered register array (for live range splitting)
+1370 bit 2Flag: liveness invalid (cleared by sub_BEF110 on exit)
+1624Context offset: register budget (double, read by sub_BF33D0)
VR+24Virtual register offset: gen bitvector (allocated by sub_BEB110)
VR+96Virtual register offset: kill bitvector (allocated by sub_BEAFD0)
VR+136Virtual register offset: interference chain (linked list of aliased VRs)
VR+144Virtual register offset: register class ID (int32)
VR+176Virtual register offset: live-through bitvector
VR+245Virtual register byte flag: needs-opcode-rewrite (set by Phase 19 cleanup)
VR+264Virtual register flags: bit 0 = has-interference-chain, bit 1 = non-subset, bit 2 = was-split
VR+280Virtual register flags: bit 2 = needs-split, bit 4 = propagated, bit 12 = predicate-qualified
0x811C9DC5FNV-1a offset basis (used in Phase 19 interference hash)
16777619FNV-1a prime (0x01000193)
0x22BC5C0String address: "OriSplitLiveRanges"
0x22BCFE8String address: "OriSplitHighPressureLiveRanges"

Cross-References

Synchronization & Barriers

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The ptxas synchronization pipeline manages the insertion, optimization, and expansion of all GPU synchronization and barrier instructions. Eight phases span the full compilation pipeline, from early memory-ordering fence insertion through post-scheduling dependency barrier fixup. These phases collectively translate the PTX memory model into the hardware synchronization primitives required by each SM architecture: thread block barriers (BAR), memory barriers (MEMBAR), dependency barriers (DEPBAR), warp-level synchronization (WARPSYNC/BSYNC/BSSY), and asynchronous barriers (MBARRIER).

Phases25, 26, 42, 71, 72, 99, 100, 114
CategoriesLowering (25, 42, 72), Optimization (26, 71), Scheduling (99, 100, 114)
Pipeline spanPhase 25 (early optimization) through phase 114 (post-scheduling)
Key opcodesBAR (opcode 61), MEMBAR (opcode 111), DEPBAR, BSYNC, BSSY, WARPSYNC, MBARRIER.*. Note: the code uses opcode 130 (HSET2 in the ROT13 name table) as an internal marker for barrier/sync instructions in the Ori IR.
Architecture gatesPhases 100, 114 dispatch through architecture vtable; phase 42 dispatches through backend vtable at ctx+1584 offset 0x168
Related EIATTREIATTR_SYNC_STACK, EIATTR_NUM_BARRIERS, EIATTR_NUM_MBARRIERS, EIATTR_MBARRIER_INSTR_OFFSETS, EIATTR_GEN_ERRBAR_AT_EXIT, EIATTR_SW_WAR_MEMBAR_SYS_INSTR_OFFSETS
CLI options--assume-extern-functions-do-not-sync, --no-membermask-overlap, --print-potentially-overlapping-membermasks
KnobsDisableErrbarAfterMembar, knob 487 (iteration gate), knob 358 (sync mode), knob 472 (barrier liveness)

GPU Synchronization Model

NVIDIA GPUs provide four distinct synchronization mechanisms, each operating at a different scope and addressing different hazards.

Thread Block Barriers (BAR)

Thread block barriers synchronize all threads within a cooperative thread array (CTA). The hardware provides 16 named barriers (indices 0--15), each tracking participation counts. PTX exposes these as:

  • bar.sync N -- block until all threads in the CTA arrive at barrier N
  • bar.red.{and,or,popc} N -- barrier with warp-level reduction
  • bar.arrive N -- signal arrival without blocking
  • barrier.cta.{sync,arrive,red} -- PTX 8.0+ cluster-aware variants

In SASS, these map to the BAR instruction family (opcode 61 in the ROT13 name table). The Ori IR uses opcode 130 (HSET2 in the ROT13 name table) as an internal barrier/sync marker. The EIATTR_NUM_BARRIERS metadata records the maximum barrier index used, which the hardware uses to partition the convergence barrier file.

PTX:     bar.sync 0;
SASS:    BAR.SYNC 0x0;
         // stalls warp until all CTASize threads arrive at barrier 0

Memory Barriers (MEMBAR)

Memory barriers enforce ordering of memory operations across different visibility scopes:

  • membar.cta -- visible to threads in the same CTA
  • membar.gpu -- visible to threads on the same GPU device
  • membar.sys -- visible to all agents (including host CPU and peer GPUs)

Additionally, fence.proxy instructions enforce ordering between different memory proxy domains (generic, texture, surface, constant).

The EIATTR_SW_WAR_MEMBAR_SYS_INSTR_OFFSETS records the byte offsets of membar.sys instructions for driver-level workaround injection.

Dependency Barriers (DEPBAR / Scoreboards)

Dependency barriers are the micro-architectural mechanism for tracking instruction-level data hazards. Each SM provides 6 scoreboard entries (barriers 0--5) that track completion of long-latency operations. SASS instructions encode a 23-bit control word containing:

  • Stall count (4 bits): cycles to wait before issuing the next instruction
  • Yield flag (1 bit): hint to give up the scheduling quantum
  • Write barrier (3 bits): scoreboard index to set on result writeback
  • Read barrier mask (6 bits): scoreboard entries to wait for before reading
  • Wait barrier mask (6 bits): scoreboard entries to clear/release

DEPBAR is the explicit dependency barrier instruction that waits for a specific set of scoreboard entries. Scoreboards are assigned by phase 115 (AdvancedScoreboardsAndOpexes) and phase 116 (ProcessO0WaitsAndSBs); the sync passes described here prepare the IR for scoreboard generation but do not assign scoreboards directly.

Warp-Level Synchronization

Warp-level sync instructions operate within a single warp (32 threads):

  • WARPSYNC mask -- synchronizes threads identified by the lane mask (sm70+)
  • BSSY B, target -- pushes a synchronization barrier for convergence
  • BSYNC B -- pops and waits at the convergence barrier

The BSSY/BSYNC mechanism replaces the pre-Volta implicit reconvergence stack. The compiler must insert these pairs explicitly at divergence/reconvergence points. EIATTR_SYNC_STACK records metadata about the convergence barrier stack depth.

Asynchronous Barriers (MBARRIER)

Introduced in sm90 (Hopper), MBARRIER provides hardware-accelerated asynchronous barriers in shared memory. These support non-blocking arrival, expected transaction count tracking, and parity-based phase completion -- critical for async copy (cp.async.bulk) and TMA (Tensor Memory Accelerator) operations.

MBARRIER operations in PTX:

PTX instructionPurpose
mbarrier.initInitialize barrier object in shared memory
mbarrier.arriveSignal arrival (non-blocking)
mbarrier.arrive_dropArrive and decrement expected count
mbarrier.arrive.expect_txArrive with expected transaction byte count
mbarrier.test_waitTest if barrier phase is complete
mbarrier.try_waitWait with timeout
mbarrier.try_wait.parityPhase-parity-based wait
mbarrier.pending_countQuery remaining arrivals
mbarrier.invalInvalidate barrier
mbarrier.complete_txMark transaction bytes as complete

The EIATTR_NUM_MBARRIERS and EIATTR_MBARRIER_INSTR_OFFSETS metadata inform the runtime about barrier allocation and instruction locations for driver patching.


Phase 25 -- StageAndFence

Phase nameStageAndFence
CategoryLowering
Execute wrappersub_C5FBC0 (34 bytes)
Implementationsub_1392E30 (166 bytes)
Core logicsub_1390B30 (8,956 bytes, 97 callees)
Setupsub_1389AF0 (3,049 bytes)
Teardownsub_138A6E0 (3,408 bytes)
GatingRequires opt_level > 1 AND context+1368 bit 0 AND context+1397 bits[6:7] != 0x40; additionally guarded by "LoopUnrolling" disable check and knob 487
Total code~16 KB across 0x1389AF0--0x1393340

Purpose

StageAndFence inserts memory fence and staging instructions to enforce coherence ordering after loop unrolling. When loop unrolling replicates memory operations, the replicated loads and stores may violate the memory model if they cross a synchronization boundary that was inside the original loop body. This pass re-establishes correctness by inserting fence operations at the boundaries of unrolled iterations.

Execution Flow

sub_1392E30(compilation_unit):
    // Guard: must have loops and bit flags set
    if !(context+1368 bit 0) or (context+1397 & 0xC0) == 0x40:
        return

    // Check if "LoopUnrolling" pass is disabled
    IsPassDisabled(knob_state, "LoopUnrolling", &disabled)
    if disabled: return
    if opt_level <= 2: return

    // Check knob 487
    if !CheckKnob(knob_state, 487, 1): return

    // Core execution
    sub_1389AF0(state, compilation_unit)   // allocate working structures
    sub_1390B30(state)                     // main fence insertion pass
    sub_138A6E0(state)                     // cleanup

Main Pass -- sub_1390B30

The main pass (8,956 bytes) is the largest function in this phase group. It:

  1. Iterates over the basic block list via the instruction chain (context+272)
  2. Identifies memory operations that cross unrolled loop iteration boundaries
  3. Computes fence requirements based on the memory model and target architecture
  4. Calls sub_A0F020 (the scheduling entry point) to build dependency information and determine where fences are needed
  5. Inserts fence.proxy or MEMBAR pseudo-instructions at identified locations
  6. Updates the instruction list metadata via sub_781F80 (basic block refresh)

The function takes floating-point parameters (double a2, double a3, __m128d a4), suggesting it incorporates latency and throughput heuristics when deciding fence placement -- preferring to merge adjacent fences or delay fences to overlap with independent computation.


Phase 26 -- OriRemoveRedundantBarriers

Phase nameOriRemoveRedundantBarriers
CategoryOptimization
Execute wrappersub_C60BD0 (334 bytes)
Implementationsub_790A40 (2,288 bytes, 33 callees)
Helper: post-RA schedsub_790020 (1,200 bytes)
Helper: pre-RA optsub_7904D0 (1,381 bytes)
Helper: barrier optsub_7923A0 (2,344 bytes, 30 callees)
Helper: barrier passsub_792CD0 (1,360 bytes, 25 callees)
GatingMulti-function dispatch: only runs when sub_7DDB50(ctx) > 1 (i.e., the compilation unit contains more than one function)
Total code~10 KB across 0x790020--0x793220

Purpose

OriRemoveRedundantBarriers performs dataflow-driven elimination of provably redundant barrier instructions. When the compiler can prove that all threads in a warp (or CTA) must have already passed through a dominating synchronization point, subsequent barriers to the same scope are redundant and can be removed. This reduces the synchronization overhead without changing program semantics.

Execution Flow

The execute wrapper sub_C60BD0 is a multi-function dispatch pattern: when a compilation unit contains multiple functions, it creates two reference-counted list objects, stores the current phase chain pointer, and calls sub_790A40 for cross-function barrier analysis. For single-function units, it returns directly.

sub_C60BD0(phase, compilation_unit):
    func_count = sub_7DDB50(compilation_unit)
    if func_count <= 1: return

    // Create two ref-counted analysis lists
    list1 = pool_alloc(24)
    list1->refcount = 1
    list2 = pool_alloc(24)
    list2->refcount = 1

    // Store current phase chain
    saved_chain = compilation_unit->field_88

    // Run multi-function barrier analysis
    sub_790A40(&compilation_unit)

    // Release ref-counted lists
    release(list1)
    release(list2)

Main Analysis -- sub_790A40

The main analysis function (2,288 bytes) operates through several stages:

  1. Mode selection: Queries knob 358 (sync mode) through the knob container at ctx+1664. Three modes exist:

    • Mode 0: no barrier removal (return immediately via sub_756F10)
    • Mode 1: conservative removal (calls sub_790020)
    • Mode 2: aggressive removal (calls sub_790020 with flag)
    • Mode >= 3: full multi-function analysis
  2. Graph construction (sub_7E6090): Builds an instruction-level dependency graph with 32-bit flags. Called with (ctx, 0, 0, 0, 0).

  3. Liveness refresh (sub_781F80): Refreshes the basic block liveness information with mode parameter 1 (compute barrier liveness).

  4. Dependency tracking (sub_A10160): Sets up dependency tracking data structures.

  5. Block iteration (sub_769300, sub_752AB0): Builds block-level analysis structures for the function.

  6. Redundancy analysis: For each barrier instruction (opcode 130; HSET2 in the ROT13 name table, but used as the internal Ori IR marker for barrier/sync instructions -- actual SASS BAR is opcode 61, MEMBAR is opcode 111), checks whether the barrier's destination register is live in any successor block. If the barrier result is dead (no thread could observe it before the next dominating barrier), the barrier is eliminated.

  7. Block-level merging (sub_75EAE0, sub_75E2F0): Merges barriers at block boundaries where adjacent blocks have compatible barrier scopes.

The algorithm checks barriers by walking the instruction chain and testing opcode 130 (HSET2 in the ROT13 name table; used as the internal Ori IR opcode for barrier/sync instructions -- not the actual HSET2 half-precision set instruction). For each barrier, it extracts the destination operand (field+84), resolves the register through the register table at context+88, and tests whether the register's use-count (reg+24) indicates the barrier result is consumed.


Phase 42 -- ExpandMbarrier

Phase nameExpandMbarrier
CategoryLowering
Execute wrapper0xC5F110 (6 bytes)
ImplementationArchitecture-dispatch via *(*(ctx+0x630))->vtable[0x168/8]
isNoOpAlways false (0xC5F130 returns 0)
No opt-level checkRuns at all optimization levels

Purpose

ExpandMbarrier expands MBARRIER pseudo-instructions into native barrier instruction sequences. This is critically important for sm90+ (Hopper and later) architectures that use asynchronous barriers for TMA operations, cp.async.bulk, and warpgroup-level synchronization.

Dispatch Mechanism

Unlike most phases that tail-call a fixed function after an optimization level check, ExpandMbarrier performs a direct vtable dispatch:

mov    rdi, [rsi+0x630]     ; rdi = ctx->arch_backend (offset 1584)
mov    rax, [rdi]            ; rax = arch_backend->vtable
jmp    [rax+0x168]           ; call vtable[45] -- ExpandMbarrier impl

The architecture backend at ctx+1584 provides the actual expansion logic. This design allows each SM generation to define its own mbarrier expansion rules:

  • Pre-sm90: MBARRIER pseudo-ops do not exist; the phase is effectively a no-op.
  • sm90 (Hopper): Expands MBARRIER pseudo-ops into hardware mbarrier instruction sequences using the mbarrier object in shared memory. Handles mbarrier.init, mbarrier.arrive, mbarrier.arrive.expect_tx, mbarrier.try_wait.parity, and mbarrier.inval.
  • sm100+ (Blackwell): Extended mbarrier semantics for tcgen05.fence, cluster-level barriers, and async pipeline operations.

MBARRIER Expansion Patterns

A typical async copy pattern in the Ori IR and its expansion:

Before expansion (pseudo-ops):
    MBARRIER_INIT  %mbar, count
    MBARRIER_ARRIVE_EXPECT_TX  %mbar, bytes
    CP.ASYNC.BULK.TENSOR  dst, src, %mbar
    MBARRIER_TRY_WAIT_PARITY  %mbar, parity, pred

After expansion (native):
    MBARRIER.INIT  [smem_addr], count
    MBARRIER.ARRIVE.EXPECT_TX  [smem_addr], bytes
    CP.ASYNC.BULK.TENSOR  [dst], [src], [smem_addr]
    MBARRIER.TRY_WAIT.PARITY  pred, [smem_addr], parity

The expansion resolves shared memory addresses for the mbarrier objects, handles the naming of __nv_reservedSMEM_tmem_allocation_pipeline_mbarrier and __nv_reservedSMEM_tmem_allocation_pipeline_mbarrier_parity reserved shared memory regions, and inserts any required fence.proxy operations for proxy domain coherence.


Phase 71 -- OptimizeSyncInstructions

Phase nameOptimizeSyncInstructions
CategoryOptimization
Execute wrappersub_C60080 (34 bytes)
Implementationsub_90A340 (1,670 bytes, 21 callees)
Sync predicatesub_18F6930 (185 bytes) -- determines if sync optimization should run
GatingRequires opt_level > 2; additionally checks knob 487, architecture flags at context+1368, and sub_18F6930 predicate
Pipeline positionAfter OriPropagateVaryingSecond (70), before LateExpandSyncInstructions (72)

Purpose

OptimizeSyncInstructions performs redundancy elimination and simplification of synchronization instructions within the partial-SSA window. It identifies and removes sync instructions that are provably unnecessary based on the data flow and the GPU memory model, and simplifies complex sync patterns into cheaper equivalents.

Gating Logic

The pass has elaborate gating controlled by sub_18F6930, which evaluates:

sub_18F6930(ctx, mode):
    // Check architecture-specific sync flags
    flags = *(ctx+1398)
    if (flags & 0x18) != 0:
        return (flags & 0x18) == 8   // specific arch config

    // Check whether SM requires explicit sync
    if !(*(ctx+1412) bit 7) or *(ctx+1584)->field_372 <= 28673:
        return true

    // Functions with <= 4 registers always need sync
    if *(ctx+1704) <= 4:
        return true

    // Mode-specific knob checks at offsets 51120/51192
    ...

The value 28673 corresponds to sm70/sm72/sm73/sm75 architecture IDs. The predicate returns true (optimize) for architectures that have explicit synchronization requirements (Volta and later), and false for older architectures where synchronization is implicit.

Main Algorithm -- sub_90A340

sub_90A340(ctx):
    if opt_level <= 2: return
    if !CheckKnob(ctx+1664, 487, 1): return

    // Determine sync optimization mode
    has_uniform_regs = (ctx+1412 bit 7) && !(ctx+1368 bit 4)
    arch_data = *(*(ctx+1664)+72)
    sync_mode = *(arch_data + 15480)
    if sync_mode == 1: mode = *(arch_data + 15488)

    // Main path: combined sync + barrier optimization
    if (ctx+1368 flags 0x20000001 all set) && (ctx+1377 bit 6) && !mode:
        need_expand = sub_18F6930(ctx, 0)
        sub_781F80(ctx, 1)               // refresh liveness

        if !need_expand && !has_uniform_regs:
            sub_7E6090(ctx, 0, 0, 0, 32) // build dep graph, 32-bit mode
            goto optimize
    else:
        need_expand = sub_18F6930(ctx, 0)
        if !has_uniform_regs && !need_expand: return
        sub_781F80(ctx, 1)

    // Barrier liveness computation
    sub_775010(ctx)
    sub_7E6090(ctx, 0, 0, 0, 32)

    // Walk instruction list, find opcode 130 (HSET2 in ROT13; internal barrier/sync marker)
    for instr = ctx->first_instr; instr; instr = instr->next:
        if instr->opcode != 130: continue

        // Extract operand, check register type
        operand = instr->field_84
        if operand_type(operand) != 1: continue

        reg = register_table[operand & 0xFFFFFF]
        if !check_liveness(reg): continue

        // For uniform-register-aware path:
        if has_uniform_regs:
            if (instr->field_91 & 1): continue  // skip if flagged
            if reg->file != 6: continue          // must be barrier reg
            if reg->use_count <= 1: continue
            // Check all uses via use-def chain...
            try_merge_barriers(ctx, instr)

        // Standard redundancy elimination
        try_eliminate_redundant_sync(ctx, instr)

    cleanup_lists()

The pass iterates the flat instruction list (not per-block), checking every instruction with opcode 130 (HSET2 in the ROT13 name table; used as the internal Ori IR opcode for barrier/synchronization instructions). For each barrier, it examines the operand to determine:

  1. Whether the barrier result register is consumed by any subsequent instruction
  2. Whether the barrier can be merged with an adjacent barrier of the same scope
  3. Whether the barrier guards a memory region that is provably thread-local

The sub_1245740 call performs the actual redundancy proof by checking dominance relationships between barrier pairs.


Phase 72 -- LateExpandSyncInstructions

Phase nameLateExpandSyncInstructions
CategoryLowering
Execute wrappersub_C600B0 (34 bytes)
Implementationsub_1381DA0 (1,517 bytes, 3 callees)
Core driversub_1381CD0 (206 bytes)
GatingRequires opt_level > 1; checks context+1376 bit 5, "Predication" disable flag, and knob 487 with iteration counter
Error diagnostic"ExpandSyncInstLate option is not supported on this architecture." (via sub_7EF030)
Pipeline positionAfter OptimizeSyncInstructions (71), before ConvertAllMovPhiToMov (73)
Gate passPhase 135 (AdvancedPhaseLateExpandSyncInstructions) provides an additional architecture hook

Purpose

LateExpandSyncInstructions performs the final expansion of synchronization pseudo-instructions into their target-specific SASS instruction sequences. This runs late in the pipeline (phase 72, within the partial-SSA window) so that earlier optimization passes can work with high-level sync pseudo-ops rather than architecture-specific instruction sequences.

Execution Flow

The entry function shares structural similarity with the Predication pass entry (sub_1381DA0) because both operate within the same address range (0x1381000--0x1382000) and share infrastructure for walking the instruction list within the partial-SSA window.

sub_1381DA0(ctx):
    if context+1376 bit 5: return      // disabled by phase flag

    // Read expansion mode from knob container
    knob_state = *(ctx+1664)
    mode = *(*(knob_state+72) + 16416)

    if mode == 0:
        limit = (ctx+1419 bit 4) != 0
    elif mode == 1:
        limit = *(*(knob_state+72) + 16424)

    IsPassDisabled(knob_state, "Predication", &disabled)
    if disabled or limit: return

    // Knob 487 iteration gating with counter
    if !CheckKnob487WithCounter(knob_state): return

    // Set up working state
    context+1385 |= 1    // mark expansion active

    // Call core driver
    sub_1381CD0(state)

    context+1385 &= ~1   // clear expansion flag
    cleanup_pools()

Expansion Rules

The pass transforms sync pseudo-instructions according to the target SM:

Pseudo-instructionsm70+ expansionsm90+ expansion
SYNC.WARP maskWARPSYNC maskWARPSYNC mask
SYNC.BLOCKBAR.SYNC 0BAR.SYNC 0
SYNC.CONVERGE targetBSSY B, target ... BSYNC BBSSY B, target ... BSYNC B
MBARRIER.WAIT pseudo(not expanded here)MBARRIER.TRY_WAIT.PARITY loop
ERRBARBAR.SYNC 15 (error barrier)Conditional on DisableErrbarAfterMembar

The ERRBAR (error barrier) is a compiler-inserted synchronization point placed after membar.sys instructions to ensure memory ordering is observable before proceeding. The DisableErrbarAfterMembar knob (accessible via the CLI option string at 0x1D04BC0) controls whether these error barriers are emitted. When set to 1, the compiler omits the error barrier, trading safety for performance.


Phase 99 -- OriDoSyncronization

Phase nameOriDoSyncronization
CategoryScheduling
Execute wrappersub_C5FAD0 (34 bytes)
Implementationsub_A0F020 (2,375 bytes, 32 callees) -- DAG scheduler entry
Dependency buildersub_A0D800 (dependency DAG construction)
Per-block processorsub_A06A60 (3,045 bytes, 53 callees)
Uninit reg checksub_A0B5E0
GatingRequires opt_level > 1
Pipeline positionAfter BackPropagateVEC2D (98), before ApplyPostSyncronizationWars (100)
Callers of sub_A0F02011 sites: sub_913A30, sub_9AEF60 (x2), sub_C5FA40/sub_C5FA70/sub_C5FAA0/sub_C5FAD0 (4 arch wrappers), sub_1390B30 (x2), sub_1395850 (x2)

Purpose

OriDoSyncronization is the post-optimization synchronization insertion pass. It runs after all IR-level optimizations are complete and before register allocation, using the scheduling infrastructure to analyze data dependencies and insert the synchronization instructions (BAR, DEPBAR, MEMBAR) required by the GPU memory model for correctness.

Note the intentional misspelling "Syncronization" (missing 'h') -- this is present in the binary's string table and preserved here for fidelity.

Architecture

OriDoSyncronization reuses the DAG scheduler's infrastructure (sub_A0F020) rather than implementing its own analysis. The same function serves as the scheduling entry point in multiple contexts:

  • Phase 99 (OriDoSyncronization): inserts sync instructions based on dependency analysis
  • Phase 25 (StageAndFence): inserts fences via sub_1390B30
  • Multiple architecture-specific scheduling wrappers: sub_C5FA40, sub_C5FA70, sub_C5FAA0

Execution Flow

sub_A0F020(ctx):
    while true:
        if *(ctx+1648) == 0: break

        // Initialize dependency context
        dep_ctx = pool_alloc(16)
        dep_ctx->refcount = 2
        dep_ctx->parent = ctx->pool

        // Build dependency DAG
        sub_A0D800(ctx, dep_ctx)

        // Process blocks in reverse order
        for each basic_block in reverse(block_list):
            if block->opcode == 8: continue  // skip NOP/exit blocks
            sub_A06A60(ctx, callback, block, flags...)

        // Check for uninitialized register usage
        sub_A0B5E0(ctx, dep_ctx)

        // Diagnostic output if enabled
        sub_7F44D0(ctx)

        // Break or retry based on scheduling result
        ...

Per-Block Synchronization -- sub_A06A60

The per-block processor (3,045 bytes, 53 callees) is the core of sync insertion. For each basic block:

  1. Allocates temporary liveness bitsets via sub_BDBA60 (bitvector alloc)
  2. Copies block-entry live set from ctx+832 via sub_BDC300
  3. Walks instructions forward, examining each opcode (masked by 0xCFFF):
    • Opcode 93 (OUT_FINAL in ROT13; used here as a call-like control-flow marker -- actual CALL is opcode 71): copies callee-save register set, handles arguments
    • Opcode 95 (STS in ROT13; used here as a barrier/terminator marker -- actual BAR is opcode 61): AND-merges successor block live sets
    • Opcode 97 (STG in ROT13; used here as a branch/control marker -- actual BRA is opcode 67): tests if live set changed since block entry
  4. Inserts sync instructions where data dependencies cross synchronization boundaries
  5. Updates uniform register liveness at ctx+856 when ctx+1378 bit 3 is set

The function uses extensive bitvector operations (13 different bitvector functions from the sub_BDB*/sub_BDC* infrastructure) to track register liveness through synchronization points.


Phase 100 -- ApplyPostSyncronizationWars

Phase nameApplyPostSyncronizationWars
CategoryScheduling
Execute wrappersub_C607A0 (51 bytes)
ImplementationArchitecture-dispatch via *(*(ctx+0x630))->vtable[0x110/8]
Nullsub guardSkips if vtable entry equals nullsub_170 (0x7D6C80)
GatingRequires opt_level > 1
Pipeline positionAfter OriDoSyncronization (99), before AdvancedPhaseAllocReg (101)

Purpose

ApplyPostSyncronizationWars fixes write-after-read (WAR) hazards that are introduced or exposed by the synchronization insertion in phase 99. When OriDoSyncronization inserts new barrier or memory fence instructions, these insertions can create new register hazards (the barrier instruction may read a register that a subsequent instruction writes). This pass scans for and resolves those hazards.

Dispatch Mechanism

; sub_C607A0
mov    rbx, rsi                ; save ctx
call   sub_7DDB50              ; get opt_level
cmp    eax, 1
jle    return                  ; skip if opt_level <= 1

mov    rdi, [rbx+0x630]       ; rdi = ctx->arch_backend
mov    rax, [rdi]              ; rax = arch_backend->vtable
mov    rax, [rax+0x110]       ; vtable[34] = ApplyPostSyncWars impl
cmp    rax, 0x7D6C80          ; compare with nullsub_170
jne    call_impl               ; if not nullsub, call it
return:
    ret
call_impl:
    jmp    rax                 ; tail-call architecture implementation

The nullsub_170 check (at 0x7D6C80) is the no-op sentinel: if the architecture backend does not override this vtable entry, the phase is silently skipped. This allows architectures that do not have post-sync WAR hazards to avoid unnecessary work.


Phase 114 -- FixUpTexDepBarAndSync

Phase nameFixUpTexDepBarAndSync
CategoryScheduling
Execute wrappersub_C60600 (51 bytes)
ImplementationArchitecture-dispatch via *(*(*(ctx+0x630)+0x10))->vtable[0x70/8]
Nullsub guardSkips if vtable entry equals nullsub_43 (0x680170)
GatingRequires opt_level > 1
Pipeline positionAfter PostFixForMercTargets (113), before AdvancedScoreboardsAndOpexes (115)

Purpose

FixUpTexDepBarAndSync performs a post-scheduling fixup of texture dependency barriers and synchronization instructions. After the main scheduling passes (phases 97--110) have reordered instructions and the Mercury encoder (phases 117--122) has finalized SASS encoding, texture fetch instructions may have dependency barriers that are incorrect due to instruction movement. This phase corrects those barriers.

Dispatch Mechanism

The dispatch is doubly-indirect, going through two vtable levels:

; sub_C60600
mov    rbx, rsi
call   sub_7DDB50              ; get opt_level
cmp    eax, 1
jle    return

mov    rax, [rbx+0x630]       ; arch_backend
mov    rdi, [rax+0x10]        ; secondary object at arch_backend+16
mov    rax, [rdi]              ; secondary vtable
mov    rax, [rax+0x70]        ; vtable[14] = FixUpTexDepBar impl
cmp    rax, 0x680170           ; compare with nullsub_43
jne    call_impl
return:
    ret
call_impl:
    jmp    rax                 ; tail-call implementation

The double indirection (arch_backend -> arch_backend+16 -> vtable+0x70) indicates that the texture dependency barrier fixup lives in a secondary object owned by the architecture backend -- likely the scheduling/scoreboard subsystem object.

Texture Dependency Barriers

Texture fetches are long-latency operations (hundreds of cycles). The hardware uses dependency barriers (scoreboards) to track their completion. When the scheduler moves a texture fetch away from its original position, the dependency barrier assignment from AdvancedScoreboardsAndOpexes (phase 115) may become suboptimal or incorrect. This fixup pass:

  1. Scans for texture fetch instructions (opcode 0x17 / class 0x37/0x38 in the scheduling tables)
  2. Checks that the assigned write-barrier index correctly covers the instruction's result register
  3. Verifies that consumer instructions have the corresponding read-barrier bit set in their wait mask
  4. Adjusts stall counts and yield flags if the texture result is consumed sooner than the original schedule assumed

Memory Order Intrinsic Lowering

Before the eight sync phases operate on the Ori IR, the OCG intrinsic lowering pipeline translates PTX memory-ordering intrinsics into Ori IR instruction sequences. Three sibling functions in the OCG body dispatcher (sub_6D8B20) handle the three families of memory-ordering intrinsics. All three share an identical subop-array parsing protocol and the same scope/memory-order/deprecation validation logic.

Dispatcher and Function Family

The OCG body dispatcher at sub_6D8B20 (432 lines) reads the intrinsic ID from *(state+10688) and dispatches to per-family lowering functions via a 28-case switch statement. The three memory-ordering handlers are:

CaseFunctionSizeFamilyPTX instructions
9sub_6C0D9019KB (812 lines)Atomic/reductionatom.add, atom.cas, atom.exch, red.add
0xAsub_6C1CF016KB (633 lines)Mbarriermbarrier.arrive, mbarrier.test_wait, mbarrier.try_wait, counted/bytemask variants
0x16sub_6C4DA015KB (647 lines)Fence / load-storefence.sc, ld.acquire, st.release with scope/domain

Subop Array Protocol

Each intrinsic descriptor carries a subop array at state+10704 (an int[]) with the count at state+10712. The subop values encode orthogonal PTX qualifiers (scope, memory order, type, domain) into a flat integer sequence that the lowering functions parse in positional order.

Reconstructed subop value map (shared by all three functions):

SubopMeaningIR effect
0Scope qualifier (.sys/.gpu/.cta)Sets scope_level = 4
1Counted mode (mbarrier arrival count)Adds extra type-14 parameter
2Shared domain (_shared)scope = 5
3Memory order acquireSets order = 5
4Memory order releaseSets order = 6
5MMIO flag (.mmio)Sets flag bit 8
6Vector width 2xscope_width = 2
7Vector width 4xscope_width = 4
8Type u32IR type 12
9Type s32IR type 11
0xAType u64IR type 10
0xB--0x12Reduction ops (add/min/max/inc/dec/and/or/xor)Op index 0--7

Scope and Memory Order Validation

All three functions enforce the PTX 8.0 scoped memory model rules through a three-way decision tree. The logic (taken from sub_6C0D90 and sub_6C4DA0 where the strings appear verbatim; sub_6C1CF0 enforces equivalent constraints via positional subop checks) is:

if scope_qualifier_present:
    if memory_order NOT present:
        ERROR 7308: "Required scope with memory order semantics"
elif memory_order_present:
    WARNING 7308 (via sub_7F7C10): "Deprecated scope without memory order semantics"
    // Deprecation warning — may be promoted to error in future PTX versions.
    // If location info available (ctx+104), emits follow-up via sub_8955D0.

if mmio_flag AND NOT global_domain:
    ERROR 7308: "Domain param \"_global\" required for mmio semantics"

The warning path uses sub_7F7C10 (the deprecation-warning emitter at context+1176), which returns a boolean indicating whether the warning was promoted to an error. This implements NVIDIA's staged deprecation of unscoped memory operations: PTX code using old-style membar.cta without explicit .acquire/.release qualifiers triggers the deprecation path, while new-style fence.sc.cta.acquire requires the full scope + order combination.

Mbarrier Intrinsic Lowering -- sub_6C1CF0

The mbarrier handler (16KB, case 0xA) lowers mbarrier.* PTX intrinsics into Ori IR instruction sequences. It handles:

  1. Scope/domain parsing: First subop must be 2 (shared) or 3 (global). If the first subop is > 1, it is treated as the domain selector directly; otherwise the function enters the two-position scope path where the second subop supplies the domain.

  2. Counted mode (subop 1): Enables arrival-count tracking. When active, the parameter list includes an extra type-14 (integer) parameter for the expected arrival count. Bytemask mode (subop 6) is incompatible with counted mode -- error 7300: "byte mask not allowed with counted".

  3. Bytemask mode (subop 6): Requires global destination (subop[1] == 3) and shared source (subop[2] == 2). Sets flag bit 17 (0x20000). Error messages: "global dst should be specified with bytemask" and "shared src should be specified with bytemask".

  4. Sequenced mode (subop 5): Explicitly unsupported. Error 7300: "sequenced : Not yet supported".

  5. MMIO flag (subop 4 when value == 4 in the optional-subop loop): Sets bit 3 in the flag word. Only valid with global domain (scope 2); enforced by the same "_global required for mmio" rule.

Parameter Processing

Parameters are stored at state+10728 as 12-byte records {value[4], flags[4], type[4]}. The function iterates over v100 parameters (2 or 3 depending on counted mode):

  • Each parameter type must be 10 (predicate register) or 12 (scope domain). Other types trigger error 7302 using the type name table at off_229E8C0.
  • For scope-domain parameters, the top 3 bits of the value word ((value >> 28) & 7) select the resolution mode:
    • Mode 5: Named barrier resolution via sub_91BF30, then sub_934630(opcode 130) to create a barrier pseudo-op in the Ori IR.
    • Mode 1 (no bit 24): Direct register reference (fast path, no resolution needed).
    • Other modes: Full register resolution via sub_91D150 + sub_7DEFA0.

Output Instruction Sequence

The function generates three Ori IR instructions:

StepBuilderOpcodePurpose
1sub_934630214Mbarrier scope-domain setup; template mask 0x90FFFFFF
2sub_934630273Memory ordering constraint / fence
3sub_92C240299Mbarrier operation with full flags (arrive/wait/test)

The flag word passed to opcode 299 encodes: flags | 0x60000000, where flags accumulates mmio (bit 3), bytemask (bit 17), and other qualifiers from the subop parsing.

Error Codes

CodeMessage templateSeverity
7300"Unexpected intrinsic name (%s)"Semantic restriction (hard error)
7301"Unexpected intrinsic param number (%d)"Parameter count mismatch
7302"Unexpected intrinsic type (%s) in param (%d)"Wrong parameter type
7303"Unexpected intrinsic type (%s) instead of (%s) in param (%d)"Type mismatch with expected
7306"Unexpected intrinsic subop in position (%d)"Positional subop error
7307"Unexpected intrinsic subop (%s) in position (%d)"Named subop error
7308"Instrinsic - \"%s\""Scope/order/domain validation

Two diagnostic functions handle these errors: sub_895530 emits directly when source location is available (ctx+48); sub_7EEFA0 builds a deferred diagnostic record.


Function Map

AddressSizeIdentityPhaseConfidence
sub_C5FBC034StageAndFence execute wrapper25CERTAIN
sub_1392E30166StageAndFence entry25HIGH
sub_1389AF03,049StageAndFence setup25HIGH
sub_1390B308,956StageAndFence core (fence insertion)25HIGH
sub_138A6E03,408StageAndFence teardown25HIGH
sub_C60BD0334OriRemoveRedundantBarriers execute wrapper26CERTAIN
sub_790A402,288OriRemoveRedundantBarriers main26HIGH
sub_7900201,200Post-RA scheduling helper26MEDIUM
sub_7904D01,381Pre-RA optimization helper26MEDIUM
sub_7923A02,344Barrier placement optimization26MEDIUM
sub_792CD01,360Top-level barrier pass26MEDIUM
0xC5F1106ExpandMbarrier execute (vtable dispatch)42CERTAIN
sub_C6008034OptimizeSyncInstructions execute wrapper71CERTAIN
sub_90A3401,670OptimizeSyncInstructions main71HIGH
sub_18F6930185Sync optimization predicate71HIGH
sub_C600B034LateExpandSyncInstructions execute wrapper72CERTAIN
sub_1381DA01,517LateExpandSyncInstructions entry72HIGH
sub_1381CD0206LateExpandSyncInstructions core driver72HIGH
sub_C5FAD034OriDoSyncronization execute wrapper99CERTAIN
sub_A0F0202,375DAG scheduler entry (sync insertion)99HIGH
sub_A0D800--Dependency DAG builder99MEDIUM
sub_A06A603,045Per-block sync processor99HIGH
sub_A0B5E0--Uninitialized register check99MEDIUM
sub_C607A051ApplyPostSyncronizationWars execute wrapper100CERTAIN
sub_C6060051FixUpTexDepBarAndSync execute wrapper114CERTAIN
sub_A9C5502,178Barrier instruction lowering--HIGH
sub_80F4001,779Sync instruction SASS lowering--HIGH
sub_AA3BB02,726MBARRIER encoding--HIGH
sub_AA33C0--MBARRIER mnemonic builder--MEDIUM
sub_77501018Barrier liveness computation entry--MEDIUM
sub_6D8B20432 linesOCG intrinsic body dispatcher (28-case switch)--HIGH
sub_6C0D90812 linesAtomic/reduction intrinsic lowering (scope+order)--HIGH
sub_6C1CF0633 linesMbarrier intrinsic lowering (arrive/wait/test)--HIGH
sub_6C4DA0647 linesFence/load-store intrinsic lowering (scope+domain)--HIGH

Pipeline Position and Data Flow

The eight sync phases are distributed across the pipeline to operate at the appropriate abstraction level:

Phase 25  StageAndFence               ─── Early: after loop unrolling (24)
Phase 26  OriRemoveRedundantBarriers   ─── Early: before GeneralOptimize (29)
    ... (mid-level optimization) ...
Phase 42  ExpandMbarrier               ─── Mid: after CTA expansion (40)
    ... (late optimization) ...
Phase 71  OptimizeSyncInstructions     ─── Late: after varying propagation (70)
Phase 72  LateExpandSyncInstructions   ─── Late: before SSA destruction (73)
    ... (legalization, scheduling setup) ...
Phase 99  OriDoSyncronization          ─── Post-opt: sync insertion pass
Phase 100 ApplyPostSyncronizationWars  ─── Post-opt: WAR fixup
    ... (register allocation, scheduling) ...
Phase 114 FixUpTexDepBarAndSync        ─── Post-sched: texture dep fixup

Data dependencies between phases:

  • Phase 25 -> 26: StageAndFence inserts fences; OriRemoveRedundantBarriers may then eliminate redundant ones.
  • Phase 42 -> 71: ExpandMbarrier materializes mbarrier ops; OptimizeSyncInstructions may simplify the resulting sequences.
  • Phase 71 -> 72: OptimizeSyncInstructions reduces sync count; LateExpandSyncInstructions expands remaining pseudo-ops to SASS.
  • Phase 99 -> 100: OriDoSyncronization inserts sync instructions; ApplyPostSyncronizationWars fixes hazards introduced by the insertion.
  • Phase 114 -> 115: FixUpTexDepBarAndSync prepares texture barriers for AdvancedScoreboardsAndOpexes.

Architecture-Specific Behavior

The sync passes have significant architecture-dependent behavior controlled through the architecture backend vtable at ctx+1584:

SM generationKey behavior
sm70--sm75 (Volta/Turing)Explicit BSSY/BSYNC convergence; WARPSYNC required; --no-membermask-overlap warning active; EIATTR_SW_WAR_MEMBAR_SYS_INSTR_OFFSETS emitted for membar.sys WAR
sm80--sm89 (Ampere/Ada)cp.async commit/wait groups; ERRBAR after membar.sys; barrier number range checked [0..15]
sm90--sm90a (Hopper)Full MBARRIER support; TMA async pipeline barriers; EIATTR_NUM_MBARRIERS and EIATTR_MBARRIER_INSTR_OFFSETS emitted; wgmma.fence / tcgen05.fence sync fences for tensor operations
sm100+ (Blackwell)Extended cluster barriers (barrier.cluster.arrive/wait); fence.proxy with proxy domain annotations; sync_restrict::shared::{cta,cluster} scope qualifiers; async bulk multicast

The sub_18F6930 predicate (185 bytes) encodes the architecture-specific decision logic. The magic value 28673 at *(ctx+1584)+372 corresponds to an architecture version threshold that enables explicit synchronization optimization for Volta-class and later architectures.

OptionEffect
--assume-extern-functions-do-not-syncTells the compiler that external function calls do not execute synchronization instructions, enabling more aggressive barrier elimination
--no-membermask-overlapAsserts that no sync instruction is executed with different but overlapping thread masks (sm70--sm75 only). Enables additional optimizations.
--print-potentially-overlapping-membermasksDiagnostic: prints locations of sync instructions where the compiler must assume overlapping masks
KnobEffect
DisableErrbarAfterMembarWhen set to 1, suppresses error barrier (BAR.SYNC 15) insertion after membar.sys instructions
Knob 358Sync optimization mode selector (0=disabled, 1=conservative, 2=aggressive, 3+=full analysis)
Knob 472Barrier liveness tracking enable
Knob 487Iteration gate (shared with multiple passes); controls maximum number of iterations

Cross-References

Hot/Cold Partitioning

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

ptxas implements hot/cold partitioning across three dedicated phases that mark cold blocks, reorganize loop internals, and restructure whole-function control flow to improve instruction cache utilization and warp scheduling efficiency. The system operates at two distinct granularities: instruction-level classification (used by the scheduler's priority function) and block-level classification (used by code layout and predication). Both are static heuristics -- no hardware performance counters are read at runtime -- though profile-guided data from phase 20 (PerformPGO) can influence block weights when available.

Phases41 (MarkAdditionalColdBlocks), 108 (OptimizeHotColdInLoop), 109 (OptimizeHotColdFlow)
CategoryAnalysis (41), Optimization (108, 109)
Pipeline positionsPhase 41: mid-optimization (after DoVirtualCTAExpansion); Phases 108--109: post-scheduling (after OriRemoveNopCode)
Vtable addressesoff_22BDC30 (41), off_22BE6A8 (108), off_22BE6D0 (109)
Instruction classifierssub_A9CDE0 (isHotMemoryOp, 380B), sub_A9CF90 (isColdMemoryOp, 367B)
Block layout consumerPhase 112: PlaceBlocksInSourceOrder (sub_A92C50)
Related knobKnob 582 (block-level cold-region query, consumed by predication at phase 63)
PGO feederPhase 20: PerformPGO (block weights, branch probabilities)

GPU Motivation

Hot/cold partitioning on a GPU serves fundamentally different purposes than on a CPU.

On a CPU, the primary goal is to keep the hot path in L1 icache lines and push cold code to distant addresses that never evict hot cache lines. The branch predictor handles the control flow; the optimization is purely about cache geometry.

On a GPU, three factors make hot/cold partitioning more impactful:

  1. Instruction cache pressure. GPU SMs have small instruction caches (typically 32--128 KB shared across all warps on the SM). With dozens of warps in flight, each executing the same kernel, icache misses stall the entire SM. Moving cold code (error paths, rare branches) away from hot loops reduces the working set that must remain cached.

  2. Warp scheduling. The warp scheduler selects ready warps from a pool. If cold-path instructions are interleaved with hot-path instructions in the binary layout, warps executing the cold path occupy instruction fetch bandwidth that could serve warps on the hot path. Physical separation means the fetch unit can service hot warps without cache line conflicts from cold code.

  3. Convergence overhead. On sm_70+ architectures, divergent branches require BSSY/BSYNC convergence barriers. Cold blocks that are reached by divergent branches incur barrier setup costs even when the cold path is rarely taken. The predication pass (phase 63) uses knob 582 to query whether a block is in a cold region, allowing it to avoid if-converting cold regions where the divergence penalty is acceptable.

Architecture Overview

The three phases form a pipeline with increasing scope:

Phase 41: MarkAdditionalColdBlocks     (mid-optimization, Ori IR)
    |
    |  Sets cold-block flags on basic blocks based on static heuristics
    |  and PGO data. These flags are read by subsequent optimization
    |  passes (predication, scheduling, code layout).
    |
    v
Phase 108: OptimizeHotColdInLoop       (post-scheduling, SASS-level)
    |
    |  Within each loop body, separates hot and cold paths. Moves cold
    |  blocks to the end of the loop region so that the hot path forms
    |  a contiguous instruction sequence.
    |
    v
Phase 109: OptimizeHotColdFlow         (post-scheduling, SASS-level)
    |
    |  At function scope, restructures control flow to place cold blocks
    |  after all hot blocks. Adjusts branch targets to maintain correctness.
    |
    v
Phase 112: PlaceBlocksInSourceOrder    (final block layout)
    |
    |  Determines the physical ordering of all basic blocks in the
    |  emitted binary, consuming the hot/cold annotations set above.

The key architectural decision is that phase 41 runs at the Ori IR level (before scheduling and register allocation), while phases 108--109 run post-scheduling on the nearly-final SASS representation. This two-stage design is necessary because:

  • Cold-block annotations must be available early for predication decisions (phase 63) and scheduling priority (the 8-bit priority encoder).
  • Block reordering can only happen after scheduling has assigned stall counts and dependency barriers, since moving blocks changes instruction fetch distances and potentially invalidates scoreboard computations.

Phase 41: MarkAdditionalColdBlocks

Phase 41 is an analysis pass that annotates basic blocks with cold flags. The name "Additional" implies that some initial cold marking occurs earlier (likely during AnalyzeControlFlow at phase 3 or PerformPGO at phase 20), and this pass extends those annotations using additional heuristics available after mid-level optimization.

Pipeline Context

Phase 41 runs after DoVirtualCTAExpansion (40) and before ExpandMbarrier (42). At this point in the pipeline:

  • The CFG is fully built (phase 3) and loop structure is known (phase 18).
  • PGO data has been applied (phase 20) if available.
  • Branch optimization (phase 15) has simplified the control flow.
  • The IR is still in Ori form -- no register allocation or scheduling has occurred.

Cold-Block Heuristics

The cold-block classification uses both static and profile-guided signals. Based on analysis of consumers of the cold-block flag:

Static heuristics (always available):

SignalClassificationRationale
Error handling / trap terminatorColdError paths are rarely executed in correct programs
EXIT with non-zero error codeColdAbnormal termination paths
Deeply nested conditional with uniform conditionColdThreads rarely diverge on uniform values
Block dominated by a back-edge but not in the loop bodyColdLoop exit paths taken only once
Very low instruction count + unconditional branch to returnColdCleanup epilogues

Profile-guided signals (when PGO data is available via phase 20):

SignalClassificationRationale
Execution count below threshold (relative to function entry)ColdDirectly measured low frequency
Branch probability < 5% on the edge leading to the blockColdRarely-taken branch target

Cold Flag Storage

The cold annotation is stored in the BasicBlock flags field at offset +28 of the 136-byte BasicBlock object. The predication pass queries this via knob 582 (block-level cold-region query), and the scheduling priority function reads it when computing the 8-bit packed priority at bit position 5.

Consumers of Cold Annotations

ConsumerPhaseUsage
OriDoPredication63Knob 582: skips if-conversion of cold regions (divergence penalty acceptable in cold code)
Scheduling priority97--101Bit 5 of 8-bit priority: hot instructions get higher scheduling priority (1 = hot, 0 = cold)
OptimizeHotColdInLoop108Reads cold flags to identify which loop blocks to move
OptimizeHotColdFlow109Reads cold flags for whole-function layout
PlaceBlocksInSourceOrder112Final block ordering uses cold annotations

Instruction-Level Hot/Cold Classification

Independent of the block-level cold marking, ptxas classifies individual memory instructions as "hot" or "cold" for scheduling purposes. This classification is performed by two small, dual functions.

sub_A9CDE0 -- isHotMemoryOp (380 bytes)

Classifies an instruction as a hot memory operation. Hot instructions access memory spaces with high latency where early scheduling is beneficial.

isHotMemoryOp(scheduler, context, instruction):
    opcode = instruction->opcode & 0xFFFFCFFF    // mask modifier bits
    if opcode == 183 or opcode == 288:            // LD.E / ST.E (global load/store)
        operand = resolve_last_source(instruction)
        memspace = getMemorySpace(operand)
        if memspace == 6:                         // global memory
            return true
        if memspace == 4:                         // shared memory
            return (operand->modifier >> 19) & 7 == 1  // specific variant
        return false
    if opcode in {91, 92}:                        // ATOM / RED
        modifier = instruction->operand[last]
        return ((modifier ^ 6) & 6) == 0 and (modifier & 1)  // specific addressing mode
    return false

sub_A9CF90 -- isColdMemoryOp (367 bytes)

The exact dual of isHotMemoryOp. Classifies an instruction as a cold memory operation.

isColdMemoryOp(scheduler, context, instruction):
    opcode = instruction->opcode & 0xFFFFCFFF
    if opcode == 183 or opcode == 288:            // LD.E / ST.E
        operand = resolve_last_source(instruction)
        memspace = getMemorySpace(operand)
        if memspace == 5:                         // constant memory (vs 6 for hot)
            return true
        if memspace == 4:                         // shared memory
            return (operand->modifier >> 19) & 7 == 2  // complement variant (vs 1 for hot)
        return false
    if opcode in {91, 92}:                        // ATOM / RED
        modifier = instruction->operand[last]
        return ((modifier ^ 6) & 6) == 0 and (modifier & 1) == 0  // complement of hot check
    return false

Memory Space Classification

The memory space type is resolved by sub_91C840 from register file metadata at context+152:

Space CodeMemory TypeHot/ColdScheduling Implication
4Shared memoryDepends on variantLow latency (~20 cycles), variant-dependent
5Constant memoryColdCached, low latency (~4 cycles via constant cache)
6Global memoryHotHigh latency (~200--800 cycles), benefits from early issue

The shared memory case splits on a 3-bit subfield at operand bits 19--21: variant 1 is hot (bank-conflicted or special access pattern), variant 2 is cold (standard access).

For atomic operations (opcodes 91/92 = ATOM/RED), the hot/cold split is on the addressing mode: specific atomics targeting global memory in reduction mode are hot; others are cold.

Scheduling Priority Integration

The instruction-level hot/cold classification feeds directly into the scheduler's 8-bit priority encoding (documented in Scheduling Algorithm):

Bit 7: yield-related
Bit 6: yield
Bit 5: hot/cold (1 = hot = higher priority, 0 = cold = lower priority)
Bit 4: register pressure overflow
Bit 3: same-BB preference
Bit 2: stall-free
Bit 1: critical path
Bit 0: tiebreaker

Hot memory instructions (global loads, global atomics) get higher scheduling priority because their long latencies benefit from being issued early -- the scheduler can then fill the latency window with independent instructions. Cold memory instructions (constant loads) have short latencies and do not benefit from early issue, so they receive lower priority.

Phase 108: OptimizeHotColdInLoop

Phase 108 operates at the post-scheduling level, after register allocation and NOP removal have completed. It optimizes the layout of basic blocks within loop bodies.

Pipeline Context

Phase 107: OriRemoveNopCode      (NOP removal)
Phase 108: OptimizeHotColdInLoop  (loop-internal reordering)
Phase 109: OptimizeHotColdFlow    (function-wide reordering)
Phase 110: PostSchedule           (post-scheduling fixup)
Phase 112: PlaceBlocksInSourceOrder (final layout)

At this point, instructions have been scheduled and stall counts assigned. The optimization must preserve scheduling correctness while improving spatial locality.

Algorithm

The pass iterates over each loop in the function (loop structure computed at phase 18, maintained through the pipeline):

  1. Identify loop blocks. Using the loop header RPO number and exit RPO number, enumerate all blocks in the loop body.

  2. Classify blocks. Each block in the loop is classified as hot or cold based on the cold-block flags set by phase 41 (and potentially refined by phases between 41 and 108).

  3. Partition. Hot blocks remain at the top of the loop body; cold blocks are moved to the bottom (higher addresses within the loop region).

  4. Adjust branches. Branch targets are updated to reflect the new block positions. Cold blocks that were fall-through targets of hot blocks receive explicit branch instructions (since they are no longer adjacent).

The effect is that the hot loop body forms a contiguous instruction sequence that fits in fewer icache lines:

Before:                          After:
  loop_header (hot)                loop_header (hot)
  hot_block_1                      hot_block_1
  cold_error_check                 hot_block_2
  hot_block_2                      BRA loop_header
  BRA loop_header                  cold_error_check  (moved)

Constraints

  • The loop header block cannot be moved (it must be the entry point).
  • Blocks with back-edges to the loop header must maintain their branch reachability.
  • The transformation must not change the set of scoreboard/dependency barrier states visible at each instruction (since scheduling has already completed).

Phase 109: OptimizeHotColdFlow

Phase 109 extends the hot/cold separation to the entire function, operating on blocks that are not inside loops (or that span multiple loops).

Algorithm

The whole-function pass:

  1. Scan all blocks in RPO order. Classify each non-loop block as hot or cold.

  2. Partition the function into a hot region (placed first in the binary) and a cold region (placed last). Loop bodies are treated as atomic units -- the internal ordering was already optimized by phase 108.

  3. Insert or adjust branches at hot-to-cold and cold-to-hot transitions. Cold-to-hot transitions require a branch back to the hot region.

  4. Update block ordering metadata consumed by phase 112 (PlaceBlocksInSourceOrder).

The combined effect of phases 108 and 109 is a two-level layout:

Function layout after phases 108+109:
  [hot loop bodies, internally sorted by phase 108]
  [hot non-loop blocks]
  [cold blocks from all regions]

Tepid Scheduling

Between the extremes of hot and cold, ptxas recognizes a "tepid" scheduling mode that balances math and memory instruction interleaving. The tepid infrastructure lives at 0x7A4350--0x7A5000 and computes ratios:

MetricFormulaPurpose
MathToDmaWaitRatiofield[756] / a5Ratio of math cycles to memory wait cycles
MathToDmaTepidRatiofield[752] / a6Ratio of math cycles to memory tepid cycles
MathToEpilogueWaitRatiofield[756] / (a5 / epilogue_count)Per-epilogue math-to-wait ratio
MathToEpilogueTepidRatioa6 / epilogue_countPer-epilogue tepid ratio

These ratios are computed by sub_7A4350 (TepidSchedulingCompute) and reported by sub_7A46E0 (TepidSchedulingReport) when verbosity > 0. Epilogue blocks are identified by sub_754510 (IsEpilogueBlock), with the epilogue instruction count controlled by knob 294.

The tepid mode affects how aggressively the scheduler interleaves memory and math instructions -- hot regions use aggressive overlap, cold regions use conservative scheduling, and tepid regions use an intermediate policy.

Interaction with Other Passes

Predication (Phase 63)

The predication pass queries knob 582 to determine whether a branch region lies in a cold block. If the region is cold, predication may be skipped because:

  • The cold path is rarely executed, so the branch divergence penalty is amortized.
  • Predication would execute both paths unconditionally, wasting functional units on cold-path instructions.
  • Keeping the branch allows the cold path to be physically separated by phases 108--109.

PlaceBlocksInSourceOrder (Phase 112)

Phase 112 is the final block layout pass. It consumes the hot/cold annotations and the reordering decisions made by phases 108--109 to determine the physical position of every basic block in the emitted binary. The function sub_A92C50 implements this with a complex block-sorting algorithm that uses FNV-1a hash maps and an explicit work stack.

Key fields consumed from the Code Object:

OffsetFieldUsage
+232Current block pointerBlock being placed
+264Block type/modeControls placement strategy
+296BB arrayBlock pointers for lookup
+648Successor edge mapDetermines fall-through targets
+720RPO arrayProvides initial ordering

PerformPGO (Phase 20)

When profile data is available (from prior compilation runs with --generate-line-info and feedback), phase 20 applies execution counts and branch probabilities to the IR. These weights directly influence cold-block identification at phase 41 -- blocks with execution counts below a threshold relative to the function entry are marked cold regardless of static heuristics.

Key Functions

AddressSizeIdentityConfidence
sub_A9CDE0380BisHotMemoryOp -- classifies instruction as hot memory accessHIGH (0.90)
sub_A9CF90367BisColdMemoryOp -- classifies instruction as cold memory accessHIGH (0.90)
sub_91C840~200BgetMemorySpace -- resolves memory space type from operand metadataMEDIUM
sub_A92C50~5KBPlaceBlocksInSourceOrder -- final block layout algorithmHIGH
sub_7A46E0~1.1KBTepidSchedulingReport -- reports tepid scheduling ratiosHIGH
sub_7A4350~500BTepidSchedulingCompute -- computes tepid scheduling metricsMEDIUM
sub_754510~200BIsEpilogueBlock -- identifies epilogue blocksMEDIUM

Vtable Layout

PhaseIndexVtable AddressName String Address
MarkAdditionalColdBlocks41off_22BDC300x22BC763
OptimizeHotColdInLoop108off_22BE6A80x22BCD1D
OptimizeHotColdFlow109off_22BE6D00x22BCD33

All three vtables follow the standard 5-entry layout:

Vtable OffsetEntry
+0execute(phase*, compilation_context*)
+8isNoOp(phase*) -> bool
+16getName(phase*) -> int
+24alloc(pool*, size)
+32free(pool*, ptr)

Confidence Assessment

ClaimConfidenceEvidence
Phase names and indices (41, 108, 109)VERY HIGHStatic name table at off_22BD0C0, factory switch at sub_C60D30
Vtable addressesVERY HIGHComputed from base off_22BD5C8 + index * 40
isHotMemoryOp / isColdMemoryOp identityHIGHDual function structure, memory space checks, opcode patterns
Memory space codes (4=shared, 5=constant, 6=global)HIGHConfirmed across multiple consumers
Scheduling priority bit 5 = hot/coldHIGHDecompiled priority function at sub_8C9320
Phase 41 runs before schedulingVERY HIGHFactory index and pipeline ordering table
Phases 108--109 run post-schedulingVERY HIGHPipeline ordering table, position after OriRemoveNopCode
Knob 582 cold-region query in predicationHIGHDecompiled predication pass at sub_1381010
Block layout consumer at phase 112HIGHsub_A92C50 identified via string xref to PlaceBlocksInSourceOrder
Cold-block flag in BB+28MEDIUMInferred from consumer patterns; exact bit position unconfirmed
Tepid scheduling ratiosHIGHString evidence from decompiled sub_7A46E0
PGO influence on cold markingMEDIUMInferred from pipeline ordering (PGO at 20, cold marking at 41)

Cross-References

GMMA/WGMMA Pipeline

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The GMMA pipeline handles warpgroup matrix multiply-accumulate (WGMMA) instructions introduced with SM 90 (Hopper). Two dedicated compiler phases -- OriPropagateGmma (phase 85) and FixupGmmaSequence (phase 87) -- transform the IR to satisfy the hardware's strict pipelining requirements for asynchronous tensor-core operations. These are the only passes in ptxas whose sole purpose is WGMMA instruction handling.

WGMMA operates at warpgroup granularity (4 warps executing in lockstep). The hardware requires a specific sequencing protocol: wgmma.fence to open a pipeline stage, a sequence of wgmma.mma_async operations that share accumulator registers, wgmma.commit_group to close the stage, and wgmma.wait_group to synchronize on completion. Between the fence and wait, strict constraints govern which registers can be touched by non-WGMMA instructions. Violating these constraints forces the compiler to serialize the WGMMA pipeline, destroying throughput.

Pipeline phases85 (OriPropagateGmma), 87 (FixupGmmaSequence)
Target architecturesSM 90+ (Hopper, Blackwell)
Phase 85 entrysub_AE5030 (2,967 bytes) -- outer driver, SM gate check
Phase 85 coresub_ADAD60 (2,170 bytes) -- accumulator propagation per instruction
Phase 87 entrysub_AE4F70 (182 bytes) -- sequencing orchestrator
Phase 87 coresub_ADEB40 (7,077 bytes) -- sequence fixup, warpgroup inject
Serialization warningssub_ACE480 (1,908 bytes) -- 10 distinct warning codes
Pipeline validationsub_AE3D40 (2,511 bytes) -- sequence structural check
Accumulator collectsub_ADA740 (146 bytes) -- gathers accumulator register set
Live range propagationsub_ADBD30 (3,364 bytes) -- per-basic-block propagation
Phase name strings0x22BCB13 (OriPropagateGmma), 0x22BCB40 (FixupGmmaSequence)

Hardware Background

Warpgroup Execution Model

A warpgroup consists of 4 consecutive warps (128 threads). WGMMA instructions execute cooperatively across all 4 warps, with each warp contributing a slice of the matrix operation. The hardware tensor core pipeline is decoupled from the main pipeline: wgmma.mma_async dispatches work to the tensor core and returns immediately, while the accumulator registers remain in-flight until a wgmma.wait_group completes.

The PTX-level instructions that constitute a WGMMA pipeline stage:

PTX InstructionOri OpcodeRole
wgmma.fence(via handler sub_4DA380)Opens a pipeline stage; prevents reordering across the fence
wgmma.mma_async309Dispatches an asynchronous matrix multiply-accumulate
wgmma.commit_group(via handler sub_4DA4B0)Closes the current pipeline stage
wgmma.wait_group(via handler sub_4DA5E0)Waits for N committed groups to complete
_warpgroup.arrive323Compiler-inserted warpgroup synchronization (arrive)
_warpgroup.wait271 (masked & 0xFFFFCFFF)Compiler-inserted warpgroup synchronization (wait)
_warpgroup.commit_batchCompiler-inserted commit batch

The _warpgroup.* instructions (prefixed with underscore) are compiler-internal pseudo-operations inserted by ptxas, not directly written by the programmer. They map to SASS WARPGROUP.ARRIVE, WARPGROUP.WAIT, and WARPGROUP.DEPBAR instructions.

Accumulator Register Constraints

WGMMA accumulator registers are the output (D) operands of wgmma.mma_async. While a pipeline stage is open (between fence and wait), strict rules apply:

  1. No non-WGMMA definitions of accumulator registers. Another instruction cannot write to a register that a WGMMA in the current stage uses as an accumulator.
  2. No non-WGMMA reads of accumulator registers. Another instruction cannot read from an accumulator register between the producing WGMMA and the completing wait.
  3. No non-WGMMA definitions of WGMMA input registers. The A and B matrix input registers (including descriptor registers) must not be redefined by non-WGMMA instructions within the stage.

Violation of any constraint forces serialization -- the compiler collapses the pipeline to issue one WGMMA at a time with individual fence/commit/wait per operation.

Sparse GMMA

The binary contains support for sparse GMMA variants (structured sparsity). The string "Sparse GMMA with " at 0x1D0B430 appears in sub_494210 (2,276 bytes), which handles sparse matrix metadata validation. Sparse WGMMA uses an additional metadata operand encoding the 2:4 or other sparsity pattern.

Phase 85: OriPropagateGmma

Purpose

Phase 85 propagates WGMMA accumulator register liveness information through the IR. For each wgmma.mma_async instruction (Ori opcode 309), it identifies the accumulator register set and builds a compact encoding that downstream passes use to track which registers are "in-flight" at each program point. This information is consumed by phase 87 to determine where warpgroup.arrive and warpgroup.wait instructions must be injected.

SM Gate

The outer driver sub_AE5030 checks the target architecture before proceeding. At offset +1381 of the compilation context, a flag indicates whether the target supports WGMMA. The check at the function entry:

if (*(char*)(context + 1381) >= 0)  // bit 7 clear = no WGMMA support
    return;

An additional mode check reads from the target descriptor at offset 26208 (within a 72-byte sub-structure at the descriptor's offset 72):

  • Value 0: no WGMMA support -- skip entirely
  • Value 1 with sub-field at 26216 nonzero: use the simple single-function path (sub_ADCA60)
  • Otherwise: use the full pipeline analysis path

Accumulator Register Encoding

The core function sub_ADAD60 processes each wgmma.mma_async instruction and encodes its accumulator register set into a packed 32-bit word. The encoding uses the FNV-1a hash (prime 16777619, offset basis 0x811C9DC5) for register-set lookup in a hash table:

hash = 16777619 * (HIBYTE(reg_id) ^
       (16777619 * (BYTE2(reg_id) ^
       (16777619 * (BYTE1(reg_id) ^
       (16777619 * ((uint8_t)reg_id ^ 0x811C9DC5)))))));

Accumulator entries are stored with a type tag in the high nibble:

  • 0x90000000 | (encoded_accum & 0xFFFFFF) -- source accumulator register set
  • 0x10000000 | (encoded_accum & 0xFFFFFF) -- destination accumulator register set

Live Range Limit Check

After accumulator propagation, the pass checks whether the number of active GMMA live ranges exceeds the hardware limit. The limit is stored at offset 56 of the pass object (field *(DWORD*)(a1 + 56) = maxActiveGmmaLiveRanges). If exceeded, a diagnostic is emitted:

"GMMA sequence has too many active live ranges (%d), reduce it to bring it under (%d)"

This diagnostic uses warning code 0x1CEF (7407). The limit is architecture-dependent and reflects the number of accumulator register banks available to the tensor core pipeline.

Call Chain

sub_AE5030  (2,967B -- SM gate, iteration over basic blocks)
  └─ sub_ADCA60  (3,643B -- per-function pipeline analysis)
       └─ sub_ADBD30  (3,364B -- per-block accumulator propagation)
            └─ sub_ADAD60  (2,170B -- per-instruction accumulator encoding)
                 ├─ sub_AD4500  -- hash table lookup for register set
                 ├─ sub_AD4940  -- hash table insert/update
                 ├─ sub_AD6280  -- register set cache insert
                 ├─ sub_AD8E50  -- instruction iterator setup
                 ├─ sub_AD0C50  -- begin accumulator iteration
                 ├─ sub_AD3EA0  -- advance accumulator iterator
                 ├─ sub_AD1FA0  -- advance to next accumulator slot
                 ├─ sub_75A670  -- grow dynamic array (accumulator list)
                 └─ sub_895530  -- emit diagnostic warning

Accumulator Collection Helper

sub_ADA740 (146 bytes) collects the set of registers that are accumulators for a given instruction. It iterates over an instruction's operands, checking:

  • Operand type tag (operand >> 28) & 7 == 1 (register operand)
  • Not an immediate-flagged operand ((byte_flag & 1) == 0)
  • reg_type == 6 at vreg+64 (tensor/accumulator register class)

Matching registers are added to a bitvector-like set via sub_768AB0.

Phase 87: FixupGmmaSequence

Purpose

Phase 87 is the critical legalization pass. It analyzes WGMMA instruction sequences, verifies that the hardware pipeline constraints are satisfied, and inserts warpgroup.arrive / warpgroup.wait instructions where registers used by non-WGMMA instructions conflict with in-flight WGMMA accumulators. If the pipeline cannot be formed correctly, it triggers serialization and emits performance warnings.

Orchestrator: sub_AE4F70

The 182-byte wrapper orchestrates the complete fixup sequence:

sub_AE4F70 (FixupGmmaSequence orchestrator)
  │
  ├─ [1] sub_ADEB40  -- primary sequence fixup (inject arrive/wait)
  ├─ [2] sub_ADA7E0  -- verify pipeline consistency
  ├─ [3] sub_AE3D40  -- structural validation of sequences
  ├─ [4] sub_AD8F90  -- secondary validation pass
  ├─ [5] sub_AE4710  -- finalize sequence metadata
  ├─ [6] sub_AE17C0  -- late pipeline consistency check
  │
  └─ On failure at any step:
       ├─ Set serialization flag: *(BYTE*)(context + 1920) = 1
       ├─ sub_ACE480  -- emit serialization warning
       └─ sub_AE47B0  -- serialize the WGMMA pipeline (fallback)

The return value encodes the failure reason in the low 32 bits and a function identifier in the high 32 bits, which sub_ACE480 uses to select the appropriate warning message.

Primary Fixup: sub_ADEB40

This 7,077-byte function is the heart of the GMMA pipeline. Its logic:

1. Initialization. Allocates two dynamic arrays (v224/v225 for warpgroup.wait insertion points, i/v228 for warpgroup.arrive insertion points) and initializes them with sentinel values (0xFFFFFFFF).

2. First pass -- identify WGMMA sequences. Iterates over all instructions in the function's code list. For each instruction with opcode 309 (wgmma.mma_async):

  • Collects the instruction's accumulator register set via sub_ACC0A0 / sub_AD50B0 iterator pattern
  • Checks whether each of the instruction's operands (positions 1--4) has already been marked with arrival/wait flags
  • For unmarked operands, calls sub_ADA740 to collect accumulator registers and add them to the tracking set

The pass checks operand flag bits at instruction + 84 + 8*operand_index + 4:

  • Bit 0 (& 1): operand has been processed for arrive
  • Bit 1 (& 2): operand has been processed for wait
  • Bit 2 (& 4): operand requires a warpgroup.arrive/wait boundary

3. Second pass -- walk pipeline stages. For each WGMMA sequence identified in the compilation context's sequence table (context->field_99), the pass walks forward through basic blocks:

  • Tracks the current pipeline stage state (v206: 0=initial, 1=arrived, 2=committed)
  • When encountering a wgmma.mma_async (opcode 309), records it as part of the current stage
  • When encountering a _warpgroup.commit_batch (opcode 323), marks the stage boundary and sets bit 2 on the last accumulator operand
  • When encountering an arrive (opcode 271 masked) or wait (opcode 32 masked), updates the pipeline state
  • When encountering a function call (opcode 236), forces a pipeline break

For non-WGMMA instructions within a stage, checks whether their register operands conflict with the active accumulator set by querying the bitvector (the balanced binary tree at v238). If a conflict is found, the instruction needs a warpgroup.arrive or warpgroup.wait to be injected before it.

4. Injection. Creates new instructions:

  • sub_ACBE60 creates warpgroup.arrive pseudo-instructions
  • sub_ACBF80 creates warpgroup.wait pseudo-instructions

These are added to the arrival/wait lists and later inserted into the code.

5. Commit pass. After analysis, iterates over the collected injection points:

  • For each warpgroup.arrive insertion, checks whether the injection needs a diagnostic via sub_ACBCA0 (knob-gated)
  • Emits advisory warning 0x1D5F (7519): "warpgroup.arrive is injected in around line %d by compiler to allow use of registers in GMMA in function '%s'"
  • For each warpgroup.wait insertion, emits advisory warning 0x1D5D (7517): "warpgroup.wait is injected in around line %d by compiler to allow use of registers defined by GMMA in function '%s'"

6. Finalization. Calls sub_ADD8A0 (1,349 bytes) to rebuild the WGMMA sequence metadata after injection.

Pipeline Stage State Machine

The fixup pass maintains a state machine as it walks through instructions within a WGMMA sequence:

          ┌──────────────┐
          │  state = 0   │  (initial / outside pipeline)
          │  no active   │
          │  stage       │
          └──────┬───────┘
                 │  encounter wgmma.mma_async
                 ▼
          ┌──────────────┐
          │  state = 1   │  (in pipeline stage, arrived)
          │  tracking    │
          │  accumulators│
          └──────┬───────┘
                 │  encounter commit_batch
                 ▼
          ┌──────────────┐
          │  state = 2   │  (committed, waiting)
          │  accumulators│
          │  in-flight   │
          └──────┬───────┘
                 │  encounter wait or stage end
                 ▼
          ┌──────────────┐
          │  state = 0   │  (back to initial)
          └──────────────┘

  At any state, encountering a function call (opcode 236)
  or a conflicting register use forces:
    → inject warpgroup.arrive/wait
    → potentially serialize the pipeline

Register Conflict Detection

Register type 6 (vreg+64 == 6) is the tensor/accumulator register class. The conflict check compares operand register IDs against the active accumulator bitvector using a balanced binary search tree (v238 / v148 in the decompilation). The tree is keyed by register_id >> 8 (register bank) with a 64-bit bitmap per node tracking individual registers within the bank:

bit_index = register_id & 0x3F;
bank_offset = (register_id >> 6) & 3;  // 0..3 for 4 64-bit words per node
is_conflict = (node->bitmap[bank_offset + 4] >> bit_index) & 1;

Serialization Warnings

When the pipeline cannot be formed correctly, sub_ACE480 (1,908 bytes) emits one of 10 distinct performance warnings. The function receives a packed 64-bit error code: the low 4 bits select the warning case (1--10) and the high 32 bits identify the function that triggered the failure. The function name is resolved via a vtable callback: context->field_0->vtable[18]->method_1(context->field_0->vtable[18], function_id).

Warning Emission Mechanism

Each warning is gated by a per-function flag at context->field_208 + 72 + 26280:

  • Byte == 1 with DWORD at +26288 nonzero: Emit via sub_895530 (direct diagnostic with source location). Falls back to sub_7EEFA0 (format-to-buffer, no location) if the source location callback at context->vtable + 48 is null.
  • Byte != 1 (default): Emit via sub_7FA2C0 (warning-once gate, keyed on hex code at context + 154). If the gate passes (first occurrence for this function), emits via sub_895670 (diagnostic through context->vtable + 128 callback). This prevents the same warning from being emitted multiple times for the same function.

All warnings use the prefix "Potential Performance Loss: wgmma.mma_async instructions are serialized due to ...".

Serialization Warning Table

CaseHexDecimalMessage suffixSource function
10x1D557509...the presence of Extern calls in the function '%s'sub_ADEB40
20x1D567510...wgmma pipeline crossing function boundary at a function call in the function '%s'sub_ADEB40
30x1D577511...insufficient register resources for the wgmma pipeline in the function '%s'sub_ADA7E0, orchestrator fallback
40x1D587512...insufficient register resources for the function '%s'orchestrator resource check
50x1D597513...non wgmma instructions defining input registers of a wgmma between start and end of the pipeline stage in the function '%s'sub_ADEB40, sub_AE17C0
60x1D5A7514...non wgmma instructions reading accumulator registers of a wgmma between start and end of the pipeline stage in the function '%s'sub_AE17C0
70x1D5B7515...non wgmma instructions defining accumulator registers of a wgmma between start and end of the pipeline stage in the function '%s'sub_ADEB40, sub_AE17C0
80x1D5C7516...ill formed pipeline stage in the function '%s'sub_AE3D40 structural check
90x1D5E7518...program dependence on compiler-inserted WG.DP in divergent path in the function '%s'sub_ADEB40 finalization
100x1D607520...program dependence on compiler-inserted WG.AR in divergent path in the function '%s'sub_ADEB40 finalization

Note: The hex codes are not contiguous. Codes 0x1D5D (7517) and 0x1D5F (7519) are advisory injection warnings, not serialization warnings (see below).

Advisory Injection Warnings

During successful (non-serialized) pipeline fixup, sub_ADEB40 emits advisory warnings when it injects warpgroup synchronization instructions. These are gated by knob check at sub_ACBCA0 and the per-instruction flag at bb_info + 282 bit 3:

HexDecimalMessage
0x1D5D7517"warpgroup.wait is injected in around line %d by compiler to allow use of registers defined by GMMA in function '%s'"
0x1D5F7519"warpgroup.arrive is injected in around line %d by compiler to allow use of registers in GMMA in function '%s'"

These are informational: they indicate the compiler successfully handled a register conflict by inserting synchronization, without falling back to serialization.

Detailed Trigger Conditions

Case 1 (0x1D55): Extern calls prevent pipelining

Trigger. During the instruction walk in sub_ADEB40, a call instruction (Ori opcode 236) is encountered within a WGMMA pipeline stage, or an operand references a basic block with no instructions (opaque/extern function target). The compiler cannot verify that the callee preserves the accumulator register state.

Detection code. In sub_ADEB40: when opcode == 236 (function call), or when a callee basic block's instruction pointer is null (*(_QWORD*)v114 == 0), v206 is set to 1.

Code pattern that causes it:

wgmma.fence;
extern_function_call();  // <-- triggers case 1
wgmma.mma_async ...;
wgmma.commit_group;
wgmma.wait_group;

Fix. Mark the callee as __forceinline__ so the compiler can see its register usage. Move non-inlineable function calls outside the fence--wait region. Restructure the kernel so that no opaque calls occur between wgmma.fence and wgmma.wait_group.

Case 2 (0x1D56): Pipeline crosses function call boundary

Trigger. The bitvector conflict check finds a non-WGMMA instruction's register operand colliding with the active accumulator bitvector, at a point where the pipeline already has active state from a preceding call-boundary violation. Specifically, the register is looked up in the balanced binary tree (node->bitmap[bank_offset + 4] >> bit_index) and if the conflict bit is set while v206 was already zero, it is promoted to case 2.

Detection code. In sub_ADEB40 lines 418--426: after the accumulator bitvector lookup returns a match, v206 is set to 2 (the first conflict after a call boundary was detected).

Code pattern that causes it:

// Function A:
wgmma.fence;
wgmma.mma_async ...;
call function_B();  // pipeline spans across this call
wgmma.commit_group; // in function_B or after return
wgmma.wait_group;

Fix. Keep the entire fence--mma--commit--wait sequence within a single function. Do not split WGMMA pipeline stages across function boundaries.

Case 3 (0x1D57): Insufficient register resources for pipeline

Trigger. Three distinct paths produce this code:

  1. sub_ADA7E0 returns 3 when its internal call to sub_AD5120() fails (line 233). This function attempts to propagate accumulator tracking through the FNV-1a hash table, and failure means the pipeline's register sets cannot be simultaneously tracked.
  2. sub_AE3D40 (structural validation) returns with low byte 0, meaning sub_ACE3D0() rejected the pipeline structure. The orchestrator uses case 3 as the generic fallback (v20 = 3 at line 66 of sub_AE4F70).
  3. sub_AD8F90 (secondary validation) returns with low byte 0 similarly.

Code pattern that causes it:

// Too many concurrent accumulators
wgmma.fence;
wgmma.mma_async D0, ...;  // accum set 0
wgmma.mma_async D1, ...;  // accum set 1
wgmma.mma_async D2, ...;  // accum set 2
// ... many more with distinct accumulators
wgmma.commit_group;
wgmma.wait_group;

Fix. Reduce the number of concurrent WGMMA operations with distinct accumulator register sets. Split large tile computations into smaller stages with intervening waits. Reduce accumulator tile dimensions.

Case 4 (0x1D58): Insufficient register resources for function

Trigger. The function's overall register pressure (including non-WGMMA code) is too high. The WGMMA pipeline requires dedicated accumulator register banks, and if the function's total register demand exceeds what is available after reserving the pipeline's needs, serialization is triggered.

Code pattern that causes it:

__global__ void kernel(...) {
    float local_array[256];     // high register pressure
    complex_computation(local_array);
    wgmma.fence;
    wgmma.mma_async ...;       // needs accumulator regs too
    wgmma.commit_group;
    wgmma.wait_group;
}

Fix. Reduce register usage in the kernel: use shared memory for large arrays, reduce live variable counts, split the kernel into smaller functions. Compile with -maxrregcount to force spilling of non-critical values.

Case 5 (0x1D59): Non-WGMMA defines input registers

Trigger. Two paths:

  1. In sub_ADEB40 (lines 960--990): for each non-WGMMA instruction within a pipeline stage, operand position 4 (WGMMA input operands) is checked. If a non-WGMMA instruction writes to a register that a WGMMA uses as matrix A or B input, and the write is in the same basic block (v84+24 == v36[6]) and after the WGMMA (v84+52 > v36[13]), the conflict is flagged.
  2. In sub_AE17C0 (lines 384--386): sub_AE0D20() validates the pipeline's input register sets against arrive/wait annotations. Failure at either the arrive set (offset +69) or wait set (offset +74) returns code 5.

Code pattern that causes it:

wgmma.fence;
// desc_a = make_descriptor(smem_ptr);
wgmma.mma_async D, desc_a, desc_b;
desc_a = make_descriptor(smem_ptr + offset);  // <-- redefines input
wgmma.mma_async D, desc_a, desc_b;            // uses redefined input
wgmma.commit_group;
wgmma.wait_group;

Fix. Compute all WGMMA input values (descriptors, pointers) before wgmma.fence. Use separate register variables for distinct input values within a single pipeline stage. If different tiles need different descriptors, pre-compute them all before entering the pipeline.

Case 6 (0x1D5A): Non-WGMMA reads accumulators

Trigger. Detected only by sub_AE17C0 (late consistency check), at two points:

  1. Lines 707--741: for each WGMMA instruction, operand 0 (accumulator) is examined via sub_AD4BE0/sub_ACBB60. If the accumulator data set is non-empty (!sub_ACC3A0), a non-WGMMA instruction reads from an in-flight accumulator register.
  2. Lines 870--885: same check in a per-basic-block iteration context.

Code pattern that causes it:

wgmma.fence;
wgmma.mma_async D, A, B;
float val = D[0];              // <-- reads accumulator before wait
wgmma.commit_group;
wgmma.wait_group;

Fix. Move all reads of accumulator registers after wgmma.wait_group. The accumulator values are undefined until the wait completes. If the compiler cannot automatically insert a warpgroup.wait at the read point (e.g., divergent control flow), serialization occurs.

Case 7 (0x1D5B): Non-WGMMA defines accumulators

Trigger. Three paths:

  1. In sub_ADEB40 (lines 994--1028): for each non-WGMMA instruction, operand position 3 is checked. If the operand is a register (not immediate, tag != 0x70000000), and it belongs to the same basic block and pipeline stage, and the defining instruction's opcode (after masking) is not 309 (wgmma.mma_async), the conflict is flagged.
  2. In sub_AE17C0 (lines 684--703): sub_AD4CC0 checks WGMMA accumulator operands against the conflict set. If a match is found and the set is non-empty, code 7 is returned.
  3. In sub_AE17C0 (lines 1296--1302): a catch-all at the end of the late validation walk.

Code pattern that causes it:

wgmma.fence;
D[0] = 0.0f;                   // <-- writes to accumulator
wgmma.mma_async D, A, B;       // D is accumulator
wgmma.commit_group;
wgmma.wait_group;

Fix. Initialize accumulators before wgmma.fence, or use the WGMMA .useC mode to let the hardware handle accumulator initialization. Never write to accumulator registers from non-WGMMA instructions inside a pipeline stage.

Case 8 (0x1D5C): Ill-formed pipeline stage

Trigger. sub_AE3D40 (structural validation) detects that the fence/mma/commit/wait structure is malformed. The function walks the WGMMA sequence and checks structural properties via sub_ACE3D0. When the structure check fails (line 447), an error with low byte 0 is returned. The orchestrator maps structural failures to code 3 as fallback, but code 8 is emitted when sub_ADEB40 detects the stage state machine in an inconsistent state.

Code pattern that causes it:

wgmma.fence;
if (condition) {
    wgmma.mma_async D, A, B;
    wgmma.commit_group;        // commit only on one path
}
wgmma.wait_group;              // wait on all paths -- mismatch

Fix. Ensure each wgmma.fence is matched by exactly one wgmma.commit_group and one wgmma.wait_group on every control flow path. Keep pipeline stages in straight-line code. Do not use goto, early return, or conditional branches between fence and wait.

Case 9 (0x1D5E): WG.DP in divergent path

Trigger. During the finalization pass in sub_ADEB40 (lines 1308--1370), the compiler iterates over warpgroup.wait injection points. For each injection, it checks the basic block's convergence flag at bb_info + 282 bit 3. If bit 3 is NOT set (block is divergent) and v206 was previously zero, v206 is set to 9 with the function ID from the basic block at offset +200.

WG.DP = WARPGROUP.DEPBAR (dependency barrier), the SASS-level instruction that implements warpgroup.wait.

Code pattern that causes it:

wgmma.fence;
wgmma.mma_async D, A, B;
wgmma.commit_group;
if (threadIdx.x < 64) {        // warp-divergent condition
    use(D[0]);                  // compiler needs WG.DP here, but path is divergent
}
wgmma.wait_group;

Fix. Ensure WGMMA pipeline stages execute in uniform (non-divergent) control flow. Move conditional logic outside the fence--wait region. Use predication instead of branching for minor variations within a stage.

Case 10 (0x1D60): WG.AR in divergent path

Trigger. During the finalization pass in sub_ADEB40 (lines 1242--1306), the compiler iterates over warpgroup.arrive injection points. When the compiler needs to inject a warpgroup.arrive (to start a new pipeline stage after a conflict) but the injection point is in a divergent basic block, v206 is set to 10. This occurs at line 1302 when a knob-gated diagnostic check at sub_ACBCA0 indicates the injection is not suppressed but the block divergence prevents safe insertion.

WG.AR = WARPGROUP.ARRIVE (arrival barrier), the SASS-level instruction that synchronizes warpgroup warps before entering a pipeline stage.

Code pattern that causes it:

if (threadIdx.x < 64) {        // divergent
    wgmma.fence;               // <-- compiler needs WG.AR, but divergent
    wgmma.mma_async D, A, B;
    wgmma.commit_group;
    wgmma.wait_group;
}

Fix. Same as case 9. Keep pipeline stage entry points (fences) and exit points (waits) in uniform control flow. All warps in the warpgroup must execute the same WGMMA pipeline structure.

Orchestrator Error Code Flow

The orchestrator sub_AE4F70 calls validation functions in sequence. Each returns a packed 64-bit value with the error code in the low bits and a function identifier in the high 32 bits:

sub_AE4F70
  │
  ├─ sub_ADEB40 (primary fixup)
  │    returns: 1, 2, 5, 7, 9, 10 in low 4 bits
  │    (0 = success)
  │
  ├─ sub_ADA7E0 (pipeline consistency)
  │    returns: 3 if FNV-1a accumulator tracking fails
  │    (0 = success)
  │
  ├─ sub_AE3D40 (structural validation)
  │    returns: low byte 1 = pass, low byte 0 = fail
  │    (orchestrator maps fail to case 3)
  │
  ├─ sub_AD8F90 (secondary validation)
  │    returns: low byte 1 = pass, low byte 0 = fail
  │    (orchestrator maps fail to case 3)
  │
  ├─ sub_AE4710 (finalize metadata) -- only on success
  │
  └─ sub_AE17C0 (late consistency)
       returns: 5, 6, 7 in low bits
       (0 = success)

Any nonzero result triggers the serialization path: *(BYTE*)(context->field_0->field_1584 + 1920) = 1, followed by sub_ACE480 (warning emission) and sub_AE47B0 (pipeline collapse).

The serialization fallback function sub_AE47B0 replaces the pipelined WGMMA sequence with individual fence/mma/commit/wait groups per operation, which is functionally correct but eliminates all overlap between tensor core operations.

Interaction with Register Allocation

The GMMA pipeline runs at phases 85/87, before register allocation (phase 101). This is by design -- the pass operates on virtual registers and needs to:

  1. Track accumulator live ranges before physical register assignment constrains placement
  2. Insert warpgroup.arrive/wait with freedom to position them optimally
  3. Propagate accumulator liveness to inform the register allocator about the extended live ranges that WGMMA creates

The live range limit check (warning code 0x1CEF) directly impacts register allocation: if too many WGMMA accumulators are simultaneously live, the register allocator will not have enough physical registers, and the pipeline must be serialized.

Phase 86 (InsertPseudoUseDefForConvUR) runs between the two GMMA phases. It inserts pseudo use/def instructions for uniform register conversion, which must account for the accumulator regions identified by phase 85.

Phase 88 (OriHoistInvariantsLate3) runs immediately after phase 87, exploiting the now-explicit pipeline boundaries as LICM barriers.

PTX Instruction Handlers

The PTX-to-Ori lowering registers four WGMMA-related handlers in sub_5D4190:

PTX MnemonicHandlerSize
wgmma.mma_asyncsub_50AC701,282 bytes
wgmma.fencesub_4DA380295 bytes
wgmma.commit_groupsub_4DA4B0295 bytes
wgmma.wait_groupsub_4DA5E0311 bytes

The wgmma.mma_async handler is the largest, handling the complex operand encoding (matrix dimensions, data types, layout, scale factors, descriptor format). The fence/commit/wait handlers are thin wrappers producing single Ori instructions.

The internal warpgroup synchronization instructions (_warpgroup.arrive, _warpgroup.wait, _warpgroup.commit_batch) are registered separately as _mma.warpgroup-prefixed handlers at 0x466000--0x467900 (approximately 36 small ~96-byte handler functions covering the various warpgroup synchronization variants).

SASS Output

The Ori WGMMA instructions are encoded to the following SASS opcodes by the Mercury encoder:

Ori InstructionSASS OpcodeDescription
wgmma.mma_asyncWGMMA.MMA_ASYNCAsynchronous warpgroup matrix multiply
wgmma.fenceWGMMA.FENCEPipeline fence
wgmma.commit_groupWGMMA.COMMIT_GROUPCommit current group
wgmma.wait_group NWGMMA.WAIT_GROUP NWait for N groups
_warpgroup.arriveWARPSYNC / BAR.ARRIVEWarpgroup arrival barrier
_warpgroup.waitWARPSYNC / BAR.WAITWarpgroup wait barrier
_warpgroup.commit_batchDEPBAR variantWarpgroup dependency barrier

The Mercury encoder at sub_62E890 (118 KB) handles the SASS-level encoding of warpgroup operations, referenced by strings "warpgroup-arrive", "warpgroup-wait", and "warpgroup-commit_batch" used as internal Mercury instruction tags.

Key Constants

ConstantValueMeaning
WGMMA opcode309Ori opcode for wgmma.mma_async
Arrive opcode (masked)271opcode & 0xFFFFCFFF for _warpgroup.arrive/wait
Commit opcode323Ori opcode for _warpgroup.commit_batch
Call opcode236Forces pipeline break
Accum reg_type6vreg+64 value for tensor/accumulator regs
Accum src tag0x90000000High nibble tag for source accumulator encoding
Accum dst tag0x10000000High nibble tag for destination accumulator encoding
FNV-1a prime16777619Hash function prime for register set lookup
FNV-1a offset0x811C9DC5Hash function offset basis
Live range warning0x1CEFWarning code for excessive live ranges
Serialization base0x1D55First serialization warning code (extern calls)
Serialization end0x1D60Last serialization warning code (WG.AR divergent)
Advisory wait inject0x1D5DAdvisory: warpgroup.wait injected
Advisory arrive inject0x1D5FAdvisory: warpgroup.arrive injected

Key Function Table

AddressSizeName / Role
0xAE50302,967Phase 85 outer driver (SM gate, BB iteration)
0xADCA603,643Phase 85 per-function pipeline analysis
0xADBD303,364Phase 85 per-block accumulator propagation
0xADAD602,170Phase 85 per-instruction accumulator encoding
0xADA740146Accumulator register collector
0xAE4F70182Phase 87 orchestrator
0xADEB407,077Phase 87 primary sequence fixup
0xADB5E01,867Phase 87 sequence metadata builder
0xADD8A01,349Phase 87 post-injection metadata rebuild
0xAE3D402,511Sequence structural validation
0xAD8F902,924Secondary validation pass
0xAE17C07,538Late pipeline consistency check
0xAE47B01,975Serialization fallback (collapse pipeline)
0xACE4801,908Serialization warning emitter (10 codes)
0xACBE60279Create warpgroup.arrive instruction
0xACBF80279Create warpgroup.wait instruction
0xACBCA0191Knob-gated injection diagnostic check
0x50AC701,282PTX handler: wgmma.mma_async
0x4DA380295PTX handler: wgmma.fence
0x4DA4B0295PTX handler: wgmma.commit_group
0x4DA5E0311PTX handler: wgmma.wait_group
0x4942102,276Sparse GMMA validation
0x62E890118,150Mercury encoder for warpgroup SASS ops

Cross-References

Uniform Register Optimization

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

Four passes in the ptxas pipeline collectively manage the conversion of general-purpose register (R) values to uniform registers (UR) on sm_75+ targets. The UR register file is a dedicated 63-entry, 32-bit register bank shared across all threads in a warp: every thread reads the same value from a given UR. By routing warp-uniform computations through the UR file, ptxas reduces R-register pressure (the dominant occupancy limiter), enables the UR-specific ALU datapath, and avoids broadcasting the same value 32 times across the register file.

Phases11, 27, 74, 86
Phase namesReplaceUniformsWithImm, AnalyzeUniformsForSpeculation, ConvertToUniformReg, InsertPseudoUseDefForConvUR
Targetsm_75+ (Turing and later) -- no-op on earlier architectures
Register fileUR: UR0--UR62 usable, UR63 = URZ (zero register); UP: UP0--UP6, UP7 = UPT
Hardware limit63 uniform GPRs, 7 uniform predicates per thread
Code Object field+99 = UR count; +856 = UR liveness bitvector
Context flags+1368 bit 1 = has-uniform; +1376 bit 4 = UR tracking enabled; +1378 bit 3 = has-UR-regs
Key knobs487 (general optimization gate), 628 (pre-allocation UR promotion), 687 (uniform register mode)
Related passesOriPropagateVaryingFirst (53), OriPropagateVaryingSecond (70), OptimizeUniformAtomic (44), ConvertMemoryToRegisterOrUniform (sub_910840)

Background: Uniform vs. Divergent Values

A value is uniform (warp-uniform) if every active thread in the warp holds the same value for that register at a given program point. A value is divergent if different threads may hold different values.

Sources of uniformity:

  • Kernel parameters. All threads receive the same parameter values. Parameters loaded from constant memory (LDC) with a uniform address are uniform by construction.
  • Constant memory loads. LDC with a uniform base address produces a uniform result.
  • S2R of warp-uniform special registers. Registers like SR_CTAID_X/Y/Z, SR_GRIDID, and SR_SMID are uniform across the warp. SR_TID_X/Y/Z and SR_LANEID are divergent.
  • Arithmetic on uniform inputs. If all source operands are uniform, the result of any pure ALU operation is uniform.
  • Convergent control flow. A value defined before a divergent branch and used after reconvergence is still uniform if the definition was uniform.

Sources of divergence:

  • Thread identity registers. SR_TID_X/Y/Z, SR_LANEID vary per thread.
  • Memory loads from thread-dependent addresses. LDG [R_addr] where R_addr is divergent produces a divergent result.
  • Phi merges across divergent branches. A MOV.PHI that merges values from two sides of a divergent branch is divergent even if each incoming value was individually uniform.

ptxas tracks uniformity through two complementary mechanisms: forward "varying" propagation (OriPropagateVarying, phases 53 and 70) marks registers as divergent, while the uniform analysis passes (this page) identify which remaining values are safe to move to the UR file.

UR Hardware ISA

sm_75+ architectures provide a dedicated set of uniform-only SASS instructions that operate on UR/UP registers. These execute on the uniform datapath, which processes one value per warp instead of 32:

SASS mnemonicROT13 in binaryOperation
UIADD3HVNQQ3Uniform 3-input integer add
UIMADHVZNQUniform integer multiply-add
ULOP3HYBC3Uniform 3-input logic
UISETPHVFRGCUniform integer set-predicate
USGXTHFTKGUniform sign-extend
UPRMTHCEZGUniform byte permute
UPOPCHCBCPUniform population count
UBREVHOERIUniform bit reverse
UP2URHC2HEUniform predicate to uniform register
UPLOP3HCYBC3Uniform predicate LOP3
VOTEUIBGRHUniform vote

Blackwell (sm_100+) extends the uniform ISA with:

  • UFADD, UFFMA, UFSEL, UFSETP -- uniform floating-point operations
  • UVIADDR -- uniform virtual address computation
  • UCLEA, UCVTA, ULEPC -- uniform address operations
  • UTMAPC, UTMALDG, UTMAPF, UTMAREDG -- uniform TMA (tensor memory accelerator) operations
  • UBLKPC, UBLKRED, UBLKPF -- uniform block operations

The R2UR instruction transfers a value from the R file to the UR file; UR2R does the reverse. These are the bridge instructions that ConvertToUniformReg inserts at file boundaries.

The SASS encoder at sub_7BC360 (126 callers) handles UR register encoding using the register-variant-B format, distinct from the main register encoder sub_7BC030. The decoder sub_7BD7D0 (4 callers) extracts UR operands with type=4 (uniform register). In the Mercury encoding layer, Major 0x0E (6 variants, sub_10C0550) encodes the uniform ALU instructions (UIADD3, ULOP3, etc.).

Phase 11: ReplaceUniformsWithImm

Phase index11
Pipeline positionStage 1 (Initial Setup), after EarlyOriSimpleLiveDead (10), before OriSanitize (12)
CategoryOptimization

Purpose

Replaces uniform register reads with immediate constants when the value is known at compile time. This is the earliest uniform-related optimization in the pipeline, running before any loop or branch optimization.

Motivation

Kernel launch parameters are passed through constant memory. After PTX-to-Ori lowering, a kernel parameter access looks like:

LDC  R3, c[0x0][0x160]     // load parameter from constant bank
IMAD R4, R3, R5, RZ        // use the parameter

If the compiler can prove that the constant bank address contains a known immediate (e.g., from .param directives with known offsets), the LDC is dead and the use can be folded:

IMAD R4, 42, R5, RZ        // parameter replaced with immediate 42

This eliminates constant memory traffic and reduces register pressure by one register.

When It Fires

The pass is most effective for:

  • Kernel parameters with known constant offsets
  • Shared memory size constants
  • Grid/block dimension constants when known at compile time
  • Constant expressions that survive PTX-to-Ori lowering as LDC loads

The pass is gated by knob 487 (general optimization enablement).

Phase 27: AnalyzeUniformsForSpeculation

Phase index27
Pipeline positionStage 2 (Early Optimization), after OriRemoveRedundantBarriers (26), before SinkRemat (28)
CategoryAnalysis

Purpose

Identifies uniform values that are safe for speculative execution. This analysis feeds subsequent passes that may hoist or speculatively execute instructions -- most immediately SinkRemat (phase 28) and SpeculativeHoistComInsts (phase 56).

Speculative Uniformity

A value is "speculatively uniform" if it would be uniform under all possible execution paths, not just the currently taken path. This is a stronger property than simple uniformity: a value that is uniform within one branch arm might not be speculatively safe to hoist above the branch if the other arm would produce a different value or a side effect.

The analysis must be conservative:

  • Memory loads cannot be speculated unless the address is provably valid on all paths (no faults).
  • Atomic operations are never speculative candidates.
  • Values defined under divergent control flow require careful handling -- the analysis must determine whether the definition dominates all paths that could reach the speculation point.

Pipeline Position Rationale

Phase 27 runs after:

  • Loop unrolling (22), which may duplicate uniform definitions
  • SSA phi insertion (23), which provides single-definition reaching information
  • Software pipelining (24), which may interleave loop iterations
  • Barrier removal (26), which may relax synchronization constraints

And before:

  • SinkRemat (28), which uses the analysis to decide what can be sunk/recomputed
  • GeneralOptimize (29), which benefits from knowing which values are uniform

Phase 74: ConvertToUniformReg

Phase index74
Pipeline positionStage 4 (Late Optimization), after ConvertAllMovPhiToMov (73), before LateArchOptimizeFirst (75)
CategoryOptimization
String reference"ConvertToUniformReg" at 0x22BCA12
Related functionsub_911030 (10,741 bytes, 56 callees)

Purpose

The main UR promotion pass. Converts qualifying R-register values to UR registers, replacing per-thread general-purpose register storage with warp-uniform storage. This is the highest-impact uniform register optimization in the pipeline.

Pipeline Position Rationale

Phase 74 runs immediately after SSA destruction (ConvertAllMovPhiToMov, phase 73). This is deliberate:

  • After SSA destruction: phi nodes have been converted to plain MOVs, giving the pass a clear view of all definitions and uses without phi-node complications.
  • After varying propagation (phases 53 and 70): the divergence annotations are complete -- the pass knows which values are proven uniform.
  • After predication (phase 63): if-conversion has already eliminated short branches, which may have exposed new uniform values.
  • Before register allocation: UR conversion reduces R-register demand before the fat-point allocator runs (phase 101), directly improving occupancy.
  • Before scheduling: the scheduler (phases 97+) can exploit UR-specific latency characteristics.

Conversion Criteria

A value qualifies for R-to-UR conversion when all of the following hold:

  1. Uniformity: the value is proven warp-uniform -- all threads compute the same result. This is established by the varying propagation passes and the phase 27 analysis.

  2. UR-expressible operation: the defining instruction has a uniform-datapath equivalent. Not all SASS instructions have UR variants. Operations like IMAD, IADD3, LOP3, ISETP, MOV, SEL, PRMT, SGXT, POPC, and BREV have UR counterparts. Complex operations like FFMA, LDG, STG, texture instructions, and atomics do not (until sm_100 added some uniform FP).

  3. UR pressure budget: the conversion must not exceed the 63-register UR hardware limit. The pass tracks live UR count and aborts conversion for a value if it would push the UR pressure beyond the limit.

  4. All uses accept UR sources: every consumer of the value must be able to read from the UR file. Some instructions have encoding restrictions that prohibit UR operands in certain source positions.

  5. No cross-warp dependencies: the value must not participate in cross-warp communication patterns (e.g., shuffle instructions that explicitly exchange values between lanes).

Algorithm

The pass operates in two main phases:

Phase A -- Candidate identification. Walks the instruction list and marks each definition as a UR candidate based on the criteria above. For each candidate, it checks:

  • The vreg+64 register file type is R (type 1 or 2, not already UR type 3)
  • The varying propagation flag on the register indicates uniformity (bit 2 of vreg+49 clear)
  • The defining opcode has a UR-equivalent instruction form
  • All consumers of this register accept UR sources

Phase B -- Conversion. For each approved candidate:

  1. Changes the register's file type from R (type 1) to UR (type 3) at vreg+64
  2. Updates the register's allocator class from class 1 (R) to class 4 (UR) at vreg+12
  3. Rewrites the defining instruction to use the UR-variant opcode (e.g., IMAD becomes UIMAD)
  4. Inserts R2UR bridge instructions where a converted UR value flows into an instruction that requires an R-file source
  5. Inserts UR2R bridge instructions where an R-file value needs to flow into a converted UR instruction
  6. Updates the UR count at Code Object +99

UR Pressure Management

The UR file has only 63 usable registers (UR0--UR62), compared to 254 for the R file. The pass must be conservative about how many values it converts:

  • Greedy allocation with pressure cap: candidates are evaluated in program order (RPO). Each conversion increments a pressure counter. If the counter reaches the hardware limit, remaining candidates are skipped.
  • Priority by benefit: conversions that save the most R-register pressure (long live ranges with many uses) are preferred.
  • Retry mechanism: the scheduling infrastructure at sub_A0D800 supports a "retry without uniform regs" fallback (controlled by flag v63). If scheduling with UR-converted code fails to meet latency targets, the scheduler can request a re-run without UR conversion.

Interaction with Register Allocation

The UR conversion reduces R-register demand but introduces UR-register demand. The fat-point allocator (phase 101) handles R and UR as separate register classes (class 1 and class 4 respectively), with independent allocation passes. The trade-off:

R fileUR file
Capacity254 usable62 usable
Pressure impactReduced by conversionIncreased by conversion
Occupancy impactPositive (fewer R regs = higher occupancy)Neutral (UR count does not affect warp occupancy on most SMs)
Spill costSpilled to local memorySpilled to R file, then to local memory

The allocator state at alloc+440 tracks the uniform register promotion flag (controlled by knob 628 and context flag +1414). When this flag is set, the pre-allocation pass (sub_94A020) enables UR-aware allocation.

Phase 86: InsertPseudoUseDefForConvUR

Phase index86
Pipeline positionStage 5 (Legalization), after OriPropagateGmma (85), before FixupGmmaSequence (87)
CategoryLowering

Purpose

Inserts pseudo use/def instructions to maintain correct liveness information for UR-converted registers. After ConvertToUniformReg (phase 74) converts values from R to UR, subsequent optimization and legalization passes may invalidate the liveness information. This pass inserts lightweight pseudo-instructions that prevent later passes from incorrectly eliminating UR definitions or extending UR live ranges beyond their intended scope.

Why Pseudo Instructions Are Needed

The UR conversion in phase 74 changes register file assignments, but does not update all downstream data structures. Between phase 74 and register allocation (phase 101), several passes run:

74  ConvertToUniformReg         <-- UR conversion happens here
75  LateArchOptimizeFirst
76  UpdateAfterOptimize
77  AdvancedPhaseLateConvUnSup
78  LateExpansionUnsupportedOps
79  OriHoistInvariantsLate2
80  ExpandJmxComputation
81  LateArchOptimizeSecond
82  AdvancedPhaseBackPropVReg
83  OriBackCopyPropagate
84  OriPerformLiveDeadFourth    <-- DCE could kill "unused" UR defs
85  OriPropagateGmma
86  InsertPseudoUseDefForConvUR <-- pseudo use/def insertion
87  FixupGmmaSequence
    ...
101 AdvancedPhaseAllocReg       <-- register allocation

The critical problem: OriPerformLiveDeadFourth (phase 84) runs liveness analysis and dead code elimination. If a UR-converted value appears dead (no R-file use remaining because the uses were also converted), DCE would remove it. The pseudo use/def instructions inserted by phase 86 create artificial uses that keep UR definitions alive through DCE.

Pseudo Instruction Properties

The pseudo use/def instructions:

  • Have no hardware encoding -- they are removed before SASS emission
  • Carry register operand references that maintain the def-use chain
  • Are transparent to instruction scheduling (zero latency, no functional unit)
  • Are removed during post-RA cleanup or Mercury encoding

Convergent Boundary Interaction

The pass also interacts with the convergent boundary enforcement mechanism. The string "Missing proper convergent boundary around func call annotated with allowConvAlloc" (from sub_19D13F0) indicates that UR-converted values crossing function call boundaries require convergent allocation markers. The allowConvAlloc annotation on function calls triggers convergent boundary checking, and "Multiple functions calls within the allowConvAlloc convergent boundary" (sub_19C6400) warns when a convergent region contains more than one call.

The CONV.ALLOC pseudo-instruction (opcode 286 / 0x11E) is inserted by sub_19D7A70 to mark convergent allocation boundaries. This prevents the register allocator from assigning the same physical UR to values that are live across a convergent boundary where the UR might be redefined.

Varying Propagation (Supporting Analysis)

The OriPropagateVarying passes (phases 53 and 70) propagate divergence information forward through the IR. They are not part of the four-pass uniform register group, but provide the critical input data.

Phase 53 (OriPropagateVaryingFirst) runs after late expansion (55) and before rematerialization. It marks each register as either "uniform" or "varying" (divergent) by propagating divergence from known-divergent sources (thread ID registers, divergent memory loads) through the def-use chain. The propagation is a forward dataflow analysis: if any source operand of an instruction is varying, the destination is varying.

Phase 70 (OriPropagateVaryingSecond) repeats the analysis after predication (phase 63) and rematerialization (phase 69) may have changed the divergence landscape.

The varying flag is stored in the virtual register descriptor (bit 2 of vreg+49). During ConvertToUniformReg, only registers marked as non-varying are candidates for UR promotion.

Uniform Atomic Optimization (Phase 44)

OptimizeUniformAtomic (phase 44) is a mid-pipeline optimization that converts thread-uniform atomic operations into warp-level reductions. When all threads in a warp perform the same atomic operation on the same address with the same value, the hardware can coalesce them into a single atomic. This pass detects such patterns and rewrites them using REDUX (reduction) or ATOM.UNIFORM instruction forms.

Code Object Uniform Register Tracking

The Code Object maintains several fields related to UR state:

OffsetFieldDescription
+99ur_countNumber of uniform registers allocated for this function
+832Main liveness bitvectorOne bit per virtual register (R + UR combined)
+856UR liveness bitvectorSeparate bitvector for UR/UP registers only
+1368 bit 1has-uniform flagSet when the function uses any UR registers
+1376 bit 4UR tracking enabledControls whether scheduling tracks UR pressure
+1378 bit 3has-UR-regs flagSecondary flag confirming UR register usage

The scheduling dependency builder at sub_A0D800 (39 KB) tracks UR pressure separately. When +1376 bit 4 is set, the control word computation at sub_A09850 doubles the register count for uniform operands (v15 = type==3 ? 2 : 1) and writes a 9-bit register count to the control word bits [0:8].

The scheduling statistics printer (sub_A3A7E0) reports texture binding mode as "UR-bound" when textures are accessed via uniform-register-based descriptors:

# [inst=142] [texInst=0] [tepid=0] [rregs=24]

Disallowed Uniform Register Diagnostic

The function sub_A465F0 (CodeObject::buildCodeObjectHeader, 2.6 KB binary) checks whether UR registers were used despite being disallowed. The diagnostic:

"Uniform registers were disallowed, but the compiler required (%d) uniform
 registers for correct code generation."

This fires on pre-sm_75 targets where the UR file does not exist, or when a CLI option explicitly disables UR usage. Knob 687 controls the uniform register mode.

SM Architecture Availability

SM rangeUR supportUR ALU instructionsUniform FP
sm_30 -- sm_72NoneNoneNone
sm_75 -- sm_89UR0--UR62, UP0--UP6UIADD3, UIMAD, ULOP3, UISETP, UMOV, UPRMT, USGXT, UPOPC, UBREVNone
sm_90 -- sm_90aUR0--UR62, UP0--UP6Full integer uniform ALUNone (LDCU requires -forcetext -sso)
sm_100+UR0--UR62, UP0--UP6Full integer + FP uniform ALUUFADD, UFFMA, UFSEL, UFSETP, UVIADDR

The LDCU (Load Constant Uniform) instruction is gated by architecture capability. The validation at sub_B28400 (345 bytes) checks:

"SM does not support LDCU. On SM90 -knob EmitLDCU is only supported when
 options '-forcetext' and '-sso out.sass' are provided."

This check queries vtable+1336 for the LDCU capability.

ConvertMemoryToRegisterOrUniform

The function sub_910840 (ConvertMemoryToRegisterOrUniform, gated by knob 487) is a pre-allocation optimization that promotes stack-resident variables to registers, with the option of promoting to UR when the variable is proven uniform. It is not one of the four numbered phases but works closely with them.

Entrysub_910840 (2,100 bytes)
Coresub_911030 (10,741 bytes, 56 callees)
Liveness buildersub_905B50 (5,407 bytes)
Promotion transformsub_90FBA0 (~4,000 bytes)
Gate knob487
String"ConvertMemoryToRegisterOrUniform" at 0x910897

The entry function checks knob 487 for enablement (via vtable+152 dispatch), builds def-use chains via sub_905B50, then calls sub_90FBA0 for the actual promotion.

The sub_911030 core function (10.7 KB) handles the "OrUniform" part -- it iterates through the variable list, checks variable properties (address space, type), and decides whether to promote to R or UR. The decision process involves:

  1. Checking the register's vreg+49 flags byte (bit 2 = uniform marker from sub_907870)
  2. Evaluating whether the variable's address space permits UR promotion
  3. Confirming that the defining and using instructions have UR-compatible forms
  4. Verifying UR pressure headroom

The per-register-class property accessors at sub_900C50--sub_9013F0 (6 nearly identical 391-byte functions, 2 callers each) provide the class-indexed lookups for the promotion decision.

Key Functions

AddressSizeFunctionDescription
sub_9108402.1 KBConvertMemoryToRegisterOrUniformPromotes stack variables to R or UR registers (knob 487 gated)
sub_91103010.7 KBCore UR promotion logicIterates variables, decides R vs UR promotion based on uniformity
sub_905B505.4 KBLiveness builder for promotionBuilds def-use chains for promotion analysis
sub_90FBA0~4 KBPromotion transformApplies the actual memory-to-register transformation
sub_8FEAC02.1 KBPer-BB pressure analyzerWalks instruction list, decodes operand types, updates pressure via vtable+1824; called from sub_910840
sub_A465F02.6 KBCodeObject::buildCodeObjectHeaderWrites UR count into code object, checks disallowed-UR diagnostic
sub_B28E90smallisURegPredicate: is operand a uniform register?
sub_19D13F04.3 KBConvergent boundary checkerValidates allowConvAlloc boundaries around function calls
sub_19C6400330 BPer-instruction convergent classifierCallback: warns on opcode 159 within convergent boundary
sub_19D7A703.3 KBCONV.ALLOC marker insertionInserts opcode 0x11E pseudo-instructions at convergent boundaries
sub_A0D80039 KBScheduling dependency builderBuilds per-block dependency graph; tracks UR pressure via +856 bitvector
sub_A09850~2 KBControl word computationDoubles count for uniform operands: type==3 ? 2 : 1
sub_B28400345 BLDCU validatorChecks SM support for Load Constant Uniform
sub_7BC360~1 KBUR register encoderEncodes UR operands in SASS instruction words (126 callers)
sub_7BD7D0~1 KBUR register decoderDecodes UR operands from SASS instruction words (type=4)
sub_94A020~3.5 KBPre-allocation setupSets alloc+440 UR promotion flag from knob 628 + context flag +1414
sub_900C50391 BRegister class property accessorPer-class property lookup (one of 6 identical functions for GP, predicate, UR, etc.)

Late Expansion & Legalization

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The ptxas pipeline contains six legalization passes spread across the 159-phase sequence. Their collective job is to replace Ori IR operations that the target SM cannot execute natively with equivalent sequences of legal instructions. "Unsupported ops" means exactly this: operations that exist in the PTX ISA or internal Ori representation but have no single-instruction mapping on the compilation target. The replacement may be an inline expansion (a sequence of simpler instructions), a call to a libdevice helper function, or an SM-specific intrinsic sequence.

The six passes run at deliberately different pipeline positions because each intervening group of optimization passes can expose new unsupported operations or create new legalization opportunities.

Passes covered6 (phases 5, 45, 55, 78, 93, 137)
CategoryLowering
Backend dispatchArchitecture-specific via two backend objects at context+0x630 and context+0x640
Libdevice functions608 helper functions registered at sub_5D1660 (9,728-byte table from unk_1D4D940)
Legalization flagSetAfterLegalization (phase 95) marks the point past which no unsupported ops should remain
Update passUpdateAfterConvertUnsupportedOps (phase 132, factory 8) rebuilds IR metadata after late expansion
Knob gatesKnob 499 (ConvertUnsupportedOps, LateExpansionUnsupportedOps), knob 487 (LateExpansion, SetAfterLegalization, LateExpansionUnsupportedOps), knob 214 / 464 (LateExpansionUnsupportedOps inner loop)

Why Six Passes

A monolithic legalize-everything pass early in the pipeline would cripple optimization. Many optimizations (CSE, LICM, strength reduction, predication) work on high-level operation semantics. If div.rn.f64 were expanded into a 30-instruction Newton-Raphson sequence at phase 5, loop-invariant code motion at phase 35 would see 30 independent instructions instead of one hoistable division. Conversely, some unsupported operations only appear after optimization passes transform the IR: predication (phase 63) can create new predicated ops that need legalization, GMMA fixup (phase 87) can introduce new WGMMA-related sequences, and conditional flow merging (phases 133/136) can expose operations that were previously dead.

The six passes form a progressive legalization strategy:

PhaseNamePipeline PositionPurpose
5ConvertUnsupportedOpsBefore optimization (stage 1)Early legalization of obviously unsupported ops; preserves optimization opportunities for everything else
45MidExpansionAfter early/mid optimization (stage 3)Target-dependent expansion after loop unrolling, strength reduction, and GVN have run
55LateExpansionAfter high-level optimizations (stage 4)Expansion of ops that optimization passes should see in unexpanded form
78LateExpansionUnsupportedOpsAfter all optimization (stage 5)Catches remaining unsupported ops after predication, rematerialization, and uniform conversion
93LateExpansionUnsupportedOps2After GMMA/attr passes (stage 5)Second catch -- handles ops exposed by GMMA propagation, GMMA fixup, and register attribute setting
137LateExpansionUnsupportedOpsMidAfter late merge (stage 10)Final catch between the two conditional flow merge passes

Architecture Backend Dispatch

None of the six passes contain legalization logic directly. Each is a thin dispatcher that forwards to a virtual method on one of two architecture backend objects stored in the compilation context. The backend objects are constructed per-SM-target and provide the actual SM-specific legalization implementations.

Two backend objects:

Context OffsetUsed ByRole
context+0x640ConvertUnsupportedOps, LateExpansionOuter backend -- wraps an inner object at +0x10, provides two-level dispatch
context+0x630MidExpansion, LateExpansionUnsupportedOps, LateExpansionUnsupportedOps2, LateExpansionUnsupportedOpsMid, SetAfterLegalizationSM backend -- single-level dispatch through vtable

The two-level dispatch through context+0x640 allows the outer backend to override the entire legalization strategy (by replacing vtable slot 0), while the inner object provides the SM-specific implementation when the outer backend delegates. This separation exists because ConvertUnsupportedOps and LateExpansion may need to coordinate with higher-level compilation modes (e.g., library compilation, OptiX IR) that wrap the SM backend.

Backend Vtable Slots

The SM backend at context+0x630 dispatches legalization through these vtable offsets:

Vtable OffsetDecimalCalled By
+0xB0176MidExpansion
+0xD8216LateExpansionUnsupportedOps2
+0x108264SetAfterLegalization
+0x178376LateExpansionUnsupportedOps
+0x180384LateExpansionUnsupportedOpsMid

The outer backend at context+0x640 dispatches:

Vtable OffsetDecimalCalled By
+0x000ConvertUnsupportedOps (type check -- compared against sub_661280)
+0x78120ConvertUnsupportedOps (delegated to inner object)
+0x5888LateExpansion (type check -- compared against sub_6612E0)
inner +0xE0224LateExpansion (delegated to inner object)

Pass Details

Phase 5 -- ConvertUnsupportedOps

Factory index:  5
Vtable:         off_22BD690
execute():      sub_C60A20  (thunk -> context+0x640 dispatch)
isNoOp():       sub_C5F610  (returns 0 -- always runs)
Flag side-effect: sets context+1378 bit 0 (isConvertUnsupportedDone)
Knob gate:      499 (checked via sub_7DDB50)
Pipeline:       Bracketed by AdvancedPhaseBeforeConvUnSup (4) and AdvancedPhaseAfterConvUnSup (7)

This is the earliest legalization pass, running at phase 5 before any optimization. It converts operations that are clearly illegal on the target SM into equivalent sequences. The pass always runs (isNoOp = false) and is unconditional -- every compilation executes it.

Dispatch mechanism. The execute function (sub_C60A20) reads the backend at context+0x640, checks whether vtable slot 0 is the default implementation (sub_661280), and either calls the overridden method directly or unwraps to the inner object at backend+0x10 and calls vtable offset +0x78 (120). This two-level indirection allows library-mode and OptiX-mode compilation to inject custom legalization logic.

Flag effect. After execution, the pass sets bit 0 of context+1378, signaling to downstream passes that early legalization has completed. Passes like OriCreateMacroInsts (phase 8) check this flag to know whether certain patterns have already been lowered.

What gets legalized early: Operations that cannot survive optimization in their original form. Examples include operations that reference address spaces not supported on the target, certain modifier combinations that have no encoding, and PTX instructions that are syntactically valid but architecturally illegal (e.g., atom.add.f64 on targets without native FP64 atomics).

Phase 45 -- MidExpansion

Factory index:  51
Vtable:         off_22BDDC0
execute():      sub_C5EFB0  (thunk -> context+0x630 vtable+0xB0)
isNoOp():       sub_C5EFD0  (returns 0 -- always runs)
Field side-effect: sets context+1552 = 3
Pipeline:       After ExpandMbarrier (42), ForwardProgress (43), OptimizeUniformAtomic (44)
                Before GeneralOptimizeMid2 (46)

MidExpansion runs after the CTA/mbarrier/barrier expansion passes and before the second mid-level GeneralOptimize bundle. It handles target-dependent expansions that must occur after barrier-related lowering but before the mid-level optimization cleanup.

Dispatch. Dispatches directly through the SM backend vtable at offset +0xB0 (176). No two-level indirection -- the SM backend provides the implementation directly.

Side effect. Sets context+1552 to 3. This field is the pipeline progress counter (not exclusively a legalization counter -- see Context Fields below) and is read by subsequent passes to determine which pipeline stages have completed. The value 3 indicates "mid-expansion complete."

Phase 55 -- LateExpansion

Factory index:  63
Vtable:         off_22BDFA0
execute():      sub_C60AA0  (thunk -> context+0x640 dispatch)
isNoOp():       sub_C5EE20  (returns 0 -- always runs)
Field side-effect: sets context+1552 = 7 (via inner dispatch)
Pipeline:       After OriDoRematEarly (54), before SpeculativeHoistComInsts (56)
                Followed by GeneralOptimizeLate (58)

LateExpansion is the primary post-optimization legalization pass. It runs after all high-level optimizations (loop unrolling, strength reduction, GVN-CSE, reassociation, predication setup) have completed, expanding operations that were deliberately kept in high-level form for those passes.

Dispatch. Uses the outer backend at context+0x640. Checks vtable slot +0x58 (88) against the default (sub_6612E0). If overridden, calls the override. Otherwise, calls the inner object's vtable at +0xE0 (224) and then sets context+1552 = 7, advancing the pipeline progress counter.

What gets expanded here: This is the pass where most math library calls are introduced. Operations like div.rn.f64, sqrt.rn.f32, rcp.rd.f64 that were kept as single Ori instructions through optimization are now replaced with Newton-Raphson sequences or calls to the 608-function libdevice library. The SM20 library functions (division, square root, reciprocal, bit-field extract/insert) and SM70 functions (WMMA matrix operations, barrier reductions) are the primary candidates.

Optimization interaction. GeneralOptimizeLate (phase 58) runs immediately after, cleaning up the expanded sequences with copy propagation, constant folding, and dead code elimination. This is why expansion happens here rather than later -- the expanded code benefits from one more optimization round.

Phase 78 -- LateExpansionUnsupportedOps

Factory index:  90
Vtable:         off_22BE3D8
execute():      sub_C5EA50  (thunk -> context+0x630 vtable+0x178)
isNoOp():       sub_C5EA70  (returns 0 -- always runs)
Knob gate:      499 (via sub_7DDB50), plus flag check: context+1414 bit 2
Pipeline:       After AdvancedPhaseLateConvUnSup (77), before OriHoistInvariantsLate2 (79)

The first of three "late unsupported ops" catches. It runs after all optimizations have completed (phases 13-76) and catches operations that optimization passes themselves introduced or exposed.

Gating. This pass has the most complex gating of the six. In addition to the standard knob 499 check (via sub_7DDB50), it also checks bit 2 of context+1414. If the bit is clear, the pass is skipped even though isNoOp returns false. This allows the backend to dynamically disable the pass when no unsupported ops were detected during earlier compilation phases.

Implementation. When active, calls sub_7917F0 which:

  1. Checks context+1382 bit 2 (another prerequisite flag)
  2. Checks knob 214 (via the capability dispatch at context+1664)
  3. If the function table at context+0 + 1056 is not yet initialized, calls the expansion setup functions (sub_785E20, sub_781F80, sub_7E6090, sub_7E6AD0)
  4. Iterates over basic blocks, applying per-instruction legalization with convergence check (knob 464 gates the inner loop)

This iterative structure -- expand, check if more work needed, repeat -- handles cascading expansions where expanding one operation exposes another unsupported operation.

Phase 93 -- LateExpansionUnsupportedOps2

Factory index:  109
Vtable:         off_22BE6D0
execute():      sub_C5E790  (thunk -> context+0x630 vtable+0xD8)
isNoOp():       sub_C5E7B0  (returns 0 -- always runs)
Pipeline:       After AdvancedPhaseAfterSetRegAttr (92), before FinalInspectionPass (94)

The second late catch, positioned after the GMMA/WGMMA passes (85-87), register attribute setting (90), and texture dependency analysis (91). These intervening passes can introduce new operations that need legalization:

  • GMMA propagation (phase 85) may introduce WGMMA accumulator movement operations
  • GMMA sequence fixup (phase 87) may insert hardware ordering instructions
  • Register attribute setting (phase 90) may expose operations that become illegal once register classes are assigned

Dispatch. Uses the SM backend vtable at offset +0xD8 (216). The dispatch is architecture-dependent: the execute function reads vtable slot 12 (backend[12]), compares against a default implementation (sub_661310), and either calls the override or falls through to a two-step sequence that calls methods at offsets 280 and 3088 on an inner object.

Phase 137 -- LateExpansionUnsupportedOpsMid

Factory index:  93
Vtable:         off_22BE450
execute():      sub_C607E0  (thunk -> context+0x630 vtable+0x180)
isNoOp():       sub_C5EA00  (returns 0 -- always runs)
Default check:  compares vtable+0x180 against sub_7D6D50 -- if default, entire pass is no-op
Pipeline:       After LateMergeEquivalentConditionalFlow (136), before OriSplitHighPressureLiveRanges (138)

The final legalization catch, positioned between the two conditional flow merge passes (133, 136) and the last-resort live range splitter (138). The merge passes can combine basic blocks in ways that create new instruction sequences containing unsupported operations.

Conditional execution. Unlike the other five passes, this one has a soft no-op mechanism: the execute function reads vtable slot +0x180 (384) and compares the function pointer against the default implementation (sub_7D6D50). If the backend has not overridden this slot, the pass returns immediately without doing any work. This means the pass is truly active only on SM targets that define a LateExpansionUnsupportedOpsMid handler -- typically newer architectures (Hopper/Blackwell) that have more complex merge and expansion interactions.

Supporting Passes

Phase 95 -- SetAfterLegalization

Factory index:  111
Vtable:         off_22BE720
execute():      sub_C5F8A0
isNoOp():       sub_C5E9C0  (returns 0 -- always runs)
Pipeline:       After FinalInspectionPass (94), before ReportBeforeScheduling (96)

Not a legalization pass per se. It marks the compilation context as post-legalization by calling the SM backend's vtable at offset +0x108 (264). This sets the legalization_complete flag that downstream passes (scheduling, register allocation, encoding) check to assert that no unsupported operations remain. The pass is gated by optimization level: sub_7DDB50 returns the current optimization level, and the dispatch only fires at -O2 and above.

Phase 132 -- UpdateAfterConvertUnsupportedOps

Factory index:  8
Vtable:         off_22BD708
execute():      sub_C5F570  (rep ret -- NOP)
isNoOp():       sub_C5F590  (returns 1 -- skipped by default)
Pipeline:       First pass in Stage 10

A placeholder update pass that rebuilds IR metadata after late unsupported-op conversion. Its execute() is a NOP (rep ret) and isNoOp() returns 1 (true), so it is skipped by default. Architecture backends can override the vtable to activate it when late expansion produces structural changes requiring metadata rebuild.

Libdevice Function Library

The legalization passes replace unsupported operations with calls to a library of 608 predefined helper functions. These are not external libraries -- they are PTX function bodies embedded in the ptxas binary itself, compiled and linked into the output at need.

The function table is initialized by sub_5D1660, which copies a 9,728-byte pre-built table from unk_1D4D940 and registers 608 function names in a hash map for lookup.

Library Function Categories

SM PrefixCountOperations
__cuda_sm20_70Division (f32/f64, all rounding modes), reciprocal (f32/f64, all rounding modes), square root (f32/f64), double-precision reciprocal sqrt, bit-field extract/insert 64-bit, integer division/remainder (s16/s64/u16/u64)
__cuda_sm3x_4FP32 division with FTZ variants (Kepler-specific paths)
__cuda_sm62_2DP2A, DP4A dot-product accumulate (pre-Volta emulation)
__cuda_sm70_397Barrier operations (arrive/red/wait with 0-15 barrier IDs and count variants), WMMA matrix operations (204 variants for different shapes/types), warp shuffle sync, warp vote sync, match sync
__cuda_sm80_3Cache policy creation (fractional, range encode)
__cuda_sm1xx_18Bulk copy (unicast/multicast), async bulk tensor copy (1D-5D tile/im2col, unicast/multicast)
__cuda_sm10x_16TCGen05 guardrail traps (bounds check, alignment, allocation), tcgen05 MMA operations, mask creation
__cuda_scalar_video_emulation_7Video instruction emulation (operand extract, sign extend, saturate, merge)
__cuda_reduxsync_18Redux-sync reductions (and/or/xor for b32, add/max/min for s32/u32/f32 with NaN/abs variants)
__cuda_sanitizer_6Memory sanitizer checks (malloc/free/generic/global/local/shared/metadata)
Other~67Miscellaneous: dummy entries, user-function stubs, device synchronize

SM-Dependent Legalization Examples

The core design principle: what is "unsupported" depends entirely on the target SM. An operation legal on one architecture may require library expansion on another.

Integer division/remainder. PTX div.s64 and rem.u64 have no single SASS instruction on any SM. They are always expanded to multi-instruction sequences via __cuda_sm20_div_s64, __cuda_sm20_rem_u64, etc. These are "sm20" functions because the expansion has been the same since Fermi.

FP32 division with rounding. div.rn.f32 on Turing (sm_75) uses a hardware-assisted Newton-Raphson (MUFU.RCP + refinement). On Kepler (sm_3x, no longer shipped but the code path remains), different refinement sequences are needed, using __cuda_sm3x_div_rn_ftz_f32 and its slowpath variant.

Barrier operations. On Volta+ (sm_70), barrier.arrive with a specific barrier ID and thread count is a single SASS instruction (BAR.ARV). On pre-Volta targets, these must be emulated with the 397 __cuda_sm70_barrier_* library functions that implement the semantic equivalent using older synchronization primitives.

WMMA/Tensor Core. Warp-level matrix multiply-accumulate (wmma.*) on sm_70 has dedicated hardware instructions (HMMA). The 204 __cuda_sm70_wmma_* variants cover the combinatorial explosion of shapes (m16n16k16, m8n32k16, m32n8k16), types (f16, bf16, tf32, s8, u8, s4, u4, b1), layouts (row/col), and accumulator types.

DP2A/DP4A. The integer dot-product-accumulate instructions have native hardware support starting at sm_61. On sm_62 (Xavier), they use __cuda_sm62_dp2a and __cuda_sm62_dp4a emulation routines.

Bulk tensor copy (Blackwell). The cp.async.bulk.tensor family on sm_100+ (Blackwell) supports 1D through 5D tile and im2col access patterns, with unicast and multicast variants. These 18 __cuda_sm1xx_cp_async_bulk_tensor_* functions provide the expansion for targets where hardware support is partial or absent.

TCGen05 guardrails (Blackwell). The 5th-generation tensor core operations (sm_100+) include runtime guardrail traps -- bounds checking, alignment validation, allocation granularity checks -- implemented as __cuda_sm10x_tcgen05_guardrail_trap_* functions inserted during legalization.

Context Fields

The legalization passes interact with several fields on the compilation context:

OffsetTypeDescription
+0x630void*SM backend object (main legalization dispatch target)
+0x640void*Outer backend object (wraps SM backend, used by ConvertUnsupportedOps and LateExpansion)
+1378byteBit 0: ConvertUnsupportedOps has run
+1382byteBit 2: prerequisite flag for LateExpansionUnsupportedOps
+1414byteBit 2: enable flag for LateExpansionUnsupportedOps
+1552int32Pipeline progress counter -- written by multiple passes across legalization, optimization, and post-RA stages (see value table below)
+1664void*Capability dispatch object (knob/option queries)

The pipeline progress counter at context+1552 provides a monotonically increasing value that downstream passes can check to determine which pipeline stages have completed. Despite being documented previously as a "legalization stage counter," it is written by passes outside the legalization family (rematerialization, backward copy propagation, architecture-specific peephole, post-RA finalization):

ValueWriterPhaseFunction
0Context constructor--sub_7F7DC0
3MidExpansion45sub_C5EF80
4OriDoRematEarly54sub_C5EF30
7LateExpansion55sub_6612E0
8Peephole/ISel refinement (arch-specific)variessub_849C60
9OriBackCopyPropagate83sub_C5EB80
10PostRAFinalizer (arch-specific)variessub_88E9D0
12SetAfterLegalization95sub_C5E980

Downstream passes compare against these thresholds: sub_A11060 checks > 4 to enable cross-block rematerialization; sub_752CF0 checks <= 3; sub_766520 checks <= 11; sub_781F80 checks <= 12; sub_78B8D0 checks > 18.

Pipeline Position Summary

Phase 0-4:   Initial setup, FP16 promotion, CFG analysis
Phase 5:     ConvertUnsupportedOps          <-- LEGALIZATION #1
Phase 6-44:  Optimization passes (branch, loop, strength reduction, GVN, barrier expansion)
Phase 45:    MidExpansion                    <-- LEGALIZATION #2
Phase 46-54: Mid/late optimization (GVN-CSE, reassociation, predication setup, remat)
Phase 55:    LateExpansion                   <-- LEGALIZATION #3
Phase 56-77: Late optimization (predication, commoning, LICM, remat, sync, phi destruction, uniform)
Phase 78:    LateExpansionUnsupportedOps     <-- LEGALIZATION #4
Phase 79-92: Post-opt (LICM, arch opt, back copy prop, GMMA, reg attrs)
Phase 93:    LateExpansionUnsupportedOps2    <-- LEGALIZATION #5
Phase 94:    FinalInspectionPass
Phase 95:    SetAfterLegalization (marks legalization complete)
Phase 96-136: Scheduling, RA, Mercury, post-RA, late merge
Phase 137:   LateExpansionUnsupportedOpsMid  <-- LEGALIZATION #6
Phase 138:   OriSplitHighPressureLiveRanges

Key Functions

AddressSizeRole
sub_C60A20~40BConvertUnsupportedOps execute dispatcher
sub_C5EFB0~16BMidExpansion execute dispatcher
sub_C60AA0~50BLateExpansion execute dispatcher
sub_C5EA50~16BLateExpansionUnsupportedOps execute dispatcher
sub_C607E0~30BLateExpansionUnsupportedOpsMid execute dispatcher
sub_C5E790~16BLateExpansionUnsupportedOps2 execute dispatcher
sub_C5F8A0~30BSetAfterLegalization execute
sub_7DDB50232BOptimization level gate (knob 499 check)
sub_7917F0~400BLateExpansionUnsupportedOps core implementation
sub_9059B0~500BLateExpansion core implementation (with expansion loop)
sub_5D1660~8KBLibdevice function table initializer (608 entries)
sub_785E20--Expansion setup (function table initialization)
sub_781F80--Expansion setup (mode configuration)
sub_7E6090--Instruction expansion driver
sub_7E6AD0--Instruction expansion driver (secondary)
sub_753600--Per-instruction legalization check
sub_753B50--Retry/convergence loop for iterative expansion
sub_13AF3D026,795BOperand legalization dispatcher -- 164-case switch on opcode, called from sub_A29220
sub_13A62801,289BGeneral operand materializer -- ensures operand is in legal register (called 83x)
sub_13A6AE0~250BSpecial-class operand materializer -- handles condition code and predicate classes
sub_13A7410~50BTry-inline-then-materialize wrapper -- checks sub_822750 before falling back
sub_13A6F90~40BArch-immediate materializer -- like sub_13A7410 without pre-check
sub_13A45E0--Predicate operand materializer
sub_13A75D0--Uniform register conversion (class 6 to class 3)
sub_A29220--Pass driver that calls sub_13AF3D0 per instruction
sub_13ADB903,353BExtended operand legalization variant (arch-specific override, vtable-dispatched)

Operand Legalization Dispatcher

The SASS encoding backend cannot encode arbitrary operand forms. Before an instruction reaches the per-instruction encoder, every operand must be in a form the hardware encoding supports: a register in the correct class, an immediate that fits the bit-field width, or an absent-operand sentinel. The operand legalization dispatcher (sub_13AF3D0, 26,795 bytes) enforces these constraints. It is called once per instruction from the pass driver sub_A29220 and runs after ISel but before the SASS encoders.

Dispatcher Structure

The function reads the instruction opcode from field +72, masks off the predication flags (bits 12-13, mask & 0xCFFF), and enters a switch with 164 case labels covering Ori IR opcodes 0 through 352. Each case implements the legalization recipe for one opcode or a group of opcodes with identical operand layouts.

Before the switch, a pre-pass handles predicated instructions. If bit 12 of the opcode is set (indicating a predicate guard is present), the function first checks backend vtable slot +3232 for a custom handler. If none exists or it declines, sub_13A6AE0 is called on the predicate guard operand (at position operand_count - 2) to ensure it is in a legal register.

The switch routes to five categories of legalization logic:

Direct operand materialization. The majority of cases call sub_13A6280 on each operand that might need conversion. Example for a 3-source FMA (case 6):

sub_13A6280(context, instruction, 3, insert_point, ...)  // src0
sub_13A7410(backend, instruction, 4, 1, insert_point, ...) // src1 (try inline first)
sub_13A6280(context, instruction, 5, insert_point, ...)  // src2
// then check optional predicate operands 6,7 via sentinel test

Variable-length operand scanning. Case 16 (store) scans up to 15 operand slots, testing each against the 0x70000000 sentinel to find where active operands end before legalizing each one.

Architecture-specific delegation. Cases 70, 243, 245-247, 254-255, 257-259, 262 delegate entirely to vtable+2816. Cases 280-281 delegate to vtable+2328 with adjusted operand counts. These are SM-specific instructions (tensor core, WGMMA, bulk copy) where operand constraints vary by architecture.

Opcode rewriting. Case 137 (SM73_FIRST boundary marker, ROT13: FZ73_SVEFG; note: MOV = opcode 19) rewrites the opcode field itself: to 0x82 (130) for conditional MOV, or to 0x109 (265) for MOV-from-special-register when the source is in register class 4.

Passthrough. Cases 22, 24, 34, 38, 44, 45, 59, 73, 74, 77, 83, 106, 135, 161, 180, 182, 192, 194, 198, 209, 213-215, 221, 297, 352 and the default case require no operand legalization and exit immediately.

The 0x70000000 Null-Operand Sentinel

Each operand occupies an 8-byte slot in the instruction. The lower 4 bytes encode the operand value and type:

BitsFieldValues
[30:28]Type1=register, 2=signed immediate, 3=unsigned immediate, 5=predicate, 7=null
[23:0]PayloadRegister index or immediate value
[31]Negate1=operand is negated
+7 (byte)FlagsBit 0: uniform/constant bank reference

The sentinel value 0x70000000 encodes type 7 ("null") with zero payload and no negation. It marks operand slots that are architecturally absent -- optional predicate guards not specified, trailing source operands of variable-width instructions, or unused operand positions in instructions with fewer sources than the maximum slot count.

The dispatcher tests for the sentinel with:

if ( ((*((_DWORD *)instr + offset) ^ 0x70000000) & 0x70000000) != 0 )
    // operand is PRESENT -- legalize it

The XOR produces zero in bits [30:28] only when they are exactly 0b111 (type 7). The AND isolates those bits. If the result is zero, the operand is null and legalization is skipped. If non-zero, the operand is present and must be processed.

The function contains 59 references to 0x70000000. The heaviest user is case 16 (store), which chains 14 successive sentinel tests (at instruction offsets +84 through +196) to determine the store's vector width -- effectively implementing for each slot: if sentinel, stop; else legalize.

Operand Materialization Helpers

The dispatcher calls six helper functions depending on the operand class:

FunctionCallsRole
sub_13A628083General materializer. The core function. Checks if the operand can remain as-is (register in a legal class, or inline immediate that fits). If not, creates a MOV instruction via sub_92E800 to load the value into a fresh register, inserts it before the current instruction, and replaces the operand slot with a register reference (0x10000000 | reg_index). Short-circuits immediately for uniform registers (class 6). Uses sub_7DBC80 to test inline-immediate feasibility and sub_91D150/sub_91D160 for constant pool operations.
sub_13A741015Try-inline-then-materialize. Checks sub_822750 first ("can this immediate be encoded inline for this arch?"). If yes, keeps the immediate. If no, tries sub_822990/sub_8229D0 for extended encoding paths. Falls back to sub_13A6280 only if all inline attempts fail.
sub_13A6AE015Special-class materializer. Handles operands in non-standard register classes. For class 5 (predicate): returns immediately. For class 2 (condition code): creates a MOV with opcode 0x108. For immediates: calls sub_91D150 for constant pool lookup and replaces the operand. Used on predicate guard operands and instructions with condition-code sources.
sub_13A6F907Arch-immediate materializer. Like sub_13A7410 but skips the sub_822750 pre-check. Used for operands where inline encoding is known to be architecture-dependent (texture coordinates, barrier IDs).
sub_13A45E05Predicate materializer. Handles materialization of optional predicate operand slots, called exclusively after a sentinel test confirms the operand is present.
sub_13A75D01Uniform register conversion. Called once (case 6, FMA) to handle uniform register class 6 operands that need conversion to general-purpose class 3.

Materialization Flow (sub_13A6280 Detail)

The general materializer at sub_13A6280 (1,289 bytes) implements this decision tree for a single operand:

  1. Uniform register early exit. If the operand is a register (type 1) in class 6 (uniform), return immediately -- uniform registers are always legal in the encoding.

  2. Inline immediate check. If the operand is an immediate (type 2/3), call sub_7DBC80 to test whether the value fits in the instruction's immediate field. If it fits and passes the floating-point validity check (vtable+1504) and architecture encoding check (vtable+3248), keep the immediate as-is.

  3. Register reclassification. If the operand is a register in class 3 (general-purpose), query the architecture via vtable+1240 and vtable+904 to determine if the register should be reclassified to uniform class 6 (for data types with width <= 3 register slots).

  4. Data-type conversion. For boolean (sub_7D66E0) or floating-point (sub_7D6780) operand types, call vtable+904 to map the data type to the appropriate register class.

  5. Materialization. Call sub_92E800 to create a MOV instruction (opcode 0x82 = 130) that loads the constant/immediate into a new register. Insert it at the insertion point. Replace the operand slot: lower word becomes 0x10000000 | new_reg_index (type 1 = register), upper word is cleared to & 0xFEC00000.

  6. Insertion point update. If the insertion point a4 currently points to the instruction being legalized, advance it to the newly inserted MOV so subsequent materializations are ordered correctly.

Opcode Groups and Legalization Recipes

OpcodesInstruction ClassOperands LegalizedNotes
2-7Arithmetic (ADD/MUL/FMA)dst, src0, src1 [, src2]FMA (6) has optional predicate slots checked via sentinel
8LD (load)Variable based on addressing modeOperand count read from +80
10-11, 151-152, 290-291Compare/selectsrc0, src1Standard 2-source legalization
16ST (store)1-15 data operandsSentinel-scanned variable width
32ATOM (atomic)dst, addr, dataSpecialized register conversion
36TEX (texture)coords + handleTexture handle materialization
42, 53, 55Shift/logicsrc0 + try-inline src1sub_13A6280 + sub_13A7410
51PRMT (permute)src0, control, src1sub_13A6F90 for arch-dependent control operand
61Branch-conditionalNested switch on modifier bits6 sub-cases for different branch forms
70, 243-262Tensor/WGMMA/bulkDelegated to vtable+2816Architecture-specific
82, 166, 196FP convertsrc + try-inlinesub_13A6280 + sub_13A7410 + optional sub_13A6F90
88-89ATOMS/ATOMGLoop over sourcesPer-source legalization with count
110-121Wide arithmeticsrc0, src1, src23 consecutive sub_13A6280 calls
137MOVOpcode rewriteRewrites to 0x82 or 0x109 based on register class
230-232LD/ST extendedsrc + inline + archsub_13A6280 + sub_13A7410 + sub_13A6F90
270-289Control flow / miscVariableSeveral sub-groups with different patterns
280-281Multi-sourceDelegated to vtable+2328Operand count adjusted by -4

Architecture Override Points

The dispatcher provides three escape hatches for architecture-specific behavior:

Vtable OffsetDecimalOpcodesPurpose
+28160xB0070, 243, 245-247, 254-255, 257-259, 262Full delegation for SM-specific instructions
+23280x918280-281 (+ other cases)Multi-source instructions with adjusted operand counts
+32320xCA0Pre-switch (predicated instructions)Custom predicate guard handling

The vtable+2816 handler receives (backend, instruction, insert_point, pass_context, mode_flag) and is expected to perform complete operand legalization for the instruction. The vtable+2328 handler receives an adjusted operand count (total - 4), suggesting these instructions have 4 fixed operands plus a variable source list.

Relationship to Legalization Passes

The operand legalization dispatcher operates at a different abstraction level than the six legalization passes described above. The legalization passes (phases 5-137) operate on the Ori IR, replacing unsupported operations with sequences of supported ones. The operand legalization dispatcher operates on individual operands within already-legal instructions, ensuring each operand is in a form the SASS encoder can bit-pack into machine code.

The dispatcher runs as part of the SASS encoding pipeline (called from sub_A29220), well after all six Ori-level legalization passes have completed. It is invoked per-instruction during the encoding walk, not as a standalone pass.

Ori legalization passes (phases 5-137)
  Replace unsupported OPERATIONS with legal sequences
         |
         v
SASS operand legalization (sub_13AF3D0, during encoding)
  Ensure each OPERAND of a legal instruction is encodable
         |
         v
SASS per-instruction encoders (522 functions)
  Pack operands into binary instruction word

Cross-References

Allocator Architecture

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The ptxas register allocator is a fat-point greedy allocator, not a graph-coloring allocator. There is no interference graph, no Chaitin-Briggs simplify-select-spill loop, and no graph coloring in the main allocation path. Instead, the allocator maintains per-physical-register pressure histograms (512-DWORD arrays) and greedily assigns each virtual register to the physical slot with the lowest interference count. This design trades theoretical optimality for speed on the very large register files of NVIDIA GPUs (up to 255 GPRs per thread).

A secondary live-range-based infrastructure (~80 functions at 0x994000--0x9A1000) supports coalescing, splitting, and pre-coloring but feeds results into the fat-point allocator rather than replacing it.

Entry pointsub_9721C0 (1086 lines)
Per-class driversub_971A90 (355 lines) -- NOSPILL then SPILL retry
Core allocatorsub_957160 (1658 lines) -- fat-point coloring engine
Assignmentsub_94FDD0 (155 lines) -- write physical reg, propagate aliases
Spill guidancesub_96D940 (2983 lines) -- per-class priority queues
Spill codegensub_94F150 (561 lines) -- emit spill/reload instructions
Pre-coloringsub_991790 (2677 lines) -- full-function pre-assignment
Address range0x8FE000 -- 0x9D3000 (~860 KB, ~950 functions)
Knobs87 OCG knobs (RegAlloc* / RegTgt* / RegUsageLevel, indices 613--699)

Pipeline Position

The register allocator runs in the late pipeline, after all optimization passes and instruction scheduling preparation, but before final SASS encoding:

... optimization passes ...
  Late Legalization / Expansion
  AdvancedPhaseAllocReg gate         <-- pipeline entry guard
  HoistInvariants                    <-- sub_8FFDE0 (optional)
  ConvertMemoryToRegisterOrUniform   <-- sub_910840
  Pre-coloring                       <-- sub_991790
  Instruction lowering               <-- sub_98F430 / sub_98B160
  Register allocation entry          <-- sub_9721C0
    Per-class allocation x 7         <-- sub_971A90 for classes 1..6
      Core fat-point allocator       <-- sub_957160
  Post-allocation fixup
  Instruction scheduling
  SASS encoding

Register Classes

The allocator processes 7 register classes. Class 0 (unified) is skipped in the normal per-class loop; it is used for cross-class constraint propagation. Classes 1--6 are allocated independently in order:

IDNameWidthHW LimitDescription
0------Unified / cross-class (skipped in main loop)
1R32-bit255General-purpose registers (R0--R254)
2R (alt)32-bit255GPR variant (RZ sentinel, stat collector alternate)
3UR32-bit63Uniform general-purpose registers (UR0--UR62)
4UR (ext)32-bit63Uniform GPR variant (extended uniform)
5P / UP1-bit7Predicate registers (P0--P6, UP0--UP6)
6Tensor/Acc32-bitvariesTensor/accumulator registers (MMA/WGMMA)

Barrier registers (B, UB) have reg_type = 9, which is above the <= 6 allocator cutoff and are handled by a separate mechanism.

Special registers that are always skipped during allocation:

  • Indices 41--44: PT, P0--P3 (architectural predicates)
  • Index 39: special register

The class ID is the reg_type value at vreg+64. The allocator distribution loop in sub_9721C0 reads this field directly and uses it as the bucket index.

Pair modes (vreg+48, bits 20--21): 0 = single, 1 = lo-half of pair, 3 = double-width (consumes two physical slots).

Entry Point: sub_9721C0

The top-level register allocation driver (1086 lines). Called once per function after the AdvancedPhaseAllocReg pipeline gate.

function regalloc_entry(alloc_state, compilation_ctx):
    // 1. Rebuild liveness
    rebuild_basic_blocks(compilation_ctx, 1)          // sub_781F80
    compute_liveness(compilation_ctx, 1)              // sub_A10160

    // 2. Initialize 7 register classes
    for class_id in 1..6:
        vtable[896](alloc_state, class_id)            // init register file state

    // 3. Sort instructions by priority
    sort_instructions_by_priority(alloc_state)        // sub_9375C0

    // 4. Distribute vregs into per-class linked lists
    for each vreg in function:
        class = vreg.register_class
        append(class_lists[class], vreg)

    debug("\nREGALLOC GUIDANCE:\n")

    // 5. Allocate each class independently
    for class_id in 1..6:
        alloc_with_spill_retry(                       // sub_971A90
            alloc_state, compilation_ctx, class_id)

    // 6. Post-allocation fixup
    fix_load_opcode_187(alloc_state)
    fix_call_saved_registers(alloc_state)

    // 7. Handle OptixIR mode (ctx+896 == 4 or 5)
    if is_optix_ir(compilation_ctx):
        record_register_counts(compilation_ctx)

The entry point calls sub_789280 when a pre-allocation fixup bit (flag bit 2) is set, handles live-through-call register counting at lines 343--352, and sets up rematerialization lists at alloc_state[161..175].

Per-Class Driver: sub_971A90

The outer retry loop (355 lines) that wraps the core allocator with a two-phase strategy:

Phase 1 -- NOSPILL: Attempt allocation without allowing spills. Debug string: "-CLASS NOSPILL REGALLOC: attemp " (note the typo -- present in the binary).

Phase 2 -- SPILL: If NOSPILL fails, invoke spill guidance (sub_96D940) and retry with spilling enabled.

function alloc_with_spill_retry(alloc_state, ctx, class_id):
    no_retarget = query_knob(638)                     // RegAllocNoRetargetPrefs (bool)
    num_trials  = query_knob(639)                     // RegAllocNumNonSpillTrials (int)

    // Phase 1: NOSPILL
    pre_allocation_pass(alloc_state)                  // sub_94A020
    secondary_driver(alloc_state, ctx)                // sub_95DC10
    result = fatpoint_allocate(alloc_state, ctx, NOSPILL)  // sub_957160
    record_best_result(alloc_state, result)            // sub_93D070

    if result == SUCCESS:
        return

    // Phase 2: SPILL retry loop
    for attempt in 1..num_trials:
        guidance = compute_spill_guidance(ctx, attempt)    // sub_96D940
        result = fatpoint_allocate(alloc_state, ctx, SPILL)
        record_best_result(alloc_state, result)

        if result == SUCCESS:
            break

    if result == FAILURE:
        final_fallback(alloc_state)                   // sub_936FD0

    post_allocation_finalize(alloc_state)             // sub_9714E0

For SMEM spilling (modes 3/6 when ctx+896 == 5), the driver activates sub_939BD0 (spill setup) followed by sub_94F150 (spill codegen) before entering the retry loop.

Core Fat-Point Allocator: sub_957160

The central allocation function (1658 lines). This is where physical registers are actually chosen.

Data Structures

Two 2056-byte arrays (512 DWORDs + 2-DWORD sentinel each):

ArrayRole
Primary (v12)Per-physical-register interference count
Secondary (v225)Per-physical-register secondary cost (tie-breaking)

Both arrays are zeroed with SSE2 vectorized loops at the start of each allocation round.

Algorithm

function fatpoint_allocate(alloc_state, ctx, mode):
    maxRegs = alloc_state.hw_limit + 7               // from alloc+756
    if mode == CSSA_PAIRED (6):  maxRegs *= 2
    if mode == CSSA (3):         maxRegs *= 4

    primary[512]   = {0}                              // SSE2 memset
    secondary[512] = {0}

    threshold = query_knob(684)                       // RegAllocThresholdForDiscardConflicts, default 50

    for each vreg in alloc_state.register_list:       // linked list at +744
        // Populate interference bitmaps for this vreg
        build_interference_bitmaps(vreg, primary, secondary)   // sub_957020

        // Scan for minimum-pressure physical register
        best_slot = -1
        best_cost = MAX_INT
        for slot in 0..maxRegs:
            if primary[slot] > threshold:
                continue                              // too congested
            cost = primary[slot]
            if cost < best_cost:
                best_cost = cost
                best_slot = slot
            elif cost == best_cost:
                // tie-break on secondary bitmap
                if secondary[slot] < secondary[best_slot]:
                    best_slot = slot

        if best_slot == -1:
            emit_error("Register allocation failed with register count of '%d'")
            return FAILURE

        // Assign physical register
        assign_register(alloc_state, ctx, mode,       // sub_94FDD0
                        vreg, best_slot)

    return alloc_state.register_count + 1

The interference threshold (RegAllocThresholdForDiscardConflicts, knob 684, default 50) is the key heuristic parameter. Slots with interference above this value are discarded (skipped entirely), forcing the allocator toward less-contested register slots even if they are not globally minimal.

Register Assignment: sub_94FDD0

The assignment function (155 lines) writes the physical register and propagates through alias chains:

function assign_register(alloc, ctx, mode, vreg, regclass_info, slot, cost):
    max_regs = regclass_info.max_regs                 // at +16

    if slot >= max_regs and not vreg.is_spilled():    // flag 0x4000
        vreg.set_needs_spill()                        // flag 0x40000
        return

    if vreg.needs_spill():                            // flag 0x40000
        setup_spill_allocator(alloc)                  // sub_939BD0
        generate_spill_code(alloc, vreg)              // sub_94F150
        return

    // Non-spill path: commit assignment
    consumption = compute_consumption(vreg)            // sub_939CE0
    update_peak_usage(alloc, consumption)
    vreg.physical_register = slot

    // Check for pre-allocated candidate
    apply_preallocated_candidate(alloc, vreg)         // sub_950100

    // Propagate through alias chain
    alias = vreg.alias_parent                         // vreg+36
    while alias != NULL:
        alias.physical_register = slot
        alias = alias.alias_parent

Register consumption computation (sub_939CE0, 23 lines) accounts for paired registers: it returns assignment + (1 << (pair_mode == 3)) - 1, effectively consuming two slots for double-width registers.

Constraint System

The fat-point interference builder (sub_926A30, 4005 lines) processes 15+ constraint types extracted from instruction operand descriptors. Each operand encodes: bits 28--30 = operand type, bits 0--23 = register index.

TypeNameDescription
0Point interferenceSingle-instruction conflict at a specific program point
1Register operandStandard read/write interference
2Immediate operandNo register interference generated
3Paired registerDouble-width; bit 23 distinguishes hi/lo half
4Exclude-oneSpecific physical register excluded from assignment
5Exclude-all-butOnly one physical register permitted
6Below-pointInterference active below the current program point
7RangeInterference over an interval of program points
8Phi-relatedCSSA phi instruction (opcode 195) constraint
9BarrierBarrier register class constraint
10--15ExtendedAdditional constraint variants

The builder uses FNV-1a hashing (seed 0x811C9DC5, prime 16777619) for hash-table lookups into the pre-allocation candidate table. It contains SSE2-vectorized inner loops for bulk interference weight accumulation and dispatches through 7+ vtable entries for OCG knob queries.

Spilling Overview

Spilling triggers when the fat-point allocator cannot find a physical register within the budget. The subsystem has three components:

Spill guidance (sub_96D940, 2983 lines): Computes which registers to spill and in what order. Builds a 7-element guidance array (one per register class), each backed by an 11112-byte working structure containing 128-element bitmask arrays. Constructs priority queues of spill candidates using bitvector-based live range analysis. The function contains 7 near-identical code blocks (one per class), likely unrolled from a template.

Spill codegen (sub_94F150, 561 lines): Emits actual spill/reload instructions. Allocates a per-register spill info array (12 bytes per entry, initialized to {0, -1, -1}). Default spill cost is 15.0, reduced to 3.0 for certain architecture modes. Handles loop nesting via block frequency callbacks (vtable offset +8) and provides special handling for uniform registers (bit 0x200 in flags).

Spill memory targets:

TargetDescription
LMEM (local memory)Default spill destination. Per-thread private memory.
SMEM (shared memory)Alternative spill destination. Faster but shared across CTA. Assertion: "Smem spilling should not be enabled when functions use abi."

Spill setup (sub_939BD0, 65 lines) selects configuration based on RegAllocEstimatedLoopIterations (knob 623) and the cost threshold at alloc+776:

ConditionBucket sizeAlignmentMax size
Cost threshold == 0841 MB
Cost threshold != 016161 MB

See Spilling for the full spill subsystem analysis.

Pre-Allocation and Mem-to-Reg

Two important pre-passes run before the main allocator:

ConvertMemoryToRegisterOrUniform

Entry: sub_910840 (327 lines). Promotes stack variables to registers or uniform registers. Gated by sub_8F3EA0 (eligibility check) and NumOptPhasesBudget (knob 487, budget type).

sub_910840 (entry, string: "ConvertMemoryToRegisterOrUniform")
  sub_905B50 (1046 lines)  build promotion candidates
  sub_911030 (2408 lines)  detailed analysis engine (def-use chains, dominance)
  sub_90FBA0 (653 lines)   execute promotion, insert phi nodes
  sub_914B40 (1737 lines)  post-promotion rewrite / phi-resolution

Pre-Allocation Pass

Entry: sub_94A020 (331 lines). Assigns physical registers to high-priority operands before the main allocator runs. Gated by RegAllocMacForce (knob 628, bool), RegAllocMacVregAllocOrder (knob 629, int), and RegAllocCoalescing (knob 618, bool).

For allocation modes 3, 5, or 6: iterates basic blocks calling sub_9499E0 (per-block scanner) and sub_93ECB0 (per-operand pre-assigner). Priority levels from RegAllocPrefMacOperands (knob 646): 1 = read operands, 2 = write operands, 3 = both.

Uses an opcode eligibility bitmask table (shift-based membership test on opcode - 22) to filter which instructions are candidates for pre-assignment.

Live Range Infrastructure

An interval-based live range system at 0x994000--0x9A1000 (~80 functions) supports auxiliary operations. This is not the main allocator but feeds results into it:

SubsystemRangeCountKey Functions
Live range primitives0x994000--0x996000~25Constructor, interval queries, weight, color get/set
Interference graph0x996000--0x99A000~18Node/edge construction, adjacency, degree, coloring
Range operations0x99C000--0x9A1000~35Merge, split, interference add/remove, copy detection
Register coalescingsub_9B12001Copy elimination pass (800 lines)
Live range splittingsub_9AEF601Interference graph update (900 lines, self-recursive)
Range merge enginesub_9AD2201Coalescing with cost heuristics (700 lines)
Range constructionsub_9A51701Build ranges from def-use chains (750 lines)

Allocator State Object Layout

Full reconstruction from the constructor sub_947150 (1088 lines), cross-referenced with the core allocator, per-class driver, entry point, and spill subsystem. The object is at least 1748 bytes (last initialized field at +1744). The constructor is called once per function before the allocation pipeline runs.

Header and Compilation Context (+0 -- +24)

OffsetSizeTypeInitField
+08ptr&off_21E1648Vtable pointer (strategy dispatch, 40+ virtual methods)
+88ptrargCompilation context (parent object)
+168ptroff_21DBEF8Secondary vtable (allocation sub-strategy)
+248ptrctx->funcFunction object pointer (from ctx+16)

Pre-Allocation Candidate Tables (+32 -- +443)

Arena-allocated hash tables for pre-assigned registers. Each table is a 3-QWORD header {base, size, capacity} plus an arena node (24 bytes, allocated from the function memory pool with an incrementing class tag).

OffsetSizeTypeInitField
+328ptr0Pre-alloc candidate list A head
+408ptr0Pre-alloc candidate list B head
+484DWORD0Pre-alloc candidate count A
+56 -- +208160--0Per-class registration slots (6 x {ptr, ptr, DWORD} = 24B each)
+2168ptr0Registration slots tail
+2248ptralloc(24)Exclusion set arena node (class tag = 1)
+2328ptralloc(24)Pre-alloc hash table A arena node (class tag = 2)
+2408ptr0Pre-alloc hash table A: base pointer
+2488ptr0Pre-alloc hash table A: count
+2568ptr0Pre-alloc hash table A: capacity
+2728ptralloc(24)Pre-alloc hash table B arena node
+28024--0Pre-alloc hash table B: {base, count, capacity}
+3128ptralloc(24)Pre-alloc hash table C arena node
+32024--0Pre-alloc hash table C: {base, count, capacity}
+3528ptralloc(24)Exclusion set hash table arena node (class tag = 3)
+3608ptr0Exclusion set: base pointer
+3688ptr0Exclusion set: count
+3768ptr0Exclusion set: capacity
+3844DWORD0Exclusion set: element count
+3928ptr=+352Exclusion alias A (points to same node)
+40024--0Exclusion secondary: {base, count, capacity}
+4244DWORD0Exclusion secondary: element count
+4328ptr=+352Exclusion alias B
+4401BYTE0MAC force pre-alloc flag (RegAllocMacForce, knob 628)
+4411BYTE0Coalescing enable flag (RegAllocCoalescing, knob 618)
+4421BYTE0MAC vreg alloc order (RegAllocMacVregAllocOrder, knob 629)
+4431BYTE0Per-class mode flag (set by vtable+296 callback)

Per-Class Bitvector Sets (+448 -- +695)

An array of 6 bitvector set entries (one per allocatable register class, classes 1--6). Each entry is 40 bytes: a linked-list header {head, data, tail, count} (32 bytes) plus an arena node pointer (8 bytes). The arena nodes carry incrementing class tags (4, 6, 8, 10, 12, 14). The constructor loop starts at +456 and increments by 40 until +656.

OffsetSizeTypeInitField
+4488QWORD0 -> 6Bitvector set count (incremented in init loop)
+456240array--6 x BitvectorSet (40B each): classes 1--6
+69624--0Remat candidate list: {base, data, tail}
+7204DWORD0Remat candidate list: count
+7288ptralloc(24)Remat candidate arena node (class tag = 2)

Core Allocation State (+736 -- +872)

OffsetSizeTypeInitField
+7368ptr0Register linked list: secondary head
+7448ptr0Register linked list head (main walk list for sub_957160)
+7521BYTE0Register list initialized flag
+7564DWORD-1Hardware register limit (max physical regs, per-class)
+7604DWORD-1Secondary HW limit
+7644DWORD-1Pre-alloc constraint count
+7768double-1.0Spill cost threshold
+7884DWORD-1Best allocation result (reset to 0 per allocation round)
+7921BYTE0Allocation-in-progress flag
+8001BYTE0Retry-active flag
+8084DWORD(dynamic)Live range interference state
+8168ptr(dynamic)Live range secondary structure (4-byte DWORD array at +816)
+8241BYTE0Pre-coloring done flag
+8328ptr0 -> dynPer-function spill info array pointer
+8408ptr0 -> dynPer-function spill info arena node
+8488ptr0Spill info secondary
+8568ptr0Spill info tertiary
+8641BYTE0Bank conflict awareness flag
+8651BYTE0Spill-already-triggered flag
+8728ptr0Debug / trace output state

Per-Class Register File Descriptors (+880 -- +1103)

An array of 7 register class descriptors (one per class 0--6), each 32 bytes. Indexed as alloc + 880 + 32 * class_id. The per-class driver (sub_971A90) accesses max_regs as a1[32 * class_id + 884] and base_offset as a1[32 * class_id + 880].

RegClassDesc (32 bytes):

OffsetSizeTypeInitField
+04DWORD0Base register offset (first physical reg in class)
+44DWORD-1Max regs / HW limit (set by vtable[896] init callback)
+84DWORD0Current allocation count
+121BYTE0Class active flag
+131BYTE0Class overflow flag
+141BYTE0Class spill flag
+151----Padding
+164DWORD148Phase ID begin (148 = unset sentinel)
+204DWORD148Phase ID end (148 = unset sentinel)
+248QWORD-1Class auxiliary link

Concrete addresses:

ClassOffset RangeDescription
0 (unified)+880 -- +911Cross-class (skipped in main loop)
1 (R)+912 -- +943GPR 32-bit
2 (R alt)+944 -- +975GPR variant
3 (UR)+976 -- +1007Uniform GPR
4 (UR ext)+1008 -- +1039Uniform GPR variant
5 (P/UP)+1040 -- +1071Predicate registers
6 (Tensor)+1072 -- +1103Tensor / accumulator

Extended Class Metadata (+1096 -- +1127)

OffsetSizeTypeInitField
+10968QWORD-1Class 6 extended auxiliary link
+11048ptr0Extended class info: pointer A
+11128ptr0Extended class info: pointer B
+11204DWORD0Extended class info: count

Per-Class Rematerialization Lists (+1128 -- +1271)

Six rematerialization candidate lists (one per allocatable class), each 24 bytes {ptr base, ptr data, DWORD count}. Initialized to zero. Populated before the allocation loop in sub_9721C0 for classes that support rematerialization.

ClassOffset Range
1+1128 -- +1147
2+1152 -- +1175
3+1176 -- +1199
4+1200 -- +1219
5+1224 -- +1243
6+1248 -- +1267

Coalescing / Live Range Lists (+1272 -- +1432)

Self-referential circular linked lists used for register coalescing and live range splitting. Each list has a sentinel structure where prev and next point into the list body.

OffsetSizeTypeInitField
+12728ptrarg2Back-pointer to compilation context
+12808ptr0Coalesce list A: sentinel head
+12888ptrself+1296Coalesce list A: prev (self-referential)
+12968ptrself+1280Coalesce list A: next (circular)
+13048ptr0Coalesce list A: data
+13124DWORD(checked)Coalesce list A: count (bit 0 = non-empty flag)
+13208ptrself+1296Coalesce list A: end marker
+13284DWORD2Coalesce list A: type tag
+13368ptralloc(24)Coalesce list A: arena node
+13448ptr0Coalesce list B: sentinel head
+13528ptrself+1360Coalesce list B: prev
+13608ptrself+1344Coalesce list B: next
+13688ptr0Coalesce list B: data (bit 2 checked as ABI flag)
+13768ptrself+1344Coalesce list B: tail
+13848ptrself+1360Coalesce list B: end marker
+13924DWORD2Coalesce list B: type tag
+14008ptralloc(24)Coalesce list B: arena node
+14088ptralloc(24)Interference graph arena node (bit 1 = call-saved mode)
+14168ptr0Interference graph: base
+14248ptr0Interference graph: data (bit 7 checked in sub_97EC60)
+14328ptr0Interference graph: capacity

Debug / Rematerialization Infrastructure (+1440 -- +1496)

OffsetSizeTypeInitField
+14408--(tree)Remat exclusion set (tree root, queried via sub_99C5B0)
+14481BYTE0Remat exclusion: active flag (checked in sub_962840, sub_94E620)
+14524DWORD0Remat exclusion: instruction threshold
+146416OWORD0Remat exclusion: data block B
+14728ptr0Remat candidate: linked list (freed in sub_99D190)
+148016--0Remat candidate list (iterated by sub_94BDF0)
+14884DWORD0Remat candidate: count (checked in sub_99C690)
+14968ptr0Remat candidate: root pointer

Spill / Retry Control Block (+1504 -- +1594)

The core state for the NOSPILL / SPILL retry loop. Zeroed at allocation start, populated by the per-class driver (sub_971A90), read/written by the fat-point allocator (sub_957160).

OffsetSizeTypeInitField
+15044DWORD0Allocation mode (0=normal, 3=CSSA, 5=SMEM, 6=paired)
+15084DWORD0Spill attempt counter
+15124DWORD0 -> 44Spill instruction count (knob 635, default 44)
+15164DWORD-1Budget lower bound
+15204DWORD-1Budget lower bound secondary (part of 128-bit at +1516)
+15244DWORD-1Register budget (from per-class desc max_regs)
+15284DWORD(dynamic)Peak register usage (copied from +1532 per round)
+153216__m128i(global)Strategy parameters (loaded from xmmword_21E17F0)
+15404DWORD0Secondary budget limit (knob 633)
+15444DWORD0Tertiary budget limit (knob 632)
+15484float4.0Spill cost multiplier (knob 680)
+15524DWORD-1Rollback sentinel
+15564DWORD-1Max regs aligned: (budget + 4) & ~3
+15604DWORD-1Best result sentinel
+15644DWORD0Current max assignment (zeroed per allocation round)
+15688double0.0Total spill cost accumulator (zeroed per round)
+15764DWORD0Spill event counter (zeroed per round)
+15804DWORD(dynamic)Effective budget: max(budget, SMEM_min)
+15844DWORD(dynamic)Adjusted budget (from vtable+256 callback)

Mode Flags (+1588 -- +1594)

Knob-derived boolean flags controlling allocation strategy. When the function has more than one basic block (sub_7DDB50 > 1), flags +1588, +1589, +1590 are all forced to 1.

OffsetSizeTypeInitKnobField
+15881BYTE0682Epoch-aware allocation mode
+15891BYTE0683Paired-register allocation mode
+15901BYTE0619SMEM spill enable
+15911BYTE0627Bank-aware allocation
+15921BYTE0--Spill status / has-spilled flag
+15931BYTE1636Precolor reuse (default enabled)
+15941BYTE1649ABI compatibility (default enabled; cleared for small kernels)

Budget Pressure Model (+1600 -- +1744)

Occupancy-aware register budget interpolation. Computes a dynamic register budget based on thread occupancy, using knob-derived coefficients and a linear interpolation model. The slope at +1736 is (coeffB - coeffC) / (maxOccupancy - minOccupancy), enabling the allocator to trade register count for occupancy.

OffsetSizeTypeInitField
+16008ptrctx[2]->+208Function object pair pointer
+16088ptr0Budget model: auxiliary pointer
+16168QWORD0xFFFFFFFFBudget model: occupancy upper bound
+16244DWORD119 / knobMax threads per block (default 119)
+16284DWORD160 / knobPressure threshold (default 160)
+16328double0.2Interpolation coefficient A (knob-overridable)
+16408double1.0Interpolation coefficient B (knob-overridable)
+16488double0.3Interpolation coefficient C (knob-overridable)
+16568double(computed)Total threads as double
+16648double= coeff AInterpolation point [0]
+16728double(computed)Interpolation point [1]: max_threads as double
+16808double= coeff AInterpolation point [2]
+16888double(computed)Interpolation point [3]: threshold as double
+16968double= coeff AInterpolation point [4]
+17048double(computed)Interpolation point [5]: 255 minus vtable result
+17128double= coeff BInterpolation point [6]
+17208double(computed)Linear model: x_min (thread count)
+17288double= coeff CLinear model: y_min
+17368double(computed)Linear model: slope
+17448ptr0Budget model: tail sentinel

Virtual Register Object Layout

OffsetSizeField
+08Next pointer (linked list)
+124Register class index
+201Flags byte (bit 0x20 = live)
+368Alias chain (coalesced parent)
+404Spill cost (float, accumulated)
+488Flags qword (see below)
+644Register type (1=GPR, 3=pred, 9=barrier)
+684Physical assignment (-1 = unassigned)
+721Size byte (0 = scalar)
+764Secondary spill cost (float)
+804Spill flag (0 = not spilled, 1 = spilled)
+1048Use chain head
+1128Def chain
+1288Next in linked-register chain
+1448Constraint list

Flag bits at +48:

BitMaskMeaning
90x200Pre-assigned / fixed register
100x400Coalesced source
110x800Coalesced target
140x4000Spill marker
180x40000Needs-spill flag
20--21--Pair mode (0=single, 1=lo-half, 3=double-width)
220x400000Constrained to architecture limit
230x800000Hi-half of pair
270x8000000Special handling flag

Key Knobs

87 OCG knobs (indices 613--699) control register allocation heuristics. The complete catalog with sub-category grouping is in Knobs System -- Register Allocation Knobs. The most important ones:

KnobNameTypeRole
381(not yet decoded)--HoistInvariants policy: 0=always, 1=inner loops, 3=never
487NumOptPhasesBudgetBDGTBudget counter that gates ConvertMemoryToRegisterOrUniform
618RegAllocCoalescingboolEnables register coalescing in the allocator
623RegAllocEstimatedLoopIterationsSTRLoop iteration estimate hint for spill cost weighting
628RegAllocMacForceboolForces MAC-level pre-allocation path
629RegAllocMacVregAllocOrderINTVreg processing order during MAC allocation
638RegAllocNoRetargetPrefsboolDisables retarget-preference optimization
639RegAllocNumNonSpillTrialsINTNon-spill allocation trials before allowing spills
646RegAllocPrefMacOperandsINTMAC operand preference level (1=read, 2=write, 3=both)
684RegAllocThresholdForDiscardConflictsINTInterference discard threshold. Default 50
934UseNewLoopInvariantRoutineForHoistingboolSelects new LICM routine for HoistInvariants pre-pass

Function Map

AddressLinesRole
sub_8FFDE0119HoistInvariants entry
sub_905B501046Mem-to-reg candidate builder
sub_910840327ConvertMemoryToRegisterOrUniform entry
sub_9110302408Mem-to-reg analysis engine
sub_914B401737Post-promotion rewrite
sub_926A304005Fat-point interference builder
sub_9471501088Allocator state constructor (initializes 1748-byte object)
sub_939BD065Spill allocator setup
sub_939CE023Register consumption counter
sub_93D070155Best result recorder
sub_93ECB0194Pre-assign registers
sub_93FBE0940Spill slot assignment
sub_94A020331Pre-allocation pass
sub_94E620617Spill cost accumulator
sub_94F150561Spill code generation
sub_94FDD0155Register assignment + alias propagation
sub_950100205Pre-allocated candidate applier
sub_9571601658Core fat-point allocator
sub_9539C01873Shared-memory spill allocator
sub_95A3501390Cost / benefit evaluator
sub_95BC901250Allocation retry / refinement
sub_95DC102738Multi-class ABI-aware driver
sub_9680F03722Per-instruction assignment core loop
sub_96D9402983Spill guidance (7-class priority queues)
sub_971A90355NOSPILL / SPILL retry driver
sub_9721C01086Register allocation entry point
sub_9917902677Pre-coloring pass
sub_9A5170750Live range construction
sub_9AD220700Live range merge / coalescing engine
sub_9AEF60900Live range splitting
sub_9B1200800Register coalescing / copy elimination

Fat-Point Allocation Algorithm

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The ptxas register allocator uses a fat-point greedy algorithm. For each virtual register, it scans a per-physical-register pressure array, picks the slot with the lowest interference count, and commits the assignment. There is no graph coloring, no simplify-select-spill loop, and no worklist -- just two 512-DWORD pressure histograms and a linear scan. This page documents the algorithm in full detail: pressure array construction, constraint evaluation, register selection, assignment propagation, the retry loop, and the supporting knobs.

Core allocatorsub_957160 (1658 lines) -- fat-point coloring engine
Occupancy bitvectorsub_957020 -- resizes bitvector; sub_94C9E0 -- marks slot ranges
Interference buildersub_926A30 (4005 lines) -- constraint solver
Assignmentsub_94FDD0 (155 lines) -- write physical reg, propagate aliases
Pre-allocationsub_94A020 (331 lines) -- pre-assign high-priority operands
Retry driversub_971A90 (355 lines) -- NOSPILL then SPILL retry loop
Best result recordersub_93D070 (155 lines) -- compare and keep best attempt
Entry pointsub_9721C0 (1086 lines) -- per-function allocation driver

Pressure Array Construction

The core allocator (sub_957160) allocates two stack-local arrays at the start of each allocation round. Each array is 2056 bytes: 512 DWORDs (2048 bytes) of pressure data plus a 2-DWORD sentinel.

ArrayVariableRole
Primaryv12Per-physical-register interference count. Lower is better.
Secondaryv225Per-physical-register secondary cost. Breaks ties when primary values are equal.

Both arrays are zeroed using SSE2 vectorized _mm_store_si128 loops aligned to 16-byte boundaries. The zeroing loop processes 128 bits per iteration, covering 512 DWORDs in approximately 128 iterations.

For each virtual register in the allocation worklist (linked list at alloc+744), the allocator zeroes the pressure arrays and then walks the VR's constraint list (vreg+144). For each constraint, it increments the appropriate pressure array entries at the physical register slots that conflict with the current virtual register. The result is a histogram: primary[slot] holds the total interference weight for physical register slot, accumulated over all constraints of all previously-assigned virtual registers that conflict with the current one. The full per-VR algorithm is documented in the Pressure Computation Algorithm section below.

The secondary array accumulates a separate cost metric used for tie-breaking. It captures weaker interference signals -- preferences and soft constraints that do not represent hard conflicts but indicate suboptimal placement.

Budget Computation

Before the pressure scan begins, the allocator computes the maximum physical register count for the current class:

v231 = hardware_limit + 7                   // alloc+756, with headroom
if allocation_mode == 6 (CSSA paired):
    v231 *= 4                               // quad range for paired allocation
elif allocation_mode == 3 (CSSA):
    v231 *= 2                               // doubled range
alloc.budget = v231                         // stored at alloc+60

The hardware limit comes from the target descriptor and reflects the physical register file size for the current class (e.g. 255 for GPRs, 7 for predicates). The +7 headroom allows the allocator to explore slightly beyond the architectural limit before triggering a hard failure -- this is clamped during assignment by the register budget check in sub_94FDD0.

The register budget at alloc+1524 interacts with --maxrregcount and --register-usage-level (values 0--10). The CLI-specified maximum register count is stored in the compilation context and propagated to the allocator as the hard ceiling. The register-usage-level option modulates the target: level 0 means no restriction, level 10 means minimize register usage as aggressively as possible. The per-class register budget stored at alloc+32*class+884 reflects this interaction.

Occupancy Bitvector

After computing the budget, the allocator initializes an occupancy bitvector (sub_957020 + sub_94C9E0) that tracks which physical register slots are already assigned. The bitvector is sized to ceil(budget / 64) 64-bit words. For each VR being allocated, sub_94C9E0 sets bits covering the VR's footprint in the bitvector using a word-level OR with computed masks:

function mark_occupancy(bitvec, range, alignment):
    lo_word = alignment >> 6
    hi_word = min(alignment + 64, range) >> 6
    hi_mask = ~(0xFFFFFFFFFFFFFFFF >> (64 - (range & 63)))
    lo_mask = 0xFFFFFFFFFFFFFFFF >> (~alignment & 63)

    for word_idx in lo_word .. lo_word + (lo_word - hi_word):
        mask = 0xFFFFFFFFFFFFFFFF
        if word_idx == hi_word:  mask &= hi_mask
        if word_idx == last:     mask &= lo_mask
        bitvec[word_idx] |= mask

During the fatpoint scan, a set bit means "slot occupied -- skip it." This prevents the allocator from considering slots already committed to other VRs in the current round.

Pressure Computation Algorithm

The per-VR pressure computation is the core of the fat-point allocator. For each unassigned virtual register, the allocator builds a fresh pressure histogram, selects the minimum-cost physical register slot, and commits the assignment. The algorithm has seven steps, all executed inside the main loop of sub_957160 (lines 493--1590 of the decompiled output).

Step 1: VR Geometry

For each VR, the allocator computes the physical register footprint via sub_7DAFD0:

function aligned_width(vreg):
    stride = 1 << vreg.alignment                // vreg+72, uint8
    size   = vreg.width                          // vreg+74, uint16
    return (-stride) & (stride + size - 1)       // = ceil(size / stride) * stride
stridesizeresultMeaning
111Single register
122Unaligned pair
222Aligned pair
234Aligned quad (rounded up)

The pair mode is extracted from (vreg+48 >> 20) & 3:

pair_modestep_sizeBehavior
0strideNormal single-width scan
1stridePaired mode -- is_paired = 1, scan ceiling doubled
32 * strideDouble-width -- aligned width doubled, step by 2x stride

Step 2: Scan Range

The scan range defines which physical register slots are candidates:

function compute_scan_range(alloc, vreg):
    max_slot = alloc.budget                          // alloc+1524
    if vreg.flags & 0x400000:                        // per-class ceiling override
        class_limit = alloc.class_limits[alloc.current_class]
        if max_slot > class_limit:
            max_slot = class_limit

    ceiling = ((max_slot + 1) << is_paired) - 1
    start   = vtable[320](alloc, vreg, bitvec)       // class-specific start offset
    alignment = alloc.scan_alignment << is_paired     // alloc+1556
    scan_width = alloc.slot_count + 4                 // alloc+1584 + 4

    return (start, ceiling, scan_width, alignment)

The +4 on scan_width provides padding beyond the register file limit. For pair modes, the ceiling is shifted left: double for is_paired, quad for pair_mode 3 with the 0x40 flag at ctx+1369.

Step 3: Zero Pressure Arrays

Before accumulating interference for this VR, both arrays are zeroed over scan_width DWORDs:

function zero_pressure(primary[], secondary[], scan_width):
    if scan_width > 14 and arrays_dont_overlap:
        // SSE2 vectorized path: zero 4 DWORDs per iteration
        for i in 0 .. scan_width/4:
            _mm_store_si128(&primary[4*i],  zero_128)
            _mm_store_si128(&secondary[4*i], zero_128)
        // scalar cleanup for remainder
    else:
        // scalar path
        for i in 0 .. scan_width:
            primary[i]   = 0
            secondary[i] = 0

The SSE2 path has a non-overlap guard (secondary >= primary + 16 || primary >= secondary + 16) to ensure the vectorized stores do not alias. The scalar path is used for narrow scan ranges (width <= 14).

Step 4: Constraint Walk

The allocator iterates the constraint list at vreg+144. For VRs with alias chains (coalesced registers via vreg+32), the walk processes constraints for the entire chain, accumulating pressure from all aliases into the same arrays. Each constraint node is a 24-byte structure:

OffsetTypeField
+0pointerNext constraint (linked list)
+8int32Constraint type (0--15)
+12int32Target VR index or physical register
+16int32Weight (interference cost)
+20uint8Soft flag (skip in later iterations)

The constraint type dispatches to different accumulation patterns:

function accumulate_pressure(primary[], secondary[], constraint_list, scan_width,
                             base_offset, pair_mode, half_width_mode, iteration):
    soft_count = 0
    for node in constraint_list:
        // --- Soft constraint relaxation (iteration > 0) ---
        if node.soft_flag and iteration > 0:
            soft_count++
            skip_threshold = iteration * knob_weight    // OCG knob at +46232
            if soft_count <= skip_threshold:
                continue                                // relax this constraint
            if bank_aware and soft_count > relaxation_ceiling:
                continue

        type   = node.type
        target = node.target
        weight = node.weight

        switch type:
            case 0:  // Point interference
                phys = lookup_vreg(target).physical_reg
                if phys < 0: continue                   // target unassigned
                offset = phys - base_offset
                if offset < 0 or offset >= scan_width: continue
                if half_width_mode:
                    offset = 2 * offset + hi_half_bit(target)
                primary[offset] += weight

            case 1:  // Exclude-one
                if half_width_mode:
                    offset = 2 * offset + hi_half_bit(target)
                for slot in 0 .. scan_width:
                    if slot != offset:
                        primary[slot] += weight

            case 2:  // Exclude-all-but (target is the only allowed slot)
                for slot in 0 .. scan_width:
                    if slot != target:
                        primary[slot] += weight

            case 3:  // Below-point (same as exclude-all-but, downward liveness)
                for slot in 0 .. scan_width:
                    if target > slot:
                        primary[slot] += weight

            case 5:  // Paired-low (even slots only)
                primary[offset] += weight
                // For pair_mode 3: also primary[offset+1] += weight

            case 6:  // Paired-high (odd slots only)
                primary[offset + 1] += weight

            case 7:  // Aligned-pair (both halves)
                primary[offset] += weight
                primary[offset + 1] += weight

            case 8:  // Phi-related (parity-strided accumulation)
                parity = compute_phi_parity(target, vreg)
                for slot in parity .. scan_width step 2:
                    primary[slot] += weight

            case 11: // Paired-even-parity
                for slot in 0 .. scan_width:
                    if slot != offset:
                        primary[slot] += weight

            case 12: // Paired-odd-parity
                inverse = compute_odd_inverse(offset, pair_mode)
                for slot in 0 .. scan_width:
                    if slot != inverse:
                        primary[slot] += weight

            case 13: // Paired-parity-group (even-only exclusion)
                if offset & 1: continue
                for slot in 0 .. scan_width:
                    if slot != offset + 1:
                        primary[slot] += weight

            case 14: // Paired-parity-extended (odd-only exclusion)
                if !(offset & 1): continue
                for slot in 0 .. scan_width:
                    if slot != offset - 1:
                        primary[slot] += weight

            case 15: // Range (SECONDARY array)
                range_end = min(offset, scan_width)
                // SSE2 vectorized: broadcast weight, add to secondary[0..range_end]
                for slot in 0 .. range_end:                 // vectorized
                    secondary[slot] += weight
                // Tail: slots beyond (range_end + pair_width)
                for slot in range_end .. scan_width:
                    if slot >= offset + pair_width:
                        secondary[slot] += weight

            default: // Types 4, 9, 10 and custom extensions
                vtable[240](alloc, primary, 514, node, scan_width, offset, pair_flag)

Type 15 (range) is the only constraint type that writes to the secondary array. All others write to primary. This is the architectural decision that makes secondary a pure tie-breaker: it captures long-range preference signals while primary captures hard interference.

SSE2 Vectorization in Constraint Walk

Three inner loops use SSE2 intrinsics:

  1. Type 0 (point) with large width: _mm_add_epi32 adds the broadcast weight to 4 primary slots per iteration. Alignment pre-loop handles the first 1--3 slots to reach 16-byte alignment.

  2. Type 15 (range) secondary accumulation: _mm_shuffle_epi32(_mm_cvtsi32_si128(weight), 0) broadcasts the weight to all 4 lanes. The vectorized loop processes 4 secondary slots per iteration with _mm_add_epi32(_mm_load_si128(...), broadcast).

  3. Type 8 (phi) stride-2 accumulation: Uses _mm_shuffle_ps with mask 136 (0b10001000) to extract every-other element, then _mm_add_epi32 to accumulate. This implements stride-2 addition across the primary array.

Step 5: Iteration-Dependent Constraint Relaxation

On retry iterations (iteration > 0), the allocator progressively relaxes soft constraints to reduce pressure:

function should_skip_soft_constraint(soft_count, iteration, knob_weight,
                                     knob_ceiling, bank_aware):
    threshold = iteration * knob_weight           // more skipped each retry
    if soft_count <= threshold:
        return true                               // skip (relax)
    if !bank_aware:
        return true
    if soft_count > (total_soft - iteration * knob_ceiling):
        return true                               // beyond ceiling
    return false

The relaxation formula means: on iteration N, the first N * knob_weight soft constraints are ignored. The knob_ceiling parameter (OCG knob at offset +46304) controls how aggressively the tail is also relaxed. This trades bank-conflict quality for register pressure reduction, allowing the allocator to find assignments that fit within the budget on later retries.

Step 6: Fatpoint Selection (Minimum Scan)

After pressure accumulation, the allocator scans for the physical register slot with the lowest cost:

function select_fatpoint(primary[], secondary[], start, ceiling, budget,
                         step_size, shift, occupancy_bv, threshold,
                         first_pass, bank_mode, bank_mask, prev_assignment):
    best_slot     = start
    best_primary  = 0
    best_secondary = 0

    // --- Pre-scan threshold check (first pass only) ---
    if first_pass:
        for slot in start .. ceiling step step_size:
            if occupancy_bv[slot]: continue        // already occupied
            if primary[slot >> shift] > threshold:  // knob 684, default 50
                first_pass = false                  // congestion detected
                break

    // --- Main scan ---
    slot = start
    while slot < budget and slot < alloc.slot_count << is_paired:
        // Occupancy filter
        if slot < bv_size and occupancy_bv[slot]:
            slot += step_size; continue

        p = primary[slot >> shift]
        s = secondary[slot >> shift]

        if prev_assignment >= 0:                    // not first VR
            if first_pass:
                if s >= best_secondary:
                    slot += step_size; continue     // secondary-only comparison
            else:
                if p > best_primary:
                    slot += step_size; continue
                if p == best_primary and s >= best_secondary:
                    slot += step_size; continue

        // Bank conflict filter
        if bank_mode and ((slot ^ prev_assignment) & bank_mask) == 0:
            slot += step_size; continue             // same bank → skip

        // Ceiling check
        if slot > ceiling:
            best_slot = slot; break                 // accept (over ceiling)

        // Accept this slot
        if slot < scan_width_shifted:
            best_secondary = secondary[slot >> shift]
            best_primary   = primary[slot >> shift]
            if best_secondary == 0 and (best_primary == 0 or first_pass):
                best_slot = slot; break             // zero-cost → immediate accept
            best_slot = slot
            slot += step_size; continue

        best_slot = slot
        best_primary = 0
        break

    return best_slot

Key design decisions in the fatpoint scan:

Two-mode comparison. On the first pass (first_pass = true, iteration 0), the scan uses secondary cost as the sole criterion, ignoring primary. This makes the first attempt pure-affinity-driven: it places VRs at their preferred locations based on copy/phi hints in the secondary array. On subsequent passes, primary cost dominates and secondary breaks ties.

Immediate zero-cost accept. When a slot has both primary == 0 and secondary == 0 (or just primary == 0 on first pass), the scan terminates immediately. This means the first zero-interference slot wins -- no further searching. Combined with the priority ordering of VRs, this produces a fast, greedy assignment.

Bank-conflict avoidance. The bank mask (-8 for pair mode 1, -4 otherwise) partitions the register file into banks. The filter ((slot ^ prev_assignment) & mask) == 0 ensures consecutive assignments land in different banks, reducing bank conflicts in the SASS execution units.

Occupancy bitvector filtering. The bitvector provides O(1) per-slot filtering of already-assigned registers. Bits are set by sub_94C9E0 for each committed assignment, preventing the scan from considering occupied slots.

Step 7: Commit and Advance

The selected slot is committed via sub_94FDD0:

alloc.cumulative_pressure += best_primary       // alloc+788
sub_94FDD0(alloc, ctx, iteration, vreg, &local_state, best_slot, best_primary)
vreg = vreg.next                                 // vreg+128, advance worklist

The cumulative pressure counter at alloc+788 tracks the total interference weight across all VR assignments in this attempt. The retry driver uses this to compare attempts.

End-of-Round Result

After all VRs are processed, the allocator computes the result (lines 1594--1641):

peak_usage = alloc+1580                          // max physical register used
class_slot = ctx+1584 + 4 * mode + 384
*class_slot = peak_usage

if peak_usage > 0x989677 + 6:                    // sanity threshold (~10M)
    emit_error("Register allocation failed with register count of '%d'."
               " Compile the program with a higher register target",
               alloc+1524 + 1)

return peak_usage + 1                            // number of registers used

The return value feeds into the retry driver's comparison: target >= result means success (the allocation fits within the register budget).

Constraint Types

The fat-point interference builder (sub_926A30, 4005 lines) processes constraints attached to each virtual register. Constraints are extracted from instruction operand descriptors encoded as 32-bit values: bits 28--30 encode the operand type, bits 0--23 encode the register index, bit 24 is the pair extension bit, and bit 31 is a sign/direction flag.

The builder recognizes 15 constraint types. Each constraint type adds interference weight to specific physical register slots in the pressure arrays:

TypeNamePressure effect
0Point interferenceAdds weight to specific physical register slots that are live at the same program point as this VR. The most common constraint -- represents a simple "these two VRs cannot share a physical register because both are live at instruction I."
1Exclude-oneAdds weight to exactly one physical register slot, excluding it from consideration. Used when a specific physical register is reserved (e.g. for ABI constraints or hardware requirements).
2Exclude-all-butAdds weight to all slots except one. Forces the VR into a single permitted physical register. Used for fixed-register operands (e.g. R0 for return values).
3Below-pointAdds interference weight for registers live below (after) the current program point. Captures downward-exposed liveness -- the VR must avoid physical registers that are used by later instructions.
4(reserved)Not observed in common paths.
5Paired-lowConstrains the VR to an even-numbered physical register. Used for the low half of a 64-bit register pair. The pressure builder increments only even-indexed slots.
6Paired-highConstrains the VR to an odd-numbered physical register (the slot immediately after its paired-low partner). Increments only odd-indexed slots.
7Aligned-pairConstrains a pair of VRs to consecutive even/odd physical registers simultaneously. Combines the effects of types 5 and 6.
8Phi-relatedMarks interference from CSSA phi instructions (opcode 195). Phi constraints are softer -- they add lower weight because the phi can potentially be eliminated by the coalescing pass.
9(reserved)Not observed in common paths.
10(reserved)Not observed in common paths.
11Paired-even-parityConstrains the VR to a physical register whose index has even parity with respect to a bank partition. Used for bank-conflict avoidance on architectures where register bank is determined by reg_index % N.
12Paired-odd-parityConstrains to odd parity within the bank partition.
13Paired-parity-groupConstrains a group of VRs to compatible parity assignments across a bank.
14Paired-parity-extendedExtended variant of parity constraints for wider register groups (quads).
15RangeAdds interference over an interval of program points rather than a single point. Represents a VR whose live range spans multiple instructions and conflicts with another VR whose live range overlaps. The weight is proportional to the overlap length.

The builder uses FNV-1a hashing (seed 0x811C9DC5, prime 16777619) for hash-table lookups into the pre-allocation candidate table. It contains SSE2-vectorized inner loops (_mm_add_epi64) for bulk interference weight accumulation when processing large constraint lists. The builder dispatches through 7+ vtable entries for OCG knob queries that modulate constraint weights.

Constraint List Structure

Each virtual register carries a constraint list at vreg+144. The list is a linked chain of constraint nodes, each containing:

  • Constraint type (one of the 15 types above)
  • Target VR or physical register index
  • Weight (integer, typically 1 for hard constraints, lower for soft)
  • Program point or interval (for types 0, 3, 15)
  • Pair/alignment specification (for types 5--7, 11--14)

The interference builder iterates this list for every VR being assigned, accumulating weights into the pressure arrays. The total cost of assignment to slot S is the sum of all constraint weights that map to S.

Register Selection

After the pressure arrays are populated for a given VR, the allocator scans physical register candidates and selects the one with minimum cost:

function select_register(primary[], secondary[], maxRegs, threshold, pair_mode):
    best_slot = -1
    best_primary_cost = MAX_INT
    best_secondary_cost = MAX_INT

    stride = 1
    if pair_mode != 0:
        stride = 2 << shift              // 2 for pairs, 4 for quads

    for slot in range(0, maxRegs, stride):
        if primary[slot] > threshold:     // knob 684, default 50
            continue                      // skip congested slots

        p = primary[slot]
        s = secondary[slot]

        if p < best_primary_cost:
            best_slot = slot
            best_primary_cost = p
            best_secondary_cost = s
        elif p == best_primary_cost and s < best_secondary_cost:
            best_slot = slot
            best_secondary_cost = s

    return best_slot                      // -1 if nothing found

Key design decisions in the selection loop:

Threshold filtering. The interference threshold (OCG knob 684, default 50) acts as a congestion cutoff. Any physical register slot with total interference weight above this value is immediately skipped. This prevents the allocator from assigning a VR to a slot that would cause excessive register pressure, even if that slot happens to be the global minimum. The threshold trades a small increase in the number of spills for a significant improvement in allocation quality -- high-interference slots tend to require cascading reassignments.

Alignment stride. For paired registers (pair mode 1 or 3 in vreg+48 bits 20--21), the scan steps by 2 instead of 1, ensuring the VR lands on an even-numbered slot. For quad-width registers, the stride is 4. The shift amount comes from the register class descriptor and varies by allocation mode.

Two-level tie-breaking. When two candidates have equal primary cost, the secondary array breaks the tie. This provides a smooth gradient for the allocator to follow when the primary interference picture is flat. The secondary array typically captures weaker signals like register preference hints, pre-allocation suggestions, and copy-related affinities.

No backtracking. The selection is final once made. There is no local search, no Kempe-chain swapping, and no reassignment of previously-colored VRs. If the selection leads to a spill later, the retry loop (see below) handles it by rerunning the entire allocation with updated spill guidance.

Assignment: sub_94FDD0

Once a physical register slot is selected, sub_94FDD0 (155 lines) commits the assignment. This function handles four cases:

Case 1: Normal Assignment

The physical register number is written to vreg+68. The register consumption counter (sub_939CE0, 23 lines) computes how many physical slots this VR occupies:

consumption = slot + (1 << (pair_mode == 3)) - 1

For single registers, this is just slot. For double-width pairs (pair_mode 3), it is slot + 1, consuming two consecutive physical registers. The peak usage trackers at alloc+1528 and alloc+1564 are updated if consumption exceeds the current maximum.

Case 2: Predicate Half-Width

For predicate registers (class 2, type 3), the allocator performs a half-width division. The physical slot is divided by 2, and the odd/even bit is stored at vreg+48 bit 23 (the 0x800000 flag):

physical_reg = slot / 2
if slot is odd:
    vreg.flags |= 0x800000    // hi-half of pair
else:
    vreg.flags &= ~0x800000   // lo-half of pair

This maps two virtual predicate registers to one physical predicate register, since NVIDIA's predicate register file supports sub-register addressing (each physical predicate holds two 1-bit values).

Case 3: Over-Budget / Spill Trigger

If slot >= regclass_info.max_regs and the VR is not already marked as spilled (flag 0x4000 at vreg+48), the allocator sets the needs-spill flag:

vreg.flags |= 0x40000         // needs-spill flag (bit 18)

When the needs-spill flag is later detected, the allocator calls:

  1. sub_939BD0 -- spill allocator setup (selects bucket size, alignment, max based on knob 623 and cost threshold at alloc+776)
  2. sub_94F150 -- spill code generation (561 lines, emits spill/reload instructions)

The spill cost is accumulated:

alloc.total_spill_cost += vreg.spill_cost     // double at alloc+1568
alloc.secondary_cost   += vreg.secondary_cost  // float at alloc+1576

Case 4: Alias Chain Propagation

After writing the physical register, the function follows the alias chain at vreg+36 (coalesced parent pointer). Every VR in the chain receives the same physical assignment:

alias = vreg.alias_parent                    // vreg+36
while alias != NULL:
    alias.physical_register = slot           // alias+68
    alias = alias.alias_parent               // alias+36

This propagation ensures that coalesced registers (merged by the coalescing pass at sub_9B1200) share a single physical register without requiring the allocator to re-derive the relationship.

Pre-Allocated Candidate Check

Before committing a normal assignment, sub_94FDD0 calls sub_950100 (205 lines) to check if the VR has a pre-allocated candidate in the hash table at alloc+248. If a candidate exists (FNV-1a keyed lookup), the pre-assigned physical register is used instead of the one selected by the pressure scan. For paired registers, the pre-assigned slot is doubled (type 1 -> slot * 2) to account for pair stride.

Pre-Allocation Pass: sub_94A020

Before the core allocator runs, the pre-allocation pass (331 lines) optionally assigns physical registers to high-priority operands. This pass is gated by three knobs:

KnobRole
628Enable pre-allocation pass
629Enable coalescing-aware pre-allocation
618Enable uniform register pre-allocation

When enabled and the allocation mode is 3, 5, or 6, the pass:

  1. Clears the pre-allocation candidate hash tables at alloc+240..336 (six tables covering candidates, results, and overflow).
  2. Iterates basic blocks calling sub_9499E0 (per-block scanner, 304 lines) to identify pre-assignment opportunities.
  3. For each eligible instruction, calls sub_93ECB0 (194 lines) to pre-assign operands.

sub_93ECB0 iterates instruction operands in reverse order (last to first). It filters: operands must be type 1 (register), index not 41--44 (architectural predicates) or 39 (special). A switch on the masked opcode determines how many operands qualify: opcode 22 dispatches to sub_7E40E0, opcode 50 uses a lookup table, opcodes 77/83/110--112/279/289/297/352 each have dedicated handlers. The function calls sub_93E9D0 with a priority level determined by OCG knob 646:

PriorityMeaning
1Pre-assign read operands only
2Pre-assign write operands only
3Pre-assign both read and write operands

sub_93E9D0 (125 lines) creates a spill candidate node via sub_93E290 (allocates 192-byte structures from the arena freelist at alloc+232), marks the live range via sub_93DBD0 (356 lines), and recursively processes dependent operands via sub_93EC50.

Retry Loop: sub_971A90

The per-class allocation driver (355 lines) wraps the core allocator in a two-phase retry loop.

Phase 1: NOSPILL

The first attempt runs the core allocator without spill permission. The debug log emits:

"-CLASS NOSPILL REGALLOC: attemp N, used M, target T"

(Note: "attemp" is a typo present in the binary.)

The call sequence for each NOSPILL attempt:

sub_93FBE0(alloc, ctx, iteration)       // reset state for attempt
if iteration == 0:
    sub_956130(alloc, class)            // build interference masks (first attempt only)
result = sub_957160(alloc, ctx, iteration)  // core fat-point allocator
sub_93D070(&best, class, iteration,         // record best result
           result, pressure, alloc, cost)

The NOSPILL loop runs up to v102 attempts. Retry mode selection (from sub_971A90 lines 199--240):

Conditionv102 (max attempts)Behavior
Knob 638 enabled + special mode0No allocation at all
Knob 638 enabled, knob 639 setknob 639 valueCustom iteration count
Knob 638 enabled, knob 639 unset1Single attempt
Knob 638 disabled, pressure low2Standard 2-attempt retry
Knob 638 disabled, pressure high0Skip to spill

Exit conditions within the NOSPILL loop:

  • target >= adjusted_result: allocation fits within budget (success)
  • target >= result: no improvement possible between iterations (give up)
  • The best-result recorder (sub_93D070) compares the current attempt against the best seen so far using a multi-criterion ranking: register count first, then cost (double at best+56), then spill count, then class width. It uses 128 / register_count as an inverse density metric.

Phase 2: SPILL

If all NOSPILL attempts fail, the driver invokes spill guidance:

guidance = sub_96D940(ctx, guidance_array, attempt_no)   // 2983 lines

The spill guidance function builds priority queues of spill candidates for each of the 7 register classes. Each guidance entry is an 11112-byte working structure containing 128-element bitmask arrays. The function contains 7 near-identical code blocks (one per class), likely unrolled from a C++ template.

After spill guidance, a final allocation attempt runs via sub_9714E0 (finalize/spill). If this also fails, sub_936FD0 (fallback allocation) makes a last-ditch effort. If that fails too, register assignments are cleared to -1 and the allocator reports:

"Register allocation failed with register count of '%d'.
 Compile the program with a higher register target"

SMEM Spill Activation

For allocation modes 3 or 6 when the compilation target is device type 5, shared-memory spilling is activated before the retry loop:

if (class == 3 || class == 6) and device_type == 5:
    if num_variables > 0:
        sub_939BD0(alloc)                  // spill allocator setup
        sub_94F150(alloc, ctx, 1)          // spill codegen to SMEM
    alloc.spill_triggered = 1              // flag at alloc+865

This path generates spill/reload instructions targeting shared memory instead of local memory, which is faster but limited in size and shared across the CTA.

Per-Class Iteration

The top-level entry point (sub_9721C0, 1086 lines) drives allocation for all register classes sequentially:

for class_id in 1..6:
    if class_list[class_id] is empty:
        continue
    alloc.current_class = class_id          // alloc+376
    while sub_971A90(alloc, ctx, class_id) != 0:
        sub_8E3A80(alloc+2)                 // arena cleanup between attempts

Classes 1--6 are initialized via the target descriptor vtable at offset +896. The vtable call vtable[896](alloc_state, class_id) populates per-class register file descriptors at alloc[114..156] (four 8-byte entries per class). The class IDs correspond to reg_type values (1 = R, 2 = R alt, 3 = UR, 4 = UR ext, 5 = P/UP, 6 = Tensor/Acc). Barrier registers (reg_type = 9) are above the <= 6 cutoff and handled separately.

Class IDNameWidthHW LimitDescription
1R32-bit255General-purpose registers (R0--R254)
2R (alt)32-bit255GPR variant (RZ sentinel, stat collector alternate)
3UR32-bit63Uniform general-purpose registers (UR0--UR62)
4UR (ext)32-bit63Uniform GPR variant (triggers flag update at +1369 in constructor)
5P / UP1-bit7Predicate registers (P0--P6, UP0--UP6)
6Tensor/Acc32-bitvariesTensor/accumulator registers for MMA/WGMMA operations

Class 0 (unified/cross-class) is skipped in the main loop. It is used for cross-class constraint propagation during the interference building phase. Classes 3 (UR) and 6 (Tensor/Acc) have early-out conditions: if alloc+348 == 2 (class 3) or alloc+332 == 2 (class 6), allocation is skipped because no VRs of that class exist. Barrier registers (B, UB) have reg_type = 9, which is above the <= 6 allocator cutoff and are handled by a separate mechanism outside the 7-class system.

Before the per-class loop, virtual registers are distributed into class-specific linked lists (lines 520--549 of sub_9721C0):

for each vreg in function_vreg_list:       // from ctx+104
    if vreg.id in {41, 42, 43, 44}:        // skip architectural predicates
        continue
    class = vreg.register_class             // vreg+12
    if class >= 1 and class <= 6 and vreg.type != 0:
        insert(class_lists[class], vreg)

The VR list is sorted by priority (sub_9375C0) before distribution. Priority ordering ensures that VRs with more constraints and higher spill costs are allocated first, giving them first pick of the register file.

Fast Register Allocation: Knob 638

Knob 638 (register pressure analysis enable / fast allocation mode) controls a single-pass no-retry allocation path. When enabled with the special mode flag set, the allocator sets v102 = 0, meaning the NOSPILL retry loop body never executes. Allocation proceeds directly to spill handling without iterating.

When knob 638 is enabled without the special mode flag:

  • The iteration count is set to 1 (or the value of knob 639 if set)
  • This creates a limited-retry mode where the allocator makes at most knob_639 attempts
  • Each attempt still uses the full fat-point algorithm but with no fallback to the multi-attempt guidance-driven loop

This mode is intended for fast compilation (--fast-compile) where compilation time matters more than register allocation quality. The allocator accepts the first viable assignment rather than searching for an optimal one.

Interference Builder: sub_926A30

The interference builder (4005 lines) is the largest single function in the allocator. It constructs the constraint lists that feed the pressure arrays. For each basic block and each instruction within it, the builder:

  1. Iterates instruction operands. Each operand is a 32-bit descriptor:
    • Bits 27--25: operand type (1 = register, 6 = special, 7 = immediate)
    • Bits 23--0: register/variable ID
    • Bit 31: sign/direction flag
    • Bit 24: pair extension bit
  2. For register operands (type 1), extracts the VR ID and looks up the VR object.
  3. Determines the constraint type based on the operand's role (def, use, or both), the instruction's properties, and the VR's pair mode.
  4. Creates a constraint node and appends it to the VR's constraint list.
  5. For paired registers (type 3 in the operand descriptor), generates two constraints: one for the low half and one for the high half (distinguished by bit 23).
  6. Uses SSE2 vectorized loops for bulk weight accumulation when processing large basic blocks with many live registers.

The builder queries multiple OCG knobs via vtable dispatches at offsets +72, +120, +152, +224, +256, +272, and +320. These knobs modulate constraint weights and enable/disable specific constraint categories (e.g. bank-conflict-aware constraints are gated by knob 641).

Special register IDs 41--44 (PT, P0--P3) and 39 are always skipped. The skip predicate (sub_9446D0, 29 lines) additionally checks for CSSA phi instructions (opcode 195 with type 9 = barrier) and performs hash-table lookups in the exclusion set at alloc+360.

Best Result Recorder: sub_93D070

The best-result recorder (155 lines) compares the current allocation result against the best seen across all retry attempts. It maintains state at offsets best[10..20]:

best[10] = register_count                   // best count so far
best[13] = 128 / register_count             // inverse density metric
best[16] = max_pressure                     // peak live registers
best[17] = spill_score
*(double*)(best + 56) = cost                // floating-point cost metric
best[18] = arch_peak_1                      // from architecture state +408
best[20] = arch_peak_2                      // from architecture state +400

Comparison uses lexicographic ordering:

  1. Lower register count wins
  2. On tie: lower cost (double) wins
  3. On tie: lower spill count wins
  4. On tie: lower class width wins

When the current attempt improves over the best, the recorder allocates a per-register assignment array and copies the full VR-to-physical-register mapping for later restoration.

Per-Instruction Assignment: sub_9680F0

The per-instruction assignment core loop (3722 lines, the largest function in part 2 of the allocator) handles the actual instruction-by-instruction walk during allocation:

  1. Iterates instructions via linked list (v87 = *(_QWORD *)v87)
  2. For each instruction, calls sub_961A60 to attempt register assignment
  3. Tracks register pressure via v86 counter and 256-bit bitvectors at alloc+1342..1350
  4. Manages three bitvector masks per instruction: assigned, must-not-spill, and used
  5. Detects rematerialization opportunities (flag v570) and calls sub_93AC90
  6. Detects bank conflicts via sub_9364B0 and resolves them
  7. Handles special opcodes: 187 / IMMA_16832 (VZZN_16832, behavioral: LOAD), 97 / STG (FGT, behavioral: STORE), 52 / BB boundary (behavioral: BRANCH), 236 / UBLKPF (HOYXCS, behavioral: CALL)
  8. Tracks first-spill-candidate (alloc+1354) and fallback-spill-candidate (alloc+1355)
  9. On allocation failure for an instruction, calls sub_96CE90 which recursively invokes sub_9680F0 with different flags for the spill fallback path

Post-Allocation Verification

After register allocation completes, a verification pass called "memcheck" (NVIDIA's internal name, unrelated to Valgrind) compares reaching definitions before and after allocation. Every instruction's operands are checked: the set of definitions that could reach each use must be preserved, or any change must be explained by a known-safe transformation (spill/refill, rematerialization, predicate packing). Unexplained changes indicate an allocator bug.

The verification runs inside the post-regalloc scheduling pass (sub_A8B680), after all register assignments are finalized and spill/reload instructions have been inserted.

Verification Call Flow

sub_A8B680 (PostAllocPass::run)
  +-- sub_A5B1C0   build pre/post def-use chains (48KB, all instructions)
  +-- sub_A76030   MemcheckPass::run -- entry point
        |
        for each instruction in function:
        |   +-- sub_A56790  fast per-instruction check (returns bool)
        |   |     true  -> skip (defs match)
        |   |     false -> deep verify
        |   |
        |   +-- sub_A54140  look up pre-regalloc def set
        |   +-- sub_A54140  look up post-regalloc def set
        |   +-- sub_A75CC0  deep single-instruction verification
        |         +-- sub_A56400  build Additions list (new defs)
        |         +-- sub_A56400  build Removals list (lost defs)
        |         +-- sub_A55D80  diagnostic reporter (10 error codes)
        |               +-- sub_A55D20  print uninitialized-def detail
        |
        printf("TOTAL MISMATCH %d   MISMATCH ON OLD %d\n", ...)

Fast Check vs Deep Verify

The fast check (sub_A56790) performs a lightweight comparison per instruction. It returns true when pre-regalloc and post-regalloc reaching definitions match exactly. Only on failure does the verifier invoke the expensive deep path (sub_A75CC0), which:

  1. Builds two difference lists -- "Additions" (defs present after but not before) and "Removals" (defs present before but not after).
  2. Classifies each difference as either BENIGN (explainable) or POTENTIAL PROBLEM by pattern-matching against known-safe transformations: spill-store/refill-load pairs, P2R/R2P predicate packing, bit-spill patterns, and rematerialized instructions.
  3. For each unexplained difference, creates an error record with a category code (1--10), pointers to the offending instructions, and operand type flags.

Error Categories

The reporter (sub_A55D80) dispatches on the error code at record+24:

CodeNameMessageTrigger
1Spill-refill mismatchFailed to find matching spill for refilling load that is involved in this operand computationA post-regalloc refill load has no corresponding spill store. The verifier walks the spill-refill chain and cannot find the matching pair.
2Refill reads uninitializedThis operand was fully defined before register allocation, however refill that is involved in this operand computation reads potentially uninitialized memoryThe refill reads from a stack slot that was never written -- the spill store was optimized away or placed on a non-executing path.
3P2R-R2P pattern failureFailed to establish match for P2R-R2P pattern involved in this operand computationPredicate-to-register / register-to-predicate instruction pairs used to spill predicate registers through GPRs have a broken chain -- the matching counterpart is missing.
4Bit-spill-refill failureFailed to establish match for bit-spill-refill pattern involved in this operand computationThe bit-packing variant of predicate spilling (multiple predicates packed into GPR bits) failed pattern matching. Same root cause as code 3 but for the packed representation.
5Uninitialized value introducedBefore register allocation this operand was fully defined, now an uninitialized value can reach itPre-regalloc: all paths to this use had a definition. Post-regalloc: at least one path has no definition. The register holds a stale value from a prior computation. Typically caused by a spill placed on the wrong path or a definition eliminated during allocation.
6Extra defs introducedAfter reg-alloc there are some extra def(s) that participate in this operand computation. They were not used this way before the allocation.The post-regalloc definition set is a strict superset of the pre-regalloc set. New definitions were introduced through register reuse or aliasing. When the extra def involves a short/byte type in a wider register, the reporter prints: Does this def potentially clobber upper bits of a register that is supposed to hold unsigned short or unsigned byte? and suggests -knob IgnorePotentialMixedSizeProblems.
7Rematerialization mismatchRematerialization problem: Old instruction: [%d] New instruction: [%d]A rematerialized instruction does not match the original. The new instruction created to recompute a value differs from the original in an unexpected way.
8P2R-R2P base destroyedSome instruction(s) are destroying the base of P2R-R2P pattern involved in this operand computationThe GPR holding packed predicate bits is overwritten between the P2R store and R2P restore by another instruction that defs the same physical register.
9Bit-spill base destroyedSome instruction(s) are destroying the base of bit-spill-refill pattern involved in this operand computationSame as code 8 but for the bit-packing spill variant. The base register holding packed predicate bits is overwritten between store and restore.
10Definitions disappearedDuring register allocation we did not add any new definitions for this operand and yet some original old ones disappearedThe post-regalloc definition set is a strict subset of the pre-regalloc set. Definitions were removed without replacement by spill/refill or rematerialization.

Reporter Output Format

When DUMPIR=AllocateRegisters is enabled (knob ID 266), the reporter (sub_A55D80) prints a structured diagnostic per mismatch:

=== ... (110 '=' banner) ===
REMATERIALIZATION PROBLEM. New Instruction [N] Old Instruction [M]   // only if instruction changed
INSTRUCTION: [N]
=== ... ===

Operand # K
Producers for operand K of instruction [N] before register allocation:
  [42] def opr # 0 for src opr # 2
  [38] def opr # 1 for src opr # 2

Producers for operand # K of instruction [N] after register allocation:
  [42] def opr # 0 for src opr # 2
  [55] def opr # 0 for src opr # 2          // <-- new, from refill

Additions
  [55] def opr # 0 src opr # 2  BENIGN (explainable)

Removals
  [38] def opr # 1 src opr # 2  POTENTIAL PROBLEM

<error-category-specific message from the table above>

If DUMPIR=AllocateRegisters is not enabled and mismatches exist, the verifier prints a one-shot suggestion:

Please use -knob DUMPIR=AllocateRegisters for debugging

The one-shot flag at verifier+1234 ensures this message appears at most once per allocation attempt.

Mismatch Counters

The verifier tracks two counters reported at the end of the pass:

OffsetCounterMeaning
verifier+1236Total mismatchesInstructions where post-regalloc defs differ from pre-regalloc defs in an unexplained way.
verifier+1240Old mismatchesSubset of total mismatches that represent pre-existing issues -- the pre-regalloc def chain was already empty (no reaching definitions before allocation either). These are not regressions from the current allocation attempt.

Knob Interactions

KnobEffect
DUMPIR=AllocateRegisters (ID 266)Enables verbose per-mismatch diagnostic output. Without this, only the summary line and suggestion are printed.
IgnorePotentialMixedSizeProblemsSuppresses the mixed-size aliasing warning in error code 6 (extra defs involving short/byte types in wider registers).
memcheck flag at function+1384Gates whether verification runs at all. When zero, sub_A76030 is not called.

Verification Function Map

AddressSizeRoleConfidence
sub_A54140--Def-use chain lookup (hash table query into pre/post maps)HIGH
sub_A55D20~100BPrint uninitialized-def warning helperHIGH
sub_A55D801454BDiagnostic reporter -- 10 error categories, structured outputHIGH
sub_A56400--Build additions/removals lists for deep comparisonMEDIUM
sub_A56560698BVerify single operand's reaching definitionsHIGH
sub_A56790~250BPer-instruction fast check (returns bool pass/fail)HIGH
sub_A5B1C08802BFull-function def-use chain builder (pre and post regalloc)HIGH
sub_A60B604560BPre/post chain comparison engineHIGH
sub_A62480--Reset scratch arrays between operand checksMEDIUM
sub_A752202640BCompare reaching definitions (builds diff lists)HIGH
sub_A75CC0866BDeep single-instruction verifier (classifies diffs)HIGH
sub_A76030770BMemcheckPass::run -- verification entry pointHIGH

Occupancy-Aware Budget Model

The allocator maintains a 144-byte budget pressure model at alloc+1600--alloc+1744 that adjusts the effective register budget based on thread occupancy. The model is initialized by sub_947150 (the allocator constructor) and consumed by the spill guidance function sub_96D940. The goal: kernels that need high occupancy get tighter register budgets (more aggressive spilling), while kernels that can tolerate low occupancy get relaxed budgets (more registers, fewer spills).

Coefficient Initialization

Three knob-overridable coefficients control the interpolation:

FieldOffsetKnobKnob NameTypeDefaultMeaning
coeffA+1632664RegAllocSpillBitLowRegScaleDBL0.2Scale at low register counts (piecewise default value)
coeffB+1640661RegAllocSpillBitHighRegScaleDBL1.0Scale at high register counts (linear model y_max)
coeffC+1648665RegAllocSpillBitMediumRegScaleDBL0.3Scale at medium register counts (linear model y_min)

Two integer knobs set the tier boundaries:

FieldOffsetKnobKnob NameTypeDefaultMeaning
maxThreads+1624663RegAllocSpillBitLowRegCountHeurINT119Low register count tier boundary
pressureThresh+1628660RegAllocSpillBitHighRegCountHeurINT160High register count tier boundary

All five knobs use the standard OCG type-check pattern: byte at knobArray + 72*index encodes 0 (use default), 1 (use INT value at +8), or 3 (use DBL value at +8).

Piecewise Interpolation Table

After reading the knobs, sub_947150 queries the target descriptor for the hardware maximum thread count (vtable slot 90, offset +720). This value is clamped to maxThreads - 1 if a knob override is active. The result becomes totalThreads -- the kernel's maximum achievable occupancy.

An optional override through the function-object vtable at context+1584 (vtable slot 118, offset +944) can adjust the architectural register limit. When the override is active, it calls override_fn(totalThreads, param, 255.0) and sets the adjusted limit to 255 - result. When inactive, the limit stays at 255.

The piecewise array stores 7 (value, x-coordinate) pairs that define a step function mapping register counts to scale factors:

interpTable[0] = coeffA    interpTable[1] = maxThreads
interpTable[2] = coeffA    interpTable[3] = pressureThresh
interpTable[4] = coeffA    interpTable[5] = adjustedLimit     // 255 or (255 - override)
interpTable[6] = coeffB

This means: for register counts up to maxThreads (default 119), the budget scale is coeffA (0.2); from maxThreads to pressureThresh (160), still coeffA; from pressureThresh to the adjusted limit (255), still coeffA; and beyond that boundary, coeffB (1.0). In practice the piecewise table establishes a two-tier system: a low scale for most of the register range, jumping to the high scale only at the top.

Linear Interpolation Model

A separate linear model provides continuous interpolation for the spill guidance decision. Two more vtable queries establish the domain:

x_min = target->getMaxOccupancy()    // vtable slot 90 on target descriptor via context+1584
x_max = target->getMinOccupancy()    // vtable slot 96 (offset +768) on function-object via context+1584
y_min = coeffC                       // 0.3
y_max = coeffB                       // 1.0
slope = (coeffB - coeffC) / (x_max - x_min)

The slope is stored at alloc+1736. Since x_max (minimum occupancy, meaning fewest concurrent threads = most registers allowed) is typically greater than x_min (maximum occupancy, meaning most concurrent threads = fewest registers), the slope is positive: as the function moves toward allowing more registers (fewer threads), the budget fraction increases.

Spill Guidance Consumption

The spill guidance function sub_96D940 (line 1520 in the decompiled output) uses the linear model to compute a dynamic spill threshold:

budget_fraction = (current_reg_count - x_min) * slope + y_min
spill_threshold = budget_fraction * (class_budget - class_floor + 1)

Where:

  • current_reg_count: the current register allocation count for this class (from alloc+884 indexed by class)
  • class_budget and class_floor: per-class allocation bounds at alloc + 32*class + 884 and alloc + 32*class + 880
  • For paired registers, current_reg_count is halved: (count + 1) >> 1

The comparison at line 1527:

if (spill_count + unspilled_need_spill + current_reg_count) > spill_threshold:
    trigger_spill(sub_948B80)

If the total pressure (live registers needing spill + those already marked for spill + current allocation count) exceeds the occupancy-adjusted threshold, the allocator triggers a spill. The sub_948B80 call adds the current VR to the spill candidate queue.

Worked Example

For a Blackwell SM100 kernel with default knobs:

ParameterValueSource
coeffA0.2Knob 664 default
coeffB1.0Knob 661 default
coeffC0.3Knob 665 default
maxOccupancy (x_min)240SM100 target vtable slot 90
minOccupancy (x_max)480SM100 target vtable slot 96
slope(1.0 - 0.3) / (480 - 240) = 0.00292Computed

If the current GPR class has a budget of 128 and floor of 0 (range = 129), and the function currently uses 300 registers:

budget_fraction = (300 - 240) * 0.00292 + 0.3 = 0.475
spill_threshold = 0.475 * 129 = 61.3

If more than 61 VRs are pending spill or already allocated, the allocator triggers a spill rather than attempting to fit another register. With fewer registers in play (say 250), the fraction drops to 0.329 and the threshold tightens to 42 -- more aggressive spilling at higher occupancy targets.

Budget Model Field Summary

OffsetSizeTypeInitField
+16008ptrctx[2]->+208Function object pair pointer
+16088ptr0Auxiliary pointer (unused at init)
+16168QWORD0xFFFFFFFFOccupancy upper bound
+16244DWORD119 / knob 663maxThreads (low reg count tier boundary)
+16284DWORD160 / knob 660pressureThresh (high reg count tier boundary)
+16328double0.2 / knob 664coeffA (low-register scale)
+16408double1.0 / knob 661coeffB (high-register scale / y_max)
+16488double0.3 / knob 665coeffC (medium-register scale / y_min)
+16568double(computed)totalThreads as double
+16648double= coeffAinterpTable[0]: value for tier 0
+16728double= maxThreadsinterpTable[1]: x-boundary for tier 0
+16808double= coeffAinterpTable[2]: value for tier 1
+16888double= pressureThreshinterpTable[3]: x-boundary for tier 1
+16968double= coeffAinterpTable[4]: value for tier 2
+17048double= adjustedLimitinterpTable[5]: x-boundary for tier 2
+17128double= coeffBinterpTable[6]: value for tier 3 (final)
+17208double(computed)x_min: max occupancy thread count
+17288double= coeffCy_min (linear model intercept at x_min)
+17368double(computed)slope: (coeffB - coeffC) / (minOcc - maxOcc)
+17448ptr0Tail sentinel

Post-Init: sub_939BD0

Immediately after building the interpolation tables, sub_947150 calls sub_939BD0 which configures the spill guidance lookup strategy at alloc+1784. This function queries knob 623 (RegAllocEstimatedLoopIterations) through the function-object vtable:

  • If knob 623 is set: the spill guidance uses the knob's value to estimate loop iteration counts, passed to the lookup strategy via vtable slot 3 (offset +24).
  • If knob 623 is unset: the lookup strategy is initialized with default parameters. When the budget model's auxiliary weight at alloc+776 is zero, the strategy uses (8, 4, 0x100000); otherwise (16, 16, 0x100000).

Function Map

AddressLinesRoleConfidence
sub_926A304005Fat-point interference builder / constraint solverHIGH
sub_93D070155Best result recorder (multi-criterion comparison)HIGH
sub_93E290397Spill candidate node creator (192-byte arena alloc)HIGH
sub_93E9D0125Pre-assign individual operandHIGH
sub_93ECB0194Pre-assign registers (per-instruction dispatcher)HIGH
sub_93FBE0940Per-iteration allocation state resetHIGH
sub_939BD063Spill guidance strategy initializer (knob 623 query)HIGH
sub_939CE023Register consumption counter (pair-aware)HIGH
sub_9446D029Register skip predicate (special regs, exclusion set)HIGH
sub_947150~700Allocator constructor (budget model + interpolation init)HIGH
sub_94A020331Pre-allocation pass (knobs 628/629/618)HIGH
sub_94FDD0155Register assignment + alias propagationHIGH
sub_950100205Pre-allocated candidate applier (FNV-1a lookup)HIGH
sub_956130873Register class interference mask builder (SSE2)HIGH
sub_95702024Occupancy bitvector resizer (arena-backed realloc + memset)HIGH
sub_94C9E059Occupancy bitmask range setter (word-level OR with masks)HIGH
sub_7DAFD07VR aligned width computation (ceil(size/stride)*stride)CERTAIN
sub_9571601658Core fat-point allocator (coloring engine)HIGH
sub_9680F03722Per-instruction assignment core loopHIGH
sub_96D9402983Spill guidance (7-class priority queues)HIGH
sub_971A90355NOSPILL / SPILL retry driverHIGH
sub_9714E0--Post-allocation finalizationMEDIUM
sub_9721C01086Register allocation entry pointHIGH
sub_936FD0--Final fallback allocationMEDIUM
sub_9375C0--VR priority sortMEDIUM

Spill Mechanism

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

When the fat-point register allocator cannot fit all simultaneously-live virtual registers into the hardware register budget, it spills excess values to memory and reloads them on demand. The spill subsystem is the second-largest component of the register allocator by code volume, spanning roughly 25 functions and 12,000+ lines of decompiled code. It implements a cost-driven, retry-based spill strategy with two memory targets (local memory and shared memory), a per-class priority queue guidance engine, and a multi-attempt allocation loop that progressively relaxes constraints until allocation succeeds or fails fatally.

Spill triggersub_94FDD0 (155 lines) -- sets flag 0x40000 when assignment exceeds budget
Spill guidancesub_96D940 (2983 lines) -- builds 7 priority queues of spill candidates
Spill codegensub_94F150 (561 lines) -- inserts spill stores and refill loads
LMEM setupsub_939BD0 (65 lines) -- local memory slot allocator configuration
SMEM allocatorsub_9539C0 (1873 lines) -- shared memory spill alternative
Retry driversub_971A90 (355 lines) -- NOSPILL then SPILL retry loop
Finalizationsub_9714E0 (290 lines) -- commit spills, emit errors on failure
SASS codegensub_9850F0 (520 lines) -- generate LDL/STL instruction sequences
Key knobs623 (spill mode), 638/639 (retry limits), 684 (interference threshold)

Spill Trigger

The spill trigger fires inside the per-virtual-register assignment function sub_94FDD0 (155 lines). When the fat-point allocator (sub_957160) selects a physical slot for a virtual register, it calls sub_94FDD0 to commit the assignment. If the chosen slot index equals or exceeds the per-class register budget, the function does not commit -- instead it marks the virtual register for spilling.

function assign_register(alloc, ctx, mode, vreg, regclass_info, slot, cost):
    max_regs = regclass_info.max_regs           // at regclass_info+16

    // Budget check
    if slot >= max_regs and not vreg.has_flag(0x4000):  // not already spilled
        vreg.flags |= 0x40000                    // set "needs-spill" bit
        return                                   // do NOT commit assignment

    // Spill path: flag was set on a previous call
    if vreg.has_flag(0x40000):                   // needs-spill
        setup_spill_allocator(alloc)             // sub_939BD0
        generate_spill_code(alloc, ctx, 1)       // sub_94F150
        return

    // Non-spill path: commit the assignment
    consumption = compute_consumption(vreg)       // sub_939CE0
    update_peak(alloc+1564, consumption)
    update_peak(alloc+1528, consumption)
    vreg.physical_register = slot                 // vreg+68

    // Accumulate spill cost even for successful assignments
    *(double*)(alloc+1568) += *(float*)(vreg+40)  // store weight
    *(float*)(alloc+1576)  += load_weight          // load weight

    apply_preallocated_candidate(alloc, vreg)     // sub_950100

    // Propagate through alias chain
    alias = vreg.alias_parent                     // vreg+36
    while alias != NULL:
        alias.physical_register = slot
        alias = alias.alias_parent

The two flag bits at vreg+48 encode spill state:

BitMaskMeaning
140x4000Already spilled -- prevents the same vreg from being spilled again
180x40000Needs spill -- triggers spill codegen on the next sub_94FDD0 call

Register consumption (sub_939CE0, 23 lines) accounts for paired registers. For double-width registers (pair mode 3 at vreg+48 bits 20--21), it returns assignment + 1, consuming two physical slots.

Spill Retry Loop

The outer allocation driver sub_971A90 (355 lines) wraps the core allocator in a two-phase strategy: first attempt allocation without spilling, then retry with progressively more aggressive spill guidance.

function alloc_with_spill_retry(alloc, ctx, class_id):
    // Phase 1: NOSPILL
    pre_allocation_pass(alloc)                        // sub_94A020
    per_class_driver(alloc, ctx, class_id)            // sub_95DC10
    result = fatpoint_allocate(alloc, ctx, attempt=0) // sub_957160
    record_best_result(&best, class_id, 0, result)    // sub_93D070

    if alloc.target >= adjusted_result:
        goto finalize                                 // success

    // Phase 2: SPILL retry
    max_attempts = query_knob(638)     // default varies
    if knob_639_set:
        max_attempts = query_knob(639) // override

    for attempt in 0..max_attempts:
        reset_alloc_state(alloc, ctx, attempt)        // sub_93FBE0
        if attempt == 0:
            build_interference_masks(alloc, class_id) // sub_956130
        result = fatpoint_allocate(alloc, ctx, attempt)

        debug("-CLASS NOSPILL REGALLOC: attemp %d, used %d, target %d",
              attempt, result, alloc.target)

        record_best_result(&best, class_id, attempt, result)
        if alloc.target >= adjusted_result:
            break

    if all_attempts_failed and no_spill_recorded:
        result = final_fallback(&best, result)        // sub_936FD0

    // Finalize
    status = finalize_allocation(alloc, result, class_id, &best)  // sub_9714E0
    if HIBYTE(status):
        clear_all_assignments_to_minus_one()          // allocation failed
    else:
        commit_results()

The debug string "-CLASS NOSPILL REGALLOC: attemp " (note the typo -- present in the binary) is printed for every attempt.

For SMEM spilling (modes 3/6 when ctx+896 == 5), the driver activates spill setup before entering the retry loop:

if (class_id == 3 or class_id == 6) and device_type == 5:
    if vreg_count > 0:
        setup_spill_allocator(alloc)      // sub_939BD0
        generate_spill_code(alloc, ctx, 1) // sub_94F150
    alloc.spill_done_flag = 1             // alloc+865

Spill Guidance Engine

sub_96D940 (2983 lines, 84 KB decompiled) computes which registers should be spilled and in what order. It is the largest spill-related function and one of the largest in the entire allocator.

Structure

The function contains 7 near-identical code blocks, one per register class (R, P, B, UR, UP, UB, tensor/accumulator). Each block is approximately 400 lines of bitvector iteration and set intersection. This repetition strongly suggests C++ template instantiation or macro expansion in the original source.

Spill Guidance Structure (11,112 bytes)

The guidance engine allocates a single 11,112-byte working structure from the arena (vtable +24). The structure is organized into five regions.

Region 0 -- Header and core pointers (bytes 0--271)

Byte offsetQWORD idxTypeInitField
0[0]ptrctxBack-pointer to allocation context
24[3]ptralloc+16Pointer into allocator state object
32[4]QWORD0Run counter / iteration state
40[5]QWORD0Class processing counter
48[6]QWORD0Spill mode flags
96[12]ptrarenaArena allocator pointer (from ctx+16)
104[13]ptrarenaQueue header base / candidate list base
112[14]DWORD+DWORD-1, 0Max element index sentinel, entry count
136[17]ptrarenaThird arena reference
144[18]QWORD0Queue 0 entry count
152[19]QWORD-1Queue 0 sentinel
160[20]ptrarenaFourth arena reference
168[21]QWORD0Queue 0b entry count
176[22]QWORD-1Queue 0b sentinel
184[23]ptrctxBack-pointer to context
192[24]ptrnodeCandidate node list head (24-byte arena node)
200[25]ptrnodeCandidate node list tail
208[26]QWORD0Node count
216[27]QWORD0Node capacity
240[30]ptrnodeSentinel node (same as initial node at [24])
248[31]QWORD0Free list head
256[32]QWORD0Free list count

Region 1 -- Bitmask arrays (bytes 272--1327)

Two 508-byte bitmask arrays (127 DWORDs each), separated by single-byte sentinels:

Byte rangeContent
284Sentinel byte (set to 0x80 after zeroing)
288--795Bitmask array 0: 127 DWORDs for live range set intersection
808Sentinel byte (set to 0x80 after zeroing)
812--1319Bitmask array 1: 127 DWORDs for second class pair

Each bitmask array is zeroed via an SSE2 vectorized loop (16 bytes per iteration, 0x1F iterations). The 0x80 sentinel byte at the start of each array marks initialization completion.

Region 2 -- Priority queue table blocks (bytes 1328--2063)

Five embedded priority queue tables, each containing an entry count (QWORD) followed by an array of 6 queue entries (24 bytes each):

QWORD idxByte offsetContent
[166]1328Queue block 1 entry count (incremented by 7 per pass)
[167]--[184]1336--1479Queue block 1: 6 entries x 24 bytes
[188]1504Queue block 2 entry count
[189]--[206]1512--1655Queue block 2: 6 entries x 24 bytes
[210]1680Queue block 3 entry count
[211]--[228]1688--1831Queue block 3: 6 entries x 24 bytes
[232]1856Queue block 4 entry count
[233]--[250]1864--2007Queue block 4: 6 entries x 24 bytes
[256]2048Queue block 5 (overflow) count

Each 24-byte queue entry has this layout:

Entry offsetTypeInitField
+0ptrarenaBitvector storage pointer
+8QWORD0Bitvector data pointer
+16DWORD-1Max element index
+20DWORD0Current element count

Queue entries are built by sub_8BE190 and sorted by sub_7553C0. Candidates are inserted via sub_9370A0 (with tie-breaking) and removed via sub_9365A0 (bit-clear in bitvector).

Region 3 -- Candidate node management (bytes ~2064--10591)

The largest region (~8,528 bytes). Contains working storage for spill candidate evaluation across all 7 register classes. This region is zeroed during initialization and populated during the instruction walk phase by sub_93BF50 (candidate evaluation), sub_936610 (candidate insertion with cost), sub_9680F0 (cost propagation), and sub_93A1F0 (interference counting). The exact internal sub-layout varies by register class and virtual register count.

Region 4 -- Linked list, accumulators, and tail (bytes 10592--11111)

Byte offsetQWORD idxTypeInitField
10592[1324]QWORD0Linked list head (spill candidate chain)
10600[1325]ptr&self[1326]Forward pointer (circular doubly-linked)
10608[1326]ptr&self[1324]Backward pointer
10616[1327]QWORD0List count
10624[1328]ptr&self[1324]Secondary forward pointer
10632[1329]ptr&self[1326]Secondary backward pointer
10640DWORDint2Node type tag
10648[1331]ptrnodePrimary candidate node (24B, type=2)
10656[1332]ptrnodeSecondary candidate node (24B, type=2)
10696[1337]ptrnodeSecondary tail pointer
10704[1338]ptr0Instruction walk context (knob 622 gate)
10712[1339]QWORD0Walk state
10720[1340]QWORD0Walk counter
10728BYTEbyte0Walk active flag
10736[1342]QWORD0Spill cost accumulator 0
10744[1343]QWORD0Spill cost accumulator 1
10752--10824[1344]--[1353]QWORD0Additional cost/range counters (10 slots)
10840[1355]QWORD0Interference counter
10872DWORDint0Class mask
10888[1361]QWORD0Result register count
10896[1362]QWORD0Result cost metric
10904[1363]QWORD0Result spill count
10912[1364]QWORD0Result class width
10920[1365]QWORD0Best-attempt index
10960[1370]QWORD0Phase indicator
10968[1371]QWORD0Phase state
10976[1372]QWORD0Output flag
11008[1376]QWORD0SMEM spill tracking
11016[1377]QWORD0SMEM spill state
11048[1381]QWORD0Output queue pointer
11056[1382]QWORD0Output queue size
11072[1384]ptr0Callee-save tracking (freed by sub_96CFA0)
11080[1385]ptr0Callee-save arena ref (freed by sub_96CFA0)
11089BYTEbyte1Guidance enabled flag
11096[1387]ptr0Final candidate object (freed by sub_96CFA0)
11104[1388]ptr0Final candidate arena ref (freed by sub_96CFA0)

The linked list at [1324]--[1329] is initialized as a circular doubly-linked list with self-referential pointers, following the standard intrusive list pattern. The cleanup function sub_96CFA0 (694 lines) deallocates the candidate node objects at offsets 11072, 11080, 11096, and 11104.

Candidate node object (24 bytes)

Each candidate node is a 24-byte arena-allocated object used in the doubly-linked list and as priority queue sentinels:

OffsetTypeField
+0QWORDType tag: 2 = priority queue node, 3 = initial sentinel
+8QWORDCount / payload
+16ptrArena back-reference

Stack-local queue header array

In addition to the 11,112-byte structure, the function maintains a stack-local 7-element queue header array (v514, 168 bytes on stack). Each entry is 24 bytes (3 QWORDs) with the same layout as the embedded queue entries above. The 7 entries map to the 7 register classes:

IndexClass
0R (general-purpose registers)
1P (predicate registers)
2B (barrier registers)
3UR (uniform registers)
4UP (uniform predicates)
5UB (uniform barriers)
6Acc (tensor/accumulator registers)

After bitvector iteration, each stack-local queue header is built by sub_8BE190 and sorted by sub_7553C0.

Algorithm

function compute_spill_guidance(ctx, guidance_array, attempt):
    for class_id in 0..6:
        entry = &guidance_array[class_id]

        // 1. Initialize working bitmask arrays
        zero_fill(entry.bitmasks, 128 elements)

        // 2. Iterate live range bitvectors
        for each live_range in class[class_id]:
            // 3. Compute set intersection with other live ranges
            intersect(entry.bitmasks, live_range.bitvector)

        // 4. Build priority queue of spill candidates
        build_priority_queue(entry)               // sub_8BE190

        // 5. Sort by spill cost (ascending -- cheapest to spill first)
        sort_priority_queue(entry)                // sub_7553C0

The guidance output is consumed by the retry loop: after each failed allocation attempt, the allocator consults the guidance to decide which virtual registers to allow to spill on the next attempt.

Spill Cost Model

The allocator uses a multi-level cost model to evaluate which registers are cheapest to spill.

Per-virtual-register weights

FieldTypeMeaning
vreg+40floatPrimary spill cost (accumulated from usage frequency)
vreg+76floatSecondary spill cost (alternate weighting)
vreg+80intSpill flag: 0 = not spilled, 1 = spilled

Allocator-level accumulators

FieldTypeMeaning
alloc+1568doubleTotal spill-store cost
alloc+1576floatTotal spill-load cost

Default cost weight

The base spill cost weight is 15.0 for normal register classes, reduced to 3.0 for register classes under high pressure. The selection is made by a per-class flag at alloc + 32 * class_id + 893:

float spill_weight = 15.0f;
if (*(uint8_t*)(alloc + 32 * regclass + 893))
    spill_weight = 3.0f;    // high-pressure class: lower cost to encourage spilling

Block frequency weighting

Spill cost is multiplied by the enclosing loop's nesting depth, obtained via a block frequency callback at vtable offset +8. Inner-loop spills receive higher penalties, discouraging the allocator from spilling values that are live across loop back-edges.

Best-result comparison

sub_93D070 (155 lines) records the best allocation result across retry attempts. Comparison uses tie-breaking priority:

  1. Register count (lower is better)
  2. Cost metric (double at result+56)
  3. Spill count
  4. Register class width

An inverse density metric 128 / register_count is used for secondary comparison. The per-variable assignment array is saved to a backup when a new best is found.

Spill cost infrastructure

A suite of functions at 0x998000--0x99E000 implements the cost computation:

AddressRole
sub_9997D0Spill cost initialization
sub_9998A0Spill cost computation
sub_999950Spill cost comparison
sub_999AA0Spill benefit estimation
sub_999D10Spill cost aggregation
sub_999F00Spill cost finalization
sub_99A0B0Range spill cost query
sub_9A8270Live range spill cost (14 KB)

Spill Code Generation

sub_94F150 (561 lines) inserts actual spill-store and refill-load instructions into the IR instruction stream.

Per-register spill info

The function allocates a tracking array: 12 bytes per virtual register, initialized to {0, -1, -1}:

OffsetSizeField
+04Spill state (0 = none)
+44Spill slot index (-1 = unassigned)
+84Refill slot index (-1 = unassigned)

Execution flow

function generate_spill_code(alloc, ctx, mode):
    // 1. Allocate per-block tracking
    tracking = arena_alloc(12 * (numBlocks + 1))

    // 2. Set up instruction iteration
    setup_instruction_walk(ctx, walk=1, dir=0)       // sub_7E6090

    // 3. Multi-block liveness (if numBlocks > 1 and mode == 1)
    compute_cross_block_liveness(alloc, ctx)          // sub_9449B0
    //   Uses bitvectors: sub_BDBA60, sub_BDC180, sub_BDCDE0

    // 4. Clear per-instruction spill flags
    for each instr:
        instr.flags &= ~0xE00

    // 5. Walk instruction list
    spill_weight = 15.0
    if high_pressure_class:
        spill_weight = 3.0

    for each instr in instruction_list:
        for each operand:
            vreg = lookup_vreg(operand)
            if vreg.was_previously_spilled:
                // Track for potential refill
                update_refill_tracking(vreg, instr)
            // Accumulate spill cost weighted by block frequency
            vreg.spill_cost += spill_weight * block_frequency(instr.block)

        // Handle call instructions (opcode 97; STG in ROT13, used as CALL marker -- actual CALL is opcode 71)
        if instr.opcode == 97:
            handle_callee_save(alloc, instr)

        // Track "use after last def" for enhanced cost
        update_use_after_def(vreg, instr)

    // 6. Uniform register special handling (flag 0x200)
    if vreg.flags & 0x200:
        apply_uniform_spill_rules(vreg)

Epoch tracking

sub_936CF0 (81 lines) checks basic block boundaries for epoch increments. It returns true if the block's successor is a CALL instruction (opcode 52) with a target that has opcode 236 (special call), or if allocator flags at alloc+1588/alloc+1589 are set. This mechanism tracks liveness across call boundaries, ensuring that spilled values are correctly reloaded after calls that may clobber registers.

LMEM Spilling

Local memory (LMEM) is the default spill destination. Each GPU thread has private local memory backed by DRAM and cached in L2.

Slot allocation

sub_939BD0 (65 lines) configures the spill slot allocator. It queries OCG knob 623 for a custom spill mode; if the knob is disabled, it selects between two default configurations based on the cost threshold at alloc+776:

ConditionBucket sizeAlignmentMax pool
Cost threshold == 0.08 bytes4 bytes1 MB
Cost threshold != 0.016 bytes16 bytes1 MB

The 8-byte/4-byte configuration handles standard 32-bit register spills. The 16-byte/16-byte configuration handles double-precision or 64-bit values that require stricter alignment.

if (*(double*)(alloc + 776) == 0.0)
    return init_spill_pool(mem_alloc, 8, 4, 0x100000);    // 8B buckets, 4B aligned
else
    return init_spill_pool(mem_alloc, 16, 16, 0x100000);   // 16B buckets, 16B aligned

When knob 623 is enabled, the knob value at offset +224 supplies a custom spill limit, passed to the spill allocator init function via vtable +24.

SASS instruction sequences

sub_9850F0 (520 lines) generates the actual SASS load/store instruction sequences for spill traffic. Architecture-specific registers drive the address computation:

FieldSourceRole
target_info[400]Architecture vtableSpill base register
target_info[401]Architecture vtableSpill slot stride
target_info[402]Architecture vtableSpill offset register

Spill store sequence:

IADD  addr, spill_base, offset       // compute slot address
IMAD  addr, addr, stride, 0          // scale by slot stride
ST    [addr], value                   // store to local memory

Refill load sequence:

IADD  addr, spill_base, offset       // compute slot address
IMAD  addr, addr, stride, 0          // scale by slot stride
LD    value, [addr]                   // load from local memory

The generated instructions use these SASS opcodes:

OpcodeSASS mnemonicRole in spill sequence
0xC9IADDAdd offset to base register
0x11BIMADMultiply-add for address
0xC3MOVMove value
0x82 (130)HSET2 in ROT13; used as LD/LDL-likeLoad from local memory (refill)
0xB7ST / STLStore to local memory (spill)
0x14ISETPSet predicate (conditional spill)
0x8BSHLShift for alignment
0x110LOP3Logical operation for masking
0x5FBRABranch (conditional spill)
0x120STSStore to shared memory (SMEM path)

At the hardware level, local memory spills become LDL/STL instructions. The SETLMEMBASE (opcode 147) and GETLMEMBASE (opcode 148) instructions manage the local memory base pointer for the thread.

SMEM Spilling

Shared memory (SMEM) spilling is an alternative to local memory spilling. Shared memory is on-chip SRAM, offering significantly lower latency than LMEM (which goes through L2/DRAM). However, shared memory is a finite resource shared across all threads in a CTA, making this optimization applicable only in specific circumstances.

Entry point

sub_9539C0 (1873 lines, 54 KB decompiled) implements the SMEM spill allocator.

Hard restriction

"Smem spilling should not be enabled when functions use abi."

This assertion (checked at two call sites within sub_9539C0) prevents SMEM spilling for ABI-conformant functions. Functions using the GPU calling convention require a stable stack frame in local memory; shared memory spill slots would conflict with the calling convention's stack layout requirements.

Activation conditions

SMEM spilling activates when all of the following hold:

  1. Register class is 3 (UR) or 6 (Tensor/Acc)
  2. Device type is 5 (ctx+896 == 5)
  3. The class has virtual registers to allocate (vreg_count > 0)
  4. The function does not use ABI calling conventions

Algorithm

function smem_spill_allocate(alloc, ctx):
    // 1. Assert no ABI conflict
    assert(not function_uses_abi())

    // 2. Allocate per-block tracking (24-byte slots)
    tracking = arena_alloc(24 * numBlocks)

    // 3. Set up SSE-width bitmaps for shared memory tracking
    init_smem_bitmaps(alloc)

    // 4. Walk instruction list, identify spill candidates
    for each vreg marked for spill:
        // 5. Allocate shared memory slot
        slot = allocate_smem_slot(alloc)
        vreg.smem_slot = slot

        // 6. Generate shared memory load/store
        insert_sts_instruction(slot, vreg)   // STS (store to smem)
        insert_lds_instruction(slot, vreg)   // LDS (load from smem)

    // 7. Update shared memory allocation bitmap
    update_smem_allocation(alloc)

SMEM symbols

The SMEM spill infrastructure uses two internal symbols for allocation tracking:

  • __nv_reservedSMEM_allocation_phase (address 0x1CFCE80)
  • __nv_reservedSMEM_allocation_mask (address 0x1CFCEA8)

The CLI flag --disable-smem-reservation can disable shared memory reservation entirely.

Spill Statistics

The allocator collects detailed spill metrics into a per-function statistics object. These are used for compilation reporting, performance analysis, and the --warn-on-spills diagnostic.

Statistics fields

The statistics object stores spill counters at fixed DWORD offsets:

Offset (DWORDs)NameDescription
[12]LSpillBLocal memory spill bytes
[13]LRefillBLocal memory refill bytes
[14]SRefillBShared memory refill bytes
[15]SSpillBShared memory spill bytes
[16]LowLmemSpillSizeLow-bound local memory spill size
[17]FrameLmemSpillSizeFrame-level local memory spill size
[18]LNonSpillBNon-spill local memory bytes
[19]LNonRefillBNon-refill local memory bytes
[20]NonSpillSizeTotal non-spill memory size

Format strings

The statistics printing subsystem (sub_A3A7E0) emits two lines for spill metrics:

# [est latency = %d] [LSpillB=%d] [LRefillB=%d], [SSpillB=%d],
  [SRefillB=%d], [LowLmemSpillSize=%d] [FrameLmemSpillSize=%d]
# [LNonSpillB=%d] [LNonRefillB=%d], [NonSpillSize=%d]

The function properties string (used in verbose output):

Function properties for %s
    %d bytes stack frame, %d bytes spill stores, %d bytes spill loads

Spill warning

When --warn-on-spills is active, the following warning is emitted for any function with spills:

Registers are spilled to local memory in function '%s',
    %d bytes spill stores, %d bytes spill loads

The flag is registered in sub_432A00 / sub_434320 and stored at compilation_ctx + 473.

Allocation Failure

When all spill retry attempts are exhausted and the allocator still cannot fit within the register budget, a fatal error is emitted:

Register allocation failed with register count of '%d'.
Compile the program with a higher register target

This error originates from sub_9714E0 (allocation finalization, 290 lines), which is the last function called in the retry pipeline. Two emission paths exist:

PathFunctionContext
With source locationsub_895530Includes function name and PTX line number
Without source locationsub_7EEFA0Generic error when location unavailable

The alternate allocator path sub_964130 (1794 lines) also references this error at six separate points, covering different failure/retry scenarios. A dedicated failure reporter sub_966490 (474 lines) handles error diagnostic formatting.

Spill-Refill Pattern Optimization

The Ori IR includes dedicated instruction type markers for spill/refill patterns, enabling post-allocation optimization of spill traffic:

Type IDNameDescription
8Spill-refillSpill/refill pair marker
10Bit-spillSingle-bit spill (predicate register spill to GPR)

The SpillRefill pass attempts to match and optimize these patterns. Error strings reveal three failure modes:

  1. "Failed to find matching spill for refilling load that is involved in this operand computation" -- the refill load has no corresponding spill store
  2. "Failed to establish match for bit-spill-refill pattern involved in this operand computation" -- the bit-spill pattern does not match expected form
  3. "Some instruction(s) are destroying the base of bit-spill-refill pattern involved in this operand computation" -- instructions between spill and refill clobber the base address register

Debug strings include " spill-regill bug " and " bit-spill bug " (both with typos present in the binary).

Function Map

AddressLinesRoleConfidence
sub_939BD065Spill allocator setup (knob 623 dispatch)HIGH
sub_93C0B0582Spill range optimizerHIGH
sub_93D070155Best allocation result recorderHIGH
sub_93E290397Spill candidate node creator (192-byte arena nodes)HIGH
sub_93F130544Spill code inserterHIGH
sub_93FBE0940Per-iteration state reset / slot assignmentHIGH
sub_940EF0764Alternate spill slot assignment pathMEDIUM
sub_944740138Interference count at program pointHIGH
sub_9449B0418Cross-block liveness range calculatorHIGH
sub_94B200642Spill weight accumulatorHIGH
sub_94E620617Spill cost accumulator / liveness annotatorHIGH
sub_94F150561Spill code generation (main entry)HIGH
sub_94FDD0155Register assignment + spill triggerHIGH
sub_9539C01,873SMEM (shared memory) spill allocatorHIGH
sub_9714E0290Allocation finalization / error emissionMEDIUM
sub_96D9402,983Spill guidance engine (7 class-parallel queues)HIGH
sub_971A90355NOSPILL / SPILL retry driverHIGH
sub_9850F0520SASS-level spill instruction generatorHIGH
sub_9997D0--Spill cost initializationMEDIUM
sub_9998A0--Spill cost computationMEDIUM
sub_999950--Spill cost comparisonMEDIUM
sub_999AA0--Spill benefit estimationMEDIUM
sub_9A8270--Live range spill cost computation (14 KB)MEDIUM

GPU ABI & Calling Convention

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The ptxas ABI engine implements the NVIDIA GPU calling convention for device-side function calls. It manages parameter register allocation, return address placement, scratch/preserved register classification, and per-function ABI lowering across the full range of SM architectures (sm_35 through sm_100+). The engine runs as a multi-pass pipeline invoked per-kernel from the per-function compilation driver (sub_98F430), positioned between optimization passes and the register allocator. It spans approximately 250 KB (276 functions) at address range 0x19C6230--0x1A00FFF.

Master ABI setupsub_19D1AF0 (5608 bytes) -- orchestrates full per-function ABI pipeline
Per-pass loweringsub_19DC4B0 (6459 bytes) -- 3-pass instruction transform driver
Opcode-level dispatchsub_19CFC30 -- routes 11 opcodes to ABI handlers
Parameter allocatorsub_19CA730 (2277 bytes) -- 2048-bit free-list bitmap allocator
Return address validatorsub_19CDFF0 (7.5 KB) -- 12 diagnostic strings, warnings 7001--7009
Return address setupsub_19D1720 (4.8 KB) -- validates and assigns return address registers
Register transfer loweringsub_19CC1A0 (3873 bytes) -- generates MOV/STS/LDS/PRMT sequences
gb10b WARsub_19D9E00 + sub_19DA2A0 -- __nv_reservedSMEM_gb10b_war_var
Convergent checkersub_19D13F0 (4.3 KB) -- allowConvAlloc boundary validation
Address range0x19C6230--0x1A00FFF (~250 KB, 276 functions)

Reserved Registers

Registers R0--R3 are reserved by the ABI and cannot be used for general allocation. The allocator enforces this with the diagnostic "Registers 0-3 are reserved by ABI and cannot be used for %s". These four registers serve fixed ABI roles (stack pointer, thread parameters, etc.) and are excluded from both parameter passing and general register allocation.

The reservation is unconditional across all SM generations. Any .maxreg directive or ABI specification that attempts to assign these registers to parameter or return roles triggers a diagnostic.

SM Generation Dispatch

The ABI engine determines the target SM generation by reading a field from the SM target descriptor:

generation = *(int*)(sm_target + 372) >> 12
Generation valueSM targetsKey ABI differences
3sm_35, sm_37Kepler ABI: no uniform registers, no convergent boundaries
4sm_50, sm_52, sm_53Maxwell ABI: 16-register minimum, label fixups, coroutine insertion
5sm_60--sm_89Pascal through Ada ABI: 24-register minimum, cooperative launch support
9sm_90, sm_90aHopper ABI: 24-register minimum, uniform return address support
>9sm_100+Blackwell ABI: no minimum enforced (skips check), extended register reservation

The minimum register count varies by generation. For generations 3--4 (sm_35 through sm_53), the ABI requires at least 16 registers per function. For generations 5--9 (sm_60 through sm_90a), the minimum is 24. Generations below 3 and above 9 skip the minimum check entirely. Violating these minimums emits warning 7016: "regcount %d specified below abi_minimum of %d". The abi_minimum value is computed as (generation - 5) < 5 ? 24 : 16.

Master ABI Setup: sub_19D1AF0

The top-level ABI entry point (5608 bytes), called once per function by the per-function compilation driver sub_98F430. It orchestrates the full ABI pipeline in 10 steps:

function abi_master_setup(func, sm_target, abi_spec):
    // 1. Validate register count vs. ABI minimums
    generation = *(sm_target + 372) >> 12
    if generation in 3..4:  min_regs = 16    // sm_35-sm_53
    if generation in 5..9:  min_regs = 24    // sm_60-sm_90a
    if func.maxreg < min_regs:
        warn(7016, "regcount %d specified below abi_minimum of %d",
             func.maxreg, min_regs)

    // 2. Validate register reservation range
    if available_regs < requested_reservation:
        warn(7017, "register available %d for reservation is less "
             "than the requested number of registers %d",
             available_regs, requested_reservation)

    // 3. Validate coroutine SUSPEND semantics
    for each register in func.preserved_set:
        if register.is_scratch_at_suspend:
            warn(7011, "Register (%s%d)is defined as scratch on "
                 "SUSPEND but preserved for coroutine function",
                 register.class_name, register.index)

    // 4. Iterate callee list, mark ABI-callable entries
    for each callee in func.callees:
        callee.abi_flags |= ABI_CALLABLE
        propagate_abi_attributes(func, callee)

    // 5. Propagate register limits to callees
    abi_propagate_limits(func)               // sub_19CE590

    // 6. Check return-address / parameter overlap
    abi_overlap_precheck(func)               // sub_19CA3C0

    // 7. Allocate parameter registers
    abi_alloc_params(func)                   // sub_19CA730

    // 8. Validate return address assignment
    abi_return_addr_setup(func)              // sub_19D1720

    // 9. Detailed return address validation
    abi_return_addr_validate(func)           // sub_19CDFF0

    // 10. Adjust register file limits via vtable
    vtable[736](func, sm_target)

Parameter Passing

Parameters are passed in consecutive R registers starting from a configurable base register. The ABI tracks "number of registers used for parameter passing" and "first parameter register" as per-function properties. The parameter register range begins after the reserved registers (R0--R3) and the return address register.

Parameter Register Allocator: sub_19CA730

The core parameter allocator (2277 bytes, 98% confidence). It uses a 2048-bit free-list bitmap (v103[], 256 bytes) to track available register slots.

function abi_alloc_params(func):
    // Initialize 2048-bit free-list (256 bytes)
    bitmap[256] = {0xFF...}                      // all slots free

    // Mark reserved registers as occupied
    clear_bits(bitmap, 0, 3)                     // R0-R3 always reserved

    // Mark already-allocated registers
    popcount = register_usage_popcount(func)     // sub_19C99B0

    // Allocate PARAMETER registers
    for each param in func.parameters:
        align = param_alignment(param.type_width) // 4/8/16 bytes
        slot = find_contiguous_free(bitmap, param.reg_count, align)
        if slot == -1:
            error("Function %s size requires more registers(%d) "
                  "than available(%d)", func.name, needed, available)
            return FAILURE
        assign_register(slot, param)             // sub_7FA420
        mark_allocated(bitmap, slot, param.reg_count)  // sub_BDBB80

    // Allocate RETURN registers (same algorithm, separate class)
    for each ret in func.return_values:
        slot = find_contiguous_free(bitmap, ret.reg_count, align)
        assign_register(slot, ret)
        mark_allocated(bitmap, slot, ret.reg_count)

The allocator processes parameters and return values as separate classes, each requiring contiguous register ranges with natural alignment. For 8-byte parameters, the base register must be even-aligned. For 16-byte parameters, the base must be 4-register-aligned.

The population count helper (sub_19C99B0, 2568 bytes) uses the __popcountdi2 intrinsic to count live registers in the function's usage bitmap, determining how many slots remain available.

Return Address Register

The return address occupies a dedicated register (or register pair) whose location is validated against parameter ranges. The diagnostic "Parameter registers from R%d to R%d overlap with return address register R%d to R%d" fires when parameter and return address ranges collide.

Return Address Modes

The return address validator (sub_19CDFF0, 7.5 KB, 99% confidence) handles four modes, selected by the v7 field in the ABI specification:

Modev7Behavior
Fixed1Return address at register 4 + 2 = R6. Fixed by architecture.
Regular2General-purpose register, validated < max_reg.
Uniform3Uniform register (UR) for return address. Requires SM support (sm_75+).
Computed5Derived from parameter layout. Auto-aligned to even register number.

Return Address Validator: sub_19CDFF0

The most thoroughly instrumented function in the ABI engine (7 distinct warning codes across two mode-specific paths). It performs these validations in sequence:

CodeConditionMessage
7001return_addr & 1 != 0"ABI return address %d is unaligned"
7002return_addr >= max_reg"Return Address (%d) should be less than %d"
7003stack_ptr in [return_addr, return_addr+1]"Return address (%d) should not overlap with the stack pointer (%d)"
7004Return addr bit set in parameter bitmap"Return Address %d overlaps with parameters in range %d - %d"
7005param_end + align > max_reg (auto-placement)"With specified parameters, return address is %d registers and exceeds specified max reg (%d)"
7008return_addr < lower_bound or return_addr > upper_bound"Return address (%d) should be between %d and %d"
7009Mode 3 and !(func+1408 byte & 0x02)"SM does not support uniform registers for return address"

The checks are mode-dependent. Mode 2 (regular GPR) enters the 7002/7001/7003/7004 path. Modes 3 and 5 (uniform/computed) enter the 7009/7008/7001 path. Mode 1 and mode 5 share the auto-placement path where 7005 fires. Warning 7001 (unaligned) appears in both paths because 64-bit return address pairs always require even alignment.

Return Address Setup: sub_19D1720

The setup function (4.8 KB, 95% confidence) runs before the validator. It propagates ABI flag 0x04 to the function state (byte 1389), validates that the return address register (register 1) is not classified as scratch when it must be preserved (warning 7012: "%d register should not be classified as scratch"), sizes the preserved register set to 255 entries via sub_BDBAD0, and computes the effective register range as return_size + param_size for comparison against the maximum available. The 7012 check fires when *(abi_spec+88) & 0x01 and *(abi_spec+48) & 0x02 are both set, always with argument 1 (the return address register).

The function also enforces the mutual exclusion rule (warning 7006): "ABI allows either specifying return address or return address before params". This fires when mode is 1 (fixed, "return address before params") but an explicit return address register is also assigned (return_addr != -1). You pick one strategy, not both.

Scratch Data Registers

Registers not reserved by the ABI and not used for parameters or return values may be classified as scratch (callee-clobbered). The ABI engine tracks scratch classification per register and validates it against coroutine semantics. At SUSPEND points in coroutine functions, a register marked as scratch must not also appear in the preserved set. Violation triggers warning 7011.

The scratch/preserved classification feeds into the register allocator's spill decisions. Registers marked as scratch across a call boundary must be saved by the caller; preserved registers must be saved by the callee.

Per-Pass Instruction Lowering: sub_19DC4B0

The instruction-level ABI transform driver (6459 bytes, 95% confidence). Called from both sub_98F430 and sub_A9DDD0. It makes three passes over the instruction stream, each performing different transformations:

Pass 1 -- Convergent Boundary Fixup

  • Fixes convergent boundary annotations (allowConvAlloc).
  • Handles SHFL.NI (shuffle, no-index) fixups for intra-warp communication.
  • Propagates the .uniform bit on CAL (call) instructions.

Pass 2 -- Instruction Lowering

Lowers high-level Ori opcodes into ABI-conforming SASS sequences:

Ori opcodeMnemonicTransform
109CALLParameter register setup, save/restore insertion
16STShared memory store lowering
77LDShared memory load lowering
185ATOMGAtomic operation lowering
183(special)Mode 2/3 reclassification

Pass 3 -- Architecture-Specific Fixups

Conditioned on SM generation:

sm_50 (generation == 4): Label fixups, coroutine code insertion, shared memory WAR insertion, convergent boundary checks.

sm_60+ (generation == 5): Additional register reservation for ABI conformance, cooperative launch handling, extended register file support.

All architectures: Per-block instruction scanning for opcode 195 (MOV) and opcode 205 reclassification. Register reservation range setup via sub_7358F0 / sub_7AC150.

Opcode-Level ABI Dispatch: sub_19CFC30

A dispatcher called twice from sub_98F430 that routes individual opcodes to specialized ABI handlers:

Ori opcodeHandlerTransform
9sub_19CF9A0PRMT (permute) lowering
54(inline)Function parameter preallocation
72sub_19CDED0 + sub_19CB590 + sub_19CB7E0SMEM reservation + pre/post call register save/restore
98sub_19CBAC0Shared load (LD.S) ABI lowering
159sub_19CD0D0Barrier instruction lowering
164sub_19CC1A0Register load (transfer lowering)
168sub_19CC1A0Register store (transfer lowering)
183sub_19CBE00Special instruction fixup
226sub_19CD950Predicate lowering
236sub_19CD510Conversion instruction lowering
335sub_19CDED0SMEM reservation instruction handler

Register Transfer Lowering: sub_19CC1A0

The register-to-register transfer lowering function (3873 bytes, 95% confidence). It converts abstract register load/store operations (opcodes 164 and 168) into concrete SASS instruction sequences. The lowering path depends on the ABI function properties:

Direct copy path (byte 12 == 0): Register-to-register MOV instructions.

Data widthGenerated sequence
4 bytes (32-bit)Single MOV-like (opcode 130 / 0x82, HSET2 in ROT13; actual SASS MOV is opcode 19)
8 bytes (64-bit)STS + LDS pair (opcodes 0x86/0x85) through shared memory
PermutePRMT (opcode 0x120) for byte-lane rearrangement

Shared memory indirect path (byte 13 == 1): All transfers go through shared memory via STS/LDS pairs, using a reserved shared memory region as a scratch buffer. This path is used when direct register-to-register transfer is not possible (e.g., cross-warp parameter passing on older architectures or when the register file is partitioned).

The function also generates opcode 0xB7 (special) for shared-memory-based transfers that require additional synchronization. It calls sub_92E800 (instruction builder) for each generated SASS instruction.

Convergent Boundary Enforcement

Two functions enforce convergent allocation boundaries for function calls annotated with allowConvAlloc:

Convergent boundary checker (sub_19D13F0, 4.3 KB): Walks the basic block list, builds a bitmask of convergent register definitions, and validates that every allowConvAlloc-annotated call has a proper convergent boundary. Emits "Missing proper convergent boundary around func call annotated with allowConvAlloc" when the boundary is absent.

CONV.ALLOC insertion (sub_19D7A70, 3313 bytes): Scans the instruction list for convergent boundary violations. When a register def flows to a convergent use through a non-convergent path, inserts a CONV.ALLOC placeholder instruction (opcode 0x11E = 286). Uses a 64-bit-word bitmask array to track which register slots are live across convergent boundaries.

The single-call checker (sub_19C6400) warns when a convergent region contains more than one call: "Multiple functions calls within the allowConvAlloc convergent boundary".

Coroutine Support

Functions with coroutine support (flag 0x01 at function byte +1369) receive special ABI handling. Registers that are live across SUSPEND points must be saved to and restored from the coroutine frame.

Coroutine SUSPEND handler (sub_19D5F10, 1568 bytes): Scans the instruction stream for suspend points. For each register defined before and used after a SUSPEND, inserts save/restore pairs to/from the coroutine frame.

Coroutine frame builder (sub_19D4B80, 1925 bytes): Constructs the frame layout for coroutine-style functions, allocating slots for each register that must survive a SUSPEND.

The ABI engine validates that the scratch/preserved classification is consistent with coroutine semantics. Warning 7011 fires when a register marked as scratch at a SUSPEND point is also required to be preserved for the coroutine function. Warning 7012 fires when the return address register itself is misclassified as scratch.

gb10b Hardware WAR

Two functions implement a shared-memory-based workaround for a hardware bug on the gb10b variant (SM 75, Turing). Both reference the reserved symbol __nv_reservedSMEM_gb10b_war_var.

Entry block variant (sub_19D9E00): Generates a complex instruction sequence using additional temp registers (opcodes ADD, MOV, BAR) for the function entry block.

Body variant (sub_19DA2A0, 95% confidence): Generates a 7-instruction SASS sequence:

1. MOV.C  temp_reg, <constant>           // opcode 195, class 3
2. LD.S   temp_reg, [__nv_reservedSMEM_gb10b_war_var]  // opcode 98
3. AND    temp_reg, temp_reg, 4           // opcode 214
4. SETP   P, temp_reg, 0x4000            // opcode 272
5. STS    [__nv_reservedSMEM_gb10b_war_var], temp_reg   // opcode 277
6. @P BRA target                          // opcode 18, predicated
7. MOV    result, 0                       // zero-initialization

The reserved SMEM checker (sub_19DDEF0, 1687 bytes) iterates instructions looking for opcode 335 (SMEM reservation). When found and the function is not allowed to use reserved shared memory, it emits warning 7801: "Function '%s' uses reserved shared memory when not allowed.".

ABI Register Limit Propagation

The limit propagator (sub_19CE590) handles inter-procedural ABI attribute forwarding. For SM generations 4 and 5 (sm_50, sm_60 families), it iterates the call graph and copies the max-register limit from caller to callee (field +264 to +268) unless the callee has an explicit ABI specification. This ensures that callees do not exceed the register budget established by their callers.

Call Instruction ABI Lowering: sub_19D41E0

The call lowering function (2247 bytes, 85% confidence) processes each call instruction (opcode 97; STG in the ROT13 name table, but used here as an internal CALL-like marker -- actual SASS CALL is opcode 71) in the function. For each call site it:

  1. Sets up parameter passing registers according to the callee's ABI specification.
  2. Inserts pre-call register save sequences for caller-saved registers.
  3. Modifies the call target to use ABI-conforming register assignments.
  4. Inserts post-call register restore sequences.

Register File Types

The ABI handles three register file types, each with distinct allocation rules:

Typev7 valueFileRangeSM requirement
GPR2General-purposeR0--R255All architectures
Uniform3Uniform GPRUR0--UR63sm_75+
Predicate5PredicateP0--P7All architectures

Uniform registers (type 3) are only available on sm_75 and later. Attempting to use a uniform register for the return address on an older SM triggers warning 7009.

Pipeline Integration

The ABI engine sits between the optimization passes and the register allocator in the ptxas pipeline:

... optimization passes ...
  Late Legalization / Expansion
  ABI Master Setup              <-- sub_19D1AF0 (per-function)
  ABI Pass 1 (convergent)       <-- sub_19DC4B0 (a2=1)
  ABI Pass 2 (lowering)         <-- sub_19DC4B0 (a2=2)
  ABI Opcode Dispatch           <-- sub_19CFC30 (2x)
  ABI Pass 3 (arch-specific)    <-- sub_19DC4B0 (a2=3)
  Register Allocation           <-- sub_9721C0
  Instruction Scheduling
  SASS Encoding

The ABI engine produces new SASS instructions via sub_934630 / sub_9314F0 (instruction builder/inserter) and uses sub_91BF30 (temp register allocation) for scratch registers needed during lowering. During final emission, the encoding functions in Zone B (0x1A01000--0x1A76F30) convert the ABI-lowered instructions into binary SASS words.

ABI State Object Layout

The ABI engine operates on three nested data structures: the ABI engine context (the this pointer passed as a1 to all ABI functions), the per-callee ABI specification (one per callee in the call graph), and parameter/return descriptor entries (one per parameter or return value). All offsets are byte offsets from the structure base.

ABI Engine Context

The top-level per-function ABI state, passed as a1 to sub_19D1AF0, sub_19CA730, sub_19CDFF0, and sub_19D1720. Total size is at least 4672 bytes.

OffsetSizeTypeFieldNotes
+08ptrvtableDispatch table; method at +144 dispatches per-callee validation, +152 selects register reservation strategy
+88ptrfunc_ctxPointer to per-function compilation context (1716+ bytes); accessed everywhere as *(_QWORD *)(a1+8)
+161byteabi_mode_flagsMaster ABI mode selector; 0 = no ABI lowering, nonzero = full pipeline
+644intmax_param_offsetHighest parameter register offset seen during callee iteration
+764intpreserved_param_startStart register for preserved parameter range
+804intpreserved_param_alignAlignment requirement for preserved parameter range
+888ptrcurrent_callee_entryPointer to the callee entry node being processed in the current iteration
+971byteskip_popcountWhen set, skips the register usage population count (sub_19C99B0)
+981bytehas_return_addr_specSet to 1 when any callee has a return address ABI specification
+44284intcached_reg_R3Cached physical register ID for R3 (from sub_7FA420(regfile, 6, 3))
+44324intcached_reg_R2Cached physical register ID for R2 (from sub_7FA420(regfile, 6, 2))
+44491bytefirst_callee_seenSet after the first callee with an ABI spec is processed; controls whether per-class reservation bitmaps are populated or inherited
+445616+bitvecparam_alloc_bitmapBitvector tracking which physical registers have been assigned to parameters; manipulated via sub_BDBB80 (set bit), sub_BDDCB0 (find highest), sub_BDDD40 (popcount)
+44724intparam_alloc_countNumber of registers allocated for parameter passing
+448016+bitvecretval_alloc_bitmapBitvector tracking which physical registers have been assigned to return values
+44964intretval_alloc_countNumber of registers allocated for return values
+4528144bitvec[6]per_class_reservationPer-register-class ABI reservation bitmaps; 6 entries (classes 1--6), 24 bytes each; the loop in sub_19D1AF0 iterates v148 from 1 to 6, incrementing the pointer by 3 qwords per iteration

The param_alloc_bitmap and retval_alloc_bitmap are used after parameter/return allocation to compute the effective register file occupancy. The master setup reads the highest set bit in each (sub_BDDCB0) to determine func_ctx+361 (total register demand) and compares against func_ctx+367 (register file limit).

Per-Callee ABI Specification

Pointed to by *(callee_entry + 64). One instance per callee in the call graph. Describes how parameters are passed, return values are received, and the return address is placed. Accessed as v3/v12/v14 (cast to _DWORD *) in the decompiled code, so integer-indexed fields are at 4-byte stride.

OffsetSizeTypeFieldNotes
+04intparam_countNumber of parameter descriptor entries
+44intreturn_countNumber of return value descriptor entries
+88ptrparam_descriptorsPointer to array of 32-byte parameter descriptor entries
+168ptrreturn_descriptorsPointer to array of 32-byte return value descriptor entries
+244intreturn_addr_registerExplicit return address register number; -1 = unassigned
+284intreturn_addr_modeReturn address placement strategy (see table below)
+324intfirst_param_registerFirst register available for parameter passing; -1 = use default
+364intavailable_reg_countNumber of registers available; -1 = target default, -2 = computed from target descriptor
+401byteret_addr_before_paramsIf set, return address is placed before the parameter range
+444intpreserved_reg_typePreserved register specification type; 1 triggers per-register scratch bitmap construction
+488uint64scratch_gpr_bitmaskBit 1 (& 2) = scratch classification active for GPR return address register
+571bytehas_abi_specMaster enable: 0 = callee has no ABI specification, 1 = specification is active
+581byteallocation_completeSet to 1 after parameter/return allocation finishes successfully
+648ptrabi_detail_ptrPointer to extended ABI detail sub-object (preserved bitmasks, scratch classification)
+808uint64preserved_pred_bitmaskPer-predicate-register preserved bitmask; bit N = predicate register N is preserved
+884uint32preserved_class_flagsBit 0 (& 1) = GPR preserved set active; bit 1 (& 2) = scratch classification active
+961bytereturn_addr_validatedSet to 1 after sub_19CDFF0 completes validation for this callee

Return address mode values (field +28):

ValueModeBehavior
1FixedReturn address at first_param_register + 2 (e.g., R6 when base is R4)
2RegularGeneral-purpose register, validated < max_reg
3UniformUniform register (UR), requires SM75+ (func_ctx+1408 & 0x02)
5ComputedDerived from parameter layout, auto-aligned to even register boundary

Parameter/Return Descriptor Entry (32 bytes)

Each parameter or return value is described by a 32-byte entry. The allocator iterates the parameter array with stride 32 (v34 += 32 per parameter) and the return array identically (v43 += 32 per return value).

OffsetSizeTypeFieldNotes
+04intelement_countNumber of elements (e.g., 4 for a float4)
+44intelement_sizeSize per element in bytes (e.g., 4 for float)
+84intalignment_hintAlignment in bytes, clamped to [4, 16]; 8 = even-aligned, 16 = quad-aligned
+121byteis_register_allocated0 = stack-passed (fallback), 1 = register-allocated
+164intassigned_register_idPhysical register ID assigned by the allocator (from sub_7FA420)

The total byte size is element_count * element_size. The register count is ceil(total_bytes / 4), computed as (total + 3) >> 2. The alignment mask applied to register slot selection is -(alignment_hint >> 2), producing a bitmask that enforces natural alignment: 8-byte parameters require even-aligned base registers, 16-byte parameters require 4-register-aligned bases.

2048-Bit Free-List Bitmap (Stack Local)

The parameter allocator (sub_19CA730) constructs a 2048-bit free-list bitmap as a stack-local variable (not stored in the engine context). It is declared as v103[31] (248 bytes of QWORD array) plus v104 (4 bytes), v105 (2 bytes), and v106 (1 byte), totaling 255 bytes.

Initialization:
  memset(v103, 0xFF, 248)     // 248 bytes all-ones
  v104 = 0xFFFFFFFF           // 4 bytes
  v105 = 0xFFFF               // 2 bytes
  v106 = 0xFF                 // 1 byte
  Result: 2040 bits all-ones (255 bytes)

A bit value of 1 means the register slot is free; 0 means occupied. The bitmap is indexed relative to first_param_register, not absolute R0. When a contiguous run of free slots is found for a parameter, the allocator zeroes the corresponding bytes using a size-optimized zeroing sequence (special-cased for lengths < 4, == 4, and >= 8 bytes). After allocation, the assigned registers are also recorded in the persistent bitvectors at +4456 (parameters) and +4480 (return values) via sub_BDBB80.

The bitmap supports up to 2040 register slots, far exceeding the 255-register GPR limit. This over-provisioning accommodates the allocator's use for both parameter and return value allocation in a single bitmap, and provides headroom for potential multi-class allocation in future architectures.

Target Descriptor Fields Referenced by ABI Engine

The ABI engine accesses the target descriptor (at func_ctx+1584) through these offsets during ABI setup:

OffsetTypePurpose
+372intSM generation index (value >> 12; 3=Kepler, 4=Maxwell, 5=Pascal+, 9=Hopper, >9=Blackwell)
+452intSM version number; > 4 gates 64-bit return address pair semantics
+616intAvailable register count ceiling for the target
+636intRegister count subtraction base (for computed available_reg_count)
+896vfuncRegister range query; called with (target, func_ctx, &query, 6), returns low/high range pair at query+24
+2096vfuncRegister class capacity query; called with (target, reg_class)
+3000vfuncValidator callback; nullsub_464 = no-op (validation skipped)

The vtable call at +896 takes a 32-byte query structure initialized to {hi=-1 lo=0, 0, 0, 0, 0, 148, 148, -1}. The result at query +24 (as two 32-bit halves) returns the reserved register range boundaries. This is used by warnings 7014 (reserved range overlaps parameters) and 7017 (insufficient registers for reservation).

ABI Validation Diagnostics

The ABI engine emits 15 distinct warning codes (7001--7017) from six functions. Two codes are unused in this binary version (7007, 7018). All codes share the contiguous hex ID range 0x1B59--0x1B69 and are emitted through two parallel paths: sub_7EEFA0 (standalone diagnostic buffer) and sub_895530 (context-attached diagnostic using the compilation context at *(func+48)).

Complete Warning Catalog

CodeHexEmitterMessageTrigger
70010x1B59sub_19CDFF0"ABI return address %d is unaligned"return_addr & 1 != 0 (odd register for 64-bit pair)
70020x1B5Asub_19CDFF0"Return Address (%d) should be less than %d"return_addr >= max_reg (exceeds register file)
70030x1B5Bsub_19CDFF0"Return address (%d) should not overlap with the stack pointer (%d)"Stack pointer falls within [return_addr, return_addr+1]
70040x1B5Csub_19CDFF0"Return Address %d overlaps with parameters in range %d - %d"Return addr bit set in parameter allocation bitmap
70050x1B5Dsub_19CDFF0"With specified parameters, return address is %d registers and exceeds specified max reg (%d)"Auto-placed return addr pushed beyond register file limit
70060x1B5Esub_19D1720"ABI allows either specifying return address or return address before params"Mode 1 (fixed) with explicit return_addr != -1
70070x1B5F----Unused/reserved in this binary version
70080x1B60sub_19CDFF0"Return address (%d) should be between %d and %d"Return addr outside valid range from target vtable query
70090x1B61sub_19CDFF0"SM does not support uniform registers for return address"Mode 3 (uniform) on target without UR support (!(func+1408 & 0x02))
70100x1B62sub_13B6DF0"Relative 32-bit return address requires a caller-save 64-bit scratch register pair"32-bit relative call without available scratch pair
70110x1B63sub_19D1AF0"Register (%s%d)is defined as scratch on SUSPEND but preserved for coroutine function"Register in preserved set is scratch in SUSPEND bitmap
70120x1B64sub_19D1720, sub_19D1AF0"%d register should not be classified as scratch"Preserved ABI register (return addr) misclassified as scratch
70130x1B65sub_19CA730"%d register used to return value cannot be classified as preserved"Return-value register appears in preserved bitmap
70140x1B66sub_19CA730"Reserved register range %d - %d overlaps with parameters in range %d - %d"Explicit reserved range collides with parameter range
70150x1B67sub_19C69D0"Reserved register range %d - %d overlaps with retAddr %d"Reserved range collides with return address register
70160x1B68sub_19D1AF0"regcount %d specified below abi_minimum of %d"func.maxreg below generation minimum (16 or 24)
70170x1B69sub_19D1AF0"register available %d for reservation is less than the requested number of registers %d "Available regs after reservation base < requested count

Diagnostic Emission Architecture

The ABI engine uses three diagnostic emitters:

sub_7EEFA0 (standalone path): Takes a stack buffer, the decimal warning code, and a printf-format string. Used as the fallback when no compilation context is available (when *(*(func)+48) == NULL). This is the path that produces warnings visible in non-context mode (e.g., standalone ptxas invocations).

sub_895530 (context-attached path): Takes the function object, the output context, flags (always 0), the hex warning code, and the format string. Used when the compilation context exists. This is the primary path during normal nvcc-driven compilation.

sub_7F7C10 (conditional emitter): Returns a bool indicating whether the diagnostic was accepted (not suppressed by the diagnostic context at func+1176). Used exclusively for warning 7011 (SUSPEND). When it returns true, the caller additionally invokes sub_8955D0 to attach the diagnostic to the compilation context.

Validation Order

The ABI master setup (sub_19D1AF0) invokes validators in this order:

1. regcount vs. abi_minimum       -> 7016
2. register reservation overflow  -> 7017
3. return address setup           -> 7006, 7012  (sub_19D1720)
4. parameter allocation           -> 7013, 7014  (sub_19CA730)
5. reserved range vs. retAddr     -> 7015         (sub_19C69D0)
6. return address validation      -> 7001-7005, 7008, 7009  (sub_19CDFF0)
7. coroutine SUSPEND validation   -> 7011, 7012

Unreferenced ABI Strings

Three ABI-related strings exist in ptxas_strings.json with no cross-references in the decompiled binary. They may be dead code, referenced via indirect dispatch, or used only in debug builds:

  • "Caller and callee expected to have different return address register but '%s' and '%s' both use R%d as return address register"
  • "Function '%s' specifies register R%d as scratch register which is used as return address register"
  • "Mismatch in return address abi when '%s' calls '%s'"

Function Map

AddressSizeConfidenceRole
sub_19C6400~20090%Convergent boundary single-call checker
sub_19C69D0~60090%Reserved register overlap checker
sub_19C7350~90080%Register bitmap manipulation helper
sub_19C7890~60080%Register range validator
sub_19C7B20~60080%Register alignment checker
sub_19C7D60~70080%Register pair allocator helper
sub_19C8040~70080%Register contiguous-range finder
sub_19C84A0192785%Multi-function register dispatcher
sub_19C8D30~60080%Register usage merger
sub_19C9010~70085%Per-function register limit setter
sub_19C92F0~105085%Register bitmap AND/OR combiner
sub_19C99B0256890%Register usage population counter
sub_19CA3C0~30095%Return address overlap pre-check
sub_19CA730227798%Parameter register allocator
sub_19CB020~20085%Shared-mem base address calculator
sub_19CB230~20085%Shared-mem offset calculator
sub_19CB590~35080%Post-call register restore
sub_19CB7E0~35080%Pre-call register save
sub_19CBAC0~60085%Shared load (LD.S) ABI lowering
sub_19CBE00~60085%Special instruction ABI fixup
sub_19CC1A0387395%Register transfer lowering (STS/LDS)
sub_19CD0D0~105085%Barrier instruction ABI lowering
sub_19CD510~90085%Conversion instruction ABI lowering
sub_19CD950~70085%Predicate lowering
sub_19CDDB0~20080%Reserved SMEM helper
sub_19CDED0~20085%SMEM reservation instruction handler
sub_19CDFF0~750099%Return address validator
sub_19CE590~30090%Register limit propagator
sub_19CE6D0~30085%ABI flag propagator
sub_19CEEF0~20080%ABI attribute copier
sub_19CF030~20080%Function entry ABI setup
sub_19CF140~70085%Register-save sequence builder
sub_19CF530~35080%Parameter setup helper
sub_19CF9A0~60085%PRMT instruction ABI lowering
sub_19CFC30~50095%Opcode-based ABI dispatch
sub_19D01E0~120085%Multi-callee ABI propagation
sub_19D0680~30080%Iterator initialization
sub_19D0A80~20080%Iterator filter setup
sub_19D0AF0~10095%Iterator filter check
sub_19D0BC0~4095%Iterator advance (next instruction)
sub_19D0C10~4095%Iterator advance (next matching)
sub_19D0C70~4095%Iterator advance (skip non-matching)
sub_19D0CE0~4095%Iterator advance (reverse)
sub_19D0EE0~4095%Iterator reset
sub_19D1030~20080%Iterator state query
sub_19D13F0~430090%Convergent boundary checker
sub_19D1720~480095%ABI return address setup
sub_19D1AF0560898%Master ABI setup
sub_19D32C0190285%Per-block register reservation builder
sub_19D41E0224785%CALL instruction ABI lowering
sub_19D4B80192585%Coroutine frame builder
sub_19D5850~90080%Shared-mem instruction lowering
sub_19D5F10156885%Coroutine SUSPEND handler
sub_19D67B0~80080%Function exit ABI lowering
sub_19D7160~60085%Sub-pass: scan for ABI-relevant ops
sub_19D7470152680%Register classification propagator
sub_19D7A70331385%CONV.ALLOC insertion (dead instruction insertion)
sub_19D8CE0~110080%Register save/restore pair generator
sub_19D9290~100080%Register live range computation
sub_19D9710~100080%Register conflict detector
sub_19D9E00~70095%gb10b WAR code generator (entry)
sub_19DA2A0~50095%gb10b WAR code generator (body)
sub_19DA8F0158080%SSA-form instruction rebuilder
sub_19DAF20~130080%Multi-dest instruction splitter
sub_19DB440~70080%Additional register reservation pass
sub_19DC070~90085%Sub-pass dispatcher
sub_19DC4B0645995%Per-pass instruction lowering
sub_19DDEF0168795%Reserved SMEM checker
sub_19DE8F0184280%Register renaming for ABI conformance
sub_19DF170192880%Instruction list rewriter

Instruction Scheduler Overview

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The ptxas instruction scheduler is a priority list scheduler with a 3-phase architecture. A single top-level orchestrator (sub_8D0640, ScheduleInstructions) drives three passes through one unified scheduling engine (sub_688DD0), each configured by a mode parameter that selects a different optimization objective: register pressure reduction, ILP/latency hiding, or dynamic batch optimization for tensor warpgroup operations. The scheduler runs twice in the ptxas pipeline -- once before register allocation on virtual registers (pre-scheduling) and once after physical register assignment (post-scheduling).

The scheduler consumes a dependency DAG built over the instruction list and produces a final instruction ordering together with SASS control words encoding stall counts, yield hints, barrier assignments, and scoreboard dependencies. The entire subsystem spans roughly 436 KB of code (0x893000--0x8FE000) with an additional 250 KB of supporting infrastructure in the 0x67F000--0x6A0000 range.

Orchestratorsub_8D0640 (22 KB) -- ScheduleInstructions
Unified enginesub_688DD0 (20 KB) -- mode-parameterized scheduling loop
Priority functionsub_8C9320 (47 KB) -- multi-criteria heuristic
Ready list buildersub_6820B0 (1.5 KB) -- zero-predecessor scan
Dependency graphsub_8CF880 (28 KB) + sub_8D9930 (19 KB)
Register budgetsub_8CEE80 (8.7 KB) -- occupancy-aware computation
HW latency profilessub_8E7300--sub_8E9DC0 -- per-SM tables
Opcode tablesub_896D50 (90 KB) -- ROT13-encoded SASS mnemonics
Scheduling arenasub_8E3970 / sub_8E3A80 -- bump allocator
Key knobs76 Sched* knobs; see Configuration
Enable gate"ScheduleInstructions" named option at (a1+8)+1664

3-Phase Pipeline

The orchestrator sub_8D0640 executes the following sequence. All three scheduling phases invoke the same unified engine sub_688DD0 -- the only difference is the mode byte passed as the second argument.

function ScheduleInstructions(sched):
    // 1. Build dependency graph
    BuildDependencyGraph(sched, func)         // sub_8CF880
    vtable[29](sched)                         // InitScheduleData
    PreScheduleSetup(sched, opt_level > 2)    // sub_8CBAD0

    // 2. Gate check
    if not KnobGetBool("ScheduleInstructions"):
        return

    // 3. Set mode flags from knobs 419 (LivenessCountRegComp), 420 (LivenessUseHiLo)
    sched.flags |= (knob_419 << 3) | (knob_420 << 4)

    // 4. Optionally create register pressure tracker
    if sched.flags & 0x10:
        sched.scoreboard = alloc(952)         // sub_69A1A0
        sched.tracker    = alloc(208)         // sub_6B8F70
        if sched.flags & 0x100:
            sched.warp_analysis = alloc(856)  // sub_6BB7C0

    // 5. Reset per-instruction SchedNode fields between passes
    //    (iterates func+104 metadata chain, NOT instruction list)
    for sched_node in func.sched_node_list:       // linked via func+104
        sched_node.depChainHead  = 0              // QWORD +56
        sched_node.extendedState = 0              // QWORD +104
        sched_node.schedulingCost  = 0            // DWORD +76
        sched_node.schedulingClass = -1           // DWORD +84, sentinel

    // 6. Phase 1 — ReduceReg
    if KnobGetBool("ScheduleInstructionsReduceReg"):
        ScheduleEngine(sched, mode=0x39, ...)   // sub_688DD0

    // 7. Phase 2 — Reverse scheduling (ILP / latency)
    ReverseSchedule(sched)                      // sub_8CD6E0
    ComputeRegisterBudget(sched)                // sub_8CEE80

    // 8. Phase 3 — DynBatch
    if KnobGetBool("ScheduleInstructionsDynBatch"):
        AllocDynBatchData(sched)                // sub_8BF890
        ScheduleEngine(sched, mode=0x41, ...)   // sub_688DD0

    // 9. Cleanup
    FreeBitvector(sched.bv)                     // sub_BDC050
    ArenaFreeAll(sched.arena)                   // sub_8E3A80

Phase 1 -- ReduceReg (mode 1, callback 0x39)

Goal: minimize register pressure so the register allocator has headroom. This phase reorders instructions to reduce the maximum number of simultaneously-live virtual registers.

  • Enabled by the named option "ScheduleInstructionsReduceReg" (default: on at -O3).
  • Register targets set from knobs 776 (SchedReduceIncLimit) and 778 (SchedReduceIncLimitHigh) (defaults approximately 250 and 300).
  • The mode byte 0x39 selects the register-pressure-minimizing priority weights inside the unified engine.
  • The engine's inner dispatch reads *(DWORD*)(scheduler+60) == 1 to enter the ReduceReg path.

Phase 2 -- ILP / Latency Hiding (mode 0, callback 0x49)

Goal: maximize instruction-level parallelism and hide memory latencies by interleaving independent operations.

  • Always runs (no separate enable gate).
  • Uses reverse post-order BB iteration via sub_8CD6E0: iterates basic blocks from last to first after resetting liveness with sub_781F80(func, 0).
  • Computes a register budget capped at min(archLimit, 0.95 * maxRegs) via sub_8CEE80.
  • The mode byte 0x49 selects latency-oriented priority weights.
  • After this phase, sub_8CF5D0 evaluates dual-issue eligibility and produces a dual-issue benefit score stored at scheduler+328.

Phase 3 -- DynBatch (mode 2, callback 0x41)

Goal: batch-aware scheduling for GMMA/WGMMA warpgroup tensor operations. Groups tensor instructions into batches that can execute as warpgroup-cooperative operations with minimal pipeline stalls.

  • Enabled by the named option "ScheduleInstructionsDynBatch" and only activates when the function has varying instruction counts across BBs.
  • Controlled by knob 742 (SchedCrossBlock, cross-block scheduling mode).
  • Reads stall/batch depth limits from knobs 805 (SchedTexBatchTargetSelectRegisterTarget), 741 (SchedCountLoadsPerTex), 761 (SchedMaxRLiveOKslack), 762 (SchedMaxRLiveOKslackColdBlocks).
  • Allocates a 184-byte DynBatch context (sub_8BF890) with an 8 * numBBs sub-array for per-BB batch tracking.
  • Context initialization (sub_8C1BA0): sets batch window to 0xFFFFFFFF (sentinel), copies register liveness from func+832.
  • The mode byte 0x41 selects batch-aware priority weights.

DynBatch Context Object (184 bytes)

sub_8BF890 allocates a 184-byte DynBatch context from the scheduling arena at sched+840 and stores the pointer at sched+272. The object is a flat structure containing a function context reference, a 20-slot working array, and a pointer to a variable-length per-BB sub-array.

OffsetSizeTypeInitNamePurpose
+08ptrfuncCtxfuncContextPointer to CompilationContext (copied from sched+8)
+8160QWORD[20]0batchWorkArrayFixed-size working array for batch state tracking; likely holds instruction pointers or batch boundary markers during scheduling
+1688ptralloc'dperBBArrayPer-BB batch tracking sub-array; 8 * numBBs bytes, zero-initialized. Each 8-byte entry holds a batch start/end instruction pointer for one basic block
+1764DWORD0flagsStatus/control flags
+1804----(padding)Pad to 184-byte allocation

The per-BB sub-array size is derived from *(sched+392) (maxBBSizeForAlloc), with an overflow check capping the multiplication at 0xFFFFFFFFFFFFFFF (2^60 - 1) entries.

DynBatch Working State (in scheduler context)

The bulk of the DynBatch working state lives directly in the scheduler context, initialized by sub_8C1BA0 (InitDynBatchState). These fields are used by the priority function during Phase 3 scheduling.

OffsetSizeTypeInitNamePurpose
+4644int320batchSlotCountNumber of instructions accumulated in the current batch
+4684int32--prevBatchSizeSize of previously-completed batch
+4764int32adjadjustedBatchTargetAdjusted batch depth target; capped to min(maxStallCycles, batchTargetCount), halved when 2 * maxStall > target
+4804int32--lastBatchEndPosScheduling position of the last instruction in the current batch
+4888QWORD0xFFFFFFFFbatchWindowBatch window start BB offset; sentinel 0xFFFFFFFF means "no batch active"
+4924int320regDeltaRegister pressure delta accumulator across batch boundaries
+4964int320maxRegInBatchMaximum register pressure observed within current batch
+5004int32from +72regBaseCountBase register count; copied from sched+72, reset on batch boundary
+5048QWORD0maxRegSpanMaximum register span (pressure peak minus baseline) across all batches
+5084int320regBaselineRegister count baseline for delta computation
+5124int320minOverflowCostMinimum overflow cost; updated when batch exceeds register budget
+5164int32-1batchDepthLimitPer-batch maximum depth; -1 = unlimited (overwritten from BB analysis)
+5201byte0batchOverflowSet to 1 when batch exceeds register budget + base count
+5211byte0batchAbortSet to 1 when opcode 96 (WGMMA commit) detected with sched+524 flag
+536+varptr[]--batchSlotsArray of instruction pointers in the current batch; sched+536 + 8*i for slot i

The batch target adjustment algorithm in sub_8C1BA0:

adjustedTarget = maxStallCycles           // from sched+404
if maxStallCycles > batchTargetCount:
    adjustedTarget = batchTargetCount     // cap to target
if batchTargetCount > maxStallCycles:
    if batchMode == 0 and maxStallCycles < batchTargetCount:
        if batchTargetCount >= 2 * maxStallCycles:
            adjustedTarget = batchTargetCount / ceil(batchTargetCount / maxStallCycles)
        else:
            adjustedTarget = batchTargetCount / 2

When a batch boundary is detected (instruction's BB start offset exceeds the batch window), sub_8C1BA0 evaluates the batch: it computes the register pressure delta, checks whether the batch overflows the combined register budget (regBaseCount + regDelta + maxRegSpan), and either accepts the batch or trims it by walking backward through the batchSlots array to find a smaller valid batch.

Unified Scheduling Engine

sub_688DD0 (20 KB) is the single engine that all three phases invoke. Its behavior is parameterized by:

  1. Mode byte (argument a2): 0x39 = ReduceReg, 0x49 = ILP/Latency, 0x41 = DynBatch.
  2. Rebuild flag (argument a4): when true, reconstructs the dependency DAG via sub_6833F0.
  3. Vtable dispatch: uses *(a1+40) and *(a1+48) for polymorphic pre/post scheduling hooks.
function ScheduleEngine(sched, mode, arg3, rebuild):
    if rebuild:
        InitScheduleRegion(sched)               // sub_6833F0
        // allocates 72-byte per-BB records, queries knobs 595 (PreserveSchedOrderSame), 743 (SchedCrossBlockInstsToSpeculate), 747 (SchedCrossBlockTexToSpeculate)

    for each bb in sched.basic_blocks:
        // 10 register pressure counters from per-BB record +4..+40 into context +48..+87
        InitResourceTracking(sched, bb)          // sub_A091C0
        ReadyList = BuildReadyList(sched)        // sub_6820B0

        while ReadyList is not empty:
            best = SelectBestInstruction(sched)  // via priority vtable
            ScheduleInstruction(sched, best)     // sub_682200
            UpdateResourceState(sched, best)     // sub_A09530
            UpdateWARTracking(sched, best)       // sub_A09D40
            RelinkInstruction(best)              // sub_925510

            // Update dependency counts, add newly-ready instructions
            for each successor of best:
                successor.dep_count -= 1
                if successor.dep_count == 0:
                    ReadyList.insert(successor)

The engine manages 10 register pressure counters at scheduler context offsets 48--87 (copied from the per-block record offsets +4--+40 at BB entry). These correspond to the GPU register classes: R (general), P (predicate), UR (uniform), UP (uniform predicate), B (barrier), and 5 architecture-specific classes. Counter [0] (R class) uses a separate update path; counters [1]--[9] are decremented from a per-opcode resource cost table during the scheduling loop.

Ready List Construction

sub_6820B0 (1.5 KB) builds the initial ready list by scanning the instruction linked list for nodes with zero unsatisfied dependencies.

function BuildReadyList(sched):
    for instr in sched.instruction_list:
        if instr.opcode == 52:         // NOP/BB boundary
            continue                    // follow through to real instruction
        if instr.dep_count == 0:
            instr.next_ready = sched.ready_head
            sched.ready_head = instr
            vtable_call(sched, 104, instr)  // ready-list insertion callback
            instr.latency_counter = 0

The ready list is maintained as a sorted linked list (via pointer at instruction offset +16). The priority function determines sort order.

Priority Function

sub_8C9320 (47 KB decompiled, ~1300 lines) is the heart of instruction selection. It computes a scheduling priority score as an 8-bit packed encoding combining multiple heuristic factors. The function uses approximately 200 local variables and a 0x330-byte stack frame.

Priority Factors

FactorSourceWeight adjustment
Register pressureCurrent live count vs budget at sched+432Primary factor in ReduceReg mode
Instruction latencysub_693BC0 latency queryPrimary factor in ILP mode
Critical path positionDAG depth from sched+464, sched+380Favors critical-path instructions
FU contention10-element resource vector via sub_8C7290Avoids saturating a single pipe
Hot/cold memorysub_A9CDE0 (hot=global) / sub_A9CF90 (cold=const)Prioritizes latency-sensitive ops
Anti-dependencyWAR hazard costBreaks ties with anti-dep distance
Barrier dependenciesBarrier flag at instr+376Defers barrier-blocked instructions
Priority queue depthKnob 770 (default 4)Limits lookahead window

Priority Encoding

The priority value is packed into an integer with 8-bit fields. Each field is computed from the corresponding factor and shifted into position. The packed encoding allows the ready list to maintain sort order with a single integer comparison, avoiding multi-key sorting overhead.

Key subroutines called during priority computation:

AddressPurpose
sub_8C67A0Compute resource cost for instruction and update BB resource table
sub_8C7120Barrier tracking update
sub_8C7290Copy 10-element resource vector from per-BB slot (SSE-optimized)
sub_8C7720Instruction reordering within BB (red-black tree operations)
sub_693BC0Memory space classification / latency query
sub_6818D0Register count to hardware-aligned unit conversion

Resource Tracking

The scheduler tracks 10 functional unit resource counters per basic block. Each counter corresponds to a hardware execution pipe.

Resource Vector Layout

Each per-BB resource slot occupies 84 bytes (21 DWORDs) stored at *(scheduler+672) + 84 * slot_index:

Offset (within slot)SizeContent
0--3610 x int32Current resource usage per FU
40--7610 x int32Resource pressure delta
80int32BB-entered flag and auxiliary bits

The 10 functional unit pipes (inferred from resource model queries):

IndexPipeTypical instructions
0Integer ALUIADD, IMAD, ISETP, LOP, SHF
1FP32FADD, FFMA, FMUL, FSETP
2FP64DADD, DFMA, DMUL
3Tensor coreHMMA, IMMA, BMMA, BGMMA
4Load/storeLD, ST, LDG, STG, LDS, STS
5TextureTEX, TLD, TXQ
6Branch/controlBRA, JMP, EXIT, RET, BAR
7Shared memoryATOMS, REDS, LDS, STS
8Special functionMUFU (RCP, RSQ, SIN, COS, EX2, LG2)
9Uniform/predicateUPLOP, UISETP, uniform operations

sub_8C67A0 computes per-instruction resource costs by calling the resource model (sub_A08A00) three times:

  • Mode 1: the instruction's own execution cost
  • Mode 2: operand release costs for last-use operands
  • Mode 3: combined instruction + BB-level impact

SSE intrinsics (_mm_add_epi32) are used for vector accumulation.

Register Budget

sub_8CEE80 (8.7 KB) computes the occupancy-aware register budget that the scheduler respects during instruction ordering.

function ComputeRegisterBudget(sched):
    hw = sched.func.sm_backend          // at func+1584 (provides hw latency profiles)
    maxRegs = hw[154]                   // architecture register limit

    coeff = KnobGetDouble(740)          // default 0.045
    if KnobGetBool(763):                // budget disabled
        budget = hw[157]                // use fixed count from profile
    else:
        physRegs = VirtToPhys(sched, maxRegs)   // sub_A99FE0
        budget = physRegs - (physRegs >> 6)     // 98.4% utilization

    // For sm_50: apply special dual-issue budget
    if arch_id == 5:
        budget = DualIssueBudget(budget)

    pressureCurve = ComputePressureCurve(sched, budget - 2)  // sub_8CE520
    // Piecewise linear model with parameters (4, 2, 6)

    sched.regBudget         = budget     // offset +432
    sched.committedTarget   = ...        // offset +324
    sched.minRegs           = ...        // offset +316
    sched.pressureSlack     = ...        // offset +320

The register pressure curve (sub_8CE520) uses a piecewise linear model parameterized by (4, 2, 6) or a custom string-encoded function from knob 750 (SchedEstimatedLoopIterations).

Dependency Graph

The dependency DAG is built in two stages:

Stage 1: Pre-scheduling scan (sub_8CF880, 28 KB)

Iterates basic blocks in reverse order. For each BB:

  • Checks knobs 314 (FenceInterference) / 313 (FenceCode) for per-instruction scheduling fence conditions
  • Walks the instruction linked list, identifying NOP/control instructions
  • Builds dependency edges via sub_8D9930
  • Manages memory arenas with SSE-optimized copies for instruction metadata arrays

Stage 2: Edge construction (sub_8D9930, 19 KB)

For each pair of instructions in a BB, checks for:

  • RAW (true) dependencies: read-after-write on the same register
  • WAR (anti) dependencies: write-after-read
  • WAW (output) dependencies: write-after-write
  • Memory dependencies: through shared/global memory (conservative ordering)
  • Barrier dependencies: through barrier/sync instructions

Uses operand analysis from sub_894290 (27 KB) which processes 16-bit operand descriptors encoding register class, bank, and dependency type.

Supplementary dependency builders

AddressSizePurpose
sub_68A69031 KBBuildDependencies -- def-use chain construction
sub_6A97B026 KBAddDependencyEdges -- register-level edges
sub_6A2D3011 KBChainDependencies -- memory ordering constraints
sub_6A78F023 KBProcessOperands -- operand dependency extraction

Pre-Scheduling Setup

sub_8CBAD0 (2.9 KB) performs BB scanning and resource allocation before the scheduling passes begin.

Key behaviors:

  • Counts instructions per basic block. If any BB exceeds 4095 instructions, it inserts a scheduling barrier (sub_931920) to split the block.
  • Tracks maximum BB size at scheduler+388.
  • Detects opcode 246 (texture operations) and sets scheduler+384 = 1.
  • Allocates per-slot arrays:
    • scheduler+672: 84-byte scheduling slots (resource tracking)
    • scheduler+280: 48-byte analysis slots (if opt_level > 2)
    • scheduler+248, scheduler+256: register pressure bitvectors sized to (numRegs+1) or (2*numRegs+2) if knob 420 (LivenessUseHiLo, dual-register tracking) is active

Pre-Scheduling vs Post-Scheduling

The scheduler runs at two distinct points in the ptxas pipeline:

AspectPre-schedulingPost-scheduling
TimingBefore physical register allocationAfter physical register allocation
Register modelVirtual registersPhysical registers
Primary goalReduce register pressure, order for regallocHide latencies, minimize stalls
Phases activeAll 3 (ReduceReg, ILP, DynBatch)Refinement pass
Budget sourceOccupancy model estimateActual allocation result
Entrysub_8D0640sub_7F5D50 / sub_A97600 (42 KB)

Post-scheduling uses the actual physical register assignments for precise dependency distances and can make final decisions about stall insertion and scoreboard barrier placement.

Scheduling Variants

The region 0x89C550--0x8BE320 contains 17+ specialized scheduling strategies, each implementing a different approach or targeting a different code pattern:

AddressSizeStrategyNotes
sub_8B939023 KBSoftware pipeliningLoop body overlapping
sub_8B77C015 KBDual-issue schedulingPair co-issuable instructions
sub_8BDC407.9 KBDual-issue pairingInstruction pair selection
sub_8B890012 KBTensor schedulingHMMA/BMMA grouping
sub_8BAAE015 KBLoop-aware schedulingTrip count + register awareness
sub_8B6D6012 KBPressure-optimizedMinimize live range overlap
sub_8B540014 KBLatency-optimizedMaximize memory latency hiding
sub_8B119016 KBBacktracking schedulerUndo and retry on conflict
sub_8B2D9018 KBGlobal schedule optimizationCross-BB considerations
sub_8B459013 KBPermutation searchTry schedule permutations
sub_8A9D8021 KBDepth-first schedulingDFS-based instruction ordering
sub_8AB7509.8 KBCritical path computationDAG analysis for priorities
sub_8BB9C08.2 KBPrefetch schedulingMemory prefetch insertion
sub_8BC0B06.1 KBBarrier coalescenceMerge adjacent barriers
sub_8BC9907.6 KBScoreboard optimizationMinimize scoreboard usage
sub_8BCFA06.8 KBWarp schedule optimizationWarp-level yield tuning
sub_8BE32025 KBComplex scheduling passMulti-strategy combined pass

These variants are selected based on code characteristics (loop structure, tensor operations, function size) and optimization level.

Hardware Latency Profiles

Per-architecture latency and throughput tables are constructed by a family of functions at 0x8E7300--0x8E9DC0. Each table specifies pipeline latencies (integer, FP32, FP64, tensor, memory), scoreboard wait counts, barrier stall cycles, and dual-issue pair compatibility for the target GPU.

AddressArchitectureSize
sub_8E7300sm_70 (Volta)3.3 KB
sub_8E7540sm_722.9 KB
sub_8E7720sm_75 (Turing)3.5 KB
sub_8E7940sm_80 base2.9 KB
sub_8E7B40sm_80 (Ampere)3.3 KB
sub_8E7D80sm_864.4 KB
sub_8E8070sm_873.5 KB
sub_8E8280sm_89 (Ada Lovelace)3.1 KB
sub_8E8480sm_90 (Hopper)5.2 KB
sub_8E8780sm_90a4.6 KB
sub_8E8A90sm_100 (Blackwell DC)3.0 KB
sub_8E8DB0sm_103 (Blackwell Ultra)1.7 KB
sub_8E9000sm_120 (RTX 50xx)2.9 KB
sub_8E92E0sm_120 extended5.5 KB
sub_8E97B0Universal fallback8.8 KB

The warp-level hardware profile (sub_8E4400) maps architecture IDs to dispatch parameters:

Architecture rangeWarpsDispatch slotsEra
<= 20479496sm_50 (Maxwell)
<= 245756176sm_60 (Pascal)
<= 286727192sm_70 (Volta)
<= 327677208sm_75 (Turing)
<= 368638224sm_80 (Ampere)
> 3686316240sm_90+ (Hopper, Blackwell)

Sub-architecture variants (stored at profile offset +26) are assigned by specific SM version codes: 8193, 20481, 24576, 28674--28677, 32768, 36864--36869.

See Latency Model for per-opcode latency tables and functional unit mapping.

Scheduling Knobs

The scheduler reads approximately 76 knobs. The most significant ones (names decoded from ROT13 in the binary):

Knob IDNameTypeDefaultPurpose
313FenceCodewhen-list--Skip scheduling for specific opcodes (per-instruction WHEN condition)
314FenceInterferencewhen-list--Mark interference fences for specific opcodes
419LivenessCountRegCompint32--Forward scheduling mode flag (bit 3 in sched+1376)
420LivenessUseHiLoint32--Dual-register hi/lo tracking (bit 4 in sched+1376)
487--booltrueMaster scheduling/peephole enable
510OptimizeUniformAtomicModeint32--BB pre-optimization mode for uniform atomics
595PreserveSchedOrderSamewhen-list--Preserve scheduling order (per-instruction WHEN condition)
740SchedBumpScaleAugmentFactordouble0.045Register pressure bump scale augmentation coefficient
741SchedCountLoadsPerTexint323Load count per texture operation (stall threshold)
742SchedCrossBlockint32--Cross-block scheduling mode
743SchedCrossBlockInstsToSpeculateint32--Cross-block instruction speculation count
747SchedCrossBlockTexToSpeculateint32--Cross-block texture speculation count
750SchedEstimatedLoopIterationsstring--Estimated loop iteration count override
760SchedMaxRLiveCarefulSlackint32--Reserved register headroom (careful slack for live registers)
761SchedMaxRLiveOKslackint32--Acceptable live-register slack (batch depth on non-sm_50)
762SchedMaxRLiveOKslackColdBlocksint32--Extra register slack for cold basic blocks
763SchedMaxRTargetint32--Maximum register target; 0 disables register budget
769SchedPrefFurthestDepwhen-list--Per-BB scheduling query: prefer furthest dependency
770SchedReadAvailTargetint324Priority queue depth (read-availability lookahead window)
776SchedReduceIncLimitint32~250Forward pass primary register increment limit
778SchedReduceIncLimitHighint32~300Forward pass secondary (high) register increment limit
805SchedTexBatchTargetSelectRegisterTargetint32--Texture batch register target stall limit (capped at 16)
806SchedTexBatchTargetSelectSchedulerTargetint32--Texture batch scheduler target stall limit (capped at 16)

Knob names are stored ROT13-encoded in the binary (see Knobs System for the obfuscation scheme). Types when-list indicate knobs that support per-instruction or per-BB conditional overrides via WHEN= syntax.

The full scheduling context configuration is performed by sub_A95DC0 (35 KB), which reads dozens of knob values and populates the scheduling context structure.

Data Flow Analysis

The scheduler includes a dedicated data flow analysis subsystem (0x8DBAF0--0x8DF1C0) that computes register liveness and propagates def-use information across BB boundaries:

AddressSizePurpose
sub_8DB0708.2 KBInitialize liveness data structures
sub_8DB5F08.4 KBCompute per-BB liveness
sub_8DBAF016 KBFull liveness analysis
sub_8DC3F03.0 KBCompute data flow state
sub_8DC6203.3 KBUpdate data flow on schedule
sub_8DC88010 KBPropagate data flow information
sub_8DCF2023 KBBuild data flow graph for scheduling
sub_8DE7A012 KBIterative data flow solver (fixed-point)
sub_8DEF902.0 KBFinalize data flow

The iterative solver runs until convergence, updating per-BB liveness sets. This information feeds into the priority function's register pressure estimates and into the register budget computation.

Scheduling Output

After instruction ordering is determined, the scheduling output pipeline (0x8F1EB0--0x8FDD60, ~57 KB) converts the abstract schedule into SASS control words:

function EmitScheduleForBB(sched, bb):
    for each instruction in scheduled order:
        stall   = ComputeStallCycles(sched, instr)   // distance to consumer
        yield   = ComputeYieldHint(sched, instr)      // warp scheduling hint
        barrier = AssignBarrier(sched, instr)          // 6 barriers available
        sb_deps = ComputeScoreboardDeps(sched, instr)  // read/write dependencies

        control_word = EncodeControlWord(stall, yield, barrier, sb_deps)
        EmitControlWord(instr, control_word)

Key encoding functions:

AddressPurpose
sub_8F1EB0Main schedule encoding entry
sub_8F3130Encode stall count field
sub_8F31F0Encode barrier field
sub_8F3650Encode yield hint field
sub_8F3860Encode scoreboard dependency field
sub_8F4140Encode complete control word
sub_8F6530Output complete schedule for function

Seven verification functions at 0x8F7610--0x8F8CB0 validate the generated schedule: stall counts, barrier assignments, dependency chains, scoreboard correctness, control word format, yield hints, and overall schedule integrity.

See Scoreboards for the scoreboard and dependency barrier encoding format.

Memory Management

The scheduler uses two allocator strategies:

  1. Arena allocator (sub_8E3970): bump allocator with 10 KB block granularity, 8-byte alignment. Allocations within a scheduling pass use the arena for fast allocation. sub_8E3A80 frees all blocks at once at pass completion.

  2. Free-list allocator (sub_8DA6D0): free-list with block coalescing for persistent scheduling data. Maintains multiple free lists for different size classes. Blocks larger than 0x1FF bytes go to a separate large-block list. Adjacent free blocks are merged on deallocation.

Per-Instruction Scheduling Metadata (SchedNode)

Each instruction has a pointer at instr+40 (sched_slot) to a separate heap-allocated scheduling metadata block called a SchedNode. The metadata offsets documented throughout the scheduling pages (e.g., metadata+24, metadata+32, metadata+108) are relative to this SchedNode, not to the 296-byte Ori instruction object itself. The SchedNode block is at least 112 bytes; all nodes are linked into a singly-linked list at func+104 (Code Object offset +104), separate from the instruction linked list at func+272.

SchedNode Layout

OffsetSizeTypeInitNamePurpose
+08ptr--nextInListSingly-linked next pointer for the func+104 metadata chain
+84i320depCountUnsatisfied dependency count; decremented as predecessors are scheduled; instruction is ready when this reaches 0
+124----(pad)Alignment padding
+168ptr--nextReadyReady list singly-linked next pointer; threaded by sub_6820B0 (BuildReadyList)
+244i32seqbbSlot1-based position within the BB (assigned sequentially by sub_8D9930); used for program-order tiebreaking in priority decisions
+284i320latencyCounterRemaining latency cycles until the instruction's result is available; reset to 0 when placed on the ready list; updated by sub_A09530 (UpdateStallCycles)
+324i32--earliestCycleEarliest available cycle -- the latest completion time among all producer instructions; stall-free when earliestCycle >= scheduler+480 (current cycle)
+364----(reserved)Alignment padding or internal use
+404i320latestDeadlineLatest deadline cycle for scheduling; secondary tiebreaker in the candidate comparison cascade
+444i32--barrierGroupIndexBarrier group assignment; identifies which of the 6 hardware barriers this instruction participates in
+484i32--schedulingFenceCodeScheduling fence code from knob 313 (FenceCode) / 314 (FenceInterference) checks; controls per-instruction scheduling boundaries
+568i640depChainHeadDependency chain data; reset to 0 between scheduling passes
+764i320schedulingCostPer-instruction scheduling cost; accumulated during priority evaluation; reset between passes
+844i32-1schedulingClassScheduling class index assigned by the latency model (sub_89FBA0); indexes into per-architecture latency tables; -1 = unclassified (sentinel)
+884i32--maxPredecessorCycleHighest cycle value among predecessor instructions; used in the priority pre-scan to compute max_pred_cycle
+924i32--maxDependencyCycleHighest cycle value along the dependency chain; used to compute max_dep_cycle for critical-path analysis
+1048i640extendedStateExtended scheduling state; reset to 0 between scheduling passes
+1081byte--flagsPrimary flag byte: bit 0 = barrier-target, bit 1 = has-dependency-set, bit 2 = fence-early (knob 314), bit 3 = fence-late (knob 313), bit 4 = has-register-operand
+1111byte--extendedFlagsExtended flags: bit 7 = uses expensive register file (triggers barrier tracking update in sub_8C7120)

Relationship to the Instruction Object

 Ori Instruction (296 bytes)              SchedNode (>= 112 bytes)
 +--------------------------+             +---------------------------+
 | +0:  prev (BB list)      |   instr+40  | +0:  nextInList           |
 | +8:  next (BB list)      |---sched_slot-->                         |
 | +16: id                  |             | +8:  depCount             |
 | +72: opcode              |             | +16: nextReady            |
 | +80: operand_count       |             | +24: bbSlot               |
 | +84: operands[]          |             | +28: latencyCounter       |
 |                          |             | +32: earliestCycle        |
 |                          |             | +40: latestDeadline       |
 |                          |             | +88: maxPredecessorCycle  |
 |                          |             | +92: maxDependencyCycle   |
 |                          |             | +108: flags               |
 +--------------------------+             +---------------------------+

Lifecycle

  1. Allocation: InitScheduleData (vtable[29], called from sub_8D0640) allocates one SchedNode per instruction from the scheduling arena and stores the pointer at instr+40. Nodes are linked into the func+104 chain.

  2. Initialization: sub_8D9930 (EdgeBuilder) initializes depCount, bbSlot, latencyCounter, latestDeadline, and flags while building dependency edges. Between scheduling phases, the orchestrator resets pass-specific fields: +56 = 0, +104 = 0, DWORD+76 = 0, DWORD+84 = -1.

  3. Population: The dependency graph builder populates depCount from edge analysis. Critical-path computation fills earliestCycle, maxPredecessorCycle, and maxDependencyCycle.

  4. Use: sub_6820B0 (BuildReadyList) checks depCount == 0 and threads ready instructions via nextReady. sub_8C9320 (PriorityFunction) reads all fields to compute the 8-bit scheduling priority.

  5. Cleanup: sub_8E3A80 (ArenaFreeAll) reclaims all SchedNode blocks when the scheduling pass completes.

Sentinel Values

  • bbSlot = -1: unscheduled (set during inter-pass reset at DWORD+84)
  • latencyCounter = 99999 (0x1869F): infinity (used as min_barrier_latency initial value in the priority pre-scan)
  • earliestCycle bit 31 set (>= 0x80000000): not-yet-available (tested in sub_8C9320 pre-scan via < 0x80000000 comparison)

Large Function Handling

Functions exceeding 16383 instructions (*(a1+372) > 0x3FFF) trigger chunk-based scheduling via sub_A9DDD0 (11.5 KB). The function is split into chunks that are scheduled independently and then merged. This avoids quadratic blowup in the dependency DAG construction for very large kernels.

Per-Block Scheduling Record (72 bytes)

sub_6833F0 (InitScheduleRegion, 10 KB) allocates an array of (numBBs + 1) records at 72 bytes each, stored at scheduler+184. Each record tracks the register pressure snapshot, region context pointers, and scheduling characteristic flags for a single basic block. The scheduling engine loads a BB's pressure state from this record at region entry and saves it back when moving to the next BB.

Field Map

OffsetSizeTypeInitNamePurpose
+04i320crossBlockIdNon-zero when the BB is active/scheduled; set to the predecessor BB index during cross-block merging. Tested as a boolean gate by 8+ functions before processing a BB.
+44i320pressure[0]Register pressure snapshot -- R (general-purpose 32-bit registers)
+84i320pressure[1]Register pressure snapshot -- P (predicate registers)
+124i320pressure[2]Register pressure snapshot -- UR (uniform registers)
+164i320pressure[3]Register pressure snapshot -- UP (uniform predicate registers)
+204i320pressure[4]Register pressure snapshot -- B (barrier registers)
+244i320pressure[5]Register pressure snapshot -- arch-specific class 0
+284i320pressure[6]Register pressure snapshot -- arch-specific class 1
+324i320pressure[7]Register pressure snapshot -- arch-specific class 2
+364i320pressure[8]Register pressure snapshot -- arch-specific class 3
+404i320pressure[9]Register pressure snapshot -- arch-specific class 4 / control total
+444----(padding)Not initialized, not accessed
+488ptr0regionContextPointer to 136-byte per-region scheduling state allocated by sub_682F10. Contains region boundaries, mode flags, and instruction range metadata.
+568ptr0regionContext2Second region context pointer, written via successor-BB index mapping. Dereferenced by sub_681C00 to check barrier presence (bit 4 of pointed-to byte).
+641byte& 0x80flagsPer-BB characteristic flags (see below). Low 7 bits cleared on init; bit 7 preserved.
+657----(padding)Padding to 72-byte stride

Pressure Counter Transfer

At the start of each BB's scheduling pass, sub_A091C0 (InitResourceTracking) copies the 10 DWORDs at record offsets +4 through +40 into the scheduler context at context offsets +48 through +87. The scheduling engine then updates the context counters as instructions are scheduled. When cross-block scheduling produces a new pressure snapshot, the engine writes it back with SSE bulk stores:

*(OWORD*)(record + 4)  = pressure[0..3]    // 16 bytes via _mm_store_si128
*(OWORD*)(record + 20) = pressure[4..7]    // 16 bytes via _mm_store_si128
*(QWORD*)(record + 36) = pressure[8..9]    // 8 bytes

During the main scheduling loop, the engine decrements pressure[1] through pressure[9] (9 counters) from a 40-byte per-opcode resource cost table. pressure[0] (R class) is handled via a separate path.

Flags Byte (+64)

BitNameSet byMeaning
0crossBlockBoundarysub_688DD0 (ScheduleEngine)BB is a cross-block scheduling boundary
1regionActivesub_688DD0 (ScheduleEngine)BB belongs to an active scheduling region
2hasCallsub_6833F0 for opcode 96BB contains a CALL instruction
3hasBranchsub_6833F0 for opcodes 188, 190BB contains a branch instruction
4hasBarrierInstrsub_6833F0 via sub_7DF3A0 test (bit 6)BB contains a barrier-flagged instruction
5hasLongLatencyOpsub_6833F0 for memory/texture/tensor opcodes; also vtable[183] arch checkBB contains a long-latency operation (memory, texture, or tensor)
6crossBlockTargetsub_6833F0 cross-block mergeBB is the target of a cross-block scheduling region
7(preserved)Not cleared during initCarries data from a prior pipeline stage; purpose unknown

The opcodes that set bit 5 (hasLongLatencyOp): 18 (with knob 62 gate), 23, 26, 32, 57, 81, 101, 124 (with knob 461 gate), 178, 188, 190, 197, 236, 248, 271, 315. Additionally, any instruction where vtable[183] returns true (architecture-specific long-latency classification) sets bit 5.

Cross-Block Scheduling Setup

After per-BB initialization, sub_6833F0 walks the CFG to identify cross-block scheduling opportunities, gated by knob 744 (SchedCrossBlockReorder). For each predecessor-successor pair within the speculative distance threshold (knobs 743 SchedCrossBlockInstsToSpeculate and 747 SchedCrossBlockTexToSpeculate):

  1. Sets record[pred].crossBlockId = succ_bb_index (marks predecessor active).
  2. Clears bit 6 of record[pred].flags (predecessor is not a cross-block target).
  3. Sets bit 6 of record[succ].flags (successor is a cross-block target).
  4. Calls sub_682F10 to allocate the 136-byte region scheduling context and store pointers at record[pred]+48 and record[succ]+56.
+0                  +4                                           +44  +48              +56              +64  +72
| crossBlockId (4B) | pressure[0..9] (40B = 10 x i32)           |pad | regionCtx (8B) | regionCtx2 (8B)| fl | pad  |
+-------------------+----+----+----+----+----+----+----+----+----+----+----------------+----------------+----+------+

Scheduler Context Object Layout

The scheduling context object (sched / a1) is the central state structure passed as the first argument to every function in the scheduling subsystem. It is populated by sub_A95DC0 (SchedulingContext::configure, 35 KB) which reads dozens of knob values and architecture parameters. The object spans approximately 1600 bytes, from a vtable pointer at offset 0 through architecture-specific SSE vectors at offset +1584.

Core Fields (offsets 0--176)

OffsetSizeTypeNamePurpose
+08void*vtablePolymorphic dispatch; pre/post scheduling hooks at *(a1+40), *(a1+48)
+88ptrfuncContextPointer to CompilationContext; all func/arch queries go through this
+168ptrallocatorMemory allocator interface (vtable-dispatched alloc/free)
+408ptrpreHookVtablePre-scheduling callback (mode-specific polymorphic hook)
+4840int32[10]regPressureCountersPer-register-class live counts (copied from per-BB record +4..+40): R, P, UR, UP, B, and 5 arch-specific. The engine decrements counters [1]..[9] in the scheduling loop; counter [0] (R class) uses a separate path.
+604int32modeScheduling mode: 0 = ILP/Latency, 1 = ReduceReg, 2 = DynBatch
+884int32maxBBDepthMaximum dependency depth across all basic blocks
+924int32maxBBDepthNonTensorMaximum depth excluding tensor instructions
+1761bytescheduleActive1 during ReduceReg and DynBatch phases, 0 during ILP/Latency
+1781bytereduceRegModeWhen set, tightens register budget by ~12.5% + 3

Phase Control (offsets 240--312)

OffsetSizeTypeNamePurpose
+2404int32currentPhasePhase ID: 0 = budget computation, 1 = ReduceReg, 2 = ILP
+2488ptrregBitvector1Register pressure bitvector (numRegs + 1 words)
+2568ptrregBitvector2Second bitvector for dual-register tracking (knob 420, LivenessUseHiLo)
+2808ptranalysisSlots48-byte per-BB analysis slots (allocated when opt_level > 2)
+2921byteregTargetValidWhether register targets from knobs 776/778 (SchedReduceIncLimit/SchedReduceIncLimitHigh) are valid
+2964int32regTargetPrimaryForward-pass primary register target (knob 776 SchedReduceIncLimit, in HW register units)
+3004int32regTargetSecondaryForward-pass secondary register target (knob 778 SchedReduceIncLimitHigh, in HW register units)
+3111bytecfgFlag1Priority queue depth configuration flag
+3124int32cfgParam1Configuration parameter (default 10)

Register Budget (offsets 316--432)

OffsetSizeTypeNamePurpose
+3164int32minRegsMinimum register count from architecture register limit
+3204int32pressureSlackRegister pressure headroom (initialized to 0)
+3244int32committedTargetCommitted register target (set to regBudget after budget computation)
+3284int32dualIssueBenefitDual-issue benefit score from sub_8CF5D0 (sm_50 only)
+3804int32latencyCutoffBarrier-target latency cutoff; controls critical-path bit activation
+3841bytehasTextureOpsSet to 1 when opcode 246 (texture operation) found in any BB
+3884int32maxBBSizeMaximum basic block size in instructions (capped at 4095)
+3924int32maxBBSizeForAllocCopy of maxBBSize used for resource slot allocation sizing

Stall / Batch Parameters (offsets 404--420)

OffsetSizeTypeNamePurpose
+4044int32maxStallCyclesMax stall cycles; from knob 805/806 (SchedTexBatchTargetSelect{Register,Scheduler}Target), capped at 16
+4084int32stallThresholdStall threshold; knob 741 (SchedCountLoadsPerTex), default 3
+4124int32batchDepthBatch depth; knob 761 (SchedMaxRLiveOKslack), default 3 (6 or 12 for sm_50 with dual-issue)
+4164int32extraRegReserveExtra register reservation; knob 762 (SchedMaxRLiveOKslackColdBlocks), default -1 (disabled)
+4204int32spillModeCountdownSpill-mode countdown; when > 0, forces aggressive scheduling with critical-path bit always set

Register Budget and Pressure Tracking (offsets 432--485)

OffsetSizeTypeNamePurpose
+4324int32regBudgetTarget register count (occupancy-aware, from sub_8CEE80)
+4408ptrlivenessBV.dataRegister liveness bitvector data (via sub_BDBA60); sized to numRegs+1 or 2*numRegs+2 if dual-reg
+4488ptrlivenessBV.allocBitvector allocator reference
+4564int32livenessBV.sizeBitvector size in 64-bit words
+4644int32depthThresholdNumber of barrier-target instructions required to activate critical-path bit
+4804int32currentCycleCurrent scheduling cycle; used for stall-free evaluation
+4841bytephaseActivePhase activity flag: 1 = ReduceReg active, 0 = ILP/budget
+4851byteschedDirtyReset to 0 at orchestrator start

Hot-Cold and Yield State (offsets 523--532)

OffsetSizeTypeNamePurpose
+5231bytehotColdEnableHot-cold memory tracking enable; result of sub_8CF5D0 (dual-issue check)
+5241byteyieldStateCurrent yield state; propagated to CONTROL instructions via priority bit 6
+5324int32hotColdBudgetHot-cold budget counter; decremented per cold instruction; tracking deactivates at zero

Architecture Parameters (offsets 604--616)

OffsetSizeTypeNamePurpose
+6044int32archParam1Architecture-dependent parameter (6 for sm_60 era)
+6164int32archParam2Architecture-dependent limit (63 for sm_50 era, 255 for sm_60+)

Resource Tracking and Dependency Data (offsets 672--744)

OffsetSizeTypeNamePurpose
+6728ptrresourceSlotsPer-BB resource cost table; 84 bytes per slot (21 DWORDs: 10 FU usage + 10 FU delta + 1 flag)
+6808ptrdepDataDependency tracking data (zeroed at orchestrator start)
+7208ptrarenaAllocRefArena allocator reference for bitvector buffer resizing
+7288ptrbvBufferGrowable bitvector buffer pointer (1.5x growth factor on realloc)
+7364int32bvCapacityBitvector capacity in words (-1 = uninitialized sentinel)
+7404int32bvAllocatedBitvector allocated word count
+7448ptrfuncContextRef2Second reference to function context for bitvector sizing

Liveness Bitvector (offset 832)

The scheduler tracks register liveness via a bitvector at offset +832 (referenced only in the scheduling algorithm). Each bit represents one register; pressure is computed as popcount(live_bv). This field is part of the larger scheduling state managed by the engine and priority function.

Arena Allocator (offset 840+)

OffsetSizeTypeNamePurpose
+840~120ArenaAllocatorarenaEmbedded bump allocator; freed via sub_8E3A80(sched+840) at each pass end; 10 KB block granularity, 8-byte alignment

Configuration Bitfields (offsets 1032--1098)

The region from +1032 through +1098 (~67 bytes) is a dense bitfield array set by sub_A95DC0 (SchedulingContext::configure). Individual bits control fine-grained scheduling features, gated by architecture version, optimization level, and knob queries. Key fields:

OffsetSizeTypeNamePurpose
+10321bytefeatureFlags0Pipeline feature enables (OR'd with 0x4F)
+10524int32cfgMaxDepthKnob 449 value (default 5); scheduling depth limit
+10641bytecfgSmFlagsBit 0: SM-specific flag (knob 931 or arch > 16386)
+10728doublepressureCoeffKnob 366 value (default 0.25); register pressure coefficient
+10801bytecfgBitmaskBits: [7] always set, [6] knob 868, [5] hot-cold, [4] knob 410, [3] knob 868 alt
+10844int32cfgThresholdKnob 876 value (default 50)
+10881bytecfgBitmask2Bit 3: knob 752 related
+10891bytecfgBitmask3Bit 7: arch == 16387 or arch == 0x4000
+10961bytecfgBitmask4Bit 7: external flag from target descriptor +788
+10971bytecfgBitmask5Bits: [7] target+1844, [4] arch <= 16386, [3] sm_50 dual-issue, [1,0] target+788
+10981bytecfgBitmask6Bit 0: knob 462 (scheduling heuristic), Bit 5: arch == 16386

Architecture-Specific Defaults (offsets 1408--1584)

Set early in sub_A95DC0 based on *(a1+372) >> 12 (architecture class). Three code paths populate these fields for sm_50 era (class < 3), sm_60--sm_89 era (class == 4), and sm_90+ era (class >= 5):

OffsetSizeTypeNamePurpose
+14081bytearchMode0Architecture scheduling mode flag
+14111bytearchMode1Scheduling sub-mode
+14121bytearchMode2Scheduling sub-mode
+14131bytearchMode3Scheduling sub-mode
+14141bytearchMode4Architecture mode flag
+14151bytearchMode5Architecture mode flag; bit 2 checked during batch depth selection
+14161bytearchMode6Architecture mode flag
+144016__m128iarchVectorSSE-loaded scheduling parameters (4 x int32)
+14524int32archWarpSizeWarp/thread configuration: 64 or 128
+14564int32archDispatchSizeDispatch slot parameter: 16, 32, or 64
+14604int32archMaxThreadsMax threads per SM: 512 or 1024
+14644int32archParam5Architecture parameter: 4 (sm_60+ only)
+14724int32archBlockSizeBlock size parameter: 32
+14808int64archSpecDataArchitecture-specific encoded scheduling data
+158416__m128iarchProfileSSE-loaded architecture profile vector

Memory Layout Diagram

SchedulerContext (~1600 bytes)
+--------+--------+--------+--------+--------+--------+--------+--------+
|+0  vtable       |+8  funcContext   |+16 allocator    |+24 (padding)    |
+--------+--------+--------+--------+--------+--------+--------+--------+
|+32 (padding)    |+40 preHookVtable |+48  regPressureCounters[0..9]     |
+--------+--------+--------+--------+--------+--------+--------+--------+
|  ...counters... |+60 mode |+64..84 |+88  maxBBDepth  |+92 maxBBDpthNT |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+96..175 (internal state)           |+176 active|+178 rrMode|           |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+240 phase       |+248 regBV1       |+256 regBV2      |                 |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+280 analysisSlots         |+292 valid|+296 tgtPri|+300 tgtSec|         |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+316 minR|+320 slack|+324 commit|+328 dualIss|  ...  |+380 latCut|     |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+384 tex |+388 maxBB|+392 alloc |    |+404 stall|+408 thresh|+412 batch|
+---------+-------+---------+--------+---------+-------+--------+--------+
|+416 xtraReg|+420 spillCnt|    |+432 budget|+440..456 livenessBV       |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+464 depth  |    |+480 cycle|+484 act|+485 dirty|    |+523 hcE|+524 yld|
+---------+-------+---------+--------+---------+-------+--------+--------+
|+532 hcBudget|  |+604 archP1|  |+616 archP2|                           |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+672 resourceSlots         |+680 depData       |    ...bitvector mgr... |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+720 arenaRef    |+728 bvBuf        |+736 cap  |+740 alloc|+744 funcRef|
+---------+-------+---------+--------+---------+-------+--------+--------+
|                      ...gap / internal state...                        |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+832 liveness bitvector ref |                                           |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+840    ArenaAllocator (embedded sub-object, ~120 bytes)                |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+960..1031 (internal/padding)                                           |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+1032..1098  configuration bitfield array (~67 bytes)                   |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+1099..1407 (internal state, ~308 bytes)                                |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+1408..1416  architecture mode flags (9 bytes)                          |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+1440 archVector (16B) |+1452..1484 arch params |+1584 archProfile (16B)|
+---------+-------+---------+--------+---------+-------+--------+--------+

Function Map

AddressSizeIdentity
sub_6820B01.5 KBBuildReadyList -- zero-dep instruction scan
sub_682200--UnlinkFromReadyList -- remove and update deps
sub_68249014 KBRegisterPressureAnalyzer -- per-class deltas
sub_6833F010 KBInitScheduleRegion -- per-BB setup and knob query
sub_685A1011 KBInstructionBarrierCheck -- opcode analysis
sub_687FE012 KBScheduleBlock -- per-BB scheduling entry
sub_688DD020 KBScheduleEngine -- unified 3-mode engine
sub_68A69031 KBBuildDependencies -- def-use chain DAG
sub_68B9C046 KBDependencyGraphBuilder -- full DAG construction
sub_69220018 KBSchedulingHeuristic -- priority with FP scoring
sub_69553015 KBComputeLatencies -- instruction latency computation
sub_69B7D017 KBTopologicalSort -- valid execution ordering
sub_69F17012 KBCriticalPathAnalysis -- DAG critical path
sub_89310017 KBClassifyInstruction -- opcode/operand analysis
sub_89429027 KBBuildOperandDependencies -- operand-level edges
sub_896D5090 KBInitOpcodeTable -- ROT13 SASS mnemonic table
sub_89FBA085 KBSetOpcodeLatencies -- per-opcode latency table
sub_8BF890929 BAllocDynBatchData -- DynBatch context allocation
sub_8C1BA06.3 KBInitDynBatchState -- batch initialization
sub_8C67A03.7 KBComputeResourceCost -- per-instruction FU cost
sub_8C72905.1 KBGetResourceVector -- SSE-optimized copy
sub_8C772020 KBReorderInstructions -- red-black tree reordering
sub_8C932047 KBComputePriority -- multi-criteria heuristic
sub_8CBAD02.9 KBPreScheduleSetup -- BB scan, 4095-instr limit
sub_8CCF802.3 KBIsLongLatencyOp -- latency > 19 check
sub_8CD1609.3 KBScheduleBasicBlock -- per-BB ordering loop
sub_8CD6E01.3 KBReverseSchedule -- reverse post-order BBs
sub_8CE52012 KBRegisterBudgetCurve -- piecewise linear model
sub_8CEE808.7 KBComputeRegisterBudget -- occupancy-aware
sub_8CF5D03.5 KBCheckDualIssueEligibility
sub_8CF88028 KBBuildDependencyGraph -- pre-scheduling DAG
sub_8D064022 KBScheduleInstructions -- top-level orchestrator
sub_8D993019 KBBuildDependencyEdges -- RAW/WAR/WAW edges
sub_8E3970~53 BArenaAlloc -- bump allocator
sub_8E3A80~22 lnArenaFreeAll -- release all blocks
sub_8E44003.3 KBInitHWProfile_Warp -- warp dispatch params
sub_8E5CA020 KBMasterHWProfileBuilder -- latency/throughput
sub_8F1EB015 KBEncodeScheduleWords -- SASS control word output
sub_8F653013 KBOutputCompleteSchedule -- final output assembly
sub_A95DC035 KBSchedulingContext::configure -- knob loading
sub_A9760042 KBPostSchedulePass::runOnFunction
sub_A9DDD011.5 KBHandleLargeFunction -- chunk-based scheduling

Cross-References

  • Scheduling Algorithm -- priority list scheduling internals, ready list management, backtracking
  • Latency Model -- per-opcode latency tables, functional unit mapping, architecture profiles
  • Scoreboards & Barriers -- scoreboard encoding, dependency barrier assignment, stall/yield format
  • Register Allocation -- register allocator that the scheduler interacts with
  • Phase Manager -- how ScheduleInstructions fits in the 159-phase pipeline
  • Knobs -- the 76 scheduling knobs and the knob query infrastructure
  • GMMA Pipeline -- GMMA/WGMMA operations targeted by DynBatch

Priority List Scheduling Algorithm

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The scheduling engine implements a classical priority list scheduling algorithm extended with GPU-specific heuristics for register pressure management, functional unit contention avoidance, yield hint generation, and barrier-aware instruction ordering. A single unified engine (sub_688DD0, 20 KB) serves all three scheduling phases -- ReduceReg, ILP/Latency, and DynBatch -- differentiated only by a mode byte that selects different priority weight configurations. The algorithm iterates basic blocks, builds a ready list of zero-dependency instructions, selects the highest-priority candidate via an 8-bit packed heuristic, emits it into the final schedule, updates the dependency DAG, and repeats until all instructions in the block are placed.

Unified enginesub_688DD0 (20 KB) -- mode-parameterized core loop
Priority functionsub_8C9320 (47 KB, ~1300 lines) -- 8-bit packed heuristic
Ready list buildersub_6820B0 (1.5 KB) -- zero-predecessor scan
Dependency pre-scansub_8CF880 (28 KB) -- reverse BB iteration
Edge buildersub_8D9930 (19 KB) -- RAW/WAR/WAW/memory/barrier edges
Instruction moversub_925510 (341 bytes) -- doubly-linked list relink
Resource trackersub_A09530 (365 bytes) -- per-instruction stall update
Stall/barrier encodersub_8D7760 (41 KB) -- control word generation
Alternative loopsub_68B9C0 (46 KB) -- combined DAG + scheduling
BB size limit4095 instructions (split via sub_931920)
Large function limit16383 instructions (chunk-based via sub_A9DDD0)

Core Algorithm

The unified scheduling engine executes the following sequence for each basic block. All three phases (ReduceReg mode 0x39, ILP mode 0x49, DynBatch mode 0x41) follow this identical structure; only the priority weight selection differs.

function ScheduleEngine(sched, mode, arg3, rebuild):
    if rebuild:
        InitScheduleRegion(sched)                 // sub_6833F0
        // Allocates 72-byte per-BB records
        // Queries knobs 595, 743, 747
        // Calls sub_7E5120 for instruction characterization

    for each bb in sched.basic_blocks:            // 72-byte stride
        InitResourceTracking(sched, bb)           // sub_A091C0
        // Zeroes 40-byte resource records (one per register class + 1)

        BuildReadyList(sched)                     // sub_6820B0

        while ready_list is not empty:
            best = SelectHighestPriority(sched)   // via priority vtable
            UnlinkFromReadyList(sched, best)      // sub_682200
            MoveInstruction(sched, best, ref)     // sub_925510
            UpdateStallCycles(sched, best)         // sub_A09530
            UpdateWARTracking(sched, best)         // sub_A09D40

            for each successor of best:
                successor.dep_count -= 1
                if successor.dep_count == 0:
                    ready_list.insert(successor)
                    // Sorted insertion using priority value

        PostBBCleanup(sched, bb)                  // sub_BDC200 / sub_BDCDE0

The outer loop iterates basic blocks via an array of 72-byte records (v112 = 72 * bb_index). The inner loop is a standard worklist algorithm: remove the highest-priority ready instruction, schedule it, and propagate readiness to its successors in the dependency DAG by decrementing their predecessor counts.

Mode Selection

The mode byte stored at *(DWORD*)(scheduler+60) controls which priority weight set the engine uses:

ModeValueCallbackObjective
ReduceReg10x39Minimize register pressure. Prioritizes instructions that release registers (last-use operands).
ILP/Latency00x49Maximize instruction-level parallelism. Prioritizes critical-path and long-latency instructions.
DynBatch20x41Batch-aware tensor scheduling. Groups GMMA/WGMMA operations for warpgroup cooperation.

The engine uses vtable dispatch at *(a1+40) and *(a1+48) for polymorphic pre/post scheduling hooks. This allows each mode to inject custom behavior at scheduling boundaries without modifying the core loop.

Ready List Construction

sub_6820B0 (1.5 KB) scans the instruction linked list and collects every instruction with zero unsatisfied dependencies into a sorted ready list.

function BuildReadyList(sched):
    for instr in sched.instruction_list:          // linked list at sched[20]
        if instr.opcode == 52:                    // NOP / BB boundary marker
            continue                              // follow to real instruction

        metadata = *(QWORD*)(instr + 40)          // SchedNode pointer
        if *(DWORD*)(metadata + 8) == 0:          // depCount == 0
            *(QWORD*)(metadata + 16) = sched[5]   // link to current head
            sched[5] = instr                       // new head
            vtable_callback(sched, 104, instr)     // insertion hook
            *(DWORD*)(metadata + 28) = 0           // reset latency counter

The ready list is a singly-linked list threaded through SchedNode offset +16 (the nextReady field). Sort order is maintained at insertion time by the priority function -- each new instruction is inserted at its correct position so that the head of the list is always the highest-priority candidate. All metadata+N offsets throughout the scheduling pages refer to fields within the SchedNode block pointed to by instr+40 (sched_slot), not offsets from the instruction object itself. See the SchedNode layout for the complete field map.

Opcode 52 instructions are phantom BB boundary markers. The builder skips them but follows their linked-list successors to reach real instructions beyond the boundary.

The vtable+104 callback provides a polymorphic insertion hook. Different scheduling strategies can override this to implement custom ready-list policies (e.g., the dual-issue scheduler uses it to pair co-issuable instructions).

Priority Function

sub_8C9320 (47 KB decompiled, ~1300 lines) is the heart of instruction selection. It computes a scheduling priority as an integer with 8-bit packed fields. Each bit encodes a different heuristic criterion. Because priority comparison reduces to a single integer comparison, the ready list maintains sort order without multi-key sorting overhead.

8-Bit Priority Encoding

BitNameMeaningNotes
7 (MSB)yield-relatedInstruction is near a yield boundaryHigher priority ensures yield hints align with scheduling boundaries
6yield flagInstruction triggers or participates in a yield sequenceControls warp scheduler round-robin interaction
5hot-coldMemory access temperature (1 = hot, 0 = cold)Hot = global/texture/surface loads with long latencies; cold = constant/shared
4hot-cold / pressurePacked byte holds hot-cold flag; pressure overflow acts as comparison overrideSee Priority Function Internals for the dual mechanism
3same-BB preferenceInstruction belongs to the currently-scheduled BBDiscourages cross-BB instruction motion
2stall-freeScheduling this instruction introduces zero stall cyclesAll producer latencies have completed
1latency-boundInstruction is on the DAG critical pathPrioritizes latency-sensitive dependency chains
0 (LSB)tiebreakerAdditional ordering factorEncodes instruction position, operand count, or FU preference

Higher numeric value = higher priority. Bit 7 is the most significant criterion: yield-boundary instructions always schedule before non-yield instructions when both are ready.

Hot-Cold Classification

The hot-cold flag (bit 5) classifies memory operations by expected latency:

  • Hot (bit 5 = 1, higher priority): global memory loads (LDG), texture fetches (TEX, TLD), surface operations. These have high latency (hundreds of cycles) and benefit most from early scheduling to overlap with computation. Detected by sub_A9CDE0.
  • Cold (bit 5 = 0, lower priority): constant memory loads (LDC), shared memory operations (LDS, STS). These have low latency and do not need early scheduling. Detected by sub_A9CF90, which also suppresses the pressure-overflow and critical-path extension signals.

Classification uses sub_A9CDE0 (hot detection) and sub_A9CF90 (cold detection). Memory space type is determined by sub_693BC0, which returns space codes: 3 = shared, 16 = global, 2 = local, 11 = surface, 7 = constant, 1 = generic. Hot-cold tracking is gated by scheduler+523 and scheduler+532; when the hot-cold budget (scheduler+532) reaches zero, the feature deactivates for the remainder of the BB.

Pressure Overflow

Bit 4 in the packed byte holds the hot-cold flag (see above). The pressure overflow signal is a separate Boolean computed by checking all four register classes (GPR, predicate, address, UGP) against their respective limits. When any class exceeds its budget, the pressure overflow flag activates and acts as a comparison override: the candidate wins regardless of the packed priority byte, forcing the scheduler to select the instruction that relieves register pressure. This is the primary mechanism by which the ReduceReg phase achieves its objective: the mode sets a tight register budget via scheduler+178, causing pressure overflow to activate frequently and driving the scheduler toward pressure-reducing orderings. See the Priority Function Internals section for the exact per-class threshold checks.

Priority Evaluation Sequence

The priority function evaluates criteria in this order for each candidate instruction:

  1. sub_8C7290: extract 4-class register deltas, same-BB flag, and per-BB resource vector (SSE-optimized)
  2. Compute yield saturation: check write-barrier counters for predicate, GPR, and UGP register classes against their ceilings (7, 7, and target_desc+624 respectively)
  3. sub_8C67A0: compute per-instruction resource cost if BB slot not yet committed
  4. sub_8C7120: update barrier tracking state (if metadata+111 bit 7 set)
  5. Evaluate register pressure: compute per-class overflow against budget (scheduler+432) and per-class limits; derive pressure-overflow Boolean
  6. Evaluate stall-free: compare earliest cycle (metadata+32) vs current cycle (scheduler+480)
  7. Evaluate critical path: compare barrier-target count vs depth threshold (scheduler+464)
  8. Evaluate yield bits: opcode 39 (yield-related) and opcode 96 (yield flag from scheduler+524)
  9. Pack 8 bits into priority byte
  10. Evaluate hot/cold: sub_A9CDE0 / sub_A9CF90 (only when scheduler+523 active)
  11. Multi-stage comparison against running best: resource vectors, then XOR-based bit scan, then secondary tiebreakers

The function scans the full ready list in a single pass (not limited by knob 770 for the scan itself). Knob 770 (priority queue depth, default 4) controls the depth threshold mechanism for critical-path activation, not the number of candidates evaluated.

Key Internal Variables

VariableSourceContent
budget_hwsub_6818D0(scheduler, scheduler[432] - scheduler[412])Register budget in HW register units
reduced_hwsub_6818D0(scheduler, budget - budget/16)Tighter budget for critical-path threshold (or knob 760 override)
queue_depthknob 770Depth threshold parameter (default 4); controls critical-path activation
per_bb_flagknob 769Per-BB scheduling flag; when set, resets yield state between BBs
scheduler+420stateSpill-mode countdown; when > 0, forces aggressive scheduling with bit 1 = 1
scheduler+464stateDepth threshold -- number of barrier targets that must be ready before critical-path activates
scheduler+480stateCurrent scheduling cycle; used for stall-free evaluation
scheduler+523stateHot-cold tracking enable flag; gated by knob
scheduler+524stateCurrent yield state; propagated to CONTROL instructions via bit 6
scheduler+532stateHot-cold budget counter; decremented per cold instruction, disables tracking at zero
scheduler+672allocationPer-BB resource cost table (84 bytes per slot)

Support Subroutines

AddressSizePurpose
sub_8C67A03.7 KBCompute per-instruction resource cost. Calls sub_A08A00 (resource model) three times: mode 1 = instruction's own cost, mode 2 = operand release cost, mode 3 = combined BB-level impact. Uses SSE _mm_add_epi32 for vector accumulation.
sub_8C72905.1 KBCopy 10-element int32 resource vector from per-BB table at scheduler+672. SSE _mm_loadu_si128 bulk copy. Special case: opcode 97 (STG in ROT13; used as control/boundary marker) returns base scheduler state with zeroed deltas.
sub_8C772020 KBRed-black tree operations for instruction reordering within BB. Maintains a balanced BST of scheduling candidates for O(log N) insertion, removal, and priority update.
sub_8C7120--Barrier tracking state update.
sub_693BC0--Memory space classification and latency query.
sub_6818D0--Register count to hardware-aligned unit conversion.

Priority Function Internals

The full logic of sub_8C9320 divides into three phases: (1) pre-scan the ready list to collect aggregate BB statistics, (2) iterate the ready list a second time evaluating each candidate and maintaining a running best, and (3) update scheduler state and return the winner. The function signature is (scheduler, &second_best) -> best_instruction.

Phase 1: Pre-Scan Statistics

Before priority evaluation begins, the function iterates the entire ready list (linked via metadata+16) and accumulates per-BB statistics that feed into the per-instruction priority decisions:

VariableInitAccumulationMeaning
shared_mem_count0++ when opcode 183 and sub_693BC0 returns space 3Count of shared-memory operations in ready list
neg_reg_deficit0+= delta when register delta < 0Total register pressure reduction from ready instructions
max_dep_cycle-1max(current, metadata+92)Highest dependency cycle among all ready instructions
max_pred_cycle0max(current, metadata+88)Highest predecessor cycle among all ready instructions
barrier_count0++ when metadata+108 & 1Count of barrier-target instructions in ready list
dep_flag_count0++ when metadata+108 & 2Count of instructions with dependency-set flag
pos_pressure_sum0+= delta when register delta > 0Total register pressure increase from ready instructions
filtered_pressure0+= delta when within depth thresholdPressure increase from depth-eligible instructions
max_barrier_slot-1max(current, metadata+24) for barrier targetsLatest BB slot among barrier-target instructions
min_barrier_latency99999min(current, metadata+28) for barrier targetsShortest latency counter among barrier-target instructions
max_nonbarrier_cycle-1max(current, metadata+32) for non-barrierLatest earliest-available-cycle for non-barrier instructions
any_stall_free0|= (metadata+32 >= 0)Whether any instruction can issue without stalling
total_ready0++ for every instructionTotal instructions in ready list
preferred_instrNULLnon-barrier instr with max metadata+24The program-order-latest non-barrier instruction

The pre-scan also maintains a depth-threshold table: an array of up to 32 barrier-target instruction pointers sorted by their latency counter (metadata+28). This table is scanned to compute scheduler+464 (depth threshold) and scheduler+380 (latency cutoff), which control when the critical-path bit activates.

Phase 2: Register Budget Prologue

Before the main loop, the function computes two register budgets from scheduler+432 (target register count):

budget_base = scheduler[432] - scheduler[412]     // target minus committed

if ReduceReg_mode (scheduler+178):               // ReduceReg tightens budget
    if scheduler[416] < 0:
        budget_base -= (scheduler[432] / 8) + 3   // reduce by ~12.5% + 3
    else:
        budget_base -= scheduler[416]              // explicit reduction

budget_hw    = RegToHWUnits(scheduler, budget_base)     // sub_6818D0
reduced_hw   = RegToHWUnits(scheduler, budget_base - budget_base/16)
                                                         // ~6.25% tighter

if knob_760_active:
    reduced_hw = RegToHWUnits(scheduler, budget_base - knob_760_value)

queue_depth  = 4                                  // default
if knob_770_active:
    queue_depth = knob_770_value                  // override

budget_hw sets the threshold for bit 4 (pressure overflow). reduced_hw provides a tighter threshold used in the critical-path assessment. queue_depth (knob 770) limits how many candidates receive full priority evaluation; the rest use cached values from initial insertion.

Phase 3: Per-Bit Computation

For each instruction in the ready list, sub_8C7290 extracts its per-register-class deltas (4 classes: GPR, predicate, address, UGP) and the same-BB flag. Then each priority bit is computed:

Bit 7 -- Yield-related. Determined by opcode. Only opcode 39 (YIELD instruction variant) can set this bit. The condition checks the last operand's low 2 bits:

if opcode_masked == 39:
    operand_index = operand_count - 1 - ((opcode >> 11) & 2)
    yield_related = (instr[84 + 8*operand_index] & 3) == 0
else:
    yield_related = 0

When set, the instruction is a yield boundary marker and receives absolute highest priority regardless of all other heuristics.

Bit 6 -- Yield flag. Set only for opcode 96 (CONTROL instruction):

if opcode_masked == 96:
    yield_flag = scheduler[524]       // current yield state
else:
    yield_flag = 0

// Post-adjustment: suppress when hot/pressure bits dominate
if (bit5_set || bit4_set):
    yield_flag = 0
    if metadata[32] < scheduler[480]:    // behind schedule
        yield_flag = scheduler[396] ? original_yield : 0

The yield flag propagates the scheduler's warp yield state only through CONTROL instructions, ensuring yield hints align with scheduling barriers.

Bit 5 -- Hot-cold classification. Requires hot-cold tracking to be active (scheduler+523 set, gated by scheduler+532 > 0):

if hot_cold_active:
    is_hot = sub_A9CDE0(target_desc, context, instruction)
else:
    is_hot = 0

// Cold detection suppresses priority
if sub_A9CF90(target_desc, context, instruction):    // is_cold?
    pressure_overflow = 0                             // suppress bit 4
    critical_extension = 0                            // suppress lookahead

sub_A9CDE0 returns true for global memory loads (LDG), texture fetches (TEX, TLD), and surface operations -- instructions with latencies in the hundreds of cycles. sub_A9CF90 returns true for constant loads (LDC), shared memory operations (LDS/STS) -- low-latency operations. Hot instructions (bit 5 = 1) get higher priority to schedule early and overlap their long latencies with computation. Cold instructions (bit 5 = 0) are deprioritized.

Bit 4 -- Pressure overflow. This bit does NOT appear directly in the initial packing as a single variable. Instead, the pressure overflow signal (v81 in decompiled source) feeds into the candidate comparison logic as an override. The mechanism:

// For barrier-target instructions:
budget_in_units = RegToHWUnits(scheduler, scheduler[432])
headroom        = RegToHWUnits(scheduler, 8)
if budget_in_units > headroom + scheduler[72]:   // plenty of headroom
    pressure_overflow = 0
elif latency_counter > min_barrier_latency + 9:  // far from ready
    pressure_overflow = 0
else:
    // Check all 4 register classes against their limits:
    overflow = false
    overflow |= (scheduler[72] + gpr_delta > budget_hw)
    overflow |= (scheduler[68] + pred_delta > 7)
    overflow |= (scheduler[56] + addr_delta > 7)
    overflow |= (scheduler[60] + ugp_delta >= target_desc[624])
    pressure_overflow = overflow

When pressure_overflow = 1, the candidate wins the comparison regardless of other bits -- it is the scheduler's mechanism for emergency register pressure relief. In the packed byte's bit 4 position, the hot-cold flag occupies the slot. The pressure overflow signal operates at a higher level: it can force the candidate to win even when its packed priority byte is lower.

Bit 3 -- Same-BB preference. Output parameter from sub_8C7290:

same_bb = sub_8C7290.output_param_5     // boolean from resource copy

Set when the instruction belongs to the currently-scheduled basic block. Instructions imported from other BBs by global scheduling get same_bb = 0, reducing their priority relative to local instructions.

Bit 2 -- Stall-free. Computed from the earliest-available-cycle field:

if countdown_active (scheduler[420] != 0):
    if metadata[32] < scheduler[480] AND instr != preferred_instr:
        stall_free = 0
        if pressure_plus_reg_sum > 0:
            goto full_evaluation    // positive pressure = needs analysis
    else:
        stall_free = 1
else:
    // Normal mode: stall-free when producers have completed
    if metadata[32] >= scheduler[480]:
        stall_free = 1
    elif instr == preferred_instr:
        stall_free = 1
    else:
        stall_free = 0

metadata+32 is the instruction's earliest available cycle -- the latest completion time among all its producer instructions. scheduler+480 is the current scheduling cycle. When earliest >= current, all producers have retired and the instruction can issue with zero pipeline stalls.

Bit 1 -- Critical-path / latency-bound. Complex multi-path computation:

if countdown_active (scheduler[420] != 0):
    // Spill mode: almost always critical
    if !(barrier_bits_set_in_priority):
        if slot_limit_exceeded:
            critical = 1
        else:
            critical = !(pressure_sum <= 0 && max_reg_class == 0)
    else:
        critical = 0
else:
    // Normal mode: depth threshold comparison
    if barrier_count >= scheduler[464]:
        critical = 1      // enough barriers ready -> critical path active
    else:
        critical = 0

In spill mode (active when scheduler+420 > 0), the critical-path bit is set for nearly all instructions to maximize scheduling throughput. In normal mode, it activates when the number of barrier-target instructions in the ready list meets or exceeds the depth threshold computed during the pre-scan, indicating that the scheduler is processing a latency-critical dependency chain.

Bit 0 -- Tiebreaker (barrier-target). Read directly from instruction metadata:

tiebreaker = metadata[108] & 1      // barrier-target flag

Barrier-target instructions (those waiting on a hardware barrier) get bit 0 = 1. Since this is the lowest-priority bit, it only affects ordering when all higher bits are identical. Scheduling barrier targets promptly allows the barrier resource to be retired sooner, freeing scoreboard entries for other instructions.

Packed Byte Assembly

The 8 bits are packed into a single byte using shift-and-mask arithmetic:

priority = (yield_related    << 7)         // bit 7
         | (yield_flag       << 6) & 0x7F  // bit 6
         | (hot_cold         << 5) & 0x3F  // bit 5  (initially yield copy)
         | (hot_flag         << 4) & 0x3F  // bit 4
         | (same_bb          << 3) & 0x0F  // bit 3
         | (stall_free       << 2) & 0x0F  // bit 2
         | (critical_path    << 1) & 0x03  // bit 1
         | (tiebreaker       << 0) & 0x03  // bit 0

The & 0xNN masks ensure each bit occupies exactly one position. In the initial packing, bit 5 and bit 6 both derive from the yield variable; the hot-cold flag (sub_A9CDE0 result) overwrites bit 5 in subsequent repackings that occur during the spill-mode and comparison paths.

Candidate Comparison

The comparison between the current candidate and the running best is NOT a simple integer comparison of the packed bytes. The function performs a multi-stage refinement:

  1. Resource vector comparison: If knob-gated architecture checks pass (SM index > 5 at context+1704), a 4-tuple lexicographic comparison of per-register-class resource vectors occurs first. The four classes are compared in order: GPR delta, predicate delta, address delta, UGP delta. The first class that differs determines the winner.

  2. Priority byte XOR scan: When resource vectors are equal, the function XORs the current and best packed bytes and checks differing bits in this order:

    • Bit 4 (0x10) -- pressure: winner has bit 4 set (higher pressure need)
    • Bit 6 (0x40) -- yield: winner has bit 6 set (yield participation)
    • Bit 1 (0x02) -- critical: winner has bit 1 set
    • Bit 2 (0x04) -- stall-free: winner has bit 2 set
    • Bit 5 (0x20) -- hot-cold: winner has bit 5 set (hot memory op)
  3. Secondary tiebreakers (when all checked bits match):

    • Barrier group index (v213 vs v253)
    • Latency counter comparison (v223 vs v248)
    • Bit 7 yield-related (only when shared-memory count > 0)
    • Contention score (a derived value incorporating register overflow penalty: contention + 2 * RegToHWUnits(pressure_delta) - pressure_sum_sign)
    • Slot manager cycles (scheduling cost estimate from sub_682490)
    • Earliest available cycle (metadata+32)
    • Dependency cycle (metadata+92)
    • Latest deadline (metadata+40)
    • Register delta magnitude
  4. Positional fallback: When all heuristic comparisons are tied, the instruction with the higher BB slot (metadata+24) wins, preserving original program order.

The multi-stage comparison explains why the packed byte uses non-obvious bit ordering. Bits 4, 6, 1, 2, 5 are checked before bit 7 in the refinement path, even though bit 7 is the MSB. The packed byte enables fast ready-list insertion sort (integer comparison), while the full comparison function provides nuanced selection for the actual scheduling decision.

Scheduler State Updates

After selecting the best candidate, the function updates scheduler state:

// Spill mode countdown
if winner is barrier-target:
    scheduler[420] = computed_countdown - 1
    scheduler[396] -= 1                       // spill sequence counter
    if metadata[32] >= 0:
        scheduler[400] -= 1                   // stall-free counter
        if stall_free_count==0 AND remaining>0 AND countdown>1:
            scheduler[420] = 0                // force exit spill mode
            scheduler[464] = -1               // reset depth threshold
else:
    // Non-barrier winner in countdown mode
    if !(barrier_bits in priority) AND slot_cost within budget:
        // do nothing, continue countdown
    else:
        scheduler[420] = 0                    // exit spill mode
        scheduler[464] = -1                   // reset depth threshold

// Slot manager update (when winner has positive scheduling cost)
if best_cost > 0 AND slotManager[76] > 0:
    if slotManager[140]:
        slotManager[28] += slotManager[44]    // advance base
        slotManager[76] = 0                   // reset count
        slotManager[80] = NULL                // reset anchor
    best.metadata[28] = sub_682490(...)       // recompute latency

// Hot-cold counter update
if hot_cold_active AND winner is cold (sub_A9CF90 returns true):
    scheduler[532] -= 1                       // decrement hot-cold budget
elif hot_flag was set for winner:
    scheduler[523] = 0                        // disable hot-cold tracking

The function returns the best instruction pointer and writes the second-best to *a2 for lookahead scheduling.

Dependency DAG Construction

The dependency graph is built in two stages before the scheduling loop begins. The DAG is a directed acyclic graph where nodes are instructions and edges represent ordering constraints with associated latency values.

Stage 1: Pre-Scan (sub_8CF880, 28 KB)

Iterates basic blocks in reverse order (bb[N-1] to bb[0]) using the BB ordering array at func+512.

For each BB:

  1. Check knobs 314/313 for per-BB scheduling skip flags
  2. Walk the instruction linked list, identifying NOP/control instructions
  3. Set bb->next pointers and configure BB scheduling state
  4. Delegate to Stage 2 (sub_8D9930) for edge construction
  5. Manage memory arenas with SSE-optimized copies for metadata arrays

Contains approximately 14 nested loops for edge construction. The reverse iteration order ensures that when the scheduler processes a BB, all of its successors have already been characterized.

Stage 2: Edge Construction (sub_8D9930, 19 KB)

For each pair of instructions within a BB, checks for five dependency types:

TypeAbbreviationConditionEdge Latency
TrueRAWRead-after-write on same registerProducer's pipeline latency
AntiWARWrite-after-read on same register0 (ordering constraint only)
OutputWAWWrite-after-write on same register1 (minimum separation)
Memory--Store before load to same memory spaceConservative; full ordering
Barrier--Instruction depends on barrier/sync resultBarrier completion latency

Operand analysis is performed by sub_894290 (27 KB), which processes 16-bit operand descriptors encoding:

BitsField
12--15Register class
8--11Bank number
0--7Dependency type

Memory dependencies are conservative: all stores are ordered before subsequent loads to the same memory space. The scheduler does not perform alias analysis -- it relies on the memory space classification from sub_693BC0 to determine whether two operations might conflict.

Supplementary Dependency Builders

These functions handle specific aspects of dependency construction in the 0x6A0000--0x6B0000 range:

AddressSizePurpose
sub_68A69031 KBBuildDependencies -- walks instruction lists and creates producer-consumer dependency edges from def-use chains
sub_6A97B026 KBAddDependencyEdges -- register-level data dependency edges
sub_6A2D3011 KBChainDependencies -- memory ordering constraints (ordering edges between memory operations even without explicit data deps)
sub_6A78F023 KBProcessOperands -- iterates operand arrays at instruction +84, extracts register file pressure and dependency distance information

Instruction Emission

sub_925510 (341 bytes, 57 lines) is the universal instruction relocation primitive. It moves an instruction to a new position in the doubly-linked instruction list.

function MoveInstruction(block, instr, insert_before):
    // 1. Unlink from current position
    instr.prev.next = instr.next
    instr.next.prev = instr.prev

    // 2. Insert before reference instruction
    instr.next = insert_before
    instr.prev = insert_before.prev
    insert_before.prev.next = instr
    insert_before.prev = instr

    // 3. Update block boundaries if needed
    if instr was block.head (block+272):
        block.head = instr.next
    if instr was block.tail (block+280):
        block.tail = instr.prev

    // 4. Notify subsystems
    UpdateDependencyGraph(block, instr)     // sub_7EEC10
    UpdateBlockTimestamp(block)             // sub_7DDCA0

This function has 13 callers across the codebase. It serves as the shared instruction movement primitive for the scheduler, register allocator, and peephole optimizer.

Resource Tracking

The scheduler maintains 10 functional unit resource counters per basic block, tracking pipeline utilization to avoid saturating any single execution unit.

Resource Vector Layout

Each per-BB resource slot occupies 84 bytes (21 DWORDs) stored at *(scheduler+672) + 84 * slot_index:

OffsetSizeContent
0--3610 x int32Current resource usage per functional unit
40--7610 x int32Resource pressure delta (change since last step)
80int32BB-entered flag and auxiliary bits

Functional Unit Pipes

IndexPipeTypical Instructions
0Integer ALUIADD, IMAD, ISETP, LOP, SHF
1FP32FADD, FFMA, FMUL, FSETP
2FP64DADD, DFMA, DMUL
3Tensor coreHMMA, IMMA, BMMA, BGMMA
4Load/storeLD, ST, LDG, STG, LDS, STS
5TextureTEX, TLD, TXQ
6Branch/controlBRA, JMP, EXIT, RET, BAR
7Shared memoryATOMS, REDS (overlaps with pipe 4 for LDS/STS)
8Special functionMUFU (RCP, RSQ, SIN, COS, EX2, LG2)
9Uniform/predicateUPLOP, UISETP, uniform operations

Resource Tracking Helpers

AddressSizePurpose
sub_A091C0--Initialize per-BB resource arrays to zero
sub_A09530365 bytesUpdate stall cycle counters after scheduling an instruction. Decrements pending latency counters for all tracked resources.
sub_A09D40--Update WAR (anti-dependency) resource tracking for register operands
sub_A08A00--Resource model query (called in 3 modes by sub_8C67A0)

The resource model sub_A08A00 is called three times per instruction by sub_8C67A0:

  • Mode 1: instruction's own execution cost (FU assignment + pipeline latency)
  • Mode 2: operand release costs (freed resources when an operand reaches last-use)
  • Mode 3: combined instruction + BB-level impact (aggregate pressure)

SSE intrinsics (_mm_add_epi32, _mm_loadu_si128) are used throughout for vectorized resource accumulation and copying.

Register Pressure Tracking

The scheduler tracks register liveness via a bitvector at scheduler+832. Each bit represents one register; the pressure is the popcount of the live set.

function UpdateRegisterPressure(sched, instr):
    for each operand in instr.operands:
        if operand.is_def:
            set_bit(sched.live_bv, operand.reg)       // DEF: mark live
        if operand.is_last_use:
            clear_bit(sched.live_bv, operand.reg)     // LAST-USE: mark dead
    sched.current_pressure = popcount(sched.live_bv)

The bitvector is sized to (numRegs + 1) words, or (2 * numRegs + 2) when knob 420 (dual-register tracking) is active. Dual-register tracking separately tracks register pairs for instructions that consume or produce 64-bit values.

Pressure state fields:

OffsetContent
scheduler+432Target register count (from budget computation)
scheduler+324Committed register target
scheduler+316Minimum register count
scheduler+320Register pressure slack (headroom)

When current_pressure > scheduler+432, the priority function sets bit 4 (pressure overflow) in the encoding, biasing the scheduler toward instructions that release registers.

Per-Instruction Scheduling Metadata (SchedNode)

Each instruction has a pointer at instr+40 to a heap-allocated SchedNode block. The offsets below are relative to the SchedNode base, not the 296-byte Ori instruction. See the SchedNode layout for the authoritative field map.

OffsetTypeContent
+8int32dep_count -- unsatisfied predecessor count (0 = ready)
+16QWORDnext_ready -- linked-list pointer in ready list
+24int32bbSlot -- 1-based BB position (-1 = unscheduled)
+28int32latency_counter -- current stall counter
+32int32earliestCycle -- earliest available cycle
+40int32latestDeadline -- latest deadline cycle
+44int32Barrier group index
+88int32maxPredecessorCycle
+92int32maxDependencyCycle
+108byteFlags: bit 0 = barrier target, bit 1 = has dependency, bit 2 = early schedulable, bit 3 = late schedulable, bit 4 = has register operand
+111byteFlags: bit 7 = uses expensive register file

The scheduling loop also reads Ori instruction fields directly (not via the SchedNode): instr+72 (opcode), instr+80 (operand count), instr+84 (operand descriptors).

Sentinel values: bbSlot -1 (unscheduled), latency 0x1869F (99999 = infinity).

The dep_count field at +8 is the key scheduling control: it counts unsatisfied predecessors in the dependency DAG. When a predecessor is scheduled, the engine decrements every successor's dep_count. When dep_count reaches zero, the instruction becomes ready and is inserted into the ready list.

Stall and Barrier Insertion

After the scheduling loop determines instruction order, sub_8D7760 (41 KB) converts the abstract schedule into SASS control words.

For each instruction in the scheduled order:

FieldComputationRange
Stall countDistance in cycles to the nearest dependent consumer0--15 (capped by knob 805, max 16)
Yield hintWarp scheduling hint -- should the HW scheduler switch to another warp?0 or 1
Barrier assignmentWhich of the 6 available barriers this instruction writes/waits on0--5, or none
Scoreboard depsRead/write dependency tracking for the hardware scoreboardBitmask

The function contains architecture-variant switches for different barrier models (sm_70 vs sm_80 vs sm_90+). It manages a 32-entry barrier table for tracking active barrier assignments.

See Scoreboards & Barriers for the control word encoding format.

Alternative Scheduling Loop

sub_68B9C0 (46 KB) is a monolithic function that combines dependency graph construction with the scheduling loop. It serves as an alternative entry point for scheduling passes that need to build the DAG inline rather than using the pre-built graph from Stage 1.

Internal structure:

  1. Initialize scheduling state (sub_685700)
  2. Initialize ready-list management (sub_687080)
  3. Check resource conflicts (sub_687410)
  4. Inner loop (while(2) infinite loop with break conditions):
    • Check if ready list is empty -- break if so
    • Check opcode 97 (STG in ROT13; used as scheduling barrier/control marker) -- special handling
    • Select best instruction from ready list
    • Schedule it: assign cycle, update resources, process edges
    • For each successor: decrement dep_count, add to ready list if zero
    • Check boundary condition (v236) -- break if done
  5. Track first-pass initialization via v215

This function accesses the Ori instruction's opcode at instr+72, plus SchedNode fields (via instr+40 pointer): +24 (bbSlot), +144 (scheduling slot), +164 (resource class), and +236 (latency).

Specialized Scheduling Strategies

The region 0x89C550--0x8BE320 contains 17+ specialized scheduling strategies. These are selected based on code characteristics (loop structure, tensor operations, function size) and optimization level. Each strategy implements a variation of the core list scheduling algorithm with different heuristics or search strategies.

AddressSizeStrategyDescription
sub_8B119016 KBBacktrackingUndo and retry on scheduling conflicts. Rolls back the last N steps and tries alternative orderings. Bounded depth prevents exponential blowup.
sub_8B2D9018 KBGlobal optimizationCross-BB scheduling. Moves instructions across BB boundaries when safe (no side effects, dominance preserved).
sub_8B459013 KBPermutation searchExhaustive permutation of instruction orderings for small BBs. Falls back to heuristic for larger blocks.
sub_8B540014 KBLatency-optimizedMaximizes memory latency hiding by aggressive interleaving of independent operations.
sub_8B6D6012 KBPressure-optimizedMinimizes live range overlap by scheduling defs as close to their uses as possible.
sub_8B77C015 KBDual-issuePairs co-issuable instructions for dual-issue architectures (sm_50/Maxwell). Uses sub_A9CDE0 and sub_A9CF90 for compatibility checks.
sub_8B890012 KBTensor schedulingHMMA/BMMA/BGMMA grouping for warpgroup tensor operations.
sub_8B939023 KBSoftware pipeliningLoop body overlapping -- interleaves iterations to fill pipeline bubbles.
sub_8BAAE015 KBLoop-awareTrip count + register awareness for loop bodies.
sub_8BB9C08.2 KBPrefetch schedulingInserts and schedules memory prefetch instructions.
sub_8BC0B06.1 KBBarrier coalescenceMerges adjacent barrier instructions to reduce overhead.
sub_8BC9907.6 KBScoreboard optimizationMinimizes scoreboard entries by reusing barrier registers.
sub_8BCFA06.8 KBWarp schedule optimizationWarp-level yield tuning for multi-warp scheduling.
sub_8BDC407.9 KBDual-issue pairingInstruction pair selection for dual-issue slots.
sub_8BE32025 KBComplex combined passMulti-strategy combined pass for complex code patterns.
sub_8A9D8021 KBDepth-firstDFS-based instruction ordering for deep dependency chains.
sub_8AB7509.8 KBCritical pathDAG analysis for priority weight computation.

Backtracking Scheduler

The backtracking strategy (sub_8B1190) is notable because it breaks from the greedy nature of standard list scheduling. When a scheduling decision leads to excessive stalls or resource conflicts, it can undo the last N steps (where N is bounded by a configurable depth), re-insert the affected instructions into the ready list, and try a different selection. This provides limited but effective lookahead without the full cost of optimal scheduling.

Dual-Issue Scheduling

For sm_50 (Maxwell), sub_8B77C0 pairs instructions that can execute simultaneously on different functional units. Eligibility is checked by sub_8CF5D0 (3.5 KB), which verifies architecture support and computes a dual-issue benefit score at scheduler+328. Pairing compatibility uses sub_A9CDE0 (is instruction dual-issuable?) and sub_A9CF90 (can this pair with the next instruction?).

Size Limits and Chunking

Two mechanisms prevent the scheduling algorithm from hitting quadratic complexity on large inputs:

BB Size Limit

sub_8CBAD0 scans all basic blocks during pre-scheduling setup. Any BB exceeding 4095 instructions is split by inserting scheduling barriers (sub_931920). This caps the per-BB scheduling problem size, ensuring the O(n^2) dependency graph construction remains tractable. The maximum BB size is tracked at scheduler+388.

Large Function Chunking

Functions exceeding 16383 instructions (*(a1+372) > 0x3FFF) trigger chunk-based scheduling via sub_A9DDD0 (11.5 KB). The function is partitioned into chunks that are scheduled independently, then the results are merged. This avoids the full O(n^2) DAG construction for very large kernels. The chunk boundary selection respects BB boundaries and dependency chains to minimize cross-chunk constraint violations.

Function Map

AddressSizeIdentityConfidence
sub_6820B01.5 KBBuildReadyList -- zero-dep instruction scanHIGH
sub_682200--UnlinkFromReadyList -- remove and update depsHIGH
sub_68249014 KBRegisterPressureAnalyzer -- per-class deltasHIGH
sub_6833F010 KBInitScheduleRegion -- per-BB setup, knob queryHIGH
sub_685700--InitSchedulingState -- loop initializationMEDIUM
sub_685A1011 KBInstructionBarrierCheck -- opcode analysisHIGH
sub_687080--ReadyListManagementHelperMEDIUM
sub_687410--ResourceConflictCheckMEDIUM
sub_687FE012 KBScheduleBlock -- per-BB scheduling entryHIGH
sub_688DD020 KBScheduleEngine -- unified 3-mode core loopHIGH
sub_68A69031 KBBuildDependencies -- def-use chain DAGHIGH
sub_68B9C046 KBMainSchedulingLoop -- combined DAG + schedulingHIGH
sub_69220018 KBSchedulingHeuristic -- priority with FP scoringHIGH
sub_69553015 KBComputeLatencies -- instruction latency computationHIGH
sub_69B7D017 KBTopologicalSort -- valid execution orderingHIGH
sub_69F17012 KBCriticalPathAnalysis -- DAG critical pathHIGH
sub_89310017 KBClassifyInstruction -- opcode/operand analysisHIGH
sub_89429027 KBBuildOperandDependencies -- operand-level edgesHIGH
sub_89C55014 KBInnerScheduleLoop -- inner scheduling iterationHIGH
sub_89EFC016 KBReadyListManager -- BST managementHIGH
sub_8A9D8021 KBDepthFirstScheduleMEDIUM
sub_8AB7509.8 KBCriticalPathComputeMEDIUM
sub_8B119016 KBScheduleWithBacktrackMEDIUM
sub_8B2D9018 KBGlobalScheduleOpt -- cross-BB schedulingMEDIUM
sub_8B459013 KBPermuteSchedule -- permutation searchMEDIUM
sub_8B540014 KBScheduleForLatencyMEDIUM
sub_8B6D6012 KBScheduleForPressureMEDIUM
sub_8B77C015 KBDualIssueSchedulerMEDIUM
sub_8B890012 KBTensorSchedulerMEDIUM
sub_8B939023 KBSoftwarePipelineMEDIUM
sub_8BAAE015 KBLoopSchedulerMEDIUM
sub_8BB9C08.2 KBPrefetchSchedulerMEDIUM
sub_8BC0B06.1 KBBarrierCoalescenceMEDIUM
sub_8BC9907.6 KBScoreboardOptMEDIUM
sub_8BCFA06.8 KBWarpScheduleOptMEDIUM
sub_8BDC407.9 KBDualIssuePairingMEDIUM
sub_8BE32025 KBComplexSchedulePassMEDIUM
sub_8C67A03.7 KBComputeResourceCost -- per-instruction FU costHIGH
sub_8C7120--BarrierTrackingUpdateMEDIUM
sub_8C72905.1 KBGetResourceVector -- SSE-optimized copyHIGH
sub_8C772020 KBReorderInstructions -- red-black treeHIGH
sub_8C932047 KBComputePriority -- 8-bit packed heuristicHIGH
sub_8CBAD02.9 KBPreScheduleSetup -- BB scan, 4095-instr limitHIGH
sub_8CCF802.3 KBIsLongLatencyOp -- latency > 19 checkHIGH
sub_8CD1609.3 KBScheduleBasicBlock -- per-BB ordering loopHIGH
sub_8CF88028 KBBuildDependencyGraph -- pre-scan stage 1HIGH
sub_8D064022 KBScheduleInstructions -- top-level orchestratorHIGH
sub_8D173019 KBExecuteSchedulePassHIGH
sub_8D25103.6 KBUpdateDependencies -- post-schedule dep updateHIGH
sub_8D31502.0 KBCheckResourceConflictMEDIUM
sub_8D32D014 KBScheduleInstruction -- schedule single instructionHIGH
sub_8D3D601.4 KBInsertStallHIGH
sub_8D3E202.1 KBComputeStallCyclesHIGH
sub_8D40003.0 KBInsertBarrierHIGH
sub_8D5E0038 KBMainSchedulingLoop -- workhorseHIGH
sub_8D776041 KBStallAndBarrierInsertion -- control word generationHIGH
sub_8D993019 KBBuildDependencyEdges -- RAW/WAR/WAW/memory/barrierHIGH
sub_925510341 bytesMoveInstruction -- doubly-linked list relinkHIGH
sub_A08A00--ResourceModel -- FU cost query (3 modes)HIGH
sub_A091C0--InitResourceTrackingMEDIUM
sub_A09530365 bytesUpdateStallCycles -- per-instruction latency updateHIGH
sub_A09D40--UpdateWARTracking -- anti-dependency trackingMEDIUM
sub_A9DDD011.5 KBHandleLargeFunction -- chunk-based schedulingMEDIUM

Per-SM Scheduling Backends

Everything documented above describes the main scheduler (Backend A), which covers approximately 436 KB at 0x680000--0x8FE000. ptxas contains two additional complete scheduling implementations activated for newer SM architectures. The three backends coexist in the binary; SM-version-gated dispatch selects which combination runs.

Architecture Dispatch

The function sub_7DDB50 reads an SM architecture index from context+2104 and returns it as an integer. Four dispatch stubs in the 0xC5FE00--0xC61000 range use this value to select the scheduling backend:

Dispatch StubConditionBackend SelectedPipeline Stage
sub_C5FEF0SmVersion > 1Backend B (SM89/90 Codec)Codec/ISel scheduling
sub_C60910SmVersion > 1 && (context+1392 & 1)Backend B (SM89/90 Codec)Codec/ISel scheduling
sub_C5FFC0SmVersion > 1Backend C (RBT List), mode 1Pre-scheduling
sub_C5FFF0SmVersion > 1Backend C (RBT List), mode 0Post-scheduling

When SmVersion <= 1 (sm_50 through sm_75 -- Maxwell through Volta), control falls through to the main Backend A documented in the preceding sections. When SmVersion >= 2 (sm_80+ -- Ampere, Ada Lovelace, Hopper, Blackwell), Backends B and C replace Backend A entirely.

sub_C60910 has a secondary activation path: if *(options + 23544) == 1 && *(options + 23552) != 0, Backend B activates regardless of SM version, providing a knob override for testing the codec scheduler on older architectures.

Backends B and C are complementary, not competing. Backend C handles pre-scheduling and post-scheduling (the same pipeline stages as Backend A's 3-phase ReduceReg/ILP/DynBatch), while Backend B handles a separate codec/ISel scheduling step that has no equivalent in the legacy path.

Backend B -- SM89/90 Codec Scheduler (0x1225000)

Backend B is a forward-then-backward scheduling pass with continuous floating-point priority weighting. It replaces Backend A's discrete 8-bit packed heuristic with a configurable pressure/ILP tradeoff expressed as doubles.

Entrysub_1233D70 (1,527 B, 321 lines) -- pass phase 5
Forward schedulersub_122AD60 (17.5 KB, 4,118 lines) -- largest function in range
Backward schedulersub_122F650 (18.2 KB, 3,917 lines)
Preparationsub_123E0D0 -- instruction characterization
Post-fixupsub_A112C0 -- scheduling result finalization
Priority structureBST with FNV-1a hash tracking
Code region0x1225000--0x1240000 (132 functions, 111 KB)

Float Weighting System

The entry point sub_1233D70 initializes two pairs of floating-point weights from the options object at *(context+1664) + 72:

Pair 1 -- Pressure/ILP tradeoff (options offsets 7200/7208):

WeightDefaultMeaning
pressure_weight1.8Contribution of register pressure to scheduling priority. Positive = favors orderings that reduce live register count.
ilp_weight-0.8Contribution of instruction-level parallelism. Negative = penalizes moves that reduce available parallelism.

The two weights sum to 1.0 and form a weighted combination on a unit scale. The default 1.8/-0.8 split heavily favors register pressure reduction, accepting moderate ILP degradation -- appropriate for register-hungry Ada Lovelace and Hopper kernels.

Pair 2 -- Secondary scoring axis (options offsets 7560/7568):

WeightDefaultMeaning
forward_weight3.2Forward-looking scheduling contribution
backward_penalty-2.2Backward-looking penalty factor

Both pairs are overridable. When the configuration byte at the respective offset equals 3, the weight is read from the adjacent double field and the complement is computed as 1.0 - weight:

if (*(BYTE*)(options + 7200) == 3):
    pressure_weight = *(double*)(options + 7208)
    ilp_weight = 1.0 - pressure_weight

After loading, both weight pairs are normalized by dividing by the register range (float)(max_regs - min_regs), producing per-register slopes:

range = (float)(max_regs) - (float)(min_regs)
pressure_slope = ilp_weight / range
secondary_slope = backward_penalty / range

This normalization ensures the scheduling heuristic scales consistently regardless of the target architecture's register file size.

Forward Pass (sub_122AD60)

The forward scheduler implements list scheduling with a BST priority queue, iterating basic blocks front-to-back. It uses FNV-1a hash tables (seed 0x811C9DC5, multiplier 16777619) for tracking scheduled instruction mappings. Instruction properties are queried via sub_7DF3A0. The function manages a ref-counted working set with proper cleanup at function exit. At 4,118 decompiled lines, it is the largest function in the 0x1225000 scheduling range.

Backward Pass (sub_122F650)

The backward scheduler receives the floating-point weights as direct parameters and processes basic blocks in reverse order. It calls into the barrier/scoreboard system (sub_BDC080, sub_BDBA60, sub_BDC0A0) and performs register liveness analysis via sub_A0EDE0. The function uses BST operations with left/right/parent pointer traversal and explicit rebalancing, then performs iterative tree cleanup at exit.

Backend C -- RBT List Scheduler (0x18CD000)

Backend C is a complete reimplementation of the list scheduling algorithm using a red-black tree priority queue, double-precision scoring, and an evaluate-then-commit model with hash-table solution caching. It replaces Backend A for all sm_80+ targets.

Orchestratorsub_1908D90 -- pre/post mode dispatch
Driversub_1906090 -- per-block scheduling loop
Core schedulersub_1902B70 (19 KB) -- RBT-based list scheduling
Solution evaluatorsub_1904B70 (26 KB) -- constraint check + commit
Constraint validatorsub_19043F0 (10 KB) -- feasibility testing
Pressure cost modelsub_18F3CB0 (16 KB) -- SIMD register pressure
Recursive cost propagationsub_18FFD70 (23 KB) -- call-graph-aware scoring
Dependency updatesub_1902100 (15 KB) -- post-scheduling DAG update
RBT insertsub_18FD370 -- balanced insertion with 3-key comparison
RBT extractsub_18FCDA0 -- pop highest-priority node
RBT resetsub_18F7EC0 -- tree cleanup
Score computationsub_18FDAF0 -- double-precision weighted score
Hash tablesub_1906510 (14 KB) -- FNV-1a instruction ID lookup
Code region0x18CD000--0x190FFFF (392 functions, 275 KB)

Red-Black Tree Priority Queue

The critical difference from Backend A is the priority queue data structure. Backend A uses a sorted singly-linked list (O(N) insertion per instruction). Backend C uses a red-black tree that maintains balance through rotation operations in sub_18FD170 (called at the end of every insertion).

Each RBT node is 40 bytes allocated from a pool, with node+24 pointing to the instruction's scheduling entry. The tree ordering uses a three-key comparison in sub_18FD370:

  1. Priority integer at scheduling_entry + 384 (descending -- higher priority nodes are left children)
  2. Latency double at scheduling_entry + 368 (descending -- higher latency scheduled first among equal-priority instructions)
  3. Instruction ID at *(scheduling_entry + 16) + 12 (ascending -- deterministic tiebreaker)

This three-key comparison provides O(log N) insertion and extraction, a significant improvement for basic blocks with hundreds of instructions where Backend A's O(N) sorted insertion becomes a bottleneck.

Core Scheduling Loop (sub_1902B70)

function RBTListSchedule(context, block, dep_info, bound, constraint):
    InitRegisterPressure(context, block)           // sub_18F8580
    InitRBTree(tree)                               // sub_18F7EC0

    for each instruction in block.instruction_list:
        node = AllocPoolNode(40 bytes)
        node.scheduling_entry = instruction
        RBTreeInsert(tree, node)                   // sub_18FD370

    ResizeScheduleArray(block, tree.count)          // sub_18F9CC0

    while tree is not empty:
        best_node = RBTreeExtractMax(tree)          // sub_18FCDA0
        ReturnNodeToPool(best_node)

        instruction = best_node.scheduling_entry
        valid = vtable_check(context, block, instruction)
        *(instruction + 365) = valid

        UpdateDependencies(context, instruction, tree)  // sub_1902100

        if not valid:
            InsertRejection(block + 112, instruction_id)
            continue

        // Record scheduling decision
        position = ++(block + 360)
        entry = *(block + 352) + 24 * position
        entry[0] = instruction_id
        entry[1] = instruction + 2         // scheduling state
        entry[2] = instruction             // back-pointer

        // Compute and accumulate scores
        latency = LookupLatency(context, instruction)  // sub_18F5460
        *(block + 96) += (priority - 2) * latency
        *(block + 88) += latency * *(instruction + 376)

        // Process successors, check conflicts (binary search on
        // sorted 12-byte conflict array with 0xAAAAAAAAAAAAAAAB
        // division-by-3 trick for index computation)

Evaluate-Then-Commit Model

Backend A uses a greedy approach: each scheduling decision is final. Backend C introduces a two-phase model where sub_1904B70 evaluates a proposed schedule against constraints before committing it:

  1. Build a candidate schedule (408-byte evaluation nodes with def/use/pred-deps/succ-deps lists)
  2. Validate via sub_19043F0 (checks scheduling mode at +64 must be 5 or 6)
  3. Run architecture-specific check via vtable at *context + 16
  4. Verify register pressure via sub_19016E0
  5. Compute score via sub_18FDAF0 (returns double)
  6. If score exceeds threshold at context + 360, insert into solution hash table at block + 304

This allows Backend C to explore multiple scheduling alternatives and commit only the best-scoring solution, a capability Backend A's greedy model lacks.

Recursive Cost Propagation (sub_18FFD70)

Backend C uniquely supports cross-function scheduling awareness through recursive cost propagation. sub_18FFD70 walks the call graph:

  1. For a given schedule entry, iterate predecessor blocks (linked list at +12/+13)
  2. Look up each predecessor in the scheduling map via sub_18F4E70
  3. If predecessor is live (byte at +365), recursively process it
  4. After recursion, scan instruction operand lists (offsets 80, 84), identifying register operands by the type tag (operand >> 28) & 7 == 1
  5. Clear register usage bits at reg_entry + 28
  6. Update double-precision scores at offsets 88 and 96

This propagation allows scheduling decisions in callee functions to influence caller scheduling priorities -- a form of interprocedural scheduling absent from both Backend A and Backend B.

Backend Comparison

FeatureBackend A (Main)Backend B (SM89/90 Codec)Backend C (RBT List)
Address range0x680000--0x8FE0000x1225000--0x12400000x18CD000--0x190FFFF
Code size~436 KB~111 KB~275 KB
SM gateSmVersion <= 1SmVersion >= 2SmVersion >= 2
Pipeline stagePre + post schedulingCodec/ISel schedulingPre + post scheduling
Priority encoding8-bit packed integerFloat-weighted BSTRB-tree (int + double + ID)
Priority function size47 KB monolithicDistributed across weights3-key comparison
Ready list structureSorted singly-linked listBinary search treeRed-black tree
Insertion complexityO(N) per instructionO(log N)O(log N)
Scheduling passes3 (ReduceReg / ILP / DynBatch)2 (Forward / Backward)2 (Pre / Post)
Pressure trackingBitvector + popcountFloat slope per registerSIMD bitmap + cost model
Weight configurationKnobs 769--805 (integer)Options 7200/7560 (double)Vtable dispatch
Score typeInteger (packed bits)Double (weighted sum)Double (accumulated)
Solution searchGreedy (single pass)Forward + backwardEvaluate + commit
Cross-function awarenessNoneNoneRecursive cost propagation
Hash infrastructureNoneFNV-1aFNV-1a
BacktrackingOptional (sub_8B1190)NoneRejection set + retry

Backend B + C Function Map

AddressSizeIdentityConfidence
sub_1233D701.5 KBSM89/90 CodecScheduleEntry -- pass phase 5, float weight initHIGH
sub_122AD6017.5 KBForwardCodecScheduler -- BST list scheduling, FNV-1a hash trackingHIGH
sub_122F65018.2 KBBackwardCodecScheduler -- reverse pass, barrier/scoreboard integrationHIGH
sub_123ADD05.8 KBCodecDependencyGraphBuilder -- dispatched via vtableMEDIUM
sub_12371D03.8 KBCodecInstructionClassifier -- convergence-based property testingMEDIUM
sub_123E0D0--CodecSchedulePreparation -- instruction characterizationMEDIUM
sub_A112C0--CodecSchedulePostFixup -- result finalizationMEDIUM
sub_1908D90--RBTScheduleOrchestrator -- pre/post mode dispatchHIGH
sub_1906090--RBTScheduleDriver -- per-block loop, 368-byte block strideHIGH
sub_1902B7019 KBRBTCoreListScheduler -- RB-tree priority queue loopHIGH
sub_1904B7026 KBRBTSolutionEvaluator -- constraint check, score threshold, hash commitHIGH
sub_19043F010 KBRBTConstraintValidator -- mode 5/6 feasibilityHIGH
sub_19038E015 KBRBTInitialEvaluation -- per-block constraint bootstrappingMEDIUM
sub_18F3CB016 KBRBTPressureCostModel -- SIMD register pressure computationHIGH
sub_18FFD7023 KBRBTRecursiveCostPropagation -- call-graph-aware scoringHIGH
sub_190210015 KBRBTDependencyUpdate -- post-scheduling DAG maintenanceHIGH
sub_18FD370--RBTreeInsert -- 3-key balanced insertion + fix-upHIGH
sub_18FCDA0--RBTreeExtractMax -- pop highest-priority nodeHIGH
sub_18F7EC0--RBTreeReset -- tree cleanupHIGH
sub_18F8580--RBTRegisterPressureInit -- pressure state initializationMEDIUM
sub_18F5460--RBTLatencyLookup -- vtable-dispatched latency queryMEDIUM
sub_18FDAF0--RBTScoreComputation -- double-precision weighted scoreHIGH
sub_190651014 KBRBTHashLookup -- FNV-1a instruction ID hash tableHIGH
sub_18FB850--RBTHashResize -- power-of-2 growth, 0.5 load factorHIGH
sub_1901200--RBTScorePropagationDriver -- calls sub_18FFD70MEDIUM
sub_19081F017 KBRBTBlockDependencyGraphBuild -- per-block DAG constructionHIGH
sub_19072F014 KBRBTInterBlockScheduling -- cross-BB register dependencyMEDIUM
sub_18FEE60--RBTScheduleStateCreate -- 528-byte state constructionMEDIUM
sub_18FE320--RBTScheduleDataPrepare -- pre-scheduling data setupMEDIUM
sub_18F94C0--RBTCleanup -- state teardownMEDIUM
sub_C5FFC0--DispatchPreSchedule -- SM gate -> Backend C (mode 1)CERTAIN
sub_C5FFF0--DispatchPostSchedule -- SM gate -> Backend C (mode 0)CERTAIN
sub_C5FEF0--DispatchCodecSchedule -- SM gate -> Backend BCERTAIN
sub_C60910--DispatchConditionalCodecSchedule -- SM gate + knob overrideCERTAIN
sub_7DDB50--GetSmVersionIndex -- reads context+2104HIGH

Scheduling Guidance Output

After scheduling completes, ptxas can emit statistics comments into the SASS output and DUMPIR stream. Three emitter functions produce scheduling guidance in different contexts, all reading from a shared ~1400-byte statistics object. sub_A46CE0 controls the "SCHEDULING GUIDANCE:" header that wraps per-block scheduling output. sub_A3A7E0 emits per-function statistics as # [field=value] comment lines during DUMPIR. Eight post-regalloc clones at sub_ABBA50--sub_ABEB50 emit a variant with hardware pipe names.

Verbosity Controls

Two independent verbosity mechanisms gate the output:

Scheduling guidance level at *(DWORD*)(vtable + 992):

LevelBehavior
0No scheduling guidance output
1+"SCHEDULING GUIDANCE:" header emitted; per-block scheduling dispatched
2+Pre-formatting hook called via vtable+816 before header emission
4+"LOOP STATIC METRICS : " sub-header appended

DUMPIR detail bits at context+1416:

BitMaskBehavior
30x08Enable detailed statistics (FP16 vectorization, functional unit breakdown, throughput estimates)
40x10Show worst-case latency: # [worstcaseLat=%f]
50x20Show average-case latency: # [avgcaseLat=%f]

Bits 4 and 5 are mutually exclusive -- only one latency variant is emitted.

Emitter Functions

AddressSizeIdentityConfidenceContext
sub_A3A7E01,236 BStatistics::emitFunctionStatsCERTAINPre-regalloc DUMPIR statistics. 20+ format strings at 0x21EBF76--0x21EC3B0. Uses abstract FU names (fp, half, shared, controlFlow, loadStore).
sub_A46CE01,793 BSchedulingGuidance::buildAndEmitHIGHScheduling guidance header + BB classification. Walks BB array at context+296, dispatches schedulable blocks to vtable+336.
sub_A4B8F0248 BStatsEmitter::emitInstrRegStatsHIGHBinary-embedded metadata. Writes record type 3 (string) into SASS code object at *(a1+1000) + *(a1+996).
sub_ABBA50--sub_ABEB508 x 1,771 BPostSchedStats::emit (SM-variant)CERTAINPost-regalloc statistics. 8 clones at 0x700 spacing. Format strings at 0x21FA008--0x21FA400. Uses hardware pipe names (adu, alu, cbu, fma, lsu).

Pre-Regalloc Output Format (sub_A3A7E0)

Emitted during DUMPIR. All lines prefixed with # . Lines marked [COND] are gated by the stated condition.

# 142 instructions, 24 R-regs
# [inst=142] [texInst=0] [tepid=0] [rregs=24]
# [urregs=8]                                              [COND: SM > 0x5FFF]
# [_lat2inst=0.0]
# [FP16 inst=0] [FP16 VectInst=0] [Percentage Vectorized=0.00]  [COND: +1416 bit 3]
# [est latency = 87] [LSpillB=0] [LRefillB=0], [SSpillB=0], [SRefillB=0], [LowLmemSpillSize=0] [FrameLmemSpillSize=0]
# [LNonSpillB=0] [LNonRefillB=0], [NonSpillSize=0]
# [Occupancy = 0.750000], [est numDivergentBranches=2] [attributeMemUsage=0], [programSize=1024]
# [est fp=12] [est half=0], [est trancedental=0], [est ipa=0], [est shared=0], [est controlFlow=8], [est loadStore=24]
# [est tex=0] [est pairs=4]
# [issue thru=0.888889] [fp thru=0.111111] [half thru=0.000000], [trancedental thru=0.000000], [ipa thru=0.000000]
# [shared thru=0.000000] [controlFlow thru=0.062500] [texLoadStore thru=0.187500], [reg thru=0.000000], [warp thru=0.000000]
# [SharedMem Alloc thru=0.125000]                         [COND: value != 0.0]
# [partially unrolled loops=0] [non-unrolled loops=1]
# [CB-Bound Tex=0] [UR-Bound Tex=0] [Bindless Tex=0] [Partially Bound Tex=0]
# [UDP inst=0] [numVecToURConverts inst=0]
# [maxNumLiveValuesAtSuspend=0]
# [Precise inst=0]
# [worstcaseLat=87.000000]                                [COND: +1416 bits 4-5 == 0x10]
# [avgcaseLat=52.500000]                                  [COND: +1416 bits 4-5 == 0x20]
# [instHint=142] [instPairs=4]                            [COND: instPairs != 0]
# <custom annotation>                                     [COND: linked list at stats[55] != NULL]

Key format details: pre-regalloc uses commas between some bracket groups ([SSpillB=%d], [SRefillB=%d],) and abstract functional unit names (fp, half, trancedental, shared, controlFlow, loadStore, texLoadStore). The typo "trancedental" (missing "s") is present in the binary.

Post-Regalloc Output Format (sub_ABBA50 clones)

Emitted after scheduling by SM-variant clones dispatched via vtable. Same # prefix. Differs from the pre-regalloc format in three ways:

  1. No commas between bracket groups
  2. SpillSize replaces LowLmemSpillSize + FrameLmemSpillSize
  3. Hardware pipe names replace abstract unit names; MMA variant breakdown added

The unique lines (lines shared with pre-regalloc use the same structure minus commas):

# [est latency = %d] [LSpillB=%d] [LRefillB=%d] [SSpillB=%d] [SRefillB=%d] [SpillSize=%d]
# [LNonSpillB=%d] [LNonRefillB=%d] [NonSpillSize=%d]
# [Occupancy = %f] [est numDivergentBranches=%d] [attributeMemUsage=%d] [programSize=%d]
# [est adu=%d] [est alu=%d] [est cbu=%d] [est fma2x=%d] [est fma=%d] [est half=%d]
# [est trancedental=%d] [est ipa=%d] [est lsu=%d] [est redux=%d]
# [est schedDisp=%d] [est tex=%d] [est ttu=%d] [est udp=%d]
# [est imma16816=%d] [est imma16832=%d] [est immaSp8832=%d] [est immaSp16832=%d]
# [est dmma=%d] [est fma64=%d] [est hmma16816=%d] [est hmma16816f16=%d]
# [est hmma1688=%d] [est hmma1688f16=%d] [est hmmaSp1688=%d] [est hmmaSp1688f16=%d]
# [issue thru=%f] [adu thru=%f] [alu thru=%f] [cbu thru=%f] [fma2x thru=%f] [fma thru=%f]
# [trancedental thru=%f] [ipa thru=%f] [lsu thru=%f] [redux thru=%f]
# [schedDisp thru=%f] [tex thru=%f] [ttu thru=%f] [udp thru=%f]
# [imma16816 thru=%f] [imma16832 thru=%f] [immaSp8832 thru=%f] [immaSp16832 thru=%f]
# [dmma thru=%f] [fma64 thru=%f] [hmma16816 thru=%f] [hmma16816f16 thru=%f]
# [hmma1688 thru=%f] [hmma1688f16 thru=%f] [hmmaSp1688 thru=%f] [hmmaSp1688f16 thru=%f]
# [reg thru=%f] [warp thru=%f]

Hardware Pipe Name Mapping

The post-regalloc format maps abstract functional unit names to hardware execution pipe identifiers:

Post-Regalloc PipePre-Regalloc EquivalentDescription
adu--Address Divergence Unit (address computation)
alufpArithmetic Logic Unit (integer + FP32 combined)
cbucontrolFlowControl/Branch Unit (branch, exit, barrier)
fma2x--Double-precision FMA (separate pipe on sm_80+)
fmafpFused Multiply-Add (FP32)
halfhalfFP16 operations
lsuloadStore + sharedLoad/Store Unit (unified)
redux--Reduction Unit (warp-level reductions)
schedDisp--Scheduler Dispatch (internal overhead)
textexTexture Unit
ttu--Tensor Texture Unit (Ada Lovelace+)
udp--Uniform Data Path operations

Binary-Embedded Statistics Record (sub_A4B8F0)

Separate from the DUMPIR comment output, sub_A4B8F0 writes a compact binary record into the SASS code object during emission:

Format string: "instr/R-regs: %d instructions, %d R-regs"
  instructions = stats[335] - stats[341]     (total minus removed)
  R-regs       = stats[159] + stats[102]     (extra + base allocation)

Record layout in output buffer:
  +0  DWORD  type = 3                        (string record type)
  +4  DWORD  string_length
  +8  char[] string_content                  (formatted text)

The companion function sub_A4B9F0 writes record type 2 for undefined register warnings: "Referencing undefined register: %s%d".

Scheduling Guidance Header (sub_A46CE0)

sub_A46CE0 emits the scheduling guidance wrapper, then walks the BB array to classify and dispatch blocks for scheduling. The header is emitted into the output stream via sub_7FE930 (string builder) at context + 1440.

BB classification algorithm:

For each BB in context+296 (index 0 through context+304):

  1. Schedulable: sub_7544D0(context, bb) returns true AND sub_754510(context, bb) returns false. Dispatched immediately to scheduling via vtable+336.

  2. Type-8 (deferred): *(bb+16) == 8. Added to a dynamically-grown src array for second-pass processing.

  3. Loop back-edge: When *(bb+148) != 0 and *(bb+128) != NULL, the function walks the predecessor linked list at bb+128. For each predecessor, it checks whether the predecessor's iteration order (bb+144) exceeds the current block's, and whether the predecessor's terminal instruction is a branch (opcode 0x5D after masking with 0xFFFFCFFD) with a matching program counter at (instr+84) & 0xFFFFFF. If a back-edge is detected, scheduling dispatch includes the back-edge source instruction as a hint parameter.

After the first pass, deferred type-8 blocks are processed in a second loop with the same back-edge detection logic.

Statistics Object Field Map

Both emitter families read from the same ~1400-byte statistics object. The object is accessed as a float* array; integer fields use the same DWORD index but read as int32.

IndexTypeFieldDescription
8int32est_latencyEstimated schedule length in cycles
9floatFP16_vectorization_pctFraction of FP16 instructions vectorized
10int32worstcase_latencyWorst-case latency (cast to float for output)
11int32avgcase_latencyAverage-case latency (cast to float for output)
12int32LSpillBLong spill byte count
13int32LRefillBLong refill byte count
14int32SRefillBShort refill byte count
15int32SSpillBShort spill byte count
16int32LowLmemSpillSizeLocal-memory low spill allocation
17int32FrameLmemSpillSizeFrame local-memory spill allocation
18int32LNonSpillBLong non-spill byte count
19int32LNonRefillBLong non-refill byte count
20int32NonSpillSizeNon-spill allocation size
26floatOccupancyOccupancy ratio (0.0--1.0)
27int32numDivergentBranchesEstimated divergent branch count
28int32attributeMemUsageAttribute memory usage in bytes
29int32programSizeProgram binary size in bytes
42int32preciseInstCount of precise (non-approximate) instructions
44int32UDPinstUniform data-path instruction count
45int32vecToURConvertsVector-to-uniform-register conversion count
49int32maxLiveAtSuspendMax live register values at suspend points
50floatissue_thruOverall issue throughput (fraction of peak)
54floatfp_thruFP32 pipe throughput
57floathalf_thruFP16 pipe throughput
58floattranscendental_thruTranscendental function pipe throughput
59floatipa_thruInterpolation pipe throughput
61floatshared_thruShared memory pipe throughput
62floatcontrolFlow_thruControl flow pipe throughput
65floattexLoadStore_thruTexture and load/store pipe throughput
84floatreg_thruRegister throughput
85floatwarp_thruWarp throughput
86floatsharedMemAlloc_thruShared memory allocation throughput
87int32partiallyUnrolledLoopsPartially unrolled loop count
88int32nonUnrolledLoopsNon-unrolled loop count
89int32CBBoundTexConstant-bank-bound texture count
90int32PartiallyBoundTexPartially bound texture count
91int32BindlessTexBindless texture count
92int32URBoundTexUniform-register-bound texture count
93int32SM_architecture_enumSM version discriminator (>0x5FFF enables UR stats)
99int32uniform_reg_countUniform register count
102int32R_reg_baseBase R-register allocation
159int32R_reg_extraExtra R-register allocation
303int32est_fpEstimated FP32 instruction count
306int32est_halfEstimated FP16 instruction count
307int32est_transcendentalEstimated transcendental instruction count
308int32est_ipaEstimated IPA instruction count
310int32est_sharedEstimated shared memory operation count
311int32est_controlFlowEstimated control flow operation count
315int32est_loadStoreEstimated load/store instruction count
316int32est_texEstimated texture instruction count
334int32est_pairsEstimated co-issuable instruction pairs
335int32total_instTotal instruction count (before removal)
336int32texInstTexture instruction count
337int32FP16instFP16 instruction count
338int32FP16VectInstFP16 vectorized instruction count
339int32instHintInstruction hint value
340int32instPairsInstruction pair count (also output gate)
341int32removed_instRemoved instruction count
342int32tepid_instTEPID (texture-pending) instruction count

Cross-References

  • Scheduler Overview -- 3-phase architecture, register budget, scheduling knobs
  • Latency Model -- per-opcode latency tables, functional unit mapping, architecture profiles
  • Scoreboards & Barriers -- scoreboard encoding, dependency barrier assignment, stall/yield format
  • Register Allocation -- the allocator that the scheduler interacts with
  • Phase Manager -- how ScheduleInstructions fits in the 159-phase pipeline
  • Knobs -- the 76 scheduling knobs and the knob query infrastructure
  • GMMA Pipeline -- GMMA/WGMMA operations targeted by DynBatch
  • DUMPIR Configuration -- DUMPIR levels that trigger statistics output
  • Spilling -- spill metrics (LSpillB, SSpillB) referenced in guidance output

Latency Model & Functional Units

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The ptxas instruction scheduler uses a static hardware performance model to estimate instruction latencies, functional unit occupancy, and pipeline conflicts. The model is architecture-parameterized: a family of 15+ profile-builder functions at 0x8E7300--0x8E9DC0 construct per-SM latency/throughput tables consumed by the scheduling engine. A separate 85 KB function (sub_89FBA0, SetOpcodeLatencies) assigns per-opcode scheduling classes that index into these tables. The combination produces a cost model that drives stall-count computation, priority scoring, and dual-issue pairing decisions.

Per-opcode classifiersub_89FBA0 (85 KB) -- assigns scheduling class per Ori opcode
HW profile buildersub_8E5CA0 (20 KB) -- assembles scheduling control word tables
Warp profilesub_8E4400 (3.3 KB) -- maps SM ID to warp/dispatch parameters
SM-specific tablessub_8E7300--sub_8E97B0 -- 15 architecture-specific builders
Latency querysub_693BC0 (22 lines) -- memory space classification
Long-latency checksub_8CCF80 (2.3 KB) -- returns true if latency > 19
Resource modelsub_A08A00 (345 lines) -- per-instruction FU cost computation
Register querysub_A08910 (39 lines) -- operand register count/cost
Stall updatesub_A09530 (91 lines) -- per-instruction stall cycle accumulation
FU class mappersub_8F0CD0 -- maps (opcode, unit_name) to scheduling class
FU unit querysub_704D30 (14 KB) -- maps SASS opcodes to functional unit IDs
Cutlass detectorsub_8F47E0 -- detects cutlass kernels for tuned scheduling
Pipe class assignersub_13710B0 (7.1 KB) -- SASS-level execution pipe assignment

Architecture of the Latency Model

The model has three layers:

Layer 1: Per-Opcode Classification
  sub_89FBA0 reads each instruction's Ori opcode (field at instr+72,
  masked with 0xFFFFCFFF) and assigns:
    - Scheduling class ID (stored at descriptor+4, range 1..772+)
    - 9-bit latency index (low 9 bits of descriptor+196)
    - Execution pipe mask (bits 15..19 of descriptor+196..200)
    - Throughput class (bits in descriptor+198..199)

Layer 2: Architecture-Specific HW Tables
  sub_8E7300..sub_8E97B0 build per-SM latency/throughput tables as
  96-byte records in a growable array. Each record maps a scheduling
  class to its pipeline latency, scoreboard wait count, barrier stall
  cycles, and dual-issue compatibility flags.

Layer 3: Runtime Query
  The scheduling engine queries the model via:
    - sub_A08A00 for per-instruction resource costs (3 modes)
    - sub_A08910 for register operand latency
    - sub_693BC0 for memory space classification
    - sub_8CCF80 for long-latency detection (threshold: 19 cycles)

Scheduling Class Assignment (sub_89FBA0)

sub_89FBA0 (85 KB, 2938 lines decompiled) is the largest function in the scheduling subsystem. It assigns each instruction a scheduling class -- an integer that indexes into the per-architecture latency tables. The function operates as a massive switch on *(instr+72) & 0xFFFFCFFF (the Ori opcode with modifier bits masked out).

Scheduling Descriptor Layout

Each instruction carries a scheduling descriptor at offsets 196--200 within the 296-byte Ori instruction object (not the SchedNode). The descriptor is a packed bit-field:

Descriptor at a3+196 (DWORD, 32 bits):
  [8:0]   9-bit latency index -- indexes into HW latency table
  [14:9]  reserved
  [19:15] 5-bit execution pipe mask -- identifies functional unit
          0x08000 = pipe A (ALU)
          0x10000 = pipe B (FP/tensor)
          0x18000 = pipe C (memory/texture)
          0xF8000 = all pipes (default sentinel)

Descriptor at a3+198 (WORD, 16 bits):
  [3:0]   pipe sub-class within the execution pipe
          0x10 = sub-class 1 (control flow)
          0x20 = sub-class 2 (integer ALU)
          0x30 = sub-class 3 (FP32)
          0x40 = sub-class 4 (FP64 / wide ops)
  [8:4]   throughput class (5 bits)
          0x1F0 = maximum throughput (sentinel)

Descriptor at a3+199 (BYTE, high bits):
  [5:1]   additional pipe flags
          0x3E = all flags set (default)
          Specific values: 0x04 (ALU), 0x08 (SFU), 0x0A (FP64), 0x0C (tensor)

Descriptor at a3+200 (WORD, 16 bits):
  [4:0]   read barrier mask (5 bits, 0x1F)
  [9:5]   write barrier mask (5 bits, 0x3E0)

Opcode-to-Class Mapping

The switch statement maps Ori opcodes to scheduling class IDs. These IDs are stored at *(v8+4) where v8 is a pointer to the instruction's extended scheduling record. Representative mappings:

Ori opcodeScheduling classExecution pipeDescription
1130sub-class 1 (0x10)Control flow (BRA, JMP)
2--7 (wide)683sub-class 4 (0x40), pipe 0xAWide FP64 operations
2--7 (narrow, type 19)52sub-class 2 (0x20)Integer ALU (narrow)
2--7 (narrow, other)72sub-class 2 (0x20)Integer ALU (standard)
3, 5 (medium)140sub-class 3 (0x30)FP32 operations
4 (medium)131sub-class 2 (0x20)Integer MAD
6 (wide)140sub-class 4 (0x40), pipe 0xAFP64 pair operations
8 (flag bit set)3defaultPredicate operations (true)
8 (flag clear)2defaultPredicate operations (false)
0xA, 0xB, 0x6C, 0x95200sub-class 2 (0x20)Integer compare/logic
0xA (extended)551defaultExtended integer (wide encoding)
0xA (extended, Mercury)694/700defaultMercury-era extended integer
0xE5defaultConversion operations
0x10 (atomic)575defaultAtomic with flag
0x10 (global)variessub-class 4 (0x40)Global memory load/store
0x141745latency 0xF1WGMMA (warpgroup MMA)
0x142 (variant 3)744latency 0xF0WGMMA variant
0x143765--767latency 0xFBBGMMA/QMMA tensor variants
0x144600latency 0xE6Tensor fence
0x145, 0x146759sub-class 4, pipe 0xCTensor core (HMMA/BMMA)
0x147, 0x148 (wide)761latency 0xFADouble-precision tensor (wide)
0x147, 0x148 (narrow)757latency 0xF6Double-precision tensor (narrow)
0x149604latency 0xE7Tensor synchronization
0x13E749latency 0xF4Bulk copy (ACQBULK)
0x13F748latency 0xF3Bulk release (RELBULK)
0x13D (variant)747/750latency 0xF2/0xF5Collective operations

The scheduling class IDs span a wide range (2--772+). Classes below 256 correspond to legacy instruction categories; higher classes (551, 575, 600, 683, 694, 700, 744--767) represent newer instruction types added for Hopper and Blackwell architectures.

Latency Index Encoding

The low 9 bits of the descriptor at a3+196 encode a latency index that maps directly into the per-architecture HW table. The index is formed by combining the descriptor's low byte with a pipe mask:

latency_index = *(WORD*)(a3+196) & 0x1FF

Observed latency index values and their instruction classes:

Index (hex)Index (dec)Instruction class
0xE6230Tensor fence / sync
0xE7231Tensor synchronization
0xF0240WGMMA variant
0xF1241WGMMA primary
0xF2242Collective op (variant A)
0xF3243Bulk release
0xF4244Bulk copy
0xF5245Collective op (variant B)
0xF6246DP tensor (narrow)
0xF8248Tensor core (HMMA/BMMA)
0xFA250DP tensor (wide)
0xFB251BGMMA/QMMA

The highest index values (0xE6--0xFB) correspond to tensor and collective operations -- the most complex instructions with the longest and most architecture-variable latencies.

Functional Unit Categories

The scheduler tracks 10 functional unit resource counters per basic block. Each counter corresponds to a hardware execution pipe on the SM.

10-Element Resource Vector

Resource tracking uses an 84-byte per-BB slot at *(scheduler+672) + 84 * slot_index:

IndexPipe nameTypical SASS instructionsThroughput (IPC)
0Integer ALU (ALU)IADD3, IMAD, ISETP, LOP3, SHF, IABS, POPC1 (full rate)
1FP32 (FMA)FADD, FFMA, FMUL, FSETP, FMNMX, FCHK1 (full rate)
2FP64 (DFMA)DADD, DFMA, DMUL, DSETP, DMNMX1/2 to 1/64 (SM-dependent)
3Tensor core (MMA)HMMA, IMMA, BMMA, BGMMA, WGMMA, QMMAvaries
4Load/store (LSU)LDG, STG, LDL, STL, LDS, STS, LDGSTS1 (full rate)
5Texture (TEX)TEX, TLD, TXQ, TLD4, TEXS1/2 to 1/4
6Control flow (BRA)BRA, JMP, EXIT, RET, CALL, BRK, CONT1
7Shared memory (SMEM)ATOMS, REDS, LDS, STS (atomic/reduce variants)1
8Special function (SFU)MUFU (RCP, RSQ, SIN, COS, EX2, LG2)1/4
9Uniform datapath (UDP)UPLOP3, UISETP, UIMAD, uniform operations1

The resource vector layout within each 84-byte slot:

Offset  Size       Content
 0..39  10 x int32  Current resource usage per FU (pipe 0..9)
40..79  10 x int32  Resource pressure delta (change from scheduling)
80..83  1 x int32   BB-entered flag and auxiliary state bits

Functional Unit Class Mapping (sub_8F0CD0)

A secondary mapper at sub_8F0CD0 translates (opcode, unit-name-string) pairs to numeric scheduling class IDs for the stall/barrier encoding stage:

OpcodeUnit stringClass IDMeaning
40"LSU_T"15Texture load/store unit
40"XU64"35Extended unit (64-bit ops)
39"DMMA"118Double-precision matrix multiply
53"DMMA"118DMMA (alternate opcode)
default--35Fallback to extended unit

The "LSU_T" and "XU64" string tags appear in the Mercury-era post-scheduling pipeline where the SASS encoder needs to distinguish sub-pipes within the load/store and extended-precision units.

Functional Unit Query (sub_704D30)

sub_704D30 (14 KB) maps SASS opcode character codes to functional unit IDs for the Mercury encoder's latency model. The mapping uses single-character opcode identifiers:

Char codeDecimalFU IDUnit
'D' (68)6840FP64 unit
'E' (69)6944Extended unit
'F' (70)7048FP32 unit
'J' (74)7452Integer unit
'K' (75)7556Conversion unit
'L' (76)7660Load/store unit
'N' (78)7832Tensor unit
'S' (83)8336Special function unit

The function dispatches on *(config+372) >> 12 (the SM architecture selector) to handle architecture-specific unit mapping variations (e.g., Kepler vs Volta).

Per-Architecture HW Latency Tables

Table Construction Pipeline

The HW latency tables are built during scheduler initialization by a chain of constructors:

sub_8E4400(profile, sm_id, sched_mode)     // Warp-level parameters
  |
  v
sub_8E5CA0(profile, table_ptr, table_size) // Assemble output array
  |
  +-- sub_8E6760()  // Group boundary markers
  +-- sub_8E6950()  // Barrier entries
  +-- sub_8E6B40()  // Standard scheduling entries
  +-- sub_8E6F20()  // Wait dependency entries
  +-- sub_8E7110()  // Scoreboard entries
  |
  v
sub_8E7300..sub_8E97B0(profile, ...)       // SM-specific table population
  |
  v
sub_8E3AD0(output, count, entries, ...)    // Copy into final profile

Each SM-specific function populates entries in the 96-byte-per-record output array. Records encode latency, throughput, pipe assignment, and barrier compatibility for each scheduling class.

96-Byte Schedule Record Format

Each record in the HW table occupies 96 bytes (6 x 16-byte XMM slots). Records are stored in a growable array at *(context+56) with count at *(context+64) and capacity at *(context+68). The array grows by 1.5x when full. Records are copied using three _mm_loadu_si128 operations (offsets 0, 16, 32) plus manual field-by-field copy for offsets 48--95; the string at +48 is reference-cloned via sub_714160 when the string-backed flag is set.

Offset  Size   Field               Content
------  ----   -----               -------
 0..1   WORD   type_code           Record type (see type table below)
 2..3   WORD   (padding)           Zero
 4..7   DWORD  aux_size            Type-dependent:
                                     root (type 1): table_size
                                     barrier ('M'): 128 (fixed)
                                     wait/scoreboard ('5'/'6'): 36
                                     sched entry (23): 0
 8..15  8B     (reserved)          Zero
16..19  DWORD  cost_product        Scheduling cost (latency x throughput product)
                                     - Standard entry (23): a2 * a3
                                     - Category header ('!'): entry_count from config+528
                                     - Wait/scoreboard: 280 (fixed sentinel)
                                     - SM-specific (','): 4 * class_count
20..21  WORD   base_latency        Base latency in cycles (standard entries only)
22..23  WORD   dual_issue_flags    Dual-issue compatibility mask (standard entries only)
24..31  8B     (reserved)          Zero
32..39  QWORD  data_ptr            Pointer to type-specific data block:
                                     - Root: parent profile object
                                     - Wait/scoreboard: dependency tracking table
                                     - Barrier: barrier data array
                                     - Category headers: 0
40..47  QWORD  data_size           Byte count of data block at data_ptr:
                                     - Root: table_size; barrier: 128
                                     - Wait/scoreboard: 36; headers: 0
48      BYTE   inline_flag         0 = data_ptr/data_size carry raw data
                                   1 = this record uses the inline string buffer
49..63  15B    inline_str_buf      Inline NUL-terminated string (max 15 chars)
64..71  QWORD  parent_ptr          Back-pointer: SM-specific entries point to table
                                   root; category headers point to profile object
72..79  8B     (reserved)          Zero
80..87  QWORD  string_buf_ptr      Pointer to growable string buffer (32-byte header:
                                   data_ptr, size, capacity, allocator) for variable-
                                   length sub-records; self-references +48 when inline
88      BYTE   string_backed_flag  1 = record owns allocated string data at +80
                                   0 = no allocated string (uses inline or none)
89..95  7B     (padding)           Zero

Record Type Codes

Records are polymorphic -- the type code at offset +0 selects the interpretation of fields +16..+31, +32..+47, and the sub-record format stored in the growable buffer at +80.

TypeASCIICreatorRole
1--sub_8E5CA0Root container (wraps entire HW table)
23--sub_8E6B40Standard scheduling entry (latency + throughput + dual-issue)
33'!'sub_8E5740Category header (begins a named section with string list)
44','sub_8E8480 et al.SM-specific table entry (per-architecture class data)
45'-'sub_8E5CA0Barrier section header (links 128-byte barrier table)
49'1'sub_8E5530Dimension entries (contains 12-byte sub-records)
53'5'sub_8E7110Scoreboard entry (dependency tracking, data_size=36)
54'6'sub_8E6F20Wait dependency entry (dependency table, data_size=36)
57'9'sub_8E5740Category footer (closes the section opened by type 33)
59';'sub_8E5310Variant section (contains 20-byte sub-records)
60'<'sub_8E6760Group boundary marker (separates scheduling groups)
69'E'sub_8E6950Barrier entry (a2 = stall count in cost_product field)
77'M'sub_8E6D40Barrier/sync data entry (data_ptr = barrier array, 128B)
87'W'sub_8E4F20Supplementary weight entry (variable-length string data)

Sub-Record Formats in the Growable Buffer (+80)

Records with string_backed_flag=1 carry variable-length sub-records in the growable buffer. The buffer header at *(record+80) is a 32-byte object: {data_ptr, size (DWORD), capacity (DWORD), allocator_ptr}.

Type 59 (';') -- Variant sub-records (20 bytes each):

Created by sub_8E5310 iterating the variant list at config+536:

Sub-record layout (20 bytes):
  +0   DWORD   source_data       Variant source identifier
  +4   WORD    flags             Variant flags
  +6   WORD    zero              Reserved
  +8   DWORD   throughput_value  Throughput for this variant
  +12  DWORD   aux_value         Auxiliary parameter
  +16  DWORD   zero              Reserved

The main record additionally stores: +16 = start_index (from config+544), +20 = record_index, +24 = back_ref to previous category.

Type 49 ('1') -- Dimension sub-records (12 bytes each):

Created by sub_8E5530 traversing the BST at config+592:

Sub-record layout (12 bytes):
  +0   WORD    node_flags        BST node flags (from node+38)
  +2   WORD    zero              Reserved
  +4   DWORD   node_value        BST node value (from node+32)
  +8   DWORD   node_child        BST node child pointer (low 32 bits of node+24)

Type 44 (',') -- SM-specific class descriptor (16 bytes + packed bitmasks):

Created by sub_8E8480 and other SM-specific builders, followed by a call to sub_8E3AD0 which appends packed bitmask DWORDs:

Initial 16-byte descriptor:
  +0   DWORD   class_flags = 2   Fixed flag value
  +4   WORD    zero              Reserved
  +8   QWORD   mask              Latency mask (0xFFFFFFFF00000000)

Followed by bitmask DWORDs (4 bytes each, one per 8 scheduling classes):
  Each DWORD encodes 4 bits per entry (4 entries x 4 properties):
    bit 4*i+0:  entry[i].field_0 != 1
    bit 4*i+1:  entry[i].field_4 != 1
    bit 4*i+2:  entry[i].field_8 != 1
    bit 4*i+3:  entry[i].field_12 != 1
  Source entries are 20 bytes apart in the input array.

Assembly Sequence

sub_8E5CA0 orchestrates the complete table by emitting records in this order:

  1. Barrier header (type '-', conditional on config+336): links the 128-byte barrier data table at config+272.
  2. Root container (type 1): data_ptr = profile_object, data_size = table_size.
  3. Category header + footer (types '!' / '9'): emitted by sub_8E5740, which enumerates named sections from config+520..528.
  4. Variant section (type ';'): emitted by sub_8E5310 if config+544 != 0.
  5. Supplementary weights (type 'W'): emitted by sub_8E4F20 if config+640 != -1.
  6. Dimension entries (type '1'): emitted by sub_8E5530 if config+608 > 0.

After all records are appended, the function computes the total serialized size (with 16-byte alignment padding per data block), allocates the output buffer, and writes a 32-byte header per record into the linear output at context+104.

Architecture Dispatch Table

AddressSMArchitectureTable sizeNotes
sub_8E7300sm_70Volta3.3 KBFirst Turing-era table format
sub_8E7540sm_72Xavier2.9 KBAutomotive Volta variant
sub_8E7720sm_75Turing3.5 KBAdded TensorFloat-32
sub_8E7940sm_80 (base)Ampere base2.9 KBShared base for sm_80/86/87
sub_8E7B40sm_80Ampere3.3 KBFull Ampere with async copy
sub_8E7D80sm_86GA10x4.4 KBConsumer Ampere
sub_8E8070sm_87Orin3.5 KBAutomotive Ampere
sub_8E8280sm_89Ada Lovelace3.1 KBAdded FP8 tensor ops
sub_8E8480sm_90Hopper5.2 KBDPX, WGMMA, TMA
sub_8E8780sm_90aHopper accel.4.6 KBWGMMA async extensions
sub_8E8A90sm_100Blackwell DC3.0 KB5th-gen tensor, TCGEN05
sub_8E8CB0sm_100 (short)Blackwell DC949 BSupplementary table
sub_8E8DB0sm_103Blackwell Ultra1.7 KBGB300 extensions
sub_8E8F60sm_103 (short)Blackwell Ultra618 BSupplementary table
sub_8E9000sm_120RTX 50xx2.9 KBConsumer Blackwell
sub_8E92E0sm_120 (ext)RTX 50xx5.5 KBExtended consumer table
sub_8E97B0universalFallback8.8 KBDefault for unknown SM

sm_90 (Hopper) has the second-largest combined table (5.2 + 4.6 KB including sm_90a) reflecting the complexity of WGMMA, DPX, and TMA scheduling. sm_120 extended (5.5 KB) is the single largest individual table, accommodating the consumer Blackwell feature set.

The "short" supplementary tables (sub_8E8CB0 for sm_100, sub_8E8F60 for sm_103) add entries for architecture-specific instructions not covered by the base table -- typically new tensor core variants and collective operations.

Warp-Level Hardware Profile (sub_8E4400)

sub_8E4400 maps the SM architecture ID (a2) to warp-level dispatch parameters stored in a 36-byte structure:

Architecture-to-Warp Mapping

SM ID rangeWarps per SMDispatch slotsArchitecture era
<= 20479496sm_50 (Maxwell)
20480--245756176sm_60 (Pascal)
24576--286727192sm_70 (Volta)
28673--327677208sm_75 (Turing)
32768--368638224sm_80 (Ampere)
> 3686316240sm_90+ (Hopper, Blackwell)

The packed DWORD at offset +18 encodes (warps, sub-warp-count) as a 32-bit value. For example, 983055 (0x000F000F) = 15 warps in the low half and 15 in the high half, while 1048592 (0x00100010) = 16 warps for sm_90+.

Sub-Architecture Variants

Specific SM version IDs map to sub-architecture variant codes stored at offset +26:

SM IDHexVariantArchitecture
81930x20012sm_50 (Maxwell Titan X)
204810x50012sm_60 variant
245760x60000sm_70 (Volta base)
286740x70022sm_75 variant A
286750x70033sm_75 variant B
286760x70044sm_75 variant C
286770x70055sm_75 variant D
327680x80000sm_80 (Ampere base)
368640x90000sm_90 (Hopper base)
368670x90033sm_90 variant A
368680x90044sm_90 variant B (sm_90a)
368690x90055sm_90 variant C

Pipeline Width (offset +24)

The scheduling mode parameter (a3) selects the pipeline width stored at offset +24. This value controls how many instructions the scheduler models as issuing per cycle:

ModeValue at +24Meaning
1, 8, 91Single-issue
34Quad-issue (tensor)
45Penta-issue
56Hexa-issue
67Hepta-issue
78Octa-issue
109Nona-issue
1110Deca-issue
default2Dual-issue

These values model the effective issue width for different scheduling contexts. The tensor core modes (4--11) reflect warpgroup-level cooperative execution where multiple warp slots issue tensor instructions simultaneously.

Memory Space Classification (sub_693BC0)

sub_693BC0 (22 lines) classifies the memory space of load/store instructions. It extracts the last source operand from the instruction, looks up the register descriptor, and calls sub_91C840 to determine the memory space type. The function returns an integer code:

Return valueMemory spaceTypical latency range
1Generic (resolved at runtime)20--200+ cycles
2Local memory (per-thread stack)20--200 cycles
3Shared memory20--30 cycles
4Constant memory (cached)4--8 cycles
7Constant bank (indexed)4--8 cycles
11Surface memory200--500 cycles
16Global memory (DRAM)200--500 cycles

The scheduler uses these values in the priority function (sub_8C9320) to distinguish "hot" (long-latency) memory operations from "cold" (short-latency) ones. Functions sub_A9CDE0 classifies hot (global/texture) memory and sub_A9CF90 classifies cold (constant/shared) memory.

Long-Latency Detection (sub_8CCF80)

sub_8CCF80 checks if an instruction qualifies as "long-latency" for scheduling priority purposes. The function:

  1. Verifies the target architecture supports dual-issue via sub_7DC0E0.
  2. For opcode 183 (LD/ST variant): checks memory space via sub_693BC0. Memory spaces 4, 16, 2, 11, 3, 1, and 7 all qualify for long-latency classification.
  3. For opcode 130 (HSET2 in the ROT13 name table; used as a generic internal marker): queries via vtable+640 whether the instruction is recognized as long-latency.
  4. Queries the scheduling oracle (sub_8BF3A0) for the instruction's estimated latency.
  5. Returns true if the estimated latency exceeds 19 cycles.

The threshold of 19 cycles is the boundary between "short-latency" instructions (ALU, FP32, shared memory) and "long-latency" instructions (global memory, texture, tensor core) that benefit from latency hiding through instruction reordering.

Resource Cost Model (sub_A08A00)

sub_A08A00 (345 lines) computes per-instruction resource costs for the 10-element functional unit vector. It operates in three modes selected by parameter a6:

Mode 0/1: Instruction Cost Initialization

Resets the instruction's resource tracking state:

  • a1[0] = 0 (accumulated cost)
  • a1[1045] = 0 (accumulated delta)
  • a1[2071] = 0 (accumulated pressure)
  • Byte at offset 8280 = 0 (flags)

Then computes per-operand resource contributions by iterating source operands (count at a3+80, starting at a3+84):

Mode 2: Differential Cost

Computes the differential cost (new minus old):

v55 = a1[0]       // previous instruction cost
v56 = a1[1045]    // previous delta cost

Then runs the same operand iteration as mode 1 and subtracts the previous values.

Mode 3: Pressure Accumulation

Adds the instruction's previously computed pressure a1[2071] into the running total at *(a5+24).

Per-Operand Cost Computation

For each source operand, the function:

  1. Checks operand type: ((operand >> 28) & 7) == 1 means register operand.
  2. Skips operands with values 41--44 (special sentinel registers).
  3. Looks up the register descriptor via *(a1+88) + 8 * (operand & 0xFFFFFF).
  4. Checks if register class *(descriptor+64) is <= 6 (physical register file).
  5. Calls sub_A08910 to get the register's latency and count:
    • Returns the starting register index
    • Outputs count (*a4) and cost-per-register (*a5)
  6. Iterates over the register range, accumulating costs for registers not in the "already-consumed" bitmask at *(a1+832).

The cost accumulation uses a 9-bit field in the instruction's scheduling word at offset +12, masked as & 0x1FF.

Register Latency Query (sub_A08910)

sub_A08910 (39 lines) returns the register index and cost for a single operand:

function GetRegisterLatency(context, reg_desc, operand, out_count, out_cost):
    pipeline_bits = (reg_desc.field_48 >> 20) & 3
    count = 1
    cost = (pipeline_bits == 3) ? 2 : 1
    *out_count = count
    *out_cost = cost

    if context.flags & 0x10:    // dual-register tracking mode
        return 2 * reg_desc.field_12     // doubled register index
    else:
        if context.flags & 0x08 and pipeline_bits != 1 and reg_desc.class == 6:
            *out_cost = 2 * cost          // double cost for wide registers
        return reg_desc.field_12          // register index

The pipeline bits extracted from (reg_desc+48) >> 20 encode the register's pipeline affinity:

  • Bits == 1: standard pipeline register
  • Bits == 3: double-width register (costs 2 instead of 1)
  • Other values: architecture-specific pipeline assignment

When dual-register tracking is active (context flag 0x10, controlled by knob 420), register indices are doubled to provide separate tracking for even/odd register halves.

Latency Hiding Statistics

The post-scheduling analysis pass (sub_73B360, MacLoopSchedulingAnalytics, 28.7 KB) computes and reports latency hiding effectiveness for four categories of long-latency operations:

CategoryString identifierStat functionTypical latency
Shared memory loads"LDS latency hiding"sub_73A1D020--30 cycles
Global memory loads"LDG latency hiding"sub_73A7F0200--500 cycles
Extended 64-bit ops"Xu64 latency hiding"sub_73ADF015--30 cycles
Anti-dependencies"Antidep latency hiding"(inline)varies

Each category reports: Num (count of operations), Min (minimum hidden cycles), Max (maximum hidden cycles), Avg (average hidden cycles). The pass also tracks MAC instruction utilization ("MacInsts", "MacReuses", "TepidMacUtil") and resource busy time ("LsuResBusy", "Time", "TepidTime").

This analysis runs after scheduling is complete and drives feedback for the Mac Loop scheduler, which handles fused multiply-accumulate loop bodies. Knob 443 gates the MAC instruction classification.

Dual-Issue Rules

Dual-issue scheduling is controlled by sub_8CF5D0 (CheckDualIssueEligibility, 3.5 KB) and implemented by sub_8B77C0 (DualIssueScheduler, 15 KB) with pairing logic in sub_8BDC40 (7.9 KB).

Eligibility Check

sub_8CF5D0 returns 0 (no dual-issue) if:

  • The target architecture does not support dual-issue (sub_7DC0E0 returns false).
  • Function flag bit 2 at func+1368 is set (incompatible function).

When eligible, the function iterates basic blocks checking instruction pairs:

  • sub_A9CDE0(instr): returns true if instruction is dual-issuable (hot = global/texture).
  • sub_A9CF90(instr): returns true if instruction can pair with the next (cold = constant/shared).

The dual-issue benefit score is stored at scheduler+328 and used by the priority function to bias toward instruction pairs that can co-issue.

Dual-Issue Constraints

Dual-issue pairs must satisfy:

  1. Pipe compatibility: the two instructions must target different functional units (e.g., ALU + FP32, or ALU + load/store). Same-pipe pairs cannot dual-issue.
  2. Register conflict: the pair must not have RAW dependencies on the same register within the same cycle.
  3. Barrier compatibility: neither instruction may be waiting on a scoreboard barrier.
  4. Architecture support: dual-issue is primarily an sm_50 (Maxwell) feature. Newer architectures (sm_70+) use wider warp schedulers instead.

For sm_50, a special register budget function adjusts the register allocation target to account for the reduced register pressure from dual-issue execution.

Stall Count Computation

The stall count determines how many cycles an instruction must wait before it can issue. Stalls are computed by sub_8D3E20 (2.1 KB) and encoded by sub_8F3130 (1.0 KB).

Stall Encoding in Control Words

Each SASS instruction carries a stall count in its control word:

  • Maximum stall: 16 cycles (capped by knobs 805 and 806).
  • Minimum stall: 1 cycle (no zero-stall encoding exists).
  • Default stall when no dependency: determined by the HW profile's pipeline depth.

The stall/barrier encoding pipeline (sub_8D7760, 41 KB) computes stalls by walking the dependency DAG backward from each instruction:

function ComputeStallCycles(sched, instr):
    max_wait = 0
    for each predecessor of instr:
        distance = instr.cycle - pred.cycle
        latency = LookupLatency(sched, pred, instr)
        wait = latency - distance
        max_wait = max(max_wait, wait)
    return min(max_wait, MaxStallFromKnob(sched))

The encoding function sub_8F4140 packs the complete control word:

FieldEncoderBitsRange
Stall countsub_8F313041--16 cycles
Yield hintsub_8F365010/1
Read barriersub_8F31F060--5 (barrier ID)
Write barriersub_8F31F060--5 (barrier ID)
Scoreboard waitsub_8F38606barrier wait mask
Reuse flags(separate)4register reuse hints

Sentinel Values

The scheduling system uses several sentinel values:

ValueMeaning
-1 (0xFFFFFFFF)Unscheduled instruction position
0x1869F (99999)Infinite latency sentinel
0xFFFFFFFFBatch window sentinel (DynBatch)

Resource Cost Accumulation

sub_8C67A0 (ComputeResourceCost, 3.7 KB) drives the per-instruction resource accounting. It calls the resource model sub_A08A00 three times per instruction:

function ComputeResourceCost(sched, instr):
    slot = GetResourceSlot(sched, instr)
    slot.bb_entered |= 1

    // Phase 1: Instruction's own execution cost
    sub_A08A00(sched, instr, instr_data, output, slot, mode=1)
    // Accumulate: slot[0..9] += output[0..9]  (SSE _mm_add_epi32)

    // Phase 2: Operand release costs (for last-use operands)
    sub_A08A00(sched, instr, instr_data, output, slot, mode=2)
    // Accumulate delta: slot[10..19] += output[0..9]

    // Phase 3: Combined instruction + BB-level impact
    sub_A08A00(sched, instr, instr_data, output, slot, mode=3)
    // Accumulate pressure into slot[20]

The SSE-optimized accumulation uses _mm_add_epi32 to add 4 resource counters at a time, processing the full 10-element vector in 3 SSE iterations (4 + 4 + 2).

Cutlass-Specific Scheduling

sub_8F47E0 detects NVIDIA cutlass GEMM kernels by calling strstr(function_name, "cutlass"). When detected, the scheduler activates hand-tuned scheduling parameters for matrix multiplication inner loops. This includes:

  • Modified stall counts for the HMMA/WGMMA instruction sequences.
  • Adjusted register pressure targets.
  • Specific barrier placement patterns for double-buffered shared memory.

This reflects NVIDIA's investment in hand-tuning their cutlass library's scheduling behavior within ptxas itself.

Execution Pipe Assignment (sub_13710B0)

sub_13710B0 (7.1 KB, 1,088 lines decompiled) is the SASS-backend execution pipe class assigner. It runs in the SASS encoding pipeline (address range 0x1370--0x139F) after instruction selection, register allocation, and the main scheduling pass are complete. Where sub_89FBA0 assigns IR-level scheduling class IDs (2--772+) consumed by the priority and stall-computation passes, sub_13710B0 writes SASS-level pipe class IDs (0x00--0x141) that control control-word encoding: stall counts, barrier assignments, and dual-issue pairing in the final binary.

Descriptor Initialization

Before dispatching on the opcode, the function initializes the scheduling descriptor at a3+196..202 to the "all-pipes" default:

*(DWORD*)(a3+196) |= 0xF8000     // pipe mask = all (bits 15..19)
*(BYTE*)(a3+200)  |= 0x1F        // read barrier mask = all
*(WORD*)(a3+198)   = HIWORD | 0x1F0  // throughput class = max
*(WORD*)(a3+200)  |= 0x3E0       // write barrier mask = all
*(BYTE*)(a3+199)   = ... | 0x3E  // pipe flags = all set

Then it switches on *(a2+72) & 0xFFFFCFFF (the Ori opcode with modifier bits masked), writing a 9-bit pipe class into the low bits of *(WORD*)(a3+196) and optionally overriding the pipe mask, sub-class, and pipe flags.

Pipe Mask Encoding

Bits 15--19 of *(DWORD*)(a3+196) select the execution pipe:

ValuePipeFunctional unitsResource vector indices
0x08000Pipe AALU, integer, FP64, conversion0 (ALU), 2 (DFMA)
0x10000Pipe BFP32, tensor, SFU, half-precision1 (FMA), 3 (MMA), 8 (SFU)
0x18000Pipe CMemory, texture, wide FP644 (LSU), 5 (TEX)
0xF8000AllDefault sentinel (no constraint)--

Sub-Class Encoding

Bits 4--7 of *(WORD*)(a3+198) encode the sub-class within the pipe:

ValueSub-classInstruction category
0x10Control flowBranch, predicate, miscellaneous
0x20Integer ALUConversion, barrier, integer ops
0x30FP32 / SFUSingle-precision, half-precision
0x40FP64 / TensorDouble-precision wide, tensor core

Pipe Flags Encoding

Bits 1--5 of *(BYTE*)(a3+199) encode sub-unit affinity:

ValueMeaning
0x02Narrow ALU sub-unit
0x04ALU (integer / conversion)
0x06Load/store or wide ALU
0x08SFU / half-precision pipe
0x0AFP64 wide (double-precision)
0x0CTensor core pipe
0x3EAll flags set (default)

Opcode-to-Pipe-Class Mapping

The complete switch covers 80+ Ori opcodes. Representative mappings:

Ori opcodePipe classPipeSub-classSASS instructionDecision logic
10x08--0x10IMADAlways
2--7 (wide)0x03B (0x10000)0x30IMAD_WIDE, IADD3, etc.sub_7D6780 = true
2--7 (wide, v6=6)0x03C (0x18000)0x40LOP3 (wide, FP64)Opcode 6, wide
2--7 (narrow)0x0CA (0x08000)--IMAD, IADD3, etc.Narrow, type != 19
2--7 (narrow, t=19)0x7B----IMAD (BF16/FP8 type)Type 19 path
8 (flag clear)0x33----IABS (no guard)Operand flag bit 0
8 (flag set)0x34----IABS (guarded)Operand flag bit 0
0x10 (flagged)0x68----ATOM (flagged)Operand bit 2
0x10 (mem=3)0x67----ATOM (shared)sub_7DFFC0 = 3
0x10 (mem=4)0x69----ATOM (constant)sub_7DFFC0 = 4
0x10 (other)0x66----ATOM (global)Default
0x12 (no 0x400)0x3D----FADD (standard)Operand bit 10 clear
0x12 (0x400 set)0x78----FADD (const-bank)Operand bit 10 set
0x17 (op1 reg6)0x37----S2R (tensor reg, op1)*(desc+64) = 6
0x17 (op2 reg6)0x36----S2R (tensor reg, op2)*(desc+64) = 6
0x17 (other)0x38----S2R (standard)Neither operand reg6
0x180x04A (0x08000)0x20FSETPAlways
0x24 (wide)0x14B (0x10000)0x30PRMT (FP width)sub_7D6780 = true
0x24 (narrow)0x11B (0x10000)0x30PRMT (integer)sub_7D6780 = false
0x330x21A (0x08000)0x20IDPAlways; flags 0x06
0x3C (mem ops)0x2B--0x32----STG variants6-way split on flags
0x3E (mem ops)0x2D--0x2E----LDL variantsFlag / no-flag split
0x420x5D----MUFU (SFU)Always
0x4D0x84--0x85B (0x10000)0x40WGMMA-classExtended tensor fields
0x4E (mem ops)0x2F--0x30----LD (generic)Flag / no-flag split
0x660x09B (0x10000)0x30DEPBARAlways; flags 0x08
0x82 / 130 (ext)0x17----NANOTRAP (extended); HSET2 in ROT13sub_A9AB10 = true
0x82 / 130 (ctrl)0x13all (0xF8000)0x10NANOTRAP (control); HSET2 in ROT13vtable+640
0xC9--0xCA (wide)0x07A (0x08000)--DFMA, DADD (wide)sub_7D6780 = true
0xD10x05A (0x08000)0x20DFMAAlways
0xD20x0AA (0x08000)0x30DFMA variantSub-class 0x30, flag 0x04
0xF00x0FA (0x08000)--F2FFlags 0x04
0x10E0x7EB (0x10000)--HMMA_16Flags 0x08
0x1170x80B (0x10000)0x40HMMA_32Tensor pipe; flags 0x0C
0x11A0x81B (0x10000)0x40IMMATensor pipe
default0x88----(unrecognized)Sentinel

Decision Axes

The function dispatches on three axes beyond the opcode:

  1. Data type width: sub_7D6780(*(a2+76)) returns true for wide types (FP64). Wide types route to pipe A or C with sub-class 0x30 or 0x40; narrow types route to pipe A with sub-class 0x20.

  2. Memory access classification: sub_7DFFC0(a2, code_obj) returns a memory space code (3 = shared, 4 = constant). Used for ATOM (case 0x10) to split into 4 pipe classes by memory space.

  3. Operand register class: *(descriptor+64) from the register descriptor. Class 6 (tensor/accumulator register file) triggers distinct pipe classes for S2R (case 0x17) and DFMA/DADD variants.

Additionally, two architectural gates control tensor instruction classes:

  • *(a1+25) flag and sub_1370F40 gate tensor-extended pipe classes. When disabled, tensor instructions fall through to class 0x141 (a sentinel).
  • vtable+3184 on the code object checks a feature gate for CALL instruction classification.

Memory Instruction Pipe Variants

Load/store instructions (cases 0x3C, 0x3E, 0x4E) receive a 6-way pipe class split based on two properties:

PropertyTest method
Same-source vs different-sourcesub_91E7A0(a2, 0) vs sub_91E7A0(a2, 1)
Has flag operandsub_91E860(code_obj, a2, i) returns 8
VariantSTG (0x3C)LDL (0x3E)LD (0x4E)
Same-src, no flag0x31(n/a)(n/a)
Same-src, flagged0x32(n/a)(n/a)
Diff-src, no flag0x2B0x2D0x2F
Diff-src, flagged0x2C0x2E0x30

This fine-grained split allows the SASS encoder to select different stall counts and barrier patterns depending on whether the load/store has a predicate guard and whether the source address register is shared with another operand.

Type-19 Special Path

When sub_7D6780 returns false (not wide) and *(a2+76) == 19, several instruction groups receive distinct pipe classes in the 0x7A--0x7D range:

Ori opcode groupStandard classType-19 classLikely type
2--7 (narrow)0x0C0x7BBF16 / FP8
0x6E--0x72 (narrow)0x0B0x7ABF16 / FP8
0x8B--0x8C (narrow)0x0D0x7CBF16 / FP8
0xC9--0xCA0x10/0x120x7DBF16 / FP8

Type 19 likely corresponds to BF16 or FP8, which require different pipeline routing than standard FP16/FP32/FP64 types on Hopper and Blackwell architectures.

Function Map

AddressSizeIdentityConfidence
sub_693BC022 linesMemorySpaceClassify -- return memory space codeHIGH
sub_695530606 linesComputeLatencies -- per-BB latency computationHIGH
sub_704D3014 KBGetFunctionalUnit -- SASS opcode to FU mappingHIGH
sub_73A1D0~6 KBLDSLatencyStats -- shared memory latency statsHIGH
sub_73A7F0~6 KBLDGLatencyStats -- global memory latency statsHIGH
sub_73ADF06.5 KBXU64LatencyStats -- extended unit latency statsHIGH
sub_73B36028.7 KBMacLoopSchedulingAnalytics -- latency hiding reportHIGH
sub_7998602.9 KBClassifyInstructionLatencyHIGH
sub_89FBA085 KBSetOpcodeLatencies -- per-opcode scheduling classHIGH
sub_8B540014 KBScheduleForLatency -- latency-optimized schedulingMEDIUM
sub_8B77C015 KBDualIssueScheduler -- dual-issue scheduling engineMEDIUM
sub_8BDC407.9 KBDualIssuePairing -- instruction pair selectionMEDIUM
sub_8C67A03.7 KBComputeResourceCost -- per-instruction FU costHIGH
sub_8C72905.1 KBGetResourceVector -- SSE-optimized copyHIGH
sub_8CCF802.3 KBIsLongLatencyOp -- latency > 19 checkHIGH
sub_8CF5D03.5 KBCheckDualIssueEligibilityHIGH
sub_8D3E202.1 KBComputeStallCycles -- required stall countHIGH
sub_8D776041 KBStallAndBarrierInsertion -- encode stalls/barriersHIGH
sub_8E3AD0--CopyProfileEntries -- finalize HW tableMEDIUM
sub_8E44003.3 KBInitHWProfile_Warp -- warp dispatch paramsHIGH
sub_8E49206.9 KBBuildScoreboardEntries -- scoreboard BSTHIGH
sub_8E4D8015 linesStringRefCleanup -- decref string in record copyHIGH
sub_8E4F20~1.5 KBEmitWeightEntry -- supplementary weight record (type 'W')HIGH
sub_8E5310~1.5 KBEmitVariantSection -- variant sub-records (type ';')HIGH
sub_8E5530~1.5 KBEmitDimensionEntries -- dimension sub-records (type '1')HIGH
sub_8E5CA020 KBEmitScheduleOutput -- scheduling control wordsHIGH
sub_8E67602.9 KBEmitGroupBoundary -- group boundary markerHIGH
sub_8E6B402.9 KBEmitSchedEntry -- standard scheduling entryHIGH
sub_8E6D402.9 KBEmitBarrierEntry -- barrier/sync entryHIGH
sub_8E6F202.9 KBEmitWaitEntry -- wait dependency entryHIGH
sub_8E71102.9 KBEmitScoreboardEntry -- scoreboard entryHIGH
sub_8E73003.3 KBHWTable_sm70 -- Volta latency tableCERTAIN
sub_8E75402.9 KBHWTable_sm72 -- Xavier latency tableCERTAIN
sub_8E77203.5 KBHWTable_sm75 -- Turing latency tableCERTAIN
sub_8E79402.9 KBHWTable_sm80_base -- Ampere base tableCERTAIN
sub_8E7B403.3 KBHWTable_sm80 -- Ampere full tableCERTAIN
sub_8E7D804.4 KBHWTable_sm86 -- GA10x tableCERTAIN
sub_8E80703.5 KBHWTable_sm87 -- Orin tableCERTAIN
sub_8E82803.1 KBHWTable_sm89 -- Ada Lovelace tableCERTAIN
sub_8E84805.2 KBHWTable_sm90 -- Hopper tableCERTAIN
sub_8E87804.6 KBHWTable_sm90a -- Hopper accelerated tableCERTAIN
sub_8E8A903.0 KBHWTable_sm100 -- Blackwell DC tableCERTAIN
sub_8E8CB0949 BHWTable_sm100_short -- Blackwell supplementaryCERTAIN
sub_8E8DB01.7 KBHWTable_sm103 -- Blackwell Ultra tableCERTAIN
sub_8E8F60618 BHWTable_sm103_short -- BU supplementaryCERTAIN
sub_8E90002.9 KBHWTable_sm120 -- RTX 50xx tableCERTAIN
sub_8E92E05.5 KBHWTable_sm120_ext -- RTX 50xx extendedCERTAIN
sub_8E97B08.8 KBHWTable_universal -- fallback tableCERTAIN
sub_8E9DC04.8 KBEmitLatencyEntry -- HW table entry helperHIGH
sub_8EFA1018 KBEmitScheduleReport -- statistics outputHIGH
sub_8F0CD024 BMapFUClassID -- (opcode, name) to classHIGH
sub_8F1EB015 KBEncodeScheduleWords -- SASS control word outputHIGH
sub_8F31301.0 KBEncodeStallFieldHIGH
sub_8F31F06.1 KBEncodeBarrierFieldHIGH
sub_8F36502.7 KBEncodeYieldFieldHIGH
sub_8F38603.0 KBEncodeScoreboardFieldHIGH
sub_8F41405.6 KBEncodeFullControlWordHIGH
sub_8F47E0~50 BDetectCutlass -- strstr for "cutlass"CERTAIN
sub_A0891039 linesGetRegisterLatency -- operand cost queryHIGH
sub_A08A00345 linesResourceModel -- 3-mode FU cost computationHIGH
sub_A0953091 linesUpdateStallCycles -- per-instruction stall updateHIGH
sub_A9CDE0--IsHotMemory -- global/texture classificationHIGH
sub_A9CF90--IsColdMemory -- constant/shared classificationHIGH
sub_13710B07.1 KBAssignPipeClass -- SASS-level pipe assignmentHIGH
sub_1370F40~500 BCheckTensorFeature -- gates tensor pipe classesHIGH
sub_7D6780~100 BIsWideType -- true for FP64/wide typesHIGH
sub_7DFFC0~200 BClassifyMemAccess -- 3=shared, 4=constantHIGH
sub_7E3640~100 BGetCustomPipe -- 5-bit pipe sub-classMEDIUM
sub_91E7A0~100 BGetSrcEncoding -- source operand encoding queryMEDIUM
sub_91E860~100 BGetOperandType -- operand type codeMEDIUM
sub_A9AB10~100 BNeedsExtEncoding -- extended encoding checkMEDIUM

Cross-References

Scoreboards & Dependency Barriers

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The NVIDIA GPU hardware uses a software-managed scoreboard system for instruction-level hazard tracking. Unlike CPUs that detect dependencies in hardware, NVIDIA's warp schedulers rely on per-instruction metadata -- encoded in a control word -- to determine when an instruction's operands are available, when a warp should yield, and which dependency barriers to set or wait on. ptxas generates this metadata in three pipeline phases (114--116) that together produce the final scoreboard annotations embedded in the SASS binary.

Phase 114FixUpTexDepBarAndSync -- texture dependency barrier fixup
Phase 115AdvancedScoreboardsAndOpexes -- full scoreboard generation (-O1+)
Phase 116ProcessO0WaitsAndSBs -- conservative scoreboard insertion (-O0)
Control word generatorsub_A36360 (52 KB) -- per-instruction control word encoder
Scheduling heuristicsub_A23CF0 (54 KB) -- DAG list scheduler with dependency analysis
Instruction dispatchersub_85C890 -- opcode-based fast-path / slow-path router
Mercury opex passsub_6FFDC0 (66 KB) -- MercGenerateOpex, phase 120
HW barrier limit6 dependency barriers per warp (hardware constraint)

Control Word Format

Every SASS instruction carries scheduling metadata in a control word. On sm_70+ architectures, the control word is packed into a dedicated scheduling control instruction that precedes each group of 3 real instructions. The control word encodes stall counts, yield hints, dependency barrier set/wait operations, and source operand reuse flags.

Ori IR Control Word (Internal Representation)

Within ptxas, the control word is stored in the Ori IR instruction node at offsets +196 through +200. sub_A36360 generates the fields, and per-field encoder functions write individual bit ranges.

The internal representation uses wider fields than the final SASS encoding to allow the encoder to track additional state during scoreboard computation:

FieldBitsRangeDescription
Stall count40--15Minimum cycles to wait before issuing this instruction
Yield flag10--1Hint to warp scheduler: yield execution to another warp
Write barrier index30--5Which barrier register this instruction's result writes to
Read barrier mask60--63Bitmask of barriers this instruction must wait for (reads)
Wait barrier mask60--63Bitmask of barriers this instruction clears upon completion
Reuse flags60--63Per-source-operand register reuse cache hints

Total: 26 bits of scheduling metadata per instruction in the internal representation.

SASS Control Word (Binary Encoding)

In the final SASS binary, the control word is packed into 23 bits per instruction slot within a 128-bit scheduling control instruction. Three instruction slots share one control instruction, yielding a 4:3 instruction-to-encoding ratio.

128-bit scheduling control instruction:
  ┌─────────┬─────────┬─────────┬──────────────────┐
  │ Slot 0  │ Slot 1  │ Slot 2  │ Reserved / flags │
  │ 23 bits │ 23 bits │ 23 bits │    59 bits       │
  └─────────┴─────────┴─────────┴──────────────────┘

Per-slot 23-bit layout (sm_70+):
  bits [3:0]    Stall count (4 bits, values 0--15)
  bit  [4]      Yield flag (1 bit)
  bits [7:5]    Write barrier index (3 bits, values 0--5; 7 = none)
  bits [13:8]   Read barrier mask (6 bits, one-hot per barrier)
  bits [19:14]  Wait barrier mask (6 bits, one-hot per barrier)
  bits [22:20]  Reserved / extended flags (3 bits)

The reuse flags (6 bits per instruction) are encoded separately in the instruction word itself at architecture-defined bit positions, not in the scheduling control instruction.

Bit-Field Diagram

  22  21  20  19  18  17  16  15  14  13  12  11  10   9   8   7   6   5   4   3   2   1   0
 ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
 │  rsvd  │         wait mask         │        read mask          │  wr_bar   │ Y │  stall    │
 └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘

Hardware Dependency Barrier Model

NVIDIA GPUs (sm_70+) provide 6 dependency barrier registers per warp, numbered 0 through 5. These are finite hardware resources managed entirely by software (ptxas). The barrier mechanism works as follows:

  1. Set barrier (write barrier): When an instruction with long latency (e.g., global memory load, texture fetch) is issued, the compiler assigns it a barrier index from the pool of 6. The hardware marks that barrier as "pending" and associates it with the instruction's completion.

  2. Wait on barrier (read barrier / wait mask): When a subsequent instruction needs the result of the long-latency operation, the compiler sets the corresponding bit in the wait mask. The warp scheduler stalls the instruction until the barrier clears.

  3. Barrier release: When the long-latency operation completes, the hardware automatically clears the associated barrier register, allowing waiting instructions to proceed.

The key constraint is the hardware limit of 6 simultaneous barriers. If a basic block has more than 6 outstanding long-latency operations, the compiler must either:

  • Reuse a barrier (wait for an earlier operation to complete before reassigning its barrier)
  • Insert explicit stall cycles to serialize operations
  • Use the DEPBAR instruction to manage barrier state programmatically

Stall Count vs. Dependency Barriers

The stall count and dependency barriers serve complementary purposes:

MechanismLatency RangeUse Case
Stall count (4 bits)0--15 cyclesShort-latency operations: ALU (4--6 cycles), shared memory (20--30 cycles when stall is sufficient)
Dependency barriersArbitraryLong-latency operations: global memory (200--800 cycles), texture (200--400 cycles), where stall count is insufficient

For operations with latency <= 15 cycles, the stall count alone suffices. For longer latencies, a dependency barrier must be used because 4 bits cannot encode delays beyond 15 cycles. The yield flag provides an additional hint: when set, it tells the warp scheduler that this warp is about to stall and should be descheduled in favor of another ready warp.

Phase 114: FixUpTexDepBarAndSync

Phase 114 runs a texture-specific fixup pass on dependency barriers and synchronization instructions. It operates through a double-indirect vtable dispatch: the architecture backend object contains a secondary scheduling/scoreboard subsystem object (at offset +16), which provides the actual implementation through its vtable at offset +0x70.

Purpose

Texture operations have complex dependency patterns that the general scoreboard pass may not handle optimally:

  • Texture fetches have variable latency depending on cache hit rates
  • Texture coordinates and sampler state create additional dependencies
  • The texture pipeline has its own internal buffering that interacts with scoreboard barriers

Phase 114 patches up the dependency barrier assignments made by the scheduler to account for these texture-specific requirements. It may:

  • Reassign barrier indices for texture operations to avoid conflicts with non-texture barriers
  • Insert additional DEPBAR instructions where texture dependencies require explicit barrier management
  • Adjust synchronization instructions that interact with texture pipeline state

Implementation

The phase is architecture-specific. In the default vtable, it maps to a nullsub (nullsub_43 at 0x680170), indicating a no-op for architectures that handle texture dependencies entirely in the general scoreboard pass. Architecture backends that need texture-specific fixup override this vtable entry with their implementation.

Dispatch path:
  PhaseManager::execute(phase=114)
    → arch_backend->vtable[phase_114]()
      → secondary_object = *(arch_backend + 16)
        → (*(secondary_object->vtable + 0x70))(secondary_object, func)

Phase 115: AdvancedScoreboardsAndOpexes

Phase 115 is the main scoreboard generation pass. It is an AdvancedPhase hook -- a no-op in the default vtable, activated only when the architecture backend overrides it. At -O1 and above, this phase runs the full dependency analysis and scoreboard assignment. At -O0, it is skipped entirely (phase 116 handles the conservative path instead).

Architecture Dispatch

The phase entry point dispatches through the architecture backend vtable to sub_85C890, which acts as an opcode-aware router: depending on instruction type, it either handles the instruction via a fast path (direct barrier assignment for known patterns) or falls through to the full DAG list scheduler sub_A23CF0.

Fast Path (sub_85C890)

sub_85C890 classifies instructions by their masked opcode (opcode & 0xFFFFCFFF) and routes them:

Handled by fast path (direct barrier assignment without full DAG analysis):

  • Opcodes 60, 62, 78, 79: Texture/surface operations -- processed via sub_A22B40 (write barrier assignment) after checking architecture capability at vtable+1928
  • Opcode 4 with operand types (7, 6): Specific ALU patterns with predicate operands -- dual operand processing via sub_A220A0
  • Opcode 111 with operand types (7, 7, 6): Triple-source patterns -- processed via triple sub_A220A0 calls
  • Opcodes 120, 121: GMMA/tensor operations -- processed via sub_A220A0 + sub_A22B40 with variable operand counts
  • Opcodes 126--128: Complex operations with architecture-specific operand counts (2--4 source operands)
  • Opcodes 195, 270, 280, 281: Memory operations with specific addressing modes
  • Opcodes 350, 351: Extended operations with operand subtype 11--12

Fall-through to slow path (full DAG scheduler):

  • All other opcodes
  • Fast-path opcodes that fail capability checks (vtable+1928 returns false)
  • Instructions with the 0x1000 flag set (bit 12 of opcode word) -- handled via sub_A227F0 first, then fall through

The fast-path check at vtable+1928 tests (*(_BYTE *)(a1 + 1090) & 4) != 0, which corresponds to an architecture feature flag controlling whether the backend supports direct scoreboard assignment for specific instruction classes.

Slow Path: DAG List Scheduler (sub_A23CF0, 54 KB)

When the fast path cannot handle an instruction, sub_A23CF0 performs full dependency-driven scoreboard assignment. This function takes 14 parameters including floating-point latency weights and throughput factors.

The scheduler:

  1. Classifies instruction dependencies: Iterates operands, extracting register IDs from the operand descriptors at instr+84. For each operand, looks up the register object via *(*(func+88) + 8 * (operand_desc & 0xFFFFFF)) and checks the register type at offset +64.

  2. Walks the def-use chain: For each source operand, traces back to the defining instruction. Determines the dependency distance (in instructions and cycles) from each producer to the current consumer.

  3. Assigns barrier or stall: Based on the dependency distance and the producer's latency:

    • If the producer's latency fits within the stall count range (0--15), assigns a stall count
    • If the latency exceeds 15 cycles, allocates a dependency barrier from the pool of 6
    • If all 6 barriers are in use, finds the oldest barrier, inserts a wait for it, and recycles it
  4. Handles instruction-specific patterns: Contains a large switch on opcode for architecture-specific scheduling decisions. Opcodes handled specially include:

    • Opcodes 2, 3, 20, 21, 24, 28, 60, 61, 62, 67, 78, 79, 98, 100, 110, 120, 126, 139, 141, 162, 164, 201, 209, 210, 213, 214, 272, 273, 311: Direct operand processing with known dependency patterns
    • Opcodes 5, 6, 7, 10, 11, 36, 63, 80, 106, 108, 112, 114: Memory/load operations with variable-latency handling
  5. Produces control word fields: After analysis, the function sets the barrier assignment, stall count, and wait mask for the instruction.

Key Support Functions

AddressSizePurpose
sub_A220A09 KBInstruction attribute / property query -- fills a scheduling descriptor for a specific operand
sub_A22B40--Write barrier assignment for a specific operand -- determines which barrier index to assign
sub_A22BC0--Read barrier dependency -- sets up wait mask for operand
sub_A22CE0--Instruction classification -- determines if instruction needs scoreboard processing
sub_A231E0--Scheduling score computation -- determines if full DAG analysis is needed
sub_A227F0--Pre-processing for flagged instructions (bit 12 set in opcode)
sub_A22D00--Dependency distance computation

Phase 116: ProcessO0WaitsAndSBs

Phase 116 implements the conservative scoreboard insertion path for -O0 (no optimization) builds. At -O0, phase 115 is a no-op, and phase 116 takes over with a simple, safe strategy.

Conservative Strategy

The -O0 path does not perform dependency analysis. Instead, it applies maximum-safety defaults:

function ProcessO0WaitsAndSBs(func):
    for each bb in func.basic_blocks:
        for each instr in bb.instructions:
            // Set maximum stall count (15 cycles)
            instr.stall_count = 15

            // Wait on all active barriers before every instruction
            instr.wait_mask = 0x3F    // all 6 barriers

            // No barrier assignment (no long-latency tracking)
            instr.write_barrier = 7   // 7 = none

            // No read barriers
            instr.read_mask = 0

            // Yield after every instruction
            instr.yield = 1

This produces correct but extremely slow code: every instruction waits the maximum time and clears all barriers, eliminating any possibility of instruction-level parallelism. The primary use case is debugging, where correctness matters more than performance.

At -O1 and above, phase 115 runs the full analysis, and phase 116's isNoOp() returns true, skipping execution entirely.

Control Word Generation Pipeline (sub_A36360)

sub_A36360 (52 KB) is the master control word generator, called via vtable for each instruction in the scheduled order. It orchestrates six per-field encoder functions to produce the complete control word.

Dispatch Architecture

The function takes the scheduling context (a1), the instruction node (a2), and several SIMD/float parameters encoding latency weights and architecture-specific constants. It begins by:

  1. Loading the function context from *(a1+8) and the SM backend from *(func+1584) (the sm_backend field; provides hardware latency profiles)
  2. Calling sub_7E1750 to classify the instruction
  3. Extracting the opcode from *(a2+72) with the standard mask (BYTE1 &= 0xCF)
  4. Switching on the masked opcode to determine the encoding strategy

Per-Opcode Dispatch

The master switch at the entry of sub_A36360 routes instructions by opcode class:

Opcode ClassHandlerDescription
2, 3, 5, 7Inline (LABEL_18 path)Standard ALU/memory with full barrier analysis. Checks operand subtype 9--10 and architecture feature at *(sm_backend+1037) & 0x20. Calls sub_A32C70 for operand analysis, then sub_A31040 for field encoding.
6sub_A34B70Wait barrier mask encoding for specific memory operations
10, 149, 151, 290Inline (large block)Extended operations with special barrier handling. Calls sub_A32A20 for multi-operand setup, then processes register-type checks at offset +64 (type==5 triggers additional barrier logic).
All othersPer-field encoder chainDefault path through the six encoder functions

Per-Field Encoder Chain

For the default path, sub_A36360 calls these encoders in sequence:

function GenerateControlWord(ctx, instr):
    // 1. Initialize operand analysis
    sub_7E19E0(&operand_info, ctx.func, instr)
    barrier_type = sub_7E53D0(instr.operand_subtype)

    // 2. Analyze source/dest operand dependencies
    sub_A32C70(&ctx, instr, src_idx, dst_idx,
               &dep_info, &barrier_info)

    // 3. Encode all control word fields
    sub_A31040(&ctx, &dep_info, &barrier_info,
               &src_desc, &dst_desc, &flags,
               barrier_type, ...)

    // 4. Finalize: set register space = 7 (done)
    *(ctx.func + 240) = 7

    // 5. Emit the control word
    sub_9253C0(ctx.func, instr, 1)

Encoder Function Details

AddressSizeFunctionField Encoded
sub_A333A03 KBEncodeStallAndYield4-bit stall count + 1-bit yield flag. Called twice from sub_A36360. Computes the minimum stall cycles from the dependency distance to the nearest consumer. Sets yield=1 when stall > threshold (architecture-dependent, typically 4+ cycles).
sub_A336607 KBEncodeReadBarrierMask6-bit read barrier mask. Determines which barrier registers this instruction must wait for before reading its source operands. Calls sub_935720 to query register-barrier associations.
sub_A342E09 KBEncodeWriteBarrierIndex3-bit write barrier index. Allocates a barrier from the pool of 6 for this instruction's result. Calls sub_934630 to find a free barrier; if none available, forces a wait on the oldest active barrier via sub_9253C0.
sub_A34B7010 KBEncodeWaitBarrierMask6-bit wait barrier mask. Determines which barriers are cleared when this instruction completes.
sub_A356A012 KBEncodeScoreboardFieldsCombined scoreboard field encoder. Orchestrates read/write barrier assignment with dependency distance tracking via sub_A318F0 and conflict detection via sub_A31390.
sub_A31F807 KBComputeReuseFlags6-bit reuse flags. Determines which source register values should be cached in the operand reuse buffer. Calls sub_7DB310 for register bank analysis and sub_91BF30 for reuse eligibility checking.

Supporting Analysis Functions

AddressSizePurpose
sub_A318F04 KBBarrier dependency distance computation -- measures the instruction distance between a barrier set and its corresponding wait
sub_A313904 KBBarrier set intersection / conflict detection -- checks whether two instructions' barrier usage conflicts
sub_A32C70--Source/destination operand dependency analysis -- identifies which operands create dependencies
sub_A31040--Master field encoding dispatcher -- coordinates all six per-field encoders

Dependency Barrier Allocation Algorithm

The barrier allocator manages the 6 hardware barrier registers as a resource pool. The algorithm must satisfy three constraints:

  1. No two simultaneously-live long-latency operations share a barrier index
  2. Every consumer instruction waits on the correct barrier before reading its operand
  3. Barrier reuse is maximized to avoid unnecessary stalls

Allocation State Machine

State per barrier register (6 entries):
  barrier[i].status     ∈ {FREE, PENDING, COMPLETED}
  barrier[i].producer   = instruction pointer (or NULL)
  barrier[i].set_cycle  = cycle when barrier was assigned
  barrier[i].consumers  = list of waiting instructions

State transitions:
  FREE → PENDING:     Barrier allocated to a long-latency producer
  PENDING → COMPLETED: Hardware signals completion (implicit)
  COMPLETED → FREE:    All consumers have executed their wait
  PENDING → FREE:      Forced recycle (all barriers in use, oldest evicted)

Allocation Pseudocode

function AllocateBarrier(ctx, producer_instr):
    // 1. Try to find a free barrier
    for i in 0..5:
        if barrier[i].status == FREE:
            barrier[i].status = PENDING
            barrier[i].producer = producer_instr
            barrier[i].set_cycle = current_cycle
            return i

    // 2. No free barrier: recycle the oldest
    oldest = argmin(barrier[i].set_cycle for i in 0..5)

    // 3. Force all consumers of oldest to wait NOW
    InsertWaitForBarrier(ctx, oldest)

    // 4. Recycle
    barrier[oldest].status = PENDING
    barrier[oldest].producer = producer_instr
    barrier[oldest].set_cycle = current_cycle
    return oldest

function AssignWaitMask(ctx, consumer_instr):
    wait_mask = 0
    for each source_operand in consumer_instr:
        producer = FindProducer(source_operand)
        if producer.barrier_index != NONE:
            if producer.latency > stall_count_range:
                wait_mask |= (1 << producer.barrier_index)
    consumer_instr.wait_mask = wait_mask

Barrier Reuse Heuristics

The allocator uses several heuristics to maximize barrier reuse:

  1. Oldest-first eviction: When all 6 barriers are occupied, the oldest (earliest set_cycle) is evicted. This maximizes the chance that the evicted operation has already completed.

  2. Type affinity: Texture operations preferentially reuse barriers previously assigned to other texture operations, because texture latencies tend to be similar and the texture pipeline may batch completions.

  3. Distance-based freeing: A barrier is marked free without an explicit wait if the instruction distance from the producer exceeds the architecture's maximum latency for that operation class. The sub_A318F0 function computes this distance.

  4. Conflict avoidance: sub_A31390 checks whether a proposed barrier assignment would conflict with an existing barrier that has not yet been waited on. If a conflict is detected, the allocator tries a different barrier index.

Scoreboard Tracking State

The scoreboard tracking state is maintained in the scheduling context object. Key fields:

OffsetTypeContent
ctx+232QWORDCurrent instruction pointer
ctx+240DWORDCurrent register space ID (7 = done)
ctx+244QWORDCurrent operand descriptor pair
ctx+248DWORDCurrent write barrier index
ctx+252DWORDCurrent barrier assignment type (1 or 2)
ctx+264DWORDCurrent instruction sequence number
ctx+1040BYTEArchitecture feature flags (bit 5 = texture scoreboard, bit 4 = extended barriers)
ctx+1090BYTECapability flags (bit 2 = fast-path scoreboard, bit 4 = extended operand tracking)

The *(ctx+1040) & 0x20 flag controls whether the architecture supports texture-specific scoreboard handling. The *(ctx+1090) & 4 flag enables the fast-path scoreboard assignment for known instruction patterns.

Scoreboard Object Layout (952 bytes)

The scoreboard object is allocated by sub_8D0640 (ScheduleInstructions) when the architecture feature flag *(func+1385) & 4 is set. The 952-byte allocation goes through the function context's vtable-dispatched allocator at *(func+16), and the constructor sub_69A1A0 initializes it. The pointer is stored at func+1864.

The object has three regions: 35 reference-counted counter slots, a linked-list/tree node for active barrier tracking, and 14 barrier tracking records.

Region 1: Counter Slots (offsets +0 to +272)

35 QWORD pointer slots, each pointing to an externally-allocated 24-byte counter node. Each counter node has the layout:

Counter node (24 bytes):
  +0   QWORD   refcount (initialized to 1)
  +8   QWORD   value (initialized to 0)
  +16  QWORD   allocator back-reference

The 35 slots are organized as barrier state / stall counter pairs for each register class, plus additional scoreboard tracking counters:

OffsetSlotPurpose
+00R (general-purpose register) barrier state
+81R stall counter
+162P (predicate register) barrier state
+243P stall counter
+324UR (uniform register) barrier state
+405UR stall counter
+486UP (uniform predicate) barrier state
+567UP stall counter
+648B (barrier register) barrier state
+729B stall counter
+8010Arch-specific class 5 barrier state
+8811Arch-specific class 5 stall counter
+9612Arch-specific class 6 barrier state
+10413Arch-specific class 6 stall counter
+11214Arch-specific class 7 barrier state
+12015Arch-specific class 7 stall counter
+12816Arch-specific class 8 barrier state
+13617Arch-specific class 8 stall counter
+144--+27218--34Additional scoreboard tracking counters (17 slots)

Total: 35 slots x 8 bytes = 280 bytes.

Region 2: Linked List / Tree Node (offsets +280 to +391)

This region contains an intrusive data structure (linked list or red-black tree node) used for tracking active barrier assignments. It cross-references counter slots from Region 1.

OffsetSizeTypeInitPurpose
+2808ptrfrom a2+16Allocator reference (arena/memory pool)
+2888QWORD0List sentinel / null node
+2968ptr&self+304List head pointer
+3048ptr&self+288Forward link (points to sentinel)
+3128QWORD0Node data
+3208ptr&self+288Backward link (points to sentinel)
+3288ptr&self+304Secondary forward link
+3364DWORD2Node type / RB-tree color (2 = initial)
+3448ptrslot 1 refCross-reference to counter slot 1 (R stall counter); refcount incremented
+3528QWORD0Pending producer instruction pointer
+3608QWORD0Set cycle timestamp
+3688QWORD0Consumer list head
+3764DWORD0Active flag / barrier index
+3848ptrslot 19 refCross-reference to counter slot 19

Total: 112 bytes.

Region 3: Barrier Tracking Records (offsets +392 to +951)

14 identical 40-byte records, each tracking one dependency barrier register. The first 6 records correspond to the 6 hardware dependency barriers per warp. Records 6--12 are extended/spare slots for overflow or future barrier model expansion (sm_100+). Record 13 uses a different initialization path (sub_6996C0 instead of sub_69A120), suggesting it serves as a sentinel or special-purpose record.

Per-record layout (40 bytes):

Offset (within record)SizeTypeInitPurpose
+08QWORD0Barrier status: FREE (0), PENDING, COMPLETED
+88QWORD0Producer instruction pointer (or NULL when free)
+168QWORD0Set cycle / consumer tracking state
+244DWORD0Barrier flags / consumer count
+284----(padding)
+328ptrslot 19 refCross-reference to counter slot 19 (allocator back-pointer)

Record index to offset mapping:

RecordOffsetHardware Barrier
0+392Dependency barrier 0
1+432Dependency barrier 1
2+472Dependency barrier 2
3+512Dependency barrier 3
4+552Dependency barrier 4
5+592Dependency barrier 5
6+632Extended / spare 0
7+672Extended / spare 1
8+712Extended / spare 2
9+752Extended / spare 3
10+792Extended / spare 4
11+832Extended / spare 5
12+872Extended / spare 6
13+912Sentinel record (different init via sub_6996C0)

Tail pointer:

OffsetSizeTypePurpose
+9448ptrCounter reference for sentinel record (from slot 25)

Total: 14 records x 40 bytes = 560 bytes + 8 byte tail = 568 bytes.

Memory Layout Diagram

ScoreboardObject (952 bytes)
+--------+--------+--------+--------+--------+--------+--------+--------+
|+0 slot0 (R bar) |+8 slot1 (R stl) |+16 slot2 (P bar)|+24 slot3 (P stl)|
+--------+--------+--------+--------+--------+--------+--------+--------+
|+32 slot4 (UR)   |+40 slot5 (UR)   |+48 slot6 (UP)   |+56 slot7 (UP)   |
+--------+--------+--------+--------+--------+--------+--------+--------+
|+64 slot8 (B)    |+72 slot9 (B)    |+80..+272 slots 10--34             |
+--------+--------+--------+--------+--------+--------+--------+--------+
|+280 allocRef    |+288 sentinel    |+296 listHead    |+304 fwdLink     |
+--------+--------+--------+--------+--------+--------+--------+--------+
|+312 nodeData    |+320 bwdLink     |+328 secFwd      |+336 type |      |
+--------+--------+--------+--------+--------+--------+--------+--------+
|+344 slotRef     |+352 producer    |+360 setCycle    |+368 consumers   |
+--------+--------+--------+--------+--------+--------+--------+--------+
|+376 flags|      |+384 slotRef19  |                                    |
+--------+--------+--------+--------+--------+--------+--------+--------+
|+392  barrierRecord[0]  (40B)      |+432  barrierRecord[1]  (40B)      |
+--------+--------+--------+--------+--------+--------+--------+--------+
|+472  barrierRecord[2]             |+512  barrierRecord[3]             |
+--------+--------+--------+--------+--------+--------+--------+--------+
|+552  barrierRecord[4]             |+592  barrierRecord[5]             |
+--------+--------+--------+--------+--------+--------+--------+--------+
|+632..+912  barrierRecords[6..13]  (8 extended / spare records)        |
+--------+--------+--------+--------+--------+--------+--------+--------+
|+944 tailPtr     |+952 END                                             |
+--------+--------+--------+--------+--------+--------+--------+--------+

Design Notes

The counter nodes use reference counting (initial refcount = 1, incremented when cross-referenced from Region 2 or Region 3). This enables sharing counter state across multiple tracking contexts -- for example, when the scheduling passes for pre-scheduling and post-scheduling need to track the same barrier state.

The 14 barrier records provide 6 slots for the hardware barrier registers plus 8 extended slots. Current architectures use exactly 6 dependency barriers per warp, but the extended slots provide headroom for the expanded barrier model hinted at in sm_100+ configurations (see *(ctx+1040) & 0x10 extended barriers flag).

Stall Count Computation

The stall count is the minimum number of cycles the warp scheduler must wait before issuing the instruction. It is computed from the dependency distance to the instruction's producers.

Algorithm

function ComputeStallCount(ctx, instr):
    max_stall = 0
    for each source_operand in instr:
        producer = FindProducer(source_operand)
        if producer == NULL:
            continue
        distance = instr.cycle - producer.cycle
        latency = GetLatency(producer.opcode, ctx.sm_backend)
        required_wait = latency - distance
        if required_wait > 0:
            max_stall = max(max_stall, required_wait)

    // Clamp to 4-bit range
    stall = min(max_stall, 15)

    // Apply architecture minimum (knob 741, default 3)
    stall = max(stall, ctx.min_stall)

    // Cap at architecture maximum (knob 805/806, max 16)
    stall = min(stall, ctx.max_stall)

    return stall

The stall count computation uses the SM backend at *(func+1584) (sm_backend) to look up per-opcode latencies from the architecture's hardware latency tables; see Latency Model.

Yield Flag

The yield flag is computed by sub_A333A0 alongside the stall count. The decision:

function ComputeYield(ctx, instr, stall):
    if stall >= yield_threshold:     // arch-dependent, typically 4
        return 1                      // long stall: yield to another warp
    if instr is branch/exit:
        return 1                      // control flow: always yield
    if instr.is_last_in_bb and next_bb.has_barrier:
        return 1                      // barrier boundary: yield
    return 0

The yield threshold is read from the SM backend's latency table and varies by architecture. On sm_80+ it is typically 4 cycles.

Mercury Opex Path (Phase 120)

In addition to the Ori IR scoreboard passes (phases 114--116), the Mercury backend has its own scoreboard generation pass: MercGenerateOpex (phase 120, sub_703480 / sub_6FFDC0).

This pass runs after Mercury encode/decode (phase 117) and instruction expansion (phase 118), operating on the Mercury intermediate representation rather than Ori IR. It generates:

  • DEPBAR instructions for explicit barrier management
  • Scoreboard wait annotations
  • Stall count adjustments for expanded instruction sequences
  • Synchronization barriers for cross-warp dependencies

The Mercury opex pass and the Ori scoreboard passes serve different purposes:

  • Phases 114--116 generate scoreboard metadata at the Ori IR level, before Mercury encoding
  • Phase 120 generates additional scoreboard metadata for instructions introduced during Mercury expansion (pseudo-instruction expansion, WAR hazard insertion)
  • The WAR pass must run twice (phases 119 and 121) because opex introduces new instructions that create additional write-after-read hazards

Scheduling Output Encoding

After the control word fields are computed, the scheduling output pipeline (0x8F1EB0--0x8FDD60, ~57 KB) encodes them into the final SASS binary format.

Encoding Pipeline

AddressSizeFunctionPurpose
sub_8F1EB015 KBEncodeScheduleWordsMain scheduling output encoder -- iterates all instructions and produces control words
sub_8F31301.0 KBEncodeStallFieldPacks 4-bit stall count into control word
sub_8F31F06.1 KBEncodeBarrierFieldPacks barrier set/wait fields with architecture-specific layout
sub_8F36502.7 KBEncodeYieldFieldPacks yield flag
sub_8F38603.0 KBEncodeScoreboardFieldPacks scoreboard dependencies
sub_8F3AB05.0 KBEncodeDependencyFieldPacks inter-instruction dependency metadata
sub_8F3DE01.3 KBEncodeControlFieldPacks control flags
sub_8F3EA02.1 KBValidateEncodingChecks encoded control word for consistency
sub_8F3FE01.7 KBEncodeWaitFieldPacks wait mask
sub_8F41405.6 KBEncodeFullControlWordCombines all fields into the final 23-bit encoding

Emission

sub_8F4510 (EmitControlWordForInstr) writes the packed control word into the output buffer. sub_8F4820 (EmitControlBlock) constructs the complete 128-bit scheduling control instruction from three consecutive instruction slots.

Scoreboard Entry Construction

sub_8E4920 (6.9 KB) constructs a balanced BST (red-black tree) of scoreboard entries during schedule output. Each entry contains:

  • Instruction pointer
  • 16-bit scoreboard register ID
  • 16-bit dependency type

The tree is used by the verification pass to check that barrier assignments are consistent across the instruction stream.

Verification

Seven verification functions (0x8F7610--0x8F8CB0) validate the generated schedule:

  1. Stall count bounds (0--15)
  2. Barrier index validity (0--5 or 7=none)
  3. Wait mask consistency (only wait on barriers that have been set)
  4. Scoreboard dependency completeness (every long-latency producer has a barrier)
  5. Control word format correctness
  6. Yield hint plausibility
  7. Overall schedule integrity (no live-range violations)

Alternative Control Word Path (sub_8D7760)

sub_8D7760 (41 KB) is the stall/barrier insertion function used by the pre-scheduling passes. Unlike sub_A36360 which generates control words for the final Ori IR, sub_8D7760 operates during the scheduling algorithm itself, computing stall and barrier assignments as instructions are placed.

This function:

  • Manages a 32-entry barrier tracking table
  • Contains architecture-variant switches for different barrier models (sm_70 vs sm_80 vs sm_90+)
  • Computes stall cycles as the distance in cycles to the nearest dependent consumer
  • Assigns barriers from the pool of 6 using an oldest-first eviction policy
  • Handles architecture-specific barrier count variations

The two control word generators (sub_A36360 for final emission, sub_8D7760 for scheduling) share the same barrier allocation algorithm but operate at different pipeline stages. sub_8D7760 produces preliminary assignments that sub_A36360 may refine during the final scoreboard pass.

Architecture-Specific Control Word Configuration

sub_A2BD90 (23 KB) configures architecture-dependent scheduling parameters by querying feature flags through the architecture vtable at *(ctx+72). Configuration includes:

  • Stall count thresholds and caps
  • Barrier count (6 for all current architectures, but the infrastructure supports variation)
  • Reuse buffer policy
  • Yield threshold
  • Texture scoreboard behavior
  • Extended barrier modes for sm_100+

The function queries multiple feature IDs through vtable dispatch, building an architecture profile that sub_A36360 and its encoder chain use for all per-instruction decisions.

Per-Instruction Control Word (Internal Structure)

Within the scheduling context, the control word is maintained at instruction offsets +196 through +200:

FieldOffsetBitsDescription
Stall count*(instr+200) bits [0:4]5Internal stall count (wider than SASS encoding to allow values up to 31 during optimization)
Extended stall*(instr+200) bits [5:9]5Second stall field for dual-issue scheduling
Barrier flags*(instr+200) bits [10:14]5Barrier control flags
Control bits*(instr+48) bits [17:13]5Barrier format in Mercury encoding
Scoreboard flag*(instr+32) byte 13 bit 21Instruction has scoreboard information
Encoding format*(instr+56)DWORD4 = barrier format in Mercury
Stall bits*(instr+168)BYTEFinal stall value for encoding

The sub_A2D340 (32 KB) function writes these fields through a large opcode switch, handling opcodes 50 (atomics), 73 (BAR), 74 (ST), 77 (LDS/STS), 78 (HMMA), and others with instruction-specific field layouts.

Function Map

AddressSizeIdentity
sub_85C8901.5 KBScoreboardDispatcher -- opcode-based fast/slow path router
sub_A220A09 KBInstructionPropertyQuery -- scheduling descriptor filler
sub_A22B40--WriteBarrierAssign -- barrier index assignment for operand
sub_A22BC0--ReadBarrierAssign -- wait mask assignment for operand
sub_A22CE0--InstructionClassify -- scoreboard processing classification
sub_A22D00--DependencyDistance -- compute instruction distance
sub_A227F0--FlaggedInstrPreprocess -- bit-12-set instruction handling
sub_A231E0--SchedulingScore -- full-DAG-analysis necessity check
sub_A23CF054 KBDAGListScheduler -- full dependency-driven scoreboard
sub_A265B010 KBBarrierDependencyTracker -- barrier assignment tracking
sub_A2922012 KBInstructionEmissionFilter -- instruction emission gating
sub_A2BD9023 KBArchControlWordConfig -- architecture-specific parameter loader
sub_A2D34032 KBInstructionControlWordEncoder -- per-opcode field writer
sub_A31040--MasterFieldEncoder -- coordinates per-field encoders
sub_A313904 KBBarrierConflictDetect -- barrier set intersection check
sub_A318F04 KBBarrierDistanceCompute -- dependency distance to barrier
sub_A31F807 KBComputeReuseFlags -- operand reuse buffer hints
sub_A32C70--OperandDependencyAnalysis -- source/dest dep extraction
sub_A333A03 KBEncodeStallAndYield -- 4-bit stall + 1-bit yield
sub_A336607 KBEncodeReadBarrierMask -- 6-bit read barrier mask
sub_A342E09 KBEncodeWriteBarrierIndex -- 3-bit write barrier index
sub_A34B7010 KBEncodeWaitBarrierMask -- 6-bit wait barrier mask
sub_A356A012 KBEncodeScoreboardFields -- combined scoreboard encoder
sub_A3636052 KBGenerateControlWord -- master control word generator
sub_8D776041 KBStallAndBarrierInsertion -- pre-scheduling control words
sub_8E49206.9 KBBuildScoreboardEntries -- scoreboard BST construction
sub_8E5CA020 KBEmitScheduleOutput -- control word output encoder
sub_8F1EB015 KBEncodeScheduleWords -- SASS control word output
sub_8F41405.6 KBEncodeFullControlWord -- 23-bit packing
sub_A95DC036 KBSASSControlWordEncoder -- architecture-dispatched encoder
sub_6FFDC066 KBMercOpexBody -- Mercury opex scoreboard generation
sub_7034801.4 KBRunOpexPass -- MercGenerateOpex entry

Cross-References

  • Scheduler Overview -- 3-phase scheduler architecture, scheduling output pipeline
  • Scheduling Algorithm -- priority list scheduling, dependency DAG construction
  • Latency Model -- per-opcode latency tables used by stall count computation
  • Mercury Encoder -- Mercury pipeline including MercGenerateOpex (phase 120)
  • SASS Encoding -- instruction encoding format including control word bit layout
  • Phase Manager -- how phases 114--116 fit in the 159-phase pipeline
  • Sync & Barriers -- software synchronization barriers (distinct from dependency barriers)
  • Knobs -- scheduling knobs 741 (stall threshold), 805/806 (stall caps)

Code Generation Overview

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The SASS code generation subsystem converts optimized Ori IR into executable GPU machine code. It is the largest subsystem in ptxas by every metric: approximately 12,000 functions, 9 MB of binary code, and nine functions so large that Hex-Rays cannot decompile them. The pipeline spans phases 112--158 of the 159-phase PhaseManager and comprises seven interlinked subsystems -- instruction selection, SASS binary encoding, peephole optimization, the Mercury encoding pipeline, Newton-Raphson math templates, SASS text generation, and ELF output packaging. Every subsystem dispatches through per-SM-family tables, so the same high-level flow produces correct output for targets from Kepler (sm_30) through Blackwell Ultra (sm_121).

Pipeline phases112--158 (code generation spans the final third of the pipeline)
Total functions~12,000 (ISel, encoding, peephole, Mercury, formatters, ELF)
Total binary size~9 MB of machine code
Non-decompilable functions9 (3 peephole + 6 encoding megadispatchers)
Core primitivesub_7B9B80 -- bitfield insert (216 bytes, 18,347 callers)
Architecture selector*(int*)(config+372) >> 12 -- SM generation ID
Largest functionsub_169B190 -- generic peephole dispatcher (280 KB)
Output modesmercury (SM 75--99), capmerc (SM 100+), sass (explicit)
CLI option--binary-kind mercury,capmerc,sass

Pipeline

 Optimized Ori IR (register-allocated, scheduled)
      |
      v
 ┌─────────────────────────────────────────────────────────────┐
 │ SASS CODE GENERATION                                        │
 │                                                             │
 │  1. Instruction Selection (ISel) ───────> [isel.md]         │
 │     │  DAG pattern matching: ~750 matchers                  │
 │     │  Mega-selector: sub_C0EB10 (185 KB)                   │
 │     │  4 arch-variant dispatch tables                       │
 │     v                                                       │
 │  2. SASS Binary Encoding ───────────────> [encoding.md]     │
 │     │  ~4,000 template-generated handlers                   │
 │     │  6 megadispatchers (750 KB total)                     │
 │     │  sub_7B9B80 bitfield packer (18,347 callers)          │
 │     v                                                       │
 │  3. Peephole Optimization ──────────────> [peephole.md]     │
 │     │  3 mega-dispatchers: 280+233+233 KB = 746 KB          │
 │     │  ~3,185 pattern matchers                              │
 │     v                                                       │
 │  4. Mercury Pipeline (phases 117-122) ──> [mercury.md]      │
 │     │  Encode/Decode → Expand → WAR → Opex → WAR → SASS    │
 │     │  sub_6D9690 master encoder (94 KB)                    │
 │     v                                                       │
 │  5. Newton-Raphson Templates ───────────> [templates.md]    │
 │     │  DDIV/DRCP/DSQRT/DRSQRT software sequences           │
 │     │  36 functions, up to 298 virtual registers each       │
 │     v                                                       │
 │  6. SASS Text Generation (phase 129) ──> [sass-printing.md] │
 │     │  580 formatter functions + 12.9 KB dispatcher         │
 │     v                                                       │
 │  7. ELF/Cubin Output ──────────────────> [../output/…]      │
 │        sub_612DE0 finalizer → sub_1C9F280 ELF emitter       │
 └─────────────────────────────────────────────────────────────┘
      |
      v
 .cubin / .o (NVIDIA custom ELF)

Phase-to-Subsystem Map

The code generation pipeline occupies phases 112--158. This table maps each phase to its subsystem and documents the six-stage Mercury core that is the dominant path for SM 75+ targets.

PhaseNameSubsystemDetail page
112PlaceBlocksInSourceOrderBlock layoutcfg.md
113PostFixForMercTargetsMercury pre-fixupmercury.md
114FixUpTexDepBarAndSyncScoreboard / syncscoreboards.md
115AdvancedScoreboardsAndOpexesScoreboard hookscoreboards.md
116ProcessO0WaitsAndSBsScoreboard (-O0)scoreboards.md
117MercEncodeAndDecodeMercury coremercury.md
118MercExpandInstructionsMercury coremercury.md
119MercGenerateWARs1Mercury coremercury.md
120MercGenerateOpexMercury coremercury.md
121MercGenerateWARs2Mercury coremercury.md
122MercGenerateSassUCodeMercury coremercury.md
123ComputeVCallRegUsePost-Mercury bookkeeping--
124CalcRegisterMapPost-Mercury bookkeeping--
125UpdateAfterPostRegAllocPost-Mercury bookkeeping--
126ReportFinalMemoryUsageReportingdumpir.md
127AdvancedPhaseOriPhaseEncodingEncoding hook--
128UpdateAfterFormatCodeListPost-Mercury bookkeeping--
129DumpNVuCodeTextSASS text outputsass-printing.md
130DumpNVuCodeHexSASS hex outputdumpir.md
131DebuggerBreakDebug--
132UpdateAfterConvertUnsupportedOpsLate cleanup--
133MergeEquivalentConditionalFlowLate cleanup--
134AdvancedPhaseAfterMidExpansionLate cleanup hook--
135AdvancedPhaseLateExpandSyncInstructionsLate cleanup hook--
136LateMergeEquivalentConditionalFlowLate cleanup--
137LateExpansionUnsupportedOpsMidLate lowering--
138OriSplitHighPressureLiveRangesLate regalloc fixup--
139--158(architecture-specific)Arch backendsphase-manager.md

Subsystem grouping summary:

SubsystemPhasesKey property
Block layout112Restores source-order block placement
Scoreboard / sync113--116Pre-Mercury texture and dependency bar fixups
Mercury core117--122Six-stage encode-expand-WAR-opex-WAR-emit pipeline
Post-Mercury bookkeeping123--128Register maps, data structure refresh
SASS output + debug129--131Text/hex dumps and debugger hook
Late cleanup132--138Conditional merging, late lowering, live-range splits
Arch-specific139--15820 backend-overridable phases (no-op by default)

Scale

SubsystemFunctionsBinary sizeKey entry point
ISel pattern matchers~750~1.3 MBsub_B285D0 (ISel driver, 9 KB)
ISel mega-selector1185 KBsub_C0EB10
SASS encoding handlers~4,000~2.5 MBsub_7B9B80 (bitfield packer)
Encoding megadispatchers6~750 KBsub_10C0B20 (setField, 180 KB)
Peephole mega-dispatchers3~746 KBsub_169B190 (generic, 280 KB)
Peephole pattern matchers~3,185~1.5 MB(individual matchers)
Mercury pipeline~50~400 KBsub_6F52F0 (orchestrator, 23 KB)
Mercury encode tables530~500 KBformat initializers at 0xC66000
Encoding vtable methods~2,735~450 KBtiny dispatchers at 0xAF0000
Newton-Raphson templates36~180 KBsub_170E260 (DDIV coordinator)
SASS text formatters580~850 KBsub_5D4190 (dispatcher, 12.9 KB)
ELF emitter~60~300 KBsub_1C9F280 (master, 97 KB)
Total~12,000~9 MB

Nine functions exceed the decompilation threshold: the three peephole mega-dispatchers (280 + 233 + 233 KB) and the six encoding megadispatchers (180 + 197 + 187 + 142 + 68 + 65 KB). All analysis of these functions derives from disassembly, call graphs, and the smaller functions they invoke.

Instruction Selection

ISel converts abstract Ori IR operations into concrete SASS instruction forms using SelectionDAG-style pattern matching. Unlike upstream LLVM's TableGen-driven ISel, ptxas uses handwritten C++ matchers compiled into ~750 functions invoked from the ISel driver via per-opcode dispatch tables. The ISel driver (sub_B285D0, 9 KB, 66 callees) selects architecture-variant builders based on the SM version. The mega-selector (sub_C0EB10, 185 KB) handles the full IR-to-SASS mapping through a giant switch over instruction opcodes. Four nearly identical dispatch functions (15,049 bytes each) at sub_B128E0--sub_B12920 provide architecture-variant opcode routing, all jumping to shared handler code at 0x1C39xxx.

See Instruction Selection for the full DAG matcher protocol, helper function table, architecture dispatch tables, and operand variant selectors.

SASS Binary Encoding

The encoding subsystem translates ISel output into packed binary SASS machine code. Each instruction is encoded into a 1280-bit (160-byte, 20-QWORD) buffer via the universal bitfield packer sub_7B9B80. The full architecture is documented in SASS Instruction Encoding; the key facts for the overview:

  • ~4,000 encoding handler functions -- each follows an identical 10-phase template, differing only in constants and modifier helpers
  • 6 megadispatchers (750 KB total) route field-level queries by instruction category: setField (180 KB), getFieldOffset (197 KB), hasField (187 KB), setFieldDefault (142 KB), getOperandFieldOffset (68 KB), setOperandField (65 KB)
  • 2,095 bitfield accessor functions at 0x10B0000--0x10BF2C0 (1,661 under 200 bytes)
  • 530 encoding table initializers at 0xC66000--0xD27000, each populating one instruction format row
  • 3-level opcode hierarchy: major (9 bits), minor (8 bits), sub-opcode (7 bits)
  • Instruction widths: 64-bit (format code 1), 128-bit (format code 2), 256-bit (format code 8)

Peephole Optimization

Three monolithic dispatch functions implement brute-force pattern-match-and-rewrite. The full architecture is documented in Peephole Optimization. The key positioning facts:

DispatcherSizeMatchersEntry trampolineRuns when
sub_169B190280 KB762sub_B12930Pre-scheduling (all SM)
sub_143C440233 KB1,087sub_B12940Pre-scheduling (SM 120 only)
sub_198BCD0233 KB1,336sub_B12960Post-scheduling (all SM)

All three use identical architecture: a 373-case primary switch on the 16-bit opcode at instruction+0x0C, per-case pattern matcher invocations with priority tracking, and a secondary switch for rewrite actions. The SM 120 dispatcher (sub_143C440) is architecture-gated and runs only when compiling for consumer RTX 50-series or enterprise Pro GPUs.

Mercury Pipeline

Mercury is NVIDIA's intermediate encoding layer between the optimizer's Ori IR and native SASS machine code. It occupies phases 113--122 and forms a six-stage sub-pipeline. Three output modes are controlled by --binary-kind: mercury (SM 75--99), capmerc (SM 100+, with embedded PTX source and relocation metadata), and sass (explicit direct SASS output). The master encoder sub_6D9690 (94 KB) is the largest backend function, with the orchestrator sub_6F52F0 (23 KB, 18 parameters) driving the full stage sequence.

See Mercury Encoder Pipeline for the six-stage architecture, key function table, and output mode details. See Capsule Mercury & Finalization for the SM 100+ variant.

Newton-Raphson Templates

Double-precision operations lacking dedicated hardware (DDIV, DRCP, DSQRT, DRSQRT) are lowered into multi-instruction SASS sequences implementing Newton-Raphson iterative refinement. The template system at 0x1700000--0x1722D60 comprises 36 functions organized in a two-level hierarchy: a top-level handler per operation delegates to a coordinator that allocates up to 298 virtual registers and chains 5--7 sub-expander functions. The register-count dispatcher sub_1704070 selects between full inline, partial inline, and template-based expansion paths based on register file pressure (thresholds: 20,479 / 16,383).

See Newton-Raphson Templates for the complete template hierarchy, register-count dispatch logic, and sub-expander details.

SASS Text Generation

Phase 129 (DumpNVuCodeText) converts the internal instruction stream into human-readable SASS assembly text for --verbose output and --out-sass dumps. The dispatcher sub_5D4190 (12.9 KB) routes 81 named opcodes via direct string comparison and 473 via hash-based switch to 580 template-generated formatter functions at 0x4DA340--0x5A8E40 (~850 KB). All formatters use a monolithic 1.8 MB format string table -- an unusual design that trades memory for formatting speed.

See SASS Text Generation for the full formatter architecture and opcode routing details.

ELF/Cubin Output

The final stage packages the encoded SASS binary into NVIDIA's custom ELF format (.cubin/.o). The kernel finalizer sub_612DE0 (47 KB) feeds the master ELF emitter sub_1C9F280 (97 KB), which delegates to symbol table emission (sub_713710), relocation generation (sub_7163C0), string table construction (sub_7122C0), and section layout finalization (sub_716DC0).

See ELF/Cubin Output for section catalog, relocation format, and EIATTR attribute encoding.

Intrinsic Lowering

The OCG (On-Chip Global) intrinsic system at 0x6C0000--0x6D0000 handles PTX builtin operations for SM 100+ targets. The master intrinsic table at sub_6C9EB0 (13 KB) initializes a 10,664-byte dispatch table with prefix "__nv_ptx_builtin_ocg_", covering operations from basic add/load/store through SM 100 tensor core (tcgen05) and bulk async copy:

HandlerSizeOperations
sub_6C0D9019 KBAtomic reduce (atom.add/min/max/cas -- 54 validation strings)
sub_6C347020 KBcp.async.bulk (bulk async copy)
sub_6C1CF016 KBmbarrier (arrive, wait, test, counted variants)
sub_6C4DA015 KBLoad/store with scope, memory order, domain validation
sub_6D435030 KBMMA intrinsics (HMMA, IMMA, DMMA variants)
sub_6D7AF019 KBTCGen05 MMA (SM 100, 5th generation tensor core)

Intrinsic parameter validators at sub_6BDB60--sub_6BF910 enforce type, sub-operation, and memory domain constraints. NVIDIA consistently misspells "intrinsic" as "instrinsic" in all validation error strings.

Post-Scheduling Statistics

Eight SM-variant statistics printers at sub_ABBA50--sub_ABEB50 (7,603 bytes each, spaced 0x700 apart) generate "# [...] " comments with comprehensive post-codegen metrics: instruction counts, register usage, spill/refill bytes, estimated latency and occupancy, per-functional-unit instruction estimates, MMA counts, and throughput figures. The per-unit instruction counter sub_ABF590 (17 KB) uses SSE2 operations for batch updates.

Operand Legalization

Post-register-allocation operand legalization rewrites instructions that cannot be directly encoded in SASS:

AddressSizePurpose
sub_AB3C3032 KBPost-RA instruction legalization (opcodes 288, 167, 185, 241, 299, 300, 317)
sub_AB2D5018 KBPer-class operand legalization (opcode 307 = ternary/FMA-like)
sub_ACF4D014 KBConstraint solver -- splits instructions when direct encoding fails
sub_AB894019 KBRegister move coalescing / copy elimination
sub_AC275036 KBOperand-to-encoding converter (36-byte operand records)

When legalization requires instruction splitting, sub_ACF4D0 creates new instructions via sub_934630 (instruction constructor). The constraint solver tries alternative encodings before resorting to splits.

WGMMA Pipeline (SM 90+)

The WGMMA (Warp Group Matrix Multiply-Accumulate) pipeline optimizer at 0xACE000--0xAE6000 manages asynchronous tensor core execution for Hopper and later. It automatically inserts warpgroup.arrive and warpgroup.wait fences to ensure correct register handoff. The warning emitter (sub_ACE480) issues "Potential Performance Loss" advisories (codes 7509--7511) when pipelining fails due to extern calls, insufficient registers, or ill-formed pipeline stages.

See WGMMA Pipeline Optimizer for the full call tree, register pressure estimator, and serialization warning details.

Per-SM Architecture Dispatch

Every code generation subsystem dispatches through architecture-specific tables. The SM generation is determined by *(int*)(config+372) >> 12:

config+372 >> 12GenerationSM versions
3Keplersm_30--sm_37
5Maxwellsm_50--sm_53
6Pascalsm_60--sm_62
7Volta / Turingsm_70--sm_75
8Amperesm_80--sm_89
9Hoppersm_90--sm_90a
10+Blackwellsm_100--sm_121

Architecture-specific dispatch points across the codegen pipeline:

SubsystemDispatch mechanismEvidence
ISel4 arch-variant dispatch tables at sub_B128E0--sub_B12910All JUMPOUT to shared code at 0x1C39xxx
Encodingvtable at *(context+416) with ~200 virtual methodsPer-opcode encoding, latency, hazard rules
Peephole3 mega-dispatchers with per-SM case logicSM 120 dispatcher (sub_143C440) is arch-gated
Mercurysub_6E8EB0 sets arch-specific flags in opcode descriptor tableSM 80: bits 1, 8; SM 84: bits 16, 64
Statistics8 SM-variant printer clones at sub_ABBA50--sub_ABEB507,603 bytes each, 0x700 spacing
NR templatesRegister-count-based dispatch at sub_1704070Thresholds: 20479 / 16383

Function Map (Top 10)

AddressSizeIdentity
sub_169B190280 KBGeneric peephole dispatcher (all SM, 762 matchers)
sub_10D5E60197 KBEncoding getFieldOffset megadispatcher (961 callers)
sub_10E32E0187 KBEncoding hasField megadispatcher (72 callers)
sub_C0EB10185 KBMain instruction selector (500+ locals, giant switch)
sub_10C0B20180 KBEncoding setField megadispatcher (3,109 callers)
sub_10CCD80142 KBEncoding setFieldDefault megadispatcher (4 callers)
sub_1C9F28097 KBMaster ELF emitter
sub_6D969094 KBMercury master encoder (instruction type switch)
sub_6FFDC066 KBMercury opex body (scoreboard generation)
sub_6E8EB064 KBBasicBlock::Initialize (encoder state, opcode descriptors)

See function-map.md for the complete table (~30 entries with all codegen functions).

Cross-References

Instruction Selection

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

Instruction selection in ptxas is a two-phase process that converts PTX virtual ISA operations into concrete SASS machine opcodes. Unlike LLVM, which uses a single SelectionDAG or GlobalISel framework, ptxas distributes instruction selection across two distinct pipeline stages separated by the entire optimization pipeline: Phase 1 converts PTX opcodes to Ori IR opcodes during initial lowering (phase 5, ConvertUnsupportedOps), and Phase 2 converts Ori IR to final SASS binary forms during code generation (phases 112--122, ISel driver + Mercury encoder). The two phases serve fundamentally different purposes: Phase 1 legalizes the IR so the optimizer can reason about it, while Phase 2 selects the optimal machine encoding for the target architecture after register allocation and scheduling are complete.

Phase 1 locationPhase 5: ConvertUnsupportedOps (PTX opcode to Ori opcode)
Phase 2 locationPhases 112+: ISel driver + Mercury encoder (Ori to SASS binary)
MercConverter dispatchsub_9ED2D0 (25 KB, master switch on *(instr+72) & 0xCF mask)
ISel driversub_B285D0 (9 KB, 66 callees, vtable entry)
ISel mega-selectorsub_C0EB10 (185 KB, 500+ locals, giant switch)
DAG pattern matchers~801 functions at 0xB28F60--0xB7D000 (~1.3 MB)
Arch dispatch tables4 copies at sub_B128E0--sub_B12920 (15,049 bytes each)
Mercury master encodersub_6D9690 (94 KB, instruction type switch)
MercExpandsub_C3CC60 (26 KB, pseudo-instruction expansion)
SM120 pattern coordinatorsub_13AF3D0 (137 KB, 130-case switch, opcodes 2--352)
Opcode variant selectorssub_B0BE00 (19 KB, class 194), sub_B0AA70 (5 KB, class 306)

Architecture

PTX source text
     |
     v
[Bison parser]  sub_4CE6B0 (48KB)
     |  Reduction actions build raw Ori nodes with PTX-derived opcodes
     v
+------------------------------------------------------------------+
| RAW ORI IR (PTX opcodes: add.f32, ld.global, mad.lo.s32, ...)    |
+------------------------------------------------------------------+
     |
     |  PHASE 1: PTX-to-Ori Opcode Legalization (phase 5)
     |
     |  sub_9F3340 (orchestrator, 7KB)
     |    -> sub_9F1A90 (MercConverter main, 35KB)
     |         -> sub_9ED2D0 (opcode dispatch, 25KB)
     |              Switch on (*(instr+72)) with BYTE1 & 0xCF mask
     |              ~120 case values -> ~60 handler functions
     |              + vtable dispatch for architecture-extensible ops
     |         -> sub_934630 (instruction creation, called N times)
     |    -> sub_9EF5E0 (post-conversion lowering, 27KB)
     |
     v
+------------------------------------------------------------------+
| OPTIMIZER-READY ORI IR (SASS opcodes: FADD, IMAD, LDG, STG, ...).|
| Every instruction has a valid SASS opcode for the target SM.      |
+------------------------------------------------------------------+
     |
     |  [Phases 14-111: Full optimization pipeline]
     |  Register allocation, scheduling, peephole, etc.
     |
     v
+------------------------------------------------------------------+
| OPTIMIZED ORI IR (register-allocated, scheduled)                  |
+------------------------------------------------------------------+
     |
     |  PHASE 2: Ori-to-SASS Selection & Encoding (phases 112+)
     |
     |  sub_B285D0 (ISel driver, 9KB)
     |    -> sub_C0EB10 (mega-selector, 185KB, default backend)
     |    -> sub_13AF3D0 (pattern coordinator, 137KB, SM120 backend)
     |    -> sub_B1FA20 / sub_B20E00 (builder variants)
     |    -> sub_B28F60..sub_B74C60 (~801 DAG pattern matchers)
     |    -> sub_B128E0..sub_B12920 (4 arch dispatch tables)
     |
     |  sub_6D9690 (Mercury master encoder, 94KB)
     |    -> Switch on instruction type (*(instr+8))
     |    -> sub_C00BF0 (opcode lookup)
     |    -> sub_91D160 (register encoding)
     |    -> sub_7B9B80 (bitfield insert, 18,347 callers)
     |
     |  sub_C3CC60 (MercExpand, 26KB)
     |    -> sub_C37A10 (expand instruction, 16KB)
     |    -> sub_C39B40 (expand memory, 10KB)
     |    -> sub_C3BCD0 (expand control flow, 19KB)
     |
     v
+------------------------------------------------------------------+
| SASS binary (packed machine code in 64/128/256-bit words)         |
+------------------------------------------------------------------+

Phase 1: PTX-to-Ori Opcode Conversion

Phase 1 runs as ConvertUnsupportedOps (pipeline phase 5), the most substantial bridge phase. Its job is to replace every PTX-derived opcode in the raw Ori IR with a valid SASS-level opcode for the target SM. After this phase completes, the optimizer sees only SASS-level instruction semantics.

The conversion is not a simple table lookup. Many PTX operations have no 1:1 SASS equivalent and must be expanded into multi-instruction sequences. The expansion depends on the target architecture, the operand types, and the available hardware functional units.

MercConverter Dispatch -- sub_9ED2D0 (25 KB)

The central dispatch function of Phase 1. Despite the sweep's initial identification as PhaseRunner::executePhaseSequence, the decompiled code reveals a classic opcode switch: it reads *(instr+72), masks byte 1 with 0xCF (stripping modifier bits 4--5), and dispatches to per-category handler functions. The switch covers approximately 120 distinct case values (opcode indices 1--352) routing to roughly 60 handler functions plus vtable-dispatched methods for architecture-extensible operations.

// sub_9ED2D0 -- simplified dispatch logic
void MercConverter_Dispatch(context, instruction) {
    // Pre-dispatch: check predication eligibility
    bool can_predicate = sub_7E18A0(instruction, *(context+8));
    if (can_predicate)
        can_predicate = vtable[205](*(*(context+8)+1584), instruction);
    *(context+40) = can_predicate;

    // Read opcode, mask out modifier bits
    int opcode = *(DWORD*)(instruction + 72);
    BYTE1(opcode) &= 0xCF;

    // Special case: opcode 130 (HSET2 in ROT13; internal marker) with GPR operand -> clear predication
    if (opcode == 130) {
        int operand = *(DWORD*)(instruction + 84);
        if (((operand >> 28) & 7) == 1 && reg_type(operand) == 6)
            *(context+40) = 0;
    }

    // Main dispatch
    switch (opcode) {
    case 1:   sub_9DA5C0(context, instruction);  break;  // opcode class 1
    case 6:   sub_9DA100(context, instruction);  break;  // arithmetic
    case 8:   sub_9D2440(context, instruction);  break;  // specific class
    case 10: case 11: case 149: case 151: case 152: case 290: case 291:
              sub_9D80E0(context, instruction);  break;  // memory load/store
    case 16:  sub_9E8B20(context, instruction);  break;  // texture/surface
    case 61: case 63: case 80:
              sub_9E6600(context, instruction);  break;  // instruction expansion
    case 108: sub_9D76D0(context, instruction);  break;  // memory legalization
    // ... ~100 more cases ...
    default:  emit_noop(context, 0xFFFF);        break;  // unknown -> passthrough
    }

    // Post-dispatch: apply predication and operand adjustments
    vtable[107](context, instruction);
}

MercConverter Opcode Dispatch Table

The complete switch covers opcodes 1--352. Cases route to three dispatch mechanisms: direct function calls (for common PTX categories), vtable-indirect calls (for architecture-extensible operations), and the emit_noop fallback for unrecognized opcodes. Below is the reconstructed routing table from the decompiled sub_9ED2D0.

Direct handler dispatch (35 handlers):

Opcode(s)HandlerSizeCategory
1sub_9DA5C02 KBOpcode class 1 (basic ALU)
6sub_9DA1009 KBArithmetic operations
8sub_9D2440--Specific class
10, 11, 149, 151, 152, 290, 291sub_9D80E017 KBMemory load/store
15, 85sub_9EC34023 KBMulti-operand legalization
16sub_9E8B2017 KBTexture/surface lowering
17sub_9E7FB0--Surface operations
22sub_9D6DB0--Specific lowering
23sub_9E58F0--Specific lowering
24sub_9D9F60--Specific lowering
26sub_9E54C0--Specific lowering
27sub_9E4BB0--Specific lowering
28sub_9D9E70--Specific lowering
32, 271sub_9E2440--Bitfield operations
34sub_9E55E0--Specific lowering
38, 59, 106, 180, 182, 192, 194, 215, 221, 242sub_9DA6B0--Generic ALU group
41, 284sub_9D1DA0--Specific lowering
42, 53, 55, 66sub_9D54B0--Grouped operations
47sub_9E74E0--Conditional (arch flag check)
51sub_9E2F60--Specific lowering
52, 54, 72, 97sub_9D09C0--Group with v8=1 (deletion flag)
57, 101sub_9D6170--Paired operations
60, 62, 78, 79sub_9E5EE0--Comparison group
61, 63, 80sub_9E660025 KBInstruction expansion (64-bit split)
67sub_9D9C30--Specific lowering
70sub_9E3490--Specific lowering
75sub_9E0C10--Specific lowering
77sub_9E4DF0--Specific lowering
83sub_9D6AB0--Specific lowering
88, 89sub_9D5990--Paired operations
90sub_9D2820--Specific lowering
91sub_9E7600--Specific lowering
92sub_9E7890--Specific lowering
93, 95sub_9E1D40--Comparison variants
94sub_9E1DF0--Specific lowering
96sub_9D41C0--Specific lowering
98sub_9D3230--Specific lowering
100sub_9D70E0--Specific lowering
102sub_9D9750--Specific lowering
103, 104sub_9E31D0--Paired operations
108sub_9D76D018 KBMemory instruction legalization
124sub_9E18B0--Specific lowering
135sub_9D6560--Specific lowering
139, 140, 141, 143sub_9D4C10--Related operations group
145sub_9D3020--Specific lowering
155, 268sub_9E5260--Paired operations
156sub_9D94B0--Specific lowering
158, 167sub_9E4A00--Paired operations
161sub_9D21D0--Specific lowering
162sub_9D9660--Specific lowering
166sub_9E2100--Specific lowering
170sub_9E2DF0--Specific lowering
173, 267sub_9EB5C0--Paired operations
174sub_9D9300--Specific lowering
184sub_9D2E70--Specific lowering
185sub_9E32F0--Specific lowering
188, 190sub_9E2970--Paired operations
195sub_9D2AB0--Specific lowering
196sub_9D9080--Specific lowering
198sub_9D66F0--Specific lowering
201, 202, 204, 285sub_9EAC30--Async/bulk group
203sub_9D8E90--Specific lowering
205sub_9E1260--Specific lowering
209sub_9E5740--Specific lowering
210, 213, 214sub_9D8B30--Grouped operations
240sub_9D6280--Specific lowering
241sub_9E2CC0--Specific lowering
247sub_9D0F70--Specific lowering
248sub_9D0DF0--Specific lowering
262sub_9E7440--Specific lowering
264sub_9D73F0--Specific lowering
276sub_9D5EC0--Specific lowering
292sub_9D0E90--Specific lowering

Vtable-indirect dispatch (for architecture-extensible operations):

Opcode(s)Vtable offsetCategory (inferred)
2, 3, 4, 5, 7vtable[0] (+0)Generic fallback
14, 39, 40, 105, 125, 299, 300, 321vtable[7] (+56)Group A operations
18vtable[3] (+24)Specific class
31vtable[4] (+32)Specific class
35vtable[6] (+48)Specific class
36vtable[21] (+168)Specific class
43vtable[9] (+72)Specific class
50vtable[12] (+96)Specific class
65vtable[22] (+176)Specific class
73vtable[15] (+120)Specific class
74vtable[16] (+128)Specific class
81vtable[24] (+192)Specific class
110, 111, 112, 114vtable[25] (+200)Warp shuffle group
118vtable[10] (+80)Specific class
119vtable[28] (+224)Specific class
120, 121, 126, 127, 128, 280, 281vtable[27] (+216)Barrier/sync group
122, 123, 310, 311, 312vtable[26] (+208)Related group
130 (HSET2), 169vtable[29] (+232)Move/convert group (130 is MOV-like internally; actual SASS MOV = 19)
157vtable[84] (+672)Specific class
176, 177vtable[34] (+272)Paired operations
183, 288vtable[36] (+288)Paired operations
186vtable[35] (+280)Specific class
211vtable[39] (+312)Specific class
220vtable[40] (+320)Specific class
223, 238vtable[41] (+328)Paired operations
228vtable[42] (+336)Specific class
243vtable[43] (+344)Specific class
245--253, 257vtable[67--77] (+536--+624)SM 100+ operations
265, 266vtable[93] (+744)Paired operations
270vtable[77] (+616)Specific class
277vtable[65] or vtable[11] (+520/+88)Operand-type dependent
279--351various high vtable offsetsSM 100+ / Blackwell operations

The vtable mechanism allows architecture backends to override conversion behavior without modifying the core dispatch. The vtable factory at sub_1CCEEE0 (17 KB, 244 callees) selects which overrides are active based on the SM version.

Per-Category Handlers

The larger handlers implement non-trivial conversion logic:

HandlerSizeCategoryKey behavior
sub_9E660025 KBInstruction expansionSplits 64-bit ops on 32-bit ALU into hi/lo pairs with carry chains. Calls sub_9D4380 (instruction builder) ~10 times per expansion.
sub_9EC34023 KBMulti-operand legalizationOperand type test: (v >> 28) & 7 == 1 means register. Register class query via sub_7BE7B0. Creates new instructions via sub_7DEAD0.
sub_9D76D018 KBMemory legalization (load/store)Register type dispatch: 6=GPR, 7=predicate, 3=address. Uses sub_9D4380 (instruction builder) and sub_9CD420 (predication).
sub_9D80E017 KBMemory legalization (variant)Same opcode set as sub_9D76D0, alternate code path for different operand patterns.
sub_9E8B2017 KBTexture/surface loweringRegister type 6 = GPR. Manipulates bitmask at register descriptor offset +48.
sub_9DA1009 KBArithmetic operationsHandles opcode case 6 -- standard ALU instruction legalization.
sub_9DA6B0--Generic ALU groupCovers 10 opcode values (38, 59, 106, 180, 182, 192, 194, 215, 221, 242).

1:1 vs 1:N Expansion

Most PTX operations map 1:1 to a single SASS opcode. When they do not, the handlers in sub_9E6600 and related functions create multi-instruction sequences:

PTX                                    Ori IR (after Phase 1)
-----------------------------------    -----------------------------------
add.f32  %r1, %r2, %r3          -->   FADD  R1, R2, R3                [1:1]
add.s32  %r4, %r5, %r6          -->   IADD3 R4, R5, R6, RZ           [1:1, operand added]
mul.lo.s64 %rd1, %rd2, %rd3     -->   IMAD.LO  R1, R2, R6, RZ       [1:N split]
                                       IMAD.HI  R0, R2, R6, RZ
                                       IMAD      R0, R3, R6, R0
                                       IMAD      R0, R2, R7, R0
div.f32  %r7, %r8, %r9          -->   MUFU.RCP  R10, R9              [1:N, Newton-Raphson]
                                       FMUL      R7, R8, R10
                                       (+ correction iterations)
bar.sync 0                       -->   BAR                            [1:1]

The expansion creates new instruction nodes via sub_934630 and links them into the doubly-linked instruction list. The original PTX-level instruction is replaced by the expanded sequence.

Type-Dependent Opcode Selection

PTX's explicitly-typed opcodes (where the type is a qualifier like .f32, .s64) map to different SASS mnemonics based on the type:

PTX typeSASS prefixExample PTXExample SASS
.f16 / .f16x2Hadd.f16HADD2
.f32Fadd.f32FADD
.f64Dadd.f64DADD
.s32 / .u32Iadd.s32IADD3
.s64 / .u64I (split)add.s64IADD3 + IADD3.X (carry chain)
.predPsetp.eq.f32FSETP

The type qualifier disappears from the instruction syntax during conversion. It becomes encoded in the SASS mnemonic itself (the F in FADD, the I in IADD3) and in the register class of the operands.

SM-Dependent Legalization

The MercConverter gates operations by SM version through the architecture vtable. An instruction available natively on one SM may require a multi-instruction lowering sequence on another:

  • 64-bit integer arithmetic on SM 50--75 (no native 64-bit ALU): splits into 32-bit hi/lo pairs
  • FP16 operations on pre-SM 53 targets: promoted to FP32 (handled by Phase 2 PromoteFP16)
  • bfe/bfi variants: some bit-field extract/insert modes not supported on all targets
  • Tensor core intrinsics: SM 70 has HMMA v1, SM 75 has HMMA v2, SM 80+ has HMMA v3/DMMA, SM 100 has TCGen05

The architecture vtable factory at sub_1CCEEE0 populates the vtable with SM-specific method overrides. The vtable has approximately 90 method slots (up to offset +720), with the highest-numbered slots (offset 624+) serving SM 100+ Blackwell operations.

Phase 2: Ori-to-SASS Selection & Encoding

Phase 2 runs during code generation (phases 112+) after the optimizer, register allocator, and scheduler have completed. It operates on fully optimized, register-allocated Ori IR and produces final SASS machine code. Phase 2 has three major components: the ISel driver with DAG pattern matching, the Mercury master encoder, and MercExpand pseudo-instruction expansion.

ISel Driver -- sub_B285D0 (9 KB)

The top-level ISel coordinator is a vtable entry point with 66 callees. It selects the appropriate instruction builder variant based on the target architecture:

// Simplified ISel driver
void ISel_LowerInstruction(context, instruction) {
    int sm = *(context + 184);          // SM version
    int opcode = instruction[18] & 0xFFFFCFFF;

    // Select architecture-variant builder
    if (sm == 14)
        Builder_VariantA(context, instruction);    // sub_B1FA20 (13 KB)
    else
        Builder_VariantB(context, instruction);    // sub_B20E00 (11 KB)

    // Apply post-ISel modifiers
    ApplyModifiers(context, instruction);           // sub_B1D670 (13 KB)
    SetProperties(context, instruction);            // sub_B241A0 (7 KB)
}

The two builder variants (sub_B1FA20 and sub_B20E00) are structurally near-identical, with 50 callees each. Both call sub_7E3EF0 (operand index helper) 6 times (3 source + 3 destination operands) and use sub_A3B930 (operand register class resolver). The key difference is the validation function: variant A uses sub_C49440, variant B uses sub_C49400, reflecting different encoding constraints for different SM families.

ISel Mega-Selector -- sub_C0EB10 (185 KB)

The single largest function in the Phase 2 ISel range: 185 KB decompiled, 6,016 lines, 719+ local variables. It performs the final Ori-IR-to-SASS opcode and operand encoding for 169 distinct instruction types (SASS opcode indices 7--221). While the ~801 DAG pattern matchers handle template-based ISel through a priority contest, the mega-selector handles complex instructions that require procedural, multi-step encoding logic -- instructions where the operand marshalling depends on runtime state (calling conventions, symbol resolution, address space aliasing).

Dual-Switch SM-Generation Dispatch

The function contains two copies of the same 169-case switch statement, separated by a vtable-based opcode translation mechanism. This dual-switch structure is the SM-generation dispatch:

// sub_C0EB10 -- simplified dispatch skeleton
void MegaSelector(context *a1, instruction *a2, isel_ctx *a3) {
    int64_t *vtable = *(a3->backend);
    int opcode = *(int *)(a2 + 8);           // SASS opcode type

    // Pre-dispatch: capability check via vtable[12]
    auto cap_check = vtable[12];              // offset +96
    if (cap_check != sub_BFEAA0)              // default stub?
        if (cap_check(a3, a2))
            ctx->flags[256] = 1;              // set encoding flag

    // Read opcode translator from vtable[2]
    auto translator = vtable[2];              // offset +16

    if (translator != sub_BFEBF0) {
        // PATH A: SM-specific translation
        int encoding_index = translator(a3, opcode);
        int isel_opcode = *(ctx + 8);         // post-translation opcode
        switch (isel_opcode) {                // PRIMARY SWITCH (169 cases)
            case 7: case 34: case 35: case 36:
                emit_simple(encoding_index, ...);
                break;
            case 8: case 38: case 46: ...
                /* already encoded */ break;
            // ... 169 cases total ...
            default: goto high_opcode_path;
        }
    } else {
        // PATH B: static table lookup (default backend)
        int encoding_index = 355;             // sentinel for extended opcodes
        if (opcode <= 0xDD)
            encoding_index = word_22B4B60[opcode];
        switch (opcode) {                     // FALLBACK SWITCH (same 169 cases)
            case 7: ...: goto handler_7;      // jumps into Path A handlers
            // ... identical case set ...
            default: return;
        }
    }

high_opcode_path:
    if (opcode > 0x199) return;
    // Try vtable[3] extension dispatch for SM 100+ / Blackwell
    auto extension = vtable[3];               // offset +24
    if (extension != sub_BFEA30)
        extension(a3, a2);                    // arch-extension handler
}

The dual-switch pattern is a code-generation artifact: the compiler emitted two copies because the vtable path and static-table path produce different values for the encoding index but need identical case routing. This doubles the binary size but avoids a conditional merge point at every case entry.

Three Vtable Dispatch Points

Vtable slotOffsetDefault stubPurpose
vtable[2]+16sub_BFEBF0Opcode-to-encoding-index translator. SM-specific override remaps opcodes to different encoding slots. Fallback: word_22B4B60[] static table.
vtable[12]+96sub_BFEAA0Pre-dispatch capability check. Returns boolean that sets ctx[256] encoding flag.
vtable[3]+24sub_BFEA30Extension opcode handler for opcodes outside the 169-case set (barrier/sync 61--63/221, opcodes > 0x199, SM 100+ extensions).

The word_22B4B60 static table is a uint16[] array indexed by SASS opcode (0--0xDD = 221). Each entry is a SASS encoding slot index. Opcodes > 221 receive the sentinel value 355. This provides the default encoding mapping; SM-specific vtable overrides can remap any opcode to a different encoding index, enabling per-architecture instruction variants without modifying the mega-selector logic.

Opcode Case Routing

The 169 distinct opcode cases (338 total case labels across both switches) group into approximately 70 handler blocks. The groupings reveal SASS ISA families:

GroupOpcodesHandler patternInstruction family
No-op passthrough8, 38, 46, 87, 89, 90, 93, 97, 98, 208goto LABEL_33 (already encoded)Pre-encoded by upstream ISel
Simple emission7, 34, 35, 36sub_9314F0(encoding_index, 1 operand)Basic ALU / simple 1-op
Branch/call9, 10, 11, 12, 13, 22sub_926370 / vtable[17] / linked-list walkControl flow, call frames
Memory load/store15, 16, 18, 19, 20, 23, 24, 25, 26, 30sub_C01840 + address helpersLDG, STG, LDS, etc.
Control flow31, 32, 33SSA phi nodes, branch tablesPhi, switch, call return
Generic ALU39, 41, 42, 50, 51, 52, 53sub_9314F0 passthroughStandard arithmetic
Special register43, 44, 45sub_C06E90 symbol lookupSR access, shared memory alias
Constant/predicate47, 54, 55, 56Direct operand copy / sub_BFFD60Constant bank, predicate ops
Address compute57200-line handler, "__nv_reservedSMEM_offset_0_alias"Complex addressing with SMEM
Immediate ops59, 60sub_C05CC0 / sub_C07690Immediate-operand variants
Barrier/sync61, 62, 63, 221Forward to vtable[3] extensionBAR, MEMBAR, SYNC
Conversion/move65Operand loop with per-element sub_9314F0MOV, CVT
Texture/surface67, 68, 69, 70Multi-operand type-qualified encodingTEX, TLD, TXQ
Intrinsics71, 74, 75Loop-based operand emissionHardware intrinsics
Tensor core84, 88, 91, 92Wide-operand encoding (case 92 = 354 lines)HMMA, DMMA, IMMA, TCGen05
Predication ext94, 95Predicate-dependent path selectionExtended predication
Memory extended99--130 (19 opcodes)sub_C0B2C0 or sub_BFFD60 + encoding lookupExtended memory ops
Warp intrinsics131--189 (50+ opcodes)Mixed handlers, vtable[198]+632 dispatchSHFL, VOTE, MATCH, REDUX
Async/bulk192--218 (15 opcodes)sub_C0B2C0 / individual handlersTMA, async copy, bulk ops

The largest case handlers:

  • Cases 141/142: ~503 lines (warp shuffle/vote extended operations)
  • Case 92: ~354 lines (tensor core instructions -- widest operand format)
  • Cases 45, 57, 95: ~200 lines each (shared memory, address compute, predication)

Operand Encoding Protocol

The mega-selector encodes operands into a stack-allocated 256-byte output buffer using a tagged-pointer word format. Each operand occupies 8 bytes (a DWORD pair):

BitsFieldDescription
[31:28] of word 0Type tag0x1=register, 0x4=constant bank, 0x5=immediate, 0x6=control/modifier, 0x9=special register
[23:0] of word 0ValueRegister index, immediate value, or bank offset
word 1FlagsModifier bits, encoding-format flags

The marshalling pipeline for a typical case:

1. sub_C01840(ctx, instr, operand_list, output_buf, max_count, ...)
   -> Iterates source operands, writes tagged words to output_buf
   -> Returns: number of operand words written

2. sub_C01F50(ctx, instr, dest_list, output_buf, max_count, ...)
   -> Same for destination operands

3. Encoding-index lookup:
   if (vtable[2] != default)
     index = vtable[2](ctx, opcode);
   else
     index = word_22B4B60[opcode];

4. sub_9314F0(output, ctx, encoding_index, count, n_words, buf, ...)
   -> Emits the instruction record to the output stream
HelperCallsPurpose
sub_C0184052Marshal source operands into tagged-word buffer
sub_9314F031Emit instruction with encoding index + operand buffer
sub_C00EA08Extract single operand as tagged word
sub_91D1608Encode register index to encoding bits
sub_9346306Build new instruction node in IR (for multi-instruction expansion)
sub_91D1505Decode register index from operand word
sub_9263704Emit simple instruction (branch/jump)
sub_C01F503Marshal destination operands
sub_7D68603Encode data type qualifier (FP32/FP64/INT)
sub_BFEF103Register bank capacity check / grow
sub_92E1B02Emit instruction with constant-bank operand

Cross-Reference: Arch Dispatch Tables

The 4 arch dispatch tables (sub_B128E0--sub_B12920) are not called from the mega-selector. They operate at the Mercury encoder level:

Mega-selector (sub_C0EB10)
  -> Produces (encoding_index, operand_buffer) pairs
  -> Calls sub_9314F0 to package into instruction nodes

Mercury encoder (sub_6D9690)
  -> Reads instruction type field from instruction node
  -> Arch dispatch tables (sub_B128E0 etc.) resolve type to encoding format
  -> Encoder emits binary SASS using format + operand data

The mega-selector and arch dispatch tables thus operate at different abstraction levels: the mega-selector decides what to encode (opcode selection, operand marshalling), while the arch tables decide how to encode it (encoding format, bit layout). The arch tables' per-SM variants handle encoding-level differences (field widths, modifier positions) that are invisible to the mega-selector's opcode-level logic.

Post-ISel Modifiers -- sub_B1D670 (13 KB)

After the main ISel selection, this pass applies architecture-specific instruction modifications:

  • Opcode 13: sets instruction field [79] = 3
  • Opcode 14: sets instruction field [79] = 2
  • Opcode 11: separate modifier path

The function has 51 callees including sub_AAD690 (field accessor, called multiple times), sub_AADF40, and sub_C49400 (encoding validator). It handles encoding mode bits, register class adjustments, and predicate attachment.

Instruction Properties -- sub_B241A0 (7 KB)

Sets scheduling-relevant properties on the selected instruction:

  • inst[74] = 7 -- scheduling class
  • inst[75] = (opcode == 325) -- special flag for specific opcode
  • inst[77] = sub_A3B930(...) -- operand class from register resolver
  • inst[79] -- derived from a2[19], architecture-dependent

Contains a switch on *(context+46) (target architecture selector), confirming per-SM property assignment.

DAG Pattern Matchers -- ~800 Functions at 0xB28F60--0xB7D000

Every pattern matcher follows an identical prototype and a strict check-and-report protocol. These are the ptxas equivalent of LLVM's TableGen-generated ISel patterns, but handwritten in C++. Binary analysis confirms 801 functions with the matching *a4 <= priority-comparison idiom, with the bulk (750+) residing in the 0xB30000--0xB7D000 range and a handful of smaller matchers in the 0xB28F60--0xB30000 preamble zone.

Pattern Matcher Architecture

The pattern matching system implements a priority-based best-match selection protocol. For each instruction being lowered, the ISel infrastructure invokes all applicable matchers (dispatched through vtable function pointers, not direct calls). Each matcher independently tests whether the instruction matches its pattern; if it does, it writes a (template_id, priority) pair to the output parameters. The dispatcher selects the match with the highest priority value.

Function signature (all 801+ matchers):

char __fastcall match(
    int64_t  ctx,           // a1: ISel context (passed through to field reader)
    int64_t  dag_node,      // a2: pointer to the Ori IR instruction node
    int32_t *template_id,   // a3: OUT: encoding template index [1..152]
    int32_t *priority       // a4: IN/OUT: current best priority; written only if better
);

The priority parameter is read-then-conditionally-written: the matcher checks if (*a4 <= threshold) before overwriting. This means the dispatcher initializes *a4 = 0 and calls matchers in sequence; each matcher only upgrades the result if its specificity exceeds the current best. After all matchers complete, *a3 holds the template index of the winning pattern.

Matching pipeline (invariant across all 801 matchers):

 1. OPCODE PROPERTY CHECKS      sub_10AE5C0(ctx, node, field_id)
    Check 1-12 instruction properties against expected values.
    Any mismatch -> return immediately (early exit).

 2. SOURCE OPERAND COUNT         sub_B28F50(node) -> source_count
    Verify the instruction has the expected number of source operands.

 3. SOURCE OPERAND VALIDATION    sub_B28F30(node, i) -> operand_record
    For each source operand:
      a. Type predicate: isImmediate / isGPR / isPredicate / isUniformReg / ...
      b. Register class: class == 1023 (wildcard) OR class == specific_value

 4. RESULT OPERAND COUNT         sub_B28F40(node) -> result_count
    Verify the expected number of result (destination) operands.

 5. RESULT OPERAND VALIDATION    sub_B28F30(node, first_result + j)
    Same type + register-class checks as for source operands.
    First-result index = sub_B28E00(*(node + 92)).

 6. PRIORITY WRITE               if (*a4 <= N) { *a4 = N+1; *a3 = template; }
    Conditional update: only overwrite if this pattern is more specific
    than whatever was already matched.

Match-Score Priority System

The priority values range from 2 (least specific) to 34 (most specific), with the distribution heavily concentrated in the 8--19 range. The priority correlates directly with pattern specificity: matchers with more constraints (more sub_10AE5C0 checks, more operand type checks, tighter register class requirements) assign higher priority values.

Priority rangeCountInterpretation
2--531Fallback / generic patterns (few constraints)
6--10253Common patterns (3--6 constraints)
11--15293Standard patterns (5--8 constraints)
16--20168Specific patterns (6--10 constraints)
21--3456Highly specific patterns (8--12+ constraints)

Template IDs range from 1 to 152. Multiple matchers can target the same template ID at different priority levels, forming a specificity ladder: a generic matcher might match FADD at priority 8 while a specialized matcher matches FADD.FTZ.SAT with specific register classes at priority 17. Both write the same template ID but the specialized matcher wins when its constraints are satisfied.

Dispatcher Mechanism

The matchers are not called directly from a single dispatch function. Instead, they are registered as virtual methods on per-instruction-class descriptor objects. The dispatch chain is:

sub_B285D0 (ISel driver, 9 KB)
  -> opcode switch on (instruction[18] & 0xFFFFCFFF)
     -> selects builder variant (sub_B1FA20 / sub_B20E00 / sub_B1EC10 / ...)
        -> builder invokes vtable method on instruction descriptor
           -> vtable slot contains pointer to one of the 801 pattern matchers
              -> matcher writes (template_id, priority) if pattern matches

The vtable dispatch occurs at various offsets including +2600, +2616, +2656, and +2896 (observed in sub_13AF3D0, the 137 KB ISel pattern coordinator). The matchers have no static callers -- they appear exclusively through indirect function pointer invocation, which is why the sweep reports them as "no callers in function DB."

For a given instruction, the dispatcher may invoke multiple matchers (one per applicable template variant). Each matcher independently checks its constraints and conditionally updates the priority/template pair. After all candidates have been tried, the dispatcher reads the final template_id and uses it to select the SASS encoding template.

DAG Node Property Accessor -- sub_10AE5C0

The field reader is the most-called function in the matcher range (typically 2--12 calls per matcher, so 3,000--8,000 total invocations across all 801 matchers):

// sub_10AE5C0 -- Read instruction property by field_id
int64_t DAGNode_ReadField(int64_t ctx, int64_t node, uint32_t field_id) {
    if (sub_10E32E0(node, field_id))        // field exists in descriptor?
        return sub_10D5E60(node, field_id); // read value from property table
    else
        return 0xFFFFFFFF;                  // sentinel: field not present
}

The field_id values form a large flat namespace (observed range: 5--595). These are not byte offsets into the instruction record; they are logical property identifiers resolved through a descriptor table. The backing store (managed by sub_10E32E0 / sub_10D5E60) implements a sparse property bag that maps field IDs to integer values.

The companion write functions follow the same field-ID namespace:

// sub_10AE590 -- Write single field
void DAGNode_WriteField(int64_t ctx, int64_t node, uint32_t field_id, uint32_t value);

// sub_10AE640 -- Write two fields atomically (multi-field update)
void DAGNode_WriteFields(int64_t ctx, int64_t node, uint32_t f1, uint32_t v1, uint32_t v2);

Inferred semantic groupings for field IDs (from cross-referencing matcher patterns):

Field rangeLikely semantics
5--7Opcode class / major instruction group
88Sub-operation modifier
105Operation variant selector
126Data type qualifier (e.g., field 126 in {547,548})
163Addressing mode / operand encoding class
190--211Encoding format selectors
220Specific encoding property
242Width/size qualifier
294Generic constraint field
327Register format descriptor
345Rounding / saturation mode
348Precision qualifier
355--429Extended instruction properties
397Instruction validity flag (value 2115 appears as a near-universal gate)
480High opcode range (Blackwell/SM 100+ instructions)
595Highest observed field ID

Field 397 with value 2115 appears in the majority of matchers as a mandatory check, suggesting it encodes a "this instruction is encoding-compatible" or "instruction is valid for ISel" flag.

Operand Record Layout

Each operand is a 32-byte record accessed by index via sub_B28F30:

// sub_B28F30 -- Get operand record by index
int64_t GetOperand(int64_t node, int index) {
    return *(int64_t*)(node + 32) + 32LL * index;
}

The 32-byte operand record:

OffsetSizeFieldDescription
+01type_tagOperand kind (see predicate table below)
+44primary_classRegister class ID; 1023 = wildcard (any class)
+141modifier_aWritten by sub_B28F10
+151modifier_bWritten by sub_B28F20
+204secondary_classFallback register class constraint

Source operand count is stored at node + 92 and doubles as the first-result-operand index:

uint32_t source_count = *(uint32_t*)(node + 92);   // sub_B28F50
uint32_t result_count = *(node + 40) + 1 - source_count; // sub_B28F40

Operand Type Predicates

Fifteen predicate functions classify operand type tags. Each is a single comparison returning bool:

AddressNameTestSemantics
sub_B28E20isImmediatetag == 1Constant / immediate literal
sub_B28E10isGPRtag == 2General-purpose register
sub_B28E80isPredicatetag == 3Predicate register
sub_B28E70isType4tag == 4(specific operand class)
sub_B28E60isType5tag == 5(specific operand class)
sub_B28E30isSpecialRegtag == 6Special register
sub_B28ED0isType7tag == 7(specific operand class)
sub_B28EF0isType8tag == 8(specific operand class)
sub_B28E50isType9tag == 9(specific operand class)
sub_B28E40isValidRegtag == 10Generic valid register
sub_B28EE0isType11tag == 11(specific operand class)
sub_B28EA0isType13tag == 13(specific operand class)
sub_B28EB0isType14tag == 14(specific operand class)
sub_B28E90isUniformRegtag == 15Uniform register (SM 75+)
sub_B28EC0isType16tag == 16(specific operand class)

Register class 1023 serves as a wildcard: if (class == 1023 || class == expected). This allows matchers to accept both unconstrained operands and operands already assigned to a specific register file.

Register Class Constraint Protocol

Operand records carry two register class fields: primary_class at offset +4 and secondary_class at offset +20. The matching protocol checks them with a cascading OR:

// Typical register class check (from sub_B33F00, sub_B390A0, etc.)
uint32_t primary   = *(uint32_t*)(operand + 4);
uint32_t secondary = *(uint32_t*)(operand + 20);

if (sub_B28E00(primary) == 1023) {
    // Wildcard -- operand is unconstrained, accept it
} else {
    uint32_t cls = sub_B28E00(secondary);
    if (cls != expected_class) return;  // mismatch
}

sub_B28E00 and sub_B28F00 are identity functions -- the register class is stored as a plain integer, not packed. The two-field scheme allows the matcher to accept an operand where either the allocation constraint (primary) is wildcard or the resolved register file (secondary) matches.

Observed register class values in matchers:

ClassFrequencyLikely meaning
1023ubiquitousWildcard (any register class)
1very common32-bit GPR (R0..R255)
2common64-bit GPR pair
3occasional128-bit GPR quad
4occasionalPredicate or special register file
5rareExtended register class

Representative Matcher Walkthroughs

sub_B30160 -- simple 2-source, 4-result pattern (68 lines, priority 9, template 12):

1. field 480 == 2481                    -> opcode/subclass check
2. source_count == 2                    -> expects 2 source operands
3. operand[0].type == 1 (immediate)     -> first source is a constant
4. operand[1].type == 2 (GPR)           -> second source is a register
5. operand[1].class == 1023 OR sec == 1 -> 32-bit GPR or unconstrained
6. result_count == 4                    -> expects 4 result operands
7. result[0].type == 2 (GPR)            -> first result is GPR
   result[0].class == 1023 OR sec == 1
8. result[1].type == 3 OR 15            -> predicate or uniform register
9. result[2].type == 2 (GPR)            -> third result is GPR
   result[2].class == 1023 OR sec == 1
10. if (*a4 <= 8) -> *a4 = 9, *a3 = 12

sub_B33F00 -- medium 2-source, 5-result pattern (4,166 bytes, priority 21, template 22):

1. field 7 == 21                            -> major opcode class
2. field 163 in {705, 706}                  -> addressing mode variant
3. field 203 in {1113..1117}                -> encoding format (5 values)
4. field 105 == 477                         -> operation variant
5. field 88 == 408                          -> sub-operation modifier
6. field 345 == 1903                        -> rounding/saturation mode
7. source_count == 2                        -> 2 sources
8. operand[0].type == 1 (immediate)         -> constant source
9. operand[1].type == 2 (GPR)              -> register source
   operand[1].class: primary wildcard or secondary in {1,2}
10. result_count == 5                       -> 5 results
11. result[0].type == 2 (GPR), class != 1023, secondary == 2 (64-bit)
12. result[1].type == 3 OR 15 (pred/uniform)
13. result[2].type == 2 (GPR), class: wildcard or secondary in {1,2}
14. result[3].type == 2 (GPR), class: wildcard or secondary in {1,2}
15. if (*a4 <= 20) -> *a4 = 21, *a3 = 22

sub_B44CA0 -- complex 0-source, 7-result pattern (6,214 bytes, priority 11, template varies):

1.  field 5 == 12                           -> opcode class 12
2.  field 220 == 1206                       -> encoding property
3.  field 595 in {2937, 2938}               -> extended field (high range)
4.  field 294 == 1493                       -> constraint
5.  field 242 in {1281, 1282}               -> width qualifier
6.  field 355 == 1943                       -> extended property
7.  field 376 == 2035                       -> extended property
8.  field 377 in {2037..2041}               -> extended property (5 values)
9.  field 429 in {2252, 2253}               -> extended qualifier
10. field 126 in {547, 548}                 -> data type
11. field 397 == 2115                       -> validity gate
12. source_count == 0                       -> no source operands
13. result_count == 7                       -> 7 result operands
14. All 7 results checked: type == 10 (valid register), various class constraints
15. if (*a4 <= 10) -> *a4 = 11, *a3 = (template)

This pattern has the most field checks (12) of the representative examples, validating properties deep into the extended field namespace (field 595). Its zero-source, seven-result shape suggests a hardware intrinsic or complex output instruction like a tensor-core operation.

sub_B28FE0 -- minimal matcher in the preamble zone (31 lines, priority 8, template 42):

1. field 211 == 1182
2. field 201 == 1109
3. field 348 in {1912, 1915}   -> precision qualifier
4. field 397 == 2115           -> validity gate
5. source_count == 0           -> no sources
6. if (*a4 <= 7) -> *a4 = 8, *a3 = 42

The simplest matchers skip operand validation entirely and rely solely on opcode-property checks. These are for instructions with fixed operand formats where the operand shape is fully determined by the opcode.

Helper Function Summary

AddressNameSignaturePurpose
sub_10AE5C0DAGNode_ReadField(ctx, node, field_id) -> valueRead instruction property by ID; returns 0xFFFFFFFF if absent
sub_10AE590DAGNode_WriteField(ctx, node, field_id, value)Write single instruction property
sub_10AE640DAGNode_WriteFields(ctx, node, f1, v1, v2)Multi-field atomic update
sub_B28F30GetOperand(node, index) -> operand_ptrIndex into operand array (32-byte records at *(node+32))
sub_B28F40GetResultCount(node) -> countNumber of result operands: node[40] + 1 - node[92]
sub_B28F50GetSourceCount(node) -> countNumber of source operands: *(node+92)
sub_B28E00DecodeRegClass(packed) -> class_idIdentity function (class stored as plain int)
sub_B28F00DecodeRegClass2(packed) -> class_idSecond identity accessor (same semantics)
sub_B28F10SetModifierA(operand, value)Write operand modifier at offset +14
sub_B28F20SetModifierB(operand, value)Write operand modifier at offset +15
sub_B28E10isGPR(tag) -> booltag == 2
sub_B28E20isImmediate(tag) -> booltag == 1
sub_B28E30isSpecialReg(tag) -> booltag == 6
sub_B28E40isValidReg(tag) -> booltag == 10
sub_B28E50isType9(tag) -> booltag == 9
sub_B28E60isType5(tag) -> booltag == 5
sub_B28E70isType4(tag) -> booltag == 4
sub_B28E80isPredicate(tag) -> booltag == 3
sub_B28E90isUniformReg(tag) -> booltag == 15
sub_B28EA0isType13(tag) -> booltag == 13
sub_B28EB0isType14(tag) -> booltag == 14
sub_B28EC0isType16(tag) -> booltag == 16
sub_B28ED0isType7(tag) -> booltag == 7
sub_B28EE0isType11(tag) -> booltag == 11
sub_B28EF0isType8(tag) -> booltag == 8

SM120 Pattern Coordinator -- sub_13AF3D0 (137 KB)

The largest ISel function in the binary (137 KB, 4,225 lines, 570+ locals). It is an architecture-specific operand-emission coordinator that runs in Phase 2 as a parallel backend to the mega-selector sub_C0EB10. The two do not call each other -- they are mutually exclusive implementations of the same ISel protocol, selected per-SM by the vtable in the ISel driver. The mega-selector covers opcodes 7--221 for the default backend; the coordinator covers opcodes 2--352 for the SM120 (consumer RTX 50xx / enterprise Pro) backend.

Position in the ISel Pipeline

sub_B285D0 (ISel driver, 9 KB)
  -> selects builder variant by SM version
     -> Builder variant vtable dispatch
        |
        +-- DEFAULT BACKEND: sub_C0EB10 (mega-selector, 185 KB)
        |     opcodes 7..221, dual-switch, word_22B4B60 encoding table
        |
        +-- SM120 BACKEND: sub_A29220 (instruction iterator, 435 lines)
              -> sub_13AF3D0 (pattern coordinator, 137 KB)
                   opcodes 2..352, single switch, inline operand emission

The coordinator is called once per instruction by sub_A29220, which walks the instruction list. Before entering the main switch, the coordinator performs a predication pre-test: if bit 0x1000 is set in the opcode word and the opcode is not 169, it queries vtable[3232/8] and optionally emits the last source operand via sub_13A6AE0.

Dispatch Structure

The coordinator reads the opcode from *(instr+72) with the standard BYTE1 & 0xCF mask (identical to Phase 1's MercConverter) and enters a single 130-case switch. Unlike the mega-selector's dual-switch encoding-slot translation, the coordinator emits operands inline -- each case directly calls sub_13A6280 (the operand emitter) with explicit operand indices.

// sub_13AF3D0 -- simplified dispatch skeleton
char PatternCoordinator(context *a1, instruction *a2, output *a3,
                        pattern_table *a4, flags a5, int a6) {
    int opcode = *(DWORD*)(a2 + 72);
    BYTE1(opcode) &= 0xCF;

    // Pre-dispatch: predication check when bit 0x1000 is set
    if ((*(a2+72) & 0x1000) && opcode != 169) {
        if (vtable[3232/8] == sub_868720 || vtable[3232/8]())
            EmitLastSource(a1[1], a2, operand_count - 2*flag, a3);
    }

    // Setup output context
    vtable[104/8](output_ctx, a1, &context_ref);

    switch (opcode) {
    case 2: case 4: case 7:     // FMA/MAD 2-source
        operand_span = 16; src_count = 2;
        goto SHARED_FMA_HANDLER;
    case 3: case 5:             // FMA/MAD 3-source
        operand_span = 24; src_count = 3;
        goto SHARED_FMA_HANDLER;
    case 6:                     // IMAD/IADD3 with 3+ sources
        EmitOperand(ctx, instr, 3, out);
        EmitOperand(ctx, instr, 4, out);
        EmitOperand(ctx, instr, 5, out);
        break;
    case 8:                     // Pure vtable dispatch (vtable+2328)
        vtable[2328/8](a1, a2, operand_count, a3, a5, 0);
        break;
    case 10: case 11: case 151: case 152: case 290: case 291:
        vtable[2768/8](a1, a2, a3, a4, a5);   // Memory load/store
        break;
    case 16:                    // Texture/surface (163-line handler)
        for (i = first_src; i < last_src; i++)
            EmitOperand(ctx, instr, i, out);   // loop up to 15 operands
        break;
    // ... 120 more cases ...
    case 329:                   // Variable-count loop + vtable+2328
        for (i = 0; i < src_count; i++)
            EmitOperand(ctx, instr, i, out);
        vtable[2328/8](a1, a2, remaining, a3, a5, 0, 0, 0);
        break;
    default:
        break;                  // no-op passthrough
    }
}

Opcode Case Routing

The 130 distinct case labels (spanning 82 distinct handler blocks) cover the full SASS opcode range including SM100+/SM120 extensions:

OpcodesHandler patternInstruction family
2, 3, 4, 5, 7Shared FMA handler with operand-span parametrizationFMA/MAD variants (32/64-bit)
6Inline 3-source emission + optional operands 6/7IMAD/IADD3 wide
8Pure vtable+2328 delegationBuilder-only instructions
10, 11, 151, 152, 290, 291vtable+2768 delegationMemory load/store
16163-line operand loop (up to 15 sources)Texture/surface
20, 21vtable+2680/2688 with stub checkMemory/store alternates
22, 77, 83, 297, 352vtable+2744 with nullsub_463 checkControl flow
24, 34, 209, 213, 214Passthrough: emit src 1 + dst 2Simple 2-operand ALU
29, 95, 96, 190Conditional operand-6 checkPredicate-source instructions
38, 59, 106, 180, 182, 192, 194, 215, 221Single EmitOperand(1) at high SMGeneric ALU
42, 53, 55EmitOperand(1)Paired ALU
60, 61, 62, 63, 64Comparison / inner sub-opcode switch (case 61: 5 sub-cases)Compare / set-predicate
88, 89Variable source count (2 or 3) with sign-dependent offsetsExtended FMA
110, 111, 114, 115, 117Warp operand emissionWarp shuffle / vote
120, 121, 126, 127Barrier handler with operand loop at LABEL_53Barrier / sync
139, 140, 141, 143sub_13A4DA0 for commutative operand selectionCommutative ALU
183Extended memory with register-class-6 checkWide memory
201, 202, 204vtable+2328 delegationAsync / bulk operations
270, 279, 282, 285, 325--328Goto LABEL_53 (barrier/sync shared handler)Extended memory / warp
280, 281vtable+2896 with nullsub_239 check, then LABEL_53Sync instructions
329Variable-count operand loop + vtable+2328Variable-width encoding

Three Competing-Match Selection Mechanisms

The coordinator selects among competing pattern matchers through three mechanisms:

1. LABEL_750 -- vtable alternate-match dispatch. Six opcode paths (cases 6, 36, 130, 137, plus opcodes reaching LABEL_119 when sub_7D6850 confirms a double-precision operand) jump to LABEL_750:

LABEL_750:
    replacement = vtable[16/8](output_ctx, instruction);
    *output = replacement;
    return;

This is the "try architecture-specific alternate" escape hatch. The vtable slot at offset +16 on the ISel context object points to an SM-specific matcher. If it succeeds, the coordinator's inline emission is entirely bypassed and the replacement instruction is written to the output.

2. sub_13A4DA0 -- commutative operand position selector. Called 12 times for commutative instructions (FMA, IADD3, comparison) where source operands can be swapped for better encoding. The function holds up to 4 pattern entries at offsets +12/+16 through +36/+40, each a (lo_word, hi_word_mask) pair. It tests operand properties via sub_13A48E0 against each entry; the first match returns a preferred operand index. The coordinator then calls sub_13A6280 with the returned index instead of the default.

// sub_13A4DA0 -- simplified
int SelectOperandSlot(pattern_table, instruction, default_slot, alt_slot, out_match) {
    if (!pattern_table->active) return default_slot;
    uint64_t operand_desc = GetOperandDescriptor(instruction, default_slot);
    for (i = 0; i < pattern_table->count; i++) {  // up to 4 entries
        if (operand_desc matches pattern_table->entry[i])
            { *out_match = entry[i].preferred; return default_slot; }
    }
    // Repeat with alt_slot if no match on default_slot
    operand_desc = GetOperandDescriptor(instruction, alt_slot);
    for (i = 0; i < pattern_table->count; i++) {
        if (operand_desc matches pattern_table->entry[i])
            { *out_match = entry[i].preferred; return alt_slot; }
    }
    return default_slot;
}

3. Inline vtable override checks. Many cases test whether a vtable function pointer equals a known null-stub before calling it. The stub addresses serve as sentinel values -- when the vtable slot has been overridden by an SM-specific implementation, the coordinator calls the override:

Vtable offsetDefault stubPurpose
+2680sub_A8CBE0Memory operation alternate matcher
+2688sub_A8CBF0Store operation alternate matcher
+2744nullsub_463Control flow alternate
+2632nullsub_233Move/convert alternate
+2760nullsub_235Atomic/barrier alternate
+2896nullsub_239Sync instruction alternate
+3232sub_868720Pre-dispatch predication alternate
+3112sub_A8CCA0MADC alternate (case 36)

When the vtable slot holds the stub, the coordinator skips the call and proceeds with its inline emission logic.

Primary Callee: sub_13A6280 (239 lines)

The operand emitter, called 83 times. It reads the operand at instruction[operand_index + 10] (each operand is 8 bytes starting at instruction + 84), checks the type tag at bits [31:28], and emits:

  • Tag 1 (register): fast-path returns if register class == 6 (UB/dead register). Otherwise reads the register descriptor from *(context+88)[reg_index], checks register class at descriptor offset +64.
  • Tags 2/3 (constant/immediate): calls sub_7DBC80 to validate constant-bank availability, then sub_A9A290 for type-5 immediate expansion. Delegates to vtable methods at *(*(context+1584) + 1504) and *(*(context+1584) + 3248).
  • Other types: pass through to the vtable dispatch chain.

The third parameter (operand index) ranges from 0 to 7 across the coordinator's call sites, with 0/1/2/3 being the most common (corresponding to the first 4 source operands in the Ori IR instruction layout).

Function Map Additions

AddressSizeIdentityConfidence
sub_13AF3D0137 KBSM120 ISel pattern coordinator (130-case switch, 83x operand emission)HIGH
sub_A29220435 linesInstruction iterator / coordinator caller (per-instruction walk)HIGH
sub_13A6280239 linesOperand emitter (type-tag dispatch, register class 6 fast-path)HIGH
sub_13A7410--Destination operand emitter (with register class 6 check)MEDIUM
sub_13A6AE0--Pre-dispatch source emitter (predicated instruction operands)MEDIUM
sub_13A4DA0180 linesCommutative operand position selector (4-entry pattern table)HIGH
sub_13A6F90--Extended destination emitter (3rd variant, class 6 check)MEDIUM
sub_13A6790--Fenced memory operand emitterMEDIUM
sub_13A45E0--Extra operand emitter (operands 6/7 for wide instructions)MEDIUM
sub_13A5ED0--Modifier flag emitter (operands with 0x18000000 bits)MEDIUM
sub_13A75D0--Register class 6 (UB) operand substitution handlerMEDIUM
sub_13A48E0--Operand property extractor (for sub_13A4DA0 matching)MEDIUM

Architecture Dispatch Tables -- 4 Copies at sub_B128E0--sub_B12920

Four nearly identical functions (15,049 bytes each) provide architecture-variant opcode dispatch. Despite being only 13 binary bytes each (3 instructions -- a thunk into shared code at 0x1C39xxx), the decompiled output is 79,562 bytes due to the massive shared switch statement they jump into.

Each table contains a switch on *(a3+12) (the opcode word field) with 50+ cases, and secondary switches on *(a3+14) (opcode sub-field) within certain cases. The return values are SASS encoding slot indices (e.g., 197, 691, 526, 697, 772, 21). The four copies serve different SM architecture families, mapping the same logical opcode to different encoding slots depending on the target.

Opcode Variant Selectors

Two specialized variant selectors handle the final opcode-to-encoding mapping for specific instruction families:

sub_B0BE00 (19 KB) -- opcode class 194:

  • Massive switch on a2 (100+ cases)
  • All cases call sub_10AE590(ctx, inst, 194, N) with sequential N values starting from 827
  • Pattern: case K -> sub_10AE590(ctx, inst, 194, 826+K)
  • Maps sub-variant indices to SASS encoding slots for one PTX opcode family

sub_B0AA70 (5 KB) -- opcode class 306:

  • Same pattern but with opcode class 306
  • Variants numbered 1680--1726 with non-sequential case indices (2, 3, 8, 9, 14, 15, 20, 21, 26, 27, 30, 31, 36, 37, 40, 41, ...)
  • The alternating-pair pattern at stride 6 suggests type-width combinations (e.g., F32/pair, F64/pair, S32/pair, ...)

Instruction Modifier Dispatchers

Two modifier-application functions run after the main ISel selection to set type modifiers, rounding modes, and register width:

sub_B13E10 (5,792 B) -- basic modifier dispatcher:

  • All 21 callees are sub_10AE640 (DAG node modifier)
  • Switch on BYTE1(a7) & 0x1F for modifier type
  • Maps modifier values 1--6 to internal codes 31--35
  • Secondary dispatch on (a7 >> 3) for register width encoding

sub_B157E0 (11,815 B) -- extended modifier dispatcher:

  • All 37 callees are sub_10AE640
  • Handles texture/surface operations specially (opcode type 18)
  • Maps sub-opcodes (BYTE5(a7) & 0x3F) to encoding values 54--60

Mercury Master Encoder -- sub_6D9690 (94 KB)

The Mercury master encoder is the single largest backend function and the final instruction selection point before binary emission. It contains a massive switch on the instruction type field (read from instruction+8) covering all SASS instruction formats. While its primary role is encoding (documented in Mercury Encoder Pipeline and SASS Instruction Encoding), the switch itself performs the final opcode-to-encoding-format selection:

// Simplified encoding flow
void EncodeInstruction(context, instruction) {
    int type = *(int*)(instruction + 8);
    uint64_t base = 0x2000000000LL;     // encoding base constant

    switch (type) {
    case 61:    // FFMA with literal operand
        sub_6D9580(ctx, operand);       // encode literal
        break;
    case 455:   // complex multi-operand format
        // bit-field extraction and assembly
        break;
    // ... hundreds of cases ...
    }

    // Common tail: append operand words, commit
    sub_6D2750(ctx, word);              // append 8-byte operand word
    sub_6D28C0(ctx);                    // commit instruction record
}

Key encoding dispatch details:

  • Operand word type prefix in bits [31:28]: 0x1 = register, 0x5 = immediate/constant, 0x6 = control/modifier, 0x7 = literal, 0x9 = special
  • sub_7D6860 handles data type encoding (FP32/FP64/INT)
  • sub_C00BF0 provides opcode lookup from the encoding tables
  • Architecture-specific bits accumulated via SM 100+ extensions controlled by knob 4176

MercExpand -- Pseudo-Instruction Expansion

sub_C3CC60 (26 KB) runs as phase 118 (MercExpandInstructions) and expands Mercury pseudo-instructions into concrete SASS sequences. This is the third and final instruction selection point -- where abstract instruction forms that survived through ISel and Mercury encoding are replaced by their concrete multi-instruction implementations.

HandlerSizeInstruction class
sub_C37A1016 KBGeneral instruction expansion (jump table, 4+ cases)
def_C37B2E13 KBComplex expansion cases (default handler, string "EXPANDING")
sub_C39B4010 KBMemory operations (LDG, STG, LDS, etc.)
sub_C3A4606 KBAtomic operations
sub_C3B5608 KBTexture operations
sub_C3BCD019 KBControl flow (branches, jumps, calls)
sub_C3E03018 KBFinalization and cleanup

The expansion creates new instruction nodes, links them into the doubly-linked list, and deletes the original pseudo-instruction. After all expansions, sub_C3E030 performs post-expansion verification. The expansion engine also uses sub_719D00 (50 KB), which builds output for expanded instructions across different operand widths (32/64/128-bit, predicate) -- four near-identical code blocks corresponding to template instantiations over operand width types.

OCG Encoding Template Lookup -- sub_C3F490

The OCG (Optimized Code Generation) intrinsic pipeline on SM100+ does not use the ISel mega-selector or DAG pattern matchers. Instead, the OCG router (sub_6CC690, documented in Intrinsics) assigns each instruction one of 7 internal routing values and passes it to the SASS instruction emitter sub_6CB8A0. These routing values are not Ori IR opcodes, not binary SASS opcodes, and not encoding slot indices from word_22B4B60. They are a small, closed set of keys that exist solely to select an operand gathering template inside sub_C3F490.

Routing values assigned by the OCG router

ValueHexInstruction classAssigned when
700x46Memory-ordered load/store/atomic (with barrier)Barrier register present (v108 != 0 in conditional paths)
2430xF3Default memory operationFallback for general memory ops without barrier or special fence
2450xF5Load variant (LD/LDG/LDS)Load-type operations (from OCG load/store handler)
2460xF6Reduction/atomic defaultAtomic operations and reductions
2470xF7Fenced memory operation (LDGSTS)Operations requiring memory fence semantics
2570x101Async copy without memory orderBulk copy ops when no barrier: v108 == 0 selects 257, else 70
2610x105Atomic with pre-existing value readAtomic exchange / compare-and-swap returning old value

How sub_C3F490 maps routing values to encoding templates

sub_C3F490 is a pure lookup function (184 bytes) that takes a routing value plus 7 boolean modifier flags and returns a pointer to an operand gathering template in .data at 0x22B8960--0x22BB460. The function is a nested if-else tree: the first-level switch selects on the routing value, then inner branches refine the template based on the modifier flags.

sub_C3F490(routing_value, a2..a8) -> template_ptr
    a2: has pre-existing-value operand (used only by value 257)
    a3: SM generation > sm_7x (SM80+)
    a4: has predicate attachment
    a5: has scope/fence operand (SM generation > sm_8x && memory_order == 4)
    a6: (always 0 from OCG emitter, used by MercExpand callers)
    a7: (always 0 from OCG emitter, used by MercExpand callers)
    a8: (always 0 from OCG emitter, used by MercExpand callers)

The OCG emitter (sub_6CB8A0) always passes a6=a7=a8=0, which means the OCG path only reaches a subset of template leaves. The MercExpand callers (sub_C41100, sub_C40420, sub_C40B90, sub_C42330) pass all 7 flags and can reach the full template space. The returned template is a packed array: template[0] is the operand count, followed by operand slot indices that reference positions in the 39-QWORD operand buffer (v134[]). The emitter iterates over these indices, gathers the tagged operand words, builds control words from bitfields, and calls sub_9314F0 to commit the encoded instruction.

Two additional routing values (254, 262) are handled by sub_C3F490 but are never assigned by the OCG router -- they originate exclusively from the MercExpand memory instruction handlers, where the routing value is read from the instruction's opcode field (instr[18] masked with & 0xCFFF).

ValueHexOriginInstruction class
2540xFEMercExpand onlyExtended memory format (operand gather mode 3)
2620x106MercExpand onlyWide memory format (operand gather mode 0, with scope/fence branches)

Template address space

The 40+ distinct templates returned by sub_C3F490 occupy a contiguous .data region:

Address rangeRouting values served
0x22B8960--0x22B8E60257 (async copy variants)
0x22B8E60--0x22B936070 (barrier memory variants)
0x22B9360--0x22B9860262 (MercExpand wide memory)
0x22B9860--0x22B9E60247, 245 (fenced / load variants)
0x22B9E60--0x22BA960243, 246, 70 (default / reduction / barrier sub-variants)
0x22BA960--0x22BB460Leaf templates for bare operand forms (no modifiers)

Each template is 256 bytes (0x100). For a given routing value, the modifier flags select progressively simpler templates as flags are cleared: the most complex template (all modifiers active) is reached first in the if-chain, and the simplest (no modifiers) is the final fallback.

Addressing Mode Selection

Addressing mode selection is distributed across Phases 1 and 2. During Phase 1, the operand processing function sub_6273E0 (44 KB) classifies PTX operand forms into internal categories. During Phase 2, the ISel driver and Mercury encoder select the optimal SASS addressing mode based on the register-allocated operand forms.

PTX addressing modes and their SASS encodings:

PTX syntaxAddressing modeSASS instructionEncoding
[%rd1]Register indirectLDG.E R0, [R2]Register + zero offset
[%rd1+16]Register + offsetLDG.E R0, [R2+0x10]Register + immediate offset
c[2][0x100]Constant bankLDC R0, c[0x2][0x100]Bank index + offset
[%rd1], %r2Base + indexSTG.E [R2], R4Separate base/data registers

Special string references in sub_6273E0 confirm complex addressing:

  • ".nv.reservedSmem.offset0" -- reserved shared memory region
  • "COARSEOFFSET" -- coarse-grained offset computation for large address spaces
  • "__$endLabel$__%s" -- label generation for structured control flow

The ISel mega-selector (sub_C0EB10) references "__nv_reservedSMEM_offset_0_alias" for shared memory alias resolution during final encoding.

Vtable Dispatcher Zone -- 0xAF0000--0xB10000

The range 0xAF0000--0xB10000 contains approximately 2,735 tiny vtable method implementations (average 160 bytes) that form the instruction encoding hierarchy. These implement polymorphic instruction property queries:

// Typical vtable method (sub_AFXXXX, ~160 bytes)
int64_t get_property(int64_t a1, unsigned int a2) {
    if (a2 <= N)
        return (unsigned int)dword_XXXXXXX[a2];  // table lookup
    return default_value;
}

Each function maps a small integer index to an encoding constant, answering questions like "what is the register class for operand N of this instruction?" The 0xAF0000--0xB00000 sub-range has 1,269 functions (all under 200 bytes), while 0xB00000--0xB10000 has 1,466 with slightly more complex logic (13 exceeding 1 KB).

Comparison with LLVM ISel

AspectLLVMptxas
ISel frameworkSelectionDAG or GlobalISel (single pass)Two-phase: MercConverter (phase 5) + ISel driver (phase 112+)
Pattern specificationTableGen .td files, machine-generatedHandwritten C++ (~750 functions)
Pattern countTarget-dependent (thousands for x86)~801 DAG matchers + 185 KB mega-selector
Architecture dispatchSubtarget feature bits4 architecture dispatch tables + vtable overrides
Intermediate formMachineInstr (already selected)Ori IR (SASS opcodes after phase 5, not yet encoded)
EncodingMCInst emission (separate pass)Integrated: ISel + Mercury encode in same pipeline
ExpansionPseudo-instruction expansion in AsmPrinterMercExpand (phase 118, post-ISel)
Optimization post-ISelMachineFunction passesPhases 14--111 (full optimizer runs between Phase 1 and Phase 2)

The key architectural difference: LLVM performs instruction selection once, then optimization happens on already-selected machine instructions. ptxas selects SASS opcodes early (phase 5) so the optimizer can reason about SASS-level semantics, then performs a second selection/encoding pass after optimization is complete. This two-phase design gives the optimizer accurate cost models (it sees real SASS opcodes, not abstract PTX operations) at the cost of architectural complexity.

Function Map

AddressSizeIdentityConfidence
sub_C0EB10185 KBISel mega-selector (719 locals, dual 169-case switch, SM-generation dispatch)HIGH
sub_6D969094 KBMercury master encoder (instruction type switch)VERY HIGH
sub_9F1A9035 KBMercConverter main instruction conversion passHIGH
sub_9EF5E027 KBPost-MercConverter lowering ("CONVERTING")HIGH
sub_C3CC6026 KBMercExpand::run (pseudo-instruction expansion)HIGH
sub_9ED2D025 KBMercConverter opcode dispatch (master switch, & 0xCF mask)HIGH
sub_9E660025 KBInstruction expansion (64-bit split)HIGH
sub_9EC34023 KBMulti-operand instruction legalizationMEDIUM
sub_B0BE0019 KBOpcode variant selector (class 194, 100+ cases)HIGH
sub_C3BCD019 KBMercExpand::expandControlFlowMEDIUM
sub_9D76D018 KBMemory instruction legalization (load/store)HIGH
sub_C3E03018 KBMercExpand::finalizeExpansionMEDIUM
sub_9D80E017 KBMemory instruction legalization (variant)HIGH
sub_9E8B2017 KBTexture/surface loweringMEDIUM
sub_C37A1016 KBMercExpand::expandInstruction (jump table)HIGH
sub_B128E0--sub_B1292015 KB x4Architecture dispatch tables (4 SM families)HIGH
sub_B1FA2013 KBSASS 3-operand builder (variant A)HIGH
sub_B1D67013 KBPost-ISel instruction modifierHIGH
def_C37B2E13 KBMercExpand complex cases ("EXPANDING")HIGH
sub_B157E012 KBExtended modifier dispatcher (37 callees)HIGH
sub_B20E0011 KBSASS 3-operand builder (variant B)HIGH
sub_C39B4010 KBMercExpand::expandMemoryOpMEDIUM
sub_9DA1009 KBArithmetic operation handler (case 6)HIGH
sub_B285D09 KBISel lowering driver (66 callees)HIGH
sub_B241A07 KBSASS instruction property setterHIGH
sub_9F33407 KBMercConverter orchestrator ("After MercConverter")HIGH
sub_C3A4606 KBMercExpand::expandAtomicOpMEDIUM
sub_B13E106 KBBasic modifier dispatcher (21 callees)HIGH
sub_B0AA705 KBOpcode variant selector (class 306)HIGH
sub_9DA5C02 KBOpcode class 1 handlerMEDIUM
sub_13AF3D0137 KBSM120 ISel pattern coordinator (130-case switch, 83x sub_13A6280, opcodes 2--352)HIGH
sub_A29220~17 KBSM120 instruction iterator (calls sub_13AF3D0 per instruction)HIGH
sub_13A6280~10 KBOperand emitter (type-tag dispatch, register class 6 fast-path)HIGH
sub_13A4DA0~7 KBCommutative operand position selector (4-entry pattern table)HIGH
sub_13A7410--Destination operand emitter (with register class 6 check)MEDIUM
sub_13A6AE0--Pre-dispatch source emitter (predicated instruction operands)MEDIUM
sub_13A6F90--Extended destination emitter (3rd variant, class 6 check)MEDIUM
sub_13A6790--Fenced memory operand emitterMEDIUM
sub_13A45E0--Extra operand emitter (wide instruction operands 6/7)MEDIUM
sub_13A5ED0--Modifier flag emitter (operands with 0x18000000 bits)MEDIUM
sub_13A48E0--Operand property extractor (for sub_13A4DA0 matching)MEDIUM
sub_10AE5C0tinyDAGNode_ReadField (field_id to value, delegates to sub_10D5E60)VERY HIGH
sub_10AE590tinyDAGNode_WriteField (single field write)VERY HIGH
sub_10AE640tinyDAGNode_WriteFields (multi-field update)VERY HIGH
sub_B28F30tinyGetOperand (index into 32-byte operand array at *(node+32))VERY HIGH
sub_B28F40tinyGetResultCount (node[40] + 1 - node[92])VERY HIGH
sub_B28F50tinyGetSourceCount (*(node+92))VERY HIGH
sub_B28E00tinyDecodeRegClass (identity function, class is plain int)VERY HIGH
sub_B28E10tinyisGPR operand predicate (tag == 2)VERY HIGH
sub_B28E20tinyisImmediate operand predicate (tag == 1)VERY HIGH
sub_B28E40tinyisValidReg operand predicate (tag == 10)VERY HIGH
sub_B28E80tinyisPredicate operand predicate (tag == 3)VERY HIGH
sub_B28E90tinyisUniformReg operand predicate (tag == 15)VERY HIGH
sub_B28F60--sub_B74C60~1.3 MB~801 DAG pattern matchers (priority 2--34, template 1--152)HIGH
sub_C01840--Mega-selector source operand marshaller (52 calls from mega-selector)HIGH
sub_C01F50--Mega-selector destination operand marshallerHIGH
sub_C00EA0--Single operand extractor (returns tagged operand word)HIGH
sub_BFFD60--Operand reference resolver (register ref to encoding word)HIGH
sub_C06E90--Symbol/special-register lookup for shared memoryHIGH
sub_C07690--Immediate-operand encoding helperMEDIUM
sub_C0B2C0--Extended memory/warp operation encoderHIGH
sub_C05CC0--Immediate operation encoder (flag-dependent path)MEDIUM
sub_BFEBF0tinyDefault vtable[2] stub (opcode translator, no-op identity)VERY HIGH
sub_BFEAA0tinyDefault vtable[12] stub (capability check, always false)VERY HIGH
sub_BFEA30tinyDefault vtable[3] stub (extension handler, no-op)VERY HIGH
sub_BFEF10--Register bank capacity check / growMEDIUM
word_22B4B60--Static opcode-to-encoding-index table (uint16[222], default backend)VERY HIGH
sub_C3F490184 BOCG encoding template lookup (routing value + 7 flags -> template ptr)VERY HIGH
sub_6CB8A0--OCG SASS instruction emitter (calls sub_C3F490 then sub_9314F0)HIGH
sub_C41100--MercExpand memory encoder (calls sub_C3F490 with full flag set)HIGH
sub_C40420--MercExpand memory encoder variant (calls sub_C3F490)HIGH
sub_C40B90--MercExpand memory encoder variant (calls sub_C3F490)HIGH
sub_C42330--MercExpand memory encoder variant (calls sub_C3F490)HIGH
unk_22B8960--unk_22BB460~11 KBOperand gathering templates (40+ entries, 256 B each)HIGH

Cross-References

SASS Instruction Encoding

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The SASS instruction encoder is the single largest subsystem in ptxas by function count. It translates the internal Ori IR instruction representation into packed binary SASS machine code for a specific SM target. The encoder comprises approximately 4,000 template-generated handler functions dispatched through function-pointer tables indexed by opcode, plus six massive switch-dispatch megafunctions that route field-level queries by instruction category. The core encoding primitive is a single 216-byte bitfield-insert function (sub_7B9B80) called from 18,347 sites throughout the binary. NVIDIA internally names this pipeline phase "Ori Phase Encoding" within the Mercury assembler backend.

Pipeline phaseOriPhaseEncoding (within Mercury)
Core bitfield packersub_7B9B80 (216 bytes, 18,347 callers)
Encoding buffer1280 bits = 20 QWORDs at a1+544
Instruction widths64-bit (format 1), 128-bit (format 2), 256-bit (format 8)
Opcode hierarchy3-level: major (9 bits) / minor (8 bits) / sub-opcode (7 bits)
SM100 encoder count~1,086 encode functions + ~97 decode functions
SM100 opcode categories370 (case values 0x0 through 0x171)
SM100 major opcodes102 unique values
Bitfield accessor primitives2,095 functions (mostly under 200 bytes)
Confirmed strings"AdvancedPhaseOriPhaseEncoding", "MercEncodeAndDecode", "After EncodeAndDecode", "ENCODING"

Encoding Buffer Layout

Every encoder operates on an instruction encoding context object passed as a1. The primary encoding target is a 1280-bit (160-byte, 20 QWORD) buffer at offset a1+544. The bitfield packer sub_7B9B80 writes individual fields into this buffer by iterating in 64-bit chunks:

// sub_7B9B80(a1, bit_offset, bit_width, value)
// Insert `value` into bits [bit_offset .. bit_offset+bit_width) of the encoding buffer
void bitfield_insert(int64_t a1, int bit_offset, int bit_width, uint64_t value) {
    uint64_t mask = (1ULL << bit_width) - 1;
    value &= mask;
    int pos = bit_offset;
    while (pos < 1280) {
        int qword_idx = pos >> 6;           // which QWORD
        int bit_in_qword = pos & 63;        // bit position within QWORD
        *(uint64_t*)(a1 + 8 * qword_idx + 544) |= value << bit_in_qword;
        // Handle fields that cross a QWORD boundary
        if (bit_in_qword + bit_width > 64) {
            int overflow = bit_in_qword + bit_width - 64;
            *(uint64_t*)(a1 + 8 * (qword_idx + 1) + 544) |= value >> (64 - bit_in_qword);
        }
        break;  // single insertion (loop structure handles interleaved format)
    }
}

The encoding context object has this layout:

OffsetSizeContent
+0x0008Bvtable / allocator pointer
+0x00816BFormat descriptor (xmmword constant from rodata)
+0x0104BBitfield position base index
+0x018120BRegister class maps (3 arrays of 10 DWORDs: source classes, dest classes, widths)
+0x0904BOperand count (a1+144)
+0x094--+0x110Explicit operand mapping table (pairs of index + bit position)
+0x19432BExtended operand attributes (from xmmword tables)
+0x1D4--+0x21464BConstant buffer slot table (16 DWORD slots, cleared to 0xFF by sub_7B9D30)
+0x2144BConstant buffer slot counter (a1+532)
+0x2188BEncoding validation context pointer (a1+536)
+0x2208BInstruction bits [63:0] (a1+544)
+0x2288BInstruction bits [127:64] (a1+552)
+0x230+Additional encoding space (up to 1280 bits total)

Instruction Word Format

SASS instructions use a 3-level opcode hierarchy packed into the first 32 bits of the encoding buffer. The format code in bits [0:3] determines instruction width:

128-bit instruction word:
  bits [0:3]     = 0x2       (format code: 128-bit)
  bits [4:6]     = 0x0       (scheduling group slot 0)
  bits [8:16]    = MAJOR     (9-bit major opcode, 0x00-0x171)
  bits [17:24]   = MINOR     (8-bit minor opcode / variant)
  bits [25:31]   = SUBOP     (7-bit sub-opcode / format ID)
  bits [48+]     = MODIFIERS (format-dependent modifier fields)
  bits [132:134] = 0x0       (extended opcode flag, at offset 0x84)

64-bit instruction word:
  bits [0:3]     = 0x1       (format code: 64-bit)
  bits [4:6]     = 0x0       (scheduling group slot 0)
  bits [8:16]    = MAJOR     (9-bit major opcode)
  bits [17:24]   = MINOR     (8-bit minor opcode)
  bits [25:31]   = SUBOP     (7-bit sub-opcode)
  (no bit 132 extended flag -- only 5 initial sub_7B9B80 calls)

256-bit instruction word:
  bits [0:3]     = 0x8       (format code: 256-bit)
  (used for IMAD.WIDE variants with 16 constant-bank operand slots)

The 128-bit format uses 6 initial sub_7B9B80 calls (including one at offset 0x84 for the extended flag). The 64-bit format uses only 5 (no 0x84 call). This is the reliable way to distinguish the two during analysis.

The maximum variant value observed is 0x2F (47 decimal), so each major opcode can have up to 48 sub-operations, though most have far fewer.

Encoder Template

Every encoding handler function follows an identical 10-phase template. The only differences between the approximately 1,086 encoder functions for SM100 are the specific constant values and which modifier-encoding helpers are called. This is textbook C++ template/macro expansion:

int64_t encode_OPCODE_VARIANT(int64_t a1, int64_t a2) {
    // a1 = instruction encoding context (output)
    // a2 = Ori IR instruction node (input)

    // Phase 1: Set instruction format header
    sub_7B9B80(a1, 0, 4, FORMAT_CODE);     // bits[0:3] = 1 (64b) / 2 (128b) / 8 (256b)
    sub_7B9B80(a1, 4, 3, 0);              // bits[4:6] = sched group slot 0
    sub_7B9B80(a1, 0x84, 3, 0);           // bits[132:134] = extended flag (128-bit only)
    sub_7B9B80(a1, 8, 9, OPCODE_CLASS);   // bits[8:16] = major opcode
    sub_7B9B80(a1, 0x11, 8, OPCODE_MINOR); // bits[17:24] = minor opcode / variant
    sub_7B9B80(a1, 0x19, 7, FORMAT_ID);   // bits[25:31] = sub-opcode / format ID

    // Phase 2: Load operand format descriptor
    *(xmmword*)(a1 + 8) = xmmword_23FXXXX; // 128-bit format field layout from rodata
    // Copy 3 arrays of 10 DWORDs into a1+24..a1+140 (slot sizes, types, flags)

    // Phase 3: Set operand count and modifier table
    *(int*)(a1 + 144) = NUM_SOURCE_OPERANDS;
    *(xmmword*)(a1 + 404) = xmmword_YYYYYYY; // modifier descriptor table

    // Phase 4: Initialize encoding context
    sub_7B9D30(a1);         // clear constant buffer slot table (memset +468, 0xFF, 64)
    sub_7B9D60(a1, a2, 0);  // encode reuse flags + guard predicate

    // Phase 5: Encode primary opcode ID
    void* ctx = *(void**)(a1 + 536);
    int opcode = sub_10BFxxx(*(a2+32) + 32 * *(a2+40));  // extract from IR operand
    int encoded = sub_10B6180(ctx, opcode);                // map through lookup table
    sub_7B9B80(a1, 8 * *(a1+16), 1, encoded);             // insert at computed position

    // Phase 6: Encode source operands (variable number and types)
    sub_7BC030(a1, a2, 0, 0x60);  // register operand 0 at bit offset 0x60
    sub_7BC030(a1, a2, 1, 0x70);  // register operand 1 at bit offset 0x70
    sub_7BCF00(a1, a2, 2, 0x88);  // immediate operand 2 at bit offset 0x88
    sub_7BC5C0(a1, a2, 3, 0x98);  // predicate operand 3 at bit offset 0x98

    // Phase 7: Encode instruction-specific modifiers
    int mod_val = sub_10B96A0(a2);              // read modifier from IR node
    int enc_mod = sub_10B3680(ctx, mod_val);    // validate and map
    *(int64_t*)(a1+544) |= ((int64_t)enc_mod << 55) & 0x180000000000000LL;

    // Phase 8: Encode explicit operand mapping (source operands with data offsets)
    *(int*)(a1 + 148) = operand_index;
    *(int*)(a1 + 152) = 8 * bit_position;
}

Operand Type Encoders

Four type-specific helper functions encode operands into the instruction word. Each reads the operand descriptor from the IR instruction's operand table at *(a2+32) + 32*operand_index.

Register Operand Encoder -- sub_7BC030

814 bytes, 6,147 callers. Encodes a general-purpose register (R0-R255, UR0-UR63):

// sub_7BC030(insn, ir_insn, operand_index, bit_offset)
void encode_register(int64_t a1, int64_t a2, int op_idx, int bit_off) {
    if (op_idx >= *(int*)(a2 + 92))  // check operand count
        return;
    void* operand = *(void**)(a2 + 32) + 32 * op_idx;
    int reg_type_raw = *(int*)(operand + 20);

    // Map register file type to 4-bit encoding:
    //   1->0, 2->1, 3->2, 4->3, 5->4, 6->5, 7->6, 8->7,
    //   16->8, 32->9, 64->10, 128->11
    int reg_type = map_regfile(reg_type_raw);
    int reg_num = *(int*)(operand + 4);  // signed register number

    sub_7B9B80(a1, bit_off, 1, 1);           // 1-bit presence flag
    sub_7B9B80(a1, bit_off + 1, 4, reg_type); // 4-bit register type
    sub_7B9B80(a1, bit_off + 6, 10, reg_num); // 10-bit register number
}

The register file type encoding maps raw operand type codes to a 4-bit hardware register file selector. The 12 supported values (1 through 128 in powers of 2) cover GPR, uniform registers, predicate registers, special registers, and extended register files.

Immediate / Constant-Buffer Encoder -- sub_7BCF00

856 bytes, 1,657 callers. Encodes immediate values and constant memory references (c[bank][offset]):

// sub_7BCF00(insn, ir_insn, operand_index, bit_offset)
void encode_immediate(int64_t a1, int64_t a2, int op_idx, int bit_off) {
    void* operand = *(void**)(a2 + 32) + 32 * op_idx;
    int type = *(uint8_t*)operand;

    if (type == 14 || type == 15 || type == 16) {
        // Predicate-typed immediate: store to constant buffer slot table
        *(void**)(a1 + 468 + 8 * *(int*)(a1 + 532)) = operand + 8;
        *(int*)(a1 + 532) += 1;
    }
    sub_7B9B80(a1, bit_off, 1, 1);               // presence flag
    sub_7B9B80(a1, bit_off + 11, 5, *(int*)(operand + 4)); // 5-bit value
}

Predicate Encoder -- sub_7BC5C0

416 bytes, 1,449 callers. Encodes predicate register operands (PT, P0-P6):

// sub_7BC5C0(insn, ir_insn, operand_index, bit_offset)
void encode_predicate(int64_t a1, int64_t a2, int op_idx, int bit_off) {
    void* operand = *(void**)(a2 + 32) + 32 * op_idx;
    sub_7B9B80(a1, bit_off, 2, pred_type);       // 2-bit predicate type
    sub_7B9B80(a1, bit_off + 3, 3, pred_cond);   // 3-bit condition code
    sub_7B9B80(a1, bit_off + 8, 8, pred_value);  // 8-bit predicate value
}

Uniform Register Encoder -- sub_7BC360

Used for uniform registers (UR0-UR63) and source operands with alternative bitfield layouts. 126 calls in the SM100 encoding range. Likely handles the UR register file which has a separate encoding namespace from the main GPR file.

Instruction Format Groups

The encoder functions are organized into 16 format groups, identified by the xmmword constant loaded at a1+8. Each xmmword holds the field layout descriptor for that instruction format. The groups divide into two categories:

128-bit Formats (11 groups)

Format DescriptorFormat IDEncoder CountDescription
xmmword_23F1DF80x03145General ALU/memory -- most common format
xmmword_23F29A80x19117Extended format for complex instructions
xmmword_23F21B00x0A99Multi-source ALU operations
xmmword_23F26780x1327Tensor/extended ALU
xmmword_23F20180x079Miscellaneous ALU
xmmword_23F23480x0D6Specialized ALU
xmmword_23F2EF80x235Extended variant
xmmword_23F28100x164Bulk data / DMA
xmmword_23F21280x092Rare format
xmmword_23F2DE80x212Rare extended
xmmword_23F25F00x122Rare format

64-bit Formats (5 groups)

Format DescriptorEncoder CountDescription
xmmword_23F1F08113Short-form general -- widest opcode coverage (27 classes)
xmmword_23F1D7041Short-form 4-operand
xmmword_23F1F9011Short-form variant
xmmword_23F22388Short-form variant
xmmword_23F2C501Minimal format

The 128-bit group (format code 2) encodes long-form SASS instructions (ALU, load/store, texture, tensor core). The 64-bit group (format code 1) encodes short-form instructions (simple moves, branches, barriers, NOP-like control). Two additional functions use format code 8 (256-bit) for IMAD.WIDE variants with 16 constant-bank operand slots.

Instruction Format Group Catalog

Format Descriptor Architecture

Each format group is defined by a 128-bit xmmword constant stored in rodata at addresses 0x23F1xxx--0x23F2xxx. This descriptor is loaded via SSE into the encoding context at a1+8:

*(__m128i *)(a1 + 8) = _mm_loadu_si128(&xmmword_23F29A8);

Immediately following each xmmword in rodata are three arrays of 10 DWORDs that define the operand slot layout. The encoder copies these into the context object at a1+24 through a1+140:

Rodata ArrayContext OffsetContent
dword_XXXXX8[10]a1+24 .. a1+60Operand slot sizes (bits per slot)
dword_XXXXE0[10]a1+64 .. a1+100Operand slot types (register class selector)
dword_XXXXX8[10]a1+104 .. a1+140Operand slot flags (encoding mode flags)

Observed slot-size values: 10 = register (10-bit number + overhead), 12 = register with type, 17 = immediate/cbuf, -1 = unused. Slot-type values: 28 = register-type, 0 = basic, -1 = unused. Slot-flag values: 0 = default, 2 = secondary (uniform/extended), -1 = unused.

The copy uses SSE aligned loads for 16-byte chunks and scalar DWORD stores for remainders. The alignment check visible in every decompiled encoder (if (a1 + 120 <= dword_XXXXX8 || a1 + 24 >= &dword_XXXXX8)) is a compiler-generated overlap guard for the memcpy-like bulk copy.

Bitfield Packer Detail -- sub_7B9B80

The core encoding primitive. 216 bytes compiled, 18,347 callers total. Inserts an arbitrary-width bitfield into the 1280-bit buffer at a1+544:

// sub_7B9B80(a1, bit_offset, bit_width, value)
// Reconstructed algorithm from decompiled code:
__int64 bitfield_insert(__int64 a1, uint32_t bit_offset, int bit_width, uint64_t value) {
    uint32_t end = bit_offset + bit_width;
    uint32_t neg_base = -64 - bit_offset;  // pre-computed right-shift amount
    uint32_t pos = 0;
    do {
        while (1) {
            uint32_t chunk_end = pos + 64;
            if (bit_offset > pos + 63 || end <= pos) goto next;  // no overlap

            uint32_t start = (bit_offset >= pos) ? bit_offset : pos;
            uint32_t stop  = (end <= chunk_end) ? end : chunk_end;
            int width = stop - start;
            int shift_right = (chunk_end + neg_base < 0) ? 0 : chunk_end + neg_base;
            int bit_in_qword = start & 0x3F;
            __int64 *qword = (__int64*)(a1 + 8 * (start >> 6) + 544);
            uint64_t shifted = value >> shift_right;

            if (width == 64)
                *qword |= shifted << bit_in_qword;
            else
                *qword |= (shifted & ~(-1ULL << width)) << bit_in_qword;
        next:
            pos = chunk_end;
            if (chunk_end == 1280) return pos;
        }
    } while (pos != 1280);
    return pos;
}

Key properties:

  • Handles cross-QWORD-boundary fields: a 9-bit opcode starting at bit 59 writes 5 bits to QWORD 0 and 4 bits to QWORD 1
  • Loop terminates at bit position 1280 (20 QWORDs), hard ceiling
  • For typical field widths (1--9 bits), executes 1--2 iterations
  • Called 8--12 times per encoder function (average ~10)
  • The 256-bit format encoders call it with wider fields (up to 32 bits for data values)

128-bit Format 0x03 -- General ALU/Memory (145 encoders)

The most populous format group. Handles the bread-and-butter ALU and memory instructions.

PropertyValue
Descriptorxmmword_23F1DF8
Format ID0x03 (bits[25:31])
Slot arraysdword_23F1E08, dword_23F1E30, dword_23F1E40
Operand slots2--7 per instruction
Typical pattern3 reg + 1 imm + 1 pred (5 slots)
Modifier fields4--8 per instruction

Opcode classes (29): 0x08, 0x0B, 0x0F, 0x10, 0x16, 0x17, 0x19, 0x1A, 0x1B, 0x20, 0x22, 0x25, 0x28, 0x2A, 0x2B, 0x30, 0x32, 0x34, 0x35, 0x36, 0x37, 0x38, 0x3B, 0x41, 0x45, 0x4A, 0x4B, 0x5B, 0x67.

128-bit Format 0x19 -- Extended Complex (117 encoders)

Second most common. Used for instructions with rich modifier fields or unusual operand configurations.

PropertyValue
Descriptorxmmword_23F29A8
Format ID0x19 (bits[25:31])
Slot arraysdword_23F29B8, dword_23F29E0, dword_23F2A08
Operand slots3--6 per instruction
Modifier fields5--8 per instruction

Opcode classes (8): 0x0F, 0x10, 0x1A, 0x1B, 0x22, 0x38, 0x4D, 0x5E. Notable concentration: opcode 0x1B has 41 variants in this format alone (tensor/MMA family); opcode 0x5E has 26 variants. The load/store family (0x38) uses this format for 7 of its 16 variants -- the ones with extended addressing modes.

128-bit Format 0x0A -- Multi-Source ALU (99 encoders)

Designed for instructions with 4--7 source operands. Heavily weighted toward rich ALU operations.

PropertyValue
Descriptorxmmword_23F21B0
Format ID0x0A (bits[25:31])
Operand slots4--7 per instruction
Typical pattern4 reg + 1 imm + 1 pred

Opcode classes (10): 0x10, 0x16, 0x17, 0x20, 0x25, 0x28, 0x2A, 0x45, 0x4B, 0x67. Opcode 0x2A dominates with 30 variants; opcode 0x25 has 18.

128-bit Format 0x13 -- Tensor/Extended ALU (27 encoders)

Contains the most complex encoders in the binary. Opcode 0x5A variant 0x02 (sub_D89C90, 2015 bytes) has 18 modifier fields -- the maximum observed.

PropertyValue
Descriptorxmmword_23F2678
Format ID0x13 (bits[25:31])
Slot arraysdword_23F2688, dword_23F26B0, dword_23F26D8
Operand slots4--7 per instruction
Modifier fields8--18 per instruction

Opcode classes (7): 0x10, 0x16, 0x17, 0x1A, 0x41, 0x5A, 0x67.

128-bit Formats 0x07, 0x0D, 0x23, 0x16, 0x09, 0x21, 0x12 -- Rare Formats (35 encoders combined)

DescriptorFormat IDEncodersOpcode Classes
xmmword_23F20180x0790x0F, 0x10
xmmword_23F23480x0D60x0F, 0x16, 0x67
xmmword_23F2EF80x2350x10
xmmword_23F28100x1640x4B (bulk/DMA)
xmmword_23F21280x092--
xmmword_23F2DE80x212--
xmmword_23F25F00x1220x4B

Format 0x16 and 0x12 share opcode class 0x4B, suggesting they encode different addressing-mode variants of the same bulk-data instruction.

64-bit Format A (xmmword_23F1F08) -- Short-Form General (113 encoders)

Widest opcode coverage of any single format. Covers 27 distinct opcode classes with few variants each -- the simple, common instructions.

PropertyValue
Descriptorxmmword_23F1F08
Operand slots0--3 per instruction
Register offsets0x40, 0x50, 0x60, 0x70

Opcode classes (27): 0x00--0x09, 0x0A--0x0F, 0x10, 0x11, 0x12, 0x14, 0x16, 0x1B, 0x1C, 0x20, 0x21, 0x23, 0x25. Many of these are NOP/control, simple moves, and compact branches.

64-bit Format B (xmmword_23F1D70) -- Short-Form 4-Operand (41 encoders)

Bimodal operand count: either 0 operands (control instructions) or 4 operands (compact arithmetic with all-register sources).

Opcode classes: 0x00--0x09, 0x10, 0x12, 0x14--0x1E, 0x26, 0x28, 0x2A.

64-bit Formats C, D, E -- Specialized Short Forms (20 encoders combined)

DescriptorEncodersNotes
xmmword_23F1F9011Short-form variant C
xmmword_23F22388Short-form variant D
xmmword_23F2C501Minimal format, single encoder; also appears in 128-bit category with 0 uses

Distinguishing 64-bit vs 128-bit Encoders

The 128-bit format sets the extended opcode flag at bit offset 0x84, which the 64-bit format does not:

128-bit (6 initial sub_7B9B80 calls):
  sub_7B9B80(a1, 0,    4, 2)     // format code = 2
  sub_7B9B80(a1, 4,    3, 0)     // sched group slot
  sub_7B9B80(a1, 0x84, 3, 0)    // extended flag at bit 132  <-- PRESENT
  sub_7B9B80(a1, 8,    9, MAJ)   // major opcode
  sub_7B9B80(a1, 0x11, 8, MIN)   // minor opcode
  sub_7B9B80(a1, 0x19, 7, FID)   // format ID

64-bit (5 initial sub_7B9B80 calls):
  sub_7B9B80(a1, 0,    4, 1)     // format code = 1
  sub_7B9B80(a1, 4,    3, 0)     // sched group slot
                                  // NO 0x84 call           <-- ABSENT
  sub_7B9B80(a1, 8,    9, MAJ)   // major opcode
  sub_7B9B80(a1, 0x11, 8, MIN)   // minor opcode
  sub_7B9B80(a1, 0x19, 7, FID)   // format ID

The 256-bit format (format code 8) is used by exactly 2 encoders for IMAD.WIDE (major 0x59, minor 0x02 and 0x03), each with 16 constant-buffer operand slots encoded via sub_7BCF00.

Dispatch Tables -- The Six Megafunctions

Six switch-dispatch megafunctions in the 0x10C0B20--0x10E32E0 range form the central routing logic of the instruction codec. All six switch on the opcode category at *(WORD*)(a1+12) with up to 370 cases (0x0 through 0x171), each containing sub-switches on field ID:

FunctionSizeDecompiled LinesCallersPurpose
sub_10C0B20180 KB9,2313,109setField -- write a value into a named field
sub_10D5E60197 KB6,491961getFieldOffset -- return bit-offset of a named field
sub_10E32E0187 KB6,24072hasField -- boolean: does this instruction have field X?
sub_10CCD80142 KB7,5814setFieldDefault -- write hardcoded default for a field
sub_10CAD7068 KB1,86474getOperandFieldOffset -- bit-offset of a per-operand field
sub_10C769065 KB2,313288setOperandField -- write a per-operand field value

sub_10C0B20 (setField) is one of the most-called functions in the entire ptxas binary at 3,109 call sites. It writes field values using sub_AF80xx writer stubs and contains jump-out targets (0xAF43F0, 0xAF44C0, 0xAF4550) for bit-manipulation operations that cross word boundaries.

sub_10D5E60 (getFieldOffset) returns extractor(a1+48, bit_position) + base_offset for each field, where base_offset is a field-specific constant (observed values: +125, +790, +1278, etc.). Returns 0xFFFFFFFF when the queried field does not exist in the given instruction category.

sub_10CAD70 (getOperandFieldOffset) operates on per-operand packed records at *(QWORD*)(a1+32) + 32*operand_index + 24. The field IDs it handles include: 1 (register class), 2 (operand type), 7, 8, 12 (operand size), 13 (address mode), 14, 15, 19, 24, 25, 26, 27, 29.

Dead cases (opcode categories without the queried field) share a common pattern: cases 3, 0x11, 0x24, 0x26, 0x2D, 0x75, 0x78, 0x84, 0x8C-0x9F, 0xA1-0xBA, 0xBE-0x16F return 0xFFFFFFFF or false.

Bitfield Accessor Library

The 0x10B0000--0x10BF2C0 range contains 2,095 machine-generated bitfield read/write primitives for the 192-bit packed instruction format. These are the building blocks that the six megafunctions call:

  • 1,661 functions under 200 bytes: pure getters/setters for individual fields
  • 412 functions between 200-500 bytes: multi-field accessors
  • 22 functions above 500 bytes: complex accessors with validation

Seven core extractors handle all bitfield reads:

FunctionWidthStorage Format
sub_10B28E01-bit192-bit (3x QWORD)
sub_10B28602-bit192-bit
sub_10B27E03-bit192-bit
sub_10B27604-bit192-bit
sub_10B26E05-bit192-bit
sub_10B26502-bit32-bit array
sub_10B25C03-bit32-bit array

The 192-bit format (3 QWORDs = 24 bytes, stored at a1+48) handles boundary crossing: if a bitfield spans a QWORD boundary, the extractor combines partial reads from adjacent words. The 32-bit-array format is used for sub-fields that are naturally DWORD-aligned.

A typical accessor is trivially simple:

// sub_10BEF80 (140 bytes)
int get_field_X(int64_t a1) {
    return (*(uint32_t*)(a1 + 24) & 3) + 51;  // extract 2-bit field, add base
}

Modifier Encoding

After operands are encoded, each handler packs instruction-specific modifier fields into the bits at a1+544 (primary word) and a1+552 (extended word). The pattern is:

  1. Read modifier value from IR node via a property extractor (sub_10B9xxx family)
  2. Validate and map through an encoding lookup table (sub_10B3xxx/sub_10B4xxx family)
  3. OR the result into the packed word at a shifted/masked position

The most commonly used modifier-encoding functions:

FunctionCallersBitsLikely Meaning
sub_10B61808,0911Boolean flag (.S, .STRONG, .SAT, etc.)
sub_10B61602,2051Boolean flag (.NEG, .ABS, etc.)
sub_10B61401,6451Boolean flag variant
sub_10B2D905382Data type, rounding mode
sub_10B55804755Shift amount, cache policy
sub_10B44E04162Addressing mode
sub_10B62203633Register bank, cache level
sub_10B46503304Type qualifier, address mode
sub_10B47F02434Type qualifier variant
sub_10B2F0015133-bit modifier field
sub_10B2F2010144-bit modifier field

Modifier fields per instruction range from 0 (simple control instructions) to 18 (the most complex encoder, sub_D89C90 for opcode class 0x5A). The average is approximately 6 modifier fields per encoder. Bit positions in a1+544 concentrate in bits 48-63; bit positions in a1+552 concentrate in bits 0-11.

Physical Register Encoding

The SASS instruction encoder uses a two-stage pipeline to convert abstract virtual registers into hardware register fields in the final instruction word. The first stage (Ori encoding, described above in "Register Operand Encoder") packs register type and number into operand slots within the 1280-bit encoding buffer. The second stage (SASS emission) maps the compiler's abstract (register_class, sub_index) pair into an 8-bit hardware register number and writes it into the final 128-bit instruction word. This second stage is implemented by the register-class encoding tables at address range 0x1B4C000--0x1B76000 (Zone A of the emission backend).

Class-to-Hardware Formula

sub_1B6B250 (2965 bytes, 254 callers, 0 callees) is a fully unrolled lookup table that implements the mapping:

hardware_reg = register_class * 32 + sub_index

The function takes two integer arguments (a1, a2) where a1 is the register class (0--5) and a2 is the sub-register index within that class. It is compiled as a deeply nested if-chain covering all 156 valid (class, index) combinations. The decompiler output is 495 lines of cascading conditionals, but every return value satisfies the formula a1 * 32 + a2 exactly:

// sub_1B6B250 -- reconstructed from decompiled lookup table
__int64 register_class_to_hardware(int reg_class, int sub_index) {
    // Returns reg_class * 32 + sub_index for all valid inputs.
    // Valid classes: 0, 1, 2, 3, 4, 5
    // Valid sub-indices: 1..15, 17..27 (index 0 and 16 excluded)
    // Returns 0 for any unmatched input (fallthrough).
}

The guard wrapper sub_1B73060 (19 bytes, 483 callers) short-circuits the no-register case:

// sub_1B73060 -- guard wrapper
__int64 encode_register_guarded(__int64 ctx, int reg_class, int sub_index) {
    if (reg_class | sub_index)
        return register_class_to_hardware(reg_class, sub_index);
    else
        return 0;  // no register
}

Per-Class Hardware Number Ranges

Each class occupies a 32-number stride in the hardware register namespace. Within each stride, indices 1--15 and 17--27 are populated (26 registers per class). Index 0 maps to the no-register sentinel via the guard wrapper. Index 16 is absent from the lookup table -- a gap in every class.

Classa1Hardware RangePopulated IndicesGapLikely Register File
000--271--15, 17--2716R (GPR primary)
1132--591--15, 17--2748R (GPR secondary)
2264--911--15, 17--2780P (predicate)
3396--1231--15, 17--27112UR (uniform GPR)
44128--1551--15, 17--27144UR (uniform ext)
55160--1871--15, 17--27176P/UP (uniform pred)

Hardware numbers 28--31 (and the corresponding padding in each class) are unused, providing alignment to 32-register boundaries. The maximum hardware register number produced by the table is 187 (class 5, index 27). The 8-bit encoding field can represent 0--255, so values 188--255 are reserved.

The index-16 gap in every class is consistent across all 6 classes. This likely corresponds to a hardware-reserved slot or a register numbering convention where physical register class*32+16 has special semantics (potentially a sentinel or a register-file-boundary marker).

Split Bitfield Writer

sub_1B72F60 (32 bytes, 483 callers) writes the 8-bit hardware register number into the SASS instruction word. The encoding is split across two non-contiguous bitfields within a single DWORD:

// sub_1B72F60 -- register field writer (decompiled verbatim)
__int64 write_register_field(__int64 a1, int encoded_reg) {
    __int64 buf = *(_QWORD *)(a1 + 112);   // instruction encoding buffer
    __int64 result = *(_DWORD *)(buf + 12)  // DWORD at byte offset 12
                   | ((_WORD)encoded_reg << 9) & 0x3E00u;     // low 5 bits -> [13:9]
    *(_DWORD *)(buf + 12) = result
                   | (encoded_reg << 21) & 0x1C000000;        // high 3 bits -> [28:26]
    return result;
}

Bit-level layout within the DWORD at *(instruction_buffer + 12):

DWORD bits:  31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10  9  8  7 ..  0
                      [  h2:h0  ]                                      [ l4:l3:l2:l1:l0 ]
                      hw[7:5]                                          hw[4:0]

The DWORD at byte offset 12 covers bits [127:96] of the 128-bit instruction word. In full instruction coordinates:

FieldDWORD BitsInstruction BitsWidthContent
Low[13:9][109:105]5 bitshardware_reg[4:0]
High[28:26][124:122]3 bitshardware_reg[7:5]

The 12-bit gap between instruction bits [121] and [110] is occupied by other instruction fields (modifiers, flags, secondary operand encodings). This split-field design is common in GPU ISAs where instruction bits are at a premium and different fields must be routed to different functional unit inputs.

sub_1B72FE0 (32 bytes, 104 callers) is byte-identical to sub_1B72F60 but occupies a different vtable slot, used by a secondary operand encoding path.

Extended Register Encoder

sub_1B6EA20 (7194 bytes, 25 callers) extends the base encoding with operand modifier support. It takes 5 parameters:

// sub_1B6EA20 -- register encoding with modifiers
__int64 encode_register_with_modifiers(
    int reg_class,      // a1: register class (0-5)
    int sub_index,      // a2: sub-register index
    int negation,       // a3: .NEG modifier flag
    int abs_value,      // a4: |.ABS| modifier flag
    int type_modifier   // a5: type cast modifier
);

When all modifier flags are zero (a3 | a4 | a5 == 0), the function returns the same value as sub_1B6B250 -- the base class * 32 + index result. When modifiers are present, the function continues into extended encoding logic that packs modifier bits alongside the register number. The guard wrapper sub_1B748C0 (35 bytes, 104 callers) provides the same no-register short-circuit for the extended variant.

Additional encoding variants for different operand positions include sub_1B6D590, sub_1B70640, sub_1B71AD0, sub_1B748F0, and sub_1B76100 (5264--6106 bytes each, 2--49 callers each). All share the same nested-if structural pattern and operate on the same class/index domain.

Encoding Pipeline Summary

The complete register encoding pipeline from virtual register to instruction bits:

Virtual Register (vreg+64 = reg_type, vreg+68 = physical_reg)
  |
  v
[Ori Encoder -- sub_7BC030, 6147 callers]
  Reads: operand+20 (reg_type_raw), operand+4 (reg_num)
  Writes: 1-bit presence + 4-bit type + 10-bit number into 1280-bit buffer
  |
  v
[SASS Emission -- sub_1B6B250 via sub_1B73060, 483 callers]
  Input: (register_class, sub_index)
  Formula: hardware_reg = class * 32 + sub_index
  Output: 8-bit hardware register number (0-187)
  |
  v
[Bitfield Writer -- sub_1B72F60, 483 callers]
  Input: 8-bit hardware register number
  Output: split across instruction bits [109:105] and [124:122]

Zone A Function Map

FunctionSizeCallersRoleConfidence
sub_1B6B2502,965 B254Core class*32+index lookup tableHIGH
sub_1B6EA207,194 B25Extended encoding with modifier bitsHIGH
sub_1B7306019 B483Guard wrapper for sub_1B6B250CERTAIN
sub_1B748C035 B104Guard wrapper for sub_1B70640CERTAIN
sub_1B72F6032 B483Split bitfield register writerHIGH
sub_1B72FE032 B104Identical writer (different vtable slot)HIGH
sub_1B730806,106 B883-operand register encoding (class, index, modifier)HIGH
sub_1B6D5905,264 BvariesRegister encoding variant (operand position A)HIGH
sub_1B70640variesvariesRegister encoding variant (operand position B)HIGH
sub_1B71AD0variesvariesRegister encoding variant (operand position C)HIGH
sub_1B748F0variesvariesRegister encoding variant (operand position D)HIGH
sub_1B76100variesvariesRegister encoding variant (operand position E)HIGH

Decoder Functions

97 decoder functions in the 0xEB3040--0xED0FE0 range reverse the encoding: they extract operand information from packed SASS bitfields back into Ori IR representation. The decoder entry point is sub_EB3040, a dispatcher that performs binary search on the instruction type word (*(a2+12), *(a2+14), *(a2+15)) against a table at off_22E6380. For instruction types 120/121, it falls through to the generic decoder sub_7BFAE0.

The decoder template mirrors the encoder but in reverse:

void decode_OPCODE(int64_t a1, int64_t a2) {
    // 1. Set output instruction type
    *(uint16_t*)(a2 + 12) = INSTR_TYPE_ID;

    // 2. Load operand format table (same xmmword constants as encoder)
    // 3. Set operand count
    *(int*)(a1 + 144) = NUM_OPERANDS;

    // 4. Decode operands using type-specific decoders
    sub_7BD3C0(a1, a2, 0, 0x50, 2);   // GPR register (type=2)
    sub_7BE090(a1, a2, 1, 0x60, 3);   // predicate register (type=3)
    sub_7BD650(a1, a2, 2, 0x70, 10);  // extended register (type=10)

    // 5. Extract control bits (reuse flags, stall counts, yield hints)
    sub_7BD260(a1, a2);

    // 6. Translate encoded values back to IR references
    int reg = sub_AF7DF0(*(void**)(a1+536), extracted_bits);
    sub_B056B0(dest_ptr, reg);
    int pred = sub_AF7200(*(void**)(a1+536), pred_bits);
    sub_AFA380(a2, pred);

    // 7. Extract modifier bitfields (reverse of encoder phase 7)
    sub_AF53B0(*(void**)(a1+536), *(a1+550) & mask);
    sub_AFCEB0();  // commit extracted value
}

Decoder operand count distribution: 6 two-operand, 18 three-operand, 22 four-operand, 16 five-operand, 22 six-operand, 12 eight-operand decoders.

Opcode ID Extractors

Over 100 small functions in the 0x10BF000--0x10C0C00 range serve as opcode discriminators. Each maps an IR instruction node to an opcode ID by reading fields from the operand table. The most-used extractors:

FunctionEncoder UsersMajor Opcode Family
sub_10BF44048Generic (most common)
sub_10BF23045Generic
sub_10BF59043Generic
sub_10BFA90300x59 (IMAD variants)
sub_10BFD30260xFD family
sub_10BFFA0250x4F family
sub_10BF580230x29 (IADD/IADD3)
sub_10BF680160x38 (load/store)
sub_10C0AF0140xDF (WGMMA)

89 distinct opcode reader functions cover all instruction families.

Per-SM Architecture Encoding

The encoding system is replicated per SM target. Each SM architecture has its own set of encoder/decoder functions with different xmmword opcode constants. The SM100 (Blackwell datacenter) implementation spans these address ranges:

RangeFunctionsLayer
0xD27000--0xDFC000592Encoder stubs (p1.12)
0xDFC000--0xEB2AE0494Encoder stubs continuation (p1.13)
0xEB3040--0xED0FE097Decoder functions (p1.13)
0x107B1E0--0x10AD700641Encoder stubs continuation (p1.16)
0x10ADD30--0x10AFF8078Instruction lifecycle & scheduling
0x10B0000--0x10BF2C02,095Bitfield accessor library (p1.16)
0x10C0B20--0x10E32E0184Dispatch table megafunctions (p1.16)
0x10EE900--0x1134160~400Binary encoders: IR fields to bits (p1.16)
0x1134160--0x114F380~132High-level encode path (p1.16)

The total SM100 codec spans roughly 2.5 MB of binary code across approximately 4,700 functions (including the shared bitfield accessor library).

Other SM targets (SM75 Turing, SM80 Ampere, SM86 Ada, SM89 Lovelace, SM90a Hopper, SM103 Blackwell Ultra, SM120 consumer Blackwell) have parallel encoder populations in the p1.14, p1.15, p1.17--p1.22 address ranges, each with matched xmmword constants for their architecture-specific instruction set.

Per-SM Instruction Format Descriptors

316 instruction format descriptor functions at 0x1732170--0x17A9B70 form the shared, architecture-neutral instruction pattern database. Unlike the per-SM encoder stubs (replicated per architecture at separate address ranges), these descriptors are a single set of functions that describe every SASS opcode variant's encoding geometry: bitfield layout, operand slot configuration, and modifier schema. They are invoked exclusively through virtual dispatch (zero static callers) from the ISel passes (sub_A4BC60, sub_A4D3F0) via the FNV-1a hash-based instruction matcher at sub_1731440.

Descriptor Template

Every descriptor function initializes an Encoding Context object through a fixed 4-phase sequence:

// Phase 1: Opcode header (5 calls for 64-bit, 6 for 128-bit)
sub_7B9B80(a1, 0,    4, FORMAT_CODE);   // bits[3:0]     format: 1=64b, 2=128b
sub_7B9B80(a1, 4,    3, 0);             // bits[6:4]     sched group slot
sub_7B9B80(a1, 0x84, 3, 0);             // bits[134:132] ext flag (128-bit ONLY)
sub_7B9B80(a1, 8,    9, MAJOR_OP);      // bits[16:8]    9-bit major opcode
sub_7B9B80(a1, 0x11, 8, MINOR_OP);      // bits[24:17]   8-bit minor opcode
sub_7B9B80(a1, 0x19, 7, FORMAT_ID);     // bits[31:25]   7-bit format ID

// Phase 2: Format layout descriptor (Tier 1) -- selects operand geometry
*(__m128i*)(a1 + 8) = xmmword_23FXXXX;  // 128-bit format template from rodata
// + bulk copy of 3 x 10 DWORD arrays into a1+24..a1+140

// Phase 3: Architecture modifier table (Tier 2) -- selects per-SM encoding
*(__m128i*)(a1 + 404) = xmmword_YYYYYYY;  // per-SM modifier constants
*(DWORD*)(a1 + 420) = VAL1;               // explicit modifier overrides
*(DWORD*)(a1 + 424) = VAL2;

// Phase 4: Operand count + standard encoding tail
*(DWORD*)(a1 + 144) = NUM_OPERANDS;       // 0--7
sub_7B9D30(a1);                            // clear constant buffer table
sub_7B9D60(a1, a2, 0);                     // encode reuse + guard predicate
// Then: opcode extraction, register encoding, modifier field packing

Two-Tier xmmword Architecture

Each descriptor loads two classes of xmmword constants that together fully specify the instruction encoding:

Tier 1 (at a1+8): Format Layout Descriptor. Selects the instruction format -- operand slot sizes, types, and field layout. These are the 16 format groups documented in the "Instruction Format Group Catalog" section above. Addresses in the 0x23F1xxx--0x23F2xxx rodata range.

Tier 2 (at a1+404): Architecture Modifier Table. Selects per-SM encoding variations for the same format layout. Two instructions with the same Tier 1 descriptor but targeting different architectures use different Tier 2 constants. Addresses span three rodata ranges:

Rodata RangeGroupFunctionsPaired With
0x202A280--0x202A2B0A~40202A290 or 202A2A0+202A2B0 at a1+420
0x22F1B30--0x22F1B50B/C~8None (single 16B block)
0x22F1BA0--0x22F1BB0D~3None
0x22F1AA0--0x22F1AE0E~3(observed in SM100 encoder range)
0x22F1C20--0x22F1C30F~2Paired at a1+404/a1+420
0x23B2DE0G4None (rare/specialized)

SM Generation Mapping

The Tier 2 modifier groups correspond to GPU architecture generations. The mapping is inferred from operand table sizes (larger = newer), function counts per group (fewer = newer/specialized), and cross-reference with the per-SM encoder stubs at known address ranges:

Modifier AddressProbable SM RangeISA FamilyConfidence
0x202A280--0x202A2B0sm_50--sm_75Maxwell / Pascal / Volta / TuringMEDIUM
0x22F1B30--0x22F1B50sm_80--sm_86Ampere / AdaMEDIUM
0x22F1BA0--0x22F1BB0sm_89--sm_90aLovelace / HopperMEDIUM
0x22F1AA0--0x22F1AE0sm_100+Blackwell datacenterMEDIUM
0x22F1C20--0x22F1C30sm_103 / sm_120Blackwell Ultra / consumerLOW
0x23B2DE0Cross-archSpecialized / rare instructionsLOW

The progression from 0x202A to 0x22F1 to 0x23B2 in rodata address space mirrors the SM generation ordering. Group A (Maxwell--Turing) is the most populous, consistent with the longest-supported ISA family. Groups E and F have the fewest functions, consistent with the newest architectures that introduce fewer format changes.

Format Code Distribution

Format CodeInstruction WidthDescriptor Countsub_7B9B80 Header CallsNotes
164-bit~1205 (no 0x84 call)Simple moves, branches, barriers, NOP-like control
2128-bit~1946 (includes 0x84)ALU, load/store, texture, tensor core
8256-bit2ExtendedIMAD.WIDE with 16 constant-bank slots

Descriptor-Initialized Context Fields

The format descriptor writes these fields into the Encoding Context object. All offsets are decimal:

OffsetSizeInitialized ByContent
+816BPhase 2 (Tier 1 xmmword)Format layout descriptor
+24--+6040BPhase 2 (bulk copy)Operand slot sizes (10 DWORDs)
+64--+10040BPhase 2 (bulk copy)Operand slot types (10 DWORDs)
+104--+14040BPhase 2 (bulk copy)Operand slot flags (10 DWORDs)
+1444BPhase 4Operand count (0--7)
+40416BPhase 3 (Tier 2 xmmword)Architecture modifier table
+4204BPhase 3 (scalar)Architecture modifier field 1
+4244BPhase 3 (scalar)Architecture modifier field 2

Pipeline Position

The format descriptors bridge ISel pattern matching and per-SM encoding:

ISel Pattern Matcher (sub_1731440, FNV-1a hash on *(a2+12))
  |
  v  (virtual dispatch via vtable)
Format Descriptor (one of 316 at 0x1732170--0x17A9B70)
  Writes: a1+0..a1+144   (format layout + operand geometry)
  Writes: a1+404..a1+424  (architecture modifier table)
  |
  v  (encoding context passed down)
Per-SM Encoder Stub (e.g. 0xD27xxx for SM100)
  Reads: format context from descriptor
  Writes: a1+544..a1+703  (1280-bit encoding buffer)

Representative Examples

sub_1732170 -- 64-bit float conversion (single-dest):

FieldValueMeaning
Format code164-bit instruction
Major opcode0x0CFloat conversion family
Minor opcode0x0DVariant D
Format ID5Short-form general (23F1F08)
Tier 1xmmword_23F1F08Short-form general, 27 opcode classes
Tier 2xmmword_22F1B30Group B (Ampere/Ada)
Operand count3Register operands at 0x50, 0x60, 0x70
Modifier fields12Spanning a1+544 and a1+552

sub_1740200 -- 128-bit IMAD.WIDE (dual-dest):

FieldValueMeaning
Format code2128-bit instruction
Major opcode0x23IMAD.WIDE family
Minor opcode0x12Variant with modifier 0x13
Format ID0x13Tensor/extended ALU (23F2678)
Tier 1xmmword_23F2678Extended ALU, 7 opcode classes
Tier 2xmmword_202A280Group A (Maxwell--Turing)
Dual-destYes0x84 field present, set to 0

sub_1732E90 -- 128-bit extended complex:

FieldValueMeaning
Format code2128-bit instruction
Major opcode0x0CFloat conversion family
Minor opcode0x0CSame as major (self-referencing variant)
Format ID0x19Extended complex (23F29A8)
Tier 1xmmword_23F29A8Extended complex, 8 opcode classes
Tier 2xmmword_22F1B30Group B (Ampere/Ada)

Operand Encoding Patterns

The 576 encoder functions in the p1.12 range use 52 distinct operand encoding patterns. The most common:

Pattern (reg, imm, pred)CountDescription
3 reg + 1 pred88Standard 3-source with predicate
2 reg + 1 pred57Binary op with predicate
3 reg only43Ternary ALU, no predicate/immediate
3 reg + 1 imm + 1 pred42MAD-class with immediate + predicate
2 reg only40Simple binary
3 reg + 1 imm25Ternary with immediate
1 reg + 1 pred22Unary with predicate
4 reg + 1 imm21Quaternary with immediate
4 reg only20Quaternary register-only

Register operand bit offsets are format-dependent:

  • 64-bit format: 0x40, 0x50, 0x60, 0x70
  • 128-bit format: 0x60, 0x70, 0x88, 0x98, 0xA8

Major Opcode Summary (SM100)

102 unique major opcodes were identified across 494 encoding variants (p1.13 range alone). Opcode-to-mnemonic mapping is inferred from operand patterns and opcode density; exact mnemonic assignment requires correlation with ROT13-obfuscated instruction names found elsewhere in the binary.

Memory / Load-Store

MajorVariantsLikely SASS Mnemonics
0x3816LDG, STG, LDS, STS
0x602Extended load
0x70--0x729Load groups A/B/C
0xA4--0xA612Load/store with addressing modes
0xAD9Memory extended
0x1E4ATOM, ATOMS
0x99, 0xA22Extended atomics
0x392REDUX (reduction)

Integer Arithmetic

MajorVariantsLikely SASS Mnemonics
0x5930IMAD, IMAD.HI, IMAD.WIDE, ISCADD
0x2924IADD3, IADD3.64, IADD32I
0x4F25Extended integer operations
0x3B10Integer MUL/MAD extended

Floating Point

MajorVariantsLikely SASS Mnemonics
0x3A1Float operation
0x3E--0x404FFMA, FFMA variants
0x43--0x442Float MUL/MAD
0x4A4FADD, FMUL, FFMA forms
0x496HFMA2, HADD2, HMUL2
0x5C6HFMA2 variants
0x5F2Half-float extended

Tensor Core / WGMMA

MajorVariantsLikely SASS Mnemonics
0xA8--0xA916Tensor core A/B (WGMMA, HMMA)
0xAB--0xAC12Tensor core C/D
0xAE--0xB030Tensor core E/F/G
0xB1--0xB315Tensor core H/I/J
0xDF14WGMMA dispatch (main family)
0x124Matrix operations
0x546Extended matrix

Control Flow

MajorVariantsLikely SASS Mnemonics
0x1810BRA, SSY, CAL, EXIT, RET, BREAK, CONT
0x195Control flow group B
0x7D2YIELD, control
0x242BAR, barrier/sync
0xCF3BARRIER
0xD42BARRIER B
0x332DEPBAR

Comparison / Predicate

MajorVariantsLikely SASS Mnemonics
0x0D10ISETP, FSETP, DSETP
0x178PSETP, PLOP3
0x956Comparison variants

Data Movement / Conversion

MajorVariantsLikely SASS Mnemonics
0x615MOV, MOV.64, MOV32I
0x46, 0x66, 0x453MOV variants
0x566F2I, I2F, F2F type conversions
0x626Type conversion group 2
0x104SEL (conditional select)
0x1B3PRMT (permute)

Instruction Object Lifecycle

The instruction object constructor sub_10AFF80 (11 KB, 3 callers: sub_6F0A30, sub_6F52F0, sub_9EE390) takes 32 parameters and builds a ~900-byte instruction-level object:

  • 13 sub-object allocations via vtable allocator (vtable+24)
  • 4 linked-list structures for instruction chaining
  • 2 string buffers for instruction name and alternate name (via strlen + memcpy)
  • Architecture descriptor via sub_B19110(arch_id) at offset +408
  • Hash table using FNV-1a (seed 0x811C9DC5, prime 16777619) for instruction record lookup

The instruction unlink-and-recycle functions (sub_10ADF90, sub_10AE190) remove an instruction node from a doubly-linked list (head/tail at a1+48/56), update the count at a1+64, free operand attachments via vtable call, and return the node to a free-list at a1+72. The maximum instruction count per list is 16,383 (checked by sub_10AE7C0).

Encoding Pipeline Layers

The full encoding pipeline operates in three layers, from high-level IR to binary output:

Layer 1: High-level encode (0x1134160--0x114F380, ~132 functions) Populates full IR records before low-level packing. Uses sub_9B3C20(a1, a2, slot, type, mode, width, reg_id) for register operands and sub_9B3D60 for immediates. Handles 255->1023 sentinel translation for "don't care" register values. Sets opcode/modifier fields via sub_AFA910/sub_AFA930. Applies conditional fixups: e.g., if opcode==2038 && subopcode==2257, sets operand_slot+84 = 5.

Layer 2: Binary encoders (0x10EE900--0x1134160, ~400 functions) Reads operand fields from IR via sub_10BDxxx extractors, transforms through sub_10Bxxx lookup tables, and packs results into the 128-bit output word at *(QWORD*)(a1+40):

// Typical pattern (sub_10F91D0):
int v6 = sub_10BF170(operand_addr);       // extract register class
int v7 = sub_10B6180(lookup_table, v6);    // translate to encoding value
*(uint64_t*)(a1 + 40) |= ((uint64_t)v7 << 15);  // pack at bit position 15

Includes a register pair encoder (sub_112CDA0, 8.9 KB) that maps 40 register pair combinations (R0/R1, R2/R3, ... R78/R79) to packed output values at 0x2000000 intervals.

Layer 3: Template encoder stubs (0xD27000--0xEB2AE0, ~1,086 functions) The lowest-level stubs that directly write the encoding buffer via sub_7B9B80. These are the functions described by the encoder template above.

Variant/Sub-opcode Distribution

The variant field (bits[17:24], 8 bits) has a distribution that peaks at variant 0x05 with 128 functions, suggesting this is the default or most common variant (possibly .F32 type or the unmodified form):

VariantCountVariantCount
0x00210x0813
0x01250x0914
0x02620x0A10
0x03240x0B19
0x04200x0C14
0x051280x0D9
0x06300x0E11
0x07100x0F--0x2Fdecreasing

Maximum observed variant value is 0x2F (47), giving up to 48 sub-operations per major opcode.

SASS Emission Backend

The final stage of the encoding pipeline operates at the instruction-word level: 11 per-instruction-form bitfield packers at addresses 0x1B79940--0x1B9C220 take a pre-decoded instruction descriptor and pack all fields into a 128-bit SASS instruction word. These functions sit at Level 2 of a 4-level emission hierarchy:

Level 0: SM-target dispatch    (0xC4DF70, 0xC53330, 0xC54090, 0xC59610, 0xC5ABE0, 0xC5B5C0)
Level 1: Emission orchestrators (Zone C: 0x1BA0000-0x1BE5000, ~150 functions)
Level 2: Per-form bit packers   (Zone B: 0x1B79940-0x1B9C220, 11 functions, THIS SECTION)
Level 3: Register class encoders (Zone A: 0x1B4C000-0x1B76000, ~40 functions)

Each function has exactly 1 caller and 0 callees (pure bitfield packing, no external calls). Sizes range from 6836 to 6980 bytes of compiled code. All 11 share an identical combinator body (verified: same 453 LABEL_xxx targets, same 75 unique OR-mask constants, same max comparison value of 27). They differ only in two things: the opcode base constant, and the prologue field-packing sequence.

Input / Output Interface

int *__fastcall emit_instruction_form_X(int *a1) {
    // a1 = pre-decoded instruction descriptor (array of 32-bit ints)
    // Returns: pointer to output buffer (also accessible at *((_QWORD*)a1 + 14))
    int *result = *((_QWORD *)a1 + 14);  // output = 128-bit instruction word
    // result[0] = instruction bits [31:0]   (opcode base, guard pred, sched group)
    // result[1] = instruction bits [63:32]  (register operand fields, modifiers)
    // result[2] = instruction bits [95:64]  (immediate/offset, auxiliary fields)
    // result[3] = instruction bits [127:96] (predicate control, combinator encoding)
}

The input struct a1 is a flat array of pre-extracted instruction fields. Fields a1[0] through a1[3] carry common header values; a1[4] through a1[15] carry instruction-specific operand data (which indices are used depends on the instruction form).

Phase 1: Prologue -- Opcode Base and Field Packing

Every function begins with the same template, parameterized by different constants:

// 1. Load output buffer pointer
result = *((_QWORD *)a1 + 14);

// 2. OR opcode base into result[0] -- unique 12-bit constant per function
*result |= OPCODE_BASE;  // e.g., 0xA1E, 0x81B, 0x803

// 3. Pack guard predicate: bits [14:12] of result[0]
*result |= ((unsigned short)a1[1] << 12) & 0x7000;

// 4. Pack scheduling group: bits [16:15] of result[0]
*result |= (unsigned short)((unsigned short)a1[2] << 15);

// 5. Pack predicate encoding: bits [25:20] of result[3]
result[3] |= (a1[3] << 20) & 0x3F00000;

// 6. Pack instruction-specific operand fields (VARIES PER FUNCTION)
//    Each function packs a different set of a1[6..15] fields into
//    result[0], result[1], result[2] using different shifts and masks.

// 7. Set base combinator mask: bits [17:14] of result[3] = 0x3F
result[3] |= 0xFC000;

The prologue is the sole source of variation between functions. The field-packing differs in which a1[] indices are used, which shift amounts are applied, and which result[] DWORDs are targeted.

The 11 Functions and Their Opcode Bases

FunctionSizeOpcode BaseFamilyCaller Chain
sub_1B799406,900 B0xA1B0xAxxsub_1BA5340 via sub_C4DF70
sub_1B7B4406,868 B0x81B0x8xxsub_1BA5340 via sub_C4DF70
sub_1B877406,852 B0x2380x2xxsub_1BA8D80 via sub_C53330
sub_1B893506,836 B0x2130x2xxsub_1BA8E80 via sub_C54090
sub_1B8FFE06,852 B0x2020x2xxsub_1BA8D80 via sub_C53330
sub_1B925906,868 B0x8030x8xxsub_1BACB10 (direct)
sub_1B943906,964 B0x21D0x2xxsub_1BACDC0 via sub_C59610
sub_1B95ED06,980 B0xA1E0xAxxsub_1BACDC0 via sub_C59610
sub_1B985B06,852 B0x8040x8xxsub_1BAD6D0 (direct)
sub_1B9A4306,884 B0x8070x8xxsub_1BB1110 via sub_C5ABE0
sub_1B9C2206,884 B0x81A0x8xxsub_1BAD920 via sub_C5B5C0

The opcode bases cluster into three families by high nibble:

  • 0x2xx (4 functions): bases 0x202, 0x213, 0x21D, 0x238
  • 0x8xx (5 functions): bases 0x803, 0x804, 0x807, 0x81A, 0x81B
  • 0xAxx (2 functions): bases 0xA1B, 0xA1E

Phase 2: Combinator -- 3-Axis Predicate Encoding into result[3]

After the prologue, all 11 functions execute an identical ~1900-line decision tree. This combinator reads three integer values from the input struct and produces a single 32-bit mask that is ORed into result[3].

The three axes are:

  • axis0 = a1[0]: instruction class selector, values 0..5 (6 values)
  • axis1 = a1[4]: slot/form index, values 1..27 (26 populated, gap at 16)
  • axis2 = a1[N]: sub-mode flag, values 0 or 1 (N varies per function -- a1[8], a1[9], a1[10], a1[11], or a1[15])

The combinator exits immediately if all three axes are zero (!(axis0 | axis1 | axis2)). Otherwise it walks a nested decision tree that tests axis0 values (0 through 5), axis1 values (1 through 27), and axis2 values (0 or 1), and ORs the appropriate mask into result[3]:

// Reconstructed combinator logic (pseudocode):
if (axis0 == 0 && axis1 == 0 && axis2 == 0) return;

// For axis0 values 1-5 combined with axis1 values 1-15:
// result[3] |= prefix_for_axis0 | 0xFC000 | (axis1 << 9)
//
// For axis1 values 17-27 combined with axis2:
// result[3] |= base_mask_for_axis1  (if axis2 == 0)
// result[3] |= extended_mask_for_axis1  (if axis2 == 1)

Combinator Mask Encoding

The 75 unique masks in the FC/FD series decompose as:

result[3] bit layout for combinator-generated fields:
  bits [17:14] = 0x3F  (always set by prologue base 0xFC000)
  bits [13:9]  = slot_index  (5-bit, derived from axis1, values 1-27)
  bits [28:26] = axis0 prefix encoding (3-bit, for axis0 values 1-5)

The 5 prefix values correspond to axis0 encodings 1-5:

axis0 ValuePrefix OR'dPrefix Bits [28:26]
00x00000000 (no prefix)
10x40xxxxx001
20x80xxxxx010
30xC0xxxxx011
40x100xxxxx100
50x140xxxxx101

Combined with 15 slot values (axis1 = 1..15), this produces 5 x 15 = 75 masks in the 0xFC200--0x140FCE00 range.

For axis1 values 17--27, the masks shift into the 0xFE200--0xFF600 range. These slots use only the "no prefix" and "prefix 0x100" variants (axis0 values 0 and 4), and the axis2 flag selects between the two. This gives an additional 12 unique masks for the high-slot range (11 base + 11 extended, minus shared ones, equals 12 unique).

Why the Combinator Exists

The combinator encodes an architecture-independent mapping from a 3-dimensional instruction property coordinate to a hardware-specific bitfield pattern in the predicate/control section of the 128-bit instruction word. This section (bits [127:96]) controls:

  • Guard predicate assignment (bits [25:20] from prologue)
  • Scheduling mode (bits [17:14] base + combinator overlay)
  • Instruction form variant (bits [13:9] from combinator)
  • Predicate class / condition code routing (bits [28:26] from combinator)

The identical combinator across all 11 functions confirms that this is not an opcode-specific encoding but rather a cross-cutting encoding for predicate/scheduling state that applies uniformly to all instruction forms.

Equivalent Lookup Table

The entire 2000-line decision tree can be replaced by a flat table of 6 x 28 x 3 = 504 entries:

// Equivalent reconstruction:
static const uint32_t combinator_table[6][28][3] = { ... };
// Access: result[3] |= combinator_table[axis0][axis1][axis2];
// Table size: 504 * 4 = 2,016 bytes (vs ~6,800 bytes of code per function)

The compiler chose a decision tree over a table lookup, likely because the C++ source used nested switch/case statements (or if/else chains with early return), and the optimizer did not convert this to a table at -O2.

Zone B Function Map (Emission Cluster)

AddressSizeOpcode BaseCallerConfidence
sub_1B799406,900 B0xA1Bsub_1BA5340HIGH
sub_1B7B4406,868 B0x81Bsub_1BA5340HIGH
sub_1B877406,852 B0x238sub_1BA8D80HIGH
sub_1B893506,836 B0x213sub_1BA8E80HIGH
sub_1B8FFE06,852 B0x202sub_1BA8D80HIGH
sub_1B925906,868 B0x803sub_1BACB10HIGH
sub_1B943906,964 B0x21Dsub_1BACDC0HIGH
sub_1B95ED06,980 B0xA1Esub_1BACDC0HIGH
sub_1B985B06,852 B0x804sub_1BAD6D0HIGH
sub_1B9A4306,884 B0x807sub_1BB1110HIGH
sub_1B9C2206,884 B0x81Asub_1BAD920HIGH

SM89/90 Codec Layer

SM89 (Ada Lovelace) and SM90 (Hopper) share a pre-encoding instruction reordering layer absent from SM100 (Blackwell). This layer sits above the three-layer encoding pipeline: it manipulates Mercury IR instruction lists to optimize instruction interleaving before the encoding stubs pack bitfields. The entire cluster spans addresses 0x1226E80--0x1233D70, roughly 261 KB of compiled code across 18 functions.

Call Chain

sub_C60910 / sub_C5FEF0   SM-target dispatch (Level 0, 0xC5xxxx range)
  |
  v
sub_1233D70 (6 KB)         Orchestrator: guards on knob 487 and O-level > 1,
  |                        sets up cost-function parameters, calls A then B
  |
  +-> sub_122AD60 (112 KB)  Pass A: classify instructions + reorder within blocks
  +-> sub_122F650 (105 KB)  Pass B: scheduling-aware emission ordering across blocks
  +-> sub_A112C0            Post-pass finalization

The orchestrator sub_1233D70 is called only when optimization level exceeds 2 (sub_7DDB50(ctx) > 1). It reads floating-point cost weights from the target descriptor via knob offsets +7200, +7560, +7128, +7272 and passes them through to both passes. Default base weights are 1.8, -0.8, 3.2, -2.2.

Pass A: Instruction Classification and Reordering (sub_122AD60)

4,118 decompiled lines. Traverses every instruction in each basic block and sorts them into 4 linked-list queues by instruction category:

CategoryReturn CodeInstruction TypeQueue Role
Branch / control-flow0type 9 (BRA, EXIT, RET, ...)Held at block boundaries
Load1type 12 (LDG, LDS, ...)Scheduled early for latency hiding
Store2type 5 (STG, STS, ...)Deferred to maximize distance from load
General ALU4type 4 (IADD, FFMA, ...)Interleaved between memory ops
Uncategorized3other / missing infoTreated as general

The classifier is sub_1228670 (30 lines), which reads the instruction scheduling class via sub_7E2FE0 and returns 0--4. A companion predicate sub_1228EF0 (38 lines) returns 0 for types 9, 5, and 12 (the "special" categories), 1 for everything else.

After classification, Pass A performs register-class-aware instruction motion: it uses sub_91BF30 (register class builder), sub_91E390 (class query), and sub_91E610 (class intersection) to verify that moving an instruction does not violate register-class constraints. Instructions that pass the check have their operand flags updated at +48 (bit 0x40 = "moved" marker) and +96 (copy-chain tracking).

The reordering step sub_122AA30 (186 lines) performs the final within-block interleaving. sub_1227D90 (522 lines) handles the actual linked-list surgery: unlink an instruction from its current position and reinsert it at a new location.

Pass B: Scheduling-Aware Emission Ordering (sub_122F650)

3,917 decompiled lines. Takes the classified instruction lists from Pass A and determines the emission order that optimizes scheduling. Operates on 8 bitvector arrays allocated via the sub_BDxxxx bitvector library:

BitvectorPurpose
v521Main liveness set (all instructions)
v523Load-group register liveness
v525Store-group register liveness
v527ALU-group register liveness
v529Control-flow register liveness
v531Cross-block interference set
v533Scheduling priority set
v535Secondary interference set

Each bitvector is sized to the function's total register count (*(ctx+224)). Pass B iterates through instructions, populates the bitvectors with defined-register information via sub_BDBC70 (set bit), then merges category-specific vectors into the main set via sub_BDC5F0 (union) in an order determined by the dependency analysis.

The single switch at line 2578 dispatches on the instruction category:

case 4 (ALU):     merge load + store + ALU vectors into main
case 3 (branch):  merge load vector only
case 0 (uncategorized): merge store vector only
case 2 (load):    merge ALU vector only
case 1 (store):   no merge (control-flow kept separate)

Knob-derived flags control reordering aggressiveness:

  • Knob at target offset +7416 (index ~103): enable load reordering
  • Knob at target offset +7488: enable general reordering
  • All reordering disabled when *(ctx+1584)+372 == 12288 (specific regalloc config)

Pass B also maintains a red-black tree structure for the emission schedule, with standard left/right/parent pointers at node offsets 0, 8, 16.

Differences from SM100

AspectSM89/90SM100 (Blackwell)
Pre-encode reorderingPresent (sub_122AD60 + sub_122F650)Absent -- scheduling integrated into own pass
Instruction classification5-category scheme (branch/load/store/ALU/other)370-category opcode dispatch via megafunctions
Cost modelFloating-point heuristic (4 tunable weights)Table-driven via hardware profile records
Liveness tracking8 bitvectors per blockHandled in scheduling pass, not in encoding
Knob controlKnobs 103, 106, 218, 230, 487, 501Different knob set for Blackwell scheduler
Register class validationsub_91BF30/sub_91E390 per-move checkPer-instruction class check at encoding time
Binary encoder callsNone -- IR-level manipulation onlysub_7B9B80 (18,347 callers)

The SM89/90 pair operates entirely at the Mercury IR level and produces no packed instruction bits. It rewrites the instruction linked lists in each basic block to optimize scheduling, after which the standard encoding pipeline (Layers 1--3) runs on the reordered sequence. SM100 Blackwell does not need this layer because its scheduling infrastructure (documented in scheduling/algorithm.md) already integrates instruction ordering into the scheduling pass itself.

SM89/90 Codec Function Map

AddressSizeLinesIdentityConfidence
sub_1233D706 KB321sm89_orchestrator -- guards, cost params, calls A+BHIGH
sub_122AD60112 KB4,118sm89_classify_reorder -- instruction classification + block reorderingHIGH
sub_122F650105 KB3,917sm89_emission_order -- scheduling-aware emission orderingHIGH
sub_122AA30~3 KB186local_reorder -- within-block instruction interleavingHIGH
sub_1227D90~9 KB522instruction_reinsert -- unlink + reinsert at new positionHIGH
sub_122F1E0~6 KB330scheduling_heuristic -- cost-function comparison for emission orderMEDIUM
sub_1228670~0.5 KB30instruction_classify -- 5-category classifier (returns 0--4)CERTAIN
sub_1228EF0~0.5 KB38is_special -- predicate: types 9/5/12 return falseCERTAIN
sub_1226E80~0.3 KB22list_prepend -- insert instruction at list headCERTAIN
sub_1226EB0~5 KB274instruction_finalize -- post-reorder operand fixupHIGH
sub_1227820~1 KB77operand_offset_update -- adjust operand offsets after moveHIGH
sub_1227B60~0.5 KB31motion_check -- can instruction move to new position?HIGH
sub_1228FA0~2 KB100regclass_propagate -- propagate register class after moveHIGH
sub_12292B0~0.5 KB38queue_init_A -- initialize classification queueHIGH
sub_1229330~0.5 KB38queue_init_B -- initialize classification queueHIGH
sub_1229BD0~2 KB107tree_rebalance -- red-black tree rebalanceMEDIUM
sub_122A050~1 KB77pre_pass_init -- initialize pass A state objectHIGH
sub_122A1A0~2 KB139block_resize -- resize bitvector for new block countHIGH

Function Map

AddressSizeCallersIdentityConfidence
sub_7B9B80216 B18,347bitfield_insert -- core packer into 1280-bit bufferCERTAIN
sub_7B9D3038 B2,408clear_cbuf_slots -- memset(a1+468, 0xFF, 64)HIGH
sub_7B9D60408 B2,408encode_reuse_predicate -- reuse flags + guard predicateHIGH
sub_7BC030814 B6,147encode_register -- GPR operand encoderHIGH
sub_7BC360~500 B126encode_uniform_register -- UR operand encoderHIGH
sub_7BC5C0416 B1,449encode_predicate -- predicate operand encoderHIGH
sub_7BCF00856 B1,657encode_immediate -- immediate/cbuf operand encoderHIGH
sub_7BD260~300 B96decode_finalize -- extract control bitsHIGH
sub_7BD3C0~500 B286decode_register -- GPR operand decoderHIGH
sub_7BD650~400 B115decode_register_alt -- destination register decoderHIGH
sub_7BE090~400 B50decode_predicate -- predicate operand decoderHIGH
sub_10B618021 B8,091encode_bool_field -- 1-bit opcode-to-control mappingHIGH
sub_10B616021 B2,205encode_bool_field_B -- 1-bit flag variantHIGH
sub_10B614021 B1,645encode_bool_field_C -- 1-bit flag variantHIGH
sub_10AFF8011 KB3instruction_constructor -- 32-param object builderHIGH
sub_10ADF902.2 KB357instruction_unlink -- linked-list remove + recycleHIGH
sub_10B0BE06.5 KB--hash_table_insert_64 -- FNV-1a, 8-byte key, 4x resizeHIGH
sub_10B1C303.9 KB--hash_table_insert_32 -- FNV-1a, 4-byte keyHIGH
sub_10C0B20180 KB3,109setField -- field value writer dispatchHIGH
sub_10D5E60197 KB961getFieldOffset -- field bit-position lookup dispatchHIGH
sub_10E32E0187 KB72hasField -- field existence query dispatchHIGH
sub_10CCD80142 KB4setFieldDefault -- default value writer dispatchMEDIUM
sub_10CAD7068 KB74getOperandFieldOffset -- per-operand field offset dispatchHIGH
sub_10C769065 KB288setOperandField -- per-operand field writer dispatchHIGH
sub_AF7DF0--7,355encoded_to_ir_register -- hardware reg to IR translationHIGH
sub_AF7200--552encoded_to_ir_predicate -- hardware pred to IR translationHIGH
sub_EB30401.9 KB--decode_dispatcher -- binary search on instruction typeHIGH
sub_112CDA08.9 KB--register_pair_encoder -- 40-pair mapping via if-chainHIGH

Cross-References

Peephole Optimization

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The peephole optimization pass in ptxas is the single largest subsystem by code volume in the entire binary. Three monolithic dispatch functions -- totaling approximately 750 KB of machine code -- implement a brute-force pattern-match-and-rewrite engine that recognizes instruction idioms in the internal IR and replaces them with more efficient SASS instruction forms. Each dispatch function serves a different compilation context (generic, SM120-specific, and post-scheduling), but all three share the same architecture: a giant opcode-based switch dispatches to hundreds of pattern matchers; the highest-priority match wins; the winning rewrite modifies the instruction in-place.

None of the three mega-dispatchers can be decompiled by Hex-Rays due to their extreme size (233--280 KB each). All analysis in this page derives from disassembly, call graphs, and the 3,185 pattern-matcher functions that they invoke.

Scale Summary

Dispatch functionBinary sizeInstructionsPattern matchersTotal call sitesEntry trampolineContext
sub_169B190280 KB65,99976215,870sub_B12930Generic (all SM)
sub_143C440233 KB~56,2411,0871,971sub_B12940SM120-specific
sub_198BCD0233 KB54,0431,33613,391sub_B12960Post-scheduling

All three entry trampolines (sub_B12930, sub_B12940, sub_B12960) are 11-byte thunks that strip or forward one argument and tail-call the corresponding giant.

Pipeline Position

 IR instruction stream
       |
       v
 sub_B12930 -----> sub_169B190   (generic peephole)
       |
       v
 sub_B12940 -----> sub_143C440   (SM120 peephole, RTX 50-series / Pro)
       |
       v
 [instruction scheduling]
       |
       v
 sub_B12960 -----> sub_198BCD0   (post-schedule peephole)
       |
       v
 [instruction encoding via vtable]

The generic and SM120 dispatchers run before scheduling; the post-scheduling dispatcher runs after. The SM120 dispatcher (sub_143C440) appears to be architecture-gated -- it is called only when compiling for SM 120 targets (consumer RTX 50-series, enterprise Pro GPUs).

Dispatch Architecture

All three mega-dispatchers follow the same algorithm.

Entry and primary switch

push callee-saves
sub  rsp, 10h
mov  rbp, rdi            ; ctx
mov  rbx, rsi            ; instruction node
mov  [rsp+var_2C], -1    ; best_template_id = NONE
mov  [rsp+var_30], -1    ; best_priority    = NONE
movzx edi, word [rsi+0Ch]; read opcode field
call sub_13B9DC0          ; identity / normalization (returns opcode)
cmp  ax, 174h             ; 373 cases (opcodes 0..372)
ja   default
jmp  [jump_table + rax*8] ; PRIMARY SWITCH on opcode

The 16-bit opcode at instruction node offset +0x0C selects a primary case. All three dispatchers use 373-case primary switches.

Per-case pattern matching

Within each primary case, the dispatcher:

  1. Calls a sequence of pattern-matcher functions, passing pointers to best_template_id and best_priority as out-parameters.
  2. Each matcher may update these if it finds a match with higher priority than the current best.
  3. After all matchers for the opcode have run, the dispatcher checks best_template_id. If it is no longer -1, a secondary switch on the template ID selects the rewrite action.

The secondary switches are embedded inside the giant function. sub_143C440 alone contains 85 secondary jump tables (sizes 7--190 cases), totaling 1,971 switch cases.

Rewrite action

When a rewrite is selected, the action block performs four operations:

setRewrittenOpcode(instr, new_opcode);     // sub_B28F10: writes byte at instr+14
setRewrittenModifier(instr, new_modifier); // sub_B28F20: writes byte at instr+15
setOperandMapping(instr, slot, value);     // sub_BA9CF0: writes instr+72+4*slot
markRewritten(instr);                      // sub_BA9C30 or sub_BA9CB0

sub_BA9C30 (markRewrittenSimple) sets bit 0 of the flags word at instr+140:

*(uint32_t*)(instr + 140) |= 1;

sub_BA9CB0 (markRewrittenComplex) applies priority-aware flag logic that respects existing rewrites from earlier passes -- it sets bits to 0x8 ("superseded") when a higher-priority rewrite exists.

The symmetry of call frequencies in sub_143C440 confirms this: setRewrittenOpcode and setRewrittenModifier are each called exactly 1,759 times -- every rewrite always sets both the opcode and modifier bytes.

Pattern Matcher Signature

Every one of the 3,185 pattern matchers shares the same prototype:

char __fastcall match(
    int64_t ctx,           // a1: peephole optimization context
    int64_t instr,         // a2: instruction node being examined
    int32_t *template_id,  // a3: output -- combined opcode / template ID
    int32_t *priority      // a4: input/output -- current best priority
);

The function returns a char (the last comparison result, used for early-exit optimization in the caller), but the meaningful outputs are *template_id and *priority.

Matching algorithm

Every matcher performs a deeply-nested chain of checks:

Step 1 -- Modifier/property checks. Call queryModifier(ctx, instr, slot) (sub_10AE5C0) repeatedly. Each call returns an enumerated value for a specific instruction property:

if (queryModifier(ctx, instr, 0xDC) != 1206) return 0;  // data type != .f32
if (queryModifier(ctx, instr, 0x163) != 1943) return 0;  // rounding != .rn
if (queryModifier(ctx, instr, 0x7E) - 547 > 1) return 0; // saturation out of range

The slot indices (0x05, 0x7B, 0x7E, 0x88, 0x90, 0xA1, 0xBE, 0xD2, 0xD3, 0xDC, 0xF2, 0x101, 0x119, 0x126, 0x127, 0x142, 0x152, 0x155, 0x159, 0x15C, 0x163, 0x167, 0x178, 0x179, 0x18A, 0x18D, 0x196, 0x197, 0x199, 0x19D, 0x1A8, 0x1AD, 0x1AE, 0x1AF, 0x1B2, 0x1D1, 0x1D2, 0x1E0, 0x1E4, 0x1EC, 0x216, 0x253, etc.) index into a per-instruction property table covering type, rounding mode, saturation, negate, comparison type, and architecture-specific modifiers.

Step 2 -- Operand count. Check the number of explicit/fixed operands and the total operand slot count:

int fixed = getExplicitOperandCount(instr);  // sub_B28F50: returns *(instr+92)
int total = getTotalOperandSlots(instr);     // sub_B28F40: returns *(instr+40)+1 - *(instr+92)

Step 3 -- Operand type and register class validation. For each operand slot, retrieve the operand pointer and check its kind:

void *op = getOperand(instr, idx);   // sub_B28F30: returns *(instr+32) + 32*idx
byte kind = *(byte*)op;
if (!isRegister(kind))   return 0;   // sub_13B9CD0: kind == 2
if (!isImmediate(kind))  return 0;   // sub_13B9CE0: kind == 1 (alt check)

Register class is checked against expected values:

int regclass = getRegisterClass(*(uint32_t*)(op + 4)); // sub_13B9CC0
if (regclass != 1023 && regclass != 1) return 0;       // 1023 = wildcard

Step 4 -- Priority gate. If all checks pass and the current priority allows it:

if (*priority <= threshold) {
    *priority = threshold + 1;
    *template_id = combined_opcode_id;
}

Since matchers are called sequentially and each checks the running maximum, the highest-priority match always wins.

Operand Type Discriminators

Three families of trivial single-instruction functions serve as operand type predicates, one family per dispatch context:

SM120 matchers (Zone A of sub_143C440)

FunctionTestSemantic
sub_13B9CD0kind == 2isRegister
sub_13B9CE0kind == 1isImmediate
sub_13B9D00kind == 2 || kind == 1isRegOrImm
sub_13B9D10kind == ?isConstantBuffer
sub_13B9D40kind == ?isPredicate
sub_13B9D50kind == ?isUniformRegister
sub_13B9CC0extracts classgetRegisterClass (1023 = wildcard)

Generic matchers (Zone A of sub_169B190)

FunctionTestSemantic
sub_15F59C0a1 == 2isRegister
sub_15F59D0a1 == 1isImmediate
sub_15F59E0a1 == 0isNone
sub_15F59F0a1 == 10isConstantMemory
sub_15F5A00a1 == 9isTexRef
sub_15F5A30a1 == 3isPredicate / isConstImm
sub_15F5A40a1 == 15isUniformRegister / isTrueConst
sub_15F5A80a1 == 6isLabel
sub_15F5A90a1 == 11isTexture
sub_15F5AB0identitygetOperandValue

Post-schedule matchers (Zone A of sub_198BCD0)

FunctionTestSemanticCall count
sub_1820170identitygetOpcodeRaw9,278
sub_1820180a1 == 2isRegOperand2,743
sub_1820190a1 == 1isImmOperand677
sub_18201A0a1 == 8isUniform7
sub_18201B0a1 == 10isPredicateReg1,228
sub_18201C0a1 == 9isTexRef211
sub_18201D0a1 == 5isConstBuf14
sub_18201E0a1 == 4isAddress9
sub_18201F0a1 == 3isConstImm1,044
sub_1820200a1 == 15isTrueConst1,044
sub_1820210a1 == 7isBarrier9
sub_1820220a1 == 12isSurface12
sub_1820230a1 == 11isTexture12
sub_1820240a1 == 6isLabel2
sub_1820250a1 == 14isSpecialReg2
sub_1820260a1 == 13isUnknown6

Priority System

Matchers use a strict numeric priority to resolve conflicts when multiple patterns match the same instruction. Higher priority means more specific and/or more profitable transformation.

Priority rangeDescriptionExample
1--2Trivial matches (simple mov, basic arithmetic)Single-operand passthrough
5--11Common 2--3 operand combining patternsStandard FMA combines
14--20Complex 4-operand patterns with constraintsMulti-source ALU combines
22--31Highly specific multi-operand patternsWide register + predicated ops
33--36Maximum specificity (8--9 operands + all modifiers)Full tensor instruction forms

Pattern IDs range from 1 to approximately 244 in the generic and SM120 dispatchers. Multiple matchers can target the same pattern ID with different priorities, creating a priority cascade.

Instruction Node Layout

The peephole subsystem reveals the following fields of the instruction IR node:

OffsetSizeFieldAccessor
+0x001 BOperand type tagisRegister, isImmediate, etc.
+0x044 BPrimary value (register number / immediate)getRegisterClass / getOperandValue
+0x0C2 BOpcode number (16-bit)Direct read in dispatch entry
+0x0E1 BRewritten opcodesub_B28F10 (setRewrittenOpcode)
+0x0F1 BRewritten modifiersub_B28F20 (setRewrittenModifier)
+0x144 BSecondary register fieldDirect read
+0x208 BOperand array base pointersub_B28F30 base address
+0x284 BTotal operand countPart of sub_B28F40 computation
+0x48varOperand mapping table (4 B per slot)sub_BA9CF0 writes here
+0x5C4 BExplicit operand countsub_B28F50 returns this
+0x8C4 BFlags wordBit 0 = rewritten (set by sub_BA9C30)

Each operand is a 32-byte record at base + 32 * index:

Operand offsetSizeContent
+01 BType tag (1=imm, 2=reg, 3=constImm, 10=pred, 15=trueConst, ...)
+44 BPrimary value (register ID; 1023 = wildcard / any-reg)
+204 BSecondary value (modifier / sub-register)

Code Duplication

The pattern matchers exhibit extreme structural duplication. Groups of 2--10 functions are near-identical clones differing only in numeric constants (the specific opcode/modifier values they check, the template ID they assign, and the priority level).

Observed clone clusters in sub_169B190's matchers:

Cluster sizeCountByte size eachAddress range example
~5,560 B5 functions5,5600x167CBB0--0x16E7D20
~5,282 B10 functions5,2820x167E3A0--0x16807E0
~5,298 B4 functions5,2980x16EA5F0--0x16ECA30
~5,846 B3 functions5,8460x16EDC00--0x16EE8B0
~2,718 B7 functions2,7180x166F260--0x1692B60
~2,604 B6 functions2,6040x166AC30--0x166E170

Similarly, in sub_198BCD0's matchers, eight functions of exactly 5,282 bytes each (sub_1982810, sub_1982AE0, sub_1982DB0, sub_1983080, sub_1984B40, sub_1984E10, sub_19850E0, sub_19853B0) share identical structure, varying only in the opcode/modifier constants passed to sub_10AE5C0.

This strongly suggests compiler-generated code from C++ templates or macros that instantiate one matcher function per instruction variant from ISA specification tables -- a pattern consistent with NVIDIA's internal build tooling.

Size Distribution of Matchers

SM120 matchers (1,087 functions, 429 KB)

Size rangeCountDescription
< 200 B37Simple 1--2 modifier checks
200--400 B520Typical 4--8 modifier checks
400--600 B4556--12 modifier checks + operand validation
600--800 B66Complex multi-operand patterns
> 800 B9Deepest nesting, most constrained patterns

Generic matchers (762 functions, ~310 KB)

Size rangeCountDescription
~2,200 Bmost common2--4 instruction field checks
~2,800 BmoderatePatterns with operand constraints
~3,500--4,000 BfewerComplex multi-operand patterns
~5,500--8,500 Brare12+ modifier checks, 8--9 operands

Post-schedule matchers (~1,336 functions)

Size rangeCountDescription
~2,200 Bmost commonSimple 2-instruction patterns
~2,500 Bcommon3-instruction patterns
~3,100 BmoderatePatterns with predicate checks
~5,300 BfewMulti-instruction sequences (8+ operands)
~6,800 B1Largest matcher (sub_1980D10)

Representative Matcher Examples

Simplest: sub_143C3B0 (132 bytes, priority 2, template 1)

Checks: no explicit operands, 2 total slots, first operand is register-or-immediate with register class 1023 or 1. Matches a trivial mov-type instruction for passthrough combining.

Moderate: sub_13CF0C0 (426 bytes, priority 15, template 28)

Checks 5 modifiers: slot 0xD3 == 1181, slot 0xD2 == 1177, slot 0x0C == 59, slot 0xB3 == 772, slot 0xC8 == 1107. Then validates 1 explicit register operand plus 4 additional operands (register, register, immediate, predicate).

Complex: sub_1615980 (priority 36, template 25 -- highest observed priority)

Checks 12 modifier slots: 0x05 == 12, 0xDC == 1206, 0x253 in {2937,2938}, 0x126 == 1493, 0xF2 in {1281,1282}, 0x163 == 1943, 0x178 == 2035, 0x179 in {2037..2041}, 0x1AD in {2253..2257}, 0x7E in {547,548}, 0x19D in {2167,2168}, 0x18D == 2115. No fixed operands, 7 variable operands, each of type 10 (constant memory) with register class 1023 or specific flag constraints. This is the most constrained pattern observed -- likely a fully specified tensor instruction variant.

Post-schedule: sub_1834600 (pattern 17, priority 16)

Checks modifier slots 0xD3 == 1181, 0xD2 == 1177, 0x0C in {60,61}, 0xB3 == 772, 0xC8 == 1107. Then: first operand offset == 1, that operand is immediate, total operand count == 5, followed by register pattern checks.

Infrastructure Helper Functions

Core accessor (sub_10AE5C0, 60 bytes)

The single most-called function in the peephole subsystem (30,768 callers across the full binary). Queries a property of an instruction node by slot ID:

int queryModifier(int64_t ctx, int64_t instr, int slot) {
    if (hasProperty(instr, slot))        // sub_10E32E0
        return getPropertyValue(instr, slot); // sub_10D5E60
    return 0xFFFFFFFF;                   // property not present
}

Node accessors

FunctionSizeSemanticsCall frequency
sub_B28F3012 BgetOperand(instr, idx) -- returns *(instr+32) + 32*idx31,399
sub_B28F4010 BgetTotalOperandSlots(instr) -- returns *(instr+40)+1 - *(instr+92)~2,500
sub_B28F504 BgetExplicitOperandCount(instr) -- returns *(instr+92)~2,100

Rewrite helpers

FunctionSemanticsCall frequency in sub_143C440
sub_B28F10setRewrittenOpcode(instr, byte) -- writes instr[14]1,759
sub_B28F20setRewrittenModifier(instr, byte) -- writes instr[15]1,759
sub_BA9CF0setOperandMapping(instr, slot, val) -- writes instr[72+4*slot]993
sub_BA9C30markRewrittenSimple(instr) -- instr[140] |= 11,222
sub_BA9CB0markRewrittenComplex(instr) -- priority-aware flag update361

The ratio of markRewrittenSimple (1,222) to markRewrittenComplex (361) shows that approximately 77% of rewrites are straightforward replacements, while 23% involve priority negotiation with competing rewrites.

Call Frequency in sub_169B190 (Generic Dispatcher)

CalleeCountRole
sub_B28F10 (setRewrittenOpcode)2,142Write new opcode byte
sub_B28F20 (setRewrittenModifier)2,142Write new modifier byte
sub_15F59B0 (getOperandValue)1,736Extract register number
sub_10AE5C0 (queryModifier)1,303Read instruction property
sub_B28F30 (getOperand)1,281Get operand pointer
sub_BA9C30 (markRewrittenSimple)1,261Simple rewrite commit
sub_BA9CF0 (setOperandMapping)855Map operand slots
sub_BA9CB0 (markRewrittenComplex)589Priority-aware commit

Relationship to Instruction Encoding

Each dispatch function's address range is adjacent to a zone of SASS instruction encoders that consume the rewritten instructions:

  • sub_143C440 (SM120) sits before 123 SM120 encoders at 0x14771E0--0x14A3C80 (180 KB), covering 82 unique SASS opcodes with up to 42 encoding variants per opcode.
  • sub_169B190 (generic) sits before 100 encoding table entries at 0x16DF750--0x16FFFF0 and 36 template expanders at 0x1700000--0x1722D60.
  • sub_198BCD0 (post-schedule) operates on already-scheduled instructions, performing strength reduction and idiom recognition on the final instruction stream.

The encoders are called via vtable dispatch, not directly from the peephole functions. Each encoder packs a 128-bit SASS instruction word using sub_7B9B80(state, bit_offset, bit_width, value) for bit-field insertion.

Function Map

AddressSizeIdentityConfidence
sub_B1293011 BEntry trampoline for generic peepholeCERTAIN
sub_B1294011 BEntry trampoline for SM120 peepholeCERTAIN
sub_B1296011 BEntry trampoline for post-schedule peepholeCERTAIN
sub_169B190280 KBGeneric peephole mega-dispatcherHIGH
sub_143C440233 KBSM120 peephole mega-dispatcherHIGH
sub_198BCD0233 KBPost-schedule peephole mega-dispatcherHIGH
sub_10AE5C060 BqueryModifier(ctx, instr, slot)HIGH
sub_B28F10smallsetRewrittenOpcode(instr, byte)HIGH
sub_B28F20smallsetRewrittenModifier(instr, byte)HIGH
sub_B28F3012 BgetOperand(instr, idx)CERTAIN
sub_B28F4010 BgetTotalOperandSlots(instr)CERTAIN
sub_B28F504 BgetExplicitOperandCount(instr)CERTAIN
sub_BA9C30smallmarkRewrittenSimple(instr)HIGH
sub_BA9CB0smallmarkRewrittenComplex(instr)HIGH
sub_BA9CF0smallsetOperandMapping(instr, slot, value)HIGH
sub_13B9CC0smallgetRegisterClass(field)HIGH
sub_13B9CD0smallisRegister(byte)HIGH
sub_13B9CE0smallisImmediate(byte)HIGH
sub_13B9D00smallisRegisterOrImmediate(byte)HIGH
sub_13B9D10smallisConstantBuffer(byte)HIGH
sub_13B9D40smallisPredicate(byte)HIGH
sub_13B9D50smallisUniformRegister(byte)HIGH
sub_13B9DC0smallopcodeIdentity(uint) -- passthroughCERTAIN
sub_1909030smallopcodePassthrough (post-schedule context)HIGH

Macro Instruction Expansion (sub_8127C0)

Separate from the three pattern-match-and-rewrite mega-dispatchers, ptxas contains a dedicated macro instruction expansion pass at sub_8127C0 (10,720 bytes). This pass resolves register-file constraints for composite instructions -- cases where source or destination operands span register files or where multi-word results need splitting into narrower instruction sequences.

It is called from the master lowering dispatcher sub_8380A0 and runs before instruction scheduling.

Two-phase algorithm

Phase 1 -- Operand scanning and constraint annotation. The pass iterates every instruction in the function's linked list (traversing via instr+8). For each instruction, it reads the opcode at instr+72 and dispatches through a 15-family if-else cascade. For each opcode, it calls sub_812550 (getOperandConstraint) on each source operand to determine register-file affinity:

Return valueMeaning
0Unconstrained
-2Constrained to register file B (e.g., even-aligned pair)
-3Constrained to register file A (e.g., odd-aligned pair)
-1Conflict / unresolvable

The pass annotates register descriptor entries (indexed via ctx+88) at reg+76 (constraint code) and reg+80 (target width code), and builds a linked list of instructions requiring expansion (linked via instr+56). Registers consumed by expansion are marked dead (reg+64 = 5).

Phase 2 -- Instruction rewriting. If any instruction requires expansion, the pass iterates the worklist and performs actual rewrites: replacing composite instructions with equivalent sequences, inserting new instructions via the sub_930040 / sub_92FF10 / sub_92E720 emitters, and deleting originals via sub_9253C0. Register-file mapping uses two lookup tables: dword_21D5EE0[26] (for constraint -2) and dword_21D5F60[16] (for constraint -3).

Between phases, a cleanup loop removes worklist entries with conflicting constraints (both operands invalid), resetting reg+76 = -1.

Opcodes handled

OpcodeMnemonicExpansion pattern
10SHFThree-source constraint check; emits I2IP (36) + new SHF when sources span register files
18FSETPPredicate operand finalization when operand count == 6 and modifier bits match
29PMTRIGLast-operand extraction and finalization
36I2IPDestination register marking and two-source constraint checking
60LEPCStore/load legalization: validates flags, checks register file == 6, recursive chain validation via sub_812480
62, 78, 79BAR_INDEXED, RTT, BSYNCSame legalization path as LEPC
95, 96STS, LDGLast-operand extraction for stores; two-source vector-width constraint checking for loads
97STGSource registration for expansion tracking
130HSET2Validates single-def destination, recursive source constraint chains; inserts HSET2 rewrites or converts to opcode-201 stores
137SM73_FIRSTSame path as HSET2
149UFLOTwo-source validation; marks destination with width code 20; vectorization combining
151UIMADShared three-source path with SHF
190LDGDEPBARShared last-operand path with PMTRIG
201, 202QMMA_16816, QMMA_16832Full multi-operand legalization; inserts barrier instructions for QMMA
283UVIADDPenultimate operand extraction and type resolution
290MOV (sm_104)Same constraint path as SHF/UIMAD
bit 12 set(arch-specific)Last-operand extraction for architecture-extended instructions

sub_812550 -- getOperandConstraint

The single most-called helper (32 call sites), this 40-byte function reads the constraint code from the register descriptor for a given operand reference:

int getOperandConstraint(int64_t ctx, uint32_t *operand_ref) {
    int modifier_bits = operand_ref[1];
    int constraint = reg_array[*operand_ref & 0xFFFFFF].constraint;  // reg+76
    if ((modifier_bits & 0xFE000000) == 0)
        return constraint;      // no sub-register modifier => raw value
    // Apply modifier-aware transformations:
    //   constraint -2 + certain modifier combos => -3 or -1
    //   constraint -3 + modifier bit 0x3C000000 => -1; + sign bit => -2
    ...
}

sub_812480 -- validateOperandChain

Recursively walks use-def chains through HSET2 (130) and SM73_FIRST (137) instructions to verify that an entire operand chain is compatible with a target register file. Uses sub_A9BD00 to resolve the register file for a width code, then checks reg+76 and reg+80 agreement.

Knob gate

Option 183 (target profile offset 13176) controls the expansion distance threshold. When enabled, a secondary value at profile+13184 sets the maximum distance between a register definition and its use before the constraint is considered violated. Default threshold: 7.

Function map

AddressSizeIdentityConfidence
sub_8127C010,720 BExpandMacroInstructions (main pass)HIGH
sub_81255040 BgetOperandConstraintHIGH
sub_812480~170 BvalidateOperandChainHIGH
sub_8125E0~450 BcanExpandStoreChainMEDIUM
sub_800470smallisLegalizableMEDIUM
sub_800360smallresolveOperandTypeMEDIUM
sub_800400smallfinalizeOperandMEDIUM

Cross-References

Evidence Index

ClaimSource
sub_143C440 structure, 1,087 matchers, 373-case switchp1.20-sweep-0x13CF000-0x14A4000.txt lines 1--486
SM120 encoder zone (123 functions, 180 KB)p1.20 lines 269--329
sub_169B190 structure, 762 matchers, 280 KBp1.22 lines 1--460, p1.23 lines 1--588
Generic operand discriminators (sub_15F59C0 family)p1.22 lines 181--201
Clone clusters in generic matchersp1.23 lines 156--174
Post-schedule discriminators (sub_1820170 family)p1.25 lines 271--289
sub_198BCD0 structure, 1,336 callees, 373-case switchp1.26 lines 355--398
Post-schedule 5,282-byte clone groupp1.26 lines 401--424
Rewrite helper call frequenciesp1.20 lines 216--227, p1.23 lines 228--237
Priority 36 as highest observedp1.22 lines 316--327
Instruction node layoutp1.20 lines 406--420, p1.22 lines 367--409

Mercury Encoder Pipeline

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

Mercury is NVIDIA's intermediate encoding layer between the optimizer's Ori IR and native SASS machine code. It is not a direct binary encoding of SASS -- it is a separate representation that contains pseudo-instructions, lacks dependency barriers, and requires multiple transformation passes before it becomes executable GPU code. The Mercury pipeline occupies phases 113--122 of the 159-phase PhaseManager, forming a six-stage sub-pipeline: encode/decode verification, pseudo-instruction expansion, two WAR-hazard passes (one before and one after operation expansion), scoreboard/latency generation ("opex"), and final SASS microcode emission. All recent GPU architectures (SM 75+) use Mercury as the encoding backend; SM 100+ (Blackwell) defaults to "Capsule Mercury" (capmerc), a variant that embeds additional metadata for relocatable patching.

Pipeline phases113--122 (8 active phases within Mercury sub-pipeline)
Core orchestratorsub_6F52F0 (23KB, RunStages -- 18 parameters)
Master encodersub_6D9690 (94KB, EncodeInstruction -- largest backend function)
Opex bodysub_6FFDC0 (66KB, EmitInstructions -- scoreboard generation)
Expansion passsub_C3CC60 (26KB, MercExpand::run)
WAR generatorsub_6FBC20 (7.4KB, GenerateWARHazards)
SASS emittersub_6E4110 (24KB, MercGenerateSassUCode)
Bitfield insertsub_7B9B80 (216 bytes, 18,347 callers across binary)
Encoding table funcs530 functions at 0xC66000--0xD27000
Mercury mode flag*(DWORD*)(context+385) == 2
Mode checksub_10ADF10 returns bool from target descriptor
MercConvertersub_9F3340 (7KB orchestrator), sub_9EF5E0 (27KB operand reorganization)
CLI option--binary-kind mercury,capmerc,sass

Architecture

Phase 113  PostFixForMercTargets          Late Ori fixups for Mercury targets
Phase 114  FixUpTexDepBarAndSync          Texture dependency bars + sync fixups
Phase 115  AdvancedScoreboardsAndOpexes   Arch hook point (noop by default)
Phase 116  ProcessO0WaitsAndSBs           -O0 scoreboard insertion
                                          ──────────────────────────────
Phase 117  MercEncodeAndDecode            ┐
Phase 118  MercExpandInstructions         │  Six-stage Mercury core
Phase 119  MercGenerateWARs1              │
Phase 120  MercGenerateOpex               │
Phase 121  MercGenerateWARs2              │
Phase 122  MercGenerateSassUCode          ┘

sub_6F52F0 (23KB orchestrator, 18 params)
  │
  ├─ [1] Decode:     sub_6F2BF0 (59KB)  — Encode Ori→Mercury binary, decode back
  │      └─ sub_6D9690 (94KB master encoder switch)
  │           ├─ sub_6D2750 — append operand word
  │           ├─ sub_6D28C0 — commit encoded instruction
  │           ├─ sub_6D9580 — encode literal values
  │           └─ sub_931690 — create instruction record
  │
  ├─ [2] Expansion:  sub_C3CC60 (26KB)  — Expand pseudo-instructions to SASS
  │      ├─ sub_C37A10 (16KB) — expandInstruction (jump table dispatch)
  │      ├─ sub_C39B40 (10KB) — expandMemoryOp
  │      ├─ sub_C3A460 (6KB)  — expandAtomicOp
  │      ├─ sub_C3B560 (8KB)  — expandTexture
  │      ├─ sub_C3BCD0 (19KB) — expandControlFlow
  │      └─ sub_C3E030 (18KB) — finalizeExpansion
  │
  ├─ [3] WAR pass 1: sub_6FBC20 (7.4KB) — DEPBAR/scoreboard for pre-opex hazards
  │      ├─ sub_6FA5B0 — detect WAR hazard per instruction
  │      ├─ sub_6FA930 — insert scoreboard barrier (opcode 54)
  │      ├─ sub_6FA7B0 — insert WAITDP (opcode 246)
  │      └─ sub_6FAA90 — insert stall cycles
  │
  ├─ [4] Opex:       sub_6FFDC0 (66KB)  — Generate scoreboards + latency waits
  │      └─ sub_703480 (1.4KB entry) or sub_7032A0 (2.3KB MercOpex entry)
  │
  ├─ [5] WAR pass 2: sub_6FBC20          — Same pass, re-run for opex-introduced hazards
  │
  └─ [6] SASS emit:  sub_6E4110 (24KB)  — Final SASS microcode generation
         └─ sub_735290 — per-instruction encoding pipeline
              ├─ sub_733FA0 — encode instruction operands
              ├─ sub_734370 — encode immediates
              ├─ sub_734820 — encode predicates
              ├─ sub_734AD0 — encode memory operands
              └─ sub_734D20 — encode complex operands (texture/surface/barrier)

Each stage logs its completion via trace infrastructure: "After Decode", "After Expansion", "After WAR post-expansion", "After Opex", "After WAR post-opexing".

Mercury vs SASS vs Capsule Mercury

The ptxas CLI (sub_703AB0) accepts --binary-kind with three values:

ModeCLI valueDefault forDescription
MercurymercurySM 75--99Traditional Mercury intermediate encoding
Capsule MercurycapmercSM 100+ (Blackwell)Mercury + embedded PTX source + relocation metadata
Raw SASSsass(explicit only)Direct SASS binary output

Additional CLI flags:

  • --cap-merc -- force Capsule Mercury generation
  • --self-check -- roundtrip verification: reconstitute SASS from capmerc, compare with original
  • --out-sass -- dump reconstituted SASS from capmerc

Mercury mode is flagged at *(DWORD*)(context+385) == 2. The function sub_10ADF10 queries the target descriptor to determine whether Mercury encoding is active for the current architecture.

MercConverter -- Operand Reorganization for Encoding

Phase141 (MercConverter)
Orchestratorsub_9F3340 (7KB)
Post-conversion loweringsub_9EF5E0 (27KB)
Opcode dispatchsub_9ED2D0 (25KB, shared with phase 5)
Strings"CONVERTING", "After MercConverter"

Phase 141 runs the MercConverter infrastructure a second time, after the full optimization pipeline has completed. While phase 5 (ConvertUnsupportedOps) performs the initial PTX-to-SASS opcode conversion early in the pipeline, phase 141 re-invokes the same machinery to handle instructions that were introduced or modified by optimization passes (rematerialization, peephole, loop transformations) and may contain PTX-derived opcodes that were never legalized. After phase 141 completes, the "After MercConverter" diagnostic string appears, and every instruction in the IR carries a valid SASS opcode ready for Mercury encoding.

The orchestrator sub_9F3340 runs two steps sequentially:

  1. Opcode conversion (sub_9F1A90, 35KB): the main MercConverter dispatch documented in ISel. Converts any remaining PTX-derived opcodes to SASS equivalents via the master switch in sub_9ED2D0. Gated by *(BYTE*)(*(context+8) + 1398) & 0x20.

  2. Operand reorganization (sub_9EF5E0, 27KB): post-conversion lowering that restructures operand lists into a form the Mercury encoder can consume directly. Gated by *(BYTE*)(*(context+16) + 1048) != 0 AND *(context+104) != 0 (non-empty instruction BST).

Post-Conversion Lowering -- sub_9EF5E0 (27KB)

This function transforms the BST (binary search tree) of converted instructions produced by step 1 into encoding-ready conversion nodes. For each instruction record in the BST, it performs three operations:

1. Operand sort. Calls sub_9EC160, a linked-list merge sort (Floyd's slow/fast pointer midpoint, recursive split-and-merge) that sorts the operand chain by the operand index at entry+16. This establishes a canonical ordering required by the encoder.

2. Contiguous/gap partitioning. Walks the sorted operand list and classifies each operand into one of two sublists:

// Simplified partitioning logic (lines 215-348 of decompilation)
for (op = first; op != sentinel; op = op->next) {
    int cur_idx  = *(DWORD*)(op + 16);
    int next_idx = *(DWORD*)(op->next + 16);

    if (next_idx - cur_idx == 32) {
        // Consecutive register indices -> contiguous sublist
        append_to_contiguous_list(node, cur_idx);
    } else {
        // Non-consecutive -> gap sublist (stores both cur and next index)
        append_to_gap_list(node, cur_idx, prev_idx);
    }
}

The stride of 32 reflects the operand index encoding: index = register_number * 32 + modifier_bits. Contiguous operands (stride-32 sequences like R0, R1, R2, R3) represent packed register groups -- common in wide loads (LDG.128), GMMA matrix operands, and multi-register moves. The encoder can represent these as a single register-range specifier. Gap operands break the stride and require individual encoding slots.

3. Conversion node construction. Allocates a 168-byte conversion node per instruction, inserts it into a per-record BST sorted by (block_id, sub_block_id), and links the two operand sublists:

Conversion Node (168 bytes):
  +0     8B    BST left child
  +8     8B    BST right child
  +16    8B    BST parent
  +24    4B    block_id
  +28    4B    sub_block_id
  +32    48B   Contiguous operand doubly-linked list (6 pointers)
  +80    4B    Contiguous operand count
  +88    8B    Contiguous list ref-counted handle
  +96    48B   Gap operand doubly-linked list (6 pointers)
  +144   4B    Gap operand count
  +152   8B    Gap list ref-counted handle
  +160   1B    Flags

BST insertion calls sub_7C11F0 for red-black tree rebalancing. The record tracks min/max block IDs at record+32 and record+40 for range queries.

Encoding Validation and Fallback

After building the conversion node, the function attempts encoding:

// Lines 949-982 of decompilation
nullsub_644(*(a1+16), node, "CONVERTING");      // diagnostic trace
int result = sub_7BFC30(node);                   // encoding validation

if (result == -1) {
    // Encoding failed: recursive fallback
    sub_9CE210(a1, node);
    // Continue with next instruction in BST
} else {
    // Encoding succeeded: emit to output
    *(node + 4) = result;                        // store encoding index
    output_slot = vtable_alloc(*(a1+24), 120);   // allocate output record
    *(output_slot + 96) = node;                  // link conversion node
    sub_9314F0(&scratch, *(a1+8), 0xF, 1, 1,    // emit SASS instruction
               &control_word);                   // control = 0x60000000
}

sub_7BFC30 validates the conversion node by traversing its operand tree and checking that the contiguous/gap partition can be represented in the target encoding format. It returns the encoding index on success, or -1 if the instruction's operand pattern cannot be encoded in the available formats.

On failure, sub_9CE210 (a recursive fallback) re-processes the instruction using a different encoding strategy -- typically splitting the operand group into smaller sub-groups that each fit the available encoding width. This handles edge cases like wide operations with mixed register classes.

Relationship to Phase 5

Phase 5 and phase 141 share the same code (sub_9F3340 orchestrator, sub_9ED2D0 dispatch, sub_9EF5E0 post-conversion). The difference is context:

PropertyPhase 5Phase 141
Pipeline positionBefore optimizationAfter optimization, before Mercury encoding
PurposeConvert PTX opcodes to SASSRe-legalize instructions introduced by optimizer
InputRaw Ori IR with PTX opcodesOptimized Ori IR with possibly-illegal opcodes
OutputOptimizer-ready SASS-opcode IREncoding-ready IR for Mercury phase 142+
Gate flag*(context+1398) & 0x20Same flag, re-checked

Stage 1: MercEncodeAndDecode -- Roundtrip Verification

Phase117
Orchestratorsub_6F52F0 (23KB, 18 parameters)
Decode workersub_6F2BF0 (59KB)
String"After EncodeAndDecode"

This phase encodes the Ori IR instruction stream into Mercury binary form, then immediately decodes it back and verifies that the decoded result matches the original. This is a self-consistency check that catches encoding bugs early -- if the roundtrip fails, the instruction cannot be correctly represented in Mercury format.

The orchestrator sub_6F52F0 passes the entire pipeline state (18 parameters) to sub_6F2BF0, which performs the actual encode-decode cycle using the master encoder sub_6D9690.

Master Encoder -- sub_6D9690 (94KB)

The central SASS instruction encoding function and the single largest function in the ptxas backend. It contains a massive switch statement on the instruction type field (read from instruction+8) with cases covering every SASS instruction format.

// Simplified encoding flow
void EncodeInstruction(context, instruction) {
    int type = *(int*)(instruction + 8);
    uint64_t base = 0x2000000000LL;     // encoding base constant

    switch (type) {
    case 61:    // FFMA with literal operand
        sub_6D9580(ctx, operand);       // encode literal
        break;
    case 455:   // complex multi-operand format
        // ... bit-field extraction and assembly ...
        break;
    // ... hundreds of cases ...
    }

    // Common: append operand words, commit
    sub_6D2750(ctx, word);              // append 8-byte operand word
    sub_6D28C0(ctx);                    // commit instruction record
}

Encoding details:

  • Instructions are encoded as sequences of 8-byte words
  • Operand word type prefix in bits [31:28]: 0x1 = register, 0x5 = immediate/constant, 0x6 = control/modifier, 0x7 = literal, 0x9 = special
  • Control words carry the 0x60000000 prefix
  • Architecture-specific bits accumulated in a flags variable, with SM 100+ extensions via knob 4176
  • sub_7D6860 handles data type encoding (FP32/FP64/INT, etc.)
  • sub_C00BF0 provides opcode lookup from the encoding tables
  • sub_91D160 handles register operand encoding

Instruction Word Format

The Mercury instruction word is a 1280-bit (160-byte, 20-QWORD) structure located at offset +544 in the encoder object. All bit-field insertions use sub_7B9B80:

// sub_7B9B80 -- bitfield insert (216 bytes, 18,347 callers)
// Signature: (encoder_obj, bit_offset, bit_width, value)
void bitfield_insert(void *a1, int bit_offset, int bit_width, uint64_t value) {
    uint64_t mask = (1ULL << bit_width) - 1;
    int qword_idx = bit_offset >> 6;
    int bit_pos   = bit_offset & 63;
    *(uint64_t *)(a1 + 8 * qword_idx + 544) |= (value & mask) << bit_pos;
    // handles cross-QWORD boundary cases in a loop up to bit 1280
}

Two companion helpers run before operand encoding:

  • sub_7B9D30 (38 bytes) -- clears the 16-entry constant buffer slot table at a1+468 to 0xFF
  • sub_7B9D60 (408 bytes) -- encodes reuse flags (1 bit) and predicate register index (5 bits) into the instruction word

Encoding Table Functions (530 functions)

The range 0xC66000--0xD27000 contains 530 functions that each initialize one row of the instruction format table. Every function calls sub_7B9B80 multiple times to describe the SASS bit layout for one instruction format variant:

// Example: sub_C6CF40 — one instruction format initializer
void init_format_XYZ(void *a1) {
    sub_7B9B80(a1, 0,    4, 1);       // bits[0:3]   = opcode field = 1
    sub_7B9B80(a1, 4,    3, 0);       // bits[4:6]   = format = 0
    sub_7B9B80(a1, 8,    9, 0xC);     // bits[8:16]  = subopcode = 12
    sub_7B9B80(a1, 0x11, 8, 0x13);    // bits[17:24] = modifier = 19
    sub_7B9B80(a1, 0x19, 7, 5);       // bits[25:31] = unit = 5
}

Function sizes are remarkably uniform (1000--1600 bytes), reflecting mechanical code generation -- roughly 10 functions per ISA opcode group, covering all SASS formats for SM 100+.

Stage 2: MercExpandInstructions -- Pseudo-Instruction Expansion

Phase118
Entrysub_C3CC60 (26KB, MercExpand::run)
Strings"After MercExpand", "EXPANDING"

Mercury uses abstract instruction forms that may map to multiple real SASS instructions. This phase expands every pseudo-instruction into its concrete SASS equivalent sequence. The expansion is type-dispatched:

HandlerSizeInstruction class
sub_C37A1016KBGeneral instruction expansion (jump table with 4+ cases)
def_C37B2E13KBComplex expansion cases (default handler, creates new nodes)
sub_C39B4010KBMemory operations (LDG, STG, LDS, etc.)
sub_C3A4606KBAtomic operations
sub_C3B5608KBTexture operations
sub_C3BCD019KBControl flow (branches, jumps, calls)

sub_C3CC60 iterates over every instruction in the function, dispatching to the appropriate handler. Handlers create new instruction nodes, link them into the list, and delete the original pseudo-instruction. After all expansions, sub_C3E030 (18KB) performs finalization and cleanup.

The expansion engine also uses sub_719D00 (50KB), which builds output for expanded instructions across different operand widths (32/64/128-bit, predicate). The four nearly identical code blocks within that function correspond to template instantiations over operand width types.

Stage 3: WAR Hazard Resolution (Phases 119, 121)

Phases119 (MercGenerateWARs1), 121 (MercGenerateWARs2)
Entrysub_6FC220 / sub_6FC240
Main passsub_6FBC20 (7.4KB)
String"After MercWARs"
Knob#16 (WAR generation control)

Write-After-Read hazards occur when an instruction reads a register that a later instruction will overwrite -- the hardware pipeline can execute them out of order, causing the read to see the wrong value. The WAR pass inserts explicit DEPBAR (dependency barrier) instructions and scoreboard annotations to force correct ordering.

Two passes are needed: WAR1 runs after expansion but before opex, and WAR2 runs after opex. The second pass exists because opex itself introduces new instructions (scoreboard waits, synchronization barriers) that create additional WAR hazards not present in the pre-opex stream.

WAR Pass Algorithm -- sub_6FBC20

// Simplified WAR generation pass
void GenerateWARs(context) {
    // Guard conditions
    if (!(context->instr_flags & 1))  return;  // no WAR-sensitive instrs
    if (context->mode != 2)           return;  // not Mercury mode

    // Per-instruction walk
    for (instr = first; instr != end; instr = instr->next) {
        // Detect hazard
        int severity = DetectWARHazard(state, instr);  // sub_6FA5B0

        if (severity >= 3) {
            InsertScoreboardBarrier(state, instr);       // sub_6FA930, opcode 54
            InsertWAITDP(state, instr);                  // sub_6FA7B0, opcode 246
            InsertWARStalls(state, instr, severity);     // sub_6FAA90
        }
    }

    PostWARAdjustment(state);    // sub_6FB850
    FinalizeWARPass(state);      // sub_6FB350
}

WAR Hazard Detection -- sub_6FA5B0 (2.5KB)

The detector classifies instructions by opcode:

  • Always hazardous (opcodes 49, 248, 92): unconditionally increment the WAR counter
  • Conditionally hazardous (opcode 75): partial hazard depending on operand configuration
  • Special handling (opcodes 35, 246): store/scoreboard instructions with custom WAR rules
  • Filtered out: (opcode - 34) > 0x2C plus bitmask 0x100000400001 for irrelevant types

Architecture-specific hazard rules are dispatched through vtable methods at offsets +968, +1008, +528, and +504.

The detector maintains per-instruction state:

  • *(DWORD*)(state+2) -- WAR counter (incremented per detected hazard)
  • *(DWORD*)(state+3) -- severity level (3 = medium, 4 = high)

Inserted Instructions

DEPBAR / Scoreboard barrier (opcode 54) -- sub_6FA930:

  • Created when *(BYTE*)(instr+48) & 0x10 is set (barrier-needed flag)
  • Barrier type extracted from bits 7:5 of the flag byte
  • Encoding: *(DWORD*)(new_instr+56) = 4 (barrier format)
  • Control bits: *(DWORD*)(new_instr+48) &= 0xFFF83FFF | 0x50000

WAITDP (opcode 246) -- sub_6FA7B0:

  • Skipped if a WAITDP already exists at the insertion point
  • Operands configured with codes 102/467 and 301/1520
  • Uses FNV-1a hash lookup for instruction deduplication

Stall cycles -- sub_6FAA90 (7.9KB):

  • Computes required stall cycles from architecture-specific latency tables
  • Vtable methods at +888, +896, +904 for stall calculation
  • GPU family dispatch: v8[14] == 9 triggers specific handling
  • Adjusts stall count fields in the instruction control word

Stage 4: Opex -- Operation Expansion

Phase120 (MercGenerateOpex)
Entrysub_703480 (1.4KB, RunOpexPass)
MercOpex entrysub_7032A0 (2.3KB, RunMercOpexPass)
Bodysub_6FFDC0 (66KB)
String"After MercOpex"
Knobs#17 (expansion options), #743 (reduce-reg), #747 (dynamic batch)

"Operation expansion" is the most complex stage. It generates the dependency scoreboards, computes latency waits, and inserts synchronization barriers that the hardware needs to manage instruction-level parallelism. After opex, the instruction stream contains all the scheduling metadata required for correct execution.

Entry Points

Two entry paths exist, both calling the same sub_6FFDC0 body:

sub_703480 (RunOpexPass, 1.4KB):

  1. Creates pipeline context via sub_6FC280
  2. Queries knob #17 to disable WAR penalty flags: *(context->flags+52) &= ~0x10
  3. Architecture check: *(DWORD*)(context+52) == 20481 (SM 100a)
  4. For SM 100a: queries knob at offset 1296/1304 for loop unroll factor
  5. Sets Mercury mode: *(DWORD*)(context+385) = 2
  6. Calls sub_6FFDC0 for actual expansion

sub_7032A0 (RunMercOpexPass, 2.3KB):

  • Nearly identical to sub_703480
  • Additionally calls sub_10ADF10 to verify Mercury mode is active
  • Allocates 40-byte + 24-byte records for Mercury-specific context
  • Calls sub_6FED20 to destroy previous Mercury context before creating new one

Opex Body -- sub_6FFDC0 (66KB)

This 66KB function with 200+ local variables performs:

  1. Instruction iteration -- walks the instruction list per basic block
  2. Latency computation -- determines execution latency for each instruction based on opcode, functional unit, and architecture
  3. Scoreboard allocation -- assigns dependency scoreboard entries to track producer-consumer relationships
  4. Wait insertion -- inserts DEPBAR.WAIT instructions where a consumer must wait for a producer to complete
  5. Stall count computation -- sets per-instruction stall counts in the scheduling control word
  6. Barrier generation -- inserts memory barriers and synchronization points

The function queries three knobs that control scheduling behavior:

  • Knob #17: expansion options, WAR penalty flag control
  • Knob #743: reduce-reg scheduling mode (minimize register pressure)
  • Knob #747: dynamic batch scheduling mode

New instructions created by opex use sub_10B1F90 (instruction allocator) and sub_10AE590 (operand configuration).

Stage 5: SASS Microcode Emission

Phase122 (MercGenerateSassUCode)
Entrysub_6E4110 (24KB)

The final stage converts the fully expanded, WAR-resolved, scoreboard-annotated Mercury stream into native SASS binary. This is the point of no return -- after this phase, the output is executable GPU machine code.

sub_6E4110 takes 8 parameters (context, instruction list, descriptors, format info, ...) and dispatches to the per-instruction encoding pipeline:

sub_6E4110 (24KB, final SASS emission)
  ├─ sub_735290 — per-instruction encoding pipeline
  │    ├─ sub_733FA0 (5.1KB)  — encode instruction operands
  │    │    └─ sub_733870 (10KB) — source operand encoder
  │    ├─ sub_734370 (6.1KB)  — encode immediates
  │    ├─ sub_734820 (4.1KB)  — encode predicates
  │    ├─ sub_734AD0 (3.3KB)  — encode memory operands
  │    └─ sub_734D20 (8.1KB)  — encode complex operands (texture/surface/barrier)
  ├─ sub_726E00 (30.6KB) — instruction encoding with FNV-1a dedup cache
  │    └─ sub_7266A0 (11.7KB) — hash table lookup (24-byte entries, separate chaining)
  ├─ sub_6E3F80 (2.2KB) — encode branch offsets
  ├─ sub_6E3560 (2.6KB) — finalize scheduling control words
  └─ sub_712E70 (9.6KB) — handle relocations (cross-BB branch targets)

The encoding pipeline uses FNV-1a hashing (seed 0x811C9DC5, multiplier 16777619) to cache instruction encodings and avoid re-encoding identical instructions.

Architecture-Specific Dispatch

Architecture selection reads *(int*)(config + 372) >> 12 to determine the SM generation. A vtable at *(context+416) with approximately 200 methods provides per-architecture behavior for encoding, latency tables, and hazard rules.

SM generationconfig+372 >> 12SM versions
Kepler3sm_30--sm_37
Maxwell5sm_50--sm_53
Pascal6sm_60--sm_62
Volta/Turing7sm_70--sm_75
Ampere8sm_80--sm_89
Hopper9sm_90--sm_90a
Blackwell(10+)sm_100--sm_121

The encoder state initializer sub_6E8EB0 (64KB) sets architecture-specific flags and populates the opcode descriptor table (40+ entries mapping internal opcode IDs to encoding words). For SM 80 (0x5000) it sets bits 1 and 8; for SM 84 (0x5004) it sets bits 16 and 64.

Vtable dispatch helpers at 0xC65530--0xC656E0:

  • sub_C65530 -- 3-key dispatch (opcode, subop1, subop2), binary search through 24-byte table entries
  • sub_C65600 -- instruction-keyed dispatch, reads keys from instr+12/14/15
  • sub_C656E0 -- instruction-keyed dispatch with fallback to default handler sub_9B3020

Data Structures

Mercury Instruction Word

Offset  Size    Field
------  ------  --------------------------------------------------
+0      8B      vtable pointer (encoder object)
+468    64B     Constant buffer slot table (16 x DWORD, cleared to 0xFF)
+532    4B      Constant buffer slot count
+544    160B    Instruction word (1280 bits = 20 QWORDs)
                — populated by sub_7B9B80 bitfield inserts
                — max addressable bit: 1280

SASS Encoding Record (~264 bytes)

Output of sub_6D9690. Contains the encoded instruction words, operand data, and metadata. The encoding base constant is 0x2000000000LL.

Pipeline Context

Offset  Size    Field
------  ------  --------------------------------------------------
+52     4B      Architecture ID (20481 = sm100a)
+236    1B      Uses shared memory flag
+284    4B      Function flags (bits 0, 3, 7 checked by WAR pass)
+385    4B      Mercury mode flag (2 = Mercury/Capsule mode)
+416    8B      Architecture vtable pointer (~200 virtual methods)

Scheduling Control Word (per SASS instruction)

Offset  Size    Field
------  ------  --------------------------------------------------
+48     4B      Control bits (barrier flags at bits 17:13)
+56     4B      Encoding format (4 = barrier format)
+144    4B      Scheduling slot
+164    4B      Resource class
+168    1B      Stall bits
+236    4B      Latency value

Mercury Instruction Node Layout

The Mercury pipeline (phases 117--122) operates on its own instruction representation, distinct from the 296-byte Ori IR instruction node documented in Instructions & Opcodes. The master encoder sub_6D9690 (phase 117) reads Ori IR nodes and produces Mercury instruction nodes; all subsequent phases -- expansion, WAR resolution, opex, and SASS emission -- operate exclusively on Mercury nodes.

Allocation

Mercury instruction nodes are allocated by sub_10AF8C0 (92 lines), which either recycles a node from a per-block free list or allocates exactly 160 bytes from the arena. The primary API wrappers are sub_10B1F90 and sub_10B1EE0, which call sub_10AF8C0 and perform additional bookkeeping (FNV-1a deduplication cache registration, scheduling state propagation).

Node Layout (160 bytes)

OffsetSizeTypeInit valueFieldDescription
+08ptr0nextForward pointer in per-block doubly-linked list
+88ptr0prevBackward pointer in per-block doubly-linked list
+168ptrsource locsource_locSource location copied from context (slot 124)
+244u32772 (0x304)node_typeConstant type marker -- never modified after init
+282u160xFFFFopcodeSASS opcode number (0xFFFF = sentinel / BB boundary)
+301u80xFFsub_key_1Encoding sub-key 1 (format variant selector)
+311u80xFFsub_key_2Encoding sub-key 2 (modifier selector)
+324u32countersequence_idMonotonically increasing ID; FNV-1a dedup key
+364------(padding)Alignment to 8-byte boundary
+408ptrctxcontext_ptrBack-pointer to allocator / code-object base
+488u640encoded_data_0Encoded operand / property data
+568u640xFFFFFFFFsentinel_56Sentinel / uninitialized marker
+648u640encoded_data_1Encoded operand / property data
+728u640encoded_data_2Encoded operand / property data
+808u640encoded_data_3Encoded operand / property data
+888i64-1sentinel_88Sentinel (end-of-data marker)
+968i64-1sentinel_96Sentinel
+1048u640xFFFFFFFFsentinel_104Sentinel
+1128u640reserved_112Reserved (zeroed)
+1208u640reserved_120Reserved (zeroed)
+1288ptralloc'dsched_ctrl_ptrPointer to 60-byte scheduling control record
+1368ptrctx schedsched_contextContext scheduling state (context slot 52)
+1448i640xFFFFFFFFsched_slotScheduling slot (sentinel = unscheduled)
+1484u320node_flagsNode flags (bit 1 = BB boundary, bit 10 = 0x400)
+1524u320xFFFFFFFFblock_seqBasic-block sequence number

The opcode field at +28 carries the Mercury/SASS opcode number. Known values include: 0xFFFF (sentinel, BB boundary marker), 54 (DEPBAR -- dependency barrier), 246 (WAITDP -- wait for dependency pipeline). All other values are SASS instruction opcodes.

Scheduling Control Record (60 bytes)

Each Mercury instruction node points (via +128) to a separately allocated 60-byte scheduling control record. This record carries barrier state, stall counts, and encoding format metadata that the WAR and opex passes read and modify.

OffsetSizeTypeInitFieldDescription
+016xmmSSE constheaderSSE-initialized from xmmword_2027620
+1616xmmSSE constlatencySSE-initialized from xmmword_202DC90
+321u80flag_32General-purpose flag byte
+368i64-1barrier_stateBarrier tracking sentinel
+444u320stall_countStall cycle count
+484u320xEE (low byte)control_bitsScheduling control word; bits 17:13 = barrier type
+564u320encoding_formatFormat discriminator (1 = basic, 4 = barrier, 15 = NOP stall)

The control_bits field at sched+48 is the primary target of WAR pass modifications:

Bits 17:13  — barrier type (masked via 0xFFF83FFF then OR'd with type << 13)
Bit  4      — barrier-needed flag (in byte at sched+50)
Bits  7:5   — barrier sub-type (in byte at sched+50)

WAR insertion functions modify this field with specific patterns:

  • sub_6FA930 (InsertScoreboardBarrier): sched[48] = (sched[48] & 0xFFF83FFF) | 0x50000; clears bit 4 of sched[50]; sets sched[56] = 4
  • sub_6FA430 (InsertNOP): sched[48] = (sched[48] & 0xFFF83FFF) | 0x44000; clears bit 4 of sched[50]; sets sched[56] = 1
  • sub_6FAFD0 (InsertStall): sched[48] = (sched[48] & 0xFFF83FFF) | 0x3C000; sets bit 4 of sched[50]; sets sched[56] = 15

Linked-List Structure

Mercury nodes form a doubly-linked list per basic block, managed through the next (+0) and prev (+8) pointers:

           head (ctx+40)                          tail (ctx+32)
              |                                      |
              v                                      v
         [node_0]  <-->  [node_1]  <-->  ...  <-->  [node_N]
         next=node_1     next=node_2                 next=0
         prev=0          prev=node_0                 prev=node_{N-1}

New nodes are inserted before the reference node by sub_10AF8C0. The WAR pass (sub_6FBC20) iterates forward through the list; sub_6FB850 (PostWARAdjustment) iterates backward, skipping sentinel nodes (opcode == 0xFFFF).

FNV-1a Deduplication

The sequence_id at +32 serves as the FNV-1a hash key for the instruction deduplication cache. The hash is computed over the 4-byte ID using the standard FNV-1a parameters (seed 0x811C9DC5, multiplier 16777619). The cache resides at context+488 (hash table pointer) with capacity at context+496 and entry count at context+480. Each hash table entry is 24 bytes with separate chaining via pointer at entry+0, key at entry+8, and value (Mercury encoding record pointer) at entry+16.

Relationship to Ori IR Instruction Node

The Mercury node is distinct from the Ori IR instruction node:

PropertyOri IR nodeMercury node
Size296 bytes160 bytes
Allocatorsub_7DD010sub_10AF8C0
Opcode location+72 (32-bit word)+28 (16-bit)
Operand modelPacked array at +84Encoded data at +48..+120
SchedulingPointer at +40Pointer at +128 (60-byte record)
List linkage+0 / +8 (prev/next)+0 / +8 (next/prev)
Pipeline phases1--116117--122

Phase 117 (MercEncodeAndDecode) reads Ori IR nodes via the master encoder sub_6D9690 and produces Mercury nodes. All subsequent Mercury pipeline phases operate on Mercury nodes exclusively.

Configuration

KnobPurposeContext
16WAR generation controlChecked in sub_6FBC20
17Expansion/opex options; disables WAR penalty flagssub_703480 entry
595Scheduling enable checkScheduling pre-check
743Scheduling reduce-reg modesub_6FFDC0 opex body
747Scheduling dynamic batch modesub_6FFDC0 opex body
4176SM 100+ extension bits for encodingsub_6D9690 encoder

Diagnostic Strings

StringSourceTrigger
"After Decode"sub_6F2BF0Decode stage completion
"After Expansion"sub_6F2BF0Expansion stage completion
"After WAR post-expansion"sub_6F2BF0WAR pass 1 completion
"After Opex"sub_6F2BF0Opex stage completion
"After WAR post-opexing"sub_6F2BF0WAR pass 2 completion
"After MercWARs"sub_6FC240WAR pass trace
"After MercOpex"sub_7032A0Opex pass trace
"After MercExpand"sub_C3DFC0Expansion pass trace
"After MercConverter"0x9F3818MercConverter phase completion
"CONVERTING"sub_9EF5E0Active operand reorganization (per instruction)
"After EncodeAndDecode"0x23D1A60Roundtrip verification
"EXPANDING"0xC381B3Active instruction expansion
"ENCODING"0x21C2880Active instruction encoding

Function Map

AddressSizeIdentityConfidence
sub_9F1A9035KBMercConverter::ConvertInstruction (opcode dispatch, phase 5/141)HIGH
sub_9EF5E027KBMercConverter::ReorganizeOperands (post-conversion lowering)HIGH
sub_9ED2D025KBMercConverter::Dispatch (master opcode switch, & 0xCF mask)HIGH
sub_9F33407KBMercConverter::Run (orchestrator, calls 9F1A90 then 9EF5E0)HIGH
sub_9EC160~2KBMergeSort (linked-list merge sort for operand chains)HIGH
sub_7BFC30~4KBMercConverter::ValidateEncoding (returns -1 on failure)HIGH
sub_9CE210~6KBMercConverter::FallbackConvert (recursive re-encoding)MEDIUM
sub_6D969094KBMercuryEncode::EncodeInstruction (master switch)HIGH
sub_6FFDC066KBMercuryPipeline::EmitInstructions (opex body)HIGH
sub_6E8EB064KBBasicBlock::Initialize (encoder state init)MEDIUM
sub_6F2BF059KBDecodePipeline::DecodeAndExpandMEDIUM
sub_719D0050KBExpansionEngine::buildOutputMEDIUM
sub_726E0030.6KBInstruction encoding + FNV-1a dedup cacheHIGH
sub_C3CC6026KBMercExpand::run (pseudo-instruction expansion)HIGH
sub_6FC81024KBMercuryPipeline::ConfigureMEDIUM
sub_6E411024KBMercGenerateSassUCode (final SASS emission)HIGH
sub_6F52F023KBDecodePipeline::RunStages (orchestrator)MEDIUM
sub_C3BCD019KBMercExpand::expandControlFlowHIGH
sub_6FF07018KBPredicate handling in expansionMEDIUM
sub_C3E03018KBMercExpand::finalizeExpansionHIGH
sub_C37A1016KBMercExpand::expandInstructionHIGH
sub_C3818013KBMercExpand::expandInstruction (complex cases)HIGH
sub_7266A011.7KBFNV-1a hash table (instruction cache)HIGH
sub_73387010KBSource operand encoderMEDIUM
sub_C39B4010KBMercExpand::expandMemoryOpHIGH
sub_6FAA907.9KBWAR stall insertionHIGH
sub_7352907.6KBPer-instruction SASS encoding pipelineMEDIUM
sub_6FBC207.4KBWAR generation main passHIGH
sub_C3B5608KBMercExpand::expandTextureHIGH
sub_734D208.1KBComplex operand encoder (texture/surface/barrier)MEDIUM
sub_C3A4606KBMercExpand::expandAtomicOpHIGH
sub_7343706.1KBImmediate operand encoderMEDIUM
sub_733FA05.1KBInstruction operand encoderMEDIUM
sub_7348204.1KBPredicate operand encoderMEDIUM
sub_734AD03.3KBMemory operand encoderMEDIUM
sub_6FA5B02.5KBWAR hazard detectorHIGH
sub_7032A02.3KBRunMercOpexPass (entry)HIGH
sub_6FC2801.8KBCreate pipeline contextMEDIUM
sub_6FA7B01.7KBInsertWAITDP (opcode 246)HIGH
sub_7034801.4KBRunOpexPass (entry)HIGH
sub_6FA9301.4KBInsertScoreboardBarrier (opcode 54)HIGH
sub_10AF8C0~0.5KBMercNode::Allocate (160-byte node allocator, core initializer)HIGH
sub_10B1F90~0.2KBMercNode::Create (wrapper: allocate + dedup cache + sched state)HIGH
sub_10B1EE0~0.2KBMercNode::Clone (wrapper: allocate from clone source)HIGH
sub_10B14B0~0.2KBMercNode::CreateBBBoundary (creates sentinel pair, opcode 0xFFFF)HIGH
sub_6FAFD0~1KBInsertScoreboardStalls (allocate NOP stall nodes)HIGH
sub_6FA430~0.5KBInsertNOP (allocate NOP barrier nodes)HIGH
sub_7B9B80216BBitfield insert primitive (18,347 callers)CERTAIN
sub_7B9D3038BClear constant buffer slot tableHIGH
sub_7B9D60408BEncode reuse flags + predicateHIGH

Cross-References

Capsule Mercury & Finalization

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

Capsule Mercury ("capmerc") is a packaging format that wraps Mercury-encoded instruction streams with relocation metadata, debug information, and a snapshot of compilation knobs, enabling deferred finalization for a target SM that may differ from the original compilation target. Where standard Mercury produces a fully-resolved SASS binary bound to a single SM, capmerc produces an intermediate ELF object that a downstream tool (the driver or linker) can finalize into native SASS at load time. This is the default output format for all SM 100+ targets (Blackwell, Jetson Thor, consumer RTX 50-series). The capmerc data lives in .nv.capmerc<funcname> per-function ELF sections alongside 21 types of .nv.merc.* auxiliary sections carrying cloned debug data, memory-space metadata, and Mercury-specific relocations. Finalization can be "opportunistic" -- the same capmerc object may be finalized for different SMs within or across architectural families, controlled by --opportunistic-finalization-lvl.

Output modesmercury (SM 75--99 default), capmerc (SM 100+ default), sass (explicit only)
CLI parsersub_703AB0 (10KB, ParsePtxasOptions)
Auto-enableSM arch > 99 sets *(context + offset+81) = 1
Mercury mode flag*(DWORD*)(context+385) == 2 (shared with Mercury)
Capsule descriptor328-byte object, one per function (sub_1C9C300)
Merc section classifiersub_1C98C60 (9KB, 15 .nv.merc.* names)
Master ELF emittersub_1C9F280 (97KB, orchestrates full CUBIN output)
Self-check verifiersub_720F00 (64KB Flex lexer) + sub_729540 (35KB comparator)
Off-target checkersub_60F290 (compatibility validation)
Kernel finalizersub_612DE0 (47KB, fastpath optimization)

Output Mode Selection

The ptxas CLI (sub_703AB0) registers three binary-kind options plus related flags:

OptionString literalPurpose
--binary-kind"mercury,capmerc,sass"Select output format
--cap-merc"Generate Capsule Mercury"Force capmerc regardless of SM
--self-check"Self check for capsule mercury (capmerc)"Roundtrip verification
--out-sass"Generate output of capmerc based reconstituted sass"Dump reconstituted SASS
--opportunistic-finalization-lvl(in finalization logic)Finalization aggressiveness

When --binary-kind is not specified, the default is determined by SM version:

// Pseudocode from sub_703AB0 + auto-enable logic
if (sm_version > 99) {
    *(context + offset + 81) = 1;  // capmerc auto-enabled
    binary_kind = CAPMERC;
} else if (sm_version >= 75) {
    binary_kind = MERCURY;
} else {
    binary_kind = SASS;  // legacy direct encoding
}

The Mercury mode flag *(DWORD*)(context+385) == 2 is shared between Mercury and capmerc -- both use the identical Mercury encoder pipeline (phases 117--122). The capmerc distinction is purely at the ELF emission level: capmerc wraps the phase-122 SASS output in a capsule descriptor with relocation metadata instead of emitting it directly as a .text section.

Capsule Mercury ELF Structure

A capmerc-mode compilation produces a CUBIN ELF with two layers of content: standard CUBIN sections (.text.<func>, .nv.constant0, .nv.info.<func>, etc.) and a parallel set of .nv.merc.* sections carrying the metadata needed for deferred finalization.

CUBIN ELF (capmerc mode)
├── Standard sections
│   ├── .shstrtab, .strtab, .symtab, .symtab_shndx
│   ├── .text.<funcname>               (SASS binary, possibly partial)
│   ├── .nv.constant0.<funcname>       (constant bank data)
│   ├── .nv.shared.<funcname>          (shared memory layout)
│   ├── .nv.info.<funcname>            (EIATTR attributes)
│   ├── .note.nv.tkinfo, .note.nv.cuinfo
│   └── .nv.uft.entry                 (unified function table)
│
├── Per-function capsule descriptor
│   └── .nv.capmerc<funcname>          (328-byte descriptor + payload)
│
└── Mercury auxiliary sections (21 types)
    ├── .nv.merc.debug_abbrev          (DWARF abbreviation table)
    ├── .nv.merc.debug_aranges         (DWARF address ranges)
    ├── .nv.merc.debug_frame           (DWARF frame info)
    ├── .nv.merc.debug_info            (DWARF info)
    ├── .nv.merc.debug_line            (DWARF line table)
    ├── .nv.merc.debug_loc             (DWARF locations)
    ├── .nv.merc.debug_macinfo         (DWARF macro info)
    ├── .nv.merc.debug_pubnames        (DWARF public names)
    ├── .nv.merc.debug_pubtypes        (DWARF public types)
    ├── .nv.merc.debug_ranges          (DWARF ranges)
    ├── .nv.merc.debug_str             (DWARF string table)
    ├── .nv.merc.nv_debug_ptx_txt      (embedded PTX source text)
    ├── .nv.merc.nv_debug_line_sass    (SASS-level line table)
    ├── .nv.merc.nv_debug_info_reg_sass (register allocation info)
    ├── .nv.merc.nv_debug_info_reg_type (register type info)
    ├── .nv.merc.symtab_shndx          (extended section index table)
    ├── .nv.merc.nv.shared.reserved    (shared memory reservation)
    ├── .nv.merc.rela                  (Mercury relocations)
    ├── .nv.merc.rela<secname>         (per-section relocation tables)
    └── .nv.merc.<memory-space>        (cloned constant/global/local/shared)

Capsule Descriptor -- sub_1C9C300

Each function produces a .nv.capmerc<funcname> section constructed by sub_1C9C300 (24KB, 3816 bytes binary). This function processes .nv.capmerc and .merc markers, embeds KNOBS data (compilation configuration snapshot), manages constant bank replication, and creates the per-function descriptor.

The descriptor is a 328-byte object containing:

  • Mercury-encoded instruction stream for the function
  • R_MERCURY_* relocation entries that must be patched during finalization
  • KNOBS block -- a serialized snapshot of all knob values affecting code generation, optimization level, target parameters, and feature flags
  • References to the .nv.merc.* auxiliary sections
  • Function-level metadata: register counts, barrier counts, shared memory usage

The KNOBS embedding allows the finalizer to reproduce exact compilation settings without the original command-line arguments. This is critical for off-target finalization where the finalizer runs in a different context (e.g., the CUDA driver at application load time).

Capsule Descriptor Layout (328 bytes)

The descriptor is heap-allocated via sub_424070(allocator, 328) and zero-filled before field initialization. The constructor also creates a companion .merc<funcname> descriptor (same 328-byte layout) when merc section mirroring is active.

         Capsule Descriptor (328 bytes = 0x148)
         ======================================

         Group 1: Identity
         ┌─────────┬──────┬────────────────────────────────────────┐
   0x000 │ WORD    │  2B  │ desc_version                           │
   0x002 │ WORD    │  2B  │ instr_format_version                   │
   0x004 │ DWORD   │  4B  │ section_index                          │
   0x008 │ DWORD   │  4B  │ weak_symbol_index                      │
   0x00C │ --      │  4B  │ (padding)                              │
         └─────────┴──────┴────────────────────────────────────────┘

         Group 2: SASS Data
         ┌─────────┬──────┬────────────────────────────────────────┐
   0x010 │ QWORD   │  8B  │ weak_symbol_desc                       │
   0x018 │ QWORD   │  8B  │ sass_data_offset                       │
   0x020 │ DWORD   │  4B  │ sass_data_size                         │
   0x024 │ --      │  4B  │ (padding)                              │
   0x028 │ QWORD   │  8B  │ func_name_ptr                          │
         └─────────┴──────┴────────────────────────────────────────┘

         Group 3: Relocation Infrastructure
         ┌─────────┬──────┬────────────────────────────────────────┐
   0x030 │ QWORD   │  8B  │ rela_list_a (vector)                   │
   0x038 │ QWORD   │  8B  │ rela_list_b (vector)                   │
   0x040 │ QWORD   │  8B  │ reloc_symbol_list (vector)             │
   0x048 │ QWORD   │  8B  │ aux_rela_list (vector)                 │
   0x050 │ QWORD   │  8B  │ debug_rela_list (vector)               │
   0x058 │ QWORD   │  8B  │ text_section_offset                    │
   0x060 │ QWORD   │  8B  │ reloc_index_set (sorted container)     │
   0x068 │ QWORD   │  8B  │ per_reloc_data_set (sorted container)  │
   0x070 │ BYTE    │  1B  │ sampling_mode                          │
   0x071 │ --      │  7B  │ (padding)                              │
   0x078 │ QWORD   │  8B  │ reloc_payload_map (sorted container)   │
         └─────────┴──────┴────────────────────────────────────────┘

   0x080 │ --      │ 32B  │ (reserved, not written by constructor)  │
         └─────────┴──────┴────────────────────────────────────────┘

         Group 4: Function Metadata
         ┌─────────┬──────┬────────────────────────────────────────┐
   0x0A0 │ QWORD   │  8B  │ section_flags                          │
   0x0A8 │ DWORD   │  4B  │ max_register_count                     │
   0x0AC │ DWORD   │  4B  │ extra_section_index                    │
   0x0B0 │ BYTE    │  1B  │ has_global_refs                        │
   0x0B1 │ BYTE    │  1B  │ has_shared_refs                        │
   0x0B2 │ BYTE    │  1B  │ has_exit                               │
   0x0B3 │ BYTE    │  1B  │ has_crs                                │
   0x0B4 │ BYTE    │  1B  │ uses_atomics                           │
   0x0B5 │ BYTE    │  1B  │ uses_shared_atomics                    │
   0x0B6 │ BYTE    │  1B  │ uses_global_atomics                    │
   0x0B7 │ BYTE    │  1B  │ has_texture_refs                       │
   0x0B8 │ --      │ 24B  │ (padding)                              │
         └─────────┴──────┴────────────────────────────────────────┘

         Group 5: Code Generation Parameters
         ┌─────────┬──────┬────────────────────────────────────────┐
   0x0D0 │ QWORD   │  8B  │ knobs_section_desc_ptr → 64B sub-obj  │
         │         │      │   +0x00 DWORD: knobs_section_index     │
         │         │      │   +0x08 QWORD: knobs_section_offset    │
         │         │      │   +0x10 DWORD: knobs_section_size      │
         │         │      │   +0x18 QWORD: knobs_section_name_ptr  │
   0x0D8 │ DWORD   │  4B  │ stack_frame_size                       │
   0x0DC │ --      │  4B  │ (padding)                              │
   0x0E0 │ DWORD   │  4B  │ register_count                         │
   0x0E4 │ --      │  4B  │ (padding)                              │
   0x0E8 │ DWORD   │  4B  │ barrier_info_size                      │
   0x0EC │ --      │  4B  │ (padding)                              │
   0x0F0 │ QWORD   │  8B  │ barrier_info_data_ptr                  │
   0x0F8 │ --      │  8B  │ (reserved)                             │
         └─────────┴──────┴────────────────────────────────────────┘

         Group 6: Constant Bank & Section Info
         ┌─────────┬──────┬────────────────────────────────────────┐
   0x100 │ QWORD   │  8B  │ const_bank_offset                      │
   0x108 │ DWORD   │  4B  │ const_bank_size                        │
   0x10C │ --      │  4B  │ (padding)                              │
   0x110 │ QWORD   │  8B  │ section_name_ptr (".nv.capmerc<func>") │
   0x118 │ QWORD   │  8B  │ section_alignment (default 16)         │
   0x120 │ DWORD   │  4B  │ const_bank_section_index               │
   0x124 │ --      │  4B  │ (padding)                              │
   0x128 │ DWORD   │  4B  │ text_section_index                     │
   0x12C │ DWORD   │  4B  │ text_rela_section_index                │
         └─────────┴──────┴────────────────────────────────────────┘

         Group 7: KNOBS Embedding
         ┌─────────┬──────┬────────────────────────────────────────┐
   0x130 │ QWORD   │  8B  │ kv_pair_list (vector)                  │
   0x138 │ QWORD   │  8B  │ knobs_pair_list (vector)               │
   0x140 │ WORD    │  2B  │ min_sm_version (default 256 = sentinel) │
   0x142 │ BYTE    │  1B  │ has_crs_depth                          │
   0x143 │ --      │  5B  │ (padding to 0x148)                     │
         └─────────┴──────┴────────────────────────────────────────┘

Key design observations:

Flag byte block (+0x0B0 to +0x0B7). Eight single-byte flags capture function characteristics that determine which R_MERCURY_* relocation patches the finalizer must apply. The flags are set by type-2, type-3, and type-4 markers in the capmerc stream. Each flag is a boolean (0 or 1), never a bitfield.

KNOBS indirection (+0x0D0). The KNOBS data does not live inline in the descriptor. Instead, +0x0D0 points to a separately allocated 64-byte sub-object carrying the ELF coordinates (section index, file offset, size, and name pointer) of the KNOBS section. This allows the KNOBS data to reside in a dedicated ELF section while the descriptor references it by position. The KNOBS pair list at +0x138 and the generic key-value list at +0x130 store the parsed key-value pairs from marker type 90 data blocks; the "KNOBS" string literal serves as the discriminator between the two lists.

Dual-descriptor pattern. When the merc section mirror is active, the constructor allocates a second 328-byte object for the .merc<funcname> companion section. This companion receives a copy of the SASS data (not a pointer -- an actual memcpy of sass_data_size bytes), the function name with a .merc prefix, and the section flags from the original ELF section header at +0x0A0. The companion's weak_symbol_index (+0x008) is always zero.

Relocation containers. The three sorted containers at +0x060, +0x068, and +0x078 (created via sub_425CA0 with comparator pair sub_427750/sub_427760 and element size 0x20 = 32 bytes) form a three-level relocation index. The reloc_index_set stores symbol indices that appear in relocations. The per_reloc_data_set stores per-symbol relocation metadata. The reloc_payload_map associates symbol indices with the actual payload data that the finalizer patches into instruction bytes. These are populated by marker sub-types 10, 23, 25, 28, 40, 46, 49, 52, 57, 64, 68, 70, 71, 85, and 87.

min_sm_version sentinel. The default value 256 (0x100) at +0x140 acts as a sentinel meaning "no minimum SM constraint." When a target profile is available at construction time, the profile's SM version overwrites this field. Marker sub-type 95 can further override it when CRS depth information constrains the minimum SM.

Capmerc Marker Stream Format

The constructor parses a compact binary marker stream embedded in the capmerc section data. Each marker begins with a type byte followed by a sub-type byte:

TypeSizeFormatDescription
24 bytes fixed[02] [sub] [00] [00]Boolean flag markers
34 bytes fixed[03] [sub] [WORD payload]Short value markers
4variable[04] [sub] [WORD size] [payload...]Variable-length data markers

Selected marker sub-types and the descriptor fields they populate:

SubTypeDescriptor FieldPurpose
104+0x0B0 has_global_refsFunction accesses global memory
212+0x0B2 has_exitFunction contains EXIT instruction
222+0x0B3 has_crsFunction uses call return stack
234+0x0A8 max_register_countRegister pressure (with max tracking)
254+0x0B1 has_shared_refsFunction accesses shared memory
273+0x000 desc_versionDescriptor format version stamp
474+0x002 instr_format_versionInstruction encoding format version
504+0x0E0 register_countAllocated register count
544+0x0B7 has_texture_refsFunction uses texture/sampler units
674+0x0E8, +0x0F0 barrier_infoBarrier count and data
724+0x0D0 knobs_section_desc_ptrKNOBS section ELF binding
733+0x0D8 stack_frame_sizePer-thread stack frame bytes
742+0x070 sampling_modeInterpolation/sampling mode
883+0x0B4/B5/B6Atomic usage (plain/shared/global)
904+0x138 knobs_pair_listKNOBS key-value data block
953+0x140, +0x142Min SM version + CRS depth flag

.nv.merc.* Section Builder Pipeline

Four functions cooperate to construct the .nv.merc.* section namespace:

sub_1C9F280 (97KB, Master ELF emitter)
  │
  ├─ sub_1C9B110 (23KB) ── Mercury capsule builder
  │   Creates .nv.merc namespace, reads symtab entry count,
  │   allocates mapping arrays, duplicates sections into merc space
  │
  ├─ sub_1CA2E40 (18KB) ── Mercury section cloner
  │   Iterates all sections, clones constant/global/shared/local
  │   into .nv.merc.* namespace, creates .nv.merc.rela sections,
  │   handles .nv.global.init and .nv.shared.reserved
  │
  ├─ sub_1C9C300 (24KB) ── Capsule descriptor processor
  │   Processes .nv.capmerc and .merc markers, embeds KNOBS,
  │   handles constant bank replication and rela duplication
  │
  ├─ sub_1CA3A90 (45KB) ── Section merger
  │   Merge/combine pass for sections with both merc and non-merc
  │   copies; processes .nv.constant bank sections, handles section
  │   linking and rela association
  │
  └─ sub_1C99BB0 (25KB) ── Section index remap
      Reindexes sections after dead elimination, handles
      .symtab_shndx / .nv.merc.symtab_shndx mapping

The section classifiers sub_1C9D1F0 (16KB) and sub_1C98C60 (9KB) map section names to internal type IDs. The former handles both SASS and merc debug section variants; the latter is merc-specific and recognizes all 15 .nv.merc.debug_* names.

R_MERCURY_* Relocation Types

Capsule Mercury defines its own relocation type namespace for references within .nv.merc.rela sections and the capsule descriptor. These are distinct from standard CUDA ELF relocations (R_NV_32, etc.) and are processed during finalization rather than at link time.

TypeDescription
R_MERCURY_ABS6464-bit absolute address
R_MERCURY_ABS3232-bit absolute address
R_MERCURY_ABS1616-bit absolute address
R_MERCURY_PROG_RELPC-relative reference
R_MERCURY_8_0Sub-byte patch: bits [7:0] of target word
R_MERCURY_8_8Sub-byte patch: bits [15:8]
R_MERCURY_8_16Sub-byte patch: bits [23:16]
R_MERCURY_8_24Sub-byte patch: bits [31:24]
R_MERCURY_8_32Sub-byte patch: bits [39:32]
R_MERCURY_8_40Sub-byte patch: bits [47:40]
R_MERCURY_8_48Sub-byte patch: bits [55:48]
R_MERCURY_8_56Sub-byte patch: bits [63:56]
R_MERCURY_FUNC_DESCFunction descriptor reference
R_MERCURY_UNIFIEDUnified address space reference
R_MERCURY_TEX_HEADER_INDEXTexture header table index
R_MERCURY_SAMP_HEADER_INDEXSampler header table index
R_MERCURY_SURF_HEADER_INDEXSurface header table index

Sub-Byte Relocation Design

The eight R_MERCURY_8_* types enable patching individual bytes within a 64-bit instruction word. Mercury instruction encodings pack multiple fields into single 8-byte QWORDs (the 1280-bit instruction buffer at a1+544 is organized as 20 QWORDs). During finalization for a different SM, only certain bit-fields within an instruction word may need updating -- for example, the opcode variant bits or register class encoding -- while neighboring fields remain unchanged. The sub-byte types let the finalizer patch exactly one byte at a specific position within the word without a read-modify-write cycle on the entire QWORD.

Relocation Resolution

The master relocation resolver sub_1CD48C0 (22KB) handles both standard and capmerc relocations. For R_MERCURY_UNIFIED, it converts internal relocation type 103 to type 1 (standard absolute). The resolver iterates relocation entries and handles: alias redirections, dead-function relocation skipping, __UFT_OFFSET / __UDT_OFFSET pseudo-relocations, PC-relative branch validation, NVRS (register spill) relocations, and YIELD-to-NOP conversion for forward progress guarantees.

Mercury Section Binary Layouts

Section Classifier Algorithm -- sub_1C98C60

The 9KB classifier uses a two-stage guard-then-waterfall pattern to identify .nv.merc.* sections from their ELF section headers.

Stage 1: sh_type range check (fast rejection). The section's sh_type is tested against two NVIDIA processor-specific ranges:

Rangesh_type spanDecimalQualifying types
A0x70000006..0x70000014SHT_LOPROC+6..+20Filtered by bitmask 0x5D05
B0x70000064..0x7000007ESHT_LOPROC+100..+126All accepted (memory-space data)
Special1SHT_PROGBITSAccepted (generic debug data)

Within Range A, the bitmask 0x5D05 (binary 0101_1101_0000_0101) selects seven specific types:

Bitsh_typeHexSection types
0SHT_LOPROC+60x70000006Memory-space clones
2SHT_LOPROC+80x70000008.nv.merc.nv.shared.reserved
8SHT_LOPROC+140x7000000E.nv.merc.debug_line
10SHT_LOPROC+160x70000010.nv.merc.debug_frame
11SHT_LOPROC+170x70000011.nv.merc.debug_info
12SHT_LOPROC+180x70000012.nv.merc.nv_debug_line_sass
14SHT_LOPROC+200x70000014.nv.merc.debug_loc, .nv.merc.debug_ranges, .nv.merc.nv_debug_info_reg_*

Stage 2: Name-based disambiguation (expensive path). When sh_flags bit 28 (0x10000000, SHF_NV_MERC) is set, the classifier calls sub_1CB9E50() to retrieve the section name and performs sequential strcmp() against 15 names, returning 1 on the first match. The check order matches the declaration order in the ELF structure table above. sub_4279D0 is used for .nv.merc.nv_debug_ptx_txt as a prefix match rather than exact match.

SHF_NV_MERC Flag (0x10000000)

Bit 28 of sh_flags is an NVIDIA extension: SHF_NV_MERC. All .nv.merc.* sections carry this flag. It serves two purposes:

  1. Fast filtering -- the classifier checks this bit before string comparisons, giving O(1) rejection for the common case of non-merc sections.
  2. Namespace separation -- during section index remapping (sub_1C99BB0), sections with SHF_NV_MERC are remapped into a separate merc section index space. The finalizer uses this flag to identify which sections require relocation patching during off-target finalization.

.nv.capmerc<funcname> -- Capsule Data Layout

The per-function capsule section contains the full marker stream, SASS data, KNOBS block, and optionally replicated constant bank data. The section is created by sub_1C9C300.

ELF section header:

FieldValue
sh_type1 (SHT_PROGBITS)
sh_flags0x10000000 (SHF_NV_MERC)
sh_addralign16

Section data is organized as four consecutive regions:

         .nv.capmerc<funcname> Section Data
         ====================================

         ┌──────────────────────────────────────────────────────┐
         │ Marker Stream     (variable length)                  │
         │   Repeating TLV records:                             │
         │     [type:1B] [sub:1B] [payload:varies]              │
         │                                                      │
         │   Type 2: 4 bytes total   [02] [sub] [00 00]        │
         │     Boolean flags (has_exit, has_crs, sampling_mode) │
         │                                                      │
         │   Type 3: 4 bytes total   [03] [sub] [WORD:value]   │
         │     Short values (desc_version, stack_frame_size,    │
         │     atomic flags, min_sm_version)                    │
         │                                                      │
         │   Type 4: variable        [04] [sub] [WORD:size] ..  │
         │     Variable-length blocks (register counts, KNOBS   │
         │     data, barrier info, relocation payloads)         │
         │                                                      │
         │   Terminal marker: sub-type 95 (min_sm + CRS depth)  │
         ├──────────────────────────────────────────────────────┤
         │ SASS Data Block   (sass_data_size bytes)             │
         │   Mercury-encoded instruction bytes identical to     │
         │   what .text.<func> would contain for the compile    │
         │   target; byte-for-byte match with phase 122 output  │
         ├──────────────────────────────────────────────────────┤
         │ KNOBS Block       (knobs_section_size bytes)         │
         │   Serialized key-value pairs from marker sub-type 90 │
         │   "KNOBS" tag separates knob pairs from generic KV   │
         │   Contains: optimization level, target parameters,   │
         │   feature flags, all codegen-affecting knob values   │
         ├──────────────────────────────────────────────────────┤
         │ Constant Bank Data (const_bank_size bytes, optional) │
         │   Replicated .nv.constant0 data for deferred binding │
         │   Only present when the function references constant │
         │   bank data that the finalizer may need to patch     │
         └──────────────────────────────────────────────────────┘

.nv.merc.debug_info -- Cloned DWARF Debug Info

The cloner (sub_1CA2E40) produces a byte-for-byte copy of the source .debug_info section, placed into the merc namespace with modified ELF section header properties.

ELF section header:

FieldValue
sh_type0x70000011 (SHT_LOPROC + 17)
sh_flags0x10000000 (SHF_NV_MERC)
sh_addralign1

Section data is standard DWARF .debug_info format:

         .nv.merc.debug_info Section Data
         ==================================

         ┌──────────────────────────────────────────────────────┐
         │ Compilation Unit Header                              │
         │   +0x00  unit_length    : 4B (DWARF-32) or 12B (-64)│
         │   +0x04  version        : 2B (typically DWARF 4)    │
         │   +0x06  debug_abbrev_offset : 4B → .nv.merc.debug_abbrev │
         │   +0x0A  address_size   : 1B (8 for 64-bit GPU)    │
         ├──────────────────────────────────────────────────────┤
         │ DIE Tree (Debug Information Entries)                  │
         │   Sequence of entries, each:                         │
         │     abbrev_code  : ULEB128                           │
         │     attributes   : per abbreviation definition       │
         │                                                      │
         │   Cross-section references (via relocations):        │
         │     DW_FORM_strp     → .nv.merc.debug_str            │
         │     DW_FORM_ref_addr → .nv.merc.debug_info           │
         │     DW_FORM_sec_offset → .nv.merc.debug_line etc.    │
         └──────────────────────────────────────────────────────┘

The critical difference from standard .debug_info: all cross-section offset references point to other .nv.merc.* sections, not the original .debug_* sections. The .nv.merc.rela.debug_info relocation table handles rebinding these offsets during finalization.

.nv.merc.rela / .nv.merc.rela<secname> -- Mercury Relocations

Mercury relocation sections use standard Elf64_Rela on-disk format (24 bytes per entry) but encode Mercury-specific relocation types with a 0x10000 offset in the type field.

ELF section header:

FieldValue
sh_type4 (SHT_RELA)
sh_flags0x10000000 (SHF_NV_MERC)
sh_addralign8
sh_entsize24
sh_linksymtab section index
sh_infotarget section index

Section names are constructed by sub_1C980F0 as ".nv.merc.rela" + suffix (e.g., ".nv.merc.rela.debug_info").

On-disk entry layout (standard Elf64_Rela, 24 bytes):

         .nv.merc.rela Entry (24 bytes on disk)
         ========================================

         ┌─────────┬──────┬────────────────────────────────────────────┐
   0x00  │ QWORD   │  8B  │ r_offset — byte position in target section │
   0x08  │ DWORD   │  4B  │ r_type — relocation type                   │
         │         │      │   Standard: 1=R_NV_ABS64, etc.             │
         │         │      │   Mercury:  r_type > 0x10000               │
         │         │      │   Decoded:  r_type - 0x10000 → R_MERCURY_* │
   0x0C  │ DWORD   │  4B  │ r_sym — symbol table index                 │
   0x10  │ QWORD   │  8B  │ r_addend — signed addend value             │
         └─────────┴──────┴────────────────────────────────────────────┘

During resolution (sub_1CD48C0), the 24-byte on-disk entries are loaded into a 32-byte in-memory representation that adds two section index fields:

         In-Memory Relocation Entry (32 bytes)
         =======================================

         ┌─────────┬──────┬────────────────────────────────────────────┐
   0x00  │ QWORD   │  8B  │ r_offset — byte position in target section │
   0x08  │ DWORD   │  4B  │ r_type — relocation type                   │
   0x0C  │ DWORD   │  4B  │ r_sym — symbol table index                 │
   0x10  │ QWORD   │  8B  │ r_addend — signed addend value             │
   0x18  │ DWORD   │  4B  │ r_sec_idx — target section index           │
   0x1C  │ DWORD   │  4B  │ r_addend_sec — addend section index        │
         └─────────┴──────┴────────────────────────────────────────────┘

The extra 8 bytes enable cross-section targeting: r_sec_idx identifies which section r_offset is relative to, and r_addend_sec identifies the section contributing the addend base address. When r_addend_sec != 0, the resolver adds that section's load address to r_offset before patching.

The resolver detects Mercury relocation types via r_type > 0x10000, subtracts 0x10000, then dispatches through a Mercury-specific handler table (off_2407B60) rather than the standard CUDA relocation table (off_2408B60).

Complete sh_type Map

sh_typeHexSection types
10x00000001.nv.capmerc<func>, .nv.merc.debug_abbrev (PROGBITS variant), .nv.merc.debug_str, .nv.merc.nv_debug_ptx_txt
40x00000004.nv.merc.rela* (SHT_RELA)
SHT_LOPROC+60x70000006.nv.merc.<memory-space> clones
SHT_LOPROC+80x70000008.nv.merc.nv.shared.reserved
SHT_LOPROC+140x7000000E.nv.merc.debug_line
SHT_LOPROC+160x70000010.nv.merc.debug_frame
SHT_LOPROC+170x70000011.nv.merc.debug_info
SHT_LOPROC+180x70000012.nv.merc.nv_debug_line_sass
SHT_LOPROC+200x70000014.nv.merc.debug_loc, .nv.merc.debug_ranges, .nv.merc.nv_debug_info_reg_sass, .nv.merc.nv_debug_info_reg_type
SHT_LOPROC+100..+1260x70000064..0x7000007EMemory-space variant sections (constant banks, shared, local, global)

The .nv.merc.* debug sections reuse the same sh_type values as their non-merc counterparts (.debug_info uses 0x70000011 in both namespaces). The SHF_NV_MERC flag (0x10000000) in sh_flags is the distinguishing marker.

Self-Check Mechanism

The --self-check flag activates a roundtrip verification that validates the capmerc encoding by reconstituting SASS from the capsule data and comparing it against the original:

Phase 122 output (SASS) ──────────────────────────> reference SASS
         │
         └─ capmerc packaging ─> .nv.capmerc<func>
                                        │
                                        └─ reconstitute ─> reconstituted SASS
                                                                │
                                                    section-by-section compare
                                                                │
                                                      pass / fail (error 17/18/19)

The reconstitution pipeline uses sub_720F00 (64KB), a Flex-generated SASS text lexer with thread-safety support (pthread_mutexattr_t), to parse the reconstituted instruction stream. sub_729540 (35KB) performs the actual section-by-section comparison.

Three error codes signal specific self-check failures:

Error codeMeaning
17Section content mismatch (instruction bytes differ)
18Section count mismatch (missing or extra sections)
19Section metadata mismatch (size, alignment, or flags differ)

These error codes trigger longjmp-based error recovery in the master ELF emitter (sub_1C9F280), which uses _setjmp at its entry point for non-local error handling.

The --out-sass flag causes ptxas to dump the reconstituted SASS to a file, useful for debugging self-check failures by manual comparison with the original SASS output.

Opportunistic Finalization

The --opportunistic-finalization-lvl flag controls how aggressively capmerc binaries may be finalized for a target SM different from the compilation target:

LevelNameBehavior
0defaultStandard finalization for the compile target only
1noneNo finalization; output stays as capmerc (deferred to driver)
2intra-familyFinalize for any SM within the same architectural family
3intra+interFinalize across SM families

Level 2 allows a capmerc binary compiled for sm_100 (datacenter Blackwell) to be finalized for sm_103 (Blackwell Ultra / GB300) without recompilation. Level 3 extends this across families -- for example, sm_100 capmerc finalized for sm_120 (consumer RTX 50-series).

The key constraint is instruction encoding compatibility: the sub-byte R_MERCURY_8_* relocations can patch SM-specific encoding bits, but the overall instruction format and register file layout must be compatible between source and target.

Off-Target Finalization

Off-target finalization is the process of converting a capmerc binary compiled for SM X into native SASS for SM Y. The compatibility checker sub_60F290 determines whether the source/target pair is compatible, examining:

  • SM version pair and generation compatibility
  • Feature flag differences between source and target
  • Instruction set compatibility (no target-only instructions used)
  • Constant bank layout compatibility
  • Register file layout match

When the check passes, the kernel finalizer sub_612DE0 (47KB) applies the "fastpath optimization" -- it directly patches the Mercury-encoded instruction stream using R_MERCURY_* relocations rather than running the full compilation pipeline. On success, ptxas emits the diagnostic:

"applied for off-target %u -> %u finalization"

where the two %u values are the source and target SM numbers.

The fastpath avoids re-running phases 117--122 of the Mercury pipeline. Instead, it:

  1. Reads the capsule descriptor from .nv.capmerc<func>
  2. Validates compatibility via sub_60F290
  3. Applies R_MERCURY_* relocation patches for the target SM
  4. Regenerates the ELF .text section with patched instruction bytes
  5. Updates .nv.info EIATTR attributes for the target (register counts, barrier counts)

This is substantially faster than full recompilation, which is why ptxas logs it as a "fastpath."

Pipeline Integration

Capmerc does not modify the Mercury encoder pipeline (phases 113--122). The instruction encoding, pseudo-instruction expansion, WAR hazard resolution, operation expansion (opex), and SASS microcode emission all execute identically regardless of output mode. The divergence happens after phase 122 completes:

ModePost-Pipeline Behavior
MercuryPhase 122 SASS output written directly to .text.<func> ELF section
CapmercPhase 122 output wrapped in 328-byte capsule descriptor; .nv.merc.* sections cloned; R_MERCURY_* relocations emitted; KNOBS data embedded
SASSPhase 122 output written as raw SASS binary (no ELF wrapper)

The master ELF emitter sub_1C9F280 (97KB) orchestrates the post-pipeline divergence:

// Simplified from sub_1C9F280
void EmitELF(context) {
    // Common: copy ELF header (64 bytes via SSE loadu)
    memcpy(output, &elf_header, 64);

    // Common: iterate sections, build section headers
    for (int i = 0; i < section_count; i++) {
        if (section[i].flags & 4) continue;  // skip virtual sections
        // ... copy section data, patch headers ...
    }

    if (is_capmerc_mode) {
        sub_1C9B110(ctx);   // create .nv.merc namespace
        sub_1CA2E40(ctx);   // clone sections into merc space
        sub_1C9C300(ctx);   // build capsule descriptors + KNOBS
        sub_1CA3A90(ctx);   // merge merc/non-merc section copies
    }

    // Common: remap section indices, build symbol table
    sub_1C99BB0(ctx);       // section index remap
    sub_1CB68D0(ctx);       // build .symtab

    // Common: resolve relocations
    sub_1CD48C0(ctx);       // relocation resolver (handles R_MERCURY_*)

    // Common: finalize and write
    sub_1CD13A0(ctx);       // serialize to file
}

Function Map

AddressSizeIdentity
sub_1C9F28097KBMaster ELF emitter (orchestrates full CUBIN output)
sub_1CA3A9045KBSection merger / combined section emitter
sub_1CB68D049KBSymbol table builder (handles merc section references)
sub_1C99BB025KBSection index remap (.symtab_shndx / .nv.merc.symtab_shndx)
sub_1C9C30024KBCapsule descriptor processor (328-byte object, KNOBS embed)
sub_1C9B11023KBMercury capsule builder (creates .nv.merc namespace)
sub_1CD48C022KBMaster relocation resolver (R_MERCURY_* + standard)
sub_1CA2E4018KBMercury section cloner
sub_1C9D1F016KBDebug section classifier (SASS + merc variants)
sub_1C98C609KBMercury debug section classifier (15 section names)
sub_720F0064KBFlex SASS text lexer (self-check reconstitution)
sub_72954035KBSASS assembly verification (self-check comparator)
sub_703AB010KBBinary-kind CLI parser
sub_612DE047KBKernel finalizer / ELF builder (fastpath optimization)
sub_60F290--Off-target compatibility checker
sub_1CD13A011KBELF serialization (final file writer)

Cross-References

Newton-Raphson & Math Templates

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

NVIDIA GPUs lack hardware integer dividers and native FP64 arithmetic units on the SFU. When ptxas encounters PTX operations such as div.s32, div.u64, rcp.f64, sqrt.f64, or rsqrt.f64, it expands them into multi-instruction SASS sequences that synthesize the result from simpler hardware primitives. These expansions are the math templates -- pre-built instruction sequence generators that emit 20--100+ SASS instructions per PTX operation, using the MUFU (Multi-Function Unit) for initial approximations and Newton-Raphson iterations for refinement.

The template subsystem lives at 0x1700000--0x172A090 in the ptxas binary: 36 functions occupying ~180 KB. It is invoked during instruction selection by the master lowering dispatcher sub_AED3C0 whenever the selected instruction requires multi-instruction expansion.

Address range0x1700000--0x172A090
Function count36 (4 top-level handlers + 4 coordinators + ~24 sub-expanders + 4 helpers)
Binary size~180 KB
Master lowering dispatchersub_AED3C0 (28 KB, vtable-dispatched)
Emission primitivessub_9314F0 (standard), sub_934630 (extended), sub_935130 (branch), sub_9352C0 (wide)
Virtual register allocatorsub_91BF30 (535 bytes, allocates 160-byte register descriptors)
Immediate encodersub_91D160 (318 bytes, encodes constant values into operand descriptors)
Operand legalizersub_13A6A10 (called before each expansion to widen immediates / fix register classes)
Template name strings__ori_template_DDIV1, __ori_template_DDIV2, __ori_template_DDIV3

Architecture

Two-Level Hierarchy

Every math template follows the same structural pattern: a top-level handler performs lazy initialization and operand legalization, then delegates to a coordinator that allocates virtual registers and calls a sequence of sub-expanders, each of which emits a portion of the final SASS instruction sequence.

sub_AED3C0 (Master Lowering Dispatcher, 28 KB)
  |
  +-- sub_170E8B0 (DDIV handler)        -- FP64 division
  |     +-- sub_170E260 (coordinator)    -- 298 vregs, 6 sub-expanders
  |
  +-- sub_1718D60 (DRCP/DSQRT handler)  -- FP64 reciprocal / square root
  |     +-- sub_1718790 (coordinator)    -- 289 vregs, 7 sub-expanders (inc. MUFU.RCP)
  |
  +-- sub_17276C0 (DRSQRT handler)      -- FP64 reciprocal square root
  |     +-- sub_1720D60 (coordinator A) -- 247 vregs, 5 sub-expanders (MUFU.RSQ path)
  |     +-- sub_1727130 (coordinator B) -- 59 vregs, integer div/mod path
  |
  +-- sub_1704070 (Inline DDIV handler) -- FP64 division, register-pressure variants
        +-- sub_1702990 (>20K regs)     -- full unrolled, ~50 instructions
        +-- sub_1701F10 (>16K regs)     -- partially spilled
        +-- sub_1701860 (<=16K regs)    -- minimal-register variant

Lazy Initialization

Each top-level handler uses a lazy-init pattern to avoid rebuilding the template for every invocation within a compilation unit:

// sub_170E8B0 -- DDIV handler (simplified from decompilation)
void DDIV_Handler(template_state *a1, instruction *a2) {
    if (a1->template_id == -1) {              // first invocation
        a1->template_id = ctx->next_id++;     // allocate unique ID
        DDIV_Coordinator(a1, ...);            // build template once
    }
    ctx->insert_point = a2->position;
    LegalizeOperand(ctx, a2, 1, ...);         // sub_13A6A10

    if (a1->use_template_call) {
        // Template path: emit BRA-to-template (opcode 168)
        EmitExtended(ctx, 168, 0x13, ...);    // sub_934630
    } else {
        // Inline path: emit individual FP ops directly
        EmitFP(ctx, 0x86, 0xC, a1->reg[0], ...);  // sub_92E800
        EmitFP(ctx, 0x85, 0xC, a1->reg[1], ...);
    }
}

The a1->use_template_call flag (at offset +8) controls whether the expansion is emitted as a callable template (with BRA to a named code section) or inlined directly at the call site. The template-call path produces three named code objects -- __ori_template_DDIV1, __ori_template_DDIV2, __ori_template_DDIV3 -- that are shared across all DDIV call sites in the same function.

Coordinator Pattern

All four coordinators share identical structure. They allocate virtual registers from a static descriptor table, call the shared helper sub_1701140 to build the code object scaffolding, then invoke their sub-expanders in sequence:

// sub_170E260 -- DDIV coordinator (simplified)
void DDIV_Coordinator(template_state *a1, ..., int template_id) {
    int *vreg_array = NULL;
    int count = 0;

    // Allocate 298 virtual registers from static table dword_23993E0
    for (int i = 0; i < 298; i++) {
        int reg_id = AllocVReg(ctx, dword_23993E0[2*i]);  // sub_91BF30
        int category = dword_23993E4[2*i];                  // 0=output, 1=temp
        if (category == 0)
            output_regs[out_count++] = reg_id;
        else if (category == 1)
            temp_regs[temp_count++] = reg_id;
        // Mark register as template-owned
        *(vreg_table[reg_id] + 48) |= 0x40;
    }

    // Build code object scaffolding
    BuildTemplateScaffold(ctx, template_id, &static_table, 3, ...);

    // Name the three code sections
    if (a1->use_template_call) {
        section[0]->name = intern("__ori_template_DDIV1");
        section[1]->name = intern("__ori_template_DDIV2");
        section[2]->name = intern("__ori_template_DDIV3");
    }

    // Allocate 240-byte scratch buffer (zeroed)
    void *scratch = arena_alloc(240);
    memset(scratch, 0, 232);

    // Call 6 sub-expanders in sequence
    DDIV_Part1(a1, template_id, scratch, vreg_array, ...);  // sub_1704180
    DDIV_Part2(a1, template_id, scratch, vreg_array, ...);  // sub_1705820
    DDIV_Part3(a1, template_id, scratch, vreg_array, ...);  // sub_17075A0
    DDIV_Part4(a1, template_id, scratch, vreg_array, ...);  // sub_1709130
    DDIV_Part5(a1, template_id, scratch, vreg_array, ...);  // sub_170AE80
    DDIV_Part6(a1, template_id, scratch, vreg_array, ...);  // sub_170CBD0

    // Emit convergence barriers (opcode 0x5D) between code sections
    for (each section boundary in static_table) {
        EmitBarrier(ctx, 0x5D, pred_reg, ...);  // sub_92E1B0
    }
    // Mark scheduling barriers at section endpoints
    *(section[23]->flags + 280) |= 8;
    *(section[42]->flags + 280) |= 8;
}

The static descriptor tables (dword_23993E0 for DDIV, dword_2398940 for DRCP/DSQRT, dword_2398000 for DRSQRT, dword_23976E0 for integer div) encode the register type and category for each virtual register used by the template. The category field (second element of each pair) classifies registers as output (0) or temporary (1).

FP64 Division (DDIV)

Double-precision division a / b has no single-instruction implementation on any NVIDIA GPU. ptxas synthesizes it using Newton-Raphson refinement of a single-precision reciprocal seed.

Algorithm

The DDIV template produces three code sections containing the following mathematical steps:

DDIV1 -- Initial reciprocal approximation:

  1. Extract the high 32 bits of the FP64 divisor b
  2. Convert to FP32 and compute MUFU.RCP (single-precision reciprocal, ~23 bits of mantissa)
  3. Convert the FP32 result back to a form suitable for FP64 refinement
  4. Handle special cases: divisor is zero, infinity, NaN, or denormal

DDIV2 -- Newton-Raphson refinement:

The classical Newton-Raphson iteration for reciprocal is:

x_{n+1} = x_n * (2 - b * x_n)

Each iteration approximately doubles the number of correct bits. Starting from the ~23-bit MUFU.RCP seed:

  • Iteration 1: ~46 bits (exceeds FP64 mantissa precision of 52 bits when combined with correction)
  • A second partial iteration provides the guard bits needed for correct rounding

The SASS instruction sequence uses DFMA (FP64 fused multiply-add) to implement each iteration step. The FSETP/BRA branches handle edge cases where the intermediate result would overflow or underflow the FP64 range.

DDIV3 -- Final multiply and exception handling:

  1. Compute a * (1/b) using the refined reciprocal
  2. Apply IEEE 754 rounding (round-to-nearest-even by default)
  3. Emit the quotient to the destination register pair
  4. Handle overflow to infinity, underflow to zero, and NaN propagation

SASS Instruction Sequence (sub_1705820)

The DDIV Part 2 sub-expander (sub_1705820, 7,545 bytes, 1,057 lines decompiled) is the largest single sub-expander and emits the core Newton-Raphson loop. The instruction mix from decompilation:

SASS OpcodeInternal IDCountRole
IMAD0xC910Integer multiply-add for mantissa manipulation
FSETP0x976Floating-point set-predicate for branch conditions
MOV0x8213Register-to-register moves
MOV (FP)0x0A10FP register moves with type annotation
IADD30x1105Three-operand integer add for exponent arithmetic
SHR0x191Shift right for exponent extraction
BRA0x5F5Conditional branches for special-case handling
MUFU0x3C1MUFU.RCP -- the initial reciprocal seed
DFMA0x1222FP64 fused multiply-add (Newton-Raphson iteration)
FP64 op0x8B2FP64 arithmetic (multiply or add)
FP32 hi/lo0x86/0x854+4Move FP32 halves of FP64 register pair
Total~63Per sub-expander (Part 2 of 6)

The complete DDIV template across all 6 sub-expanders emits approximately 100--120 SASS instructions, using 298 virtual registers.

Register Pressure Variants

The inline DDIV handler (sub_1704070) selects between three implementations based on the target architecture's register file size at *(*(context+1584) + 372):

Register limitHandlerStrategy
> 20,479sub_1702990 (5,846 bytes)Full unrolled -- maximum ILP, 14+ dedicated scratch registers
> 16,383sub_1701F10Partially spilled -- trades some registers for spill/fill
<= 16,383sub_1701860Minimal-register -- reuses registers aggressively, more instructions

This three-tier approach is a register-pressure/throughput tradeoff: kernels with high register demand (and thus low occupancy) use the minimal variant, while kernels with register headroom use the fully unrolled variant for better instruction-level parallelism.

FP64 Reciprocal and Square Root (DRCP/DSQRT)

The DRCP/DSQRT handler (sub_1718D60) shares the same lazy-init and template-call architecture as DDIV. Its coordinator (sub_1718790) allocates 289 virtual registers from dword_2398940 and calls 7 sub-expanders:

Sub-expanderAddressRole
Part 1sub_170ED40FP64 reciprocal: extract exponent, compute MUFU.RCP seed
Part 2sub_1710280Newton-Raphson iteration 1 for reciprocal refinement
Part 3sub_17120F0Newton-Raphson iteration 2 (second doubling of precision)
Part 4sub_17139D0Rounding and normalization
Part 5sub_1715910Square root path: compute MUFU.RSQ seed, refine
Part 6sub_1717470Final multiply x * rsqrt(x) to get sqrt(x), exception handling
(shared)sub_1701140Template scaffolding helper (called by all coordinators)

The algorithm for DRCP(b) = 1/b:

  1. MUFU.RCP(float32(b)) provides ~23-bit seed
  2. Two Newton-Raphson iterations: x_{n+1} = x_n * (2 - b * x_n), each using DFMA
  3. Final rounding to FP64 precision

The algorithm for DSQRT(a) = sqrt(a):

  1. MUFU.RSQ(float32(a)) provides ~23-bit 1/sqrt(a) seed
  2. Refine 1/sqrt(a) via Newton-Raphson: y_{n+1} = y_n * (3 - a * y_n^2) / 2
  3. Compute sqrt(a) = a * (1/sqrt(a)) using the refined reciprocal square root
  4. Apply IEEE 754 rounding

Both paths share the same coordinator and register pool. The coordinator selects the DRCP path or DSQRT path based on the original PTX operation being lowered.

FP64 Reciprocal Square Root (DRSQRT)

The DRSQRT handler (sub_17276C0) is the most complex top-level handler. It dispatches to one of two coordinators based on a hardware capability flag:

// sub_17276C0 -- DRSQRT handler (simplified)
void DRSQRT_Handler(template_state *a1, instruction *a2) {
    int hw_flag = *(*(ctx + 1584) + 1037) & 1;

    if (a1->template_id == -1) {
        a1->template_id = ctx->next_id++;
        if (hw_flag)
            Coordinator_IntDiv(a1, ...);    // sub_1727130: 59 vregs
        else
            Coordinator_DRSQRT(a1, ...);    // sub_1720D60: 247 vregs
    }
    // ... operand legalization, template call or inline emission
}

The hardware flag at *(config + 1037) & 1 likely distinguishes architectures with enhanced SFU precision (where fewer refinement iterations are needed) from older architectures requiring the full Newton-Raphson sequence.

Coordinator A (sub_1720D60): allocates 247 virtual registers from dword_2398000 and calls 5 sub-expanders:

Sub-expanderAddressRole
Part 1sub_1719080Initial MUFU.RSQ seed, exponent extraction
Part 2sub_171A260Newton-Raphson iteration 1
Part 3sub_171BB80Newton-Raphson iteration 2
Part 4sub_171D3A0Normalization and rounding
Part 5sub_171EFD0Exception handling (NaN, infinity, negative, zero)

Coordinator B (sub_1727130): allocates only 59 virtual registers from dword_23976E0 and dispatches to the integer division sub-expanders (sub_1724A20 for 32-bit, sub_1728930 for 64-bit unsigned, sub_1727AC0 for 64-bit signed). This path handles the integer division/modulo lowering via sub_1729B50.

Integer Division Lowering

Integer division and modulo by variable (non-constant) values are expanded into multi-instruction SASS sequences during instruction selection. These sequences use the MUFU.RCP hardware approximation as a starting point, then correct the result with integer arithmetic.

32-bit Division -- sub_1724A20

Size: 28,138 bytes decompiled (957 lines), the largest function in the 0x1723000--0x17F8000 range. Called from: sub_1727130 (coordinator B). Virtual registers: 59 (allocated by coordinator B from dword_23976E0). Temporary pool: indices 90--126 of the parameter array, providing 37 dedicated scratch registers.

Algorithm for unsigned 32-bit a / b:

Step 1:  float_b = I2F(b)                    ; convert divisor to FP32
Step 2:  rcp     = MUFU.RCP(float_b)          ; ~23-bit reciprocal approximation
Step 3:  int_rcp = F2I(rcp)                   ; convert back to integer
Step 4:  q_est   = IMAD.HI(a, int_rcp, 0)    ; estimated quotient (high 32 bits of a*rcp)
Step 5:  r_est   = IMAD(q_est, -b, a)        ; estimated remainder = a - q*b
Step 6:  if (r_est >= b) q_est++; r_est -= b  ; correction iteration 1
Step 7:  if (r_est >= b) q_est++; r_est -= b  ; correction iteration 2
Step 8:  result  = q_est                       ; (or r_est for modulo)

The correction steps (6--7) are implemented with ISETP (opcode 0xC9) for comparison and BRA (opcode 0x5F) for conditional execution. In the worst case, two correction iterations suffice because the MUFU.RCP approximation is accurate to within 2 ULP of the true reciprocal.

Key constants allocated via sub_91D160:

ConstantPurpose
23Float exponent bias for mantissa extraction
255Exponent mask (8-bit IEEE 754 exponent field)
127IEEE 754 single-precision exponent bias
254Double-bias for overflow guard (2 * 127)
1, -1Correction increments for quotient adjustment

The complete SASS instruction mix for the 32-bit division template:

SASS OpcodeInternal IDCountRole
I2F0xD52Integer-to-float conversion
F2I0xD63Float-to-integer conversion
IMAD0x6E5Integer multiply-add (quotient estimation)
IMAD.WIDE0x6F3Wide multiply-add (64-bit intermediate)
IADD0x02~3Integer add (correction)
MOV0x8210Register moves
MOV (typed)0x0A6Typed register moves
ISETP0xC98Integer set-predicate (comparison)
FSETP0x973Float set-predicate
SHL/LEA0x242Shift-left / load effective address
BRA0x5F4Conditional branch (correction paths)
POPC/LOP0x931Population count / logic op
Total~50

64-bit Division

Two variants handle 64-bit operands, both called from sub_1729B50:

  • sub_1728930 (16,545 bytes): unsigned 64-bit division. The algorithm is analogous to 32-bit but requires double-width multiply (IMAD.WIDE), carry propagation, and additional correction iterations. Emits ~80 SASS instructions.

  • sub_1727AC0 (13,776 bytes): signed 64-bit division. Wraps the unsigned algorithm with sign extraction, absolute value computation, and sign fixup of the quotient and remainder.

Both allocate from the same 59-register pool managed by coordinator B.

Division by Constant

Division by compile-time constant is handled separately during the GeneralOptimize bundle passes (not by these templates). The classic Granlund-Montgomery magic-number technique converts x / C to MULHI(x, magic) >> shift, producing 2--3 instructions instead of ~50. See Strength Reduction for details.

MUFU: The Hardware Approximation Engine

All math templates depend on the MUFU (Multi-Function Unit) instruction, which provides low-precision hardware approximations for transcendental and special functions. MUFU is a single SASS instruction (internal opcode 0x3C) with a sub-function selector:

MUFU Sub-functionOperationPrecisionLatency (typical)
MUFU.RCP1/x (reciprocal)~23 bits (FP32 mantissa)~8 cycles
MUFU.RSQ1/sqrt(x) (reciprocal square root)~23 bits~8 cycles
MUFU.RCP64HHigh-precision 1/x seed for FP64~28 bits (sm_80+)~10 cycles
MUFU.RSQ64HHigh-precision 1/sqrt(x) seed for FP64~28 bits (sm_80+)~10 cycles
MUFU.SINsin(x)~23 bits~8 cycles
MUFU.COScos(x)~23 bits~8 cycles
MUFU.EX22^x (base-2 exponential)~23 bits~8 cycles
MUFU.LG2log2(x) (base-2 logarithm)~23 bits~8 cycles
MUFU.SQRTsqrt(x) (sm_89+)~23 bits~8 cycles

MUFU executes on the SFU (Special Function Unit), which is separate from the integer and floating-point ALU pipelines. On sm_80 (Ampere) and later, the SFU can execute one MUFU per cycle per SM partition. The key insight is that MUFU provides only FP32-precision seeds; achieving FP64 precision requires the Newton-Raphson refinement implemented by the math templates.

Fast-Math vs. IEEE Precision

For FP32 operations, the PTX modifiers .approx and .ftz control whether ptxas uses MUFU directly or applies refinement:

  • div.approx.f32: Emits a single MUFU.RCP followed by FMUL. No Newton-Raphson. Result has ~23-bit precision (not IEEE-correct rounding).
  • div.full.f32: Emits MUFU.RCP + one Newton-Raphson iteration via FFMA. Result is IEEE-correct for all normal inputs.
  • div.rn.f64: Emits the full DDIV template (~100+ instructions) with two Newton-Raphson iterations. Result is IEEE 754 round-to-nearest-even.

For transcendental functions (sin, cos, exp, log):

  • sin.approx.f32 / cos.approx.f32: Single MUFU.SIN / MUFU.COS. ~23-bit precision over a reduced range.
  • sin.f32 (full range): Range reduction to [-pi, pi] via polynomial argument reduction, then MUFU.SIN + polynomial correction. Emitted as a libdevice call or inline sequence depending on optimization level.
  • ex2.approx.f32: Single MUFU.EX2.
  • lg2.approx.f32: Single MUFU.LG2.

There are no FP64 versions of MUFU.SIN/COS/EX2/LG2. FP64 transcendentals are always implemented by linking against libdevice (the CUDA math library), which provides polynomial approximation sequences compiled from C source code. These are not handled by ptxas's internal templates but by the libdevice bitcode linked during cicc compilation, upstream of ptxas.

Template Instantiation Infrastructure

Emission Primitives

The sub-expanders construct SASS instructions using a family of emission functions:

FunctionSizeSignatureRole
sub_9314F0403 bytes(scratch, ctx, opcode, type, operand_count, operands, xmm, fp)Standard SASS instruction emission (2--5 operands)
sub_9346301,213 bytes(scratch, ctx, opcode, type, ?, ?, xmm, fp, operand_buf, count)Extended emission for control flow and >4 operands
sub_935130390 bytes(scratch, ctx, opcode, count, label_buf, label_count, ...)Branch emission with label resolution
sub_9352C0(variant)(scratch, ctx, opcode, type, operands, count, ..., extra_buf, ...)Wide emission with extra operand buffer (used for MUFU)
sub_92E80070 bytes(scratch, ctx, opcode, type, reg_id, src_operand, xmm, fp)Simplified emission for single-source FP ops
sub_92E72051 bytes(scratch, ctx, opcode, type, dest_pair, src_operand, xmm, fp)Simplified emission wrapper for register pairs
sub_92E1B0(variant)(scratch, ctx, opcode, pred_reg, xmm, fp)Predicated barrier/convergence emission

Operand Encoding

Each operand in the emission buffer is a 32-bit tagged value:

Tag (bits 31:24)Meaning
0x90Destination register (bits 23:0 = register ID)
0x10Source register
0x20Immediate constant (from constant pool via sub_91D160)
0x40External constant reference
0x60Template call target / sentinel (used for BRA-to-template)
0x80Negate modifier (OR'd onto source tag: 0x90 = negated source)

The 64-bit modifier word 0x6000000500000000 appearing in many emission calls encodes instruction-level flags such as .reuse hints and type specifiers.

Virtual Register Allocation

Each coordinator allocates its full set of virtual registers in a single loop before any instructions are emitted. The sub_91BF30 allocator creates 160-byte register descriptors and returns a 24-bit register ID. Each register is marked with flags |= 0x40 (bit 6) to indicate it is owned by a template rather than the main register allocation pass. This prevents the register allocator from coalescing or splitting template-internal registers.

The static descriptor tables encode register types as the first element of each (type, category) pair:

TemplateTable addressRegister countData types
DDIVdword_23993E0298FP64 pairs, FP32, integer, predicate
DRCP/DSQRTdword_2398940289FP64 pairs, FP32, integer, predicate
DRSQRTdword_2398000247FP64 pairs, FP32, integer, predicate
Integer divdword_23976E059Integer (32/64-bit), predicate

The register counts explain why these templates dominate register pressure in FP64-heavy kernels: 298 virtual registers for a single DDIV expansion is enormous by GPU standards, where the entire physical register file is 65,536 32-bit registers shared across all active warps.

Template Call vs. Inline

The use_template_call flag at template_state + 8 selects between two emission strategies:

Template-call path (flag set):

  • The coordinator builds three named code sections (__ori_template_DDIV1/2/3)
  • Each call site emits a BRA (opcode 168 via sub_934630) to the template code
  • The template code is shared across all call sites in the same function
  • Convergence barriers (opcode 0x5D via sub_92E1B0) ensure correct re-convergence
  • A CALL-like instruction (opcode 164) handles the return path

Inline path (flag clear):

  • The sub-expander instructions are emitted directly at the call site
  • Each call site gets its own copy of the full instruction sequence
  • Uses direct IADD3 (opcode 0x110) for control flow instead of BRA
  • No named code sections, no convergence barriers
  • A JMP/BRA (opcode 0x20 or 0x5F) replaces the template return

The template-call path is preferred for functions with multiple DDIV/DRCP/DSQRT operations because it avoids duplicating the large instruction sequence. The inline path is used when the function has only one such operation, or when the register allocator determines that the overhead of the template call mechanism (saving/restoring registers across the call boundary) exceeds the code-size benefit.

SM-Specific Variants

Register File Size Dispatch

The inline DDIV handler (sub_1704070) reads the register file capacity from *(*(context + 1584) + 372) and selects between three implementation tiers:

int reg_file_size = *(*(ctx + 1584) + 372);  // physical register count
if (reg_file_size > 20479)
    DDIV_FullUnroll(a1, a2, ...);     // sub_1702990: max ILP
else if (reg_file_size > 16383)
    DDIV_PartialSpill(a1, a2, ...);   // sub_1701F10: balanced
else
    DDIV_MinimalRegs(a1, a2, ...);    // sub_1701860: min pressure

The thresholds (20,479 and 16,383) correspond to register file sizes across GPU generations:

  • sm_50--sm_61 (Maxwell/Pascal): 65,536 registers per SM -> 20,479 threshold met at occupancy < 3 blocks
  • sm_70--sm_89 (Volta through Ada): 65,536 registers -> same thresholds
  • sm_100+ (Blackwell): 65,536 registers -> same, but wider warp execution changes the pressure calculus

Hardware Capability Flag

The DRSQRT handler checks *(*(context + 1584) + 1037) & 1 to select between coordinator A (full Newton-Raphson, 247 registers) and coordinator B (reduced sequence, 59 registers). This flag likely indicates the presence of MUFU.RCP64H / MUFU.RSQ64H on sm_80+ architectures, which provide higher-precision seeds (~28 bits vs. ~23 bits) and thus require fewer refinement iterations.

SASS Opcode Reference

Internal opcode IDs used by the math templates, mapped to SASS mnemonics:

Internal IDSASS MnemonicDescription
0x02IADDInteger add (3-operand)
0x0AMOVFP register move (typed)
0x19SHRShift right (exponent extraction)
0x20BRA/JMPUnconditional branch (inline return path)
0x24SHL/LEAShift left / load effective address
0x3CMUFUMulti-function unit (RCP, RSQ, SIN, COS, EX2, LG2)
0x5DBSYNCBarrier synchronization / convergence barrier
0x5FBRAConditional branch
0x6EIMADInteger multiply-add
0x6FIMAD.WIDEWide integer multiply-add (64-bit result)
0x82MOVGeneral register move
0x85MOV.LOMove low 32 bits of FP64 pair
0x86MOV.HIMove high 32 bits of FP64 pair
0x8BDFMA/DMULFP64 fused multiply-add or multiply
0x93POPCPopulation count / bitwise logic
0x97FSETPFP32 set-predicate (comparison)
0xA8SELConditional select
0xC9ISETPInteger set-predicate (comparison)
0xD5I2FInteger-to-float conversion
0xD6F2IFloat-to-integer conversion
0x110IADD3Three-operand integer add
0x122DFMAFP64 fused multiply-add (Newton-Raphson core)

Function Map

AddressSizeFunctionRole
sub_AED3C028 KBMaster lowering dispatcherVtable-dispatched, calls all template handlers
sub_170E8B01,166 bytesDDIV top-level handlerLazy init, operand legalization, template-call/inline dispatch
sub_170E2601,615 bytesDDIV coordinator298 vregs, 6 sub-expanders, names __ori_template_DDIV1/2/3
sub_1704180~5 KBDDIV sub-expander 1Initial reciprocal approximation
sub_17058207,545 bytesDDIV sub-expander 2Newton-Raphson core (MUFU.RCP + refinement)
sub_17075A0~6 KBDDIV sub-expander 3Second refinement iteration
sub_1709130~6 KBDDIV sub-expander 4Final multiply (a * 1/b)
sub_170AE80~6 KBDDIV sub-expander 5Rounding and normalization
sub_170CBD0~5 KBDDIV sub-expander 6Exception handling
sub_1718D60790 bytesDRCP/DSQRT top-level handlerLazy init, shares structure with DDIV
sub_17187901,487 bytesDRCP/DSQRT coordinator289 vregs, 7 sub-expanders
sub_170ED40~5 KBDRCP/DSQRT sub-expander 1MUFU.RCP seed extraction
sub_1710280~5 KBDRCP/DSQRT sub-expander 2Newton-Raphson iteration 1
sub_17120F0~6 KBDRCP/DSQRT sub-expander 3Newton-Raphson iteration 2
sub_17139D0~6 KBDRCP/DSQRT sub-expander 4Rounding/normalization
sub_1715910~6 KBDRCP/DSQRT sub-expander 5DSQRT path (MUFU.RSQ)
sub_1717470~5 KBDRCP/DSQRT sub-expander 6Final multiply, exceptions
sub_17276C01,011 bytesDRSQRT top-level handlerHW capability dispatch
sub_1720D601,423 bytesDRSQRT coordinator A247 vregs, 5 sub-expanders (full N-R)
sub_1719080~5 KBDRSQRT sub-expander 1MUFU.RSQ seed
sub_171A260~6 KBDRSQRT sub-expander 2Newton-Raphson iteration 1
sub_171BB80~6 KBDRSQRT sub-expander 3Newton-Raphson iteration 2
sub_171D3A0~6 KBDRSQRT sub-expander 4Normalization
sub_171EFD0~5 KBDRSQRT sub-expander 5Exception handling
sub_17271301,423 bytesInteger div coordinator (B)59 vregs, dispatches to div templates
sub_1724A2028,138 bytes32-bit integer div/modNewton-Raphson via MUFU.RCP + IMAD correction
sub_172893016,545 bytes64-bit unsigned div/modDouble-width Newton-Raphson
sub_1727AC013,776 bytes64-bit signed div/modSigned wrapper around unsigned
sub_1729B50~2 KB64-bit div dispatcherSelects signed vs. unsigned handler
sub_1704070263 bytesInline DDIV dispatcherRegister-pressure based 3-tier selection
sub_17029905,846 bytesInline DDIV (full unroll)>20K register variant
sub_1701F10~4 KBInline DDIV (partial spill)>16K register variant
sub_1701860~3 KBInline DDIV (minimal regs)<=16K register variant
sub_17011408,690 bytesTemplate scaffolding helperCode object construction, called by all coordinators
sub_172A0903,095 bytesConditional move emissionScheduling barrier fixup

Cross-References

SASS Text Generation

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

Phases 129 (DumpNVuCodeText) and 130 (DumpNVuCodeHex) convert the internal instruction stream into human-readable SASS assembly text and raw hex dumps respectively. The text output is the same format produced by cuobjdump --dump-sass and is used for --verbose output, DUMPIR diagnostics, --forcetext mode, --out-sass dumps, and the --self-check roundtrip verification pipeline. The subsystem spans two distinct address ranges: a PTX-level text generation system (580 formatter functions at 0x4DA340--0x5A8E40) and a SASS-level disassembly renderer (~123 virtual printer methods at 0x17F8000--0x181FFFF).

Pipeline phases129 (DumpNVuCodeText), 130 (DumpNVuCodeHex)
Phase categoryDebug (conditionally executed)
PTX formatter count580 functions at 0x4DA340--0x5A8E40 (~850 KB)
PTX dispatchersub_5D4190 (12.9 KB, two-level opcode dispatch)
SASS printer count~123 vtable methods at 0x17F8000--0x181FFFF
Builder/visitor vtable~520 method slots (4,160+ byte vtable)
Format string table~1.8 MB monolithic NUL-terminated string block
Temp buffer size50,000 bytes per formatter invocation
Largest formattersub_5A8E40 (wmma.load.b, 9,757 bytes)
Key helperssub_9D12F0 (operand encoder), sub_9DB7E0 (predicate printer)

Output Format

SASS text generation produces output compatible with cuobjdump --dump-sass. The format includes control information (scheduling metadata), predicate guards, opcode mnemonics, operands with modifiers, and optional annotations.

Instruction Line Format

/*ADDR*/  {CTRL} OPCODE{.MODIFIERS}  DST, SRC0{, SRC1{, SRC2}} ;  /* LINE */

Concrete examples of the format ptxas produces:

/*0000*/                   MOV R1, c[0x0][0x28] ;                /* 0x00000a0004017802 */
/*0010*/                   S2R R0, SR_CTAID.X ;                  /* 0x0000000000007919 */
/*0020*/              @P0  IMAD.MOV.U32 R4, RZ, RZ, c[0x0][0x168] ;
/*0030*/                   IMAD.MOV.U32 R5, RZ, RZ, c[0x0][0x16c] ;
/*0040*/                   ISETP.GE.AND P0, PT, R0, R2, PT ;
/*0050*/              @P0  EXIT ;
/*0060*/                   STG.E [R4.64], R0 ;
/*0070*/                   EXIT ;
/*0080*/                   BRA 0x80 ;

Control Word Format

For architectures with explicit scheduling control (SM 50--SM 70), the control word is printed in a dedicated line before each group of three instructions:

      /* 0x001c4400fe2007f6 */
/*0008*/                   MOV R1, c[0x0][0x20] ;
/*0010*/                   S2R R0, SR_TID.X ;
/*0018*/                   S2R R2, SR_CTAID.X ;

The 64-bit control word encodes scheduling data for three instructions:

FieldBitsDescription
Stall count4 bits per instructionMinimum cycles to wait before issue (0--15)
Yield hint1 bit per instructionSuggest warp scheduler switch
Write barrier3 bits per instructionDependency barrier index (0--5, 7 = none)
Read barrier3 bits per instructionRead dependency barrier mask
Wait barrier mask6 bits per instructionWhich barriers to wait on before issue

For SM 75+ architectures (Turing and later), scheduling information is embedded per-instruction rather than in grouped control words, so the text output places it differently or omits the separate control line.

Hex Dump Format (Phase 130)

Phase 130 (DumpNVuCodeHex) emits the raw encoding bytes as hex values:

/*0000*/  0x00000a0004017802
/*0008*/  0x0000000000007919
/*0010*/  0x000000ff0aff7824

Each line contains the instruction address and its encoded QWORD(s). For 128-bit instructions, two QWORDs are printed.

Architecture

The text generation subsystem has two levels: a PTX-level pretty-printer that formats instructions from the Ori IR representation, and a SASS-level disassembly renderer that decodes binary-encoded SASS instructions back to text.

Level 1: PTX Instruction Text Formatters

This is the primary text generation system. The 580 formatter functions convert internal instruction representations (accessed via the instruction object at *(a1+1096)) into PTX assembly text strings.

sub_5D4190 (12.9 KB, dispatcher)
  ├─ First: calls sub_5D1660 to initialize intrinsic ID table (608 entries)
  ├─ Registers 121 named opcodes at a1+808 via sub_426150()
  ├─ Registers ~400 hash-keyed opcodes at a1+816 via sub_426150()
  └─ Dispatches to one of 580 formatters at 0x4DA340-0x5A8E40
       └─ Each: alloc 50 KB → sprintf via format table → shrink-copy → free

The dispatcher uses a two-level dispatch strategy:

  1. Named dispatch (121 opcodes): Direct string-to-function registration for recent or complex instructions. The opcode name string (e.g., "wmma.load.a", "tcgen05.mma", "barrier.cta") is looked up in a hash map at a1+808.

  2. Hash dispatch (~400 opcodes): Numeric hash values of opcode names are used as keys in a second hash map at a1+816. The hash values are stored as decimal string representations (e.g., "2644314910", "605425506"). This covers the stable ISA core -- arithmetic, logic, loads, stores, branches, conversions.

Level 2: SASS Disassembly Renderer

The SASS printer at 0x17F8000--0x181FFFF operates on binary-encoded SASS instructions and produces text through a builder/visitor pattern. This is used for the --self-check roundtrip verification and --out-sass output.

SASS instruction (binary-encoded)
  │
  ├─ Read opcode at instruction+72, mask BYTE1 &= 0xCF
  ├─ Switch on canonical opcode ID
  │
  ├─ For each operand:
  │    └─ sub_9D12F0(output_128, ctx, instr, operand_idx, stride, mode, flag)
  │         → 64-byte operand encoding structure
  │
  ├─ Emit via builder/visitor vtable at *(a1 + 24):
  │    ├─ vtable+936:  begin_predicate_guard()
  │    ├─ vtable+3768: begin_operands()
  │    ├─ vtable+16:   emit_operand(kind_id, ...)
  │    ├─ vtable+272:  emit_integer(value)
  │    ├─ vtable+1760: set_rounding_mode(mode)
  │    ├─ vtable+3952: emit_saturation_flag()
  │    ├─ vtable+3960: emit_ftz_flag()
  │    ├─ vtable+3968: emit_negate_flag()
  │    ├─ vtable+4072: emit_cache_operation()
  │    ├─ vtable+4080: emit_eviction_hint()
  │    ├─ vtable+944:  end_predicate_guard()
  │    └─ vtable+4160: end_statement()
  │
  └─ Predicate guard: sub_9DB7E0 (662 bytes, 19 callers)

The builder/visitor vtable has approximately 520 method slots (vtable spans 4,160+ bytes), making it one of the largest virtual dispatch interfaces in the binary. Different concrete visitor implementations produce different output formats (text, hex, self-check comparison).

Formatter Template

Every PTX formatter function is mechanically generated from instruction definition tables. All 580 follow an identical structure:

char* format_OPCODE(int64_t a1, int64_t a2) {
    // a1 = instruction context (instruction data at a1+1096)
    // a2 = format string table base pointer (~1.8 MB)

    // Phase 1: Allocate temp buffer
    int64_t pool = ((int64_t*)sub_4280C0(a1, a2))[3];   // arena_get_pool
    char* buf = (char*)sub_424070(pool, 50000);           // pool_alloc(50KB)
    if (!buf) sub_42BDB0(pool, 50000, ...);               // alloc_fail_abort

    // Phase 2: Build instruction text via sprintf chain
    int pos = sprintf(buf, "%s", (char*)(a2 + OFFSET_A)); // opcode prefix
    if (sub_70B6E0(*(a1+1096)))                           // has_predicate?
        pos += sprintf(buf+pos, fmt, sub_70B780(*(a1+1096))); // predicate name
    pos += sprintf(buf+pos, "%s", (char*)(a2 + OFFSET_B)); // operand template
    // ... more operands via sub_70B8E0, sub_70B910, sub_70B920 ...
    strcpy(buf+pos, (char*)(a2 + OFFSET_N));              // trailing text

    // Phase 3: Shrink-copy to exact size
    size_t len = strlen(buf);
    int64_t pool2 = ((int64_t*)sub_4280C0(buf, ...))[3];
    char* result = (char*)sub_424070(pool2, len + 1);
    strcpy(result, buf);

    // Phase 4: Free temp buffer
    sub_4248B0(buf);                                       // pool_free
    return result;
}

The format string table (a2) is a single monolithic ~1.8 MB block of NUL-terminated strings containing pre-assembled text templates with %s, %llu, %d placeholders. Different formatters access it at different offsets:

FormatterOffset into a2Approximate position
wgmma.mma_async1,731,609~1.7 MB
wmma.mma1,731,130~1.7 MB
rsqrt67,573~67 KB
copysign110,152~110 KB
vavrg4286,309~286 KB
guardrails.alloc~1,843,620~1.8 MB

This design trades memory for speed: instead of building instruction text dynamically, ptxas stores the complete format template and fills in operand names at runtime.

Instruction Operand Accessors

All formatters query the instruction object through a uniform set of tiny accessor functions:

AddressSizeCallersIdentity
sub_70B70014 B946has_predicate()
sub_70B6E014 B42has_predicate_v2()
sub_70B710111 B348get_opcode_string()
sub_70B780151 B514get_predicate_name()
sub_70B8E012 B1,449get_reg_operand(idx)
sub_70B91012 B1,656get_src_part0(idx)
sub_70B92012 B1,296get_src_part1(idx)
sub_70B9307 B68get_operand_count()
sub_70B4C022 B46get_base_address()
sub_70CA6011 B480get_operand_type(idx)
sub_70CA70427 B191get_type_suffix()
sub_70CD20122 B158get_operand_offset(idx)
sub_71086039 B2,953get_data_type(idx, part)
sub_70FA0010 B286get_target_sm(idx)
sub_70FA1066 B7check_target_sm(idx, str)
sub_70991014 B13get_variant_count()
sub_709A1073 B46get_variant_string()
sub_707CE022 B93get_address_operand(idx)
sub_709760127 B21get_comparison_op()
sub_709FE011 B17get_rounding_mode()
sub_70A50013 B15get_saturation_mode()
sub_70B3F0----get_ftz_flag()
sub_707530----get_precision_string()
sub_707C80----get_scope_string()
sub_7075E0----get_layout_string()
sub_707BE0----get_shape_string()
sub_70A810----get_scale_string()

All accessors read from the instruction object at *(a1+1096). The tiny sizes (7--151 bytes for most) indicate these are simple field extractions from the instruction record.

Memory Allocation

The formatter memory lifecycle uses a pool allocator:

FunctionSizeCallersIdentity
sub_4280C0597 B3,928arena_get_pool(ctx, table)
sub_4240702,098 B3,809pool_alloc(pool, size)
sub_42BDB014 B3,825alloc_fail_abort()
sub_4248B0923 B1,215pool_free(ptr)

Every formatter allocates a 50,000-byte temporary buffer, builds the instruction string via sprintf chains, measures the result with strlen, allocates an exact-size copy, and frees the temporary. The 50 KB buffer provides headroom for the largest instructions (WMMA loads produce multi-KB strings) but is wasteful for simple 2-operand instructions that generate ~50-byte strings.

Predicate Guard Printing

Predicate guards (@P0, @!P1, etc.) are printed by checking has_predicate() on the instruction, then formatting the guard via get_predicate_name():

// PTX-level predicate printing (in every formatter)
int pos = sprintf(buf, "%s", opcode_prefix);
if (sub_70B6E0(*(a1+1096))) {                     // has_predicate?
    int64_t pred = sub_70B780(*(a1+1096));         // get_predicate_name
    pos += sprintf(buf+pos, guard_fmt, pred);      // e.g., "@P0 " or "@!P1 "
}

// SASS-level predicate printing (in disassembly renderer)
// sub_9DB7E0 (662 bytes, 19 callers) — emits guard through builder vtable
//   calls builder->begin_predicate_guard() at vtable+936
//   emits predicate register name
//   calls builder->end_predicate_guard() at vtable+944

Register and Operand Formatting

Register operands are resolved from the instruction's operand array. The formatter accesses operands by index through get_reg_operand(idx), get_src_part0(idx), and get_src_part1(idx). The standard register naming follows NVIDIA conventions:

Register classNamingExamples
General-purposeR0--R255R0, R4, R255
Zero registerRZRZ
PredicateP0--P6, PT@P0, PT
UniformUR0--UR63UR4, UR16
Uniform predicateUP0--UP6, UPTUP0
Constant bufferc[bank][offset]c[0x0][0x168]
SpecialSR_*SR_CTAID.X, SR_TID.X

For the SASS disassembly renderer, the register class discriminator sub_91C840 (347 bytes, 232 callers) maps internal type codes 1--0x17 to output class IDs 0--18, covering integer registers, float registers, double registers, predicate registers, condition registers, texture/surface references, and uniform registers.

The operand encoder sub_9D12F0 (1,423 bytes, 289 callers) is the core serializer for SASS-level printing. It takes an instruction and operand index, resolves whether the operand is a register, immediate, or memory reference, handles constant buffer lookups, and fills a 64-byte (4x __m128i) encoding structure that the builder/visitor consumes.

Address and Offset Formatting

Memory operands are formatted with address space qualifiers and offset expressions:

[R4.64]              — register indirect, 64-bit
[R4+0x10]            — register + offset
c[0x0][0x168]        — constant buffer bank 0, offset 0x168
[UR4]                — uniform register indirect

The address space qualifier resolver sub_9CEB50 (185 bytes, 57 callers) combines address space information from the operand descriptor with the instruction context. For SASS-level output, the address space emitter sub_9E7B00 and related functions (sub_9E9910, sub_9E9A70) handle data type and memory space qualifiers.

Architecture-Conditional Formatting

86 of the 580 formatters contain architecture-conditional paths that check the target SM version via sub_70FA00 (numeric comparison) or sub_70FA10 (string comparison). Architecture-specific formatting reflects hardware evolution:

SMEraFormatting impact
sm_20, sm_21Fermi (2010)copysign has different operand layout (7 vs 5 fields)
sm_62Pascal mobile (2016)vavrg4 gets per-component register formatting
sm_103Blackwell Ultra (2025)rsqrt gains new operand layout for extended precision

Five formatters additionally use string-based SM comparison via sub_70FA10:

  • sub_4DD860 (copysign): checks "sm_20", "sm_21"
  • sub_56BA60 (vavrg4): checks "sm_62"
  • sub_56C8D0 (dp2a.lo): SM string comparison
  • sub_577BA0 (dp2a.hi): SM string comparison
  • sub_583190 (rsqrt): checks "sm_103"

SASS Disassembly Renderer

The SASS-level renderer at 0x17F8000--0x181FFFF (~160 KB, ~123 virtual entry points) converts binary-encoded SASS instructions into textual SASS assembly. Unlike the PTX formatters (Level 1) which work from the high-level Ori IR via sprintf chains, the SASS renderer decodes the binary instruction encoding and drives a builder/visitor object through a structured sequence of emit_* calls. The builder's concrete implementation determines the output format -- text for --out-sass, comparison data for --self-check, or binary encoding verification.

Internal Layers

The subsystem splits into five layers by address range and function role:

LayerRangeCountRole
A: Encoding templates0x17F8000--0x180FFFF~75Build per-opcode operand layout descriptors
B: Accessor vtable methods0x1810700--0x1810BFF~15ISA version/class discriminator predicates
C: Format-class printers0x1810D20--0x18167FF~50Workhorses: decode operands + emit through builder
D: Complex multi-format printers0x1817000--0x181CFFF~15Texture, multi-operand, predicated printers
E: Post-processing hooks0x181E000--0x181FFFF~8ISA-override detection, fixup dispatch

All ~123 entry points have zero static callers, confirming they are virtual method overrides dispatched through vtables. The printer dispatch layer at 0xAA8000--0xACA000 (sub_AA9330, sub_AA9860, sub_AAB9C0, sub_AC99D0) invokes them.

Rendering Protocol

Every SASS printer receives (a1, a2) where a1 is the printer context (builder pointer at a1+24) and a2 is the binary-encoded instruction. The rendering follows a fixed protocol:

1. vtable[0](builder, instruction_kind_id)     // begin_instruction
2. vtable[3760](builder, sync_mode)            // set_sync_type (if applicable)
3. vtable[3768](builder)                       // begin_operand_list
4. sub_9DB7E0(a1, a2, 1)                       // emit predicate guard (@Px)
5. For each operand:
   a. sub_9D12F0(&buf, a1, a2, idx, stride, mode, flag)  // encode -> 64B struct
   b. vtable[16](builder, kind_id, buf...)                // emit_operand
6. vtable[3952](builder)                       // emit_saturation (.SAT)
7. vtable[3960](builder)                       // emit_ftz (.FTZ)
8. vtable[3968](builder)                       // emit_negate (.NEG)
9. vtable[4072](builder)                       // emit_cache_operation
10. vtable[4080](builder)                      // emit_eviction_hint
11. vtable[4160](builder)                      // end_instruction

The protocol is directly visible in decompiled code. In sub_1812F60 (16-DWORD immediate printer), the function begins with vtable[0](builder, 89) (begin instruction kind 89), calls vtable[3760] for sync type, vtable[3768] for begin operand list, sub_9DB7E0 for predicate guard, then loops 16 times calling vtable[272] (create integer operand) followed by vtable[16] (emit operand) with kind IDs 55 through 70 -- one per DWORD.

In sub_1810D20 (comparison-mode printer), the function first reads the modifier word from the operand array at instruction+84, switches on (modifier >> 4) & 0xF, calls vtable[3528]/vtable[3536] to configure comparison mode and variant, then emits 2--3 operands via the standard sub_9D12F0 + vtable[16] sequence.

Builder/Visitor Vtable

The builder object at *(a1 + 24) exposes a vtable spanning 4,160+ bytes (~520 method slots at 8 bytes each). The complete set of identified methods:

OffsetMethodCategory
+0begin_instruction(kind_id)Framing
+8get_current_instruction()Accessor
+16emit_operand(kind_id, operand_buf...)Core emission
+24post_process_operand()After-emit hook
+112get_register_size_32()Register geometry
+120get_register_size_64()Register geometry
+128create_register_operand()Operand factory
+152create_memory_operand()Operand factory
+192create_special_operand()Operand factory
+208create_literal_operand()Operand factory
+272create_integer_operand(value)Operand factory
+304create_register_ref_operand()Operand factory
+368set_address_space()Memory qualifier
+936begin_predicate_guard()Predicate block
+944end_predicate_guard()Predicate block
+984set_predicate_mode()Predicate negate/true
+1000emit_modifier()Generic modifier
+1056set_offset_mode()Address offset
+1128emit_width_qualifier().B32, .B64
+1392set_comparison_flag()Comparison type
+1760set_rounding_mode().RN, .RZ, .RM, .RP
+1936begin_sync_block()Sync scope
+1944end_sync_block()Sync scope
+2016set_sync_width()Sync width
+2024set_sync_depth()Sync depth
+2584set_uniform_flag().U modifier
+2960set_address_mode()Address mode
+2992set_cache_level_a()Cache hint (L1)
+3000set_cache_level_b()Cache hint (L2)
+3096set_comparison_type()Second comparison slot
+3128set_source_type_a()Source type
+3136set_source_type_b()Source type
+3144set_interlock_mode()Memory ordering
+3152begin_comparison_block()Comparison section
+3160set_comparison_width()Comparison width
+3520set_data_width()Operand width
+3528set_comparison_mode()Comparison config
+3536set_comparison_variant()Comparison variant
+3560set_conversion_type()Conversion modifier
+3576begin_conversion()Conversion block
+3760set_sync_type()Synchronization type
+3768begin_operand_list()Operand section
+3776emit_rounding_decoration()Rounding modifier
+3824emit_texture_header()Texture header index
+3952emit_saturation_flag().SAT
+3960emit_ftz_flag().FTZ
+3968emit_negate_flag().NEG
+4072emit_cache_operation()Cache operation hint
+4080emit_eviction_hint()Eviction priority
+4160end_instruction()Framing

Different concrete visitor implementations produce different output formats. The vtable design means adding a new output format (e.g., JSON, binary verification) requires only a new visitor class with no changes to any of the ~123 printer functions.

Encoding Template Builders (Layer A)

~75 functions at 0x17F8000--0x180FFFF build per-opcode instruction format descriptors that define the expected operand signature. Each function:

  1. Sets the SASS opcode ID: *(a2+12) = opcode_number
  2. Loads a 128-bit format descriptor: *(a1+8) = xmmword_23Fxxxx (from rodata)
  3. Fills up to 10 operand slots at a1+24..a1+120 with type codes, register class IDs, and modifier flags
  4. Writes expected-value constraints at a1+64..a1+160 (-1 = any)
  5. Writes type constraint modifiers at a1+104..a1+200

From sub_17F8210 (opcode 274):

*(a2+12) = 274;                                          // SASS opcode ID
*(a1+8)  = _mm_loadu_si128(&xmmword_23F21B0);           // 128-bit descriptor
*(a1+24) = 10;   // operand 0: predicate register type
*(a1+64) = -1;   // operand 0: any value accepted
*(a1+104) = 0;   // operand 0: no modifier constraint
*(a1+28) = 17;   // operand 1: specific register class
*(a1+68) = -1;   // operand 1: any value
*(a1+108) = 3;   // operand 1: modifier constraint 3
// ... remaining operands bulk-copied via SSE from xmmword_23F1C60 table

The 128-bit descriptors at xmmword_23F1xxx--23F2xxx encode canonical operand layouts. The bulk SSE copies (_mm_load_si128/_mm_loadu_si128) fill 4 operand slots per iteration, making the template builders compact despite handling up to 10 operand positions.

Format-Class Printers (Layer C)

The instruction's format class at instruction+76 determines which printer handles it. The dispatch computes index = format_class - 11, then looks up dword_23B39E0[index] for the encoding strategy:

StrategyValueDescription
Default0Standard register fields
Wide19-bit register fields, 8 sequential operands
Pair22x register fields per operand
Extended3Extra modifier bits
Special4+Texture header, 16-DWORD immediate

Printer functions for each format class:

FunctionSizeFormat classEvidence
sub_1810D208.8 KBComparison-modeSwitches on (modifier >> 4) & 0xF: case 4 emits comparison with two variants, case 6 emits single-variant. Calls vtable[3528]/vtable[3536] for comparison config
sub_18111F011.6 KBWide-operand8 sequential sub_9D12F0 calls with indices 0--7
sub_1811E2011.6 KBWide + specialBoth sub_9D12F0 and sub_9CF740 calls
sub_181289010.5 KBRegister + constantsub_9CF8A0 for constant folding
sub_1812F6015.3 KB16-DWORD immediatesub_7E4CF0 iterator, 16x vtable[272] + vtable[16] with kind IDs 55--70
sub_18141C06.5 KBPer-operand comparisonDispatch entry from sub_1820000
sub_18146607.1 KBLoad/storesub_C49400 + sub_9CEB50 for address space
sub_1814B1017.6 KBLoad/store + predicatedsub_C49400, sub_91E860, sub_91C840, sub_9CEB50
sub_181581012.7 KBWide variantSimilar to sub_1811E20
sub_181600013.1 KBData-type qualifiedsub_9E9910 for data type emission
sub_18167F011.8 KBMemory-accesssub_9E7B00 for address space qualifier, sub_A3B930 for operand modifier
sub_1816FC06.4 KBModifier-heavyChecks bits 6, 7, 14 of operand word for negate/absolute modifiers

Texture/Surface Printer (Layer D)

The texture/surface printer sub_18189C0 is the largest at 45.2 KB. It handles the complete texture and surface instruction families:

sub_18189C0 (45.2 KB, 1361 lines)
  ├─ Read opcode at +72, mask to canonical form
  ├─ Giant switch on opcodes:
  │    18 (FADD/FMUL?), 119 (MUFU?), 186 (TEX),
  │    211 (TLD), 283 (SULD), 315 (SUST)
  ├─ Check operand modifier bits for predication/negation
  ├─ dword_23B39E0[format_class-11] → subtype (values 0-4)
  ├─ word_23B3A58[subtype] → builder kind ID
  ├─ Emit predicate: vtable[89], begin_operand_list
  ├─ sub_9DB7E0 → predicate guard
  ├─ For each operand: sub_9D12F0 → builder->emit_operand
  ├─ If format > 9: sub_1817C50 (12.8KB) → texture header index
  │    └─ Linearizes from 2D bit fields: bits[0:8] x bits[9:17] → index 0-52
  ├─ Emit cache/eviction: vtable[4080], vtable[4072]
  ├─ Emit saturation/ftz: vtable[3952], vtable[3960], vtable[3968]
  └─ Return 1 on success

The multi-operand printer sub_181B370 (27.8 KB) handles instructions with many operand variants (VOTE at opcode 0x7A, multi-op at 0x138), emitting up to 12 sequential operands through sub_9CEF90 (extended operand encoder) and sub_9CF740 (immediate encoder).

ISA-Override Detection (Layer E)

sub_181E1D0 (7.3 KB) is a post-processing fixup dispatcher called from sub_AA9330. It performs ISA-target-aware fixups by comparing vtable method pointers against known discriminator functions:

// If the current ISA target has the DEFAULT implementation:
if (vtable[111] == sub_1810B90)   // default comparison handler
    apply_default_fixup();
// Otherwise the target has OVERRIDDEN the method:
else
    apply_specialized_fixup();    // e.g., sub_1BCBB90 for arch-specific

This mechanism supports 45 opcodes (0x12, 0x16, 0x24, ..., 0x141) and dispatches to architecture-specific post-processors (sub_1BCBB90, sub_1BCC2D0, sub_BCCF80, sub_1BCF120) or re-emits modifiers via sub_9E9910/sub_9E9A70.

The discriminator functions at 0x1810700--0x1810BFF (~15 tiny functions) serve as sentinel values: sub_1810720, sub_1810750, sub_18108A0, sub_18108D0, sub_1810B90. Their identity (which function pointer is stored) determines which specialization path the fixup dispatcher takes.

Instruction Object Layout

The binary instruction object (a2) used by all SASS printers:

OffsetSizeField
+08Context/vtable pointer
+88ISA context pointer (register file, instruction info table)
+248Builder/visitor object pointer
+328Operand metadata pointer
+401Half-precision flag
+488Operand modifier context
+724Opcode (bits 12--13 are variant flags, masked via &0xCFFF)
+764Format class (subtract 11 for dword_23B39E0[] indexing)
+804Operand count
+84+8*NOperand array (N operands, 8 bytes each)

Each 8-byte operand slot encodes:

BitsWordField
28--300Operand type tag: 1=register, 4=address, 5=constant buffer, 7=special
0--230Register/constant index
24--270Modifier flags
01Negate
11Absolute value
201Constant pool flag (0x100000)
291Sign extension (0x20000000)
301Uniform flag (0x40000000)
311Negation modifier (0x80000000)

Global Lookup Tables

TableSizeIndexPurpose
dword_23B39E0[10]40 Bformat_class - 11Format class to encoding strategy (0--4)
word_23B3A58[4]8 BSubtype from aboveSubtype to builder kind_id mapping
dword_23B3A20[14]56 Bregister_class - 3Register class to comparison type ID
dword_23B3980[7]28 Bwidth_field - 1Encoded width to builder width value
xmmword_23F1xxx--23F2xxx~16 B eachPer-opcode128-bit operand layout descriptor templates

SASS Renderer Function Map

AddressSizeCallersIdentityConfidence
sub_17F8210~1.3 KB0 (vtable)Encoding template builder (opcode 274)95%
sub_1810D208.8 KB0 (vtable)Comparison-mode format-class printer90%
sub_18111F011.6 KB0 (vtable)Wide-operand format-class printer85%
sub_1811E2011.6 KB0 (vtable)Wide-operand + special encoding printer85%
sub_181289010.5 KB0 (vtable)Register + constant operand printer85%
sub_1812F6015.3 KB0 (vtable)16-DWORD immediate printer90%
sub_18141C06.5 KB0 (vtable)Per-operand comparison printer85%
sub_18146607.1 KB0 (vtable)Load/store with address space printer85%
sub_1814B1017.6 KB0 (vtable)Load/store + predication printer85%
sub_181581012.7 KB0 (vtable)Wide-operand variant printer80%
sub_181600013.1 KB0 (vtable)Data-type qualified printer85%
sub_18167F011.8 KB0 (vtable)Memory-access instruction printer85%
sub_1816FC06.4 KB0 (vtable)Modifier-heavy instruction printer85%
sub_1817C5012.8 KB~1Texture header index encoder90%
sub_18189C045.2 KB0 (vtable)Texture/surface instruction printer92%
sub_181B37027.8 KB0 (vtable)Multi-operand instruction printer88%
sub_181CF6014.0 KB0 (vtable)Predicated instruction printer85%
sub_181D9B012.6 KB0 (vtable)Load/store variant printer80%
sub_181E1D07.3 KB~1ISA-override fixup dispatcher90%
sub_181E63014.7 KB~1Comparison instruction post-processor88%
sub_181F0007.6 KB~1Data-type specialized printer75%
sub_181F4F017.3 KB~1Multi-variant data-type printer80%

CLI Integration

--verbose / -v

Enables printing of code generation statistics after compilation. The statistics printers at sub_ABBA50--sub_ABEB50 (8 SM-variant clones, 7,603 bytes each) emit post-scheduling metrics in "# [...] " comment format.

--forcetext

Forces text-mode SASS output regardless of the default binary output mode. Internal flag: "forcetext=%d".

--out-sass

Generates reconstituted SASS text from the Capsule Mercury representation. Used for debugging the capmerc encode/decode roundtrip. Triggers the SASS text Flex lexer sub_720F00 (64 KB) for parsing in --self-check mode.

--self-check

Roundtrip verification for Capsule Mercury: encodes the instruction stream to capmerc format, decodes it back, renders both original and reconstituted as SASS text, and compares. The Flex lexer at sub_720F00 parses the text output for comparison. The SASS text formatter sub_719D00 (50 KB) builds the output for self-check.

DUMPIR

The DUMPIR environment variable (and related knobs) triggers intermediate representation dumps at named phases. Phase 129 (DumpNVuCodeText) is one of the dump targets, emitting the full instruction stream as formatted text when DUMPIR includes that phase name.

Formatter Size Distribution

Function size directly correlates with PTX instruction complexity:

TierSize rangeCountDescription
Tiny< 500 B13Simple 2-operand (wgmma.fence: 295 B)
Small500--1,000 B191Standard 3--4 operand (copysign: 794 B)
Medium1,000--2,000 B319Instructions with modifiers (bfind: 1,130 B)
Large2,000--4,000 B36Arch-conditional paths (membar: 2,788 B)
Very large4,000--6,000 B20Complex multi-form (tex.grad: 5,636 B)
Monster6,000--10,000 B2WMMA matrix loads (wmma.load.b: 9,757 B)

The WMMA load/store formatters account for 34,423 bytes (4% of the total range), reflecting the combinatorial explosion of matrix shapes, data types, layouts, and architectures.

Named Opcode Dispatch Table

The 121 named opcodes registered at a1+808 by sub_5D4190:

CategoryOpcodes
Memory fencemembar
Conversioncvt, tensormap.replace
Mathdiv, div.full, rem, rcp, rsqrt, ex2, lg2, sqrt, tanh, copysign
Bit manipulationbfind, brev, bfe, bfi, clz, popc, testp
Load/store_ldldu, ldmatrix, movmatrix, stmatrix, st.async, red.async, st.bulk, prefetch
Texturetex, tex.base, tex.level, tex.grad, tld4, sured.b
Video SIMDvadd--vmad, vadd2--vavrg2, vadd4--vavrg4
Dot productdp2a.lo, dp2a.hi, dp4a
Barriersbar, barrier, bar.arrive, barrier.arrive, bar.red, barrier.red, bar.cta, barrier.cta, + .arrive/.red variants, bar.warp
Warp opsvote, shfl, match, redux
Async copycp.async.mbarrier.arrive, cp.async.bulk, cp.async.bulk.tensor
Cache policycreatepolicy.range, createpolicy.fractional, createpolicy.cvt
Multi-memorymultimem.ld_reduce, multimem.st, multimem.red
WMMAwmma.load.a, wmma.load.b, wmma.load.c, wmma.store.d, wmma.mma, mma
WGMMAwgmma.mma_async, wgmma.fence, wgmma.commit_group, wgmma.wait_group
TCGen05tcgen05.alloc, tcgen05.relinquish_alloc_permit, tcgen05.dealloc, tcgen05.ld, tcgen05.ld.red, tcgen05.st, tcgen05.commit, tcgen05.cp, tcgen05.shift, tcgen05.mma, tcgen05.mma.ws
Guardrails_tcgen05.guardrails.is_phase_valid, .are_columns_allocated, .is_current_warp_valid_owner, .in_physical_bounds, .allocation_granularity, .datapath_alignment, .sp_consistency_across_idesc_mod, .check_sparse_usage

The remaining ~400 opcodes (arithmetic, logic, load/store, control flow, etc.) are dispatched through hash values at a1+816.

SASS Printer Key Functions

AddressSizeCallersIdentity
sub_5D419012.9 KB1PTX instruction text dispatch + intrinsic registration
sub_5D166046 KB1Intrinsic library registration (608 entries)
sub_5FF700354 KB--Builtin function declaration emitter (prototype generator)
sub_4DA34061 B1,080Builtin declaration lookup helper
sub_719D0050 KB--SASS text formatter (self-check output builder)
sub_720F0064 KB--Flex lexer for SASS text parsing (self-check input)
sub_9D12F01.4 KB289Operand encoder (64-byte struct per operand)
sub_9DB7E0662 B19Predicate guard printer
sub_91C840347 B232Register class discriminator
sub_9CEB50185 B57Address space qualifier resolver
sub_91E86031 B214Data size accessor
sub_18189C045.2 KB--Texture/surface instruction printer (largest SASS printer)
sub_181B37027.8 KB--Multi-operand instruction printer
sub_1817C5012.8 KB--Texture header index encoder

Instruction Data Flow

                    ┌──────────────────────────────────┐
                    │  Ori IR Instruction Object        │
                    │  (instruction data at *(a1+1096)) │
                    └────────────────┬─────────────────┘
                                     │
               ┌─────────────────────┼──────────────────────┐
               │                     │                      │
               v                     v                      v
   sub_70B6E0/B700          sub_70B8E0/B910/B920     sub_70CA60/CA70
   has_predicate()          get_reg_operand(idx)      get_operand_type()
   get_predicate_name()     get_src_part0/1(idx)      get_type_suffix()
               │                     │                      │
               └─────────────────────┼──────────────────────┘
                                     │
                                     v
                          ┌─────────────────────┐
                          │  sprintf() chain     │
                          │  into 50 KB buffer   │
                          │  using format table  │
                          │  at a2+offset        │
                          └──────────┬──────────┘
                                     │
                                     v
                          ┌─────────────────────┐
                          │  strlen → alloc →    │
                          │  strcpy → free temp  │
                          └──────────┬──────────┘
                                     │
                                     v
                          ┌─────────────────────┐
                          │  Formatted PTX text  │
                          │  string (exact size) │
                          └─────────────────────┘

Cross-References

SM Architecture Map

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

ptxas validates the --gpu-name target against three sorted lookup tables, constructs a profile object with family metadata and a CUDA_ARCH macro value, then populates seven parallel dispatch tables that drive capability checks, code generation factory selection, performance modeling, and occupancy calculation throughout the compiler. The default target is sm_75 (Turing). Every downstream decision -- instruction legality, encoder selection, register file geometry, scheduling latencies -- routes through the profile object built here.

SM validationsub_6765E0 (54KB, profile object construction)
Capability dispatchsub_607DB0 (14KB, 7 parallel hash maps)
Default targetsub_6784B0 -- returns sm_75 when --gpu-name is omitted
Validation tables3 bsearch arrays: base (32 entries at unk_1D16220), f (6 entries at unk_1D16160), a (7 entries at unk_1D161C0)
Per-SM accessorssub_609XXX cluster (24 functions, ~1.2KB each)
Per-SM intrinsic initsub_60AXXX cluster (12 functions, ~1KB each)
Profile lookupsub_608D70 (384 bytes, dispatcher registered via sub_42BEC0)

Per-SM Deep Dives:

Complete SM Table

23 active SM base targets ship in ptxas v13.0.88 (plus 9 legacy and 2 internal/alias entries retained in the validation table for backward compatibility). Each base target optionally has a (accelerated) and/or f (feature-reduced) sub-variants. The CUDA_ARCH column shows the value the -D__CUDA_ARCH__ macro expands to.

SM__CUDA_ARCHFamilyProductCodegen FactoryStatusDeep Dive
sm_75750TuringTU10x (RTX 20xx)24577Productionturing-ampere
sm_80800AmpereA10028673Productionturing-ampere
sm_86860AmpereA40/A10/RTX 30xx28673Productionturing-ampere
sm_87870AmpereOrin (Jetson)28673Productionturing-ampere
sm_88880Ampere--28673Productionturing-ampere
sm_89890Ada LovelaceAD10x (RTX 40xx) / L40S28673Productionada-hopper
sm_90 / sm_90a900HopperH100 / H20032768Productionada-hopper
sm_100 / sm_100a / sm_100f1000BlackwellB200 (datacenter)36864Productionblackwell
sm_103 / sm_103a / sm_103f1030Blackwell UltraGB300 (datacenter)36864Productionblackwell
sm_110 / sm_110a / sm_110f1100Jetson ThorThor SoC (auto/robotics)36864Productionblackwell
sm_120 / sm_120a / sm_120f1200Blackwell (sm120)RTX 50xx / RTX Pro36864Productionblackwell
sm_121 / sm_121a / sm_121f1210Blackwell (sm120)DGX Spark36864Productionblackwell

The family name stored in the profile object (from sub_6765E0) uses NVIDIA's internal naming: "Turing", "Ampere", "Hopper", "Blackwell". Ada Lovelace (sm_89) is stored as Ampere-derived internally despite being a distinct microarchitecture. sm_120/121 use "Blackwell" internally despite being a different consumer microarchitecture from sm_100 datacenter Blackwell.

Suffix Semantics

ptxas uses three suffix modes to control forward compatibility. The distinction is critical: it determines which SASS binary a cubin can execute on.

SuffixMeaningForward CompatibilityValidation Table
(none)Base feature setFull forward-compat across generationsunk_1D16220 (32 entries)
a (accelerated)Architecture-locked, advanced featuresNo forward compat -- locked to specific siliconunk_1D161C0 (7 entries)
f (feature-reduced)Same-family forward compat onlyForward-compat within family, not acrossunk_1D16160 (6 entries)

The base variant (no suffix) produces SASS that runs on the named architecture and all later ones: sm_80 code runs on sm_86, sm_89, sm_90, sm_100, etc. The a suffix locks the binary to exact silicon: sm_90a code runs only on H100/H200 hardware and will not execute on Blackwell. The f suffix allows forward compatibility within the same family: sm_100f code runs on sm_100 and sm_103 (both Blackwell datacenter) but not on sm_120 (different family).

Compilation rules from help text:

  • sm_90a PTX must be compiled to sm_90a SASS (no cross-arch compilation)
  • sm_100f PTX can compile to sm_100f or sm_103f SASS (same family)
  • sm_100a PTX must compile to sm_100a SASS only
  • Base sm_100 PTX compiles to any sm_100+ SASS

Sub-Variant Expansion

Basea Variantf VariantCUDA_ARCH (a)CUDA_ARCH (f)
sm_90sm_90a--90a0--
sm_100sm_100asm_100f100a0100f0
sm_103sm_103asm_103f103a0103f0
sm_101sm_101asm_101f----
sm_110sm_110asm_110f110a0110f0
sm_120sm_120asm_120f120a0120f0
sm_121sm_121asm_121f121a0121f0

sm_75 through sm_89 have no a or f variants. sm_90 has only the a variant (no f). All Blackwell-era targets (sm_100+) have both a and f. sm_101 is a legacy alias for sm_110 (Jetson Thor, original internal designation); it passes validation but is not registered as a profile object, so its CUDA_ARCH values are not populated.

SM Validation Tables

Target name validation uses three sorted arrays searched via bsearch(). The CLI parser extracts the SM string from --gpu-name, strips any suffix, and searches the appropriate table.

Base Table -- unk_1D16220 (32 entries)

Contains all valid base SM names without suffix, sorted by numeric SM ID. Includes legacy architectures no longer supported for active compilation but retained for validation, plus two internal/alias entries. Each entry is 12 bytes: {uint32 sm_id, uint32 ptx_major, uint32 ptx_minor}. The bsearch comparison (sub_484B70) compares the numeric sm_id extracted from the --gpu-name string via sscanf.

sm_10, sm_11, sm_12, sm_13,                    // Tesla (legacy, PTX 1.0--1.2)
sm_20, sm_21,                                  // Fermi (legacy, PTX 2.0)
sm_30, sm_32, sm_35, sm_37,                    // Kepler (legacy, PTX 3.0--4.1)
sm_50, sm_52, sm_53,                           // Maxwell (legacy, PTX 4.0--4.2)
sm_60, sm_61, sm_62,                           // Pascal (legacy, PTX 5.0)
sm_70, sm_72,                                  // Volta (legacy, PTX 6.0--6.1)
sm_75,                                         // Turing (active, PTX 6.3)
sm_80, sm_82, sm_86, sm_87, sm_88, sm_89,      // Ampere/Ada (active, PTX 6.2--7.8)
sm_90,                                         // Hopper (active, PTX 7.8)
sm_100, sm_101, sm_103, sm_110, sm_120, sm_121 // Blackwell (active, PTX 8.6--9.0)

sm_82 (PTX 6.2): Undocumented internal Ampere target. Not registered in sub_6765E0 (no profile object). Serves as the SASS opcode generation boundary (SM82_FIRST/SM82_LAST, opcode indices 172--193). The anomalously low PTX version requirement (6.2 vs sm_80's 7.0) suggests it was an early development target added before PTX ISA versioning was finalized.

sm_101 (PTX 8.6): Original internal designation for Jetson Thor, renamed to sm_110 in a later CUDA release. Both entries coexist in the validation table for backward compatibility with PTX files referencing the old name. sub_6765E0 registers only sm_110; sm_101 is validation-only.

Accelerated Table -- unk_1D161C0 (7 entries)

sm_90a, sm_100a, sm_101a, sm_103a, sm_110a, sm_120a, sm_121a

One Hopper entry, six Blackwell entries. sm_101a is the legacy alias for sm_110a (Jetson Thor, original internal designation).

Feature-Reduced Table -- unk_1D16160 (6 entries)

sm_100f, sm_101f, sm_103f, sm_110f, sm_120f, sm_121f

No Hopper entry (sm_90 has no f variant). All Blackwell-era. sm_101f is the legacy alias for sm_110f.

Architecture Registration -- sub_6765E0

This 54KB function constructs profile objects for every SM version. Each profile contains:

FieldContentExample (sm_90)
SM name"sm_90""sm_90"
Compute name"compute_90""compute_90"
Family name"Hopper""Hopper"
CUDA_ARCH macroDecimal integer900
LTO name"lto_90""lto_90"
isaClassArchitecture class ID--

The function registers each profile into three hash maps indexed by sm_XX, compute_XX, and lto_XX strings. This allows lookup by any of the three naming conventions used in different contexts (CLI, PTX .target directive, LTO linking).

Family assignment from sub_6765E0:

SM RangeFamily StringNotes
sm_75"Turing"Single entry
sm_80, sm_86, sm_87, sm_88"Ampere"Includes sm_88 (undocumented Ada variant)
sm_89"Ampere"Ada Lovelace stored as Ampere-derived internally
sm_90/90a"Hopper"Single silicon, two feature levels
sm_100/100a/100f"Blackwell"Datacenter B200
sm_103/103a/103f"Blackwell"Blackwell Ultra (GB300)
sm_110/110a/110f"Blackwell"Jetson Thor -- same family string despite different product
sm_120/120a/120f"Blackwell"Consumer/enterprise (RTX 50xx) -- different uarch, same string
sm_121/121a/121f"Blackwell"DGX Spark

All sm_100 through sm_121 share the "Blackwell" family string internally, even though sm_110 (Jetson Thor) and sm_120 (consumer RTX 50xx) are distinct microarchitectures. The compiler distinguishes them through the capability dispatch tables, not through family name.

Capability Dispatch -- sub_607DB0

The capability dispatch initializer builds 7 parallel hash maps at initialization time, protected by a once-guard (byte_29FE1D8). Each map indexes sm_XX / compute_XX strings to per-architecture values or handler functions. Error recovery uses setjmp/longjmp.

MapGlobalPurposeValue Type
1qword_29FE1D0Handler A (primary codegen)Function pointer
2qword_29FE1C8Handler B (secondary codegen)Function pointer
3qword_29FE1C0Intrinsic table initializerFunction pointer
4qword_29FE1B8Capability flagsByte value
5qword_29FE1B0Profile registrationRegistered via sub_42BEC0
6qword_29FE1A8Perf-stats / occupancy handler EFunction pointer
7qword_29FE1A0Perf-stats / occupancy handler FFunction pointer

Handler Function Assignments

Each SM version registers its own handler functions into these maps. Functions within the same suffix group (e.g., sm_100/100a/100f) share all handlers -- they are the same silicon with different feature exposure.

Map 1 -- Handler A (per SM):

SMHandler ASMHandler A
sm_75sub_609B70sm_100sub_609C30
sm_80sub_609CC0sm_110sub_609F30
sm_86sub_609D50sm_103sub_608F20
sm_87sub_609F00sm_120sub_609E40
sm_88sub_609E70sm_121sub_609ED0
sm_89sub_609E10
sm_90sub_609DB0

Map 2 -- Handler B (per SM):

SMHandler BSMHandler B
sm_75sub_609B40sm_100sub_609BD0
sm_80sub_609C90sm_110sub_608F50
sm_86sub_609D80sm_103sub_609D20
sm_87sub_609DE0sm_120sub_609C60
sm_88sub_609EA0sm_121sub_609BA0
sm_89sub_609CF0
sm_90sub_609C00

Map 3 -- Intrinsic table initializer (per SM):

SMInitializerSMInitializer
sm_75sub_60A2E0sm_100sub_60A910
sm_80sub_60A3E0sm_110sub_60AA20
sm_86sub_60AC30sm_103sub_60A700
sm_87sub_60AD30sm_120sub_608DF0
sm_88sub_60AB30sm_121sub_60A4E0
sm_89sub_60A810
sm_90sub_60A5F0

Shared Handler Groups

Sub-variants within a base SM share all handler functions, confirming they are identical silicon:

GroupMembersShared Handlers
Hoppersm_90, sm_90aAll 7 maps
Blackwell DCsm_100, sm_100a, sm_100fAll 7 maps
Blackwell Ultrasm_103, sm_103a, sm_103fAll 7 maps
Jetson Thorsm_110, sm_110a, sm_110fAll 7 maps
Consumersm_120, sm_120a, sm_120fAll 7 maps
DGX Sparksm_121, sm_121a, sm_121fAll 7 maps

Codegen Factory Values

The profile object stores an encoded architecture identifier at a known offset (visible as field +348 on the profile pointer chain, e.g., *(_QWORD *)(a1+1584)+348). This value is compared throughout the compiler to gate features:

Codegen FactorySM RangeSASS ISA Generation
24577sm_75Turing (SM 7.5)
28673sm_80 -- sm_89Ampere / Ada (SM 8.x)
32768sm_90Hopper (SM 9.0)
36864sm_100 -- sm_121Blackwell (SM 10.x -- 12.x)

These values appear in feature-gating checks. For example, FMA/DFMA combining in the peephole optimizer checks profile[+372] > 28673 to require sm_70+ capability. The exact encoding formula is (isa_generation << 12) | variant, where the high bits identify the SASS instruction set generation.

Related encoded values seen in the binary:

12288  = sm_30 (Kepler)       // 3 << 12
16385  = sm_50 (Maxwell)      // 4 << 12 | 1
20481  = sm_50 alt (Maxwell)  // 5 << 12 | 1
24576  = sm_60 (Pascal)       // 6 << 12
24577  = sm_75 (Turing)       // 6 << 12 | 1
28673  = sm_80 (Ampere)       // 7 << 12 | 1
28674-28677 = sm_86/87/88/89  // 7 << 12 | 2..5
32768  = sm_90 (Hopper)       // 8 << 12
36864  = sm_100 (Blackwell)   // 9 << 12
36865-36869 = sm_103..121     // 9 << 12 | 1..5

Hardware Resource Geometry

ptxas assembles per-SM hardware parameters from three data sources: sub_8688F0 (universal baseline), sub_8E4400 (scheduler partition geometry), and sub_ABF250 (occupancy calculator properties). These parameters control register allocation limits, shared memory partitioning, occupancy calculations, and scheduling decisions throughout the compiler.

Universal Constants (sub_8688F0)

sub_8688F0 sets the baseline hardware profile shared by all SM 75+ targets. These values are architecture-invariant within the ptxas v13.0.88 binary:

ParameterValueBinary EvidenceProfile Offset
Warp size32 threads*(a1+1472) = 32+1472
Max registers per thread255*(a1+612) = 0xFF0000003F+612
Register file per SM65,536 x 32-bitDerived: max_warps = 65536 / (regcount * 32)--
Dependency barriers per warp6*(a1+604) = 6+604
Named barriers per CTA16barrier_arrive_0 through barrier_arrive_15 intrinsics--
Static shared memory base48 KB (49,152 B)*(a1+1484) = 49152+1484
Shared memory config base1 MB (1,048,576 B)*(v6+344) = 0x100000 in all per-SM initsprofile +344

The register file size of 65,536 registers is confirmed by the EIATTR_REGCOUNT formula (code 0x2F): max_warps_per_SM = total_registers / (regcount * warp_size), and by explicit reference in codegen/templates.md ("the entire physical register file is 65,536 32-bit registers shared across all active warps").

Per-SM Resource Geometry Table

Combines binary evidence (sub_8E4400 scheduling profile, sub_8688F0 baseline, sub_ABF250 occupancy properties, sub_60AXXX per-SM initializers) with NVIDIA public specifications for parameters not stored as scalar constants in the binary. Confidence column rates how directly the value was extracted from the binary vs. inferred from public documentation.

SMRegs/SMMax Regs/ThreadMax Threads/CTAWarps/SMMax CTAs/SMSched PartitionsDispatch SlotsConfigurable Shared MemoryConf
sm_7565,5362551,02432167 / 20820832 / 48 / 64 KB90%
sm_8065,5362552,04864327 / 20820848 / 100 / 132 / 164 KB90%
sm_8665,5362551,53648167 / 20820848 / 100 KB90%
sm_8765,5362551,53648167 / 20820848 / 100 / 164 KB90%
sm_8865,5362551,53648167 / 208208(same as sm_86)85%
sm_8965,5362551,53648167 / 20820848 / 100 KB90%
sm_9065,5362551,02464328 / 22422448 / 100 / 132 / 164 / 228 KB90%
sm_10065,5362551,024643216 / 24024048 / 100 / 132 / 164 / 228 KB90%
sm_10365,5362551,024643216 / 240240(same as sm_100)88%
sm_11065,5362551,024643216 / 240240(same as sm_100)85%
sm_12065,5362551,024643216 / 24024048 / 100 / 132 / 164 / 228 KB88%
sm_12165,5362551,024643216 / 240240(same as sm_120)85%

Column definitions:

  • Regs/SM: Total 32-bit registers per streaming multiprocessor. 65,536 universally for sm_75+.
  • Max Regs/Thread: Maximum registers a single thread can use. 255 universally (sub_8688F0 offset +612).
  • Max Threads/CTA: Maximum threads per cooperative thread array (block). Not stored as a ptxas constant; derived from warps_per_SM * warp_size / max_CTAs.
  • Warps/SM: Total concurrent warps per SM. Determines peak occupancy.
  • Max CTAs/SM: Maximum concurrent CTAs per SM.
  • Sched Partitions / Dispatch Slots: From sub_8E4400 offset +18 (packed DWORD) and offset +22 (WORD). The scheduler partition count is the number of warp scheduler units; dispatch slots is the total scheduling capacity.
  • Configurable Shared Memory: Valid shared memory sizes per CTA, selected by cudaFuncSetAttribute. Stored as pointer-to-table at profile offsets +1488/+1496; sm_75 has 3 entries, later architectures have more.

sm_88 note: No known product ships on sm_88. It shares all handler functions with sm_86. Listed parameters are inherited; actual hardware behavior is unverifiable.

Scheduler Partition Geometry (sub_8E4400 Detail)

The packed DWORD at offset +18 of the warp-level profile encodes scheduler partition counts. The WORD at offset +22 is the dispatch slot count -- a scheduling capacity value distinct from the raw warp count.

Codegen Factory RangePacked DWORDHexPartitionsDispatch SlotsSM Era
<= 20479458,7590x00070007796sm_50 (Maxwell)
20480 -- 24575786,4440x000C000C12176sm_60 (Pascal)
24576 -- 28672851,9810x000D000D13192sm_70 (Volta)
28673 -- 32767917,5180x000E000E14208sm_75 -- sm_89
32768 -- 36863983,0550x000F000F15224sm_90 (Hopper)
> 368631,048,5920x0010001016240sm_100+ (Blackwell)

The dispatch slot count increases monotonically across generations, reflecting wider scheduling capacity. All sm_75 through sm_89 targets (Turing, Ampere, Ada Lovelace) share identical scheduling partition geometry despite their hardware differences -- the differentiation occurs in the per-SM latency tables, not in the partition structure.

Shared Memory Configuration Tables

ptxas stores configurable shared memory sizes as a pointer + count pair at profile offsets +1488 and +1496. The driver uses this table to validate cudaFuncSetAttribute(cudaFuncAttributeMaxDynamicSharedMemorySize, ...) calls.

sub_8688F0 sets the sm_75 configuration:

*(a1+1488) = &unk_21D9168    // pointer to shared memory size table
*(a1+1496) = 3               // 3 valid configurations

For sm_75 (Turing), the 3 entries correspond to 32 KB, 48 KB, and 64 KB configurable shared memory. The L1/shared partitioning on Turing splits the 96 KB unified data cache between L1 and shared memory.

For sm_80 (Ampere), the configurable shared memory extends to 164 KB, reflecting the larger shared memory/L1 combined capacity. sub_ABF250 records the maximum as 167,936 bytes (163.8 KB) for the base sm_60 path and 233,472 bytes (228 KB) for sm_70+ paths, though these values encode via xmmword constants that depend on the specific SM variant.

For sm_90+ (Hopper, Blackwell), sub_ABF250 populates a maximum configurable value of 233,472 bytes (228 KB), supporting the opt-in extended shared memory mode added in Hopper.

Register Allocation Mechanics

ptxas allocates registers in units determined by the register allocation granularity stored in sub_ABF250:

SM GenerationAlloc Granularitya2[6][1]a2[6][2]Notes
sm_30 -- sm_6064 registers / warp631Allocates in blocks of 2 regs/thread
sm_70+256 registers / warp2552Allocates in blocks of 8 regs/thread

The register allocation unit directly affects occupancy. With 256-register granularity on sm_75+, a kernel using 33 registers effectively consumes 40 (rounded up to the next multiple of 8), which means each warp uses 40 * 32 = 1280 of the 65,536 available registers, allowing up to 51 warps -- but capped by the hardware limit of 32 warps on sm_75.

The formula the GPU driver uses (from EIATTR_REGCOUNT documentation):

effective_regs = ceil(regcount / alloc_granularity) * alloc_granularity
regs_per_warp  = effective_regs * warp_size
max_warps      = min(registers_per_SM / regs_per_warp, hw_max_warps)
max_CTAs       = min(max_warps / warps_per_CTA, hw_max_CTAs)

SM Version Encoding

The raw SM version number stored in profile objects and code object headers uses a packed integer format. This is the value at v4[93] in the code object builder (sub_A465F0):

Encoded ValueSM TargetCode Object VersionMax Threads/CTA
12288sm_300x7000796
20481sm_500xC000C176
24576sm_60----
28673sm_80----
36864sm_900x100010240

The code object builder (sub_A465F0 at 0xA465F0) maps these encoded SM versions to ELF code object version fields and thread-per-CTA limits. The magic number 0x16375564E is written at offset 0 of every code object header, with the SM version at offset +8.

Per-SM Capability Accessors -- sub_609XXX

The 24 functions in the sub_609XXX cluster (range 0x609280--0x609F60, ~1.2KB each) are the per-SM-version capability accessor functions. They are registered into Maps 1 and 2 of the dispatch tables and return architecture-specific values: register file sizes, feature flags, warp geometry, shared memory limits, and similar hardware parameters.

These are the functions that downstream code calls (through the dispatch table) to answer questions like "how many registers does this SM have?" or "is feature X available on this target?"

Profile Layering

Three levels of SM profile information cooperate:

Level 1: sub_607DB0                    // Capability dispatch (7 hash maps)
    |                                  //   -> feature flags, handler functions
    v
Level 2: sub_6765E0                    // Profile objects (name, family, CUDA_ARCH, lto)
    |                                  //   -> identity metadata, isaClass
    v
Level 3: sub_609XXX / sub_60AXXX       // Per-SM accessor functions
                                       //   -> concrete hardware parameter values

Level 1 provides the dispatch infrastructure. Level 2 provides identity metadata for diagnostics and linking. Level 3 provides the actual numeric values that drive register allocation, scheduling, and instruction legality.

Generation-Specific Features

Turing (sm_75)

sm_75 is the default architecture for ptxas v13.0.88, returned by sub_6784B0 when no --gpu-name is specified. Codegen factory value: 24577.

  • Base tensor core (WMMA m16n16k16 f16/f32, m32n8k16, m8n32k16)
  • Integer MMA (IMMA int8/int4), binary MMA (BMMA b1)
  • Base warp-level operations (shfl.sync, vote.sync, match.sync, redux.sync)
  • Named barrier support (bar 0-15 with arrive/red/sync variants)

Ampere (sm_80 -- sm_88)

Codegen factory value: 28673. Shared with Ada Lovelace (sm_89).

  • Extended tensor core: TF32, BF16, FP64 MMA shapes
  • cp.async for asynchronous shared memory copies
  • L2 cache hints on atomic operations
  • createpolicy instructions for cache management
  • 14 additional intrinsics (__cuda_sm80_*: bf16/tf32/s4/s8/b1 MMA, createpolicy)

Ada Lovelace (sm_89)

Codegen factory value: 28673 (same as Ampere). Stored as "Ampere" internally despite being a distinct Ada Lovelace microarchitecture.

  • Same codegen path as Ampere; differentiated through capability flags, not codegen factory
  • 39 additional MMA intrinsics (__cuda_sm_8x_mma_*)

Hopper (sm_90 / sm_90a)

Codegen factory value: 32768. sm_90a is architecture-locked (H100/H200 only).

  • WGMMA (warpgroup MMA async): wgmma.mma_async, wgmma.fence, wgmma.commit_group, wgmma.wait_group
  • Cluster operations: barrier.cluster.arrive/wait, distributed shared memory
  • setmaxnreg: Dynamic register allocation limit
  • Cluster special registers: %clusterid, %cluster_ctaid, %cluster_ctarank, etc.
  • 38 sub-byte MMA intrinsics (__cuda_sm_9x_mma_sub_byte_internal_*: s4/u4 sparse)

Blackwell Datacenter (sm_100, sm_103)

Codegen factory value: 36864. Both a and f sub-variants available.

  • tcgen05: 5th-generation tensor core ISA (alloc, dealloc, ld, st, commit, cp, shift, mma) -- a/f sub-variants only
  • tcgen05 guardrails: 8 debug validation functions (phase validity, column allocation, bounds checking)
  • Extended MMA: 10 Blackwell-specific hmma/imma + bit MMA intrinsics (__cuda_sm_10x_*)
  • 11 tcgen05 guardrail trap intrinsics (__cuda_sm10x_tcgen05_guardrail_trap_*)
  • 18 sm_1xx bulk copy intrinsics (__cuda_sm1xx_*: cp.async.bulk.tensor 1D-5D tile/im2col)

Jetson Thor (sm_110)

Codegen factory value: 36864 (same as sm_100). Originally sm_101 before rename. Automotive/robotics SoC.

  • Retains full tcgen05/TMEM hardware on a/f sub-variants
  • Same Blackwell datacenter feature set for tensor operations
  • Differentiated through capability flags for SoC-specific constraints

Blackwell Consumer (sm_120, sm_121)

Codegen factory value: 36864 (same as sm_100). Architecturally a distinct consumer microarchitecture despite sharing the "Blackwell" family string.

  • No tcgen05: The entire tcgen05 ISA is absent on sm_120/121 -- gated by SM version checks
  • Tensor core falls back to HMMA/IMMA/WGMMA inherited from sm_70--sm_90 path
  • sm_120 = RTX 50xx consumer / RTX Blackwell Pro (enterprise)
  • sm_121 = DGX Spark

Diagnostic Strings

StringContextFunction
"Turing"Family name in profile objectsub_6765E0
"Ampere"Family name in profile objectsub_6765E0
"Hopper"Family name in profile objectsub_6765E0
"Blackwell"Family name in profile objectsub_6765E0
"isaClass"Architecture class reference on profilesub_6765E0
"sm_%d"SM name formattingMultiple
"compute_%d"Compute name formattingsub_6765E0
"lto_%d"LTO name formattingsub_6765E0

Function Map

AddressSizeIdentityConfidence
sub_607DB014KBSM capability dispatch -- builds 7 hash maps99%
sub_608D70384BProfile lookup dispatcher80%
sub_608DF0~1KBsm_120 intrinsic table initializer85%
sub_608F20~1.2KBsm_103 handler A (capability accessor)90%
sub_608F50~1.2KBsm_110 handler B (capability accessor)90%
sub_609280--sub_609F60~1.2KB each24 per-SM capability accessor functions (Maps 1+2)90%
sub_609F602.8KBlds128convert option handler ("always", "nonconst", "never")90%
sub_60A2E0--sub_60AD30~1KB each12 per-SM intrinsic table initializers (Map 3)85%
sub_60B0404.5KBStress test options ("stress-maxrregcount", etc.)85%
sub_6765E054KBSM profile object construction (family, CUDA_ARCH, lto)95%
sub_6784B0--Default architecture -- returns sm_7599%
sub_8688F031 linesUniversal HW profile baseline (warp size, regs, barriers, shmem)95%
sub_8E44003.3KBWarp-level HW profile: scheduler partitions, dispatch slots95%
sub_ABF250~600BOccupancy property table: configurable shmem, reg alloc granularity90%
sub_A95DC0~1.8KBExtended HW profile: architecture-specific shmem config85%
sub_A465F014KBCode object header builder (SM version -> ELF fields)88%

Profile Object Layout (1936 bytes)

Every SM's intrinsic table initializer (Map 3 handler) calls sub_917990 to allocate a 1,936-byte profile object that carries target-specific parameters throughout the compiler. This is the compilation unit's target descriptor -- the single structure that downstream code reads to answer "what hardware am I compiling for?"

Construction Sequence

1. sub_71BDE0(1936, a1)     heap allocate 1936 bytes
2. sub_C1B7A0(profile)      zero-fill + structural defaults (8 SSE blocks, 5 scalars)
3. sub_917990(a3)           overlay: codegen factory default, tail constants
4. sub_60AXXX(a1,a2,a3,a4)  per-SM: codegen factory, shmem base, capability flags

Key Fields -- Explicitly Initialized

These fields receive non-zero values during construction. Offsets are byte offsets from the profile object base pointer. Type column: D=DWORD(4B), Q=QWORD(8B), O=OWORD(16B), B=BYTE.

OffsetTypeDefaultSet BySemantic NameConfidence
+0Q0sub_C1B7A0object_base -- zeroed, likely vtable/class pointer75%
+112Q0x500000000sub_C1B7A0packed_config -- stores DWORD 5 at +112, DWORD 0 at +11685%
+120D5sub_C1B7A0opt_level_default -- initial optimization level or block dimension85%
+132Q0xFFFFFFFFsub_C1B7A0max_register_limit -- -1 sentinel = "no limit"85%
+340D1sub_C1B7A0enable_flag_A -- default-enabled capability85%
+344D0x100000per-SM initshared_memory_config_base -- 1 MB for all SM 75+ targets95%
+348Dper-SMper-SM initcodegen_factory -- ISA generation encoding (gen << 12) | variant99%
+424Q0x100000000sub_C1B7A0packed_enable -- stores DWORD 1 at +424, DWORD 0 at +42890%
+428D0 (cond.)per-SM initconditional_feature_flag -- sm_90+ only: set to 0 when *(a2+355) is true85%
+432Dcomputedper-SM initmodule_base_address -- callback() - 0x6FFFFE64 or -1 if disabled95%
+588D0sub_917990cleared_field -- explicitly re-zeroed; used in 10+ consumer functions90%
+708D1per-SM initenable_flag_D -- universally set to 1 by all per-SM initializers85%
+944D4sub_C1B7A0pipeline_depth -- possibly barrier count or pipeline stage limit85%
+1200Q"NVIDIA"sub_43A400vendor_string_ptr -- pointer to vendor identification string95%
+1208Q(pointer)sub_43A400associated_data_ptr -- assigned from callback result90%
+1216D1sub_43A400vendor_flag -- set to 1 during ELF builder initialization85%
+1385B0 (bits)runtimescheduling_feature_flags -- bitfield, 21+ consumer sites99%
+1536Q1832sub_C1B7A0dynamic_region_offset -- points to tail SSE constant region start90%
+1552Q0runtimepipeline_progress -- monotonically increasing counter (values 0--21); scoreboard guards check 16--1995%
+1584Qnullsub_856sub_C1B7A0sm_backend_vtable_ptr -- THE central polymorphic pointer; initialized to null stub99%
+1684DCLI valueper-SM initcli_option_value -- *(a1+108) passthrough from compiler driver90%
+1840D1per-SM initelf_section_data -- initially 1 (enable), later overwritten with ptr85%
+1880Q1per-SM initbarrier_tracking_ptr -- initially 1, later pointer to scoreboard data95%
+1892D2sub_917990tail_mode_value -- possibly versioning or encoding mode indicator85%
+1912D0 (cond.)per-SM initconditional_clear -- cleared when *(a2+233) is true (debug mode)85%
+1928D1per-SM initoutput_config_value -- compilation output configuration85%

SSE Constant Blocks

10 blocks of 16 bytes each are loaded from .rodata segments via _mm_load_si128. These likely contain per-register-class sizing parameters, pipeline configuration constants, or default opcode table pointers. Exact values require .rodata dump.

OffsetSourceSet By
+184xmmword_20206F0sub_C1B7A0
+280xmmword_2027950sub_C1B7A0
+312xmmword_2027600sub_C1B7A0
+680xmmword_2027620sub_C1B7A0
+696xmmword_22B4ED0sub_C1B7A0
+740xmmword_22B4EE0sub_C1B7A0
+788xmmword_22B4EF0sub_C1B7A0
+816xmmword_22B4F00sub_C1B7A0
+1832xmmword_21DEBA0sub_917990
+1908xmmword_21DEBB0sub_917990

Scheduling Feature Flags (+1385 bitfield)

The byte at offset +1385 is the most heavily accessed bitfield on the profile object (21+ consumer sites in 15+ decompiled functions). Each bit gates a scheduling or codegen behavior.

BitMaskMeaningEvidence
00x01Function has sync barrierssub_792CD0 sets/clears; sub_75F680, sub_75F580 check
10x02(unknown)--
20x04Extended barrier modelsub_796D60 checks jointly with +1382 & 0x20
30x08Scoreboard tracking enabledsub_925670, sub_925510, sub_9253C0 check jointly with +1880 and +1552
40x10(unknown)--
50x20Scheduling feature flagsub_793220, sub_A36360 (scoreboard encoder) check
60x40Temporary analysis flagsub_752E40 sets; sub_77F0D0 clears
70x80Preserved across resetssub_7F7DC0: *(a1+1385) &= 0x80 (all others cleared)

Per-SM Initializer Differences

All 12 per-SM initializers (one per SM family) are structurally identical. Only two fields differ between families.

Fieldsm_75--sm_88sm_89sm_90sm_100--sm_121
+348 codegen_factory24577--28676286770x8000 (32768)36864--36869
+428 conditional_feature_flagnot writtennot writtenwritten (if *(a2+355))written (if *(a2+355))

All other fields (+344, +432, +588, +708, +1684, +1840, +1880, +1912, +1928) are set identically across all SM families.

Critical Distinction: Profile Object vs SM Backend

The profile object (1936 bytes, this layout) is the compilation unit's target descriptor, stored somewhere in the compilation context. The pointer at context+1584 (sm_backend_vtable_ptr) points to a separate polymorphic SM backend object -- not to another profile object. Fields accessed as *(*(ctx+1584) + N) are on the SM backend, not on this profile object.

Commonly accessed SM backend fields (NOT on the 1936-byte profile):

SM Backend OffsetSemanticConsumer Count
+12arch_class_id (4=Maxwell, 10=Grace, 11=NVLink-Volta+)3+
+372codegen_factory (THE feature-gating value, 37+ sites)37+
+1037hw_capability_flags bitfield (SFU precision, barriers)20+
+1312vtable dispatch for predicate marking capability2+
+1320vtable function slot for optimization dispatch2+
+1417Grace/ARM architecture flag (bit 7)1+
+1418NVLink-capable flag (bit 0)1+

The profile's own +348 stores the codegen factory value at construction time. The SM backend's +372 is where all downstream code reads it. These are the same numeric value stored in two different objects.

Object Region Map

Offset Range   Size    Content
-----------    ----    -------
[0..111]       112B    Object header, vtable chain, zeroed bulk
[112..159]     48B     Config scalars (opt level, register limit)
[160..343]     184B    Zeroed + 3 SSE constant blocks (184, 280, 312)
[344..587]     244B    Target identity (shmem base, codegen factory, enable flags)
[588..879]     292B    Capability fields, 4 SSE constant blocks (680, 696, 740, 788)
[880..1063]    184B    Scheduling config, barrier config (+944=4 default)
[1064..1279]   216B    Extended config, vendor metadata (+1200="NVIDIA")
[1280..1535]   256B    Architecture feature flags (+1385 bitfield, +1417/+1418)
[1536..1663]   128B    Dynamic region pointer, phase counter, sm_backend ptr (+1584)
[1664..1831]   168B    CLI passthrough, ELF section data, barrier tracking ptr
[1832..1935]   104B    Tail region: 2 SSE constant blocks, mode value (+1892=2)

Cross-References

Turing & Ampere (SM 75--88)

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

SM 75 through SM 88 span two microarchitecture generations that ptxas treats as a contiguous feature band. sm_75 (Turing) is the default target for ptxas v13.0.88 -- when --gpu-name is omitted, sub_6784B0 returns sm_75. The Ampere targets (sm_80, sm_86, sm_87, sm_88) share generation-7 SASS encoding and add incremental tensor core and async-copy capabilities. sm_89 (Ada Lovelace) is architecturally Ampere-derived internally but is covered in Ada & Hopper because it bridges to sm_90 features.

SM targetssm_75, sm_80, sm_86, sm_87, sm_88 (+ sm_82 validation-only)
Codegen factory range24577--28676
ISA generation6 (Turing), 7 (Ampere)
Encoding format128-bit per-instruction control word
Scheduler profile7 warps, 208 dispatch slots
Family strings"Turing" (sm_75), "Ampere" (sm_80--88)
Sub-variantsNone (no a or f suffixes)
Profile object size1,936 bytes (allocated by sub_917990)

SM Version Table

SMProductFamily__CUDA_ARCH__Codegen FactoryHexVariant
sm_75TU10x (RTX 20xx, Quadro RTX)Turing750245770x60011 (gen 6)
sm_80GA100 (A100, A30)Ampere800286730x70011 (gen 7)
sm_86GA10x (A40, A10, RTX 30xx)Ampere860286740x70022
sm_87GA10B (Jetson Orin)Ampere870286750x70033
sm_88-- (undocumented)Ampere880286760x70044

Codegen factory encoding: (isa_generation << 12) | sub_variant. Turing is generation 6; Ampere is generation 7. The sub-variant distinguishes silicon cut within a generation. sm_75 and Pascal sm_60 share generation 6 (sm_60 = 24576 = 0x6000), differentiated by sub-variant 0 vs 1.

sm_88 note: Registered in ptxas with CUDA_ARCH=880 and codegen factory 28676, but no public product ships on this SM. It may represent an unreleased Ampere derivative or internal test target.

SM 82 -- Internal Ampere Target

sm_82 is an undocumented internal Ampere target present in the base validation table (unk_1D16220, entry [20]) but not registered in the profile constructor sub_6765E0. It has no capability dispatch handler, no profile object, and no handler functions in any of the 7 hash maps. It exists in ptxas solely as a validation table entry and as the SASS opcode generation boundary.

Validation table entry{82, 6, 2} -- sm_82, PTX 6.2
PTX ISA requirement6.2 (anomalously low -- see below)
Profile objectNone -- not registered in sub_6765E0
Capability handlersNone -- not registered in sub_607DB0
SASS opcode roleSM82_FIRST (index 172) through SM82_LAST (index 193)

PTX 6.2 Anomaly

sm_82 requires PTX ISA version 6.2, which is lower than both its neighbors:

SMPTX ISACUDA Toolkit
sm_756.3CUDA 10.0
sm_807.0CUDA 11.0
sm_826.2--
sm_867.1CUDA 11.1

PTX 6.2 corresponds to CUDA 10.1 (Turing era). This backward version number strongly suggests sm_82 was created as an early Ampere development target -- a PTX-level placeholder added before the Ampere PTX ISA (7.0) was defined. The validation table entry was never removed, but no profile object was ever created for it.

SASS Opcode Boundary Role

sm_82's primary significance in ptxas is as the opcode generation boundary for Ampere SASS instructions. The opcode hierarchy uses SM-number-based range labels:

SM82_FIRST  = index 172   (first Ampere-era SASS opcode)
SM82_LAST   = index 193   (last opcode in the sm_82 range)

These 22 opcode slots (indices 172--193) cover the core Ampere SASS additions:

OpcodesCategory
GATHER, GENMETADATA, SPMETADATASparse MMA infrastructure
BMMA_88128, BMMA_168128, BMMA_168256Binary tensor core MMA shapes
DMMAFP64 tensor core MMA (re-introduced at index 215 for Hopper)
HMMA_SP_1688, HFMA2_MMA, HMNMX2FP16 sparse/packed operations
IMMA_88, IMMA_SP_88, IMMA_16816, IMMA_16832, IMMA_SP_16832Integer tensor core MMA shapes
ARRIVES, LDGDEPBAR, LDGSTSAsync copy and barrier infrastructure
REDUXWarp-wide reduction
CLMADCarry-less multiply-add (GF(2) arithmetic)

The name SM82_FIRST/SM82_LAST is used as the boundary label even though these instructions are available on sm_80+ (any codegen factory >= 28673). The "82" in the label refers to the internal target used during Ampere development, not to a minimum SM requirement for the opcodes themselves.

Why sm_82 Matters

sm_82 is a ghost target: it occupies a validation table slot and lends its name to an opcode range, but cannot be compiled for. Passing --gpu-name sm_82 to ptxas would pass the initial validation check (bsearch succeeds in the base table) but fail during profile construction because sub_6765E0 has no case for SM 82. The practical consequence is that sm_82 is a naming artifact preserved from Ampere development, not a usable compilation target.

Profile Object Construction

Each SM's intrinsic table initializer (Map 3 handler) calls sub_917990 to allocate a 1,936-byte profile object, then populates architecture-specific fields.

Source: decompiled sub_60A2E0 (sm_75), sub_60A3E0 (sm_80), sub_60AC30 (sm_86), sub_60AD30 (sm_87), sub_60AB30 (sm_88).

All five initializers are structurally identical. The only field that differs between them is offset +348 (codegen factory). Common initialization:

v6 = sub_917990(a3);              // allocate 1936-byte profile object
*(_DWORD *)(v6 + 344) = 0x100000; // shared memory config base (1 MB)
*(_DWORD *)(v6 + 348) = FACTORY;  // codegen factory (per-SM)
v7[147] = 0;                      // cleared
v7[421] = cli_option_value;       // from a1+108
v7[460] = 1;                      // enable flag
v7[482] = 1;                      // enable flag
v7[470] = 1;                      // enable flag
v7[177] = 1;                      // enable flag

The sub_917990 base constructor sets the default codegen factory to 0x2000 (8192), which every per-SM initializer immediately overwrites. Key base constructor fields:

OffsetDefaultContent
+3480x2000Codegen factory (overridden)
+5880Cleared
+18922Mode/config value
+1832--1848xmmwordSSE-loaded constant block
+1908--1924xmmwordSSE-loaded constant block

Handler Dispatch

ptxas registers per-SM handler functions into 7 parallel hash maps via sub_607DB0. For sm_75--88, all handlers are thin wrappers around shared codegen infrastructure.

Map 1 (Handler A) and Map 2 (Handler B)

SMHandler AHandler B
sm_75sub_609B70sub_609B40
sm_80sub_609CC0sub_609C90
sm_86sub_609D50sub_609D80
sm_87sub_609F00sub_609DE0
sm_88sub_609E70sub_609EA0

Every Handler A function is identical:

bool handler_A(int64_t a1, int64_t a2) {
    *(int32_t*)(a2 + 100) = read_option(a2, "cpf_optx");
    return sub_663C30(a2, 0) != 0;  // a2=0: sets *(a1+96) = 1
}

Every Handler B function is identical except the second argument:

bool handler_B(int64_t a1, int64_t a2) {
    *(int32_t*)(a2 + 100) = read_option(a2, "cpf_optx");
    return sub_663C30(a2, 1) != 0;  // a2=1: skips the flag set
}

The "cpf_optx" option controls OptiX IR compilation mode. sub_663C30 is the core codegen-factory-aware driver that delegates to sub_662920 (ELF section iteration, 26KB) and sub_7FBB70 (actual compilation pass). The only behavioral difference between Handler A and Handler B: when called as Handler A (arg=0), sub_663C30 sets *(a1 + 96) = 1 before processing -- likely a "primary pass" flag.

Map 3 (Intrinsic Table Initializer)

SMInitializerNotes
sm_75sub_60A2E0Factory 24577
sm_80sub_60A3E0Factory 28673
sm_86sub_60AC30Factory 28674
sm_87sub_60AD30Factory 28675
sm_88sub_60AB30Factory 28676

Maps 6--7 (Performance / Occupancy)

Mapsm_75 HandlerPurpose
6sub_608D50Perf-stats / occupancy handler E
7sub_6096E0Perf-stats / occupancy handler F

These handlers return per-SM occupancy parameters used by the driver API's occupancy calculator. Other Ampere SMs have their own handlers registered into the same maps (addresses not fully traced in sweep data).

Scheduler Profile

sub_8E4400 (InitHWProfile_Warp) uses the codegen factory value to select scheduling parameters. The dispatch is a linear threshold cascade:

Factory RangeWarpsDispatch SlotsArchitecture Class
<= 20479496Kepler (sm_30)
<= 245756176Pascal (sm_60)
<= 286727192Volta (sm_70)
<= 327677208Turing / Ampere (sm_75--88)
<= 368638224Hopper (sm_90)
> 3686316240Blackwell (sm_100+)

All SM 75--88 targets fall into the 7-warp / 208-slot bucket. After the warp count, a secondary switch maps specific codegen factory values to sub-architecture variants (stored at a1+26):

Codegen FactoryVariantSM
24576, 32768, 368640sm_60, sm_90, sm_100 (base)
8193, 20481, 286742sm_30, sm_50, sm_86
28675, 368673sm_87, sm_103
28676, 368684sm_88, sm_110
28677, 368695sm_89, sm_121

sm_75 (24577) and sm_80 (28673) are absent from the variant table and fall through to the default variant (0 or 1). This means sm_75 and sm_80 use the baseline latency model, while sm_86--88 get tuned sub-architecture parameters.

HW Latency Tables

Each SM has a dedicated latency table function containing per-opcode pipeline stage assignments, stall cycle costs, and functional unit mappings. These are called from the scheduling infrastructure to drive instruction ordering.

FunctionSizeSMNotes
sub_8E77203.5 KBsm_75Turing baseline
sub_8E79402.9 KBsm_80 (base)Ampere shared layer
sub_8E7B403.3 KBsm_80Ampere full table
sub_8E7D804.4 KBsm_86Largest in Ampere family
sub_8E80703.5 KBsm_87Orin tuning
sub_8E82803.1 KBsm_89Ada Lovelace

sm_80 uniquely has two latency tables: a "base" table (sub_8E7940, 2.9KB) and a full table (sub_8E7B40, 3.3KB), suggesting a layered lookup where the base provides defaults and the full table overrides specific entries. sm_86's table is the largest at 4.4KB, likely because RTX 30xx consumer GPUs have different pipeline characteristics from A100 datacenter parts.

No separate table entry was found for sm_88 in the sweep data. It may share sm_86 or sm_87's latency profile, or be registered through a path not captured in the sweep.

SASS Instruction Encoding

SM 75--88 all use the 128-bit per-instruction encoding format introduced with Turing. This replaced the Volta/Pascal scheme where scheduling control was packed into a separate 64-bit control header shared by 3 instructions.

Control Word Layout

Each 128-bit SASS instruction carries a 23-bit control word encoding scheduling decisions. The control word is generated by sub_A36360 (52KB, the control-word / scoreboard encoder).

bits [0:3]   stall count       (4 bits, max 15 cycles)
bit  [4]     yield flag        (1 bit, warp scheduler hint)
bits [5:7]   write barrier idx (3 bits, selects 1 of 6 barriers)
bits [8:13]  read barrier mask (6 bits, one per barrier)
bits [14:19] wait barrier mask (6 bits, one per barrier)
bits [20:25] reuse flags       (6 bits, per source operand)

Instruction Word Structure

128-bit instruction word:
  bits [0:3]     = 0x2       (format code: 128-bit)
  bits [4:6]     = slot      (scheduling group slot)
  bits [8:16]    = MAJOR     (9-bit major opcode, 0x00-0x171)
  bits [17:24]   = MINOR     (8-bit minor opcode / variant)
  bits [25:31]   = SUBOP     (7-bit sub-opcode / format ID)
  bits [48+]     = MODIFIERS (format-dependent modifier fields)
  bits [132:134] = 0x0       (extended opcode flag, at offset 0x84)

The 3-level opcode hierarchy (major/minor/subop) allows up to 102 major opcodes with 48 sub-operations each. Maximum observed variant value: 0x2F (47 decimal).

Scoreboard / Dependency Barriers

SM 75--88 provide 6 hardware dependency barriers per warp. The scoreboard tracker is managed by sub_8E4920 (BuildScoreboardEntries, 6.9KB) and encoded by sub_A36360.

ResourceWidthRangeNotes
Write barrier index3 bits0--5 active, 6--7 reservedAssigns instruction to a barrier
Read barrier mask6 bits1 bit per barrierIndicates which barriers to check before read
Wait barrier mask6 bits1 bit per barrierIndicates which barriers to wait on

The scoreboard tracker allocates 952 bytes per function when bit 4 of the flag byte at offset +1385 is set. An additional 856-byte bitset is allocated when bit 8 is also set (for barrier register tracking in writeback mode).

The scoreboard infrastructure is shared across all SM 75--88 targets. The barrier count (6) is constant for this entire range. sm_90 (Hopper) potentially increases this, and sm_100+ (Blackwell) changes the barrier model further.

Intrinsic Table

Intrinsic availability is cumulative. Each generation adds to the previous.

sm_75 Baseline (IDs 0x89--0x1FA, 370 intrinsics)

sm_75 inherits the full sm_70 (Volta) intrinsic set labeled __cuda_sm70_*:

CategoryIntrinsicsPTX Operations
Named barriersbarrier_arrive/red/sync (0--15)bar.arrive, bar.red.{and,or,popc}, bar.sync
Warp shuffleshflsync_bfly/down/idx/upshfl.sync.{bfly,down,idx,up}
Warp votevotesync_all/any/ballot/univote.sync.{all,any,ballot,uni}
Warp matchmatchsync_all/any_b32/b64match.sync.{all,any}.b{32,64}
Warp syncwarpsyncbar.warp.sync
Reduxreduxsync_* (IDs 0x01--0x11)redux.sync.{and,or,xor,min,max,add}
WMMAm16n16k16, m32n8k16, m8n32k16wmma.{load,store,mma}

WMMA intrinsics cover all combinations of:

  • Shapes: m16n16k16, m32n8k16, m8n32k16
  • Operations: load_a, load_b, load_c, store_d, mma
  • Layouts: row/col combinations
  • Types: f16, f32, with/without satfinite
  • Address spaces: generic, global, shared

sm_80 Additions (IDs 0x1FB--0x22F, 53 intrinsics)

sm_80 adds two intrinsic groups:

14 __cuda_sm80_* intrinsics (IDs 0x1FB--0x208):

IntrinsicPTX OperationNotes
createpolicy_rangecreatepolicy.rangeL2 cache persistence control
createpolicy_fractionalcreatepolicy.fractionalL2 cache fraction control
createpolicy_cvtCache policy conversion
mma_bf16_*mma.sync with .bf16BF16 tensor core MMA
mma_tf32_*mma.sync with .tf32TF32 tensor core MMA
mma_s4_*mma.sync with .s4INT4 tensor core MMA
mma_s8_*mma.sync with .s8INT8 tensor core MMA
mma_b1_*mma.sync with .b1Binary tensor core MMA

39 __cuda_sm_8x_mma_* intrinsics (IDs 0x209--0x22F):

Extended MMA shapes and sparse variants for the 2nd/3rd generation tensor core.

Intrinsic Gate Mechanism

The per-SM intrinsic table initializer (Map 3 handler) controls which intrinsics are available. sm_75 registers only the sm_70 intrinsic block. sm_80+ additionally registers the sm_80 and sm_8x blocks. The gate is not a runtime check -- it is a registration-time decision: if the intrinsic's handler function is not registered for a given SM, the PTX parser emits an error when it encounters a call to that intrinsic.

Peephole Optimizer Gates

The SASS-level peephole optimizer (sub_83EF00, 4,858 lines decompiled) applies pattern-matching transformations to the instruction stream. Several transformations are gated by codegen factory thresholds.

FMA/DFMA Combining

// sub_83EF00, case 0x50 (opcode 80 = FMA/DFMA)
if (*(_QWORD *)(*(_QWORD *)(a1 + 1584) + 372) > 28673) {
    // FMA combining enabled: sm_86+ (codegen factory 28674+)
    // Look for CVT -> FMA patterns and combine
}

Gate: codegen_factory > 28673. This means:

  • sm_75 (24577): FMA combining disabled
  • sm_80 (28673): FMA combining disabled (equality fails the > check)
  • sm_86 (28674): FMA combining enabled
  • sm_87 (28675): FMA combining enabled
  • sm_88 (28676): FMA combining enabled

This is a significant compiler optimization difference between A100 (sm_80) and RTX 30xx (sm_86). A100 code does not get FMA/DFMA combining, possibly because A100's pipeline already has hardware FMA fusion or because the combining transformation is not profitable on GA100 silicon.

Master Gate

All peephole passes check capability 487 ("enable peephole") and capability 350 (per-instruction gate) before applying any transformation. These capabilities are controlled by optimization level and user options, not by SM target.

BB Initialization Flags

The basic block initializer sub_6E8EB0 sets architecture-specific flags using a secondary SM version encoding (distinct from the codegen factory):

SM Version (secondary)HexFlags SetArchitecture
204800x5000bits 1, 8sm_80 encoding space
204840x5004bits 16, 64sm_84 encoding space

This secondary encoding uses (gen << 12) with generation 5 for Ampere in the BB init context. The specific bit flags control opcode descriptor table population -- each BB gets a set of 40+ (opcode_id, encoding_word) pairs that define which SASS instructions are legal in that basic block.

Feature Comparison

Featuresm_75sm_80sm_86sm_87sm_88
ISA generation67777
Codegen factory2457728673286742867528676
WMMA (1st gen TC)YesYesYesYesYes
MMA bf16/tf32 (2nd gen TC)--YesYesYesYes
MMA s4/s8/b1 extended--YesYesYesYes
createpolicy (L2 cache)--YesYesYesYes
cp.async (async copy)--YesYesYesYes
sm_8x MMA intrinsics (39)--YesYesYesYes
FMA/DFMA peephole combining----YesYesYes
Scheduler: warps / slots7/2087/2087/2087/2087/208
Sub-arch variantdefaultdefault234
Separate HW latency tableYesYes (2)YesYes?
Family stringTuringAmpereAmpereAmpereAmpere

Hardware Resource Geometry

Per-SM hardware resource limits used by ptxas for register allocation, occupancy calculations, and scheduling decisions. Extracted from sub_8688F0 (universal baseline), sub_8E4400 (scheduler partition geometry), and sub_ABF250 (occupancy calculator). See targets/index.md -- Per-SM Resource Geometry Table for the complete table across all architectures.

SMRegs/SMMax Regs/ThreadMax Threads/CTAWarps/SMMax CTAs/SMSched PartitionsDispatch SlotsConfigurable Shared MemoryConf
sm_7565,5362551,02432167 / 20820832 / 48 / 64 KB90%
sm_8065,5362552,04864327 / 20820848 / 100 / 132 / 164 KB90%
sm_8665,5362551,53648167 / 20820848 / 100 KB90%
sm_8765,5362551,53648167 / 20820848 / 100 / 164 KB90%
sm_8865,5362551,53648167 / 208208(same as sm_86)85%

Column definitions:

  • Regs/SM: Total 32-bit registers per streaming multiprocessor. 65,536 universally for sm_75+.
  • Max Regs/Thread: Maximum registers a single thread can use. 255 universally (sub_8688F0 offset +612).
  • Max Threads/CTA: Maximum threads per cooperative thread array (block).
  • Warps/SM: Total concurrent warps per SM. Determines peak occupancy.
  • Max CTAs/SM: Maximum concurrent CTAs per SM.
  • Sched Partitions / Dispatch Slots: From sub_8E4400 offset +18 (packed DWORD) and offset +22 (WORD).
  • Configurable Shared Memory: Valid shared memory sizes per CTA, selected by cudaFuncSetAttribute.

All Turing/Ampere targets share the 7-partition / 208-slot scheduling geometry. The major resource difference is sm_80 (A100 datacenter) with 2,048 max threads and 64 warps vs. the consumer/embedded parts (sm_86--88) with 1,536 max threads and 48 warps. sm_75 (Turing) is the most constrained with 1,024 max threads and 32 warps.

Codegen Factory Gating Patterns

Throughout ptxas, the codegen factory value at profile offset +348 (or equivalently *(*(QWORD*)(a1+1584)+372) from the function context) is used to gate features. Common patterns for the Turing/Ampere range:

// Pattern 1: Generation check (factory >> 12)
int gen = codegen_factory >> 12;
if (gen >= 7) { /* Ampere+ feature */ }

// Pattern 2: Exact threshold
if (codegen_factory > 28673) { /* sm_86+ only */ }
if (codegen_factory >= 28673) { /* sm_80+ */ }

// Pattern 3: Range check
if (codegen_factory >= 24577 && codegen_factory < 32768) {
    /* Turing/Ampere only, not Hopper+ */
}

The >> 12 shift extracts the ISA generation, allowing coarse checks (Turing = 6, Ampere = 7). Fine-grained checks compare against specific factory values to distinguish sm_80 from sm_86, etc.

Function Map

AddressSizeIdentitySMConfidence
sub_609B70~48BHandler A (cpf_optx + compile)sm_7599%
sub_609B40~48BHandler B (cpf_optx + compile)sm_7599%
sub_60A2E0~300BIntrinsic table initializersm_7595%
sub_609CC0~48BHandler Asm_8099%
sub_609C90~48BHandler Bsm_8099%
sub_60A3E0~300BIntrinsic table initializersm_8095%
sub_609D50~48BHandler Asm_8699%
sub_609D80~48BHandler Bsm_8699%
sub_60AC30~300BIntrinsic table initializersm_8695%
sub_609F00~48BHandler Asm_8799%
sub_609DE0~48BHandler Bsm_8799%
sub_60AD30~300BIntrinsic table initializersm_8795%
sub_609E70~48BHandler Asm_8899%
sub_609EA0~48BHandler Bsm_8899%
sub_60AB30~300BIntrinsic table initializersm_8895%
sub_608D50~200BPerf/occupancy handler Esm_7585%
sub_6096E0~200BPerf/occupancy handler Fsm_7585%
sub_663C30~400BCore codegen driver (shared)all90%
sub_66292026KBELF section iteration (shared)all85%
sub_917990~300BProfile object constructorall90%
sub_8E77203.5KBHW latency tablesm_7590%
sub_8E79402.9KBHW latency table (base layer)sm_8090%
sub_8E7B403.3KBHW latency table (full)sm_8090%
sub_8E7D804.4KBHW latency tablesm_8690%
sub_8E80703.5KBHW latency tablesm_8790%
sub_8E44003.3KBInitHWProfile_Warp (shared)all90%
sub_83EF00~100KBPrimary peephole optimizerall85%
sub_A3636052KBControl word / scoreboard encoderall85%
sub_8E49206.9KBBuildScoreboardEntriesall85%

Cross-References

Ada Lovelace & Hopper (SM 89--90a)

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

ptxas handles SM 89 (Ada Lovelace -- RTX 4090, L40S) and SM 90/90a (Hopper -- H100, H200) as adjacent but architecturally distinct targets. Ada shares the Ampere codegen factory (28673) and is stored internally as "Ampere"-family despite being a different microarchitecture. Hopper gets its own codegen factory (32768), its own family string "Hopper", and introduces the largest single-generation feature addition in ptxas: WGMMA, thread-block clusters, TMA, setmaxnreg, and distributed shared memory.

SM 89 (Ada)SM 90 (Hopper)SM 90a (Hopper accel)
ProductsRTX 4090, RTX 4080, L40S, L4H100, H200H100, H200 (arch-locked)
Family string"Ampere""Hopper""Hopper"
__CUDA_ARCH__89090090a0
Codegen factory28673 (7 << 12 | 1)32768 (8 << 12)32768
Handler Asub_609E10sub_609DB0sub_609DB0 (shared)
Handler Bsub_609CF0sub_609C00sub_609C00 (shared)
Intrinsic initsub_60A810sub_60A5F0sub_60A5F0 (shared)
HW latency tablesub_8E8280 (3.1KB)sub_8E8480 (5.2KB)sub_8E8780 (4.6KB)
Suffix variantsNonea only (no f)--
Forward compatFull (runs on sm_90+)Full (base variant)None (locked to H100/H200)

The "a" Suffix -- Architecture-Accelerated

SM 90a is the first target to use the a suffix. It appears in the accelerated validation table (unk_1D161C0, 7 entries). The meaning is precise: sm_90a SASS executes only on the specific silicon it was compiled for (H100/H200) and will not run on any future architecture. This trades forward compatibility for access to features that may not survive to the next generation.

In ptxas, sm_90 and sm_90a share all 7 dispatch-table handler functions. The a suffix does not produce different handler code paths -- it produces different compatibility metadata in the output cubin. The ELF header records whether the binary is forward-compatible (base) or arch-locked (accelerated), and the CUDA driver enforces this at load time.

Compilation rules from ptxas help text:

  • sm_90a PTX must be compiled to sm_90a SASS (no cross-arch)
  • sm_90 PTX can compile to sm_90 or any later SASS target
  • No sm_90f variant exists; the f suffix starts with Blackwell

SM 89 -- Ada Lovelace

Internal Classification

Ada is classified as Ampere-derived in the binary. The profile object constructed by sub_6765E0 stores "Ampere" as the family name and uses codegen factory value 28673 -- identical to sm_80 through sm_88. The compiler distinguishes Ada from Ampere through per-SM capability accessor functions, not through the factory ID.

The encoded SM version for sm_89 is 28677 (7 << 12 | 5), placing it as the 5th variant in the Ampere generation:

Encoded ValueSMVariant
28673sm_807 << 12 | 1 (base Ampere)
28674sm_867 << 12 | 2
28675sm_877 << 12 | 3
28676sm_887 << 12 | 4
28677sm_897 << 12 | 5 (Ada)

Ada-Specific Features

Ada introduces 4th-generation tensor cores with FP8 (E4M3, E5M2) support. In the PTX validator, sub_4A6050 explicitly references "%s on sm_89" in its CVT instruction validation, confirming sm_89-specific conversion rules for new data types.

The intrinsic table initializer for sm_89 (sub_60A810) enables 39 additional MMA intrinsics:

Intrinsic ID RangeCountCategory
0x209--0x22F39__cuda_sm_8x_mma_* -- extended MMA shapes and types

These intrinsics cover FP8 MMA operations, block-scale MMA, and additional type combinations beyond what sm_80--88 support. The MMA validator at sub_49BBA0 checks for "mma with FP8 floating point type" and validates against the target SM version.

Ada Scheduling Profile

The HW latency table for sm_89 (sub_8E8280, 3.1KB) is smaller than Hopper's (5.2KB), reflecting Ada's simpler pipeline structure -- no WGMMA async pipeline, no cluster operations. The register file geometry from sub_8E4400:

ParameterValueNotes
Warps per scheduler8Threshold: encoded SM <= 36863
Dispatch slots224Same as sm_80 class
Sub-architecture variant5From encoded value 28677

SM 90 / SM 90a -- Hopper

Profile Construction

Hopper is the first SM to get its own codegen factory value (32768 = 8 << 12) since the Turing/Ampere split. The profile object stores "Hopper" as the family name. Key profile fields:

Fieldsm_90sm_90a
SM name"sm_90""sm_90a"
Compute name"compute_90""compute_90a"
LTO name"lto_90""lto_90a"
CUDA_ARCH90090a0
Family"Hopper""Hopper"
Codegen factory3276832768

Hopper Scheduling Profile

The HW latency tables for Hopper are substantially larger than any previous architecture, reflecting the async pipeline and tensor core scheduling complexity:

FunctionSizeTargetNotes
sub_8E84805.2KBsm_90Base Hopper latency model
sub_8E87804.6KBsm_90aArch-accelerated variant

sm_90a gets its own latency table (4.6KB) distinct from sm_90 (5.2KB), even though they share all dispatch handler functions. This is the only architecture where base and a variants have separate scheduling profiles -- all Blackwell variants share their tables within each base SM.

Register file geometry from sub_8E4400:

ParameterValueNotes
Warps per scheduler16Threshold: encoded SM > 36863 (32768 qualifies)
Dispatch slots240Maximum -- 2x the sm_80 class
Sub-architecture variant0From encoded value 32768 (base variant)
Max threads/CTA240From code object builder sub_A465F0

The jump from 8 warps / 224 slots (sm_89) to 16 warps / 240 slots (sm_90) is the largest warp geometry change in the binary. This doubling of warp capacity directly corresponds to Hopper's 4-warp warpgroup execution model.

Hopper Intrinsics

The intrinsic initializer for sm_90 (sub_60A5F0) enables 38 sub-byte MMA intrinsics:

Intrinsic ID RangeCountCategory
0x23A--0x25F38__cuda_sm_9x_mma_sub_byte_internal_*

These cover sparse sub-byte MMA operations: s4/u4 sparse variants for m16n8k32, m16n8k64, and m16n8k128 shapes. These are Hopper-specific and do not appear in the Ada (sm_89) intrinsic table.

WGMMA -- Warpgroup Matrix Multiply-Accumulate

WGMMA is Hopper's signature instruction. It operates on warpgroups (4 consecutive warps) rather than single warps, and executes asynchronously -- the tensor core operates in parallel with the warp's regular instruction stream. ptxas handles WGMMA through four PTX instructions and a dedicated compiler pass infrastructure.

PTX Instructions

Registered in the opcode dispatch table at sub_5D4190:

PTX InstructionCodegen HandlerFormatterFormatter Size
wgmma.mma_asyncsub_50AC70sub_4DA380295B
wgmma.fencesub_4DA380sub_4DA4B0295B
wgmma.commit_groupsub_4DA4B0sub_4DA5E0311B
wgmma.wait_groupsub_4DA5E0sub_505B001066B

The formatters are the smallest named formatters in ptxas (295 bytes), reflecting WGMMA's compact text representation. wgmma.wait_group is significantly larger (1066B) because it must encode the pipeline depth parameter.

GMMA Pipeline Pass Infrastructure

The WGMMA pipeline optimizer is the largest single-architecture compiler subsystem in ptxas, spanning approximately 100KB of code across 15+ functions in the range 0xACE000--0xAE6000. It is active only for SM 90+ targets.

Call chain:

sub_AE4F70  (coordinator -- outside primary range)
 +-- sub_ACE480   (22.7KB)  WGMMA serialization warning emitter
 +-- sub_ADEB40   (43.1KB)  warpgroup.arrive/wait fence insertion
 +-- sub_AE17C0   (37.9KB)  pipeline stage builder
 |    +-- sub_AE0D20  (16.8KB)  live range builder
 |    +-- sub_AE06F0           GMMA operand classifier
 +-- sub_ADDDF0   (20.6KB)  pass entry (vtable dispatch)
      +-- sub_ADCA60  (21.7KB)  scheduling coordinator
           +-- sub_ADBD30  (23.9KB)  register pressure estimator
           |    +-- sub_ADAD60  (8.4KB)  live range limiter
           |    +-- sub_AD9C20  (14.4KB) register class allocator
           +-- sub_AD70B0  (22.6KB)  operand register assignment

Warpgroup Synchronization Injection

The fence insertion pass (sub_ADEB40, 43.1KB) scans for wgmma.mma_async operations and automatically injects warpgroup.arrive and warpgroup.wait instructions. These fences manage register ownership when asynchronous tensor core operations are in flight -- the hardware requires explicit handoff between the warpgroup's register file and the tensor core's accumulator registers.

Diagnostic messages emitted by the compiler:

  • "warpgroup.arrive is injected in around line %d by compiler to allow use of registers in GMMA in function '%s'"
  • "warpgroup.wait is injected in around line %d by compiler to allow use of registers defined by GMMA in function '%s'"

WGMMA Serialization Warnings

When the compiler cannot pipeline WGMMA operations, sub_ACE480 (22.7KB, 98% confidence) emits detailed performance advisories using codes 0x1D55--0x1D57 (7509--7511 decimal). Nine distinct serialization reasons are enumerated:

Reason CodeDiagnostic Message
1"presence of Extern calls in the function"
2"wgmma pipeline crossing function boundary"
3"insufficient register resources for the wgmma pipeline"
4"program dependence on compiler-inserted warpgroup"
5"ill formed pipeline stage in the function"
6"non wgmma instructions defining accumulator registers"
7"non wgmma instructions reading accumulator registers"
8"non wgmma instructions defining input registers"
9"insufficient register resources for the function"

All messages are prefixed with "Potential Performance Loss: wgmma.mma_async instructions are serialized due to ...". The pass reads its configuration from offsets +26280 (1-byte enable) and +26288 (dword threshold) on the compilation context.

GMMA Live Range Management

The live range limiter (sub_ADAD60) enforces a maximum on simultaneously active live ranges within GMMA sequences:

"GMMA sequence has too many active live ranges (%d), reduce it to bring it under (%d)"

When the threshold is exceeded, the system triggers register spilling or sequence splitting through sub_ADBD30 (register pressure estimator). The live range builder uses FNV-1a hashing (constants 16777619 and 0x811C9DC5) for instruction deduplication.

Thread-Block Clusters

Hopper introduces the concept of a thread-block cluster -- a group of cooperating CTAs that can access each other's shared memory (distributed shared memory). ptxas adds several PTX directives and special registers to support this.

Cluster Directives

The directive validator (sub_4CE6B0, 48KB) enforces mutual exclusivity of cluster configuration:

".reqnctapercluster and .maxclusterrank cannot both be specified"

Two shared-memory state spaces are distinguished:

  • .shared::cta -- CTA-local shared memory (pre-Hopper behavior)
  • .shared::cluster -- distributed shared memory accessible across CTAs in a cluster

Cluster Special Registers

Registered in sub_61B850 (special register table initializer):

Special RegisterPurpose
%clusteridCluster ID within the grid
%nclusteridNumber of clusters in the grid
%cluster_ctaidCTA position within the cluster
%cluster_nctaidNumber of CTAs in the cluster
%cluster_ctarankLinear rank of CTA within the cluster
%cluster_nctarankTotal CTAs in the cluster (linear)
%is_explicit_clusterWhether this launch uses explicit clustering
%aggr_smem_sizeAggregate shared memory across cluster

Distributed Shared Memory Intrinsics

The intrinsic handler OCG_DshmemHandler at sub_6C60B0 validates distributed shared memory operations:

  • "Cannot use both the selfcast and the broadcast modifier."
  • "Either the selfcast or the broadcast modifier must be used."

TMA -- Tensor Memory Accelerator

The Tensor Memory Accelerator (TMA) provides hardware-accelerated bulk data movement between global and shared memory, using tensor descriptors to specify multi-dimensional copy patterns. In ptxas, TMA is exposed through cp.async.bulk.tensor.

cp.async.bulk.tensor Codegen

The codegen handler (sub_5AB460, 45KB) is one of the largest single-instruction handlers in ptxas:

PropertyValue
Handler functionsub_5AB460
Size45KB
Buffer allocation50,000 bytes
Registered name"cp.async.bulk.tensor"
Dimensionality1D through 5D
Modestile, im2col
Cast variantsunicast, multicast

The TMA intrinsic handler (OCG_CpAsyncTensorHandler at sub_6C8100) validates operand counts per mode:

  • "Must have 1 input with a1t0 and no multicast"
  • "Must have 2 inputs with a1t0 and multicast"
  • "Must have 2 input with a0tx and no multicast"
  • "Must have 3 inputs with a0tx and multicast"

Tensormap Instructions

The tensormap validator (sub_4A73C0, 10.8KB) handles tensor descriptor manipulation:

  • ".tile" mode validation
  • "Tensormap field with input value >= 13" / "with input value == 4" bounds checking
  • ".tensormap::generic" addressing mode
  • "Interger Immediate for ordinal" (sic -- typo preserved from binary)
HandlerFunctionSize
cp.async.bulksub_5932105.1KB (formatter)
cp.async.mbarrier.arrivesub_4DC180--
OCG_CpAsyncBulkHandlersub_6C347020KB
OCG_CpAsyncHandlersub_6C2AE010KB

setmaxnreg -- Dynamic Register Allocation

Hopper introduces setmaxnreg (PTX opcode 315) for dynamic register count adjustment. This allows kernels to change their register footprint at runtime, enabling techniques like CTA reconfiguration and warpgroup-level resource management.

setmaxnreg Pass

The handler (sub_97EC60, ~3.5KB, 90% confidence) walks the instruction list looking for opcode 315 and processes or removes them. Five reasons for ignoring setmaxnreg are enumerated as "Potential Performance Loss" warnings:

ReasonMessage
1"unable to determine register count at entry"
2"to maintain minimum register requirements"
3"to allow debugging"
4"to maintain compatibility across compilation units"
5"to maintain compatibility into 'extern' call"

The setmaxnreg handling mode is controlled by knob 653.

CTA Reconfig Pragmas

The pragma validator (sub_97F540, ~4KB) enforces ordering constraints on setmaxreg.alloc and setmaxreg.dealloc:

  • "Found an 'alloc' pragma after 'dealloc'"
  • "Found incompatible thread count re-specification"
  • "Found a 'dealloc' pragma after 'alloc'"

The dealloc validator (sub_98D100, ~4.8KB) enforces register count bounds:

  • "setmaxreg.dealloc/release has register count (%d) less than launch min target (%d)"
  • "setmaxnreg.dec has register count (%d) which is larger than the largest temporal register count"

Async Pipeline Features

Hopper extends the async copy pipeline introduced in Ampere with mbarrier (memory barrier) objects. ptxas implements mbarrier handling through a cluster of detector, classifier, and emitter functions:

FunctionAddressIdentity
MBarrierDetector::isNonTrivialMBarriersub_A94440Checks "%mbarrier_" prefix
MBarrierDetector::classifyMBarriersub_A9A5F0Returns packed (type << 32) | is_mbarrier
MBarrierDetector::resolveMBarrierBaseNamesub_A9A920Extracts base name from symbol
MBarrierEmitter::rewriteMBarrierOperandsub_AA33C0Constructs "%%mbarrier_%s_%s"

The mbarrier type classifier returns an enumeration of 0--12 for different operation kinds, covering arrive, arrive_drop, expect_tx, and their counted variants.

Warpgroup Attribute Management

The warpgroup attribute processor (sub_64BAF0, 30KB) handles kernel-level attributes introduced with Hopper:

Attribute StringPurpose
"warpgroup-commit_batch"WGMMA commit batch configuration
"warpgroup-wait"WGMMA wait configuration
"warpgroup-arrive"WGMMA arrive configuration
"setsmemsize-state"Shared memory size pragma state
"setmaxreg-state"Register limit pragma state
"func_begin"Function entry marker
"CC-temp"Calling convention temporary

Architecture Version Threshold Checks

The binary uses the encoded SM version (codegen factory value) for feature gating throughout the compiler. Key thresholds observed:

Check PatternThresholdMeaning
profile[+372] > 28673> sm_80 basePost-Ampere features
a2 <= 32767<= sm_89 classPre-Hopper geometry (7 warps, 208 slots)
a2 <= 36863<= sm_89 extendedAmpere/Ada geometry (8 warps, 224 slots)
a2 > 36863>= sm_90Hopper+ geometry (16 warps, 240 slots)
Codegen factory == 32768sm_90 exactlyHopper-specific code paths
Codegen factory >= 32768sm_90+WGMMA, cluster, TMA enabled

The register file descriptor at sub_8E4400 uses the encoded value to select warp geometry. The full cascade:

encoded <= 20479  ->  4 warps,  96 slots   (pre-Maxwell)
encoded <= 24575  ->  6 warps, 176 slots   (Pascal)
encoded <= 28672  ->  7 warps, 192 slots   (Volta)
encoded <= 32767  ->  7 warps, 208 slots   (Turing/Ampere/Ada)
encoded <= 36863  ->  8 warps, 224 slots   (Ampere extended)
encoded >  36863  -> 16 warps, 240 slots   (Hopper+)

Note: sm_89 (encoded 28677) falls in the <= 32767 range, giving it 7 warps / 208 slots. But the separate warp geometry lookup uses a different cascade where sm_89 (as Ampere class) gets 8 warps / 224 slots. The dual-cascade structure reflects the fact that different subsystems query different profile fields.

Hardware Resource Geometry

Per-SM hardware resource limits used by ptxas for register allocation, occupancy calculations, and scheduling decisions. Extracted from sub_8688F0 (universal baseline), sub_8E4400 (scheduler partition geometry), and sub_ABF250 (occupancy calculator). See targets/index.md -- Per-SM Resource Geometry Table for the complete table across all architectures.

SMRegs/SMMax Regs/ThreadMax Threads/CTAWarps/SMMax CTAs/SMSched PartitionsDispatch SlotsConfigurable Shared MemoryConf
sm_8965,5362551,53648167 / 20820848 / 100 KB90%
sm_9065,5362551,02464328 / 22422448 / 100 / 132 / 164 / 228 KB90%

Column definitions:

  • Regs/SM: Total 32-bit registers per streaming multiprocessor. 65,536 universally for sm_75+.
  • Max Regs/Thread: Maximum registers a single thread can use. 255 universally (sub_8688F0 offset +612).
  • Max Threads/CTA: Maximum threads per cooperative thread array (block).
  • Warps/SM: Total concurrent warps per SM. Determines peak occupancy.
  • Max CTAs/SM: Maximum concurrent CTAs per SM.
  • Sched Partitions / Dispatch Slots: From sub_8E4400 offset +18 (packed DWORD) and offset +22 (WORD).
  • Configurable Shared Memory: Valid shared memory sizes per CTA, selected by cudaFuncSetAttribute.

sm_90a shares sm_90's geometry -- the a suffix affects only compatibility metadata, not hardware resource limits. The jump from sm_89 (7 partitions, 208 slots, 48 warps) to sm_90 (8 partitions, 224 slots, 64 warps) is the largest single-generation scheduling capacity increase in the binary, directly supporting Hopper's 4-warp warpgroup execution model.

MMA Instruction Validators

Several validator functions gate MMA features by SM version:

ValidatorAddressSizeSM Strings Referenced
WMMA/MMA validatorsub_4C2FD012.2KB"sm_90", "sm_80", "sm_75"
MMA scale/block validatorsub_49BBA011.4KB"sm_%d", "mma with FP8 floating point type"
WMMA shape validatorsub_4BFED010.3KB"sm_80", "sm_75"
CVT arch-specificsub_4A60505.0KB"%s on sm_89"
Special register validatorsub_49A5A03.5KB"sm_90", "%laneid", "%warpid"
Instruction fusion validatorsub_4AA3E07.1KB"sm_90"
Float instruction validatorsub_4A2CA03.7KB"sm_90"
Function address validatorsub_4B16304.6KB"sm_90", "sm_30", "sm_20"

The WMMA/MMA validator at sub_4C2FD0 performs three-way version checks: sm_75 for base WMMA, sm_80 for extended types (BF16/TF32), sm_90 for WGMMA features. FP8 MMA is gated by both sm_89 (Ada) for the data types and sm_90 (Hopper) for the warpgroup shapes.

Post-Scheduling Statistics

Eight architecture-variant statistics printers (clones at 0x700-byte intervals from sub_ABBA50) emit DUMPIR statistics. The metrics cover Hopper-specific counters:

MetricFormat String
MMA counts"# [hmma1688=%d]" (and variants for imma, sparse, dmma)
Occupancy"# [Occupancy = %f]"
Per-unit throughput"# [est adu=%d] [est alu=%d] [est cbu=%d] [est fma2x=%d] ..."
Issue throughput"# [issue thru=%f] [adu thru=%f] [alu thru=%f] ..."
WGMMA serialization"Potential Performance Loss: wgmma.mma_async ..." (9 variants)
Shared memory"# [SharedMem Alloc thru=%f]"

Function Map

AddressSizeIdentityConfidence
sub_4A60505.0KBCVT validator (sm_89 special cases)85%
sub_4C2FD012.2KBWMMA/MMA validator (sm_75/80/90 version checks)90%
sub_4DA380295Bwgmma.mma_async formatter99%
sub_4DA4B0295Bwgmma.fence formatter99%
sub_4DA5E0311Bwgmma.commit_group formatter99%
sub_505B001066Bwgmma.wait_group formatter99%
sub_50AC70--wgmma.mma_async codegen handler99%
sub_5AB46045KBcp.async.bulk.tensor codegen (TMA)95%
sub_609CF0~1.2KBsm_89 handler B (capability accessor)90%
sub_609DB0~1.2KBsm_90 handler A (capability accessor)90%
sub_609E10~1.2KBsm_89 handler A (capability accessor)90%
sub_60A5F0~1KBsm_90 intrinsic table initializer85%
sub_60A810~1KBsm_89 intrinsic table initializer85%
sub_61B85010KBSpecial register table (cluster regs)99%
sub_64BAF030KBWarpgroup/kernel attribute processor80%
sub_6C60B0--Distributed shared memory intrinsic handler65%
sub_6C8100--TMA (cp.async.tensor) intrinsic handler85%
sub_8E44003.3KBRegister file geometry initializer90%
sub_8E82803.1KBsm_89 (Ada) HW latency table85%
sub_8E84805.2KBsm_90 (Hopper) HW latency table85%
sub_8E87804.6KBsm_90a HW latency table85%
sub_97EC60~3.5KBsetmaxnreg handler (opcode 315)90%
sub_97F540~4KBCTA reconfig pragma validator90%
sub_98D100~4.8KBsetmaxreg.dealloc validator90%
sub_A94440--MBarrierDetector::isNonTrivialMBarrier85%
sub_A9A5F0--MBarrierDetector::classifyMBarrier85%
sub_AA33C0--MBarrierEmitter::rewriteMBarrierOperand85%
sub_ACE48022.7KBWGMMA serialization warning emitter98%
sub_AD70B022.6KBGMMA operand register assignment75%
sub_AD9C2014.4KBGMMA register class allocator75%
sub_ADAD608.4KBGMMA live range limiter90%
sub_ADBD3023.9KBGMMA register pressure estimator80%
sub_ADCA6021.7KBGMMA scheduling coordinator85%
sub_ADDDF020.6KBGMMA pass entry (vtable)80%
sub_ADEB4043.1KBWGMMA sync fence insertion95%
sub_AE0D2016.8KBGMMA live range builder80%
sub_AE17C037.9KBGMMA pipeline stage builder85%
sub_AE4F70--GMMA pass coordinator (outside range)90%
sub_AE503015.5KBGMMA scheduling wrapper (alt entry)75%

Cross-References

Blackwell (SM 100--121)

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

ptxas v13.0.88 handles five Blackwell-era base targets -- sm_100, sm_103, sm_110, sm_120, sm_121 -- spanning datacenter, automotive, consumer, and DGX product lines. All share the codegen factory value 36864 (generation 9, 9 << 12) and the "Blackwell" family string internally, despite being distinct microarchitectures. The defining Blackwell feature is Capsule Mercury (capmerc) as the default binary output format, automatically enabled for SM numbers exceeding 99. The datacenter variants (sm_100, sm_103, sm_110) support tcgen05 (5th-generation tensor cores with dedicated tensor memory); the consumer variants (sm_120, sm_121) do not.

SM targetssm_100, sm_103, sm_110, sm_120, sm_121 (+ a and f sub-variants each)
Codegen factory36864 (0x9000, generation 9)
Family string"Blackwell" (all five targets)
Default binary formatCapsule Mercury (capmerc) -- auto-enabled for SM > 99
SASS encoding128-bit per instruction (Mercury-encoded)
Warp geometry16 warps, 240 dispatch slots (shared with Hopper sm_90)
Sub-variants per SM3: base, a (accelerated), f (feature-reduced)
Profile constructorsub_6765E0 (54KB)
Capability dispatchsub_607DB0 (7 hash maps, once-guarded)

SM Version Table

SMProduct__CUDA_ARCH__Codegen FactoryHexVariant
sm_100B100 / B200 (datacenter)1000368640x90000 (gen 9 base)
sm_103GB300 (Blackwell Ultra)1030368670x90033
sm_110Jetson Thor SoC1100368680x90044
sm_120RTX 50xx / RTX Pro1200368690x90055
sm_121DGX Spark1210368690x90055

Codegen factory encoding: (9 << 12) | sub_variant. sm_100 is variant 0 (generation base). sm_103 is variant 3. sm_110 is variant 4. sm_120 and sm_121 appear to share variant 5 in the scheduling sub-architecture table at sub_8E4400.

Unreleased SM numbers referenced in the binary: The SASS formatter sub_583190 (rsqrt) checks for SM codes 102, 103, 107, 124, 130 in architecture-specific dispatch paths, suggesting internal/future variants beyond the five publicly exposed targets.

Sub-Variant System

Every Blackwell SM has three sub-variants. The base and a/f variants within an SM share all 7 dispatch table handler functions -- they are identical silicon with different compatibility metadata and feature exposure.

Profile Object Fields

The profile constructor sub_6765E0 builds profile objects for each sub-variant with these fields:

SMBasea Variantf Variant
sm_100sm_100 / compute_100 / lto_100sm_100a / compute_100a / lto_100asm_100f / compute_100f / lto_100f
sm_103sm_103 / compute_103 / lto_103sm_103a / compute_103a / lto_103asm_103f / compute_103f / lto_103f
sm_110sm_110 / compute_110 / lto_110sm_110a / compute_110a / lto_110asm_110f / compute_110f / lto_110f
sm_120sm_120 / compute_120 / lto_120sm_120a / compute_120a / lto_120asm_120f / compute_120f / lto_120f
sm_121sm_121 / compute_121 / lto_121sm_121a / compute_121a / lto_121asm_121f / compute_121f / lto_121f

CUDA_ARCH Macro Values

Sub-Variantsm_100sm_103sm_110sm_120sm_121
Base-D__CUDA_ARCH__=1000=1030=1100=1200=1210
Accelerated=100a0=103a0=110a0=120a0=121a0
Feature-reduced=100f0=103f0=110f0=120f0=121f0

Suffix Bit Flags in Profile Objects

From the decompiled profile constructor, suffixed variants set specific byte flags:

SuffixFlag PositionEvidence
a (accelerated)profile[4] = 1v79->m128i_i8[4] = 1; *(_BYTE *)(v82 + 4) = 1;
f (feature-reduced)profile[5] = 1 (on all 3 objects: sm, compute, lto)v88->m128i_i8[5] = 1; *(_BYTE *)(v91 + 5) = 1; v94[5] = 1;

The a flag is set on the SM and compute profile objects only. The f flag is set on all three (sm, compute, lto), reflecting the fact that f-compiled code must retain its feature-reduced metadata through linking.

isaClass Inheritance

Sub-variants reference their base SM's isaClass rather than defining a new one:

  • sm_100a and sm_100f reference "(profile_sm_100)->isaClass"
  • sm_103a and sm_103f reference "(profile_sm_103)->isaClass"
  • sm_120a and sm_120f reference "(profile_sm_120)->isaClass"
  • sm_121a and sm_121f reference "(profile_sm_121)->isaClass"

This confirms that sub-variants share the instruction set architecture class with their base. The a/f distinction is purely in compatibility metadata, not in the ISA or codegen.

Capability Dispatch

sub_607DB0 registers handler functions into 7 parallel hash maps. All sub-variants of a given SM register the same function pointers.

Handler Assignments (Maps 1--3)

SMHandler A (Map 1)Handler B (Map 2)Intrinsic Init (Map 3)
sm_100 / 100a / 100fsub_609C30sub_609BD0sub_60A910
sm_103 / 103a / 103fsub_608F20sub_609D20sub_60A700
sm_110 / 110a / 110fsub_609F30sub_608F50sub_60AA20
sm_120 / 120a / 120fsub_609E40sub_609C60sub_608DF0
sm_121 / 121a / 121fsub_609ED0sub_609BA0sub_60A4E0

Performance / Occupancy Handlers (Maps 6--7)

SMHandler E (Map 6)Handler F (Map 7)
sm_100 / 100a / 100fsub_609080sub_6098A0
sm_103 / 103a / 103fsub_609020sub_6091A0
sm_110 / 110a / 110fsub_609000sub_609280
sm_120 / 120a / 120fsub_608FE0sub_609520
sm_121 / 121a / 121fsub_609040sub_6097C0

Every Blackwell SM has unique handler functions in all 7 maps. This contrasts with Hopper where sm_90 and sm_90a share all handlers. Each Blackwell SM variant is architecturally distinct enough to warrant separate capability accessors, performance models, and intrinsic tables.

Warp Geometry

The warp geometry initializer at sub_8E4400 uses the codegen factory value to select dispatch parameters. All Blackwell targets (codegen factory > 36863) fall into the maximum bucket:

encoded >  36863  -> 16 warps, 240 dispatch slots

This is identical to Hopper (sm_90). The 16-warp / 240-slot geometry supports Blackwell's warpgroup execution model (4 warps per warpgroup, 4 warpgroups per SM partition).

Sub-Architecture Variant Table

The secondary variant assignment at sub_8E4400 maps codegen factory values to sub-architecture indices:

Codegen FactoryVariantSM
368640sm_100 (base)
368673sm_103
368684sm_110
368695sm_120, sm_121

These variant indices select different entries within the per-SM latency tables, allowing the scheduler to use silicon-specific pipeline timing.

Hardware Resource Geometry

Per-SM hardware resource limits used by ptxas for register allocation, occupancy calculations, and scheduling decisions. Extracted from sub_8688F0 (universal baseline), sub_8E4400 (scheduler partition geometry), and sub_ABF250 (occupancy calculator). See targets/index.md -- Per-SM Resource Geometry Table for the complete table across all architectures.

SMRegs/SMMax Regs/ThreadMax Threads/CTAWarps/SMMax CTAs/SMSched PartitionsDispatch SlotsConfigurable Shared MemoryConf
sm_10065,5362551,024643216 / 24024048 / 100 / 132 / 164 / 228 KB90%
sm_10365,5362551,024643216 / 240240(same as sm_100)88%
sm_11065,5362551,024643216 / 240240(same as sm_100)85%
sm_12065,5362551,024643216 / 24024048 / 100 / 132 / 164 / 228 KB88%
sm_12165,5362551,024643216 / 240240(same as sm_120)85%

Column definitions:

  • Regs/SM: Total 32-bit registers per streaming multiprocessor. 65,536 universally for sm_75+.
  • Max Regs/Thread: Maximum registers a single thread can use. 255 universally (sub_8688F0 offset +612).
  • Max Threads/CTA: Maximum threads per cooperative thread array (block).
  • Warps/SM: Total concurrent warps per SM. Determines peak occupancy.
  • Max CTAs/SM: Maximum concurrent CTAs per SM.
  • Sched Partitions / Dispatch Slots: From sub_8E4400 offset +18 (packed DWORD) and offset +22 (WORD).
  • Configurable Shared Memory: Valid shared memory sizes per CTA, selected by cudaFuncSetAttribute.

All Blackwell targets share the 16-partition / 240-slot geometry (identical to Hopper sm_90). The a and f sub-variants within each SM share the same geometry -- differentiation is in compatibility metadata and feature exposure, not in resource limits. The primary distinction across Blackwell SMs is in the latency tables and tcgen05 availability, not in the scheduling partition structure.

Capsule Mercury (capmerc) -- Default Output Format

Capsule Mercury is automatically enabled for all Blackwell targets. When the SM architecture number exceeds 99, ptxas sets the capmerc flag at offset+81 in the compilation context. This applies to sm_100, sm_103, sm_110, sm_120, and sm_121 uniformly.

Three Output Modes

ModeStringDefault ForSM Range
mercury"mercury"sm_75 -- sm_90Turing through Hopper
capmerc"capmerc"sm_100 -- sm_121All Blackwell
sass"sass"None (explicit only)Any

Capsule Mercury vs Mercury

Both modes use the same Mercury encoder pipeline (phases 117--122). The capmerc distinction is at the ELF emission level:

  • Mercury produces a fully-resolved SASS binary in .text.<funcname> sections
  • Capsule Mercury wraps Mercury-encoded instructions in .nv.capmerc<funcname> sections with a 328-byte capsule descriptor, plus .nv.merc.* debug/metadata sections

The capsule descriptor (constructed by sub_1C9C300, 24KB) contains the Mercury instruction stream, relocation metadata (R_MERCURY_* types), KNOBS compilation configuration snapshot, and function-level metadata (register counts, barriers, shared memory usage).

Opportunistic Finalization

Capsule Mercury enables deferred finalization -- compiling once for one SM and reconstituting SASS for a different SM at link or load time.

LevelNameBehaviorExample
0defaultStandard finalization (compile-target only)--
1noneNo finalization; output stays as capmerc--
2intra-familyFinalize within same SM familysm_100 -> sm_103
3intra+interFinalize across SM familiessm_100 -> sm_120

The compatibility checker sub_60F290 determines whether a capmerc binary compiled for SM X can be finalized for SM Y. On success, ptxas emits: "applied for off-target %u -> %u finalization".

Self-Check Mechanism

The --self-check CLI option performs roundtrip verification:

  1. Generate capmerc output (Mercury encoding + metadata)
  2. Reconstitute SASS from the capmerc data
  3. Compare section-by-section; report error codes 17 (content mismatch), 18 (count mismatch), 19 (metadata mismatch)

The reconstituted SASS can be dumped with --out-sass for debugging self-check failures.

TCGen05 -- 5th Generation Tensor Cores

TCGen05 is the defining hardware feature of Blackwell datacenter parts. It introduces tensor memory (TMEM) as a dedicated register-like storage directly connected to the tensor core, eliminating the shared-memory bottleneck of previous WGMMA designs.

SM Availability

SMtcgen05 AvailableNotes
sm_100 / 100a / 100fYesFull datacenter tcgen05
sm_103 / 103a / 103fYesBlackwell Ultra -- same tcgen05 ISA
sm_110 / 110a / 110fYesJetson Thor -- full tcgen05 hardware
sm_120 / 120a / 120fNoConsumer -- no TMEM, no tcgen05
sm_121 / 121a / 121fNoDGX Spark -- no TMEM, no tcgen05

The tcgen05 ISA is gated by SM version checks (visible as sub_70FA00(*, 29) capability queries). sm_120 and sm_121 fall back to inherited HMMA/IMMA/WGMMA tensor core paths.

PTX Instructions

Registered in the opcode dispatch table at sub_5D4190:

PTX InstructionCodegen HandlerFormatterSize
tcgen05.allocsub_569180sub_5263701287B
tcgen05.relinquish_alloc_permitsub_526370----
tcgen05.deallocsub_58C7F0sub_5740502130B
tcgen05.ldsub_574050sub_578DB02466B
tcgen05.ld.redsub_578DB0----
tcgen05.stsub_571FE0sub_56C1901842B
tcgen05.commitsub_56C190sub_5427F01575B
tcgen05.cpsub_5427F0sub_4F1A90903B
tcgen05.shiftsub_4F1A90sub_58FA204604B
tcgen05.mmasub_5BBC30 (90KB)----
tcgen05.mma.wssub_58FA20sub_4DA720343B

The tcgen05.mma codegen handler at sub_5BBC30 is 90KB -- the largest single-instruction handler in ptxas -- reflecting the complexity of 5th-gen tensor core MMA with tensor memory operands, scale factors, sparsity, and accumulator management.

TCGen05 Guardrail Functions

Eight debug/validation functions provide runtime instrumentation for tensor memory operations when compiled with --g-tensor-memory-access-check (or -g-tmem-access-check):

GuardrailFormatterSizeValidation
_tcgen05.guardrails.is_phase_validsub_4DA720775BPhase lifecycle
_tcgen05.guardrails.are_columns_allocatedsub_4DDE70599BColumn allocation
_tcgen05.guardrails.is_current_warp_valid_ownersub_4DBF20791BWarp ownership
_tcgen05.guardrails.in_physical_boundssub_4DB050439BMemory bounds
_tcgen05.guardrails.allocation_granularitysub_4F0960839BAllocation alignment
_tcgen05.guardrails.datapath_alignmentsub_4DD580735BData path checks
_tcgen05.guardrails.sp_consistency_across_idesc_modsub_500FA0970BSparse descriptor
_tcgen05.guardrails.check_sparse_usagesub_4DDB80743BSparsity validation

The bounds checker (sub_70E0E0, 296 lines decompiled) generates inline PTX code to validate tensor memory column counts, extracting bitfield positions from tcgen05 descriptors (e.g., and.b32 %s, 0x7E0000, %s; shr.u32 %s, %s, 17; mul.lo.u32 %s, %s, 8).

TCGen05 Intrinsics

ID RangeCountCategory
0x20--0x2A11__cuda_sm10x_tcgen05_guardrail_trap_* (trap on validation failure)
0x230--0x23910__cuda_sm_10x_* (hmma/imma mdata + bit MMA)

Additional tcgen05 helper intrinsics observed in decompiled code:

  • __cuda_sm_100_tcgen05_ld_red_immhalfSplitOff -- load-reduce with immediate half-split offset
  • __cuda_sm_100_tcgen05_ld_immhalfSplitOff -- load with immediate half-split offset
  • __cuda_sm_100_tcgen05_st_immhalfSplitOff -- store with immediate half-split offset
  • __cuda_sm_100_tcgen05_ld_red_funcRetArr -- load-reduce returning array
  • __cuda_sm_100_tcgen05_ld_funcRetArr -- load returning array

These helpers (decompiled in sub_70D910 and sub_70DDB0) generate inline PTX for complex tensor memory access patterns including array returns via ld.param.b32 sequences.

Bulk Copy Intrinsics (sm_1xx)

18 intrinsics in the __cuda_sm1xx_* namespace cover cp.async.bulk.tensor 1D--5D in tile and im2col modes. These extend the Hopper TMA infrastructure with Blackwell-specific enhancements.

SM 100 / SM 100a / SM 100f -- Blackwell Datacenter

sm_100 is the reference Blackwell architecture. Codegen factory 36864 (0x9000), CUDA_ARCH 1000.

Products

B100, B200 (datacenter GPU), paired as GB200 NVL72 superchips.

Key Features

  • TCGen05: Full 5th-gen tensor core with TMEM (alloc, dealloc, ld, st, commit, cp, shift, mma, mma.ws)
  • Capsule Mercury: Default output format (auto-enabled for SM > 99)
  • WGMMA inherited: Warpgroup MMA from Hopper carries forward
  • Cluster operations: Thread-block clusters, distributed shared memory (from Hopper)
  • setmaxnreg: Dynamic register allocation (from Hopper)
  • Uniform register ALU: UFADD, UFFMA, UFSEL, UFSETP, UVIADDR (Blackwell uniform register ISA additions)

Handler Functions

MapFunctionRole
Handler Asub_609C30Primary codegen capability accessor
Handler Bsub_609BD0Secondary codegen capability accessor
Intrinsic initsub_60A910Intrinsic table population
Perf/occupancy Esub_609080Performance statistics
Perf/occupancy Fsub_6098A0Occupancy calculator

HW Latency Table

sub_8E8A90 (3.0KB) -- the base Blackwell latency table. Two-part structure: a 3.0KB base table for standard instructions plus a ~949-byte TCGEN05 supplement covering tensor core scheduling classes 745--772+.

Profile Object

From sub_6765E0:

SM name:       "sm_100"
Compute name:  "compute_100"
LTO name:      "lto_100"
Family:        "Blackwell"
CUDA_ARCH:     "-D__CUDA_ARCH__=1000"

The profile constructor stores dword_29FE2C4 = 100 after constructing all sm_100 sub-variants, likely recording the current highest-registered base SM number.

SM 103 / SM 103a / SM 103f -- Blackwell Ultra

sm_103 is Blackwell Ultra, targeting the GB300 NVL72 platform. Codegen factory 36867 (0x9003), CUDA_ARCH 1030.

Products

GB300 (datacenter, Blackwell Ultra). Incremental silicon revision over sm_100.

Differentiation from sm_100

The SASS formatter sub_583190 (rsqrt instruction) explicitly checks for "sm_103" and applies a Blackwell Ultra-specific operand layout. The formatter dispatch first tests raw SM codes (102, 103, 107, 130, 124), then uses check_target_sm(instr, 0, "sm_103") for string-based validation.

From sweep data on the SM103-specific encoding path:

  • Sets encoding flags XOR 0x10, XOR 0x40 for sm_103-specific instruction variants
  • New operand layout for transcendental instructions (rsqrt, likely also rcp/sqrt)

sm_103 has a separate 618-byte supplementary latency table -- the smallest in the binary -- suggesting minimal scheduling parameter changes from sm_100.

Handler Functions

All unique from sm_100:

MapFunction
Handler Asub_608F20
Handler Bsub_609D20
Intrinsic initsub_60A700
Perf/occupancy Esub_609020
Perf/occupancy Fsub_6091A0

Profile Object

SM name:       "sm_103"
Compute name:  "compute_103"
LTO name:      "lto_103"
Family:        "Blackwell"
CUDA_ARCH:     "-D__CUDA_ARCH__=1030"

SM 110 / SM 110a / SM 110f -- Jetson Thor

sm_110 targets the Jetson Thor SoC for automotive and robotics applications. Codegen factory 36868 (0x9004), CUDA_ARCH 1100.

Products

Jetson Thor (automotive-grade SoC with integrated GPU). Originally internally designated sm_101 before rename.

sm_101 Legacy Alias

sm_101 (with variants sm_101a and sm_101f) was the original internal name for Jetson Thor. It was renamed to sm_110 in a later CUDA release, but all three validation table entries are retained for backward compatibility:

TableEntryPTX ISAPurpose
Base (unk_1D16220){101, 8, 6}8.6Accepts --gpu-name sm_101 in existing PTX files
Accelerated (unk_1D161C0){101, 8, 6}8.6Accepts sm_101a
Feature-reduced (unk_1D16160){101, 8, 8}8.8Accepts sm_101f

The validation tables use bsearch() (sub_484B70 comparator), so both sm_101 and sm_110 are independently findable. However, sub_6765E0 (the profile constructor) registers only sm_110 / sm_110a / sm_110f -- there is no profile object for sm_101. After passing validation, sm_101 must resolve to the sm_110 profile through an internal aliasing path (likely in sub_4B1080, the target directive parser).

The PTX ISA version difference is notable: sm_101 requires PTX 8.6 (same as sm_100), while sm_110 requires PTX 9.0. This reflects the timeline -- sm_101 was named when the Jetson Thor target was first added alongside sm_100, before the sm_110 numbering and PTX 9.0 specification existed.

Key Characteristics

  • Full tcgen05 hardware: Retains datacenter-class tensor memory and tensor core features
  • Same WGMMA/cluster support: Inherits Hopper-era warpgroup and cluster operations
  • SoC-specific constraints: Differentiated from sm_100 through capability flags, not through missing features -- the capability accessor functions (sub_609F30, sub_608F50) return SoC-appropriate resource limits

Handler Functions

MapFunction
Handler Asub_609F30
Handler Bsub_608F50
Intrinsic initsub_60AA20
Perf/occupancy Esub_609000
Perf/occupancy Fsub_609280

Profile Object

SM name:       "sm_110"
Compute name:  "compute_110"
LTO name:      "lto_110"
Family:        "Blackwell"
CUDA_ARCH:     "-D__CUDA_ARCH__=1100"

Note: sm_110 uses different xmmword constants for profile fields [5]--[7] compared to sm_100, visible in the profile constructor where v98[5] = xmmword_2027D70; v98[6] = v103 (from v209); v98[7] = v104 (from v207). This encodes different hardware resource parameters.

SM 120 / SM 120a / SM 120f -- Blackwell Consumer

sm_120 targets consumer and enterprise workstation GPUs. Codegen factory 36869 (0x9005), CUDA_ARCH 1200.

Products

RTX 5090, RTX 5080, RTX 5070 Ti, RTX 5070, RTX 5060 (consumer). RTX Blackwell Pro (enterprise workstation).

The tcgen05 Gap

sm_120 is architecturally distinct from sm_100 despite sharing the "Blackwell" family string. The critical difference: no tcgen05 support. The entire tensor memory subsystem (alloc, dealloc, ld, st, commit, cp, shift, mma) is absent. Tensor operations fall back to:

  • HMMA/IMMA inherited from sm_70--sm_89 (direct MMA path)
  • WGMMA inherited from sm_90 (warpgroup async MMA)

This is gated by SM version checks in the capability accessor functions. The intrinsic table initializer (sub_608DF0) does not register tcgen05 intrinsic handlers for sm_120.

HW Latency Table

sm_120 has a distinct latency model, split into two parts:

FunctionSizeContent
sub_8E90002.9KBBase consumer Blackwell table
sub_8E92E05.5KBExtended table (largest individual table in binary)

The 5.5KB extended table is larger than any other individual latency table, suggesting that sm_120's consumer pipeline has significantly different scheduling characteristics from datacenter sm_100. The consumer pipeline likely has different functional unit counts, memory latencies, and tensor core throughput profiles.

Handler Functions

MapFunction
Handler Asub_609E40
Handler Bsub_609C60
Intrinsic initsub_608DF0
Perf/occupancy Esub_608FE0
Perf/occupancy Fsub_609520

Profile Object

SM name:       "sm_120"
Compute name:  "compute_120"
LTO name:      "lto_120"
Family:        "Blackwell"
CUDA_ARCH:     "-D__CUDA_ARCH__=1200"

sm_120 uses a third distinct set of xmmword constants for profile fields, including xmmword_2027DC0 at field [6], confirming different hardware resource parameters from both sm_100 and sm_110.

SM 121 / SM 121a / SM 121f -- DGX Spark

sm_121 targets the DGX Spark desktop AI workstation. Codegen factory 36869 (0x9005), CUDA_ARCH 1210.

Products

NVIDIA DGX Spark (desktop AI workstation with Grace CPU + Blackwell GPU).

Relationship to sm_120

sm_121 shares the same codegen factory sub-variant (5) as sm_120 in the scheduling sub-architecture table, and inherits sm_120's xmmword profile constants. This suggests sm_121 is a binned or slightly modified sm_120 die, similar to how sm_86 relates to sm_80 in the Ampere generation.

Like sm_120, sm_121 has no tcgen05 support -- tensor operations use the HMMA/IMMA/WGMMA path.

Handler Functions

All unique from sm_120:

MapFunction
Handler Asub_609ED0
Handler Bsub_609BA0
Intrinsic initsub_60A4E0
Perf/occupancy Esub_609040
Perf/occupancy Fsub_6097C0

Profile Object

SM name:       "sm_121"
Compute name:  "compute_121"
LTO name:      "lto_121"
Family:        "Blackwell"
CUDA_ARCH:     "-D__CUDA_ARCH__=1210"

Blackwell Uniform Register ISA

Blackwell extends the uniform register (UR) ISA introduced in Turing/Ampere with dedicated uniform ALU instructions:

SASS InstructionOperationNotes
UFADDUniform floating-point addNew in Blackwell
UFFMAUniform fused multiply-addNew in Blackwell
UFSELUniform floating-point selectNew in Blackwell
UFSETPUniform FP set-predicateNew in Blackwell
UVIADDRUniform integer-to-addressNew in Blackwell

These instructions execute on the uniform datapath (UDP, functional unit index 9), allowing floating-point uniform computations to stay in the UR file without round-tripping through the R file. Mercury encoding assigns major opcode 0x0E with 6 variants (sub_10C0550) for uniform ALU.

Architecture Version Threshold Checks

All Blackwell targets share codegen factory 36864+ (0x9000+). The binary uses these thresholds to gate Blackwell-specific features:

Check PatternThresholdMeaning
encoded > 36863> sm_90 extendedBlackwell warp geometry (16 warps, 240 slots)
codegen_factory >= 36864>= sm_100Blackwell generation features
codegen_factory == 36864sm_100 exactlysm_100-specific paths
SM_number > 99sm_100+Capsule Mercury auto-enable
sub_70FA00(*, 29)--tcgen05 capability query (SM-specific)

The tcgen05 gating is not a simple threshold -- it uses a per-SM capability query (sub_70FA00 with argument 29) that returns true for sm_100/103/110 and false for sm_120/121.

BB Initialization (Secondary Encoding)

The basic block initializer sub_6E8EB0 uses a secondary encoding space for Blackwell:

Secondary EncodingSMFlags
20480 (0x5000)sm_100Instruction set flags for datacenter
20484 (0x5004)sm_103XOR 0x10, XOR 0x40 for Ultra variants

This secondary encoding uses generation 5 in the BB init context (5 << 12), separate from the primary codegen factory's generation 9.

Scheduling Profile Differences

Blackwell targets share the 16-warp / 240-slot geometry with Hopper but have distinct latency tables:

SMLatency TableSizeStructure
sm_100sub_8E8A903.0KBBase + 949B TCGEN05 supplement
sm_103(supplementary)618BSmallest table -- minimal delta from sm_100
sm_110(shared with sm_100 or dedicated)--Not separately identified in sweep
sm_120sub_8E9000 + sub_8E92E02.9KB + 5.5KBTwo-part consumer table
sm_121(likely shares sm_120's table)--Same sub-variant index as sm_120
Universalsub_8E97B08.8KBFallback / universal

The 5.5KB extended table for sm_120 is the largest individual latency table in the binary, reflecting the consumer microarchitecture's distinct pipeline design. The sm_100 table uses a supplement mechanism for TCGEN05-specific scheduling classes that the consumer sm_120 table does not need (since sm_120 lacks tcgen05).

Scheduling Class Assignments

From the opcode-to-scheduling-class mapper sub_89FBA0 (85KB), Blackwell-era opcodes use high-numbered scheduling classes:

Class RangeCategoryArchitecture
700--772+Mercury/Blackwell tensor opssm_100+ with tcgen05
745WGMMA primarysm_90+ (Hopper carry-forward)
744WGMMA variantsm_90+
765--767BGMMA/QMMA (Blackwell-specific MMA types)sm_100+
759HMMA/BMMA tensor coresm_100+
757, 761Narrow/wide DP tensorsm_100+
600, 604Tensor fence / tensor syncsm_90+

Intrinsic Table

Blackwell intrinsic availability is cumulative -- all sm_70, sm_80, sm_8x, and sm_9x intrinsics carry forward. Blackwell adds two new intrinsic groups:

sm_10x Intrinsics (21 entries)

ID RangeCountNamespaceCategory
0x20--0x2A11__cuda_sm10x_tcgen05_guardrail_trap_*Trap on guardrail validation failure
0x230--0x23910__cuda_sm_10x_*hmma/imma mdata + bit MMA (Blackwell-specific shapes)

sm_1xx Intrinsics (18 entries)

NamespaceCountCategory
__cuda_sm1xx_*18cp.async.bulk.tensor 1D--5D tile/im2col (extended bulk copy)

OCG (On-Chip Generated) Intrinsics

The OCG builtin name table at sub_6C9EB0 (13KB) contains the master list of Blackwell+ runtime-generated intrinsic names:

cp_async_bulk, cp_red_async_bulk, cp_async_tensor,
cp_async_prefetch_tensor, fence_view_async,
viaddmax, viaddmin, viadd, vimax, vimin, vimax3, vimin3,
write_async, tcbar, mmareadshma, tcmma,
tcshift, tcatomsws, tcldsws, tcstsws,
gdesc, breuse, bkeep, virtcount,
memclear, acqshminit, sparsify, spfactor2to4,
2x64dp128bitlw02lw13, 2x64dp128bitlw01lw23, 4x32dp128bit,
16dp32bitt0t15, 16dp32bitt16t31,
selfcast, broadcast, findandset, align

These names represent the Blackwell SASS operations exposed through the OCG intrinsic interface, covering tensor core scheduling (tcbar, tcmma, tcshift), sparse operations (sparsify, spfactor2to4), integer min/max variants (viaddmax, viaddmin), and async memory operations.

New PTX Instructions (Blackwell-Specific)

Beyond tcgen05, Blackwell introduces or extends several instruction families visible in the opcode dispatch:

InstructionCategoryEvidence
tcgen05.* (11 instructions)Tensor memory opssub_5D4190 registration
fence_view_asyncMemory orderingOCG builtin table
write_asyncAsync writesOCG builtin table
viaddmax / viaddminInteger add-with-max/minOCG builtin table
BGMMA / QMMABlock/quantized MMAScheduling classes 765--767

CLI Options -- Tensor Memory Checks

Two CLI options control tcgen05 guardrail instrumentation:

OptionShort FormDefaultDescription
--g-tensor-memory-access-check-g-tmem-access-checkEnabled with -gEnable tensor memory access checks for tcgen05 operations
--gno-tensor-memory-access-check-gno-tmem-access-checkfalseDisable checks (overrides the above)

When enabled, the compiler inserts inline validation code (the 8 guardrail functions) around tcgen05 operations. These emit trap instructions if tensor memory invariants are violated at runtime -- useful for debugging TMEM allocation errors, bounds violations, and ownership conflicts.

Feature Comparison

Featuresm_100sm_103sm_110sm_120sm_121
Codegen factory3686436867368683686936869
Sub-arch variant03455
Family stringBlackwellBlackwellBlackwellBlackwellBlackwell
Capsule Mercury defaultYesYesYesYesYes
tcgen05 (tensor memory)YesYesYesNoNo
WGMMA (from Hopper)YesYesYesYesYes
Cluster operationsYesYesYesYesYes
setmaxnregYesYesYesYesYes
Uniform FP ALU (UFADD etc.)YesYesYesYesYes
BGMMA/QMMAYesYesYes??
Guardrail instrumentationYesYesYesN/AN/A
HW latency table3.0KB + 949B618B (supp.)--2.9KB + 5.5KB(shared w/ sm_120)
a sub-variantYesYesYesYesYes
f sub-variantYesYesYesYesYes
ProductsB100/B200GB300Jetson ThorRTX 50xxDGX Spark

SASS Instruction Encoding

Blackwell continues the 128-bit per-instruction format introduced in Turing. The Mercury encoder handles the SM100+ instruction set through a dedicated encoding subsystem spanning approximately 851KB in the address range 0xDFC000--0x107B000.

The encoding subsystem covers 16 instruction format groups with full Blackwell ISA support including:

  • Standard ALU/FPU/memory operations (inherited from earlier architectures)
  • TCGEN05 tensor memory operations (new encoding classes)
  • BGMMA/QMMA block-scale and quantized MMA variants
  • Extended bulk copy operations (UBLKCP variants)
  • Sparse tensor operations

Function Map

AddressSizeIdentitySMConfidence
sub_607DB014KBCapability dispatch (all Blackwell registrations)all99%
sub_608DF0~1KBIntrinsic table initializersm_12085%
sub_608F20~1.2KBHandler A (capability accessor)sm_10390%
sub_608F50~1.2KBHandler B (capability accessor)sm_11090%
sub_609000~200BPerf/occupancy Esm_11085%
sub_609020~200BPerf/occupancy Esm_10385%
sub_609040~200BPerf/occupancy Esm_12185%
sub_609080~200BPerf/occupancy Esm_10085%
sub_6091A0~200BPerf/occupancy Fsm_10385%
sub_609280~200BPerf/occupancy Fsm_11085%
sub_609520~200BPerf/occupancy Fsm_12085%
sub_6097C0~200BPerf/occupancy Fsm_12185%
sub_6098A0~200BPerf/occupancy Fsm_10085%
sub_609BA0~48BHandler Bsm_12199%
sub_609BD0~48BHandler Bsm_10099%
sub_609C30~48BHandler Asm_10099%
sub_609C60~48BHandler Bsm_12099%
sub_609D20~48BHandler Bsm_10399%
sub_609E40~48BHandler Asm_12099%
sub_609ED0~48BHandler Asm_12199%
sub_609F30~48BHandler Asm_11099%
sub_60A4E0~1KBIntrinsic table initializersm_12185%
sub_60A700~1KBIntrinsic table initializersm_10385%
sub_60A910~1KBIntrinsic table initializersm_10085%
sub_60AA20~1KBIntrinsic table initializersm_11085%
sub_60F290est.Off-target capmerc compatibility checkerall75%
sub_612DE047KBKernel finalizer / ELF builderall80%
sub_6765E054KBProfile constructor (Blackwell entries at lines 600--1330)all95%
sub_703AB010KBCLI option parser (capmerc/mercury/sass)all90%
sub_70D91024 linestcgen05 immhalfSplitOff helpersm_10090%
sub_70DDB047 linestcgen05 funcRetArr helpersm_10090%
sub_70E0E0296 linestcgen05 guardrail bounds checkersm_10090%
sub_8E44003.3KBWarp geometry initializer (Blackwell = 16 warps, 240 slots)all90%
sub_8E8A903.0KBHW latency table (Blackwell datacenter)sm_10085%
sub_8E90002.9KBHW latency table (consumer base)sm_12085%
sub_8E92E05.5KBHW latency table (consumer extended)sm_12085%
sub_8E97B08.8KBUniversal fallback latency tableall85%
sub_89FBA085KBOpcode-to-scheduling-class mapperall90%
sub_5BBC3090KBtcgen05.mma codegen handlersm_10098%
sub_5D4190--Opcode dispatch table (tcgen05 registrations)all99%
sub_1C9C30024KBCapsule Mercury section processorall85%
sub_1C9B11023KBMercury capsule builderall85%

Cross-References

TCGen05 -- 5th Generation Tensor Cores

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

TCGen05 is the Blackwell-generation tensor core instruction family introduced with SM 100. It replaces Hopper's WGMMA with a descriptor-based programming model that operates on Tensor Memory (TMEM) -- a dedicated register-file-like storage visible only to the tensor core. ptxas implements TCGen05 as 13 PTX instruction mnemonics (plus 8 debug guardrails), backed by a 90KB MMA codegen function, 11 SASS opcode groups (28 encoding variants), and a set of compiler-inserted validation hooks. TCGen05 is absent on sm_120/sm_121 (consumer Blackwell).

Target architecturessm_100, sm_100a, sm_100f, sm_103, sm_103a, sm_103f, sm_110, sm_110a, sm_110f
NOT availablesm_120, sm_121 (consumer/DGX Spark) -- gated by SM version checks
Capability checksub_70FA00(*, 29) -- returns true for tcgen05-capable targets
PTX instructions13: alloc, dealloc, relinquish_alloc_permit, ld, ld.red, st, commit, cp, shift, fence, wait, mma, mma.ws
Guardrail instructions8: is_phase_valid, are_columns_allocated, is_current_warp_valid_owner, in_physical_bounds, allocation_granularity, datapath_alignment, sp_consistency_across_idesc_mod, check_sparse_usage
SASS opcode rangeOpcodes 122--139 (TMEM operations), 213--221 (TCGEN05_MMA/FENCE, TMEM extended), 342--372 (TCGEN05 control)
Codegen factory36864 (9 << 12) -- shared across all Blackwell targets
MMA codegensub_5BBC30 (90KB)
PTX validatorsub_4C5FB0 (28KB -- shared MMA/WMMA/tcgen05 validator)
Intrinsic handlersub_6D7AF0 (19KB -- TCGen05 MMA handler)
Intrinsic validatorsub_6D69B0 (12KB -- TCGen05 MMA validator)
EIATTR markersEIATTR_TCGEN05_1CTA_USED, EIATTR_TCGEN05_2CTA_USED
Version constraintObjects using tcgen05 from CUDA 12.x cannot link with 13.0+; must rebuild

Architecture Overview

Descriptor-Based Model

TCGen05 abandons the register-operand model of previous tensor core generations (WMMA, HMMA, WGMMA) in favor of descriptors. The instruction descriptor (idesc) encodes the matrix operation configuration -- dimensions, data types, data path width, sparsity, and layout. The descriptor is passed as an operand to tcgen05.mma rather than encoded in the instruction mnemonic.

This design decouples the instruction encoding from the operation specification. Where WGMMA required hundreds of distinct intrinsic hash entries to cover every shape/type/layout combination, tcgen05 uses a single instruction with different descriptor values. The ~400 numeric MMA hash entries in the intrinsic dispatch table (at a1+816 in sub_5D4190) map WGMMA variants; tcgen05 replaces that complexity with descriptor-driven dispatch.

Tensor Memory (TMEM)

TMEM is a dedicated storage region private to the tensor core unit. It is not part of the general register file and is not directly addressable by non-tensor-core instructions. TMEM is organized into columns that are allocated, used, and deallocated explicitly by the programmer.

Key properties from binary analysis:

  • Column-based allocation: tcgen05.alloc reserves columns; tcgen05.dealloc releases them
  • Two CTA granularities: Operations execute at .cta_group::1 (single CTA) or .cta_group::2 (CTA pair) granularity. A function cannot mix both -- ptxas enforces: "Function '%s' uses single CTA(.cta_group::1) and CTA pair granularity(.cta_group::2) and that is not allowed."
  • Allocation tracking: The compiler inserts reserved shared memory variables to track allocation state:
    • __nv_reservedSMEM_tcgen05_partition -- partition identifier
    • __nv_reservedSMEM_allocation_phase -- current allocation phase
    • __nv_reservedSMEM_allocation_mask -- bitmask of allocated columns
    • __nv_reservedSMEM_tmem_allocation_pipeline_mbarrier -- mbarrier for allocation pipeline
    • __nv_reservedSMEM_tmem_allocation_pipeline_mbarrier_parity -- parity tracking

TMEM Address Computation

Tensor memory addresses are computed through a standardized pattern visible in the TMEM address generator functions (sub_70E740, sub_70E940, sub_70EB00):

cvt.u32.u64 __cuda_sm_100_tcgen05_tmem_addr_base, %s;
add.u32 %s, __cuda_sm_100_tcgen05_tmem_addr_base, %s;

Five named TMEM address roles exist for MMA operations:

Address RoleIntrinsic NamePurpose
D (destination)__cuda_sm10x_tcgen05_mma_tmemDAccumulator/output matrix
A (input)__cuda_sm10x_tcgen05_mma_tmemALeft input matrix
Scale A__cuda_sm10x_tcgen05_mma_scaleTmemAScale factors for A
Scale B__cuda_sm10x_tcgen05_mma_scaleTmemBScale factors for B
Sparse Meta__cuda_sm10x_tcgen05_mma_spMetaTmemSparsity metadata

Constraint from the binary: "URa must be uint32 when URa is TMEM" -- uniform registers addressing TMEM must use 32-bit unsigned integers. When addressing a global descriptor: "URa must be uint64 when URa is GDESC".

PTX Instruction Set

Lifecycle Instructions

PTX InstructionFormatter AddressSizePurpose
tcgen05.alloc0x5263701,287 BAllocate TMEM columns for tensor core use
tcgen05.dealloc0x5740502,130 BRelease allocated TMEM columns
tcgen05.relinquish_alloc_permit0x58C7F04,282 BRelinquish allocation permit (multi-CTA coordination)

The alloc instruction has two CTA-granularity variants visible in the prototype strings:

  • __cuda_sm10x_tcgen05_alloc_one_sm -- single-SM allocation (.cta_group::1)
  • __cuda_sm10x_tcgen05_alloc_two_sm -- two-SM allocation (.cta_group::2)

Both take a destination pointer argument (__cuda_sm10x_tc_alloc_dst_ptr_arg) and a column count (__cuda_sm10x_tc_alloc_num_cols_arg).

Data Movement Instructions

PTX InstructionFormatter AddressSizePurpose
tcgen05.ld0x578DB02,466 BLoad data into TMEM from shared/global memory
tcgen05.ld.red0x571FE02,066 BLoad with reduction (accumulate into TMEM)
tcgen05.st0x56C1901,842 BStore data from TMEM to shared/global memory
tcgen05.cp0x5427F0903 BCopy between TMEM regions (intra-tensor-core)

Three intrinsic helper arrays support the ld/st/ld.red operations:

HelperPurpose
__cuda_sm_100_tcgen05_ld_funcRetArrReturn array descriptor for loads
__cuda_sm_100_tcgen05_ld_red_funcRetArrReturn array descriptor for load-reduce
__cuda_sm_100_tcgen05_st_funcInputArrInput array descriptor for stores

Each has a corresponding immhalfSplitOff parameter controlling split behavior:

  • __cuda_sm_100_tcgen05_ld_immhalfSplitOff
  • __cuda_sm_100_tcgen05_ld_red_immhalfSplitOff
  • __cuda_sm_100_tcgen05_st_immhalfSplitOff

Synchronization Instructions

PTX InstructionFormatter AddressSizePurpose
tcgen05.commit0x5427F01,575 BCommit pending tensor core operations
tcgen05.fence(inline)--Fence preventing reordering of tcgen05 operations
tcgen05.wait(inline)--Wait for committed tcgen05 operations to complete
tcgen05.shift0x58FA204,604 BShift accumulator data within TMEM (shared formatter with mma)

Compute Instructions

PTX InstructionFormatter AddressSizePurpose
tcgen05.mma0x5BBC30 (codegen) / 0x58FA20 (formatter)90KB / 4,604 BMatrix multiply-accumulate
tcgen05.mma.ws0x4DA720 (formatter)343 BWarp-specialized MMA variant

TCGen05.MMA -- Matrix Multiply-Accumulate

Codegen Function: sub_5BBC30 (90KB)

The largest per-instruction codegen function for TCGen05. Registered as the "tcgen05.mma" handler in sub_5D4190 (the intrinsic dispatch table builder). The function:

  1. Allocates a 50,000-byte working buffer
  2. Queries sub_70FA00(*, 29) to validate tcgen05 capability on the current target
  3. Processes the instruction descriptor to determine operation parameters
  4. Generates tensor memory addressing code for all operands (D, A, scaleA, scaleB, sparsity meta)
  5. Emits the final MMA instruction encoding

MMA Modifiers

The binary reveals a rich set of MMA modifiers extracted by functions in the sub_70D1F0--sub_70D410 cluster:

ModifierStringPurpose
.o128".o128"128-bit output size
.transA".transA"Transpose A matrix
.transB".transB"Transpose B matrix
.negA"_negA"Negate A matrix
.negB"_negB"Negate B matrix
_expand16bit"_expand16bit"16-bit expansion mode
_pack16bit"_pack16bit"16-bit packing mode
_maxabs"_maxabs"Maximum absolute value reduction
_minabs"_minabs"Minimum absolute value reduction
_fused"_fused"Fused operation mode
_blockscale"_blockscale"Block scaling (MX format support)
_ashift"_ashift"A-matrix shift
_areuse"_areuse"A-matrix register reuse
_akeep"_akeep"A-matrix keep (preserve for reuse)

Data Path Configurations

The MMA data path width determines the number of elements processed per cycle and the accumulator layout. Six configurations exist:

Data PathStringInterpretation
_4dp256bit4 data paths, 256 bits each
_16dp32bit16 data paths, 32 bits each (two sub-variants: t0t15, t16t31)
_32dp32bit32 data paths, 32 bits each
_16dp256bit16 data paths, 256 bits each
_128dp256bit128 data paths, 256 bits each

Constraint: "fused and l16dp32bit must be specified together" -- the fused mode requires the 16dp32bit data path.

Block Scaling (MX Format)

TCGen05 adds native block scaling support for microscaling (MX) format operations, visible through the tcmma prefix strings:

  • "tcmma_*_o must be specified with blockscale" -- output modifier requires blockscale
  • "uri width for tcmma_*_o must be 2" -- output uniform register index width must be 2
  • "tcmma_*_q with blockscale must have uri width of 2" -- quantization with blockscale
  • "tcmma_*_mxq must be specified with blockscale" -- MX quantization requires blockscale

Warp-Specialized MMA (.ws)

The .ws modifier enables warp-specialized execution where different warps in a warpgroup contribute to different phases of the MMA pipeline. Constraints from the binary:

  • "When using buffer1-3, WS modifier must be specified" -- triple buffering requires .ws
  • "ws opcode modifier not allowed with .2CTA" -- warp specialization is single-CTA only
  • "ws opcode modifier not allowed with areuse or akeep" -- .ws incompatible with A-matrix reuse
  • "ws opcode modifier not allowed with ashift" -- .ws incompatible with A-matrix shift

Triple-buffer register reuse strings for .ws mode:

BufferVariant
_breuse_bkeep_buffer1B-reuse + B-keep, buffer 1
_breuse_buffer1B-reuse, buffer 1
_breuse_bkeep_buffer2B-reuse + B-keep, buffer 2
_breuse_buffer2B-reuse, buffer 2
_breuse_bkeep_buffer3B-reuse + B-keep, buffer 3
_breuse_buffer3B-reuse, buffer 3

Sparsity Support

TCGen05 supports structured sparsity through the sparsity metadata TMEM address (spMetaTmem). The _ashift modifier is constrained: "Ashift can only be specific when URa is in TMEM".

SASS Encoding

Opcode Map

TCGen05 SASS instructions span three opcode regions in the SM 100 SASS ISA. The encoding information comes from the latency model tables (sub_8E8A90 for sm_100) and the master instruction encoder (sub_6D9690, 94KB).

TMEM Operations (Opcodes 122--139)

OpcodeVariantsCategoryEncoding ClassOperands
1222TMEM_OP / new ISAF1F08, F1C603-op, reg10
1236TMEM_LD (tensor mem load)F1F08, F1DF82--3 op
1256TMEM_ST (tensor mem store)F1F08, F1DF82--3 op
1279TMEM_ALLOC / FENCEF1F08..F29A83--6 op
1293TMEM extendedF1F082 op
13026EXTENDED_MOV / TMEM_MVAF1F08..F26782--9 op
1313EXTENDED_ALU / UTMAF21B04--5 op
1331UTMA variantF21B04 op
1394TCGEN05 operationsF21B0, F25684--8 op

TCGEN05 MMA/FENCE (Opcodes 213--221)

OpcodeVariantsCategoryEncoding ClassOperands
2136TCGEN05_MMAF26785--7 op
2162TCGEN05_FENCEF26783--4 op
2196TMEM_LD extendedF1C60..F28103--7 op
2201TMEM_ST extendedF1C603 op
2211TMEM_PREFETCHF1C603 op
2551SETSTMEMADDRF1F081 op
2694TMEM_ALLOC_FENCE extF2018, F1DF82--3 op

TCGEN05 Control (Opcodes 342--372)

28 encoding variants across 10 opcodes. These are the primary tensor core pipeline control instructions:

OpcodeVariantsCategoryEncoding ClassOperands
3421TCGEN05 ctrl AF1F080 op (scheduling marker)
3431TCGEN05 ctrl BF1F080 op (scheduling marker)
34414TCGEN05 executeF1F08..F30082--7 op
3464TCGEN05 commitF1F08, F20182--3 op
3491TCGEN05 syncF1D700 op
3593TCGEN05 allocF1D70, F1F080--2 op
3691TCGEN05 deallocF1F080 op
3701TCGEN05 release AF1D700 op
3711TCGEN05 release BF1D700 op
3721TCGEN05 release CF1D700 op

Opcode 344 (TCGEN05 execute) has the most variants (14), spanning encoding classes from F1F08 to F3008 with 2 to 7 operands. This is the actual MMA dispatch instruction -- the wide encoding range reflects the variety of descriptor configurations, operand modes, and data path widths.

Encoding Class Distribution

The encoding classes used by TCGen05 SASS instructions:

ClassHexUsage
F1D70Control/syncalloc (0-op), sync, release A/B/C
F1F08Generalctrl markers, execute, commit, alloc, dealloc, TMEM ops
F1C60ExtendedTMEM_LD/ST extended, TMEM_PREFETCH
F1DF8StandardTMEM_LD/ST, TMEM_ALLOC_FENCE ext
F2018Commit extTCGEN05 commit, TMEM_ALLOC_FENCE ext
F21B0ALUTCGEN05 operations, UTMA
F2568TCGEN05 opsTCGEN05 operations
F2678MMA/FENCETCGEN05_MMA, TCGEN05_FENCE
F29A8TMEM_ALLOCTMEM_ALLOC/FENCE
F2810ExtendedTMEM_LD extended
F3008Execute maxTCGEN05 execute (high-operand-count)

Latency Model

The sm_100 latency table (sub_8E8A90) uses a two-part structure: a 3.0KB base table covering standard instructions and a 949-byte supplement dedicated to TCGEN05 operations. The sm_120 consumer Blackwell table (sub_8E9000 + sub_8E92E0, 5.5KB) is the largest individual table and does not include TCGEN05 entries (confirming the feature's absence on consumer silicon).

CTA Granularity

TCGen05 operations specify whether they execute at single-CTA or CTA-pair granularity through the .cta_group modifier:

GranularityModifierEIATTRELF Marker
Single CTA.cta_group::1EIATTR_TCGEN05_1CTA_USEDTC_1CTA
CTA Pair.cta_group::2EIATTR_TCGEN05_2CTA_USEDTC_2CTA

The compiler emits the appropriate EIATTR marker into the output cubin based on which granularity the kernel uses. The CUDA runtime uses this to configure the CTA launch parameters.

The binary enforces exclusivity: a single function cannot mix .cta_group::1 and .cta_group::2 operations. The error message is explicit: "Function '%s' uses single CTA(.cta_group::1) and CTA pair granularity(.cta_group::2) and that is not allowed."

ELF/Cubin Markers

EIATTR Entries

EIATTR NamePurpose
EIATTR_TCGEN05_1CTA_USEDKernel uses tcgen05 at single-CTA granularity
EIATTR_TCGEN05_2CTA_USEDKernel uses tcgen05 at CTA-pair granularity

EICOMPAT Attributes

EICOMPAT NamePurpose
EICOMPAT_ATTR_INST_TCGEN05_MMAKernel uses tcgen05.mma instructions
EICOMPAT_ATTR_INST_TCGEN05_MMA_DEPRECATEDKernel uses deprecated (12.x-era) tcgen05.mma encoding

Entry Fragment Markers

TMEM usage per-CTA is recorded in entry fragment markers:

MarkerVersionPurpose
AT_ENTRY_FRAGMENT_TMEM_CTA1V1TMEM usage for single-CTA kernels
AT_ENTRY_FRAGMENT_TMEM_CTA2V1TMEM usage for CTA-pair kernels
AT_ENTRY_FRAGMENT_TMEM_CTA1_V2V2TMEM usage V2 format, single-CTA
AT_ENTRY_FRAGMENT_TMEM_CTA2_V2V2TMEM usage V2 format, CTA-pair

Guardrail Debug Instrumentation

When compiling with -g (debug mode), ptxas inserts runtime validation checks around tcgen05 operations. These are controlled by the --g-tensor-memory-access-check / --gno-tensor-memory-access-check CLI options.

Guardrail Check Functions

Eight _tcgen05.guardrails.* pseudo-instructions insert inline validation code:

GuardrailFormatter AddressSizeValidates
is_phase_valid0x4DDE70775 BAllocation phase is correct for the operation
are_columns_allocated0x4DBF20599 BAccessed columns are currently allocated
is_current_warp_valid_owner0x4DE180791 BCurrent warp owns the accessed TMEM region
in_physical_bounds0x4DB050439 BColumn access is within physical TMEM bounds
allocation_granularity0x4F0960839 BColumn count meets granularity requirements
datapath_alignment0x4DD580735 BTMEM address is aligned for the data path width
sp_consistency_across_idesc_mod0x500FA0970 BSparsity settings in descriptor match modifier
check_sparse_usage0x4DDB80743 BSparse mode usage is valid for the environment

Guardrail Trap Functions

When a guardrail check fails, it calls a trap function that reports the violation and terminates:

Trap IntrinsicParameters
__cuda_sm10x_tcgen05_guardrail_trap_phase_invalid_during_alloc(.reg .b32 phase)
__cuda_sm10x_tcgen05_guardrail_trap_current_warp_owner_invalid(.reg .b32 tmem_start_lane_accessed, .reg .b32 cur_warp_id, ...)
__cuda_sm10x_tcgen05_guardrail_trap_unallocated_columns_access(.reg .b32 col_no_accessed, .reg .b32 alloced_mask, .reg .b32 instr_kind)
__cuda_sm10x_tcgen05_guardrail_trap_unallocated_columns_being_dealloced(.reg .b32 col_no_being_dealloced, .reg .b32 alloced_mask)
__cuda_sm10x_tcgen05_guardrail_trap_col_being_dealloced_not_returned_by_alloc(.reg .b32 col_no_being_dealloced_not_returned_by_alloc, ...)
__cuda_sm10x_tcgen05_guardrail_trap_allocation_granularity_invalid(.reg .b32 nCols)
__cuda_sm10x_tcgen05_guardrail_trap_access_out_of_physical_bounds(.reg .b32 oob_access_col_no, .reg .b32 instr_kind)
__cuda_sm10x_tcgen05_guardrail_trap_invalid_datapath_alignment(.reg .b32 dp_lane, .reg .b32 matrix_kind, .reg .b32 valid_alignment_kind, ...)
__cuda_sm10x_tcgen05_guardrail_trap_sparse_mismatch_between_idesc_mod(.reg .b32 idesc_sp_enabled, .reg .b32 mod_sp_enabled)
__cuda_sm10x_tcgen05_guardrail_trap_sp_used_in_unsupported_env(.reg .b32 idesc_sp_enabled, .reg .b32 idesc, .reg .b32 mma_kind, .reg .b32 ptx_target, .reg .b32 is_family_portable)

These are intrinsic IDs 0x20--0x2A (11 entries total including a mask creation helper) in the intrinsic table.

Guardrail Check Wrappers

The compiler also generates .FORCE_INLINE wrapper functions that combine multiple checks:

WrapperParameters
__cuda_sm10x_tcgen05_guardrails_check_phase_validity(.reg .u32 dummyInp)
__cuda_sm10x_tcgen05_guardrails_check_column_allocation(.reg .u32 start_col_num, .reg .u32 num_of_cols, ...)
__cuda_sm10x_tcgen05_guardrails_check_datapath_validity(.reg .u32 tmem_addr, .reg .u32 ld_or_st)
__cuda_sm10x_tcgen05_guardrails_check_physical_bounds(.reg .u32 start_col_num, .reg .u32 num_of_cols, ...)
__cuda_sm10x_tcgen05_guardrails_check_allocation_granularity(.reg .u32 num_of_cols)
__cuda_sm10x_tcgen05_guardrails_check_datapath_alignment(.reg .u32 tmemAddr, .reg .u32 iDesc, .reg .u32 cta_group, ...)

Bulk Copy Operations (cp.async.bulk.tensor)

TCGen05 is complemented by asynchronous bulk copy operations for loading data into tensor memory. These are registered as separate intrinsic IDs (0x2B--0x3C, 18 entries) under the __cuda_sm1xx_* naming convention:

OperationCodegen HandlerSize
cp.async.bulk.tensor (1D--5D, tile/im2col, unicast/multicast)sub_5AB46045KB
cp.async.bulksub_593210--
cp.async.mbarrier.arrivesub_4DC180--

The cp.async.bulk.tensor handler is 45KB and covers all dimensionality variants (1D through 5D), both tile and im2col access patterns, and unicast/multicast delivery modes.

SM Availability Gating

Capability Check

TCGen05 availability is gated by sub_70FA00(*, 29), which checks the target SM version. The check returns true for sm_100, sm_103, and sm_110 (and their a/f sub-variants) and false for sm_120/sm_121.

OCG Builtin Names

The OCG (Optimized Code Generation) layer uses short mnemonic names for tcgen05 operations visible in the builtin name lookup (sub_6C9EB0):

OCG NameFull Operation
tcmmatcgen05.mma core multiply-accumulate
tcshifttcgen05.shift accumulator data shift
gdescGlobal descriptor operations
memclearTensor memory clear
sparsifySparsity pattern application

The .tcgen05op string identifies an Ori IR instruction as belonging to the tcgen05 family during the optimizer pipeline.

Version Compatibility

CUDA 12.x to 13.0 Breaking Change

ptxas v13.0.88 includes a linker-level version check for tcgen05 objects:

"Object '%s' cannot be linked due to version mismatch. Objects using tcgen05 in 12.x cannot be linked with 13.0 or later, they must be rebuilt with latest compiler"

The EICOMPAT_ATTR_INST_TCGEN05_MMA_DEPRECATED attribute tags objects compiled with the 12.x-era tcgen05 encoding, which is binary-incompatible with the 13.0 encoding. The SASS instruction encoding for tcgen05.mma changed between CUDA 12.x and 13.0 -- objects must be recompiled.

SM 100 vs SM 103 Differences

Both sm_100 and sm_103 share the same tcgen05 instruction set and codegen factory (36864). They share all 7 dispatch-table handler functions. The differences between sm_100 and sm_103 are:

  • Different Handler A and Handler B capability accessor functions (sm_100: sub_609C30/sub_609BD0; sm_103: sub_608F20/sub_609D20)
  • Different intrinsic table initializers (sm_100: sub_60A910; sm_103: sub_60A700)
  • sm_103 may expose additional capability flags for GB300-specific features

Both targets produce identical SASS for tcgen05 instructions. The f sub-variants (sm_100f, sm_103f) allow cross-compilation within the family: sm_100f code can run on sm_103 hardware.

Compiler Pipeline

PTX Parsing and Validation

  1. Lexer (sub_720F00, 64KB): Recognizes tcgen05.* tokens during lexical analysis
  2. Validator (sub_4C5FB0, 28KB): Shared MMA/WMMA/tcgen05 validation function. Checks instruction legality for the current SM target, validates operand types, descriptor fields, and modifier combinations
  3. Instruction table (sub_46E000, 93KB): Registers tcgen05 instruction variants with their type combinations (e.g., .tcgen05op)

Intrinsic Dispatch

The intrinsic dispatch table builder (sub_5D4190, 41KB) registers tcgen05 handlers:

RegistrationPTX InstructionHandlerSize
Line 112tcgen05.mmasub_5BBC3090KB
Lifecycletcgen05.allocsub_569180--
Lifecycletcgen05.relinquish_alloc_permitsub_526370--
Lifecycletcgen05.deallocsub_58C7F0--
Datatcgen05.ldsub_574050--
Datatcgen05.ld.redsub_578DB0--
Datatcgen05.stsub_571FE0--
Synctcgen05.commitsub_56C190--
Copytcgen05.cpsub_5427F0--
Computetcgen05.shiftsub_4F1A90--
Computetcgen05.mma.wssub_58FA20--

Intrinsic Lowering

The TCGen05 MMA handler (sub_6D7AF0, 19KB) and validator (sub_6D69B0, 12KB) in the encoding zone handle the lowering from abstract intrinsic operations to concrete SASS encoding. The handler checks modifier consistency:

  • "fused and l16dp32bit must be specified together"
  • "Inputs vector length is inconsistent with layout and num modifiers"

TMEM Address Generation

The TMEM address generator cluster (sub_70E740, sub_70E940, sub_70EB00) generates PTX parameter passing code for tensor memory addresses:

st.param.b32 [%s + %d], %s;
ld.param.b32 %s, [%s + %d];

SASS Encoding

The master instruction encoder (sub_6D9690, 94KB) handles the final binary encoding. TCGen05 instructions use the Mercury encoding pipeline (encoder factory 36864) with Blackwell-specific opcode tables.

Function Map

AddressSizeIdentityConfidence
sub_4C5FB028KBPTX validator (MMA/WMMA/tcgen05 shared)HIGH
sub_4DA720343 Btcgen05.mma.ws formatterHIGH
sub_4DB050439 Bguardrails.in_physical_bounds formatterHIGH
sub_4DBF20599 Bguardrails.are_columns_allocated formatterHIGH
sub_4DD580735 Bguardrails.datapath_alignment formatterHIGH
sub_4DDB80743 Bguardrails.check_sparse_usage formatterHIGH
sub_4DDE70775 Bguardrails.is_phase_valid formatterHIGH
sub_4DE180791 Bguardrails.is_current_warp_valid_owner formatterHIGH
sub_4F0960839 Bguardrails.allocation_granularity formatterHIGH
sub_4F1A90903 Btcgen05.shift / tcgen05.cp formatterHIGH
sub_500FA0970 Bguardrails.sp_consistency_across_idesc_mod formatterHIGH
sub_5263701,287 Btcgen05.alloc / tcgen05.relinquish_alloc_permit formatterHIGH
sub_5427F01,575 Btcgen05.commit formatterHIGH
sub_569180--tcgen05.alloc codegen handlerHIGH
sub_56C1901,842 Btcgen05.st formatterHIGH
sub_571FE02,066 Btcgen05.ld.red formatterHIGH
sub_5740502,130 Btcgen05.dealloc formatterHIGH
sub_578DB02,466 Btcgen05.ld formatterHIGH
sub_58C7F04,282 Btcgen05.relinquish_alloc_permit / tcgen05.dealloc formatterHIGH
sub_58FA204,604 Btcgen05.shift + tcgen05.mma formatterHIGH
sub_593210--cp.async.bulk codegenHIGH
sub_5AB46045KBcp.async.bulk.tensor codegen (1D--5D)HIGH
sub_5BBC3090KBtcgen05.mma codegen (main)HIGH
sub_6D69B012KBTCGen05 MMA validator (encoding zone)MED
sub_6D7AF019KBTCGen05 MMA handler (encoding zone)HIGH
sub_70BC30--TCGen05 parameter helperMED
sub_70BCC0--TCGen05 parameter helperMED
sub_70DEF0--TCGen05 parameter helperMED
sub_70E0E0--SM100 guardrail bounds-check code generatorMED
sub_70E740--TMEM address generator (tmemD)MED
sub_70E940--TMEM address generator (tmemA)MED
sub_70EB00--TMEM address generator (scaleTmemA/B, spMetaTmem)MED
sub_70FA00--Instruction capability checker (29 = tcgen05)HIGH
sub_8E8A903.0KB + 949 BSM 100 latency table (base + TCGEN05 supplement)HIGH

Cross-References

Intrinsic Table Architecture (607 Registered Entries)

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

ptxas maintains two separate intrinsic subsystems that together cover every CUDA runtime helper function, every PTX opcode requiring inline code generation, and every Blackwell+ OCG builtin operation. The first subsystem (sub_5D1660 + sub_5D4190 + sub_5D7430 + sub_5FF700) handles 607 classical CUDA intrinsics and PTX opcode dispatch through a name-to-ID hash map, a body template name table, and a giant prototype generator. The second subsystem (sub_6C9EB0 and its handler cluster at 0x6C0000--0x6CC000) handles OCG (Optimized Code Generation) builtins for SM100+ targets. Both subsystems use the same hash map infrastructure (sub_425CA0 / sub_426150 / sub_426D60) documented in Hash Tables & Bitvectors.

Master registrationsub_5D1660 (46KB) -- 607 CUDA intrinsics, name-to-integer-ID hash map (608 table slots, ID 0 = null)
Opcode dispatchsub_5D4190 (41KB) -- ~120 PTX opcodes to codegen handlers + ~400 MMA hash entries
Body template namessub_5D7430 (161KB) -- 1,079 intrinsic names constructed from .rodata prefixes + type suffixes, stored in hash map at +824
Prototype generatorsub_5FF700 (354KB) -- switch generating .weak .func PTX declarations
OCG intrinsic tablesub_6C9EB0 (13KB) -- __nv_ptx_builtin_ocg_* dispatch for SM100+
OCG routersub_6CC690 (22KB) -- routes OCG calls to type-specific handlers
OCG name resolversub_6C9BC0 -- resolves operation names to internal enums
Hash map createsub_425CA0 (initial capacity 0x80)
Hash map insertsub_426150(map, name, value)
Hash map lookupsub_426D60

Per-Family Deep Dives:

System Overview

sub_451730 (intrinsic lowering context constructor)
  │
  ├── sub_5D4190(ctx)  ── register PTX opcode & MMA handlers ──────────────┐
  │     │ (1) Calls sub_5D1660 to populate intrinsic ID table (607 entries) │
  │     │ (2) Registers ~120 PTX opcode -> codegen handler mappings         │
  │     │ (3) Registers ~400 MMA hash -> codegen handler mappings           │
  │     │                                                                   │
  │     ├─ Hash map at +808  ── PTX opcode name -> codegen function ptr     │
  │     │    "div"     -> sub_5B76D0  (64KB)                                │
  │     │    "sqrt"    -> sub_5B4040  (49KB)                                │
  │     │    "wmma.mma"-> sub_5C7A50  (173KB)                               │
  │     │    "mma"     -> sub_5C10A0  (120KB)                               │
  │     │    ... ~116 more                                                  │
  │     │                                                                   │
  │     ├─ Hash map at +816  ── numeric MMA hash -> codegen handler ptr     │
  │     │    "2644314910" -> sub_4DDB80                                     │
  │     │    ... ~399 more (shape/type/layout combinations)                 │
  │     │                                                                   │
  │     └─ ID table at +1056 ── 9728-byte array (memcpy from unk_1D4D940)  │
  │        Hash map at +1064 ── name -> integer ID (sub_5D1660, 607)        │
  │        Count at +1072 = 608 (includes null ID 0 slot)                   │
  │                                                                         │
  ├── sub_4CE230(ctx)  ── register modifier keywords (GUARD, PRED, ...)     │
  │                                                                         │
  ├── sub_5D7430(ctx, sregs)  ── body template name table (161KB) ──────────┤
  │     │ 1,079 entries, each constructed from:                             │
  │     │   16-byte .rodata prefix (e.g. "__cuda_sm20_div_")               │
  │     │ + 4-byte type suffix (e.g. "s16\0", "u64\0", "rn_f")            │
  │     │ → registered into hash map at +824 with sequential integer IDs    │
  │     │                                                                   │
  │     └─ Hash map at +824  ── intrinsic name -> body template ID          │
  │          "__cuda_sm20_div_s16" -> 0                                     │
  │          "__cuda_sm20_div_u16" -> 1                                     │
  │          ... 1,079 total entries                                        │
  │                                                                         │
  └── sub_451330("<fermi macros>", ...)  ── load Fermi macro library        │
                                                                            │
sub_5FF700 (354KB) ─────────────────────────────────────────────────────────┘
  │ switch(body_template_id) with hundreds of cases
  │ Each case: allocate buffer via sub_4DA340, strcpy() PTX prototype
  │
  │ case 0:  ".weak .func (.reg .s32 %d) __cuda_sm20_div_s16
  │           (.reg .s32 %a0, .reg .s32 %a1)"
  │ case 4:  ".weak .func (.reg .u64 %rdv1) __cuda_sm20_div_u64
  │           (.reg .u64 %rda1, .reg .u64 %rda2)"
  │ case 9:  ".weak .func (.reg .f32 %fv1) __cuda_sm20_div_rn_f32
  │           (.reg .f32 %fa1, .reg .f32 %fa2)"
  │ case 25: ".weak .func (.reg .f64 %fdv1) __cuda_sm20_div_rn_f64_full
  │           (...)"
  │ ... hundreds more for rcp, sqrt, dsqrt, barrier, wmma, mma, etc.
  v
Emitted into PTX output as .weak .func declarations
(linker resolves calls to runtime helper functions)

Master Registration -- sub_5D1660

This 46KB function is the master catalog. It allocates a 9728-byte table (memcpy from unk_1D4D940, 0x2600 bytes = 608 x 16B slots), creates a hash map with initial capacity 0x80 via sub_425CA0, then calls sub_426150(hashmap, "name", (char*)ID) exactly 607 times to register every CUDA runtime helper function with an integer ID (IDs 1--607, contiguous). The hash map is stored at a1+1064, the table at a1+1056, and the count 608 at a1+1072 (includes the unused null ID 0 slot).

Complete ID Allocation

607 intrinsics are registered with contiguous IDs from 0x01 through 0x25F. The binary stores count=608 at a1+1072 because the pre-built 9,728-byte table (608 x 16B slots) includes a null ID 0 sentinel. The ID ranges partition cleanly by SM generation and functional category.

ID RangeCountPrefixCategorySM Floor
0x001--0x01117__cuda_reduxsync_*Redux sync (b32 and/or/xor, f32 max/min/abs/NaN, s32/u32 add/max/min)sm_70
0x012--0x0187__cuda_sanitizer_memcheck_*Compute-sanitizer hooks (free, generic, global, local, malloc, readmetadata, shared)--
0x019--0x01F7__cuda_scalar_video_emulation_*Video instruction emulation helperssm_20
0x020--0x02A11__cuda_sm10x_*Blackwell tcgen05 guardrail traps + create_mask helpersm_100
0x02B--0x03C18__cuda_sm1xx_*Bulk copy + cp.async.bulk.tensor 1D--5D tile/im2col uni/multicastsm_100+
0x03D--0x08270__cuda_sm20_*IEEE math: bfe, bfi, div, rcp, sqrt, dsqrt, drsqrt, rem (all rounding modes + slowpaths)sm_20
0x083--0x0864__cuda_sm3x_div_*Optimized division variants (rn_ftz_f32, rn_noftz_f32 + slowpaths)sm_30
0x087--0x0882__cuda_sm62_dp2a/dp4aInteger dot product emulationsm_62
0x089--0x211393__cuda_sm70_*Volta+ intrinsics (barriers, shuffle, vote, match, WMMA -- all shapes, layouts, address spaces)sm_70
0x212--0x2143__cuda_sm80_*Ampere: createpolicy_fractional, createpolicy_fractional_encode, createpolicy_range_encodesm_80
0x215--0x21E10__cuda_sm_10x_*Blackwell hmma/imma mdata + bit MMA (and/xor m8n8k128/m16n8k128/m16n8k256)sm_100
0x21F--0x22C14__cuda_sm_8x_*Direct MMA operations (f16/f32 accum, 4 layout combos) + mma_shfl_f16/f32sm_80+
0x22D--0x25F51__cuda_sm_9x_*Hopper sub-byte + bit MMA: s4/u4 dense m16n8k32/k64 + sparse m16n8k64/k128, bit xor (m8n8k128/m16n8k128/m16n8k256)sm_90

Total: 607 registered intrinsics across 13 prefix groups. Table has 608 slots (ID 0 unused).

sm_70 Intrinsic Breakdown (IDs 0x89--0x211)

The sm_70 block is by far the largest at 393 entries. It covers every Volta-era warp synchronous intrinsic plus the complete WMMA API. The explosion in count comes from the combinatorial product of shapes, layouts, data types, address spaces, and predicate/satfinite variants.

Sub-CategoryExamplesCombinatorial Source
barrier_arrive0--15, with/without count16 barrier IDs x 2 count variants
barrier_red_and/or/popc0--15, with/without count3 reduction ops x 16 IDs x 2 count
barrier_sync0--15, with/without count16 IDs x 2 count variants
matchsync_all/any_b32/b64with predicate variants2 match modes x 2 types x pred
shflsync_bfly/down/idx/upwith predicate variants4 shuffle modes x pred
votesync_all/any/ballot/uni--4 vote modes
warpsync--1 entry
wmma_*m16n16k16, m32n8k16, m8n32k163 shapes x {load_a, load_b, load_c, store_d, mma} x {row, col} x {f16, f32} x {generic, global, shared} x {satfinite}

The WMMA entries dominate the count. Each combination of shape (m16n16k16/m32n8k16/m8n32k16), operation (load_a/load_b/load_c/store_d/mma), layout (row/col for each matrix), data type (f16/f32), address space (generic/global/shared), and optional satfinite flag produces a separate intrinsic registration.

Opcode Dispatch -- sub_5D4190

This 41KB function first calls sub_5D1660(a1) to populate the intrinsic ID table, then builds two more hash maps for PTX opcode dispatch.

Named Opcode Table (at a1+808)

~120 PTX instruction names mapped to codegen handler function pointers. Each handler allocates a 50,000-byte buffer, queries instruction properties through accessor functions on the instruction object at a1+1096, and generates inline PTX code via sequential sprintf() calls.

CategoryOpcodesCodegen Handlers
Mathdiv.full, div, rem, rcp, rsqrt, sqrt, ex2, lg2, tanhsub_573860, sub_5B76D0 (64KB), sub_589810, sub_5B0CD0 (44KB), sub_57BFC0, sub_5B4040 (49KB), sub_583190, sub_52A5C0, sub_505B00
Memorymembar, _ldldu, prefetchsub_4DB410, sub_4DD860, sub_507FB0
Conversioncvtsub_59F630
Bit manipulationbfind, brev, bfe, bfi, clz, popc, testp, copysignsub_590C20, sub_50B5A0, sub_578470, sub_52E100, sub_4DBCC0, sub_4DB210, sub_581A10, sub_50B180
Texturetex, tex.base, tex.level, tld4, tex.gradsub_584D10, sub_5879B0, sub_58B6A0, sub_56D700, sub_5ADDC0 (50KB)
Video (SIMD)vadd/vsub/vmin/vmax/vabsdiff/vshl/vshr/vset/vmad (scalar), vadd2/vmax2/vmin2/vabsdiff2/vset2/vsub2/vavrg2 (packed 2x16), vadd4/vmin4/vmax4/vabsdiff4/vset4/vsub4/vavrg4 (packed 4x8)per-instruction handlers
Dot productdp2a.lo, dp2a.hi, dp4asub_56BA60, sub_56C8D0, sub_577BA0
Barriersbar, barrier, bar.arrive, barrier.arrive, bar.red, barrier.red, bar.cta/barrier.cta (.arrive/.red variants), bar.warpsub_524FB0, sub_570290, sub_500BF0, sub_570940, sub_52D590, sub_5889B0, sub_56A5A0
Warpvote, shfl, match, reduxsub_580E50, sub_5801D0, sub_58A730, sub_567680
Async copycp.async.mbarrier.arrive, cp.async.bulk, cp.async.bulk.tensorsub_4DC180, sub_593210, sub_5AB460 (45KB)
Matrixldmatrix, movmatrix, stmatrix, st.async, red.async, st.bulksub_50D4B0, sub_4DAEA0, sub_4F05D0, sub_58E9B0, sub_5825A0, sub_549430
Cachecreatepolicy.range, createpolicy.fractional, createpolicy.cvtper-instruction handlers
WMMAwmma.load.a, wmma.load.b, wmma.load.c, wmma.store.d, wmma.mmasub_5A2D10, sub_5A0EA0, sub_5A8E40, sub_5A6BD0, sub_5C7A50 (173KB)
MMAmmasub_5C10A0 (120KB)
WGMMAwgmma.mma_async, wgmma.fence, wgmma.commit_group, wgmma.wait_groupsub_50AC70, sub_4DA380, sub_4DA4B0, sub_4DA5E0
Multimemmultimem.ld_reduce, multimem.st, multimem.redsub_58D8B0, sub_57B4C0, sub_50A850
Tensormaptensormap.replacesub_57F6E0
TCGen05tcgen05.alloc, tcgen05.relinquish_alloc_permit, tcgen05.dealloc, tcgen05.ld, tcgen05.ld.red, tcgen05.st, tcgen05.commit, tcgen05.cp, tcgen05.shift, tcgen05.mma, tcgen05.mma.wssub_569180, sub_526370, sub_58C7F0, sub_574050, sub_578DB0, sub_571FE0, sub_56C190, sub_5427F0, sub_4F1A90, sub_5BBC30 (90KB), sub_58FA20
TCGen05 guardrails_tcgen05.guardrails.is_phase_valid, are_columns_allocated, is_current_warp_valid_owner, in_physical_bounds, allocation_granularity, datapath_alignment, sp_consistency_across_idesc_mod, check_sparse_usageper-instruction handlers

Numeric MMA Hash Table (at a1+816)

~400 entries where the key is a numeric string representation of a hash value (e.g., "2644314910") that encodes a specific MMA shape/type/layout combination. The hash encodes the instruction variant completely: matrix dimensions (m16n8k16, m16n8k32, etc.), data type (f16, bf16, tf32, f32, f64, s8, u8, s4, u4, b1), and layout (row/col combinations). Each entry maps to a codegen handler function pointer. This avoids a multi-dimensional lookup by collapsing the full variant space into a single hash probe.

Body Template Name Table -- sub_5D7430

At 161KB of machine code (0x5D7430--0x5FF700), this is the largest function in the intrinsic infrastructure by code size and the 6th largest function in the entire ptxas binary. IDA failed to decompile it; all analysis comes from raw x86-64 disassembly. The function constructs a third hash map (at context offset +824 / 0x338) containing 1,079 entries that map dynamically constructed __cuda_* intrinsic names to sequential body template IDs (0--1078).

Why 1,079 Body Templates for 607 Logical Intrinsics

The master registration table (sub_5D1660) maps 607 intrinsic names to logical IDs. The body template table (sub_5D7430) maps 1,079 variant-specific names to prototype generator case numbers. The 1.78x expansion has one dominant cause: WMMA template proliferation across GPU generations.

The 204 logical WMMA entries in sub_5D1660 cover only the original sm_70 Volta shapes (m16n16k16/m32n8k16/m8n32k16 with f16/f32 types). But the body template table includes all later-generation WMMA variants -- sm7x sub-byte/bit, sm72 integer, sm8x tf32/bf16/f64 -- that were added as hardware evolved. These ~416 extra WMMA templates have no matching entry in the 607 logical ID table; they exist only in the body template hash map and the prototype generator switch.

Non-WMMA intrinsics map approximately 1:1 between logical IDs and body templates. The math operations (div, rcp, sqrt) are already fully type-specialized at the logical level -- each rounding-mode/type combination is a separate logical intrinsic.

Three sources of expansion beyond the 607 logical entries:

  1. Later-generation WMMA variants (~416 template-only entries):
    • sm7x sub-byte WMMA (s4/u4 m8n8k32) + bit WMMA (m8n8k128): ~231 templates
    • sm72 integer WMMA (m16n16k16/m32n8k16/m8n32k16 integer types): ~105 templates
    • sm8x tf32 WMMA (m16n16k8) + bf16/f64 WMMA: ~80 templates
  2. Aligned warp sync variants (~13 extra templates): matchsync_aligned, votesync_aligned, votesync_ballot_groupwise, query_activemask/query_activemask_groupwise for cooperative group support
  3. Additional SM100 specializations (~8 extra templates): tcgen05_alloc_two_sm, extra guardrails check variants, get_warp_rank

Conversely, 18 sm1xx bulk copy intrinsics have logical IDs but zero body templates -- they bypass the template/prototype mechanism entirely and are lowered directly to inline PTX by the opcode dispatch handlers (sub_593210, sub_5AB460).

Template Distribution Table

Logical GroupLogicalTemplateFactor
SM20 IEEE math (div, rem, rcp, sqrt, bfe/bfi)70701.0x
SM3x optimized division441.0x
SM62 integer dot product221.0x
SM70 barriers1701701.0x
SM70 warp sync (match, vote, shfl, query)19321.7x
SM70 WMMA (f16/f32 original Volta)2042491.2x
SM7x WMMA extended (sub-byte, bit)0231tmpl-only
SM72 WMMA (integer)0105tmpl-only
SM8x WMMA (tf32, bf16, f64)080tmpl-only
SM80 cache policy341.3x
SM8x direct MMA14141.0x
SM9x Hopper sub-byte/bit MMA51521.0x
SM10x Blackwell MMA metadata10101.0x
SM100 tcgen05 + guardrails11191.7x
SM100+ bulk copy / TMA180(no templates)
Redux sync primitives17171.0x
Compute-sanitizer hooks771.0x
Video instruction emulation771.0x
Total6071,0731.78x

WMMA subtotal: 204 logical entries expand to 665 body templates (3.3x). Non-WMMA: 403 logical entries map to 408 templates (~1.0x). The remaining 7 templates (1,080 prototype switch cases minus 1,073 classified) are sanitizer/cache variants where IDA produced qmemcpy instead of strcpy, preventing exact name extraction.

The sm70 WMMA group itself expands from 204 to 249 templates because the prototype generator includes update_ptr and desc (descriptor-based addressing) variants of certain load/store operations that the logical table does not separate.

The three "tmpl-only" WMMA rows (sm7x/sm72/sm8x) are the single largest contributor to the expansion. They represent ~416 templates with zero logical ID counterparts. These families use .FORCE_INLINE .func linkage in their prototypes instead of the .weak .func used by the original sm70 WMMA entries:

sm70 (original):   .weak .func (...) __cuda_sm70_wmma_m16n16k16_load_a_col (...)
sm72 (integer):    .FORCE_INLINE .func (...) __cuda_sm72_Integer_wmma_m16n16k16_load_a_row (...)
sm7x (sub-byte):   .FORCE_INLINE .func (...) __cuda_sm7x_sub_byte_wmma_m8n8k32_load_a_row (...)
sm8x (tf32):       .FORCE_INLINE .func (...) __cuda_sm8x_tf32_wmma_m16n16k8_load_a_row (...)

The .FORCE_INLINE directive forces inlining at every call site rather than emitting a separate callable function. The later-gen WMMA implementations are more complex and performance-sensitive, making per-call-site specialization profitable.

Name Construction Algorithm

The function contains zero string references because it constructs all 1,079 names at runtime. For each entry:

  1. Allocate a 20-byte buffer via sub_424070(allocator, 20)
  2. Copy prefix (16 bytes) from .rodata via SSE movdqa + movups (e.g., "__cuda_sm20_div_")
  3. Append suffix (4 bytes) via movl immediate at offset +16 (e.g., "s16\0", "u64\0", "rn_f")
  4. Register via sub_426150(context+824, buffer, template_id) with sequential integer IDs

The 533 unique .rodata prefix addresses fan out through multiple suffixes per prefix:

.rodata prefix (16B)       suffix (4B)     result (20B buffer)
───────────────────────    ───────────     ──────────────────────
"__cuda_sm20_div_"    +    "s16\0"    =   "__cuda_sm20_div_s16"
"__cuda_sm20_div_"    +    "u16\0"    =   "__cuda_sm20_div_u16"
"__cuda_sm20_div_"    +    "u64\0"    =   "__cuda_sm20_div_u64"
"__cuda_sm20_div_"    +    "s64\0"    =   "__cuda_sm20_div_s64"
"__cuda_sm20_div_"    +    "rn_f"     =   "__cuda_sm20_div_rn_f" (truncated)
"__cuda_sm20_rem_"    +    "s16\0"    =   "__cuda_sm20_rem_s16"
"__cuda_sm20_rcp_"    +    "rn_f"     =   "__cuda_sm20_rcp_rn_f" (truncated)
"__cuda_sm70_barr"    +    "ier_"     =   "__cuda_sm70_barrier_" (prefix chain)

Names truncated at the 20-byte buffer limit are still sufficient for hash map lookup -- the full untruncated name appears only inside the prototype string in sub_5FF700.

Worked Example: Division (Cases 0--26)

The __cuda_sm20_div operation illustrates the template-to-prototype mapping. Division has 19 logical IDs and 19 body templates (1:1 ratio) because each type/rounding/precision variant is already a separate logical intrinsic. The suffix encodes the type specialization:

CaseBody Template NameType SuffixPTX Signature
0__cuda_sm20_div_s16s16(.reg .s32 %d) ... (.reg .s32 %a0, .reg .s32 %a1)
1__cuda_sm20_div_u16u16(.reg .u32 %d) ... (.reg .u32 %a0, .reg .u32 %a1)
4__cuda_sm20_div_u64u64(.reg .u64 %rdv1) ... (.reg .u64 %rda1, .reg .u64 %rda2)
5__cuda_sm20_div_s64s64(.reg .u64 %rdv1) ... (.reg .u64 %rda1, .reg .u64 %rda2)
9__cuda_sm20_div_rn_f32rn_f(.reg .f32 %fv1) ... (.reg .f32 %fa1, .reg .f32 %fa2)
10__cuda_sm20_div_rd_f32rd_f(.reg .f32 %fv1) ... (.reg .f32 %fa1, .reg .f32 %fa2)
14__cuda_sm20_div_rn_ftz_f32rn_f(.reg .f32 %fv1) ... (.reg .f32 %fa1, .reg .f32 %fa2)
22__cuda_sm20_div_ru_f64_v2ru_f(.reg .f64 %fdv1) ... (.reg .f64 %fda1, .reg .f64 %fda2)
25__cuda_sm20_div_rn_f64_fullrn_f(.reg .f64 %fdv1) ... (.reg .f64 %fda1, .reg .f64 %fda2)

Cases 2--3 (rem s16/u16), 6--7 (rem s64/u64) are interleaved between the division entries. Cases 8, 13 are _slowpath variants that implement Newton-Raphson refinement fallbacks. Cases 18--21 are the sm3x-optimized division variants with the same suffix scheme. Note: s16/u16 division uses .s32/.u32 register types because PTX has no 16-bit register class; the 16-bit operation is performed by 32-bit hardware with appropriate sign/zero extension.

Statistics

MetricValue
Machine code size164,560 bytes (0x5D7430--0x5FF700)
sub_426150 calls1,079
Unique .rodata prefix addresses533
Hash map destinationcontext+824 (0x338)
Buffer size per entry20 bytes
IDA decompilationFailed (function too large/repetitive)

Context Hash Map Summary

The intrinsic lowering context object holds five hash maps and one flat table:

OffsetFieldBuilderContentsEntries
+808opcode handlerssub_5D4190PTX opcode name -> codegen fn ptr~120
+816MMA hash handlerssub_5D4190numeric hash -> codegen fn ptr~400
+824body templatessub_5D7430intrinsic name -> template ID1,079
+1056descriptor tablesub_5D1660608 x 16B intrinsic descriptor slots608
+1064ID mapsub_5D1660intrinsic name -> logical ID (1-607)607
+1072countsub_5D1660608 (includes null slot 0)--

Instruction Property Accessors

All codegen handlers query instruction properties through accessor functions on the instruction object at a1+1096. These are the same accessors used by WMMA, MMA, and tcgen05 codegen.

AccessorPurposeUsage Example
sub_70B6E0Check if feature enabledsub_70B6E0(obj) -- boolean feature gate
sub_70B780Get feature parameterNumeric feature parameter
sub_70FA00Check instruction capability for SMsub_70FA00(*, 23) = texture, sub_70FA00(*, 29) = tcgen05
sub_70E940Get operand countNumber of operands
sub_70E6E0Get data typeOperand data type enumeration
sub_70ACC0Get accumulator typeMMA accumulator data type
sub_709860Get register type/sizeRegister class and width
sub_70F460Get layout variantrow/col matrix layout
sub_707D60Check MMA shape variantm16n16k16 vs m32n8k16, etc.
sub_709910Check sparse modeSparse MMA variant flag
sub_70F650Get matrix dimension (M/N)Matrix size parameter
sub_70F600Get matrix dimension (K)Alternate dimension parameter
sub_70CA60Get operand type by indexsub_70CA60(*, 0) -- type of first operand (21 = specific type, 58 = f32, 59 = f64)
sub_70BA40Texture mode queryTexture sampling mode
sub_70BD50Sampler mode queryTexture sampler configuration
sub_70BB20Bulk tensor modecp.async.bulk.tensor transfer mode
sub_70F0A0Get sparse metadataSparse matrix metadata parameter

Prototype Generator -- sub_5FF700

At 354KB, this is the single largest function in the intrinsic infrastructure and the 2nd largest function in the entire ptxas binary. It takes a body template ID (a1, range 0--1079) and an allocator context (a2), allocates a buffer via sub_4DA340(size, a2), fills it with a PTX prototype string via strcpy(), and returns the result. The output is a complete .weak .func or .FORCE_INLINE .func PTX declaration that gets emitted into the PTX output stream so the linker can resolve calls to CUDA runtime helper functions.

The function is a single switch(a1) with 1,080 case labels (0--1079) plus a default case that returns an empty string "". Each case allocates an exact-sized buffer (72--1,200 bytes), copies a hardcoded PTX prototype string into it, and returns the pointer.

Prototype Generator Architecture

sub_5FF700(template_id, allocator)
  │
  │  switch(template_id)     ← 1,080 cases, 0--1079
  │
  ├── case N:
  │     buf = sub_4DA340(byte_count, allocator)    ← allocate exact-fit buffer
  │     strcpy(buf, ".weak .func (...) name (...)")  ← copy PTX prototype
  │     return buf
  │
  ├── case M:  (45 large WMMA mma cases)
  │     buf = sub_4DA340(byte_count, allocator)    ← up to 1,200 bytes
  │     *(u64*)buf = 0x662E206B6165772E            ← ".weak .f" (inline store)
  │     *(u64*)(buf+N-8) = <trailer>               ← last 8 bytes inline
  │     qmemcpy(buf+8, .rodata_addr, size-16)      ← bulk copy middle
  │     return buf
  │
  └── default:
        return ""

Three copy strategies appear in the decompilation, all producing the same result:

StrategyCasesTriggerMax Size
strcpy() with inline string literal1,035Prototype fits in decompiler string threshold~520 bytes
qmemcpy() with QWORD bookend stores45Prototype too long for IDA to reproduce as literal1,200 bytes
Indirect variable assignment + copy~130IDA SSA split (subset of strcpy)~120 bytes

The qmemcpy cases are the 45 WMMA mma operations with the largest parameter lists (3--4 fragment matrices of 8 elements each). IDA stores the first and last 8 bytes as inline immediates (0x662E206B6165772E = ".weak .f", trailer varies per case) and bulk-copies the middle from .rodata. The prototype content is structurally identical to the strcpy cases.

Linkage Directives

Two PTX linkage types are emitted, controlling how the linker handles the declared function:

DirectiveCountMeaningUsed By
.weak616Overridable by user code; linker uses user version if presentSM20 math, SM70 barriers/sync/WMMA (original Volta), SM80 cache policy, SM8x/9x/10x MMA, redux sync, sanitizer hooks, video emulation, dp2a/dp4a
.FORCE_INLINE464Inlined at every call site; no separate callable functionSM70 aligned vote/match/query_activemask, SM7x sub-byte/bit WMMA, SM72 integer WMMA, SM8x tf32/bf16/f64 WMMA, SM10x tcgen05 alloc/guardrails, SM80 createpolicy_fractional

The .weak linkage supports user-supplied replacements: if the user provides their own implementation of __cuda_sm20_div_s16, the linker will use that instead of the built-in runtime version. The .FORCE_INLINE directive forces per-call-site specialization -- the later-generation WMMA implementations are more complex and performance-sensitive, making inlining profitable.

A subset of .weak prototypes (~410) carry the .unique qualifier:

.weak .func (.reg .b32 dst) __cuda_sm70_barrier_sync (.reg .b32 arg0) .unique ;

.unique instructs the PTX linker to keep exactly one copy of the function body even if multiple compilation units reference it. All barriers, redux sync, warpsync, non-aligned vote/match/shuffle use .unique.

Prototype Format

Every emitted prototype follows one of these structural patterns:

<linkage> .func (<return_params>) <name> (<input_params>) [.unique] ;
CasePrototype
0.weak .func (.reg .s32 %d) __cuda_sm20_div_s16 (.reg .s32 %a0, .reg .s32 %a1)
4.weak .func (.reg .u64 %rdv1) __cuda_sm20_div_u64 (.reg .u64 %rda1, .reg .u64 %rda2)
9.weak .func (.reg .f32 %fv1) __cuda_sm20_div_rn_f32 (.reg .f32 %fa1, .reg .f32 %fa2)
25.weak .func (.reg .f64 %fdv1) __cuda_sm20_div_rn_f64_full (.reg .f64 %fda1, .reg .f64 %fda2)
76.weak .func (.reg .b32 dst) __cuda_reduxsync_b32_xor (.reg .b32 src, .reg .b32 mask) .unique
303.weak .func (.param .align 16 .b32 dst[8]) __cuda_sm70_wmma_m16n16k16_load_a_row (.reg .u64 ptr, .reg .u32 ldm) .unique
666.FORCE_INLINE .func (.reg .b32 dst0, .reg .b32 dst1, .reg .b32 dst2, .reg .b32 dst3) __cuda_sm8x_tf32_wmma_m16n16k8_load_a_row (.reg .u64 ptr, .reg .u32 ldm)
890.weak .func () __cuda_sm70_wmma_m16n16k16_store_d_row_f32 (.reg .b64 ptr, .reg .b32 ldm, .reg .b32 sreg0, ...)
1055.FORCE_INLINE .func (.reg .b32 warp_rank) __cuda_sm10x_get_warp_rank ()
1073.weak .func (.param .b64 func_retval0) __cuda_sanitizer_memcheck_readmetadata (.param .b64 ..._param_0, .param .b64 ..._param_1)

Parameter Passing Conventions

Five distinct parameter-passing ABIs appear across the 1,080 prototypes:

Convention A -- Register-only (.reg): Used by math operations, barriers, warp sync, redux sync, video emulation. Return and input parameters are individual .reg declarations with typed names. This is the simplest and most common convention.

.weak .func (.reg .f32 %fv1) __cuda_sm20_div_rn_f32 (.reg .f32 %fa1, .reg .f32 %fa2) ;

Convention B -- Param-array with alignment (.param .align N .b32 name[K]): Used by WMMA load/mma, MMA, Hopper sub-byte MMA, Blackwell MMA. Returns an aligned array of .b32 elements. Array sizes: dst[2], dst[3], dst[4], dst[5], dst[8], mma_dst[2], mma_dst[4], mma_dst[8], ret_dst[3], ret_dst[5]. 326 prototypes use .align 16; 1 prototype (mma_shfl_f16) uses .align 8.

.weak .func (.param .align 16 .b32 d[8]) __cuda_sm70_wmma_m16n16k16_mma_row_col_f32_f32
  (.param .align 16 .b32 a[8], .param .align 16 .b32 b[8], .param .align 16 .b32 c[8]) ;

Convention C -- Param-scalar (.param .b64): Used exclusively by the 7 compute-sanitizer hooks. Parameters use fully-qualified names (__cuda_sanitizer_memcheck_malloc_param_0).

.weak .func (.param .b64 func_retval0) __cuda_sanitizer_memcheck_malloc
  (.param .b64 __cuda_sanitizer_memcheck_malloc_param_0,
   .param .b64 __cuda_sanitizer_memcheck_malloc_param_1) ;

Convention D -- Void return (): Used by WMMA store_d, tcgen05 guardrail traps, sanitizer_free. ~140 prototypes (45 .weak + 95 .FORCE_INLINE).

.weak .func () __cuda_sm70_wmma_m16n16k16_store_d_row_f32
  (.reg .b64 ptr, .reg .b32 ldm, .reg .b32 sreg0, .reg .b32 sreg1, ...) ;

Convention E -- Multi-register return (.FORCE_INLINE only): Used by extended WMMA load operations (SM7x/SM72/SM8x). Returns 1--4 registers in the return position (never 8 -- 8-element returns use Convention B's .param arrays instead).

.FORCE_INLINE .func (.reg .b32 dst0, .reg .b32 dst1, .reg .b32 dst2, .reg .b32 dst3)
  __cuda_sm8x_tf32_wmma_m16n16k8_load_a_row (.reg .u64 ptr, .reg .u32 ldm) ;

PTX Register Types

Eight PTX register types appear across the prototypes:

TypeApprox. CountUsage
.reg .b32~2,925Dominant: barrier args, WMMA/MMA fragments, guardrail params
.reg .u64~520Pointers (WMMA/MMA base addresses)
.reg .u32~341Integer params (leading dimension, counts, offsets)
.reg .b64~24664-bit bitwise (match bitmask, shuffle predicates, retval)
.reg .f32~106Float math (div, rcp, sqrt)
.reg .f64~70Double math (div, rcp, sqrt, dsqrt)
.reg .pred~10Predicate (vote output, matchsync predicate out)
.reg .s32~6Signed 32-bit (SM20 div/rem s16 return values only)

Note: .b32 is used instead of .s32/.u32/.f32 for operations where the type interpretation is determined by the instruction rather than the register declaration (WMMA fragments, MMA accumulators, barrier IDs). The .s32 type appears only in the 4 oldest SM20 div/rem_s16/u16 prototypes (cases 0--3).

Register Naming Convention

The prototype register names encode the data type and role:

PrefixMeaning
%d32-bit integer return value (SM20 div/rem s16/u16 only)
%a0, %a132-bit integer input parameters
%rdv164-bit integer return value
%rda1, %rda264-bit integer input parameters
%fv1f32 return value
%fa1, %fa2f32 input parameters
%fdv1f64 return value
%fda1, %fda2f64 input parameters
%fdnum, %fddenf64 numerator/denominator (div_f64_v2 variants)
dst, dst0..dst7Generic output registers (WMMA load, barriers)
src, sreg0..sreg7Generic input registers (WMMA store)
ptr, base64-bit pointer registers
ldmLeading dimension parameter (WMMA)
maskWarp participation mask
cntThread count (barrier_sync_count, barrier_arrive_count)
arg0..arg3Generic numbered arguments
pargPredicate argument (vote)
retVal, dummyReturn/placeholder (tcgen05 guardrails)
activemask, warp_rankCooperative group queries

Buffer Allocation Sizes

sub_4DA340(size, allocator) allocates an exact-fit buffer per prototype:

MetricValue
Minimum allocation72 bytes
Maximum allocation1,200 bytes
Median allocation~130 bytes
Most common sizes132 (37x), 182 (31x), 192 (30x), 125 (29x), 118 (28x)
Total allocations1,080

The 45 qmemcpy cases have the largest buffers: 386--1,200 bytes. These are WMMA mma operations whose prototypes enumerate all 3--4 fragment matrices (a, b, c, d) with 4--8 elements each, producing prototype strings that exceed 900 bytes.

Case Range Layout

The 1,080 cases follow the body template registration order from sub_5D7430, roughly grouped by SM generation:

Case RangeCountCategoryLinkage
0--6970SM20 IEEE math (div, rem, rcp, sqrt, bfe, bfi, dsqrt, drsqrt).weak
70--734SM3x optimized division (rn_ftz/noftz f32 + slowpaths).weak
74--752SM62 dp2a/dp4a.weak
76--9217Redux sync (b32/s32/u32/f32 add/max/min/xor/and/or/abs/NaN).weak .unique
93--~274~182SM70 barriers (sync/arrive/red, 16 IDs x with/without count).weak .unique
~275--~302~28SM70 vote, shuffle, match (bfly/down/idx/up, all/any/b32/b64).weak .unique / .FORCE_INLINE
~303--~665~363SM70 WMMA load/store (m16n16k16, m32n8k16, m8n32k16, all types/spaces).weak .unique
~666--~889~224SM7x/SM72/SM8x extended WMMA (sub-byte, integer, tf32, bf16, f64).FORCE_INLINE
~890--~964~75SM70 WMMA store_d (all shapes/layouts/spaces/types).weak
~965--~1048~84SM70 WMMA mma + SM8x/SM9x/SM10x MMA (f16/f32, sub-byte, bit, sparse).weak
~1049--~1055~7SM10x tcgen05 guardrail traps.weak
~1056--~1060~5SM8x direct MMA (mma_shfl, row/col f16/f32 combos).weak
~1061--~1072~12SM10x tcgen05 alloc/guardrails check functions + get_warp_rank + create_mask.FORCE_INLINE / .weak
1073--10797Compute-sanitizer hooks (readmetadata, generic, global, local, shared, malloc, free).weak

Statistics

MetricValue
Machine code size362,496 bytes (0x5FF700--0x658B00)
Decompiled lines9,414
Switch cases1,080 (case 0 through case 1079 + default)
Local variables declared~716 (IDA SSA artifacts)
.weak prototypes616 (571 strcpy + 45 qmemcpy)
.FORCE_INLINE prototypes464
.unique-qualified prototypes~410
.param .align prototypes327 (326 align-16, 1 align-8)
Void-return prototypes~140
Predicate-using prototypes~10

Major Codegen Handlers

The four largest codegen handlers together represent ~500KB of code and cover the tensor core instruction families.

sub_5C7A50 -- WMMA.MMA Codegen (173KB)

The largest codegen handler. Generates inline PTX code for wmma.mma instructions across all variant combinations.

  • Allocates a 50,000-byte buffer for code generation
  • Covers shapes: m16n16k16, m32n8k16, m8n32k16
  • Data types: f16, f32, bf16, tf32, s8, u8, s4, u4, b1
  • Layouts: row/col for each of the A, B, C, D matrices (4 layout combinations)
  • Satfinite variants for each configuration
  • Address spaces: generic, global, shared

sub_5C10A0 -- MMA Codegen (120KB)

Handles the newer mma.sync API (non-WMMA). Covers the post-Volta PTX MMA instructions.

  • Shapes: m8n8k4, m16n8k8, m16n8k16, m16n8k32, m16n8k64, m16n8k128, m16n8k256
  • Types: f16, bf16, tf32, f32, f64, s8, u8, s4, u4, b1
  • Sparse variants for sm_80+ and sm_90+ (structured sparsity 2:4)

sub_5BBC30 -- TCGen05.MMA Codegen (90KB)

Blackwell 5th-generation tensor core MMA code generation. Handles the tcgen05.mma instruction family introduced in sm_100.

  • Allocates a 50,000-byte buffer
  • Queries sub_70FA00(*, 29) to validate tcgen05 capability
  • Handles standard, sparse, and warp-shared (.ws) variants
  • Uses sub_70F0A0 for sparse metadata parameter extraction
  • Generates code for tcgen05-specific tensor memory addressing

sub_5B76D0 -- Division Codegen (64KB)

Generates inline PTX code for all div variants.

  • Integer division: s16, s64, u16, u64
  • Floating-point division: f32, f64 with all rounding modes (rn, rd, ru, rz)
  • Flush-to-zero (ftz) variants for f32
  • Checks operand type via sub_70CA60(*(_QWORD *)(a1+1096), 0) == 21
  • Emits both fastpath and slowpath (Newton-Raphson) code sequences

OCG Intrinsic System -- sub_6C9EB0

The OCG (Optimized Code Generation) intrinsic subsystem is a separate, parallel dispatch mechanism for SM100+ builtin operations. While the classical system at sub_5D1660 maps CUDA runtime helper names to integer IDs and generates inline PTX, the OCG system maps __nv_ptx_builtin_ocg_* function names to type-specific handlers that validate parameters and emit SASS instructions directly -- bypassing PTX entirely. The OCG table contains 44 operations across 9 categories: arithmetic, packed float, vector integer, async copy/TMA, load/store/cache, reduction/fence, tensor core, tensor memory, and synchronization.

See OCG Intrinsic System (44 Operations) for the complete builtin name table, handler functions, validation strings, SASS-level handlers, and the full five-stage lowering pipeline with operand buffer layout.

Intrinsic Families by SM Generation

Each SM generation introduces new intrinsic families while preserving all earlier ones. The per-SM intrinsic table initializer functions (sub_60AXXX cluster, registered in Map 3 of the capability dispatch) control which intrinsics are available on each target.

sm_20 -- Software IEEE Math (70 entries)

The foundation layer. 70 intrinsics providing IEEE-754-compliant software implementations of math operations that either lack hardware support or need exact rounding guarantees. All later SM targets inherit these.

  • Division: div_s16, div_u64, div_rn_f32, div_rn_f64_full, etc. -- all rounding modes (rn/rd/ru/rz) and types (s16/s64/u16/u64/f32/f64)
  • Reciprocal: rcp_rn_f32, rcp_rn_f64, etc. -- all rounding modes
  • Square root: sqrt_rn_f32, sqrt_rn_f64, etc. -- all rounding modes
  • Double-precision sqrt: dsqrt_rn, dsqrt_rd, dsqrt_ru, dsqrt_rz
  • Double-precision reciprocal sqrt: drsqrt_rn
  • Bit extract/insert: bfe (bit field extract), bfi (bit field insert)
  • Remainder: rem_s32, rem_u32, rem_s64, rem_u64

Codegen handlers: sub_5B76D0 (div, 64KB), sub_5B0CD0 (rcp, 44KB), sub_5B4040 (sqrt, 49KB).

sm_3x -- Optimized Division (4 entries)

Four optimized division variants introduced on Kepler to improve throughput on common division patterns.

sm_62 -- Integer Dot Product (2 entries)

dp2a and dp4a integer dot product intrinsics introduced on Pascal (GP10x). Software emulation of the hardware instructions added in sm_61/sm_62.

sm_70 -- Volta Warp-Synchronous + WMMA (393 entries)

The largest single block. Volta introduced mandatory warp-synchronous programming with explicit sync masks and the first generation of tensor core (WMMA) instructions.

Synchronization primitives:

  • barrier_arrive / barrier_sync / barrier_red (0--15, with/without count)
  • matchsync_all/any_b32/b64 with predicate variants
  • shflsync_bfly/down/idx/up with predicate variants
  • votesync_all/any/ballot/uni
  • warpsync

WMMA (Warp Matrix Multiply-Accumulate):

  • Shapes: m16n16k16, m32n8k16, m8n32k16
  • Operations per shape: load_a, load_b, load_c, store_d, mma
  • Layouts: row/col combinations for A and B matrices
  • Types: f16, f32 (with satfinite optional)
  • Address spaces: generic, global, shared

sm_80 -- Ampere Cache Policy (3 entries)

Three createpolicy intrinsics for L2 cache management: createpolicy_fractional, createpolicy_fractional_encode, createpolicy_range_encode.

sm_10x -- Blackwell MMA Metadata + Bit MMA (10 entries)

10 hmma_mdata/imma_mdata + bit MMA intrinsics for sm_100: metadata variants at m16n8k16/k32/k64 shapes, and 1-bit AND/XOR MMA at m8n8k128/m16n8k128/m16n8k256.

sm_8x -- Direct MMA (14 entries)

14 mma_* intrinsics for sm_8x: 12 direct MMA operations (f16/f32 accumulator x 4 layout combinations of col/row for A and B) plus mma_shfl_f16 and mma_shfl_f32 for register-to-register MMA shuffle.

sm_9x -- Sub-Byte + Bit MMA (51 entries)

51 Hopper-era intrinsics: 3 bit-XOR MMA (m8n8k128/m16n8k128/m16n8k256), 24 dense sub-byte MMA (s4/u4 at m16n8k32/m16n8k64/m8n8k32, with satfinite), 8 sparse m16n8k128, and 16 sparse m16n8k64 (with _0/_1 split variants and satfinite).

sm_10x (via __cuda_sm10x_*) -- Blackwell Tensor Memory + Guardrails (11 entries)

  • 1 create_mask_from_bit_idx_and_alloc_size_v1 helper
  • 10 tcgen05_guardrail_trap_* intrinsics for debug validation of tensor memory operations

sm_1xx -- Bulk Copy (18 entries)

18 bulk copy and cp.async.bulk.tensor intrinsics covering 1D through 5D tensor copies with tile and im2col addressing modes, both unicast and multicast variants.

Intrinsic Lookup Flow

The lookup path from a function call in PTX source to the codegen handler follows this sequence:

PTX source: call.uni __cuda_sm70_warpsync, (%mask);
                    |
                    v
            sub_5D1660 hash map (a1+1064)
            key: "__cuda_sm70_warpsync"
            value: integer ID (within 0x89..0x211 range)
                    |
                    v
            sub_5FF700 switch(ID)
            Emits: .weak .func __cuda_sm70_warpsync (.reg .u32 %a0)
                    |
                    v
            sub_5D4190 named opcode hash map (a1+808)
            key: PTX opcode (e.g., "shfl", "vote", "barrier")
            value: codegen handler function pointer
                    |
                    v
            Codegen handler (e.g., sub_5801D0 for "shfl")
            Queries instruction properties via sub_70XXXX accessors
            Generates inline PTX code into 50KB buffer

For OCG intrinsics on SM100+, the path bypasses PTX entirely: sub_6A97B0 matches call nodes to SASS instructions via an RB-tree, sub_6C9BC0 parses the __nv_ptx_builtin_ocg_* name into an operation enum + sub-op array, sub_6CC690 routes to type-specific handlers and assembles operands, and sub_6CB8A0 emits the final SASS instruction. See the OCG Intrinsic System page for the full five-stage pipeline breakdown with operand buffer layout and internal SASS opcode enum values.

Per-SM Intrinsic Initializers

Each SM target has its own intrinsic table initializer function registered in Map 3 of the capability dispatch (sub_607DB0). These functions control which subset of the 607 intrinsics are available on each target.

SMInitializerSMInitializer
sm_75sub_60A2E0sm_100sub_60A910
sm_80sub_60A3E0sm_110sub_60AA20
sm_86sub_60AC30sm_103sub_60A700
sm_87sub_60AD30sm_120sub_608DF0
sm_88sub_60AB30sm_121sub_60A4E0
sm_89sub_60A810
sm_90sub_60A5F0

Sub-variants (e.g., sm_100a, sm_100f) share the same initializer as their base SM since they represent the same silicon with different feature exposure levels.

Instruction Description Loader -- sub_9EE390

sub_9EE390 (3,584 bytes, 0x9EE390--0x9EF190) is the constructor for an instruction description object that feeds the register allocator's pre-coloring pass. Despite the diagnostic string "IntrinsicDescrFile=%s", the function loads instruction descriptions broadly -- not just intrinsic operations. It determines which instructions exist for the target SM, what register classes they use, and what scheduling properties apply. The sole caller is sub_991790 (pre-coloring pass, 12KB).

Invocation pattern: The pre-coloring pass checks context+1936 before calling sub_9EE390. If the descriptor for the current SM class already exists, it is reused. This means the expensive initialization happens once per SM architecture per ptxas process lifetime.

Initialization Sequence

  1. Extract target properties. Read the target descriptor from context+1584. Compute the SM architecture class: v111 = target_descriptor[+372] >> 12. Read resource descriptors from option interface slots 41--44.

  2. Check option 404 (IntrinsicDescrFile). Query the option interface at context[208]. If option 404 is set, extract the file path and log " IntrinsicDescrFile=%s". This CI-internal mechanism supplies an external description file that overrides or extends the built-in instruction table. When absent, the built-in database is used.

  3. Determine instruction format class. Call sub_7DDB50(context) (GetSmVersionIndex), subtract 1, index into dword_21E6330:

    sub_7DDB50 returnv114Format
    10basic 64-bit
    21128-bit
    33extended
    42192-bit
    5+3extended (default)
  4. Determine SM generation class. Read context+12 (sm_version_id), subtract 1, index into dword_21E5C80. The table is an identity mapping (1--11), one entry per SM generation.

  5. Construct instruction table (648 bytes). Call sub_10AFF80 with 32 parameters including memory pool, register count, format class, description file path, architecture descriptor (16 bytes from context+1888), SM generation class, instruction count limits, and context flags. Follow with sub_10B1A90 (init pass 2) and sub_10AEF10 (finalization).

  6. Apply option overrides. Options 497, 738, 739 from the option interface set register limits and allocation budget values on the instruction table sub-object at +312.

  7. Select SM-specific instruction set descriptor. Based on v111 (SM architecture class):

    v111SM range (inferred)Alloc sizeConstructorVtable
    5sm_50--sm_62200 Bsub_9CDF90off_23F3B00
    6sm_70--sm_75216 Bsub_9CE030off_22BB738
    7sm_80--sm_89232 Bsub_9CE120off_22B5150
    8+sm_90--sm_121240 Bsub_9CE190off_22AD230
    <5(reuse existing)------

    Each successor inherits the previous class and extends it with generation-specific instructions. The descriptor is stored at context+1936 and this+48.

Object Layout

OffsetSizeContents
+08Vtable (off_21E6818 -> [sub_9DAA40, sub_9CADF0, sub_9CAE10, sub_9DDEE0])
+88Back-pointer to compilation context
+168Instruction table object (648 B, built by sub_10AFF80)
+248Scheduling metadata (from sub_1BBBA60)
+328Scratch area pointer (context[198])
+401Dirty flag (0 = clean)
+488SM-specific instruction set descriptor
+56--136--Resource descriptors, memory pool, sentinel, sub-allocator

Diagnostic Strings

StringLocationContext
"__nv_ptx_builtin_ocg_"sub_6C9EB0 (0x6c9ecf)OCG builtin name prefix
"instrinsic" (sic)Multiple OCG handlersConsistent NVIDIA typo for "intrinsic"
".weak .func"sub_5FF700 (354KB)Prototype declaration prefix
"__cuda_sm20_*", "__cuda_sm70_*", etc.sub_5D1660Intrinsic name patterns in registration
"__cuda_sanitizer_memcheck_*"sub_5D1660Compute-sanitizer integration hooks
"__cuda_sm10x_tcgen05_guardrail_trap_*"sub_5D1660Blackwell debug trap intrinsics
" IntrinsicDescrFile=%s"sub_9EE390 (0x9EEC9B)Instruction description loader -- logs external description file path (option 404)
".RELU not allowed with unsigned type"sub_6BEC60OCG LDC/S2R handler

Function Map

AddressSizeIdentityConfidence
sub_5D166046KBMaster intrinsic registration -- 607 name-to-ID entries (608 table slots)99%
sub_5D419041KBOpcode dispatch -- ~120 named + ~400 MMA hash entries99%
sub_5FF700354KBPrototype generator -- .weak .func PTX declarations99%
sub_5C7A50173KBwmma.mma codegen (all shapes/types/layouts)98%
sub_5C10A0120KBmma codegen (mma.sync API, post-Volta)98%
sub_5BBC3090KBtcgen05.mma codegen (Blackwell 5th-gen tensor core)98%
sub_5B76D064KBdiv codegen (integer + FP, all rounding modes)95%
sub_5ADDC050KBtex.grad codegen (1D/2D/3D gradient textures)95%
sub_5B404049KBsqrt codegen (f32/f64, all rounding modes)95%
sub_5AB46045KBcp.async.bulk.tensor codegen (1D--5D, tile/im2col)95%
sub_5B0CD044KBrcp codegen (f32/f64 reciprocal, all rounding modes)95%
sub_6C9EB013KBOCG intrinsic table init -- see OCG Intrinsic System for full function map (27 entries)95%
sub_6BDE207KBIntrinsic operand expansion88%
sub_6BEC605.8KBLDC/S2R intrinsic handlers90%
sub_9EE3903.5KBInstruction description loader -- builds per-SM instruction table for pre-coloring ("IntrinsicDescrFile=%s")92%
sub_9CDF90156BSM class 5 instruction set descriptor (200B, vtable off_23F3B00)85%
sub_9CE030115BSM class 6 instruction set descriptor (216B, extends sub_9CDF90)85%
sub_9CE120112BSM class 7 instruction set descriptor (232B, vtable off_22B5150)85%
sub_9CE190114BSM class 8+ instruction set descriptor (240B, vtable off_22AD230)85%
sub_9EF1901.1KBError handler for instruction description loader (ICE on invalid option type)88%

Cross-References

Appendix: Complete Intrinsic Name Catalog (607 Entries)

Every intrinsic registered by sub_5D1660, extracted from the decompiled source. IDs are contiguous 1--607 (0x001--0x25F). The suffix after stripping the prefix encodes the operation, data type, rounding mode, address space, and optional modifiers.

__cuda_reduxsync_* -- Redux sync (17 entries, 0x001--0x011, sm_70)

IDHexName
10x001__cuda_reduxsync_b32_and
20x002__cuda_reduxsync_b32_or
30x003__cuda_reduxsync_b32_xor
40x004__cuda_reduxsync_f32_max
50x005__cuda_reduxsync_f32_max_NaN
60x006__cuda_reduxsync_f32_max_abs
70x007__cuda_reduxsync_f32_max_abs_NaN
80x008__cuda_reduxsync_f32_min
90x009__cuda_reduxsync_f32_min_NaN
100x00A__cuda_reduxsync_f32_min_abs
110x00B__cuda_reduxsync_f32_min_abs_NaN
120x00C__cuda_reduxsync_s32_add
130x00D__cuda_reduxsync_s32_max
140x00E__cuda_reduxsync_s32_min
150x00F__cuda_reduxsync_u32_add
160x010__cuda_reduxsync_u32_max
170x011__cuda_reduxsync_u32_min

__cuda_sanitizer_memcheck_* -- Compute-sanitizer hooks (7 entries, 0x012--0x018, --)

IDHexName
180x012__cuda_sanitizer_memcheck_free
190x013__cuda_sanitizer_memcheck_generic
200x014__cuda_sanitizer_memcheck_global
210x015__cuda_sanitizer_memcheck_local
220x016__cuda_sanitizer_memcheck_malloc
230x017__cuda_sanitizer_memcheck_readmetadata
240x018__cuda_sanitizer_memcheck_shared

__cuda_scalar_video_emulation_* -- Video emulation (7 entries, 0x019--0x01F, sm_20)

IDHexName
250x019__cuda_scalar_video_emulation_operandExtractAndSignExtend01
260x01A__cuda_scalar_video_emulation_operandExtractAndSignExtend11
270x01B__cuda_scalar_video_emulation_operandExtractAndSignExtend12
280x01C__cuda_scalar_video_emulation_operandExtractAndSignExtend22
290x01D__cuda_scalar_video_emulation_optionalMerge32
300x01E__cuda_scalar_video_emulation_saturate64
310x01F__cuda_scalar_video_emulation_secondOp64

__cuda_sm10x_* -- Blackwell tcgen05 guardrails + mask (11 entries, 0x020--0x02A, sm_100)

IDHexName
320x020__cuda_sm10x_create_mask_from_bit_idx_and_alloc_size_v1
330x021__cuda_sm10x_tcgen05_guardrail_trap_access_out_of_physical_bounds
340x022__cuda_sm10x_tcgen05_guardrail_trap_allocation_granularity_invalid
350x023__cuda_sm10x_tcgen05_guardrail_trap_col_being_dealloced_not_returned_by_alloc
360x024__cuda_sm10x_tcgen05_guardrail_trap_current_warp_owner_invalid
370x025__cuda_sm10x_tcgen05_guardrail_trap_invalid_datapath_alignment
380x026__cuda_sm10x_tcgen05_guardrail_trap_phase_invalid_during_alloc
390x027__cuda_sm10x_tcgen05_guardrail_trap_sp_used_in_unsupported_env
400x028__cuda_sm10x_tcgen05_guardrail_trap_sparse_mismatch_between_idesc_mod
410x029__cuda_sm10x_tcgen05_guardrail_trap_unallocated_columns_access
420x02A__cuda_sm10x_tcgen05_guardrail_trap_unallocated_columns_being_dealloced

__cuda_sm1xx_* -- Bulk copy + cp.async.bulk.tensor (18 entries, 0x02B--0x03C, sm_100+)

IDHexName
430x02B__cuda_sm1xx_bulk_copy_multicast
440x02C__cuda_sm1xx_bulk_copy_unicast
450x02D__cuda_sm1xx_cp_async_bulk_tensor_1d_tile_multicast
460x02E__cuda_sm1xx_cp_async_bulk_tensor_1d_tile_unicast
470x02F__cuda_sm1xx_cp_async_bulk_tensor_2d_tile_multicast
480x030__cuda_sm1xx_cp_async_bulk_tensor_2d_tile_unicast
490x031__cuda_sm1xx_cp_async_bulk_tensor_3d_im2col_multicast
500x032__cuda_sm1xx_cp_async_bulk_tensor_3d_im2col_unicast
510x033__cuda_sm1xx_cp_async_bulk_tensor_3d_tile_multicast
520x034__cuda_sm1xx_cp_async_bulk_tensor_3d_tile_unicast
530x035__cuda_sm1xx_cp_async_bulk_tensor_4d_im2col_multicast
540x036__cuda_sm1xx_cp_async_bulk_tensor_4d_im2col_unicast
550x037__cuda_sm1xx_cp_async_bulk_tensor_4d_tile_multicast
560x038__cuda_sm1xx_cp_async_bulk_tensor_4d_tile_unicast
570x039__cuda_sm1xx_cp_async_bulk_tensor_5d_im2col_multicast
580x03A__cuda_sm1xx_cp_async_bulk_tensor_5d_im2col_unicast
590x03B__cuda_sm1xx_cp_async_bulk_tensor_5d_tile_multicast
600x03C__cuda_sm1xx_cp_async_bulk_tensor_5d_tile_unicast

__cuda_sm20_* -- IEEE math (70 entries, 0x03D--0x082, sm_20)

IDHexName
610x03D__cuda_sm20_bfe_s64_
620x03E__cuda_sm20_bfe_u64_
630x03F__cuda_sm20_bfi_u64_
640x040__cuda_sm20_dblrcp_rn_slowpath_v3
650x041__cuda_sm20_div_rd_f32
660x042__cuda_sm20_div_rd_f64_v2
670x043__cuda_sm20_div_rd_ftz_f32
680x044__cuda_sm20_div_rn_f32
690x045__cuda_sm20_div_rn_f64_fast
700x046__cuda_sm20_div_rn_f64_full
710x047__cuda_sm20_div_rn_ftz_f32
720x048__cuda_sm20_div_rn_ftz_f32_slowpath
730x049__cuda_sm20_div_rn_noftz_f32_slowpath
740x04A__cuda_sm20_div_ru_f32
750x04B__cuda_sm20_div_ru_f64_v2
760x04C__cuda_sm20_div_ru_ftz_f32
770x04D__cuda_sm20_div_rz_f32
780x04E__cuda_sm20_div_rz_f64_v2
790x04F__cuda_sm20_div_rz_ftz_f32
800x050__cuda_sm20_div_s16
810x051__cuda_sm20_div_s64
820x052__cuda_sm20_div_u16
830x053__cuda_sm20_div_u64
840x054__cuda_sm20_drsqrt_f64_slowpath_v2
850x055__cuda_sm20_drsqrt_f64_v2
860x056__cuda_sm20_dsqrt_rd_f64
870x057__cuda_sm20_dsqrt_rn_f64_mediumpath_v1
880x058__cuda_sm20_dsqrt_rn_f64_v3
890x059__cuda_sm20_dsqrt_ru_f64
900x05A__cuda_sm20_dsqrt_rz_f64
910x05B__cuda_sm20_rcp_f64_v3
920x05C__cuda_sm20_rcp_rd_f32
930x05D__cuda_sm20_rcp_rd_f32_slowpath
940x05E__cuda_sm20_rcp_rd_f64
950x05F__cuda_sm20_rcp_rd_ftz_f32
960x060__cuda_sm20_rcp_rd_ftz_f32_slowpath
970x061__cuda_sm20_rcp_rn_f32
980x062__cuda_sm20_rcp_rn_f32_slowpath
990x063__cuda_sm20_rcp_rn_ftz_f32
1000x064__cuda_sm20_rcp_rn_ftz_f32_slowpath
1010x065__cuda_sm20_rcp_ru_f32
1020x066__cuda_sm20_rcp_ru_f32_slowpath
1030x067__cuda_sm20_rcp_ru_f64
1040x068__cuda_sm20_rcp_ru_ftz_f32
1050x069__cuda_sm20_rcp_ru_ftz_f32_slowpath
1060x06A__cuda_sm20_rcp_rz_f32
1070x06B__cuda_sm20_rcp_rz_f32_slowpath
1080x06C__cuda_sm20_rcp_rz_f64
1090x06D__cuda_sm20_rcp_rz_ftz_f32
1100x06E__cuda_sm20_rcp_rz_ftz_f32_slowpath
1110x06F__cuda_sm20_rem_s16
1120x070__cuda_sm20_rem_s64
1130x071__cuda_sm20_rem_u16
1140x072__cuda_sm20_rem_u64
1150x073__cuda_sm20_sqrt_rd_f32
1160x074__cuda_sm20_sqrt_rd_f32_slowpath
1170x075__cuda_sm20_sqrt_rd_ftz_f32
1180x076__cuda_sm20_sqrt_rd_ftz_f32_slowpath
1190x077__cuda_sm20_sqrt_rn_f32
1200x078__cuda_sm20_sqrt_rn_f32_slowpath
1210x079__cuda_sm20_sqrt_rn_ftz_f32
1220x07A__cuda_sm20_sqrt_rn_ftz_f32_slowpath
1230x07B__cuda_sm20_sqrt_ru_f32
1240x07C__cuda_sm20_sqrt_ru_f32_slowpath
1250x07D__cuda_sm20_sqrt_ru_ftz_f32
1260x07E__cuda_sm20_sqrt_ru_ftz_f32_slowpath
1270x07F__cuda_sm20_sqrt_rz_f32
1280x080__cuda_sm20_sqrt_rz_f32_slowpath
1290x081__cuda_sm20_sqrt_rz_ftz_f32
1300x082__cuda_sm20_sqrt_rz_ftz_f32_slowpath

__cuda_sm3x_* -- Optimized division (4 entries, 0x083--0x086, sm_30)

IDHexName
1310x083__cuda_sm3x_div_rn_ftz_f32
1320x084__cuda_sm3x_div_rn_ftz_f32_slowpath
1330x085__cuda_sm3x_div_rn_noftz_f32
1340x086__cuda_sm3x_div_rn_noftz_f32_slowpath

__cuda_sm62_* -- Integer dot product (2 entries, 0x087--0x088, sm_62)

IDHexName
1350x087__cuda_sm62_dp2a
1360x088__cuda_sm62_dp4a

__cuda_sm70_* -- Volta sync/warp/WMMA (393 entries, 0x089--0x211, sm_70)

IDHexName
1370x089__cuda_sm70_barrier_arrive
1380x08A__cuda_sm70_barrier_arrive_0
1390x08B__cuda_sm70_barrier_arrive_0_count
1400x08C__cuda_sm70_barrier_arrive_1
1410x08D__cuda_sm70_barrier_arrive_10
1420x08E__cuda_sm70_barrier_arrive_10_count
1430x08F__cuda_sm70_barrier_arrive_11
1440x090__cuda_sm70_barrier_arrive_11_count
1450x091__cuda_sm70_barrier_arrive_12
1460x092__cuda_sm70_barrier_arrive_12_count
1470x093__cuda_sm70_barrier_arrive_13
1480x094__cuda_sm70_barrier_arrive_13_count
1490x095__cuda_sm70_barrier_arrive_14
1500x096__cuda_sm70_barrier_arrive_14_count
1510x097__cuda_sm70_barrier_arrive_15
1520x098__cuda_sm70_barrier_arrive_15_count
1530x099__cuda_sm70_barrier_arrive_1_count
1540x09A__cuda_sm70_barrier_arrive_2
1550x09B__cuda_sm70_barrier_arrive_2_count
1560x09C__cuda_sm70_barrier_arrive_3
1570x09D__cuda_sm70_barrier_arrive_3_count
1580x09E__cuda_sm70_barrier_arrive_4
1590x09F__cuda_sm70_barrier_arrive_4_count
1600x0A0__cuda_sm70_barrier_arrive_5
1610x0A1__cuda_sm70_barrier_arrive_5_count
1620x0A2__cuda_sm70_barrier_arrive_6
1630x0A3__cuda_sm70_barrier_arrive_6_count
1640x0A4__cuda_sm70_barrier_arrive_7
1650x0A5__cuda_sm70_barrier_arrive_7_count
1660x0A6__cuda_sm70_barrier_arrive_8
1670x0A7__cuda_sm70_barrier_arrive_8_count
1680x0A8__cuda_sm70_barrier_arrive_9
1690x0A9__cuda_sm70_barrier_arrive_9_count
1700x0AA__cuda_sm70_barrier_arrive_count
1710x0AB__cuda_sm70_barrier_red_and
1720x0AC__cuda_sm70_barrier_red_and_0
1730x0AD__cuda_sm70_barrier_red_and_0_count
1740x0AE__cuda_sm70_barrier_red_and_1
1750x0AF__cuda_sm70_barrier_red_and_10
1760x0B0__cuda_sm70_barrier_red_and_10_count
1770x0B1__cuda_sm70_barrier_red_and_11
1780x0B2__cuda_sm70_barrier_red_and_11_count
1790x0B3__cuda_sm70_barrier_red_and_12
1800x0B4__cuda_sm70_barrier_red_and_12_count
1810x0B5__cuda_sm70_barrier_red_and_13
1820x0B6__cuda_sm70_barrier_red_and_13_count
1830x0B7__cuda_sm70_barrier_red_and_14
1840x0B8__cuda_sm70_barrier_red_and_14_count
1850x0B9__cuda_sm70_barrier_red_and_15
1860x0BA__cuda_sm70_barrier_red_and_15_count
1870x0BB__cuda_sm70_barrier_red_and_1_count
1880x0BC__cuda_sm70_barrier_red_and_2
1890x0BD__cuda_sm70_barrier_red_and_2_count
1900x0BE__cuda_sm70_barrier_red_and_3
1910x0BF__cuda_sm70_barrier_red_and_3_count
1920x0C0__cuda_sm70_barrier_red_and_4
1930x0C1__cuda_sm70_barrier_red_and_4_count
1940x0C2__cuda_sm70_barrier_red_and_5
1950x0C3__cuda_sm70_barrier_red_and_5_count
1960x0C4__cuda_sm70_barrier_red_and_6
1970x0C5__cuda_sm70_barrier_red_and_6_count
1980x0C6__cuda_sm70_barrier_red_and_7
1990x0C7__cuda_sm70_barrier_red_and_7_count
2000x0C8__cuda_sm70_barrier_red_and_8
2010x0C9__cuda_sm70_barrier_red_and_8_count
2020x0CA__cuda_sm70_barrier_red_and_9
2030x0CB__cuda_sm70_barrier_red_and_9_count
2040x0CC__cuda_sm70_barrier_red_and_count
2050x0CD__cuda_sm70_barrier_red_or
2060x0CE__cuda_sm70_barrier_red_or_0
2070x0CF__cuda_sm70_barrier_red_or_0_count
2080x0D0__cuda_sm70_barrier_red_or_1
2090x0D1__cuda_sm70_barrier_red_or_10
2100x0D2__cuda_sm70_barrier_red_or_10_count
2110x0D3__cuda_sm70_barrier_red_or_11
2120x0D4__cuda_sm70_barrier_red_or_11_count
2130x0D5__cuda_sm70_barrier_red_or_12
2140x0D6__cuda_sm70_barrier_red_or_12_count
2150x0D7__cuda_sm70_barrier_red_or_13
2160x0D8__cuda_sm70_barrier_red_or_13_count
2170x0D9__cuda_sm70_barrier_red_or_14
2180x0DA__cuda_sm70_barrier_red_or_14_count
2190x0DB__cuda_sm70_barrier_red_or_15
2200x0DC__cuda_sm70_barrier_red_or_15_count
2210x0DD__cuda_sm70_barrier_red_or_1_count
2220x0DE__cuda_sm70_barrier_red_or_2
2230x0DF__cuda_sm70_barrier_red_or_2_count
2240x0E0__cuda_sm70_barrier_red_or_3
2250x0E1__cuda_sm70_barrier_red_or_3_count
2260x0E2__cuda_sm70_barrier_red_or_4
2270x0E3__cuda_sm70_barrier_red_or_4_count
2280x0E4__cuda_sm70_barrier_red_or_5
2290x0E5__cuda_sm70_barrier_red_or_5_count
2300x0E6__cuda_sm70_barrier_red_or_6
2310x0E7__cuda_sm70_barrier_red_or_6_count
2320x0E8__cuda_sm70_barrier_red_or_7
2330x0E9__cuda_sm70_barrier_red_or_7_count
2340x0EA__cuda_sm70_barrier_red_or_8
2350x0EB__cuda_sm70_barrier_red_or_8_count
2360x0EC__cuda_sm70_barrier_red_or_9
2370x0ED__cuda_sm70_barrier_red_or_9_count
2380x0EE__cuda_sm70_barrier_red_or_count
2390x0EF__cuda_sm70_barrier_red_popc
2400x0F0__cuda_sm70_barrier_red_popc_0
2410x0F1__cuda_sm70_barrier_red_popc_0_count
2420x0F2__cuda_sm70_barrier_red_popc_1
2430x0F3__cuda_sm70_barrier_red_popc_10
2440x0F4__cuda_sm70_barrier_red_popc_10_count
2450x0F5__cuda_sm70_barrier_red_popc_11
2460x0F6__cuda_sm70_barrier_red_popc_11_count
2470x0F7__cuda_sm70_barrier_red_popc_12
2480x0F8__cuda_sm70_barrier_red_popc_12_count
2490x0F9__cuda_sm70_barrier_red_popc_13
2500x0FA__cuda_sm70_barrier_red_popc_13_count
2510x0FB__cuda_sm70_barrier_red_popc_14
2520x0FC__cuda_sm70_barrier_red_popc_14_count
2530x0FD__cuda_sm70_barrier_red_popc_15
2540x0FE__cuda_sm70_barrier_red_popc_15_count
2550x0FF__cuda_sm70_barrier_red_popc_1_count
2560x100__cuda_sm70_barrier_red_popc_2
2570x101__cuda_sm70_barrier_red_popc_2_count
2580x102__cuda_sm70_barrier_red_popc_3
2590x103__cuda_sm70_barrier_red_popc_3_count
2600x104__cuda_sm70_barrier_red_popc_4
2610x105__cuda_sm70_barrier_red_popc_4_count
2620x106__cuda_sm70_barrier_red_popc_5
2630x107__cuda_sm70_barrier_red_popc_5_count
2640x108__cuda_sm70_barrier_red_popc_6
2650x109__cuda_sm70_barrier_red_popc_6_count
2660x10A__cuda_sm70_barrier_red_popc_7
2670x10B__cuda_sm70_barrier_red_popc_7_count
2680x10C__cuda_sm70_barrier_red_popc_8
2690x10D__cuda_sm70_barrier_red_popc_8_count
2700x10E__cuda_sm70_barrier_red_popc_9
2710x10F__cuda_sm70_barrier_red_popc_9_count
2720x110__cuda_sm70_barrier_red_popc_count
2730x111__cuda_sm70_barrier_sync
2740x112__cuda_sm70_barrier_sync_0
2750x113__cuda_sm70_barrier_sync_0_count
2760x114__cuda_sm70_barrier_sync_1
2770x115__cuda_sm70_barrier_sync_10
2780x116__cuda_sm70_barrier_sync_10_count
2790x117__cuda_sm70_barrier_sync_11
2800x118__cuda_sm70_barrier_sync_11_count
2810x119__cuda_sm70_barrier_sync_12
2820x11A__cuda_sm70_barrier_sync_12_count
2830x11B__cuda_sm70_barrier_sync_13
2840x11C__cuda_sm70_barrier_sync_13_count
2850x11D__cuda_sm70_barrier_sync_14
2860x11E__cuda_sm70_barrier_sync_14_count
2870x11F__cuda_sm70_barrier_sync_15
2880x120__cuda_sm70_barrier_sync_15_count
2890x121__cuda_sm70_barrier_sync_1_count
2900x122__cuda_sm70_barrier_sync_2
2910x123__cuda_sm70_barrier_sync_2_count
2920x124__cuda_sm70_barrier_sync_3
2930x125__cuda_sm70_barrier_sync_3_count
2940x126__cuda_sm70_barrier_sync_4
2950x127__cuda_sm70_barrier_sync_4_count
2960x128__cuda_sm70_barrier_sync_5
2970x129__cuda_sm70_barrier_sync_5_count
2980x12A__cuda_sm70_barrier_sync_6
2990x12B__cuda_sm70_barrier_sync_6_count
3000x12C__cuda_sm70_barrier_sync_7
3010x12D__cuda_sm70_barrier_sync_7_count
3020x12E__cuda_sm70_barrier_sync_8
3030x12F__cuda_sm70_barrier_sync_8_count
3040x130__cuda_sm70_barrier_sync_9
3050x131__cuda_sm70_barrier_sync_9_count
3060x132__cuda_sm70_barrier_sync_count
3070x133__cuda_sm70_matchsync_all_b32
3080x134__cuda_sm70_matchsync_all_b32_p
3090x135__cuda_sm70_matchsync_all_b64
3100x136__cuda_sm70_matchsync_all_b64_p
3110x137__cuda_sm70_matchsync_any_b32
3120x138__cuda_sm70_matchsync_any_b64
3130x139__cuda_sm70_shflsync_bfly
3140x13A__cuda_sm70_shflsync_bfly_p
3150x13B__cuda_sm70_shflsync_down
3160x13C__cuda_sm70_shflsync_down_p
3170x13D__cuda_sm70_shflsync_idx
3180x13E__cuda_sm70_shflsync_idx_p
3190x13F__cuda_sm70_shflsync_up
3200x140__cuda_sm70_shflsync_up_p
3210x141__cuda_sm70_votesync_all
3220x142__cuda_sm70_votesync_any
3230x143__cuda_sm70_votesync_ballot
3240x144__cuda_sm70_votesync_uni
3250x145__cuda_sm70_warpsync
3260x146__cuda_sm70_wmma_m16n16k16_load_a_col
3270x147__cuda_sm70_wmma_m16n16k16_load_a_col_global
3280x148__cuda_sm70_wmma_m16n16k16_load_a_col_shared
3290x149__cuda_sm70_wmma_m16n16k16_load_a_row
3300x14A__cuda_sm70_wmma_m16n16k16_load_a_row_global
3310x14B__cuda_sm70_wmma_m16n16k16_load_a_row_shared
3320x14C__cuda_sm70_wmma_m16n16k16_load_b_col
3330x14D__cuda_sm70_wmma_m16n16k16_load_b_col_global
3340x14E__cuda_sm70_wmma_m16n16k16_load_b_col_shared
3350x14F__cuda_sm70_wmma_m16n16k16_load_b_row
3360x150__cuda_sm70_wmma_m16n16k16_load_b_row_global
3370x151__cuda_sm70_wmma_m16n16k16_load_b_row_shared
3380x152__cuda_sm70_wmma_m16n16k16_load_c_col_f16
3390x153__cuda_sm70_wmma_m16n16k16_load_c_col_f16_global
3400x154__cuda_sm70_wmma_m16n16k16_load_c_col_f16_shared
3410x155__cuda_sm70_wmma_m16n16k16_load_c_col_f32
3420x156__cuda_sm70_wmma_m16n16k16_load_c_col_f32_global
3430x157__cuda_sm70_wmma_m16n16k16_load_c_col_f32_shared
3440x158__cuda_sm70_wmma_m16n16k16_load_c_row_f16
3450x159__cuda_sm70_wmma_m16n16k16_load_c_row_f16_global
3460x15A__cuda_sm70_wmma_m16n16k16_load_c_row_f16_shared
3470x15B__cuda_sm70_wmma_m16n16k16_load_c_row_f32
3480x15C__cuda_sm70_wmma_m16n16k16_load_c_row_f32_global
3490x15D__cuda_sm70_wmma_m16n16k16_load_c_row_f32_shared
3500x15E__cuda_sm70_wmma_m16n16k16_mma_col_col_f16_f16
3510x15F__cuda_sm70_wmma_m16n16k16_mma_col_col_f16_f16_satfinite
3520x160__cuda_sm70_wmma_m16n16k16_mma_col_col_f16_f32
3530x161__cuda_sm70_wmma_m16n16k16_mma_col_col_f16_f32_satfinite
3540x162__cuda_sm70_wmma_m16n16k16_mma_col_col_f32_f16
3550x163__cuda_sm70_wmma_m16n16k16_mma_col_col_f32_f16_satfinite
3560x164__cuda_sm70_wmma_m16n16k16_mma_col_col_f32_f32
3570x165__cuda_sm70_wmma_m16n16k16_mma_col_col_f32_f32_satfinite
3580x166__cuda_sm70_wmma_m16n16k16_mma_col_row_f16_f16
3590x167__cuda_sm70_wmma_m16n16k16_mma_col_row_f16_f16_satfinite
3600x168__cuda_sm70_wmma_m16n16k16_mma_col_row_f16_f32
3610x169__cuda_sm70_wmma_m16n16k16_mma_col_row_f16_f32_satfinite
3620x16A__cuda_sm70_wmma_m16n16k16_mma_col_row_f32_f16
3630x16B__cuda_sm70_wmma_m16n16k16_mma_col_row_f32_f16_satfinite
3640x16C__cuda_sm70_wmma_m16n16k16_mma_col_row_f32_f32
3650x16D__cuda_sm70_wmma_m16n16k16_mma_col_row_f32_f32_satfinite
3660x16E__cuda_sm70_wmma_m16n16k16_mma_row_col_f16_f16
3670x16F__cuda_sm70_wmma_m16n16k16_mma_row_col_f16_f16_satfinite
3680x170__cuda_sm70_wmma_m16n16k16_mma_row_col_f16_f32
3690x171__cuda_sm70_wmma_m16n16k16_mma_row_col_f16_f32_satfinite
3700x172__cuda_sm70_wmma_m16n16k16_mma_row_col_f32_f16
3710x173__cuda_sm70_wmma_m16n16k16_mma_row_col_f32_f16_satfinite
3720x174__cuda_sm70_wmma_m16n16k16_mma_row_col_f32_f32
3730x175__cuda_sm70_wmma_m16n16k16_mma_row_col_f32_f32_satfinite
3740x176__cuda_sm70_wmma_m16n16k16_mma_row_row_f16_f16
3750x177__cuda_sm70_wmma_m16n16k16_mma_row_row_f16_f16_satfinite
3760x178__cuda_sm70_wmma_m16n16k16_mma_row_row_f16_f32
3770x179__cuda_sm70_wmma_m16n16k16_mma_row_row_f16_f32_satfinite
3780x17A__cuda_sm70_wmma_m16n16k16_mma_row_row_f32_f16
3790x17B__cuda_sm70_wmma_m16n16k16_mma_row_row_f32_f16_satfinite
3800x17C__cuda_sm70_wmma_m16n16k16_mma_row_row_f32_f32
3810x17D__cuda_sm70_wmma_m16n16k16_mma_row_row_f32_f32_satfinite
3820x17E__cuda_sm70_wmma_m16n16k16_store_d_col_f16
3830x17F__cuda_sm70_wmma_m16n16k16_store_d_col_f16_global
3840x180__cuda_sm70_wmma_m16n16k16_store_d_col_f16_shared
3850x181__cuda_sm70_wmma_m16n16k16_store_d_col_f32
3860x182__cuda_sm70_wmma_m16n16k16_store_d_col_f32_global
3870x183__cuda_sm70_wmma_m16n16k16_store_d_col_f32_shared
3880x184__cuda_sm70_wmma_m16n16k16_store_d_row_f16
3890x185__cuda_sm70_wmma_m16n16k16_store_d_row_f16_global
3900x186__cuda_sm70_wmma_m16n16k16_store_d_row_f16_shared
3910x187__cuda_sm70_wmma_m16n16k16_store_d_row_f32
3920x188__cuda_sm70_wmma_m16n16k16_store_d_row_f32_global
3930x189__cuda_sm70_wmma_m16n16k16_store_d_row_f32_shared
3940x18A__cuda_sm70_wmma_m32n8k16_load_a_col
3950x18B__cuda_sm70_wmma_m32n8k16_load_a_col_global
3960x18C__cuda_sm70_wmma_m32n8k16_load_a_col_shared
3970x18D__cuda_sm70_wmma_m32n8k16_load_a_row
3980x18E__cuda_sm70_wmma_m32n8k16_load_a_row_global
3990x18F__cuda_sm70_wmma_m32n8k16_load_a_row_shared
4000x190__cuda_sm70_wmma_m32n8k16_load_b_col
4010x191__cuda_sm70_wmma_m32n8k16_load_b_col_global
4020x192__cuda_sm70_wmma_m32n8k16_load_b_col_shared
4030x193__cuda_sm70_wmma_m32n8k16_load_b_row
4040x194__cuda_sm70_wmma_m32n8k16_load_b_row_global
4050x195__cuda_sm70_wmma_m32n8k16_load_b_row_shared
4060x196__cuda_sm70_wmma_m32n8k16_load_c_col_f16
4070x197__cuda_sm70_wmma_m32n8k16_load_c_col_f16_global
4080x198__cuda_sm70_wmma_m32n8k16_load_c_col_f16_shared
4090x199__cuda_sm70_wmma_m32n8k16_load_c_col_f32
4100x19A__cuda_sm70_wmma_m32n8k16_load_c_col_f32_global
4110x19B__cuda_sm70_wmma_m32n8k16_load_c_col_f32_shared
4120x19C__cuda_sm70_wmma_m32n8k16_load_c_row_f16
4130x19D__cuda_sm70_wmma_m32n8k16_load_c_row_f16_global
4140x19E__cuda_sm70_wmma_m32n8k16_load_c_row_f16_shared
4150x19F__cuda_sm70_wmma_m32n8k16_load_c_row_f32
4160x1A0__cuda_sm70_wmma_m32n8k16_load_c_row_f32_global
4170x1A1__cuda_sm70_wmma_m32n8k16_load_c_row_f32_shared
4180x1A2__cuda_sm70_wmma_m32n8k16_mma_col_col_f16_f16
4190x1A3__cuda_sm70_wmma_m32n8k16_mma_col_col_f16_f16_satfinite
4200x1A4__cuda_sm70_wmma_m32n8k16_mma_col_col_f16_f32
4210x1A5__cuda_sm70_wmma_m32n8k16_mma_col_col_f16_f32_satfinite
4220x1A6__cuda_sm70_wmma_m32n8k16_mma_col_col_f32_f16
4230x1A7__cuda_sm70_wmma_m32n8k16_mma_col_col_f32_f16_satfinite
4240x1A8__cuda_sm70_wmma_m32n8k16_mma_col_col_f32_f32
4250x1A9__cuda_sm70_wmma_m32n8k16_mma_col_col_f32_f32_satfinite
4260x1AA__cuda_sm70_wmma_m32n8k16_mma_col_row_f16_f16
4270x1AB__cuda_sm70_wmma_m32n8k16_mma_col_row_f16_f16_satfinite
4280x1AC__cuda_sm70_wmma_m32n8k16_mma_col_row_f16_f32
4290x1AD__cuda_sm70_wmma_m32n8k16_mma_col_row_f16_f32_satfinite
4300x1AE__cuda_sm70_wmma_m32n8k16_mma_col_row_f32_f16
4310x1AF__cuda_sm70_wmma_m32n8k16_mma_col_row_f32_f16_satfinite
4320x1B0__cuda_sm70_wmma_m32n8k16_mma_col_row_f32_f32
4330x1B1__cuda_sm70_wmma_m32n8k16_mma_col_row_f32_f32_satfinite
4340x1B2__cuda_sm70_wmma_m32n8k16_mma_row_col_f16_f16
4350x1B3__cuda_sm70_wmma_m32n8k16_mma_row_col_f16_f16_satfinite
4360x1B4__cuda_sm70_wmma_m32n8k16_mma_row_col_f16_f32
4370x1B5__cuda_sm70_wmma_m32n8k16_mma_row_col_f16_f32_satfinite
4380x1B6__cuda_sm70_wmma_m32n8k16_mma_row_col_f32_f16
4390x1B7__cuda_sm70_wmma_m32n8k16_mma_row_col_f32_f16_satfinite
4400x1B8__cuda_sm70_wmma_m32n8k16_mma_row_col_f32_f32
4410x1B9__cuda_sm70_wmma_m32n8k16_mma_row_col_f32_f32_satfinite
4420x1BA__cuda_sm70_wmma_m32n8k16_mma_row_row_f16_f16
4430x1BB__cuda_sm70_wmma_m32n8k16_mma_row_row_f16_f16_satfinite
4440x1BC__cuda_sm70_wmma_m32n8k16_mma_row_row_f16_f32
4450x1BD__cuda_sm70_wmma_m32n8k16_mma_row_row_f16_f32_satfinite
4460x1BE__cuda_sm70_wmma_m32n8k16_mma_row_row_f32_f16
4470x1BF__cuda_sm70_wmma_m32n8k16_mma_row_row_f32_f16_satfinite
4480x1C0__cuda_sm70_wmma_m32n8k16_mma_row_row_f32_f32
4490x1C1__cuda_sm70_wmma_m32n8k16_mma_row_row_f32_f32_satfinite
4500x1C2__cuda_sm70_wmma_m32n8k16_store_d_col_f16
4510x1C3__cuda_sm70_wmma_m32n8k16_store_d_col_f16_global
4520x1C4__cuda_sm70_wmma_m32n8k16_store_d_col_f16_shared
4530x1C5__cuda_sm70_wmma_m32n8k16_store_d_col_f32
4540x1C6__cuda_sm70_wmma_m32n8k16_store_d_col_f32_global
4550x1C7__cuda_sm70_wmma_m32n8k16_store_d_col_f32_shared
4560x1C8__cuda_sm70_wmma_m32n8k16_store_d_row_f16
4570x1C9__cuda_sm70_wmma_m32n8k16_store_d_row_f16_global
4580x1CA__cuda_sm70_wmma_m32n8k16_store_d_row_f16_shared
4590x1CB__cuda_sm70_wmma_m32n8k16_store_d_row_f32
4600x1CC__cuda_sm70_wmma_m32n8k16_store_d_row_f32_global
4610x1CD__cuda_sm70_wmma_m32n8k16_store_d_row_f32_shared
4620x1CE__cuda_sm70_wmma_m8n32k16_load_a_col
4630x1CF__cuda_sm70_wmma_m8n32k16_load_a_col_global
4640x1D0__cuda_sm70_wmma_m8n32k16_load_a_col_shared
4650x1D1__cuda_sm70_wmma_m8n32k16_load_a_row
4660x1D2__cuda_sm70_wmma_m8n32k16_load_a_row_global
4670x1D3__cuda_sm70_wmma_m8n32k16_load_a_row_shared
4680x1D4__cuda_sm70_wmma_m8n32k16_load_b_col
4690x1D5__cuda_sm70_wmma_m8n32k16_load_b_col_global
4700x1D6__cuda_sm70_wmma_m8n32k16_load_b_col_shared
4710x1D7__cuda_sm70_wmma_m8n32k16_load_b_row
4720x1D8__cuda_sm70_wmma_m8n32k16_load_b_row_global
4730x1D9__cuda_sm70_wmma_m8n32k16_load_b_row_shared
4740x1DA__cuda_sm70_wmma_m8n32k16_load_c_col_f16
4750x1DB__cuda_sm70_wmma_m8n32k16_load_c_col_f16_global
4760x1DC__cuda_sm70_wmma_m8n32k16_load_c_col_f16_shared
4770x1DD__cuda_sm70_wmma_m8n32k16_load_c_col_f32
4780x1DE__cuda_sm70_wmma_m8n32k16_load_c_col_f32_global
4790x1DF__cuda_sm70_wmma_m8n32k16_load_c_col_f32_shared
4800x1E0__cuda_sm70_wmma_m8n32k16_load_c_row_f16
4810x1E1__cuda_sm70_wmma_m8n32k16_load_c_row_f16_global
4820x1E2__cuda_sm70_wmma_m8n32k16_load_c_row_f16_shared
4830x1E3__cuda_sm70_wmma_m8n32k16_load_c_row_f32
4840x1E4__cuda_sm70_wmma_m8n32k16_load_c_row_f32_global
4850x1E5__cuda_sm70_wmma_m8n32k16_load_c_row_f32_shared
4860x1E6__cuda_sm70_wmma_m8n32k16_mma_col_col_f16_f16
4870x1E7__cuda_sm70_wmma_m8n32k16_mma_col_col_f16_f16_satfinite
4880x1E8__cuda_sm70_wmma_m8n32k16_mma_col_col_f16_f32
4890x1E9__cuda_sm70_wmma_m8n32k16_mma_col_col_f16_f32_satfinite
4900x1EA__cuda_sm70_wmma_m8n32k16_mma_col_col_f32_f16
4910x1EB__cuda_sm70_wmma_m8n32k16_mma_col_col_f32_f16_satfinite
4920x1EC__cuda_sm70_wmma_m8n32k16_mma_col_col_f32_f32
4930x1ED__cuda_sm70_wmma_m8n32k16_mma_col_col_f32_f32_satfinite
4940x1EE__cuda_sm70_wmma_m8n32k16_mma_col_row_f16_f16
4950x1EF__cuda_sm70_wmma_m8n32k16_mma_col_row_f16_f16_satfinite
4960x1F0__cuda_sm70_wmma_m8n32k16_mma_col_row_f16_f32
4970x1F1__cuda_sm70_wmma_m8n32k16_mma_col_row_f16_f32_satfinite
4980x1F2__cuda_sm70_wmma_m8n32k16_mma_col_row_f32_f16
4990x1F3__cuda_sm70_wmma_m8n32k16_mma_col_row_f32_f16_satfinite
5000x1F4__cuda_sm70_wmma_m8n32k16_mma_col_row_f32_f32
5010x1F5__cuda_sm70_wmma_m8n32k16_mma_col_row_f32_f32_satfinite
5020x1F6__cuda_sm70_wmma_m8n32k16_mma_row_col_f16_f16
5030x1F7__cuda_sm70_wmma_m8n32k16_mma_row_col_f16_f16_satfinite
5040x1F8__cuda_sm70_wmma_m8n32k16_mma_row_col_f16_f32
5050x1F9__cuda_sm70_wmma_m8n32k16_mma_row_col_f16_f32_satfinite
5060x1FA__cuda_sm70_wmma_m8n32k16_mma_row_col_f32_f16
5070x1FB__cuda_sm70_wmma_m8n32k16_mma_row_col_f32_f16_satfinite
5080x1FC__cuda_sm70_wmma_m8n32k16_mma_row_col_f32_f32
5090x1FD__cuda_sm70_wmma_m8n32k16_mma_row_col_f32_f32_satfinite
5100x1FE__cuda_sm70_wmma_m8n32k16_mma_row_row_f16_f16
5110x1FF__cuda_sm70_wmma_m8n32k16_mma_row_row_f16_f16_satfinite
5120x200__cuda_sm70_wmma_m8n32k16_mma_row_row_f16_f32
5130x201__cuda_sm70_wmma_m8n32k16_mma_row_row_f16_f32_satfinite
5140x202__cuda_sm70_wmma_m8n32k16_mma_row_row_f32_f16
5150x203__cuda_sm70_wmma_m8n32k16_mma_row_row_f32_f16_satfinite
5160x204__cuda_sm70_wmma_m8n32k16_mma_row_row_f32_f32
5170x205__cuda_sm70_wmma_m8n32k16_mma_row_row_f32_f32_satfinite
5180x206__cuda_sm70_wmma_m8n32k16_store_d_col_f16
5190x207__cuda_sm70_wmma_m8n32k16_store_d_col_f16_global
5200x208__cuda_sm70_wmma_m8n32k16_store_d_col_f16_shared
5210x209__cuda_sm70_wmma_m8n32k16_store_d_col_f32
5220x20A__cuda_sm70_wmma_m8n32k16_store_d_col_f32_global
5230x20B__cuda_sm70_wmma_m8n32k16_store_d_col_f32_shared
5240x20C__cuda_sm70_wmma_m8n32k16_store_d_row_f16
5250x20D__cuda_sm70_wmma_m8n32k16_store_d_row_f16_global
5260x20E__cuda_sm70_wmma_m8n32k16_store_d_row_f16_shared
5270x20F__cuda_sm70_wmma_m8n32k16_store_d_row_f32
5280x210__cuda_sm70_wmma_m8n32k16_store_d_row_f32_global
5290x211__cuda_sm70_wmma_m8n32k16_store_d_row_f32_shared

__cuda_sm80_* -- Ampere createpolicy (3 entries, 0x212--0x214, sm_80)

IDHexName
5300x212__cuda_sm80_createpolicy_fractional
5310x213__cuda_sm80_createpolicy_fractional_encode
5320x214__cuda_sm80_createpolicy_range_encode

__cuda_sm_10x_* -- Blackwell hmma/imma/bit MMA (10 entries, 0x215--0x21E, sm_100)

IDHexName
5330x215__cuda_sm_10x_hmma_mdata_m16n8k16
5340x216__cuda_sm_10x_hmma_mdata_m16n8k32
5350x217__cuda_sm_10x_imma_mdata_m16n8k32
5360x218__cuda_sm_10x_imma_mdata_m16n8k64
5370x219__cuda_sm_10x_mma_bit_internal_and_m16n8k128
5380x21A__cuda_sm_10x_mma_bit_internal_and_m16n8k256
5390x21B__cuda_sm_10x_mma_bit_internal_and_m8n8k128
5400x21C__cuda_sm_10x_mma_bit_internal_xor_m16n8k128
5410x21D__cuda_sm_10x_mma_bit_internal_xor_m16n8k256
5420x21E__cuda_sm_10x_mma_bit_internal_xor_m8n8k128

__cuda_sm_8x_* -- Direct MMA + shfl (14 entries, 0x21F--0x22C, sm_80+)

IDHexName
5430x21F__cuda_sm_8x_mma_col_col_f16_f16_f16_f16
5440x220__cuda_sm_8x_mma_col_col_f32_f16_f16_f16
5450x221__cuda_sm_8x_mma_col_col_f32_f16_f16_f32
5460x222__cuda_sm_8x_mma_col_row_f16_f16_f16_f16
5470x223__cuda_sm_8x_mma_col_row_f32_f16_f16_f16
5480x224__cuda_sm_8x_mma_col_row_f32_f16_f16_f32
5490x225__cuda_sm_8x_mma_row_col_f16_f16_f16_f16
5500x226__cuda_sm_8x_mma_row_col_f32_f16_f16_f16
5510x227__cuda_sm_8x_mma_row_col_f32_f16_f16_f32
5520x228__cuda_sm_8x_mma_row_row_f16_f16_f16_f16
5530x229__cuda_sm_8x_mma_row_row_f32_f16_f16_f16
5540x22A__cuda_sm_8x_mma_row_row_f32_f16_f16_f32
5550x22B__cuda_sm_8x_mma_shfl_f16
5560x22C__cuda_sm_8x_mma_shfl_f32

__cuda_sm_9x_* -- Hopper sub-byte/bit MMA (51 entries, 0x22D--0x25F, sm_90)

IDHexName
5570x22D__cuda_sm_9x_mma_bit_internal_xor_m16n8k128
5580x22E__cuda_sm_9x_mma_bit_internal_xor_m16n8k256
5590x22F__cuda_sm_9x_mma_bit_internal_xor_m8n8k128
5600x230__cuda_sm_9x_mma_sub_byte_internal_m16n8k32_s4_s4
5610x231__cuda_sm_9x_mma_sub_byte_internal_m16n8k32_s4_s4_satfinite
5620x232__cuda_sm_9x_mma_sub_byte_internal_m16n8k32_s4_u4
5630x233__cuda_sm_9x_mma_sub_byte_internal_m16n8k32_s4_u4_satfinite
5640x234__cuda_sm_9x_mma_sub_byte_internal_m16n8k32_u4_s4
5650x235__cuda_sm_9x_mma_sub_byte_internal_m16n8k32_u4_s4_satfinite
5660x236__cuda_sm_9x_mma_sub_byte_internal_m16n8k32_u4_u4
5670x237__cuda_sm_9x_mma_sub_byte_internal_m16n8k32_u4_u4_satfinite
5680x238__cuda_sm_9x_mma_sub_byte_internal_m16n8k64_s4_s4
5690x239__cuda_sm_9x_mma_sub_byte_internal_m16n8k64_s4_s4_satfinite
5700x23A__cuda_sm_9x_mma_sub_byte_internal_m16n8k64_s4_u4
5710x23B__cuda_sm_9x_mma_sub_byte_internal_m16n8k64_s4_u4_satfinite
5720x23C__cuda_sm_9x_mma_sub_byte_internal_m16n8k64_u4_s4
5730x23D__cuda_sm_9x_mma_sub_byte_internal_m16n8k64_u4_s4_satfinite
5740x23E__cuda_sm_9x_mma_sub_byte_internal_m16n8k64_u4_u4
5750x23F__cuda_sm_9x_mma_sub_byte_internal_m16n8k64_u4_u4_satfinite
5760x240__cuda_sm_9x_mma_sub_byte_internal_m8n8k32_s4_s4
5770x241__cuda_sm_9x_mma_sub_byte_internal_m8n8k32_s4_s4_satfinite
5780x242__cuda_sm_9x_mma_sub_byte_internal_m8n8k32_s4_u4
5790x243__cuda_sm_9x_mma_sub_byte_internal_m8n8k32_s4_u4_satfinite
5800x244__cuda_sm_9x_mma_sub_byte_internal_m8n8k32_u4_s4
5810x245__cuda_sm_9x_mma_sub_byte_internal_m8n8k32_u4_s4_satfinite
5820x246__cuda_sm_9x_mma_sub_byte_internal_m8n8k32_u4_u4
5830x247__cuda_sm_9x_mma_sub_byte_internal_m8n8k32_u4_u4_satfinite
5840x248__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_s4_s4
5850x249__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_s4_s4_satfinite
5860x24A__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_s4_u4
5870x24B__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_s4_u4_satfinite
5880x24C__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_u4_s4
5890x24D__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_u4_s4_satfinite
5900x24E__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_u4_u4
5910x24F__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_u4_u4_satfinite
5920x250__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_s4_0
5930x251__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_s4_1
5940x252__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_s4_satfinite_0
5950x253__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_s4_satfinite_1
5960x254__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_u4_0
5970x255__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_u4_1
5980x256__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_u4_satfinite_0
5990x257__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_u4_satfinite_1
6000x258__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_s4_0
6010x259__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_s4_1
6020x25A__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_s4_satfinite_0
6030x25B__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_s4_satfinite_1
6040x25C__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_u4_0
6050x25D__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_u4_1
6060x25E__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_u4_satfinite_0
6070x25F__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_u4_satfinite_1

OCG Intrinsic System (44 Builtin Operations, SM100+)

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The OCG (Optimized Code Generation) intrinsic subsystem is a separate, parallel dispatch mechanism for SM100+ builtin operations. While the classical intrinsic system at sub_5D1660 maps __cuda_* runtime helper names to integer IDs and emits inline PTX code via body templates, the OCG system maps __nv_ptx_builtin_ocg_* function names to type-specific handler functions that validate parameters and emit SASS instructions directly -- bypassing the PTX intermediate step entirely.

OCG intrinsic tablesub_6C9EB0 (13KB) -- __nv_ptx_builtin_ocg_* dispatch for SM100+
OCG routersub_6CC690 (22KB) -- routes OCG calls to type-specific handlers
OCG name resolversub_6C9BC0 -- resolves operation names to internal enums

Initialization -- sub_6C9EB0

sub_6C9EB0 initializes a 10,664-byte (0x29A8) lookup table and sets the vtable pointer to off_202CF48. The operation name prefix is stored at *(_QWORD *)(a1 + 120) = "__nv_ptx_builtin_ocg_". The table contains 44 operations in 248-byte slots starting at offset 128. Each slot holds the operation name followed by up to 30 sub-operation/modifier string pointers (unused slots are NULL from the memset).

OCG Builtin Name Table -- Complete (44 Operations)

The complete OCG builtin table extracted from sub_6C9EB0. Thirty numeric string pointers that IDA left unresolved were recovered by reading null-terminated strings from the ptxas binary at addr - 0x400000 (ELF LOAD virtual address base). The table size 0x29A8 and 248-byte slot stride are verified against the memset in the decompiled code.

Arithmetic and ALU Operations

SlotOffsetOCG NameSub-Operations / TypesSASS Equivalent
0128adds32, f32, s64, f64, satIADD3 / FADD
287072mnmxs32, u32, s64, u64IMNMX / FMNMX
153848viadd32, f16x2VIADD

Vector Integer Operations (SM100+ VIMNMX family)

All six vector integer operations share the same type set: s32, u32, s16x2, u16x2 with an optional relu modifier for ReLU clamping.

SlotOffsetOCG NameSASS EquivalentDescription
164096viaddmaxVIADDMNMXfused add + max
174344viaddminVIADDMNMXfused add + min
184592vimaxVIMNMXvector integer max
194840viminVIMNMXvector integer min
205088vimax3VIMNMX33-way vector integer max
215336vimin3VIMNMX33-way vector integer min

Packed Float Operations (f16x2 arithmetic)

All three packed operations share the same modifier set: ftz (flush-to-zero) and rounding modes rn, rm, rp, rz.

SlotOffsetOCG NameSASS EquivalentDescription
256328fadd2HADD2 / FADD.PACKEDpacked f16 addition
266576ffma2HFMA2 / FFMA.PACKEDpacked f16 fused multiply-add
276824fmul2HMUL2 / FMUL.PACKEDpacked f16 multiplication
297320fmax3FMNMX33-way float max (ftz, nan modifiers)
307568fmin3FMNMX33-way float min (ftz, nan modifiers)

Async Copy and TMA Operations

SlotOffsetOCG NameSub-OperationsSASS Equivalent
1376cp_async_commitmem, bulk, shared, globalLDGDEPBAR
2624cp_async_waitmem, bulk, shared, global, read, writeDEPBAR
102608cp_async_bulkmbarrier, counted, shared, global, multicast, sequenced, bytemaskUBLKCP
112856cp_red_async_bulkmbarrier, counted, shared, global; types: u32/s32/u64/s64/f16/f32/f32ftz/f64/bf16; ops: add/min/max/inc/dec/and/or/xorUBLKCP.RED
123104cp_async_tensormbarrier, shared, global, 1d/2d/3d/4d/5d, im2col, multicastUTMAKCP
133352cp_async_prefetch_tensorglobal, 1d/2d/3d/4d/5d, im2colUTMAPF

Note: The SASS mnemonics UBLKCP and UTMAKCP do not appear as strings in the ptxas binary. These are SASS assembler-level names visible only in cuobjdump output; the OCG names (cp_async_bulk, cp_async_tensor) are the canonical internal form.

Load, Store, and Cache Operations

SlotOffsetOCG NameSub-OperationsSASS Equivalent
3872cachetensor, pf (prefetch), iv (invalidate), ivall (invalidate all)CCTL / PREFETCH
41120ld_mcops: add/min/max/f32add/and/or/xor; types: f16x2/f16x4/f16x8/bf16x2/bf16x4/bf16x8/f32/f32x2/f32x4/f64/u32/s32/s64/u64LDG.MC
51368ldcu32, u64LDC
61616s2r(none -- register 0-255)S2R
225584write_asyncrelease; shared/global; gpu/sys/mmio; v2/v4; u8/s8/u16/s16/b32/b64/u32/f64STG.ASYNC
235832cctl_cldc/ldcu, shallow/deep, iv/ivallCCTL

Async Reduction and Fence Operations

SlotOffsetOCG NameSub-OperationsSASS Equivalent
92360red_asyncrelease; shared/global; gpu/sys/mmio; v2/v4; u32/s32/u64; add/min/max/inc/dec/and/or/xorRED.ASYNC
143600fence_view_asyncall, global, shared, dshared, tensorFENCE.VIEW.ASYNC

Tensor Core Operations (Blackwell TC family)

SlotOffsetOCG NameSub-OperationsSASS Equivalent
317816tcbarcta1/cta2, a1t0/a0tx, flush, multicast, b32TCBAR
327880mmareadshma(none)LDSM variant
338064tccp128dp256bit/4dp256bit/128dp128bit/2x64dp128bitlw02lw13/2x64dp128bitlw01lw23/4x32dp128bit/u4x16p64/u6x16p32; cta1/cta2; b32/b64TCCP
348312tcmmagdesc/tmem; h/i/q/o/mxq; cta1/cta2; ashift/scale/lutb; areuse/akeep/breuse/bkeep; ws; buffer0-3; 2x/4x/blockscale/impl; b32/b64/u32TCMMA
358560tcshiftcta1/cta2, b32TCSHIFT
379056tcatomswsand/or/findandset/align/cas; cta1/cta2; b32/b64TCATOM.SWS
389304tcldswscta1/cta2TCLD.SWS
399552tcstswscta1/cta2; b32/b64TCST.SWS

The tcmma operation at slot 34 is the primary Blackwell MMA instruction, successor to HMMA/IMMA/DMMA. Its sub-operations encode:

  • Descriptor mode: gdesc (global descriptor via UR), tmem (tensor memory direct)
  • Input formats: h (half/f16), i (integer), q (quarter/fp8), o (output descriptor), mxq (MX-format quarter for microscaled block-scaling)
  • Operand reuse: areuse/akeep (A matrix), breuse/bkeep (B matrix) -- register reuse hints
  • Warp-shared: ws -- warp-shared execution across 2 warps
  • Block scaling: blockscale with 2x/4x multipliers and impl (implementation-defined) -- FP4/FP6 microscaled format support
  • Buffers: buffer0-buffer3 -- double/quad buffering for pipelined execution

The SWS (Software Scoreboard) operations (tcatomsws, tcldsws, tcstsws) are a Blackwell synchronization mechanism for tensor core pipelines that replaces hardware scoreboards with software-managed tracking.

Tensor Memory Load/Store (Blackwell native)

SlotOffsetOCG NameSub-OperationsSASS Equivalent
4210296ldtmformats: 16dp128bit/16dp256bit/32dp32bit/16dp64bit/16dp32bitt0t15/16dp32bitt16t31/16dp32bit; scale: x1-x128; pack16bit; fused/stat; statistics: nan/max/maxabs/min/minabs; types: u32/s32/f32/b32; sparsity: sparsify/u2/spfactor2to4LDTM
4310544sttmformats: (same 7 as ldtm); scale: x1-x128; expand16bit; fused; b32STTM

The ldtm/sttm format strings encode the tensor memory data layout:

  • 16dp128bit -- 16 data-points, 128-bit total (e.g., 16x fp8)
  • 16dp256bit -- 16 data-points, 256-bit total (e.g., 16x fp16)
  • 32dp32bit -- 32 data-points, 32-bit total (e.g., 32x 1-bit)
  • 16dp32bitt0t15 / 16dp32bitt16t31 -- 16 data-points in thread groups 0-15 / 16-31
  • Scale factors x1 through x128 control the number of consecutive elements loaded
  • sparsify and spfactor2to4 enable structured 2:4 sparsity metadata generation
  • stat with nan/max/maxabs/min/minabs enables online statistics collection during load

Synchronization and Control

SlotOffsetOCG NameSub-OperationsSASS Equivalent
71864acqblk(none)barrier acquire block
82112preexit(none)EXIT.KEEPREFCOUNT
246080getnextworkidselfcast, broadcastwork distribution primitive
368808virtcountu32virtual warp counter
409800memclearb32, b64MEMCLEAR
4110048acqshminit(none)shared memory init barrier

Category Summary

CategoryCountOperations
Arithmetic / ALU3add, mnmx, viadd
Packed float5fadd2, ffma2, fmul2, fmax3, fmin3
Vector integer6viaddmax, viaddmin, vimax, vimin, vimax3, vimin3
Async copy / TMA6cp_async_commit, cp_async_wait, cp_async_bulk, cp_red_async_bulk, cp_async_tensor, cp_async_prefetch_tensor
Load / store / cache6ld_mc, ldc, s2r, write_async, cctl_c, cache
Async reduction / fence2red_async, fence_view_async
Tensor core (TC)8tcbar, mmareadshma, tccp, tcmma, tcshift, tcatomsws, tcldsws, tcstsws
Tensor memory (TM)2ldtm, sttm
Sync / control6acqblk, preexit, getnextworkid, virtcount, memclear, acqshminit
Total44

Handler Functions

The OCG handler cluster at 0x6C0000--0x6CC000 contains ~25--30 specialized handler/validator functions. Each validates parameters, types, sub-operations, and memory domains before delegating to the SASS encoding engine.

AddressSizeHandlerConfidence
sub_6C0D9019KBAtomic reduction (atom.add/min/max/cas, scope, memory order, vector width)90%
sub_6C1CF016KBMbarrier (arrive, wait, test, counted, bytemask variants)88%
sub_6C2AE010KBcp.async (basic async copy)85%
sub_6C347020KBcp.async.bulk (bulk async copy with type validation)85%
sub_6C46B0--cp.red.async.bulk (bulk async reduction)85%
sub_6C4DA015KBLoad/store (scope, memory order, domain validation)85%
sub_6C5A408KBCache control (CCTL: shallow/deep, iv/ivall, ldc/ldcu)85%
sub_6C60B07KBDistributed shared memory (selfcast/broadcast)80%
sub_6C81009KBcp.async.tensor / TMA (1--5D, multicast, tile/im2col)85%
sub_6C9BC0--Name resolver (operation name -> internal enum)80%
sub_6CC69022KBRouter (dispatches to type-specific handlers via vtable)80%

OCG Validation Strings

The OCG handlers share a consistent validation pattern. Notable error messages (NVIDIA consistently misspells "intrinsic" as "instrinsic" throughout the codebase):

Error StringHandlerMeaning
"Op {add, min, max, inc, dec, and, or, xor} not specified"AtomicMissing reduction operation
"Domain param '_shared' or '_global' required"Atomic/LSNo memory domain specified
"Unsupported non _add global memory reduction"AtomicOnly add supported for global reductions
"Deprecated scope without memory order semantics"Memory orderLegacy scope usage
"Required scope with memory order semantics"Memory orderMissing scope on memory-ordered op
"byte mask not allowed with counted"MbarrierConflicting mbarrier modifiers
"Exactly one of the 'shallow' or 'deep' modifiers must be used."CCTLMissing cache depth modifier
"Cannot use both the selfcast and the broadcast modifier."DshmemConflicting multicast mode
"Unexpected instrinsic name (%s)"Name resolverUnknown OCG operation name
"Unexpected instrinsic subop (%s)"Name resolverUnknown sub-operation
"Unexpected instrinsic type (%s) instead of (%s) in param (%d)"Type validatorParameter type mismatch
"LDC requires a constant/immediate bank number"LDC/S2RMissing constant bank operand
"S2R register must be between 0 and 255 inclusive"LDC/S2RSystem register out of range

OCG SASS-Level Handlers

Separate from the validation layer, the SASS encoding zone at 0x6D0000--0x6E0000 contains MMA-specific handlers that operate during final instruction encoding:

AddressSizeHandlerConfidence
sub_6D435030KBMMA intrinsic lowering (HMMA, IMMA, DMMA)90%
sub_6D5CB016KBMMA operand encoder (matrix fragments, accumulator registers)80%
sub_6D7AF019KBTCGen05 MMA handler (SM100 5th-gen tensor core encoding)90%
sub_6D69B012KBTCGen05 MMA validator (parameter validation only)80%

Notable validation strings from the tcgen05 MMA handler:

  • "fused and l16dp32bit must be specified together"
  • "Inputs vector length is inconsistent with layout and num modifiers"

OCG Intrinsic Lowering Pipeline -- sub_6A97B0 + sub_6CC690

The full end-to-end flow that takes a PTX call.uni __nv_ptx_builtin_ocg_* intrinsic and produces a binary SASS instruction passes through five stages. Three are data-structure manipulation (matching, cleanup), two are instruction encoding (operand assembly, SASS emission).

sub_6B5F30 (intrinsic lowering driver)
  |
  ├─ sub_6B40C0 ── pre-processing
  |
  ├─ sub_6A97B0 (LowerIntrinsicOp, 26KB) ──────────────────────────────┐
  |     │                                                               |
  |     │ Phase 1: SASS instruction matching                            |
  |     │   For each intrinsic call node in linked list [a1+32..a1+40): |
  |     │     Walk operand tree at node+288                             |
  |     │     For each leaf: read instruction ID at leaf+24             |
  |     │     Search RB-tree at context+760 for matching SASS defn      |
  |     │     On match: store ptr at node+464, back-link at SASS+440    |
  |     │                                                               |
  |     │ Phase 2: Unmatched node garbage collection                    |
  |     │   If node+464 == 0 (no SASS match):                          |
  |     │     Walk use-def chain at node+40..48                         |
  |     │     Delete matching RB-tree entries (full rebalance via       |
  |     │       sub_6A92E0)                                             |
  |     │     Unlink node from work list                                |
  |     │     Release internal resources (operands, types)              |
  |     │     Return node to free pool at a1+80                         |
  |     │                                                               |
  |     │ Phase 3: Secondary cleanup (re-scan remaining nodes)          |
  |     │   Nodes with SASS match but no definition link:               |
  |     │     Clear back-pointer, clean up, recycle to free pool        |
  |     │                                                               |
  |     │ Key data: context+760 = RB-tree root (SASS instruction defs)  |
  |     │           context+768/776 = min/max pointers                  |
  |     │           context+784 = tree node count                       |
  |     │           context+792 = tree free list                        |
  |                                                                     |
  ├─ (post-processing: sub_693D00 per remaining node) ─────────────────┘
  |
  v
sub_6D9690 (master SASS encoder, 94KB)
  |
  ├─ sub_6D9290 (OCG vtable entry point) ────────────────────────────────┐
  |     │                                                                |
  |     │ 1. Extract intrinsic name from IR node                         |
  |     │ 2. Call sub_6C9BC0(this+120, name)  ── ParseOCGBuiltinName     |
  |     │      Strips "__nv_ptx_builtin_ocg_" prefix                     |
  |     │      Iterates 43 operation slots (248B each) in OCG table      |
  |     │      Matches operation name, then parses '_'-delimited sub-ops |
  |     │      Output: this+10688 = operation enum (0..42)               |
  |     │              this+10704 = int[] of sub-op indices              |
  |     │              this+10712 = sub-op count                         |
  |     │ 3. Fall through to sub_6D8B20 for secondary dispatch           |
  |                                                                      |
  ├─ sub_6CC690 (OCGRouter, 22KB) ──────────────────────────────────────┘
  |     │
  |     │ Input: (self, instruction_node, sass_descriptor)
  |     │
  |     │ 1. Read SASS opcode from descriptor+8
  |     │ 2. Read target profile from context+1584
  |     │    Key profile fields:
  |     │      +503  = operand decode flag
  |     │      +1012 = target SM enum
  |     │      +1020 = extended address mode
  |     │      +1021 = barrier mode
  |     │      +1041 = memory order capabilities bitmask
  |     │
  |     │ 3. Vtable dispatch (off_202CF48):
  |     │    vtable[2]  = OpcodeValidator   (default: sub_6BC1D0)
  |     │    vtable[24] = ScopeValidator    (default: sub_6BCE50)
  |     │    vtable[25] = MemOrderValidator (default: sub_6BBEC0)
  |     │    Each is compared by address; if overridden, calls the
  |     │    custom validator; if default, uses inline fast-path.
  |     │
  |     │ 4. Opcode-range dispatch (descriptor+8):
  |     │    178..189: Memory ops (ld_mc, st)  -> SASS enum 243/245/247
  |     │    416..420, 434: Reduction/atomic    -> SASS enum 243/246/261
  |     │    445..448: Barrier/fence            -> memory op path
  |     │    467: cp.async.tensor/special       -> SASS enum 70 or 257
  |     │    default: zero-init modifiers, use raw descriptor
  |     │
  |     │ 5. Operand assembly into v134[] (312-byte buffer):
  |     │    sub_6CAFD0: decode src/dst registers -> v134[8..10]
  |     │    sub_6CAE80: encode uniform operands  -> v134[16]
  |     │    sub_6CAF50: encode scope/mem-order   -> v134[13]
  |     │    sub_6CBA50: encode barrier level     -> v134[26..28]
  |     │
  |     │ 6. Build control words:
  |     │    v134[26] = 0x60000000 | modifier_bits
  |     │    v134[27] = 0x60000000 | ordering | barrier | write_mask
  |     │    v134[28] = 0x60000000 | scope_flags | 0x81000
  |     │
  |     v
  sub_6CB8A0 (EmitSASS)
        │
        │ Input: (self, sass_opcode_enum, instr_node, v134[], flags...)
        │ 1. Read SM version from profile+372 (>> 12)
        │ 2. sub_6CB4B0: final operand validation
        │ 3. sub_C3F490(opcode, ...): look up SASS encoding template
        │ 4. Encode instruction word from template + v134[] operands
        │ 5. sub_9253C0: commit encoded instruction to output

Internal SASS opcode enum values assigned by the router (not binary SASS opcodes -- these are routing keys that sub_C3F490 maps to encoding templates):

EnumHexMeaning
700x46Memory-ordered load/store/atomic (with barrier)
2430xF3Default memory operation
2450xF5Load variant (LD/LDG/LDS)
2460xF6Reduction/atomic default
2470xF7Fenced memory operation (LDGSTS)
2570x101Async copy without memory order
2610x105Atomic with pre-existing value read

Operand buffer layout (v134[], 39 QWORDs passed to sub_6CB8A0):

SlotContent
0--3Reserved (zero)
4Barrier register (0x90000000 | reg)
5--7Extra source operands (from instruction node)
8--10Primary operands (from sub_6CAFD0 decode)
11Secondary operand (LDC, conditional loads)
12Predicate thread operand
13Scope / memory-order (from sub_6CAF50)
14Cache mode operand
15Memory fence operand
16Uniform / extended operand (from sub_6CAE80)
17Memory ordering constant / barrier tracking
19--21Source address (bulk/tensor ops)
22--24Destination address (bulk/tensor ops)
25Extra predicate (opcode 187 only)
26Control word 0: 0x60000000 | modifier_bits
27Control word 1: 0x60000000 | ordering | flags
28Control word 2: 0x60000000 | scope | 0x81000
29Write mask operand (conditional)

OCG Lookup Flow

PTX source: call.uni __nv_ptx_builtin_ocg_tcmma, (%args...);
                    |
                    v
            sub_6A97B0 (LowerIntrinsicOp, 26KB)
            Matches call node to SASS instruction via RB-tree at ctx+760
            Garbage-collects unmatched nodes
                    |
                    v
            sub_6D9290 -> sub_6C9BC0 (ParseOCGBuiltinName)
            Strips "__nv_ptx_builtin_ocg_" prefix
            Parses op name + sub-ops from 43-slot table (sub_6C9EB0)
                    |
                    v
            sub_6CC690 (OCGRouter, 22KB)
            Vtable dispatch: validates opcode, scope, memory order
            Decodes operands into 312-byte buffer via sub_6CAFD0 cluster
            Builds control words (0x60000000 | modifier_bits)
                    |
                    v
            sub_6CB8A0 (EmitSASS)
            Looks up encoding template via sub_C3F490(sass_opcode_enum)
            Encodes instruction word, commits to output via sub_9253C0

See OCG Intrinsic Lowering Pipeline for the full five-stage breakdown with operand buffer layout and internal SASS opcode enum values.

Function Map

AddressSizeIdentityConfidence
sub_6C9EB013KBOCG intrinsic table init (__nv_ptx_builtin_ocg_*)95%
sub_6CC69022KBOCG router -- vtable-dispatched operand assembly and SASS emission90%
sub_6C9BC0--OCG name parser -- decomposes __nv_ptx_builtin_ocg_X_Y_Z into enum + sub-op array95%
sub_6C0D9019KBOCG atomic/reduction handler90%
sub_6C1CF016KBOCG mbarrier handler88%
sub_6C347020KBOCG cp.async.bulk handler85%
sub_6C4DA015KBOCG load/store handler85%
sub_6C5A408KBOCG cache control handler85%
sub_6C60B07KBOCG distributed shared memory handler80%
sub_6C81009KBOCG cp.async.tensor / TMA handler85%
sub_6D435030KBMMA intrinsic lowering (SASS encoding)90%
sub_6D7AF019KBTCGen05 MMA handler (SASS encoding)90%
sub_6D5CB016KBMMA operand encoder80%
sub_6D69B012KBTCGen05 MMA validator80%
sub_6D9290--OCG vtable entry point (calls sub_6C9BC0 then sub_6D8B20)85%
sub_6CB8A0--SASS instruction emitter (template lookup via sub_C3F490)80%
sub_6CAFD0--Operand decoder (registers into v134[] slots)85%
sub_6CAE80--Uniform operand encoder85%
sub_6CAF50--Scope / memory-order encoder85%
sub_6CBA50--Barrier-level encoder85%
sub_6CB4B0--Operand validator (called by sub_6CB8A0)80%
sub_6A97B026KBLowerIntrinsicOp -- SASS matching and unmatched-node GC90%
sub_6B5F30--Intrinsic lowering driver (calls sub_6B40C0 then sub_6A97B0)90%
sub_6A92E0--RB-tree fixup (rotation/recolor after deletion)90%
sub_6BC1D0--Default opcode validator (vtable[2] of off_202CF48)90%
sub_6BCE50--Default scope validator (vtable[24])90%
sub_6BBEC0--Default memory-order validator (vtable[25])90%

Diagnostic Strings

StringLocationContext
"__nv_ptx_builtin_ocg_"sub_6C9EB0 (0x6c9ecf)OCG builtin name prefix
"instrinsic" (sic)Multiple OCG handlersConsistent NVIDIA typo for "intrinsic"
".RELU not allowed with unsigned type"sub_6BEC60OCG LDC/S2R handler

Cross-References

Math Intrinsics

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

ptxas provides built-in IEEE-compliant math functions for operations that have no single-instruction hardware implementation. When a PTX instruction like div.rn.f64, rcp.rn.f32, sqrt.rn.f64, or rsqrt.approx.f64 is encountered, ptxas either emits a single MUFU (Multi-Function Unit) instruction for approximate results, or generates a multi-instruction SASS sequence using Newton-Raphson refinement for IEEE-compliant precision. For operations too complex to inline, ptxas emits a call to one of 70 registered __cuda_sm20_* helper functions.

Intrinsic ID range0x3D--0x86 (70 math entries + 4 sm_3x division variants)
Math codegen handlers9 functions: div.full, div, rem, rcp, rsqrt, sqrt, ex2, lg2, tanh
Newton-Raphson templates4 top-level handlers at 0x1700000--0x172A090 (~180 KB)
MUFU internal opcode0x3C (60 decimal), Ori mnemonic ZHSH
MUFU Mercury major0x58, minor sub-function encoded in operand fields
SFU functional unitIndex 8 in the latency model (RCP, RSQ, SIN, COS, EX2, LG2)
MUFU encoding variants14 (reg/reg, reg/pred, reg/ureg, bar operands)

MUFU -- Multi-Function Unit

The MUFU instruction is a single-cycle-issue instruction that computes transcendental approximations on the SFU (Special Function Unit). Each SM has a dedicated SFU pipe that executes MUFU operations independently of the ALU pipes.

Sub-Function Table

The MUFU sub-function is encoded in the instruction's modifier field (not a separate operand). The following sub-functions are available across all SM architectures supported by ptxas v13.0:

Sub-FunctionOperationInputOutputPrecision
MUFU.COScos(x * 2pi)FP32FP32~22 bits mantissa
MUFU.SINsin(x * 2pi)FP32FP32~22 bits mantissa
MUFU.EX22^xFP32FP32~22 bits mantissa
MUFU.LG2log2(x)FP32FP32~22 bits mantissa
MUFU.RCP1/xFP32FP32~23 bits mantissa
MUFU.RSQ1/sqrt(x)FP32FP32~23 bits mantissa
MUFU.RCP64H1/x (FP64 high-word seed)FP32FP32~23 bits, sm_80+
MUFU.RSQ64H1/sqrt(x) (FP64 high-word seed)FP32FP32~23 bits, sm_80+
MUFU.TANHtanh(x)FP32FP32~22 bits, sm_75+

MUFU.RCP and MUFU.RSQ produce results accurate to approximately 1 ULP of the true FP32 value (23 mantissa bits). The trigonometric and exponential sub-functions (SIN, COS, EX2, LG2) are slightly less precise at approximately 22 bits. MUFU.TANH was added in Turing (sm_75).

MUFU in the Ori IR

In the ptxas internal representation, MUFU uses Ori opcode 0x3C (decimal 60) with the mnemonic ZHSH. During instruction selection, PTX operations like sin.approx.f32, cos.approx.f32, ex2.approx.f32, lg2.approx.f32, rcp.approx.f32, and rsqrt.approx.f32 are each lowered to a single ZHSH (MUFU) instruction with the appropriate sub-function selector.

The lowering pass responsible for MUFU emission is at sub_80E9B0 (LowerSpecialFunctions), called from the master lowering dispatcher sub_8380A0. It converts Ori-level special function opcodes into MUFU SASS instructions with appropriate sub-function encoding.

MUFU Encoding (sm_100+)

In the Mercury/Blackwell encoding, MUFU is major opcode 0x58 with a single variant at the basic encoding level (sub_10C0170). The encoding signature:

FieldValue
Major opcode0x58
f190xB
Format1 (single-operand class)
Operand count1 (destination implicit, source = register)
Encoding functionsub_10C0170

The variant table at 0xF7CEB0--0xF80760 defines 14 encoding patterns for MUFU, supporting combinations of:

  • reg2, reg2 -- standard register source and destination
  • reg2, pred3 -- predicated source
  • reg2, reg10 -- extended register class
  • reg2, ureg4 -- uniform register source (sm_100+ addition)
  • reg2, bar6 -- barrier operand (scheduling)

Uniform register support (ureg4) in MUFU is a Blackwell-specific addition, allowing MUFU to consume values directly from the uniform register file without a prior UMOV to a general-purpose register.

Pre-Assignment Constraints

The register allocator applies pre-assignment constraints for MUFU at sub_93F000. MUFU (internal opcode 22 in the constraint check, mapped from Ori opcode 0x3C) requires its operands in specific register classes. The constraint handler calls sub_93E9D0 with constraint type 1 (early) for MUFU operands.

Precision Levels

ptxas implements two distinct precision tiers for every math operation, selected by PTX instruction modifiers:

Approximate (.approx)

A single MUFU instruction. This is the default for sin.approx.f32, cos.approx.f32, ex2.approx.f32, lg2.approx.f32, rcp.approx.f32, and rsqrt.approx.f32. The MUFU hardware provides approximately 22--23 bits of mantissa precision in a single instruction dispatch on the SFU pipe.

PTX to SASS mapping (approximate):

PTX InstructionSASSLatency
sin.approx.f32MUFU.SINSFU pipe, ~4 cycles
cos.approx.f32MUFU.COSSFU pipe, ~4 cycles
ex2.approx.f32MUFU.EX2SFU pipe, ~4 cycles
lg2.approx.f32MUFU.LG2SFU pipe, ~4 cycles
rcp.approx.f32MUFU.RCPSFU pipe, ~4 cycles
rsqrt.approx.f32MUFU.RSQSFU pipe, ~4 cycles
tanh.approx.f32MUFU.TANHSFU pipe, ~4 cycles (sm_75+)

IEEE-Compliant (.rn, .rd, .ru, .rz)

Multi-instruction sequences that use MUFU as a seed and refine with Newton-Raphson iterations using FMA instructions. These produce results that are correctly rounded to the specified IEEE 754 rounding mode (round-to-nearest-even, round-down, round-up, round-toward-zero). The instruction count ranges from ~15 for FP32 operations to ~120 for FP64 operations.

The IEEE-compliant paths are implemented in two ways:

  1. Inline templates -- Multi-instruction SASS sequences emitted directly at the call site by the Newton-Raphson template subsystem (0x1700000--0x172A090). Used for FP64 division, reciprocal, sqrt, and rsqrt.
  2. Callable helpers -- Calls to __cuda_sm20_* functions whose bodies are pre-compiled PTX routines linked from libdevice. Used for FP32 operations with directed rounding modes and all slowpath variants.

PTX Math Codegen Handlers

When sub_5D4190 builds the opcode dispatch table, it registers 9 math-related PTX instruction names to codegen handler functions. Each handler allocates a 50,000-byte temporary buffer, queries instruction properties through accessor functions on the instruction object at a1+1096, and generates inline PTX code via sequential sprintf() calls.

PTX OpcodeHandlerSizeDescription
div.fullsub_573860~7 KBFP64 full-precision division (calls Newton-Raphson template)
divsub_5B76D064 KBGeneral division: dispatches by type (s16/u16/s64/u64/f32/f64) and rounding mode
remsub_589810~13 KBInteger remainder (s16/u16/s64/u64)
rcpsub_5B0CD044 KBReciprocal: dispatches by type (f32/f64) and rounding mode
rsqrtsub_57BFC0~10 KBReciprocal square root
sqrtsub_5B404049 KBSquare root: dispatches by type (f32/f64) and rounding mode
ex2sub_583190~14 KBBase-2 exponential
lg2sub_52A5C0~5 KBBase-2 logarithm
tanhsub_505B00~5 KBHyperbolic tangent

Handler Dispatch Logic

All math codegen handlers follow the same structural pattern. The div handler (sub_5B76D0, 1,466 lines decompiled, 64 KB) is the largest because it covers the most type/rounding/precision combinations. The handler:

  1. Allocates a 50,000-byte output buffer via sub_424070
  2. Queries the operand type via sub_70CA60(*(a1+1096), 0):
    • Type 58 = FP32 (f32)
    • Type 59 = FP64 (f64)
    • Type 54 = signed 16-bit (s16)
    • Type 56 = unsigned 16-bit / other integer
  3. Queries rounding mode, FTZ flag, and precision modifier via additional accessors (sub_707BC0, sub_70B820, sub_70B8E0, sub_70B710)
  4. Selects the appropriate intrinsic function name and emits a PTX call via sprintf() into the output buffer
  5. Copies the result to a final allocation and frees the temporary buffer

For the tanh handler (sub_505B00, 121 lines), the dispatch is simpler:

  • Type 56: emits a short-form call to a hardware-supported path
  • Type 54: emits a multi-operand call querying register info via sub_70B8E0 and sub_70B710
  • Default: emits a generic call with 6 operand parameters

Type Codes in Math Handlers

The operand type returned by sub_70CA60(instr, 0) maps to PTX data types:

CodePTX TypeUsed By
54.s16 / .s32Integer div, rem
56.u16 / .u32Integer div, rem, tanh variant
58.f32Float div, rcp, sqrt, rsqrt, ex2, lg2, tanh
59.f64Double div, rcp, sqrt, rsqrt

Registered Math Intrinsics (IDs 0x3D--0x86)

The master registration function sub_5D1660 registers 70 math helper functions with IDs 0x3D through 0x82, plus 4 sm_3x-optimized division variants at 0x83--0x86. These are the __cuda_sm20_* functions whose PTX prototypes are emitted by the prototype generator sub_5FF700.

Division Intrinsics (22 entries)

IDNameCategory
0x41__cuda_sm20_div_rd_f32FP32 div, round-down
0x42__cuda_sm20_div_rd_f64_v2FP64 div, round-down
0x43__cuda_sm20_div_rd_ftz_f32FP32 div, round-down, flush-to-zero
0x44__cuda_sm20_div_rn_f32FP32 div, round-to-nearest
0x45__cuda_sm20_div_rn_f64_fastFP64 div, round-to-nearest (fast path)
0x46__cuda_sm20_div_rn_f64_fullFP64 div, round-to-nearest (full IEEE)
0x47__cuda_sm20_div_rn_ftz_f32FP32 div, round-to-nearest, FTZ
0x48__cuda_sm20_div_rn_ftz_f32_slowpathFP32 div RN FTZ (denormal handler)
0x49__cuda_sm20_div_rn_noftz_f32_slowpathFP32 div RN no-FTZ (denormal handler)
0x4A__cuda_sm20_div_ru_f32FP32 div, round-up
0x4B__cuda_sm20_div_ru_f64_v2FP64 div, round-up
0x4C__cuda_sm20_div_ru_ftz_f32FP32 div, round-up, FTZ
0x4D__cuda_sm20_div_rz_f32FP32 div, round-toward-zero
0x4E__cuda_sm20_div_rz_f64_v2FP64 div, round-toward-zero
0x4F__cuda_sm20_div_rz_ftz_f32FP32 div, round-toward-zero, FTZ
0x50__cuda_sm20_div_s16Signed 16-bit integer div
0x51__cuda_sm20_div_s64Signed 64-bit integer div
0x52__cuda_sm20_div_u16Unsigned 16-bit integer div
0x53__cuda_sm20_div_u64Unsigned 64-bit integer div
0x83__cuda_sm3x_div_rn_ftz_f32sm_30+ optimized FP32 div RN FTZ
0x84__cuda_sm3x_div_rn_ftz_f32_slowpathsm_30+ FP32 div RN FTZ slowpath
0x85__cuda_sm3x_div_rn_noftz_f32sm_30+ optimized FP32 div RN
0x86__cuda_sm3x_div_rn_noftz_f32_slowpathsm_30+ FP32 div RN slowpath

Reciprocal Intrinsics (20 entries)

IDNameCategory
0x40__cuda_sm20_dblrcp_rn_slowpath_v3FP64 reciprocal slowpath
0x5B__cuda_sm20_rcp_f64_v3FP64 reciprocal (default rounding)
0x5C__cuda_sm20_rcp_rd_f32FP32 rcp, round-down
0x5D__cuda_sm20_rcp_rd_f32_slowpathFP32 rcp RD slowpath
0x5E__cuda_sm20_rcp_rd_f64FP64 rcp, round-down
0x5F__cuda_sm20_rcp_rd_ftz_f32FP32 rcp RD FTZ
0x60__cuda_sm20_rcp_rd_ftz_f32_slowpathFP32 rcp RD FTZ slowpath
0x61__cuda_sm20_rcp_rn_f32FP32 rcp, round-to-nearest
0x62__cuda_sm20_rcp_rn_f32_slowpathFP32 rcp RN slowpath
0x63__cuda_sm20_rcp_rn_ftz_f32FP32 rcp RN FTZ
0x64__cuda_sm20_rcp_rn_ftz_f32_slowpathFP32 rcp RN FTZ slowpath
0x65__cuda_sm20_rcp_ru_f32FP32 rcp, round-up
0x66__cuda_sm20_rcp_ru_f32_slowpathFP32 rcp RU slowpath
0x67__cuda_sm20_rcp_ru_f64FP64 rcp, round-up
0x68__cuda_sm20_rcp_ru_ftz_f32FP32 rcp RU FTZ
0x69__cuda_sm20_rcp_ru_ftz_f32_slowpathFP32 rcp RU FTZ slowpath
0x6A__cuda_sm20_rcp_rz_f32FP32 rcp, round-toward-zero
0x6B__cuda_sm20_rcp_rz_f32_slowpathFP32 rcp RZ slowpath
0x6C__cuda_sm20_rcp_rz_f64FP64 rcp, round-toward-zero
0x6D__cuda_sm20_rcp_rz_ftz_f32FP32 rcp RZ FTZ
0x6E__cuda_sm20_rcp_rz_ftz_f32_slowpathFP32 rcp RZ FTZ slowpath

Square Root Intrinsics (17 entries)

IDNameCategory
0x56__cuda_sm20_dsqrt_rd_f64FP64 sqrt, round-down
0x57__cuda_sm20_dsqrt_rn_f64_mediumpath_v1FP64 sqrt RN (medium-complexity path)
0x58__cuda_sm20_dsqrt_rn_f64_v3FP64 sqrt, round-to-nearest
0x59__cuda_sm20_dsqrt_ru_f64FP64 sqrt, round-up
0x5A__cuda_sm20_dsqrt_rz_f64FP64 sqrt, round-toward-zero
0x73__cuda_sm20_sqrt_rd_f32FP32 sqrt, round-down
0x74__cuda_sm20_sqrt_rd_f32_slowpathFP32 sqrt RD slowpath
0x75__cuda_sm20_sqrt_rd_ftz_f32FP32 sqrt RD FTZ
0x76__cuda_sm20_sqrt_rd_ftz_f32_slowpathFP32 sqrt RD FTZ slowpath
0x77__cuda_sm20_sqrt_rn_f32FP32 sqrt, round-to-nearest
0x78__cuda_sm20_sqrt_rn_f32_slowpathFP32 sqrt RN slowpath
0x79__cuda_sm20_sqrt_rn_ftz_f32FP32 sqrt RN FTZ
0x7A__cuda_sm20_sqrt_rn_ftz_f32_slowpathFP32 sqrt RN FTZ slowpath
0x7B__cuda_sm20_sqrt_ru_f32FP32 sqrt, round-up
0x7C__cuda_sm20_sqrt_ru_f32_slowpathFP32 sqrt RU slowpath
0x7D__cuda_sm20_sqrt_ru_ftz_f32FP32 sqrt RU FTZ
0x7E__cuda_sm20_sqrt_ru_ftz_f32_slowpathFP32 sqrt RU FTZ slowpath
0x7F__cuda_sm20_sqrt_rz_f32FP32 sqrt, round-toward-zero
0x80__cuda_sm20_sqrt_rz_f32_slowpathFP32 sqrt RZ slowpath
0x81__cuda_sm20_sqrt_rz_ftz_f32FP32 sqrt RZ FTZ
0x82__cuda_sm20_sqrt_rz_ftz_f32_slowpathFP32 sqrt RZ FTZ slowpath

Reciprocal Square Root Intrinsics (2 entries)

IDNameCategory
0x54__cuda_sm20_drsqrt_f64_slowpath_v2FP64 rsqrt slowpath
0x55__cuda_sm20_drsqrt_f64_v2FP64 rsqrt default

Remainder Intrinsics (4 entries)

IDNameCategory
0x6F__cuda_sm20_rem_s16Signed 16-bit remainder
0x70__cuda_sm20_rem_s64Signed 64-bit remainder
0x71__cuda_sm20_rem_u16Unsigned 16-bit remainder
0x72__cuda_sm20_rem_u64Unsigned 64-bit remainder

Bit-Field Intrinsics (3 entries)

IDNameCategory
0x3D__cuda_sm20_bfe_s64_64-bit signed bit-field extract
0x3E__cuda_sm20_bfe_u64_64-bit unsigned bit-field extract
0x3F__cuda_sm20_bfi_u64_64-bit unsigned bit-field insert

Naming Conventions

The intrinsic name encodes the complete variant specification:

__cuda_sm{gen}_{op}_{rounding}_{ftz}_{type}_{suffix}
ComponentValuesMeaning
sm{gen}sm20, sm3xMinimum SM architecture
{op}div, rcp, sqrt, dsqrt, drsqrt, rem, bfe, bfiMathematical operation
{rounding}rn, rd, ru, rzIEEE 754 rounding mode
{ftz}ftz, noftzFlush-to-zero denormal behavior
{type}f32, f64, s16, s64, u16, u64Operand data type
{suffix}slowpath, mediumpath, full, fast, v2, v3Implementation variant

The slowpath suffix indicates a handler for denormalized inputs or edge cases (NaN, infinity, zero) that the fast path branches around. The v2/v3 suffixes indicate iteration count on the implementation (each version may use different Newton-Raphson step counts or algorithm improvements).

Prototype Format

The prototype generator sub_5FF700 (354 KB) emits .weak .func PTX declarations for every registered intrinsic. Example prototypes:

.weak .func (.reg .s32 %d) __cuda_sm20_div_s16
    (.reg .s32 %a0, .reg .s32 %a1)

.weak .func (.reg .u64 %rdv1) __cuda_sm20_div_u64
    (.reg .u64 %rda1, .reg .u64 %rda2)

.weak .func (.reg .f32 %fv1) __cuda_sm20_div_rn_f32
    (.reg .f32 %fa1, .reg .f32 %fa2)

.weak .func (.reg .f64 %fdv1) __cuda_sm20_div_rn_f64_full
    (.reg .f64 %fda1, .reg .f64 %fda2)

The .weak linkage allows user-provided implementations to override the built-in versions at link time.

Newton-Raphson Refinement Templates

For FP64 operations, ptxas emits multi-instruction SASS sequences inline rather than calling helper functions. These sequences are generated by the template subsystem at 0x1700000--0x172A090 (36 functions, ~180 KB). The templates use MUFU hardware as the initial seed and iterate Newton-Raphson to achieve full FP64 precision. See Newton-Raphson & Math Templates for complete details.

Template Hierarchy

sub_AED3C0 (Master Lowering Dispatcher, 28 KB)
  |
  +-- sub_170E8B0 (DDIV handler)        -- FP64 division
  |     +-- sub_170E260 (coordinator)    -- 298 vregs, 6 sub-expanders
  |
  +-- sub_1718D60 (DRCP/DSQRT handler)  -- FP64 reciprocal / square root
  |     +-- sub_1718790 (coordinator)    -- 289 vregs, 7 sub-expanders
  |
  +-- sub_17276C0 (DRSQRT handler)      -- FP64 reciprocal square root
  |     +-- sub_1720D60 (coordinator A) -- 247 vregs, 5 sub-expanders
  |     +-- sub_1727130 (coordinator B) -- 59 vregs, integer div/mod path
  |
  +-- sub_1704070 (Inline DDIV handler) -- Register-pressure variants

FP64 Division (DDIV)

Algorithm for a / b:

  1. Extract the high 32 bits of the FP64 divisor b
  2. Convert to FP32 and compute MUFU.RCP -- ~23-bit seed for 1/b
  3. Newton-Raphson iteration 1: x1 = x0 * (2 - b * x0) via DFMA -- ~46 bits
  4. Newton-Raphson iteration 2 (partial): guard bits for correct rounding
  5. Compute a * (1/b) using the refined reciprocal
  6. Apply IEEE 754 rounding, handle overflow/underflow/NaN

The complete DDIV template emits ~100--120 SASS instructions across 3 named code sections (__ori_template_DDIV1, __ori_template_DDIV2, __ori_template_DDIV3), using 298 virtual registers. Three register-pressure variants are available:

Register LimitHandlerStrategy
> 20,479sub_1702990Full unrolled, maximum ILP
> 16,383sub_1701F10Partially spilled
<= 16,383sub_1701860Minimal-register, more instructions

FP64 Reciprocal (DRCP)

Algorithm for 1/b:

  1. MUFU.RCP(float32(b)) -- ~23-bit seed
  2. Newton-Raphson iteration 1: x1 = x0 * (2 - b * x0) via DFMA
  3. Newton-Raphson iteration 2: doubles precision to ~52+ bits
  4. Final rounding to FP64 precision

Implemented by sub_1718D60 (coordinator at sub_1718790, 289 vregs, 7 sub-expanders: sub_170ED40 through sub_1717470).

FP64 Square Root (DSQRT)

Algorithm for sqrt(a):

  1. MUFU.RSQ(float32(a)) -- ~23-bit seed for 1/sqrt(a)
  2. Newton-Raphson refinement: y1 = y0 * (3 - a * y0^2) / 2
  3. Compute sqrt(a) = a * (1/sqrt(a))
  4. Apply IEEE 754 rounding

Shares the coordinator with DRCP (sub_1718790), selecting the DSQRT sub-expanders (sub_1715910, sub_1717470) based on the original PTX operation.

FP64 Reciprocal Square Root (DRSQRT)

The most complex template handler (sub_17276C0). Dispatches based on a hardware capability flag at *(*(ctx+1584)+1037) & 1:

  • Flag set (sm_80+ with enhanced SFU): sub_1727130 -- 59 vregs, fewer refinement iterations due to MUFU.RSQ64H providing better initial precision
  • Flag clear (older architectures): sub_1720D60 -- 247 vregs, full Newton-Raphson with 5 sub-expanders

Integer Division via MUFU

Integer division by variable values also uses MUFU.RCP as a starting point. The algorithm for unsigned 32-bit a / b:

I2F(b) -> MUFU.RCP -> F2I -> IMAD.HI -> correction

Specifically:

  1. float_b = I2F(b) -- convert divisor to FP32
  2. rcp = MUFU.RCP(float_b) -- ~23-bit reciprocal approximation
  3. int_rcp = F2I(rcp) -- convert back to integer
  4. q_est = IMAD.HI(a, int_rcp, 0) -- estimated quotient
  5. r_est = IMAD(q_est, -b, a) -- estimated remainder
  6. Correction: up to 2 iterations of if (r_est >= b) q_est++; r_est -= b

The correction steps (at most 2) are needed because MUFU.RCP is accurate to within 2 ULP. This sequence emits ~50 SASS instructions for 32-bit (sub_1724A20, 28 KB decompiled) and ~80 for 64-bit unsigned (sub_1728930, 16.5 KB).

FP32 Math Paths

Approximate vs Full-Range

For FP32 operations, the codegen handler selects between:

  1. Single MUFU -- for .approx modifier. One instruction, ~23-bit precision.
  2. MUFU + correction -- for .rn/.rd/.ru/.rz with FTZ. MUFU seed plus 1--2 FMA correction steps, inline.
  3. Helper function call -- for directed rounding modes (RD/RU/RZ) without FTZ, or when denormal handling is required (slowpath variants). Calls to __cuda_sm20_* or __cuda_sm3x_* functions.

Flush-to-Zero (FTZ)

The .ftz modifier on FP32 operations flushes denormalized inputs and outputs to zero, which simplifies the math sequence:

  • Eliminates denormal input handling branches
  • Eliminates denormal output rounding logic
  • Allows a shorter inline sequence instead of a function call

Each FP32 math intrinsic exists in both FTZ and non-FTZ variants (e.g., __cuda_sm20_rcp_rn_ftz_f32 vs __cuda_sm20_rcp_rn_f32), and many also have a slowpath variant for edge cases.

sm_3x Optimized Division

Four additional division intrinsics at IDs 0x83--0x86 provide sm_30+ optimized paths for FP32 round-to-nearest division. The __cuda_sm3x_div_rn_ftz_f32 and __cuda_sm3x_div_rn_noftz_f32 variants (plus their slowpath counterparts) take advantage of Kepler+ hardware improvements to produce shorter instruction sequences than the sm_20 versions.

FP16 Math Handling

FP16 (half) math operations do not use MUFU directly. Instead, ptxas:

  1. Promotes FP16 inputs to FP32 via H2F (half-to-float conversion)
  2. Performs the FP32 MUFU operation
  3. Converts the result back to FP16 via F2H (float-to-half conversion)

For HMMA (half-precision matrix multiply-accumulate) operations, the tensor core path is used instead -- see Tensor Core Intrinsics.

The HADD2, HMUL2, HFMA2 instructions operate on packed FP16x2 values and are separate from the MUFU path. These are direct hardware instructions dispatched to the ALU pipe, not the SFU.

Codegen Handler Deep Dive

sub_5B76D0 -- Division Codegen (64 KB)

The largest math codegen handler at 1,466 decompiled lines. Its dispatch tree:

sub_70CA60(instr, 0) -> operand type
  |
  +-- type 58 (f32)
  |     +-- sub_707BC0(instr) -> rounding mode check
  |     |     +-- mode 1 -> short-form call (approx)
  |     |     +-- mode > 39 -> full Newton-Raphson inline sequence
  |     |     +-- else -> helper function call
  |     +-- sub_70B820(instr) -> precision modifier
  |           +-- <= 39 -> 3-operand compact call
  |           +-- > 39 -> multi-segment inline expansion
  |
  +-- type 59 (f64)
  |     +-- full/fast path selection
  |     +-- rounding mode -> specific __cuda_sm20_div_r{n,d,u,z}_f64 call
  |
  +-- type 54 (s16/s32)
  |     +-- __cuda_sm20_div_s{16,64} call
  |
  +-- type 56 (u16/u32)
        +-- __cuda_sm20_div_u{16,64} call

The FP32 path at rounding mode > 39 generates a multi-segment inline PTX sequence with ~20 sprintf() calls, each appending a PTX instruction to the output buffer. This is the full-range IEEE-compliant FP32 division path that uses MUFU.RCP as a seed followed by FMA-based correction.

sub_5B0CD0 -- Reciprocal Codegen (44 KB)

Similar structure to the division handler. Dispatches by type (f32/f64) and rounding mode. For FP64, calls __cuda_sm20_rcp_f64_v3. For FP32, selects between 4 rounding modes x 2 FTZ variants x 2 paths (fast/slowpath) = up to 16 different intrinsic calls.

sub_5B4040 -- Square Root Codegen (49 KB)

Handles both FP32 (__cuda_sm20_sqrt_*) and FP64 (__cuda_sm20_dsqrt_*) variants. For FP64, the dsqrt_rn_f64_mediumpath_v1 variant provides an intermediate-complexity path between the fast approximation and the full Newton-Raphson template.

sub_583190 -- Base-2 Exponential (ex2)

Dispatches by operand type:

  • FP32 with mode 1: short-form approximate path (single MUFU.EX2)
  • FP32 with rounding mode > 39: full-range inline sequence with ~18 sprintf() segments generating a PTX sequence that includes range reduction, MUFU.EX2, and polynomial correction
  • FP64: multi-operand call to a helper function

sub_57BFC0 -- Reciprocal Square Root (rsqrt)

Dispatches by type:

  • FP64 with mode 1: short-form call to __cuda_sm20_drsqrt_f64_v2
  • FP64 with rounding mode > 39: full inline sequence with ~35 sprintf() segments -- the longest inline math expansion for a single-precision-equivalent operation. The sequence implements range reduction, MUFU.RSQ, Newton-Raphson correction, and renormalization
  • FP32: MUFU.RSQ for approximate, helper call for IEEE-compliant

Scheduling and Latency

MUFU instructions are scheduled on the SFU (Special Function Unit), which is functional unit index 8 in the ptxas latency model. Key scheduling properties:

PropertyValue
Functional unitSFU (index 8)
Issue latency1 cycle (can issue every cycle)
Result latency~4 cycles (pipeline depth)
Throughput1 per 4 cycles per SM partition (16 per SM for 4 partitions)
Dual-issueCannot dual-issue with ALU on same warp

The scheduler (sub_815820) places MUFU instructions to maximize overlap with ALU operations from other warps. The Newton-Raphson sequences interleave MUFU, DFMA, IMAD, and MOV instructions to hide the SFU pipeline latency behind ALU computation.

Fast-Math vs IEEE-Compliant Summary

PTX OperationFast-Math (-use_fast_math)IEEE-Compliant
div.f32MUFU.RCP + FMUL (2 instr)__cuda_sm20_div_rn_f32 call (~15 instr)
div.f64N/A (no FP64 fast-math)DDIV template (~100--120 instr)
rcp.f32MUFU.RCP (1 instr)__cuda_sm20_rcp_rn_f32 call (~10 instr)
rcp.f64N/ADRCP template (~90 instr)
sqrt.f32MUFU.RSQ + FMUL (2 instr)__cuda_sm20_sqrt_rn_f32 call (~12 instr)
sqrt.f64N/ADSQRT template (~80 instr)
rsqrt.f32MUFU.RSQ (1 instr)__cuda_sm20_drsqrt_f64_v2 (for f64)
sin.f32MUFU.SIN (1 instr)Range reduction + MUFU.SIN + correction
cos.f32MUFU.COS (1 instr)Range reduction + MUFU.COS + correction
ex2.f32MUFU.EX2 (1 instr)Range reduction + MUFU.EX2 + correction
lg2.f32MUFU.LG2 (1 instr)Range reduction + MUFU.LG2 + correction

Key Function Reference

AddressSizeIdentity
sub_5B76D064 KBdiv codegen handler -- dispatches all division variants
sub_5B0CD044 KBrcp codegen handler -- reciprocal for f32/f64
sub_5B404049 KBsqrt codegen handler -- square root for f32/f64
sub_57BFC0~10 KBrsqrt codegen handler -- reciprocal square root
sub_583190~14 KBex2 codegen handler -- base-2 exponential
sub_52A5C0~5 KBlg2 codegen handler -- base-2 logarithm
sub_505B00~5 KBtanh codegen handler -- hyperbolic tangent
sub_573860~7 KBdiv.full codegen handler -- FP64 full-precision division
sub_589810~13 KBrem codegen handler -- integer remainder
sub_5D166046 KBMaster intrinsic registration (608 entries)
sub_5FF700354 KBPrototype generator (PTX .weak .func declarations)
sub_80E9B0~1.5 KBLowerSpecialFunctions -- MUFU emission pass
sub_170E8B0--DDIV top-level handler
sub_1718D60790 BDRCP/DSQRT coordinator wrapper
sub_17276C01,011 BDRSQRT coordinator wrapper
sub_1704070263 BDDIV register-pressure dispatcher
sub_1724A2028 KB32-bit integer division via MUFU.RCP
sub_172893016.5 KB64-bit unsigned integer division
sub_1727AC013.8 KB64-bit signed integer division
sub_AED3C028 KBMaster lowering dispatcher (invokes all templates)
sub_10C0170~5 KBMUFU Mercury encoding function (sm_100+)

Tensor Core Intrinsics

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

ptxas supports five generations of tensor core operations spanning SM 70 through SM 100. The binary contains three major codegen handlers -- sub_5C7A50 (173KB, WMMA), sub_5C10A0 (120KB, MMA), and sub_5BBC30 (90KB, tcgen05.mma) -- plus four WGMMA handlers, eleven tcgen05 instruction handlers, and ~400 numeric MMA hash table entries. Together these constitute ~500KB of code generation logic dedicated to tensor core instructions, making this the single largest functional subsystem in ptxas.

WMMA codegensub_5C7A50 (173KB) -- wmma.mma instruction code generation
MMA codegensub_5C10A0 (120KB) -- mma.sync instruction code generation
TCGen05 MMA codegensub_5BBC30 (90KB) -- tcgen05.mma instruction code generation
WMMA load/storesub_5A0EA0 (7.8KB), sub_5A8E40 (9.8KB), sub_5A6BD0 (8.8KB), sub_5A2D10 (8.1KB)
WGMMA handlerssub_50AC70 (1.3KB), sub_4DA380 (295B), sub_4DA4B0 (295B), sub_4DA5E0 (311B)
MMA validatorsub_4C2FD0 (12.2KB), sub_49BBA0 (11.4KB), sub_4BFED0 (10.3KB)
Numeric MMA hash~400 entries at compilation context offset a1+816
Prototype generatorsub_5FF700 (354KB) -- generates .weak .func PTX declarations
SASS MMA encoderssub_6D4350, sub_6D7AF0, sub_6D5CB0, sub_6D69B0

Tensor Core Generations

ptxas tracks five distinct tensor core generations, each introducing new SASS opcodes, data types, and matrix shapes. The internal scheduling counters (visible in DUMPIR statistics output) reveal the SASS-level taxonomy.

GenSMSASS OpcodesPTX APIKey Addition
1st75 (Turing)HMMA (m16n8k8, m8n8k16)wmma.mmaFP16 tensor cores; INT8/INT4/B1 WMMA
2nd80 (Ampere)HMMA (m16n8k16), IMMA, DMMA, BMMAwmma.mma, mma.syncBF16, TF32, FP64, structured sparsity
3rd89 (Ada)HMMA (extended)mma.syncFP8 (E4M3, E5M2), block-scale MMA
4th90 (Hopper)WGMMAwgmma.mma_asyncWarpgroup MMA, async pipeline, sub-byte sparse
5th100 (Blackwell)UTCHMMA, UTCIMMA, tcmmatcgen05.mma, tcgen05.mma.wsTensor memory (TMEM), warp-shared, block-scale

Scheduling Unit Names

The binary's statistics printer functions (clones at 0x700-byte intervals from sub_ABBA50) emit per-unit throughput counters using the internal SASS operation classification:

CounterSASS OperationMatrix ShapeDescription
hmma1688HMMA m16n8k816x8x81st-gen FP16 MMA (Turing)
hmma1688f16HMMA m16n8k8 (f16 accum)16x8x8FP16 accumulation variant
hmma16816HMMA m16n8k1616x8x162nd-gen FP16 MMA (Ampere+)
hmma16816f16HMMA m16n8k16 (f16 accum)16x8x16FP16 accumulation variant
hmmaSp1688HMMA.SP m16n8k816x8x(8*2)Sparse FP16 (2:4 sparsity)
hmmaSp1688f16HMMA.SP m16n8k8 (f16 accum)16x8x(8*2)Sparse FP16, f16 accum
imma16816IMMA m16n8k1616x8x16Integer MMA (INT8)
imma16832IMMA m16n8k3216x8x32Integer MMA (INT4/sub-byte)
immaSp8832IMMA.SP m8n8k328x8x(32*2)Sparse integer (m8 variant)
immaSp16832IMMA.SP m16n8k3216x8x(32*2)Sparse integer (m16 variant)
dmmaDMMA8x8x4FP64 tensor MMA
fma64FMA64--FP64 FMA (non-tensor)

Format strings from the binary:

# [est hmma1688=%d] [est hmma1688f16=%d] [est hmmaSp1688=%d] [est hmmaSp1688f16=%d]
# [est hmma16816=%d] [est hmma16816f16=%d]
# [est imma16816=%d] [est imma16832=%d] [est immaSp8832=%d] [est immaSp16832=%d]
# [est dmma=%d] [est fma64=%d]
# [hmma1688 thru=%f] [hmma1688f16 thru=%f] [hmmaSp1688 thru=%f] [hmmaSp1688f16 thru=%f]
# [hmma16816 thru=%f] [hmma16816f16 thru=%f]
# [imma16816 thru=%f] [imma16832 thru=%f] [immaSp8832 thru=%f] [immaSp16832 thru=%f]
# [dmma thru=%f] [fma64 thru=%f]

These counters are emitted by the post-scheduling statistics pass. The scheduler treats each counter as a separate functional unit throughput class, with dedicated latency table entries:

Ori OpcodeLatency ClassDescription
0x144600Tensor fence
0x145--0x146759HMMA/BMMA tensor core operations
0x147--0x148757 or 761Narrow/wide DP tensor operations
0x149604Tensor sync

PTX-Level Instruction Lowering

WMMA Instructions

The WMMA (Warp Matrix Multiply-Accumulate) API is the oldest tensor core interface. Five PTX instructions are registered in the opcode dispatch table at sub_5D4190:

PTX InstructionCodegen HandlerSizePurpose
wmma.load.asub_5A0EA07,779BLoad matrix fragment A
wmma.load.bsub_5A8E409,757BLoad matrix fragment B
wmma.load.csub_5A6BD08,813BLoad accumulator fragment C
wmma.store.dsub_5A2D108,074BStore result fragment D
wmma.mmasub_5C7A50173KBMatrix multiply-accumulate

All five handlers allocate a 50,000-byte code generation buffer. The load/store handlers are ~8--10KB each and cover the combinatorial product of shape, layout, data type, and address space. The wmma.mma handler at 173KB is the largest single codegen handler in ptxas.

Instruction property accessors used by WMMA codegen:

AccessorPurpose
sub_7075C0Get instruction flag A (layout/type encoding)
sub_707BC0Get instruction flag B (variant/mode encoding)
sub_7075E0Get layout string (row/col)
sub_707BE0Get shape string (m16n16k16 etc.)
sub_70A810Get scale string (satfinite etc.)

WMMA Shapes and Types

sm_70 (Volta/Turing) -- 1st generation:

ShapeData TypesAccumulatorRegs (A)Regs (B)Regs (C/D)
m16n16k16f16f16 or f328 x b328 x b324 x b32 (f16) or 8 x b32 (f32)
m32n8k16f16f16 or f328 x b328 x b324 x b32 (f16) or 8 x b32 (f32)
m8n32k16f16f16 or f328 x b328 x b324 x b32 (f16) or 8 x b32 (f32)

sm_72 (Turing) -- integer WMMA extension:

ShapeData TypesAccumulatorRegs (A)Regs (B)Regs (C/D)
m16n16k16s8, u8s322 x b322 x b328 x b32
m32n8k16s8, u8s324 x b321 x b328 x b32
m8n32k16s8, u8s321 x b324 x b328 x b32
m8n8k32s4, u4 (sub-byte)s321 x b321 x b322 x b32
m8n8k128b1 (1-bit)s321 x b321 x b322 x b32

sm_80 (Ampere) -- 2nd generation extensions:

ShapeData TypesAccumulatorNotes
m16n16k16bf16f32New BF16 support, 4 x b32 per fragment
m16n16k8tf32f32New TF32 support, 4 x b32 per fragment
m8n8k4f64f64Double-precision MMA, 1 x f64 per A/B, 2 x f64 per C/D

All WMMA shapes support three address spaces (generic, global, shared) and optional descriptor-based addressing (_desc variants). The binary contains separate intrinsic registrations for each combination, with the full prototype available in the string table.

MMA (mma.sync) Instructions

The mma.sync API uses a single PTX opcode mma dispatched to sub_5C10A0 (120KB). Unlike WMMA, it uses asymmetric matrix shapes with M and N decoupled from K, and operates at a single-warp granularity.

The numeric MMA hash table at a1+816 collapses the full variant space (shape + type + layout) into ~400 hash entries. Each entry maps a numeric string key (e.g., "2644314910") to a specific codegen handler function pointer, avoiding multi-dimensional dispatch.

sm_80+ MMA intrinsics (IDs 0x209--0x22F):

The 39 intrinsics registered under __cuda_sm_8x_mma_* cover:

Intrinsic PatternLayoutTypes (D, C, A, B)Regs
mma_row_col_f16_f16_f16_f16row x colf16, f16, f16, f16D: 4 x b32
mma_row_col_f32_f16_f16_f16row x colf32, f16, f16, f16D: 8 x b32
mma_row_col_f32_f16_f16_f32row x colf32, f32, f16, f16D: 8 x b32, C: 8 x b32
mma_col_col_*col x colsame setsame
mma_row_row_*row x rowsame setsame
mma_col_row_*col x rowsame setsame
mma_shfl_f16--shuffle f16D: 2 x b32
mma_shfl_f32--shuffle f32D: 4 x b32

Prototype examples from the binary:

.weak .func (.param .align 16 .b32 mma_dst[4])
    __cuda_sm_8x_mma_row_col_f16_f16_f16_f16
    (.reg .b32 a0, .reg .b32 a1, .reg .b32 b0, .reg .b32 b1,
     .reg .b32 c0, .reg .b32 c1, .reg .b32 c2, .reg .b32 c3);

.weak .func (.param .align 16 .b32 mma_dst[8])
    __cuda_sm_8x_mma_row_col_f32_f16_f16_f32
    (.reg .b32 a0, .reg .b32 a1, .reg .b32 b0, .reg .b32 b1,
     .reg .b32 c0, .reg .b32 c1, .reg .b32 c2, .reg .b32 c3,
     .reg .b32 c4, .reg .b32 c5, .reg .b32 c6, .reg .b32 c7);

Note the .param .align 16 .b32 mma_dst[N] return convention -- MMA results are returned through aligned parameter space, not registers, because the warp-cooperative nature of the operation means each thread holds only a fragment.

MMA Shape Summary Across Generations

ShapeTypesSM FloorSASS OpcodeNotes
m8n8k4f6480DMMADouble-precision
m8n8k16f1675HMMAOriginal Turing shape
m8n8k32s4/u475IMMASub-byte integer
m8n8k128b175BMMA1-bit (XOR/AND pop)
m16n8k8f1675HMMAAsymmetric Turing shape
m16n8k16f16, bf16, s8/u880HMMA, IMMAPrimary Ampere shape
m16n8k32s8/u8, s4/u480/90IMMAInteger, sub-byte at sm_90
m16n8k64s4/u490IMMAHopper sub-byte extension
m16n8k128s4/u4 (sparse), b190IMMA, BMMAHopper sparse sub-byte
m16n8k256b190/100BMMAExtended 1-bit MMA

MMA Validation

Three validator functions gate MMA features by SM version:

sub_4C2FD0 (12.2KB) -- WMMA/MMA master validator:

Performs three-way version checks:

  • SM 75: base WMMA (f16)
  • SM 80: extended types (BF16, TF32, FP64 -- "MMA with double types")
  • SM 90: WGMMA features
  • FP8: "mma with FP8 floating point type" (gated by sm_89+)

sub_49BBA0 (11.4KB) -- MMA type/scale validator:

Validates FP8 and block-scale configurations:

  • "mma with FP8 floating point type" -- sm_89+ gate
  • "mma with FP8 floating point type and FP16 accumulation" -- additional FP16 accum check
  • "mma with FP8 floating point type and .m16n8k16 shape" -- shape/type cross-validation
  • "Sparse mma with block scale" -- block-scale + sparsity interaction
  • .block_scale modifier validation

sub_4BFED0 (10.3KB) -- WMMA shape validator:

Validates WMMA-specific shapes and the .aligned modifier:

  • ".aligned modifier for wmma" -- alignment enforcement
  • SM 75/80 version checks for shape legality

sub_490F90 -- Integer MMA validator:

Checks integer MMA shape validity: "Integer MMA with shape " -- validates m/n/k dimensions against the SM-level capability set.

sub_494210 (2.3KB) -- Sparse GMMA validator:

Validates sparse MMA metadata: "Sparse GMMA with " -- checks 2:4 sparsity pattern encoding.

sub_495900 -- WMMA floating-point validator:

Checks: "'wmma.mma' with floating point type" -- validates FP type compatibility with the target shape.

sub_4428E0 -- FP64 MMA validator:

Validates: "mma with .f64 type" -- gates double-precision MMA on sm_80+.

sm_90+ Sub-Byte MMA Intrinsics (IDs 0x23A--0x25F)

Hopper introduces 38 sub-byte MMA intrinsics covering s4/u4 sparse operations at warp granularity. These are distinct from the WGMMA API and provide backward-compatible sub-byte operations through the classical mma.sync interface.

Dense sub-byte (m8n8k32, m16n8k32, m16n8k64):

ShapeType CombinationsVariantsCount
m8n8k32s4xs4, s4xu4, u4xs4, u4xu4plain + satfinite8
m16n8k32s4xs4, s4xu4, u4xs4, u4xu4plain + satfinite8
m16n8k64s4xs4, s4xu4, u4xs4, u4xu4plain + satfinite8

Sparse sub-byte (m16n8k64, m16n8k128):

ShapeType CombinationsVariantsCount
m16n8k64 (sparse)s4xs4, s4xu4, u4xs4, u4xu4plain + satfinite, split _0/_116
m16n8k128 (sparse)s4xs4, s4xu4, u4xs4, u4xu4plain + satfinite8

The _0 and _1 suffixes on sparse m16n8k64 represent the two halves of a split operation -- the K dimension is decomposed into two steps for the sparsity pattern. The sparse variants take an additional e (metadata) operand encoding the 2:4 sparsity pattern.

Bit-operations (b1):

ShapeOperationSMIntrinsic
m8n8k128XOR90__cuda_sm_9x_mma_bit_internal_xor_m8n8k128
m16n8k128XOR90__cuda_sm_9x_mma_bit_internal_xor_m16n8k128
m16n8k256XOR90__cuda_sm_9x_mma_bit_internal_xor_m16n8k256

Prototype example (sparse m16n8k128):

.weak .func (.param .align 16 .b32 mma_dst[4])
    __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_s4_s4
    (.reg .b32 a0, .reg .b32 a1, .reg .b32 a2, .reg .b32 a3,
     .reg .b32 b0, .reg .b32 b1, .reg .b32 b2, .reg .b32 b3,
     .reg .b32 c0, .reg .b32 c1, .reg .b32 c2, .reg .b32 c3,
     .reg .b32 e);

The sparse m16n8k128 variant takes 4 A-operands, 4 B-operands (double-width due to sparsity encoding), 4 C-operands, and 1 metadata operand e.

sm_100 Blackwell MMA (IDs 0x230--0x239)

Blackwell introduces 10 intrinsics for extended MMA operations:

HMMA/IMMA metadata helpers (__cuda_sm_10x_*_mdata_*):

IntrinsicShapeReturnsInputs
__cuda_sm_10x_hmma_mdata_m16n8k16m16n8k16ret_dst[3]a0, a1, e, f_temp
__cuda_sm_10x_hmma_mdata_m16n8k32m16n8k32ret_dst[5]a0, a1, a2, a3, e, f_temp
__cuda_sm_10x_imma_mdata_m16n8k32m16n8k32ret_dst[3]a0, a1, e, f_temp
__cuda_sm_10x_imma_mdata_m16n8k64m16n8k64ret_dst[5]a0, a1, a2, a3, e, f_temp

These mdata functions compute sparse metadata for Blackwell's 5th-gen tensor cores. The e parameter is the sparsity selector, f_temp is a scratch register. The return array includes the transformed A-operands plus the computed metadata word.

Bit MMA (AND + XOR for sm_100):

IntrinsicShapeOperationRegs
__cuda_sm_10x_mma_bit_internal_and_m8n8k128m8n8k128ANDD: 2, A: 1, B: 1, C: 2
__cuda_sm_10x_mma_bit_internal_and_m16n8k128m16n8k128ANDD: 4, A: 2, B: 1, C: 4
__cuda_sm_10x_mma_bit_internal_and_m16n8k256m16n8k256ANDD: 4, A: 4, B: 2, C: 4
__cuda_sm_10x_mma_bit_internal_xor_m8n8k128m8n8k128XORsame as AND
__cuda_sm_10x_mma_bit_internal_xor_m16n8k128m16n8k128XORsame as AND
__cuda_sm_10x_mma_bit_internal_xor_m16n8k256m16n8k256XORsame as AND

Blackwell adds the AND reduction mode for 1-bit MMA (sm_90 only had XOR).

WGMMA -- Warpgroup MMA (SM 90+)

WGMMA (Warp Group Matrix Multiply-Accumulate) operates at warpgroup granularity (4 warps, 128 threads) and uses an asynchronous pipeline protocol. Four PTX instructions are registered:

PTX InstructionHandlerSizeRole
wgmma.mma_asyncsub_50AC701,282BDispatch asynchronous MMA operation
wgmma.fencesub_4DA380295BOpen pipeline stage
wgmma.commit_groupsub_4DA4B0295BClose pipeline stage
wgmma.wait_groupsub_4DA5E0311BWait for N committed groups

WGMMA Pipeline Protocol

The hardware requires strict sequencing:

wgmma.fence              -- open pipeline stage
wgmma.mma_async  (1..N)  -- asynchronous MMA operations sharing accumulators
wgmma.commit_group       -- close pipeline stage
wgmma.wait_group N       -- wait for N outstanding groups to complete

Between fence and wait, strict register constraints apply:

  1. No non-WGMMA definitions of accumulator registers
  2. No non-WGMMA reads of accumulator registers
  3. No non-WGMMA definitions of WGMMA input registers (including descriptors)

Violation of any constraint triggers pipeline serialization via sub_AE47B0, which collapses the pipeline to individual fence/mma/commit/wait per operation. The serialization reason is reported through warning codes 0x1D55--0x1D5E (see GMMA Pipeline).

WGMMA Descriptor Format

The wgmma.mma_async handler (sub_50AC70, 1,282 bytes) encodes the operation's matrix dimensions, data types, layout, and scale factors into the instruction. The A operand can be either a register operand or a descriptor -- a 64-bit value encoding the matrix base address, leading dimension, stride, and swizzle pattern. The B operand is always descriptor-based.

The descriptor format allows the hardware to fetch matrix data directly from shared memory via the TMA (Tensor Memory Accelerator), bypassing register file involvement for the B matrix operand entirely.

Ori Internal Encoding

ConstantValueMeaning
WGMMA opcode309Ori opcode for wgmma.mma_async
Arrive opcode (masked)271opcode & 0xFFFFCFFF for _warpgroup.arrive/wait
Commit opcode323Ori opcode for _warpgroup.commit_batch
Accum reg_type6vreg+64 value for tensor/accumulator registers
Accum src tag0x90000000High nibble tag for source accumulator set
Accum dst tag0x10000000High nibble tag for destination accumulator set

Compiler-Inserted Warpgroup Instructions

The GMMA pipeline passes (phases 85 and 87) insert three compiler-internal pseudo-operations prefixed with underscore:

Pseudo-OpSASS OutputPurpose
_warpgroup.arriveWARPSYNC / BAR.ARRIVEWarpgroup synchronization (arrive)
_warpgroup.waitWARPSYNC / BAR.WAITWarpgroup synchronization (wait)
_warpgroup.commit_batchDEPBAR variantWarpgroup dependency barrier

These are not directly written by the programmer. The compiler inserts them to manage register ownership transfer between the warpgroup's register file and the tensor core's accumulator pipeline.

TCGen05 -- 5th Generation Tensor Cores (SM 100+)

Blackwell introduces TCGen05 (Tensor Core Generation 5), which operates through tensor memory (TMEM) -- a dedicated on-chip memory visible only to the tensor core engine, separate from the register file and shared memory. Eleven PTX instructions are registered:

PTX InstructionHandlerSizePurpose
tcgen05.mmasub_5BBC3090KBTensor core MMA from TMEM
tcgen05.mma.wssub_58FA204,604BWarp-shared MMA variant
tcgen05.ldsub_574050--Load data into TMEM
tcgen05.ld.redsub_578DB0--Load-reduce into TMEM
tcgen05.stsub_571FE0--Store from TMEM
tcgen05.cpsub_5427F0--Copy within TMEM
tcgen05.commitsub_56C190--Commit pending operations
tcgen05.shiftsub_4F1A90--Shift TMEM contents
tcgen05.allocsub_569180--Allocate TMEM columns
tcgen05.deallocsub_58C7F0--Deallocate TMEM columns
tcgen05.relinquish_alloc_permitsub_526370--Release allocation rights

TCGen05 MMA Codegen

The tcgen05.mma handler (sub_5BBC30, 90KB) is the third-largest codegen handler in ptxas. It:

  1. Allocates a 50,000-byte code generation buffer
  2. Validates tcgen05 capability via sub_70FA00(*, 29)
  3. Handles standard, sparse (.sp), and warp-shared (.ws) variants
  4. Extracts sparse metadata via sub_70F0A0
  5. Generates TMEM address computation code

The tcgen05.mma.ws formatter (sub_58FA20, 4,604B, also used for tcgen05.shift) handles the warp-shared variant where multiple warps contribute to a single MMA operation.

TCGen05 SASS Encoding

At the SASS level, TCGen05 operations are encoded by four specialized Mercury encoder functions:

AddressHandlerPurpose
sub_6D4350SASS MMA encoderPrimary MMA SASS emission
sub_6D7AF0SASS MMA encoderAlternate MMA variant
sub_6D5CB0SASS MMA encoderAdditional MMA mode
sub_6D69B0SASS MMA encoderAdditional MMA mode

The SASS encoder at sub_6D4350 references the tcmma operation namespace and validates block-scale configurations:

  • "tcmma_*_o must be specified with blockscale" -- output operand requires block-scale modifier
  • "uri width for tcmma_*_o must be 2" -- output URI width constraint
  • "tcmma_*_q with blockscale must have uri width of 2" -- scale factor operand constraint
  • "tcmma_*_mxq must be specified with blockscale" -- MX quantization operand constraint
  • "For UTCHMMA, #scaleU4 must be 0 in SPA 10.1." -- SM 100 vs 103 compatibility

The string "UTCHMMA" (Unified Tensor Core HMMA) and "tcmma" (Tensor Core MMA) are the internal SASS-level names for Blackwell's tensor core operations.

TCGen05 Guardrails

Blackwell includes a debug/validation mode activated by --g-tensor-memory-access-check. When enabled, ptxas wraps TMEM operations with guardrail trap functions. Ten guardrail intrinsics are registered (IDs 0x20--0x2A):

IntrinsicTrap Condition
tcgen05_guardrail_trap_phase_invalid_during_allocTMEM allocation during invalid phase
tcgen05_guardrail_trap_current_warp_owner_invalidWarp accessing TMEM it does not own
tcgen05_guardrail_trap_unallocated_columns_accessAccess to unallocated TMEM columns
tcgen05_guardrail_trap_unallocated_columns_being_deallocedDeallocation of unallocated columns
tcgen05_guardrail_trap_col_being_dealloced_not_returned_by_allocDealloc of column not from alloc
tcgen05_guardrail_trap_allocation_granularity_invalidInvalid allocation granularity
tcgen05_guardrail_trap_access_out_of_physical_boundsOut-of-bounds TMEM access
tcgen05_guardrail_trap_invalid_datapath_alignmentTMEM datapath misalignment
tcgen05_guardrail_trap_sparse_mismatch_between_idesc_modSparsity mismatch in instruction descriptor
tcgen05_guardrail_trap_sp_used_in_unsupported_envSparsity in unsupported config

Eight guardrail PTX instructions are registered in the opcode dispatch table:

PTX InstructionCheck
_tcgen05.guardrails.is_phase_validPhase validity for alloc
_tcgen05.guardrails.is_current_warp_valid_ownerWarp ownership
_tcgen05.guardrails.are_columns_allocatedColumn allocation status
_tcgen05.guardrails.in_physical_boundsPhysical bounds check
_tcgen05.guardrails.allocation_granularityGranularity validation
_tcgen05.guardrails.datapath_alignmentAlignment validation
_tcgen05.guardrails.sp_consistency_across_idesc_modSparsity consistency
_tcgen05.guardrails.check_sparse_usageSparse usage validation

The guardrail check functions are .FORCE_INLINE and return a boolean retVal:

.FORCE_INLINE .func (.reg .b32 retVal)
    __cuda_sm10x_tcgen05_guardrails_check_datapath_alignment
    (.reg .u32 tmemAddr, .reg .u32 iDesc, .reg .u32 cta_group,
     .reg .u32 hasWS, .reg .u32 hasSP, .reg .u32 matrix_kind);

The parameters reveal TMEM addressing structure: tmemAddr (TMEM base address), iDesc (instruction descriptor), cta_group (CTA group for cluster operations), hasWS (warp-shared flag), hasSP (sparse flag), matrix_kind (operand role).

Block-Scale MMA

Block-scale MMA allows per-block scaling factors for mixed-precision computation. In ptxas, this is gated by the .block_scale modifier on the PTX mma instruction:

  • Validator sub_49BBA0 checks ".block_scale" and "Sparse mma with block scale" (sparsity + block-scale interaction)
  • Additional intrinsic suffix __cuda_sm_100_tcgen05_ld_immhalfSplitOff and variants handle block-scale-aware loads
  • The bf16x2.ue8m0x2 type string in the binary indicates UE8M0 (unsigned exponent-only) scale factor format for MX (microscaling) quantization

Intrinsic Registration Summary

Full Tensor Core Intrinsic ID Map

ID RangeCountSMCategorySASS Target
0x89--0x1FA (subset)~20070+__cuda_sm70_wmma_* -- WMMA load/store/mma (f16)HMMA
0x1FB--0x2081480__cuda_sm80_* -- bf16/tf32/s4/s8/b1 MMA, createpolicyHMMA, IMMA, DMMA, BMMA
0x209--0x22F3980+__cuda_sm_8x_mma_* -- direct MMA operationsHMMA, IMMA
0x230--0x23910100__cuda_sm_10x_* -- hmma/imma mdata + bit MMAUTCHMMA, UTCIMMA
0x23A--0x25F3890__cuda_sm_9x_mma_sub_byte_internal_* -- sub-byte sparseIMMA

TCGen05 Intrinsics (Not in Master ID Table)

TCGen05 operations are dispatched through the named opcode table, not the numeric ID table. The 11 tcgen05.* instructions are registered directly in the opcode hash map at a1+808:

PTX OpcodeCodegen Handler
tcgen05.allocsub_569180
tcgen05.relinquish_alloc_permitsub_526370
tcgen05.deallocsub_58C7F0
tcgen05.ldsub_574050
tcgen05.ld.redsub_578DB0
tcgen05.stsub_571FE0
tcgen05.commitsub_56C190
tcgen05.cpsub_5427F0
tcgen05.shiftsub_4F1A90
tcgen05.mmasub_5BBC30
tcgen05.mma.wssub_58FA20

OCG-Level MMA Operations

The OCG system at sub_6C9EB0 registers additional SASS-level MMA operations for SM100+:

tcmma (2x64dp128bitlw02lw13, 2x64dp128bitlw01lw23, 4x32dp128bit)
tcbar, mmareadshma
16dp32bitt0t15, 16dp32bitt16t31
sparsify, spfactor2to4
tcshift, tcatomsws, tcldsws, tcstsws

These are internal Mercury-level operations that do not correspond 1:1 to PTX instructions. The tcmma variants encode the specific SASS datapath configuration: 2x64dp128bit means 2 datapaths, 64-element wide, 128-bit lane width.

Data Type Matrix

Complete data type support across tensor core generations as visible in the binary:

Data TypePTX TypeWidthSM FloorUse
FP16.f1616b75Primary HMMA operand
BF16.bf1616b80HMMA alternate format
TF32.tf3219b (stored as 32b)80Reduced-precision FP32
FP32.f3232b75HMMA accumulator
FP64.f6464b80DMMA (double-precision)
FP8 E4M3.e4m38b89Ada/Hopper FP8 MMA
FP8 E5M2.e5m28b89Ada/Hopper FP8 MMA
INT8.s8, .u88b72IMMA integer MMA
INT4.s4, .u44b72IMMA sub-byte MMA
INT1.b11b75BMMA 1-bit MMA (XOR/AND)
UE8M0.ue8m0x28b (packed)100Block-scale exponent factor
B1024.b10241024b100TMEM-width operand

ELF Metadata

The cubin output includes tensor-core-specific EIATTR attributes:

EIATTRPurpose
EIATTR_SPARSE_MMA_MASKRecords which MMA operations use structured sparsity
EIATTR_TCGEN05_1CTA_USEDKernel uses 1-CTA tcgen05 operations
EIATTR_TCGEN05_2CTA_USEDKernel uses 2-CTA tcgen05 operations

Knobs and Options

Knob / OptionPurpose
suppress-sparse-mma-advisory-infoSuppress advisory info for mma.sp operations
--g-tensor-memory-access-checkEnable tcgen05 guardrail instrumentation

Key Function Table

AddressSizeIdentityConfidence
0x5C7A50173KBWMMA.MMA codegen (all shapes/types/layouts)98%
0x5C10A0120KBMMA.sync codegen (post-Volta shapes)98%
0x5BBC3090KBTCGen05.MMA codegen (Blackwell TMEM-based)98%
0x5A0EA07,779Bwmma.load.a formatter95%
0x5A8E409,757Bwmma.load.b formatter95%
0x5A6BD08,813Bwmma.load.c formatter95%
0x5A2D108,074Bwmma.store.d formatter95%
0x50AC701,282Bwgmma.mma_async handler99%
0x4DA380295Bwgmma.fence handler99%
0x4DA4B0295Bwgmma.commit_group handler99%
0x4DA5E0311Bwgmma.wait_group handler99%
0x58FA204,604Btcgen05.mma.ws / tcgen05.shift formatter95%
0x4DA720343Btcgen05.mma.ws formatter90%
0x569180--tcgen05.alloc handler90%
0x526370--tcgen05.relinquish_alloc_permit handler90%
0x58C7F0--tcgen05.dealloc handler90%
0x574050--tcgen05.ld handler90%
0x578DB0--tcgen05.ld.red handler90%
0x571FE0--tcgen05.st handler90%
0x56C190--tcgen05.commit handler90%
0x5427F0--tcgen05.cp handler90%
0x4F1A90--tcgen05.shift handler90%
0x4C2FD012.2KBWMMA/MMA master validator (sm_75/80/90)90%
0x49BBA011.4KBMMA type/scale validator (FP8, block-scale)90%
0x4BFED010.3KBWMMA shape validator90%
0x490F90--Integer MMA shape validator85%
0x4942102,276BSparse GMMA validator85%
0x495900--WMMA floating-point type validator85%
0x496570--FP8 MMA shape validator85%
0x4961F0--FP8 MMA accumulation validator85%
0x4428E0--FP64 MMA type validator85%
0x6D4350--SASS TCGen05 MMA encoder (UTCHMMA/tcmma)85%
0x6D7AF0--SASS MMA encoder variant85%
0x6D5CB0--SASS MMA encoder variant85%
0x6D69B0--SASS MMA encoder variant85%
0x50D4B01,187Bldmatrix formatter90%
0x4DAEA0--movmatrix formatter90%
0x4F05D0--stmatrix formatter90%
0x5D166046KBMaster intrinsic registration (608 entries)99%
0x5D419041KBOpcode dispatch (MMA hash table builder)99%
0x5FF700354KBPrototype generator (.weak .func declarations)99%

Cross-References

Synchronization & Warp Intrinsics

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

This page documents how ptxas v13.0.88 handles the lowering of synchronization primitives and warp-level intrinsic operations from PTX source through Ori IR to final SASS instructions. The coverage spans warp vote, shuffle, match, and redux operations; thread-block barriers; memory barriers and fences; warp-level synchronization; asynchronous barriers (mbarrier); and atomic/reduction intrinsic lowering.

PTX codegen handlerssub_580E50 (vote), sub_5801D0 (shfl), sub_58A730 (match), sub_567680 (redux), sub_524FB0 (bar), sub_570290 (barrier), sub_500BF0 (bar.arrive), sub_570940 (barrier.arrive), sub_52D590 (bar.red), sub_5889B0 (barrier.red), sub_56A5A0 (bar.warp), sub_4DB410 (membar), sub_6C0D90 (atom/red)
Intrinsic IDs0x01--0x11 (reduxsync, 17), 0x89--0x1FA (sm70 warp/barrier/wmma, 370)
Ori IR opcodes96 (barrier), 119 (vote), 130 (HSET2 in ROT13; sync internal / shared-mem LEA), 314 (atom/red)
SASS opcodesVOTE, VOTEU, BAR, BAR_INDEXED, MATCH, MEMBAR, WARPSYNC, BSYNC, BSSY, DEPBAR, ERRBAR, ELECT, NANOSLEEP, ATOM, ATOMG, ATOMS, RED
Blackwell additionsFENCE_G, FENCE_S, FENCE_T, CGABAR_ARV, CGABAR_GET, CGABAR_SET, CGABAR_WAIT, CGAERRBAR, SYNCS_BASIC, SYNCS_LD_UNIFM
Optimizer phases25 (StageAndFence), 26 (OriRemoveRedundantBarriers), 42 (ExpandMbarrier), 71 (OptimizeSyncInstructions), 72 (LateExpandSyncInstructions), 99, 100, 114
Intrinsic detectionsub_A9A410 (sm70 warp-sync prefix matcher), sub_A94440 (mbarrier classifier)
Related EIATTREIATTR_NUM_BARRIERS, EIATTR_NUM_MBARRIERS, EIATTR_MBARRIER_INSTR_OFFSETS, EIATTR_SYNC_STACK, EIATTR_SW_WAR_MEMBAR_SYS_INSTR_OFFSETS, EIATTR_GEN_ERRBAR_AT_EXIT
CLI options--assume-extern-functions-do-not-sync, --no-membermask-overlap, --print-potentially-overlapping-membermasks
Related pagesSynchronization & Barriers (passes), Intrinsic Table

Instruction Classification

ptxas groups synchronization and warp operations into three Ori IR opcode categories that govern scheduling, dependency tracking, and optimization treatment throughout the pipeline.

Ori OpcodeCategoryInstructionsWAR Resource Mask
96Barrierbar.sync, bar.red, bar.arrive, barrier.*0x200001 (bit 0 + bit 21)
119Votevote.{all,any,uni,ballot}, match.*, redux.*, activemask, elect.sync0x1 (bit 0 only)
130 (HSET2)Sync internalBAR/MEMBAR pseudo-ops during optimization (actual SASS BAR = 61, MEMBAR = 111)Used by phases 26, 71 for redundancy analysis

The WAR resource mask at opcode 96 (2097153 = 0x200001) uniquely identifies barrier instructions to the scoreboard tracker. For all other opcodes, the base mask is 1. The scheduler uses this to insert appropriate stall cycles between barrier producers and consumers.

Warp Vote

Warp vote operations evaluate a per-lane predicate across the active threads in a warp and return a collective result. In ptxas these are lowered through the vote codegen handler at sub_580E50 (approximately 3.2KB of decompiled code).

PTX to SASS Mapping

PTX InstructionSASS OpcodeResultMembermask
vote.sync.all.pred p, q, membermaskVOTE.ALLPredicate: true iff all active lanes have q=trueExplicit 32-bit mask
vote.sync.any.pred p, q, membermaskVOTE.ANYPredicate: true iff any active lane has q=trueExplicit 32-bit mask
vote.sync.uni.pred p, q, membermaskVOTE.UNIPredicate: true iff q is uniform across active lanesExplicit 32-bit mask
vote.sync.ballot.b32 d, q, membermaskVOTE.BALLOTR: 32-bit ballot mask of lanes where q=trueExplicit 32-bit mask
activemask.b32 dCS2R (read SR_LANEMASK_ACTIVE)R: current active lane maskImplicit (all active)
elect.sync d, membermaskELECT.SYNCPredicate: true for exactly one active laneExplicit 32-bit mask (sm75+)

On sm100+ (Blackwell), VOTEU is available as a uniform-register variant of VOTE for cases where the result feeds only uniform consumers.

Codegen Handler Structure -- sub_580E50

The vote handler follows the standard intrinsic codegen pattern: allocates a 50,000-byte scratch buffer via sub_424070, then builds an inline PTX function body through sequential sprintf() calls. The handler dispatches on three architecture tiers:

sub_580E50(ctx, string_table):
    instr = *(ctx + 1096)
    buf = alloc(50000)

    // Common prologue: function header + parameter declarations
    sprintf(buf, string_table[309261...])   // ".reg .pred %p; ..."

    // Feature check: sub_70B6E0(instr) -- has membermask operand?
    if has_membermask:
        mask_val = sub_70B780(instr)
        sprintf(buf, format_string, mask_val)

    // Architecture dispatch:
    if (sub_70FA00(instr, 11) || SM > 89) && sub_7081E0(instr) == 1:
        // Path 1: sm90+ with sync variant
        // Emits: vote.sync.{mode} with full operand set
        // Reads operands 0,1,2 via sub_70B8E0, sub_70CA70, sub_709E80, sub_70B4F0

    else if SM > 69 && sub_7081E0(instr) == 1:
        if sub_70FA00(instr, 10) || sub_709E60(instr) == 1:
            // Path 2a: sm70+ with explicit sync
            // Reads operands via sub_70B510, sub_70B8E0

        else:
            // Path 2b: sm70+ standard path
            // Checks sub_70FA00(instr, 17) -- has predicate output
            // Checks sub_70BA10(instr) -- ballot mode
            // SM > 75 branch: different register conventions
            // sub_70A910(instr) == 1: uniform result path

    // Epilogue: closing braces, return buffer

The accessor sub_70FA00(instr, 0) returns the SM architecture level (e.g., 70, 75, 80, 89, 90). The value at parameter 11 checks for a specific feature flag (cluster/sync extension). sub_7081E0(instr) returns the instruction variant (1 = sync form).

Intrinsic Registration

Vote intrinsics are registered as part of the sm70 block (0x89--0x1FA) with four entries:

Intrinsic NamePTX Equivalent
__cuda_sm70_votesync_allvote.sync.all.pred
__cuda_sm70_votesync_anyvote.sync.any.pred
__cuda_sm70_votesync_ballotvote.sync.ballot.b32
__cuda_sm70_votesync_univote.sync.uni.pred

Detection of these intrinsics at the IR level is handled by sub_A9A410 (194 bytes binary, 908 bytes decompiled), which performs prefix matching against three string patterns:

// sub_A9A410 -- IntrinsicDetector::isSM70WarpSync
static const char* prefixes[] = {
    "__cuda_sm70_warpsync",
    "__cuda_sm70_votesync_",
    "__cuda_sm70_matchsync_",
};
for (each prefix) {
    name = getSymbolName(instr.symbol_id);
    if (!strncmp(prefix, name, strlen(prefix)))
        return 1;
}
return 0;

This function is called during instruction lowering (subsystem 6 at 0xA9F000--0xAA8000) to identify warp-synchronous call sites that need special handling during barrier optimization.

Warp Shuffle

Warp shuffle moves data between lanes within a warp. The codegen handler sub_5801D0 (approximately 3.3KB decompiled) generates inline PTX for the four shuffle modes.

PTX to SASS Mapping

PTX InstructionSASS OpcodeData Movement
shfl.sync.idx.b32 d|p, a, b, c, membermaskSHF / SHFLd = lane[b] (direct indexed)
shfl.sync.up.b32 d|p, a, b, c, membermaskSHF / SHFLd = lane[laneid - b] (shift up)
shfl.sync.down.b32 d|p, a, b, c, membermaskSHF / SHFLd = lane[laneid + b] (shift down)
shfl.sync.bfly.b32 d|p, a, b, c, membermaskSHF / SHFLd = lane[laneid ^ b] (butterfly XOR)

The c operand packs the clamp value and width: c = ((width - 1) << 8) | clamp. The optional predicate output p indicates whether the source lane was within bounds.

Codegen Handler -- sub_5801D0

The shuffle handler is structurally similar to vote. It reads up to 5 operands (source value, lane offset, clamp/width, membermask, and optional predicate output) through the accessor chain:

sub_5801D0(ctx, string_table):
    // string_table offsets start at 311376
    // Prologue with 4 sprintf calls for parameter declarations

    if (SM >= 90 || feature_flag_11) && variant == 1:
        // sm90+ path: reads operands 0..4
        // sub_70B960(instr) -- gets shuffle mode enum
        // sub_70B450(instr) -- gets data type
        // Emits shfl.sync.{mode}.b32 with full 8-operand format

    else if SM > 69 && variant == 1:
        if feature_10 || sub_709E60(instr) == 1:
            // sm70+ explicit sync path
            // 7-operand format

        else:
            // Standard sm70 path
            // Checks sub_70BA10 for predicate output
            // SM > 75: different operand packing

The shuffle mode is obtained via sub_70B960(instr), returning an enum: 0=idx, 1=up, 2=down, 3=bfly. sub_70B450(instr) returns the data type (b32 for standard shuffles).

Intrinsic Registration

Intrinsic NamePTX Equivalent
__cuda_sm70_shflsync_idxshfl.sync.idx.b32
__cuda_sm70_shflsync_upshfl.sync.up.b32
__cuda_sm70_shflsync_downshfl.sync.down.b32
__cuda_sm70_shflsync_bflyshfl.sync.bfly.b32

Each has a variant with predicate output (e.g., __cuda_sm70_shflsync_idx_pred).

Warp Match

Warp match instructions compare a value across lanes and return which lanes hold matching values. The codegen handler sub_58A730 (approximately 4.5KB decompiled) is the largest in this group.

PTX to SASS Mapping

PTX InstructionSASS OpcodeResult
match.sync.any.b32 d, a, membermaskMATCH.ANYd: mask of lanes holding the same value as the calling lane
match.sync.all.b32 d|p, a, membermaskMATCH.ALLd: mask of lanes holding the same value; p: true if ALL active lanes match
match.sync.any.b64 d, a, membermaskMATCH.ANY64-bit value comparison variant
match.sync.all.b64 d|p, a, membermaskMATCH.ALL64-bit value comparison variant

Codegen Handler -- sub_58A730

The match handler has three architecture tiers and handles both b32 and b64 operand widths:

sub_58A730(ctx, string_table):
    // string_table offsets start at 323786

    if (SM >= 90 || feature_flag_11) && variant == 1:
        // sm90+ path: 7-operand format
        // Reads: sub_70B4F0, sub_709E80, sub_70CA70, sub_70B8E0(0..2)

    else if SM > 69 && variant == 1:
        // sm70+ path
        // Checks feature_10 and sub_709E60 for explicit sync
        // sub_70B940(instr) -- has match predicate output?
        // sub_70D1F0(instr, 0) -- gets operand by index
        // sub_70B950(instr) -- gets comparison mode

Intrinsic Registration

Intrinsic NameVariants
__cuda_sm70_matchsync_any_b32Standard
__cuda_sm70_matchsync_any_b6464-bit
__cuda_sm70_matchsync_all_b32With predicate output
__cuda_sm70_matchsync_all_b64With predicate output, 64-bit

Detection uses the same sub_A9A410 prefix matcher with "__cuda_sm70_matchsync_".

Warp Redux

Warp redux performs a warp-wide reduction operation and returns the result to all participating lanes. The codegen handler sub_567680 (approximately 2.0KB decompiled) is relatively compact.

PTX to SASS Mapping

PTX InstructionSASS Functional UnitOperation
redux.sync.add.s32 d, a, membermaskreduxWarp-wide signed integer addition
redux.sync.min.s32 d, a, membermaskreduxWarp-wide signed integer minimum
redux.sync.max.s32 d, a, membermaskreduxWarp-wide signed integer maximum
redux.sync.min.u32 d, a, membermaskreduxWarp-wide unsigned integer minimum
redux.sync.max.u32 d, a, membermaskreduxWarp-wide unsigned integer maximum
redux.sync.add.u32 d, a, membermaskreduxWarp-wide unsigned integer addition
redux.sync.and.b32 d, a, membermaskreduxWarp-wide bitwise AND
redux.sync.or.b32 d, a, membermaskreduxWarp-wide bitwise OR
redux.sync.xor.b32 d, a, membermaskreduxWarp-wide bitwise XOR
redux.sync.min.f32.NaN d, a, membermaskreduxWarp-wide float minimum (NaN-propagating)
redux.sync.max.f32.NaN d, a, membermaskreduxWarp-wide float maximum (NaN-propagating)
redux.sync.min.f32.abs d, a, membermaskreduxWarp-wide float absolute minimum

The scheduler tracks redux operations on the dedicated redux functional unit pipeline, alongside adu, alu, cbu, fma2x, fma, half, transcendental, ipa, lsu, schedDisp, tex, ttu, udp, and the various MMA pipelines.

Codegen Handler -- sub_567680

sub_567680(ctx, string_table):
    // Prologue: function header

    if (SM >= 90 || feature_flag_11) && variant == 1:
        // sm90+ path: 8-operand format
        // Reads: sub_709E80, sub_70CA70, sub_707530, sub_7087C0,
        //        sub_707630, sub_70B8E0(0..2)
        // sub_707630 -- gets reduction operation type
        // sub_7087C0 -- gets data type qualifier

    else if SM > 79 && variant == 1:
        // sm80+ path
        // Two sub-branches:
        //   - feature_10 || explicit_sync || feature_19: simplified format
        //   - Standard: full format with sub_707650, sub_7087F0,
        //     sub_707540, sub_70D1F0

The accessor sub_707630(instr) returns the reduction operation enum (add/min/max/and/or/xor), while sub_7087C0(instr) returns the data type qualifier (s32/u32/b32/f32). Note that redux requires sm80+ in the hardware; the sm70 block in the intrinsic table registers redux-sync intrinsics as software emulation routines.

Redux Sync Intrinsic Registration (IDs 0x01--0x11)

The earliest 17 intrinsic IDs are dedicated to software-emulated redux-sync operations for pre-sm80 targets:

IDIntrinsic NameOperation
0x01__cuda_reduxsync_b32_andBitwise AND reduction
0x02__cuda_reduxsync_b32_orBitwise OR reduction
0x03__cuda_reduxsync_b32_xorBitwise XOR reduction
0x04__cuda_reduxsync_f32_maxFloat maximum
0x05__cuda_reduxsync_f32_minFloat minimum
0x06__cuda_reduxsync_f32_max_absFloat absolute maximum
0x07__cuda_reduxsync_f32_min_absFloat absolute minimum
0x08__cuda_reduxsync_f32_max_NaNFloat maximum (NaN-propagating)
0x09__cuda_reduxsync_f32_min_NaNFloat minimum (NaN-propagating)
0x0A__cuda_reduxsync_s32_addSigned integer sum
0x0B__cuda_reduxsync_s32_maxSigned integer maximum
0x0C__cuda_reduxsync_s32_minSigned integer minimum
0x0D__cuda_reduxsync_u32_addUnsigned integer sum
0x0E__cuda_reduxsync_u32_maxUnsigned integer maximum
0x0F__cuda_reduxsync_u32_minUnsigned integer minimum
0x10--0x11(additional variants)Extended operations

On sm80+, redux PTX instructions lower directly to hardware SASS instructions and bypass the software emulation path.

Thread-Block Barriers

Thread-block barriers synchronize all threads within a CTA (Cooperative Thread Array). ptxas provides codegen handlers for three PTX barrier families plus their PTX 8.0 cluster-aware equivalents.

PTX Barrier Family

PTX HandlerAddressSizePTX Instructions
sub_524FB00x524FB01.8KBbar.sync, bar
sub_500BF00x500BF01.2KBbar.arrive
sub_52D5900x52D5901.5KBbar.red.{and,or,popc}
sub_5702900x5702902.5KBbarrier.sync, barrier
sub_5709400x570940--barrier.arrive
sub_5889B00x5889B04.8KBbarrier.red
sub_56A5A00x56A5A01.9KBbar.warp.sync

PTX to SASS Mapping

PTX InstructionSASS OpcodeBehavior
bar.sync NBAR.SYNCBlock until all CTA threads arrive at barrier N
bar.sync N, countBAR.SYNCBlock until count threads arrive at barrier N
bar.arrive NBAR.ARVSignal arrival at barrier N without blocking
bar.arrive N, countBAR.ARVSignal arrival with thread count
bar.red.and N, pBAR.RED.ANDBarrier + warp-level AND reduction on predicate
bar.red.or N, pBAR.RED.ORBarrier + warp-level OR reduction
bar.red.popc N, dBAR.RED.POPCBarrier + warp-level population count
barrier.cta.sync NBAR.SYNCPTX 8.0 cluster-aware CTA barrier
barrier.cta.arrive NBAR.ARVPTX 8.0 cluster-aware CTA arrive
barrier.cta.red NBAR.REDPTX 8.0 cluster-aware CTA reduction

The hardware provides 16 named barriers (indices 0--15). The EIATTR_NUM_BARRIERS metadata records the maximum barrier index used per kernel, which the driver uses to partition the convergence barrier file at launch.

Codegen Handler Details -- sub_524FB0

The bar.sync / bar handler dispatches across three architecture generations:

sub_524FB0(ctx, string_table):
    // string_table offsets start at 294205

    if feature_flag_11 || SM > 89:
        // sm90+ path: 4 format strings for prologue
        // Checks sub_70B930(instr) for count variant:
        //   count != 2: single-operand format (barrier index only)
        //   count == 2: two-operand format (barrier index + thread count)

    else if SM > 69:
        if !feature_13 || sub_70B860(instr) > 69:
            // sm70+ standard path
            // Same count dispatch as sm90
        else:
            // sm70 with feature_13: extended format
            // Reads sub_709400 (barrier scope) and sub_708200 (barrier type)

    else:  // SM <= 69 (pre-Volta)
        // Legacy path
        // sub_709400 -- barrier scope identifier
        // sub_708200 -- barrier operation type
        // count dispatch with 3-operand format strings

The accessor sub_70B930(instr) returns the operand count mode: 1 for single-operand (barrier index only), 2 for two-operand (barrier index + thread count). sub_70C180(instr) returns a special value (-1 for default thread count).

Named Barrier Intrinsic Registration

The sm70 intrinsic block registers barrier operations combinatorially:

Sub-CategoryCountCombinatorial Source
__cuda_sm70_barrier_arrive_{0..15}1616 barrier indices
__cuda_sm70_barrier_arrive_{0..15}_cnt16With explicit thread count
__cuda_sm70_barrier_sync_{0..15}1616 barrier indices
__cuda_sm70_barrier_sync_{0..15}_cnt16With explicit thread count
__cuda_sm70_barrier_red_and_{0..15}16AND reduction per barrier
__cuda_sm70_barrier_red_and_{0..15}_cnt16With thread count
__cuda_sm70_barrier_red_or_{0..15}16OR reduction per barrier
__cuda_sm70_barrier_red_or_{0..15}_cnt16With thread count
__cuda_sm70_barrier_red_popc_{0..15}16POPC reduction per barrier
__cuda_sm70_barrier_red_popc_{0..15}_cnt16With thread count

This combinatorial explosion produces 160 intrinsic entries for barriers alone (16 indices x 2 count variants x 5 operation types).

Barrier Codegen Pattern -- sub_570290

The barrier (PTX 8.0 form) handler at sub_570290 (2.5KB) is the most complex barrier handler. It adds cluster-awareness for sm90+ and handles the barrier.cta.* variants. The handler has an elaborate multi-level dispatch:

sub_570290(ctx, string_table):
    // sm90+ path: additional CTA scope parameters
    //   sub_709E80(instr) -- barrier scope enum
    //   sub_70B8E0(instr, 0) -- barrier index operand
    //   sub_70B8E0(instr, 1) -- thread count operand

    // sm70+ with feature_10 or explicit sync:
    //   Separate code paths for count=1 vs count=2

    // sm70+ standard (no explicit sync, no feature_10):
    //   Six format strings building an elaborate inline PTX body
    //   Handles sub_70B930 count modes 1 and 2
    //   Checks sub_70C180 for default-count (-1) vs explicit count
    //   Generates bar.sync N [, count]; or bar.arrive N [, count];

Memory Barriers and Fences

Memory Barriers -- sub_4DB410

The membar codegen handler at sub_4DB410 (84 lines decompiled) is the smallest sync handler. It generates memory barrier instructions across three scope levels.

PTX InstructionSASS OpcodeScope
membar.ctaMEMBAR.CTAThread block (CTA)
membar.gpuMEMBAR.GPUDevice (GPU)
membar.sysMEMBAR.SYSSystem (all agents)
sub_4DB410(ctx, string_table):
    mode = sub_709FE0(instr)

    if mode != 2 && mode != 4:
        // Standard 3-operand format
        // sub_70B710(instr) -- scope qualifier
        // sub_709FF0(instr) -- barrier type
        // sub_70A530(instr) -- additional qualifier

    else:  // mode 2 or 4 (cta or sys)
        if SM > 49:
            // sm50+ uses 3-operand format with scope
        else:
            // Pre-sm50: 2-operand membar + separate scope annotation

The EIATTR_SW_WAR_MEMBAR_SYS_INSTR_OFFSETS metadata records the byte offset of every membar.sys instruction in the output SASS. This allows the driver to apply software workarounds (WAR patches) at specific membar.sys locations for known hardware errata.

Fence Operations

Fence operations enforce ordering between different memory proxy domains. These are not exposed as separate PTX codegen handlers but are inserted by the compiler's synchronization passes (phases 25 and 72).

PTX InstructionSASS Opcode (sm100+)Purpose
fence.proxy.alias(expanded inline)Orders generic/alias memory accesses
fence.proxy.async(expanded inline)Orders async copy completion visibility
fence.proxy.async.global(expanded inline)Global memory async fence
fence.sc.ctaFENCE_SSequentially-consistent fence, CTA scope
fence.sc.gpuFENCE_GSequentially-consistent fence, GPU scope
fence.acq_rel.ctaFENCE_TAcquire-release fence, CTA scope

On Blackwell (sm100+), dedicated FENCE_G, FENCE_S, and FENCE_T SASS opcodes replace the older pattern of MEMBAR + proxy annotation sequences.

StageAndFence (Phase 25)

The StageAndFence pass (sub_1392E30, 166 bytes entry, sub_1390B30 8,956 bytes core) inserts fence instructions after loop unrolling to re-establish memory ordering correctness. When loop unrolling replicates memory operations that crossed a synchronization boundary in the original loop body, this pass inserts fence.proxy or MEMBAR pseudo-instructions at the boundaries of unrolled iterations.

The core function takes floating-point parameters (double/__m128d), indicating it incorporates latency and throughput heuristics when deciding fence placement -- preferring to merge adjacent fences or delay them to overlap with independent computation.

Warp-Level Synchronization

WARPSYNC

WARPSYNC mask synchronizes the threads in a warp identified by the lane mask. This is the fundamental warp-level sync primitive on sm70+ (Volta and later).

PTXSASSPurpose
bar.warp.sync membermaskWARPSYNC maskSynchronize warp lanes specified by mask

The intrinsic __cuda_sm70_warpsync (single entry in the sm70 block) is the simplest warp-sync intrinsic, and is detected by the same sub_A9A410 prefix matcher that handles vote and match.

BSSY / BSYNC (Convergence Barriers)

The BSSY / BSYNC instruction pair replaces the pre-Volta implicit reconvergence stack. The compiler must insert these pairs explicitly at divergence/reconvergence points:

SASS OpcodePurpose
BSSY B, targetPush a synchronization barrier; target is the reconvergence point
BSYNC BPop and wait at the convergence barrier B

These are not directly exposed as PTX instructions -- they are inserted by the compiler during phase 72 (LateExpandSyncInstructions) and the architecture-specific sync expansion passes (phases 99, 100, 114). The EIATTR_SYNC_STACK metadata records the convergence barrier stack depth.

ELECT

ELECT.SYNC (sm75+) elects a single active lane from the warp, setting a predicate register to true for exactly one thread.

In the SASS opcode table, ELECT appears among the Blackwell-era additions alongside ENDCOLLECTIVE, PREXIT, SETMAXREG, and SETSMEMSIZE. The ELECT opcode is used for leader-based algorithms where only one thread per warp performs a shared operation.

Asynchronous Barriers (MBARRIER)

Introduced with sm90 (Hopper), mbarrier provides hardware-accelerated asynchronous barriers in shared memory. These are critical for async copy (cp.async.bulk), TMA operations, and warpgroup-level synchronization.

MBARRIER Operation Classification

ptxas classifies mbarrier operations through sub_A94440 (MBarrierDetector, 412 bytes binary) and sub_A9A5F0 (MBarrierClassifier). The classifier resolves the mbarrier symbol name (prefix %mbarrier_) and returns an operation type enum:

Type IDSuffixOperation
0INITInitialize barrier object in shared memory
1ARRIVESignal arrival (non-blocking)
2ARRIVE_NOCOMPLETEArrive without completing the phase
3ARRIVE_DROPArrive and decrement expected count
4ARRIVE_DROP_NOCOMPLETEArrive, drop, no completion
5TEST_WAITTest if barrier phase is complete
6TEST_WAIT_PARITYPhase-parity-based completion test
7CP_ASYNC_ARRIVEcp.async arrival notification
8INVALInvalidate barrier
9TRY_WAITWait with timeout
10TRY_WAIT_PARITYPhase-parity-based wait with timeout
11EXPECT_TXSet expected transaction byte count
12COMPLETE_TXMark transaction bytes as complete

The "trivial" mbarrier operations (types 0--8) are handled inline; "non-trivial" operations (types 9--12, including EXPECT_TX and the extended TRY_WAIT variants) require more complex lowering.

Mbarrier Symbol Naming

Mbarrier objects are tracked through shared memory symbols following the pattern:

%mbarrier_{basename}_{operation}

The resolver at sub_A9A920 extracts the base name from the full symbol (e.g., %mbarrier_pipeline0_ARRIVE yields base name pipeline0). The format string "%%mbarrier_%s_%s" at sub_AA33C0 is used for symbol construction during mbarrier expansion.

Reserved shared memory regions for TMA pipeline mbarriers:

  • __nv_reservedSMEM_tmem_allocation_pipeline_mbarrier
  • __nv_reservedSMEM_tmem_allocation_pipeline_mbarrier_parity

ExpandMbarrier (Phase 42)

Phase 42 expands mbarrier pseudo-instructions into native barrier sequences through architecture vtable dispatch:

mov    rdi, [rsi+0x630]     ; rdi = ctx->arch_backend (offset 1584)
mov    rax, [rdi]            ; rax = arch_backend->vtable
jmp    [rax+0x168]           ; call vtable[45] -- ExpandMbarrier impl

The expansion is architecture-specific:

  • Pre-sm90: No mbarrier pseudo-ops exist; the phase is a no-op
  • sm90 (Hopper): Expands to hardware mbarrier instruction sequences, resolves shared memory addresses, inserts fence.proxy for coherence
  • sm100+ (Blackwell): Extended semantics for tcgen05.fence, cluster-level barriers, and async pipeline operations

Expansion pattern:

Before (Ori pseudo-ops):           After (native SASS):
MBARRIER_INIT  %mbar, count       MBARRIER.INIT  [smem], count
MBARRIER_ARRIVE_EXPECT_TX          MBARRIER.ARRIVE.EXPECT_TX  [smem], bytes
  %mbar, bytes
CP.ASYNC.BULK.TENSOR               CP.ASYNC.BULK.TENSOR  [dst], [src], [smem]
  dst, src, %mbar
MBARRIER_TRY_WAIT_PARITY           MBARRIER.TRY_WAIT.PARITY  pred, [smem],
  %mbar, parity, pred                parity

EIATTR Metadata

EIATTRPurpose
EIATTR_NUM_MBARRIERSCount of mbarrier objects used by the kernel
EIATTR_MBARRIER_INSTR_OFFSETSByte offsets of mbarrier instructions for driver patching

Blackwell CGA Barriers

On sm100+ (Blackwell), a new class of CGA (Cooperative Grid Array) barriers extends the mbarrier concept to cluster-level synchronization:

SASS OpcodePurpose
CGABAR_ARVCGA barrier arrive
CGABAR_GETQuery CGA barrier state
CGABAR_SETSet CGA barrier parameters
CGABAR_WAITWait on CGA barrier
CGAERRBARCGA error barrier

Atomic/Reduction Intrinsic Lowering

The OCG atomic/reduction handler at sub_6C0D90 (19KB, 813 decompiled lines) lowers atom.* and red.* intrinsic calls into Ori IR opcode 314 instructions. Unlike the warp-level sync handlers (which generate inline PTX via sprintf), this function works at the Ori IR level directly: it parses a sub-op parameter array, validates all qualifier combinations, resolves operands to register references, and emits the final instruction through sub_92C240. All diagnostics use error code 7308 and the prefix "Instrinsic - \"%s\"" (the typo is in the binary).

Parameter Array Parsing

The intrinsic name is decomposed by the OCG name parser (sub_6C9BC0) into an integer token array stored at *(ctx+10704). Each token encodes one qualifier dimension of the atomic operation. The handler reads tokens sequentially through a switch-case loop:

TokenVariableSemanticDecoded Value
0memory_order=4Memory orderingRelaxed
1domain=12, is_mmio=1Memory domainMMIO (global, mapped)
2domain=5Memory domainShared (_shared)
3scope=5Visibility scope.cta
4scope=6Visibility scope.gpu
5is_noreturn=1Return behaviorReduction (fire-and-forget, no return value)
6data_size=2Operand width64-bit (u64)
7data_size=4Operand widthVector (v2/v4)
8data_type=12Data type.f32
9data_type=11Data type.s32
10data_type=10Data type.u32 (also .u64 when size=2)
11op=0Operation.add
12op=1Operation.min
13op=2Operation.max
14op=3Operation.inc
15op=4Operation.dec
16op=5Operation.and
17op=6Operation.or
18op=7Operation.xor

Tokens not matching any case are silently skipped. If the parameter array is empty (no tokens), all values take defaults: data_type=1 (unspecified), op=-1 (unspecified), data_size=1 (32-bit), and all flags zero.

Modifier Word Encoding

The parsed qualifiers are packed into a single 32-bit modifier word that accompanies the Ori instruction through the pipeline to ISel:

Bit [14:13] = type encoding:  00=u32  01=s32  10=u64/f32  11=invalid
Bit [12:10] = operation:      0=add  1=min  2=max  3=inc  4=dec  5=and  6=or  7=xor
Bit [8]     = no-return:      1=reduction (red.*)  0=atomic (atom.*)
Bit [7:5]   = memory order:   4=relaxed (only value supported here)
Bit [4:2]   = scope:          5=cta  6=gpu
Bit [1:0]   = operand flags:  bit0=(addr_type==u32)  bit1=(data_type==u32)
Top nibble  = 0x6 (constant marker: 0x60000000)

The type encoding bits [14:13] are set during cross-validation: s32 with 32-bit width sets 0x2000, u32/f32 with 64-bit width sets 0x4000, and invalid combinations set 0xE000 (with an error).

Validation Chain

The handler enforces a strict 10-phase validation sequence. Each failure emits error 7308 with a descriptive message:

PhaseCheckError Message
1Domain must be _shared (5) or _mmio (12); otherwise global is assumed but errors if explicitly set to something else"Domain param '_shared' or '_global' required"
2Vector type operand count must match expected operand count from data_size"Vector type does not match number of subops"
3Data type must be explicitly set (not default 1)"Type {u32, s32, u64} not specified"
4Vector width (v12>1) requires u32 (10) or f32 (12) type"Vector supported only for {u32, u64}"
5Operation must be explicitly set (not default -1); emitted twice"Op {add, min, max, inc, dec, and, or, xor} not specified"
6Shared-domain reductions only support .add"Unsupported non _add global memory reduction"
7aScope without memory order is deprecated"Deprecated scope without memory order semantics" (warning)
7bMemory order requires scope"Required scope with memory order semantics"
8MMIO semantics require global domain"Domain param '_global' required for mmio semantics"
9s32 requires 32-bit; f32+64-bit only with add; otherwise invalid"Invalid data type / op combination" or "Invalid vector / data type combination"
10Each data operand's type field must match declared type; address operand must be u32 (10) or f32 (12)"Operand type does not match specified type" / "Unexpected instrinsic type (%s) in param (%d)"

Operand Resolution and Shared Memory Address Materialization

After validation, the handler resolves up to three operand slots:

  1. Destination/address: Resolved via sub_926370 into a 24-bit register ID, then tagged with 0x50000000 (output register class marker).

  2. Source data operand: Read from the operand descriptor array at *(ctx+10728). Routing depends on bits [30:28] of the operand word:

    • Value 5 (shared memory pointer): Allocates a temporary register in class 6 via sub_91BF30, then emits an Ori opcode 130 pseudo-instruction via sub_934630 to materialize the shared memory address into a general register. This extra LEA/MOV is necessary because ATOMS requires an explicit register operand, not a symbolic shared-memory reference.
    • Value 1 with !(operand[1] & 0x1000000): Direct register reference (24-bit register ID from low bits).
    • Otherwise: Full register resolution through sub_91D150 + operand legalization through sub_7DEFA0.
  3. Second data operand (MMIO only, v109=1): Same three-way resolution for the second source, reading from operand descriptor offset +12.

The final instruction is emitted as:

sub_92C240(output, ctx, 314, data_type, operand_count, operand_buffer, 1)

where Ori opcode 314 represents the unified ATOM/RED operation.

SASS Opcode Selection

The Ori opcode 314 instruction flows through the optimizer pipeline and reaches ISel (sub_C0EB10), which selects the final SASS opcode based on the domain and no-return flag encoded in the modifier word:

Modifier BitsSASS OpcodeROT13Table EntryPTX Equivalent
domain=global, no-return=0ATOMGNGBZT103atom.global.*
domain=shared, no-return=0ATOMSNGBZF105atom.shared.*
domain=generic, no-return=0ATOMNGBZ102atom.*
no-return=1REDERQ104red.*

The operation bits [12:10] further select the SASS sub-opcode qualifier (.ADD, .MIN, .MAX, .INC, .DEC, .AND, .OR, .XOR). The type bits [14:13] determine the data type suffix (.32, .64, .S32, .U32, .F32).

Scope and Memory Order (sm70+)

When scope and memory order are both present, the modifier word carries them through to ISel where they become SASS instruction modifiers:

  • Scope .cta (token 3, value 5): Atomic is visible only within the CTA
  • Scope .gpu (token 4, value 6): Atomic is visible to all thread blocks on the device
  • Memory order relaxed (token 0, value 4): No ordering guarantees beyond atomicity

The handler does not encode acquire, release, or acq_rel memory orders -- these are handled by the separate memory fence/order handler at sub_6C1CF0. The deprecation warning for scope-without-order indicates ptxas is transitioning toward requiring explicit memory order qualifiers for all scoped atomics.

Limitations and Notable Behavior

  1. No CAS/EXCH tokens: The parameter array parser has no tokens for .cas (compare-and-swap) or .exch (exchange). These operations are either encoded through a different OCG intrinsic or use a distinct sub-op encoding not visible in this function's switch-case.

  2. Shared-memory restriction: Only atom.shared.add is supported as a reduction. All other shared-memory reduction operations (red.shared.{min,max,and,or,xor}) are rejected.

  3. MMIO path: Token 1 (domain=MMIO) enables a separate code path that processes two data operands instead of one. This supports the MMIO atomic semantics where both the address and a data value must be explicitly resolved.

  4. Error message bug: The message "Unsupported non _add global memory reduction" fires for shared-memory non-add reductions despite saying "global". This is likely a copy-paste artifact in the ptxas source.

Synchronization Pipeline Summary

The complete synchronization processing pipeline spans 8 optimizer phases:

PhaseNameCategoryPurpose
25StageAndFenceLoweringInsert fences after loop unrolling
26OriRemoveRedundantBarriersOptimizationDataflow-driven barrier elimination (multi-function)
42ExpandMbarrierLoweringExpand mbarrier pseudo-ops via arch vtable
71OptimizeSyncInstructionsOptimizationSync instruction redundancy elimination
72LateExpandSyncInstructionsLoweringFinal sync pseudo-op expansion to SASS
99(Architecture-specific)LoweringPost-RA sync expansion
100(Architecture-specific)LoweringArchitecture vtable dispatch for sync
114(Post-scheduling)FixupPost-scheduling dependency barrier fixup

The progression is: early fence insertion (25) -> cross-function barrier elimination (26) -> mbarrier expansion (42) -> optimization within partial-SSA (71) -> final expansion (72) -> post-RA architecture hooks (99, 100) -> post-scheduling fixup (114).

Ori IR Opcode 130 -- Sync Analysis Target

The optimizer phases 26 and 71 identify synchronization instructions by checking for Ori opcode 130 (HSET2 in the ROT13 name table; used as an internal Ori IR marker for barrier/sync instructions -- actual SASS BAR is opcode 61, MEMBAR is opcode 111). For each barrier instruction found:

  1. Extract the destination operand from field+84
  2. Resolve the register through the register table at context+88
  3. Test whether the register's use-count (reg+24) indicates consumption
  4. If the barrier result is dead (no thread observes it before the next dominating barrier), eliminate the barrier
  5. At block boundaries, attempt to merge barriers with compatible scopes

Knobs and Gating

Knob / FlagEffect
Knob 487Master gate for sync optimization (checked by phases 25, 71, 72)
Knob 358Sync mode selector (0=none, 1=conservative, 2=aggressive, 3+=full multi-function)
Knob 472Barrier liveness analysis mode
context+1368 bit 0Global sync optimization enable
context+1397 bits[6:7]Architecture-specific sync configuration
context+1398 bits[3:4]Sync expansion mode (architecture-dependent)
DisableErrbarAfterMembarSuppress error barrier insertion after membar

SASS Opcode Summary Table

Complete mapping of all synchronization and warp SASS opcodes, with their ROT13-encoded internal names as found in the ptxas binary:

ROT13 (Binary)SASS OpcodeTable OffsetCategory
IBGRVOTE4600Warp vote
IBGRHVOTEU--Uniform warp vote (sm100+)
ONEBAR5160Thread-block barrier
ONE_VAQRKRQBAR_INDEXED--Indexed barrier variant
REEONEERRBAR4184Error barrier
QRCONEDEPBAR--Dependency barrier (scoreboard)
ZNGPUMATCH--Warp match
ZRZONEMEMBAR--Memory barrier
JNECFLAPWARPSYNC--Warp synchronize
OFLAPBSYNC--Convergence barrier sync
OFFLBSSY--Convergence barrier set
E2OR2B5128Register to barrier transfer
--ELECT--Warp lane election (sm75+)
--NANOSLEEP--Nanosleep hint
--FENCE_G--Global fence (sm100+)
--FENCE_S--Shared fence (sm100+)
--FENCE_T--Texture fence (sm100+)
--CGABAR_ARV--CGA barrier arrive (sm100+)
--CGABAR_GET--CGA barrier query (sm100+)
--CGABAR_SET--CGA barrier set (sm100+)
--CGABAR_WAIT--CGA barrier wait (sm100+)
--CGAERRBAR--CGA error barrier (sm100+)
--SYNCS_BASIC--Basic sync (sm100+)
--SYNCS_LD_UNIFM--Sync with uniform load (sm100+)
NGBZATOM102Atomic (generic address space)
NGBZTATOMG103Atomic (global memory)
ERQRED104Reduction (fire-and-forget)
NGBZFATOMS105Atomic (shared memory)

Key Function Reference

AddressIdentitySizePurpose
sub_580E50VoteCodegen~3.2KBPTX vote.* to inline PTX body
sub_5801D0ShflCodegen~3.3KBPTX shfl.* to inline PTX body
sub_58A730MatchCodegen~4.5KBPTX match.* to inline PTX body
sub_567680ReduxCodegen~2.0KBPTX redux.* to inline PTX body
sub_524FB0BarSyncCodegen~1.8KBPTX bar.sync / bar
sub_570290BarrierCodegen~2.5KBPTX barrier.* (PTX 8.0)
sub_500BF0BarArriveCodegen~1.2KBPTX bar.arrive
sub_570940BarrierArriveCodegen--PTX barrier.arrive
sub_52D590BarRedCodegen~1.5KBPTX bar.red.{and,or,popc}
sub_5889B0BarrierRedCodegen~4.8KBPTX barrier.red
sub_56A5A0BarWarpCodegen~1.9KBPTX bar.warp.sync
sub_4DB410MembarCodegen~0.8KBPTX membar.*
sub_A9A410isSM70WarpSync194BIntrinsic prefix detection
sub_A94440isNonTrivialMBarrier412BMbarrier operation classifier
sub_A9A5F0classifyMBarrier--Mbarrier type enum resolver
sub_A9A920resolveMBarrierBaseName--Extract mbarrier base name from symbol
sub_AA33C0constructMBarrierSymbol--Build %%mbarrier_%s_%s symbol
sub_1392E30StageAndFence (phase 25)166B entryPost-unroll fence insertion
sub_1390B30StageAndFence core8,956BMain fence insertion logic
sub_790A40RemoveRedundantBarriers2,288BCross-function barrier elimination
sub_90A340OptimizeSyncInstructions1,670BSync instruction optimization
sub_1381DA0LateExpandSync1,517BFinal sync expansion
sub_6C0D90LowerAtomicRedIntrinsic~19KBOCG atom.*/red.* to Ori opcode 314
sub_C0EB10InstructionSelector185KBMain ISel (handles all sync opcodes)

Custom ELF Emitter

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

ptxas builds its ELF/cubin output without libelf or any external ELF library. The entire ELF construction pipeline is a custom implementation spread across approximately 20 functions in the 0x1C99--0x1CD1 address range, totaling roughly 300 KB of binary code. At the center is a 672-byte in-memory object called the "ELF world" (ELFW), which owns all sections, symbols, and string tables. The emitter writes standard ELF headers with NVIDIA extensions: EM_CUDA (0xBE / 190) as the machine type, NVIDIA-specific section types (SHT_CUDA_INFO = 0x70000064), and CUDA-specific ELF flags encoding the SM architecture version. The design handles both 32-bit and 64-bit ELF classes, with the class byte at ELF offset 4 set to '3' (32-bit) or 'A' (64-bit). Finalization is a single-pass algorithm that orders sections into 8 priority buckets, assigns file offsets with alignment, and handles the ELF extended section index mechanism (SHN_XINDEX) when the section count exceeds 65,280.

ELFW constructorsub_1CB53A0 (3,480 bytes, 672-byte object)
Section creatorsub_1CB3570 (1,963 bytes, 44 callers)
Text section creatorsub_1CB42D0 (SHT_PROGBITS, SHF_ALLOC|SHF_EXECINSTR)
Symbol table buildersub_1CB68D0 (9,578 bytes, ~1,700 lines)
Master ELF emittersub_1C9F280 (15,263 bytes, 97 KB decompiled)
Section layout calculatorsub_1C9DC60 (5,663 bytes, 29 KB decompiled)
Symbol fixupsub_1CB2CA0 (2,038 bytes)
Section index remapsub_1C99BB0 (4,900 bytes)
ELF structure dumpersub_1CB91C0 (2,668 bytes)
File serializersub_1CD13A0 (2,541 bytes)
Cubin entry pointsub_612DE0 (47 KB, called from sub_446240)
ELF machine typeEM_CUDA = 0xBE (190)
CUDA section typeSHT_CUDA_INFO = 0x70000064
ELF magic0x464C457F (\x7fELF)
Memory pool"elfw memory space" (4,096-byte initial)

Architecture

sub_446240 (real main -- compilation driver)
  |
  v
sub_612DE0 (47KB -- cubin generation entry)
  |  Parses: deviceDebug, lineInfo, optLevel, IsCompute, IsPIC
  |  Sets up setjmp/longjmp error recovery
  |  Handles recursive self-call for nested finalization
  |
  +-- sub_1CB53A0 -------- ELFW constructor (672-byte object)
  |     |  Creates "elfw memory space" pool (4096 initial)
  |     |  Writes ELF magic 0x464C457F into header
  |     |  Sets e_machine = 0xBE (EM_CUDA)
  |     |  Sets EI_CLASS = 32-bit ('3') or 64-bit ('A')
  |     |  Creates 7 standard sections:
  |     |    .shstrtab, .strtab, .symtab, .symtab_shndx,
  |     |    .note.nv.tkinfo, .note.nv.cuinfo, .nv.uft.entry
  |     v
  +-- sub_1CB3570 x N ---- Section creator (44 call sites)
  |     |  Creates section: name, type, flags, link, info, align, entsize
  |     |  Auto-creates .rela/.rel companion for executable sections
  |     v
  +-- sub_1CB42D0 --------- .text.<funcname> section creator
  |     |  SHT_PROGBITS, SHF_ALLOC | SHF_EXECINSTR
  |     |  One section per kernel/function
  |     v
  +-- sub_1CB68D0 --------- Symbol table builder (~1700 lines)
  |     |  Iterates internal symbol list
  |     |  Filters deleted symbols
  |     |  Handles __cuda_syscall special symbol
  |     |  Manages SHN_XINDEX overflow (>= SHN_LORESERVE)
  |     |  Builds .symtab_shndx extended index table
  |     v
  +-- sub_1CB2CA0 --------- Symbol fixup
  |     |  Renumbers symbols after section deletion
  |     |  Creates missing section symbols
  |     v
  +-- sub_1C99BB0 --------- Section index remap
  |     |  Reindexes sections after dead elimination
  |     |  Remaps .symtab_shndx / .nv.merc.symtab_shndx
  |     v
  +-- sub_1C9DC60 --------- Section layout calculator (29KB)
  |     |  Computes section offsets and sizes
  |     |  Skips .nv.constant0, .nv.reservedSmem
  |     |  Handles .debug_line special padding
  |     v
  +-- sub_1C9F280 --------- Master ELF emitter (97KB -- largest function)
  |     |  Copies ELF header (64 bytes via SSE loadu)
  |     |  Iterates sections via sub_1CB9FF0 / sub_1CB9C40
  |     |  Skips virtual sections (flag & 4)
  |     |  Patches ELF flags (SM version, ELFCLASS32/64)
  |     |  Handles program headers
  |     |  Embeds Mercury capsule if capmerc mode
  |     |  Processes debug sections
  |     v
  +-- sub_1CD13A0 --------- File serializer
        |  Iterates sections, writes with alignment padding
        |  Validates sizes: "section size mismatch"
        |  Handles 32-bit and 64-bit ELF formats
        v
      OUTPUT: .cubin / .o file on disk

ELFW Object -- sub_1CB53A0

The ELFW constructor allocates and initializes a 672-byte object that serves as the central data structure for the entire ELF construction pipeline. Every section, symbol, and string table lives under this object. The constructor is called exactly once per compilation unit.

Construction Sequence

// sub_1CB53A0 -- ELFW constructor (simplified)
void* elfw_init(int elf_class, int sm_version) {
    // 1. Allocate 672-byte ELFW object from pool allocator
    void* elfw = sub_424070(672);

    // 2. Create dedicated memory pool
    sub_4258D0("elfw memory space", 0, 4096);

    // 3. Write ELF header
    *(uint32_t*)(elfw + 0) = 0x464C457F;   // e_ident[EI_MAG0..3] = "\x7fELF"
    *(uint8_t*)(elfw + 4)  = (elf_class == 64) ? 'A' : '3';  // EI_CLASS
    *(uint16_t*)(elfw + MACHINE_OFF) = 0xBE; // e_machine = EM_CUDA (190)

    // 4. Initialize section/symbol/string containers
    init_section_table(elfw);
    init_symbol_table(elfw);
    init_string_table(elfw);

    // 5. Create 7 mandatory sections
    add_section(elfw, ".shstrtab",        SHT_STRTAB,   0, ...);
    add_section(elfw, ".strtab",          SHT_STRTAB,   0, ...);
    add_section(elfw, ".symtab",          SHT_SYMTAB,   0, ...);
    add_section(elfw, ".symtab_shndx",    SHT_SYMTAB_SHNDX, 0, ...);
    add_section(elfw, ".note.nv.tkinfo",  SHT_NOTE,     0, ...);
    add_section(elfw, ".note.nv.cuinfo",  SHT_NOTE,     0, ...);
    add_section(elfw, ".nv.uft.entry",    SHT_PROGBITS, 0, ...);

    return elfw;
}

The ELFW object stores:

  • The ELF header (first 64 bytes for 64-bit class, 52 for 32-bit)
  • A section table (dynamic array of section descriptors)
  • A symbol table (dynamic array of symbol entries)
  • String tables for section names (.shstrtab) and symbol names (.strtab)
  • Metadata for relocation processing and section ordering

ELFW Object Layout (672 bytes)

The 672-byte ELFW object divides into 13 regions. Offsets 0--63 overlay a standard ELF header (whose internal layout depends on ELF class). All pointer-sized fields are 8 bytes (the allocator returns 8-byte-aligned memory). The v17 variable in the decompilation is a uint64_t*, so v17[N] = byte offset N * 8.

Region 1 -- ELF Header Embed (bytes 0--63)

The ELF header is stored inline at the start of the ELFW object. Field positions within it vary by class (32-bit vs 64-bit), matching the standard Elf32_Ehdr / Elf64_Ehdr layout, except that EI_CLASS and EI_OSABI use non-standard CUDA values.

OffsetSizeNameEvidence
04Be_ident[EI_MAG0..3]*(_DWORD*)v17 = 0x464C457F
41Be_ident[EI_CLASS](v11 != 0) + 1: 1 = ELFCLASS32, 2 = ELFCLASS64
51Be_ident[EI_DATA]Hardcoded 1 (little-endian)
61Be_ident[EI_VERSION]Hardcoded 1 (EV_CURRENT)
71Be_ident[EI_OSABI]0x41 ('A') for 64-bit cubin, 0x33 ('3') for 32-bit
81Be_ident[EI_ABIVERSION]Constructor parameter a3
162Be_typeConstructor parameter a1 (cast to uint16)
182Be_machineHardcoded 0x00BE (EM_CUDA = 190)
622Be_shstrndx*(_WORD*)(v17 + 31) -- set to .shstrtab section index

For 32-bit class (EI_CLASS = 1):

OffsetSizeNameDumper accessor
204Be_version*(_DWORD*)(v17 + 5)
284Be_phoff*(_DWORD*)(a1 + 28)
324Be_shoff*(_DWORD*)(a1 + 32)
364Be_flags*(_DWORD*)(a1 + 36) -- dumper prints "flags=%x"
442Be_phnum*(uint16*)(a1 + 44) -- dumper prints "phnum"
482Be_shnum*(uint16*)(a1 + 48) -- dumper prints "shnum"

For 64-bit class (EI_CLASS = 2):

OffsetSizeNameDumper accessor
328Be_phoff*(_QWORD*)(a1 + 32) -- dumper prints "phoff=%llx"
408Be_shoff*(_QWORD*)(a1 + 40) -- dumper prints "shoff=%llx"
484Be_flags*(_DWORD*)(a1 + 48) -- dumper prints "flags=%x"
562Be_phnum*(uint16*)(a1 + 56) -- dumper prints "phnum"
602Be_shnum*(uint16*)(a1 + 60) -- dumper prints "shnum"

Region 2 -- Metadata and Flags (bytes 64--107)

OffsetSizeNamePurpose
641BdebugModea8 parameter: deviceDebug
684BcompilationFlagsrawOptions & 0x70000 -- preserved option bits 16--18
724BsmVersiona4 parameter: SM architecture number (e.g., 100 for Blackwell)
764BrawOptionsFull options bitmask a9, possibly OR'd with 0x80000 for relocatable
801BlineInfoModea6 parameter: lineInfo
821Bis32bitDumper gate: controls 32-bit vs 64-bit format in sub_1CB91C0
831BhasSymbolRemapSet to *(WORD*)(v17 + 42) != 0
841Bflag_relocatablerawOptions & 1
851Bflag_executable(rawOptions & 2) != 0
861Bflag_PIC(rawOptions & 0x200) != 0 -- position-independent code
871Bflag_bit2(rawOptions & 4) != 0
881Bflag_bit3(rawOptions & 8) != 0
891Bflag_relocOrBit4(rawOptions >> 4) & 1, forced to 1 if relocatable mode
901Bflag_bit5(rawOptions & 0x20) != 0
911Bflag_bit14(rawOptions & 0x4000) != 0
921Bflag_bit6(rawOptions & 0x40) != 0
931Bflag_byte1_bit0BYTE1(rawOptions) & 1 -- bit 8 of options
941Bflag_archGuard(a5 > 0x45) & (rawOptions >> 7) -- arch-gated feature
961Bflag_bit11(rawOptions & 0x800) != 0
991Bflag_notBit12((rawOptions >> 12) ^ 1) & 1 -- inverted bit 12
1001Bflag_bit13(rawOptions & 0x2000) != 0
1011BhighClass(a9 & 0x8000) != 0 -- selects 64-bit header variant with wider ELF fields

Region 3 -- Inline String Tables (bytes 108--171)

OffsetSizeNamePurpose
10832BshstrtabSection header string table, initialized via sub_1CB0530(v17 + 108, 1000)
14032BstrtabSymbol name string table, initialized via sub_1CB0530(v17 + 140, 2000)

These are 32-byte inline structures (not heap pointers). sub_1CB0530 initializes them with the given initial capacity (1000 and 2000 bytes respectively). The .shstrtab is also referenced by sub_1CA6650 during .note.nv.cuinfo attribute injection.

Region 4 -- Section Index Cache (bytes 196--215)

OffsetSizeNamePurpose
2002BstrtabSecIdx.strtab section index -- *(_WORD*)(v17 + 101) = v54
2022BsymtabSecIdx.symtab section index -- *(_WORD*)(v17 + 102) = v56
2042BxindexSecIdx.symtab_shndx section index -- *(_WORD*)(v17 + 103)
2062BcuinfoSecIdx.note.nv.cuinfo section index -- *(_WORD*)(v17 + 104)
2082BtkinfoSecIdx.note.nv.tkinfo section index -- *(_WORD*)(v17 + 105)

These cached indices avoid repeated linear scans of the section table when cross-referencing sections (e.g., .symtab's sh_link must point to .strtab).

Region 5 -- Sorted Maps and Counters (bytes 288--327)

OffsetSizeNamePurpose
2888BsortedMap_ARed-black tree via sub_425CA0, initial capacity 512
2968BsortedMap_BRed-black tree via sub_425CA0, initial capacity 512
3044BmapCountCounter for sorted maps, cleared to 0
3128BcountPairPacked 0x100000000 = high DWORD 1, low DWORD 0
3204BactiveFlagSet to 1 during initialization

Region 6 -- Section and Symbol Containers (bytes 344--419)

OffsetSizeNamePurpose
3448BsectionList_AIndexed vector (cap=64) via sub_1CD2F90. Dumper: sub_1CD3060(*(a1+344)) returns symbol count
3528BsectionList_BIndexed vector (cap=64). Dumper: sub_1CD3060(*(a1+352)) returns secondary count
3608BsectionList_CIndexed vector (cap=64). Dumper iterates *(a1+360) for section dump loop
3688BsecIndexMapVirtual-to-real section index map. Dumper: *(a1+368) + 4*v11 for reverse lookup
3768BrelocListLinked list head for relocations. Dumper: for (k = *(a1+376); k; k = *k)
3928BnvinfoListLinked list head for .nv.info entries. Dumper: for (j = *(a1+392); j; j = *j)
4088BauxVectorIndexed vector (cap=32) via sub_1CD2F90
4164BauxCountCounter for auxVector, cleared to 0

sub_1CD2F90 creates an indexed vector (growable array with count/capacity tracking). sub_1CD3060 returns the element count; sub_1CD31F0 returns the element at the current iteration index.

Region 7 -- Deletion Remap Tables (bytes 456--479)

OffsetSizeNamePurpose
4568BsymDeleteMapSymbol index remap table (positive indices). Dumper: *(a1+456) + 4*idx
4648BsymDeleteMapNegSymbol index remap table (negative indices). Dumper: *(a1+464) + 4*(-idx)
4728BsecDeleteMapSection index remap table. Dumper: *(a1+472) + 4*idx

After dead code elimination deletes sections and symbols, these tables map old indices to new indices. The negative-index variant handles symbols stored with inverted sign conventions (a ptxas-internal encoding for unresolved forward references).

Region 8 -- Architecture State (bytes 488--495)

OffsetSizeNamePurpose
4888BarchStateArchitecture descriptor pointer. Initialized via sub_1CD04F0 (relocatable) or sub_1CCEEE0 (non-relocatable). Fatal error "couldn't initialize arch state" on failure

Region 9 -- Name Sets and Input Tracking (bytes 496--519)

OffsetSizeNamePurpose
4968BsectionNameSetSorted set of well-known section name strings. Populated from static table off_2403A60 (22 entries ending at dword_2403B70)
5128BinputFileListIndexed vector (cap=8). First entry is a 16-byte descriptor: {ptr="<input>", arch_version, ...}

Region 10 -- Hash Maps (bytes 520--567)

OffsetSizeNamePurpose
5208BhashMap_AHash map via sub_42D150, initial capacity 16
5288BhashMap_BHash map (cap=16)
5368BhashMap_CHash map (cap=16)
5448BhashMap_DHash map (cap=16)
5528BhashMap_EHash map (cap=16)
5608BhashMap_FHash map (cap=16)

Six hash maps initialized identically with sub_42D150(sub_427630, sub_4277B0, 0x10). The two function pointers are the hash function and equality comparator. These maps serve section/symbol lookups during the construction pipeline. The specific role of each map (by-name, by-type, etc.) requires tracing callers of the hash map accessors.

Region 11 -- Extended Index and Miscellaneous (bytes 576--607)

OffsetSizeNamePurpose
5768BsmallSortedSetSorted set via sub_425CA0, element size 8
5928BsymtabShndxVec.symtab_shndx data vector. Dumper: sub_1CD31F0(*(a1+592)) for SHN_XINDEX resolution
6008BmercSymtabShndx.nv.merc.symtab_shndx data vector. Dumper: *(a1+600) for Mercury SHN_XINDEX

Region 12 -- Memory Pool and Tail (bytes 608--671)

OffsetSizeNamePurpose
6088BmemoryPool"elfw memory space" pool pointer. Only set when (a9 & 0x400) != 0
6168BmemoryPoolCursorPool allocation cursor
6244BelfFormatVersionsub_1C97990() return value -- ELF format version from global config
6648BtailSentinelv17[83] = 0 -- zeroed during init, marks end of object

Visual Layout

 ELFW Object (672 bytes = 0x2A0)
 +---------+------+--------------------------------------------------+
 |  0x000  | 64B  | ELF Header Embed (Elf32_Ehdr or Elf64_Ehdr)     |
 +---------+------+--------------------------------------------------+
 |  0x040  | 44B  | Metadata + Option Flags (debugMode, smVersion,   |
 |         |      |   rawOptions, 16 boolean flags)                  |
 +---------+------+--------------------------------------------------+
 |  0x06C  | 64B  | Inline String Tables                             |
 |         |      |   +0x06C: shstrtab (32B, cap=1000)               |
 |         |      |   +0x08C: strtab  (32B, cap=2000)                |
 +---------+------+--------------------------------------------------+
 |  0x0AC  | 36B  | Section Index Cache + Padding                    |
 |         |      |   5 x uint16 section indices (.strtab, .symtab,  |
 |         |      |   .symtab_shndx, .cuinfo, .tkinfo)               |
 +---------+------+--------------------------------------------------+
 |  0x120  | 40B  | Sorted Maps + Counters                           |
 +---------+------+--------------------------------------------------+
 |  0x158  | 76B  | Section/Symbol Containers                        |
 |         |      |   3 indexed vectors (sections), index map,        |
 |         |      |   relocation list, nvinfo list, aux vector        |
 +---------+------+--------------------------------------------------+
 |  0x1C8  | 24B  | Deletion Remap Tables (sym+, sym-, sec)          |
 +---------+------+--------------------------------------------------+
 |  0x1E8  |  8B  | Architecture State Pointer                       |
 +---------+------+--------------------------------------------------+
 |  0x1F0  | 24B  | Name Sets + Input Tracking                       |
 +---------+------+--------------------------------------------------+
 |  0x208  | 48B  | Six Hash Maps (16-entry initial)                 |
 +---------+------+--------------------------------------------------+
 |  0x240  | 32B  | Extended Index Vectors + Small Sorted Set         |
 +---------+------+--------------------------------------------------+
 |  0x260  | 64B  | Memory Pool + Format Version + Tail              |
 +---------+------+--------------------------------------------------+

ELF Class Selection

The ELF class byte at offset 4 determines 32-bit vs 64-bit output format. ptxas uses non-standard values:

EI_CLASS byteStandard ELFptxas meaning
'3' (0x33)n/a32-bit CUDA ELF
'A' (0x41)n/a64-bit CUDA ELF

Standard ELF uses 1 (ELFCLASS32) and 2 (ELFCLASS64). The non-standard values '3' and 'A' are a CUDA-specific convention that identifies the binary as a cubin rather than a generic ELF. The CUDA driver recognizes these values during cubin loading.

Section Creator -- sub_1CB3570

The generic section creation function, called from 44 sites throughout the ELF construction pipeline. It accepts the full set of ELF section header parameters and optionally creates a companion relocation section.

// sub_1CB3570 -- add section to ELFW (simplified)
int add_section(void* elfw, const char* name, uint32_t type, uint64_t flags,
                uint32_t link, uint32_t info, uint64_t align, uint64_t entsize) {
    // 1. Add name to .shstrtab, get string table offset
    int name_idx = strtab_add(elfw->shstrtab, name);

    // 2. Allocate section descriptor, fill ELF section header fields
    section_t* sec = alloc_section(elfw);
    sec->sh_name    = name_idx;
    sec->sh_type    = type;
    sec->sh_flags   = flags;
    sec->sh_link    = link;
    sec->sh_info    = info;
    sec->sh_addralign = align;
    sec->sh_entsize = entsize;

    // 3. For executable sections, auto-create relocation companion
    if (flags & SHF_EXECINSTR) {
        char rela_name[256];
        snprintf(rela_name, sizeof(rela_name), ".rela%s", name);
        // -- or ".rel%s" depending on ELF class --
        section_t* rela = alloc_section(elfw);
        rela->sh_type = SHT_RELA;  // or SHT_REL
        rela->sh_link = symtab_index;
        rela->sh_info = sec->index;
    }

    return sec->index;
}

The assertion "adding function section after callgraph completed" fires if a section is added after the call graph analysis phase has already run. This enforces the ordering constraint: all .text.<funcname> sections must exist before dead code elimination and call graph construction begin.

Text Section Creator -- sub_1CB42D0

Creates a per-function code section with the naming convention .text.<funcname>:

FieldValue
sh_typeSHT_PROGBITS (1)
sh_flagsSHF_ALLOC | SHF_EXECINSTR (0x6)
Section name.text.<funcname>
Companion.rela.text.<funcname> (auto-created)

Each kernel entry point and each device function gets its own .text section. This per-function section layout enables the linker (nvlink) to perform function-level dead code elimination and allows the CUDA driver to load individual kernels.

Symbol Table Builder -- sub_1CB68D0

The largest function in the ELFW subsystem at 9,578 bytes (approximately 1,700 decompiled lines). It constructs the .symtab section from the internal symbol representation, handling several CUDA-specific concerns.

Processing Steps

  1. Iterate internal symbol list -- walks the ELFW symbol container
  2. Filter deleted symbols -- skips entries marked deleted, emits "reference to deleted symbol" warning (12 occurrences of this check in the function)
  3. Handle __cuda_syscall -- special-cases the CUDA syscall dispatcher symbol, which serves as the entry point for device-side system calls (vprintf, malloc, __assertfail, etc.)
  4. Compute symbol values/sizes -- resolves virtual addresses from section offsets
  5. Create section symbols -- ensures every section has a corresponding STT_SECTION symbol
  6. Handle SHN_XINDEX overflow -- when the section index exceeds SHN_LORESERVE (0xFF00 = 65,280), the symbol's st_shndx field is set to SHN_XINDEX (0xFFFF) and the real index is stored in the .symtab_shndx table
  7. Build .symtab_shndx -- populates the extended section index table for overflow cases

Error Messages

StringCondition
"reference to deleted symbol"Symbol was deleted but still referenced (12 checks)
"ignore symbol %s in unused section"Symbol in dead-eliminated section
"missing sec strtab"String table not initialized
"missing std sections"Standard sections (.shstrtab, .strtab, .symtab) missing
"overflow number of sections %d"Section count exceeds ELF limits

CUDA Syscall Functions

The __cuda_syscall symbol is the dispatcher for device-side system calls. The known syscall functions referenced throughout the ptxas binary:

SyscallPurpose
vprintfDevice-side formatted output
mallocDevice-side dynamic memory allocation
freeDevice-side memory deallocation
__assertfailDevice assertion failure handler
__profileProfiling counter increment
cnpGetParameterBufferCooperative launch parameter access

These are compiled as indirect calls through the __cuda_syscall dispatch mechanism. The symbol __cuda_syscall_32f3056bbb (observed in binary strings) is a hash-mangled variant used for linking.

Section Layout Calculator -- sub_1C9DC60

Computes file offsets and virtual addresses for all sections in the ELF. This is a multi-pass algorithm that respects alignment constraints and handles several special cases.

Special Section Handling

SectionTreatment
.nv.constant0Skipped (handled separately by OCG constant bank allocation)
.nv.reservedSmemSkipped (shared memory layout computed by master allocator sub_1CABD60)
.debug_lineReceives special alignment padding for DWARF line table requirements

The layout calculator assigns offsets in section-table order, which itself is determined by the 8-bucket priority sort performed during finalization.

ELF Finalization -- sub_1C9F280

The master ELF emitter at 15,263 binary bytes (97 KB decompiled) is the single largest function in the post-codegen address range. It assembles the complete ELF output from the ELFW internal representation.

Execution Flow

  1. Copy ELF header -- 64 bytes transferred via SSE loadu (128-bit unaligned loads) for performance
  2. Iterate sections -- uses accessor pair sub_1CB9FF0 (section count) / sub_1CB9C40 (get section by index)
  3. Skip virtual sections -- sections with flag & 4 set have no file data (metadata-only)
  4. Filter .nv.constant0 -- detected via strstr(".nv.constant0"), handled by separate constant bank logic
  5. Copy section headers -- SIMD-width stride memcpy of section header entries
  6. Patch ELF flags -- mask 0x7FFFBFFF clears CUDA-specific flag bits, then sets SM version and relocatable/executable mode
  7. Emit program headers -- creates PT_LOAD segments for loadable sections
  8. Build symbol table -- delegates to sub_1CB68D0
  9. Resolve section indices -- handles cross-references between sections
  10. Embed Mercury capsule -- if capmerc mode, embeds the .nv.merc.* sections
  11. Process debug sections -- maps .debug_info, .debug_line, .debug_frame sections
  12. Error recovery -- uses _setjmp for non-local error exit on fatal corruption

Section Ordering -- 8 Priority Buckets

During finalization, sections are sorted into 8 priority buckets that determine their order in the output ELF. The bucket assignment ensures the correct layout for the CUDA driver's section scanner:

BucketTypical Contents
0 (highest)ELF header pseudo-section, .shstrtab
1.strtab, .symtab, .symtab_shndx
2.note.nv.tkinfo, .note.nv.cuinfo
3.text.<funcname> (code sections)
4.nv.constant0.*, .nv.shared.*, .nv.local.* (data sections)
5.rela.*, .rel.* (relocation sections)
6.nv.info.*, EIATTR sections
7 (lowest).debug_*, .nv.merc.* (debug/mercury metadata)

Offset Assignment and Alignment

Each section's file offset is aligned to its sh_addralign value. The algorithm walks the sorted section list, advancing a running offset counter with alignment padding:

uint64_t offset = elf_header_size;
for (int i = 0; i < section_count; i++) {
    section_t* sec = sorted_sections[i];
    if (sec->sh_addralign > 1)
        offset = (offset + sec->sh_addralign - 1) & ~(sec->sh_addralign - 1);
    sec->sh_offset = offset;
    offset += sec->sh_size;
}

Extended Section Index Handling (SHN_XINDEX)

When the total section count exceeds 65,280 (SHN_LORESERVE = 0xFF00), standard ELF e_shnum cannot hold the value. The emitter activates the SHN_XINDEX mechanism:

  1. Sets e_shnum = 0 in the ELF header
  2. Stores the real section count in sh_size of section header index 0 (the null section)
  3. Sets e_shstrndx = SHN_XINDEX (0xFFFF)
  4. Stores the real .shstrtab index in sh_link of section header index 0

This is the standard ELF extension for large section counts, and it is necessary for large CUDA programs with many kernels (each kernel generates at minimum a .text, .rela.text, .nv.info, and .nv.constant0 section).

Cubin Generation Entry -- sub_612DE0

The top-level ELF/cubin generation function at 47 KB. Called from the compilation driver sub_446240 after all per-kernel OCG passes complete. This function orchestrates the entire output pipeline.

Key Behaviors

  • Option parsing -- reads compilation flags: deviceDebug, lineInfo, optLevel, IsCompute, IsPIC
  • Fastpath optimization -- "Finalizer fastpath optimization" string indicates a fast path for cross-target finalization when no complex linking is needed
  • Version embedding -- writes "Cuda compilation tools, release 13.0, V13.0.88" and build ID "Build cuda_13.0.r13.0/compiler.36424714_0" into the cubin
  • Error recovery -- establishes its own setjmp/longjmp frame independent of the top-level driver's
  • Recursive self-call -- handles nested finalization for scenarios where the output pipeline must invoke itself (e.g., generating both a primary cubin and an embedded Mercury cubin simultaneously)

Symbol Fixup -- sub_1CB2CA0

Adjusts symbol indices after dead code elimination removes sections from the ELFW. When sections are deleted, all symbol references to those sections become stale and must be renumbered.

For each section in ELFW:
  If section lacks a STT_SECTION symbol:
    Create one
  If section has multiple STT_SECTION symbols:
    Emit "found multiple section symbols for %s"
  Renumber all symbol st_shndx values to match new section indices

Called from 4 sites, indicating it runs at multiple points during the output pipeline (after dead function elimination, after mercury section cloning, etc.).

Section Index Remap -- sub_1C99BB0

Handles the .symtab_shndx and .nv.merc.symtab_shndx extended index tables when section indices change. This is the companion to sub_1CB2CA0: while that function fixes symbol st_shndx fields, this one fixes the extended index tables that hold the real indices when SHN_XINDEX is in use.

ELF Structure Dumper -- sub_1CB91C0

Debug-mode function that prints a formatted dump of the ELFW internal state. Triggered by internal debug flags, not by any user-visible CLI option.

Output format:

elfw structure:
header: size=%d type=%d abiv=%d, flags=%x
  shnum=%d shoff=%d
  phnum=%d phoff=%d

section <v/r>:
  [idx] name  type  flags  offset  size  link  info  align  entsize

symbol <v/r>:
  [idx] name  value  size  bind  type  shndx

The <v/r> suffix indicates virtual (v) or real (r) mode, corresponding to whether the dump shows the in-memory intermediate state or the final file-ready values. Both 32-bit and 64-bit format strings are present.

File Serializer -- sub_1CD13A0

The final step: writes the assembled ELF binary to disk. Called from 2 sites (main cubin and Mercury capsule cubin).

Validation Checks

CheckError String
Section data size negative"Negative size encountered"
Computed size != declared size"section size mismatch"
Write failure"writing file" (logged 12 times across write operations)

The serializer handles both 32-bit and 64-bit ELF formats, adjusting section header entry sizes (40 bytes for 32-bit, 64 bytes for 64-bit) and symbol table entry sizes (16 bytes for 32-bit, 24 bytes for 64-bit).

ELF Header Layout

The ELF header written by the ELFW constructor follows the standard ELF format with CUDA-specific overrides:

OffsetSizeFieldValue
0x004Be_ident[EI_MAG0..3]0x7F 'E' 'L' 'F' (magic 0x464C457F)
0x041Be_ident[EI_CLASS]0x33 ('3', 32-bit) or 0x41 ('A', 64-bit)
0x051Be_ident[EI_DATA]0x01 (little-endian)
0x061Be_ident[EI_VERSION]0x01 (EV_CURRENT)
0x071Be_ident[EI_OSABI]CUDA ABI version
0x122Be_machine0x00BE (EM_CUDA = 190)
0x144Be_version0x00000001
0x244Be_flagsSM version + CUDA flags (masked via 0x7FFFBFFF)

The e_flags field encodes the target SM architecture (e.g., sm_100 for Blackwell) and several CUDA-specific flags including relocatable object mode vs executable mode. The mask 0x7FFFBFFF clears bits 14 and 31, which are reserved CUDA control bits.

NVIDIA-Specific Section Types

Beyond standard ELF section types, the emitter uses NVIDIA-defined types in the SHT_LOPROC--SHT_HIPROC range:

Type ConstantValueSection
SHT_CUDA_INFO0x70000064.nv.info.* -- per-entry EIATTR attributes
SHT_CUDA_CALLGRAPH(proc-specific).nv.callgraph -- inter-function call edges
SHT_CUDA_CONSTANT(proc-specific).nv.constant0.* -- per-entry constant banks

The magic constant 1879048292 appearing in the emitter decompilation is 0x70000064, confirming SHT_CUDA_INFO as the type used for NVIDIA info sections.

Cross-References

Function Map

AddressSize (binary)DecompiledCallersCalleesPurpose
sub_1CB53A03,480 B13 KB125ELFW constructor (672-byte object)
sub_1CB35701,963 B10 KB4413Section creator (.rela/.rel auto-create)
sub_1CB42D0--------.text.<funcname> section creator
sub_1CB68D09,578 B49 KB136Symbol table builder (.symtab)
sub_1CB2CA02,038 B8 KB411Symbol fixup (post-deletion renumbering)
sub_1C99BB04,900 B25 KB118Section index remap (.symtab_shndx)
sub_1C9DC605,663 B29 KB124Section layout calculator
sub_1C9F28015,263 B97 KB142Master ELF emitter (assembles complete output)
sub_1CB91C02,668 B13 KB15ELF structure dumper (debug)
sub_1CD13A02,541 B11 KB26File serializer (final write to disk)
sub_612DE0~12,000 B47 KB1--Cubin generation entry point
sub_1CB9FF0--------Section count accessor
sub_1CB9C40--------Get section by index

Section Catalog & EIATTR

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

A CUDA cubin is a standard ELF container with NVIDIA-proprietary extensions. ptxas v13.0.88 populates it with approximately 4*(N+M) sections minimum for a program with N entry kernels and M device functions. Each section carries a specific kind of data -- SASS instructions, constant bank contents, relocation entries, per-kernel resource metadata (EIATTR), shared memory layout, debug information, or Mercury-encoded streams for deferred finalization. This page catalogs every section type ptxas can emit, the NVIDIA-specific ELF section types used, the section ordering rules, and the complete EIATTR attribute encoding.

Section attribute buildersub_60FBF0 (76 KB decompiled -- per-kernel section config + codegen launch)
Section creatorsub_1CB3570 (1,963 bytes, 44 call sites)
Text section creatorsub_1CB42D0 (SHF_ALLOC | SHF_EXECINSTR)
nvinfo section creatorsub_1CC7FB0 (creates .nv.info / .nv.info.<func>)
EIATTR record emittersub_1CC85F0 (emits one TLV record)
EIATTR buildersub_1CC9800 (14,764 bytes, 90 KB decompiled, 2,786 lines)
EIATTR propagatorsub_1CC8950 (2,634 bytes -- barrier/register propagation)
.nv.compat handlersub_1CC93A0 (.nv.compat attribute processor)
Call graph buildersub_1CBE1B0 (.nv.callgraph section)
Layout calculatorsub_1C9DC60 (5,663 bytes -- offset assignment)
Master section allocatorsub_1CABD60 (11,856 bytes -- shared/constant/local addresses)
SHT_CUDA_INFO0x70000000 (1,879,048,192)
SHT_CUDA_CALLGRAPH0x70000064 (1,879,048,292)
.nv.compat section type0x70000086 (1,879,048,326)

NVIDIA-Specific Section Types

Beyond standard ELF section types (SHT_PROGBITS, SHT_STRTAB, SHT_SYMTAB, SHT_RELA, SHT_NOTE), ptxas uses NVIDIA-defined types in the SHT_LOPROC--SHT_HIPROC range (0x70000000--0x7FFFFFFF):

ConstantValueDecimalUsed by
SHT_CUDA_INFO0x700000001,879,048,192.nv.info, .nv.info.<func>
SHT_CUDA_CALLGRAPH0x700000641,879,048,292.nv.callgraph
SHT_CUDA_COMPAT0x700000861,879,048,326.nv.compat

The section creator sub_1CB3570 contains a range check on CUDA-specific types:

// sub_1CB3570 -- section type validation
if (elf_mode != 1 && is_relocatable
    && ((sh_type - 0x70000064) <= 0x1A || sh_type == 0x70000006)) {
    // These CUDA section types require special handling in relocatable mode
}

This tells us that NVIDIA reserves the range 0x70000064--0x7000007E (27 types) plus 0x70000006 for CUDA-specific sections that receive special treatment in relocatable object mode.

Complete Section Catalog

Standard ELF Infrastructure Sections

Created unconditionally by the ELFW constructor (sub_1CB53A0). These form the skeleton of every cubin.

SectionTypeFlagsPurpose
(null)SHT_NULL--Required ELF null section (index 0)
.shstrtabSHT_STRTAB--Section name string table
.strtabSHT_STRTAB--Symbol name string table
.symtabSHT_SYMTAB--Symbol table
.symtab_shndxSHT_SYMTAB_SHNDX--Extended section indices (when section count > 65,280)

NVIDIA Note Sections

Created unconditionally. Carry module-level metadata the CUDA driver reads before launching any kernel.

SectionTypeFlagsPurpose
.note.nv.tkinfoSHT_NOTE--Toolkit info: version string, build ID, CLI arguments
.note.nv.cuinfoSHT_NOTE--CUDA info: SM version, feature flags
.note.nv.cuverSHT_NOTE--CUDA version note

Per-Kernel Code Sections

Created by sub_1CB42D0, one set per kernel entry and device function:

SectionTypeFlagssh_linkPurpose
.text.<func>SHT_PROGBITSSHF_ALLOC | SHF_EXECINSTR (0x6)--SASS instruction bytes
.rela.text.<func>SHT_RELA--.symtab indexRelocations for the code section

The .rela companion is auto-created by the section creator when SHF_EXECINSTR is set. The assertion "adding function section after callgraph completed" fires if a code section is added after call graph analysis.

Per-Kernel Metadata Sections

SectionTypeFlagssh_linkPurpose
.nv.info.<func>SHT_CUDA_INFOSHF_LINK_ORDER (0x40).text.<func> symbolEIATTR TLV records for this kernel
.nv.constant0.<func>SHT_PROGBITSSHF_ALLOC--Constant bank 0: kernel params + literal constants
.nv.shared.<func>SHT_NOBITSSHF_ALLOC | SHF_WRITE--Shared memory layout (size only, no file data)
.nv.local.<func>SHT_NOBITSSHF_ALLOC | SHF_WRITE--Local (spill) memory layout

The .nv.info.<func> section uses SHF_LINK_ORDER (flag 0x40) to declare its association with the function's symbol. The SHT_CUDA_INFO type value 0x70000000 is used; note that the nvlink wiki previously documented 0x70000064 for this -- the discrepancy arises because nvlink uses a different constant in its own emitter. Binary evidence from ptxas shows sub_1CC7FB0 consistently passes 1879048192 (0x70000000).

Global Metadata Sections

SectionTypeFlagsPurpose
.nv.infoSHT_CUDA_INFO--Global EIATTR attributes (sh_link = 0, not per-function)
.nv.compatSHT_CUDA_COMPAT--Forward-compatibility attributes (sm version negotiation)
.nv.metadataSHT_PROGBITS--Module-level metadata
.nv.callgraphSHT_CUDA_CALLGRAPH--Inter-function call edges (relocatable mode, -c)
.nv.prototypeSHT_PROGBITS--Prototype information for cross-module linking
.nv.rel.actionSHT_PROGBITS--Relocation action table
.nv.resolvedrelaSHT_PROGBITS--Resolved relocations (post-linking)
.nv.hostSHT_PROGBITS--Host-side interop data

Constant Banks

CUDA supports up to 18 numbered constant banks (0--17) plus named constant sections:

SectionPurpose
.nv.constant0Merged constant bank 0 (whole-program mode)
.nv.constant0.<func>Per-function constant bank 0 (kernel params + compiler constants)
.nv.constant1 -- .nv.constant17User-declared __constant__ variables
.nv.constant.entry_paramsEntry point parameter block
.nv.constant.entry_image_header_indicesTexture/surface header index table
.nv.constant.driverDriver-injected constants
.nv.constant.optimizerOptimizer-generated constants (OCG)
.nv.constant.userUser-specified constants
.nv.constant.picPosition-independent code constants
.nv.constant.tools_dataTools/debugger-injected data

The layout calculator sub_1C9DC60 skips .nv.constant0 sections during offset assignment because their addresses are managed by the OCG constant bank allocator, not the ELF layout engine.

Shared Memory Sections

SectionPurpose
.nv.shared.<func>Per-kernel shared memory (size declaration, no data)
.nv.shared.reserved.Reserved shared memory for runtime allocation
.nv.reservedSmemReserved shared memory master section
.nv.reservedSmem.beginStart offset of reserved region
.nv.reservedSmem.capCapacity of reserved region
.nv.reservedSmem.offset0Offset within reserved region 0
.nv.global.initInitialized global variables

The master section allocator sub_1CABD60 assigns addresses to shared, constant, and local memory sections. The layout calculator skips .nv.reservedSmem for the same reason it skips .nv.constant0 -- its address comes from the shared memory master allocator.

Unified Function/Data Tables

SectionPurpose
.nv.uftUnified Function Table (indirect call dispatch)
.nv.uft.entryUFT entry point table
.nv.udtUnified Data Table
.nv.udt.entryUDT entry point table

The error "Number of .nv.uft jump slots != Number of entries" fires when the UFT and entry tables are inconsistent. "missing nv.uft.entry" fires when the required entry table section was never created.

DWARF Debug Sections

Generated when --device-debug or --generate-line-info is active:

SectionContent
.debug_infoDWARF DIE tree (compilation units, types, variables)
.debug_abbrevDWARF abbreviation table
.debug_lineSource-to-address line number mapping
.debug_frameCall frame information for unwinding
.debug_locLocation lists for variables
.debug_strDWARF string table
.debug_rangesAddress ranges
.debug_arangesAddress range lookup table
.debug_pubnamesPublic name index
.debug_pubtypesPublic type index

NVIDIA Debug Extensions

SectionContent
.nv_debug_ptx_txtEmbedded PTX source text
.nv_debug_line_sassSASS-level line number mapping
.nv_debug_info_reg_sassRegister allocation debug info
.nv_debug_info_reg_typeRegister type information
.nv_debug.sharedShared memory debug layout

Mercury / Capsule Mercury Sections (SM 100+)

For Capsule Mercury output (Blackwell and later), the cubin contains a parallel set of .nv.merc.* sections carrying Mercury-encoded instruction streams plus all metadata needed for deferred finalization:

SectionPurpose
.nv.capmercCapsule Mercury descriptor
.nv.merc.symtab_shndxExtended section index table (Mercury copy)
.nv.merc.nv.shared.reservedShared memory reservation metadata
.nv.merc.rela<secname>Per-section relocation tables
.nv.merc.debug_abbrevCloned DWARF abbreviation table
.nv.merc.debug_infoCloned DWARF info
.nv.merc.debug_lineCloned DWARF line table
.nv.merc.debug_frameCloned DWARF frame info
.nv.merc.debug_locCloned DWARF locations
.nv.merc.debug_strCloned DWARF string table
.nv.merc.debug_rangesCloned DWARF ranges
.nv.merc.debug_arangesCloned DWARF address ranges
.nv.merc.debug_pubnamesCloned DWARF public names
.nv.merc.debug_pubtypesCloned DWARF public types
.nv.merc.debug_macinfoCloned DWARF macro info
.nv.merc.nv_debug_ptx_txtEmbedded PTX source text
.nv.merc.nv_debug_line_sassSASS-level line mapping
.nv.merc.nv_debug_info_reg_sassRegister allocation debug info
.nv.merc.nv_debug_info_reg_typeRegister type debug info

The Mercury section cloner (sub_1CA2E40) iterates all sections and duplicates constant, global, shared, and local sections into the .nv.merc.* namespace, creating corresponding .nv.merc.rela sections for relocations.

Global vs Per-Kernel Sections

The .nv.info / .nv.info.<func> split is the primary distinction between global and per-kernel metadata:

Global .nv.info (one per cubin):

  • sh_link = 0 (no associated symbol)
  • Contains module-wide EIATTR records: EIATTR_CUDA_API_VERSION, EIATTR_STATISTICS, EIATTR_HAS_PRE_V10_OBJECT, EIATTR_MERCURY_ISA_VERSION
  • Created by sub_1CC7FB0(elfw, 0) -- the zero argument selects global mode

Per-kernel .nv.info.<func> (one per kernel):

  • Section name: sprintf(".nv.info.%s", func_name) (visible in sub_1CC7FB0)
  • sh_link points to the symbol table entry for the function
  • sh_flags includes SHF_LINK_ORDER (0x40) to declare its association
  • Contains per-kernel EIATTR records: EIATTR_REGCOUNT, EIATTR_NUM_BARRIERS, EIATTR_FRAME_SIZE, etc.
  • Created by sub_1CC7FB0(elfw, sym_idx) when sym_idx != 0

The .nv.info section creator (sub_1CC7FB0) first searches for an existing section of type 0x70000000 with the appropriate name. If none exists, it creates one. The per-function variant links the new section to the function's .text section via sub_1CB4180.

Section Ordering

During finalization, sections are sorted into 8 priority buckets that determine their order in the output ELF:

BucketPriorityContents
0HighestELF header pseudo-section, .shstrtab
1.strtab, .symtab, .symtab_shndx
2.note.nv.tkinfo, .note.nv.cuinfo
3.text.<func> code sections
4.nv.constant0.*, .nv.shared.*, .nv.local.* data sections
5.rela.*, .rel.* relocation sections
6.nv.info.* EIATTR metadata sections
7Lowest.debug_*, .nv.merc.* debug and Mercury metadata

Within each bucket, sections appear in creation order. Section file offsets are assigned by sub_1C9DC60 walking the sorted list with alignment padding. The .debug_line section receives special alignment padding for DWARF line table requirements.

Offset Assignment

// sub_1C9DC60 -- simplified layout algorithm
uint64_t offset = elf_header_size;
for (int i = 0; i < section_count; i++) {
    section_t* sec = sorted_sections[i];
    if (is_virtual(sec))          continue;  // flag & 4 -> no file data
    if (is_nv_constant0(sec))     continue;  // OCG allocator manages these
    if (is_nv_reservedSmem(sec))  continue;  // shared memory allocator manages these

    if (sec->sh_addralign > 1)
        offset = (offset + sec->sh_addralign - 1) & ~(sec->sh_addralign - 1);
    sec->sh_offset = offset;
    offset += sec->sh_size;
}

Three section types are skipped during offset assignment:

  1. Virtual sections (flag bit 2 set) -- have no file data, only metadata
  2. .nv.constant0 -- address assigned by the OCG constant bank allocator
  3. .nv.reservedSmem -- address assigned by the shared memory master allocator sub_1CABD60

EIATTR Encoding

Each .nv.info section contains a flat sequence of EIATTR (Entry Information Attribute) records. There is no section header or record count -- the parser walks from byte 0 to sh_size, consuming records sequentially. The EIATTR builder is sub_1CC9800 (14,764 binary bytes, 90 KB decompiled) -- one of the three largest functions in the output pipeline.

TLV Record Format

Offset  Size  Field
------  ----  -----
0x00    1     format      Format byte (determines payload structure)
0x01    1     attr_code   EIATTR type code (identifies the attribute)
0x02    2     size        Payload size in bytes (little-endian uint16)
0x04    var   payload     Attribute-specific data (size bytes)

Total record size = 4 + ALIGN_UP(size, 4). Records are 4-byte aligned.

Format Byte

FormatNamePayload structure
0x01FreeRaw bytes, attribute-specific layout
0x02ValueSingle 32-bit value (no symbol index)
0x03Sized16-bit value + padding
0x04Indexed[sym_index:4][value:4] -- per-symbol attribute

Format 0x04 (indexed) is the most common for per-function attributes. The 4-byte symbol index at payload offset 0 identifies which function the attribute applies to, enabling the linker to remap symbol indices during merge.

Parsing Pseudocode

uint8_t *ptr = section_data;
uint8_t *end = section_data + section_size;

while (ptr < end) {
    uint8_t  format    = ptr[0];
    uint8_t  attr_code = ptr[1];
    uint16_t size      = *(uint16_t *)(ptr + 2);

    switch (format) {
    case 0x04:  // Indexed
        uint32_t sym_idx = *(uint32_t *)(ptr + 4);
        uint32_t value   = *(uint32_t *)(ptr + 8);
        process_indexed(attr_code, sym_idx, value);
        break;
    case 0x02:  // Value
        uint32_t value = *(uint32_t *)(ptr + 4);
        process_global(attr_code, value);
        break;
    default:    // Free / Sized
        process_raw(attr_code, ptr + 4, size);
        break;
    }
    ptr += 4 + ALIGN_UP(size, 4);
}

EIATTR Record Emitter -- sub_1CC85F0

The low-level function that writes one EIATTR TLV record. Called from the builder and propagator with parameters:

// sub_1CC85F0 -- emit one EIATTR record
void emit_eiattr(
    ELFW*    elfw,       // a1: ELFW object
    uint8_t  attr_code,  // a2: EIATTR type code (e.g., 0x2F for REGCOUNT)
    int16_t  size,       // a3: payload size in bytes
    void*    payload,    // a4: pointer to payload data
    uint32_t sym_idx     // a5: symbol index (0 = global)
);

Before emitting, it calls sub_1C97840 to check whether the attribute code is supported on the current SM architecture. If not supported, the record is silently skipped. It then calls sub_1CC7FB0 to obtain or create the appropriate .nv.info section, allocates a 16-byte record descriptor, fills the format byte and attribute code, and appends it to the section's linked list (offset +392 in the ELFW object).

EIATTR Attribute Catalog

ptxas v13.0.88 defines 97 EIATTR codes numbered 0 through 96 (plus the EIATTR_ERROR_LAST sentinel at 96). The complete catalog below is cross-referenced against the nvlink v13.0.88 name table (extracted from pointer table at VA 0x1D37D60) and verified against EIATTR codes observed in the ptxas EIATTR builder (sub_1CC9800 switch cases and sub_1CC85F0 call sites).

Complete Code Table

CodeHexNameFmtCategory
00x00EIATTR_ERROR--Sentinel
10x01EIATTR_PAD--Sentinel
20x02EIATTR_IMAGE_SLOTIdxTexture
30x03EIATTR_JUMPTABLE_RELOCSFreeMetadata
40x04EIATTR_CTAIDZ_USEDIdxMetadata
50x05EIATTR_MAX_THREADSIdxResource
60x06EIATTR_IMAGE_OFFSETIdxTexture
70x07EIATTR_IMAGE_SIZEIdxTexture
80x08EIATTR_TEXTURE_NORMALIZEDIdxTexture
90x09EIATTR_SAMPLER_INITIdxTexture
100x0AEIATTR_PARAM_CBANKIdxParam
110x0BEIATTR_SMEM_PARAM_OFFSETSFreeParam
120x0CEIATTR_CBANK_PARAM_OFFSETSFreeParam
130x0DEIATTR_SYNC_STACKIdxMetadata
140x0EEIATTR_TEXID_SAMPID_MAPFreeTexture
150x0FEIATTR_EXTERNSFreeMetadata
160x10EIATTR_REQNTIDIdxResource
170x11EIATTR_FRAME_SIZEIdxResource
180x12EIATTR_MIN_STACK_SIZEIdxResource
190x13EIATTR_SAMPLER_FORCE_UNNORMALIZEDIdxTexture
200x14EIATTR_BINDLESS_IMAGE_OFFSETSFreeTexture
210x15EIATTR_BINDLESS_TEXTURE_BANKIdxTexture
220x16EIATTR_BINDLESS_SURFACE_BANKIdxTexture
230x17EIATTR_KPARAM_INFOFreeParam
240x18EIATTR_SMEM_PARAM_SIZEIdxParam
250x19EIATTR_CBANK_PARAM_SIZESizedParam
260x1AEIATTR_QUERY_NUMATTRIBIdxMetadata
270x1BEIATTR_MAXREG_COUNTSizedResource
280x1CEIATTR_EXIT_INSTR_OFFSETSFreeOffsets
290x1DEIATTR_S2RCTAID_INSTR_OFFSETSFreeOffsets
300x1EEIATTR_CRS_STACK_SIZEIdxResource
310x1FEIATTR_NEED_CNP_WRAPPERIdxMetadata
320x20EIATTR_NEED_CNP_PATCHIdxMetadata
330x21EIATTR_EXPLICIT_CACHINGIdxMetadata
340x22EIATTR_ISTYPEP_USEDIdxMetadata
350x23EIATTR_MAX_STACK_SIZEIdxResource
360x24EIATTR_SUQ_USEDIdxMetadata
370x25EIATTR_LD_CACHEMOD_INSTR_OFFSETSFreeOffsets
380x26EIATTR_LOAD_CACHE_REQUESTIdxMetadata
390x27EIATTR_ATOM_SYS_INSTR_OFFSETSFreeOffsets
400x28EIATTR_COOP_GROUP_INSTR_OFFSETSFreeOffsets
410x29EIATTR_COOP_GROUP_MASK_REGIDSIdxCluster
420x2AEIATTR_SW1850030_WARFreeWAR
430x2BEIATTR_WMMA_USEDIdxMetadata
440x2CEIATTR_HAS_PRE_V10_OBJECTValMetadata
450x2DEIATTR_ATOMF16_EMUL_INSTR_OFFSETSFreeOffsets
460x2EEIATTR_ATOM16_EMUL_INSTR_REG_MAPFreeOffsets
470x2FEIATTR_REGCOUNTIdxResource
480x30EIATTR_SW2393858_WARFreeWAR
490x31EIATTR_INT_WARP_WIDE_INSTR_OFFSETSFreeOffsets
500x32EIATTR_SHARED_SCRATCHIdxShared
510x33EIATTR_STATISTICSFreeMetadata
520x34EIATTR_INDIRECT_BRANCH_TARGETSFreeOffsets
530x35EIATTR_SW2861232_WARFreeWAR
540x36EIATTR_SW_WARFreeWAR
550x37EIATTR_CUDA_API_VERSIONIdxMetadata
560x38EIATTR_NUM_MBARRIERSIdxResource
570x39EIATTR_MBARRIER_INSTR_OFFSETSFreeOffsets
580x3AEIATTR_COROUTINE_RESUME_OFFSETSFreeOffsets
590x3BEIATTR_SAM_REGION_STACK_SIZEIdxResource
600x3CEIATTR_PER_REG_TARGET_PERF_STATSFreeMetadata
610x3DEIATTR_CTA_PER_CLUSTERIdxCluster
620x3EEIATTR_EXPLICIT_CLUSTERIdxCluster
630x3FEIATTR_MAX_CLUSTER_RANKIdxCluster
640x40EIATTR_INSTR_REG_MAPFreeMetadata
650x41EIATTR_RESERVED_SMEM_USEDIdxShared
660x42EIATTR_RESERVED_SMEM_0_SIZEIdxShared
670x43EIATTR_UCODE_SECTION_DATAFreeMetadata
680x44EIATTR_UNUSED_LOAD_BYTE_OFFSETFreeOffsets
690x45EIATTR_KPARAM_INFO_V2FreeParam
700x46EIATTR_SYSCALL_OFFSETSFreeOffsets
710x47EIATTR_SW_WAR_MEMBAR_SYS_INSTR_OFFSETSFreeWAR
720x48EIATTR_GRAPHICS_GLOBAL_CBANKIdxGraphics
730x49EIATTR_SHADER_TYPEIdxGraphics
740x4AEIATTR_VRC_CTA_INIT_COUNTIdxGraphics
750x4BEIATTR_TOOLS_PATCH_FUNCIdxMetadata
760x4CEIATTR_NUM_BARRIERSIdxResource
770x4DEIATTR_TEXMODE_INDEPENDENTIdxTexture
780x4EEIATTR_PERF_STATISTICSFreeMetadata
790x4FEIATTR_AT_ENTRY_FRAGMENTSFreeBlackwell
800x50EIATTR_SPARSE_MMA_MASKFreeBlackwell
810x51EIATTR_TCGEN05_1CTA_USEDIdxBlackwell
820x52EIATTR_TCGEN05_2CTA_USEDIdxBlackwell
830x53EIATTR_GEN_ERRBAR_AT_EXITIdxBlackwell
840x54EIATTR_REG_RECONFIGIdxBlackwell
850x55EIATTR_ANNOTATIONSFreeMetadata
860x56EIATTR_UNKNOWN--Sentinel
870x57EIATTR_STACK_CANARY_TRAP_OFFSETSFreeOffsets
880x58EIATTR_STUB_FUNCTION_KINDIdxMetadata
890x59EIATTR_LOCAL_CTA_ASYNC_STORE_OFFSETSFreeOffsets
900x5AEIATTR_MERCURY_FINALIZER_OPTIONSFreeMercury
910x5BEIATTR_BLOCKS_ARE_CLUSTERSIdxCluster
920x5CEIATTR_SANITIZEIdxBlackwell
930x5DEIATTR_SYSCALLS_FALLBACKFreeMetadata
940x5EEIATTR_CUDA_REQFreeMetadata
950x5FEIATTR_MERCURY_ISA_VERSIONSizedMercury
960x60EIATTR_ERROR_LAST--Sentinel

Fmt column: Idx = format 0x04 (indexed, per-symbol), Free = format 0x01 (raw bytes), Val = format 0x02 (single 32-bit value), Sized = format 0x03 (16-bit value).

EIATTR Codes Confirmed in ptxas Builder

The following codes appear as explicit case labels in the sub_1CC9800 switch statement or as arguments to sub_1CC85F0:

CodeHexConfirmed via
40x04case 0x4 in builder -- CTAIDZ_USED
130x0Dcase 0xD -- SYNC_STACK
150x0Fcase 0xF + sub_1CC85F0(_, 0xF, ...) -- EXTERNS
170x11case 0x11 -- FRAME_SIZE
180x12case 0x12 + sub_1CC85F0(_, 0x12, ...) -- MIN_STACK_SIZE
270x1Bcase 0x1B -- MAXREG_COUNT
300x1Ecase 0x1E + sub_1CC85F0(_, 0x1E, ...) -- CRS_STACK_SIZE
350x23case 0x23 -- MAX_STACK_SIZE
380x26case 0x26 -- LOAD_CACHE_REQUEST
470x2Fcase 0x2F + sub_1CC85F0(_, 0x2F, ...) -- REGCOUNT
560x38case 0x38 -- NUM_MBARRIERS
590x3Bcase 0x3B + sub_1CC85F0(_, 0x3B, ...) -- SAM_REGION_STACK_SIZE
650x41case 0x41 -- RESERVED_SMEM_USED
740x4Acase 0x4A -- VRC_CTA_INIT_COUNT
760x4Ccase 0x4C -- NUM_BARRIERS
790x4Fcase 0x4F + sub_1CC85F0(_, 0x4F, ...) -- AT_ENTRY_FRAGMENTS
800x50case 0x50 + sub_1C97840(0x50, ...) -- SPARSE_MMA_MASK
810x51case 0x51 -- TCGEN05_1CTA_USED
820x52case 0x52 -- TCGEN05_2CTA_USED
840x54case 0x54 -- REG_RECONFIG

The builder's first pass uses a switch with cases 0x04, 0x0D, 0x0F, 0x11, 0x12, 0x1B, 0x1E, 0x23, 0x26, 0x2F, 0x38, 0x3B, 0x41, 0x4A, 0x4C, 0x4F, 0x50, 0x51, 0x52, 0x54 to sort EIATTR records into per-entry arrays. A second pass emits the final records via sub_1CC85F0 and sub_1CC86D0.

Symbol Index Resolution Pass

Before the main builder runs, the EIATTR builder performs a symbol index resolution pass (lines 700--884 in the decompiled builder). This pass walks all pre-existing EIATTR records and resolves symbol indices through the linker's mapping tables:

// Simplified from sub_1CC9800 lines ~716-824
for (record in eiattr_list) {
    switch (record->attr_code) {
    case 0x02: case 0x06: case 0x07: case 0x08: case 0x09:
    case 0x0A: case 0x11: case 0x12: case 0x13: case 0x14:
    case 0x17: case 0x23: case 0x26: case 0x2F: case 0x3B:
    case 0x45:
        // Indexed format: resolve sym_idx through mapping table
        int32_t *sym_ptr = (int32_t *)record->payload;
        if (mapping_table && *sym_ptr != 0) {
            if (*sym_ptr < 0)
                *sym_ptr = negative_mapping[-(*sym_ptr)];
            else
                *sym_ptr = mapping_table[*sym_ptr];
        }
        if (*sym_ptr == 0 && attr_code != 0x45 && attr_code != 0x17) {
            record->attr_code = 0;  // disable record
        }
        break;
    case 0x0F:
        // EXTERNS: resolve each 4-byte symbol index in the array
        int count = record->size / 4;
        for (int i = 0; i < count; i++) {
            resolve_sym(&payload[i], mapping_table, negative_mapping);
        }
        break;
    }
}

The bitmask 0x800800060000 (seen at line 716) encodes which attribute codes use the simple indexed-resolve path: it selects codes 2, 6, 7, 8, 9, 10, 17, 18, 19, 20, 23, 38, 47, 59, 69.

Barrier and Register Propagation -- sub_1CC8950

When a device function uses barriers or a high register count, those requirements must propagate upward through the call graph to each entry kernel. The propagator sub_1CC8950 handles this:

"Creating new EIATTR_NUM_BARRIERS and moving barcount %d
    from section flags of %s to nvinfo for entry symbol %s"
"Propagating higher barcount %d to the section flags
    of %s of entry symbol %s"
"regcount %d for %s propagated to entry %s"

The propagator emits EIATTR_REGCOUNT (0x2F) records via sub_1CC85F0(_, 0x2F, 8, ...) and handles EIATTR_NUM_BARRIERS (0x4C) through the sub_1CC7FB0 path. Barrier counts are extracted from the section flags field at bit offset 20 (7-bit field, mask 0x7F), then cleared from the section flags (&= 0xF80FFFFF) after being moved into an EIATTR record.

EIATTR Categories by Function

Resource allocation (GPU driver reads these to configure hardware at launch):

CodeNameDescription
0x2FREGCOUNTPhysical register count per thread (primary occupancy determinant)
0x05MAX_THREADSMaximum threads per block (.maxntid)
0x10REQNTIDRequired block dimensions (.reqntid, 3x uint32)
0x11FRAME_SIZEPer-thread local memory frame size (bytes)
0x12MIN_STACK_SIZEMinimum call stack (non-recursive)
0x23MAX_STACK_SIZEMaximum call stack (recursive)
0x1ECRS_STACK_SIZECall-Return-Sync stack
0x3BSAM_REGION_STACK_SIZESAM (Streaming Async Memory) region stack
0x4CNUM_BARRIERSNamed barrier count (0--16)
0x38NUM_MBARRIERSMemory barrier (mbarrier) object count
0x1BMAXREG_COUNTRegister count hint (--maxrregcount / .maxnreg)

Parameter bank:

CodeNameDescription
0x0APARAM_CBANKConstant bank number + offset for parameters
0x19CBANK_PARAM_SIZEParameter constant bank size
0x18SMEM_PARAM_SIZEShared memory parameter region size
0x0BSMEM_PARAM_OFFSETSPer-parameter shared memory offsets
0x0CCBANK_PARAM_OFFSETSPer-parameter constant bank offsets
0x17KPARAM_INFOPer-parameter metadata (v1)
0x45KPARAM_INFO_V2Per-parameter metadata (v2, extended)

Instruction offset tables (driver/tools locate and patch instructions at load time):

CodeNameDescription
0x1CEXIT_INSTR_OFFSETSByte offsets of EXIT instructions
0x1DS2RCTAID_INSTR_OFFSETSOffsets of S2R SR_CTAID.* instructions
0x25LD_CACHEMOD_INSTR_OFFSETSLoad instructions with cache modifier
0x27ATOM_SYS_INSTR_OFFSETSAtomic instructions with .sys scope
0x28COOP_GROUP_INSTR_OFFSETSCooperative group instructions
0x2DATOMF16_EMUL_INSTR_OFFSETSEmulated FP16 atomics
0x2EATOM16_EMUL_INSTR_REG_MAPRegister map for 16-bit atomic emulation
0x31INT_WARP_WIDE_INSTR_OFFSETSInteger warp-wide instructions
0x34INDIRECT_BRANCH_TARGETSValid indirect branch targets (CFI)
0x39MBARRIER_INSTR_OFFSETSMBAR memory barrier instructions
0x3ACOROUTINE_RESUME_OFFSETSDevice coroutine resume points
0x44UNUSED_LOAD_BYTE_OFFSETUnused load instruction byte offset
0x46SYSCALL_OFFSETS__cuda_syscall invocation offsets
0x57STACK_CANARY_TRAP_OFFSETSStack canary trap instructions
0x59LOCAL_CTA_ASYNC_STORE_OFFSETSCTA-local async store instructions

Texture and surface:

CodeNameDescription
0x02IMAGE_SLOTTexture/surface image slot assignment
0x06IMAGE_OFFSETImage descriptor table offset
0x07IMAGE_SIZEImage descriptor size
0x08TEXTURE_NORMALIZEDNormalized texture coordinates flag
0x09SAMPLER_INITSampler initialization data
0x0ETEXID_SAMPID_MAPTexture-to-sampler mapping table
0x13SAMPLER_FORCE_UNNORMALIZEDForce unnormalized sampler
0x14BINDLESS_IMAGE_OFFSETSBindless texture/surface offsets
0x15BINDLESS_TEXTURE_BANKConstant bank for bindless textures
0x16BINDLESS_SURFACE_BANKConstant bank for bindless surfaces
0x4DTEXMODE_INDEPENDENTIndependent texture mode

Cluster and cooperative launch (sm_90+):

CodeNameDescription
0x29COOP_GROUP_MASK_REGIDSCooperative group mask register IDs
0x3DCTA_PER_CLUSTERCTAs per cluster (Hopper+)
0x3EEXPLICIT_CLUSTERExplicit cluster dimensions
0x3FMAX_CLUSTER_RANKMaximum cluster rank
0x5BBLOCKS_ARE_CLUSTERSCTA blocks are clusters flag

Shared memory:

CodeNameDescription
0x32SHARED_SCRATCHShared memory scratch for register spilling
0x41RESERVED_SMEM_USEDReserved shared memory in use
0x42RESERVED_SMEM_0_SIZEReserved shared memory partition 0 size

Hardware workarounds:

CodeNameDescription
0x2ASW1850030_WARHW bug 1850030 workaround
0x30SW2393858_WARHW bug 2393858 workaround
0x35SW2861232_WARHW bug 2861232 workaround
0x36SW_WARGeneric workaround container
0x47SW_WAR_MEMBAR_SYS_INSTR_OFFSETSMEMBAR.SYS workaround offsets

Blackwell+ (sm_100+):

CodeNameDescription
0x4FAT_ENTRY_FRAGMENTSFragment descriptors at function entry
0x50SPARSE_MMA_MASKStructured sparsity mask for MMA
0x51TCGEN05_1CTA_USED5th-gen tensor core (single-CTA mode)
0x52TCGEN05_2CTA_USED5th-gen tensor core (two-CTA mode)
0x53GEN_ERRBAR_AT_EXITGenerate error barrier at kernel exit
0x54REG_RECONFIGDynamic register reconfiguration (setmaxnreg)
0x5CSANITIZEAddress sanitizer instrumentation present

Mercury:

CodeNameDescription
0x5AMERCURY_FINALIZER_OPTIONSOptions for Mercury FNLZR post-link pass
0x5FMERCURY_ISA_VERSIONMercury ISA version for shader binary

Graphics-specific:

CodeNameDescription
0x48GRAPHICS_GLOBAL_CBANKGlobal constant bank for graphics shaders
0x49SHADER_TYPEShader type (vertex, fragment, compute, etc.)
0x4AVRC_CTA_INIT_COUNTVirtual Register Count CTA init count

.nv.compat Section

The .nv.compat section (SHT_CUDA_COMPAT = 0x70000086) stores forward-compatibility attributes. Its records use a different format from EIATTR -- each is a small TLV with:

Offset  Size  Field
------  ----  -----
0x00    1     format (always 0x02 = value)
0x01    1     compat_code
0x02    1     value

The sub_1CC93A0 handler processes these with a switch over compat codes 2--6:

CodeBehavior
2Max of existing and new value (keeps higher)
3OR existing with new value (accumulate flags)
4Reset to zero
5Per-nibble max (two 2-bit fields)
6Set to 1 if values differ (conflict detection)

The guard *(_DWORD *)(a1 + 72) <= 0x59 (SM version <= 89 decimal) means compat processing only applies to SM 90 (Hopper) and later. Unknown compat codes trigger: "unknown .nv.compat attribute (%x) encoutered with value %x." (note the typo "encoutered" in the binary string).

Architecture-Gated EIATTR Emission

Not all EIATTR codes are valid on all SM architectures. The function sub_1C97840 performs architecture checks before emitting a record. Observed gates:

EIATTR CodeGateMeaning
0x04 (CTAIDZ_USED)Always emitted
0x41 (RESERVED_SMEM_USED)sub_1C97840(0x41, sm)SM-version dependent
0x4C (NUM_BARRIERS)sub_1C97840(0x4C, sm)SM-version dependent
0x50 (SPARSE_MMA_MASK)sub_1C97840(0x50, sm)SM 100+ (Blackwell)
0x51 (TCGEN05_1CTA)sub_1C97840(0x51, sm) implicitSM 100+
0x52 (TCGEN05_2CTA)sub_1C97840(0x52, sm) implicitSM 100+
0x54 (REG_RECONFIG)sub_1C97840(0x54, sm) implicitSM 100+

The sub_1C97840 function takes an EIATTR code and the SM version from the ELFW object's field at offset 624, returning a boolean. This prevents older EIATTR codes from appearing in Blackwell cubins and prevents Blackwell-only codes from appearing in Hopper cubins.

Constant Bank Optimization

The master section allocator sub_1CABD60 (11,856 bytes) performs two major space optimizations during address assignment: constant value deduplication within .nv.constant0 banks, and shared memory interference-graph coloring for extern shared variables. Both run before final offset assignment.

Constant Value Deduplication -- sub_1CA6890

When multiple kernels in the same compilation unit use identical constant values, the OCG constant bank can contain duplicates. sub_1CA6890 (454 lines decompiled) eliminates them by value-matching, reducing .nv.constant0 section size.

The algorithm dispatches on constant value width:

Value WidthDedup StrategyData Structure
4 bytesHash map lookup (sub_426D60)Hash table keyed on 32-bit value
8 bytesHash map lookup (separate table)Hash table keyed on 64-bit value
12, 16, 20, 24, 32, 48, 64 bytesLinear scan with memcmp (sub_1CA6760)Per-width linked list
OtherNo deduplicationDirect append

For each constant data node in the section's linked list (at section+72):

  1. Extract the value bytes (node+0), alignment (node+16), and size (node+24).
  2. Look up the value in the appropriate dedup structure.
  3. If duplicate found: alias the current symbol's offset to the existing symbol's offset. Debug output: "found duplicate value 0x%x, alias %s to %s" (32-bit) or "found duplicate 64bit value 0x%llx, alias %s to %s" (64-bit) or "found duplicate %d byte value, alias %s to %s" (N-byte via sub_1CA6760).
  4. If not found: align the section cursor to the required alignment, append the data via sub_1CA6650, and insert into the dedup structure.

After aliasing, the function rewrites pending relocations that targeted the now-eliminated range:

// Simplified relocation rewriting after dedup alias
for (reloc in pending_relocs) {
    if (reloc.section == target_section
        && reloc.offset >= old_data_offset
        && reloc.offset <  old_data_offset + old_data_size) {
        reloc.offset = reloc.offset + alias_target_offset - old_data_offset;
        // "optimize ocg constant reloc offset from %lld to %lld"
        unlink(reloc);  // remove from pending list
    }
}

Special cases:

  • Zero-valued constants: A "seen set" (parameter a15) prevents distinct zero-valued symbols from being aliased to each other, since different __constant__ variables may legitimately hold zero but need separate addresses.
  • Redirect mode: When parameter a13 is set and sub_1CB15C0 returns true for a symbol, the constant is redirected to its defining section rather than deduplicated.

The caller sub_1CABD60 wraps this in an optimization check: "optimize OCG constants for %s, old size = %lld". If dedup does not reduce the section size, it reverts: "ocg const optimization didn't help so give up".

Shared Memory Interference Graph -- sub_1CA92F0

When a CUDA program declares multiple extern __shared__ variables used by different kernels, they can potentially share the same memory if no single kernel uses both simultaneously. sub_1CA92F0 (585 lines decompiled) builds an interference graph and performs greedy graph coloring to pack shared objects into minimum total space.

Phase 1 -- Build usage sets (which kernels reference each shared object):

For each global shared object, walk all referencing functions. A kernel "uses" a shared object if it directly references it or transitively calls a device function that does (traced via sub_1CBD800). Objects used by exactly one kernel are privatized -- moved into a per-entry .nv.shared.<func> section. Unused objects are removed entirely (symbol flags set to mark deleted).

"global shared %s only used in entry %d"    -- privatize
"remove unused global shared %s"             -- delete

Phase 2 -- Build interference edges:

For each pair of remaining shared objects (i, j), test whether their usage sets intersect (via sub_42E460 set membership). If any kernel uses both, they interfere -- they cannot overlap in memory. Edges are stored as linked lists per object.

Phase 3 -- Greedy graph coloring:

Objects are processed in sorted order. For each object:

  1. Mark all colors used by interfering neighbors as unavailable.
  2. Assign the lowest available color (starting from 1).
  3. Update the color's alignment requirement (max of all objects in that color group).
  4. Update the color's size requirement (max of all objects in that color group).
"  allocate to group %d"    -- color assignment

Phase 4 -- Compute group offsets:

Groups are laid out sequentially with alignment padding:

group_offset[1] = align_up(base, group_align[1]);
for (g = 2; g <= num_groups; g++)
    group_offset[g] = align_up(group_offset[g-1] + group_size[g-1], group_align[g]);
total_size = group_offset[last] + group_size[last];

Each shared object's final offset is group_offset[its_color]. The total extern shared size is written to the section descriptor. Per-entry shared sections are expanded if a referenced object's offset + size exceeds their current size.

"esh %s size = %lld"
"for shared object (%d) %s:"
"  offset = 0x%llx, size = 0x%llx"
"  edge to %d"
"  allocate to group %d"

Constant Bank Optimization Functions

AddressSizePurpose
sub_1CA6890454 linesConstant value deduplication (32/64-bit hash, N-byte memcmp)
sub_1CA676057 linesN-byte value dedup helper (12--64 byte constants)
sub_1CA665065 linesConstant data node appender (40-byte node, alignment + append)
sub_1CA92F0585 linesShared memory interference graph + greedy coloring
sub_1CA91A0--Per-entry shared section creator (.nv.shared.<func>)
sub_1CA5360--Shared object comparison function (sort key)
sub_1CA5A00--Shared memory data copier (offset overlap check)

Section Attribute Builder -- sub_60FBF0

The per-kernel section attribute builder sub_60FBF0 (76 KB decompiled, 2,541 lines, VA 0x60FBF0) runs once for each kernel entry point and device function. It assembles the full per-function section configuration object (648 bytes), parses compile option overrides, remaps PTX memory space codes to ELF section type IDs, conditionally creates three section types, then invokes the Mercury codegen pipeline (sub_6F52F0, DecodePipeline::RunStages).

Inputs

The function takes three parameters:

ParameterContent
a1Per-function descriptor: SM version (a1[0..1]), key-value option list (a1+38), assembler flags (a1+39), global/extern symbol lists (a1+6, a1+7), boolean flags (a1+180..182)
a2Compilation context: config base (a2+248), function list (a2+136), optional symbol tables for textures (a2+112), surfaces (a2+120), globals (a2+72), sass_map flag (a2+232), mutex (a2+240), ELFW object (a2+32), target descriptor (a2+56)
a3Output handle (released and reallocated at function entry)

Option Parsing

The function iterates the key-value list at a1+38 and matches five string keys by character-by-character comparison:

// Simplified from sub_60FBF0 lines ~638-812
for (int i = 0; i < list_length(a1->options); i++) {
    const char** kv = list_get(a1->options, i);
    if (strcmp(kv[0], "deviceDebug") == 0)
        config->deviceDebug = 1;                // config+24
    else if (strcmp(kv[0], "lineInfo") == 0)
        config->lineInfo = 1;                   // config+25
    else if (strcmp(kv[0], "optLevel") == 0) {
        if (!config->optLevel_locked)            // config+108
            config->optLevel = strtol(kv[1], ...);  // config+104
    }
    else if (strcmp(kv[0], "IsCompute") == 0)
        config->isCompute = (strcmp(kv[1], "True") == 0);  // config+184
    else if (strcmp(kv[0], "IsPIC") == 0)
        config->isPIC = (strcmp(kv[1], "True") == 0);      // config+191
}

Additional inputs read from a1 and a2:

SourceConfig fieldDescription
a2+408config+44Optimization level from compilation context
a1+180config+190Negated boolean flag
a1+181config+188Boolean flag
a1+182config+189Boolean flag
word_2020620[a2+64 - 20]v31+4SM version lookup table (SM 20..121 range)

Memory Space Type Remapping

PTX internal memory space type codes in the 0x10000 range are remapped to compact ELF section type IDs by the helper sub_60DA40 (and inline copies). This remapping is applied to every symbol in the global, extern, texture, surface, and per-function symbol lists:

PTX CodeHexSection Type IDMemory Space
655380x1000283.nv.shared (per-kernel)
655390x1000380.nv.constant0 (kernel params)
655400x1000484.nv.local (spill memory)
655410x1000581.nv.constant1
655420x1000682.nv.constant2
655440x1000885.nv.constant3
655450x1000986.nv.constant4
655460x1000A87.nv.constant5
655760x1002888.nv.global.init
655770x1002989.nv.global
655860x1003293High constant bank
655870x1003390High constant bank
655980x1003E91Texture/surface descriptor
655990x1003F92Texture/surface descriptor

Special handling: when the space code is 0x10003 (constant0) and the compilation mode is relocatable (*(a3+48) == 12), the descriptor's needs_reloc flag (byte 33) is set to 1, indicating the constant0 section requires special relocation handling during linking.

The value 65596 (0x1003C) serves as a threshold -- symbols with (space_type - 0x1003C) < 2 are counted into the texture/surface allocation arrays.

Conditional Section Creation

Three per-kernel sections are conditionally created:

.sass_map.<func> -- created when *(a2+232) != 0 (sass_map generation enabled):

if (context->sass_map_enabled) {                 // a2+232
    descriptor = alloc(64);                       // 64-byte section descriptor
    memset(descriptor, 0, 64);
    pthread_mutex_lock(context->mutex);           // a2+240
    // Allocate instruction tree node and connect to codegen state
    name = sprintf(".sass_map%s", func_name);     // ".sass_map" + func_name
    descriptor->name = name;
    pthread_mutex_unlock(context->mutex);
}

.nv.local.<func> -- created when the register spill size (config+112) is non-zero:

if (config->spill_size != 0) {                   // config+112
    descriptor = alloc(64);
    descriptor->size = config->spill_size;
    // Name: ".nv.local." + bare_func_name (skip ".text." prefix)
    name = sprintf(".nv.local.%s", func_name + 6);
}

The spill size at config+112 is set from the sum of the register spill count and frame size when the spill flag is non-zero.

.nv.constant<N>.<func> -- created when:

  1. The compilation mode field equals 2 (*(a1->target+48) == 2)
  2. No pre-existing constant section exists (*(a1+172) == 0)
  3. The function's symbol list is empty
if (mode == 2 && !has_constant_section && no_symbols) {
    descriptor = alloc(64);
    int bank = target->get_section_type() - 0x70000064;
    int size;
    if (func_const_size <= target->get_min_const_size())
        size = target->get_min_const_size();     // vtable+80
    else
        size = target->get_max_const_size();     // vtable+88
    descriptor->size = size + func_const_size;
    name = sprintf(".nv.constant%d.%s", bank, bare_func_name);
    descriptor->data = calloc(descriptor->size);
}

Assembler Flag Processing

The assembler flag list at a1+39 is iterated. Each entry's value string (at offset +8) is split on spaces via strtok_r. Each token is validated by sub_60F790, which constructs a temporary 656-byte object to test the flag. Valid tokens are concatenated with spaces and appended to config+48 (the toolkit info string that ends up in .note.nv.tkinfo).

Codegen Pipeline Invocation

After configuration, the function calls sub_6F52F0 (DecodePipeline::RunStages) with 18 parameters including the configuration object, all 7 descriptor arrays, the ELFW context, and the function name. The return code is mapped:

sub_6F52F0 returnsub_60FBF0 returnMeaning
00Success
114Mercury encode failure
222Mercury decode failure

Post-Pipeline Section Registration

After the Mercury pipeline returns successfully:

  1. Calls sub_60DD30 twice for pre/post code region finalization
  2. Calls sub_60DBE0 for each optional symbol table (texture, surface, global) to register their sections with the ELFW emitter
  3. Calls sub_1CB9C30 on the ELFW object (a2+32) to commit all sections
  4. If SM version <= 0x45 (SM 69): creates UFT/UDT entries (section types 68/69) for each resolved symbol
  5. Under mutex lock, ORs the per-function WAR bitmask (config+232..240) into the global accumulator at a2+504

Thread Safety

All shared state modifications are protected by the mutex at a2+240:

  • String length accumulator updates (a2+296, a2+304) for string table pre-allocation
  • WAR bitmask accumulation (a2+504)
  • .sass_map section setup (instruction tree access)
  • Instruction merge from secondary codegen contexts (a2+80, a2+88)

Key Functions

AddressSizePurpose
sub_60FBF0~76 KB decompiledPer-kernel section attribute builder (section above)
sub_1CC980014,764 B (90 KB decompiled)EIATTR builder -- master nvinfo section constructor
sub_1CC89502,634 BEIATTR propagator -- barrier/register cross-function propagation
sub_1CC85F0~200 BEIATTR record emitter -- writes one TLV record
sub_1CC86D0~500 BPer-entry EIATTR emission (MIN_STACK_SIZE, CRS_STACK_SIZE, SAM_REGION_STACK_SIZE)
sub_1CC7FB0~200 B.nv.info section creator/finder
sub_1CC93A0~500 B.nv.compat attribute processor
sub_1CB35701,963 BGeneric section creator (44 call sites)
sub_1CB42D0--.text.<func> section creator
sub_1C9DC605,663 BSection layout calculator (offset assignment)
sub_1CABD6011,856 BMaster section allocator (shared/constant/local addresses)
sub_1CBE1B0~10 KB.nv.callgraph section builder
sub_1C97840--Architecture-gated EIATTR check
sub_1CA6890454 linesConstant bank value deduplication
sub_1CA92F0585 linesShared memory interference graph + coloring

Cross-References

Debug Information

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

ptxas generates DWARF-based debug information for cuda-gdb and other GPU debuggers. The debug subsystem spans three distinct code regions: an early-pipeline DWARF line table generator at 0x45A--0x45C that encodes PTX-level source mappings, a mid-pipeline SASS-level emitter at 0x860--0x868 that produces both .debug_line and .nv_debug_line_sass sections along with register mapping tables, and a late-stage DWARF processor/dumper cluster at 0x1CBF--0x1CC9 that handles .debug_info, .debug_abbrev, .debug_loc, and .debug_frame parsing and emission. The design follows a two-tier model: PTX-level debug info records source file/line to PTX instruction mappings, while SASS-level debug info records the final PTX-to-SASS address correspondence after all optimizations. NVIDIA extends standard DWARF with proprietary sections (.nv_debug_line_sass, .nv_debug_info_reg_sass, .nv_debug_info_reg_type, .nv_debug_info_ptx) and Mercury-namespace variants (.nv.merc.debug_*) for Capsule Mercury binaries.

DWARF line generator (PTX)sub_45C3A0 (9,041 bytes) -- PTX source line to address mapping
LEB128 encodersub_45A870 (5,293 bytes) -- variable-length integer encoding for DWARF
Debug line table (SASS)sub_866BB0 (3,273 bytes) -- .debug_line / .nv_debug_line_sass
Debug top-level entrysub_867880 (100 bytes) -- calls line generator twice (PTX + SASS)
Reg info SASS emittersub_8679F0 (225 bytes) -- .nv_debug_info_reg_sass
Reg type emittersub_867B00 (230 bytes) -- .nv_debug_info_reg_type
Post-RA debug annotatorsub_88D870 (2,656 bytes) -- final source line annotation
DWARF form name tablesub_1CBF820 (400 bytes) -- DW_FORM_* ID-to-string
DWARF attribute name tablesub_1CBF9B0 (1,600 bytes) -- DW_AT_* ID-to-string
.debug_abbrev parsersub_1CC0850 (3,704 bytes) -- abbreviation table handler
.debug_info parsersub_1CC4A40 (5,218 bytes) -- DIE tree walker
CU header parsersub_1CC5EB0 (2,023 bytes) -- compilation unit headers
Location expression printersub_1CC34E0 (3,094 bytes) -- DW_OP_* decoder
DWARF info processorsub_1CC24C0 (3,993 bytes) -- non-dump emission mode
Debug section classifiersub_1C9D1F0 (2,667 bytes) -- section name to type ID mapper
Mercury debug classifiersub_1C98C60 (1,755 bytes) -- .nv.merc.debug_* classifier
SASS debug classifiersub_1C99340 -- .debug_* standard section classifier
Debug section type mappersub_1C998D0 -- maps section name to internal buffer pointer
DWARF attribute emittersub_66A0B0 (28 KB) -- emits DWARF attributes during IR lowering
DWARF debug info buildersub_66F4E0 (59 KB) -- main DWARF debug info section builder
DWARF line table buildersub_66E250 (33 KB) -- builds .debug_line during IR phase
Debug line number formattersub_671C00 (11 KB) -- formats line number records

CLI Flags

Two flags control debug information generation:

FlagInternal fieldEffect
--device-debug / -gdeviceDebug at ELFW context + 432Full debug info: all DWARF sections (.debug_info, .debug_abbrev, .debug_frame, .debug_line, .debug_loc, .debug_str, .debug_aranges), plus NVIDIA extensions. Disables most optimizations, preserving source-level variable correspondence.
--lineinfo / -lnlineInfo in option contextLine tables only: generates .debug_line and .nv_debug_line_sass without full DWARF DIE trees. Preserves optimization levels. Sufficient for cuda-memcheck and profiler source correlation.
--suppress-debug-infosuppresses emissionStrips all debug sections from output, even if debug input was provided.

The cubin entry point sub_612DE0 reads both deviceDebug and lineInfo flags and passes them through the ELF output pipeline. The section classifier at sub_1C9D1F0 checks the byte at context offset +432 (deviceDebug) to decide whether to emit .debug_frame and .debug_line sections -- when this byte is zero (no -g), those sections are conditionally suppressed.

The --lineinfo flag is described in the CLI as "Generate debug line table information" and is orthogonal to -g. When only --lineinfo is active, ptxas generates the two line table sections but omits the heavyweight .debug_info/.debug_abbrev/.debug_loc sections. The string "device-debug or lineinfo" appears in a validation check that prevents --extensible-whole-program from being combined with either debug mode.

Debug Section Catalog

ptxas generates three tiers of debug sections depending on compilation mode. Standard DWARF sections use the conventional .debug_* namespace. NVIDIA extensions use .nv_debug_* names. Capsule Mercury binaries additionally carry .nv.merc.debug_* clones.

Standard DWARF Sections

SectionDWARF standardContent
.debug_abbrevYes (DWARF 2+)Abbreviation table defining DIE tag/attribute schemas
.debug_arangesYesAddress range table mapping compilation units to code ranges
.debug_frameYesCall frame information (CFA rules for unwinding)
.debug_infoYesDIE tree: compilation units, subprograms, variables, types
.debug_lineYesLine number program (source file/line to PTX address mapping)
.debug_locYesLocation lists for variables with multiple storage locations
.debug_macinfoYesMacro information
.debug_pubnamesYesPublic name lookup table
.debug_strYesString table referenced by DW_FORM_strp

NVIDIA Extension Sections

SectionContent
.nv_debug_line_sassSASS-level line table: maps SASS instruction addresses to PTX source lines. Parallel to .debug_line but at the machine code level.
.nv_debug_info_reg_sassRegister-to-variable mapping for SASS. Records which physical GPU register(s) hold each source variable at each program point.
.nv_debug_info_reg_typeType information for register mappings. Associates register locations with DWARF type descriptions.
.nv_debug_info_ptxPTX-level debug info section. Created by sub_1CC5EB0 as a PTX-namespace mirror of .debug_info.
.nv_debug.sharedDebug metadata for shared memory variables.

Mercury Namespace Variants

For Capsule Mercury binaries (SM 100+), every debug section is cloned into the .nv.merc.* namespace. The Mercury debug classifier sub_1C98C60 recognizes 15 Mercury-namespaced debug sections:

.nv.merc.debug_abbrev      .nv.merc.debug_aranges
.nv.merc.debug_frame        .nv.merc.debug_info
.nv.merc.debug_loc          .nv.merc.debug_macinfo
.nv.merc.debug_pubnames     .nv.merc.debug_str
.nv.merc.debug_line         (and additional variants)

These sections carry the PTX-level debug information that travels inside the Mercury capsule, enabling deferred finalization to produce debug-capable SASS without re-invoking the full compiler.

Debug Information Pipeline

Debug data flows through three pipeline stages. Each stage operates on a different intermediate representation and produces output at a different abstraction level.

STAGE 1: PTX PARSING  (0x45A-0x45C)
  PTX source --> .loc directives --> DWARF line number program
  sub_45C3A0: Reads .loc directives from PTX input
              Builds file/directory tables
              Generates DWARF line number program (LEB128-encoded)
              Creates "$LDWend" end-of-program label
              Uses function-to-index map ("function index not found
              in debug function map" on error)
  sub_45A870: LEB128 encoder for all numeric fields:
              file number, prologue size, address advance,
              line advance, context, function offset

STAGE 2: IR LOWERING  (0x66A-0x672)
  Ori IR instructions carry debug info at instruction node offset +20
  sub_66F4E0 (59KB): Main DWARF debug info builder
  sub_66E250 (33KB): DWARF .debug_line builder
  sub_66A0B0 (28KB): DWARF attribute emitter (directory id, time stamp, file size)
  sub_671C00 (11KB): Debug line number formatter (context, functionOffset, line number)

STAGE 3: POST-RA + ELF EMISSION  (0x860-0x868, 0x1CBF-0x1CC9)
  After register allocation, physical register assignments are known.
  sub_88D870:  PostRA debug info annotator -- finalizes source line
               mappings after all code motion/scheduling
  sub_867880:  Top-level debug entry -- calls sub_866BB0 twice:
               once with a3=0 for .debug_line (PTX-level)
               once with a3=1 for .nv_debug_line_sass (SASS-level)
  sub_866BB0:  DWARF .debug_line section generator
  sub_8679F0:  .nv_debug_info_reg_sass emitter
  sub_867B00:  .nv_debug_info_reg_type emitter

Line Number Table Generation

The DWARF .debug_line section generator sub_866BB0 is the central function for line table construction. It produces standard DWARF 2 line number programs that map addresses to source locations.

Parameters

// sub_866BB0 -- DebugLineTableGenerator
// a1: debug_line_context (pointer to ~460-byte state structure)
// a2: ELF output context (ELFW object)
// a3: section index  (0 = .debug_line,  nonzero = .nv_debug_line_sass)
// a4: unused
// a5: source file path (const char*, used for .nv_debug_line_sass only)

Algorithm

  1. Section selection: Based on a3, selects the target section name. For a3 == 0, looks up or creates .debug_line via sub_1CB2C60 / sub_1CA7AB0. For a3 != 0, uses .nv_debug_line_sass.

  2. Source file collection: If a5 provides a source file path and the section is the SASS variant, copies the filename into the debug context at offset +272. Iterates source files via sub_4271E0 (directory iterator), sorts entries using sub_866B80 (file comparison callback).

  3. Directory table construction: For each source file, splits the path into directory and filename components. Records unique directories via a hash table (sub_426150 / sub_426D60). Each directory gets a sequential index.

  4. File table construction: Each source file entry is a 40-byte record:

    • File name string
    • Directory index (LEB128)
    • Modification timestamp from stat() (st_mtim.tv_sec)
    • File size from stat() (st_size)
  5. Line number program generation: Generates the DWARF line number state machine program using standard opcodes (DW_LNS_copy, DW_LNS_advance_pc, DW_LNS_advance_line, DW_LNS_set_file, etc.) and special opcodes for compact address/line delta encoding.

  6. Finalization: Writes the complete section via sub_1CA7180 (ELF section write).

Debug Line Context Structure

debug_line_context (at a1, ~460 bytes):
  +0:     vtable pointer
  +16:    SASS line info pointer (nonzero triggers second pass)
  +64:    per-section context base (160-byte stride, indexed by a3)
  +96:    filename buffer pointer
  +104:   filename buffer size
  +108:   file_count
  +112:   directory buffer pointer
  +120:   directory_count
  +216:   source file count (.debug_line variant)
  +256:   raw filename pointer
  +268:   raw filename flag
  +272:   filename copy buffer (for .nv_debug_line_sass)
  +280:   filename copy length
  +376:   source file count (.nv_debug_line_sass variant)
  +408:   .nv_debug_info_reg_sass buffer chain (linked list head)
  +416:   .nv_debug_info_reg_sass final buffer pointer
  +424:   .nv_debug_info_reg_sass total size
  +432:   .nv_debug_info_reg_type buffer chain (linked list head)
  +440:   .nv_debug_info_reg_type final buffer pointer
  +448:   .nv_debug_info_reg_type total size
  +456:   memory arena / pool allocator

Top-Level Entry

The top-level debug emitter sub_867880 is minimal -- it calls the line table generator twice:

// sub_867880 -- DebugInfoTopLevel (simplified)
void emit_debug_info(ctx, elf, aux, source_path) {
    sub_866BB0(ctx, elf, 0, aux, source_path);   // .debug_line
    if (ctx->sass_line_info)                       // offset +16
        sub_866BB0(ctx, elf, 1, aux, source_path); // .nv_debug_line_sass
}

LEB128 Encoding

The DWARF standard uses LEB128 (Little-Endian Base 128) variable-length encoding for integers throughout debug sections. ptxas implements this in sub_45A870, which handles encoding for multiple fields. Error strings in this function reveal the field types being encoded:

FieldError string on overflow
File number"when generating LEB128 number for file number"
Prologue marker"when generating LEB128 number for setting prologue"
Address advance"when generating LEB128 number for address advance"
Line advance"when generating LEB128 number for line advance"
Context"when generating LEB128 number for setting context"
Function offset"when generating LEB128 number for setting function Offset"
File timestamp"when generating LEB128 number for timestamp" (in sub_866BB0)
File size"when generating LEB128 number for file size" (in sub_866BB0)

Register-to-Variable Mapping

After register allocation, ptxas knows the physical GPU registers assigned to each source variable. Two NVIDIA extension sections capture this:

.nv_debug_info_reg_sass

Emitted by sub_8679F0. Records which physical registers (R0--R255, P0--P7, UR0--UR63) hold which source variables at each SASS instruction address. The data is accumulated during code generation into a linked list of buffer chunks at debug context offsets +408/+416/+424. At emission time, the chunks are concatenated into a contiguous buffer and written as an ELF section:

// sub_8679F0 -- simplified
void emit_reg_sass(debug_ctx, elf) {
    // Collect linked list of buffer chunks from debug_ctx+408
    chunks = linked_list_to_array(debug_ctx->reg_sass_chain);
    buf = allocate(debug_ctx->reg_sass_total_size);
    offset = 0;
    for (chunk in chunks) {
        memcpy(buf + offset, chunk->data, chunk->size);
        offset += chunk->size;
    }
    section = create_section(elf, ".nv_debug_info_reg_sass", 0, 1, 0);
    write_section(elf, section, buf, 1, total_size);
}

.nv_debug_info_reg_type

Emitted by sub_867B00. Structurally identical to the reg_sass emitter but operates on offsets +432/+440/+448. Associates register locations with DWARF type information, enabling the debugger to interpret register contents correctly (e.g., distinguishing a 32-bit float in R5 from a 32-bit integer in R5).

DWARF Processing Subsystem

The DWARF processing cluster at 0x1CBF--0x1CC9 handles both generation and diagnostic dumping of DWARF sections. The code can operate in two modes: a dump mode that prints human-readable representations (for --dump-debug-info or internal diagnostics), and an emission mode that processes raw DWARF bytes for the final binary.

DWARF Form Table -- sub_1CBF820

Maps DWARF form IDs to string names. Supports DWARF 2 forms:

IDFormEncoding
1DW_FORM_addrTarget address
3DW_FORM_block22-byte length block
4DW_FORM_block44-byte length block
5DW_FORM_data22-byte unsigned
6DW_FORM_data44-byte unsigned
7DW_FORM_data88-byte unsigned
8DW_FORM_stringNull-terminated inline
9DW_FORM_blockULEB128 length block
10DW_FORM_block11-byte length block
11DW_FORM_data11-byte unsigned
12DW_FORM_flagBoolean byte
13DW_FORM_sdataSigned LEB128
14DW_FORM_strp4-byte offset into .debug_str
15DW_FORM_udataUnsigned LEB128
16DW_FORM_ref_addrAddress-sized reference
17DW_FORM_ref11-byte CU-relative reference
18DW_FORM_ref22-byte CU-relative reference
19DW_FORM_ref44-byte CU-relative reference
20DW_FORM_ref88-byte CU-relative reference
21DW_FORM_ref_udataULEB128 CU-relative reference
22DW_FORM_indirectForm specified inline

The absence of DWARF 4/5 forms (e.g., DW_FORM_sec_offset, DW_FORM_exprloc, DW_FORM_flag_present) indicates ptxas targets DWARF version 2, consistent with the pointer size and CU header format observed in sub_1CC5EB0.

DWARF Attribute Table -- sub_1CBF9B0

Maps DWARF attribute IDs to string names. The function recognizes a comprehensive set of standard attributes. Notable entries include:

  • Location attributes: DW_AT_location (2), DW_AT_frame_base (64), DW_AT_data_member_location (56)
  • Name/type: DW_AT_name (3), DW_AT_type (73), DW_AT_encoding (63)
  • Scope: DW_AT_low_pc (17), DW_AT_high_pc (18), DW_AT_stmt_list (16)
  • Producer: DW_AT_producer (37), DW_AT_comp_dir (27), DW_AT_language (19)
  • Subprogram: DW_AT_inline (32), DW_AT_prototyped (39), DW_AT_artificial (52)
  • Array: DW_AT_lower_bound (34), DW_AT_upper_bound (47), DW_AT_count (55)
  • Calling: DW_AT_calling_convention (54), DW_AT_return_addr (42)
  • Accessibility: DW_AT_accessibility (50), DW_AT_external (63)
  • C++ support: DW_AT_vtable_elem_location (77), DW_AT_containing_type (29)

.debug_abbrev Parser -- sub_1CC0850

Parses the abbreviation table that defines the schema for each DIE tag. The dump mode output header is:

Contents of the .debug_abbrev section:
  Number  TAG

Each entry includes:

  • Abbreviation number
  • TAG name (e.g., DW_TAG_compile_unit, DW_TAG_subprogram)
  • Children indicator: [has children] or [has no children]
  • Attribute-form pairs

The function includes a safety check: "unexpectedly too many dwarf attributes for any DW_TAG entry!" -- a guard against malformed or corrupt abbreviation tables.

.debug_info Parser -- sub_1CC4A40

Walks the DIE tree, printing entries with nesting depth indentation:

 <%d><%x>:  Abbrev Number: %d   (0x%02x %s)

Format: <nesting_depth><byte_offset>: Abbrev Number: <n> (<tag_hex> <tag_name>). Null DIEs are printed as " (nill) ". Attribute values are formatted by sub_1CC4100 (the attribute value printer) which dispatches on form type.

Compilation Unit Header -- sub_1CC5EB0

Parses and prints CU headers, and creates the NVIDIA extension .nv_debug_info_ptx section:

 Compilation Unit @ offset 0x%zx:
  Length: %d
  Version: %d
  Abbrev Offset: %d
  Pointer Size: %d

The pointer size field is significant -- it determines the size of DW_FORM_addr values and DW_FORM_ref_addr references throughout the CU.

Location Expression Decoder -- sub_1CC34E0

Decodes DWARF location expressions (DW_OP_* operations) used in DW_AT_location and related attributes. The supported operations reveal how ptxas encodes GPU variable locations:

OperationStringGPU usage
DW_OP_addr"DW_OP_addr: 0x%x"Absolute memory address (global/shared/local)
DW_OP_const4u"DW_OP_const4u: %d"4-byte unsigned constant
DW_OP_xderef"DW_OP_xderef"Cross-address-space dereference (GPU memory spaces)
DW_OP_plus_uconst"DW_OP_plus_uconst: %llu"Add unsigned constant to stack top
DW_OP_lit0--DW_OP_lit31"DW_OP_lit%u"Push literal 0--31
DW_OP_reg0--DW_OP_reg31"DW_OP_reg%d"Variable in register N
DW_OP_breg0--DW_OP_breg31"DW_OP_breg%d %lld"Register N + signed offset
DW_OP_fbreg"DW_OP_fbreg: %lld"Frame base + signed offset (stack variables)
DW_OP_nop"DW_OP_nop"No operation
DW_OP_stack_value"DW_OP_stack_value"Value is on DWARF expression stack, not in memory

The presence of DW_OP_xderef is particularly noteworthy -- this is a DWARF operation rarely used in CPU debuggers but essential for GPU debugging, where variables may reside in different memory spaces (global, shared, local, constant) that require address-space-qualified access.

Debug Section Classification

Three classifier functions map section names to internal type IDs. The type IDs route sections to the correct processing pipeline during ELF assembly.

SASS Classifier -- sub_1C99340

Recognizes standard DWARF sections by comparing the section name (obtained via sub_1CB9E50) against hardcoded strings. Returns 1 (is-debug-section) for:

.debug_abbrev    .debug_aranges   .debug_frame
.debug_info      .debug_loc       .debug_macinfo
.debug_pubnames  .debug_str       .debug_line

Plus the NVIDIA extension:

.nv_debug_info_reg_sass

Mercury Classifier -- sub_1C98C60

The Mercury classifier sub_1C98C60 checks for the .nv.merc. prefix on the same set of debug section names. It uses a strcmp chain against 15 Mercury-namespaced section names. This classifier is called from 4 sites, primarily during Capsule Mercury construction when debug sections need to be cloned into the merc namespace.

Unified Classifier -- sub_1C9D1F0

The master debug section classifier sub_1C9D1F0 (2,667 bytes, 13 callees) handles both SASS and Mercury variants. It:

  1. Checks whether the section has the Mercury flag (bit 0x10 in section flags byte at offset +11)
  2. For Mercury sections, dispatches to the .nv.merc.debug_* name check
  3. For standard sections, dispatches to the .debug_* name check
  4. Recognizes additional NVIDIA extensions: .nv_debug_line_sass, .nv_debug_info_reg_sass, .nv_debug_info_reg_type
  5. Uses setjmp/longjmp error recovery for malformed section handling

The function also checks the deviceDebug flag at context offset +432 to suppress .debug_frame and .debug_line when debug info is not requested. This is the gate that prevents line tables from appearing in release builds.

Section Type Mapper -- sub_1C998D0

Maps debug section names to internal buffer pointers within the debug context object:

Section nameReturns pointer at offset
.debug_linecontext + 80 (a1[10])
.debug_framecontext + 72 (a1[9])
.nv_debug_line_sasscontext + 88 (a1[11])
.debug_info(subsequent check)
.debug_loc(subsequent check)

This enables the ELF emitter to route section data to the correct output buffer during final assembly.

Instruction-Level Debug Metadata

Each internal instruction node in the Ori IR carries debug metadata at offset +20 (a pointer or encoded value for source line information) and offset +0 (a pointer to the PTX source location). This metadata travels through the entire optimization pipeline:

  1. PTX parsing: The parser records .loc directives and attaches source file/line/column to each instruction as it is lowered from PTX to the Ori IR.

  2. Optimization passes: Most optimization passes preserve or propagate debug metadata. When instructions are cloned (e.g., loop unrolling), the clone inherits the original's debug info. When instructions are deleted, their debug info is lost -- the debugger will map those addresses to the nearest surviving instruction's source line.

  3. Post-RA annotation (sub_88D870): After register allocation and scheduling, this pass finalizes the source-line-to-SASS-address correspondence. It walks all instructions and records the final mapping that will be encoded into the .nv_debug_line_sass section.

  4. ELF emission: The debug line table generator sub_866BB0 reads the finalized mappings and encodes them as a DWARF line number program.

The PTX-level line generator sub_45C3A0 uses the label $LDWend as an end-of-debug-range marker. The error message "function index not found in debug function map" indicates a function-to-index mapping that translates internal function identifiers to DWARF subprogram indices.

DWARF Version and Extensions

ptxas generates DWARF version 2 debug information. Evidence:

  • The form table (sub_1CBF820) covers exactly forms 1--22, which is the DWARF 2 form set. No DWARF 3+ forms (DW_FORM_sec_offset = 0x17, DW_FORM_exprloc = 0x18) are present.
  • The CU header parser (sub_1CC5EB0) prints "Version: %d" as a field, consistent with the 11-byte DWARF 2 CU header format.
  • The attribute table includes DWARF 2 attributes only.

CUDA-Specific DWARF Extensions

NVIDIA extends standard DWARF for GPU debugging through:

  1. Address space encoding: DW_OP_xderef is used with address space qualifiers to distinguish GPU memory spaces. The DW_AT_address_class attribute (recognized in the attribute table at ID 51) encodes CUDA memory space identifiers.

  2. Parallel execution model: GPU warps execute 32 threads simultaneously. Debug info must account for the fact that each "program counter" corresponds to 32 concurrent threads, and divergent threads may be at different source locations.

  3. Register mapping: The .nv_debug_info_reg_sass section provides a CUDA-specific register-to-variable mapping that goes beyond standard DWARF location lists. GPU register files are much larger (up to 255 general-purpose registers) and have different allocation semantics than CPU registers.

  4. PTX/SASS duality: The two-tier line table (.debug_line for PTX, .nv_debug_line_sass for SASS) reflects the unique compilation model where PTX is an intermediate representation with its own debug significance.

Section Layout Considerations

The ELF section layout calculator sub_1C9DC60 applies special handling to .debug_line:

  • .debug_line sections receive special padding during layout to ensure proper alignment
  • Debug sections are placed after code and data sections in the ELF file
  • The .debug_line section name is one of only three section names explicitly checked by the layout calculator (alongside .nv.constant0 and .nv.reservedSmem)

Key Address Summary

Address rangeSubsystemFunction count
0x45A000--0x45F000PTX-level DWARF line generator + LEB128~6
0x660000--0x675000IR-level DWARF builder/emitter~8
0x860000--0x869000SASS-level debug line + register info~8
0x88D000--0x88E000Post-RA debug annotator1
0x1C98000--0x1C9A000Section classifiers (merc + SASS)~6
0x1C9D000--0x1C9E000Unified section classifier1
0x1CBF000--0x1CC7000DWARF processor/dumper cluster~12

Relocations & Symbols

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

ptxas defines two parallel relocation type systems for CUBIN ELF files: R_CUDA_* (117 types, ordinals 0--116) for SASS-encoded cubins targeting SM 30--90a, and R_MERCURY_* (65 types, ordinals 0--64) for Mercury-encoded cubins targeting SM 100+ (Blackwell and later). Both systems use standard Elf64_Rela relocation entries in .rela.text.<funcname> sections, with a custom resolution algorithm that handles alias redirection, dead function filtering, UFT/UDT pseudo-relocations, PC-relative branch validation, and sub-byte instruction patching. The symbol table (.symtab) follows standard ELF Elf64_Sym format with CUDA-specific symbol types and an extended section index mechanism (.symtab_shndx) for programs exceeding 65,280 sections.

Relocation resolversub_1CD48C0 (4,184 bytes binary, 22 KB decompiled, 17 callees)
Relocation writersub_1CD5920 (1,985 bytes binary, 11 KB decompiled)
Relocation creator (SASS)sub_1CD4510 (860 bytes binary)
Relocation creator (Mercury)sub_1CD46B0 (540 bytes binary)
Relocation pre-scansub_1CD43A0 (560 bytes binary)
Bit-field patchersub_1CD34E0 (3,700 bytes binary, sub_1CD33F0/sub_1CD3330 helpers)
Symbol table buildersub_1CB68D0 (9,578 bytes binary, 49 KB decompiled, 36 callees)
Symbol fixupsub_1CB2CA0 (2,038 bytes binary, 4 call sites)
Section index remapsub_1C99BB0 (4,900 bytes binary)
UFT managersub_1CD22E0 (1,979 bytes binary, 10 KB decompiled)
UFT slot validatorsub_1CD2AA0 (~800 bytes binary)
Bindless handlersub_1CAB300 (2,157 bytes binary, 12 KB decompiled)
R_CUDA table addressoff_2408B60 (117 entries x 64 bytes)
R_MERCURY table addressoff_2407B60 (65 entries x 64 bytes)

Relocation Type Systems

Table Selection Logic

The ELFW object stores the ELF class byte at offset 7 and a flags word at offset 48. The relocation subsystem selects between the two tables based on the IsPIC flag combined with the ELF class:

// Table selection (reconstructed from sub_1CD48C0, sub_1CD4510, sub_1CD5920)
uint32_t test_bit = (elfw->ei_class == 'A') ? 1 : 0x80000000;
bool is_mercury = (test_bit & elfw->flags) != 0;

if (is_mercury) {
    // SM 100+ Mercury encoding: off_2407B60
    // Type codes start at 0x10000; subtract to index the table
    table = &R_MERCURY_table;           // off_2407B60
    index = raw_type - 0x10000;         // range check: index <= 0x3F (63)
} else {
    // SM 30-90a SASS encoding: off_2408B60
    table = &R_CUDA_table;              // off_2408B60
    index = raw_type;                   // range check: index <= 0x73 (115)
}

Mercury relocation type codes are stored with a 0x10000 offset in the internal relocation entry's type field. This lets a single code path handle both systems -- the table selection just subtracts the offset for Mercury types.

Relocation Descriptor Table Format

Each entry in the relocation type descriptor table is 64 bytes (8 qwords). The layout is accessed through pointer arithmetic patterns like table[8 * index + N] where the table pointer type is char** (8-byte stride):

// Relocation type descriptor -- 64 bytes per entry (reconstructed)
struct reloc_type_desc {
    const char* name;       // +0:  R_CUDA_* or R_MERCURY_* name string
    uint32_t    unknown_04; // +8:  unknown field
    uint32_t    unknown_08; // +12: unknown field
    uint32_t    bit_start;  // +16: starting bit position in instruction
    uint32_t    bit_width;  // +20: field width in bits
    uint32_t    patch_mode; // +24: patching mode (0=none, 1=direct, 6/7=split)
    uint32_t    flags_hi;   // +28: high flags (value 12-15 triggers callgraph)
    // ... remaining 32 bytes: additional patching parameters
};

The patch_mode field at offset +24 drives the bit-field patching logic in sub_1CD34E0. The switch statement handles these modes:

ModeDescriptionTypes
0No-op (sentinel/terminator)R_CUDA_NONE, R_CUDA_NONE_LAST
1, 0x12, 0x2EDirect bit-field write (full or partial 64-bit word)Most absolute/PC-relative types
6, 0x37Split low-word patching (handles cross-qword boundaries)LO types, sub-byte 8_N types
7, 0x38Split high-word patching (uses HIDWORD of value)HI types

When flags_hi (at descriptor offset +28) is in the range 12--15, the relocation creator calls sub_1CBD0D0 to register the relocation's target section in the call graph. This triggers call graph edge creation for function descriptors and branch targets.

R_CUDA_* Relocation Types

117 types from R_CUDA_NONE (ordinal 0) to R_CUDA_NONE_LAST (ordinal 116). String addresses span 0x23FBE0E--0x23FC6B6 in the ptxas binary, confirming these are contiguous in the read-only data section. Ordinals are assigned by string table order.

Absolute Address Relocations

OrdinalNameBit FieldPurpose
0R_CUDA_NONE--Sentinel / no relocation
1R_CUDA_3232-bitAbsolute 32-bit address
2R_CUDA_6464-bitAbsolute 64-bit address
5R_CUDA_ABS32_2632-bit at bit 26Absolute address, 26-bit encoding
10R_CUDA_ABS32_LO_26low 32 at bit 26Low half of 64-bit address
11R_CUDA_ABS32_HI_26high 32 at bit 26High half of 64-bit address
12R_CUDA_ABS32_2332-bit at bit 23Absolute address, 23-bit encoding
13R_CUDA_ABS32_LO_23low 32 at bit 23Low half, 23-bit encoding
14R_CUDA_ABS32_HI_23high 32 at bit 23High half, 23-bit encoding
15R_CUDA_ABS24_2624-bit at bit 2624-bit absolute address
16R_CUDA_ABS24_2324-bit at bit 2324-bit absolute, 23-bit encoding
17R_CUDA_ABS16_2616-bit at bit 2616-bit absolute address
18R_CUDA_ABS16_2316-bit at bit 2316-bit absolute, 23-bit encoding
42R_CUDA_ABS32_2032-bit at bit 20Volta+ encoding format
43R_CUDA_ABS32_LO_20low 32 at bit 20Low half, 20-bit encoding
44R_CUDA_ABS32_HI_20high 32 at bit 20High half, 20-bit encoding
45R_CUDA_ABS24_2024-bit at bit 2024-bit, 20-bit encoding
46R_CUDA_ABS16_2016-bit at bit 2016-bit, 20-bit encoding
55R_CUDA_ABS32_3232-bit at bit 32Ampere+ encoding format
56R_CUDA_ABS32_LO_32low 32 at bit 32Low half, 32-bit position
57R_CUDA_ABS32_HI_32high 32 at bit 32High half, 32-bit position
58R_CUDA_ABS47_3447-bit at bit 3447-bit wide field
59R_CUDA_ABS16_3216-bit at bit 3216-bit, 32-bit position
60R_CUDA_ABS24_3224-bit at bit 3224-bit, 32-bit position
74R_CUDA_ABS24_4024-bit at bit 4024-bit at offset 40
75R_CUDA_ABS55_16_3455-bit, 16+34 splitSplit wide field
100R_CUDA_ABS20_4420-bit at bit 4420-bit at offset 44
114R_CUDA_ABS56_16_3456-bit, 16+34 splitSplit wide field
70R_CUDA_32_LOlow 32Low half of 64-bit
71R_CUDA_32_HIhigh 32High half of 64-bit

The naming convention encodes the bit-field geometry: R_CUDA_ABS<width>_<start_bit> indicates that <width> bits of the resolved address are patched into the instruction at bit position <start_bit>. The LO/HI suffix indicates low or high 32 bits of a 64-bit value. The different start positions (20, 23, 26, 32, 34, 40, 44) correspond to different SASS instruction encoding formats across SM generations: Kepler (26), Maxwell/Pascal (23), Volta/Turing (20), Ampere/Ada/Hopper (32).

Global Address Relocations

OrdinalNamePurpose
3R_CUDA_G32Global-space 32-bit address
4R_CUDA_G64Global-space 64-bit address
84R_CUDA_G8_0Global-space byte 0 of 64-bit instruction
85R_CUDA_G8_8Global-space byte 1
86R_CUDA_G8_16Global-space byte 2
87R_CUDA_G8_24Global-space byte 3
88R_CUDA_G8_32Global-space byte 4
89R_CUDA_G8_40Global-space byte 5
90R_CUDA_G8_48Global-space byte 6
91R_CUDA_G8_56Global-space byte 7

Global address relocations target .nv.global and .nv.global.init sections. The G8_* sub-byte variants patch individual bytes within a 64-bit instruction word, used when the instruction encoding requires the address to be spread across non-contiguous bit fields.

PC-Relative Relocations

OrdinalNamePurpose
40R_CUDA_PCREL_IMM24_26PC-relative 24-bit immediate at bit 26
41R_CUDA_PCREL_IMM24_23PC-relative 24-bit immediate at bit 23

PC-relative relocations resolve branch and call targets. The resolver enforces a critical constraint:

"PC relative branch address should be in the same section"

This means intra-function branches use PC-relative relocations, but cross-function calls use absolute or function descriptor relocations. The 24-bit immediate provides a +/-8 MB range from the instruction address, sufficient for any single kernel.

Constant Field Relocations

OrdinalNamePurpose
24R_CUDA_CONST_FIELD19_2819-bit constant bank offset at bit 28
25R_CUDA_CONST_FIELD19_2319-bit constant bank offset at bit 23
36R_CUDA_CONST_FIELD21_2621-bit constant bank offset at bit 26
38R_CUDA_CONST_FIELD19_2619-bit constant bank offset at bit 26
39R_CUDA_CONST_FIELD21_2321-bit constant bank offset at bit 23
50R_CUDA_CONST_FIELD19_2019-bit constant bank offset at bit 20
54R_CUDA_CONST_FIELD21_2021-bit constant bank offset at bit 20
64R_CUDA_CONST_FIELD19_4019-bit constant bank offset at bit 40
66R_CUDA_CONST_FIELD21_3821-bit constant bank offset at bit 38
115R_CUDA_CONST_FIELD22_3722-bit constant bank offset at bit 37

Constant field relocations patch .nv.constant0.<func> bank offsets into load constant (LDC) instructions. The field width (19, 21, or 22 bits) determines the maximum addressable constant bank size: 19-bit supports 512 KB, 21-bit supports 2 MB, 22-bit supports 4 MB. During resolution, the constant bank deduplication pass (sub_1CA6890) may adjust the relocation offset:

"optimize ocg constant reloc offset from %lld to %lld"

Function Descriptor Relocations

OrdinalNamePurpose
31R_CUDA_FUNC_DESC32_2332-bit function descriptor at bit 23
32R_CUDA_FUNC_DESC32_LO_23Low 32 of descriptor at bit 23
33R_CUDA_FUNC_DESC32_HI_23High 32 of descriptor at bit 23
34R_CUDA_FUNC_DESC_32Full 32-bit function descriptor
35R_CUDA_FUNC_DESC_64Full 64-bit function descriptor
47R_CUDA_FUNC_DESC32_2032-bit function descriptor at bit 20
48R_CUDA_FUNC_DESC32_LO_20Low 32 of descriptor at bit 20
49R_CUDA_FUNC_DESC32_HI_20High 32 of descriptor at bit 20
61R_CUDA_FUNC_DESC32_3232-bit function descriptor at bit 32
62R_CUDA_FUNC_DESC32_LO_32Low 32 of descriptor at bit 32
63R_CUDA_FUNC_DESC32_HI_32High 32 of descriptor at bit 32
92--99R_CUDA_FUNC_DESC_8_0 -- R_CUDA_FUNC_DESC_8_56Sub-byte function descriptor patches

Function descriptors are used for indirect calls through function pointers. The descriptor contains the target function's entry point address and is loaded by the GPU's indirect call mechanism. The sub-byte FUNC_DESC_8_* variants patch individual bytes of the descriptor into instruction encoding slots, used in wide instruction formats where the descriptor address is spread across multiple fields. When the relocation creator detects a flags_hi value of 12--15 in the descriptor table entry, it calls sub_1CBD0D0 to register the call edge in the call graph.

Texture, Sampler, and Surface Relocations

OrdinalNamePurpose
6R_CUDA_TEX_HEADER_INDEXTexture header table index
7R_CUDA_SAMP_HEADER_INDEXSampler header table index
8R_CUDA_SURF_HW_DESCSurface hardware descriptor
9R_CUDA_SURF_HW_SW_DESCSurface hardware+software descriptor
19R_CUDA_TEX_SLOTTexture binding slot
20R_CUDA_SAMP_SLOTSampler binding slot
21R_CUDA_SURF_SLOTSurface binding slot
26R_CUDA_TEX_SLOT9_499-bit texture slot at bit 49
52R_CUDA_SURF_HEADER_INDEXSurface header table index
101R_CUDA_SAMP_HEADER_INDEX_0Sampler header index variant

These relocations connect texture/sampler/surface operations to their runtime-allocated descriptor table entries. The CUDA driver fills in the actual descriptor indices at launch time based on the kernel's resource binding.

Bindless Texture/Surface Relocations

OrdinalNamePurpose
22R_CUDA_TEX_BINDLESSOFF13_32Bindless texture offset, 13-bit at bit 32
23R_CUDA_TEX_BINDLESSOFF13_47Bindless texture offset, 13-bit at bit 47
29R_CUDA_TEX_BINDLESSOFF13_41Bindless texture offset, 13-bit at bit 41
30R_CUDA_TEX_BINDLESSOFF13_45Bindless texture offset, 13-bit at bit 45
51R_CUDA_BINDLESSOFF13_36Bindless offset, 13-bit at bit 36
65R_CUDA_BINDLESSOFF14_40Bindless offset, 14-bit at bit 40

Bindless texture/surface relocations are handled by sub_1CAB300, which creates $NVLINKBINDLESSOFF_<name> symbols for each bindless reference. During resolution:

"change reloc symbol from %d to %d"
"no bindless ref in section %s"
"unexpected usage of non-unified surface descriptors"

Sub-Byte Patch Relocations

OrdinalNamePurpose
76--83R_CUDA_8_0 -- R_CUDA_8_56Patch byte 0--7 of 64-bit instruction

These relocations patch a single byte at a specific 8-bit-aligned position within a 64-bit instruction word. They are used when the resolved value must be inserted into a non-standard bit position that does not align with the instruction encoding's immediate field boundaries.

Miscellaneous Relocations

OrdinalNamePurpose
27R_CUDA_6_316-bit field at bit 31
28R_CUDA_2_472-bit field at bit 47
37R_CUDA_QUERY_DESC21_37Query descriptor, 21-bit at bit 37
53R_CUDA_INSTRUCTION64Whole 64-bit instruction replacement
67R_CUDA_INSTRUCTION128Whole 128-bit instruction replacement
68R_CUDA_YIELD_OPCODE9_0YIELD opcode, 9-bit at bit 0
69R_CUDA_YIELD_CLEAR_PRED4_87Clear YIELD predicate, 4-bit at bit 87
72R_CUDA_UNUSED_CLEAR32Zero out 32-bit unused field
73R_CUDA_UNUSED_CLEAR64Zero out 64-bit unused field
116R_CUDA_NONE_LASTSentinel marking end of relocation table

The R_CUDA_INSTRUCTION64 and R_CUDA_INSTRUCTION128 types replace entire instruction words, used for instruction-level patching by the linker when the instruction encoding changes based on the final resolved address.

The R_CUDA_YIELD_* types handle YIELD-to-NOP conversion. When a kernel has forward-progress requirements that prevent yielding, the resolver converts YIELD instructions to NOPs:

"Ignoring the reloc to convert YIELD to NOP due to forward progress requirement."

The R_CUDA_UNUSED_CLEAR* types zero out instruction fields that are unused in the final encoding, ensuring deterministic output.

Unified Address Space Relocations

OrdinalNamePurpose
102R_CUDA_UNIFIEDUnified address (generic pointer)
103R_CUDA_UNIFIED_3232-bit unified address
104--111R_CUDA_UNIFIED_8_0 -- R_CUDA_UNIFIED_8_56Unified address sub-byte patches
112R_CUDA_UNIFIED32_LO_32Low 32 of unified at bit 32
113R_CUDA_UNIFIED32_HI_32High 32 of unified at bit 32

Unified address relocations resolve generic pointers that can point to global, shared, or constant memory. During final resolution, the resolver performs a type conversion from unified (type 103) to absolute (type 1):

// In sub_1CD48C0: unified reloc replacement
if (reloc_type == 103)  // R_CUDA_UNIFIED_32
    reloc_type = 1;     // R_CUDA_32

R_MERCURY_* Relocation Types

65 types from R_MERCURY_NONE (ordinal 0) to R_MERCURY_NONE_LAST (ordinal 64). String addresses span 0x23FB8C5--0x23FBDFA. Mercury relocations serve the same purpose as R_CUDA types but are designed for the Mercury intermediate representation used on SM 100+ targets.

Mercury Type Categories

CategoryTypesPurpose
AddressR_MERCURY_G64, R_MERCURY_ABS64, R_MERCURY_ABS32, R_MERCURY_ABS16Memory addresses
Split addressR_MERCURY_ABS32_LO, R_MERCURY_ABS32_HI64-bit address halves
Program-relativeR_MERCURY_PROG_REL64, R_MERCURY_PROG_REL32, R_MERCURY_PROG_REL32_LO, R_MERCURY_PROG_REL32_HIOffsets from program base
Tex/samp/surfR_MERCURY_TEX_HEADER_INDEX, R_MERCURY_SAMP_HEADER_INDEX, R_MERCURY_SURF_HEADER_INDEXResource descriptors
FunctionR_MERCURY_FUNC_DESC_64Function descriptor
Sub-byteR_MERCURY_8_0 -- R_MERCURY_8_56 (8 types)Byte-level patches
Global sub-byteR_MERCURY_G8_0 -- R_MERCURY_G8_56 (8 types)Global-space byte patches
Func desc sub-byteR_MERCURY_FUNC_DESC_8_0 -- R_MERCURY_FUNC_DESC_8_56 (8 types)Function descriptor byte patches
Abs-program-relativeR_MERCURY_ABS_PROG_REL32_LO, R_MERCURY_ABS_PROG_REL32_HI, R_MERCURY_ABS_PROG_REL32, R_MERCURY_ABS_PROG_REL64Absolute program-relative
Program-relative sub-byteR_MERCURY_PROG_REL8_0 -- R_MERCURY_PROG_REL8_56 (8 types)Program-relative byte patches
UnifiedR_MERCURY_UNIFIED, R_MERCURY_UNIFIED_32, R_MERCURY_UNIFIED_8_0 -- R_MERCURY_UNIFIED_8_56, R_MERCURY_UNIFIED32_LO, R_MERCURY_UNIFIED32_HIUnified address space
CleanupR_MERCURY_UNUSED_CLEAR64Zero out unused fields
SentinelsR_MERCURY_NONE, R_MERCURY_NONE_LASTTable boundaries

Mercury introduces program-relative relocations (PROG_REL*) that do not exist in the R_CUDA set. These compute offsets relative to the program base address rather than absolute virtual addresses, enabling position-independent code for the Mercury deferred finalization model. The Mercury finalizer (running at link or load time) resolves these program-relative relocations after the final code layout is known.

Relocation Encoding

ELF Relocation Entry Format

Cubin relocations use standard Elf64_Rela entries in .rela.text.<funcname> sections:

typedef struct {
    Elf64_Addr  r_offset;   // Byte offset within the section
    Elf64_Xword r_info;     // Symbol index (high 32) | Type (low 32)
    Elf64_Sxword r_addend;  // Addend for the relocation computation
} Elf64_Rela;  // 24 bytes

The r_info field packs the symbol table index in the upper 32 bits and the R_CUDA/R_MERCURY type code in the lower 32 bits:

#define ELF64_R_SYM(info)   ((info) >> 32)
#define ELF64_R_TYPE(info)  ((info) & 0xFFFFFFFF)

For Mercury types, the type code stored in r_info is the ordinal plus 0x10000. The resolver subtracts 0x10000 before indexing the R_MERCURY descriptor table.

Internal Relocation Entry

The ELFW object maintains relocations in an internal linked list at offset +376 of the ELFW structure. Each internal entry is a 32-byte node:

// Internal relocation entry (reconstructed from sub_1CD4510, sub_1CD46B0)
struct elfw_reloc {
    uint64_t offset;            // +0:  byte offset in target section
    uint64_t type_and_section;  // +8:  (target_section << 32) | reloc_type
    uint64_t addend;            // +16: relocation addend
    uint32_t symbol_index;      // +24: index into ELFW symbol table
    uint32_t alias_index;       // +28: original symbol if aliased, else 0
};

The type_and_section field encodes both the relocation type code (low 32 bits) and the target section index (high 32 bits) in a single 64-bit field.

Resolved Relocation Output

Resolved relocations are written by sub_1CD5920 to .nv.resolvedrela sections. Additionally, .nv.rel.action sections carry relocation action metadata for the CUDA driver's runtime linker.

Symbol Table Structure

.symtab Format

The symbol table uses standard Elf64_Sym entries (24 bytes each for 64-bit, 16 bytes for 32-bit):

typedef struct {
    Elf32_Word    st_name;    // String table offset
    unsigned char st_info;    // Type (low 4 bits) | Binding (high 4 bits)
    unsigned char st_other;   // Visibility (low 2 bits) | Flags
    Elf16_Half    st_shndx;   // Section index (or SHN_XINDEX=0xFFFF)
    Elf64_Addr    st_value;   // Symbol value (section offset)
    Elf64_Xword   st_size;    // Symbol size
} Elf64_Sym;

Internal Symbol Representation

The ELFW maintains an internal symbol structure (40+ bytes) with additional metadata:

OffsetSizeFieldDescription
+41st_infoLow nibble = type (STT_*), high nibble = binding strength
+51st_otherBits 0-1 = visibility, bits 4-7 = CUDA-specific flags
+62st_shndxSection index (0xFFFF = use extended index)
+88st_valueSymbol address; -1 = unallocated
+244section_linkInternal section reference
+284extra_indexSecondary symbol link
+328name_ptrPointer to symbol name string

Symbol Types

ELF TypeValueCUDA Usage
STT_NOTYPE0Undefined/external symbols
STT_OBJECT1Global/constant/shared variables
STT_FUNC2Kernel entry points, device functions
STT_SECTION3Section symbols (one per section)
STT_COMMON5Common symbols (.common symbol)
STT_CUDA_TEXTURE10Texture reference symbols
STT_CUDA_SURFACE11Surface reference symbols
STT_CUDA_SAMPLER12Sampler reference symbols
STT_CUDA_FUNC_DESC13Function descriptor (indirect call target)

The internal type field at offset +4 uses the low nibble for ELF standard types and the high nibble for binding/scope information. The resolver checks st_info & 0xF throughout its processing.

Function descriptor symbols (type 13) receive special handling in the relocation resolver. When the resolver encounters a type-13 symbol, it checks whether the symbol is allocated:

// sub_1CD48C0: function descriptor symbol handling
if ((sym->st_info & 0xF) == 13) {  // STT_CUDA_FUNC_DESC
    shndx = get_section_index(elfw, sym);
    if (shndx == 0) {
        // Unresolved -- check binding and ELFW flags
        if ((sym->st_other & 0xE0) == 0x20  // STB_GLOBAL
            || (sym->st_other & 0x10))       // CUDA-specific extern flag
        {
            // External function descriptor: keep relocation for linker
        }
    }
}

Symbol Binding and Visibility

The st_other byte encodes both ELF visibility (bits 0-1) and CUDA-specific binding flags (bits 4-7):

BitsFieldValues
0-1ELF visibility0 = STV_DEFAULT, 1 = STV_INTERNAL, 2 = STV_HIDDEN, 3 = STV_PROTECTED
4Extern flag1 = external linkage (for nvlink)
5-6Binding strength0x20 = STB_GLOBAL, 0x80 = STB_WEAK
7ReservedUsed internally during resolution

The binding byte at st_other & 3 (low 2 bits of the high nibble) maps to:

ValueMeaningResolution
1STB_LOCAL / deadSkip relocation ("ignore reloc on dead func %s")
2STB_GLOBALNormal resolution
3STB_WEAKResolve if available, otherwise use default

Symbol Table Builder -- sub_1CB68D0

The symbol table builder (9,578 bytes, approximately 1,700 decompiled lines) processes the ELFW internal symbol list in these steps:

  1. Iterate symbols -- walks the symbol list from the ELFW object
  2. Filter deleted symbols -- 12 separate checks for "reference to deleted symbol" guard against stale entries from dead code elimination
  3. Handle __cuda_syscall -- special-cases the device-side syscall dispatcher symbol
  4. Resolve aliases -- follows alias chains to find the canonical symbol
  5. Compute values -- resolves st_value from section base + offset
  6. Create section symbols -- ensures every section has an STT_SECTION symbol; emits "found multiple section symbols for %s" if duplicates exist
  7. Handle SHN_XINDEX overflow -- when section index >= SHN_LORESERVE (0xFF00 = 65,280), sets st_shndx = SHN_XINDEX (0xFFFF) and stores the real index in .symtab_shndx
  8. Build .symtab_shndx -- populates the extended index table for overflow sections

Error strings observed in the builder:

StringCondition
"reference to deleted symbol"Symbol marked deleted but still referenced (12 checks)
"ignore symbol %s in unused section"Symbol in eliminated section
"ignore symbol string %s for sym %d"Skipping symbol name for unnamed/internal symbol
"found multiple section symbols for %s"Duplicate STT_SECTION entries
"symbol already assigned"Duplicate assignment attempt
"adding global symbols of same name"Name collision
"alias to unknown symbol"Alias target not found
"unallocated symbol"Symbol value is -1 (never assigned an address)
"missing sec strtab"String table not initialized

Symbol Fixup -- sub_1CB2CA0

After dead code elimination removes sections, symbol indices become stale. The fixup pass (2,038 bytes, called from 4 sites) renumbers all symbol st_shndx values:

  1. For each section in the ELFW:
    • If the section lacks an STT_SECTION symbol, create one
    • If the section has multiple STT_SECTION symbols, warn
  2. Walk the symbol table and remap st_shndx values through the section index mapping

The fixup runs at multiple pipeline points: after dead function elimination, after Mercury section cloning, and after any section deletion.

Section Index Remap -- sub_1C99BB0

The companion to sub_1CB2CA0 for the extended index mechanism. When section indices change, this function updates both .symtab_shndx and .nv.merc.symtab_shndx to keep the extended index tables consistent.

Relocation Resolution Algorithm

The master resolver sub_1CD48C0 implements a 7-step algorithm that processes every relocation entry in the ELFW's linked list:

Step 1: Symbol Address Computation

For each relocation entry, compute the symbol's resolved address by adding the symbol's st_value (from the section base) to the relocation offset:

if (reloc->alias_index) {
    sym = lookup_symbol(elfw, reloc->alias_index);
    reloc->offset += sym->st_value;
}

For Mercury cubins (64-bit ELF class 'A' with Mercury flag set), the resolver applies an additional address transformation that accounts for the Mercury instruction stride:

if (is_mercury && sym_value != 0) {
    int stride = 2 * arch_vtable->get_merc_stride();
    reloc->offset += stride * (sym_value >> 7);
}

Step 2: Alias Resolution

If the relocation targets an alias symbol (ELF type STT_NOTYPE with section index pointing to another symbol), redirect the relocation to the canonical target:

"change alias reloc %s to %s"

The resolver follows the alias chain through sub_1CB1E00 (get section index) and sub_1CB3D20 (get section by index), replacing the alias with its real target.

Step 3: Dead Function Filtering

If the relocation's target symbol has local binding (st_other & 3 == 1) and is in a deleted section, the relocation is zeroed out:

"ignore reloc on dead func %s"

The relocation's type is set to 0 (R_CUDA_NONE), effectively removing it. For the output mode != 2 (relocatable), dead relocations on STT_NOTYPE symbols with a binding prefix of 2 are also removed.

Step 4: UFT/UDT Pseudo-Relocation Handling

Relocations targeting special synthetic symbols are intercepted:

SymbolAction
__UFT_OFFSETRecord for UFT slot assignment, zero the relocation
__UFT_CANONICALMap to canonical UFT entry
__UDT_OFFSETRecord for UDT slot assignment
__UDT_CANONICALMap to canonical UDT entry
__UFT, __UFT_ENDUFT boundary markers
__UDT, __UDT_ENDUDT boundary markers

The resolver checks if a symbol name starts with "__UFT_OFFSET" (exact 13-character comparison in the decompiled code). If matched:

"ignore reloc on UFT_OFFSET"

The relocation entry is then processed by the UFT manager (sub_1CD22E0) which maps UUIDs to UFT slot indices.

Step 5: PC-Relative Branch Validation

For relocations whose descriptor table entry has *(table + 8*index + 5) == 16 (indicating a PC-relative type), the resolver validates that the source and target sections are identical:

if (reloc_desc->patch_mode == 16 && reloc->section != target_section)
    fatal("PC relative branch address should be in the same section");

Step 6: YIELD-to-NOP Conversion

If the relocation type is R_CUDA_YIELD_OPCODE9_0 or R_CUDA_YIELD_CLEAR_PRED4_87, and the kernel has forward-progress requirements, the resolver skips the NOP conversion:

"Ignoring the reloc to convert YIELD to NOP due to forward progress requirement."

Step 7: Bit-Field Patching

The final step delegates to sub_1CD34E0, the bit-field patcher. This function uses the relocation descriptor table entry's parameters to extract the current field value (via sub_1CD33F0), add the resolved address, and write the result back (via sub_1CD3330):

// sub_1CD34E0 -- bit-field patching (simplified)
bool apply_reloc(reloc_desc_table, index, is_addend, instruction_data,
                 symbol_value, reloc_offset, sym_addr, sym_shndx,
                 section_type_offset, old_value_out) {
    entry = &reloc_desc_table[index * 64 + 12]; // Start at byte 12
    end = &reloc_desc_table[index * 64 + 60];   // 4 operations max

    while (entry < end) {
        uint32_t bit_start = entry[0];
        uint32_t bit_width = entry[1];
        uint32_t mode      = entry[2];

        switch (mode) {
        case 0:  // NOP
            break;
        case 1:  // Direct write: place value at [bit_start, bit_start+bit_width)
        case 0x12: case 0x2E:
            old = extract_bits(instruction_data, bit_start, bit_width);
            insert_bits(instruction_data, resolved_value, bit_start, bit_width);
            break;
        case 6:  // Split low-word write (cross-qword boundary handling)
        case 0x37:
            // Write low portion, advance to next qword if needed
            break;
        case 7:  // Split high-word write
        case 0x38:
            // Write HIDWORD of value
            break;
        }
        entry += 4;  // Next 16-byte operation
    }
    return true;
}

If the NVRS (NVIDIA Register Spill) check fails during patching, the resolver emits:

"unexpected NVRS"

NVRS relocations are special-purpose relocations for register spill slot references. When the bit-field patcher returns false, the relocation is invalid for the current context.

Post-Resolution

Successfully resolved relocations are either:

  • Removed from the linked list (the relocation was fully applied to the instruction bytes)
  • Kept for the output .nv.resolvedrela section (the relocation needs runtime resolution by the CUDA driver)

The relocation writer sub_1CD5920 validates every remaining relocation before serializing it:

CheckError
Symbol value == -1"symbol never allocated"
Offset >= section size"relocation is past end of offset"
Target section unallocated"rela section never allocated"
Address not found in section data"reloc address not found"

Unified Function Table (UFT) and Unified Data Table (UDT)

Purpose

UFT and UDT support indirect function calls and generic data references across compilation units. When nvcc compiles a program using function pointers, virtual functions, or __device__ function addresses taken in host code, the compiler generates UFT/UDT entries that the runtime linker resolves at load time.

Sections

SectionPurpose
.nv.uftJump slot table (one slot per indirect-callable function)
.nv.uft.entryUFT entry metadata (UUID, offset pairs)
.nv.udtData slot table (one slot per externally-referenced data object)
.nv.udt.entryUDT entry metadata
.nv.uft.relUFT relocation table

UFT Entry Structure

Each UFT entry contains a 128-bit UUID and a 64-bit offset:

struct uft_entry {
    uint64_t uuid_lo;   // Low 64 bits of UUID
    uint64_t uuid_hi;   // High 64 bits of UUID
    uint64_t offset;     // Offset into the jump slot table
};  // 24 bytes per entry

UFT Manager -- sub_1CD22E0

The UFT manager (1,979 bytes, 10 KB decompiled) processes UFT/UDT entries across all compilation units:

  1. Build UID-to-key map -- hashes uuid_lo ^ uuid_hi as the lookup key
  2. Detect conflicts -- reports "uft map conflict: 0x%llx" when two entries hash to the same key
  3. Detect duplicates -- reports "duplicate ids in uft.entry" when identical UUIDs appear
  4. Reorder entries -- "Re-ordering UFT entries" / "Re-ordering UDT entries" sorts entries for deterministic output
  5. Match UUIDs -- cross-references UUIDs against the existing UFT for linking: "matching uuid not found" if a referenced UUID does not exist
  6. Align UDT -- "udt size %lld needs aligning" pads UDT entries to required alignment

UFT Slot Validator -- sub_1CD2AA0

Validates consistency between .nv.uft (jump slots) and .nv.uft.entry (metadata):

"missing nv.uft.entry"
"Number of .nv.uft jump slots != Number of entries in .nv.uft.entry"
"size of uidx window != nv.uft"

Synthetic Symbols

The resolver recognizes these synthetic symbol names:

SymbolPurpose
__UFT_OFFSETPoints to a UFT jump slot
__UFT_CANONICALCanonical UUID entry for a UFT slot
__UDT_OFFSETPoints to a UDT data slot
__UDT_CANONICALCanonical UUID entry for a UDT slot
__UFT / __UFT_ENDUFT table start/end boundaries
__UDT / __UDT_ENDUDT table start/end boundaries
$NVLINKBINDLESSOFF_<name>Bindless texture/surface offset symbol
__cuda_syscallDevice-side syscall dispatcher

Extern Shared Memory Relocations

Extern shared memory variables (declared with extern __shared__) are handled specially because their addresses are not known until kernel launch. The resolver tracks these through dedicated strings:

"extern shared variable %s at offset %lld"
"reloc of extern shared %d replaced with symbol %d"
"new extern shared instance %d"

Multiple kernels may reference the same extern shared variable. The linker creates separate instances when necessary and patches the relocation to point to the correct instance.

Weak Symbol Handling

When nvlink encounters a weak symbol that conflicts with a strong definition:

"Could not replace weak symbol '%s'"

This occurs during the relocation pre-scan (sub_1CD43A0) when processing relocations that reference weak symbols. The pre-scan walks all relocations and checks the symbol binding at sym->st_other & 0xE0:

  • 0x80 = weak: eligible for replacement by a strong definition
  • 0x20 = global: normal binding

Linking Model

Relocatable Object Mode (-c)

When ptxas produces a relocatable object (.o), all relocations are preserved in .rela.text.<func> sections. The call graph is written to .nv.callgraph. Symbols retain their binding information for nvlink to resolve.

"No relocatable objects found. Did not generate callgraph."
"Generate relocatable object"

The --preserve-relocs flag additionally preserves relocations that would normally be resolved internally:

"This option will make PTXAS to generate relocatable references for variables and preserve ..."

Executable Mode (default)

In the default mode, ptxas resolves all internal relocations and writes .nv.resolvedrela for any relocations that require runtime resolution. External references and function descriptors for indirect calls are preserved as unresolved relocations for the CUDA driver's runtime linker.

PIC Mode

Position-independent code mode (IsPIC flag) changes the relocation encoding. The ELF flags word at ELFW offset +48 encodes this mode. PIC cubins use additional program-relative relocations and avoid absolute addresses where possible.

Cross-References

Function Map

AddressSize (binary)DecompiledCallersCalleesPurpose
sub_1CD48C04,184 B22 KB117Master relocation resolver (7-step algorithm)
sub_1CD59201,985 B11 KB19Relocation writer (.nv.resolvedrela)
sub_1CD4510~860 B4 KB----Relocation creator (SASS)
sub_1CD46B0~540 B4 KB----Relocation creator (Mercury)
sub_1CD43A0~560 B3 KB----Relocation pre-scan (weak/extern)
sub_1CD34E03,700 B17 KB12Bit-field patcher (sub_1CD33F0 extract, sub_1CD3330 insert)
sub_1CD33F0~300 B2 KB71Extract bits from instruction word
sub_1CD3330~200 B1 KB50Insert bits into instruction word
sub_1CD22E01,979 B10 KB220UFT manager (UUID-to-slot mapping)
sub_1CD2AA0~800 B3 KB----UFT slot validator
sub_1CB68D09,578 B49 KB136Symbol table builder (.symtab)
sub_1CB2CA02,038 B8 KB411Symbol fixup (post-deletion renumbering)
sub_1C99BB04,900 B25 KB118Section index remap (.symtab_shndx)
sub_1CB64A0~500 B2 KB----Symbol resolver (checks .nv.* special names)
sub_1CAB3002,157 B12 KB119Bindless texture/surface handler
sub_1CA68902,286 B15 KB211Constant bank deduplication
sub_1CBD0D0--------Call graph edge registration

CLI Options

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

ptxas v13.0.88 accepts approximately 160 command-line options: 51 documented in --help output and roughly 109 internal/undocumented options discovered through binary analysis. All option names are registered via sub_432A00 (6,427 bytes at 0x432A00) using a generic option framework shared with other NVIDIA tools. The framework library (sub_1C960C0--sub_1C97640) supports short options (-X), long options (--name), and four value types: boolean toggle, list append, scalar value, and multi-value. Internal option names are stored ROT13-encoded in the binary.

Total options~160 (51 documented + ~109 internal)
Option registrationsub_432A00 (0x432A00, 6,427 bytes)
Option parsersub_434320 (0x434320, 10,289 bytes)
Framework constructorsub_1C960C0
Argv processorsub_1C96680
Help printersub_1C97640
Options block1,352 bytes on stack in compilation driver
Name obfuscationROT13 for internal option names

Architecture

          argv
           │
           ▼
   ┌───────────────────┐      ┌─────────────────────────┐
   │  sub_1C960C0      │      │   sub_432A00             │
   │  Parser ctor      │◄─────│   Register ~160 options  │
   │  (56-byte context)│      │   (name, type, default,  │
   └───────┬───────────┘      │    help text, callback)  │
           │                  └─────────────────────────┘
           ▼
   ┌───────────────────┐
   │  sub_1C96680      │
   │  Process argv     │
   │  Detect - and --  │
   │  Type dispatch:   │
   │    1=bool toggle  │
   │    2=list append  │
   │    3=scalar value │
   │    4=multi-value  │
   └───────┬───────────┘
           │
           ▼
   ┌───────────────────┐
   │  sub_434320       │
   │  Validate combos  │
   │  Populate 1352B   │
   │  options block    │
   └───────┬───────────┘
           │
           ▼
     Compilation driver
      (sub_446240)

Quick Start

The most common ptxas invocations and essential options, ordered by frequency of use:

# 1. Basic compilation: PTX -> cubin for a specific GPU
ptxas -arch sm_90 -o kernel.cubin kernel.ptx

# 2. Compilation with optimization control
ptxas -arch sm_100 -O3 -o kernel.cubin kernel.ptx

# 3. Debug build with line info
ptxas -arch sm_90 -g -lineinfo -o kernel.cubin kernel.ptx

# 4. Register-limited compilation (occupancy tuning)
ptxas -arch sm_90 -maxrregcount 64 -o kernel.cubin kernel.ptx

# 5. Verbose output with resource statistics
ptxas -arch sm_90 -v -o kernel.cubin kernel.ptx

# 6. Relocatable object for separate linking
ptxas -arch sm_90 -c -o kernel.o kernel.ptx

# 7. Fast-compile mode (trade codegen quality for build speed)
ptxas -arch sm_100 -Ofc max -o kernel.cubin kernel.ptx

# 8. Parallel compilation with multiple threads
ptxas -arch sm_90 -split-compile 0 -o kernel.cubin kernel.ptx

# 9. Internal knob override (developer/debugging)
ptxas -arch sm_90 -knob DUMPIR=AllocateRegisters -o kernel.cubin kernel.ptx

# 10. Discover all 1,294 internal knob values
DUMP_KNOBS_TO_FILE=/tmp/knobs.txt ptxas -arch sm_90 -o kernel.cubin kernel.ptx
GoalOptions
Maximize performance-O3 -allow-expensive-optimizations -fmad
Maximize occupancy-maxrregcount N (N = 32, 64, 128, ...)
Minimize compile time-Ofc max -split-compile 0
Debug build-g -lineinfo -sp-bounds-check
Spill diagnostics-v -warn-spills -warn-lmem-usage
Internal tuning-knob NAME=VALUE (see Knobs System)

Option Discovery Methodology

Options were extracted from four independent sources:

  1. Official --help output -- 51 options with full metadata.
  2. Binary string extraction -- strings(1) reveals plaintext option names used in error messages and format strings.
  3. ROT13 decode -- Internal option names stored as ROT13 in the registration function. Decoding fj4575628 yields sw4575628, pbzcvyre-fgngf yields compiler-stats, etc.
  4. Decompiled code cross-reference -- String references in option processing functions (sub_434320, sub_4428E0, sub_43A400) confirm option semantics.

Tables below use these markers:

  • Unmarked rows = documented in --help
  • Rows marked (internal) = discovered through RE, not in --help

Core Compilation

Long NameShort NameTypeDefaultDescription
--opt-level-Oint3Optimization level (0--4)
--output-file-ofileelf.oOutput file name and location
--gpu-name-archenumsm_75Target GPU architecture (sm_XX, compute_XX, lto_XX)
--compile-only-cboolfalseGenerate relocatable object
--entry-elist(all)Entry function name(s) to compile
--verbose-vboolfalsePrint code generation statistics
--version-Vbool--Print version information
--help-hbool--Print help text
--machine-mint64Host architecture bitness (only 64 supported)
--input-as-string-iaslist--PTX modules as strings instead of files
--options-file-optflist--Include CLI options from file
--compile-functions (internal)--list--Restrict compilation to named functions
--ptx-length (internal)--int--PTX input length for --input-as-string mode
--tool-name (internal)--string--Tool name for diagnostics (nvcc integration)
--cuda-api-version (internal)--int(auto)CUDA API version for compatibility
--abi-compile (internal)--boolfalseCompile using strict ABI conventions

Debug and Instrumentation

Long NameShort NameTypeDefaultDescription
--device-debug-gboolfalseGenerate debug information for device code
--generate-line-info-lineinfoboolfalseGenerate line-number information
--sp-bounds-check-sp-bounds-checkboolfalseStack-pointer bounds checking; auto-enabled with -g or -O0
--suppress-debug-info-suppress-debug-infoboolfalseSuppress debug sections in output; ignored without -g or -lineinfo
--device-stack-protector-device-stack-protectorboolfalseStack canaries; heuristic per-function risk assessment
--sanitize-sanitizeenum--Instrumented code: memcheck or threadsteer
--g-tensor-memory-access-check-g-tmem-access-checkbool(with -g)Tensor memory access checks for tcgen05
--gno-tensor-memory-access-check-gno-tmem-access-checkboolfalseOverride: disable tensor memory access checks
--dont-merge-basicblocks-no-bb-mergeboolfalsePrevent basic block merging (debuggable code)
--return-at-end-ret-endboolfalsePreserve last return instruction for breakpoints
--make-errors-visible-at-exit-make-errors-visible-at-exitboolfalseGenerate instructions to surface memory faults at exit
--trap-into-debugger (internal)--boolfalseInsert trap instructions for debugger attachment
--device-stack-protector-size (internal)--int(varies)Stack protector canary size
--device-stack-protector-frame-size-threshold (internal)--int(varies)Frame size threshold for canary insertion

Register and Occupancy Control

Long NameShort NameTypeDefaultDescription
--maxrregcount-maxrregcountint/enum(unlimited)Max registers per function; accepts N, archmax, archmin
--minnctapersm-minnctapersmint--Min CTAs per SM; ignored if -maxrregcount is set
--maxntid-maxntidlist--Max thread-block dimensions; ignored if -maxrregcount is set
--device-function-maxrregcount-func-maxrregcountint/enum(unlimited)Max registers for device functions (with -c); overrides --maxrregcount for non-entry functions
--register-usage-level-regUsageLevelint5Register-usage optimization aggressiveness (0--10); BETA
--override-directive-values-override-directive-valuesboolfalseCLI values override PTX directives for minnctapersm, maxntid, maxrregcount
--first-reserved-rreg (internal)--int--First reserved register number (tools integration)
--reg-fatpoint (internal)--string--Fatpoint register allocation mode selector
--no-fastreg (internal)--boolfalseDisable fast register allocation path
--no-spill (internal)--boolfalseDisable register spilling (debug/stress)

Performance and Optimization

Long NameShort NameTypeDefaultDescription
--Ofast-compile-Ofcenum0Fast-compile level: 0 (disabled), min, mid, max
--fast-compile (internal)--boolfalseInternal fast-compile flag (predecessor of --Ofast-compile)
--allow-expensive-optimizations-allow-expensive-optimizationsbool(auto at O2+)Allow max resources for expensive optimizations
--split-compile-split-compileint--Max concurrent threads for optimizer; 0 = num CPUs
--fmad-fmadbooltrueContract FP multiply + add into FMA (FMAD/FFMA/DFMA)
--optimize-float-atomics-opt-fp-atomicsboolfalseFP atomic optimizations (may affect precision)
--disable-optimizer-constants-disable-optimizer-constsboolfalseDisable optimizer constant bank
--cloning (internal)--enum(auto)Inline function cloning control (yes/no)
--perf-per-watt-opt-level (internal)--int--Performance-per-watt optimization level
--lds128convert (internal)--enum(auto)LDS.128 conversion: always, nonconst, never
--opt-pointers (internal)--bool(varies)Enable pointer optimization passes
--fastpath-off (internal)--boolfalseDisable fast-path optimizations
--full-double-div (internal)--bool(varies)Full-precision double division
--limit-fold-fp (internal)--bool(varies)Limit floating-point constant folding
--shift-right (internal)--boolfalseShift-right optimization control
--dont-reserve-null-pointer (internal)--boolfalseDo not reserve null pointer in address space

Cache Control

Long NameShort NameTypeDefaultDescription
--def-load-cache-dlcmenum(arch-dep)Default cache modifier on global/generic load
--def-store-cache-dscmenum(arch-dep)Default cache modifier on global/generic store
--force-load-cache-flcmenum--Force cache modifier on global/generic load
--force-store-cache-fscmenum--Force cache modifier on global/generic store

Warnings and Diagnostics

Long NameShort NameTypeDefaultDescription
--warning-as-error-WerrorboolfalsePromote all warnings to errors
--disable-warnings-wboolfalseInhibit all warnings
--warn-on-spills-warn-spillsboolfalseWarn when registers spill to local memory
--warn-on-local-memory-usage-warn-lmem-usageboolfalseWarn when local memory is used
--warn-on-double-precision-use-warn-double-usageboolfalseWarn when doubles are used
--suppress-stack-size-warning-suppress-stack-size-warningboolfalseSuppress undetermined-stack-size warning
--suppress-double-demote-warning-suppress-double-demote-warningboolfalseSuppress double demotion warning on SM without double support
--suppress-async-bulk-multicast-advisory-warning-suppress-async-bulk-multicast-advisory-warningboolfalseSuppress .multicast::cluster advisory
--suppress-sparse-mma-advisory-info-suppress-sparse-mma-advisory-infoboolfalseSuppress mma.sp advisory
--print-potentially-overlapping-membermasks (internal)--boolfalseDiagnostic for overlapping member masks
--no-membermask-overlap (internal)--boolfalseDisable member mask overlap checks

Output Format and Relocation

Long NameShort NameTypeDefaultDescription
--preserve-relocs-preserve-relocsboolfalsePreserve relocations in linked executable
--position-independent-code-picboolfalse (whole-prog: true)Generate PIC; default on for whole-program compilation
--compiler-annotations-annotateboolfalseAnnotate compiler-internal information in binary
--binary-kind (internal)--enum(arch-dep)Target binary format: mercury, capmerc, sass
--force-rela (internal)--boolfalseForce RELA-style relocations
--gen-std-elf (internal)--boolfalseGenerate standard ELF (vs NVIDIA custom format)
--link-info (internal)--string--Link information for assembler
--force-externals (internal)--boolfalseForce functions as external
--forcetext (internal)--boolfalseForce text-mode SASS output
--emit-internal-clo (internal)--boolfalseEmit internal compiler-level object metadata
--hide-user-functions (internal)--boolfalseHide user function symbols in output

Workaround Flags

Hardware and software bug workarounds tied to internal NVIDIA bug-tracking IDs. All names are ROT13-encoded in the binary (e.g., fj2614554 decodes to sw2614554). These flags toggle specific code paths that avoid known errata or compiler defects. New workarounds appear (and old ones become permanent) with each ptxas release. The validator in sub_434320 enforces architecture restrictions: a flag set on an unsupported architecture is silently cleared with a diagnostic.

Long NameShort NameTypeDefaultArch GateDescription
--sw2614554 (internal)--boolfalseallThread-safety workaround; incompatible with --split-compile. When set, forces single-threaded compilation -- validator emits "'--sw2614554' ignored because of '--split-compile'" and disables split-compile. Addresses a race condition in the parallel optimizer.
--sw2837879 (internal)--boolfalseallBackend codegen workaround. No architecture gating or validator logic; consumed directly in DAG/OCG pipeline phases. Specific behavioral effect not traced beyond registration.
--sw1729687 (internal)--boolfalsesm_50--sm_53Maxwell-era hardware errata workaround. Validator checks (arch_ordinal - 14) > 2 and clears the flag with a warning on any architecture beyond sm_53. Activates an alternate codegen path on Maxwell GPUs.
--sw200428197 (internal)--boolfalsesm_80+Sanitizer-compatible ABI workaround. Forces scratch register reservation for CUDA sanitizer instrumentation state and applies ABI-minimum register counts. Consumed in function/ABI setup (sub_43F400, sub_441780) alongside --compile-as-tools-patch. Validator clears it with "-arch=X ignored because of --sw200428197" on sm_75 and earlier.
--sw200387803 (internal)--boolfalsedeprecatedRetired workaround. Setting it triggers a deprecation advisory (dword_29FBDB0) but no behavioral change -- the underlying fix has been permanently integrated.
--sw200764156 (internal)--booltruesm_90 onlyHopper-specific hardware errata. Default is true (unique among all sw* flags). Help text reads "Enable/Disable sw200764156", confirming it is a toggle that can be turned off. On any architecture other than sm_90, the user-set value is discarded: "option -arch=X ignored because of --sw200764156".
--sw4575628 (internal)--boolfalsesm_100+Cache and texturing mode workaround. Validator clears it with a warning on architectures sm_100 and earlier. In target configuration (sub_43A400), the target profile at offset +2465 independently determines whether the workaround is needed; if both the profile and the CLI flag are set simultaneously, the CLI flag is cleared with "--sw4575628 conflicts with specified texturing mode".
--sw200531531 (internal)--bool(varies)unknownKnown only from ROT13 decode (fj200531531). No help text, no validator cross-references, no decompiled consumption. Consumed in backend passes not covered by available decompiled functions.
--sw200380282 (internal)--bool(varies)unknownKnown only from ROT13 decode (fj200380282). Same as --sw200531531 -- registered but with no traceable validator or target configuration logic.
--sw4915215 (internal)--boolfalseall (behavior varies)Generation-dependent workaround. On Blackwell (sm_100+, generation=100), when enabled alongside non-PIC mode, emits informational "sw4915215=true". On other architectures, emits a different informational. Behavioral effect is in backend codegen.
--sw4936628 (internal)--boolfalseallStored at options block offset +503, adjacent to --blocks-are-clusters in the registration sequence. No architecture gating in the validator. Specific behavioral effect requires deeper backend tracing; registration proximity suggests cluster/CTA-level code generation relevance.

EIATTR-Level Workarounds

Three EIATTR attributes encode workaround metadata directly in the output ELF. These are set by target architecture rather than CLI flags -- ptxas emits them unconditionally when the target requires it, and the GPU driver applies fixups at load time.

EIATTR CodeNameKnob NameDescription
42 (0x2A)EIATTR_SW1850030_WAROneFlapJne1850030Instruction offsets requiring driver-side fixup for HW bug 1850030.
48 (0x30)EIATTR_SW2393858_WAROneFlapJne2393858Instruction offsets requiring driver-side fixup for HW bug 2393858.
53 (0x35)EIATTR_SW2861232_WAR--Instruction offsets for HW bug 2861232 workaround.
54 (0x36)EIATTR_SW_WAR--Generic software workaround container (variable payload).
71 (0x47)EIATTR_SW_WAR_MEMBAR_SYS_INSTR_OFFSETS--Offsets of MEMBAR.SYS instructions needing software workaround.

Tool and Patch Modes

Long NameShort NameTypeDefaultDescription
--compile-as-tools-patch-astoolspatchboolfalseCompile patch code for CUDA tools; forces ABI-minimum regcount
--extensible-whole-program-ewpboolfalseExtensible whole-program mode
--compile-as-at-entry-patch (internal)-asatentrypatchboolfalseCompile as at-entry instrumentation patch
--compile-as-entry-exit-patch (internal)--boolfalseCompile as entry/exit instrumentation patch
--compile-device-func-without-entry (internal)--boolfalseAllow device function compilation without entry point
--assyscall (internal)--boolfalseSystem-call instrumentation mode
--fdcmpt (internal)--boolfalseForward-compatibility mode
--enable-syscall-abi (internal)--boolfalseEnable syscall ABI for device functions
--assume-extern-functions-do-not-sync (internal)--boolfalseAssume external functions do not synchronize
--function-pointer-is-function-pointer (internal)--boolfalseTreat function pointers as true function pointers

Statistics and Profiling

Long NameShort NameTypeDefaultDescription
--compiler-stats (internal)--boolfalsePrint per-phase timing (Parse, CompileUnitSetup, DAGgen, OCG, ELF, DebugInfo) and peak memory
--compiler-stats-file (internal)--file--Write statistics to JSON file
--fdevice-time-trace (internal)--file--Chrome DevTools trace format (JSON) for time profiling
--ftrace-phase-after (internal)--string--Trace/dump IR state after named optimization phase
--perf-stats (internal)--boolfalsePrint performance statistics
--dump-perf-stats (internal)--boolfalseDump performance statistics to output
--phase-wise (internal)--boolfalsePer-phase statistics breakdown
--use-trace-pid (internal)--boolfalseInclude process ID in trace output
--verbose-tkinfo (internal)--boolfalseVerbose token/parse information

Mercury and Capsule Mercury

These options control the Mercury intermediate encoding and Capsule Mercury format, which is the default output format on sm_100+ (Blackwell).

Long NameShort NameTypeDefaultDescription
--cap-merc (internal)--bool(arch-dep)Generate Capsule Mercury format
--self-check (internal)--boolfalseValidate capmerc by comparing reconstituted SASS with original
--out-sass (internal)--boolfalseOutput reconstituted SASS from capmerc
--opportunistic-finalization-lvl (internal)--int--Opportunistic finalization level for Mercury pipeline

Threading and Parallelism

Long NameShort NameTypeDefaultDescription
--jobserver-jobserverboolfalseEnable GNU Make jobserver support (make -j<N>)
--threads-dynamic-scheduling (internal)--bool(varies)Dynamic scheduling for thread pool tasks
--threads-min-section-size (internal)--int(varies)Minimum section size for thread pool partitioning

Texture and Memory Modes

Long NameShort NameTypeDefaultDescription
--legacy-bar-warp-wide-behavior-legacy-bar-warp-wide-behaviorboolfalseLegacy PTX bar semantics; deprecated, ignored for sm_70+
--set-texmode-independent (internal)--boolfalseSet texture mode to independent
--set-texmode-raw (internal)--boolfalseSet texture mode to raw
--disable-fast-video-emulation (internal)--boolfalseDisable fast video emulation path
--treat-bf16-as-e6m9 (internal)--boolfalseTreat BF16 as E6M9 format
--legacy-cvtf64 (internal)--boolfalseLegacy cvt.f64 conversion behavior
--use-gmem-for-func-addr (internal)--boolfalseGlobal memory for function addresses
--blocks-are-clusters (internal)--boolfalseTreat blocks as clusters (sm_90a+ TBC)
--enable-extended-smem (internal)--boolfalseExtended shared memory support
--disable-smem-reservation (internal)--boolfalseDisable shared memory reservation
--membermask-overlap (internal)--bool(varies)Member mask overlap control
--ld-prefetch-random-seed (internal)--int--Random seed for load prefetch heuristic
--max-stack-size (internal)--int(auto)Max kernel stack size

Constant Bank Allocation

NVIDIA GPUs provide 18 hardware constant banks (c[0] through c[17]), each a 64 KB read-only memory segment accessible by all threads in a warp with uniform-address broadcast -- loads from constant banks cost a single memory transaction when all threads in the warp read the same address. The compiler assigns different data categories (kernel parameters, driver state, user constants, PIC tables, etc.) to separate banks to avoid address-space collisions. These options override the default bank assignments; all are ROT13-encoded.

Long NameShort NameTypeDefaultDescription
--sw-kernel-params-bank (internal)--int(varies)Constant bank for kernel parameters
--sw-driver-bank (internal)--int(varies)Constant bank for driver data
--sw-compiler-bank (internal)--int(varies)Constant bank for compiler-generated constants
--sw-user-bank (internal)--int(varies)Constant bank for user constants
--sw-pic-bank (internal)--int(varies)Constant bank for PIC data
--sw-ocl-param1-bank (internal)--int(varies)Constant bank for OpenCL parameter set 1
--sw-ocl-param2-bank (internal)--int(varies)Constant bank for OpenCL parameter set 2
--sw-devtools-data-bank (internal)--int(varies)Constant bank for developer tools data
--sw-bindless-tex-surf-table-bank (internal)--int(varies)Constant bank for bindless texture/surface table

Stress Testing

Internal options for compiler stress testing and regression verification.

Long NameShort NameTypeDefaultDescription
--stress-no-crp (internal)--boolfalseDisable CRP (Caller/callee Register Partitioning)
--stress-maxrregcount (internal)--int--Override maxrregcount for stress testing
--stress-noglobalregalloc (internal)--boolfalseDisable global register allocation

Query and Control Interface

Internal options for the query/control interface used by nvcc and other tools.

Long NameShort NameTypeDefaultDescription
--ext-desc-file (internal)--file--External description file for instruction metadata
--ext-desc-string (internal)--string--External description string for instruction metadata
--query-controls (internal)--string--Query control parameters
--query-schema (internal)--string--Query schema definition
--apply-controls (internal)--string--Apply control parameters to compilation
--profile-options (internal)--string--Pass profiling options to backend
--knob (internal)-knoblist--Set internal knob: -knob NAME=VALUE; repeatable; see Knobs System
--omega-knob (internal)--string--Pass omega-subsystem knob settings
--expand-macros-in-omega (internal)--boolfalseExpand macros in omega (instruction expansion) phase
--force-expand-macros-after-errors (internal)--boolfalseForce macro expansion after errors
--enable-func-clone-sc (internal)--boolfalseEnable function cloning for self-check
--use-alternate-query-implementation (internal)--boolfalseAlternate query implementation
--use-alternate-const-ptr-implementation (internal)--boolfalseAlternate constant pointer implementation

Syscall Integration

Internal options for system-call based operations (texturing, bulk copy).

Long NameShort NameTypeDefaultDescription
--use-tex-grad-syscall (internal)--boolfalseSyscall for texture gradient operations
--use-tex-surf-syscall (internal)--boolfalseSyscall for texture/surface operations
--use-bulk-copy-syscall (internal)--boolfalseSyscall for bulk copy operations

Knobs Configuration

The -knob flag is the primary CLI mechanism for setting internal knob values -- the 1,294 tuning parameters documented in Knobs System. It is not listed in --help output and uses a single-dash prefix (not --knob).

Syntax

-knob NAME=VALUE         Set a typed knob (int, float, double, string, range)
-knob NAME               Set a boolean knob (presence = true)
-knob "A=1~B=2~C=3"     Multiple knobs in one argument, separated by ~ (tilde)

Multiple -knob flags are accumulated (list-append semantics):

ptxas -knob SchedNumBB_Limit=100 -knob DisableCSE -knob RegAllocBudget=5000 \
      -arch sm_90 -o out.cubin input.ptx

Knob names are case-insensitive. The name is resolved via ROT13-encoded lookup tables in GetKnobIndex (sub_6F0820 for DAG knobs, sub_79B240 for OCG knobs). An unrecognized name produces warning 7203: "Invalid knob specified (%s)".

Value Types

The value after = is parsed according to the knob's registered type:

TypeSyntaxExample
Boolean(no value)-knob DisableCSE
Integerdecimal, 0x hex, 0 octal-knob SchedNumBB_Limit=100
Floatdecimal with .-knob CostWeight=0.75
Doubledecimal with .-knob PriorityScale=1.5
Stringraw text-knob DUMPIR=AllocateRegisters
Int-rangelow..high-knob AllowedRange=100..200
Int-listcomma-separated-knob TargetOpcodes=1,2,3,4

Conditional Overrides (WHEN=)

Knobs can be set conditionally based on shader or instruction hash, applied only when a specific function is compiled:

# Apply knob only when shader hash matches
ptxas -knob "WHEN=SH=0xDEADBEEF;SchedNumBB_Limit=200" -arch sm_90 -o out.cubin input.ptx

# Multiple conditional overrides separated by ~
ptxas -knob "WHEN=SH=0xDEAD;DisableCSE~WHEN=IH=0x1234;RegAllocBudget=1000" ...

Condition prefixes: SH= (shader hash), IH= (instruction hash), K= (direct knob, no condition).

Interaction with Other Knob Sources

KnobsInit (sub_79D990) processes knob sources in this order -- later sources override earlier ones for the same knob index:

PrioritySourceMechanism
1 (lowest)Environment variablesKnobsInitFromEnv (sub_79C9D0), comma-separated name=value pairs
2Knobs fileReadKnobsFile (sub_79D070), plain-text with [knobs] header
3-knob CLI flagsAccumulated list-append from argv processing
4PTX .pragmaPer-function; disabled by DisablePragmaKnobs knob
5 (highest)WHEN= overridesPer-function conditional, matched by shader/instruction hash

Environment Variable: DUMP_KNOBS_TO_FILE

The DUMP_KNOBS_TO_FILE environment variable causes ptxas to write all 1,294 knob names and their resolved values to a file:

DUMP_KNOBS_TO_FILE=/tmp/all_knobs.txt ptxas -arch sm_90 -o out.cubin input.ptx

This is the primary mechanism for discovering which knobs exist, their current defaults for a given architecture, and verifying that CLI overrides took effect.

Commonly Used Knobs

KnobTypePurpose
DUMPIRstringDump IR after a named phase (e.g., AllocateRegisters)
DisableCSEboolDisable common subexpression elimination
DisablePhasesstring+-delimited list of phases to skip
SchedNumBB_LimitintBasic block limit for scheduling heuristic
RegAllocBudgetintBudget for register allocation cost model
EmitLDCUboolEmit LDCU instructions (SM90: requires -forcetext -sso)
IgnorePotentialMixedSizeProblemsboolSuppress mixed-size register warnings
DisablePragmaKnobsboolIgnore all .pragma knob directives in PTX

For the complete knob type system, file format, and all 1,294 knob categories, see Knobs System.

Version and Architecture Queries

Long NameShort NameTypeDefaultDescription
--list-arch-arch-lsbool--Print supported GPU architectures
--list-version-version-lsbool--Print supported PTX ISA versions

Option Interaction Rules

Several options interact in non-obvious ways, as revealed by the validation logic in sub_434320:

  1. --maxrregcount dominance -- When --maxrregcount is specified, --minnctapersm and --maxntid are ignored. The register constraint calculator (sub_43B660) enforces this precedence.

  2. --override-directive-values -- Only affects --minnctapersm, --maxntid, and --maxrregcount. Without this flag, PTX directives (.maxnreg, .minnctapersm, .maxntid) take precedence over CLI values.

  3. --device-function-maxrregcount vs --maxrregcount -- The former overrides the latter for device functions only, and only under --compile-only mode. For whole-program compilation, --device-function-maxrregcount is ignored.

  4. --Ofast-compile vs --fast-compile -- The documented --Ofast-compile supersedes the internal --fast-compile. Both may conflict with --allow-expensive-optimizations (the validator in sub_434320 checks for this).

  5. --device-debug auto-enables -- Setting -g auto-enables --sp-bounds-check and --g-tensor-memory-access-check. The flag --gno-tensor-memory-access-check explicitly overrides regardless of ordering.

  6. --suppress-debug-info requires -- Has no effect unless --device-debug or --generate-line-info is also specified.

  7. --compile-as-tools-patch forces -- Automatically sets maxrregcount to ABI minimum. Interacts with --sw200428197 workaround in the function/ABI setup path (sub_43F400).

  8. --split-compile and --allow-expensive-optimizations -- Both activate the thread pool (sub_1CB18B0). The jobserver client (sub_1CC7300) integrates with GNU Make's --jobserver-auth= to respect parallel build limits.

Function Map

AddressSizeIdentity
0x40358875 BUsage printer (calls sub_1C97640)
0x432A006,427 BOption registration (~160 options)
0x43432010,289 BOption parser and validator
0x4398802,935 BChrome trace JSON parser (--fdevice-time-trace)
0x43A4004,696 BTarget configuration (cache defaults, --sw4575628)
0x43B6603,843 BRegister/resource constraint calculator
0x44624011,064 BCompilation driver (options block consumer)
0x4428E013,774 BPTX input setup (--compile-only, --extensible-whole-program)
0x60B0404,500 BStress test option handler
0x703AB010,000 BBinary-kind / capmerc CLI parser
0x1C960C0~1,500 BOption parser constructor
0x1C96680~2,000 BArgv processor
0x1C97210~1,500 BOption value validator
0x1C97640--Options help printer

Knobs System

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The knobs system is ptxas's internal configuration mechanism -- a separate layer beneath the public CLI flags that exposes 1,294 tuning parameters to NVIDIA developers. Every significant compiler heuristic (register allocation thresholds, scheduling priorities, pass enable/disable, peephole rules) has a corresponding knob. The system is shared with cicc via a common header (generic_knobs_impl.h) but ptxas instantiates it twice: once for the DAG scheduler pipeline (99 knobs) and once for the OCG (Optimizing Code Generator) backend (1,195 knobs). All knob names are stored ROT13-encoded in the binary, a lightweight obfuscation that prevents casual strings discovery while being trivially reversible.

The knobs infrastructure lives primarily in two address regions: 0x6F0000--0x6F8000 (DAG knob instantiation, shared with the Mercury SASS pipeline) and 0x797000--0x7A2000 (OCG knob instantiation, the larger set). Both regions are compiled from the same template in generic_knobs_impl.h.

Total knobs1,294 (99 DAG + 1,195 OCG)
Source header/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/common/utils/generic/impl/generic_knobs_impl.h
DAG GetKnobIndexsub_6F0820 (2,782 bytes)
OCG GetKnobIndexsub_79B240 (518 bytes)
ParseKnobValuesub_6F7360 / sub_79F540 (DAG: 18KB, OCG: 18KB)
ReadKnobsFilesub_79D070 (9,879 bytes)
KnobsInit (master)sub_79D990 (40,817 bytes)
KnobInit (per-knob)sub_7A0C10 (13,874 bytes)
Knob descriptor64 bytes per entry
Knob runtime value72 bytes per slot
Name obfuscationROT13 with case-insensitive comparison
Setting mechanisms-knob NAME=VALUE, knobs file ([knobs] header), PTX pragma, env var
Debug dumpDUMP_KNOBS_TO_FILE environment variable

Architecture

                  ┌──────────────────────────────────────────┐
                  │            KnobsInit (sub_79D990)        │
                  │  Called once from global init sub_662920  │
                  └─────┬──────────┬──────────┬──────────────┘
                        │          │          │
              ┌─────────▼──┐  ┌───▼──────┐  ┌▼───────────────┐
              │ ReadKnobsFile│  │ -knob CLI│  │ PTX pragma     │
              │ sub_79D070   │  │ parsing  │  │ (unless        │
              │ [knobs] fmt  │  │          │  │ DisablePragma) │
              └─────────┬───┘  └───┬──────┘  └┬───────────────┘
                        │          │           │
                        ▼          ▼           ▼
              ┌─────────────────────────────────────────────┐
              │       ParseKnobsString (sub_79B530)         │
              │  Handles WHEN=, INJECTSTRING, ~-delimited   │
              └──────────────────┬──────────────────────────┘
                                 │
                    ┌────────────▼────────────┐
                    │   GetKnobIndex           │
                    │   sub_6F0820 (DAG)       │
                    │   sub_79B240 (OCG)       │
                    │   ROT13 decode + lookup  │
                    └────────────┬─────────────┘
                                 │
                    ┌────────────▼────────────┐
                    │   ParseKnobValue         │
                    │   sub_6F7360 (DAG)       │
                    │   sub_79F540 (OCG)       │
                    │   Type-specific parsing  │
                    └────────────┬─────────────┘
                                 │
                    ┌────────────▼────────────┐
                    │   Runtime knob array     │
                    │   72 bytes per slot      │
                    │   Accessed by index      │
                    └──────────────────────────┘

ROT13 Name Obfuscation

Every knob name in the binary is stored as a ROT13-encoded string. The GetKnobIndex function decodes each character inline during comparison, without ever materializing the cleartext name in memory. The decode is combined with a case-insensitive tolower() comparison against the user-supplied query.

The inline ROT13 decode from sub_6F0820:

// For each character in the stored ROT13 name:
char c = stored_name[i];
if ((unsigned char)((c & 0xDF) - 65) <= 12)
    c += 13;                   // A-M (or a-m) -> N-Z (or n-z)
else if ((unsigned char)((c & 0xDF) - 78) < 13)
    c -= 13;                   // N-Z (or n-z) -> A-M (or a-m)
// Then compare case-insensitively:
if (tolower(query_char) != tolower(c))
    goto mismatch;

The & 0xDF trick converts lowercase to uppercase before range-checking, so both 'a'-'m' and 'A'-'M' hit the first branch. Non-alphabetic characters pass through unchanged. This means knob names like SchedNumBB_Limit with underscores and digits are handled correctly -- only the alphabetic portion rotates.

To reverse-engineer knob names from the binary: extract the ROT13 strings from the knob definition table (64-byte stride at the table base pointer), apply ROT13, and you get the cleartext name.

Knob Descriptor Layout

Each knob is described by a 64-byte entry in the knob definition table. The table is an array at (knob_state + 16) with count at (knob_state + 24).

Offset  Size  Field
──────  ────  ─────────────────────────────────────
+0      8     name_ptr          Pointer to ROT13-encoded primary name
+8      8     name_len          Length of primary name
+16     1     type_tag          Knob type (OKT_* enum, 1-12)
+17     7     (padding)
+24     16    (reserved)
+40     8     alias_ptr         Pointer to ROT13-encoded alias name
+48     8     alias_len         Length of alias name
+56     8     (reserved)
──────  ────
        64    Total

Both primary and alias names are checked during lookup. A knob matches if either its primary name or alias decodes to the query string (case-insensitive). The alias mechanism allows backward-compatible renaming of knobs across toolkit versions.

Knob Value Layout

Runtime knob values are stored in a flat array of 72-byte slots at (knob_state + 72 * index). The slot layout depends on the type:

Offset  Size  Field
──────  ────  ─────────────────────────────────────
+0      1     type_tag          Runtime type (0=unset, 1-10)
+1      7     (padding)
+8      8     value / pointer   Primary value (int32, int64, float, double, or pointer)
+16     8     list_begin        For list types: first element pointer
+24     8     list_sentinel     For list types: sentinel node
+32     4     aux_value         Secondary value (e.g., int-range high bound)
+36     4     (padding)
+40     8     list_tail         For list types: last element pointer
+48     8     list_head         For list types: head pointer
+56     4     element_count     For list types: number of elements
+60     4     (padding)
+64     8     allocator         Arena allocator pointer for list/range types
──────  ────
        72    Total

The type tag at runtime differs from the definition-table type tag. The definition type drives parsing; the runtime type reflects what was actually stored:

Runtime TypeMeaningPayload
0Unset / invalidNone
1int32*(int32*)(slot + 8)
2float*(float*)(slot + 8)
3double / int64*(int64*)(slot + 8)
4boolean (true)No payload; presence = true
5string*(char**)(slot + 8)
6when-condition listDoubly-linked list at +16..+48, count at +56
7int32 with secondary*(int32*)(slot + 8), *(int32*)(slot + 12)
8int-range*(int32*)(slot + 8) = low, *(int32*)(slot + 12) = high
9opcode-string-listDoubly-linked list (same structure as type 6)
10int-list (dynamic)Growable array at +16, count at +24

Per-Type Slot Usage (confirmed from decompilation)

Types 1, 2, 3, 4, 5, 7, 8 -- scalar types using only bytes +0 through +15:

Type 1 (int32):      +0 = 0x01, +8 = int32 value (4 bytes)
Type 2 (float):      +0 = 0x02, +8 = float value (4 bytes, upper 4 undefined)
Type 3 (double):     +0 = 0x03, +8 = double value (8 bytes)
Type 4 (boolean):    +0 = 0x04  (no payload -- presence = true)
Type 5 (string):     +0 = 0x05, +8 = char* pointer (8 bytes, NOT owned)
Type 7 (budget):     +0 = 0x07, +8 = int32 primary, +12 = int32 secondary
Type 8 (int-range):  +0 = 0x08, +8 = int32 low, +12 = int32 high

Types 6 and 9 -- doubly-linked list types using the full 72 bytes:

+0:   byte   type tag (6 or 9)
+8:   ptr    next pointer (initially 0)
+16:  ptr    → slot+24 (sentinel backward link)
+24:  ptr    → slot+8 (sentinel forward link)
+32:  int64  (unused, set to 0)
+40:  ptr    tail of list
+48:  ptr    head of list
+56:  int32  element count (starts at 2 for sentinel nodes)
+64:  ptr    arena allocator (for node allocation)

Each list node is 24 bytes, allocated from the arena at +64:

Type 6 node: [next(8), prev(8), string_ptr(8)]
Type 9 node: [next(8), prev(8), opcode_id(4) | int_value(4)]

Type 10 -- dynamic growable array:

+0:   byte   = 0x0A
+8:   ptr    arena allocator
+16:  ptr    array base (int32 elements, grown via sub_6EFD20)
+24:  int32  element count (initialized to 0xFFFFFFFF = -1; first insert sets to 0)

The array grows by calling sub_6EFD20(slot+8, count+2) before each insertion, which reallocates if capacity is exceeded. Elements are 4-byte int32 values stored contiguously starting at the base pointer.

Knob Type System

The definition-table type tag (at descriptor offset +16) determines how ParseKnobValue interprets the value string. There are 10 logical knob types with 1,294 total registrations:

Type TagNameCountParse Rule
1OKT_NONE139Boolean flag -- presence = true, no value needed
2OKT_INT616strtol(value, NULL, 0) -- accepts decimal, hex (0x), octal (0)
3OKT_BDGT88Same as INT but stores with secondary field zeroed (budget type)
4OKT_IRNG8"lo..hi" range -- two integers separated by ..
5OKT_ILIST3Comma-separated integers: "1,2,3,4"
6OKT_FLOAT12sscanf(value, "%f", &result)
7OKT_DBL100sscanf(value, "%lf", &result)
8OKT_STR28Direct string assignment (pointer copy)
9OKT_WHEN2When-condition string; parsed into linked list of condition nodes
10OKT_OPCODE_STR_LIST4Opcode-name,integer pairs: "FADD,3,FMUL,2"
11OKT_STR (variant)Same as type 8 (alternate string slot)
12OKT_ILIST (variant)Int-list with pre-initialized allocator

The INT type (616 knobs, 47.6%) dominates. These control thresholds, limits, and numeric heuristic parameters across the entire compiler. BDGT (budget) knobs (88) are semantically similar to INT but carry a secondary field used for budget-tracking in cost models. The 100 DBL knobs control floating-point heuristic weights (scheduling priorities, cost ratios, etc.).

Definition-Type to Runtime-Type Mapping

The definition-table type tag drives parsing; ParseKnobValue writes a different runtime type tag into the 72-byte slot. The mapping is not 1:1 -- several definition types collapse into the same runtime type, and compound types undergo a pre-initialization phase before the main parse:

Def TypeDefinition NameRuntime TypeRuntime NamePre-init?
1OKT_NONE4boolean (true)No
2OKT_INT1int32No
3OKT_BDGT7int32 + secondaryNo
4OKT_IRNG8int-range (low, high)No
5OKT_ILIST10int-list (dynamic array)No
6OKT_FLOAT2float (single precision)No
7OKT_DBL3double (8-byte)No
8OKT_STR5string (pointer)No
9OKT_WHEN6linked list (when-condition)Yes
10OKT_OPCODE_STR_LIST9linked list (opcode-string)Yes
11OKT_STR (variant)5string (pointer)No
12OKT_ILIST (variant)10int-list (dynamic array)Yes

Types 11 and 12 are aliases: type 11 shares the exact handler with type 8 (both produce runtime type 5), and type 12 shares parsing logic with type 5 but its pre-switch initializes the allocator from the knob state object instead of inline.

ParseKnobValue Dispatch Algorithm

ParseKnobValue (sub_79F540, source lines 435--551 of generic_knobs_impl.h) implements a two-phase dispatch. The first switch pre-initializes compound types; the second switch parses the value string.

Phase 1 -- Pre-initialization (compound types only):

// v15 = definition type tag at (knob_descriptor + 16)
// v14 = runtime slot at (knob_state[9] + 72 * index)
switch (v15) {
case 9:   // OKT_WHEN -> runtime type 6
    KnobValueReset(v14);
    v14[0] = 6;
    // Initialize doubly-linked list with two sentinel nodes:
    //   +8  = 0 (next), +16 -> +24, +24 -> +8 (circular sentinels)
    //   +40 = tail, +48 = head, +56 = count (starts at 2)
    //   +64 = allocator from knob_state[1]
    break;

case 10:  // OKT_OPCODE_STR_LIST -> runtime type 9
    KnobValueReset(v14);
    v14[0] = 9;
    // Same linked-list initialization as case 9
    break;

case 12:  // OKT_ILIST variant -> runtime type 10
    KnobValueReset(v14);
    v14[0] = 10;
    *(ptr*)(v14 + 16) = NULL;           // growable array base
    *(ptr*)(v14 + 8)  = allocator;      // from knob_state[1]
    *(int32*)(v14 + 24) = 0xFFFFFFFF;   // sentinel count (-1)
    break;
}

Phase 2 -- Value parsing (all types):

Type 1 (OKT_NONE, boolean): No value string needed. Stores runtime type 4 (boolean true). Presence alone indicates the knob is set.

Type 2 (OKT_INT, integer): Calls sub_6F71D0(value, NULL) -- a strtol wrapper with base 0, which auto-detects decimal, hex (0x prefix), and octal (0 prefix). Stores runtime type 1, value at slot+8 as int32.

Type 3 (OKT_BDGT, budget): Same integer parsing as type 2. Stores runtime type 7 with the primary value at slot+8 and the secondary (budget counter) at slot+12 zeroed. Cost models decrement the secondary field as optimization budget is consumed.

Type 4 (OKT_IRNG, integer range): Parses "low..high" format with these edge cases:

"100..200"    -> low=100,  high=200        Standard range
"100.."       -> low=100,  high=0x7FFFFFFF  Open upper bound
"..200"       -> low=0x80000000, high=200   Open lower bound
".."          -> low=0x80000000, high=0x7FFFFFFF  Full range
"42"          -> low=42, high=42            Degenerate (single value)
""            -> error "Empty integer range value"

The .. separator is detected by checking *endptr == '.' && endptr[1] == '.'. Default bounds are INT_MIN (0x80000000) and INT_MAX (0x7FFFFFFF). Stores runtime type 8 with low at slot+8, high at slot+12.

Type 5 (OKT_ILIST, integer list): Parses comma-separated integers. Validation requires each element to start with a digit or -. Uses a growable array (runtime type 10) at slot+16, grown via sub_6EFD20(slot+8, count+2) before each insertion. Elements are 4-byte int32 values stored contiguously. Example: "1,2,3,4" produces a 4-element array.

Type 6 (OKT_FLOAT, float): Calls sscanf(value, "%f", &result). Stores runtime type 2, value at slot+8 as a 4-byte IEEE 754 single. Returns error "Invalid floating point value" if sscanf does not return 1.

Type 7 (OKT_DBL, double): Calls sscanf(value, "%lf", &result). Stores runtime type 3, value at slot+8 as an 8-byte IEEE 754 double. Returns error "Invalid double value" if sscanf does not return 1.

Type 8/11 (OKT_STR, string): Both handled identically. Stores runtime type 5 with a direct pointer copy: *(char**)(slot+8) = value. The string is NOT duplicated -- the pointer references the original buffer, so the caller must ensure the string's lifetime exceeds the knob's.

Type 9 (OKT_WHEN, when-condition): Pre-switch already initialized the linked list (runtime type 6). Allocates a 24-byte node via the allocator's vtable (allocator_vtable[3](allocator, 24)). Node layout: [next_ptr(8), prev_ptr(8), string_ptr(8)]. The condition string pointer is stored at node+16. Nodes are inserted at the tail of the doubly-linked list. Error if value is NULL; empty string is permitted.

Type 10 (OKT_OPCODE_STR_LIST, value-pair list): Pre-switch already initialized the linked list (runtime type 9). Parsing loop:

  1. Call vtable+40 to split the next comma-delimited token into opcode name and integer value strings
  2. If opcode name is NULL: error "Empty opcode string" (line 520)
  3. If integer value is NULL: error "Empty integer value" (line 522)
  4. Parse integer via strtol(nptr, 0, 10) (base 10 only, unlike OKT_INT)
  5. Resolve opcode name to internal ID via vtable+56 (SASS opcode table lookup)
  6. Allocate 24-byte node: [next(8), prev(8), opcode_id(4) | int_value(4)]
  7. Insert into linked list; loop until input exhausted

Format: "FADD,3,FMUL,2" produces two nodes: (FADD_id, 3) and (FMUL_id, 2). The opcode resolution uses the same 11,240-byte opcode recognition table as the peephole optimizer.

Type 12 (OKT_ILIST variant, opcode list): Pre-switch already initialized the growable array (runtime type 10). Parsing loop:

  1. Call vtable+64 to extract the next comma-delimited opcode name
  2. Resolve to internal ID via vtable+56
  3. Grow array via sub_6EFD20(slot+8, count+2)
  4. Store opcode ID as int32 in the array

Format: "FADD,FMUL,IADD3" -- opcode names only, no integers. Each is resolved to its internal opcode ID.

Default: Error "Invalid knob type" (line 551).

Parse Error Messages

ParseKnobValue (sub_79F540 / sub_6F7360) produces these diagnostic strings on parse failure:

Error StringSource LineDef TypeCondition
"Empty when-string"4359WHEN knob with NULL value
"Empty integer range value"4454IRNG knob with NULL or empty value
"Empty integer list value"4515ILIST knob with NULL or empty value
"Integer list value is not an integer"4535First char not digit or -
"End of integer range value is not ',' or null character"4575ILIST terminator not , or \0
"Empty integer value"4702INT knob with NULL or empty value
"Empty integer value"4783BDGT knob with NULL or empty value
"Empty floating point value"4916FLOAT knob with NULL or empty value
"Invalid floating point value"4966sscanf returns != 1
"Empty double value"5027DBL knob with NULL or empty value
"Invalid double value"5067sscanf returns != 1
"Empty value pair list"51510OPCODE_STR_LIST with NULL value
"Empty opcode string"52010Opcode name resolves to NULL
"Empty integer value"52210Integer after opcode resolves to NULL
"Empty opcode list"53612Opcode-list variant with NULL value
"Invalid knob type"551Unrecognized type tag in definition table
"Invalid knob identifier"395GetKnobIndex -- name not found

All errors carry source attribution: generic_knobs_impl.h with a line number and function name ("GetKnobIndex", "ParseKnobValue", "ReadKnobsFile"). Error constructors: sub_79CDB0 (simple format string) and sub_79AED0 (format with knob name and value context).

Setting Knobs

Method 1: -knob CLI Flag

ptxas -knob SchedNumBB_Limit=100 -knob DisableCSE=1 input.ptx -o output.cubin

Multiple -knob flags accumulate. Each is parsed by KnobsInit (sub_79D990) during startup. The knob name is looked up via GetKnobIndex, then the value is parsed according to the knob's type.

Method 2: Knobs File

A knobs file is a plain-text file with a required [knobs] section header:

; Comments or metadata can appear before the header.
; ReadKnobsFile ignores everything until [knobs] is found.
[knobs]
SchedNumBB_Limit=100
DisableCSE=1
RegAllocBudget=5000
; WHEN= syntax is also supported inside the file:
WHEN=SH=0xDEADBEEF;SchedNumBB_Limit=200

ReadKnobsFile (sub_79D070, source lines 1060--1090 of generic_knobs_impl.h) processes the file:

1. fopen(path, "r")                               line ~1060
2. fseek(file, 0, SEEK_END)                        line 1075
3. size = ftell(file)                               line 1075
4. fseek(file, 0, SEEK_SET)                         line 1075
5. buffer = allocator->vtable[2](allocator, size+1) (heap alloc)
6. bytes = fread(buffer, 1, size, file)             line 1070
7. buffer[bytes] = '\0'                             (null-terminate)
8. marker = strstr(buffer, "[knobs]")               line 1065
9. if (!marker) error "Knobs header not found"
10. content = marker + 7                            (skip "[knobs]")
11. vtable[4](result, knob_state, content, 0)       (parse callback)
12. fclose(file)                                    line 1085

Key implementation details:

  • Entire file read at once. The file is fseek/ftell-measured, then fread into a single buffer of size + 1 bytes. No line-by-line streaming.
  • strstr-based header detection. The [knobs] marker is located via strstr, so it can appear anywhere in the file -- not necessarily on the first line. Everything before it (comments, version metadata, other INI sections) is silently ignored.
  • Parsing starts at marker+7. Exactly 7 characters ([knobs]) are skipped. The parse callback is ParseKnobsString (sub_79B530), which processes newline-delimited key=value pairs. The ~ separator and WHEN= conditional syntax are supported.
  • Result/Expected monad. Every I/O operation has a corresponding error path. Errors are accumulated via sub_79A3D0 (ErrorChainAppend) and propagated through a tagged result object. Multiple errors from a single file are chained, not short-circuited.

Error strings with source line numbers:

Error StringSource LineCondition
"fseek() error knobsfile %s"1075fseek(SEEK_END) or fseek(SEEK_SET) fails
"fseek() error for knobsfile %s"1080fseek(SEEK_END) fails (alternate path)
"fread() error knobsfile %s"1070fread returns <= 0
"Knobs header not found in %s"1065strstr(buffer, "[knobs]") returns NULL
"fclose() error for knobsfile %s"1085fclose returns non-zero

Method 3: PTX Pragma

Knobs can be set from PTX source via .pragma directives, unless the DisablePragmaKnobs knob is set. The pragma string is copied into a temporary buffer and parsed by ParseKnobsString (sub_79B530), following the same key=value syntax.

Method 4: WHEN= Conditional Overrides

The most powerful mechanism allows setting knobs conditionally, based on shader hash or instruction hash. The override string uses ~ (tilde) as a record separator:

WHEN=SH=0xDEADBEEF;SchedNumBB_Limit=200~WHEN=IH=0x12345;DisableCSE=1

ParseKnobsString (sub_79B530) recognizes these prefixes (case-insensitive):

  • WHEN= -- conditional knob application
  • SH= -- match by shader hash (decimal, hex with 0x, or range with ..)
  • IH= -- match by instruction hash
  • K= -- direct knob setting (no condition)
  • INJECTSTRING -- special directive terminated by ;; (double semicolon)

The full conditional override system is parsed by ParseKnobOverrides (sub_79C210), which iterates a linked list of override entries at knob_state + 68904. Each entry carries the condition (hash match criterion) and the knob assignment to apply when matched.

Hash matching uses FNV-1a (magic 0x811C9DC5, prime 16777619) for the per-function override table lookup at ctx+120 → +1128. See IsPassDisabledFull (sub_7992A0).

Priority Order

When the same knob is set by multiple mechanisms, the last write wins. KnobsInit (sub_79D990) processes sources in this order:

  1. Environment variable overrides (getenv)
  2. Knobs file (if specified via -knobs-file or equivalent)
  3. -knob CLI flags
  4. PTX pragma knobs (applied per-function at compile time)
  5. WHEN= conditional overrides (applied per-function when hash matches)

Later sources override earlier ones for the same knob index.

Two Instantiations: DAG and OCG

The knob system is a C++ template instantiated twice with different knob definition tables:

DAG Knobs (sub_6F0820)

The DAG (Directed Acyclic Graph) scheduler knob table contains 99 entries. These control the Mercury SASS pipeline: instruction expansion, WAR hazard handling, scoreboard configuration, and the decode/expand/opex pipeline stages.

PropertyValue
GetKnobIndexsub_6F0820
ParseKnobValuesub_6F7360
InitializeKnobssub_6F68C0 (9KB, 24 references to generic_knobs_impl.h)
Table size99 entries x 64 bytes = 6,336 bytes

DAG knobs referenced in the binary include knob indices 8 and 17 (pipeline options in sub_6F52F0), 16 (WAR generation options in sub_6FBC20), and 743/747 (expansion options in sub_6FFDC0).

OCG Knobs (sub_79B240)

The OCG (Optimizing Code Generator) knob table contains 1,195 entries -- the vast majority of all knobs. These control the optimization passes, register allocation, instruction scheduling, and code generation.

PropertyValue
GetKnobIndexsub_79B240
ParseKnobValuesub_79F540
KnobsInitsub_79D990 (40,817 bytes, master initializer)
KnobInitsub_7A0C10 (per-knob state constructor)
Table size1,195 entries x 64 bytes = 76,480 bytes
Runtime values1,195 entries x 72 bytes = 86,040 bytes

OCG knob indices referenced across the codebase include: 185 (pass-disable string, offset 13320), 294 (epilogue instruction count, used in tepid scheduling), 487 (LoopMakeSingleEntry enablement), 956-957 (shader hint settings at offsets 68832/68904).

Knob State Object

The master knob state object is constructed by KnobInit (sub_7A0C10):

Offset    Size    Field
────────  ──────  ──────────────────────────────
+0        8       vtable pointer (off_21C0738)
+8        8       arena allocator
+16       8       knob definition table pointer
+24       8       knob count
+32       40      (zero-initialized control fields)
+72       var     knob value array (72 * count bytes)
+80       4       max knob index (initially 0xFFFFFFFF)
+88       16      DUMP_KNOBS_TO_FILE path (growable string)

The vtable at off_21C0738 provides virtual methods for knob access:

  • vtable+72: IsKnobSet(index) -- check if a knob has a value
  • vtable+152: GetKnobIntValue(index) -- retrieve int32 value
  • And others for bool, string, double retrieval

Knob Access Helpers

Throughout the codebase, knobs are accessed by index via small helper functions:

FunctionAddressPurpose
GetKnobIntValuesub_7A1B80Returns *(int32*)(state + 72*idx + 8)
GetKnobBoolValuesub_7A1CC0Checks type == 4, returns presence
GetKnobStringValuesub_7A1E10Returns string pointer (type 5/8)
SetKnobValuesub_7A2860Writes value with optional WHEN=SH= condition
IsKnobSet(inlined)Checks *(byte*)(state + 72*idx) != 0

Access is O(1) by index -- no hash lookup or name comparison at runtime. The GetKnobIndex name-to-index translation happens only during initialization.

Pass Disable Mechanism

The knobs system provides a string-based pass disable mechanism through knob index 185 (OCG offset 13320). The string contains +-delimited pass names:

-knob DisablePhases=LoopMakeSingleEntry+SinkCodeIntoBlock

Two check functions consult this string:

IsPassDisabled (sub_799250)

Simple version. Reads the disable flag byte at ctx+13320:

  • If byte == 0: no pass-disable configured, returns false
  • If byte == 5: string pointer at ctx+13328, performs substring match via sub_6E1520 (strcasestr-like)

Called from 16+ sites across the codebase: sub_78B430 (LoopMakeSingleEntry), sub_78DB70 (SinkCodeIntoBlock), sub_8236B0, sub_8D0640, sub_8F45E0, and others.

IsPassDisabledFull (sub_7992A0)

Full version with per-function overrides. First checks a per-function hash table at ctx+120 → +1128 using FNV-1a on the function identifier. If the function has a specific override entry, reads the disable string from there. Otherwise falls back to the global disable string at ctx+72 → +13320.

// FNV-1a hash for per-function lookup
uint32_t hash = 0x811C9DC5;
for (each byte b in function_id)
    hash = 16777619 * (hash ^ b);
uint32_t bucket = hash & (table_size - 1);

The + character is used as a delimiter between alternative phase names in the disable string, allowing "phaseA+phaseB" to match either name.

NamedPhases Parser (sub_798B60)

Parses a comma-separated list of name=value pairs into parallel arrays (max 256 entries). Used by KnobsInitFromEnv (sub_79C9D0) to process environment variable-based knob overrides.

Input:  "knob1=value1,knob2=value2,knob3=value3"
Output: names[256], values[256], full_strings[256]

Knob Categories

The 1,294 knobs cluster into functional categories. Prefix analysis of decoded knob names reveals these major groups:

PrefixCountDomain
Sched* / PostSched* / Sb*89Instruction scheduling heuristics and thresholds
RegAlloc* / Reg*87Register allocation parameters, spill cost model, target selection
Disable*75Pass/feature disable switches (boolean)
Remat* / SinkRemat*35Rematerialization cost model, enable switches, placement control
Mercury* / Merc*21Mercury encoder configuration
URF*24Uniform Register File optimization
Enable*19Pass/feature enable switches (boolean)
Dump*15Debug dump controls (DUMPIR, DumpSched, etc.)
Peephole*~20Peephole optimization rules
Loop*~15Loop optimization parameters
Sync* / Barrier*~12Synchronization and barrier handling
WAR*~8Write-after-read hazard parameters
GMMA* / MMA*~10Matrix multiply-accumulate configuration
Spill*~8Spill code generation parameters
Budget*~10Cost model budgets (BDGT type knobs)
Copy* / CSE*~8Copy propagation and CSE parameters
(other)~577Miscellaneous per-pass tuning knobs

Notable Individual Knobs

Selected knobs referenced by address in the binary:

IndexName (decoded)TypeReferenced AtPurpose
8(DAG pipeline)INTsub_6F52F0Pipeline option flag
16(WAR generation)INTsub_6FBC20WAR pass behavior
17(DAG pipeline)INTsub_6F52F0Pipeline option flag
185(pass-disable string)STRsub_799250, sub_7992A0DisablePhases string
294(epilogue count)INTsub_7A46E0Tepid scheduling divisor
487(loop single-entry)BOOLsub_78B430LoopMakeSingleEntry enable
743(expansion option)INTsub_6FFDC0Mercury expansion control
747(expansion option)INTsub_6FFDC0Mercury expansion control
956(shader hint)sub_79C210Shader hint knob (offset 68832)
957(shader hint)sub_79C210Shader hint linked list (offset 68904)

Register Allocation Knobs (87 knobs, indices 613--699)

The register allocator is the most heavily parameterized subsystem in ptxas. Its 87 knobs span indices 613 through 699 in the OCG knob table, registered in ctor_005 at addresses 0x4197F0--0x41B2E0. The knobs cluster into seven functional sub-categories. All names decoded from ROT13 strings at 0x21B9730--0x21BA6C0.

A. Spill Cost Model (26 knobs)

The spill guidance engine (sub_96D940, 84 KB) uses these knobs to compute per-candidate spill costs. The model multiplies hardware-specific latency and resource metrics by configurable scale factors, then applies threshold-based activation logic.

IndexNameTypePurpose
658RegAllocSpillBarriersAcrossSuspendNONEEnable spill barriers across suspend points
659RegAllocSpillBitINTMaster spill-bit mode selector
660RegAllocSpillBitHighRegCountHeurINTHigh register count heuristic for spill-bit decisions
661RegAllocSpillBitHighRegScaleDBLScale factor for high-register-count spill cost
662RegAllocSpillBitInfPerRegThresholdINTInterference-per-register threshold for spill-bit activation
663RegAllocSpillBitLowRegCountHeurINTLow register count heuristic for spill-bit decisions
664RegAllocSpillBitLowRegScaleDBLScale factor for low-register-count spill cost
665RegAllocSpillBitMediumRegScaleDBLScale factor for medium-register-count spill cost
666RegAllocSpillBitNonRematSpillThresholdINTThreshold for non-rematerializable spill-bit activation
667RegAllocSpillBitRLivePerRegThresholdINTLive-per-register threshold for R-type spill decisions
668RegAllocSpillBitRLiveThresholdINTGlobal R-live threshold for spill activation
669RegAllocSpillForceXBlockHoistRefillINTForce cross-block hoisting of refill instructions
670RegAllocSpillLatencyScaleDBLScale factor for latency in spill cost model
671RegAllocSpillLatencyScale2DBLSecondary latency scale (nested loops)
672RegAllocSpillMemResScaleDBLScale factor for memory resource pressure in spill cost
673RegAllocSpillMioHeavyThresholdDBLThreshold for MIO-heavy (memory-intensive) spill classification
674RegAllocSpillOptBudgetBDGTBudget for spill optimization passes
675RegAllocSpillResourceScaleDBLScale factor for resource usage in spill cost
676RegAllocSpillResCostsScaleDBLScale factor for resource costs (secondary weighting)
677RegAllocSpillReturnRegisterINTSpill handling mode for return-value registers
678RegAllocSpillSmemFlatModeINTShared memory spill: flat addressing mode selector
679RegAllocSpillSmemLatencyScaleDBLScale factor for shared-memory spill latency
680RegAllocSpillTexDepScaleDBLScale factor for texture dependency in spill cost
681RegAllocSpillValidateDebugINTDebug: validate spill correctness (0=off, >0=level)
682RegAllocSpillXBlockINTCross-block spill mode (hoist/refill strategy)
683RegAllocSpillXBlock2INTSecondary cross-block spill mode

The cost model uses three register-count tiers (low/medium/high), each with independent scale factors (664, 665, 661). The tier boundaries are set by the heuristic knobs (663, 660). Latency scales (670, 671) multiply the estimated stall cycles, while resource scales (672, 675, 676) multiply memory bandwidth consumption. The MIO-heavy threshold (673) triggers a separate cost path when the basic block is already saturated with memory operations.

B. Rematerialization (11 knobs)

Rematerialization recomputes values instead of spilling them. The allocator treats remat as a first-class spill alternative with its own budget and candidate ordering.

IndexNameTypePurpose
619RegAllocCtxSensitiveRematINTEnable context-sensitive rematerialization
622RegAllocEnableOptimizedRematINTEnable optimized rematerialization pass
627RegAllocLiveRematINTEnable live-range-aware rematerialization
632RegAllocMaxRematHeightINTMax expression DAG height for remat candidates
633RegAllocMaxRematInstINTMax instructions in a remat sequence
635RegAllocMultiRegclassRematINTEnable remat across multiple register classes
636RegAllocMultiRegRematINTEnable multi-register rematerialization
637RegAllocMultiRegRematBudgetBDGTBudget for multi-register remat attempts
650RegAllocRematDisableRangeIRNGDisable remat for instruction index range lo..hi
651RegAllocRematEnableINTMaster enable for rematerialization (0=off)
652RegAllocRematReuseBudgetBDGTBudget for remat-reuse optimization attempts
654RegAllocOrderRematCandHeuristicINTHeuristic for ordering remat candidates

Knob 650 (RegAllocRematDisableRange) is unique as the only IRNG-type knob in the set, accepting "lo..hi" to disable rematerialization for a range of instruction indices -- a debugging aid for bisecting remat-related miscompiles.

C. Pre-Assignment / MAC (8 knobs)

MAC (Machine-level Allocation with Constraints) pre-assigns physical registers to high-priority operands before the main Fatpoint allocator runs. Entry: sub_94A020 (331 lines).

IndexNameTypePurpose
613RegAllocAvoidBankConflictMacINTEnable bank-conflict-aware MAC pre-assignment
614RegAllocAvoidBankConflictMacPenaltyINTPenalty weight for bank conflicts during MAC pre-assignment
615RegAllocAvoidBankConflictMacWindowSizeINTInstruction window size for bank conflict analysis
628RegAllocMacForceNONEForce MAC-level pre-allocation path
629RegAllocMacVregAllocOrderINTVreg processing order during MAC allocation
630RegAllocMacVregAllocOrderCompileTimeINTCompile-time variant of MAC vreg allocation order
646RegAllocPrefMacOperandsINTMAC operand preference level (1=read, 2=write, 3=both)
647RegAllocPrefMacOperandsMaxDepthINTMax operand chain depth for MAC preference propagation

D. Coalescing (3 knobs)

Register coalescing eliminates unnecessary register-to-register copies by merging live ranges.

IndexNameTypePurpose
617RegAllocCoalesceBudgetBDGTBudget limit for coalescing iterations
618RegAllocCoalescingNONEEnable register coalescing
634RegAllocMmaCoalescingNONEEnable MMA-specific coalescing

E. Performance-Difference Backoff (5 knobs)

Progressive constraint relaxation: on retry iteration N, if the performance difference exceeds a limit, constraints relax between the begin and end iterations.

IndexNameTypePurpose
641RegAllocPerfDiffBackoffNONEEnable perf-diff based constraint backoff
642RegAllocPerfDiffBackoffBeginINTIteration at which backoff begins
643RegAllocPerfDiffBackoffEndINTIteration at which full relaxation is reached
644RegAllocPerfDiffConflictWeightINTWeight factor for conflicts in perf-diff calculation
645RegAllocPerfDiffLimitINTPerformance difference limit triggering relaxation

F. Register Target Selection (13 knobs)

The target selection phase determines how many physical registers to aim for -- the occupancy/performance tradeoff. More registers per thread means fewer warps can execute concurrently.

IndexNameTypePurpose
687RegTargetListILISTComma-separated list of target register counts to try
688RegTgtLowerLimitMMASlackINTSlack added to MMA lower register limit
689RegTgtLowerLimitTCGENSlackINTSlack added to TCGEN lower register limit
690RegTgtLowerLimitSPARSIFYSlackINTSlack added to SPARSIFY lower register limit
691RegTgtLowerLimitDECOMPRESSSlackINTSlack added to DECOMPRESS lower register limit
692RegTgtSelHigherWarpCntHeurINTHeuristic mode for higher-warp-count target selection
693RegTgtSelHigherWarpCntHeurValueDBLWeight value for higher-warp-count heuristic
694RegTgtSelHighLiveRangeHeurValueDBLWeight for high-live-range target selection heuristic
695RegTgtSelLowerWarpCntHeurINTHeuristic mode for lower-warp-count target selection
696RegTgtSelLowerWarpCntHeurValueDBLWeight value for lower-warp-count heuristic
697RegTgtSelLowLiveRangeHeurValueDBLWeight for low-live-range target selection heuristic
698RegTgtSelWithSMemSpillHeurINTHeuristic mode when shared-memory spilling is active
699RegUsageLevelINTRegister usage reporting level

The four "Slack" knobs (688--691) fine-tune lower register limits for specific architectural features that have minimum register requirements: MMA (matrix multiply), TCGEN (tensor core generation), SPARSIFY (structured sparsity), DECOMPRESS (decompression).

G. General Allocation Control (12 knobs)

IndexNameTypePurpose
616RegAllocCacheSizeINTCache size parameter for interference graph
620RegAllocDebugConflictDetailsINTDebug: print conflict graph details (verbosity level)
621RegAllocDepDistanceThresholdForHighConflictsINTDep-distance threshold above which high-conflict registers are deprioritized
624RegAllocIndexAbiScratchRegsINTIndex into ABI scratch register set
639RegAllocNumNonSpillTrialsINTNon-spill allocation trials before allowing spills
640RegAllocOptLevelINTRegalloc optimization level (controls aggressiveness)
648RegAllocPrintDetailsNONEEnable detailed regalloc diagnostic printing
649RegAllocRefineInfINTRefine interference graph iteration limit
653RegAllocOptimizeABIINTEnable ABI-aware register optimization (setmaxnreg handling)
655RegAllocReportMaxRegsAllowedINTReport maximum registers allowed per thread (diagnostic)
656RegAllocCudaSmemSpillEnableINTEnable CUDA shared memory spill path
685RegAllocUserSmemBytesPerCTAINTUser-specified shared memory bytes per CTA (overrides computed)

H. Miscellaneous (8 knobs)

IndexNameTypePurpose
623RegAllocEstimatedLoopIterationsSTRString hint providing estimated loop iteration counts for spill cost weighting
625RegAllocL1SpillRegThresINTRegister count threshold for L1 spill mode activation
626RegAllocL1SpillScaleDBLScale factor for L1 cache spill cost
631RegAllocMaxGmmaDisallowedRegINTMax registers disallowed during GMMA (warp group MMA) allocation
638RegAllocNoRetargetPrefsNONEDisable retarget-preference optimization
657RegAllocSortRegsINTSorting order for register candidates during allocation
684RegAllocThresholdForDiscardConflictsINTInterference count above which conflicts are discarded (default 50)
686RegAttrReuseVectorBudgetBDGTBudget for register-attribute vector reuse optimization

Scheduling Knobs (89 knobs, indices 229--978)

The instruction scheduler is the second most heavily parameterized subsystem after register allocation. Its 89 knobs span two contiguous blocks (indices 738--811 for the core Sched* set, and 569--574 for the PostSched* set) plus 11 scattered entries for scheduling-adjacent features. All names decoded from ROT13 strings at 0x21B6CB0--0x21BE100, registered in ctor_005 at code addresses 0x411FF0--0x420A00.

The knobs control every aspect of the list scheduler: how latencies are modeled, which functional units are treated as busy, how aggressively cross-block motion is attempted, and how register pressure feedback loops interact with the priority function. Three Blackwell-era SchedResBusy* knobs (QMMA at 964, OMMA at 977, MXQMMA at 978) sit outside the main block because they were appended in a later toolkit version for new MMA unit types.

A. Resource Busy Overrides (28 knobs)

The SchedResBusy* knobs override the hardware-profile resource busy times for individual functional units. Each knob sets the number of cycles the named unit is considered occupied after issuing an instruction to it. When unset, the scheduler uses the value from the latency model's per-SM hardware profile. Setting a SchedResBusy* knob to 0 effectively makes the unit appear always free to the scheduler.

Two knobs accept string values instead of integers: SchedResBusyOp and SchedResBusyMachineOpcode take a string identifying a specific opcode or machine opcode to override, enabling per-instruction busy-time tuning.

IndexNameTypeFunctional Unit
781SchedResBusyADUINTAddress divergence unit
782SchedResBusyALUINTArithmetic logic unit
783SchedResBusyCBUINTConvergence barrier unit
784SchedResBusyDMMAINTDouble-precision MMA unit
785SchedResBusyFMAINTFused multiply-add unit
786SchedResBusyFMAWideINTWide FMA unit (multi-cycle)
787SchedResBusyFP16INTHalf-precision FP unit
788SchedResBusyFP64INTDouble-precision FP unit
789SchedResBusyGMMAINTWarp group MMA (WGMMA) unit
790SchedResBusyHMMA16INTHalf-precision MMA, 16-wide
791SchedResBusyHMMA16816INTHalf-precision MMA, 16x8x16 shape
792SchedResBusyHMMA1688INTHalf-precision MMA, 16x8x8 shape
793SchedResBusyHMMA32INTHalf-precision MMA, 32-wide
794SchedResBusyIMMAINTInteger MMA unit
795SchedResBusyLSUINTLoad/store unit
796SchedResBusyLSUL1INTLoad/store unit (L1 path)
797SchedResBusyOpSTRPer-opcode override (string: opcode name)
798SchedResBusyMachineOpcodeSTRPer-machine-opcode override (string)
799SchedResBusyUDPINTUniform datapath unit
800SchedResBusyXU64INTExtended-precision (64-bit) unit
964SchedResBusyQMMAINTQuarter-precision MMA unit (Blackwell)
977SchedResBusyOMMAINTOctal MMA unit (Blackwell)
978SchedResBusyMXQMMAINTMX-quantized MMA unit (Blackwell)

The five HMMA variants (790--793) correspond to different tensor core shapes: HMMA16 for 16-wide half-precision, HMMA1688 for the 16x8x8 tile used on Volta/Turing, HMMA16816 for the 16x8x16 tile used on Ampere+, and HMMA32 for 32-wide half-precision operations. IMMA (794) handles integer tensor operations (INT8/INT4).

B. Latency Overrides (12 knobs)

These override the default latency values the scheduler uses for dependency edges. The SchedRead* prefix indicates read-after-write latencies; the SchedTex* and SchedLDS* variants target texture and shared-memory operations specifically.

IndexNameTypePurpose
757SchedLDSLatencyINTShared memory (LDS) load latency in cycles
771SchedReadLatencyINTDefault read-after-write latency
772SchedReadSBBaseLatencyINTScoreboard base read latency
773SchedReadSBBaseUseLSULatBOOLUse LSU latency as scoreboard base
774SchedReadSbDmmaLatencyINTScoreboard read latency for DMMA operations
775SchedReadSbLdgstsLatencyINTScoreboard read latency for LDGSTS (async copy) operations
802SchedSyncsLatencyINTSynchronization barrier latency
803SchedSyncsPhasechkLatencyINTPhase-check synchronization latency
804SchedTex2TexIssueRateINTMinimum cycles between back-to-back texture issues
808SchedTexLatencyINTTexture fetch latency in cycles
811SchedXU64LatencyINTExtended 64-bit unit latency
770SchedReadAvailTargetINTTarget availability delay for read operands

C. Register Pressure Feedback (8 knobs)

The scheduler's priority function incorporates register pressure awareness through these knobs. They control how aggressively the scheduler tries to reduce live register count: SchedMaxRTarget sets the target register count, while the SchedMaxRLive* knobs define slack bands around that target. SchedReduceIncLimit* throttles how quickly the scheduler increases its pressure-reduction efforts.

IndexNameTypePurpose
758SchedLocalRefRatioDBLLocal reference ratio weight in priority function
760SchedMaxRLiveCarefulSlackINTSlack before aggressive register pressure reduction
761SchedMaxRLiveOKslackINTSlack band where register pressure is acceptable
762SchedMaxRLiveOKslackColdBlocksINTOK-slack for cold (infrequently executed) blocks
763SchedMaxRTargetINTTarget maximum register count for scheduling
776SchedReduceIncLimitINTLimit on incremental register pressure reduction steps
778SchedReduceIncLimitHighINTUpper bound on incremental reduction
779SchedReduceRegBudgetBDGTBudget for register-pressure-reduction iterations

D. Cross-Block Scheduling (8 knobs)

Cross-block motion allows the scheduler to move instructions across basic block boundaries for better latency hiding. These knobs control the scope and cost limits of cross-block speculation.

IndexNameTypePurpose
742SchedCrossBlockINTMaster cross-block scheduling mode selector
743SchedCrossBlockInstsToSpeculateINTMax instructions to speculate across block boundary
744SchedCrossBlockLimitINTOverall cross-block motion limit
745SchedCrossBlockSpeculateINTSpeculation mode for cross-block motion
746SchedCrossBlockSpeculateBudgetBDGTBudget for cross-block speculation attempts
747SchedCrossBlockTexToSpeculateINTMax texture instructions to speculate across blocks
288EnableXBlockSchedInMultiBlockInMMALoopINTEnable cross-block scheduling within multi-block MMA loops
738SbXBlockINTCross-block scoreboard tracking mode

E. Texture Batching (7 knobs)

Texture operations have high latency, so the scheduler groups them into batches to maximize memory-level parallelism. These knobs control batch formation and target selection.

IndexNameTypePurpose
741SchedCountLoadsPerTexINTMax loads to count per texture operation
756SchedLDGBatchDelayBiasINTDelay bias for global load batching
755SchedLastHybridInBBWithIssueRateINTLast hybrid scheduler position in BB with issue rate
805SchedTexBatchTargetSelectRegisterTargetINTBatch formation: prefer register-target-aware grouping
806SchedTexBatchTargetSelectSchedulerTargetINTBatch formation: prefer scheduler-target grouping
807SchedTexBatchTargetTexReadTogetherINTBatch formation: prefer grouping tex reads together
931UseGroupOpexesForResourceSchedulingINTUse grouped opexes for resource scheduling decisions

F. Dependency Modeling (6 knobs)

These control how the scheduler builds and refines the dependency graph between instructions.

IndexNameTypePurpose
753SchedAddDepFromGlobalMembarToCBINTAdd dependency edge from global membar to CB unit
759SchedMaxMemDepINTMax memory dependencies per instruction
764SchedMemNoAliasNONEAssume no memory aliasing (aggressive scheduling)
777SchedReduceRefPsuedoDepLimitINTLimit on reducing reference pseudo-dependencies
780SchedRefineMemDepBudgetBDGTBudget for memory dependency refinement iterations
801SchedSymmetricAntiDepConflictWindowBOOLEnable symmetric anti-dependency conflict window

G. Post-Scheduler (6 knobs)

The post-scheduler runs after register allocation (phase 103) and adjusts the schedule to account for actual register assignments. It primarily inserts stall cycles and adjusts issue delays.

IndexNameTypePurpose
569PostSchedAdvLatencyHidingBOOLEnable advanced latency hiding in post-scheduler
570PostSchedBudgetBDGTBudget for post-scheduler iterations
571PostSchedEarlyStallINTEarly stall insertion mode
572PostSchedForceReverseOrderINTForce reverse traversal order in post-scheduler
573PostSchedIssueDelayBOOLEnable issue delay computation
574PostSchedIssueDelayForNoWBStallsBOOLCompute issue delays for no-writeback stalls

H. Ordering and Preservation (5 knobs)

These control whether the scheduler preserves the original instruction order (from the optimizer or PTX source) versus reordering freely.

IndexNameTypePurpose
229ForcePreserveSchedOrderSameNvOptINTForce preserve scheduling order from NvOpt pass
594PreserveSchedOrderNONEPreserve source scheduling order (boolean)
595PreserveSchedOrderSameBOOLPreserve scheduling order for same-priority instructions
751SchedForceReverseOrderINTForce reverse scheduling order (bottom-up)
769SchedPrefFurthestDepBOOLPrefer instructions with furthest dependency

I. Scoreboard (4 knobs)

The hardware scoreboard tracks instruction completion. These knobs tune how the scheduler predicts scoreboard occupancy to avoid stalls.

IndexNameTypePurpose
738SbXBlockINTCross-block scoreboard tracking mode
739SbXBlockLLSBINTCross-block long-latency scoreboard tracking
772SchedReadSBBaseLatencyINTScoreboard base read latency
773SchedReadSBBaseUseLSULatBOOLUse LSU latency as scoreboard base

Note: SbXBlock appears in both cross-block (D) and scoreboard (I) categories because it serves both purposes -- it controls whether the scoreboard state propagates across block boundaries, which is a prerequisite for cross-block scheduling correctness.

J. MMA Coupling (3 knobs)

Matrix multiply-accumulate instructions on certain architectures share functional unit resources. These knobs control how the scheduler models coupled execution.

IndexNameTypePurpose
752SchedFP16CoupledMaxellPascalINTFP16 coupled execution mode on Maxwell/Pascal
754SchedHmmaImmaBmmaCoupledAmperePlusINTHMMA/IMMA/BMMA coupled execution on Ampere+
366GroupOpexesForResourceSchedulingThresholdDBLThreshold for grouping opexes in resource scheduling

K. Scheduler Model (4 knobs)

These control how the scheduler models the hardware pipeline and instruction movement costs.

IndexNameTypePurpose
765SchedModelIdentityMoveINTModel identity moves as zero-latency
766SchedModelSharedPhysicalPipeINTModel shared physical pipe contention
767SchedMultiRefDeltaLiveINTDelta-live threshold for multi-reference instructions
768SchedMultiRefDeltaLiveMinRefsINTMinimum reference count for delta-live calculation

L. Budget, Scale, and Control (7 knobs)

General scheduling control knobs covering budgets, loop iteration estimates, the master disable switch, and validation.

IndexNameTypePurpose
740SchedBumpScaleAugmentFactorDBLAugment factor for priority bump scaling
748SchedDisableAllINTMaster disable for all scheduling passes
749SchedDynBatchBudgetBDGTBudget for dynamic batching iterations
750SchedEstimatedLoopIterationsSTREstimated loop iterations (string: per-loop hints)
809ScheduleKILsINTSchedule KIL (kill/discard) instructions
810SchedValidateLivenessINTEnable liveness validation after scheduling
811SchedXU64LatencyINTXU64 unit latency override

Disable Switches (75 knobs)

The disable switches are boolean knobs that turn off specific passes, optimizations, or workarounds. All 75 knobs containing "Disable" were decoded from ROT13 strings at 0x21BDE30--0x21BFA10. Nearly all are OKT_NONE (boolean) type -- setting them with no value or any value disables the corresponding feature. The single exception is RegAllocRematDisableRange, which is OKT_IRNG and accepts a "lo..hi" instruction index range.

The bare Disable knob at 0x21BE860 appears to be a master pass-disable switch. SchedDisableAll is the master scheduler disable. DisablePragmaKnobs prevents PTX .pragma directives from setting knobs -- a meta-level control that protects the knob system itself.

A. Workaround (WAR) Switches (9 knobs)

These disable hardware or compiler bug workarounds. Each War_SW* knob corresponds to an NVIDIA internal bug tracker ID. Disabling a WAR reverts to the unpatched behavior -- useful for bisecting whether a WAR is causing a regression.

NameFeature Disabled
DisableWar_SW200655588Workaround for bug SW-200655588
DisableWar_SW2549067Workaround for bug SW-2549067
DisableWar_SW2789503Workaround for bug SW-2789503
DisableWar_SW2965144Workaround for bug SW-2965144
DisableWar_SW3093632Workaround for bug SW-3093632
DisableForwardProgressWar1842954Forward-progress guarantee workaround (bug 1842954)
DisableForwardProgressWar1842954ForDeferBlockingSame WAR, variant for defer-blocking scheduling
DisableHMMARegAllocWarHMMA (half-precision MMA) register allocation workaround
DisableMultiViewPerfWARMulti-view rendering performance workaround

B. Memory and Addressing (11 knobs)

These control address computation, memory access conversion, and shared-memory optimizations.

NameFeature Disabled
DisableCvtaForGenmemToSmemGeneric-to-shared address space conversion via cvta
DisableDoubleIndexedAddressDouble-indexed addressing mode optimization
DisableErrbarAfterMembarError barrier (BAR.SYNC 15) insertion after membar.sys
DisableForceLDCTOLDCUConvLDC to LDCU (constant uniform load) conversion
DisableImplicitMemDescImplicit memory descriptor inference
DisableLDCU256LDCU.256 -- 256-bit constant uniform load
DisableLDCUWithURbLDCU with uniform register base addressing
DisableLongIntArithAddressFoldingLong integer arithmetic folding into address computation
DisableRemoveSmemLeaShared memory LEA (load effective address) removal
DisableSmemSizePerCTACheckShared memory size per CTA validation check
DisableStrideOnAddrStride-on-address optimization (base+stride*index folding)

C. Register Allocation and Uniform Registers (9 knobs)

These control uniform register (UR) file usage, live range management, and remat-related disable ranges.

NameTypeFeature Disabled
DisableConvergentWriteURNONEConvergent write-to-UR optimization
DisableExtendedLiveRangeNONEExtended live range optimization
DisableU128NONE128-bit uniform register support
DisableURLiveAcrossConvBoundNONEUR liveness across convergence boundaries
DisableURLivenessTradeOffNONEUR liveness trade-off heuristic
DisableUregNONEUniform register file usage entirely
MercuryDisableLegalizationOfTexToURBoundNONEMercury tex-to-UR-bound legalization
RegAllocRematDisableRangeIRNGRematerialization for instruction index range lo..hi
RematDisableTexThrottleRegTgtNONETexture throttle register target during remat

D. Loop Optimization (6 knobs)

NameFeature Disabled
DisableAlignHotLoopsHot loop alignment (NOP padding for fetch efficiency)
DisableDeadLoopEliminationDead loop elimination pass
DisableLoopLevelVaryingAnalysisLoop-level varying/invariant analysis
DisableLoopPrecheckForYieldsLoop pre-check insertion for yield points (cooperative groups)
DisableMeshVCTALoopMesh shader virtual CTA loop optimization
DisablePartialUnrollOverflowCheckOverflow check during partial loop unrolling

E. Code Motion and Scheduling (6 knobs)

NameFeature Disabled
DisableLatTransitivityLatency transitivity in scheduling dependency chains
DisableMoveCommoningMOV-based equivalence propagation (commoning walker)
DisableNestedHoistNested code hoisting (loop-invariant-like motion)
DisableOffDeckOff-deck scheduling (prefetch to off-deck buffer)
DisableSourceOrderSource-order scheduling constraint
SchedDisableAllMaster switch: all scheduling passes

F. Vectorization (4 knobs)

NameFeature Disabled
DisableFastvecEnhancementFast vectorization enhancement pass
DisableHalfPartialVectorWritesHalf-precision partial vector write coalescing
DisableReadVectorizationLoad vectorization (coalescing scalar reads into vector loads)
DisableWriteVectorizationStore vectorization (coalescing scalar writes into vector stores)

G. Predication and Branching (4 knobs)

NameFeature Disabled
CmpToMovPredCrossBlockDisableCMP-to-MOV predicate propagation across basic blocks
DisableBranchPredInputBranch predicate input optimization
DisableCmpToPredCMP-to-predicate conversion
DisablePredicationPredication pass (phase 63, OriDoPredication)

H. Synchronization and Barriers (2 knobs)

NameFeature Disabled
DisableRedundantBarrierRemovalRedundant barrier removal pass
DisableStageAndFenceStage-and-fence synchronization insertion

I. Dead Code and Store Elimination (2 knobs)

NameFeature Disabled
DisableDeadStoreEliminationDead store elimination pass
DisableStraightenInSimpleLiveDeadStraightening within simple live/dead analysis

J. Control Flow Merging (5 knobs)

NameFeature Disabled
DisableEarlyExtractBCOEarly extraction of BCO (branch code optimization objects)
DisableMergeEquivalentConditionalFlowPhase 133: tail merging of equivalent conditional branches
DisableMergeFp16MovPhiFP16 MOV-PHI merge optimization
DisableMergeSamRamBlocksSAM/RAM block merging (surface/texture access coalescing)
DisableOptimizeHotColdFlowHot/cold flow optimization (code layout splitting)

K. Pass Control (2 knobs)

NameFeature Disabled
DisableMaster disable switch (bare name)
DisablePragmaKnobsPTX .pragma-based knob overrides

L. Sanitizer (3 knobs)

These control the address sanitizer instrumentation for different memory spaces. When the sanitizer is active, these knobs can selectively disable checking for one space while keeping the others.

NameFeature Disabled
SanitizeDisableGlobalAddress sanitizer for global memory accesses
SanitizeDisableLocalAddress sanitizer for local memory accesses
SanitizeDisableSharedAddress sanitizer for shared memory accesses

M. Floating Point (2 knobs)

NameFeature Disabled
FPFoldDisableFloating-point constant folding
FPRefactoringDisableFloating-point expression refactoring

N. Miscellaneous (10 knobs)

NameFeature Disabled
DisableBW225LongIntArithBW225 (Blackwell) long integer arithmetic optimization
DisableBptTrapNoReturnBPT.TRAP no-return semantics (debugger breakpoint trap)
DisableDependentConstExprDependent constant expression optimization
DisableISBESharingISBE (indexed set buffer entry) sharing for bindless textures
DisableMarkF2FPackbTo16BitMarking F2F.PACKB as 16-bit operation
DisableNonUniformQuadDerivativesNon-uniform quad derivative computation
DisablePaddingNOP padding insertion (alignment and scheduling)
DisablePicCodeGenPosition-independent code generation
DisableSopSrSOP (scalar operation) on special registers (SR)
DisableSuperUdpSuper-UDP (enhanced uniform datapath) optimization

Rematerialization Knobs (35 knobs)

Rematerialization knobs control the three dedicated remat pipeline phases (Phase 28: SinkRemat, Phase 69: OriDoRemat) and the cost model that decides whether recomputing a value is cheaper than keeping it live in a register. These are separate from the 12 RegAlloc*Remat* knobs documented above in section B, which control allocator-integrated rematerialization. The distinction matters: allocator-integrated remat fires during register allocation itself (sub_93AC90), while these knobs tune the standalone pre-allocation and post-predication remat passes.

The 35 knobs split into two contiguous blocks in the descriptor table plus one outlier:

  • Remat* (27 knobs, indices 702--728): Late rematerialization (Phase 69) and shared cost model
  • SinkRemat* (8 knobs, indices 824--831): Early sink+remat (Phase 28)

A. Remat Enable/Disable (5 knobs)

IndexNameTypePurpose
709RematDisableTexThrottleRegTgtINTDisable texture-throttle register targeting during remat
710RematEarlyEnableINTEnable Phase 54 early remat mode activation
711RematEnableINTMaster enable for Phase 69 late rematerialization
712RematEnablePRegNONEEnable predicate register rematerialization (boolean flag)
726RematStressTestNONEForce all remat candidates to be rematerialized (debug, boolean flag)

Knob 711 (RematEnable) is the master switch. When zeroed via -knob RematEnable=0, Phase 69 skips its core loop entirely. Knob 710 (RematEarlyEnable) independently controls Phase 54's mode flag write (ctx+1552 = 4). Knob 726 (RematStressTest) is a debug-only boolean that forces every candidate to be rematerialized regardless of profitability -- useful for stress-testing correctness.

B. Remat Cost Model (10 knobs)

IndexNameTypePurpose
702RematAbsCostFactorDBLAbsolute cost scaling factor for remat profitability
703RematBackOffRegTargetFactorDBLBack-off factor for register pressure target during remat
705RematColdBlockRatioDBLCost discount ratio for cold (rarely executed) blocks
713RematGlobalCostFactorDBLGlobal cost multiplier for cross-block rematerialization
714RematGlobalLowCostFactorDBLCost factor for low-cost (cheap ALU: MOV, IADD, LOP3) remat
716RematLdcCostDBLCost weight assigned to LDC (load-from-constant-bank) remat
719RematMemCostDBLCost weight for memory-sourced (LD/ST) rematerialization
722RematReadUAsLdcINTTreat uniform address reads as LDC for cost classification
727RematTexInstRatioThresholdDBLTexture instruction ratio threshold for throttle activation
728RematTexThrottleRegTgtScaleDBLScale factor for register target when texture throttle is active

These 10 knobs parameterize the remat profitability function (sub_90B790). The cost model computes remat_cost = instruction_cost * factor and compares against register savings. The DBL-typed knobs (8 of 10) are floating-point multipliers that allow fine-grained tuning. The texture-specific knobs (727, 728) implement a throttle: when the ratio of texture instructions exceeds the threshold, the register target is scaled to avoid excessive register use that would harm texture unit throughput.

C. Register Pressure Control (5 knobs)

IndexNameTypePurpose
706RematConservativeRegSlackINTExtra registers to reserve beyond target (conservative mode)
708RematCostRegLimitINTMax register count considered during cost analysis
718RematMaxRegCountINTAbsolute ceiling on registers for remat decisions
723RematRegTargetFactorDBLScaling factor for computing the register pressure target
724RematRegTargetTrialLimitINTMax iterations when searching for optimal register target

The register target is the pressure level below which rematerialization becomes profitable. RematRegTargetFactor (723) scales the occupancy-derived target. RematRegTargetTrialLimit (724) caps the binary-search iterations in the target-finding loop. RematMaxRegCount (718) is a hard ceiling -- if current pressure exceeds this value, the remat pass operates in aggressive mode.

D. Instruction and Code Limits (2 knobs)

IndexNameTypePurpose
707RematCostInstLimitINTMax instruction count for inclusion in cost model
715RematInflationSlackINTAllowed code-size inflation slack (extra instructions from remat)

RematCostInstLimit (707) prevents the cost model from analyzing extremely large remat sequences. RematInflationSlack (715) limits how many extra instructions rematerialization may introduce before the pass backs off.

E. Placement Control (4 knobs)

IndexNameTypePurpose
717RematLowCostPlacementLimitDBLMax placement distance for low-cost remat candidates
720RematMinDistanceINTMinimum def-to-remat distance (instructions) before remat is attempted
721RematPlacementLookbackINTLookback window size for placement-site search
725RematSortRematChainINTSort remat chain by priority before placement (0=off, 1=on)

These knobs control where rematerialized instructions are placed relative to their uses. RematMinDistance (720) ensures remat is not attempted for short live ranges where the original definition is close enough. RematPlacementLookback (721) limits how far back the placement algorithm scans when searching for a profitable insertion point.

F. Remat Budget (1 knob)

IndexNameTypePurpose
704RematBudgetBDGTOptimization budget for the late remat pass (phase 69)

BDGT-typed knobs carry a primary value and a secondary counter. The budget is decremented as each remat decision is committed. When exhausted (secondary reaches zero), the pass stops processing further candidates. This provides a deterministic cap on compile-time cost.

G. SinkRemat (Phase 28) Knobs (8 knobs, indices 824--831)

IndexNameTypePurpose
824SinkRematAbsCostLimitDBLAbsolute cost ceiling for sinking+remat decisions
825SinkRematBudgetBDGTOptimization budget for the sink+remat pass
826SinkRematDeltaRegsRatioDBLRegister pressure delta ratio threshold for sink profitability
827SinkRematEnableINTMaster enable for Phase 28 SinkRemat
828SinkRematMinDefPlaceDistINTMinimum definition-to-placement distance for sinking
829SinkRematMinPlaceRefDistINTMinimum placement-to-reference distance for sinking
830SinkRematMultiRefXBlkUsesPenaltyFactorDBLPenalty multiplier for multi-reference cross-block uses
831SinkRematPredPenaltyFactorDBLPenalty multiplier for sinking predicated instructions

Phase 28's SinkRemat pass (entry: sub_913A30, core: sub_A0F020) sinks instructions closer to their uses and marks remat candidates. Knob 827 (SinkRematEnable) is the master switch. The distance knobs (828, 829) prevent unprofitable micro-sinks. The penalty factors (830, 831) make the cost model more conservative for predicated instructions and for instructions with multiple cross-block uses, where sinking may duplicate code along multiple paths.

IndexNameTypePurpose
475MovWeightForRematDBLMOV instruction weight in remat profitability scoring

This knob sits in the general MOV-weight family (indices 474--476) rather than the Remat block. It tunes how MOV instructions contribute to the scheduling cost model's remat profitability calculation. When the remat candidate is a MOV chain, this weight determines the per-MOV cost used to decide whether rematerialization beats keeping the value live.

DUMP_KNOBS_TO_FILE

The DUMP_KNOBS_TO_FILE environment variable triggers a full dump of all knob values to a file. Checked during KnobInit (sub_7A0C10) via getenv("DUMP_KNOBS_TO_FILE"):

char* dump_path = getenv("DUMP_KNOBS_TO_FILE");
if (dump_path) {
    size_t len = strlen(dump_path);
    // Store into SSO string at knob_state+88..104
}

The path is stored in a small-string-optimized (SSO) buffer at knob_state offsets +88 through +104:

Offset  Size  Field
──────  ────  ─────────────────────────────────────
+88     8     data pointer (or first 8 inline bytes if len <= 15)
+96     8     string length
+104    8     capacity (or remaining inline bytes)

Paths of 15 bytes or fewer are stored inline without heap allocation. Longer paths allocate via the arena allocator at knob_state+8. The dump is produced later during compilation -- KnobInit only stores the path; the actual file write happens after all knobs are resolved.

This is the primary mechanism for discovering which knobs exist and what their current values are. Setting it produces a text file with all 1,294 knob names and their resolved values.

Error Handling

The knob system uses structured error descriptors (96 bytes each) allocated from an arena:

Offset  Size  Field
──────  ────  ─────────────────────────────────────
+0      8     formatted message string pointer
+8      8     message length
+16     8     source file path pointer
+24     8     source file path length
+32     8     line number
+40     8     function name pointer
+48     48    (additional context fields)

Two error constructor functions:

FunctionAddressPurpose
FormatKnobErrorsub_79CDB0General knob error with vsnprintf formatting
FormatKnobErrorWithContextsub_79AED0Error with additional context (knob name, value)
KnobError::Mergesub_79A780Chains multiple errors for accumulated reporting

Errors propagate through a tagged result: bit 0 of *(result + 16) is set on error, cleared on success. The GetKnobIndex return protocol:

// Success:
*(byte*)(result + 16) &= ~1;    // clear error bit
*(int32*)(result) = knob_index;  // store index

// Failure:
*(byte*)(result + 16) |= 1;     // set error bit
*(result + 0..15) = error_desc;  // store error descriptor

KnobValue Lifecycle

Construction

KnobValue::Destroy (sub_797790) resets a 72-byte value slot before writing a new value. It switches on the type tag:

TypeDestruction Action
0-5, 7, 8No-op (POD types, no heap allocation)
6 (int-list)Walk doubly-linked list, free each node via allocator+32
9 (opcode-list)Walk doubly-linked list, free each node via allocator+32
10 (int-list dynamic)Free the growable array block

Deep Copy

KnobValue::CopyFrom (sub_7978F0) handles deep copy of value slots, switching on type to properly duplicate linked lists and allocated buffers.

KnobInit (sub_7A0C10) constructs a new knob state object by allocating 72 * count bytes for the value array, then deep-copying each slot from a source state if one exists.

Function Map

AddressSizeFunctionConfidence
sub_6F04B06,824ReportKnobError (DAG)HIGH
sub_6F08202,782GetKnobIndex (DAG)CERTAIN
sub_6F0A308,700RegisterKnob (DAG)HIGH
sub_6F0FF013,000GetKnobValue (DAG)HIGH
sub_6F1B1013,000BuildKnobTable (DAG)HIGH
sub_6F238014,000ParseKnobString (DAG)HIGH
sub_6F68C09,000InitializeKnobs (DAG)HIGH
sub_6F736018,306ParseKnobValue (DAG)CERTAIN
sub_6F83C0ParseWhenShorthand (DAG)MEDIUM
sub_797790385KnobValue::DestroyHIGH
sub_7978F0240KnobValue::CopyFromMEDIUM
sub_7973E0400KnobType::GetSizeMEDIUM
sub_798280900ParsePhaseNameFragmentMEDIUM
sub_798B601,776NamedPhases::ParsePhaseListCERTAIN
sub_79925068IsPassDisabledHIGH
sub_7992A0894IsPassDisabledFullHIGH
sub_79A490600KnobError::AppendContextMEDIUM
sub_79A5D0800KnobError::FormatMEDIUM
sub_79A7802,200KnobError::MergeMEDIUM
sub_79AED01,000FormatKnobErrorWithContextHIGH
sub_79B240518GetKnobIndex (OCG)CERTAIN
sub_79B450200GetKnobIndexWithValidationHIGH
sub_79B5303,296ParseKnobsStringHIGH
sub_79C2102,200ParseKnobOverridesHIGH
sub_79C9D01,600KnobsInitFromEnvHIGH
sub_79CDB01,400FormatKnobErrorHIGH
sub_79D0702,312ReadKnobsFileCERTAIN
sub_79D9907,073KnobsInit (master)HIGH
sub_79F5403,640ParseKnobValue (OCG)CERTAIN
sub_7A0A90350KnobValue::CopyListValueMEDIUM
sub_7A0C101,745KnobInit (per-knob)HIGH
sub_7A1B80400GetKnobIntValueMEDIUM
sub_7A1CC0350GetKnobBoolValueMEDIUM
sub_7A1E10400GetKnobStringValueMEDIUM
sub_7A28602,100SetKnobValueMEDIUM
sub_7ACEA03,700OCGKnobSetupMEDIUM

Reimplementation Notes

To reimplement the knobs system:

  1. Define the knob table as a compile-time array of descriptors (name, alias, type). No need for ROT13 -- that is purely obfuscation. Use an enum for knob indices so call sites reference KNOB_SchedNumBB_Limit instead of magic index 294.

  2. Parse order matters. Process sources in the documented priority order (env, file, CLI, pragma, WHEN). Last-write-wins semantics.

  3. The WHEN= system is the complex part. You need FNV-1a hashing of function identifiers and a per-function override table. The hash table at ctx+120 → +1128 uses open addressing with linear probing.

  4. Budget knobs (OKT_BDGT) are just integers with a secondary tracking field. The secondary starts at 0 and is used by cost models to track how much "budget" remains during optimization.

  5. Int-range knobs (OKT_IRNG) use .. as the range separator: "100..200" means [100, 200]. Missing bounds default to INT_MIN (0x80000000) / INT_MAX (0x7FFFFFFF).

  6. The opcode-string-list type (OKT_OPCODE_STR_LIST) carries pairs of (opcode_name, integer). The opcode name is resolved to an internal opcode ID via the SASS opcode table. Used for per-instruction tuning overrides.

Cross-References

Optimization Levels

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The --opt-level (-O) flag controls how aggressively ptxas optimizes during the 159-phase pipeline. The option is parsed into a 32-bit integer at options block offset +148 by sub_434320 (line 216: sub_1C96470(v10, "opt-level", a3 + 148, 4)). The default value is 3. The documented range is 0--4, but the internal NvOpt recipe system supports levels 0--5, and the scheduler and rematerialization passes distinguish level 5 from lower values.

The optimization level propagates through the compilation context and is read by individual passes via sub_7DDB50 (232 bytes at 0x7DDB50), which combines the opt-level check with a knob 499 guard. Passes that call sub_7DDB50 receive the opt-level value (stored at compilation context offset +2104) only if knob 499 is enabled; otherwise, the function returns 1 (effectively clamping the level to O1 behavior).

CLI option--opt-level / -O
Options block offset+148 (int32)
Default3
Documented range0--4
Internal range0--5 (NvOpt levels)
Accessorsub_7DDB50 (0x7DDB50, 232 bytes)
Knob guard499 (if disabled, accessor returns 1)
Parse locationsub_434320 line 216
Debug overridesub_431A40 forces level to 0 when -g is set
Ofast-compile overridesub_434320 lines 635--679

Level Summary

LevelNameUse Case
0No optimizationDebug builds (-g), maximum source fidelity
1Minimal optimizationFast compile, basic folding/DCE
2Standard optimizationBalanced speed/compile-time (previous default)
3Full optimizationDefault; all standard passes enabled
4Aggressive optimizationExtra loop peeling, speculative hoisting
5Maximum optimizationFull SinkRemat+Cutlass, highest compile time

Options Block Fields Affected by Opt Level

The option parser (sub_434320) and debug handler (sub_431A40) modify several options block fields based on the optimization level. Key interactions discovered from decompiled code:

OffsetFieldO0O1O2+Source
+148opt_level012, 3, 4Direct from -O
+160register_usage_levelForced to 5 if set with -O0 (warning issued)5 (default)5 (default)sub_434320 line 359--363
+235cloning_disabled0 (disabled)per-CLIper-CLIsub_434320 line 776
+288device_debug(from -g)(from -g)(from -g)CLI only
+292sp_bounds_check1 (auto-enabled)per-CLIper-CLIsub_434320 line 775
+326no_cloning1 (when -g)per-CLIper-CLIsub_431A40 line 42
+408allow_expensive_optimizationsfalsefalsetruesub_434320 line 768
+477fast_compileforced 0 with -gper-CLIper-CLIsub_431A40 line 28

The critical line at sub_434320 offset 768:

// allow-expensive-optimizations defaults to (opt_level > 1)
if (!user_set_allow_expensive)
    options->allow_expensive_optimizations = (options->opt_level > 1);

Debug Mode Override (-g)

When --device-debug (-g) is active, sub_431A40 (at 0x431A40) forces the optimization level to 0 and disables most optimization features:

// sub_431A40 pseudocode
void ApplyDebugOverrides(options_block* opts, bool suppress_warning) {
    if (suppress_warning) {
        opts->device_debug = 1;
        opts->sp_bounds_check_pair = 0x0101;  // +16 = {1, 1}
    }
    opts->sp_bounds_check = 1;                // +292

    // Warn if user explicitly set opt-level with -g
    if (was_set("opt-level") && opts->opt_level != 0)
        warn("ignoring -O with -g");

    // Warn about incompatible options
    if (was_set("register-usage-level"))
        warn("'--device-debug' overrides '--register-usage-level'");

    // Force register_usage_level to {5, 5} (pair at +160)
    *(int64*)(opts + 160) = 0x500000005LL;

    if (opts->fast_compile)
        warn("'--device-debug' overrides '--fast-compile'");
    opts->fast_compile = 0;

    if (opts->ofast_compile is "max" or "mid" or "min")
        warn("'--device-debug' overrides '--Ofast-compile'");
    opts->ofast_compile = "0";
    opts->opt_level = 0;                      // +148

    // Handle cloning
    if (was_set("cloning") && opts->device_debug && !opts->no_cloning)
        warn("-cloning=yes incompatible with -g");
    opts->cloning_disabled = 0;               // +235
}

The 0x500000005LL write to offset +160 sets both the 32-bit register-usage-level and the adjacent 32-bit field to 5, resetting any user override.

Ofast-Compile Interaction

The --Ofast-compile (-Ofc) option provides a compile-time vs code-quality tradeoff orthogonal to -O. It has four settings: 0 (disabled), min, mid, max. Each setting overrides the opt-level and related flags:

Ofast-compileForces opt_level toCloningSplit-compileExpensive optsNotes
0(no change)(no change)(no change)(no change)Default
min1disabled(no change)true (forced)Warns if -O set to non-{0,1}
mid1disabled(no change)(no change)Disables cloning when no split-compile
max0disabled(no change)(no change)Most aggressive compile-time reduction

From sub_434320 lines 635--679:

if (ofast_compile == "max") {
    if (was_set("cloning") && !no_cloning)
        warn("-cloning=yes incompatible with --Ofast-compile=max");
    no_cloning = 1;
    if (was_set("opt-level") && opt_level != 0)
        warn("-opt-level=<1,2,3> incompatible with --Ofast-compile=max");
    opt_level = 0;
}
if (ofast_compile == "mid") {
    no_cloning = 1;
    if (!split_compile) cloning_disabled = 1;
    if (was_set("opt-level") && opt_level != 1)
        warn("-opt-level=<0,2,3> incompatible with --Ofast-compile=mid");
    opt_level = 1;
    fast_compile_mode = 1;
}
if (ofast_compile == "min") {
    no_cloning = 1;
    if (!split_compile) cloning_disabled = 1;
    if (was_set("opt-level") && opt_level != 1)
        warn("-opt-level=<0,2,3> incompatible with --Ofast-compile=min");
    opt_level = 1;
    fast_compile_mode = 1;
    if (was_set("allow-expensive-optimizations") && !allow_expensive)
        warn("-allow-expensive-optimizations=false incompatible with --Ofast-compile=min");
    allow_expensive_optimizations = 1;
}

Per-Phase Gating by Optimization Level

Optimization levels control the pipeline through two mechanisms:

  1. Static isNoOp() overrides -- The AdvancedPhase gate vtables are overridden at pipeline construction time based on the target architecture and opt-level.
  2. Runtime opt-level checks -- Individual pass execute functions call sub_7DDB50 (the opt-level accessor) and early-return when the level is below their threshold.

Gate Accessor: sub_7DDB50

// sub_7DDB50 pseudocode (232 bytes at 0x7DDB50)
int getOptLevel(compilation_context* ctx) {
    knob_vtable* kv = ctx->knob_state;             // ctx + 1664
    query_func qf = kv->vtable[19];                // vtable + 152

    if (qf == sub_67EB60) {                         // fast-path: known vtable
        check_func cf = kv->vtable[9];              // vtable + 72
        bool knob_499;
        if (cf == sub_6614A0)
            knob_499 = (kv->state[35928] != 0);     // direct field read
        else
            knob_499 = cf(ctx->knob_state, 499);     // indirect query
        if (!knob_499)
            return ctx->opt_level;                   // ctx + 2104
        int64 state = kv->state;
        int iteration = state[35940];
        if (state[35936] > iteration) {
            state[35940] = iteration + 1;            // increment pass counter
            return ctx->opt_level;
        }
    } else if (qf(ctx->knob_state, 499, 1)) {
        return ctx->opt_level;
    }
    return 1;                                        // fallback: treat as O1
}

When knob 499 is disabled (or its iteration budget is exhausted), sub_7DDB50 returns 1 regardless of the actual opt-level. This provides a master kill-switch: setting knob 499 to false effectively caps all opt-level-gated behavior at O1.

Phase Activation Table

The following table lists every phase where the optimization level has been confirmed to affect behavior, based on decompiled isNoOp() methods and execute-function guard checks.

Threshold notation: > N means the phase requires opt_level > N (i.e., level N+1 and above).

PhaseNameThresholdEffect at thresholdSource
14DoSwitchOptFirst> 0Branch/switch optimization enabledisNoOp returns true at O0
15OriBranchOpt> 0Branch folding enabledisNoOp returns true at O0
18OriLoopSimplification4--5Aggressive loop peeling enabled at O4+sub_78B430 checks opt_level
22OriLoopUnrolling> 1Loop unrolling requires at least O2Execute guard via sub_7DDB50
24OriPipelining> 1Software pipelining requires at least O2Execute guard
26OriRemoveRedundantBarriers> 1Barrier optimization at O2+Gating: opt_level > 1
28SinkRemat> 1 / > 4O2+: basic path; O5: full cutlass modeTwo-tier guard in sub_913A30
30DoSwitchOptSecond> 0Second switch pass at O1+isNoOp returns true at O0
38OptimizeNestedCondBranches> 0Nested branch simplification at O1+isNoOp returns true at O0
49GvnCse> 1GVN-CSE requires at least O2Execute guard
54OriDoRematEarly> 1Early rematerialization at O2+sub_7DDB50 check
63OriDoPredication> 1If-conversion at O2+Execute guard
69OriDoRemat> 1Late rematerialization at O2+sub_7DDB50 check
71OptimizeSyncInstructions> 1Sync optimization at O2+Gating: opt_level > 1
72LateExpandSyncInstructions> 2Late sync expansion at O3+Gating: opt_level > 2
95SetAfterLegalization> 1Post-legalization flag at O2+sub_7DDB50 check
99OriDoSyncronization> 1Sync insertion at O2+Gating: opt_level > 1
100ApplyPostSyncronizationWars> 1WAR fixup at O2+Gating: opt_level > 1
110PostSchedule> 0Full post-schedule at O1+Mode selection
115AdvancedScoreboardsAndOpexes> 0Full scoreboard generation at O1+Hook activated at O1+
116ProcessO0WaitsAndSBs== 0Conservative scoreboards at O0 onlyActive only at O0

O-Level Feature Matrix

FeatureO0O1O2O3O4O5
Basic block mergingoffononononon
Branch/switch optimizationoffononononon
Copy propagation + const foldingoffononononon
Dead code eliminationpartialononononon
Loop canonicalizationbasicbasicfullfullaggressiveaggressive
Loop unrollingoffoffonononon
Software pipeliningoffoffonononon
Strength reductionoffononononon
GVN-CSEoffoffonononon
Predication (if-conversion)offoffonononon
Rematerialization (early)offoffonononon
Rematerialization (late)offoffonononon
SinkRemat (full)offoffoffoffoffon
Cutlass iterative rematoffoffoffoffoffon
Loop peeling (aggressive)offoffoffoffonon
Barrier optimizationoffoffonononon
Sync instruction optimizationoffoffonononon
Late sync expansionoffoffoffononon
Post-legalization markoffoffonononon
Allow expensive optimizationsoffoffonononon
Speculative hoistingoffoffonononon
Hot/cold partitioningoffononononon
Full scoreboard analysis (115)offononononon
Conservative scoreboards (116)onoffoffoffoffoff
Stack-pointer bounds checkauto-onoffoffoffoffoff
Cloningdisabledononononon

Notes:

  • "partial" DCE at O0: EarlyOriSimpleLiveDead (phase 10) still runs for basic cleanup even at O0.
  • O4 and O5 are not documented in --help but are accepted internally. O4 is equivalent to O3 plus aggressive loop peeling. O5 adds the full SinkRemat pass with cutlass iteration support.

Scoreboard Path Selection

The scoreboard generation subsystem has two mutually exclusive paths, selected by optimization level:

O0: Conservative Path (Phase 116)

Phase 116 (ProcessO0WaitsAndSBs) inserts maximum-safety scheduling metadata:

For every instruction:
    stall_count = 15            // maximum stall (15 cycles)
    wait_mask   = 0x3F          // wait on all 6 barriers
    write_barrier = 7           // no barrier assignment (7 = none)
    read_mask   = 0             // no read barriers
    yield       = 1             // yield after every instruction

This eliminates all instruction-level parallelism. Every instruction waits the maximum time and clears all dependency barriers before executing. The result is correct but extremely slow code -- suitable only for debugging.

O1+: Full Analysis Path (Phase 115)

Phase 115 (AdvancedScoreboardsAndOpexes) runs the complete dependency analysis pipeline:

  1. sub_A36360 (52 KB) -- Master control word generator with per-opcode dispatch
  2. sub_A23CF0 (54 KB) -- DAG list scheduler heuristic for barrier assignment
  3. Per-field encoders for stall, yield, barrier, and scoreboard dependency fields

The full path computes precise stall counts based on actual instruction latencies from the hardware profile, assigns the minimum necessary dependency barriers (6 available per SM), and inserts yield hints only where the warp scheduler benefits from switching.

Scheduling Direction

The scheduling infrastructure (sub_8D0640) selects scheduling direction based on opt-level:

ConditionDirectionStrategy
opt_level <= 2Forward-passRegister-pressure-reducing: prioritizes freeing registers
opt_level > 2Reverse-passLatency-hiding: prioritizes ILP and memory latency overlap

At the default O3, the scheduler uses the reverse-pass strategy, which hides memory latencies at the cost of potentially higher register pressure. At O1--O2, the forward-pass strategy minimizes peak register usage.

The direction selection happens in PreScheduleSetup (sub_8CBAD0), called from the scheduling orchestrator with the boolean opt_level > 2:

PreScheduleSetup(sched, opt_level > 2);    // sub_8CBAD0

Additionally, the ScheduleInstructionsReduceReg phase (mode 0x39) is enabled by default at O3 and above, providing a dedicated register-pressure-reduction scheduling pass before the ILP pass.

Register Allocation Differences

The register allocator itself (fat-point greedy at sub_957160) does not directly branch on the optimization level. However, the opt-level affects register allocation indirectly through:

  1. --register-usage-level (offset +160, range 0--10, default 5): At O0 with -g, this is forced to 5 regardless of user setting. The value modulates the per-class register budget at alloc + 32*class + 884.

  2. allow-expensive-optimizations (offset +408): Defaults to true when opt_level > 1. When true, the allocator and related passes are permitted to spend more compile time on better solutions (e.g., more spill-retry iterations, more aggressive coalescing).

  3. Phase gating: At O0, passes that reduce register pressure (rematerialization, predication, loop optimizations) are disabled, so the allocator receives un-optimized IR with higher register demand. This typically results in more spills at O0.

NvOpt Recipe System

The NvOpt recipe system (Phase 1: ApplyNvOptRecipes, option 391) provides an additional optimization-level axis. When enabled, the PhaseManager allocates a 440-byte NvOptRecipe sub-manager that configures per-phase aggressiveness:

NvOpt levelBehavior
0Minimal optimization (fast-compile path, many phases set to isNoOp())
1--2Standard optimization
3--4Aggressive optimization (loop unrolling, speculative hoisting enabled)
5Maximum optimization (may significantly increase compile time)

The NvOpt level is validated in sub_C173E0 via the string "Invalid nvopt level : %d.", confirming the range 0--5. Recipe data lives at NvOptRecipe+312 with per-phase records at stride 584 bytes.

The NvOpt level is distinct from the -O CLI level. The -O level controls which phases run at all (via isNoOp() and sub_7DDB50 guards); the NvOpt level controls how aggressively the phases that do run behave (via recipe parameters).

Knob Defaults That Change Per Level

Several knobs have default values that vary by optimization level. The most significant:

KnobO0 DefaultO1 DefaultO2+ DefaultEffect
487enabledenabledenabledMaster optimization enable (checked by many passes)
499(varies)(varies)(varies)Guard knob for sub_7DDB50 opt-level accessor
595truetruetrueScheduling enable (but O0 uses conservative path)
419------Forward scheduling mode (bit 3 in scheduler flags)

Knob 487 is the most pervasive: it is checked by loop simplification, barrier optimization, sync optimization, predication, rematerialization, and scheduling passes. Disabling it overrides the opt-level and turns off the corresponding pass regardless of -O setting.

Key Decompiled Evidence

Options block opt_level field (offset +148)

// sub_434320 line 216: parse opt-level from CLI
sub_1C96470(v10, "opt-level", a3 + 148, 4);

allow-expensive-optimizations defaults to (opt_level > 1)

// sub_434320 line 768
if (!user_set_allow_expensive)
    *(a3 + 408) = *(a3 + 148) > 1;

O0 forces sp-bounds-check and disables cloning

// sub_434320 lines 773-776
if (opt_level == 0) {
    *(a3 + 292) = 1;       // sp_bounds_check = true
    *(a3 + 235) = 0;       // cloning_disabled = false (misleading name: 0 = no cloning)
}

Debug mode forces O0

// sub_431A40 line 33
*(a3 + 148) = 0;           // opt_level = 0

Ofast-compile=max forces O0

// sub_434320 line 646
*(a3 + 148) = 0;           // opt_level = 0 for Ofast=max

Ofast-compile=mid/min forces O1

// sub_434320 lines 659, 674
*(a3 + 148) = 1;           // opt_level = 1 for Ofast=mid and min
*(a3 + 572) = 1;           // fast_compile_mode = 1

Scoreboard mutually exclusive paths

// Phase 115 (AdvancedScoreboardsAndOpexes): isNoOp() returns true at O0
// Phase 116 (ProcessO0WaitsAndSBs): isNoOp() returns true at O1+

SinkRemat two-tier gating

// sub_913A30 (SinkRemat core, phase 28)
if (getOptLevel(ctx) <= 1) return;      // O0/O1: skip entirely
if (getOptLevel(ctx) <= 4) return;      // O2-O4: skip full sink+remat
// O5: proceed to cutlass iterative mode

Sync barrier gating

// Phase 99 (OriDoSyncronization), Phase 71 (OptimizeSyncInstructions):
call   sub_7DDB50              ; get opt_level
cmp    eax, 1
jle    return                  ; skip if opt_level <= 1

// Phase 72 (LateExpandSyncInstructions):
// Requires opt_level > 2

Cross-References

Key Functions

AddressSizeRoleConfidence
sub_434320--CLI option parser; parses --opt-level at line 216, handles --Ofast-compile at lines 635--679, sets allow-expensive-optimizations default at line 7680.95
sub_431A40--Debug mode override; forces opt-level to 0, disables cloning, resets register-usage-level when -g is active0.95
sub_7DDB50232BOpt-level accessor; returns ctx+2104 opt-level if knob 499 is enabled, otherwise returns 1 (O1 fallback). Called by 20+ passes as the runtime opt-level gate0.95
sub_1C96470--Generic CLI argument reader; called by sub_434320 to read --opt-level into options block offset +1480.85
sub_67EB60--Fast-path knob query vtable function; identified inside sub_7DDB50 for knob 499 check0.80
sub_6614A0--Knob state direct-read function; used by sub_7DDB50 to read knob 499 via direct field access at offset 359280.80
sub_78B430--OriLoopSimplification execute function; checks opt_level for aggressive loop peeling (O4+)0.85
sub_913A30--SinkRemat core (phase 28); two-tier opt-level guard: skips at O0--O1, limited at O2--O4, full cutlass mode at O50.90
sub_8D0640--Scheduling infrastructure; selects forward (O1--O2) vs reverse (O3+) scheduling direction0.85
sub_8CBAD0--PreScheduleSetup; called with opt_level > 2 boolean to configure scheduling direction0.85
sub_957160--Fat-point greedy register allocator; does not directly branch on opt-level but is affected indirectly0.90
sub_A3636052KBMaster scoreboard control word generator (phase 115, O1+ path)0.90
sub_A23CF054KBDAG list scheduler heuristic for barrier assignment (phase 115, O1+ path)0.90
sub_C173E0--NvOpt level validator; emits "Invalid nvopt level : %d." for out-of-range values0.85

DUMPIR & NamedPhases

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The DUMPIR knob and NamedPhases option are the two primary mechanisms for inspecting ptxas's internal IR at arbitrary points in the 159-phase optimization pipeline. DUMPIR is an OCG string knob that triggers an IR dump after a named phase completes. NamedPhases is a separate OCG string knob (index 298) that restricts the pipeline to execute only the specified phases, effectively allowing selective phase execution and reordering. Both knobs accept phase names resolved through a case-insensitive binary search over a sorted table of 144 phase names (sub_C641D0, 305 bytes).

DUMPIR knobOCG string knob (ROT13: QhzcVE), registered in ctor_005 at 0x412B80
NamedPhases knobOCG knob index 298, runtime offset 21456 in knob value array
Phase name lookupsub_C641D0 (305 bytes, case-insensitive binary search)
Table sortsub_C63FA0 (on-demand iterative quicksort via sub_C639A0)
Name table144 entries at off_22BD0C0 + 5 arch-specific additions
NamedPhases parsersub_798B60 (1,776 bytes)
Phase fragment parsersub_798280 (900 bytes)
Report passesPhases 9, 96, 102, 126, 129, 130
Sentinel return158 (NOP phase, returned on lookup failure)

DUMPIR Knob

The DUMPIR knob is a string-valued OCG knob that takes one or more phase names. When set, the compiler dumps the Ori IR state after the named phase executes. This is the primary IR inspection mechanism for NVIDIA developers debugging the optimization pipeline.

Usage

ptxas -knob DUMPIR=AllocateRegisters input.ptx -o output.cubin

The knob value is a phase name string. The name is resolved through the phase name lookup function (sub_C641D0) using case-insensitive comparison, so allocateregisters, ALLOCATEREGISTERS, and AllocateRegisters all match.

The DUMPIR knob exists in two instantiations:

  • OCG instance (ROT13: QhzcVE at 0x21BDBAD): registered in ctor_005 at 0x412B80. This is the primary instance for the optimization pipeline.
  • DAG instance (ROT13: QhzcVE at 0x21DCC95): registered in ctor_007 at 0x421920. This controls IR dumps in the Mercury SASS/DAG pipeline.

Diagnostic Reference

The DUMPIR knob is referenced in register allocation error diagnostics. When a register allocation verification failure occurs, sub_A55D80 and sub_A76030 emit:

Please use -knob DUMPIR=AllocateRegisters for debugging

This tells the developer to re-run with the DUMPIR knob set to AllocateRegisters to inspect the IR state entering register allocation, which helps diagnose mismatches between pre- and post-allocation reaching definitions.

DUMPIR is part of a family of 17 dump-related OCG knobs across two constructor registrations. The OCG pipeline registers 11 dump knobs in ctor_005 (0x412A40--0x412D60); the Mercury/DAG pipeline registers 6 in ctor_007 (0x421880--0x421A10). All knob names and their definition-table offsets are ROT13-encoded in the binary (e.g. 0k14q0 decodes to 0x14D0).

OCG Pipeline Dump Knobs (ctor_005)

Knob NameROT13Reg AddressDef OffsetPurpose
DumpCallGraphQhzcPnyyTencu0x412A400x1490Dump the inter-procedural call graph
DumpCFGQhzcPST0x412A900x14A0Dump the control flow graph
DumpFlowQhzcSybj0x412AE00x14B0Dump data flow information (reaching defs, live sets)
DumpInstPhaseQhzcVafgCunfr0x412B300x14C0Dump per-instruction phase annotations
DumpIRQhzcVE0x412B800x14D0Dump the Ori IR after a named phase
DumpIRInfoAsIntegerQhzcVEVasbNfVagrtre0x412BD00x14E0Dump IR with integer-format operand info
DumpKnobsQhzcXabof0x412C200x14F0Dump all knob values to stderr
DumpPerfMetricsForBlockQhzcCresZrgevpfSbeOybpx0x412C700x1500Dump per-basic-block performance metrics
DumpPerfStatsQhzcCresFgngf0x412CC00x1510Dump performance statistics
DumpSASSQhzcFNFF0x412D100x1520Dump generated SASS assembly
DumpSBInstInfoQhzcFOVafgVasb0x412D600x1530Dump scoreboard per-instruction info

The "Def Offset" column is the byte offset into the 16-byte-stride knob definition table. Dividing by 16 gives the definition-table index: DumpCallGraph is index 329, DumpSBInstInfo is index 339. These indices are distinct from the 72-byte runtime knob slot indices used by GetKnobIntValue.

Adjacent knobs in ctor_005 (for boundary context):

  • 0x4129F0: DoYieldInsertionWAR_SW2491854 (offset 0x1480) -- immediately before DumpCallGraph
  • 0x412DB0: EmitLDCU (offset 0x1540) -- immediately after DumpSBInstInfo

Mercury/DAG Pipeline Dump Knobs (ctor_007)

Knob NameROT13Reg AddressPurpose
DumpAnnotQhzcNaabg0x421880Dump instruction annotations
DumpCFGQhzcPST0x4218D0Dump DAG pipeline CFG
DumpIRQhzcVE0x421920Dump DAG pipeline IR
DumpMercOpCountsQhzcZrepBcPbhagf0x421970Dump Mercury opcode distribution
DumpReconstitutedBinaryQhzcErpbafgvghgrqOvanel0x4219C0Dump reconstituted binary output
DumpRPOQhzcECB0x421A10Dump reverse post-order traversal

Three knob names appear in both pipelines: DumpCFG, DumpIR, and (implicitly) their string addresses differ (0x21BDBF0 vs 0x21DCCA0 for DumpCFG). Setting one does not affect the other.

NamedPhases Knob

The NamedPhases knob (OCG index 298) provides a mechanism to restrict the optimization pipeline to execute only specific phases. Unlike DUMPIR which passively observes, NamedPhases actively controls which phases run.

Knob Location

NamedPhases is at OCG knob index 298. The runtime byte offset is 298 * 72 = 21456 from the knob state base. This is confirmed by the decompiled code in sub_798B60:

// sub_798B60 (NamedPhases parser)
v11 = *(ctx + 72);                    // knob state base pointer
v12 = *(byte*)(v11 + 21456);          // type tag at knob index 298
if (!v12) return 0;                   // knob not set => no filtering
if (v12 == 5)                         // type 5 = string
    v14 = *(ptr*)(v11 + 21464);       // string value at +8 from type tag

Parser -- sub_798B60

The NamedPhases parser (sub_798B60, 1,776 bytes) reads the knob value string and parses it into parallel arrays of up to 256 entries. It is called from two sites:

  1. OCG pipeline (sub_798B60 direct): parses the NamedPhases string from OCG knob index 298, referenced at address 0x798E90 where the string "NamedPhases" (0x21B64C8) appears in an error/diagnostic message.
  2. Mercury pipeline (sub_9F4040): the Mercury encoder's phase reordering mechanism also references the "NamedPhases" string at 0x9F42B0, using the same knob to control Mercury-side phase execution.

The parser operates as follows:

  1. Reads knob value at offset 21456 from the knob state
  2. If the knob is unset (type byte == 0), returns immediately (no filtering)
  3. If the knob is a string (type byte == 5), extracts the string pointer
  4. Copies the string into a pool-allocated buffer
  5. Tokenizes using strtok_r with comma (,) as delimiter
  6. For each token, calls sub_798280 (ParsePhaseNameFragment) to split the phase name from optional parameters
  7. Stores results in parallel arrays: names[], values[], full_strings[] (max 256 entries)

Phase Name Fragment Parser -- sub_798280

Each comma-separated token in the NamedPhases string is parsed by sub_798280 into two components:

  • Phase name: characters up to the first , separator, uppercased during parsing
  • Parameter suffix: characters after , up to the next + delimiter or end-of-string

The + character acts as an entry separator (analogous to how the DisablePhases string uses + to delimit multiple phase names). This allows:

-knob NamedPhases=PhaseA,param1+PhaseB,param2+PhaseC

Mercury NamedPhases -- sub_9F4040

The Mercury encoder pipeline (sub_9F4040, 1,850 lines decompiled) uses the NamedPhases knob to support phase reordering within the Mercury backend. In addition to standard pipeline phase names, it recognizes Mercury-specific pseudo-phases:

NameDecompiled LineMatch MethodPurpose
shuffle843strlen + byte compare (8 chars)Mercury instruction shuffle pass
swap1950strlen + byte compare (6 chars)Mercury register swap level 1
swap21007strlen + byte compare (6 chars)Mercury register swap level 2
swap31061strlen + byte compare (6 chars)Mercury register swap level 3
swap41119strlen + byte compare (6 chars)Mercury register swap level 4
swap51162strlen + byte compare (6 chars)Mercury register swap level 5
swap61202strcmp()Mercury register swap level 6
OriPerformLiveDead1556sub_C641D0() lookupLiveness analysis within Mercury context
OriCopyProp1648sub_C641D0() lookupCopy propagation within Mercury context

shuffle and swap1--swap6 are pure Mercury pseudo-phases: they do not exist in the main 144-entry phase name table at off_22BD0C0. Their name matching is done inline with strlen-guarded character comparison (not strcmp -- except swap6 which uses a full strcmp call, likely because it is the last in a fallthrough chain).

OriPerformLiveDead and OriCopyProp resolve through sub_C641D0 (the standard binary search), meaning they ARE in the main phase table. They are special in that Mercury conditionally inserts them into its own phase sequence rather than inheriting them from the standard pipeline ordering. The insertion is guarded by state flags (v234, v252, v240 for OriPerformLiveDead; v222, v236, v257 for OriCopyProp), suggesting they are injected only when the Mercury encoder detects certain register-pressure or correctness conditions.

Phase Name Lookup -- sub_C641D0

The binary search function sub_C641D0 (305 bytes) resolves a phase name string to a phase index. It is the core name resolution used by both DUMPIR and NamedPhases.

Algorithm

int PhaseManager::lookup_phase(const char* query) {
    ensure_sorted();                          // sub_C63FA0

    // Binary search over sorted {name_ptr, index} pairs
    // Each entry is 16 bytes: [8-byte name pointer, 4-byte phase index, 4-byte padding]
    int lo = 0, hi = sorted_count;
    while (hi > 0) {
        int mid = hi / 2;
        // Case-insensitive string comparison via tolower()
        int cmp = strcasecmp(table[lo + mid].name, query);
        if (cmp < 0) {
            hi -= mid + 1;
            lo += mid + 1;
        } else if (cmp == 0) {
            return table[lo + mid].index;     // found
        } else {
            hi = mid;
        }
    }

    // Verify final position (handles edge case)
    if (lo < end && strcasecmp(table[lo].name, query) == 0)
        return table[lo].index;

    return 158;                               // sentinel: NOP phase
}

The comparison uses tolower() on each character individually, making the search fully case-insensitive. On lookup failure, the function returns 158 (the sentinel NOP phase), not an error code. This means misspelled phase names silently resolve to a no-op rather than producing an error.

Sorted Table Construction -- sub_C63FA0

The sorted name table is lazily constructed. sub_C63FA0 checks whether the current sorted count matches the expected count (stored at PhaseManager+104). If they differ, it:

  1. Grows the sorted table array if needed (1.5x growth policy)
  2. Copies name pointers from the raw phase name table (off_22BD0C0)
  3. Each entry is 16 bytes: {char* name, int phase_index}, where phase_index is the array position
  4. Sorts using iterative quicksort (sub_C639A0) with median-of-three pivot selection

The sort is performed once and cached. Subsequent lookups reuse the sorted table without re-sorting.

Report Passes

Six phases in the pipeline are dedicated diagnostic/dump passes. They are no-ops by default and activate only when specific debug options are enabled:

PhaseNameTriggerOutput
9ReportInitialRepresentationDUMPIR knob, --keepOri IR after initial lowering (pre-optimization)
96ReportBeforeSchedulingDUMPIR knob, --keepOri IR entering scheduling/RA stage
102ReportAfterRegisterAllocationDUMPIR knob, --keepOri IR after register allocation
126ReportFinalMemoryUsage--stat=phase-wiseMemory pool consumption summary
129DumpNVuCodeText--keep, DUMPIRSASS text disassembly (cuobjdump-style)
130DumpNVuCodeHex--keep, DUMPIRRaw SASS hex dump

Additionally, ReportBeforeRegisterAllocation (at 0x22BD068) is a phase name in the table but is handled as an arch-specific phase (index >= 139), providing an IR dump point immediately before register allocation in backends that override it.

Report Pass Activation

Report passes check their activation condition in the isNoOp() virtual method. When the DUMPIR knob is set to a phase name, the report pass compares the current phase name against the DUMPIR value. If they match, isNoOp() returns false and the pass executes its dump logic.

The dispatch loop in sub_C64F70 constructs diagnostic context strings around each phase execution:

// Before execution (line 117 of sub_C64F70):
*(_QWORD *)buffer = 0x2065726F666542LL;    // "Before " as 8-byte LE literal
memcpy(buffer + 7, phase_name, len + 1);   // append phase name after "Before "

// After execution (line 196 of sub_C64F70):
strcpy(buffer, "After ");                  // 6-byte prefix
memcpy(buffer + 6, phase_name, len + 1);   // append phase name after "After "

The literal 0x2065726F666542 decomposes as bytes 42 65 66 6F 72 65 20 = ASCII "Before " (7 bytes including trailing space, plus a null in the 8th byte that gets overwritten by memcpy). The "After " path uses strcpy instead of a literal store because it is only 6 bytes and the code path is post-execution (not latency-critical).

These strings appear in diagnostic output when --stat=phase-wise is enabled:

Before GeneralOptimize  ::  [Total 1234 KB]  [Freeable 567 KB]  [Freeable Leaked 12 KB] (2%)
After GeneralOptimize   ::  [Total 1456 KB]  [Freeable 789 KB]  [Freeable Leaked 23 KB] (3%)

The string addresses in the binary are:

  • "Before " at 0x22BC3D3
  • "After " at 0x22BC3DB
  • " :: " at 0x22BC3E2 (separator between phase name and stats)
  • "[Total " at 0x22BC3E9
  • "[Freeable " at 0x22BC3F6
  • "[Freeable Leaked " at 0x22BC401
  • "All Phases Summary" at 0x22BC416 (final summary label)

Phase-Wise Statistics -- --stat=phase-wise

The --stat CLI option (processed in sub_432A00 at 0x432E5A) accepts a comma-separated list of report modes:

ptxas --stat=phase-wise input.ptx -o output.cubin
ModeShortDescription
timetPrint compilation time
memorymPrint peak memory usage
phase-wisepPrint per-phase time and memory delta
detaileddPrint all of the above

When phase-wise is enabled (string comparison at 0x4460F8 in sub_445EB0), the dispatch loop's timing flag (PhaseManager+72) is set, and sub_C64310 runs after every phase to print memory deltas.

IR Output Format

The DUMPIR dump emits a per-function statistics header (using # comment prefix) followed by the Ori IR listing. The statistics header is emitted by sub_A3A7E0 and contains hardware performance estimates computed from the current IR state.

Per-Function Statistics Header

# 142 instructions, 24 R-regs
# [inst=142] [texInst=0] [tepid=0] [rregs=24]
# [FP16 inst=0] [FP16 VectInst=0] [Percentage Vectorized=0.00]
# [est latency = 87] [LSpillB=0] [LRefillB=0], [SSpillB=0], [SRefillB=0], [LowLmemSpillSize=0] [FrameLmemSpillSize=0]
# [LNonSpillB=0] [LNonRefillB=0], [NonSpillSize=0]
# [Occupancy = 0.750000], [est numDivergentBranches=2] [attributeMemUsage=0], [programSize=1024]
# [est fp=12] [est half=0], [est trancedental=0], [est ipa=0], [est shared=0], [est controlFlow=8], [est loadStore=24]
# [est tex=0] [est pairs=4]
# [issue thru=0.888889] [fp thru=0.111111] [half thru=0.000000], [trancedental thru=0.000000], [ipa thru=0.000000]
# [shared thru=0.000000] [controlFlow thru=0.062500] [texLoadStore thru=0.187500], [reg thru=0.000000], [warp thru=0.000000]
# [partially unrolled loops=0] [non-unrolled loops=1]
# [CB-Bound Tex=0] [UR-Bound Tex=0] [Bindless Tex=0] [Partially Bound Tex=0]
# [UDP inst=0] [numVecToURConverts inst=0]
# [maxNumLiveValuesAtSuspend=0]
# [instHint=142] [instPairs=4]
# [worstcaseLat=87.000000]
# [avgcaseLat=52.500000]
# [SharedMem Alloc thru=0.000000]
# [Precise inst=0]

The format strings are at two locations in rodata:

Address RangeContextNotes
0x21EBF76--0x21EC3B0Pre-register-allocation statsCommas between some [SSpillB=%d], [SRefillB=%d] fields
0x21FA008--0x21FA0A0Post-register-allocation statsNo commas: [SSpillB=%d] [SRefillB=%d]

The [Occupancy] line's typo "trancedental" (missing 's') is present in the binary itself and matches NVIDIA's original source.

Statistics Field Glossary

FieldMeaning
instTotal instruction count
texInstTexture/surface instruction count
tepidTexture instruction count (alternate metric)
rregsR-register (GPR) count
LSpillB / LRefillBLocal-memory spill/refill byte counts
SSpillB / SRefillBShared-memory spill/refill byte counts
LowLmemSpillSizeLow local-memory spill total
FrameLmemSpillSizeFrame-level local-memory spill total
LNonSpillB / LNonRefillBLocal-memory non-spill traffic bytes
OccupancyEstimated warp occupancy (0.0--1.0)
numDivergentBranchesEstimated divergent branch count
attributeMemUsageAttribute memory usage (shader inputs)
programSizeTotal program size in bytes
issue thruIssue throughput (instructions per cycle)
fp thru / half thruFP32 / FP16 throughput
trancedental thruTranscendental (SFU) throughput
ipa thruInterpolation throughput
shared thruShared memory throughput
texLoadStore thruTexture + load/store throughput
reg thruRegister throughput
warp thruWarp-level throughput
CB-Bound TexConstant-bank-bound texture references
UR-Bound TexUniform-register-bound texture references
Bindless TexBindless texture references
UDP instUniform datapath instruction count
numVecToURConvertsVector-to-uniform-register conversion count
maxNumLiveValuesAtSuspendPeak live values at suspension point
instHint / instPairsInstruction hint count / instruction pair count
worstcaseLat / avgcaseLatWorst-case / average-case latency estimates
SharedMem Alloc thruShared memory allocation throughput
Precise instPrecise (non-relaxed) instruction count

Mercury Pipeline Dump Points

The Mercury/DAG pipeline emits its own "After" labels at fixed points in the encode-decode-expand flow. These labels are used by both the DumpIR (DAG) knob and the DumpAnnot knob:

LabelString AddressPipeline Stage
After Decode0x202D5CEAfter initial SASS decode
After Expansion0x202D5DBAfter instruction expansion
After WAR post-expansion0x202D604After WAR insertion (post-expansion)
After Opex0x202D60FAfter operand expansion
After WAR post-opexing0x202DCD0After WAR insertion (post-opex)
After MercWARs0x202DCDFAfter Mercury WAR pass
After MercOpex0x21E5C33After Mercury operand expansion
After MercConverter0x22B7B38After Mercury format conversion
After MercExpand0x22BC3DBAfter Mercury instruction expansion
After EncodeAndDecode0x23D1A60After encode-decode round-trip

Memory Statistics Format

The sub_C64310 (ReportPhaseStats) function formats memory sizes using three thresholds:

Size RangeFormatExample
< 1024 bytes%d (raw integer)512
< 10 MB%.3lf KB (kilobytes, 3 decimals)1234.567 KB
>= 10 MB%.3lf MB (megabytes, 3 decimals)12.345 MB

The memory format reuses the suffix from "PeakMemoryUsage = %.3lf KB" (at 0x1CE7BB6) by referencing the string at offset +24 to extract just " KB". The pool-consumption variant uses "[Pool Consumption = " at 0x22BC3B3.

Phase Name Table

The static phase name table at off_22BD0C0 contains 145 entries: 1 sentinel ("All Phases Summary") plus 144 phase names. After sorting by sub_C63FA0, the binary search in sub_C641D0 provides O(log n) lookup -- approximately 8 comparisons for 145 entries.

The 144 non-sentinel entries include:

  • 139 base pipeline phases (indices 0--138) with fixed names
  • 5 arch-specific phase aliases that map to indices >= 139:
    • LateEnforceArgumentRestrictions
    • UpdateAfterScheduleInstructions
    • UpdateAfterOriDoSyncronization
    • ReportBeforeRegisterAllocation
    • UpdateAfterOriAllocateRegisters

The AllocateRegisters string (0x21F0229) also appears as a phase name referenced by the register allocation subsystem (sub_A55D80, sub_A76030) and is present in the name table at 0x22BD490.

Interaction with --keep

The --keep flag triggers output file retention and activates certain report passes. When --keep is set:

  1. Phase 129 (DumpNVuCodeText) writes a human-readable SASS disassembly to a .sass file
  2. Phase 130 (DumpNVuCodeHex) writes raw SASS binary as hex
  3. Report phases 9, 96, and 102 may produce .ori intermediate representation dumps

The --keep flag is processed in the CLI option handler (sub_43CC70 at 0x43D850) which generates the .sass file extension.

Function Map

AddressSizeFunctionConfidence
sub_798280900ParsePhaseNameFragment -- splits NAME,PARAM from NamedPhases tokenMEDIUM
sub_798B601,776NamedPhases::ParsePhaseList -- tokenizes NamedPhases knob stringCERTAIN
sub_9F4040~7,400MercuryNamedPhases -- Mercury pipeline phase selection/reorderingHIGH
sub_A3A7E0~2,000CodeObject::EmitStats -- per-function statistics header printerHIGH
sub_C639A0~800QuicksortNameTable -- iterative quicksort for phase name tableMEDIUM
sub_C63FA0~600EnsureSortedNameTable -- lazy sorted table constructionMEDIUM
sub_C641D0305PhaseManager::LookupPhase -- case-insensitive binary searchCERTAIN
sub_C643103,168PhaseManager::ReportPhaseStats -- per-phase timing/memory reporterHIGH
sub_C64F701,455PhaseManager::Dispatch -- main phase execution loopCERTAIN
sub_A55D80~2,000RegAlloc::VerifyReachingDefs -- references DUMPIR in error messageHIGH
sub_A76030~1,000RegAlloc::VerifyMismatch -- references DUMPIR in error messageHIGH

Reimplementation Notes

  1. DUMPIR is a string knob, not a boolean. The value is a phase name that triggers a dump after that specific phase. To dump at multiple points, run separate compilations with different DUMPIR values. There is no comma-separated multi-phase dump syntax for DUMPIR itself.

  2. NamedPhases uses comma+plus syntax. Commas separate name-from-parameter within a single entry; + separates multiple entries. The phase name portion is uppercased during parsing. Parameters are preserved as-is.

  3. Lookup failure is silent. An unrecognized phase name in DUMPIR or NamedPhases resolves to phase index 158 (NOP sentinel), not an error. The compiler does not warn about misspelled phase names.

  4. The sorted table is 16 bytes per entry: {char* name, int32 index, int32 padding}. The sort is stable only within the quicksort's three-way partitioning -- duplicate names (which do not occur in practice) would have undefined ordering.

  5. Two DumpIR knob instances exist (OCG and DAG). They are independent -- setting one does not affect the other. The OCG instance controls the 159-phase optimization pipeline; the DAG instance controls the Mercury SASS pipeline. Three knob names (DumpCFG, DumpIR, DumpAnnot/DumpRPO) have separate OCG and DAG instances with distinct ROT13 string addresses.

  6. Memory statistics format uses three thresholds: bytes (< 1 KB), kilobytes with 3 decimals (< 10 MB), megabytes with 3 decimals (>= 10 MB). The reporter is sub_C64310.

  7. NamedPhases in Mercury (sub_9F4040) supports 7 pure pseudo-phases (shuffle, swap1--swap6) that do not exist in the main phase table. These use inline strlen-guarded byte comparison, not strcmp (except swap6). Two additional names (OriPerformLiveDead, OriCopyProp) ARE in the main table but are conditionally injected into Mercury's phase sequence based on register-pressure/correctness state flags.

  8. The "Before" string is a raw 8-byte LE literal store, not a strcpy. The dispatch loop writes 0x2065726F666542 directly to the buffer, which is "Before " in ASCII. This is a micro-optimization for the hot path (pre-phase execution). The "After" path uses strcpy since it is post-execution.

  9. Statistics header has two variants. The pre-register-allocation format strings (at 0x21EC050) use commas between some spill fields: [SSpillB=%d], [SRefillB=%d]. The post-register-allocation variant (at 0x21FA008) drops those commas: [SSpillB=%d] [SRefillB=%d]. A reimplementation should match whichever variant is appropriate for the dump point.

  10. The "trancedental" typo is canonical. Both the format string and the stats output use "trancedental" (missing 's'). A reimplementation should preserve this spelling for compatibility with tools that parse the output.

Cross-References

  • Knobs System -- DUMPIR and NamedPhases are OCG knobs; ROT13 encoding, type system, access patterns
  • CLI Options -- --stat=phase-wise, --keep flags that activate report passes
  • Phase Manager -- dispatch loop, phase factory, name table infrastructure
  • Pass Inventory -- complete 159-phase table with report pass positions
  • Register Allocator -- DUMPIR=AllocateRegisters diagnostic reference
  • Mercury Encoder -- Mercury-side NamedPhases and DAG DumpIR knob

Memory Pool Allocator

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

ptxas replaces malloc/free with a custom hierarchical pool allocator for the vast majority of allocations. The allocator (sub_424070, 3,809 callers) is the single most-used allocation function in the binary. Every IR node, hash map, linked list, phase object, and temporary buffer flows through pools. The design serves two goals: fast allocation via size-class free lists, and per-compilation-unit lifetime management via hierarchical pool ownership.

Allocatorsub_424070 (2,098 bytes, 3,809 callers)
Deallocatorsub_4248B0 (923 bytes, 1,215 callers)
Reallocatorsub_424C50 (488 bytes, 27 callers)
OOM handlersub_42BDB0 (14 bytes, 3,825 callers)
TLS contextsub_4280C0 (597 bytes, 3,928 callers)
Stats headersub_423A10 (323 bytes) -- prints "Memory space statistics for ..." banner
Stats detailsub_425020 (~1,500 bytes) -- full per-pool metrics, recursive into children
Stats entrysub_425AB0 (80 bytes) -- mutex-wrapped entry point for stats dump
OCG statssub_6936B0 (120 bytes) -- OCG mem space fixed-format stats to stderr
Pool teardownsub_4234D0 (258 bytes)
Pool accountingsub_423600 (922 bytes)
Slab registrationsub_423E50 (544 bytes)
Size-class indexsub_42BE50 (floor-log2, 64 bytes)
Slab growthsub_423B60 / sub_423C70
Global fallbacksub_427A10 (raw malloc wrapper)
System freesub_427B30 (raw free wrapper)
Pool reportersub_C62200 (888 bytes)
Consumption querysub_8DAE60 (32 bytes)
Snapshotsub_8DADE0 (48 bytes)

Pool Object Layout

The pool object is at least 7,136 bytes. It contains pool metadata at low offsets, large-block free lists indexed by power-of-2 order in the middle range, small-block free lists indexed by size class starting at offset +2128, and a mutex pointer at the end.

Pool Object (~7136 bytes)
  +0        ptr      large_block_list     singly-linked list of large-block slab descriptors
  +32       u32      min_slab_size        minimum slab allocation (default from pool creator)
  +44       u32      slab_count           number of slabs allocated for this pool
  +48       ptr      large_free_list      free list head for large blocks
  +56       u32      fragmentation_count  decremented on block split
  +60       u32      max_order            highest power-of-2 order currently tracked
  +64..     ptr[]    order_free_lists     per-order free list: *(pool + 32*(order+2)) = head
  +2112     ptr      tracking_map         hash map for allocation metadata (when enabled)
  +2128..   ptr[]    small_free_lists     625 bins: *(pool + 8*(size>>3) + 2128) = head
  +7128     mutex*   pool_mutex           pthread_mutex_t* for thread safety

Size-Class Bins (Small Path)

Small allocations (up to 4,999 bytes) are served from 625 free-list bins. Each bin holds blocks of exactly one size class. The bin index is computed from the 8-byte-aligned allocation size:

aligned_size = max(16, (requested + 7) & ~7)
bin_index    = aligned_size >> 3
bin_head     = *(pool + 8 * bin_index + 2128)

This gives bins for sizes 16, 24, 32, 40, ... up to 4,992 bytes (the largest multiple of 8 that is <= 4,999). The minimum allocation is 16 bytes because each free block stores a next-pointer (8 bytes) and a slab-descriptor back-pointer (8 bytes).

Order Free Lists (Large Path)

Large allocations (above 4,999 bytes) use power-of-2 order free lists. The order is computed by sub_42BE50, which returns floor(log2(size)) by clearing all bits except the highest set bit, then using _BitScanForward64. The free list for order k is at pool offset 32*(k+2) + 64. The pool tracks max_order at +60 to avoid scanning empty higher-order lists.

Allocation Algorithm -- sub_424070

The allocator takes two arguments: a pool pointer (a1) and a size (a2). When a1 is NULL, it falls through to the global allocator (sub_427A10) which wraps malloc. Otherwise, it acquires the pool mutex and dispatches to one of two paths based on the aligned size.

// Pseudocode for sub_424070
void* pool_alloc(Pool* pool, size_t size) {
    if (!pool)
        return global_alloc(size);    // sub_427A10 -> malloc

    pthread_mutex_lock(pool->mutex);  // pool + 7128
    size_t aligned = (size + 7) & ~7;

    if (aligned <= 4999) {
        // --- Small path ---
        if (aligned < 16) aligned = 16;
        size_t bin = aligned >> 3;
        FreeNode** head = &pool->small_free_lists[bin];

        if (!*head) {
            // Bin empty: allocate a new slab from parent pool
            if (!can_grow(pool->min_slab_size))    // sub_423B60
                goto oom;

            // 1. Allocate 56-byte slab descriptor from parent pool
            Pool* parent = get_tls_context()->parent_pool;
            SlabDesc* desc = pool_alloc(parent, 56);

            // 2. Compute slab memory: aligned * ceil(min_slab_size / aligned)
            size_t slab_bytes = aligned * ((aligned + pool->min_slab_size - 1) / aligned);

            // 3. Allocate slab memory from parent
            void* slab_mem = pool_alloc(parent, slab_bytes);

            // 4. Initialize slab descriptor
            desc->total_size     = slab_bytes;   // +8
            desc->available_size = slab_bytes;   // +16
            desc->owning_pool    = pool;         // +24
            desc->memory_base    = slab_mem;     // +32
            desc->is_small_slab  = 1;            // +40
            desc->slab_id        = atomic_inc(&global_slab_counter);
            desc->bin_size       = aligned;      // +48

            // 5. Carve slab into free-list nodes
            char* cursor = slab_mem + slab_bytes;
            FreeNode* list = NULL;
            while (cursor > slab_mem) {
                cursor -= aligned;
                ((FreeNode*)cursor)->next = list;
                ((FreeNode*)cursor)->slab = desc;
                list = (FreeNode*)cursor;
            }
            *head = list;

            // 6. Register slab in tracking structures
            register_slab(desc);                   // sub_423E50
            pool->slab_count++;
        }

        // Pop from free list
        FreeNode* block = *head;
        *head = block->next;
        block->slab->available_size -= aligned;

        pthread_mutex_unlock(pool->mutex);
        return block;
    }

    // --- Large path ---
    size_t total = aligned + 32;  // 32 bytes for boundary tag header

    // Search order free lists starting from floor(log2(total))
    int order = floor_log2(total);       // sub_42BE50
    while (order <= pool->max_order) {
        BoundaryTag* block = pool->order_lists[order];
        while (block) {
            if (block->payload_size >= total) {
                // Found a fit: unlink from free list
                unlink_free_block(block);
                block->sentinel = -1;  // mark allocated

                // Split remainder if >= 40 bytes
                size_t remainder = block->payload_size - total;
                if (remainder > 39) {
                    split_block(block, total, remainder);
                    pool->fragmentation_count--;
                }

                // Update slab accounting
                slab_desc->available_size -= block->tag_offset;

                pthread_mutex_unlock(pool->mutex);
                return (char*)block + 32;  // skip header
            }
            block = block->next_free;
        }
        order++;
    }

    // No fit found: allocate new large slab from parent
    // (allocates 88-byte slab descriptor + slab memory + 64 bytes for
    //  header/footer boundary tags)
    allocate_large_slab(pool, total);
    // retry search...

    pthread_mutex_unlock(pool->mutex);
    return result;
}

Critical Constants

ConstantValueMeaning
0x13874,999Small/large allocation threshold
16Minimum allocationFree node: 8-byte next + 8-byte slab pointer
32Boundary tag header sizeSentinel + prev + tag_offset + payload_size
39 (0x27)Minimum split remainderMust hold a full boundary tag + at least 8 bytes
56Slab descriptor size (small)7 fields
88Slab descriptor size (large)Extended with boundary-tag metadata
64Overhead for large slabHeader (32) + footer (32) boundary tags

Deallocation Algorithm -- sub_4248B0

The deallocator takes a single pointer argument. It locates the owning pool through the slab descriptor back-pointer (stored either inline for small blocks, or recoverable from boundary tags for large blocks), then returns the memory to the appropriate free list.

// Pseudocode for sub_4248B0
void pool_free(void* ptr) {
    if (!ptr) { system_free(ptr); return; }   // sub_427B30

    // Locate slab descriptor via tracking map
    SlabDesc* desc = find_slab(ptr);
    if (!desc) { system_free(ptr); return; }

    Pool* pool = desc->owning_pool;
    pthread_mutex_lock(pool->mutex);

    if (desc->is_small_slab) {
        // Small block: push back onto size-class free list
        size_t bin_size = desc->bin_size;
        size_t bin = bin_size & ~7;
        FreeNode** head = &pool->small_free_lists[bin >> 3];
        ((FreeNode*)ptr)->slab = desc;
        ((FreeNode*)ptr)->next = *head;
        *head = (FreeNode*)ptr;
        desc->available_size += bin_size;
    } else {
        // Large block: coalesce with adjacent free blocks
        BoundaryTag* header = (BoundaryTag*)((char*)ptr - 32);
        size_t block_size = header->payload_size;

        // Validate sentinel (must be -1 = allocated)
        assert(header->sentinel == -1);
        desc->available_size += block_size;

        // Check next block's sentinel
        BoundaryTag* next = (BoundaryTag*)((char*)ptr - 32 + block_size);
        if (next->sentinel != -1) {
            // Next block is free: unlink and merge
            unlink_free_block(next);
            header->payload_size += next->payload_size;
            // Update footer
        }

        // Check prev block via footer tag
        BoundaryTag* prev_footer = (BoundaryTag*)((char*)header - header->prev_free);
        if (prev_footer->sentinel != -1) {
            // Prev block is free: merge into prev
            prev_footer->payload_size += header->payload_size;
            // Update footer
        } else {
            // Header becomes free: insert into order free list
            header->sentinel = 0;  // mark free
            int order = floor_log2(header->payload_size);
            insert_free_block(pool, order, header);
        }
    }

    pthread_mutex_unlock(pool->mutex);
}

Small Block Free-List Node

Each free block in a small bin stores two pointers in the returned memory region itself (since the block is not in use):

Small Free Node (aligned_size bytes, minimum 16)
  +0    ptr    next       next free node in this bin, or NULL
  +8    ptr    slab_desc  back-pointer to owning slab descriptor

On allocation, the node is popped from the head. On deallocation, the node is pushed back to the head. This is a classic LIFO (stack) free list with O(1) alloc and free.

Boundary Tag Format (Large Blocks)

Large blocks use a classic Knuth-style boundary tag scheme. Every allocated or free block has a 32-byte header before the user payload and a 32-byte footer at the end. The sentinel field distinguishes allocated blocks (-1) from free blocks (pointer to next free block, or 0).

                              Large Block Layout
  ┌──────────────────────────────────────────────────────────────────┐
  │ Header (32 bytes)                                                │
  │   +0   i64   sentinel        -1 = allocated, else next_free ptr │
  │   +8   ptr   prev_free       previous in order free list        │
  │   +16  u64   tag_offset      always 32 (header size)            │
  │   +24  u64   payload_size    user allocation size                │
  ├──────────────────────────────────────────────────────────────────┤
  │ User Payload (payload_size - 64 bytes)                           │
  │   ... returned to caller ...                                     │
  ├──────────────────────────────────────────────────────────────────┤
  │ Footer (32 bytes, at end of block)                               │
  │   +0   i64   sentinel        mirrors header sentinel             │
  │   +8   ptr   prev_free       (unused in footer)                  │
  │   +16  u64   footer_tag      always 32                           │
  │   +24  u64   block_size      total block size including headers  │
  └──────────────────────────────────────────────────────────────────┘

The footer allows the deallocator to coalesce with the preceding block by reading block_size from the footer of the previous block, then checking whether that block's header sentinel is -1 (allocated) or a free-list pointer. This enables bidirectional coalescing in O(1) without maintaining a separate block-address data structure.

Block Splitting

When a large free block is larger than needed, the allocator splits it if the remainder exceeds 39 bytes (enough for a header + footer + at least 8 bytes of payload). The split creates a new free block from the remainder and inserts it into the appropriate order free list. The pool's fragmentation_count is decremented on each split.

Slab Descriptor

Every slab (contiguous memory region backing allocations) is tracked by a descriptor. Small slabs use 56-byte descriptors; large slabs use 88-byte descriptors with additional boundary-tag metadata.

Small Slab Descriptor (56 bytes)

SlabDesc (56 bytes)
  +0    ptr    chain_link       next descriptor in pool's slab chain
  +8    u64    total_size       total slab memory in bytes
  +16   u64    available_size   bytes currently free (decremented on alloc)
  +24   ptr    owning_pool      back-pointer to the pool that owns this slab
  +32   ptr    memory_base      base address of the contiguous slab memory
  +40   u8     is_small_slab    1 = small-alloc slab, 0 = large-alloc slab
  +44   u32    slab_id          global atomic sequence number
  +48   u32    bin_size         size class this slab serves

Large Slab Descriptor (88 bytes)

Large slab descriptors extend the base 56 bytes with fields for boundary-tag free-list management. The memory base at +32 points to the raw allocation, which begins with a 32-byte header boundary tag. The descriptor at +48 points to the final footer boundary tag.

Hierarchical Pool Model

Pools form a tree. The root is a global fallback that wraps malloc/free. Below it are named pools created by the compilation driver. Each named pool allocates its slab memory from its parent pool.

    ┌─────────────────────────────────┐
    │  Global Fallback (a1 = NULL)    │
    │  sub_427A10 -> malloc           │
    │  sub_427B30 -> free             │
    └─────────┬───────────────────────┘
              │
    ┌─────────▼───────────────────────┐
    │  "Top level ptxas memory pool"  │
    │  Created in sub_446240 (driver) │
    │  Lifetime: entire compilation   │
    └─────┬───────────┬───────────────┘
          │           │
    ┌─────▼─────┐   ┌─▼──────────────────────────┐
    │ "Command  │   │  Per-compilation-unit pool  │
    │  option   │   │  (from compilation_ctx +16) │
    │  parser"  │   └──┬──────────┬───────────────┘
    └───────────┘      │          │
               ┌───────▼──┐   ┌──▼───────────────────┐
               │ "PTX     │   │ "Permanent OCG        │
               │  parsing │   │  memory pool"         │
               │  state"  │   │  per-kernel OCG state │
               └──────────┘   └───┬───────────────────┘
                                  │
                              ┌───▼───────────────┐
                              │ "elfw memory       │
                              │  space" (4096 init)│
                              │  ELF output buffer │
                              └───────────────────┘

Known Named Pools

NameCreatorLifetimePurpose
"Top level ptxas memory pool"sub_446240Entire processRoot of all sub-pools
"Command option parser"sub_446240Entire processCLI option storage
"Permanent OCG memory pool"0x1CE7B2B refPer-kernelOCG IR and pass state
"PTX parsing state"sub_451730Per-parseLexer/parser temporaries
"elfw memory space"sub_1CB53A0 / sub_4258D0Per-ELF-outputELF world (672-byte object, 4096 initial)

Parent Pool Resolution

When the allocator needs a new slab, it calls sub_4280C0 to get the thread-local context, which holds a parent pool pointer at byte offset +192 (qword offset 24). This TLS context is a 280-byte (0x118) struct allocated via raw malloc on first access per thread, initialized with pthread_cond_t at +128, pthread_mutex_t at +176, and sem_t at +216.

// TLS context layout (280 bytes = 0x118)
struct TLSContext {
    uint64_t error_flags;           // +0
    uint64_t has_error;             // +8
    // ... diagnostic fields ...
    void*    parent_pool;           // +192  (qword index 24)
    // ...
    pthread_cond_t  cond;           // +128  (48 bytes)
    pthread_mutex_t mutex;          // +176  (40 bytes)
    sem_t           sem;            // +216
    // ... diagnostic suppression ... // +384-416
};

The parent pool pointer determines where slab memory is allocated from. For the top-level pool, the parent is the global allocator (NULL pool, i.e., malloc). For sub-pools, the parent is the enclosing pool.

Thread Safety

Every pool operation acquires a per-pool mutex at offset +7128. The mutex is lazily initialized: on first use, sub_4286A0 (a once-init guard) creates the mutex via sub_428320 (pthread_mutex_init). The initialization itself is serialized through a separate global once-init mechanism (sub_42BDD0 saves/restores some state around the initialization).

There is also a global mutex at qword_29FDC08 that protects the global slab counter (dword_29FDBF4) and the global emergency-reclaim state (qword_29FDC00). The allocator acquires this global mutex briefly after creating new slabs to decrement the outstanding-growth counter.

Locking Sequence

1. Lock pool->mutex  (per-pool, offset +7128)
2. Perform allocation or deallocation
3. If new slab was created:
   a. Lock global_mutex  (qword_29FDC08)
   b. Decrement dword_29FDBF4 (outstanding growth count)
   c. Unlock global_mutex
4. Unlock pool->mutex

The locking is strictly ordered (pool mutex first, then global mutex if needed), preventing deadlock between pool operations. There is no lock-free fast path -- every allocation takes the pool mutex.

OOM Handling

The OOM handler sub_42BDB0 is a 14-byte stub that forwards to sub_42F590 (the central diagnostic/fatal-error emitter) with the error descriptor at unk_29FA530. This triggers a longjmp to abort the current compilation.

// sub_42BDB0 -- 14 bytes, 3825 callers
void alloc_fail_abort(void* pool, size_t size, ...) {
    fatal_error(&internal_error_descriptor, size, ...);
    // does not return -- longjmp
}

Every allocation site in ptxas follows the same pattern:

void* p = pool_alloc(pool, size);
if (!p) alloc_fail_abort(pool, size);  // sub_42BDB0

The 3,825 call sites for sub_42BDB0 exactly mirror the 3,809 callers of sub_424070 (the difference being realloc and a few indirect call sites). This is an unconditional abort -- there is no graceful degradation or fallback allocation strategy.

Emergency Reclaim

Before aborting, the allocator at a1 = NULL (global path) checks for a reclaimable cache at qword_29FDC00. If present, it locks the global mutex, calls sub_427B30 to free the cached block, zeroes the cache pointer, then retries the allocation. This provides a one-shot emergency reserve for the global allocator only.

Per-Phase Memory Reporting

When --stat=phase-wise is enabled (option 17928), the phase manager takes memory snapshots before and after each phase, then reports deltas.

Memory Snapshot

sub_8DADE0 captures a 48-byte snapshot from the pool state:

// sub_8DADE0 -- take_snapshot(snapshot, pool_state)
void take_snapshot(Snapshot* snap, PoolState* ps) {
    snap->pool_state    = ps;           // +0
    snap->total_alloc   = ps[80];       // +8   (qword offset 80 = byte +640)
    snap->freeable      = ps[78];       // +16  (qword offset 78 = byte +624)
    snap->freeable_leak = ps[79];       // +24  (qword offset 79 = byte +632)
    snap->metric4       = ps[76];       // +32  (qword offset 76 = byte +608)
    snap->current_usage = ps->vtable->get_usage(ps); // +40
}

Memory Delta Queries

Three helper functions compute deltas between the current pool state and a saved snapshot:

FunctionComputationMetric
sub_8DAE20pool[632] - snap[3]Total memory delta
sub_8DAE30pool[624] - snap[2]Freeable memory delta
sub_8DAE40snap[1] + pool[624] - snap[2] - pool[640]Freeable leaked delta

Pool Consumption Query

sub_8DAE60 returns the current pool consumption as a single integer:

// sub_8DAE60 -- pool_consumption(pool_state)
int64_t pool_consumption(PoolState* ps) {
    return *(ps->vtable->field_at_32) - ps[5];
    // i.e., total allocated from parent minus some baseline
}

Reporter Output

The pool reporter (sub_C62200) prints to stderr:

[Pool Consumption = 45.678 MB]

Size formatting follows the same thresholds used throughout ptxas:

  • 0--1023 bytes: raw integer with B suffix
  • 1,024--10,485,760 bytes: %.3lf KB
  • Above 10 MB: %.3lf MB

The per-phase reporter (sub_C64310) prints one line per phase:

  <phase_name>  ::  [Total 1234 KB]  [Freeable 567 KB]  [Freeable Leaked 12 KB] (2%)

The leak percentage is computed only when both freeable and freeable-leaked are positive.

Memory Space Statistics Dump

ptxas contains a detailed memory-space statistics subsystem for debugging the pool allocator. The output is gated by a byte flag at context+404 (initialized to 0 in sub_434320; not exposed as a user-facing knob). When the flag is non-zero, the compilation driver calls into the statistics printers at two points: after each per-kernel compilation (sub_436DF0, sub_4428E0) and on error-path exit from the main driver (sub_446240).

Generic Pool Statistics -- sub_425020

The entry point is sub_425AB0, which acquires the pool mutex, builds a stack-local stats-context struct, and calls sub_425020. The stats context is 28 bytes:

StatsContext (28 bytes, on stack)
  +0    ptr    output_stream     FILE* for sub_42BB30 (formatted output)
  +8    u8     verbosity_flag    enables/disables output
  +12   u32    detail_level      0 = compact, 1 = standard, 2 = per-page
  +16   u8     recurse_flag      walk child pools if set
  +20   u32    indent_level      current tab depth
  +24   u32    indent_step       tabs added per recursion level

sub_425020 first calls sub_423A10 to print the banner, then walks two structures to compute totals:

  1. Large-block slab chain (pool+48 linked list): for each slab descriptor, accumulates total_size and available_size, and counts free blocks within each slab.

  2. Small-block bin scan (pool+2112 hash map, via sub_426D60): iterates all 625 size classes (0..4992 in steps of 8), summing per-bucket total_size and available_size.

The three output metrics are in_use = total_available - total_allocated, all formatted as hex strings via sprintf("0x%llx", ...).

Detail level 1 (standard) output:

Memory space statistics for 'Top level ptxas memory pool'            
==========================================================
Page size                 : 0x10000 bytes
Total allocated           :     0x1a2b3c4 bytes
Total available           :     0x1ffffff bytes
Total in use              :     0x05d4c3b bytes
Nrof small block pages    : 42
Nrof large block pages    : 7
Longest free list size    : 3
Average free list size    : 0

Detail level 2 adds per-page breakdowns:

@@ large block page    0 : 0x1234/0x10000, #=2  max=0x5000
@@ small block size  24: 0x600/0x1800 (64/128 blocks) 3 pages

Detail level 0 (compact) prints a single line:

	 available= 	     0x1ffffff, allocated= 	     0x1a2b3c4, used= 	     0x05d4c3b

When recurse_flag is set, sub_425020 calls sub_42D4C0(child_chain, sub_425020, stats_context) to recursively walk and print statistics for all child pools, incrementing the indentation at each level.

OCG Memory Space Statistics -- sub_6936B0

The OCG (Optimizing Code Generator) uses a separate fixed-page allocator tracked in a 1048-byte hash-table object with 128 buckets. sub_6936B0 prints its statistics to stderr via sub_427540:

Memory space statistics for 'OCG mem space'
===========================================
Page size                 : 0x100000 bytes
Total allocated           : 0x340000 bytes
Total available           : 0x400000 bytes

The page size is hardcoded at 0x100000 (1 MB). The counters are read from the OCG state object at offsets +1032 (total_allocated) and +1040 (total_available) of the hash-table structure at OCG-context+24.

After printing, sub_693630 tears down the OCG allocator: it walks all 128 hash buckets freeing every linked-list entry, frees the overflow list at +1024, then frees the hash table object and the parent allocation via sub_4248B0.

Trigger

Both statistics paths are gated by the same flag: *(uint8_t*)(context + 404). This flag defaults to 0 and is not registered as a CLI knob. It is an internal debug mechanism, likely set only by NVIDIA-internal debug builds or environment variables not present in the release binary.

Pool Reset and Reuse

The pool system does not expose an explicit "reset" operation that returns all allocations without freeing slabs. Instead, pool lifetime is managed through the hierarchical ownership model:

  1. Per-parse pool ("PTX parsing state"): created before parsing, destroyed after parsing is complete. All lexer/parser temporaries are freed in bulk when the pool is torn down.

  2. Per-kernel pool ("Permanent OCG memory pool"): created before the 159-phase pipeline runs on a kernel, destroyed afterward. All IR nodes, analysis results, and phase-local data die with this pool.

  3. ELF output pool ("elfw memory space"): scoped to the ELF emission phase.

The teardown helper sub_4234D0 walks the pool's slab chain and returns each slab's memory to the parent pool via sub_4248B0 (free), then frees the slab descriptors themselves. Because slabs are allocated from the parent pool, this cascades upward -- destroying a child pool returns memory to the parent without touching the system heap.

Allocation Pattern: The 50KB Buffer

A pervasive allocation pattern across ptxas is the "alloc-format-shrink" idiom, observed in all PTX text formatters:

// ~100+ call sites follow this exact pattern
Pool* pool = get_arena_pool(ctx, table);           // sub_4280C0 -> offset 24
char* buf = pool_alloc(pool, 50000);               // 50KB temp buffer
if (!buf) alloc_fail_abort(pool, 50000);
int len = snprintf(buf, 50000, format, ...);
char* result = pool_alloc(pool2, len + 1);          // exact-size copy
memcpy(result, buf, len + 1);
pool_free(buf);                                     // return 50KB to pool
return result;

The 50,000-byte temporary buffer is a "one size fits all" strategy. Because it exceeds the 4,999-byte small-path threshold, every format operation takes the large-block path. However, because the buffer is freed immediately after use, it is typically coalesced back and reused by the next formatter call, making this effectively a per-thread scratch buffer recycled through the pool.

Global State

The allocator uses several global variables for cross-pool coordination:

AddressTypePurpose
dword_29FDBF4u32Outstanding slab growth count (decremented after slab creation)
dword_29FDBF8u32Emergency cache flag (zeroed when cache is reclaimed)
qword_29FDC00ptrEmergency reclaimable cache block pointer
qword_29FDC08mutex*Global mutex protecting the above three fields
dword_29FDBE8u32Global slab sequence number (atomic increment)
qword_29FDBE0ptrGlobal slab tracking map (for cross-pool slab lookup)
qword_29FDBD8mutex*Mutex protecting qword_29FDBE0
byte_29FA4C0u8Flag enabling per-pool slab tracking maps

The slab tracking map (qword_29FDBE0) is a hash map keyed by address >> 3 that maps any allocated pointer to its owning slab descriptor. The deallocator (sub_4248B0) consults this map when the per-pool tracking flag (byte_29FA4C0) is enabled. When per-pool tracking is disabled, it falls back to the global map.

Key Functions Reference

AddressSizeCallersIdentity
sub_4240702,0983,809pool_alloc(pool, size) -- main allocator
sub_4248B09231,215pool_free(ptr) -- main deallocator
sub_424C5048827pool_realloc(ptr, new_size) -- alloc+copy+free
sub_42BDB0143,825alloc_fail_abort() -- fatal OOM via longjmp
sub_4280C05973,928get_tls_context() -- per-thread state accessor
sub_427A10----global_alloc(size) -- malloc wrapper for NULL pool
sub_427B30----global_free(ptr) -- free wrapper for non-pool memory
sub_423A103231pool_stats_header() -- prints "Memory space statistics for ..." banner
sub_425020~1,5001pool_stats_detail() -- full metrics dump, recursive child walk
sub_425AB0802pool_stats_entry() -- mutex-wrapped entry point
sub_6936B01202ocg_memspace_stats() -- OCG allocator stats to stderr
sub_6936301662ocg_memspace_teardown() -- free OCG hash-table allocator
sub_4234D02581pool_teardown() -- recursive slab deallocation
sub_4236009223pool_accounting_init() -- accounting/hash-set setup
sub_423E505442register_slab() -- slab tracking insertion
sub_423B60----can_grow() -- checks whether slab expansion is permitted
sub_423C704802pool_grow() -- slab expansion handler
sub_42BE5064--floor_log2(size) -- clear-to-highest-bit + BSF
sub_42B990----slab_lookup(map, addr>>3) -- find slab for address
sub_4258D0----create_named_pool(name, flags, init_size)
sub_8DADE048--take_snapshot(snap, pool_state)
sub_8DAE2016--delta_total(snap)
sub_8DAE3016--delta_freeable(snap)
sub_8DAE4032--delta_freeable_leaked(snap)
sub_8DAE6032--pool_consumption(pool_state)
sub_C622008881Pool consumption reporter (stderr)
sub_C643103,168--Per-phase timing/memory reporter

Hash Tables & Bitvectors

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

ptxas contains two independent hash map implementations and a dedicated bitvector library. These three container abstractions underpin nearly every subsystem in the compiler -- from the PTX parser's symbol tables to the optimizer's liveness analysis to the code generator's instruction deduplication cache. This page documents their binary-level layouts, hash algorithms, SIMD acceleration, and usage patterns.

Overview

ContainerObject SizeHash AlgorithmAddress RangeCallers
General hash map112 bytesMurmurHash3 (strings), pointer-shift, or identity0x425B20--0x42D8502800+ (sub_426150)
CFG hash map40 bytes (header)FNV-1a (32-bit)0xBDED20--0xBDFB10~80
Bitvector20 bytes (header)N/A0xBDBA60--0xBDE150500+

The general hash map is a self-contained 112-byte object used for string-keyed lookups (intrinsic tables, PTX directive dispatch, ELF section names), pointer-keyed caches (instruction deduplication), and integer-keyed registries (opcode tables). The CFG hash map is a separate, purpose-built implementation for graph edge storage with embedded sub-hash tables for multi-edge blocks. The bitvector library provides 17+ SSE2-accelerated operations optimized for the iterative dataflow workloads that dominate ptxas compile time.


General Hash Map (112 bytes)

Construction

The constructor sub_425CA0 takes three arguments: a hash function pointer, a compare function pointer, and an initial capacity hint. It delegates allocation to sub_425B20, which allocates the 112-byte map object, a bucket pointer array, an entries array, and a used-bits bitmap.

HashMap* HashMap_create(HashFn hash, CmpFn compare, uint32_t capacity_hint);
// sub_425CA0(hash_fn, cmp_fn, initial_capacity)

After allocation, the constructor detects two well-known function-pointer pairs and sets fast-path mode flags in the flags field at offset +84 (bits 4-7):

ModeBits 4-7Hash FunctionCompare FunctionKey Semantics
0 (custom)0x00User-suppliedUser-suppliedString or arbitrary (MurmurHash3 typical)
1 (pointer)0x10sub_4277F0sub_427810Raw pointer identity
2 (integer)0x20sub_427750sub_42776032/64-bit integer identity

Mode detection is a compile-time optimization: the insert and lookup fast paths test the mode bits and branch directly to specialized hash/compare code, avoiding the overhead of indirect function calls through the hash/compare pointers.

Object Layout (112 bytes)

GeneralHashMap (112 bytes, allocated by sub_425B20)
  +0    ptr     hash_fn          // Hash function pointer (or NULL for mode 1/2)
  +8    ptr     compare_fn       // Compare function pointer
  +16   ptr     (reserved)       // Additional function pointer slot
  +24   ptr     (reserved)       // Additional function pointer slot
  +32   u32     has_custom_cmp   // Non-zero if custom compare supplied
  +40   u32     bucket_mask      // (bucket_count - 1); power-of-2 mask
  +48   u32     entry_count      // Number of occupied entries
  +52   u32     (padding)
  +56   u32     (reserved)
  +60   u32     (reserved)
  +64   u32     load_threshold   // entry_count threshold triggering resize
  +68   u32     (reserved)
  +72   u32     first_free       // Free-slot tracking index
  +76   u32     entry_capacity   // Allocated capacity of entries array
  +80   u32     bitmap_capacity  // Capacity of used-bits bitmap (in uint32_t words)
  +84   u16     flags            // Bits 4-7: hash mode (0=custom, 1=pointer, 2=integer)
                                 // Bits 0-1: bitmap/entry state flags
  +86   u16     (padding)
  +88   ptr     entries          // Array of 16-byte entries [key (8B), value (8B)]
  +96   ptr     used_bitmap      // Bitmap tracking which entry slots are occupied
  +104  ptr     bucket_array     // Array of pointers to index lists per bucket

Entry Format (16 bytes)

Each entry in the entries array at +88 is 16 bytes:

Entry (16 bytes, at entries + 16 * index)
  +0    ptr/u64   key            // Key (string pointer, raw pointer, or integer)
  +8    ptr/u64   value          // Associated value

Bucket chains are stored as arrays of uint32_t indices terminated by sentinel value 0xFFFFFFFF (-1). Each bucket pointer at bucket_array[hash & bucket_mask] points to a null-terminated index list. Collision resolution is open addressing with index chaining -- not linked-list chaining.

Hash Functions

Mode 0 (Custom / String) -- MurmurHash3:

String-keyed maps use sub_427630 (273 bytes, 73 callers), which implements MurmurHash3_x86_32 with the standard mixing constants:

uint32_t MurmurHash3_x86_32(const char* key, int len) {
    // sub_427630
    uint32_t h = seed;
    const uint32_t c1 = 0xCC9E2D51;  // -862048943
    const uint32_t c2 = 0x1B873593;  //  461845907

    // Body: process 4 bytes at a time
    for (int i = 0; i < len / 4; i++) {
        uint32_t k = ((uint32_t*)key)[i];
        k *= c1;
        k = ROTL32(k, 15);
        k *= c2;
        h ^= k;
        h = ROTL32(h, 13);
        h = h * 5 - 430675100;  // 0xE6546B64
    }

    // Tail: remaining 1-3 bytes
    // ... standard MurmurHash3 tail handling ...

    // Finalization
    h ^= len;
    h ^= h >> 16;
    h *= 0x85EBCA6B;  // -2048144789
    h ^= h >> 13;
    h *= 0xC2B2AE35;  // -1028477387
    h ^= h >> 16;
    return h;
}

Mode 1 (Pointer):

Pointer-keyed maps use a shift-XOR hash that destroys the alignment pattern common in heap pointers:

uint32_t pointer_hash(void* key) {
    uintptr_t p = (uintptr_t)key;
    return (p >> 11) ^ (p >> 8) ^ (p >> 5);
}

The right-shifts at 5, 8, and 11 bits fold the significant bits of a 64-bit heap address into a 32-bit range. The compare function (sub_427810) performs raw pointer equality.

Mode 2 (Integer):

Integer-keyed maps use the key value directly as the hash (identity hash). The compare function (sub_427760) performs integer equality. Bucket selection is key & bucket_mask.

Insert Operation (sub_426150)

sub_426150 (2534 bytes, 2800 callers) is the most heavily used hash map function. Its signature:

void* HashMap_put(HashMap* map, void* key, void* value);
// Returns: previous value if key existed, NULL if new insertion

Algorithm:

  1. Hash computation. Branch on mode bits at +84 to select hash function.
  2. Bucket lookup. Compute bucket = hash & *(map+40). Load index list from bucket_array[bucket].
  3. Key scan. Walk the index list. For each index i where i != 0xFFFFFFFF, compare entries[i].key with the target key using the mode-specific compare.
  4. Hit: Swap entries[i].value with the new value. Return the old value.
  5. Miss: Find a free slot via the used-bits bitmap at +96. Set entries[free].key = key, entries[free].value = value. Append the free index to the bucket's index list. Increment entry_count at +48.
  6. Resize check. If entry_count > load_threshold, double the capacity: allocate new entries array (2 * old_capacity) and new bucket array, rehash all entries.

Lookup Operation (sub_426D60)

sub_426D60 (345 bytes, 422 callers) is the read-only lookup:

void* HashMap_get(HashMap* map, void* key);
// Returns: value if found, NULL (0) if not found

Same hash and scan logic as insert, but no modification path. The three mode fast paths avoid indirect calls on the hot path.

Contains Operation (sub_426EC0)

sub_426EC0 (349 bytes, 29 callers) returns 1 if the key exists, 0 otherwise. Nearly identical to lookup but returns a boolean rather than the value.

Resize Policy

The resize doubles capacity when entry_count exceeds load_threshold. The threshold is set to 4 * bucket_count during initialization (from sub_425B20: v6[8] = 4 << log2(capacity)), meaning the default load factor limit is approximately 4.0 entries per bucket. Initial bucket count is rounded up to the next power of two from the capacity hint (minimum 1). The constructor at sub_425B20 calls sub_42BDF0 to compute ceil(log2(capacity)).

Destructor (sub_425D20)

sub_425D20 (121 bytes, 63 callers) frees the entries array, used-bits bitmap, bucket array, and the 112-byte map object itself. All four allocations are returned to the pool allocator.

Iteration

Hash map iteration (used by sub_425DB0, 9 callers) walks the entries array linearly, testing the used-bits bitmap to identify occupied slots. There is no guaranteed iteration order -- the order depends on insertion history and resize operations.

Usage Examples

SubsystemHash ModeKey TypePurpose
Intrinsic table (sub_5AB660)Custom (MurmurHash3)String608 intrinsic name lookups (capacity 0x80)
PTX opcode dispatchCustom (MurmurHash3)StringNamed PTX opcodes at ctx+808
SM version backendsCustom (MurmurHash3)Stringsm_XX, compute_XX, lto_XX registries
Instruction dedup (sub_737760)PointerPointerAvoid re-encoding identical instructions
Opcode hash tablesIntegerIntegerFast opcode-to-handler dispatch
File offset cacheCustomStringCache file offsets for #line directives
Symbol table (sub_621480)Custom (MurmurHash3)StringNamed symbol lookups at ctx+30016
Per-function disable listCustom (FNV-1a)StringFunction-specific pass disable at ctx+120->+1128

CFG Hash Map (Graph Edges)

The CFG edge hash map is a completely separate implementation from the general hash map, located in the address range 0xBDED20--0xBDFB10. It is designed specifically for graph edge storage -- mapping block indices to successor/predecessor sets -- and has a different object layout, hash function, and collision resolution strategy.

Object Layout

CFGHashMap (40 bytes header)
  +0    ptr     first_free_node   // Free list for node recycling
  +8    ptr     node_arena        // Pool allocator for new nodes
  +16   ptr     bucket_array      // Array of 24-byte bucket headers
  +24   u64     num_buckets       // Power of two, initial = 8
  +32   i32     total_elements    // Total entries across all buckets
  +36   i32     num_unique_keys   // Distinct keys inserted

Two distinct hash map configurations exist for different node sizes:

Full Node (64 bytes) -- Successor Edge Map

Used by sub_BDED20 (12KB) for the successor edge hash map at Code Object +648. Each node represents a block's successor edge set with an optional sub-hash table for multi-successor blocks:

FullNode (64 bytes)
  +0    ptr     next              // Chain link within bucket
  +8    i32     key               // Block index (bix)
  +12   i32     value_info        // Edge count or flags
  +16   ptr     value_array       // Pointer to sub-array of successor indices
  +24   i32     value_count       // Number of successors in sub-array
  +28   i32     (padding)
  +32   ptr     sub_hash_data     // Embedded sub-hash for multi-edge blocks
  +40   u64     sub_hash_size     // Sub-hash capacity
  +48   u64     (reserved)
  +56   u32     cached_hash       // Cached FNV-1a hash of key
  +60   u32     (padding)

Simple Node (16 bytes) -- Backedge Set

Used by sub_BDF480 (10KB) for the backedge hash map at Code Object +680. Each node is a minimal key-hash pair for set membership testing:

SimpleNode (16 bytes)
  +0    ptr     next              // Chain link within bucket
  +8    i32     key               // Block index
  +12   u32     cached_hash       // Cached hash

Bucket Header (24 bytes)

Both full and simple node maps use the same bucket header:

Bucket (24 bytes)
  +0    ptr     head              // First node in collision chain
  +8    ptr     tail              // Last node in collision chain
  +16   i32     count             // Number of nodes in this bucket
  +20   i32     (padding)

FNV-1a Hash Function

All CFG hash lookups use FNV-1a (32-bit) with standard parameters, confirmed across 50+ call sites:

uint32_t fnv1a_hash_u32(uint32_t key) {
    uint32_t hash = 0x811C9DC5;                    // FNV offset basis
    hash = 16777619 * (hash ^ (key & 0xFF));       // byte 0
    hash = 16777619 * (hash ^ ((key >> 8) & 0xFF));  // byte 1
    hash = 16777619 * (hash ^ ((key >> 16) & 0xFF)); // byte 2
    hash = 16777619 * (hash ^ ((key >> 24) & 0xFF)); // byte 3
    return hash;
}

Bucket selection: bucket = hash & (num_buckets - 1).

This appears inline in the decompiled code as:

v9 = 16777619 * (HIBYTE(*a3)
   ^ (16777619 * ((unsigned __int8)BYTE2(*a3)
   ^ (16777619 * (BYTE1(v8)
   ^ (16777619 * ((unsigned __int8)v8 ^ 0x811C9DC5)))))));
v11 = v9 & (a2[3] - 1);   // bucket index

Resize Policy

The CFG hash map resizes when total_elements > num_unique_keys (load factor > 1.0). New capacity is 4 * old_bucket_count. Both sub_BDED20 and sub_BDF480 allocate a new 192-byte bucket array (8 buckets * 24 bytes/bucket = 192) during initial creation, then redistribute nodes during resize.

The resize algorithm:

  1. Allocate new bucket array (192 bytes for 8 buckets).
  2. Walk all old buckets, removing each node from old chains.
  3. Re-hash each node: new_bucket = cached_hash & (new_num_buckets - 1).
  4. Insert at head or tail of new bucket chain.
  5. Free old bucket array via pool allocator (vtable dispatch at allocator +32 for free).

Edge Map Operations

Insert/Find (sub_BDED20, sub_BDF480):

  1. Compute FNV-1a hash of the block index key.
  2. Index into bucket array: bucket = hash & (num_buckets - 1).
  3. Walk the collision chain comparing keys.
  4. If found: return existing node.
  5. If not found: allocate node from arena, initialize, insert into bucket, increment counters.

Erase (sub_BDE6C0, 3KB): Removes an entry from the successor edge map and recursively erases successor edges for blocks that become unreachable (single-predecessor blocks whose only predecessor was removed). This recursive cleanup maintains CFG consistency during block deletion.

Print (sub_BDE8B0, 2KB): Iterates a block's successors and prints "\tbix%d -> bix%d\n" for each edge. Used by the CFG debug dump infrastructure.

Multi-Edge Sub-Hash

For blocks with many successors (switch statements, computed branches), the full node at +32 contains a pointer to an embedded sub-hash table. This sub-hash maps successor block indices within a single node's edge set, enabling O(1) lookup of whether a specific successor edge exists. The sub-hash uses the same 24-byte bucket structure as the outer hash map.

Storage Locations

Code Object OffsetFieldMap TypeNode Size
+648succ_mapSuccessor edges64 bytes (full)
+680backedge_mapBackedge set16 bytes (simple)

Bitvector Library

The bitvector library at 0xBDBA60--0xBDE150 is the most performance-critical infrastructure in ptxas. It supports 17+ operations, all SSE2-accelerated with manual alignment handling. The library is the backbone of liveness analysis (6 dedicated phases), dominance computation, register interference detection, and dead code elimination.

Object Layout (20 bytes)

struct BitVector {          // 20 bytes
    uint32_t* data;         // +0:  pointer to word array (heap-allocated)
    int32_t   word_count;   // +8:  number of 32-bit words in use
    int32_t   capacity;     // +12: allocated words (>= word_count)
    int32_t   bit_count;    // +16: number of valid bits
};

Word count derivation: word_count = (bit_count + 31) >> 5.

Memory is allocated through the pool allocator (vtable dispatch at allocator +24 for alloc, +32 for free). The allocate function (sub_BDBA60) implements grow-only semantics: if the new word count exceeds the current capacity, the old buffer is freed and a new one allocated. The buffer is never shrunk.

Bit Addressing

Individual bits are addressed by word index and bit position within the word:

// Set bit i:  data[i >> 5] |= (1 << (i & 0x1F))
// Clear bit i: data[i >> 5] &= ~(1 << (i & 0x1F))
// Test bit i:  (data[i >> 5] >> (i & 0x1F)) & 1

SSE2 Acceleration Pattern

All bulk operations follow a three-phase structure:

  1. Alignment prologue. Process scalar words until the destination pointer is 16-byte aligned. The alignment count is (-(uintptr_t)(dst_ptr) >> 2) & 3, yielding 0-3 scalar iterations.

  2. SSE2 main loop. Process 4 words (128 bits) per iteration using _mm_load_si128 (aligned destination), _mm_loadu_si128 (potentially unaligned source), and the appropriate SSE2 intrinsic (_mm_or_si128, _mm_and_si128, _mm_andnot_si128, _mm_xor_si128). The loop count is (remaining_words) >> 2.

  3. Scalar epilogue. Process remaining 0-6 words individually. The decompiler shows an unrolled epilogue handling up to 6 trailing words with sequential if-chains rather than a loop.

Example from sub_BDCF40 (orIfChanged):

// Prologue: align dst
int align = (-(uintptr_t)(dst) >> 2) & 3;
for (int i = 0; i < align && i < word_count; i++)
    dst[i] |= src[i];

// SSE2 loop: 4 words per iteration
int sse_iters = (word_count - align) >> 2;
for (int i = 0; i < sse_iters; i++) {
    __m128i d = _mm_load_si128(&dst[aligned_offset + 4*i]);
    __m128i s = _mm_loadu_si128(&src[aligned_offset + 4*i]);
    _mm_store_si128(&dst[aligned_offset + 4*i], _mm_or_si128(d, s));
}

// Epilogue: 0-6 remaining words

Operation Catalog

Allocation and Lifecycle

AddressOperationSignatureDescription
sub_BDBA60allocate(bv*, alloc*, num_bits)Grow-only allocation. Sets bit_count, recomputes word_count, reallocates if capacity insufficient.
sub_BDDC00clear(bv*, start_bit)Zeroes all words from start_bit >> 5 to end. When start_bit = 0, equivalent to memset(data, 0, ...).
sub_BDCA60operator=(dst*, src*)Copy assignment. Reallocates dst if capacity insufficient, then memcpy.

Bit Manipulation

AddressOperationSignatureDescription
sub_BDBFB0setBit(bv*, bit_index)data[i>>5] |= (1 << (i&31))
sub_BDC0E0clearBit(bv*, bit_index)data[i>>5] &= ~(1 << (i&31))
sub_BDC200testBit(bv*, bit_index) -> bool(data[i>>5] >> (i&31)) & 1

Bulk Boolean Operations (SSE2)

AddressOperationSignatureSSE2 IntrinsicDescription
sub_BDCDE0operator|=(dst*, src*)_mm_or_si128dst |= src
sub_BDC5F0operator&=(dst*, src*)_mm_and_si128dst &= src; zeroes dst tail beyond src length
sub_BDDAA0operator^=(dst*, src*)_mm_xor_si128dst ^= src

Fixed-Point Detection (IfChanged variants)

These return 1 if any bit changed, 0 if the result is identical to the previous state. They are the core of iterative dataflow convergence detection.

AddressOperationSignatureDescription
sub_BDCF40orIfChanged(dst*, src*) -> boolScans for (~dst & src) != 0 first, then applies dst |= src from the first differing word. Returns whether any bit was newly set.
sub_BDC790andIfChanged(dst*, src*) -> boolScans for (~src & dst) != 0 first, then applies dst &= src. Zeroes trailing dst words beyond src. Returns whether any bit was cleared.

The IfChanged early-exit optimization: the function first scans word-by-word for the first position where a change would occur (~dst & src for OR, ~src & dst for AND). If no such position exists, it returns 0 immediately without touching memory. If found, it applies the operation only from that position forward, reducing cache pollution when most blocks have already converged.

Three-Input Operations

AddressOperationSignatureDescription
sub_BDC3F0assignAND(dst*, a*, b*)dst = a & b (SSE2 _mm_and_si128)
sub_BDD8C0assignANDNOT(dst*, a*, b*)dst = a & ~b (SSE2 _mm_andnot_si128)
sub_BDCC20assignOR(dst*, a*, b*)dst = a | b (SSE2 _mm_or_si128)
sub_BDD140orWithAND(dst*, a*, b*)dst |= a & b (SSE2 _mm_and_si128 + _mm_or_si128)

Liveness-Specific Fused Operations

The most important bitvector operations for compiler performance are the fused transfer functions used by liveness analysis:

AddressOperationSignatureDescription
sub_BDD300orWithAndNot(dst*, gen*, in*, kill*)dst |= gen | (in & ~kill)
sub_BDD560orWithAndNotIfChanged(dst*, gen*, in*, kill*) -> boolSame as above + returns change flag

These implement the liveness transfer function LiveIn(B) = gen(B) | (LiveOut(B) - kill(B)) in a single SIMD pass, avoiding materialization of intermediate bitvectors.

orWithAndNot inner loop (sub_BDD300):

// SSE2: process 4 words per iteration
*(__m128i*)(dst + offset) = _mm_or_si128(
    _mm_or_si128(
        _mm_loadu_si128(gen + offset),      // gen
        *(__m128i*)(dst + offset)),          // existing dst
    _mm_andnot_si128(
        _mm_loadu_si128(kill + offset),     // ~kill
        _mm_loadu_si128(in + offset)));     // in & ~kill

orWithAndNotIfChanged (sub_BDD560):

This function first scans for any word position where the new value would differ from the current dst:

// Phase 1: scan for first change
for (int i = 0; i < word_count; i++) {
    uint32_t new_bits = gen[i] | (in[i] & ~kill[i]);
    if (~dst[i] & new_bits)
        goto apply_from_i;  // Found change at position i
}
return 0;  // No change -- already at fixed point

// Phase 2: apply from first change position
apply_from_i:
    // SSE2 loop: dst[j] |= gen[j] | (in[j] & ~kill[j])
    // ... (same three-phase SSE2 pattern) ...
    return 1;

This two-phase design is critical for liveness analysis performance: in the late iterations of the fixed-point solver, most blocks have already converged and the scan returns 0 without writing memory.

Query Operations

AddressOperationSignatureDescription
sub_BDDD40popcount(bv*) -> intCount set bits. Uses __popcountdi2 intrinsic per word.
sub_BDBFB0findLastSetWord(bv*) -> intScans from high word to low, returns index of last non-zero word. Returns 0xFFFFFFFF if all zero.
sub_BDDC00nextSetBit(bv*, start) -> intFind next set bit at or after start. Uses tzcnt (trailing zero count) instruction for intra-word scanning. Returns 0xFFFFFFFF if none found.
sub_BDDCB0prevSetBit(bv*, start) -> intFind previous set bit at or before start. Uses bsr (bit scan reverse). Returns 0xFFFFFFFF if none found.

Extract Operation

AddressOperationSignatureDescription
sub_BDBD60extractBits(bv*, out[], start, end)Extract bit range [start, end] into output array. Handles cross-word boundaries with shift-and-mask. Supports extraction spans > 32 bits by filling multiple output words.

The extract function handles three cases:

  1. Same word: Both start and end are in the same 32-bit word. Single mask-and-shift.
  2. Small span (<=32 bits, two words): Combine partial words with shift and OR.
  3. Large span (>32 bits): Loop over full words with optional shift for non-aligned start.

Bitvector Subset Test

The isSubsetOf operation tests whether bitvector A is a subset of B, i.e. (A & ~B) == 0. This is used by dominance queries (checking if block X is dominated by block Y). The test at sub_1245740 referenced in the copy-prop-CSE pass performs a single-bit indexed test against the dominator bitvector: testBit(dom_set[block], dominator_id).


Usage Across Subsystems

Liveness Analysis

The iterative fixed-point solver runs in reverse post-order, computing LiveIn and LiveOut bitvectors per basic block:

LiveOut(B) |= LiveIn(S)                    -- for each successor S: orIfChanged
LiveIn(B)  = gen(B) | (LiveOut(B) - kill(B))  -- orWithAndNotIfChanged

Six dedicated phases (10, 16, 19, 33, 61, 84) perform liveness + DCE. The orWithAndNotIfChanged fusion and early-exit optimization minimize cache traffic in later iterations when most sets have stabilized. Bitvectors for liveness are stored at Code Object +832 (R registers) and +856 (UR registers).

Dominance

The iterative dominator computation (sub_BE2330, 4KB) uses bitvector intersection:

dom[entry] = {entry}
dom[b] = {b} union (intersection of dom[p] for all predecessors p)

Each block's dominator set is a bitvector indexed by block ID. Convergence uses andIfChanged to detect when the intersection no longer removes any dominators.

Register Allocation

The interference graph builder (sub_926A30, 155KB decompiled) uses bitvector intersection to detect live range overlaps. Two registers interfere if their liveness bitvectors have overlapping set bits at any program point. The allocator at sub_9721C0 explicitly rebuilds liveness before allocation.

GVN-CSE

The GVN-CSE pass (phase 49) uses the general hash map with FNV-1a hashing for the value numbering table. Each hash entry is 24 bytes: [next_ptr (8B), key (8B), value/metadata (8B)] with chained collision resolution. The hash incorporates opcode, type, and recursively resolved value numbers of all operands.

CFG Analysis

The CFG builder (sub_BE0690, 54KB) populates successor and backedge hash maps during phase 3 (AnalyzeControlFlow). RPO computation (sub_BDE150, 9KB) reads from the successor map. The DOT dumper (sub_BE21D0) iterates edge sets for visualization.


Key Function Table

General Hash Map

AddressSizeIdentityCallersConfidence
sub_425B200.5KBHashMap::allocate -- allocates 112-byte map + arrays127 (via sub_425CA0)HIGH
sub_425CA0114BHashMap::create -- constructor with hash/compare fn ptrs127HIGH
sub_425D20121BHashMap::destroy -- frees all internal arrays + map63MEDIUM
sub_425DB0292BHashMap::clear / iterator init9MEDIUM
sub_4261502.5KBHashMap::put -- insert or update key/value pair2800HIGH
sub_426D60345BHashMap::get -- lookup key, return value or 0422HIGH
sub_426EC0349BHashMap::contains -- test key existence29HIGH
sub_427630273BMurmurHash3_x86_32 -- string hash73HIGH
sub_427750--Integer hash function (identity)--HIGH
sub_4277F0--Pointer hash function (shift-XOR)--HIGH
sub_42D8502.5KBHashMap::insertSet -- set-mode insert with auto-resize282HIGH

CFG Hash Map

AddressSizeIdentityConfidence
sub_BDED2012KBCFGHashMap::insertOrFind -- full 64-byte nodeHIGH
sub_BDF48010KBCFGHashMap::insertOrFind_simple -- 16-byte nodeHIGH
sub_BDE6C03KBCFGHashMap::erase -- recursive edge removalMEDIUM
sub_BDE8B02KBCFGHashMap::printEdges -- "bix%d -> bix%d"HIGH
sub_BDEA504KBCFGHashMap::dumpRPOAndBackedges -- debug dumpHIGH
sub_BDFB1024KBCFGHashMap::buildBlockMap -- block array init, RPO resizeMEDIUM
sub_BDDDF0~2KBCFGHashMap::destroyAll -- walk and free all nodesMEDIUM

Bitvector Library

AddressSizeIdentityConfidence
sub_BDBA60~120BBitVector::allocateHIGH (0.90)
sub_BDBFB0~120BBitVector::findLastSetWordHIGH (0.90)
sub_BDBD60~370BBitVector::extractBitsHIGH (0.88)
sub_BDBFB0~120BBitVector::setBitHIGH (0.90)
sub_BDC0E0~120BBitVector::clearBitHIGH (0.90)
sub_BDC200~140BBitVector::testBitHIGH (0.90)
sub_BDC3F0~520BBitVector::assignANDHIGH (0.90)
sub_BDC5F0~480BBitVector::operator&=HIGH (0.95)
sub_BDC790~800BBitVector::andIfChangedHIGH (0.95)
sub_BDCA60~280BBitVector::operator= (copy)MEDIUM (0.85)
sub_BDCC20~320BBitVector::assignORHIGH (0.90)
sub_BDCDE0~400BBitVector::operator|=HIGH (0.95)
sub_BDCF40~560BBitVector::orIfChangedHIGH (0.95)
sub_BDD140~480BBitVector::orWithANDHIGH (0.90)
sub_BDD300~490BBitVector::orWithAndNotHIGH (0.92)
sub_BDD560~650BBitVector::orWithAndNotIfChangedHIGH (0.92)
sub_BDD8C0~320BBitVector::assignANDNOTHIGH (0.88)
sub_BDDAA0~400BBitVector::operator^=HIGH (0.95)
sub_BDDC00~200BBitVector::nextSetBit / clearHIGH (0.90)
sub_BDDCB0~180BBitVector::prevSetBitMEDIUM (0.80)
sub_BDDD40~160BBitVector::popcountMEDIUM (0.80)

Design Notes

Why two hash map implementations? The general hash map is optimized for string and pointer lookups with three mode-specific fast paths that avoid indirect calls. The CFG hash map is optimized for integer-keyed graph edge storage with chained nodes and embedded sub-hashes for multi-edge blocks. The general map uses open addressing with index arrays; the CFG map uses linked-list chaining with explicit node allocation. These are different design tradeoffs for different access patterns.

Why 32-bit bitvector words (not 64-bit)? The SSE2 instructions operate on 128-bit vectors regardless of word size, so 32-bit and 64-bit words yield the same SIMD throughput. The 32-bit choice minimizes wasted bits in the trailing word when bit_count is not a multiple of 64, and simplifies the alignment prologue calculation ((-(ptr >> 2)) & 3 vs (-(ptr >> 3)) & 1). It also matches the tzcnt/bsr instruction operand size on x86, avoiding the need for 64-bit variants in the bit-scan operations.

Why orWithAndNotIfChanged as a single function? This fusion eliminates three intermediate bitvectors that would be needed if the liveness transfer function were decomposed into separate AND-NOT, OR, and change-detection operations. On a function with 200 registers and 50 basic blocks, this saves ~60KB of intermediate allocations per iteration. More importantly, the two-phase scan-then-apply design avoids cache writes for converged blocks, which is the common case in later iterations.

Cross-References

Thread Pool & Concurrency

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

ptxas compiles multiple entry functions (kernels) in a single PTX input file. When --split-compile or --allow-expensive-optimizations is active, the compilation driver dispatches each kernel to a worker thread for parallel DAGgen+OCG+ELF+DebugInfo processing. The threading infrastructure lives in two address ranges: the TLS system at 0x4280xx--0x4286xx (front-end) and the thread pool at 0x1CB17xx--0x1CB1Axx (shared infrastructure).

TLS accessorsub_4280C0 (597 bytes, 3,928 callers, 280-byte struct)
TLS key initctor_001 at 0x4094C0 (static constructor)
TLS destructordestr_function at 0x427F10
Thread pool ctorsub_1CB18B0 (184-byte pool struct)
Task submitsub_1CB1A50 (24-byte task node)
Worker threadstart_routine at 0x1CB1780
Wait-allsub_1CB1AE0
Pool destroysub_1CB1970
CPU countsub_1CB1890 (sysconf(_SC_NPROCESSORS_CONF))
Mutex init helpersub_428620 (recursive mutex factory)
Global mutex locksub_4286A0 (lazy-init + lock)
Jobserver clientsub_1CC7300 (GNU Make integration)

Architecture

                         ┌─────────────────────────────────────────┐
                         │          Compilation Driver             │
                         │           sub_446240                    │
                         │                                         │
                         │  if (thread_count > 0):                 │
                         │    pool = sub_1CB18B0(thread_count)     │
                         │    for each kernel:                     │
                         │      snapshot config → 360-byte buffer  │
                         │      copy hash maps for isolation       │
                         │      sub_1CB1A50(pool, sub_436DF0, buf) │
                         │    sub_1CB1AE0(pool)   // wait-all      │
                         │    sub_1CB1970(pool)    // destroy       │
                         └────────────┬────────────────────────────┘
                                      │ enqueue tasks
                                      ▼
                    ┌────────────────────────────────────┐
                    │         Thread Pool (184 bytes)     │
                    │                                     │
                    │  mutex ──────── guards all fields   │
                    │  cond_work ──── wake workers        │
                    │  cond_done ──── signal completion   │
                    │  work_queue ─── priority heap       │
                    │  pending ────── pending task count   │
                    │  shutdown ───── termination flag     │
                    └───┬──────┬──────┬──────────────────┘
                        │      │      │
                  ┌─────┘      │      └─────┐
                  ▼            ▼            ▼
            ┌──────────┐ ┌──────────┐ ┌──────────┐
            │ Worker 0 │ │ Worker 1 │ │ Worker N │
            │ (detached)│ │(detached)│ │(detached)│
            └──────────┘ └──────────┘ └──────────┘
                  │            │            │
                  ▼            ▼            ▼
            ┌──────────┐ ┌──────────┐ ┌──────────┐
            │ TLS ctx  │ │ TLS ctx  │ │ TLS ctx  │
            │ 280 bytes│ │ 280 bytes│ │ 280 bytes│
            │ + pool   │ │ + pool   │ │ + pool   │
            └──────────┘ └──────────┘ └──────────┘

CLI Activation

Two options activate the thread pool:

OptionTypeBehavior
--split-compile NintSet thread count directly. 0 = CPU count via sysconf(83). 1 = serial (pool not created). N > 1 = N worker threads.
--allow-expensive-optimizationsboolAuto-enabled at -O2 and above. Enables the thread pool with an automatically determined thread count.

Both flags ultimately result in a non-zero value at offset +668 in the compilation driver's state block. The driver checks this field to decide between sequential per-kernel iteration and thread pool dispatch.

Two internal options fine-tune pool behavior:

OptionTypePurpose
--threads-dynamic-schedulingboolEnable dynamic scheduling of thread pool tasks (work stealing vs static partitioning).
--threads-min-section-sizeintMinimum section size for thread pool work partitioning; prevents excessive task granularity.

CPU Count Detection

// sub_1CB1890 -- returns online CPU count
__int64 sub_1CB1890() {
    return sysconf(83);   // _SC_NPROCESSORS_CONF on Linux
}

When --split-compile 0 is specified, the pool constructor receives the return value of sub_1CB1890() as its nmemb argument.

Thread-Local Storage (TLS)

The TLS system is the foundation of ptxas's concurrency model. Every thread -- main and workers alike -- gets its own 280-byte context struct, accessed through the single most-called function in the binary: sub_4280C0 (3,928 call sites).

Initialization: ctor_001 (0x4094C0)

The TLS key is created in a static constructor that runs before main:

// ctor_001 -- .init_array entry
int ctor_001() {
    if (!qword_29FE0A0) {            // one-time guard
        pthread_key_create(&key, destr_function);
        pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
        pthread_mutex_init(&mutex, &attr);
        dword_29FE0F4 = sched_get_priority_max(SCHED_RR);
        dword_29FE0F0 = sched_get_priority_min(SCHED_RR);
        qword_29FE0A0 = &sentinel_node;  // marks "initialized"
        // ... initialize linked list sentinels
    }
    __cxa_atexit(cleanup_func, ...);
}

The SCHED_RR priority range is queried but never used for thread creation (all threads are created with default attributes). This appears to be vestigial infrastructure for priority-based scheduling that was never activated.

TLS Accessor: sub_4280C0

char *sub_4280C0() {
    if (!qword_29FE0A0)
        goto init_once;              // fallback init (race protection)
    
    char *result = pthread_getspecific(key);
    if (result)
        return result;               // fast path: already allocated
    
    char *ctx = malloc(0x118);       // 280 bytes
    memset(ctx, 0, 0x118);
    pthread_cond_init(ctx + 128, NULL);
    pthread_mutex_init(ctx + 176, NULL);
    sem_init(ctx + 216, 0, 0);
    
    // Insert into global doubly-linked list (under global mutex)
    pthread_mutex_lock(&mutex);
    ctx->prev = sentinel;
    ctx->next = sentinel->next;
    sentinel->next->prev = ctx;
    sentinel->next = ctx;
    pthread_mutex_unlock(&mutex);
    
    pthread_setspecific(key, ctx);
    return ctx;
}

The global doubly-linked list at offsets +256 (prev) and +264 (next) allows the system to enumerate all live TLS contexts. This is used during cleanup to destroy all thread contexts at program exit.

TLS Context Layout (280 bytes)

The full byte-level layout, verified against the constructor (sub_4280C0), destructor (destr_function), and the two diagnostic reporters (sub_42F590, sub_42FBA0):

OffsetSizeTypePurposeAccessor
01bytehas_warning_or_error flag -- set to 1 when severity > 2sub_42F590 (direct write)
11bytehas_fatal_error flag -- set to 1 when severity > 4sub_42F590 (direct write)
26--padding (zeroed by memset)--
88jmp_buf*Error recovery longjmp target (installed by sub_431ED0 before _setjmp)sub_431ED0 (save/restore)
168void*Error descriptor pointer -- set to the faulting error descriptor on longjmpsub_42F590 (write on fatal), sub_431ED0 (propagate on re-throw)
248void*Per-thread memory pool pointer (used by sub_424070)sub_42BDD0 (swap), sub_4287D0 (read)
328char*Program name string (e.g. "ptxas") -- prepended to diagnostic messagessub_430590 (set), sub_430570 (get)
408char*Diagnostic suffix string -- appended to message body when non-NULLsub_42F590 (read, format as " %s")
481byteInfo suppression flag -- suppresses severity-2 (info) messagessub_42F590 (check)
491byteDiagnostic suppression flag -- suppresses severity-3 (warning) messagessub_430560 (set)
501byteWerror promotion flag (--Werror) -- promotes warnings to errorssub_430550 (set)
511byteMachine-readable annotation flag -- adds @E@/@W@/@O@/@I@ severity tagssub_4305A0 (get)
521byteMulti-line continuation flag -- suppresses ". " prefix on wrapped linessub_4305C0 (set)
5375--padding (zeroed by memset)--
12848pthread_cond_tPer-thread condition variableconstructor: pthread_cond_init
17640pthread_mutex_tPer-thread mutexconstructor: pthread_mutex_init
21632sem_tSemaphore for cross-thread signalingconstructor: sem_init(0, 0)
2488void*Saved semaphore pointer (from pool, for destr_function to sem_post)destr_function (read at qword[31])
2568void*Linked list prev pointer (global TLS chain)constructor (qword[32])
2648void*Linked list next pointer (global TLS chain)constructor (qword[33])
2721byteDestroyed flag (prevents double-destroy)destr_function (set to 1)
2737--padding (zeroed by memset)--

The fields fall into seven functional groups:

Error state (offsets 0--1). Two byte flags set by the diagnostic reporter sub_42F590. Byte 0 records whether any error-or-above diagnostic was emitted; byte 1 records fatal errors specifically. The compilation driver reads these after each kernel compilation to determine the process exit code.

Error recovery (offsets 8--16). A setjmp/longjmp mechanism for non-local error exits. sub_431ED0 saves the current jmp_buf* and error byte flags, installs a fresh jmp_buf, then enters the compiler. On a fatal diagnostic, sub_42F590 stores the error descriptor at offset +16 and calls longjmp to the target at offset +8. If no jmp_buf is installed, the handler calls sub_4275E0 (abort).

Per-thread allocator (offset 24). The most performance-critical field. The pool allocator sub_424070 reads this pointer on every allocation (accessed as sub_4280C0()[3]). When non-NULL, allocations go to the calling thread's own slab without any locking. sub_42BDD0 provides an atomic swap primitive that replaces the pool pointer and returns the old value -- used during pool migration at compilation boundaries. This is used pervasively: 3,928 call sites to sub_4280C0 are predominantly pool allocator calls that need the current thread's arena.

Diagnostic context (offsets 32, 40). The program name at +32 (e.g. "ptxas") is prepended to all diagnostic messages. The suffix at +40 is appended after the message body. Both are set per-thread to support library mode where multiple tool names coexist in the same process.

Diagnostic flags (offsets 48--52). Five single-byte flags controlling diagnostic formatting and filtering. The info suppression flag (+48) silences informational messages. The diagnostic suppression flag (+49) silences warnings entirely. The Werror flag (+50) promotes warnings to errors. The annotation flag (+51) enables machine-readable severity tags (@E@, @W@, @O@, @I@). The continuation flag (+52) enables multi-line continuation mode where wrapped lines omit the ". " prefix.

Synchronization primitives (offsets 128--248). The condvar, mutex, and semaphore are used by the thread pool for task coordination and cross-thread signaling. The saved semaphore pointer at +248 is set by the pool when assigning work to a thread; on thread exit, the destructor calls sem_post on it to notify the pool's shutdown logic.

Global linked list (offsets 256--272). A doubly-linked list threading through all live TLS contexts, protected by the global mutex at 0x29FE0xx. Used by the atexit handler to enumerate and destroy all contexts. The destroyed flag at +272 prevents double-destroy when contexts move to the free list for recycling.

TLS Destructor: destr_function (0x427F10)

When a worker thread terminates, the POSIX TLS destructor fires:

void destr_function(char *ctx) {
    if (!ctx) return;
    
    pthread_mutex_lock(&global_mutex);
    
    if (ctx[272]) {                    // already destroyed?
        pthread_mutex_unlock(&global_mutex);
        return;
    }
    
    sem_t *sem = ctx->saved_semaphore; // offset +248
    
    // Unlink from global TLS chain
    ctx->prev->next = ctx->next;
    ctx->next->prev = ctx->prev;
    
    // Destroy sync primitives
    pthread_cond_destroy(ctx + 128);
    pthread_mutex_destroy(ctx + 176);
    sem_destroy(ctx + 216);
    
    // Move to free list (recycling, not freed)
    ctx[272] = 1;                      // mark destroyed
    ctx->next = free_list_sentinel;
    ctx->prev = free_list_head;
    free_list_head->next = ctx;
    free_list_head = ctx;
    
    pthread_mutex_unlock(&global_mutex);
    
    if (sem)
        sem_post(sem);                 // notify pool that worker exited
}

The destructor does not free() the 280-byte struct. Instead, it moves it to a free list rooted at a second sentinel node (unk_29FDC40 / unk_29FDD60). This is a deliberate choice: the destructor runs during pthread_exit() or thread detachment, and ptxas reuses TLS structs across multiple pool lifetimes within a single process invocation (library mode).

The sem_post at the end notifies the pool shutdown code (sub_1CB1970) that a worker has fully terminated.

Thread Pool Implementation

Pool Struct (184 bytes)

struct ThreadPool {           // calloc(1, 0xB8) = 184 bytes
    void       *thread_handles;  // +0:   array of (pthread_t, 16 bytes each)
    void       *work_queue;      // +8:   priority heap (from sub_1CBEC10)
    int32_t     pending;         // +16:  count of tasks awaiting execution
    // padding
    pthread_mutex_t mutex;       // +24:  guards all mutable pool state (40 bytes)
    pthread_cond_t  cond_work;   // +64:  broadcast when new work arrives (48 bytes)
    pthread_cond_t  cond_done;   // +112: signaled when all work completes (48 bytes)
    int64_t     active_count;    // +160: workers currently executing tasks
    int64_t     max_threads;     // +168: total worker count (= nmemb)
    int8_t      shutdown;        // +176: set to 1 to terminate all workers
};

Constructor: sub_1CB18B0

ThreadPool *sub_1CB18B0(size_t nmemb) {
    ThreadPool *pool = calloc(1, 0xB8);            // 184 bytes, zero-init
    pool->thread_handles = calloc(nmemb, 0x10);    // 16 bytes per thread
    pool->max_threads = nmemb;
    pool->pending = 0;
    
    pthread_mutex_init(&pool->mutex, NULL);         // default (non-recursive)
    pthread_cond_init(&pool->cond_work, NULL);
    pthread_cond_init(&pool->cond_done, NULL);
    
    // Create priority heap for task ordering
    pool->work_queue = sub_1CBEC10(sub_1CB1770, 0);
    
    // Spawn nmemb detached worker threads
    for (int i = 0; i < nmemb; i++) {
        pthread_create(&pool->thread_handles[i].tid, NULL,
                       start_routine, pool);
        pthread_detach(pool->thread_handles[i].tid);
    }
    return pool;
}

Workers are detached immediately after creation. This means the pool does not use pthread_join for cleanup -- instead, it relies on the cond_done / max_threads countdown protocol in sub_1CB1970.

Work Queue: Priority Heap

The work queue at pool offset +8 is not a simple linked list -- it is a binary min-heap (priority queue) backed by a dynamically-resized array.

Heap structure (32 bytes):

OffsetSizeTypeField
08void**Array of element pointers
88int64Current element count
168int64Allocated capacity
248fn_ptrComparator function

Constructor (sub_1CBEC10): Allocates the heap struct from the pool allocator, sets the comparator to sub_1CB1770 (which always returns 1, making the heap behave as a FIFO -- every parent "beats" every child, so new elements sink to the end).

Enqueue (sub_1CBECC0): Standard heap push with sift-up. Appends element, then bubbles up through parent comparisons. Auto-grows the backing array (doubles capacity) when full via sub_424C50 (realloc equivalent).

Dequeue (sub_1CBEDD0): Standard heap pop with sift-down. Extracts root, moves last element to root, then percolates down via comparator-guided child selection.

Since the comparator sub_1CB1770 unconditionally returns 1, the heap degenerates into FIFO order. Tasks are dispatched in submission order.

Task Nodes (24 bytes)

struct TaskNode {              // malloc(0x18) = 24 bytes
    void (*func)(void *arg);   // +0:  task function pointer
    void  *arg;                // +8:  argument passed to func
    void  *reserved;           // +16: zeroed (unused)
};

Task Submit: sub_1CB1A50

int sub_1CB1A50(ThreadPool *pool, void (*func)(void*), void *arg) {
    if (!func || !pool)
        return 0;
    
    TaskNode *task = malloc(0x18);     // 24 bytes
    task->func     = func;
    task->arg      = arg;
    task->reserved = NULL;
    
    pthread_mutex_lock(&pool->mutex);
    sub_1CBECC0(task, pool->work_queue);  // heap push
    ++pool->pending;
    pthread_cond_broadcast(&pool->cond_work);
    pthread_mutex_unlock(&pool->mutex);
    return 1;
}

The broadcast wakes all sleeping workers, not just one. This is correct for the use case: multiple tasks are typically submitted in a burst (one per kernel), and all idle workers should attempt to pick up work immediately.

Worker Thread: start_routine (0x1CB1780)

void *start_routine(ThreadPool *pool) {
    pthread_mutex_t *mtx = &pool->mutex;
    pthread_cond_t  *done = &pool->cond_done;
    
    while (1) {
        pthread_mutex_lock(mtx);
        
        // Wait for work or shutdown
        while (pool->pending == 0 && !pool->shutdown)
            pthread_cond_wait(&pool->cond_work, mtx);
        
        if (pool->shutdown)
            goto exit;
        
        // Dequeue task
        TaskNode *task = (TaskNode *)sub_1CBEDD0(pool->work_queue);
        --pool->pending;
        ++pool->active_count;
        pthread_mutex_unlock(mtx);
        
        // Execute task outside the lock
        if (task) {
            task->func(task->arg);
            free(task);
        }
        
        // Post-execution bookkeeping
        pthread_mutex_lock(mtx);
        --pool->active_count;
        if (!pool->shutdown && pool->active_count == 0 && pool->pending == 0)
            pthread_cond_signal(done);   // all work complete
        pthread_mutex_unlock(mtx);
    }
    
exit:
    --pool->max_threads;               // decrement live worker count
    pthread_cond_signal(done);          // notify shutdown waiter
    pthread_mutex_unlock(mtx);
    return NULL;
}

The completion signal on cond_done fires only when both active_count == 0 and pending == 0. This is the condition that sub_1CB1AE0 (wait-all) blocks on.

Wait-All: sub_1CB1AE0

Blocks until all submitted tasks complete:

void sub_1CB1AE0(ThreadPool *pool) {
    if (!pool) return;
    
    pthread_mutex_lock(&pool->mutex);
    while (pool->pending > 0 ||
           (pool->shutdown ? pool->max_threads > 0 : pool->active_count > 0))
        pthread_cond_wait(&pool->cond_done, &pool->mutex);
    pthread_mutex_unlock(&pool->mutex);
}

In the non-shutdown case, it waits for pending == 0 && active_count == 0. During shutdown, it waits for max_threads == 0 (all workers have exited).

Destroy: sub_1CB1970

Initiates graceful shutdown:

void sub_1CB1970(ThreadPool *pool) {
    if (!pool) return;
    
    pthread_mutex_lock(&pool->mutex);
    
    // Drain any remaining queued tasks
    sub_1CBEBF0(pool->work_queue);     // free heap contents
    pool->pending = 0;
    pool->shutdown = 1;
    
    // Wake all workers so they see the shutdown flag
    pthread_cond_broadcast(&pool->cond_work);
    pthread_mutex_unlock(&pool->mutex);
    
    // Wait for all workers to exit
    pthread_mutex_lock(&pool->mutex);
    while (pool->pending > 0 ||
           (pool->shutdown ? pool->max_threads > 0 : pool->active_count > 0))
        pthread_cond_wait(&pool->cond_done, &pool->mutex);
    pthread_mutex_unlock(&pool->mutex);
    
    // Destroy synchronization primitives
    pthread_mutex_destroy(&pool->mutex);
    pthread_cond_destroy(&pool->cond_work);
    pthread_cond_destroy(&pool->cond_done);
    free(pool->thread_handles);
    free(pool);
}

The shutdown sequence is: (1) drain the queue, (2) set the shutdown flag, (3) broadcast on cond_work so all sleeping workers wake up and check the flag, (4) wait on cond_done until max_threads reaches zero (each exiting worker decrements it), (5) destroy synchronization primitives and free memory.

Multi-Kernel Parallel Compilation

The compilation driver (sub_446240) decides between serial and parallel execution based on the thread count at offset +668:

Serial Path (default)

for each kernel in compile_unit:
    sub_43A400(kernel)          // target configuration
    sub_43CC70(kernel)          // DAGgen → OCG → ELF → DebugInfo

Parallel Path (--split-compile / --allow-expensive-optimizations)

pool = sub_1CB18B0(thread_count)

for each kernel in compile_unit:
    sub_43A400(kernel)          // target config (still serial)
    buf = allocate 360-byte work buffer from pool
    snapshot 15 config vectors into buf
    deep-copy hash maps for thread isolation
    sub_1CB1A50(pool, sub_436DF0, buf)

sub_1CB1AE0(pool)               // block until all kernels done
sub_1CB1970(pool)               // destroy pool

Each task runs sub_436DF0, which performs the per-kernel backend pipeline:

  1. Set thread-local program name via sub_430590
  2. Acquire jobserver token (if --jobserver active): sub_1CC6EC0()
  3. Record start time
  4. sub_432500 -- run the full DAGgen+OCG pipeline
  5. Record end time, write to timing array at a1->timing[112 * cu_index]
  6. Update peak wall-clock counter (under lock via sub_607D70 / sub_607D90)
  7. Release jobserver token: sub_1CC7040()
  8. Free the 360-byte work buffer

Thread Isolation Strategy

Each worker thread operates on an isolated copy of compilation state:

ResourceIsolation Mechanism
Memory poolPer-thread pool pointer at TLS offset +24. Each thread's allocations go to a separate arena, eliminating heap contention.
Error statePer-thread flags at TLS offsets 0--1 (error bytes), 8 (longjmp target), 16 (error descriptor), 48--52 (diagnostic control). Each thread tracks its own errors independently.
Hash mapsDeep-copied from the master compilation context before task submission. Workers never share mutable lookup tables.
Config vectorsSnapshot of 15 configuration vectors into a 360-byte per-task buffer. Workers read their own copy.
Timing dataPer-kernel slots in a pre-allocated timing array (112 bytes per kernel). Each worker writes only to its own kernel's slot.

The only shared mutable state during parallel compilation is the peak wall-clock counter at offset +224 in the compilation driver's state block, protected by a global lock (lock index 6, via sub_607D70/sub_607D90). This lock is acquired briefly at the end of each per-kernel task to compare-and-update the maximum observed wall-clock time.

GNU Make Jobserver Integration

When both --jobserver and --split-compile are active, ptxas participates in GNU Make's parallel job token protocol. The compilation driver creates the jobserver client object before spawning the thread pool, and each per-kernel worker task must acquire a token before starting and release it when done. This throttles ptxas to never exceed the make -j N budget, even when --split-compile would otherwise use more threads.

Jobserver Object (296 bytes)

The jobserver state is a 296-byte heap object allocated once per compilation, stored at global qword_29FE128. The constructor (sub_1CC7AF0) is called from the compilation driver (sub_4428E0) when *(_BYTE*)(context + 993) is set (the --jobserver CLI flag).

OffsetSizeTypeField
04int32 (atomic)State code (0=OK; see state table below)
44int32Saved errno from last failed syscall
81byteImplicit token available (1=unconsumed)
168int64Pending waiters (threads blocked in acquire)
248int64Active count (tokens currently held)
328void*Token buffer base (std::vector<char> data)
408void*Token buffer cursor (stack top)
488void*Token buffer capacity end
5640pthread_mutex_tInner mutex (guards token accounting)
9640pthread_mutex_tWrite mutex (guards write() to Make pipe)
13648pthread_cond_tCondition variable (wakes acquire waiters and reader thread)
1841byteToken-ready flag (set by reader thread / release handoff)
1851byteLast byte read from Make pipe
1884int32Read fd (Make pipe/FIFO read end)
1924int32Write fd (Make pipe/FIFO write end)
1964int32Internal pipe read fd (shutdown wakeup)
2004int32Internal pipe write fd (shutdown wakeup)
2041byteOpened-fds flag (1=ptxas opened the Make fds itself)
2051byteShutdown flag
2088void*Reader thread handle (std::thread)
21680bytesOuter mutexes (serializing full acquire/release operations)

MAKEFLAGS Parser: sub_1CC7300

Called during object construction to detect the Make jobserver channel:

// sub_1CC7300 -- parse MAKEFLAGS, open pipe/FIFO
void sub_1CC7300(JobserverObject *obj) {
    char *flags = getenv("MAKEFLAGS");
    if (!flags) {
        CAS(&obj->state, 5, 0);       // no MAKEFLAGS
        return;
    }
    std::string s(flags);
    size_t pos = s.find("--jobserver-auth=");
    if (pos == npos) {
        CAS(&obj->state, 6, 0);       // no --jobserver-auth=
        return;
    }
    size_t val = pos + 17;             // skip "--jobserver-auth="

    if (s.substr(val, 5) == "fifo:") {
        // --- FIFO mode ---
        std::string path = s.substr(val + 5, next_space);
        int fd = open(path.c_str(), O_RDWR | O_NONBLOCK);  // 0x802
        if (fd == -1) { CAS(&obj->state, 7, 0); return; }
        obj->read_fd  = fd;            // same fd for both
        obj->write_fd = fd;
        obj->opened_fds = 1;
    } else {
        // --- Pipe mode ---
        // parse "R,W" -- e.g. "3,4"
        std::string r_str = s.substr(val, comma_pos - val);
        std::string w_str = s.substr(comma_pos + 1, ...);
        // validate: digits only
        if (r_str.find_first_not_of("0123456789") != npos ||
            w_str.find_first_not_of("0123456789") != npos) {
            CAS(&obj->state, 7, 0); return;
        }
        int rd = dup(stoi(r_str));     // private copy
        if (fcntl(rd, F_SETFD, FD_CLOEXEC) == -1) {
            CAS(&obj->state, 7, 0); return;
        }
        int wd = dup(stoi(w_str));
        if (fcntl(wd, F_SETFD, FD_CLOEXEC) == -1) {
            close(rd);
            CAS(&obj->state, 7, 0); return;
        }
        obj->read_fd  = rd;
        obj->write_fd = wd;
        obj->opened_fds = 1;
    }
}
Protocol--jobserver-auth= valueDetectionfd Setup
FIFOfifo:/path/to/fifoPrefix match on fifo:open(path, O_RDWR|O_NONBLOCK) -- single fd for both read and write
PipeR,W (e.g. 3,4)Comma-separated integers after auth=dup() each fd + fcntl(F_SETFD, FD_CLOEXEC) -- prevents fd leak to children

Object Construction: sub_1CC7AF0

After sub_1CC7300 succeeds (state == 0), the constructor continues:

  1. Creates an internal wakeup pipe via pipe() -- fds stored at +196/+200
  2. Spawns the reader thread (sub_1CC6720) -- passed as a std::thread functor via off_2406838
  3. Pre-allocates the token buffer vector to hold thread_count bytes

If state is 5 or 6 (no MAKEFLAGS, no auth string), the caller (sub_4428E0) emits: "GNU Jobserver support requested, but no compatible jobserver found. Ignoring '--jobserver'" and proceeds without throttling.

Reader Thread: sub_1CC6720

A dedicated background thread that reads tokens from the Make pipe/FIFO and buffers them for the acquire function:

loop:
    if state != 0 or shutdown → exit

    lock(mutex_inner)
    while pending_waiters == 0 and not shutdown:
        cond_wait(cond, mutex_inner)     // sleep until someone needs a token
    unlock(mutex_inner)

    fd_set = { read_fd, internal_pipe_read }
    select(max_fd + 1, &fd_set, NULL, NULL, NULL)    // block

    if shutdown → exit

    n = read(read_fd, &byte, 1)
    if n == 1:
        if pending_waiters > 0:
            lock(mutex_inner + mutex_write)
            push byte onto token_buffer
            token_ready = 1
            unlock(mutex_write)
            cond_signal(cond)            // wake one acquire waiter
        else:
            write(write_fd, &byte, 1)    // no waiter → return token immediately
    else if errno == EAGAIN:
        continue                         // expected for non-blocking fd
    else:
        state = 11; exit                 // I/O error

The select() monitors two fds simultaneously: the Make pipe (for incoming tokens) and the internal wakeup pipe (for shutdown notification). The internal pipe avoids a race between checking the shutdown flag and blocking in select().

Token Acquire: sub_1CC6EC0

Called by each per-kernel worker before compilation begins. Returns 0 on success.

int sub_1CC6EC0(JobserverObject *obj) {
    if (!obj) return 4;
    lock(outer_mutex_0);
    if (obj->state) { unlock; return obj->state; }

    lock(mutex_inner);
    if (obj->implicit_token_available) {
        // Fast path: consume the implicit token (no pipe I/O)
        obj->implicit_token_available = 0;
        obj->active_count++;
        unlock_all;
        return 0;
    }
    // Slow path: wait for reader thread to supply a token
    obj->pending_waiters++;
    if (obj->pending_waiters == 1)
        cond_signal(cond);               // wake reader thread
    while (!obj->token_ready && !obj->shutdown)
        cond_wait(cond, mutex_inner);
    if (obj->shutdown) { unlock_all; return 3; }
    obj->token_ready = 0;
    obj->pending_waiters--;
    obj->active_count++;
    unlock_all;
    return 0;
}

The implicit token is the standard GNU Make convention: the parent Make gives the first child an implicit token (no byte in the pipe). The first worker to call acquire consumes it for free; subsequent workers must wait for bytes to be read from the pipe.

Token Release: sub_1CC7040

Called by each per-kernel worker after compilation completes. Returns 0 on success.

int sub_1CC7040(JobserverObject *obj) {
    if (!obj) return 4;
    lock(outer_mutex_1);
    if (obj->state) { unlock; return obj->state; }

    lock(mutex_inner);
    lock(mutex_write);
    if (token_buffer not empty) {
        // Path A: write a buffered byte back to Make pipe
        byte = pop(token_buffer);
        if (write(obj->write_fd, &byte, 1) == 1) {
            obj->active_count--;
            unlock_all;
            return 0;
        }
        // write error → set state 11 or 2
    }
    unlock(mutex_write);

    if (obj->pending_waiters > 0) {
        // Path B: hand off directly to a waiting acquirer
        obj->token_ready = 1;
        obj->active_count--;
        cond_signal(cond);
        unlock_all;
        return 0;
    }
    if (!obj->implicit_token_available && obj->active_count == 1) {
        // Path C: return the implicit token
        obj->implicit_token_available = 1;
        obj->active_count = 0;
        unlock_all;
        return 0;
    }
    // Protocol error (double-free)
    CAS(&obj->state, 12, 0);
    unlock_all;
    return 12;
}

Release has three paths, in priority order:

PathConditionAction
AToken buffer non-emptyPop byte, write() back to Make pipe
BNo buffered token but waiters existSet token_ready, signal condvar (avoids pipe round-trip)
CNo buffered token, no waiters, last activeRestore implicit token flag

Per-Kernel Worker Integration

In sub_436DF0 (the per-kernel compilation task submitted to the thread pool):

void sub_436DF0(int64_t *task) {
    sub_430590("ptxas", kernel_name);     // set TLS program name
    if (task[5] && sub_1CC6EC0(task[5]))  // acquire token if jobserver active
        sub_42F590(FATAL);                // acquire failed → fatal error
    // ... sub_432500(): full DAGgen + OCG pipeline ...
    if (!task[5] || !sub_1CC7040(task[5]))  // release token
        return;                             // normal return
    sub_42F590(FATAL);                    // release failed → fatal error
}

task[5] is populated from qword_29FE128 during task dispatch in sub_4428E0. When --jobserver is not active, task[5] == 0 and both acquire/release calls are skipped.

Destroy: sub_1CC6C20

Called after sub_1CB1AE0 (wait-all) and sub_1CB1970 (pool destroy) complete:

  1. Set shutdown flag (+205) via _InterlockedCompareExchange8
  2. Lock inner mutex, signal condvar (wake reader thread), unlock
  3. Write 1 byte to internal pipe write end (+200) -- unblocks select() in reader thread
  4. Join reader thread
  5. Lock inner mutex, drain all buffered tokens by writing each byte back to write_fd
  6. Unlock inner mutex
  7. Close Make fds if opened_fds is set (for FIFO: close once since read_fd == write_fd; for pipe: close both if different)
  8. Close internal pipe fds (+196, +200)
  9. Destroy condvar, free token buffer, free 296-byte object

State Machine

All state transitions use _InterlockedCompareExchange(state, new_value, 0) -- only the first error sticks; subsequent errors are silently dropped.

StateMeaningSet by
0OK (operational)Constructor
2Unexpected I/O (short write/read)Release, reader thread
5No MAKEFLAGS environment variablesub_1CC7300
6No --jobserver-auth= in MAKEFLAGSsub_1CC7300
7open()/dup()/fcntl() failedsub_1CC7300
11I/O error (errno recorded at +4)Reader thread, release, constructor
12Protocol error (double-free of token)Release

Throttling Semantics

With make -jN and --split-compile M where M > N:

ptxas creates M worker threads in the pool
but only N-1 pipe tokens exist + 1 implicit token = N total
workers that cannot acquire a token block in cond_wait
→ at most N kernels compile simultaneously, matching Make's budget
→ as each kernel finishes and releases its token, a blocked worker wakes

Without --jobserver, all M workers run freely with no external throttling.

Pool Allocator Thread Safety

The pool allocator (sub_424070) achieves thread safety through a combination of per-thread arenas and a global mutex:

  1. Per-thread arena: The TLS context at offset +24 holds a pointer to the current thread's memory pool. sub_424070 reads sub_4280C0()[3] to obtain this pointer. When non-NULL, allocations come from the thread-local slab without any locking.

  2. Global pool mutex: The pool struct contains a pthread_mutex_t at offset +7128 (within the ~7,200-byte pool descriptor). This mutex is acquired only for operations that modify the global pool state: slab acquisition from the parent pool, large-block allocation via mmap/malloc, and pool destruction.

  3. Slab-level lockfree: Within a thread-local slab (56-byte descriptor), bump-pointer allocation requires no synchronization. The allocator advances a pointer and returns; only when the slab is exhausted does it acquire the global lock to request a new slab.

Recursive Mutex Pattern

All mutexes created by sub_428620 (the mutex factory used throughout ptxas) are PTHREAD_MUTEX_RECURSIVE:

bool sub_428620(pthread_mutex_t *mutex) {
    pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
    return pthread_mutex_init(mutex, &attr) == 0;
}

This is necessary because ptxas functions may re-enter locking code paths through the diagnostic emitter (sub_42FBA0) or pool allocator (sub_424070), both of which are called from virtually everywhere.

Global Synchronization Objects

Global TLS Mutex (mutex at 0x29FE0xx)

Protects the global doubly-linked list of TLS contexts. Acquired during:

  • TLS context allocation (sub_4280C0)
  • TLS context destruction (destr_function)
  • sub_4286A0 (explicit lock for cross-thread operations)

This is a recursive mutex (initialized in ctor_001).

Global Lock Array (sub_607D70 / sub_607D90)

A global array of locks accessed by index. Lock index 6 is used to protect the peak wall-clock counter during parallel compilation. The total number of lock indices and their complete purpose map is not fully recovered; index 6 is the only one observed in the threading hot path.

sub_607D70(6);    // acquire lock 6
// update peak wall-clock
sub_607D90(6);    // release lock 6

Function Map

AddressSizeCallersIdentity
0x4094C0204 B0ctor_001 -- TLS key + global mutex init (.init_array)
0x427F10376 B0destr_function -- TLS destructor (via pthread_key_create)
0x4280C0597 B3,928TLS context accessor (280-byte struct, lazy alloc)
0x42860027 B--Mutex destroy + free wrapper
0x42862062 B--Recursive mutex init factory
0x4286706 B--pthread_mutex_destroy PLT thunk
0x4286806 B--pthread_mutex_lock PLT thunk
0x4286906 B--pthread_mutex_unlock PLT thunk
0x4286A0163 B--Global mutex lazy-init + lock
0x1CB17708 B1Priority comparator (always returns 1 = FIFO)
0x1CB1780202 B0start_routine -- worker thread main loop
0x1CB189011 B--CPU count via sysconf(_SC_NPROCESSORS_CONF)
0x1CB18B0159 B--Thread pool constructor (184-byte struct)
0x1CB1970168 B--Thread pool graceful destroy
0x1CB1A5090 B--Task submit (24-byte task node, heap push, broadcast)
0x1CB1AE0109 B--Wait-all (block until pending=0, active=0)
0x1CBEBF0--1Heap drain (free all queued elements)
0x1CBEC10--1Priority heap constructor (32-byte struct)
0x1CBECC0----Priority heap push (sift-up)
0x1CBEDD0----Priority heap pop (sift-down)
0x1CC6720~700 B1Jobserver reader thread (select loop, pushes tokens to buffer)
0x1CC6C20~300 B1Jobserver destroy (drain tokens, close fds, free 296-byte object)
0x1CC6EC0384 B1Jobserver token acquire (consume implicit or wait for pipe token)
0x1CC7040~280 B1Jobserver token release (write byte back or hand off to waiter)
0x1CC73002,027 B1Jobserver MAKEFLAGS parser (FIFO vs pipe detection, fd setup)
0x1CC7AF0~700 B1Jobserver constructor (alloc 296B, spawn reader thread)

Cross-References

SASS Opcode Catalog

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

Complete reference table of all SASS opcode mnemonics known to ptxas v13.0.88. Extracted from the ROT13-encoded opcode name table in the InstructionInfo constructor (sub_7A5D10, vtable off_233ADC0). The table stores exactly 322 named entries (indices 0--321) at object offset +0x1058, with each entry occupying 16 bytes (8-byte string pointer + 8-byte length). A parallel constructor sub_BE7390 initializes an identical table. Immediately after the name table, a 322-element identity-mapped index array (0x508 bytes of 4-byte integers 0..321) is bulk-copied from unk_21C0E00 to object offset +0x2478; this is a separate data structure (encoding category map), not additional opcode names.

All SASS mnemonic strings in the ptxas binary are ROT13-obfuscated. The cleartext names shown here are the result of applying ROT13 decoding to the stored strings.

Table Organization

Opcodes are partitioned by SM generation through explicit boundary markers embedded in the table:

IndexMarkerRange
0--135Base ISAsm_70 (Volta) and all later architectures
136SM70_LASTEnd of sm_70 range
137--171sm_73+Volta extensions (uniform registers, tensor shapes)
171SM73_LASTEnd of sm_73 range
172--193sm_82+Ampere additions (MMA shapes, gather, REDUX)
193SM82_LASTEnd of sm_82 range
194--199sm_86+Ampere+ additions (conversion packed, SUQUERY)
199SM86_LASTEnd of sm_86 range
200--205sm_89+Ada Lovelace additions (QMMA shapes)
205SM89_LASTEnd of sm_89 range
206--252sm_90+Hopper additions (GMMA, CGA barriers, fences, TMA)
252SM90_LASTEnd of sm_90 range
253--280sm_100+Blackwell datacenter additions (UTC, QFMA4, MEMSET)
280SM100_LASTEnd of sm_100 range
281--320sm_104+Blackwell Ultra additions (uniform FP, new conversions)
320SM104_LASTEnd of sm_104 range
321LASTSentinel (end of table)

Each SM generation only adds opcodes; no base opcodes are removed. The Ori IR uses the 12-bit index into this table as the base opcode field (instruction offset +72, lower 12 bits). Bits 12--13 of the opcode word encode sub-operation modifiers (.HI, .WIDE, etc.) and are stripped by the 0xFFFFCFFF mask to recover the base index.

Encoding Format Summary

SASS instructions use three widths, selected per opcode during encoding:

Format CodeWidthUsage
0x164-bitSimple moves, branches, barriers, NOPs, short-form ALU
0x2128-bitMost ALU, load/store, texture, tensor core, atomics
0x8256-bitIMAD.WIDE variants with 16 constant-bank operand slots

The 3-level opcode hierarchy within the encoded instruction word is: major (9 bits, at bits [8:16]) / minor (8 bits, at bits [17:24]) / sub-opcode (7 bits, at bits [25:31]). See the encoding page for full details.

Duplicate Mnemonic Entries

Five entries in the table share a SASS mnemonic with an earlier index. These are not errors in the table -- they are distinct IR opcodes that happen to produce the same assembly mnemonic but with different binary encodings, operand widths, or functional-unit routing. The duplicates fall into two categories:

Category A -- SM-generation re-introduction. The same operation is re-implemented for a newer GPU generation with a different SASS major opcode and encoding path, typically because the tensor core or ALU microarchitecture changed:

Later IndexEarlier IndexMnemonicWhy re-introduced
215 (sm_90)180 (sm_82)DMMAHopper warpgroup-aware TC path (enc. cat. 515 vs 434)
220 (sm_90)14 (sm_70)FMNMXHopper adds 5-entry operand sub-mode table (enc. cat. 534 vs 510)

Category B -- Operand-width extension. Blackwell Ultra (sm_104) adds 64-bit operand variants of existing integer ALU instructions. The SASS printer appends a .64 suffix at render time; the IR name table stores the same base mnemonic for both widths:

Later IndexEarlier IndexMnemonicWhat the later index adds
284 (sm_104)37 (sm_70)IMNMX32-bit form, new encoding path
285 (sm_104)37 (sm_70)IMNMX64-bit form (IMNMX.64, .64.UI, .64.LO)
288 (sm_104)7 (sm_70)ISETP64-bit comparison (ISETP.64, .64.UI, .64.LO)

Binary evidence: in the constructor sub_7A5D10, indices 284 and 285 store identical "VZAZK" string pointers at adjacent 16-byte slots (v2+8728 and v2+8744). The SASS printer (sub_7CB560) maps them to IMNMX vs IMNMX.64 based on operand metadata.

Base ISA -- sm_70 (Volta) and Later (Indices 0--135)

These opcodes are available on all SM architectures supported by ptxas v13.0.

Integer Arithmetic

IdxROT13MnemonicDescription
1VZNQIMADInteger multiply-add (32-bit)
2VZNQ_JVQRIMAD_WIDEInteger multiply-add, 32x32->64 result
3VNQQ3IADD3Three-input integer add with carry
4OZFXBMSKGenerate bitmask from position and width
5FTKGSGXTSign-extend from specified bit position
6YBC3LOP3Three-input logic operation (arbitrary LUT)
7VFRGCISETPInteger compare and set predicate (32-bit; re-introduced at index 288 for sm_104 with 64-bit support)
8VNOFIABSInteger absolute value
9YRNLEALoad effective address (shift-add)
10FUSSHFFunnel shift (concatenate two regs, shift)
33VQCIDPInteger dot product (4-element)
34VQRIDEInteger dot expand
37VZAZKIMNMXInteger min/max (32-bit only; re-introduced at indices 284--285 for sm_104 with 32/64-bit split)
38CBCPPOPCPopulation count (count set bits)
39SYBFLOFind leading one (bit scan)
53OERIBREVBit reverse

FP32 Arithmetic

IdxROT13MnemonicDescription
11SSZNFFMAFP32 fused multiply-add
12SNQQFADDFP32 add
13SZHYFMULFP32 multiply
14SZAZKFMNMXFP32 min/max (base encoding cat. 510; re-introduced at index 220 for sm_90 with extended operand modes)
15SFJMNQQFSWZADDFP32 swizzle add (cross-lane partial reduction)
16SFRGFSETFP32 compare and set result register
17SFRYFSELFP32 select (conditional move)
18SFRGCFSETPFP32 compare and set predicate
40SPUXFCHKFP check for NaN/Inf/denorm
42ZHSHMUFUMulti-function unit: RCP, RSQ, SIN, COS, EX2, LG2, RCP64H, RSQ64H

FP64 Arithmetic

IdxROT13MnemonicDescription
122QSZNDFMAFP64 fused multiply-add
123QNQQDADDFP64 add
124QZHYDMULFP64 multiply
125QFRGCDSETPFP64 compare and set predicate

FP16 Packed Arithmetic

IdxROT13MnemonicDescription
126UNQQ2HADD2Packed FP16x2 add
127UNQQ2_S32HADD2_F32Packed FP16x2 add with FP32 accumulator
128USZN2HFMA2Packed FP16x2 fused multiply-add
129UZHY2HMUL2Packed FP16x2 multiply
130UFRG2HSET2Packed FP16x2 compare and set
131UFRGC2HSETP2Packed FP16x2 compare and set predicate

Type Conversion

IdxROT13MnemonicDescription
35V2VI2IInteger to integer conversion (width/sign change)
36V2VCI2IPInteger to integer, packed variant
43S2SF2FFloat to float conversion (precision change)
44S2S_KF2F_XFloat to float, extended (with carry chain)
45S2VF2IFloat to integer
46S2V_KF2I_XFloat to integer, extended
47V2SI2FInteger to float
48V2S_KI2F_XInteger to float, extended
49SEAQFRNDFP round to integer (within FP format)
50SEAQ_KFRND_XFP round, extended

Data Movement

IdxROT13MnemonicDescription
19ZBIMOVMove register to register
20FRYSELPredicated select (ternary conditional)
21C2EP2RPack predicate registers into GPR
22E2CR2PUnpack GPR bits into predicate registers
24CEZGPRMTByte-level permute (4-byte shuffle)
41VCNIPAInterpolate pixel attribute (fragment shader)
57F2ES2RRead special register to GPR
27PF2E_32CS2R_32Control/status register to GPR (32-bit)
28PF2E_64CS2R_64Control/status register to GPR (64-bit)

Predicate Operations

IdxROT13MnemonicDescription
23CYBC3PLOP3Three-input predicate logic (arbitrary LUT)
26IBGRVOTEWarp-wide vote (ballot/any/all/unanimity)
31INOFQVSSVABSDIFFVector absolute difference
32INOFQVSS4VABSDIFF4Vector absolute difference, 4-way

Memory -- Load/Store

IdxROT13MnemonicDescription
89YQPLDCLoad from constant memory bank c[bank][offset]
90NYQALDAttribute load (vertex/fragment attributes)
91NFGASTAttribute store
94YQFLDSLoad from shared memory
95FGFSTSStore to shared memory
96YQTLDGLoad from global memory
97FGTSTGStore to global memory
98YQYLDLLoad from local memory (per-thread stack)
99FGYSTLStore to local memory
100YQLDLoad, generic address space
101FGSTStore, generic address space

Atomic and Reduction

IdxROT13MnemonicDescription
102NGBZATOMAtomic operation (generic address space)
103NGBZTATOMGAtomic operation (global memory)
104ERQREDReduction (global memory, fire-and-forget)
105NGBZFATOMSAtomic operation (shared memory)

Cache and Memory Control

IdxROT13MnemonicDescription
106DFCPQSPCQuery address space type
107PPGY_AB_FOCCTL_NO_SBCache control, no scoreboard wait
108PPGYCCTLCache control (invalidate/writeback/etc.)
109PPGYYCCTLLCache control, L2 level
110PPGYGCCTLTCache control, texture cache
111ZRZONEMEMBARMemory barrier (fence)

Texture Operations

IdxROT13MnemonicDescription
83GRKTEXTexture fetch (filtered sample)
84GYQTLDTexture load (unfiltered, integer coords)
85GYQ4TLD4Texture gather (fetch 4 texels for bilinear)
86GZZYTMMLQuery texture mip-map level
87GKQTXDTexture fetch with explicit derivatives
88GKDTXQTexture query (dimensions, levels, format)

Surface Operations

IdxROT13MnemonicDescription
112FHYQSULDSurface load
113FHFGSUSTSurface store
114FHNGBZSUATOMSurface atomic
115FHERQSUREDSurface reduction

Graphics Pipeline

IdxROT13MnemonicDescription
51NY2CAL2PAttribute location to patch offset
52NY2C_VAQRKRQAL2P_INDEXEDAttribute to patch, indexed variant
92BHGOUTTessellation output emit
93BHG_SVANYOUT_FINALTessellation output emit (final, cut primitive)
116CVKYQPIXLDPixel information load (coverage, sample mask)
117VFOREQISBERDIndexed set buffer for read (bindless)
118VFORJEISBEWRIndexed set buffer for write (bindless)

Control Flow

IdxROT13MnemonicDescription
67OENBRABranch (relative)
68OEKBRXBranch indirect (register target)
69WZCJMPJump (absolute)
70WZKJMXJump indirect
71PNYYCALLFunction call
72ERGRETReturn from function
73OFFLBSSYPush convergence point onto branch sync stack
74OERNXBREAKBreak out of convergence region
77RKVGEXITThread exit
76XVYYKILLKill thread (discard fragment)
75OCGBPTBreakpoint trap (debugger)
78EGGRTTReturn to trap handler
79OFLAPBSYNCBranch sync (pop convergence stack, reconverge)

Synchronization and Warp

IdxROT13MnemonicDescription
54OZBI_OBMOV_BBarrier move (barrier register, B variant)
55OZBI_EBMOV_RBarrier move (barrier register, R variant)
56OZBIBMOVBarrier move
58O2EB2RBarrier register to GPR
59E2OR2BGPR to barrier register
61ONEBARNamed barrier synchronization
62ONE_VAQRKRQBAR_INDEXEDBarrier, indexed variant
66QRCONEDEPBARDependency barrier (wait for scoreboard)
80ZNGPUMATCHWarp match (find lanes with same value)
119FUSYSHFLWarp shuffle (cross-lane data exchange)
120JNECFLAPWARPSYNCWarp-wide synchronization barrier
81ANABFYRRCNANOSLEEPThread sleep for specified nanoseconds
82ANABGENCNANOTRAPNano trap (lightweight trap)

System and Miscellaneous

IdxROT13MnemonicDescription
0REEONEERRBARError barrier (internal pseudo-instruction)
25ABCNOPNo-operation
29CZGEVTPMTRIGPerformance monitor trigger
30PFZGRFGCSMTESTCSM (compute shader model) test
60YRCPLEPCLoad effective PC (get current instruction address)
63FRGPGNVQSETCTAIDSet CTA (thread block) ID
64FRGYZRZONFRSETLMEMBASESet local memory base address
65TRGYZRZONFRGETLMEMBASEGet local memory base address
121LVRYQYIELDYield execution (internal, scheduler hint)
135VAGEVAFVPINTRINSICCompiler intrinsic (pseudo-opcode, lowered before encoding)

Tensor Core (Base)

IdxROT13MnemonicDescription
132UZZN_16HMMA_16FP16 matrix multiply-accumulate, 16-wide
133UZZN_32HMMA_32FP16 matrix multiply-accumulate, 32-wide
134VZZNIMMAInteger matrix multiply-accumulate

sm_73 Extensions (Indices 137--171)

Volta+ additions. Primarily introduces uniform register variants and additional tensor core shapes.

Uniform Register Operations

Uniform registers (UR0--UR63) hold values shared across the warp, enabling scalar execution of warp-uniform computations.

IdxROT13MnemonicDescription
138HOERIUBREVUniform bit reverse
139HOZFXUBMSKUniform bitmask
140HPYRNUCLEAUniform clear address
141HVFRGCUISETPUniform integer set-predicate
142HYQPULDCUniform load constant
143HYRNULEAUniform load effective address
144HC2HEUP2URUniform predicate to uniform register
145HYBC3ULOP3Uniform three-input logic
146HCYBC3UPLOP3Uniform predicate three-input logic
147HFRYUSELUniform select
148HFTKGUSGXTUniform sign-extend
149HSYBUFLOUniform find leading one
150HVNQQ3UIADD3Uniform three-input integer add
151HVZNQUIMADUniform integer multiply-add
152HZBIUMOVUniform move
153HCEZGUPRMTUniform byte permute
154IBGRHVOTEUUniform vote
155HCBCPUPOPCUniform population count
156HFUSUSHFUniform funnel shift

Additional sm_73 Operations

IdxROT13MnemonicDescription
157FPNGGRESCATTERScatter write
158S2SCF2FPFloat to float, packed conversion
159UZZN_1688HMMA_1688FP16 MMA, 16x8x8 shape
160UZZN_16816HMMA_16816FP16 MMA, 16x8x16 shape
161OZZNBMMABinary (1-bit) matrix multiply-accumulate
162GGHPPGYTTUCCTLTensor texture unit cache control
163GGHZNPEBTTUMACROTensor texture unit macro
164E2HER2URGPR to uniform register
165ZBIZMOVMMove with mask
166YQFZLDSMLoad from shared memory to matrix register
167YQGENZLDTRAMLoad from TRAM (transposed shared memory)
168SBBGCEVAGFOOTPRINTTexture footprint query
169F2HES2URSpecial register to uniform register
170OEKHBRXUBranch indirect, uniform target

sm_82 Extensions (Indices 172--193)

Ampere additions. New MMA shapes, gather/scatter metadata, and reduction variants.

IdxROT13MnemonicDescription
173TNGUREGATHERGather (multi-address load)
174TRAZRGNQNGNGENMETADATAGenerate metadata (for sparse MMA)
175FCZRGNQNGNSPMETADATASparse metadata
176OZZN_88128BMMA_88128Binary MMA, 8x8x128 shape
177OZZN_168128BMMA_168128Binary MMA, 16x8x128 shape
178OZZN_168256BMMA_168256Binary MMA, 16x8x256 shape
179PYZNQCLMADCarry-less multiply-add (GF(2) arithmetic)
180QZZNDMMAFP64 matrix multiply-accumulate (Ampere; encoding category 434; re-introduced at index 215 for Hopper with different TC path)
181UZZN_FC_1688HMMA_SP_1688FP16 sparse MMA, 16x8x8
182USZN2_ZZNHFMA2_MMAFP16 FMA2, MMA variant
183UZAZK2HMNMX2Packed FP16x2 min/max
184VZZN_88IMMA_88Integer MMA, 8x8 shape
185VZZN_FC_88IMMA_SP_88Integer sparse MMA, 8x8
186VZZN_16816IMMA_16816Integer MMA, 16x8x16
187VZZN_16832IMMA_16832Integer MMA, 16x8x32
188VZZN_FC_16832IMMA_SP_16832Integer sparse MMA, 16x8x32
189NEEVIRFARRIVESAsync barrier arrive signal
190YQTQRCONELDGDEPBARLoad-global dependency barrier
191YQTFGFLDGSTSLoad-global, store-to-shared (async copy)
192ERQHKREDUXWarp-wide reduction (uniform result)

sm_86 Extensions (Indices 194--199)

Ampere+ (GA106/GA107) additions.

IdxROT13MnemonicDescription
195S2VCF2IPFloat to integer, packed
196HS2SCUF2FPUniform float to float, packed
197V2SCI2FPInteger to float, packed
198FHDHRELSUQUERYSurface query (dimensions, format)

sm_89 Extensions (Indices 200--205)

Ada Lovelace additions. Quarter-precision MMA shapes for FP8/INT4.

IdxROT13MnemonicDescription
201DZZN_16816QMMA_16816Quarter-precision MMA, 16x8x16 (FP8)
202DZZN_16832QMMA_16832Quarter-precision MMA, 16x8x32
203DZZN_FC_16832QMMA_SP_16832Quarter-precision sparse MMA, 16x8x32
204DZZN_FC_12864QMMA_SP_12864Quarter-precision sparse MMA, 128x64

sm_90 Extensions (Indices 206--252)

Hopper additions. Major expansion: CGA (Cooperative Grid Array) barriers, fences, GMMA (Group MMA), TMA (Tensor Memory Accelerator), and collective operations.

CGA Barriers and Synchronization

IdxROT13MnemonicDescription
207NPDOYXACQBLKAcquire block (CTA resource acquisition)
208PTNONE_NEICGABAR_ARVCGA barrier arrive
209PTNONE_TRGCGABAR_GETCGA barrier get (query state)
210PTNONE_FRGCGABAR_SETCGA barrier set
211PTNONE_JNVGCGABAR_WAITCGA barrier wait
212PTNREEONECGAERRBARCGA error barrier

Collective and Election

IdxROT13MnemonicDescription
213PERNGRCBYVPLCREATEPOLICYCreate scheduling/cache policy
214PIGNCVTAConvert address space (generic to specific)
215QZZNDMMAFP64 matrix multiply-accumulate (Hopper re-introduction; encoding category 515 vs 434 for index 180; uses warpgroup-aware tensor core path, shared dispatch with CVTA at case 0xD6/0xD7 in sub_6575D0)
216RYRPGELECTElect a leader lane in warp
217RAQPBYYRPGVIRENDCOLLECTIVEEnd collective operation scope

Fences

IdxROT13MnemonicDescription
218SRAPR_TFENCE_GFence, global scope
219SRAPR_FFENCE_SFence, shared/CTA scope
220SZAZKFMNMXFP32 min/max (Hopper re-introduction; encoding category 534 vs 510 for index 14; adds 5-entry operand sub-mode table via dword_2026FC0 for extended rounding/precision modes not in base encoding)

GMMA (Group Matrix Multiply-Accumulate)

IdxROT13MnemonicDescription
221TZZNGMMAGroup (warpgroup) matrix multiply-accumulate

Memory Extensions

IdxROT13MnemonicDescription
222YQPHLDCULoad constant, uniform (warp-coherent constant load)
223YRCPLEPCLoad effective PC (sm_90 variant)
224ZNCNMAPAMap address (for TMA address translation)
225CERRKVGPREEXITPre-exit (cleanup before thread exit)
226E2HE_UR2UR_HRegister to uniform register, high half
227ERQNFREDASReduction, async (fire-and-forget with arrive)

Configuration

IdxROT13MnemonicDescription
228FRGZNKERTSETMAXREGSet maximum register count for dynamic partitioning
229FRGFZRZFVMRSETSMEMSIZESet shared memory size dynamically
230FGNFSTASStore async (to shared, with barrier)
231FGFZSTSMStore to shared memory, matrix layout

Synchronization Extensions

IdxROT13MnemonicDescription
232FLAPF_ONFVPSYNCS_BASICSync scope, basic
233FLAPF_YQ_HAVSZSYNCS_LD_UNIFMSync scope with uniform load

Uniform Block Operations

IdxROT13MnemonicDescription
234HOYXPCUBLKCPUniform block copy
235HOYXERQUBLKREDUniform block reduction
236HOYXCSUBLKPFUniform block prefetch
237HPIGNUCVTAUniform convert address space
238HYRCPULEPCUniform load effective PC
239HZNCNUMAPAUniform map address

TMA (Tensor Memory Accelerator) Operations

IdxROT13MnemonicDescription
240HGZNPPGYUTMACCTLTMA cache control
241HGZNPZQSYHFUUTMACMDFLUSHTMA command flush
242HGZNYQTUTMALDGTMA load global
243HGZNCSUTMAPFTMA prefetch
244HGZERQTUTMREDGTMA reduction global
245HGZNYFGUTMALSTTMA load/store

Vector Min/Max Extensions

IdxROT13MnemonicDescription
246IUZAZKVHMNMXVector half min/max (FP16x2)
247IVNQQVIADDVector integer add
248IVNQQZAZKVIADDMNMXVector integer add with min/max
249IVZAZKVIMNMXVector integer min/max
250IVZAZK3VIMNMX3Vector integer three-input min/max
251JNECTEBHCWARPGROUPWarpgroup collective operation

sm_100 Extensions (Indices 253--280)

Blackwell datacenter additions. UTC (Unified Tensor Core) operations, quad-precision FP, FP32x2 packed operations, and tensor core swizzle load/store.

Packed FP32 and Reduction

IdxROT13MnemonicDescription
254PERQHKCREDUXCTA-scope reduction (cross-warp)
255SNQQ2FADD2Packed FP32x2 add
256SSZN2FFMA2Packed FP32x2 fused multiply-add
257SZAZK3FMNMX3FP32 three-input min/max
258SZHY2FMUL2Packed FP32x2 multiply

Tensor Memory

IdxROT13MnemonicDescription
259YQGZLDTMLoad via tensor memory (5th-gen tensor core)
260HTRGARKGJBEXVQUGETNEXTWORKIDUniform get next work ID (dynamic scheduling)

UTC (Unified Tensor Core) Operations

IdxROT13MnemonicDescription
261HGPONE_1PGNUTCBAR_1CTAUTC barrier, 1 CTA scope
262HGPONE_2PGNUTCBAR_2CTAUTC barrier, 2 CTA scope
263HGPPC_1PGNUTCCP_1CTAUTC copy, 1 CTA scope
264HGPPC_2PGNUTCCP_2CTAUTC copy, 2 CTA scope
265HGPZZN_1PGNUTCMMA_1CTAUTC MMA, 1 CTA scope
266HGPZZN_2PGNUTCMMA_2CTAUTC MMA, 2 CTA scope
267HGPFUVSG_1PGNUTCSHIFT_1CTAUTC shift, 1 CTA scope
268HGPFUVSG_2PGNUTCSHIFT_2CTAUTC shift, 2 CTA scope

Tensor Core Swizzle

IdxROT13MnemonicDescription
269IVEGPBHAGVIRTCOUNTVirtual thread count query
270GPNGBZFJFTCATOMSWSTensor core atomic with swizzle
271GPYQFJFTCLDSWSTensor core load with swizzle
272GPFGFJFTCSTSWSTensor core store with swizzle

Quad-Precision FP

IdxROT13MnemonicDescription
273DSZN4QFMA4Quad-element FP fused multiply-add
274DNQQ4QADD4Quad-element FP add
275DZHY4QMUL4Quad-element FP multiply

Additional sm_100

IdxROT13MnemonicDescription
276ZRZFRGMEMSETMemory set (block fill)
277NPDFUZVAVGACQSHMINITAcquire shared memory and initialize
278FGGZSTTMStore via tensor memory
279SRAPR_GFENCE_TFence, tensor scope

sm_104 Extensions (Indices 281--320)

Blackwell Ultra additions. Uniform FP operations, additional integer widths, conversion variants, MMA shape extensions, and MKQ sparse variants.

Integer Extensions

IdxROT13MnemonicDescription
282VNQQIADDInteger add (two-input, distinct from IADD3)
283HIVNQQUVIADDUniform vector integer add
284VZAZKIMNMXInteger min/max, 32-bit operands (sm_104 re-introduction; new Blackwell Ultra encoding path distinct from base index 37)
285VZAZKIMNMXInteger min/max, 64-bit operands (SASS prints as IMNMX.64; consecutive with 284 to form the 32/64-bit pair; .64.UI and .64.LO sub-modifiers select unsigned/low-half comparison modes)
286HVZAZKUIMNMXUniform integer min/max
287HIVZAZKUVIMNMXUniform vector integer min/max
288VFRGCISETPInteger set-predicate (sm_104 re-introduction; supports 64-bit operand comparison as ISETP.64 with .64.UI/.64.LO sub-modifiers; new encoding path, case 0x120 in sub_7482B0 and sub_8380A0)
289HVFRGCUISETPUniform integer set-predicate (sm_104 re-introduction of index 141; pairs with ISETP index 288 for 64-bit uniform comparison)

Data Movement Extensions

IdxROT13MnemonicDescription
290ZBIMOVMove (sm_104 variant)
291HZBIUMOVUniform move (sm_104 variant)
292FRYSELSelect (sm_104 variant)
293HFRYUSELUniform select (sm_104 variant)

Uniform FP Operations

IdxROT13MnemonicDescription
294HSNQQUFADDUniform FP add
295HSFRYUFSELUniform FP select
296HSSZNUFFMAUniform FP fused multiply-add
297HSZHYUFMULUniform FP multiply
298HSFRGUFSETUniform FP compare and set
299HSFRGCUFSETPUniform FP compare and set predicate

Uniform Conversion

IdxROT13MnemonicDescription
300HV2VUI2IUniform integer to integer conversion
301HV2VCUI2IPUniform integer to integer, packed
302HS2SUF2FUniform float to float
303HSEAQUFRNDUniform FP round
304HS2VUF2IUniform float to integer
305HS2VCUF2IPUniform float to integer, packed
306HV2SUI2FUniform integer to float
307HV2SCUI2FPUniform integer to float, packed
308HVNOFUIABSUniform integer absolute value
309PF2HECS2URControl/status register to uniform register
310HS2SCUF2FPUniform float to float, packed (sm_104 variant)

MMA Extensions

IdxROT13MnemonicDescription
311ZKDZZN_FS_16832MXQMMA_SF_16832Mixed-quantized structured-sparse MMA, 16x8x32
312BZZN_16864OMMA_16864Operand MMA, 16x8x64 shape
313BZZN_FC_168128OMMA_SP_168128Operand sparse MMA, 16x8x128
314DZZN_16816QMMA_16816Quarter-precision MMA (sm_104 variant)
315DZZN_16832QMMA_16832Quarter-precision MMA (sm_104 variant)
316DZZN_FC_16832QMMA_SP_16832Quarter-precision sparse MMA (sm_104 variant)
317DZZN_FC_12864QMMA_SP_12864Quarter-precision sparse MMA (sm_104 variant)
318DZZN_FS_16832QMMA_SF_16832Quarter-precision structured sparse MMA
319DZZN_FS_FC_16864QMMA_SF_SP_16864Quarter-precision structured+unstructured sparse MMA

Boundary Markers

IdxROT13MnemonicDescription
136FZ70_YNFGSM70_LASTEnd of sm_70 base ISA
137FZ73_SVEFGSM73_FIRSTStart of sm_73 extensions
171FZ73_YNFGSM73_LASTEnd of sm_73
172FZ82_SVEFGSM82_FIRSTStart of sm_82 extensions
193FZ82_YNFGSM82_LASTEnd of sm_82
194FZ86_SVEFGSM86_FIRSTStart of sm_86 extensions
199FZ86_YNFGSM86_LASTEnd of sm_86
200FZ89_SVEFGSM89_FIRSTStart of sm_89 extensions
205FZ89_YNFGSM89_LASTEnd of sm_89
206FZ90_SVEFGSM90_FIRSTStart of sm_90 extensions
252FZ90_YNFGSM90_LASTEnd of sm_90
253FZ100_SVEFGSM100_FIRSTStart of sm_100 extensions
280FZ100_YNFGSM100_LASTEnd of sm_100
281FZ104_SVEFGSM104_FIRSTStart of sm_104 extensions
320FZ104_YNFGSM104_LASTEnd of sm_104
321YNFGLASTEnd-of-table sentinel

Encoding Category Map at unk_21C0E00

The 0x508 bytes (1288 bytes) at unk_21C0E00 are not additional opcode names. They are a 322-element int32 array mapping each opcode index to an encoding category number -- a level of indirection between opcode indices and binary encoding format descriptors.

Binary Evidence

  1. RSI is loaded with 0x21C0E00 (at 0x7A5D9F: mov $0x21c0e00, %esi)
  2. RDI is set to obj+0x2478 (at 0x7A5D82: lea 0x2478(%rbx), %rdi)
  3. RCX is set to 161 (at 0x7A5D22: mov $0xa1, %r13d; 0x7A5D69: mov %r13, %rcx)
  4. The rep movsq at 0x7A791D copies 161 quadwords = 1288 bytes = 322 x 4 bytes

The destination offset +0x2478 (decimal 9336) is immediately after the 322-entry name table (+4184 through +9328). Three arch-specific constructors each populate this array from a different static source table:

ConstructorSource TableMap Content
sub_7A5D10 (base)unk_21C0E00Identity: map[i] = i for all i in 0..321
sub_7C5410unk_21C3600Arch-remapped (selected entries differ)
sub_BE7390unk_22B2320Arch-remapped (selected entries differ)

Reader: sub_1377C60 (SASS Mnemonic Lookup)

The SASS mnemonic lookup function at sub_1377C60 reads this map at line 292:

v84 = *(_DWORD *)(a1 + 4 * v18 + 9336);  // encoding_category_map[opcode_index]

After matching an input mnemonic string against the ROT13 name table (with inline decoding at lines 264-273), the function reads encoding_category_map[opcode_index] and uses the result as a hash key -- combined with a 24-bit architecture discriminator via FNV-1a -- to look up the encoding format descriptor in the hash table at InstructionInfo+10672.

This is why duplicate mnemonics (e.g. DMMA at indices 180 and 215, or FMNMX at indices 14 and 220) can have different encoding categories (434 vs 515, 510 vs 534): the category map provides the indirection needed to select different binary encoders for the same mnemonic across architectures. The opcode name table has exactly 322 entries and no more.

Opcode Category Summary

CategoryBase ISAsm_73+sm_82+sm_86+sm_89+sm_90+sm_100+sm_104+Total
Integer ALU161010020534
FP3210000014015
FP64401000005
FP16602000008
Conversion101030001024
Data Movement9500020521
Predicate/Vote420000006
Load/Store11320052023
Atomic/Reduce401001006
Cache/Fence6101021011
Texture620000008
Surface400000004
Control Flow13100010015
Sync/Warp10000040014
Tensor Core33100419939
TMA000006006
Uniform Block0000031610
CGA/Collective000005005
Graphics710000008
System/Misc7010042014
Boundaries2222222216

Encoding Format Correlation

From the encoding page analysis, the approximate distribution of 64-bit vs 128-bit formats for the base ISA:

64-bit format (format code 0x1): NOP, BRA, BRX, JMP, JMX, CALL, RET, EXIT, BREAK, BSSY, BSYNC, BPT, KILL, RTT, BAR, DEPBAR, WARPSYNC, BMOV, B2R, R2B, S2R, CS2R, MOV (short form), YIELD, ERRBAR, NANOSLEEP, NANOTRAP, SHFL. These are primarily control-flow, barriers, and simple data movement instructions that need fewer operand bits.

128-bit format (format code 0x2): All ALU operations (IMAD, IADD3, FFMA, FADD, FMUL, LOP3, ISETP, FSETP, etc.), all memory operations (LDG, STG, LDS, STS, LDL, STL, LD, ST, LDC), all atomics (ATOM, ATOMG, ATOMS, RED), all texture operations (TEX, TLD, TLD4, TMML, TXD, TXQ), all surface operations, tensor core operations (HMMA, IMMA, BMMA, GMMA, etc.), conversion instructions, and most uniform register operations.

256-bit format (format code 0x8): IMAD.WIDE variants with 16 constant-bank operand slots. Extremely rare -- only 2 encoder functions use this format.

The 64-bit short-form encoders cover 27 opcode classes across 174 encoder functions total. The 128-bit encoders cover the remaining ~75+ opcode classes across 912+ encoder functions.

SM100 Encoding Variant Counts

Per-opcode variant counts for the SM100 (Blackwell datacenter) SASS encoder, extracted from the 683 concrete encoding handler functions at 0xED1520--0xFA5F10. Each function encodes one (opcode, operand-form) pair -- e.g., FFMA reg,reg,reg vs FFMA reg,reg,imm vs FFMA reg,reg,pred. The "Enc ID" column is the numeric value written to *(WORD*)(a2+12) by each handler, which maps to the SASS binary major opcode through the encoding dispatch megafunctions. The "SASS Mnemonic" column gives the canonical name from the 322-entry ROT13 opcode name table in InstructionInfo. Where two encoder IDs map to the same mnemonic (e.g. IADD3 IDs 0+1, LOP3 IDs 4+10), both are listed; the "Combined" column gives the merged count for that instruction.

Source: sweep report p1.14-sweep-0xED1000-0xFA6000.txt, ptxas v13.0.88.

Integer ALU

Enc IDVariantsSASS MnemonicCombinedFormats
08IADD313 (IDs 0+1)23F1DF8, 23F1F08
15IADD323F1DF8, 23F1F08
1519IMAD1923F1DF8, 23F2018
4023IMAD (wide)2323F1DF8, 23F21B0
4234IMAD (extended)3423F1DF8, 23F21B0
44LOP312 (IDs 4+10)23F2018
108LOP323F2018
3433ISETP3323F1DF8, 23F29A8
302IMNMX223F1D70
4313FLO1323F1D70, 23F1DF8
444IABS423F1F08, 23F1F90
475POPC523F1F08, 23F1F90
492BREV223F1DF8
215SHF523F1DF8, 23F1F08
846SHF623F1F08, 23F1F90
Subtotal171

FP32 ALU

Enc IDVariantsSASS MnemonicCombinedFormats
1330FFMA3023F2018..23F2EF8
1411FADD1123F1F90, 23F2E70
2218FMUL1823F1DF8..23F2678
312FMNMX223F1D70
3530FSETP30many formats
332FSET/CSET223F2238
382FSWZADD223F2128
1039extended FMA923F1DF8..23F2678
Subtotal104

FP64 ALU

Enc IDVariantsSASS MnemonicCombinedFormats
596DFMA623F2678, 23F2EF8
912DADD223F1DF8
575DMUL523F1F08
656DSETP623F2678, 23F2EF8
Subtotal19

FP16 / Half-Precision

Enc IDVariantsSASS MnemonicCombinedFormats
2318HFMA2/HMUL21823F1DF8..23F2678
3734HSETP2/DSETP3423F1DF8, 23F21B0
Subtotal52

Data Movement

Enc IDVariantsSASS MnemonicCombinedFormats
1878MOV78many formats
3228SEL2823F1D70, 23F1DF8
7145P2R/R2P45many formats
193PRMT323F1C60, 23F1D70
203LEA323F1DF8, 23F1F08
65S2R523F1F08, 23F1F90
72CS2R223F2018
Subtotal164

Memory

Enc IDVariantsSASS MnemonicCombinedFormats
2724LDG/STG2423F1F08, 23F29A8
7718LDS/STS1823F29A8
9416LDL/STL1623F29A8
746ST623F1DF8, 23F1F08
505ATOM/ATOMG523F1DF8, 23F1F08
816RED623F1F08, 23F1F90
1003SULD323F1DF8, 23F1F08
Subtotal78

Tensor Core

Enc IDVariantsSASS MnemonicCombinedFormats
7835HMMA/IMMA3523F1DF8, 23F29A8
905BMMA/QMMA523F2678
Subtotal40

Texture

Enc IDVariantsSASS MnemonicCombinedFormats
51TLD123F1F08
82TEX223F1DF8, 23F1F90
91TLD4123F1F08
882TEX (variant)223F1F08
Subtotal6

Predicate / Warp

Enc IDVariantsSASS MnemonicCombinedFormats
797PLOP3723F1F08..23F2018
826VOTE623F1F08, 23F1F90
487SHFL723F1D70, 23F1DF8
Subtotal20

Control Flow / Sync

Enc IDVariantsSASS MnemonicCombinedFormats
171BRA123F1F08
7310BAR1023F1F08, 23F2238
921DEPBAR123F1F08
981MEMBAR123F1F08
1114MUFU1423F1F08, 23F1F90
451NOP123F1D70
461YIELD/EXIT123F2238
Subtotal29

Totals

CategoryEncoder FunctionsDistinct Opcodes
Integer ALU17115 (across 10 mnemonics)
FP32 ALU1048
FP64 ALU194
FP16522
Data Movement1647
Memory787
Tensor Core402
Texture64
Predicate/Warp203
Control/Sync297
Total68359

The top 5 instructions by variant count -- MOV (78), P2R/R2P (45), HMMA/IMMA (35), IMAD extended (34), HSETP2/DSETP (34) -- account for 226 of 683 encoders (33%). MOV alone accounts for 11.4% of all encoder functions because every possible source type (GPR, uniform reg, immediate, constant bank, predicate, special reg) and every destination type requires a separate encoder with a distinct operand signature and bitfield extraction sequence.

The 21 encoding format descriptors (xmmword groups) cluster into three tiers by usage: heavy (165+141+101 = 407 functions across 3 formats), medium (87+47+36 = 170 across 3 formats), and light (106 functions across 15 formats). The heavy-tier formats (23F1F08, 23F1DF8, 23F29A8) are the simple/compact, primary ALU, and memory/load-store formats respectively -- these three alone cover 60% of all SM100 encoders.

Internal Index vs. Numeric Opcode

The index in this table (the position within the ROT13 name array) is the value stored in the Ori IR instruction's opcode field at offset +72 (lower 12 bits). However, this index is distinct from the encoded SASS major opcode in the binary instruction word. The mapping between IR opcode index and SASS binary major opcode is performed by the encoding dispatch tables (the "six megafunctions" at 0x10C0B20--0x10E32E0, which switch on up to 370 opcode category values from 0x0 through 0x171). A single IR opcode index may map to multiple SASS major opcodes depending on operand types and modifier bits, and vice versa.

Known IR-index-to-numeric correlations (confirmed from switch statements across multiple independent functions):

IR IndexNumeric (encoding switch)Mnemonic
10x59IMAD
30x29IADD3
25(64-bit, no major)NOP
52(pseudo)BB boundary
77(64-bit, no major)EXIT
910x1EATOM
95(64-bit, no major)EXIT/RET
960x38LDG
2210xDFGMMA

Extended Mnemonic Table (sub_896D50)

A second, much larger mnemonic table is constructed by sub_896D50 (21KB, vtable off_21DA9F8). This "extended" table serves a different purpose from the primary 322-entry table: it is used during SASS disassembly input parsing (string-to-index lookup), whereas the primary table is used during encoding (index-to-string). The two tables share the same base class (sub_A2B110) but have different vtables and different object layouts.

Table Dimensions

PropertyPrimary (sub_7A5D10)Extended (sub_896D50)
Entry count322 (indices 0--321)773 (indices 0--772)
Effective mnemonics306 (excl. 16 boundary markers)772 (excl. NONE sentinel)
Entry size16 bytes (8B ptr + 8B len)16 bytes (8B ptr + 8B len)
Object offset+0x1058 (+4184)+0x2C60 (+11360)
OrderingBy IR opcode indexAlphabetical by ROT13 name
Encoding category map322 x int32 at +0x2478772 x int32 at +0x5CB0 (+23728), from unk_21D92E0
Vtableoff_233ADC0off_21DA9F8

Why 772 Entries?

The extended table is 2.4x larger because it expands each base mnemonic into its modifier-qualified SASS forms. For example, the primary table stores one IMAD entry (index 1), but the extended table stores seven:

Extended entryROT13Description
IMADVZNQBase form
IMAD.HIVZNQ.UVHigh-half variant
IMAD.WIDEVZNQ.JVQR32x32->64
IMAD.WIDE.READ.ABVZNQ.JVQR.ERNQ.NOPaired read, A+B
IMAD.WIDE.READ.CHVZNQ.JVQR.ERNQ.PUPaired read, C high
IMAD.WIDE.READ.CLVZNQ.JVQR.ERNQ.PYPaired read, C low
IMAD.WIDE.WRITE.DHVZNQ.JVQR.JEVGR.QUPaired write, D high
IMAD.WIDE.WRITE.DLVZNQ.JVQR.JEVGR.QYPaired write, D low

Entry Composition

The 771 populated entries (from the decompiled string assignments at a1+11360 through a1+23712) break down as:

CategoryCountExamples
SASS base mnemonics (also in primary table)244IMAD, FADD, LDG, BRA, MOV, ...
SASS dot-modified variants125FENCE.G, ISETP.64, BAR.SYNC.DEFER_BLOCKING, HMMA.SP.16832.F16.*
SASS new base names (not in primary)81BGMMA, RPCMOV, SYNCS, MOV32I, SHL, SHR, LOP, BITEXTRACT
Mercury internal descriptors321MERCURY_addmin_srcs_r_ur_0, MERCURY_mbarrier_try_wait_...
Total SASS450
Total (SASS + Mercury)771

Of the 450 SASS entries, 7 carry annotation text in parentheses: F2F (not F64), F2I (not *64), FRND (not F64), I2F (not F64), NANOSLEEP (with Rb), NANOTRAP (with Rb), WARPSYNC (with Rb). These annotations indicate operand-type restrictions or register-variant qualifiers used by the SASS parser to disambiguate instruction forms.

32-Bit Immediate Forms

These mnemonics represent SASS instructions with a 32-bit immediate operand packed directly into the instruction word. They do not appear as separate entries in the primary IR opcode table because the immediate form is selected during encoding based on operand type, not during IR construction:

ROT13MnemonicDescription
SNQQ32VFADD32IFP32 add with 32-bit immediate
SSZN32VFFMA32IFP32 FMA with 32-bit immediate
SZHY32VFMUL32IFP32 multiply with 32-bit immediate
UNQQ2_32VHADD2_32IFP16x2 add with 32-bit immediate
USZN2_32VHFMA2_32IFP16x2 FMA with 32-bit immediate
UZHY2_32VHMUL2_32IFP16x2 multiply with 32-bit immediate
VNQQ32VIADD32IInteger add with 32-bit immediate
VNQQ2IADD2Two-input integer add (32I related)
VZHY32VIMUL32IInteger multiply with 32-bit immediate
VZHY32V.JVQRIMUL32I.WIDEInteger multiply-wide with 32-bit immediate
VFPNQQ32VISCADD32IInteger scaled-add with 32-bit immediate
YBC32VLOP32ILogic operation with 32-bit immediate
ZBI32VMOV32IMove 32-bit immediate to register
ZBI64VHEMOV64IURMove 64-bit immediate to uniform register
HYBC32VULOP32IUniform logic with 32-bit immediate

Mercury Pseudo-Instructions (321 Entries)

The single largest category. These are not real SASS instructions -- they are internal pseudo-instructions representing Mercury IR operations that need mnemonic-string identity for diagnostic and dump output. They follow a rigid naming convention:

MERCURY_{operation}_{srcs|dests}_{regclass}_{variant_index}

Register class codes in the mnemonic:

  • r = GPR (R0--R255)
  • ur = Uniform register (UR0--UR63)
  • p = Predicate register (P0--P6)
  • simm = Signed immediate
  • uimm = Unsigned immediate
  • r2 / ur2 = Register pair

Representative entries (decoded from ROT13):

ROT13CleartextOperation
ZREPHEL__vageMERCURY__intrGeneric intrinsic placeholder
ZREPHEL_nqqzva_fepf_e_he_0MERCURY_addmin_srcs_r_ur_0Fused add-min, GPR + uniform
ZREPHEL_nqqznk_fepf_he_e_0MERCURY_addmax_srcs_ur_r_0Fused add-max, uniform + GPR
ZREPHEL_ngbz_pnf_vag_npd_ery_...MERCURY_atom_cas_int_acq_rel_...Atomic CAS with acquire-release
ZREPHEL_flapf_neevir_n1g0_n0g1_...MERCURY_syncs_arrive_a1t0_a0t1_...Sync arrive with token spec

New Base Mnemonics

Mnemonics that appear in the extended table but have no base-name match in the primary 322-entry table at all. Some are legacy forms (pre-Volta mnemonics preserved for disassembly compatibility), others are specialized operations:

ROT13MnemonicCategory
NPDOHYXACQBULKCGA bulk resource acquire
OVGRKGENPGBITEXTRACTBitfield extract
QRPBZCERFFDECOMPRESSData decompression
VQC4NIDP4AInteger dot-product accumulate (4-element)
VZHYIMULInteger multiply (non-fused, legacy)
VFPNQQISCADDInteger scaled-add (legacy LEA form)
YQTZPLDGMCLoad global with memory consistency
YQGLDTLoad from texture memory
YBCLOPTwo-input logic operation (legacy)
CFRGCPSETPPredicate set-predicate
ERQTREDGReduction, global (explicit address space)
FUYSHLShift left (legacy, replaced by SHF)
FUESHRShift right (legacy, replaced by SHF)
FCNEFVSLSPARSIFYConvert dense to sparse format
FGGSTTStore to texture memory
GNGBZTTATOMGTexture atomic, global scope
IVFRGVISETVector integer set
JNECTEBHCFRGWARPGROUPSETConfigure warpgroup parameters

Modifier Suffix Patterns

Five distinct modifier suffix patterns are used in the extended table's dot-separated SASS mnemonics:

Pattern 1 -- Sub-operation mode. The suffix selects a functional sub-operation within a single hardware instruction. CCTL has the most variants (7):

Extended MnemonicSub-operation
CCTL.CClean
CCTL.C.LDCClean via constant cache
CCTL.C.LDC.IVALLClean constant cache, invalidate all
CCTL.E.LDCEvict via constant cache
CCTL.IInvalidate
CCTL.LDCULoad constant, uniform path
CCTL.QFAULTQuery fault status

Also: SYNCS.ARRIVE.A1T0.A0T1, SYNCS.CAS.EXCH, SYNCS.CCTL, SYNCS.FLUSH, SYNCS.LD.NON_UNIFORM, SYNCS.LD.UNIFORM, SYNCS.PHASECHK (8 variants); and BPT.DRAIN, BPT.PAUSE.

Pattern 2 -- Operand width. The .64 suffix (with optional .HI/.LO half-selectors) indicates 64-bit operand mode. Added for sm_104 (Blackwell Ultra):

Extended MnemonicBase Opcode
ISETP.64, ISETP.64.HI, ISETP.64.LOISETP (idx 288)
IMNMX.64, IMNMX.64.HI, IMNMX.64.LOIMNMX (idx 285)
IADD.64, IADD.64.HI, IADD.64.LOIADD (idx 282)
IADD2.64, IADD2.64.HI, IADD2.64.LOIADD2
MOV.64, MOV.64.HI, MOV.64.LOMOV (idx 290)
SEL.64, SEL.64.HI, SEL.64.LOSEL (idx 292)
UMOV.64, USEL.64, UIADD3.64, UIMNMX.64, UISETP.64Uniform 64-bit variants

Pattern 3 -- Data access direction. IMAD.WIDE has 5 sub-variants controlling which 32-bit half of the 64-bit accumulator is read or written. These correspond to the 256-bit instruction format (format code 0x8) with 16 constant-bank operand slots:

Extended MnemonicMeaning
IMAD.WIDEDefault wide multiply-add
IMAD.WIDE.READ.ABRead both A and B input halves
IMAD.WIDE.READ.CL / .CHRead accumulator low / high half
IMAD.WIDE.WRITE.DL / .DHWrite result low / high half
IMAD.HIHigh-half result only

Pattern 4 -- Scope qualifier. Fences, barriers, UTC operations, and synchronization carry scope suffixes:

Extended MnemonicScope
FENCE.GGlobal (GPU-wide)
FENCE.SShared/CTA
FENCE.TTensor (sm_100+)
UTCBAR.1CTA, UTCBAR.2CTA1-CTA / 2-CTA scope
UTCBAR.1CTA.FLUSH1-CTA with flush
BAR.SYNC.DEFER_BLOCKINGDeferred blocking sync
USETMAXREG.RELEASERelease variant
USETSHMSZ.FLUSHFlush variant

Pattern 5 -- Shape and type descriptor. Tensor core operations carry shape geometry and data type. Brace-delimited alternation syntax indicates a single encoder handling multiple shapes:

Extended MnemonicMeaning
HMMA.F32.{16816.F16|16816.E8M7|1688.E8M10}FP16 MMA with FP32 accum, multiple shapes
HMMA.SP.16832.F16.*Sparse FP16 MMA, 16x8x32
IMMA.{8816.*|8832.*}Integer MMA, 8x8x16 or 8x8x32
IMMA.SP.{16832.*|16864.*4.*4}Sparse integer MMA
QMMA.SF.SPStructured + unstructured sparse
MUFU.EX2.LOW_ACC.{F16x2, BF16x2}Low-accuracy EX2 for half types

Top Opcodes by Dot-Variant Count

Base OpcodeVariantsCategory
HMMA8Tensor core shape + sparse + FP type
SYNCS8Scope-aware synchronization modes
CCTL7Cache control sub-operations
IMAD7.HI, .WIDE, .WIDE.READ., .WIDE.WRITE.
IMMA6Tensor core shape + sparse
QMMA6Shape + structured/unstructured sparse
USYNCS6Uniform sync scope modes
MUFU5.EX2, .RCP, .RSQ, .EX2 with half-precision
IADD4.64, .64.HI, .64.LO, .XOR
WARPGROUP3.ARRIVE, .DEPBAR, .WAIT
RPCMOV3.32, .32.READ, .64
UTCBAR3.1CTA, .1CTA.FLUSH, .2CTA

Complete New SASS Mnemonics by Category

The following 206 SASS mnemonics appear only in the extended table -- they have no corresponding entry in the base 322-entry name table. Many represent modifier-suffixed forms of base opcodes; others are entirely new operations.

GMMA type-specialized (8): BGMMA, BGMMA_GSB, HGMMA, HGMMA_GSB, IGMMA, IGMMA_GSB, QGMMA, QGMMA_GSB

UTC type-specialized (20): UTCHMMA.1CTA, UTCHMMA.2CTA, UTCIMMA.1CTA, UTCIMMA.2CTA, UTCMXQMMA.1CTA, UTCMXQMMA.2CTA, UTCOMMA.1CTA, UTCOMMA.2CTA, UTCQMMA.1CTA, UTCQMMA.2CTA, UTCBAR.1CTA.FLUSH, UTCATOMSWS, UTCLDSWS, UTCSTSWS, UTCBAR.1CTA, UTCBAR.2CTA, UTCCP.1CTA, UTCCP.2CTA, UTCSHIFT.1CTA, UTCSHIFT.2CTA

DLC/DPC operations (13): UDLCBAR, UDLCCP, UDLCHMMA, UDLCIMMA, UDLCQMMA, UDPCBLKCP, UDPCBLKL2CCTL, UDPCBLKRED, UDPCTMACCTL, UDPCTMAL2CCTL, UDPCTMALDG, UDPCTMAREDG, UDPCTMASTG

Synchronization (17): SYNCS.ARRIVE.A1T0.A0T1, SYNCS.ARRIVE.A1TR.ART0.A0TR.A0TX, SYNCS.CAS.EXCH, SYNCS.CCTL, SYNCS.FLUSH, SYNCS.LD.NON_UNIFORM, SYNCS.LD.UNIFORM, SYNCS.PHASECHK, SYNCSU.ARRIVE.A1T0, SYNCSU.ARRIVE.MULTICAST.A1T0, WARPGROUP.ARRIVE, WARPGROUP.DEPBAR, WARPGROUP.WAIT, WARPGROUPSET, BAR.SYNC.DEFER_BLOCKING, BPT.DRAIN, BPT.PAUSE

Uniform sync (6): USYNCS.ARRIVE, USYNCS.ARRIVE.MULTICAST, USYNCS.CAS.EXCH, USYNCS.CCTL, USYNCS.LD, USYNCS.PHASECHK

Integer 64-bit variants (18): IADD.64, IADD.64.HI, IADD.64.LO, IADD.XOR, IADD2, IADD2.64, IADD2.64.HI, IADD2.64.LO, IMNMX.64, IMNMX.64.HI, IMNMX.64.LO, ISETP.64, ISETP.64.HI, ISETP.64.LO, MOV.64, MOV.64.HI, MOV.64.LO, SEL.64, SEL.64.HI, SEL.64.LO

Uniform scalar extended (27): UIADD3.64, UIMNMX.64, UISETP.64, UMOV.64, USEL.64, ULOP, ULOP32I, UMEMSETS.64, UPSETP, UR2UP, USHL, USHR, UCCTL, UBLKL2CCTL, UCGABAR_ARV, UCGABAR_GET, UCGABAR_SET, UCGABAR_WAIT, USETMAXREG, USETMAXREG.RELEASE, USETSHMSZ, USETSHMSZ.FLUSH, UREDGR, UREGPRERELEASE, USTGR, UTRACEEVENT, UVIRTCOUNT

IMAD/IMUL variants (8): IMAD.HI, IMAD.WIDE.READ.AB, IMAD.WIDE.READ.CH, IMAD.WIDE.READ.CL, IMAD.WIDE.WRITE.DH, IMAD.WIDE.WRITE.DL, IMUL.WIDE, IMUL32I.WIDE

Tensor core shapes (28): HMMA.16816.F16.*, HMMA.1688.F16.*, HMMA.F32.{...} (4 entries), HMMA.SP.{...} (4 entries), IMMA.{...} (3 entries), IMMA.SP.{...} (3 entries), DMMA.1684, DMMA.1688, DMMA.16816, BMMA.88128, BMMA.168128, BMMA.168256, QMMA.16816, QMMA.16832, QMMA.SF, QMMA.SF.SP, QMMA.SP.16832, QMMA.SP.16864, OMMA.SP

FP extensions (16): FADD32I, FFMA32I, FMUL32I, FHADD, FHADD2, FHFMA, FHFMA2, FHMUL2, UFHADD, UFHFMA, UFMNMX, MUFU.EX2, MUFU.RCP, MUFU.RSQ, MUFU.EX2.{F16x2, BF16x2}, MUFU.EX2.LOW_ACC.{F16x2, BF16x2}

Cache control (7): CCTL.C, CCTL.C.LDC, CCTL.C.LDC.IVALL, CCTL.E.LDC, CCTL.I, CCTL.LDCU, CCTL.QFAULT

Texture extensions (8): TATOMG, TTUCLOSE, TTUGO, TTULD, TTULD_CLOSE, TTUMACROFUSE, TTUOPEN, TTUST

Fence/scope (3): FENCE.G, FENCE.S, FENCE.T

Data movement (7): MOV32I, MOV64IUR, RPCMOV, RPCMOV.32, RPCMOV.32.READ, RPCMOV.64, CS2R (base without size), DECOMPRESS

Memory (4): LDGMC, LDT, STT, REDG

Other new (13): ACQBULK, BRA_IMM, JMP_IMM, JMXU, NONE, PSETP, HADD2_32I, HFMA2_32I, HMUL2_32I, IADD32I, IMUL, LOP, LOP32I

Parallel Constructor Regions

The ROT13 string data for the extended table exists in two identical regions:

RegionAddress RangeSASS EntriesMERCURY Entries
10x2039000--0x203A500139 unique32
20x21CA000--0x21CB100139 unique40

Region 2 has 8 additional MERCURY entries not in region 1, all for sm_100/sm_104 cluster barrier and atomic operations: MERCURY_barrier_cluster_arrive_sync_unaligned_* (4), MERCURY_atom_shared_cta_popc_inc_* (3), MERCURY_atom_shared_cta_int_acq_rel_* (1). This indicates at least two InstructionInfo variant objects for different target architectures, where the newer variant gains additional Mercury instruction templates.

Hash Table for O(1) Lookup

After populating the flat sorted array, sub_896D50 constructs a hash table for O(1) mnemonic lookup during SASS parsing. The hash table is allocated as a 488-byte header object with three backing arrays:

ArraySlot sizeSlotsTotal bytesPurpose
164 bytes77249,408Open-addressing hash (key prefix + metadata)
236 bytes77227,792Auxiliary data per mnemonic
316 bytes35560Overflow / collision chain

Array 1 slots are initialized to 0xFF (empty sentinel). The hash function used for lookup is the same FNV-1a variant used by sub_1377C60 for the primary table.

Object Tail Configuration

After building the tables and hash structure, the constructor:

  1. Queries ~14 knobs via context+1664 (knobs 1, 2, 5, 11, 14, 18, 22, 25, 28, 273, 774, 775, 803, 983, 998) to conditionally register feature-gated instruction families at context+1728
  2. Stores knob 803's value at obj+108
  3. Sets the vtable to off_21DA9F8 (line 2438 in decompiled source)
  4. Writes feature bitmask 0x48018BA65 at obj+26856
  5. Stores the hash table pointer at obj+26832 and the arena pointer at obj+26840

Key Functions

AddressSizeRoleConfidence
sub_7A5D10--InstructionInfo constructor; initializes the 322-entry ROT13 opcode name table at object offset +0x1058 and the 322-entry encoding category identity map at +0x2478 (vtable off_233ADC0)0.92
sub_BE7390--Parallel InstructionInfo constructor; initializes an identical 322-entry name table0.90
sub_7CB560--SASS printer; maps duplicate opcode indices (e.g., 284 vs 285) to distinct mnemonic strings (IMNMX vs IMNMX.64) based on operand metadata0.85
sub_6575D049KBRegister-class-to-opcode dispatch; handles DMMA (index 215) shared dispatch with CVTA at cases 0xD6/0xD70.85
sub_7482B0--Encoding path for ISETP (index 288, sm_104); handles case 0x120 for 64-bit integer set-predicate0.80
sub_8380A0--Encoding path for ISETP (index 288, sm_104); second handler for case 0x1200.80
sub_896D5021KBExtended mnemonic table constructor; builds the 772-entry alphabetically-sorted SASS mnemonic lookup table at object offset +11360, with parallel 772-entry encoding category map from unk_21D92E0, plus 3-array hash table for O(1) string lookup during disassembly parsing (vtable off_21DA9F8)0.90
sub_A2B110--Base class constructor shared by both primary (sub_7A5D10) and extended (sub_896D50) mnemonic table objects0.85

PTX Instruction Table

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

Complete catalog of PTX instructions recognized by ptxas v13.0.88 (CUDA 12.8 / PTX ISA 8.7). All entries are verified against the binary: instruction names come from the Flex lexer's 552 token rules, type signatures from the instruction table builder's 1,141 descriptor registrations (sub_46E000 calling sub_46BED0), and formatter names from the 580 PTX text generation functions dispatched by sub_5D4190. Internal-only instructions (prefixed with _) are included where they appear in the binary but are marked accordingly.

Instruction table buildersub_46E000 (93 KB, 1,141 calls to sub_46BED0)
Instruction lookupsub_46C690 (entry) / sub_46C6E0 (6.4 KB matcher)
PTX text formatter dispatchsub_5D4190 (12.9 KB, 81 string + 473-entry hash)
Formatter functions0x4DA340--0x5A8E40 (580 functions)
Semantic validators0x460000--0x4D5000 (~20 validator functions)
Operand type encodingSingle-char codes: F=float, H=half, I=int, B=bits, N=imm, P=pred, E=bf16, Q=fp8, R=fp4

Organization

Instructions are grouped by functional category following NVIDIA's PTX ISA documentation structure. Each table entry lists:

  • Mnemonic: the PTX instruction name as recognized by the lexer
  • Type suffixes: legal type qualifiers (from instruction table builder encoding strings)
  • Operands: operand pattern (d=dest, a/b/c=source, p=predicate, [a]=memory)
  • SM req: minimum SM architecture (from sub_489390 version gates in validators)
  • PTX req: minimum PTX ISA version (from sub_489050 version gates)
  • Description: brief functional description

Type abbreviations in the suffix column: s=signed int, u=unsigned int, f=float, b=bits, pred=predicate. Widths: 8/16/32/64/128. Packed: f16x2, bf16x2.


Integer Arithmetic

MnemonicType suffixesOperandsSMPTXDescription
add.s16 .s32 .s64 .u16 .u32 .u64d, a, ball1.0Integer addition
sub.s16 .s32 .s64 .u16 .u32 .u64d, a, ball1.0Integer subtraction
mul.lo.s16 .s32 .s64 .u16 .u32 .u64d, a, ball1.0Multiply, low half of result
mul.hi.s16 .s32 .s64 .u16 .u32 .u64d, a, ball1.0Multiply, high half of result
mul.wide.s16 .s32 .u16 .u32d, a, ball1.0Widening multiply (16->32 or 32->64)
mul24.lo.s32 .u32d, a, ball1.024-bit multiply, low half (deprecated sm_20+)
mul24.hi.s32 .u32d, a, ball1.024-bit multiply, high half (deprecated sm_20+)
mad.lo.s16 .s32 .s64 .u16 .u32 .u64d, a, b, call1.0Multiply-add, low half
mad.hi.s16 .s32 .s64 .u16 .u32 .u64d, a, b, call1.0Multiply-add, high half
mad.wide.s16 .s32 .u16 .u32d, a, b, call1.0Widening multiply-add
mad24.lo.s32 .u32d, a, b, call1.024-bit multiply-add, low (deprecated)
mad24.hi.s32 .u32d, a, b, call1.024-bit multiply-add, high (deprecated)
mad.cc.s32 .u32 .s64 .u64d, a, b, call1.0Multiply-add with carry-out
madc.lo.s32 .u32 .s64 .u64d, a, b, call1.0Multiply-add with carry-in, low
madc.hi.s32 .u32 .s64 .u64d, a, b, call1.0Multiply-add with carry-in, high
mad.fused.hi.s32 .u32d, a, b, call1.0Fused multiply-add, high half
madc.fused.hi.s32 .u32d, a, b, call1.0Fused multiply-add with carry, high
div.s16 .s32 .s64 .u16 .u32 .u64d, a, ball1.0Integer division
rem.s16 .s32 .s64 .u16 .u32 .u64d, a, ball1.0Integer remainder
abs.s16 .s32 .s64d, aall1.0Absolute value
neg.s16 .s32 .s64d, aall1.0Negate
min.s16 .s32 .s64 .u16 .u32 .u64d, a, ball1.0Minimum
max.s16 .s32 .s64 .u16 .u32 .u64d, a, ball1.0Maximum
popc.b32 .b64d, a20+2.0Population count (count set bits)
clz.b32 .b64d, a20+2.0Count leading zeros
bfind.s32 .s64 .u32 .u64d, a20+2.0Find most significant set bit
brev.b32 .b64d, a20+2.0Bit reverse
bfe.s32 .s64 .u32 .u64d, a, b, c20+2.0Bit field extract
bfi.b32 .b64d, f, a, b, c20+2.0Bit field insert
dp4a.s32.s32 .s32.u32 .u32.s32 .u32.u32d, a, b, c61+5.04-element dot product accumulate
dp2a.lo.s32.s32 .s32.u32 .u32.s32 .u32.u32d, a, b, c61+5.02-element dot product accumulate, low
dp2a.hi.s32.s32 .s32.u32 .u32.s32 .u32.u32d, a, b, c61+5.02-element dot product accumulate, high
sad.s16 .s32 .u16 .u32d, a, b, call1.0Sum of absolute differences

Floating-Point Arithmetic

MnemonicType suffixesOperandsSMPTXDescription
add.f32 .f64d, a, ball1.0FP addition (.rn .rz .rm .rp rounding)
sub.f32 .f64d, a, ball1.0FP subtraction
mul.f32 .f64d, a, ball1.0FP multiplication
fma.f32 .f64d, a, b, c20+2.0Fused multiply-add
mad.f32 .f64d, a, b, call1.0Multiply-add (non-fused on sm_20+)
mad.rnd.f32.f32d, a, b, call1.0Multiply-add with explicit rounding
div.f32 .f64d, a, ball1.0FP division (.approx .full .rn .rz .rm .rp)
div.full.f32d, a, ball1.0Full-range division (specialized formatter)
div.rnd.f32.f32d, a, ball1.0Division with explicit rounding
div.rn.f64.f64d, a, ball1.0Double-precision division, round-nearest
abs.f32 .f64d, aall1.0FP absolute value
neg.f32 .f64d, aall1.0FP negate
min.f32 .f64d, a, ball1.0FP minimum
max.f32 .f64d, a, ball1.0FP maximum
rcp.f32 .f64d, aall1.0Reciprocal (.approx .rn .rz .rm .rp)
rcp.approx.f64.f64d, aall1.0Approximate double reciprocal
rcp.rnd.f32.f32d, aall1.0Reciprocal with explicit rounding
rcp.rn.f64.f64d, aall1.0Double reciprocal, round-nearest
sqrt.f32 .f64d, aall1.0Square root (.approx .rn .rz .rm .rp)
rsqrt.f32 .f64d, aall1.0Reciprocal square root (.approx)
sin.f32d, aall1.0Sine (approximate)
cos.f32d, aall1.0Cosine (approximate)
lg2.f32d, aall1.0Log base 2 (approximate)
ex2.f32d, aall1.0Exp base 2 (approximate)
tanh.f32d, a75+6.5Hyperbolic tangent (approximate)
testp.f32 .f64p, a20+2.0Test FP property (.finite .infinite .number .notanumber .normal .subnormal)
copysign.f32 .f64d, a, b20+2.0Copy sign from b to a
fma.f32.f32d, a, b, c20+2.0FP32 fused multiply-add (rounding modes)
fma.f64.f64d, a, b, c20+2.0FP64 fused multiply-add

Half-Precision and BFloat16 Arithmetic

MnemonicType suffixesOperandsSMPTXDescription
add.f16 .f16x2 .bf16 .bf16x2d, a, b53+4.2Half-precision addition
sub.f16 .f16x2 .bf16 .bf16x2d, a, b53+4.2Half-precision subtraction
mul.f16 .f16x2 .bf16 .bf16x2d, a, b53+4.2Half-precision multiplication
fma.f16 .f16x2 .bf16 .bf16x2d, a, b, c53+4.2Half-precision fused multiply-add
neg.f16 .f16x2 .bf16 .bf16x2d, a53+4.2Half-precision negate
abs.f16 .f16x2 .bf16 .bf16x2d, a53+4.2Half-precision absolute value
min.f16 .f16x2 .bf16 .bf16x2d, a, b80+7.0Half-precision minimum
max.f16 .f16x2 .bf16 .bf16x2d, a, b80+7.0Half-precision maximum
min.ftz.NaN.f16 .f16x2 .bf16 .bf16x2d, a, b80+7.0Min with NaN propagation
max.ftz.NaN.f16 .f16x2 .bf16 .bf16x2d, a, b80+7.0Max with NaN propagation
ex2.approx.f16 .f16x2d, a75+6.5Half-precision exp2
tanh.approx.f16 .f16x2d, a75+6.5Half-precision tanh
fma.rn.relu.f16 .bf16d, a, b, c80+7.0Fused multiply-add with ReLU

Comparison and Selection

MnemonicType suffixesOperandsSMPTXDescription
setp.s16 .s32 .s64 .u16 .u32 .u64 .f32 .f64 .f16 .bf16 .b16 .b32 .b64p[|q], a, ball1.0Set predicate on comparison
selp.s16 .s32 .s64 .u16 .u32 .u64 .f32 .f64 .b16 .b32 .b64d, a, b, pall1.0Select on predicate
slct.s32 .u32 .f32 .s64 .u64 .f64d, a, b, call1.0Select on comparison
set.s32 .u32 .f32 .s64 .u64 .f64d, a, ball1.0Compare and set

Comparison operators for setp/set: .eq .ne .lt .le .gt .ge .lo .ls .hi .hs .equ .neu .ltu .leu .gtu .geu .num .nan

Logic and Shift

MnemonicType suffixesOperandsSMPTXDescription
and.b16 .b32 .b64 .predd, a, ball1.0Bitwise AND
or.b16 .b32 .b64 .predd, a, ball1.0Bitwise OR
xor.b16 .b32 .b64 .predd, a, ball1.0Bitwise XOR
not.b16 .b32 .b64 .predd, aall1.0Bitwise NOT
cnot.b16 .b32 .b64d, aall1.0C-style logical NOT
lop3.b32d, a, b, c, immLut50+4.03-input logic operation (LUT-encoded)
shl.b16 .b32 .b64d, a, ball1.0Shift left
shr.s16 .s32 .s64 .u16 .u32 .u64d, a, ball1.0Shift right (arithmetic for .s, logical for .u)
shf.l.b32d, a, b, c32+3.2Funnel shift left
shf.r.b32d, a, b, c32+3.2Funnel shift right (.clamp .wrap)
prmt.b32d, a, b, c20+2.0Byte permute (.f4e .b4e .rc8 .ecl .ecr .rc16)

Data Movement

MnemonicType suffixesOperandsSMPTXDescription
mov.b16 .b32 .b64 .b128 .u16 .u32 .u64 .s16 .s32 .s64 .f16 .f32 .f64 .predd, aall1.0Move register-to-register
shfl.b32d|p, a, b, c30+3.0Warp shuffle (.up .down .bfly .idx)
shfl.sync.b32d|p, a, b, c, membermask70+6.0Warp shuffle with sync
vote.predd, {p}, a20+2.0Warp vote (.all .any .uni .ballot)
vote.sync.pred .b32d, {p}, a, membermask70+6.0Warp vote with sync
match.b32 .b64d, a70+6.0Warp match (.any .all)
match.sync.b32 .b64d, a, membermask70+6.0Warp match with sync
redux.s32 .u32d, a80+7.0Warp reduction (.add .min .max .and .or .xor)
redux.sync.s32 .u32d, a, membermask80+7.0Warp reduction with sync
activemask.b32d70+6.2Get active thread mask
elect.predp90+8.0Elect one leader thread
elect.one--d, {p}90+8.0Elect one thread, return success

Load, Store, and Memory

Global, Local, Shared, Const

MnemonicType suffixesOperandsSMPTXDescription
ld.b8 .b16 .b32 .b64 .b128 .u8 .u16 .u32 .u64 .s8 .s16 .s32 .s64 .f16 .f32 .f64d, [a]all1.0Load from memory (.global .shared .local .const .param)
ld.nc.b32 .b64 .b128 .f32 .f64d, [a]35+3.2Non-coherent load (read-only cache)
ld.param.b8 .b16 .b32 .b64 .b128d, [a]all1.0Load from kernel parameter space
st.b8 .b16 .b32 .b64 .b128 .u8 .u16 .u32 .u64 .s8 .s16 .s32 .s64 .f16 .f32 .f64[a], ball1.0Store to memory
ldu.b32 .b64 .b128 .f32 .f64d, [a]20+2.0Load via uniform cache (deprecated)
prefetch.L1 .L2[a]20+2.0Prefetch to cache level
prefetchu.L1[a]20+2.0Prefetch uniform
isspacep.global .shared .local .constp, a20+2.0Test address space
cvta--d, a20+2.0Convert address space (generic <-> specific)
cvta.to.global .shared .local .constd, a20+2.0Convert to specific state space

Cache qualifiers for ld/st: .ca .cg .cs .cv .lu .wb .wt
Eviction policy (PTX 7.4+): .L2::evict_first .L2::evict_last .L2::evict_normal .L2::cache_hint

Async Copy

MnemonicType suffixesOperandsSMPTXDescription
cp.async.ca .cg[dst], [src], size80+7.0Async copy (4/8/16 bytes, global->shared)
cp.async.commit_group----80+7.0Commit outstanding async copies
cp.async.wait_group--N80+7.0Wait for async copy group completion
cp.async.wait_all----80+7.0Wait for all async copies
cp.async.mbarrier.arrive--[mbar]80+7.0Arrive at mbarrier on async copy completion
cp.async.bulk--[dst], [src], size90+8.0Bulk async copy (TMA)
cp.async.bulk.tensor--[dst], [src], dims...90+8.0Tensor async copy (TMA, 1-5D tiles)
cp.async.bulk.prefetch--[src], size90+8.0Prefetch via TMA
cp.async.bulk.prefetch.tensor--[src], dims...90+8.0Tensor prefetch via TMA
cp.async.bulk.commit_group----90+8.0Commit bulk async group
cp.async.bulk.wait_group--N90+8.0Wait for bulk group completion
cp.reduce.async.bulk.add .min .max .and .or .xor .inc .dec[dst], [src], size90+8.0Bulk async copy with reduction
cp.reduce.async.bulk.tensor.add .min .max .and .or .xor .inc .dec[dst], [src], dims...90+8.0Tensor async copy with reduction
st.async.b32 .b64 .b128[a], b90+8.1Async store
st.bulk--[a], b, size90+8.0Bulk store
red.async.add .min .max .and .or .xor .inc .dec[a], b90+8.1Async reduction

Multimem

MnemonicType suffixesOperandsSMPTXDescription
multimem.ld_reduce.f16 .bf16 .f32 .u32 .s32 .u64d, [a]90+8.1Multicast memory load with reduction
multimem.st.f16 .bf16 .f32 .u32 .s32 .u64[a], b90+8.1Multicast memory store
multimem.red.f16 .bf16 .f32 .u32 .s32 .u64[a], b90+8.1Multicast memory reduction

Cache Control

MnemonicType suffixesOperandsSMPTXDescription
discard.global .L2[a], size80+7.4Discard data (hint: no writeback)
applypriority.global .L2[a], size, prio80+7.4Set cache eviction priority
createpolicy.cvt.L2d, imm80+7.4Create cache policy from immediate
createpolicy.fractional.L2d, fraction80+7.4Create fractional cache policy
createpolicy.range.L2d, lo, hi, hit, miss80+7.4Create cache policy for address range

Conversion

MnemonicType suffixesOperandsSMPTXDescription
cvtall int/float combinationsd, aall1.0Type conversion (rounding: .rn .rz .rm .rp .rna)
cvt.pack.b16.b32 .b32.b32d, a, b80+7.1Pack two values into one register

cvt type combinations (from instruction table encoding strings):

The instruction table builder registers extensive type-pair combinations for cvt. Representative signatures:

SourceDestinationNotes
F[16|32|64]F[16|32|64]Float-to-float, rounding modes apply
F[16|32|64]I[8|16|32|64]Float-to-integer, rounding + saturation
I[8|16|32|64]F[16|32|64]Integer-to-float, rounding modes
I[8|16|32|64]I[8|16|32|64]Integer-to-integer, sign extend / truncate
E16F[16|64] / I[8|16|32|64]bf16 source conversions (sm_80+)
F[16|64] / I[8|16|32|64]E16bf16 destination conversions (sm_80+)
H32 (tf32)variousTensorFloat-32 conversions (sm_80+)
Q16 (fp8 e5m2)F32 / H32 / E32FP8 conversions (sm_89+)
R8 (fp8 e4m3)F32 / H32 / E32FP8 e4m3 conversions (sm_89+)
R16 (fp4)F32FP4 conversions (sm_100+, PTX 8.6)
Q32 (fp8 e5m2)F32Extended FP8 conversions (sm_100+)

Modifier .ftz (flush-to-zero), .sat (saturation), .relu (ReLU clamp to 0) are recognized. The .rna rounding mode (round-to-nearest-away, PTX 8.3+) is registered for cvt.tf32.f32.

szext -- Sign/Zero Extend

MnemonicType suffixesOperandsSMPTXDescription
szext.b32d, a, pos90+8.0Sign- or zero-extend at bit position

Texture and Surface

Texture

MnemonicType suffixesOperandsSMPTXDescription
tex.1d .2d .3d .a1d .a2d .cube .acubed, [tex, sampler, coord]all1.0Texture fetch with sampler
tex.base.1d .2d .3dd, [tex, coord]60+5.0Texture fetch base level
tex.level.1d .2d .3d .a1d .a2d .cube .acubed, [tex, sampler, coord, lod]all1.0Texture fetch at explicit LOD
tex.grad.1d .2d .3d .a1d .a2d .cube .acubed, [tex, sampler, coord, dPdx, dPdy]all1.0Texture fetch with explicit gradients
tld4.2d .a2dd, [tex, sampler, coord]20+2.0Texture gather (4 texels)
txq.width .height .depth .channel_data_type .channel_order .normalized_coords .filter_mode .addr_mode_0 .addr_mode_1 .addr_mode_2 .samp_pos .num_mip_levels .num_samplesd, [tex]20+2.0Texture query

Return types for texture ops: .v4.s32 .v4.u32 .v4.f32 (4-component return).

Surface

MnemonicType suffixesOperandsSMPTXDescription
suld.b.1d .2d .3d .a1d .a2dd, [surf, coord]20+2.0Surface load (bindless)
sust.b.1d .2d .3d .a1d .a2d[surf, coord], a20+2.0Surface store (bindless)
sust.p.1d .2d .3d .a1d .a2d[surf, coord], a20+2.0Surface store (packed format)
sured.b.1d .2d .3dd, [surf, coord], a20+2.0Surface reduction (.add .min .max .and .or)
suq.width .height .depth .channel_data_type .channel_order .array_sized, [surf]20+2.0Surface query

Surface clamp modes: .trap .clamp .zero

Tensormap

MnemonicType suffixesOperandsSMPTXDescription
tensormap.replace.tile .im2cold, [tmap], field, value90+8.0Replace tensormap field at runtime
tensormap.cp_fenceproxy--[tmap]90+8.0Tensormap copy fence proxy

Control Flow

MnemonicType suffixesOperandsSMPTXDescription
bra--targetall1.0Branch (unconditional or predicated)
bra.uni--targetall1.0Uniform branch (all threads take same direction)
brx.idx--a, [targets]70+6.0Indexed branch (jump table)
call--(ret), func, (params)20+2.0Function call (with ABI)
ret----20+2.0Return from function
exit----all1.0Exit kernel / terminate thread
trap----all1.0Trigger error
brkpt----11+1.0Breakpoint (debugger halt)
pmevent--imm20+2.0Performance monitor event
pmevent.mask--imm20+3.0Performance monitor event with mask
nanosleep--t70+6.3Sleep for t nanoseconds

Synchronization and Barriers

Legacy Barrier

MnemonicType suffixesOperandsSMPTXDescription
bar.sync--a{, b}all1.0Barrier synchronize (CTA-level)
bar.arrive--a, ball1.0Barrier arrive (non-blocking)
bar.red.and .or .popcd, a, {b}, pall1.0Barrier with reduction
bar.warp.syncmembermask70+6.0Warp-level barrier

Named Barrier (PTX 6.0+)

MnemonicType suffixesOperandsSMPTXDescription
barrier--a{, b}70+6.0Named barrier synchronize
barrier.arrive--a, b70+6.0Named barrier arrive
barrier.red.and .or .popcd, a, {b}, p70+6.0Named barrier with reduction

CTA-Cluster Barrier (PTX 7.8+)

MnemonicType suffixesOperandsSMPTXDescription
bar.cta----90+7.8CTA-level barrier sync
bar.cta.arrive----90+7.8CTA-level barrier arrive
bar.cta.red.and .or .popcd, p90+7.8CTA barrier with reduction
barrier.cta----90+7.8CTA named barrier sync
barrier.cta.arrive----90+7.8CTA named barrier arrive
barrier.cta.red.and .or .popcd, p90+7.8CTA named barrier with reduction
barrier.cluster.arrive----90+7.8Cluster-level barrier arrive
barrier.cluster.wait----90+7.8Cluster-level barrier wait

Asynchronous Barriers (mbarrier)

MnemonicType suffixesOperandsSMPTXDescription
mbarrier.init.shared.b64[mbar], count80+7.0Initialize mbarrier with expected count
mbarrier.inval.shared.b64[mbar]80+7.0Invalidate mbarrier
mbarrier.arrive.shared.b64state, [mbar]80+7.0Arrive at mbarrier
mbarrier.arrive_drop.shared.b64state, [mbar]80+7.0Arrive and drop expected count
mbarrier.test_wait.shared.b64p, [mbar], state80+7.0Test if mbarrier phase complete
mbarrier.test_wait.parity.shared.b64p, [mbar], parity80+7.1Test mbarrier parity
mbarrier.try_wait.shared.b64p, [mbar], state80+7.8Try-wait on mbarrier (with timeout)
mbarrier.try_wait.parity.shared.b64p, [mbar], parity80+7.8Try-wait on mbarrier parity
mbarrier.pending_count.b64d, state80+7.0Get pending arrival count
mbarrier.complete_tx.shared.b64[mbar], count90+8.0Complete transaction at mbarrier
mbarrier.expect_tx.shared.b64[mbar], count90+8.0Set expected transaction count
mbarrier.tx--[mbar], count90+8.0Transaction mbarrier arrive
mbarrier.arrive.expect_tx.shared.b64state, [mbar], count90+8.0Arrive with expected tx count
mbarrier.arrive_drop.expect_tx.shared.b64state, [mbar], count90+8.0Arrive-drop with expected tx count

Memory Fence

MnemonicType suffixesOperandsSMPTXDescription
membar.cta .gl .sys--all1.0Memory barrier (scope)
membar.proxy.alias--75+6.4Proxy memory barrier (alias scope)
fence.proxy.alias .async .async.global .async.shared::cta--70+6.0Fence proxy (alias/async)
fence.proxy.tensormap--[addr]90+8.0Fence tensormap proxy

Grid Dependency Control

MnemonicType suffixesOperandsSMPTXDescription
griddepcontrol.launch_dependents .wait--90+7.8Grid dependency control

Atomic and Reduction

MnemonicType suffixesOperandsSMPTXDescription
atom.s32 .u32 .u64 .f32 .f64 .b32 .b64 .f16x2 .bf16x2d, [a], ball1.1Atomic RMW (.add .min .max .and .or .xor .exch .cas .inc .dec)
atom.global(same as atom)d, [a], ball1.1Atomic on global memory
atom.shared(same as atom)d, [a], ball1.1Atomic on shared memory
red.s32 .u32 .u64 .f32 .f64 .b32 .b64 .f16x2 .bf16x2[a], ball1.1Reduction (no return value)
red.global(same as red)[a], ball1.1Reduction on global memory

Atom/red scope modifiers (PTX 6.0+): .cta .gpu .sys .cluster

Matrix (MMA / Tensor Core)

WMMA (Warp Matrix Multiply-Accumulate, PTX 6.0+)

MnemonicType suffixesOperandsSMPTXDescription
wmma.load.a.sync .alignedd, [ptr], stride70+6.0Load matrix A fragment
wmma.load.b.sync .alignedd, [ptr], stride70+6.0Load matrix B fragment
wmma.load.c.sync .alignedd, [ptr], stride70+6.0Load accumulator C fragment
wmma.store.d.sync .aligned[ptr], d, stride70+6.0Store result D fragment
wmma.mma.sync .alignedd, a, b, c70+6.0Matrix multiply-accumulate

WMMA shapes (from validator sub_4BFED0): .m16n16k16 .m32n8k16 .m8n32k16 .m16n16k8 etc.
WMMA type combinations (from string table): F16F16F16F16, F32F16F16F32, F32F32 (TF32), I32I8I8I32, I32I4I4I32, I32B1B1I32, F64F64F64F64

MMA (PTX 6.5+)

MnemonicType suffixesOperandsSMPTXDescription
mma.sync .alignedd, a, b, c75+6.5Matrix multiply-accumulate (Turing+)

MMA type combinations (verified from instruction table builder strings):

D typeA typeB typeC typeSMNotes
F16F16F16F1675+Native FP16
F32F16F16F3275+Mixed-precision
F32F32F32--80+TF32 Tensor Core
F32E16E16F3280+BFloat16
F32T32T32F3280+TF32 path (string: F32T32T32F32)
I32I8I8I3275+INT8
I32I4I4I3275+INT4
I32B1B1I3275+Binary (1-bit)
F64F64F64F6480+Double-precision
F16Q8Q8F1689+FP8 (e5m2)
F32Q8Q8F3289+FP8 mixed
F16R4Q8F16100+FP4 x FP8
F32R4Q8F32100+FP4 x FP8 mixed
F32R4R4F32100+FP4 x FP4
F32R4R4F32.Q8100+FP4 with scale (string: F32R4R4F32Q8)
F32Q8Q8F32.Q8100+FP8 with block scale

MMA shapes: .m16n8k16 .m16n8k32 .m16n8k64 .m16n8k128 .m16n8k256
Sparse MMA modifiers: .sp with metadata selector and sparsity pattern

WGMMA (Warp-Group MMA, PTX 7.8+)

MnemonicType suffixesOperandsSMPTXDescription
wgmma.mma_async.alignedd, a_desc, b_desc90+7.8Warp-group async matrix multiply
wgmma.fence.aligned--90+7.8WGMMA fence (ordering)
wgmma.commit_group.aligned--90+7.8Commit WGMMA group
wgmma.wait_group.alignedN90+7.8Wait for WGMMA group completion

WGMMA operand encoding strings (from instruction table, selection): hUUhP, fUUfP, hUhhP, fUhfP, hhUhP, fhUfP (H=half dest, F=float dest, U=desc operand, P=pred)
With accumulator: hUUhdC, fUUfdC, hUhhdC, fUhfdC
With scale: hUUhdCP, fUUfdCP (P=pred control for scale)

TCGen05 (5th Generation Tensor Core, sm_100+)

MnemonicType suffixesOperandsSMPTXDescription
tcgen05.mma--d, a_desc, b_desc100+8.65th-gen tensor core MMA
tcgen05.mma.ws--d, a_desc, b_desc100+8.65th-gen MMA with warpgroup scale
tcgen05.ld--d, [desc]100+8.6TC load from descriptor
tcgen05.ld.red--d, [desc], src100+8.6TC load with reduction
tcgen05.st--[desc], src100+8.6TC store to descriptor
tcgen05.cp--[desc], [src]100+8.6TC copy
tcgen05.commit--[mbar]100+8.6TC commit
tcgen05.shift--[desc]100+8.6TC shift accumulator
tcgen05.alloc--d, nCols100+8.6Allocate TC columns
tcgen05.dealloc--nCols100+8.6Deallocate TC columns
tcgen05.relinquish_alloc_permit----100+8.6Relinquish TC allocation permit
tcgen05.fence----100+8.6TC fence
tcgen05.wait----100+8.6TC wait

TCGen05 MMA operand encodings (from instruction table): MUUuP, MUUMuP, MMUuP, MMUMuP (M=matrix, U=desc, u=uniform, P=pred)
With accumulator descriptors: MUUudP, MUUuPC, MMUudP, MMUuPC, MUUMudP, MMUMudPC
With metadata: MUUuMMP, MUUMuMMP, MMUuMMP, MMUMuMMP (sparse)

TCGen05 Guardrails (internal)

These are internal debug/verification instructions, not user-facing PTX:

MnemonicSMDescription
_tcgen05.guardrails.is_phase_valid100+Validate TC phase
_tcgen05.guardrails.are_columns_allocated100+Check column allocation
_tcgen05.guardrails.is_current_warp_valid_owner100+Check warp ownership
_tcgen05.guardrails.in_physical_bounds100+Check physical bounds
_tcgen05.guardrails.allocation_granularity100+Validate allocation granularity
_tcgen05.guardrails.datapath_alignment100+Validate datapath alignment
_tcgen05.guardrails.sp_consistency_across_idesc_mod100+Sparse consistency check
_tcgen05.guardrails.check_sparse_usage100+Validate sparse usage

Matrix Utility Instructions

MnemonicType suffixesOperandsSMPTXDescription
ldmatrix.sync .aligned .m8n8 .numd, [ptr]75+6.5Load matrix from shared memory
stmatrix.sync .aligned .m8n8 .num[ptr], a90+7.8Store matrix to shared memory
movmatrix.alignedd, a80+7.1Move/transform matrix fragment

SIMD Video Instructions

These 8/16-bit SIMD instructions operate on packed sub-word elements within 32-bit registers.

MnemonicType suffixesOperandsSMPTXDescription
vadd.s32.s32 .s32.u32 .u32.s32 .u32.u32d, a.asel, b.bsel, c20+2.0SIMD add with secondary op
vsub(same)d, a.asel, b.bsel, c20+2.0SIMD subtract
vmad(same)d, a.asel, b.bsel, c20+2.0SIMD multiply-add
vmin(same)d, a.asel, b.bsel, c20+2.0SIMD minimum
vmax(same)d, a.asel, b.bsel, c20+2.0SIMD maximum
vabsdiff(same)d, a.asel, b.bsel, c20+2.0SIMD absolute difference
vset(same)d, a.asel, b.bsel, c20+2.0SIMD set on compare
vshl.u32.u32d, a.asel, b.bsel, c20+2.0SIMD shift left
vshr.s32.u32 .u32.u32d, a.asel, b.bsel, c20+2.0SIMD shift right
vadd2(packed-pairs)d, a, b, c30+3.0Dual 16-bit SIMD add
vsub2(packed-pairs)d, a, b, c30+3.0Dual 16-bit SIMD subtract
vmin2(packed-pairs)d, a, b, c30+3.0Dual 16-bit SIMD minimum
vmax2(packed-pairs)d, a, b, c30+3.0Dual 16-bit SIMD maximum
vabsdiff2(packed-pairs)d, a, b, c30+3.0Dual 16-bit absolute difference
vset2(packed-pairs)d, a, b, c30+3.0Dual 16-bit set on compare
vavrg2(packed-pairs)d, a, b, c30+3.0Dual 16-bit average
vadd4(packed-quads)d, a, b, c30+3.0Quad 8-bit SIMD add
vsub4(packed-quads)d, a, b, c30+3.0Quad 8-bit SIMD subtract
vmin4(packed-quads)d, a, b, c30+3.0Quad 8-bit SIMD minimum
vmax4(packed-quads)d, a, b, c30+3.0Quad 8-bit SIMD maximum
vabsdiff4(packed-quads)d, a, b, c30+3.0Quad 8-bit absolute difference
vset4(packed-quads)d, a, b, c30+3.0Quad 8-bit set on compare
vavrg4(packed-quads)d, a, b, c30+3.0Quad 8-bit average

Element selectors: .b0 .b1 .b2 .b3 (byte), .h0 .h1 (half-word)
Secondary ops: .add .min .max
Saturation: .sat

Miscellaneous

MnemonicType suffixesOperandsSMPTXDescription
getctarank.shared .globald, a90+7.8Get CTA rank in cluster from address
istypep.texref .samplerref .surfrefp, aall1.0Test if variable is a type
preexit----all1.0Pre-exit notification
stacksave--d20+2.0Save stack pointer
stackrestore--a20+2.0Restore stack pointer
alloca--d, size20+2.0Dynamic stack allocation
clusterlaunchcontrol.try_cancel.async--[mbar], d100+8.7Cluster launch cancel (async)
clusterlaunchcontrol.query_cancel--d, [mbar]100+8.7Query cluster launch cancel status

Register and Shared Memory Control

MnemonicType suffixesOperandsSMPTXDescription
setmaxnreg.inc--N90+7.8Increase max register count
setmaxnreg.dec--N90+7.8Decrease max register count
setmaxreg.alloc--N100+8.6Allocate registers to max
setmaxreg.dealloc--N100+8.6Deallocate from max registers
setmaxreg.try_alloc--d, N100+8.6Try-allocate registers
setsmemsize--N90+7.8Set dynamic shared memory size
setsmemsize.flush--N100+8.6Set shared memory size with flush
getnextworkid--d90+8.0Get next dynamic work unit ID

Warp-Group Management

MnemonicType suffixesOperandsSMPTXDescription
_warpgroup.arrive----90+7.8Warpgroup arrive (internal)
_warpgroup.wait--N90+7.8Warpgroup wait
_warpgroup.commit_batch----90+7.8Warpgroup commit batch

Internal Instructions

These underscore-prefixed instructions are not part of the public PTX ISA. They are generated internally by ptxas during lowering, stub synthesis, or as pre-codegen IR representations. All are registered in the instruction table builder sub_46E000 and appear in --dumpir output, but users never write them directly.

Internal Memory

MnemonicType suffixesOperandsString addrHandler / FormatterDescription
_ldldu(varies)d, [a]0x1d080eeformatter sub_4DD860Unified load-uniform; combines ld+ldu semantics for uniform-cache-path loads
_ldsm.b8 .b16 .s8.s4 .u8.u4 .s4.s2 .u4.u2d, [M]0x1d076c2handlers sub_46B0C0--sub_46B160, validator sub_4AEB60Load shared matrix; loads matrix tiles from shared memory into registers for MMA. Opcode ID 28
_movm.b16 .s8.s4 .u8.u4 .s4.s2 .u4.u2d, a0x1d076dahandlers sub_46B1B0--sub_46B260Move matrix; register-to-register matrix data movement with optional format conversion. Opcode ID 29

Internal Cache Control

MnemonicType suffixesOperandsString addrTable builder xrefDescription
_createpolicy.fractional.L2d, fraction0x1d0813a0x47752fInternal form of createpolicy.fractional; creates fractional L2 cache eviction policy
_createpolicy.range.L2d, lo, hi, hit, miss0x1d081580x477579Internal form of createpolicy.range; creates L2 policy for address range

Internal Surface

MnemonicType suffixesOperandsString addrTable builder xrefsDescription
_sulea.b(varies)d, [surf, coord]0x1d088bc0x4815cd, 0x48166bSurface load effective address, bindless; computes address for suld.b without performing the load
_sulea.p(varies)d, [surf, coord]0x1d088c50x48161c, 0x4816baSurface load effective address, packed; computes address for sust.p-mode surface access

Internal FP / Guard

MnemonicType suffixesOperandsString addrTable builder xrefDescription
_checkfp.divide(varies)d, a, b0x1d088d20x481709FP division guard; inserted during lowering to validate divisor (handles division-by-zero, denormals) before SASS div emission

Internal Control Flow / ABI

MnemonicType suffixesOperandsString addrTable builder xrefDescription
_gen_proto--(opaque)0x1d089030x48189aGenerate function prototype; synthesizes call prototypes for indirect / device-runtime calls during ABI resolution
_jcall--target0x1d0890e0x4818dfInternal jump-call; used inside auto-generated unified-function-stub (UFT) wrappers synthesized by sub_451680 (.func .attribute(.unified_func_stub) __cuda_uf_stub_%s() { _jcall %s; })

Internal Warp

MnemonicType suffixesOperandsString addrTable builder xrefDescription
_match(varies)d, a0x1d08a240x483404Internal match; pre-sync lowered form of warp match instruction, distinct from the public match.sync

Internal MMA / Tensor Core

MnemonicType suffixesOperandsString addrHandlersDescription
_mma.warpgroup135 type combos (F16, BF16, TF32, FP8, INT8)d, a, b, c0x1d072e3135 handlers sub_4668A0--sub_469FD0Warp-group MMA; pre-codegen form of WGMMA. Each handler registers one (src, dst, acc) type triple via sub_465030. Lowers to MERCURY_warpgroup_mma_* SASS opcodes (sm_90+)
_zzn.z8a8x4(sub-byte int)d, a, b, c0x1cfdc03data table at 0x1cfe678ROT13-obfuscated _mma.m8n8k4; sub-byte integer MMA with tile shape m8n8k4 for INT4/INT2 and bit-level XOR MMA (sm_75+)

Handler address summary for internal instructions:

RangeContents
sub_46B0C0--sub_46B260_ldsm (3) + _movm (3) type-variant handlers
sub_4668A0--sub_469FD0_mma.warpgroup 135 type-variant handlers
sub_4AEB60_ldsm validator (3.7 KB) -- handles .s8.s4/.u8.u4 format rules
sub_451680_jcall UFT stub generator
sub_4DD860_ldldu PTX text formatter

Instruction Table Builder Internals

Registration Mechanism

The 93 KB function sub_46E000 runs once during parser initialization (sub_451730). Each of its 1,141 calls to sub_46BED0 has the form:

sub_46BED0(table, operand_encoding, opcode_id, opcode_name, 
           type_flags, sm_requirement, xmm_data, extra);

Where:

  • operand_encoding is a compact string like "F32F32", "I32I8I8I32", "MUUuP"
  • opcode_id is the internal opcode integer (mapped 1:1 to Ori IR opcodes from ctor_003)
  • type_flags is a bitfield encoding which .sNN/.uNN/.fNN/.bNN qualifiers are legal
  • sm_requirement gates the instruction to architectures >= this SM version

The operand encoding characters are:

CharIDMeaningType bits registered
F1Float operand.f16 .f32 .f64
H2Half-precision.f16 .f16x2
N3Immediate / numeric(no type suffix)
I4Integer operand.s8--.s64 .u8--.u64
B5Bitwise operand.b8--.b128
P6Predicate.pred
O7Optional operand(no type suffix)
E8Extended type.bf16 .e4m3 .e5m2
Q10FP8 type.e5m2 .e4m3 (fp8)
R11FP4/narrow type.e2m1 (fp4)
M--Matrix descriptorTensor core descriptor
U--Uniform descriptorTMA/TC uniform register
C--Carry/accumulatorMMA accumulator control

When a letter is followed by a digit, that digit constrains the bit-width: F32 means only .f32, I16 means only .s16/.u16. The function sub_1CB0850 registers each valid width into a bitset.

Descriptor Layout

Each registered instruction creates a 368-byte descriptor node via sub_424070. Key fields:

OffsetFieldDescription
+0opcode_idInternal opcode identifier
+8type_flagsBitfield of legal type qualifiers
+12xmm_data128-bit SIMD data (rounding, modifiers)
+28extra_flagsArchitecture / behavior flags
+36operand_countNumber of operands
+40+operand_slotsPer-operand type bitsets (4 slots max, each 8 bytes)
+232name_lengthLength of the opcode name string

The two hash tables at offsets 2472 and 2480 in the instruction table provide dual-path lookup -- the first table is the primary lookup; if the opcode is not found, the second table is checked. This two-table scheme separates core instructions from extended/variant instructions.

Instruction Count by Category

Based on the 1,141 registration calls and the 431 decoded Ori IR opcodes (from ctor_003):

CategoryApproximate descriptor count
Integer arithmetic (add/sub/mul/mad/div/rem/...)~120
Float arithmetic (fadd/fmul/fma/div/rcp/sqrt/...)~100
Half/BF16 arithmetic~60
Comparison and selection (setp/selp/set/slct)~50
Logic and shift (and/or/xor/shl/shr/lop3/shf)~40
Conversion (cvt + 100+ type combinations)~180
Load/store/atomic (ld/st/atom/red + variants)~150
Texture/surface (tex/suld/sust/sured/txq)~80
Control flow (bra/call/ret/exit/bar/barrier)~50
MMA/WMMA/WGMMA/TCGen05~160
SIMD video (vadd/vsub/vmin/vmax/...)~60
Miscellaneous (prmt/bfe/bfi/shfl/vote/...)~91
Total1,141

PTX 8.x New Instructions

Instructions introduced in PTX ISA 8.0--8.7 (CUDA 12.0--12.8), verified from SM architecture gates and PTX version checks in the ptxas v13.0.88 binary:

PTXInstructionSMDescription
8.0cp.async.bulk90+TMA bulk async copy
8.0cp.async.bulk.tensor90+TMA tensor async copy
8.0cp.reduce.async.bulk90+Bulk async copy with reduction
8.0cp.reduce.async.bulk.tensor90+Tensor async copy with reduction
8.0cp.async.bulk.prefetch90+TMA prefetch
8.0cp.async.bulk.prefetch.tensor90+TMA tensor prefetch
8.0cp.async.bulk.commit_group90+Commit TMA group
8.0cp.async.bulk.wait_group90+Wait for TMA group
8.0tensormap.replace90+Runtime tensormap modification
8.0elect / elect.one90+Elect leader thread
8.0mbarrier.complete_tx90+Transaction mbarrier completion
8.0mbarrier.expect_tx90+Set expected transaction count
8.0st.bulk90+Bulk store
8.0szext90+Sign/zero extend at bit position
8.0getnextworkid90+Dynamic work unit
8.1st.async90+Asynchronous store
8.1red.async90+Asynchronous reduction
8.1multimem.ld_reduce90+Multicast load-reduce
8.1multimem.st90+Multicast store
8.1multimem.red90+Multicast reduction
8.6tcgen05.mma100+5th-gen tensor core MMA
8.6tcgen05.mma.ws100+5th-gen MMA with warp scale
8.6tcgen05.ld / tcgen05.st100+TC load/store
8.6tcgen05.ld.red100+TC load with reduction
8.6tcgen05.cp100+TC copy
8.6tcgen05.commit100+TC commit
8.6tcgen05.shift100+TC shift accumulator
8.6tcgen05.alloc / tcgen05.dealloc100+TC column allocation
8.6tcgen05.relinquish_alloc_permit100+Relinquish TC permit
8.6tcgen05.fence / tcgen05.wait100+TC fence/wait
8.6setmaxreg.alloc / .dealloc / .try_alloc100+Register management
8.6setsmemsize.flush100+Shared memory with flush
8.7clusterlaunchcontrol.try_cancel.async100+Cluster launch cancel
8.7clusterlaunchcontrol.query_cancel100+Query cancel status

Cross-References

Key Functions

AddressSizeRoleConfidence
sub_46E00093KBInstruction table builder; runs once during parser init, makes 1,141 calls to sub_46BED0 to register all PTX instruction descriptors0.95
sub_46BED0--Instruction descriptor registrar; called 1,141 times with operand encoding, opcode ID, type flags, SM requirement0.95
sub_46C690--Instruction lookup entry point; dispatches to sub_46C6E0 for name-to-descriptor matching0.90
sub_46C6E06.4KBInstruction name matcher; resolves PTX mnemonic strings to instruction table descriptors0.90
sub_5D419012.9KBPTX text formatter dispatch; 81-string + 473-entry hash table routing to per-instruction formatters0.90
sub_489390--SM version gate; validates minimum SM architecture for each instruction0.85
sub_489050--PTX ISA version gate; validates minimum PTX version for each instruction0.85
sub_4BFED0--WMMA shape validator; checks legal WMMA shape combinations (.m16n16k16, .m32n8k16, etc.)0.85
sub_451680--UFT stub generator; synthesizes _jcall wrappers for unified-function-stub entries0.90
sub_451730--Parser initialization; calls sub_46E000 to build the instruction table0.85
sub_4DD860--_ldldu PTX text formatter; handles internal unified load-uniform instruction0.85
sub_4AEB603.7KB_ldsm validator; validates .s8.s4/.u8.u4 format rules for shared matrix loads0.85
sub_465030--MMA type triple registrar; called by 135 _mma.warpgroup handlers to register (src, dst, acc) type combinations0.85
sub_424070--Instruction descriptor allocator; creates 368-byte descriptor nodes for each registered instruction0.85
sub_1CB0850--Width bitset registrar; registers valid bit-widths (e.g., F32 -> .f32 only) into per-operand bitsets0.80

EIATTR Attribute Catalog

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

EIATTR (ELF Info ATTRibute) is NVIDIA's proprietary metadata system embedded in .nv.info ELF sections within CUBIN files. Every CUDA kernel carries EIATTR records that tell the GPU driver how many registers to allocate, how much shared memory to reserve, what barriers the kernel uses, and dozens of other resource descriptors. Without this metadata, the driver cannot launch the kernel -- it has no way to determine the kernel's hardware resource footprint.

ptxas v13.0.88 defines 97 EIATTR codes, numbered 0 through 96 (0x00--0x60). The code-to-name mapping was extracted from the pointer table at VA 0x23FDC20 in the ptxas binary (16-byte entries: 8-byte string pointer + 8-byte metadata word, indexed by code number). The string names reside at 0x23FC6C7--0x23FD040. Code assignments were cross-verified against the nvlink v13.0.88 pointer table at 0x1D37D60, confirming identical enumeration across both tools.

ELF section typeSHT_CUDA_INFO = 0x70000064
Section name (global).nv.info
Section name (per-function).nv.info.<function_name>
Record formatType-Length-Value (TLV), 4-byte aligned
Known attribute count97 codes: 0--96 (v13.0.88)
Name table VA0x23FDC20 (97 entries x 16 bytes = 1,552 bytes)
EIATTR builder functionsub_1CC9800 (14,764 bytes, 90 KB decompiled -- third largest in output range)
Barrier/register propagatorsub_1CC8950 (2,634 bytes, propagates counts across call graph)
TLV record emittersub_1CC85F0 (44 lines, writes individual EIATTR records)
SM-version gatingsub_1C97840 (checks whether an EIATTR code is valid for a given SM version)

TLV Record Format

Each .nv.info section contains a flat sequence of 4-byte-aligned TLV records. There is no section header or record count -- the parser walks from byte 0 to sh_size, consuming records sequentially.

Record Layout

Offset  Size  Field
------  ----  -----
0x00    1     format      Format byte (determines payload structure)
0x01    1     attr_code   EIATTR type code (0x00--0x60)
0x02    2     size        Payload size in bytes (little-endian uint16)
0x04    var   payload     Attribute-specific data (size bytes)

Total record size = 4 + size, padded up to 4-byte alignment. The minimum record is 4 bytes (format + code + size=0, no payload).

Format Byte

The format byte at offset 0 controls how the payload is interpreted:

FormatNamePayload structureTypical use
0x01FreeRaw bytes, attribute-specific layoutOffset tables, parameter info
0x02ValueSingle 32-bit value (no symbol index)Global flags
0x03Sized16-bit value + paddingCounts, sizes
0x04Indexed[sym_index:4][value:4] -- per-symbol attributePer-kernel resources

Format 0x04 (indexed) is the most common for per-function attributes. The 4-byte symbol index at payload offset 0 identifies which function the attribute applies to. The linker uses this index for symbol remapping during merge and for per-function property extraction during finalization.

Binary Evidence -- sub_1CC85F0

The TLV record emitter function directly confirms the encoding:

// sub_1CC85F0 -- simplified from decompilation
// a2 = attr_code, a3 = 16-bit value/size, a4 = payload data, a5 = symbol index
void emit_eiattr(void* elfw, uint8_t attr_code, int16_t size, void* data, uint32_t sym_idx) {
    if (!is_valid_for_sm(attr_code, elfw->sm_version))
        return;

    int section_index = get_nvinfo_section(elfw, sym_idx);

    // Allocate 16-byte record buffer
    uint8_t* record = pool_alloc(16);

    // TLV header
    record[0] = 0x04;               // format = Indexed
    record[1] = attr_code;           // EIATTR type code
    *(uint16_t*)(record + 2) = size; // payload size
    *(uint32_t*)(record + 4) = section_index; // symbol index

    // Append to .nv.info section's linked list
    list_append(record, &elfw->nvinfo_list);

    // Overwrite size field with actual value for indexed format
    *(uint16_t*)(record + 2) = size;
    *(uint64_t*)(record + 8) = data;
}

Parsing Pseudocode

uint8_t *ptr = section_data;
uint8_t *end = section_data + section_size;

while (ptr < end) {
    uint8_t  format    = ptr[0];
    uint8_t  attr_code = ptr[1];
    uint16_t size      = *(uint16_t *)(ptr + 2);

    if (format == 0x04) {
        // Indexed: first 4 bytes of payload = symbol index
        uint32_t sym_idx = *(uint32_t *)(ptr + 4);
        uint32_t value   = *(uint32_t *)(ptr + 8);
        process_indexed_attribute(attr_code, sym_idx, value);
    } else if (format == 0x02) {
        // Value: single 32-bit immediate
        uint32_t value = *(uint32_t *)(ptr + 4);
        process_global_attribute(attr_code, value);
    } else {
        // Free/sized: attribute-specific handling
        process_raw_attribute(attr_code, ptr + 4, size);
    }

    ptr += 4 + ALIGN_UP(size, 4);
}

Section Variants

A cubin contains two kinds of .nv.info sections:

Global .nv.info -- A single section named .nv.info with sh_link = 0 (no associated symbol). Contains attributes that apply to the entire compilation unit: CUDA API version, compatibility flags, and shared metadata not specific to any one kernel.

Per-function .nv.info.<name> -- One section per kernel or device function, named .nv.info.<function_name> with sh_link pointing to the corresponding symbol table entry. Carries per-kernel resource descriptors: register count, barrier count, stack sizes, parameter bank layout, and instruction-offset tables.

Both section variants use sh_type = SHT_CUDA_INFO (0x70000064). The ELF section type is the authoritative way to identify .nv.info sections; the name is only a convention.

Complete Code Table

All 97 EIATTR codes in numeric order. Extracted from the ptxas pointer table at VA 0x23FDC20. The "Format" column reflects the typical TLV format byte used when emitting that attribute. The "Meta" column shows the metadata word from the pointer table (lo word encodes minimum toolkit version compatibility, hi word encodes flags).

CodeHexNameFormatMetaCategory
00x00EIATTR_ERROR--1Sentinel
10x01EIATTR_PAD--1Sentinel
20x02EIATTR_IMAGE_SLOTIndexed1Texture
30x03EIATTR_JUMPTABLE_RELOCSFree1Metadata
40x04EIATTR_CTAIDZ_USEDIndexed1Metadata
50x05EIATTR_MAX_THREADSIndexed1Resource
60x06EIATTR_IMAGE_OFFSETIndexed1Texture
70x07EIATTR_IMAGE_SIZEIndexed1Texture
80x08EIATTR_TEXTURE_NORMALIZEDIndexed1Texture
90x09EIATTR_SAMPLER_INITIndexed1Texture
100x0AEIATTR_PARAM_CBANKIndexed1Param
110x0BEIATTR_SMEM_PARAM_OFFSETSFree1Param
120x0CEIATTR_CBANK_PARAM_OFFSETSFree1Param
130x0DEIATTR_SYNC_STACKIndexed1Metadata
140x0EEIATTR_TEXID_SAMPID_MAPFree1Texture
150x0FEIATTR_EXTERNSFree1Metadata
160x10EIATTR_REQNTIDIndexed1Resource
170x11EIATTR_FRAME_SIZEIndexed1Resource
180x12EIATTR_MIN_STACK_SIZEIndexed1Resource
190x13EIATTR_SAMPLER_FORCE_UNNORMALIZEDIndexed1Texture
200x14EIATTR_BINDLESS_IMAGE_OFFSETSFree1Texture
210x15EIATTR_BINDLESS_TEXTURE_BANKIndexed1Texture
220x16EIATTR_BINDLESS_SURFACE_BANKIndexed1Texture
230x17EIATTR_KPARAM_INFOFree1Param
240x18EIATTR_SMEM_PARAM_SIZEIndexed1Param
250x19EIATTR_CBANK_PARAM_SIZESized1Param
260x1AEIATTR_QUERY_NUMATTRIBIndexed1Metadata
270x1BEIATTR_MAXREG_COUNTSized1Resource
280x1CEIATTR_EXIT_INSTR_OFFSETSFree1Offsets
290x1DEIATTR_S2RCTAID_INSTR_OFFSETSFree1Offsets
300x1EEIATTR_CRS_STACK_SIZEIndexed1Resource
310x1FEIATTR_NEED_CNP_WRAPPERIndexed1Metadata
320x20EIATTR_NEED_CNP_PATCHIndexed1Metadata
330x21EIATTR_EXPLICIT_CACHINGIndexed1Metadata
340x22EIATTR_ISTYPEP_USEDIndexed1Metadata
350x23EIATTR_MAX_STACK_SIZEIndexed1Resource
360x24EIATTR_SUQ_USEDIndexed1Metadata
370x25EIATTR_LD_CACHEMOD_INSTR_OFFSETSFree1Offsets
380x26EIATTR_LOAD_CACHE_REQUESTIndexed1Metadata
390x27EIATTR_ATOM_SYS_INSTR_OFFSETSFree1Offsets
400x28EIATTR_COOP_GROUP_INSTR_OFFSETSFree1Offsets
410x29EIATTR_COOP_GROUP_MASK_REGIDSIndexed1Cluster
420x2AEIATTR_SW1850030_WARFree1WAR
430x2BEIATTR_WMMA_USEDIndexed2Metadata
440x2CEIATTR_HAS_PRE_V10_OBJECTValue3Metadata
450x2DEIATTR_ATOMF16_EMUL_INSTR_OFFSETSFree3Offsets
460x2EEIATTR_ATOM16_EMUL_INSTR_REG_MAPFree5Offsets
470x2FEIATTR_REGCOUNTIndexed5Resource
480x30EIATTR_SW2393858_WARFree5WAR
490x31EIATTR_INT_WARP_WIDE_INSTR_OFFSETSFree5Offsets
500x32EIATTR_SHARED_SCRATCHIndexed5Shared
510x33EIATTR_STATISTICSFree5Metadata
520x34EIATTR_INDIRECT_BRANCH_TARGETSFree5Offsets
530x35EIATTR_SW2861232_WARFree5WAR
540x36EIATTR_SW_WARFree5WAR
550x37EIATTR_CUDA_API_VERSIONIndexed5Metadata
560x38EIATTR_NUM_MBARRIERSIndexed5Resource
570x39EIATTR_MBARRIER_INSTR_OFFSETSFree5Offsets
580x3AEIATTR_COROUTINE_RESUME_OFFSETSFree5Offsets
590x3BEIATTR_SAM_REGION_STACK_SIZEIndexed5Resource
600x3CEIATTR_PER_REG_TARGET_PERF_STATSFree5Metadata
610x3DEIATTR_CTA_PER_CLUSTERIndexed5Cluster
620x3EEIATTR_EXPLICIT_CLUSTERIndexed5Cluster
630x3FEIATTR_MAX_CLUSTER_RANKIndexed5Cluster
640x40EIATTR_INSTR_REG_MAPFree5Metadata
650x41EIATTR_RESERVED_SMEM_USEDIndexed5Shared
660x42EIATTR_RESERVED_SMEM_0_SIZEIndexed5Shared
670x43EIATTR_UCODE_SECTION_DATAFree5Metadata
680x44EIATTR_UNUSED_LOAD_BYTE_OFFSETFree5Offsets
690x45EIATTR_KPARAM_INFO_V2Free5Param
700x46EIATTR_SYSCALL_OFFSETSFree5Offsets
710x47EIATTR_SW_WAR_MEMBAR_SYS_INSTR_OFFSETSFree5WAR
720x48EIATTR_GRAPHICS_GLOBAL_CBANKIndexed5Graphics
730x49EIATTR_SHADER_TYPEIndexed5Graphics
740x4AEIATTR_VRC_CTA_INIT_COUNTIndexed5Graphics
750x4BEIATTR_TOOLS_PATCH_FUNCIndexed5Metadata
760x4CEIATTR_NUM_BARRIERSIndexed5Resource
770x4DEIATTR_TEXMODE_INDEPENDENTIndexed5Texture
780x4EEIATTR_PERF_STATISTICSFree5Metadata
790x4FEIATTR_AT_ENTRY_FRAGEMENTSFree5Blackwell
800x50EIATTR_SPARSE_MMA_MASKFree5Blackwell
810x51EIATTR_TCGEN05_1CTA_USEDIndexed5Blackwell
820x52EIATTR_TCGEN05_2CTA_USEDIndexed5Blackwell
830x53EIATTR_GEN_ERRBAR_AT_EXITIndexed5Blackwell
840x54EIATTR_REG_RECONFIGIndexed5Blackwell
850x55EIATTR_ANNOTATIONSFree5Metadata
860x56EIATTR_UNKNOWN--5Sentinel
870x57EIATTR_STACK_CANARY_TRAP_OFFSETSFree5Offsets
880x58EIATTR_STUB_FUNCTION_KINDIndexed5Metadata
890x59EIATTR_LOCAL_CTA_ASYNC_STORE_OFFSETSFree5Offsets
900x5AEIATTR_MERCURY_FINALIZER_OPTIONSFree5Mercury
910x5BEIATTR_BLOCKS_ARE_CLUSTERSIndexed5Cluster
920x5CEIATTR_SANITIZEIndexed5Blackwell
930x5DEIATTR_SYSCALLS_FALLBACKFree5Metadata
940x5EEIATTR_CUDA_REQFree5Metadata
950x5FEIATTR_MERCURY_ISA_VERSIONSized5Mercury
960x60EIATTR_ERROR_LAST--5Sentinel

Metadata Word Encoding

Each entry in the pointer table carries an 8-byte metadata word alongside the string pointer. The low 32 bits encode the minimum toolkit version required to parse this attribute. The high 32 bits encode flags (0 = legacy, 1 = internal-only, 2 = standard).

Meta loInterpretation
1Legacy attribute, present since earliest CUDA versions
2Introduced in CUDA ~7.0 era (Volta)
3Introduced in CUDA ~9.0 era (Turing)
5Introduced in CUDA ~11.0+ era (Ampere and later)

Codes 0--42 all carry meta=1 (legacy). The boundary at code 43 (EIATTR_WMMA_USED) marks the Volta-era expansion. Codes 46+ carry meta_lo=5, indicating the major expansion that happened with Ampere and continued through Blackwell.

Attribute Categories

Resource Allocation (GPU Driver Critical)

These attributes directly control how the GPU driver allocates hardware resources for kernel launch. Incorrect values cause silent performance degradation or launch failure.

CodeHexNameFormatDescription
470x2FEIATTR_REGCOUNTIndexedPhysical register count per thread. The GPU driver computes max_warps_per_SM = total_registers / (regcount * warp_size). Single most important occupancy-determining attribute.
50x05EIATTR_MAX_THREADSIndexedMaximum threads per block (from .maxntid PTX directive).
160x10EIATTR_REQNTIDIndexedRequired thread count per dimension (from .reqntid).
170x11EIATTR_FRAME_SIZEIndexedPer-thread local memory frame size in bytes.
180x12EIATTR_MIN_STACK_SIZEIndexedMinimum stack size per thread (non-recursive case).
350x23EIATTR_MAX_STACK_SIZEIndexedMaximum stack size per thread (recursive case, computed via call graph propagation).
300x1EEIATTR_CRS_STACK_SIZEIndexedCall-Return-Stack size for nested function calls.
590x3BEIATTR_SAM_REGION_STACK_SIZEIndexedSAM (Streaming Asynchronous Memory) region stack size.
760x4CEIATTR_NUM_BARRIERSIndexedNumber of named barriers used (max 16 on most architectures). Propagated from callees to entry points by sub_1CC8950.
560x38EIATTR_NUM_MBARRIERSIndexedNumber of memory barriers (mbarrier objects) used.
270x1BEIATTR_MAXREG_COUNTSizedMaximum register count hint (from --maxrregcount or .maxnreg).
840x54EIATTR_REG_RECONFIGIndexedDynamic register reconfiguration support (setmaxnreg instruction, sm_100+).

Parameter Bank Layout

Describes how kernel parameters are laid out in constant memory bank 0 (c[0x0]).

CodeHexNameFormatDescription
100x0AEIATTR_PARAM_CBANKIndexedConstant bank number and offset for kernel parameters.
250x19EIATTR_CBANK_PARAM_SIZESizedSize of the parameter constant bank in bytes.
240x18EIATTR_SMEM_PARAM_SIZEIndexedSize of shared memory parameter region.
110x0BEIATTR_SMEM_PARAM_OFFSETSFreeOffsets of parameters within shared memory.
120x0CEIATTR_CBANK_PARAM_OFFSETSFreeOffsets of parameters within constant bank.
230x17EIATTR_KPARAM_INFOFreeKernel parameter metadata (types, sizes, alignments).
690x45EIATTR_KPARAM_INFO_V2FreeExtended kernel parameter info (v2 format with additional fields, no metadata version constraint).

Instruction Offset Tables

Record byte offsets of specific instruction types within the kernel's .text section, enabling the driver and tools to locate and patch instructions at load time.

CodeHexNameFormatDescription
280x1CEIATTR_EXIT_INSTR_OFFSETSFreeByte offsets of all EXIT instructions.
290x1DEIATTR_S2RCTAID_INSTR_OFFSETSFreeOffsets of S2R instructions reading SR_CTAID (CTA ID). Used for cluster launch CTA-ID remapping.
370x25EIATTR_LD_CACHEMOD_INSTR_OFFSETSFreeOffsets of load instructions with explicit cache modifier.
390x27EIATTR_ATOM_SYS_INSTR_OFFSETSFreeOffsets of atomic instructions with .sys scope.
400x28EIATTR_COOP_GROUP_INSTR_OFFSETSFreeOffsets of cooperative group instructions.
450x2DEIATTR_ATOMF16_EMUL_INSTR_OFFSETSFreeOffsets of emulated FP16 atomic instructions.
460x2EEIATTR_ATOM16_EMUL_INSTR_REG_MAPFreeRegister map for 16-bit atomic emulation.
490x31EIATTR_INT_WARP_WIDE_INSTR_OFFSETSFreeOffsets of integer warp-wide instructions.
520x34EIATTR_INDIRECT_BRANCH_TARGETSFreeValid targets of indirect branches (for control flow integrity).
570x39EIATTR_MBARRIER_INSTR_OFFSETSFreeOffsets of MBAR (memory barrier) instructions.
580x3AEIATTR_COROUTINE_RESUME_OFFSETSFreeResume point offsets for device-side coroutines. Variant name EIATTR_COROUTINE_RESUME_ID_OFFSETS at 0x24064D8.
680x44EIATTR_UNUSED_LOAD_BYTE_OFFSETFreeByte offset of unused load instruction.
700x46EIATTR_SYSCALL_OFFSETSFreeOffsets of __cuda_syscall invocations.
870x57EIATTR_STACK_CANARY_TRAP_OFFSETSFreeOffsets of stack canary trap instructions (stack protector).
890x59EIATTR_LOCAL_CTA_ASYNC_STORE_OFFSETSFreeOffsets of CTA-local async store instructions.

Texture and Surface Binding

CodeHexNameFormatDescription
20x02EIATTR_IMAGE_SLOTIndexedTexture/surface image slot assignment.
60x06EIATTR_IMAGE_OFFSETIndexedOffset within the image descriptor table.
70x07EIATTR_IMAGE_SIZEIndexedSize of the image descriptor.
80x08EIATTR_TEXTURE_NORMALIZEDIndexedWhether texture coordinates are normalized.
90x09EIATTR_SAMPLER_INITIndexedSampler initialization parameters.
140x0EEIATTR_TEXID_SAMPID_MAPFreeTexture ID to sampler ID mapping table.
190x13EIATTR_SAMPLER_FORCE_UNNORMALIZEDIndexedForce unnormalized sampler coordinates.
200x14EIATTR_BINDLESS_IMAGE_OFFSETSFreeOffsets for bindless image references.
210x15EIATTR_BINDLESS_TEXTURE_BANKIndexedConstant bank used for bindless texture descriptors.
220x16EIATTR_BINDLESS_SURFACE_BANKIndexedConstant bank used for bindless surface descriptors.
770x4DEIATTR_TEXMODE_INDEPENDENTIndexedIndependent texture mode flag.

Cluster and Cooperative Launch (sm_90+)

CodeHexNameFormatDescription
410x29EIATTR_COOP_GROUP_MASK_REGIDSIndexedRegister IDs used for cooperative group masks.
610x3DEIATTR_CTA_PER_CLUSTERIndexedNumber of CTAs per cluster (Hopper cluster launch).
620x3EEIATTR_EXPLICIT_CLUSTERIndexedKernel uses explicit cluster dimensions.
630x3FEIATTR_MAX_CLUSTER_RANKIndexedMaximum cluster rank for scheduling.
910x5BEIATTR_BLOCKS_ARE_CLUSTERSIndexedCTA blocks are clusters flag.

Shared Memory and Reserved Resources

CodeHexNameFormatDescription
500x32EIATTR_SHARED_SCRATCHIndexedShared memory scratch space for register spilling.
650x41EIATTR_RESERVED_SMEM_USEDIndexedWhether reserved shared memory is used.
660x42EIATTR_RESERVED_SMEM_0_SIZEIndexedSize of reserved shared memory partition 0.

Software Workarounds

Hardware errata requiring instruction-level patching by the driver. Each WAR attribute carries a list of instruction byte offsets that the driver must modify at kernel load time.

CodeHexNameFormatDescription
420x2AEIATTR_SW1850030_WARFreeWorkaround for HW bug 1850030.
480x30EIATTR_SW2393858_WARFreeWorkaround for HW bug 2393858.
530x35EIATTR_SW2861232_WARFreeWorkaround for HW bug 2861232.
540x36EIATTR_SW_WARFreeGeneric software workaround container.
710x47EIATTR_SW_WAR_MEMBAR_SYS_INSTR_OFFSETSFreeOffsets of MEMBAR.SYS instructions needing software workaround.

Graphics-Specific

CodeHexNameFormatDescription
720x48EIATTR_GRAPHICS_GLOBAL_CBANKIndexedGlobal constant bank for graphics shaders.
730x49EIATTR_SHADER_TYPEIndexedShader type (vertex, fragment, compute, etc.).
740x4AEIATTR_VRC_CTA_INIT_COUNTIndexedVirtual Register Count CTA init count.

Blackwell+ Features (sm_100+)

CodeHexNameFormatDescription
790x4FEIATTR_AT_ENTRY_FRAGEMENTSFreeFragment descriptors at function entry. Note: "FRAGEMENTS" is a typo preserved in the binary; corrected variant EIATTR_AT_ENTRY_FRAGMENTS exists at 0x2405DA1.
800x50EIATTR_SPARSE_MMA_MASKFreeSparsity mask for structured-sparse MMA operations.
810x51EIATTR_TCGEN05_1CTA_USEDIndexedtcgen05 (5th-gen tensor core) single-CTA mode used.
820x52EIATTR_TCGEN05_2CTA_USEDIndexedtcgen05 two-CTA mode used.
830x53EIATTR_GEN_ERRBAR_AT_EXITIndexedGenerate error barrier at kernel exit.
840x54EIATTR_REG_RECONFIGIndexedDynamic register reconfiguration (setmaxnreg).
920x5CEIATTR_SANITIZEIndexedAddress sanitizer instrumentation present.

Mercury-Specific

CodeHexNameFormatDescription
900x5AEIATTR_MERCURY_FINALIZER_OPTIONSFreeOptions for the Mercury FNLZR post-link pass.
950x5FEIATTR_MERCURY_ISA_VERSIONSizedMercury ISA version for the shader binary.

Compilation Metadata

CodeHexNameFormatDescription
30x03EIATTR_JUMPTABLE_RELOCSFreeJump table relocation entries.
40x04EIATTR_CTAIDZ_USEDIndexedWhether kernel uses %ctaid.z (3D grid).
130x0DEIATTR_SYNC_STACKIndexedSynchronization stack depth.
150x0FEIATTR_EXTERNSFreeExternal symbol references list.
260x1AEIATTR_QUERY_NUMATTRIBIndexedNumber of queryable attributes.
310x1FEIATTR_NEED_CNP_WRAPPERIndexedKernel needs CUDA Nested Parallelism wrapper.
320x20EIATTR_NEED_CNP_PATCHIndexedKernel needs CNP patching at load time.
330x21EIATTR_EXPLICIT_CACHINGIndexedExplicit cache control directives present.
340x22EIATTR_ISTYPEP_USEDIndexedisspacep instruction used.
360x24EIATTR_SUQ_USEDIndexedSurface query instruction used.
380x26EIATTR_LOAD_CACHE_REQUESTIndexedLoad cache request configuration.
430x2BEIATTR_WMMA_USEDIndexedWarp Matrix Multiply-Accumulate instructions used.
440x2CEIATTR_HAS_PRE_V10_OBJECTValueObject contains pre-CUDA 10 compiled code.
510x33EIATTR_STATISTICSFreeCompilation statistics (instruction counts, etc.).
550x37EIATTR_CUDA_API_VERSIONIndexedCUDA API version the kernel was compiled for.
600x3CEIATTR_PER_REG_TARGET_PERF_STATSFreePer-register-target performance statistics.
640x40EIATTR_INSTR_REG_MAPFreeInstruction-to-register mapping for profiling.
670x43EIATTR_UCODE_SECTION_DATAFreeMicrocode section data (internal).
750x4BEIATTR_TOOLS_PATCH_FUNCIndexedFunction patching descriptor for CUDA tools (cuda-gdb, Nsight).
780x4EEIATTR_PERF_STATISTICSFreePerformance statistics for the profiler.
850x55EIATTR_ANNOTATIONSFreeGeneral-purpose annotation data.
880x58EIATTR_STUB_FUNCTION_KINDIndexedStub function classification.
930x5DEIATTR_SYSCALLS_FALLBACKFreeSyscall fallback mechanism offsets.
940x5EEIATTR_CUDA_REQFreeCUDA requirements descriptor.

Sentinel and Error

CodeHexNameFormatDescription
00x00EIATTR_ERROR--Invalid/error sentinel. Never emitted in valid cubins.
10x01EIATTR_PAD--Padding record (ignored by parser).
860x56EIATTR_UNKNOWN--Unknown attribute placeholder.
960x60EIATTR_ERROR_LAST--Upper bound sentinel for the enum range. Code 96 is never emitted; it serves as a bound check (if (attr_code > 0x2F) at line 760 of the builder).

Payload Format Reference (Codes 0--32)

Per-attribute wire-format documentation derived from sub_1CC9800 (master EIATTR builder), sub_1CC86D0 (per-entry stack emitter), sub_1CC8950 (barrier/register propagator), and sub_1CC85F0 (TLV record emitter). Payload layouts describe the bytes that follow the 4-byte TLV header.

For Indexed-format (0x04) attributes the first 4 payload bytes are always a u32 symbol index. The remaining bytes (if any) carry the value. For Sized-format (0x03) attributes the value is encoded directly in the 16-bit size field of the TLV header -- there are no additional payload bytes.

Sentinel Codes (0--1)

CodeHexNamePayload
00x00EIATTR_ERRORNone. Never emitted.
10x01EIATTR_PADNone. Padding, ignored by parser.

Texture and Image Binding (2, 6--9, 14, 19--22)

All Indexed attributes in this group share the same 8-byte payload layout: [sym_index:4][value:4]. The builder's first switch (line 722) routes all of these through the same symbol-index resolution path.

Offset  Size  Field
------  ----  -----
0x00    4     sym_index     Per-function symbol table index
0x04    4     value         Attribute-specific (see per-code table)
CodeHexNamevalue field semantics
20x02EIATTR_IMAGE_SLOTImage slot number (texture unit binding point)
60x06EIATTR_IMAGE_OFFSETByte offset within image descriptor table
70x07EIATTR_IMAGE_SIZEImage descriptor size in bytes
80x08EIATTR_TEXTURE_NORMALIZED0 = unnormalized, 1 = normalized coordinates
90x09EIATTR_SAMPLER_INITPacked sampler initialization parameters
190x13EIATTR_SAMPLER_FORCE_UNNORMALIZEDSampler ID to force unnormalized
210x15EIATTR_BINDLESS_TEXTURE_BANKConstant bank ID for bindless texture descriptors
220x16EIATTR_BINDLESS_SURFACE_BANKConstant bank ID for bindless surface descriptors

Code 14 (0x0E) -- EIATTR_TEXID_SAMPID_MAP: Free format. Variable-length array of u32 pairs mapping texture IDs to sampler IDs.

Payload: repeating [tex_id:4][samp_id:4] pairs
Size:    N * 8 bytes (N = number of tex-sampler bindings)

Code 20 (0x14) -- EIATTR_BINDLESS_IMAGE_OFFSETS: Free format. Array of u32 byte offsets for bindless image descriptor references in the kernel's constant bank. Each u32 is a symbol index that gets resolved during link.

Payload: u32[] symbol indices (resolved to byte offsets at link)
Size:    N * 4 bytes

Jump Table Relocations (3)

Code 3 (0x03) -- EIATTR_JUMPTABLE_RELOCS: Free format. Array of u32 byte offsets into the .text section where jump table relocations are needed.

Payload: u32[] byte offsets into .text
Size:    N * 4 bytes

CTAIDZ Flag (4)

Code 4 (0x04) -- EIATTR_CTAIDZ_USED: Indexed format, zero-value flag attribute. Presence of the record signals the kernel reads %ctaid.z. SM-version gated via sub_1C97840(0x04, sm_version).

Offset  Size  Field
------  ----  -----
0x00    4     sym_index     Per-function symbol
(no value field -- presence is the signal)

The builder creates this record with two different format bytes depending on context: 0x04 (Indexed) via the TLV emitter, or 0x01 (Free) via inline construction (magic 0x0401). Both encode the same semantic: flag-only, no value.

Resource Allocation (5, 16--18, 25, 27, 30)

Codes 5, 16, 17, 18 -- Indexed, 8-byte payload [sym_index:4][value:4]:

CodeHexNamevalue field semantics
50x05EIATTR_MAX_THREADSMaximum threads per block (from .maxntid)
160x10EIATTR_REQNTIDRequired thread count per dimension (from .reqntid)
170x11EIATTR_FRAME_SIZEPer-thread local memory frame size in bytes
180x12EIATTR_MIN_STACK_SIZEMinimum per-thread stack size in bytes

EIATTR_FRAME_SIZE is weak-symbol filtered: dropped when a weak function is replaced by a stronger definition (bitmask 0x800800020000).

EIATTR_MIN_STACK_SIZE is emitted by sub_1CC86D0 with sub_1CC85F0(a1, 0x12, 8, buf, 0) where buf is [sym_index:4][min_stack:4]. A sentinel value of -1 in min_stack means "not yet computed." When sm_version == 0xFF00 (Mercury), the record is suppressed.

Code 25 (0x19) -- EIATTR_CBANK_PARAM_SIZE: Sized format (0x03). Value encoded directly in the 16-bit size field. No separate payload bytes.

TLV header: [fmt=0x03][code=0x19][param_bank_size:2]
Total record: 4 bytes (header only)

Code 27 (0x1B) -- EIATTR_MAXREG_COUNT: Sized format (0x03). Value encoded in the low byte of the 16-bit size field (range 0--255). Per-compilation-unit hint, not per-function. Set by --maxrregcount CLI flag or .maxnreg PTX directive.

TLV header: [fmt=0x03][code=0x1B][maxreg:2]
Total record: 4 bytes (header only)
Effective range: low byte only (0--255), high byte 0

Binary evidence: second switch case 0x1B (line 1094) reads *(u8*)(v150+2) -- the low byte of the size field -- as the register count value.

Code 30 (0x1E) -- EIATTR_CRS_STACK_SIZE: Indexed format, 4-byte value payload. Emitted by sub_1CC86D0 with sub_1CC85F0(a1, 0x1E, 4, buf, sym_index).

Offset  Size  Field
------  ----  -----
0x00    4     sym_index     Per-function symbol
0x04    4     crs_bytes     Call-Return-Stack size in bytes

Total record: 12 bytes (4 header + 8 payload). Diagnostic "conflicting crs_stack attribute" fires when two records target the same function.

Parameter Bank Layout (10--12, 23--24)

Code 10 (0x0A) -- EIATTR_PARAM_CBANK: Indexed format, packed value.

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
0x04    4     cbank_desc    lo16 = bank number, hi16 = byte offset

Typical value: bank=0, offset=0x160 (standard CUDA kernel parameter ABI).

Codes 11 (0x0B) and 12 (0x0C) -- Free format, variable-length u32 arrays:

EIATTR_SMEM_PARAM_OFFSETS (0x0B):

Payload: u32[] byte offsets within shared memory, one per parameter
Size:    N * 4 bytes

EIATTR_CBANK_PARAM_OFFSETS (0x0C):

Payload: u32[] packed entries, one per parameter
         Each u32: lo16 = byte offset in cbank, hi16 = parameter size
Size:    N * 4 bytes

Code 23 (0x17) -- EIATTR_KPARAM_INFO: Free format, complex per-parameter descriptors. This is the only attribute in codes 0--32 with a multi-field sub-record structure.

Payload: repeating 12-byte per-parameter entries:
  Offset  Size  Field
  ------  ----  -----
  0x00    4     param_index       Ordinal position (0-based)
  0x04    4     param_offset      Byte offset in constant bank
  0x08    2     param_size        Size in bytes
  0x0A    1     log_alignment     log2(alignment)
  0x0B    1     flags             Bit flags (pointer, ordinal, etc.)
Size: N * 12 bytes

Special behavior: the builder exempts KPARAM_INFO from being zeroed when its symbol index resolves to 0 (line 755: (_BYTE)v5 == 23 check). This allows global-scope parameter info records.

Code 24 (0x18) -- EIATTR_SMEM_PARAM_SIZE: Indexed, [sym_index:4][smem_param_bytes:4].

Synchronization (13)

Code 13 (0x0D) -- EIATTR_SYNC_STACK: Indexed format.

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
0x04    4     sync_depth    lo16 = stack depth (u16), hi16 = 0

Binary evidence: case 0x0D (line 1038) reads *(u16**)(v150+8) as a pointer to a u16 value. The depth value (v343) is a 16-bit unsigned integer. Used with sub_1CBD8F0 for sync stack tracking.

External Symbol References (15)

Code 15 (0x0F) -- EIATTR_EXTERNS: Free format, most complex processing of any attribute in the 0--32 range.

Payload: u32[] symbol table indices
Size:    N * 4 bytes (N = size_field / 4)

The builder handles EXTERNS in both switches:

  • First switch (line 779): iterates the u32 array, resolving each symbol index through the link-time symbol table. Dead symbols (resolved to 0) are zeroed in-place.
  • Second switch (line 1054): collects extern refs into a set (v643) for the current function.
  • Emission (line 1706): sub_1CC85F0(a1, 0x0F, 4*count, buf, sym_index) emits the final record.
  • The size field encodes N * 4 and the element count is recovered as size >> 2.

Metadata Query (26)

Code 26 (0x1A) -- EIATTR_QUERY_NUMATTRIB: Indexed, [sym_index:4][num_attributes:4].

Instruction Offset Tables (28--29)

Both attributes are Free format carrying arrays of u32 byte offsets into the .text section.

Code 28 (0x1C) -- EIATTR_EXIT_INSTR_OFFSETS:

Payload: u32[] byte offsets of EXIT instructions
Size:    N * 4 bytes

Confirmed by the builder's loop (line 2011): code 28 is explicitly checked and skipped past the symbol-resolution path, confirming the payload is a simple offset array with no embedded symbol indices.

Code 29 (0x1D) -- EIATTR_S2RCTAID_INSTR_OFFSETS:

Payload: u32[] byte offsets of S2R SR_CTAID.* instructions
Size:    N * 4 bytes

At line 2001, code 29 triggers CNP (CUDA Nested Parallelism) wrapper generation. The symbol index from the record is added to the CNP wrapper list, driving emission of NEED_CNP_WRAPPER (code 31) and NEED_CNP_PATCH (code 32) records.

CUDA Nested Parallelism Flags (31--32)

Both are Indexed-format flag attributes with no value payload. They are always emitted as a pair.

Code 31 (0x1F) -- EIATTR_NEED_CNP_WRAPPER:

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
(no value -- flag only, presence is the signal)

SM-version gated: sub_1C97840(0x1F, sm_version). Builder constructs with internal format 0x01 (magic 0x1F01 = 7937). Emitted for every function that the S2RCTAID analysis identified as needing a CNP wrapper.

Code 32 (0x20) -- EIATTR_NEED_CNP_PATCH:

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
(no value -- flag only, presence is the signal)

SM-version gated: sub_1C97840(0x20, sm_version). Builder constructs with internal format 0x01 (magic 0x2001 = 8193). Emitted for every function in the CNP call tree.

Payload Format Summary (Codes 0--32)

CodeNameWire FmtPayload sizePayload layout
0ERROR--0none
1PAD--0none
2IMAGE_SLOT0x048[sym:4][slot_id:4]
3JUMPTABLE_RELOCS0x01N*4u32[] byte offsets
4CTAIDZ_USED0x044[sym:4] flag-only
5MAX_THREADS0x048[sym:4][max_threads:4]
6IMAGE_OFFSET0x048[sym:4][offset:4]
7IMAGE_SIZE0x048[sym:4][size:4]
8TEXTURE_NORMALIZED0x048[sym:4][normalized:4]
9SAMPLER_INIT0x048[sym:4][params:4]
10PARAM_CBANK0x048[sym:4][lo16=bank,hi16=off:4]
11SMEM_PARAM_OFFSETS0x01N*4u32[] param offsets
12CBANK_PARAM_OFFSETS0x01N*4u32[] lo16=off,hi16=size
13SYNC_STACK0x048[sym:4][depth_u16:4]
14TEXID_SAMPID_MAP0x01N*8[tex_id:4][samp_id:4] pairs
15EXTERNS0x01N*4u32[] symbol indices
16REQNTID0x048[sym:4][reqntid:4]
17FRAME_SIZE0x048[sym:4][frame_bytes:4]
18MIN_STACK_SIZE0x048[sym:4][stack_bytes:4]
19SAMPLER_FORCE_UNNORM0x048[sym:4][sampler_id:4]
20BINDLESS_IMAGE_OFFSETS0x01N*4u32[] sym indices
21BINDLESS_TEXTURE_BANK0x048[sym:4][bank_id:4]
22BINDLESS_SURFACE_BANK0x048[sym:4][bank_id:4]
23KPARAM_INFO0x01N*1212B per-param descriptors
24SMEM_PARAM_SIZE0x048[sym:4][size_bytes:4]
25CBANK_PARAM_SIZE0x030value in TLV size field
26QUERY_NUMATTRIB0x048[sym:4][count:4]
27MAXREG_COUNT0x030value in TLV size field (u8)
28EXIT_INSTR_OFFSETS0x01N*4u32[] .text byte offsets
29S2RCTAID_INSTR_OFFSETS0x01N*4u32[] .text byte offsets
30CRS_STACK_SIZE0x048[sym:4][crs_bytes:4]
31NEED_CNP_WRAPPER0x044[sym:4] flag-only
32NEED_CNP_PATCH0x044[sym:4] flag-only

Payload Format Reference (Codes 33--64)

Continuation of the per-attribute wire-format documentation. Same sources and conventions as the 0--32 section above.

Metadata Flags (33--34, 36, 43)

Code 33 (0x21) -- EIATTR_EXPLICIT_CACHING: Indexed format, flag-only. Signals the kernel uses explicit cache control directives (ld.ca, ld.cg, etc.). SM-gated via sub_1C97840(0x21, sm_version).

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
(no value -- flag only)

Binary evidence: magic 0x2101 (line 1733). Emitted when cache-on flag (v648) is set. When both cache-on and cache-off flags are set simultaneously (conflicting directives), sub_1CC8100 (cache conflict resolver) is called instead of emitting this record. The diagnostic "Turning caching %s for entry '%s' as per its request" logs cache resolution decisions.

Code 34 (0x22) -- EIATTR_ISTYPEP_USED: Indexed format, flag-only. Signals the kernel uses isspacep (type predicate) instructions.

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
(no value -- flag only)

No special builder logic -- passes through the default path.

Code 36 (0x24) -- EIATTR_SUQ_USED: Indexed format, flag-only. Signals the kernel uses surface query instructions.

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
(no value -- flag only)

No special builder logic.

Code 43 (0x2B) -- EIATTR_WMMA_USED: Indexed format, flag-only. Signals the kernel uses Warp Matrix Multiply-Accumulate instructions. First attribute introduced in the Volta era (meta=2).

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
(no value -- flag only)

No special builder logic.

Resource Allocation (35, 47, 50, 55--56, 59)

Code 35 (0x23) -- EIATTR_MAX_STACK_SIZE: Indexed format, 4-byte value. Maximum per-thread stack size for recursive call chains, computed via call-graph propagation in sub_1CC8950.

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
0x04    4     max_stack_bytes   Maximum stack size in bytes

Binary evidence: second switch case 0x23 (line 1128) reads v354[1] as the stack size value and stores it in the per-entry array s[]. Weak-symbol filtered: bitmask 0x800800060000 includes this code. Mercury suppression: when sm_version == 0xFF00, the code byte is zeroed, dropping the record.

Code 47 (0x2F) -- EIATTR_REGCOUNT: Indexed format, 4-byte value. Physical register count per thread. The single most important attribute for GPU occupancy: max_warps_per_SM = total_registers / (regcount * warp_size).

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
0x04    4     regcount          Physical registers per thread

Binary evidence: second switch case 0x2F (line 1176) resolves the symbol and stores the record pointer in v642[] (per-entry regcount array). Diagnostic "invalid index" (line 1180) fires if the symbol resolves to null. Weak-symbol filtered: bitmask 0x800800060000 includes this code.

Code 50 (0x32) -- EIATTR_SHARED_SCRATCH: Indexed format, 4-byte value. Shared memory scratch space allocated for register spilling.

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
0x04    4     scratch_bytes     Shared scratch size in bytes

No special builder logic.

Code 55 (0x37) -- EIATTR_CUDA_API_VERSION: Indexed format, 4-byte value. Records the CUDA API version the kernel was compiled for.

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
0x04    4     api_version       CUDA API version number

No special builder logic -- passes through the default path.

Code 56 (0x38) -- EIATTR_NUM_MBARRIERS: Sized format (0x03), value encoded in the TLV size field. Number of memory barrier (mbarrier) objects used by the kernel.

TLV header: [fmt=0x03][code=0x38][mbar_count:2]
Total record: 4 bytes (header only)

Binary evidence: magic 0x3803 (14339) at lines 1664 and 2446. The mbarrier count is stored in the 16-bit size field: *((_WORD *)v511 + 1) = v651 (line 1669). SM-gated via sub_1C97840(0x38, sm_version) at lines 1654 and 2436.

Accumulative semantics: the builder sums mbarrier counts from callees during call-graph propagation (second switch case 0x38 at line 1183, falling through to LABEL_331). If any callee reports -1 (unknown), the sum stays -1 (lines 1255--1256). The emission loop at lines 2407--2454 propagates the count to all entry points that call the function.

Code 59 (0x3B) -- EIATTR_SAM_REGION_STACK_SIZE: Indexed format, 8-byte payload. SAM (Streaming Asynchronous Memory) region stack size.

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
0x04    4     sam_stack_bytes   SAM region stack size in bytes

Binary evidence: emitted by sub_1CC86D0 at line 114: sub_1CC85F0(a1, 0x3B, 8, buf, 0) where buf is [sym_index:4][sam_stack:4]. Only emitted when sub_1CBD9E0(a1, a2) returns nonzero, indicating the kernel actually uses SAM regions. Second switch case 0x3B (line 1186) calls sub_1CBD940(a1, sym, value) to record the SAM stack size.

Cache Control (38)

Code 38 (0x26) -- EIATTR_LOAD_CACHE_REQUEST: Indexed format, 4-byte value. Per-kernel cache mode configuration. Controls whether the driver enables explicit caching for this kernel.

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
0x04    4     cache_mode        0 = off, nonzero = on

Binary evidence: second switch case 0x26 (line 1134) is the most complex handler in this range. The builder first checks the function kind: if (byte & 3) == 1 (device function), the record is dropped by zeroing the code byte (line 1141). For entry-point kernels, the verbose trace "Turning caching %s for entry '%s' as per its request" is emitted (line 1153), where %s is either "OFF" or "ON". When cache_mode is nonzero: adds the symbol to the caching-on list (v639[]) and sets the per-entry status to 2. When cache_mode is zero: sets status to 1 (off). The v648 and v655 flags track the presence of on/off requests for conflict detection.

Global Flags (44)

Code 44 (0x2C) -- EIATTR_HAS_PRE_V10_OBJECT: Value format (0x02), global scope. Signals the compilation unit contains pre-CUDA 10 compiled code.

TLV header: [fmt=0x02][code=0x2C][size:2]
Payload:    [flags:4]
Total record: 8 bytes

Binary evidence: top-level gating at line 686--688 checks three conditions: link mode (v609 == 2), toolkit version (> 0x63), and SM compatibility (sub_1C97840(0x2C, sm_version)). The magic 0x2C01 at line 709 constructs the record with internal format byte 0x01, which the emitter translates to Value format (0x02) for the wire encoding since the record is global scope. This is the only Value-format attribute in the 33--64 range.

Instruction Offset Tables (37, 39--40, 45--46, 48--49, 52, 57--58)

All attributes in this group use Free format (0x01) carrying variable-length arrays of u32 byte offsets into the kernel's .text section. None have explicit switch cases in the builder -- they pass through the default path. The payload layout for all is identical:

Payload: u32[] byte offsets into .text section
Size:    N * 4 bytes (N = size_field / 4)
CodeHexNameOffset semantics
370x25LD_CACHEMOD_INSTR_OFFSETSLoad instructions with explicit cache modifier
390x27ATOM_SYS_INSTR_OFFSETSAtomic instructions with .sys scope
400x28COOP_GROUP_INSTR_OFFSETSCooperative group instructions
450x2DATOMF16_EMUL_INSTR_OFFSETSEmulated FP16 atomic instructions
480x30SW2393858_WARHW bug 2393858 patch locations
490x31INT_WARP_WIDE_INSTR_OFFSETSInteger warp-wide instructions
520x34INDIRECT_BRANCH_TARGETSValid targets of indirect branches
570x39MBARRIER_INSTR_OFFSETSMemory barrier instructions
580x3ACOROUTINE_RESUME_OFFSETSCoroutine resume point offsets

Code 46 (0x2E) -- EIATTR_ATOM16_EMUL_INSTR_REG_MAP: Free format, but NOT a simple offset array. Carries a register map for 16-bit atomic emulation with a structured per-entry layout rather than flat offsets. The exact sub-record layout is not fully determined from the builder alone (constructed by a separate pass).

Payload: structured register-map entries (not flat u32[] offsets)
Size:    variable

Software Workarounds (42, 48, 53--54)

All use Free format (0x01) with u32 offset arrays. The driver patches the instructions at the listed byte offsets during kernel load.

CodeHexName
420x2ASW1850030_WAR
480x30SW2393858_WAR
530x35SW2861232_WAR
540x36SW_WAR

SW_WAR (0x36) is a generic container -- unlike the numbered WAR attributes, its payload format may include sub-type discriminators, though the builder treats it as a flat pass-through.

Cluster and Cooperative Launch (41, 61--63)

Code 41 (0x29) -- EIATTR_COOP_GROUP_MASK_REGIDS: Indexed, 4-byte value.

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
0x04    4     mask_regids       Register IDs for cooperative group masks

Code 61 (0x3D) -- EIATTR_CTA_PER_CLUSTER: Indexed, 4-byte value.

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
0x04    4     ctas_per_cluster  Number of CTAs per cluster (Hopper sm_90+)

Code 62 (0x3E) -- EIATTR_EXPLICIT_CLUSTER: Indexed, flag-only.

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
(no value -- flag only, presence signals explicit cluster dimensions)

Code 63 (0x3F) -- EIATTR_MAX_CLUSTER_RANK: Indexed, 4-byte value.

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
0x04    4     max_rank          Maximum cluster rank for scheduling

Compilation Metadata (51, 60, 64)

Code 51 (0x33) -- EIATTR_STATISTICS: Free format. Variable-length compilation statistics (instruction counts, etc.). Internal diagnostic data not consumed by the GPU driver.

Payload: structured statistics data (format varies)
Size:    variable

Code 60 (0x3C) -- EIATTR_PER_REG_TARGET_PERF_STATS: Free format. Per-register-target performance statistics for the profiler.

Payload: structured performance data (format varies)
Size:    variable

Code 64 (0x40) -- EIATTR_INSTR_REG_MAP: Free format. Instruction-to-register mapping for profiling and debugging tools.

Payload: structured register-map data
Size:    variable

Payload Format Summary (Codes 33--64)

CodeNameWire FmtPayload sizePayload layout
33EXPLICIT_CACHING0x044[sym:4] flag-only
34ISTYPEP_USED0x044[sym:4] flag-only
35MAX_STACK_SIZE0x048[sym:4][max_stack_bytes:4]
36SUQ_USED0x044[sym:4] flag-only
37LD_CACHEMOD_INSTR_OFFSETS0x01N*4u32[] .text byte offsets
38LOAD_CACHE_REQUEST0x048[sym:4][cache_mode:4]
39ATOM_SYS_INSTR_OFFSETS0x01N*4u32[] .text byte offsets
40COOP_GROUP_INSTR_OFFSETS0x01N*4u32[] .text byte offsets
41COOP_GROUP_MASK_REGIDS0x048[sym:4][mask_regids:4]
42SW1850030_WAR0x01N*4u32[] .text byte offsets
43WMMA_USED0x044[sym:4] flag-only
44HAS_PRE_V10_OBJECT0x024[flags:4] global
45ATOMF16_EMUL_INSTR_OFFSETS0x01N*4u32[] .text byte offsets
46ATOM16_EMUL_INSTR_REG_MAP0x01varstructured register map
47REGCOUNT0x048[sym:4][regcount:4]
48SW2393858_WAR0x01N*4u32[] .text byte offsets
49INT_WARP_WIDE_INSTR_OFFSETS0x01N*4u32[] .text byte offsets
50SHARED_SCRATCH0x048[sym:4][scratch_bytes:4]
51STATISTICS0x01varstructured stats data
52INDIRECT_BRANCH_TARGETS0x01N*4u32[] .text byte offsets
53SW2861232_WAR0x01N*4u32[] .text byte offsets
54SW_WAR0x01vargeneric WAR data
55CUDA_API_VERSION0x048[sym:4][api_version:4]
56NUM_MBARRIERS0x030value in TLV size field (u16)
57MBARRIER_INSTR_OFFSETS0x01N*4u32[] .text byte offsets
58COROUTINE_RESUME_OFFSETS0x01N*4u32[] .text byte offsets
59SAM_REGION_STACK_SIZE0x048[sym:4][sam_stack_bytes:4]
60PER_REG_TARGET_PERF_STATS0x01varstructured perf data
61CTA_PER_CLUSTER0x048[sym:4][ctas:4]
62EXPLICIT_CLUSTER0x044[sym:4] flag-only
63MAX_CLUSTER_RANK0x048[sym:4][max_rank:4]
64INSTR_REG_MAP0x01varstructured register map

Payload Format Reference (Codes 65--96)

Continuation of the per-attribute wire-format documentation. Same sources and conventions as the 0--64 sections above. Codes 65--96 represent the newest EIATTR additions (Ampere through Blackwell era). All require SM-version gating via sub_1C97840 before emission. Many have dedicated switch cases in the master builder for call-graph propagation.

Shared Memory (65--66)

Code 65 (0x41) -- EIATTR_RESERVED_SMEM_USED: Indexed format, flag-only. Signals the kernel uses reserved shared memory. SM-gated via sub_1C97840(0x41, sm_version).

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
(no value -- flag only, presence is the signal)

Binary evidence: magic 0x4101 (16641) at lines 1511 and 2219 of sub_1CC9800. The builder tracks this attribute in the v615[] per-entry array and propagates it to callee entry points during the second pass (lines 2186--2229). When an entry point does not already have this record, the builder creates one using sub_1CC7FB0 for symbol resolution.

Code 66 (0x42) -- EIATTR_RESERVED_SMEM_0_SIZE: Indexed format, 4-byte value. Size of reserved shared memory partition 0 in bytes.

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
0x04    4     rsmem_bytes       Reserved shared memory size in bytes

No explicit switch case in the builder -- passes through the default path.

Microcode Section (67)

Code 67 (0x43) -- EIATTR_UCODE_SECTION_DATA: Free format. Opaque microcode section data for internal use. Payload format is architecture-specific and not decoded by the builder.

Payload: opaque byte array
Size:    variable

Instruction Offset Tables (68, 70--71, 87, 89)

All attributes in this group use Free format (0x01) carrying variable-length arrays of u32 byte offsets into the kernel's .text section.

Payload: u32[] byte offsets into .text section
Size:    N * 4 bytes (N = size_field / 4)
CodeHexNameOffset semanticsEmitter
680x44UNUSED_LOAD_BYTE_OFFSETUnused load instructionssub_60BCF0 (code 70 pattern)
700x46SYSCALL_OFFSETS__cuda_syscall invocationssub_60BCF0
710x47SW_WAR_MEMBAR_SYS_INSTR_OFFSETSMEMBAR.SYS instructions needing WARsub_60BDC0
870x57STACK_CANARY_TRAP_OFFSETSStack canary trap instructionssub_60BEA0
890x59LOCAL_CTA_ASYNC_STORE_OFFSETSCTA-local async store instructionsdefault path

Binary evidence for sub_60BCF0 (code 70): allocates 4 * count bytes, copies offsets from the instruction table at struct+40, then calls sub_1CC85F0(a2, 70, (unsigned __int16)count, buf, a4). Emission gated by *(a1+25) flag and count > 0.

Binary evidence for sub_60BDC0 (code 71) and sub_60BEA0 (code 87): identical structure to sub_60BCF0, differing only in the attribute code passed to sub_1CC85F0.

Kernel Parameter Info V2 (69)

Code 69 (0x45) -- EIATTR_KPARAM_INFO_V2: Free format, 12-byte per-parameter entries. Extended version of KPARAM_INFO (code 23) with additional type encoding. Emitted by sub_7FD2B0.

Payload: repeating 12-byte per-parameter entries:
  Offset  Size  Field
  ------  ----  -----
  0x00    4     param_index       Ordinal position (0-based)
  0x04    4     param_offset      Byte offset in constant bank
  0x08    2     param_size        Size in bytes
  0x0A    1     log_alignment     log2(alignment)
  0x0B    1     flags             Packed nibbles:
                                    lo4 = param_type (from lookup table at 0x21D2E60)
                                    bit4 = is_pointer flag
                                    hi3 = reserved
Size: N * 12 bytes

Binary evidence: sub_7FD2B0 at line 116 calls sub_1CC85F0(a3, 69, 12, v16, a4). The flags byte at offset 0x0B is assembled from two sources: the low nibble is looked up from dword_21D2E60 indexed by param_type - 1 (line 110), and bit 4 is set when the parameter is a pointer (line 115: 16 * (*(_BYTE *)(v20 + 25) & 1)).

First-switch handling: code 69 (0x45) appears in the first switch at line 737 alongside texture and resource codes, meaning KPARAM_INFO_V2 records undergo symbol-index resolution during the first pass.

Graphics-Specific (72--74)

Code 72 (0x48) -- EIATTR_GRAPHICS_GLOBAL_CBANK: Indexed format, 4-byte value. Global constant bank descriptor for graphics shaders.

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
0x04    4     cbank_desc        Global constant bank descriptor

Code 73 (0x49) -- EIATTR_SHADER_TYPE: Indexed format, 4-byte value. Shader type classification.

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
0x04    4     shader_type       Shader type enum (vertex, fragment, compute, etc.)

Code 74 (0x4A) -- EIATTR_VRC_CTA_INIT_COUNT: Constructed with internal format byte 0x02 (magic 0x4A02 = 18946), but the value is stored in the TLV size field byte, making the wire behavior Sized-like. The builder takes the maximum across all callees.

TLV header: [fmt=0x02][code=0x4A][vrc_count:2]
Payload:    [sym_index:4]
Total record: 8 bytes

Binary evidence: magic 18946 at lines 1532 and 2344. The maximum-across-callees logic at lines 1214--1215: if (v675 < *(v150+2)) v328 = *(v150+2); v675 = v328. The final value is written back at line 1538: *((_BYTE *)v196 + 2) = v675. The v617[] per-entry array tracks this attribute for propagation. SM-gated via sub_1C97840(0x4A, sm_version).

Tools Patching (75)

Code 75 (0x4B) -- EIATTR_TOOLS_PATCH_FUNC: Indexed format, 4-byte value. Function patching descriptor for CUDA debugging tools (cuda-gdb, Nsight Compute).

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
0x04    4     patch_info        Patch descriptor for tool instrumentation

No explicit switch case -- passes through the default path.

Barrier Count (76)

Code 76 (0x4C) -- EIATTR_NUM_BARRIERS: Constructed with internal format byte 0x02 (magic 0x4C02 = 19458), with the barrier count stored in the TLV size field. This is one of the most complex attributes in the 65--96 range, with two distinct code paths.

TLV header: [fmt=0x02][code=0x4C][bar_count:2]
Payload:    [sym_index:4]
Total record: 8 bytes

Dual-path behavior controlled by *(a1+101):

  • Per-SM tracking mode (when *(a1+101) is set, line 1223): reads barrier count from the size field byte. Takes the maximum across all callees: if (n < *(v150+2)) v323 = *(v150+2); n = v323. The v628[] per-entry array tracks records. SM-gated via sub_1C97840(0x4C, sm_version).

  • Accumulative mode (when *(a1+101) is clear, falls through to LABEL_331): sums barrier counts from callees with -1 sentinel handling (lines 1251--1257): v298 = v297 + v651; if (v297 == -1) v298 = -1. The sentinel -1 means "unknown count" and poisons the sum.

Propagation in sub_1CC8950: the barrier/register propagator (2,634 bytes) also creates NUM_BARRIERS records during barrier count migration from section flags to .nv.info records.

Texture Mode (77)

Code 77 (0x4D) -- EIATTR_TEXMODE_INDEPENDENT: Indexed format, flag-only. Signals the kernel uses independent texture mode.

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
(no value -- flag only)

No explicit switch case -- passes through the default path.

Performance Statistics (78)

Code 78 (0x4E) -- EIATTR_PERF_STATISTICS: Free format. Performance statistics for the profiler.

Payload: structured performance data
Size:    variable

No explicit switch case -- passes through the default path. Internal profiler data, not consumed by the GPU driver.

Fragment Descriptors at Entry (79)

Code 79 (0x4F) -- EIATTR_AT_ENTRY_FRAGEMENTS: Free format. The most complex handler in the 65--96 range. Carries fragment offset arrays that describe function entry point fragments. Note: "FRAGEMENTS" is a typo preserved in the binary; corrected variant EIATTR_AT_ENTRY_FRAGMENTS exists at 0x2405DA1.

Payload: u32[] fragment offsets
Size:    N * 4 bytes

Binary evidence: emitted via sub_1CC85F0(a1, 0x4F, 4*count, buf, sym) at lines 1774 and 2539. The builder uses a set data structure (v644) to collect fragment offsets from callees, then merges and deduplicates them:

  1. Line 1749: collects total fragment count from v644 set.
  2. Lines 1762--1772: iterates set entries, extracting each offset via sub_42F060.
  3. Line 1774: emits the merged offset array.
  4. Lines 2460--2548: callee propagation loop. For each callee, if an existing entry has fragments, the builder extends the array and deduplicates offsets. If no existing entry, creates a new record.

The deduplication logic (lines 2503--2525) does an O(N*M) scan: for each new offset, checks all existing offsets for duplicates before appending.

Cross-function ownership: when *(a1+568) != srca (the current entry's symbol differs from the fragment source), the code byte is zeroed (line 1290: *(_BYTE *)(v150+1)=0), suppressing the record for non-owning functions.

Sparse MMA Mask (80)

Code 80 (0x50) -- EIATTR_SPARSE_MMA_MASK: Sized format (0x03). Sparsity bitmask for structured-sparse MMA (Matrix Multiply-Accumulate) operations on Blackwell. SM-gated via sub_1C97840(0x50, sm_version).

TLV header: [fmt=0x03][code=0x50][mask_bits:2]
Total record: 4 bytes (header only)

Binary evidence: magic 0x5003 (20483) at lines 2085 and 1433. The mask value is stored in the TLV size field. During propagation, the builder OR's mask bits from all callees (line 1407: v158 |= *(_WORD *)(v162 + 2)). New entry-point records are initialized with bit 15 set (line 1436: *((_WORD *)v598 + 1) = 0x8000; line 1438: v158 |= 0x8000u). The v632[] per-entry array tracks records.

The .nv.uft section emission (lines 2068--2090) also creates SPARSE_MMA_MASK records, gated on *(a1+240) (UFT presence flag).

Tensor Core Gen05 (81--82)

These two codes are mutually exclusive. The builder enforces that a function cannot use both 1-CTA and 2-CTA tensor core modes simultaneously.

Code 81 (0x51) -- EIATTR_TCGEN05_1CTA_USED: Indexed format, flag-only. Signals the kernel uses 5th-generation tensor cores in single-CTA mode. SM-gated via sub_1C97840(0x51, sm_version) AND requires v673 > 0x81 (SM code > 129, i.e., sm_130+ / Blackwell).

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
(no value -- flag only)

Binary evidence: magic 0x5101 (20737) at lines 1559 and 2259. Tracked in v614[] per-entry array. The v668 flag indicates any tcgen05_1CTA record was seen. The SM architecture threshold v673 > 0x81 (line 1543) gates emission: only architectures above 0x81 support tcgen05.

Code 82 (0x52) -- EIATTR_TCGEN05_2CTA_USED: Indexed format, flag-only. Signals the kernel uses 5th-generation tensor cores in two-CTA collaborative mode. SM-gated via sub_1C97840(0x52, sm_version) AND requires v673 > 0x81.

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
(no value -- flag only)

Binary evidence: magic 0x5201 (20993) at lines 1582 and 2300. Tracked in v610[] per-entry array. The v674 flag indicates any tcgen05_2CTA record was seen.

Mutual exclusion enforcement: during callee propagation (lines 2264--2266 and 2304--2307), if a function already has a TCGEN05_1CTA record and the builder attempts to add a TCGEN05_2CTA record (or vice versa), sub_42F590 fires a diagnostic warning with the function name. This catches conflicting tensor core mode usage across the call graph.

Error Barrier at Exit (83)

Code 83 (0x53) -- EIATTR_GEN_ERRBAR_AT_EXIT: Indexed format, flag-only. Instructs the driver to generate an error barrier at kernel exit.

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
(no value -- flag only)

No explicit switch case in the builder -- passes through the default path.

Register Reconfiguration (84)

Code 84 (0x54) -- EIATTR_REG_RECONFIG: Indexed format, flag-only with optional value. Signals the kernel uses dynamic register reconfiguration (setmaxnreg instruction, sm_100+). SM-gated via sub_1C97840(0x54, sm_version).

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
0x02    1     reconfig_value    (in TLV size field lo byte, optional)

Binary evidence: magic 0x5401 (21505) at lines 1637 and 2395. Tracked in v616[] per-entry array with the v666 flag. During callee propagation (lines 2364--2405), if a callee has a reconfig value (ii = *(v230+2)), it is written into the target record's size field byte: *(_BYTE *)(v417 + 2) = ii (line 2403). The value propagates from callee to entry point.

Annotations (85)

Code 85 (0x55) -- EIATTR_ANNOTATIONS: Free format with nested TLV-within-TLV sub-records. Emitted by sub_60C580. General-purpose annotation container for arbitrary metadata.

Payload: sequence of sub-records, each starting with a type byte:
  Type 0: [type:4]                                  -- 4 bytes
  Type 1: [type:4][value:4]                         -- 8 bytes
  Type 2: [type:4][key:4][len:4][data:len]          -- 12+len bytes, 4-byte aligned
  Type 3: [type:4][len:4][data:len]                 -- 8+len bytes, 4-byte aligned
Size:    sum of all sub-record sizes

Binary evidence from sub_60C580:

  • Line 47: type 2 records copy key (4 bytes) + len (4 bytes) + len bytes of data (line 51--53: memcpy(v17+3, v7+3, v22)). Alignment: (len + 11) & ~3 + 4 (line 55).
  • Line 63: type 3 records copy len (4 bytes) + len bytes (line 66--67: memcpy(v17+2, v7+2, v26)). Alignment: (len + 7) & ~3 + 4 (line 68).
  • Line 71: type 1 records are 8 bytes (v19 = 8; v17[1] = v7[1]).
  • Line 79: type 0 (default) records are 4 bytes.

Total allocation: 257 * entry_count dwords (line 29: v8 = 257LL * count), providing generous headroom for variable-length sub-records.

Sentinel (86)

Code 86 (0x56) -- EIATTR_UNKNOWN: Never emitted. Placeholder in the enum, analogous to EIATTR_ERROR (code 0).

Stub Function Kind (88)

Code 88 (0x58) -- EIATTR_STUB_FUNCTION_KIND: Indexed format, 4-byte value. Classifies the type of stub function.

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
0x04    4     stub_kind         Stub function classification enum

No explicit switch case -- passes through the default path.

Mercury Finalizer Options (90)

Code 90 (0x5A) -- EIATTR_MERCURY_FINALIZER_OPTIONS: Free format. Options for the Mercury FNLZR post-link pass. Emitted by sub_462220. Contains null-terminated key-value string pairs with a trailing CRC hash.

Payload: sequence of key-value entries followed by a hash:
  Per-entry:
    Offset  Size    Field
    ------  ----    -----
    0x00    2       key_len     strlen(key) + 1 (includes null terminator)
    0x02    2       val_len     strlen(val) + 1 (includes null terminator)
    0x04    key_len key_str     Null-terminated key string
    0x04+   val_len val_str     Null-terminated value string
            key_len

  Trailer: CRC/hash (computed by sub_4305D0)
Size:    sum of all entries + hash

Binary evidence: sub_462220 at line 656 calls sub_1CC85F0(v7, 90, v234, v225, *a5). Lines 640--647 show the key-value pair packing: strlen of key and value, packed as u16 lengths, followed by strcpy of both strings. The hash is computed at line 653 via sub_4305D0(0x123456, ...).

Cluster Configuration (91)

Code 91 (0x5B) -- EIATTR_BLOCKS_ARE_CLUSTERS: Indexed format, flag-only. Signals that CTA blocks are clusters (every block is its own cluster).

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
(no value -- flag only)

No explicit switch case -- passes through the default path.

Address Sanitizer (92)

Code 92 (0x5C) -- EIATTR_SANITIZE: Indexed format, flag-only. Signals the kernel has been instrumented with address sanitizer.

Offset  Size  Field
------  ----  -----
0x00    4     sym_index
(no value -- flag only)

No explicit switch case -- passes through the default path.

Syscall Fallback (93)

Code 93 (0x5D) -- EIATTR_SYSCALLS_FALLBACK: Free format. Syscall fallback mechanism data.

Payload: structured syscall fallback data
Size:    variable

No explicit switch case -- passes through the default path.

CUDA Requirements (94)

Code 94 (0x5E) -- EIATTR_CUDA_REQ: Free format. CUDA requirements descriptor specifying minimum runtime capabilities.

Payload: structured requirements data
Size:    variable

No explicit switch case -- passes through the default path.

Mercury ISA Version (95)

Code 95 (0x5F) -- EIATTR_MERCURY_ISA_VERSION: Sized format (0x03). Mercury ISA version encoded in the TLV size field.

TLV header: [fmt=0x03][code=0x5F][isa_version:2]
Total record: 4 bytes (header only)

Error Last Sentinel (96)

Code 96 (0x60) -- EIATTR_ERROR_LAST: Never emitted. Upper bound sentinel for the enum range. Used for bound checks in the builder: if (attr_code > 0x2F) at line 760.

Payload Format Summary (Codes 65--96)

CodeNameWire FmtPayload sizePayload layout
65RESERVED_SMEM_USED0x044[sym:4] flag-only
66RESERVED_SMEM_0_SIZE0x048[sym:4][rsmem_bytes:4]
67UCODE_SECTION_DATA0x01varopaque byte array
68UNUSED_LOAD_BYTE_OFFSET0x01N*4u32[] .text byte offsets
69KPARAM_INFO_V20x01N*1212B per-param descriptors
70SYSCALL_OFFSETS0x01N*4u32[] .text byte offsets
71SW_WAR_MEMBAR_SYS_OFFSETS0x01N*4u32[] .text byte offsets
72GRAPHICS_GLOBAL_CBANK0x048[sym:4][cbank_desc:4]
73SHADER_TYPE0x048[sym:4][shader_type:4]
74VRC_CTA_INIT_COUNT0x024[sym:4] count in TLV size byte
75TOOLS_PATCH_FUNC0x048[sym:4][patch_info:4]
76NUM_BARRIERS0x024[sym:4] count in TLV size byte
77TEXMODE_INDEPENDENT0x044[sym:4] flag-only
78PERF_STATISTICS0x01varstructured perf data
79AT_ENTRY_FRAGEMENTS0x01N*4u32[] fragment offsets
80SPARSE_MMA_MASK0x030bitmask in TLV size field (u16)
81TCGEN05_1CTA_USED0x044[sym:4] flag-only
82TCGEN05_2CTA_USED0x044[sym:4] flag-only
83GEN_ERRBAR_AT_EXIT0x044[sym:4] flag-only
84REG_RECONFIG0x044[sym:4] value in TLV size byte
85ANNOTATIONS0x01varnested TLV sub-records
86UNKNOWN--0none (never emitted)
87STACK_CANARY_TRAP_OFFSETS0x01N*4u32[] .text byte offsets
88STUB_FUNCTION_KIND0x048[sym:4][stub_kind:4]
89LOCAL_CTA_ASYNC_STORE_OFFSETS0x01N*4u32[] .text byte offsets
90MERCURY_FINALIZER_OPTIONS0x01varkey-value pairs + hash
91BLOCKS_ARE_CLUSTERS0x044[sym:4] flag-only
92SANITIZE0x044[sym:4] flag-only
93SYSCALLS_FALLBACK0x01varstructured syscall data
94CUDA_REQ0x01varstructured requirements
95MERCURY_ISA_VERSION0x030value in TLV size field (u16)
96ERROR_LAST--0none (never emitted)

Generation Pipeline

EIATTR attributes are generated during Phase 6 of the ELF output pipeline, after all per-kernel SASS encoding and memory allocation have completed. The generation is orchestrated by two functions working in sequence.

Barrier/Register Propagation -- sub_1CC8950

Before per-entry attribute emission begins, sub_1CC8950 (2,634 bytes, called once per entry point) propagates resource requirements from callees to entry kernels via the call graph:

  1. Register count propagation: Walks the call graph DFS, finding the maximum register count among all callees. The verbose trace "regcount %d for %s propagated to entry %s" logs this.

  2. Barrier count creation: When a kernel's section flags contain a barrier count (bits 20--26 of section_header + 8) but no EIATTR_NUM_BARRIERS record exists, creates one and clears the section flag bits:

Creating new EIATTR_NUM_BARRIERS and moving barcount %d
from section flags of %s to nvinfo for entry symbol %s
  1. SM-version gating: Uses sub_1C97840 to check whether EIATTR_NUM_BARRIERS (0x4C) and EIATTR_NUM_MBARRIERS (0x38) are valid for the target SM version before emitting.

Master EIATTR Builder -- sub_1CC9800

The main builder function (14,764 bytes binary, 90 KB decompiled -- third largest function in the output range) constructs the complete set of .nv.info.<func> sections. It has 51 callees and is called once per compilation unit.

The builder iterates over every entry point and device function, emitting the applicable EIATTR records for each. The SM-version gating function sub_1C97840 is called before emitting each attribute to check compatibility. Observed EIATTR code checks in the builder:

Hex codeEIATTR nameGating condition
0x04CTAIDZ_USEDSM-version check
0x21EXPLICIT_CACHINGSM-version check
0x1FNEED_CNP_WRAPPERSM-version check
0x20NEED_CNP_PATCHSM-version check
0x2CHAS_PRE_V10_OBJECTSM-version check
0x38NUM_MBARRIERSSM-version check
0x41RESERVED_SMEM_USEDSM-version check
0x4AVRC_CTA_INIT_COUNTSM-version check
0x4CNUM_BARRIERSSM-version check
0x50SPARSE_MMA_MASKSM-version check
0x51TCGEN05_1CTA_USEDSM-version check
0x52TCGEN05_2CTA_USEDSM-version check
0x54REG_RECONFIGSM-version check

The SM version comes from offset +624 of the compilation state object, consistent with the SM version field at a1 + 624 observed throughout ptxas.

Weak Symbol Filtering

During linking (nvlink), three specific EIATTR codes are treated specially during weak symbol resolution. When a weak function is replaced by a stronger definition, records for these three codes are dropped using the bitmask 0x800800020000:

  • Code 17 (0x11) -- EIATTR_FRAME_SIZE
  • Code 35 (0x23) -- EIATTR_MAX_STACK_SIZE
  • Code 47 (0x2F) -- EIATTR_REGCOUNT

The rationale: when a weak function is replaced, its resource descriptors must not contaminate the replacement's resource accounting.

Consumer Tools

cuobjdump

cuobjdump --dump-elf-section=.nv.info dumps raw hex bytes of the global .nv.info section. With --dump-resource-usage, it decodes EIATTR records into human-readable resource summaries (register count, shared memory, stack sizes).

nvdisasm

nvdisasm -nvi decodes .nv.info sections into named EIATTR records with decoded values. This is the primary tool for inspecting EIATTR content without writing a custom parser.

cuda-gdb

The debugger uses EIATTR_TOOLS_PATCH_FUNC (code 75, 0x4B) to locate patchable function entry points for breakpoint insertion and instrumentation.

How EIATTR Drives GPU Resource Allocation

The .nv.info section is not just metadata for tools -- it is the primary input to the GPU driver's kernel launch resource allocator:

  1. Register allocation: EIATTR_REGCOUNT (0x2F) tells the driver how many registers each thread needs. The driver computes max_warps_per_SM = total_registers / (regcount * warp_size).

  2. Shared memory reservation: EIATTR_SMEM_PARAM_SIZE (0x18) and EIATTR_RESERVED_SMEM_0_SIZE (0x42) determine how much shared memory to carve out before dynamic shared memory allocation.

  3. Stack allocation: EIATTR_CRS_STACK_SIZE (0x1E) and EIATTR_MAX_STACK_SIZE (0x23) determine per-thread stack allocation. Too small causes memory corruption; too large reduces occupancy.

  4. Barrier reservation: EIATTR_NUM_BARRIERS (0x4C) reserves named barrier slots. Hardware supports 16 barriers per CTA on most architectures.

  5. Instruction patching: Offset tables (EXIT_INSTR_OFFSETS, S2RCTAID_INSTR_OFFSETS, SW*_WAR) tell the driver which instruction words to patch at load time. This enables hardware workarounds and CTA-ID remapping for cluster launch without recompilation.

  6. Cluster configuration: EIATTR_CTA_PER_CLUSTER (0x3D) and EIATTR_EXPLICIT_CLUSTER (0x3E) control the cluster launch hardware on sm_90+, determining how many CTAs share distributed shared memory.

  7. Tensor core mode: EIATTR_TCGEN05_1CTA_USED (0x51) and EIATTR_TCGEN05_2CTA_USED (0x52) inform the driver about 5th-gen tensor core usage modes on sm_100+.

Binary Artifacts

Pointer Table Layout

The EIATTR name table at VA 0x23FDC20 consists of 97 entries of 16 bytes each (1,552 bytes total):

Offset  Size  Field
------  ----  -----
0x00    8     name_ptr     Pointer to null-terminated EIATTR name string
0x08    4     meta_lo      Minimum toolkit version compatibility
0x0C    4     meta_hi      Flags (0=legacy, 1=internal, 2=standard)

The table is indexed directly by EIATTR code number: entry = table_base + code * 16.

Typos Preserved in the Binary

String in binaryCorrect spellingAddress
EIATTR_AT_ENTRY_FRAGEMENTSEIATTR_AT_ENTRY_FRAGMENTS0x23FCCBD (code 79 name)

A corrected variant EIATTR_AT_ENTRY_FRAGMENTS exists at 0x2405DA1, and EIATTR_COROUTINE_RESUME_ID_OFFSETS at 0x24064D8 is an alternate name for code 58, both outside the main table.

Diagnostic Strings

"Creating new EIATTR_NUM_BARRIERS and moving barcount %d
 from section flags of %s to nvinfo for entry symbol %s"       (0x2406960)

"Creating new EIATTR_NUM_BARRIERS and moving barcount %d
 from section flags of %s to nvinfo for non-entry symbol %s"   (0x24069D0)

"Creating new EIATTR_NUM_BARRIERS and propagating higher
 barcount %d from section flags of %s to nvinfo
 for entry symbol %s"                                          (0x2406B10)

"conflicting crs_stack attribute"                               (sub_1CC9800 evidence)

"Turning caching %s for entry '%s' as per its request"          (sub_1CC9800 evidence)

"regcount %d for %s propagated to entry %s"                     (sub_1CC8950 evidence)

"no regcount?"                                                  (sub_1CC8950 evidence)

Key Functions

AddressSizeIdentityRole
sub_1CC980014,764 BMaster EIATTR builderConstructs all .nv.info.<func> sections (90 KB decompiled, 51 callees)
sub_1CC89502,634 BBarrier/register propagatorPropagates resource counts across call graph
sub_1CC85F0~180 BTLV record emitterWrites individual EIATTR records to the nvinfo linked list
sub_1C97840~100 BSM-version gateChecks if an EIATTR code is valid for a given SM target
sub_1CC86D0~600 BPer-entry stack emitterEmits MIN_STACK_SIZE (0x12), CRS_STACK_SIZE (0x1E), SAM_REGION_STACK_SIZE (0x3B) per function
sub_1CC84A0~400 BEIATTR helperAttribute lookup helper
sub_1CC83F0~200 BEIATTR helperSection flag extractor
sub_1CC8100~1 KBCache conflict resolverResolves conflicting cache preference attributes

Cross-References

Glossary

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

Quick-reference for terms used throughout this wiki. Each entry links to the primary page with full details.

TermDefinition
BarrierHardware synchronization primitive that blocks threads until a condition is met. PTXAS inserts and optimizes barriers via dedicated passes. See Synchronization & Barriers.
BMMABinary Matrix Multiply-Accumulate — tensor core operation on 1-bit inputs. Part of the WMMA/GMMA family. See Tensor Core Intrinsics.
BSSYBarrier Set Synchronization — SASS instruction that sets a convergence barrier for divergent control flow. Paired with BSYNC. See Scoreboards & Dependency Barriers.
BSYNCBarrier Synchronize — SASS instruction that waits on a convergence barrier set by BSSY. See Scoreboards & Dependency Barriers.
CapmercCapsule Mercury — an ELF section (.nv.capmerc) embedding a secondary Mercury-encoded representation of the kernel for debug metadata and binary patching. See Capsule Mercury & Finalization.
CGACooperative Grid Array — Hopper+ hardware grouping of thread blocks that can synchronize and share distributed shared memory. See Ada & Hopper.
ConvergenceThe point where divergent warp threads rejoin a common execution path, marked by BSSY/BSYNC pairs in SASS. See Predication.
CubinCUDA Binary — the ELF-based output format produced by ptxas, containing .text (SASS), .nv.info, .nv.constant0, and other NVIDIA-specific sections. See ELF/Cubin Output.
DAGDirected Acyclic Graph — the core data structure within Ori IR basic blocks; instructions form a DAG of def-use edges rather than a flat list. See IR Overview & Design.
DEPBARDependency Barrier — SASS instruction (DEPBAR) that stalls until a scoreboard counter reaches a threshold, enforcing producer-consumer ordering. See Scoreboards & Dependency Barriers.
DivergenceWhen threads within a warp take different control-flow paths, requiring the hardware to serialize execution. PTXAS manages divergence through predication and BSSY/BSYNC insertion. See Predication.
DMMADouble-precision Matrix Multiply-Accumulate — FP64 tensor core operation available on sm_80+. See Tensor Core Intrinsics.
DynBatchDynamic Batch — one of the instruction scheduler's two modes (alongside ReduceReg), which batches independent instructions to maximize ILP. See Scheduler Architecture.
EIATTRExtended Info Attributes — per-kernel metadata in .nv.info sections: tag-length-value records carrying register counts, barrier usage, shared memory sizes, and other properties consumed by the CUDA runtime and driver. See EIATTR Attribute Catalog.
ELFWPTXAS's custom ELF writer (sub_1C9F280, 97 KB) — a bespoke emitter that builds CUBIN files with NVIDIA-specific sections, relocations, and symbol conventions. See Custom ELF Emitter.
FatpointThe register allocation algorithm used by ptxas. A fatpoint is a program point annotated with the set of simultaneously live virtual registers; the allocator maps these sets to physical registers. See Fatpoint Algorithm.
HMMAHalf-precision Matrix Multiply-Accumulate — FP16 tensor core operation, the original WMMA instruction class from Volta/Turing. See Tensor Core Intrinsics.
IMMAInteger Matrix Multiply-Accumulate — INT8/INT4 tensor core operation. See Tensor Core Intrinsics.
KnobAn internal tuning parameter (1,294 total) stored as a ROT13-obfuscated string in the binary, read from environment variables or INI-format knob files. Controls per-pass thresholds, feature toggles, and scheduler behavior. See Knobs System.
MEMBARMemory Barrier — SASS instruction that enforces memory ordering across threads, CTAs, or the GPU. See Synchronization & Barriers.
MercConverterThe subsystem that converts abstract Ori IR instructions into Mercury-compatible instruction objects for SASS encoding. Part of instruction selection. See Instruction Selection.
MercuryThe SASS binary encoder subsystem. Converts abstract instruction objects into 128-bit packed machine words via ~4,000 per-variant handler functions. See Mercury Encoder.
MovPhiA pseudo-instruction in the Ori IR that represents SSA phi-node moves — parallel copies resolved during register allocation and out-of-SSA conversion. See IR Overview & Design.
NvOptRecipeNVIDIA Optimization Recipe — a predefined sequence of optimization phases selected by optimization level. The PhaseManager reads the recipe to determine which phases run and in what order. See Optimization Levels.
OccupancyThe ratio of active warps to the maximum warps a streaming multiprocessor can support, determined by register count, shared memory usage, and barrier count. Higher occupancy helps hide memory latency. See Allocator Architecture.
OCGOptimizing Code Generator — NVIDIA's internal name for the ptxas optimization and codegen pipeline (the 159-phase core). Appears in knob prefixes and timing strings. See Optimization Pipeline.
OpexOperand Expansion — a late pipeline stage that expands abstract operands into concrete SASS encoding fields (virtual registers, immediates, address modes to bit patterns). See SASS Code Generation.
Ori IRPTXAS's internal intermediate representation — basic blocks containing an instruction DAG with typed virtual registers. Named after recovered debug strings; not an acronym. See IR Overview & Design.
PhaseManagerThe infrastructure class (sub_C62720) that drives the 159-phase optimization pipeline: a factory, vtable dispatch, execute/isNoOp/getName interface. See Phase Manager Infrastructure.
Pipeline progress counterA hardware counter (Hopper+) that tracks the stage of an asynchronous pipeline operation, used by cp.async and TMA to overlap compute with memory transfers. See Ada & Hopper.
PTXParallel Thread Execution — NVIDIA's virtual ISA for GPU compute. The textual input format consumed by ptxas. See PTX Instruction Table.
QMMAQuarter-precision Matrix Multiply-Accumulate — FP8 (E4M3/E5M2) tensor core operation available on Hopper+. See Tensor Core Intrinsics.
Register pressureThe number of live virtual registers at a program point relative to the physical register file capacity. High pressure causes spilling. See Allocator Architecture.
RematRematerialization — recomputing a value instead of spilling and reloading it, trading ALU cycles for register file pressure reduction. See Rematerialization.
ROT13The trivial Caesar cipher (rotate-13) used to obfuscate all 1,294 knob name strings in the ptxas binary. Decoded at lookup time by GetKnobIndex. See Knobs System.
SASSShader Assembly — NVIDIA's native GPU machine code. The binary output produced by ptxas, encoded as 128-bit instruction words. See SASS Opcode Catalog.
ScoreboardA hardware dependency-tracking mechanism (6 barriers on pre-Hopper, more on Hopper+) that enforces ordering between long-latency producers and their consumers. Managed by DEPBAR instructions. See Scoreboards & Dependency Barriers.
sm_backendThe per-architecture codegen backend selected by --gpu-name. Each SM family (Turing, Ampere, Ada, Hopper, Blackwell) has distinct encoding tables, latency profiles, and feature gates. See SM Architecture Map.
SpillStoring a live register value to local memory when the allocator cannot fit all live values into the physical register file. Spills degrade performance significantly on GPUs. See Spilling.
tcgen05Fifth-generation tensor core instruction set on Blackwell (sm_100+). Replaces WMMA/GMMA with a new ISA for matrix operations. See TCGen05.
TMATensor Memory Accelerator — Hopper+ hardware unit that performs bulk asynchronous copies between global and shared memory with address generation offloaded from the SM. See Ada & Hopper.
UFTUniform Function Table — a data structure in the CUBIN that maps function indices to code offsets, used by the driver for indirect call dispatch. See ELF/Cubin Output.
UDTUniform Data Table — a companion to UFT that maps data indices to constant bank offsets within the CUBIN. See ELF/Cubin Output.
WarpgroupA Hopper+ scheduling unit consisting of 4 warps (128 threads) that execute WGMMA and other warpgroup-level instructions collectively. See Ada & Hopper.
WGMMAWarpgroup Matrix Multiply-Accumulate — Hopper+ tensor core instruction that operates at warpgroup granularity (4 warps), supporting asynchronous execution with pipeline progress counters. See GMMA/WGMMA Pipeline.