Glossary

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

Quick-reference for terms used throughout this wiki. Each entry links to the primary page with full details.

Term	Definition
Barrier	Hardware synchronization primitive that blocks threads until a condition is met. PTXAS inserts and optimizes barriers via dedicated passes. See Synchronization & Barriers.
BMMA	Binary Matrix Multiply-Accumulate — tensor core operation on 1-bit inputs. Part of the WMMA/GMMA family. See Tensor Core Intrinsics.
BSSY	Barrier Set Synchronization — SASS instruction that sets a convergence barrier for divergent control flow. Paired with BSYNC. See Scoreboards & Dependency Barriers.
BSYNC	Barrier Synchronize — SASS instruction that waits on a convergence barrier set by BSSY. See Scoreboards & Dependency Barriers.
Capmerc	Capsule Mercury — an ELF section (`.nv.capmerc`) embedding a secondary Mercury-encoded representation of the kernel for debug metadata and binary patching. See Capsule Mercury & Finalization.
CGA	Cooperative Grid Array — Hopper+ hardware grouping of thread blocks that can synchronize and share distributed shared memory. See Ada & Hopper.
Convergence	The point where divergent warp threads rejoin a common execution path, marked by BSSY/BSYNC pairs in SASS. See Predication.
Cubin	CUDA Binary — the ELF-based output format produced by ptxas, containing `.text` (SASS), `.nv.info`, `.nv.constant0`, and other NVIDIA-specific sections. See ELF/Cubin Output.
DAG	Directed Acyclic Graph — the core data structure within Ori IR basic blocks; instructions form a DAG of def-use edges rather than a flat list. See IR Overview & Design.
DEPBAR	Dependency Barrier — SASS instruction (`DEPBAR`) that stalls until a scoreboard counter reaches a threshold, enforcing producer-consumer ordering. See Scoreboards & Dependency Barriers.
Divergence	When threads within a warp take different control-flow paths, requiring the hardware to serialize execution. PTXAS manages divergence through predication and BSSY/BSYNC insertion. See Predication.
DMMA	Double-precision Matrix Multiply-Accumulate — FP64 tensor core operation available on sm_80+. See Tensor Core Intrinsics.
DynBatch	Dynamic Batch — one of the instruction scheduler's two modes (alongside ReduceReg), which batches independent instructions to maximize ILP. See Scheduler Architecture.
EIATTR	Extended Info Attributes — per-kernel metadata in `.nv.info` sections: tag-length-value records carrying register counts, barrier usage, shared memory sizes, and other properties consumed by the CUDA runtime and driver. See EIATTR Attribute Catalog.
ELFW	PTXAS's custom ELF writer (`sub_1C9F280`, 97 KB) — a bespoke emitter that builds CUBIN files with NVIDIA-specific sections, relocations, and symbol conventions. See Custom ELF Emitter.
Fatpoint	The register allocation algorithm used by ptxas. A fatpoint is a program point annotated with the set of simultaneously live virtual registers; the allocator maps these sets to physical registers. See Fatpoint Algorithm.
HMMA	Half-precision Matrix Multiply-Accumulate — FP16 tensor core operation, the original WMMA instruction class from Volta/Turing. See Tensor Core Intrinsics.
IMMA	Integer Matrix Multiply-Accumulate — INT8/INT4 tensor core operation. See Tensor Core Intrinsics.
Knob	An internal tuning parameter (1,294 total) stored as a ROT13-obfuscated string in the binary, read from environment variables or INI-format knob files. Controls per-pass thresholds, feature toggles, and scheduler behavior. See Knobs System.
MEMBAR	Memory Barrier — SASS instruction that enforces memory ordering across threads, CTAs, or the GPU. See Synchronization & Barriers.
MercConverter	The subsystem that converts abstract Ori IR instructions into Mercury-compatible instruction objects for SASS encoding. Part of instruction selection. See Instruction Selection.
Mercury	The SASS binary encoder subsystem. Converts abstract instruction objects into 128-bit packed machine words via ~4,000 per-variant handler functions. See Mercury Encoder.
MovPhi	A pseudo-instruction in the Ori IR that represents SSA phi-node moves — parallel copies resolved during register allocation and out-of-SSA conversion. See IR Overview & Design.
NvOptRecipe	NVIDIA Optimization Recipe — a predefined sequence of optimization phases selected by optimization level. The PhaseManager reads the recipe to determine which phases run and in what order. See Optimization Levels.
Occupancy	The ratio of active warps to the maximum warps a streaming multiprocessor can support, determined by register count, shared memory usage, and barrier count. Higher occupancy helps hide memory latency. See Allocator Architecture.
OCG	Optimizing Code Generator — NVIDIA's internal name for the ptxas optimization and codegen pipeline (the 159-phase core). Appears in knob prefixes and timing strings. See Optimization Pipeline.
Opex	Operand Expansion — a late pipeline stage that expands abstract operands into concrete SASS encoding fields (virtual registers, immediates, address modes to bit patterns). See SASS Code Generation.
Ori IR	PTXAS's internal intermediate representation — basic blocks containing an instruction DAG with typed virtual registers. Named after recovered debug strings; not an acronym. See IR Overview & Design.
PhaseManager	The infrastructure class (`sub_C62720`) that drives the 159-phase optimization pipeline: a factory, vtable dispatch, execute/isNoOp/getName interface. See Phase Manager Infrastructure.
Pipeline progress counter	A hardware counter (Hopper+) that tracks the stage of an asynchronous pipeline operation, used by `cp.async` and TMA to overlap compute with memory transfers. See Ada & Hopper.
PTX	Parallel Thread Execution — NVIDIA's virtual ISA for GPU compute. The textual input format consumed by ptxas. See PTX Instruction Table.
QMMA	Quarter-precision Matrix Multiply-Accumulate — FP8 (E4M3/E5M2) tensor core operation available on Hopper+. See Tensor Core Intrinsics.
Register pressure	The number of live virtual registers at a program point relative to the physical register file capacity. High pressure causes spilling. See Allocator Architecture.
Remat	Rematerialization — recomputing a value instead of spilling and reloading it, trading ALU cycles for register file pressure reduction. See Rematerialization.
ROT13	The trivial Caesar cipher (rotate-13) used to obfuscate all 1,294 knob name strings in the ptxas binary. Decoded at lookup time by `GetKnobIndex`. See Knobs System.
SASS	Shader Assembly — NVIDIA's native GPU machine code. The binary output produced by ptxas, encoded as 128-bit instruction words. See SASS Opcode Catalog.
Scoreboard	A hardware dependency-tracking mechanism (6 barriers on pre-Hopper, more on Hopper+) that enforces ordering between long-latency producers and their consumers. Managed by DEPBAR instructions. See Scoreboards & Dependency Barriers.
sm_backend	The per-architecture codegen backend selected by `--gpu-name`. Each SM family (Turing, Ampere, Ada, Hopper, Blackwell) has distinct encoding tables, latency profiles, and feature gates. See SM Architecture Map.
Spill	Storing a live register value to local memory when the allocator cannot fit all live values into the physical register file. Spills degrade performance significantly on GPUs. See Spilling.
tcgen05	Fifth-generation tensor core instruction set on Blackwell (sm_100+). Replaces WMMA/GMMA with a new ISA for matrix operations. See TCGen05.
TMA	Tensor Memory Accelerator — Hopper+ hardware unit that performs bulk asynchronous copies between global and shared memory with address generation offloaded from the SM. See Ada & Hopper.
UFT	Uniform Function Table — a data structure in the CUBIN that maps function indices to code offsets, used by the driver for indirect call dispatch. See ELF/Cubin Output.
UDT	Uniform Data Table — a companion to UFT that maps data indices to constant bank offsets within the CUBIN. See ELF/Cubin Output.
Warpgroup	A Hopper+ scheduling unit consisting of 4 warps (128 threads) that execute WGMMA and other warpgroup-level instructions collectively. See Ada & Hopper.
WGMMA	Warpgroup Matrix Multiply-Accumulate — Hopper+ tensor core instruction that operates at warpgroup granularity (4 warps), supporting asynchronous execution with pipeline progress counters. See GMMA/WGMMA Pipeline.

Keyboard shortcuts

PTXAS Reverse Engineering Reference

Glossary