Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Glossary

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

Quick-reference for terms used throughout this wiki. Each entry links to the primary page with full details.

TermDefinition
BarrierHardware synchronization primitive that blocks threads until a condition is met. PTXAS inserts and optimizes barriers via dedicated passes. See Synchronization & Barriers.
BMMABinary Matrix Multiply-Accumulate — tensor core operation on 1-bit inputs. Part of the WMMA/GMMA family. See Tensor Core Intrinsics.
BSSYBarrier Set Synchronization — SASS instruction that sets a convergence barrier for divergent control flow. Paired with BSYNC. See Scoreboards & Dependency Barriers.
BSYNCBarrier Synchronize — SASS instruction that waits on a convergence barrier set by BSSY. See Scoreboards & Dependency Barriers.
CapmercCapsule Mercury — an ELF section (.nv.capmerc) embedding a secondary Mercury-encoded representation of the kernel for debug metadata and binary patching. See Capsule Mercury & Finalization.
CGACooperative Grid Array — Hopper+ hardware grouping of thread blocks that can synchronize and share distributed shared memory. See Ada & Hopper.
ConvergenceThe point where divergent warp threads rejoin a common execution path, marked by BSSY/BSYNC pairs in SASS. See Predication.
CubinCUDA Binary — the ELF-based output format produced by ptxas, containing .text (SASS), .nv.info, .nv.constant0, and other NVIDIA-specific sections. See ELF/Cubin Output.
DAGDirected Acyclic Graph — the core data structure within Ori IR basic blocks; instructions form a DAG of def-use edges rather than a flat list. See IR Overview & Design.
DEPBARDependency Barrier — SASS instruction (DEPBAR) that stalls until a scoreboard counter reaches a threshold, enforcing producer-consumer ordering. See Scoreboards & Dependency Barriers.
DivergenceWhen threads within a warp take different control-flow paths, requiring the hardware to serialize execution. PTXAS manages divergence through predication and BSSY/BSYNC insertion. See Predication.
DMMADouble-precision Matrix Multiply-Accumulate — FP64 tensor core operation available on sm_80+. See Tensor Core Intrinsics.
DynBatchDynamic Batch — one of the instruction scheduler's two modes (alongside ReduceReg), which batches independent instructions to maximize ILP. See Scheduler Architecture.
EIATTRExtended Info Attributes — per-kernel metadata in .nv.info sections: tag-length-value records carrying register counts, barrier usage, shared memory sizes, and other properties consumed by the CUDA runtime and driver. See EIATTR Attribute Catalog.
ELFWPTXAS's custom ELF writer (sub_1C9F280, 97 KB) — a bespoke emitter that builds CUBIN files with NVIDIA-specific sections, relocations, and symbol conventions. See Custom ELF Emitter.
FatpointThe register allocation algorithm used by ptxas. A fatpoint is a program point annotated with the set of simultaneously live virtual registers; the allocator maps these sets to physical registers. See Fatpoint Algorithm.
HMMAHalf-precision Matrix Multiply-Accumulate — FP16 tensor core operation, the original WMMA instruction class from Volta/Turing. See Tensor Core Intrinsics.
IMMAInteger Matrix Multiply-Accumulate — INT8/INT4 tensor core operation. See Tensor Core Intrinsics.
KnobAn internal tuning parameter (1,294 total) stored as a ROT13-obfuscated string in the binary, read from environment variables or INI-format knob files. Controls per-pass thresholds, feature toggles, and scheduler behavior. See Knobs System.
MEMBARMemory Barrier — SASS instruction that enforces memory ordering across threads, CTAs, or the GPU. See Synchronization & Barriers.
MercConverterThe subsystem that converts abstract Ori IR instructions into Mercury-compatible instruction objects for SASS encoding. Part of instruction selection. See Instruction Selection.
MercuryThe SASS binary encoder subsystem. Converts abstract instruction objects into 128-bit packed machine words via ~4,000 per-variant handler functions. See Mercury Encoder.
MovPhiA pseudo-instruction in the Ori IR that represents SSA phi-node moves — parallel copies resolved during register allocation and out-of-SSA conversion. See IR Overview & Design.
NvOptRecipeNVIDIA Optimization Recipe — a predefined sequence of optimization phases selected by optimization level. The PhaseManager reads the recipe to determine which phases run and in what order. See Optimization Levels.
OccupancyThe ratio of active warps to the maximum warps a streaming multiprocessor can support, determined by register count, shared memory usage, and barrier count. Higher occupancy helps hide memory latency. See Allocator Architecture.
OCGOptimizing Code Generator — NVIDIA's internal name for the ptxas optimization and codegen pipeline (the 159-phase core). Appears in knob prefixes and timing strings. See Optimization Pipeline.
OpexOperand Expansion — a late pipeline stage that expands abstract operands into concrete SASS encoding fields (virtual registers, immediates, address modes to bit patterns). See SASS Code Generation.
Ori IRPTXAS's internal intermediate representation — basic blocks containing an instruction DAG with typed virtual registers. Named after recovered debug strings; not an acronym. See IR Overview & Design.
PhaseManagerThe infrastructure class (sub_C62720) that drives the 159-phase optimization pipeline: a factory, vtable dispatch, execute/isNoOp/getName interface. See Phase Manager Infrastructure.
Pipeline progress counterA hardware counter (Hopper+) that tracks the stage of an asynchronous pipeline operation, used by cp.async and TMA to overlap compute with memory transfers. See Ada & Hopper.
PTXParallel Thread Execution — NVIDIA's virtual ISA for GPU compute. The textual input format consumed by ptxas. See PTX Instruction Table.
QMMAQuarter-precision Matrix Multiply-Accumulate — FP8 (E4M3/E5M2) tensor core operation available on Hopper+. See Tensor Core Intrinsics.
Register pressureThe number of live virtual registers at a program point relative to the physical register file capacity. High pressure causes spilling. See Allocator Architecture.
RematRematerialization — recomputing a value instead of spilling and reloading it, trading ALU cycles for register file pressure reduction. See Rematerialization.
ROT13The trivial Caesar cipher (rotate-13) used to obfuscate all 1,294 knob name strings in the ptxas binary. Decoded at lookup time by GetKnobIndex. See Knobs System.
SASSShader Assembly — NVIDIA's native GPU machine code. The binary output produced by ptxas, encoded as 128-bit instruction words. See SASS Opcode Catalog.
ScoreboardA hardware dependency-tracking mechanism (6 barriers on pre-Hopper, more on Hopper+) that enforces ordering between long-latency producers and their consumers. Managed by DEPBAR instructions. See Scoreboards & Dependency Barriers.
sm_backendThe per-architecture codegen backend selected by --gpu-name. Each SM family (Turing, Ampere, Ada, Hopper, Blackwell) has distinct encoding tables, latency profiles, and feature gates. See SM Architecture Map.
SpillStoring a live register value to local memory when the allocator cannot fit all live values into the physical register file. Spills degrade performance significantly on GPUs. See Spilling.
tcgen05Fifth-generation tensor core instruction set on Blackwell (sm_100+). Replaces WMMA/GMMA with a new ISA for matrix operations. See TCGen05.
TMATensor Memory Accelerator — Hopper+ hardware unit that performs bulk asynchronous copies between global and shared memory with address generation offloaded from the SM. See Ada & Hopper.
UFTUniform Function Table — a data structure in the CUBIN that maps function indices to code offsets, used by the driver for indirect call dispatch. See ELF/Cubin Output.
UDTUniform Data Table — a companion to UFT that maps data indices to constant bank offsets within the CUBIN. See ELF/Cubin Output.
WarpgroupA Hopper+ scheduling unit consisting of 4 warps (128 threads) that execute WGMMA and other warpgroup-level instructions collectively. See Ada & Hopper.
WGMMAWarpgroup Matrix Multiply-Accumulate — Hopper+ tensor core instruction that operates at warpgroup granularity (4 warps), supporting asynchronous execution with pipeline progress counters. See GMMA/WGMMA Pipeline.