Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

PTX Instruction Table

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

Complete catalog of PTX instructions recognized by ptxas v13.0.88 (CUDA 12.8 / PTX ISA 8.7). All entries are verified against the binary: instruction names come from the Flex lexer's 552 token rules, type signatures from the instruction table builder's 1,141 descriptor registrations (sub_46E000 calling sub_46BED0), and formatter names from the 580 PTX text generation functions dispatched by sub_5D4190. Internal-only instructions (prefixed with _) are included where they appear in the binary but are marked accordingly.

Instruction table buildersub_46E000 (93 KB, 1,141 calls to sub_46BED0)
Instruction lookupsub_46C690 (entry) / sub_46C6E0 (6.4 KB matcher)
PTX text formatter dispatchsub_5D4190 (12.9 KB, 81 string + 473-entry hash)
Formatter functions0x4DA340--0x5A8E40 (580 functions)
Semantic validators0x460000--0x4D5000 (~20 validator functions)
Operand type encodingSingle-char codes: F=float, H=half, I=int, B=bits, N=imm, P=pred, E=bf16, Q=fp8, R=fp4

Organization

Instructions are grouped by functional category following NVIDIA's PTX ISA documentation structure. Each table entry lists:

  • Mnemonic: the PTX instruction name as recognized by the lexer
  • Type suffixes: legal type qualifiers (from instruction table builder encoding strings)
  • Operands: operand pattern (d=dest, a/b/c=source, p=predicate, [a]=memory)
  • SM req: minimum SM architecture (from sub_489390 version gates in validators)
  • PTX req: minimum PTX ISA version (from sub_489050 version gates)
  • Description: brief functional description

Type abbreviations in the suffix column: s=signed int, u=unsigned int, f=float, b=bits, pred=predicate. Widths: 8/16/32/64/128. Packed: f16x2, bf16x2.


Integer Arithmetic

MnemonicType suffixesOperandsSMPTXDescription
add.s16 .s32 .s64 .u16 .u32 .u64d, a, ball1.0Integer addition
sub.s16 .s32 .s64 .u16 .u32 .u64d, a, ball1.0Integer subtraction
mul.lo.s16 .s32 .s64 .u16 .u32 .u64d, a, ball1.0Multiply, low half of result
mul.hi.s16 .s32 .s64 .u16 .u32 .u64d, a, ball1.0Multiply, high half of result
mul.wide.s16 .s32 .u16 .u32d, a, ball1.0Widening multiply (16->32 or 32->64)
mul24.lo.s32 .u32d, a, ball1.024-bit multiply, low half (deprecated sm_20+)
mul24.hi.s32 .u32d, a, ball1.024-bit multiply, high half (deprecated sm_20+)
mad.lo.s16 .s32 .s64 .u16 .u32 .u64d, a, b, call1.0Multiply-add, low half
mad.hi.s16 .s32 .s64 .u16 .u32 .u64d, a, b, call1.0Multiply-add, high half
mad.wide.s16 .s32 .u16 .u32d, a, b, call1.0Widening multiply-add
mad24.lo.s32 .u32d, a, b, call1.024-bit multiply-add, low (deprecated)
mad24.hi.s32 .u32d, a, b, call1.024-bit multiply-add, high (deprecated)
mad.cc.s32 .u32 .s64 .u64d, a, b, call1.0Multiply-add with carry-out
madc.lo.s32 .u32 .s64 .u64d, a, b, call1.0Multiply-add with carry-in, low
madc.hi.s32 .u32 .s64 .u64d, a, b, call1.0Multiply-add with carry-in, high
mad.fused.hi.s32 .u32d, a, b, call1.0Fused multiply-add, high half
madc.fused.hi.s32 .u32d, a, b, call1.0Fused multiply-add with carry, high
div.s16 .s32 .s64 .u16 .u32 .u64d, a, ball1.0Integer division
rem.s16 .s32 .s64 .u16 .u32 .u64d, a, ball1.0Integer remainder
abs.s16 .s32 .s64d, aall1.0Absolute value
neg.s16 .s32 .s64d, aall1.0Negate
min.s16 .s32 .s64 .u16 .u32 .u64d, a, ball1.0Minimum
max.s16 .s32 .s64 .u16 .u32 .u64d, a, ball1.0Maximum
popc.b32 .b64d, a20+2.0Population count (count set bits)
clz.b32 .b64d, a20+2.0Count leading zeros
bfind.s32 .s64 .u32 .u64d, a20+2.0Find most significant set bit
brev.b32 .b64d, a20+2.0Bit reverse
bfe.s32 .s64 .u32 .u64d, a, b, c20+2.0Bit field extract
bfi.b32 .b64d, f, a, b, c20+2.0Bit field insert
dp4a.s32.s32 .s32.u32 .u32.s32 .u32.u32d, a, b, c61+5.04-element dot product accumulate
dp2a.lo.s32.s32 .s32.u32 .u32.s32 .u32.u32d, a, b, c61+5.02-element dot product accumulate, low
dp2a.hi.s32.s32 .s32.u32 .u32.s32 .u32.u32d, a, b, c61+5.02-element dot product accumulate, high
sad.s16 .s32 .u16 .u32d, a, b, call1.0Sum of absolute differences

Floating-Point Arithmetic

MnemonicType suffixesOperandsSMPTXDescription
add.f32 .f64d, a, ball1.0FP addition (.rn .rz .rm .rp rounding)
sub.f32 .f64d, a, ball1.0FP subtraction
mul.f32 .f64d, a, ball1.0FP multiplication
fma.f32 .f64d, a, b, c20+2.0Fused multiply-add
mad.f32 .f64d, a, b, call1.0Multiply-add (non-fused on sm_20+)
mad.rnd.f32.f32d, a, b, call1.0Multiply-add with explicit rounding
div.f32 .f64d, a, ball1.0FP division (.approx .full .rn .rz .rm .rp)
div.full.f32d, a, ball1.0Full-range division (specialized formatter)
div.rnd.f32.f32d, a, ball1.0Division with explicit rounding
div.rn.f64.f64d, a, ball1.0Double-precision division, round-nearest
abs.f32 .f64d, aall1.0FP absolute value
neg.f32 .f64d, aall1.0FP negate
min.f32 .f64d, a, ball1.0FP minimum
max.f32 .f64d, a, ball1.0FP maximum
rcp.f32 .f64d, aall1.0Reciprocal (.approx .rn .rz .rm .rp)
rcp.approx.f64.f64d, aall1.0Approximate double reciprocal
rcp.rnd.f32.f32d, aall1.0Reciprocal with explicit rounding
rcp.rn.f64.f64d, aall1.0Double reciprocal, round-nearest
sqrt.f32 .f64d, aall1.0Square root (.approx .rn .rz .rm .rp)
rsqrt.f32 .f64d, aall1.0Reciprocal square root (.approx)
sin.f32d, aall1.0Sine (approximate)
cos.f32d, aall1.0Cosine (approximate)
lg2.f32d, aall1.0Log base 2 (approximate)
ex2.f32d, aall1.0Exp base 2 (approximate)
tanh.f32d, a75+6.5Hyperbolic tangent (approximate)
testp.f32 .f64p, a20+2.0Test FP property (.finite .infinite .number .notanumber .normal .subnormal)
copysign.f32 .f64d, a, b20+2.0Copy sign from b to a
fma.f32.f32d, a, b, c20+2.0FP32 fused multiply-add (rounding modes)
fma.f64.f64d, a, b, c20+2.0FP64 fused multiply-add

Half-Precision and BFloat16 Arithmetic

MnemonicType suffixesOperandsSMPTXDescription
add.f16 .f16x2 .bf16 .bf16x2d, a, b53+4.2Half-precision addition
sub.f16 .f16x2 .bf16 .bf16x2d, a, b53+4.2Half-precision subtraction
mul.f16 .f16x2 .bf16 .bf16x2d, a, b53+4.2Half-precision multiplication
fma.f16 .f16x2 .bf16 .bf16x2d, a, b, c53+4.2Half-precision fused multiply-add
neg.f16 .f16x2 .bf16 .bf16x2d, a53+4.2Half-precision negate
abs.f16 .f16x2 .bf16 .bf16x2d, a53+4.2Half-precision absolute value
min.f16 .f16x2 .bf16 .bf16x2d, a, b80+7.0Half-precision minimum
max.f16 .f16x2 .bf16 .bf16x2d, a, b80+7.0Half-precision maximum
min.ftz.NaN.f16 .f16x2 .bf16 .bf16x2d, a, b80+7.0Min with NaN propagation
max.ftz.NaN.f16 .f16x2 .bf16 .bf16x2d, a, b80+7.0Max with NaN propagation
ex2.approx.f16 .f16x2d, a75+6.5Half-precision exp2
tanh.approx.f16 .f16x2d, a75+6.5Half-precision tanh
fma.rn.relu.f16 .bf16d, a, b, c80+7.0Fused multiply-add with ReLU

Comparison and Selection

MnemonicType suffixesOperandsSMPTXDescription
setp.s16 .s32 .s64 .u16 .u32 .u64 .f32 .f64 .f16 .bf16 .b16 .b32 .b64p[|q], a, ball1.0Set predicate on comparison
selp.s16 .s32 .s64 .u16 .u32 .u64 .f32 .f64 .b16 .b32 .b64d, a, b, pall1.0Select on predicate
slct.s32 .u32 .f32 .s64 .u64 .f64d, a, b, call1.0Select on comparison
set.s32 .u32 .f32 .s64 .u64 .f64d, a, ball1.0Compare and set

Comparison operators for setp/set: .eq .ne .lt .le .gt .ge .lo .ls .hi .hs .equ .neu .ltu .leu .gtu .geu .num .nan

Logic and Shift

MnemonicType suffixesOperandsSMPTXDescription
and.b16 .b32 .b64 .predd, a, ball1.0Bitwise AND
or.b16 .b32 .b64 .predd, a, ball1.0Bitwise OR
xor.b16 .b32 .b64 .predd, a, ball1.0Bitwise XOR
not.b16 .b32 .b64 .predd, aall1.0Bitwise NOT
cnot.b16 .b32 .b64d, aall1.0C-style logical NOT
lop3.b32d, a, b, c, immLut50+4.03-input logic operation (LUT-encoded)
shl.b16 .b32 .b64d, a, ball1.0Shift left
shr.s16 .s32 .s64 .u16 .u32 .u64d, a, ball1.0Shift right (arithmetic for .s, logical for .u)
shf.l.b32d, a, b, c32+3.2Funnel shift left
shf.r.b32d, a, b, c32+3.2Funnel shift right (.clamp .wrap)
prmt.b32d, a, b, c20+2.0Byte permute (.f4e .b4e .rc8 .ecl .ecr .rc16)

Data Movement

MnemonicType suffixesOperandsSMPTXDescription
mov.b16 .b32 .b64 .b128 .u16 .u32 .u64 .s16 .s32 .s64 .f16 .f32 .f64 .predd, aall1.0Move register-to-register
shfl.b32d|p, a, b, c30+3.0Warp shuffle (.up .down .bfly .idx)
shfl.sync.b32d|p, a, b, c, membermask70+6.0Warp shuffle with sync
vote.predd, {p}, a20+2.0Warp vote (.all .any .uni .ballot)
vote.sync.pred .b32d, {p}, a, membermask70+6.0Warp vote with sync
match.b32 .b64d, a70+6.0Warp match (.any .all)
match.sync.b32 .b64d, a, membermask70+6.0Warp match with sync
redux.s32 .u32d, a80+7.0Warp reduction (.add .min .max .and .or .xor)
redux.sync.s32 .u32d, a, membermask80+7.0Warp reduction with sync
activemask.b32d70+6.2Get active thread mask
elect.predp90+8.0Elect one leader thread
elect.one--d, {p}90+8.0Elect one thread, return success

Load, Store, and Memory

Global, Local, Shared, Const

MnemonicType suffixesOperandsSMPTXDescription
ld.b8 .b16 .b32 .b64 .b128 .u8 .u16 .u32 .u64 .s8 .s16 .s32 .s64 .f16 .f32 .f64d, [a]all1.0Load from memory (.global .shared .local .const .param)
ld.nc.b32 .b64 .b128 .f32 .f64d, [a]35+3.2Non-coherent load (read-only cache)
ld.param.b8 .b16 .b32 .b64 .b128d, [a]all1.0Load from kernel parameter space
st.b8 .b16 .b32 .b64 .b128 .u8 .u16 .u32 .u64 .s8 .s16 .s32 .s64 .f16 .f32 .f64[a], ball1.0Store to memory
ldu.b32 .b64 .b128 .f32 .f64d, [a]20+2.0Load via uniform cache (deprecated)
prefetch.L1 .L2[a]20+2.0Prefetch to cache level
prefetchu.L1[a]20+2.0Prefetch uniform
isspacep.global .shared .local .constp, a20+2.0Test address space
cvta--d, a20+2.0Convert address space (generic <-> specific)
cvta.to.global .shared .local .constd, a20+2.0Convert to specific state space

Cache qualifiers for ld/st: .ca .cg .cs .cv .lu .wb .wt
Eviction policy (PTX 7.4+): .L2::evict_first .L2::evict_last .L2::evict_normal .L2::cache_hint

Async Copy

MnemonicType suffixesOperandsSMPTXDescription
cp.async.ca .cg[dst], [src], size80+7.0Async copy (4/8/16 bytes, global->shared)
cp.async.commit_group----80+7.0Commit outstanding async copies
cp.async.wait_group--N80+7.0Wait for async copy group completion
cp.async.wait_all----80+7.0Wait for all async copies
cp.async.mbarrier.arrive--[mbar]80+7.0Arrive at mbarrier on async copy completion
cp.async.bulk--[dst], [src], size90+8.0Bulk async copy (TMA)
cp.async.bulk.tensor--[dst], [src], dims...90+8.0Tensor async copy (TMA, 1-5D tiles)
cp.async.bulk.prefetch--[src], size90+8.0Prefetch via TMA
cp.async.bulk.prefetch.tensor--[src], dims...90+8.0Tensor prefetch via TMA
cp.async.bulk.commit_group----90+8.0Commit bulk async group
cp.async.bulk.wait_group--N90+8.0Wait for bulk group completion
cp.reduce.async.bulk.add .min .max .and .or .xor .inc .dec[dst], [src], size90+8.0Bulk async copy with reduction
cp.reduce.async.bulk.tensor.add .min .max .and .or .xor .inc .dec[dst], [src], dims...90+8.0Tensor async copy with reduction
st.async.b32 .b64 .b128[a], b90+8.1Async store
st.bulk--[a], b, size90+8.0Bulk store
red.async.add .min .max .and .or .xor .inc .dec[a], b90+8.1Async reduction

Multimem

MnemonicType suffixesOperandsSMPTXDescription
multimem.ld_reduce.f16 .bf16 .f32 .u32 .s32 .u64d, [a]90+8.1Multicast memory load with reduction
multimem.st.f16 .bf16 .f32 .u32 .s32 .u64[a], b90+8.1Multicast memory store
multimem.red.f16 .bf16 .f32 .u32 .s32 .u64[a], b90+8.1Multicast memory reduction

Cache Control

MnemonicType suffixesOperandsSMPTXDescription
discard.global .L2[a], size80+7.4Discard data (hint: no writeback)
applypriority.global .L2[a], size, prio80+7.4Set cache eviction priority
createpolicy.cvt.L2d, imm80+7.4Create cache policy from immediate
createpolicy.fractional.L2d, fraction80+7.4Create fractional cache policy
createpolicy.range.L2d, lo, hi, hit, miss80+7.4Create cache policy for address range

Conversion

MnemonicType suffixesOperandsSMPTXDescription
cvtall int/float combinationsd, aall1.0Type conversion (rounding: .rn .rz .rm .rp .rna)
cvt.pack.b16.b32 .b32.b32d, a, b80+7.1Pack two values into one register

cvt type combinations (from instruction table encoding strings):

The instruction table builder registers extensive type-pair combinations for cvt. Representative signatures:

SourceDestinationNotes
F[16|32|64]F[16|32|64]Float-to-float, rounding modes apply
F[16|32|64]I[8|16|32|64]Float-to-integer, rounding + saturation
I[8|16|32|64]F[16|32|64]Integer-to-float, rounding modes
I[8|16|32|64]I[8|16|32|64]Integer-to-integer, sign extend / truncate
E16F[16|64] / I[8|16|32|64]bf16 source conversions (sm_80+)
F[16|64] / I[8|16|32|64]E16bf16 destination conversions (sm_80+)
H32 (tf32)variousTensorFloat-32 conversions (sm_80+)
Q16 (fp8 e5m2)F32 / H32 / E32FP8 conversions (sm_89+)
R8 (fp8 e4m3)F32 / H32 / E32FP8 e4m3 conversions (sm_89+)
R16 (fp4)F32FP4 conversions (sm_100+, PTX 8.6)
Q32 (fp8 e5m2)F32Extended FP8 conversions (sm_100+)

Modifier .ftz (flush-to-zero), .sat (saturation), .relu (ReLU clamp to 0) are recognized. The .rna rounding mode (round-to-nearest-away, PTX 8.3+) is registered for cvt.tf32.f32.

szext -- Sign/Zero Extend

MnemonicType suffixesOperandsSMPTXDescription
szext.b32d, a, pos90+8.0Sign- or zero-extend at bit position

Texture and Surface

Texture

MnemonicType suffixesOperandsSMPTXDescription
tex.1d .2d .3d .a1d .a2d .cube .acubed, [tex, sampler, coord]all1.0Texture fetch with sampler
tex.base.1d .2d .3dd, [tex, coord]60+5.0Texture fetch base level
tex.level.1d .2d .3d .a1d .a2d .cube .acubed, [tex, sampler, coord, lod]all1.0Texture fetch at explicit LOD
tex.grad.1d .2d .3d .a1d .a2d .cube .acubed, [tex, sampler, coord, dPdx, dPdy]all1.0Texture fetch with explicit gradients
tld4.2d .a2dd, [tex, sampler, coord]20+2.0Texture gather (4 texels)
txq.width .height .depth .channel_data_type .channel_order .normalized_coords .filter_mode .addr_mode_0 .addr_mode_1 .addr_mode_2 .samp_pos .num_mip_levels .num_samplesd, [tex]20+2.0Texture query

Return types for texture ops: .v4.s32 .v4.u32 .v4.f32 (4-component return).

Surface

MnemonicType suffixesOperandsSMPTXDescription
suld.b.1d .2d .3d .a1d .a2dd, [surf, coord]20+2.0Surface load (bindless)
sust.b.1d .2d .3d .a1d .a2d[surf, coord], a20+2.0Surface store (bindless)
sust.p.1d .2d .3d .a1d .a2d[surf, coord], a20+2.0Surface store (packed format)
sured.b.1d .2d .3dd, [surf, coord], a20+2.0Surface reduction (.add .min .max .and .or)
suq.width .height .depth .channel_data_type .channel_order .array_sized, [surf]20+2.0Surface query

Surface clamp modes: .trap .clamp .zero

Tensormap

MnemonicType suffixesOperandsSMPTXDescription
tensormap.replace.tile .im2cold, [tmap], field, value90+8.0Replace tensormap field at runtime
tensormap.cp_fenceproxy--[tmap]90+8.0Tensormap copy fence proxy

Control Flow

MnemonicType suffixesOperandsSMPTXDescription
bra--targetall1.0Branch (unconditional or predicated)
bra.uni--targetall1.0Uniform branch (all threads take same direction)
brx.idx--a, [targets]70+6.0Indexed branch (jump table)
call--(ret), func, (params)20+2.0Function call (with ABI)
ret----20+2.0Return from function
exit----all1.0Exit kernel / terminate thread
trap----all1.0Trigger error
brkpt----11+1.0Breakpoint (debugger halt)
pmevent--imm20+2.0Performance monitor event
pmevent.mask--imm20+3.0Performance monitor event with mask
nanosleep--t70+6.3Sleep for t nanoseconds

Synchronization and Barriers

Legacy Barrier

MnemonicType suffixesOperandsSMPTXDescription
bar.sync--a{, b}all1.0Barrier synchronize (CTA-level)
bar.arrive--a, ball1.0Barrier arrive (non-blocking)
bar.red.and .or .popcd, a, {b}, pall1.0Barrier with reduction
bar.warp.syncmembermask70+6.0Warp-level barrier

Named Barrier (PTX 6.0+)

MnemonicType suffixesOperandsSMPTXDescription
barrier--a{, b}70+6.0Named barrier synchronize
barrier.arrive--a, b70+6.0Named barrier arrive
barrier.red.and .or .popcd, a, {b}, p70+6.0Named barrier with reduction

CTA-Cluster Barrier (PTX 7.8+)

MnemonicType suffixesOperandsSMPTXDescription
bar.cta----90+7.8CTA-level barrier sync
bar.cta.arrive----90+7.8CTA-level barrier arrive
bar.cta.red.and .or .popcd, p90+7.8CTA barrier with reduction
barrier.cta----90+7.8CTA named barrier sync
barrier.cta.arrive----90+7.8CTA named barrier arrive
barrier.cta.red.and .or .popcd, p90+7.8CTA named barrier with reduction
barrier.cluster.arrive----90+7.8Cluster-level barrier arrive
barrier.cluster.wait----90+7.8Cluster-level barrier wait

Asynchronous Barriers (mbarrier)

MnemonicType suffixesOperandsSMPTXDescription
mbarrier.init.shared.b64[mbar], count80+7.0Initialize mbarrier with expected count
mbarrier.inval.shared.b64[mbar]80+7.0Invalidate mbarrier
mbarrier.arrive.shared.b64state, [mbar]80+7.0Arrive at mbarrier
mbarrier.arrive_drop.shared.b64state, [mbar]80+7.0Arrive and drop expected count
mbarrier.test_wait.shared.b64p, [mbar], state80+7.0Test if mbarrier phase complete
mbarrier.test_wait.parity.shared.b64p, [mbar], parity80+7.1Test mbarrier parity
mbarrier.try_wait.shared.b64p, [mbar], state80+7.8Try-wait on mbarrier (with timeout)
mbarrier.try_wait.parity.shared.b64p, [mbar], parity80+7.8Try-wait on mbarrier parity
mbarrier.pending_count.b64d, state80+7.0Get pending arrival count
mbarrier.complete_tx.shared.b64[mbar], count90+8.0Complete transaction at mbarrier
mbarrier.expect_tx.shared.b64[mbar], count90+8.0Set expected transaction count
mbarrier.tx--[mbar], count90+8.0Transaction mbarrier arrive
mbarrier.arrive.expect_tx.shared.b64state, [mbar], count90+8.0Arrive with expected tx count
mbarrier.arrive_drop.expect_tx.shared.b64state, [mbar], count90+8.0Arrive-drop with expected tx count

Memory Fence

MnemonicType suffixesOperandsSMPTXDescription
membar.cta .gl .sys--all1.0Memory barrier (scope)
membar.proxy.alias--75+6.4Proxy memory barrier (alias scope)
fence.proxy.alias .async .async.global .async.shared::cta--70+6.0Fence proxy (alias/async)
fence.proxy.tensormap--[addr]90+8.0Fence tensormap proxy

Grid Dependency Control

MnemonicType suffixesOperandsSMPTXDescription
griddepcontrol.launch_dependents .wait--90+7.8Grid dependency control

Atomic and Reduction

MnemonicType suffixesOperandsSMPTXDescription
atom.s32 .u32 .u64 .f32 .f64 .b32 .b64 .f16x2 .bf16x2d, [a], ball1.1Atomic RMW (.add .min .max .and .or .xor .exch .cas .inc .dec)
atom.global(same as atom)d, [a], ball1.1Atomic on global memory
atom.shared(same as atom)d, [a], ball1.1Atomic on shared memory
red.s32 .u32 .u64 .f32 .f64 .b32 .b64 .f16x2 .bf16x2[a], ball1.1Reduction (no return value)
red.global(same as red)[a], ball1.1Reduction on global memory

Atom/red scope modifiers (PTX 6.0+): .cta .gpu .sys .cluster

Matrix (MMA / Tensor Core)

WMMA (Warp Matrix Multiply-Accumulate, PTX 6.0+)

MnemonicType suffixesOperandsSMPTXDescription
wmma.load.a.sync .alignedd, [ptr], stride70+6.0Load matrix A fragment
wmma.load.b.sync .alignedd, [ptr], stride70+6.0Load matrix B fragment
wmma.load.c.sync .alignedd, [ptr], stride70+6.0Load accumulator C fragment
wmma.store.d.sync .aligned[ptr], d, stride70+6.0Store result D fragment
wmma.mma.sync .alignedd, a, b, c70+6.0Matrix multiply-accumulate

WMMA shapes (from validator sub_4BFED0): .m16n16k16 .m32n8k16 .m8n32k16 .m16n16k8 etc.
WMMA type combinations (from string table): F16F16F16F16, F32F16F16F32, F32F32 (TF32), I32I8I8I32, I32I4I4I32, I32B1B1I32, F64F64F64F64

MMA (PTX 6.5+)

MnemonicType suffixesOperandsSMPTXDescription
mma.sync .alignedd, a, b, c75+6.5Matrix multiply-accumulate (Turing+)

MMA type combinations (verified from instruction table builder strings):

D typeA typeB typeC typeSMNotes
F16F16F16F1675+Native FP16
F32F16F16F3275+Mixed-precision
F32F32F32--80+TF32 Tensor Core
F32E16E16F3280+BFloat16
F32T32T32F3280+TF32 path (string: F32T32T32F32)
I32I8I8I3275+INT8
I32I4I4I3275+INT4
I32B1B1I3275+Binary (1-bit)
F64F64F64F6480+Double-precision
F16Q8Q8F1689+FP8 (e5m2)
F32Q8Q8F3289+FP8 mixed
F16R4Q8F16100+FP4 x FP8
F32R4Q8F32100+FP4 x FP8 mixed
F32R4R4F32100+FP4 x FP4
F32R4R4F32.Q8100+FP4 with scale (string: F32R4R4F32Q8)
F32Q8Q8F32.Q8100+FP8 with block scale

MMA shapes: .m16n8k16 .m16n8k32 .m16n8k64 .m16n8k128 .m16n8k256
Sparse MMA modifiers: .sp with metadata selector and sparsity pattern

WGMMA (Warp-Group MMA, PTX 7.8+)

MnemonicType suffixesOperandsSMPTXDescription
wgmma.mma_async.alignedd, a_desc, b_desc90+7.8Warp-group async matrix multiply
wgmma.fence.aligned--90+7.8WGMMA fence (ordering)
wgmma.commit_group.aligned--90+7.8Commit WGMMA group
wgmma.wait_group.alignedN90+7.8Wait for WGMMA group completion

WGMMA operand encoding strings (from instruction table, selection): hUUhP, fUUfP, hUhhP, fUhfP, hhUhP, fhUfP (H=half dest, F=float dest, U=desc operand, P=pred)
With accumulator: hUUhdC, fUUfdC, hUhhdC, fUhfdC
With scale: hUUhdCP, fUUfdCP (P=pred control for scale)

TCGen05 (5th Generation Tensor Core, sm_100+)

MnemonicType suffixesOperandsSMPTXDescription
tcgen05.mma--d, a_desc, b_desc100+8.65th-gen tensor core MMA
tcgen05.mma.ws--d, a_desc, b_desc100+8.65th-gen MMA with warpgroup scale
tcgen05.ld--d, [desc]100+8.6TC load from descriptor
tcgen05.ld.red--d, [desc], src100+8.6TC load with reduction
tcgen05.st--[desc], src100+8.6TC store to descriptor
tcgen05.cp--[desc], [src]100+8.6TC copy
tcgen05.commit--[mbar]100+8.6TC commit
tcgen05.shift--[desc]100+8.6TC shift accumulator
tcgen05.alloc--d, nCols100+8.6Allocate TC columns
tcgen05.dealloc--nCols100+8.6Deallocate TC columns
tcgen05.relinquish_alloc_permit----100+8.6Relinquish TC allocation permit
tcgen05.fence----100+8.6TC fence
tcgen05.wait----100+8.6TC wait

TCGen05 MMA operand encodings (from instruction table): MUUuP, MUUMuP, MMUuP, MMUMuP (M=matrix, U=desc, u=uniform, P=pred)
With accumulator descriptors: MUUudP, MUUuPC, MMUudP, MMUuPC, MUUMudP, MMUMudPC
With metadata: MUUuMMP, MUUMuMMP, MMUuMMP, MMUMuMMP (sparse)

TCGen05 Guardrails (internal)

These are internal debug/verification instructions, not user-facing PTX:

MnemonicSMDescription
_tcgen05.guardrails.is_phase_valid100+Validate TC phase
_tcgen05.guardrails.are_columns_allocated100+Check column allocation
_tcgen05.guardrails.is_current_warp_valid_owner100+Check warp ownership
_tcgen05.guardrails.in_physical_bounds100+Check physical bounds
_tcgen05.guardrails.allocation_granularity100+Validate allocation granularity
_tcgen05.guardrails.datapath_alignment100+Validate datapath alignment
_tcgen05.guardrails.sp_consistency_across_idesc_mod100+Sparse consistency check
_tcgen05.guardrails.check_sparse_usage100+Validate sparse usage

Matrix Utility Instructions

MnemonicType suffixesOperandsSMPTXDescription
ldmatrix.sync .aligned .m8n8 .numd, [ptr]75+6.5Load matrix from shared memory
stmatrix.sync .aligned .m8n8 .num[ptr], a90+7.8Store matrix to shared memory
movmatrix.alignedd, a80+7.1Move/transform matrix fragment

SIMD Video Instructions

These 8/16-bit SIMD instructions operate on packed sub-word elements within 32-bit registers.

MnemonicType suffixesOperandsSMPTXDescription
vadd.s32.s32 .s32.u32 .u32.s32 .u32.u32d, a.asel, b.bsel, c20+2.0SIMD add with secondary op
vsub(same)d, a.asel, b.bsel, c20+2.0SIMD subtract
vmad(same)d, a.asel, b.bsel, c20+2.0SIMD multiply-add
vmin(same)d, a.asel, b.bsel, c20+2.0SIMD minimum
vmax(same)d, a.asel, b.bsel, c20+2.0SIMD maximum
vabsdiff(same)d, a.asel, b.bsel, c20+2.0SIMD absolute difference
vset(same)d, a.asel, b.bsel, c20+2.0SIMD set on compare
vshl.u32.u32d, a.asel, b.bsel, c20+2.0SIMD shift left
vshr.s32.u32 .u32.u32d, a.asel, b.bsel, c20+2.0SIMD shift right
vadd2(packed-pairs)d, a, b, c30+3.0Dual 16-bit SIMD add
vsub2(packed-pairs)d, a, b, c30+3.0Dual 16-bit SIMD subtract
vmin2(packed-pairs)d, a, b, c30+3.0Dual 16-bit SIMD minimum
vmax2(packed-pairs)d, a, b, c30+3.0Dual 16-bit SIMD maximum
vabsdiff2(packed-pairs)d, a, b, c30+3.0Dual 16-bit absolute difference
vset2(packed-pairs)d, a, b, c30+3.0Dual 16-bit set on compare
vavrg2(packed-pairs)d, a, b, c30+3.0Dual 16-bit average
vadd4(packed-quads)d, a, b, c30+3.0Quad 8-bit SIMD add
vsub4(packed-quads)d, a, b, c30+3.0Quad 8-bit SIMD subtract
vmin4(packed-quads)d, a, b, c30+3.0Quad 8-bit SIMD minimum
vmax4(packed-quads)d, a, b, c30+3.0Quad 8-bit SIMD maximum
vabsdiff4(packed-quads)d, a, b, c30+3.0Quad 8-bit absolute difference
vset4(packed-quads)d, a, b, c30+3.0Quad 8-bit set on compare
vavrg4(packed-quads)d, a, b, c30+3.0Quad 8-bit average

Element selectors: .b0 .b1 .b2 .b3 (byte), .h0 .h1 (half-word)
Secondary ops: .add .min .max
Saturation: .sat

Miscellaneous

MnemonicType suffixesOperandsSMPTXDescription
getctarank.shared .globald, a90+7.8Get CTA rank in cluster from address
istypep.texref .samplerref .surfrefp, aall1.0Test if variable is a type
preexit----all1.0Pre-exit notification
stacksave--d20+2.0Save stack pointer
stackrestore--a20+2.0Restore stack pointer
alloca--d, size20+2.0Dynamic stack allocation
clusterlaunchcontrol.try_cancel.async--[mbar], d100+8.7Cluster launch cancel (async)
clusterlaunchcontrol.query_cancel--d, [mbar]100+8.7Query cluster launch cancel status

Register and Shared Memory Control

MnemonicType suffixesOperandsSMPTXDescription
setmaxnreg.inc--N90+7.8Increase max register count
setmaxnreg.dec--N90+7.8Decrease max register count
setmaxreg.alloc--N100+8.6Allocate registers to max
setmaxreg.dealloc--N100+8.6Deallocate from max registers
setmaxreg.try_alloc--d, N100+8.6Try-allocate registers
setsmemsize--N90+7.8Set dynamic shared memory size
setsmemsize.flush--N100+8.6Set shared memory size with flush
getnextworkid--d90+8.0Get next dynamic work unit ID

Warp-Group Management

MnemonicType suffixesOperandsSMPTXDescription
_warpgroup.arrive----90+7.8Warpgroup arrive (internal)
_warpgroup.wait--N90+7.8Warpgroup wait
_warpgroup.commit_batch----90+7.8Warpgroup commit batch

Internal Instructions

These underscore-prefixed instructions are not part of the public PTX ISA. They are generated internally by ptxas during lowering, stub synthesis, or as pre-codegen IR representations. All are registered in the instruction table builder sub_46E000 and appear in --dumpir output, but users never write them directly.

Internal Memory

MnemonicType suffixesOperandsString addrHandler / FormatterDescription
_ldldu(varies)d, [a]0x1d080eeformatter sub_4DD860Unified load-uniform; combines ld+ldu semantics for uniform-cache-path loads
_ldsm.b8 .b16 .s8.s4 .u8.u4 .s4.s2 .u4.u2d, [M]0x1d076c2handlers sub_46B0C0--sub_46B160, validator sub_4AEB60Load shared matrix; loads matrix tiles from shared memory into registers for MMA. Opcode ID 28
_movm.b16 .s8.s4 .u8.u4 .s4.s2 .u4.u2d, a0x1d076dahandlers sub_46B1B0--sub_46B260Move matrix; register-to-register matrix data movement with optional format conversion. Opcode ID 29

Internal Cache Control

MnemonicType suffixesOperandsString addrTable builder xrefDescription
_createpolicy.fractional.L2d, fraction0x1d0813a0x47752fInternal form of createpolicy.fractional; creates fractional L2 cache eviction policy
_createpolicy.range.L2d, lo, hi, hit, miss0x1d081580x477579Internal form of createpolicy.range; creates L2 policy for address range

Internal Surface

MnemonicType suffixesOperandsString addrTable builder xrefsDescription
_sulea.b(varies)d, [surf, coord]0x1d088bc0x4815cd, 0x48166bSurface load effective address, bindless; computes address for suld.b without performing the load
_sulea.p(varies)d, [surf, coord]0x1d088c50x48161c, 0x4816baSurface load effective address, packed; computes address for sust.p-mode surface access

Internal FP / Guard

MnemonicType suffixesOperandsString addrTable builder xrefDescription
_checkfp.divide(varies)d, a, b0x1d088d20x481709FP division guard; inserted during lowering to validate divisor (handles division-by-zero, denormals) before SASS div emission

Internal Control Flow / ABI

MnemonicType suffixesOperandsString addrTable builder xrefDescription
_gen_proto--(opaque)0x1d089030x48189aGenerate function prototype; synthesizes call prototypes for indirect / device-runtime calls during ABI resolution
_jcall--target0x1d0890e0x4818dfInternal jump-call; used inside auto-generated unified-function-stub (UFT) wrappers synthesized by sub_451680 (.func .attribute(.unified_func_stub) __cuda_uf_stub_%s() { _jcall %s; })

Internal Warp

MnemonicType suffixesOperandsString addrTable builder xrefDescription
_match(varies)d, a0x1d08a240x483404Internal match; pre-sync lowered form of warp match instruction, distinct from the public match.sync

Internal MMA / Tensor Core

MnemonicType suffixesOperandsString addrHandlersDescription
_mma.warpgroup135 type combos (F16, BF16, TF32, FP8, INT8)d, a, b, c0x1d072e3135 handlers sub_4668A0--sub_469FD0Warp-group MMA; pre-codegen form of WGMMA. Each handler registers one (src, dst, acc) type triple via sub_465030. Lowers to MERCURY_warpgroup_mma_* SASS opcodes (sm_90+)
_zzn.z8a8x4(sub-byte int)d, a, b, c0x1cfdc03data table at 0x1cfe678ROT13-obfuscated _mma.m8n8k4; sub-byte integer MMA with tile shape m8n8k4 for INT4/INT2 and bit-level XOR MMA (sm_75+)

Handler address summary for internal instructions:

RangeContents
sub_46B0C0--sub_46B260_ldsm (3) + _movm (3) type-variant handlers
sub_4668A0--sub_469FD0_mma.warpgroup 135 type-variant handlers
sub_4AEB60_ldsm validator (3.7 KB) -- handles .s8.s4/.u8.u4 format rules
sub_451680_jcall UFT stub generator
sub_4DD860_ldldu PTX text formatter

Instruction Table Builder Internals

Registration Mechanism

The 93 KB function sub_46E000 runs once during parser initialization (sub_451730). Each of its 1,141 calls to sub_46BED0 has the form:

sub_46BED0(table, operand_encoding, opcode_id, opcode_name, 
           type_flags, sm_requirement, xmm_data, extra);

Where:

  • operand_encoding is a compact string like "F32F32", "I32I8I8I32", "MUUuP"
  • opcode_id is the internal opcode integer (mapped 1:1 to Ori IR opcodes from ctor_003)
  • type_flags is a bitfield encoding which .sNN/.uNN/.fNN/.bNN qualifiers are legal
  • sm_requirement gates the instruction to architectures >= this SM version

The operand encoding characters are:

CharIDMeaningType bits registered
F1Float operand.f16 .f32 .f64
H2Half-precision.f16 .f16x2
N3Immediate / numeric(no type suffix)
I4Integer operand.s8--.s64 .u8--.u64
B5Bitwise operand.b8--.b128
P6Predicate.pred
O7Optional operand(no type suffix)
E8Extended type.bf16 .e4m3 .e5m2
Q10FP8 type.e5m2 .e4m3 (fp8)
R11FP4/narrow type.e2m1 (fp4)
M--Matrix descriptorTensor core descriptor
U--Uniform descriptorTMA/TC uniform register
C--Carry/accumulatorMMA accumulator control

When a letter is followed by a digit, that digit constrains the bit-width: F32 means only .f32, I16 means only .s16/.u16. The function sub_1CB0850 registers each valid width into a bitset.

Descriptor Layout

Each registered instruction creates a 368-byte descriptor node via sub_424070. Key fields:

OffsetFieldDescription
+0opcode_idInternal opcode identifier
+8type_flagsBitfield of legal type qualifiers
+12xmm_data128-bit SIMD data (rounding, modifiers)
+28extra_flagsArchitecture / behavior flags
+36operand_countNumber of operands
+40+operand_slotsPer-operand type bitsets (4 slots max, each 8 bytes)
+232name_lengthLength of the opcode name string

The two hash tables at offsets 2472 and 2480 in the instruction table provide dual-path lookup -- the first table is the primary lookup; if the opcode is not found, the second table is checked. This two-table scheme separates core instructions from extended/variant instructions.

Instruction Count by Category

Based on the 1,141 registration calls and the 431 decoded Ori IR opcodes (from ctor_003):

CategoryApproximate descriptor count
Integer arithmetic (add/sub/mul/mad/div/rem/...)~120
Float arithmetic (fadd/fmul/fma/div/rcp/sqrt/...)~100
Half/BF16 arithmetic~60
Comparison and selection (setp/selp/set/slct)~50
Logic and shift (and/or/xor/shl/shr/lop3/shf)~40
Conversion (cvt + 100+ type combinations)~180
Load/store/atomic (ld/st/atom/red + variants)~150
Texture/surface (tex/suld/sust/sured/txq)~80
Control flow (bra/call/ret/exit/bar/barrier)~50
MMA/WMMA/WGMMA/TCGen05~160
SIMD video (vadd/vsub/vmin/vmax/...)~60
Miscellaneous (prmt/bfe/bfi/shfl/vote/...)~91
Total1,141

PTX 8.x New Instructions

Instructions introduced in PTX ISA 8.0--8.7 (CUDA 12.0--12.8), verified from SM architecture gates and PTX version checks in the ptxas v13.0.88 binary:

PTXInstructionSMDescription
8.0cp.async.bulk90+TMA bulk async copy
8.0cp.async.bulk.tensor90+TMA tensor async copy
8.0cp.reduce.async.bulk90+Bulk async copy with reduction
8.0cp.reduce.async.bulk.tensor90+Tensor async copy with reduction
8.0cp.async.bulk.prefetch90+TMA prefetch
8.0cp.async.bulk.prefetch.tensor90+TMA tensor prefetch
8.0cp.async.bulk.commit_group90+Commit TMA group
8.0cp.async.bulk.wait_group90+Wait for TMA group
8.0tensormap.replace90+Runtime tensormap modification
8.0elect / elect.one90+Elect leader thread
8.0mbarrier.complete_tx90+Transaction mbarrier completion
8.0mbarrier.expect_tx90+Set expected transaction count
8.0st.bulk90+Bulk store
8.0szext90+Sign/zero extend at bit position
8.0getnextworkid90+Dynamic work unit
8.1st.async90+Asynchronous store
8.1red.async90+Asynchronous reduction
8.1multimem.ld_reduce90+Multicast load-reduce
8.1multimem.st90+Multicast store
8.1multimem.red90+Multicast reduction
8.6tcgen05.mma100+5th-gen tensor core MMA
8.6tcgen05.mma.ws100+5th-gen MMA with warp scale
8.6tcgen05.ld / tcgen05.st100+TC load/store
8.6tcgen05.ld.red100+TC load with reduction
8.6tcgen05.cp100+TC copy
8.6tcgen05.commit100+TC commit
8.6tcgen05.shift100+TC shift accumulator
8.6tcgen05.alloc / tcgen05.dealloc100+TC column allocation
8.6tcgen05.relinquish_alloc_permit100+Relinquish TC permit
8.6tcgen05.fence / tcgen05.wait100+TC fence/wait
8.6setmaxreg.alloc / .dealloc / .try_alloc100+Register management
8.6setsmemsize.flush100+Shared memory with flush
8.7clusterlaunchcontrol.try_cancel.async100+Cluster launch cancel
8.7clusterlaunchcontrol.query_cancel100+Query cancel status

Cross-References

Key Functions

AddressSizeRoleConfidence
sub_46E00093KBInstruction table builder; runs once during parser init, makes 1,141 calls to sub_46BED0 to register all PTX instruction descriptors0.95
sub_46BED0--Instruction descriptor registrar; called 1,141 times with operand encoding, opcode ID, type flags, SM requirement0.95
sub_46C690--Instruction lookup entry point; dispatches to sub_46C6E0 for name-to-descriptor matching0.90
sub_46C6E06.4KBInstruction name matcher; resolves PTX mnemonic strings to instruction table descriptors0.90
sub_5D419012.9KBPTX text formatter dispatch; 81-string + 473-entry hash table routing to per-instruction formatters0.90
sub_489390--SM version gate; validates minimum SM architecture for each instruction0.85
sub_489050--PTX ISA version gate; validates minimum PTX version for each instruction0.85
sub_4BFED0--WMMA shape validator; checks legal WMMA shape combinations (.m16n16k16, .m32n8k16, etc.)0.85
sub_451680--UFT stub generator; synthesizes _jcall wrappers for unified-function-stub entries0.90
sub_451730--Parser initialization; calls sub_46E000 to build the instruction table0.85
sub_4DD860--_ldldu PTX text formatter; handles internal unified load-uniform instruction0.85
sub_4AEB603.7KB_ldsm validator; validates .s8.s4/.u8.u4 format rules for shared matrix loads0.85
sub_465030--MMA type triple registrar; called by 135 _mma.warpgroup handlers to register (src, dst, acc) type combinations0.85
sub_424070--Instruction descriptor allocator; creates 368-byte descriptor nodes for each registered instruction0.85
sub_1CB0850--Width bitset registrar; registers valid bit-widths (e.g., F32 -> .f32 only) into per-operand bitsets0.80