Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Intrinsic Table Architecture (607 Registered Entries)

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

ptxas maintains two separate intrinsic subsystems that together cover every CUDA runtime helper function, every PTX opcode requiring inline code generation, and every Blackwell+ OCG builtin operation. The first subsystem (sub_5D1660 + sub_5D4190 + sub_5D7430 + sub_5FF700) handles 607 classical CUDA intrinsics and PTX opcode dispatch through a name-to-ID hash map, a body template name table, and a giant prototype generator. The second subsystem (sub_6C9EB0 and its handler cluster at 0x6C0000--0x6CC000) handles OCG (Optimized Code Generation) builtins for SM100+ targets. Both subsystems use the same hash map infrastructure (sub_425CA0 / sub_426150 / sub_426D60) documented in Hash Tables & Bitvectors.

Master registrationsub_5D1660 (46KB) -- 607 CUDA intrinsics, name-to-integer-ID hash map (608 table slots, ID 0 = null)
Opcode dispatchsub_5D4190 (41KB) -- ~120 PTX opcodes to codegen handlers + ~400 MMA hash entries
Body template namessub_5D7430 (161KB) -- 1,079 intrinsic names constructed from .rodata prefixes + type suffixes, stored in hash map at +824
Prototype generatorsub_5FF700 (354KB) -- switch generating .weak .func PTX declarations
OCG intrinsic tablesub_6C9EB0 (13KB) -- __nv_ptx_builtin_ocg_* dispatch for SM100+
OCG routersub_6CC690 (22KB) -- routes OCG calls to type-specific handlers
OCG name resolversub_6C9BC0 -- resolves operation names to internal enums
Hash map createsub_425CA0 (initial capacity 0x80)
Hash map insertsub_426150(map, name, value)
Hash map lookupsub_426D60

Per-Family Deep Dives:

System Overview

sub_451730 (intrinsic lowering context constructor)
  │
  ├── sub_5D4190(ctx)  ── register PTX opcode & MMA handlers ──────────────┐
  │     │ (1) Calls sub_5D1660 to populate intrinsic ID table (607 entries) │
  │     │ (2) Registers ~120 PTX opcode -> codegen handler mappings         │
  │     │ (3) Registers ~400 MMA hash -> codegen handler mappings           │
  │     │                                                                   │
  │     ├─ Hash map at +808  ── PTX opcode name -> codegen function ptr     │
  │     │    "div"     -> sub_5B76D0  (64KB)                                │
  │     │    "sqrt"    -> sub_5B4040  (49KB)                                │
  │     │    "wmma.mma"-> sub_5C7A50  (173KB)                               │
  │     │    "mma"     -> sub_5C10A0  (120KB)                               │
  │     │    ... ~116 more                                                  │
  │     │                                                                   │
  │     ├─ Hash map at +816  ── numeric MMA hash -> codegen handler ptr     │
  │     │    "2644314910" -> sub_4DDB80                                     │
  │     │    ... ~399 more (shape/type/layout combinations)                 │
  │     │                                                                   │
  │     └─ ID table at +1056 ── 9728-byte array (memcpy from unk_1D4D940)  │
  │        Hash map at +1064 ── name -> integer ID (sub_5D1660, 607)        │
  │        Count at +1072 = 608 (includes null ID 0 slot)                   │
  │                                                                         │
  ├── sub_4CE230(ctx)  ── register modifier keywords (GUARD, PRED, ...)     │
  │                                                                         │
  ├── sub_5D7430(ctx, sregs)  ── body template name table (161KB) ──────────┤
  │     │ 1,079 entries, each constructed from:                             │
  │     │   16-byte .rodata prefix (e.g. "__cuda_sm20_div_")               │
  │     │ + 4-byte type suffix (e.g. "s16\0", "u64\0", "rn_f")            │
  │     │ → registered into hash map at +824 with sequential integer IDs    │
  │     │                                                                   │
  │     └─ Hash map at +824  ── intrinsic name -> body template ID          │
  │          "__cuda_sm20_div_s16" -> 0                                     │
  │          "__cuda_sm20_div_u16" -> 1                                     │
  │          ... 1,079 total entries                                        │
  │                                                                         │
  └── sub_451330("<fermi macros>", ...)  ── load Fermi macro library        │
                                                                            │
sub_5FF700 (354KB) ─────────────────────────────────────────────────────────┘
  │ switch(body_template_id) with hundreds of cases
  │ Each case: allocate buffer via sub_4DA340, strcpy() PTX prototype
  │
  │ case 0:  ".weak .func (.reg .s32 %d) __cuda_sm20_div_s16
  │           (.reg .s32 %a0, .reg .s32 %a1)"
  │ case 4:  ".weak .func (.reg .u64 %rdv1) __cuda_sm20_div_u64
  │           (.reg .u64 %rda1, .reg .u64 %rda2)"
  │ case 9:  ".weak .func (.reg .f32 %fv1) __cuda_sm20_div_rn_f32
  │           (.reg .f32 %fa1, .reg .f32 %fa2)"
  │ case 25: ".weak .func (.reg .f64 %fdv1) __cuda_sm20_div_rn_f64_full
  │           (...)"
  │ ... hundreds more for rcp, sqrt, dsqrt, barrier, wmma, mma, etc.
  v
Emitted into PTX output as .weak .func declarations
(linker resolves calls to runtime helper functions)

Master Registration -- sub_5D1660

This 46KB function is the master catalog. It allocates a 9728-byte table (memcpy from unk_1D4D940, 0x2600 bytes = 608 x 16B slots), creates a hash map with initial capacity 0x80 via sub_425CA0, then calls sub_426150(hashmap, "name", (char*)ID) exactly 607 times to register every CUDA runtime helper function with an integer ID (IDs 1--607, contiguous). The hash map is stored at a1+1064, the table at a1+1056, and the count 608 at a1+1072 (includes the unused null ID 0 slot).

Complete ID Allocation

607 intrinsics are registered with contiguous IDs from 0x01 through 0x25F. The binary stores count=608 at a1+1072 because the pre-built 9,728-byte table (608 x 16B slots) includes a null ID 0 sentinel. The ID ranges partition cleanly by SM generation and functional category.

ID RangeCountPrefixCategorySM Floor
0x001--0x01117__cuda_reduxsync_*Redux sync (b32 and/or/xor, f32 max/min/abs/NaN, s32/u32 add/max/min)sm_70
0x012--0x0187__cuda_sanitizer_memcheck_*Compute-sanitizer hooks (free, generic, global, local, malloc, readmetadata, shared)--
0x019--0x01F7__cuda_scalar_video_emulation_*Video instruction emulation helperssm_20
0x020--0x02A11__cuda_sm10x_*Blackwell tcgen05 guardrail traps + create_mask helpersm_100
0x02B--0x03C18__cuda_sm1xx_*Bulk copy + cp.async.bulk.tensor 1D--5D tile/im2col uni/multicastsm_100+
0x03D--0x08270__cuda_sm20_*IEEE math: bfe, bfi, div, rcp, sqrt, dsqrt, drsqrt, rem (all rounding modes + slowpaths)sm_20
0x083--0x0864__cuda_sm3x_div_*Optimized division variants (rn_ftz_f32, rn_noftz_f32 + slowpaths)sm_30
0x087--0x0882__cuda_sm62_dp2a/dp4aInteger dot product emulationsm_62
0x089--0x211393__cuda_sm70_*Volta+ intrinsics (barriers, shuffle, vote, match, WMMA -- all shapes, layouts, address spaces)sm_70
0x212--0x2143__cuda_sm80_*Ampere: createpolicy_fractional, createpolicy_fractional_encode, createpolicy_range_encodesm_80
0x215--0x21E10__cuda_sm_10x_*Blackwell hmma/imma mdata + bit MMA (and/xor m8n8k128/m16n8k128/m16n8k256)sm_100
0x21F--0x22C14__cuda_sm_8x_*Direct MMA operations (f16/f32 accum, 4 layout combos) + mma_shfl_f16/f32sm_80+
0x22D--0x25F51__cuda_sm_9x_*Hopper sub-byte + bit MMA: s4/u4 dense m16n8k32/k64 + sparse m16n8k64/k128, bit xor (m8n8k128/m16n8k128/m16n8k256)sm_90

Total: 607 registered intrinsics across 13 prefix groups. Table has 608 slots (ID 0 unused).

sm_70 Intrinsic Breakdown (IDs 0x89--0x211)

The sm_70 block is by far the largest at 393 entries. It covers every Volta-era warp synchronous intrinsic plus the complete WMMA API. The explosion in count comes from the combinatorial product of shapes, layouts, data types, address spaces, and predicate/satfinite variants.

Sub-CategoryExamplesCombinatorial Source
barrier_arrive0--15, with/without count16 barrier IDs x 2 count variants
barrier_red_and/or/popc0--15, with/without count3 reduction ops x 16 IDs x 2 count
barrier_sync0--15, with/without count16 IDs x 2 count variants
matchsync_all/any_b32/b64with predicate variants2 match modes x 2 types x pred
shflsync_bfly/down/idx/upwith predicate variants4 shuffle modes x pred
votesync_all/any/ballot/uni--4 vote modes
warpsync--1 entry
wmma_*m16n16k16, m32n8k16, m8n32k163 shapes x {load_a, load_b, load_c, store_d, mma} x {row, col} x {f16, f32} x {generic, global, shared} x {satfinite}

The WMMA entries dominate the count. Each combination of shape (m16n16k16/m32n8k16/m8n32k16), operation (load_a/load_b/load_c/store_d/mma), layout (row/col for each matrix), data type (f16/f32), address space (generic/global/shared), and optional satfinite flag produces a separate intrinsic registration.

Opcode Dispatch -- sub_5D4190

This 41KB function first calls sub_5D1660(a1) to populate the intrinsic ID table, then builds two more hash maps for PTX opcode dispatch.

Named Opcode Table (at a1+808)

~120 PTX instruction names mapped to codegen handler function pointers. Each handler allocates a 50,000-byte buffer, queries instruction properties through accessor functions on the instruction object at a1+1096, and generates inline PTX code via sequential sprintf() calls.

CategoryOpcodesCodegen Handlers
Mathdiv.full, div, rem, rcp, rsqrt, sqrt, ex2, lg2, tanhsub_573860, sub_5B76D0 (64KB), sub_589810, sub_5B0CD0 (44KB), sub_57BFC0, sub_5B4040 (49KB), sub_583190, sub_52A5C0, sub_505B00
Memorymembar, _ldldu, prefetchsub_4DB410, sub_4DD860, sub_507FB0
Conversioncvtsub_59F630
Bit manipulationbfind, brev, bfe, bfi, clz, popc, testp, copysignsub_590C20, sub_50B5A0, sub_578470, sub_52E100, sub_4DBCC0, sub_4DB210, sub_581A10, sub_50B180
Texturetex, tex.base, tex.level, tld4, tex.gradsub_584D10, sub_5879B0, sub_58B6A0, sub_56D700, sub_5ADDC0 (50KB)
Video (SIMD)vadd/vsub/vmin/vmax/vabsdiff/vshl/vshr/vset/vmad (scalar), vadd2/vmax2/vmin2/vabsdiff2/vset2/vsub2/vavrg2 (packed 2x16), vadd4/vmin4/vmax4/vabsdiff4/vset4/vsub4/vavrg4 (packed 4x8)per-instruction handlers
Dot productdp2a.lo, dp2a.hi, dp4asub_56BA60, sub_56C8D0, sub_577BA0
Barriersbar, barrier, bar.arrive, barrier.arrive, bar.red, barrier.red, bar.cta/barrier.cta (.arrive/.red variants), bar.warpsub_524FB0, sub_570290, sub_500BF0, sub_570940, sub_52D590, sub_5889B0, sub_56A5A0
Warpvote, shfl, match, reduxsub_580E50, sub_5801D0, sub_58A730, sub_567680
Async copycp.async.mbarrier.arrive, cp.async.bulk, cp.async.bulk.tensorsub_4DC180, sub_593210, sub_5AB460 (45KB)
Matrixldmatrix, movmatrix, stmatrix, st.async, red.async, st.bulksub_50D4B0, sub_4DAEA0, sub_4F05D0, sub_58E9B0, sub_5825A0, sub_549430
Cachecreatepolicy.range, createpolicy.fractional, createpolicy.cvtper-instruction handlers
WMMAwmma.load.a, wmma.load.b, wmma.load.c, wmma.store.d, wmma.mmasub_5A2D10, sub_5A0EA0, sub_5A8E40, sub_5A6BD0, sub_5C7A50 (173KB)
MMAmmasub_5C10A0 (120KB)
WGMMAwgmma.mma_async, wgmma.fence, wgmma.commit_group, wgmma.wait_groupsub_50AC70, sub_4DA380, sub_4DA4B0, sub_4DA5E0
Multimemmultimem.ld_reduce, multimem.st, multimem.redsub_58D8B0, sub_57B4C0, sub_50A850
Tensormaptensormap.replacesub_57F6E0
TCGen05tcgen05.alloc, tcgen05.relinquish_alloc_permit, tcgen05.dealloc, tcgen05.ld, tcgen05.ld.red, tcgen05.st, tcgen05.commit, tcgen05.cp, tcgen05.shift, tcgen05.mma, tcgen05.mma.wssub_569180, sub_526370, sub_58C7F0, sub_574050, sub_578DB0, sub_571FE0, sub_56C190, sub_5427F0, sub_4F1A90, sub_5BBC30 (90KB), sub_58FA20
TCGen05 guardrails_tcgen05.guardrails.is_phase_valid, are_columns_allocated, is_current_warp_valid_owner, in_physical_bounds, allocation_granularity, datapath_alignment, sp_consistency_across_idesc_mod, check_sparse_usageper-instruction handlers

Numeric MMA Hash Table (at a1+816)

~400 entries where the key is a numeric string representation of a hash value (e.g., "2644314910") that encodes a specific MMA shape/type/layout combination. The hash encodes the instruction variant completely: matrix dimensions (m16n8k16, m16n8k32, etc.), data type (f16, bf16, tf32, f32, f64, s8, u8, s4, u4, b1), and layout (row/col combinations). Each entry maps to a codegen handler function pointer. This avoids a multi-dimensional lookup by collapsing the full variant space into a single hash probe.

Body Template Name Table -- sub_5D7430

At 161KB of machine code (0x5D7430--0x5FF700), this is the largest function in the intrinsic infrastructure by code size and the 6th largest function in the entire ptxas binary. IDA failed to decompile it; all analysis comes from raw x86-64 disassembly. The function constructs a third hash map (at context offset +824 / 0x338) containing 1,079 entries that map dynamically constructed __cuda_* intrinsic names to sequential body template IDs (0--1078).

Why 1,079 Body Templates for 607 Logical Intrinsics

The master registration table (sub_5D1660) maps 607 intrinsic names to logical IDs. The body template table (sub_5D7430) maps 1,079 variant-specific names to prototype generator case numbers. The 1.78x expansion has one dominant cause: WMMA template proliferation across GPU generations.

The 204 logical WMMA entries in sub_5D1660 cover only the original sm_70 Volta shapes (m16n16k16/m32n8k16/m8n32k16 with f16/f32 types). But the body template table includes all later-generation WMMA variants -- sm7x sub-byte/bit, sm72 integer, sm8x tf32/bf16/f64 -- that were added as hardware evolved. These ~416 extra WMMA templates have no matching entry in the 607 logical ID table; they exist only in the body template hash map and the prototype generator switch.

Non-WMMA intrinsics map approximately 1:1 between logical IDs and body templates. The math operations (div, rcp, sqrt) are already fully type-specialized at the logical level -- each rounding-mode/type combination is a separate logical intrinsic.

Three sources of expansion beyond the 607 logical entries:

  1. Later-generation WMMA variants (~416 template-only entries):
    • sm7x sub-byte WMMA (s4/u4 m8n8k32) + bit WMMA (m8n8k128): ~231 templates
    • sm72 integer WMMA (m16n16k16/m32n8k16/m8n32k16 integer types): ~105 templates
    • sm8x tf32 WMMA (m16n16k8) + bf16/f64 WMMA: ~80 templates
  2. Aligned warp sync variants (~13 extra templates): matchsync_aligned, votesync_aligned, votesync_ballot_groupwise, query_activemask/query_activemask_groupwise for cooperative group support
  3. Additional SM100 specializations (~8 extra templates): tcgen05_alloc_two_sm, extra guardrails check variants, get_warp_rank

Conversely, 18 sm1xx bulk copy intrinsics have logical IDs but zero body templates -- they bypass the template/prototype mechanism entirely and are lowered directly to inline PTX by the opcode dispatch handlers (sub_593210, sub_5AB460).

Template Distribution Table

Logical GroupLogicalTemplateFactor
SM20 IEEE math (div, rem, rcp, sqrt, bfe/bfi)70701.0x
SM3x optimized division441.0x
SM62 integer dot product221.0x
SM70 barriers1701701.0x
SM70 warp sync (match, vote, shfl, query)19321.7x
SM70 WMMA (f16/f32 original Volta)2042491.2x
SM7x WMMA extended (sub-byte, bit)0231tmpl-only
SM72 WMMA (integer)0105tmpl-only
SM8x WMMA (tf32, bf16, f64)080tmpl-only
SM80 cache policy341.3x
SM8x direct MMA14141.0x
SM9x Hopper sub-byte/bit MMA51521.0x
SM10x Blackwell MMA metadata10101.0x
SM100 tcgen05 + guardrails11191.7x
SM100+ bulk copy / TMA180(no templates)
Redux sync primitives17171.0x
Compute-sanitizer hooks771.0x
Video instruction emulation771.0x
Total6071,0731.78x

WMMA subtotal: 204 logical entries expand to 665 body templates (3.3x). Non-WMMA: 403 logical entries map to 408 templates (~1.0x). The remaining 7 templates (1,080 prototype switch cases minus 1,073 classified) are sanitizer/cache variants where IDA produced qmemcpy instead of strcpy, preventing exact name extraction.

The sm70 WMMA group itself expands from 204 to 249 templates because the prototype generator includes update_ptr and desc (descriptor-based addressing) variants of certain load/store operations that the logical table does not separate.

The three "tmpl-only" WMMA rows (sm7x/sm72/sm8x) are the single largest contributor to the expansion. They represent ~416 templates with zero logical ID counterparts. These families use .FORCE_INLINE .func linkage in their prototypes instead of the .weak .func used by the original sm70 WMMA entries:

sm70 (original):   .weak .func (...) __cuda_sm70_wmma_m16n16k16_load_a_col (...)
sm72 (integer):    .FORCE_INLINE .func (...) __cuda_sm72_Integer_wmma_m16n16k16_load_a_row (...)
sm7x (sub-byte):   .FORCE_INLINE .func (...) __cuda_sm7x_sub_byte_wmma_m8n8k32_load_a_row (...)
sm8x (tf32):       .FORCE_INLINE .func (...) __cuda_sm8x_tf32_wmma_m16n16k8_load_a_row (...)

The .FORCE_INLINE directive forces inlining at every call site rather than emitting a separate callable function. The later-gen WMMA implementations are more complex and performance-sensitive, making per-call-site specialization profitable.

Name Construction Algorithm

The function contains zero string references because it constructs all 1,079 names at runtime. For each entry:

  1. Allocate a 20-byte buffer via sub_424070(allocator, 20)
  2. Copy prefix (16 bytes) from .rodata via SSE movdqa + movups (e.g., "__cuda_sm20_div_")
  3. Append suffix (4 bytes) via movl immediate at offset +16 (e.g., "s16\0", "u64\0", "rn_f")
  4. Register via sub_426150(context+824, buffer, template_id) with sequential integer IDs

The 533 unique .rodata prefix addresses fan out through multiple suffixes per prefix:

.rodata prefix (16B)       suffix (4B)     result (20B buffer)
───────────────────────    ───────────     ──────────────────────
"__cuda_sm20_div_"    +    "s16\0"    =   "__cuda_sm20_div_s16"
"__cuda_sm20_div_"    +    "u16\0"    =   "__cuda_sm20_div_u16"
"__cuda_sm20_div_"    +    "u64\0"    =   "__cuda_sm20_div_u64"
"__cuda_sm20_div_"    +    "s64\0"    =   "__cuda_sm20_div_s64"
"__cuda_sm20_div_"    +    "rn_f"     =   "__cuda_sm20_div_rn_f" (truncated)
"__cuda_sm20_rem_"    +    "s16\0"    =   "__cuda_sm20_rem_s16"
"__cuda_sm20_rcp_"    +    "rn_f"     =   "__cuda_sm20_rcp_rn_f" (truncated)
"__cuda_sm70_barr"    +    "ier_"     =   "__cuda_sm70_barrier_" (prefix chain)

Names truncated at the 20-byte buffer limit are still sufficient for hash map lookup -- the full untruncated name appears only inside the prototype string in sub_5FF700.

Worked Example: Division (Cases 0--26)

The __cuda_sm20_div operation illustrates the template-to-prototype mapping. Division has 19 logical IDs and 19 body templates (1:1 ratio) because each type/rounding/precision variant is already a separate logical intrinsic. The suffix encodes the type specialization:

CaseBody Template NameType SuffixPTX Signature
0__cuda_sm20_div_s16s16(.reg .s32 %d) ... (.reg .s32 %a0, .reg .s32 %a1)
1__cuda_sm20_div_u16u16(.reg .u32 %d) ... (.reg .u32 %a0, .reg .u32 %a1)
4__cuda_sm20_div_u64u64(.reg .u64 %rdv1) ... (.reg .u64 %rda1, .reg .u64 %rda2)
5__cuda_sm20_div_s64s64(.reg .u64 %rdv1) ... (.reg .u64 %rda1, .reg .u64 %rda2)
9__cuda_sm20_div_rn_f32rn_f(.reg .f32 %fv1) ... (.reg .f32 %fa1, .reg .f32 %fa2)
10__cuda_sm20_div_rd_f32rd_f(.reg .f32 %fv1) ... (.reg .f32 %fa1, .reg .f32 %fa2)
14__cuda_sm20_div_rn_ftz_f32rn_f(.reg .f32 %fv1) ... (.reg .f32 %fa1, .reg .f32 %fa2)
22__cuda_sm20_div_ru_f64_v2ru_f(.reg .f64 %fdv1) ... (.reg .f64 %fda1, .reg .f64 %fda2)
25__cuda_sm20_div_rn_f64_fullrn_f(.reg .f64 %fdv1) ... (.reg .f64 %fda1, .reg .f64 %fda2)

Cases 2--3 (rem s16/u16), 6--7 (rem s64/u64) are interleaved between the division entries. Cases 8, 13 are _slowpath variants that implement Newton-Raphson refinement fallbacks. Cases 18--21 are the sm3x-optimized division variants with the same suffix scheme. Note: s16/u16 division uses .s32/.u32 register types because PTX has no 16-bit register class; the 16-bit operation is performed by 32-bit hardware with appropriate sign/zero extension.

Statistics

MetricValue
Machine code size164,560 bytes (0x5D7430--0x5FF700)
sub_426150 calls1,079
Unique .rodata prefix addresses533
Hash map destinationcontext+824 (0x338)
Buffer size per entry20 bytes
IDA decompilationFailed (function too large/repetitive)

Context Hash Map Summary

The intrinsic lowering context object holds five hash maps and one flat table:

OffsetFieldBuilderContentsEntries
+808opcode handlerssub_5D4190PTX opcode name -> codegen fn ptr~120
+816MMA hash handlerssub_5D4190numeric hash -> codegen fn ptr~400
+824body templatessub_5D7430intrinsic name -> template ID1,079
+1056descriptor tablesub_5D1660608 x 16B intrinsic descriptor slots608
+1064ID mapsub_5D1660intrinsic name -> logical ID (1-607)607
+1072countsub_5D1660608 (includes null slot 0)--

Instruction Property Accessors

All codegen handlers query instruction properties through accessor functions on the instruction object at a1+1096. These are the same accessors used by WMMA, MMA, and tcgen05 codegen.

AccessorPurposeUsage Example
sub_70B6E0Check if feature enabledsub_70B6E0(obj) -- boolean feature gate
sub_70B780Get feature parameterNumeric feature parameter
sub_70FA00Check instruction capability for SMsub_70FA00(*, 23) = texture, sub_70FA00(*, 29) = tcgen05
sub_70E940Get operand countNumber of operands
sub_70E6E0Get data typeOperand data type enumeration
sub_70ACC0Get accumulator typeMMA accumulator data type
sub_709860Get register type/sizeRegister class and width
sub_70F460Get layout variantrow/col matrix layout
sub_707D60Check MMA shape variantm16n16k16 vs m32n8k16, etc.
sub_709910Check sparse modeSparse MMA variant flag
sub_70F650Get matrix dimension (M/N)Matrix size parameter
sub_70F600Get matrix dimension (K)Alternate dimension parameter
sub_70CA60Get operand type by indexsub_70CA60(*, 0) -- type of first operand (21 = specific type, 58 = f32, 59 = f64)
sub_70BA40Texture mode queryTexture sampling mode
sub_70BD50Sampler mode queryTexture sampler configuration
sub_70BB20Bulk tensor modecp.async.bulk.tensor transfer mode
sub_70F0A0Get sparse metadataSparse matrix metadata parameter

Prototype Generator -- sub_5FF700

At 354KB, this is the single largest function in the intrinsic infrastructure and the 2nd largest function in the entire ptxas binary. It takes a body template ID (a1, range 0--1079) and an allocator context (a2), allocates a buffer via sub_4DA340(size, a2), fills it with a PTX prototype string via strcpy(), and returns the result. The output is a complete .weak .func or .FORCE_INLINE .func PTX declaration that gets emitted into the PTX output stream so the linker can resolve calls to CUDA runtime helper functions.

The function is a single switch(a1) with 1,080 case labels (0--1079) plus a default case that returns an empty string "". Each case allocates an exact-sized buffer (72--1,200 bytes), copies a hardcoded PTX prototype string into it, and returns the pointer.

Prototype Generator Architecture

sub_5FF700(template_id, allocator)
  │
  │  switch(template_id)     ← 1,080 cases, 0--1079
  │
  ├── case N:
  │     buf = sub_4DA340(byte_count, allocator)    ← allocate exact-fit buffer
  │     strcpy(buf, ".weak .func (...) name (...)")  ← copy PTX prototype
  │     return buf
  │
  ├── case M:  (45 large WMMA mma cases)
  │     buf = sub_4DA340(byte_count, allocator)    ← up to 1,200 bytes
  │     *(u64*)buf = 0x662E206B6165772E            ← ".weak .f" (inline store)
  │     *(u64*)(buf+N-8) = <trailer>               ← last 8 bytes inline
  │     qmemcpy(buf+8, .rodata_addr, size-16)      ← bulk copy middle
  │     return buf
  │
  └── default:
        return ""

Three copy strategies appear in the decompilation, all producing the same result:

StrategyCasesTriggerMax Size
strcpy() with inline string literal1,035Prototype fits in decompiler string threshold~520 bytes
qmemcpy() with QWORD bookend stores45Prototype too long for IDA to reproduce as literal1,200 bytes
Indirect variable assignment + copy~130IDA SSA split (subset of strcpy)~120 bytes

The qmemcpy cases are the 45 WMMA mma operations with the largest parameter lists (3--4 fragment matrices of 8 elements each). IDA stores the first and last 8 bytes as inline immediates (0x662E206B6165772E = ".weak .f", trailer varies per case) and bulk-copies the middle from .rodata. The prototype content is structurally identical to the strcpy cases.

Linkage Directives

Two PTX linkage types are emitted, controlling how the linker handles the declared function:

DirectiveCountMeaningUsed By
.weak616Overridable by user code; linker uses user version if presentSM20 math, SM70 barriers/sync/WMMA (original Volta), SM80 cache policy, SM8x/9x/10x MMA, redux sync, sanitizer hooks, video emulation, dp2a/dp4a
.FORCE_INLINE464Inlined at every call site; no separate callable functionSM70 aligned vote/match/query_activemask, SM7x sub-byte/bit WMMA, SM72 integer WMMA, SM8x tf32/bf16/f64 WMMA, SM10x tcgen05 alloc/guardrails, SM80 createpolicy_fractional

The .weak linkage supports user-supplied replacements: if the user provides their own implementation of __cuda_sm20_div_s16, the linker will use that instead of the built-in runtime version. The .FORCE_INLINE directive forces per-call-site specialization -- the later-generation WMMA implementations are more complex and performance-sensitive, making inlining profitable.

A subset of .weak prototypes (~410) carry the .unique qualifier:

.weak .func (.reg .b32 dst) __cuda_sm70_barrier_sync (.reg .b32 arg0) .unique ;

.unique instructs the PTX linker to keep exactly one copy of the function body even if multiple compilation units reference it. All barriers, redux sync, warpsync, non-aligned vote/match/shuffle use .unique.

Prototype Format

Every emitted prototype follows one of these structural patterns:

<linkage> .func (<return_params>) <name> (<input_params>) [.unique] ;
CasePrototype
0.weak .func (.reg .s32 %d) __cuda_sm20_div_s16 (.reg .s32 %a0, .reg .s32 %a1)
4.weak .func (.reg .u64 %rdv1) __cuda_sm20_div_u64 (.reg .u64 %rda1, .reg .u64 %rda2)
9.weak .func (.reg .f32 %fv1) __cuda_sm20_div_rn_f32 (.reg .f32 %fa1, .reg .f32 %fa2)
25.weak .func (.reg .f64 %fdv1) __cuda_sm20_div_rn_f64_full (.reg .f64 %fda1, .reg .f64 %fda2)
76.weak .func (.reg .b32 dst) __cuda_reduxsync_b32_xor (.reg .b32 src, .reg .b32 mask) .unique
303.weak .func (.param .align 16 .b32 dst[8]) __cuda_sm70_wmma_m16n16k16_load_a_row (.reg .u64 ptr, .reg .u32 ldm) .unique
666.FORCE_INLINE .func (.reg .b32 dst0, .reg .b32 dst1, .reg .b32 dst2, .reg .b32 dst3) __cuda_sm8x_tf32_wmma_m16n16k8_load_a_row (.reg .u64 ptr, .reg .u32 ldm)
890.weak .func () __cuda_sm70_wmma_m16n16k16_store_d_row_f32 (.reg .b64 ptr, .reg .b32 ldm, .reg .b32 sreg0, ...)
1055.FORCE_INLINE .func (.reg .b32 warp_rank) __cuda_sm10x_get_warp_rank ()
1073.weak .func (.param .b64 func_retval0) __cuda_sanitizer_memcheck_readmetadata (.param .b64 ..._param_0, .param .b64 ..._param_1)

Parameter Passing Conventions

Five distinct parameter-passing ABIs appear across the 1,080 prototypes:

Convention A -- Register-only (.reg): Used by math operations, barriers, warp sync, redux sync, video emulation. Return and input parameters are individual .reg declarations with typed names. This is the simplest and most common convention.

.weak .func (.reg .f32 %fv1) __cuda_sm20_div_rn_f32 (.reg .f32 %fa1, .reg .f32 %fa2) ;

Convention B -- Param-array with alignment (.param .align N .b32 name[K]): Used by WMMA load/mma, MMA, Hopper sub-byte MMA, Blackwell MMA. Returns an aligned array of .b32 elements. Array sizes: dst[2], dst[3], dst[4], dst[5], dst[8], mma_dst[2], mma_dst[4], mma_dst[8], ret_dst[3], ret_dst[5]. 326 prototypes use .align 16; 1 prototype (mma_shfl_f16) uses .align 8.

.weak .func (.param .align 16 .b32 d[8]) __cuda_sm70_wmma_m16n16k16_mma_row_col_f32_f32
  (.param .align 16 .b32 a[8], .param .align 16 .b32 b[8], .param .align 16 .b32 c[8]) ;

Convention C -- Param-scalar (.param .b64): Used exclusively by the 7 compute-sanitizer hooks. Parameters use fully-qualified names (__cuda_sanitizer_memcheck_malloc_param_0).

.weak .func (.param .b64 func_retval0) __cuda_sanitizer_memcheck_malloc
  (.param .b64 __cuda_sanitizer_memcheck_malloc_param_0,
   .param .b64 __cuda_sanitizer_memcheck_malloc_param_1) ;

Convention D -- Void return (): Used by WMMA store_d, tcgen05 guardrail traps, sanitizer_free. ~140 prototypes (45 .weak + 95 .FORCE_INLINE).

.weak .func () __cuda_sm70_wmma_m16n16k16_store_d_row_f32
  (.reg .b64 ptr, .reg .b32 ldm, .reg .b32 sreg0, .reg .b32 sreg1, ...) ;

Convention E -- Multi-register return (.FORCE_INLINE only): Used by extended WMMA load operations (SM7x/SM72/SM8x). Returns 1--4 registers in the return position (never 8 -- 8-element returns use Convention B's .param arrays instead).

.FORCE_INLINE .func (.reg .b32 dst0, .reg .b32 dst1, .reg .b32 dst2, .reg .b32 dst3)
  __cuda_sm8x_tf32_wmma_m16n16k8_load_a_row (.reg .u64 ptr, .reg .u32 ldm) ;

PTX Register Types

Eight PTX register types appear across the prototypes:

TypeApprox. CountUsage
.reg .b32~2,925Dominant: barrier args, WMMA/MMA fragments, guardrail params
.reg .u64~520Pointers (WMMA/MMA base addresses)
.reg .u32~341Integer params (leading dimension, counts, offsets)
.reg .b64~24664-bit bitwise (match bitmask, shuffle predicates, retval)
.reg .f32~106Float math (div, rcp, sqrt)
.reg .f64~70Double math (div, rcp, sqrt, dsqrt)
.reg .pred~10Predicate (vote output, matchsync predicate out)
.reg .s32~6Signed 32-bit (SM20 div/rem s16 return values only)

Note: .b32 is used instead of .s32/.u32/.f32 for operations where the type interpretation is determined by the instruction rather than the register declaration (WMMA fragments, MMA accumulators, barrier IDs). The .s32 type appears only in the 4 oldest SM20 div/rem_s16/u16 prototypes (cases 0--3).

Register Naming Convention

The prototype register names encode the data type and role:

PrefixMeaning
%d32-bit integer return value (SM20 div/rem s16/u16 only)
%a0, %a132-bit integer input parameters
%rdv164-bit integer return value
%rda1, %rda264-bit integer input parameters
%fv1f32 return value
%fa1, %fa2f32 input parameters
%fdv1f64 return value
%fda1, %fda2f64 input parameters
%fdnum, %fddenf64 numerator/denominator (div_f64_v2 variants)
dst, dst0..dst7Generic output registers (WMMA load, barriers)
src, sreg0..sreg7Generic input registers (WMMA store)
ptr, base64-bit pointer registers
ldmLeading dimension parameter (WMMA)
maskWarp participation mask
cntThread count (barrier_sync_count, barrier_arrive_count)
arg0..arg3Generic numbered arguments
pargPredicate argument (vote)
retVal, dummyReturn/placeholder (tcgen05 guardrails)
activemask, warp_rankCooperative group queries

Buffer Allocation Sizes

sub_4DA340(size, allocator) allocates an exact-fit buffer per prototype:

MetricValue
Minimum allocation72 bytes
Maximum allocation1,200 bytes
Median allocation~130 bytes
Most common sizes132 (37x), 182 (31x), 192 (30x), 125 (29x), 118 (28x)
Total allocations1,080

The 45 qmemcpy cases have the largest buffers: 386--1,200 bytes. These are WMMA mma operations whose prototypes enumerate all 3--4 fragment matrices (a, b, c, d) with 4--8 elements each, producing prototype strings that exceed 900 bytes.

Case Range Layout

The 1,080 cases follow the body template registration order from sub_5D7430, roughly grouped by SM generation:

Case RangeCountCategoryLinkage
0--6970SM20 IEEE math (div, rem, rcp, sqrt, bfe, bfi, dsqrt, drsqrt).weak
70--734SM3x optimized division (rn_ftz/noftz f32 + slowpaths).weak
74--752SM62 dp2a/dp4a.weak
76--9217Redux sync (b32/s32/u32/f32 add/max/min/xor/and/or/abs/NaN).weak .unique
93--~274~182SM70 barriers (sync/arrive/red, 16 IDs x with/without count).weak .unique
~275--~302~28SM70 vote, shuffle, match (bfly/down/idx/up, all/any/b32/b64).weak .unique / .FORCE_INLINE
~303--~665~363SM70 WMMA load/store (m16n16k16, m32n8k16, m8n32k16, all types/spaces).weak .unique
~666--~889~224SM7x/SM72/SM8x extended WMMA (sub-byte, integer, tf32, bf16, f64).FORCE_INLINE
~890--~964~75SM70 WMMA store_d (all shapes/layouts/spaces/types).weak
~965--~1048~84SM70 WMMA mma + SM8x/SM9x/SM10x MMA (f16/f32, sub-byte, bit, sparse).weak
~1049--~1055~7SM10x tcgen05 guardrail traps.weak
~1056--~1060~5SM8x direct MMA (mma_shfl, row/col f16/f32 combos).weak
~1061--~1072~12SM10x tcgen05 alloc/guardrails check functions + get_warp_rank + create_mask.FORCE_INLINE / .weak
1073--10797Compute-sanitizer hooks (readmetadata, generic, global, local, shared, malloc, free).weak

Statistics

MetricValue
Machine code size362,496 bytes (0x5FF700--0x658B00)
Decompiled lines9,414
Switch cases1,080 (case 0 through case 1079 + default)
Local variables declared~716 (IDA SSA artifacts)
.weak prototypes616 (571 strcpy + 45 qmemcpy)
.FORCE_INLINE prototypes464
.unique-qualified prototypes~410
.param .align prototypes327 (326 align-16, 1 align-8)
Void-return prototypes~140
Predicate-using prototypes~10

Major Codegen Handlers

The four largest codegen handlers together represent ~500KB of code and cover the tensor core instruction families.

sub_5C7A50 -- WMMA.MMA Codegen (173KB)

The largest codegen handler. Generates inline PTX code for wmma.mma instructions across all variant combinations.

  • Allocates a 50,000-byte buffer for code generation
  • Covers shapes: m16n16k16, m32n8k16, m8n32k16
  • Data types: f16, f32, bf16, tf32, s8, u8, s4, u4, b1
  • Layouts: row/col for each of the A, B, C, D matrices (4 layout combinations)
  • Satfinite variants for each configuration
  • Address spaces: generic, global, shared

sub_5C10A0 -- MMA Codegen (120KB)

Handles the newer mma.sync API (non-WMMA). Covers the post-Volta PTX MMA instructions.

  • Shapes: m8n8k4, m16n8k8, m16n8k16, m16n8k32, m16n8k64, m16n8k128, m16n8k256
  • Types: f16, bf16, tf32, f32, f64, s8, u8, s4, u4, b1
  • Sparse variants for sm_80+ and sm_90+ (structured sparsity 2:4)

sub_5BBC30 -- TCGen05.MMA Codegen (90KB)

Blackwell 5th-generation tensor core MMA code generation. Handles the tcgen05.mma instruction family introduced in sm_100.

  • Allocates a 50,000-byte buffer
  • Queries sub_70FA00(*, 29) to validate tcgen05 capability
  • Handles standard, sparse, and warp-shared (.ws) variants
  • Uses sub_70F0A0 for sparse metadata parameter extraction
  • Generates code for tcgen05-specific tensor memory addressing

sub_5B76D0 -- Division Codegen (64KB)

Generates inline PTX code for all div variants.

  • Integer division: s16, s64, u16, u64
  • Floating-point division: f32, f64 with all rounding modes (rn, rd, ru, rz)
  • Flush-to-zero (ftz) variants for f32
  • Checks operand type via sub_70CA60(*(_QWORD *)(a1+1096), 0) == 21
  • Emits both fastpath and slowpath (Newton-Raphson) code sequences

OCG Intrinsic System -- sub_6C9EB0

The OCG (Optimized Code Generation) intrinsic subsystem is a separate, parallel dispatch mechanism for SM100+ builtin operations. While the classical system at sub_5D1660 maps CUDA runtime helper names to integer IDs and generates inline PTX, the OCG system maps __nv_ptx_builtin_ocg_* function names to type-specific handlers that validate parameters and emit SASS instructions directly -- bypassing PTX entirely. The OCG table contains 44 operations across 9 categories: arithmetic, packed float, vector integer, async copy/TMA, load/store/cache, reduction/fence, tensor core, tensor memory, and synchronization.

See OCG Intrinsic System (44 Operations) for the complete builtin name table, handler functions, validation strings, SASS-level handlers, and the full five-stage lowering pipeline with operand buffer layout.

Intrinsic Families by SM Generation

Each SM generation introduces new intrinsic families while preserving all earlier ones. The per-SM intrinsic table initializer functions (sub_60AXXX cluster, registered in Map 3 of the capability dispatch) control which intrinsics are available on each target.

sm_20 -- Software IEEE Math (70 entries)

The foundation layer. 70 intrinsics providing IEEE-754-compliant software implementations of math operations that either lack hardware support or need exact rounding guarantees. All later SM targets inherit these.

  • Division: div_s16, div_u64, div_rn_f32, div_rn_f64_full, etc. -- all rounding modes (rn/rd/ru/rz) and types (s16/s64/u16/u64/f32/f64)
  • Reciprocal: rcp_rn_f32, rcp_rn_f64, etc. -- all rounding modes
  • Square root: sqrt_rn_f32, sqrt_rn_f64, etc. -- all rounding modes
  • Double-precision sqrt: dsqrt_rn, dsqrt_rd, dsqrt_ru, dsqrt_rz
  • Double-precision reciprocal sqrt: drsqrt_rn
  • Bit extract/insert: bfe (bit field extract), bfi (bit field insert)
  • Remainder: rem_s32, rem_u32, rem_s64, rem_u64

Codegen handlers: sub_5B76D0 (div, 64KB), sub_5B0CD0 (rcp, 44KB), sub_5B4040 (sqrt, 49KB).

sm_3x -- Optimized Division (4 entries)

Four optimized division variants introduced on Kepler to improve throughput on common division patterns.

sm_62 -- Integer Dot Product (2 entries)

dp2a and dp4a integer dot product intrinsics introduced on Pascal (GP10x). Software emulation of the hardware instructions added in sm_61/sm_62.

sm_70 -- Volta Warp-Synchronous + WMMA (393 entries)

The largest single block. Volta introduced mandatory warp-synchronous programming with explicit sync masks and the first generation of tensor core (WMMA) instructions.

Synchronization primitives:

  • barrier_arrive / barrier_sync / barrier_red (0--15, with/without count)
  • matchsync_all/any_b32/b64 with predicate variants
  • shflsync_bfly/down/idx/up with predicate variants
  • votesync_all/any/ballot/uni
  • warpsync

WMMA (Warp Matrix Multiply-Accumulate):

  • Shapes: m16n16k16, m32n8k16, m8n32k16
  • Operations per shape: load_a, load_b, load_c, store_d, mma
  • Layouts: row/col combinations for A and B matrices
  • Types: f16, f32 (with satfinite optional)
  • Address spaces: generic, global, shared

sm_80 -- Ampere Cache Policy (3 entries)

Three createpolicy intrinsics for L2 cache management: createpolicy_fractional, createpolicy_fractional_encode, createpolicy_range_encode.

sm_10x -- Blackwell MMA Metadata + Bit MMA (10 entries)

10 hmma_mdata/imma_mdata + bit MMA intrinsics for sm_100: metadata variants at m16n8k16/k32/k64 shapes, and 1-bit AND/XOR MMA at m8n8k128/m16n8k128/m16n8k256.

sm_8x -- Direct MMA (14 entries)

14 mma_* intrinsics for sm_8x: 12 direct MMA operations (f16/f32 accumulator x 4 layout combinations of col/row for A and B) plus mma_shfl_f16 and mma_shfl_f32 for register-to-register MMA shuffle.

sm_9x -- Sub-Byte + Bit MMA (51 entries)

51 Hopper-era intrinsics: 3 bit-XOR MMA (m8n8k128/m16n8k128/m16n8k256), 24 dense sub-byte MMA (s4/u4 at m16n8k32/m16n8k64/m8n8k32, with satfinite), 8 sparse m16n8k128, and 16 sparse m16n8k64 (with _0/_1 split variants and satfinite).

sm_10x (via __cuda_sm10x_*) -- Blackwell Tensor Memory + Guardrails (11 entries)

  • 1 create_mask_from_bit_idx_and_alloc_size_v1 helper
  • 10 tcgen05_guardrail_trap_* intrinsics for debug validation of tensor memory operations

sm_1xx -- Bulk Copy (18 entries)

18 bulk copy and cp.async.bulk.tensor intrinsics covering 1D through 5D tensor copies with tile and im2col addressing modes, both unicast and multicast variants.

Intrinsic Lookup Flow

The lookup path from a function call in PTX source to the codegen handler follows this sequence:

PTX source: call.uni __cuda_sm70_warpsync, (%mask);
                    |
                    v
            sub_5D1660 hash map (a1+1064)
            key: "__cuda_sm70_warpsync"
            value: integer ID (within 0x89..0x211 range)
                    |
                    v
            sub_5FF700 switch(ID)
            Emits: .weak .func __cuda_sm70_warpsync (.reg .u32 %a0)
                    |
                    v
            sub_5D4190 named opcode hash map (a1+808)
            key: PTX opcode (e.g., "shfl", "vote", "barrier")
            value: codegen handler function pointer
                    |
                    v
            Codegen handler (e.g., sub_5801D0 for "shfl")
            Queries instruction properties via sub_70XXXX accessors
            Generates inline PTX code into 50KB buffer

For OCG intrinsics on SM100+, the path bypasses PTX entirely: sub_6A97B0 matches call nodes to SASS instructions via an RB-tree, sub_6C9BC0 parses the __nv_ptx_builtin_ocg_* name into an operation enum + sub-op array, sub_6CC690 routes to type-specific handlers and assembles operands, and sub_6CB8A0 emits the final SASS instruction. See the OCG Intrinsic System page for the full five-stage pipeline breakdown with operand buffer layout and internal SASS opcode enum values.

Per-SM Intrinsic Initializers

Each SM target has its own intrinsic table initializer function registered in Map 3 of the capability dispatch (sub_607DB0). These functions control which subset of the 607 intrinsics are available on each target.

SMInitializerSMInitializer
sm_75sub_60A2E0sm_100sub_60A910
sm_80sub_60A3E0sm_110sub_60AA20
sm_86sub_60AC30sm_103sub_60A700
sm_87sub_60AD30sm_120sub_608DF0
sm_88sub_60AB30sm_121sub_60A4E0
sm_89sub_60A810
sm_90sub_60A5F0

Sub-variants (e.g., sm_100a, sm_100f) share the same initializer as their base SM since they represent the same silicon with different feature exposure levels.

Instruction Description Loader -- sub_9EE390

sub_9EE390 (3,584 bytes, 0x9EE390--0x9EF190) is the constructor for an instruction description object that feeds the register allocator's pre-coloring pass. Despite the diagnostic string "IntrinsicDescrFile=%s", the function loads instruction descriptions broadly -- not just intrinsic operations. It determines which instructions exist for the target SM, what register classes they use, and what scheduling properties apply. The sole caller is sub_991790 (pre-coloring pass, 12KB).

Invocation pattern: The pre-coloring pass checks context+1936 before calling sub_9EE390. If the descriptor for the current SM class already exists, it is reused. This means the expensive initialization happens once per SM architecture per ptxas process lifetime.

Initialization Sequence

  1. Extract target properties. Read the target descriptor from context+1584. Compute the SM architecture class: v111 = target_descriptor[+372] >> 12. Read resource descriptors from option interface slots 41--44.

  2. Check option 404 (IntrinsicDescrFile). Query the option interface at context[208]. If option 404 is set, extract the file path and log " IntrinsicDescrFile=%s". This CI-internal mechanism supplies an external description file that overrides or extends the built-in instruction table. When absent, the built-in database is used.

  3. Determine instruction format class. Call sub_7DDB50(context) (GetSmVersionIndex), subtract 1, index into dword_21E6330:

    sub_7DDB50 returnv114Format
    10basic 64-bit
    21128-bit
    33extended
    42192-bit
    5+3extended (default)
  4. Determine SM generation class. Read context+12 (sm_version_id), subtract 1, index into dword_21E5C80. The table is an identity mapping (1--11), one entry per SM generation.

  5. Construct instruction table (648 bytes). Call sub_10AFF80 with 32 parameters including memory pool, register count, format class, description file path, architecture descriptor (16 bytes from context+1888), SM generation class, instruction count limits, and context flags. Follow with sub_10B1A90 (init pass 2) and sub_10AEF10 (finalization).

  6. Apply option overrides. Options 497, 738, 739 from the option interface set register limits and allocation budget values on the instruction table sub-object at +312.

  7. Select SM-specific instruction set descriptor. Based on v111 (SM architecture class):

    v111SM range (inferred)Alloc sizeConstructorVtable
    5sm_50--sm_62200 Bsub_9CDF90off_23F3B00
    6sm_70--sm_75216 Bsub_9CE030off_22BB738
    7sm_80--sm_89232 Bsub_9CE120off_22B5150
    8+sm_90--sm_121240 Bsub_9CE190off_22AD230
    <5(reuse existing)------

    Each successor inherits the previous class and extends it with generation-specific instructions. The descriptor is stored at context+1936 and this+48.

Object Layout

OffsetSizeContents
+08Vtable (off_21E6818 -> [sub_9DAA40, sub_9CADF0, sub_9CAE10, sub_9DDEE0])
+88Back-pointer to compilation context
+168Instruction table object (648 B, built by sub_10AFF80)
+248Scheduling metadata (from sub_1BBBA60)
+328Scratch area pointer (context[198])
+401Dirty flag (0 = clean)
+488SM-specific instruction set descriptor
+56--136--Resource descriptors, memory pool, sentinel, sub-allocator

Diagnostic Strings

StringLocationContext
"__nv_ptx_builtin_ocg_"sub_6C9EB0 (0x6c9ecf)OCG builtin name prefix
"instrinsic" (sic)Multiple OCG handlersConsistent NVIDIA typo for "intrinsic"
".weak .func"sub_5FF700 (354KB)Prototype declaration prefix
"__cuda_sm20_*", "__cuda_sm70_*", etc.sub_5D1660Intrinsic name patterns in registration
"__cuda_sanitizer_memcheck_*"sub_5D1660Compute-sanitizer integration hooks
"__cuda_sm10x_tcgen05_guardrail_trap_*"sub_5D1660Blackwell debug trap intrinsics
" IntrinsicDescrFile=%s"sub_9EE390 (0x9EEC9B)Instruction description loader -- logs external description file path (option 404)
".RELU not allowed with unsigned type"sub_6BEC60OCG LDC/S2R handler

Function Map

AddressSizeIdentityConfidence
sub_5D166046KBMaster intrinsic registration -- 607 name-to-ID entries (608 table slots)99%
sub_5D419041KBOpcode dispatch -- ~120 named + ~400 MMA hash entries99%
sub_5FF700354KBPrototype generator -- .weak .func PTX declarations99%
sub_5C7A50173KBwmma.mma codegen (all shapes/types/layouts)98%
sub_5C10A0120KBmma codegen (mma.sync API, post-Volta)98%
sub_5BBC3090KBtcgen05.mma codegen (Blackwell 5th-gen tensor core)98%
sub_5B76D064KBdiv codegen (integer + FP, all rounding modes)95%
sub_5ADDC050KBtex.grad codegen (1D/2D/3D gradient textures)95%
sub_5B404049KBsqrt codegen (f32/f64, all rounding modes)95%
sub_5AB46045KBcp.async.bulk.tensor codegen (1D--5D, tile/im2col)95%
sub_5B0CD044KBrcp codegen (f32/f64 reciprocal, all rounding modes)95%
sub_6C9EB013KBOCG intrinsic table init -- see OCG Intrinsic System for full function map (27 entries)95%
sub_6BDE207KBIntrinsic operand expansion88%
sub_6BEC605.8KBLDC/S2R intrinsic handlers90%
sub_9EE3903.5KBInstruction description loader -- builds per-SM instruction table for pre-coloring ("IntrinsicDescrFile=%s")92%
sub_9CDF90156BSM class 5 instruction set descriptor (200B, vtable off_23F3B00)85%
sub_9CE030115BSM class 6 instruction set descriptor (216B, extends sub_9CDF90)85%
sub_9CE120112BSM class 7 instruction set descriptor (232B, vtable off_22B5150)85%
sub_9CE190114BSM class 8+ instruction set descriptor (240B, vtable off_22AD230)85%
sub_9EF1901.1KBError handler for instruction description loader (ICE on invalid option type)88%

Cross-References

Appendix: Complete Intrinsic Name Catalog (607 Entries)

Every intrinsic registered by sub_5D1660, extracted from the decompiled source. IDs are contiguous 1--607 (0x001--0x25F). The suffix after stripping the prefix encodes the operation, data type, rounding mode, address space, and optional modifiers.

__cuda_reduxsync_* -- Redux sync (17 entries, 0x001--0x011, sm_70)

IDHexName
10x001__cuda_reduxsync_b32_and
20x002__cuda_reduxsync_b32_or
30x003__cuda_reduxsync_b32_xor
40x004__cuda_reduxsync_f32_max
50x005__cuda_reduxsync_f32_max_NaN
60x006__cuda_reduxsync_f32_max_abs
70x007__cuda_reduxsync_f32_max_abs_NaN
80x008__cuda_reduxsync_f32_min
90x009__cuda_reduxsync_f32_min_NaN
100x00A__cuda_reduxsync_f32_min_abs
110x00B__cuda_reduxsync_f32_min_abs_NaN
120x00C__cuda_reduxsync_s32_add
130x00D__cuda_reduxsync_s32_max
140x00E__cuda_reduxsync_s32_min
150x00F__cuda_reduxsync_u32_add
160x010__cuda_reduxsync_u32_max
170x011__cuda_reduxsync_u32_min

__cuda_sanitizer_memcheck_* -- Compute-sanitizer hooks (7 entries, 0x012--0x018, --)

IDHexName
180x012__cuda_sanitizer_memcheck_free
190x013__cuda_sanitizer_memcheck_generic
200x014__cuda_sanitizer_memcheck_global
210x015__cuda_sanitizer_memcheck_local
220x016__cuda_sanitizer_memcheck_malloc
230x017__cuda_sanitizer_memcheck_readmetadata
240x018__cuda_sanitizer_memcheck_shared

__cuda_scalar_video_emulation_* -- Video emulation (7 entries, 0x019--0x01F, sm_20)

IDHexName
250x019__cuda_scalar_video_emulation_operandExtractAndSignExtend01
260x01A__cuda_scalar_video_emulation_operandExtractAndSignExtend11
270x01B__cuda_scalar_video_emulation_operandExtractAndSignExtend12
280x01C__cuda_scalar_video_emulation_operandExtractAndSignExtend22
290x01D__cuda_scalar_video_emulation_optionalMerge32
300x01E__cuda_scalar_video_emulation_saturate64
310x01F__cuda_scalar_video_emulation_secondOp64

__cuda_sm10x_* -- Blackwell tcgen05 guardrails + mask (11 entries, 0x020--0x02A, sm_100)

IDHexName
320x020__cuda_sm10x_create_mask_from_bit_idx_and_alloc_size_v1
330x021__cuda_sm10x_tcgen05_guardrail_trap_access_out_of_physical_bounds
340x022__cuda_sm10x_tcgen05_guardrail_trap_allocation_granularity_invalid
350x023__cuda_sm10x_tcgen05_guardrail_trap_col_being_dealloced_not_returned_by_alloc
360x024__cuda_sm10x_tcgen05_guardrail_trap_current_warp_owner_invalid
370x025__cuda_sm10x_tcgen05_guardrail_trap_invalid_datapath_alignment
380x026__cuda_sm10x_tcgen05_guardrail_trap_phase_invalid_during_alloc
390x027__cuda_sm10x_tcgen05_guardrail_trap_sp_used_in_unsupported_env
400x028__cuda_sm10x_tcgen05_guardrail_trap_sparse_mismatch_between_idesc_mod
410x029__cuda_sm10x_tcgen05_guardrail_trap_unallocated_columns_access
420x02A__cuda_sm10x_tcgen05_guardrail_trap_unallocated_columns_being_dealloced

__cuda_sm1xx_* -- Bulk copy + cp.async.bulk.tensor (18 entries, 0x02B--0x03C, sm_100+)

IDHexName
430x02B__cuda_sm1xx_bulk_copy_multicast
440x02C__cuda_sm1xx_bulk_copy_unicast
450x02D__cuda_sm1xx_cp_async_bulk_tensor_1d_tile_multicast
460x02E__cuda_sm1xx_cp_async_bulk_tensor_1d_tile_unicast
470x02F__cuda_sm1xx_cp_async_bulk_tensor_2d_tile_multicast
480x030__cuda_sm1xx_cp_async_bulk_tensor_2d_tile_unicast
490x031__cuda_sm1xx_cp_async_bulk_tensor_3d_im2col_multicast
500x032__cuda_sm1xx_cp_async_bulk_tensor_3d_im2col_unicast
510x033__cuda_sm1xx_cp_async_bulk_tensor_3d_tile_multicast
520x034__cuda_sm1xx_cp_async_bulk_tensor_3d_tile_unicast
530x035__cuda_sm1xx_cp_async_bulk_tensor_4d_im2col_multicast
540x036__cuda_sm1xx_cp_async_bulk_tensor_4d_im2col_unicast
550x037__cuda_sm1xx_cp_async_bulk_tensor_4d_tile_multicast
560x038__cuda_sm1xx_cp_async_bulk_tensor_4d_tile_unicast
570x039__cuda_sm1xx_cp_async_bulk_tensor_5d_im2col_multicast
580x03A__cuda_sm1xx_cp_async_bulk_tensor_5d_im2col_unicast
590x03B__cuda_sm1xx_cp_async_bulk_tensor_5d_tile_multicast
600x03C__cuda_sm1xx_cp_async_bulk_tensor_5d_tile_unicast

__cuda_sm20_* -- IEEE math (70 entries, 0x03D--0x082, sm_20)

IDHexName
610x03D__cuda_sm20_bfe_s64_
620x03E__cuda_sm20_bfe_u64_
630x03F__cuda_sm20_bfi_u64_
640x040__cuda_sm20_dblrcp_rn_slowpath_v3
650x041__cuda_sm20_div_rd_f32
660x042__cuda_sm20_div_rd_f64_v2
670x043__cuda_sm20_div_rd_ftz_f32
680x044__cuda_sm20_div_rn_f32
690x045__cuda_sm20_div_rn_f64_fast
700x046__cuda_sm20_div_rn_f64_full
710x047__cuda_sm20_div_rn_ftz_f32
720x048__cuda_sm20_div_rn_ftz_f32_slowpath
730x049__cuda_sm20_div_rn_noftz_f32_slowpath
740x04A__cuda_sm20_div_ru_f32
750x04B__cuda_sm20_div_ru_f64_v2
760x04C__cuda_sm20_div_ru_ftz_f32
770x04D__cuda_sm20_div_rz_f32
780x04E__cuda_sm20_div_rz_f64_v2
790x04F__cuda_sm20_div_rz_ftz_f32
800x050__cuda_sm20_div_s16
810x051__cuda_sm20_div_s64
820x052__cuda_sm20_div_u16
830x053__cuda_sm20_div_u64
840x054__cuda_sm20_drsqrt_f64_slowpath_v2
850x055__cuda_sm20_drsqrt_f64_v2
860x056__cuda_sm20_dsqrt_rd_f64
870x057__cuda_sm20_dsqrt_rn_f64_mediumpath_v1
880x058__cuda_sm20_dsqrt_rn_f64_v3
890x059__cuda_sm20_dsqrt_ru_f64
900x05A__cuda_sm20_dsqrt_rz_f64
910x05B__cuda_sm20_rcp_f64_v3
920x05C__cuda_sm20_rcp_rd_f32
930x05D__cuda_sm20_rcp_rd_f32_slowpath
940x05E__cuda_sm20_rcp_rd_f64
950x05F__cuda_sm20_rcp_rd_ftz_f32
960x060__cuda_sm20_rcp_rd_ftz_f32_slowpath
970x061__cuda_sm20_rcp_rn_f32
980x062__cuda_sm20_rcp_rn_f32_slowpath
990x063__cuda_sm20_rcp_rn_ftz_f32
1000x064__cuda_sm20_rcp_rn_ftz_f32_slowpath
1010x065__cuda_sm20_rcp_ru_f32
1020x066__cuda_sm20_rcp_ru_f32_slowpath
1030x067__cuda_sm20_rcp_ru_f64
1040x068__cuda_sm20_rcp_ru_ftz_f32
1050x069__cuda_sm20_rcp_ru_ftz_f32_slowpath
1060x06A__cuda_sm20_rcp_rz_f32
1070x06B__cuda_sm20_rcp_rz_f32_slowpath
1080x06C__cuda_sm20_rcp_rz_f64
1090x06D__cuda_sm20_rcp_rz_ftz_f32
1100x06E__cuda_sm20_rcp_rz_ftz_f32_slowpath
1110x06F__cuda_sm20_rem_s16
1120x070__cuda_sm20_rem_s64
1130x071__cuda_sm20_rem_u16
1140x072__cuda_sm20_rem_u64
1150x073__cuda_sm20_sqrt_rd_f32
1160x074__cuda_sm20_sqrt_rd_f32_slowpath
1170x075__cuda_sm20_sqrt_rd_ftz_f32
1180x076__cuda_sm20_sqrt_rd_ftz_f32_slowpath
1190x077__cuda_sm20_sqrt_rn_f32
1200x078__cuda_sm20_sqrt_rn_f32_slowpath
1210x079__cuda_sm20_sqrt_rn_ftz_f32
1220x07A__cuda_sm20_sqrt_rn_ftz_f32_slowpath
1230x07B__cuda_sm20_sqrt_ru_f32
1240x07C__cuda_sm20_sqrt_ru_f32_slowpath
1250x07D__cuda_sm20_sqrt_ru_ftz_f32
1260x07E__cuda_sm20_sqrt_ru_ftz_f32_slowpath
1270x07F__cuda_sm20_sqrt_rz_f32
1280x080__cuda_sm20_sqrt_rz_f32_slowpath
1290x081__cuda_sm20_sqrt_rz_ftz_f32
1300x082__cuda_sm20_sqrt_rz_ftz_f32_slowpath

__cuda_sm3x_* -- Optimized division (4 entries, 0x083--0x086, sm_30)

IDHexName
1310x083__cuda_sm3x_div_rn_ftz_f32
1320x084__cuda_sm3x_div_rn_ftz_f32_slowpath
1330x085__cuda_sm3x_div_rn_noftz_f32
1340x086__cuda_sm3x_div_rn_noftz_f32_slowpath

__cuda_sm62_* -- Integer dot product (2 entries, 0x087--0x088, sm_62)

IDHexName
1350x087__cuda_sm62_dp2a
1360x088__cuda_sm62_dp4a

__cuda_sm70_* -- Volta sync/warp/WMMA (393 entries, 0x089--0x211, sm_70)

IDHexName
1370x089__cuda_sm70_barrier_arrive
1380x08A__cuda_sm70_barrier_arrive_0
1390x08B__cuda_sm70_barrier_arrive_0_count
1400x08C__cuda_sm70_barrier_arrive_1
1410x08D__cuda_sm70_barrier_arrive_10
1420x08E__cuda_sm70_barrier_arrive_10_count
1430x08F__cuda_sm70_barrier_arrive_11
1440x090__cuda_sm70_barrier_arrive_11_count
1450x091__cuda_sm70_barrier_arrive_12
1460x092__cuda_sm70_barrier_arrive_12_count
1470x093__cuda_sm70_barrier_arrive_13
1480x094__cuda_sm70_barrier_arrive_13_count
1490x095__cuda_sm70_barrier_arrive_14
1500x096__cuda_sm70_barrier_arrive_14_count
1510x097__cuda_sm70_barrier_arrive_15
1520x098__cuda_sm70_barrier_arrive_15_count
1530x099__cuda_sm70_barrier_arrive_1_count
1540x09A__cuda_sm70_barrier_arrive_2
1550x09B__cuda_sm70_barrier_arrive_2_count
1560x09C__cuda_sm70_barrier_arrive_3
1570x09D__cuda_sm70_barrier_arrive_3_count
1580x09E__cuda_sm70_barrier_arrive_4
1590x09F__cuda_sm70_barrier_arrive_4_count
1600x0A0__cuda_sm70_barrier_arrive_5
1610x0A1__cuda_sm70_barrier_arrive_5_count
1620x0A2__cuda_sm70_barrier_arrive_6
1630x0A3__cuda_sm70_barrier_arrive_6_count
1640x0A4__cuda_sm70_barrier_arrive_7
1650x0A5__cuda_sm70_barrier_arrive_7_count
1660x0A6__cuda_sm70_barrier_arrive_8
1670x0A7__cuda_sm70_barrier_arrive_8_count
1680x0A8__cuda_sm70_barrier_arrive_9
1690x0A9__cuda_sm70_barrier_arrive_9_count
1700x0AA__cuda_sm70_barrier_arrive_count
1710x0AB__cuda_sm70_barrier_red_and
1720x0AC__cuda_sm70_barrier_red_and_0
1730x0AD__cuda_sm70_barrier_red_and_0_count
1740x0AE__cuda_sm70_barrier_red_and_1
1750x0AF__cuda_sm70_barrier_red_and_10
1760x0B0__cuda_sm70_barrier_red_and_10_count
1770x0B1__cuda_sm70_barrier_red_and_11
1780x0B2__cuda_sm70_barrier_red_and_11_count
1790x0B3__cuda_sm70_barrier_red_and_12
1800x0B4__cuda_sm70_barrier_red_and_12_count
1810x0B5__cuda_sm70_barrier_red_and_13
1820x0B6__cuda_sm70_barrier_red_and_13_count
1830x0B7__cuda_sm70_barrier_red_and_14
1840x0B8__cuda_sm70_barrier_red_and_14_count
1850x0B9__cuda_sm70_barrier_red_and_15
1860x0BA__cuda_sm70_barrier_red_and_15_count
1870x0BB__cuda_sm70_barrier_red_and_1_count
1880x0BC__cuda_sm70_barrier_red_and_2
1890x0BD__cuda_sm70_barrier_red_and_2_count
1900x0BE__cuda_sm70_barrier_red_and_3
1910x0BF__cuda_sm70_barrier_red_and_3_count
1920x0C0__cuda_sm70_barrier_red_and_4
1930x0C1__cuda_sm70_barrier_red_and_4_count
1940x0C2__cuda_sm70_barrier_red_and_5
1950x0C3__cuda_sm70_barrier_red_and_5_count
1960x0C4__cuda_sm70_barrier_red_and_6
1970x0C5__cuda_sm70_barrier_red_and_6_count
1980x0C6__cuda_sm70_barrier_red_and_7
1990x0C7__cuda_sm70_barrier_red_and_7_count
2000x0C8__cuda_sm70_barrier_red_and_8
2010x0C9__cuda_sm70_barrier_red_and_8_count
2020x0CA__cuda_sm70_barrier_red_and_9
2030x0CB__cuda_sm70_barrier_red_and_9_count
2040x0CC__cuda_sm70_barrier_red_and_count
2050x0CD__cuda_sm70_barrier_red_or
2060x0CE__cuda_sm70_barrier_red_or_0
2070x0CF__cuda_sm70_barrier_red_or_0_count
2080x0D0__cuda_sm70_barrier_red_or_1
2090x0D1__cuda_sm70_barrier_red_or_10
2100x0D2__cuda_sm70_barrier_red_or_10_count
2110x0D3__cuda_sm70_barrier_red_or_11
2120x0D4__cuda_sm70_barrier_red_or_11_count
2130x0D5__cuda_sm70_barrier_red_or_12
2140x0D6__cuda_sm70_barrier_red_or_12_count
2150x0D7__cuda_sm70_barrier_red_or_13
2160x0D8__cuda_sm70_barrier_red_or_13_count
2170x0D9__cuda_sm70_barrier_red_or_14
2180x0DA__cuda_sm70_barrier_red_or_14_count
2190x0DB__cuda_sm70_barrier_red_or_15
2200x0DC__cuda_sm70_barrier_red_or_15_count
2210x0DD__cuda_sm70_barrier_red_or_1_count
2220x0DE__cuda_sm70_barrier_red_or_2
2230x0DF__cuda_sm70_barrier_red_or_2_count
2240x0E0__cuda_sm70_barrier_red_or_3
2250x0E1__cuda_sm70_barrier_red_or_3_count
2260x0E2__cuda_sm70_barrier_red_or_4
2270x0E3__cuda_sm70_barrier_red_or_4_count
2280x0E4__cuda_sm70_barrier_red_or_5
2290x0E5__cuda_sm70_barrier_red_or_5_count
2300x0E6__cuda_sm70_barrier_red_or_6
2310x0E7__cuda_sm70_barrier_red_or_6_count
2320x0E8__cuda_sm70_barrier_red_or_7
2330x0E9__cuda_sm70_barrier_red_or_7_count
2340x0EA__cuda_sm70_barrier_red_or_8
2350x0EB__cuda_sm70_barrier_red_or_8_count
2360x0EC__cuda_sm70_barrier_red_or_9
2370x0ED__cuda_sm70_barrier_red_or_9_count
2380x0EE__cuda_sm70_barrier_red_or_count
2390x0EF__cuda_sm70_barrier_red_popc
2400x0F0__cuda_sm70_barrier_red_popc_0
2410x0F1__cuda_sm70_barrier_red_popc_0_count
2420x0F2__cuda_sm70_barrier_red_popc_1
2430x0F3__cuda_sm70_barrier_red_popc_10
2440x0F4__cuda_sm70_barrier_red_popc_10_count
2450x0F5__cuda_sm70_barrier_red_popc_11
2460x0F6__cuda_sm70_barrier_red_popc_11_count
2470x0F7__cuda_sm70_barrier_red_popc_12
2480x0F8__cuda_sm70_barrier_red_popc_12_count
2490x0F9__cuda_sm70_barrier_red_popc_13
2500x0FA__cuda_sm70_barrier_red_popc_13_count
2510x0FB__cuda_sm70_barrier_red_popc_14
2520x0FC__cuda_sm70_barrier_red_popc_14_count
2530x0FD__cuda_sm70_barrier_red_popc_15
2540x0FE__cuda_sm70_barrier_red_popc_15_count
2550x0FF__cuda_sm70_barrier_red_popc_1_count
2560x100__cuda_sm70_barrier_red_popc_2
2570x101__cuda_sm70_barrier_red_popc_2_count
2580x102__cuda_sm70_barrier_red_popc_3
2590x103__cuda_sm70_barrier_red_popc_3_count
2600x104__cuda_sm70_barrier_red_popc_4
2610x105__cuda_sm70_barrier_red_popc_4_count
2620x106__cuda_sm70_barrier_red_popc_5
2630x107__cuda_sm70_barrier_red_popc_5_count
2640x108__cuda_sm70_barrier_red_popc_6
2650x109__cuda_sm70_barrier_red_popc_6_count
2660x10A__cuda_sm70_barrier_red_popc_7
2670x10B__cuda_sm70_barrier_red_popc_7_count
2680x10C__cuda_sm70_barrier_red_popc_8
2690x10D__cuda_sm70_barrier_red_popc_8_count
2700x10E__cuda_sm70_barrier_red_popc_9
2710x10F__cuda_sm70_barrier_red_popc_9_count
2720x110__cuda_sm70_barrier_red_popc_count
2730x111__cuda_sm70_barrier_sync
2740x112__cuda_sm70_barrier_sync_0
2750x113__cuda_sm70_barrier_sync_0_count
2760x114__cuda_sm70_barrier_sync_1
2770x115__cuda_sm70_barrier_sync_10
2780x116__cuda_sm70_barrier_sync_10_count
2790x117__cuda_sm70_barrier_sync_11
2800x118__cuda_sm70_barrier_sync_11_count
2810x119__cuda_sm70_barrier_sync_12
2820x11A__cuda_sm70_barrier_sync_12_count
2830x11B__cuda_sm70_barrier_sync_13
2840x11C__cuda_sm70_barrier_sync_13_count
2850x11D__cuda_sm70_barrier_sync_14
2860x11E__cuda_sm70_barrier_sync_14_count
2870x11F__cuda_sm70_barrier_sync_15
2880x120__cuda_sm70_barrier_sync_15_count
2890x121__cuda_sm70_barrier_sync_1_count
2900x122__cuda_sm70_barrier_sync_2
2910x123__cuda_sm70_barrier_sync_2_count
2920x124__cuda_sm70_barrier_sync_3
2930x125__cuda_sm70_barrier_sync_3_count
2940x126__cuda_sm70_barrier_sync_4
2950x127__cuda_sm70_barrier_sync_4_count
2960x128__cuda_sm70_barrier_sync_5
2970x129__cuda_sm70_barrier_sync_5_count
2980x12A__cuda_sm70_barrier_sync_6
2990x12B__cuda_sm70_barrier_sync_6_count
3000x12C__cuda_sm70_barrier_sync_7
3010x12D__cuda_sm70_barrier_sync_7_count
3020x12E__cuda_sm70_barrier_sync_8
3030x12F__cuda_sm70_barrier_sync_8_count
3040x130__cuda_sm70_barrier_sync_9
3050x131__cuda_sm70_barrier_sync_9_count
3060x132__cuda_sm70_barrier_sync_count
3070x133__cuda_sm70_matchsync_all_b32
3080x134__cuda_sm70_matchsync_all_b32_p
3090x135__cuda_sm70_matchsync_all_b64
3100x136__cuda_sm70_matchsync_all_b64_p
3110x137__cuda_sm70_matchsync_any_b32
3120x138__cuda_sm70_matchsync_any_b64
3130x139__cuda_sm70_shflsync_bfly
3140x13A__cuda_sm70_shflsync_bfly_p
3150x13B__cuda_sm70_shflsync_down
3160x13C__cuda_sm70_shflsync_down_p
3170x13D__cuda_sm70_shflsync_idx
3180x13E__cuda_sm70_shflsync_idx_p
3190x13F__cuda_sm70_shflsync_up
3200x140__cuda_sm70_shflsync_up_p
3210x141__cuda_sm70_votesync_all
3220x142__cuda_sm70_votesync_any
3230x143__cuda_sm70_votesync_ballot
3240x144__cuda_sm70_votesync_uni
3250x145__cuda_sm70_warpsync
3260x146__cuda_sm70_wmma_m16n16k16_load_a_col
3270x147__cuda_sm70_wmma_m16n16k16_load_a_col_global
3280x148__cuda_sm70_wmma_m16n16k16_load_a_col_shared
3290x149__cuda_sm70_wmma_m16n16k16_load_a_row
3300x14A__cuda_sm70_wmma_m16n16k16_load_a_row_global
3310x14B__cuda_sm70_wmma_m16n16k16_load_a_row_shared
3320x14C__cuda_sm70_wmma_m16n16k16_load_b_col
3330x14D__cuda_sm70_wmma_m16n16k16_load_b_col_global
3340x14E__cuda_sm70_wmma_m16n16k16_load_b_col_shared
3350x14F__cuda_sm70_wmma_m16n16k16_load_b_row
3360x150__cuda_sm70_wmma_m16n16k16_load_b_row_global
3370x151__cuda_sm70_wmma_m16n16k16_load_b_row_shared
3380x152__cuda_sm70_wmma_m16n16k16_load_c_col_f16
3390x153__cuda_sm70_wmma_m16n16k16_load_c_col_f16_global
3400x154__cuda_sm70_wmma_m16n16k16_load_c_col_f16_shared
3410x155__cuda_sm70_wmma_m16n16k16_load_c_col_f32
3420x156__cuda_sm70_wmma_m16n16k16_load_c_col_f32_global
3430x157__cuda_sm70_wmma_m16n16k16_load_c_col_f32_shared
3440x158__cuda_sm70_wmma_m16n16k16_load_c_row_f16
3450x159__cuda_sm70_wmma_m16n16k16_load_c_row_f16_global
3460x15A__cuda_sm70_wmma_m16n16k16_load_c_row_f16_shared
3470x15B__cuda_sm70_wmma_m16n16k16_load_c_row_f32
3480x15C__cuda_sm70_wmma_m16n16k16_load_c_row_f32_global
3490x15D__cuda_sm70_wmma_m16n16k16_load_c_row_f32_shared
3500x15E__cuda_sm70_wmma_m16n16k16_mma_col_col_f16_f16
3510x15F__cuda_sm70_wmma_m16n16k16_mma_col_col_f16_f16_satfinite
3520x160__cuda_sm70_wmma_m16n16k16_mma_col_col_f16_f32
3530x161__cuda_sm70_wmma_m16n16k16_mma_col_col_f16_f32_satfinite
3540x162__cuda_sm70_wmma_m16n16k16_mma_col_col_f32_f16
3550x163__cuda_sm70_wmma_m16n16k16_mma_col_col_f32_f16_satfinite
3560x164__cuda_sm70_wmma_m16n16k16_mma_col_col_f32_f32
3570x165__cuda_sm70_wmma_m16n16k16_mma_col_col_f32_f32_satfinite
3580x166__cuda_sm70_wmma_m16n16k16_mma_col_row_f16_f16
3590x167__cuda_sm70_wmma_m16n16k16_mma_col_row_f16_f16_satfinite
3600x168__cuda_sm70_wmma_m16n16k16_mma_col_row_f16_f32
3610x169__cuda_sm70_wmma_m16n16k16_mma_col_row_f16_f32_satfinite
3620x16A__cuda_sm70_wmma_m16n16k16_mma_col_row_f32_f16
3630x16B__cuda_sm70_wmma_m16n16k16_mma_col_row_f32_f16_satfinite
3640x16C__cuda_sm70_wmma_m16n16k16_mma_col_row_f32_f32
3650x16D__cuda_sm70_wmma_m16n16k16_mma_col_row_f32_f32_satfinite
3660x16E__cuda_sm70_wmma_m16n16k16_mma_row_col_f16_f16
3670x16F__cuda_sm70_wmma_m16n16k16_mma_row_col_f16_f16_satfinite
3680x170__cuda_sm70_wmma_m16n16k16_mma_row_col_f16_f32
3690x171__cuda_sm70_wmma_m16n16k16_mma_row_col_f16_f32_satfinite
3700x172__cuda_sm70_wmma_m16n16k16_mma_row_col_f32_f16
3710x173__cuda_sm70_wmma_m16n16k16_mma_row_col_f32_f16_satfinite
3720x174__cuda_sm70_wmma_m16n16k16_mma_row_col_f32_f32
3730x175__cuda_sm70_wmma_m16n16k16_mma_row_col_f32_f32_satfinite
3740x176__cuda_sm70_wmma_m16n16k16_mma_row_row_f16_f16
3750x177__cuda_sm70_wmma_m16n16k16_mma_row_row_f16_f16_satfinite
3760x178__cuda_sm70_wmma_m16n16k16_mma_row_row_f16_f32
3770x179__cuda_sm70_wmma_m16n16k16_mma_row_row_f16_f32_satfinite
3780x17A__cuda_sm70_wmma_m16n16k16_mma_row_row_f32_f16
3790x17B__cuda_sm70_wmma_m16n16k16_mma_row_row_f32_f16_satfinite
3800x17C__cuda_sm70_wmma_m16n16k16_mma_row_row_f32_f32
3810x17D__cuda_sm70_wmma_m16n16k16_mma_row_row_f32_f32_satfinite
3820x17E__cuda_sm70_wmma_m16n16k16_store_d_col_f16
3830x17F__cuda_sm70_wmma_m16n16k16_store_d_col_f16_global
3840x180__cuda_sm70_wmma_m16n16k16_store_d_col_f16_shared
3850x181__cuda_sm70_wmma_m16n16k16_store_d_col_f32
3860x182__cuda_sm70_wmma_m16n16k16_store_d_col_f32_global
3870x183__cuda_sm70_wmma_m16n16k16_store_d_col_f32_shared
3880x184__cuda_sm70_wmma_m16n16k16_store_d_row_f16
3890x185__cuda_sm70_wmma_m16n16k16_store_d_row_f16_global
3900x186__cuda_sm70_wmma_m16n16k16_store_d_row_f16_shared
3910x187__cuda_sm70_wmma_m16n16k16_store_d_row_f32
3920x188__cuda_sm70_wmma_m16n16k16_store_d_row_f32_global
3930x189__cuda_sm70_wmma_m16n16k16_store_d_row_f32_shared
3940x18A__cuda_sm70_wmma_m32n8k16_load_a_col
3950x18B__cuda_sm70_wmma_m32n8k16_load_a_col_global
3960x18C__cuda_sm70_wmma_m32n8k16_load_a_col_shared
3970x18D__cuda_sm70_wmma_m32n8k16_load_a_row
3980x18E__cuda_sm70_wmma_m32n8k16_load_a_row_global
3990x18F__cuda_sm70_wmma_m32n8k16_load_a_row_shared
4000x190__cuda_sm70_wmma_m32n8k16_load_b_col
4010x191__cuda_sm70_wmma_m32n8k16_load_b_col_global
4020x192__cuda_sm70_wmma_m32n8k16_load_b_col_shared
4030x193__cuda_sm70_wmma_m32n8k16_load_b_row
4040x194__cuda_sm70_wmma_m32n8k16_load_b_row_global
4050x195__cuda_sm70_wmma_m32n8k16_load_b_row_shared
4060x196__cuda_sm70_wmma_m32n8k16_load_c_col_f16
4070x197__cuda_sm70_wmma_m32n8k16_load_c_col_f16_global
4080x198__cuda_sm70_wmma_m32n8k16_load_c_col_f16_shared
4090x199__cuda_sm70_wmma_m32n8k16_load_c_col_f32
4100x19A__cuda_sm70_wmma_m32n8k16_load_c_col_f32_global
4110x19B__cuda_sm70_wmma_m32n8k16_load_c_col_f32_shared
4120x19C__cuda_sm70_wmma_m32n8k16_load_c_row_f16
4130x19D__cuda_sm70_wmma_m32n8k16_load_c_row_f16_global
4140x19E__cuda_sm70_wmma_m32n8k16_load_c_row_f16_shared
4150x19F__cuda_sm70_wmma_m32n8k16_load_c_row_f32
4160x1A0__cuda_sm70_wmma_m32n8k16_load_c_row_f32_global
4170x1A1__cuda_sm70_wmma_m32n8k16_load_c_row_f32_shared
4180x1A2__cuda_sm70_wmma_m32n8k16_mma_col_col_f16_f16
4190x1A3__cuda_sm70_wmma_m32n8k16_mma_col_col_f16_f16_satfinite
4200x1A4__cuda_sm70_wmma_m32n8k16_mma_col_col_f16_f32
4210x1A5__cuda_sm70_wmma_m32n8k16_mma_col_col_f16_f32_satfinite
4220x1A6__cuda_sm70_wmma_m32n8k16_mma_col_col_f32_f16
4230x1A7__cuda_sm70_wmma_m32n8k16_mma_col_col_f32_f16_satfinite
4240x1A8__cuda_sm70_wmma_m32n8k16_mma_col_col_f32_f32
4250x1A9__cuda_sm70_wmma_m32n8k16_mma_col_col_f32_f32_satfinite
4260x1AA__cuda_sm70_wmma_m32n8k16_mma_col_row_f16_f16
4270x1AB__cuda_sm70_wmma_m32n8k16_mma_col_row_f16_f16_satfinite
4280x1AC__cuda_sm70_wmma_m32n8k16_mma_col_row_f16_f32
4290x1AD__cuda_sm70_wmma_m32n8k16_mma_col_row_f16_f32_satfinite
4300x1AE__cuda_sm70_wmma_m32n8k16_mma_col_row_f32_f16
4310x1AF__cuda_sm70_wmma_m32n8k16_mma_col_row_f32_f16_satfinite
4320x1B0__cuda_sm70_wmma_m32n8k16_mma_col_row_f32_f32
4330x1B1__cuda_sm70_wmma_m32n8k16_mma_col_row_f32_f32_satfinite
4340x1B2__cuda_sm70_wmma_m32n8k16_mma_row_col_f16_f16
4350x1B3__cuda_sm70_wmma_m32n8k16_mma_row_col_f16_f16_satfinite
4360x1B4__cuda_sm70_wmma_m32n8k16_mma_row_col_f16_f32
4370x1B5__cuda_sm70_wmma_m32n8k16_mma_row_col_f16_f32_satfinite
4380x1B6__cuda_sm70_wmma_m32n8k16_mma_row_col_f32_f16
4390x1B7__cuda_sm70_wmma_m32n8k16_mma_row_col_f32_f16_satfinite
4400x1B8__cuda_sm70_wmma_m32n8k16_mma_row_col_f32_f32
4410x1B9__cuda_sm70_wmma_m32n8k16_mma_row_col_f32_f32_satfinite
4420x1BA__cuda_sm70_wmma_m32n8k16_mma_row_row_f16_f16
4430x1BB__cuda_sm70_wmma_m32n8k16_mma_row_row_f16_f16_satfinite
4440x1BC__cuda_sm70_wmma_m32n8k16_mma_row_row_f16_f32
4450x1BD__cuda_sm70_wmma_m32n8k16_mma_row_row_f16_f32_satfinite
4460x1BE__cuda_sm70_wmma_m32n8k16_mma_row_row_f32_f16
4470x1BF__cuda_sm70_wmma_m32n8k16_mma_row_row_f32_f16_satfinite
4480x1C0__cuda_sm70_wmma_m32n8k16_mma_row_row_f32_f32
4490x1C1__cuda_sm70_wmma_m32n8k16_mma_row_row_f32_f32_satfinite
4500x1C2__cuda_sm70_wmma_m32n8k16_store_d_col_f16
4510x1C3__cuda_sm70_wmma_m32n8k16_store_d_col_f16_global
4520x1C4__cuda_sm70_wmma_m32n8k16_store_d_col_f16_shared
4530x1C5__cuda_sm70_wmma_m32n8k16_store_d_col_f32
4540x1C6__cuda_sm70_wmma_m32n8k16_store_d_col_f32_global
4550x1C7__cuda_sm70_wmma_m32n8k16_store_d_col_f32_shared
4560x1C8__cuda_sm70_wmma_m32n8k16_store_d_row_f16
4570x1C9__cuda_sm70_wmma_m32n8k16_store_d_row_f16_global
4580x1CA__cuda_sm70_wmma_m32n8k16_store_d_row_f16_shared
4590x1CB__cuda_sm70_wmma_m32n8k16_store_d_row_f32
4600x1CC__cuda_sm70_wmma_m32n8k16_store_d_row_f32_global
4610x1CD__cuda_sm70_wmma_m32n8k16_store_d_row_f32_shared
4620x1CE__cuda_sm70_wmma_m8n32k16_load_a_col
4630x1CF__cuda_sm70_wmma_m8n32k16_load_a_col_global
4640x1D0__cuda_sm70_wmma_m8n32k16_load_a_col_shared
4650x1D1__cuda_sm70_wmma_m8n32k16_load_a_row
4660x1D2__cuda_sm70_wmma_m8n32k16_load_a_row_global
4670x1D3__cuda_sm70_wmma_m8n32k16_load_a_row_shared
4680x1D4__cuda_sm70_wmma_m8n32k16_load_b_col
4690x1D5__cuda_sm70_wmma_m8n32k16_load_b_col_global
4700x1D6__cuda_sm70_wmma_m8n32k16_load_b_col_shared
4710x1D7__cuda_sm70_wmma_m8n32k16_load_b_row
4720x1D8__cuda_sm70_wmma_m8n32k16_load_b_row_global
4730x1D9__cuda_sm70_wmma_m8n32k16_load_b_row_shared
4740x1DA__cuda_sm70_wmma_m8n32k16_load_c_col_f16
4750x1DB__cuda_sm70_wmma_m8n32k16_load_c_col_f16_global
4760x1DC__cuda_sm70_wmma_m8n32k16_load_c_col_f16_shared
4770x1DD__cuda_sm70_wmma_m8n32k16_load_c_col_f32
4780x1DE__cuda_sm70_wmma_m8n32k16_load_c_col_f32_global
4790x1DF__cuda_sm70_wmma_m8n32k16_load_c_col_f32_shared
4800x1E0__cuda_sm70_wmma_m8n32k16_load_c_row_f16
4810x1E1__cuda_sm70_wmma_m8n32k16_load_c_row_f16_global
4820x1E2__cuda_sm70_wmma_m8n32k16_load_c_row_f16_shared
4830x1E3__cuda_sm70_wmma_m8n32k16_load_c_row_f32
4840x1E4__cuda_sm70_wmma_m8n32k16_load_c_row_f32_global
4850x1E5__cuda_sm70_wmma_m8n32k16_load_c_row_f32_shared
4860x1E6__cuda_sm70_wmma_m8n32k16_mma_col_col_f16_f16
4870x1E7__cuda_sm70_wmma_m8n32k16_mma_col_col_f16_f16_satfinite
4880x1E8__cuda_sm70_wmma_m8n32k16_mma_col_col_f16_f32
4890x1E9__cuda_sm70_wmma_m8n32k16_mma_col_col_f16_f32_satfinite
4900x1EA__cuda_sm70_wmma_m8n32k16_mma_col_col_f32_f16
4910x1EB__cuda_sm70_wmma_m8n32k16_mma_col_col_f32_f16_satfinite
4920x1EC__cuda_sm70_wmma_m8n32k16_mma_col_col_f32_f32
4930x1ED__cuda_sm70_wmma_m8n32k16_mma_col_col_f32_f32_satfinite
4940x1EE__cuda_sm70_wmma_m8n32k16_mma_col_row_f16_f16
4950x1EF__cuda_sm70_wmma_m8n32k16_mma_col_row_f16_f16_satfinite
4960x1F0__cuda_sm70_wmma_m8n32k16_mma_col_row_f16_f32
4970x1F1__cuda_sm70_wmma_m8n32k16_mma_col_row_f16_f32_satfinite
4980x1F2__cuda_sm70_wmma_m8n32k16_mma_col_row_f32_f16
4990x1F3__cuda_sm70_wmma_m8n32k16_mma_col_row_f32_f16_satfinite
5000x1F4__cuda_sm70_wmma_m8n32k16_mma_col_row_f32_f32
5010x1F5__cuda_sm70_wmma_m8n32k16_mma_col_row_f32_f32_satfinite
5020x1F6__cuda_sm70_wmma_m8n32k16_mma_row_col_f16_f16
5030x1F7__cuda_sm70_wmma_m8n32k16_mma_row_col_f16_f16_satfinite
5040x1F8__cuda_sm70_wmma_m8n32k16_mma_row_col_f16_f32
5050x1F9__cuda_sm70_wmma_m8n32k16_mma_row_col_f16_f32_satfinite
5060x1FA__cuda_sm70_wmma_m8n32k16_mma_row_col_f32_f16
5070x1FB__cuda_sm70_wmma_m8n32k16_mma_row_col_f32_f16_satfinite
5080x1FC__cuda_sm70_wmma_m8n32k16_mma_row_col_f32_f32
5090x1FD__cuda_sm70_wmma_m8n32k16_mma_row_col_f32_f32_satfinite
5100x1FE__cuda_sm70_wmma_m8n32k16_mma_row_row_f16_f16
5110x1FF__cuda_sm70_wmma_m8n32k16_mma_row_row_f16_f16_satfinite
5120x200__cuda_sm70_wmma_m8n32k16_mma_row_row_f16_f32
5130x201__cuda_sm70_wmma_m8n32k16_mma_row_row_f16_f32_satfinite
5140x202__cuda_sm70_wmma_m8n32k16_mma_row_row_f32_f16
5150x203__cuda_sm70_wmma_m8n32k16_mma_row_row_f32_f16_satfinite
5160x204__cuda_sm70_wmma_m8n32k16_mma_row_row_f32_f32
5170x205__cuda_sm70_wmma_m8n32k16_mma_row_row_f32_f32_satfinite
5180x206__cuda_sm70_wmma_m8n32k16_store_d_col_f16
5190x207__cuda_sm70_wmma_m8n32k16_store_d_col_f16_global
5200x208__cuda_sm70_wmma_m8n32k16_store_d_col_f16_shared
5210x209__cuda_sm70_wmma_m8n32k16_store_d_col_f32
5220x20A__cuda_sm70_wmma_m8n32k16_store_d_col_f32_global
5230x20B__cuda_sm70_wmma_m8n32k16_store_d_col_f32_shared
5240x20C__cuda_sm70_wmma_m8n32k16_store_d_row_f16
5250x20D__cuda_sm70_wmma_m8n32k16_store_d_row_f16_global
5260x20E__cuda_sm70_wmma_m8n32k16_store_d_row_f16_shared
5270x20F__cuda_sm70_wmma_m8n32k16_store_d_row_f32
5280x210__cuda_sm70_wmma_m8n32k16_store_d_row_f32_global
5290x211__cuda_sm70_wmma_m8n32k16_store_d_row_f32_shared

__cuda_sm80_* -- Ampere createpolicy (3 entries, 0x212--0x214, sm_80)

IDHexName
5300x212__cuda_sm80_createpolicy_fractional
5310x213__cuda_sm80_createpolicy_fractional_encode
5320x214__cuda_sm80_createpolicy_range_encode

__cuda_sm_10x_* -- Blackwell hmma/imma/bit MMA (10 entries, 0x215--0x21E, sm_100)

IDHexName
5330x215__cuda_sm_10x_hmma_mdata_m16n8k16
5340x216__cuda_sm_10x_hmma_mdata_m16n8k32
5350x217__cuda_sm_10x_imma_mdata_m16n8k32
5360x218__cuda_sm_10x_imma_mdata_m16n8k64
5370x219__cuda_sm_10x_mma_bit_internal_and_m16n8k128
5380x21A__cuda_sm_10x_mma_bit_internal_and_m16n8k256
5390x21B__cuda_sm_10x_mma_bit_internal_and_m8n8k128
5400x21C__cuda_sm_10x_mma_bit_internal_xor_m16n8k128
5410x21D__cuda_sm_10x_mma_bit_internal_xor_m16n8k256
5420x21E__cuda_sm_10x_mma_bit_internal_xor_m8n8k128

__cuda_sm_8x_* -- Direct MMA + shfl (14 entries, 0x21F--0x22C, sm_80+)

IDHexName
5430x21F__cuda_sm_8x_mma_col_col_f16_f16_f16_f16
5440x220__cuda_sm_8x_mma_col_col_f32_f16_f16_f16
5450x221__cuda_sm_8x_mma_col_col_f32_f16_f16_f32
5460x222__cuda_sm_8x_mma_col_row_f16_f16_f16_f16
5470x223__cuda_sm_8x_mma_col_row_f32_f16_f16_f16
5480x224__cuda_sm_8x_mma_col_row_f32_f16_f16_f32
5490x225__cuda_sm_8x_mma_row_col_f16_f16_f16_f16
5500x226__cuda_sm_8x_mma_row_col_f32_f16_f16_f16
5510x227__cuda_sm_8x_mma_row_col_f32_f16_f16_f32
5520x228__cuda_sm_8x_mma_row_row_f16_f16_f16_f16
5530x229__cuda_sm_8x_mma_row_row_f32_f16_f16_f16
5540x22A__cuda_sm_8x_mma_row_row_f32_f16_f16_f32
5550x22B__cuda_sm_8x_mma_shfl_f16
5560x22C__cuda_sm_8x_mma_shfl_f32

__cuda_sm_9x_* -- Hopper sub-byte/bit MMA (51 entries, 0x22D--0x25F, sm_90)

IDHexName
5570x22D__cuda_sm_9x_mma_bit_internal_xor_m16n8k128
5580x22E__cuda_sm_9x_mma_bit_internal_xor_m16n8k256
5590x22F__cuda_sm_9x_mma_bit_internal_xor_m8n8k128
5600x230__cuda_sm_9x_mma_sub_byte_internal_m16n8k32_s4_s4
5610x231__cuda_sm_9x_mma_sub_byte_internal_m16n8k32_s4_s4_satfinite
5620x232__cuda_sm_9x_mma_sub_byte_internal_m16n8k32_s4_u4
5630x233__cuda_sm_9x_mma_sub_byte_internal_m16n8k32_s4_u4_satfinite
5640x234__cuda_sm_9x_mma_sub_byte_internal_m16n8k32_u4_s4
5650x235__cuda_sm_9x_mma_sub_byte_internal_m16n8k32_u4_s4_satfinite
5660x236__cuda_sm_9x_mma_sub_byte_internal_m16n8k32_u4_u4
5670x237__cuda_sm_9x_mma_sub_byte_internal_m16n8k32_u4_u4_satfinite
5680x238__cuda_sm_9x_mma_sub_byte_internal_m16n8k64_s4_s4
5690x239__cuda_sm_9x_mma_sub_byte_internal_m16n8k64_s4_s4_satfinite
5700x23A__cuda_sm_9x_mma_sub_byte_internal_m16n8k64_s4_u4
5710x23B__cuda_sm_9x_mma_sub_byte_internal_m16n8k64_s4_u4_satfinite
5720x23C__cuda_sm_9x_mma_sub_byte_internal_m16n8k64_u4_s4
5730x23D__cuda_sm_9x_mma_sub_byte_internal_m16n8k64_u4_s4_satfinite
5740x23E__cuda_sm_9x_mma_sub_byte_internal_m16n8k64_u4_u4
5750x23F__cuda_sm_9x_mma_sub_byte_internal_m16n8k64_u4_u4_satfinite
5760x240__cuda_sm_9x_mma_sub_byte_internal_m8n8k32_s4_s4
5770x241__cuda_sm_9x_mma_sub_byte_internal_m8n8k32_s4_s4_satfinite
5780x242__cuda_sm_9x_mma_sub_byte_internal_m8n8k32_s4_u4
5790x243__cuda_sm_9x_mma_sub_byte_internal_m8n8k32_s4_u4_satfinite
5800x244__cuda_sm_9x_mma_sub_byte_internal_m8n8k32_u4_s4
5810x245__cuda_sm_9x_mma_sub_byte_internal_m8n8k32_u4_s4_satfinite
5820x246__cuda_sm_9x_mma_sub_byte_internal_m8n8k32_u4_u4
5830x247__cuda_sm_9x_mma_sub_byte_internal_m8n8k32_u4_u4_satfinite
5840x248__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_s4_s4
5850x249__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_s4_s4_satfinite
5860x24A__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_s4_u4
5870x24B__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_s4_u4_satfinite
5880x24C__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_u4_s4
5890x24D__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_u4_s4_satfinite
5900x24E__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_u4_u4
5910x24F__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_u4_u4_satfinite
5920x250__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_s4_0
5930x251__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_s4_1
5940x252__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_s4_satfinite_0
5950x253__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_s4_satfinite_1
5960x254__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_u4_0
5970x255__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_u4_1
5980x256__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_u4_satfinite_0
5990x257__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_u4_satfinite_1
6000x258__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_s4_0
6010x259__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_s4_1
6020x25A__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_s4_satfinite_0
6030x25B__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_s4_satfinite_1
6040x25C__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_u4_0
6050x25D__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_u4_1
6060x25E__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_u4_satfinite_0
6070x25F__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_u4_satfinite_1