Intrinsic Table Architecture (607 Registered Entries)
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
ptxas maintains two separate intrinsic subsystems that together cover every CUDA runtime helper function, every PTX opcode requiring inline code generation, and every Blackwell+ OCG builtin operation. The first subsystem (sub_5D1660 + sub_5D4190 + sub_5D7430 + sub_5FF700) handles 607 classical CUDA intrinsics and PTX opcode dispatch through a name-to-ID hash map, a body template name table, and a giant prototype generator. The second subsystem (sub_6C9EB0 and its handler cluster at 0x6C0000--0x6CC000) handles OCG (Optimized Code Generation) builtins for SM100+ targets. Both subsystems use the same hash map infrastructure (sub_425CA0 / sub_426150 / sub_426D60) documented in Hash Tables & Bitvectors.
| Master registration | sub_5D1660 (46KB) -- 607 CUDA intrinsics, name-to-integer-ID hash map (608 table slots, ID 0 = null) |
| Opcode dispatch | sub_5D4190 (41KB) -- ~120 PTX opcodes to codegen handlers + ~400 MMA hash entries |
| Body template names | sub_5D7430 (161KB) -- 1,079 intrinsic names constructed from .rodata prefixes + type suffixes, stored in hash map at +824 |
| Prototype generator | sub_5FF700 (354KB) -- switch generating .weak .func PTX declarations |
| OCG intrinsic table | sub_6C9EB0 (13KB) -- __nv_ptx_builtin_ocg_* dispatch for SM100+ |
| OCG router | sub_6CC690 (22KB) -- routes OCG calls to type-specific handlers |
| OCG name resolver | sub_6C9BC0 -- resolves operation names to internal enums |
| Hash map create | sub_425CA0 (initial capacity 0x80) |
| Hash map insert | sub_426150(map, name, value) |
| Hash map lookup | sub_426D60 |
Per-Family Deep Dives:
- OCG Intrinsic System -- SM100+ OCG builtins (44 operations), lowering pipeline, SASS handler map
- Math Intrinsics -- IEEE math software emulation (div, rcp, sqrt, rem)
- Tensor Core Intrinsics -- WMMA, MMA, WGMMA, tcgen05 lowering
- Sync & Warp Intrinsics -- Barriers, vote, shuffle, match, redux
System Overview
sub_451730 (intrinsic lowering context constructor)
│
├── sub_5D4190(ctx) ── register PTX opcode & MMA handlers ──────────────┐
│ │ (1) Calls sub_5D1660 to populate intrinsic ID table (607 entries) │
│ │ (2) Registers ~120 PTX opcode -> codegen handler mappings │
│ │ (3) Registers ~400 MMA hash -> codegen handler mappings │
│ │ │
│ ├─ Hash map at +808 ── PTX opcode name -> codegen function ptr │
│ │ "div" -> sub_5B76D0 (64KB) │
│ │ "sqrt" -> sub_5B4040 (49KB) │
│ │ "wmma.mma"-> sub_5C7A50 (173KB) │
│ │ "mma" -> sub_5C10A0 (120KB) │
│ │ ... ~116 more │
│ │ │
│ ├─ Hash map at +816 ── numeric MMA hash -> codegen handler ptr │
│ │ "2644314910" -> sub_4DDB80 │
│ │ ... ~399 more (shape/type/layout combinations) │
│ │ │
│ └─ ID table at +1056 ── 9728-byte array (memcpy from unk_1D4D940) │
│ Hash map at +1064 ── name -> integer ID (sub_5D1660, 607) │
│ Count at +1072 = 608 (includes null ID 0 slot) │
│ │
├── sub_4CE230(ctx) ── register modifier keywords (GUARD, PRED, ...) │
│ │
├── sub_5D7430(ctx, sregs) ── body template name table (161KB) ──────────┤
│ │ 1,079 entries, each constructed from: │
│ │ 16-byte .rodata prefix (e.g. "__cuda_sm20_div_") │
│ │ + 4-byte type suffix (e.g. "s16\0", "u64\0", "rn_f") │
│ │ → registered into hash map at +824 with sequential integer IDs │
│ │ │
│ └─ Hash map at +824 ── intrinsic name -> body template ID │
│ "__cuda_sm20_div_s16" -> 0 │
│ "__cuda_sm20_div_u16" -> 1 │
│ ... 1,079 total entries │
│ │
└── sub_451330("<fermi macros>", ...) ── load Fermi macro library │
│
sub_5FF700 (354KB) ─────────────────────────────────────────────────────────┘
│ switch(body_template_id) with hundreds of cases
│ Each case: allocate buffer via sub_4DA340, strcpy() PTX prototype
│
│ case 0: ".weak .func (.reg .s32 %d) __cuda_sm20_div_s16
│ (.reg .s32 %a0, .reg .s32 %a1)"
│ case 4: ".weak .func (.reg .u64 %rdv1) __cuda_sm20_div_u64
│ (.reg .u64 %rda1, .reg .u64 %rda2)"
│ case 9: ".weak .func (.reg .f32 %fv1) __cuda_sm20_div_rn_f32
│ (.reg .f32 %fa1, .reg .f32 %fa2)"
│ case 25: ".weak .func (.reg .f64 %fdv1) __cuda_sm20_div_rn_f64_full
│ (...)"
│ ... hundreds more for rcp, sqrt, dsqrt, barrier, wmma, mma, etc.
v
Emitted into PTX output as .weak .func declarations
(linker resolves calls to runtime helper functions)
Master Registration -- sub_5D1660
This 46KB function is the master catalog. It allocates a 9728-byte table (memcpy from unk_1D4D940, 0x2600 bytes = 608 x 16B slots), creates a hash map with initial capacity 0x80 via sub_425CA0, then calls sub_426150(hashmap, "name", (char*)ID) exactly 607 times to register every CUDA runtime helper function with an integer ID (IDs 1--607, contiguous). The hash map is stored at a1+1064, the table at a1+1056, and the count 608 at a1+1072 (includes the unused null ID 0 slot).
Complete ID Allocation
607 intrinsics are registered with contiguous IDs from 0x01 through 0x25F. The binary stores count=608 at a1+1072 because the pre-built 9,728-byte table (608 x 16B slots) includes a null ID 0 sentinel. The ID ranges partition cleanly by SM generation and functional category.
| ID Range | Count | Prefix | Category | SM Floor |
|---|---|---|---|---|
0x001--0x011 | 17 | __cuda_reduxsync_* | Redux sync (b32 and/or/xor, f32 max/min/abs/NaN, s32/u32 add/max/min) | sm_70 |
0x012--0x018 | 7 | __cuda_sanitizer_memcheck_* | Compute-sanitizer hooks (free, generic, global, local, malloc, readmetadata, shared) | -- |
0x019--0x01F | 7 | __cuda_scalar_video_emulation_* | Video instruction emulation helpers | sm_20 |
0x020--0x02A | 11 | __cuda_sm10x_* | Blackwell tcgen05 guardrail traps + create_mask helper | sm_100 |
0x02B--0x03C | 18 | __cuda_sm1xx_* | Bulk copy + cp.async.bulk.tensor 1D--5D tile/im2col uni/multicast | sm_100+ |
0x03D--0x082 | 70 | __cuda_sm20_* | IEEE math: bfe, bfi, div, rcp, sqrt, dsqrt, drsqrt, rem (all rounding modes + slowpaths) | sm_20 |
0x083--0x086 | 4 | __cuda_sm3x_div_* | Optimized division variants (rn_ftz_f32, rn_noftz_f32 + slowpaths) | sm_30 |
0x087--0x088 | 2 | __cuda_sm62_dp2a/dp4a | Integer dot product emulation | sm_62 |
0x089--0x211 | 393 | __cuda_sm70_* | Volta+ intrinsics (barriers, shuffle, vote, match, WMMA -- all shapes, layouts, address spaces) | sm_70 |
0x212--0x214 | 3 | __cuda_sm80_* | Ampere: createpolicy_fractional, createpolicy_fractional_encode, createpolicy_range_encode | sm_80 |
0x215--0x21E | 10 | __cuda_sm_10x_* | Blackwell hmma/imma mdata + bit MMA (and/xor m8n8k128/m16n8k128/m16n8k256) | sm_100 |
0x21F--0x22C | 14 | __cuda_sm_8x_* | Direct MMA operations (f16/f32 accum, 4 layout combos) + mma_shfl_f16/f32 | sm_80+ |
0x22D--0x25F | 51 | __cuda_sm_9x_* | Hopper sub-byte + bit MMA: s4/u4 dense m16n8k32/k64 + sparse m16n8k64/k128, bit xor (m8n8k128/m16n8k128/m16n8k256) | sm_90 |
Total: 607 registered intrinsics across 13 prefix groups. Table has 608 slots (ID 0 unused).
sm_70 Intrinsic Breakdown (IDs 0x89--0x211)
The sm_70 block is by far the largest at 393 entries. It covers every Volta-era warp synchronous intrinsic plus the complete WMMA API. The explosion in count comes from the combinatorial product of shapes, layouts, data types, address spaces, and predicate/satfinite variants.
| Sub-Category | Examples | Combinatorial Source |
|---|---|---|
barrier_arrive | 0--15, with/without count | 16 barrier IDs x 2 count variants |
barrier_red_and/or/popc | 0--15, with/without count | 3 reduction ops x 16 IDs x 2 count |
barrier_sync | 0--15, with/without count | 16 IDs x 2 count variants |
matchsync_all/any_b32/b64 | with predicate variants | 2 match modes x 2 types x pred |
shflsync_bfly/down/idx/up | with predicate variants | 4 shuffle modes x pred |
votesync_all/any/ballot/uni | -- | 4 vote modes |
warpsync | -- | 1 entry |
wmma_* | m16n16k16, m32n8k16, m8n32k16 | 3 shapes x {load_a, load_b, load_c, store_d, mma} x {row, col} x {f16, f32} x {generic, global, shared} x {satfinite} |
The WMMA entries dominate the count. Each combination of shape (m16n16k16/m32n8k16/m8n32k16), operation (load_a/load_b/load_c/store_d/mma), layout (row/col for each matrix), data type (f16/f32), address space (generic/global/shared), and optional satfinite flag produces a separate intrinsic registration.
Opcode Dispatch -- sub_5D4190
This 41KB function first calls sub_5D1660(a1) to populate the intrinsic ID table, then builds two more hash maps for PTX opcode dispatch.
Named Opcode Table (at a1+808)
~120 PTX instruction names mapped to codegen handler function pointers. Each handler allocates a 50,000-byte buffer, queries instruction properties through accessor functions on the instruction object at a1+1096, and generates inline PTX code via sequential sprintf() calls.
| Category | Opcodes | Codegen Handlers |
|---|---|---|
| Math | div.full, div, rem, rcp, rsqrt, sqrt, ex2, lg2, tanh | sub_573860, sub_5B76D0 (64KB), sub_589810, sub_5B0CD0 (44KB), sub_57BFC0, sub_5B4040 (49KB), sub_583190, sub_52A5C0, sub_505B00 |
| Memory | membar, _ldldu, prefetch | sub_4DB410, sub_4DD860, sub_507FB0 |
| Conversion | cvt | sub_59F630 |
| Bit manipulation | bfind, brev, bfe, bfi, clz, popc, testp, copysign | sub_590C20, sub_50B5A0, sub_578470, sub_52E100, sub_4DBCC0, sub_4DB210, sub_581A10, sub_50B180 |
| Texture | tex, tex.base, tex.level, tld4, tex.grad | sub_584D10, sub_5879B0, sub_58B6A0, sub_56D700, sub_5ADDC0 (50KB) |
| Video (SIMD) | vadd/vsub/vmin/vmax/vabsdiff/vshl/vshr/vset/vmad (scalar), vadd2/vmax2/vmin2/vabsdiff2/vset2/vsub2/vavrg2 (packed 2x16), vadd4/vmin4/vmax4/vabsdiff4/vset4/vsub4/vavrg4 (packed 4x8) | per-instruction handlers |
| Dot product | dp2a.lo, dp2a.hi, dp4a | sub_56BA60, sub_56C8D0, sub_577BA0 |
| Barriers | bar, barrier, bar.arrive, barrier.arrive, bar.red, barrier.red, bar.cta/barrier.cta (.arrive/.red variants), bar.warp | sub_524FB0, sub_570290, sub_500BF0, sub_570940, sub_52D590, sub_5889B0, sub_56A5A0 |
| Warp | vote, shfl, match, redux | sub_580E50, sub_5801D0, sub_58A730, sub_567680 |
| Async copy | cp.async.mbarrier.arrive, cp.async.bulk, cp.async.bulk.tensor | sub_4DC180, sub_593210, sub_5AB460 (45KB) |
| Matrix | ldmatrix, movmatrix, stmatrix, st.async, red.async, st.bulk | sub_50D4B0, sub_4DAEA0, sub_4F05D0, sub_58E9B0, sub_5825A0, sub_549430 |
| Cache | createpolicy.range, createpolicy.fractional, createpolicy.cvt | per-instruction handlers |
| WMMA | wmma.load.a, wmma.load.b, wmma.load.c, wmma.store.d, wmma.mma | sub_5A2D10, sub_5A0EA0, sub_5A8E40, sub_5A6BD0, sub_5C7A50 (173KB) |
| MMA | mma | sub_5C10A0 (120KB) |
| WGMMA | wgmma.mma_async, wgmma.fence, wgmma.commit_group, wgmma.wait_group | sub_50AC70, sub_4DA380, sub_4DA4B0, sub_4DA5E0 |
| Multimem | multimem.ld_reduce, multimem.st, multimem.red | sub_58D8B0, sub_57B4C0, sub_50A850 |
| Tensormap | tensormap.replace | sub_57F6E0 |
| TCGen05 | tcgen05.alloc, tcgen05.relinquish_alloc_permit, tcgen05.dealloc, tcgen05.ld, tcgen05.ld.red, tcgen05.st, tcgen05.commit, tcgen05.cp, tcgen05.shift, tcgen05.mma, tcgen05.mma.ws | sub_569180, sub_526370, sub_58C7F0, sub_574050, sub_578DB0, sub_571FE0, sub_56C190, sub_5427F0, sub_4F1A90, sub_5BBC30 (90KB), sub_58FA20 |
| TCGen05 guardrails | _tcgen05.guardrails.is_phase_valid, are_columns_allocated, is_current_warp_valid_owner, in_physical_bounds, allocation_granularity, datapath_alignment, sp_consistency_across_idesc_mod, check_sparse_usage | per-instruction handlers |
Numeric MMA Hash Table (at a1+816)
~400 entries where the key is a numeric string representation of a hash value (e.g., "2644314910") that encodes a specific MMA shape/type/layout combination. The hash encodes the instruction variant completely: matrix dimensions (m16n8k16, m16n8k32, etc.), data type (f16, bf16, tf32, f32, f64, s8, u8, s4, u4, b1), and layout (row/col combinations). Each entry maps to a codegen handler function pointer. This avoids a multi-dimensional lookup by collapsing the full variant space into a single hash probe.
Body Template Name Table -- sub_5D7430
At 161KB of machine code (0x5D7430--0x5FF700), this is the largest function in the intrinsic infrastructure by code size and the 6th largest function in the entire ptxas binary. IDA failed to decompile it; all analysis comes from raw x86-64 disassembly. The function constructs a third hash map (at context offset +824 / 0x338) containing 1,079 entries that map dynamically constructed __cuda_* intrinsic names to sequential body template IDs (0--1078).
Why 1,079 Body Templates for 607 Logical Intrinsics
The master registration table (sub_5D1660) maps 607 intrinsic names to logical IDs. The body template table (sub_5D7430) maps 1,079 variant-specific names to prototype generator case numbers. The 1.78x expansion has one dominant cause: WMMA template proliferation across GPU generations.
The 204 logical WMMA entries in sub_5D1660 cover only the original sm_70 Volta shapes (m16n16k16/m32n8k16/m8n32k16 with f16/f32 types). But the body template table includes all later-generation WMMA variants -- sm7x sub-byte/bit, sm72 integer, sm8x tf32/bf16/f64 -- that were added as hardware evolved. These ~416 extra WMMA templates have no matching entry in the 607 logical ID table; they exist only in the body template hash map and the prototype generator switch.
Non-WMMA intrinsics map approximately 1:1 between logical IDs and body templates. The math operations (div, rcp, sqrt) are already fully type-specialized at the logical level -- each rounding-mode/type combination is a separate logical intrinsic.
Three sources of expansion beyond the 607 logical entries:
- Later-generation WMMA variants (~416 template-only entries):
- sm7x sub-byte WMMA (s4/u4 m8n8k32) + bit WMMA (m8n8k128): ~231 templates
- sm72 integer WMMA (m16n16k16/m32n8k16/m8n32k16 integer types): ~105 templates
- sm8x tf32 WMMA (m16n16k8) + bf16/f64 WMMA: ~80 templates
- Aligned warp sync variants (~13 extra templates):
matchsync_aligned,votesync_aligned,votesync_ballot_groupwise,query_activemask/query_activemask_groupwisefor cooperative group support - Additional SM100 specializations (~8 extra templates):
tcgen05_alloc_two_sm, extra guardrails check variants,get_warp_rank
Conversely, 18 sm1xx bulk copy intrinsics have logical IDs but zero body templates -- they bypass the template/prototype mechanism entirely and are lowered directly to inline PTX by the opcode dispatch handlers (sub_593210, sub_5AB460).
Template Distribution Table
| Logical Group | Logical | Template | Factor |
|---|---|---|---|
| SM20 IEEE math (div, rem, rcp, sqrt, bfe/bfi) | 70 | 70 | 1.0x |
| SM3x optimized division | 4 | 4 | 1.0x |
| SM62 integer dot product | 2 | 2 | 1.0x |
| SM70 barriers | 170 | 170 | 1.0x |
| SM70 warp sync (match, vote, shfl, query) | 19 | 32 | 1.7x |
| SM70 WMMA (f16/f32 original Volta) | 204 | 249 | 1.2x |
| SM7x WMMA extended (sub-byte, bit) | 0 | 231 | tmpl-only |
| SM72 WMMA (integer) | 0 | 105 | tmpl-only |
| SM8x WMMA (tf32, bf16, f64) | 0 | 80 | tmpl-only |
| SM80 cache policy | 3 | 4 | 1.3x |
| SM8x direct MMA | 14 | 14 | 1.0x |
| SM9x Hopper sub-byte/bit MMA | 51 | 52 | 1.0x |
| SM10x Blackwell MMA metadata | 10 | 10 | 1.0x |
| SM100 tcgen05 + guardrails | 11 | 19 | 1.7x |
| SM100+ bulk copy / TMA | 18 | 0 | (no templates) |
| Redux sync primitives | 17 | 17 | 1.0x |
| Compute-sanitizer hooks | 7 | 7 | 1.0x |
| Video instruction emulation | 7 | 7 | 1.0x |
| Total | 607 | 1,073 | 1.78x |
WMMA subtotal: 204 logical entries expand to 665 body templates (3.3x). Non-WMMA: 403 logical entries map to 408 templates (~1.0x). The remaining 7 templates (1,080 prototype switch cases minus 1,073 classified) are sanitizer/cache variants where IDA produced qmemcpy instead of strcpy, preventing exact name extraction.
The sm70 WMMA group itself expands from 204 to 249 templates because the prototype generator includes update_ptr and desc (descriptor-based addressing) variants of certain load/store operations that the logical table does not separate.
The three "tmpl-only" WMMA rows (sm7x/sm72/sm8x) are the single largest contributor to the expansion. They represent ~416 templates with zero logical ID counterparts. These families use .FORCE_INLINE .func linkage in their prototypes instead of the .weak .func used by the original sm70 WMMA entries:
sm70 (original): .weak .func (...) __cuda_sm70_wmma_m16n16k16_load_a_col (...)
sm72 (integer): .FORCE_INLINE .func (...) __cuda_sm72_Integer_wmma_m16n16k16_load_a_row (...)
sm7x (sub-byte): .FORCE_INLINE .func (...) __cuda_sm7x_sub_byte_wmma_m8n8k32_load_a_row (...)
sm8x (tf32): .FORCE_INLINE .func (...) __cuda_sm8x_tf32_wmma_m16n16k8_load_a_row (...)
The .FORCE_INLINE directive forces inlining at every call site rather than emitting a separate callable function. The later-gen WMMA implementations are more complex and performance-sensitive, making per-call-site specialization profitable.
Name Construction Algorithm
The function contains zero string references because it constructs all 1,079 names at runtime. For each entry:
- Allocate a 20-byte buffer via
sub_424070(allocator, 20) - Copy prefix (16 bytes) from
.rodatavia SSEmovdqa+movups(e.g.,"__cuda_sm20_div_") - Append suffix (4 bytes) via
movlimmediate at offset +16 (e.g.,"s16\0","u64\0","rn_f") - Register via
sub_426150(context+824, buffer, template_id)with sequential integer IDs
The 533 unique .rodata prefix addresses fan out through multiple suffixes per prefix:
.rodata prefix (16B) suffix (4B) result (20B buffer)
─────────────────────── ─────────── ──────────────────────
"__cuda_sm20_div_" + "s16\0" = "__cuda_sm20_div_s16"
"__cuda_sm20_div_" + "u16\0" = "__cuda_sm20_div_u16"
"__cuda_sm20_div_" + "u64\0" = "__cuda_sm20_div_u64"
"__cuda_sm20_div_" + "s64\0" = "__cuda_sm20_div_s64"
"__cuda_sm20_div_" + "rn_f" = "__cuda_sm20_div_rn_f" (truncated)
"__cuda_sm20_rem_" + "s16\0" = "__cuda_sm20_rem_s16"
"__cuda_sm20_rcp_" + "rn_f" = "__cuda_sm20_rcp_rn_f" (truncated)
"__cuda_sm70_barr" + "ier_" = "__cuda_sm70_barrier_" (prefix chain)
Names truncated at the 20-byte buffer limit are still sufficient for hash map lookup -- the full untruncated name appears only inside the prototype string in sub_5FF700.
Worked Example: Division (Cases 0--26)
The __cuda_sm20_div operation illustrates the template-to-prototype mapping. Division has 19 logical IDs and 19 body templates (1:1 ratio) because each type/rounding/precision variant is already a separate logical intrinsic. The suffix encodes the type specialization:
| Case | Body Template Name | Type Suffix | PTX Signature |
|---|---|---|---|
| 0 | __cuda_sm20_div_s16 | s16 | (.reg .s32 %d) ... (.reg .s32 %a0, .reg .s32 %a1) |
| 1 | __cuda_sm20_div_u16 | u16 | (.reg .u32 %d) ... (.reg .u32 %a0, .reg .u32 %a1) |
| 4 | __cuda_sm20_div_u64 | u64 | (.reg .u64 %rdv1) ... (.reg .u64 %rda1, .reg .u64 %rda2) |
| 5 | __cuda_sm20_div_s64 | s64 | (.reg .u64 %rdv1) ... (.reg .u64 %rda1, .reg .u64 %rda2) |
| 9 | __cuda_sm20_div_rn_f32 | rn_f | (.reg .f32 %fv1) ... (.reg .f32 %fa1, .reg .f32 %fa2) |
| 10 | __cuda_sm20_div_rd_f32 | rd_f | (.reg .f32 %fv1) ... (.reg .f32 %fa1, .reg .f32 %fa2) |
| 14 | __cuda_sm20_div_rn_ftz_f32 | rn_f | (.reg .f32 %fv1) ... (.reg .f32 %fa1, .reg .f32 %fa2) |
| 22 | __cuda_sm20_div_ru_f64_v2 | ru_f | (.reg .f64 %fdv1) ... (.reg .f64 %fda1, .reg .f64 %fda2) |
| 25 | __cuda_sm20_div_rn_f64_full | rn_f | (.reg .f64 %fdv1) ... (.reg .f64 %fda1, .reg .f64 %fda2) |
Cases 2--3 (rem s16/u16), 6--7 (rem s64/u64) are interleaved between the division entries. Cases 8, 13 are _slowpath variants that implement Newton-Raphson refinement fallbacks. Cases 18--21 are the sm3x-optimized division variants with the same suffix scheme. Note: s16/u16 division uses .s32/.u32 register types because PTX has no 16-bit register class; the 16-bit operation is performed by 32-bit hardware with appropriate sign/zero extension.
Statistics
| Metric | Value |
|---|---|
| Machine code size | 164,560 bytes (0x5D7430--0x5FF700) |
sub_426150 calls | 1,079 |
| Unique .rodata prefix addresses | 533 |
| Hash map destination | context+824 (0x338) |
| Buffer size per entry | 20 bytes |
| IDA decompilation | Failed (function too large/repetitive) |
Context Hash Map Summary
The intrinsic lowering context object holds five hash maps and one flat table:
| Offset | Field | Builder | Contents | Entries |
|---|---|---|---|---|
| +808 | opcode handlers | sub_5D4190 | PTX opcode name -> codegen fn ptr | ~120 |
| +816 | MMA hash handlers | sub_5D4190 | numeric hash -> codegen fn ptr | ~400 |
| +824 | body templates | sub_5D7430 | intrinsic name -> template ID | 1,079 |
| +1056 | descriptor table | sub_5D1660 | 608 x 16B intrinsic descriptor slots | 608 |
| +1064 | ID map | sub_5D1660 | intrinsic name -> logical ID (1-607) | 607 |
| +1072 | count | sub_5D1660 | 608 (includes null slot 0) | -- |
Instruction Property Accessors
All codegen handlers query instruction properties through accessor functions on the instruction object at a1+1096. These are the same accessors used by WMMA, MMA, and tcgen05 codegen.
| Accessor | Purpose | Usage Example |
|---|---|---|
sub_70B6E0 | Check if feature enabled | sub_70B6E0(obj) -- boolean feature gate |
sub_70B780 | Get feature parameter | Numeric feature parameter |
sub_70FA00 | Check instruction capability for SM | sub_70FA00(*, 23) = texture, sub_70FA00(*, 29) = tcgen05 |
sub_70E940 | Get operand count | Number of operands |
sub_70E6E0 | Get data type | Operand data type enumeration |
sub_70ACC0 | Get accumulator type | MMA accumulator data type |
sub_709860 | Get register type/size | Register class and width |
sub_70F460 | Get layout variant | row/col matrix layout |
sub_707D60 | Check MMA shape variant | m16n16k16 vs m32n8k16, etc. |
sub_709910 | Check sparse mode | Sparse MMA variant flag |
sub_70F650 | Get matrix dimension (M/N) | Matrix size parameter |
sub_70F600 | Get matrix dimension (K) | Alternate dimension parameter |
sub_70CA60 | Get operand type by index | sub_70CA60(*, 0) -- type of first operand (21 = specific type, 58 = f32, 59 = f64) |
sub_70BA40 | Texture mode query | Texture sampling mode |
sub_70BD50 | Sampler mode query | Texture sampler configuration |
sub_70BB20 | Bulk tensor mode | cp.async.bulk.tensor transfer mode |
sub_70F0A0 | Get sparse metadata | Sparse matrix metadata parameter |
Prototype Generator -- sub_5FF700
At 354KB, this is the single largest function in the intrinsic infrastructure and the 2nd largest function in the entire ptxas binary. It takes a body template ID (a1, range 0--1079) and an allocator context (a2), allocates a buffer via sub_4DA340(size, a2), fills it with a PTX prototype string via strcpy(), and returns the result. The output is a complete .weak .func or .FORCE_INLINE .func PTX declaration that gets emitted into the PTX output stream so the linker can resolve calls to CUDA runtime helper functions.
The function is a single switch(a1) with 1,080 case labels (0--1079) plus a default case that returns an empty string "". Each case allocates an exact-sized buffer (72--1,200 bytes), copies a hardcoded PTX prototype string into it, and returns the pointer.
Prototype Generator Architecture
sub_5FF700(template_id, allocator)
│
│ switch(template_id) ← 1,080 cases, 0--1079
│
├── case N:
│ buf = sub_4DA340(byte_count, allocator) ← allocate exact-fit buffer
│ strcpy(buf, ".weak .func (...) name (...)") ← copy PTX prototype
│ return buf
│
├── case M: (45 large WMMA mma cases)
│ buf = sub_4DA340(byte_count, allocator) ← up to 1,200 bytes
│ *(u64*)buf = 0x662E206B6165772E ← ".weak .f" (inline store)
│ *(u64*)(buf+N-8) = <trailer> ← last 8 bytes inline
│ qmemcpy(buf+8, .rodata_addr, size-16) ← bulk copy middle
│ return buf
│
└── default:
return ""
Three copy strategies appear in the decompilation, all producing the same result:
| Strategy | Cases | Trigger | Max Size |
|---|---|---|---|
strcpy() with inline string literal | 1,035 | Prototype fits in decompiler string threshold | ~520 bytes |
qmemcpy() with QWORD bookend stores | 45 | Prototype too long for IDA to reproduce as literal | 1,200 bytes |
| Indirect variable assignment + copy | ~130 | IDA SSA split (subset of strcpy) | ~120 bytes |
The qmemcpy cases are the 45 WMMA mma operations with the largest parameter lists (3--4 fragment matrices of 8 elements each). IDA stores the first and last 8 bytes as inline immediates (0x662E206B6165772E = ".weak .f", trailer varies per case) and bulk-copies the middle from .rodata. The prototype content is structurally identical to the strcpy cases.
Linkage Directives
Two PTX linkage types are emitted, controlling how the linker handles the declared function:
| Directive | Count | Meaning | Used By |
|---|---|---|---|
.weak | 616 | Overridable by user code; linker uses user version if present | SM20 math, SM70 barriers/sync/WMMA (original Volta), SM80 cache policy, SM8x/9x/10x MMA, redux sync, sanitizer hooks, video emulation, dp2a/dp4a |
.FORCE_INLINE | 464 | Inlined at every call site; no separate callable function | SM70 aligned vote/match/query_activemask, SM7x sub-byte/bit WMMA, SM72 integer WMMA, SM8x tf32/bf16/f64 WMMA, SM10x tcgen05 alloc/guardrails, SM80 createpolicy_fractional |
The .weak linkage supports user-supplied replacements: if the user provides their own implementation of __cuda_sm20_div_s16, the linker will use that instead of the built-in runtime version. The .FORCE_INLINE directive forces per-call-site specialization -- the later-generation WMMA implementations are more complex and performance-sensitive, making inlining profitable.
A subset of .weak prototypes (~410) carry the .unique qualifier:
.weak .func (.reg .b32 dst) __cuda_sm70_barrier_sync (.reg .b32 arg0) .unique ;
.unique instructs the PTX linker to keep exactly one copy of the function body even if multiple compilation units reference it. All barriers, redux sync, warpsync, non-aligned vote/match/shuffle use .unique.
Prototype Format
Every emitted prototype follows one of these structural patterns:
<linkage> .func (<return_params>) <name> (<input_params>) [.unique] ;
| Case | Prototype |
|---|---|
| 0 | .weak .func (.reg .s32 %d) __cuda_sm20_div_s16 (.reg .s32 %a0, .reg .s32 %a1) |
| 4 | .weak .func (.reg .u64 %rdv1) __cuda_sm20_div_u64 (.reg .u64 %rda1, .reg .u64 %rda2) |
| 9 | .weak .func (.reg .f32 %fv1) __cuda_sm20_div_rn_f32 (.reg .f32 %fa1, .reg .f32 %fa2) |
| 25 | .weak .func (.reg .f64 %fdv1) __cuda_sm20_div_rn_f64_full (.reg .f64 %fda1, .reg .f64 %fda2) |
| 76 | .weak .func (.reg .b32 dst) __cuda_reduxsync_b32_xor (.reg .b32 src, .reg .b32 mask) .unique |
| 303 | .weak .func (.param .align 16 .b32 dst[8]) __cuda_sm70_wmma_m16n16k16_load_a_row (.reg .u64 ptr, .reg .u32 ldm) .unique |
| 666 | .FORCE_INLINE .func (.reg .b32 dst0, .reg .b32 dst1, .reg .b32 dst2, .reg .b32 dst3) __cuda_sm8x_tf32_wmma_m16n16k8_load_a_row (.reg .u64 ptr, .reg .u32 ldm) |
| 890 | .weak .func () __cuda_sm70_wmma_m16n16k16_store_d_row_f32 (.reg .b64 ptr, .reg .b32 ldm, .reg .b32 sreg0, ...) |
| 1055 | .FORCE_INLINE .func (.reg .b32 warp_rank) __cuda_sm10x_get_warp_rank () |
| 1073 | .weak .func (.param .b64 func_retval0) __cuda_sanitizer_memcheck_readmetadata (.param .b64 ..._param_0, .param .b64 ..._param_1) |
Parameter Passing Conventions
Five distinct parameter-passing ABIs appear across the 1,080 prototypes:
Convention A -- Register-only (.reg): Used by math operations, barriers, warp sync, redux sync, video emulation. Return and input parameters are individual .reg declarations with typed names. This is the simplest and most common convention.
.weak .func (.reg .f32 %fv1) __cuda_sm20_div_rn_f32 (.reg .f32 %fa1, .reg .f32 %fa2) ;
Convention B -- Param-array with alignment (.param .align N .b32 name[K]): Used by WMMA load/mma, MMA, Hopper sub-byte MMA, Blackwell MMA. Returns an aligned array of .b32 elements. Array sizes: dst[2], dst[3], dst[4], dst[5], dst[8], mma_dst[2], mma_dst[4], mma_dst[8], ret_dst[3], ret_dst[5]. 326 prototypes use .align 16; 1 prototype (mma_shfl_f16) uses .align 8.
.weak .func (.param .align 16 .b32 d[8]) __cuda_sm70_wmma_m16n16k16_mma_row_col_f32_f32
(.param .align 16 .b32 a[8], .param .align 16 .b32 b[8], .param .align 16 .b32 c[8]) ;
Convention C -- Param-scalar (.param .b64): Used exclusively by the 7 compute-sanitizer hooks. Parameters use fully-qualified names (__cuda_sanitizer_memcheck_malloc_param_0).
.weak .func (.param .b64 func_retval0) __cuda_sanitizer_memcheck_malloc
(.param .b64 __cuda_sanitizer_memcheck_malloc_param_0,
.param .b64 __cuda_sanitizer_memcheck_malloc_param_1) ;
Convention D -- Void return (): Used by WMMA store_d, tcgen05 guardrail traps, sanitizer_free. ~140 prototypes (45 .weak + 95 .FORCE_INLINE).
.weak .func () __cuda_sm70_wmma_m16n16k16_store_d_row_f32
(.reg .b64 ptr, .reg .b32 ldm, .reg .b32 sreg0, .reg .b32 sreg1, ...) ;
Convention E -- Multi-register return (.FORCE_INLINE only): Used by extended WMMA load operations (SM7x/SM72/SM8x). Returns 1--4 registers in the return position (never 8 -- 8-element returns use Convention B's .param arrays instead).
.FORCE_INLINE .func (.reg .b32 dst0, .reg .b32 dst1, .reg .b32 dst2, .reg .b32 dst3)
__cuda_sm8x_tf32_wmma_m16n16k8_load_a_row (.reg .u64 ptr, .reg .u32 ldm) ;
PTX Register Types
Eight PTX register types appear across the prototypes:
| Type | Approx. Count | Usage |
|---|---|---|
.reg .b32 | ~2,925 | Dominant: barrier args, WMMA/MMA fragments, guardrail params |
.reg .u64 | ~520 | Pointers (WMMA/MMA base addresses) |
.reg .u32 | ~341 | Integer params (leading dimension, counts, offsets) |
.reg .b64 | ~246 | 64-bit bitwise (match bitmask, shuffle predicates, retval) |
.reg .f32 | ~106 | Float math (div, rcp, sqrt) |
.reg .f64 | ~70 | Double math (div, rcp, sqrt, dsqrt) |
.reg .pred | ~10 | Predicate (vote output, matchsync predicate out) |
.reg .s32 | ~6 | Signed 32-bit (SM20 div/rem s16 return values only) |
Note: .b32 is used instead of .s32/.u32/.f32 for operations where the type interpretation is determined by the instruction rather than the register declaration (WMMA fragments, MMA accumulators, barrier IDs). The .s32 type appears only in the 4 oldest SM20 div/rem_s16/u16 prototypes (cases 0--3).
Register Naming Convention
The prototype register names encode the data type and role:
| Prefix | Meaning |
|---|---|
%d | 32-bit integer return value (SM20 div/rem s16/u16 only) |
%a0, %a1 | 32-bit integer input parameters |
%rdv1 | 64-bit integer return value |
%rda1, %rda2 | 64-bit integer input parameters |
%fv1 | f32 return value |
%fa1, %fa2 | f32 input parameters |
%fdv1 | f64 return value |
%fda1, %fda2 | f64 input parameters |
%fdnum, %fdden | f64 numerator/denominator (div_f64_v2 variants) |
dst, dst0..dst7 | Generic output registers (WMMA load, barriers) |
src, sreg0..sreg7 | Generic input registers (WMMA store) |
ptr, base | 64-bit pointer registers |
ldm | Leading dimension parameter (WMMA) |
mask | Warp participation mask |
cnt | Thread count (barrier_sync_count, barrier_arrive_count) |
arg0..arg3 | Generic numbered arguments |
parg | Predicate argument (vote) |
retVal, dummy | Return/placeholder (tcgen05 guardrails) |
activemask, warp_rank | Cooperative group queries |
Buffer Allocation Sizes
sub_4DA340(size, allocator) allocates an exact-fit buffer per prototype:
| Metric | Value |
|---|---|
| Minimum allocation | 72 bytes |
| Maximum allocation | 1,200 bytes |
| Median allocation | ~130 bytes |
| Most common sizes | 132 (37x), 182 (31x), 192 (30x), 125 (29x), 118 (28x) |
| Total allocations | 1,080 |
The 45 qmemcpy cases have the largest buffers: 386--1,200 bytes. These are WMMA mma operations whose prototypes enumerate all 3--4 fragment matrices (a, b, c, d) with 4--8 elements each, producing prototype strings that exceed 900 bytes.
Case Range Layout
The 1,080 cases follow the body template registration order from sub_5D7430, roughly grouped by SM generation:
| Case Range | Count | Category | Linkage |
|---|---|---|---|
| 0--69 | 70 | SM20 IEEE math (div, rem, rcp, sqrt, bfe, bfi, dsqrt, drsqrt) | .weak |
| 70--73 | 4 | SM3x optimized division (rn_ftz/noftz f32 + slowpaths) | .weak |
| 74--75 | 2 | SM62 dp2a/dp4a | .weak |
| 76--92 | 17 | Redux sync (b32/s32/u32/f32 add/max/min/xor/and/or/abs/NaN) | .weak .unique |
| 93--~274 | ~182 | SM70 barriers (sync/arrive/red, 16 IDs x with/without count) | .weak .unique |
| ~275--~302 | ~28 | SM70 vote, shuffle, match (bfly/down/idx/up, all/any/b32/b64) | .weak .unique / .FORCE_INLINE |
| ~303--~665 | ~363 | SM70 WMMA load/store (m16n16k16, m32n8k16, m8n32k16, all types/spaces) | .weak .unique |
| ~666--~889 | ~224 | SM7x/SM72/SM8x extended WMMA (sub-byte, integer, tf32, bf16, f64) | .FORCE_INLINE |
| ~890--~964 | ~75 | SM70 WMMA store_d (all shapes/layouts/spaces/types) | .weak |
| ~965--~1048 | ~84 | SM70 WMMA mma + SM8x/SM9x/SM10x MMA (f16/f32, sub-byte, bit, sparse) | .weak |
| ~1049--~1055 | ~7 | SM10x tcgen05 guardrail traps | .weak |
| ~1056--~1060 | ~5 | SM8x direct MMA (mma_shfl, row/col f16/f32 combos) | .weak |
| ~1061--~1072 | ~12 | SM10x tcgen05 alloc/guardrails check functions + get_warp_rank + create_mask | .FORCE_INLINE / .weak |
| 1073--1079 | 7 | Compute-sanitizer hooks (readmetadata, generic, global, local, shared, malloc, free) | .weak |
Statistics
| Metric | Value |
|---|---|
| Machine code size | 362,496 bytes (0x5FF700--0x658B00) |
| Decompiled lines | 9,414 |
| Switch cases | 1,080 (case 0 through case 1079 + default) |
| Local variables declared | ~716 (IDA SSA artifacts) |
.weak prototypes | 616 (571 strcpy + 45 qmemcpy) |
.FORCE_INLINE prototypes | 464 |
.unique-qualified prototypes | ~410 |
.param .align prototypes | 327 (326 align-16, 1 align-8) |
| Void-return prototypes | ~140 |
| Predicate-using prototypes | ~10 |
Major Codegen Handlers
The four largest codegen handlers together represent ~500KB of code and cover the tensor core instruction families.
sub_5C7A50 -- WMMA.MMA Codegen (173KB)
The largest codegen handler. Generates inline PTX code for wmma.mma instructions across all variant combinations.
- Allocates a 50,000-byte buffer for code generation
- Covers shapes: m16n16k16, m32n8k16, m8n32k16
- Data types: f16, f32, bf16, tf32, s8, u8, s4, u4, b1
- Layouts: row/col for each of the A, B, C, D matrices (4 layout combinations)
- Satfinite variants for each configuration
- Address spaces: generic, global, shared
sub_5C10A0 -- MMA Codegen (120KB)
Handles the newer mma.sync API (non-WMMA). Covers the post-Volta PTX MMA instructions.
- Shapes: m8n8k4, m16n8k8, m16n8k16, m16n8k32, m16n8k64, m16n8k128, m16n8k256
- Types: f16, bf16, tf32, f32, f64, s8, u8, s4, u4, b1
- Sparse variants for sm_80+ and sm_90+ (structured sparsity 2:4)
sub_5BBC30 -- TCGen05.MMA Codegen (90KB)
Blackwell 5th-generation tensor core MMA code generation. Handles the tcgen05.mma instruction family introduced in sm_100.
- Allocates a 50,000-byte buffer
- Queries
sub_70FA00(*, 29)to validate tcgen05 capability - Handles standard, sparse, and warp-shared (
.ws) variants - Uses
sub_70F0A0for sparse metadata parameter extraction - Generates code for tcgen05-specific tensor memory addressing
sub_5B76D0 -- Division Codegen (64KB)
Generates inline PTX code for all div variants.
- Integer division: s16, s64, u16, u64
- Floating-point division: f32, f64 with all rounding modes (rn, rd, ru, rz)
- Flush-to-zero (ftz) variants for f32
- Checks operand type via
sub_70CA60(*(_QWORD *)(a1+1096), 0) == 21 - Emits both fastpath and slowpath (Newton-Raphson) code sequences
OCG Intrinsic System -- sub_6C9EB0
The OCG (Optimized Code Generation) intrinsic subsystem is a separate, parallel dispatch mechanism for SM100+ builtin operations. While the classical system at sub_5D1660 maps CUDA runtime helper names to integer IDs and generates inline PTX, the OCG system maps __nv_ptx_builtin_ocg_* function names to type-specific handlers that validate parameters and emit SASS instructions directly -- bypassing PTX entirely. The OCG table contains 44 operations across 9 categories: arithmetic, packed float, vector integer, async copy/TMA, load/store/cache, reduction/fence, tensor core, tensor memory, and synchronization.
See OCG Intrinsic System (44 Operations) for the complete builtin name table, handler functions, validation strings, SASS-level handlers, and the full five-stage lowering pipeline with operand buffer layout.
Intrinsic Families by SM Generation
Each SM generation introduces new intrinsic families while preserving all earlier ones. The per-SM intrinsic table initializer functions (sub_60AXXX cluster, registered in Map 3 of the capability dispatch) control which intrinsics are available on each target.
sm_20 -- Software IEEE Math (70 entries)
The foundation layer. 70 intrinsics providing IEEE-754-compliant software implementations of math operations that either lack hardware support or need exact rounding guarantees. All later SM targets inherit these.
- Division:
div_s16,div_u64,div_rn_f32,div_rn_f64_full, etc. -- all rounding modes (rn/rd/ru/rz) and types (s16/s64/u16/u64/f32/f64) - Reciprocal:
rcp_rn_f32,rcp_rn_f64, etc. -- all rounding modes - Square root:
sqrt_rn_f32,sqrt_rn_f64, etc. -- all rounding modes - Double-precision sqrt:
dsqrt_rn,dsqrt_rd,dsqrt_ru,dsqrt_rz - Double-precision reciprocal sqrt:
drsqrt_rn - Bit extract/insert:
bfe(bit field extract),bfi(bit field insert) - Remainder:
rem_s32,rem_u32,rem_s64,rem_u64
Codegen handlers: sub_5B76D0 (div, 64KB), sub_5B0CD0 (rcp, 44KB), sub_5B4040 (sqrt, 49KB).
sm_3x -- Optimized Division (4 entries)
Four optimized division variants introduced on Kepler to improve throughput on common division patterns.
sm_62 -- Integer Dot Product (2 entries)
dp2a and dp4a integer dot product intrinsics introduced on Pascal (GP10x). Software emulation of the hardware instructions added in sm_61/sm_62.
sm_70 -- Volta Warp-Synchronous + WMMA (393 entries)
The largest single block. Volta introduced mandatory warp-synchronous programming with explicit sync masks and the first generation of tensor core (WMMA) instructions.
Synchronization primitives:
barrier_arrive/barrier_sync/barrier_red(0--15, with/without count)matchsync_all/any_b32/b64with predicate variantsshflsync_bfly/down/idx/upwith predicate variantsvotesync_all/any/ballot/uniwarpsync
WMMA (Warp Matrix Multiply-Accumulate):
- Shapes: m16n16k16, m32n8k16, m8n32k16
- Operations per shape:
load_a,load_b,load_c,store_d,mma - Layouts: row/col combinations for A and B matrices
- Types: f16, f32 (with satfinite optional)
- Address spaces: generic, global, shared
sm_80 -- Ampere Cache Policy (3 entries)
Three createpolicy intrinsics for L2 cache management: createpolicy_fractional, createpolicy_fractional_encode, createpolicy_range_encode.
sm_10x -- Blackwell MMA Metadata + Bit MMA (10 entries)
10 hmma_mdata/imma_mdata + bit MMA intrinsics for sm_100: metadata variants at m16n8k16/k32/k64 shapes, and 1-bit AND/XOR MMA at m8n8k128/m16n8k128/m16n8k256.
sm_8x -- Direct MMA (14 entries)
14 mma_* intrinsics for sm_8x: 12 direct MMA operations (f16/f32 accumulator x 4 layout combinations of col/row for A and B) plus mma_shfl_f16 and mma_shfl_f32 for register-to-register MMA shuffle.
sm_9x -- Sub-Byte + Bit MMA (51 entries)
51 Hopper-era intrinsics: 3 bit-XOR MMA (m8n8k128/m16n8k128/m16n8k256), 24 dense sub-byte MMA (s4/u4 at m16n8k32/m16n8k64/m8n8k32, with satfinite), 8 sparse m16n8k128, and 16 sparse m16n8k64 (with _0/_1 split variants and satfinite).
sm_10x (via __cuda_sm10x_*) -- Blackwell Tensor Memory + Guardrails (11 entries)
- 1
create_mask_from_bit_idx_and_alloc_size_v1helper - 10
tcgen05_guardrail_trap_*intrinsics for debug validation of tensor memory operations
sm_1xx -- Bulk Copy (18 entries)
18 bulk copy and cp.async.bulk.tensor intrinsics covering 1D through 5D tensor copies with tile and im2col addressing modes, both unicast and multicast variants.
Intrinsic Lookup Flow
The lookup path from a function call in PTX source to the codegen handler follows this sequence:
PTX source: call.uni __cuda_sm70_warpsync, (%mask);
|
v
sub_5D1660 hash map (a1+1064)
key: "__cuda_sm70_warpsync"
value: integer ID (within 0x89..0x211 range)
|
v
sub_5FF700 switch(ID)
Emits: .weak .func __cuda_sm70_warpsync (.reg .u32 %a0)
|
v
sub_5D4190 named opcode hash map (a1+808)
key: PTX opcode (e.g., "shfl", "vote", "barrier")
value: codegen handler function pointer
|
v
Codegen handler (e.g., sub_5801D0 for "shfl")
Queries instruction properties via sub_70XXXX accessors
Generates inline PTX code into 50KB buffer
For OCG intrinsics on SM100+, the path bypasses PTX entirely: sub_6A97B0 matches call nodes to SASS instructions via an RB-tree, sub_6C9BC0 parses the __nv_ptx_builtin_ocg_* name into an operation enum + sub-op array, sub_6CC690 routes to type-specific handlers and assembles operands, and sub_6CB8A0 emits the final SASS instruction. See the OCG Intrinsic System page for the full five-stage pipeline breakdown with operand buffer layout and internal SASS opcode enum values.
Per-SM Intrinsic Initializers
Each SM target has its own intrinsic table initializer function registered in Map 3 of the capability dispatch (sub_607DB0). These functions control which subset of the 607 intrinsics are available on each target.
| SM | Initializer | SM | Initializer |
|---|---|---|---|
| sm_75 | sub_60A2E0 | sm_100 | sub_60A910 |
| sm_80 | sub_60A3E0 | sm_110 | sub_60AA20 |
| sm_86 | sub_60AC30 | sm_103 | sub_60A700 |
| sm_87 | sub_60AD30 | sm_120 | sub_608DF0 |
| sm_88 | sub_60AB30 | sm_121 | sub_60A4E0 |
| sm_89 | sub_60A810 | ||
| sm_90 | sub_60A5F0 |
Sub-variants (e.g., sm_100a, sm_100f) share the same initializer as their base SM since they represent the same silicon with different feature exposure levels.
Instruction Description Loader -- sub_9EE390
sub_9EE390 (3,584 bytes, 0x9EE390--0x9EF190) is the constructor for an instruction description object that feeds the register allocator's pre-coloring pass. Despite the diagnostic string "IntrinsicDescrFile=%s", the function loads instruction descriptions broadly -- not just intrinsic operations. It determines which instructions exist for the target SM, what register classes they use, and what scheduling properties apply. The sole caller is sub_991790 (pre-coloring pass, 12KB).
Invocation pattern: The pre-coloring pass checks context+1936 before calling sub_9EE390. If the descriptor for the current SM class already exists, it is reused. This means the expensive initialization happens once per SM architecture per ptxas process lifetime.
Initialization Sequence
-
Extract target properties. Read the target descriptor from
context+1584. Compute the SM architecture class:v111 = target_descriptor[+372] >> 12. Read resource descriptors from option interface slots 41--44. -
Check option 404 (IntrinsicDescrFile). Query the option interface at
context[208]. If option 404 is set, extract the file path and log" IntrinsicDescrFile=%s". This CI-internal mechanism supplies an external description file that overrides or extends the built-in instruction table. When absent, the built-in database is used. -
Determine instruction format class. Call
sub_7DDB50(context)(GetSmVersionIndex), subtract 1, index intodword_21E6330:sub_7DDB50 return v114 Format 1 0 basic 64-bit 2 1 128-bit 3 3 extended 4 2 192-bit 5+ 3 extended (default) -
Determine SM generation class. Read
context+12(sm_version_id), subtract 1, index intodword_21E5C80. The table is an identity mapping (1--11), one entry per SM generation. -
Construct instruction table (648 bytes). Call
sub_10AFF80with 32 parameters including memory pool, register count, format class, description file path, architecture descriptor (16 bytes fromcontext+1888), SM generation class, instruction count limits, and context flags. Follow withsub_10B1A90(init pass 2) andsub_10AEF10(finalization). -
Apply option overrides. Options 497, 738, 739 from the option interface set register limits and allocation budget values on the instruction table sub-object at
+312. -
Select SM-specific instruction set descriptor. Based on v111 (SM architecture class):
v111 SM range (inferred) Alloc size Constructor Vtable 5 sm_50--sm_62 200 B sub_9CDF90off_23F3B006 sm_70--sm_75 216 B sub_9CE030off_22BB7387 sm_80--sm_89 232 B sub_9CE120off_22B51508+ sm_90--sm_121 240 B sub_9CE190off_22AD230<5 (reuse existing) -- -- -- Each successor inherits the previous class and extends it with generation-specific instructions. The descriptor is stored at
context+1936andthis+48.
Object Layout
| Offset | Size | Contents |
|---|---|---|
| +0 | 8 | Vtable (off_21E6818 -> [sub_9DAA40, sub_9CADF0, sub_9CAE10, sub_9DDEE0]) |
| +8 | 8 | Back-pointer to compilation context |
| +16 | 8 | Instruction table object (648 B, built by sub_10AFF80) |
| +24 | 8 | Scheduling metadata (from sub_1BBBA60) |
| +32 | 8 | Scratch area pointer (context[198]) |
| +40 | 1 | Dirty flag (0 = clean) |
| +48 | 8 | SM-specific instruction set descriptor |
| +56--136 | -- | Resource descriptors, memory pool, sentinel, sub-allocator |
Diagnostic Strings
| String | Location | Context |
|---|---|---|
"__nv_ptx_builtin_ocg_" | sub_6C9EB0 (0x6c9ecf) | OCG builtin name prefix |
"instrinsic" (sic) | Multiple OCG handlers | Consistent NVIDIA typo for "intrinsic" |
".weak .func" | sub_5FF700 (354KB) | Prototype declaration prefix |
"__cuda_sm20_*", "__cuda_sm70_*", etc. | sub_5D1660 | Intrinsic name patterns in registration |
"__cuda_sanitizer_memcheck_*" | sub_5D1660 | Compute-sanitizer integration hooks |
"__cuda_sm10x_tcgen05_guardrail_trap_*" | sub_5D1660 | Blackwell debug trap intrinsics |
" IntrinsicDescrFile=%s" | sub_9EE390 (0x9EEC9B) | Instruction description loader -- logs external description file path (option 404) |
".RELU not allowed with unsigned type" | sub_6BEC60 | OCG LDC/S2R handler |
Function Map
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_5D1660 | 46KB | Master intrinsic registration -- 607 name-to-ID entries (608 table slots) | 99% |
sub_5D4190 | 41KB | Opcode dispatch -- ~120 named + ~400 MMA hash entries | 99% |
sub_5FF700 | 354KB | Prototype generator -- .weak .func PTX declarations | 99% |
sub_5C7A50 | 173KB | wmma.mma codegen (all shapes/types/layouts) | 98% |
sub_5C10A0 | 120KB | mma codegen (mma.sync API, post-Volta) | 98% |
sub_5BBC30 | 90KB | tcgen05.mma codegen (Blackwell 5th-gen tensor core) | 98% |
sub_5B76D0 | 64KB | div codegen (integer + FP, all rounding modes) | 95% |
sub_5ADDC0 | 50KB | tex.grad codegen (1D/2D/3D gradient textures) | 95% |
sub_5B4040 | 49KB | sqrt codegen (f32/f64, all rounding modes) | 95% |
sub_5AB460 | 45KB | cp.async.bulk.tensor codegen (1D--5D, tile/im2col) | 95% |
sub_5B0CD0 | 44KB | rcp codegen (f32/f64 reciprocal, all rounding modes) | 95% |
sub_6C9EB0 | 13KB | OCG intrinsic table init -- see OCG Intrinsic System for full function map (27 entries) | 95% |
sub_6BDE20 | 7KB | Intrinsic operand expansion | 88% |
sub_6BEC60 | 5.8KB | LDC/S2R intrinsic handlers | 90% |
sub_9EE390 | 3.5KB | Instruction description loader -- builds per-SM instruction table for pre-coloring ("IntrinsicDescrFile=%s") | 92% |
sub_9CDF90 | 156B | SM class 5 instruction set descriptor (200B, vtable off_23F3B00) | 85% |
sub_9CE030 | 115B | SM class 6 instruction set descriptor (216B, extends sub_9CDF90) | 85% |
sub_9CE120 | 112B | SM class 7 instruction set descriptor (232B, vtable off_22B5150) | 85% |
sub_9CE190 | 114B | SM class 8+ instruction set descriptor (240B, vtable off_22AD230) | 85% |
sub_9EF190 | 1.1KB | Error handler for instruction description loader (ICE on invalid option type) | 88% |
Cross-References
- OCG Intrinsic System -- SM100+ OCG builtin table (44 operations), lowering pipeline, SASS handler map
- SM Architecture Map -- Per-SM capability dispatch tables and intrinsic initializer assignments
- Math Intrinsics -- Detailed coverage of sm_20 IEEE math intrinsic codegen (div, rcp, sqrt, rem)
- Tensor Core Intrinsics -- WMMA, MMA, WGMMA, tcgen05 instruction lowering
- Sync & Warp Intrinsics -- Barrier, vote, shuffle, match, redux intrinsics
- Newton-Raphson Templates -- Software math slowpath sequences used by div/rcp/sqrt
- TCGen05 -- 5th Gen Tensor Cores -- Blackwell tensor core ISA detail
- Hash Tables & Bitvectors -- Hash map infrastructure (
sub_425CA0/sub_426150/sub_426D60) - Mercury Encoder -- Master SASS encoder
sub_6D9690(94KB) that encodes validated intrinsics - SASS Instruction Encoding -- Instruction encoding infrastructure
- Pipeline Overview -- OCG-time measurement covers intrinsic lowering
Appendix: Complete Intrinsic Name Catalog (607 Entries)
Every intrinsic registered by sub_5D1660, extracted from the decompiled source. IDs are contiguous 1--607 (0x001--0x25F). The suffix after stripping the prefix encodes the operation, data type, rounding mode, address space, and optional modifiers.
__cuda_reduxsync_* -- Redux sync (17 entries, 0x001--0x011, sm_70)
| ID | Hex | Name |
|---|---|---|
| 1 | 0x001 | __cuda_reduxsync_b32_and |
| 2 | 0x002 | __cuda_reduxsync_b32_or |
| 3 | 0x003 | __cuda_reduxsync_b32_xor |
| 4 | 0x004 | __cuda_reduxsync_f32_max |
| 5 | 0x005 | __cuda_reduxsync_f32_max_NaN |
| 6 | 0x006 | __cuda_reduxsync_f32_max_abs |
| 7 | 0x007 | __cuda_reduxsync_f32_max_abs_NaN |
| 8 | 0x008 | __cuda_reduxsync_f32_min |
| 9 | 0x009 | __cuda_reduxsync_f32_min_NaN |
| 10 | 0x00A | __cuda_reduxsync_f32_min_abs |
| 11 | 0x00B | __cuda_reduxsync_f32_min_abs_NaN |
| 12 | 0x00C | __cuda_reduxsync_s32_add |
| 13 | 0x00D | __cuda_reduxsync_s32_max |
| 14 | 0x00E | __cuda_reduxsync_s32_min |
| 15 | 0x00F | __cuda_reduxsync_u32_add |
| 16 | 0x010 | __cuda_reduxsync_u32_max |
| 17 | 0x011 | __cuda_reduxsync_u32_min |
__cuda_sanitizer_memcheck_* -- Compute-sanitizer hooks (7 entries, 0x012--0x018, --)
| ID | Hex | Name |
|---|---|---|
| 18 | 0x012 | __cuda_sanitizer_memcheck_free |
| 19 | 0x013 | __cuda_sanitizer_memcheck_generic |
| 20 | 0x014 | __cuda_sanitizer_memcheck_global |
| 21 | 0x015 | __cuda_sanitizer_memcheck_local |
| 22 | 0x016 | __cuda_sanitizer_memcheck_malloc |
| 23 | 0x017 | __cuda_sanitizer_memcheck_readmetadata |
| 24 | 0x018 | __cuda_sanitizer_memcheck_shared |
__cuda_scalar_video_emulation_* -- Video emulation (7 entries, 0x019--0x01F, sm_20)
| ID | Hex | Name |
|---|---|---|
| 25 | 0x019 | __cuda_scalar_video_emulation_operandExtractAndSignExtend01 |
| 26 | 0x01A | __cuda_scalar_video_emulation_operandExtractAndSignExtend11 |
| 27 | 0x01B | __cuda_scalar_video_emulation_operandExtractAndSignExtend12 |
| 28 | 0x01C | __cuda_scalar_video_emulation_operandExtractAndSignExtend22 |
| 29 | 0x01D | __cuda_scalar_video_emulation_optionalMerge32 |
| 30 | 0x01E | __cuda_scalar_video_emulation_saturate64 |
| 31 | 0x01F | __cuda_scalar_video_emulation_secondOp64 |
__cuda_sm10x_* -- Blackwell tcgen05 guardrails + mask (11 entries, 0x020--0x02A, sm_100)
| ID | Hex | Name |
|---|---|---|
| 32 | 0x020 | __cuda_sm10x_create_mask_from_bit_idx_and_alloc_size_v1 |
| 33 | 0x021 | __cuda_sm10x_tcgen05_guardrail_trap_access_out_of_physical_bounds |
| 34 | 0x022 | __cuda_sm10x_tcgen05_guardrail_trap_allocation_granularity_invalid |
| 35 | 0x023 | __cuda_sm10x_tcgen05_guardrail_trap_col_being_dealloced_not_returned_by_alloc |
| 36 | 0x024 | __cuda_sm10x_tcgen05_guardrail_trap_current_warp_owner_invalid |
| 37 | 0x025 | __cuda_sm10x_tcgen05_guardrail_trap_invalid_datapath_alignment |
| 38 | 0x026 | __cuda_sm10x_tcgen05_guardrail_trap_phase_invalid_during_alloc |
| 39 | 0x027 | __cuda_sm10x_tcgen05_guardrail_trap_sp_used_in_unsupported_env |
| 40 | 0x028 | __cuda_sm10x_tcgen05_guardrail_trap_sparse_mismatch_between_idesc_mod |
| 41 | 0x029 | __cuda_sm10x_tcgen05_guardrail_trap_unallocated_columns_access |
| 42 | 0x02A | __cuda_sm10x_tcgen05_guardrail_trap_unallocated_columns_being_dealloced |
__cuda_sm1xx_* -- Bulk copy + cp.async.bulk.tensor (18 entries, 0x02B--0x03C, sm_100+)
| ID | Hex | Name |
|---|---|---|
| 43 | 0x02B | __cuda_sm1xx_bulk_copy_multicast |
| 44 | 0x02C | __cuda_sm1xx_bulk_copy_unicast |
| 45 | 0x02D | __cuda_sm1xx_cp_async_bulk_tensor_1d_tile_multicast |
| 46 | 0x02E | __cuda_sm1xx_cp_async_bulk_tensor_1d_tile_unicast |
| 47 | 0x02F | __cuda_sm1xx_cp_async_bulk_tensor_2d_tile_multicast |
| 48 | 0x030 | __cuda_sm1xx_cp_async_bulk_tensor_2d_tile_unicast |
| 49 | 0x031 | __cuda_sm1xx_cp_async_bulk_tensor_3d_im2col_multicast |
| 50 | 0x032 | __cuda_sm1xx_cp_async_bulk_tensor_3d_im2col_unicast |
| 51 | 0x033 | __cuda_sm1xx_cp_async_bulk_tensor_3d_tile_multicast |
| 52 | 0x034 | __cuda_sm1xx_cp_async_bulk_tensor_3d_tile_unicast |
| 53 | 0x035 | __cuda_sm1xx_cp_async_bulk_tensor_4d_im2col_multicast |
| 54 | 0x036 | __cuda_sm1xx_cp_async_bulk_tensor_4d_im2col_unicast |
| 55 | 0x037 | __cuda_sm1xx_cp_async_bulk_tensor_4d_tile_multicast |
| 56 | 0x038 | __cuda_sm1xx_cp_async_bulk_tensor_4d_tile_unicast |
| 57 | 0x039 | __cuda_sm1xx_cp_async_bulk_tensor_5d_im2col_multicast |
| 58 | 0x03A | __cuda_sm1xx_cp_async_bulk_tensor_5d_im2col_unicast |
| 59 | 0x03B | __cuda_sm1xx_cp_async_bulk_tensor_5d_tile_multicast |
| 60 | 0x03C | __cuda_sm1xx_cp_async_bulk_tensor_5d_tile_unicast |
__cuda_sm20_* -- IEEE math (70 entries, 0x03D--0x082, sm_20)
| ID | Hex | Name |
|---|---|---|
| 61 | 0x03D | __cuda_sm20_bfe_s64_ |
| 62 | 0x03E | __cuda_sm20_bfe_u64_ |
| 63 | 0x03F | __cuda_sm20_bfi_u64_ |
| 64 | 0x040 | __cuda_sm20_dblrcp_rn_slowpath_v3 |
| 65 | 0x041 | __cuda_sm20_div_rd_f32 |
| 66 | 0x042 | __cuda_sm20_div_rd_f64_v2 |
| 67 | 0x043 | __cuda_sm20_div_rd_ftz_f32 |
| 68 | 0x044 | __cuda_sm20_div_rn_f32 |
| 69 | 0x045 | __cuda_sm20_div_rn_f64_fast |
| 70 | 0x046 | __cuda_sm20_div_rn_f64_full |
| 71 | 0x047 | __cuda_sm20_div_rn_ftz_f32 |
| 72 | 0x048 | __cuda_sm20_div_rn_ftz_f32_slowpath |
| 73 | 0x049 | __cuda_sm20_div_rn_noftz_f32_slowpath |
| 74 | 0x04A | __cuda_sm20_div_ru_f32 |
| 75 | 0x04B | __cuda_sm20_div_ru_f64_v2 |
| 76 | 0x04C | __cuda_sm20_div_ru_ftz_f32 |
| 77 | 0x04D | __cuda_sm20_div_rz_f32 |
| 78 | 0x04E | __cuda_sm20_div_rz_f64_v2 |
| 79 | 0x04F | __cuda_sm20_div_rz_ftz_f32 |
| 80 | 0x050 | __cuda_sm20_div_s16 |
| 81 | 0x051 | __cuda_sm20_div_s64 |
| 82 | 0x052 | __cuda_sm20_div_u16 |
| 83 | 0x053 | __cuda_sm20_div_u64 |
| 84 | 0x054 | __cuda_sm20_drsqrt_f64_slowpath_v2 |
| 85 | 0x055 | __cuda_sm20_drsqrt_f64_v2 |
| 86 | 0x056 | __cuda_sm20_dsqrt_rd_f64 |
| 87 | 0x057 | __cuda_sm20_dsqrt_rn_f64_mediumpath_v1 |
| 88 | 0x058 | __cuda_sm20_dsqrt_rn_f64_v3 |
| 89 | 0x059 | __cuda_sm20_dsqrt_ru_f64 |
| 90 | 0x05A | __cuda_sm20_dsqrt_rz_f64 |
| 91 | 0x05B | __cuda_sm20_rcp_f64_v3 |
| 92 | 0x05C | __cuda_sm20_rcp_rd_f32 |
| 93 | 0x05D | __cuda_sm20_rcp_rd_f32_slowpath |
| 94 | 0x05E | __cuda_sm20_rcp_rd_f64 |
| 95 | 0x05F | __cuda_sm20_rcp_rd_ftz_f32 |
| 96 | 0x060 | __cuda_sm20_rcp_rd_ftz_f32_slowpath |
| 97 | 0x061 | __cuda_sm20_rcp_rn_f32 |
| 98 | 0x062 | __cuda_sm20_rcp_rn_f32_slowpath |
| 99 | 0x063 | __cuda_sm20_rcp_rn_ftz_f32 |
| 100 | 0x064 | __cuda_sm20_rcp_rn_ftz_f32_slowpath |
| 101 | 0x065 | __cuda_sm20_rcp_ru_f32 |
| 102 | 0x066 | __cuda_sm20_rcp_ru_f32_slowpath |
| 103 | 0x067 | __cuda_sm20_rcp_ru_f64 |
| 104 | 0x068 | __cuda_sm20_rcp_ru_ftz_f32 |
| 105 | 0x069 | __cuda_sm20_rcp_ru_ftz_f32_slowpath |
| 106 | 0x06A | __cuda_sm20_rcp_rz_f32 |
| 107 | 0x06B | __cuda_sm20_rcp_rz_f32_slowpath |
| 108 | 0x06C | __cuda_sm20_rcp_rz_f64 |
| 109 | 0x06D | __cuda_sm20_rcp_rz_ftz_f32 |
| 110 | 0x06E | __cuda_sm20_rcp_rz_ftz_f32_slowpath |
| 111 | 0x06F | __cuda_sm20_rem_s16 |
| 112 | 0x070 | __cuda_sm20_rem_s64 |
| 113 | 0x071 | __cuda_sm20_rem_u16 |
| 114 | 0x072 | __cuda_sm20_rem_u64 |
| 115 | 0x073 | __cuda_sm20_sqrt_rd_f32 |
| 116 | 0x074 | __cuda_sm20_sqrt_rd_f32_slowpath |
| 117 | 0x075 | __cuda_sm20_sqrt_rd_ftz_f32 |
| 118 | 0x076 | __cuda_sm20_sqrt_rd_ftz_f32_slowpath |
| 119 | 0x077 | __cuda_sm20_sqrt_rn_f32 |
| 120 | 0x078 | __cuda_sm20_sqrt_rn_f32_slowpath |
| 121 | 0x079 | __cuda_sm20_sqrt_rn_ftz_f32 |
| 122 | 0x07A | __cuda_sm20_sqrt_rn_ftz_f32_slowpath |
| 123 | 0x07B | __cuda_sm20_sqrt_ru_f32 |
| 124 | 0x07C | __cuda_sm20_sqrt_ru_f32_slowpath |
| 125 | 0x07D | __cuda_sm20_sqrt_ru_ftz_f32 |
| 126 | 0x07E | __cuda_sm20_sqrt_ru_ftz_f32_slowpath |
| 127 | 0x07F | __cuda_sm20_sqrt_rz_f32 |
| 128 | 0x080 | __cuda_sm20_sqrt_rz_f32_slowpath |
| 129 | 0x081 | __cuda_sm20_sqrt_rz_ftz_f32 |
| 130 | 0x082 | __cuda_sm20_sqrt_rz_ftz_f32_slowpath |
__cuda_sm3x_* -- Optimized division (4 entries, 0x083--0x086, sm_30)
| ID | Hex | Name |
|---|---|---|
| 131 | 0x083 | __cuda_sm3x_div_rn_ftz_f32 |
| 132 | 0x084 | __cuda_sm3x_div_rn_ftz_f32_slowpath |
| 133 | 0x085 | __cuda_sm3x_div_rn_noftz_f32 |
| 134 | 0x086 | __cuda_sm3x_div_rn_noftz_f32_slowpath |
__cuda_sm62_* -- Integer dot product (2 entries, 0x087--0x088, sm_62)
| ID | Hex | Name |
|---|---|---|
| 135 | 0x087 | __cuda_sm62_dp2a |
| 136 | 0x088 | __cuda_sm62_dp4a |
__cuda_sm70_* -- Volta sync/warp/WMMA (393 entries, 0x089--0x211, sm_70)
| ID | Hex | Name |
|---|---|---|
| 137 | 0x089 | __cuda_sm70_barrier_arrive |
| 138 | 0x08A | __cuda_sm70_barrier_arrive_0 |
| 139 | 0x08B | __cuda_sm70_barrier_arrive_0_count |
| 140 | 0x08C | __cuda_sm70_barrier_arrive_1 |
| 141 | 0x08D | __cuda_sm70_barrier_arrive_10 |
| 142 | 0x08E | __cuda_sm70_barrier_arrive_10_count |
| 143 | 0x08F | __cuda_sm70_barrier_arrive_11 |
| 144 | 0x090 | __cuda_sm70_barrier_arrive_11_count |
| 145 | 0x091 | __cuda_sm70_barrier_arrive_12 |
| 146 | 0x092 | __cuda_sm70_barrier_arrive_12_count |
| 147 | 0x093 | __cuda_sm70_barrier_arrive_13 |
| 148 | 0x094 | __cuda_sm70_barrier_arrive_13_count |
| 149 | 0x095 | __cuda_sm70_barrier_arrive_14 |
| 150 | 0x096 | __cuda_sm70_barrier_arrive_14_count |
| 151 | 0x097 | __cuda_sm70_barrier_arrive_15 |
| 152 | 0x098 | __cuda_sm70_barrier_arrive_15_count |
| 153 | 0x099 | __cuda_sm70_barrier_arrive_1_count |
| 154 | 0x09A | __cuda_sm70_barrier_arrive_2 |
| 155 | 0x09B | __cuda_sm70_barrier_arrive_2_count |
| 156 | 0x09C | __cuda_sm70_barrier_arrive_3 |
| 157 | 0x09D | __cuda_sm70_barrier_arrive_3_count |
| 158 | 0x09E | __cuda_sm70_barrier_arrive_4 |
| 159 | 0x09F | __cuda_sm70_barrier_arrive_4_count |
| 160 | 0x0A0 | __cuda_sm70_barrier_arrive_5 |
| 161 | 0x0A1 | __cuda_sm70_barrier_arrive_5_count |
| 162 | 0x0A2 | __cuda_sm70_barrier_arrive_6 |
| 163 | 0x0A3 | __cuda_sm70_barrier_arrive_6_count |
| 164 | 0x0A4 | __cuda_sm70_barrier_arrive_7 |
| 165 | 0x0A5 | __cuda_sm70_barrier_arrive_7_count |
| 166 | 0x0A6 | __cuda_sm70_barrier_arrive_8 |
| 167 | 0x0A7 | __cuda_sm70_barrier_arrive_8_count |
| 168 | 0x0A8 | __cuda_sm70_barrier_arrive_9 |
| 169 | 0x0A9 | __cuda_sm70_barrier_arrive_9_count |
| 170 | 0x0AA | __cuda_sm70_barrier_arrive_count |
| 171 | 0x0AB | __cuda_sm70_barrier_red_and |
| 172 | 0x0AC | __cuda_sm70_barrier_red_and_0 |
| 173 | 0x0AD | __cuda_sm70_barrier_red_and_0_count |
| 174 | 0x0AE | __cuda_sm70_barrier_red_and_1 |
| 175 | 0x0AF | __cuda_sm70_barrier_red_and_10 |
| 176 | 0x0B0 | __cuda_sm70_barrier_red_and_10_count |
| 177 | 0x0B1 | __cuda_sm70_barrier_red_and_11 |
| 178 | 0x0B2 | __cuda_sm70_barrier_red_and_11_count |
| 179 | 0x0B3 | __cuda_sm70_barrier_red_and_12 |
| 180 | 0x0B4 | __cuda_sm70_barrier_red_and_12_count |
| 181 | 0x0B5 | __cuda_sm70_barrier_red_and_13 |
| 182 | 0x0B6 | __cuda_sm70_barrier_red_and_13_count |
| 183 | 0x0B7 | __cuda_sm70_barrier_red_and_14 |
| 184 | 0x0B8 | __cuda_sm70_barrier_red_and_14_count |
| 185 | 0x0B9 | __cuda_sm70_barrier_red_and_15 |
| 186 | 0x0BA | __cuda_sm70_barrier_red_and_15_count |
| 187 | 0x0BB | __cuda_sm70_barrier_red_and_1_count |
| 188 | 0x0BC | __cuda_sm70_barrier_red_and_2 |
| 189 | 0x0BD | __cuda_sm70_barrier_red_and_2_count |
| 190 | 0x0BE | __cuda_sm70_barrier_red_and_3 |
| 191 | 0x0BF | __cuda_sm70_barrier_red_and_3_count |
| 192 | 0x0C0 | __cuda_sm70_barrier_red_and_4 |
| 193 | 0x0C1 | __cuda_sm70_barrier_red_and_4_count |
| 194 | 0x0C2 | __cuda_sm70_barrier_red_and_5 |
| 195 | 0x0C3 | __cuda_sm70_barrier_red_and_5_count |
| 196 | 0x0C4 | __cuda_sm70_barrier_red_and_6 |
| 197 | 0x0C5 | __cuda_sm70_barrier_red_and_6_count |
| 198 | 0x0C6 | __cuda_sm70_barrier_red_and_7 |
| 199 | 0x0C7 | __cuda_sm70_barrier_red_and_7_count |
| 200 | 0x0C8 | __cuda_sm70_barrier_red_and_8 |
| 201 | 0x0C9 | __cuda_sm70_barrier_red_and_8_count |
| 202 | 0x0CA | __cuda_sm70_barrier_red_and_9 |
| 203 | 0x0CB | __cuda_sm70_barrier_red_and_9_count |
| 204 | 0x0CC | __cuda_sm70_barrier_red_and_count |
| 205 | 0x0CD | __cuda_sm70_barrier_red_or |
| 206 | 0x0CE | __cuda_sm70_barrier_red_or_0 |
| 207 | 0x0CF | __cuda_sm70_barrier_red_or_0_count |
| 208 | 0x0D0 | __cuda_sm70_barrier_red_or_1 |
| 209 | 0x0D1 | __cuda_sm70_barrier_red_or_10 |
| 210 | 0x0D2 | __cuda_sm70_barrier_red_or_10_count |
| 211 | 0x0D3 | __cuda_sm70_barrier_red_or_11 |
| 212 | 0x0D4 | __cuda_sm70_barrier_red_or_11_count |
| 213 | 0x0D5 | __cuda_sm70_barrier_red_or_12 |
| 214 | 0x0D6 | __cuda_sm70_barrier_red_or_12_count |
| 215 | 0x0D7 | __cuda_sm70_barrier_red_or_13 |
| 216 | 0x0D8 | __cuda_sm70_barrier_red_or_13_count |
| 217 | 0x0D9 | __cuda_sm70_barrier_red_or_14 |
| 218 | 0x0DA | __cuda_sm70_barrier_red_or_14_count |
| 219 | 0x0DB | __cuda_sm70_barrier_red_or_15 |
| 220 | 0x0DC | __cuda_sm70_barrier_red_or_15_count |
| 221 | 0x0DD | __cuda_sm70_barrier_red_or_1_count |
| 222 | 0x0DE | __cuda_sm70_barrier_red_or_2 |
| 223 | 0x0DF | __cuda_sm70_barrier_red_or_2_count |
| 224 | 0x0E0 | __cuda_sm70_barrier_red_or_3 |
| 225 | 0x0E1 | __cuda_sm70_barrier_red_or_3_count |
| 226 | 0x0E2 | __cuda_sm70_barrier_red_or_4 |
| 227 | 0x0E3 | __cuda_sm70_barrier_red_or_4_count |
| 228 | 0x0E4 | __cuda_sm70_barrier_red_or_5 |
| 229 | 0x0E5 | __cuda_sm70_barrier_red_or_5_count |
| 230 | 0x0E6 | __cuda_sm70_barrier_red_or_6 |
| 231 | 0x0E7 | __cuda_sm70_barrier_red_or_6_count |
| 232 | 0x0E8 | __cuda_sm70_barrier_red_or_7 |
| 233 | 0x0E9 | __cuda_sm70_barrier_red_or_7_count |
| 234 | 0x0EA | __cuda_sm70_barrier_red_or_8 |
| 235 | 0x0EB | __cuda_sm70_barrier_red_or_8_count |
| 236 | 0x0EC | __cuda_sm70_barrier_red_or_9 |
| 237 | 0x0ED | __cuda_sm70_barrier_red_or_9_count |
| 238 | 0x0EE | __cuda_sm70_barrier_red_or_count |
| 239 | 0x0EF | __cuda_sm70_barrier_red_popc |
| 240 | 0x0F0 | __cuda_sm70_barrier_red_popc_0 |
| 241 | 0x0F1 | __cuda_sm70_barrier_red_popc_0_count |
| 242 | 0x0F2 | __cuda_sm70_barrier_red_popc_1 |
| 243 | 0x0F3 | __cuda_sm70_barrier_red_popc_10 |
| 244 | 0x0F4 | __cuda_sm70_barrier_red_popc_10_count |
| 245 | 0x0F5 | __cuda_sm70_barrier_red_popc_11 |
| 246 | 0x0F6 | __cuda_sm70_barrier_red_popc_11_count |
| 247 | 0x0F7 | __cuda_sm70_barrier_red_popc_12 |
| 248 | 0x0F8 | __cuda_sm70_barrier_red_popc_12_count |
| 249 | 0x0F9 | __cuda_sm70_barrier_red_popc_13 |
| 250 | 0x0FA | __cuda_sm70_barrier_red_popc_13_count |
| 251 | 0x0FB | __cuda_sm70_barrier_red_popc_14 |
| 252 | 0x0FC | __cuda_sm70_barrier_red_popc_14_count |
| 253 | 0x0FD | __cuda_sm70_barrier_red_popc_15 |
| 254 | 0x0FE | __cuda_sm70_barrier_red_popc_15_count |
| 255 | 0x0FF | __cuda_sm70_barrier_red_popc_1_count |
| 256 | 0x100 | __cuda_sm70_barrier_red_popc_2 |
| 257 | 0x101 | __cuda_sm70_barrier_red_popc_2_count |
| 258 | 0x102 | __cuda_sm70_barrier_red_popc_3 |
| 259 | 0x103 | __cuda_sm70_barrier_red_popc_3_count |
| 260 | 0x104 | __cuda_sm70_barrier_red_popc_4 |
| 261 | 0x105 | __cuda_sm70_barrier_red_popc_4_count |
| 262 | 0x106 | __cuda_sm70_barrier_red_popc_5 |
| 263 | 0x107 | __cuda_sm70_barrier_red_popc_5_count |
| 264 | 0x108 | __cuda_sm70_barrier_red_popc_6 |
| 265 | 0x109 | __cuda_sm70_barrier_red_popc_6_count |
| 266 | 0x10A | __cuda_sm70_barrier_red_popc_7 |
| 267 | 0x10B | __cuda_sm70_barrier_red_popc_7_count |
| 268 | 0x10C | __cuda_sm70_barrier_red_popc_8 |
| 269 | 0x10D | __cuda_sm70_barrier_red_popc_8_count |
| 270 | 0x10E | __cuda_sm70_barrier_red_popc_9 |
| 271 | 0x10F | __cuda_sm70_barrier_red_popc_9_count |
| 272 | 0x110 | __cuda_sm70_barrier_red_popc_count |
| 273 | 0x111 | __cuda_sm70_barrier_sync |
| 274 | 0x112 | __cuda_sm70_barrier_sync_0 |
| 275 | 0x113 | __cuda_sm70_barrier_sync_0_count |
| 276 | 0x114 | __cuda_sm70_barrier_sync_1 |
| 277 | 0x115 | __cuda_sm70_barrier_sync_10 |
| 278 | 0x116 | __cuda_sm70_barrier_sync_10_count |
| 279 | 0x117 | __cuda_sm70_barrier_sync_11 |
| 280 | 0x118 | __cuda_sm70_barrier_sync_11_count |
| 281 | 0x119 | __cuda_sm70_barrier_sync_12 |
| 282 | 0x11A | __cuda_sm70_barrier_sync_12_count |
| 283 | 0x11B | __cuda_sm70_barrier_sync_13 |
| 284 | 0x11C | __cuda_sm70_barrier_sync_13_count |
| 285 | 0x11D | __cuda_sm70_barrier_sync_14 |
| 286 | 0x11E | __cuda_sm70_barrier_sync_14_count |
| 287 | 0x11F | __cuda_sm70_barrier_sync_15 |
| 288 | 0x120 | __cuda_sm70_barrier_sync_15_count |
| 289 | 0x121 | __cuda_sm70_barrier_sync_1_count |
| 290 | 0x122 | __cuda_sm70_barrier_sync_2 |
| 291 | 0x123 | __cuda_sm70_barrier_sync_2_count |
| 292 | 0x124 | __cuda_sm70_barrier_sync_3 |
| 293 | 0x125 | __cuda_sm70_barrier_sync_3_count |
| 294 | 0x126 | __cuda_sm70_barrier_sync_4 |
| 295 | 0x127 | __cuda_sm70_barrier_sync_4_count |
| 296 | 0x128 | __cuda_sm70_barrier_sync_5 |
| 297 | 0x129 | __cuda_sm70_barrier_sync_5_count |
| 298 | 0x12A | __cuda_sm70_barrier_sync_6 |
| 299 | 0x12B | __cuda_sm70_barrier_sync_6_count |
| 300 | 0x12C | __cuda_sm70_barrier_sync_7 |
| 301 | 0x12D | __cuda_sm70_barrier_sync_7_count |
| 302 | 0x12E | __cuda_sm70_barrier_sync_8 |
| 303 | 0x12F | __cuda_sm70_barrier_sync_8_count |
| 304 | 0x130 | __cuda_sm70_barrier_sync_9 |
| 305 | 0x131 | __cuda_sm70_barrier_sync_9_count |
| 306 | 0x132 | __cuda_sm70_barrier_sync_count |
| 307 | 0x133 | __cuda_sm70_matchsync_all_b32 |
| 308 | 0x134 | __cuda_sm70_matchsync_all_b32_p |
| 309 | 0x135 | __cuda_sm70_matchsync_all_b64 |
| 310 | 0x136 | __cuda_sm70_matchsync_all_b64_p |
| 311 | 0x137 | __cuda_sm70_matchsync_any_b32 |
| 312 | 0x138 | __cuda_sm70_matchsync_any_b64 |
| 313 | 0x139 | __cuda_sm70_shflsync_bfly |
| 314 | 0x13A | __cuda_sm70_shflsync_bfly_p |
| 315 | 0x13B | __cuda_sm70_shflsync_down |
| 316 | 0x13C | __cuda_sm70_shflsync_down_p |
| 317 | 0x13D | __cuda_sm70_shflsync_idx |
| 318 | 0x13E | __cuda_sm70_shflsync_idx_p |
| 319 | 0x13F | __cuda_sm70_shflsync_up |
| 320 | 0x140 | __cuda_sm70_shflsync_up_p |
| 321 | 0x141 | __cuda_sm70_votesync_all |
| 322 | 0x142 | __cuda_sm70_votesync_any |
| 323 | 0x143 | __cuda_sm70_votesync_ballot |
| 324 | 0x144 | __cuda_sm70_votesync_uni |
| 325 | 0x145 | __cuda_sm70_warpsync |
| 326 | 0x146 | __cuda_sm70_wmma_m16n16k16_load_a_col |
| 327 | 0x147 | __cuda_sm70_wmma_m16n16k16_load_a_col_global |
| 328 | 0x148 | __cuda_sm70_wmma_m16n16k16_load_a_col_shared |
| 329 | 0x149 | __cuda_sm70_wmma_m16n16k16_load_a_row |
| 330 | 0x14A | __cuda_sm70_wmma_m16n16k16_load_a_row_global |
| 331 | 0x14B | __cuda_sm70_wmma_m16n16k16_load_a_row_shared |
| 332 | 0x14C | __cuda_sm70_wmma_m16n16k16_load_b_col |
| 333 | 0x14D | __cuda_sm70_wmma_m16n16k16_load_b_col_global |
| 334 | 0x14E | __cuda_sm70_wmma_m16n16k16_load_b_col_shared |
| 335 | 0x14F | __cuda_sm70_wmma_m16n16k16_load_b_row |
| 336 | 0x150 | __cuda_sm70_wmma_m16n16k16_load_b_row_global |
| 337 | 0x151 | __cuda_sm70_wmma_m16n16k16_load_b_row_shared |
| 338 | 0x152 | __cuda_sm70_wmma_m16n16k16_load_c_col_f16 |
| 339 | 0x153 | __cuda_sm70_wmma_m16n16k16_load_c_col_f16_global |
| 340 | 0x154 | __cuda_sm70_wmma_m16n16k16_load_c_col_f16_shared |
| 341 | 0x155 | __cuda_sm70_wmma_m16n16k16_load_c_col_f32 |
| 342 | 0x156 | __cuda_sm70_wmma_m16n16k16_load_c_col_f32_global |
| 343 | 0x157 | __cuda_sm70_wmma_m16n16k16_load_c_col_f32_shared |
| 344 | 0x158 | __cuda_sm70_wmma_m16n16k16_load_c_row_f16 |
| 345 | 0x159 | __cuda_sm70_wmma_m16n16k16_load_c_row_f16_global |
| 346 | 0x15A | __cuda_sm70_wmma_m16n16k16_load_c_row_f16_shared |
| 347 | 0x15B | __cuda_sm70_wmma_m16n16k16_load_c_row_f32 |
| 348 | 0x15C | __cuda_sm70_wmma_m16n16k16_load_c_row_f32_global |
| 349 | 0x15D | __cuda_sm70_wmma_m16n16k16_load_c_row_f32_shared |
| 350 | 0x15E | __cuda_sm70_wmma_m16n16k16_mma_col_col_f16_f16 |
| 351 | 0x15F | __cuda_sm70_wmma_m16n16k16_mma_col_col_f16_f16_satfinite |
| 352 | 0x160 | __cuda_sm70_wmma_m16n16k16_mma_col_col_f16_f32 |
| 353 | 0x161 | __cuda_sm70_wmma_m16n16k16_mma_col_col_f16_f32_satfinite |
| 354 | 0x162 | __cuda_sm70_wmma_m16n16k16_mma_col_col_f32_f16 |
| 355 | 0x163 | __cuda_sm70_wmma_m16n16k16_mma_col_col_f32_f16_satfinite |
| 356 | 0x164 | __cuda_sm70_wmma_m16n16k16_mma_col_col_f32_f32 |
| 357 | 0x165 | __cuda_sm70_wmma_m16n16k16_mma_col_col_f32_f32_satfinite |
| 358 | 0x166 | __cuda_sm70_wmma_m16n16k16_mma_col_row_f16_f16 |
| 359 | 0x167 | __cuda_sm70_wmma_m16n16k16_mma_col_row_f16_f16_satfinite |
| 360 | 0x168 | __cuda_sm70_wmma_m16n16k16_mma_col_row_f16_f32 |
| 361 | 0x169 | __cuda_sm70_wmma_m16n16k16_mma_col_row_f16_f32_satfinite |
| 362 | 0x16A | __cuda_sm70_wmma_m16n16k16_mma_col_row_f32_f16 |
| 363 | 0x16B | __cuda_sm70_wmma_m16n16k16_mma_col_row_f32_f16_satfinite |
| 364 | 0x16C | __cuda_sm70_wmma_m16n16k16_mma_col_row_f32_f32 |
| 365 | 0x16D | __cuda_sm70_wmma_m16n16k16_mma_col_row_f32_f32_satfinite |
| 366 | 0x16E | __cuda_sm70_wmma_m16n16k16_mma_row_col_f16_f16 |
| 367 | 0x16F | __cuda_sm70_wmma_m16n16k16_mma_row_col_f16_f16_satfinite |
| 368 | 0x170 | __cuda_sm70_wmma_m16n16k16_mma_row_col_f16_f32 |
| 369 | 0x171 | __cuda_sm70_wmma_m16n16k16_mma_row_col_f16_f32_satfinite |
| 370 | 0x172 | __cuda_sm70_wmma_m16n16k16_mma_row_col_f32_f16 |
| 371 | 0x173 | __cuda_sm70_wmma_m16n16k16_mma_row_col_f32_f16_satfinite |
| 372 | 0x174 | __cuda_sm70_wmma_m16n16k16_mma_row_col_f32_f32 |
| 373 | 0x175 | __cuda_sm70_wmma_m16n16k16_mma_row_col_f32_f32_satfinite |
| 374 | 0x176 | __cuda_sm70_wmma_m16n16k16_mma_row_row_f16_f16 |
| 375 | 0x177 | __cuda_sm70_wmma_m16n16k16_mma_row_row_f16_f16_satfinite |
| 376 | 0x178 | __cuda_sm70_wmma_m16n16k16_mma_row_row_f16_f32 |
| 377 | 0x179 | __cuda_sm70_wmma_m16n16k16_mma_row_row_f16_f32_satfinite |
| 378 | 0x17A | __cuda_sm70_wmma_m16n16k16_mma_row_row_f32_f16 |
| 379 | 0x17B | __cuda_sm70_wmma_m16n16k16_mma_row_row_f32_f16_satfinite |
| 380 | 0x17C | __cuda_sm70_wmma_m16n16k16_mma_row_row_f32_f32 |
| 381 | 0x17D | __cuda_sm70_wmma_m16n16k16_mma_row_row_f32_f32_satfinite |
| 382 | 0x17E | __cuda_sm70_wmma_m16n16k16_store_d_col_f16 |
| 383 | 0x17F | __cuda_sm70_wmma_m16n16k16_store_d_col_f16_global |
| 384 | 0x180 | __cuda_sm70_wmma_m16n16k16_store_d_col_f16_shared |
| 385 | 0x181 | __cuda_sm70_wmma_m16n16k16_store_d_col_f32 |
| 386 | 0x182 | __cuda_sm70_wmma_m16n16k16_store_d_col_f32_global |
| 387 | 0x183 | __cuda_sm70_wmma_m16n16k16_store_d_col_f32_shared |
| 388 | 0x184 | __cuda_sm70_wmma_m16n16k16_store_d_row_f16 |
| 389 | 0x185 | __cuda_sm70_wmma_m16n16k16_store_d_row_f16_global |
| 390 | 0x186 | __cuda_sm70_wmma_m16n16k16_store_d_row_f16_shared |
| 391 | 0x187 | __cuda_sm70_wmma_m16n16k16_store_d_row_f32 |
| 392 | 0x188 | __cuda_sm70_wmma_m16n16k16_store_d_row_f32_global |
| 393 | 0x189 | __cuda_sm70_wmma_m16n16k16_store_d_row_f32_shared |
| 394 | 0x18A | __cuda_sm70_wmma_m32n8k16_load_a_col |
| 395 | 0x18B | __cuda_sm70_wmma_m32n8k16_load_a_col_global |
| 396 | 0x18C | __cuda_sm70_wmma_m32n8k16_load_a_col_shared |
| 397 | 0x18D | __cuda_sm70_wmma_m32n8k16_load_a_row |
| 398 | 0x18E | __cuda_sm70_wmma_m32n8k16_load_a_row_global |
| 399 | 0x18F | __cuda_sm70_wmma_m32n8k16_load_a_row_shared |
| 400 | 0x190 | __cuda_sm70_wmma_m32n8k16_load_b_col |
| 401 | 0x191 | __cuda_sm70_wmma_m32n8k16_load_b_col_global |
| 402 | 0x192 | __cuda_sm70_wmma_m32n8k16_load_b_col_shared |
| 403 | 0x193 | __cuda_sm70_wmma_m32n8k16_load_b_row |
| 404 | 0x194 | __cuda_sm70_wmma_m32n8k16_load_b_row_global |
| 405 | 0x195 | __cuda_sm70_wmma_m32n8k16_load_b_row_shared |
| 406 | 0x196 | __cuda_sm70_wmma_m32n8k16_load_c_col_f16 |
| 407 | 0x197 | __cuda_sm70_wmma_m32n8k16_load_c_col_f16_global |
| 408 | 0x198 | __cuda_sm70_wmma_m32n8k16_load_c_col_f16_shared |
| 409 | 0x199 | __cuda_sm70_wmma_m32n8k16_load_c_col_f32 |
| 410 | 0x19A | __cuda_sm70_wmma_m32n8k16_load_c_col_f32_global |
| 411 | 0x19B | __cuda_sm70_wmma_m32n8k16_load_c_col_f32_shared |
| 412 | 0x19C | __cuda_sm70_wmma_m32n8k16_load_c_row_f16 |
| 413 | 0x19D | __cuda_sm70_wmma_m32n8k16_load_c_row_f16_global |
| 414 | 0x19E | __cuda_sm70_wmma_m32n8k16_load_c_row_f16_shared |
| 415 | 0x19F | __cuda_sm70_wmma_m32n8k16_load_c_row_f32 |
| 416 | 0x1A0 | __cuda_sm70_wmma_m32n8k16_load_c_row_f32_global |
| 417 | 0x1A1 | __cuda_sm70_wmma_m32n8k16_load_c_row_f32_shared |
| 418 | 0x1A2 | __cuda_sm70_wmma_m32n8k16_mma_col_col_f16_f16 |
| 419 | 0x1A3 | __cuda_sm70_wmma_m32n8k16_mma_col_col_f16_f16_satfinite |
| 420 | 0x1A4 | __cuda_sm70_wmma_m32n8k16_mma_col_col_f16_f32 |
| 421 | 0x1A5 | __cuda_sm70_wmma_m32n8k16_mma_col_col_f16_f32_satfinite |
| 422 | 0x1A6 | __cuda_sm70_wmma_m32n8k16_mma_col_col_f32_f16 |
| 423 | 0x1A7 | __cuda_sm70_wmma_m32n8k16_mma_col_col_f32_f16_satfinite |
| 424 | 0x1A8 | __cuda_sm70_wmma_m32n8k16_mma_col_col_f32_f32 |
| 425 | 0x1A9 | __cuda_sm70_wmma_m32n8k16_mma_col_col_f32_f32_satfinite |
| 426 | 0x1AA | __cuda_sm70_wmma_m32n8k16_mma_col_row_f16_f16 |
| 427 | 0x1AB | __cuda_sm70_wmma_m32n8k16_mma_col_row_f16_f16_satfinite |
| 428 | 0x1AC | __cuda_sm70_wmma_m32n8k16_mma_col_row_f16_f32 |
| 429 | 0x1AD | __cuda_sm70_wmma_m32n8k16_mma_col_row_f16_f32_satfinite |
| 430 | 0x1AE | __cuda_sm70_wmma_m32n8k16_mma_col_row_f32_f16 |
| 431 | 0x1AF | __cuda_sm70_wmma_m32n8k16_mma_col_row_f32_f16_satfinite |
| 432 | 0x1B0 | __cuda_sm70_wmma_m32n8k16_mma_col_row_f32_f32 |
| 433 | 0x1B1 | __cuda_sm70_wmma_m32n8k16_mma_col_row_f32_f32_satfinite |
| 434 | 0x1B2 | __cuda_sm70_wmma_m32n8k16_mma_row_col_f16_f16 |
| 435 | 0x1B3 | __cuda_sm70_wmma_m32n8k16_mma_row_col_f16_f16_satfinite |
| 436 | 0x1B4 | __cuda_sm70_wmma_m32n8k16_mma_row_col_f16_f32 |
| 437 | 0x1B5 | __cuda_sm70_wmma_m32n8k16_mma_row_col_f16_f32_satfinite |
| 438 | 0x1B6 | __cuda_sm70_wmma_m32n8k16_mma_row_col_f32_f16 |
| 439 | 0x1B7 | __cuda_sm70_wmma_m32n8k16_mma_row_col_f32_f16_satfinite |
| 440 | 0x1B8 | __cuda_sm70_wmma_m32n8k16_mma_row_col_f32_f32 |
| 441 | 0x1B9 | __cuda_sm70_wmma_m32n8k16_mma_row_col_f32_f32_satfinite |
| 442 | 0x1BA | __cuda_sm70_wmma_m32n8k16_mma_row_row_f16_f16 |
| 443 | 0x1BB | __cuda_sm70_wmma_m32n8k16_mma_row_row_f16_f16_satfinite |
| 444 | 0x1BC | __cuda_sm70_wmma_m32n8k16_mma_row_row_f16_f32 |
| 445 | 0x1BD | __cuda_sm70_wmma_m32n8k16_mma_row_row_f16_f32_satfinite |
| 446 | 0x1BE | __cuda_sm70_wmma_m32n8k16_mma_row_row_f32_f16 |
| 447 | 0x1BF | __cuda_sm70_wmma_m32n8k16_mma_row_row_f32_f16_satfinite |
| 448 | 0x1C0 | __cuda_sm70_wmma_m32n8k16_mma_row_row_f32_f32 |
| 449 | 0x1C1 | __cuda_sm70_wmma_m32n8k16_mma_row_row_f32_f32_satfinite |
| 450 | 0x1C2 | __cuda_sm70_wmma_m32n8k16_store_d_col_f16 |
| 451 | 0x1C3 | __cuda_sm70_wmma_m32n8k16_store_d_col_f16_global |
| 452 | 0x1C4 | __cuda_sm70_wmma_m32n8k16_store_d_col_f16_shared |
| 453 | 0x1C5 | __cuda_sm70_wmma_m32n8k16_store_d_col_f32 |
| 454 | 0x1C6 | __cuda_sm70_wmma_m32n8k16_store_d_col_f32_global |
| 455 | 0x1C7 | __cuda_sm70_wmma_m32n8k16_store_d_col_f32_shared |
| 456 | 0x1C8 | __cuda_sm70_wmma_m32n8k16_store_d_row_f16 |
| 457 | 0x1C9 | __cuda_sm70_wmma_m32n8k16_store_d_row_f16_global |
| 458 | 0x1CA | __cuda_sm70_wmma_m32n8k16_store_d_row_f16_shared |
| 459 | 0x1CB | __cuda_sm70_wmma_m32n8k16_store_d_row_f32 |
| 460 | 0x1CC | __cuda_sm70_wmma_m32n8k16_store_d_row_f32_global |
| 461 | 0x1CD | __cuda_sm70_wmma_m32n8k16_store_d_row_f32_shared |
| 462 | 0x1CE | __cuda_sm70_wmma_m8n32k16_load_a_col |
| 463 | 0x1CF | __cuda_sm70_wmma_m8n32k16_load_a_col_global |
| 464 | 0x1D0 | __cuda_sm70_wmma_m8n32k16_load_a_col_shared |
| 465 | 0x1D1 | __cuda_sm70_wmma_m8n32k16_load_a_row |
| 466 | 0x1D2 | __cuda_sm70_wmma_m8n32k16_load_a_row_global |
| 467 | 0x1D3 | __cuda_sm70_wmma_m8n32k16_load_a_row_shared |
| 468 | 0x1D4 | __cuda_sm70_wmma_m8n32k16_load_b_col |
| 469 | 0x1D5 | __cuda_sm70_wmma_m8n32k16_load_b_col_global |
| 470 | 0x1D6 | __cuda_sm70_wmma_m8n32k16_load_b_col_shared |
| 471 | 0x1D7 | __cuda_sm70_wmma_m8n32k16_load_b_row |
| 472 | 0x1D8 | __cuda_sm70_wmma_m8n32k16_load_b_row_global |
| 473 | 0x1D9 | __cuda_sm70_wmma_m8n32k16_load_b_row_shared |
| 474 | 0x1DA | __cuda_sm70_wmma_m8n32k16_load_c_col_f16 |
| 475 | 0x1DB | __cuda_sm70_wmma_m8n32k16_load_c_col_f16_global |
| 476 | 0x1DC | __cuda_sm70_wmma_m8n32k16_load_c_col_f16_shared |
| 477 | 0x1DD | __cuda_sm70_wmma_m8n32k16_load_c_col_f32 |
| 478 | 0x1DE | __cuda_sm70_wmma_m8n32k16_load_c_col_f32_global |
| 479 | 0x1DF | __cuda_sm70_wmma_m8n32k16_load_c_col_f32_shared |
| 480 | 0x1E0 | __cuda_sm70_wmma_m8n32k16_load_c_row_f16 |
| 481 | 0x1E1 | __cuda_sm70_wmma_m8n32k16_load_c_row_f16_global |
| 482 | 0x1E2 | __cuda_sm70_wmma_m8n32k16_load_c_row_f16_shared |
| 483 | 0x1E3 | __cuda_sm70_wmma_m8n32k16_load_c_row_f32 |
| 484 | 0x1E4 | __cuda_sm70_wmma_m8n32k16_load_c_row_f32_global |
| 485 | 0x1E5 | __cuda_sm70_wmma_m8n32k16_load_c_row_f32_shared |
| 486 | 0x1E6 | __cuda_sm70_wmma_m8n32k16_mma_col_col_f16_f16 |
| 487 | 0x1E7 | __cuda_sm70_wmma_m8n32k16_mma_col_col_f16_f16_satfinite |
| 488 | 0x1E8 | __cuda_sm70_wmma_m8n32k16_mma_col_col_f16_f32 |
| 489 | 0x1E9 | __cuda_sm70_wmma_m8n32k16_mma_col_col_f16_f32_satfinite |
| 490 | 0x1EA | __cuda_sm70_wmma_m8n32k16_mma_col_col_f32_f16 |
| 491 | 0x1EB | __cuda_sm70_wmma_m8n32k16_mma_col_col_f32_f16_satfinite |
| 492 | 0x1EC | __cuda_sm70_wmma_m8n32k16_mma_col_col_f32_f32 |
| 493 | 0x1ED | __cuda_sm70_wmma_m8n32k16_mma_col_col_f32_f32_satfinite |
| 494 | 0x1EE | __cuda_sm70_wmma_m8n32k16_mma_col_row_f16_f16 |
| 495 | 0x1EF | __cuda_sm70_wmma_m8n32k16_mma_col_row_f16_f16_satfinite |
| 496 | 0x1F0 | __cuda_sm70_wmma_m8n32k16_mma_col_row_f16_f32 |
| 497 | 0x1F1 | __cuda_sm70_wmma_m8n32k16_mma_col_row_f16_f32_satfinite |
| 498 | 0x1F2 | __cuda_sm70_wmma_m8n32k16_mma_col_row_f32_f16 |
| 499 | 0x1F3 | __cuda_sm70_wmma_m8n32k16_mma_col_row_f32_f16_satfinite |
| 500 | 0x1F4 | __cuda_sm70_wmma_m8n32k16_mma_col_row_f32_f32 |
| 501 | 0x1F5 | __cuda_sm70_wmma_m8n32k16_mma_col_row_f32_f32_satfinite |
| 502 | 0x1F6 | __cuda_sm70_wmma_m8n32k16_mma_row_col_f16_f16 |
| 503 | 0x1F7 | __cuda_sm70_wmma_m8n32k16_mma_row_col_f16_f16_satfinite |
| 504 | 0x1F8 | __cuda_sm70_wmma_m8n32k16_mma_row_col_f16_f32 |
| 505 | 0x1F9 | __cuda_sm70_wmma_m8n32k16_mma_row_col_f16_f32_satfinite |
| 506 | 0x1FA | __cuda_sm70_wmma_m8n32k16_mma_row_col_f32_f16 |
| 507 | 0x1FB | __cuda_sm70_wmma_m8n32k16_mma_row_col_f32_f16_satfinite |
| 508 | 0x1FC | __cuda_sm70_wmma_m8n32k16_mma_row_col_f32_f32 |
| 509 | 0x1FD | __cuda_sm70_wmma_m8n32k16_mma_row_col_f32_f32_satfinite |
| 510 | 0x1FE | __cuda_sm70_wmma_m8n32k16_mma_row_row_f16_f16 |
| 511 | 0x1FF | __cuda_sm70_wmma_m8n32k16_mma_row_row_f16_f16_satfinite |
| 512 | 0x200 | __cuda_sm70_wmma_m8n32k16_mma_row_row_f16_f32 |
| 513 | 0x201 | __cuda_sm70_wmma_m8n32k16_mma_row_row_f16_f32_satfinite |
| 514 | 0x202 | __cuda_sm70_wmma_m8n32k16_mma_row_row_f32_f16 |
| 515 | 0x203 | __cuda_sm70_wmma_m8n32k16_mma_row_row_f32_f16_satfinite |
| 516 | 0x204 | __cuda_sm70_wmma_m8n32k16_mma_row_row_f32_f32 |
| 517 | 0x205 | __cuda_sm70_wmma_m8n32k16_mma_row_row_f32_f32_satfinite |
| 518 | 0x206 | __cuda_sm70_wmma_m8n32k16_store_d_col_f16 |
| 519 | 0x207 | __cuda_sm70_wmma_m8n32k16_store_d_col_f16_global |
| 520 | 0x208 | __cuda_sm70_wmma_m8n32k16_store_d_col_f16_shared |
| 521 | 0x209 | __cuda_sm70_wmma_m8n32k16_store_d_col_f32 |
| 522 | 0x20A | __cuda_sm70_wmma_m8n32k16_store_d_col_f32_global |
| 523 | 0x20B | __cuda_sm70_wmma_m8n32k16_store_d_col_f32_shared |
| 524 | 0x20C | __cuda_sm70_wmma_m8n32k16_store_d_row_f16 |
| 525 | 0x20D | __cuda_sm70_wmma_m8n32k16_store_d_row_f16_global |
| 526 | 0x20E | __cuda_sm70_wmma_m8n32k16_store_d_row_f16_shared |
| 527 | 0x20F | __cuda_sm70_wmma_m8n32k16_store_d_row_f32 |
| 528 | 0x210 | __cuda_sm70_wmma_m8n32k16_store_d_row_f32_global |
| 529 | 0x211 | __cuda_sm70_wmma_m8n32k16_store_d_row_f32_shared |
__cuda_sm80_* -- Ampere createpolicy (3 entries, 0x212--0x214, sm_80)
| ID | Hex | Name |
|---|---|---|
| 530 | 0x212 | __cuda_sm80_createpolicy_fractional |
| 531 | 0x213 | __cuda_sm80_createpolicy_fractional_encode |
| 532 | 0x214 | __cuda_sm80_createpolicy_range_encode |
__cuda_sm_10x_* -- Blackwell hmma/imma/bit MMA (10 entries, 0x215--0x21E, sm_100)
| ID | Hex | Name |
|---|---|---|
| 533 | 0x215 | __cuda_sm_10x_hmma_mdata_m16n8k16 |
| 534 | 0x216 | __cuda_sm_10x_hmma_mdata_m16n8k32 |
| 535 | 0x217 | __cuda_sm_10x_imma_mdata_m16n8k32 |
| 536 | 0x218 | __cuda_sm_10x_imma_mdata_m16n8k64 |
| 537 | 0x219 | __cuda_sm_10x_mma_bit_internal_and_m16n8k128 |
| 538 | 0x21A | __cuda_sm_10x_mma_bit_internal_and_m16n8k256 |
| 539 | 0x21B | __cuda_sm_10x_mma_bit_internal_and_m8n8k128 |
| 540 | 0x21C | __cuda_sm_10x_mma_bit_internal_xor_m16n8k128 |
| 541 | 0x21D | __cuda_sm_10x_mma_bit_internal_xor_m16n8k256 |
| 542 | 0x21E | __cuda_sm_10x_mma_bit_internal_xor_m8n8k128 |
__cuda_sm_8x_* -- Direct MMA + shfl (14 entries, 0x21F--0x22C, sm_80+)
| ID | Hex | Name |
|---|---|---|
| 543 | 0x21F | __cuda_sm_8x_mma_col_col_f16_f16_f16_f16 |
| 544 | 0x220 | __cuda_sm_8x_mma_col_col_f32_f16_f16_f16 |
| 545 | 0x221 | __cuda_sm_8x_mma_col_col_f32_f16_f16_f32 |
| 546 | 0x222 | __cuda_sm_8x_mma_col_row_f16_f16_f16_f16 |
| 547 | 0x223 | __cuda_sm_8x_mma_col_row_f32_f16_f16_f16 |
| 548 | 0x224 | __cuda_sm_8x_mma_col_row_f32_f16_f16_f32 |
| 549 | 0x225 | __cuda_sm_8x_mma_row_col_f16_f16_f16_f16 |
| 550 | 0x226 | __cuda_sm_8x_mma_row_col_f32_f16_f16_f16 |
| 551 | 0x227 | __cuda_sm_8x_mma_row_col_f32_f16_f16_f32 |
| 552 | 0x228 | __cuda_sm_8x_mma_row_row_f16_f16_f16_f16 |
| 553 | 0x229 | __cuda_sm_8x_mma_row_row_f32_f16_f16_f16 |
| 554 | 0x22A | __cuda_sm_8x_mma_row_row_f32_f16_f16_f32 |
| 555 | 0x22B | __cuda_sm_8x_mma_shfl_f16 |
| 556 | 0x22C | __cuda_sm_8x_mma_shfl_f32 |
__cuda_sm_9x_* -- Hopper sub-byte/bit MMA (51 entries, 0x22D--0x25F, sm_90)
| ID | Hex | Name |
|---|---|---|
| 557 | 0x22D | __cuda_sm_9x_mma_bit_internal_xor_m16n8k128 |
| 558 | 0x22E | __cuda_sm_9x_mma_bit_internal_xor_m16n8k256 |
| 559 | 0x22F | __cuda_sm_9x_mma_bit_internal_xor_m8n8k128 |
| 560 | 0x230 | __cuda_sm_9x_mma_sub_byte_internal_m16n8k32_s4_s4 |
| 561 | 0x231 | __cuda_sm_9x_mma_sub_byte_internal_m16n8k32_s4_s4_satfinite |
| 562 | 0x232 | __cuda_sm_9x_mma_sub_byte_internal_m16n8k32_s4_u4 |
| 563 | 0x233 | __cuda_sm_9x_mma_sub_byte_internal_m16n8k32_s4_u4_satfinite |
| 564 | 0x234 | __cuda_sm_9x_mma_sub_byte_internal_m16n8k32_u4_s4 |
| 565 | 0x235 | __cuda_sm_9x_mma_sub_byte_internal_m16n8k32_u4_s4_satfinite |
| 566 | 0x236 | __cuda_sm_9x_mma_sub_byte_internal_m16n8k32_u4_u4 |
| 567 | 0x237 | __cuda_sm_9x_mma_sub_byte_internal_m16n8k32_u4_u4_satfinite |
| 568 | 0x238 | __cuda_sm_9x_mma_sub_byte_internal_m16n8k64_s4_s4 |
| 569 | 0x239 | __cuda_sm_9x_mma_sub_byte_internal_m16n8k64_s4_s4_satfinite |
| 570 | 0x23A | __cuda_sm_9x_mma_sub_byte_internal_m16n8k64_s4_u4 |
| 571 | 0x23B | __cuda_sm_9x_mma_sub_byte_internal_m16n8k64_s4_u4_satfinite |
| 572 | 0x23C | __cuda_sm_9x_mma_sub_byte_internal_m16n8k64_u4_s4 |
| 573 | 0x23D | __cuda_sm_9x_mma_sub_byte_internal_m16n8k64_u4_s4_satfinite |
| 574 | 0x23E | __cuda_sm_9x_mma_sub_byte_internal_m16n8k64_u4_u4 |
| 575 | 0x23F | __cuda_sm_9x_mma_sub_byte_internal_m16n8k64_u4_u4_satfinite |
| 576 | 0x240 | __cuda_sm_9x_mma_sub_byte_internal_m8n8k32_s4_s4 |
| 577 | 0x241 | __cuda_sm_9x_mma_sub_byte_internal_m8n8k32_s4_s4_satfinite |
| 578 | 0x242 | __cuda_sm_9x_mma_sub_byte_internal_m8n8k32_s4_u4 |
| 579 | 0x243 | __cuda_sm_9x_mma_sub_byte_internal_m8n8k32_s4_u4_satfinite |
| 580 | 0x244 | __cuda_sm_9x_mma_sub_byte_internal_m8n8k32_u4_s4 |
| 581 | 0x245 | __cuda_sm_9x_mma_sub_byte_internal_m8n8k32_u4_s4_satfinite |
| 582 | 0x246 | __cuda_sm_9x_mma_sub_byte_internal_m8n8k32_u4_u4 |
| 583 | 0x247 | __cuda_sm_9x_mma_sub_byte_internal_m8n8k32_u4_u4_satfinite |
| 584 | 0x248 | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_s4_s4 |
| 585 | 0x249 | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_s4_s4_satfinite |
| 586 | 0x24A | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_s4_u4 |
| 587 | 0x24B | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_s4_u4_satfinite |
| 588 | 0x24C | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_u4_s4 |
| 589 | 0x24D | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_u4_s4_satfinite |
| 590 | 0x24E | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_u4_u4 |
| 591 | 0x24F | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_u4_u4_satfinite |
| 592 | 0x250 | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_s4_0 |
| 593 | 0x251 | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_s4_1 |
| 594 | 0x252 | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_s4_satfinite_0 |
| 595 | 0x253 | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_s4_satfinite_1 |
| 596 | 0x254 | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_u4_0 |
| 597 | 0x255 | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_u4_1 |
| 598 | 0x256 | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_u4_satfinite_0 |
| 599 | 0x257 | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_u4_satfinite_1 |
| 600 | 0x258 | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_s4_0 |
| 601 | 0x259 | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_s4_1 |
| 602 | 0x25A | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_s4_satfinite_0 |
| 603 | 0x25B | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_s4_satfinite_1 |
| 604 | 0x25C | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_u4_0 |
| 605 | 0x25D | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_u4_1 |
| 606 | 0x25E | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_u4_satfinite_0 |
| 607 | 0x25F | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_u4_satfinite_1 |