Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Instruction Selection

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

Instruction selection in ptxas is a two-phase process that converts PTX virtual ISA operations into concrete SASS machine opcodes. Unlike LLVM, which uses a single SelectionDAG or GlobalISel framework, ptxas distributes instruction selection across two distinct pipeline stages separated by the entire optimization pipeline: Phase 1 converts PTX opcodes to Ori IR opcodes during initial lowering (phase 5, ConvertUnsupportedOps), and Phase 2 converts Ori IR to final SASS binary forms during code generation (phases 112--122, ISel driver + Mercury encoder). The two phases serve fundamentally different purposes: Phase 1 legalizes the IR so the optimizer can reason about it, while Phase 2 selects the optimal machine encoding for the target architecture after register allocation and scheduling are complete.

Phase 1 locationPhase 5: ConvertUnsupportedOps (PTX opcode to Ori opcode)
Phase 2 locationPhases 112+: ISel driver + Mercury encoder (Ori to SASS binary)
MercConverter dispatchsub_9ED2D0 (25 KB, master switch on *(instr+72) & 0xCF mask)
ISel driversub_B285D0 (9 KB, 66 callees, vtable entry)
ISel mega-selectorsub_C0EB10 (185 KB, 500+ locals, giant switch)
DAG pattern matchers~801 functions at 0xB28F60--0xB7D000 (~1.3 MB)
Arch dispatch tables4 copies at sub_B128E0--sub_B12920 (15,049 bytes each)
Mercury master encodersub_6D9690 (94 KB, instruction type switch)
MercExpandsub_C3CC60 (26 KB, pseudo-instruction expansion)
SM120 pattern coordinatorsub_13AF3D0 (137 KB, 130-case switch, opcodes 2--352)
Opcode variant selectorssub_B0BE00 (19 KB, class 194), sub_B0AA70 (5 KB, class 306)

Architecture

PTX source text
     |
     v
[Bison parser]  sub_4CE6B0 (48KB)
     |  Reduction actions build raw Ori nodes with PTX-derived opcodes
     v
+------------------------------------------------------------------+
| RAW ORI IR (PTX opcodes: add.f32, ld.global, mad.lo.s32, ...)    |
+------------------------------------------------------------------+
     |
     |  PHASE 1: PTX-to-Ori Opcode Legalization (phase 5)
     |
     |  sub_9F3340 (orchestrator, 7KB)
     |    -> sub_9F1A90 (MercConverter main, 35KB)
     |         -> sub_9ED2D0 (opcode dispatch, 25KB)
     |              Switch on (*(instr+72)) with BYTE1 & 0xCF mask
     |              ~120 case values -> ~60 handler functions
     |              + vtable dispatch for architecture-extensible ops
     |         -> sub_934630 (instruction creation, called N times)
     |    -> sub_9EF5E0 (post-conversion lowering, 27KB)
     |
     v
+------------------------------------------------------------------+
| OPTIMIZER-READY ORI IR (SASS opcodes: FADD, IMAD, LDG, STG, ...).|
| Every instruction has a valid SASS opcode for the target SM.      |
+------------------------------------------------------------------+
     |
     |  [Phases 14-111: Full optimization pipeline]
     |  Register allocation, scheduling, peephole, etc.
     |
     v
+------------------------------------------------------------------+
| OPTIMIZED ORI IR (register-allocated, scheduled)                  |
+------------------------------------------------------------------+
     |
     |  PHASE 2: Ori-to-SASS Selection & Encoding (phases 112+)
     |
     |  sub_B285D0 (ISel driver, 9KB)
     |    -> sub_C0EB10 (mega-selector, 185KB, default backend)
     |    -> sub_13AF3D0 (pattern coordinator, 137KB, SM120 backend)
     |    -> sub_B1FA20 / sub_B20E00 (builder variants)
     |    -> sub_B28F60..sub_B74C60 (~801 DAG pattern matchers)
     |    -> sub_B128E0..sub_B12920 (4 arch dispatch tables)
     |
     |  sub_6D9690 (Mercury master encoder, 94KB)
     |    -> Switch on instruction type (*(instr+8))
     |    -> sub_C00BF0 (opcode lookup)
     |    -> sub_91D160 (register encoding)
     |    -> sub_7B9B80 (bitfield insert, 18,347 callers)
     |
     |  sub_C3CC60 (MercExpand, 26KB)
     |    -> sub_C37A10 (expand instruction, 16KB)
     |    -> sub_C39B40 (expand memory, 10KB)
     |    -> sub_C3BCD0 (expand control flow, 19KB)
     |
     v
+------------------------------------------------------------------+
| SASS binary (packed machine code in 64/128/256-bit words)         |
+------------------------------------------------------------------+

Phase 1: PTX-to-Ori Opcode Conversion

Phase 1 runs as ConvertUnsupportedOps (pipeline phase 5), the most substantial bridge phase. Its job is to replace every PTX-derived opcode in the raw Ori IR with a valid SASS-level opcode for the target SM. After this phase completes, the optimizer sees only SASS-level instruction semantics.

The conversion is not a simple table lookup. Many PTX operations have no 1:1 SASS equivalent and must be expanded into multi-instruction sequences. The expansion depends on the target architecture, the operand types, and the available hardware functional units.

MercConverter Dispatch -- sub_9ED2D0 (25 KB)

The central dispatch function of Phase 1. Despite the sweep's initial identification as PhaseRunner::executePhaseSequence, the decompiled code reveals a classic opcode switch: it reads *(instr+72), masks byte 1 with 0xCF (stripping modifier bits 4--5), and dispatches to per-category handler functions. The switch covers approximately 120 distinct case values (opcode indices 1--352) routing to roughly 60 handler functions plus vtable-dispatched methods for architecture-extensible operations.

// sub_9ED2D0 -- simplified dispatch logic
void MercConverter_Dispatch(context, instruction) {
    // Pre-dispatch: check predication eligibility
    bool can_predicate = sub_7E18A0(instruction, *(context+8));
    if (can_predicate)
        can_predicate = vtable[205](*(*(context+8)+1584), instruction);
    *(context+40) = can_predicate;

    // Read opcode, mask out modifier bits
    int opcode = *(DWORD*)(instruction + 72);
    BYTE1(opcode) &= 0xCF;

    // Special case: opcode 130 (HSET2 in ROT13; internal marker) with GPR operand -> clear predication
    if (opcode == 130) {
        int operand = *(DWORD*)(instruction + 84);
        if (((operand >> 28) & 7) == 1 && reg_type(operand) == 6)
            *(context+40) = 0;
    }

    // Main dispatch
    switch (opcode) {
    case 1:   sub_9DA5C0(context, instruction);  break;  // opcode class 1
    case 6:   sub_9DA100(context, instruction);  break;  // arithmetic
    case 8:   sub_9D2440(context, instruction);  break;  // specific class
    case 10: case 11: case 149: case 151: case 152: case 290: case 291:
              sub_9D80E0(context, instruction);  break;  // memory load/store
    case 16:  sub_9E8B20(context, instruction);  break;  // texture/surface
    case 61: case 63: case 80:
              sub_9E6600(context, instruction);  break;  // instruction expansion
    case 108: sub_9D76D0(context, instruction);  break;  // memory legalization
    // ... ~100 more cases ...
    default:  emit_noop(context, 0xFFFF);        break;  // unknown -> passthrough
    }

    // Post-dispatch: apply predication and operand adjustments
    vtable[107](context, instruction);
}

MercConverter Opcode Dispatch Table

The complete switch covers opcodes 1--352. Cases route to three dispatch mechanisms: direct function calls (for common PTX categories), vtable-indirect calls (for architecture-extensible operations), and the emit_noop fallback for unrecognized opcodes. Below is the reconstructed routing table from the decompiled sub_9ED2D0.

Direct handler dispatch (35 handlers):

Opcode(s)HandlerSizeCategory
1sub_9DA5C02 KBOpcode class 1 (basic ALU)
6sub_9DA1009 KBArithmetic operations
8sub_9D2440--Specific class
10, 11, 149, 151, 152, 290, 291sub_9D80E017 KBMemory load/store
15, 85sub_9EC34023 KBMulti-operand legalization
16sub_9E8B2017 KBTexture/surface lowering
17sub_9E7FB0--Surface operations
22sub_9D6DB0--Specific lowering
23sub_9E58F0--Specific lowering
24sub_9D9F60--Specific lowering
26sub_9E54C0--Specific lowering
27sub_9E4BB0--Specific lowering
28sub_9D9E70--Specific lowering
32, 271sub_9E2440--Bitfield operations
34sub_9E55E0--Specific lowering
38, 59, 106, 180, 182, 192, 194, 215, 221, 242sub_9DA6B0--Generic ALU group
41, 284sub_9D1DA0--Specific lowering
42, 53, 55, 66sub_9D54B0--Grouped operations
47sub_9E74E0--Conditional (arch flag check)
51sub_9E2F60--Specific lowering
52, 54, 72, 97sub_9D09C0--Group with v8=1 (deletion flag)
57, 101sub_9D6170--Paired operations
60, 62, 78, 79sub_9E5EE0--Comparison group
61, 63, 80sub_9E660025 KBInstruction expansion (64-bit split)
67sub_9D9C30--Specific lowering
70sub_9E3490--Specific lowering
75sub_9E0C10--Specific lowering
77sub_9E4DF0--Specific lowering
83sub_9D6AB0--Specific lowering
88, 89sub_9D5990--Paired operations
90sub_9D2820--Specific lowering
91sub_9E7600--Specific lowering
92sub_9E7890--Specific lowering
93, 95sub_9E1D40--Comparison variants
94sub_9E1DF0--Specific lowering
96sub_9D41C0--Specific lowering
98sub_9D3230--Specific lowering
100sub_9D70E0--Specific lowering
102sub_9D9750--Specific lowering
103, 104sub_9E31D0--Paired operations
108sub_9D76D018 KBMemory instruction legalization
124sub_9E18B0--Specific lowering
135sub_9D6560--Specific lowering
139, 140, 141, 143sub_9D4C10--Related operations group
145sub_9D3020--Specific lowering
155, 268sub_9E5260--Paired operations
156sub_9D94B0--Specific lowering
158, 167sub_9E4A00--Paired operations
161sub_9D21D0--Specific lowering
162sub_9D9660--Specific lowering
166sub_9E2100--Specific lowering
170sub_9E2DF0--Specific lowering
173, 267sub_9EB5C0--Paired operations
174sub_9D9300--Specific lowering
184sub_9D2E70--Specific lowering
185sub_9E32F0--Specific lowering
188, 190sub_9E2970--Paired operations
195sub_9D2AB0--Specific lowering
196sub_9D9080--Specific lowering
198sub_9D66F0--Specific lowering
201, 202, 204, 285sub_9EAC30--Async/bulk group
203sub_9D8E90--Specific lowering
205sub_9E1260--Specific lowering
209sub_9E5740--Specific lowering
210, 213, 214sub_9D8B30--Grouped operations
240sub_9D6280--Specific lowering
241sub_9E2CC0--Specific lowering
247sub_9D0F70--Specific lowering
248sub_9D0DF0--Specific lowering
262sub_9E7440--Specific lowering
264sub_9D73F0--Specific lowering
276sub_9D5EC0--Specific lowering
292sub_9D0E90--Specific lowering

Vtable-indirect dispatch (for architecture-extensible operations):

Opcode(s)Vtable offsetCategory (inferred)
2, 3, 4, 5, 7vtable[0] (+0)Generic fallback
14, 39, 40, 105, 125, 299, 300, 321vtable[7] (+56)Group A operations
18vtable[3] (+24)Specific class
31vtable[4] (+32)Specific class
35vtable[6] (+48)Specific class
36vtable[21] (+168)Specific class
43vtable[9] (+72)Specific class
50vtable[12] (+96)Specific class
65vtable[22] (+176)Specific class
73vtable[15] (+120)Specific class
74vtable[16] (+128)Specific class
81vtable[24] (+192)Specific class
110, 111, 112, 114vtable[25] (+200)Warp shuffle group
118vtable[10] (+80)Specific class
119vtable[28] (+224)Specific class
120, 121, 126, 127, 128, 280, 281vtable[27] (+216)Barrier/sync group
122, 123, 310, 311, 312vtable[26] (+208)Related group
130 (HSET2), 169vtable[29] (+232)Move/convert group (130 is MOV-like internally; actual SASS MOV = 19)
157vtable[84] (+672)Specific class
176, 177vtable[34] (+272)Paired operations
183, 288vtable[36] (+288)Paired operations
186vtable[35] (+280)Specific class
211vtable[39] (+312)Specific class
220vtable[40] (+320)Specific class
223, 238vtable[41] (+328)Paired operations
228vtable[42] (+336)Specific class
243vtable[43] (+344)Specific class
245--253, 257vtable[67--77] (+536--+624)SM 100+ operations
265, 266vtable[93] (+744)Paired operations
270vtable[77] (+616)Specific class
277vtable[65] or vtable[11] (+520/+88)Operand-type dependent
279--351various high vtable offsetsSM 100+ / Blackwell operations

The vtable mechanism allows architecture backends to override conversion behavior without modifying the core dispatch. The vtable factory at sub_1CCEEE0 (17 KB, 244 callees) selects which overrides are active based on the SM version.

Per-Category Handlers

The larger handlers implement non-trivial conversion logic:

HandlerSizeCategoryKey behavior
sub_9E660025 KBInstruction expansionSplits 64-bit ops on 32-bit ALU into hi/lo pairs with carry chains. Calls sub_9D4380 (instruction builder) ~10 times per expansion.
sub_9EC34023 KBMulti-operand legalizationOperand type test: (v >> 28) & 7 == 1 means register. Register class query via sub_7BE7B0. Creates new instructions via sub_7DEAD0.
sub_9D76D018 KBMemory legalization (load/store)Register type dispatch: 6=GPR, 7=predicate, 3=address. Uses sub_9D4380 (instruction builder) and sub_9CD420 (predication).
sub_9D80E017 KBMemory legalization (variant)Same opcode set as sub_9D76D0, alternate code path for different operand patterns.
sub_9E8B2017 KBTexture/surface loweringRegister type 6 = GPR. Manipulates bitmask at register descriptor offset +48.
sub_9DA1009 KBArithmetic operationsHandles opcode case 6 -- standard ALU instruction legalization.
sub_9DA6B0--Generic ALU groupCovers 10 opcode values (38, 59, 106, 180, 182, 192, 194, 215, 221, 242).

1:1 vs 1:N Expansion

Most PTX operations map 1:1 to a single SASS opcode. When they do not, the handlers in sub_9E6600 and related functions create multi-instruction sequences:

PTX                                    Ori IR (after Phase 1)
-----------------------------------    -----------------------------------
add.f32  %r1, %r2, %r3          -->   FADD  R1, R2, R3                [1:1]
add.s32  %r4, %r5, %r6          -->   IADD3 R4, R5, R6, RZ           [1:1, operand added]
mul.lo.s64 %rd1, %rd2, %rd3     -->   IMAD.LO  R1, R2, R6, RZ       [1:N split]
                                       IMAD.HI  R0, R2, R6, RZ
                                       IMAD      R0, R3, R6, R0
                                       IMAD      R0, R2, R7, R0
div.f32  %r7, %r8, %r9          -->   MUFU.RCP  R10, R9              [1:N, Newton-Raphson]
                                       FMUL      R7, R8, R10
                                       (+ correction iterations)
bar.sync 0                       -->   BAR                            [1:1]

The expansion creates new instruction nodes via sub_934630 and links them into the doubly-linked instruction list. The original PTX-level instruction is replaced by the expanded sequence.

Type-Dependent Opcode Selection

PTX's explicitly-typed opcodes (where the type is a qualifier like .f32, .s64) map to different SASS mnemonics based on the type:

PTX typeSASS prefixExample PTXExample SASS
.f16 / .f16x2Hadd.f16HADD2
.f32Fadd.f32FADD
.f64Dadd.f64DADD
.s32 / .u32Iadd.s32IADD3
.s64 / .u64I (split)add.s64IADD3 + IADD3.X (carry chain)
.predPsetp.eq.f32FSETP

The type qualifier disappears from the instruction syntax during conversion. It becomes encoded in the SASS mnemonic itself (the F in FADD, the I in IADD3) and in the register class of the operands.

SM-Dependent Legalization

The MercConverter gates operations by SM version through the architecture vtable. An instruction available natively on one SM may require a multi-instruction lowering sequence on another:

  • 64-bit integer arithmetic on SM 50--75 (no native 64-bit ALU): splits into 32-bit hi/lo pairs
  • FP16 operations on pre-SM 53 targets: promoted to FP32 (handled by Phase 2 PromoteFP16)
  • bfe/bfi variants: some bit-field extract/insert modes not supported on all targets
  • Tensor core intrinsics: SM 70 has HMMA v1, SM 75 has HMMA v2, SM 80+ has HMMA v3/DMMA, SM 100 has TCGen05

The architecture vtable factory at sub_1CCEEE0 populates the vtable with SM-specific method overrides. The vtable has approximately 90 method slots (up to offset +720), with the highest-numbered slots (offset 624+) serving SM 100+ Blackwell operations.

Phase 2: Ori-to-SASS Selection & Encoding

Phase 2 runs during code generation (phases 112+) after the optimizer, register allocator, and scheduler have completed. It operates on fully optimized, register-allocated Ori IR and produces final SASS machine code. Phase 2 has three major components: the ISel driver with DAG pattern matching, the Mercury master encoder, and MercExpand pseudo-instruction expansion.

ISel Driver -- sub_B285D0 (9 KB)

The top-level ISel coordinator is a vtable entry point with 66 callees. It selects the appropriate instruction builder variant based on the target architecture:

// Simplified ISel driver
void ISel_LowerInstruction(context, instruction) {
    int sm = *(context + 184);          // SM version
    int opcode = instruction[18] & 0xFFFFCFFF;

    // Select architecture-variant builder
    if (sm == 14)
        Builder_VariantA(context, instruction);    // sub_B1FA20 (13 KB)
    else
        Builder_VariantB(context, instruction);    // sub_B20E00 (11 KB)

    // Apply post-ISel modifiers
    ApplyModifiers(context, instruction);           // sub_B1D670 (13 KB)
    SetProperties(context, instruction);            // sub_B241A0 (7 KB)
}

The two builder variants (sub_B1FA20 and sub_B20E00) are structurally near-identical, with 50 callees each. Both call sub_7E3EF0 (operand index helper) 6 times (3 source + 3 destination operands) and use sub_A3B930 (operand register class resolver). The key difference is the validation function: variant A uses sub_C49440, variant B uses sub_C49400, reflecting different encoding constraints for different SM families.

ISel Mega-Selector -- sub_C0EB10 (185 KB)

The single largest function in the Phase 2 ISel range: 185 KB decompiled, 6,016 lines, 719+ local variables. It performs the final Ori-IR-to-SASS opcode and operand encoding for 169 distinct instruction types (SASS opcode indices 7--221). While the ~801 DAG pattern matchers handle template-based ISel through a priority contest, the mega-selector handles complex instructions that require procedural, multi-step encoding logic -- instructions where the operand marshalling depends on runtime state (calling conventions, symbol resolution, address space aliasing).

Dual-Switch SM-Generation Dispatch

The function contains two copies of the same 169-case switch statement, separated by a vtable-based opcode translation mechanism. This dual-switch structure is the SM-generation dispatch:

// sub_C0EB10 -- simplified dispatch skeleton
void MegaSelector(context *a1, instruction *a2, isel_ctx *a3) {
    int64_t *vtable = *(a3->backend);
    int opcode = *(int *)(a2 + 8);           // SASS opcode type

    // Pre-dispatch: capability check via vtable[12]
    auto cap_check = vtable[12];              // offset +96
    if (cap_check != sub_BFEAA0)              // default stub?
        if (cap_check(a3, a2))
            ctx->flags[256] = 1;              // set encoding flag

    // Read opcode translator from vtable[2]
    auto translator = vtable[2];              // offset +16

    if (translator != sub_BFEBF0) {
        // PATH A: SM-specific translation
        int encoding_index = translator(a3, opcode);
        int isel_opcode = *(ctx + 8);         // post-translation opcode
        switch (isel_opcode) {                // PRIMARY SWITCH (169 cases)
            case 7: case 34: case 35: case 36:
                emit_simple(encoding_index, ...);
                break;
            case 8: case 38: case 46: ...
                /* already encoded */ break;
            // ... 169 cases total ...
            default: goto high_opcode_path;
        }
    } else {
        // PATH B: static table lookup (default backend)
        int encoding_index = 355;             // sentinel for extended opcodes
        if (opcode <= 0xDD)
            encoding_index = word_22B4B60[opcode];
        switch (opcode) {                     // FALLBACK SWITCH (same 169 cases)
            case 7: ...: goto handler_7;      // jumps into Path A handlers
            // ... identical case set ...
            default: return;
        }
    }

high_opcode_path:
    if (opcode > 0x199) return;
    // Try vtable[3] extension dispatch for SM 100+ / Blackwell
    auto extension = vtable[3];               // offset +24
    if (extension != sub_BFEA30)
        extension(a3, a2);                    // arch-extension handler
}

The dual-switch pattern is a code-generation artifact: the compiler emitted two copies because the vtable path and static-table path produce different values for the encoding index but need identical case routing. This doubles the binary size but avoids a conditional merge point at every case entry.

Three Vtable Dispatch Points

Vtable slotOffsetDefault stubPurpose
vtable[2]+16sub_BFEBF0Opcode-to-encoding-index translator. SM-specific override remaps opcodes to different encoding slots. Fallback: word_22B4B60[] static table.
vtable[12]+96sub_BFEAA0Pre-dispatch capability check. Returns boolean that sets ctx[256] encoding flag.
vtable[3]+24sub_BFEA30Extension opcode handler for opcodes outside the 169-case set (barrier/sync 61--63/221, opcodes > 0x199, SM 100+ extensions).

The word_22B4B60 static table is a uint16[] array indexed by SASS opcode (0--0xDD = 221). Each entry is a SASS encoding slot index. Opcodes > 221 receive the sentinel value 355. This provides the default encoding mapping; SM-specific vtable overrides can remap any opcode to a different encoding index, enabling per-architecture instruction variants without modifying the mega-selector logic.

Opcode Case Routing

The 169 distinct opcode cases (338 total case labels across both switches) group into approximately 70 handler blocks. The groupings reveal SASS ISA families:

GroupOpcodesHandler patternInstruction family
No-op passthrough8, 38, 46, 87, 89, 90, 93, 97, 98, 208goto LABEL_33 (already encoded)Pre-encoded by upstream ISel
Simple emission7, 34, 35, 36sub_9314F0(encoding_index, 1 operand)Basic ALU / simple 1-op
Branch/call9, 10, 11, 12, 13, 22sub_926370 / vtable[17] / linked-list walkControl flow, call frames
Memory load/store15, 16, 18, 19, 20, 23, 24, 25, 26, 30sub_C01840 + address helpersLDG, STG, LDS, etc.
Control flow31, 32, 33SSA phi nodes, branch tablesPhi, switch, call return
Generic ALU39, 41, 42, 50, 51, 52, 53sub_9314F0 passthroughStandard arithmetic
Special register43, 44, 45sub_C06E90 symbol lookupSR access, shared memory alias
Constant/predicate47, 54, 55, 56Direct operand copy / sub_BFFD60Constant bank, predicate ops
Address compute57200-line handler, "__nv_reservedSMEM_offset_0_alias"Complex addressing with SMEM
Immediate ops59, 60sub_C05CC0 / sub_C07690Immediate-operand variants
Barrier/sync61, 62, 63, 221Forward to vtable[3] extensionBAR, MEMBAR, SYNC
Conversion/move65Operand loop with per-element sub_9314F0MOV, CVT
Texture/surface67, 68, 69, 70Multi-operand type-qualified encodingTEX, TLD, TXQ
Intrinsics71, 74, 75Loop-based operand emissionHardware intrinsics
Tensor core84, 88, 91, 92Wide-operand encoding (case 92 = 354 lines)HMMA, DMMA, IMMA, TCGen05
Predication ext94, 95Predicate-dependent path selectionExtended predication
Memory extended99--130 (19 opcodes)sub_C0B2C0 or sub_BFFD60 + encoding lookupExtended memory ops
Warp intrinsics131--189 (50+ opcodes)Mixed handlers, vtable[198]+632 dispatchSHFL, VOTE, MATCH, REDUX
Async/bulk192--218 (15 opcodes)sub_C0B2C0 / individual handlersTMA, async copy, bulk ops

The largest case handlers:

  • Cases 141/142: ~503 lines (warp shuffle/vote extended operations)
  • Case 92: ~354 lines (tensor core instructions -- widest operand format)
  • Cases 45, 57, 95: ~200 lines each (shared memory, address compute, predication)

Operand Encoding Protocol

The mega-selector encodes operands into a stack-allocated 256-byte output buffer using a tagged-pointer word format. Each operand occupies 8 bytes (a DWORD pair):

BitsFieldDescription
[31:28] of word 0Type tag0x1=register, 0x4=constant bank, 0x5=immediate, 0x6=control/modifier, 0x9=special register
[23:0] of word 0ValueRegister index, immediate value, or bank offset
word 1FlagsModifier bits, encoding-format flags

The marshalling pipeline for a typical case:

1. sub_C01840(ctx, instr, operand_list, output_buf, max_count, ...)
   -> Iterates source operands, writes tagged words to output_buf
   -> Returns: number of operand words written

2. sub_C01F50(ctx, instr, dest_list, output_buf, max_count, ...)
   -> Same for destination operands

3. Encoding-index lookup:
   if (vtable[2] != default)
     index = vtable[2](ctx, opcode);
   else
     index = word_22B4B60[opcode];

4. sub_9314F0(output, ctx, encoding_index, count, n_words, buf, ...)
   -> Emits the instruction record to the output stream
HelperCallsPurpose
sub_C0184052Marshal source operands into tagged-word buffer
sub_9314F031Emit instruction with encoding index + operand buffer
sub_C00EA08Extract single operand as tagged word
sub_91D1608Encode register index to encoding bits
sub_9346306Build new instruction node in IR (for multi-instruction expansion)
sub_91D1505Decode register index from operand word
sub_9263704Emit simple instruction (branch/jump)
sub_C01F503Marshal destination operands
sub_7D68603Encode data type qualifier (FP32/FP64/INT)
sub_BFEF103Register bank capacity check / grow
sub_92E1B02Emit instruction with constant-bank operand

Cross-Reference: Arch Dispatch Tables

The 4 arch dispatch tables (sub_B128E0--sub_B12920) are not called from the mega-selector. They operate at the Mercury encoder level:

Mega-selector (sub_C0EB10)
  -> Produces (encoding_index, operand_buffer) pairs
  -> Calls sub_9314F0 to package into instruction nodes

Mercury encoder (sub_6D9690)
  -> Reads instruction type field from instruction node
  -> Arch dispatch tables (sub_B128E0 etc.) resolve type to encoding format
  -> Encoder emits binary SASS using format + operand data

The mega-selector and arch dispatch tables thus operate at different abstraction levels: the mega-selector decides what to encode (opcode selection, operand marshalling), while the arch tables decide how to encode it (encoding format, bit layout). The arch tables' per-SM variants handle encoding-level differences (field widths, modifier positions) that are invisible to the mega-selector's opcode-level logic.

Post-ISel Modifiers -- sub_B1D670 (13 KB)

After the main ISel selection, this pass applies architecture-specific instruction modifications:

  • Opcode 13: sets instruction field [79] = 3
  • Opcode 14: sets instruction field [79] = 2
  • Opcode 11: separate modifier path

The function has 51 callees including sub_AAD690 (field accessor, called multiple times), sub_AADF40, and sub_C49400 (encoding validator). It handles encoding mode bits, register class adjustments, and predicate attachment.

Instruction Properties -- sub_B241A0 (7 KB)

Sets scheduling-relevant properties on the selected instruction:

  • inst[74] = 7 -- scheduling class
  • inst[75] = (opcode == 325) -- special flag for specific opcode
  • inst[77] = sub_A3B930(...) -- operand class from register resolver
  • inst[79] -- derived from a2[19], architecture-dependent

Contains a switch on *(context+46) (target architecture selector), confirming per-SM property assignment.

DAG Pattern Matchers -- ~800 Functions at 0xB28F60--0xB7D000

Every pattern matcher follows an identical prototype and a strict check-and-report protocol. These are the ptxas equivalent of LLVM's TableGen-generated ISel patterns, but handwritten in C++. Binary analysis confirms 801 functions with the matching *a4 <= priority-comparison idiom, with the bulk (750+) residing in the 0xB30000--0xB7D000 range and a handful of smaller matchers in the 0xB28F60--0xB30000 preamble zone.

Pattern Matcher Architecture

The pattern matching system implements a priority-based best-match selection protocol. For each instruction being lowered, the ISel infrastructure invokes all applicable matchers (dispatched through vtable function pointers, not direct calls). Each matcher independently tests whether the instruction matches its pattern; if it does, it writes a (template_id, priority) pair to the output parameters. The dispatcher selects the match with the highest priority value.

Function signature (all 801+ matchers):

char __fastcall match(
    int64_t  ctx,           // a1: ISel context (passed through to field reader)
    int64_t  dag_node,      // a2: pointer to the Ori IR instruction node
    int32_t *template_id,   // a3: OUT: encoding template index [1..152]
    int32_t *priority       // a4: IN/OUT: current best priority; written only if better
);

The priority parameter is read-then-conditionally-written: the matcher checks if (*a4 <= threshold) before overwriting. This means the dispatcher initializes *a4 = 0 and calls matchers in sequence; each matcher only upgrades the result if its specificity exceeds the current best. After all matchers complete, *a3 holds the template index of the winning pattern.

Matching pipeline (invariant across all 801 matchers):

 1. OPCODE PROPERTY CHECKS      sub_10AE5C0(ctx, node, field_id)
    Check 1-12 instruction properties against expected values.
    Any mismatch -> return immediately (early exit).

 2. SOURCE OPERAND COUNT         sub_B28F50(node) -> source_count
    Verify the instruction has the expected number of source operands.

 3. SOURCE OPERAND VALIDATION    sub_B28F30(node, i) -> operand_record
    For each source operand:
      a. Type predicate: isImmediate / isGPR / isPredicate / isUniformReg / ...
      b. Register class: class == 1023 (wildcard) OR class == specific_value

 4. RESULT OPERAND COUNT         sub_B28F40(node) -> result_count
    Verify the expected number of result (destination) operands.

 5. RESULT OPERAND VALIDATION    sub_B28F30(node, first_result + j)
    Same type + register-class checks as for source operands.
    First-result index = sub_B28E00(*(node + 92)).

 6. PRIORITY WRITE               if (*a4 <= N) { *a4 = N+1; *a3 = template; }
    Conditional update: only overwrite if this pattern is more specific
    than whatever was already matched.

Match-Score Priority System

The priority values range from 2 (least specific) to 34 (most specific), with the distribution heavily concentrated in the 8--19 range. The priority correlates directly with pattern specificity: matchers with more constraints (more sub_10AE5C0 checks, more operand type checks, tighter register class requirements) assign higher priority values.

Priority rangeCountInterpretation
2--531Fallback / generic patterns (few constraints)
6--10253Common patterns (3--6 constraints)
11--15293Standard patterns (5--8 constraints)
16--20168Specific patterns (6--10 constraints)
21--3456Highly specific patterns (8--12+ constraints)

Template IDs range from 1 to 152. Multiple matchers can target the same template ID at different priority levels, forming a specificity ladder: a generic matcher might match FADD at priority 8 while a specialized matcher matches FADD.FTZ.SAT with specific register classes at priority 17. Both write the same template ID but the specialized matcher wins when its constraints are satisfied.

Dispatcher Mechanism

The matchers are not called directly from a single dispatch function. Instead, they are registered as virtual methods on per-instruction-class descriptor objects. The dispatch chain is:

sub_B285D0 (ISel driver, 9 KB)
  -> opcode switch on (instruction[18] & 0xFFFFCFFF)
     -> selects builder variant (sub_B1FA20 / sub_B20E00 / sub_B1EC10 / ...)
        -> builder invokes vtable method on instruction descriptor
           -> vtable slot contains pointer to one of the 801 pattern matchers
              -> matcher writes (template_id, priority) if pattern matches

The vtable dispatch occurs at various offsets including +2600, +2616, +2656, and +2896 (observed in sub_13AF3D0, the 137 KB ISel pattern coordinator). The matchers have no static callers -- they appear exclusively through indirect function pointer invocation, which is why the sweep reports them as "no callers in function DB."

For a given instruction, the dispatcher may invoke multiple matchers (one per applicable template variant). Each matcher independently checks its constraints and conditionally updates the priority/template pair. After all candidates have been tried, the dispatcher reads the final template_id and uses it to select the SASS encoding template.

DAG Node Property Accessor -- sub_10AE5C0

The field reader is the most-called function in the matcher range (typically 2--12 calls per matcher, so 3,000--8,000 total invocations across all 801 matchers):

// sub_10AE5C0 -- Read instruction property by field_id
int64_t DAGNode_ReadField(int64_t ctx, int64_t node, uint32_t field_id) {
    if (sub_10E32E0(node, field_id))        // field exists in descriptor?
        return sub_10D5E60(node, field_id); // read value from property table
    else
        return 0xFFFFFFFF;                  // sentinel: field not present
}

The field_id values form a large flat namespace (observed range: 5--595). These are not byte offsets into the instruction record; they are logical property identifiers resolved through a descriptor table. The backing store (managed by sub_10E32E0 / sub_10D5E60) implements a sparse property bag that maps field IDs to integer values.

The companion write functions follow the same field-ID namespace:

// sub_10AE590 -- Write single field
void DAGNode_WriteField(int64_t ctx, int64_t node, uint32_t field_id, uint32_t value);

// sub_10AE640 -- Write two fields atomically (multi-field update)
void DAGNode_WriteFields(int64_t ctx, int64_t node, uint32_t f1, uint32_t v1, uint32_t v2);

Inferred semantic groupings for field IDs (from cross-referencing matcher patterns):

Field rangeLikely semantics
5--7Opcode class / major instruction group
88Sub-operation modifier
105Operation variant selector
126Data type qualifier (e.g., field 126 in {547,548})
163Addressing mode / operand encoding class
190--211Encoding format selectors
220Specific encoding property
242Width/size qualifier
294Generic constraint field
327Register format descriptor
345Rounding / saturation mode
348Precision qualifier
355--429Extended instruction properties
397Instruction validity flag (value 2115 appears as a near-universal gate)
480High opcode range (Blackwell/SM 100+ instructions)
595Highest observed field ID

Field 397 with value 2115 appears in the majority of matchers as a mandatory check, suggesting it encodes a "this instruction is encoding-compatible" or "instruction is valid for ISel" flag.

Operand Record Layout

Each operand is a 32-byte record accessed by index via sub_B28F30:

// sub_B28F30 -- Get operand record by index
int64_t GetOperand(int64_t node, int index) {
    return *(int64_t*)(node + 32) + 32LL * index;
}

The 32-byte operand record:

OffsetSizeFieldDescription
+01type_tagOperand kind (see predicate table below)
+44primary_classRegister class ID; 1023 = wildcard (any class)
+141modifier_aWritten by sub_B28F10
+151modifier_bWritten by sub_B28F20
+204secondary_classFallback register class constraint

Source operand count is stored at node + 92 and doubles as the first-result-operand index:

uint32_t source_count = *(uint32_t*)(node + 92);   // sub_B28F50
uint32_t result_count = *(node + 40) + 1 - source_count; // sub_B28F40

Operand Type Predicates

Fifteen predicate functions classify operand type tags. Each is a single comparison returning bool:

AddressNameTestSemantics
sub_B28E20isImmediatetag == 1Constant / immediate literal
sub_B28E10isGPRtag == 2General-purpose register
sub_B28E80isPredicatetag == 3Predicate register
sub_B28E70isType4tag == 4(specific operand class)
sub_B28E60isType5tag == 5(specific operand class)
sub_B28E30isSpecialRegtag == 6Special register
sub_B28ED0isType7tag == 7(specific operand class)
sub_B28EF0isType8tag == 8(specific operand class)
sub_B28E50isType9tag == 9(specific operand class)
sub_B28E40isValidRegtag == 10Generic valid register
sub_B28EE0isType11tag == 11(specific operand class)
sub_B28EA0isType13tag == 13(specific operand class)
sub_B28EB0isType14tag == 14(specific operand class)
sub_B28E90isUniformRegtag == 15Uniform register (SM 75+)
sub_B28EC0isType16tag == 16(specific operand class)

Register class 1023 serves as a wildcard: if (class == 1023 || class == expected). This allows matchers to accept both unconstrained operands and operands already assigned to a specific register file.

Register Class Constraint Protocol

Operand records carry two register class fields: primary_class at offset +4 and secondary_class at offset +20. The matching protocol checks them with a cascading OR:

// Typical register class check (from sub_B33F00, sub_B390A0, etc.)
uint32_t primary   = *(uint32_t*)(operand + 4);
uint32_t secondary = *(uint32_t*)(operand + 20);

if (sub_B28E00(primary) == 1023) {
    // Wildcard -- operand is unconstrained, accept it
} else {
    uint32_t cls = sub_B28E00(secondary);
    if (cls != expected_class) return;  // mismatch
}

sub_B28E00 and sub_B28F00 are identity functions -- the register class is stored as a plain integer, not packed. The two-field scheme allows the matcher to accept an operand where either the allocation constraint (primary) is wildcard or the resolved register file (secondary) matches.

Observed register class values in matchers:

ClassFrequencyLikely meaning
1023ubiquitousWildcard (any register class)
1very common32-bit GPR (R0..R255)
2common64-bit GPR pair
3occasional128-bit GPR quad
4occasionalPredicate or special register file
5rareExtended register class

Representative Matcher Walkthroughs

sub_B30160 -- simple 2-source, 4-result pattern (68 lines, priority 9, template 12):

1. field 480 == 2481                    -> opcode/subclass check
2. source_count == 2                    -> expects 2 source operands
3. operand[0].type == 1 (immediate)     -> first source is a constant
4. operand[1].type == 2 (GPR)           -> second source is a register
5. operand[1].class == 1023 OR sec == 1 -> 32-bit GPR or unconstrained
6. result_count == 4                    -> expects 4 result operands
7. result[0].type == 2 (GPR)            -> first result is GPR
   result[0].class == 1023 OR sec == 1
8. result[1].type == 3 OR 15            -> predicate or uniform register
9. result[2].type == 2 (GPR)            -> third result is GPR
   result[2].class == 1023 OR sec == 1
10. if (*a4 <= 8) -> *a4 = 9, *a3 = 12

sub_B33F00 -- medium 2-source, 5-result pattern (4,166 bytes, priority 21, template 22):

1. field 7 == 21                            -> major opcode class
2. field 163 in {705, 706}                  -> addressing mode variant
3. field 203 in {1113..1117}                -> encoding format (5 values)
4. field 105 == 477                         -> operation variant
5. field 88 == 408                          -> sub-operation modifier
6. field 345 == 1903                        -> rounding/saturation mode
7. source_count == 2                        -> 2 sources
8. operand[0].type == 1 (immediate)         -> constant source
9. operand[1].type == 2 (GPR)              -> register source
   operand[1].class: primary wildcard or secondary in {1,2}
10. result_count == 5                       -> 5 results
11. result[0].type == 2 (GPR), class != 1023, secondary == 2 (64-bit)
12. result[1].type == 3 OR 15 (pred/uniform)
13. result[2].type == 2 (GPR), class: wildcard or secondary in {1,2}
14. result[3].type == 2 (GPR), class: wildcard or secondary in {1,2}
15. if (*a4 <= 20) -> *a4 = 21, *a3 = 22

sub_B44CA0 -- complex 0-source, 7-result pattern (6,214 bytes, priority 11, template varies):

1.  field 5 == 12                           -> opcode class 12
2.  field 220 == 1206                       -> encoding property
3.  field 595 in {2937, 2938}               -> extended field (high range)
4.  field 294 == 1493                       -> constraint
5.  field 242 in {1281, 1282}               -> width qualifier
6.  field 355 == 1943                       -> extended property
7.  field 376 == 2035                       -> extended property
8.  field 377 in {2037..2041}               -> extended property (5 values)
9.  field 429 in {2252, 2253}               -> extended qualifier
10. field 126 in {547, 548}                 -> data type
11. field 397 == 2115                       -> validity gate
12. source_count == 0                       -> no source operands
13. result_count == 7                       -> 7 result operands
14. All 7 results checked: type == 10 (valid register), various class constraints
15. if (*a4 <= 10) -> *a4 = 11, *a3 = (template)

This pattern has the most field checks (12) of the representative examples, validating properties deep into the extended field namespace (field 595). Its zero-source, seven-result shape suggests a hardware intrinsic or complex output instruction like a tensor-core operation.

sub_B28FE0 -- minimal matcher in the preamble zone (31 lines, priority 8, template 42):

1. field 211 == 1182
2. field 201 == 1109
3. field 348 in {1912, 1915}   -> precision qualifier
4. field 397 == 2115           -> validity gate
5. source_count == 0           -> no sources
6. if (*a4 <= 7) -> *a4 = 8, *a3 = 42

The simplest matchers skip operand validation entirely and rely solely on opcode-property checks. These are for instructions with fixed operand formats where the operand shape is fully determined by the opcode.

Helper Function Summary

AddressNameSignaturePurpose
sub_10AE5C0DAGNode_ReadField(ctx, node, field_id) -> valueRead instruction property by ID; returns 0xFFFFFFFF if absent
sub_10AE590DAGNode_WriteField(ctx, node, field_id, value)Write single instruction property
sub_10AE640DAGNode_WriteFields(ctx, node, f1, v1, v2)Multi-field atomic update
sub_B28F30GetOperand(node, index) -> operand_ptrIndex into operand array (32-byte records at *(node+32))
sub_B28F40GetResultCount(node) -> countNumber of result operands: node[40] + 1 - node[92]
sub_B28F50GetSourceCount(node) -> countNumber of source operands: *(node+92)
sub_B28E00DecodeRegClass(packed) -> class_idIdentity function (class stored as plain int)
sub_B28F00DecodeRegClass2(packed) -> class_idSecond identity accessor (same semantics)
sub_B28F10SetModifierA(operand, value)Write operand modifier at offset +14
sub_B28F20SetModifierB(operand, value)Write operand modifier at offset +15
sub_B28E10isGPR(tag) -> booltag == 2
sub_B28E20isImmediate(tag) -> booltag == 1
sub_B28E30isSpecialReg(tag) -> booltag == 6
sub_B28E40isValidReg(tag) -> booltag == 10
sub_B28E50isType9(tag) -> booltag == 9
sub_B28E60isType5(tag) -> booltag == 5
sub_B28E70isType4(tag) -> booltag == 4
sub_B28E80isPredicate(tag) -> booltag == 3
sub_B28E90isUniformReg(tag) -> booltag == 15
sub_B28EA0isType13(tag) -> booltag == 13
sub_B28EB0isType14(tag) -> booltag == 14
sub_B28EC0isType16(tag) -> booltag == 16
sub_B28ED0isType7(tag) -> booltag == 7
sub_B28EE0isType11(tag) -> booltag == 11
sub_B28EF0isType8(tag) -> booltag == 8

SM120 Pattern Coordinator -- sub_13AF3D0 (137 KB)

The largest ISel function in the binary (137 KB, 4,225 lines, 570+ locals). It is an architecture-specific operand-emission coordinator that runs in Phase 2 as a parallel backend to the mega-selector sub_C0EB10. The two do not call each other -- they are mutually exclusive implementations of the same ISel protocol, selected per-SM by the vtable in the ISel driver. The mega-selector covers opcodes 7--221 for the default backend; the coordinator covers opcodes 2--352 for the SM120 (consumer RTX 50xx / enterprise Pro) backend.

Position in the ISel Pipeline

sub_B285D0 (ISel driver, 9 KB)
  -> selects builder variant by SM version
     -> Builder variant vtable dispatch
        |
        +-- DEFAULT BACKEND: sub_C0EB10 (mega-selector, 185 KB)
        |     opcodes 7..221, dual-switch, word_22B4B60 encoding table
        |
        +-- SM120 BACKEND: sub_A29220 (instruction iterator, 435 lines)
              -> sub_13AF3D0 (pattern coordinator, 137 KB)
                   opcodes 2..352, single switch, inline operand emission

The coordinator is called once per instruction by sub_A29220, which walks the instruction list. Before entering the main switch, the coordinator performs a predication pre-test: if bit 0x1000 is set in the opcode word and the opcode is not 169, it queries vtable[3232/8] and optionally emits the last source operand via sub_13A6AE0.

Dispatch Structure

The coordinator reads the opcode from *(instr+72) with the standard BYTE1 & 0xCF mask (identical to Phase 1's MercConverter) and enters a single 130-case switch. Unlike the mega-selector's dual-switch encoding-slot translation, the coordinator emits operands inline -- each case directly calls sub_13A6280 (the operand emitter) with explicit operand indices.

// sub_13AF3D0 -- simplified dispatch skeleton
char PatternCoordinator(context *a1, instruction *a2, output *a3,
                        pattern_table *a4, flags a5, int a6) {
    int opcode = *(DWORD*)(a2 + 72);
    BYTE1(opcode) &= 0xCF;

    // Pre-dispatch: predication check when bit 0x1000 is set
    if ((*(a2+72) & 0x1000) && opcode != 169) {
        if (vtable[3232/8] == sub_868720 || vtable[3232/8]())
            EmitLastSource(a1[1], a2, operand_count - 2*flag, a3);
    }

    // Setup output context
    vtable[104/8](output_ctx, a1, &context_ref);

    switch (opcode) {
    case 2: case 4: case 7:     // FMA/MAD 2-source
        operand_span = 16; src_count = 2;
        goto SHARED_FMA_HANDLER;
    case 3: case 5:             // FMA/MAD 3-source
        operand_span = 24; src_count = 3;
        goto SHARED_FMA_HANDLER;
    case 6:                     // IMAD/IADD3 with 3+ sources
        EmitOperand(ctx, instr, 3, out);
        EmitOperand(ctx, instr, 4, out);
        EmitOperand(ctx, instr, 5, out);
        break;
    case 8:                     // Pure vtable dispatch (vtable+2328)
        vtable[2328/8](a1, a2, operand_count, a3, a5, 0);
        break;
    case 10: case 11: case 151: case 152: case 290: case 291:
        vtable[2768/8](a1, a2, a3, a4, a5);   // Memory load/store
        break;
    case 16:                    // Texture/surface (163-line handler)
        for (i = first_src; i < last_src; i++)
            EmitOperand(ctx, instr, i, out);   // loop up to 15 operands
        break;
    // ... 120 more cases ...
    case 329:                   // Variable-count loop + vtable+2328
        for (i = 0; i < src_count; i++)
            EmitOperand(ctx, instr, i, out);
        vtable[2328/8](a1, a2, remaining, a3, a5, 0, 0, 0);
        break;
    default:
        break;                  // no-op passthrough
    }
}

Opcode Case Routing

The 130 distinct case labels (spanning 82 distinct handler blocks) cover the full SASS opcode range including SM100+/SM120 extensions:

OpcodesHandler patternInstruction family
2, 3, 4, 5, 7Shared FMA handler with operand-span parametrizationFMA/MAD variants (32/64-bit)
6Inline 3-source emission + optional operands 6/7IMAD/IADD3 wide
8Pure vtable+2328 delegationBuilder-only instructions
10, 11, 151, 152, 290, 291vtable+2768 delegationMemory load/store
16163-line operand loop (up to 15 sources)Texture/surface
20, 21vtable+2680/2688 with stub checkMemory/store alternates
22, 77, 83, 297, 352vtable+2744 with nullsub_463 checkControl flow
24, 34, 209, 213, 214Passthrough: emit src 1 + dst 2Simple 2-operand ALU
29, 95, 96, 190Conditional operand-6 checkPredicate-source instructions
38, 59, 106, 180, 182, 192, 194, 215, 221Single EmitOperand(1) at high SMGeneric ALU
42, 53, 55EmitOperand(1)Paired ALU
60, 61, 62, 63, 64Comparison / inner sub-opcode switch (case 61: 5 sub-cases)Compare / set-predicate
88, 89Variable source count (2 or 3) with sign-dependent offsetsExtended FMA
110, 111, 114, 115, 117Warp operand emissionWarp shuffle / vote
120, 121, 126, 127Barrier handler with operand loop at LABEL_53Barrier / sync
139, 140, 141, 143sub_13A4DA0 for commutative operand selectionCommutative ALU
183Extended memory with register-class-6 checkWide memory
201, 202, 204vtable+2328 delegationAsync / bulk operations
270, 279, 282, 285, 325--328Goto LABEL_53 (barrier/sync shared handler)Extended memory / warp
280, 281vtable+2896 with nullsub_239 check, then LABEL_53Sync instructions
329Variable-count operand loop + vtable+2328Variable-width encoding

Three Competing-Match Selection Mechanisms

The coordinator selects among competing pattern matchers through three mechanisms:

1. LABEL_750 -- vtable alternate-match dispatch. Six opcode paths (cases 6, 36, 130, 137, plus opcodes reaching LABEL_119 when sub_7D6850 confirms a double-precision operand) jump to LABEL_750:

LABEL_750:
    replacement = vtable[16/8](output_ctx, instruction);
    *output = replacement;
    return;

This is the "try architecture-specific alternate" escape hatch. The vtable slot at offset +16 on the ISel context object points to an SM-specific matcher. If it succeeds, the coordinator's inline emission is entirely bypassed and the replacement instruction is written to the output.

2. sub_13A4DA0 -- commutative operand position selector. Called 12 times for commutative instructions (FMA, IADD3, comparison) where source operands can be swapped for better encoding. The function holds up to 4 pattern entries at offsets +12/+16 through +36/+40, each a (lo_word, hi_word_mask) pair. It tests operand properties via sub_13A48E0 against each entry; the first match returns a preferred operand index. The coordinator then calls sub_13A6280 with the returned index instead of the default.

// sub_13A4DA0 -- simplified
int SelectOperandSlot(pattern_table, instruction, default_slot, alt_slot, out_match) {
    if (!pattern_table->active) return default_slot;
    uint64_t operand_desc = GetOperandDescriptor(instruction, default_slot);
    for (i = 0; i < pattern_table->count; i++) {  // up to 4 entries
        if (operand_desc matches pattern_table->entry[i])
            { *out_match = entry[i].preferred; return default_slot; }
    }
    // Repeat with alt_slot if no match on default_slot
    operand_desc = GetOperandDescriptor(instruction, alt_slot);
    for (i = 0; i < pattern_table->count; i++) {
        if (operand_desc matches pattern_table->entry[i])
            { *out_match = entry[i].preferred; return alt_slot; }
    }
    return default_slot;
}

3. Inline vtable override checks. Many cases test whether a vtable function pointer equals a known null-stub before calling it. The stub addresses serve as sentinel values -- when the vtable slot has been overridden by an SM-specific implementation, the coordinator calls the override:

Vtable offsetDefault stubPurpose
+2680sub_A8CBE0Memory operation alternate matcher
+2688sub_A8CBF0Store operation alternate matcher
+2744nullsub_463Control flow alternate
+2632nullsub_233Move/convert alternate
+2760nullsub_235Atomic/barrier alternate
+2896nullsub_239Sync instruction alternate
+3232sub_868720Pre-dispatch predication alternate
+3112sub_A8CCA0MADC alternate (case 36)

When the vtable slot holds the stub, the coordinator skips the call and proceeds with its inline emission logic.

Primary Callee: sub_13A6280 (239 lines)

The operand emitter, called 83 times. It reads the operand at instruction[operand_index + 10] (each operand is 8 bytes starting at instruction + 84), checks the type tag at bits [31:28], and emits:

  • Tag 1 (register): fast-path returns if register class == 6 (UB/dead register). Otherwise reads the register descriptor from *(context+88)[reg_index], checks register class at descriptor offset +64.
  • Tags 2/3 (constant/immediate): calls sub_7DBC80 to validate constant-bank availability, then sub_A9A290 for type-5 immediate expansion. Delegates to vtable methods at *(*(context+1584) + 1504) and *(*(context+1584) + 3248).
  • Other types: pass through to the vtable dispatch chain.

The third parameter (operand index) ranges from 0 to 7 across the coordinator's call sites, with 0/1/2/3 being the most common (corresponding to the first 4 source operands in the Ori IR instruction layout).

Function Map Additions

AddressSizeIdentityConfidence
sub_13AF3D0137 KBSM120 ISel pattern coordinator (130-case switch, 83x operand emission)HIGH
sub_A29220435 linesInstruction iterator / coordinator caller (per-instruction walk)HIGH
sub_13A6280239 linesOperand emitter (type-tag dispatch, register class 6 fast-path)HIGH
sub_13A7410--Destination operand emitter (with register class 6 check)MEDIUM
sub_13A6AE0--Pre-dispatch source emitter (predicated instruction operands)MEDIUM
sub_13A4DA0180 linesCommutative operand position selector (4-entry pattern table)HIGH
sub_13A6F90--Extended destination emitter (3rd variant, class 6 check)MEDIUM
sub_13A6790--Fenced memory operand emitterMEDIUM
sub_13A45E0--Extra operand emitter (operands 6/7 for wide instructions)MEDIUM
sub_13A5ED0--Modifier flag emitter (operands with 0x18000000 bits)MEDIUM
sub_13A75D0--Register class 6 (UB) operand substitution handlerMEDIUM
sub_13A48E0--Operand property extractor (for sub_13A4DA0 matching)MEDIUM

Architecture Dispatch Tables -- 4 Copies at sub_B128E0--sub_B12920

Four nearly identical functions (15,049 bytes each) provide architecture-variant opcode dispatch. Despite being only 13 binary bytes each (3 instructions -- a thunk into shared code at 0x1C39xxx), the decompiled output is 79,562 bytes due to the massive shared switch statement they jump into.

Each table contains a switch on *(a3+12) (the opcode word field) with 50+ cases, and secondary switches on *(a3+14) (opcode sub-field) within certain cases. The return values are SASS encoding slot indices (e.g., 197, 691, 526, 697, 772, 21). The four copies serve different SM architecture families, mapping the same logical opcode to different encoding slots depending on the target.

Opcode Variant Selectors

Two specialized variant selectors handle the final opcode-to-encoding mapping for specific instruction families:

sub_B0BE00 (19 KB) -- opcode class 194:

  • Massive switch on a2 (100+ cases)
  • All cases call sub_10AE590(ctx, inst, 194, N) with sequential N values starting from 827
  • Pattern: case K -> sub_10AE590(ctx, inst, 194, 826+K)
  • Maps sub-variant indices to SASS encoding slots for one PTX opcode family

sub_B0AA70 (5 KB) -- opcode class 306:

  • Same pattern but with opcode class 306
  • Variants numbered 1680--1726 with non-sequential case indices (2, 3, 8, 9, 14, 15, 20, 21, 26, 27, 30, 31, 36, 37, 40, 41, ...)
  • The alternating-pair pattern at stride 6 suggests type-width combinations (e.g., F32/pair, F64/pair, S32/pair, ...)

Instruction Modifier Dispatchers

Two modifier-application functions run after the main ISel selection to set type modifiers, rounding modes, and register width:

sub_B13E10 (5,792 B) -- basic modifier dispatcher:

  • All 21 callees are sub_10AE640 (DAG node modifier)
  • Switch on BYTE1(a7) & 0x1F for modifier type
  • Maps modifier values 1--6 to internal codes 31--35
  • Secondary dispatch on (a7 >> 3) for register width encoding

sub_B157E0 (11,815 B) -- extended modifier dispatcher:

  • All 37 callees are sub_10AE640
  • Handles texture/surface operations specially (opcode type 18)
  • Maps sub-opcodes (BYTE5(a7) & 0x3F) to encoding values 54--60

Mercury Master Encoder -- sub_6D9690 (94 KB)

The Mercury master encoder is the single largest backend function and the final instruction selection point before binary emission. It contains a massive switch on the instruction type field (read from instruction+8) covering all SASS instruction formats. While its primary role is encoding (documented in Mercury Encoder Pipeline and SASS Instruction Encoding), the switch itself performs the final opcode-to-encoding-format selection:

// Simplified encoding flow
void EncodeInstruction(context, instruction) {
    int type = *(int*)(instruction + 8);
    uint64_t base = 0x2000000000LL;     // encoding base constant

    switch (type) {
    case 61:    // FFMA with literal operand
        sub_6D9580(ctx, operand);       // encode literal
        break;
    case 455:   // complex multi-operand format
        // bit-field extraction and assembly
        break;
    // ... hundreds of cases ...
    }

    // Common tail: append operand words, commit
    sub_6D2750(ctx, word);              // append 8-byte operand word
    sub_6D28C0(ctx);                    // commit instruction record
}

Key encoding dispatch details:

  • Operand word type prefix in bits [31:28]: 0x1 = register, 0x5 = immediate/constant, 0x6 = control/modifier, 0x7 = literal, 0x9 = special
  • sub_7D6860 handles data type encoding (FP32/FP64/INT)
  • sub_C00BF0 provides opcode lookup from the encoding tables
  • Architecture-specific bits accumulated via SM 100+ extensions controlled by knob 4176

MercExpand -- Pseudo-Instruction Expansion

sub_C3CC60 (26 KB) runs as phase 118 (MercExpandInstructions) and expands Mercury pseudo-instructions into concrete SASS sequences. This is the third and final instruction selection point -- where abstract instruction forms that survived through ISel and Mercury encoding are replaced by their concrete multi-instruction implementations.

HandlerSizeInstruction class
sub_C37A1016 KBGeneral instruction expansion (jump table, 4+ cases)
def_C37B2E13 KBComplex expansion cases (default handler, string "EXPANDING")
sub_C39B4010 KBMemory operations (LDG, STG, LDS, etc.)
sub_C3A4606 KBAtomic operations
sub_C3B5608 KBTexture operations
sub_C3BCD019 KBControl flow (branches, jumps, calls)
sub_C3E03018 KBFinalization and cleanup

The expansion creates new instruction nodes, links them into the doubly-linked list, and deletes the original pseudo-instruction. After all expansions, sub_C3E030 performs post-expansion verification. The expansion engine also uses sub_719D00 (50 KB), which builds output for expanded instructions across different operand widths (32/64/128-bit, predicate) -- four near-identical code blocks corresponding to template instantiations over operand width types.

OCG Encoding Template Lookup -- sub_C3F490

The OCG (Optimized Code Generation) intrinsic pipeline on SM100+ does not use the ISel mega-selector or DAG pattern matchers. Instead, the OCG router (sub_6CC690, documented in Intrinsics) assigns each instruction one of 7 internal routing values and passes it to the SASS instruction emitter sub_6CB8A0. These routing values are not Ori IR opcodes, not binary SASS opcodes, and not encoding slot indices from word_22B4B60. They are a small, closed set of keys that exist solely to select an operand gathering template inside sub_C3F490.

Routing values assigned by the OCG router

ValueHexInstruction classAssigned when
700x46Memory-ordered load/store/atomic (with barrier)Barrier register present (v108 != 0 in conditional paths)
2430xF3Default memory operationFallback for general memory ops without barrier or special fence
2450xF5Load variant (LD/LDG/LDS)Load-type operations (from OCG load/store handler)
2460xF6Reduction/atomic defaultAtomic operations and reductions
2470xF7Fenced memory operation (LDGSTS)Operations requiring memory fence semantics
2570x101Async copy without memory orderBulk copy ops when no barrier: v108 == 0 selects 257, else 70
2610x105Atomic with pre-existing value readAtomic exchange / compare-and-swap returning old value

How sub_C3F490 maps routing values to encoding templates

sub_C3F490 is a pure lookup function (184 bytes) that takes a routing value plus 7 boolean modifier flags and returns a pointer to an operand gathering template in .data at 0x22B8960--0x22BB460. The function is a nested if-else tree: the first-level switch selects on the routing value, then inner branches refine the template based on the modifier flags.

sub_C3F490(routing_value, a2..a8) -> template_ptr
    a2: has pre-existing-value operand (used only by value 257)
    a3: SM generation > sm_7x (SM80+)
    a4: has predicate attachment
    a5: has scope/fence operand (SM generation > sm_8x && memory_order == 4)
    a6: (always 0 from OCG emitter, used by MercExpand callers)
    a7: (always 0 from OCG emitter, used by MercExpand callers)
    a8: (always 0 from OCG emitter, used by MercExpand callers)

The OCG emitter (sub_6CB8A0) always passes a6=a7=a8=0, which means the OCG path only reaches a subset of template leaves. The MercExpand callers (sub_C41100, sub_C40420, sub_C40B90, sub_C42330) pass all 7 flags and can reach the full template space. The returned template is a packed array: template[0] is the operand count, followed by operand slot indices that reference positions in the 39-QWORD operand buffer (v134[]). The emitter iterates over these indices, gathers the tagged operand words, builds control words from bitfields, and calls sub_9314F0 to commit the encoded instruction.

Two additional routing values (254, 262) are handled by sub_C3F490 but are never assigned by the OCG router -- they originate exclusively from the MercExpand memory instruction handlers, where the routing value is read from the instruction's opcode field (instr[18] masked with & 0xCFFF).

ValueHexOriginInstruction class
2540xFEMercExpand onlyExtended memory format (operand gather mode 3)
2620x106MercExpand onlyWide memory format (operand gather mode 0, with scope/fence branches)

Template address space

The 40+ distinct templates returned by sub_C3F490 occupy a contiguous .data region:

Address rangeRouting values served
0x22B8960--0x22B8E60257 (async copy variants)
0x22B8E60--0x22B936070 (barrier memory variants)
0x22B9360--0x22B9860262 (MercExpand wide memory)
0x22B9860--0x22B9E60247, 245 (fenced / load variants)
0x22B9E60--0x22BA960243, 246, 70 (default / reduction / barrier sub-variants)
0x22BA960--0x22BB460Leaf templates for bare operand forms (no modifiers)

Each template is 256 bytes (0x100). For a given routing value, the modifier flags select progressively simpler templates as flags are cleared: the most complex template (all modifiers active) is reached first in the if-chain, and the simplest (no modifiers) is the final fallback.

Addressing Mode Selection

Addressing mode selection is distributed across Phases 1 and 2. During Phase 1, the operand processing function sub_6273E0 (44 KB) classifies PTX operand forms into internal categories. During Phase 2, the ISel driver and Mercury encoder select the optimal SASS addressing mode based on the register-allocated operand forms.

PTX addressing modes and their SASS encodings:

PTX syntaxAddressing modeSASS instructionEncoding
[%rd1]Register indirectLDG.E R0, [R2]Register + zero offset
[%rd1+16]Register + offsetLDG.E R0, [R2+0x10]Register + immediate offset
c[2][0x100]Constant bankLDC R0, c[0x2][0x100]Bank index + offset
[%rd1], %r2Base + indexSTG.E [R2], R4Separate base/data registers

Special string references in sub_6273E0 confirm complex addressing:

  • ".nv.reservedSmem.offset0" -- reserved shared memory region
  • "COARSEOFFSET" -- coarse-grained offset computation for large address spaces
  • "__$endLabel$__%s" -- label generation for structured control flow

The ISel mega-selector (sub_C0EB10) references "__nv_reservedSMEM_offset_0_alias" for shared memory alias resolution during final encoding.

Vtable Dispatcher Zone -- 0xAF0000--0xB10000

The range 0xAF0000--0xB10000 contains approximately 2,735 tiny vtable method implementations (average 160 bytes) that form the instruction encoding hierarchy. These implement polymorphic instruction property queries:

// Typical vtable method (sub_AFXXXX, ~160 bytes)
int64_t get_property(int64_t a1, unsigned int a2) {
    if (a2 <= N)
        return (unsigned int)dword_XXXXXXX[a2];  // table lookup
    return default_value;
}

Each function maps a small integer index to an encoding constant, answering questions like "what is the register class for operand N of this instruction?" The 0xAF0000--0xB00000 sub-range has 1,269 functions (all under 200 bytes), while 0xB00000--0xB10000 has 1,466 with slightly more complex logic (13 exceeding 1 KB).

Comparison with LLVM ISel

AspectLLVMptxas
ISel frameworkSelectionDAG or GlobalISel (single pass)Two-phase: MercConverter (phase 5) + ISel driver (phase 112+)
Pattern specificationTableGen .td files, machine-generatedHandwritten C++ (~750 functions)
Pattern countTarget-dependent (thousands for x86)~801 DAG matchers + 185 KB mega-selector
Architecture dispatchSubtarget feature bits4 architecture dispatch tables + vtable overrides
Intermediate formMachineInstr (already selected)Ori IR (SASS opcodes after phase 5, not yet encoded)
EncodingMCInst emission (separate pass)Integrated: ISel + Mercury encode in same pipeline
ExpansionPseudo-instruction expansion in AsmPrinterMercExpand (phase 118, post-ISel)
Optimization post-ISelMachineFunction passesPhases 14--111 (full optimizer runs between Phase 1 and Phase 2)

The key architectural difference: LLVM performs instruction selection once, then optimization happens on already-selected machine instructions. ptxas selects SASS opcodes early (phase 5) so the optimizer can reason about SASS-level semantics, then performs a second selection/encoding pass after optimization is complete. This two-phase design gives the optimizer accurate cost models (it sees real SASS opcodes, not abstract PTX operations) at the cost of architectural complexity.

Function Map

AddressSizeIdentityConfidence
sub_C0EB10185 KBISel mega-selector (719 locals, dual 169-case switch, SM-generation dispatch)HIGH
sub_6D969094 KBMercury master encoder (instruction type switch)VERY HIGH
sub_9F1A9035 KBMercConverter main instruction conversion passHIGH
sub_9EF5E027 KBPost-MercConverter lowering ("CONVERTING")HIGH
sub_C3CC6026 KBMercExpand::run (pseudo-instruction expansion)HIGH
sub_9ED2D025 KBMercConverter opcode dispatch (master switch, & 0xCF mask)HIGH
sub_9E660025 KBInstruction expansion (64-bit split)HIGH
sub_9EC34023 KBMulti-operand instruction legalizationMEDIUM
sub_B0BE0019 KBOpcode variant selector (class 194, 100+ cases)HIGH
sub_C3BCD019 KBMercExpand::expandControlFlowMEDIUM
sub_9D76D018 KBMemory instruction legalization (load/store)HIGH
sub_C3E03018 KBMercExpand::finalizeExpansionMEDIUM
sub_9D80E017 KBMemory instruction legalization (variant)HIGH
sub_9E8B2017 KBTexture/surface loweringMEDIUM
sub_C37A1016 KBMercExpand::expandInstruction (jump table)HIGH
sub_B128E0--sub_B1292015 KB x4Architecture dispatch tables (4 SM families)HIGH
sub_B1FA2013 KBSASS 3-operand builder (variant A)HIGH
sub_B1D67013 KBPost-ISel instruction modifierHIGH
def_C37B2E13 KBMercExpand complex cases ("EXPANDING")HIGH
sub_B157E012 KBExtended modifier dispatcher (37 callees)HIGH
sub_B20E0011 KBSASS 3-operand builder (variant B)HIGH
sub_C39B4010 KBMercExpand::expandMemoryOpMEDIUM
sub_9DA1009 KBArithmetic operation handler (case 6)HIGH
sub_B285D09 KBISel lowering driver (66 callees)HIGH
sub_B241A07 KBSASS instruction property setterHIGH
sub_9F33407 KBMercConverter orchestrator ("After MercConverter")HIGH
sub_C3A4606 KBMercExpand::expandAtomicOpMEDIUM
sub_B13E106 KBBasic modifier dispatcher (21 callees)HIGH
sub_B0AA705 KBOpcode variant selector (class 306)HIGH
sub_9DA5C02 KBOpcode class 1 handlerMEDIUM
sub_13AF3D0137 KBSM120 ISel pattern coordinator (130-case switch, 83x sub_13A6280, opcodes 2--352)HIGH
sub_A29220~17 KBSM120 instruction iterator (calls sub_13AF3D0 per instruction)HIGH
sub_13A6280~10 KBOperand emitter (type-tag dispatch, register class 6 fast-path)HIGH
sub_13A4DA0~7 KBCommutative operand position selector (4-entry pattern table)HIGH
sub_13A7410--Destination operand emitter (with register class 6 check)MEDIUM
sub_13A6AE0--Pre-dispatch source emitter (predicated instruction operands)MEDIUM
sub_13A6F90--Extended destination emitter (3rd variant, class 6 check)MEDIUM
sub_13A6790--Fenced memory operand emitterMEDIUM
sub_13A45E0--Extra operand emitter (wide instruction operands 6/7)MEDIUM
sub_13A5ED0--Modifier flag emitter (operands with 0x18000000 bits)MEDIUM
sub_13A48E0--Operand property extractor (for sub_13A4DA0 matching)MEDIUM
sub_10AE5C0tinyDAGNode_ReadField (field_id to value, delegates to sub_10D5E60)VERY HIGH
sub_10AE590tinyDAGNode_WriteField (single field write)VERY HIGH
sub_10AE640tinyDAGNode_WriteFields (multi-field update)VERY HIGH
sub_B28F30tinyGetOperand (index into 32-byte operand array at *(node+32))VERY HIGH
sub_B28F40tinyGetResultCount (node[40] + 1 - node[92])VERY HIGH
sub_B28F50tinyGetSourceCount (*(node+92))VERY HIGH
sub_B28E00tinyDecodeRegClass (identity function, class is plain int)VERY HIGH
sub_B28E10tinyisGPR operand predicate (tag == 2)VERY HIGH
sub_B28E20tinyisImmediate operand predicate (tag == 1)VERY HIGH
sub_B28E40tinyisValidReg operand predicate (tag == 10)VERY HIGH
sub_B28E80tinyisPredicate operand predicate (tag == 3)VERY HIGH
sub_B28E90tinyisUniformReg operand predicate (tag == 15)VERY HIGH
sub_B28F60--sub_B74C60~1.3 MB~801 DAG pattern matchers (priority 2--34, template 1--152)HIGH
sub_C01840--Mega-selector source operand marshaller (52 calls from mega-selector)HIGH
sub_C01F50--Mega-selector destination operand marshallerHIGH
sub_C00EA0--Single operand extractor (returns tagged operand word)HIGH
sub_BFFD60--Operand reference resolver (register ref to encoding word)HIGH
sub_C06E90--Symbol/special-register lookup for shared memoryHIGH
sub_C07690--Immediate-operand encoding helperMEDIUM
sub_C0B2C0--Extended memory/warp operation encoderHIGH
sub_C05CC0--Immediate operation encoder (flag-dependent path)MEDIUM
sub_BFEBF0tinyDefault vtable[2] stub (opcode translator, no-op identity)VERY HIGH
sub_BFEAA0tinyDefault vtable[12] stub (capability check, always false)VERY HIGH
sub_BFEA30tinyDefault vtable[3] stub (extension handler, no-op)VERY HIGH
sub_BFEF10--Register bank capacity check / growMEDIUM
word_22B4B60--Static opcode-to-encoding-index table (uint16[222], default backend)VERY HIGH
sub_C3F490184 BOCG encoding template lookup (routing value + 7 flags -> template ptr)VERY HIGH
sub_6CB8A0--OCG SASS instruction emitter (calls sub_C3F490 then sub_9314F0)HIGH
sub_C41100--MercExpand memory encoder (calls sub_C3F490 with full flag set)HIGH
sub_C40420--MercExpand memory encoder variant (calls sub_C3F490)HIGH
sub_C40B90--MercExpand memory encoder variant (calls sub_C3F490)HIGH
sub_C42330--MercExpand memory encoder variant (calls sub_C3F490)HIGH
unk_22B8960--unk_22BB460~11 KBOperand gathering templates (40+ entries, 256 B each)HIGH

Cross-References