Instruction Selection

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

Instruction selection in ptxas is a two-phase process that converts PTX virtual ISA operations into concrete SASS machine opcodes. Unlike LLVM, which uses a single SelectionDAG or GlobalISel framework, ptxas distributes instruction selection across two distinct pipeline stages separated by the entire optimization pipeline: Phase 1 converts PTX opcodes to Ori IR opcodes during initial lowering (phase 5, ConvertUnsupportedOps), and Phase 2 converts Ori IR to final SASS binary forms during code generation (phases 112--122, ISel driver + Mercury encoder). The two phases serve fundamentally different purposes: Phase 1 legalizes the IR so the optimizer can reason about it, while Phase 2 selects the optimal machine encoding for the target architecture after register allocation and scheduling are complete.


Phase 1 location	Phase 5: `ConvertUnsupportedOps` (PTX opcode to Ori opcode)
Phase 2 location	Phases 112+: ISel driver + Mercury encoder (Ori to SASS binary)
MercConverter dispatch	`sub_9ED2D0` (25 KB, master switch on `*(instr+72) & 0xCF` mask)
ISel driver	`sub_B285D0` (9 KB, 66 callees, vtable entry)
ISel mega-selector	`sub_C0EB10` (185 KB, 500+ locals, giant switch)
DAG pattern matchers	~801 functions at `0xB28F60`--`0xB7D000` (~1.3 MB)
Arch dispatch tables	4 copies at `sub_B128E0`--`sub_B12920` (15,049 bytes each)
Mercury master encoder	`sub_6D9690` (94 KB, instruction type switch)
MercExpand	`sub_C3CC60` (26 KB, pseudo-instruction expansion)
SM120 pattern coordinator	`sub_13AF3D0` (137 KB, 130-case switch, opcodes 2--352)
Opcode variant selectors	`sub_B0BE00` (19 KB, class 194), `sub_B0AA70` (5 KB, class 306)

Architecture

PTX source text
     |
     v
[Bison parser]  sub_4CE6B0 (48KB)
     |  Reduction actions build raw Ori nodes with PTX-derived opcodes
     v
+------------------------------------------------------------------+
| RAW ORI IR (PTX opcodes: add.f32, ld.global, mad.lo.s32, ...)    |
+------------------------------------------------------------------+
     |
     |  PHASE 1: PTX-to-Ori Opcode Legalization (phase 5)
     |
     |  sub_9F3340 (orchestrator, 7KB)
     |    -> sub_9F1A90 (MercConverter main, 35KB)
     |         -> sub_9ED2D0 (opcode dispatch, 25KB)
     |              Switch on (*(instr+72)) with BYTE1 & 0xCF mask
     |              ~120 case values -> ~60 handler functions
     |              + vtable dispatch for architecture-extensible ops
     |         -> sub_934630 (instruction creation, called N times)
     |    -> sub_9EF5E0 (post-conversion lowering, 27KB)
     |
     v
+------------------------------------------------------------------+
| OPTIMIZER-READY ORI IR (SASS opcodes: FADD, IMAD, LDG, STG, ...).|
| Every instruction has a valid SASS opcode for the target SM.      |
+------------------------------------------------------------------+
     |
     |  [Phases 14-111: Full optimization pipeline]
     |  Register allocation, scheduling, peephole, etc.
     |
     v
+------------------------------------------------------------------+
| OPTIMIZED ORI IR (register-allocated, scheduled)                  |
+------------------------------------------------------------------+
     |
     |  PHASE 2: Ori-to-SASS Selection & Encoding (phases 112+)
     |
     |  sub_B285D0 (ISel driver, 9KB)
     |    -> sub_C0EB10 (mega-selector, 185KB, default backend)
     |    -> sub_13AF3D0 (pattern coordinator, 137KB, SM120 backend)
     |    -> sub_B1FA20 / sub_B20E00 (builder variants)
     |    -> sub_B28F60..sub_B74C60 (~801 DAG pattern matchers)
     |    -> sub_B128E0..sub_B12920 (4 arch dispatch tables)
     |
     |  sub_6D9690 (Mercury master encoder, 94KB)
     |    -> Switch on instruction type (*(instr+8))
     |    -> sub_C00BF0 (opcode lookup)
     |    -> sub_91D160 (register encoding)
     |    -> sub_7B9B80 (bitfield insert, 18,347 callers)
     |
     |  sub_C3CC60 (MercExpand, 26KB)
     |    -> sub_C37A10 (expand instruction, 16KB)
     |    -> sub_C39B40 (expand memory, 10KB)
     |    -> sub_C3BCD0 (expand control flow, 19KB)
     |
     v
+------------------------------------------------------------------+
| SASS binary (packed machine code in 64/128/256-bit words)         |
+------------------------------------------------------------------+

Phase 1: PTX-to-Ori Opcode Conversion

Phase 1 runs as ConvertUnsupportedOps (pipeline phase 5), the most substantial bridge phase. Its job is to replace every PTX-derived opcode in the raw Ori IR with a valid SASS-level opcode for the target SM. After this phase completes, the optimizer sees only SASS-level instruction semantics.

The conversion is not a simple table lookup. Many PTX operations have no 1:1 SASS equivalent and must be expanded into multi-instruction sequences. The expansion depends on the target architecture, the operand types, and the available hardware functional units.

MercConverter Dispatch -- `sub_9ED2D0` (25 KB)

The central dispatch function of Phase 1. Despite the sweep's initial identification as PhaseRunner::executePhaseSequence, the decompiled code reveals a classic opcode switch: it reads *(instr+72), masks byte 1 with 0xCF (stripping modifier bits 4--5), and dispatches to per-category handler functions. The switch covers approximately 120 distinct case values (opcode indices 1--352) routing to roughly 60 handler functions plus vtable-dispatched methods for architecture-extensible operations.

// sub_9ED2D0 -- simplified dispatch logic
void MercConverter_Dispatch(context, instruction) {
    // Pre-dispatch: check predication eligibility
    bool can_predicate = sub_7E18A0(instruction, *(context+8));
    if (can_predicate)
        can_predicate = vtable[205](*(*(context+8)+1584), instruction);
    *(context+40) = can_predicate;

    // Read opcode, mask out modifier bits
    int opcode = *(DWORD*)(instruction + 72);
    BYTE1(opcode) &= 0xCF;

    // Special case: opcode 130 (HSET2 in ROT13; internal marker) with GPR operand -> clear predication
    if (opcode == 130) {
        int operand = *(DWORD*)(instruction + 84);
        if (((operand >> 28) & 7) == 1 && reg_type(operand) == 6)
            *(context+40) = 0;
    }

    // Main dispatch
    switch (opcode) {
    case 1:   sub_9DA5C0(context, instruction);  break;  // opcode class 1
    case 6:   sub_9DA100(context, instruction);  break;  // arithmetic
    case 8:   sub_9D2440(context, instruction);  break;  // specific class
    case 10: case 11: case 149: case 151: case 152: case 290: case 291:
              sub_9D80E0(context, instruction);  break;  // memory load/store
    case 16:  sub_9E8B20(context, instruction);  break;  // texture/surface
    case 61: case 63: case 80:
              sub_9E6600(context, instruction);  break;  // instruction expansion
    case 108: sub_9D76D0(context, instruction);  break;  // memory legalization
    // ... ~100 more cases ...
    default:  emit_noop(context, 0xFFFF);        break;  // unknown -> passthrough
    }

    // Post-dispatch: apply predication and operand adjustments
    vtable[107](context, instruction);
}

MercConverter Opcode Dispatch Table

The complete switch covers opcodes 1--352. Cases route to three dispatch mechanisms: direct function calls (for common PTX categories), vtable-indirect calls (for architecture-extensible operations), and the emit_noop fallback for unrecognized opcodes. Below is the reconstructed routing table from the decompiled sub_9ED2D0.

Direct handler dispatch (35 handlers):

Opcode(s)	Handler	Size	Category
1	`sub_9DA5C0`	2 KB	Opcode class 1 (basic ALU)
6	`sub_9DA100`	9 KB	Arithmetic operations
8	`sub_9D2440`	--	Specific class
10, 11, 149, 151, 152, 290, 291	`sub_9D80E0`	17 KB	Memory load/store
15, 85	`sub_9EC340`	23 KB	Multi-operand legalization
16	`sub_9E8B20`	17 KB	Texture/surface lowering
17	`sub_9E7FB0`	--	Surface operations
22	`sub_9D6DB0`	--	Specific lowering
23	`sub_9E58F0`	--	Specific lowering
24	`sub_9D9F60`	--	Specific lowering
26	`sub_9E54C0`	--	Specific lowering
27	`sub_9E4BB0`	--	Specific lowering
28	`sub_9D9E70`	--	Specific lowering
32, 271	`sub_9E2440`	--	Bitfield operations
34	`sub_9E55E0`	--	Specific lowering
38, 59, 106, 180, 182, 192, 194, 215, 221, 242	`sub_9DA6B0`	--	Generic ALU group
41, 284	`sub_9D1DA0`	--	Specific lowering
42, 53, 55, 66	`sub_9D54B0`	--	Grouped operations
47	`sub_9E74E0`	--	Conditional (arch flag check)
51	`sub_9E2F60`	--	Specific lowering
52, 54, 72, 97	`sub_9D09C0`	--	Group with `v8=1` (deletion flag)
57, 101	`sub_9D6170`	--	Paired operations
60, 62, 78, 79	`sub_9E5EE0`	--	Comparison group
61, 63, 80	`sub_9E6600`	25 KB	Instruction expansion (64-bit split)
67	`sub_9D9C30`	--	Specific lowering
70	`sub_9E3490`	--	Specific lowering
75	`sub_9E0C10`	--	Specific lowering
77	`sub_9E4DF0`	--	Specific lowering
83	`sub_9D6AB0`	--	Specific lowering
88, 89	`sub_9D5990`	--	Paired operations
90	`sub_9D2820`	--	Specific lowering
91	`sub_9E7600`	--	Specific lowering
92	`sub_9E7890`	--	Specific lowering
93, 95	`sub_9E1D40`	--	Comparison variants
94	`sub_9E1DF0`	--	Specific lowering
96	`sub_9D41C0`	--	Specific lowering
98	`sub_9D3230`	--	Specific lowering
100	`sub_9D70E0`	--	Specific lowering
102	`sub_9D9750`	--	Specific lowering
103, 104	`sub_9E31D0`	--	Paired operations
108	`sub_9D76D0`	18 KB	Memory instruction legalization
124	`sub_9E18B0`	--	Specific lowering
135	`sub_9D6560`	--	Specific lowering
139, 140, 141, 143	`sub_9D4C10`	--	Related operations group
145	`sub_9D3020`	--	Specific lowering
155, 268	`sub_9E5260`	--	Paired operations
156	`sub_9D94B0`	--	Specific lowering
158, 167	`sub_9E4A00`	--	Paired operations
161	`sub_9D21D0`	--	Specific lowering
162	`sub_9D9660`	--	Specific lowering
166	`sub_9E2100`	--	Specific lowering
170	`sub_9E2DF0`	--	Specific lowering
173, 267	`sub_9EB5C0`	--	Paired operations
174	`sub_9D9300`	--	Specific lowering
184	`sub_9D2E70`	--	Specific lowering
185	`sub_9E32F0`	--	Specific lowering
188, 190	`sub_9E2970`	--	Paired operations
195	`sub_9D2AB0`	--	Specific lowering
196	`sub_9D9080`	--	Specific lowering
198	`sub_9D66F0`	--	Specific lowering
201, 202, 204, 285	`sub_9EAC30`	--	Async/bulk group
203	`sub_9D8E90`	--	Specific lowering
205	`sub_9E1260`	--	Specific lowering
209	`sub_9E5740`	--	Specific lowering
210, 213, 214	`sub_9D8B30`	--	Grouped operations
240	`sub_9D6280`	--	Specific lowering
241	`sub_9E2CC0`	--	Specific lowering
247	`sub_9D0F70`	--	Specific lowering
248	`sub_9D0DF0`	--	Specific lowering
262	`sub_9E7440`	--	Specific lowering
264	`sub_9D73F0`	--	Specific lowering
276	`sub_9D5EC0`	--	Specific lowering
292	`sub_9D0E90`	--	Specific lowering

Vtable-indirect dispatch (for architecture-extensible operations):

Opcode(s)	Vtable offset	Category (inferred)
2, 3, 4, 5, 7	`vtable[0]` (+0)	Generic fallback
14, 39, 40, 105, 125, 299, 300, 321	`vtable[7]` (+56)	Group A operations
18	`vtable[3]` (+24)	Specific class
31	`vtable[4]` (+32)	Specific class
35	`vtable[6]` (+48)	Specific class
36	`vtable[21]` (+168)	Specific class
43	`vtable[9]` (+72)	Specific class
50	`vtable[12]` (+96)	Specific class
65	`vtable[22]` (+176)	Specific class
73	`vtable[15]` (+120)	Specific class
74	`vtable[16]` (+128)	Specific class
81	`vtable[24]` (+192)	Specific class
110, 111, 112, 114	`vtable[25]` (+200)	Warp shuffle group
118	`vtable[10]` (+80)	Specific class
119	`vtable[28]` (+224)	Specific class
120, 121, 126, 127, 128, 280, 281	`vtable[27]` (+216)	Barrier/sync group
122, 123, 310, 311, 312	`vtable[26]` (+208)	Related group
130 (`HSET2`), 169	`vtable[29]` (+232)	Move/convert group (130 is MOV-like internally; actual SASS `MOV` = 19)
157	`vtable[84]` (+672)	Specific class
176, 177	`vtable[34]` (+272)	Paired operations
183, 288	`vtable[36]` (+288)	Paired operations
186	`vtable[35]` (+280)	Specific class
211	`vtable[39]` (+312)	Specific class
220	`vtable[40]` (+320)	Specific class
223, 238	`vtable[41]` (+328)	Paired operations
228	`vtable[42]` (+336)	Specific class
243	`vtable[43]` (+344)	Specific class
245--253, 257	`vtable[67--77]` (+536--+624)	SM 100+ operations
265, 266	`vtable[93]` (+744)	Paired operations
270	`vtable[77]` (+616)	Specific class
277	`vtable[65]` or `vtable[11]` (+520/+88)	Operand-type dependent
279--351	various high vtable offsets	SM 100+ / Blackwell operations

The vtable mechanism allows architecture backends to override conversion behavior without modifying the core dispatch. The vtable factory at sub_1CCEEE0 (17 KB, 244 callees) selects which overrides are active based on the SM version.

Per-Category Handlers

The larger handlers implement non-trivial conversion logic:

Handler	Size	Category	Key behavior
`sub_9E6600`	25 KB	Instruction expansion	Splits 64-bit ops on 32-bit ALU into hi/lo pairs with carry chains. Calls `sub_9D4380` (instruction builder) ~10 times per expansion.
`sub_9EC340`	23 KB	Multi-operand legalization	Operand type test: `(v >> 28) & 7 == 1` means register. Register class query via `sub_7BE7B0`. Creates new instructions via `sub_7DEAD0`.
`sub_9D76D0`	18 KB	Memory legalization (load/store)	Register type dispatch: 6=GPR, 7=predicate, 3=address. Uses `sub_9D4380` (instruction builder) and `sub_9CD420` (predication).
`sub_9D80E0`	17 KB	Memory legalization (variant)	Same opcode set as `sub_9D76D0`, alternate code path for different operand patterns.
`sub_9E8B20`	17 KB	Texture/surface lowering	Register type 6 = GPR. Manipulates bitmask at register descriptor offset `+48`.
`sub_9DA100`	9 KB	Arithmetic operations	Handles opcode case 6 -- standard ALU instruction legalization.
`sub_9DA6B0`	--	Generic ALU group	Covers 10 opcode values (38, 59, 106, 180, 182, 192, 194, 215, 221, 242).

1:1 vs 1:N Expansion

Most PTX operations map 1:1 to a single SASS opcode. When they do not, the handlers in sub_9E6600 and related functions create multi-instruction sequences:

PTX                                    Ori IR (after Phase 1)
-----------------------------------    -----------------------------------
add.f32  %r1, %r2, %r3          -->   FADD  R1, R2, R3                [1:1]
add.s32  %r4, %r5, %r6          -->   IADD3 R4, R5, R6, RZ           [1:1, operand added]
mul.lo.s64 %rd1, %rd2, %rd3     -->   IMAD.LO  R1, R2, R6, RZ       [1:N split]
                                       IMAD.HI  R0, R2, R6, RZ
                                       IMAD      R0, R3, R6, R0
                                       IMAD      R0, R2, R7, R0
div.f32  %r7, %r8, %r9          -->   MUFU.RCP  R10, R9              [1:N, Newton-Raphson]
                                       FMUL      R7, R8, R10
                                       (+ correction iterations)
bar.sync 0                       -->   BAR                            [1:1]

The expansion creates new instruction nodes via sub_934630 and links them into the doubly-linked instruction list. The original PTX-level instruction is replaced by the expanded sequence.

Type-Dependent Opcode Selection

PTX's explicitly-typed opcodes (where the type is a qualifier like .f32, .s64) map to different SASS mnemonics based on the type:

PTX type	SASS prefix	Example PTX	Example SASS
`.f16` / `.f16x2`	`H`	`add.f16`	`HADD2`
`.f32`	`F`	`add.f32`	`FADD`
`.f64`	`D`	`add.f64`	`DADD`
`.s32` / `.u32`	`I`	`add.s32`	`IADD3`
`.s64` / `.u64`	`I` (split)	`add.s64`	`IADD3` + `IADD3.X` (carry chain)
`.pred`	`P`	`setp.eq.f32`	`FSETP`

The type qualifier disappears from the instruction syntax during conversion. It becomes encoded in the SASS mnemonic itself (the F in FADD, the I in IADD3) and in the register class of the operands.

SM-Dependent Legalization

The MercConverter gates operations by SM version through the architecture vtable. An instruction available natively on one SM may require a multi-instruction lowering sequence on another:

64-bit integer arithmetic on SM 50--75 (no native 64-bit ALU): splits into 32-bit hi/lo pairs
FP16 operations on pre-SM 53 targets: promoted to FP32 (handled by Phase 2 PromoteFP16)
bfe/bfi variants: some bit-field extract/insert modes not supported on all targets
Tensor core intrinsics: SM 70 has HMMA v1, SM 75 has HMMA v2, SM 80+ has HMMA v3/DMMA, SM 100 has TCGen05

The architecture vtable factory at sub_1CCEEE0 populates the vtable with SM-specific method overrides. The vtable has approximately 90 method slots (up to offset +720), with the highest-numbered slots (offset 624+) serving SM 100+ Blackwell operations.

Phase 2: Ori-to-SASS Selection & Encoding

Phase 2 runs during code generation (phases 112+) after the optimizer, register allocator, and scheduler have completed. It operates on fully optimized, register-allocated Ori IR and produces final SASS machine code. Phase 2 has three major components: the ISel driver with DAG pattern matching, the Mercury master encoder, and MercExpand pseudo-instruction expansion.

ISel Driver -- `sub_B285D0` (9 KB)

The top-level ISel coordinator is a vtable entry point with 66 callees. It selects the appropriate instruction builder variant based on the target architecture:

// Simplified ISel driver
void ISel_LowerInstruction(context, instruction) {
    int sm = *(context + 184);          // SM version
    int opcode = instruction[18] & 0xFFFFCFFF;

    // Select architecture-variant builder
    if (sm == 14)
        Builder_VariantA(context, instruction);    // sub_B1FA20 (13 KB)
    else
        Builder_VariantB(context, instruction);    // sub_B20E00 (11 KB)

    // Apply post-ISel modifiers
    ApplyModifiers(context, instruction);           // sub_B1D670 (13 KB)
    SetProperties(context, instruction);            // sub_B241A0 (7 KB)
}

The two builder variants (sub_B1FA20 and sub_B20E00) are structurally near-identical, with 50 callees each. Both call sub_7E3EF0 (operand index helper) 6 times (3 source + 3 destination operands) and use sub_A3B930 (operand register class resolver). The key difference is the validation function: variant A uses sub_C49440, variant B uses sub_C49400, reflecting different encoding constraints for different SM families.

ISel Mega-Selector -- `sub_C0EB10` (185 KB)

The single largest function in the Phase 2 ISel range: 185 KB decompiled, 6,016 lines, 719+ local variables. It performs the final Ori-IR-to-SASS opcode and operand encoding for 169 distinct instruction types (SASS opcode indices 7--221). While the ~801 DAG pattern matchers handle template-based ISel through a priority contest, the mega-selector handles complex instructions that require procedural, multi-step encoding logic -- instructions where the operand marshalling depends on runtime state (calling conventions, symbol resolution, address space aliasing).

Dual-Switch SM-Generation Dispatch

The function contains two copies of the same 169-case switch statement, separated by a vtable-based opcode translation mechanism. This dual-switch structure is the SM-generation dispatch:

// sub_C0EB10 -- simplified dispatch skeleton
void MegaSelector(context *a1, instruction *a2, isel_ctx *a3) {
    int64_t *vtable = *(a3->backend);
    int opcode = *(int *)(a2 + 8);           // SASS opcode type

    // Pre-dispatch: capability check via vtable[12]
    auto cap_check = vtable[12];              // offset +96
    if (cap_check != sub_BFEAA0)              // default stub?
        if (cap_check(a3, a2))
            ctx->flags[256] = 1;              // set encoding flag

    // Read opcode translator from vtable[2]
    auto translator = vtable[2];              // offset +16

    if (translator != sub_BFEBF0) {
        // PATH A: SM-specific translation
        int encoding_index = translator(a3, opcode);
        int isel_opcode = *(ctx + 8);         // post-translation opcode
        switch (isel_opcode) {                // PRIMARY SWITCH (169 cases)
            case 7: case 34: case 35: case 36:
                emit_simple(encoding_index, ...);
                break;
            case 8: case 38: case 46: ...
                /* already encoded */ break;
            // ... 169 cases total ...
            default: goto high_opcode_path;
        }
    } else {
        // PATH B: static table lookup (default backend)
        int encoding_index = 355;             // sentinel for extended opcodes
        if (opcode <= 0xDD)
            encoding_index = word_22B4B60[opcode];
        switch (opcode) {                     // FALLBACK SWITCH (same 169 cases)
            case 7: ...: goto handler_7;      // jumps into Path A handlers
            // ... identical case set ...
            default: return;
        }
    }

high_opcode_path:
    if (opcode > 0x199) return;
    // Try vtable[3] extension dispatch for SM 100+ / Blackwell
    auto extension = vtable[3];               // offset +24
    if (extension != sub_BFEA30)
        extension(a3, a2);                    // arch-extension handler
}

The dual-switch pattern is a code-generation artifact: the compiler emitted two copies because the vtable path and static-table path produce different values for the encoding index but need identical case routing. This doubles the binary size but avoids a conditional merge point at every case entry.

Three Vtable Dispatch Points

Vtable slot	Offset	Default stub	Purpose
`vtable[2]`	+16	`sub_BFEBF0`	Opcode-to-encoding-index translator. SM-specific override remaps opcodes to different encoding slots. Fallback: `word_22B4B60[]` static table.
`vtable[12]`	+96	`sub_BFEAA0`	Pre-dispatch capability check. Returns boolean that sets `ctx[256]` encoding flag.
`vtable[3]`	+24	`sub_BFEA30`	Extension opcode handler for opcodes outside the 169-case set (barrier/sync 61--63/221, opcodes > 0x199, SM 100+ extensions).

The word_22B4B60 static table is a uint16[] array indexed by SASS opcode (0--0xDD = 221). Each entry is a SASS encoding slot index. Opcodes > 221 receive the sentinel value 355. This provides the default encoding mapping; SM-specific vtable overrides can remap any opcode to a different encoding index, enabling per-architecture instruction variants without modifying the mega-selector logic.

Opcode Case Routing

The 169 distinct opcode cases (338 total case labels across both switches) group into approximately 70 handler blocks. The groupings reveal SASS ISA families:

Group	Opcodes	Handler pattern	Instruction family
No-op passthrough	8, 38, 46, 87, 89, 90, 93, 97, 98, 208	`goto LABEL_33` (already encoded)	Pre-encoded by upstream ISel
Simple emission	7, 34, 35, 36	`sub_9314F0(encoding_index, 1 operand)`	Basic ALU / simple 1-op
Branch/call	9, 10, 11, 12, 13, 22	`sub_926370` / vtable[17] / linked-list walk	Control flow, call frames
Memory load/store	15, 16, 18, 19, 20, 23, 24, 25, 26, 30	`sub_C01840` + address helpers	LDG, STG, LDS, etc.
Control flow	31, 32, 33	SSA phi nodes, branch tables	Phi, switch, call return
Generic ALU	39, 41, 42, 50, 51, 52, 53	`sub_9314F0` passthrough	Standard arithmetic
Special register	43, 44, 45	`sub_C06E90` symbol lookup	SR access, shared memory alias
Constant/predicate	47, 54, 55, 56	Direct operand copy / `sub_BFFD60`	Constant bank, predicate ops
Address compute	57	200-line handler, `"__nv_reservedSMEM_offset_0_alias"`	Complex addressing with SMEM
Immediate ops	59, 60	`sub_C05CC0` / `sub_C07690`	Immediate-operand variants
Barrier/sync	61, 62, 63, 221	Forward to vtable[3] extension	BAR, MEMBAR, SYNC
Conversion/move	65	Operand loop with per-element `sub_9314F0`	MOV, CVT
Texture/surface	67, 68, 69, 70	Multi-operand type-qualified encoding	TEX, TLD, TXQ
Intrinsics	71, 74, 75	Loop-based operand emission	Hardware intrinsics
Tensor core	84, 88, 91, 92	Wide-operand encoding (case 92 = 354 lines)	HMMA, DMMA, IMMA, TCGen05
Predication ext	94, 95	Predicate-dependent path selection	Extended predication
Memory extended	99--130 (19 opcodes)	`sub_C0B2C0` or `sub_BFFD60` + encoding lookup	Extended memory ops
Warp intrinsics	131--189 (50+ opcodes)	Mixed handlers, vtable[198]+632 dispatch	SHFL, VOTE, MATCH, REDUX
Async/bulk	192--218 (15 opcodes)	`sub_C0B2C0` / individual handlers	TMA, async copy, bulk ops

The largest case handlers:

Cases 141/142: ~503 lines (warp shuffle/vote extended operations)
Case 92: ~354 lines (tensor core instructions -- widest operand format)
Cases 45, 57, 95: ~200 lines each (shared memory, address compute, predication)

Operand Encoding Protocol

The mega-selector encodes operands into a stack-allocated 256-byte output buffer using a tagged-pointer word format. Each operand occupies 8 bytes (a DWORD pair):

Bits	Field	Description
`[31:28]` of word 0	Type tag	`0x1`=register, `0x4`=constant bank, `0x5`=immediate, `0x6`=control/modifier, `0x9`=special register
`[23:0]` of word 0	Value	Register index, immediate value, or bank offset
word 1	Flags	Modifier bits, encoding-format flags

The marshalling pipeline for a typical case:

1. sub_C01840(ctx, instr, operand_list, output_buf, max_count, ...)
   -> Iterates source operands, writes tagged words to output_buf
   -> Returns: number of operand words written

2. sub_C01F50(ctx, instr, dest_list, output_buf, max_count, ...)
   -> Same for destination operands

3. Encoding-index lookup:
   if (vtable[2] != default)
     index = vtable[2](ctx, opcode);
   else
     index = word_22B4B60[opcode];

4. sub_9314F0(output, ctx, encoding_index, count, n_words, buf, ...)
   -> Emits the instruction record to the output stream

Helper	Calls	Purpose
`sub_C01840`	52	Marshal source operands into tagged-word buffer
`sub_9314F0`	31	Emit instruction with encoding index + operand buffer
`sub_C00EA0`	8	Extract single operand as tagged word
`sub_91D160`	8	Encode register index to encoding bits
`sub_934630`	6	Build new instruction node in IR (for multi-instruction expansion)
`sub_91D150`	5	Decode register index from operand word
`sub_926370`	4	Emit simple instruction (branch/jump)
`sub_C01F50`	3	Marshal destination operands
`sub_7D6860`	3	Encode data type qualifier (FP32/FP64/INT)
`sub_BFEF10`	3	Register bank capacity check / grow
`sub_92E1B0`	2	Emit instruction with constant-bank operand

Cross-Reference: Arch Dispatch Tables

The 4 arch dispatch tables (sub_B128E0--sub_B12920) are not called from the mega-selector. They operate at the Mercury encoder level:

Mega-selector (sub_C0EB10)
  -> Produces (encoding_index, operand_buffer) pairs
  -> Calls sub_9314F0 to package into instruction nodes

Mercury encoder (sub_6D9690)
  -> Reads instruction type field from instruction node
  -> Arch dispatch tables (sub_B128E0 etc.) resolve type to encoding format
  -> Encoder emits binary SASS using format + operand data

The mega-selector and arch dispatch tables thus operate at different abstraction levels: the mega-selector decides what to encode (opcode selection, operand marshalling), while the arch tables decide how to encode it (encoding format, bit layout). The arch tables' per-SM variants handle encoding-level differences (field widths, modifier positions) that are invisible to the mega-selector's opcode-level logic.

Post-ISel Modifiers -- `sub_B1D670` (13 KB)

After the main ISel selection, this pass applies architecture-specific instruction modifications:

Opcode 13: sets instruction field [79] = 3
Opcode 14: sets instruction field [79] = 2
Opcode 11: separate modifier path

The function has 51 callees including sub_AAD690 (field accessor, called multiple times), sub_AADF40, and sub_C49400 (encoding validator). It handles encoding mode bits, register class adjustments, and predicate attachment.

Instruction Properties -- `sub_B241A0` (7 KB)

Sets scheduling-relevant properties on the selected instruction:

inst[74] = 7 -- scheduling class
inst[75] = (opcode == 325) -- special flag for specific opcode
inst[77] = sub_A3B930(...) -- operand class from register resolver
inst[79] -- derived from a2[19], architecture-dependent

Contains a switch on *(context+46) (target architecture selector), confirming per-SM property assignment.

DAG Pattern Matchers -- ~800 Functions at `0xB28F60`--`0xB7D000`

Every pattern matcher follows an identical prototype and a strict check-and-report protocol. These are the ptxas equivalent of LLVM's TableGen-generated ISel patterns, but handwritten in C++. Binary analysis confirms 801 functions with the matching *a4 <= priority-comparison idiom, with the bulk (750+) residing in the 0xB30000--0xB7D000 range and a handful of smaller matchers in the 0xB28F60--0xB30000 preamble zone.

Pattern Matcher Architecture

The pattern matching system implements a priority-based best-match selection protocol. For each instruction being lowered, the ISel infrastructure invokes all applicable matchers (dispatched through vtable function pointers, not direct calls). Each matcher independently tests whether the instruction matches its pattern; if it does, it writes a (template_id, priority) pair to the output parameters. The dispatcher selects the match with the highest priority value.

Function signature (all 801+ matchers):

char __fastcall match(
    int64_t  ctx,           // a1: ISel context (passed through to field reader)
    int64_t  dag_node,      // a2: pointer to the Ori IR instruction node
    int32_t *template_id,   // a3: OUT: encoding template index [1..152]
    int32_t *priority       // a4: IN/OUT: current best priority; written only if better
);

The priority parameter is read-then-conditionally-written: the matcher checks if (*a4 <= threshold) before overwriting. This means the dispatcher initializes *a4 = 0 and calls matchers in sequence; each matcher only upgrades the result if its specificity exceeds the current best. After all matchers complete, *a3 holds the template index of the winning pattern.

Matching pipeline (invariant across all 801 matchers):

 1. OPCODE PROPERTY CHECKS      sub_10AE5C0(ctx, node, field_id)
    Check 1-12 instruction properties against expected values.
    Any mismatch -> return immediately (early exit).

 2. SOURCE OPERAND COUNT         sub_B28F50(node) -> source_count
    Verify the instruction has the expected number of source operands.

 3. SOURCE OPERAND VALIDATION    sub_B28F30(node, i) -> operand_record
    For each source operand:
      a. Type predicate: isImmediate / isGPR / isPredicate / isUniformReg / ...
      b. Register class: class == 1023 (wildcard) OR class == specific_value

 4. RESULT OPERAND COUNT         sub_B28F40(node) -> result_count
    Verify the expected number of result (destination) operands.

 5. RESULT OPERAND VALIDATION    sub_B28F30(node, first_result + j)
    Same type + register-class checks as for source operands.
    First-result index = sub_B28E00(*(node + 92)).

 6. PRIORITY WRITE               if (*a4 <= N) { *a4 = N+1; *a3 = template; }
    Conditional update: only overwrite if this pattern is more specific
    than whatever was already matched.

Match-Score Priority System

The priority values range from 2 (least specific) to 34 (most specific), with the distribution heavily concentrated in the 8--19 range. The priority correlates directly with pattern specificity: matchers with more constraints (more sub_10AE5C0 checks, more operand type checks, tighter register class requirements) assign higher priority values.

Priority range	Count	Interpretation
2--5	31	Fallback / generic patterns (few constraints)
6--10	253	Common patterns (3--6 constraints)
11--15	293	Standard patterns (5--8 constraints)
16--20	168	Specific patterns (6--10 constraints)
21--34	56	Highly specific patterns (8--12+ constraints)

Template IDs range from 1 to 152. Multiple matchers can target the same template ID at different priority levels, forming a specificity ladder: a generic matcher might match FADD at priority 8 while a specialized matcher matches FADD.FTZ.SAT with specific register classes at priority 17. Both write the same template ID but the specialized matcher wins when its constraints are satisfied.

Dispatcher Mechanism

The matchers are not called directly from a single dispatch function. Instead, they are registered as virtual methods on per-instruction-class descriptor objects. The dispatch chain is:

sub_B285D0 (ISel driver, 9 KB)
  -> opcode switch on (instruction[18] & 0xFFFFCFFF)
     -> selects builder variant (sub_B1FA20 / sub_B20E00 / sub_B1EC10 / ...)
        -> builder invokes vtable method on instruction descriptor
           -> vtable slot contains pointer to one of the 801 pattern matchers
              -> matcher writes (template_id, priority) if pattern matches

The vtable dispatch occurs at various offsets including +2600, +2616, +2656, and +2896 (observed in sub_13AF3D0, the 137 KB ISel pattern coordinator). The matchers have no static callers -- they appear exclusively through indirect function pointer invocation, which is why the sweep reports them as "no callers in function DB."

For a given instruction, the dispatcher may invoke multiple matchers (one per applicable template variant). Each matcher independently checks its constraints and conditionally updates the priority/template pair. After all candidates have been tried, the dispatcher reads the final template_id and uses it to select the SASS encoding template.

DAG Node Property Accessor -- `sub_10AE5C0`

The field reader is the most-called function in the matcher range (typically 2--12 calls per matcher, so 3,000--8,000 total invocations across all 801 matchers):

// sub_10AE5C0 -- Read instruction property by field_id
int64_t DAGNode_ReadField(int64_t ctx, int64_t node, uint32_t field_id) {
    if (sub_10E32E0(node, field_id))        // field exists in descriptor?
        return sub_10D5E60(node, field_id); // read value from property table
    else
        return 0xFFFFFFFF;                  // sentinel: field not present
}

The field_id values form a large flat namespace (observed range: 5--595). These are not byte offsets into the instruction record; they are logical property identifiers resolved through a descriptor table. The backing store (managed by sub_10E32E0 / sub_10D5E60) implements a sparse property bag that maps field IDs to integer values.

The companion write functions follow the same field-ID namespace:

// sub_10AE590 -- Write single field
void DAGNode_WriteField(int64_t ctx, int64_t node, uint32_t field_id, uint32_t value);

// sub_10AE640 -- Write two fields atomically (multi-field update)
void DAGNode_WriteFields(int64_t ctx, int64_t node, uint32_t f1, uint32_t v1, uint32_t v2);

Inferred semantic groupings for field IDs (from cross-referencing matcher patterns):

Field range	Likely semantics
5--7	Opcode class / major instruction group
88	Sub-operation modifier
105	Operation variant selector
126	Data type qualifier (e.g., field 126 in {547,548})
163	Addressing mode / operand encoding class
190--211	Encoding format selectors
220	Specific encoding property
242	Width/size qualifier
294	Generic constraint field
327	Register format descriptor
345	Rounding / saturation mode
348	Precision qualifier
355--429	Extended instruction properties
397	Instruction validity flag (value 2115 appears as a near-universal gate)
480	High opcode range (Blackwell/SM 100+ instructions)
595	Highest observed field ID

Field 397 with value 2115 appears in the majority of matchers as a mandatory check, suggesting it encodes a "this instruction is encoding-compatible" or "instruction is valid for ISel" flag.

Operand Record Layout

Each operand is a 32-byte record accessed by index via sub_B28F30:

// sub_B28F30 -- Get operand record by index
int64_t GetOperand(int64_t node, int index) {
    return *(int64_t*)(node + 32) + 32LL * index;
}

The 32-byte operand record:

Offset	Size	Field	Description
+0	1	`type_tag`	Operand kind (see predicate table below)
+4	4	`primary_class`	Register class ID; 1023 = wildcard (any class)
+14	1	`modifier_a`	Written by `sub_B28F10`
+15	1	`modifier_b`	Written by `sub_B28F20`
+20	4	`secondary_class`	Fallback register class constraint

Source operand count is stored at node + 92 and doubles as the first-result-operand index:

uint32_t source_count = *(uint32_t*)(node + 92);   // sub_B28F50
uint32_t result_count = *(node + 40) + 1 - source_count; // sub_B28F40

Operand Type Predicates

Fifteen predicate functions classify operand type tags. Each is a single comparison returning bool:

Address	Name	Test	Semantics
`sub_B28E20`	`isImmediate`	`tag == 1`	Constant / immediate literal
`sub_B28E10`	`isGPR`	`tag == 2`	General-purpose register
`sub_B28E80`	`isPredicate`	`tag == 3`	Predicate register
`sub_B28E70`	`isType4`	`tag == 4`	(specific operand class)
`sub_B28E60`	`isType5`	`tag == 5`	(specific operand class)
`sub_B28E30`	`isSpecialReg`	`tag == 6`	Special register
`sub_B28ED0`	`isType7`	`tag == 7`	(specific operand class)
`sub_B28EF0`	`isType8`	`tag == 8`	(specific operand class)
`sub_B28E50`	`isType9`	`tag == 9`	(specific operand class)
`sub_B28E40`	`isValidReg`	`tag == 10`	Generic valid register
`sub_B28EE0`	`isType11`	`tag == 11`	(specific operand class)
`sub_B28EA0`	`isType13`	`tag == 13`	(specific operand class)
`sub_B28EB0`	`isType14`	`tag == 14`	(specific operand class)
`sub_B28E90`	`isUniformReg`	`tag == 15`	Uniform register (SM 75+)
`sub_B28EC0`	`isType16`	`tag == 16`	(specific operand class)

Register class 1023 serves as a wildcard: if (class == 1023 || class == expected). This allows matchers to accept both unconstrained operands and operands already assigned to a specific register file.

Register Class Constraint Protocol

Operand records carry two register class fields: primary_class at offset +4 and secondary_class at offset +20. The matching protocol checks them with a cascading OR:

// Typical register class check (from sub_B33F00, sub_B390A0, etc.)
uint32_t primary   = *(uint32_t*)(operand + 4);
uint32_t secondary = *(uint32_t*)(operand + 20);

if (sub_B28E00(primary) == 1023) {
    // Wildcard -- operand is unconstrained, accept it
} else {
    uint32_t cls = sub_B28E00(secondary);
    if (cls != expected_class) return;  // mismatch
}

sub_B28E00 and sub_B28F00 are identity functions -- the register class is stored as a plain integer, not packed. The two-field scheme allows the matcher to accept an operand where either the allocation constraint (primary) is wildcard or the resolved register file (secondary) matches.

Observed register class values in matchers:

Class	Frequency	Likely meaning
1023	ubiquitous	Wildcard (any register class)
1	very common	32-bit GPR (R0..R255)
2	common	64-bit GPR pair
3	occasional	128-bit GPR quad
4	occasional	Predicate or special register file
5	rare	Extended register class

Representative Matcher Walkthroughs

sub_B30160 -- simple 2-source, 4-result pattern (68 lines, priority 9, template 12):

1. field 480 == 2481                    -> opcode/subclass check
2. source_count == 2                    -> expects 2 source operands
3. operand[0].type == 1 (immediate)     -> first source is a constant
4. operand[1].type == 2 (GPR)           -> second source is a register
5. operand[1].class == 1023 OR sec == 1 -> 32-bit GPR or unconstrained
6. result_count == 4                    -> expects 4 result operands
7. result[0].type == 2 (GPR)            -> first result is GPR
   result[0].class == 1023 OR sec == 1
8. result[1].type == 3 OR 15            -> predicate or uniform register
9. result[2].type == 2 (GPR)            -> third result is GPR
   result[2].class == 1023 OR sec == 1
10. if (*a4 <= 8) -> *a4 = 9, *a3 = 12

sub_B33F00 -- medium 2-source, 5-result pattern (4,166 bytes, priority 21, template 22):

1. field 7 == 21                            -> major opcode class
2. field 163 in {705, 706}                  -> addressing mode variant
3. field 203 in {1113..1117}                -> encoding format (5 values)
4. field 105 == 477                         -> operation variant
5. field 88 == 408                          -> sub-operation modifier
6. field 345 == 1903                        -> rounding/saturation mode
7. source_count == 2                        -> 2 sources
8. operand[0].type == 1 (immediate)         -> constant source
9. operand[1].type == 2 (GPR)              -> register source
   operand[1].class: primary wildcard or secondary in {1,2}
10. result_count == 5                       -> 5 results
11. result[0].type == 2 (GPR), class != 1023, secondary == 2 (64-bit)
12. result[1].type == 3 OR 15 (pred/uniform)
13. result[2].type == 2 (GPR), class: wildcard or secondary in {1,2}
14. result[3].type == 2 (GPR), class: wildcard or secondary in {1,2}
15. if (*a4 <= 20) -> *a4 = 21, *a3 = 22

sub_B44CA0 -- complex 0-source, 7-result pattern (6,214 bytes, priority 11, template varies):

1.  field 5 == 12                           -> opcode class 12
2.  field 220 == 1206                       -> encoding property
3.  field 595 in {2937, 2938}               -> extended field (high range)
4.  field 294 == 1493                       -> constraint
5.  field 242 in {1281, 1282}               -> width qualifier
6.  field 355 == 1943                       -> extended property
7.  field 376 == 2035                       -> extended property
8.  field 377 in {2037..2041}               -> extended property (5 values)
9.  field 429 in {2252, 2253}               -> extended qualifier
10. field 126 in {547, 548}                 -> data type
11. field 397 == 2115                       -> validity gate
12. source_count == 0                       -> no source operands
13. result_count == 7                       -> 7 result operands
14. All 7 results checked: type == 10 (valid register), various class constraints
15. if (*a4 <= 10) -> *a4 = 11, *a3 = (template)

This pattern has the most field checks (12) of the representative examples, validating properties deep into the extended field namespace (field 595). Its zero-source, seven-result shape suggests a hardware intrinsic or complex output instruction like a tensor-core operation.

sub_B28FE0 -- minimal matcher in the preamble zone (31 lines, priority 8, template 42):

1. field 211 == 1182
2. field 201 == 1109
3. field 348 in {1912, 1915}   -> precision qualifier
4. field 397 == 2115           -> validity gate
5. source_count == 0           -> no sources
6. if (*a4 <= 7) -> *a4 = 8, *a3 = 42

The simplest matchers skip operand validation entirely and rely solely on opcode-property checks. These are for instructions with fixed operand formats where the operand shape is fully determined by the opcode.

Helper Function Summary

Address	Name	Signature	Purpose
`sub_10AE5C0`	`DAGNode_ReadField`	`(ctx, node, field_id) -> value`	Read instruction property by ID; returns `0xFFFFFFFF` if absent
`sub_10AE590`	`DAGNode_WriteField`	`(ctx, node, field_id, value)`	Write single instruction property
`sub_10AE640`	`DAGNode_WriteFields`	`(ctx, node, f1, v1, v2)`	Multi-field atomic update
`sub_B28F30`	`GetOperand`	`(node, index) -> operand_ptr`	Index into operand array (32-byte records at `*(node+32)`)
`sub_B28F40`	`GetResultCount`	`(node) -> count`	Number of result operands: `node[40] + 1 - node[92]`
`sub_B28F50`	`GetSourceCount`	`(node) -> count`	Number of source operands: `*(node+92)`
`sub_B28E00`	`DecodeRegClass`	`(packed) -> class_id`	Identity function (class stored as plain int)
`sub_B28F00`	`DecodeRegClass2`	`(packed) -> class_id`	Second identity accessor (same semantics)
`sub_B28F10`	`SetModifierA`	`(operand, value)`	Write operand modifier at offset +14
`sub_B28F20`	`SetModifierB`	`(operand, value)`	Write operand modifier at offset +15
`sub_B28E10`	`isGPR`	`(tag) -> bool`	`tag == 2`
`sub_B28E20`	`isImmediate`	`(tag) -> bool`	`tag == 1`
`sub_B28E30`	`isSpecialReg`	`(tag) -> bool`	`tag == 6`
`sub_B28E40`	`isValidReg`	`(tag) -> bool`	`tag == 10`
`sub_B28E50`	`isType9`	`(tag) -> bool`	`tag == 9`
`sub_B28E60`	`isType5`	`(tag) -> bool`	`tag == 5`
`sub_B28E70`	`isType4`	`(tag) -> bool`	`tag == 4`
`sub_B28E80`	`isPredicate`	`(tag) -> bool`	`tag == 3`
`sub_B28E90`	`isUniformReg`	`(tag) -> bool`	`tag == 15`
`sub_B28EA0`	`isType13`	`(tag) -> bool`	`tag == 13`
`sub_B28EB0`	`isType14`	`(tag) -> bool`	`tag == 14`
`sub_B28EC0`	`isType16`	`(tag) -> bool`	`tag == 16`
`sub_B28ED0`	`isType7`	`(tag) -> bool`	`tag == 7`
`sub_B28EE0`	`isType11`	`(tag) -> bool`	`tag == 11`
`sub_B28EF0`	`isType8`	`(tag) -> bool`	`tag == 8`

SM120 Pattern Coordinator -- `sub_13AF3D0` (137 KB)

The largest ISel function in the binary (137 KB, 4,225 lines, 570+ locals). It is an architecture-specific operand-emission coordinator that runs in Phase 2 as a parallel backend to the mega-selector sub_C0EB10. The two do not call each other -- they are mutually exclusive implementations of the same ISel protocol, selected per-SM by the vtable in the ISel driver. The mega-selector covers opcodes 7--221 for the default backend; the coordinator covers opcodes 2--352 for the SM120 (consumer RTX 50xx / enterprise Pro) backend.

Position in the ISel Pipeline

sub_B285D0 (ISel driver, 9 KB)
  -> selects builder variant by SM version
     -> Builder variant vtable dispatch
        |
        +-- DEFAULT BACKEND: sub_C0EB10 (mega-selector, 185 KB)
        |     opcodes 7..221, dual-switch, word_22B4B60 encoding table
        |
        +-- SM120 BACKEND: sub_A29220 (instruction iterator, 435 lines)
              -> sub_13AF3D0 (pattern coordinator, 137 KB)
                   opcodes 2..352, single switch, inline operand emission

The coordinator is called once per instruction by sub_A29220, which walks the instruction list. Before entering the main switch, the coordinator performs a predication pre-test: if bit 0x1000 is set in the opcode word and the opcode is not 169, it queries vtable[3232/8] and optionally emits the last source operand via sub_13A6AE0.

Dispatch Structure

The coordinator reads the opcode from *(instr+72) with the standard BYTE1 & 0xCF mask (identical to Phase 1's MercConverter) and enters a single 130-case switch. Unlike the mega-selector's dual-switch encoding-slot translation, the coordinator emits operands inline -- each case directly calls sub_13A6280 (the operand emitter) with explicit operand indices.

// sub_13AF3D0 -- simplified dispatch skeleton
char PatternCoordinator(context *a1, instruction *a2, output *a3,
                        pattern_table *a4, flags a5, int a6) {
    int opcode = *(DWORD*)(a2 + 72);
    BYTE1(opcode) &= 0xCF;

    // Pre-dispatch: predication check when bit 0x1000 is set
    if ((*(a2+72) & 0x1000) && opcode != 169) {
        if (vtable[3232/8] == sub_868720 || vtable[3232/8]())
            EmitLastSource(a1[1], a2, operand_count - 2*flag, a3);
    }

    // Setup output context
    vtable[104/8](output_ctx, a1, &context_ref);

    switch (opcode) {
    case 2: case 4: case 7:     // FMA/MAD 2-source
        operand_span = 16; src_count = 2;
        goto SHARED_FMA_HANDLER;
    case 3: case 5:             // FMA/MAD 3-source
        operand_span = 24; src_count = 3;
        goto SHARED_FMA_HANDLER;
    case 6:                     // IMAD/IADD3 with 3+ sources
        EmitOperand(ctx, instr, 3, out);
        EmitOperand(ctx, instr, 4, out);
        EmitOperand(ctx, instr, 5, out);
        break;
    case 8:                     // Pure vtable dispatch (vtable+2328)
        vtable[2328/8](a1, a2, operand_count, a3, a5, 0);
        break;
    case 10: case 11: case 151: case 152: case 290: case 291:
        vtable[2768/8](a1, a2, a3, a4, a5);   // Memory load/store
        break;
    case 16:                    // Texture/surface (163-line handler)
        for (i = first_src; i < last_src; i++)
            EmitOperand(ctx, instr, i, out);   // loop up to 15 operands
        break;
    // ... 120 more cases ...
    case 329:                   // Variable-count loop + vtable+2328
        for (i = 0; i < src_count; i++)
            EmitOperand(ctx, instr, i, out);
        vtable[2328/8](a1, a2, remaining, a3, a5, 0, 0, 0);
        break;
    default:
        break;                  // no-op passthrough
    }
}

Opcode Case Routing

The 130 distinct case labels (spanning 82 distinct handler blocks) cover the full SASS opcode range including SM100+/SM120 extensions:

Opcodes	Handler pattern	Instruction family
2, 3, 4, 5, 7	Shared FMA handler with operand-span parametrization	FMA/MAD variants (32/64-bit)
6	Inline 3-source emission + optional operands 6/7	IMAD/IADD3 wide
8	Pure vtable+2328 delegation	Builder-only instructions
10, 11, 151, 152, 290, 291	vtable+2768 delegation	Memory load/store
16	163-line operand loop (up to 15 sources)	Texture/surface
20, 21	vtable+2680/2688 with stub check	Memory/store alternates
22, 77, 83, 297, 352	vtable+2744 with `nullsub_463` check	Control flow
24, 34, 209, 213, 214	Passthrough: emit src 1 + dst 2	Simple 2-operand ALU
29, 95, 96, 190	Conditional operand-6 check	Predicate-source instructions
38, 59, 106, 180, 182, 192, 194, 215, 221	Single `EmitOperand(1)` at high SM	Generic ALU
42, 53, 55	`EmitOperand(1)`	Paired ALU
60, 61, 62, 63, 64	Comparison / inner sub-opcode switch (case 61: 5 sub-cases)	Compare / set-predicate
88, 89	Variable source count (2 or 3) with sign-dependent offsets	Extended FMA
110, 111, 114, 115, 117	Warp operand emission	Warp shuffle / vote
120, 121, 126, 127	Barrier handler with operand loop at LABEL_53	Barrier / sync
139, 140, 141, 143	`sub_13A4DA0` for commutative operand selection	Commutative ALU
183	Extended memory with register-class-6 check	Wide memory
201, 202, 204	vtable+2328 delegation	Async / bulk operations
270, 279, 282, 285, 325--328	Goto LABEL_53 (barrier/sync shared handler)	Extended memory / warp
280, 281	vtable+2896 with `nullsub_239` check, then LABEL_53	Sync instructions
329	Variable-count operand loop + vtable+2328	Variable-width encoding

Three Competing-Match Selection Mechanisms

The coordinator selects among competing pattern matchers through three mechanisms:

1. LABEL_750 -- vtable alternate-match dispatch. Six opcode paths (cases 6, 36, 130, 137, plus opcodes reaching LABEL_119 when sub_7D6850 confirms a double-precision operand) jump to LABEL_750:

LABEL_750:
    replacement = vtable[16/8](output_ctx, instruction);
    *output = replacement;
    return;

This is the "try architecture-specific alternate" escape hatch. The vtable slot at offset +16 on the ISel context object points to an SM-specific matcher. If it succeeds, the coordinator's inline emission is entirely bypassed and the replacement instruction is written to the output.

2. sub_13A4DA0 -- commutative operand position selector. Called 12 times for commutative instructions (FMA, IADD3, comparison) where source operands can be swapped for better encoding. The function holds up to 4 pattern entries at offsets +12/+16 through +36/+40, each a (lo_word, hi_word_mask) pair. It tests operand properties via sub_13A48E0 against each entry; the first match returns a preferred operand index. The coordinator then calls sub_13A6280 with the returned index instead of the default.

// sub_13A4DA0 -- simplified
int SelectOperandSlot(pattern_table, instruction, default_slot, alt_slot, out_match) {
    if (!pattern_table->active) return default_slot;
    uint64_t operand_desc = GetOperandDescriptor(instruction, default_slot);
    for (i = 0; i < pattern_table->count; i++) {  // up to 4 entries
        if (operand_desc matches pattern_table->entry[i])
            { *out_match = entry[i].preferred; return default_slot; }
    }
    // Repeat with alt_slot if no match on default_slot
    operand_desc = GetOperandDescriptor(instruction, alt_slot);
    for (i = 0; i < pattern_table->count; i++) {
        if (operand_desc matches pattern_table->entry[i])
            { *out_match = entry[i].preferred; return alt_slot; }
    }
    return default_slot;
}

3. Inline vtable override checks. Many cases test whether a vtable function pointer equals a known null-stub before calling it. The stub addresses serve as sentinel values -- when the vtable slot has been overridden by an SM-specific implementation, the coordinator calls the override:

Vtable offset	Default stub	Purpose
+2680	`sub_A8CBE0`	Memory operation alternate matcher
+2688	`sub_A8CBF0`	Store operation alternate matcher
+2744	`nullsub_463`	Control flow alternate
+2632	`nullsub_233`	Move/convert alternate
+2760	`nullsub_235`	Atomic/barrier alternate
+2896	`nullsub_239`	Sync instruction alternate
+3232	`sub_868720`	Pre-dispatch predication alternate
+3112	`sub_A8CCA0`	MADC alternate (case 36)

When the vtable slot holds the stub, the coordinator skips the call and proceeds with its inline emission logic.

Primary Callee: `sub_13A6280` (239 lines)

The operand emitter, called 83 times. It reads the operand at instruction[operand_index + 10] (each operand is 8 bytes starting at instruction + 84), checks the type tag at bits [31:28], and emits:

Tag 1 (register): fast-path returns if register class == 6 (UB/dead register). Otherwise reads the register descriptor from *(context+88)[reg_index], checks register class at descriptor offset +64.
Tags 2/3 (constant/immediate): calls sub_7DBC80 to validate constant-bank availability, then sub_A9A290 for type-5 immediate expansion. Delegates to vtable methods at *(*(context+1584) + 1504) and *(*(context+1584) + 3248).
Other types: pass through to the vtable dispatch chain.

The third parameter (operand index) ranges from 0 to 7 across the coordinator's call sites, with 0/1/2/3 being the most common (corresponding to the first 4 source operands in the Ori IR instruction layout).

Function Map Additions

Address	Size	Identity	Confidence
`sub_13AF3D0`	137 KB	SM120 ISel pattern coordinator (130-case switch, 83x operand emission)	HIGH
`sub_A29220`	435 lines	Instruction iterator / coordinator caller (per-instruction walk)	HIGH
`sub_13A6280`	239 lines	Operand emitter (type-tag dispatch, register class 6 fast-path)	HIGH
`sub_13A7410`	--	Destination operand emitter (with register class 6 check)	MEDIUM
`sub_13A6AE0`	--	Pre-dispatch source emitter (predicated instruction operands)	MEDIUM
`sub_13A4DA0`	180 lines	Commutative operand position selector (4-entry pattern table)	HIGH
`sub_13A6F90`	--	Extended destination emitter (3rd variant, class 6 check)	MEDIUM
`sub_13A6790`	--	Fenced memory operand emitter	MEDIUM
`sub_13A45E0`	--	Extra operand emitter (operands 6/7 for wide instructions)	MEDIUM
`sub_13A5ED0`	--	Modifier flag emitter (operands with 0x18000000 bits)	MEDIUM
`sub_13A75D0`	--	Register class 6 (UB) operand substitution handler	MEDIUM
`sub_13A48E0`	--	Operand property extractor (for sub_13A4DA0 matching)	MEDIUM

Architecture Dispatch Tables -- 4 Copies at `sub_B128E0`--`sub_B12920`

Four nearly identical functions (15,049 bytes each) provide architecture-variant opcode dispatch. Despite being only 13 binary bytes each (3 instructions -- a thunk into shared code at 0x1C39xxx), the decompiled output is 79,562 bytes due to the massive shared switch statement they jump into.

Each table contains a switch on *(a3+12) (the opcode word field) with 50+ cases, and secondary switches on *(a3+14) (opcode sub-field) within certain cases. The return values are SASS encoding slot indices (e.g., 197, 691, 526, 697, 772, 21). The four copies serve different SM architecture families, mapping the same logical opcode to different encoding slots depending on the target.

Opcode Variant Selectors

Two specialized variant selectors handle the final opcode-to-encoding mapping for specific instruction families:

sub_B0BE00 (19 KB) -- opcode class 194:

Massive switch on a2 (100+ cases)
All cases call sub_10AE590(ctx, inst, 194, N) with sequential N values starting from 827
Pattern: case K -> sub_10AE590(ctx, inst, 194, 826+K)
Maps sub-variant indices to SASS encoding slots for one PTX opcode family

sub_B0AA70 (5 KB) -- opcode class 306:

Same pattern but with opcode class 306
Variants numbered 1680--1726 with non-sequential case indices (2, 3, 8, 9, 14, 15, 20, 21, 26, 27, 30, 31, 36, 37, 40, 41, ...)
The alternating-pair pattern at stride 6 suggests type-width combinations (e.g., F32/pair, F64/pair, S32/pair, ...)

Instruction Modifier Dispatchers

Two modifier-application functions run after the main ISel selection to set type modifiers, rounding modes, and register width:

sub_B13E10 (5,792 B) -- basic modifier dispatcher:

All 21 callees are sub_10AE640 (DAG node modifier)
Switch on BYTE1(a7) & 0x1F for modifier type
Maps modifier values 1--6 to internal codes 31--35
Secondary dispatch on (a7 >> 3) for register width encoding

sub_B157E0 (11,815 B) -- extended modifier dispatcher:

All 37 callees are sub_10AE640
Handles texture/surface operations specially (opcode type 18)
Maps sub-opcodes (BYTE5(a7) & 0x3F) to encoding values 54--60

Mercury Master Encoder -- `sub_6D9690` (94 KB)

The Mercury master encoder is the single largest backend function and the final instruction selection point before binary emission. It contains a massive switch on the instruction type field (read from instruction+8) covering all SASS instruction formats. While its primary role is encoding (documented in Mercury Encoder Pipeline and SASS Instruction Encoding), the switch itself performs the final opcode-to-encoding-format selection:

// Simplified encoding flow
void EncodeInstruction(context, instruction) {
    int type = *(int*)(instruction + 8);
    uint64_t base = 0x2000000000LL;     // encoding base constant

    switch (type) {
    case 61:    // FFMA with literal operand
        sub_6D9580(ctx, operand);       // encode literal
        break;
    case 455:   // complex multi-operand format
        // bit-field extraction and assembly
        break;
    // ... hundreds of cases ...
    }

    // Common tail: append operand words, commit
    sub_6D2750(ctx, word);              // append 8-byte operand word
    sub_6D28C0(ctx);                    // commit instruction record
}

Key encoding dispatch details:

Operand word type prefix in bits [31:28]: 0x1 = register, 0x5 = immediate/constant, 0x6 = control/modifier, 0x7 = literal, 0x9 = special
sub_7D6860 handles data type encoding (FP32/FP64/INT)
sub_C00BF0 provides opcode lookup from the encoding tables
Architecture-specific bits accumulated via SM 100+ extensions controlled by knob 4176

MercExpand -- Pseudo-Instruction Expansion

sub_C3CC60 (26 KB) runs as phase 118 (MercExpandInstructions) and expands Mercury pseudo-instructions into concrete SASS sequences. This is the third and final instruction selection point -- where abstract instruction forms that survived through ISel and Mercury encoding are replaced by their concrete multi-instruction implementations.

Handler	Size	Instruction class
`sub_C37A10`	16 KB	General instruction expansion (jump table, 4+ cases)
`def_C37B2E`	13 KB	Complex expansion cases (default handler, string `"EXPANDING"`)
`sub_C39B40`	10 KB	Memory operations (LDG, STG, LDS, etc.)
`sub_C3A460`	6 KB	Atomic operations
`sub_C3B560`	8 KB	Texture operations
`sub_C3BCD0`	19 KB	Control flow (branches, jumps, calls)
`sub_C3E030`	18 KB	Finalization and cleanup

The expansion creates new instruction nodes, links them into the doubly-linked list, and deletes the original pseudo-instruction. After all expansions, sub_C3E030 performs post-expansion verification. The expansion engine also uses sub_719D00 (50 KB), which builds output for expanded instructions across different operand widths (32/64/128-bit, predicate) -- four near-identical code blocks corresponding to template instantiations over operand width types.

OCG Encoding Template Lookup -- `sub_C3F490`

The OCG (Optimized Code Generation) intrinsic pipeline on SM100+ does not use the ISel mega-selector or DAG pattern matchers. Instead, the OCG router (sub_6CC690, documented in Intrinsics) assigns each instruction one of 7 internal routing values and passes it to the SASS instruction emitter sub_6CB8A0. These routing values are not Ori IR opcodes, not binary SASS opcodes, and not encoding slot indices from word_22B4B60. They are a small, closed set of keys that exist solely to select an operand gathering template inside sub_C3F490.

Routing values assigned by the OCG router

Value	Hex	Instruction class	Assigned when
70	`0x46`	Memory-ordered load/store/atomic (with barrier)	Barrier register present (`v108 != 0` in conditional paths)
243	`0xF3`	Default memory operation	Fallback for general memory ops without barrier or special fence
245	`0xF5`	Load variant (LD/LDG/LDS)	Load-type operations (from OCG load/store handler)
246	`0xF6`	Reduction/atomic default	Atomic operations and reductions
247	`0xF7`	Fenced memory operation (LDGSTS)	Operations requiring memory fence semantics
257	`0x101`	Async copy without memory order	Bulk copy ops when no barrier: `v108 == 0` selects 257, else 70
261	`0x105`	Atomic with pre-existing value read	Atomic exchange / compare-and-swap returning old value

How `sub_C3F490` maps routing values to encoding templates

sub_C3F490 is a pure lookup function (184 bytes) that takes a routing value plus 7 boolean modifier flags and returns a pointer to an operand gathering template in .data at 0x22B8960--0x22BB460. The function is a nested if-else tree: the first-level switch selects on the routing value, then inner branches refine the template based on the modifier flags.

sub_C3F490(routing_value, a2..a8) -> template_ptr
    a2: has pre-existing-value operand (used only by value 257)
    a3: SM generation > sm_7x (SM80+)
    a4: has predicate attachment
    a5: has scope/fence operand (SM generation > sm_8x && memory_order == 4)
    a6: (always 0 from OCG emitter, used by MercExpand callers)
    a7: (always 0 from OCG emitter, used by MercExpand callers)
    a8: (always 0 from OCG emitter, used by MercExpand callers)

The OCG emitter (sub_6CB8A0) always passes a6=a7=a8=0, which means the OCG path only reaches a subset of template leaves. The MercExpand callers (sub_C41100, sub_C40420, sub_C40B90, sub_C42330) pass all 7 flags and can reach the full template space. The returned template is a packed array: template[0] is the operand count, followed by operand slot indices that reference positions in the 39-QWORD operand buffer (v134[]). The emitter iterates over these indices, gathers the tagged operand words, builds control words from bitfields, and calls sub_9314F0 to commit the encoded instruction.

Two additional routing values (254, 262) are handled by sub_C3F490 but are never assigned by the OCG router -- they originate exclusively from the MercExpand memory instruction handlers, where the routing value is read from the instruction's opcode field (instr[18] masked with & 0xCFFF).

Value	Hex	Origin	Instruction class
254	`0xFE`	MercExpand only	Extended memory format (operand gather mode 3)
262	`0x106`	MercExpand only	Wide memory format (operand gather mode 0, with scope/fence branches)

Template address space

The 40+ distinct templates returned by sub_C3F490 occupy a contiguous .data region:

Address range	Routing values served
`0x22B8960`--`0x22B8E60`	257 (async copy variants)
`0x22B8E60`--`0x22B9360`	70 (barrier memory variants)
`0x22B9360`--`0x22B9860`	262 (MercExpand wide memory)
`0x22B9860`--`0x22B9E60`	247, 245 (fenced / load variants)
`0x22B9E60`--`0x22BA960`	243, 246, 70 (default / reduction / barrier sub-variants)
`0x22BA960`--`0x22BB460`	Leaf templates for bare operand forms (no modifiers)

Each template is 256 bytes (0x100). For a given routing value, the modifier flags select progressively simpler templates as flags are cleared: the most complex template (all modifiers active) is reached first in the if-chain, and the simplest (no modifiers) is the final fallback.

Addressing Mode Selection

Addressing mode selection is distributed across Phases 1 and 2. During Phase 1, the operand processing function sub_6273E0 (44 KB) classifies PTX operand forms into internal categories. During Phase 2, the ISel driver and Mercury encoder select the optimal SASS addressing mode based on the register-allocated operand forms.

PTX addressing modes and their SASS encodings:

PTX syntax	Addressing mode	SASS instruction	Encoding
`[%rd1]`	Register indirect	`LDG.E R0, [R2]`	Register + zero offset
`[%rd1+16]`	Register + offset	`LDG.E R0, [R2+0x10]`	Register + immediate offset
`c[2][0x100]`	Constant bank	`LDC R0, c[0x2][0x100]`	Bank index + offset
`[%rd1], %r2`	Base + index	`STG.E [R2], R4`	Separate base/data registers

Special string references in sub_6273E0 confirm complex addressing:

".nv.reservedSmem.offset0" -- reserved shared memory region
"COARSEOFFSET" -- coarse-grained offset computation for large address spaces
"__$endLabel$__%s" -- label generation for structured control flow

The ISel mega-selector (sub_C0EB10) references "__nv_reservedSMEM_offset_0_alias" for shared memory alias resolution during final encoding.

Vtable Dispatcher Zone -- `0xAF0000`--`0xB10000`

The range 0xAF0000--0xB10000 contains approximately 2,735 tiny vtable method implementations (average 160 bytes) that form the instruction encoding hierarchy. These implement polymorphic instruction property queries:

// Typical vtable method (sub_AFXXXX, ~160 bytes)
int64_t get_property(int64_t a1, unsigned int a2) {
    if (a2 <= N)
        return (unsigned int)dword_XXXXXXX[a2];  // table lookup
    return default_value;
}

Each function maps a small integer index to an encoding constant, answering questions like "what is the register class for operand N of this instruction?" The 0xAF0000--0xB00000 sub-range has 1,269 functions (all under 200 bytes), while 0xB00000--0xB10000 has 1,466 with slightly more complex logic (13 exceeding 1 KB).

Comparison with LLVM ISel

Aspect	LLVM	ptxas
ISel framework	SelectionDAG or GlobalISel (single pass)	Two-phase: MercConverter (phase 5) + ISel driver (phase 112+)
Pattern specification	TableGen `.td` files, machine-generated	Handwritten C++ (~750 functions)
Pattern count	Target-dependent (thousands for x86)	~801 DAG matchers + 185 KB mega-selector
Architecture dispatch	Subtarget feature bits	4 architecture dispatch tables + vtable overrides
Intermediate form	MachineInstr (already selected)	Ori IR (SASS opcodes after phase 5, not yet encoded)
Encoding	MCInst emission (separate pass)	Integrated: ISel + Mercury encode in same pipeline
Expansion	Pseudo-instruction expansion in AsmPrinter	MercExpand (phase 118, post-ISel)
Optimization post-ISel	MachineFunction passes	Phases 14--111 (full optimizer runs between Phase 1 and Phase 2)

The key architectural difference: LLVM performs instruction selection once, then optimization happens on already-selected machine instructions. ptxas selects SASS opcodes early (phase 5) so the optimizer can reason about SASS-level semantics, then performs a second selection/encoding pass after optimization is complete. This two-phase design gives the optimizer accurate cost models (it sees real SASS opcodes, not abstract PTX operations) at the cost of architectural complexity.

Function Map

Address	Size	Identity	Confidence
`sub_C0EB10`	185 KB	ISel mega-selector (719 locals, dual 169-case switch, SM-generation dispatch)	HIGH
`sub_6D9690`	94 KB	Mercury master encoder (instruction type switch)	VERY HIGH
`sub_9F1A90`	35 KB	MercConverter main instruction conversion pass	HIGH
`sub_9EF5E0`	27 KB	Post-MercConverter lowering (`"CONVERTING"`)	HIGH
`sub_C3CC60`	26 KB	MercExpand::run (pseudo-instruction expansion)	HIGH
`sub_9ED2D0`	25 KB	MercConverter opcode dispatch (master switch, `& 0xCF` mask)	HIGH
`sub_9E6600`	25 KB	Instruction expansion (64-bit split)	HIGH
`sub_9EC340`	23 KB	Multi-operand instruction legalization	MEDIUM
`sub_B0BE00`	19 KB	Opcode variant selector (class 194, 100+ cases)	HIGH
`sub_C3BCD0`	19 KB	MercExpand::expandControlFlow	MEDIUM
`sub_9D76D0`	18 KB	Memory instruction legalization (load/store)	HIGH
`sub_C3E030`	18 KB	MercExpand::finalizeExpansion	MEDIUM
`sub_9D80E0`	17 KB	Memory instruction legalization (variant)	HIGH
`sub_9E8B20`	17 KB	Texture/surface lowering	MEDIUM
`sub_C37A10`	16 KB	MercExpand::expandInstruction (jump table)	HIGH
`sub_B128E0`--`sub_B12920`	15 KB x4	Architecture dispatch tables (4 SM families)	HIGH
`sub_B1FA20`	13 KB	SASS 3-operand builder (variant A)	HIGH
`sub_B1D670`	13 KB	Post-ISel instruction modifier	HIGH
`def_C37B2E`	13 KB	MercExpand complex cases (`"EXPANDING"`)	HIGH
`sub_B157E0`	12 KB	Extended modifier dispatcher (37 callees)	HIGH
`sub_B20E00`	11 KB	SASS 3-operand builder (variant B)	HIGH
`sub_C39B40`	10 KB	MercExpand::expandMemoryOp	MEDIUM
`sub_9DA100`	9 KB	Arithmetic operation handler (case 6)	HIGH
`sub_B285D0`	9 KB	ISel lowering driver (66 callees)	HIGH
`sub_B241A0`	7 KB	SASS instruction property setter	HIGH
`sub_9F3340`	7 KB	MercConverter orchestrator (`"After MercConverter"`)	HIGH
`sub_C3A460`	6 KB	MercExpand::expandAtomicOp	MEDIUM
`sub_B13E10`	6 KB	Basic modifier dispatcher (21 callees)	HIGH
`sub_B0AA70`	5 KB	Opcode variant selector (class 306)	HIGH
`sub_9DA5C0`	2 KB	Opcode class 1 handler	MEDIUM
`sub_13AF3D0`	137 KB	SM120 ISel pattern coordinator (130-case switch, 83x `sub_13A6280`, opcodes 2--352)	HIGH
`sub_A29220`	~17 KB	SM120 instruction iterator (calls `sub_13AF3D0` per instruction)	HIGH
`sub_13A6280`	~10 KB	Operand emitter (type-tag dispatch, register class 6 fast-path)	HIGH
`sub_13A4DA0`	~7 KB	Commutative operand position selector (4-entry pattern table)	HIGH
`sub_13A7410`	--	Destination operand emitter (with register class 6 check)	MEDIUM
`sub_13A6AE0`	--	Pre-dispatch source emitter (predicated instruction operands)	MEDIUM
`sub_13A6F90`	--	Extended destination emitter (3rd variant, class 6 check)	MEDIUM
`sub_13A6790`	--	Fenced memory operand emitter	MEDIUM
`sub_13A45E0`	--	Extra operand emitter (wide instruction operands 6/7)	MEDIUM
`sub_13A5ED0`	--	Modifier flag emitter (operands with 0x18000000 bits)	MEDIUM
`sub_13A48E0`	--	Operand property extractor (for `sub_13A4DA0` matching)	MEDIUM
`sub_10AE5C0`	tiny	DAGNode_ReadField (field_id to value, delegates to `sub_10D5E60`)	VERY HIGH
`sub_10AE590`	tiny	DAGNode_WriteField (single field write)	VERY HIGH
`sub_10AE640`	tiny	DAGNode_WriteFields (multi-field update)	VERY HIGH
`sub_B28F30`	tiny	GetOperand (index into 32-byte operand array at `*(node+32)`)	VERY HIGH
`sub_B28F40`	tiny	GetResultCount (`node[40] + 1 - node[92]`)	VERY HIGH
`sub_B28F50`	tiny	GetSourceCount (`*(node+92)`)	VERY HIGH
`sub_B28E00`	tiny	DecodeRegClass (identity function, class is plain int)	VERY HIGH
`sub_B28E10`	tiny	isGPR operand predicate (`tag == 2`)	VERY HIGH
`sub_B28E20`	tiny	isImmediate operand predicate (`tag == 1`)	VERY HIGH
`sub_B28E40`	tiny	isValidReg operand predicate (`tag == 10`)	VERY HIGH
`sub_B28E80`	tiny	isPredicate operand predicate (`tag == 3`)	VERY HIGH
`sub_B28E90`	tiny	isUniformReg operand predicate (`tag == 15`)	VERY HIGH
`sub_B28F60`--`sub_B74C60`	~1.3 MB	~801 DAG pattern matchers (priority 2--34, template 1--152)	HIGH
`sub_C01840`	--	Mega-selector source operand marshaller (52 calls from mega-selector)	HIGH
`sub_C01F50`	--	Mega-selector destination operand marshaller	HIGH
`sub_C00EA0`	--	Single operand extractor (returns tagged operand word)	HIGH
`sub_BFFD60`	--	Operand reference resolver (register ref to encoding word)	HIGH
`sub_C06E90`	--	Symbol/special-register lookup for shared memory	HIGH
`sub_C07690`	--	Immediate-operand encoding helper	MEDIUM
`sub_C0B2C0`	--	Extended memory/warp operation encoder	HIGH
`sub_C05CC0`	--	Immediate operation encoder (flag-dependent path)	MEDIUM
`sub_BFEBF0`	tiny	Default vtable[2] stub (opcode translator, no-op identity)	VERY HIGH
`sub_BFEAA0`	tiny	Default vtable[12] stub (capability check, always false)	VERY HIGH
`sub_BFEA30`	tiny	Default vtable[3] stub (extension handler, no-op)	VERY HIGH
`sub_BFEF10`	--	Register bank capacity check / grow	MEDIUM
`word_22B4B60`	--	Static opcode-to-encoding-index table (`uint16[222]`, default backend)	VERY HIGH
`sub_C3F490`	184 B	OCG encoding template lookup (routing value + 7 flags -> template ptr)	VERY HIGH
`sub_6CB8A0`	--	OCG SASS instruction emitter (calls `sub_C3F490` then `sub_9314F0`)	HIGH
`sub_C41100`	--	MercExpand memory encoder (calls `sub_C3F490` with full flag set)	HIGH
`sub_C40420`	--	MercExpand memory encoder variant (calls `sub_C3F490`)	HIGH
`sub_C40B90`	--	MercExpand memory encoder variant (calls `sub_C3F490`)	HIGH
`sub_C42330`	--	MercExpand memory encoder variant (calls `sub_C3F490`)	HIGH
`unk_22B8960`--`unk_22BB460`	~11 KB	Operand gathering templates (40+ entries, 256 B each)	HIGH

Cross-References

PTX-to-Ori Lowering -- Phase 1 context: bridge phases, MercConverter call chain
Code Generation Overview -- ISel within the codegen pipeline
SASS Instruction Encoding -- bit-level encoding format, operand encoders
Mercury Encoder Pipeline -- Mercury master encoder, MercExpand
Peephole Optimization -- post-ISel pattern rewrites (3 mega-dispatchers)
Newton-Raphson Templates -- DDIV/DRCP/DSQRT expansion sequences
Intrinsics: OCG Lowering Pipeline -- OCG router that assigns routing values, operand buffer layout
Ori IR -- instruction format, opcode field layout
SASS Opcodes -- target instruction set

Keyboard shortcuts

PTXAS Reverse Engineering Reference