Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Late Expansion & Legalization

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The ptxas pipeline contains six legalization passes spread across the 159-phase sequence. Their collective job is to replace Ori IR operations that the target SM cannot execute natively with equivalent sequences of legal instructions. "Unsupported ops" means exactly this: operations that exist in the PTX ISA or internal Ori representation but have no single-instruction mapping on the compilation target. The replacement may be an inline expansion (a sequence of simpler instructions), a call to a libdevice helper function, or an SM-specific intrinsic sequence.

The six passes run at deliberately different pipeline positions because each intervening group of optimization passes can expose new unsupported operations or create new legalization opportunities.

Passes covered6 (phases 5, 45, 55, 78, 93, 137)
CategoryLowering
Backend dispatchArchitecture-specific via two backend objects at context+0x630 and context+0x640
Libdevice functions608 helper functions registered at sub_5D1660 (9,728-byte table from unk_1D4D940)
Legalization flagSetAfterLegalization (phase 95) marks the point past which no unsupported ops should remain
Update passUpdateAfterConvertUnsupportedOps (phase 132, factory 8) rebuilds IR metadata after late expansion
Knob gatesKnob 499 (ConvertUnsupportedOps, LateExpansionUnsupportedOps), knob 487 (LateExpansion, SetAfterLegalization, LateExpansionUnsupportedOps), knob 214 / 464 (LateExpansionUnsupportedOps inner loop)

Why Six Passes

A monolithic legalize-everything pass early in the pipeline would cripple optimization. Many optimizations (CSE, LICM, strength reduction, predication) work on high-level operation semantics. If div.rn.f64 were expanded into a 30-instruction Newton-Raphson sequence at phase 5, loop-invariant code motion at phase 35 would see 30 independent instructions instead of one hoistable division. Conversely, some unsupported operations only appear after optimization passes transform the IR: predication (phase 63) can create new predicated ops that need legalization, GMMA fixup (phase 87) can introduce new WGMMA-related sequences, and conditional flow merging (phases 133/136) can expose operations that were previously dead.

The six passes form a progressive legalization strategy:

PhaseNamePipeline PositionPurpose
5ConvertUnsupportedOpsBefore optimization (stage 1)Early legalization of obviously unsupported ops; preserves optimization opportunities for everything else
45MidExpansionAfter early/mid optimization (stage 3)Target-dependent expansion after loop unrolling, strength reduction, and GVN have run
55LateExpansionAfter high-level optimizations (stage 4)Expansion of ops that optimization passes should see in unexpanded form
78LateExpansionUnsupportedOpsAfter all optimization (stage 5)Catches remaining unsupported ops after predication, rematerialization, and uniform conversion
93LateExpansionUnsupportedOps2After GMMA/attr passes (stage 5)Second catch -- handles ops exposed by GMMA propagation, GMMA fixup, and register attribute setting
137LateExpansionUnsupportedOpsMidAfter late merge (stage 10)Final catch between the two conditional flow merge passes

Architecture Backend Dispatch

None of the six passes contain legalization logic directly. Each is a thin dispatcher that forwards to a virtual method on one of two architecture backend objects stored in the compilation context. The backend objects are constructed per-SM-target and provide the actual SM-specific legalization implementations.

Two backend objects:

Context OffsetUsed ByRole
context+0x640ConvertUnsupportedOps, LateExpansionOuter backend -- wraps an inner object at +0x10, provides two-level dispatch
context+0x630MidExpansion, LateExpansionUnsupportedOps, LateExpansionUnsupportedOps2, LateExpansionUnsupportedOpsMid, SetAfterLegalizationSM backend -- single-level dispatch through vtable

The two-level dispatch through context+0x640 allows the outer backend to override the entire legalization strategy (by replacing vtable slot 0), while the inner object provides the SM-specific implementation when the outer backend delegates. This separation exists because ConvertUnsupportedOps and LateExpansion may need to coordinate with higher-level compilation modes (e.g., library compilation, OptiX IR) that wrap the SM backend.

Backend Vtable Slots

The SM backend at context+0x630 dispatches legalization through these vtable offsets:

Vtable OffsetDecimalCalled By
+0xB0176MidExpansion
+0xD8216LateExpansionUnsupportedOps2
+0x108264SetAfterLegalization
+0x178376LateExpansionUnsupportedOps
+0x180384LateExpansionUnsupportedOpsMid

The outer backend at context+0x640 dispatches:

Vtable OffsetDecimalCalled By
+0x000ConvertUnsupportedOps (type check -- compared against sub_661280)
+0x78120ConvertUnsupportedOps (delegated to inner object)
+0x5888LateExpansion (type check -- compared against sub_6612E0)
inner +0xE0224LateExpansion (delegated to inner object)

Pass Details

Phase 5 -- ConvertUnsupportedOps

Factory index:  5
Vtable:         off_22BD690
execute():      sub_C60A20  (thunk -> context+0x640 dispatch)
isNoOp():       sub_C5F610  (returns 0 -- always runs)
Flag side-effect: sets context+1378 bit 0 (isConvertUnsupportedDone)
Knob gate:      499 (checked via sub_7DDB50)
Pipeline:       Bracketed by AdvancedPhaseBeforeConvUnSup (4) and AdvancedPhaseAfterConvUnSup (7)

This is the earliest legalization pass, running at phase 5 before any optimization. It converts operations that are clearly illegal on the target SM into equivalent sequences. The pass always runs (isNoOp = false) and is unconditional -- every compilation executes it.

Dispatch mechanism. The execute function (sub_C60A20) reads the backend at context+0x640, checks whether vtable slot 0 is the default implementation (sub_661280), and either calls the overridden method directly or unwraps to the inner object at backend+0x10 and calls vtable offset +0x78 (120). This two-level indirection allows library-mode and OptiX-mode compilation to inject custom legalization logic.

Flag effect. After execution, the pass sets bit 0 of context+1378, signaling to downstream passes that early legalization has completed. Passes like OriCreateMacroInsts (phase 8) check this flag to know whether certain patterns have already been lowered.

What gets legalized early: Operations that cannot survive optimization in their original form. Examples include operations that reference address spaces not supported on the target, certain modifier combinations that have no encoding, and PTX instructions that are syntactically valid but architecturally illegal (e.g., atom.add.f64 on targets without native FP64 atomics).

Phase 45 -- MidExpansion

Factory index:  51
Vtable:         off_22BDDC0
execute():      sub_C5EFB0  (thunk -> context+0x630 vtable+0xB0)
isNoOp():       sub_C5EFD0  (returns 0 -- always runs)
Field side-effect: sets context+1552 = 3
Pipeline:       After ExpandMbarrier (42), ForwardProgress (43), OptimizeUniformAtomic (44)
                Before GeneralOptimizeMid2 (46)

MidExpansion runs after the CTA/mbarrier/barrier expansion passes and before the second mid-level GeneralOptimize bundle. It handles target-dependent expansions that must occur after barrier-related lowering but before the mid-level optimization cleanup.

Dispatch. Dispatches directly through the SM backend vtable at offset +0xB0 (176). No two-level indirection -- the SM backend provides the implementation directly.

Side effect. Sets context+1552 to 3. This field is the pipeline progress counter (not exclusively a legalization counter -- see Context Fields below) and is read by subsequent passes to determine which pipeline stages have completed. The value 3 indicates "mid-expansion complete."

Phase 55 -- LateExpansion

Factory index:  63
Vtable:         off_22BDFA0
execute():      sub_C60AA0  (thunk -> context+0x640 dispatch)
isNoOp():       sub_C5EE20  (returns 0 -- always runs)
Field side-effect: sets context+1552 = 7 (via inner dispatch)
Pipeline:       After OriDoRematEarly (54), before SpeculativeHoistComInsts (56)
                Followed by GeneralOptimizeLate (58)

LateExpansion is the primary post-optimization legalization pass. It runs after all high-level optimizations (loop unrolling, strength reduction, GVN-CSE, reassociation, predication setup) have completed, expanding operations that were deliberately kept in high-level form for those passes.

Dispatch. Uses the outer backend at context+0x640. Checks vtable slot +0x58 (88) against the default (sub_6612E0). If overridden, calls the override. Otherwise, calls the inner object's vtable at +0xE0 (224) and then sets context+1552 = 7, advancing the pipeline progress counter.

What gets expanded here: This is the pass where most math library calls are introduced. Operations like div.rn.f64, sqrt.rn.f32, rcp.rd.f64 that were kept as single Ori instructions through optimization are now replaced with Newton-Raphson sequences or calls to the 608-function libdevice library. The SM20 library functions (division, square root, reciprocal, bit-field extract/insert) and SM70 functions (WMMA matrix operations, barrier reductions) are the primary candidates.

Optimization interaction. GeneralOptimizeLate (phase 58) runs immediately after, cleaning up the expanded sequences with copy propagation, constant folding, and dead code elimination. This is why expansion happens here rather than later -- the expanded code benefits from one more optimization round.

Phase 78 -- LateExpansionUnsupportedOps

Factory index:  90
Vtable:         off_22BE3D8
execute():      sub_C5EA50  (thunk -> context+0x630 vtable+0x178)
isNoOp():       sub_C5EA70  (returns 0 -- always runs)
Knob gate:      499 (via sub_7DDB50), plus flag check: context+1414 bit 2
Pipeline:       After AdvancedPhaseLateConvUnSup (77), before OriHoistInvariantsLate2 (79)

The first of three "late unsupported ops" catches. It runs after all optimizations have completed (phases 13-76) and catches operations that optimization passes themselves introduced or exposed.

Gating. This pass has the most complex gating of the six. In addition to the standard knob 499 check (via sub_7DDB50), it also checks bit 2 of context+1414. If the bit is clear, the pass is skipped even though isNoOp returns false. This allows the backend to dynamically disable the pass when no unsupported ops were detected during earlier compilation phases.

Implementation. When active, calls sub_7917F0 which:

  1. Checks context+1382 bit 2 (another prerequisite flag)
  2. Checks knob 214 (via the capability dispatch at context+1664)
  3. If the function table at context+0 + 1056 is not yet initialized, calls the expansion setup functions (sub_785E20, sub_781F80, sub_7E6090, sub_7E6AD0)
  4. Iterates over basic blocks, applying per-instruction legalization with convergence check (knob 464 gates the inner loop)

This iterative structure -- expand, check if more work needed, repeat -- handles cascading expansions where expanding one operation exposes another unsupported operation.

Phase 93 -- LateExpansionUnsupportedOps2

Factory index:  109
Vtable:         off_22BE6D0
execute():      sub_C5E790  (thunk -> context+0x630 vtable+0xD8)
isNoOp():       sub_C5E7B0  (returns 0 -- always runs)
Pipeline:       After AdvancedPhaseAfterSetRegAttr (92), before FinalInspectionPass (94)

The second late catch, positioned after the GMMA/WGMMA passes (85-87), register attribute setting (90), and texture dependency analysis (91). These intervening passes can introduce new operations that need legalization:

  • GMMA propagation (phase 85) may introduce WGMMA accumulator movement operations
  • GMMA sequence fixup (phase 87) may insert hardware ordering instructions
  • Register attribute setting (phase 90) may expose operations that become illegal once register classes are assigned

Dispatch. Uses the SM backend vtable at offset +0xD8 (216). The dispatch is architecture-dependent: the execute function reads vtable slot 12 (backend[12]), compares against a default implementation (sub_661310), and either calls the override or falls through to a two-step sequence that calls methods at offsets 280 and 3088 on an inner object.

Phase 137 -- LateExpansionUnsupportedOpsMid

Factory index:  93
Vtable:         off_22BE450
execute():      sub_C607E0  (thunk -> context+0x630 vtable+0x180)
isNoOp():       sub_C5EA00  (returns 0 -- always runs)
Default check:  compares vtable+0x180 against sub_7D6D50 -- if default, entire pass is no-op
Pipeline:       After LateMergeEquivalentConditionalFlow (136), before OriSplitHighPressureLiveRanges (138)

The final legalization catch, positioned between the two conditional flow merge passes (133, 136) and the last-resort live range splitter (138). The merge passes can combine basic blocks in ways that create new instruction sequences containing unsupported operations.

Conditional execution. Unlike the other five passes, this one has a soft no-op mechanism: the execute function reads vtable slot +0x180 (384) and compares the function pointer against the default implementation (sub_7D6D50). If the backend has not overridden this slot, the pass returns immediately without doing any work. This means the pass is truly active only on SM targets that define a LateExpansionUnsupportedOpsMid handler -- typically newer architectures (Hopper/Blackwell) that have more complex merge and expansion interactions.

Supporting Passes

Phase 95 -- SetAfterLegalization

Factory index:  111
Vtable:         off_22BE720
execute():      sub_C5F8A0
isNoOp():       sub_C5E9C0  (returns 0 -- always runs)
Pipeline:       After FinalInspectionPass (94), before ReportBeforeScheduling (96)

Not a legalization pass per se. It marks the compilation context as post-legalization by calling the SM backend's vtable at offset +0x108 (264). This sets the legalization_complete flag that downstream passes (scheduling, register allocation, encoding) check to assert that no unsupported operations remain. The pass is gated by optimization level: sub_7DDB50 returns the current optimization level, and the dispatch only fires at -O2 and above.

Phase 132 -- UpdateAfterConvertUnsupportedOps

Factory index:  8
Vtable:         off_22BD708
execute():      sub_C5F570  (rep ret -- NOP)
isNoOp():       sub_C5F590  (returns 1 -- skipped by default)
Pipeline:       First pass in Stage 10

A placeholder update pass that rebuilds IR metadata after late unsupported-op conversion. Its execute() is a NOP (rep ret) and isNoOp() returns 1 (true), so it is skipped by default. Architecture backends can override the vtable to activate it when late expansion produces structural changes requiring metadata rebuild.

Libdevice Function Library

The legalization passes replace unsupported operations with calls to a library of 608 predefined helper functions. These are not external libraries -- they are PTX function bodies embedded in the ptxas binary itself, compiled and linked into the output at need.

The function table is initialized by sub_5D1660, which copies a 9,728-byte pre-built table from unk_1D4D940 and registers 608 function names in a hash map for lookup.

Library Function Categories

SM PrefixCountOperations
__cuda_sm20_70Division (f32/f64, all rounding modes), reciprocal (f32/f64, all rounding modes), square root (f32/f64), double-precision reciprocal sqrt, bit-field extract/insert 64-bit, integer division/remainder (s16/s64/u16/u64)
__cuda_sm3x_4FP32 division with FTZ variants (Kepler-specific paths)
__cuda_sm62_2DP2A, DP4A dot-product accumulate (pre-Volta emulation)
__cuda_sm70_397Barrier operations (arrive/red/wait with 0-15 barrier IDs and count variants), WMMA matrix operations (204 variants for different shapes/types), warp shuffle sync, warp vote sync, match sync
__cuda_sm80_3Cache policy creation (fractional, range encode)
__cuda_sm1xx_18Bulk copy (unicast/multicast), async bulk tensor copy (1D-5D tile/im2col, unicast/multicast)
__cuda_sm10x_16TCGen05 guardrail traps (bounds check, alignment, allocation), tcgen05 MMA operations, mask creation
__cuda_scalar_video_emulation_7Video instruction emulation (operand extract, sign extend, saturate, merge)
__cuda_reduxsync_18Redux-sync reductions (and/or/xor for b32, add/max/min for s32/u32/f32 with NaN/abs variants)
__cuda_sanitizer_6Memory sanitizer checks (malloc/free/generic/global/local/shared/metadata)
Other~67Miscellaneous: dummy entries, user-function stubs, device synchronize

SM-Dependent Legalization Examples

The core design principle: what is "unsupported" depends entirely on the target SM. An operation legal on one architecture may require library expansion on another.

Integer division/remainder. PTX div.s64 and rem.u64 have no single SASS instruction on any SM. They are always expanded to multi-instruction sequences via __cuda_sm20_div_s64, __cuda_sm20_rem_u64, etc. These are "sm20" functions because the expansion has been the same since Fermi.

FP32 division with rounding. div.rn.f32 on Turing (sm_75) uses a hardware-assisted Newton-Raphson (MUFU.RCP + refinement). On Kepler (sm_3x, no longer shipped but the code path remains), different refinement sequences are needed, using __cuda_sm3x_div_rn_ftz_f32 and its slowpath variant.

Barrier operations. On Volta+ (sm_70), barrier.arrive with a specific barrier ID and thread count is a single SASS instruction (BAR.ARV). On pre-Volta targets, these must be emulated with the 397 __cuda_sm70_barrier_* library functions that implement the semantic equivalent using older synchronization primitives.

WMMA/Tensor Core. Warp-level matrix multiply-accumulate (wmma.*) on sm_70 has dedicated hardware instructions (HMMA). The 204 __cuda_sm70_wmma_* variants cover the combinatorial explosion of shapes (m16n16k16, m8n32k16, m32n8k16), types (f16, bf16, tf32, s8, u8, s4, u4, b1), layouts (row/col), and accumulator types.

DP2A/DP4A. The integer dot-product-accumulate instructions have native hardware support starting at sm_61. On sm_62 (Xavier), they use __cuda_sm62_dp2a and __cuda_sm62_dp4a emulation routines.

Bulk tensor copy (Blackwell). The cp.async.bulk.tensor family on sm_100+ (Blackwell) supports 1D through 5D tile and im2col access patterns, with unicast and multicast variants. These 18 __cuda_sm1xx_cp_async_bulk_tensor_* functions provide the expansion for targets where hardware support is partial or absent.

TCGen05 guardrails (Blackwell). The 5th-generation tensor core operations (sm_100+) include runtime guardrail traps -- bounds checking, alignment validation, allocation granularity checks -- implemented as __cuda_sm10x_tcgen05_guardrail_trap_* functions inserted during legalization.

Context Fields

The legalization passes interact with several fields on the compilation context:

OffsetTypeDescription
+0x630void*SM backend object (main legalization dispatch target)
+0x640void*Outer backend object (wraps SM backend, used by ConvertUnsupportedOps and LateExpansion)
+1378byteBit 0: ConvertUnsupportedOps has run
+1382byteBit 2: prerequisite flag for LateExpansionUnsupportedOps
+1414byteBit 2: enable flag for LateExpansionUnsupportedOps
+1552int32Pipeline progress counter -- written by multiple passes across legalization, optimization, and post-RA stages (see value table below)
+1664void*Capability dispatch object (knob/option queries)

The pipeline progress counter at context+1552 provides a monotonically increasing value that downstream passes can check to determine which pipeline stages have completed. Despite being documented previously as a "legalization stage counter," it is written by passes outside the legalization family (rematerialization, backward copy propagation, architecture-specific peephole, post-RA finalization):

ValueWriterPhaseFunction
0Context constructor--sub_7F7DC0
3MidExpansion45sub_C5EF80
4OriDoRematEarly54sub_C5EF30
7LateExpansion55sub_6612E0
8Peephole/ISel refinement (arch-specific)variessub_849C60
9OriBackCopyPropagate83sub_C5EB80
10PostRAFinalizer (arch-specific)variessub_88E9D0
12SetAfterLegalization95sub_C5E980

Downstream passes compare against these thresholds: sub_A11060 checks > 4 to enable cross-block rematerialization; sub_752CF0 checks <= 3; sub_766520 checks <= 11; sub_781F80 checks <= 12; sub_78B8D0 checks > 18.

Pipeline Position Summary

Phase 0-4:   Initial setup, FP16 promotion, CFG analysis
Phase 5:     ConvertUnsupportedOps          <-- LEGALIZATION #1
Phase 6-44:  Optimization passes (branch, loop, strength reduction, GVN, barrier expansion)
Phase 45:    MidExpansion                    <-- LEGALIZATION #2
Phase 46-54: Mid/late optimization (GVN-CSE, reassociation, predication setup, remat)
Phase 55:    LateExpansion                   <-- LEGALIZATION #3
Phase 56-77: Late optimization (predication, commoning, LICM, remat, sync, phi destruction, uniform)
Phase 78:    LateExpansionUnsupportedOps     <-- LEGALIZATION #4
Phase 79-92: Post-opt (LICM, arch opt, back copy prop, GMMA, reg attrs)
Phase 93:    LateExpansionUnsupportedOps2    <-- LEGALIZATION #5
Phase 94:    FinalInspectionPass
Phase 95:    SetAfterLegalization (marks legalization complete)
Phase 96-136: Scheduling, RA, Mercury, post-RA, late merge
Phase 137:   LateExpansionUnsupportedOpsMid  <-- LEGALIZATION #6
Phase 138:   OriSplitHighPressureLiveRanges

Key Functions

AddressSizeRole
sub_C60A20~40BConvertUnsupportedOps execute dispatcher
sub_C5EFB0~16BMidExpansion execute dispatcher
sub_C60AA0~50BLateExpansion execute dispatcher
sub_C5EA50~16BLateExpansionUnsupportedOps execute dispatcher
sub_C607E0~30BLateExpansionUnsupportedOpsMid execute dispatcher
sub_C5E790~16BLateExpansionUnsupportedOps2 execute dispatcher
sub_C5F8A0~30BSetAfterLegalization execute
sub_7DDB50232BOptimization level gate (knob 499 check)
sub_7917F0~400BLateExpansionUnsupportedOps core implementation
sub_9059B0~500BLateExpansion core implementation (with expansion loop)
sub_5D1660~8KBLibdevice function table initializer (608 entries)
sub_785E20--Expansion setup (function table initialization)
sub_781F80--Expansion setup (mode configuration)
sub_7E6090--Instruction expansion driver
sub_7E6AD0--Instruction expansion driver (secondary)
sub_753600--Per-instruction legalization check
sub_753B50--Retry/convergence loop for iterative expansion
sub_13AF3D026,795BOperand legalization dispatcher -- 164-case switch on opcode, called from sub_A29220
sub_13A62801,289BGeneral operand materializer -- ensures operand is in legal register (called 83x)
sub_13A6AE0~250BSpecial-class operand materializer -- handles condition code and predicate classes
sub_13A7410~50BTry-inline-then-materialize wrapper -- checks sub_822750 before falling back
sub_13A6F90~40BArch-immediate materializer -- like sub_13A7410 without pre-check
sub_13A45E0--Predicate operand materializer
sub_13A75D0--Uniform register conversion (class 6 to class 3)
sub_A29220--Pass driver that calls sub_13AF3D0 per instruction
sub_13ADB903,353BExtended operand legalization variant (arch-specific override, vtable-dispatched)

Operand Legalization Dispatcher

The SASS encoding backend cannot encode arbitrary operand forms. Before an instruction reaches the per-instruction encoder, every operand must be in a form the hardware encoding supports: a register in the correct class, an immediate that fits the bit-field width, or an absent-operand sentinel. The operand legalization dispatcher (sub_13AF3D0, 26,795 bytes) enforces these constraints. It is called once per instruction from the pass driver sub_A29220 and runs after ISel but before the SASS encoders.

Dispatcher Structure

The function reads the instruction opcode from field +72, masks off the predication flags (bits 12-13, mask & 0xCFFF), and enters a switch with 164 case labels covering Ori IR opcodes 0 through 352. Each case implements the legalization recipe for one opcode or a group of opcodes with identical operand layouts.

Before the switch, a pre-pass handles predicated instructions. If bit 12 of the opcode is set (indicating a predicate guard is present), the function first checks backend vtable slot +3232 for a custom handler. If none exists or it declines, sub_13A6AE0 is called on the predicate guard operand (at position operand_count - 2) to ensure it is in a legal register.

The switch routes to five categories of legalization logic:

Direct operand materialization. The majority of cases call sub_13A6280 on each operand that might need conversion. Example for a 3-source FMA (case 6):

sub_13A6280(context, instruction, 3, insert_point, ...)  // src0
sub_13A7410(backend, instruction, 4, 1, insert_point, ...) // src1 (try inline first)
sub_13A6280(context, instruction, 5, insert_point, ...)  // src2
// then check optional predicate operands 6,7 via sentinel test

Variable-length operand scanning. Case 16 (store) scans up to 15 operand slots, testing each against the 0x70000000 sentinel to find where active operands end before legalizing each one.

Architecture-specific delegation. Cases 70, 243, 245-247, 254-255, 257-259, 262 delegate entirely to vtable+2816. Cases 280-281 delegate to vtable+2328 with adjusted operand counts. These are SM-specific instructions (tensor core, WGMMA, bulk copy) where operand constraints vary by architecture.

Opcode rewriting. Case 137 (SM73_FIRST boundary marker, ROT13: FZ73_SVEFG; note: MOV = opcode 19) rewrites the opcode field itself: to 0x82 (130) for conditional MOV, or to 0x109 (265) for MOV-from-special-register when the source is in register class 4.

Passthrough. Cases 22, 24, 34, 38, 44, 45, 59, 73, 74, 77, 83, 106, 135, 161, 180, 182, 192, 194, 198, 209, 213-215, 221, 297, 352 and the default case require no operand legalization and exit immediately.

The 0x70000000 Null-Operand Sentinel

Each operand occupies an 8-byte slot in the instruction. The lower 4 bytes encode the operand value and type:

BitsFieldValues
[30:28]Type1=register, 2=signed immediate, 3=unsigned immediate, 5=predicate, 7=null
[23:0]PayloadRegister index or immediate value
[31]Negate1=operand is negated
+7 (byte)FlagsBit 0: uniform/constant bank reference

The sentinel value 0x70000000 encodes type 7 ("null") with zero payload and no negation. It marks operand slots that are architecturally absent -- optional predicate guards not specified, trailing source operands of variable-width instructions, or unused operand positions in instructions with fewer sources than the maximum slot count.

The dispatcher tests for the sentinel with:

if ( ((*((_DWORD *)instr + offset) ^ 0x70000000) & 0x70000000) != 0 )
    // operand is PRESENT -- legalize it

The XOR produces zero in bits [30:28] only when they are exactly 0b111 (type 7). The AND isolates those bits. If the result is zero, the operand is null and legalization is skipped. If non-zero, the operand is present and must be processed.

The function contains 59 references to 0x70000000. The heaviest user is case 16 (store), which chains 14 successive sentinel tests (at instruction offsets +84 through +196) to determine the store's vector width -- effectively implementing for each slot: if sentinel, stop; else legalize.

Operand Materialization Helpers

The dispatcher calls six helper functions depending on the operand class:

FunctionCallsRole
sub_13A628083General materializer. The core function. Checks if the operand can remain as-is (register in a legal class, or inline immediate that fits). If not, creates a MOV instruction via sub_92E800 to load the value into a fresh register, inserts it before the current instruction, and replaces the operand slot with a register reference (0x10000000 | reg_index). Short-circuits immediately for uniform registers (class 6). Uses sub_7DBC80 to test inline-immediate feasibility and sub_91D150/sub_91D160 for constant pool operations.
sub_13A741015Try-inline-then-materialize. Checks sub_822750 first ("can this immediate be encoded inline for this arch?"). If yes, keeps the immediate. If no, tries sub_822990/sub_8229D0 for extended encoding paths. Falls back to sub_13A6280 only if all inline attempts fail.
sub_13A6AE015Special-class materializer. Handles operands in non-standard register classes. For class 5 (predicate): returns immediately. For class 2 (condition code): creates a MOV with opcode 0x108. For immediates: calls sub_91D150 for constant pool lookup and replaces the operand. Used on predicate guard operands and instructions with condition-code sources.
sub_13A6F907Arch-immediate materializer. Like sub_13A7410 but skips the sub_822750 pre-check. Used for operands where inline encoding is known to be architecture-dependent (texture coordinates, barrier IDs).
sub_13A45E05Predicate materializer. Handles materialization of optional predicate operand slots, called exclusively after a sentinel test confirms the operand is present.
sub_13A75D01Uniform register conversion. Called once (case 6, FMA) to handle uniform register class 6 operands that need conversion to general-purpose class 3.

Materialization Flow (sub_13A6280 Detail)

The general materializer at sub_13A6280 (1,289 bytes) implements this decision tree for a single operand:

  1. Uniform register early exit. If the operand is a register (type 1) in class 6 (uniform), return immediately -- uniform registers are always legal in the encoding.

  2. Inline immediate check. If the operand is an immediate (type 2/3), call sub_7DBC80 to test whether the value fits in the instruction's immediate field. If it fits and passes the floating-point validity check (vtable+1504) and architecture encoding check (vtable+3248), keep the immediate as-is.

  3. Register reclassification. If the operand is a register in class 3 (general-purpose), query the architecture via vtable+1240 and vtable+904 to determine if the register should be reclassified to uniform class 6 (for data types with width <= 3 register slots).

  4. Data-type conversion. For boolean (sub_7D66E0) or floating-point (sub_7D6780) operand types, call vtable+904 to map the data type to the appropriate register class.

  5. Materialization. Call sub_92E800 to create a MOV instruction (opcode 0x82 = 130) that loads the constant/immediate into a new register. Insert it at the insertion point. Replace the operand slot: lower word becomes 0x10000000 | new_reg_index (type 1 = register), upper word is cleared to & 0xFEC00000.

  6. Insertion point update. If the insertion point a4 currently points to the instruction being legalized, advance it to the newly inserted MOV so subsequent materializations are ordered correctly.

Opcode Groups and Legalization Recipes

OpcodesInstruction ClassOperands LegalizedNotes
2-7Arithmetic (ADD/MUL/FMA)dst, src0, src1 [, src2]FMA (6) has optional predicate slots checked via sentinel
8LD (load)Variable based on addressing modeOperand count read from +80
10-11, 151-152, 290-291Compare/selectsrc0, src1Standard 2-source legalization
16ST (store)1-15 data operandsSentinel-scanned variable width
32ATOM (atomic)dst, addr, dataSpecialized register conversion
36TEX (texture)coords + handleTexture handle materialization
42, 53, 55Shift/logicsrc0 + try-inline src1sub_13A6280 + sub_13A7410
51PRMT (permute)src0, control, src1sub_13A6F90 for arch-dependent control operand
61Branch-conditionalNested switch on modifier bits6 sub-cases for different branch forms
70, 243-262Tensor/WGMMA/bulkDelegated to vtable+2816Architecture-specific
82, 166, 196FP convertsrc + try-inlinesub_13A6280 + sub_13A7410 + optional sub_13A6F90
88-89ATOMS/ATOMGLoop over sourcesPer-source legalization with count
110-121Wide arithmeticsrc0, src1, src23 consecutive sub_13A6280 calls
137MOVOpcode rewriteRewrites to 0x82 or 0x109 based on register class
230-232LD/ST extendedsrc + inline + archsub_13A6280 + sub_13A7410 + sub_13A6F90
270-289Control flow / miscVariableSeveral sub-groups with different patterns
280-281Multi-sourceDelegated to vtable+2328Operand count adjusted by -4

Architecture Override Points

The dispatcher provides three escape hatches for architecture-specific behavior:

Vtable OffsetDecimalOpcodesPurpose
+28160xB0070, 243, 245-247, 254-255, 257-259, 262Full delegation for SM-specific instructions
+23280x918280-281 (+ other cases)Multi-source instructions with adjusted operand counts
+32320xCA0Pre-switch (predicated instructions)Custom predicate guard handling

The vtable+2816 handler receives (backend, instruction, insert_point, pass_context, mode_flag) and is expected to perform complete operand legalization for the instruction. The vtable+2328 handler receives an adjusted operand count (total - 4), suggesting these instructions have 4 fixed operands plus a variable source list.

Relationship to Legalization Passes

The operand legalization dispatcher operates at a different abstraction level than the six legalization passes described above. The legalization passes (phases 5-137) operate on the Ori IR, replacing unsupported operations with sequences of supported ones. The operand legalization dispatcher operates on individual operands within already-legal instructions, ensuring each operand is in a form the SASS encoder can bit-pack into machine code.

The dispatcher runs as part of the SASS encoding pipeline (called from sub_A29220), well after all six Ori-level legalization passes have completed. It is invoked per-instruction during the encoding walk, not as a standalone pass.

Ori legalization passes (phases 5-137)
  Replace unsupported OPERATIONS with legal sequences
         |
         v
SASS operand legalization (sub_13AF3D0, during encoding)
  Ensure each OPERAND of a legal instruction is encodable
         |
         v
SASS per-instruction encoders (522 functions)
  Pack operands into binary instruction word

Cross-References