Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

SASS Text Generation

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

Phases 129 (DumpNVuCodeText) and 130 (DumpNVuCodeHex) convert the internal instruction stream into human-readable SASS assembly text and raw hex dumps respectively. The text output is the same format produced by cuobjdump --dump-sass and is used for --verbose output, DUMPIR diagnostics, --forcetext mode, --out-sass dumps, and the --self-check roundtrip verification pipeline. The subsystem spans two distinct address ranges: a PTX-level text generation system (580 formatter functions at 0x4DA340--0x5A8E40) and a SASS-level disassembly renderer (~123 virtual printer methods at 0x17F8000--0x181FFFF).

Pipeline phases129 (DumpNVuCodeText), 130 (DumpNVuCodeHex)
Phase categoryDebug (conditionally executed)
PTX formatter count580 functions at 0x4DA340--0x5A8E40 (~850 KB)
PTX dispatchersub_5D4190 (12.9 KB, two-level opcode dispatch)
SASS printer count~123 vtable methods at 0x17F8000--0x181FFFF
Builder/visitor vtable~520 method slots (4,160+ byte vtable)
Format string table~1.8 MB monolithic NUL-terminated string block
Temp buffer size50,000 bytes per formatter invocation
Largest formattersub_5A8E40 (wmma.load.b, 9,757 bytes)
Key helperssub_9D12F0 (operand encoder), sub_9DB7E0 (predicate printer)

Output Format

SASS text generation produces output compatible with cuobjdump --dump-sass. The format includes control information (scheduling metadata), predicate guards, opcode mnemonics, operands with modifiers, and optional annotations.

Instruction Line Format

/*ADDR*/  {CTRL} OPCODE{.MODIFIERS}  DST, SRC0{, SRC1{, SRC2}} ;  /* LINE */

Concrete examples of the format ptxas produces:

/*0000*/                   MOV R1, c[0x0][0x28] ;                /* 0x00000a0004017802 */
/*0010*/                   S2R R0, SR_CTAID.X ;                  /* 0x0000000000007919 */
/*0020*/              @P0  IMAD.MOV.U32 R4, RZ, RZ, c[0x0][0x168] ;
/*0030*/                   IMAD.MOV.U32 R5, RZ, RZ, c[0x0][0x16c] ;
/*0040*/                   ISETP.GE.AND P0, PT, R0, R2, PT ;
/*0050*/              @P0  EXIT ;
/*0060*/                   STG.E [R4.64], R0 ;
/*0070*/                   EXIT ;
/*0080*/                   BRA 0x80 ;

Control Word Format

For architectures with explicit scheduling control (SM 50--SM 70), the control word is printed in a dedicated line before each group of three instructions:

      /* 0x001c4400fe2007f6 */
/*0008*/                   MOV R1, c[0x0][0x20] ;
/*0010*/                   S2R R0, SR_TID.X ;
/*0018*/                   S2R R2, SR_CTAID.X ;

The 64-bit control word encodes scheduling data for three instructions:

FieldBitsDescription
Stall count4 bits per instructionMinimum cycles to wait before issue (0--15)
Yield hint1 bit per instructionSuggest warp scheduler switch
Write barrier3 bits per instructionDependency barrier index (0--5, 7 = none)
Read barrier3 bits per instructionRead dependency barrier mask
Wait barrier mask6 bits per instructionWhich barriers to wait on before issue

For SM 75+ architectures (Turing and later), scheduling information is embedded per-instruction rather than in grouped control words, so the text output places it differently or omits the separate control line.

Hex Dump Format (Phase 130)

Phase 130 (DumpNVuCodeHex) emits the raw encoding bytes as hex values:

/*0000*/  0x00000a0004017802
/*0008*/  0x0000000000007919
/*0010*/  0x000000ff0aff7824

Each line contains the instruction address and its encoded QWORD(s). For 128-bit instructions, two QWORDs are printed.

Architecture

The text generation subsystem has two levels: a PTX-level pretty-printer that formats instructions from the Ori IR representation, and a SASS-level disassembly renderer that decodes binary-encoded SASS instructions back to text.

Level 1: PTX Instruction Text Formatters

This is the primary text generation system. The 580 formatter functions convert internal instruction representations (accessed via the instruction object at *(a1+1096)) into PTX assembly text strings.

sub_5D4190 (12.9 KB, dispatcher)
  ├─ First: calls sub_5D1660 to initialize intrinsic ID table (608 entries)
  ├─ Registers 121 named opcodes at a1+808 via sub_426150()
  ├─ Registers ~400 hash-keyed opcodes at a1+816 via sub_426150()
  └─ Dispatches to one of 580 formatters at 0x4DA340-0x5A8E40
       └─ Each: alloc 50 KB → sprintf via format table → shrink-copy → free

The dispatcher uses a two-level dispatch strategy:

  1. Named dispatch (121 opcodes): Direct string-to-function registration for recent or complex instructions. The opcode name string (e.g., "wmma.load.a", "tcgen05.mma", "barrier.cta") is looked up in a hash map at a1+808.

  2. Hash dispatch (~400 opcodes): Numeric hash values of opcode names are used as keys in a second hash map at a1+816. The hash values are stored as decimal string representations (e.g., "2644314910", "605425506"). This covers the stable ISA core -- arithmetic, logic, loads, stores, branches, conversions.

Level 2: SASS Disassembly Renderer

The SASS printer at 0x17F8000--0x181FFFF operates on binary-encoded SASS instructions and produces text through a builder/visitor pattern. This is used for the --self-check roundtrip verification and --out-sass output.

SASS instruction (binary-encoded)
  │
  ├─ Read opcode at instruction+72, mask BYTE1 &= 0xCF
  ├─ Switch on canonical opcode ID
  │
  ├─ For each operand:
  │    └─ sub_9D12F0(output_128, ctx, instr, operand_idx, stride, mode, flag)
  │         → 64-byte operand encoding structure
  │
  ├─ Emit via builder/visitor vtable at *(a1 + 24):
  │    ├─ vtable+936:  begin_predicate_guard()
  │    ├─ vtable+3768: begin_operands()
  │    ├─ vtable+16:   emit_operand(kind_id, ...)
  │    ├─ vtable+272:  emit_integer(value)
  │    ├─ vtable+1760: set_rounding_mode(mode)
  │    ├─ vtable+3952: emit_saturation_flag()
  │    ├─ vtable+3960: emit_ftz_flag()
  │    ├─ vtable+3968: emit_negate_flag()
  │    ├─ vtable+4072: emit_cache_operation()
  │    ├─ vtable+4080: emit_eviction_hint()
  │    ├─ vtable+944:  end_predicate_guard()
  │    └─ vtable+4160: end_statement()
  │
  └─ Predicate guard: sub_9DB7E0 (662 bytes, 19 callers)

The builder/visitor vtable has approximately 520 method slots (vtable spans 4,160+ bytes), making it one of the largest virtual dispatch interfaces in the binary. Different concrete visitor implementations produce different output formats (text, hex, self-check comparison).

Formatter Template

Every PTX formatter function is mechanically generated from instruction definition tables. All 580 follow an identical structure:

char* format_OPCODE(int64_t a1, int64_t a2) {
    // a1 = instruction context (instruction data at a1+1096)
    // a2 = format string table base pointer (~1.8 MB)

    // Phase 1: Allocate temp buffer
    int64_t pool = ((int64_t*)sub_4280C0(a1, a2))[3];   // arena_get_pool
    char* buf = (char*)sub_424070(pool, 50000);           // pool_alloc(50KB)
    if (!buf) sub_42BDB0(pool, 50000, ...);               // alloc_fail_abort

    // Phase 2: Build instruction text via sprintf chain
    int pos = sprintf(buf, "%s", (char*)(a2 + OFFSET_A)); // opcode prefix
    if (sub_70B6E0(*(a1+1096)))                           // has_predicate?
        pos += sprintf(buf+pos, fmt, sub_70B780(*(a1+1096))); // predicate name
    pos += sprintf(buf+pos, "%s", (char*)(a2 + OFFSET_B)); // operand template
    // ... more operands via sub_70B8E0, sub_70B910, sub_70B920 ...
    strcpy(buf+pos, (char*)(a2 + OFFSET_N));              // trailing text

    // Phase 3: Shrink-copy to exact size
    size_t len = strlen(buf);
    int64_t pool2 = ((int64_t*)sub_4280C0(buf, ...))[3];
    char* result = (char*)sub_424070(pool2, len + 1);
    strcpy(result, buf);

    // Phase 4: Free temp buffer
    sub_4248B0(buf);                                       // pool_free
    return result;
}

The format string table (a2) is a single monolithic ~1.8 MB block of NUL-terminated strings containing pre-assembled text templates with %s, %llu, %d placeholders. Different formatters access it at different offsets:

FormatterOffset into a2Approximate position
wgmma.mma_async1,731,609~1.7 MB
wmma.mma1,731,130~1.7 MB
rsqrt67,573~67 KB
copysign110,152~110 KB
vavrg4286,309~286 KB
guardrails.alloc~1,843,620~1.8 MB

This design trades memory for speed: instead of building instruction text dynamically, ptxas stores the complete format template and fills in operand names at runtime.

Instruction Operand Accessors

All formatters query the instruction object through a uniform set of tiny accessor functions:

AddressSizeCallersIdentity
sub_70B70014 B946has_predicate()
sub_70B6E014 B42has_predicate_v2()
sub_70B710111 B348get_opcode_string()
sub_70B780151 B514get_predicate_name()
sub_70B8E012 B1,449get_reg_operand(idx)
sub_70B91012 B1,656get_src_part0(idx)
sub_70B92012 B1,296get_src_part1(idx)
sub_70B9307 B68get_operand_count()
sub_70B4C022 B46get_base_address()
sub_70CA6011 B480get_operand_type(idx)
sub_70CA70427 B191get_type_suffix()
sub_70CD20122 B158get_operand_offset(idx)
sub_71086039 B2,953get_data_type(idx, part)
sub_70FA0010 B286get_target_sm(idx)
sub_70FA1066 B7check_target_sm(idx, str)
sub_70991014 B13get_variant_count()
sub_709A1073 B46get_variant_string()
sub_707CE022 B93get_address_operand(idx)
sub_709760127 B21get_comparison_op()
sub_709FE011 B17get_rounding_mode()
sub_70A50013 B15get_saturation_mode()
sub_70B3F0----get_ftz_flag()
sub_707530----get_precision_string()
sub_707C80----get_scope_string()
sub_7075E0----get_layout_string()
sub_707BE0----get_shape_string()
sub_70A810----get_scale_string()

All accessors read from the instruction object at *(a1+1096). The tiny sizes (7--151 bytes for most) indicate these are simple field extractions from the instruction record.

Memory Allocation

The formatter memory lifecycle uses a pool allocator:

FunctionSizeCallersIdentity
sub_4280C0597 B3,928arena_get_pool(ctx, table)
sub_4240702,098 B3,809pool_alloc(pool, size)
sub_42BDB014 B3,825alloc_fail_abort()
sub_4248B0923 B1,215pool_free(ptr)

Every formatter allocates a 50,000-byte temporary buffer, builds the instruction string via sprintf chains, measures the result with strlen, allocates an exact-size copy, and frees the temporary. The 50 KB buffer provides headroom for the largest instructions (WMMA loads produce multi-KB strings) but is wasteful for simple 2-operand instructions that generate ~50-byte strings.

Predicate Guard Printing

Predicate guards (@P0, @!P1, etc.) are printed by checking has_predicate() on the instruction, then formatting the guard via get_predicate_name():

// PTX-level predicate printing (in every formatter)
int pos = sprintf(buf, "%s", opcode_prefix);
if (sub_70B6E0(*(a1+1096))) {                     // has_predicate?
    int64_t pred = sub_70B780(*(a1+1096));         // get_predicate_name
    pos += sprintf(buf+pos, guard_fmt, pred);      // e.g., "@P0 " or "@!P1 "
}

// SASS-level predicate printing (in disassembly renderer)
// sub_9DB7E0 (662 bytes, 19 callers) — emits guard through builder vtable
//   calls builder->begin_predicate_guard() at vtable+936
//   emits predicate register name
//   calls builder->end_predicate_guard() at vtable+944

Register and Operand Formatting

Register operands are resolved from the instruction's operand array. The formatter accesses operands by index through get_reg_operand(idx), get_src_part0(idx), and get_src_part1(idx). The standard register naming follows NVIDIA conventions:

Register classNamingExamples
General-purposeR0--R255R0, R4, R255
Zero registerRZRZ
PredicateP0--P6, PT@P0, PT
UniformUR0--UR63UR4, UR16
Uniform predicateUP0--UP6, UPTUP0
Constant bufferc[bank][offset]c[0x0][0x168]
SpecialSR_*SR_CTAID.X, SR_TID.X

For the SASS disassembly renderer, the register class discriminator sub_91C840 (347 bytes, 232 callers) maps internal type codes 1--0x17 to output class IDs 0--18, covering integer registers, float registers, double registers, predicate registers, condition registers, texture/surface references, and uniform registers.

The operand encoder sub_9D12F0 (1,423 bytes, 289 callers) is the core serializer for SASS-level printing. It takes an instruction and operand index, resolves whether the operand is a register, immediate, or memory reference, handles constant buffer lookups, and fills a 64-byte (4x __m128i) encoding structure that the builder/visitor consumes.

Address and Offset Formatting

Memory operands are formatted with address space qualifiers and offset expressions:

[R4.64]              — register indirect, 64-bit
[R4+0x10]            — register + offset
c[0x0][0x168]        — constant buffer bank 0, offset 0x168
[UR4]                — uniform register indirect

The address space qualifier resolver sub_9CEB50 (185 bytes, 57 callers) combines address space information from the operand descriptor with the instruction context. For SASS-level output, the address space emitter sub_9E7B00 and related functions (sub_9E9910, sub_9E9A70) handle data type and memory space qualifiers.

Architecture-Conditional Formatting

86 of the 580 formatters contain architecture-conditional paths that check the target SM version via sub_70FA00 (numeric comparison) or sub_70FA10 (string comparison). Architecture-specific formatting reflects hardware evolution:

SMEraFormatting impact
sm_20, sm_21Fermi (2010)copysign has different operand layout (7 vs 5 fields)
sm_62Pascal mobile (2016)vavrg4 gets per-component register formatting
sm_103Blackwell Ultra (2025)rsqrt gains new operand layout for extended precision

Five formatters additionally use string-based SM comparison via sub_70FA10:

  • sub_4DD860 (copysign): checks "sm_20", "sm_21"
  • sub_56BA60 (vavrg4): checks "sm_62"
  • sub_56C8D0 (dp2a.lo): SM string comparison
  • sub_577BA0 (dp2a.hi): SM string comparison
  • sub_583190 (rsqrt): checks "sm_103"

SASS Disassembly Renderer

The SASS-level renderer at 0x17F8000--0x181FFFF (~160 KB, ~123 virtual entry points) converts binary-encoded SASS instructions into textual SASS assembly. Unlike the PTX formatters (Level 1) which work from the high-level Ori IR via sprintf chains, the SASS renderer decodes the binary instruction encoding and drives a builder/visitor object through a structured sequence of emit_* calls. The builder's concrete implementation determines the output format -- text for --out-sass, comparison data for --self-check, or binary encoding verification.

Internal Layers

The subsystem splits into five layers by address range and function role:

LayerRangeCountRole
A: Encoding templates0x17F8000--0x180FFFF~75Build per-opcode operand layout descriptors
B: Accessor vtable methods0x1810700--0x1810BFF~15ISA version/class discriminator predicates
C: Format-class printers0x1810D20--0x18167FF~50Workhorses: decode operands + emit through builder
D: Complex multi-format printers0x1817000--0x181CFFF~15Texture, multi-operand, predicated printers
E: Post-processing hooks0x181E000--0x181FFFF~8ISA-override detection, fixup dispatch

All ~123 entry points have zero static callers, confirming they are virtual method overrides dispatched through vtables. The printer dispatch layer at 0xAA8000--0xACA000 (sub_AA9330, sub_AA9860, sub_AAB9C0, sub_AC99D0) invokes them.

Rendering Protocol

Every SASS printer receives (a1, a2) where a1 is the printer context (builder pointer at a1+24) and a2 is the binary-encoded instruction. The rendering follows a fixed protocol:

1. vtable[0](builder, instruction_kind_id)     // begin_instruction
2. vtable[3760](builder, sync_mode)            // set_sync_type (if applicable)
3. vtable[3768](builder)                       // begin_operand_list
4. sub_9DB7E0(a1, a2, 1)                       // emit predicate guard (@Px)
5. For each operand:
   a. sub_9D12F0(&buf, a1, a2, idx, stride, mode, flag)  // encode -> 64B struct
   b. vtable[16](builder, kind_id, buf...)                // emit_operand
6. vtable[3952](builder)                       // emit_saturation (.SAT)
7. vtable[3960](builder)                       // emit_ftz (.FTZ)
8. vtable[3968](builder)                       // emit_negate (.NEG)
9. vtable[4072](builder)                       // emit_cache_operation
10. vtable[4080](builder)                      // emit_eviction_hint
11. vtable[4160](builder)                      // end_instruction

The protocol is directly visible in decompiled code. In sub_1812F60 (16-DWORD immediate printer), the function begins with vtable[0](builder, 89) (begin instruction kind 89), calls vtable[3760] for sync type, vtable[3768] for begin operand list, sub_9DB7E0 for predicate guard, then loops 16 times calling vtable[272] (create integer operand) followed by vtable[16] (emit operand) with kind IDs 55 through 70 -- one per DWORD.

In sub_1810D20 (comparison-mode printer), the function first reads the modifier word from the operand array at instruction+84, switches on (modifier >> 4) & 0xF, calls vtable[3528]/vtable[3536] to configure comparison mode and variant, then emits 2--3 operands via the standard sub_9D12F0 + vtable[16] sequence.

Builder/Visitor Vtable

The builder object at *(a1 + 24) exposes a vtable spanning 4,160+ bytes (~520 method slots at 8 bytes each). The complete set of identified methods:

OffsetMethodCategory
+0begin_instruction(kind_id)Framing
+8get_current_instruction()Accessor
+16emit_operand(kind_id, operand_buf...)Core emission
+24post_process_operand()After-emit hook
+112get_register_size_32()Register geometry
+120get_register_size_64()Register geometry
+128create_register_operand()Operand factory
+152create_memory_operand()Operand factory
+192create_special_operand()Operand factory
+208create_literal_operand()Operand factory
+272create_integer_operand(value)Operand factory
+304create_register_ref_operand()Operand factory
+368set_address_space()Memory qualifier
+936begin_predicate_guard()Predicate block
+944end_predicate_guard()Predicate block
+984set_predicate_mode()Predicate negate/true
+1000emit_modifier()Generic modifier
+1056set_offset_mode()Address offset
+1128emit_width_qualifier().B32, .B64
+1392set_comparison_flag()Comparison type
+1760set_rounding_mode().RN, .RZ, .RM, .RP
+1936begin_sync_block()Sync scope
+1944end_sync_block()Sync scope
+2016set_sync_width()Sync width
+2024set_sync_depth()Sync depth
+2584set_uniform_flag().U modifier
+2960set_address_mode()Address mode
+2992set_cache_level_a()Cache hint (L1)
+3000set_cache_level_b()Cache hint (L2)
+3096set_comparison_type()Second comparison slot
+3128set_source_type_a()Source type
+3136set_source_type_b()Source type
+3144set_interlock_mode()Memory ordering
+3152begin_comparison_block()Comparison section
+3160set_comparison_width()Comparison width
+3520set_data_width()Operand width
+3528set_comparison_mode()Comparison config
+3536set_comparison_variant()Comparison variant
+3560set_conversion_type()Conversion modifier
+3576begin_conversion()Conversion block
+3760set_sync_type()Synchronization type
+3768begin_operand_list()Operand section
+3776emit_rounding_decoration()Rounding modifier
+3824emit_texture_header()Texture header index
+3952emit_saturation_flag().SAT
+3960emit_ftz_flag().FTZ
+3968emit_negate_flag().NEG
+4072emit_cache_operation()Cache operation hint
+4080emit_eviction_hint()Eviction priority
+4160end_instruction()Framing

Different concrete visitor implementations produce different output formats. The vtable design means adding a new output format (e.g., JSON, binary verification) requires only a new visitor class with no changes to any of the ~123 printer functions.

Encoding Template Builders (Layer A)

~75 functions at 0x17F8000--0x180FFFF build per-opcode instruction format descriptors that define the expected operand signature. Each function:

  1. Sets the SASS opcode ID: *(a2+12) = opcode_number
  2. Loads a 128-bit format descriptor: *(a1+8) = xmmword_23Fxxxx (from rodata)
  3. Fills up to 10 operand slots at a1+24..a1+120 with type codes, register class IDs, and modifier flags
  4. Writes expected-value constraints at a1+64..a1+160 (-1 = any)
  5. Writes type constraint modifiers at a1+104..a1+200

From sub_17F8210 (opcode 274):

*(a2+12) = 274;                                          // SASS opcode ID
*(a1+8)  = _mm_loadu_si128(&xmmword_23F21B0);           // 128-bit descriptor
*(a1+24) = 10;   // operand 0: predicate register type
*(a1+64) = -1;   // operand 0: any value accepted
*(a1+104) = 0;   // operand 0: no modifier constraint
*(a1+28) = 17;   // operand 1: specific register class
*(a1+68) = -1;   // operand 1: any value
*(a1+108) = 3;   // operand 1: modifier constraint 3
// ... remaining operands bulk-copied via SSE from xmmword_23F1C60 table

The 128-bit descriptors at xmmword_23F1xxx--23F2xxx encode canonical operand layouts. The bulk SSE copies (_mm_load_si128/_mm_loadu_si128) fill 4 operand slots per iteration, making the template builders compact despite handling up to 10 operand positions.

Format-Class Printers (Layer C)

The instruction's format class at instruction+76 determines which printer handles it. The dispatch computes index = format_class - 11, then looks up dword_23B39E0[index] for the encoding strategy:

StrategyValueDescription
Default0Standard register fields
Wide19-bit register fields, 8 sequential operands
Pair22x register fields per operand
Extended3Extra modifier bits
Special4+Texture header, 16-DWORD immediate

Printer functions for each format class:

FunctionSizeFormat classEvidence
sub_1810D208.8 KBComparison-modeSwitches on (modifier >> 4) & 0xF: case 4 emits comparison with two variants, case 6 emits single-variant. Calls vtable[3528]/vtable[3536] for comparison config
sub_18111F011.6 KBWide-operand8 sequential sub_9D12F0 calls with indices 0--7
sub_1811E2011.6 KBWide + specialBoth sub_9D12F0 and sub_9CF740 calls
sub_181289010.5 KBRegister + constantsub_9CF8A0 for constant folding
sub_1812F6015.3 KB16-DWORD immediatesub_7E4CF0 iterator, 16x vtable[272] + vtable[16] with kind IDs 55--70
sub_18141C06.5 KBPer-operand comparisonDispatch entry from sub_1820000
sub_18146607.1 KBLoad/storesub_C49400 + sub_9CEB50 for address space
sub_1814B1017.6 KBLoad/store + predicatedsub_C49400, sub_91E860, sub_91C840, sub_9CEB50
sub_181581012.7 KBWide variantSimilar to sub_1811E20
sub_181600013.1 KBData-type qualifiedsub_9E9910 for data type emission
sub_18167F011.8 KBMemory-accesssub_9E7B00 for address space qualifier, sub_A3B930 for operand modifier
sub_1816FC06.4 KBModifier-heavyChecks bits 6, 7, 14 of operand word for negate/absolute modifiers

Texture/Surface Printer (Layer D)

The texture/surface printer sub_18189C0 is the largest at 45.2 KB. It handles the complete texture and surface instruction families:

sub_18189C0 (45.2 KB, 1361 lines)
  ├─ Read opcode at +72, mask to canonical form
  ├─ Giant switch on opcodes:
  │    18 (FADD/FMUL?), 119 (MUFU?), 186 (TEX),
  │    211 (TLD), 283 (SULD), 315 (SUST)
  ├─ Check operand modifier bits for predication/negation
  ├─ dword_23B39E0[format_class-11] → subtype (values 0-4)
  ├─ word_23B3A58[subtype] → builder kind ID
  ├─ Emit predicate: vtable[89], begin_operand_list
  ├─ sub_9DB7E0 → predicate guard
  ├─ For each operand: sub_9D12F0 → builder->emit_operand
  ├─ If format > 9: sub_1817C50 (12.8KB) → texture header index
  │    └─ Linearizes from 2D bit fields: bits[0:8] x bits[9:17] → index 0-52
  ├─ Emit cache/eviction: vtable[4080], vtable[4072]
  ├─ Emit saturation/ftz: vtable[3952], vtable[3960], vtable[3968]
  └─ Return 1 on success

The multi-operand printer sub_181B370 (27.8 KB) handles instructions with many operand variants (VOTE at opcode 0x7A, multi-op at 0x138), emitting up to 12 sequential operands through sub_9CEF90 (extended operand encoder) and sub_9CF740 (immediate encoder).

ISA-Override Detection (Layer E)

sub_181E1D0 (7.3 KB) is a post-processing fixup dispatcher called from sub_AA9330. It performs ISA-target-aware fixups by comparing vtable method pointers against known discriminator functions:

// If the current ISA target has the DEFAULT implementation:
if (vtable[111] == sub_1810B90)   // default comparison handler
    apply_default_fixup();
// Otherwise the target has OVERRIDDEN the method:
else
    apply_specialized_fixup();    // e.g., sub_1BCBB90 for arch-specific

This mechanism supports 45 opcodes (0x12, 0x16, 0x24, ..., 0x141) and dispatches to architecture-specific post-processors (sub_1BCBB90, sub_1BCC2D0, sub_BCCF80, sub_1BCF120) or re-emits modifiers via sub_9E9910/sub_9E9A70.

The discriminator functions at 0x1810700--0x1810BFF (~15 tiny functions) serve as sentinel values: sub_1810720, sub_1810750, sub_18108A0, sub_18108D0, sub_1810B90. Their identity (which function pointer is stored) determines which specialization path the fixup dispatcher takes.

Instruction Object Layout

The binary instruction object (a2) used by all SASS printers:

OffsetSizeField
+08Context/vtable pointer
+88ISA context pointer (register file, instruction info table)
+248Builder/visitor object pointer
+328Operand metadata pointer
+401Half-precision flag
+488Operand modifier context
+724Opcode (bits 12--13 are variant flags, masked via &0xCFFF)
+764Format class (subtract 11 for dword_23B39E0[] indexing)
+804Operand count
+84+8*NOperand array (N operands, 8 bytes each)

Each 8-byte operand slot encodes:

BitsWordField
28--300Operand type tag: 1=register, 4=address, 5=constant buffer, 7=special
0--230Register/constant index
24--270Modifier flags
01Negate
11Absolute value
201Constant pool flag (0x100000)
291Sign extension (0x20000000)
301Uniform flag (0x40000000)
311Negation modifier (0x80000000)

Global Lookup Tables

TableSizeIndexPurpose
dword_23B39E0[10]40 Bformat_class - 11Format class to encoding strategy (0--4)
word_23B3A58[4]8 BSubtype from aboveSubtype to builder kind_id mapping
dword_23B3A20[14]56 Bregister_class - 3Register class to comparison type ID
dword_23B3980[7]28 Bwidth_field - 1Encoded width to builder width value
xmmword_23F1xxx--23F2xxx~16 B eachPer-opcode128-bit operand layout descriptor templates

SASS Renderer Function Map

AddressSizeCallersIdentityConfidence
sub_17F8210~1.3 KB0 (vtable)Encoding template builder (opcode 274)95%
sub_1810D208.8 KB0 (vtable)Comparison-mode format-class printer90%
sub_18111F011.6 KB0 (vtable)Wide-operand format-class printer85%
sub_1811E2011.6 KB0 (vtable)Wide-operand + special encoding printer85%
sub_181289010.5 KB0 (vtable)Register + constant operand printer85%
sub_1812F6015.3 KB0 (vtable)16-DWORD immediate printer90%
sub_18141C06.5 KB0 (vtable)Per-operand comparison printer85%
sub_18146607.1 KB0 (vtable)Load/store with address space printer85%
sub_1814B1017.6 KB0 (vtable)Load/store + predication printer85%
sub_181581012.7 KB0 (vtable)Wide-operand variant printer80%
sub_181600013.1 KB0 (vtable)Data-type qualified printer85%
sub_18167F011.8 KB0 (vtable)Memory-access instruction printer85%
sub_1816FC06.4 KB0 (vtable)Modifier-heavy instruction printer85%
sub_1817C5012.8 KB~1Texture header index encoder90%
sub_18189C045.2 KB0 (vtable)Texture/surface instruction printer92%
sub_181B37027.8 KB0 (vtable)Multi-operand instruction printer88%
sub_181CF6014.0 KB0 (vtable)Predicated instruction printer85%
sub_181D9B012.6 KB0 (vtable)Load/store variant printer80%
sub_181E1D07.3 KB~1ISA-override fixup dispatcher90%
sub_181E63014.7 KB~1Comparison instruction post-processor88%
sub_181F0007.6 KB~1Data-type specialized printer75%
sub_181F4F017.3 KB~1Multi-variant data-type printer80%

CLI Integration

--verbose / -v

Enables printing of code generation statistics after compilation. The statistics printers at sub_ABBA50--sub_ABEB50 (8 SM-variant clones, 7,603 bytes each) emit post-scheduling metrics in "# [...] " comment format.

--forcetext

Forces text-mode SASS output regardless of the default binary output mode. Internal flag: "forcetext=%d".

--out-sass

Generates reconstituted SASS text from the Capsule Mercury representation. Used for debugging the capmerc encode/decode roundtrip. Triggers the SASS text Flex lexer sub_720F00 (64 KB) for parsing in --self-check mode.

--self-check

Roundtrip verification for Capsule Mercury: encodes the instruction stream to capmerc format, decodes it back, renders both original and reconstituted as SASS text, and compares. The Flex lexer at sub_720F00 parses the text output for comparison. The SASS text formatter sub_719D00 (50 KB) builds the output for self-check.

DUMPIR

The DUMPIR environment variable (and related knobs) triggers intermediate representation dumps at named phases. Phase 129 (DumpNVuCodeText) is one of the dump targets, emitting the full instruction stream as formatted text when DUMPIR includes that phase name.

Formatter Size Distribution

Function size directly correlates with PTX instruction complexity:

TierSize rangeCountDescription
Tiny< 500 B13Simple 2-operand (wgmma.fence: 295 B)
Small500--1,000 B191Standard 3--4 operand (copysign: 794 B)
Medium1,000--2,000 B319Instructions with modifiers (bfind: 1,130 B)
Large2,000--4,000 B36Arch-conditional paths (membar: 2,788 B)
Very large4,000--6,000 B20Complex multi-form (tex.grad: 5,636 B)
Monster6,000--10,000 B2WMMA matrix loads (wmma.load.b: 9,757 B)

The WMMA load/store formatters account for 34,423 bytes (4% of the total range), reflecting the combinatorial explosion of matrix shapes, data types, layouts, and architectures.

Named Opcode Dispatch Table

The 121 named opcodes registered at a1+808 by sub_5D4190:

CategoryOpcodes
Memory fencemembar
Conversioncvt, tensormap.replace
Mathdiv, div.full, rem, rcp, rsqrt, ex2, lg2, sqrt, tanh, copysign
Bit manipulationbfind, brev, bfe, bfi, clz, popc, testp
Load/store_ldldu, ldmatrix, movmatrix, stmatrix, st.async, red.async, st.bulk, prefetch
Texturetex, tex.base, tex.level, tex.grad, tld4, sured.b
Video SIMDvadd--vmad, vadd2--vavrg2, vadd4--vavrg4
Dot productdp2a.lo, dp2a.hi, dp4a
Barriersbar, barrier, bar.arrive, barrier.arrive, bar.red, barrier.red, bar.cta, barrier.cta, + .arrive/.red variants, bar.warp
Warp opsvote, shfl, match, redux
Async copycp.async.mbarrier.arrive, cp.async.bulk, cp.async.bulk.tensor
Cache policycreatepolicy.range, createpolicy.fractional, createpolicy.cvt
Multi-memorymultimem.ld_reduce, multimem.st, multimem.red
WMMAwmma.load.a, wmma.load.b, wmma.load.c, wmma.store.d, wmma.mma, mma
WGMMAwgmma.mma_async, wgmma.fence, wgmma.commit_group, wgmma.wait_group
TCGen05tcgen05.alloc, tcgen05.relinquish_alloc_permit, tcgen05.dealloc, tcgen05.ld, tcgen05.ld.red, tcgen05.st, tcgen05.commit, tcgen05.cp, tcgen05.shift, tcgen05.mma, tcgen05.mma.ws
Guardrails_tcgen05.guardrails.is_phase_valid, .are_columns_allocated, .is_current_warp_valid_owner, .in_physical_bounds, .allocation_granularity, .datapath_alignment, .sp_consistency_across_idesc_mod, .check_sparse_usage

The remaining ~400 opcodes (arithmetic, logic, load/store, control flow, etc.) are dispatched through hash values at a1+816.

SASS Printer Key Functions

AddressSizeCallersIdentity
sub_5D419012.9 KB1PTX instruction text dispatch + intrinsic registration
sub_5D166046 KB1Intrinsic library registration (608 entries)
sub_5FF700354 KB--Builtin function declaration emitter (prototype generator)
sub_4DA34061 B1,080Builtin declaration lookup helper
sub_719D0050 KB--SASS text formatter (self-check output builder)
sub_720F0064 KB--Flex lexer for SASS text parsing (self-check input)
sub_9D12F01.4 KB289Operand encoder (64-byte struct per operand)
sub_9DB7E0662 B19Predicate guard printer
sub_91C840347 B232Register class discriminator
sub_9CEB50185 B57Address space qualifier resolver
sub_91E86031 B214Data size accessor
sub_18189C045.2 KB--Texture/surface instruction printer (largest SASS printer)
sub_181B37027.8 KB--Multi-operand instruction printer
sub_1817C5012.8 KB--Texture header index encoder

Instruction Data Flow

                    ┌──────────────────────────────────┐
                    │  Ori IR Instruction Object        │
                    │  (instruction data at *(a1+1096)) │
                    └────────────────┬─────────────────┘
                                     │
               ┌─────────────────────┼──────────────────────┐
               │                     │                      │
               v                     v                      v
   sub_70B6E0/B700          sub_70B8E0/B910/B920     sub_70CA60/CA70
   has_predicate()          get_reg_operand(idx)      get_operand_type()
   get_predicate_name()     get_src_part0/1(idx)      get_type_suffix()
               │                     │                      │
               └─────────────────────┼──────────────────────┘
                                     │
                                     v
                          ┌─────────────────────┐
                          │  sprintf() chain     │
                          │  into 50 KB buffer   │
                          │  using format table  │
                          │  at a2+offset        │
                          └──────────┬──────────┘
                                     │
                                     v
                          ┌─────────────────────┐
                          │  strlen → alloc →    │
                          │  strcpy → free temp  │
                          └──────────┬──────────┘
                                     │
                                     v
                          ┌─────────────────────┐
                          │  Formatted PTX text  │
                          │  string (exact size) │
                          └─────────────────────┘

Cross-References