Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

SM75 Turing

The SM75 (Turing, compute capability 7.5) instruction selection backend occupies 984 KB at 0xF16000--0x100C000 and is the largest single-architecture ISel backend in the nvlink v13.0.88 binary. It contains 1,737 functions organized into four functional layers -- operand predicates, instruction emitters, pattern matchers, and post-ISel emit+encode functions -- plus a 280 KB mega-hub dispatch function (sub_FBB810) that is the largest function in the entire binary at 65,999 instructions.

Turing is architecturally significant as the first SM generation to introduce the Uniform Register File (URF), which manifests throughout this backend as operand kind 10 (UREG). The ISel uses a priority-based linear scan: for each IR instruction, all 276 pattern matchers run in sequence, and the highest-priority match wins.

Key Facts

PropertyValue
Address range0xF16000--0x100C000 (984 KB)
Total functions~1,737
Mega-hub dispatchsub_FBB810 at 0xFBB810 (280 KB, 65,999 instructions, 1,733 callees)
Operand predicates15 functions at 0xF16030--0xF160F0
Instruction emitters18 functions at 0xF10080--0xF15A50
Pattern matchers276 functions at 0xF16150--0xFBB780
Emit+encode functions38 functions at 0xFFFDF0--0x100BBF0
Encoding context size576+ bytes
ISel architecturePriority-based linear scan (not tree-pattern or DAG)

Address Map

RangeSizeSubsystemCount
0xF16030--0xF160F0<1 KBOperand predicate functions15
0xF10080--0xF15A5022 KBInstruction emitters18
0xF16150--0xFBB780678 KBISel pattern matchers276
0xFBB810280 KBMega-hub dispatch (sub_FBB810)1
0xFFFDF0--0x100BBF048 KBPost-ISel emit+encode38

Instruction Selection Flow

For each NVVM IR instruction to be lowered to SM75 machine code, sub_FBB810 executes the following protocol:

1. sub_FBB810 iterates through all 276+ pattern matchers
2. Each matcher calls sub_A49150(ctx, node, field_id) to read instruction attributes
3. Each matcher calls sub_530FD0(node) to check explicit operand count
4. Each matcher calls sub_530FB0(node, idx) to retrieve operand at index
5. Each matcher calls sub_530FC0(node) to check implicit operand count
6. Operand type checked via sub_F16040/F16070/etc predicates (kind tag)
7. Register class validated via sub_F16030: value 1023 = "any" (wildcard)
8. If all checks pass: *priority_out = priority, *pattern_id_out = id
9. After all matchers run, mega-hub picks highest-priority match
10. Corresponding emitter called to generate 128-bit encoding

The priority mechanism ensures specific patterns override general ones. Higher values win. If the current best priority already exceeds a matcher's threshold, that matcher early-outs (optimization to avoid redundant checks). Priority ranges across the 276 matchers:

Priority RangeMeaningExample
2--4Fallback/default patterns (minimal constraints)sub_FBB780 (pattern 1, priority 2): matches any instruction with 0 explicit ops and 2 implicit uniform-register ops
7--10Simple patterns (few attribute checks)NOP/barrier variants, basic shifts
14--19Standard patterns (moderate constraints)IADD3, I2I, MUFU, ISETP, texture fetch, surface load
22--24Complex patterns (many attribute + operand checks)Memory indexed 3-op, branch with predication
33--36Very specific patterns (maximum constraints)SHFL/VOTE with 8 mixed operands, STG with 7 uniform-reg operands
39Most specific (HMMA widest variants)sub_F77140 (9 implicit operands, R128 tensor core ops)

Operand Predicates

Fifteen trivial inline functions at 0xF16030--0xF160F0 classify operand types by a single-byte tag at operand offset +0. These are the leaves of every pattern-match tree.

AddressFunctionTestOperand KindConfidence
0xF16030sm75_get_regclass_idreturn a1Identity (passthrough)HIGH
0xF16040sm75_is_register_operanda1 == 2REG -- general registerHIGH
0xF16050sm75_is_immediate_operanda1 == 1IMM -- immediate/literal valueHIGH
0xF16060sm75_is_memory_operanda1 == 6Memory/address operandMEDIUM
0xF16070sm75_is_uniform_registera1 == 10UREG -- uniform register (Turing+)MEDIUM
0xF16080sm75_is_predicate_operanda1 == 9PRED -- predicate registerMEDIUM
0xF16090sm75_is_cbuf_operanda1 == 5Constant buffer referenceLOW
0xF160A0sm75_is_texture_operanda1 == 4Texture/sampler referenceLOW
0xF160B0sm75_is_true_predicatea1 == 3PT -- always-true guardMEDIUM
0xF160C0sm75_is_false_predicatea1 == 15PN -- always-false guardMEDIUM
0xF160D0sm75_is_kind_13a1 == 13UnknownLOW
0xF160E0sm75_is_kind_14a1 == 14UnknownLOW
0xF160F0sm75_is_kind_16a1 == 16UnknownLOW
0xF16100sm75_is_kind_7a1 == 7UnknownLOW
0xF16110sm75_is_kind_11a1 == 11UnknownLOW

The identity function at 0xF16030 is used as both a register-class accessor (return value compared against 1023/1/2/4/5) and a generic field value passthrough. A second identity function at 0xF16130 has a different type signature in the original source (both compile to identical machine code, but they occupy distinct vtable slots).

The predicate pair sub_F160B0 / sub_F160C0 is always called as sub_F160B0(v) || sub_F160C0(v) -- accepting either PT (always true, kind 3) or PN (always false, kind 15), matching the SASS convention where a predicate guard can be either polarity.

Operand Kind Tag Summary

TypeSymbolMeaningPredicate Address
1IMMImmediate / constant value0xF16050
2REGGeneral register operand0xF16040
3PTPredicate true (always-true guard)0xF160B0
4--Texture / sampler reference0xF160A0
5--Constant buffer reference0xF16090
6--Memory / address operand0xF16060
9PREDPredicate register operand0xF16080
10UREGUniform register (Turing+)0xF16070
15PNPredicate false (always-false guard)0xF160C0

Kind 10 (UREG) is the defining Turing addition. It reflects the Uniform Register File introduced in SM75, which provides scalar registers shared across all threads in a warp. This operand kind appears pervasively in the pattern matchers, often alongside kind 2 (REG) as alternatives in the same operand slot.

Register Class IDs

IDSymbolWidthUsage
1R32 / GPR3232-bitGeneral purpose register
2R64 / GPR6464-bitGeneral purpose register pair
4R128 / GPR128128-bitRegister quad (HMMA tensor core)
5P1-bitPredicate register
1023 (0x3FF)ANYwildcardMatches any register class ("don't care")

Instruction Emitters

Eighteen functions at 0xF10080--0xF15A50 implement the "emit" phase of instruction selection. Each takes an emitter context (a1, 576+ bytes) and an instruction node (a2) and produces the SM75 128-bit instruction encoding. All share a common structure:

Phase 1: Set instruction opcode at a2+12 (e.g., 126, 18, 104)
Phase 2: Load register class descriptor from rodata into a1+8 (SSE load)
Phase 3: Populate 10-slot operand descriptor arrays at a1+24..a1+140
         (register IDs, types, flags -- using SSE memcpy for speed)
Phase 4: Set explicit operand count at a1+144
Phase 5: Bind operands via sub_4C6380/sub_4C60F0/sub_4C6DC0
Phase 6: Encode bitfields into encoding words at a1+544 and a1+552
Phase 7: Set instruction class tag at a1+276
Phase 8: Write branch target / relocation info to instruction node

Identified Emitters

AddressSizeIdentityOpcodeOperand BindingsInstruction Class
0xF100804,975 Bsm75_emit_memop_5src126 (memory)5: src reg, gen x3, pred0xE000000004 (load)
0xF106204,969 Bsm75_emit_memop_6src126 (memory)6: as above + extra src0xE000000003 (store)
0xF10BE04,857 Bsm75_emit_alu_2src_uniform18 (int ALU)2: gen(rc=10) + pred0x7000000001
0xF11090~4.8 KBsm75_emit_alu_2src_uniform_B1820x7000000001
0xF11540~4.8 KBsm75_emit_alu_2src_uniform_C1820x7000000001
0xF119F0~4.8 KBsm75_emit_alu_variant_D18----
0xF11EE0~4.8 KBsm75_emit_alu_variant_E18----
0xF123D0~4.8 KBsm75_emit_alu_variant_F18----
0xF128E0~4.8 KBsm75_emit_memop_variant_G126----
0xF12DF0~4.8 KBsm75_emit_memop_variant_H126----
0xF13310~4.8 KBsm75_emit_variant_I------
0xF13830~4.8 KBsm75_emit_variant_J------
0xF13D50~4.8 KBsm75_emit_variant_K------
0xF14310~4.8 KBsm75_emit_variant_L------
0xF148D0~4.8 KBsm75_emit_variant_M------
0xF14E90~4.8 KBsm75_emit_variant_N------
0xF15470~4.8 KBsm75_emit_variant_O------
0xF15A50~4.8 KBsm75_emit_variant_P------

Opcode Families

Three instruction opcode numbers are confirmed from a2+12 assignments:

OpcodeFamilySASS Instructions
18Integer ALUIADD, IMAD, ISCADD, LEA, SHF, BFE, BFI, LOP3, PRMT
104FP32 operationsFADD, FMUL, FMAD, FFMA
126Memory / load-storeLDG, STG, LDS, STS, LDL, STL

Instruction Class Tags

The 5-byte value written at context offset +276 encodes both the instruction family (high nibble) and operand configuration (low word):

TagMeaning
0x7000000001Integer ALU, 2-operand form
0xE000000003Memory store (opcode 126, 6 sources)
0xE000000004Memory load (opcode 126, 5 sources)

Pattern Matchers

All 276 pattern matchers at 0xF16150--0xFBB780 share the same signature:

char __fastcall sm75_match_XXX(
    void*     ctx,          // a1: ISel context
    void*     instr_node,   // a2: IR instruction node
    uint32_t* pattern_id,   // a3: output -- matched pattern ID
    uint32_t* priority      // a4: in/out -- current best priority
);

Each performs a deeply-nested sequence of checks:

  1. Check instruction attributes via sub_A49150(ctx, node, field_id) -- see field ID dictionary below
  2. Check explicit operand count via sub_530FD0(node)
  3. For each explicit operand: validate kind tag and register class
  4. Check implicit operand count via sub_530FC0(node)
  5. For each implicit operand: validate kind tag and register class
  6. If all checks pass and *a4 <= threshold: set *a4 = new_priority, *a3 = pattern_id

Pattern Matcher Categories

The 276 matchers group into 12 functional categories by the instruction families they match:

TypeAddress RangeCountRepresentative Patterns
NOP / barrier0xF16150--0xF163A0~5NOP variant (pattern 33, priority 4), control flow simple (42, 8)
HMMA (tensor core)0xF1C3F0--0xF20D10~10HMMA f16/f32 (4, 15), HMMA f64 (13, 15), HMMA with UR (9, 15)
ALU / arithmetic0xF20D10--0xF2B2A0~30IADD3 reg+imm (2, 17), ALU 2-op variants
Memory / load-store0xF307E0--0xF36A20~15Memory indexed 3-op (12, 24), STG indexed 6-op (25, 36)
Conversion / cast0xF3C0F0--0xF437C0~20I2I 3-op (57, 17), MUFU/F2F (82, 19), TEX 3-op (121, 19)
Predicated ops0xF4AA30--0xF4FB70~10ISETP 3-op (209, 19), texture fetch 3-op (218, 19)
Store variants0xF58BB0--0xF5C120~10STG with predicate (10, 19), store predicated variants
Surface / texture0xF6DC60--0xF71B60~15SULD 4-op predicated (5, 19) with side-effect check
Complex HMMA0xF76170--0xF77DF0~8HMMA wide R128 (7, 34), HMMA widest 9-op (8, 39) -- largest matchers
ALU extended0xF82CF0--0xF96B40~50IMAD predicated 6-op (1, 19), IADD/IMUL/SHF/BFE/BFI/LOP3/PRMT
Comparison / SETP0xF97CE0--0xF9CD30~15DSETP 8-op (22, 34), DSETP 9-op+pred (23, 36)
Branch / call0xFA0310--0xFAA4E0~20BRA complex predicated (1, 24), call/return variants
Final / fallback0xFB7A90--0xFBB780~5SHF 3-op imm (2, 10), fallback simplest (1, 2)

Complex HMMA Matchers (Largest)

The most complex matchers target Half-precision Matrix Multiply-Accumulate (HMMA) instructions for Turing's tensor cores. These are the largest individual matcher functions (6--8 KB each) because HMMA has the most operands and encoding options:

sub_F77140 -- HMMA widest variant A (8,408 bytes, 179 lines):

  • Checks field 0x216 == 2717 (HMMA opcode variant)
  • Additional checks: fields 0xA1 == 700, 0xA2 in range 702--703
  • 1 explicit operand: register R128
  • 9 implicit operands: R128 x3, predicate, R64, R32, UREG R32, PT/PN check
  • Sets pattern_id=8, priority=39 (maximum observed)

sub_F77DF0 -- HMMA widest variant B (8,401 bytes, 179 lines):

  • Checks field 0x21A == 2729 (different HMMA subtype)
  • Same 9-implicit-operand structure
  • Sets pattern_id=12, priority=39

sub_F76DD0 -- HMMA wide operand A (7,226 bytes, 164 lines):

  • Checks field 0x216 == 2716
  • 1 explicit R128 + 8 implicit (R128 x3, predicate, R32, UREG x2)
  • Sets pattern_id=7, priority=34

Fallback Matcher

sub_FBB780 (1,108 bytes, 34 lines) is the fallback pattern that matches when nothing else does:

  • Zero instruction attribute checks
  • Requires: explicit operand count == 0, implicit count == 2, first implicit = uniform register R32
  • Sets pattern_id=1, priority=2 (lowest observed)
  • Any other matching pattern will override this due to priority 2

Post-ISel Emit+Encode Functions

Thirty-eight functions at 0xFFFDF0--0x100BBF0 combine pattern matching with instruction encoding for complex instructions requiring immediate bitfield packing. They share the emitter signature __int64 (ctx, instr_node) and use sub_4C28B0(ctx, bit_offset, width, value) extensively to pack individual fields into the 128-bit SASS instruction word.

Encoding Protocol

1. Pack opcode bits:     sub_4C28B0(a1, 0, 4, 2)   -- 4 bits at offset 0
2. Pack sub-opcode:      sub_4C28B0(a1, 4, 3, 0)   -- 3 bits at offset 4
3. Pack encoding fields from rodata tables
4. Initialize operand binding via sub_4C2A60/sub_4C2A90
5. Extract instruction-specific modifiers via sub_A551C0..sub_A55470
6. Encode modifiers into bit positions via sub_A4xxxx/sub_A50xxx
7. Pack into encoding words at ctx+544 and ctx+552 (shift+OR)
8. Set relocation metadata at ctx+148..ctx+160

Identified Emit+Encode Functions

AddressSizeIdentityOpcodeSources
0xFFFDF06,810 Bsm75_emit_encode_memop_complex_4src1264 src via sub_4C4D60 + sub_4C5C30
0x10004606,959 Bsm75_emit_encode_memop_complex_6src1266 src, two relocation entries
0x1000CD05,492 Bsm75_emit_encode_alu_3src_A183 src
0x10012B05,493 Bsm75_emit_encode_alu_3src_B183 src
0x10018905,407 Bsm75_emit_encode_alu_3src_C183 src
0x1001E105,460 Bsm75_emit_encode_alu_3src_D183 src
0x10023405,687 Bsm75_emit_encode_alu_4src_A184 src
0x10028F05,688 Bsm75_emit_encode_alu_4src_B184 src
0x10088D05,190 Bsm75_emit_encode_fp32_4op1044 src, sub_A51DD0(node) == 1875 check
0x1008DD05,329 Bsm75_emit_encode_fp32_4op_B1044 src
0x10092F05,191 Bsm75_emit_encode_fp32_4op_C1044 src
0x10097F05,330 Bsm75_emit_encode_fp32_4op_D1044 src
0x1009D105,138 Bsm75_emit_encode_fp32_4op_E1044 src
0x100A2105,296 Bsm75_emit_encode_fp32_4op_F1044 src
0x100A7305,435 Bsm75_emit_encode_fp32_4op_G1044 src
0x100AC705,297 Bsm75_emit_encode_fp32_4op_H1044 src
0x100B1905,436 Bsm75_emit_encode_fp32_4op_I1044 src
0x100B6D05,297 Bsm75_emit_encode_fp32_4op_J1044 src
0x100BBF05,296 Bsm75_emit_encode_fp32_4op_K1044 src

The remaining 20 emit+encode functions (0x1002EA0--0x10083A0) follow the same structure with varying field encoding positions and are labeled as generic variants (I through AA).

Opcode distribution among emit+encode functions: 11 for FP32 (opcode 104), 8 for integer ALU (opcode 18), 2 for memory (opcode 126), 17 for undetermined variants.

Encoding Context Structure

The 576-byte emitter context is the central data structure threading through all emitter and emit+encode functions. It accumulates the operand bindings and bitfield encodings for one SM75 SASS instruction.

Offset  Size   Field
+0      8      Reserved / vtable pointer
+8      16     XMM register class descriptor (SSE-loaded from rodata)
+12     2      Instruction opcode number (18, 104, or 126)
+16     4      Base bit position for predicate encoding
+24     40     Operand register numbers: 10 x 4-byte slots (indices 0--9)
+64     40     Operand types / constraints: 10 x 4-byte slots
+104    40     Operand flags: 10 x 4-byte slots (0=def, 1-5=use, -1=unused)
+144    4      Explicit operand count
+148    4      Relocation type (first)
+152    4      Relocation bit offset (first)
+156    4      Relocation type (second)
+160    4      Relocation bit offset (second)
+276    8      Instruction class tag (e.g., 0xE000000004)
+404    32     Match/emit dispatch table pointer
+536    8      Pointer to instruction descriptor table
+544    8      Encoding word 0 (64-bit bitfield, low half of 128-bit)
+552    8      Encoding word 1 (64-bit bitfield, high half of 128-bit)
+558    2      Immediate value (16-bit)
+572    4      Branch/offset target

The operand descriptor arrays at offsets +24, +64, and +104 are populated with optimized SIMD memcpy (aligned SSE loads/stores copying 4 elements at a time from rodata descriptor tables).

Rodata Register Class Descriptors

Each instruction family has a 16-byte register class descriptor loaded from rodata into context offset +8 via SSE:

Rodata AddressInstruction FamilyUsed By
xmmword_1F46E28Memory operations (opcode 126)sub_F10080, sub_F10620, sub_FFFDF0, sub_1000460
xmmword_1F466B8Integer ALU (opcode 18)sub_F10BE0, sub_F11090, sub_F11540
xmmword_1F46630FP32 operations (opcode 104)sub_10088D0--sub_100BBF0
xmmword_1F47268Complex memory (emit+encode)Post-ISel emit+encode functions

Each descriptor has three parallel arrays of 10 DWORDs defining per-slot operand register IDs, type/constraint descriptors, and flag words. Example for memory operations: dword_1F46E38[0..9] (register IDs), dword_1F46E60[0..9] (types), dword_1F46E88[0..9] (flags).

External Dependencies

The SM75 backend relies on shared infrastructure functions used across all ISel backends:

IR Node Accessors

FunctionSignatureDescriptionCallers
sub_A49150(ctx, node, field_id) -> valueRead instruction attribute by field ID30,768 (binary-wide)
sub_530FD0(node) -> countGet explicit operand countUniversal
sub_530FB0(node, idx) -> operand*Get operand at index31,399 (binary-wide)
sub_530FC0(node) -> countGet implicit operand countUniversal
sub_A49720(node) -> boolCheck instruction has side effectsSurface load matchers
sub_A51DD0(node) -> classGet instruction class / post-conditionFP32 emit+encode

Operand Binding Functions

FunctionDescription
sub_4C6380(ctx, node, op, off, rc)Bind source register operand
sub_4C60F0(ctx, node, op, off, rc)Bind general register operand
sub_4C6DC0(ctx, node, op, off, rc)Bind predicate register operand
sub_4C5F90(ctx, node)Finalize operand binding
sub_4C28B0(ctx, bit, width, val)Pack value into encoding bitfield
sub_4C2A60(ctx)Initialize encoding
sub_4C2A90(ctx, node, flag)Bind primary result
sub_4C4D60(ctx, node, op, off)Bind source operand (complex)
sub_4C5C30(ctx, node, op, off)Bind special operand

Modifier Extraction and Encoding

Modifier fields are extracted from the IR node via sub_A55xxx functions and encoded into bit positions via sub_A4xxxx/sub_50xxxx functions:

ExtractorFieldEncoderWidth
sub_A551C0Modifier 1sub_A4F9703-bit
sub_A55220Modifier 2sub_A4D9402-bit
sub_A55280Modifier 3sub_A4DC602-bit (alt)
sub_A55320Modifier 4sub_A4FDE04-bit
sub_A55340Modifier 5sub_A502602-bit
sub_A55400Modifier 6sub_A500E02-bit
sub_A55450Modifier 7sub_A500F05-bit
sub_A55470Modifier 8sub_A4FBC04-bit

Additional encoding functions handle specific operand attributes: sub_509D90 (register source A), sub_509DB0 (register source B), sub_509F20 (comparison mode), sub_509160 (data type/precision), sub_509290 (rounding mode), sub_509890 (saturation/clamp), sub_50AC80 (source negate/abs), sub_50ACD0 (source modifier composite), sub_509800 (address mode), sub_509930 (thread scope), sub_509A90 (memory order), sub_50C820 (cache policy), sub_50B570 (texture mode).

Field ID Dictionary

Field IDs passed to sub_A49150 to query instruction attributes. These are the keys used by every pattern matcher to classify instructions:

Field IDHexSemantic NameKnown Values
50x05Instruction major class12 = memory/special
280x1CBranch/jump type subfield123--124
460x2EInteger comparison mode213
590x3BWarp operation mode273--274
880x58Data type / precision code406--408
890x59Store type410--416
910x5BAddress space qualifier425--427
920x5CMemory ordering429--430
1050x69ALU function select477
1160x74Texture/surface function512--513
1230x7BSpecial function unit selector536 = texture/surface
1260x7ECache coherence / eviction policy547--548
1360x88Source negate/absolute modifier598--599
1610xA1HMMA input precision A700
1620xA2HMMA input precision B702--703
1900xBENOP/barrier subtype815
2010xC9Control flow subtype1109
2030xCBInteger multiply mode1113--1119
2070xCFInteger multiply variant1150--1158
2110xD3Conversion subtype1182
2200xDCLoad/store address mode1206
2260xE2Matrix layout1229
2290xE5Special instruction code1238
2420xF2Addressing mode detail1281--1282
2530xFDMUFU function select1321
2540xFEI2I conversion mode1324
2550xFFTexture fetch type1327--1328
2650x109Texture opcode variant1363/1366
2810x119Warp shuffle type1435--1440
2850x11DWarp shuffle mode1454--1457
2870x11FSpecial indexed operation1464
2940x126Memory bank selector1493
2950x127Surface load type1495
3020x12ESet-predicate class1525
3290x149Integer addressing mode1833--1837
3380x152Source predication mode A1871/1873--1874
3390x153HMMA accumulator type1877
3410x155Source predication mode B1881--1882
3450x159Memory scope / synchronization1899--1903
3480x15CExecution model qualifier1912--1915
3550x163Data size / vector width1943/1947
3560x164Texture data type1949
3590x167Surface data type1960
3760x178Memory persistence2035
3770x179Memory eviction priority2037--2041
3790x17BBranch condition type2046
3800x17CBranch target type A2048--2049
3810x17DBranch target type B2052--2053
3820x17EBranch modifier2055--2060
3940x18AConvert source type2107--2108
3970x18DDestination predication2115
3990x18FHMMA sub-operation2121
4040x194Comparison extension2140--2141
4060x196HMMA configuration2146
4070x197Comparison precision2148--2151
4090x199Set-predicate comparison2155
4130x19DMemory segment2167--2168
4230x1A7Source data typebitmask test
4240x1A8Function lookupbitmask test (739)
4290x1ADMemory ordering qualifier2253--2257
4300x1AESource A comparison2259--2260
4310x1AFSource A comparison ext2262--2263
4650x1D1Source B comparison2420--2421
4660x1D2Source B comparison ext2423--2424
4680x1D4HMMA step select2429--2430
4800x1E0Matrix multiply type2480/2482/2485
4840x1E4Set-predicate subclass2502
4920x1ECComparison boolean combine2524--2525
4940x1EEBranch target form2529--2530
5050x1F9HMMA operand layout A2569
5060x1FAHMMA operand layout B2571
5080x1FCShift/funnel type2576--2577
5240x20CBranch distance2678--2679
5340x216HMMA opcode variant A2716--2717
5350x217HMMA source C layout2719--2720
5360x218HMMA source D layout2722--2723
5380x21AHMMA opcode variant B2729
5390x21BHMMA mode X2731--2736
5400x21CHMMA mode Y2738--2743
5470x223HMMA step A2767--2768
5480x224HMMA step B2770
5490x225HMMA step C2772
5690x239Integer set-predicate type2850--2851
5750x23FMemory base addressing2870
5760x240Memory indexed addressing2872
5830x247Conversion class2892
5950x253Store qualifier2937--2938

Turing-Specific Design Observations

Uniform Register File (URF). SM75 introduced the uniform register file -- scalar registers whose value is identical across all threads in a warp. This eliminates redundant per-lane computation for warp-uniform values. In the ISel backend, UREG (kind 10) appears as a first-class operand type alongside REG (kind 2). Many pattern matchers accept either kind in the same operand slot, reflecting that SASS instructions can take operands from either the general or uniform register file.

HMMA complexity. The tensor core (HMMA) instruction family drives the most complex patterns in this backend. The R128 register class (ID 4) exists specifically for HMMA, representing four consecutive 32-bit registers that hold matrix fragments. The highest-priority matchers (priority 39) are all HMMA variants, and the largest individual matcher functions (8+ KB) target HMMA.

Linear scan architecture. Unlike tree-pattern matchers (as used in LLVM's TableGen-generated ISel), this backend evaluates all patterns sequentially. The 280 KB mega-hub calls each of the 276 matchers in order, collects the highest-priority match, then dispatches to the winning emitter. This is computationally expensive (O(patterns) per instruction) but simple to extend: adding a new pattern requires only inserting a new matcher function into the sequence.

128-bit instruction encoding. SM75 uses 128-bit SASS instructions (two 64-bit words at context offsets +544 and +552). The sub_4C28B0 primitive packs arbitrary-width bit fields at arbitrary positions within these two words. Modifier fields are extracted from the IR node and encoded at precise bit positions, with different emit+encode variants differing only in which bits they set within the 128-bit word.

Confidence Assessment

ClaimConfidenceVerification
ISA class string "Turing" for sm_75CONFIRMEDDecompiled sub_484F50 line 251: "Turing"; string in nvlink_strings.json at 0x1d409dc
SM75 backend at 0xF16000--0x100C000 (984 KB)HIGHAddress range consistent with decompiled function addresses in the catalog; mega-hub at sub_FBB810 falls within range
Mega-hub sub_FBB810 at 280 KB, 65,999 instructionsHIGHSize claim derived from binary analysis; too large for Hex-Rays decompilation consistent with other mega-hubs
276 pattern matchers at 0xF16150--0xFBB780HIGHPattern addresses verified against decompiled function catalog; representative patterns like sub_FBB780 (fallback) confirmed
15 operand predicates at 0xF16030--0xF160F0HIGHAddress range and trivial predicate structure consistent with ISel infrastructure
Operand kind tags: 1=IMM, 2=REG, 10=UREG, etc.HIGHConsistent with shared infrastructure used across SM75/80/89/90 backends
Priority-based linear scan ISel architectureHIGHProtocol described matches mega-hub structure: iterate all matchers, pick highest priority
18 emitter functions at 0xF10080--0xF15A50HIGHAddresses consistent with function catalog
Opcode families: 18 (int ALU), 104 (FP32), 126 (memory)HIGHOpcode numbers from decompiled emit+encode functions at *(a2+12) assignments
Register class 1023 = wildcardHIGHConsistent across all ISel backends; sentinel value used in operand matching
Dispatch table: sm_75 encoding table = sub_15C3210CONFIRMEDDecompiled sub_15C0CE0 shows sm_75 registration (earlier in file)
__CUDA_ARCH__=750CONFIRMEDString at 0x1d409c8; decompiled sub_484F50 line 252
128-bit instruction encoding at ctx+544/+552HIGHConsistent across all SM75+ backends; encoding word offsets documented
Field ID dictionary (500+ field IDs)MEDIUMField IDs from pattern matcher analysis; individual values not independently verified but consistent with sub_A49150 usage

For general SM75 architecture details, see the ptxas wiki: Turing/Ampere and cicc wiki: SM70-89.

Cross-References

Sibling Wikis