Mercury Overview
Mercury is NVIDIA's internal codename for a new GPU ISA binary format that replaces the legacy SASS (Shader ASSembler) encoding for modern GPU architectures. The name is ROT13-obfuscated throughout the binary as "Zrephel" -- applying ROT13 to "MERCURY" yields "ZREPHEL", the form seen in all instruction-level string tables. In nvlink v13.0.88, Mercury surfaces across four distinct subsystems: the MercExpand instruction expansion engine, the capsule mercury (capmerc) ELF format, the R_MERCURY relocation family, and the FNLZR (finalizer) that converts between SASS and Mercury representations.
String Evidence Summary
| Type | Count | Address Range | Examples |
|---|---|---|---|
mercury / Mercury | 82 | 0x1D35A17--0x245EF38 | R_MERCURY_ABS64, EIATTR_MERCURY_ISA_VERSION, mercury,capmerc,sass |
Zrephel / ZREPHEL | 667 | 0x1D42C80--0x1D4DF80 | ROT13-encoded SASS builtins: ZREPHEL_zoneevre_neevir = MERCURY_mbarrier_arrive |
R_MERCURY_* | 67 | 0x1D35A17--0x1D35F4C | 65 unique relocation types plus R_MERCURY_NONE and R_MERCURY_NONE_LAST sentinels |
.nv.merc.* | 20 | 0x24582E8--0x2458D00 | .nv.merc.debug_info, .nv.merc.rela, .nv.merc.symtab_shndx |
capmerc | 7 | 0x1D33FA9--0x1D41EF8 | capmerc.cubin, --binary-kind capmerc, self-check strings |
FNLZR | 17 | 0x1D32381--0x2458F10 | FNLZR: Input ELF: %s, FNLZR: Pre-Link Mode, FNLZR: JIT Path |
Architecture Generation Mapping
Mercury is not a single monolithic format. It has two deployment tiers tied to GPU architecture:
| Architecture | SM Range | Mercury Role | Default --binary-kind |
|---|---|---|---|
| Hopper | SM90, SM90a | Mercury format available but not default. SASS remains the standard output. MercExpand runs in the backend pipeline | sass |
| Blackwell | SM100, SM100a, SM100f | Mercury is default. Capsule Mercury (capmerc) is the standard ELF output format | capmerc |
| Blackwell Ultra / Future | SM103, SM120, SM121 | Mercury-only. No legacy SASS path | capmerc |
The --binary-kind CLI flag at 0x1D41D94 (xref 0x4ACC47) selects the output format:
--binary-kind <mercury|capmerc|sass>
Specify the type of target ELF binary kind.
Default on sm100+ is capmerc.
The three valid values are parsed from the string "mercury,capmerc,sass" at 0x1D41D03 (xref 0x4AC55C). The option description at 0x1D41E78 confirms the sm100+ default: "Specify the type of target ELF binary kind. Default on sm100+ is capmerc".
ROT13 Obfuscation Scheme
NVIDIA applies ROT13 encoding to Mercury-related instruction mnemonics throughout the binary. This is a trivial Caesar cipher (A-M swap with N-Z) applied character-by-character, preserving underscores and digits. The encoding is consistent across all 667 ZREPHEL_* strings.
Key decodings:
| ROT13 (in binary) | Decoded (real name) | Instruction Category |
|---|---|---|
ZREPHEL | MERCURY | ISA prefix |
ZREPHEL_zoneevre_neevir | MERCURY_mbarrier_arrive | Barrier ops (124 strings) |
ZREPHEL_oneevre_neevir_flap | MERCURY_barrier_arrive_sync | Barrier ops (86 strings) |
ZREPHEL_jnectebhc_zzn_flap_0 | MERCURY_warpgroup_mma_sync_0 | Warpgroup MMA (40 strings) |
ZREPHEL_ngbz_nqq_f32 | MERCURY_atom_add_s32 | Atomics (36 strings) |
ZREPHEL_erqhk_f32_flap | MERCURY_redux_s32_sync | Reductions (32 strings) |
ZREPHEL_srapr_zoneevref | MERCURY_fence_mbarriers | Fences (32 strings) |
ZREPHEL_gptra05_yq | MERCURY_tcgen05_ld | SM100 tensor ops (4 strings) |
The 667 ROT13-encoded entries represent the Mercury-ISA-specific builtin instruction templates. These are instruction descriptors used by the ISel pattern matchers and the MercExpand engine. Each string encodes the full instruction signature: opcode, operand types (e.g., fepf = srcs for source operands, he4 = ur4 for uniform register 4-wide), and variant index.
Instruction category distribution across the 667 ZREPHEL builtins:
| Category (decoded) | Count | Description |
|---|---|---|
mbarrier | 124 | Async barrier operations (arrive, wait, test, try_wait, pend) |
barrier | 86 | Classic warp-level barrier synchronization |
warpgroup | 40 | Warpgroup MMA operations (fp16, fp8, int, sparse variants) |
atom | 36 | Atomic memory operations |
redux | 32 | Warp-level reductions (s32, u32) |
fence | 32 | Memory fence operations |
max, min, addmin, addmax | 78 | Integer min/max combinators |
elect | 20 | Warp-level leader election |
match | 16 | Warp match operations |
vabsdiff4 | 14 | SIMD 4-byte absolute difference |
mov, selmov | 16 | Data movement |
vote | 12 | Warp-level voting |
tcgen05 | 4 | SM100 Blackwell tensor core gen05 |
| Others | 57 | cvt, cvta, mapa, createpolicy, shfl, ld, st, cp, red, fma, sad, predict, multimem, griddepcontrol |
Mercury Pipeline in the Backend Compiler
Mercury processing occurs in the backend scheduling/encoding pipeline within the embedded ptxas. Four named Mercury passes are identified from the pass table at 0x2443C00:
| Pass Name | String Address | Xref Address | Stage |
|---|---|---|---|
MercEncodeAndDecode | 0x2443CA2 | 0x24443F0 | Encode IR to Mercury binary, then decode for verification |
MercExpandInstructions | 0x2443CB6 | 0x24443F8 | Expand pseudo-instructions into Mercury machine operations |
MercGenerateWARs1 | 0x2443CCD | 0x2444400 | Generate write-after-read hazard barriers (first pass) |
MercGenerateOpex | 0x2443CDF | 0x2444408 | Generate operand extensions for wide encodings |
MercGenerateWARs2 | 0x2443CF0 | 0x2444410 | Generate WAR barriers (second pass, post-opex) |
MercGenerateSassUCode | 0x2443D02 | 0x2444418 | Generate final SASS microcode from Mercury representation |
PostFixForMercTargets | 0x2443C44 | 0x24443C0 | Target-specific fixups for Mercury architectures |
Additional Mercury-prefixed pass markers confirmed from logging strings:
| Marker String | Address | Xref | Engine |
|---|---|---|---|
"After MercExpand" | 0x1DFE320 | 0x5FF15E | MercExpand dispatch at sub_5FDDB0 |
"After MercConverter" | 0x241F913 | 0x19798F8 | MercConverter in scheduling pipeline |
"After MercWARs" | 0x1D41C60 | 0x4A480A | WAR hazard barrier insertion |
"After MercOpex" | 0x1D41C6F | 0x4ABC3E | Operand extension generation |
The MercExpand Engine
MercExpand is the instruction expansion pass that lowers IR-level pseudo-instructions into Mercury machine operations. It occupies the address range 0x5E4470--0x600260 (~112 KB, ~40 functions) in the binary.
Mercury IR Before Expansion
Before MercExpand runs, each function's IR is a doubly-linked list of instruction nodes threaded through basic blocks. Every IR node has the following layout (all offsets from the node pointer):
| Offset | Size | Field | Description |
|---|---|---|---|
| +0 | 8 | prev | Previous node in linked list |
| +8 | 8 | next | Next node in linked list |
| +16 | -- | ir_body | Start of IR body (handlers receive node+16) |
| +28 | 2 | opcode_tag | IR opcode type tag (the dispatch key) |
| +32 | 4 | data_type_id | Data type identifier (used for surface ops check: 559-560) |
| +36 | 4 | sub_type | Sub-type or variant index |
| +48 | 8 | operand_list | Pointer to operand array base |
| +56 | 4 | num_operands | Operand count (signed, checked > 0 for surface ops) |
| +112 | 8 | metadata_ptr | Pointer to metadata block (contains descriptor index at +20) |
| +120 | 8 | cfg_group_ptr | Pointer to CFG group node (basic block membership) |
| +128 | 8 | extended_info | Extended instruction info pointer |
| +148 | 4 | flags | Instruction flag bitvector (bit 0 = terminator, bit 18 = barrier flag) |
| +149 | 1 | sched_hint | Scheduling hint byte (bit 7 = needs ordering constraint) |
| +152 | 4 | bb_id | Basic block identifier |
The opcode tag at +28 is a 16-bit signed integer. Pre-expansion opcodes are abstract: they represent generic operations (e.g., "add", "load", "branch") that do not yet encode Mercury-specific encoding constraints. The tag value -1 (0xFFFF) marks basic block terminators; tag 120 marks special metadata nodes (PHI inputs, call descriptors, inline constants).
Each instruction also carries an attribute bag accessible via sub_A49150(context, ir_body, attr_id) and sub_A49120(context, ir_body, attr_id, value). Key attribute IDs observed in the expansion pass:
| Attribute ID | Description | Observed Values |
|---|---|---|
| 5 | Data type class | 12 = predicate type |
| 88 | Post-expansion flag | 408 = "was MercExpand'd" marker |
| 118 | Return semantics | 519 = return/exit opcode marker |
| 199 | Operand mode | 1104 = special operand form |
| 200 | Instruction class | 1107 = MOV pseudo requiring expansion |
| 202 | Memory scope | 1111 = specific scope tag |
| 227 | Target opcode | 1233 = MOV target opcode marker |
| 257 | Comparison mode | 1332 = specific comparison |
| 295 | Memory operation class | 1495/1496/1499/1500 = different memory categories |
| 297 | Address space | 1505/1506 = global/shared address space |
| 315 | Barrier type | 1774 = specific barrier form |
| 319 | Register constraint | 1791 = register alignment constraint |
| 345 | Expansion directive | 1901 = standard expansion |
| 348 | MercExpand state | 1912/1913/1914/1915 = expansion lifecycle states |
| 359 | Return mode | 1960 = return instruction marker |
| 527, 534, 538 | Opex flags | Operand extension indicators (case 90) |
The attribute-348 field is particularly important -- it tracks where each instruction is in the MercExpand lifecycle:
| Value | Meaning |
|---|---|
| 1912 | Needs expansion (pre-expand marker) |
| 1913 | Currently expanding (in-progress) |
| 1914 | Expansion complete, needs fixup |
| 1915 | Fully expanded and finalized |
Core Architecture
The engine operates on a per-basic-block basis, iterating the IR instruction linked list. For each node, it dispatches to specialized handlers based on the IR opcode type (field at node offset +28).
Main dispatch function: sub_5FDDB0 (MercExpand_Dispatch, ~25.5 KB)
Dispatch Initialization
Before entering the main loop, the dispatch function performs three setup steps:
-
Mercury capability check: Reads the target descriptor at
context+312and queries capability flags via the vtable at +72. If the target's byte at descriptor offset +216 equals 1, reads a flag from offset +224. Alternatively checks byte at offset +864. These determine whether Mercury expansion is needed at all (v85= skip flag). -
CFG group allocation: Allocates two CFG group objects, each 64 bytes, via the arena allocator at
***context+16. Each CFG group has a doubly-linked list structure:
CFG Group (64 bytes):
+0: prev pointer (or NULL for head)
+8: sentinel (points to self+16)
+16: self-pointer (list anchor)
+24: NULL (reserved)
+32: tail pointer (points to self)
+40: back-pointer (points to self+16)
+48: state word = 2 (initial state: "open")
+56: ref-count object pointer
The two groups serve as: v81 = "definition group" (tracks where new instructions are inserted) and v11/v82 = "resource group" (tracks register resource accounting boundaries).
- State initialization: Sets
v90 = -1(last processed BB ID),v86 = 0(predication tracking),v83 = 0(expansion-needed flag),v84 = 0(secondary flag).
The Main Iteration Loop
MercExpand_Dispatch(pass_state):
context = pass_state->context // at +24
func_body = context->func_body // at +312
bb_list = context->basic_blocks // at +24 (linked list)
// 1. Check if Mercury expansion is needed
target_desc = func_body->target_descriptor // at [func_body+72]
mercury_ver = target_desc[9]->byte_216
if mercury_ver == 1:
skip_flag = target_desc[9]->dword_224
has_mercury = call_vtable(func_body, cap_12) | sub_A48AA0(context)
else:
has_mercury = sub_A48AA0(context) | (target_desc[9]->byte_864 != 0)
// 2. Allocate two CFG group trackers
def_group = alloc_cfg_group(context) // 64 bytes, state=2
res_group = alloc_cfg_group(context) // 64 bytes, state=2
// 3. Initialize tracking state
last_bb_id = -1
pred_state = 0
expand_needed = 0
secondary_flag = 0
// 4. Walk basic block linked list
node = bb_list->first_child // context+24 -> offset+24
while node != bb_list_sentinel:
opcode_tag = node->word_28
// --- Handle BB terminator (opcode_tag == -1) ---
if opcode_tag == -1:
if node->flags & 1: // bit 0 of +148
prev_pred = context->byte_237 ^ 1
update predication tracking
pred_state = 1
pass_state->dword_704 = node->dword_152
// --- Handle special node (opcode_tag == 120) ---
elif opcode_tag == 120:
goto instruction_processing // skip to per-instruction
// --- Handle normal instructions ---
else:
// Assign node to current def_group (if expansion active)
if !has_mercury && def_group->state != 2:
node->cfg_group_ptr = def_group
// Allocate fresh def_group
def_group = alloc_cfg_group(context)
// Call resource accounting on current res_group
if res_group->state != 2:
sub_5F8B60(pass_state, node+16, res_group, &last_bb_id)
// Allocate fresh res_group
res_group = alloc_cfg_group(context)
// --- Per-instruction processing ---
instruction_processing:
// Query target capabilities (vtable+1048 and +792)
if expand_needed:
if call_vtable(target, cap_1048, node+16):
sub_5F8980(pass_state, node) // reset expansion
expand_needed = 0
secondary_flag = 0
// Check vtable+792 for secondary capability
if call_vtable(target, cap_792, node+16):
secondary_flag = result
// Record current/previous instruction pointers
pass_state->ptr_8 = node
pass_state->ptr_16 = node
// Check scheduling constraints from target descriptor
if context->dword_208 != context->dword_212
&& node->byte_149 < 0:
target_info = context->ptr_312 -> ptr_72
if target_info->byte_3384 == 1 && target_info->dword_3392:
sub_5F9B10(pass_state, node) // apply scheduling hint
elif !target_info->byte_3456
|| (target_info->byte_3456 == 1 && target_info->dword_3464):
return 1 // early exit: constraint met
// MOV pseudo-instruction special case
if sub_A497D0(node+16) // check instruction kind
&& sub_A49150(context, node+16, 200) == 1107:
sub_5FC6B0(pass_state, node) // ExpandMOV
// Opcode dispatch switch
switch opcode_tag:
// ... (see dispatch table below)
// Post-dispatch: check for abort
if pass_state->byte_886:
return 1 // abort expansion
// Advance to next node
node = next_node
The Opcode Dispatch Table
The switch statement at the heart of sub_5FDDB0 dispatches on 30 distinct opcode tag values. Each case either calls a direct handler function or dispatches through a vtable. The vtable is at *pass_state (the first pointer in the pass state object).
switch (opcode_tag): // node->word_28
case 0: vtable[48/8=6] (pass_state, node) // Generic expansion
case 5: register_width_clamp(node, max=15) // Clamp register width
case 8: register_width_clamp(node, max=15) // (same as 5)
case 9: register_width_clamp(node, max=15) // (same as 5)
case 11: complex_3way_dispatch(pass_state, node) // See below
case 12: vtable[136/8=17] (pass_state, node) // Conversion expansion
case 17: debug_node_handler(pass_state, node) // Debug info (if +1536 set)
case 19: vtable[40/8=5] (pass_state, node) // Pre-header ops
case 22: vtable[144/8=18] (pass_state, node) // Texture ops (pair with 23)
case 23: vtable[144/8=18] (pass_state, node) // Texture ops (pair with 22)
case 27: memory_op_dispatch(pass_state, node) // Memory load/store (complex)
case 34: comparison_handler(pass_state, node) // Compare-and-branch
case 35: sub_5F7D20 + vtable[32/8=4] // Predicated ops
case 50: vtable[112/8=14] (pass_state, node) // Miscellaneous
case 56: atomic_handler(pass_state, node) // Atomic read-modify-write
case 57: vtable[72/8=9] (pass_state, node) // Arithmetic
case 59: vtable[160/8=20] (pass_state, node) // Returns next_node
case 65: vtable[160/8=20] (pass_state, node) // (same as 59)
case 66: vtable[152/8=19] (pass_state, node) // Returns next_node
case 67: vtable[152/8=19] (pass_state, node) // (same as 66)
case 69: vtable[168/8=21] (pass_state, node) // Special form
case 71: vtable[16/8=2] (pass_state, node) // Branch + WAR check
case 73: call_handler(pass_state, node) // Function calls
case 77: predicate_handler(pass_state, node) // Predicate definition
case 78: predicate_handler(pass_state, node) // (same as 77)
case 90: opex_check(pass_state, node) // Operand extension check
case 94: comparison_fixup(pass_state, node) // Compare post-fixup
case 95: sub_5F8380(pass_state, node) // Immediate handling
case 97: vtable[176/8=22] (pass_state, node) // Post-pass ops
case 98: conditional_rewrite(pass_state, node) // Conditional vtable+536
case 99: vtable[128/8=16] (pass_state, node) // Shuffle/permute
case 120: special_node_handler(pass_state, node) // Metadata / inline const
case 130: barrier_handler(pass_state, node) // Barrier ops (WAR flag)
case 205: sub_5F7A80(pass_state, node) // Named sub-handler
case 210: vtable[24/8=3] (pass_state, node) // Late-stage ops
default: (no action, fall through to post-dispatch)
Case 11: The Three-Way Dispatch
Case 11 is the most complex case. It checks three capabilities to select a handler:
case 11:
target = context->ptr_416
if call_vtable(target, cap_584, node+16):
sub_5F80E0(pass_state, node) // FNV-1a hash lookup path
elif call_vtable(target, cap_1160, node+16):
sub_5FAC90(pass_state, node) // Shared memory expansion
elif node->num_operands > 0
&& node->operand_list->byte_32 == 6
&& (node->operand_list->dword_36 - 559) <= 1:
sub_5FC1B0(pass_state, node) // Surface read/write
else:
vtable[88/8=11](pass_state, node) // Generic fallback
Case 27: Memory Operation Dispatch
Case 27 handles load/store instructions with the richest attribute-based dispatch:
case 27:
// Check expansion state attribute (348)
if has_attr(348):
state = get_attr(348)
if state == 1915: skip (already done)
if state == 1913: sub_5FA360 -- in-progress expansion
if state == 1912: sub_5F9FD0 -- needs expansion
// Clear attribute 348 and mark as expanded
clear_attr(348)
set_attr(348, 1915)
invalidate(node)
// Classify by address space (attribute 297)
addr_space = get_attr(297)
if addr_space == 1506 && num_operands == 6:
sub_5FB5B0(pass_state, node) // HandleGlobalMem
elif addr_space == 1506 && num_operands == 5
|| addr_space == 1505:
// Check register constraint (attr 319)
if get_attr(319) == -1:
// Inject alignment constraint
operands->dword_132 += 1
operands->dword_148 = 1
set_attr(319, 1791)
invalidate(node)
elif addr_space == 1500:
sub_5FBC30(pass_state, node) // HandleConstMem
elif (addr_space - 1495) & ~4 == 0
&& get_attr(202) == 1111
&& get_attr(359) == 1960:
sub_5FCE20(pass_state, node) // ExpandRETURN
elif addr_space == 1496:
flag = sub_5FA920(pass_state, node, secondary_flag)
expand_needed |= flag
secondary_flag &= flag
else:
vtable[80/8=10](pass_state, node) // Generic memory handler
sub_5F7A00(pass_state) // Post-expansion NOP removal
Case 120: Special Node Handler
Case 120 handles metadata nodes, particularly inline constants and PHI inputs:
case 120:
metadata = node->metadata_ptr // at +112
kind = *metadata // first dword
if kind == 3:
// Call descriptor -- skip, handled by call expansion
pass
elif kind == 0:
// Inline constant definition
if !has_mercury:
def_5F8C7E(pass_state, node+16, def_group)
sub_A48B20(context, node, 0)
elif kind == 2:
// PHI input from another basic block
if context->word_1544 > 0x100 && metadata[2] != 52:
// Insert into resource group's linked list
alloc list_node from res_group freelist
list_node->data = metadata
append to res_group's doubly-linked list
res_group->count++
sub_A48B20(context, node, 0)
else:
sub_A48B20(context, node, 0) // generic cleanup
Post-Loop Finalization
After processing all instructions, the dispatch function performs final bookkeeping:
// After all nodes processed:
func_info = context->ptr_1280 // at context[160]
if func_info:
if func_info->byte_92:
func_info->dword_88 += pass_state->dword_880
else:
func_info->byte_92 = 1
func_info->oword_76 = 0 // zero 16 bytes
func_info->dword_88 = accumulated_value
context->dword_1540 = 1 // mark expansion complete
return 0 // success
The Per-Instruction Handler (sub_5F38E0)
sub_5F38E0 is the 35 KB per-instruction expansion function. It is called from sub_5FDDB0 indirectly for most non-trivial opcode types. While the dispatch function selects which handler runs, this function performs the actual expansion -- converting one IR instruction into potentially multiple Mercury machine operations with proper register constraints and scheduling parameters.
Initialization
The function initializes a 144-byte local state structure on the stack:
HandleInstruction(state, context, bb_list, is_predicated, pass_number):
// Zero the local state
state->ptr_24 = &input_operands
state->dword_48 = 0
memset(state+48, 0, 0x10) // clear 16 bytes
state->ptr_72 = NULL
state->byte_40 = 1 // "first instruction" flag
state->ptr_88 = NULL // last_scheduling_dep
state->ptr_96 = NULL // last_texture_dep
state->ptr_104 = NULL // last_shared_dep
state->ptr_112 = NULL // last_global_dep
state->ptr_128 = NULL // last_barrier_dep
state->ptr_136 = NULL // last_const_dep
state->ptr_120 = NULL // last_fence_dep
state->byte_41 = 0 // not in reorder zone
state->byte_42 = 0 // no opex needed
state->byte_43 = is_predicated ^ 1 // inverted predication flag
// Invalidate register state for this basic block
sub_5EA4F0(context, is_predicated)
The Descriptor Lookup
For each IR instruction, the handler looks up the corresponding target instruction descriptor -- a 184-byte structure that defines the Mercury encoding constraints:
// Get descriptor index from instruction metadata
desc_index = *(inst->metadata_ptr + 20) // metadata[5]
// Lookup in descriptor table (indexed or hash-map backed)
if desc_index <= target_state->max_direct_index: // at target+840
descriptor = target_state->desc_table + 184 * desc_index // at target+832
else:
descriptor = hash_lookup(target_state->desc_hash, desc_index) // at target+848
The descriptor at pointer v19 has this layout:
| Offset | Size | Field | Description |
|---|---|---|---|
| +0 | 4 | desc_id | Descriptor unique ID |
| +4 | 4 | sched_class | Scheduling class (stored as v193) |
| +8-103 | -- | constraint_bitvectors | Register constraint data (6 pairs at +8, +40, +104, +120) |
| +104 | 16 | src_reg_constraints | Source register constraint bitvectors |
| +120 | 16 | dst_reg_constraints | Destination register constraint bitvectors |
| +152 | 2 | reg_class_0 | Register class constraint word 0 |
| +154 | 2 | reg_class_1 | Register class constraint word 1 |
| +156 | 2 | reg_class_2 | Register class constraint word 2 |
| +158 | 2 | reg_class_3 | Register class constraint word 3 |
| +160 | 2 | reg_class_4 | Register class constraint word 4 |
| +162 | 2 | reg_class_5 | Register class constraint word 5 |
| +164 | 1 | has_src_constraint | Source operand constraint flag |
| +165 | 1 | has_dst_constraint | Destination operand constraint flag |
| +166 | 1 | has_src_bitvec | Source bitvector constraint present |
| +167 | 1 | has_dst_bitvec | Destination bitvector constraint present |
| +176 | 1 | is_pseudo | Pseudo-instruction flag |
| +177 | 1 | is_eliminated | Eliminated during expansion |
Register Constraint Propagation
After descriptor lookup, the handler applies register constraints in a strict order. The constraint system has four layers, each with source (operand_index=0) and destination (operand_index=4) variants, plus bidirectional variants (operand_index=2,3):
// Layer 1: Simple operand constraints (sub_5F1C50)
if descriptor->has_src_constraint: // byte at +164
sub_5F1C50(context, operands, &offset, node+16, 0, &descriptor[+8])
sub_5F1C50(context, operands, &offset, node+16, 4, &descriptor[+8])
if descriptor->has_dst_constraint: // byte at +165
sub_5F1C50(context, operands, &offset, node+16, 3, &descriptor[+40])
sub_5F1C50(context, operands, &offset, node+16, 2, &descriptor[+40])
// Layer 2: Bitvector constraints (sub_5F0180)
if descriptor->has_src_bitvec: // byte at +166
sub_5F0180(context, operands, &offset, node+16, 0, &descriptor[+104], 0, 0, 0, 0)
sub_5F0180(context, operands, &offset, node+16, 4, &descriptor[+104], 0, 0, 0, 0)
if descriptor->has_dst_bitvec: // byte at +167
sub_5F0180(context, operands, &offset, node+16, 3, &descriptor[+120], 0, 0, 0, 0)
sub_5F0180(context, operands, &offset, node+16, 2, &descriptor[+120], 0, 0, 0, 0)
// Layer 3: Register class constraints (6 class words at +152..+162)
for i in 0..5:
class_word = descriptor->reg_class[i] // at +152 + 2*i
if class_word != 0:
sub_5F0180(context, operands, &offset, node+16,
src_or_dst[i], 0, class_word, 0, 0, 0)
sub_5F0180(context, operands, &offset, node+16,
complement[i], 0, class_word, 0, 0, 0)
The sub_5F0180 function (14.2 KB) is the core register constraint propagation engine. It takes 10 parameters and operates on bitvector scan data:
- Parameter
a5selects the operand direction: 0=source, 2=dst-as-src, 3=src-as-dst, 4=destination - Parameters
a6-a10provide constraint data from different layers of the descriptor - The function maintains 6 local bitvector buffers (48 bytes each at
v41-v46) for intermediate results - It uses the state structure at
a1+168,a1+232,a1+328to track accumulated constraints across operands
Scheduling Distance Computation
After register constraints, the handler computes scheduling distances -- the minimum number of cycles between dependent instructions:
// Query scheduling distance from target capabilities
target_caps = sub_4FBCF0(target->desc_312, node+16, 0)
if target_caps:
// Check capability 95 (write-after-read distance)
if has_cap(target_caps, 95):
war_distance = get_cap_value(target_caps, 95)
// Check capability 96 (read-after-write distance)
elif has_cap(target_caps, 96):
raw_distance = get_cap_value(target_caps, 96)
// Check capability 97 (barrier-to-instruction distance)
// Check capability 98 (instruction size for opex)
// Compute stall cycles for dependent pair
if previous_instruction:
desc_prev = previous_instruction->metadata_ptr
has_reorder = sub_5EF3C0(context, prev_inst, node+16)
offset_prev = desc_prev->dword_0
new_start = offset_prev + (has_reorder ? 2 : 1)
if new_start < state->current_offset:
new_start = state->current_offset
state->current_offset = new_start
// Apply result to instruction metadata
inst_meta->dword_0 = state->current_offset // cycle start
inst_meta->dword_4 = state->max_offset // cycle end bound
Dependency Chain Tracking
The function tracks seven categories of instruction dependencies for scheduling:
// Track dependencies by memory space / operation type
state->ptr_88 -- last instruction using scheduling class (general dep)
state->ptr_96 -- last texture/sampler operation
state->ptr_104 -- last fence/sync operation
state->ptr_112 -- last global memory barrier
state->ptr_120 -- last fence instruction
state->ptr_128 -- last barrier instruction
state->ptr_136 -- reserved for cross-BB tracking
// For each instruction, compute distance to its dependency:
if descriptor->sched_class == 12: // barrier class
state->ptr_112 = current_node
// Scheduling class queries via sub_502210 and sub_5021F0:
if sub_502210(target_table, descriptor, state->ptr_152):
// This instruction depends on last in same class
dep = state->ptr_88
if dep:
dep_desc = dep->metadata_ptr
distance = dep_desc->dword_0 + sub_4B13F0(target, dep, descriptor)
if distance < state->dword_24:
distance = state->dword_24
state->dword_24 = distance
state->ptr_88 = current_node
// Similar checks for texture (sub_5021F0 flag 32/31),
// fence (flag 6/7), and barrier distances
Target Opcode Lookup (sub_5EA930)
After scheduling, sub_5EA930 (12.1 KB) performs the final target instruction descriptor lookup that maps the expanded instruction to the encoder tables. This function:
- Reads the descriptor index from
instruction->metadata_ptr->dword_20 - Looks up the 184-byte descriptor (same table as above)
- Queries the target capabilities database via
sub_4FBCF0for encoder-specific information - Checks capabilities 93-98 for instruction size, stall penalties, and encoding constraints
- Sets bit flags on the metadata block:
byte_50 |= 0x10when the instruction exceeds the minimum encoding sizebyte_50 |= 0x08when capability 51 value equals 4byte_51 |= 0x04when instruction must use extended encodingbyte_51 |= 0x08when capability 51 value equals 2byte_51 |= 0x10when capability 51 value equals 3
These bit flags directly control the Mercury encoder: they select between compact (64-bit) and extended (128-bit) instruction encodings, and determine whether operand extension (opex) words are needed.
Instruction Emission Calls
After constraints and scheduling, two final calls emit the expanded instruction:
// 1. Branch expansion
sub_5EEB20(state, node+16) // HandleBranch -- resolves branch targets
// 2. Register constraint finalization
sub_5EBA30(state, context, node+16) // Finalize register allocations
// 3. Post-expansion constraint writeback
// Write back constraint results to instruction metadata
if descriptor->has_src_constraint:
sub_5EDEA0(context, node+16, &descriptor[+8], 1)
if descriptor->has_dst_constraint:
sub_5EDEA0(context, node+16, &descriptor[+40], 0)
if descriptor->has_src_bitvec:
sub_5EDA80(context, node+16, &descriptor[+104], 0, 0, 1, 0)
if descriptor->has_dst_bitvec:
sub_5EDA80(context, node+16, &descriptor[+120], 0, 0, 0, 0)
// Write back each non-zero register class word
for i in 0..5:
if descriptor->reg_class[i]:
sub_5EDA80(context, node+16, 0, class_word[i], 0, src_flag[i], extra[i])
Epilogue: Last Instruction Fixup
After the main loop exits, the handler performs a final pass on the last instruction in the block:
// Compute final scheduling distance for the block
last_meta = last_instruction->metadata_ptr
max_offset = state->dword_32
if state->dword_24 >= max_offset:
max_offset = state->dword_24
// Account for trailing barriers/fences
if state->ptr_112: // had global barrier
trailing = *(state->ptr_112->metadata_ptr) + 6
if max_offset < trailing: max_offset = trailing
if state->ptr_104: // had fence
trailing = *(state->ptr_104->metadata_ptr) + 6
if max_offset < trailing: max_offset = trailing
if state->ptr_88: // had scheduling dep
trailing = sub_4B13F0(...) computation
if max_offset < trailing: max_offset = trailing
// Check capability 97 for minimum block distance
target_caps = sub_4FBCF0(target->desc_312, last_instruction, 0)
if target_caps && has_cap(target_caps, 97):
cap_97_distance = get_cap_value(target_caps, 97)
if max_offset < cap_97_distance: max_offset = cap_97_distance
// Compute final stall count
stall_count = max_offset - last_meta->dword_0
if stall_count < last_meta->dword_12:
stall_count = last_meta->dword_12
// Check capability 93 for additional penalty
if has_cap(target_caps, 93):
stall_count += get_cap_value(target_caps, 93)
// Minimum stall: 1, or 2 if instruction has opex/extension flags
min_stall = 1 + ((last_meta->byte_48 & 0x11) != 0)
if stall_count < min_stall:
stall_count = min_stall
// Final target lookup
if sub_5EA930(state, last_instruction, stall_count, state->dword_36):
last_meta->byte_50 |= 0x10 // needs extended encoding
last_meta->dword_56 = stall_count
// Write total cycle count to basic block terminator
bb_terminator->metadata_ptr->dword_4 = stall_count + last_meta->dword_4
// Apply target constraints for final pass
if !pass_number:
sub_5EB130(state, context, last_instruction, 1)
The FNV-1a Hash Lookup (sub_5F80E0)
The instruction dispatch for case 11 / vtable+584 uses an FNV-1a hash map to find pre-computed expansion templates:
sub_5F80E0(pass_state, node):
context = pass_state[3] // ptr at index 3
if context->byte_512: // hash map enabled
assert context->dword_480 != 0 // hash table size > 0
// FNV-1a hash of the data type ID
data_type = node->dword_32 // 4-byte value
hash = 0x811C9DC5 // FNV offset basis
hash = (hash ^ byte_0(data_type)) * 16777619 // FNV prime
hash = (hash ^ byte_1(data_type)) * 16777619
hash = (hash ^ byte_2(data_type)) * 16777619
hash = (hash ^ byte_3(data_type)) * 16777619
// Probe the hash table
bucket = context->ptr_488 + 24 * (hash & (context->ptr_496 - 1))
while bucket != NULL:
bucket = *bucket // follow chain
if bucket && bucket->dword_8 == data_type:
break // found
expansion_template = bucket->ptr_16
else:
expansion_template = NULL
// Store IR body for expansion
context->ptr_992 = node->ptr_16
// Create expanded instruction via sub_A4CA70
new_node = sub_A4CA70(context, node, pass_state+2, expansion_template)
pass_state[1] = new_node
// Call target-specific fixup (vtable+56 on target at context+416)
call_vtable(target, fixup_56, node+16, new_node+16)
// Transfer CFG group ownership
new_node->ptr_120 = node->ptr_120
node->ptr_120 = NULL
// Invalidate and clean up original node
sub_A49DF0(context, new_node+16, 0)
sub_5F70E0(pass_state, node+16)
sub_A48B20(context, node, new_node+16)
Post-Expansion NOP Removal (sub_5F7A00)
After certain expansions, the engine runs a cleanup pass that removes unnecessary NOP-like instructions:
sub_5F7A00(pass_state):
target_info = *(context->ptr_312 + 72)
if target_info->byte_1008 != 1:
return
if target_info->dword_1016 == 0:
return
// Scan from first_inserted to last_inserted
node = pass_state->ptr_1 // first inserted instruction
sentinel = pass_state->ptr_2 // last original instruction
while node != sentinel:
opcode = node->word_28
next = node->next
// Remove opcode 162 (NOP) and opcode 349 (expansion placeholder)
if opcode == 162 || opcode == 349:
sub_A48D20(context, node, 0) // delete node from list
node = next
Key Functions
| Address | Name | Size | Role |
|---|---|---|---|
0x5FDDB0 | MercExpand_Dispatch | 25.5 KB | Main entry point, instruction dispatch loop |
0x5F38E0 | MercExpand_HandleInstruction | 35.0 KB | Per-instruction expansion, 2nd largest function in engine |
0x5F0180 | MercExpand_PropagateRegConstraints | 14.2 KB | Register constraint propagation (bitvector scanning, 10 params) |
0x5F1C50 | MercExpand_ApplySimpleConstraints | ~8 KB | Simple operand constraint application |
0x5F8B60 | MercExpand_ApplyResourceConstraints | 16.0 KB | Register resource accounting (52 register types) |
0x5EA930 | MercExpand_LookupTargetOpInfo | 12.1 KB | Target instruction descriptor lookup + encoder flag computation |
0x5EB130 | MercExpand_ApplyRegConstraintsFromTarget | 11.0 KB | Target-specific register constraints (capabilities 40-47) |
0x5EA4F0 | MercExpand_InvalidateRegisterState | 4.3 KB | Register cache invalidation (13 register slots, 15+ generation counters) |
0x5F60E0 | IRTree_Walk | 18.6 KB | Recursive tree walker (manually unrolled to 5 nesting levels) |
0x5F80E0 | MercExpand_HashDispatch | 2.8 KB | FNV-1a hash lookup for expansion templates |
0x5F7A00 | MercExpand_PostCleanup | 1.1 KB | Remove NOP/placeholder instructions (opcodes 162, 349) |
0x5F9C70 | MercExpand_ReleaseCFGGroup | 0.9 KB | Reference-counted CFG group deallocation |
0x5F9B10 | MercExpand_ApplySchedHint | ~2 KB | Apply scheduling hint from target descriptor (+3384/+3456) |
0x5FCE20 | MercExpand_ExpandRETURN | ~8 KB | Return/exit expansion |
0x5FC6B0 | MercExpand_ExpandMOV | ~4 KB | MOV pseudo-instruction expansion |
0x5EEB20 | MercExpand_HandleBranch | 7.7 KB | Branch target resolution |
0x5EBA30 | MercExpand_FinalizeRegAlloc | ~6 KB | Register constraint finalization |
0x5EF3C0 | MercExpand_CheckReorderability | ~2 KB | Check if two instructions can be reordered |
0x5F2BA0 | MercExpand_InsertStallCycles | ~4 KB | Insert explicit stall cycles between instructions |
Register State Cache and Invalidation
sub_5EA4F0 (MercExpand_InvalidateRegisterState, 4.3 KB) manages a generation-counter-based register cache. The cache tracks 13 physical register file partitions, each with a value slot and a generation counter:
sub_5EA4F0(state, is_predicated):
// Bump 15 generation counters (different register file partitions)
state->gen_64++ // general purpose registers (even)
state->gen_96++ // general purpose registers (odd)
state->gen_128++ // predicate registers
state->gen_160++ // special registers group A
state->gen_192++ // special registers group B
state->gen_224++ // uniform registers even
state->gen_256++ // uniform registers odd
state->gen_288++ // condition code registers
state->gen_320++ // barrier registers
state->gen_352++ // address registers
state->gen_384++ // texture sampler registers
state->gen_416++ // global version counter
state->gen_448++ // reserved / opex
// Zero the 15 corresponding dirty flags
state->dirty_68 = 0
state->dirty_100 = 0
state->dirty_132 = 0
... (all 15 cleared)
// Invalidate the 13-slot register value cache
// Cache is at state->ptr_400, each entry is 8 bytes (value + generation)
cache = state->ptr_400
gen = ++state->dword_416 // global generation
// For each of the 13 cache slots:
for slot in [0, 1, 4, 5, 10, 11, 12, 13, 18, 19, 22, 23, ...]:
if cache[slot].generation != gen:
state->dirty_420++ // count invalidations
cache[slot].value = -1 // mark invalid
cache[slot].generation = gen
// Apply target-specific constraints (capabilities 39-40)
target = context->ptr_312
if has_cap(target, 40):
// Walk linked list at target+2896 (sentinel at +2904)
for entry in linked_list(target+2896):
slot_idx = entry->dword_16
new_val = entry->dword_20
cache_entry = &cache[slot_idx]
if cache_entry.generation != gen:
dirty_420++
cache_entry.value = new_val
cache_entry.generation = gen
if has_cap(target, 39):
// Walk linked list at target+2824 (sentinel at +2832)
for entry in linked_list(target+2824):
sub_5EA3A0(state, entry->dword_16, entry->dword_20)
state->byte_34 = is_predicated
Instruction Handlers
The dispatch loop delegates to specialized handlers per instruction category:
| Function | Handler | Notes |
|---|---|---|
0x5EC540 | HandlePredication | Predicate register setup |
0x5EC940 | HandleBarrier | Barrier synchronization |
0x5ECC60 | HandleSync | Warp synchronization |
0x5ED060 | HandleDepInfo | Dependency information |
0x5ED3A0 | HandleTexSampler | Texture/sampler operations |
0x5ED850 | HandleAtomicOp | Atomic memory operations |
0x5EDA80 | HandleMemOp | Memory load/store |
0x5EE030 | HandleConversion | Type conversion |
0x5EE750 | HandleCmp | Comparison operations |
0x5EE930 | HandleSelect | Select/conditional move |
0x5EEB20 | HandleBranch | Branch expansion (7.7 KB) |
0x5EF0E0 | HandleCall | Function call expansion |
0x5EF3C0 | CheckReorderability | Instruction reorder check |
0x5EF4D0 | HandleReturn | Return/exit expansion |
0x5EF760 | HandlePhi | PHI node expansion |
0x5FAC90 | HandleSharedMem | Shared memory access (9.6 KB) |
0x5FB5B0 | HandleGlobalMem | Global memory access |
0x5FBC30 | HandleConstMem | Constant memory access |
0x5FC1B0 | HandleSurfaceOp | Surface read/write |
0x5FC6B0 | ExpandMOV | MOV instruction (attribute 200==1107 special case) |
0x5FCE20 | ExpandRETURN | Return/exit (creates opcode 270, sets attribute 118=519) |
How Expanded Instructions Feed Into Encoder Tables
The MercExpand output feeds directly into the downstream MercEncodeAndDecode pass. The connection points are:
-
Descriptor index (metadata +20): Each expanded instruction carries a descriptor index that maps 1:1 to an entry in the Mercury encoder table. The encoder table base is at
target_state+832, with each entry being 184 bytes. This is the same table consulted during expansion -- the encoder reads the same descriptor to determine the binary encoding format. -
Encoding size flags (metadata byte +50 and +51): Set by
sub_5EA930during expansion:- Bit 4 of byte +50 (
0x10): instruction exceeds minimum encoding width, needs extended format - Bit 3 of byte +50 (
0x08): capability-51 value 4 flag - Bit 2 of byte +51 (
0x04): requires opex (operand extension) words - Bit 3 of byte +51 (
0x08): capability-51 value 2 flag - Bit 4 of byte +51 (
0x10): capability-51 value 3 flag
- Bit 4 of byte +50 (
-
Stall count (metadata +56): Written by the per-instruction handler, this integer encodes the minimum stall cycles before the next instruction. The Mercury encoder embeds this in the instruction's yield/stall field.
-
Scheduling offset (metadata +0 and +4): The
dword_0field gives the cycle at which this instruction can issue;dword_4gives the upper bound. The encoder uses these to compute the final instruction ordering in the.nv.mercsection. -
Register class results (metadata +36): The encoder reads the final register constraint resolution written back by
sub_5EDEA0/sub_5EDA80. These determine which register encoding fields are used in the binary word. -
CFG group membership (node +120): The CFG group pointer threads instructions into basic blocks. The encoder traverses these groups to produce per-block instruction sequences with proper entry/exit markers.
Internal Data Structures
Target instruction descriptor (184 bytes per entry, base at state offset +832):
- Offset 0: descriptor index
- Offset 4: scheduling class
- Offsets 8-39: source operand simple constraints (4 pairs of 8-byte bitvectors)
- Offsets 40-103: destination operand simple constraints (4 pairs of 8-byte bitvectors)
- Offsets 104-119: source register bitvector constraints (16 bytes)
- Offsets 120-135: destination register bitvector constraints (16 bytes)
- Offsets 152-163: 6 register class constraint words (2 bytes each)
- Offsets 164-167: 4 constraint flag bytes (has_src_constraint, has_dst_constraint, has_src_bitvec, has_dst_bitvec)
- Offsets 176-177: pseudo/eliminated flags
- Offsets 2880-2904: register constraint linked lists (from target capabilities 39-40)
- Offsets 3384-3456: scheduling hints
- Offset 3672: capability flag 51
Register state cache (at MercExpand state offset +400):
- 13 register slots mapping to physical register file partitions
- Each slot: 8 bytes (4-byte value + 4-byte generation counter)
- Generation counters for cache invalidation (15 separate counters at offsets +64 through +448)
- Dirty counter at offset +420 tracks total invalidation events
- Slots cover: general purpose (R0-R255), predicates (P0-P6), uniform registers (UR0-UR63), condition codes (CC), barrier registers, address registers, texture sampler state, and reserved/opex slots
CFG group node (64 bytes, reference-counted):
- +0: prev pointer (doubly-linked list)
- +8: sentinel pointer (self+16)
- +16: self-pointer (list anchor)
- +32: tail pointer
- +48: state word (2=open/active, other values=closed/finalized)
- +56: ref-count object pointer (shared_ptr-like, released via
sub_5F9C70)
FNV-1a hash maps: Used for IR node tracking and lookup tables throughout MercExpand. The hash map at state offset +480 uses FNV-1a (prime=16777619, basis=0x811C9DC5) with chained buckets of 24 bytes each. Node identification uses hash maps keyed on node metadata at offset +112 -> +20.
Capsule Mercury (capmerc) Format
Capsule Mercury is the new ELF binary format for SM100+ targets. It wraps Mercury-encoded instructions in a specialized ELF layout with .nv.merc.* sections. The capmerc.cubin filename extension is used (string at 0x1D33FA9, xrefs from 0x40A84F and 0x42A26F).
ELF Sections
The 20 .nv.merc.* section names identified in the binary:
| Section Name | Description |
|---|---|
.nv.merc | Main Mercury instruction section |
.nv.merc.rela | Mercury relocation entries |
.nv.merc.symtab_shndx | Extended section header index (for >65535 sections) |
.nv.merc.nv.shared.reserved. | Reserved shared memory region |
.nv.merc.debug_abbrev | DWARF abbreviation tables |
.nv.merc.debug_aranges | DWARF address ranges |
.nv.merc.debug_frame | DWARF call frame information |
.nv.merc.debug_info | DWARF debug information entries |
.nv.merc.debug_line | DWARF line number program |
.nv.merc.debug_loc | DWARF location lists |
.nv.merc.debug_macinfo | DWARF macro information |
.nv.merc.debug_pubnames | DWARF public names |
.nv.merc.debug_pubtypes | DWARF public types |
.nv.merc.debug_ranges | DWARF address ranges |
.nv.merc.debug_str | DWARF string table |
.nv.merc.nv_debug_ptx_txt | Embedded PTX source text |
.nv.merc.nv_debug_line_sass | NVIDIA SASS-level line tables |
.nv.merc.nv_debug_info_reg_sass | NVIDIA SASS register debug info |
.nv.merc.nv_debug_info_reg_type | NVIDIA register type debug info |
Self-Check Mechanism
nvlink includes a self-check facility for capsule mercury output, enabled via the --self-check CLI flag (string at 0x1D41D3A). The self-check description at 0x1D41EC8: "Self check for capsule mercury (capmerc)".
Self-check validates three sections independently:
| Check | Error String | Address |
|---|---|---|
| Text section | "Self check for capsule mercury text section failed" | 0x2458F38 |
| Debug section | "Self check for capsule mercury debug section failed" | 0x2458F70 |
| Relocation section | "Self check for capsule mercury relocation section failed" | 0x2458FA8 |
On failure, the error at 0x1F44288 references internal documentation: "Failure of '%s' section in self-check for capsule mercury. See the Jira confluence page 'MERCSW-125' for more information that includes some debugging steps." The MERCSW Jira project is NVIDIA's internal Mercury software tracker.
An additional option produces reconstituted SASS for debugging: "Generate output of capmerc based reconstituted sass only through -self-check" (string at 0x1D41EF8).
Mercury Uplift
The "mercury uplift" path converts legacy SASS ELF binaries into Mercury format. The error string at 0x2458FE8 ("Invalid elf provided for mercury uplift.", xref 0x24590B8) confirms this conversion direction. A related skip path at 0x1D3BCB7 ("skip mercury section %i", xref 0x45F624) handles sections that should not be uplifted.
The uplift path coexists with the "don't uplift %s" diagnostic at 0x1D3410E (xref 0x42BBDC), indicating per-symbol or per-section uplift control.
ELF Attributes for Mercury
Two EIATTR (ELF Info Attribute) types are Mercury-specific:
| Attribute | String Address |
|---|---|
EIATTR_MERCURY_ISA_VERSION | 0x1D36F31 |
EIATTR_MERCURY_FINALIZER_OPTIONS | 0x1D37170 |
Four EICOMPAT (ELF Info Compatibility) attributes relate to Mercury and finalization:
| Attribute | String Address | Description |
|---|---|---|
EICOMPAT_ATTR_MERCURY_ISA_MAJOR_MINOR_VERSION | 0x245EF08 | Mercury ISA version (major.minor) |
EICOMPAT_ATTR_MERCURY_ISA_PATCH_VERSION | 0x245EF38 | Mercury ISA patch version |
EICOMPAT_ATTR_ENABLE_OPPORTUNISTIC_FINALIZATION | 0x245EED8 | Controls cross-family finalization |
EICOMPAT_ATTR_CAN_FASTPATH_FINALIZE | 0x245EF88 | Fast-path finalization eligibility |
Relationship to FNLZR (Finalizer)
The FNLZR (Finalizer) subsystem is the runtime component that converts between Mercury and SASS representations. It operates in two modes, logged via diagnostic strings:
| Mode | String | Address | Xref |
|---|---|---|---|
| Pre-Link | "FNLZR: Pre-Link Mode" | 0x1D323BD | 0x427999 |
| Post-Link | "FNLZR: Post-Link Mode" | 0x1D32397 | 0x427951 |
| JIT | "FNLZR: JIT Path" | 0x1DF8C40 | 0x52DDE1 |
FNLZR logs its input: "FNLZR: Input ELF: %s" (0x1D32381), and tracks lifecycle: "FNLZR: Starting %s" / "FNLZR: Ending %s" / "FNLZR: Flags [ %u | %u ]".
Finalization also appears in the capsule mercury code region with thread-level parallelism: "Failed to create finalizer thread" (0x2458EC0), suggesting that the finalizer runs as a separate thread during link.
The --opportunistic-finalization-lvl flag (string at 0x1D41F70) controls cross-architecture finalization behavior:
--opportunistic-finalization-lvl <0|1|2|3>
0 = default behavior
1 = no opportunistic finalization
2 = intra family finalization only
3 = intra and inter family finalization
Fast-path finalization is confirmed by the diagnostic at 0x1D40610: "[Finalizer] fastpath optimization applied for off-target %u -> %u finalization", indicating cross-SM finalization (e.g., compiling SM90 code for an SM100 target).
See FNLZR (Finalizer) for detailed analysis of the finalization subsystem.
MercGenerateSassUCode
The final Mercury pipeline stage is MercGenerateSassUCode (0x2443D02, xref 0x2444418), which converts the Mercury internal representation into SASS microcode -- the actual GPU-executable instruction encoding. Related dump utilities exist:
| Function | String Address | Description |
|---|---|---|
DumpNVuCodeText | 0x2443DA2 | Dump microcode in text format |
DumpNVuCodeHex | 0x2443DB2 | Dump microcode in hex format |
The .ucode section name at 0x1EEC922 and EIATTR_UCODE_SECTION_DATA at 0x1D36D20 confirm that microcode is a distinct section in the output ELF.
Cross-References
nvlink Internal
- Capsule Mercury Format -- detailed capmerc ELF layout and encoding
- R_MERCURY Relocations -- the 67 Mercury relocation types
- Mercury ELF Sections -- the 20
.nv.merc.*sections - Mercury Compiler Passes -- MercExpand, MercConverter, MercWARs, MercOpex
- FNLZR (Finalizer) -- SASS-to-Mercury and Mercury-to-SASS conversion
- Embedded ptxas Overview -- MercExpand mega-hub at
0x5B1D80in the address map - ISel Hubs -- MercExpand is the 5th mega-hub dispatch function
- SM100 Blackwell -- Mercury is the default encoding for SM100+ targets
- Output Phase -- Mercury output path in the linker pipeline
Sibling Wikis
- ptxas: Mercury Encoder Pipeline -- standalone ptxas Mercury encoder (phases 113--122: encode/decode, MercExpand, WAR, opex, UCode emission)
- ptxas: Capsule Mercury & Finalization -- standalone ptxas capmerc output format, Mercury section binary layouts, finalization pipeline
- ptxas: SASS Encoding -- SASS instruction encoding that Mercury wraps
Confidence Assessment
| Claim | Rating | Evidence |
|---|---|---|
| Mercury = ROT13("Zrephel") obfuscation scheme | HIGH | 667 ZREPHEL strings verified in nvlink_strings.json (addr range 0x1D42C80--0x1D4DF80). ROT13 decode confirmed character-by-character. |
| 667 ZREPHEL builtin instruction entries | HIGH | Exact count from nvlink_strings.json string scan. All entries begin with ZREPHEL_ prefix. |
ROT13 decoder at sub_1A40AC0 (15,629 bytes, SIMD) | MEDIUM | Function exists at stated address in decompiled code. SIMD _mm_load_si128 usage inferred from decompiler output. Byte count from function bounds. |
--binary-kind flag parsed from "mercury,capmerc,sass" at 0x1D41D03 | HIGH | String verified at exact address in nvlink_strings.json with xref to 0x4AC55C. |
SM100+ default is capmerc (activation via byte_2A5F222) | HIGH | String "Default on sm100+ is capmerc" verified. Global byte_2A5F222 confirmed in decompiled sub_4AC380. |
| Architecture compatibility: 104->120, 130->107, 101->110 remapping | HIGH | Verified from decompiled sub_4709E0 and sub_470DA0. |
| Capability bitmask: sm100=1, sm103=8, sm110=2, sm121=64 | HIGH | Decompiled from sub_470DA0, verified from switch on architecture codes 'd','g','n','y'. |
MercExpand engine spans 0x5E4470--0x600260 (~112 KB) | MEDIUM | Address range from sweep analysis. Entry sub_5FF110 and dispatch sub_5FDDB0 verified from decompiled code. Sub-function count (~40) is approximate. |
30 opcode dispatch cases in sub_5FDDB0 | HIGH | Switch statement cases verified from decompiled sub_5FDDB0 (25.5KB). |
| IR node layout (offsets +0/+8/+16/+28/+32/+48/+56/+112/+120/+148/+149/+152) | MEDIUM | Offsets derived from decompiled code field access patterns. Consistent across multiple functions. Field names are inferred from usage context. |
| 184-byte target instruction descriptor format | MEDIUM | Size inferred from 184 * desc_index multiplication in decompiled code. Field offsets verified from constraint propagation code. |
FNV-1a hash constants (basis=0x811C9DC5, prime=16777619) | HIGH | Standard FNV-1a constants, verified from decompiled sub_5F80E0. |
| Register state cache: 13 slots, 15 generation counters | MEDIUM | Slot count and counter count from decompiled sub_5EA4F0. Specific slot-to-register-file mapping is inferred. |
String evidence: 82 mercury/Mercury, 7 capmerc, 17 FNLZR, 20 .nv.merc.* | HIGH | Counts verified by scanning nvlink_strings.json. Address ranges confirmed. |
FNLZR diagnostic strings ("FNLZR: Input ELF: %s", etc.) | HIGH | All 12 FNLZR strings verified at exact addresses with xrefs to sub_4275C0. |
| Self-check error strings and MERCSW-125 Jira reference | HIGH | 4 self-check strings verified. MERCSW-125 string confirmed at 0x1F44288. |
capmerc.cubin output filename at 0x1D33FA9 | HIGH | Verified in nvlink_strings.json with xrefs from 0x40A84F and 0x42A26F. |
Fastpath optimization log string at 0x1D40610 | HIGH | Exact string verified in nvlink_strings.json. |
Mercury uplift error at 0x2458FE8 | HIGH | "Invalid elf provided for mercury uplift." verified at exact address. |
MercGenerateSassUCode at 0x2443D02 (pipeline stage) | HIGH | String verified. Xref at 0x2444418 confirmed (master phase table entry). |
| Instruction category distribution (mbarrier=124, barrier=86, etc.) | MEDIUM | Counts from categorizing 667 ZREPHEL strings by decoded prefix. Manual categorization may have minor counting errors. |
| Hopper (SM90) uses Mercury but not as default | MEDIUM | Inferred from global flag thresholds (sm > 99 vs sm > 89). Not directly confirmed from string evidence. |
"don't uplift %s" diagnostic at 0x1D3410E | HIGH | Verified in nvlink_strings.json with xref at 0x42BBDC. |
| EIATTR/EICOMPAT Mercury attributes at stated addresses | HIGH | All 6 attribute strings verified at exact addresses in nvlink_strings.json. |