Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Capsule Mercury & Finalization

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

Capsule Mercury ("capmerc") is a packaging format that wraps Mercury-encoded instruction streams with relocation metadata, debug information, and a snapshot of compilation knobs, enabling deferred finalization for a target SM that may differ from the original compilation target. Where standard Mercury produces a fully-resolved SASS binary bound to a single SM, capmerc produces an intermediate ELF object that a downstream tool (the driver or linker) can finalize into native SASS at load time. This is the default output format for all SM 100+ targets (Blackwell, Jetson Thor, consumer RTX 50-series). The capmerc data lives in .nv.capmerc<funcname> per-function ELF sections alongside 21 types of .nv.merc.* auxiliary sections carrying cloned debug data, memory-space metadata, and Mercury-specific relocations. Finalization can be "opportunistic" -- the same capmerc object may be finalized for different SMs within or across architectural families, controlled by --opportunistic-finalization-lvl.

Output modesmercury (SM 75--99 default), capmerc (SM 100+ default), sass (explicit only)
CLI parsersub_703AB0 (10KB, ParsePtxasOptions)
Auto-enableSM arch > 99 sets *(context + offset+81) = 1
Mercury mode flag*(DWORD*)(context+385) == 2 (shared with Mercury)
Capsule descriptor328-byte object, one per function (sub_1C9C300)
Merc section classifiersub_1C98C60 (9KB, 15 .nv.merc.* names)
Master ELF emittersub_1C9F280 (97KB, orchestrates full CUBIN output)
Self-check verifiersub_720F00 (64KB Flex lexer) + sub_729540 (35KB comparator)
Off-target checkersub_60F290 (compatibility validation)
Kernel finalizersub_612DE0 (47KB, fastpath optimization)

Output Mode Selection

The ptxas CLI (sub_703AB0) registers three binary-kind options plus related flags:

OptionString literalPurpose
--binary-kind"mercury,capmerc,sass"Select output format
--cap-merc"Generate Capsule Mercury"Force capmerc regardless of SM
--self-check"Self check for capsule mercury (capmerc)"Roundtrip verification
--out-sass"Generate output of capmerc based reconstituted sass"Dump reconstituted SASS
--opportunistic-finalization-lvl(in finalization logic)Finalization aggressiveness

When --binary-kind is not specified, the default is determined by SM version:

// Pseudocode from sub_703AB0 + auto-enable logic
if (sm_version > 99) {
    *(context + offset + 81) = 1;  // capmerc auto-enabled
    binary_kind = CAPMERC;
} else if (sm_version >= 75) {
    binary_kind = MERCURY;
} else {
    binary_kind = SASS;  // legacy direct encoding
}

The Mercury mode flag *(DWORD*)(context+385) == 2 is shared between Mercury and capmerc -- both use the identical Mercury encoder pipeline (phases 117--122). The capmerc distinction is purely at the ELF emission level: capmerc wraps the phase-122 SASS output in a capsule descriptor with relocation metadata instead of emitting it directly as a .text section.

Capsule Mercury ELF Structure

A capmerc-mode compilation produces a CUBIN ELF with two layers of content: standard CUBIN sections (.text.<func>, .nv.constant0, .nv.info.<func>, etc.) and a parallel set of .nv.merc.* sections carrying the metadata needed for deferred finalization.

CUBIN ELF (capmerc mode)
├── Standard sections
│   ├── .shstrtab, .strtab, .symtab, .symtab_shndx
│   ├── .text.<funcname>               (SASS binary, possibly partial)
│   ├── .nv.constant0.<funcname>       (constant bank data)
│   ├── .nv.shared.<funcname>          (shared memory layout)
│   ├── .nv.info.<funcname>            (EIATTR attributes)
│   ├── .note.nv.tkinfo, .note.nv.cuinfo
│   └── .nv.uft.entry                 (unified function table)
│
├── Per-function capsule descriptor
│   └── .nv.capmerc<funcname>          (328-byte descriptor + payload)
│
└── Mercury auxiliary sections (21 types)
    ├── .nv.merc.debug_abbrev          (DWARF abbreviation table)
    ├── .nv.merc.debug_aranges         (DWARF address ranges)
    ├── .nv.merc.debug_frame           (DWARF frame info)
    ├── .nv.merc.debug_info            (DWARF info)
    ├── .nv.merc.debug_line            (DWARF line table)
    ├── .nv.merc.debug_loc             (DWARF locations)
    ├── .nv.merc.debug_macinfo         (DWARF macro info)
    ├── .nv.merc.debug_pubnames        (DWARF public names)
    ├── .nv.merc.debug_pubtypes        (DWARF public types)
    ├── .nv.merc.debug_ranges          (DWARF ranges)
    ├── .nv.merc.debug_str             (DWARF string table)
    ├── .nv.merc.nv_debug_ptx_txt      (embedded PTX source text)
    ├── .nv.merc.nv_debug_line_sass    (SASS-level line table)
    ├── .nv.merc.nv_debug_info_reg_sass (register allocation info)
    ├── .nv.merc.nv_debug_info_reg_type (register type info)
    ├── .nv.merc.symtab_shndx          (extended section index table)
    ├── .nv.merc.nv.shared.reserved    (shared memory reservation)
    ├── .nv.merc.rela                  (Mercury relocations)
    ├── .nv.merc.rela<secname>         (per-section relocation tables)
    └── .nv.merc.<memory-space>        (cloned constant/global/local/shared)

Capsule Descriptor -- sub_1C9C300

Each function produces a .nv.capmerc<funcname> section constructed by sub_1C9C300 (24KB, 3816 bytes binary). This function processes .nv.capmerc and .merc markers, embeds KNOBS data (compilation configuration snapshot), manages constant bank replication, and creates the per-function descriptor.

The descriptor is a 328-byte object containing:

  • Mercury-encoded instruction stream for the function
  • R_MERCURY_* relocation entries that must be patched during finalization
  • KNOBS block -- a serialized snapshot of all knob values affecting code generation, optimization level, target parameters, and feature flags
  • References to the .nv.merc.* auxiliary sections
  • Function-level metadata: register counts, barrier counts, shared memory usage

The KNOBS embedding allows the finalizer to reproduce exact compilation settings without the original command-line arguments. This is critical for off-target finalization where the finalizer runs in a different context (e.g., the CUDA driver at application load time).

Capsule Descriptor Layout (328 bytes)

The descriptor is heap-allocated via sub_424070(allocator, 328) and zero-filled before field initialization. The constructor also creates a companion .merc<funcname> descriptor (same 328-byte layout) when merc section mirroring is active.

         Capsule Descriptor (328 bytes = 0x148)
         ======================================

         Group 1: Identity
         ┌─────────┬──────┬────────────────────────────────────────┐
   0x000 │ WORD    │  2B  │ desc_version                           │
   0x002 │ WORD    │  2B  │ instr_format_version                   │
   0x004 │ DWORD   │  4B  │ section_index                          │
   0x008 │ DWORD   │  4B  │ weak_symbol_index                      │
   0x00C │ --      │  4B  │ (padding)                              │
         └─────────┴──────┴────────────────────────────────────────┘

         Group 2: SASS Data
         ┌─────────┬──────┬────────────────────────────────────────┐
   0x010 │ QWORD   │  8B  │ weak_symbol_desc                       │
   0x018 │ QWORD   │  8B  │ sass_data_offset                       │
   0x020 │ DWORD   │  4B  │ sass_data_size                         │
   0x024 │ --      │  4B  │ (padding)                              │
   0x028 │ QWORD   │  8B  │ func_name_ptr                          │
         └─────────┴──────┴────────────────────────────────────────┘

         Group 3: Relocation Infrastructure
         ┌─────────┬──────┬────────────────────────────────────────┐
   0x030 │ QWORD   │  8B  │ rela_list_a (vector)                   │
   0x038 │ QWORD   │  8B  │ rela_list_b (vector)                   │
   0x040 │ QWORD   │  8B  │ reloc_symbol_list (vector)             │
   0x048 │ QWORD   │  8B  │ aux_rela_list (vector)                 │
   0x050 │ QWORD   │  8B  │ debug_rela_list (vector)               │
   0x058 │ QWORD   │  8B  │ text_section_offset                    │
   0x060 │ QWORD   │  8B  │ reloc_index_set (sorted container)     │
   0x068 │ QWORD   │  8B  │ per_reloc_data_set (sorted container)  │
   0x070 │ BYTE    │  1B  │ sampling_mode                          │
   0x071 │ --      │  7B  │ (padding)                              │
   0x078 │ QWORD   │  8B  │ reloc_payload_map (sorted container)   │
         └─────────┴──────┴────────────────────────────────────────┘

   0x080 │ --      │ 32B  │ (reserved, not written by constructor)  │
         └─────────┴──────┴────────────────────────────────────────┘

         Group 4: Function Metadata
         ┌─────────┬──────┬────────────────────────────────────────┐
   0x0A0 │ QWORD   │  8B  │ section_flags                          │
   0x0A8 │ DWORD   │  4B  │ max_register_count                     │
   0x0AC │ DWORD   │  4B  │ extra_section_index                    │
   0x0B0 │ BYTE    │  1B  │ has_global_refs                        │
   0x0B1 │ BYTE    │  1B  │ has_shared_refs                        │
   0x0B2 │ BYTE    │  1B  │ has_exit                               │
   0x0B3 │ BYTE    │  1B  │ has_crs                                │
   0x0B4 │ BYTE    │  1B  │ uses_atomics                           │
   0x0B5 │ BYTE    │  1B  │ uses_shared_atomics                    │
   0x0B6 │ BYTE    │  1B  │ uses_global_atomics                    │
   0x0B7 │ BYTE    │  1B  │ has_texture_refs                       │
   0x0B8 │ --      │ 24B  │ (padding)                              │
         └─────────┴──────┴────────────────────────────────────────┘

         Group 5: Code Generation Parameters
         ┌─────────┬──────┬────────────────────────────────────────┐
   0x0D0 │ QWORD   │  8B  │ knobs_section_desc_ptr → 64B sub-obj  │
         │         │      │   +0x00 DWORD: knobs_section_index     │
         │         │      │   +0x08 QWORD: knobs_section_offset    │
         │         │      │   +0x10 DWORD: knobs_section_size      │
         │         │      │   +0x18 QWORD: knobs_section_name_ptr  │
   0x0D8 │ DWORD   │  4B  │ stack_frame_size                       │
   0x0DC │ --      │  4B  │ (padding)                              │
   0x0E0 │ DWORD   │  4B  │ register_count                         │
   0x0E4 │ --      │  4B  │ (padding)                              │
   0x0E8 │ DWORD   │  4B  │ barrier_info_size                      │
   0x0EC │ --      │  4B  │ (padding)                              │
   0x0F0 │ QWORD   │  8B  │ barrier_info_data_ptr                  │
   0x0F8 │ --      │  8B  │ (reserved)                             │
         └─────────┴──────┴────────────────────────────────────────┘

         Group 6: Constant Bank & Section Info
         ┌─────────┬──────┬────────────────────────────────────────┐
   0x100 │ QWORD   │  8B  │ const_bank_offset                      │
   0x108 │ DWORD   │  4B  │ const_bank_size                        │
   0x10C │ --      │  4B  │ (padding)                              │
   0x110 │ QWORD   │  8B  │ section_name_ptr (".nv.capmerc<func>") │
   0x118 │ QWORD   │  8B  │ section_alignment (default 16)         │
   0x120 │ DWORD   │  4B  │ const_bank_section_index               │
   0x124 │ --      │  4B  │ (padding)                              │
   0x128 │ DWORD   │  4B  │ text_section_index                     │
   0x12C │ DWORD   │  4B  │ text_rela_section_index                │
         └─────────┴──────┴────────────────────────────────────────┘

         Group 7: KNOBS Embedding
         ┌─────────┬──────┬────────────────────────────────────────┐
   0x130 │ QWORD   │  8B  │ kv_pair_list (vector)                  │
   0x138 │ QWORD   │  8B  │ knobs_pair_list (vector)               │
   0x140 │ WORD    │  2B  │ min_sm_version (default 256 = sentinel) │
   0x142 │ BYTE    │  1B  │ has_crs_depth                          │
   0x143 │ --      │  5B  │ (padding to 0x148)                     │
         └─────────┴──────┴────────────────────────────────────────┘

Key design observations:

Flag byte block (+0x0B0 to +0x0B7). Eight single-byte flags capture function characteristics that determine which R_MERCURY_* relocation patches the finalizer must apply. The flags are set by type-2, type-3, and type-4 markers in the capmerc stream. Each flag is a boolean (0 or 1), never a bitfield.

KNOBS indirection (+0x0D0). The KNOBS data does not live inline in the descriptor. Instead, +0x0D0 points to a separately allocated 64-byte sub-object carrying the ELF coordinates (section index, file offset, size, and name pointer) of the KNOBS section. This allows the KNOBS data to reside in a dedicated ELF section while the descriptor references it by position. The KNOBS pair list at +0x138 and the generic key-value list at +0x130 store the parsed key-value pairs from marker type 90 data blocks; the "KNOBS" string literal serves as the discriminator between the two lists.

Dual-descriptor pattern. When the merc section mirror is active, the constructor allocates a second 328-byte object for the .merc<funcname> companion section. This companion receives a copy of the SASS data (not a pointer -- an actual memcpy of sass_data_size bytes), the function name with a .merc prefix, and the section flags from the original ELF section header at +0x0A0. The companion's weak_symbol_index (+0x008) is always zero.

Relocation containers. The three sorted containers at +0x060, +0x068, and +0x078 (created via sub_425CA0 with comparator pair sub_427750/sub_427760 and element size 0x20 = 32 bytes) form a three-level relocation index. The reloc_index_set stores symbol indices that appear in relocations. The per_reloc_data_set stores per-symbol relocation metadata. The reloc_payload_map associates symbol indices with the actual payload data that the finalizer patches into instruction bytes. These are populated by marker sub-types 10, 23, 25, 28, 40, 46, 49, 52, 57, 64, 68, 70, 71, 85, and 87.

min_sm_version sentinel. The default value 256 (0x100) at +0x140 acts as a sentinel meaning "no minimum SM constraint." When a target profile is available at construction time, the profile's SM version overwrites this field. Marker sub-type 95 can further override it when CRS depth information constrains the minimum SM.

Capmerc Marker Stream Format

The constructor parses a compact binary marker stream embedded in the capmerc section data. Each marker begins with a type byte followed by a sub-type byte:

TypeSizeFormatDescription
24 bytes fixed[02] [sub] [00] [00]Boolean flag markers
34 bytes fixed[03] [sub] [WORD payload]Short value markers
4variable[04] [sub] [WORD size] [payload...]Variable-length data markers

Selected marker sub-types and the descriptor fields they populate:

SubTypeDescriptor FieldPurpose
104+0x0B0 has_global_refsFunction accesses global memory
212+0x0B2 has_exitFunction contains EXIT instruction
222+0x0B3 has_crsFunction uses call return stack
234+0x0A8 max_register_countRegister pressure (with max tracking)
254+0x0B1 has_shared_refsFunction accesses shared memory
273+0x000 desc_versionDescriptor format version stamp
474+0x002 instr_format_versionInstruction encoding format version
504+0x0E0 register_countAllocated register count
544+0x0B7 has_texture_refsFunction uses texture/sampler units
674+0x0E8, +0x0F0 barrier_infoBarrier count and data
724+0x0D0 knobs_section_desc_ptrKNOBS section ELF binding
733+0x0D8 stack_frame_sizePer-thread stack frame bytes
742+0x070 sampling_modeInterpolation/sampling mode
883+0x0B4/B5/B6Atomic usage (plain/shared/global)
904+0x138 knobs_pair_listKNOBS key-value data block
953+0x140, +0x142Min SM version + CRS depth flag

.nv.merc.* Section Builder Pipeline

Four functions cooperate to construct the .nv.merc.* section namespace:

sub_1C9F280 (97KB, Master ELF emitter)
  │
  ├─ sub_1C9B110 (23KB) ── Mercury capsule builder
  │   Creates .nv.merc namespace, reads symtab entry count,
  │   allocates mapping arrays, duplicates sections into merc space
  │
  ├─ sub_1CA2E40 (18KB) ── Mercury section cloner
  │   Iterates all sections, clones constant/global/shared/local
  │   into .nv.merc.* namespace, creates .nv.merc.rela sections,
  │   handles .nv.global.init and .nv.shared.reserved
  │
  ├─ sub_1C9C300 (24KB) ── Capsule descriptor processor
  │   Processes .nv.capmerc and .merc markers, embeds KNOBS,
  │   handles constant bank replication and rela duplication
  │
  ├─ sub_1CA3A90 (45KB) ── Section merger
  │   Merge/combine pass for sections with both merc and non-merc
  │   copies; processes .nv.constant bank sections, handles section
  │   linking and rela association
  │
  └─ sub_1C99BB0 (25KB) ── Section index remap
      Reindexes sections after dead elimination, handles
      .symtab_shndx / .nv.merc.symtab_shndx mapping

The section classifiers sub_1C9D1F0 (16KB) and sub_1C98C60 (9KB) map section names to internal type IDs. The former handles both SASS and merc debug section variants; the latter is merc-specific and recognizes all 15 .nv.merc.debug_* names.

R_MERCURY_* Relocation Types

Capsule Mercury defines its own relocation type namespace for references within .nv.merc.rela sections and the capsule descriptor. These are distinct from standard CUDA ELF relocations (R_NV_32, etc.) and are processed during finalization rather than at link time.

TypeDescription
R_MERCURY_ABS6464-bit absolute address
R_MERCURY_ABS3232-bit absolute address
R_MERCURY_ABS1616-bit absolute address
R_MERCURY_PROG_RELPC-relative reference
R_MERCURY_8_0Sub-byte patch: bits [7:0] of target word
R_MERCURY_8_8Sub-byte patch: bits [15:8]
R_MERCURY_8_16Sub-byte patch: bits [23:16]
R_MERCURY_8_24Sub-byte patch: bits [31:24]
R_MERCURY_8_32Sub-byte patch: bits [39:32]
R_MERCURY_8_40Sub-byte patch: bits [47:40]
R_MERCURY_8_48Sub-byte patch: bits [55:48]
R_MERCURY_8_56Sub-byte patch: bits [63:56]
R_MERCURY_FUNC_DESCFunction descriptor reference
R_MERCURY_UNIFIEDUnified address space reference
R_MERCURY_TEX_HEADER_INDEXTexture header table index
R_MERCURY_SAMP_HEADER_INDEXSampler header table index
R_MERCURY_SURF_HEADER_INDEXSurface header table index

Sub-Byte Relocation Design

The eight R_MERCURY_8_* types enable patching individual bytes within a 64-bit instruction word. Mercury instruction encodings pack multiple fields into single 8-byte QWORDs (the 1280-bit instruction buffer at a1+544 is organized as 20 QWORDs). During finalization for a different SM, only certain bit-fields within an instruction word may need updating -- for example, the opcode variant bits or register class encoding -- while neighboring fields remain unchanged. The sub-byte types let the finalizer patch exactly one byte at a specific position within the word without a read-modify-write cycle on the entire QWORD.

Relocation Resolution

The master relocation resolver sub_1CD48C0 (22KB) handles both standard and capmerc relocations. For R_MERCURY_UNIFIED, it converts internal relocation type 103 to type 1 (standard absolute). The resolver iterates relocation entries and handles: alias redirections, dead-function relocation skipping, __UFT_OFFSET / __UDT_OFFSET pseudo-relocations, PC-relative branch validation, NVRS (register spill) relocations, and YIELD-to-NOP conversion for forward progress guarantees.

Mercury Section Binary Layouts

Section Classifier Algorithm -- sub_1C98C60

The 9KB classifier uses a two-stage guard-then-waterfall pattern to identify .nv.merc.* sections from their ELF section headers.

Stage 1: sh_type range check (fast rejection). The section's sh_type is tested against two NVIDIA processor-specific ranges:

Rangesh_type spanDecimalQualifying types
A0x70000006..0x70000014SHT_LOPROC+6..+20Filtered by bitmask 0x5D05
B0x70000064..0x7000007ESHT_LOPROC+100..+126All accepted (memory-space data)
Special1SHT_PROGBITSAccepted (generic debug data)

Within Range A, the bitmask 0x5D05 (binary 0101_1101_0000_0101) selects seven specific types:

Bitsh_typeHexSection types
0SHT_LOPROC+60x70000006Memory-space clones
2SHT_LOPROC+80x70000008.nv.merc.nv.shared.reserved
8SHT_LOPROC+140x7000000E.nv.merc.debug_line
10SHT_LOPROC+160x70000010.nv.merc.debug_frame
11SHT_LOPROC+170x70000011.nv.merc.debug_info
12SHT_LOPROC+180x70000012.nv.merc.nv_debug_line_sass
14SHT_LOPROC+200x70000014.nv.merc.debug_loc, .nv.merc.debug_ranges, .nv.merc.nv_debug_info_reg_*

Stage 2: Name-based disambiguation (expensive path). When sh_flags bit 28 (0x10000000, SHF_NV_MERC) is set, the classifier calls sub_1CB9E50() to retrieve the section name and performs sequential strcmp() against 15 names, returning 1 on the first match. The check order matches the declaration order in the ELF structure table above. sub_4279D0 is used for .nv.merc.nv_debug_ptx_txt as a prefix match rather than exact match.

SHF_NV_MERC Flag (0x10000000)

Bit 28 of sh_flags is an NVIDIA extension: SHF_NV_MERC. All .nv.merc.* sections carry this flag. It serves two purposes:

  1. Fast filtering -- the classifier checks this bit before string comparisons, giving O(1) rejection for the common case of non-merc sections.
  2. Namespace separation -- during section index remapping (sub_1C99BB0), sections with SHF_NV_MERC are remapped into a separate merc section index space. The finalizer uses this flag to identify which sections require relocation patching during off-target finalization.

.nv.capmerc<funcname> -- Capsule Data Layout

The per-function capsule section contains the full marker stream, SASS data, KNOBS block, and optionally replicated constant bank data. The section is created by sub_1C9C300.

ELF section header:

FieldValue
sh_type1 (SHT_PROGBITS)
sh_flags0x10000000 (SHF_NV_MERC)
sh_addralign16

Section data is organized as four consecutive regions:

         .nv.capmerc<funcname> Section Data
         ====================================

         ┌──────────────────────────────────────────────────────┐
         │ Marker Stream     (variable length)                  │
         │   Repeating TLV records:                             │
         │     [type:1B] [sub:1B] [payload:varies]              │
         │                                                      │
         │   Type 2: 4 bytes total   [02] [sub] [00 00]        │
         │     Boolean flags (has_exit, has_crs, sampling_mode) │
         │                                                      │
         │   Type 3: 4 bytes total   [03] [sub] [WORD:value]   │
         │     Short values (desc_version, stack_frame_size,    │
         │     atomic flags, min_sm_version)                    │
         │                                                      │
         │   Type 4: variable        [04] [sub] [WORD:size] ..  │
         │     Variable-length blocks (register counts, KNOBS   │
         │     data, barrier info, relocation payloads)         │
         │                                                      │
         │   Terminal marker: sub-type 95 (min_sm + CRS depth)  │
         ├──────────────────────────────────────────────────────┤
         │ SASS Data Block   (sass_data_size bytes)             │
         │   Mercury-encoded instruction bytes identical to     │
         │   what .text.<func> would contain for the compile    │
         │   target; byte-for-byte match with phase 122 output  │
         ├──────────────────────────────────────────────────────┤
         │ KNOBS Block       (knobs_section_size bytes)         │
         │   Serialized key-value pairs from marker sub-type 90 │
         │   "KNOBS" tag separates knob pairs from generic KV   │
         │   Contains: optimization level, target parameters,   │
         │   feature flags, all codegen-affecting knob values   │
         ├──────────────────────────────────────────────────────┤
         │ Constant Bank Data (const_bank_size bytes, optional) │
         │   Replicated .nv.constant0 data for deferred binding │
         │   Only present when the function references constant │
         │   bank data that the finalizer may need to patch     │
         └──────────────────────────────────────────────────────┘

.nv.merc.debug_info -- Cloned DWARF Debug Info

The cloner (sub_1CA2E40) produces a byte-for-byte copy of the source .debug_info section, placed into the merc namespace with modified ELF section header properties.

ELF section header:

FieldValue
sh_type0x70000011 (SHT_LOPROC + 17)
sh_flags0x10000000 (SHF_NV_MERC)
sh_addralign1

Section data is standard DWARF .debug_info format:

         .nv.merc.debug_info Section Data
         ==================================

         ┌──────────────────────────────────────────────────────┐
         │ Compilation Unit Header                              │
         │   +0x00  unit_length    : 4B (DWARF-32) or 12B (-64)│
         │   +0x04  version        : 2B (typically DWARF 4)    │
         │   +0x06  debug_abbrev_offset : 4B → .nv.merc.debug_abbrev │
         │   +0x0A  address_size   : 1B (8 for 64-bit GPU)    │
         ├──────────────────────────────────────────────────────┤
         │ DIE Tree (Debug Information Entries)                  │
         │   Sequence of entries, each:                         │
         │     abbrev_code  : ULEB128                           │
         │     attributes   : per abbreviation definition       │
         │                                                      │
         │   Cross-section references (via relocations):        │
         │     DW_FORM_strp     → .nv.merc.debug_str            │
         │     DW_FORM_ref_addr → .nv.merc.debug_info           │
         │     DW_FORM_sec_offset → .nv.merc.debug_line etc.    │
         └──────────────────────────────────────────────────────┘

The critical difference from standard .debug_info: all cross-section offset references point to other .nv.merc.* sections, not the original .debug_* sections. The .nv.merc.rela.debug_info relocation table handles rebinding these offsets during finalization.

.nv.merc.rela / .nv.merc.rela<secname> -- Mercury Relocations

Mercury relocation sections use standard Elf64_Rela on-disk format (24 bytes per entry) but encode Mercury-specific relocation types with a 0x10000 offset in the type field.

ELF section header:

FieldValue
sh_type4 (SHT_RELA)
sh_flags0x10000000 (SHF_NV_MERC)
sh_addralign8
sh_entsize24
sh_linksymtab section index
sh_infotarget section index

Section names are constructed by sub_1C980F0 as ".nv.merc.rela" + suffix (e.g., ".nv.merc.rela.debug_info").

On-disk entry layout (standard Elf64_Rela, 24 bytes):

         .nv.merc.rela Entry (24 bytes on disk)
         ========================================

         ┌─────────┬──────┬────────────────────────────────────────────┐
   0x00  │ QWORD   │  8B  │ r_offset — byte position in target section │
   0x08  │ DWORD   │  4B  │ r_type — relocation type                   │
         │         │      │   Standard: 1=R_NV_ABS64, etc.             │
         │         │      │   Mercury:  r_type > 0x10000               │
         │         │      │   Decoded:  r_type - 0x10000 → R_MERCURY_* │
   0x0C  │ DWORD   │  4B  │ r_sym — symbol table index                 │
   0x10  │ QWORD   │  8B  │ r_addend — signed addend value             │
         └─────────┴──────┴────────────────────────────────────────────┘

During resolution (sub_1CD48C0), the 24-byte on-disk entries are loaded into a 32-byte in-memory representation that adds two section index fields:

         In-Memory Relocation Entry (32 bytes)
         =======================================

         ┌─────────┬──────┬────────────────────────────────────────────┐
   0x00  │ QWORD   │  8B  │ r_offset — byte position in target section │
   0x08  │ DWORD   │  4B  │ r_type — relocation type                   │
   0x0C  │ DWORD   │  4B  │ r_sym — symbol table index                 │
   0x10  │ QWORD   │  8B  │ r_addend — signed addend value             │
   0x18  │ DWORD   │  4B  │ r_sec_idx — target section index           │
   0x1C  │ DWORD   │  4B  │ r_addend_sec — addend section index        │
         └─────────┴──────┴────────────────────────────────────────────┘

The extra 8 bytes enable cross-section targeting: r_sec_idx identifies which section r_offset is relative to, and r_addend_sec identifies the section contributing the addend base address. When r_addend_sec != 0, the resolver adds that section's load address to r_offset before patching.

The resolver detects Mercury relocation types via r_type > 0x10000, subtracts 0x10000, then dispatches through a Mercury-specific handler table (off_2407B60) rather than the standard CUDA relocation table (off_2408B60).

Complete sh_type Map

sh_typeHexSection types
10x00000001.nv.capmerc<func>, .nv.merc.debug_abbrev (PROGBITS variant), .nv.merc.debug_str, .nv.merc.nv_debug_ptx_txt
40x00000004.nv.merc.rela* (SHT_RELA)
SHT_LOPROC+60x70000006.nv.merc.<memory-space> clones
SHT_LOPROC+80x70000008.nv.merc.nv.shared.reserved
SHT_LOPROC+140x7000000E.nv.merc.debug_line
SHT_LOPROC+160x70000010.nv.merc.debug_frame
SHT_LOPROC+170x70000011.nv.merc.debug_info
SHT_LOPROC+180x70000012.nv.merc.nv_debug_line_sass
SHT_LOPROC+200x70000014.nv.merc.debug_loc, .nv.merc.debug_ranges, .nv.merc.nv_debug_info_reg_sass, .nv.merc.nv_debug_info_reg_type
SHT_LOPROC+100..+1260x70000064..0x7000007EMemory-space variant sections (constant banks, shared, local, global)

The .nv.merc.* debug sections reuse the same sh_type values as their non-merc counterparts (.debug_info uses 0x70000011 in both namespaces). The SHF_NV_MERC flag (0x10000000) in sh_flags is the distinguishing marker.

Self-Check Mechanism

The --self-check flag activates a roundtrip verification that validates the capmerc encoding by reconstituting SASS from the capsule data and comparing it against the original:

Phase 122 output (SASS) ──────────────────────────> reference SASS
         │
         └─ capmerc packaging ─> .nv.capmerc<func>
                                        │
                                        └─ reconstitute ─> reconstituted SASS
                                                                │
                                                    section-by-section compare
                                                                │
                                                      pass / fail (error 17/18/19)

The reconstitution pipeline uses sub_720F00 (64KB), a Flex-generated SASS text lexer with thread-safety support (pthread_mutexattr_t), to parse the reconstituted instruction stream. sub_729540 (35KB) performs the actual section-by-section comparison.

Three error codes signal specific self-check failures:

Error codeMeaning
17Section content mismatch (instruction bytes differ)
18Section count mismatch (missing or extra sections)
19Section metadata mismatch (size, alignment, or flags differ)

These error codes trigger longjmp-based error recovery in the master ELF emitter (sub_1C9F280), which uses _setjmp at its entry point for non-local error handling.

The --out-sass flag causes ptxas to dump the reconstituted SASS to a file, useful for debugging self-check failures by manual comparison with the original SASS output.

Opportunistic Finalization

The --opportunistic-finalization-lvl flag controls how aggressively capmerc binaries may be finalized for a target SM different from the compilation target:

LevelNameBehavior
0defaultStandard finalization for the compile target only
1noneNo finalization; output stays as capmerc (deferred to driver)
2intra-familyFinalize for any SM within the same architectural family
3intra+interFinalize across SM families

Level 2 allows a capmerc binary compiled for sm_100 (datacenter Blackwell) to be finalized for sm_103 (Blackwell Ultra / GB300) without recompilation. Level 3 extends this across families -- for example, sm_100 capmerc finalized for sm_120 (consumer RTX 50-series).

The key constraint is instruction encoding compatibility: the sub-byte R_MERCURY_8_* relocations can patch SM-specific encoding bits, but the overall instruction format and register file layout must be compatible between source and target.

Off-Target Finalization

Off-target finalization is the process of converting a capmerc binary compiled for SM X into native SASS for SM Y. The compatibility checker sub_60F290 determines whether the source/target pair is compatible, examining:

  • SM version pair and generation compatibility
  • Feature flag differences between source and target
  • Instruction set compatibility (no target-only instructions used)
  • Constant bank layout compatibility
  • Register file layout match

When the check passes, the kernel finalizer sub_612DE0 (47KB) applies the "fastpath optimization" -- it directly patches the Mercury-encoded instruction stream using R_MERCURY_* relocations rather than running the full compilation pipeline. On success, ptxas emits the diagnostic:

"applied for off-target %u -> %u finalization"

where the two %u values are the source and target SM numbers.

The fastpath avoids re-running phases 117--122 of the Mercury pipeline. Instead, it:

  1. Reads the capsule descriptor from .nv.capmerc<func>
  2. Validates compatibility via sub_60F290
  3. Applies R_MERCURY_* relocation patches for the target SM
  4. Regenerates the ELF .text section with patched instruction bytes
  5. Updates .nv.info EIATTR attributes for the target (register counts, barrier counts)

This is substantially faster than full recompilation, which is why ptxas logs it as a "fastpath."

Pipeline Integration

Capmerc does not modify the Mercury encoder pipeline (phases 113--122). The instruction encoding, pseudo-instruction expansion, WAR hazard resolution, operation expansion (opex), and SASS microcode emission all execute identically regardless of output mode. The divergence happens after phase 122 completes:

ModePost-Pipeline Behavior
MercuryPhase 122 SASS output written directly to .text.<func> ELF section
CapmercPhase 122 output wrapped in 328-byte capsule descriptor; .nv.merc.* sections cloned; R_MERCURY_* relocations emitted; KNOBS data embedded
SASSPhase 122 output written as raw SASS binary (no ELF wrapper)

The master ELF emitter sub_1C9F280 (97KB) orchestrates the post-pipeline divergence:

// Simplified from sub_1C9F280
void EmitELF(context) {
    // Common: copy ELF header (64 bytes via SSE loadu)
    memcpy(output, &elf_header, 64);

    // Common: iterate sections, build section headers
    for (int i = 0; i < section_count; i++) {
        if (section[i].flags & 4) continue;  // skip virtual sections
        // ... copy section data, patch headers ...
    }

    if (is_capmerc_mode) {
        sub_1C9B110(ctx);   // create .nv.merc namespace
        sub_1CA2E40(ctx);   // clone sections into merc space
        sub_1C9C300(ctx);   // build capsule descriptors + KNOBS
        sub_1CA3A90(ctx);   // merge merc/non-merc section copies
    }

    // Common: remap section indices, build symbol table
    sub_1C99BB0(ctx);       // section index remap
    sub_1CB68D0(ctx);       // build .symtab

    // Common: resolve relocations
    sub_1CD48C0(ctx);       // relocation resolver (handles R_MERCURY_*)

    // Common: finalize and write
    sub_1CD13A0(ctx);       // serialize to file
}

Function Map

AddressSizeIdentity
sub_1C9F28097KBMaster ELF emitter (orchestrates full CUBIN output)
sub_1CA3A9045KBSection merger / combined section emitter
sub_1CB68D049KBSymbol table builder (handles merc section references)
sub_1C99BB025KBSection index remap (.symtab_shndx / .nv.merc.symtab_shndx)
sub_1C9C30024KBCapsule descriptor processor (328-byte object, KNOBS embed)
sub_1C9B11023KBMercury capsule builder (creates .nv.merc namespace)
sub_1CD48C022KBMaster relocation resolver (R_MERCURY_* + standard)
sub_1CA2E4018KBMercury section cloner
sub_1C9D1F016KBDebug section classifier (SASS + merc variants)
sub_1C98C609KBMercury debug section classifier (15 section names)
sub_720F0064KBFlex SASS text lexer (self-check reconstitution)
sub_72954035KBSASS assembly verification (self-check comparator)
sub_703AB010KBBinary-kind CLI parser
sub_612DE047KBKernel finalizer / ELF builder (fastpath optimization)
sub_60F290--Off-target compatibility checker
sub_1CD13A011KBELF serialization (final file writer)

Cross-References