Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Linker Context Object

The linker context is a 672-byte arena-allocated structure (elfw -- ELF wrapper) that serves as the single mutable state object threaded through every phase of nvlink's pipeline. Created by sub_4438F0 (elfw_create) during Phase 5 and destroyed by sub_4475B0 (elfw_destroy) at the end of main(), this object holds the ELF header fields, all section and symbol storage arrays, hash tables for name-based lookup, architecture-specific vtable pointers, relocation lists, index mapping tables, and configuration flags decoded from the merge-flags bitmask. Every pipeline function -- merge, layout, relocation, finalization, serialization -- receives this structure as its first argument (a1).

This page documents the internal layout of the 672-byte structure at reimplementation depth, the allocation and initialization sequence, the destruction sequence, and how the context flows through the linker pipeline.

Key Functions

AddressNameSizeRole
sub_4438F0elfw_create14,821 BAllocates 672-byte struct, initializes ELF header, creates hash tables, sorted arrays, core sections
sub_4475B0elfw_destroy3,023 BDestroys all sub-structures, frees arrays, hash tables, and the struct itself
sub_443260elfw_get_section~640 BResolves a signed section index to a section data record pointer (core accessor)
sub_443500elfw_is_callee~560 BChecks whether a symbol is a cross-function callee
sub_443730elfw_setup_phdrs~560 BInitializes program header name strings at offset 192
sub_444690elfw_reset_arch_state~480 BResets arch vtable at offset 488 and clears flag state
sub_444710elfw_set_flags16 BOR-merges bits into e_flags at offset 48
sub_444720elfw_remap_symbol~416 BRemaps a signed symbol index through the mapping table at offsets 456/464
sub_4447B0elfw_is_syscall~288 BChecks if a symbol name is a CUDA syscall (offset 496 hash table or static table)
sub_444830elfw_is_syscall_32f16 BExact match for __cuda_syscall_32f3056bbb
sub_444840elfw_is_cnp_syscall~288 BChecks syscall status and cnp prefix
sub_447570elfw_layout_relocate_finalize~64 BOrchestrator: calls layout, relocation, finalization in sequence
sub_4478F0elfw_dump_structure15,098 BDebug dump of entire context (all sections, symbols, relocations)
sub_4504B0elfw_init_callgraph~largeInitializes callgraph tracking structures

Structure Layout (672 bytes)

The structure combines three functional regions: an embedded ELF header template (bytes 0--101), configuration/buffer space (bytes 102--287), and the linker's mutable state (bytes 288--671). All offsets are from the base of the 672-byte allocation, expressed as byte offsets. Array element notations like ctx[36] refer to 8-byte QWORD indexing (ctx + 36 * 8 = ctx + 288).

Region 1: ELF Header Template (offsets 0--101)

These fields are written directly into the output ELF during serialization.

Offset  Size  QWORD  Field                   Description
------  ----  -----  -----                   -----------
  0      4    [0]    elf_magic               0x464C457F ("\x7fELF")
  4      1           ei_class                1=Elf32, 2=Elf64 (computed as (is_64bit != 0) + 1)
  5      1           ei_data                 1 (ELFDATA2LSB, little-endian)
  6      1           ei_version              1 (EV_CURRENT)
  7      1           ei_osabi                0x41 (ELFOSABI_CUDA_DEVICE) or 0x33 (32-bit)
  8      1           ei_abiversion           SM ABI version (a3 parameter)
 16      2    [2]    e_type                  1=ET_REL, 2=ET_EXEC, 0xFF00=Mercury relocatable
 18      2           e_machine               190 (0xBE = EM_CUDA), always
 20      4           e_version_or_api        API version (device) or 1
 48      4    [6]    e_flags                 Architecture flags (encoding depends on OSABI)
 62      2           shstrtab_section_idx    Index of .shstrtab section
 64      1    [8]    verbose_flags           Verbose output control byte
 68      4           link_mode_bits          merge_flags & 0x70000
 72      4    [9]    sm_arch_major           SM major version (e.g., 100 for Blackwell)
 76      4           merge_flags             Full merge-flags bitmask (a9 | 0x80000 if relocatable)
 80      1   [10]    debug_flag              Debug info generation flag (a6 parameter)
 82      1           section_virtualization  Non-zero if section virtualization is active (set by merge)
 83      1           has_shstrtab            Non-zero if shstrtab section index was stored

Region 2: Configuration Flags and Buffers (offsets 84--287)

Individual boolean flags are decomposed from merge_flags during construction. These single-byte flags govern behavior throughout the pipeline without requiring repeated bitmask checks.

Offset  Size  Field                   merge_flags bit    Description
------  ----  -----                   ---------------    -----------
 84      1    preserve_relocs         bit 0              --preserve-relocs
 85      1    force_rela              bit 1              --force-rela
 86      1    reserve_null            bit 9              --reserve-null-pointer
 87      1    disable_smem_res        bit 2              --disable-smem-reservation
 88      1    allow_undef_globals     bit 3              --allow-undefined-globals
 89      1    is_relocatable          bit 4 (or forced)  Relocatable link mode
 90      1    optimize_data           bit 5              --optimize-data-layout
 91      1    syscall_const_off       bit 14             --syscall-const-offset
 92      1    no_opt                  bit 6              --no-opt
 93      1    extra_flag              bit 8              (reserved)
 94      1    extended_smem           (conditional)      sm_minor > 0x45 && bit 7
 96      1    dump_callgraph          bit 11             --dump-callgraph
 99      1    no_warn_dead_code       ~bit 12            !(bit 12)
100      1    verbose_keep            bit 13             --verbose-keep
101      1    is_device_elf           (a9 & 0x8000)      Device ELF mode (vs host)

Tkinfo and cuinfo buffers are initialized by sub_43E490:

Offset  Size  Field              Initial capacity
------  ----  -----              ----------------
108      32   tkinfo_buffer      1,000 bytes (for .note.nv.tkinfo content)
140      32   cuinfo_buffer      2,000 bytes (for .note.nv.cuinfo content)

Program header name buffer (set by sub_443730):

Offset  Size  Field              Description
------  ----  -----              -----------
192      8    phdr_string_buf    Pointer to growable string buffer
168      4    phdr_offset_0      Offset for PHDR name 0
172      4    phdr_offset_1      Offset for PHDR name 1
176      4    phdr_offset_2      Offset for PHDR name 2
180      4    phdr_offset_3      Offset for PHDR name 3
184      4    phdr_offset_4      Offset for PHDR name 4

Section index tracking for core ELF sections:

Offset  Size  Field                Description
------  ----  -----                -----------
200      2    tkinfo_section_idx   .note.nv.tkinfo section index
202      2    symtab_section_idx   .symtab section index
204      2    symtab_shndx_idx     .symtab_shndx section index
206      2    strtab_section_idx   .strtab section index (at word offset 103)
208      2    cuinfo_section_idx   .note.nv.cuinfo section index
210      2    cuinfo_note_idx      Secondary cuinfo note index
228      8    reloc_tracking        Relocation processing state (zeroed during init)

Region 3: Linker Mutable State (offsets 288--671)

This is the core of the linker context. All section, symbol, and relocation data is accessed through these fields.

Offset  Bytes  QWORD   Field                        Type / Description
------  -----  ------  -----                        ------------------
288      8     [36]    symbol_name_hash             LinkerHash* -- hash table for symbol lookup by name (512 buckets)
296      8     [37]    section_name_hash            LinkerHash* -- hash table for section lookup by name (512 buckets)
304      4     [38]    shstrtab_entry_count         Number of entries in shstrtab
312      8     [39]    initial_value                Set to 0x100000000 (count=1, flags=0)
320      4     [40]    strtab_entry_count           Set to 1 initially
344      8     [43]    pos_symbol_array             SortedArray* -- positive-index symbols (64-element initial)
352      8     [44]    neg_symbol_array             SortedArray* -- negative-index symbols (64-element initial)
360      8     [45]    section_array                SortedArray* -- section data records (64-element initial)
368      8     [46]    section_order_map            Pointer -- maps virtual section indices to physical
376      8     [47]    reloc_list_0                 SortedArray* -- relocation linked list / pending relocations
384      8     [48]    reloc_list_1                 SortedArray* -- resolved relocation list
392      8     [49]    reloc_list_2                 SortedArray* -- additional relocation list
408      8     [51]    input_section_list           SortedArray* -- per-input section tracking (32-element initial)
416      4     [52]    input_section_count          Counter for input_section_list
448      8     [56]    global_symbol_list           SortedArray* -- (freed during destroy)
456      8     [57]    symbol_index_mapping         Pointer -- maps old symbol indices to new (uint32_t array)
464      8     [58]    neg_symbol_index_mapping     Pointer -- maps old negative symbol indices to new (uint32_t array)
472      8     [59]    section_virtualization_table  Pointer -- maps virtual section indices to physical (uint32_t array)
480      8     [60]    file_list                    LinkedList* -- singly-linked list of input file records (node+8 = data)
488      8     [61]    arch_vtable                  Pointer -- architecture-specific vtable (from sub_45AC50 or sub_459640)
496      8     [62]    entry_hash                   LinkerHash* -- hash table for entry-point/syscall symbol lookup (32 buckets)
504      8     [63]    (reserved)                   (padding or unused pointer)
512      8     [64]    input_file_records           SortedArray* -- 16-byte records: {name_ptr, sm_minor, reserved} (8-element initial)
520      8     [65]    sorted_array_0               SortedArray* -- 16-element sorted array (section management)
528      8     [66]    sorted_array_1               SortedArray* -- 16-element sorted array (section management)
536      8     [67]    sorted_array_2               SortedArray* -- 16-element sorted array (section management)
544      8     [68]    sorted_array_3               SortedArray* -- 16-element sorted array (section management)
552      8     [69]    sorted_array_4               SortedArray* -- 16-element sorted array (section management)
560      8     [70]    sorted_array_5               SortedArray* -- 16-element sorted array (section management)
576      8     [72]    reloc_type_hash              LinkerHash* -- hash table for relocation type tracking (8 buckets, int-keyed)
584      8     [73]    (reserved)                   (unused or padding)
592      8     [74]    merged_symbol_array          SortedArray* -- symbols after merge (used as override for section resolution)
600      8     [75]    extended_symbol_store        SortedArray* -- extended symbol entries (for cross-reference)
608      8     [76]    private_arena_ptr            Arena* -- owning "elfw memory space" arena (only if merge_flags & 0x400)
616      8     [77]    private_arena_handle         Arena handle from sub_45CAE0
624      4     [78]    option_parser_result         Result from sub_42F8B0 (option parser state)
664      8     [83]    end_marker                   Zeroed, marks end of struct

Signed Index Convention

A critical design pattern throughout the linker context is the signed section/symbol index convention:

  • Positive index: Refers to the pos_symbol_array (offset 344) or regular section entries
  • Negative index: Refers to the neg_symbol_array (offset 352), with the absolute value as the array index
  • Index 0: Null / undefined

This dual-array scheme is used because the merge phase introduces new sections/symbols that may conflict with or extend the numbering of the original input. The accessor sub_443260 implements this dispatch:

SectionRecord *elfw_get_section(elfw *ctx, int signed_index) {
    SymbolRecord *sym;
    if (signed_index < 0)
        sym = SortedArray_get(ctx->neg_symbol_array, -signed_index);   // ctx[44]
    else
        sym = SortedArray_get(ctx->pos_symbol_array, signed_index);    // ctx[43]

    if (!sym) return NULL;

    uint16_t section_idx = sym->section_index;                         // sym + 6
    if (section_idx == 0xFFFF) {
        // Deferred resolution: symbol has an indirect reference at sym+24
        int ref = *(int32_t *)(sym + 24);

        if (ctx->extended_symbol_store) {                              // ctx[75]
            // Resolve through extended store
            if (ref < 0)
                section_idx = SortedArray_get(ctx->extended_symbol_store, -ref);
            // else fall through to merged_symbol_array
        } else {
            // Resolve through mapping tables
            uint32_t *pos_map = ctx->symbol_index_mapping;             // ctx[57]
            uint32_t *neg_map = ctx->neg_symbol_index_mapping;         // ctx[58]
            int remapped = remap_index(pos_map, neg_map, ref);
        }
        section_idx = SortedArray_get(ctx->merged_symbol_array, remapped);  // ctx[74]
    }

    // Virtualization check
    if (ctx->section_virtualization_active) {                          // byte at +82
        uint32_t phys = ctx->section_virtualization_table[section_idx];  // ctx[59]
        if (phys && ctx->section_order_map[phys] != section_idx)
            fatal("secidx not virtual");
    }

    return SortedArray_get(ctx->section_array, section_idx);           // ctx[45]
}

The 0xFFFF sentinel at sym+6 means "this symbol's section has not been directly assigned yet -- look it up through the mapping chain." This indirection supports lazy resolution during multi-pass merge operations.

Symbol Index Remapping

The remapping table pair at offsets 456 (symbol_index_mapping) and 464 (neg_symbol_index_mapping) translates symbol indices from input objects to the output object's symbol numbering. The accessor sub_444720 implements the remap:

uint32_t elfw_remap_symbol(elfw *ctx, int old_index) {
    uint32_t *pos_map = (uint32_t *)ctx[57];     // offset 456
    if (!pos_map || old_index == 0) return 0;

    if (old_index > 0) {
        uint32_t new_idx = pos_map[old_index];
        if (new_idx == 0)
            fatal("reference to deleted symbol");
        return new_idx;
    } else {
        uint32_t *neg_map = (uint32_t *)ctx[58];  // offset 464
        uint32_t new_idx = neg_map[-old_index];
        if (new_idx == 0)
            fatal("reference to deleted symbol");
        return new_idx;
    }
}

The "reference to deleted symbol" error appears 14+ times across the binary, always through this remapping path. It fires when dead-code elimination has removed a symbol that is still referenced by a relocation or section link.

Construction: sub_4438F0 (elfw_create)

Parameters

elfw *elfw_create(
    a1:  type_code,       // Overloaded: arena pointer in some paths, type code in others
    a2:  is_64bit,        // 0 = 32-bit ELF, nonzero = 64-bit ELF
    a3:  abi_version,     // e_ident[EI_ABIVERSION] -- SM ABI version
    a4:  sm_major,        // SM architecture major version (e.g., 90, 100)
    a5:  sm_minor,        // SM architecture minor version / letter code
    a6:  debug_flag,      // Enable debug info generation
    a7:  api_version,     // CUDA API version or e_version
    a8:  verbose_flags,   // Verbose output control
    a9:  merge_flags,     // Master bitmask controlling all link behavior
    a10: is_relocatable   // Explicit relocatable-link flag
)

Initialization Sequence

The construction proceeds in 11 ordered steps:

Step 1 -- Private arena (conditional on merge_flags & 0x400):

Creates an isolated "elfw memory space" arena via sub_432020 with 4096-byte page size. Stores the arena at ctx[76] and its handle at ctx[77]. When this flag is not set, allocations go to the parent arena.

Step 2 -- Struct allocation and zeroing:

Allocates 672 bytes from the arena via sub_4307C0. Zeroes the entire buffer with memset. Sets the end marker at ctx[83] to 0.

Step 3 -- ELF header template:

Writes \x7fELF magic, class byte, data encoding (LSB), version, OSABI (0x41 for device, 0x33 for 32-bit), machine type (190 = EM_CUDA). For device ELF (merge_flags & 0x8000), sets e_type = ET_EXEC (2) for non-relocatable or ET_REL (1) for relocatable. If not device ELF, sets e_type = 0x80000000 for relocatable.

Step 4 -- Merge-flags decomposition:

Extracts the 17 individual boolean flags into bytes 84--100. Forces is_relocatable = 1 if a10 is set or if merge_flags & 0x180000 is nonzero, setting merge_flags |= 0x80000.

Step 5 -- Architecture state initialization:

Calls sub_45AC50 (Mercury/relocatable path) or sub_459640 (non-Mercury path) with the SM version. The returned pointer is stored at ctx[61] (offset 488). This is a vtable of ~70 function pointers covering all architecture-specific behaviors: relocation handlers, instruction encoders, section layout rules. If the call returns NULL, the constructor emits "couldn't initialize arch state" via sub_467460 and aborts.

Step 6 -- Hash table creation:

Creates two 512-bucket LinkerHash instances for symbol name lookup (ctx[36], offset 288) and section name lookup (ctx[37], offset 296). Both use MurmurHash3-based string hashing (sub_44E000) and strcmp equality (sub_44E180).

Step 7 -- Sorted array creation:

Creates nine sorted arrays via sub_465020 (six with 16-element initial capacity) and sub_464AE0 (three with 64-element initial capacity):

FieldOffsetQWORDInitial capacityContents
sorted_array_0..5520--560[65]--[70]16Section management (six arrays for different section categories)
pos_symbol_array344[43]64Positive-index symbols
neg_symbol_array352[44]64Negative-index symbols
section_array360[45]64Section data records

Step 8 -- Nested sub-structures:

Allocates a 104-byte sub-structure and links it into section_array. Allocates a 48-byte sub-structure and links it into both pos_symbol_array and neg_symbol_array. These serve as the initial (index-0) sentinel entries.

Step 9 -- Input file records:

Creates an 8-element sorted array at ctx[64] (offset 512). Allocates a 16-byte <input> record containing the string "<input>" as the name pointer and the SM minor version as the architecture identifier. Links this record into the array.

Step 10 -- Core ELF sections:

Uses sub_441AC0 (elfw_add_section) to create the mandatory sections:

SectionTypeLinkAlignmentEntry size
.shstrtabSHT_STRTAB (3)010
.strtabSHT_STRTAB (3)010
.symtabSHT_SYMTAB (2).strtab index8 (Elf64) or 4 (Elf32)24 (Elf64) or 16 (Elf32)
.symtab_shndxSHT_SYMTAB_SHNDX (18).symtab index44

For device ELF only:

SectionTypeAlignmentDescription
.note.nv.tkinfoSHT_NOTE (7)0x2000000Toolkit info note
.note.nv.cuinfoSHT_NOTE (7)0x1000000CUDA compilation info

For non-relocatable output only:

SectionTypeEntry sizeDescription
.nv.uft.entry0x7000001132Unified function table entries

Step 11 -- Entry-point hash and callgraph:

Creates a 32-bucket string-keyed LinkerHash at ctx[62] (offset 496) for entry-point symbol tracking. Populates it with the built-in CUDA syscall names from the static table at off_1D3A9C0. Creates an 8-bucket integer-keyed LinkerHash at ctx[72] (offset 576) for relocation type tracking. Calls sub_4504B0 to initialize callgraph structures.

Destruction: sub_4475B0 (elfw_destroy)

The destructor takes the context pointer and a mode flag. It has two code paths:

Path 1: Private arena (ctx[76] != 0)

When the context owns a private arena, destruction is simple -- destroy the arena and everything allocated from it goes away:

void elfw_destroy(elfw *ctx, uint64_t mode) {
    if (ctx->private_arena_ptr) {                  // ctx[76]
        sub_45CAE0(ctx->private_arena_handle, mode);   // release arena handle
        sub_431C70(ctx->private_arena_ptr, 0);         // arena_destroy(arena, no_merge)
        return;
    }
    // ... individual cleanup follows
}

Path 2: Individual cleanup (ctx[76] == 0)

When no private arena exists, each sub-structure must be freed individually. The sequence walks every field that owns allocated memory:

  1. Index mapping tables (offsets 456, 464, 472): Free with arena_free
  2. Hash tables (offsets 288, 296): Walk entries via sub_448C00 calling sub_440080 (symbol record destructor), then destroy hash tables via sub_448A40
  3. Reloc tracking pointers (offsets 336, 328): Free via arena_free
  4. Sorted arrays at offsets 520--560: Free each via sub_466E00 with destructor sub_45CAD0
  5. Relocation lists (offsets 376, 384, 392): Destroy via sub_464550
  6. Positive symbols (offset 344): Iterate all entries starting at index 0, free each record, then destroy the array via sub_464B90
  7. Negative symbols (offset 352): Iterate entries starting at index 1 (skipping sentinel), free each record, destroy array
  8. Merged/extended arrays (offsets 592, 600): Destroy via sub_464B90 if non-null
  9. Input file records (offset 512): Iterate entries, free each 16-byte record, destroy array
  10. Global symbol list (offset 448): Free via sub_464520
  11. Additional arrays (offsets 424, 432, 440): Free via sub_464520
  12. Callgraph cleanup: sub_44CC60(ctx) -- destroys callgraph state
  13. Sorted arrays (offsets 256, 272, 280, 264): Free via sub_464520
  14. Section order map (offset 368): Free via arena_free
  15. Section records in section_array (offset 360): Iterate entries, free section data at entry+72, free each record, destroy array
  16. File list (offset 480): Walk linked list, free each node's data at node+8
  17. Entry hash and reloc type hash (offsets 496, 576): Destroy via sub_448A40
  18. File list storage (offset 480): Free via sub_464520
  19. Additional storage (offset 448): Free via sub_464520
  20. Arch state (offset 488): Destroy via sub_45B680
  21. The struct itself: Free via arena_free

The ordering is significant -- hash tables must be walked before the entries they reference are freed, and the arch state must outlive any code that might call through the vtable during teardown.

Context Flow Through Pipeline Phases

The linker context is the single thread of mutable state that connects all pipeline phases. In main(), it is stored in a local variable and passed as the first argument to every phase function.

Phase 5: Context Creation

elfw *ctx = elfw_create(
    (byte_2A5F1E8 == 0) + 1,     // type: 1=exec, 2=relocatable
    1,                             // 64-bit
    sm_abi_version,                // ABI version
    sm_major,                      // e.g., 100
    sm_minor,                      // e.g., 0x61
    debug_flag,                    // from -g
    api_version,                   // from --cuda-api-version
    verbose_flags,                 // from -v
    merge_flags,                   // accumulated from all CLI options
    is_relocatable                 // from -r
);

After creation, ctx holds an ELF skeleton with core sections but no application content.

Phase 7: Input File Loop

Each input cubin passes through sub_426570 (validate arch) which reads ctx->sm_arch_major (offset 72) and ctx->e_flags (offset 48) to verify architecture compatibility. Input file records are appended to ctx[64] (offset 512).

Phase 9: Merge

merge_elf (sub_45E7D0) is called once per input object:

int err = merge_elf(ctx, input_elf, filename, ...);

During merge, the function:

  • Adds symbols to pos_symbol_array / neg_symbol_array (offsets 344, 352)
  • Hashes symbol names into symbol_name_hash (offset 288)
  • Adds sections to section_array (offset 360)
  • Hashes section names into section_name_hash (offset 296)
  • Appends relocations to reloc_list_0 / reloc_list_1 (offsets 376, 384)
  • Builds the symbol index mapping tables (offsets 456, 464) for each input
  • Populates the callgraph via sub_4504B0
  • Writes CUDA metadata into tkinfo_buffer / cuinfo_buffer (offsets 108, 140)

Phase 10: Shared Memory Layout

sub_439830 reads the section arrays to identify .nv.shared.* and .nv.global sections, computes overlapping-set analysis, and updates section offsets in ctx->section_array. The section_virtualization_table (offset 472) may be populated here.

Phase 11: Dead Code Elimination

sub_44AD40 traverses the callgraph built during merge, marks reachable symbols, then removes unreachable sections from ctx->section_array and symbols from the symbol arrays. The "reference to deleted symbol" remapping errors can originate from dangling references after this phase.

Phase 12: Layout

sub_465720 and related functions read all sections from ctx->section_array, sort them, compute file offsets and virtual addresses, and store the results back into section records. The section_order_map (offset 368) is built here.

Phase 13: Relocation

sub_469D60 reads relocations from the relocation lists (offsets 376--392), resolves symbol addresses through the mapping tables (offsets 456--464), and applies relocation patches to section data. The arch_vtable at offset 488 provides architecture-specific relocation handlers.

Phase 14: Finalization

sub_445000 performs final relocation application and ELF finalization. It may call sub_444690 to reset the arch state. The function reads nearly every field of the context.

Phase 15: Serialization

sub_45BF00 reads the ELF header template (offsets 0--62), iterates all sections in ctx->section_array, and serializes the complete ELF to a byte buffer. The buffer is then written to disk via sub_45C920 (file) or returned to a caller via sub_45C950 (memory).

Phase 16: Destruction

sub_4475B0 cleans up the context as described in the destruction section above.

Orchestrator: sub_447570

The function at 0x447570 is a thin orchestrator that chains three phases through the context:

void elfw_layout_relocate_finalize(elfw *ctx, bool do_layout, bool do_reloc) {
    if (do_layout)
        sub_439830(ctx);    // shared memory layout
    else
        sub_438C60(ctx);    // alternative layout path

    if (do_reloc)
        sub_469D60(ctx);    // apply relocations

    sub_445000(ctx, ...);   // finalization
}

This collapses Phases 10, 13, and 14 into a single call when the pipeline mode requires all three in sequence.

Architecture Vtable (offset 488)

The arch_vtable pointer at offset 488 deserves special attention. It points to a structure created by sub_459640 (non-Mercury) or sub_45AC50 (Mercury/relocatable), which allocates a ~632-byte block containing approximately 70 function pointers. Each pointer is a handler for a specific relocation type or architecture-specific operation.

The vtable is dispatched on SM version during creation:

SM rangeHandler setNotes
30--39Kepler handlersLegacy
50--59Maxwell handlers
60--69Pascal handlers
70--74Volta handlers
75--79Turing handlers
80--89Ampere/Ada handlers
90--99Hopper handlers
100+Mercury (Blackwell+) handlersNew relocation types

The vtable is called through the context during relocation:

// sub_469D60 -- apply_relocations
void (*handler)(elfw *ctx, ...) = ctx->arch_vtable->handlers[reloc_type];
handler(ctx, section, offset, addend, symbol_value);

Syscall Symbol Table (offset 496)

The hash table at ctx[62] (offset 496) maps CUDA syscall symbol names to boolean presence flags. It is populated during construction from a static table at off_1D3A9C0 containing the names of all built-in CUDA device runtime syscalls. The accessors sub_4447B0, sub_444830, and sub_444840 check this table:

  • sub_4447B0: Returns 1 if the name is __cuda_syscall* or is in the entry hash
  • sub_444830: Exact match for __cuda_syscall_32f3056bbb
  • sub_444840: Checks syscall status AND verifies the name starts with cnp (cooperative launch prefix)

These checks are used during dead-code elimination and cudadevrt handling to distinguish intrinsic syscalls (which cannot be dead-code eliminated) from regular functions.

Section Virtualization (offset 472)

When section virtualization is active (byte at offset 82 is nonzero), the section_virtualization_table at offset 472 maps each virtual section index to its physical section index. This table is populated during the merge phase when sections from different input objects map to the same output section (e.g., multiple .nv.constant0 sections merging into one).

The invariant enforced by every accessor is:

if (ctx->section_virtualization_active) {
    uint32_t physical = ctx->section_virtualization_table[virtual_idx];
    if (physical != 0 && ctx->section_order_map[physical] != virtual_idx)
        fatal("secidx not virtual");
}

This consistency check appears in sub_443260, sub_443500, and many other accessors. It catches corruption in the merge-time section mapping.

Confidence Assessment

Each claim below was verified against decompiled functions (sub_4438F0 at /decompiled/sub_4438F0_0x4438f0.c, sub_4475B0, sub_443260, sub_443500, sub_444720, sub_444710, sub_43E490, sub_42F8B0), string references in nvlink_strings.json, and raw research report W080.

ClaimConfidenceEvidence
Struct size = 672 bytesHIGHsub_4307C0(v14, 672) on sub_4438F0 line 130; memset of 672 bytes on line 135
ELF magic at offset 0 (0x464C457F)HIGH*(_DWORD *)v17 = 1179403647 literal on sub_4438F0 line 141
ei_class at offset 4HIGH*((_BYTE *)v17 + 4) = (a2 != 0) + 1 on sub_4438F0 line 146
ei_data+ei_version at offset 5 as word 0x0101HIGH*(_WORD *)((char *)v17 + 5) = 257 on sub_4438F0 line 142
ei_osabi at offset 7 (0x41 device / 0x33 non-device)HIGH*((_BYTE *)v17 + 7) = 65 on line 149, or 51 on line 197
ei_abiversion at offset 8HIGH*((_BYTE *)v17 + 8) = a3 on lines 150 and 198
e_type at offset 16HIGH*((_WORD *)v17 + 8) = v114 on line 151 (word 8 = byte 16)
e_machine = 190 at offset 18HIGH*((_WORD *)v17 + 9) = 190 on lines 152 and 199
e_flags at offset 48HIGHsub_444710: *(_DWORD *)(a1 + 48) |= a2 (DWORD-wide access at byte 48)
e_version_or_api at offset 20HIGH*((_DWORD *)v17 + 5) = a7 on line 223 (dword 5 = byte 20)
link_mode_bits = merge_flags & 0x70000 at offset 68HIGH*((_DWORD *)v17 + 17) = v20 & 0x70000 on lines 172, 208, 216 (dword 17 = byte 68)
verbose_flags at offset 64HIGH*((_BYTE *)v17 + 64) = a8 on line 236
sm_arch_major at offset 72HIGH*((_DWORD *)v17 + 18) = a4 on line 145 (dword 18 = byte 72)
merge_flags at offset 76HIGH*((_DWORD *)v17 + 19) = a9 on lines 158, 164, 207, 215 (dword 19 = byte 76)
debug_flag at offset 80HIGH*((_BYTE *)v17 + 80) = a6 on line 235
has_shstrtab at offset 83HIGH*((_BYTE *)v17 + 83) = !v31 on line 241 (tests word 42 for 0)
section_virtualization flag byte at offset 82HIGHsub_443260 line 80: *(_BYTE *)(a1 + 82) gates virtualization check; same in sub_443500 line 86
preserve_relocs at offset 84 (bit 0)HIGH*((_BYTE *)v17 + 84) = v20 & 1 on line 237
force_rela at offset 85 (bit 1)HIGH*((_BYTE *)v17 + 85) = (v20 & 2) != 0 on line 238
allow_undef_globals at offset 86 (bit 9)HIGH*((_BYTE *)v17 + 86) = (v20 & 0x200) != 0 on line 240
no_opt at offset 87 (bit 2)HIGH*((_BYTE *)v17 + 87) = (v20 & 4) != 0 on line 242
optimize_data at offset 88 (bit 3)HIGH*((_BYTE *)v17 + 88) = (v20 & 8) != 0 on line 243
Byte at offset 89 = (bit 4) || mercury_flagHIGHv32 = (v20 >> 4) & 1; if (v13) LOBYTE(v32) = 1; *((_BYTE *)v17 + 89) = v32 on lines 246-249
emit_ptx at offset 90 (bit 5)HIGH*((_BYTE *)v17 + 90) = (v20 & 0x20) != 0 on line 244
Flag bit 0x4000 at offset 91HIGH*((_BYTE *)v17 + 91) = (v20 & 0x4000) != 0 on line 245
Flag bit 6 at offset 92HIGH*((_BYTE *)v17 + 92) = (v20 & 0x40) != 0 on line 250
Flag bit 8 at offset 93HIGH*((_BYTE *)v17 + 93) = BYTE1(v20) & 1 on line 253
extended_smem at offset 94HIGH*((_BYTE *)v17 + 94) = (a5 > 0x45u) & ((unsigned __int8)v20 >> 7) on line 260
Flag bit 0x800 at offset 96HIGH*((_BYTE *)v17 + 96) = (v20 & 0x800) != 0 on line 259
no_warn_dead_code at offset 99 (!bit 12)HIGH*((_BYTE *)v17 + 99) = ((v20 >> 12) ^ 1) & 1 on line 251
Byte at offset 100 = (merge_flags & 0x2000) != 0HIGH*((_BYTE *)v17 + 100) = (v20 & 0x2000) != 0 on line 252 (overwrites earlier word-wide write)
is_device_elf at offset 101HIGH*((_BYTE *)v17 + 101) = (a9 & 0x8000) != 0 on line 144
tkinfo_buffer at offset 108 (labeling error)LOWsub_43E490(v17 + 108, 1000) initializes a 24-byte ELF note header (namesz, descsz, type=1000, "NVIDIA Corp"), not a 1000-byte buffer. Per line 539 sub_433760(v17, cuinfo_idx, v17+108, 4, 32), offset 108 is tied to the cuinfo note, not tkinfo
cuinfo_buffer at offset 140 (labeling error)LOWsub_43E490(v17 + 140, 2000) initializes a second ELF note header (type=2000). The 1000/2000 values are NVIDIA note TYPE identifiers, not capacities. Which of 108/140 is tkinfo vs cuinfo is ambiguous from the constructor alone
Two 24-byte NVIDIA note headers at +108 and +140HIGHsub_43E490 sets namesz=12, descsz∈{8,24,0}, type=a2, and strcpy "NVIDIA Corp" at +12 of each
shstrtab_section_idx at offset 62HIGH*((_WORD *)v17 + 31) = v53 on line 368 (word 31 = byte 62)
Section indices at 200--210 (tkinfo/symtab/strtab/etc)LOWDOCUMENTED ERROR in wiki body: Page currently claims tkinfo at 200, symtab_section_idx at 202, symtab_shndx at 204, strtab at 206, cuinfo at 208, cuinfo_note at 210. Decompiled constructor proves otherwise: word 101 (byte 202) = strtab idx [line 427]; word 102 (byte 204) = symtab idx [line 494]; word 103 (byte 206) = symtab_shndx idx [line 522/573]; word 104 (byte 208) = cuinfo idx [line 538]; word 105 (byte 210) = tkinfo idx [line 531]. Byte 200 (word 100) is the api version cache a7 [lines 177, 221], not a section index. Page body needs correction.
strtab_section_idx at offset 202 (word 101)HIGH*((_WORD *)v17 + 101) = v58 on line 427 after .strtab creation
symtab_section_idx at offset 204 (word 102)HIGH*((_WORD *)v17 + 102) = v63 on line 494 after .symtab creation; also read in sub_441AC0 as link field
symtab_shndx_idx at offset 206 (word 103)HIGH*((_WORD *)v17 + 103) = v68/v78 on lines 522/573 after .symtab_shndx creation
cuinfo_section_idx at offset 208 (word 104)HIGH*((_WORD *)v17 + 104) = v91 on line 538 after .note.nv.cuinfo creation
tkinfo_section_idx at offset 210 (word 105)HIGH*((_WORD *)v17 + 105) = sub_440350(...) on line 531 after .note.nv.tkinfo creation
symbol_name_hash at offset 288HIGHv17[36] = sub_4489C0(sub_44E000, sub_44E180, 512) on line 261; sub_440BE0 reads a1+288 on lines 125, 182, 211, 313
section_name_hash at offset 296HIGHv17[37] = sub_4489C0(sub_44E000, sub_44E180, 512) on line 262; sub_441AC0 reads a1+296 on lines 93, 174, 203 for .section name lookup
v17[39] = 0x100000000LL at offset 312HIGHLine 264; stores count=1 in high dword, 0 in low dword (dword 78 gets count)
strtab_entry_count = 1 at offset 320HIGH*((_DWORD *)v17 + 80) = 1 on line 265 (dword 80 = byte 320)
pos_symbol_array at offset 344HIGHv17[43] = sub_464AE0(64) on line 272; sub_443260 line 31: *(_QWORD *)(a1 + 344) for positive index
neg_symbol_array at offset 352HIGHv17[44] = sub_464AE0(64) on line 273; sub_443260 line 29: *(_QWORD *)(a1 + 352) for negative index
section_array at offset 360HIGHv17[45] = sub_464AE0(64) on line 274; sub_443260 line 98: v16 = *(_QWORD *)(a1 + 360) for final resolution
104-byte section-array sentinel appendedHIGHv36 = sub_4307C0(v33, 104) on line 276; sub_464C30(v36, v17[45]) on line 285
48-byte sentinel appended to both pos/neg symbol arraysHIGHv41 = sub_4307C0(v38, 48) on line 287; two sub_464C30 calls on lines 293-294
section_order_map at offset 368HIGHsub_443260 line 92: v15 = *(_QWORD *)(a1 + 368); destructor frees a1[46] on line 110
reloc_list_0 at offset 376HIGHDestructor: sub_464550(a1[47], 0) on line 52 (qword 47 = byte 376)
reloc_list_1 at offset 384HIGHDestructor: sub_464550(a1[48], 0) on line 53
reloc_list_2 at offset 392HIGHDestructor: sub_464550(a1[49], 0) on line 55
input_section_list at offset 408 (qword 51)HIGHv17[51] = sub_464AE0(32) on line 295 (32-element initial capacity)
Reloc counter at offset 416HIGH*((_DWORD *)v17 + 104) = 0 on line 296 (dword 104 = byte 416)
global_symbol_list at offset 448 (qword 56)HIGHDestructor: sub_464520(a1[56]) on line 129
symbol_index_mapping at offset 456HIGHsub_444720 line 10: v6 = *(_QWORD *)(a1 + 456); destructor sub_431000(a1[57], a2) on line 39
neg_symbol_index_mapping at offset 464HIGHsub_444720 line 16: *(_QWORD *)(a1 + 464); destructor sub_431000(a1[58], a2) on line 38
section_virtualization_table at offset 472HIGHsub_443260 line 89: *(_QWORD *)(a1 + 472) indexed by v13; destructor sub_431000(a1[59], a2) on line 37
file_list at offset 480HIGHDestructor walks j = (_QWORD *)a1[60]; j = (_QWORD *)*j; sub_431000(j[1], ...) on line 124
arch_vtable at offset 488HIGHConstructor line 189: v17[61] = sub_459640(v25) or line 229: v17[61] = sub_45AC50(v25); destructor sub_45B680(a1 + 61) on line 130
entry_hash at offset 496HIGHConstructor lines 588-589: v71 = sub_4489C0(sub_44E000, sub_44E180, 32); v17[62] = v71; destructor sub_448A40(a1[62]) on line 126; sub_443500 reads a1 + 496 for syscall lookup
Input file records at offset 512 (qword 64)HIGHv17[64] = sub_464AE0(8) on line 297; v44 = sub_4307C0(..., 16); *v44 = "<input>" on line 307; *((_DWORD *)v44 + 2) = v24 sm_minor on line 308
Six sorted arrays at offsets 520--560 (qwords 65--70)HIGHv17[65..70] = sub_465020(sub_44E000, sub_44E180, 16) six times on lines 266-271
reloc_type_hash at offset 576 (qword 72)HIGHv17[72] = sub_4489C0(sub_44E120, sub_44E130, 8) on line 596; destructor sub_448A40(a1[72]) on line 127
merged_symbol_array at offset 592 (qword 74)HIGHsub_443260 line 78: sub_464DB0(*(_QWORD *)(a1 + 592), v24); destructor v13 = a1[74]; if (v13) sub_464B90(v13) on line 84-86; also read in sub_443500 line 79
extended_symbol_store at offset 600 (qword 75)HIGHsub_443260 line 37: v23 = *(_QWORD *)(a1 + 600); destructor v14 = a1[75]; if (v14) sub_464B90(v14) on line 87-89
private_arena_ptr at offset 608 (qword 76)HIGHDestructor: if (a1[76]) { ... sub_431C70(a1[76], 0); } on lines 29-33; constructor v17[76] = v117 on line 256
private_arena_handle at offset 616 (qword 77)HIGHConstructor v17[77] = v118 on line 257; destructor sub_45CAE0(a1[77], a2) on line 31
option_parser_result at offset 624 (labeling error)LOW*((_DWORD *)v17 + 156) = sub_42F8B0() on line 597 (dword 156 = byte 624). But sub_42F8B0 is a 1-line function that returns the literal constant 5 — not an option parser result. Correct label is "arch class = 5" (matches elf-writer.md wiki). The value is set once to 5 and never changed.
end_marker at offset 664 (qword 83)HIGHv17[83] = 0 on line 134 (early clear before memset)
Boolean flag CLI names (offsets 84--100)MEDIUMBit extractions and offsets verified against constructor lines 237-260. Semantic CLI names (e.g., --reserve-null, --disable-smem-reservation) cannot be confirmed from sub_4438F0 alone and require tracing through the option parser
symbol_name_hash uses 512 bucketsHIGHConstructor line 261: third param to sub_4489C0 is 512
section_name_hash uses 512 bucketsHIGHConstructor line 262: third param to sub_4489C0 is 512
entry_hash uses 32 bucketsHIGHConstructor line 588: third param to sub_4489C0 is 32
reloc_type_hash uses 8 bucketsHIGHConstructor line 596: third param is 8; uses sub_44E120/sub_44E130 (integer-keyed)
Syscall names loaded from off_1D3A9C0HIGHConstructor lines 587-595: loop reads v70 = off_1D3A9C0, v72 = *v70++; sub_448E70(v71, v72, 0) until v70 == &n
merge_flags & 0x400 gates private arenaHIGHLine 123: if ((a9 & 0x400) != 0) calls sub_432020("elfw memory space", 0, 4096)
merge_flags & 0x8000 = is_device_elf bitHIGHLine 144: *((_BYTE *)v17 + 101) = (a9 & 0x8000) != 0
Forced-relocatable via a10 || (a9 & 0x180000)HIGHLines 153, 201: identical condition sets v13=1, v20 = a9 | 0x80000, e_type = 1
String "elfw memory space"HIGHFound at line 12520 of nvlink_strings.json; used in sub_432020 call on line 125
String "couldn't initialize arch state"HIGHFound at line 12622 of nvlink_strings.json; used in sub_467460 call on line 233
String "reference to deleted symbol"HIGHFound at line 11766 of nvlink_strings.json; used 4+ times in sub_4438F0 and sub_443260/sub_444720/sub_443500
String "secidx not virtual"HIGHFound at line 12185 of nvlink_strings.json; used in sub_443260 line 94 and sub_443500 line 93
Strings ".note.nv.tkinfo", ".note.nv.cuinfo"HIGHFound at lines 11279, 11308 of nvlink_strings.json; referenced in sub_441AC0 section creation path

Cross-References