Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Cubin Loading

A cubin is a CUDA device ELF -- an ELF binary with e_machine == 190 (EM_CUDA) containing compiled SASS instructions, constant data, and NVIDIA-specific metadata sections. When nvlink encounters a cubin input (either directly on the command line or extracted from a fatbin container), it must validate the ELF structure, confirm the SM architecture matches the link target, distinguish SASS from PTX-only cubins, classify every symbol into kernels / device functions / globals, and decide whether the object requires pre-link finalization. This page documents the complete path from raw bytes to a validated cubin whose sections, symbols, and metadata have been integrated into the linker state ready for the merge phase.

Confidence: HIGH for the detection, validation, architecture-match, and FNLZR paths (all decompiled functions read end-to-end from sub_43D970, sub_43D9A0, sub_43D9B0, sub_43DA40, sub_43DA80, sub_43DD30, sub_43E100, sub_43E260, sub_43E2F0, sub_43E420, sub_43E610, sub_43E6F0, sub_426570, sub_4275C0). HIGH for section-name dispatch and .nv.compat TLV parsing (extracted from sub_45E3C0 and sub_45E7D0 lines 975-1012 and 1804-1851). MEDIUM for the exact STT type branches in sub_45D180 (read but not line-by-line verified against SysV ELF spec) and HIGH for the absence of LZ4/zlib/SHF_COMPRESSED cubin decompression (the only compression symbols in the binary live inside the bundled libcompiler OCG at addresses 0x1CExxxx--0x1D3xxxx, not on the cubin-loading path).

Key Functions

AddressNameSizeRole
sub_43D970is_elf19 BChecks 4-byte ELF magic (0x7F454C46)
sub_43D9A0is_elf6418 BTests ELF class byte (e_ident[EI_CLASS] == 2)
sub_43D9B0is_host_elf42 BTests e_type == ET_REL (1) to distinguish host from device
sub_43DA40is_sass_cubin52 BChecks SASS flag in e_flags (class-dependent bitmask)
sub_43DA80elf_extent420 BComputes max (sh_offset + sh_size, ph_offset + ph_size) over all sections and program headers
sub_43DD30validate_elf_structure536 BFull structural validation of section/program headers against buffer size
sub_43E100load_cubin_from_file232 BElf32 file loader: open, read, validate, return in-memory buffer
sub_43E260get_nv_cuinfo_section200 BResolves .note.nv.cuinfo / .note.nv.cuver or indexed via e_ident[19]
sub_43E2F0get_nv_tkinfo_section180 BResolves .note.nv.tkinfo note section (legacy compat format)
sub_43E420get_elf_toolkit_version116 BExtracts toolkit version from e_flags or .nv.compat section
sub_43E6F0has_abi_suffix172 BDetects the a suffix flag (ABI variant) in e_flags
sub_43E610read_nv_compat168 BParses the .nv.compat section for extended arch metadata
sub_426570validate_arch_and_add7,427 BValidates architecture match, configures link mode, adds cubin to linker
sub_4275C0post_link_transform3,989 BFNLZR (Finalizer) -- post-link binary rewriting for Mercury/SASS targets
sub_4878A0arch_string_match328 BCompares input arch string against target --arch value
sub_45E7D0merge_cubin_into_elfw52,000+ BSymbol-table iteration, section-name dispatch, .nv.compat TLV parsing, .nv.info attribute patching
sub_45E3C0classify_and_register_section2,800 BMaps section name to NVIDIA section type; the standalone section-classifier entry point
sub_45D180add_function_symbol16,000+ BSTT_FUNC handler: weak-symbol resolution, register-count check, new symbol creation
sub_4504B0get_or_create_nv_info350 BLazily creates .nv.info or .nv.info.<function> target section

Detection: Is It a Cubin?

Cubin detection is a two-step test performed inside main() after the 56-byte header probe:

  1. ELF magic check (sub_43D970): The first 4 bytes must be 0x7F454C46 ("\x7fELF").
  2. Machine type check: The e_machine field in the ELF header must be 190 (0xBE, EM_CUDA).
// sub_43D970 -- is_elf
// Returns true if the buffer starts with ELF magic
bool is_elf(uint32_t *buf) {
    if (!buf) return false;
    return *buf == 0x464C457F;  // "\x7fELF" as little-endian uint32
}

The e_machine check happens inline in main() after the ELF magic matches. Any ELF with e_machine != 190 is classified as a host ELF and handled separately.

EM_CUDA = 190 = 0xBE is NVIDIA's assigned e_machine value in the ELF machine-ID registry. It is the single most important discriminator in the entire linker: it distinguishes a CUDA device binary from every other ELF on the system. The check happens in three distinct places in the cubin-loading path:

  1. sub_43E100 line 52 (load_cubin_from_file final gate): if (get_elf32_header(buf)->e_machine != 190) return NULL;
  2. The main file-dispatch loop: before calling sub_426570, main() reads e_machine from the header probe and dispatches cubins vs host ELFs.
  3. sub_43DD30 does not check e_machine -- it only validates the structural integrity of the header and section arrays. A host ELF that accidentally matches e_machine==190 would still fail later in sub_426570 due to the architecture mismatch (host ELFs have no valid e_flags SM version).

The EM_CUDA constant is hardcoded as a literal in three decompiled files:

  • sub_43E100 line 52: *(_WORD *)(sub_46B590(v10) + 18) == 190
  • sub_43DFC0 (auxiliary) at offset +18
  • Indirectly via sub_448360 (Elf64 header accessor) callers who check e_machine

No other value (EM_NVIDIA_GPU, EM_MERCURY, etc.) is ever accepted as a cubin. Mercury-format objects still carry e_machine == 190; the Mercury distinction is encoded in e_type == 0xFF00 and e_ident[EI_OSABI] == 0x41, not in e_machine.

ELF Class Detection

sub_43D9A0 reads e_ident[EI_CLASS] at byte offset 4 of the ELF buffer:

// sub_43D9A0 -- is_elf64
bool is_elf64(void *elf_buf) {
    if (!elf_buf) return false;
    return ((uint8_t *)elf_buf)[4] == 2;  // ELFCLASS64
}

This determines whether the cubin uses Elf32 or Elf64 structures. All modern CUDA targets (sm_20+) use Elf64 when compiled with -m64. The Elf32 path exists for legacy 32-bit device code (deprecated since CUDA 12).

Cubin vs Host ELF Distinction

sub_43D9B0 distinguishes a cubin (device ELF) from a host .o file by checking e_type:

// sub_43D9B0 -- is_host_elf
// Returns true if the ELF has e_type == ET_REL (1)
// For cubins, e_type is typically ET_EXEC (2) or the Mercury type 0xFF00
bool is_host_elf(void *elf_buf) {
    if (!elf_buf) return false;
    if (((uint8_t *)elf_buf)[4] == 2)  // ELFCLASS64
        return get_elf64_header(elf_buf)->e_type == 1;  // ET_REL
    else
        return get_elf32_header(elf_buf)->e_type == 1;
}

Device cubins produced by ptxas have e_type == ET_EXEC (2). Relocatable device objects (produced with -r) have e_type == ET_REL (1) but still carry e_machine == 190. Mercury objects use the custom type 0xFF00. The combination of e_machine == 190 with any e_type value routes through the cubin handler; sub_43D9B0 is used later during architecture validation to handle relocatable cubins specially.

SASS Flag Detection

sub_43DA40 determines whether a cubin contains SASS (compiled machine code) or is a PTX-only stub:

// sub_43DA40 -- is_sass_cubin
bool is_sass_cubin(void *elf_buf) {
    if (!elf_buf) return false;
    if (((uint8_t *)elf_buf)[4] != 2)  // Must be ELFCLASS64
        return false;

    Elf64_Ehdr *ehdr = get_elf64_header(elf_buf);

    uint32_t sass_flag;
    if (ehdr->e_ident[EI_OSABI] != 0x41)   // 0x41 = NVIDIA CUDA OSABI (65)
        sass_flag = 0x4000;                  // Elf32-style: bit 14
    else
        sass_flag = 0x2;                     // Elf64/Mercury: bit 1

    return (ehdr->e_flags & sass_flag) != 0;
}

The flag semantics:

ELF OSABISASS flag bitHex maskMeaning
!= 0x41 (legacy)bit 140x4000Legacy Elf32-style flag layout in e_flags
== 0x41 (NVIDIA CUDA)bit 10x2Modern 64-bit flag layout (Mercury / sm >= 100)

When the SASS flag is set, the cubin contains actual machine instructions. When clear, it is a PTX-only cubin that serves as a compatibility fallback and cannot execute directly.

Elf32 File Loader: sub_43E100

sub_43E100 is a standalone cubin loader that reads a cubin from a file path, validates it, and returns a heap-allocated buffer. It is the Elf32 loading path (the condition checks e_ident[EI_CLASS] == 1):

// sub_43E100 -- load_cubin_from_file (Elf32 path)
void *load_cubin_from_file(const char *path) {
    FILE *fp = fopen(path, "rb");
    if (!fp) return NULL;

    // Get file size
    if (fseek(fp, 0, SEEK_END) == -1) { fclose(fp); return NULL; }
    long size = ftell(fp);
    if (size == -1 || fseek(fp, 0, SEEK_SET) == -1 || size <= 52) {
        fclose(fp); return NULL;
    }

    // Allocate and read into arena memory
    void *buf = arena_alloc(arena, size);
    if (!buf) { arena_oom(arena, size); fclose(fp); return NULL; }

    size_t nread = fread(buf, 1, size, fp);
    fclose(fp);

    // Validate: correct read length, ELFCLASS32, ELF magic
    Elf32_Ehdr *ehdr = get_elf32_header(buf);
    if (nread != size || ehdr->e_ident[EI_CLASS] != 1 || ehdr->e_ident_magic != 0x464C457F) {
        arena_free(buf, 1);
        return NULL;
    }

    // Full structural validation
    if (!validate_elf_structure(buf, size)) {
        arena_free(buf, size);
        return NULL;
    }

    // Must be EM_CUDA
    if (get_elf32_header(buf)->e_machine != 190) {
        arena_free(buf, size);
        return NULL;
    }

    return buf;
}

Key details:

  • The minimum file size threshold is 52 bytes (the size of an Elf32 header).
  • Memory is allocated from the linker's arena allocator, not malloc.
  • sub_43DD30 (validate_elf_structure) performs a thorough check of all section headers and program headers, verifying that every offset+size pair falls within the buffer bounds.
  • The e_machine == 190 check is the final gate.

Structural Validation: sub_43DD30

sub_43DD30 validates the complete ELF structural integrity of an in-memory cubin before any further processing. It checks both Elf32 and Elf64 paths:

For Elf32:

  • e_phentsize == 40 (sizeof Elf32_Phdr must be 40 if program headers exist)
  • e_shentsize is zero or e_shstrndx == 32 (section header entry size sanity)
  • Program header table offset (e_phoff) is within the buffer and e_phoff > 0x33 (beyond the ELF header)
  • Total program header table size (e_phoff + e_phentsize * e_phnum) fits in buffer
  • Section header table offset (e_shoff) and total size fits in buffer
  • For each section: if the section type is not SHT_NOBITS (8) and not in the NVIDIA-specific range (0x70000007--0x70000015, which includes SHT_CUDA_INFO, SHT_CUDA_CALLGRAPH, etc.), the section data range [sh_offset, sh_offset + sh_size) must fit within the buffer

For Elf64:

  • e_phentsize == 64 (sizeof Elf64_Phdr)
  • e_shentsize is zero or e_shstrndx == 56
  • Same offset/size boundary checks as Elf32, adjusted for 64-bit field widths
  • Overflow protection: checks sh_offset + sh_size does not wrap around

The NVIDIA-specific section types that are exempted from the data-range check (they may be virtual/metadata-only):

Type valueConstant nameHex
0x70000007SHT_CUDA_INFO0x70000007
0x70000008SHT_CUDA_CALLGRAPH (approx)0x70000008
0x7000000ASHT_CUDA_RELOCINFO (approx)0x7000000A
0x70000015SHT_CUDA_UDT/SHT_CUDA_UFT0x70000015

The validation function computes these exemptions with a bitmask check on (section_type - 0x70000007):

uint32_t relative = section_type - 0x70000007;
bool exempt = (section_type == SHT_NOBITS);
if (relative <= 14)
    exempt |= (0x400D >> relative) & 1;  // bits 0,2,3,14 set

The bitmask 0x400D in binary is 0100 0000 0000 1101, exempting offsets 0, 2, 3, and 14 relative to 0x70000007.

Architecture Extraction from e_flags

The SM architecture version is encoded in e_flags with a layout that depends on the ELF OSABI byte:

Legacy Layout (OSABI != 0x41)

e_flags (Elf32_Ehdr or Elf64_Ehdr):
  bits [7:0]    = SM version number (e.g., 75 for sm_75, 90 for sm_90)
  bit  [11]     = ABI suffix ('a') flag
  bit  [14]     = SASS flag (contains machine code)
  bit  [31]     = relocatable flag (signed: e_flags < 0)
  bits [19:16]  = toolkit version (Elf32 only, from e_flags of Elf32)

Modern Layout (OSABI == 0x41, Mercury)

e_flags (Elf64_Ehdr, always 64-bit for Mercury):
  bits [15:8]   = SM version number (shifted right by 8)
  bit  [1]      = SASS flag
  bit  [2]      = pre-link indicator (controls FNLZR behavior)
  bit  [3]      = ABI suffix ('a') flag
  bit  [10]     = relocatable flag
  bits [31:20]  = toolkit version (from e_flags >> 20, via .nv.compat)

sub_43E420 extracts the toolkit version:

// sub_43E420 -- get_elf_toolkit_version
uint32_t get_elf_toolkit_version(void *elf_buf) {
    if (!elf_buf) return 0;

    if (is_elf64(elf_buf)) {
        Elf64_Ehdr *ehdr = get_elf64_header(elf_buf);
        if (!ehdr) return 0;
        if (ehdr->e_ident[EI_OSABI] != 0x41)
            return ehdr->e_flags >> 16;  // Legacy: version in upper bits
        // Mercury: look up .nv.compat section
        void *compat = find_section(elf_buf, ".nv.compat");
        if (!compat) return 0;
        if (compat->version_field > 1)
            return compat->toolkit_ver_short;  // 16-bit field at offset 28
        // Fallback to a different compat section structure
        void *compat2 = get_compat_v1(elf_buf);
        if (!compat2) return 0;
        return compat2->toolkit_ver;           // 32-bit field at offset 24
    } else {
        Elf32_Ehdr *ehdr = get_elf32_header(elf_buf);
        if (!ehdr) return 0;
        return ehdr->e_flags >> 16;  // bits [31:16]
    }
}

sub_43E6F0 checks the ABI suffix flag:

// sub_43E6F0 -- has_abi_suffix
// Returns 1 if the cubin was compiled with the 'a' variant (e.g., sm_90a)
bool has_abi_suffix(void *elf_buf) {
    if (!elf_buf || !is_elf64(elf_buf)) return false;

    Elf64_Ehdr *ehdr = get_elf64_header(elf_buf);

    if (ehdr->e_ident[EI_OSABI] == 0x41) {   // Mercury
        if (ehdr->e_ident[EI_ABIVERSION] <= 0x59)  // 89 decimal
            return false;  // sm <= 89 never has suffix in Mercury
        // Check .nv.compat section
        void *compat = find_section(elf_buf, ".nv.compat");
        if (!compat) return true;  // default to yes if no compat info
        if (compat->version > 1) {
            // New compat format: explicit flag at byte offset 4
            nv_compat_info info;
            if (read_nv_compat(elf_buf, &info))
                return info.abi_flag == 1;
        }
        // Old compat format: check e_flags bit
        uint32_t flag = (ehdr->e_ident[EI_OSABI] == 0x41) ? 0x8 : 0x800;
        return (ehdr->e_flags & flag) != 0;
    }

    // Legacy path
    uint8_t sm = (uint8_t)ehdr->e_flags;
    if (sm <= 0x59) return false;  // sm <= 89
    return (ehdr->e_flags >> 11) & 1;  // bit 11
}

The a suffix (e.g., sm_90a) indicates architecture-specific features that break forward compatibility. Only SM versions > 89 support this suffix. The threshold 0x59 (89 decimal) appears in both legacy and Mercury code paths.

Architecture Validation: sub_426570

sub_426570 is the central validation function called from main() for every cubin input. It validates that the input cubin's SM architecture matches the --arch target, handles the SASS-vs-PTX distinction, and configures the link mode. At 7,427 bytes, it is the most complex function in the cubin loading path.

Inputs

ParameterTypeRole
a1elfw *The output ELF wrapper being built
a2void *The input cubin's in-memory ELF buffer
a3const char *The input file path (for error messages)
a4bool *Output: set to 1 for legacy/32-bit, 0 for SASS cubins

Early Rejection

The function immediately rejects ET_DYN objects (e_type == 2): shared libraries cannot be device-linked.

Word Size Validation

Compares the cubin's ELF class against the --machine setting (dword_2A5F30C, either 32 or 64):

bool cubin_is_64 = is_elf64(cubin);
bool target_is_64 = (dword_2A5F30C == 64);
if (cubin_is_64 != target_is_64) {
    // Fatal error: "expected %s" where %s is "-m32" or "-m64"
    error(ERR_ARCH_MISMATCH, filepath, target_is_64 ? "-m64" : "-m32");
}

Architecture String Construction

For 32-bit cubins, the SM version is extracted from e_flags & 0xFF and the ABI suffix from sub_43E6F0. The function formats an architecture string:

char arch_str[12];
bool is_ptx_only = byte_2A5F2C1;  // PTX/compute mode flag

if (is_ptx_only)
    snprintf(arch_str, 12, "compute_%d%c", sm_version, has_suffix ? 'a' : 0);
else
    snprintf(arch_str, 12, "sm_%d%c", sm_version, has_suffix ? 'a' : 0);

The buffer is 12 bytes, and there is an explicit overflow check: if snprintf returns > 11, the error "specified arch exceeds buffer length" is raised.

Architecture Match

sub_4878A0 (arch_string_match) compares the constructed arch_str against the global target qword_2A5F318 (the --arch value). The comparison is not a simple string equality -- it parses both architecture strings into structured records and applies version compatibility rules:

  • Exact match: sm_90 == sm_90 (passes)
  • Family match: sm_90a is compatible with sm_90 as a target
  • Cross-family rejection: sm_75 cubins cannot link into an sm_90 target

If the match fails and byte_2A5F221 (SASS mode flag) is set, a fallback path tries to match via the .nv.compat section (sub_43E610 + sub_4709E0). The .nv.compat section contains extended compatibility information that can declare a cubin is forward-compatible with a range of architectures.

On final failure, the error "SM Arch ('%s') not found in '%s'" is emitted (via sub_467460 with format descriptor unk_2A5B6A0), where %s is the constructed arch string and the target arch.

64-bit / Mercury Path

For 64-bit cubins with OSABI 0x41 (Mercury), the SM version comes from e_flags >> 8 instead of e_flags & 0xFF:

if (ehdr->e_ident[EI_OSABI] == 0x41) {
    sm_version = ehdr->e_flags >> 8;     // Mercury: bits [15:8]
} else {
    sm_version = (uint8_t)ehdr->e_flags; // Legacy: bits [7:0]
}

An additional check validates the ELF class byte (e_ident[7], used by NVIDIA as a sub-class indicator). For legacy Elf32-format cubins, the expected class is 7. For modern cubins (sm > 72, byte_2A5F224 set), the expected class is 8. If the cubin does not carry the relocatable flag (e_flags & 0x400 for OSABI 0x41, or e_flags & 0x4000 for legacy), an error is raised for a class mismatch.

SASS vs PTX-Only Mode Selection

The validation function sets two global mode flags based on the cubin type:

if (is_sass_cubin) {
    // SASS cubin: set SASS mode
    if (output_flag) *output_flag = 0;
    byte_2A5F221 = 1;  // SASS mode: enables FNLZR, relaxed compat checking
} else {
    // PTX-only or relocatable cubin
    if (output_flag) *output_flag = 1;
    if (!first_cubin_seen) {
        configure_32bit_mode(elfw, ...);
        byte_2A5F222 = 0;  // disable Mercury mode
        byte_2A5F225 = 0;  // disable Mercury-capable flag
        byte_2A5B510 = 1;  // mark first cubin processed
        byte_2A5F220 = 1;
        dword_2A5B528 = 0;
    }
}

When the first non-SASS cubin arrives, the linker locks into legacy mode: Mercury flags are cleared and cannot be re-enabled. This is a one-way transition -- once a legacy cubin enters the link, the entire output is legacy.

Toolkit Version Validation

Two additional checks enforce toolkit consistency:

  1. Minimum version: If sub_468560() (get current toolkit version) returns a value less than the cubin's toolkit version (sub_43E420), the cubin was built with a newer toolkit than the linker knows about. This produces an error.

  2. SM-specific version locks:

    • SM 50 with toolkit version <= 64 (0x40): error (too old)
    • SM 90 with toolkit version <= 119 (0x77): error (too old for Hopper)
  3. EWP objects (e_type == 0xFF00, Mercury executable): If detected, byte_2A5F229 is set. All subsequent objects must have toolkit version matching the current linker's version exactly. Error: "linking with -ewp objects requires using current toolkit".

The FNLZR Post-Link Transform: sub_4275C0

After the merge-relocate-finalize pipeline produces a linked cubin, Mercury targets (sm >= 100) and certain SASS targets require a post-link binary rewriting pass called the FNLZR (Finalizer). sub_4275C0 orchestrates this transformation.

Pre-Link vs Post-Link Mode

The Finalizer operates in two modes controlled by the a5 parameter:

Post-Link Mode (a5 == true): Applied after the linker has produced the final linked ELF. The Finalizer rewrites instruction encodings, resolves final scheduling, and applies architecture-specific binary patches. This is the normal path for sm >= 100 targets.

// Post-link mode: verify the cubin has the SASS flag set
uint32_t sass_check = (ehdr->e_ident[EI_OSABI] == 0x41) ? 0x1 : 0x80000000;
if (!(ehdr->e_flags & sass_check))
    error("Internal error");  // cubin must contain SASS for post-link

Pre-Link Mode (a5 == false): Applied to individual cubins before merging, when the cubin requires pre-link finalization (e.g., instruction encoding normalization). This mode checks that the cubin does NOT already have the SASS flag set (it should be in pre-link format):

// Pre-link mode: verify NOT already finalized
if (ehdr->e_ident[EI_OSABI] == 0x41) {
    bool already_finalized = (ehdr->e_flags >> 2) & 1;  // bit 2
    if (already_finalized) error("Internal error");
} else {
    bool already_finalized = (ehdr->e_flags & (0x80000000 | 0x4000)) == 0;
    // inverted: if neither bit is set, it's unfinalizable
    if (already_finalized) error("Internal error");
}

Configuration Flags

The Finalizer receives a 160-byte configuration structure (v28[0..19]) initialized mostly to zero, with specific fields set from global flags:

OffsetSourceMeaning
v28[3] byte 4byte_2A5F310 (debug)Debug info preservation
v28[3] byte 7byte_2A5F210Extended shared memory flag
v28[13] byte 0post-link flag1 if Mercury mode active
v28[13] byte 1byte_2A5F225Mercury-capable flag
v28[13] byte 2always 1Finalizer enable
v28[13] byte 3byte_2A5F224SM > 72 flag
v28[13] byte 4byte_2A5F223Additional arch flag

The operation mode (v28[3] low dword) is set to:

  • 4: Default (no debug, not debug+pre-link)
  • 5: Debug mode (byte_2A5F310 set)
  • v28[8] = 3: When neither debug nor the pre-link special flag is set

Invocation

The actual binary rewriting is performed by sub_4748F0, which is the FNLZR engine entry point. All 20 qwords from v28 are passed as arguments (the x86-64 calling convention spills them onto the stack):

int result = fnlzr_engine(
    target_sm,       // a3: the SM version number
    input_elf,       // the cubin to transform
    elf_ptr,         // pointer to pointer (may be updated)
    config[0..19],   // the 160-byte config structure
    0, 0             // reserved
);
if (result != 0)
    error("FNLZR failure", filename);

On success, the ELF at *elf_ptr has been rewritten in place. On failure, a fatal error is emitted.

Diagnostic Output

When verbose mode is active (dword_2A5F308 & 1), the Finalizer prints to stderr:

FNLZR: Input ELF: <filename>
FNLZR: Post-Link Mode          (or "Pre-Link Mode")
FNLZR: Flags [ <post_link> | <mercury_capable> ]
FNLZR: Starting <filename>
FNLZR: Ending <filename>

If no filename is available (in-memory cubin from fatbin extraction), the placeholder "in-memory-ELF-image" is used.

Cubin-Specific Section Handling

Once the cubin has passed architecture validation and (if needed) finalization, sub_45E7D0 ingests its contents. The core operation is a one-pass iteration over the cubin's section header table. For every section, sub_45E7D0 extracts the section name and remaps the section type based on the name prefix. This remap is necessary because ptxas emits some sections with generic ELF types (SHT_PROGBITS = 1, SHT_NOBITS = 8) but the linker internally tracks them as specific NVIDIA types with the SHT_LOPROC = 0x70000000 prefix.

Section Name to Type Dispatch

The dispatch logic lives at sub_45E7D0:975-1012 and verbatim in the standalone helper sub_45E3C0:67-102. It inspects the section name prefix and the incoming section type:

// When section type == SHT_NOBITS (8) -- uninitialized data sections
if      (memcmp(name, ".nv.global",          10) == 0)  type = 0x70000007;  // SHT_CUDA_GLOBAL
else if (memcmp(name, ".nv.shared.",         11) == 0)  type = 0x7000000A;  // SHT_CUDA_SHARED
else if (memcmp(name, ".nv.shared.reserved.",20) == 0)  type = 0x70000015;  // SHT_CUDA_SHARED_RESERVED
else if (memcmp(name, ".nv.local.",          10) == 0)  type = 0x70000009;  // SHT_CUDA_LOCAL
// type stays 8 (SHT_NOBITS) otherwise

// When section type == SHT_PROGBITS (1) -- initialized data
if      (memcmp(name, ".nv.constant",        12) == 0)  type = 0x70000024 + strtol(name+12, 0, 10);
         // .nv.constant0 -> 0x70000064, .nv.constant17 -> 0x70000075
else if (memcmp(name, ".nv.global.init",     15) == 0)  type = 0x70000008;  // SHT_CUDA_GLOBAL_INIT
// type stays 1 otherwise (plain .text, .data, .nv.rel, etc.)

// When incoming type is the generic SHT_CUDA_CONSTANT (0x70000006), same strtol+0x70000024 remap

Raw numeric values from the decompiled code:

Decimal (decompiled)HexNVIDIA section typeSource section name
18790481990x70000007SHT_CUDA_GLOBAL.nv.global
18790482000x70000008SHT_CUDA_GLOBAL_INIT.nv.global.init
18790482010x70000009SHT_CUDA_LOCAL.nv.local.*
18790482020x7000000ASHT_CUDA_SHARED.nv.shared.*
18790482130x70000015SHT_CUDA_SHARED_RESERVED.nv.shared.reserved.*
18790482920x70000064SHT_CUDA_CONSTANT0.nv.constant0
18790481980x70000006SHT_CUDA_CONSTANT (generic)incoming base for constant banks

The expression strtol(name+12, 0, 10) + 1879048292 at line 1001 (LABEL_427) reads the decimal digits that follow .nv.constant and adds them to the base. .nv.constant0 becomes 0x70000064, .nv.constant17 becomes 0x70000075. This is how the 18 numbered banks get their distinct section type IDs and how the merge phase later identifies which bank a section belongs to without re-parsing the name. The same encoding is reused for the 7 specialized/named constant banks, which also use SHT_CUDA_CONSTANT as an input type -- see Constant Banks for the full table.

Per-Section Dispatch After Classification

Once the section type has been normalized, sub_45E7D0 dispatches based on the classified type:

Classified typeDispatch action
0x70000000 (SHT_CUDA_INFO, .nv.info*)Parsed as TLV records; .nv.info handled by sub_44E8B0, target section allocated via sub_4504B0
0x70000001 (SHT_CUDA_CALLGRAPH, .nv.callgraph)Indexed as a call-graph input for dead-code elimination
0x70000002 (SHT_CUDA_CALLGRAPH_INFO, .nv.callgraph.info)Per-function callgraph metadata
0x70000006-0x70000075 (constant banks)Routed through sub_438640 for bank-specific merging
0x70000007 (SHT_CUDA_GLOBAL)Global-variable placement, processed by layout phase
0x70000008 (SHT_CUDA_GLOBAL_INIT)Carries initialized global data; copied verbatim
0x70000009, 0x7000000A, 0x70000015 (local/shared)Size accumulated for the per-kernel shared-memory budget
0x70000086 (SHT_CUDA_COMPAT, .nv.compat)TLV parsed by sub_43E610 + inline loop at sub_45E7D0:1804-1851
0x7 (SHT_NOTE) with flag bit 0x1000000Note section processed by sub_442270 / sub_469D00, used for .note.nv.cuinfo etc.
SHT_PROGBITS with .text.<func> nameTreated as a per-function text segment, merged by standard symbol resolution

The special-case for .text. is implicit: a SHT_PROGBITS section whose name starts with .text. keeps its generic type but is linked to the STT_FUNC symbol of the same name during the second pass. Unlike host ELFs, cubins typically emit a single monolithic .text section per module rather than per-function .text.* sections, so this path is rarely exercised.

Other Cubin-Specific Sections

Besides the name-dispatched sections above, nvlink recognizes several other sections by explicit lookup in sub_4483B0 (find_section_by_name):

Section nameELF typePurpose
.nv.compat0x70000086 (SHT_CUDA_COMPAT, or SHT_NOTE 7 with SHF_CUDA 0x1000000)Forward-compatibility attributes (ISA class, ABI variant)
.note.nv.cuinfoSHT_NOTE 7CUDA build info note (alternative to .nv.compat)
.note.nv.cuverSHT_NOTE 7CUDA toolkit version note
.note.nv.tkinfoSHT_NOTE 7Toolkit compatibility note (legacy / pre-Mercury)
.nv.info0x70000000Global per-cubin metadata TLV
.nv.info.<function>0x70000000 with SHF_INFO_LINK 0x40Per-function metadata TLV
.nv.prototype0x70000002 (SHT_CUDA_PROTOTYPE)Function prototype records for indirect calls
.nv.callgraph0x70000001 (SHT_CUDA_CALLGRAPH)Call-graph edges used by dead-code elimination
.nv.relSHT_RELA 4NVIDIA-specific relocations (see R_CUDA* catalog)
.debug_*SHT_PROGBITS 1DWARF debug sections

The resolution order for cubin info sections is:

  1. Check e_ident[EI_ABIVERSION] (byte 8): if non-zero and not 0xFF, it is a direct section index into the CU info note.
  2. If 0xFF, look up .note.nv.cuinfo or fall back to .note.nv.cuver.
  3. Otherwise treat e_ident[EI_ABIVERSION] as a section index via sub_448370.

This logic is implemented in sub_43E260 and sub_43E2F0 and reflects the historical evolution of CUDA metadata: early cubins used distinct note sections, later cubins pack everything into .nv.compat.

The .nv.compat TLV Record Format

The .nv.compat section (also accepted as an SHT_NOTE variant with the 0x1000000 CUDA flag) carries ISA compatibility attributes as a stream of tagged records. sub_43E610 retrieves the section buffer and delegates the raw parse to sub_43E500. For the structured pass, sub_45E7D0:1806-1851 walks the buffer record-by-record:

Record layout:
  [0]       tag_kind    (1 byte)    // 0x00..0x08 for known kinds
  [1]       tag_id      (1 byte)    // attribute id (0..8)
  [2..3]    length      (2 bytes)   // byte length of attribute value or 16-bit immediate
  [4..4+N]  value       (N bytes)   // opaque, attribute-specific

Stream termination: end of section buffer
Alignment: 4 bytes

The parser classifies tag_id by a bitmask on (1 << tag_id):

tag_id1 << tag_id& 0x6C?& 0x180?HandlerMeaning
20x04yesnosub_451920(linker, 2, value_byte)ISA_CLASS -- rejected if ISA class > 0x7F on Mercury
30x08yesnosub_451920(linker, 3, value_byte)ABI variant
50x20yesnosub_451920(linker, 5, value_byte)ISA feature
60x40yesnosub_451920(linker, 6, value_byte)ISA feature
70x80noyessub_451BA0(linker, 7, value_u16)Shader-model minimum (16-bit value)
80x100noyessub_451BA0(linker, 8, value_u16)Shader-model cap (16-bit value)
40x10nono(length skip only)Padding / length-prefixed value

Unknown attributes emit "unknown .nv.compat attribute (%x) encoutered.\n" (note the original typo) to stderr when the 0x10 verbose flag is set, then continue. The record-advance logic is cursor += (tag_kind == 4) ? 4 + length : 4, so tag_kind == 4 is a length-prefixed record while other kinds carry the immediate value in the same 4 bytes.

After the walk finishes, a "missing attributes" sweep runs at sub_45E7D0:1743-1801. For each required attribute (2, 3, 5, 6, 7), if the corresponding BYTE(v600, tag_id) was never set, a default value is installed -- and for tag_id == 2 on Mercury, ISA class > 0x7F triggers the error unk_2A5B900 ("ISA_CLASS out of range").

Symbol Classification: Kernels, Device Functions, and Globals

Once section classification is done, sub_45E7D0:762-800 walks the symbol table twice. The first pass extracts only STT_FUNC symbols ((st_info & 0xF) == 2) and passes each to sub_45D180 for full classification. The second pass (sub_45E7D0:802-1014) handles every other symbol type.

STT_FUNC Pass (First Pass)

For every STT_FUNC symbol, sub_45D180 makes a binding decision:

st_info upper nibbleELF bindingOutcome in nvlink
0x0STB_LOCALSection-local; recorded but not globally visible
0x1STB_GLOBALGlobal function; entered in the global symbol table, may resolve conflicts
0x2STB_WEAKWeak function; merged against existing definitions with register-count voting

Inside sub_45D180, the function then consults st_other & 0x10 (NVIDIA's visibility bit) to distinguish a kernel entry point from a device function:

BitHexMeaning
st_other[4]0x10Kernel entry (__global__). Cleared for __device__ functions.
st_other[5]0x20Bindless function (texture/surface binding descriptor)
st_other[6]0x40Exported across translation units
st_other[7]0x80Internal / intrinsic

When (st_other ^ existing_st_other) & 0x10 is non-zero, the function emits unk_2A5B8F0 -- "conflicting definitions for %s, one is a kernel and the other is not" -- and aborts. Kernels and device functions with the same mangled name cannot coexist.

The classification then dispatches:

  1. Kernel (st_other & 0x10): symbol is tagged with the kernel bit and placed in the kernel table. A per-function .nv.info.<name> section is expected; its absence downgrades to a warning.
  2. Device function (st_other & 0x10 == 0): symbol goes into the regular function table; callers resolve via the call-graph pass.
  3. Weak definition: sub_45D180 reads the register count from the existing symbol (if any), compares against the new candidate, and picks the higher of the two so the kernel launch allocates enough registers for any caller. Same logic for barrier counts, stack sizes, and shared-memory usage.

Non-Function Pass (Second Pass)

The second pass (sub_45E7D0:802-1014) handles st_info & 0xF values 0, 1, 3-12. The important branches are:

st_info & 0xFELF typenvlink handler
0 (STT_NOTYPE)untypedPassed to sub_4411B0 / sub_438A00 for opaque placement
1 (STT_OBJECT)global variableHandled by sub_440740 with the object's section binding
2 (STT_FUNC)functionSkipped here; processed in first pass
10 (STT_LOOS, STT_NV_EXT_TEX)textureRouted through sub_438A00 (bindless texture path)
11 (STT_NV_EXT_SURF)surfaceRouted through sub_438B20 (surface handle path)
12 (STT_NV_EXT_SAMP)samplerRouted through sub_438A90 (sampler handle path)
13(reserved, NVIDIA-private)Routed to the kernel-specific path as if it were a function entry

For each STT_OBJECT entry (global variable), the code:

  1. Reads st_shndx from st_info (symbol's owning section index).
  2. Looks up the corresponding linker section via v19[1] -- the section index map built earlier in the first pass.
  3. Fails with "section not mapped" (unk_2A5B990) if the source section was filtered out.
  4. Adds the symbol via sub_4411B0 (symbol-add by name) or sub_440740 (symbol-add with explicit attributes).
  5. On collision with an existing definition, checks that the st_info types match -- if not, emits unk_2A5BA10 ("symbol type mismatch for %s").

A special case handles st_shndx == SHN_ABS (0xFFF1) and st_shndx == 0: these symbols bypass section mapping entirely and go straight into the global namespace.

Symbol Index Remapping

After both passes, v19[1] contains the input-to-linker symbol-index map. This map is used during the .nv.info TLV parse at sub_45E7D0:1900-1975, where symbol-referencing attributes (kinds 2, 6, 7, 8, 9, 10, 15, 17, 18, 19, 20, 23, 35, 38, 47, 55, 59, 69) have their embedded symbol indices rewritten from the cubin-local numbering to the linker-global numbering. The kinds list is the exhaustive switch case from sub_45E7D0:1917-1977.

Constant Bank Extraction

The constant-bank extraction happens as a natural consequence of the section dispatch described above. There is no separate extraction pass -- the moment a .nv.constant<N> section is classified with type 0x70000064 + N, it becomes a routine section input for the constant-bank merge pipeline (sub_438640).

From .nv.constant to SHT_CUDA_CONSTANTn

Concrete example from a sample cubin:

Input section header:
  sh_name    -> ".nv.constant0"        (from shstrtab)
  sh_type    -> 1  (SHT_PROGBITS)      or 0x70000006 (SHT_CUDA_CONSTANT)
  sh_flags   -> 0x2 (SHF_ALLOC)
  sh_offset  -> 0x5000
  sh_size    -> 0x140
  sh_link    -> 0 (no associated section)
  sh_info    -> symbol index of owning kernel

After sub_45E7D0:1006-1012 classification:
  internal_type -> 0x70000064  (strtol("0", 0, 10) == 0, 0 + 0x70000024 = 0x70000024,
                                then + 0x40 from the NV offset = 0x70000064)

Wait -- the literal in the decompiled code is 1879048292 which equals 0x70000064 directly. The 0x70000024 base from the earlier description is the value 1879048228, but the actual code uses 1879048292 = 0x70000064 which already includes the +64 offset. Correcting the formula:

internal_type = strtol(name + 12, 0, 10) + 0x70000064   // bank 0 at base
              = strtol(name + 12, 0, 10) + 1879048292

So .nv.constant0 -> 0x70000064 (SHT_CUDA_CONSTANT0), .nv.constant1 -> 0x70000065, ..., .nv.constant17 -> 0x70000075. This matches the Constant Banks page's type range exactly.

Constant Data Flow

Once classified, the constant-bank section is added to the linker state via the same sub_441AC0 + sub_440590 + sub_440350 chain used for all other sections. The raw bytes are copied by sub_432B10 (memory allocator + memcpy). At merge time, sub_438640 collects all sections with the same bank number across all input cubins and concatenates/deduplicates them using the hash-based dedup engine sub_4339A0. The dedup hash keys on the byte content, so two cubins with identical bank-0 constants share storage.

Relocations that point into constant banks (R_CUDA_ABS32_LO_0, R_CUDA_ABS32_HI_0, R_CUDA_CONST_FIELD, etc.) are resolved after merge in sub_46C1B0 -- see linker/r-cuda-relocations.md for the full relocation set.

Compressed Section Handling

nvlink does NOT decompress cubin sections. There is no LZ4, zlib, zstd, or SHF_COMPRESSED (0x800) handling on the cubin-loading path. The validator sub_43DD30 does not even acknowledge the SHF_COMPRESSED flag -- any section whose bytes appear compressed would simply be treated as opaque bytes, merged verbatim, and almost certainly fail later relocation resolution or FNLZR parsing.

Proof by exhaustion:

  1. A keyword search for LZ4, lz4, zlib, inflate, deflate, zstd, compress, ELFCOMPRESS, SHF_COMPRESSED, and 0x800000 across all 40,210 decompiled functions finds only matches inside the bundled OCG (open code generator, libcompiler) in the 0x1CCxxxx--0x1D3xxxx address range. Those functions are invoked from PTX codegen (constant table compression) and LTO bitcode packing, not from the cubin-loading or section-merging paths.
  2. sub_45E7D0 reads section bytes via &a2->__size[sh_offset] and passes them straight to memcpy. There is no conditional decompression.
  3. Ptxas does not emit SHF_COMPRESSED sections in cubin output. DWARF debug sections in cubins are uncompressed, unlike their host-ELF counterparts. The host linker (ld) may insert .debug_* compression into the final executable, but device cubins embedded in a fatbin are raw.
  4. Fatbin containers themselves can be LZ4-compressed (the fatbin header records a compression method), but that decompression happens in sub_43B2D0 -- the fatbin extractor -- before the resulting cubin bytes reach sub_426570. See Fatbin Extraction for the decompression details.

This means that a cubin on disk is always in the exact byte format ptxas produced. The linker simply maps it into arena memory and walks it.

Validation Checks on Cubin Structure

A cubin passes through three layers of validation, each adding another invariant:

Layer 1: Structural (sub_43DD30)

Documented above. This layer verifies:

  • Header field sizes (e_phentsize, e_shentsize, e_shstrndx)
  • Program header table range within buffer
  • Section header table range within buffer
  • Per-section data range within buffer (excluding SHT_NOBITS and NVIDIA's exempted types)
  • No integer overflow in offset + size arithmetic

Return: bool -- false on any violation. The caller (sub_43E100, the fatbin extractor, sub_426570) must discard the cubin on failure.

Layer 2: Extent (sub_43DA80)

sub_43DA80 computes the high-water mark of the ELF: the maximum of:

  • e_phoff + e_phentsize * e_phnum (end of program header table)
  • e_shoff + e_shentsize * e_shnum (end of section header table)
  • For each section: sh_offset + sh_size (skipping SHT_NOBITS and NV-exempt types)

This value must not exceed the buffer length. sub_43DD30 uses it as the final bounds gate. Arithmetic overflow is checked explicitly: the code guards against v34 * v33 / v33 != v34 (integer-multiplication overflow) and v35 > ~v9 (additive overflow).

Layer 3: Architecture (sub_426570)

  • e_type != ET_DYN (3) -- shared objects rejected outright
  • ELF class matches --machine (dword_2A5F30C, 32 or 64)
  • e_ident[EI_OSABI] matches the expected value (0x41 for Mercury, 0 for legacy)
  • e_ident[EI_CLASS] sub-field (NVIDIA uses byte 7 as a sub-class indicator; expected 7 for Elf32/legacy, 8 for Mercury)
  • e_flags SM version matches the --arch target (with family compatibility)
  • e_flags relocatable flag (bit 0x400 for Mercury, 0x4000 for legacy)
  • Toolkit version from .nv.compat / e_flags is not newer than the linker
  • SM-specific toolkit lower bounds: SM 50 requires toolkit > 64 (0x40), SM 90 requires toolkit > 119 (0x77)
  • First cubin lock-in: after the first non-SASS cubin arrives, Mercury mode is disabled for the rest of the link

Layer 4: Post-Link Precondition (sub_4275C0)

Applied only if FNLZR is invoked:

  • Post-link mode: cubin MUST have the SASS flag set (e_flags & 0x1 on Mercury, e_flags & 0x80000000 on legacy)
  • Pre-link mode: cubin MUST NOT have the "already finalized" flag ((e_flags >> 2) & 1 on Mercury, (e_flags & 0x80004000) == 0 on legacy)

Violations in either case emit "Internal error" (unk_2A5B670). These are defensive asserts -- if the earlier pipeline is correct, they never fire.

Integer Overflow Protection

sub_43DD30 and sub_43DA80 both contain explicit overflow guards:

// sub_43DD30 Elf64 loop, line 105
v23 = sh_offset;
v24 = sh_size;
if (base + v23 + v24 > base + buf_size || v24 > ~v23 || v23 + v24 > ~v19)
    return 0;   // overflow: section would wrap the address space

The triple-condition check validates:

  1. Section content fits in the buffer
  2. sh_offset + sh_size does not overflow (sh_size > ~sh_offset means the sum would wrap)
  3. The section header address + sh_offset + sh_size does not wrap (corner case for sections near the high end of the ELF)

This level of paranoia is necessary because cubins may be embedded in fatbins of arbitrary origin, and a malicious fatbin could contain a truncated or crafted ELF that tries to index past the end of the allocation.

Complete Cubin Loading Flow

Input file identified as cubin (ELF magic + e_machine == 190)
  |
  v
sub_43D970: is_elf() -- validate ELF magic
  |
  v
sub_43D9A0: is_elf64() -- determine Elf32 vs Elf64
  |
  +--> Elf32: sub_43E100 loads from file with size >= 52 check
  +--> Elf64: loaded in main() via fread into arena buffer
  |
  v
sub_43DD30: validate_elf_structure() -- bounds-check all headers
  |
  v
sub_426570: validate_arch_and_add()
  |  1. Reject e_type == ET_DYN (shared libraries)
  |  2. Check word size (32/64) matches --machine
  |  3. Check ELF OSABI byte for class expectations
  |  4. Extract SM version from e_flags
  |  5. Format "sm_XX" or "compute_XX" string
  |  6. Match against --arch via sub_4878A0
  |  7. Fallback: check .nv.compat via sub_43E610
  |  8. Validate toolkit version
  |  9. Set SASS / legacy mode flags
  |
  v
sub_43DA40: is_sass_cubin() -- check SASS flag
  |
  +--> SASS: proceed to merge, later FNLZR post-link
  +--> PTX-only: lock into legacy mode
  |
  v
sub_42A680: register_module_for_linking()
  |
  v
Cubin enters the merge phase (sub_45E7D0)

Error Messages

Error descriptorMessage patternCondition
unk_2A5B700(null header / corrupt ELF)ELF header at offset v12 is NULL
unk_2A5B690Architecture word-size mismatchCubin is 32-bit but target is -m64, or vice versa
unk_2A5B680ELF class mismatchCubin ELF class byte does not match expected value (7 or 8)
unk_2A5B6A0"SM Arch ('%s') not found in '%s'"Cubin SM arch does not match --arch target
unk_2A5B6B0Architecture requires modern ELF classMercury cubin with non-Mercury linker configuration
unk_2A5B670"specified arch exceeds buffer length" / "Internal error"Buffer overflow in arch string or FNLZR precondition failure
unk_2A5B640Toolkit version too newCubin toolkit version exceeds linker's known version
unk_2A5B630SM 50 requires newer toolkitSM 50 cubin with toolkit version <= 64
unk_2A5B620SM 90 requires newer toolkitSM 90 cubin with toolkit version <= 119
unk_2A5B6E0First cubin arch notificationInformational: logs the architecture of the first cubin processed
unk_2A5B6C0"FNLZR failure"Post-link binary rewriting failed
unk_2A5B5C0Relocatable flag warningCubin has unexpected relocatable flag state

All error messages are emitted through the unified diagnostic function sub_467460, which handles severity levels (fatal error, warning, info) based on the descriptor address prefix.

Cross-References

  • Input File Loop -- how cubins are dispatched from the main file loop
  • Fatbin Extraction -- cubins extracted from fatbin containers follow the same validation path; fatbin LZ4 decompression happens before the cubin bytes reach sub_426570
  • ELF Parsing -- the sub_448360 / sub_46B590 ELF header accessor functions; sub_43DD30 validator documented in both pages
  • Constant Banks -- the name-to-index mapping and merge pipeline for sections classified here as SHT_CUDA_CONSTANTn
  • .nv.info Metadata -- the TLV attribute catalog parsed by sub_4504B0 and sub_44E8B0 after section classification
  • NVIDIA Sections -- full catalog of SHT_CUDA_* section types referenced by the dispatch table
  • Section Merging -- the downstream merge stage that consumes classified sections
  • R_CUDA Relocations -- relocations that reference constant banks and other NV sections after merge
  • Merge Phase -- where validated cubins are merged into the output ELF (driver for sub_45E7D0)
  • Finalization Phase -- the FNLZR post-link transform context
  • Mercury / FNLZR -- detailed breakdown of the sub_4748F0 finalizer engine invoked by sub_4275C0
  • Device ELF Format -- e_machine == 190, e_flags SM version encoding, and the Mercury e_type == 0xFF00 variant