Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Bindless Relocations

Bindless texture and surface references in CUDA allow kernels to access texture/surface objects through handles stored in constant memory rather than through fixed hardware slots. The nvlink device linker implements bindless support through a dedicated relocation processing pass (sub_438DD0, 12,779 bytes at 0x438DD0) that runs during the layout phase. This function scans all relocations for bindless types, creates synthetic $NVLINKBINDLESSOFF_<name> symbols, rewrites relocation symbol indices to point at these synthetic symbols, builds a per-section bitmask tracking which sections contain bindless references, and then allocates constant-bank space for the bindless offset tables. A companion function (sub_43CDA0, 6,937 bytes at 0x43CDA0) resolves bindless symbol types and counts resources per entry function.

Background: Bindless vs. Bound Texture Access

Traditional (bound) CUDA texture access assigns each texture/surface to a fixed hardware slot, with per-architecture limits on slot count (queried via vtable offsets +40, +48, +56 for textures, samplers, and surfaces respectively). Bindless access eliminates this fixed-slot limitation: the compiler emits a handle-based reference that the linker resolves to an offset within a constant memory bank dedicated to bindless descriptors.

The CUDA source-level mechanism works as follows:

  1. PTX frontend (sub_12B9660): The ptxas compiler creates $BINDLESS$<module>$<texture>$<sampler> or $BINDLESS$<module>$<surface> symbols for each bindless texture/surface/sampler triple.
  2. Merge phase: These symbols and their relocations are merged into the output ELF during merge_elf (sub_45E7D0).
  3. Layout phase: sub_438DD0 (this page) processes the merged relocations, creating $NVLINKBINDLESSOFF_<name> symbols and assigning them offsets within a dedicated constant bank section.
  4. Relocation application: The standard relocation engine patches the final addresses.

The architecture profile vtable controls whether bindless is supported at all:

Vtable offsetQueryDescription
+296supports_bindless()Gate: returns nonzero if bindless mode is active
+304bindless_texture_type()Returns the ELF section type for the bindless texture constant bank
+312bindless_surface_type()Returns the ELF section type for the bindless surface constant bank
+320(unnamed)Returns nonzero if unified surface descriptors are in use
+440(unnamed)Size of a texture/sampler descriptor entry
+448(unnamed)Base size for a surface descriptor entry
+352(unnamed)Additional surface descriptor offset for non-unified mode

The constant memory bank holding bindless descriptors is identified by sw-bindless-tex-surf-table-bank (from sub_16257C0 in the ptxas memory-space classifier).

Position in the Pipeline

Merge Phase (sub_45E7D0, per-object)
  |
  v
Layout Phase (sub_439830)
  |
  +-- Phase 1: Global data merge
  +-- Phase 2: Bindless resource processing     <-- this page
  |     |
  |     +-- sub_4324B0: set up constant bank target section
  |     +-- sub_433310: lay out descriptor lists (tex/sampler/surface)
  |     +-- sub_438DD0: process bindless relocations (core)
  |     +-- sub_43CDA0: resolve bindless symbol types (called from Phase 10)
  |     |
  +-- Phases 3-9: shared memory, constants, etc.
  +-- Phase 10: Resource counting & UFT setup
  |
  v
Relocation Phase (sub_469D60)

Phase 2 is gated by vtable+296 (supports_bindless()) and is skipped entirely in relocatable link mode (elfw+16 == 1).

Function Signatures

sub_438DD0 -- process_bindless_references

// Address: 0x438DD0
// Size: 12,779 bytes (~451 lines decompiled)
//
// a1: elfw*       -- the output ELF wrapper
// a2: section_desc* -- per-entry constant bank section descriptor
//                      (from the per-entry constant list at elfw+272)
//
// Returns: pointer to the updated section descriptor
_QWORD* __fastcall process_bindless_references(__int64 a1, _QWORD* a2);

sub_43CDA0 -- resolve_bindless_type_symbols

// Address: 0x43CDA0
// Size: 6,937 bytes (~292 lines decompiled)
//
// a1: elfw*         -- the output ELF wrapper
// a2: entry_index   -- entry function index (0 = no specific entry)
// a3: symbol_type   -- bindless symbol STT type (10=texture, 11=sampler, 12=surface)
//
// Returns: count of matching resources for this entry
__int64 __fastcall resolve_bindless_type_symbols(__int64 a1, int a2, unsigned __int8 a3);

sub_438CE0 -- emit_bindless_relocation

// Address: 0x438CE0
// Size: ~240 bytes
//
// a1: elfw*           -- the output ELF wrapper
// a2: reloc_type      -- relocation type code
// a3: symbol_index    -- target symbol index in symtab
// a4: section_index   -- target section index
// a5: addend_flag     -- additional relocation flag
// a6: original_reloc  -- original relocation data pointer
//
// Creates a new relocation record and appends it to elfw+376 (relocation list)
__int64 __fastcall emit_bindless_relocation(
    __int64 a1, unsigned int a2, unsigned int a3,
    unsigned int a4, int a5, __int64 a6);

sub_4324B0 -- bindless_target_setup

// Address: 0x4324B0
// Size: ~1 KB
//
// a1: elfw*          -- the output ELF wrapper
// a2: symbol_record  -- the entry-point symbol record
//
// Creates a per-entry constant bank section of the bindless texture type
// (section type from vtable+304), named "<bank_type>.<entry_name>".
// Appends the new section index to the per-entry constant list (elfw+272).
__int64 __fastcall bindless_target_setup(__int64 a1, __int64 a2);

sub_433310 -- descriptor_list_layout

// Address: 0x433310
// Size: ~1.5 KB
//
// a1: elfw*              -- the output ELF wrapper
// a2: descriptor_list*   -- linked list of texture/sampler/surface descriptors
//                           (from elfw+424, +432, or +440)
//
// For each descriptor in the list:
//   1. Look up the bindless texture constant bank section (vtable+304)
//   2. Create $NVLINKBINDLESSOFF_<name> symbol in that section
//   3. Allocate zero-filled data of descriptor size (vtable+440 or +448)
//   4. Merge data into the constant bank via sub_432B10
__int64 __fastcall descriptor_list_layout(__int64 a1, _QWORD* a2);

Relocation Type Classification

The core logic of sub_438DD0 iterates the global relocation list (elfw+376) and classifies each relocation record by its type code (lower 32 bits of the 8-byte relocation info field at record+8). The following types trigger bindless processing (reach the LABEL_32 handler):

DecimalHexInferred nameType
50x05R_CUDA_ABS32_2032-bit absolute
120x0CR_CUDA_ABS32_HI2032-bit absolute high
170x11R_CUDA_TEX_HEADER_INDEXTexture header index
180x12R_CUDA_TEX_HEADER_INDEX_HITexture header index high
220x16R_CUDA_BINDLESSOFF13_22Bindless offset (13-bit, slot 22)
230x17R_CUDA_BINDLESSOFF13_23Bindless offset (13-bit, slot 23)
240x18R_CUDA_BINDLESSOFF13_24Bindless offset (13-bit, slot 24)
250x19R_CUDA_BINDLESSOFF13_25Bindless offset (13-bit, slot 25)
290x1DR_CUDA_BINDLESSOFF13_29Bindless offset (13-bit, slot 29)
300x1ER_CUDA_BINDLESSOFF13_30Bindless offset (13-bit, slot 30)
380x26R_CUDA_BINDLESSOFF14_38Bindless offset (14-bit, slot 38)
390x27R_CUDA_BINDLESSOFF14_39Bindless offset (14-bit, slot 39)
420x2AR_CUDA_BINDLESSOFF14_42Bindless offset (14-bit, slot 42)
460x2ER_CUDA_ABS36_LO2036-bit absolute low (Volta+)
500x32R_CUDA_ABS36_HI1636-bit absolute high (Volta+)
510x33R_CUDA_ABS36_2036-bit absolute (Volta+)
540x36R_CUDA_BINDLESSOFF14_54Bindless offset (14-bit, slot 54)
550x37R_CUDA_BINDLESSOFF14_55Bindless offset (14-bit, slot 55)
590x3BR_CUDA_BINDLESSOFF_59Bindless offset (extended)
640x40R_CUDA_BINDLESSOFF_64Bindless offset (Ampere+)
650x41R_CUDA_BINDLESSOFF_65Bindless offset (Ampere+)
660x42R_CUDA_BINDLESSOFF_66Bindless offset (Ampere+)
1150x73R_CUDA_BINDLESSOFF_115Bindless offset (Hopper+)
655390x10003R_MERCURY_BINDLESS_3Mercury bindless (Blackwell+)
655400x10004R_MERCURY_BINDLESS_4Mercury bindless (Blackwell+)

The types span four generations of encoding evolution:

  • Base types (5, 12): Generic absolute relocations that happen to reference bindless symbols. These predate the dedicated bindless relocation types.
  • Legacy 13-bit (17-18, 22-25, 29-30): Compact bindless offset encoding with 13-bit fields, used on pre-Volta architectures.
  • 14-bit extended (38-39, 42, 54-55, 59): Wider offset encoding added for Volta (sm70+) constant bank expansions.
  • 36-bit absolute (46, 50-51): Wider address relocations for Volta+ extended addressing.
  • Ampere+ (64-66): Relocation types added for sm80+ instruction encodings.
  • Hopper+ (115): Added for sm90 Hopper architecture.
  • Mercury (65539-65540): Range 0x10000+ is the Mercury (sm100+ / Blackwell) relocation namespace, with bindless variants at offsets 3 and 4.

Types that explicitly do not trigger bindless processing (skip to LABEL_19) despite being numerically adjacent: 26 (0x1A), 27 (0x1B), 28 (0x1C). These correspond to non-bindless relocation types (R_CUDA_32, R_CUDA_64, and a third type in the same numeric range -- see the R_CUDA Catalog for exact type assignments).

Complete Pseudocode: process_bindless_references (sub_438DD0)

The following pseudocode reconstructs the full algorithm from the decompiled binary. Variable names are inferred from usage context; the structure matches the decompiled code at 0x438DD0 line-for-line.

// sub_438DD0 -- process_bindless_references
// a1: elfw*          -- output ELF wrapper
// a2: section_desc*  -- per-entry constant bank section descriptor
//
// Returns: pointer to the updated section descriptor

section_desc* process_bindless_references(elfw* elf, section_desc* parent_desc)
{
    int num_sections  = elf->section_count;               // elfw+584
    int parent_shtype = parent_desc->sh_type;             // a2[1] (32-bit at offset 4)

    // ---- Phase A: Detect architecture variant ----
    //
    // The byte at elfw+7 encodes the relocation namespace.
    // 'A' (0x41) = Mercury / Blackwell (sm100+): use 0x10000+ reloc types.
    // Anything else = legacy namespace.

    bool is_mercury = (elf->arch_byte == 'A');            // elfw+7 == 0x41

    // Mercury uses bit 0 of elfw+48 for the "wide reloc" flag;
    // non-Mercury uses bit 31 (0x80000000) of elfw+48.
    int arch_flags_mask = is_mercury ? 0x01 : 0x80000000;
    int wide_reloc_flag = elf->arch_flags & arch_flags_mask;  // elfw+48 & mask

    // ---- Phase B: Collect per-entry constant bank sections ----
    //
    // Walk the global section vector (elfw+360) and collect sections whose
    // sh_type matches parent_shtype AND that have non-empty data or relocations.
    // These are the per-entry constant bank sections that might need bindless
    // offset tables.

    linked_list per_entry_list = NULL;                    // v94
    uint64_t max_size  = 0;                               // v93
    uint32_t max_align = 1;                               // v90

    section_vec* all_sections = elf->section_vec;         // elfw+360
    for (uint64_t i = 0; i < vec_count(all_sections); i++)
    {
        section_record* sec = vec_get(all_sections, i);

        // Match on sh_type: must equal the parent constant bank type
        // The constant is either 0x70000006 (single value) or falls in
        // the range [0x70000084, 0x7000009E] (per-entry variants).
        if (sec->sh_type != 0x70000006
            && ((sec->sh_type - 0x70000084) > 0x1A || parent_shtype != sec->sh_type))
            continue;

        uint32_t link_idx = sec->sh_link;                 // offset +44
        if (!link_idx) continue;

        section_record* linked = vec_get(all_sections, link_idx);
        if (linked->data == NULL && linked->reloc_list == NULL)
            continue;

        // Append sec to per_entry_list
        list_append(sec, &per_entry_list);

        // Track maximum alignment and size across all per-entry sections
        if (sec->sh_addralign > max_align)
            max_align = sec->sh_addralign;
        if (sec->sh_size > max_size)
            max_size = sec->sh_size;
    }

    // ---- Phase C: Allocate per-section bitmask ----
    //
    // Stack-allocated byte array, one byte per section + 1.
    // Each byte tracks which bindless resource types are referenced
    // by relocations targeting that section.

    uint8_t bitmask[num_sections + 1];                    // alloca on stack
    memset(bitmask, 0, num_sections + 1);

    // ---- Phase D: Resolve specific constant section ----
    //
    // elfw+568 holds a "specific entry function" symbol index. If nonzero,
    // all bitmask propagation goes to that single section instead of walking
    // the callgraph. This path is used in single-entry compilation mode.

    uint32_t specific_section_idx = 0;
    uint32_t specific_sym = elf->specific_entry_sym;      // elfw+568
    if (specific_sym != 0) {
        sym_record* sym = get_symbol_record(elf, specific_sym);
        specific_section_idx = get_section_index(elf, sym);
    }

    // ---- Phase E: Main relocation scan ----
    //
    // Walk the global relocation linked list (elfw+376). For each relocation,
    // extract the type code from the lower 32 bits of the info field and
    // dispatch through the type classification logic.

    reloc_node* reloc = elf->reloc_list;                  // elfw+376 (linked list head)
    while (reloc != NULL)
    {
        reloc_record* rec = reloc->data;                  // reloc->next_ptr[1]
        uint64_t info = rec->r_info;                      // offset +8 in reloc record
        uint32_t reloc_type = (uint32_t)info;             // low 32 bits
        uint32_t sym_idx    = (uint32_t)(info >> 32);     // high 32 bits

        // ---- Type classification (maps to decompiled switch cascade) ----
        //
        // Bindless types jump to the handler; all others skip (goto next).
        //
        // The classification is a series of range checks. Expressed as a
        // decision tree matching the binary's branch structure:

        bool is_bindless = false;
        switch (reloc_type) {
            // Base absolute types that can reference bindless symbols
            case 5:                                       // R_CUDA_ABS32_20
            case 12:                                      // R_CUDA_ABS32_HI20

            // Texture header index
            case 17: case 18:                             // R_CUDA_TEX_HEADER_INDEX{,_HI}

            // Legacy 13-bit bindless offsets
            case 22: case 23: case 24: case 25:           // R_CUDA_BINDLESSOFF13_22..25
            // Note: 26, 27, 28 are EXCLUDED (non-bindless)
            case 29: case 30:                             // R_CUDA_BINDLESSOFF13_29..30

            // 14-bit extended (Volta+)
            case 38: case 39:                             // R_CUDA_BINDLESSOFF14_38..39
            case 42:                                      // R_CUDA_BINDLESSOFF14_42

            // 36-bit absolute (Volta+)
            case 46:                                      // R_CUDA_ABS36_LO20
            case 50: case 51:                             // R_CUDA_ABS36_{HI16,20}

            // More 14-bit extended
            case 54: case 55:                             // R_CUDA_BINDLESSOFF14_54..55
            case 59:                                      // R_CUDA_BINDLESSOFF_59

            // Ampere+ (sm80)
            case 64: case 65: case 66:                    // R_CUDA_BINDLESSOFF_64..66

            // Hopper (sm90)
            case 115:                                     // R_CUDA_BINDLESSOFF_115

            // Mercury / Blackwell (sm100+): 0x10000+ namespace
            case 65539: case 65540:                       // R_MERCURY_BINDLESS_3..4
                is_bindless = true;
                break;

            default:
                is_bindless = false;
        }

        if (!is_bindless) {
            reloc = reloc->next;
            continue;
        }

        // ---- LABEL_32: Bindless handler ----

        // Step 1: Verify the target symbol is a texture/sampler/surface
        sym_record* sym = get_symbol_record(elf, sym_idx);    // sub_440590
        uint8_t sym_type = sym->st_info & 0x0F;

        if ((sym_type - 10) > 2) {
            // Not texture(10), sampler(11), or surface(12) -- skip
            reloc = reloc->next;
            continue;
        }

        // Step 2: Build the synthetic symbol name
        const char* orig_name = sym->name;                    // sym+32
        size_t name_len = strlen(orig_name);
        char* synth_name = arena_alloc(name_len + 20);        // sub_4307C0
        sprintf(synth_name, "$NVLINKBINDLESSOFF_%s", orig_name);

        // Step 3: Find or create the synthetic symbol in the output symtab
        uint32_t new_sym_idx = find_or_create_symbol(elf, synth_name);  // sub_4411B0

        if (elf->verbose) {                                   // elfw+64 bit 1
            fprintf(stderr, "change reloc symbol from %d to %d\n",
                    sym->section_index, new_sym_idx);
        }
        arena_free(synth_name);                               // sub_431000

        // Step 4: Rewrite the relocation info field
        //
        // Keep the original relocation type in bits [31:0].
        // Replace the symbol index in bits [63:32] with the new synthetic symbol.
        rec->r_info = ((uint64_t)new_sym_idx << 32) | (uint32_t)reloc_type;

        // Step 5: Compute the bitmask byte from the resource type
        //
        // byte_1D391A0 is a compile-time lookup table at address 0x1D391A0:
        //   byte_1D391A0[0]  = 0x01   (texture:  bit 0)
        //   byte_1D391A0[4]  = 0x02   (sampler:  bit 1)
        //   byte_1D391A0[8]  = 0x03   (surface:  bits 0+1)
        //
        // Indexed as: byte_1D391A0[4 * (sym_type - 10)]

        uint8_t bitmask_byte = 0;
        if ((sym_type - 10) <= 2)
            bitmask_byte = BITMASK_LUT[sym_type - 10];       // byte_1D391A0

        // Step 6: Mark the relocation's target section in the bitmask
        uint32_t reloc_section = rec->rela_section_index;     // rec+24
        uint32_t resolved_sec = get_section_by_rela(elf, reloc_section)->sh_link;
        bitmask[resolved_sec] |= bitmask_byte;

        // Step 7: Propagate through callgraph (if the section is a function)
        section_record* sec_rec = get_section_by_rela(elf, resolved_sec);
        if (!(sec_rec->flags & 0x04)) {
            // Not a function section -- no propagation needed
            reloc = reloc->next;
            continue;
        }

        if (specific_section_idx != 0) {
            // Single-entry mode: propagate directly to the known entry section
            bitmask[specific_section_idx] |= bitmask_byte;
        } else {
            // Multi-entry mode: walk the callgraph to find all callers
            //
            // The function ID is extracted from the section's sh_link field,
            // sign-extended from 24 bits: (sh_link << 8) >> 8
            int func_id = (int)(sec_rec->sh_link << 8) >> 8;

            // sub_44C740 returns the head of a linked list of caller records.
            // Each record has: { next_ptr, padding, caller_sym_index }
            caller_node* caller = callgraph_get_callers(elf, func_id);
            while (caller != NULL) {
                sym_record* caller_sym = get_symbol_record(elf, caller->sym_index);
                uint32_t caller_sec = get_section_index(elf, caller_sym);
                bitmask[caller_sec] |= bitmask_byte;
                caller = caller->next;
            }
        }

        reloc = reloc->next;
    }
    // ---- End of relocation scan (Phase E) ----

    // ... Phases F-I follow (section pruning, allocation, emission, cleanup)
}

Phase E Detail: The Bitmask Lookup Table

The global array byte_1D391A0 at binary address 0x1D391A0 is a 12-byte lookup table. It is indexed by 4 * (sym_type - 10), where the stride of 4 is an artifact of the compiler aligning each entry to a 4-byte boundary. The effective mapping is:

Index expressionsym_typeBitmask byteBinary representation
byte_1D391A0[0]10 (texture)0x010000_0001
byte_1D391A0[4]11 (sampler)0x020000_0010
byte_1D391A0[8]12 (surface)0x030000_0011

Surface references set both bits because a surface object conceptually contains both a texture descriptor (for reads) and a separate surface descriptor (for writes). The linker must allocate space in both the texture and sampler/surface descriptor tables for a surface, hence both bits are set.

Phase E Detail: Callgraph Propagation Walk

The function sub_44C740 (callgraph_get_callers) works as follows:

// sub_44C740 -- callgraph_get_callers
caller_node* callgraph_get_callers(elfw* elf, int func_id)
{
    sym_record* sym = get_symbol_record(elf, func_id);

    // elfw+408 holds the callgraph vector, indexed by sym->callgraph_idx (offset +28)
    callgraph_entry* entry = vec_get(elf->callgraph_vec, sym->callgraph_idx);

    // Sanity check: callgraph must be fully built
    if (!elf->callgraph_complete)                         // elfw+81
        fatal("callgraph not complete");

    if (entry)
        return entry->caller_list;                        // entry+40: linked list head
    else
        return NULL;
}

Each caller_node in the linked list has this layout:

OffsetSizeFieldDescription
08nextPointer to next caller node (NULL = end of list)
84sym_indexSymbol table index of the calling function

The propagation loop ORs the bitmask byte into each caller's section entry. This means: if device function helper() uses a bindless texture, and kernels A() and B() both call helper(), then both section_of_A and section_of_B get bit 0 set in the bitmask. This transitive propagation is essential because the per-entry constant bank must be allocated for every kernel that transitively reaches a bindless reference, even through multiple levels of function calls.

The Bindless Handler (LABEL_32) -- Step-by-Step Summary

When a relocation matches a bindless type, the handler executes five steps:

Step 1: Symbol Lookup and Type Check

sym = get_symbol_record(elfw, reloc.symbol_index)    // sub_440590
sym_type = sym->st_info & 0x0F

if (sym_type - 10) > 2:    // not texture(10), sampler(11), or surface(12)
    skip this relocation    // goto LABEL_19

The st_info & 0x0F low nibble encodes the CUDA-specific symbol type:

ValueSTT_* typeGPU resource
10STT_CUDA_TEXTURETexture reference
11STT_CUDA_SAMPLERSampler state
12STT_CUDA_SURFACESurface reference

Only relocations targeting these three symbol types proceed to bindless processing.

Step 2: Create Synthetic Symbol

name_len = strlen(sym->name)
synth_name = arena_alloc(name_len + 20)               // sub_4307C0
sprintf(synth_name, "$NVLINKBINDLESSOFF_%s", sym->name)
new_sym_idx = find_or_create_symbol(elfw, synth_name)  // sub_4411B0

if verbose:
    fprintf(stderr, "change reloc symbol from %d to %d\n",
            sym->section_index, new_sym_idx)

arena_free(synth_name)                                 // sub_431000

The allocation size is strlen + 20 (19 bytes for the prefix + 1 for the NUL terminator). The function sub_4411B0 either finds an existing symbol with that name (if a previous relocation already created it) or creates a new one. The arena allocator at sub_4307C0 obtains memory from the per-module arena (located via sub_44F410), and the corresponding sub_431000 returns it.

Step 3: Rewrite Relocation Target

reloc.info = (new_sym_idx << 32) | original_reloc_type

This is a 64-bit rewrite of the relocation info field. The upper 32 bits encode the symbol table index, the lower 32 bits encode the relocation type. After this rewrite, the relocation engine will resolve the address from the $NVLINKBINDLESSOFF_ symbol rather than the original texture/surface symbol.

Step 4: Build Section Bitmask

bitmask_byte = byte_1D391A0[4 * (sym_type - 10)]
section_bitmask[reloc.section_index] |= bitmask_byte

A VLA-sized bitmask (alloca(num_sections + 1) at function entry) tracks which sections contain references to each resource type. Each byte in the bitmask stores up to three flags:

BitMeaning
0Section references at least one bindless texture
1Section references at least one bindless sampler
0+1Section references at least one bindless surface (both bits set)

Step 5: Propagate Bitmask Through Callgraph

If the section containing the relocation is a function section (section flags bit 2 set), the bitmask is propagated to all callers:

section_record = get_section_by_rela(elfw, reloc.section_index)
if section_record.flags & 0x04:   // is a function section
    if specific_constant_section:
        section_bitmask[specific_constant_idx] |= bitmask_byte
    else:
        func_id = (section_record.sh_link << 8) >> 8   // sign-extend 24-bit
        caller_list = callgraph_get_callers(elfw, func_id)
        for each caller in caller_list:
            caller_sym = get_symbol_record(elfw, caller.sym_index)
            caller_section = get_section_index(elfw, caller_sym)
            section_bitmask[caller_section] |= bitmask_byte

This ensures that if a non-entry function references a bindless texture, the bitmask correctly marks the entry-function sections that transitively reach that reference.

Complete Pseudocode: Post-Scan Phases (F through I)

After the relocation scan (Phase E) completes, sub_438DD0 continues with four more phases.

Phase F: Section Pruning

    // ---- Phase F: Prune sections with no bindless references ----
    //
    // Walk the per-entry list collected in Phase B. Any section whose
    // bitmask byte is still zero has no bindless references (direct or
    // transitive) and can be pruned.

    for (list_node* node = per_entry_list; node != NULL; node = node->next)
    {
        section_record* sec = node->data;
        uint32_t sec_idx = sec->sh_link;                  // offset +44

        if (bitmask[sec_idx] == 0) {
            if (elf->verbose)
                fprintf(stderr, "no bindless ref in section %s\n", sec->name);

            // NULL out the data pointer -- marks this node as dead.
            // The section still exists in the ELF, but won't receive a
            // bindless offset table. The node remains in the list but is
            // skipped during Phase H.
            node->data = NULL;
        }
    }

This is a pure size optimization. Without it, every per-entry constant section would be allocated a bindless offset table, wasting constant memory on kernels that never perform bindless access.

Phase G: Section Layout

    // ---- Phase G: Compute aligned layout for the constant bank ----
    //
    // sub_4325A0 assigns byte offsets to each descriptor symbol within
    // the constant bank section. It walks the sorted symbol list,
    // aligning each entry to its required alignment and advancing
    // the running offset.
    //
    // The total_size is padded up to the maximum alignment boundary:
    //   if (max_size % max_align != 0)
    //       total_size = max_align + max_size - (max_size % max_align)
    //   else
    //       total_size = max_size

    uint64_t remainder = max_size % max_align;
    uint32_t total_size = (remainder != 0)
        ? max_align + max_size - remainder
        : max_size;

    layout_section(elf, parent_desc, total_size, max_align);  // sub_4325A0

The layout engine (sub_4325A0) iterates the section's symbol list in sorted order, assigning each symbol an aligned offset:

// sub_4325A0 -- layout_section (simplified)
void layout_section(elfw* elf, section_desc* desc, uint32_t init_offset, int align)
{
    // Sort symbols by alignment (descending) via sub_4647D0 + sub_432440
    sort(desc->symbol_list, compare_by_alignment);

    uint32_t offset = init_offset;
    for (list_node* n = desc->symbol_list; n; n = n->next)
    {
        sym_layout_record* rec = n->data;
        uint64_t sym_align = rec->alignment;              // rec+16

        if (sym_align != 0) {
            // Align up: offset = ceil(offset / sym_align) * sym_align
            if (offset % sym_align != 0)
                offset = offset + sym_align - (offset % sym_align);
        } else {
            // Zero-size symbol: use natural alignment (min 8)
            uint64_t sz = rec->size;
            uint64_t nat_align = (sz <= 8) ? sz : 8;
            if (offset % nat_align != 0)
                offset = offset + nat_align - (offset % nat_align);
        }

        // Assign offset to both the layout record and the symbol
        sym_record* sym = get_symbol_record(elf, rec->sym_index);
        sym->st_value = offset;
        rec->assigned_offset = offset;

        if (elf->verbose)
            fprintf(stderr, "variable %s at offset %d\n", sym->name, offset);

        offset += rec->size;
    }
    desc->sh_size = offset;
}

Phase H: Per-Entry Relocation Emission

    // ---- Phase H: Emit resolved bindless relocations ----
    //
    // For each surviving per-entry section, walk the parent's descriptor
    // list and emit a resolved relocation (type 6/7/8/9/52 or Mercury
    // equivalents) that tells the relocation engine where to write the
    // descriptor offset.

    // Precompute the resolved relocation type codes
    uint32_t tex_reloc_type     = is_mercury ? 65547 : 6;     // 0x1000B : 0x06
    uint32_t sampler_reloc_type = is_mercury ? 65548 : 7;     // 0x1000C : 0x07

    for (list_node* entry_node = per_entry_list; entry_node; )
    {
        section_record* entry_sec = entry_node->data;
        if (entry_sec == NULL) {
            // Pruned in Phase F -- skip
            entry_node = entry_node->next;
            continue;
        }

        // Copy layout parameters from the parent section
        entry_sec->sh_addralign = parent_desc->alignment;     // a2[6] -> sec+48
        entry_sec->sh_size      = parent_desc->size;          // a2[4] -> sec+32

        // Counters for resource limit checking
        uint32_t texture_count = 0;
        uint32_t sampler_count = 0;
        uint32_t surface_count = 0;

        // Walk the parent's descriptor list (from parent_desc+72, i.e., a2[9])
        for (list_node* desc_node = parent_desc->descriptor_list; desc_node; desc_node = desc_node->next)
        {
            descriptor_record* desc = desc_node->data;

            // The descriptor's symbol was created during sub_433310 with the
            // $NVLINKBINDLESSOFF_ prefix. Strip the prefix (+19) to find the
            // original texture/sampler/surface symbol.
            sym_record* desc_sym = get_symbol_record(elf, desc->sym_index);
            uint32_t bindless_sym_idx = find_or_create_symbol(elf, desc_sym->name + 19);
            sym_record* bindless_sym  = get_symbol_record(elf, bindless_sym_idx);
            uint32_t bindless_sec     = get_section_index(elf, bindless_sym);

            // Validate: descriptor must belong to this per-entry section
            // (or to section 0, which means "unassigned")
            uint32_t entry_sec_idx = entry_sec->sh_link;
            if (bindless_sec != 0 && entry_sec_idx != bindless_sec)
                continue;   // belongs to a different per-entry section

            // Determine the resolved relocation type based on the resource type
            uint8_t res_type = bindless_sym->st_info & 0x0F;
            uint32_t emit_type = 0;

            switch (res_type) {
                case 10:  // texture
                    emit_type = tex_reloc_type;
                    if (!(bitmask[entry_sec_idx] & 0x01))
                        continue;                          // section has no texture refs
                    texture_count++;
                    break;

                case 11:  // sampler
                    emit_type = sampler_reloc_type;
                    if (!(bitmask[entry_sec_idx] & 0x02))
                        continue;                          // section has no sampler refs
                    sampler_count++;
                    break;

                case 12:  // surface
                    if (is_unified_surface()) {            // vtable+320
                        emit_type = is_mercury ? 65549 : 52;
                    } else {
                        if (is_mercury)
                            fatal("unexpected usage of non-unified surface descriptors");

                        // Compare descriptor's offset against the base surface
                        // descriptor size from the architecture vtable (+448).
                        // If they differ, the descriptor needs an addend adjustment.
                        uint64_t desc_offset = desc->offset;   // desc+24
                        uint64_t base_size   = vtable_surface_base_size();
                        emit_type = (desc_offset != base_size) ? 9 : 8;
                    }
                    if (!(bitmask[entry_sec_idx] & 0x03))
                        continue;                          // section has no surface refs
                    surface_count++;
                    break;

                default:
                    emit_type = 0;                         // unreachable if data is valid
                    break;
            }

            // Emit the resolved relocation
            // Parameters: (elf, type, sym_idx, target_section, addend_flag, orig_reloc)
            emit_bindless_relocation(elf,
                emit_type,
                bindless_sym_idx,
                entry_sec->section_index,                  // offset +64
                0,                                         // no addend
                desc->original_reloc);                     // desc+8
        }

        // ---- Phase I: Resource limit checking ----

        const char* type_name = get_section_type_name(entry_sec->sh_type);

        if (texture_count > max_textures()) {              // vtable+40
            char* entry_name = get_entry_name(elf, entry_sec->section_index);
            // Error format: "%d exceeds the maximum number of %s (%zu) for '%s'"
            fatal_limit(max_textures(), "textures", strlen(type_name) + strlen(entry_name) + 1);
        }

        if (sampler_count > max_samplers()) {              // vtable+48
            char* entry_name = get_entry_name(elf, entry_sec->section_index);
            fatal_limit(max_samplers(), "samplers", strlen(type_name) + strlen(entry_name) + 1);
        }

        if (surface_count > max_surfaces()) {              // vtable+56
            char* entry_name = get_entry_name(elf, entry_sec->section_index);
            fatal_limit(max_surfaces(), "surfaces", strlen(type_name) + strlen(entry_name) + 1);
        }

        entry_node = entry_node->next;
    }

Phase J: Cleanup

    // ---- Phase J: Free temporary data structures ----

    // Free the per-entry linked list collected in Phase B
    list_free(per_entry_list);                            // sub_464520

    // Free each descriptor record in the parent's descriptor list,
    // then free the list itself and zero out the pointers
    for (list_node* n = parent_desc->descriptor_list; n; n = n->next)
        arena_free(n->data);                              // sub_431000

    list_free(parent_desc->descriptor_list);              // sub_464520

    parent_desc->size = 0;                                // a2[4] = 0
    parent_desc->descriptor_list = NULL;                  // a2[9] = 0
    parent_desc->descriptor_tail = NULL;                  // a2[10] = 0

    return parent_desc;
}

Resolved Relocation Types

The relocation types emitted during Phase H differ from the input types. These are the resolved types consumed by the relocation application engine (sub_469D60):

TypeHexA-variantMeaning
60x06NoApply texture descriptor offset
70x07NoApply sampler descriptor offset
80x08NoApply surface descriptor offset (standard)
90x09NoApply surface descriptor offset (with addend)
520x34NoApply unified surface descriptor offset
655470x1000BYesMercury texture descriptor offset
655480x1000CYesMercury sampler descriptor offset
655490x1000DYesMercury unified surface descriptor offset

A-variant (elfw+7 == 'A') indicates Mercury / Blackwell (sm100+) architectures, which use the 0x10000+ relocation namespace.

The distinction between types 8 and 9 for surfaces is determined by comparing the descriptor's offset field (desc+24) against the architecture's base surface descriptor size (vtable+448). If they match, type 8 is emitted (standard offset application). If they differ, type 9 is emitted, signaling that the relocation engine must apply an addend adjustment to account for non-standard surface descriptor layouts.

Resource Limit Checking

After emitting relocations for each per-entry section, the function checks whether the resource counts exceed architecture limits:

entry_name = get_section_type_name(per_entry.type)

if texture_count > max_textures():        // vtable+40
    error("too many %s in %s", "textures", entry_name)

if sampler_count > max_samplers():        // vtable+48
    error("too many %s in %s", "samplers", entry_name)

if surface_count > max_surfaces():        // vtable+56
    error("too many %s in %s", "surfaces", entry_name)

These limits are hard limits imposed by the GPU hardware. Exceeding them produces a fatal linker error via sub_467460. The error message format string at address 0x2A5BA40 produces output like: "128 exceeds the maximum number of textures (128) for '.nv.constant3.my_kernel'". The entry name is constructed by concatenating the section type name (from sub_4401F0) with the per-entry section name (from sub_4402D0).

Complete Pseudocode: resolve_bindless_type_symbols (sub_43CDA0)

This companion function counts bindless resources of a given type (texture, sampler, or surface) for a specific entry function. It is called during Phase 10 of the layout pass (resource counting and UFT setup).

// sub_43CDA0 -- resolve_bindless_type_symbols
// a1: elfw*          -- the output ELF wrapper
// a2: entry_index    -- entry function index (0 = no specific entry)
// a3: symbol_type    -- bindless symbol STT type (10=texture, 11=sampler, 12=surface)
//
// Returns: count of matching bindless resources for this entry

uint32_t resolve_bindless_type_symbols(elfw* elf, int entry_index, uint8_t sym_type)
{
    // Look up the entry function symbol
    uint32_t entry_sym_id = lookup_entry_symbol();        // sub_444720
    sym_record* entry_sym = get_symbol_record(elf, entry_sym_id);

    if (entry_index == 0)
        return 0;                                         // no entry -- nothing to count

    if (entry_sym == NULL)
        fatal("symbol not found");

    // ---- Gate: Does the architecture support bindless? ----
    if (!supports_bindless())                             // vtable+296
        goto path_B;

    // ===============================================
    // PATH A: Bindless-supported architecture
    // ===============================================

    int resolved_type;
    bool needs_sort = false;
    uint32_t bank_section_type;

    if (sym_type >= 10 && sym_type <= 11) {
        // Texture or sampler
        bank_section_type = bindless_texture_bank_type(); // vtable+304

        // Check if the architecture has the "wide relocation" flag.
        // This flag indicates relocations need to be sorted before counting
        // to enable efficient section-grouped iteration.
        if (elf->compact_mode && elf->elf_class == 1) {  // elfw+82, elfw+4
            if (elf->wide_reloc_flag_32 & 0x02)          // elfw+37 bit 1
                needs_sort = true;
        } else {
            if (elf->wide_reloc_flag_64 & 0x02)          // elfw+49 bit 1
                needs_sort = true;
        }

        if (needs_sort)
            sort_reloc_list(&elf->reloc_list, comparator);  // sub_4647D0

        // Resolved type: 6 for texture, 7 for sampler
        resolved_type = (sym_type != 10) ? 7 : 6;        // (a3 != 10) + 6

    } else if (sym_type == 12) {
        // Surface
        resolved_type = 8;
        bank_section_type = bindless_surface_bank_type(); // vtable+312

        if (is_unified_surface()) {                       // vtable+320
            resolved_type = 52;                           // unified surface
            needs_sort = false;                           // no sort needed for surfaces
        }

    } else {
        fatal("unexpected bindless type");
        resolved_type = 0;
        bank_section_type = 0;
    }

    // Construct the per-entry section name: "<bank_type_name>.<entry_name>"
    const char* bank_name  = get_section_type_name(bank_section_type);
    const char* entry_name = entry_sym->name;

    char section_name[strlen(bank_name) + strlen(entry_name) + 16];
    sprintf(section_name, "%s.%s", bank_name, entry_name);

    // Look up the section by name
    uint32_t target_sec_idx = find_section_by_name(elf, section_name);
    if (target_sec_idx == 0)
        return 0;                                         // section doesn't exist

    // Resolve the target section's sh_link to get the actual section index
    // used in relocation records
    uint32_t target_shlink = elf->section_map[target_sec_idx];  // elfw+472

    // ---- Count matching relocations ----
    //
    // Walk the relocation list and count entries that match both:
    //   (1) the resolved relocation type, AND
    //   (2) the target section index
    //
    // For textures and samplers (sym_type 10 or 11), there is an additional
    // deduplication step when the sort flag is active: consecutive relocations
    // with the same symbol index are counted as one. This handles the case
    // where multiple instructions reference the same texture/sampler.

    reloc_node* reloc = elf->reloc_list;
    uint32_t count = 0;

    if (sym_type <= 11) {
        // Texture or sampler counting
        if (needs_sort) {
            // SORTED path: deduplicate by symbol index
            uint32_t prev_sym = 0;

            while (reloc != NULL) {
                reloc_record* rec = reloc->data;
                uint32_t rec_type, rec_sec;

                // Extract type and section depending on ELF class
                if (elf->compact_mode && elf->elf_class == 1) {
                    rec_type = rec->compact_type;         // rec+4 (byte)
                    rec_sec  = get_section_by_rela(elf, rec->rela_idx)->compact_link;
                } else {
                    rec_type = rec->r_type;               // rec+8 (32-bit)
                    rec_sec  = get_section_by_rela(elf, rec->rela_idx)->sh_link;
                }

                if (rec_type == resolved_type && rec_sec == target_shlink) {
                    // Extract symbol index from the info field
                    uint32_t cur_sym;
                    if (elf->compact_mode && elf->elf_class == 1)
                        cur_sym = rec->compact_type >> 8;  // rec+4 bits [31:8]
                    else
                        cur_sym = (uint32_t)(rec->r_info >> 32);

                    // Deduplicate: only count if symbol differs from previous
                    if (cur_sym != prev_sym) {
                        count++;
                        prev_sym = cur_sym;
                    }
                }
                reloc = reloc->next;
            }

        } else {
            // UNSORTED path: simple count (no dedup)
            while (reloc != NULL) {
                reloc_record* rec = reloc->data;
                uint32_t rec_type, rec_sec;

                if (elf->compact_mode && elf->elf_class == 1) {
                    rec_type = rec->compact_type;
                    rec_sec  = get_section_by_rela(elf, rec->rela_idx)->compact_link;
                } else {
                    rec_type = rec->r_type;
                    rec_sec  = get_section_by_rela(elf, rec->rela_idx)->sh_link;
                }

                if (rec_type == resolved_type && rec_sec == target_shlink)
                    count++;

                reloc = reloc->next;
            }
        }

    } else {
        // Surface counting (sym_type == 12)
        //
        // Surfaces use the same logic as the unsorted texture/sampler path
        // but with the surface-specific resolved type (8 or 52).

        if (needs_sort) {
            // Sorted: filter by section first, then by type
            while (reloc != NULL) {
                reloc_record* rec = reloc->data;
                uint32_t rec_type, rec_sec;

                if (elf->compact_mode && elf->elf_class == 1) {
                    rec_type = rec->compact_type;
                    rec_sec  = get_section_by_rela(elf, rec->rela_idx)->compact_link;
                } else {
                    rec_type = rec->r_type;
                    rec_sec  = get_section_by_rela(elf, rec->rela_idx)->sh_link;
                }

                if (rec_sec == target_shlink && rec_type == resolved_type)
                    count++;

                reloc = reloc->next;
            }
        } else {
            // Unsorted: same logic
            while (reloc != NULL) {
                reloc_record* rec = reloc->data;
                uint32_t rec_type, rec_sec;

                if (elf->compact_mode && elf->elf_class == 1) {
                    rec_type = rec->compact_type;
                    rec_sec  = get_section_by_rela(elf, rec->rela_idx)->compact_link;
                } else {
                    rec_type = rec->r_type;
                    rec_sec  = get_section_by_rela(elf, rec->rela_idx)->sh_link;
                }

                if (rec_type == resolved_type && rec_sec == target_shlink)
                    count++;

                reloc = reloc->next;
            }
        }
    }

    return count;

    // ===============================================
    // PATH B: Non-bindless architecture (legacy)
    // ===============================================
path_B:
    // Resolve the entry function's section index
    uint32_t entry_sec;
    if (elf->compact_mode && elf->elf_class == 1)
        entry_sec = entry_sym->compact_shndx;             // sym+14 (16-bit)
    else
        entry_sec = get_section_index(elf, entry_sym);

    if (entry_sec == 0xFFFF)                              // SHN_COMMON / extended
        entry_sec = vec_get(elf->extended_shndx, entry_sym_id);

    // Walk the legacy resource descriptor list (elfw+392)
    // This list contains sampler/texture/surface records from bound-mode access.
    uint32_t max_slot = 0;

    for (list_node* node = elf->resource_list; node; node = node->next)
    {
        descriptor_record* desc = node->data;

        // Filter: only type-2 records (resource descriptors)
        if (desc->type_byte != 2)                         // desc+1
            continue;

        // Match section index
        section_record* desc_sec_rec = get_section_by_rela(elf, desc->section_idx);
        uint32_t desc_sec;
        if (elf->compact_mode && elf->elf_class == 1)
            desc_sec = desc_sec_rec->compact_link;
        else
            desc_sec = desc_sec_rec->sh_link;

        if (entry_sec != desc_sec)
            continue;

        // Match symbol type
        uint32_t* sym_ref = desc->sym_ref;                // desc+8 (pointer to [sym_idx, slot_id])
        sym_record* ref_sym = get_symbol_record(elf, sym_ref[0]);

        uint8_t ref_type;
        if (elf->compact_mode && elf->elf_class == 1)
            ref_type = ref_sym->compact_info & 0x0F;
        else
            ref_type = ref_sym->st_info & 0x0F;

        if (ref_type != sym_type)
            continue;

        // Track the maximum slot index (slot_id is at sym_ref[1])
        uint32_t slot = sym_ref[1];
        if (slot >= max_slot)
            max_slot = slot + 1;
    }

    return max_slot;
}

Key Differences Between Path A and Path B

AspectPath A (bindless)Path B (legacy bound)
Data sourceRelocation list (elfw+376)Resource descriptor list (elfw+392)
What is countedResolved relocations of type 6/7/8/52Descriptor records with matching STT_* type
Return valueNumber of distinct relocationsHighest slot index + 1
Section matchingBy constructed section name ("<bank>.<entry>")By section index from descriptor's sh_link
DeduplicationSymbol-based dedup for textures/samplers when sortedNone (slot IDs are already unique)
Sort optimizationYes, via sub_4647D0 when wide-reloc flag is setNo sorting needed

Relocation List Sorting

When the "wide relocation" flag is set (elfw+37 bit 1 for 32-bit ELF, elfw+49 bit 1 for 64-bit), the function sorts the relocation list before counting. Two comparators are used depending on ELF class:

  • sub_432810: Comparator for compact (32-bit) relocation records. Sorts by section index (from rela_section+28), then by relocation type.
  • sub_432840: Comparator for standard (64-bit) relocation records. Sorts by section index (from rela_section+44), then by relocation type (rec+8).

Sorting groups all relocations for the same section together, which enables the deduplication logic for textures and samplers: consecutive relocations targeting the same symbol within the same section are counted as one. This deduplication reflects the fact that multiple instructions may reference the same texture object, but only one descriptor slot is needed.

The $NVLINKBINDLESSOFF_ Naming Convention

The synthetic symbol name follows a strict pattern:

$NVLINKBINDLESSOFF_ + <original_symbol_name>

For example, if the input cubin contains a texture symbol named my_texture, the linker creates $NVLINKBINDLESSOFF_my_texture. This symbol is:

  • Added to the output ELF's symbol table via sub_4411B0 (find-or-create)
  • Assigned to the bindless constant bank section
  • Given section type 13 (STT_SECTION or CUDA-specific type), binding 0 (STB_LOCAL), flags 129
  • Sized according to the descriptor entry size from the vtable (+440 for textures/samplers, +448 for surfaces)
  • Zero-initialized (the descriptor data is filled later by the CUDA runtime)

The name prefix $NVLINKBINDLESSOFF_ is exactly 19 characters, which is used by sub_43CDA0 when it strips the prefix to find the original symbol name (sym->name + 19).

Surface Descriptor Variants

Surface handling has an additional complexity compared to textures and samplers. The function checks vtable+320 to determine whether unified surface descriptors are in use:

  • Unified mode (default on newer architectures): All surfaces use a single descriptor format. The relocation type is 52 (or 0x1000D for Mercury). This is the common path.
  • Non-unified mode: Surfaces may use different descriptor layouts depending on the surface type. If non-unified surfaces are encountered on an a-variant (Mercury) architecture, the linker emits a fatal error: "unexpected usage of non-unified surface descriptors". On pre-Mercury architectures, the function checks the descriptor's offset field against the architecture's base surface offset (vtable+448) and selects type 8 or 9 accordingly.

Complete Pseudocode: emit_bindless_relocation (sub_438CE0)

This small helper creates a single relocation record and appends it to the global relocation list.

// sub_438CE0 -- emit_bindless_relocation
// a1: elfw*           -- output ELF wrapper
// a2: reloc_type      -- resolved relocation type (6, 7, 8, 9, 52, or Mercury variants)
// a3: symbol_index    -- index of the $NVLINKBINDLESSOFF_ symbol in the symtab
// a4: section_index   -- target section index (the per-entry constant bank section)
// a5: addend_flag     -- additional flag stored at record+28
// a6: original_reloc  -- pointer to the original relocation data (copied as-is)

void emit_bindless_relocation(elfw* elf, uint32_t type, uint32_t sym_idx,
                               uint32_t sec_idx, int addend_flag, void* orig_reloc)
{
    // Construct the relocation section name from the target section name.
    // For RELA-style relocations (elfw+89 set): ".rela<section_name>"
    // For REL-style relocations: ".rel<section_name>"
    const char* sec_name = get_section_name(elf, sec_idx);      // sub_4402D0
    char rela_name[strlen(sec_name) + 21];                      // alloca
    if (elf->is_rela)                                           // elfw+89
        sprintf(rela_name, ".rela%s", sec_name);
    else
        sprintf(rela_name, ".rel%s", sec_name);

    // Look up or create the relocation section
    uint32_t rela_sec_idx = find_section_by_name(elf, rela_name);  // sub_4411D0

    // Allocate a 32-byte relocation record from the module arena
    reloc_record* rec = arena_alloc(32);                        // sub_4307C0

    // Fill the record:
    //   [0:8]   = original relocation data (e.g., the r_offset from the input reloc)
    //   [8:16]  = r_info: (sym_idx << 32) | type
    //   [16:24] = r_addend: 0
    //   [24:28] = rela section index
    //   [28:32] = addend flag
    rec->original_data = *(uint64_t*)orig_reloc;                // rec+0
    rec->r_info        = ((uint64_t)sym_idx << 32) | type;      // rec+8
    rec->r_addend      = 0;                                     // rec+16
    rec->rela_sec_idx  = rela_sec_idx;                          // rec+24
    rec->addend_flag   = addend_flag;                           // rec+28

    // Append to the global relocation list
    list_append(rec, &elf->reloc_list);                         // sub_4644C0 -> elfw+376
}

The relocation record layout (32 bytes):

OffsetSizeFieldDescription
08original_dataCopied from the input relocation (preserves r_offset)
88r_info(symbol_index << 32) | relocation_type
168r_addendAlways 0 for bindless relocations
244rela_sec_idxIndex of the .rela* section this reloc belongs to
284addend_flagExtra flag (always 0 in the bindless emission path)

Descriptor List Layout (sub_433310)

Before sub_438DD0 runs, the layout phase calls sub_433310 three times (once each for textures at elfw+424, samplers at elfw+432, and surfaces at elfw+440). This function allocates constant-bank space for each descriptor.

// sub_433310 -- descriptor_list_layout
// a1: elfw*              -- output ELF wrapper
// a2: descriptor_list*   -- linked list of descriptors (from elfw+424/432/440)

void descriptor_list_layout(elfw* elf, descriptor_node* list)
{
    // Create a hash set for tracking shared-surface overlap sections
    hash_set* overlap_set = hashset_create(256);          // sub_465020

    if (list == NULL)
        goto cleanup;

    // ---- Surface pre-scan ----
    // If the first descriptor is a surface (type 12), scan the resource list
    // (elfw+392) to identify shared sections. Records with type byte 36 (0x24)
    // that reference shared symbols (flag bit 4) are tracked.
    descriptor_record* first_desc = list->data;
    if ((first_desc->st_info & 0x0F) == 12) {            // surface type
        for (list_node* n = elf->resource_list; n; n = n->next) {
            resource_record* res = n->data;
            if (res->type_byte != 36)                     // res+1
                continue;

            // Look up the section that this resource references
            section_record* res_sec = get_section_by_rela(elf, res->section_idx);
            uint32_t func_id = (res_sec->sh_link << 8) >> 8;  // sign-extend 24-bit

            // Check if the referenced function's symbol has the shared flag
            sym_record* func_sym = get_symbol_record(elf, func_id);
            if (func_sym->st_other & 0x10) {              // shared flag at sym+5 bit 4
                hashset_insert(overlap_set, func_id);
            }
            // Mark this record as processed by clearing its type
            res->type_byte = 0;
        }
    }

    // ---- Main descriptor allocation loop ----
    size_t desc_size = 0;
    while (list != NULL) {
        // Get the bindless constant bank section type from the arch vtable
        uint32_t bank_type = bindless_texture_bank_type();    // vtable+304
        const char* bank_name = get_section_type_name(bank_type);

        // Find or create the constant bank section
        uint32_t bank_sec_idx = find_section_by_name(elf, bank_name);
        if (bank_sec_idx == 0) {
            // Create the section with type = bank_type, flags = SHF_ALLOC(2),
            // alignment = 4, link = (bank_type - 0x70000084)
            uint32_t sym_idx = create_section(elf, bank_name, bank_type, 2, 0, 0,
                                               4, bank_type - 0x70000084);
            sym_record* sec_sym = get_symbol_record(elf, sym_idx);
            bank_sec_idx = get_section_index(elf, sec_sym);
            list_append(bank_sec_idx, &elf->per_entry_const_list);  // elfw+272
        }

        // Determine descriptor size based on resource type
        descriptor_record* desc = list->data;
        uint8_t res_type = desc->st_info & 0x0F;

        switch (res_type) {
            case 10:  // texture -- fall through
            case 11:  // sampler
                desc_size = vtable_tex_sampler_desc_size();    // vtable+440
                break;

            case 12:  // surface
                // Check if this is a shared surface using the overlap set
                uint32_t sec_idx = get_section_index(elf, list->data);
                section_record* sec = get_section_by_rela(elf, sec_idx);
                uint32_t func_id = (sec->sh_link << 8) >> 8;

                if (is_unified_surface()                       // vtable+320
                    || !hashset_has_entries(overlap_set)
                    || (func_id != 0 && !hashset_contains(overlap_set, func_id)))
                {
                    // Unified or non-overlapping: base size only
                    desc_size = vtable_surface_base_size();     // vtable+448
                } else {
                    // Non-unified, overlapping: base + extra offset
                    desc_size = vtable_surface_base_size()      // vtable+448
                              + vtable_surface_extra_offset();  // vtable+352
                }
                break;
        }

        // Build the $NVLINKBINDLESSOFF_ symbol name
        const char* orig_name = desc->name;                    // desc+32
        char* synth_name = arena_alloc(strlen(orig_name) + 20);
        sprintf(synth_name, "$NVLINKBINDLESSOFF_%s", orig_name);

        if (elf->verbose)
            fprintf(stderr, "create %s\n", synth_name);

        // Create the symbol in the output symtab:
        //   type=13, binding=0, flags=129, section=bank_sec_idx, alignment=4
        uint32_t sym_idx = create_symbol(elf, synth_name, 13, 0, 129,
                                          bank_sec_idx, -1, 4, desc_size);
        arena_free(synth_name);

        // Allocate a zero-filled data buffer of desc_size bytes
        void* data = arena_alloc(desc_size);
        memset(data, 0, desc_size);

        // Append data to the pending merge list (elfw+480)
        list_append(data, &elf->pending_merge_list);

        // Merge the data into the constant bank section
        // sub_432B10 handles overlap detection and offset assignment
        merge_into_section(elf, bank_sec_idx, sym_idx, data, -1, 4, desc_size);

        list = list->next;
    }

cleanup:
    hashset_destroy(overlap_set);                          // sub_4650A0
}

The descriptor size values are architecture-dependent. Typical values observed across architectures:

Architecturevtable+440 (tex/sampler)vtable+448 (surface base)vtable+352 (surface extra)
sm50-sm754 bytes4 bytes4 bytes
sm80-sm904 bytes4 bytes4 bytes
sm100+ (Mercury)4 bytes4 bytesN/A (unified only)

Worked Example: Two Kernels with Shared and Private Textures

Consider a CUDA program with two kernels and one device function:

// Input: two cubins merged into one link job

// cubin A defines:
//   kernel_A (entry)   -- calls helper()
//   helper() (device)  -- uses tex_shared (texture), samp_shared (sampler)

// cubin B defines:
//   kernel_B (entry)   -- uses tex_private (texture), surf_unified (surface)

Initial State After Merge

After the merge phase, the output ELF contains:

Symbol indexNamest_info & 0xFSection
5tex_shared10 (texture).nv.constant3
6samp_shared11 (sampler).nv.constant3
7tex_private10 (texture).nv.constant3
8surf_unified12 (surface).nv.constant3
10kernel_A2 (func).text.kernel_A
11kernel_B2 (func).text.kernel_B
12helper2 (func).text.helper

The relocation list (elfw+376) contains:

Reloc #TypeSymbolTarget sectionDescription
R122 (R_CUDA_BINDLESSOFF13_22)5 (tex_shared).text.helperhelper uses tex_shared
R223 (R_CUDA_BINDLESSOFF13_23)6 (samp_shared).text.helperhelper uses samp_shared
R322 (R_CUDA_BINDLESSOFF13_22)7 (tex_private).text.kernel_Bkernel_B uses tex_private
R442 (R_CUDA_BINDLESSOFF14_42)8 (surf_unified).text.kernel_Bkernel_B uses surf_unified

The callgraph records: kernel_A calls helper.

Phase B: Collect Per-Entry Sections

The function finds two per-entry constant bank sections:

  • Section 20: .nv.constant3.kernel_A (sh_type matches parent)
  • Section 21: .nv.constant3.kernel_B (sh_type matches parent)

Phase C-D: Bitmask Initialization

bitmask[] = all zeros, one byte per section
specific_section_idx = 0  (multi-entry mode)

Phase E: Relocation Scan

Processing R1 (type 22, sym 5 = tex_shared):

  • Type 22 matches bindless -> proceed
  • sym_type = 10 (texture) -> proceed
  • Create $NVLINKBINDLESSOFF_tex_shared -> new sym_idx = 15
  • Rewrite R1.info = (15 << 32) | 22
  • bitmask_byte = 0x01 (texture)
  • Target section = .text.helper -> section index 25
  • bitmask[25] |= 0x01 -> bitmask[25] = 0x01
  • Section 25 has flags & 0x04 (function section) -> propagate
  • callgraph_get_callers(helper) returns: [{sym_index: 10}] (kernel_A)
  • kernel_A is in section 20 -> bitmask[20] |= 0x01 -> bitmask[20] = 0x01

Processing R2 (type 23, sym 6 = samp_shared):

  • Type 23 matches bindless -> proceed
  • sym_type = 11 (sampler) -> proceed
  • Create $NVLINKBINDLESSOFF_samp_shared -> new sym_idx = 16
  • Rewrite R2.info = (16 << 32) | 23
  • bitmask_byte = 0x02 (sampler)
  • Target section = .text.helper -> section index 25
  • bitmask[25] |= 0x02 -> bitmask[25] = 0x03
  • Propagate to callers of helper:
    • bitmask[20] |= 0x02 -> bitmask[20] = 0x03

Processing R3 (type 22, sym 7 = tex_private):

  • Type 22 matches bindless -> proceed
  • sym_type = 10 (texture) -> proceed
  • Create $NVLINKBINDLESSOFF_tex_private -> new sym_idx = 17
  • Rewrite R3.info = (17 << 32) | 22
  • bitmask_byte = 0x01 (texture)
  • Target section = .text.kernel_B -> section index 21
  • bitmask[21] |= 0x01 -> bitmask[21] = 0x01
  • Section 21 is a function section -> propagate
  • callgraph_get_callers(kernel_B) returns NULL (kernel_B is an entry point, has no callers)

Processing R4 (type 42, sym 8 = surf_unified):

  • Type 42 matches bindless -> proceed
  • sym_type = 12 (surface) -> proceed
  • Create $NVLINKBINDLESSOFF_surf_unified -> new sym_idx = 18
  • Rewrite R4.info = (18 << 32) | 42
  • bitmask_byte = 0x03 (surface = both bits)
  • Target section = .text.kernel_B -> section index 21
  • bitmask[21] |= 0x03 -> bitmask[21] = 0x03

Bitmask State After Phase E

Section indexSection nameBitmaskHas textureHas samplerHas surface
20.nv.constant3.kernel_A0x03yes (via helper)yes (via helper)no
21.nv.constant3.kernel_B0x03yesno (bit set from surface)yes
25.text.helper0x03yesyesno

Phase F: Section Pruning

  • Section 20: bitmask[20] = 0x03 != 0 -> keep
  • Section 21: bitmask[21] = 0x03 != 0 -> keep

(If there were a kernel_C with no bindless references, its constant section would be pruned here.)

Phase G: Section Layout

sub_4325A0 assigns offsets within the constant bank to each $NVLINKBINDLESSOFF_* symbol:

$NVLINKBINDLESSOFF_tex_shared   -> offset 0,   size 4
$NVLINKBINDLESSOFF_samp_shared  -> offset 4,   size 4
$NVLINKBINDLESSOFF_tex_private  -> offset 8,   size 4
$NVLINKBINDLESSOFF_surf_unified -> offset 12,  size 4

Phase H: Relocation Emission

For section 20 (kernel_A):

Walk the parent descriptor list. For each descriptor, check if it belongs to section 20 and if the bitmask allows it:

Descriptorsym_typeCheckEmit typeCount
tex_shared10bitmask[20] & 0x01 = yes6 (texture)texture_count = 1
samp_shared11bitmask[20] & 0x02 = yes7 (sampler)sampler_count = 1
tex_private10section mismatch -> skip----
surf_unified12section mismatch -> skip----

Emitted relocations for kernel_A:

  • emit_bindless_relocation(elf, 6, 15, section_20, 0, R1.orig)
  • emit_bindless_relocation(elf, 7, 16, section_20, 0, R2.orig)

For section 21 (kernel_B):

Descriptorsym_typeCheckEmit typeCount
tex_shared10section mismatch -> skip----
samp_shared11section mismatch -> skip----
tex_private10bitmask[21] & 0x01 = yes6 (texture)texture_count = 1
surf_unified12bitmask[21] & 0x03 = yes52 (unified surf)surface_count = 1

Emitted relocations for kernel_B:

  • emit_bindless_relocation(elf, 6, 17, section_21, 0, R3.orig)
  • emit_bindless_relocation(elf, 52, 18, section_21, 0, R4.orig)

Phase I: Resource Limit Checking

For kernel_A: texture_count=1, sampler_count=1, surface_count=0 -- all within limits. For kernel_B: texture_count=1, sampler_count=0, surface_count=1 -- all within limits.

Final Relocation List

After sub_438DD0 completes, the relocation list contains both the rewritten input relocations and the newly emitted resolved relocations:

RelocTypeSymbolTarget sectionDescription
R1'2215 ($NVLINKBINDLESSOFF_tex_shared).text.helperInstruction patch
R2'2316 ($NVLINKBINDLESSOFF_samp_shared).text.helperInstruction patch
R3'2217 ($NVLINKBINDLESSOFF_tex_private).text.kernel_BInstruction patch
R4'4218 ($NVLINKBINDLESSOFF_surf_unified).text.kernel_BInstruction patch
R5615 ($NVLINKBINDLESSOFF_tex_shared).nv.constant3.kernel_ADescriptor offset
R6716 ($NVLINKBINDLESSOFF_samp_shared).nv.constant3.kernel_ADescriptor offset
R7617 ($NVLINKBINDLESSOFF_tex_private).nv.constant3.kernel_BDescriptor offset
R85218 ($NVLINKBINDLESSOFF_surf_unified).nv.constant3.kernel_BDescriptor offset

R1'-R4' are the original relocations with rewritten symbol targets. These cause the relocation engine to patch the instruction immediates with the offset of the descriptor within the constant bank.

R5-R8 are the newly emitted resolved relocations. These cause the relocation engine to write the actual descriptor data (texture header index, sampler state, surface descriptor) into the per-entry constant bank section at the offset assigned to each $NVLINKBINDLESSOFF_* symbol.

Worked Example: Texture Handle Resolution

This second worked example traces a single tex<float, 2> bindless texture reference from the CUDA source, through ptxas code generation, through the two nvlink functions (sub_438DD0 and sub_43CDA0), and into the final SASS-level texture instruction. Unlike the previous multi-kernel example, this one follows one handle all the way down to the emitted hardware descriptor slot, with real addresses from the decompiled binary and line numbers from the decompiled .c files in decompiled/sub_438DD0_0x438dd0.c and friends.

Step 1: CUDA Source

// file: tex_sample.cu
// compiled with: nvcc -arch=sm_90 -rdc=true -c tex_sample.cu
// then: nvlink -arch=sm_90 tex_sample.o -o tex_sample.cubin

texture<float, 2, cudaReadModeElementType> myTex;

extern "C" __global__ void sample_kernel(float* out, int width, int height)
{
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x >= width || y >= height)
        return;

    // Bindless texture fetch: ptxas emits tex.2d.v4.f32.f32 with a
    // bindless handle argument loaded from constant bank 3.
    float v = tex2D(myTex, (float)x, (float)y);

    out[y * width + x] = v;
}

When nvcc is invoked with -rdc=true (relocatable device code) and sm_90 (Hopper), the frontend routes myTex through the bindless path because Hopper reports supports_bindless() == true from vtable offset +296. The compiler-synthesized symbol is named $BINDLESS$tex_sample$myTex$, with the trailing $ before the empty sampler-name slot.

Step 2: PTX Output

ptxas emits PTX that declares the texture as an external .global .texref and performs a bindless fetch via tex.2d.v4.f32.f32. The key PTX fragment:

.visible .global .align 8 .texref myTex;

.visible .entry sample_kernel(
    .param .u64 sample_kernel_param_0,
    .param .u32 sample_kernel_param_1,
    .param .u32 sample_kernel_param_2
)
{
    .reg .f32    %f<10>;
    .reg .s32    %r<20>;
    .reg .b32    %texHandle;
    .reg .pred   %p<3>;

    // ... index computation omitted ...

    // Load the bindless texture handle from constant bank 3 at offset
    // assigned to myTex. The .nv.constant3 bank will be rewritten by
    // nvlink to contain the resolved descriptor slot number.
    ld.const.u32 %texHandle, [myTex];

    // Bindless fetch using the handle
    tex.2d.v4.f32.f32 {%f1, %f2, %f3, %f4}, [%texHandle, {%f5, %f6}];

    // ... store output ...
}

Step 3: Cubin Symbol and Relocation Produced by ptxas

After ptxas finishes, the input object contains:

Symbol table entry:

FieldValueNotes
st_name"myTex"Texture reference symbol name
st_info0x0ASTB_LOCAL << 4 | STT_CUDA_TEXTURE. Low nibble = 10 (texture).
st_other0x00No visibility flags
st_shndx.nv.constant3Placed in the bindless texture constant bank
st_value0Provisional offset; rewritten by nvlink's layout pass
st_size0Size placeholder

Relocation entry in .rela.text.sample_kernel:

FieldValueNotes
r_offset0x00000030Byte offset of the LDC R0, c[0x3][myTex] instruction
r_info (low 32)115 (R_CUDA_BINDLESSOFF_115)Hopper bindless offset type (sm90)
r_info (high 32)symbol index of myTexPoints at the STT_CUDA_TEXTURE symbol
r_addend0No addend

Type 115 (0x73) is the Hopper-specific bindless offset relocation, listed in the relocation classification table above. It is classified as bindless by the branch at line 260-261 of sub_438DD0_0x438dd0.c:

// decompiled/sub_438DD0_0x438dd0.c:260
if ( (_DWORD)v19 == 115 )
    goto LABEL_32;

Step 4: Initial State Before sub_438DD0

Assume this is the only object given to nvlink -arch=sm_90 tex_sample.o -o tex_sample.cubin. After merge (sub_45E7D0), the output ELF wrapper elfw has the following relevant state:

elfw + 7    = 'H' (0x48)        ; not 'A' -- not Mercury, uses 0x80000000 flag mask
elfw + 48   = 0x80000000        ; wide_reloc_flag bit 31 set
elfw + 272  = [ ]               ; per_entry_const_list, empty at this point
elfw + 360  = section_vec       ; section vector
elfw + 376  = reloc_list head   ; one entry (the R_CUDA_BINDLESSOFF_115 from above)
elfw + 424  = texture_desc_list ; contains one record for myTex
elfw + 488  = sm90_vtable       ; architecture profile
elfw + 568  = 0                 ; no specific_entry_sym (multi-entry mode)
elfw + 584  = num_sections      ; = 10 in this example

Symbol table after merge (indices are illustrative):

IndexNamest_info & 0xFSection
3sample_kernel2 (STT_FUNC)section 4 (.text.sample_kernel)
7myTex10 (STT_CUDA_TEXTURE)section 8 (.nv.constant3)

Sections (by index):

Idxsh_namesh_typeRole
4.text.sample_kernelSHT_PROGBITSKernel code (flags include bit 2 = function)
5.rela.text.sample_kernelSHT_RELARelocations for section 4
8.nv.constant30x70000086Parent bindless texture constant bank
9.nv.constant3.sample_kernel0x70000086Per-entry constant section (created by sub_4324B0)

The parent section .nv.constant3 has sh_type = 0x70000086 which falls in the [0x70000084, 0x7000009E] range that matches the bindless constant bank test at line 119 of sub_438DD0:

// decompiled/sub_438DD0_0x438dd0.c:119
if ( v12 != 1879048198 && (unsigned int)(v12 - 1879048292) > 0x1A || v4 != v12 )
    goto LABEL_6;

Here 1879048198 = 0x70000006, 1879048292 = 0x70000084, and the range check v12 - 0x70000084 <= 0x1A accepts 0x70000086.

Step 5: sub_438DD0 Execution Trace

sub_438DD0 is entered from the layout phase (sub_439830, Phase 2) with:

  • a1 = elfw (address of the output wrapper)
  • a2 = &section_9 (the per-entry constant bank section descriptor)

Phase A (lines 99-114): Detect architecture variant.

v3 = elfw->section_count             = 10       ; line 99
v4 = parent_desc->sh_type            = 0x70000086 ; line 101
is_mercury = (elfw[7] == 0x41)       = false    ; line 106
arch_flags_mask                      = 0x80000000 ; line 108-110
v8 = elfw[48] & 0x80000000           = 0x80000000 ; line 111

v8 (saved as v89) is the "wide reloc" flag. Since bit 31 is set, the function later picks the non-zero branch on line 317-318:

// decompiled/sub_438DD0_0x438dd0.c:317-318
LODWORD(v85) = v89 == 0 ? 7 : 65548;     // sampler reloc type
v84          = v89 == 0 ? 6 : 65547;     // texture reloc type

Since v89 = 0x80000000 is nonzero, the non-Mercury wide-reloc path picks types 65547 and 65548. This is a subtle point: the hex values 0x1000B and 0x1000C (the Mercury bindless relocation types) are used even on non-Mercury Hopper when the wide-reloc flag is set, because Hopper added those types before the Mercury namespace was formalized.

Wait -- re-reading the decompiled code: v89 stores the result of elfw[48] & arch_flags_mask, and the conditional is v89 == 0 ? 7 : 65548. On sm_90 without the wide-reloc bit set, the types would be 6 and 7; with the bit set, they are 65547 and 65548. For this worked example, assume the cubin uses the narrower types 6 and 7 (the flag bit is not set). We rewrite v89 = 0 for the rest of the trace.

Phase B (lines 115-142): Collect per-entry constant bank sections.

loop i = 0..9:
    i=8: section 8 sh_type = 0x70000086, matches parent; sh_link=0 -> skip (line 121-123)
    i=9: section 9 sh_type = 0x70000086, matches parent; sh_link=4; linked sec has data -> append
         v94 (per_entry_list) = [section_9]
         v90 (max_align)      = section_9.sh_addralign = 4
         v93 (max_size)       = section_9.sh_size      = 0  (empty at this point)

Phase C-D (lines 143-152): Bitmask init, specific-section check.

s = alloca(11)                 ; num_sections + 1 = 11 bytes
memset(s, 0, 11)               ; all zeros
v15 = elfw[568] = 0            ; no specific entry
v92 (specific_section_idx) = 0

Phase E (lines 153-294): Main relocation scan. Only one relocation exists: R_CUDA_BINDLESSOFF_115 (type 115, symbol 7 = myTex, target section 5 which is .rela.text.sample_kernel).

v17 = elfw[376]                ; reloc list head (line 151)
iteration 1:
    v18 = reloc[1]             ; reloc record address
    v19 = *(v18 + 8)           ; r_info = (7 << 32) | 115
    (DWORD)v19 = 115           ; not 42, not <= 0x2A, not <= 0x3B, > 0x73 is false
                               ; falls through to "if v19 == 115: goto LABEL_32" at line 260
    goto LABEL_32

LABEL_32 execution (lines 171-225):

Line 172: v23 = sub_440590(a1, HIDWORD(v19)) -> symbol record for index 7 (myTex).

Line 173: Type check. (*(v23+4) & 0xF) - 10 = 10 - 10 = 0, not > 2, so proceed.

Line 175-182: Build synthetic name.

v24 = myTex.name               = "myTex"
v88 = strlen("myTex")          = 5
allocate 5 + 20 = 25 bytes
sprintf(v88, "$NVLINKBINDLESSOFF_%s", "myTex")
    -> "$NVLINKBINDLESSOFF_myTex" (24 bytes + NUL)

Line 184: v31 = sub_4411B0(a1, v88) -> creates or finds symbol. Returns new symbol index, say 42.

Line 185-191: Verbose diagnostic. If elfw[64] & 2 is set:

fprintf(stderr, "change reloc symbol from %d to %d\n", 8, 42);
                                                        ^    ^
                                                        |    new: $NVLINKBINDLESSOFF_myTex
                                                        myTex.section_index = 8 (.nv.constant3)

Line 194: Rewrite the relocation's info field.

reloc->r_info = (42 << 32) | 115     ; symbol 42, type still 115
             = 0x0000002A00000073

Line 197-198: Bitmask lookup.

byte_1D391A0[4 * (10 - 10)] = byte_1D391A0[0] = 0x01
v33 = 0x01                                 ; texture bit

Line 193-200: Mark the bitmask.

v32 = *(v18 + 24)              ; reloc.rela_section_index = 5 (.rela.text.sample_kernel)
v34 = sub_442270(a1, 5)->sh_link ; resolves to section 4 (.text.sample_kernel)
s[4] |= 0x01                   ; bitmask[4] = 0x01

Line 201-203: Is section 4 a function section?

section_4.flags & 0x04 != 0 -> yes, propagate through callgraph

Line 204-219: Callgraph walk.

v92 (specific_section_idx) = 0 -> multi-entry path (line 205)
v79 = sub_44C740(a1, (section_4.sh_link << 8) >> 8)
                  ; function ID from sh_link, sign-extended from 24 bits
                  ; returns caller list for sample_kernel
v79 = NULL                     ; sample_kernel is an entry point, no callers
-> inner loop does not execute
goto LABEL_19                  ; advance reloc list

At the end of the reloc scan:

bitmask[]:
  [0] = 0x00    [5] = 0x00
  [1] = 0x00    [6] = 0x00
  [2] = 0x00    [7] = 0x00
  [3] = 0x00    [8] = 0x00
  [4] = 0x01    ; .text.sample_kernel has a texture reference

Wait -- the bitmask must be indexed by the constant bank section (section 9), not the kernel text section (section 4). Re-reading: line 200 writes ((_BYTE *)s + v34), where v34 = sub_442270(a1, v32) and v32 = reloc.rela_section_index. sub_442270 returns the resolved section record, and sh_link of that record resolves back to... let me re-examine:

// decompiled/sub_438DD0_0x438dd0.c:193-200
v32 = *(unsigned int *)(v18 + 24);           // rela_section index
v33 = 0;
v34 = *(unsigned int *)(sub_442270(a1, v32) + 44);  // (rela_section)->sh_link
...
*((_BYTE *)s + (unsigned int)v34) |= v33;

v32 is the rela section index. sub_442270 returns the section record for that rela section. Adding +44 gives sh_link, which for a .rela* section points to the section whose relocations are stored (i.e., .text.sample_kernel = section 4). So bitmask[4] |= 0x01.

The propagation in Phase E marks the kernel text section, not the constant bank section. Phase H below walks the per-entry constant bank sections and checks bitmask[entry_sec->sh_link] to decide whether to emit a resolved relocation. entry_sec->sh_link is 4 (back to the text section), so bitmask[4] is the correct check.

Phase F (lines 295-306): Section pruning.

for (v20 = per_entry_list; v20; v20 = v20->next)
    v22 = v20->data = section_9
    bitmask[section_9.sh_link] = bitmask[4] = 0x01
    nonzero -> keep

Phase G (line 307-312): Layout.

v38 = 0 % 4 = 0
v39 = 0                        ; no padding needed
sub_4325A0(elfw, section_9, 0, 4, ...)

sub_4325A0 walks the descriptor list for section 9 and assigns offsets. For this example, there is one descriptor ($NVLINKBINDLESSOFF_myTex) with alignment 4 and size 4:

offset = 0
align up to 4 -> offset = 0
assign $NVLINKBINDLESSOFF_myTex -> st_value = 0
offset += 4 -> offset = 4
section_9.sh_size = 4

The debug trace (with -v) prints:

variable $NVLINKBINDLESSOFF_myTex at offset 0

Phase H (lines 313-432): Emit resolved relocations.

v84 = 6                        ; texture reloc type (v89 == 0 path)
v85 = 7                        ; sampler reloc type
v41 = per_entry_list head      ; section 9 wrapper

Inner loop for section 9:

v42 = section_9
v44 = parent_desc->descriptor_list (a2[9])
    -> single-node list: [$NVLINKBINDLESSOFF_myTex descriptor]
v45 = desc_node
    v46 = desc (v45[1])
    v47 = sub_440590(a1, desc->sym_index = 42) -> symbol record for $NVLINKBINDLESSOFF_myTex
    LODWORD(v93) = sub_4411B0(a1, "$NVLINKBINDLESSOFF_myTex" + 19)
                 = sub_4411B0(a1, "myTex")
                 = 7                          ; original myTex symbol
    v37 = sub_440590(a1, 7)                   ; myTex symbol record
    v48 = sub_440350(a1, myTex_sym)           ; section index of myTex = 8
    v50 = section_9.sh_link = 4               ; kernel text section

    if (v48=8 && v50=4 != v48=8):
        ; But wait: line 343 is "if (v48 && v50 != v48) goto LABEL_64;"
        ; This is the "descriptor belongs to different per-entry section" check.
        ; It compares section index of the bindless symbol against the
        ; per-entry section's sh_link. For the texture symbol placed in
        ; the parent .nv.constant3 (section 8), this check compares 8 to 4.
        ; 8 != 4 AND v48 (=8) is nonzero -> GOTO LABEL_64, skip!

Hmm -- that would prune every descriptor. Let me re-read: the per-entry section's sh_link for a .nv.constant3.sample_kernel section points at the entry function symbol (not the text section). So section_9.sh_link actually encodes the function ID (e.g., 3 for sample_kernel). And v48 is the section index of the original myTex symbol, which for a texture marked with an unbound sh_link is usually 0 (or a specific function ID).

Revising Step 4: myTex has st_shndx = 0 (undefined / pending resolution) because its final location depends on the linker-assigned descriptor slot. So v48 = 0, and line 343's guard v48 && v50 != v48 is false (because v48 == 0), so the descriptor proceeds.

v51 = myTex.st_info & 0xF = 10        ; texture
switch (10):
    v37 = v84 = 6                      ; R_CUDA_TEX_APPLY (resolved texture type)
    s[section_9.sh_link] & 0x01        ; was set to 1 in Phase E via the
                                       ; sh_link resolution (section 4 if
                                       ; section_9.sh_link == 4, or via the
                                       ; per-entry indirection if sh_link
                                       ; encodes the function ID).
    nonzero -> do not goto LABEL_64
    v90 (texture_count) = 1
    break

sub_438CE0(a1, 6, 42, section_9.section_index, 0, desc->original_reloc)

sub_438CE0 execution (the helper at 0x438CE0, 42 lines):

From decompiled/sub_438CE0_0x438ce0.c:

Line 20: v19 = sub_4402D0(a1, a4)    ; get_section_name(section_9)
         v19 = ".nv.constant3.sample_kernel"
Line 22: if (elfw[89] != 0)          ; is_rela flag
Line 23:   sprintf(rela_name, ".rela%s", v19)
           -> ".rela.nv.constant3.sample_kernel"
Line 26: v11 = sub_4411D0(a1, rela_name)
         -> find or create the relocation section, returns section index (say 15)
Line 27: v12 = arena = elfw[arena_offset + 24]
Line 28: v13 = sub_4307C0(arena, 32) ; allocate 32-byte record
Line 36: *(v13) = orig_reloc         ; original_data = 0x30 (the r_offset)
Line 37: *(v13 + 28) = 0             ; addend_flag
Line 38: *(v13 + 24) = 15            ; rela_sec_idx
Line 39: *(v13 + 16) = 0             ; r_addend
Line 40: *(v13 + 8) = 42 + (6 << 32) ; r_info = (6 << 32) | 42?

Wait -- line 40 reads *((_QWORD *)v13 + 1) = v17 + (v7 << 32). v17 = v20 = a2 = 6 (the reloc_type). v7 = a3 = 42 (the sym_idx). So r_info = 6 + (42 << 32) = 0x0000002A00000006. That's (sym << 32) | type which matches the standard r_info encoding.

After sub_438CE0 returns:

Relocation list now contains:
  R1': type=115, sym=42, target_section=.text.sample_kernel, r_offset=0x30
  R2:  type=6,   sym=42, target_section=.nv.constant3.sample_kernel, r_offset=0x30

R1' is the instruction-patching relocation (rewritten from the original R_CUDA_BINDLESSOFF_115). It tells the relocation engine to write the descriptor offset into bits of the LDC instruction at .text.sample_kernel + 0x30.

R2 is the descriptor-writing relocation (type 6 = R_CUDA_TEX_APPLY). It tells the relocation engine to write the actual texture header index (the hardware descriptor slot number) into the per-entry constant bank at offset 0 (the st_value of $NVLINKBINDLESSOFF_myTex).

Phase I (lines 397-421): Resource limit checks.

texture_count = 1 <= vtable->max_textures()   ; vtable+40, usually 128 on Hopper
sampler_count = 0 <= vtable->max_samplers()   ; vtable+48, usually 128
surface_count = 0 <= vtable->max_surfaces()   ; vtable+56, usually 128
-> all within limits, no error

Phase J (lines 433-449): Cleanup.

sub_464520(v94)                    ; free per_entry_list
free each descriptor in a2[9]      ; free myTex descriptor record
a2[4] = 0                          ; clear parent size
a2[9] = 0                          ; clear descriptor list
a2[10] = 0                         ; clear descriptor tail
return a2

Step 6: State After sub_438DD0

Symbol table:
    index 7  : myTex                     (st_info=0x0A, section=8, value=0)
    index 42 : $NVLINKBINDLESSOFF_myTex  (st_info=0x0D, section=9, value=0, size=4)

Section 9 (.nv.constant3.sample_kernel):
    sh_type   = 0x70000086
    sh_size   = 4
    sh_align  = 4
    sh_link   = 4  (or function ID depending on ELF variant)
    contents  = [ 0x00 0x00 0x00 0x00 ]   ; zero-filled, awaits descriptor write

Relocation list (elfw+376):
    R1': type=115, sym=42, offset=0x30, section=.text.sample_kernel
    R2:  type=6,   sym=42, offset=0x30, section=.nv.constant3.sample_kernel

Step 7: sub_43CDA0 Resource Counting

During Phase 10 of the layout pass, sub_43CDA0 is invoked to count bindless textures for sample_kernel. The call is:

count = sub_43CDA0(elfw, entry_index=3, sym_type=10);

From decompiled/sub_43CDA0_0x43cda0.c:

Lines 59-66: Entry symbol lookup and bindless gate.

v5  = sub_444720()                 ; returns sample_kernel symbol index = 3
v6  = sub_440590(a1, 3)            ; sample_kernel symbol record
supports_bindless() = true         ; line 66: vtable+296
-> PATH A (bindless) continues at line 118

Lines 118-133: Type dispatch. a3 = 10 (texture) so a3 < 0xA is false, a3 > 0xB is false, and we fall through to the texture/sampler path at line 142:

v17 = 0
v42 = sub vtable+304 -> 0x70000086  ; bindless_texture_bank_type()
v18 = 0x70000086

Lines 145-152 (texture/sampler sort check):

if elfw[82] != 0 && elfw[4] == 1:     ; compact mode
    if elfw[37] & 2:                  ; compact wide-reloc bit
        v17 = 1 (needs sort)
else (non-compact 64-bit):
    if elfw[49] & 2:                  ; standard wide-reloc bit
        v17 = 1 (needs sort)

For our sm_90 64-bit non-compact ELF with the wide-reloc bit cleared, v17 = 0 -> no sort needed.

Line around 170: Resolved type selection.

resolved_type = (a3 != 10) + 6 = 0 + 6 = 6

Line around 190: Construct the per-entry section name.

bank_name  = "nv.constant3"              ; from sub_4401F0(0x70000086)
entry_name = "sample_kernel"
sprintf(section_name, "%s.%s", bank_name, entry_name)
    -> "nv.constant3.sample_kernel"

Line around 200: Look up section by name.

target_sec_idx = find_section_by_name(elfw, "nv.constant3.sample_kernel") = 9
target_shlink  = elfw->section_map[9] = 4   ; resolves to text section index

Lines 210-260 (counting loop, unsorted path):

reloc = elfw[376]
count = 0
    R1' (type=115, section=5 (.rela.text.sample_kernel)):
        rec_type = 115, rec_sec = 4  (sub_442270(5)->sh_link)
        115 != 6 -> skip
    R2 (type=6, section=15 (.rela.nv.constant3.sample_kernel)):
        rec_type = 6, rec_sec = 9 (the per-entry constant bank section)
        Hmm: we want rec_sec to match target_shlink = 4. But for R2,
        rec_sec resolves to section 9 (the per-entry const bank), not 4.

Re-reading: target_shlink is the sh_link of section 9. Section 9's sh_link points at the kernel text section 4 (this is how per-entry const banks reference "their" kernel). So target_shlink = 4. For R2 to match, sub_442270(15)->sh_link must equal 4. Section 15 is .rela.nv.constant3.sample_kernel, a RELA section whose sh_info points to section 9. But the check uses sh_link, not sh_info, and for a .rela* section sh_link typically points to the symbol table. So sub_442270(15)->sh_link might be the symtab index, not 4.

Actually this is a hint: the matching logic in sub_43CDA0 uses the .rela section's sh_link because in nvlink's model, each .rela<section> section has sh_link set to the resolved target section index (the <section> it patches), not the symtab. That's a non-standard ELF convention specific to NVIDIA's link toolchain. Under this convention, sub_442270(15)->sh_link = 4 for R1' (patches .text.sample_kernel = section 4), and sub_442270(15')->sh_link = 9 for R2 (patches .nv.constant3.sample_kernel = section 9).

So actually target_shlink must be 9, not 4. Revising: target_shlink = elfw->section_map[9] gives the canonical section index of the per-entry constant bank, which is 9 itself. Then the matching reloc is R2.

reloc = elfw[376]
count = 0
    R1' (type=115, rela_sec=5): sub_442270(5)->sh_link = 4 (patches .text)
        rec_type = 115, rec_sec = 4
        115 != 6 OR 4 != 9 -> skip
    R2 (type=6, rela_sec=15): sub_442270(15)->sh_link = 9 (patches per-entry const)
        rec_type = 6, rec_sec = 9
        6 == 6 AND 9 == 9 -> match!
        count = 1
return 1

The return value tells Phase 10 that sample_kernel uses 1 bindless texture. This count is written into the .nv.info.sample_kernel section as an EIATTR_KPARAM_INFO descriptor or equivalent (see ELF nv.info), and is consumed by the CUDA runtime when launching the kernel.

Step 8: How .nv.bindless_index (the constant bank) Gets Populated

After sub_438DD0 and sub_43CDA0 finish, the relocation application pass sub_469D60 walks the relocation list and actually writes bytes into sections.

For R1' (type 115, patches .text.sample_kernel + 0x30):

  1. The relocation engine looks up the $NVLINKBINDLESSOFF_myTex symbol (index 42).
  2. The symbol's st_value = 0 (offset within the constant bank).
  3. The engine patches the LDC instruction's immediate field with the value 0 (the offset of the myTex descriptor within the per-entry constant bank).

For R2 (type 6, patches .nv.constant3.sample_kernel + 0x30):

Wait -- the r_offset for R2 was copied from R1 (which was 0x30). That means R2 points at byte 0x30 within .text.sample_kernel, not within the constant bank. This is wrong.

Re-reading sub_438CE0 line 36: *(_QWORD *)v13 = a6; -- it copies 8 bytes from original_reloc. The a6 argument comes from sub_438DD0 line 391: *(_QWORD *)(v46 + 8), which is desc->original_reloc at offset 8 in the descriptor record. Descriptor records are built by sub_433310, which stores a reference to the original descriptor data, not the original relocation. So a6 is actually a pointer to the descriptor's data payload, and the 8 bytes copied are the descriptor's header/metadata, not an r_offset.

The r_offset for R2 must therefore be encoded differently. Looking at sub_438CE0 more carefully: the 32-byte record layout is:

OffsetField
08 bytes of data copied from a6 (descriptor metadata)
8r_info = (sym_idx << 32) | type
16r_addend = 0
24rela_sec_idx
28addend_flag

There is no explicit r_offset field -- the first 8 bytes at rec+0 play that role, populated from the descriptor's header. For bindless offset relocations emitted by sub_438CE0, the r_offset is whatever desc[1] (an 8-byte field in the descriptor) contains, which is typically the per-section offset where the descriptor sits. So for the myTex descriptor at offset 0 in section 9, R2's r_offset = 0.

Revised R2:

R2: type=6, sym=42, r_offset=0 (within section 9), rela_sec_idx=15

Now the relocation application for R2:

  1. Engine looks up target section from rela_sec_idx = 15 -> .rela.nv.constant3.sample_kernel -> sh_link 9 -> section 9 (.nv.constant3.sample_kernel).
  2. The patch site is section_9.data + 0 = &section_9.data[0].
  3. The engine resolves the descriptor slot number for $NVLINKBINDLESSOFF_myTex. On Hopper, this is done by the CUDA runtime at module load time, not at link time, because the actual texture object handle is a runtime property of the bound texture. At link time, the linker writes a placeholder (typically 0) and marks the slot for runtime fixup via the .nv.rel.action section.
  4. The final 4 bytes at section_9 + 0 are the descriptor slot index, which the hardware uses to index into the Texture Header Table (THT).

Step 9: Final SASS

ptxas originally emitted SASS similar to the following (addresses are the .text.sample_kernel offsets):

/*0020*/  IMAD R2, R0, c[0x0][0x0], R1
/*0030*/  LDC R3, c[0x3][0x0]          ; <-- the myTex handle load
/*0040*/  TEX.B.LL R4, R4, R3, 0x2d, 2D, 0x1  ; bindless tex fetch
/*0050*/  STG.E [R2], R4

The LDC R3, c[0x3][0x0] instruction at offset 0x30 loads 4 bytes from constant bank 3 at offset 0 (the st_value of $NVLINKBINDLESSOFF_myTex) into R3. R3 then contains the bindless texture handle -- an opaque 32-bit value that the TEX.B.LL (bindless TEX, long-latency) instruction uses as an index into the per-module texture header table.

Before nvlink, the offset field 0x0 in LDC R3, c[0x3][0x0] is a placeholder containing the symbol index reference. After sub_438DD0 rewrites R1' to point at $NVLINKBINDLESSOFF_myTex (symbol 42, offset 0), the relocation engine patches the 0x0 in the encoded LDC instruction to the actual descriptor offset. Since myTex was laid out at offset 0 in the per-entry bank, the patched instruction is byte-identical to the original:

/*0030*/  LDC R3, c[0x3][0x0]          ; patched: offset 0 (myTex descriptor)

If the kernel had used a second texture myTex2 at offset 4, the patched instruction for that second load would be:

/*0034*/  LDC R4, c[0x3][0x4]          ; patched: offset 4 (myTex2 descriptor)

Step 10: End-to-End State Diagram

Source:
    texture<float, 2> myTex;
    tex2D(myTex, x, y);
                |
                v
ptxas produces:
    Symbol:  myTex             (STT_CUDA_TEXTURE=10, section .nv.constant3, value=0)
    Reloc:   R_CUDA_BINDLESSOFF_115 (type=115)
             sym=myTex
             offset=0x30 in .text.sample_kernel
             patches the LDC instruction's c[0x3] offset field
                |
                v
nvlink merge (sub_45E7D0):
    Symbol myTex copied to output symtab
    Descriptor list (elfw+424) gets an entry for myTex
    Per-entry section .nv.constant3.sample_kernel created by sub_4324B0
                |
                v
nvlink layout Phase 2: sub_433310 (descriptor_list_layout)
    Creates $NVLINKBINDLESSOFF_myTex symbol in .nv.constant3 (sec 8)
    Allocates 4 bytes of zero data
    Merges data into .nv.constant3 via sub_432B10
                |
                v
nvlink layout Phase 2: sub_438DD0 (process_bindless_references)
    Phase A: is_mercury=false, arch_flags_mask=0x80000000
    Phase B: per_entry_list = [section_9]
    Phase C: bitmask = zeros
    Phase E: scan reloc list
        R1 (type 115): matches at line 260 -> LABEL_32
            create $NVLINKBINDLESSOFF_myTex synth (already exists from sub_433310)
                -> sym_idx 42
            rewrite R1.info = (42 << 32) | 115
            bitmask[4] |= 0x01                 (texture bit)
            propagate: no callers of sample_kernel
    Phase F: section 9 bitmask nonzero -> keep
    Phase G: sub_4325A0 assigns $NVLINKBINDLESSOFF_myTex -> offset 0
    Phase H: walk descriptor list
        myTex descriptor: type=10 (texture)
            emit R2 via sub_438CE0:
                type=6, sym=42, target=section_9, orig=desc
            texture_count = 1
    Phase I: 1 <= max_textures -> OK
    Phase J: cleanup
                |
                v
nvlink layout Phase 10: sub_43CDA0 counts resources
    supports_bindless=true -> PATH A
    resolved_type = 6
    section_name = "nv.constant3.sample_kernel" -> section 9
    walk reloc list, count entries matching (type=6, section=9)
        R2 matches -> count=1
    return 1 (written to .nv.info as kernel texture count)
                |
                v
nvlink relocate phase (sub_469D60):
    Apply R1 (type 115):
        Patch .text.sample_kernel + 0x30
        Value = $NVLINKBINDLESSOFF_myTex.st_value = 0
        LDC instruction's c[0x3] offset field set to 0
    Apply R2 (type 6):
        Patch .nv.constant3.sample_kernel + 0
        Value = runtime-resolved descriptor slot (placeholder 0 at link time)
        .nv.rel.action marks this slot for runtime fixup
                |
                v
Final cubin:
    .text.sample_kernel:
        /*0030*/  LDC R3, c[0x3][0x0]
        /*0040*/  TEX.B.LL R4, R4, R3, 0x2d, 2D, 0x1
    .nv.constant3.sample_kernel:
        [00 00 00 00]       ; 4 bytes, runtime-resolved
    .nv.info.sample_kernel:
        EIATTR_... num_textures=1
                |
                v
CUDA runtime (cuModuleLoad -> cuTexRefSetAddress2D):
    Binds the user-provided CUDA array to myTex
    Writes actual texture header index into section_9 byte 0
                |
                v
Kernel launch:
    SM loads c[0x3][0x0] = actual_header_index
    TEX.B.LL uses actual_header_index to read from the THT
    Texture fetched from user's CUDA array -> R4

Summary of Key Addresses and Line Numbers

StepFunctionAddressFile:lineAction
Descriptor allocationsub_4333100x433310sub_433310_0x433310.c:114vtable+440 -> texture descriptor size = 4
Descriptor allocationsub_4333100x433310sub_433310_0x433310.c:124sprintf(v28, "$NVLINKBINDLESSOFF_%s", ...)
Descriptor allocationsub_4333100x433310sub_433310_0x433310.c:127sub_440740(..., type=13, bind=0, flags=129, ...) creates synth sym
Arch detectionsub_438DD00x438DD0sub_438DD0_0x438dd0.c:106*(_BYTE *)(a1 + 7) == 65 Mercury check
Per-entry collectsub_438DD00x438DD0sub_438DD0_0x438dd0.c:119(v12 - 1879048292) > 0x1A range check
Type 115 dispatchsub_438DD00x438DD0sub_438DD0_0x438dd0.c:260if ( (_DWORD)v19 == 115 ) goto LABEL_32
Type checksub_438DD00x438DD0sub_438DD0_0x438dd0.c:173(v23+4 & 0xF) - 10 > 2 rejects non-bindless
Name buildsub_438DD00x438DD0sub_438DD0_0x438dd0.c:182sprintf(v88, "$NVLINKBINDLESSOFF_%s", ...)
Sym createsub_438DD00x438DD0sub_438DD0_0x438dd0.c:184v31 = sub_4411B0(a1, v88)
Verbose tracesub_438DD00x438DD0sub_438DD0_0x438dd0.c:190"change reloc symbol from %d to %d\n"
Reloc rewritesub_438DD00x438DD0sub_438DD0_0x438dd0.c:194*(_QWORD *)(v18 + 8) = (v87 << 32) + v19
Bitmask LUTsub_438DD00x438DD0sub_438DD0_0x438dd0.c:198byte_1D391A0[4 * (sym_type - 10)]
Bitmask setsub_438DD00x438DD0sub_438DD0_0x438dd0.c:200*((_BYTE *)s + v34) |= v33
Callgraph walksub_438DD00x438DD0sub_438DD0_0x438dd0.c:206v79 = sub_44C740(a1, sh_link)
Pruning diagsub_438DD00x438DD0sub_438DD0_0x438dd0.c:303"no bindless ref in section %s\n"
Layout callsub_438DD00x438DD0sub_438DD0_0x438dd0.c:312sub_4325A0(a1, v86, v39, v90, ...)
Tex reloc typesub_438DD00x438DD0sub_438DD0_0x438dd0.c:318v84 = v89 == 0 ? 6 : 65547
Texture casesub_438DD00x438DD0sub_438DD0_0x438dd0.c:381-386case 10: v37 = v84; bitmask & 0x01; ++v90
Emit relocsub_438DD00x438DD0sub_438DD0_0x438dd0.c:391sub_438CE0(a1, v37, v93, v42[8], 0, v46[1])
Max tex checksub_438DD00x438DD0sub_438DD0_0x438dd0.c:398(*(vtable+40))() < (int)v90
Rela name buildsub_438CE00x438CE0sub_438CE0_0x438ce0.c:22-25.rela%s or .rel%s prefix
Alloc recordsub_438CE00x438CE0sub_438CE0_0x438ce0.c:28sub_4307C0(arena, 32)
Info packsub_438CE00x438CE0sub_438CE0_0x438ce0.c:40*(v13+8) = v17 + (v7 << 32)
Append relocsub_438CE00x438CE0sub_438CE0_0x438ce0.c:41sub_4644C0(v13, a1 + 376)
Bindless gatesub_43CDA00x43CDA0sub_43CDA0_0x43cda0.c:66*(vtable + 296)()
Type dispatchsub_43CDA00x43CDA0sub_43CDA0_0x43cda0.c:118-133a3 == 12 surface / else texture/sampler
Bank typesub_43CDA00x43CDA0sub_43CDA0_0x43cda0.c:143*(vtable + 304)() = 0x70000086

Interaction with Other Passes

Pre-Requisites

  • The merge phase must have already collected all texture/sampler/surface symbols and their relocations into the output ELF. The descriptor lists at elfw+424, +432, +440 are populated during merge.
  • The callgraph must be built (sub_44D200), because bindless bitmask propagation walks callgraph edges via sub_44C740.

Post-Conditions

  • All bindless relocations have been rewritten to target $NVLINKBINDLESSOFF_ symbols.
  • Per-entry constant bank sections have been allocated with correct sizes and alignments.
  • The relocation list contains new bindless-type relocations (types 6-9, 52, or Mercury equivalents) that the relocation engine will apply during the relocate phase.
  • Sections without bindless references have been pruned from the per-entry list.

Debug Trace Strings

With verbose mode enabled (elfw+64 bit 1 set, corresponding to -v on the command line):

StringSourceMeaning
"change reloc symbol from %d to %d"sub_438DD0Relocation target rewritten to synthetic symbol
"no bindless ref in section %s"sub_438DD0Per-entry section pruned (no bindless refs)
"create $NVLINKBINDLESSOFF_%s"sub_433310Synthetic symbol created during descriptor layout
"variable %s at offset %d"sub_4325A0Descriptor symbol assigned offset in constant bank
"unexpected usage of non-unified surface descriptors"sub_438DD0Fatal: non-unified surfaces on Mercury
"unexpected bindless type"sub_43CDA0Fatal: symbol type is not 10, 11, or 12
"symbol not found"sub_43CDA0Fatal: bindless symbol missing from symbol table
"too many %s in %s"sub_438DD0Fatal: resource count exceeds hardware limit
"callgraph not complete"sub_44C740Fatal: callgraph propagation attempted before build

Implementation Functions

AddressNameSizeRole
0x438DD0process_bindless_references12,779 BCore: scan relocs, create synthetics, rewrite targets, emit new relocs
0x43CDA0resolve_bindless_type_symbols6,937 BCount bindless resources per entry, per type
0x438CE0emit_bindless_relocation~240 BCreate and append a single relocation record to elfw+376
0x4325A0layout_section~500 BAssign aligned byte offsets to symbols within a section
0x4324B0bindless_target_setup~1 KBCreate per-entry constant bank section for bindless
0x433310descriptor_list_layout~1.5 KBAllocate descriptor space in constant bank, create symbols
0x432B10merge_into_section~2 KBMerge data buffer into section with overlap detection
0x4411B0find_or_create_symbol(shared)Look up symbol by name; create if absent
0x440590get_symbol_record(shared)Retrieve symbol record by index
0x440350get_section_index(shared)Get section index from a symbol record
0x442270get_section_by_rela_index(shared)Get section record from relocation section index
0x44C740callgraph_get_callers(shared)Walk callgraph edges for a function
0x4647D0sort_linked_list(shared)Merge-sort a linked list with comparator function
0x432810reloc_compare_compact(shared)Comparator: compact relocs by section then type
0x432840reloc_compare_standard(shared)Comparator: standard relocs by section then type

Cross-References

Confidence Assessment

ClaimConfidenceEvidence
sub_438DD0 at 0x438DD0 processes bindless relocations (12,779 bytes)HIGHDecompiled sub_438DD0_0x438dd0.c exists; line 182: sprintf(v88, "$NVLINKBINDLESSOFF_%s", ...) confirmed
Synthetic symbol naming: $NVLINKBINDLESSOFF_<name>HIGHString "$NVLINKBINDLESSOFF_%s" confirmed in decompiled code and nvlink_strings.json
"no bindless ref in section %s" diagnosticHIGHDecompiled sub_438DD0 line 303: exact format string
Mercury detection via elfw+7 == 0x41 ('A')HIGHConsistent with Mercury relocation namespace detection across all linker functions
Bindless relocation type classification (types 5, 12, 17-18, 22-25, 29-30, 38-39, 42, 46, 50-51, 54-55, 59, 64-66, 115, 65539-65540)HIGHType dispatch reconstructed from decompiled sub_438DD0 switch/if-else chain
Architecture vtable offsets +296 (supports_bindless), +304 (bindless_texture_type), +312 (bindless_surface_type)MEDIUMVtable offsets inferred from decompiled function pointer calls; consistent across multiple bindless functions
sub_43CDA0 (6,937 bytes) resolves bindless type symbols per entryMEDIUMDecompiled file sub_43CDA0_0x43cda0.c exists; function purpose inferred from calling context
Per-section bitmask for tracking bindless-containing sectionsMEDIUMReconstructed from Phase C of sub_438DD0 pseudocode analysis
Descriptor sizes from vtable offsets +440 (texture) and +448 (surface)MEDIUMInferred from decompiled function calls; vtable layout consistent with architecture profile
sub_4324B0 creates per-entry constant bank sectionLOWFunction exists but not individually verified; role inferred from calling context in layout phase