Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Relocations & Symbols

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

ptxas defines two parallel relocation type systems for CUBIN ELF files: R_CUDA_* (117 types, ordinals 0--116) for SASS-encoded cubins targeting SM 30--90a, and R_MERCURY_* (65 types, ordinals 0--64) for Mercury-encoded cubins targeting SM 100+ (Blackwell and later). Both systems use standard Elf64_Rela relocation entries in .rela.text.<funcname> sections, with a custom resolution algorithm that handles alias redirection, dead function filtering, UFT/UDT pseudo-relocations, PC-relative branch validation, and sub-byte instruction patching. The symbol table (.symtab) follows standard ELF Elf64_Sym format with CUDA-specific symbol types and an extended section index mechanism (.symtab_shndx) for programs exceeding 65,280 sections.

Relocation resolversub_1CD48C0 (4,184 bytes binary, 22 KB decompiled, 17 callees)
Relocation writersub_1CD5920 (1,985 bytes binary, 11 KB decompiled)
Relocation creator (SASS)sub_1CD4510 (860 bytes binary)
Relocation creator (Mercury)sub_1CD46B0 (540 bytes binary)
Relocation pre-scansub_1CD43A0 (560 bytes binary)
Bit-field patchersub_1CD34E0 (3,700 bytes binary, sub_1CD33F0/sub_1CD3330 helpers)
Symbol table buildersub_1CB68D0 (9,578 bytes binary, 49 KB decompiled, 36 callees)
Symbol fixupsub_1CB2CA0 (2,038 bytes binary, 4 call sites)
Section index remapsub_1C99BB0 (4,900 bytes binary)
UFT managersub_1CD22E0 (1,979 bytes binary, 10 KB decompiled)
UFT slot validatorsub_1CD2AA0 (~800 bytes binary)
Bindless handlersub_1CAB300 (2,157 bytes binary, 12 KB decompiled)
R_CUDA table addressoff_2408B60 (117 entries x 64 bytes)
R_MERCURY table addressoff_2407B60 (65 entries x 64 bytes)

Relocation Type Systems

Table Selection Logic

The ELFW object stores the ELF class byte at offset 7 and a flags word at offset 48. The relocation subsystem selects between the two tables based on the IsPIC flag combined with the ELF class:

// Table selection (reconstructed from sub_1CD48C0, sub_1CD4510, sub_1CD5920)
uint32_t test_bit = (elfw->ei_class == 'A') ? 1 : 0x80000000;
bool is_mercury = (test_bit & elfw->flags) != 0;

if (is_mercury) {
    // SM 100+ Mercury encoding: off_2407B60
    // Type codes start at 0x10000; subtract to index the table
    table = &R_MERCURY_table;           // off_2407B60
    index = raw_type - 0x10000;         // range check: index <= 0x3F (63)
} else {
    // SM 30-90a SASS encoding: off_2408B60
    table = &R_CUDA_table;              // off_2408B60
    index = raw_type;                   // range check: index <= 0x73 (115)
}

Mercury relocation type codes are stored with a 0x10000 offset in the internal relocation entry's type field. This lets a single code path handle both systems -- the table selection just subtracts the offset for Mercury types.

Relocation Descriptor Table Format

Each entry in the relocation type descriptor table is 64 bytes (8 qwords). The layout is accessed through pointer arithmetic patterns like table[8 * index + N] where the table pointer type is char** (8-byte stride):

// Relocation type descriptor -- 64 bytes per entry (reconstructed)
struct reloc_type_desc {
    const char* name;       // +0:  R_CUDA_* or R_MERCURY_* name string
    uint32_t    unknown_04; // +8:  unknown field
    uint32_t    unknown_08; // +12: unknown field
    uint32_t    bit_start;  // +16: starting bit position in instruction
    uint32_t    bit_width;  // +20: field width in bits
    uint32_t    patch_mode; // +24: patching mode (0=none, 1=direct, 6/7=split)
    uint32_t    flags_hi;   // +28: high flags (value 12-15 triggers callgraph)
    // ... remaining 32 bytes: additional patching parameters
};

The patch_mode field at offset +24 drives the bit-field patching logic in sub_1CD34E0. The switch statement handles these modes:

ModeDescriptionTypes
0No-op (sentinel/terminator)R_CUDA_NONE, R_CUDA_NONE_LAST
1, 0x12, 0x2EDirect bit-field write (full or partial 64-bit word)Most absolute/PC-relative types
6, 0x37Split low-word patching (handles cross-qword boundaries)LO types, sub-byte 8_N types
7, 0x38Split high-word patching (uses HIDWORD of value)HI types

When flags_hi (at descriptor offset +28) is in the range 12--15, the relocation creator calls sub_1CBD0D0 to register the relocation's target section in the call graph. This triggers call graph edge creation for function descriptors and branch targets.

R_CUDA_* Relocation Types

117 types from R_CUDA_NONE (ordinal 0) to R_CUDA_NONE_LAST (ordinal 116). String addresses span 0x23FBE0E--0x23FC6B6 in the ptxas binary, confirming these are contiguous in the read-only data section. Ordinals are assigned by string table order.

Absolute Address Relocations

OrdinalNameBit FieldPurpose
0R_CUDA_NONE--Sentinel / no relocation
1R_CUDA_3232-bitAbsolute 32-bit address
2R_CUDA_6464-bitAbsolute 64-bit address
5R_CUDA_ABS32_2632-bit at bit 26Absolute address, 26-bit encoding
10R_CUDA_ABS32_LO_26low 32 at bit 26Low half of 64-bit address
11R_CUDA_ABS32_HI_26high 32 at bit 26High half of 64-bit address
12R_CUDA_ABS32_2332-bit at bit 23Absolute address, 23-bit encoding
13R_CUDA_ABS32_LO_23low 32 at bit 23Low half, 23-bit encoding
14R_CUDA_ABS32_HI_23high 32 at bit 23High half, 23-bit encoding
15R_CUDA_ABS24_2624-bit at bit 2624-bit absolute address
16R_CUDA_ABS24_2324-bit at bit 2324-bit absolute, 23-bit encoding
17R_CUDA_ABS16_2616-bit at bit 2616-bit absolute address
18R_CUDA_ABS16_2316-bit at bit 2316-bit absolute, 23-bit encoding
42R_CUDA_ABS32_2032-bit at bit 20Volta+ encoding format
43R_CUDA_ABS32_LO_20low 32 at bit 20Low half, 20-bit encoding
44R_CUDA_ABS32_HI_20high 32 at bit 20High half, 20-bit encoding
45R_CUDA_ABS24_2024-bit at bit 2024-bit, 20-bit encoding
46R_CUDA_ABS16_2016-bit at bit 2016-bit, 20-bit encoding
55R_CUDA_ABS32_3232-bit at bit 32Ampere+ encoding format
56R_CUDA_ABS32_LO_32low 32 at bit 32Low half, 32-bit position
57R_CUDA_ABS32_HI_32high 32 at bit 32High half, 32-bit position
58R_CUDA_ABS47_3447-bit at bit 3447-bit wide field
59R_CUDA_ABS16_3216-bit at bit 3216-bit, 32-bit position
60R_CUDA_ABS24_3224-bit at bit 3224-bit, 32-bit position
74R_CUDA_ABS24_4024-bit at bit 4024-bit at offset 40
75R_CUDA_ABS55_16_3455-bit, 16+34 splitSplit wide field
100R_CUDA_ABS20_4420-bit at bit 4420-bit at offset 44
114R_CUDA_ABS56_16_3456-bit, 16+34 splitSplit wide field
70R_CUDA_32_LOlow 32Low half of 64-bit
71R_CUDA_32_HIhigh 32High half of 64-bit

The naming convention encodes the bit-field geometry: R_CUDA_ABS<width>_<start_bit> indicates that <width> bits of the resolved address are patched into the instruction at bit position <start_bit>. The LO/HI suffix indicates low or high 32 bits of a 64-bit value. The different start positions (20, 23, 26, 32, 34, 40, 44) correspond to different SASS instruction encoding formats across SM generations: Kepler (26), Maxwell/Pascal (23), Volta/Turing (20), Ampere/Ada/Hopper (32).

Global Address Relocations

OrdinalNamePurpose
3R_CUDA_G32Global-space 32-bit address
4R_CUDA_G64Global-space 64-bit address
84R_CUDA_G8_0Global-space byte 0 of 64-bit instruction
85R_CUDA_G8_8Global-space byte 1
86R_CUDA_G8_16Global-space byte 2
87R_CUDA_G8_24Global-space byte 3
88R_CUDA_G8_32Global-space byte 4
89R_CUDA_G8_40Global-space byte 5
90R_CUDA_G8_48Global-space byte 6
91R_CUDA_G8_56Global-space byte 7

Global address relocations target .nv.global and .nv.global.init sections. The G8_* sub-byte variants patch individual bytes within a 64-bit instruction word, used when the instruction encoding requires the address to be spread across non-contiguous bit fields.

PC-Relative Relocations

OrdinalNamePurpose
40R_CUDA_PCREL_IMM24_26PC-relative 24-bit immediate at bit 26
41R_CUDA_PCREL_IMM24_23PC-relative 24-bit immediate at bit 23

PC-relative relocations resolve branch and call targets. The resolver enforces a critical constraint:

"PC relative branch address should be in the same section"

This means intra-function branches use PC-relative relocations, but cross-function calls use absolute or function descriptor relocations. The 24-bit immediate provides a +/-8 MB range from the instruction address, sufficient for any single kernel.

Constant Field Relocations

OrdinalNamePurpose
24R_CUDA_CONST_FIELD19_2819-bit constant bank offset at bit 28
25R_CUDA_CONST_FIELD19_2319-bit constant bank offset at bit 23
36R_CUDA_CONST_FIELD21_2621-bit constant bank offset at bit 26
38R_CUDA_CONST_FIELD19_2619-bit constant bank offset at bit 26
39R_CUDA_CONST_FIELD21_2321-bit constant bank offset at bit 23
50R_CUDA_CONST_FIELD19_2019-bit constant bank offset at bit 20
54R_CUDA_CONST_FIELD21_2021-bit constant bank offset at bit 20
64R_CUDA_CONST_FIELD19_4019-bit constant bank offset at bit 40
66R_CUDA_CONST_FIELD21_3821-bit constant bank offset at bit 38
115R_CUDA_CONST_FIELD22_3722-bit constant bank offset at bit 37

Constant field relocations patch .nv.constant0.<func> bank offsets into load constant (LDC) instructions. The field width (19, 21, or 22 bits) determines the maximum addressable constant bank size: 19-bit supports 512 KB, 21-bit supports 2 MB, 22-bit supports 4 MB. During resolution, the constant bank deduplication pass (sub_1CA6890) may adjust the relocation offset:

"optimize ocg constant reloc offset from %lld to %lld"

Function Descriptor Relocations

OrdinalNamePurpose
31R_CUDA_FUNC_DESC32_2332-bit function descriptor at bit 23
32R_CUDA_FUNC_DESC32_LO_23Low 32 of descriptor at bit 23
33R_CUDA_FUNC_DESC32_HI_23High 32 of descriptor at bit 23
34R_CUDA_FUNC_DESC_32Full 32-bit function descriptor
35R_CUDA_FUNC_DESC_64Full 64-bit function descriptor
47R_CUDA_FUNC_DESC32_2032-bit function descriptor at bit 20
48R_CUDA_FUNC_DESC32_LO_20Low 32 of descriptor at bit 20
49R_CUDA_FUNC_DESC32_HI_20High 32 of descriptor at bit 20
61R_CUDA_FUNC_DESC32_3232-bit function descriptor at bit 32
62R_CUDA_FUNC_DESC32_LO_32Low 32 of descriptor at bit 32
63R_CUDA_FUNC_DESC32_HI_32High 32 of descriptor at bit 32
92--99R_CUDA_FUNC_DESC_8_0 -- R_CUDA_FUNC_DESC_8_56Sub-byte function descriptor patches

Function descriptors are used for indirect calls through function pointers. The descriptor contains the target function's entry point address and is loaded by the GPU's indirect call mechanism. The sub-byte FUNC_DESC_8_* variants patch individual bytes of the descriptor into instruction encoding slots, used in wide instruction formats where the descriptor address is spread across multiple fields. When the relocation creator detects a flags_hi value of 12--15 in the descriptor table entry, it calls sub_1CBD0D0 to register the call edge in the call graph.

Texture, Sampler, and Surface Relocations

OrdinalNamePurpose
6R_CUDA_TEX_HEADER_INDEXTexture header table index
7R_CUDA_SAMP_HEADER_INDEXSampler header table index
8R_CUDA_SURF_HW_DESCSurface hardware descriptor
9R_CUDA_SURF_HW_SW_DESCSurface hardware+software descriptor
19R_CUDA_TEX_SLOTTexture binding slot
20R_CUDA_SAMP_SLOTSampler binding slot
21R_CUDA_SURF_SLOTSurface binding slot
26R_CUDA_TEX_SLOT9_499-bit texture slot at bit 49
52R_CUDA_SURF_HEADER_INDEXSurface header table index
101R_CUDA_SAMP_HEADER_INDEX_0Sampler header index variant

These relocations connect texture/sampler/surface operations to their runtime-allocated descriptor table entries. The CUDA driver fills in the actual descriptor indices at launch time based on the kernel's resource binding.

Bindless Texture/Surface Relocations

OrdinalNamePurpose
22R_CUDA_TEX_BINDLESSOFF13_32Bindless texture offset, 13-bit at bit 32
23R_CUDA_TEX_BINDLESSOFF13_47Bindless texture offset, 13-bit at bit 47
29R_CUDA_TEX_BINDLESSOFF13_41Bindless texture offset, 13-bit at bit 41
30R_CUDA_TEX_BINDLESSOFF13_45Bindless texture offset, 13-bit at bit 45
51R_CUDA_BINDLESSOFF13_36Bindless offset, 13-bit at bit 36
65R_CUDA_BINDLESSOFF14_40Bindless offset, 14-bit at bit 40

Bindless texture/surface relocations are handled by sub_1CAB300, which creates $NVLINKBINDLESSOFF_<name> symbols for each bindless reference. During resolution:

"change reloc symbol from %d to %d"
"no bindless ref in section %s"
"unexpected usage of non-unified surface descriptors"

Sub-Byte Patch Relocations

OrdinalNamePurpose
76--83R_CUDA_8_0 -- R_CUDA_8_56Patch byte 0--7 of 64-bit instruction

These relocations patch a single byte at a specific 8-bit-aligned position within a 64-bit instruction word. They are used when the resolved value must be inserted into a non-standard bit position that does not align with the instruction encoding's immediate field boundaries.

Miscellaneous Relocations

OrdinalNamePurpose
27R_CUDA_6_316-bit field at bit 31
28R_CUDA_2_472-bit field at bit 47
37R_CUDA_QUERY_DESC21_37Query descriptor, 21-bit at bit 37
53R_CUDA_INSTRUCTION64Whole 64-bit instruction replacement
67R_CUDA_INSTRUCTION128Whole 128-bit instruction replacement
68R_CUDA_YIELD_OPCODE9_0YIELD opcode, 9-bit at bit 0
69R_CUDA_YIELD_CLEAR_PRED4_87Clear YIELD predicate, 4-bit at bit 87
72R_CUDA_UNUSED_CLEAR32Zero out 32-bit unused field
73R_CUDA_UNUSED_CLEAR64Zero out 64-bit unused field
116R_CUDA_NONE_LASTSentinel marking end of relocation table

The R_CUDA_INSTRUCTION64 and R_CUDA_INSTRUCTION128 types replace entire instruction words, used for instruction-level patching by the linker when the instruction encoding changes based on the final resolved address.

The R_CUDA_YIELD_* types handle YIELD-to-NOP conversion. When a kernel has forward-progress requirements that prevent yielding, the resolver converts YIELD instructions to NOPs:

"Ignoring the reloc to convert YIELD to NOP due to forward progress requirement."

The R_CUDA_UNUSED_CLEAR* types zero out instruction fields that are unused in the final encoding, ensuring deterministic output.

Unified Address Space Relocations

OrdinalNamePurpose
102R_CUDA_UNIFIEDUnified address (generic pointer)
103R_CUDA_UNIFIED_3232-bit unified address
104--111R_CUDA_UNIFIED_8_0 -- R_CUDA_UNIFIED_8_56Unified address sub-byte patches
112R_CUDA_UNIFIED32_LO_32Low 32 of unified at bit 32
113R_CUDA_UNIFIED32_HI_32High 32 of unified at bit 32

Unified address relocations resolve generic pointers that can point to global, shared, or constant memory. During final resolution, the resolver performs a type conversion from unified (type 103) to absolute (type 1):

// In sub_1CD48C0: unified reloc replacement
if (reloc_type == 103)  // R_CUDA_UNIFIED_32
    reloc_type = 1;     // R_CUDA_32

R_MERCURY_* Relocation Types

65 types from R_MERCURY_NONE (ordinal 0) to R_MERCURY_NONE_LAST (ordinal 64). String addresses span 0x23FB8C5--0x23FBDFA. Mercury relocations serve the same purpose as R_CUDA types but are designed for the Mercury intermediate representation used on SM 100+ targets.

Mercury Type Categories

CategoryTypesPurpose
AddressR_MERCURY_G64, R_MERCURY_ABS64, R_MERCURY_ABS32, R_MERCURY_ABS16Memory addresses
Split addressR_MERCURY_ABS32_LO, R_MERCURY_ABS32_HI64-bit address halves
Program-relativeR_MERCURY_PROG_REL64, R_MERCURY_PROG_REL32, R_MERCURY_PROG_REL32_LO, R_MERCURY_PROG_REL32_HIOffsets from program base
Tex/samp/surfR_MERCURY_TEX_HEADER_INDEX, R_MERCURY_SAMP_HEADER_INDEX, R_MERCURY_SURF_HEADER_INDEXResource descriptors
FunctionR_MERCURY_FUNC_DESC_64Function descriptor
Sub-byteR_MERCURY_8_0 -- R_MERCURY_8_56 (8 types)Byte-level patches
Global sub-byteR_MERCURY_G8_0 -- R_MERCURY_G8_56 (8 types)Global-space byte patches
Func desc sub-byteR_MERCURY_FUNC_DESC_8_0 -- R_MERCURY_FUNC_DESC_8_56 (8 types)Function descriptor byte patches
Abs-program-relativeR_MERCURY_ABS_PROG_REL32_LO, R_MERCURY_ABS_PROG_REL32_HI, R_MERCURY_ABS_PROG_REL32, R_MERCURY_ABS_PROG_REL64Absolute program-relative
Program-relative sub-byteR_MERCURY_PROG_REL8_0 -- R_MERCURY_PROG_REL8_56 (8 types)Program-relative byte patches
UnifiedR_MERCURY_UNIFIED, R_MERCURY_UNIFIED_32, R_MERCURY_UNIFIED_8_0 -- R_MERCURY_UNIFIED_8_56, R_MERCURY_UNIFIED32_LO, R_MERCURY_UNIFIED32_HIUnified address space
CleanupR_MERCURY_UNUSED_CLEAR64Zero out unused fields
SentinelsR_MERCURY_NONE, R_MERCURY_NONE_LASTTable boundaries

Mercury introduces program-relative relocations (PROG_REL*) that do not exist in the R_CUDA set. These compute offsets relative to the program base address rather than absolute virtual addresses, enabling position-independent code for the Mercury deferred finalization model. The Mercury finalizer (running at link or load time) resolves these program-relative relocations after the final code layout is known.

Relocation Encoding

ELF Relocation Entry Format

Cubin relocations use standard Elf64_Rela entries in .rela.text.<funcname> sections:

typedef struct {
    Elf64_Addr  r_offset;   // Byte offset within the section
    Elf64_Xword r_info;     // Symbol index (high 32) | Type (low 32)
    Elf64_Sxword r_addend;  // Addend for the relocation computation
} Elf64_Rela;  // 24 bytes

The r_info field packs the symbol table index in the upper 32 bits and the R_CUDA/R_MERCURY type code in the lower 32 bits:

#define ELF64_R_SYM(info)   ((info) >> 32)
#define ELF64_R_TYPE(info)  ((info) & 0xFFFFFFFF)

For Mercury types, the type code stored in r_info is the ordinal plus 0x10000. The resolver subtracts 0x10000 before indexing the R_MERCURY descriptor table.

Internal Relocation Entry

The ELFW object maintains relocations in an internal linked list at offset +376 of the ELFW structure. Each internal entry is a 32-byte node:

// Internal relocation entry (reconstructed from sub_1CD4510, sub_1CD46B0)
struct elfw_reloc {
    uint64_t offset;            // +0:  byte offset in target section
    uint64_t type_and_section;  // +8:  (target_section << 32) | reloc_type
    uint64_t addend;            // +16: relocation addend
    uint32_t symbol_index;      // +24: index into ELFW symbol table
    uint32_t alias_index;       // +28: original symbol if aliased, else 0
};

The type_and_section field encodes both the relocation type code (low 32 bits) and the target section index (high 32 bits) in a single 64-bit field.

Resolved Relocation Output

Resolved relocations are written by sub_1CD5920 to .nv.resolvedrela sections. Additionally, .nv.rel.action sections carry relocation action metadata for the CUDA driver's runtime linker.

Symbol Table Structure

.symtab Format

The symbol table uses standard Elf64_Sym entries (24 bytes each for 64-bit, 16 bytes for 32-bit):

typedef struct {
    Elf32_Word    st_name;    // String table offset
    unsigned char st_info;    // Type (low 4 bits) | Binding (high 4 bits)
    unsigned char st_other;   // Visibility (low 2 bits) | Flags
    Elf16_Half    st_shndx;   // Section index (or SHN_XINDEX=0xFFFF)
    Elf64_Addr    st_value;   // Symbol value (section offset)
    Elf64_Xword   st_size;    // Symbol size
} Elf64_Sym;

Internal Symbol Representation

The ELFW maintains an internal symbol structure (40+ bytes) with additional metadata:

OffsetSizeFieldDescription
+41st_infoLow nibble = type (STT_*), high nibble = binding strength
+51st_otherBits 0-1 = visibility, bits 4-7 = CUDA-specific flags
+62st_shndxSection index (0xFFFF = use extended index)
+88st_valueSymbol address; -1 = unallocated
+244section_linkInternal section reference
+284extra_indexSecondary symbol link
+328name_ptrPointer to symbol name string

Symbol Types

ELF TypeValueCUDA Usage
STT_NOTYPE0Undefined/external symbols
STT_OBJECT1Global/constant/shared variables
STT_FUNC2Kernel entry points, device functions
STT_SECTION3Section symbols (one per section)
STT_COMMON5Common symbols (.common symbol)
STT_CUDA_TEXTURE10Texture reference symbols
STT_CUDA_SURFACE11Surface reference symbols
STT_CUDA_SAMPLER12Sampler reference symbols
STT_CUDA_FUNC_DESC13Function descriptor (indirect call target)

The internal type field at offset +4 uses the low nibble for ELF standard types and the high nibble for binding/scope information. The resolver checks st_info & 0xF throughout its processing.

Function descriptor symbols (type 13) receive special handling in the relocation resolver. When the resolver encounters a type-13 symbol, it checks whether the symbol is allocated:

// sub_1CD48C0: function descriptor symbol handling
if ((sym->st_info & 0xF) == 13) {  // STT_CUDA_FUNC_DESC
    shndx = get_section_index(elfw, sym);
    if (shndx == 0) {
        // Unresolved -- check binding and ELFW flags
        if ((sym->st_other & 0xE0) == 0x20  // STB_GLOBAL
            || (sym->st_other & 0x10))       // CUDA-specific extern flag
        {
            // External function descriptor: keep relocation for linker
        }
    }
}

Symbol Binding and Visibility

The st_other byte encodes both ELF visibility (bits 0-1) and CUDA-specific binding flags (bits 4-7):

BitsFieldValues
0-1ELF visibility0 = STV_DEFAULT, 1 = STV_INTERNAL, 2 = STV_HIDDEN, 3 = STV_PROTECTED
4Extern flag1 = external linkage (for nvlink)
5-6Binding strength0x20 = STB_GLOBAL, 0x80 = STB_WEAK
7ReservedUsed internally during resolution

The binding byte at st_other & 3 (low 2 bits of the high nibble) maps to:

ValueMeaningResolution
1STB_LOCAL / deadSkip relocation ("ignore reloc on dead func %s")
2STB_GLOBALNormal resolution
3STB_WEAKResolve if available, otherwise use default

Symbol Table Builder -- sub_1CB68D0

The symbol table builder (9,578 bytes, approximately 1,700 decompiled lines) processes the ELFW internal symbol list in these steps:

  1. Iterate symbols -- walks the symbol list from the ELFW object
  2. Filter deleted symbols -- 12 separate checks for "reference to deleted symbol" guard against stale entries from dead code elimination
  3. Handle __cuda_syscall -- special-cases the device-side syscall dispatcher symbol
  4. Resolve aliases -- follows alias chains to find the canonical symbol
  5. Compute values -- resolves st_value from section base + offset
  6. Create section symbols -- ensures every section has an STT_SECTION symbol; emits "found multiple section symbols for %s" if duplicates exist
  7. Handle SHN_XINDEX overflow -- when section index >= SHN_LORESERVE (0xFF00 = 65,280), sets st_shndx = SHN_XINDEX (0xFFFF) and stores the real index in .symtab_shndx
  8. Build .symtab_shndx -- populates the extended index table for overflow sections

Error strings observed in the builder:

StringCondition
"reference to deleted symbol"Symbol marked deleted but still referenced (12 checks)
"ignore symbol %s in unused section"Symbol in eliminated section
"ignore symbol string %s for sym %d"Skipping symbol name for unnamed/internal symbol
"found multiple section symbols for %s"Duplicate STT_SECTION entries
"symbol already assigned"Duplicate assignment attempt
"adding global symbols of same name"Name collision
"alias to unknown symbol"Alias target not found
"unallocated symbol"Symbol value is -1 (never assigned an address)
"missing sec strtab"String table not initialized

Symbol Fixup -- sub_1CB2CA0

After dead code elimination removes sections, symbol indices become stale. The fixup pass (2,038 bytes, called from 4 sites) renumbers all symbol st_shndx values:

  1. For each section in the ELFW:
    • If the section lacks an STT_SECTION symbol, create one
    • If the section has multiple STT_SECTION symbols, warn
  2. Walk the symbol table and remap st_shndx values through the section index mapping

The fixup runs at multiple pipeline points: after dead function elimination, after Mercury section cloning, and after any section deletion.

Section Index Remap -- sub_1C99BB0

The companion to sub_1CB2CA0 for the extended index mechanism. When section indices change, this function updates both .symtab_shndx and .nv.merc.symtab_shndx to keep the extended index tables consistent.

Relocation Resolution Algorithm

The master resolver sub_1CD48C0 implements a 7-step algorithm that processes every relocation entry in the ELFW's linked list:

Step 1: Symbol Address Computation

For each relocation entry, compute the symbol's resolved address by adding the symbol's st_value (from the section base) to the relocation offset:

if (reloc->alias_index) {
    sym = lookup_symbol(elfw, reloc->alias_index);
    reloc->offset += sym->st_value;
}

For Mercury cubins (64-bit ELF class 'A' with Mercury flag set), the resolver applies an additional address transformation that accounts for the Mercury instruction stride:

if (is_mercury && sym_value != 0) {
    int stride = 2 * arch_vtable->get_merc_stride();
    reloc->offset += stride * (sym_value >> 7);
}

Step 2: Alias Resolution

If the relocation targets an alias symbol (ELF type STT_NOTYPE with section index pointing to another symbol), redirect the relocation to the canonical target:

"change alias reloc %s to %s"

The resolver follows the alias chain through sub_1CB1E00 (get section index) and sub_1CB3D20 (get section by index), replacing the alias with its real target.

Step 3: Dead Function Filtering

If the relocation's target symbol has local binding (st_other & 3 == 1) and is in a deleted section, the relocation is zeroed out:

"ignore reloc on dead func %s"

The relocation's type is set to 0 (R_CUDA_NONE), effectively removing it. For the output mode != 2 (relocatable), dead relocations on STT_NOTYPE symbols with a binding prefix of 2 are also removed.

Step 4: UFT/UDT Pseudo-Relocation Handling

Relocations targeting special synthetic symbols are intercepted:

SymbolAction
__UFT_OFFSETRecord for UFT slot assignment, zero the relocation
__UFT_CANONICALMap to canonical UFT entry
__UDT_OFFSETRecord for UDT slot assignment
__UDT_CANONICALMap to canonical UDT entry
__UFT, __UFT_ENDUFT boundary markers
__UDT, __UDT_ENDUDT boundary markers

The resolver checks if a symbol name starts with "__UFT_OFFSET" (exact 13-character comparison in the decompiled code). If matched:

"ignore reloc on UFT_OFFSET"

The relocation entry is then processed by the UFT manager (sub_1CD22E0) which maps UUIDs to UFT slot indices.

Step 5: PC-Relative Branch Validation

For relocations whose descriptor table entry has *(table + 8*index + 5) == 16 (indicating a PC-relative type), the resolver validates that the source and target sections are identical:

if (reloc_desc->patch_mode == 16 && reloc->section != target_section)
    fatal("PC relative branch address should be in the same section");

Step 6: YIELD-to-NOP Conversion

If the relocation type is R_CUDA_YIELD_OPCODE9_0 or R_CUDA_YIELD_CLEAR_PRED4_87, and the kernel has forward-progress requirements, the resolver skips the NOP conversion:

"Ignoring the reloc to convert YIELD to NOP due to forward progress requirement."

Step 7: Bit-Field Patching

The final step delegates to sub_1CD34E0, the bit-field patcher. This function uses the relocation descriptor table entry's parameters to extract the current field value (via sub_1CD33F0), add the resolved address, and write the result back (via sub_1CD3330):

// sub_1CD34E0 -- bit-field patching (simplified)
bool apply_reloc(reloc_desc_table, index, is_addend, instruction_data,
                 symbol_value, reloc_offset, sym_addr, sym_shndx,
                 section_type_offset, old_value_out) {
    entry = &reloc_desc_table[index * 64 + 12]; // Start at byte 12
    end = &reloc_desc_table[index * 64 + 60];   // 4 operations max

    while (entry < end) {
        uint32_t bit_start = entry[0];
        uint32_t bit_width = entry[1];
        uint32_t mode      = entry[2];

        switch (mode) {
        case 0:  // NOP
            break;
        case 1:  // Direct write: place value at [bit_start, bit_start+bit_width)
        case 0x12: case 0x2E:
            old = extract_bits(instruction_data, bit_start, bit_width);
            insert_bits(instruction_data, resolved_value, bit_start, bit_width);
            break;
        case 6:  // Split low-word write (cross-qword boundary handling)
        case 0x37:
            // Write low portion, advance to next qword if needed
            break;
        case 7:  // Split high-word write
        case 0x38:
            // Write HIDWORD of value
            break;
        }
        entry += 4;  // Next 16-byte operation
    }
    return true;
}

If the NVRS (NVIDIA Register Spill) check fails during patching, the resolver emits:

"unexpected NVRS"

NVRS relocations are special-purpose relocations for register spill slot references. When the bit-field patcher returns false, the relocation is invalid for the current context.

Post-Resolution

Successfully resolved relocations are either:

  • Removed from the linked list (the relocation was fully applied to the instruction bytes)
  • Kept for the output .nv.resolvedrela section (the relocation needs runtime resolution by the CUDA driver)

The relocation writer sub_1CD5920 validates every remaining relocation before serializing it:

CheckError
Symbol value == -1"symbol never allocated"
Offset >= section size"relocation is past end of offset"
Target section unallocated"rela section never allocated"
Address not found in section data"reloc address not found"

Unified Function Table (UFT) and Unified Data Table (UDT)

Purpose

UFT and UDT support indirect function calls and generic data references across compilation units. When nvcc compiles a program using function pointers, virtual functions, or __device__ function addresses taken in host code, the compiler generates UFT/UDT entries that the runtime linker resolves at load time.

Sections

SectionPurpose
.nv.uftJump slot table (one slot per indirect-callable function)
.nv.uft.entryUFT entry metadata (UUID, offset pairs)
.nv.udtData slot table (one slot per externally-referenced data object)
.nv.udt.entryUDT entry metadata
.nv.uft.relUFT relocation table

UFT Entry Structure

Each UFT entry contains a 128-bit UUID and a 64-bit offset:

struct uft_entry {
    uint64_t uuid_lo;   // Low 64 bits of UUID
    uint64_t uuid_hi;   // High 64 bits of UUID
    uint64_t offset;     // Offset into the jump slot table
};  // 24 bytes per entry

UFT Manager -- sub_1CD22E0

The UFT manager (1,979 bytes, 10 KB decompiled) processes UFT/UDT entries across all compilation units:

  1. Build UID-to-key map -- hashes uuid_lo ^ uuid_hi as the lookup key
  2. Detect conflicts -- reports "uft map conflict: 0x%llx" when two entries hash to the same key
  3. Detect duplicates -- reports "duplicate ids in uft.entry" when identical UUIDs appear
  4. Reorder entries -- "Re-ordering UFT entries" / "Re-ordering UDT entries" sorts entries for deterministic output
  5. Match UUIDs -- cross-references UUIDs against the existing UFT for linking: "matching uuid not found" if a referenced UUID does not exist
  6. Align UDT -- "udt size %lld needs aligning" pads UDT entries to required alignment

UFT Slot Validator -- sub_1CD2AA0

Validates consistency between .nv.uft (jump slots) and .nv.uft.entry (metadata):

"missing nv.uft.entry"
"Number of .nv.uft jump slots != Number of entries in .nv.uft.entry"
"size of uidx window != nv.uft"

Synthetic Symbols

The resolver recognizes these synthetic symbol names:

SymbolPurpose
__UFT_OFFSETPoints to a UFT jump slot
__UFT_CANONICALCanonical UUID entry for a UFT slot
__UDT_OFFSETPoints to a UDT data slot
__UDT_CANONICALCanonical UUID entry for a UDT slot
__UFT / __UFT_ENDUFT table start/end boundaries
__UDT / __UDT_ENDUDT table start/end boundaries
$NVLINKBINDLESSOFF_<name>Bindless texture/surface offset symbol
__cuda_syscallDevice-side syscall dispatcher

Extern Shared Memory Relocations

Extern shared memory variables (declared with extern __shared__) are handled specially because their addresses are not known until kernel launch. The resolver tracks these through dedicated strings:

"extern shared variable %s at offset %lld"
"reloc of extern shared %d replaced with symbol %d"
"new extern shared instance %d"

Multiple kernels may reference the same extern shared variable. The linker creates separate instances when necessary and patches the relocation to point to the correct instance.

Weak Symbol Handling

When nvlink encounters a weak symbol that conflicts with a strong definition:

"Could not replace weak symbol '%s'"

This occurs during the relocation pre-scan (sub_1CD43A0) when processing relocations that reference weak symbols. The pre-scan walks all relocations and checks the symbol binding at sym->st_other & 0xE0:

  • 0x80 = weak: eligible for replacement by a strong definition
  • 0x20 = global: normal binding

Linking Model

Relocatable Object Mode (-c)

When ptxas produces a relocatable object (.o), all relocations are preserved in .rela.text.<func> sections. The call graph is written to .nv.callgraph. Symbols retain their binding information for nvlink to resolve.

"No relocatable objects found. Did not generate callgraph."
"Generate relocatable object"

The --preserve-relocs flag additionally preserves relocations that would normally be resolved internally:

"This option will make PTXAS to generate relocatable references for variables and preserve ..."

Executable Mode (default)

In the default mode, ptxas resolves all internal relocations and writes .nv.resolvedrela for any relocations that require runtime resolution. External references and function descriptors for indirect calls are preserved as unresolved relocations for the CUDA driver's runtime linker.

PIC Mode

Position-independent code mode (IsPIC flag) changes the relocation encoding. The ELF flags word at ELFW offset +48 encodes this mode. PIC cubins use additional program-relative relocations and avoid absolute addresses where possible.

Cross-References

Function Map

AddressSize (binary)DecompiledCallersCalleesPurpose
sub_1CD48C04,184 B22 KB117Master relocation resolver (7-step algorithm)
sub_1CD59201,985 B11 KB19Relocation writer (.nv.resolvedrela)
sub_1CD4510~860 B4 KB----Relocation creator (SASS)
sub_1CD46B0~540 B4 KB----Relocation creator (Mercury)
sub_1CD43A0~560 B3 KB----Relocation pre-scan (weak/extern)
sub_1CD34E03,700 B17 KB12Bit-field patcher (sub_1CD33F0 extract, sub_1CD3330 insert)
sub_1CD33F0~300 B2 KB71Extract bits from instruction word
sub_1CD3330~200 B1 KB50Insert bits into instruction word
sub_1CD22E01,979 B10 KB220UFT manager (UUID-to-slot mapping)
sub_1CD2AA0~800 B3 KB----UFT slot validator
sub_1CB68D09,578 B49 KB136Symbol table builder (.symtab)
sub_1CB2CA02,038 B8 KB411Symbol fixup (post-deletion renumbering)
sub_1C99BB04,900 B25 KB118Section index remap (.symtab_shndx)
sub_1CB64A0~500 B2 KB----Symbol resolver (checks .nv.* special names)
sub_1CAB3002,157 B12 KB119Bindless texture/surface handler
sub_1CA68902,286 B15 KB211Constant bank deduplication
sub_1CBD0D0--------Call graph edge registration