Relocations & Symbols
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
ptxas defines two parallel relocation type systems for CUBIN ELF files: R_CUDA_* (117 types, ordinals 0--116) for SASS-encoded cubins targeting SM 30--90a, and R_MERCURY_* (65 types, ordinals 0--64) for Mercury-encoded cubins targeting SM 100+ (Blackwell and later). Both systems use standard Elf64_Rela relocation entries in .rela.text.<funcname> sections, with a custom resolution algorithm that handles alias redirection, dead function filtering, UFT/UDT pseudo-relocations, PC-relative branch validation, and sub-byte instruction patching. The symbol table (.symtab) follows standard ELF Elf64_Sym format with CUDA-specific symbol types and an extended section index mechanism (.symtab_shndx) for programs exceeding 65,280 sections.
| Relocation resolver | sub_1CD48C0 (4,184 bytes binary, 22 KB decompiled, 17 callees) |
| Relocation writer | sub_1CD5920 (1,985 bytes binary, 11 KB decompiled) |
| Relocation creator (SASS) | sub_1CD4510 (860 bytes binary) |
| Relocation creator (Mercury) | sub_1CD46B0 (540 bytes binary) |
| Relocation pre-scan | sub_1CD43A0 (560 bytes binary) |
| Bit-field patcher | sub_1CD34E0 (3,700 bytes binary, sub_1CD33F0/sub_1CD3330 helpers) |
| Symbol table builder | sub_1CB68D0 (9,578 bytes binary, 49 KB decompiled, 36 callees) |
| Symbol fixup | sub_1CB2CA0 (2,038 bytes binary, 4 call sites) |
| Section index remap | sub_1C99BB0 (4,900 bytes binary) |
| UFT manager | sub_1CD22E0 (1,979 bytes binary, 10 KB decompiled) |
| UFT slot validator | sub_1CD2AA0 (~800 bytes binary) |
| Bindless handler | sub_1CAB300 (2,157 bytes binary, 12 KB decompiled) |
| R_CUDA table address | off_2408B60 (117 entries x 64 bytes) |
| R_MERCURY table address | off_2407B60 (65 entries x 64 bytes) |
Relocation Type Systems
Table Selection Logic
The ELFW object stores the ELF class byte at offset 7 and a flags word at offset 48. The relocation subsystem selects between the two tables based on the IsPIC flag combined with the ELF class:
// Table selection (reconstructed from sub_1CD48C0, sub_1CD4510, sub_1CD5920)
uint32_t test_bit = (elfw->ei_class == 'A') ? 1 : 0x80000000;
bool is_mercury = (test_bit & elfw->flags) != 0;
if (is_mercury) {
// SM 100+ Mercury encoding: off_2407B60
// Type codes start at 0x10000; subtract to index the table
table = &R_MERCURY_table; // off_2407B60
index = raw_type - 0x10000; // range check: index <= 0x3F (63)
} else {
// SM 30-90a SASS encoding: off_2408B60
table = &R_CUDA_table; // off_2408B60
index = raw_type; // range check: index <= 0x73 (115)
}
Mercury relocation type codes are stored with a 0x10000 offset in the internal relocation entry's type field. This lets a single code path handle both systems -- the table selection just subtracts the offset for Mercury types.
Relocation Descriptor Table Format
Each entry in the relocation type descriptor table is 64 bytes (8 qwords). The layout is accessed through pointer arithmetic patterns like table[8 * index + N] where the table pointer type is char** (8-byte stride):
// Relocation type descriptor -- 64 bytes per entry (reconstructed)
struct reloc_type_desc {
const char* name; // +0: R_CUDA_* or R_MERCURY_* name string
uint32_t unknown_04; // +8: unknown field
uint32_t unknown_08; // +12: unknown field
uint32_t bit_start; // +16: starting bit position in instruction
uint32_t bit_width; // +20: field width in bits
uint32_t patch_mode; // +24: patching mode (0=none, 1=direct, 6/7=split)
uint32_t flags_hi; // +28: high flags (value 12-15 triggers callgraph)
// ... remaining 32 bytes: additional patching parameters
};
The patch_mode field at offset +24 drives the bit-field patching logic in sub_1CD34E0. The switch statement handles these modes:
| Mode | Description | Types |
|---|---|---|
| 0 | No-op (sentinel/terminator) | R_CUDA_NONE, R_CUDA_NONE_LAST |
| 1, 0x12, 0x2E | Direct bit-field write (full or partial 64-bit word) | Most absolute/PC-relative types |
| 6, 0x37 | Split low-word patching (handles cross-qword boundaries) | LO types, sub-byte 8_N types |
| 7, 0x38 | Split high-word patching (uses HIDWORD of value) | HI types |
When flags_hi (at descriptor offset +28) is in the range 12--15, the relocation creator calls sub_1CBD0D0 to register the relocation's target section in the call graph. This triggers call graph edge creation for function descriptors and branch targets.
R_CUDA_* Relocation Types
117 types from R_CUDA_NONE (ordinal 0) to R_CUDA_NONE_LAST (ordinal 116). String addresses span 0x23FBE0E--0x23FC6B6 in the ptxas binary, confirming these are contiguous in the read-only data section. Ordinals are assigned by string table order.
Absolute Address Relocations
| Ordinal | Name | Bit Field | Purpose |
|---|---|---|---|
| 0 | R_CUDA_NONE | -- | Sentinel / no relocation |
| 1 | R_CUDA_32 | 32-bit | Absolute 32-bit address |
| 2 | R_CUDA_64 | 64-bit | Absolute 64-bit address |
| 5 | R_CUDA_ABS32_26 | 32-bit at bit 26 | Absolute address, 26-bit encoding |
| 10 | R_CUDA_ABS32_LO_26 | low 32 at bit 26 | Low half of 64-bit address |
| 11 | R_CUDA_ABS32_HI_26 | high 32 at bit 26 | High half of 64-bit address |
| 12 | R_CUDA_ABS32_23 | 32-bit at bit 23 | Absolute address, 23-bit encoding |
| 13 | R_CUDA_ABS32_LO_23 | low 32 at bit 23 | Low half, 23-bit encoding |
| 14 | R_CUDA_ABS32_HI_23 | high 32 at bit 23 | High half, 23-bit encoding |
| 15 | R_CUDA_ABS24_26 | 24-bit at bit 26 | 24-bit absolute address |
| 16 | R_CUDA_ABS24_23 | 24-bit at bit 23 | 24-bit absolute, 23-bit encoding |
| 17 | R_CUDA_ABS16_26 | 16-bit at bit 26 | 16-bit absolute address |
| 18 | R_CUDA_ABS16_23 | 16-bit at bit 23 | 16-bit absolute, 23-bit encoding |
| 42 | R_CUDA_ABS32_20 | 32-bit at bit 20 | Volta+ encoding format |
| 43 | R_CUDA_ABS32_LO_20 | low 32 at bit 20 | Low half, 20-bit encoding |
| 44 | R_CUDA_ABS32_HI_20 | high 32 at bit 20 | High half, 20-bit encoding |
| 45 | R_CUDA_ABS24_20 | 24-bit at bit 20 | 24-bit, 20-bit encoding |
| 46 | R_CUDA_ABS16_20 | 16-bit at bit 20 | 16-bit, 20-bit encoding |
| 55 | R_CUDA_ABS32_32 | 32-bit at bit 32 | Ampere+ encoding format |
| 56 | R_CUDA_ABS32_LO_32 | low 32 at bit 32 | Low half, 32-bit position |
| 57 | R_CUDA_ABS32_HI_32 | high 32 at bit 32 | High half, 32-bit position |
| 58 | R_CUDA_ABS47_34 | 47-bit at bit 34 | 47-bit wide field |
| 59 | R_CUDA_ABS16_32 | 16-bit at bit 32 | 16-bit, 32-bit position |
| 60 | R_CUDA_ABS24_32 | 24-bit at bit 32 | 24-bit, 32-bit position |
| 74 | R_CUDA_ABS24_40 | 24-bit at bit 40 | 24-bit at offset 40 |
| 75 | R_CUDA_ABS55_16_34 | 55-bit, 16+34 split | Split wide field |
| 100 | R_CUDA_ABS20_44 | 20-bit at bit 44 | 20-bit at offset 44 |
| 114 | R_CUDA_ABS56_16_34 | 56-bit, 16+34 split | Split wide field |
| 70 | R_CUDA_32_LO | low 32 | Low half of 64-bit |
| 71 | R_CUDA_32_HI | high 32 | High half of 64-bit |
The naming convention encodes the bit-field geometry: R_CUDA_ABS<width>_<start_bit> indicates that <width> bits of the resolved address are patched into the instruction at bit position <start_bit>. The LO/HI suffix indicates low or high 32 bits of a 64-bit value. The different start positions (20, 23, 26, 32, 34, 40, 44) correspond to different SASS instruction encoding formats across SM generations: Kepler (26), Maxwell/Pascal (23), Volta/Turing (20), Ampere/Ada/Hopper (32).
Global Address Relocations
| Ordinal | Name | Purpose |
|---|---|---|
| 3 | R_CUDA_G32 | Global-space 32-bit address |
| 4 | R_CUDA_G64 | Global-space 64-bit address |
| 84 | R_CUDA_G8_0 | Global-space byte 0 of 64-bit instruction |
| 85 | R_CUDA_G8_8 | Global-space byte 1 |
| 86 | R_CUDA_G8_16 | Global-space byte 2 |
| 87 | R_CUDA_G8_24 | Global-space byte 3 |
| 88 | R_CUDA_G8_32 | Global-space byte 4 |
| 89 | R_CUDA_G8_40 | Global-space byte 5 |
| 90 | R_CUDA_G8_48 | Global-space byte 6 |
| 91 | R_CUDA_G8_56 | Global-space byte 7 |
Global address relocations target .nv.global and .nv.global.init sections. The G8_* sub-byte variants patch individual bytes within a 64-bit instruction word, used when the instruction encoding requires the address to be spread across non-contiguous bit fields.
PC-Relative Relocations
| Ordinal | Name | Purpose |
|---|---|---|
| 40 | R_CUDA_PCREL_IMM24_26 | PC-relative 24-bit immediate at bit 26 |
| 41 | R_CUDA_PCREL_IMM24_23 | PC-relative 24-bit immediate at bit 23 |
PC-relative relocations resolve branch and call targets. The resolver enforces a critical constraint:
"PC relative branch address should be in the same section"
This means intra-function branches use PC-relative relocations, but cross-function calls use absolute or function descriptor relocations. The 24-bit immediate provides a +/-8 MB range from the instruction address, sufficient for any single kernel.
Constant Field Relocations
| Ordinal | Name | Purpose |
|---|---|---|
| 24 | R_CUDA_CONST_FIELD19_28 | 19-bit constant bank offset at bit 28 |
| 25 | R_CUDA_CONST_FIELD19_23 | 19-bit constant bank offset at bit 23 |
| 36 | R_CUDA_CONST_FIELD21_26 | 21-bit constant bank offset at bit 26 |
| 38 | R_CUDA_CONST_FIELD19_26 | 19-bit constant bank offset at bit 26 |
| 39 | R_CUDA_CONST_FIELD21_23 | 21-bit constant bank offset at bit 23 |
| 50 | R_CUDA_CONST_FIELD19_20 | 19-bit constant bank offset at bit 20 |
| 54 | R_CUDA_CONST_FIELD21_20 | 21-bit constant bank offset at bit 20 |
| 64 | R_CUDA_CONST_FIELD19_40 | 19-bit constant bank offset at bit 40 |
| 66 | R_CUDA_CONST_FIELD21_38 | 21-bit constant bank offset at bit 38 |
| 115 | R_CUDA_CONST_FIELD22_37 | 22-bit constant bank offset at bit 37 |
Constant field relocations patch .nv.constant0.<func> bank offsets into load constant (LDC) instructions. The field width (19, 21, or 22 bits) determines the maximum addressable constant bank size: 19-bit supports 512 KB, 21-bit supports 2 MB, 22-bit supports 4 MB. During resolution, the constant bank deduplication pass (sub_1CA6890) may adjust the relocation offset:
"optimize ocg constant reloc offset from %lld to %lld"
Function Descriptor Relocations
| Ordinal | Name | Purpose |
|---|---|---|
| 31 | R_CUDA_FUNC_DESC32_23 | 32-bit function descriptor at bit 23 |
| 32 | R_CUDA_FUNC_DESC32_LO_23 | Low 32 of descriptor at bit 23 |
| 33 | R_CUDA_FUNC_DESC32_HI_23 | High 32 of descriptor at bit 23 |
| 34 | R_CUDA_FUNC_DESC_32 | Full 32-bit function descriptor |
| 35 | R_CUDA_FUNC_DESC_64 | Full 64-bit function descriptor |
| 47 | R_CUDA_FUNC_DESC32_20 | 32-bit function descriptor at bit 20 |
| 48 | R_CUDA_FUNC_DESC32_LO_20 | Low 32 of descriptor at bit 20 |
| 49 | R_CUDA_FUNC_DESC32_HI_20 | High 32 of descriptor at bit 20 |
| 61 | R_CUDA_FUNC_DESC32_32 | 32-bit function descriptor at bit 32 |
| 62 | R_CUDA_FUNC_DESC32_LO_32 | Low 32 of descriptor at bit 32 |
| 63 | R_CUDA_FUNC_DESC32_HI_32 | High 32 of descriptor at bit 32 |
| 92--99 | R_CUDA_FUNC_DESC_8_0 -- R_CUDA_FUNC_DESC_8_56 | Sub-byte function descriptor patches |
Function descriptors are used for indirect calls through function pointers. The descriptor contains the target function's entry point address and is loaded by the GPU's indirect call mechanism. The sub-byte FUNC_DESC_8_* variants patch individual bytes of the descriptor into instruction encoding slots, used in wide instruction formats where the descriptor address is spread across multiple fields. When the relocation creator detects a flags_hi value of 12--15 in the descriptor table entry, it calls sub_1CBD0D0 to register the call edge in the call graph.
Texture, Sampler, and Surface Relocations
| Ordinal | Name | Purpose |
|---|---|---|
| 6 | R_CUDA_TEX_HEADER_INDEX | Texture header table index |
| 7 | R_CUDA_SAMP_HEADER_INDEX | Sampler header table index |
| 8 | R_CUDA_SURF_HW_DESC | Surface hardware descriptor |
| 9 | R_CUDA_SURF_HW_SW_DESC | Surface hardware+software descriptor |
| 19 | R_CUDA_TEX_SLOT | Texture binding slot |
| 20 | R_CUDA_SAMP_SLOT | Sampler binding slot |
| 21 | R_CUDA_SURF_SLOT | Surface binding slot |
| 26 | R_CUDA_TEX_SLOT9_49 | 9-bit texture slot at bit 49 |
| 52 | R_CUDA_SURF_HEADER_INDEX | Surface header table index |
| 101 | R_CUDA_SAMP_HEADER_INDEX_0 | Sampler header index variant |
These relocations connect texture/sampler/surface operations to their runtime-allocated descriptor table entries. The CUDA driver fills in the actual descriptor indices at launch time based on the kernel's resource binding.
Bindless Texture/Surface Relocations
| Ordinal | Name | Purpose |
|---|---|---|
| 22 | R_CUDA_TEX_BINDLESSOFF13_32 | Bindless texture offset, 13-bit at bit 32 |
| 23 | R_CUDA_TEX_BINDLESSOFF13_47 | Bindless texture offset, 13-bit at bit 47 |
| 29 | R_CUDA_TEX_BINDLESSOFF13_41 | Bindless texture offset, 13-bit at bit 41 |
| 30 | R_CUDA_TEX_BINDLESSOFF13_45 | Bindless texture offset, 13-bit at bit 45 |
| 51 | R_CUDA_BINDLESSOFF13_36 | Bindless offset, 13-bit at bit 36 |
| 65 | R_CUDA_BINDLESSOFF14_40 | Bindless offset, 14-bit at bit 40 |
Bindless texture/surface relocations are handled by sub_1CAB300, which creates $NVLINKBINDLESSOFF_<name> symbols for each bindless reference. During resolution:
"change reloc symbol from %d to %d"
"no bindless ref in section %s"
"unexpected usage of non-unified surface descriptors"
Sub-Byte Patch Relocations
| Ordinal | Name | Purpose |
|---|---|---|
| 76--83 | R_CUDA_8_0 -- R_CUDA_8_56 | Patch byte 0--7 of 64-bit instruction |
These relocations patch a single byte at a specific 8-bit-aligned position within a 64-bit instruction word. They are used when the resolved value must be inserted into a non-standard bit position that does not align with the instruction encoding's immediate field boundaries.
Miscellaneous Relocations
| Ordinal | Name | Purpose |
|---|---|---|
| 27 | R_CUDA_6_31 | 6-bit field at bit 31 |
| 28 | R_CUDA_2_47 | 2-bit field at bit 47 |
| 37 | R_CUDA_QUERY_DESC21_37 | Query descriptor, 21-bit at bit 37 |
| 53 | R_CUDA_INSTRUCTION64 | Whole 64-bit instruction replacement |
| 67 | R_CUDA_INSTRUCTION128 | Whole 128-bit instruction replacement |
| 68 | R_CUDA_YIELD_OPCODE9_0 | YIELD opcode, 9-bit at bit 0 |
| 69 | R_CUDA_YIELD_CLEAR_PRED4_87 | Clear YIELD predicate, 4-bit at bit 87 |
| 72 | R_CUDA_UNUSED_CLEAR32 | Zero out 32-bit unused field |
| 73 | R_CUDA_UNUSED_CLEAR64 | Zero out 64-bit unused field |
| 116 | R_CUDA_NONE_LAST | Sentinel marking end of relocation table |
The R_CUDA_INSTRUCTION64 and R_CUDA_INSTRUCTION128 types replace entire instruction words, used for instruction-level patching by the linker when the instruction encoding changes based on the final resolved address.
The R_CUDA_YIELD_* types handle YIELD-to-NOP conversion. When a kernel has forward-progress requirements that prevent yielding, the resolver converts YIELD instructions to NOPs:
"Ignoring the reloc to convert YIELD to NOP due to forward progress requirement."
The R_CUDA_UNUSED_CLEAR* types zero out instruction fields that are unused in the final encoding, ensuring deterministic output.
Unified Address Space Relocations
| Ordinal | Name | Purpose |
|---|---|---|
| 102 | R_CUDA_UNIFIED | Unified address (generic pointer) |
| 103 | R_CUDA_UNIFIED_32 | 32-bit unified address |
| 104--111 | R_CUDA_UNIFIED_8_0 -- R_CUDA_UNIFIED_8_56 | Unified address sub-byte patches |
| 112 | R_CUDA_UNIFIED32_LO_32 | Low 32 of unified at bit 32 |
| 113 | R_CUDA_UNIFIED32_HI_32 | High 32 of unified at bit 32 |
Unified address relocations resolve generic pointers that can point to global, shared, or constant memory. During final resolution, the resolver performs a type conversion from unified (type 103) to absolute (type 1):
// In sub_1CD48C0: unified reloc replacement
if (reloc_type == 103) // R_CUDA_UNIFIED_32
reloc_type = 1; // R_CUDA_32
R_MERCURY_* Relocation Types
65 types from R_MERCURY_NONE (ordinal 0) to R_MERCURY_NONE_LAST (ordinal 64). String addresses span 0x23FB8C5--0x23FBDFA. Mercury relocations serve the same purpose as R_CUDA types but are designed for the Mercury intermediate representation used on SM 100+ targets.
Mercury Type Categories
| Category | Types | Purpose |
|---|---|---|
| Address | R_MERCURY_G64, R_MERCURY_ABS64, R_MERCURY_ABS32, R_MERCURY_ABS16 | Memory addresses |
| Split address | R_MERCURY_ABS32_LO, R_MERCURY_ABS32_HI | 64-bit address halves |
| Program-relative | R_MERCURY_PROG_REL64, R_MERCURY_PROG_REL32, R_MERCURY_PROG_REL32_LO, R_MERCURY_PROG_REL32_HI | Offsets from program base |
| Tex/samp/surf | R_MERCURY_TEX_HEADER_INDEX, R_MERCURY_SAMP_HEADER_INDEX, R_MERCURY_SURF_HEADER_INDEX | Resource descriptors |
| Function | R_MERCURY_FUNC_DESC_64 | Function descriptor |
| Sub-byte | R_MERCURY_8_0 -- R_MERCURY_8_56 (8 types) | Byte-level patches |
| Global sub-byte | R_MERCURY_G8_0 -- R_MERCURY_G8_56 (8 types) | Global-space byte patches |
| Func desc sub-byte | R_MERCURY_FUNC_DESC_8_0 -- R_MERCURY_FUNC_DESC_8_56 (8 types) | Function descriptor byte patches |
| Abs-program-relative | R_MERCURY_ABS_PROG_REL32_LO, R_MERCURY_ABS_PROG_REL32_HI, R_MERCURY_ABS_PROG_REL32, R_MERCURY_ABS_PROG_REL64 | Absolute program-relative |
| Program-relative sub-byte | R_MERCURY_PROG_REL8_0 -- R_MERCURY_PROG_REL8_56 (8 types) | Program-relative byte patches |
| Unified | R_MERCURY_UNIFIED, R_MERCURY_UNIFIED_32, R_MERCURY_UNIFIED_8_0 -- R_MERCURY_UNIFIED_8_56, R_MERCURY_UNIFIED32_LO, R_MERCURY_UNIFIED32_HI | Unified address space |
| Cleanup | R_MERCURY_UNUSED_CLEAR64 | Zero out unused fields |
| Sentinels | R_MERCURY_NONE, R_MERCURY_NONE_LAST | Table boundaries |
Mercury introduces program-relative relocations (PROG_REL*) that do not exist in the R_CUDA set. These compute offsets relative to the program base address rather than absolute virtual addresses, enabling position-independent code for the Mercury deferred finalization model. The Mercury finalizer (running at link or load time) resolves these program-relative relocations after the final code layout is known.
Relocation Encoding
ELF Relocation Entry Format
Cubin relocations use standard Elf64_Rela entries in .rela.text.<funcname> sections:
typedef struct {
Elf64_Addr r_offset; // Byte offset within the section
Elf64_Xword r_info; // Symbol index (high 32) | Type (low 32)
Elf64_Sxword r_addend; // Addend for the relocation computation
} Elf64_Rela; // 24 bytes
The r_info field packs the symbol table index in the upper 32 bits and the R_CUDA/R_MERCURY type code in the lower 32 bits:
#define ELF64_R_SYM(info) ((info) >> 32)
#define ELF64_R_TYPE(info) ((info) & 0xFFFFFFFF)
For Mercury types, the type code stored in r_info is the ordinal plus 0x10000. The resolver subtracts 0x10000 before indexing the R_MERCURY descriptor table.
Internal Relocation Entry
The ELFW object maintains relocations in an internal linked list at offset +376 of the ELFW structure. Each internal entry is a 32-byte node:
// Internal relocation entry (reconstructed from sub_1CD4510, sub_1CD46B0)
struct elfw_reloc {
uint64_t offset; // +0: byte offset in target section
uint64_t type_and_section; // +8: (target_section << 32) | reloc_type
uint64_t addend; // +16: relocation addend
uint32_t symbol_index; // +24: index into ELFW symbol table
uint32_t alias_index; // +28: original symbol if aliased, else 0
};
The type_and_section field encodes both the relocation type code (low 32 bits) and the target section index (high 32 bits) in a single 64-bit field.
Resolved Relocation Output
Resolved relocations are written by sub_1CD5920 to .nv.resolvedrela sections. Additionally, .nv.rel.action sections carry relocation action metadata for the CUDA driver's runtime linker.
Symbol Table Structure
.symtab Format
The symbol table uses standard Elf64_Sym entries (24 bytes each for 64-bit, 16 bytes for 32-bit):
typedef struct {
Elf32_Word st_name; // String table offset
unsigned char st_info; // Type (low 4 bits) | Binding (high 4 bits)
unsigned char st_other; // Visibility (low 2 bits) | Flags
Elf16_Half st_shndx; // Section index (or SHN_XINDEX=0xFFFF)
Elf64_Addr st_value; // Symbol value (section offset)
Elf64_Xword st_size; // Symbol size
} Elf64_Sym;
Internal Symbol Representation
The ELFW maintains an internal symbol structure (40+ bytes) with additional metadata:
| Offset | Size | Field | Description |
|---|---|---|---|
| +4 | 1 | st_info | Low nibble = type (STT_*), high nibble = binding strength |
| +5 | 1 | st_other | Bits 0-1 = visibility, bits 4-7 = CUDA-specific flags |
| +6 | 2 | st_shndx | Section index (0xFFFF = use extended index) |
| +8 | 8 | st_value | Symbol address; -1 = unallocated |
| +24 | 4 | section_link | Internal section reference |
| +28 | 4 | extra_index | Secondary symbol link |
| +32 | 8 | name_ptr | Pointer to symbol name string |
Symbol Types
| ELF Type | Value | CUDA Usage |
|---|---|---|
STT_NOTYPE | 0 | Undefined/external symbols |
STT_OBJECT | 1 | Global/constant/shared variables |
STT_FUNC | 2 | Kernel entry points, device functions |
STT_SECTION | 3 | Section symbols (one per section) |
STT_COMMON | 5 | Common symbols (.common symbol) |
STT_CUDA_TEXTURE | 10 | Texture reference symbols |
STT_CUDA_SURFACE | 11 | Surface reference symbols |
STT_CUDA_SAMPLER | 12 | Sampler reference symbols |
STT_CUDA_FUNC_DESC | 13 | Function descriptor (indirect call target) |
The internal type field at offset +4 uses the low nibble for ELF standard types and the high nibble for binding/scope information. The resolver checks st_info & 0xF throughout its processing.
Function descriptor symbols (type 13) receive special handling in the relocation resolver. When the resolver encounters a type-13 symbol, it checks whether the symbol is allocated:
// sub_1CD48C0: function descriptor symbol handling
if ((sym->st_info & 0xF) == 13) { // STT_CUDA_FUNC_DESC
shndx = get_section_index(elfw, sym);
if (shndx == 0) {
// Unresolved -- check binding and ELFW flags
if ((sym->st_other & 0xE0) == 0x20 // STB_GLOBAL
|| (sym->st_other & 0x10)) // CUDA-specific extern flag
{
// External function descriptor: keep relocation for linker
}
}
}
Symbol Binding and Visibility
The st_other byte encodes both ELF visibility (bits 0-1) and CUDA-specific binding flags (bits 4-7):
| Bits | Field | Values |
|---|---|---|
| 0-1 | ELF visibility | 0 = STV_DEFAULT, 1 = STV_INTERNAL, 2 = STV_HIDDEN, 3 = STV_PROTECTED |
| 4 | Extern flag | 1 = external linkage (for nvlink) |
| 5-6 | Binding strength | 0x20 = STB_GLOBAL, 0x80 = STB_WEAK |
| 7 | Reserved | Used internally during resolution |
The binding byte at st_other & 3 (low 2 bits of the high nibble) maps to:
| Value | Meaning | Resolution |
|---|---|---|
| 1 | STB_LOCAL / dead | Skip relocation ("ignore reloc on dead func %s") |
| 2 | STB_GLOBAL | Normal resolution |
| 3 | STB_WEAK | Resolve if available, otherwise use default |
Symbol Table Builder -- sub_1CB68D0
The symbol table builder (9,578 bytes, approximately 1,700 decompiled lines) processes the ELFW internal symbol list in these steps:
- Iterate symbols -- walks the symbol list from the ELFW object
- Filter deleted symbols -- 12 separate checks for
"reference to deleted symbol"guard against stale entries from dead code elimination - Handle
__cuda_syscall-- special-cases the device-side syscall dispatcher symbol - Resolve aliases -- follows alias chains to find the canonical symbol
- Compute values -- resolves
st_valuefrom section base + offset - Create section symbols -- ensures every section has an
STT_SECTIONsymbol; emits"found multiple section symbols for %s"if duplicates exist - Handle SHN_XINDEX overflow -- when section index >=
SHN_LORESERVE(0xFF00 = 65,280), setsst_shndx = SHN_XINDEX(0xFFFF) and stores the real index in.symtab_shndx - Build .symtab_shndx -- populates the extended index table for overflow sections
Error strings observed in the builder:
| String | Condition |
|---|---|
"reference to deleted symbol" | Symbol marked deleted but still referenced (12 checks) |
"ignore symbol %s in unused section" | Symbol in eliminated section |
"ignore symbol string %s for sym %d" | Skipping symbol name for unnamed/internal symbol |
"found multiple section symbols for %s" | Duplicate STT_SECTION entries |
"symbol already assigned" | Duplicate assignment attempt |
"adding global symbols of same name" | Name collision |
"alias to unknown symbol" | Alias target not found |
"unallocated symbol" | Symbol value is -1 (never assigned an address) |
"missing sec strtab" | String table not initialized |
Symbol Fixup -- sub_1CB2CA0
After dead code elimination removes sections, symbol indices become stale. The fixup pass (2,038 bytes, called from 4 sites) renumbers all symbol st_shndx values:
- For each section in the ELFW:
- If the section lacks an
STT_SECTIONsymbol, create one - If the section has multiple
STT_SECTIONsymbols, warn
- If the section lacks an
- Walk the symbol table and remap
st_shndxvalues through the section index mapping
The fixup runs at multiple pipeline points: after dead function elimination, after Mercury section cloning, and after any section deletion.
Section Index Remap -- sub_1C99BB0
The companion to sub_1CB2CA0 for the extended index mechanism. When section indices change, this function updates both .symtab_shndx and .nv.merc.symtab_shndx to keep the extended index tables consistent.
Relocation Resolution Algorithm
The master resolver sub_1CD48C0 implements a 7-step algorithm that processes every relocation entry in the ELFW's linked list:
Step 1: Symbol Address Computation
For each relocation entry, compute the symbol's resolved address by adding the symbol's st_value (from the section base) to the relocation offset:
if (reloc->alias_index) {
sym = lookup_symbol(elfw, reloc->alias_index);
reloc->offset += sym->st_value;
}
For Mercury cubins (64-bit ELF class 'A' with Mercury flag set), the resolver applies an additional address transformation that accounts for the Mercury instruction stride:
if (is_mercury && sym_value != 0) {
int stride = 2 * arch_vtable->get_merc_stride();
reloc->offset += stride * (sym_value >> 7);
}
Step 2: Alias Resolution
If the relocation targets an alias symbol (ELF type STT_NOTYPE with section index pointing to another symbol), redirect the relocation to the canonical target:
"change alias reloc %s to %s"
The resolver follows the alias chain through sub_1CB1E00 (get section index) and sub_1CB3D20 (get section by index), replacing the alias with its real target.
Step 3: Dead Function Filtering
If the relocation's target symbol has local binding (st_other & 3 == 1) and is in a deleted section, the relocation is zeroed out:
"ignore reloc on dead func %s"
The relocation's type is set to 0 (R_CUDA_NONE), effectively removing it. For the output mode != 2 (relocatable), dead relocations on STT_NOTYPE symbols with a binding prefix of 2 are also removed.
Step 4: UFT/UDT Pseudo-Relocation Handling
Relocations targeting special synthetic symbols are intercepted:
| Symbol | Action |
|---|---|
__UFT_OFFSET | Record for UFT slot assignment, zero the relocation |
__UFT_CANONICAL | Map to canonical UFT entry |
__UDT_OFFSET | Record for UDT slot assignment |
__UDT_CANONICAL | Map to canonical UDT entry |
__UFT, __UFT_END | UFT boundary markers |
__UDT, __UDT_END | UDT boundary markers |
The resolver checks if a symbol name starts with "__UFT_OFFSET" (exact 13-character comparison in the decompiled code). If matched:
"ignore reloc on UFT_OFFSET"
The relocation entry is then processed by the UFT manager (sub_1CD22E0) which maps UUIDs to UFT slot indices.
Step 5: PC-Relative Branch Validation
For relocations whose descriptor table entry has *(table + 8*index + 5) == 16 (indicating a PC-relative type), the resolver validates that the source and target sections are identical:
if (reloc_desc->patch_mode == 16 && reloc->section != target_section)
fatal("PC relative branch address should be in the same section");
Step 6: YIELD-to-NOP Conversion
If the relocation type is R_CUDA_YIELD_OPCODE9_0 or R_CUDA_YIELD_CLEAR_PRED4_87, and the kernel has forward-progress requirements, the resolver skips the NOP conversion:
"Ignoring the reloc to convert YIELD to NOP due to forward progress requirement."
Step 7: Bit-Field Patching
The final step delegates to sub_1CD34E0, the bit-field patcher. This function uses the relocation descriptor table entry's parameters to extract the current field value (via sub_1CD33F0), add the resolved address, and write the result back (via sub_1CD3330):
// sub_1CD34E0 -- bit-field patching (simplified)
bool apply_reloc(reloc_desc_table, index, is_addend, instruction_data,
symbol_value, reloc_offset, sym_addr, sym_shndx,
section_type_offset, old_value_out) {
entry = &reloc_desc_table[index * 64 + 12]; // Start at byte 12
end = &reloc_desc_table[index * 64 + 60]; // 4 operations max
while (entry < end) {
uint32_t bit_start = entry[0];
uint32_t bit_width = entry[1];
uint32_t mode = entry[2];
switch (mode) {
case 0: // NOP
break;
case 1: // Direct write: place value at [bit_start, bit_start+bit_width)
case 0x12: case 0x2E:
old = extract_bits(instruction_data, bit_start, bit_width);
insert_bits(instruction_data, resolved_value, bit_start, bit_width);
break;
case 6: // Split low-word write (cross-qword boundary handling)
case 0x37:
// Write low portion, advance to next qword if needed
break;
case 7: // Split high-word write
case 0x38:
// Write HIDWORD of value
break;
}
entry += 4; // Next 16-byte operation
}
return true;
}
If the NVRS (NVIDIA Register Spill) check fails during patching, the resolver emits:
"unexpected NVRS"
NVRS relocations are special-purpose relocations for register spill slot references. When the bit-field patcher returns false, the relocation is invalid for the current context.
Post-Resolution
Successfully resolved relocations are either:
- Removed from the linked list (the relocation was fully applied to the instruction bytes)
- Kept for the output
.nv.resolvedrelasection (the relocation needs runtime resolution by the CUDA driver)
The relocation writer sub_1CD5920 validates every remaining relocation before serializing it:
| Check | Error |
|---|---|
| Symbol value == -1 | "symbol never allocated" |
| Offset >= section size | "relocation is past end of offset" |
| Target section unallocated | "rela section never allocated" |
| Address not found in section data | "reloc address not found" |
Unified Function Table (UFT) and Unified Data Table (UDT)
Purpose
UFT and UDT support indirect function calls and generic data references across compilation units. When nvcc compiles a program using function pointers, virtual functions, or __device__ function addresses taken in host code, the compiler generates UFT/UDT entries that the runtime linker resolves at load time.
Sections
| Section | Purpose |
|---|---|
.nv.uft | Jump slot table (one slot per indirect-callable function) |
.nv.uft.entry | UFT entry metadata (UUID, offset pairs) |
.nv.udt | Data slot table (one slot per externally-referenced data object) |
.nv.udt.entry | UDT entry metadata |
.nv.uft.rel | UFT relocation table |
UFT Entry Structure
Each UFT entry contains a 128-bit UUID and a 64-bit offset:
struct uft_entry {
uint64_t uuid_lo; // Low 64 bits of UUID
uint64_t uuid_hi; // High 64 bits of UUID
uint64_t offset; // Offset into the jump slot table
}; // 24 bytes per entry
UFT Manager -- sub_1CD22E0
The UFT manager (1,979 bytes, 10 KB decompiled) processes UFT/UDT entries across all compilation units:
- Build UID-to-key map -- hashes
uuid_lo ^ uuid_hias the lookup key - Detect conflicts -- reports
"uft map conflict: 0x%llx"when two entries hash to the same key - Detect duplicates -- reports
"duplicate ids in uft.entry"when identical UUIDs appear - Reorder entries --
"Re-ordering UFT entries"/"Re-ordering UDT entries"sorts entries for deterministic output - Match UUIDs -- cross-references UUIDs against the existing UFT for linking:
"matching uuid not found"if a referenced UUID does not exist - Align UDT --
"udt size %lld needs aligning"pads UDT entries to required alignment
UFT Slot Validator -- sub_1CD2AA0
Validates consistency between .nv.uft (jump slots) and .nv.uft.entry (metadata):
"missing nv.uft.entry"
"Number of .nv.uft jump slots != Number of entries in .nv.uft.entry"
"size of uidx window != nv.uft"
Synthetic Symbols
The resolver recognizes these synthetic symbol names:
| Symbol | Purpose |
|---|---|
__UFT_OFFSET | Points to a UFT jump slot |
__UFT_CANONICAL | Canonical UUID entry for a UFT slot |
__UDT_OFFSET | Points to a UDT data slot |
__UDT_CANONICAL | Canonical UUID entry for a UDT slot |
__UFT / __UFT_END | UFT table start/end boundaries |
__UDT / __UDT_END | UDT table start/end boundaries |
$NVLINKBINDLESSOFF_<name> | Bindless texture/surface offset symbol |
__cuda_syscall | Device-side syscall dispatcher |
Extern Shared Memory Relocations
Extern shared memory variables (declared with extern __shared__) are handled specially because their addresses are not known until kernel launch. The resolver tracks these through dedicated strings:
"extern shared variable %s at offset %lld"
"reloc of extern shared %d replaced with symbol %d"
"new extern shared instance %d"
Multiple kernels may reference the same extern shared variable. The linker creates separate instances when necessary and patches the relocation to point to the correct instance.
Weak Symbol Handling
When nvlink encounters a weak symbol that conflicts with a strong definition:
"Could not replace weak symbol '%s'"
This occurs during the relocation pre-scan (sub_1CD43A0) when processing relocations that reference weak symbols. The pre-scan walks all relocations and checks the symbol binding at sym->st_other & 0xE0:
0x80= weak: eligible for replacement by a strong definition0x20= global: normal binding
Linking Model
Relocatable Object Mode (-c)
When ptxas produces a relocatable object (.o), all relocations are preserved in .rela.text.<func> sections. The call graph is written to .nv.callgraph. Symbols retain their binding information for nvlink to resolve.
"No relocatable objects found. Did not generate callgraph."
"Generate relocatable object"
The --preserve-relocs flag additionally preserves relocations that would normally be resolved internally:
"This option will make PTXAS to generate relocatable references for variables and preserve ..."
Executable Mode (default)
In the default mode, ptxas resolves all internal relocations and writes .nv.resolvedrela for any relocations that require runtime resolution. External references and function descriptors for indirect calls are preserved as unresolved relocations for the CUDA driver's runtime linker.
PIC Mode
Position-independent code mode (IsPIC flag) changes the relocation encoding. The ELF flags word at ELFW offset +48 encodes this mode. PIC cubins use additional program-relative relocations and avoid absolute addresses where possible.
Cross-References
- Custom ELF Emitter -- ELFW object, header construction, file serialization
- Section Catalog & EIATTR -- complete section type inventory, EIATTR encoding
- Debug Information -- DWARF section generation
- Pipeline Overview -- where relocation resolution fits in the 11-phase pipeline
- Capsule Mercury -- Mercury-specific relocation handling
Function Map
| Address | Size (binary) | Decompiled | Callers | Callees | Purpose |
|---|---|---|---|---|---|
sub_1CD48C0 | 4,184 B | 22 KB | 1 | 17 | Master relocation resolver (7-step algorithm) |
sub_1CD5920 | 1,985 B | 11 KB | 1 | 9 | Relocation writer (.nv.resolvedrela) |
sub_1CD4510 | ~860 B | 4 KB | -- | -- | Relocation creator (SASS) |
sub_1CD46B0 | ~540 B | 4 KB | -- | -- | Relocation creator (Mercury) |
sub_1CD43A0 | ~560 B | 3 KB | -- | -- | Relocation pre-scan (weak/extern) |
sub_1CD34E0 | 3,700 B | 17 KB | 1 | 2 | Bit-field patcher (sub_1CD33F0 extract, sub_1CD3330 insert) |
sub_1CD33F0 | ~300 B | 2 KB | 7 | 1 | Extract bits from instruction word |
sub_1CD3330 | ~200 B | 1 KB | 5 | 0 | Insert bits into instruction word |
sub_1CD22E0 | 1,979 B | 10 KB | 2 | 20 | UFT manager (UUID-to-slot mapping) |
sub_1CD2AA0 | ~800 B | 3 KB | -- | -- | UFT slot validator |
sub_1CB68D0 | 9,578 B | 49 KB | 1 | 36 | Symbol table builder (.symtab) |
sub_1CB2CA0 | 2,038 B | 8 KB | 4 | 11 | Symbol fixup (post-deletion renumbering) |
sub_1C99BB0 | 4,900 B | 25 KB | 1 | 18 | Section index remap (.symtab_shndx) |
sub_1CB64A0 | ~500 B | 2 KB | -- | -- | Symbol resolver (checks .nv.* special names) |
sub_1CAB300 | 2,157 B | 12 KB | 1 | 19 | Bindless texture/surface handler |
sub_1CA6890 | 2,286 B | 15 KB | 2 | 11 | Constant bank deduplication |
sub_1CBD0D0 | -- | -- | -- | -- | Call graph edge registration |