Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Relocation Phase

The relocation phase is the sixth stage of nvlink's linking pipeline, invoked from main() after the layout phase has assigned addresses to all sections and symbols. Its job is to walk every pending relocation in the output ELF, resolve the target symbol, compute the final value, and either patch the value into the instruction stream or (when --preserve-relocs is active) emit the relocation into a .nv.resolvedrela section for the runtime loader. The primary entry point is sub_469D60 (apply_relocations, 26,578 bytes, 985 decompiled lines), called from main() as the "relocate" phase in the timing checkpoint sequence init -> read -> merge -> layout -> relocate -> finalize -> write.

Unlike a conventional ELF linker that iterates .rela.* sections and patches target bytes, nvlink maintains relocations in a singly-linked list rooted at offset +376 in the linker context object. Each node in this list holds a pointer to a 32-byte relocation record. The function walks this list linearly, resolving each relocation through a multi-stage pipeline: addend computation, symbol lookup, alias chain resolution, dead function filtering, unified table remapping, descriptor table dispatch, and finally bit-field patching of the instruction word via the application engine sub_468760.

Primary functionsub_469D60 at 0x469D60 (26,578 bytes)
Application enginesub_468760 at 0x468760 (14,322 bytes)
Resolved-rela emittersub_46ADC0 at 0x46ADC0 (11,515 bytes)
Relocation vtablesub_459640 at 0x459640 (16,109 bytes, used by finalization)
Called frommain() at 0x409800, between layout and finalization
Timing label"relocate"
Key globalsoff_1D3CBE0 (Mercury descriptor table), off_1D3DBE0 (CUDA descriptor table)
CLI option--preserve-relocs (byte at 2A5F2CE)

Pipeline Position

Layout Phase (sub_439830)
   |
   v
Relocation Phase (sub_469D60)  <-- this page
   |
   v
Finalization Phase (sub_445000)

main() calls sub_469D60 with two arguments: the linker context pointer (a1) and a mutex attribute pointer (a2). The linker context carries all state needed for relocation resolution: the relocation linked list at offset +376, the preserve-relocs linked list at offset +384, architecture flags, the output ELF wrapper, and symbol/section accessors.

Relocation Linked List

All relocations pending resolution are stored in a singly-linked list. Each node is a pair:

struct reloc_node {
    reloc_node*   next;         // offset +0: pointer to next node (NULL = end)
    reloc_record* reloc;        // offset +8: pointer to the 32-byte relocation record
};

The relocation record itself is stored as two SSE-width (128-bit) values, loaded via _mm_loadu_si128:

struct reloc_record {           // 32 bytes total, accessed as two __m128i
    int64_t  addend;            // [0:8]   target addend / offset value
    int64_t  reloc_info;        // [8:16]  low 32 bits = relocation type,
                                //         high 32 bits = symbol index
    uint32_t section_idx;       // [16:20] target section index in output ELF
    uint32_t sym_addend_idx;    // [20:24] symbol index for addend resolution
    int64_t  extra;             // [24:32] extra data / secondary offset
};

The walk is a simple while (v4 != NULL) loop that reads v4[0] (next pointer) and v4[1] (relocation record pointer) at each step. Nodes are removed from the list in-place when the relocation is fully applied: the predecessor's next pointer is redirected to skip the consumed node, and both the node and its record are freed via sub_431000 (arena_free).

Resolution Algorithm

For each relocation record, sub_469D60 executes the following steps:

Step 1: Addend Resolution

sym_addend_idx = reloc->sym_addend_idx;   // field at byte 20
if (sym_addend_idx != 0) {
    symbol = sub_440590(ctx, sym_addend_idx);  // resolve symbol by index
    reloc->addend += *(int64_t*)(symbol + 8);  // add symbol value to addend
}

sub_440590 is the symbol-index-to-record accessor. It returns a pointer to the symbol record, whose field at offset +8 is the resolved symbol value (address). This handles relocations that reference a symbol plus a constant addend -- the standard S + A pattern.

Step 2: Architecture-Dependent Descriptor Table Selection

The relocation type (low 32 bits of reloc_info) is used to index into one of two descriptor tables depending on the target architecture:

uint32_t reloc_type = reloc->reloc_info & 0xFFFFFFFF;
uint32_t flags_mask;

if (ctx->elf_class == 'A')      // byte at ctx+7, 'A' = 0x41
    flags_mask = 1;
else
    flags_mask = 0x80000000;

if (flags_mask & *(uint32_t*)(ctx + 48)) {
    // Mercury (SM100+) path
    descriptor_table = &off_1D3CBE0;
    if (reloc_type != 0)
        adjusted_type = reloc_type - 0x10000;
    else
        adjusted_type = 0;
} else {
    // CUDA (pre-Mercury) path
    descriptor_table = &off_1D3DBE0;
    adjusted_type = reloc_type;
}

The two global tables off_1D3CBE0 and off_1D3DBE0 are arrays of relocation descriptors. Each descriptor is 64 bytes (16 x 4-byte fields), containing bit-field specifications that tell the application engine which bits of the instruction word to patch. Mercury relocations use type codes offset by 0x10000 from CUDA relocations -- the linker subtracts 0x10000 to normalize the index into the Mercury descriptor table.

The relocation type 0 is a sentinel meaning "no relocation" or "already resolved." If reloc_type == 0 and reloc_type is non-zero after normalization, the error "unexpected reloc" is emitted via sub_467460.

Step 3: Symbol Resolution and Section Lookup

target_symbol = sub_440590(ctx, HIDWORD(reloc_info));  // high 32 bits = sym index
sym_section   = sub_440350(ctx, target_symbol);         // get symbol's section index
section_rec   = sub_442270(ctx, section_idx);           // section idx -> record
parent_sec    = sub_442270(ctx, section_rec->parent);   // section's parent section

sub_440350 returns the section index that contains the target symbol. sub_442270 converts a section index to its section record pointer. The parent section (at offset +44 in the section record) is used to locate the actual data buffer where the relocation will be applied.

Step 4: Special Section Handling

If the link type (ctx + 16) is not 1 (non-relocatable link) and the parent section has a magic identifier 0x6FFFFF0E (1879048206) at offset +4, the relocation targets a "computed goto" or descriptor section. In this case, the addend is replaced with a value from sub_463660 (the unified table offset resolver):

if (link_type != 1 && parent_section->magic == 0x6FFFFF0E) {
    uft_entry = sub_463660(ctx, target_symbol);
    reloc->addend = *(int64_t*)(uft_entry + 8);
    if (ctx->compilation_mode == 2) {   // ctx+104
        if (reloc->addend != 0) {
            slot_size = 2 * (*(fn_ptr*)(ctx->vtable + 624))();
            reloc->addend += slot_size * (reloc->addend >> 7);
        }
    }
}

This handles Unified Function Table (UFT) and Unified Descriptor Table (UDT) relocations, where the addend encodes a table slot index that must be multiplied by the per-slot size.

Step 5: Unified Relocation Remapping

For relocatable links (link_type == 1), unified relocation types are remapped to their base equivalents. The decompiled code contains a large switch-case that maps unified relocation types to standard ones:

Unified typeRemapped toNotes
1022Base absolute relocation
1031fprintf: "replace unified reloc %d with %d\n", 103, 1
10476
10577
10678
10779
10880
10981
11082
11183
11256
11357
6558665538Mercury equivalents (type - 0x10000 base)
6558765539
6558865552
6558965553
6559065554
6559165555
6559265556
6559365557
6559465558
6559865541
6559965542
6559565559

For types not in this table, the function checks whether the target symbol name matches one of the unified table synthetic symbols: __UFT_OFFSET, __UFT_CANONICAL, __UDT_OFFSET, __UDT_CANONICAL, __UDT, __UFT, __UFT_END, __UDT_END. If any matches, the relocation type is set to 0 (resolved) with the verbose trace "replace unified reloc %d with %d\n", old_type, 0.

Step 6: Alias Chain Resolution

After the symbol is resolved, the function checks if the target symbol is a weak alias (symbol type STT_FUNC = 2, at byte +4 low nibble) with an unresolved value (offset +8 is zero). If so, it follows the alias chain:

if (sym_section_idx != 0 && (symbol->st_info & 0xF) == STT_FUNC) {
    if (symbol->st_value == 0) {
        // Follow alias: look up the canonical symbol
        new_sym_idx = sub_440350(ctx, symbol);
        canonical = sub_442270(ctx, new_sym_idx);
        canonical_section_idx = canonical->parent & 0x00FFFFFF;
        if (canonical_section_idx != reloc->sym_hi && canonical->magic != 0x6FFFFF0E) {
            old_name = symbol->name;
            symbol = sub_440590(ctx, canonical_section_idx);
            if (ctx->verbose_flags & 4)
                fprintf(stderr, "change alias reloc %s to %s\n", old_name, symbol->name);
            reloc->reloc_info = ((uint64_t)canonical_section_idx << 32) | reloc_type;
        }
    }
}

The verbose trace "change alias reloc %s to %s\n" is emitted when debug verbosity bit 2 is set in the flags at ctx+64. This alias chain walk is crucial for weak function resolution -- when multiple translation units define the same weak symbol, the merge phase picks one canonical definition, and all other references must be redirected.

Step 7: Dead Function Filtering

If the target symbol is marked as dead (symbol type STT_FUNC with binding STB_LOCAL = 1, at byte +5 bits 0-1):

if ((symbol->st_info & 0xF) == STT_FUNC && (symbol->st_bind & 3) == STB_LOCAL) {
    if (ctx->verbose_flags & 4)
        fprintf(stderr, "ignore reloc on dead func %s\n", symbol->name);
    reloc->reloc_info = 0;   // zero out type and symbol
    adjusted_type = 0;
    reloc_type = 0;
}

Dead functions are those eliminated by the dead code elimination pass. Their relocations are silently dropped. The verbose trace "ignore reloc on dead func %s\n" helps track which relocations were discarded.

Step 8: Special Relocation Handling

Several special cases are handled before the general application:

Common undefined symbols (section index SHN_COMMON = 0xFFF2 or related): If the section type at offset +4 matches 0x6FFFFF07 (1879048199), 0x6FFFFF08 (1879048200), or 0x6FFFFF12 (1879048210), the relocation is deferred to the finalization phase. The node is simply advanced past without removal.

YIELD-to-NOP suppression (relocation types 68-69): When the forward-progress-required flag (ctx+94) is set, YIELD instructions are not converted to NOP, and the relocation is handled specially:

"Ignoring the reloc to convert YIELD to NOP due to forward progress requirement."

PC-relative branch validation: When the descriptor table entry at index +5 (descriptor mode) equals 16 (PC-relative), the function validates that the relocation target and source are in the same section:

"PC relative branch address should be in the same section"

UFT_OFFSET ignoring: If the target symbol is named __UFT_OFFSET and the linker's UDT mode (ctx+240) is zero, the relocation is dropped:

"ignore reloc on UFT_OFFSET"

Step 9: Data Buffer Location

Before calling the application engine, the function must locate the exact byte position in the output section's data buffer where the relocation applies. Section data is stored in a linked list of data chunks (at section record offset +72), each chunk containing a base address and length:

chunk_list = *(chunk_node**)(section_record + 72);
target_offset = reloc->addend;

while (chunk_list) {
    chunk_data = chunk_list->data;       // chunk_list[1]
    chunk_base = chunk_data->base;       // chunk_data[1]
    if (target_offset >= chunk_base) {
        delta = target_offset - chunk_base;
        if (delta < chunk_data->size)    // chunk_data[3]
            break;
    }
    chunk_list = chunk_list->next;       // chunk_list[0]
}

if (!chunk_list)
    error("reloc address not found");

patch_ptr = chunk_data->buffer + delta;  // chunk_data[0] + delta

If no chunk contains the target offset, the fatal error "reloc address not found" is emitted.

Step 10: Application Engine Dispatch

The actual bit-patching is performed by sub_468760:

success = sub_468760(
    descriptor_table,       // off_1D3CBE0 or off_1D3DBE0
    adjusted_type,          // normalized relocation type index
    is_absolute,            // flag: 1 if symbol has absolute address
    patch_ptr,              // pointer into section data buffer
    extra_offset,           // reloc->extra field
    addend_value,           // computed addend
    symbol_value,           // resolved symbol address (from sym+8)
    symbol_size,            // symbol size (from sym+28)
    section_type_delta,     // section type - 0x6FFFFF84
    &output_value           // receives the computed final value
);

If the engine returns 0 (failure), the fatal error "unexpected NVRS" is emitted. On success, the relocation node is unlinked from the list.

Application Engine: sub_468760

The application engine (sub_468760, 14,322 bytes, 582 lines) is the bit-level instruction patching workhorse. It receives a relocation descriptor and the target instruction word, then applies the relocation by modifying specific bit fields.

Descriptor Table Format

Each entry in the descriptor table (off_1D3CBE0 / off_1D3DBE0) is 64 bytes, organized as an array of up to 4 relocation actions. Each action is 16 bytes (4 x uint32):

struct reloc_action {           // 16 bytes
    uint32_t  bit_offset;       // offset +0: starting bit position in instruction word
    uint32_t  bit_width;        // offset +4: number of bits to patch
    uint32_t  action_type;      // offset +8: relocation action code (0=end, 1..0x14+)
    uint32_t  reserved;         // offset +12: reserved / flags
};

struct reloc_descriptor {       // 64 bytes = 4 actions
    reloc_action  actions[4];   // at offsets +12 through +60 within the entry
                                // (first 12 bytes are the entry header)
};

The engine iterates actions from the first to the last, stopping when action_type == 0 or when all 4 slots are consumed (the end sentinel is at byte offset +60 from the entry start, stored in v100).

Action Types

The action_type field in each descriptor action determines how the value is computed and patched. The engine implements a switch statement over these codes:

CodeNameSemantics
0ENDTerminator -- stop processing this descriptor
1ABS_FULLAbsolute: write value to bit field (also used by 0x12, 0x2E)
6, 0x37ABS_LOAbsolute low 32 bits: extract low word of value
7, 0x38ABS_HIAbsolute high 32 bits: extract high word of value
8ABS_PLUS_SIZEAbsolute + symbol size addend
9ABS_SHIFTEDAbsolute with right-shift by 2 (4-byte aligned addresses)
0xASEC_TYPE_LOSection type low bits, masked by (255 >> (8 - width))
0xBSEC_TYPE_HISection type high bits, shifted right by 4 then masked
0x10PC_RELPC-relative: value - section_offset
0x13, 0x14CLEARClear bits: write zeros to the specified bit field

Bit-Field Patching

The patching mechanism operates on 64-bit words addressed through the patch_ptr. Given a bit_offset and bit_width, the engine:

  1. Computes which 64-bit word(s) the field spans: word_index = bit_offset / 64, local_offset = bit_offset % 64
  2. If the field fits within a single 64-bit word (local_offset + bit_width <= 64), performs a single read-modify-write using shift-and-mask:
    mask = ~((-1ULL << (64 - bit_width)) >> (64 - (local_offset + bit_width)));
    word = (word & ~mask) | ((value << (64 - bit_width)) >> (64 - (local_offset + bit_width)));
    
  3. If the field spans multiple 64-bit words, loops through intermediate words using sub_4685B0 (the bit-field writer helper), shifting the value right by the consumed bits at each step.

The helper sub_468670 is the inverse operation -- it extracts a bit field from the instruction word, used in non-absolute modes where the engine must read the existing value before adding to it.

This bit-level granularity is necessary because GPU instructions encode operands, immediates, and relocation targets in non-byte-aligned bit fields. A single SASS instruction may be 64 or 128 bits wide, with the relocated value occupying an arbitrary sub-field.

Worked Example: Applying 3 Relocation Types

This section walks three representative relocations through sub_469D60 and sub_468760 end to end -- from the symbol being referenced, through the 32-byte relocation record, into the CUDA descriptor table at off_1D3DBE0, and finally producing a before/after hex dump of the patched bytes. The three examples cover the three architectural shapes of the descriptor table:

  1. R_CUDA_ABS32_LO_20 -- a 16-bit instruction bit-field write (low half of a 32-bit absolute).
  2. R_CUDA_FUNC_DESC_32 -- a 32-bit data patch into a function descriptor slot.
  3. R_CUDA_CONST_FIELD19_20 -- a 19-bit instruction bit-field write with an implicit >> 2 (byte offset to DWORD offset).

All three examples assume a pre-Mercury target (sm_70 Volta), so the descriptor table selected by sub_469D60 is off_1D3DBE0 (the CUDA table), and the relocation types are used as direct indices (no 0x10000 normalization). Each entry in the 64-byte CUDA descriptor table is laid out as [12 bytes header | action[0] 16 bytes | action[1] 16 bytes | action[2] 16 bytes | 4 bytes sentinel], with the action iteration bounded by v100 = (unsigned int *)(v12 + 60) at line 132 of sub_468760.

Example 1: R_CUDA_ABS32_LO_20 (index 33)

Scenario. A kernel needs to load the address of a global __device__ variable g_table into a register. The compiler emits a MOV32I (or equivalent sm_70 IMAD/MOV32I-style wide-immediate) instruction split into two halves -- an R_CUDA_ABS32_HI_20 high-half relocation and an R_CUDA_ABS32_LO_20 low-half relocation. After layout, g_table has been assigned the absolute address 0x00C0_FFEE in the merged .nv.global section. This example patches the low half of that address into the 16-bit immediate field at bit 20 of the second instruction of the pair.

a. Symbol being referenced.

Symbol name      : g_table
Section          : .nv.global.data (sh_type = SHT_CUDA_GLOBAL, 0x70000062)
st_value (post-layout)
                 : 0x00C0_FFEE     (32-bit absolute address)
st_size          : 0x0080           (128 bytes, 32 floats)
st_info          : STB_GLOBAL | STT_OBJECT
Binding section  : output section index 11

After the layout phase, sub_440590(ctx, sym_idx) returns a symbol record whose field at +8 is 0x00000000_00C0FFEE.

b. Relocation record bytes (32 bytes, loaded as two __m128i).

Offset  Bytes (little-endian)                            Interpretation
------  ---------------------------------                ----------------------------------
+0      08 00 00 00 00 00 00 00                          addend = 0x08  (target offset within
                                                         .text.foo: second half of the pair,
                                                         the low-half instruction)
+8      21 00 00 00 07 00 00 00                          reloc_info low32 = 0x21 = 33 (type)
                                                         reloc_info high32 = 0x07 = sym idx 7
+16     05 00 00 00                                      section_idx = 5 (.text.foo)
+20     00 00 00 00                                      sym_addend_idx = 0 (no S+A rewrite)
+24     00 00 00 00 00 00 00 00                          extra = 0

The low 32 bits of reloc_info give relocation type 33 = 0x21 = R_CUDA_ABS32_LO_20. The high 32 bits give symbol index 7, which sub_440590 resolves to g_table.

c. Instruction being patched.

The target is the second instruction of a HI/LO pair inside .text.foo at byte offset 0x08 (reloc addend). The pre-relocation 64-bit instruction word, as stored in the section data buffer:

patch_ptr (offset 0x08 in .text.foo):
   Byte 0  1  2  3  4  5  6  7
         38 72 00 00 00 00 00 00
   as u64: 0x0000_0000_0000_7238

The low 16 bits of the final 32-bit address 0x00C0_FFEE are 0xFFEE. The relocation must write 0xFFEE into bits [20:36) of this instruction word.

d. Descriptor action slots from off_1D3DBE0.

The engine computes the descriptor pointer as off_1D3DBE0 + (33 << 6) = off_1D3DBE0 + 2112, then reads action slots starting at +12:

Offset  Bytes          action field              Value
------  ------------   ----------------------    ------------------
+12     14 00 00 00    action[0].bit_offset      0x14 = 20
+16     10 00 00 00    action[0].bit_width       0x10 = 16
+20     06 00 00 00    action[0].action_type     6 (ABS_LO = low 32 bits)
+24     00 00 00 00    action[0].reserved        0
+28     00 00 00 00    action[1].bit_offset      0
+32     00 00 00 00    action[1].bit_width       0
+36     00 00 00 00    action[1].action_type     0 (END)
+40     00 00 00 00    action[1].reserved        0
+44 ... 00 ...         action[2] / action[3]     END / zero

Only one real action: action[0] is code 6 (ABS_LO). The engine routes to the case 6u, 0x37u branch of sub_468760 (lines 173--211 of the decompiled function).

e. Before/after hex dump.

// Pre-patch extraction (sub_468670 at bit 20, width 16):
old = bitfield_extract(patch_ptr, 20, 16);
//   end = 20 + 16 = 36 <= 64 (single-word case)
//   old = (*patch_ptr << (64 - 36)) >> (64 - 16)
//       = (0x0000_0000_0000_7238 << 28) >> 48
//       = 0x0000_0723_8000_0000_0000 >> 48  (logically; truncated to 64 bits)
//       = 0x0000  (the immediate slot was empty pre-link)
old = 0;

// a7 = symbol_value = 0x00C0_FFEE, a3 = is_absolute = 0
// v10 = a7 = 0x00C0_FFEE (line 122)
// v80 = (uint32_t)v10 = 0xFFEE (line 176, LOBYTE of a 32-bit view)
// Adding old: v80 += 0 -> v80 = 0xFFEE
// v55 = v80 = 0xFFEE

// Write-back (single-word branch, LABEL_48 at line 253):
//   bit_offset v81 = 20, bit_width v52 = 16, v57 = 36
//   mask = (~(-1ULL << (64 - 16))) >> (64 - 36)
//        = 0x0000_0000_0000_FFFF << 20     (equivalently)
//        = 0x0000_0000_FFF0_0000
//   placed = (0xFFEE << 48) >> 28
//          = 0xFFEE_0000_0000_0000 >> 28
//          = 0x0000_000F_FEE0_0000
//                  ^--- the 16-bit value 0xFFEE now occupies bits [20:36)
//   *patch_ptr = (*patch_ptr & ~mask) | placed
//              = (0x0000_0000_0000_7238 & 0xFFFF_FFFF_000F_FFFF)
//              | 0x0000_0000_FFE0_0000
//              = 0x0000_0000_FFE0_7238

Hex dump of the 8 bytes at .text.foo + 0x08:

BEFORE:  38 72 00 00  00 00 00 00          // 0x0000_0000_0000_7238
AFTER:   38 72 E0 FF  00 00 00 00          // 0x0000_0000_FFE0_7238

                ^^ ^^   low byte of 0xFFEE at byte 2, high byte 0xFF at byte 3,
                         shifted by 20 bits (bits [20:36))

Paired with the matching R_CUDA_ABS32_HI_20 (index 29, action code 7 = ABS_HI) on the previous instruction, the full 32-bit address 0x00C0_FFEE is reconstructed in a register at runtime by a two-instruction sequence.

Example 2: R_CUDA_FUNC_DESC_32 (index 52)

Scenario. A device-side function pointer table (.nv.global slot) needs to be filled with a 32-bit function descriptor handle for the function kernel_launch_helper. Unlike the instruction-field relocations, R_CUDA_FUNC_DESC_32 patches a raw 32-bit data word in a data section -- the descriptor format is just bit_offset=0, bit_width=32, action_type=1 (ABS_FULL). The value written is the function's descriptor address (assigned by sub_463660 / the unified function table resolver).

a. Symbol being referenced.

Symbol name      : kernel_launch_helper
Section          : .text.kernel_launch_helper (sh_type = SHT_CUDA_TEXT)
st_value         : 0x0000_2340        (function entry PC)
st_info          : STB_GLOBAL | STT_FUNC

Because this is a function descriptor relocation, the target symbol is a STT_FUNC with a canonical value. sub_469D60's alias chain resolution (Step 6) runs first, picking the canonical definition if this is a weak alias. The final symbol->st_value + 8 read yields the function's code address 0x0000_2340. The descriptor table / UFT mechanism wraps this into a 32-bit descriptor handle during resolution; for this example, assume the resolved descriptor value (stored in the symbol record) is 0x0000_2340 (1:1 mapping on architectures without a UFT indirection).

b. Relocation record bytes.

Offset  Bytes (little-endian)                            Interpretation
------  ---------------------------------                ----------------------------------
+0      20 01 00 00 00 00 00 00                          addend = 0x0120  (target byte offset
                                                         within .nv.global.functable)
+8      34 00 00 00 13 00 00 00                          reloc_info low32  = 0x34 = 52 (type)
                                                         reloc_info high32 = 0x13 = sym idx 19
+16     0C 00 00 00                                      section_idx = 12 (.nv.global.functable)
+20     00 00 00 00                                      sym_addend_idx = 0
+24     00 00 00 00 00 00 00 00                          extra = 0

Relocation type 52 = 0x34 = R_CUDA_FUNC_DESC_32. Symbol index 19 resolves to kernel_launch_helper.

c. Data word being patched.

The target is 4 bytes of uninitialized data inside .nv.global.functable at byte offset 0x0120. Data sections are treated as 64-bit words by sub_468760 (the engine always addresses through unsigned __int64 *a4), but only bits [0:32) are written by this descriptor, leaving the high 32 bits untouched.

patch_ptr (offset 0x0120 in .nv.global.functable):
   Byte  0  1  2  3  4  5  6  7
         00 00 00 00 00 00 00 00         // zero-initialized slot
   as u64: 0x0000_0000_0000_0000

d. Descriptor action slots from off_1D3DBE0.

descriptor_ptr = off_1D3DBE0 + (52 << 6) = off_1D3DBE0 + 3328

Offset  Bytes          action field              Value
------  ------------   ----------------------    ------------------
+12     00 00 00 00    action[0].bit_offset      0
+16     20 00 00 00    action[0].bit_width       0x20 = 32
+20     01 00 00 00    action[0].action_type     1 (ABS_FULL)
+24     00 00 00 00    action[0].reserved        0
+28     00 00 00 00    action[1].action_type     0 (END)
+32 ... 00 ...         action[2] / action[3]     END / zero

Action code 1 routes to the case 1u, 0x12u, 0x2Eu branch (lines 140--172 of sub_468760).

e. Before/after hex dump.

The ABS_FULL path handles the "bit_width == 32" case directly. Tracing the decompiled logic:

// Entry to case 1 (line 143)
v18 = *v15;        // bit_offset = 0
v19 = v15[1];      // bit_width  = 32
// Not the (v18 == 0 && v19 == 64) whole-word fast path, fall through:
if ( !a3 )         // a3 = is_absolute = 0
{
    v112 = v15[1];
    v98 = sub_468670(a4, 0, 32);   // extract old 32-bit field
    v10 += v98;                    // add to value: 0x2340 + 0 = 0x2340
    v19 = v112;
    *a10 = v98;
}
v15 += 4;
sub_4685B0(a4, v10, 0, 32);        // write 0x2340 into bits [0:32)

The bit-field writer (sub_4685B0, lines 35--37):

// bit_offset=0, bit_width=32, value=0x2340
// v5 = 0 + 32 = 32; since 32 <= 64, single-word branch
// mask = (-1LL << (64 - 32)) >> (64 - 32) = 0xFFFF_FFFF_0000_0000 >> 32
//      = 0x0000_0000_FFFF_FFFF
// *a1 = (*a1 & ~0x0000_0000_FFFF_FFFF)
//     | ((0x2340 << 32) >> 32)
//     = 0x0000_0000_0000_0000 | 0x0000_0000_0000_2340
//     = 0x0000_0000_0000_2340

Hex dump of the 8 bytes at .nv.global.functable + 0x0120:

BEFORE:  00 00 00 00  00 00 00 00          // slot zeroed pre-link
AFTER:   40 23 00 00  00 00 00 00          // 32-bit descriptor 0x00002340 (little-endian)

         ^^ ^^ ^^ ^^
         low 32 bits = kernel_launch_helper's function descriptor

Only the first 4 bytes are affected. Bytes 4--7 are the high half of the 64-bit word the engine operates on; they are untouched because the mask is exactly 0xFFFFFFFF in the low half. If another R_CUDA_FUNC_DESC_32 patch were to land at byte offset 0x0124, it would write to the high half of the same 64-bit word without disturbing the low half.

Example 3: R_CUDA_CONST_FIELD19_20 (index 42)

Scenario. A kernel loads from a compiler-generated constant in .nv.constant0. After merging, the constant symbol __cuda_local_const_0 is placed at byte offset 0x240 within .nv.constant0. The target instruction is a 64-bit sm_70 LDC-family encoding with a 19-bit DWORD-offset field starting at bit 20. R_CUDA_CONST_FIELD19_20 is standard-table index 42; its descriptor uses action code 9 (ABS_SHIFTED), which right-shifts the byte offset by 2 to convert to a DWORD offset before writing.

a. Symbol being referenced.

Symbol name      : __cuda_local_const_0
Section          : .nv.constant0 (sh_type = SHT_CUDA_CONSTANT0, 0x70000064)
st_value (post-merge)
                 : 0x0000_0240        (byte offset within merged .nv.constant0)
st_info          : STB_LOCAL | STT_OBJECT

sub_440590 returns the symbol record whose +8 field is 0x240.

b. Relocation record bytes.

Offset  Bytes (little-endian)                            Interpretation
------  ---------------------------------                ----------------------------------
+0      40 00 00 00 00 00 00 00                          addend = 0x40  (offset of target
                                                         instruction within .text.kernel)
+8      2A 00 00 00 2B 00 00 00                          reloc_info low32  = 0x2A = 42 (type)
                                                         reloc_info high32 = 0x2B = sym idx 43
+16     06 00 00 00                                      section_idx = 6 (.text.kernel)
+20     00 00 00 00                                      sym_addend_idx = 0
+24     00 00 00 00 00 00 00 00                          extra = 0

Relocation type 42 = 0x2A = R_CUDA_CONST_FIELD19_20. Symbol index 43 resolves to __cuda_local_const_0.

c. Instruction being patched.

A 64-bit load-constant instruction at .text.kernel + 0x40. The compiler has pre-encoded the bank index (bits [14:19) = 0x00 for bank 0) and zeroed the 19-bit offset field:

patch_ptr (offset 0x40 in .text.kernel):
   Byte 0  1  2  3  4  5  6  7
         B8 79 00 00 00 00 00 00
   as u64: 0x0000_0000_0000_79B8

The bit layout of this pre-relocation word:

  63                    39       20 19  14 13              0
  +----------------------+--------+------+------------------+
  | (scheduling, pred)   | offset | bank |    opcode/dst    |
  |         ...          | 0x00000|  0x0 |      0x79B8      |
  |                      | 19 bit | 5 b  |                  |
  +----------------------+--------+------+------------------+

d. Descriptor action slots from off_1D3DBE0.

descriptor_ptr = off_1D3DBE0 + (42 << 6) = off_1D3DBE0 + 2688

Offset  Bytes          action field              Value
------  ------------   ----------------------    ------------------
+12     14 00 00 00    action[0].bit_offset      0x14 = 20
+16     13 00 00 00    action[0].bit_width       0x13 = 19
+20     09 00 00 00    action[0].action_type     9 (ABS_SHIFTED, >> 2)
+24     00 00 00 00    action[0].reserved        0
+28     00 00 00 00    action[1].action_type     0 (END)
+32 ... 00 ...         action[2] / action[3]     END / zero

Action code 9 routes to the case 9u branch (lines 301--337 of sub_468760).

e. Before/after hex dump.

The ABS_SHIFTED action does one extra step before the standard extract/add/write cycle: it right-shifts v10 (the running value, initialized to the symbol address) by 2 once, at the top of the case:

// case 9u: (line 302)
v10 >>= 2;                     // 0x0000_0240 >> 2 = 0x0000_0090 (DWORD offset)

v61 = *v15;                    // bit_offset = 20
v62 = v15[1];                  // bit_width  = 19
v63 = *v15;                    // bit_offset (cached)
if ( !a3 )                     // a3 = is_absolute = 0
{
    v96 = sub_468670(a4, 20, 19);   // extract old 19-bit field
    // old = (0x0000_0000_0000_79B8 << (64 - 39)) >> (64 - 19)
    //     = (0x0000_0000_0000_79B8 << 25) >> 45
    //     = 0x0000_0000_F370_0000_0000 >> 45
    //     (first drop high 8 bits above bit 63, then shift right 45)
    //     = 0x00000  (all bits in [20:39) were zero pre-link)
    v10 += v96;                // v10 += 0  ->  v10 = 0x0000_0090
    *a10 = v96;
}
// v61 = 20 <= 63, so v42 = a4; v64 = 20 + 19 = 39 <= 64 -> LABEL_98
// v44 = v10 = 0x0000_0090
// v48 = 64 - 19 = 45, v49 = 64 - 39 = 25
// v50 = -1LL << 45 = 0xFFFF_E000_0000_0000
// Write-back at LABEL_59 (line 573):
//   *v42 = (*v42 & ~(v50 >> v49)) | (v44 << v48 >> v49)
//        = (0x0000_0000_0000_79B8 & ~(0xFFFF_E000_0000_0000 >> 25))
//        | ((0x0000_0090 << 45) >> 25)
//   mask  = 0xFFFF_E000_0000_0000 >> 25 = 0x0000_007F_FFF0_0000
//   ~mask = 0xFFFF_FF80_000F_FFFF
//   placed = (0x0000_0090 << 45) >> 25
//          = 0x0012_0000_0000_0000 >> 25
//          = 0x0000_0009_0000_0000
//                  ^--- DWORD offset 0x90 at bits [20:39)
//   *v42 = (0x0000_0000_0000_79B8 & 0xFFFF_FF80_000F_FFFF)
//        | 0x0000_0009_0000_0000
//        = 0x0000_0009_0000_79B8

Hex dump of the 8 bytes at .text.kernel + 0x40:

BEFORE:  B8 79 00 00  00 00 00 00          // 0x0000_0000_0000_79B8
AFTER:   B8 79 00 00  09 00 00 00          // 0x0000_0009_0000_79B8

                      ^^
                      byte 4 = 0x09: the DWORD offset 0x90 occupies bits [20:39)
                      Verification: 0x09 << 32 = 0x0000_0009_0000_0000
                                    bits [20:39) of that = bits [20:39) of
                                    0x0000_0009_0000_0000, i.e. 0x90

Verification of the ISA semantics: the hardware interprets the 19-bit field as a DWORD index (4-byte-multiplied). 0x90 << 2 = 0x240, which matches the merged byte offset of __cuda_local_const_0 in .nv.constant0. The instruction now decodes as LDC R?, c[0x0][0x240], loading the correct constant.

Summary of the Three Examples

RelocationIndexAction codeBit offsetBit widthValue transformTarget kind
R_CUDA_ABS32_LO_20336 (ABS_LO)2016value & 0xFFFF (low half of 32-bit)Instruction field
R_CUDA_FUNC_DESC_32521 (ABS_FULL)032value (verbatim)Data word
R_CUDA_CONST_FIELD19_20429 (ABS_SHIFTED)2019value >> 2 (byte to DWORD)Instruction field

The three examples exercise three distinct mechanical paths through sub_468760: the wide-immediate split-half path (ABS_LO), the full-width data patch path (ABS_FULL), and the byte-to-DWORD shifted path (ABS_SHIFTED). All three share the common infrastructure of bit-field extract (sub_468670), value accumulation, and bit-field write (sub_4685B0), and all three produce deterministic, architecture-independent bit patterns in the output section data buffer.

Preserve-Relocs Path

When --preserve-relocs is active (byte at ctx+85 is nonzero), resolved relocations are not discarded after application. Instead, they are appended to a secondary linked list rooted at ctx+384:

if (ctx->preserve_relocs) {
    if ((symbol->st_bind & 3) != STB_LOCAL
        || (sym_section != 0
            && section_has_data))
    {
        if (section_type != 4)  // not SHT_RELA
            reloc->extra = output_value;
        sub_4644C0(reloc_record, ctx + 384);  // append to preserve list
    }
} else {
    sub_431000(reloc_record);  // free the record
}

sub_4644C0 is a linked-list append operation. After the main relocation walk completes, the preserve list is processed by sub_46ADC0 to emit .nv.resolvedrela sections.

Resolved-Rela Emission: sub_46ADC0

The function sub_46ADC0 (11,515 bytes, 406 lines) walks the preserve-relocs list at ctx+384 and writes each relocation into a .nv.resolvedrela section. This is used when the output is a relocatable object (-r flag) or when --preserve-relocs is specified, so that a subsequent link step or the CUDA runtime loader can perform final relocation at load time.

For each preserved relocation:

  1. Symbol index remapping: Calls sub_444720 to remap the symbol index from internal numbering to output ELF symbol table numbering.

  2. Symbol value validation: The resolved symbol's value at offset +8 must not be -1 (unallocated):

    "symbol never allocated"
    
  3. Section data location: Same chunk-list walk as the main engine, with the same error:

    "reloc address not found"
    
  4. Offset validation: Verifies the relocation offset does not exceed the section's data size:

    "relocation is past end of offset"
    
  5. Rela section allocation: Calls sub_442760 to find or create the .nv.resolvedrela section for the target, with error:

    "rela section never allocated"
    
  6. Descriptor-driven extraction: For relocations with non-trivial descriptors, the engine extracts existing bit-field values from the patched instruction data using the same sub_468670 bit-field reader, accumulating the extracted value into the relocation addend at offset +16.

  7. Output format: Writes the relocation record in either REL or RELA format depending on the ELF class (byte at ctx+4). For RELA (class == 2), the full 24-byte record {offset, info, addend} is emitted via sub_4336B0. For REL, the record is 12 bytes with the addend folded into info.

After all preserved relocations are emitted, if this is a non-Mercury relocatable link, the function also generates a .nv.rel.action section containing the relocation descriptor actions themselves, so downstream tools can re-apply them:

if (link_type == 2 && !mercury_flag) {
    action_section = sub_441AC0(ctx, ".nv.rel.action", SHT_CUDA_xxx, ...);
    // Iterate descriptor table entries, emit action records
}

Relocation Vtable: sub_459640

While not directly called by sub_469D60, the relocation vtable at sub_459640 (16,109 bytes, 570 lines) is a critical companion used by the finalization phase (sub_445000). It creates a 632-byte vtable of function pointers, one per relocation type, dispatched by architecture:

ArchitectureDescription
sm 30-39Kepler handlers
sm 50-59Maxwell handlers
sm 60-69Pascal handlers
sm 70-74Volta handlers
sm 75-79Turing handlers
sm 80-89Ampere/Ada handlers
sm 90-99Hopper handlers
sm 100+Mercury/Blackwell handlers

Each handler slot corresponds to a specific R_CUDA (or R_MERCURY) relocation type. The vtable provides approximately 70 handler slots, covering all GPU relocation types across all supported architectures. The finalization phase uses this vtable for the second pass of relocation application -- while sub_469D60 handles the initial resolution and unified table fixup, sub_445000 applies architecture-specific patching using the vtable dispatch.

Error Conditions

Error stringSeverityCondition
"unexpected reloc"FatalRelocation type nonzero but <= 0x10000 in Mercury mode
"reloc address not found"FatalTarget offset not contained in any section data chunk
"unexpected NVRS"FatalApplication engine returned failure (invalid descriptor)
"PC relative branch address should be in the same section"FatalPC-relative relocation crosses section boundary
"symbol never allocated"FatalPreserved relocation references unallocated symbol
"rela section never allocated"FatalCould not create .nv.resolvedrela output section
"relocation is past end of offset"FatalRelocation offset exceeds section data size

Diagnostic Traces

All traces are gated by (ctx->verbose_flags & 4) != 0 (bit 2 of the debug flags at ctx+64):

Trace stringWhen emitted
"change alias reloc %s to %s\n"Weak alias chain followed to canonical symbol
"ignore reloc on dead func %s\n"Relocation dropped because target function was eliminated
"replace unified reloc %d with %d\n"Unified table relocation type remapped to base type
"resolve reloc %d for sym=%d+%lld at <section=%d,offset=%llx>\n"Per-relocation resolution trace (full detail)
"ignore reloc on UFT_OFFSET\n"UFT_OFFSET relocation dropped when UDT mode inactive
"Ignoring the reloc to convert YIELD to NOP due to forward progress requirement.\n"YIELD conversion suppressed

Function Map

AddressSizeIdentityRole
0x469D6026,578 Bapply_relocationsMain relocation phase entry point
0x46876014,322 Breloc_apply_engineBit-field patching engine, descriptor-driven
0x46ADC011,515 Bemit_resolved_relaWrites .nv.resolvedrela for preserve-relocs
0x45964016,109 Breloc_vtable_createPer-architecture relocation handler vtable
0x468670~240 Bbitfield_extractExtracts bit field from instruction word
0x4685B0~240 Bbitfield_writeWrites value into bit field of instruction word
0x440590~2 KBsym_idx_to_recordSymbol index to record pointer accessor
0x440350~2 KBsym_get_sectionGets section index containing a symbol
0x442270~2 KBsec_idx_to_recordSection index to record pointer accessor
0x444BD0~2 KBsym_is_definedChecks if symbol has a definition
0x463660~2 KBuft_get_offsetUFT/UDT offset resolver
0x4644C0~1 KBlist_appendAppends node to singly-linked list
0x444720~2 KBsym_remap_indexRemaps symbol index for output ELF
0x4336B0~2 KBsection_write_dataWrites data into a section's data buffer
0x4411D0~2 KBsection_find_by_nameFinds section by name string
0x467460~2 KBerror_emitVariadic error emission entry point

Cross-References

Confidence Assessment

ClaimConfidenceEvidence
sub_469D60 at 0x469D60, 26,578 bytes, 985 linesHIGHstat -c%s = 26,578; wc -l = 985
sub_468760 (application engine), 14,322 B, 582 linesHIGHstat -c%s = 14,322; wc -l = 582
sub_46ADC0 (resolved-rela emitter), 11,515 B, 406 linesHIGHstat -c%s = 11,515; wc -l = 406
sub_459640 (relocation vtable), 16,109 B, 570 linesHIGHstat -c%s = 16,109; wc -l = 570
Signature: (ctx, mutex_attr) -- two argumentsHIGHDecompiled: char __fastcall sub_469D60(__int64 a1, pthread_mutexattr_t *a2)
off_1D3CBE0 (Mercury descriptor table)HIGHReferenced at lines 202, 207 of sub_469D60 decompiled code
off_1D3DBE0 (CUDA descriptor table)HIGHReferenced at line 214 of sub_469D60 decompiled code
Mercury relocation type offset 0x10000HIGHv9 - 0x10000 at line 203 and <= 0x10000 check at line 197 of decompiled code
SSE _mm_loadu_si128 for relocation record loadingHIGH_mm_loadu_si128(v5) at line 236 of decompiled code
"unexpected reloc" error stringHIGHString at 0x1d3bcd0 in nvlink_strings.json (full: "unexpected reloc section")
"reloc address not found" error stringHIGHString at 0x1d3c990 in nvlink_strings.json
"unexpected NVRS" error stringHIGHString at 0x1d3caf8 in nvlink_strings.json
"PC relative branch address should be in the same section"HIGHString at 0x1d3ca68 in nvlink_strings.json
"symbol never allocated" error stringHIGHString at 0x1d3cb17 in nvlink_strings.json
"rela section never allocated" error stringHIGHString at 0x1d3cb2e in nvlink_strings.json
"change alias reloc %s to %s" traceHIGHString at 0x1d3caa1 in nvlink_strings.json
"ignore reloc on dead func %s" traceHIGHString at 0x1d3cabe in nvlink_strings.json
"replace unified reloc %d with %d" traceHIGHString at 0x1d3c9a8 in nvlink_strings.json
"resolve reloc %d for sym=%d+%lld at <section=%d,offset=%llx>"HIGHString at 0x1d3ca28 in nvlink_strings.json
"Ignoring the reloc to convert YIELD to NOP due to forward progress requirement."HIGHString at 0x1d3c9d0 in nvlink_strings.json
"ignore reloc on UFT_OFFSET" traceHIGH"__UFT_OFFSET" string at 0x1d3a025 in nvlink_strings.json
Unified relocation remapping table (102->2, 103->1, etc.)MEDIUMValues inferred from switch-case in decompiled sub_469D60; individual mappings verified against code
UFT synthetic symbols (__UFT_OFFSET, __UFT_CANONICAL, etc.)HIGH"__UFT_OFFSET" at 0x1d3a025; related strings nearby in nvlink_strings.json
Relocation descriptor format (64 bytes, 4 actions of 16 bytes)MEDIUMInferred from decompiled loop structure in sub_468760 and descriptor table stride; not independently labeled
Action type codes (0=END, 1=ABS_FULL, 6=ABS_LO, 7=ABS_HI, etc.)MEDIUMValues from switch-case in sub_468760; semantics inferred from patching behavior
Bit-field patching mechanism (64-bit read-modify-write)HIGHShift-and-mask operations visible in sub_468760 and helpers sub_4685B0/sub_468670
sub_4685B0 (bitfield_write) and sub_468670 (bitfield_extract)HIGHBoth files exist in decompiled/
Preserve-relocs path appends to list at ctx+384MEDIUMOffset inferred from decompiled pointer arithmetic; list-append call visible
Relocation vtable architecture ranges (sm 30--39 Kepler, etc.)MEDIUMArchitecture dispatch visible in sub_459640; specific SM ranges are editorial grouping
All 18 function addresses in the function map tableHIGHAll verified to exist in decompiled/ directory
10-step resolution algorithmMEDIUMStep boundaries are editorial grouping of the decompiled control flow; individual steps verified
Relocation record struct (32 bytes, two __m128i)HIGH__m128i type visible in decompiled variable declarations; _mm_loadu_si128 confirms 128-bit loading