Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

R_CUDA Relocations

nvlink defines 119 CUDA-specific ELF relocation types for the EM_CUDA (190) machine type. These types are stored in .rela.* sections of device ELF (cubin) files and are consumed by the relocation engine during the link phase. Each type encodes how a resolved symbol address is patched into the instruction stream or data section: the bit-field width, the bit-field position within the 64- or 128-bit instruction word, and the computation to perform (absolute, PC-relative, hi/lo split, etc.).

The types are organized into two global descriptor tables baked into the nvlink binary's .rodata segment. A validation/dispatch function at sub_42F6C0 selects between them based on whether the relocation is a standard code/data relocation or a section-attribute relocation. The application engine at sub_468760 reads 64-byte per-type descriptors from these tables and performs up to three sequential bit-field patching actions per relocation.

Key Facts

PropertyValue
Machine typeEM_CUDA (190)
Total unique type names119
Standard relocation tableoff_1D37600 (117 entries, index 0--116)
Attribute relocation tableoff_1D371E0 (65 entries, index 0--64)
Attribute type offset0x10000 (attribute type = standard type + 65536)
Descriptor size64 bytes per type (12-byte header + 3 actions x 16 bytes + 4-byte sentinel)
Validation functionsub_42F6C0 at 0x42F6C0
Architecture class functionsub_42F8C0 at 0x42F8C0
Max-type-for-class functionsub_42F690 at 0x42F690
Application enginesub_468760 at 0x468760 (14,322 bytes)
CUDA descriptor tableoff_1D3DBE0 (used by relocation engine)
Mercury descriptor tableoff_1D3CBE0 (Mercury types are CUDA types + 0x10000)

Naming Convention

Every R_CUDA type name follows a systematic pattern:

R_CUDA_<category><bits>_<bitposition>

The components are:

  • Category: the semantic class of the relocation (ABS, G, FUNC_DESC, TEX, etc.)
  • Bits: the width of the relocated value in bits (8, 16, 19, 20, 21, 22, 24, 32, 47, 55, 56, 64, 128)
  • Bit position: the starting bit offset within the instruction word where the value is inserted

For example, R_CUDA_ABS32_20 means: patch bits [20:52) of the instruction word with a 32-bit absolute address. R_CUDA_PCREL_IMM24_23 means: compute a PC-relative offset, take 24 bits, and insert starting at bit 23.

Some types use a compound suffix with _HI or _LO to indicate which half of a 32-bit value is being patched (high 16 bits or low 16 bits).

Dual Descriptor Tables

Standard Table (off_1D37600)

The standard relocation table at off_1D37600 contains 117 entries (indices 0 through 116, validated against limit 0x75 = 117). Each entry is a pointer pair in the table: the first pointer is the type name string (e.g., "R_CUDA_ABS32_20"), and additional fields encode the relocation class and architecture compatibility.

The validation function sub_42F6C0 checks:

// Standard relocation path
if (!is_attribute) {
    if (type_index >= 117)       // limit 0x75
        error("unknown attribute");
    entry = &off_1D37600[2 * type_index];
    if (entry->arch_class > target_class)
        warning("Relocation %s not supported on %s", entry->name, class_name);
}

Attribute Table (off_1D371E0)

The attribute relocation table at off_1D371E0 contains 65 entries (indices 0 through 64, validated against limit 0x41 = 65). Attribute relocations are identified by having their type encoded with the 0x10000 offset -- when the relocation engine encounters a type >= 0x10000, it subtracts 0x10000 and uses this table instead.

// Attribute relocation path (type >= 0x10000)
type_index -= 0x10000;
if (type_index >= 65)            // limit 0x41
    error("unknown attribute");
entry = &off_1D371E0[2 * type_index];

Attribute relocations apply to .nv.info.* attribute sections rather than to instruction streams. A separate validation function at sub_42F760 handles attribute-specific compatibility checking with a three-way dispatch based on the attribute usage field (dword_1D37D68[4 * type + 1]): value 0 = warning, value 1 = error, value 2 = silent ignore.

Architecture Class System

The function at sub_42F8C0 maps an SM version number to an architecture class used for relocation compatibility checking:

int reloc_arch_class(int sm_version) {
    if (sm_version == 0)   return 0;   // invalid / unset
    if (sm_version <= 70)  return 1;   // Kepler through Volta (sm_30--sm_70)
    if (sm_version <= 72)  return 2;   // Volta extended (sm_72)
    if (sm_version >= 76)  return 5;   // Ampere+ (sm_80--sm_90+)
    return 3;                           // Turing (sm_75)
}

Each descriptor entry stores a minimum architecture class. The validation function compares the entry's class against the target to ensure the relocation type is supported on the architecture being linked. The five class names are stored in a string pointer array at off_1D371A0 (indexed 0--4), used in error/warning messages.

The maximum valid relocation index varies by architecture class. The function sub_42F690 scans backward from index 115 (the last non-special standard type) through the descriptor table, returning the first index whose architecture class is not 5 (the highest). This determines which types are valid for a given target.

Descriptor Format

Each relocation type has a 64-byte descriptor in the application engine's table (off_1D3DBE0 for CUDA, off_1D3CBE0 for Mercury). The descriptor is divided into a 12-byte header followed by three 16-byte action slots and a 4-byte sentinel:

Descriptor (64 bytes total):
  +0   Header (12 bytes)
       +0   uint32_t  field_0;      // Used by resolved-rela emitter (sub_46ADC0)
       +4   uint32_t  field_1;      // Used by resolved-rela emitter
       +8   uint32_t  field_2;      // Used by resolved-rela emitter
  +12  Action 0 (16 bytes)
       +12  uint32_t  bit_offset;   // Starting bit position in instruction word
       +16  uint32_t  bit_width;    // Number of bits to patch
       +20  uint32_t  action_type;  // Operation code (see table below)
       +24  uint32_t  reserved;     // Padding / flags
  +28  Action 1 (16 bytes)
       +28  uint32_t  bit_offset;
       +32  uint32_t  bit_width;
       +36  uint32_t  action_type;
       +40  uint32_t  reserved;
  +44  Action 2 (16 bytes)
       +44  uint32_t  bit_offset;
       +48  uint32_t  bit_width;
       +52  uint32_t  action_type;
       +56  uint32_t  reserved;
  +60  Sentinel (4 bytes, marks end of action array)

The application engine (sub_468760) iterates the three action slots sequentially. Action type 0x00 terminates the sequence early (the engine skips to the next slot and terminates only when the slot pointer reaches the sentinel). The engine indexes into the descriptor table and positions its action pointer and sentinel:

descriptor_base = table + (type_index << 6);  // type_index * 64
action_ptr = descriptor_base + 12;             // first action at byte offset +12
end_ptr    = descriptor_base + 60;             // sentinel at byte offset +60

The header fields at offsets +0, +4, +8 are not used by the application engine itself. They are consumed by the resolved-rela emitter (sub_46ADC0) during --preserve-relocs processing, where they specify up to three (present-flag, bit_offset, bit_width) triples for extracting the already-patched instruction fields back into addend records. The mapping is: descriptor uint32 indices [3,4,5] = action 0's extraction spec, [7,8,9] = action 1's, [11,12,13] = action 2's -- where the third element of each triple is the "present" flag gating whether extraction occurs.

Action Slot Processing Pseudocode

The core loop in sub_468760 processes each action slot in order. The following pseudocode captures the complete dispatch:

int reloc_apply_engine(
    void*      table,            // descriptor table base (off_1D3DBE0 or off_1D3CBE0)
    uint32_t   type_index,       // normalized relocation type index
    bool       is_absolute,      // true if symbol has absolute address
    uint64_t*  patch_ptr,        // pointer into section data (instruction words)
    int64_t    extra_offset,     // reloc record extra field
    int        section_offset,   // section base / PC address
    uint64_t   symbol_value,     // resolved symbol address (S)
    uint32_t   symbol_size,      // symbol st_size
    uint32_t   section_type,     // section_type - 0x6FFFFF84
    int64_t*   output_value      // receives extracted original value
) {
    // Compute initial relocation value
    uint64_t value = symbol_value;
    if (is_absolute)
        value = symbol_value + extra_offset;   // S + A

    uint8_t* desc = table + (type_index << 6);
    uint32_t* action = (uint32_t*)(desc + 12); // first action slot
    uint32_t* end    = (uint32_t*)(desc + 60); // sentinel
    *output_value = 0;

    while (action != end) {
        uint32_t bit_off  = action[0];
        uint32_t bit_wid  = action[1];
        uint32_t act_type = action[2];

        switch (act_type) {

        case 0x00: // END -- skip this slot, continue to next
            action += 4;
            break;

        case 0x01:  // ABS_FULL
        case 0x12:  // ABS_FULL (alias)
        case 0x2E:  // ABS_FULL (alias)
            // Fast path: full 64-bit word write
            if (bit_off == 0 && bit_wid == 64) {
                if (!is_absolute) {
                    *output_value = *patch_ptr;
                    value += *patch_ptr;
                }
                *patch_ptr = value;
                action += 4;
                break;
            }
            // Narrow field: extract old, add, write back
            if (!is_absolute) {
                int64_t old = bitfield_extract(patch_ptr, bit_off, bit_wid);
                value += old;
                *output_value = old;
            }
            action += 4;
            bitfield_write(patch_ptr, value, bit_off, bit_wid);
            break;

        case 0x06:  // ABS_LO -- low bits of (S + A)
        case 0x37:  // ABS_LO (alias)
            if (!is_absolute) {
                int64_t old = bitfield_extract(patch_ptr, bit_off, bit_wid);
                *output_value = old;
                value += old;
            }
            // Write low 64 bits of value
            bitfield_write(patch_ptr, (uint64_t)value, bit_off, bit_wid);
            action += 4;
            break;

        case 0x07:  // ABS_HI -- high 32 bits of (S + A)
        case 0x38:  // ABS_HI (alias)
            if (!is_absolute) {
                int64_t old = bitfield_extract(patch_ptr, bit_off, bit_wid);
                *output_value = old;
                value = (value >> 32) + old;
            } else {
                value = value >> 32;
            }
            bitfield_write(patch_ptr, value, bit_off, bit_wid);
            action += 4;
            break;

        case 0x08:  // ABS_SIZE -- S + A + symbol_size
            if (!is_absolute) {
                int64_t old = bitfield_extract(patch_ptr, bit_off, bit_wid);
                value = old + symbol_size;
                *output_value = old;
            } else {
                value = extra_offset + symbol_size;
            }
            bitfield_write(patch_ptr, value, bit_off, bit_wid);
            action += 4;
            break;

        case 0x09:  // SHIFTED_2 -- (S + A) >> 2
            value >>= 2;
            if (!is_absolute) {
                int64_t old = bitfield_extract(patch_ptr, bit_off, bit_wid);
                value += old;
                *output_value = old;
            }
            bitfield_write(patch_ptr, value, bit_off, bit_wid);
            action += 4;
            break;

        case 0x0A:  // SEC_TYPE_LO -- low nybble of section type delta
            value = section_type & (uint64_t)(0xFF >> (8 - bit_wid));
            if (!is_absolute)
                value += bitfield_extract(patch_ptr, bit_off, bit_wid);
            bitfield_write(patch_ptr, value, bit_off, bit_wid);
            action += 4;
            break;

        case 0x0B:  // SEC_TYPE_HI -- high nybble of section type delta
            value = (section_type >> 4) & (uint64_t)(0xFF >> (8 - bit_wid));
            if (!is_absolute)
                value += bitfield_extract(patch_ptr, bit_off, bit_wid);
            bitfield_write(patch_ptr, value, bit_off, bit_wid);
            action += 4;
            break;

        case 0x10:  // PC_REL -- (int32_t)(S + A) - PC
            if (!is_absolute) {
                int64_t old = bitfield_extract(patch_ptr, bit_off, bit_wid);
                value += old;
                *output_value = old;
            }
            value = (int32_t)value - section_offset;
            bitfield_write(patch_ptr, value, bit_off, bit_wid);
            action += 4;
            break;

        case 0x13:  // CLEAR -- zero the bit-field
        case 0x14:  // CLEAR (alias)
            bitfield_write(patch_ptr, 0, bit_off, bit_wid);
            action += 4;
            break;

        case 0x16: case 0x17: case 0x18: case 0x19:  // MASKED_SHIFT 0-3
        case 0x1A: case 0x1B: case 0x1C: case 0x1D:  // MASKED_SHIFT 4-7
        case 0x2F: case 0x30: case 0x31: case 0x32:  // MASKED_SHIFT 8-11
        case 0x33: case 0x34: case 0x35: case 0x36:  // MASKED_SHIFT 12-15
        {
            int idx = act_type - 22;
            if (!is_absolute) {
                int64_t old = bitfield_extract(patch_ptr, bit_off, bit_wid);
                value += old;
                *output_value = old;
            }
            value = (value & mask_table[idx]) >> shift_table[idx];
            bitfield_write(patch_ptr, value, bit_off, bit_wid);
            action += 4;
            break;
        }

        default:
            return 0;  // Unknown action -- caller emits "unexpected NVRS"
        }

        if (action == end)
            return 1;
    }
    return 1;
}

Action Types

CodeNameComputation
0x00endSkip slot; terminate if at sentinel
0x01abs_fullS + A -- full absolute, store all bits
0x06abs_lo(S + A) & mask -- low bits of absolute
0x07abs_hi((S + A) >> 32) & mask -- high 32 bits of absolute
0x08abs_sizeS + A + symbol_size -- absolute plus symbol size addend
0x09abs_shifted(S + A) >> 2 -- absolute right-shifted by 2 (4-byte aligned)
0x0Asec_type_losection_type & (0xFF >> (8 - width)) -- low nybble of section type
0x0Bsec_type_hi(section_type >> 4) & (0xFF >> (8 - width)) -- high nybble of section type
0x10pc_rel(int32_t)(S + A) - PC -- PC-relative offset
0x12abs_fullAlias of 0x01 (same behavior)
0x13clearZero the bit-field (write 0)
0x14clearAlias of 0x13 (same behavior)
0x16--0x1Dmasked_shift_0..7(S + A) & mask_table[n] >> shift_table[n]
0x2Eabs_fullAlias of 0x01 (same behavior, different encoding)
0x2F--0x36masked_shift_8..15(S + A) & mask_table[n] >> shift_table[n]
0x37abs_loAlias of 0x06
0x38abs_hiAlias of 0x07

The masked-shift actions (codes 0x16--0x1D and 0x2F--0x36) use a pair of SSE constant vectors loaded from xmmword_1D3F8E0 through xmmword_1D3F930. These contain mask and shift values indexed by (action_type - 22), enabling a single code path to handle 16 different extract-and-place patterns for multi-field instruction encodings.

Bit-field Patching

The engine uses two helper functions for the actual bit manipulation:

  • sub_468670 (bitfield_extract): extracts the current value of a bit-field from the instruction word
  • sub_4685B0 (bitfield_write): splices a new value into a bit-field

Both handle fields that span 64-bit word boundaries. The instruction data is treated as an array of uint64_t words (little-endian). The general algorithms:

Extraction (sub_468670): Given a starting bit position and width, extract the field value:

int64_t bitfield_extract(uint64_t* words, int bit_offset, int bit_width) {
    // Normalize: advance pointer past full 64-bit words
    if (bit_offset >= 64) {
        words += (bit_offset / 64);
        bit_offset = bit_offset % 64;
    }
    int end_bit = bit_offset + bit_width;

    // Single-word case: field fits within one uint64_t
    if (end_bit <= 64)
        return *words << (64 - end_bit) >> (64 - bit_width);

    // Multi-word case: recursive split across 64-bit boundary
    int64_t low = bitfield_extract(words, bit_offset, 64 - bit_offset);
    int64_t high;
    if (end_bit - 64 > 64) {
        // Three-word span (up to 192 bits, theoretical max)
        int64_t mid = bitfield_extract(words + 1, 0, 64);
        high = bitfield_extract(words + 2, 0, end_bit - 128);
    } else {
        // Two-word span (common case for 128-bit instructions)
        high = words[1] << (128 - end_bit) >> (64 - (end_bit - 64));
    }
    return low | (high << (64 - bit_offset));
}

Insertion (sub_4685B0): Given a value, starting bit position, and width, splice the value into the instruction words:

void bitfield_write(uint64_t* words, uint64_t value, int bit_offset, int bit_width) {
    // Normalize: advance pointer past full 64-bit words
    if (bit_offset >= 64) {
        words += (bit_offset / 64);
        bit_offset = bit_offset % 64;
    }
    int end_bit = bit_offset + bit_width;

    // Multi-word case: iterate through intermediate words
    if (end_bit > 64) {
        uint64_t* end_word = words + ((end_bit - 65) / 64) + 1;
        int off = bit_offset;
        while (words != end_word) {
            // Preserve bits below bit_offset, fill above with value
            *words = (*words & ~(-1ULL << off)) | (value << off);
            value >>= (64 - off);
            off = 0;
            words++;
        }
        end_bit = end_bit - (((end_bit - 65) / 64) * 64) - 64;
        bit_width = end_bit;
    }

    // Final (or only) word: read-modify-write with constructed mask
    //   mask = bit_width ones positioned at [end_bit - bit_width, end_bit)
    uint64_t mask = (-1ULL << (64 - bit_width)) >> (64 - end_bit);
    *words = (*words & ~mask) | ((value << (64 - bit_width)) >> (64 - end_bit));
}

The mask formula (-1ULL << (64 - W)) >> (64 - E) where E = offset + W creates a window of W ones starting at bit position E - W. The value is aligned to the same position using an identical pair of shifts. This is a standard bit-field insertion idiom that avoids branching.

Worked Example: R_CUDA_ABS32_26

R_CUDA_ABS32_26 (standard table index 5) is a common relocation type that patches a 32-bit absolute address into a SASS instruction starting at bit 26. This section traces the complete path from relocation record to patched instruction.

Scenario: A MOV instruction at offset 0x100 in .text references a global symbol _Z10my_kernelPi resolved to address 0x0000_0000_DEAD_BEEF. The .rela.text section contains:

Elf64_Rela {
    r_offset = 0x100,        // instruction offset within .text
    r_info   = (sym << 32) | 5,  // type = 5 (R_CUDA_ABS32_26)
    r_addend = 0              // no addend
}

Step 1: Descriptor lookup. The relocation engine selects the CUDA descriptor table at off_1D3DBE0 and computes the descriptor address:

descriptor = off_1D3DBE0 + (5 << 6) = off_1D3DBE0 + 320

The 64-byte descriptor for type 5 contains (reconstructed from the type semantics):

Offset  Bytes (hex)                           Interpretation
------  -----------                           --------------
+0      xx xx xx xx xx xx xx xx xx xx xx xx   Header (12 bytes, used by sub_46ADC0)
+12     1A 00 00 00                           action[0].bit_offset  = 26
+16     20 00 00 00                           action[0].bit_width   = 32
+20     01 00 00 00                           action[0].action_type = 0x01 (ABS_FULL)
+24     00 00 00 00                           action[0].reserved    = 0
+28     00 00 00 00                           action[1].bit_offset  = 0
+32     00 00 00 00                           action[1].bit_width   = 0
+36     00 00 00 00                           action[1].action_type = 0x00 (END)
+40     00 00 00 00                           action[1].reserved    = 0
+44     00 00 00 00                           action[2].bit_offset  = 0
+48     00 00 00 00                           action[2].bit_width   = 0
+52     00 00 00 00                           action[2].action_type = 0x00 (END)
+56     00 00 00 00                           action[2].reserved    = 0
+60     xx xx xx xx                           Sentinel (4 bytes)

The engine positions its pointers: action_ptr = descriptor + 12, end_ptr = descriptor + 60.

Step 2: Value computation. The relocation is not absolute (is_absolute = false), so value = symbol_value = 0xDEAD_BEEF.

Step 3: Action dispatch. The engine reads action slot 0:

  • bit_offset = 26
  • bit_width = 32
  • action_type = 0x01 (ABS_FULL)

This is not the fast path (bit_offset != 0 || bit_width != 64), so the engine takes the narrow-field path.

Step 4: Extract old value. Since is_absolute == false, the engine extracts the existing 32-bit field from the instruction word. Assume the instruction word at patch_ptr is 0x0000_0000_0000_0000 (field pre-initialized to zero):

old = bitfield_extract(patch_ptr, 26, 32)

With end_bit = 26 + 32 = 58 <= 64, this is the single-word case:

old = *patch_ptr << (64 - 58) >> (64 - 32)
    = 0 << 6 >> 32
    = 0

The engine computes value = value + old = 0xDEAD_BEEF + 0 = 0xDEAD_BEEF and stores *output_value = 0.

Step 5: Write new value. The engine constructs a mask and writes the value into the instruction word. With bit_offset = 26, bit_width = 32, end_bit = 58:

mask = (-1ULL << (64 - 32)) >> (64 - 58)
     = 0xFFFF_FFFF_0000_0000 >> 6
     = 0x03FF_FFFF_FC00_0000

value_positioned = (0xDEAD_BEEF << (64 - 32)) >> (64 - 58)
                 = (0xDEAD_BEEF_0000_0000) >> 6
                 = 0x037A_B6FB_BC00_0000

*patch_ptr = (*patch_ptr & ~mask) | value_positioned
           = (0 & 0xFC00_0000_03FF_FFFF) | 0x037A_B6FB_BC00_0000
           = 0x037A_B6FB_BC00_0000

The 32 bits of 0xDEAD_BEEF now occupy bits [26:58) of the instruction word:

Bit layout of patched instruction word:
  bits [63:58] = 0b000000       (unchanged)
  bits [57:26] = 0xDEAD_BEEF    (patched value)
  bits [25:0]  = 0x0000000      (unchanged)

Step 6: Advance and terminate. The action pointer advances by 16 bytes to action slot 1, which has action_type = 0x00 (END). The engine skips to the next slot. The action pointer advances to offset +44 (action slot 2), which is also END. After advancing once more, action_ptr == end_ptr (offset +60), so the engine returns 1 (success).

Multi-word variant: If the relocation were R_CUDA_ABS47_34 (47-bit field at bit offset 34), the field would span two 64-bit words: bits [34:64) in word 0 (30 bits) and bits [0:17) in word 1 (17 bits). The engine would call bitfield_write which iterates: first writing the low 30 bits into word 0 at offset 34, then shifting value right by 30 and writing the remaining 17 bits into word 1 at offset 0.

Relocation Categories

The 119 types fall into 15 semantic categories based on their name prefix and purpose.

No-op and Sentinel

IndexNameDescription
0R_CUDA_NONENo relocation (placeholder / deleted)
116R_CUDA_NONE_LASTSentinel marking end of valid type range

Full-Width Data Relocations

These apply to data sections (.nv.global, .nv.constant*, etc.) rather than instructions.

IndexNameBitsDescription
1R_CUDA_323232-bit absolute address
2R_CUDA_32_HI16Upper 16 bits of 32-bit address
3R_CUDA_32_LO16Lower 16 bits of 32-bit address
4R_CUDA_646464-bit absolute address

Byte-Level Relocations (R_CUDA_8_*)

Byte-granularity relocations for patching individual bytes within data structures, typically in descriptor tables or attribute sections.

IndexNameByte offsetDescription
5R_CUDA_8_00Byte at offset 0
6R_CUDA_8_81Byte at offset 8 bits
7R_CUDA_8_162Byte at offset 16 bits
8R_CUDA_8_243Byte at offset 24 bits
9R_CUDA_8_324Byte at offset 32 bits
10R_CUDA_8_405Byte at offset 40 bits
11R_CUDA_8_486Byte at offset 48 bits
12R_CUDA_8_567Byte at offset 56 bits

Global Address Relocations (R_CUDA_G*)

Used for global memory address references. These are the primary relocations for .nv.global section symbols.

IndexNameBitsDescription
13R_CUDA_G323232-bit global address
14R_CUDA_G646464-bit global address
15R_CUDA_G8_08Global byte at offset 0
16R_CUDA_G8_88Global byte at offset 8
17R_CUDA_G8_168Global byte at offset 16
18R_CUDA_G8_248Global byte at offset 24
19R_CUDA_G8_328Global byte at offset 32
20R_CUDA_G8_408Global byte at offset 40
21R_CUDA_G8_488Global byte at offset 48
22R_CUDA_G8_568Global byte at offset 56

Absolute Instruction Relocations (R_CUDA_ABS*)

These patch absolute addresses into SASS instruction bit-fields. The first number is the bit-width of the value; the second is the starting bit position within the instruction word.

IndexNameWidthBit posDescription
23R_CUDA_ABS16_20162016-bit absolute at bit 20
24R_CUDA_ABS16_23162316-bit absolute at bit 23
25R_CUDA_ABS16_26162616-bit absolute at bit 26
26R_CUDA_ABS16_32163216-bit absolute at bit 32
27R_CUDA_ABS20_44204420-bit absolute at bit 44
28R_CUDA_ABS24_20242024-bit absolute at bit 20
29R_CUDA_ABS24_23242324-bit absolute at bit 23
30R_CUDA_ABS24_26242624-bit absolute at bit 26
31R_CUDA_ABS24_32243224-bit absolute at bit 32
32R_CUDA_ABS24_40244024-bit absolute at bit 40
33R_CUDA_ABS32_20322032-bit absolute at bit 20
34R_CUDA_ABS32_23322332-bit absolute at bit 23
35R_CUDA_ABS32_26322632-bit absolute at bit 26
36R_CUDA_ABS32_32323232-bit absolute at bit 32
37R_CUDA_ABS32_HI_201620High 16 bits of 32-bit absolute at bit 20
38R_CUDA_ABS32_HI_231623High 16 bits of 32-bit absolute at bit 23
39R_CUDA_ABS32_HI_261626High 16 bits of 32-bit absolute at bit 26
40R_CUDA_ABS32_HI_321632High 16 bits of 32-bit absolute at bit 32
41R_CUDA_ABS32_LO_201620Low 16 bits of 32-bit absolute at bit 20
42R_CUDA_ABS32_LO_231623Low 16 bits of 32-bit absolute at bit 23
43R_CUDA_ABS32_LO_261626Low 16 bits of 32-bit absolute at bit 26
44R_CUDA_ABS32_LO_321632Low 16 bits of 32-bit absolute at bit 32
45R_CUDA_ABS47_34473447-bit absolute at bit 34 (sm_75+ wide immediate)
46R_CUDA_ABS55_16_34553455-bit absolute at bit 34 (16-bit aligned)
47R_CUDA_ABS56_16_34563456-bit absolute at bit 34 (16-bit aligned)

The HI/LO variants are used in instruction pairs where a 32-bit address is split across two instructions: one loads the upper 16 bits (MOV32I or LUI-like) and the other loads the lower 16 bits.

The ABS47, ABS55, and ABS56 types are newer additions (sm_75+/sm_90+) that exploit wider immediate fields in Turing and later ISA encodings.

PC-Relative Relocations (R_CUDA_PCREL_*)

IndexNameWidthBit posDescription
48R_CUDA_PCREL_IMM24_232423PC-relative 24-bit immediate at bit 23
49R_CUDA_PCREL_IMM24_262426PC-relative 24-bit immediate at bit 26

PC-relative relocations compute (S + A) - PC where PC is the address of the instruction being patched. These are used for branch instructions (BRA, BRX, CALL). The 24-bit width limits the branch offset to +/- 8M instructions (each instruction being 8 or 16 bytes depending on encoding).

Function Descriptor Relocations (R_CUDA_FUNC_DESC_*)

These relocate references to function descriptor entries, used for indirect calls, virtual function tables, and device-side function pointers.

IndexNameBitsDescription
50R_CUDA_FUNC_DESC_323232-bit function descriptor reference
51R_CUDA_FUNC_DESC_646464-bit function descriptor reference
52R_CUDA_FUNC_DESC_8_08Descriptor byte at offset 0
53R_CUDA_FUNC_DESC_8_88Descriptor byte at offset 8
54R_CUDA_FUNC_DESC_8_168Descriptor byte at offset 16
55R_CUDA_FUNC_DESC_8_248Descriptor byte at offset 24
56R_CUDA_FUNC_DESC_8_328Descriptor byte at offset 32
57R_CUDA_FUNC_DESC_8_408Descriptor byte at offset 40
58R_CUDA_FUNC_DESC_8_488Descriptor byte at offset 48
59R_CUDA_FUNC_DESC_8_568Descriptor byte at offset 56
60R_CUDA_FUNC_DESC32_203220
61R_CUDA_FUNC_DESC32_233223
62R_CUDA_FUNC_DESC32_323232
63R_CUDA_FUNC_DESC32_HI_201620
64R_CUDA_FUNC_DESC32_HI_231623
65R_CUDA_FUNC_DESC32_HI_321632
66R_CUDA_FUNC_DESC32_LO_201620
67R_CUDA_FUNC_DESC32_LO_231623
68R_CUDA_FUNC_DESC32_LO_321632

The byte-level variants (FUNC_DESC_8_) are used for patching function descriptors in data sections rather than instruction immediate fields. The instruction variants (FUNC_DESC32_) patch call instructions that embed the function descriptor index.

Texture, Sampler, and Surface Relocations

These handle bindable resource references in SASS instructions.

IndexNameDescription
69R_CUDA_TEX_HEADER_INDEXTexture header index (binds to .nv.tex.header)
70R_CUDA_TEX_SLOTTexture slot number
71R_CUDA_SAMP_HEADER_INDEXSampler header index
72R_CUDA_SAMP_SLOTSampler slot number
73R_CUDA_SAMP_HEADER_INDEX_0Sampler header index (variant 0)
74R_CUDA_SURF_HEADER_INDEXSurface header index
75R_CUDA_SURF_SLOTSurface slot number
76R_CUDA_SURF_HW_DESCSurface hardware descriptor
77R_CUDA_SURF_HW_SW_DESCSurface combined hardware/software descriptor

These are resolved during linking by looking up the resource in the merged texture/sampler/surface header tables. The HEADER_INDEX types reference the global .nv.tex.header / .nv.samp.header / .nv.surf.header sections, while SLOT types reference the logical binding slot number.

Constant Bank Relocations (R_CUDA_CONST_FIELD*)

Relocations for references into constant memory banks (.nv.constant0, .nv.constant2, etc.).

IndexNameWidthBit posDescription
78R_CUDA_CONST_FIELD19_20192019-bit constant offset at bit 20
79R_CUDA_CONST_FIELD19_23192319-bit constant offset at bit 23
80R_CUDA_CONST_FIELD19_26192619-bit constant offset at bit 26
81R_CUDA_CONST_FIELD19_28192819-bit constant offset at bit 28
82R_CUDA_CONST_FIELD19_40194019-bit constant offset at bit 40
83R_CUDA_CONST_FIELD21_20212021-bit constant offset at bit 20
84R_CUDA_CONST_FIELD21_23212321-bit constant offset at bit 23
85R_CUDA_CONST_FIELD21_26212621-bit constant offset at bit 26
86R_CUDA_CONST_FIELD21_38213821-bit constant offset at bit 38
87R_CUDA_CONST_FIELD22_37223722-bit constant offset at bit 37

The 19-bit variants can address up to 512 KB of constant memory (19 bits * 4-byte aligned = 2 MB byte addressable, or 512 K dwords). The 21-bit and 22-bit variants (sm_75+) expand the addressable range for larger constant banks.

Bindless Texture Relocations (R_CUDA_TEX_BINDLESSOFF* / R_CUDA_BINDLESSOFF*)

IndexNameWidthBit posDescription
88R_CUDA_TEX_BINDLESSOFF13_321332Texture bindless offset at bit 32
89R_CUDA_TEX_BINDLESSOFF13_411341Texture bindless offset at bit 41
90R_CUDA_TEX_BINDLESSOFF13_451345Texture bindless offset at bit 45
91R_CUDA_TEX_BINDLESSOFF13_471347Texture bindless offset at bit 47
92R_CUDA_TEX_SLOT9_49949Texture slot 9-bit at bit 49
93R_CUDA_BINDLESSOFF13_361336Generic bindless offset at bit 36
94R_CUDA_BINDLESSOFF14_401440Generic bindless 14-bit offset at bit 40

Bindless texture relocations patch the bindless resource offset into texture sampling instructions. The 13-bit width supports up to 8192 unique textures per kernel launch. See Bindless Relocations for the full resolution pipeline.

Unified Table Relocations (R_CUDA_UNIFIED_*)

Relocations for the Unified Descriptor Table (UDT) and Unified Function Table (UFT). These are the primary relocation types for CUDA Dynamic Parallelism and indirect function calls through the unified tables.

IndexNameBitsDescription
95R_CUDA_UNIFIEDspecialUnified table reference (generic)
96R_CUDA_UNIFIED_323232-bit unified table offset
97R_CUDA_UNIFIED32_HI_3216High 16 bits of unified table offset at bit 32
98R_CUDA_UNIFIED32_LO_3216Low 16 bits of unified table offset at bit 32
99R_CUDA_UNIFIED_8_08Unified byte at offset 0
100R_CUDA_UNIFIED_8_88Unified byte at offset 8
101R_CUDA_UNIFIED_8_168Unified byte at offset 16
102R_CUDA_UNIFIED_8_248Unified byte at offset 24
103R_CUDA_UNIFIED_8_328Unified byte at offset 32
104R_CUDA_UNIFIED_8_408Unified byte at offset 40
105R_CUDA_UNIFIED_8_488Unified byte at offset 48
106R_CUDA_UNIFIED_8_568Unified byte at offset 56

During the relocation phase, unified relocations (types 102--113 in the internal remapping) are translated to their base equivalents. Relocations targeting synthetic symbols (__UFT_OFFSET, __UDT_OFFSET, __UFT_CANONICAL, __UDT_CANONICAL, __UDT, __UFT, __UFT_END, __UDT_END) are resolved to type 0 (no-op) because the unified table manager has already computed the final offsets.

Instruction-Level Relocations (R_CUDA_INSTRUCTION*)

IndexNameWidthDescription
107R_CUDA_INSTRUCTION6464Full 64-bit instruction replacement
108R_CUDA_INSTRUCTION128128Full 128-bit instruction replacement

These are whole-instruction relocations that replace the entire instruction word. Used by the instruction-level optimization pass (peephole) and for instruction encoding conversions where the entire instruction must be rewritten (e.g., converting a 64-bit instruction to a 128-bit encoding or vice versa).

Yield Relocations (R_CUDA_YIELD_*)

IndexNameDescription
109R_CUDA_YIELD_OPCODE9_0Patch 9-bit opcode field at bit 0 for YIELD
110R_CUDA_YIELD_CLEAR_PRED4_87Clear 4-bit predicate field at bit 87 for YIELD

YIELD relocations are used to convert YIELD instructions to NOP when forward-progress guarantees are required. The relocation engine checks the forward-progress-required flag (ctx+94) and suppresses the conversion if active, with the trace message: "Ignoring the reloc to convert YIELD to NOP due to forward progress requirement."

Unused-Clear Relocations

IndexNameWidthDescription
111R_CUDA_UNUSED_CLEAR3232Clear 32 bits (write zeros)
112R_CUDA_UNUSED_CLEAR6464Clear 64 bits (write zeros)

These zero out fields in sections that are no longer needed after linking, such as placeholder entries in merged data sections.

Miscellaneous Types

IndexNameDescription
113R_CUDA_QUERY_DESC21_3721-bit query descriptor offset at bit 37
114R_CUDA_6_316-bit value at bit 31
115R_CUDA_2_472-bit value at bit 47

The QUERY_DESC type is used for CUDA's query descriptor mechanism. The 6-bit and 2-bit types are narrow-field relocations for specific instruction encoding fields in newer architectures.

Per-Architecture Vtable

In addition to the descriptor-table-driven application engine, nvlink maintains a per-architecture relocation vtable created by sub_459640 (16,109 bytes). This 632-byte table (79 function pointer slots, 8 bytes each) contains architecture-specific handler functions for relocation types that require different patching behavior across GPU generations.

The vtable is populated based on the target SM range:

SM RangeArchitectureNotes
30--39KeplerShared "legacy" handler set
50--59MaxwellAdds additional instruction-field handlers
60--69PascalAdds 60+ series handlers, wider field support
70--72, 73--79Volta/TuringNew instruction format, result[33]/result[34] populated
80--88, 89Ampere/AdaAdds bindless handlers, new field variants
90--99HopperMajor differences in slots 10/11/28/50--53, new desc types
100--103, 110--121Mercury (Blackwell+)Distinct handler for slot 13, new constant field sizes

The vtable is allocated via sub_4307C0 (arena allocator) and the first 78 slots are populated. Slots that are not explicitly set remain NULL (zero), and the relocation engine skips NULL handlers. This is how unsupported relocation types are detected at runtime -- a NULL vtable entry for a required type triggers an error.

Mercury vs CUDA Type Mapping

Mercury (sm >= 100) uses a parallel set of relocation types offset by 0x10000. When the linker context's ELF class byte (offset +7) is 'A' (0x41, indicating Mercury), the relocation engine subtracts 0x10000 from the type and uses the Mercury descriptor table (off_1D3CBE0) instead of the CUDA table (off_1D3DBE0).

if (ctx->elf_class == 'A') {              // Mercury
    if (reloc_type <= 0x10000)
        fatal("unexpected reloc");          // should always be >= 0x10000
    reloc_type -= 0x10000;
    descriptor_table = off_1D3CBE0;         // Mercury table
} else {                                    // CUDA
    descriptor_table = off_1D3DBE0;         // CUDA table
}

Both tables have the same structure (64 bytes per entry) but different action encodings reflecting the different instruction formats between pre-Mercury (64-bit SASS) and Mercury (128-bit SASS) architectures.

Validation Error Messages

The validation infrastructure produces these diagnostic messages:

Source functionMessageCondition
sub_42F6C0"unknown attribute"Type index exceeds table bounds
sub_42F6C0"Relocation %s not supported on %s"Architecture class mismatch
sub_42F760"unknown attribute"Attribute type index > 96
sub_42F760"Attribute %s not supported on %s"Attribute arch class mismatch
sub_42F760"unknown usage"Usage field has unrecognized value
sub_42F850"STO_CUDA_OBSCURE"Symbol with obscure storage class
sub_469D60"unexpected reloc"Mercury type found without Mercury context

The error descriptors at unk_2A5BAB0 (warning) and unk_2A5BAC0 (error) control whether these diagnostics are warnings or fatal errors.

Full Type Catalog

The complete list of 119 R_CUDA relocation types extracted from nvlink v13.0.88, sorted by name:

#NameType
1R_CUDA_2_47misc
2R_CUDA_32data
3R_CUDA_32_HIdata
4R_CUDA_32_LOdata
5R_CUDA_6_31misc
6R_CUDA_64data
7R_CUDA_8_0byte
8R_CUDA_8_16byte
9R_CUDA_8_24byte
10R_CUDA_8_32byte
11R_CUDA_8_40byte
12R_CUDA_8_48byte
13R_CUDA_8_56byte
14R_CUDA_8_8byte
15R_CUDA_ABS16_20abs-instr
16R_CUDA_ABS16_23abs-instr
17R_CUDA_ABS16_26abs-instr
18R_CUDA_ABS16_32abs-instr
19R_CUDA_ABS20_44abs-instr
20R_CUDA_ABS24_20abs-instr
21R_CUDA_ABS24_23abs-instr
22R_CUDA_ABS24_26abs-instr
23R_CUDA_ABS24_32abs-instr
24R_CUDA_ABS24_40abs-instr
25R_CUDA_ABS32_20abs-instr
26R_CUDA_ABS32_23abs-instr
27R_CUDA_ABS32_26abs-instr
28R_CUDA_ABS32_32abs-instr
29R_CUDA_ABS32_HI_20abs-instr
30R_CUDA_ABS32_HI_23abs-instr
31R_CUDA_ABS32_HI_26abs-instr
32R_CUDA_ABS32_HI_32abs-instr
33R_CUDA_ABS32_LO_20abs-instr
34R_CUDA_ABS32_LO_23abs-instr
35R_CUDA_ABS32_LO_26abs-instr
36R_CUDA_ABS32_LO_32abs-instr
37R_CUDA_ABS47_34abs-instr
38R_CUDA_ABS55_16_34abs-instr
39R_CUDA_ABS56_16_34abs-instr
40R_CUDA_BINDLESSOFF13_36bindless
41R_CUDA_BINDLESSOFF14_40bindless
42R_CUDA_CONST_FIELD19_20const
43R_CUDA_CONST_FIELD19_23const
44R_CUDA_CONST_FIELD19_26const
45R_CUDA_CONST_FIELD19_28const
46R_CUDA_CONST_FIELD19_40const
47R_CUDA_CONST_FIELD21_20const
48R_CUDA_CONST_FIELD21_23const
49R_CUDA_CONST_FIELD21_26const
50R_CUDA_CONST_FIELD21_38const
51R_CUDA_CONST_FIELD22_37const
52R_CUDA_FUNC_DESC_32func-desc
53R_CUDA_FUNC_DESC32_20func-desc
54R_CUDA_FUNC_DESC32_23func-desc
55R_CUDA_FUNC_DESC32_32func-desc
56R_CUDA_FUNC_DESC32_HI_20func-desc
57R_CUDA_FUNC_DESC32_HI_23func-desc
58R_CUDA_FUNC_DESC32_HI_32func-desc
59R_CUDA_FUNC_DESC32_LO_20func-desc
60R_CUDA_FUNC_DESC32_LO_23func-desc
61R_CUDA_FUNC_DESC32_LO_32func-desc
62R_CUDA_FUNC_DESC_64func-desc
63R_CUDA_FUNC_DESC_8_0func-desc
64R_CUDA_FUNC_DESC_8_16func-desc
65R_CUDA_FUNC_DESC_8_24func-desc
66R_CUDA_FUNC_DESC_8_32func-desc
67R_CUDA_FUNC_DESC_8_40func-desc
68R_CUDA_FUNC_DESC_8_48func-desc
69R_CUDA_FUNC_DESC_8_56func-desc
70R_CUDA_FUNC_DESC_8_8func-desc
71R_CUDA_G32global
72R_CUDA_G64global
73R_CUDA_G8_0global
74R_CUDA_G8_16global
75R_CUDA_G8_24global
76R_CUDA_G8_32global
77R_CUDA_G8_40global
78R_CUDA_G8_48global
79R_CUDA_G8_56global
80R_CUDA_G8_8global
81R_CUDA_INSTRUCTION128instr
82R_CUDA_INSTRUCTION64instr
83R_CUDA_NONEsentinel
84R_CUDA_NONE_LASTsentinel
85R_CUDA_PCREL_IMM24_23pc-rel
86R_CUDA_PCREL_IMM24_26pc-rel
87R_CUDA_QUERY_DESC21_37misc
88R_CUDA_SAMP_HEADER_INDEXsampler
89R_CUDA_SAMP_HEADER_INDEX_0sampler
90R_CUDA_SAMP_SLOTsampler
91R_CUDA_SURF_HEADER_INDEXsurface
92R_CUDA_SURF_HW_DESCsurface
93R_CUDA_SURF_HW_SW_DESCsurface
94R_CUDA_SURF_SLOTsurface
95R_CUDA_TEX_BINDLESSOFF13_32bindless
96R_CUDA_TEX_BINDLESSOFF13_41bindless
97R_CUDA_TEX_BINDLESSOFF13_45bindless
98R_CUDA_TEX_BINDLESSOFF13_47bindless
99R_CUDA_TEX_HEADER_INDEXtexture
100R_CUDA_TEX_SLOTtexture
101R_CUDA_TEX_SLOT9_49texture
102R_CUDA_UNIFIEDunified
103R_CUDA_UNIFIED_32unified
104R_CUDA_UNIFIED32_HI_32unified
105R_CUDA_UNIFIED32_LO_32unified
106R_CUDA_UNIFIED_8_0unified
107R_CUDA_UNIFIED_8_16unified
108R_CUDA_UNIFIED_8_24unified
109R_CUDA_UNIFIED_8_32unified
110R_CUDA_UNIFIED_8_40unified
111R_CUDA_UNIFIED_8_48unified
112R_CUDA_UNIFIED_8_56unified
113R_CUDA_UNIFIED_8_8unified
114R_CUDA_UNUSED_CLEAR32clear
115R_CUDA_UNUSED_CLEAR64clear
116R_CUDA_YIELD_CLEAR_PRED4_87yield
117R_CUDA_YIELD_OPCODE9_0yield
118R_CUDA_INSTRUCTION64instr
119R_CUDA_INSTRUCTION128instr

Note: The catalog contains 119 unique name strings as extracted from the binary. Some types appear in both the standard and attribute tables with the same name but different table indices and different descriptor actions. The total across both tables is 117 + 65 = 182 table slots, but many attribute slots share names with their standard counterparts.

Cross-References

Sibling Wiki

  • ptxas wiki: Relocations & Symbols -- how ptxas generates R_CUDA and R_MERCURY relocation entries during code emission (the producer side of what nvlink consumes)

Confidence Assessment

ClaimConfidenceEvidence
sub_42F6C0 validates standard table at off_1D37600 (117 entries, limit 0x75)HIGHDecompiled sub_42F6C0_0x42f6c0.c line 26: a1 >= 0x75; line 25: &off_1D37600
sub_42F6C0 validates attribute table at off_1D371E0 (65 entries, limit 0x41)HIGHDecompiled line 17: &off_1D371E0; line 18: a1 < 0x41
Attribute type offset is 0x10000HIGHDecompiled line 15: a1 -= 0x10000
sub_42F8C0 arch class mapping: sm<=70 class 1, sm<=72 class 2, sm>=76 class 5, sm 73-75 class 3HIGHDecompiled sub_42F8C0_0x42f8c0.c: 2*(a1>=76)+3 formula confirmed
sub_42F6C0 emits "unknown attribute" on bounds violationHIGHDecompiled line 21: exact string literal
Five architecture class names at off_1D371A0HIGHDecompiled line 36: &off_1D371A0 + v10
64-byte descriptor entries at off_1D3DBE0 (CUDA) and off_1D3CBE0 (Mercury)HIGHDecompiled sub_468760: type_index << 6 indexing confirmed
Application engine sub_468760 at 0x468760 with 10-parameter signatureHIGHDecompiled file exists with matching signature
119 unique R_CUDA type name strings extracted from binaryHIGHAll names extracted from nvlink_strings.json; complete catalog verified
"STO_CUDA_OBSCURE" emitted by sub_42F850HIGHString confirmed in nvlink_strings.json
Action types and aliases (0x01/0x12/0x2E, 0x06/0x37, 0x07/0x38)MEDIUMReconstructed from decompiled sub_468760 switch; full per-action verification not performed
Per-architecture vtable with 79 slots from sub_459640MEDIUMFunction exists; slot count inferred from 632-byte allocation (79 * 8)
YIELD relocation forward-progress check at ctx+94MEDIUMReconstructed from decompiled relocation phase analysis