Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Line Table Merging

Line table merging is the process by which nvlink constructs DWARF .debug_line sections in the output cubin. Two distinct subsystems handle this depending on the compilation path: the linker-side merger (functions at 0x470000--0x480000) operates during regular linking, collecting per-CU line programs from input ELF objects and concatenating them into merged output sections; the LTO-side generator (functions at 0x12D0000--0x12D5000 and 0x181A320) operates during link-time optimization, building DWARF line tables from scratch for freshly compiled SASS code. Both paths ultimately produce two output sections: .debug_line (DWARF standard, indexed by a3 == 0) for source-level PTX line information, and .nv_debug_line_sass (NVIDIA extension, indexed by a3 > 0) for SASS-level machine instruction line mappings.

This page documents the LTO-side line table generator in reimplementation-grade detail, as this is the more complex and more heavily used path during --lto builds. For the linker-side approach (section-level concatenation), see Section Merging.

Key Facts

PropertyValue
Header buildersub_181A320 at 0x181A320 (16,225 bytes)
Line program encodersub_12D04E0 at 0x12D04E0 (33,592 bytes)
Master debug section buildersub_12D2010 at 0x12D2010 (56,450 bytes)
State machine initializersub_12D1990 at 0x12D1990 (7,032 bytes)
State machine object464 bytes, three 8 KB buffers + one 256 KB buffer
Directory deduplicationHash map via sub_4489C0/sub_449A80/sub_448E70
DWARF version2 or 3, selected per compilation unit
Address size4-byte (DWARF32) or 8-byte (DWARF64)
Section names.debug_line (index 0), .nv_debug_line_sass (index > 0)
Grid dimension encodersub_12D4440 at 0x12D4440 (7,692 bytes)

Architecture Overview

The LTO line table pipeline has three phases:

  1. Collection (sub_12D2010): The master debug section builder iterates all functions in the compiled module. For each function, it walks the instruction stream (32-byte stride records), classifies debug record types (1, 49, 51, 52, 55, 56, 58, 83), and collects source location entries -- file index, line number, column, instruction address, is_stmt flag, and inline context id.

  2. Encoding (sub_12D04E0): The line program encoder receives the collected entries as a flat array of 12-byte records (file:u16, is_stmt:u16, line:u32, address:u32) and encodes them into DWARF line number program opcodes using the standard state machine model: special opcodes for compact line+address deltas, standard opcodes (DW_LNS_advance_pc, DW_LNS_advance_line, DW_LNS_set_file, DW_LNS_copy, DW_LNS_negate_stmt) for large deltas, and extended opcodes for context/prologue markers.

  3. Serialization (sub_181A320): The header builder assembles the final .debug_line or .nv_debug_line_sass section by concatenating the DWARF header (version, prologue length, header fields), directory table, file name table, and the encoded line program. It registers the section in the output ELF via sub_4411B0/sub_434BC0 and emits relocations as needed.

sub_12D2010 (master)
  |
  +---> iterate functions in module
  |       |
  |       +---> walk instruction records (32-byte stride)
  |       |       classify: type 1  = line entry
  |       |                 type 49 = inline start
  |       |                 type 51 = scope entry
  |       |                 type 52 = convergent boundary
  |       |                 type 55 = function end
  |       |                 type 56 = prologue marker
  |       |                 type 58 = debug frame entry
  |       |                 type 83 = SASS offset marker
  |       |
  |       +---> collect file entries, resolve $LDWend labels
  |       |
  |       +---> call sub_12D04E0 (encode line program)
  |
  +---> call sub_181A320 (build section header + serialize)

State Machine Object Layout

The DWARF line program state machine is allocated by sub_12D1990 as a 464-byte object with four attached buffers:

OffsetSizeFieldDescription
+08allocatorMemory allocator context pointer
+648opcode_bufferPointer to 8,000-byte line program opcode buffer
+724opcode_capacityCurrent capacity (initially 8,000)
+764min_instr_lengthMinimum instruction length in bytes
+771default_is_stmtDefault value of is_stmt register
+781line_baseLine base for special opcode computation
+791line_rangeLine range for special opcode computation
+801opcode_baseOpcode base (10 for DWARF standard)
+82--909std_opcode_lengthsStandard opcode operand count table
+968directory_bufferPointer to 8,000-byte directory table buffer
+1044directory_capacityDirectory buffer capacity
+1084directory_sizeBytes written to directory buffer
+1128file_bufferPointer to 8,000-byte file name table buffer
+1204file_capacityFile buffer capacity
+1244file_sizeBytes written to file buffer
+1288program_bufferPointer to 256,000-byte assembled output
+1368program_sizeBytes written to program buffer
+1601is_64bitDWARF64 mode flag (affects address encoding)
+176--200--file_entry_arrayPointer + count for internal file records
+2088total_program_lengthTotal length of encoded program data

The four buffers use a doubling growth strategy: when a write would exceed capacity, a new buffer at 2x the current size is allocated, the old data is copied, and the old buffer is freed via sub_431000.

Header Builder: sub_181A320

Signature

void dwarf_build_line_section(
    line_ctx*    ctx,          // a1: line table context (160-byte stride per index)
    elfw*        elf_writer,   // a2: output ELF wrapper
    uint32_t     section_index,// a3: 0 = .debug_line, >0 = .nv_debug_line_sass
    hash_map*    symbol_map,   // a4: directory deduplication source
    const char*  comp_dir      // a5: compilation directory (NULL if already set)
);

Section Selection

The function begins by choosing the target section name based on a3:

if (a3 == 0) {
    section_id = find_section_by_name(elf_writer, ".debug_line");
    if (!section_id)
        section_id = section_create(elf_writer, ".debug_line", 0, 1, 0);
} else {
    section_id = find_section_by_name(elf_writer, ".nv_debug_line_sass");
    if (!section_id)
        section_id = section_create(elf_writer, ".nv_debug_line_sass", 0, 1, 0);
}

This uses sub_4411B0 (find by name) and sub_434BC0 (create) -- the same section registry infrastructure documented in Section Merging.

Source File Collection

When source files have not yet been collected (indicated by the source file count at ctx + 160*a3 + 216 being zero for .debug_line, or ctx + 160*a3 + 376 for .nv_debug_line_sass), the header builder performs a full file scan:

  1. Retrieve file list: Call sub_449F00(a4) to extract all file entries from the input symbol map. Sort them via sub_4647D0 using comparator sub_181A2F0.

  2. Allocate file records: Allocate file_count * 40 bytes for the file record array (40 bytes per entry: filename pointer, directory pointer, directory index, file serial number, modification time, file size).

  3. Allocate directory pointer array: Allocate (file_count + 1) * 8 bytes, with slot 0 reserved (DWARF convention: directory index 0 means "current compilation directory").

  4. Split paths: For each file, separate the full path into a directory component and a filename component by scanning backwards for / or \:

for (each file entry) {
    path = entry->filename;
    len = strlen(path);
    // scan backwards from end for path separator
    for (i = len - 1; i >= 0; i--) {
        if (path[i] == '/' || path[i] == '\\')
            break;
    }
    if (i > 0) {
        dir_string  = strncpy(alloc(i+1), path, i);
        file_string = strcpy(alloc(len-i), &path[i+1]);
    } else {
        dir_string  = NULL;
        file_string = path;
    }
}
  1. Directory deduplication: Each unique directory is inserted into a hash map via sub_448E70(hash_map, dir_string, directory_index). If sub_449A80(hash_map, dir_string) returns a non-zero result, the existing directory index is reused. The directory_pointers array is populated only for new unique directories, starting at index 1.

  2. File metadata: For each file, if the input record does not carry modification time and file size, stat() is called on the original source path to obtain them. These are stored in the file record at offsets +24 (mtime) and +32 (size).

Directory Table Encoding

The directory table is a sequence of null-terminated strings written into the directory buffer:

directory_1\0 directory_2\0 ... directory_N\0 \0

The trailing \0 marks the end of the directory table (DWARF convention). If only a single directory exists, the table is empty (just the terminal \0).

File Name Table Encoding

The file name table encodes each file as:

filename\0  dir_index(ULEB128)  mod_time(ULEB128)  file_size(ULEB128)

Each filename is the basename (without directory), and dir_index references the directory table (1-based). Modification time and file size are encoded as ULEB128 via sub_4FA8B0. On encoding failure, a diagnostic is emitted through sub_467460:

  • "when generating LEB128 number for timestamp" -- mtime encoding overflow
  • "when generating LEB128 number for file size" -- file size encoding overflow

DWARF Header Assembly

After the directory and file tables are built, the function assembles the complete DWARF .debug_line section header:

Offset  Size    Field
------  ------  -----
0       4       total_length (entire section minus 4 bytes)
4       2       version (from ctx+68, typically 2 or 3)
6       4       header_length (bytes from here to first opcode)
10      1       minimum_instruction_length (ctx+76)
11      1       default_is_stmt (ctx+77)
12      1       line_base (ctx+78)
13      1       line_range (ctx+79)
14      1       opcode_base (ctx+80)
15..    N       standard_opcode_lengths[opcode_base-1]
        M       directory_table (includes trailing \0)
        K       file_name_table (includes trailing \0)

The standard opcode lengths array has opcode_base - 1 entries, copied byte-by-byte from ctx+82..90. For opcode_base=10, this is 9 bytes covering DW_LNS_copy through DW_LNS_set_isa.

The total section is assembled into a single contiguous allocation (with 256 extra bytes of slack):

alloc_size = header_size + directory_size + file_size + program_size + 256;
output = arena_alloc(alloc_size);
memcpy(output + 0,  &total_length, 4);
memcpy(output + 4,  &version, 2);
memcpy(output + 6,  &header_length, 4);
memcpy(output + 10, header_fields, opcode_base + 4);
memcpy(output + hdr_end, directory_data, directory_size);
memcpy(output + dir_end, file_data, file_size);
memcpy(output + file_end, program_data, program_size);

DWARF64 Support

When ctx + 160*a3 + 160 (the is_64bit flag at the next stride slot) is set, the header gains an additional 4-byte field: a .debug_str section reference is created via sub_4411B0/sub_434BC0, and a relocation entry is emitted via sub_469B50 to patch the string offset. The relocation type depends on sub_440260 (ELF class check): type 1 for 32-bit, type 65539 for 64-bit.

Relocation Emission

After the section data is written via sub_4343C0(elf_writer, section_id, 0, output, 0, 1, total_size), the function checks for pending relocations at ctx + 160*a3 + 88..100. Each relocation's offset is adjusted by adding the header size (since the program data starts after the header in the output section). Relocations are emitted via sub_4699B0, referencing target sections looked up by name through sub_4411B0 or created on demand via sub_440BE0.

Line Program Encoder: sub_12D04E0

Signature

uint64_t dwarf_emit_line_program(
    line_ctx*        ctx,         // a1: line table context
    line_state*      state,       // a2: 464-byte state machine object
    int              entry_count, // a3: number of line entries
    uint16_t*        entries,     // a4: array of 12-byte line entries
    const char*      comp_dir,    // a5: compilation directory name
    bool             is_64bit,    // a6: DWARF64 address encoding
    int64_t*         context_map, // a7: inline context index map (or NULL)
    bool             has_is_stmt  // a8: entries carry is_stmt flag
);

Entry Format

Each line entry in the a4 array is 12 bytes (accessed as uint16_t[6]):

OffsetSizeField
02file_number (1-based DWARF file index)
22flags (bit 0 = is_stmt)
44line_number (1-based source line)
84address (instruction byte offset from function start)

Function Name Registration

Before encoding line entries, the encoder registers the current function's source file in the state machine:

  1. Allocate a slot in the file tracking array at state[22] (with capacity doubling at state[26]).
  2. Copy the compilation directory name a5 into arena memory.
  3. Write into the line program buffer: opcode 0 (extended opcode escape), length 2, sub-opcode 9 (for DWARF64: DW_LNE_set_address) or sub-opcode 5 (for DWARF32: DW_LNE_define_file). This is followed by an 8-byte or 4-byte address field that will be patched by relocation.
  4. Increment the file counter at state[25].

File Number Tracking

The encoder emits DW_LNS_set_file (opcode 4) whenever the file number changes between consecutive entries:

if (current_file != state->current_file) {
    uleb128_encode(current_file, &leb_buf, 255);
    emit_byte(state, 4);  // DW_LNS_set_file
    emit_bytes(state, leb_buf, leb_len);
    state->current_file = current_file;
}

The file number is ULEB128-encoded via sub_4FA8B0. On overflow, the diagnostic "when generating LEB128 number for file number" is emitted.

Prologue Tracking

When has_is_stmt (a8) is set and the is_stmt flag in the current entry differs from the state machine's current value (state[39]), the encoder emits a DW_LNS_negate_stmt sequence:

if (has_is_stmt && entry->is_stmt != state->is_stmt_current) {
    emit_byte(state, 0);     // extended opcode escape
    emit_byte(state, 0);     // placeholder for length
    emit_byte(state, 0x92);  // NVIDIA-extended: set prologue
    // ULEB128 encode the is_stmt value
    uleb128_encode(entry->is_stmt, &leb_buf, 255);
    emit_bytes(state, leb_buf, leb_len);
    // patch the length byte
    state->buffer[length_offset] = leb_len + 1;
    state->is_stmt_current = entry->is_stmt;
}

The opcode 0x92 (146 decimal) is an NVIDIA-proprietary extended opcode not part of the DWARF standard.

Inline Context Tracking

When a context map (a7) is provided and the inline context changes, the encoder emits an NVIDIA-extended opcode 0x90 (144 decimal):

if (context_map && context_map[entry_index].context != state->current_context) {
    emit_byte(state, 0);     // extended escape
    emit_byte(state, 0);     // placeholder
    emit_byte(state, 0x90);  // NVIDIA-extended: set context
    uleb128_encode(new_context, &leb_buf, 255);
    emit_bytes(state, leb_buf, leb_len);
    uleb128_encode(function_offset, &leb_buf2, 255);
    emit_bytes(state, leb_buf2, leb_len2);
    // patch length
    state->buffer[length_offset] = leb_len + leb_len2 + 1;
    state->current_context = new_context;
}

The diagnostics for these ULEB128 encodings are:

  • "when generating LEB128 number for setting context" -- context index overflow
  • "when generating LEB128 number for setting function Offset" -- function offset overflow

Line and Address Advance Encoding

The core of the DWARF line program encoding uses the standard special opcode scheme. For each entry, the encoder computes:

line_delta   = entry->line - state->current_line;
addr_delta   = entry->address - state->current_address;

If state->current_address < 0 (uninitialized), the encoder uses standard opcodes for the first entry:

  1. DW_LNS_advance_pc (opcode 2): Encode addr_delta as ULEB128.
  2. DW_LNS_advance_line (opcode 3): Encode line_delta as SLEB128 (signed).
  3. DW_LNS_copy (opcode 1): Emit a matrix row.

For subsequent entries, the encoder attempts to use special opcodes for compact encoding:

if (line_delta + 5 <= 13) {  // fits in line_range window
    special = 14 * addr_delta + line_delta + line_base + 5;
    if (special > 0 && special <= 255) {
        emit_byte(state, special);
        // single byte encodes both line and address advance
        goto done;
    }
}

The special opcode formula is: opcode = (line_delta - line_base) + (line_range * addr_delta) + opcode_base. With line_base = -5, line_range = 14, and opcode_base = 10, this becomes opcode = (line_delta + 5) + 14 * addr_delta + 10. This matches the DWARF standard formula with the NVIDIA-chosen parameters.

When the special opcode does not fit (line delta too large, address delta too large, or either is negative), the encoder falls back to standard opcodes:

  • Line advance only (addr_delta == 0): Emit DW_LNS_advance_line + SLEB128, then DW_LNS_copy.
  • Address advance only (line_delta == 0): Emit DW_LNS_advance_pc + ULEB128. If the context map indicates an end-of-sequence marker, also emit DW_LNS_copy.
  • Both (fallback): Emit DW_LNS_advance_line + SLEB128, then DW_LNS_advance_pc + ULEB128, then DW_LNS_copy.

The associated diagnostics are:

  • "when generating LEB128 number for line advance" -- line SLEB128 overflow
  • "when generating LEB128 number for address advance" -- address ULEB128 overflow

End of Sequence

After all entries are processed, the encoder emits the DWARF end-of-sequence marker:

emit_byte(state, 0);  // extended escape
emit_byte(state, 1);  // length = 1
emit_byte(state, 1);  // DW_LNE_end_sequence

The state machine registers are then reset: state[29] = 0xFFFFFFFF00000001 (line=1, address=0xFFFFFFFF), state[32] = 0 (context cleared).

Master Debug Section Builder: sub_12D2010

Signature

void dwarf_build_debug_sections(
    int64_t     module_ctx,    // a1: module-level compilation context
    uint64_t    a2,            // a2: pointer to function descriptor array
    int64_t     a3,            // a3: directory hash map (or 0 for serial mode)
    int64_t     a4,            // a4: reserved
    const char* comp_dir       // a5: compilation directory string
);

Overview

This 56 KB function is the largest in the entire DWARF subsystem. It orchestrates the complete pipeline: parsing debug records from the compiled instruction stream, building source file maps, handling inline expansion boundaries, and dispatching to sub_12D04E0 for line program encoding.

Instruction Record Scanning

The function dereferences a2 to obtain a pointer to the function's instruction record array. Each record is 32 bytes (accessed with stride 16 as uint16_t*), and the first 16-bit word is a record type discriminator:

TypeMeaningProcessing
1Line entryExtracts line number from offset +2 (32-bit), accumulates
49Inline function startPushes inline context, tracks nesting
51Scope entryRecords debug scope boundary
52Convergent boundaryMarks convergent control flow region
55Function endCloses current function's line table
56Prologue endMarks end of function prologue for debugger
58Debug frame entryGenerates .debug_frame relocations via sub_12B8650
83SASS offset markerRecords SASS byte offset for .nv_debug_line_sass

Debug Frame Integration

When type-58 records are present alongside type-55 (function end) and type-52 (convergent boundary), the builder generates .debug_frame cross-references. For each frame entry, it calls:

sub_12B8650(module, reloc_type, ".debug_frame", ".debug_frame",
            entry_offset - addr_size, target_offset + base_addr);

The reloc_type is 2 for 32-bit or 4 for 64-bit (based on byte at *module + 4). This creates relocations that allow the debugger to correlate .debug_frame unwind entries with .debug_line addresses.

Serial vs. Parallel Mode

The builder operates in two modes based on whether module_ctx[2] (the parallel dispatch function pointer) is NULL:

Serial mode (module_ctx[2] == 0): Calls sub_12D04E0 directly with a single state machine at module_ctx + 16 (byte offset +128 from the 8-byte-strided context).

Parallel mode (module_ctx[2] != 0): Creates additional data structures for thread-safe operation:

  1. Allocates a hash map via sub_4489C0(sub_44E1C0, sub_44E1E0, 0x400) where sub_44E1C0/sub_44E1E0 are mutex lock/unlock functions and 0x400 (1024) is the initial bucket count.
  2. Allocates three dynamically-growing arrays via sub_464AE0(8, ...) for tracking file indices, inline contexts, and address mappings.
  3. Calls the parallel dispatch function at module_ctx[2] with:
    • The function index
    • Output pointers for line data, address data, source file data
    • The is_stmt flag
    • String buffer for filename parsing
    • The five tracking arrays
    • A flag word at &v484

The parallel dispatch function populates the tracking arrays, and the builder then processes them in a single-threaded pass to construct the final line program.

Inline Context Handling

For each compiled function, the builder maintains an inline context stack. When sub_464BB0 (array size query) reports entries in the context tracking array (v401), the builder iterates them in reverse order to build a context map:

for (i = array_size(v401) - 1; i >= 0; i--) {
    context_id  = array_get(v401, i);
    existing    = hash_lookup(context_map, context_id);
    if (existing) {
        // reuse existing context index
    } else {
        // assign new context index
        context_entry[context_index].file      = array_get(v440, i);
        context_entry[context_index].line       = array_get(v417, i);
        context_entry[context_index].func_addr  = function_addresses[i];
        hash_insert(context_map, context_id, context_index);
        // if source file has "+" in name, parse base address via sscanf("%llu")
    }
}

The sscanf call for "+" in filenames handles NVIDIA's convention of encoding inlined function base addresses in the source file name: "filename.cu+12345" means the inlined function starts at byte offset 12345.

$LDWend Label Resolution

A special label $LDWend is used to mark the end of the DWARF info contribution for a compilation unit. When the builder encounters this symbol in the input symbol map:

if (symbol_name && strcmp(symbol_name, "$LDWend") == 0) {
    end_address = symbol_value;
}

This end address is used to determine where the last line table entry should be placed, ensuring the line table covers the full range of the function's SASS instructions.

Final Assembly

After all functions have been processed, the builder calls sub_12D04E0 one final time to encode the accumulated line entries. It passes the complete sorted entry array (with inline context map if parallel mode was used), producing the encoded DWARF line program in the state machine's output buffer. The state machine buffer is then consumed by sub_181A320 during the header serialization phase.

NVIDIA-Proprietary Extended Opcodes

nvlink's DWARF line tables include two proprietary extended opcodes not defined in the DWARF standard. Both are emitted using the standard DWARF extended opcode mechanism: a 0x00 escape byte, followed by a ULEB128 length, followed by the sub-opcode byte and its operands. Standard DWARF consumers that encounter these opcodes will skip them correctly because the length prefix allows unknown extended opcodes to be passed over.

Both opcodes are emitted by the LTO-side encoder (sub_12D04E0) and by the linker-side merger (sub_480570). They appear only in .nv_debug_line_sass sections (the NVIDIA SASS-level line tables), never in standard .debug_line sections, because they carry NVIDIA-specific inline context and statement metadata that standard DWARF debuggers do not understand.

DW_LNE_NV_set_context (0x90)

Sets the inline context for subsequent line table rows. This opcode tracks which inlined function call site the following instructions belong to.

Encoding:

Byte 0:      0x00        (extended opcode escape)
Byte 1:      length      (ULEB128 = context_leb_len + offset_leb_len + 1)
Byte 2:      0x90        (sub-opcode: DW_LNE_NV_set_context)
Bytes 3..N:  context_id  (ULEB128 — inline context index into the context table)
Bytes N+1..: func_offset (ULEB128 — byte offset of the inlined function)

The context_id operand is a zero-based index into the per-CU inline context table. Each entry in this table records the source file, source line, and base address of an inlined call site. When context_id is 0, the state machine returns to the top-level (non-inlined) function context.

The func_offset operand is the byte offset of the inlined function within the compilation unit. For inlined functions whose source file name follows NVIDIA's "filename.cu+12345" convention, this offset is parsed from the +-delimited suffix via sscanf("%llu").

Emission logic in sub_12D04E0 (lines 668--766 of the decompiled source):

// When context map is present and context changes
if (context_map && context_map[entry_index].context != state->current_context) {
    emit_byte(state, 0x00);                     // extended escape
    length_offset = state->write_pos;
    emit_byte(state, 0x00);                     // placeholder for length
    emit_byte(state, 0x90);                     // DW_LNE_NV_set_context

    uleb128_encode(context_map[entry_index].context, &leb_buf, 255);
    emit_bytes(state, leb_buf, leb_len);        // context_id
    ctx_len = leb_len;

    uleb128_encode(context_map[entry_index].func_offset, &leb_buf, 255);
    emit_bytes(state, leb_buf, leb_len);        // func_offset

    // patch length = ctx_len + offset_len + 1 (for sub-opcode byte)
    state->buffer[length_offset] = ctx_len + leb_len + 1;
    state->current_context = context_map[entry_index].context;
}

The linker-side emitter (sub_480570 at lines 289--355) uses an identical encoding but sources the context index from a lookup table at a2 + 72 and the function offset from v7[2].m128i_u32[2]. The diagnostic strings for encoding failures are "when generating LEB128 number for setting context" and "when generating LEB128 number for setting function Offset".

State machine effect: Updates the internal current_context register (state[32]) and current_func_offset register (state[33]). Does not emit a matrix row.

DW_LNE_NV_set_stmt (0x92)

Sets the is_stmt (is-statement) flag for subsequent line table rows. While standard DWARF provides DW_LNS_negate_stmt (opcode 6) to toggle the flag, NVIDIA uses this extended opcode to set it to an explicit value, which avoids ambiguity in the presence of complex inlining where the toggle semantic becomes error-prone.

Encoding:

Byte 0:    0x00        (extended opcode escape)
Byte 1:    length      (ULEB128 = is_stmt_leb_len + 1)
Byte 2:    0x92        (sub-opcode: DW_LNE_NV_set_stmt)
Bytes 3..: is_stmt_val (ULEB128 — 0 = not a statement, 1 = statement)

The is_stmt_val operand is the low bit of the entry's flags field (entry->flags & 1). In practice, the encoded ULEB128 is always a single byte (0 or 1), so the total extended opcode sequence is 4 bytes: {0x00, 0x02, 0x92, 0x00} or {0x00, 0x02, 0x92, 0x01}.

Emission logic in sub_12D04E0 (lines 504--563 of the decompiled source):

// When has_is_stmt flag is set and is_stmt value differs from state
if (has_is_stmt && (entry->flags & 1) != state->is_stmt_current) {
    emit_byte(state, 0x00);                     // extended escape
    length_offset = state->write_pos;
    emit_byte(state, 0x00);                     // placeholder (skips +1 byte)
    emit_byte(state, 0x92);                     // DW_LNE_NV_set_stmt

    uleb128_encode(entry->flags & 1, &leb_buf, 255);
    emit_bytes(state, leb_buf, leb_len);        // is_stmt value

    // patch length = leb_len + 1 (for sub-opcode byte)
    state->buffer[length_offset] = leb_len + 1;
    state->is_stmt_current = entry->flags & 1;
}

Note: the decompiled code writes the sub-opcode as *(_BYTE *)(v83 + v82 + 1) = -110, where -110 as a signed byte is 0x92 (146 unsigned). The length placeholder at v82 (one position before) is patched after the operand is encoded.

The diagnostic string for encoding failure is "when generating LEB128 number for setting prologue", which reveals NVIDIA's internal nomenclature: they refer to this opcode as the "prologue" marker, reflecting its primary use case of distinguishing function prologue instructions (where is_stmt = 0) from body instructions (where is_stmt = 1).

State machine effect: Updates the internal is_stmt_current register (state[39]). Does not emit a matrix row.

Extended Opcode Summary

Sub-opcodeHexDecimalSigned byteNameOperands
DW_LNE_end_sequence0x0111End sequence(none)
DW_LNE_set_address0x0222Set address4-byte or 8-byte address
DW_LNE_NV_set_context0x90144-112Set inline contextULEB128 context_id, ULEB128 func_offset
DW_LNE_NV_set_stmt0x92146-110Set is_stmt flagULEB128 is_stmt_val

Sub-opcodes 0x01 and 0x02 are standard DWARF. Sub-opcodes 0x90 and 0x92 are NVIDIA-proprietary. The gap at 0x91 is unused in the current binary -- no emitter or consumer references this value.

Wire Format Examples

Set inline context to index 3, function offset 0x100:

00              extended escape
04              length = 4 bytes (1 sub-opcode + 1 context_id + 2 func_offset)
90              DW_LNE_NV_set_context
03              context_id = 3 (ULEB128)
80 02           func_offset = 256 (ULEB128: 0x80 0x02)

Set is_stmt to 1 (mark as statement):

00              extended escape
02              length = 2 bytes (1 sub-opcode + 1 value)
92              DW_LNE_NV_set_stmt
01              is_stmt = 1 (ULEB128)

Set is_stmt to 0 (mark as non-statement / prologue):

00              extended escape
02              length = 2 bytes
92              DW_LNE_NV_set_stmt
00              is_stmt = 0 (ULEB128)

End of sequence:

00              extended escape
01              length = 1 byte
01              DW_LNE_end_sequence

Standard DWARF Opcodes Used

The complete set of standard DWARF opcodes emitted by nvlink:

OpcodeValueOperandsMeaning
DW_LNS_copy10Emit a row to the line table matrix
DW_LNS_advance_pc21 (ULEB128)Advance address register by operand bytes
DW_LNS_advance_line31 (SLEB128)Advance line register by signed operand
DW_LNS_set_file41 (ULEB128)Set file register to operand (1-based)
DW_LNS_set_column51 (ULEB128)Set column register (declared in header, not emitted in practice)
DW_LNS_negate_stmt60Toggle is_stmt register (declared in header, not emitted -- 0x92 used instead)
DW_LNS_set_basic_block70Set basic_block flag (declared in header, not emitted)
DW_LNS_const_add_pc80Advance address by (255 - opcode_base) / line_range (declared, not emitted)
DW_LNS_fixed_advance_pc91 (uhalf)Advance address by fixed 16-bit value (declared, not emitted)
Special opcodes10--2550Compact line + address advance, emit row

In practice, the encoder only emits opcodes 1--4 and special opcodes 10--255. Opcodes 5--9 are declared in the standard opcode lengths table (required by the DWARF header format) but never generated. The DW_LNS_negate_stmt toggle (opcode 6) is superseded by the DW_LNE_NV_set_stmt (0x92) extended opcode for explicit value setting.

Standard Opcode Lengths Table

The header declares 9 standard opcodes (opcode_base - 1) with the following operand counts, initialized at sub_12D1990 offset +82 as the packed value 0x101010100:

std_opcode_lengths[0] = 0   // DW_LNS_copy:            no operands
std_opcode_lengths[1] = 1   // DW_LNS_advance_pc:      1 ULEB128
std_opcode_lengths[2] = 1   // DW_LNS_advance_line:    1 SLEB128
std_opcode_lengths[3] = 1   // DW_LNS_set_file:        1 ULEB128
std_opcode_lengths[4] = 1   // DW_LNS_set_column:      1 ULEB128
std_opcode_lengths[5] = 0   // DW_LNS_negate_stmt:     no operands
std_opcode_lengths[6] = 0   // DW_LNS_set_basic_block: no operands
std_opcode_lengths[7] = 0   // DW_LNS_const_add_pc:    no operands
std_opcode_lengths[8] = 1   // DW_LNS_fixed_advance_pc: 1 uhalf

DWARF Parameters

nvlink uses the following DWARF line number program parameters, confirmed by the state machine initializer at sub_12D1990 (packed as 0x0EFB0101 at byte offset +76):

ParameterValueOffsetDescription
minimum_instruction_length1+76Minimum instruction unit in bytes
default_is_stmt1+77All lines are statements by default
line_base-5+78 (signed)Minimum line delta encodable in special opcode
line_range14+79Number of line values encodable per address step
opcode_base10+80First special opcode value

Special Opcode Computation

The special opcode formula follows the DWARF standard exactly:

special_opcode = (line_delta - line_base) + (line_range * addr_delta) + opcode_base
               = (line_delta + 5) + (14 * addr_delta) + 10

The encoder applies this in two steps (decompiled line 772):

// Check line delta fits in the window: line_base <= line_delta < line_base + line_range
if (line_delta + 5 > 13)       // equivalently: line_delta > 8 or line_delta < -5
    goto use_standard_opcodes;

// Compute special opcode
special = 14 * addr_delta + line_delta + opcode_base + 5;
if (special > 255)             // does not fit in a single byte
    goto use_standard_opcodes;

emit_byte(state, special);     // single byte encodes both deltas + emits row

The encodable ranges with these parameters:

PropertyRangeNotes
Line delta-5 to +814 values (line_range)
Address delta0 to 17floor((255 - opcode_base) / line_range) = 17
Special opcode range10 to 255246 possible values

When line or address deltas exceed these ranges, the encoder falls back to standard opcodes (DW_LNS_advance_line + DW_LNS_advance_pc + DW_LNS_copy). The fallback paths also handle the case where addr_delta == 0 (line advance only) or line_delta == 0 (address advance only), emitting only the necessary standard opcodes plus DW_LNS_copy.

Buffer Management

All three encoding functions use the same buffer growth pattern:

if (write_position + needed >= capacity - 1) {
    new_capacity = 2 * capacity;
    new_buffer = arena_alloc(new_capacity);
    memset(new_buffer, 0, new_capacity);
    memcpy(new_buffer, old_buffer, old_capacity);
    arena_free(old_buffer);
    buffer_ptr = new_buffer;
    capacity   = new_capacity;
}

The initial buffer sizes are:

  • Opcode buffer: 8,000 bytes
  • Directory table: 8,000 bytes
  • File name table: 8,000 bytes
  • Assembled output: 256,000 bytes

For large compilation units with thousands of source lines (common in template-heavy CUDA code), the opcode buffer may grow to several megabytes through repeated doubling.

LEB128 Encoding

The ULEB128 and SLEB128 encoding functions (sub_4FA8B0 and sub_4FA910) are shared infrastructure used across the entire DWARF subsystem:

  • sub_4FA8B0 (ULEB128): Encodes an unsigned integer into a caller-provided buffer. Returns 0 on success, 1 if the encoded value exceeds the 255-byte buffer limit.
  • sub_4FA910 (SLEB128): Encodes a signed integer using sign-extended 7-bit groups.

Both functions write to a stack-allocated 256-byte buffer and return the encoded length through a by-reference integer parameter (v316 in the decompiled output). The caller then copies the encoded bytes into the line program buffer.

Interaction with Section Merging

During regular (non-LTO) linking, .debug_line sections from multiple input cubins are merged by the standard section merging infrastructure documented in Section Merging. The linker-side functions at 0x470000--0x480000 handle this:

FunctionAddressRole
debug_line_table_add_entrysub_4776E0Builds line entries (56-byte records)
debug_line_program_header_growsub_477A60Grows header arrays (32-byte records)
debug_line_table_emit_filenamessub_477E10Emits file name table from merged inputs
debug_line_program_serialize_opssub_4783C0Concatenates per-CU line programs
debug_line_info_encodesub_478A20Encodes DWARF header from merged data

These functions follow the same DWARF header format as the LTO path but operate on pre-parsed line table entries from input ELF .debug_line sections rather than generating them from scratch.

Sibling Wiki Cross-References

The line table data that nvlink merges originates from two upstream toolchain stages:

  • ptxas wiki -- Debug Information: ptxas generates both .debug_line (PTX-level) and .nv_debug_line_sass (SASS-level) sections. The SASS line table builder at sub_866BB0 produces the per-CU fragments that nvlink's LTO-side generator (sub_12D04E0) replaces during link-time optimization, and that the linker-side merger (sub_4776E0--sub_478A20) concatenates during regular linking.
  • cicc wiki -- Debug Info Pipeline: cicc emits .loc and .file directives in PTX output, which ptxas then converts to DWARF line program bytes. The -generate-line-info flag at cicc controls whether DILocation metadata is preserved through the optimizer -- this directly determines whether nvlink receives any line table data to merge.

Cross-References

Confidence Assessment

ClaimConfidenceEvidence
Header builder sub_181A320 at 0x181A320 (16,225 bytes)HIGHDecompiled file sub_181A320_0x181a320.c present at exact address
Line program encoder sub_12D04E0 at 0x12D04E0 (33,592 bytes)HIGHDecompiled file sub_12D04E0_0x12d04e0.c present at exact address
Master debug section builder sub_12D2010 (56,450 bytes)HIGHDecompiled file sub_12D2010_0x12d2010.c present at exact address
State machine initializer sub_12D1990HIGHDecompiled code confirms allocation and buffer initialization pattern
Section selection: a3 == 0 for .debug_line, a3 > 0 for .nv_debug_line_sassHIGHStrings .nv_debug_line_sass at 0x241282C area; section name dispatch confirmed in decompiled sub_181A320
DWARF parameters: line_base=-5, line_range=14, opcode_base=10HIGHPacked constant 0x0EFB0101 documented at sub_12D1990 offset +76; the values decode correctly: 0x01 (min_instr_length=1), 0x01 (default_is_stmt=1), 0xFB (line_base=-5 signed), 0x0E (line_range=14)
Special opcode formula matches DWARF standardHIGHFormula (line_delta + 5) + 14 * addr_delta + 10 is standard DWARF with the documented parameters
NVIDIA extended opcode DW_LNE_NV_set_context (0x90)HIGHEncoding logic documented with decompiled line references; LEB128 diagnostic strings "when generating LEB128 number for setting context" at 0x1F25D40 and "when generating LEB128 number for setting function Offset" at 0x1F25D78 confirmed
NVIDIA extended opcode DW_LNE_NV_set_stmt (0x92)HIGHDiagnostic string "when generating LEB128 number for setting prologue" at 0x1F25CA0 confirmed; sub-opcode value -110 as signed byte = 0x92 is mathematically correct
$LDWend label resolutionHIGHString $LDWend at 0x1F25DB2 confirmed in strings file
LEB128 encoding diagnostics (file number, timestamp, file size, address advance, line advance)HIGHAll five diagnostic strings confirmed at 0x1F25C70--0x1F25D10 and 0x24127D0--0x2412800
Linker-side merger functions at 0x470000--0x480000HIGHAll five functions confirmed in decompiled/: sub_4776E0, sub_477A60, sub_477E10, sub_4783C0, sub_478A20, plus sub_480570
State machine object layout (464 bytes, four buffers)MEDIUMDecompiled sub_12D1990 confirms buffer allocation pattern; exact 464-byte size inferred from allocation calls and field layout but not directly readable from decompiled output
Grid dimension encoder sub_12D4440 (7,692 bytes)HIGHDecompiled file present at exact address
Parallel vs. serial mode dispatchMEDIUMClaim based on module_ctx[2] null check for dispatch; function pointer pattern is consistent but parallel thread safety details (hash map with mutex lock/unlock) are inferred from helper function addresses
Buffer growth: 8,000 / 8,000 / 8,000 / 256,000 initial sizesMEDIUMInitial sizes inferred from decompiled allocation calls in sub_12D1990; exact constants hard to read due to decompiler optimization of allocation parameters