Layout Phase
The layout phase (sub_439830, 65,776 bytes at 0x439830) is the single largest phase in the linker core after merge. It assigns addresses to every data item across all NVIDIA GPU memory spaces -- global memory, shared memory (global, local, extern, and reserved), and constant banks -- merges overlapping data, deduplicates constants, resolves extern shared memory fixups, and lays out per-kernel constant sections. The function is called from main() immediately after the merge phase completes and before relocation application.
Despite the sweep identifying this function as "shared_memory_layout," the decompiled code reveals it is actually the unified layout engine for the entire device ELF: it handles global data, all four shared memory categories, constant bank layout, constant deduplication, OCG (object-code-generator) constant optimization, bindless resource allocation, reserved shared memory symbols, UFT section setup, and texture/sampler/surface resource counting. The name in the binary's debug trace is simply "layout".
Position in the Pipeline
Merge Phase (sub_45E7D0, per-object)
|
v
Layout Phase (sub_439830) <-- this page
|
+-- Global data merging (.nv.global)
+-- Shared memory allocation (global, local, extern, reserved)
+-- Constant bank layout & dedup
+-- Bindless resource processing
+-- UFT section setup
|
v
Relocation Phase (sub_469D60)
The function receives a single argument: the elfw (ELF wrapper) object that holds the merged output state. It returns after all addresses have been assigned and all layout optimizations have been applied.
Function Signature
// sub_439830 -- unified layout engine
// Address: 0x439830
// Size: 65,776 bytes (~2,173 lines of decompiled pseudocode)
// ~400 local variables
//
// a1: elfw* -- the output ELF wrapper object
// Returns: result of last operation (discarded by caller)
size_t __fastcall sub_439830(__int64 a1);
Key fields accessed on the elfw object a1:
| Offset | Type | Description |
|---|---|---|
+7 | byte | Architecture class (65 = 'A' for a-variant) |
+16 | int16 | Link mode (1 = relocatable, 0xFF00 = special) |
+48 | dword | Architecture flags bitmask |
+64 | byte | Debug/verbose flags (bit 1 = verbose layout trace) |
+80 | byte | EWP (early-write-pass) mode flag |
+81 | byte | Skip-global-init flag |
+83 | byte | Optimize-data-layout flag |
+85 | byte | No-shared-layout flag |
+87 | byte | Reserve-null-pointer flag |
+90 | byte | No-opt flag |
+91 | byte | Force OCG constant optimization flag |
+97 | byte | Merge-constants flag |
+100 | byte | Extended shared memory mode |
+216 | dword | Section index: .nv.global.init |
+220 | dword | Section index: secondary global init |
+228 | dword | Section index: .nv.shared (global shared) |
+248 | dword | Section index: .nv.reservedSmem |
+256 | list* | Per-entry shared memory section list |
+264 | list* | Extern shared memory section list |
+272 | list* | Per-entry constant bank section list |
+280 | list* | Per-entry local data section list |
+344 | vec* | Positive symbol array (symbols with index >= 0) |
+352 | vec* | Negative symbol array (symbols with index < 0) |
+360 | vec* | Section array (all section records) |
+376 | list* | Relocation record list |
+424 | list* | Texture descriptor list |
+432 | list* | Sampler descriptor list |
+440 | list* | Surface descriptor list |
+448 | list* | Global variable pending-merge list |
+488 | vtable* | Architecture profile vtable pointer |
+504 | dword | Syscall-const-offset value |
+512 | vec* | EWP function metadata vector |
+568 | dword | Specific constant section index |
+576 | htab* | Overlapped-constant dedup hash table |
High-Level Execution Flow
The function proceeds through ten sequential phases. Each phase handles one memory space or optimization pass. The order is critical because later phases depend on addresses assigned by earlier ones.
Phase 1: Global Data Merge (.nv.global)
if (elfw->pending_globals != NULL):
section = find_or_create_section(".nv.global", SHT_CUDA_GLOBAL=0x70000007)
walk symbol linked list at section+72 to find tail
for each pending global in elfw+448:
move symbol into .nv.global section (sub_440430)
copy data into section buffer (sub_433760)
advance to next
The function sub_433760 is the core data-copy primitive. It allocates a 40-byte data node, computes the aligned offset (current_size rounded up to the symbol's alignment), stores the data pointer and size, and appends the node to the section's symbol linked list. The section's total size (section+32) is updated to aligned_offset + data_size.
If .nv.global does not yet exist, it is created via sub_441AC0 with section type SHT_CUDA_GLOBAL (0x70000007), flags 3 (ALLOC|WRITE), and alignment 1.
Phase 2: Bindless Resource Processing
If the architecture supports bindless resources (checked via the vtable at elfw+488, offset 296), the layout phase processes texture, sampler, and surface descriptors:
- Specific constant section: If
elfw+568has a section index, callsub_4324B0to set up that section as the bindless data target. - Fallback scan: Otherwise, iterate all symbols looking for entry-point functions (symbol type check: bit 4 set and binding != LOCAL), and call
sub_4324B0on each matching section. - Descriptor list layout: Call
sub_433310three times to lay out the texture list (elfw+424), sampler list (elfw+432), and surface list (elfw+440). - Per-entry bindless section processing: For each entry in the constant bank list (
elfw+272), if the section type matches the bindless constant bank type (queried via vtable offset 304 or 312), callsub_438DD0to process bindless relocations. This function creates synthetic$NVLINKBINDLESSOFF_<name>symbols and rewrites relocation targets.
This phase is skipped entirely in relocatable link mode (link_mode == 1).
Phase 3: Global Init Section Layout
if (elfw+216 has a section index):
section = get_section_header(elfw, index)
sub_4325A0(elfw, section, 0) // lay out at offset 0
if (elfw+220 has a section index):
section = get_section_header(elfw, index)
sub_4325A0(elfw, section, 0) // lay out at offset 0
The function sub_4325A0 is the general-purpose section layout engine. It sorts the section's symbol list (via sub_4647D0 with comparator sub_432440), then iterates symbols computing aligned offsets:
for each symbol in section's symbol list:
if symbol has explicit alignment:
offset = roundup(current_offset, alignment)
else if symbol has a size:
align = min(size, 8)
offset = roundup(current_offset, align)
else:
assert(no_opt_mode) // "should only reach here with no opt"
symbol.value = offset
section_record.value = offset
current_offset = offset + symbol.size
// verbose: "variable %s at offset %d"
Phase 4: Global Shared Memory Layout
This is the most complex shared memory phase. It handles .nv.shared and related per-entry shared memory sections.
Step 4a: Build overlap sets. Create a set container (via sub_465020), then scan the relocation list (elfw+376) for relocations that reference entry-point function sections. For each such relocation, add its function ID to the overlap set via sub_4644C0.
Step 4b: Lay out global shared. If there is a global shared section (elfw+228):
-
Relocatable mode (
link_mode == 1): Walk the section's symbol list directly, assigning offsets from their pre-existing positions. For each symbol, setsection_record.value = symbol.offsetand track the maximum offset as the section's running size. -
No-opt mode (
elfw+90set): Callsub_4325A0directly for a simple linear layout. -
Optimized mode (default): Call
sub_436BD0(the shared memory optimizer, 15,711 bytes). This function builds an interference graph between shared memory variables based on which kernel entry points reference them. Variables that are never live simultaneously in any kernel can share the same address, reducing total shared memory consumption. The optimizer prints"global shared %s only used in entry %d"and"remove unused global shared %s"during verbose operation.
Step 4c: Per-entry shared memory layout. Iterate the per-entry shared section list (elfw+256). For each entry:
- Determine which entry function the shared section belongs to (via
sub_440590to get the section record, then extract the function index). - Check if the entry function reaches global shared memory using the overlap set (
sub_4655C0). If not, the entry's shared offset starts fresh rather than after the global shared region. - If the entry reaches global shared, its shared sections start after the global shared region's high-water mark.
- In relocatable mode: walk symbols directly assigning pre-existing offsets.
- Otherwise: if the entry is not from an EWP (early-write-pass) function, call
sub_4325A0to lay out the section. - Track the maximum shared memory address across all entries.
Verbose output includes "esh %s has offset %d", "shared entry %s:", "entry %s does not reach global shared", and "shared entry %s from EWP ignored".
Phase 5: Extern Shared Memory Fixups
Extern shared variables (__shared__ extern) have no static size -- their offset is determined at link time based on the total non-extern shared memory consumed. This phase resolves those addresses.
There are two distinct code paths:
Path A: No-opt mode (elfw+90 set). All extern shared variables are placed at the same offset (after global shared). The maximum alignment across all extern shared sections (minimum 16 bytes) is computed, the base offset is aligned accordingly, and every extern shared symbol receives that aligned offset. The per-entry shared sections also get their base offset updated.
Path B: Optimized mode (default, and the larger code path). This handles per-entry extern shared placement with overlap analysis:
- Count entries: Call
sub_464740to count the number of extern shared entries. - Allocate per-entry overlap sets: Create one overlap set per extern shared entry.
- Populate sets: Scan the relocation list. For each relocation that targets an extern shared section with zero data (i.e., truly extern), identify which entry function it belongs to and add a mapping from the extern shared symbol to each reachable entry's set.
- Enforce minimum alignment: For every extern shared section record, if alignment is less than 16, force it to 16.
- Iterative placement loop (
do { ... } while (v422)):- For each extern shared section, scan its overlap set to find the maximum offset of any overlapping shared region.
- Compute the extern shared offset:
max_overlap + 16 - (max_overlap % 16), with special handling for the reserve-null-pointer flag. - If the computed offset changed from the previous iteration, set a "changed" flag and iterate again.
- This fixed-point iteration converges because offsets can only grow.
- Verbose:
"extern shared variable %s at offset %lld".
- Overlapping set merging: If the EWP flag (
elfw+80) is set, additional logic detects and merges overlapping extern shared sets. It prints"overlapping sets %d and %d"and rewrites relocation targets. The format"reloc of extern shared %d replaced with symbol %d"appears in the verbose trace. - Final extern layout: For each merged set, iterate its overlap set members, query their shared section sizes via
sub_436A80, and assign final offsets. Verbose:"extern shared instance %d at offset %lld".
Phase 6: Shared Memory Cleanup and Reserved Shared
After the extern shared pass, if neither EWP nor relocatable mode is active and the no-shared-layout flag is clear, the global shared section is cleaned up: its data nodes are freed, the symbol list is cleared, and the section index at elfw+228 is zeroed.
Reserved shared memory (elfw+248): Three special symbols are looked up and assigned architecture-specific values from the profile vtable:
| Symbol name | Vtable offset | Meaning |
|---|---|---|
.nv.reservedSmem.begin | +600 | Start address of GPU-reserved shared memory |
.nv.reservedSmem.cap | +584 | Capacity of reserved shared region |
.nv.reservedSmem.offset0 | +608 | Base offset for reserved shared |
If the architecture is not an a-variant (i.e., elfw+7 != 'A'), and reserved shared sections exist, the reserved shared symbols are offset by the reservedSmem.offset0 value. Verbose: "shared variable %s updated offset to %lld".
Phase 7: Local Data Section Layout
for each entry in per-entry local data list (elfw+280):
// verbose: "local entry %s:"
section = get_section_header(elfw, entry_index)
sub_4325A0(elfw, section, 0)
Each per-entry local data section is laid out independently starting at offset 0 using the standard layout engine.
Phase 8: Overlapped Constant Replacement
If the overlapped-constant hash table (elfw+576) is populated (from the merge phase's constant dedup analysis), this phase rewrites relocations that target constants which were identified as overlapping with larger constants:
for each relocation in elfw+376:
dup_sym = lookup(overlap_table, reloc.symbol)
if found:
offset_delta = original_sym.value - dup_sym.value
rewrite reloc target to dup_sym
if offset_delta < 0:
fatal("overlapped offset < full offset?")
if offset_delta > 0:
reloc.addend += offset_delta
// also rewrite rela section index if needed
// verbose: "replace overlapped constant %d with %d, offset %d"
Phase 9: Per-Entry Constant Bank Layout
This phase handles the CUDA constant memory banks (.nv.constant0, .nv.constant2, etc.) that are split per kernel entry point. It contains four sub-paths:
Sub-path 9a: OCG (object-code-generator) constant sections. For each entry in the per-entry constant list (elfw+272), if the section type matches the OCG constant bank type (vtable offset 136), the section is laid out per entry function:
- Walk the callgraph edges (
sub_44C740) to find which entry functions reference this constant section. - For each reachable entry, create a per-entry constant section named
"<bank_type>.<entry_name>"(e.g.,".nv.constant0.my_kernel"). - Copy the constant data into the per-entry section using
sub_432B10(overlapping data merge). - Track the maximum constant section size across all entries.
- Align the per-entry copies to the maximum alignment.
- Copy relocations that reference the original OCG section into the per-entry copies, adjusting offsets.
- Verbose:
"OCG constant section %d reaches entry %d(%s)","need new ocg section %s","new ocg %s offset = %lld in %d","reset ocg constant reloc offset from %lld to %lld". - After creating all per-entry copies, the original shared OCG section's symbol list is freed and its data zeroed.
Sub-path 9b: Non-OCG constant sections. If the constant bank type matches the standard constant type (vtable offset 144 -- typically .nv.constant0), the section is laid out using sub_4325A0, with an optional syscall-const-offset value from elfw+504. Verbose: "constant entry %s:".
Sub-path 9c: Constant section merging. If the merge-constants flag (elfw+97) is set, the function creates a TEMP_MERGED_CONSTANTS temporary section, calls sub_4339A0 (the constant deduplication engine, 13,199 bytes) to merge all constant data into it, then replaces the original section's contents. This function identifies duplicate 32-bit and 64-bit values and aliases symbols. Verbose: "layout and merge section %s", "found duplicate value 0x%x, alias %s to %s".
Sub-path 9d: OCG constant optimization. If any OCG constant section exceeds the architecture's constant bank size limit (vtable offset 32), or if the force-ocg-optimization flag is set, the function enters the OCG constant optimization path:
- Create
TEMP_OCG_CONSTANTStemporary section. - Build a per-entry set of referenced constants.
- For each OCG constant section with data, call
sub_4339A0to deduplicate. - If the optimized size is within the bank size limit: replace all OCG constant section contents with the merged version. Copy the merged symbol list to each per-entry section.
- If the optimized size still exceeds the limit: discard the optimization attempt. Verbose:
"ocg const optimization didn't help so give up". - Verbose:
"optimize OCG constants for %s, old size = %lld","new OCG constant size = %lld".
Phase 10: Resource Counting and UFT Setup
Resource counting. If the architecture does not support bindless resources (vtable offset 296 returns false), allocate a counting structure (sub_432740) and call sub_432870 three times to count textures, samplers, and surfaces against their respective architecture limits (vtable offsets 40, 48, 56).
UFT section setup. Call sub_463F70 to create and validate the Unified Function Table sections (.nv.uft, .nv.udt, .nv.uft.entry, .nv.udt.entry). This function validates that the UFT entry window size matches the UFT section size and aligns the UDT section as needed.
This phase is skipped entirely in relocatable link mode.
Key Sub-Functions
| Address | Name | Size | Role |
|---|---|---|---|
0x4325A0 | section_layout_engine | ~1.4 KB | Sort symbols by alignment, assign offsets with padding |
0x433760 | section_data_copy | ~600 B | Allocate 40-byte data node, compute aligned offset, append to section |
0x436BD0 | shared_memory_optimizer | 15,711 B | Build interference graph, group non-overlapping shared vars |
0x432B10 | merge_overlapping_global | 11,683 B | Validate and merge overlapping data in .nv.global |
0x4339A0 | constant_dedup | 13,199 B | Find duplicate 32/64-bit constants, alias symbols |
0x438DD0 | process_bindless_refs | 12,779 B | Handle bindless texture/surface relocations |
0x4376D0 | extern_shared_set_builder | ~2 KB | Build overlap sets for extern shared placement |
0x436A80 | extern_shared_section_lookup | ~1 KB | Find per-entry shared section for a given entry function |
0x44DB00 | global_data_fixup | ~2 KB | Pre-layout global data initialization |
0x433310 | descriptor_list_layout | ~1 KB | Lay out texture/sampler/surface descriptor lists |
0x4324B0 | bindless_target_setup | ~1 KB | Set up constant section as bindless data target |
0x463F70 | uft_setup_sections | 3,978 B | Create/validate .nv.uft/.nv.udt sections |
0x432740 | resource_counter_create | ~1 KB | Allocate resource counting structure |
0x432870 | resource_counter_check | ~2 KB | Count resources against architecture limits |
The Section Layout Engine (sub_4325A0)
This is the workhorse function called repeatedly throughout the layout phase. It implements a sorted linear allocator:
sub_4325A0(elfw, section, initial_offset):
assert(section != NULL, "section not found")
if not extended_smem_mode or not arch_supports(section.type):
sort section.symbol_list by alignment (sub_4647D0)
current = initial_offset
for each symbol in section.symbol_list:
sym_record = get_sym_record(elfw, symbol.sym_index)
alignment = symbol.alignment // from symbol.offset+16
if alignment > 0:
current = roundup(current, alignment)
elif symbol.size > 0:
natural_align = min(symbol.size, 8)
current = roundup(current, natural_align)
else:
assert(no_opt_mode)
sym_record.value = current
symbol.value = current
// verbose: "variable %s at offset %d"
current += symbol.size
section.total_size = current
return current
The Data Copy Primitive (sub_433760)
This function appends a data contribution to a section:
sub_433760(elfw, section_index, source_sym, alignment, data_ptr):
section = get_section_header(elfw, section_index)
if section is NULL: return
if alignment > section.alignment:
section.alignment = alignment
node = arena_alloc(40) // 40-byte data node
node.source_sym = source_sym
node.data_ptr = 0
node.alignment = alignment
node.data = data_ptr
node.size = 0
// compute aligned offset within section
current_size = section.size
remainder = current_size % alignment
if remainder:
current_size = current_size + alignment - remainder
node.offset = current_size
section.size = current_size + data_size
// append to section's linked list (tail pointer at section+80)
if section.symbol_list is empty:
prepend(node, section.symbol_list)
section.tail = section.symbol_list
else:
assert(section.tail != NULL, "tail data node not found")
append_after(node, section.tail)
section.tail = node
Shared Memory Optimization (sub_436BD0)
The shared memory optimizer is a graph-coloring-based algorithm that reduces total shared memory consumption by aliasing non-overlapping variables. Two shared memory variables can share the same address if no kernel entry point uses both simultaneously.
The algorithm:
- For each shared variable, determine which entry functions reference it (via callgraph traversal).
- Build an interference graph: two variables interfere if any entry references both.
- Variables that are only referenced by a single entry are candidates for entry-local placement. Verbose:
"global shared %s only used in entry %d". - Unreferenced variables are removed entirely. Verbose:
"remove unused global shared %s". - Non-interfering variables are grouped, and the group's address is the maximum of the individual requirements. Verbose:
"allocate to group %d".
Constant Deduplication (sub_4339A0)
The constant deduplication engine finds identical constant values and merges them:
- Create a temporary section
TEMP_MERGED_CONSTANTSorTEMP_USER_DATA. - Iterate all symbols in the constant section.
- For each symbol, check if an identical value already exists in the dedup hash table (
sub_448E70). - If a duplicate 32-bit value is found: alias the symbol. Verbose:
"found duplicate value 0x%x, alias %s to %s". - If a duplicate 64-bit value is found: alias the symbol. Verbose:
"found duplicate 64bit value 0x%llx, alias %s to %s". - Unused constants (determined by
sub_43FB70reachability check) are marked as dead:sym.binding = LOCAL. Verbose:"remove unused constant %s". - After dedup, the original section's symbol list is replaced with the merged version.
- Relocations targeting the original section are rewritten to point into the merged section. Verbose:
"change reloc in section %d, offset from %lld to %lld".
Verbose Trace Strings
When the verbose layout flag is set (elfw+64, bit 1), the function emits extensive trace output to stderr. These strings are the primary evidence for understanding the layout logic:
| String | Phase | Meaning |
|---|---|---|
"global shared:" | 4b | Header before global shared layout |
"esh %s has offset %d" | 4c | Per-entry shared variable with pre-existing offset |
"entry %s does not reach global shared" | 4c | Entry function not in global shared overlap set |
"shared entry %s:" | 4c | Header before per-entry shared layout |
"shared entry %s from EWP ignored" | 4c | Skip EWP entry's shared section |
"extern shared" | 5 | Header before extern shared processing |
"extern shared variable %s at offset %lld" | 5b | Extern shared address assigned |
"overlapping sets %d and %d" | 5b | Two extern shared sets merged |
"reloc of extern shared %d replaced with symbol %d" | 5b | Relocation rewrite for extern shared |
"extern shared instance %d at offset %lld" | 5b | Final extern shared instance placement |
"global reserved shared:" | 6 | Header before reserved shared layout |
"shared variable %s updated offset to %lld" | 6 | Reserved shared offset adjustment |
"local entry %s:" | 7 | Header before per-entry local layout |
"replace overlapped constant %d with %d, offset %d" | 8 | Constant overlap rewrite |
"ocg const bank %s, size=%lld:" | 9a | OCG constant section header |
"OCG constant section %d reaches entry %d(%s)" | 9a | OCG section reachability |
"need new ocg section %s" | 9a | Creating per-entry OCG section |
"new ocg %s offset = %lld in %d" | 9a | Per-entry OCG offset assignment |
"reset ocg constant reloc offset from %lld to %lld" | 9a | OCG relocation rewrite |
"constant entry %s:" | 9b | Non-OCG constant layout header |
"layout and merge section %s" | 9c | Constant merge begin |
"optimize space in %s (%d)" | 9d | Constant dedup begin |
"local constant %s at offset %lld" | 9d | Local constant placement |
"constant %s at offset %lld" | 9d | Global constant placement |
"remove unused constant %s" | 9d | Dead constant elimination |
"no symbol for reloc section %d at offset %lld?" | 9d | Relocation target lookup failure |
"change reloc in section %d, offset from %lld to %lld" | 9d | Relocation offset rewrite |
"optimize OCG constants for %s, old size = %lld" | 9d | OCG optimization begin |
"new OCG constant size = %lld" | 9d | OCG optimization result |
"ocg const optimization didn't help so give up" | 9d | OCG optimization abandoned |
Relocatable Link Mode
When link_mode == 1 (relocatable link, -r flag), the layout phase operates in a simplified mode:
- Global shared layout directly copies pre-existing offsets from symbols rather than computing new ones.
- Per-entry shared sections are laid out the same way.
- Reserved shared sections copy offsets directly.
- The extern shared optimization path is entirely skipped.
- OCG constant optimization is skipped.
- Resource counting and UFT setup are skipped.
- Constant deduplication is skipped.
This mode preserves the address assignments from the individual input objects so the output can be linked again later.
Architecture-Dependent Behavior
The layout phase queries the architecture profile vtable (elfw+488) at numerous points:
| Vtable offset | Query | Effect on layout |
|---|---|---|
| +32 | max_constant_bank_size() | Triggers OCG constant optimization when exceeded |
| +40 | max_textures() | Texture resource count limit |
| +48 | max_samplers() | Sampler resource count limit |
| +56 | max_surfaces() | Surface resource count limit |
| +136 | ocg_constant_type() | Section type for OCG constant banks |
| +144 | standard_constant_type() | Section type for .nv.constant0 |
| +192 | constant_section_name_id() | Generates constant section name string |
| +200 | supports_extended_smem() | Controls extended shared memory layout |
| +208 | is_constant_type() | Validates section type is a constant bank |
| +296 | supports_bindless() | Enables/disables bindless resource processing |
| +304 | bindless_texture_type() | Section type for bindless texture bank |
| +312 | bindless_surface_type() | Section type for bindless surface bank |
| +584 | reserved_smem_capacity() | Value for .nv.reservedSmem.cap |
| +600 | reserved_smem_begin() | Value for .nv.reservedSmem.begin |
| +608 | reserved_smem_offset0() | Value for .nv.reservedSmem.offset0 |
The a-variant architectures (sm_90a, sm_100a, sm_103a) are detected via elfw+7 == 'A' and gate certain constant-bank and reserved-shared behaviors.
Data Flow Summary
Input state (after merge):
.nv.global -- merged global data, possibly with pending entries
.nv.shared.* -- per-entry shared memory sections, sizes known
extern shared -- zero-size placeholders awaiting address assignment
.nv.constant* -- constant bank data, possibly with duplicates
.nv.reservedSmem.* -- symbols awaiting architecture-specific values
relocations -- unresolved references to shared/constant/global
Output state (after layout):
.nv.global -- all globals merged, overlaps validated
.nv.shared.* -- all shared vars assigned addresses, optimized
extern shared -- addresses resolved to post-non-extern offsets
.nv.constant* -- deduped, per-entry copies created, unused removed
.nv.reservedSmem.* -- filled with arch-specific values
.nv.uft/.nv.udt -- UFT sections created and validated
relocations -- targets rewritten for constant overlap/merge
Worked Example: Section Layout Computation
This example traces a single invocation of sub_439830 on a small merged ELF with two kernels and demonstrates exactly how each byte offset is produced. All arithmetic matches the decompiled control flow of sub_4325A0 (section layout engine) and sub_433760 (data-copy primitive).
Setup: Input Merged ELF
After the merge phase (sub_45E7D0), the elfw object holds the following sections. Section sizes are the totals accumulated during merge; the per-symbol offsets within each section are what the layout phase is about to compute or re-compute.
| Section | Type | Size | Align | Flags | Purpose |
|---|---|---|---|---|---|
.text.kernel_a | SHT_PROGBITS | 2048 B | 128 | AX | SASS for kernel_a |
.text.kernel_b | SHT_PROGBITS | 1024 B | 128 | AX | SASS for kernel_b |
.nv.constant0.kernel_a | SHT_PROGBITS (cbank0) | 256 B | 4 | A | Driver param bank (kernel_a) |
.nv.constant0.kernel_b | SHT_PROGBITS (cbank0) | 128 B | 4 | A | Driver param bank (kernel_b) |
.nv.constant2 | SHT_PROGBITS (cbank2) | 4096 B | 16 | A | User const bank, shared between both kernels |
.nv.shared.kernel_a | SHT_NOBITS (smem) | 128 B | 16 | AW | Per-entry shared for kernel_a |
.nv.shared | SHT_NOBITS (smem) | -- | 8 | AW | Two globals: g_tmp (64 B, align 8) and g_hist (96 B, align 16) |
Both kernels contain R_CUDA_ABS32_LO_20 relocations into .nv.shared (global shared variables g_tmp and g_hist) and R_CUDA_CONST_FIELD relocations into .nv.constant2. .nv.constant0.kernel_a contains three symbols and .nv.constant0.kernel_b contains two, both split out during merge (Phase 9a's OCG path).
The relevant elfw field state at the start of layout:
elfw+90 (no_opt_flag) = 0 -- optimizer enabled
elfw+80 (EWP flag) = 0 -- no early-write-pass
elfw+16 (link_mode) = 0 -- final link, not -r
elfw+228 (.nv.shared idx) = 14 -- global shared exists
elfw+256 (per-entry smem) = list{.nv.shared.kernel_a}
elfw+272 (per-entry cbank) = list{.nv.constant0.kernel_a, .nv.constant0.kernel_b, .nv.constant2}
Step 1: Per-Section Symbol Offsets (sub_4325A0)
The layout engine is called on each data-bearing section. .text.* sections are not touched by sub_439830 -- they hold fully resolved SASS after merge and have stable internal layout. Only shared/constant/global data sections have their symbol offsets reassigned.
Step 1a: .nv.constant0.kernel_a (Phase 9b)
Symbol list after merge (in insertion order):
| Symbol | Size | Align |
|---|---|---|
__cudaparm_kernel_a_ptr_out | 8 | 8 |
__cudaparm_kernel_a_count | 4 | 4 |
__cudaparm_kernel_a_scale | 4 | 4 |
__cudaparm_kernel_a_bias | 4 | 4 |
__cudaparm_kernel_a_mode | 4 | 4 |
sub_4325A0 first calls sub_4647D0(section+72, sub_432440) to stable-sort by alignment (decreasing). The list becomes {ptr_out, count, scale, bias, mode} (unchanged here -- insertion was already sorted). It then walks the list:
current = 0
ptr_out: align=8 -> roundup(0,8)=0 -> value=0, current=0+8=8
count: align=4 -> roundup(8,4)=8 -> value=8, current=8+4=12
scale: align=4 -> roundup(12,4)=12 -> value=12, current=12+4=16
bias: align=4 -> roundup(16,4)=16 -> value=16, current=16+4=20
mode: align=4 -> roundup(20,4)=20 -> value=20, current=20+4=24
Stored back: section+32 (total_size) = 24. Verbose trace: variable __cudaparm_kernel_a_ptr_out at offset 0 etc. The driver will later prepend the built-in 0x160-byte gridDim/blockDim/... header at runtime, so these user-visible offsets start at 0.
Step 1b: .nv.constant0.kernel_b (Phase 9b)
Symbol list:
| Symbol | Size | Align |
|---|---|---|
__cudaparm_kernel_b_ptr_in | 8 | 8 |
__cudaparm_kernel_b_ptr_out | 8 | 8 |
__cudaparm_kernel_b_n | 4 | 4 |
Walk:
current = 0
ptr_in: align=8 -> value=0, current=8
ptr_out: align=8 -> value=8, current=16
n: align=4 -> value=16, current=20
section+32 = 20. Note: the section's internal size is the tight 20, but the section's own sh_addralign will be promoted to 8 (the max symbol alignment seen) by the line if (v11 > *(_QWORD *)(a2 + 48)) *(_QWORD *)(a2 + 48) = v11; in sub_4325A0.
Step 1c: .nv.constant2 (Phase 9b, shared user cbank)
This section is referenced by both kernels, so it is laid out as a single unified bank. Symbols (post-merge, after any dedup from Phase 9c):
| Symbol | Size | Align |
|---|---|---|
lookup_table | 2048 | 16 |
coeffs | 1024 | 16 |
weights | 768 | 16 |
masks | 128 | 8 |
thresholds | 16 | 4 |
Walk:
current = 0
lookup_table: align=16 -> roundup(0,16)=0 -> value=0, current=2048
coeffs: align=16 -> roundup(2048,16)=2048 -> value=2048, current=3072
weights: align=16 -> roundup(3072,16)=3072 -> value=3072, current=3840
masks: align=8 -> roundup(3840,8)=3840 -> value=3840, current=3968
thresholds: align=4 -> roundup(3968,4)=3968 -> value=3968, current=3984
section+32 = 3984. The section occupies bytes [0, 3984) within cbank2; the remaining 4096 - 3984 = 112 bytes are unused padding at the bank's tail. The hardware cbank2 window for a kernel is a fixed 64 KB slice of constant memory; only the bytes covered by actual symbols are materialized in the output ELF.
Verbose trace for this section:
constant entry .nv.constant2:
variable lookup_table at offset 0
variable coeffs at offset 2048
variable weights at offset 3072
variable masks at offset 3840
variable thresholds at offset 3968
Step 2: Shared Memory Overlap Graph (Phase 4)
Both kernels reference the globals in .nv.shared. The shared memory optimizer (sub_436BD0) determines which variables interfere.
Step 2a: scan relocations and build reachability sets. sub_439830 walks elfw+376 (the relocation list) and for each relocation that targets a .nv.shared symbol, identifies which entry function owns the referencing .text.* section. This produces:
g_tmp -> {kernel_a, kernel_b} (both kernels call a helper that touches g_tmp)
g_hist -> {kernel_a} (only kernel_a updates the histogram)
s_local -> {kernel_a} (from .nv.shared.kernel_a, per-entry)
Step 2b: build interference graph. Two shared vars interfere iff their entry-sets overlap:
g_tmp g_hist s_local
g_tmp -- X X (both share kernel_a)
g_hist X -- X (both live in kernel_a)
s_local X X --
An edge means "cannot share an address." In ASCII:
+--------+
| g_tmp |
+--------+
/ \
/ \
/ \
+--------+ +---------+
| g_hist |----| s_local |
+--------+ +---------+
Every pair interferes because kernel_a touches all three. Result: no aliasing is possible; each variable needs its own offset. Verbose trace: no "global shared %s only used in entry %d" lines are emitted (each var has more than one user or is already entry-local).
Step 2c: lay out .nv.shared globals. sub_4325A0 is called (via the optimized path in sub_436BD0 which reduces to the plain layout here since no merging is possible):
current = 0
g_hist: align=16 -> roundup(0,16)=0 -> value=0, current=96
g_tmp: align=8 -> roundup(96,8)=96 -> value=96, current=160
Sort pulled g_hist ahead of g_tmp (higher alignment first). section+32 = 160. Global shared high-water mark = 160.
Step 2d: lay out .nv.shared.kernel_a (Phase 4c). kernel_a reaches the global shared region (its relocation set contains g_tmp and g_hist), so the per-entry layout starts after the global high-water mark. sub_436BD0 calls sub_4325A0 with initial_offset = roundup(160, per_entry_align).
Per-entry symbols in .nv.shared.kernel_a:
| Symbol | Size | Align |
|---|---|---|
s_local (shared tile) | 128 | 16 |
initial = 160
s_local: align=16 -> roundup(160,16)=160 -> value=160, current=160+128=288
section+32 = 288. Verbose: shared entry .nv.shared.kernel_a: then variable s_local at offset 160.
Step 2e: kernel_b has no per-entry .nv.shared.kernel_b. Its shared memory requirement is just the global region it reaches (g_tmp at offset 96). If kernel_b did not reach any global shared variable, the verbose trace would emit entry kernel_b does not reach global shared and its hypothetical per-entry section would start at offset 0 instead of 160 -- this is the key optimization: non-overlapping kernels can reuse the low shared addresses.
Final per-kernel shared memory footprints:
| Kernel | Shared high-water | Composition |
|---|---|---|
kernel_a | 288 B | g_hist[0..96) + g_tmp[96..160) + s_local[160..288) |
kernel_b | 160 B | g_hist[0..96) + g_tmp[96..160) (g_hist addr exists but is dead for b) |
Step 3: Constant Bank Layout Summary
After Phase 9 completes, the final per-bank picture is:
cbank0 / kernel_a:
+0 +---------------------+
| __cudaparm_ptr_out | (8 B)
+8 +---------------------+
| __cudaparm_count | (4 B)
+12 +---------------------+
| __cudaparm_scale | (4 B)
+16 +---------------------+
| __cudaparm_bias | (4 B)
+20 +---------------------+
| __cudaparm_mode | (4 B)
+24 +---------------------+
... (driver fills rest)
cbank0 / kernel_b:
+0 +---------------------+
| __cudaparm_ptr_in | (8 B)
+8 +---------------------+
| __cudaparm_ptr_out | (8 B)
+16 +---------------------+
| __cudaparm_n | (4 B)
+20 +---------------------+
cbank2 (shared by both kernels):
+0 +---------------------+
| lookup_table | (2048 B, align 16)
+2048 +---------------------+
| coeffs | (1024 B, align 16)
+3072 +---------------------+
| weights | (768 B, align 16)
+3840 +---------------------+
| masks | (128 B, align 8)
+3968 +---------------------+
| thresholds | (16 B, align 4)
+3984 +---------------------+
| (padding up to 4096)|
+4096 +---------------------+
Both kernels see the same cbank2 addresses -- that is the whole point of a non-per-entry constant bank. In contrast, cbank0 is split per entry because the parameter layout differs between kernels.
Step 4: Final File sh_offset Table
The layout phase only assigns symbol offsets within sections; file offsets (sh_offset in each section header) are materialized later during the output phase (sub_452010 / writer sub-pipeline), which walks the section list and places each section sequentially in the ELF file respecting sh_addralign.
For the merged ELF described above, assuming a standard 64-byte ELF64 header and no program headers (typical for a .cubin fragment before fatbinary wrap), the file layout is:
| # | Section | sh_type | sh_addralign | sh_size | Raw write start | sh_offset (aligned) | sh_offset + sh_size |
|---|---|---|---|---|---|---|---|
| 0 | NULL | SHT_NULL | 0 | 0 | -- | 0 | 0 |
| 1 | .text.kernel_a | PROGBITS | 128 | 2048 | 64 | 128 | 2176 |
| 2 | .text.kernel_b | PROGBITS | 128 | 1024 | 2176 | 2176 | 3200 |
| 3 | .nv.constant0.kernel_a | PROGBITS | 8 | 24 | 3200 | 3200 | 3224 |
| 4 | .nv.constant0.kernel_b | PROGBITS | 8 | 20 | 3224 | 3224 | 3244 |
| 5 | .nv.constant2 | PROGBITS | 16 | 3984 | 3244 | 3248 | 7232 |
| 6 | .nv.shared | NOBITS | 16 | 160 | 7232 | 7232 | 7232 |
| 7 | .nv.shared.kernel_a | NOBITS | 16 | 288 | 7232 | 7232 | 7232 |
| 8 | .shstrtab | STRTAB | 1 | ~120 | 7232 | 7232 | 7352 |
| 9 | .symtab | SYMTAB | 8 | ... | 7352 | 7360 | ... |
| 10 | .strtab | STRTAB | 1 | ... | ... | ... | ... |
Alignment padding events to notice:
- Section 1 (
.text.kernel_a) aligns from raw 64 up to 128: 64 bytes of zero padding immediately after the ELF header. - Section 2 (
.text.kernel_b) is already 128-aligned at 2176 (since 2048 was a multiple of 128): no padding. - Section 3 (
.nv.constant0.kernel_a) at 3200 is 8-aligned already: no padding. - Section 4 (
.nv.constant0.kernel_b) at 3224 is 8-aligned: no padding. - Section 5 (
.nv.constant2) at 3244 must align to 16: 4 bytes of padding (3244 -> 3248). - Sections 6 and 7 are
SHT_NOBITS: they occupy zero bytes in the file, so theirsh_offsetis set to the current write position but the cursor does not advance past them. Both recordsh_offset = 7232. - Section 9 (
.symtab) aligns 7352 -> 7360: 8 bytes of padding.
The section content cursor therefore advances: 64 -> 128 -> 2176 -> 3200 -> 3224 -> 3244 -> 3248 -> 7232 (data sections done) -> 7352 -> 7360 -> ... (metadata sections). Total file padding: 64 + 4 + 8 = 76 bytes.
Step 5: Order of Offset Assignment
For reference, the exact order in which sub_439830 assigns offsets for this example is:
- Phase 1 (global data): no
.nv.globalpending, skipped. - Phase 3 (global init): no
.nv.global.init, skipped. - Phase 4b (global shared):
g_hist=0,g_tmp=96. Section size = 160. - Phase 4c (per-entry shared):
.nv.shared.kernel_a.s_local = 160. Section size = 288. - Phase 5 (extern shared): no extern shared in this example, skipped.
- Phase 6 (reserved shared): no
.nv.reservedSmemused, skipped. - Phase 7 (local data): no
.nv.local.*, skipped. - Phase 8 (overlapped constant replacement): dedup table empty, skipped.
- Phase 9a (OCG split): cbank0 was already per-entry, no split needed.
- Phase 9b (non-OCG constants):
.nv.constant0.kernel_a-> symbols at 0/8/12/16/20 (size 24)..nv.constant0.kernel_b-> symbols at 0/8/16 (size 20)..nv.constant2-> symbols at 0/2048/3072/3840/3968 (size 3984). - Phase 9c (constant merge):
merge-constantsflag was clear, skipped. - Phase 9d (OCG size optimization): cbank0 sizes (24 and 20) are well under the 64 KB limit, no optimization.
- Phase 10 (resource counting + UFT): textures/samplers/surfaces = 0;
sub_463F70creates empty.nv.uft/.nv.udt.
At this point sub_439830 returns and control passes to the relocation phase (sub_469D60), which uses the offsets assigned above to patch R_CUDA_ABS32_LO_20 (for g_tmp/g_hist/s_local) and R_CUDA_CONST_FIELD (for lookup_table/coeffs/etc.) into the already-finalized .text.* bytes.
Cross-References
- Pipeline Overview -- layout phase in the context of the full 14-phase pipeline
- Merge Phase -- preceding phase that produces the merged elfw consumed by layout
- Relocation Phase -- succeeding phase that patches instruction/data bytes against addresses assigned here
- Section Merging -- how sections are merged during Phase 9 before layout assigns addresses
- Data Layout Optimization -- the constant deduplication and overlap merge sub-algorithms called from layout
- Bindless Relocations -- Phase 2 bindless resource processing detail (
sub_438DD0) - Unified Function Tables -- Phase 10 UFT/UDT section creation and validation (
sub_463F70) - Constant Banks --
.nv.constant0/.nv.constant2layout and R_CUDA_CONST_FIELD relocations - Program Headers -- how laid-out sections map to ELF program headers in the output
- Architecture Profiles -- the vtable at
elfw+488queried throughout layout for architecture-specific limits - ELF Writer Structure -- the elfw data structure that layout mutates
- Section Record -- per-section metadata records updated during layout
- Symbol Record -- per-symbol records whose
valuefield is assigned during layout - Dead Code Elimination -- callgraph reachability analysis that determines which symbols are live for layout
Confidence Assessment
| Claim | Confidence | Evidence |
|---|---|---|
sub_439830 at 0x439830, 65,776 bytes, ~2,173 lines | HIGH | stat -c%s = 65,776; wc -l = 2,173 |
Single argument (elfw*) signature | HIGH | Decompiled: size_t __fastcall sub_439830(__int64 a1) |
"global shared %s only used in entry %d" | HIGH | String at 0x1d389a8 in nvlink_strings.json |
"remove unused global shared %s" | HIGH | String at 0x1d389d0 in nvlink_strings.json |
"variable %s at offset %d" | HIGH | String at 0x1d38739 in nvlink_strings.json |
"esh %s has offset %d" | HIGH | String at 0x1d38bcb in nvlink_strings.json |
"entry %s does not reach global shared" | HIGH | String at 0x1d38da8 in nvlink_strings.json |
"shared entry %s:" | HIGH | String at 0x1d38be1 in nvlink_strings.json |
"extern shared variable %s at offset %lld" | HIGH | String found in nvlink_strings.json (partial match at 0x1d38a80) |
"overlapping sets %d and %d" | HIGH | String at 0x1d38c02 in nvlink_strings.json |
"reloc of extern shared %d replaced with symbol %d" | HIGH | String at 0x1d38e28 in nvlink_strings.json |
"local entry %s:" | HIGH | String at 0x1d38c7c in nvlink_strings.json |
"constant entry %s:" | HIGH | String at 0x1d38c8d in nvlink_strings.json |
"found duplicate value 0x%x, alias %s to %s" | HIGH | String at 0x1d38888 in nvlink_strings.json |
"remove unused constant %s" | HIGH | String at 0x1d38cf9 in nvlink_strings.json |
"layout and merge section %s" | HIGH | String at 0x1d38d46 in nvlink_strings.json |
"optimize space in %s (%d)" | HIGH | String at 0x1d38ccf in nvlink_strings.json |
"need new ocg section %s" | HIGH | String at 0x1d38ca1 in nvlink_strings.json |
"new OCG constant size = %lld" | HIGH | String at 0x1d38d8a in nvlink_strings.json |
"ocg const optimization didn't help so give up" | HIGH | String at 0x1d39058 in nvlink_strings.json |
"reset ocg constant reloc offset from %lld to %lld" | HIGH | String at 0x1d38f60 in nvlink_strings.json |
"should only reach here with no opt" assertion | HIGH | String at 0x1d38758 in nvlink_strings.json |
"tail data node not found" assertion | HIGH | String at 0x1d38839 in nvlink_strings.json |
"shared variable %s updated offset to %lld" | HIGH | String found in nvlink_strings.json |
TEMP_MERGED_CONSTANTS / TEMP_OCG_CONSTANTS temp sections | HIGH | Strings at 0x1d38d30 and 0x1d38d77 in nvlink_strings.json |
.nv.reservedSmem.begin / .cap / .offset0 symbols | HIGH | ".nv.reservedSmem.begin" at 0x1d38c37 in nvlink_strings.json |
.nv.global.init section reference | HIGH | String at 0x1d38940 in nvlink_strings.json |
.nv.uft / .nv.udt sections | HIGH | Strings at 0x1d39f74 and 0x1d38924 in nvlink_strings.json |
sub_436BD0 (shared memory optimizer), 15,711 B | HIGH | stat -c%s = 15,711 bytes |
sub_4339A0 (constant dedup), 13,199 B | HIGH | stat -c%s = 13,199 bytes |
sub_438DD0 (process_bindless_refs), 12,779 B | HIGH | stat -c%s = 12,779 bytes |
sub_463F70 (uft_setup_sections), 3,978 B | HIGH | stat -c%s = 3,978 bytes |
| All 14 function addresses in the key sub-functions table | HIGH | All verified to exist in decompiled/ directory |
| 10-phase sequential execution order | MEDIUM | Phase ordering matches decompiled control flow in sub_439830; sub-phase boundaries are editorial grouping |
elfw field offsets (+7, +16, +48, +64, +80, +81, etc.) | MEDIUM | Consistent with decompiled pointer arithmetic; individual offsets are inferred from a1 + N patterns |
| Architecture vtable offsets (+32, +40, +48, +136, +296, etc.) | MEDIUM | Offsets inferred from decompiled indirect calls *(fn*)(a1 + 488 + N); consistent across the codebase |
| Interference graph algorithm for shared memory optimization | MEDIUM | Algorithmic description inferred from control flow of sub_436BD0; graph-coloring terminology is editorial |