Data Layout Optimization
The data layout optimization pass deduplicates constant values, eliminates unused constants, and compacts OCG (object-code-generator) constant bank sections across translation units. It is implemented by sub_4339A0 (constant_dedup, 13,199 bytes at 0x4339A0) and orchestrated by the layout phase sub_439830 during phases 9c and 9d. The optimization reduces constant memory pressure by finding identical 32-bit and 64-bit values across compilation units, aliasing duplicate symbols to a single physical copy, and rewriting relocations to target the canonical instance.
The pass is enabled by --optimize-data-layout and disabled by --no-opt. These two flags are mutually exclusive; specifying both triggers a fatal error. When neither flag is given, the optimization runs automatically if any OCG constant bank section exceeds the architecture's maximum constant bank size (queried via vtable offset +32, typically 64 KB for CUDA constant memory).
| Primary function | sub_4339A0 at 0x4339A0 (13,199 bytes, 426 decompiled lines) |
| Caller | sub_439830 (layout phase) at phases 9c and 9d |
| Hash table lookup | sub_449A80 at 0x449A80 (hash-table value lookup) |
| Hash table insert | sub_448E70 at 0x448E70 (hash-table key-value insert) |
| Reachability check | sub_43FB70 at 0x43FB70 (symbol entry-function reachability) |
| Data copy | sub_433760 at 0x433760 (section data append with alignment) |
| Large-value dedup | sub_433870 at 0x433870 (byte-level memcmp deduplication for 12..64 byte values) |
| Overlap validation | sub_4343C0 at 0x4343C0 (section data overlap detection and identity check) |
| 32-bit hash function | sub_44E120 at 0x44E120 (identity: returns key unchanged) |
| 64-bit hash function | sub_44E1C0 at 0x44E1C0 (shift-xor: (k>>11) ^ (k>>8) ^ (k>>5)) |
| 64-bit xor-fold hash | sub_44E150 at 0x44E150 (custom hash: lo32 ^ hi32) |
| Hash table create | sub_4489C0 at 0x4489C0 (allocate 112-byte header, set mode tag) |
| CLI enable | --optimize-data-layout (stored in byte_2A5F2A8, maps to elfw+83) |
| CLI disable | --no-opt (stored in byte_2A5F2A9, maps to elfw+90) |
| Verbose gate | elfw+64, bit 1 (debug/verbose flags) |
Position in the Pipeline
Phase 9b: Standard constant bank layout (sub_4325A0)
|
v
Phase 9c: Constant merging via sub_4339A0 <-- merge-constants mode
|
v
Phase 9d: OCG constant optimization <-- this page (primary focus)
| via sub_4339A0
v
Phase 10: Resource counting + UFT setup
The dedup engine sub_4339A0 is called from two distinct code paths within the layout phase:
-
Phase 9c (merge-constants mode): When
elfw+97(merge-constants flag) is set, the layout phase creates aTEMP_MERGED_CONSTANTStemporary section and callssub_4339A0with thea13parameter set to 1, meaning "copy all constants, even unreferenced ones." This mode deduplicates the standard constant bank (.nv.constant0) contents. -
Phase 9d (OCG constant optimization): Triggered when
elfw+91(force-OCG-optimization) is set or any OCG constant section exceeds the bank size limit. Creates aTEMP_OCG_CONSTANTStemporary section and callssub_4339A0witha13=0and a non-null overlap-set argument (a15), enabling entry-function-level dead constant elimination and per-entry OCG section tracking.
Function Signature
// sub_4339A0 -- constant_dedup
// Address: 0x4339A0
// Size: 13,199 bytes (~426 lines of decompiled pseudocode)
//
// a1: elfw* -- linker context / ELF wrapper
// a2: section* -- source constant section record
// a3: uint32_t -- target temporary section index
// a4: htab_32* -- hash table for 32-bit value dedup (key=value, val=symbol*)
// a5: htab_64* -- hash table for 64-bit value dedup (key=value, val=symbol*)
// a6: list** -- bucket for 12-byte value linked list
// a7: list** -- bucket for 16-byte value linked list
// a8: list** -- bucket for 20-byte value linked list
// a9: list** -- bucket for 24-byte value linked list
// a10: list** -- bucket for 32-byte value linked list
// a11: list** -- bucket for 48-byte value linked list
// a12: list** -- bucket for 64-byte value linked list
// a13: char -- copy-all flag (1=copy unreferenced constants, 0=skip dead ones)
// a14: reloc** -- relocation list head pointer (for OCG reloc rewriting)
// a15: set* -- per-entry overlap set (NULL in merge-constants mode)
_QWORD* __fastcall sub_4339A0(
__int64 a1, __int64 a2, unsigned int a3,
__int64 a4, __int64 a5,
_QWORD* a6, _QWORD* a7, _QWORD* a8,
_QWORD* a9, _QWORD* a10, _QWORD* a11,
_QWORD* a12, char a13,
_QWORD* a14, __int64 a15);
Deduplication Algorithm
The function iterates the source section's symbol linked list (rooted at a2+72). For each symbol, it extracts the data pointer, alignment, and size from the 40-byte symbol node, then dispatches into one of three dedup strategies based on the data size.
Size-Based Dispatch
for each symbol node in source_section->symbol_list:
sym_record = get_symbol(elfw, node->sym_index)
if (a13 == true) and (node->sym_index != 0) and reachable(elfw, sym_record):
// Phase 9c path: force-copy reachable constant
aligned_offset = align_up(target_section->size, node->alignment)
sym_record->value = aligned_offset
section_data_copy(elfw, target_section_idx, data, alignment, size)
continue
switch (node->data_size):
case 4: dedup_32bit(...)
case 8: dedup_64bit(...)
case 12: dedup_memcmp(a6, 12, ...)
case 16: dedup_memcmp(a7, 16, ...)
case 20: dedup_memcmp(a8, 20, ...)
case 24: dedup_memcmp(a9, 24, ...)
case 32: dedup_memcmp(a10, 32, ...)
case 48: dedup_memcmp(a11, 48, ...)
case 64: dedup_memcmp(a12, 64, ...)
default: section_data_copy(...) // no dedup, just copy
The reachability check sub_43FB70 determines whether a constant symbol is referenced by any live entry-point function. It checks the EWP mode flag (elfw+80) and the merge-constants flag (elfw+97). If neither is set, it queries entry-function reachability via the callgraph: for global symbols (binding bit 4 set), it checks against the function set at elfw+536; for local symbols, against elfw+544. Both queries use sub_43EB40, a string-based set membership test on the symbol name.
32-Bit Value Deduplication
For 4-byte constants, the algorithm uses a dedicated hash table (a4) keyed on the integer value:
uint32_t val = *(uint32_t*)data_ptr;
uint64_t key = (uint64_t)val; // zero-extended for hash table
symbol* existing = hash_lookup(a4, key); // sub_449A80
if (existing != NULL && val != 0) {
// Duplicate found -- alias this symbol to the existing one
if (existing != sym_record) {
// verbose: "found duplicate value 0x%x, alias %s to %s"
sym_record->value = existing->value; // offset +8: copy address
}
} else if (val != 0 || a15 == NULL) {
// First occurrence or zero value without overlap tracking
aligned_offset = align_up(target_size, alignment)
sym_record->value = aligned_offset
section_data_copy(elfw, target_idx, data, alignment, size)
hash_insert(a4, key, sym_record) // sub_448E70
} else {
// val == 0 and overlap set tracking is active (OCG mode)
// Check if this symbol's entry-function was already seen
if (!set_contains(a15, sym_record->name_ptr)):
// First zero-value in this entry's scope
aligned_offset = align_up(target_size, alignment)
sym_record->value = aligned_offset
section_data_copy(elfw, target_idx, data, alignment, size)
hash_insert(a4, key, sym_record)
set_insert(a15, sym_record->name_ptr)
// Then rewrite OCG relocations targeting this constant
rewrite_ocg_relocs(elfw, a14, sym_record, aligned_offset)
else:
// Already have a zero-value for this entry
// Rewrite OCG relocations anyway (different offset)
rewrite_ocg_relocs(elfw, a14, sym_record, sym_record->value)
}
The zero-value case receives special treatment. Zero-valued constants are common (uninitialized constant declarations compile to zero-filled entries), and in OCG mode each entry function may have its own instance. The overlap set a15 tracks which entry-function scopes have already allocated a zero-value slot to prevent redundant copies while still allowing per-entry uniqueness.
When a duplicate is found, the aliasing sets sym_record->value (field at offset +8 in the symbol record) to the existing symbol's value. This is the address within the target section. No data is copied -- the duplicate symbol simply points to the same physical location as the original.
64-Bit Value Deduplication
The 8-byte case follows the same pattern as 32-bit but uses the second hash table (a5) and the full 64-bit value as the key:
uint64_t val = *(uint64_t*)data_ptr;
symbol* existing = hash_lookup(a5, val);
if (existing != NULL && val != 0) {
if (existing != sym_record) {
// verbose: "found duplicate 64bit value 0x%llx, alias %s to %s"
sym_record->value = existing->value;
}
} else if (val != 0 || a15 == NULL) {
// same as 32-bit first-occurrence path
...
hash_insert(a5, val, sym_record)
} else {
// same zero-value OCG tracking path as 32-bit
...
}
Large-Value Deduplication (12--64 Bytes)
For values of sizes 12, 16, 20, 24, 32, 48, and 64 bytes, the function delegates to sub_433870 (dedup_memcmp, 216 bytes at 0x433870). Each size class has its own singly-linked list (parameters a6 through a12). The function signature, reconstructed from the decompiled output:
// sub_433870 -- dedup_memcmp
// Address: 0x433870, Size: 216 bytes
//
// a1: elfw* -- linker context
// a2: list_head** -- pointer to head of seen-values list for this size class
// a3: int -- expected data size (12, 16, 20, 24, 32, 48, or 64)
// a4: data_node* -- 40-byte data node from the source symbol list
// +0: void* data_ptr (pointer to raw constant bytes)
// +8: uint64_t name_or_id (symbol name/reference)
// +16: uint64_t alignment (required alignment in bytes)
// +24: uint64_t data_size (redundant with a3)
// +32: uint32_t sym_index (symbol table index)
// a5: int64_t* -- output: receives the assigned offset in target section
// a6: const char** -- output: receives the symbol name (for verbose logging)
// a7: int64_t* -- pointer to target_section->size (updated on copy)
// a8: uint32_t -- target section index
The algorithm walks the linked list, comparing alignment and content byte-for-byte:
void dedup_memcmp(elfw* ctx, list_node** head, int size,
data_node* node, int64_t* offset_out,
const char** name_out, int64_t* section_size,
uint32_t target_idx)
{
list_node* cur = *head;
uint64_t align = node->alignment; // offset +16 of data_node
void* data = node->data_ptr; // offset +0
while (cur != NULL) {
data_node* existing = cur->payload;
// Two-stage comparison:
// 1. Alignment must match (fast integer compare, rejects most)
// 2. Byte-exact content match via memcmp
if (existing->alignment == align &&
memcmp(data, existing->data_ptr, size) == 0)
{
// Duplicate found -- look up the canonical symbol's offset
symbol* sym = get_symbol(ctx, existing->sym_index);
if (ctx->verbose_layout) // (*(ctx+64) & 2)
fprintf(stderr,
"found duplicate %d byte value, alias %s to %s\n",
size, *name_out, sym->name);
*offset_out = sym->value; // alias to existing offset
return;
}
cur = cur->next;
}
// No match found -- allocate space in the target section
int64_t current_size = *section_size;
if (current_size % align)
current_size = current_size + align - (current_size % align);
*offset_out = current_size;
section_data_copy(ctx, target_idx, data, align, node->data_size);
// Prepend this data_node to the seen-values list (sub_4644C0)
// so future duplicates will find it
list_prepend(node, head);
}
Key observations from the decompiled code:
-
Alignment check first: The alignment comparison (
existing->alignment == align) acts as a cheap filter before the expensivememcmp. Constants with different alignment requirements are never considered duplicates even if their byte content matches, because merging them would require the stricter alignment for all users. -
Prepend, not append: New entries are prepended to the list head via
sub_4644C0(generic linked-list prepend). This means recently seen values are found first, which is beneficial when translation units define constants in similar order. -
No hash, no index: The linked-list approach for large values (rather than a hash table) is a deliberate tradeoff. These sizes are rare in practice -- most GPU constants are 4 or 8 bytes (scalar floats, doubles, pointers). The O(n) linear scan with
memcmpis faster than computing and comparing a hash for the small typical population of each size class. A program with 50 distinctfloat4(16-byte) constants would require at most 50 comparisons per lookup, each of which is a single 16-bytememcmp-- likely a single SSE instruction on x86-64. -
Size-class isolation: Each of the seven supported large sizes has an independent list. A 16-byte constant is never compared against a 32-byte constant. This eliminates the need for a size field in the comparison and keeps each list short.
Values Outside Standard Size Classes
Constants whose size does not match any of the nine dedup-eligible sizes (4, 8, 12, 16, 20, 24, 32, 48, 64) bypass deduplication entirely. The function falls through to sub_433760 (section_data_copy), which appends the data to the target section at the next aligned offset. This handles unusual sizes like packed structs or non-power-of-two arrays.
If the alignment matches the size exactly for the 4-byte or 8-byte cases, the function uses the fast hash-table path. If alignment differs from size (e.g., a 4-byte value with 8-byte alignment), the function still uses the hash-table dedup but applies the larger alignment during placement.
OCG Constant Relocation Rewriting
After aliasing or placing a constant in the target section, the function rewrites relocations that reference the original constant section. This is critical because the constants have moved from their original per-TU sections into the merged temporary section.
The relocation rewriting logic is embedded directly in sub_4339A0 (appearing four times in the decompiled output -- once for each of the 32-bit and 64-bit paths, each with two sub-cases for the overlap-set tracking). The algorithm walks the relocation linked list (a14):
// Invoked after placing/aliasing a constant symbol
for each reloc_node in *a14:
reloc = reloc_node->payload;
reloc_section = get_section(elfw, reloc->section_idx);
if (reloc_section->parent == source_section_idx &&
reloc->offset >= sym_record->original_value &&
reloc->offset < sym_record->original_value + sym_record->size) {
// This relocation targets the constant we just processed
new_offset = reloc->offset + new_value - sym_record->original_value;
reloc->offset = new_offset;
// verbose: "optimize ocg constant reloc offset from %lld to %lld"
// Remove from relocation list (unlink node)
}
The check reloc->offset >= original_value && reloc->offset < original_value + size identifies relocations that fall within the address range of the source constant. The offset is then recomputed relative to the new position in the merged section: new_offset = old_offset + (new_address - old_address).
This relocation rewriting is what makes the "optimize ocg constant reloc offset from %lld to %lld" message appear in verbose output. Each time a constant is placed or aliased, all relocations pointing into its original range are patched to point into the new location.
Section Data Overlap Validation (sub_4343C0)
When constant data is copied into a target section, the section data append function delegates to sub_4343C0 (overlap_validate, at 0x4343C0). This function maintains a sorted linked list of data fragments within each section and validates that new data does not conflict with existing content at overlapping offsets.
Function Signature
// sub_4343C0 -- overlap_validate
// Address: 0x4343C0
//
// a1: elfw* -- linker context (reinterpreted as FILE* by decompiler)
// a2: uint32_t -- symbol index associated with this data
// a3: uint32_t -- target section index
// a4: void* -- pointer to data bytes to write
// a5: uint64_t -- offset within target section (-1 = append at end)
// a6: uint32_t -- alignment requirement
// a7: size_t -- data size in bytes
Overlap Detection Algorithm
The function maintains a sorted linked list of data records attached to each section (at section header offset +72). Each record is a 40-byte node:
data_record (40 bytes):
+0: void* data_ptr -- pointer to raw bytes
+8: uint64_t offset -- offset within section
+16: uint64_t alignment -- alignment used during placement
+24: uint64_t size -- data size
+32: uint32_t sym_index -- originating symbol index
+36: uint32_t padding -- (zeroed)
The insertion algorithm searches for the correct position in offset-sorted order, checking for three overlap cases:
void overlap_validate(elfw* ctx, uint32_t sym_idx, uint32_t sec_idx,
void* data, uint64_t offset, uint32_t align, size_t size)
{
section* sec = get_section(ctx, sec_idx);
// --- Section size update ---
// If offset == -1 (sentinel): append at current section end
// If offset + size > section_size: extend section
// If offset < section_size and offset + size > section_size:
// verbose: "offset %lld goes past section %d size"
// extend anyway (allow partial overlap past end)
// --- ELF NOBITS (0x70000002) section special case ---
// For NOBITS-like sections, pads data to 128 bytes with
// fill pattern 0xF804002D01 (an NVIDIA NOP encoding) before insertion
// --- Allocate new data record ---
data_record* rec = arena_alloc(40);
rec->data_ptr = data;
rec->offset = offset;
rec->alignment = align;
rec->size = size;
rec->sym_index = sym_idx;
// Update symbol's value and size from this record
if (sym_idx != 0) {
symbol* sym = get_symbol(ctx, sym_idx);
sym->size = size;
if (is_defined_symbol(sym))
sym->value = offset;
}
// --- Find insertion point in sorted list ---
if (offset == -1) {
// Append to tail of list
list_append(rec, &sec->data_list);
return;
}
node* prev = NULL;
node* cur = sec->data_list;
if (!cur) {
// Empty list -- just insert
list_prepend(rec, &sec->data_list);
return;
}
data_record* cur_rec = cur->payload;
uint64_t cur_offset = cur_rec->offset;
uint64_t cur_size = cur_rec->size;
// Walk list to find correct sorted position
while (cur && cur_rec->offset < offset) {
if (cur_rec->offset + cur_rec->size > offset) {
// CASE 1: New data starts inside existing record
goto validate_partial_overlap;
}
prev = cur;
cur = cur->next;
if (cur) cur_rec = cur->payload;
}
if (!cur || data == NULL || size == 0 || !cur_rec->size) {
// No overlap possible -- insert at this position
list_insert_after(rec, prev, &sec->data_list);
return;
}
if (offset == cur_rec->offset) {
// CASE 2: Exact start-offset match with existing record
goto validate_exact_overlap;
}
if (cur_rec->size >= size || offset + size != cur_rec->size + cur_rec->offset) {
// Non-overlapping or size mismatch -- just insert
list_insert_after(rec, prev ? prev : cur, &sec->data_list);
return;
}
// CASE 3: New data extends to exactly where existing record ends
// (tail overlap -- new data covers existing record's suffix)
goto validate_tail_overlap;
}
Three Overlap Cases
Case 1: Partial forward overlap -- The new data's start offset falls strictly inside an existing record's range (existing.offset < new.offset < existing.offset + existing.size):
validate_partial_overlap:
// verbose: "offset %lld (sym %d) overlaps in section %d"
// Verify the new data does not extend past the existing record
if (offset + size > cur_rec->offset + cur_rec->size)
fatal_error("overlapping data spans too much");
// Verify overlapping bytes are identical
void* overlap_ptr = cur_rec->data_ptr + (offset - cur_rec->offset);
if (memcmp(data, overlap_ptr, size) != 0)
fatal_error("overlapping non-identical data");
// Overlap is valid (identical content) -- merge symbol tracking
hash_insert(sym_dedup_table, sym_idx, cur_rec->sym_index);
free_record(rec);
Case 2: Exact offset match -- The new data starts at exactly the same offset as an existing record. Two sub-cases:
validate_exact_overlap:
// verbose: "offset %lld (sym %d) overlaps in section %d"
if (size <= cur_rec->size) {
// 2a: New data is same size or smaller -- verify prefix match
if (memcmp(data, cur_rec->data_ptr, size) != 0)
fatal_error("overlapping non-identical data");
// Discard new record, merge symbol into existing
free_record(rec);
hash_insert(sym_dedup_table, sym_idx, cur_rec->sym_index);
} else {
// 2b: New data is LARGER -- it subsumes the existing record
// Verify the existing content matches the corresponding prefix
if (memcmp(data, cur_rec->data_ptr, cur_rec->size) != 0)
fatal_error("overlapping non-identical data");
// Replace existing record with new (larger) one
cur->payload = rec;
free_record(cur_rec);
hash_insert(sym_dedup_table, cur_rec->sym_index, sym_idx);
// Check if the next record is also subsumed
node* next = cur->next;
if (next) {
data_record* next_rec = next->payload;
if (next_rec->offset < offset + size) {
// Next record overlaps with new data
if (offset + size != next_rec->offset + next_rec->size)
fatal_error("overlapping data spans too much");
// Verify trailing overlap bytes match
size_t tail_off = next_rec->offset - offset;
if (memcmp(data + tail_off, next_rec->data_ptr,
size - next_rec->size) != 0)
fatal_error("overlapping non-identical data");
// Absorb next record -- unlink and merge symbol
cur->next = next->next;
free_record(next_rec);
hash_insert(sym_dedup_table, next_rec->sym_index, sym_idx);
}
}
}
Case 3: Tail overlap -- The new data ends at exactly the same offset as an existing record (new.offset + new.size == existing.offset + existing.size) and is larger than the existing record:
validate_tail_overlap:
// verbose: "offset %lld (sym %d) overlaps in section %d"
size_t overlap_size = cur_rec->size;
void* tail_in_new = data + size - overlap_size;
if (memcmp(tail_in_new, cur_rec->data_ptr, overlap_size) != 0)
fatal_error("overlapping non-identical data");
// Replace existing with new larger record
cur->payload = rec;
free_record(cur_rec);
hash_insert(sym_dedup_table, cur_rec->sym_index, sym_idx);
Validation Invariants
The overlap validation enforces three strict invariants:
-
Content identity: Any bytes that are covered by both an existing record and a new record must be bitwise identical. A
memcmpmismatch triggers a fatal error ("overlapping non-identical data"viasub_467460). -
Span containment: A new record may partially overlap an existing record only if it is fully contained within the existing record's range, or if it fully subsumes the existing record. "Crossing" overlaps where neither contains the other trigger
"overlapping data spans too much". -
Sorted order: Records are maintained in ascending offset order within the linked list, enabling the single-pass walk to detect all overlaps.
These checks are critical for correctness: when multiple translation units contribute the same constant (e.g., a __constant__ variable defined in a header included by multiple .cu files), their data regions may legally overlap in the merged section. The validation ensures the link is sound -- overlapping regions contain identical content, so the merged section is well-defined.
The Two Invocation Contexts
Phase 9c: Merge-Constants Mode
Trigger: elfw+97 (merge-constants flag) is set
Target section: TEMP_MERGED_CONSTANTS
Arguments: a13=1, a14=NULL (no reloc list), a15=NULL (no overlap set)
In this mode, the function copies all constants into a single unified section. The a13=1 flag causes reachable constants to be force-copied even if they could be deduplicated (the function still deduplicates when the same value appears, but it does not skip unreachable constants). After sub_4339A0 returns:
- The original section's symbol list is freed via
sub_464550. - The original section's symbol list pointer is replaced with the temporary section's list.
- The original section's size and alignment are updated from the temporary section.
- The temporary section's pointers are zeroed.
Phase 9d: OCG Constant Optimization
Trigger: elfw+91 (force-OCG-optimization) OR any OCG section size exceeds
max_constant_bank_size (vtable+32)
Target section: TEMP_OCG_CONSTANTS
Arguments: a13=0, a14=&reloc_list, a15=overlap_set
This is the more aggressive path. The layout phase iterates all per-entry constant bank sections (elfw+272), and for each OCG constant section (type matching vtable+136) that has data, it calls sub_4339A0. The verbose output is:
"optimize OCG constants for %s, old size = %lld"
After all OCG sections have been processed through the dedup engine, the layout phase checks whether the optimization actually helped:
new_size = temp_section->size;
max_bank = arch_vtable->max_constant_bank_size();
if (new_size <= max_bank) {
// Optimization succeeded -- replace all OCG sections with merged data
for each ocg_section in elfw+272:
if ocg_section->type == OCG_CONSTANT_TYPE:
free old symbol list
copy merged symbol list (deep copy of 40-byte nodes)
set section->size = temp_section->size
set section->alignment = temp_section->alignment
// verbose: "new OCG constant size = %lld"
} else {
// Optimization did not reduce size below the limit
free temp section's symbol list
// verbose: "ocg const optimization didn't help so give up"
}
The first OCG section receives the merged list directly (move, not copy). Subsequent OCG sections receive deep copies of the merged symbol list, where each 40-byte data node is duplicated via arena_alloc and _mm_loadu_si128 (SSE2 128-bit copy for the 40-byte nodes, which are copied as two 16-byte loads plus one 8-byte load).
The "optimize space" Pre-Pass
Before the dedup engine runs, the layout phase performs a pre-pass (visible in the decompiled code around line 1757 of sub_439830) that iterates local and global symbol vectors to identify and remove unreferenced constants. This pre-pass produces the "optimize space in %s (%d)" verbose message.
The pre-pass walks:
-
Positive symbol array (
elfw+344): For each symbol with type code0xD(constant data) in the source section, callssub_43FB70to check reachability. Reachable constants are copied to aTEMP_USER_DATAsection and inserted into a dedup hash table. Unreachable constants have their binding changed to LOCAL (bits masked:sym+5 = (sym+5 & 0xFC) | 1), effectively hiding them. Verbose:"local constant %s at offset %lld"or"remove unused constant %s". -
Negative symbol array (
elfw+352): Same logic, but for globally-visible constants. Alignment defaults tomin(size, 8)if no explicit alignment is recorded. Verbose:"constant %s at offset %lld"or"remove unused constant %s".
After scanning both vectors, the pre-pass lays out the temporary section via sub_4325A0, then replaces the original section's contents:
section_layout_engine(elfw, temp_section, reserve_null_ptr_flag);
original_section->symbol_list = temp_section->symbol_list;
original_section->size = temp_section->size;
original_section->alignment = temp_section->alignment;
temp_section->size = 0;
temp_section->symbol_list = NULL;
Then relocations targeting the original section are rewritten. For each relocation whose section index matches the source constant section (checked against both the .rela section index and the base section index), the function looks up the relocation's target offset in the dedup hash table. If found, it rewrites the relocation's offset:
for each reloc in elfw+376:
if reloc->section_idx matches source section:
sym = hash_lookup(dedup_table, reloc->target_offset);
if sym == NULL:
// verbose: "no symbol for reloc section %d at offset %lld?"
else:
new_offset = sym->value;
if new_offset != reloc->target_offset:
// verbose: "change reloc in section %d, offset from %lld to %lld"
reloc->target_offset = new_offset;
Hash Table Implementation
Creation and Mode Selection
The dedup hash tables are created by sub_4489C0 (htab_create), which allocates a 112-byte header and sets up the bucket array and entry pool. The constructor accepts three arguments: a hash function pointer, a compare function pointer, and an initial capacity hint.
// sub_4489C0 -- htab_create
// a1: hash_fn -- hash function pointer
// a2: cmp_fn -- equality comparison function pointer
// a3: capacity -- initial capacity hint (rounded up to power of two)
htab* htab_create(hash_fn a1, cmp_fn a2, uint32_t a3) {
htab* t = htab_alloc(a3); // sub_448840: allocates header + arrays
t->hash_fn_custom = a1; // offset +0
t->cmp_fn_custom = a2; // offset +8
// Fast-path recognition: if the function pointers match known
// simple implementations, set a mode tag in the header to avoid
// indirect call overhead during lookup/insert.
if (a2 == sub_44E130 && a1 == sub_44E120) {
// 32-bit identity hash + int equality
// Set bits [7:4] of header+84 to 0x2 --> mode 2 (modular)
t->flags = (t->flags & 0xF00F) | 0x0020;
}
if (a2 == sub_44E1E0 && a1 == sub_44E1C0) {
// Shift-xor hash + int64 equality
// Set bits [7:4] of header+84 to 0x1 --> mode 1 (shift-xor)
t->flags = (t->flags & 0xF00F) | 0x0010;
}
return t;
}
The mode tag stored at header+84 bits [7:4] controls which fast path the lookup and insert functions use, eliminating indirect function-call overhead for the two most common hash table configurations.
| Hash table | Hash function | Compare function | Mode tag | Key type |
|---|---|---|---|---|
32-bit values (a4) | sub_44E120 (identity) | sub_44E130 (int ==) | 2 (modular) | uint32_t zero-extended to 64-bit |
64-bit values (a5) | sub_44E1C0 (shift-xor) | sub_44E1E0 (int64 ==) | 1 (shift-xor) | uint64_t |
Hash Functions
The two hash functions used by the constant dedup tables are trivially simple, optimized for speed over distribution quality -- acceptable because constant values in GPU programs have low collision rates in practice.
32-bit identity hash (sub_44E120 at 0x44E120):
// The key IS the hash. For 32-bit constant values, this gives perfect
// distribution when values are distinct (common case: each float literal
// has a unique bit pattern).
uint32_t hash_identity(uint32_t key) {
return key;
}
64-bit xor-fold hash (sub_44E150 at 0x44E150):
// Folds a 64-bit key into 32 bits by XORing the high and low halves.
// For doubles, this combines the exponent+sign bits (high word) with
// the mantissa (low word), producing reasonable spread.
uint32_t hash_xor_fold(uint64_t key) {
return (uint32_t)key ^ (uint32_t)(key >> 32);
}
Shift-xor hash (sub_44E1C0 at 0x44E1C0):
// Three-way shift-xor used as the fast-path hash for mode 1 tables.
// This is the function recognized by htab_create to trigger mode 1.
// Applied to the 64-bit key; the three shifts (11, 8, 5) mix nearby
// bits to spread clustered values across buckets.
uint32_t hash_shift_xor(uint64_t key) {
return (uint32_t)(key >> 11) ^ (uint32_t)(key >> 8) ^ (uint32_t)(key >> 5);
}
In practice, the 32-bit dedup table uses the identity hash with mode 2 (modular: key & mask), and the 64-bit dedup table uses the shift-xor hash with mode 1. Both mode 1 and mode 2 inline the hash computation into the lookup/insert hot path, avoiding the overhead of an indirect call through a function pointer.
Header Layout (112 Bytes)
The hash table header allocated by sub_448840 has the following layout:
| Offset | Size | Field | Description |
|---|---|---|---|
| +0 | 8 | hash_fn | Custom hash function pointer (mode 0 path) |
| +8 | 8 | cmp_fn | Custom compare function pointer (mode 0 path) |
| +16 | 8 | hash_fn_alt | Alternate hash function (used when +32 flag is set) |
| +24 | 8 | cmp_fn_alt | Alternate compare function |
| +32 | 8 | use_alt_fns | Non-zero selects +16/+24 over +0/+8 in mode 0 |
| +40 | 4 | bucket_mask | bucket_count - 1 (always a power-of-two minus one) |
| +48 | 8 | entry_count | Current number of entries stored |
| +56 | 4 | xor_checksum | Running XOR of all inserted hash values (offset +56, *((_DWORD *)v3 + 14)) |
| +64 | 8 | load_limit | Resize threshold (initially 4 << ceil_log2(capacity)) |
| +72 | 4 | scan_cursor | Bitmap scan position for free-slot search |
| +76 | 4 | entry_capacity | Current capacity of the entry array |
| +80 | 4 | bitmap_capacity | Current capacity of the occupancy bitmap |
| +84 | 2 | flags | Bits [1:0]: entry array ownership, bits [3:2]: bitmap ownership, bits [7:4]: mode tag (0/1/2) |
| +88 | 8 | entries | Pointer to flat array of 16-byte entries [key:8, value:8] |
| +96 | 8 | bitmap | Pointer to uint32 occupancy bitmap (1 = slot occupied) |
| +104 | 8 | buckets | Pointer to array of bucket_count pointers to chain arrays |
Bucket Chain Structure
Each bucket slot in the buckets array (at header offset +104) is a pointer to a dynamically-sized chain array. The chain array layout is:
+0: uint32_t capacity -- max number of index entries before realloc
+4: uint32_t idx[0] -- first entry index
+8: uint32_t idx[1] -- second entry index
...
+N: uint32_t 0xFFFFFFFF -- sentinel (end of chain)
A NULL bucket pointer means no entries hash to that bucket. When a chain needs to grow beyond its capacity, sub_448E70 allocates a new array of size 2 * old_capacity + 2, copies the existing indices, and frees the old chain via sub_431000.
Entry Array
The entry pool (at header offset +88) is a flat array of 16-byte slots:
entry[i] = {
+0: uint64_t key -- the constant value (zero-extended for 32-bit)
+8: uint64_t value -- pointer to the canonical symbol record
}
New entries are allocated by scanning the occupancy bitmap for the first zero bit (using _BitScanForward on the bitwise complement). The bitmap at header offset +96 tracks which entry slots are in use -- one bit per slot, packed into 32-bit words.
Lookup Algorithm (sub_449A80)
The lookup function sub_449A80 implements hash lookup with three dispatch modes selected by bits [7:4] of header+84:
// sub_449A80 -- htab_lookup
// a1: htab* -- hash table header
// a2: uint64_t -- key to look up
// Returns: value (symbol*) if found, 0 if not found
symbol* htab_lookup(htab* t, uint64_t key) {
uint8_t mode = (t->flags >> 4) & 0xF;
uint32_t bucket_idx;
switch (mode) {
case 1: // Shift-xor fast path (64-bit dedup table)
bucket_idx = t->bucket_mask &
((uint32_t)(key >> 11) ^
(uint32_t)(key >> 8) ^
(uint32_t)(key >> 5));
break;
case 2: // Modular fast path (32-bit dedup table)
bucket_idx = t->bucket_mask & (uint32_t)key;
break;
default: // Mode 0: indirect call through function pointer
uint32_t h;
if (t->use_alt_fns)
h = t->hash_fn_alt(key);
else
h = t->hash_fn(key);
bucket_idx = t->bucket_mask & h;
break;
}
uint32_t* chain = t->buckets[bucket_idx];
if (!chain)
return NULL;
// Walk the chain; chain[0] is capacity, entries start at chain[1]
uint32_t* p = chain + 1; // skip capacity field
while (*p != 0xFFFFFFFF) {
entry_t* e = &t->entries[*p];
if (mode == 0) {
// Use custom compare function
if (t->use_alt_fns ? t->cmp_fn_alt(e->key, key)
: t->cmp_fn(e->key, key))
return e->value;
} else {
// Modes 1 and 2: direct integer equality
if (e->key == key)
return e->value;
}
p++;
}
return NULL;
}
Insert and Resize (sub_448E70)
The insert function sub_448E70 first attempts a lookup; if the key exists, it replaces the value and returns the old value. If the key is new:
-
Find free entry slot: Scans the occupancy bitmap starting at
scan_cursorfor the first word with a zero bit. Uses_BitScanForward(~bitmap[i])to find the exact bit position. If all existing bitmap words are full, grows the bitmap by doubling. -
Grow entry array if needed: If the free slot index exceeds
entry_capacity, doubles the entry array via realloc (or arena alloc + memcpy if arena-managed). -
Write entry: Sets
entries[slot].key = keyandentries[slot].value = value. -
Append to chain: Finds the bucket chain for the key's hash. If the chain exists, appends the slot index before the sentinel. If the chain's index count exceeds its capacity, reallocates at
2 * old_capacity + 2. If no chain exists, creates a new 12-byte chain[capacity=1, slot_idx, 0xFFFFFFFF]. -
Update occupancy: Sets the bit in the bitmap, XORs the hash into the running checksum, increments
entry_count. -
Resize check: If
entry_countexceedsload_limit, triggers a full rehash:- Doubles
bucket_maskto2 * old_mask + 1(doubles bucket count). - Doubles
load_limit. - Allocates a new bucket pointer array, zeroes it.
- Frees all old chain arrays via
sub_431000. - Walks the occupancy bitmap: for each occupied slot, recomputes the hash (via indirect call through
hash_fn), and inserts the slot index into the new chain for the new bucket. This rehash loop uses the same chain-append logic with capacity tracking.
- Doubles
The rehash ensures amortized O(1) insert. The load factor threshold is 4 << ceil_log2(initial_capacity) initially, doubling on each rehash.
Dedup Table Sizing for Constants
Both dedup tables are created with an initial capacity hint of 256 (passed from the layout phase). After sub_45CB00 (ceil_log2) and sub_448840:
ceil_log2(256)= 8bucket_count=1 << 8= 256bucket_mask= 255load_limit=4 << 8= 1024
So each table starts with 256 buckets and will not rehash until it holds more than 1024 entries. For typical CUDA programs with hundreds of distinct constants, no rehash occurs.
CLI Option Interaction
The two relevant CLI options are registered in sub_427AE0 (option parsing):
--no-opt "Turn off linker optimization of data resources"
Type: bool, stored in byte_2A5F2A9
--optimize-data-layout "Force linker optimization of data resources"
Type: bool, stored in byte_2A5F2A8
These are mutually exclusive. If both are specified, the option parser emits a fatal error via sub_467460 with error descriptor unk_2A5B650, passing "-no-opt" and "-optimize-data-layout" as conflicting option names.
The flags propagate to the elfw object during construction:
| Global | Elfw offset | Effect |
|---|---|---|
byte_2A5F2A9 | +90 | no-opt: disables shared memory optimization, constant dedup, forces simple linear layout |
byte_2A5F2A8 | +83 | optimize-data-layout: forces the "optimize space" pre-pass and OCG constant optimization even when sections are within the bank size limit |
When --no-opt is active, the layout phase:
- Uses simple linear layout for shared memory (skips interference graph optimization).
- Skips constant deduplication entirely.
- Uses direct offset assignment instead of sorted layout in
sub_4325A0. - Produces the assertion
"should only reach here with no opt"if a zero-alignment symbol is encountered.
When --optimize-data-layout is active, the layout phase:
- Forces the "optimize space" pre-pass (
elfw+83check at line 1757). - Forces OCG constant optimization (
elfw+91OR size-exceeded check). - Enables dead constant elimination via
sub_43FB70reachability.
When neither flag is set (the default), optimization triggers automatically only when an OCG constant section exceeds the architecture's bank size limit.
Verbose Trace Strings
All strings are gated by (*(_BYTE*)(elfw + 64) & 2) != 0 (verbose layout flag, bit 1):
| String | Context | Meaning |
|---|---|---|
"optimize space in %s (%d)" | Pre-pass entry | Starting constant space optimization for a section |
"local constant %s at offset %lld" | Pre-pass local scan | Copying reachable local constant to temp section |
"constant %s at offset %lld" | Pre-pass global scan | Copying reachable global constant to temp section |
"remove unused constant %s" | Pre-pass dead elim | Demoting unreachable constant to LOCAL binding |
"no symbol for reloc section %d at offset %lld?" | Reloc rewrite | No dedup-table entry for a relocation target |
"change reloc in section %d, offset from %lld to %lld" | Reloc rewrite | Updating relocation offset after dedup |
"layout and merge section %s" | Phase 9c entry | Starting merge-constants mode dedup |
"found duplicate value 0x%x, alias %s to %s" | 32-bit dedup | Aliasing a 32-bit constant to existing copy |
"found duplicate 64bit value 0x%llx, alias %s to %s" | 64-bit dedup | Aliasing a 64-bit constant to existing copy |
"found duplicate %d byte value, alias %s to %s" | Large-value dedup | Aliasing a 12..64-byte constant (from sub_433870) |
"optimize ocg constant reloc offset from %lld to %lld" | OCG reloc patch | Rewriting relocation after constant placement |
"optimize OCG constants for %s, old size = %lld" | Phase 9d entry | Starting OCG constant optimization for a section |
"new OCG constant size = %lld" | Phase 9d result | Total size after OCG deduplication |
"ocg const optimization didn't help so give up" | Phase 9d abandon | Deduped size still exceeds bank limit |
Worked Example
Consider two translation units that each define a constant:
// TU A: .nv.constant0.kernelA
float c1 = 3.14f; // 4 bytes, value 0x4048F5C3
float c2 = 2.71f; // 4 bytes, value 0x402D70A4
// TU B: .nv.constant0.kernelB
float c3 = 3.14f; // 4 bytes, value 0x4048F5C3 (duplicate of c1)
double c4 = 1.0; // 8 bytes, value 0x3FF0000000000000
After merge, the per-entry constant sections contain:
kernelA constants: c1 (0x4048F5C3), c2 (0x402D70A4) -- 8 bytes total
kernelB constants: c3 (0x4048F5C3), c4 (0x3FF0...) -- 12 bytes total
When sub_4339A0 processes these:
- c1 (value 0x4048F5C3): Not in 32-bit hash table. Insert at offset 0. Hash table:
{0x4048F5C3 -> c1}. - c2 (value 0x402D70A4): Not in 32-bit hash table. Insert at offset 4. Hash table:
{0x4048F5C3 -> c1, 0x402D70A4 -> c2}. - c3 (value 0x4048F5C3): Found in hash table -- duplicate of c1. Alias:
c3.value = c1.value = 0. No data copied. Verbose:"found duplicate value 0x4048f5c3, alias c3 to c1". - c4 (value 0x3FF0...): Not in 64-bit hash table. Insert at offset 8 (aligned to 8). Hash table updated.
Result: merged section is 16 bytes instead of 20, and c3 shares c1's storage.
Key Implementation Details
Zero-value special casing. The hash tables use the constant value as the key, which means a zero value would map to hash bucket 0 and could collide with the "not found" sentinel. The code explicitly checks val != 0 before looking up in the hash table. Zero-valued constants are handled separately: they are either always copied (when a15 == NULL) or tracked per-entry via the overlap set (when a15 != NULL).
Alignment preservation. When a constant is aliased, the duplicate symbol receives the original's offset but its alignment requirement is not checked against the target. This is safe because the first occurrence was placed with at least its own alignment, and all subsequent duplicates have the same size (and therefore the same natural alignment).
Deep copy for multi-entry OCG. In phase 9d, the first OCG section receives the merged symbol list by direct pointer assignment (move semantics). All subsequent OCG sections receive deep copies, where each 40-byte data node is cloned via SSE2 memcpy (_mm_loadu_si128 x2 + 8-byte load). This ensures each per-entry section has independent node memory that can be freed separately.
Relocation list surgery. The OCG relocation rewriting removes matched relocations from the singly-linked list in-place. The predecessor pointer v42/v52/v66/v72 (depending on which code path) tracks the previous node for unlinking. If the match is the first node, *a14 (the list head) is updated directly; otherwise, the predecessor's next pointer skips the matched node.
Confidence Assessment
| Claim | Confidence | Evidence |
|---|---|---|
sub_4339A0 at 0x4339A0, 15-parameter signature | HIGH | Decompiled sub_4339A0: function at 0x4339a0 with parameters (a1..a15) including char a13, _QWORD *a14, __int64 a15 exactly matching wiki |
Hash table create sub_4489C0 mode detection for 32-bit and 64-bit | HIGH | Decompiled sub_4489C0: compares sub_44E130/sub_44E120 for mode 2 and sub_44E1E0/sub_44E1C0 for mode 1; bitmask 0xF00F | 0x20 / 0x10 confirmed |
Identity hash sub_44E120 returns key unchanged | HIGH | Decompiled: return a1; |
XOR-fold hash sub_44E150 returns lo32 ^ hi32 | HIGH | Decompiled: return (unsigned int)a1 ^ HIDWORD(a1); |
Shift-xor hash sub_44E1C0 returns (a>>11)^(a>>8)^(a>>5) | HIGH | Decompiled: exact expression confirmed |
sub_433870 (dedup_memcmp) 8-parameter function matching wiki signature | HIGH | Decompiled: sub_433870(a1, a2, a3, a4, a5, a6, a7, a8) with memcmp at line 33, alignment check *(_QWORD *)(v13 + 16) == v10 at line 33 |
sub_433870 emits "found duplicate %d byte value, alias %s to %s" | HIGH | Decompiled line 42: fprintf(stderr, "found duplicate %d byte value, alias %s to %s\n", ...) and string at addr 0x1d38858 in nvlink_strings.json |
sub_433870 checks alignment before memcmp | HIGH | Decompiled line 33: *(_QWORD *)(v13 + 16) == v10 (alignment at data_node+16) precedes memcmp |
sub_433870 uses sub_4644C0 for list prepend on miss | HIGH | Decompiled line 54: return sub_4644C0(a4, a2); |
sub_433760 (section_data_copy) allocates 40-byte node | HIGH | Decompiled: sub_4307C0(v10, 40) at line 28 |
sub_433760 updates sh_addralign at section offset +48 | HIGH | Decompiled line 25: if ( v7 > *(_QWORD *)(result + 48) ) *(_QWORD *)(result + 48) = v7; |
sub_433760 emits "tail data node not found" assertion | HIGH | Decompiled line 53: string literal confirmed; also at addr 0x1d38839 in nvlink_strings.json |
sub_433760 computes aligned offset v16 + v7 - v16 % v7 | HIGH | Decompiled lines 37-40: v17 = v16 % v7; v18 = v16 + v7 - v16 % v7; if (v17) v16 = v18; |
Verbose gate at elfw+64, bit 1 | HIGH | Decompiled sub_433870 line 41: if ( (*(_BYTE *)(a1 + 64) & 2) != 0 ) |
"found duplicate value 0x%x, alias %s to %s" string | HIGH | String at addr 0x1d38888 in nvlink_strings.json |
"found duplicate 64bit value 0x%llx, alias %s to %s" string | HIGH | String at addr 0x1d388f0 in nvlink_strings.json |
"optimize ocg constant reloc offset from %lld to %lld" string | HIGH | String at addr 0x1d388b8 in nvlink_strings.json |
"optimize OCG constants for %s, old size = %lld" string | HIGH | String at addr 0x1d39028 in nvlink_strings.json |
"optimize space in %s (%d)" string | HIGH | String at addr 0x1d38ccf in nvlink_strings.json |
"remove unused constant %s" string | HIGH | String at addr 0x1d38cf9 in nvlink_strings.json |
"no symbol for reloc section %d at offset %lld?" string | HIGH | String at addr 0x1d38fc0 in nvlink_strings.json |
CLI flags --no-opt at byte_2A5F2A9 and --optimize-data-layout at byte_2A5F2A8 | MEDIUM | Inferred from option parsing patterns in sub_427AE0; flag offsets reconstructed from layout phase conditional checks |
| Size-class dispatch: 4, 8, 12, 16, 20, 24, 32, 48, 64 bytes | MEDIUM | Reconstructed from sub_4339A0 switch-case structure matching parameters a6..a12 to specific sizes; confirmed via parameter count and linked-list bucket pattern |
OCG relocation rewriting appears 4 times in sub_4339A0 | MEDIUM | Inferred from decompiled code structure showing repeated reloc-walk patterns with predecessor tracking; exact count reconstructed from block analysis |
Phase 9c uses a13=1, Phase 9d uses a13=0 with non-null a15 | MEDIUM | Reconstructed from sub_439830 call sites into sub_4339A0; parameter passing patterns match documented behavior |
Related Pages
- Layout Phase -- parent phase (sub_439830) containing all 10 layout sub-phases
- Section Merging -- how sections from multiple TUs are combined before dedup
- Dead Code Elimination -- callgraph-based reachability that feeds into constant liveness
- Hash Tables -- general hash table infrastructure used by the dedup engine
- Symbol Resolution -- symbol records whose offsets are modified during constant dedup
- Bindless Relocations -- bindless descriptor sections in the same constant banks optimized here
- R_CUDA Relocations -- R_CUDA_CONST_FIELD relocations rewritten during OCG dedup