Whole-Program Devirtualization
CICC v13.0 includes LLVM's WholeProgramDevirtPass at sub_2703170 (13,077 bytes), which replaces indirect virtual calls with direct calls using whole-program type information. On GPU this optimization is far more consequential than on CPU: an indirect call in PTX compiles to a call.uni through a register, which prevents the backend from inlining the callee, forces all live registers across the call boundary into local memory spills, destroys instruction scheduling freedom, and creates a warp-divergence hazard if threads in the same warp resolve the function pointer to different targets. A single devirtualized call site in a hot kernel loop can therefore improve performance by an order of magnitude -- the direct call enables inlining by the inliner cost model, which in turn eliminates .param-space marshaling, enables cross-boundary register allocation, and restores the instruction scheduler's ability to interleave memory and arithmetic operations.
CICC's devirtualization operates in a privileged position: GPU compilation is inherently a closed-world model. Every function that can be called on the device must be visible at link time -- there is no dynamic loading, no shared libraries, and no dlopen on GPU. This means the set of possible implementations for any virtual function is fully known, making single-implementation devirtualization almost always profitable and branch funnels rare. The pass runs as a module-level pass (pipeline parser slot 121, registered as "wholeprogramdevirt") during the LTO phase, after the NVModuleSummary builder has computed type test metadata and before GlobalDCE eliminates dead virtual methods.
| Entry point | sub_2703170 (0x2703170, 13,077 bytes) |
| Address range | 0x2703170--0x2706485 |
| Stack frame | 856 bytes (0x358) |
| Pass name | "wholeprogramdevirt" (pipeline slot 121) |
| Pass type | Module pass |
| Callee-saved | r15, r14, r13, r12, rbx |
| Return value | 1 = module modified, 0 = no changes |
| Remark category | "wholeprogramdevirt" / "Devirtualized" |
| Helper range | sub_2700B00--sub_2708220 (branch funnel helpers, summary I/O) |
The Closed-World GPU Advantage
Upstream LLVM's WholeProgramDevirt is designed primarily for LTO pipelines where some modules may not be visible (ThinLTO import/export split, shared libraries with hidden visibility). The pass must therefore be conservative: it can only devirtualize when !type metadata proves that the vtable set is complete. On GPU, this conservatism is unnecessary. All device code is statically linked into a single fatbinary -- there are no device-side shared libraries, no runtime code loading (the driver JIT compiles PTX, but does not add new device functions), and __device__ virtual functions cannot escape to host code. The entire class hierarchy is visible.
CICC exploits this by running WPD in regular LTO mode (not ThinLTO export/import split), where the pass directly resolves virtual calls against the merged module. The NVModuleSummary builder records type_test metadata for all device vtables, and the pass consumes this metadata to build a complete picture of every virtual call site and every possible target. In practice, GPU programs rarely have deep polymorphic hierarchies in device code (the hardware penalties discourage it), so most virtual call sites resolve to a single implementation.
The Formal Closed-World Argument
The closed-world guarantee on GPU rests on five architectural invariants, each of which eliminates a source of conservatism that forces upstream LLVM to leave calls indirect:
| # | Invariant | What upstream LLVM must worry about | Why GPU is immune |
|---|---|---|---|
| 1 | No device-side shared libraries | A .so loaded at runtime could add a new vtable entry for a class. LTO must mark !vcall_visibility metadata linkage-unit to prove the vtable set is closed within the link unit. | The CUDA driver loads PTX/SASS as a monolithic blob. cuModuleLoad does not support incremental symbol addition. There is no dl_iterate_phdr on device. |
| 2 | No dlopen on device | Host-side dlopen can inject new implementations of virtual functions. Upstream must check !vcall_visibility for translation-unit scope. | Device code has no equivalent of dlopen. The only way to add device code is to recompile and reload the entire module. |
| 3 | No device-side RTTI | dynamic_cast and typeid on host can defeat devirtualization by requiring the vtable to contain RTTI pointers that reference external type_info objects. | CUDA explicitly prohibits dynamic_cast and typeid in __device__ functions. Device vtables contain no RTTI pointers. The NVVM IR verifier (sub_12DD660) rejects code that attempts dynamic_cast in device context. |
| 4 | No exceptions on device | Virtual destructors in exception-handling code create additional vtable entries and __cxa_throw unwinding paths that must be considered. | CUDA does not support exceptions in device code. Virtual destructors are simple (no EH cleanup), and the compiler can see every destructor call site. |
| 5 | Complete link-time visibility | ThinLTO's import/export split means some modules may not be available during WPD. The pass must use summary-based resolution with wholeprogramdevirt-summary-action=import/export. | CICC uses wholeprogramdevirt-summary-action=none (direct resolution on the merged module). All device functions, including those from separate compilation units, are linked by nvlink into a single merged module before the LTO pipeline runs. |
The practical consequence: CICC sets whole-program-visibility effectively to true for all device code. The !vcall_visibility metadata that upstream uses to distinguish "linkage-unit" from "translation-unit" scope becomes irrelevant -- every device vtable is within a single, complete, closed translation unit.
How NVModuleSummary Feeds WPD
The NVModuleSummary builder at sub_D7D4E0 (2,571 decompiled lines, 74KB) produces the type metadata that WPD consumes. The interaction is:
-
NVModuleSummary walks every
GlobalValuein the module (linked list atModule+72). For each function (opcode0x3D), it extracts attribute groups #34 (reference edges with type metadata) and #35 (direct call targets) viasub_B91C10. -
For reference edges with type info (attribute #34), the builder decodes MDNode operands (lines 1193-1228 of the decompilation): each parameter position >= 2 yields a type node (opcode range 5-36), walked to a parent MDTuple (opcode 17) containing the type name string at offset 24 (indirect through pointer if length > 64).
-
These type-metadata edges are packed into the
FunctionSummaryrecord bysub_D77220as thev378(type-checked references) argument. The resulting metadata lands in the module asllvm.type.test/type_test_assumenamed metadata nodes. -
WPD reads these nodes back via
sub_B6AC80(module, 0x166)at its entry point, completing the producer-consumer chain.
DevirtSCCRepeatedPass: The Outer Loop
WPD at the module level is one of two devirtualization mechanisms. The other operates at CGSCC granularity: DevirtSCCRepeatedPass at sub_2284BC0 (16KB) wraps the CGSCC pipeline in a fixed-point iteration loop that re-runs until no new devirtualization opportunities are discovered or a maximum iteration count is reached. On reaching the limit, the pass emits "Max devirtualization iterations reached". The abort-on-max-devirt-iterations-reached knob (registered at constructor 378) controls whether this is a fatal error or a warning. The iteration count at O1-O3 is 1; at tier 3 (maximum optimization) it is 5, giving the inliner and devirtualizer multiple rounds to discover indirect-to-direct call conversions that expose further inlining opportunities.
The two mechanisms are complementary: module-level WPD resolves virtual calls using global type hierarchy information (vtable metadata), while CGSCC-level devirtualization catches cases where inlining reveals new constant function pointers that can be resolved without type metadata.
Algorithm
The pass executes in seven phases:
Phase 1: Metadata Extraction (0x2703170--0x27031CA)
The entry point fetches four named metadata nodes from the module using sub_B6AC80 (getNamedMetadata):
| Enum ID | Metadata Node | Purpose |
|---|---|---|
0x166 (358) | llvm.type.test / type_test_assume | Records of @llvm.assume(@llvm.type.test(%ptr, %typeID)) intrinsic results |
0x164 (356) | llvm.type.checked.load | Call sites using type-checked vtable loads |
0x165 (357) | llvm.type.checked.load.relative | Relative vtable pointer variant (compact vtables) |
0x0B (11) | Module-level type metadata | Type summaries describing vtable layouts |
If neither type_test_assume nor module-level type metadata are present, the pass checks for type_checked_load and type_checked_load_relative as fallbacks. If none exist, the pass returns 0 immediately.
The assembly sequence at the entry point:
; 0x2703170: entry
mov esi, 0x166 ; enum ID = 358 (type_test_assume)
call sub_B6AC80 ; rbx = getNamedMetadata(module, 0x166)
mov esi, 0x164 ; enum ID = 356 (type_checked_load)
call sub_B6AC80 ; r13 = getNamedMetadata(module, 0x164)
mov esi, 0x165 ; enum ID = 357 (type_checked_load_relative)
call sub_B6AC80 ; [rbp-0x338] = result
mov esi, 0x0B ; enum ID = 11 (module-level type metadata)
call sub_B6AC80 ; r12 = result
Phase 2: Type Test Record Iteration (0x2703296--0x2703383)
Type test records are stored in an array at offset +0xA0 of the metadata state, with count at +0xA8. Each record is 144 bytes (0x90):
struct TypeTestRecord { // 0x90 = 144 bytes per record
uint8_t *type_value; // +0x00: pointer to type test value
// ... call site references, metadata links ...
};
// Iteration pattern at 0x2703296:
TypeTestRecord *base = state->records; // [state + 0xA0]
uint32_t count = state->record_count; // [state + 0xA8]
TypeTestRecord *end = base + count; // stride = 0x90
// Address computation in binary:
// lea rax, [rax+rax*8] ; count * 9
// shl rax, 4 ; count * 144 = count * 0x90
// add rax, rdx ; end pointer
for (TypeTestRecord *rec = base; rec != end; rec++) {
if (rec->type_value[0] != 0) continue; // skip already-processed
// ... look up type in hierarchy ...
}
For each record whose type byte is 0 (unprocessed), the pass computes a string hash of the type name via sub_B91420 (get type name) and sub_B2F650 (string hash), then looks up the type in a red-black tree rooted at offset +0xE0 of the module state.
Phase 3: Hash Table Construction (0x2703589--0x2703AE2)
Unique type test values are tracked in an open-addressed hash table with 56-byte entries. The hash function combines bit-shifted fields to reduce clustering:
uint32_t hash(uint32_t val, uint32_t mask) {
return ((val >> 4) ^ (val >> 9)) & mask;
}
The table uses power-of-2 sizing with LLVM-layer sentinels (empty = 0xFFFFFFFFE000, deleted = 0xFFFFFFFFF000). See Hash Table and Collection Infrastructure for the probing and growth policy.
Each 56-byte hash table entry stores:
| Offset | Size | Field |
|---|---|---|
+0x00 | 8 | Type test value (key) |
+0x08 | 8 | Flags / padding |
+0x10 | 8 | Type info pointer |
+0x18 | 8 | Associated data (resolution result) |
+0x20 | 8 | Red-black tree node (self-referential on init) |
+0x28 | 8 | Link pointer |
+0x30 | 8 | Count / size |
Slot addressing uses the identity slot_index * 7 * 8 = slot_index * 56:
; At 0x27035A0:
lea rdx, ds:0[rsi*8] ; rsi = slot index, rdx = slot*8
sub rdx, rsi ; rdx = slot*8 - slot = slot*7
mov rsi, [rdi+rdx*8] ; load from table base + slot*56
Table growth is handled by sub_2702540, which reallocates and rehashes all entries using the same (val >> 4) ^ (val >> 9) function against the new mask. Entry initialization at 0x2703A33:
; Insert new entry:
add [rbp-0x2D0], 1 ; increment unique type count
call sub_2702540 ; grow table if needed (returns new entry ptr in rax)
mov dword [rax+10h], 0 ; clear type info
mov qword [rax+18h], 0 ; clear data
mov [rax], rdx ; store type test value
lea rdx, [rax+10h]
mov [rax+20h], rdx ; self-referential link (RB tree node init)
mov [rax+28h], rdx ; self-referential link
mov qword [rax+30h], 0 ; zero count
Phase 4: Type Hierarchy Lookup via Red-Black Tree (0x27032F7--0x2703362, 0x2704183--0x2704267)
For each unique type, the pass searches a red-black tree keyed by hashed type name. The tree is rooted at offset +0xE0 of the module state, with the sentinel node at +0xD8. The search is a two-phase process with a three-field comparison:
Phase 4a: Compute Type Name Hash
// At 0x27032F7:
char *name = sub_B91420(type_value); // returns (name_ptr, name_len)
uint64_t hash = sub_B2F650(name, len); // string hash
// Tree root and sentinel:
RBTreeNode *root = module_state[+0xE0]; // root pointer
RBTreeNode *sentinel = module_state + 0xD8; // sentinel node address
sub_B2F650 (stringHash) is LLVM's standard xxHash-style string hasher. It produces a 64-bit hash that is stored at node[+0x20] for each type in the tree.
Phase 4b: Descend Tree by Hash
// At 0x270330C:
RBTreeNode *current = root;
RBTreeNode *best = sentinel; // rcx = sentinel initially
while (current != NULL) {
uint64_t node_hash = current[+0x20]; // hash stored in node
if (target_hash < node_hash) {
best = current; // track nearest greater
current = current[+0x10]; // left child
} else if (target_hash > node_hash) {
current = current[+0x18]; // right child
} else {
// hash matches -- proceed to Phase 4c
break;
}
}
if (current == NULL) goto not_found;
The binary encodes this as:
compare_node:
cmp rsi, [r15+20h] ; compare target hash vs node hash
ja go_right ; target > node -> right child
jnb hash_match ; target == node -> verify
mov rcx, r15 ; track best (left-leaning)
mov r15, [r15+10h] ; r15 = left child
test r15, r15
jnz compare_node
jmp not_found
go_right:
mov r15, [r15+18h] ; r15 = right child
test r15, r15
jnz compare_node
jmp not_found
Phase 4c: Verify Full Match (Hash Collision Resolution)
On hash match, the pass performs a two-step verification to handle collisions:
// At 0x2704200:
// Step 1: compare string lengths
if (current[+0x30] != target_length) {
// Length mismatch -- this is a hash collision, not a real match.
// Continue tree traversal to the next candidate.
goto next_candidate;
}
// Step 2: compare actual type name strings
char *node_name = (char *)current[+0x28]; // node's type name data
char *target_name = target_string; // from sub_B91420
int cmp = memcmp(node_name, target_name, target_length);
if (cmp != 0) goto next_candidate;
// Verified match -- read vtable data
The binary at 0x2704200--0x2704240:
cmp r12, [r15+30h] ; compare string length
jnz next_candidate ; length mismatch
mov rdi, [r15+28h] ; s1 = node's string data
mov rsi, [rbp-0x348] ; s2 = target string data
mov rdx, r12 ; n = length
call _memcmp
test eax, eax
jz found_match
Phase 4d: Extract Vtable Data
After verifying the type match, the pass reads the vtable descriptor from the type node:
// At 0x2704248:
void *vtable_start = current[+0x68]; // vtable start address
void *vtable_data = current[+0x70]; // vtable data pointer (function pointers)
if (vtable_data == NULL) goto skip_type; // no vtable -> nothing to devirtualize
The vtable_data pointer leads to an array of function pointers representing the virtual method implementations for this type. The pass iterates this array comparing each entry against call site signatures to identify devirtualization candidates.
Phase 5: Virtual Call Resolution (0x2703974--0x27039BA)
For each call site on a matched type, the pass calls sub_26FEE10 (resolveVirtualCall):
bool resolveVirtualCall(
void *module_state, // rdi: r15 (module/pass state)
void *target_candidates, // rsi: candidates vector from [rbp-0x230]
void *hash_entry, // rdx: r12 (pointer to hash table entry + 8)
uint32_t candidate_count, // rcx: from [rbp-0x228]
void *call_site_info // r8: r13 (call site from [r15+0x28])
);
// Returns: al = 1 if unique resolution found, 0 otherwise
The resolution algorithm within sub_26FEE10 works by comparing the vtable offset encoded in each call site's llvm.type.test / llvm.type.checked.load intrinsic against the vtable slot offsets of all candidate implementations. When exactly one candidate matches, the resolution succeeds with strategy 1 (direct call). When multiple candidates exist but all return the same constant or can be distinguished by a single offset, strategy 2 (unique member) is chosen. When multiple distinct targets exist, strategy 3 (branch funnel) is produced.
The resolution result is written to hash_entry[+0x28] as a strategy selector:
| Value | Strategy | Upstream LLVM counter |
|---|---|---|
| 1 | Direct call (single implementation) | NumSingleImpl |
| 2 | Unique member dispatch | NumUniformRetVal / NumUniqueRetVal |
| 3 | Branch funnel | NumBranchFunnel |
Before calling sub_26FEE10, the pass checks two preconditions:
// At 0x2703974:
void *call_site_list = module_state[+0x28]; // r13 = [r15+0x28]
if (call_site_list == NULL) goto skip;
if (type_value[0] != 0) goto skip; // byte check: direct type info only
void *existing = hash_entry[+0x28];
if (existing != 0) goto already_resolved; // skip if previously resolved
Phase 6: Strategy Application (0x2703BA3--0x27046F0)
Strategy 1 -- Direct Call Replacement (0x27044DA)
When only one class implements the virtual function (the common case on GPU), the indirect call is replaced with a direct call to the resolved function. This is handled by sub_26F9AB0 (rewriteCallToDirectCall):
// At 0x27044DA:
void rewriteCallToDirectCall(
void *type_entry, // rdi: r12
void *call_site, // rsi: [r15+0x38]
uint64_t vtable_data, // rdx: byte_3F871B3 (vtable offset data)
uint32_t flags, // ecx: 0
void *resolved_function // r8: [rbx+0x40]
);
This is the simplest and most common optimization: the call.reg becomes call.direct, enabling downstream inlining. On GPU this is by far the dominant strategy. Consider a CUDA kernel with a virtual method call inside a loop:
; Before devirtualization (PTX):
ld.global.u64 %rd1, [%rd0]; // load vtable ptr
ld.global.u64 %rd2, [%rd1+16]; // load function ptr at vtable slot 2
call.uni %rd2, (%args); // indirect call -- full scheduling barrier
; After devirtualization (PTX):
call.uni _ZN7DerivedN4workEv, (%args); // direct call -- inlinable
The direct call then becomes an inlining candidate with CICC's 20,000-unit budget (89x the upstream LLVM default of 225), and the inliner typically eliminates it entirely, producing fully-inlined code with no call overhead.
Strategy 2 -- Unique Member Dispatch (0x27045C9)
When multiple classes exist but the call can be dispatched through a unique member offset, the pass rewrites via sub_26F9080 (rewriteToUniqueMember), passing the diagnostic string "unique_member" (13 chars). The member offset is read from hash_entry[+0x60] and the base type from hash_entry[+0x00].
; At 0x27045D9:
mov r11, [r12] ; type info (base type)
mov rsi, [r12+60h] ; member offset
lea rax, "unique_member" ; diagnostic string (13 chars)
call sub_26F9080 ; rewriteToUniqueMember
; rdx = r14 (type test record)
; rcx = r13 (call site)
; r9 = rdi (vtable byte offset / 8)
; "unique_member" + length 0x0D pushed on stack
After the initial rewrite, sub_26FAF90 performs call-site-specific fixup, checking [rbx+0x40] to determine if additional adjustment is needed (e.g., adjusting this pointer offset for multiple inheritance).
Upstream LLVM's equivalent covers two sub-strategies: uniform return value optimization (all implementations return the same constant -- replace the call with that constant) and unique return value optimization (for i1 returns, compare the vptr against the one vtable that returns a different value). Both are folded under the "unique_member" label in CICC's implementation.
Strategy 3 -- Branch Funnel (0x27043B5)
When multiple possible targets exist and cannot be reduced to a single dispatch, the pass creates a branch funnel -- a compact conditional dispatch sequence that checks the vtable pointer and branches to the correct target. This is handled by three functions:
sub_26F78E0-- create branch funnel metadata (with diagnostic string"branch_funnel", 13 chars)sub_BCF480-- build the conditional dispatch structuresub_BA8C10-- emit the indirect branch sequence
; At 0x27043B5:
mov r12, [rbx] ; vtable pointer
mov rdi, [r12] ; function pointer from vtable
call sub_BCB120 ; get function declaration
; At 0x27043D3:
lea rax, "branch_funnel" ; 13 chars at 0x42BCB92
call sub_26F78E0 ; create branch funnel metadata
call sub_BCF480 ; build dispatch structure
call sub_BA8C10 ; emit indirect branch sequence
The branch funnel supports two dispatch granularities:
| Granularity | String | Function | Description |
|---|---|---|---|
| Byte | "byte" (4 chars, at 0x3F8C256) | sub_26F9120 | Check byte offset into vtable to select target |
| Bit | "bit" (3 chars, at 0x43ADFE0+0xE) | sub_26F9120 | Check bit offset for single-bit discrimination |
The emission sequence at 0x270450C--0x27045BF:
; Byte-granularity dispatch:
lea rcx, "byte" ; at 0x3F8C256
mov [rbp-0x318], 4 ; string length
call sub_26F9120 ; emit byte-offset check
; Bit-granularity dispatch:
lea rbx, "bit" ; at 0x43ADFE0 + 0xE
mov [rbp-0x328], 3 ; string length
call sub_26F9120 ; emit bit-offset check
; Finalize:
call sub_26FB610 ; r8=byte_result, r9=bit_result
; rdi=r12, rdx=byte_3F871B3
The finalization call sub_26FB610 receives both byte and bit results and produces the final dispatch sequence. On GPU, branch funnels are rare because device code hierarchies are typically shallow, but the infrastructure exists for cases like thrust/CUB polymorphic iterators.
Upstream LLVM gates branch funnels behind the wholeprogramdevirt-branch-funnel-threshold knob (default: 10 targets per call site). CICC inherits this threshold.
Phase 7: Cleanup (0x2704144--0x270342E)
After processing all types, the pass performs four cleanup operations:
- Function attribute cleanup (
0x2704144): iterates the module's function list (red-black tree at[rax+10h]), callingsub_B98000with parameter0x1C(attribute cleanup enum) on each function. - Import list cleanup (
0x270416C): processes entries atmodule[+0x110..+0x118], callingsub_B43D60to release function metadata for imported declarations. - Type hierarchy destruction:
sub_26F92C0releases all type hierarchy data structures. - Hash table deallocation (
0x27033C3): iterates all non-sentinel entries, callssub_26F75B0to release per-entry resolution data, thensub_C7D6A0to free the table buffer. Type test result vectors (0x70-byte elements with sub-vectors at offsets+0x10,+0x28,+0x40,+0x58) are freed element by element.
Hash table cleanup detail:
// At 0x27033C3:
uint32_t count = hash_table_entry_count; // [rbp-0x2B8]
if (count == 0) goto skip_cleanup;
void *base = hash_table_base; // [rbp-0x2C8]
void *end = base + count * 56; // count * 7 * 8
for (void *entry = base; entry < end; entry += 56) {
uint64_t key = *(uint64_t *)entry;
if (key == 0xFFFFFFFFE000) continue; // empty sentinel
if (key == 0xFFFFFFFFF000) continue; // deleted sentinel
sub_26F75B0(entry[+0x18]); // release resolution data
}
sub_C7D6A0(base); // free table buffer
GPU-Specific Constraints
Virtual Functions in Device Code
CUDA allows __device__ virtual functions, but with restrictions that simplify devirtualization:
- No RTTI on device. There is no
typeidordynamic_caston GPU. This means vtable layouts do not contain RTTI pointers, simplifying vtable reconstruction. The NVVM IR verifier rejects code that attemptsdynamic_castin device context. - No exceptions on device. Virtual destructors do not need to handle
__cxa_throwunwinding paths. - Closed world. No device-side shared libraries, no
dlopen, no runtime code generation. All virtual targets are known at compile time. - No separate compilation for virtual dispatch. Device linking (nvlink) resolves all symbols before PTX emission, so the merged module always has complete type information.
- Simplified vtable layout. Without RTTI pointers and exception tables, device vtables are a flat array of function pointers at known offsets. This makes vtable slot arithmetic straightforward for the WPD pass.
Cost of Unresolved Indirect Calls
If devirtualization fails, the PTX backend must emit a call.uni or call through a register. This has several penalties:
- No inlining. The callee is unknown, so the inliner cannot evaluate it.
- Full
.parammarshaling. Every argument must be written to.paramspace; no copy elision is possible. The call ABI (opcodes 510-513:CallDirect,CallDirectNoProto,CallIndirect,CallIndirectNoProto) forces.param-space round-tripping. - Register pressure spike. All live registers across the call must be spilled to local memory (device DRAM, ~400 cycle latency on SM 70-90).
- Scheduling barrier. The call is a full fence for instruction scheduling -- no operations can be reordered across it.
- Divergence hazard. If different threads in a warp resolve the pointer to different functions, execution serializes both paths. In the worst case (32 different targets), this is a 32x slowdown.
- Occupancy reduction. The register spills increase per-thread local memory usage, reducing occupancy and thus hiding less memory latency.
This is why CICC's default inlining budget of 20,000 (89x the upstream LLVM default) makes sense in combination with aggressive devirtualization: the pass converts expensive indirect calls into direct calls, and the inliner then eliminates them entirely.
Relationship to LowerTypeTests
The LowerTypeTests pass (sub_188C730, 96,984 bytes at 0x188C730; also sub_2638ED0 at 70KB) is the other half of the type-test infrastructure. While WPD consumes type test metadata to resolve virtual calls, LowerTypeTests produces the runtime type-checking implementation. The interaction:
| Pass | Role | When |
|---|---|---|
NVModuleSummary (sub_D7D4E0) | Produces type metadata in function summaries | During summary construction |
WholeProgramDevirt (sub_2703170) | Consumes type metadata, resolves virtual calls | LTO phase, after summary, before GlobalDCE |
LowerTypeTests (sub_188C730) | Lowers remaining @llvm.type.test intrinsics to runtime bit tests | After WPD, if CFI is active |
On GPU, LowerTypeTests is largely dead code -- CUDA does not use Control-Flow Integrity (CFI), and WPD resolves most type tests statically. The sweep at 0x1880000 confirms: "WPD/CFI/LowerTypeTests cluster is also upstream-only; CUDA does not use CFI or type-based devirtualization" in the sense of runtime CFI checks. The type metadata is consumed entirely by WPD's compile-time resolution.
LowerTypeTests validates its input with: "Second argument of llvm.type.test must be metadata" and "Second argument of llvm.type.test must be a metadata string". These error paths are unreachable in normal CUDA compilation but exist because CICC links the full upstream LLVM IPO library.
Optimization Remarks
When a call site is successfully devirtualized, the pass emits an optimization remark through the diagnostic handler. The remark is constructed at 0x2703EDA using three components:
| Component | String | Address |
|---|---|---|
| Remark name | "Devirtualized" (13 chars) | 0x42BCBEe |
| Pass name | "wholeprogramdevirt" (18 chars) | 0x42BC950 |
| Body prefix | "devirtualized " (14 chars) | 0x42BCBE2 |
| Attribute key | "FunctionName" (12 chars) | 0x42BC980 |
The remark construction sequence:
// At 0x2703EDA:
sub_B17560(&remark, "Devirtualized", 13, "wholeprogramdevirt", 18);
sub_B18290(&remark, "devirtualized ", 14); // append body
sub_B16430(&remark, "FunctionName", 12); // create named attribute
sub_26F69E0(&remark, resolved_function); // attach target name
sub_B180C0(&remark); // finalize
sub_1049740(diag_handler, &remark); // publish to handler
The remark is visible via -Rpass=wholeprogramdevirt and includes the name of the resolved target function (obtained from the function's name metadata or via sub_26F69E0 for unnamed functions).
After remark emission, extensive cleanup of small-string-optimized (SSO) std::string objects is performed -- each remark component checks if the string buffer was heap-allocated (compare pointer vs stack buffer address) and frees if necessary.
Knobs
| Knob | Type | Default | Effect |
|---|---|---|---|
wholeprogramdevirt-branch-funnel-threshold | unsigned | 10 | Maximum number of call targets per call site for branch funnel emission. Beyond this threshold, the call site is left indirect. |
whole-program-visibility | bool | false | Force enable whole-program visibility even without !vcall_visibility metadata. On GPU this is effectively always true. |
disable-whole-program-visibility | bool | false | Force disable whole-program visibility for debugging. |
wholeprogramdevirt-summary-action | enum | none | Controls summary interaction: none, import, export. CICC uses none (direct resolution on merged module). |
wholeprogramdevirt-read-summary | string | empty | Read type resolutions from a bitcode/YAML file. |
wholeprogramdevirt-write-summary | string | empty | Write type resolutions to a bitcode/YAML file. |
wholeprogramdevirt-skip | string list | empty | Comma-separated list of function names to exclude from devirtualization. |
wholeprogramdevirt-check | enum | none | Runtime checking mode: none, trap (abort on incorrect devirt), fallback (fall back to indirect call). |
wholeprogramdevirt-keep-unreachable-function | bool | true | Keep unreachable functions as possible devirt targets (conservative default). |
wholeprogramdevirt-print-index-based | bool | false | Print index-based devirtualization messages for debugging. |
wholeprogramdevirt-cutoff | signed | -1 | Maximum number of devirtualization actions to perform. -1 = unlimited. Useful for bisecting devirtualization-induced miscompiles. |
abort-on-max-devirt-iterations-reached | bool | false | When DevirtSCCRepeatedPass at sub_2284BC0 hits its iteration limit, abort instead of warning. Registered at constructor 378. |
Complexity
| Operation | Complexity | Notes |
|---|---|---|
| Hash table insert/lookup | O(1) amortized, O(n) worst case | Linear probing with sentinel-based open addressing |
| Type hierarchy lookup | O(log n) | Red-black tree keyed by type name hash, with memcmp verification |
| Per-type call resolution | O(call_sites * candidates) | For each type, check every call site against every candidate target |
| Branch funnel emission | O(vtable_entries) per site | Linear in number of possible targets |
String hash (sub_B2F650) | O(name_length) | One-pass hash of the type name string |
| Total pass | O(T * S * C * log T) | T = types, S = call sites per type, C = candidates. Typically sparse on GPU. |
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
WholeProgramDevirtPass::run | sub_2703170 | 13,077 | Pass entry point |
buildTypeTestInfo | sub_2702830 | ~2,600 | Build type test records from metadata |
growHashTable | sub_2702540 | ~740 | Grow and rehash the type test hash table |
resolveVirtualCall | sub_26FEE10 | ~3,200 | Attempt single-target resolution for a call site |
rewriteCallToDirectCall | sub_26F9AB0 | ~1,600 | Strategy 1: replace indirect call with direct call |
rewriteToUniqueMember | sub_26F9080 | ~640 | Strategy 2: unique member dispatch rewrite |
finalizeUniqueMember | sub_26FAF90 | ~1,700 | Strategy 2: call-site-specific fixup |
createBranchFunnelMeta | sub_26F78E0 | ~1,100 | Strategy 3: create branch funnel metadata |
buildBranchFunnel | sub_BCF480 | ~6,400 | Strategy 3: build conditional dispatch structure |
emitIndirectBranch | sub_BA8C10 | ~8,200 | Strategy 3: emit indirect branch sequence |
emitDispatchCheck | sub_26F9120 | ~500 | Branch funnel byte/bit offset check |
finalizeBranchFunnel | sub_26FB610 | ~1,800 | Branch funnel finalization |
destroyTypeHierarchy | sub_26F92C0 | ~400 | Release type hierarchy data structures |
releaseResolutionData | sub_26F75B0 | ~300 | Free per-entry resolution data |
attachFunctionName | sub_26F69E0 | ~240 | Attach function name to optimization remark |
branchFunnelHelper | sub_2700B00 | ~9,800 | Branch funnel main helper (called from sub_2703170) |
summaryIO | sub_2706490 | ~7,600 | WPD summary read/write (-wholeprogramdevirt-read-summary) |
DevirtSCCRepeatedPass::run | sub_2284BC0 | 16,000 | CGSCC devirtualization iteration loop |
getNamedMetadata | sub_B6AC80 | ~200 | Fetch named metadata node from module |
getTypeInfoName | sub_B91420 | ~300 | Compute type info name string |
stringHash | sub_B2F650 | ~180 | Hash a type name string (xxHash-style) |
createRemarkHeader | sub_B17560 | ~250 | Create optimization remark header |
appendRemarkBody | sub_B18290 | ~200 | Append body text to remark |
createNamedAttribute | sub_B16430 | ~200 | Create named attribute for remark |
publishRemark | sub_1049740 | ~100 | Publish remark to diagnostic handler |
Cross-References
- NVModuleSummary Builder -- produces the
type_testmetadata consumed by this pass; records devirtualization-relevant type GUIDs in per-function summaries viasub_D7D4E0. - Inliner Cost Model -- devirtualized direct calls become inlining candidates with a 20,000-unit budget; the entire value of devirtualization on GPU depends on the inliner subsequently eliminating the call.
- ThinLTO Function Import -- in ThinLTO mode the pass would operate in export/import phases, but CICC primarily uses regular LTO for device code.
- Pipeline & Ordering -- WPD is registered at pipeline parser slot 121 as a module pass; it runs during the LTO phase after summary construction and before GlobalDCE.
- NVPTX Call ABI -- describes the
.param-space calling convention that makes indirect calls so expensive (opcodes 510-513: CallDirect, CallDirectNoProto, CallIndirect, CallIndirectNoProto). - LazyCallGraph & CGSCC -- devirtualization converts ref edges to call edges in the call graph, triggering SCC re-computation via
switchInternalEdgeToCall. TheDevirtSCCRepeatedPassatsub_2284BC0wraps the CGSCC pipeline in a fixed-point loop. - GPU Execution Model -- explains why indirect calls are so expensive on GPU (warp divergence, scheduling barriers, register spilling to local memory).
- Hash Infrastructure -- the type test hash table uses the same sentinel-based open-addressing pattern as CICC's universal DenseMap infrastructure.