Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

NVVM IR Node Layout

The NVVM frontend in cicc v13.0 uses a custom intermediate representation distinct from LLVM's native IR. Each IR node is a variable-length structure allocated from a bump allocator, with operands stored backward from the node header pointer. The node uniquing infrastructure lives in sub_162D4F0 (49KB), which routes each opcode to a dedicated DenseMap inside the NVVM context object.

Node Header Layout

The pointer a1 returned from allocation points to the start of the fixed header. Operands are at negative offsets behind it.

OffsetSizeTypeFieldNotes
+01Buint8_topcodeSwitch key in sub_162D4F0; values 0x04..0x22+
+22Buint16_tsubopcodeIntrinsic ID; read for opcodes 0x1C, 0x1D, 0x1E
+44B--(padding)Not accessed directly
+84Buint32_tnum_operandsControls operand access range
+168Btagged_ptrcontext_ptrLow 3 bits are tag; mask with & ~7 for pointer
+248Bvariesextra_ADWORD for opcodes 0x1A/0x1B; pointer for 0x10/0x22
+284Buint32_textra_BPresent for opcode 0x1B
+328Bvariesextra_CPresent for opcode 0x10
+401Buint8_textra_flagPresent for opcode 0x10

Minimum header size is 24 bytes. Total node allocation: 24 + 8 * num_operands bytes minimum, though opcode-specific extra fields extend the header region for certain node types.

Operand Storage

Operands are stored as 8-byte QWORD pointers at negative offsets from the header. The stride is exactly 8 bytes per operand. Access follows this pattern (decompiled from sub_162D4F0):

operand[k] = *(_QWORD *)(a1 + 8 * (k - num_ops))

For a node with num_operands = 3:

  • operand[0] is at a1 - 24
  • operand[1] is at a1 - 16
  • operand[2] is at a1 - 8

A 2-operand node occupies 40 bytes total (16 operand bytes + 24 header bytes). A node with opcode 0x1B and 5 operands requires approximately 88 bytes (40 operand bytes + ~48 header bytes including extra fields).

Tagged Pointer Semantics

The context_ptr at offset +16 uses low-bit tagging to encode indirection:

  • Bits [2:0] = 0: pointer is a direct reference to the context object.
  • Bit [2] = 1: pointer is an indirect reference (pointer-to-pointer).

The decompiled dereferencing pattern:

v = *(a1 + 16) & 0xFFFFFFFFFFFFFFF8;  // mask off tag bits
if (*(a1 + 16) & 4)                    // bit 2 set = indirect
    v = *v;                             // one extra dereference

This technique saves a field by encoding the indirection flag inside the pointer itself, relying on 8-byte alignment guarantees.

Opcode Dispatch Table

The uniquing function sub_162D4F0 performs a byte-level switch on *(_BYTE *)a1. Each case extracts the tagged context pointer, dereferences it, then probes an opcode-specific DenseMap for a structurally identical node.

Uniquing Opcode Dispatch (sub_162D4F0, 49KB)

The opcodes fall into two categories: "simple" opcodes that use sub-function tables at fixed stride, and "complex" opcodes that use dedicated DenseMap instances at individually-known offsets.

Simple opcodes (0x04--0x15) -- These 18 opcodes share a uniform dispatch pattern. Each routes to a sub-function table at a fixed byte offset within the context object, spaced 32 bytes apart:

OpcodeContext Byte OffsetSemantic Category
0x04+496Type / value constant
0x05+528Binary operation
0x06+560(simple node)
0x07+592(simple node)
0x08+624(simple node)
0x09+656Undef / poison
0x0A+688(simple node)
0x0B+720(simple node)
0x0C+752(simple node)
0x0D+784Integer constant
0x0E+816FP constant
0x0F+848Constant expression
0x10--Special: uses DenseMap at qw[178]
0x11+912(simple node)
0x12+944(simple node)
0x13+976Struct / aggregate type
0x14+1008(simple node)
0x15+1072(simple node)

Each sub-function table entry at these offsets is a 32-byte structure containing the callback address and metadata for hash-table probing.

Complex opcodes (0x16--0x22) -- These opcodes each own a full DenseMap within the context object. Each DenseMap occupies 4 qwords at the indicated base, plus associated dword counters:

OpcodeQWord BaseByte OffsetDenseMap DwordsIdentified Semantic
0x16qw[130]+1040dw[264..266]Metadata node
0x17--+1104--(simple-table path at +1104)
0x18--+1136--Alloca (bitcode 0x18/0x58)
0x19------Load
0x1Aqw[146]+1168dw[296..298]Branch (br)
0x1Bqw[150]+1200dw[304..306]Switch
0x1Cqw[154]+1232dw[312..314]Invoke (reads subopcode)
0x1Dqw[158]+1264dw[320..322]Unreachable / resume (reads subopcode)
0x1Eqw[162]+1296dw[328..330]LandingPad (reads subopcode)
0x1Fqw[166]+1328dw[336..338]Call instruction
0x20------PHI node
0x21------IndirectBr
0x22qw[178]+1424dw[360..362]Special (extra_A = ptr)

Opcodes 0x1C, 0x1D, and 0x1E read the subopcode field at *(unsigned __int16 *)(a1 + 2) as part of the hash key, because these node types require the intrinsic ID to distinguish structurally identical nodes with different semantic meaning.

Hash Function

Every DenseMap in the uniquing tables uses the same hash:

hash(ptr) = (ptr >> 9) ^ (ptr >> 4)

Hash computation for multi-operand nodes (sub_15B3480) extends this by combining the hash of each operand pointer with a mixing step. The hash seed is the opcode byte, then each operand is folded in:

seed ^= hash(operand[i]) + 0x9E3779B9 + (seed << 6) + (seed >> 2);

Sentinel values: empty = -8 (0xFFFFFFFFFFFFFFF8), tombstone = -16 (0xFFFFFFFFFFFFFFF0).

Node Erasure (sub_1621740, 14KB)

The mirror of insertion. Dispatches by the same opcode byte, finds the node in the corresponding DenseMap, overwrites the bucket with the tombstone sentinel (-16), and decrements NumItems while incrementing NumTombstones. When tombstone count exceeds NumBuckets >> 3, a rehash at the same capacity is triggered to reclaim tombstone slots.

Bitcode Instruction Opcode Table

NVIDIA uses LLVM's standard instruction opcode numbering with minor adjustments. The bitcode reader sub_166A310 / sub_151B070 (parseFunctionBody, 60KB/123KB) dispatches on a contiguous range. The NVVM verifier sub_2C80C90 confirms the mapping via its per-opcode validation switch:

OpcodeHexLLVM InstructionVerifier Checks
0x0B11ret--
0x0E14br--
0x0F15switch--
0x1521invoke"invoke" unsupported via sub_2C76F10
0x1824allocaAlignment <= 2^23; AS must be Generic
0x1925load--
0x1A26br (cond)Validates "Branch condition is not 'i1' type!"
0x1B27switch (extended)--
0x1C28invoke (extended)--
0x1D29unreachable--
0x1E30resume--
0x1F31callPragma metadata validation
0x2032phi--
0x2133indirectbr"indirectbr" unsupported
0x2234call (variant)Validates callee type signature
0x2335resume (verifier)"resume" unsupported
0x23--0x3435--52Binary ops (add/sub/mul/div/rem/shift/logic)--
0x35--0x3853--56Casts (trunc/zext/sext/fpcast)--
0x3C60allocaAlignment and address-space checks
0x3D61loadAtomic loads rejected; tensor memory AS rejected
0x3E62storeAtomic stores rejected; tensor memory AS rejected
0x4064fenceOnly acq_rel/seq_cst in UnifiedNVVMIR mode
0x4165cmpxchgOnly i32/i64/i128; must be generic/global/shared AS
0x4266atomicrmwAddress space validation
0x4F79addrspacecast"Cannot cast non-generic to different non-generic"
0x5585Intrinsic callRoutes to sub_2C7B6A0 (143KB verifier)
0x5888alloca (inalloca)Same as 0x18
0x5F95landingpad"landingpad" unsupported

The binary opcodes in the 0x23--0x34 range follow LLVM's BinaryOperator numbering:

OpcodeHexOperationIRBuilder Helper
0x2335add--
0x2436fadd--
0x2537sub--
0x2638fsub--
0x2739mul--
0x2840fmul--
0x2941udiv--
0x2A42sdiv--
0x2B43fdiv--
0x2C44urem--
0x2D45srem--
0x2E46frem--
0x2F47shl--
0x3048lshr--
0x3149ashr--
0x3250and--
0x3351or--
0x3452xor--

InstCombine Internal Opcode Table

The InstCombine mega-visitor sub_10EE7A0 (405KB, the single largest function in cicc) uses a different opcode numbering -- the full LLVM Instruction::getOpcode() values rather than the bitcode record codes. These are accessed via sub_987FE0 (getOpcode equivalent). Key ranges observed:

Opcode RangeLLVM Instructions
0x0BRet
0x0EBr
0x0FSwitch
0x15Invoke
0x1AUnreachable
0x3FFNeg
0x41--0x43Add, FAdd, Sub
0x99GetElementPtr
0xAATrunc
0xAC--0xAEZExt, SExt, FPToUI
0xB4--0xB5PtrToInt, IntToPtr
0xCF--0xD2ICmp, FCmp, PHI, Call
0xE3--0xEBVAArg, ExtractElement, InsertElement, ShuffleVector, ExtractValue, InsertValue
0x11AFence
0x11DAtomicCmpXchg
0x125AtomicRMW
0x134--0x174FPTrunc, FPExt, UIToFP, Alloca, Load, Store, FMul, UDiv, SDiv, ...
0x17D--0x192BitCast, Freeze, LandingPad, CatchSwitch, CatchRet, CallBr, ...
0x2551, 0x255F, 0x254DNVIDIA custom intrinsic operations

The NVIDIA custom opcodes (0x2551, 0x255F, 0x254D) are in a range far above standard LLVM and handle CUDA-specific operations (texture, surface, or warp-level ops encoded as custom IR nodes) that have no upstream LLVM equivalent.

NVVM Context Object

The context object referenced by context_ptr is a large structure (~3,656 bytes, confirmed by the destructor sub_B76CB0 at 97KB which tears down a ~3656-byte object) containing uniquing tables for every NVVM opcode, plus type caches, metadata interning tables, and allocator state.

Context Layout Overview

Byte OffsetSizeFieldDescription
+0..+200200BCore stateModule pointer, allocator, flags
+2008Bvtable_0Points to unk_49ED3E0
+2248Bvtable_1Points to unk_49ED440
+2488Bvtable_2Points to unk_49ED4A0
+272..+792520BHash table array16 DenseMaps freed at stride 32 by destructor
+496..+1136640BSimple opcode tables18 sub-function tables, 32B each (opcodes 0x04..0x15)
+1040..+1424384BComplex opcode DenseMapsDedicated DenseMaps for opcodes 0x16..0x22
+1424..+2800~1376BExtended tablesAdditional hash tables, type caches, metadata maps
+2800..+3656~856BAllocator stateBump allocator slabs, counters, statistics

Simple Opcode Table Region (+496..+1136)

The 18 entries for opcodes 0x04 through 0x15 (plus a few extras) are 32-byte structures at fixed offsets:

struct SimpleOpcodeTable {
    void  *buckets;       // +0:  heap-allocated bucket array
    int32  num_items;     // +8:  live entry count
    int32  num_tombstones; // +12: tombstone count
    int32  num_buckets;   // +16: always power-of-2
    int32  reserved;      // +20: padding
    void  *callback;      // +24: hash-insert function pointer (or NULL)
};

Byte offsets increase monotonically: +496, +528, +560, +592, +624, +656, +688, +720, +752, +784, +816, +848, +880, +912, +944, +976, +1008, +1072, +1104, +1136.

Complex Opcode DenseMap Region (+1040..+1424)

Each DenseMap for a complex opcode occupies 4 qwords plus associated dword counters:

struct OpcodeUniqueMap {
    int64  num_entries;    // qw[N]:   includes tombstones
    void  *buckets;        // qw[N+1]: heap-allocated bucket array
    int32  num_items;      // dw[2*N + offset]: live entries
    int32  num_tombstones; // dw[2*N + offset + 1]: tombstone count
    int32  num_buckets;    // dw[2*N + offset + 2]: capacity (power-of-2)
};

Complete mapping:

Opcodeqw BaseByte Offset (qw)dw CountersByte Offset (dw)
0x16qw[130]+1040dw[264..266]+2112..+2120
0x1Aqw[146]+1168dw[296..298]+2368..+2376
0x1Bqw[150]+1200dw[304..306]+2432..+2440
0x1Cqw[154]+1232dw[312..314]+2496..+2504
0x1Dqw[158]+1264dw[320..322]+2560..+2568
0x1Eqw[162]+1296dw[328..330]+2624..+2632
0x1Fqw[166]+1328dw[336..338]+2688..+2696
0x10qw[178]+1424dw[360..362]+2880..+2888

Destructor (sub_1608300, 90KB)

The context destructor confirms the layout by freeing resources in order:

  1. Calls j___libc_free_0 on bucket pointers at offsets +272 through +792 (stride 32) -- frees all 16 simple opcode hash tables.
  2. Destroys sub-objects via sub_16BD9D0, sub_1605960, sub_16060D0 -- these tear down the complex DenseMap instances and any heap-allocated overflow chains.
  3. Releases vtable-referenced objects at offsets +200, +224, +248.

The separate LLVMContext destructor (sub_B76CB0, 97KB) frees 28+ hash tables from the full ~3,656-byte context structure, confirming that the uniquing tables are only part of the overall context.

Type Tag System

The context object's hash tables also serve as uniquing tables for type nodes. The byte at offset +16 in each IR node encodes the type tag (distinct from the opcode byte at +0):

Type TagMeaningNotes
5Instruction / expressionBinary ops, comparisons
8Constant aggregateConstantArray, ConstantStruct
9Undef / poisonUndefValue
13Integer constantAPInt at +24, bitwidth at +32
14FP constantAPFloat storage
15Constant expressionConstantExpr (GEP, cast, etc.)
16Struct / aggregate typeElement list at +32
17MDTuple / metadata nodeMetadata tuple
37Comparison instructionICmp / FCmp predicate

The type tag at +16 is used by InstCombine (sub_1743DA0) and many other passes to quickly classify nodes without reading the full opcode. The observed range is 5--75, considerably denser than standard LLVM's Value subclass IDs.

Instruction Creation Helpers

NVIDIA's LLVM fork provides a set of instruction creation functions that allocate nodes from the bump allocator, insert them into the appropriate uniquing table, and update use-lists. These are the core IR mutation API:

Primary Instruction Factories

AddressSizeSignatureLLVM Equivalent
sub_B504D0--(opcode, op0, op1, state, 0, 0)BinaryOperator::Create / IRBuilder::CreateBinOp
sub_B50640--(val, state, 0, 0)Result-typed instruction / CreateNeg wrapper
sub_B51BF0--(inst, src, destTy, state, 0, 0)IRBuilder::CreateZExtOrBitCast
sub_B51D30--(opcode, src, destTy, state, 0, 0)CmpInst::Create / IRBuilder::CreateCast
sub_B52190--(...)BitCastInst::Create
sub_B52260--(...)GetElementPtrInst::Create (single-index)
sub_B52500--(...)CastInst::Create with predicate
sub_B33D10--(ctx, intrinsicID, args, numArgs, ...)IRBuilder::CreateIntrinsicCall
sub_BD2DA0--(80)Instruction::Create (allocates 80-byte IR node)
sub_BD2C40--(72, N)Instruction::Create (72-byte base, N operands)

Opcode Constants for Creation

These numeric opcode values are passed as the first argument to sub_B504D0:

ValueOperationExample
13Subsub_B504D0(13, a, b, ...)
15FNeg / FSub variantsub_B504D0(15, ...)
18SDivsub_B504D0(18, ...)
21FMulsub_B504D0(21, ...)
25Orsub_B504D0(25, ...)
26Andsub_B504D0(26, a, mask, ...)
28Xorsub_B504D0(28, ...)
29Addsub_B504D0(29, ...)
30Subsub_B504D0(30, zero, operand)
32Shlsub_B504D0(32, ...)
33AShrsub_B504D0(33, ...)
38And (FP context)sub_B504D0(38, ...)
40ZExt (via sub_B51D30)sub_B51D30(40, source, resultType)
49CastInstsub_B51D30(49, src, destTy, ...)

Node Builder / Cloner (sub_16275A0, 21KB)

The IR builder at sub_16275A0 creates new nodes by cloning operand lists from a source node, using the tagged pointer Use-list encoding described above. It dispatches to three specialized constructors:

AddressRole
sub_1627350Multi-operand node create (MDTuple::get equivalent). Takes (ctx, operand_array, count, flag0, flag1). Called 463+ times from func-attrs and metadata passes.
sub_15B9E00Binary node create. Fixed 2-operand layout, minimal header.
sub_15C4420Variadic node create. Variable operand count, allocates backward operand storage.

All three ultimately route through the uniquing function sub_162D4F0 to deduplicate structurally-identical nodes.

Infrastructure Functions

AddressCall CountRole
sub_1623A60349xIRBuilder::CreateBinOp or SCEV type extension
sub_1623210337xIRBuilder::CreateUnaryOp or SCEV use registration
sub_15FB440276xCreate node with 5 args: (opcode, type, op1, op2, flags)
sub_161E7C0463xNode accessor / property query (most-called IR function)
sub_164B780336xUse-chain linked list manipulation
sub_1648A60406xMemory allocator: (size, alignment)

Allocation

NVVM IR nodes are allocated from a slab-based bump allocator:

  • Slab growth: 4096 << (slab_index >> 7) -- exponential, capped at 4TB.
  • Alignment: 8 bytes (pointer aligned via (ptr + 7) & ~7).
  • Deallocation: no individual free; entire slabs are released at once.
  • Overflow: triggers a new slab via malloc().

This is the standard LLVM BumpPtrAllocator pattern, consistent with how upstream LLVM manages IR node lifetimes. The lack of per-node deallocation means the NVVM frontend cannot reclaim memory for dead nodes until the entire context is destroyed.

Cross-References

Function Map

FunctionAddressSizeRole
Node uniquing: lookup-or-insert, opcode dispatchsub_162D4F049KB--
Node erase from uniquing tables (tombstone writer)sub_162174014KB--
IR builder / node clonersub_16275A021KB--
Multi-operand node create (MDTuple::get)sub_1627350----
Binary node createsub_15B9E00----
Variadic node createsub_15C4420----
Hash computation for multi-operand nodessub_15B3480----
Context destructor (frees 20+ hash tables)sub_160830090KB--
LLVMContext destructor (~3,656-byte object)sub_B76CB097KB--
BinaryOperator::Create / IRBuilder::CreateBinOpsub_B504D0----
Result-typed instruction create / CreateNegsub_B50640----
IRBuilder::CreateZExtOrBitCastsub_B51BF0----
CmpInst::Create / IRBuilder::CreateCastsub_B51D30----
BitCastInst::Createsub_B52190----
GetElementPtrInst::Create (single-index)sub_B52260----
CastInst::Create with predicatesub_B52500----
IRBuilder::CreateIntrinsicCallsub_B33D10----
Instruction::Create (80-byte allocation)sub_BD2DA0----
Instruction::Create (variable-size)sub_BD2C40----
create_empty_ir_node (204 callers, EDG front-end)sub_72C9A0----
IR builder / node constructor (349x calls)sub_1623A60----
IR builder / node constructor variant (337x calls)sub_1623210----
Create node with 5 args (276x calls)sub_15FB440----
Node accessor / property query (463x calls)sub_161E7C0----
BitcodeReader::parseFunctionBody (stock LLVM)sub_166A31060KB--
parseFunctionBody (two-phase compilation path)sub_151B070123KB--
parseFunctionBody (standalone libNVVM path)sub_9F2A40185KB--
InstCombinerImpl::visitInstruction (full opcode switch)sub_10EE7A0405KB--
InstCombine master visit dispatchersub_F2CFA0----
NVVMModuleVerifier (per-opcode validation)sub_2C80C9051KB--