NVVM IR Node Layout

The NVVM frontend in cicc v13.0 uses a custom intermediate representation distinct from LLVM's native IR. Each IR node is a variable-length structure allocated from a bump allocator, with operands stored backward from the node header pointer. The node uniquing infrastructure lives in sub_162D4F0 (49KB), which routes each opcode to a dedicated DenseMap inside the NVVM context object.

The pointer a1 returned from allocation points to the start of the fixed header. Operands are at negative offsets behind it.

Offset	Size	Type	Field	Notes
+0	1B	`uint8_t`	`opcode`	Switch key in `sub_162D4F0`; values 0x04..0x22+
+2	2B	`uint16_t`	`subopcode`	Intrinsic ID; read for opcodes 0x1C, 0x1D, 0x1E
+4	4B	--	(padding)	Not accessed directly
+8	4B	`uint32_t`	`num_operands`	Controls operand access range
+16	8B	`tagged_ptr`	`context_ptr`	Low 3 bits are tag; mask with `& ~7` for pointer
+24	8B	varies	`extra_A`	DWORD for opcodes 0x1A/0x1B; pointer for 0x10/0x22
+28	4B	`uint32_t`	`extra_B`	Present for opcode 0x1B
+32	8B	varies	`extra_C`	Present for opcode 0x10
+40	1B	`uint8_t`	`extra_flag`	Present for opcode 0x10

Minimum header size is 24 bytes. Total node allocation: 24 + 8 * num_operands bytes minimum, though opcode-specific extra fields extend the header region for certain node types.

Operand Storage

Operands are stored as 8-byte QWORD pointers at negative offsets from the header. The stride is exactly 8 bytes per operand. Access follows this pattern (decompiled from sub_162D4F0):

operand[k] = *(_QWORD *)(a1 + 8 * (k - num_ops))

For a node with num_operands = 3:

operand[0] is at a1 - 24
operand[1] is at a1 - 16
operand[2] is at a1 - 8

A 2-operand node occupies 40 bytes total (16 operand bytes + 24 header bytes). A node with opcode 0x1B and 5 operands requires approximately 88 bytes (40 operand bytes + ~48 header bytes including extra fields).

Tagged Pointer Semantics

The context_ptr at offset +16 uses low-bit tagging to encode indirection:

Bits [2:0] = 0: pointer is a direct reference to the context object.
Bit [2] = 1: pointer is an indirect reference (pointer-to-pointer).

The decompiled dereferencing pattern:

v = *(a1 + 16) & 0xFFFFFFFFFFFFFFF8;  // mask off tag bits
if (*(a1 + 16) & 4)                    // bit 2 set = indirect
    v = *v;                             // one extra dereference

This technique saves a field by encoding the indirection flag inside the pointer itself, relying on 8-byte alignment guarantees.

Opcode Dispatch Table

The uniquing function sub_162D4F0 performs a byte-level switch on *(_BYTE *)a1. Each case extracts the tagged context pointer, dereferences it, then probes an opcode-specific DenseMap for a structurally identical node.

Uniquing Opcode Dispatch (`sub_162D4F0`, 49KB)

The opcodes fall into two categories: "simple" opcodes that use sub-function tables at fixed stride, and "complex" opcodes that use dedicated DenseMap instances at individually-known offsets.

Simple opcodes (0x04--0x15) -- These 18 opcodes share a uniform dispatch pattern. Each routes to a sub-function table at a fixed byte offset within the context object, spaced 32 bytes apart:

Opcode	Context Byte Offset	Semantic Category
0x04	+496	Type / value constant
0x05	+528	Binary operation
0x06	+560	(simple node)
0x07	+592	(simple node)
0x08	+624	(simple node)
0x09	+656	Undef / poison
0x0A	+688	(simple node)
0x0B	+720	(simple node)
0x0C	+752	(simple node)
0x0D	+784	Integer constant
0x0E	+816	FP constant
0x0F	+848	Constant expression
0x10	--	Special: uses DenseMap at qw[178]
0x11	+912	(simple node)
0x12	+944	(simple node)
0x13	+976	Struct / aggregate type
0x14	+1008	(simple node)
0x15	+1072	(simple node)

Each sub-function table entry at these offsets is a 32-byte structure containing the callback address and metadata for hash-table probing.

Complex opcodes (0x16--0x22) -- These opcodes each own a full DenseMap within the context object. Each DenseMap occupies 4 qwords at the indicated base, plus associated dword counters:

Opcode	QWord Base	Byte Offset	DenseMap Dwords	Identified Semantic
0x16	qw[130]	+1040	dw[264..266]	Metadata node
0x17	--	+1104	--	(simple-table path at +1104)
0x18	--	+1136	--	Alloca (bitcode 0x18/0x58)
0x19	--	--	--	Load
0x1A	qw[146]	+1168	dw[296..298]	Branch (br)
0x1B	qw[150]	+1200	dw[304..306]	Switch
0x1C	qw[154]	+1232	dw[312..314]	Invoke (reads subopcode)
0x1D	qw[158]	+1264	dw[320..322]	Unreachable / resume (reads subopcode)
0x1E	qw[162]	+1296	dw[328..330]	LandingPad (reads subopcode)
0x1F	qw[166]	+1328	dw[336..338]	Call instruction
0x20	--	--	--	PHI node
0x21	--	--	--	IndirectBr
0x22	qw[178]	+1424	dw[360..362]	Special (extra_A = ptr)

Opcodes 0x1C, 0x1D, and 0x1E read the subopcode field at *(unsigned __int16 *)(a1 + 2) as part of the hash key, because these node types require the intrinsic ID to distinguish structurally identical nodes with different semantic meaning.

Hash Function

Every DenseMap in the uniquing tables uses the same hash:

hash(ptr) = (ptr >> 9) ^ (ptr >> 4)

Hash computation for multi-operand nodes (sub_15B3480) extends this by combining the hash of each operand pointer with a mixing step. The hash seed is the opcode byte, then each operand is folded in:

seed ^= hash(operand[i]) + 0x9E3779B9 + (seed << 6) + (seed >> 2);

Sentinel values: empty = -8 (0xFFFFFFFFFFFFFFF8), tombstone = -16 (0xFFFFFFFFFFFFFFF0).

Node Erasure (`sub_1621740`, 14KB)

The mirror of insertion. Dispatches by the same opcode byte, finds the node in the corresponding DenseMap, overwrites the bucket with the tombstone sentinel (-16), and decrements NumItems while incrementing NumTombstones. When tombstone count exceeds NumBuckets >> 3, a rehash at the same capacity is triggered to reclaim tombstone slots.

Bitcode Instruction Opcode Table

NVIDIA uses LLVM's standard instruction opcode numbering with minor adjustments. The bitcode reader sub_166A310 / sub_151B070 (parseFunctionBody, 60KB/123KB) dispatches on a contiguous range. The NVVM verifier sub_2C80C90 confirms the mapping via its per-opcode validation switch:

Opcode	Hex	LLVM Instruction	Verifier Checks
0x0B	11	`ret`	--
0x0E	14	`br`	--
0x0F	15	`switch`	--
0x15	21	`invoke`	"invoke" unsupported via `sub_2C76F10`
0x18	24	`alloca`	Alignment <= 2^23; AS must be Generic
0x19	25	`load`	--
0x1A	26	`br` (cond)	Validates "Branch condition is not 'i1' type!"
0x1B	27	`switch` (extended)	--
0x1C	28	`invoke` (extended)	--
0x1D	29	`unreachable`	--
0x1E	30	`resume`	--
0x1F	31	`call`	Pragma metadata validation
0x20	32	`phi`	--
0x21	33	`indirectbr`	"indirectbr" unsupported
0x22	34	`call` (variant)	Validates callee type signature
0x23	35	`resume` (verifier)	"resume" unsupported
0x23--0x34	35--52	Binary ops (add/sub/mul/div/rem/shift/logic)	--
0x35--0x38	53--56	Casts (trunc/zext/sext/fpcast)	--
0x3C	60	`alloca`	Alignment and address-space checks
0x3D	61	`load`	Atomic loads rejected; tensor memory AS rejected
0x3E	62	`store`	Atomic stores rejected; tensor memory AS rejected
0x40	64	`fence`	Only acq_rel/seq_cst in UnifiedNVVMIR mode
0x41	65	`cmpxchg`	Only i32/i64/i128; must be generic/global/shared AS
0x42	66	`atomicrmw`	Address space validation
0x4F	79	`addrspacecast`	"Cannot cast non-generic to different non-generic"
0x55	85	Intrinsic call	Routes to `sub_2C7B6A0` (143KB verifier)
0x58	88	`alloca` (inalloca)	Same as 0x18
0x5F	95	`landingpad`	"landingpad" unsupported

The binary opcodes in the 0x23--0x34 range follow LLVM's BinaryOperator numbering:

Opcode	Hex	Operation	IRBuilder Helper
0x23	35	`add`	--
0x24	36	`fadd`	--
0x25	37	`sub`	--
0x26	38	`fsub`	--
0x27	39	`mul`	--
0x28	40	`fmul`	--
0x29	41	`udiv`	--
0x2A	42	`sdiv`	--
0x2B	43	`fdiv`	--
0x2C	44	`urem`	--
0x2D	45	`srem`	--
0x2E	46	`frem`	--
0x2F	47	`shl`	--
0x30	48	`lshr`	--
0x31	49	`ashr`	--
0x32	50	`and`	--
0x33	51	`or`	--
0x34	52	`xor`	--

InstCombine Internal Opcode Table

The InstCombine mega-visitor sub_10EE7A0 (405KB, the single largest function in cicc) uses a different opcode numbering -- the full LLVM Instruction::getOpcode() values rather than the bitcode record codes. These are accessed via sub_987FE0 (getOpcode equivalent). Key ranges observed:

Opcode Range	LLVM Instructions
0x0B	Ret
0x0E	Br
0x0F	Switch
0x15	Invoke
0x1A	Unreachable
0x3F	FNeg
0x41--0x43	Add, FAdd, Sub
0x99	GetElementPtr
0xAA	Trunc
0xAC--0xAE	ZExt, SExt, FPToUI
0xB4--0xB5	PtrToInt, IntToPtr
0xCF--0xD2	ICmp, FCmp, PHI, Call
0xE3--0xEB	VAArg, ExtractElement, InsertElement, ShuffleVector, ExtractValue, InsertValue
0x11A	Fence
0x11D	AtomicCmpXchg
0x125	AtomicRMW
0x134--0x174	FPTrunc, FPExt, UIToFP, Alloca, Load, Store, FMul, UDiv, SDiv, ...
0x17D--0x192	BitCast, Freeze, LandingPad, CatchSwitch, CatchRet, CallBr, ...
0x2551, 0x255F, 0x254D	NVIDIA custom intrinsic operations

The NVIDIA custom opcodes (0x2551, 0x255F, 0x254D) are in a range far above standard LLVM and handle CUDA-specific operations (texture, surface, or warp-level ops encoded as custom IR nodes) that have no upstream LLVM equivalent.

NVVM Context Object

The context object referenced by context_ptr is a large structure (~3,656 bytes, confirmed by the destructor sub_B76CB0 at 97KB which tears down a ~3656-byte object) containing uniquing tables for every NVVM opcode, plus type caches, metadata interning tables, and allocator state.

Context Layout Overview

Byte Offset	Size	Field	Description
+0..+200	200B	Core state	Module pointer, allocator, flags
+200	8B	vtable_0	Points to `unk_49ED3E0`
+224	8B	vtable_1	Points to `unk_49ED440`
+248	8B	vtable_2	Points to `unk_49ED4A0`
+272..+792	520B	Hash table array	16 DenseMaps freed at stride 32 by destructor
+496..+1136	640B	Simple opcode tables	18 sub-function tables, 32B each (opcodes 0x04..0x15)
+1040..+1424	384B	Complex opcode DenseMaps	Dedicated DenseMaps for opcodes 0x16..0x22
+1424..+2800	~1376B	Extended tables	Additional hash tables, type caches, metadata maps
+2800..+3656	~856B	Allocator state	Bump allocator slabs, counters, statistics

Simple Opcode Table Region (+496..+1136)

The 18 entries for opcodes 0x04 through 0x15 (plus a few extras) are 32-byte structures at fixed offsets:

struct SimpleOpcodeTable {
    void  *buckets;       // +0:  heap-allocated bucket array
    int32  num_items;     // +8:  live entry count
    int32  num_tombstones; // +12: tombstone count
    int32  num_buckets;   // +16: always power-of-2
    int32  reserved;      // +20: padding
    void  *callback;      // +24: hash-insert function pointer (or NULL)
};

Byte offsets increase monotonically: +496, +528, +560, +592, +624, +656, +688, +720, +752, +784, +816, +848, +880, +912, +944, +976, +1008, +1072, +1104, +1136.

Complex Opcode DenseMap Region (+1040..+1424)

Each DenseMap for a complex opcode occupies 4 qwords plus associated dword counters:

struct OpcodeUniqueMap {
    int64  num_entries;    // qw[N]:   includes tombstones
    void  *buckets;        // qw[N+1]: heap-allocated bucket array
    int32  num_items;      // dw[2*N + offset]: live entries
    int32  num_tombstones; // dw[2*N + offset + 1]: tombstone count
    int32  num_buckets;    // dw[2*N + offset + 2]: capacity (power-of-2)
};

Complete mapping:

Opcode	qw Base	Byte Offset (qw)	dw Counters	Byte Offset (dw)
0x16	qw[130]	+1040	dw[264..266]	+2112..+2120
0x1A	qw[146]	+1168	dw[296..298]	+2368..+2376
0x1B	qw[150]	+1200	dw[304..306]	+2432..+2440
0x1C	qw[154]	+1232	dw[312..314]	+2496..+2504
0x1D	qw[158]	+1264	dw[320..322]	+2560..+2568
0x1E	qw[162]	+1296	dw[328..330]	+2624..+2632
0x1F	qw[166]	+1328	dw[336..338]	+2688..+2696
0x10	qw[178]	+1424	dw[360..362]	+2880..+2888

Destructor (`sub_1608300`, 90KB)

The context destructor confirms the layout by freeing resources in order:

Calls j___libc_free_0 on bucket pointers at offsets +272 through +792 (stride 32) -- frees all 16 simple opcode hash tables.
Destroys sub-objects via sub_16BD9D0, sub_1605960, sub_16060D0 -- these tear down the complex DenseMap instances and any heap-allocated overflow chains.
Releases vtable-referenced objects at offsets +200, +224, +248.

The separate LLVMContext destructor (sub_B76CB0, 97KB) frees 28+ hash tables from the full ~3,656-byte context structure, confirming that the uniquing tables are only part of the overall context.

Type Tag System

The context object's hash tables also serve as uniquing tables for type nodes. The byte at offset +16 in each IR node encodes the type tag (distinct from the opcode byte at +0):

Type Tag	Meaning	Notes
5	Instruction / expression	Binary ops, comparisons
8	Constant aggregate	ConstantArray, ConstantStruct
9	Undef / poison	`UndefValue`
13	Integer constant	APInt at +24, bitwidth at +32
14	FP constant	APFloat storage
15	Constant expression	ConstantExpr (GEP, cast, etc.)
16	Struct / aggregate type	Element list at +32
17	MDTuple / metadata node	Metadata tuple
37	Comparison instruction	ICmp / FCmp predicate

The type tag at +16 is used by InstCombine (sub_1743DA0) and many other passes to quickly classify nodes without reading the full opcode. The observed range is 5--75, considerably denser than standard LLVM's Value subclass IDs.

Instruction Creation Helpers

NVIDIA's LLVM fork provides a set of instruction creation functions that allocate nodes from the bump allocator, insert them into the appropriate uniquing table, and update use-lists. These are the core IR mutation API:

Primary Instruction Factories

Address	Size	Signature	LLVM Equivalent
`sub_B504D0`	--	`(opcode, op0, op1, state, 0, 0)`	`BinaryOperator::Create` / `IRBuilder::CreateBinOp`
`sub_B50640`	--	`(val, state, 0, 0)`	Result-typed instruction / `CreateNeg` wrapper
`sub_B51BF0`	--	`(inst, src, destTy, state, 0, 0)`	`IRBuilder::CreateZExtOrBitCast`
`sub_B51D30`	--	`(opcode, src, destTy, state, 0, 0)`	`CmpInst::Create` / `IRBuilder::CreateCast`
`sub_B52190`	--	`(...)`	`BitCastInst::Create`
`sub_B52260`	--	`(...)`	`GetElementPtrInst::Create` (single-index)
`sub_B52500`	--	`(...)`	`CastInst::Create` with predicate
`sub_B33D10`	--	`(ctx, intrinsicID, args, numArgs, ...)`	`IRBuilder::CreateIntrinsicCall`
`sub_BD2DA0`	--	`(80)`	`Instruction::Create` (allocates 80-byte IR node)
`sub_BD2C40`	--	`(72, N)`	`Instruction::Create` (72-byte base, N operands)

Opcode Constants for Creation

These numeric opcode values are passed as the first argument to sub_B504D0:

Value	Operation	Example
13	Sub	`sub_B504D0(13, a, b, ...)`
15	FNeg / FSub variant	`sub_B504D0(15, ...)`
18	SDiv	`sub_B504D0(18, ...)`
21	FMul	`sub_B504D0(21, ...)`
25	Or	`sub_B504D0(25, ...)`
26	And	`sub_B504D0(26, a, mask, ...)`
28	Xor	`sub_B504D0(28, ...)`
29	Add	`sub_B504D0(29, ...)`
30	Sub	`sub_B504D0(30, zero, operand)`
32	Shl	`sub_B504D0(32, ...)`
33	AShr	`sub_B504D0(33, ...)`
38	And (FP context)	`sub_B504D0(38, ...)`
40	ZExt (via sub_B51D30)	`sub_B51D30(40, source, resultType)`
49	CastInst	`sub_B51D30(49, src, destTy, ...)`

Node Builder / Cloner (`sub_16275A0`, 21KB)

The IR builder at sub_16275A0 creates new nodes by cloning operand lists from a source node, using the tagged pointer Use-list encoding described above. It dispatches to three specialized constructors:

Address	Role
`sub_1627350`	Multi-operand node create (MDTuple::get equivalent). Takes `(ctx, operand_array, count, flag0, flag1)`. Called 463+ times from func-attrs and metadata passes.
`sub_15B9E00`	Binary node create. Fixed 2-operand layout, minimal header.
`sub_15C4420`	Variadic node create. Variable operand count, allocates backward operand storage.

All three ultimately route through the uniquing function sub_162D4F0 to deduplicate structurally-identical nodes.

Infrastructure Functions

Address	Call Count	Role
`sub_1623A60`	349x	`IRBuilder::CreateBinOp` or SCEV type extension
`sub_1623210`	337x	`IRBuilder::CreateUnaryOp` or SCEV use registration
`sub_15FB440`	276x	Create node with 5 args: `(opcode, type, op1, op2, flags)`
`sub_161E7C0`	463x	Node accessor / property query (most-called IR function)
`sub_164B780`	336x	Use-chain linked list manipulation
`sub_1648A60`	406x	Memory allocator: `(size, alignment)`

Allocation

NVVM IR nodes are allocated from a slab-based bump allocator:

Slab growth: 4096 << (slab_index >> 7) -- exponential, capped at 4TB.
Alignment: 8 bytes (pointer aligned via (ptr + 7) & ~7).
Deallocation: no individual free; entire slabs are released at once.
Overflow: triggers a new slab via malloc().

This is the standard LLVM BumpPtrAllocator pattern, consistent with how upstream LLVM manages IR node lifetimes. The lack of per-node deallocation means the NVVM frontend cannot reclaim memory for dead nodes until the entire context is destroyed.

Cross-References

DenseMap / Hash Infrastructure -- universal hash function and DenseMap layout
DAG Node -- SelectionDAG-level node layout (104-byte SDNode)
NVVM Container -- the NVVMPassOptions/container that wraps the context
Bitcode I/O -- bitcode opcode encoding and parseFunctionBody
InstCombine -- the 405KB mega-visitor that consumes these nodes
NVVM Verifier -- per-opcode validation rules

Function Map

Function	Address	Size	Role
Node uniquing: lookup-or-insert, opcode dispatch	`sub_162D4F0`	49KB	--
Node erase from uniquing tables (tombstone writer)	`sub_1621740`	14KB	--
IR builder / node cloner	`sub_16275A0`	21KB	--
Multi-operand node create (MDTuple::get)	`sub_1627350`	--	--
Binary node create	`sub_15B9E00`	--	--
Variadic node create	`sub_15C4420`	--	--
Hash computation for multi-operand nodes	`sub_15B3480`	--	--
Context destructor (frees 20+ hash tables)	`sub_1608300`	90KB	--
LLVMContext destructor (~3,656-byte object)	`sub_B76CB0`	97KB	--
`BinaryOperator::Create` / `IRBuilder::CreateBinOp`	`sub_B504D0`	--	--
Result-typed instruction create / `CreateNeg`	`sub_B50640`	--	--
`IRBuilder::CreateZExtOrBitCast`	`sub_B51BF0`	--	--
`CmpInst::Create` / `IRBuilder::CreateCast`	`sub_B51D30`	--	--
`BitCastInst::Create`	`sub_B52190`	--	--
`GetElementPtrInst::Create` (single-index)	`sub_B52260`	--	--
`CastInst::Create` with predicate	`sub_B52500`	--	--
`IRBuilder::CreateIntrinsicCall`	`sub_B33D10`	--	--
`Instruction::Create` (80-byte allocation)	`sub_BD2DA0`	--	--
`Instruction::Create` (variable-size)	`sub_BD2C40`	--	--
`create_empty_ir_node` (204 callers, EDG front-end)	`sub_72C9A0`	--	--
IR builder / node constructor (349x calls)	`sub_1623A60`	--	--
IR builder / node constructor variant (337x calls)	`sub_1623210`	--	--
Create node with 5 args (276x calls)	`sub_15FB440`	--	--
Node accessor / property query (463x calls)	`sub_161E7C0`	--	--
`BitcodeReader::parseFunctionBody` (stock LLVM)	`sub_166A310`	60KB	--
`parseFunctionBody` (two-phase compilation path)	`sub_151B070`	123KB	--
`parseFunctionBody` (standalone libNVVM path)	`sub_9F2A40`	185KB	--
`InstCombinerImpl::visitInstruction` (full opcode switch)	`sub_10EE7A0`	405KB	--
InstCombine master visit dispatcher	`sub_F2CFA0`	--	--
NVVMModuleVerifier (per-opcode validation)	`sub_2C80C90`	51KB	--

CICC Reverse Engineering Reference