NVVM Container Binary Format
The NVVM container is a proprietary binary envelope that wraps LLVM bitcode with compiler metadata for transport between pipeline stages in cicc v13.0. It carries target architecture, optimization options, fast-math flags, memory window configurations, per-kernel resource tables, and the IR payload itself -- all in a single serializable blob. Two serialization paths exist: a compact binary wire format used in production (nvcc / ptxas pipelines) and an XML-based format used for debugging and interchange. This page specifies the binary format in sufficient detail to write a conformant parser and serializer.
The format is implemented across 26 functions in the 0xCCBB10--0xCDD2D0 address range (Cluster C in the binary layout). The six top-level entry points:
| Function | Address | Size | Role |
|---|---|---|---|
NvvmContainer_serialize | 0xCDD2D0 | 47,540 B | Binary + XML serializer |
NvvmContainer_deserialize_options | 0xCD1D80 | 51,859 B | Binary tag/value decoder |
NvvmContainer_parse_header | 0xCDCA30 | 10,206 B | XML path header parser |
NvvmContainer_check_versions | 0xCD41B0 | 16,708 B | Version compatibility gate |
NvvmContainer_validate_versions | 0xCCD5F0 | 8,987 B | Standalone version validator |
NvvmContainer_init_options_struct | 0xCCBB10 | small | Zero-init 248-byte container struct |
Supporting parsers called from NvvmOptions_parse_compile_options (0xCDB4D0, 26,643 bytes):
| Function | Address | Size | Role |
|---|---|---|---|
NvvmOptions_parse_arch_enum | 0xCD09E0 | 14,516 B | ArchVariant enum string-to-int |
NvvmOptions_parse_fast_math | 0xCCF590 | 12,771 B | FastMathOptions sub-structure |
NvvmOptions_parse_multi_view | 0xCD6D20 | 12,188 B | MultiViewOptions sub-structure |
NvvmOptions_parse_cb_reserved_area | 0xCCE780 | 9,802 B | CB reserved area config |
NvvmOptions_parse_reg_targets | 0xCD7CE0 | 9,542 B | Register target config |
NvvmOptions_parse_serialize_helper | 0xCD58A0 | 9,579 B | Option serialization helper |
NvvmOptions_parse_shader_const_iface | 0xCCEEA0 | 8,355 B | ShaderConstIface (DCI) |
NvvmOptions_parse_align_entries | 0xCD8610 | 6,739 B | Alignment entry config |
NvvmOptions_parse_pgo_section | 0xCD02C0 | 5,482 B | PGO configuration |
NvvmOptions_parse_section | 0xCD5510 | 5,166 B | Nested YAML section parser |
NvvmOptions_parse_memory_windows | 0xCCE100 | 5,042 B | Memory window config |
NvvmOptions_parse_cbank_config | 0xCCE4B0 | 4,173 B | Constant bank config |
NvvmOptions_parse_bool_or_int | 0xCCC4A0 | small | Boolean/int option parser |
NvvmOptions_parse_tristate | 0xCCCFB0 | small | Tri-state option parser |
NvvmOptions_parse_string | 0xCD5150 | small | String option parser |
The finalizer knobs parser (0xCD9990, 31,702 bytes) is called separately to ingest the full set of NVIDIA-specific backend knobs (see NVVMPassOptions).
Binary-level helpers:
| Function | Address | Role |
|---|---|---|
NvvmContainer_write_tag_value | 0xCD17A0 | Write one tag/value pair (called 121 times from serializer) |
NvvmContainer_write_blob | 0xCD1AB0 | Write blob data + tag reference |
NvvmContainer_compute_crc | 0xCCD2B0 | CRC with seeds 0x8DF5D74C, 0xBAA56A96 |
Global state: qword_4F87148 holds the NVVM options global state pointer, checked by many downstream consumers.
Binary Header
Every binary container begins with a fixed 24-byte header. The header is self-describing: HeaderSize at offset 0x0E stores its own length (always 24), and two size fields partition the remainder into a scalar tag region and a blob data region.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Magic (0x7F4E5C7D) | 0x00
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Ver.Major | Ver.Minor | NvvmIR.Major | NvvmIR.Minor | 0x04
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| NvvmDbg.Major | NvvmDbg.Minor | Llvm.Major | Llvm.Minor | 0x08
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| IRLevel (u16) | HeaderSize (u16) | 0x0C
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ScalarFieldsEnd (u32) | 0x10
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| BlobDataEnd (u32) | 0x14
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
struct NvvmContainerBinaryHeader {
uint32_t magic; /* 0x00: must be 0x7F4E5C7D */
uint8_t version_major; /* 0x04: container format major (1) */
uint8_t version_minor; /* 0x05: container format minor (<=0x41) */
uint8_t nvvm_ir_major; /* 0x06: NVVM IR version major (2) */
uint8_t nvvm_ir_minor; /* 0x07: NVVM IR version minor (<=0x62) */
uint8_t nvvm_debug_major; /* 0x08: debug info version major (3) */
uint8_t nvvm_debug_minor; /* 0x09: debug info version minor (<=2) */
uint8_t llvm_major; /* 0x0A: LLVM version (see encoding) */
uint8_t llvm_minor; /* 0x0B: LLVM version (see encoding) */
uint16_t ir_level; /* 0x0C: IRLevel enum */
uint16_t header_size; /* 0x0E: always 24 (0x0018) */
uint32_t scalar_fields_end; /* 0x10: byte offset past scalar region */
uint32_t blob_data_end; /* 0x14: byte offset past blob region */
};
The three data regions in order:
[0 .. 24) -- Header (fixed)
[24 .. scalar_fields_end) -- Scalar tag/value pairs
[scalar_fields_end .. blob_data_end) -- Blob data region
The total container size is blob_data_end bytes. After the blob data region, the IR payload (LLVM bitcode, optionally compressed) follows immediately.
LLVM Version Encoding
The llvm_major and llvm_minor bytes encode the LLVM version as a combined integer: llvm_major * 100 + llvm_minor. For cicc v13.0 (LLVM 20), this yields 20 * 100 + 0 = 2000. The version check compares the combined value, not the individual bytes.
IRLevel Enum
| Value | Name | Meaning |
|---|---|---|
| 0 | NVVM_IR_LEVEL_UNIFIED_AFTER_DCI | Default: IR after Device-Code-Interface unification |
| 1 | NVVM_IR_LEVEL_LTO | Link-Time Optimization IR (partially optimized) |
| 2 | NVVM_IR_LEVEL_OPTIX | OptiX pipeline IR |
Scalar Tag/Value Encoding
Immediately after the 24-byte header, a sequence of (tag, value) pairs encodes every container field that differs from its default value. The encoding is a variable-length scheme optimized for small values:
Case 1 -- value fits in 16 bits (0x0000..0xFFFE):
[tag : int16] [value : int16] -- 4 bytes total
Case 2 -- value needs 32 bits:
[tag : int16] [0xFFFF : int16] [value : int32] -- 8 bytes total
Terminator:
[0x0000 : int16] -- tag 0 ends the sequence
All multi-byte fields are little-endian. The sentinel value 0xFFFF in the value slot signals that a full 32-bit value follows. This means the maximum encodable 16-bit value is 0xFFFE (65534); values of exactly 0xFFFF or larger require the extended form.
The serializer (sub_CD17A0, called 121 times from NvvmContainer_serialize) writes each tag/value pair using this scheme. The deserializer enters a switch loop over tags 1--402, decoding each value and writing it to the appropriate offset in the deserialized container struct.
Delta Encoding Strategy
The serializer allocates a default-initialized 440-byte Options struct and compares each field in the current Options against the corresponding default. Only fields that differ from the default are written as tag/value pairs. This makes typical containers very compact -- a standard compilation targeting SM 89 with -O2 might emit fewer than 20 tag/value pairs, covering just SmMajor, SmMinor, CompileMode, and a handful of target-specific flags.
The deserializer reverses this: it allocates a default Options struct first, then overwrites individual fields as tags are encountered. Unknown tags are silently skipped, which is the mechanism that provides forward compatibility -- a newer serializer can emit tags that an older deserializer simply ignores.
Blob Data Region
Tags in the 200+ and 400+ ranges reference variable-length data stored in the blob region. The scalar value for a blob tag is the byte offset into the blob region where the data begins. The blob region starts at scalar_fields_end bytes from the container start.
To resolve a blob reference: blob_ptr = container_base + scalar_fields_end + offset_value.
Blob entries do not carry explicit length fields in the tag/value stream. The deserializer knows each blob type's expected size from the tag ID (e.g., tag 201 is always 24 bytes, tag 203 is always 40 bytes). Variable-length blobs like strings (tags 209, 210, 213, 216, 217) are null-terminated. Length-prefixed blobs (tag 218) carry a 4-byte length prefix.
Complete Tag Table
144 distinct tag IDs organized into six ranges. The "Offset" column refers to the byte position within the deserialized 440-byte Options struct.
Range 1--39: Core Scalar Options
| Tag | Type | Name | Options Offset | Notes |
|---|---|---|---|---|
| 1 | int32 | SmMajor | +0 (ArchVariant) | SM major version (e.g., 8 for SM 89) |
| 2 | int32 | SmMinor | +0 (ArchVariant) | SM minor version (e.g., 9 for SM 89) |
| 3 | int32 | NumRegs | +216 | Register count hint |
| 4 | int32 | NumBarriers | +220 | Barrier count |
| 5 | int32 | SharedMemorySize | +224 | Shared memory size in bytes |
| 6 | int32 | VertexMode | +72 | See VertexMode enum |
| 7 | bit | ReserveLocalAddressZero | +20 bit 0 | Reserve address 0 in local memory |
| 8 | bit | FastMath.IgnoreInf | +200 bit 0 | Treat infinities as NaN |
| 9 | bit | FastMath.IgnoreNaN | +200 bit 1 | Assume no NaN values present |
| 10 | bit | FastMath.IgnoreSignedZero | +200 bit 2 | Ignore sign of zero |
| 11 | bit | FastMath.ReorderFloat | +200 bit 3 | Allow float reordering |
| 12 | bit | FastMath.ReorderHalf | +200 bit 4 | Allow half-precision reordering |
| 13 | bit | FastMath.Ftz | +200 bit 5 | Flush denormals to zero |
| 14 | bit | FastMath.FastSqrt | +200 bit 6 | Use fast sqrt approximation |
| 15 | bit | FastMath.Fmad | +200 bit 7 | Allow fused multiply-add |
| 16 | bit | FastMath.AllowRcpRsqToSqrt | +201 bit 0 | Allow rcp(rsqrt(x)) to sqrt(x) |
| 17 | bit | FastMath.CanReorderFloatDistribute | +201 bit 1 | Allow distributive reordering |
| 18 | int32 | FastMath.Reserved | +204 | Reserved fast-math field |
| 19 | int32 | MaxRRegsAllowed | +216 | Maximum registers per thread (primary) |
| 20 | int32 | SchedRegTarget | +220 | Scheduling register pressure target |
| 21 | int32 | UnrollControl | +224 | Unroll factor control |
| 22 | bool | AcceleratedArch | +232 | True for sm_XXa variants |
| 23 | bool | StdELF | +233 | Use standard ELF output format |
| 24 | int32 | MaxRRegsAllowed2 | +216 | Secondary max-regs (override) |
| 25 | int32 | SchedRegTarget2 | +220 | Secondary sched target |
| 26 | bit | FastMath.ReassociateFloatAddOverMad | +201 bit 2 | Float add reassociation over MAD |
| 27 | bit | ForceImmediateConstants | +20 bit 1 | Force immediate constant loading |
| 28 | bit | HideFunctions | +20 bit 2 | Hide internal functions from output |
| 29 | bit | UseDX10AddressInRange | +20 bit 3 | DX10 address range mode |
| 30 | int32 | UnrollControl2 | +224 | Secondary unroll control |
| 31 | bit | FastMath.NoFloatMAD | +201 bit 3 | Disable float MAD formation |
| 32 | bool | AcceleratedArch2 | +232 | Secondary accelerated-arch flag |
| 33 | bit | FastMath.LaxFP16ApproximateDivision | +201 bit 4 | Lax FP16 approximate division |
| 34 | bool | StdELF2 | +233 | Secondary StdELF |
| 35 | int32 | ShaderCodegenSelMask | +236 | Shader codegen selection bitmask |
| 36 | bool | OmegaPtxErrorHandling | +240 | Enable Omega-style PTX error handling |
| 37 | int32 | FDLInsertMode | +244 | See FDLInsertMode enum |
| 38 | bit | IsPIC | +20 bit 4 | Position-independent code flag |
| 39 | bit | NoSpillsConstraint | +20 bit 5 | Hard constraint: no register spills |
Tag 99: Compression Metadata
| Tag | Type | Name | Notes |
|---|---|---|---|
| 99 | int32 | CompressAlgoId | Compression algorithm selector for IR payload |
When present, the IR payload following the blob region is compressed. The value selects a codec via sub_16886D0(algo_id). If the value is 0, the runtime substitutes the default algorithm ID 0x75D49913 (1,977,119,507 decimal). The codec is a pluggable compression/encryption layer accessed through four function pointers:
/* Compression codec API (addresses in the 0x1688xxx range) */
void *codec_acquire(uint32_t algo_id); /* sub_16886D0 */
int codec_compress(void *codec, void *data,
size_t size); /* sub_1688730 */
int codec_decompress(void *codec, void *data,
size_t size); /* sub_16887A0 */
void codec_release(void *codec); /* sub_1688720 */
The write path in NvvmContainer_serialize (0xCDD2D0) compresses the LLVM bitcode payload via sub_C8D290, then computes a CRC hash via NvvmContainer_compute_crc (0xCCD2B0) with the two seed values -1914584148 (0x8DF5D74C) and -1162247642 (0xBAA56A96). The CRC value is stored as the CompressAlgoId tag 99 value, which doubles as an integrity check token: the deserializer uses the same CRC seeds to verify the payload before decompression.
The compression subsystem lives outside the main container cluster at addresses 0x16886D0--0x16887A0, in the utility library region of the binary.
Range 101--173: Extended Target Options
These tags configure per-kernel and target-specific hardware parameters. Most map into a sub-structure accessed through the Options struct. The "Byte.Bit" column indicates the packed bitfield location within the target options sub-structure.
| Tag | Type | Name | Location | Notes |
|---|---|---|---|---|
| 101 | bool | HasTextureOps | offset 0 | Target supports texture operations |
| 102 | bool | HasSurfaceOps | offset 0 | Target supports surface operations |
| 103 | bool | HasAtomics | offset 0 | Target supports atomic operations |
| 104 | bool | HasVote | offset 0 | Target supports warp vote intrinsics |
| 105 | int32 | MaxThreadsPerBlock | offset 4 | Maximum CTA thread count |
| 106 | byte | PreferL1SizeFlag | offset 8 | L1 cache vs shared memory preference |
| 107 | bool | HasWarpShuffle | offset 0 | Target supports warp shuffle |
| 108 | bool | HasFunnelShift | offset 0 | Target supports funnel shift |
| 109 | int32 | CBankOfstLow | offset 12 | Constant bank offset lower bound |
| 110 | int32 | CBankOfstHi | offset 16 | Constant bank offset upper bound |
| 111 | int32 | CBankSize | offset 20 | Constant bank size in bytes |
| 112 | bit | Bit0_68 | byte 68, bit 0 | Target capability flag |
| 113 | bit | Bit1_68 | byte 68, bit 1 | Target capability flag |
| 114 | bit | Bit2_68 | byte 68, bit 2 | Target capability flag |
| 115 | bit | Bit3_68 | byte 68, bit 3 | Target capability flag |
| 116 | bit | Bit4_68 | byte 68, bit 4 | Target capability flag |
| 117 | bit | Bit5_68 | byte 68, bit 5 | Target capability flag |
| 118 | bit | Bit7_68 | byte 68, bit 7 | Target capability flag (bit 6 skipped) |
| 119 | bit | EnableCoalesce | byte 69, bit 0 | Enable memory coalescing optimization |
| 120 | bit | EnableVectorize | byte 69, bit 2 | Enable auto-vectorization |
| 121 | 2-bit | CompactionMode | byte 69, bits 3--4 | Thread compaction strategy (0--3) |
| 122 | int32 | StackFrameSize | offset 96 | Stack frame size in bytes |
| 123 | int32 | StackAlignment | offset 100 | Stack alignment requirement |
| 124 | int32 | ParamSpaceSize | offset 104 | Parameter space size |
| 125 | int32 | ParamAlignment | offset 108 | Parameter space alignment |
| 126 | int32 | LocalMemSize | offset 116 | Local memory size per thread |
| 127 | int32 | SharedBankConfig | offset 156 | Shared memory bank configuration |
| 128 | int32 | MinGridSize | offset 248 | Minimum grid size for occupancy |
| 129 | int32 | MaxGridDimX | offset 252 | Maximum X-dimension grid size |
| 130 | int32 | SharedMemPerBlock | offset 264 | Shared memory per block |
| 131 | 2-bit | WarpScheduleMode | byte 70, bits 0--1 | Warp scheduling strategy |
| 132 | bit | EnablePrefetch | byte 70, bit 2 | Enable memory prefetch instructions |
| 133 | bit | Bit4_70 | byte 70, bit 4 | Target capability flag |
| 134 | bit | Bit5_70 | byte 70, bit 5 | Target capability flag |
| 135 | bit | Bit6_70 | byte 70, bit 6 | Target capability flag |
| 136 | bit | Bit7_70 | byte 70, bit 7 | Target capability flag |
| 137 | int32 | MaxDynShared | offset 268 | Maximum dynamic shared memory |
| 138 | bool | HasLDG | offset 5 | Target supports LDG instruction |
| 139 | bit | Bit1_71 | byte 71, bit 1 | Target capability flag |
| 140 | bit | Bit2_71 | byte 71, bit 2 | Target capability flag |
| 141 | bool | HasBarrierReduce | offset 40 | Target supports barrier-reduce |
| 142 | int32 | CacheConfig | offset 280 | Cache configuration selector |
| 143 | bit | Bit6_68 | byte 68, bit 6 | Target capability flag |
| 144 | bit | Bit3_71 | byte 71, bit 3 | Target capability flag |
| 145 | bit | Bit0_71 | byte 71, bit 0 | Target capability flag |
| 146 | int32 | ConstBankSize | offset 256 | Constant bank total size |
| 147 | int32 | ShMemBankStride | offset 152 | Shared memory bank stride |
| 148 | 2-bit | ScheduleMode2 | byte 71, bits 4--5 | Secondary scheduling mode |
| 149 | bit | Bit6_71 | byte 71, bit 6 | Target capability flag |
| 150 | bit | Bit7_71 | byte 71, bit 7 | Target capability flag |
| 151 | int32 | LocalMemAlignment | offset 112 | Local memory alignment |
| 152 | bit | EnableBarrierOpt | byte 69, bit 5 | Enable barrier optimization |
| 153 | bit | Bit0_72 | byte 72, bit 0 | Target capability flag |
| 154 | bit | Bit6_69 | byte 69, bit 6 | Target capability flag |
| 155 | bit | Bit7_69 | byte 69, bit 7 | Target capability flag |
| 156 | bit | Bit1_72 | byte 72, bit 1 | Target capability flag |
| 157 | bool | HasDP4A | offset 1 | Target supports DP4A dot-product |
| 158 | bit | Bit3_72 | byte 72, bit 3 | Target capability flag |
| 159 | int32 | ConstBankSize2 | offset 260 | Secondary constant bank size |
| 160 | int32 | MaxRegsPerThread | offset 284 | Hard limit on registers per thread |
| 161 | int32 | ClusterSize | offset 276 | Thread block cluster size (SM 90+) |
| 162 | bit | Bit4_72 | byte 72, bit 4 | Target capability flag |
| 163 | bit | Bit5_72 | byte 72, bit 5 | Target capability flag |
| 164 | bit | Bit6_72 | byte 72, bit 6 | Target capability flag |
| 165 | bit | Bit7_72 | byte 72, bit 7 | Target capability flag |
| 166 | int32 | MaxCTAPerSM | offset 160 | Maximum CTAs per SM |
| 167 | int32 | TexIndirectLimit | offset 272 | Texture indirect access limit |
| 168 | bit | Bit0_432 | byte 432, bit 0 | Extended capability flag |
| 169 | bit | Bit1_432 | byte 432, bit 1 | Extended capability flag |
| 170 | bit | Bit2_432 | byte 432, bit 2 | Extended capability flag |
| 171 | bool | HasTMAOps | offset 289 | Target supports TMA operations (SM 90+) |
| 172 | bit | Bit3_70 | byte 70, bit 3 | Target capability flag |
| 173 | bool | HasTCGen05 | offset 290 | Target supports TCGen05 (SM 100+) |
Range 201--218: Blob Data Tags
| Tag | Size | Name | Description |
|---|---|---|---|
| 201 | 24 B | MemoryWindowCBank | 3 memory window entries for constant bank (see below) |
| 202 | 24 B | MemoryWindowLocal | 3 memory window entries for local memory |
| 203 | 40 B | MemoryWindowShared | 10 x uint32_t for shared memory windows + flags |
| 204 | 48 B | MultiViewOptions | Multi-view rendering header + typed arrays |
| 205 | var | TargetResourceTable | 24-byte header + 36 bytes per entry |
| 206 | var | PerKernelCBankOffsets | 4-byte count + 4 bytes per kernel |
| 207 | var | PerKernelStackSizes | 4-byte count + 4 bytes per kernel |
| 208 | var | PerKernelSMEMSizes | 8-byte count + 8 bytes per kernel |
| 209 | var | TargetFuncName | Null-terminated string |
| 210 | var | TargetEntryName | Null-terminated string |
| 211 | 8 B | PerKernelQWORD | 8-byte per-kernel datum |
| 212 | 12 B | ExtraMemParams | 8 + 4 bytes of memory parameters |
| 213 | var | AuxString1 | Null-terminated auxiliary string |
| 214 | var | PerKernelRegisters | 4-byte count + 4 bytes per kernel |
| 215 | var | PerKernelBarriers | 4-byte count + 4 bytes per kernel |
| 216 | var | AuxString2 | Null-terminated auxiliary string |
| 217 | var | AuxString3 | Null-terminated auxiliary string |
| 218 | var | AuxByteArray | 4-byte length prefix + raw bytes |
Range 301--309: Extended Int32 Fields
| Tag | Type | Name | Options Offset | Notes |
|---|---|---|---|---|
| 301 | int32 | ExtOpt.Field344 | +344 | Cluster/group configuration selector |
| 302 | int32 | ExtOpt.Field348 | +348 | Extended option |
| 303 | int32 | ExtOpt.Field352 | +352 | Extended option |
| 304 | int32 | ExtOpt.Field356 | +356 | Extended option |
| 305 | int32 | ExtOpt.Field360 | +360 | Extended option |
| 306 | int32 | ExtOpt.Field400 | +400 | Extended option |
| 307 | int32 | ExtOpt.Field364 | +364 | Extended option |
| 308 | int32 | ExtOpt.Field368 | +368 | Extended option |
| 309 | int32 | ExtOpt.Field372 | +372 | Extended option |
Range 351--353: Extended Int64 Blob References
| Tag | Size | Name | Options Offset |
|---|---|---|---|
| 351 | 8 B | ExtOpt.QWord376 | +376 |
| 352 | 8 B | ExtOpt.QWord384 | +384 |
| 353 | 8 B | ExtOpt.QWord392 | +392 |
Range 401--402: Structured Blob Data
These tags are conditionally parsed based on the value of tag 301 (ExtOpt.Field344):
| Tag | Condition | Size | Name | Notes |
|---|---|---|---|---|
| 401 | Field344 == 1 | 56+ B | TMADescriptor | SM 90 Hopper TMA bulk-copy descriptors. 44-byte fixed header + 16 bytes per entry. |
| 402 | Field344 == 4 | 40+ B | TCGen05Config | SM 100 Blackwell TCGen05 tensor configurations. 32-byte fixed header + 12 bytes per entry. |
The conditional parsing means a single container cannot carry both TMA and TCGen05 data -- the Field344 value selects which hardware generation's tensor memory interface is active.
TMADescriptor Layout (Tag 401, Field344 == 1)
TMA (Tensor Memory Access) descriptors configure cp.async.bulk operations on SM 90 Hopper. The TMA descriptor extraction is performed by sub_9483E0 during intrinsic lowering. The blob layout:
struct TMADescriptor {
/* +0 */ uint32_t num_entries; /* Number of TMA descriptors */
/* +4 */ uint32_t dimensionality; /* 1d..5d tensor rank */
/* +8 */ uint32_t element_size; /* Bytes per element */
/* +12 */ uint32_t interleave_layout; /* Memory interleave pattern */
/* +16 */ uint32_t swizzle_mode; /* Swizzle mode selector */
/* +20 */ uint32_t fill_mode; /* Out-of-bounds fill behavior */
/* +24 */ uint32_t [5] global_dims; /* Global tensor dimensions */
/* +44 */ /* --- 16 bytes per entry --- */
/* uint32_t box_dim; Per-entry box dimension */
/* uint32_t stride; Per-entry stride */
/* uint32_t elem_stride; Per-entry element stride */
/* uint32_t reserved; Reserved/padding */
};
See SM 90 Hopper for the TMA instruction format and the cp.async.bulk.tensor.g2s.tile.{1d,2d,3d,4d,5d} intrinsic family.
TCGen05Config Layout (Tag 402, Field344 == 4)
TCGen05 (Tensor Core Generation 5) configurations describe Blackwell SM 100 tensor memory operations. The TCGen05 instruction set includes tcgen05.alloc, tcgen05.dealloc, tcgen05.commit, tcgen05.fence, tcgen05.wait, and tcgen05.relinquish.alloc -- all gated by the SM 100 arch-conditional check at sub_30462A0. The blob layout:
struct TCGen05Config {
/* +0 */ uint32_t num_entries; /* Number of TCGen05 configs */
/* +4 */ uint32_t accumulator_size; /* Accumulator memory size */
/* +8 */ uint32_t commit_mode; /* Commit mode (multicast flags) */
/* +12 */ uint32_t fence_mode; /* Fence mode selector */
/* +16 */ uint32_t [4] reserved; /* Reserved fields */
/* +32 */ /* --- 12 bytes per entry --- */
/* uint32_t config_id; TCGen05 config identifier */
/* uint32_t fragment_count; Number of fragments */
/* uint32_t flags; Per-config flags */
};
See SM 100 Blackwell for the TCGen05 instruction set and the tcgen05.* intrinsic family.
Deserialized Container Struct
After parsing, the container is represented as a 248-byte in-memory structure allocated by NvvmContainer_init_options_struct (0xCCBB10). This struct holds the container metadata plus a pointer to the full 440-byte Options struct.
struct NvvmContainerHeader { /* 248 bytes total */
/* 0x00 */ uint32_t sm_major; /* Tag 1: SM major version */
/* 0x04 */ uint32_t sm_minor; /* Tag 2: SM minor version */
/* 0x08 */ uint32_t num_regs; /* Tag 3 */
/* 0x0C */ uint32_t num_barriers; /* Tag 4 */
/* 0x10 */ uint32_t shared_mem_size; /* Tag 5 */
/* 0x14 */ uint8_t flags_14; /* Packed bits: tags 7,27,28,29,38,39*/
/* bit 0: ReserveLocalAddressZero (tag 7) */
/* bit 1: ForceImmediateConstants (tag 27) */
/* bit 2: HideFunctions (tag 28) */
/* bit 3: UseDX10AddressInRange (tag 29) */
/* bit 4: IsPIC (tag 38) */
/* bit 5: NoSpillsConstraint (tag 39) */
/* 0x15 */ uint8_t _pad15[3];
/* 0x18 */ uint8_t multi_view_options[48]; /* Tag 204 blob */
/* 0x48 */ uint32_t vertex_mode; /* Tag 6 */
/* 0x4C */ uint8_t _pad4c[4];
/* 0x50 */ uint32_t max_rregs; /* Tag 19 */
/* 0x54 */ uint32_t sched_reg_target; /* Tag 20 */
/* 0x58 */ uint32_t unroll_control; /* Tag 21 */
/* 0x5C */ uint8_t _pad5c[4];
/* 0x60 */ uint8_t mem_win_cbank[24]; /* Tag 201 blob */
/* 0x78 */ uint8_t mem_win_local[24]; /* Tag 202 blob */
/* 0x90 */ uint8_t mem_win_shared[40]; /* Tag 203 blob */
/* 0xB8 */ uint8_t _padb8[12];
/* 0xC4 */ uint8_t accelerated_arch; /* Tag 22 */
/* 0xC5 */ uint8_t std_elf; /* Tag 23 */
/* 0xC6 */ uint8_t _padc6[2];
/* 0xC8 */ uint8_t fast_math[8]; /* Tags 8-17,26,31,33 bitfields */
/* 0xD0 */ uint8_t _padd0[8];
/* 0xD8 */ uint32_t max_rregs_2; /* Tag 24 */
/* 0xDC */ uint32_t sched_reg_2; /* Tag 25 */
/* 0xE0 */ uint32_t unroll_ctl_2; /* Tag 30 */
/* 0xE4 */ uint32_t compress_algo_id; /* Tag 99 */
/* 0xE8 */ uint8_t omega_ptx_err; /* Tag 32 */
/* 0xE9 */ uint8_t std_elf_2; /* Tag 34 */
/* 0xEA */ uint8_t _padea[2];
/* 0xEC */ uint32_t shader_cg_sel; /* Tag 35 */
/* 0xF0 */ uint8_t fdl_bit; /* Tag 36 */
/* 0xF1 */ uint8_t _padf1[3];
/* 0xF4 */ uint32_t fdl_insert_mode; /* Tag 37 */
};
/* sizeof(NvvmContainerHeader) == 248 (0xF8) */
The Options pointer is stored at offset 208 (0xD0) of the container header during deserialization -- the container header acts as both a data holder and an index into the full Options struct.
Options Struct (440 bytes)
The full compiler options structure is allocated separately and linked from the container header. It is parsed by NvvmOptions_parse_compile_options (0xCDB4D0, 26,643 bytes) in the XML path, or populated field-by-field from tags in the binary path.
struct NvvmOptions { /* 440 bytes total */
/* +0 */ uint32_t arch_variant; /* ArchVariant enum */
/* +4 */ uint32_t compile_mode; /* CompileMode enum */
/* +8 */ uint32_t opt_level; /* OptLevel enum */
/* +12 */ uint32_t debug_info; /* DebugInfo enum */
/* +16 */ uint32_t client_version;
/* +20 */ uint8_t flags_20; /* Packed booleans: 6 bits */
/* bit 0: ReserveLocalAddressZero */
/* bit 1: ForceImmediateConstants */
/* bit 2: HideFunctions */
/* bit 3: UseDX10AddressInRange */
/* bit 4: IsPIC */
/* bit 5: NoSpillsConstraint */
/* +21 */ uint8_t _pad21[3];
/* +24 */ uint8_t multi_view[48]; /* MultiViewOptions sub-structure */
/* +72 */ uint32_t vertex_mode; /* VertexMode enum */
/* +76 */ uint8_t _pad76[4];
/* +80 */ uint8_t dci_info[120]; /* DCIInfo sub-structure */
/* +200 */ uint8_t fast_math_byte0; /* FastMath bits 0-7 */
/* +201 */ uint8_t fast_math_byte1; /* FastMath bits 8-12 */
/* +202 */ uint8_t _pad202[2];
/* +204 */ uint32_t fast_math_reserved;
/* +208 */ uint8_t _pad208[8];
/* +216 */ uint32_t max_rregs_allowed;
/* +220 */ uint32_t sched_reg_target;
/* +224 */ uint32_t unroll_control;
/* +228 */ uint32_t okey; /* CompressAlgoId / OKey */
/* +232 */ uint8_t accelerated_arch;
/* +233 */ uint8_t std_elf;
/* +234 */ uint8_t _pad234[2];
/* +236 */ uint32_t shader_codegen_sel_mask;
/* +240 */ uint8_t omega_ptx_error_handling;
/* +241 */ uint8_t _pad241[3];
/* +244 */ uint32_t fdl_insert_mode;
/* +248 */ uint8_t target_opts[192]; /* Extended target options (tags 101-173) */
};
/* sizeof(NvvmOptions) == 440 (0x1B8) */
DCIInfo Sub-Structure (Options +80, 120 bytes)
The Device-Code-Interface sub-structure at offset +80 contains the shader constant interface and constant bank reserved area configurations. Parsed by NvvmOptions_parse_shader_const_iface (0xCCEEA0, 8,355 bytes) and NvvmOptions_parse_cb_reserved_area (0xCCE780, 9,802 bytes).
ShaderConstIface XML fields (from sub_CCEEA0):
| Field | Type | Description |
|---|---|---|
OptimizerConstBank | int32 | Constant bank index used by the optimizer |
DriverConstBank | int32 | Constant bank index used by the driver |
BindlessTextureBank | int32 | Constant bank for bindless texture handles |
LocalMemoryWindow | struct | Memory window config for local memory |
SharedMemoryWindow | struct | Memory window config for shared memory |
VectorizeAndRemapTLD | bool | Enable vectorization and TLD remapping |
ELFControlsDCI | bool | ELF controls DCI interface layout |
DiscardDefaultValueOutputs | bool | Discard outputs that match default values |
CBReservedArea XML fields (from sub_CCE780):
| Field | Type | Description |
|---|---|---|
ByteOffsetToEndOfReservedArea | int32 | End-of-reserved-area offset in constant bank |
CbAddressBitsInReservedVABase | int32 | Address bits for reserved virtual address base |
CbBankToReservedVABase | int32 | Constant bank index for reserved VA base |
ForceHighLatencyConstExpr | bool | Force high-latency constant expression evaluation |
ReservedCbReadBank | int32 | Reserved constant bank read bank index |
MultiViewOptions Sub-Structure (Options +24, 48 bytes)
The multi-view rendering options sub-structure at offset +24 carries graphics pipeline multi-view configuration. Parsed by NvvmOptions_parse_multi_view (0xCD6D20, 12,188 bytes). Serialized as blob tag 204.
| Field | Type | Description |
|---|---|---|
NumViews | int32 | Number of rendering views |
NominalViewIDs | int32[] | Array of nominal view identifiers |
PerViewRTIndexConstants | int32[] | Per-view render target index constants |
EnableViewInstanceMask | bool | Enable per-view instance masking |
ComputePerPatchAttribsForViewZero | bool | Compute per-patch attributes for view 0 |
IsImplicit | bool | Implicit multi-view mode |
CompileMode Enum
| Value | Name | Meaning |
|---|---|---|
| 0 | NVVM_COMPILE_MODE_WHOLE_PROGRAM_ABI | Whole-program with ABI compliance |
| 1 | NVVM_COMPILE_MODE_WHOLE_PROGRAM_NOABI | Whole-program without ABI (internal) |
| 2 | NVVM_COMPILE_MODE_SEPARATE_ABI | Separate compilation (relocatable, --device-c) |
| 3 | NVVM_COMPILE_MODE_EXTENSIBLE_WHOLE_PROGRAM_ABI | Extensible whole-program with ABI |
OptLevel Enum
| Value | Name |
|---|---|
| 0 | NVVM_OPT_LEVEL_NONE |
| 1 | NVVM_OPT_LEVEL_1 |
| 2 | NVVM_OPT_LEVEL_2 (default) |
| 3 | NVVM_OPT_LEVEL_3 |
DebugInfo Enum
| Value | Name |
|---|---|
| 0 | NVVM_DEBUG_INFO_NONE (default) |
| 1 | NVVM_DEBUG_INFO_LINE_INFO |
| 2 | NVVM_DEBUG_INFO_DWARF |
VertexMode Enum
| Value | Name |
|---|---|
| 0 | NVVM_VERTEX_MODE_SINGLE |
| 1 | NVVM_VERTEX_MODE_A |
| 2 | NVVM_VERTEX_MODE_B |
| 3 | NVVM_VERTEX_MODE_AB |
FDLInsertMode Enum
| Value | Name |
|---|---|
| 0 | NVVM_FDL_MODE_NONE |
| 1 | NVVM_FDL_MODE_ALL |
| 2 | NVVM_FDL_MODE_APP |
ArchVariant Enum
The architecture enum uses a numeric encoding where the value equals major * 10 + minor for older architectures and major * 10 + minor (with 3-digit major) for Blackwell. There are two parallel enum spaces: "virtual" architecture variants (used for compute_XX targets) and "HW" variants (used for sm_XX real silicon targets). The virtual variants are serialized by name in the XML format via NvvmOptions_parse_arch_enum (0xCD09E0, 14,516 bytes).
Virtual Architecture Variants
| Enum Name | Numeric Value | Generation | SM |
|---|---|---|---|
NVVM_ARCH_KEPLER_3_0 | 30 | Kepler | 3.0 |
NVVM_ARCH_KEPLER_3_2 | 32 | Kepler | 3.2 |
NVVM_ARCH_KEPLER_3_5 | 35 | Kepler | 3.5 |
NVVM_ARCH_KEPLER_3_7 | 37 | Kepler | 3.7 |
NVVM_ARCH_MAXWELL_5_0 | 50 | Maxwell | 5.0 |
NVVM_ARCH_MAXWELL_5_2 | 52 | Maxwell | 5.2 |
NVVM_ARCH_MAXWELL_5_3 | 53 | Maxwell | 5.3 |
NVVM_ARCH_PASCAL_6_0 | 60 | Pascal | 6.0 |
NVVM_ARCH_PASCAL_6_1 | 61 | Pascal | 6.1 |
NVVM_ARCH_PASCAL_6_2 | 62 | Pascal | 6.2 |
NVVM_ARCH_VOLTA_7_0 | 70 | Volta | 7.0 |
NVVM_ARCH_VOLTA_7_2 | 72 | Volta | 7.2 |
NVVM_ARCH_TURING_7_3 | 73 | Turing | 7.3 |
NVVM_ARCH_TURING_7_5 | 75 | Turing | 7.5 |
NVVM_ARCH_AMPERE_8_0 | 80 | Ampere | 8.0 |
NVVM_ARCH_AMPERE_8_2 | 82 | Ampere | 8.2 |
NVVM_ARCH_AMPERE_8_6 | 86 | Ampere | 8.6 |
NVVM_ARCH_AMPERE_8_7 | 87 | Ampere | 8.7 |
NVVM_ARCH_AMPERE_8_8 | 88 | Ampere | 8.8 |
NVVM_ARCH_ADA_8_9 | 89 | Ada Lovelace | 8.9 |
NVVM_ARCH_HOPPER_9_0 | 90 | Hopper | 9.0 |
NVVM_ARCH_BLACKWELL_10_0 | 100 | Blackwell | 10.0 |
NVVM_ARCH_BLACKWELL_10_1 | 101 | Blackwell | 10.1 |
NVVM_ARCH_BLACKWELL_10_3 | 103 | Blackwell | 10.3 |
NVVM_ARCH_BLACKWELL_11_0 | 110 | Blackwell (Jetson Thor) | 11.0 |
NVVM_ARCH_BLACKWELL_12_0 | 120 | Blackwell (RTX 50xx / Pro) | 12.0 |
NVVM_ARCH_BLACKWELL_12_1 | 121 | Blackwell (DGX Spark) | 12.1 |
Note: NVVM_ARCH_BLACKWELL_10_1 maps to __CUDA_ARCH 1010, while NVVM_ARCH_BLACKWELL_11_0 maps to __CUDA_ARCH 1100. Despite both being in the BLACKWELL family, they are distinct architectures with separate entries in the processor table. sm_110 (Jetson Thor) was originally designated sm_101 before being renumbered to its own 11.x line.
HW Architecture Variants
The HW variants use a major * 1000 + minor * 10 encoding for their internal numeric values. These map to real silicon rather than virtual compute capabilities:
| Enum Name | Internal Value | Notes |
|---|---|---|
NVVM_ARCH_HW_SM_5_0 | 500 | Maxwell HW baseline |
| ... | ... | One entry per supported HW SM through 9.0 |
NVVM_ARCH_HW_SM_10_0 | 1000 | Blackwell datacenter |
NVVM_ARCH_HW_SM_10_1 | 1010 | Blackwell Ultra (GB300) |
NVVM_ARCH_HW_SM_10_3 | 1030 | Blackwell variant |
NVVM_ARCH_HW_SM_10_4 | 1200 | Maps to SM 120 value -- not publicly documented |
The HW_SM_10_4 = 1200 mapping is notable: SM 10.4 in the HW enum space corresponds to the SM 120 consumer architecture. This reveals that "SM 120" is internally considered a Blackwell 10.4 die variant, not a separate generation.
FastMathOptions Bitfields
The fast-math configuration occupies two bytes at Options offset +200 and +201, with an additional int32 at +204. Each bit independently controls one floating-point relaxation.
Byte +200 (tags 8--15)
Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0
+-------+-------+-------+-------+-------+-------+-------+-------+
| Fmad | Fast | Ftz |Reorder|Reorder|Ignore | Ignore|Ignore |
| | Sqrt | | Half | Float | Sign0 | NaN | Inf |
+-------+-------+-------+-------+-------+-------+-------+-------+
tag 15 tag 14 tag 13 tag 12 tag 11 tag 10 tag 9 tag 8
Byte +201 (tags 16--17, 26, 31, 33)
Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0
+-------+-------+-------+-------+-------+-------+-------+-------+
| | | | Lax | No |Reassoc|CanReor| Allow |
| | | | FP16 | Float | Float |derDist| Rcp |
| | | | Div | MAD |AddMAD |ribute | Rsq |
+-------+-------+-------+-------+-------+-------+-------+-------+
tag 33 tag 31 tag 26 tag 17 tag 16
FastMath Divide Sub-Enum
The Divide field within FastMathOptions is a nested enum serialized by name in the XML path:
| Value | Name | Meaning |
|---|---|---|
| 0 | NVVM_FAST_MATH_DIVIDE_PRECISE_NO_FTZ | IEEE-compliant division, no flush-to-zero |
| 1 | NVVM_FAST_MATH_DIVIDE_PRECISE_ALLOW_FTZ | IEEE division with FTZ permitted |
| 2 | NVVM_FAST_MATH_DIVIDE_FULL_RANGE_APPROX | Full-range approximation |
| 3 | NVVM_FAST_MATH_DIVIDE_FAST_APPROX | Fast approximation (least precise) |
These correspond to the nvcc flags -prec-div=1 (precise) and -prec-div=0 (fast), with FTZ interaction determined by -ftz.
Complete FastMath XML Field Inventory
The full set of XML field names parsed by NvvmOptions_parse_fast_math (0xCCF590, 12,771 bytes):
| XML Field Name | Binary Tag | Type | Description |
|---|---|---|---|
IgnoreInf | 8 | bit | Treat infinities as NaN |
IgnoreNaN | 9 | bit | Assume no NaN values present |
IgnoreSignedZero | 10 | bit | Ignore sign of zero |
ReorderFloat | 11 | bit | Allow float reordering |
ReorderHalf | 12 | bit | Allow half-precision reordering |
Ftz | 13 | bit | Flush denormals to zero |
FastSqrt | 14 | bit | Use fast sqrt approximation |
Fmad | 15 | bit | Allow fused multiply-add |
AllowRcpRsqToSqrt | 16 | bit | Allow rcp(rsqrt(x)) to sqrt(x) |
CanReorderFloatDistribute | 17 | bit | Allow distributive reordering |
ReassociateFloatAddOverMad | 26 | bit | Float add reassociation over MAD |
NoFloatMAD | 31 | bit | Disable float MAD formation |
LaxFP16ApproximateDivision | 33 | bit | Lax FP16 approximate division |
Divide | -- | enum | Division precision sub-enum (above) |
The Divide field is serialized as a nested enum element in XML; in the binary format it is encoded as part of the fast-math reserved int32 at Options +204 (tag 18).
Memory Window Configuration
Memory windows define how the compiler maps address spaces to hardware memory banks. Three window types are serialized as blobs via tags 201--203, parsed by NvvmOptions_parse_cbank_config (0xCCE4B0) and NvvmOptions_parse_memory_windows (0xCCE100).
MemoryWindow Type Enum
| Value | Name | Meaning |
|---|---|---|
| 0 | NVVM_MEMORY_WINDOW_SPECIAL_REGISTER | Accessed via special registers |
| 1 | NVVM_MEMORY_WINDOW_CBANK | Constant bank window |
| 2 | NVVM_MEMORY_WINDOW_IMMEDIATE | Immediate offset addressing |
Window Entry Layout (8 bytes)
struct MemoryWindowEntry {
uint32_t window_type; /* MemoryWindow type enum */
uint32_t cbank; /* Constant bank index */
/* The following are part of the containing blob: */
/* uint32_t cbank_ofst_low; -- lower bound of offset range */
/* uint32_t cbank_ofst_hi; -- upper bound of offset range */
};
- Tag 201 (
MemoryWindowCBank): 24 bytes = 3 entries of{window_type, cbank, low, hi}truncated to fit, or 3 x 8 bytes depending on sub-field packing. - Tag 202 (
MemoryWindowLocal): 24 bytes, same structure. - Tag 203 (
MemoryWindowShared): 40 bytes = 10 xuint32_tvalues encoding shared memory bank strides, offsets, and configuration flags.
Version Compatibility Logic
Version checking is the first operation performed on a container buffer, implemented in NvvmContainer_check_versions (0xCD41B0). The logic is conservative on major versions and lenient on minor versions:
1. Verify magic == 0x7F4E5C7D
Fail: return NULL (not a container)
2. Version.Major must == 1
Fail: "NvvmContainer major version N not compatible" → return NULL
3. Version.Minor compared to 0x41 (65)
If container minor > tool minor:
Warning: "Linked container's NvvmContainer minor version N newer than tool"
Parse continues regardless.
4. NvvmIRVersion.Major must == 2
Fail: "NvvmIR major version N not compatible" → return NULL
5. NvvmIRVersion.Minor compared to 0x62 (98)
If container minor > tool minor: warning, parse continues.
6. NvvmDebugVersion.Major must == 3
Fail: "NvvmDebug major version N not compatible" → return NULL
7. NvvmDebugVersion.Minor compared to 2
If container minor > tool minor: warning, parse continues.
8. LlvmVersion (major*100 + minor) must be <= 2000
Fail: "LLVM version N not compatible" → return NULL
A separate standalone validator (0xCCD5F0) adds a mode-dependent check: in binary dump mode (a5=0), the LLVM version must be exactly 20; in normal mode (a5=1), it must be <= 20.
The philosophy is clear: major version bumps signal breaking format changes and are hard failures. Minor version bumps add new tags but never change existing tag semantics -- the delta encoding and unknown-tag-skipping design ensures forward compatibility.
Current Version Constants (cicc v13.0)
| Field | Major | Minor |
|---|---|---|
| Version (container format) | 1 | 0x41 (65) |
| NvvmIRVersion | 2 | 0x62 (98) |
| NvvmDebugVersion | 3 | 2 |
| LlvmVersion | 20 | 0 |
XML Serialization Format
The XML path (NvvmContainer_parse_header at 0xCDCA30) uses NVIDIA's YAML-based serialization framework with virtual dispatch. The top-level XML document contains these elements:
<NvvmContainer>
<Version major="1" minor="65"/>
<NvvmIRVersion major="2" minor="98"/>
<NvvmDebugVersion major="3" minor="2"/>
<LlvmVersion major="20" minor="0"/>
<IRLevel>NVVM_IR_LEVEL_UNIFIED_AFTER_DCI</IRLevel>
<Options>
<ArchVariant>NVVM_ARCH_ADA_8_9</ArchVariant>
<CompileMode>NVVM_COMPILE_MODE_WHOLE_PROGRAM_ABI</CompileMode>
<OptLevel>NVVM_OPT_LEVEL_2</OptLevel>
<DebugInfo>NVVM_DEBUG_INFO_NONE</DebugInfo>
<FastMathOptions>
<Ftz>1</Ftz>
<Fmad>1</Fmad>
<Divide>NVVM_FAST_MATH_DIVIDE_FAST_APPROX</Divide>
...
</FastMathOptions>
<MaxRRegsAllowed>255</MaxRRegsAllowed>
...
</Options>
<IsBinary>1</IsBinary>
<Module>... base64-encoded LLVM bitcode ...</Module>
</NvvmContainer>
All enum values are serialized by their full string names (e.g., NVVM_COMPILE_MODE_SEPARATE_ABI), not by numeric value. The XML format does not use delta encoding -- every field is written regardless of whether it matches the default, making XML containers significantly larger but human-readable.
Serialization Flow
The serializer (0xCDD2D0) has two modes controlled by parameter a3: binary (a3=1) and XML (a3=0).
Binary Serialization (a3=1)
1. Compute version fields (use defaults if not set):
Version = {1, 0x41}
NvvmIRVersion = {2, 0x62}
NvvmDebugVersion = {3, 2}
LlvmVersion = {20, 0}
2. Allocate 248-byte NvvmContainerHeader (zeroed)
3. Allocate 440-byte default Options struct
4. Allocate two growable arrays:
scalar_tags[] -- int32 entries for tag/value pairs
blob_data[] -- byte entries for blob payloads
5. For each field in current Options vs. default Options:
If field differs:
Scalar → sub_CD17A0(scalar_tags, tag_id, value)
Blob → sub_CD1AB0(blob_data, scalar_tags, tag_id, ptr, size)
6. Optional IR compression:
If a4 flag set:
Compress LLVM bitcode via sub_C8D290
Compute CRC via sub_CCD2B0 → store as tag 99
Compress via sub_1688730(codec, data, size)
7. Append terminator: tag 0 to scalar_tags
8. Write 24-byte header (with computed ScalarFieldsEnd, BlobDataEnd)
9. Write scalar_tags array
10. Write blob_data array
11. Write compressed or raw IR payload
Deserialization (0xCD1D80)
1. Verify magic == 0x7F4E5C7D
2. Allocate 248-byte NvvmContainerHeader
3. Allocate 440-byte Options struct with defaults
4. Store Options pointer at container header offset 208
5. Compute tag_ptr = buffer + header_size (from offset 0x0E)
6. Compute blob_base = buffer + scalar_fields_end (from offset 0x10)
7. Enter switch loop:
Read tag (int16), decode value (int16 or sentinel + int32)
Switch on tag (103 unique case labels):
Tags 1-39: → write scalar to Options field
Tag 99: → store compression algo ID
Tags 101-173: → write to extended target options
Tags 201-218: → resolve blob offset, copy blob data
Tags 301-309: → write to extended int32 fields
Tags 351-353: → copy 8-byte blob to extended fields
Tags 401-402: → conditionally parse structured blob
Tag 0 → exit loop
8. If tag 99 present: decompress IR payload
9. Return container pointer
Annotated Hex Dump
A minimal container targeting SM 89 (Ada Lovelace) with default options (only SmMajor and SmMinor differ from defaults):
Offset Hex Decoded
------ ----------------------------------------- ---------------------------------
0x0000 7D 5C 4E 7F Magic: 0x7F4E5C7D
0x0004 01 41 Version: 1.65
0x0006 02 62 NvvmIRVersion: 2.98
0x0008 03 02 NvvmDebugVersion: 3.2
0x000A 14 00 LlvmVersion: 20.0
0x000C 00 00 IRLevel: 0 (UNIFIED_AFTER_DCI)
0x000E 18 00 HeaderSize: 24
0x0010 2C 00 00 00 ScalarFieldsEnd: 44
0x0014 2C 00 00 00 BlobDataEnd: 44 (no blobs)
--- Scalar tag/value region ---
0x0018 01 00 08 00 Tag 1 (SmMajor) = 8
0x001C 02 00 09 00 Tag 2 (SmMinor) = 9
0x0020 0D 00 01 00 Tag 13 (Ftz) = 1
0x0024 0F 00 01 00 Tag 15 (Fmad) = 1
0x0028 00 00 Terminator (tag 0)
0x002A 00 00 Padding to alignment
--- Blob data region ---
(empty -- ScalarFieldsEnd == BlobDataEnd)
--- IR payload follows at offset 0x002C ---
0x002C DE C0 17 0B ... LLVM bitcode (0xDEC0170B magic)
This example shows the efficiency of delta encoding: only 4 tag/value pairs (16 bytes of tags) plus the 24-byte header produce a fully-specified container. All other fields (CompileMode, OptLevel, DebugInfo, all target options) inherit their defaults during deserialization.
A container with a 32-bit value would look like:
0x00XX 13 00 FF FF 00 04 00 00 Tag 19 (MaxRRegsAllowed) = 1024
(0xFFFF sentinel, then 0x0400 LE)
Pipeline Integration
The container serves as the inter-stage transport format within the cicc compilation pipeline. Two entry paths exist:
| Path | Entry Function | Address | Pipeline |
|---|---|---|---|
| Path A (LibNVVM) | nvvmCompileProgram dispatcher | 0x9047E0 | 3-phase: LNK -> OPT -> LLC |
| Path B (standalone) | cicc_main orchestrator | 0x12642A0 | 4-stage: LNK -> OPT -> OPTIXIR -> LLC |
Both paths deserialize the container at phase 1, then translate Options into per-stage compiler flags:
SmMajor/SmMinorfrom tags 1--2 become-mcpu=sm_XXFastMath.Ftzfrom tag 13 becomes-nvptx-f32ftzFastMath.Fmadfrom tag 15 becomes the IEEE mode flagOptLevelbecomes-nvptx-opt-level=NCompileMode == 2(SEPARATE_ABI) adds--device-cIRLevel == 1(LTO) enters the LTO pipeline with partially-optimized bitcodeIRLevel == 2(OPTIX) activates the OptiX IR stage (bit 6 of pipeline bitmask) and disables LICM and IP-MSP
The container format is the single source of truth for all compilation parameters. When cicc is invoked by nvcc, the driver serializes its accumulated flags into a container, passes the container as input, and cicc deserializes it back into compiler options. This round-trip through binary serialization ensures that all pipeline stages see exactly the same configuration, eliminating the flag-parsing divergence that would otherwise arise from each stage having its own CLI parser.
YAML Serialization Framework
The XML/YAML path uses a generic serialization framework built on a bundled YAML parser/emitter library (Cluster A: 0xCB0000--0xCBFA60). The library provides:
| Function | Address | Role |
|---|---|---|
yaml_parser_main | 0xCB9640 | Top-level YAML parser (25,873 bytes) |
yaml_emitter_main_loop | 0xCBDA10 | Main YAML emitter loop (23,583 bytes) |
yaml_scanner_scan_tokens | 0xCB7E40 | Token scanner (17,924 bytes) |
yaml_parser_parse_flow | 0xCB8C00 | Flow-style parsing (15,188 bytes) |
yaml_parser_load_document | 0xCBA570 | Document loader/resolver (9,695 bytes) |
The serialization framework uses virtual dispatch: each serializable type registers a serialize/deserialize function pair, and the framework dispatches based on the YAML node type (scalar=1, sequence, mapping). All enum values are serialized by their full string names (NVVM_COMPILE_MODE_SEPARATE_ABI, NVVM_ARCH_ADA_8_9, etc.), not by numeric value.
Finalizer Knobs Integration
The container Options struct also feeds into the NVIDIA finalizer knobs system through NvvmOptions_parse_finalizer_knobs (0xCD9990, 31,702 bytes -- the 7th largest function in the binary). This parser ingests the complete set of NVIDIA-specific backend configuration knobs:
- Shader pipeline controls:
PromoteHalf,PromoteFixed,USePIXBAR,VSIsVREnabled,VSIsLastVTGStage - Codegen controls:
DisablePredication,DisableXBlockSched,EnableJumpTable,ScheduleKils - Memory controls:
DoMMACoalescing,AssumeConvertMemoryToRegProfitable - Barrier controls:
DisableERRBARAfterMEMBAR,GenConvBranchForWarpSync - PGO controls:
PGOEpoch,PGOBatchSize,PGOCounterMemBaseVAIndex - Per-CTA controls:
CTASizeX,CTASizeY,CTASizeZ,SharedMemorySize,SMemScratchBase - Register controls:
MaxActiveWarpsPerSM,NumReservedUReg,NumScratchURegs
These knobs are distinct from the NVVMPassOptions system (see NVVMPassOptions) -- the finalizer knobs configure the backend code generator, while NVVMPassOptions configure the optimization pipeline.
Tag Summary Statistics
| Range | Count | Description |
|---|---|---|
| 1--39 | 38 | Core scalar options (SM version, fast-math, unroll, flags) |
| 99 | 1 | Compression metadata |
| 101--173 | 73 | Extended target options (hardware capabilities, memory config) |
| 201--218 | 18 | Blob data (memory windows, resource tables, strings) |
| 301--309 | 9 | Extended int32 fields (cluster config, extended options) |
| 351--353 | 3 | Extended int64 blob references |
| 401--402 | 2 | Structured conditional blobs (TMA / TCGen05) |
| Total | 144 | Distinct tag IDs across 6 ranges |
The deserializer switch statement has 103 unique case labels -- the remaining 41 tags share code paths with other tags (e.g., all single-bit tags in a byte share a case that reads the bit position from a secondary table).
Cross-References
- NVVMPassOptions -- 222-slot optimization pipeline configuration
- Pipeline Entry -- LibNVVM API and CLI entry points
- OptiX IR -- IRLevel=2 OptiX pipeline
- LTO Pipeline -- IRLevel=1 link-time optimization
- SM 90 Hopper -- TMA descriptor usage (tag 401)
- SM 100 Blackwell -- TCGen05 config usage (tag 402)
- Bitcode I/O -- LLVM bitcode reader/writer wrapping the IR payload
- nvcc Interface -- Driver-to-cicc container passing