Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

NVVM Container Binary Format

The NVVM container is a proprietary binary envelope that wraps LLVM bitcode with compiler metadata for transport between pipeline stages in cicc v13.0. It carries target architecture, optimization options, fast-math flags, memory window configurations, per-kernel resource tables, and the IR payload itself -- all in a single serializable blob. Two serialization paths exist: a compact binary wire format used in production (nvcc / ptxas pipelines) and an XML-based format used for debugging and interchange. This page specifies the binary format in sufficient detail to write a conformant parser and serializer.

The format is implemented across 26 functions in the 0xCCBB10--0xCDD2D0 address range (Cluster C in the binary layout). The six top-level entry points:

FunctionAddressSizeRole
NvvmContainer_serialize0xCDD2D047,540 BBinary + XML serializer
NvvmContainer_deserialize_options0xCD1D8051,859 BBinary tag/value decoder
NvvmContainer_parse_header0xCDCA3010,206 BXML path header parser
NvvmContainer_check_versions0xCD41B016,708 BVersion compatibility gate
NvvmContainer_validate_versions0xCCD5F08,987 BStandalone version validator
NvvmContainer_init_options_struct0xCCBB10smallZero-init 248-byte container struct

Supporting parsers called from NvvmOptions_parse_compile_options (0xCDB4D0, 26,643 bytes):

FunctionAddressSizeRole
NvvmOptions_parse_arch_enum0xCD09E014,516 BArchVariant enum string-to-int
NvvmOptions_parse_fast_math0xCCF59012,771 BFastMathOptions sub-structure
NvvmOptions_parse_multi_view0xCD6D2012,188 BMultiViewOptions sub-structure
NvvmOptions_parse_cb_reserved_area0xCCE7809,802 BCB reserved area config
NvvmOptions_parse_reg_targets0xCD7CE09,542 BRegister target config
NvvmOptions_parse_serialize_helper0xCD58A09,579 BOption serialization helper
NvvmOptions_parse_shader_const_iface0xCCEEA08,355 BShaderConstIface (DCI)
NvvmOptions_parse_align_entries0xCD86106,739 BAlignment entry config
NvvmOptions_parse_pgo_section0xCD02C05,482 BPGO configuration
NvvmOptions_parse_section0xCD55105,166 BNested YAML section parser
NvvmOptions_parse_memory_windows0xCCE1005,042 BMemory window config
NvvmOptions_parse_cbank_config0xCCE4B04,173 BConstant bank config
NvvmOptions_parse_bool_or_int0xCCC4A0smallBoolean/int option parser
NvvmOptions_parse_tristate0xCCCFB0smallTri-state option parser
NvvmOptions_parse_string0xCD5150smallString option parser

The finalizer knobs parser (0xCD9990, 31,702 bytes) is called separately to ingest the full set of NVIDIA-specific backend knobs (see NVVMPassOptions).

Binary-level helpers:

FunctionAddressRole
NvvmContainer_write_tag_value0xCD17A0Write one tag/value pair (called 121 times from serializer)
NvvmContainer_write_blob0xCD1AB0Write blob data + tag reference
NvvmContainer_compute_crc0xCCD2B0CRC with seeds 0x8DF5D74C, 0xBAA56A96

Global state: qword_4F87148 holds the NVVM options global state pointer, checked by many downstream consumers.

Binary Header

Every binary container begins with a fixed 24-byte header. The header is self-describing: HeaderSize at offset 0x0E stores its own length (always 24), and two size fields partition the remainder into a scalar tag region and a blob data region.

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                     Magic  (0x7F4E5C7D)                       |  0x00
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|  Ver.Major    |  Ver.Minor    | NvvmIR.Major  | NvvmIR.Minor  |  0x04
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| NvvmDbg.Major | NvvmDbg.Minor | Llvm.Major    | Llvm.Minor    |  0x08
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|         IRLevel (u16)         |       HeaderSize (u16)        |  0x0C
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                     ScalarFieldsEnd (u32)                     |  0x10
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                      BlobDataEnd (u32)                        |  0x14
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
struct NvvmContainerBinaryHeader {
    uint32_t magic;              /* 0x00: must be 0x7F4E5C7D              */
    uint8_t  version_major;      /* 0x04: container format major (1)      */
    uint8_t  version_minor;      /* 0x05: container format minor (<=0x41) */
    uint8_t  nvvm_ir_major;      /* 0x06: NVVM IR version major (2)       */
    uint8_t  nvvm_ir_minor;      /* 0x07: NVVM IR version minor (<=0x62)  */
    uint8_t  nvvm_debug_major;   /* 0x08: debug info version major (3)    */
    uint8_t  nvvm_debug_minor;   /* 0x09: debug info version minor (<=2)  */
    uint8_t  llvm_major;         /* 0x0A: LLVM version (see encoding)     */
    uint8_t  llvm_minor;         /* 0x0B: LLVM version (see encoding)     */
    uint16_t ir_level;           /* 0x0C: IRLevel enum                    */
    uint16_t header_size;        /* 0x0E: always 24 (0x0018)              */
    uint32_t scalar_fields_end;  /* 0x10: byte offset past scalar region  */
    uint32_t blob_data_end;      /* 0x14: byte offset past blob region    */
};

The three data regions in order:

[0 .. 24)                         -- Header (fixed)
[24 .. scalar_fields_end)         -- Scalar tag/value pairs
[scalar_fields_end .. blob_data_end) -- Blob data region

The total container size is blob_data_end bytes. After the blob data region, the IR payload (LLVM bitcode, optionally compressed) follows immediately.

LLVM Version Encoding

The llvm_major and llvm_minor bytes encode the LLVM version as a combined integer: llvm_major * 100 + llvm_minor. For cicc v13.0 (LLVM 20), this yields 20 * 100 + 0 = 2000. The version check compares the combined value, not the individual bytes.

IRLevel Enum

ValueNameMeaning
0NVVM_IR_LEVEL_UNIFIED_AFTER_DCIDefault: IR after Device-Code-Interface unification
1NVVM_IR_LEVEL_LTOLink-Time Optimization IR (partially optimized)
2NVVM_IR_LEVEL_OPTIXOptiX pipeline IR

Scalar Tag/Value Encoding

Immediately after the 24-byte header, a sequence of (tag, value) pairs encodes every container field that differs from its default value. The encoding is a variable-length scheme optimized for small values:

Case 1 -- value fits in 16 bits (0x0000..0xFFFE):
  [tag : int16] [value : int16]          -- 4 bytes total

Case 2 -- value needs 32 bits:
  [tag : int16] [0xFFFF : int16] [value : int32]  -- 8 bytes total

Terminator:
  [0x0000 : int16]                       -- tag 0 ends the sequence

All multi-byte fields are little-endian. The sentinel value 0xFFFF in the value slot signals that a full 32-bit value follows. This means the maximum encodable 16-bit value is 0xFFFE (65534); values of exactly 0xFFFF or larger require the extended form.

The serializer (sub_CD17A0, called 121 times from NvvmContainer_serialize) writes each tag/value pair using this scheme. The deserializer enters a switch loop over tags 1--402, decoding each value and writing it to the appropriate offset in the deserialized container struct.

Delta Encoding Strategy

The serializer allocates a default-initialized 440-byte Options struct and compares each field in the current Options against the corresponding default. Only fields that differ from the default are written as tag/value pairs. This makes typical containers very compact -- a standard compilation targeting SM 89 with -O2 might emit fewer than 20 tag/value pairs, covering just SmMajor, SmMinor, CompileMode, and a handful of target-specific flags.

The deserializer reverses this: it allocates a default Options struct first, then overwrites individual fields as tags are encountered. Unknown tags are silently skipped, which is the mechanism that provides forward compatibility -- a newer serializer can emit tags that an older deserializer simply ignores.

Blob Data Region

Tags in the 200+ and 400+ ranges reference variable-length data stored in the blob region. The scalar value for a blob tag is the byte offset into the blob region where the data begins. The blob region starts at scalar_fields_end bytes from the container start.

To resolve a blob reference: blob_ptr = container_base + scalar_fields_end + offset_value.

Blob entries do not carry explicit length fields in the tag/value stream. The deserializer knows each blob type's expected size from the tag ID (e.g., tag 201 is always 24 bytes, tag 203 is always 40 bytes). Variable-length blobs like strings (tags 209, 210, 213, 216, 217) are null-terminated. Length-prefixed blobs (tag 218) carry a 4-byte length prefix.

Complete Tag Table

144 distinct tag IDs organized into six ranges. The "Offset" column refers to the byte position within the deserialized 440-byte Options struct.

Range 1--39: Core Scalar Options

TagTypeNameOptions OffsetNotes
1int32SmMajor+0 (ArchVariant)SM major version (e.g., 8 for SM 89)
2int32SmMinor+0 (ArchVariant)SM minor version (e.g., 9 for SM 89)
3int32NumRegs+216Register count hint
4int32NumBarriers+220Barrier count
5int32SharedMemorySize+224Shared memory size in bytes
6int32VertexMode+72See VertexMode enum
7bitReserveLocalAddressZero+20 bit 0Reserve address 0 in local memory
8bitFastMath.IgnoreInf+200 bit 0Treat infinities as NaN
9bitFastMath.IgnoreNaN+200 bit 1Assume no NaN values present
10bitFastMath.IgnoreSignedZero+200 bit 2Ignore sign of zero
11bitFastMath.ReorderFloat+200 bit 3Allow float reordering
12bitFastMath.ReorderHalf+200 bit 4Allow half-precision reordering
13bitFastMath.Ftz+200 bit 5Flush denormals to zero
14bitFastMath.FastSqrt+200 bit 6Use fast sqrt approximation
15bitFastMath.Fmad+200 bit 7Allow fused multiply-add
16bitFastMath.AllowRcpRsqToSqrt+201 bit 0Allow rcp(rsqrt(x)) to sqrt(x)
17bitFastMath.CanReorderFloatDistribute+201 bit 1Allow distributive reordering
18int32FastMath.Reserved+204Reserved fast-math field
19int32MaxRRegsAllowed+216Maximum registers per thread (primary)
20int32SchedRegTarget+220Scheduling register pressure target
21int32UnrollControl+224Unroll factor control
22boolAcceleratedArch+232True for sm_XXa variants
23boolStdELF+233Use standard ELF output format
24int32MaxRRegsAllowed2+216Secondary max-regs (override)
25int32SchedRegTarget2+220Secondary sched target
26bitFastMath.ReassociateFloatAddOverMad+201 bit 2Float add reassociation over MAD
27bitForceImmediateConstants+20 bit 1Force immediate constant loading
28bitHideFunctions+20 bit 2Hide internal functions from output
29bitUseDX10AddressInRange+20 bit 3DX10 address range mode
30int32UnrollControl2+224Secondary unroll control
31bitFastMath.NoFloatMAD+201 bit 3Disable float MAD formation
32boolAcceleratedArch2+232Secondary accelerated-arch flag
33bitFastMath.LaxFP16ApproximateDivision+201 bit 4Lax FP16 approximate division
34boolStdELF2+233Secondary StdELF
35int32ShaderCodegenSelMask+236Shader codegen selection bitmask
36boolOmegaPtxErrorHandling+240Enable Omega-style PTX error handling
37int32FDLInsertMode+244See FDLInsertMode enum
38bitIsPIC+20 bit 4Position-independent code flag
39bitNoSpillsConstraint+20 bit 5Hard constraint: no register spills

Tag 99: Compression Metadata

TagTypeNameNotes
99int32CompressAlgoIdCompression algorithm selector for IR payload

When present, the IR payload following the blob region is compressed. The value selects a codec via sub_16886D0(algo_id). If the value is 0, the runtime substitutes the default algorithm ID 0x75D49913 (1,977,119,507 decimal). The codec is a pluggable compression/encryption layer accessed through four function pointers:

/* Compression codec API (addresses in the 0x1688xxx range) */
void *codec_acquire(uint32_t algo_id);            /* sub_16886D0 */
int   codec_compress(void *codec, void *data,
                     size_t size);                 /* sub_1688730 */
int   codec_decompress(void *codec, void *data,
                       size_t size);               /* sub_16887A0 */
void  codec_release(void *codec);                  /* sub_1688720 */

The write path in NvvmContainer_serialize (0xCDD2D0) compresses the LLVM bitcode payload via sub_C8D290, then computes a CRC hash via NvvmContainer_compute_crc (0xCCD2B0) with the two seed values -1914584148 (0x8DF5D74C) and -1162247642 (0xBAA56A96). The CRC value is stored as the CompressAlgoId tag 99 value, which doubles as an integrity check token: the deserializer uses the same CRC seeds to verify the payload before decompression.

The compression subsystem lives outside the main container cluster at addresses 0x16886D0--0x16887A0, in the utility library region of the binary.

Range 101--173: Extended Target Options

These tags configure per-kernel and target-specific hardware parameters. Most map into a sub-structure accessed through the Options struct. The "Byte.Bit" column indicates the packed bitfield location within the target options sub-structure.

TagTypeNameLocationNotes
101boolHasTextureOpsoffset 0Target supports texture operations
102boolHasSurfaceOpsoffset 0Target supports surface operations
103boolHasAtomicsoffset 0Target supports atomic operations
104boolHasVoteoffset 0Target supports warp vote intrinsics
105int32MaxThreadsPerBlockoffset 4Maximum CTA thread count
106bytePreferL1SizeFlagoffset 8L1 cache vs shared memory preference
107boolHasWarpShuffleoffset 0Target supports warp shuffle
108boolHasFunnelShiftoffset 0Target supports funnel shift
109int32CBankOfstLowoffset 12Constant bank offset lower bound
110int32CBankOfstHioffset 16Constant bank offset upper bound
111int32CBankSizeoffset 20Constant bank size in bytes
112bitBit0_68byte 68, bit 0Target capability flag
113bitBit1_68byte 68, bit 1Target capability flag
114bitBit2_68byte 68, bit 2Target capability flag
115bitBit3_68byte 68, bit 3Target capability flag
116bitBit4_68byte 68, bit 4Target capability flag
117bitBit5_68byte 68, bit 5Target capability flag
118bitBit7_68byte 68, bit 7Target capability flag (bit 6 skipped)
119bitEnableCoalescebyte 69, bit 0Enable memory coalescing optimization
120bitEnableVectorizebyte 69, bit 2Enable auto-vectorization
1212-bitCompactionModebyte 69, bits 3--4Thread compaction strategy (0--3)
122int32StackFrameSizeoffset 96Stack frame size in bytes
123int32StackAlignmentoffset 100Stack alignment requirement
124int32ParamSpaceSizeoffset 104Parameter space size
125int32ParamAlignmentoffset 108Parameter space alignment
126int32LocalMemSizeoffset 116Local memory size per thread
127int32SharedBankConfigoffset 156Shared memory bank configuration
128int32MinGridSizeoffset 248Minimum grid size for occupancy
129int32MaxGridDimXoffset 252Maximum X-dimension grid size
130int32SharedMemPerBlockoffset 264Shared memory per block
1312-bitWarpScheduleModebyte 70, bits 0--1Warp scheduling strategy
132bitEnablePrefetchbyte 70, bit 2Enable memory prefetch instructions
133bitBit4_70byte 70, bit 4Target capability flag
134bitBit5_70byte 70, bit 5Target capability flag
135bitBit6_70byte 70, bit 6Target capability flag
136bitBit7_70byte 70, bit 7Target capability flag
137int32MaxDynSharedoffset 268Maximum dynamic shared memory
138boolHasLDGoffset 5Target supports LDG instruction
139bitBit1_71byte 71, bit 1Target capability flag
140bitBit2_71byte 71, bit 2Target capability flag
141boolHasBarrierReduceoffset 40Target supports barrier-reduce
142int32CacheConfigoffset 280Cache configuration selector
143bitBit6_68byte 68, bit 6Target capability flag
144bitBit3_71byte 71, bit 3Target capability flag
145bitBit0_71byte 71, bit 0Target capability flag
146int32ConstBankSizeoffset 256Constant bank total size
147int32ShMemBankStrideoffset 152Shared memory bank stride
1482-bitScheduleMode2byte 71, bits 4--5Secondary scheduling mode
149bitBit6_71byte 71, bit 6Target capability flag
150bitBit7_71byte 71, bit 7Target capability flag
151int32LocalMemAlignmentoffset 112Local memory alignment
152bitEnableBarrierOptbyte 69, bit 5Enable barrier optimization
153bitBit0_72byte 72, bit 0Target capability flag
154bitBit6_69byte 69, bit 6Target capability flag
155bitBit7_69byte 69, bit 7Target capability flag
156bitBit1_72byte 72, bit 1Target capability flag
157boolHasDP4Aoffset 1Target supports DP4A dot-product
158bitBit3_72byte 72, bit 3Target capability flag
159int32ConstBankSize2offset 260Secondary constant bank size
160int32MaxRegsPerThreadoffset 284Hard limit on registers per thread
161int32ClusterSizeoffset 276Thread block cluster size (SM 90+)
162bitBit4_72byte 72, bit 4Target capability flag
163bitBit5_72byte 72, bit 5Target capability flag
164bitBit6_72byte 72, bit 6Target capability flag
165bitBit7_72byte 72, bit 7Target capability flag
166int32MaxCTAPerSMoffset 160Maximum CTAs per SM
167int32TexIndirectLimitoffset 272Texture indirect access limit
168bitBit0_432byte 432, bit 0Extended capability flag
169bitBit1_432byte 432, bit 1Extended capability flag
170bitBit2_432byte 432, bit 2Extended capability flag
171boolHasTMAOpsoffset 289Target supports TMA operations (SM 90+)
172bitBit3_70byte 70, bit 3Target capability flag
173boolHasTCGen05offset 290Target supports TCGen05 (SM 100+)

Range 201--218: Blob Data Tags

TagSizeNameDescription
20124 BMemoryWindowCBank3 memory window entries for constant bank (see below)
20224 BMemoryWindowLocal3 memory window entries for local memory
20340 BMemoryWindowShared10 x uint32_t for shared memory windows + flags
20448 BMultiViewOptionsMulti-view rendering header + typed arrays
205varTargetResourceTable24-byte header + 36 bytes per entry
206varPerKernelCBankOffsets4-byte count + 4 bytes per kernel
207varPerKernelStackSizes4-byte count + 4 bytes per kernel
208varPerKernelSMEMSizes8-byte count + 8 bytes per kernel
209varTargetFuncNameNull-terminated string
210varTargetEntryNameNull-terminated string
2118 BPerKernelQWORD8-byte per-kernel datum
21212 BExtraMemParams8 + 4 bytes of memory parameters
213varAuxString1Null-terminated auxiliary string
214varPerKernelRegisters4-byte count + 4 bytes per kernel
215varPerKernelBarriers4-byte count + 4 bytes per kernel
216varAuxString2Null-terminated auxiliary string
217varAuxString3Null-terminated auxiliary string
218varAuxByteArray4-byte length prefix + raw bytes

Range 301--309: Extended Int32 Fields

TagTypeNameOptions OffsetNotes
301int32ExtOpt.Field344+344Cluster/group configuration selector
302int32ExtOpt.Field348+348Extended option
303int32ExtOpt.Field352+352Extended option
304int32ExtOpt.Field356+356Extended option
305int32ExtOpt.Field360+360Extended option
306int32ExtOpt.Field400+400Extended option
307int32ExtOpt.Field364+364Extended option
308int32ExtOpt.Field368+368Extended option
309int32ExtOpt.Field372+372Extended option

Range 351--353: Extended Int64 Blob References

TagSizeNameOptions Offset
3518 BExtOpt.QWord376+376
3528 BExtOpt.QWord384+384
3538 BExtOpt.QWord392+392

Range 401--402: Structured Blob Data

These tags are conditionally parsed based on the value of tag 301 (ExtOpt.Field344):

TagConditionSizeNameNotes
401Field344 == 156+ BTMADescriptorSM 90 Hopper TMA bulk-copy descriptors. 44-byte fixed header + 16 bytes per entry.
402Field344 == 440+ BTCGen05ConfigSM 100 Blackwell TCGen05 tensor configurations. 32-byte fixed header + 12 bytes per entry.

The conditional parsing means a single container cannot carry both TMA and TCGen05 data -- the Field344 value selects which hardware generation's tensor memory interface is active.

TMADescriptor Layout (Tag 401, Field344 == 1)

TMA (Tensor Memory Access) descriptors configure cp.async.bulk operations on SM 90 Hopper. The TMA descriptor extraction is performed by sub_9483E0 during intrinsic lowering. The blob layout:

struct TMADescriptor {
    /* +0  */ uint32_t num_entries;          /* Number of TMA descriptors     */
    /* +4  */ uint32_t dimensionality;       /* 1d..5d tensor rank            */
    /* +8  */ uint32_t element_size;         /* Bytes per element             */
    /* +12 */ uint32_t interleave_layout;    /* Memory interleave pattern     */
    /* +16 */ uint32_t swizzle_mode;         /* Swizzle mode selector         */
    /* +20 */ uint32_t fill_mode;            /* Out-of-bounds fill behavior   */
    /* +24 */ uint32_t [5] global_dims;      /* Global tensor dimensions      */
    /* +44 */ /* --- 16 bytes per entry --- */
    /*        uint32_t box_dim;              Per-entry box dimension          */
    /*        uint32_t stride;               Per-entry stride                 */
    /*        uint32_t elem_stride;          Per-entry element stride         */
    /*        uint32_t reserved;             Reserved/padding                 */
};

See SM 90 Hopper for the TMA instruction format and the cp.async.bulk.tensor.g2s.tile.{1d,2d,3d,4d,5d} intrinsic family.

TCGen05Config Layout (Tag 402, Field344 == 4)

TCGen05 (Tensor Core Generation 5) configurations describe Blackwell SM 100 tensor memory operations. The TCGen05 instruction set includes tcgen05.alloc, tcgen05.dealloc, tcgen05.commit, tcgen05.fence, tcgen05.wait, and tcgen05.relinquish.alloc -- all gated by the SM 100 arch-conditional check at sub_30462A0. The blob layout:

struct TCGen05Config {
    /* +0  */ uint32_t num_entries;          /* Number of TCGen05 configs     */
    /* +4  */ uint32_t accumulator_size;     /* Accumulator memory size       */
    /* +8  */ uint32_t commit_mode;          /* Commit mode (multicast flags) */
    /* +12 */ uint32_t fence_mode;           /* Fence mode selector           */
    /* +16 */ uint32_t [4] reserved;         /* Reserved fields               */
    /* +32 */ /* --- 12 bytes per entry --- */
    /*        uint32_t config_id;            TCGen05 config identifier        */
    /*        uint32_t fragment_count;       Number of fragments              */
    /*        uint32_t flags;                Per-config flags                 */
};

See SM 100 Blackwell for the TCGen05 instruction set and the tcgen05.* intrinsic family.

Deserialized Container Struct

After parsing, the container is represented as a 248-byte in-memory structure allocated by NvvmContainer_init_options_struct (0xCCBB10). This struct holds the container metadata plus a pointer to the full 440-byte Options struct.

struct NvvmContainerHeader {          /* 248 bytes total                   */
    /* 0x00 */ uint32_t sm_major;     /* Tag 1: SM major version           */
    /* 0x04 */ uint32_t sm_minor;     /* Tag 2: SM minor version           */
    /* 0x08 */ uint32_t num_regs;     /* Tag 3                             */
    /* 0x0C */ uint32_t num_barriers; /* Tag 4                             */
    /* 0x10 */ uint32_t shared_mem_size; /* Tag 5                          */
    /* 0x14 */ uint8_t  flags_14;     /* Packed bits: tags 7,27,28,29,38,39*/
    /*         bit 0: ReserveLocalAddressZero  (tag 7)                     */
    /*         bit 1: ForceImmediateConstants  (tag 27)                    */
    /*         bit 2: HideFunctions            (tag 28)                    */
    /*         bit 3: UseDX10AddressInRange     (tag 29)                   */
    /*         bit 4: IsPIC                    (tag 38)                    */
    /*         bit 5: NoSpillsConstraint       (tag 39)                   */
    /* 0x15 */ uint8_t  _pad15[3];
    /* 0x18 */ uint8_t  multi_view_options[48]; /* Tag 204 blob            */
    /* 0x48 */ uint32_t vertex_mode;  /* Tag 6                             */
    /* 0x4C */ uint8_t  _pad4c[4];
    /* 0x50 */ uint32_t max_rregs;    /* Tag 19                            */
    /* 0x54 */ uint32_t sched_reg_target; /* Tag 20                        */
    /* 0x58 */ uint32_t unroll_control; /* Tag 21                          */
    /* 0x5C */ uint8_t  _pad5c[4];
    /* 0x60 */ uint8_t  mem_win_cbank[24];  /* Tag 201 blob                */
    /* 0x78 */ uint8_t  mem_win_local[24];  /* Tag 202 blob                */
    /* 0x90 */ uint8_t  mem_win_shared[40]; /* Tag 203 blob                */
    /* 0xB8 */ uint8_t  _padb8[12];
    /* 0xC4 */ uint8_t  accelerated_arch; /* Tag 22                        */
    /* 0xC5 */ uint8_t  std_elf;      /* Tag 23                            */
    /* 0xC6 */ uint8_t  _padc6[2];
    /* 0xC8 */ uint8_t  fast_math[8]; /* Tags 8-17,26,31,33 bitfields     */
    /* 0xD0 */ uint8_t  _padd0[8];
    /* 0xD8 */ uint32_t max_rregs_2;  /* Tag 24                            */
    /* 0xDC */ uint32_t sched_reg_2;  /* Tag 25                            */
    /* 0xE0 */ uint32_t unroll_ctl_2; /* Tag 30                            */
    /* 0xE4 */ uint32_t compress_algo_id; /* Tag 99                        */
    /* 0xE8 */ uint8_t  omega_ptx_err; /* Tag 32                           */
    /* 0xE9 */ uint8_t  std_elf_2;    /* Tag 34                            */
    /* 0xEA */ uint8_t  _padea[2];
    /* 0xEC */ uint32_t shader_cg_sel; /* Tag 35                           */
    /* 0xF0 */ uint8_t  fdl_bit;      /* Tag 36                            */
    /* 0xF1 */ uint8_t  _padf1[3];
    /* 0xF4 */ uint32_t fdl_insert_mode; /* Tag 37                         */
};
/* sizeof(NvvmContainerHeader) == 248 (0xF8) */

The Options pointer is stored at offset 208 (0xD0) of the container header during deserialization -- the container header acts as both a data holder and an index into the full Options struct.

Options Struct (440 bytes)

The full compiler options structure is allocated separately and linked from the container header. It is parsed by NvvmOptions_parse_compile_options (0xCDB4D0, 26,643 bytes) in the XML path, or populated field-by-field from tags in the binary path.

struct NvvmOptions {                   /* 440 bytes total                  */
    /* +0   */ uint32_t arch_variant;  /* ArchVariant enum                 */
    /* +4   */ uint32_t compile_mode;  /* CompileMode enum                 */
    /* +8   */ uint32_t opt_level;     /* OptLevel enum                    */
    /* +12  */ uint32_t debug_info;    /* DebugInfo enum                   */
    /* +16  */ uint32_t client_version;
    /* +20  */ uint8_t  flags_20;      /* Packed booleans: 6 bits          */
    /*          bit 0: ReserveLocalAddressZero                             */
    /*          bit 1: ForceImmediateConstants                             */
    /*          bit 2: HideFunctions                                       */
    /*          bit 3: UseDX10AddressInRange                               */
    /*          bit 4: IsPIC                                               */
    /*          bit 5: NoSpillsConstraint                                  */
    /* +21  */ uint8_t  _pad21[3];
    /* +24  */ uint8_t  multi_view[48]; /* MultiViewOptions sub-structure  */
    /* +72  */ uint32_t vertex_mode;    /* VertexMode enum                 */
    /* +76  */ uint8_t  _pad76[4];
    /* +80  */ uint8_t  dci_info[120];  /* DCIInfo sub-structure           */
    /* +200 */ uint8_t  fast_math_byte0; /* FastMath bits 0-7              */
    /* +201 */ uint8_t  fast_math_byte1; /* FastMath bits 8-12             */
    /* +202 */ uint8_t  _pad202[2];
    /* +204 */ uint32_t fast_math_reserved;
    /* +208 */ uint8_t  _pad208[8];
    /* +216 */ uint32_t max_rregs_allowed;
    /* +220 */ uint32_t sched_reg_target;
    /* +224 */ uint32_t unroll_control;
    /* +228 */ uint32_t okey;           /* CompressAlgoId / OKey           */
    /* +232 */ uint8_t  accelerated_arch;
    /* +233 */ uint8_t  std_elf;
    /* +234 */ uint8_t  _pad234[2];
    /* +236 */ uint32_t shader_codegen_sel_mask;
    /* +240 */ uint8_t  omega_ptx_error_handling;
    /* +241 */ uint8_t  _pad241[3];
    /* +244 */ uint32_t fdl_insert_mode;
    /* +248 */ uint8_t  target_opts[192]; /* Extended target options (tags 101-173) */
};
/* sizeof(NvvmOptions) == 440 (0x1B8) */

DCIInfo Sub-Structure (Options +80, 120 bytes)

The Device-Code-Interface sub-structure at offset +80 contains the shader constant interface and constant bank reserved area configurations. Parsed by NvvmOptions_parse_shader_const_iface (0xCCEEA0, 8,355 bytes) and NvvmOptions_parse_cb_reserved_area (0xCCE780, 9,802 bytes).

ShaderConstIface XML fields (from sub_CCEEA0):

FieldTypeDescription
OptimizerConstBankint32Constant bank index used by the optimizer
DriverConstBankint32Constant bank index used by the driver
BindlessTextureBankint32Constant bank for bindless texture handles
LocalMemoryWindowstructMemory window config for local memory
SharedMemoryWindowstructMemory window config for shared memory
VectorizeAndRemapTLDboolEnable vectorization and TLD remapping
ELFControlsDCIboolELF controls DCI interface layout
DiscardDefaultValueOutputsboolDiscard outputs that match default values

CBReservedArea XML fields (from sub_CCE780):

FieldTypeDescription
ByteOffsetToEndOfReservedAreaint32End-of-reserved-area offset in constant bank
CbAddressBitsInReservedVABaseint32Address bits for reserved virtual address base
CbBankToReservedVABaseint32Constant bank index for reserved VA base
ForceHighLatencyConstExprboolForce high-latency constant expression evaluation
ReservedCbReadBankint32Reserved constant bank read bank index

MultiViewOptions Sub-Structure (Options +24, 48 bytes)

The multi-view rendering options sub-structure at offset +24 carries graphics pipeline multi-view configuration. Parsed by NvvmOptions_parse_multi_view (0xCD6D20, 12,188 bytes). Serialized as blob tag 204.

FieldTypeDescription
NumViewsint32Number of rendering views
NominalViewIDsint32[]Array of nominal view identifiers
PerViewRTIndexConstantsint32[]Per-view render target index constants
EnableViewInstanceMaskboolEnable per-view instance masking
ComputePerPatchAttribsForViewZeroboolCompute per-patch attributes for view 0
IsImplicitboolImplicit multi-view mode

CompileMode Enum

ValueNameMeaning
0NVVM_COMPILE_MODE_WHOLE_PROGRAM_ABIWhole-program with ABI compliance
1NVVM_COMPILE_MODE_WHOLE_PROGRAM_NOABIWhole-program without ABI (internal)
2NVVM_COMPILE_MODE_SEPARATE_ABISeparate compilation (relocatable, --device-c)
3NVVM_COMPILE_MODE_EXTENSIBLE_WHOLE_PROGRAM_ABIExtensible whole-program with ABI

OptLevel Enum

ValueName
0NVVM_OPT_LEVEL_NONE
1NVVM_OPT_LEVEL_1
2NVVM_OPT_LEVEL_2 (default)
3NVVM_OPT_LEVEL_3

DebugInfo Enum

ValueName
0NVVM_DEBUG_INFO_NONE (default)
1NVVM_DEBUG_INFO_LINE_INFO
2NVVM_DEBUG_INFO_DWARF

VertexMode Enum

ValueName
0NVVM_VERTEX_MODE_SINGLE
1NVVM_VERTEX_MODE_A
2NVVM_VERTEX_MODE_B
3NVVM_VERTEX_MODE_AB

FDLInsertMode Enum

ValueName
0NVVM_FDL_MODE_NONE
1NVVM_FDL_MODE_ALL
2NVVM_FDL_MODE_APP

ArchVariant Enum

The architecture enum uses a numeric encoding where the value equals major * 10 + minor for older architectures and major * 10 + minor (with 3-digit major) for Blackwell. There are two parallel enum spaces: "virtual" architecture variants (used for compute_XX targets) and "HW" variants (used for sm_XX real silicon targets). The virtual variants are serialized by name in the XML format via NvvmOptions_parse_arch_enum (0xCD09E0, 14,516 bytes).

Virtual Architecture Variants

Enum NameNumeric ValueGenerationSM
NVVM_ARCH_KEPLER_3_030Kepler3.0
NVVM_ARCH_KEPLER_3_232Kepler3.2
NVVM_ARCH_KEPLER_3_535Kepler3.5
NVVM_ARCH_KEPLER_3_737Kepler3.7
NVVM_ARCH_MAXWELL_5_050Maxwell5.0
NVVM_ARCH_MAXWELL_5_252Maxwell5.2
NVVM_ARCH_MAXWELL_5_353Maxwell5.3
NVVM_ARCH_PASCAL_6_060Pascal6.0
NVVM_ARCH_PASCAL_6_161Pascal6.1
NVVM_ARCH_PASCAL_6_262Pascal6.2
NVVM_ARCH_VOLTA_7_070Volta7.0
NVVM_ARCH_VOLTA_7_272Volta7.2
NVVM_ARCH_TURING_7_373Turing7.3
NVVM_ARCH_TURING_7_575Turing7.5
NVVM_ARCH_AMPERE_8_080Ampere8.0
NVVM_ARCH_AMPERE_8_282Ampere8.2
NVVM_ARCH_AMPERE_8_686Ampere8.6
NVVM_ARCH_AMPERE_8_787Ampere8.7
NVVM_ARCH_AMPERE_8_888Ampere8.8
NVVM_ARCH_ADA_8_989Ada Lovelace8.9
NVVM_ARCH_HOPPER_9_090Hopper9.0
NVVM_ARCH_BLACKWELL_10_0100Blackwell10.0
NVVM_ARCH_BLACKWELL_10_1101Blackwell10.1
NVVM_ARCH_BLACKWELL_10_3103Blackwell10.3
NVVM_ARCH_BLACKWELL_11_0110Blackwell (Jetson Thor)11.0
NVVM_ARCH_BLACKWELL_12_0120Blackwell (RTX 50xx / Pro)12.0
NVVM_ARCH_BLACKWELL_12_1121Blackwell (DGX Spark)12.1

Note: NVVM_ARCH_BLACKWELL_10_1 maps to __CUDA_ARCH 1010, while NVVM_ARCH_BLACKWELL_11_0 maps to __CUDA_ARCH 1100. Despite both being in the BLACKWELL family, they are distinct architectures with separate entries in the processor table. sm_110 (Jetson Thor) was originally designated sm_101 before being renumbered to its own 11.x line.

HW Architecture Variants

The HW variants use a major * 1000 + minor * 10 encoding for their internal numeric values. These map to real silicon rather than virtual compute capabilities:

Enum NameInternal ValueNotes
NVVM_ARCH_HW_SM_5_0500Maxwell HW baseline
......One entry per supported HW SM through 9.0
NVVM_ARCH_HW_SM_10_01000Blackwell datacenter
NVVM_ARCH_HW_SM_10_11010Blackwell Ultra (GB300)
NVVM_ARCH_HW_SM_10_31030Blackwell variant
NVVM_ARCH_HW_SM_10_41200Maps to SM 120 value -- not publicly documented

The HW_SM_10_4 = 1200 mapping is notable: SM 10.4 in the HW enum space corresponds to the SM 120 consumer architecture. This reveals that "SM 120" is internally considered a Blackwell 10.4 die variant, not a separate generation.

FastMathOptions Bitfields

The fast-math configuration occupies two bytes at Options offset +200 and +201, with an additional int32 at +204. Each bit independently controls one floating-point relaxation.

Byte +200 (tags 8--15)

  Bit 7   Bit 6   Bit 5   Bit 4   Bit 3   Bit 2   Bit 1   Bit 0
+-------+-------+-------+-------+-------+-------+-------+-------+
|  Fmad | Fast  |  Ftz  |Reorder|Reorder|Ignore | Ignore|Ignore |
|       | Sqrt  |       | Half  | Float | Sign0 |  NaN  |  Inf  |
+-------+-------+-------+-------+-------+-------+-------+-------+
  tag 15  tag 14  tag 13  tag 12  tag 11  tag 10  tag 9   tag 8

Byte +201 (tags 16--17, 26, 31, 33)

  Bit 7   Bit 6   Bit 5   Bit 4   Bit 3   Bit 2   Bit 1   Bit 0
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       | Lax   | No    |Reassoc|CanReor| Allow |
|       |       |       | FP16  | Float | Float |derDist| Rcp   |
|       |       |       | Div   | MAD   |AddMAD |ribute | Rsq   |
+-------+-------+-------+-------+-------+-------+-------+-------+
                          tag 33  tag 31  tag 26  tag 17  tag 16

FastMath Divide Sub-Enum

The Divide field within FastMathOptions is a nested enum serialized by name in the XML path:

ValueNameMeaning
0NVVM_FAST_MATH_DIVIDE_PRECISE_NO_FTZIEEE-compliant division, no flush-to-zero
1NVVM_FAST_MATH_DIVIDE_PRECISE_ALLOW_FTZIEEE division with FTZ permitted
2NVVM_FAST_MATH_DIVIDE_FULL_RANGE_APPROXFull-range approximation
3NVVM_FAST_MATH_DIVIDE_FAST_APPROXFast approximation (least precise)

These correspond to the nvcc flags -prec-div=1 (precise) and -prec-div=0 (fast), with FTZ interaction determined by -ftz.

Complete FastMath XML Field Inventory

The full set of XML field names parsed by NvvmOptions_parse_fast_math (0xCCF590, 12,771 bytes):

XML Field NameBinary TagTypeDescription
IgnoreInf8bitTreat infinities as NaN
IgnoreNaN9bitAssume no NaN values present
IgnoreSignedZero10bitIgnore sign of zero
ReorderFloat11bitAllow float reordering
ReorderHalf12bitAllow half-precision reordering
Ftz13bitFlush denormals to zero
FastSqrt14bitUse fast sqrt approximation
Fmad15bitAllow fused multiply-add
AllowRcpRsqToSqrt16bitAllow rcp(rsqrt(x)) to sqrt(x)
CanReorderFloatDistribute17bitAllow distributive reordering
ReassociateFloatAddOverMad26bitFloat add reassociation over MAD
NoFloatMAD31bitDisable float MAD formation
LaxFP16ApproximateDivision33bitLax FP16 approximate division
Divide--enumDivision precision sub-enum (above)

The Divide field is serialized as a nested enum element in XML; in the binary format it is encoded as part of the fast-math reserved int32 at Options +204 (tag 18).

Memory Window Configuration

Memory windows define how the compiler maps address spaces to hardware memory banks. Three window types are serialized as blobs via tags 201--203, parsed by NvvmOptions_parse_cbank_config (0xCCE4B0) and NvvmOptions_parse_memory_windows (0xCCE100).

MemoryWindow Type Enum

ValueNameMeaning
0NVVM_MEMORY_WINDOW_SPECIAL_REGISTERAccessed via special registers
1NVVM_MEMORY_WINDOW_CBANKConstant bank window
2NVVM_MEMORY_WINDOW_IMMEDIATEImmediate offset addressing

Window Entry Layout (8 bytes)

struct MemoryWindowEntry {
    uint32_t window_type;   /* MemoryWindow type enum      */
    uint32_t cbank;         /* Constant bank index         */
    /* The following are part of the containing blob: */
    /* uint32_t cbank_ofst_low;  -- lower bound of offset range */
    /* uint32_t cbank_ofst_hi;   -- upper bound of offset range */
};
  • Tag 201 (MemoryWindowCBank): 24 bytes = 3 entries of {window_type, cbank, low, hi} truncated to fit, or 3 x 8 bytes depending on sub-field packing.
  • Tag 202 (MemoryWindowLocal): 24 bytes, same structure.
  • Tag 203 (MemoryWindowShared): 40 bytes = 10 x uint32_t values encoding shared memory bank strides, offsets, and configuration flags.

Version Compatibility Logic

Version checking is the first operation performed on a container buffer, implemented in NvvmContainer_check_versions (0xCD41B0). The logic is conservative on major versions and lenient on minor versions:

1. Verify magic == 0x7F4E5C7D
   Fail: return NULL (not a container)

2. Version.Major must == 1
   Fail: "NvvmContainer major version N not compatible" → return NULL

3. Version.Minor compared to 0x41 (65)
   If container minor > tool minor:
     Warning: "Linked container's NvvmContainer minor version N newer than tool"
   Parse continues regardless.

4. NvvmIRVersion.Major must == 2
   Fail: "NvvmIR major version N not compatible" → return NULL

5. NvvmIRVersion.Minor compared to 0x62 (98)
   If container minor > tool minor: warning, parse continues.

6. NvvmDebugVersion.Major must == 3
   Fail: "NvvmDebug major version N not compatible" → return NULL

7. NvvmDebugVersion.Minor compared to 2
   If container minor > tool minor: warning, parse continues.

8. LlvmVersion (major*100 + minor) must be <= 2000
   Fail: "LLVM version N not compatible" → return NULL

A separate standalone validator (0xCCD5F0) adds a mode-dependent check: in binary dump mode (a5=0), the LLVM version must be exactly 20; in normal mode (a5=1), it must be <= 20.

The philosophy is clear: major version bumps signal breaking format changes and are hard failures. Minor version bumps add new tags but never change existing tag semantics -- the delta encoding and unknown-tag-skipping design ensures forward compatibility.

Current Version Constants (cicc v13.0)

FieldMajorMinor
Version (container format)10x41 (65)
NvvmIRVersion20x62 (98)
NvvmDebugVersion32
LlvmVersion200

XML Serialization Format

The XML path (NvvmContainer_parse_header at 0xCDCA30) uses NVIDIA's YAML-based serialization framework with virtual dispatch. The top-level XML document contains these elements:

<NvvmContainer>
  <Version major="1" minor="65"/>
  <NvvmIRVersion major="2" minor="98"/>
  <NvvmDebugVersion major="3" minor="2"/>
  <LlvmVersion major="20" minor="0"/>
  <IRLevel>NVVM_IR_LEVEL_UNIFIED_AFTER_DCI</IRLevel>
  <Options>
    <ArchVariant>NVVM_ARCH_ADA_8_9</ArchVariant>
    <CompileMode>NVVM_COMPILE_MODE_WHOLE_PROGRAM_ABI</CompileMode>
    <OptLevel>NVVM_OPT_LEVEL_2</OptLevel>
    <DebugInfo>NVVM_DEBUG_INFO_NONE</DebugInfo>
    <FastMathOptions>
      <Ftz>1</Ftz>
      <Fmad>1</Fmad>
      <Divide>NVVM_FAST_MATH_DIVIDE_FAST_APPROX</Divide>
      ...
    </FastMathOptions>
    <MaxRRegsAllowed>255</MaxRRegsAllowed>
    ...
  </Options>
  <IsBinary>1</IsBinary>
  <Module>... base64-encoded LLVM bitcode ...</Module>
</NvvmContainer>

All enum values are serialized by their full string names (e.g., NVVM_COMPILE_MODE_SEPARATE_ABI), not by numeric value. The XML format does not use delta encoding -- every field is written regardless of whether it matches the default, making XML containers significantly larger but human-readable.

Serialization Flow

The serializer (0xCDD2D0) has two modes controlled by parameter a3: binary (a3=1) and XML (a3=0).

Binary Serialization (a3=1)

 1. Compute version fields (use defaults if not set):
      Version        = {1, 0x41}
      NvvmIRVersion  = {2, 0x62}
      NvvmDebugVersion = {3, 2}
      LlvmVersion    = {20, 0}

 2. Allocate 248-byte NvvmContainerHeader (zeroed)
 3. Allocate 440-byte default Options struct
 4. Allocate two growable arrays:
      scalar_tags[]   -- int32 entries for tag/value pairs
      blob_data[]     -- byte entries for blob payloads

 5. For each field in current Options vs. default Options:
      If field differs:
        Scalar → sub_CD17A0(scalar_tags, tag_id, value)
        Blob   → sub_CD1AB0(blob_data, scalar_tags, tag_id, ptr, size)

 6. Optional IR compression:
      If a4 flag set:
        Compress LLVM bitcode via sub_C8D290
        Compute CRC via sub_CCD2B0 → store as tag 99
        Compress via sub_1688730(codec, data, size)

 7. Append terminator: tag 0 to scalar_tags
 8. Write 24-byte header (with computed ScalarFieldsEnd, BlobDataEnd)
 9. Write scalar_tags array
10. Write blob_data array
11. Write compressed or raw IR payload

Deserialization (0xCD1D80)

 1. Verify magic == 0x7F4E5C7D
 2. Allocate 248-byte NvvmContainerHeader
 3. Allocate 440-byte Options struct with defaults
 4. Store Options pointer at container header offset 208
 5. Compute tag_ptr = buffer + header_size  (from offset 0x0E)
 6. Compute blob_base = buffer + scalar_fields_end  (from offset 0x10)
 7. Enter switch loop:
      Read tag (int16), decode value (int16 or sentinel + int32)
      Switch on tag (103 unique case labels):
        Tags 1-39:    → write scalar to Options field
        Tag 99:       → store compression algo ID
        Tags 101-173: → write to extended target options
        Tags 201-218: → resolve blob offset, copy blob data
        Tags 301-309: → write to extended int32 fields
        Tags 351-353: → copy 8-byte blob to extended fields
        Tags 401-402: → conditionally parse structured blob
      Tag 0 → exit loop
 8. If tag 99 present: decompress IR payload
 9. Return container pointer

Annotated Hex Dump

A minimal container targeting SM 89 (Ada Lovelace) with default options (only SmMajor and SmMinor differ from defaults):

Offset  Hex                                        Decoded
------  -----------------------------------------  ---------------------------------
0x0000  7D 5C 4E 7F                                Magic: 0x7F4E5C7D
0x0004  01 41                                      Version: 1.65
0x0006  02 62                                      NvvmIRVersion: 2.98
0x0008  03 02                                      NvvmDebugVersion: 3.2
0x000A  14 00                                      LlvmVersion: 20.0
0x000C  00 00                                      IRLevel: 0 (UNIFIED_AFTER_DCI)
0x000E  18 00                                      HeaderSize: 24
0x0010  2C 00 00 00                                ScalarFieldsEnd: 44
0x0014  2C 00 00 00                                BlobDataEnd: 44 (no blobs)

--- Scalar tag/value region ---
0x0018  01 00 08 00                                Tag 1 (SmMajor) = 8
0x001C  02 00 09 00                                Tag 2 (SmMinor) = 9
0x0020  0D 00 01 00                                Tag 13 (Ftz) = 1
0x0024  0F 00 01 00                                Tag 15 (Fmad) = 1
0x0028  00 00                                      Terminator (tag 0)
0x002A  00 00                                      Padding to alignment

--- Blob data region ---
(empty -- ScalarFieldsEnd == BlobDataEnd)

--- IR payload follows at offset 0x002C ---
0x002C  DE C0 17 0B ...                            LLVM bitcode (0xDEC0170B magic)

This example shows the efficiency of delta encoding: only 4 tag/value pairs (16 bytes of tags) plus the 24-byte header produce a fully-specified container. All other fields (CompileMode, OptLevel, DebugInfo, all target options) inherit their defaults during deserialization.

A container with a 32-bit value would look like:

0x00XX  13 00 FF FF  00 04 00 00                   Tag 19 (MaxRRegsAllowed) = 1024
                                                   (0xFFFF sentinel, then 0x0400 LE)

Pipeline Integration

The container serves as the inter-stage transport format within the cicc compilation pipeline. Two entry paths exist:

PathEntry FunctionAddressPipeline
Path A (LibNVVM)nvvmCompileProgram dispatcher0x9047E03-phase: LNK -> OPT -> LLC
Path B (standalone)cicc_main orchestrator0x12642A04-stage: LNK -> OPT -> OPTIXIR -> LLC

Both paths deserialize the container at phase 1, then translate Options into per-stage compiler flags:

  • SmMajor / SmMinor from tags 1--2 become -mcpu=sm_XX
  • FastMath.Ftz from tag 13 becomes -nvptx-f32ftz
  • FastMath.Fmad from tag 15 becomes the IEEE mode flag
  • OptLevel becomes -nvptx-opt-level=N
  • CompileMode == 2 (SEPARATE_ABI) adds --device-c
  • IRLevel == 1 (LTO) enters the LTO pipeline with partially-optimized bitcode
  • IRLevel == 2 (OPTIX) activates the OptiX IR stage (bit 6 of pipeline bitmask) and disables LICM and IP-MSP

The container format is the single source of truth for all compilation parameters. When cicc is invoked by nvcc, the driver serializes its accumulated flags into a container, passes the container as input, and cicc deserializes it back into compiler options. This round-trip through binary serialization ensures that all pipeline stages see exactly the same configuration, eliminating the flag-parsing divergence that would otherwise arise from each stage having its own CLI parser.

YAML Serialization Framework

The XML/YAML path uses a generic serialization framework built on a bundled YAML parser/emitter library (Cluster A: 0xCB0000--0xCBFA60). The library provides:

FunctionAddressRole
yaml_parser_main0xCB9640Top-level YAML parser (25,873 bytes)
yaml_emitter_main_loop0xCBDA10Main YAML emitter loop (23,583 bytes)
yaml_scanner_scan_tokens0xCB7E40Token scanner (17,924 bytes)
yaml_parser_parse_flow0xCB8C00Flow-style parsing (15,188 bytes)
yaml_parser_load_document0xCBA570Document loader/resolver (9,695 bytes)

The serialization framework uses virtual dispatch: each serializable type registers a serialize/deserialize function pair, and the framework dispatches based on the YAML node type (scalar=1, sequence, mapping). All enum values are serialized by their full string names (NVVM_COMPILE_MODE_SEPARATE_ABI, NVVM_ARCH_ADA_8_9, etc.), not by numeric value.

Finalizer Knobs Integration

The container Options struct also feeds into the NVIDIA finalizer knobs system through NvvmOptions_parse_finalizer_knobs (0xCD9990, 31,702 bytes -- the 7th largest function in the binary). This parser ingests the complete set of NVIDIA-specific backend configuration knobs:

  • Shader pipeline controls: PromoteHalf, PromoteFixed, USePIXBAR, VSIsVREnabled, VSIsLastVTGStage
  • Codegen controls: DisablePredication, DisableXBlockSched, EnableJumpTable, ScheduleKils
  • Memory controls: DoMMACoalescing, AssumeConvertMemoryToRegProfitable
  • Barrier controls: DisableERRBARAfterMEMBAR, GenConvBranchForWarpSync
  • PGO controls: PGOEpoch, PGOBatchSize, PGOCounterMemBaseVAIndex
  • Per-CTA controls: CTASizeX, CTASizeY, CTASizeZ, SharedMemorySize, SMemScratchBase
  • Register controls: MaxActiveWarpsPerSM, NumReservedUReg, NumScratchURegs

These knobs are distinct from the NVVMPassOptions system (see NVVMPassOptions) -- the finalizer knobs configure the backend code generator, while NVVMPassOptions configure the optimization pipeline.

Tag Summary Statistics

RangeCountDescription
1--3938Core scalar options (SM version, fast-math, unroll, flags)
991Compression metadata
101--17373Extended target options (hardware capabilities, memory config)
201--21818Blob data (memory windows, resource tables, strings)
301--3099Extended int32 fields (cluster config, extended options)
351--3533Extended int64 blob references
401--4022Structured conditional blobs (TMA / TCGen05)
Total144Distinct tag IDs across 6 ranges

The deserializer switch statement has 103 unique case labels -- the remaining 41 tags share code paths with other tags (e.g., all single-bit tags in a byte share a case that reads the bit position from a secondary table).

Cross-References