Entry Point & CLI
The cicc binary has a surprisingly complex entry point. Rather than a straightforward main → compile → exit flow, it implements a dual-path architecture where the same binary can operate as either a LibNVVM-based compiler (Path A) or a standalone compiler (Path B), selected at runtime through environment variables and obfuscated string comparisons. This design allows NVIDIA to ship a single binary that serves both the nvcc toolchain and the LibNVVM API.
The entry point region (0x8F0000–0x96FFFF, ~520 KB) handles CLI parsing, architecture detection with a 3-column flag fan-out system, and dispatch into one of several compilation pipelines. A hidden "wizard mode" gated behind an environment variable with a magic number enables developer diagnostics that are otherwise completely inaccessible.
| main() thunk | 0x4396A0 (16 bytes) — return sub_8F9C90(argc, argv, envp) |
| Real main | sub_8F9C90 (10,066 bytes, 1,990 lines) |
| Wizard mode | getenv("NVVMCCWIZ") == 553282 → byte_4F6D280 = 1 |
| Default arch | compute_75 / sm_75 (Turing) |
| Flag catalog | sub_9624D0 (75KB, 2,626 lines, 4 output vectors) |
| Architecture map | sub_95EB40 (38KB, 23 architectures, 3-column fan-out) |
| Flag translation | sub_8FE280 (red-black tree at qword_4F6D2A0, 40+ nvcc→cicc mappings) |
| Pipeline stages | LNK → OPT → [OPTIXIR] → LLC |
| Dual path | Path A (sub_905EE0) / Path B (sub_1265970) |
| Libdevice | Path A: unk_3EA0080 / Path B: unk_420FD80 (455,876 bytes each) |
| Arch bitmask | 0x60081200F821 (validates SM 75–121) |
Architecture
main (0x4396A0, 16B thunk)
│
└─ sub_8F9C90 (10KB, REAL MAIN)
│
├─ getenv("NVVMCCWIZ") == 553282 → wizard mode
├─ sub_16C5290: extract program name from argv[0]
│
├─ ARGUMENT LOOP (v15 = 1..argc)
│ ├─ -o <file> → v257 (output)
│ ├─ -nvvmir-library <path> → v256 (libdevice)
│ ├─ -lgenfe/-libnvvm/-lnk/-opt/-llc → v263 (mode)
│ ├─ -arch/-mcpu/--nv_arch → v242 (SM number)
│ ├─ --emit-optix-ir → v243=1, v258=1
│ ├─ -nvc → v258=1
│ ├─ -irversion → print IR version, exit
│ ├─ .bc/.ci/.i/.ii/.cup/.optixir → s (input file)
│ └─ obfuscated option → v253 (0 or 1)
│
├─ v253 RESOLUTION (if still == 2)
│ └─ getenv(obfuscated) → compare → set v253 = 0 or 1
│
├─ DISPATCH (v263 × v253)
│ ├─ v263==0, v253==1 → sub_902D10 (simple Path A)
│ ├─ v263==0, v253==0 → sub_1262860 (simple Path B)
│ ├─ v263==1 → sub_905E50 / sub_12658E0 (lgenfe)
│ ├─ v263≥2, v253==1 → sub_905EE0 (multi-stage Path A)
│ └─ v263≥2, v253==0 → sub_1265970 (multi-stage Path B)
│
└─ CLEANUP: free all vectors, strings, argv copy
Real Main — sub_8F9C90
The exported main() at 0x4396A0 is a 16-byte thunk that immediately tail-calls sub_8F9C90 — the actual entry point. This function is a monolithic CLI parser and dispatcher: it copies argv into a local buffer, checks for wizard mode, iterates over all arguments accumulating state in ~12 local variables, resolves the compilation path, and finally dispatches to the appropriate pipeline function. The entire function is a single 10KB basic-block-heavy control flow graph with ~80 branch targets.
| Field | Value |
|---|---|
| Address | 0x8F9C90–0x8FC3E2 |
| Size | 10,066 bytes |
| Stack frame | 0x978 bytes (2,424 bytes) |
| Local buffers | v284[2096] for argv copy (stack if argc ≤ 256, else heap) |
Argument Handling and Argv Copy
The function begins with a defensive copy of argv into a local buffer. When 8 * argc fits within 0x800 bytes (argc ≤ 256), the copy lives in v284[2096] on the stack. For larger argument lists -- which can occur during complex nvcc invocations with many pass-through flags -- it allocates heap memory via sub_16CD150. This copy is necessary because the argument loop modifies pointers (advancing i to skip flag values), and the caller's argv must not be disturbed.
if (8 * argc > 0x800)
v284 = sub_16CD150(8 * argc); // heap alloc for large argc
// else use stack buffer v284[2096]
memcpy(v284, argv, 8 * argc); // copy all pointers
After copying, sub_16C5290 extracts the base program name from argv[0] -- stripping directory prefixes -- and stores it in dest. This name appears in error messages and verbose output throughout the pipeline.
Key Local Variables
The function's behavior is controlled by two critical dispatch variables: v253 (which compilation backend to use) and v263 (which phase of the pipeline to invoke). These are accumulated during the argument loop and combined after parsing to select one of ~10 possible code paths. The interaction between them creates a matrix of behaviors that covers everything from simple single-file compilation to multi-stage LibNVVM pipeline processing.
| Variable | Init | Purpose |
|---|---|---|
v253 | 2 | Dispatch mode: 0=Path B, 1=Path A, 2=default (needs env resolution) |
v263 | 0 | Invocation mode: 0=default, 1=lgenfe, 2=libnvvm, 3=lnk, 4=opt, 6=llc |
v242 | 0 | Target architecture (SM number) |
v258 | 0 | NVC flag |
v243 | 0 | OptiX IR flag |
v259 | 0 | Verbose (only effective in wizard mode) |
v261 | 0 | Dryrun |
v262 | 0 | Keep intermediates (only effective in wizard mode) |
s | NULL | Input file path |
v257 | NULL | Output file path |
v256 | NULL | NVVM IR library path |
v266 | vector | Pass-through options vector |
Wizard Mode
v10 = getenv("NVVMCCWIZ"); // 0x8F9D36
if (v10 && strtol(v10, NULL, 10) == 553282) // 0x8F9D92
byte_4F6D280 = 1;
Global byte_4F6D280 gates the effectiveness of -v, -keep, -dryrun. Without wizard mode, these flags are silently ignored — v259 and v262 stay 0. This is a deliberate anti-reverse-engineering measure: even if someone discovers the -v flag, it does nothing without the magic environment variable. The magic number 553282 (0x87142) appears to be arbitrary.
Invocation Modes (v263)
The v263 variable determines which stage of the compilation pipeline cicc enters. When nvcc invokes cicc directly, v263 stays at 0 (default). But cicc can also be invoked in sub-pipeline mode — for example, -lnk runs only the linking phase, -opt runs only the optimizer, and -llc runs only code generation. This is how the multi-stage pipeline works: the outer driver calls cicc multiple times with different -lXXX flags, or a single invocation with -libnvvm runs all stages internally.
Each mode has its own format for the -discard-value-names flag, which tells the LLVM backend whether to strip IR value names (reducing memory usage). The different formats exist because each sub-pipeline stage has its own option namespace:
| v263 | Flag | Mode | discard-value-names format |
|---|---|---|---|
| 0 | (none) | Default (nvcc invocation) | -discard-value-names |
| 1 | -lgenfe | EDG frontend linkage | --discard_value_names=1 (underscores) |
| 2 | -libnvvm | LibNVVM API | -discard-value-names=1 (dashes) |
| 3 | -lnk | Linker | -lnk-discard-value-names=1 |
| 4 | -opt | Optimizer | -opt-discard-value-names=1 |
| 5 | (internal) | Undocumented (sets v278 high byte) | — |
| 6 | -llc | Standalone LLVM codegen | — |
Input File Extensions
Input files are identified by extension during the argument loop. The last matching file wins (s is overwritten each time). Unrecognized arguments are added to the v266 pass-through vector and forwarded to sub-pipelines. The .cup extension has a special restriction — it's only accepted when the preceding argument is --orig_src_path_name or --orig_src_file_name, which are metadata flags inserted by nvcc to track the original source file.
| Extension | Format | Condition |
|---|---|---|
.bc | LLVM bitcode | Always accepted |
.ci | CUDA intermediate (preprocessed) | Always accepted |
.i | Preprocessed C/C++ | Always accepted |
.ii | Preprocessed C++ | Always accepted |
.cup | CUDA source | Only after --orig_src_path_name or --orig_src_file_name |
.optixir | OptiX IR | Always accepted |
Obfuscated Strings
At 0x8F98A0, sub_8F98A0 decrypts strings using an XOR + ROT13-like cipher:
v40 = v37 ^ (-109 * ((offset + 97) ^ 0xC5));
// then ROT13 on alphabetic characters
This hides an environment variable name and option prefix from static analysis. The decrypted strings control the v253 (Path A vs Path B) resolution when no explicit mode is specified.
Error Messages
| Message | Condition | Address |
|---|---|---|
"Missing output file\n" | -o with no next argument | 0x8FA365 |
"Missing NVVM IR library file\n" | -nvvmir-library with no next arg | 0x8FAB34 |
"Unparseable architecture: " + value | Invalid arch string | Multiple |
"Missing input file\n" | No recognized input file | 0x8FBEAD |
"Recognized input file extensions are: .bc .ci .i .cup .optixir" | After missing input | 0x8FBE97 |
"Error: Output file was not specified (See -o option).\n" | Multi-stage without -o | 0x8FB655 |
The v253 Dispatch Variable
The v253 variable is the single most important dispatch control in the entire entry point. It determines whether the compilation uses Path A (the EDG/PTX-producing pipeline) or Path B (the standalone LLVM-based pipeline). Understanding its resolution logic is essential to reproducing cicc's behavior.
Initialization and Explicit Setting
v253 begins at 2 (unresolved default). During the argument loop, obfuscated string matching can set it directly:
| Source | Value | Meaning |
|---|---|---|
| Initial default | 2 | Needs environment variable resolution |
Obfuscated option suffix matches byte_3C23AC3 | 1 | Path A explicitly requested |
Obfuscated option suffix matches byte_3C23AB4 | 0 | Path B explicitly requested |
Environment Variable Resolution
When v253 remains at 2 after argument parsing (the common case), cicc resolves it through the obfuscated environment variable NV_NVVM_VERSION (decrypted from byte_3C23A9F). The resolution has two sub-cases depending on the target architecture:
if (v253 == 2) {
env = getenv(decrypt(byte_3C23A9F)); // NV_NVVM_VERSION
if (env matches decrypt(byte_3C23A82)) // "nvvm-latest"
v253 = 1; // Path A
else if (env matches decrypt(byte_3C23A7B)) // "nvvm70"
v253 = 0; // Path B
else if (v242 > 99 && !v258) // SM >= 100, not -nvc
v253 = 0; // Path B (new architectures default to standalone)
else
v253 = 1; // Path A (legacy default)
}
The architectural threshold at SM 100 (Blackwell) is notable: for SM < 100, the default is Path A (the EDG frontend path). For SM >= 100, unless the -nvc flag is present, the default switches to Path B. This suggests NVIDIA is migrating newer architectures toward the standalone LLVM pipeline, possibly as a precursor to eventually deprecating the EDG-based path.
Version Strings Injected per Path
After v253 is resolved and for multi-stage modes (v263 >= 3), the entry point injects a version string into the pass-through options:
| v253 | Injected string | Semantics |
|---|---|---|
| 1 (Path A) | "-nvvm-version=nvvm-latest" (25 bytes from xmmword_3C23BC0) | Targets the latest NVVM IR specification |
| 0 (Path B) | "-nvvm-version=nvvm70" (20 bytes) | Targets NVVM 7.0 IR (frozen at LLVM 7.0.1 bitcode format) |
This version string propagates through the entire pipeline, controlling bitcode compatibility, intrinsic name resolution, and metadata format expectations.
Post-Parse Dispatch Logic
After the argument loop terminates, the dispatch logic combines v253 and v263 to select the target function. The combined keep-and-verbose flag v260 = v262 & v259 is also computed -- both wizard-mode flags must be active for intermediate file retention and verbose logging to function simultaneously.
Simple Dispatch (v263 == 0)
When cicc is invoked without any -lXXX mode flag (the standard nvcc invocation path):
if (v253 == 1)
v8 = sub_902D10(dest, 0, &v266, s, v257, v256, v260, v262, v261);
// Path A: CLI → lgenfe → LibNVVM pipeline
else
v8 = sub_1262860(dest, 0, &v266, s, v257, v256, v260, v262, v261);
// Path B: CLI → standalone LLVM pipeline
Both functions receive identical parameter signatures: program name, zero (unused), pass-through options, input file, output file, libdevice path, verbose+keep, keep, and dryrun. The return value becomes the process exit code.
lgenfe Dispatch (v263 == 1)
The -lgenfe mode builds a full argv-style array with the program name as the first entry, followed by all v266 pass-through options. This argv is then passed to one of two function pairs:
| v253 | Init function | Pipeline function |
|---|---|---|
| 1 (Path A) | sub_B6EEA0 (LLVMContext + metadata kind registration) | sub_905880 (EDG lgenfe) |
| 0 (Path B) | sub_1602D10 (standalone context initialization) | sub_1265340 (standalone lgenfe) |
The init functions create the LLVM context and register the 42+ metadata kinds used throughout the pipeline (dbg, tbaa, prof, noalias, etc.). These must be registered before any IR construction begins.
Multi-Stage Dispatch (v263 >= 2)
For -libnvvm, -lnk, -opt, and -llc modes, the dispatch constructs a CompilationState structure with input/output strings, extra arguments, and the v278 mode byte, then calls:
| v253 | Function | Size | Role |
|---|---|---|---|
| 1 | sub_905EE0 | 43 KB | Path A multi-stage pipeline driver |
| 0 | sub_1265970 | 48 KB | Path B multi-stage pipeline driver |
For -libnvvm (v263 == 2), the extra args are taken directly from v266 without prepending the program name. For -lnk/-opt/-llc (v263 >= 3), the appropriate version string (nvvm-latest or nvvm70) is appended to the pass-through options before dispatch.
Cleanup
After the pipeline function returns, sub_8F9C90 performs deterministic cleanup in reverse allocation order: the v281 extra-argument char** array and each entry, the v275 output string, the s2 input string, each element of the v266 pass-through vector, the vector's backing buffer, the dest program name, and the v282 argv copy buffer (if heap-allocated). The return value v8 is 0 on success, 1 on argument errors, or the pipeline function's return code (stored in v264).
Path A — EDG → LibNVVM Pipeline
Path A is the full CUDA C++ compilation path. It starts with the EDG 6.6 C++ frontend parsing CUDA source code into an IL tree, then converts that IL into LLVM IR via the lgenfe (LLVM Generation Front End) stage, and finally runs the LibNVVM pipeline to optimize and lower the IR to PTX. This is the path taken when cicc is invoked by nvcc for .cu file compilation, and it represents the standard CUDA compilation flow that most users encounter.
Path A Orchestrator — sub_902D10
The orchestrator is a 9 KB function that sequences the three major stages of Path A compilation. It acts as the conductor between the CLI processing layer, the EDG frontend, and the LibNVVM optimizer/codegen.
| Field | Value |
|---|---|
| Address | 0x902D10 |
| Size | ~9 KB |
| Timer | Creates 8-byte timer via sub_22077B0 → sub_B6EEA0 |
Execution flow:
-
Timer creation. Allocates and initializes an 8-byte timing context. The
sub_B6EEA0init function also registers the 42+ LLVM metadata kinds (dbg=1, tbaa=2, prof=3, ... noalias.addrspace=42) that all subsequent IR construction depends on. This is why the timer creation happens first: the metadata registration is a side effect of context initialization. -
CLI processing. Calls
sub_900130(39 KB) to parse the accumulated CLI flags into structured forms: command bufferv58, emit-llvm-bc flagv52, architecture compute/SM numbersv55/v56, and file paths. On failure:"Error processing command line: <cmd>\n". -
Include path setup. If an input file is present (
v64), callssub_C98ED0to configure system and user include paths for the EDG frontend. -
EDG frontend (lgenfe). Calls
sub_905880with timer name"CUDA C++ Front-End". This stage:- Allocates an 880-byte module object via
sub_BA8740 - Processes lgenfe CLI options from the options struct
- In dryrun mode: skips execution, frees the module, returns null
- On success: returns a module pointer and sets the output path
- Allocates an 880-byte module object via
-
LibNVVM pipeline. If lgenfe succeeds (module pointer is non-null), calls
sub_905EE0with the module for the full optimization and codegen pipeline. -
Time profiler output. After pipeline completion, checks
sub_C96F30()for active profiling. If profiling is enabled, writes timing data to the output file viasub_C9C600. Failure emits:"Error: Failed to write time profiler data.\n". -
Cleanup. Frees the timer (
sub_B6E710), option strings, and option arrays.
EDG Frontend Stage — sub_905880
The lgenfe stage bridges the EDG 6.6 C++ frontend to LLVM IR generation. This is where CUDA C++ source code becomes NVVM IR.
| Field | Value |
|---|---|
| Address | 0x905880 |
| Size | ~6 KB |
| Timer label | "CUDA C++ Front-End" |
| Module size | 880 bytes (allocated by sub_BA8740) |
The function reconstructs a verbose command line for diagnostic output (quoting paths for --orig_src_file_name, --orig_src_path_name, --compiler_bindir, --sdk_dir), builds an argument array, and calls sub_908750(numArgs, argArray, opt_level) to create the LLVM module. On success, it copies the output path into the module at offset 21*8 and, if the keep flag is set via a3->byte[66], calls sub_905860 to write intermediate files.
The actual EDG parsing and IL-to-IR conversion happens inside sub_908750, which eventually calls sub_617BD0 — the lgenfe_main function documented in the EDG Frontend page.
EDG Module Binding — sub_908850
After the EDG frontend produces its IL tree, sub_908850 (10 KB) bridges the output to the LLVM backend. This function performs the critical step of configuring the LLVM module's data layout and target triple based on the target architecture.
Data layout strings are selected based on unk_4F06A68 (address space width):
| Width | p3 flag | Data layout string |
|---|---|---|
| 8 (64-bit) | unk_4D0461C set | "e-p:64:64:64-p3:32:32:32-i1:8:8-..." (167 chars) |
| 8 (64-bit) | Not set | "e-p:64:64:64-i1:8:8-..." (155 chars) |
| 4 (32-bit) | — | "e-p:32:32:32-i1:8:8-..." (155 chars) |
The p3:32:32:32 component enables 32-bit pointers in address space 3 (shared memory), which is critical for SM architectures where shared memory accesses use 32-bit addressing even in 64-bit compilation mode.
Target triple is set to "nvptx64-nvidia-cuda" for 64-bit or "nvptx-nvidia-cuda" for 32-bit. The function also:
- Creates a 496-byte target info structure via
sub_AE3F70 - Iterates global function declarations, marking device functions for compilation via
sub_91CA00 - Iterates global variables, processing initializers for device-side storage via
sub_9172F0 - Runs LLVM module verification via
sub_B89FE0-- on failure:"there was an error in verifying the lgenfe output!" - Stores the module globally at
unk_4F6D2F8
LibNVVM Pipeline Driver — sub_905EE0
This 43 KB function is the core of Path A. It orchestrates the full compilation through 14 sequential phases, using an interesting indirection mechanism: rather than calling LibNVVM API functions directly, it resolves them at runtime through sub_12BC0F0(id) — a dispatch function that takes a numeric ID and returns a function pointer.
| Field | Value |
|---|---|
| Address | 0x905EE0 |
| Size | 43KB (1,268 lines) |
| Timer | "LibNVVM" |
| Orchestrator | sub_902D10 (simple mode) |
14-Phase Compilation Flow
The compilation proceeds through these phases sequentially. Phases 2.1–2.14 are the core compilation unit lifecycle: create, populate, configure, compile, extract results, destroy. The -keep flag (when wizard mode is active) causes intermediate .lnk.bc and .opt.bc files to be written to disk, which is invaluable for debugging the pipeline.
| Phase | Action |
|---|---|
| 0 | Verbose command-line reconstruction |
| 1 | Input file handling — NVVM container (sub_9047E0) or regular bitcode |
| 2.1 | Pipeline infrastructure: verbose stream, address space query, module prep |
| 2.2 | nvvmCreateCU (dispatch ID 2151) |
| 2.3 | nvvmCUAddModule (dispatch ID 4660) |
| 2.4 | Check for -opt/-llc direct stage invocation |
| 2.5 | Parse and add additional modules from extra args |
| 2.6 | Libdevice linking: external file or embedded unk_3EA0080 (455,876 bytes) |
| 2.7 | Intermediate file path construction (.lnk.bc, .opt.bc) |
| 2.8 | Option parsing via sub_9685E0 → 4 stage vectors (lnk/opt/lto/llc) |
| 2.9 | Verbose stage logging + callback registration |
| 2.10 | Load 37 LLVM options from off_4B90FE0 via nvvmSetOptionStrings (ID 0xFEED) |
| 2.11 | nvvmCUCompile (ID 0xBEAD) with phase code 57069 (0xDEED) |
| 2.12 | Get compilation log/warnings |
| 2.13 | Get PTX output |
| 2.14 | nvvmDestroyCU (ID 21257 = 0x5309) |
| 3 | Error/warning output to caller |
| 4 | Write output file (text or binary detection via ELF magic) |
| 5 | Timer stop |
Input File Handling — Phase 1
Phase 1 has two sub-paths based on the a3->byte[65] container flag:
Path 1A: NVVM IR Container. When the input is an NVVM container (a binary format wrapping IR plus compilation options), sub_9047E0 (10 KB) parses it. The container format encodes the target SM version, FTZ mode, precision settings, and IEEE mode. The parser extracts these and converts them to LLVM CLI flags:
// Pseudo-code for container option extraction
push("-march=nvptx");
push("-mcpu=sm_" + str(container->sm_version / 10));
if (container->flags[200] & 0x20) push("-nvptx-f32ftz");
if (container->flags[200] & 0x80) push("-nvptx-prec-sqrtf32=1");
else push("-nvptx-prec-sqrtf32=0");
push(container->flags[204] ? "-nvvm-ieee-mode=S" : "-nvvm-ieee-mode=T");
if (container->mode == 2) push("--device-c"); // relocatable compilation
If parsing fails, the error message is "Invalid NVVM IR Container" (error code 259).
Path 1B: Regular LLVM bitcode. For raw .bc files, the function creates a timer object, configures the SM architecture via sub_B6F950, opens the file via sub_C7EAD0, and parses it into an LLVM module via sub_A01950.
LibNVVM API Dispatch IDs
Internal function sub_12BC0F0(id) returns API function pointers by numeric ID. This indirection exists because the LibNVVM API is implemented within the same binary — these aren't dynamically-linked external functions but rather internal call points resolved through a dispatch table. The hex IDs double as a form of internal documentation:
| ID | Hex | Function |
|---|---|---|
| 2151 | 0x0867 | nvvmCreateCU |
| 4111 | 0x100F | nvvmGetCompiledResult |
| 4660 | 0x1234 | nvvmCUAddModule |
| 17185 | 0x4321 | nvvmCUSetExtraArgs |
| 21257 | 0x5309 | nvvmDestroyCU |
| 41856 | 0xA380 | nvvmGetCompilationLog |
| 46903 | 0xB737 | nvvmGetCompiledResultLog |
| 46967 | 0xB777 | nvvmGetErrorString |
| 48813 | 0xBEAD | nvvmCUCompile |
| 48879 | 0xBEEF | Callback registrar |
| 61451 | 0xF00B | nvvmGetCompiledResultSize |
| 62298 | 0xF37A | nvvmCUAddModuleFromBuffer |
| 65261 | 0xFEED | nvvmCUSetOptions |
The complete dispatch table in sub_12BC0F0 contains 25 entries implemented as a binary search tree on the ID value:
| ID | Hex | Target | Semantic Name |
|---|---|---|---|
| 2151 | 0x0867 | sub_12BB090 | nvvmCreateCU |
| 2167 | 0x0877 | sub_12BB090 | (alias) |
| 3911 | 0x0F47 | sub_12BBF40 | nvvmCUSetProgressCallback |
| 4111 | 0x100F | sub_12BA8F0 | nvvmGetCompiledResult |
| 4606 | 0x11FE | sub_12BA330 | nvvmCULinkModule |
| 4660 | 0x1234 | sub_12BC650 | nvvmCUAddModule |
| 8320 | 0x2080 | sub_12BB400 | nvvmCUSetOption |
| 11245 | 0x2BED | sub_12BB290 | nvvmCUGetLog |
| 17185 | 0x4321 | sub_12BBD80 | nvvmCUSetExtraArgs |
| 21257 | 0x5309 | sub_12B9C40 | nvvmDestroyCU |
| 23294 | 0x5AFE | sub_12BAF10 | nvvmVerify |
| 41856 | 0xA380 | sub_12BA220 | nvvmGetCompiledResultSize |
| 45242 | 0xB0BA | sub_12BAB40 | nvvmCUGetWarnings |
| 46903 | 0xB737 | sub_12BA7C0 | nvvmGetCompiledResultLog |
| 46967 | 0xB777 | sub_12B9980 | nvvmGetErrorString |
| 48813 | 0xBEAD | sub_12BA110 | nvvmCUCompile |
| 48879 | 0xBEEF | sub_12BACF0 | nvvmCURegisterCallback |
| 49522 | 0xC172 | sub_12BA470 | nvvmCUGetIR |
| 51966 | 0xCAFE | sub_12B9A50 | nvvmGetVersion |
| 56495 | 0xDCEF | sub_12B9A40 | (unknown) |
| 57005 | 0xDEAD | sub_12B9C00 | nvvmInit |
| 61451 | 0xF00B | sub_12BA560 | nvvmGetCompiledResultPTXSize |
| 61453 | 0xF00D | sub_12BA6A0 | nvvmCURegisterLNKCallback |
| 61806 | 0xF16E | sub_12BAA30 | nvvmCUGetOptIR |
| 62298 | 0xF37A | sub_12BC8B0 | nvvmCUAddModuleFromBuffer |
| 65261 | 0xFEED | sub_12B9AB0 | nvvmSetOptionStrings |
Public LibNVVM API vs Internal CU API
The dispatch table above reveals a critical architectural detail: cicc's internal API uses compilation unit semantics (nvvmCreateCU, nvvmCUAddModule, nvvmCUCompile), while the public LibNVVM shared library (libnvvm.so) exports a different API surface using program semantics (nvvmCreateProgram, nvvmAddModuleToProgram, nvvmCompileProgram). The public API is documented in NVIDIA's nvvm.h header; the internal API exists only within cicc and is never exported.
Evidence for this mapping comes from nvlink's -dlto code path, which dynamically loads libnvvm.so via dlsym() and resolves symbols by their public names:
// nvlink sub_4BC290 — loads libnvvm.so for device LTO
dlsym(handle, "nvvmCreateProgram"); // → internally nvvmCreateCU
dlsym(handle, "nvvmCompileProgram"); // → internally nvvmCUCompile
dlsym(handle, "nvvmGetCompiledResultSize");
dlsym(handle, "nvvmGetCompiledResult");
dlsym(handle, "nvvmDestroyProgram"); // → internally nvvmDestroyCU
The complete mapping between the public libnvvm.so API (as used by external callers like nvlink and user programs) and cicc's internal CU dispatch IDs:
Public API (libnvvm.so) | Internal Name | Dispatch ID | Hex | Target |
|---|---|---|---|---|
nvvmCreateProgram | nvvmCreateCU | 2151 | 0x0867 | sub_12BB090 |
nvvmAddModuleToProgram | nvvmCUAddModule | 4660 | 0x1234 | sub_12BC650 |
nvvmLazyAddModuleToProgram | nvvmCUAddModuleFromBuffer | 62298 | 0xF37A | sub_12BC8B0 |
nvvmCompileProgram | nvvmCUCompile | 48813 | 0xBEAD | sub_12BA110 |
nvvmVerifyProgram | nvvmVerify | 23294 | 0x5AFE | sub_12BAF10 |
nvvmGetCompiledResultSize | nvvmGetCompiledResultPTXSize | 61451 | 0xF00B | sub_12BA560 |
nvvmGetCompiledResult | nvvmGetCompiledResult | 4111 | 0x100F | sub_12BA8F0 |
nvvmGetProgramLogSize | nvvmGetCompiledResultSize | 41856 | 0xA380 | sub_12BA220 |
nvvmGetProgramLog | nvvmGetCompiledResultLog | 46903 | 0xB737 | sub_12BA7C0 |
nvvmDestroyProgram | nvvmDestroyCU | 21257 | 0x5309 | sub_12B9C40 |
Note the naming confusion in the internal API: nvvmGetCompiledResultSize (ID 0xA380) returns the log size, while nvvmGetCompiledResultPTXSize (ID 0xF00B) returns the actual PTX output size. The public API resolves this with clearer names (nvvmGetProgramLogSize vs nvvmGetCompiledResultSize).
The internal-only API entries have no public equivalents:
| Internal Name | Dispatch ID | Hex | Target | Purpose |
|---|---|---|---|---|
nvvmInit | 57005 | 0xDEAD | sub_12B9C00 | One-time initialization of LLVM infrastructure |
nvvmGetVersion | 51966 | 0xCAFE | sub_12B9A50 | Returns internal NVVM version tuple |
nvvmGetErrorString | 46967 | 0xB777 | sub_12B9980 | Maps nvvmResult code to human-readable string |
nvvmSetOptionStrings | 65261 | 0xFEED | sub_12B9AB0 | Bulk-loads LLVM CLI option table (37 entries) |
nvvmCUSetExtraArgs | 17185 | 0x4321 | sub_12BBD80 | Passes additional argc/argv to compilation |
nvvmCUSetOption | 8320 | 0x2080 | sub_12BB400 | Sets a single compilation option |
nvvmCUSetProgressCallback | 3911 | 0x0F47 | sub_12BBF40 | Registers progress/cancellation callback |
nvvmCURegisterCallback | 48879 | 0xBEEF | sub_12BACF0 | Registers stage-boundary callback (verbose output) |
nvvmCURegisterLNKCallback | 61453 | 0xF00D | sub_12BA6A0 | Registers LNK-stage-specific callback |
nvvmCUGetLog | 11245 | 0x2BED | sub_12BB290 | Alternative log retrieval interface |
nvvmCUGetWarnings | 45242 | 0xB0BA | sub_12BAB40 | Retrieves warning-only messages |
nvvmCUGetIR | 49522 | 0xC172 | sub_12BA470 | Retrieves intermediate LLVM IR after linking |
nvvmCUGetOptIR | 61806 | 0xF16E | sub_12BAA30 | Retrieves optimized IR (post-OPT stage); also used by -irversion |
nvvmCULinkModule | 4606 | 0x11FE | sub_12BA330 | Explicit module linking (separate from add-then-compile) |
| (unknown) | 56495 | 0xDCEF | sub_12B9A40 | Unknown (one byte smaller than nvvmGetVersion) |
| (alias) | 2167 | 0x0877 | sub_12BB090 | Alias for nvvmCreateCU (same target, different ID) |
The nvvmCUGetOptIR function at sub_12BAA30 serves double duty: it is both the post-optimization IR retrieval API and the target of sub_12BC0E0 (a thunk called from sub_8F9C90 for the -irversion flag). When the user passes -irversion, the real main calls sub_12BC0E0 which dispatches to sub_12BAA30, which returns the IR version tuple as major * 100 + minor. This value is printed to stdout and the process exits immediately.
The sub_12BC0F0 Dispatch Mechanism
sub_12BC0F0 is a ~3 KB function at 0x12BC0F0 that implements a binary search tree over the 25 dispatch IDs. The function takes a single unsigned int argument (the ID) and returns a function pointer (void*). The tree is hardcoded as a series of comparison-and-branch instructions, not as a data-driven lookup table.
// Pseudocode for sub_12BC0F0(unsigned int id)
void* nvvm_dispatch(unsigned int id) {
// Binary search over 25 IDs
if (id < 17185) {
if (id < 4660) {
if (id == 2151 || id == 2167) return sub_12BB090;
if (id == 3911) return sub_12BBF40;
if (id == 4111) return sub_12BA8F0;
if (id == 4606) return sub_12BA330;
} else {
if (id == 4660) return sub_12BC650;
if (id == 8320) return sub_12BB400;
if (id == 11245) return sub_12BB290;
}
} else {
// ... upper half of the tree
if (id == 48813) return sub_12BA110; // 0xBEAD
if (id == 65261) return sub_12B9AB0; // 0xFEED
// etc.
}
return NULL; // unknown ID
}
The hex IDs are deliberately memorable patterns used as a form of internal documentation: 0xDEAD = init, 0xBEAD = compile, 0xBEEF = callback, 0xCAFE = version, 0xFEED = options, 0xF00D = LNK callback, 0xF00B = result size. The secondary ID 0x0877 (2167) is an alias for 0x0867 (2151) and dispatches to the same sub_12BB090 target, suggesting an internal API version migration where both old and new IDs must remain functional.
Dual-Path Initialization
The two compilation paths (Path A and Path B) use independent initialization sequences, creating a dual-path initialization architecture where the same underlying LLVM infrastructure is bootstrapped through different entry points. This is why two copies of libdevice, two LLVM options tables, and two sets of verbose callbacks exist.
Path A initialization (EDG → LibNVVM):
sub_B6EEA0 — Creates LLVMContext + registers 42+ metadata kinds
(dbg=1, tbaa=2, prof=3, ... noalias.addrspace=42)
sub_900130 — 39 KB CLI parser for Path A flags
sub_905880 — EDG frontend produces LLVM module (880-byte object)
sub_908850 — Binds module to target: data layout, triple, verification
→ sub_905EE0 enters LibNVVM pipeline with module
Path B initialization (Standalone):
sub_1602D10 — Creates standalone LLVMContext (no EDG metadata assumptions)
sub_125FB30 — 8 KB CLI parser for Path B flags
sub_1265340 — Pre-compilation setup (configure output path, timer)
→ sub_1265970 enters LibNVVM pipeline with bitcode input
The version resolver sub_12B9F70 at 0x12B9F70 is shared between both paths and determines which NVVM IR compatibility mode to use. It reads two obfuscated environment variables in sequence:
// Pseudocode for sub_12B9F70(unsigned int sm_version)
int nvvm_version_resolve(unsigned int sm_version) {
// Try NV_NVVM_VERSION first (decrypted from 0x3C23A90)
char *env = getenv(decrypt("NV_NVVM_VERSION"));
if (!env) {
// Fallback: try LIBNVVM_NVVM_VERSION (decrypted from 0x42812F0)
env = getenv(decrypt("LIBNVVM_NVVM_VERSION"));
}
if (env) {
if (strcmp(env, "nvvm70") == 0) return 0; // Path B mode
if (strcmp(env, "nvvm-latest") == 0) return 1; // Path A mode
}
// Default: SM >= 100 uses Path B, SM < 100 uses Path A
return (sm_version > 99) ? 0 : 1;
}
This function is called from both sub_8F9C90 (the real main, for v253 resolution) and sub_12BB580 (inside the LibNVVM compilation unit initialization). The dual call-site ensures that the version mode is consistent regardless of whether the compiler was invoked via CLI or via the LibNVVM API.
The nvvmInit function (ID 0xDEAD, sub_12B9C00) performs one-time LLVM infrastructure initialization. It is called implicitly during nvvmCreateCU (sub_12BB090) via a pthread_once guard at dword_4F92D9C. The initialization includes:
- Registering LLVM target triples (
nvptx64-nvidia-cuda,nvptx-nvidia-cuda) - Initializing the NVPTX target machine factory
- Setting up the LLVM pass registry
- Configuring thread-safety based on
LIBNVVM_DISABLE_CONCURRENT_API(byte_4F92D70)
When byte_4F92D70 == 1 (concurrent API disabled), the pipeline operates in single-threaded mode — no pthread_mutex locks are acquired around compilation unit operations, and Phase II concurrent optimization is disabled regardless of the module's function count.
Internal API Usage Sequence
The complete sequence of dispatch table calls during a standard Path A compilation (from sub_905EE0):
1. sub_12BC0F0(2151) → nvvmCreateCU(&handle)
Creates compilation unit. Calls nvvmInit via pthread_once on first use.
2. sub_12BC0F0(46967) → nvvmGetErrorString
Saved for later error message formatting.
3. sub_12BC0F0(4660) → nvvmCUAddModule(handle, IR_data, IR_size, NULL)
Adds the user's LLVM bitcode module.
4. sub_12BC0F0(21257) → nvvmDestroyCU
Saved as cleanup function pointer (not called yet).
5. sub_12BCB00 [thunk] → nvvmCUAddModuleFromBuffer(handle, buf, size, NULL)
Called N times: once per additional module from extra args,
once for libdevice (embedded or external).
6. sub_12BC0F0(48879) → nvvmCURegisterCallback
Registers verbose stage callbacks:
sub_903BA0 with ID 61453 (LNK stage)
sub_903730 with ID 47710 (LLC stage)
When -keep mode active, also registers:
sub_9085A0 with ID 64222 (OPT output → .opt.bc file)
sub_908220 with ID 56993 (LLC output → final file)
7. sub_12BC0F0(65261) → nvvmSetOptionStrings(opts_table, 37)
Loads 37 LLVM backend configuration strings from off_4B90FE0.
Calls sub_1C31130() internally to register/reset LLVM options.
8. sub_12BC0F0(48813) → nvvmCUCompile(handle, 57069)
Main compilation. Phase code 57069 (0xDEED) triggers full
LNK → OPT → [OPTIXIR] → LLC pipeline in sub_12C35D0.
9. sub_12BC0F0(17185) → nvvmCUSetExtraArgs(handle, argc, argv)
Passes additional arguments collected from the CLI.
10. sub_12BC0F0(41856) → nvvmGetCompiledResultSize(handle, &log_size)
Queries the compilation log size.
11. sub_12BC0F0(46903) → nvvmGetCompiledResultLog(handle, log_buf)
Retrieves the compilation log (warnings/errors).
12. sub_12BC0F0(61451) → nvvmGetCompiledResultPTXSize(handle, &ptx_size)
Queries the PTX output size.
13. sub_12BC0F0(4111) → nvvmGetCompiledResult(handle, ptx_buf)
Copies the generated PTX into the caller's buffer.
14. sub_12BC0F0(21257) → nvvmDestroyCU(&handle)
Destroys the compilation unit, frees all internal resources.
Path B (sub_1265970) follows the identical sequence but uses off_4C6EEE0 for the options table (step 7), unk_420FD80 for the embedded libdevice (step 5), and appends "-nvvm-version=nvvm70" instead of "-nvvm-version=nvvm-latest" to the pipeline arguments.
nvvmResult Error Codes
The nvvmGetErrorString function (ID 0xB777, sub_12B9980) maps integer result codes from all API functions to descriptive strings:
| Code | Constant | Description |
|---|---|---|
| 0 | NVVM_SUCCESS | Operation completed successfully |
| 1 | NVVM_ERROR_OUT_OF_MEMORY | Memory allocation failed |
| 2 | NVVM_ERROR_PROGRAM_CREATION_FAILURE | Failed to create compilation unit |
| 3 | NVVM_ERROR_IR_VERSION_MISMATCH | Incompatible NVVM IR version detected |
| 4 | NVVM_ERROR_INVALID_INPUT | Malformed input (bad bitcode, wrong magic) |
| 5 | NVVM_ERROR_INVALID_PROGRAM | Null or invalid compilation unit handle |
| 6 | NVVM_ERROR_INVALID_IR | IR failed verification |
| 7 | NVVM_ERROR_INVALID_OPTION | Unrecognized compilation option |
| 8 | NVVM_ERROR_NO_MODULE_IN_PROGRAM | Compilation unit has no modules added |
| 9 | NVVM_ERROR_COMPILATION | Compilation failed (linker, optimizer, or codegen error) |
| 10 | NVVM_ERROR_CANCELLED | Compilation cancelled by user callback |
The pipeline orchestrator sub_12C35D0 maps its internal return codes to these: 0 → NVVM_SUCCESS, 7 → NVVM_ERROR_INVALID_OPTION, 9 → NVVM_ERROR_COMPILATION, 10 → NVVM_ERROR_CANCELLED, 100 → NVVM_ERROR_COMPILATION (post-pipeline verification failure).
37 LLVM Options from off_4B90FE0
Phase 2.10 loads a hardcoded table of 37 LLVM option strings from off_4B90FE0 (296 bytes = 37 pointers). These are static, compiled-in LLVM backend configuration flags that are injected into every compilation unit via nvvmSetOptionStrings (ID 0xFEED). The options include target architecture flags (-march=nvptx64, -mcpu=sm_XX), math precision controls (-nvptx-f32ftz, -nvptx-prec-sqrtf32=), optimization levels, debug info flags, and NVPTX-specific feature knobs. The sub_12B9AB0 target function calls sub_1C31130() -- the LLVM option registration/reset function -- to apply them.
Embedded Libdevice
A key design decision: two identical copies of the libdevice bitcode are statically embedded in the binary. Each is 455,876 bytes (~445 KB) of LLVM bitcode containing ~400+ math functions (__nv_sin, __nv_cos, __nv_exp, __nv_log, __nv_sqrt, etc.) plus atomic operation helpers and FP16/BF16 conversion routines. The duplication exists because Path A and Path B have separate initialization sequences and the linker didn't deduplicate the .rodata sections.
When the user provides -nvvmir-library <path>, the external file is used instead. This allows overriding the built-in math library — useful for testing custom libdevice builds.
| Path | Address | Size | Purpose |
|---|---|---|---|
| Path A | unk_3EA0080 | 455,876 bytes | Default libdevice for LibNVVM mode |
| Path B | unk_420FD80 | 455,876 bytes | Default libdevice for standalone mode |
Verbose Callbacks and Intermediate Files
Phase 2.9 registers callback functions that fire at pipeline stage boundaries. When verbose mode is active, these callbacks produce reconstructed command-line output for each stage:
[ "<src>" -lnk -nvvmir-library "<path>" "<input>" -o "<file>.lnk.bc" <opts> -nvvm-version=nvvm-latest ]
[ "<src>" -llc "<llc_path>" -o "<output>" <opts> -nvvm-version=nvvm-latest ]
The callback registration uses sub_12BC0F0(48879) (ID 0xBEEF = nvvmCURegisterCallback) with stage-specific callback IDs:
| Callback | ID | Stage |
|---|---|---|
sub_903BA0 | 61453 | LNK stage output |
sub_903730 | 47710 | LLC stage output |
sub_9085A0 | 64222 | OPT output (keep mode) |
sub_908220 | 56993 | LLC output (keep mode) |
Intermediate file paths (.lnk.bc for linked-but-unoptimized, .opt.bc for optimized-but-not-yet-codegen'd) are always constructed as strings, but the actual files are only written to disk when the -keep flag is active in wizard mode.
Path A Error Messages
All errors from sub_905EE0 are written to stderr via sub_223E0D0. Error categories:
| Category | Prefix | Example |
|---|---|---|
| File I/O | "<src>: " | "error in open <file>", "input file <f> read error" |
| LibNVVM API | "libnvvm: error: " | "failed to create the libnvvm compilation unit" |
| Output | "<src>: " | "IO error: <system_error_msg>" |
| Fatal | (none) | "basic_string::append" (std::string overflow at 0x3FFFFFFFFFFFFFFF) |
The error code from LibNVVM API calls maps to nvvmResult: 0 = success, 1 = out of memory, 4 = invalid input, 5 = invalid compilation unit (null handle).
Path B — Standalone cicc Pipeline (sub_1265970)
Path B is the standalone compilation path used when cicc is invoked with LLVM bitcode input (.bc files), by the LibNVVM API directly, or as the default for SM >= 100 architectures. Despite the different entry point, it shares the same underlying LLVM infrastructure as Path A — the difference is in how modules are loaded and how the pipeline stages are orchestrated. Path B appends -nvvm-version=nvvm70 to the optimizer arguments, indicating it targets the NVVM 7.0 IR specification (corresponding to LLVM 7.0.1 bitcode format, the version NVIDIA froze their IR compatibility at).
The 4-stage pipeline (LNK → OPT → OPTIXIR → LLC) runs in-memory: each stage takes an LLVM Module, transforms it, and passes it to the next stage. The OPTIXIR stage is optional and only active when --emit-optix-ir is specified. A user-provided cancellation callback can abort compilation between stages (return code 10).
| Field | Value |
|---|---|
| Address | 0x1265970 |
| Size | ~48KB (1,371 lines) |
| Timer | "LibNVVM" (same name as Path A) |
| Version string | -nvvm-version=nvvm70 |
Path B Entry — sub_1262860
sub_1262860 (418 lines) is the command-line entry point for Path B, analogous to sub_902D10 for Path A. It parses CLI flags, initializes the compilation context, and calls sub_1265970 for the actual compilation.
| Field | Value |
|---|---|
| Address | 0x1262860 |
| Timer init | sub_1602D10 (standalone context, contrasted with Path A's sub_B6EEA0) |
| CLI parser | sub_125FB30 (Path B's equivalent of Path A's sub_900130) |
The flow is: allocate timer handle → parse CLI via sub_125FB30 → configure output path → call sub_1265340 for pre-compilation setup → call sub_1265970 for compilation → write output. Output can go to stdout if the output path is "-", handled by sub_125C500. On failure: "\n Error processing command line: <details>".
Path B Compilation Orchestrator — sub_1265970
This 48 KB function mirrors sub_905EE0's role but with Path B's initialization and context. It handles both LibNVVM API invocations (when a11 = 1) and CLI invocations (when a11 = 0), with the same 14-phase structure as Path A but using Path B's context objects and the nvvm70 version string.
Key behavioral differences from Path A:
-
Context initialization. Path B uses
sub_1602D10for context init (rather thansub_B6EEA0), which creates a standalone LLVM context without the EDG frontend's metadata registration assumptions. -
NVVM IR container handling. Container parsing is performed by
sub_12642A0(Path B's container parser) rather thansub_9047E0. -
Embedded libdevice address. Uses
unk_420FD80(the second copy) rather thanunk_3EA0080. -
LLVM options table. Loads 37 options from
off_4C6EEE0(Path B's copy) rather thanoff_4B90FE0. -
Verbose callbacks. Registers
sub_1263280(ID 61453) andsub_12636E0(ID 47710) for LNK and OPT stage output respectively, andsub_1268040/sub_1267CC0for keep-mode output. -
Version string. Always appends
"-nvvm-version=nvvm70"rather than"-nvvm-version=nvvm-latest".
4-Stage Pipeline Orchestrator — sub_12C35D0
The orchestrator creates two backend objects — nvopt (512 bytes, the optimizer) and nvllc (480 bytes, the code generator) — and wires them together with the stage dispatch structure. Each stage is controlled by a bit in a stage bitmask derived from sub_12D2AA0, which parses architecture and options into per-stage configuration.
| Field | Value |
|---|---|
| Address | 0x12C35D0 |
| Size | 41KB (1,446 lines) |
| Backend objects | nvopt (512 bytes) + nvllc (480 bytes) |
| Stage | Bit | Timer String | Core Function |
|---|---|---|---|
| LNK | 0x01 | "LNK" / "LibNVVM module linking step." | sub_12C06E0 (63KB, module linker) |
| OPT | 0x80 | "OPT" / "LibNVVM optimization step." | sub_12E7E70 (full LLVM pipeline) |
| OPTIXIR | 0x40 | "OPTIXIR" / "LibNVVM Optix IR step." | sub_12F9270 (OptiX IR gen) |
| LLC | 0x04 | "LLC" / "LibNVVM code-generation step." | sub_12F5100 (SelectionDAG codegen) |
Pipeline stage bitmask (from sub_12D2AA0): bit 0=LNK, bit 2=LLC, bit 5=verify, bit 6=OPTIXIR, bit 7=OPT.
Return codes: 0=success, 7=parse failure, 9=link/layout/verification error, 10=cancelled, 100=post-pipeline verification failure.
Backend Object Initialization
The orchestrator allocates and initializes two backend objects with distinct vtables:
// nvllc — code generator backend (480 bytes)
v8 = sub_22077B0(480);
sub_12EC960(v8, "nvllc", 5);
v8->vtable = &unk_49E7FF0;
// nvopt — optimizer backend (512 bytes)
v10 = sub_22077B0(512);
sub_12EC960(v10, "nvopt", 5);
v10->vtable = &unk_49E6A58;
v10->sub_vtable = &unk_49E6B20; // at offset +60*8
v10->plugin_slots[0..2] = 0; // offsets 61-63 cleared
A stage dispatch structure (vtable &unk_49E6B38) links the OPT output to the LLC input and stores the cancellation callback pointer.
Cancellation Callback
Between every pipeline stage, the orchestrator checks an optional user-provided cancellation callback stored at state[26]:
cancellation_fn = state[26];
if (cancellation_fn && cancellation_fn(state[27], 0))
return 10; // CANCELLED
This mechanism allows the LibNVVM API caller to abort a long-running compilation. Return code 10 propagates up through the entire call chain, causing sub_8F9C90 to return 10 as the process exit code.
Two-Phase Optimization (OPT Stage)
The OPT stage calls sub_12E7E70, which implements a two-phase optimization protocol. Both phases call the same underlying pipeline function sub_12E54A0, but a TLS variable qword_4FBB3B0 is set to 1 or 2 to indicate which phase is active:
| Phase | TLS value | Purpose |
|---|---|---|
| Phase I | 1 | Analysis + early IR optimization (module-level, CGSCC, function passes) |
| Phase II | 2 | Backend optimization + codegen preparation (lowering, legalization) |
| Complete | 3 | Compilation finished for this module |
Between phases, sub_12D4250 checks concurrency eligibility: if the module contains more than one defined function (non-declaration), and the options permit it, Phase II can run with multiple threads. Thread count is determined from opts[1026] or falls back to get_nprocs(). When concurrency is enabled, sub_12E7B90 is the concurrent worker entry point.
For single-function modules, the optimizer skips the two-phase protocol entirely and runs a single un-phased call to sub_12E54A0 -- no phase counter is set, and the optimizer executes both analysis and backend passes in one invocation.
Data Layout Validation
After the LLC stage but before returning, the orchestrator validates the module's data layout string. If the module has no data layout:
"DataLayoutError: Data Layout string is empty"
→ return 9
On layout mismatch, it produces a detailed diagnostic:
"<error details>\nExample valid data layout:\n64-bit: <reference_layout>"
The reference layout string is loaded from off_4CD4948[0].
Module Linker — sub_12C06E0
The LNK stage's core function (63KB) links multiple LLVM bitcode modules into a single module. This is where user code gets linked with the libdevice math library and any additional modules. The linker performs several validation steps to catch incompatible IR early — before the expensive optimization and codegen stages:
- Bitcode magic validation: checks for
0xDE,0xC0,0x17,0x0B(raw LLVM bitcode) or0x42,0x43,0xC0,0xDE(bitcode wrapper). Anything else → error code 9. - Triple validation: every module's target triple must start with
"nvptx64-". Modules without a triple get a clear error:"Module does not contain a triple, should be 'nvptx64-'". - IR version compatibility:
sub_12BFF60reads"nvvmir.version"metadata (2 or 4 element tuples: major.minor or major.minor.debug_major.debug_minor). TheNVVM_IR_VER_CHKenvironment variable can disable this check entirely (set to"0"), useful when mixing IR from different CUDA toolkit versions. - Symbol size matching: for multi-module linking, compares the byte sizes of identically-named globals across modules. Size computation uses type codes (1=half(16b), 2=float(32b), 3=double(64b), 7=ptr, 0xB=integer, 0xD=struct, 0xE=array). A mismatch produces:
"Size does not match for <sym> in <mod> with size X specified in <other> with size Y."
Single-module fast path: When only one module is present (after adding user code and libdevice), the linker returns it directly via sub_1C3DFC0 without invoking the full linking machinery.
Multi-module linking: For N > 1 modules, the linker copies the primary module's target triple to all secondary modules, then calls sub_12F5610 to perform the LLVM link. After user modules are linked, builtin modules (from a1[3..4]) are linked via sub_1CCEBE0, followed by target feature configuration via sub_1CB9110 and sub_1619140.
NVVM IR Version Checker — sub_12BFF60
The version checker reads "nvvmir.version" named metadata and validates it against the compiler's expected version range.
| Field | Value |
|---|---|
| Address | 0x12BFF60 |
| Size | ~9 KB (362 lines) |
| Metadata key | "nvvmir.version" |
| Debug metadata | "llvm.dbg.cu" |
Version tuples come in two forms:
- 2-element:
(major, minor)— IR version only. Special case:(2, 0)always passes. - 4-element:
(major, minor, debug_major, debug_minor)— IR version plus debug info version. Special case:debug_major == 3, debug_minor <= 2always passes.
The NVVM_IR_VER_CHK environment variable is checked multiple times throughout the validation. When set to "0", all version checks are bypassed, returning 0 (compatible). This is a critical escape hatch for mixing bitcode from different CUDA toolkit versions.
Memory Management
jemalloc — The Global Allocator
cicc statically links a jemalloc 5.x allocator in the address range 0x12FC000–0x131FFFF (~400 functions). This replaces the system malloc/free entirely. The jemalloc configuration parser (sub_12FCDB0, 131,600 bytes -- the largest single function in this range) handles the MALLOC_CONF environment variable and /etc/malloc.conf symlink, supporting dozens of tuning options: abort, cache_oblivious, metadata_thp, trust_madvise, retain, dss, tcache, narenas, percpu_arena, background_thread, san_guard_small, san_guard_large, and more.
The choice of jemalloc over glibc's allocator is significant for compiler workloads. jemalloc's thread-local caching (tcache) and arena-per-CPU design (percpu_arena) reduce contention during the concurrent Phase II optimization, where multiple threads may be simultaneously allocating and freeing IR nodes, instruction objects, and analysis results.
The jemalloc stats subsystem (functions at 0x400000–0x42FFFF) provides comprehensive per-arena statistics including allocation counts, active/dirty/muzzy page tracking, mutex contention metrics, and HPA hugify counts. These can be triggered via MALLOC_CONF="stats_print:true".
EDG Memory Regions — sub_822260
The EDG 6.6 frontend uses a custom memory region system configured with USE_MMAP_FOR_MEMORY_REGIONS = 1. During post-parse validation in sub_617BD0 (lgenfe_main), sub_822260() is called 11 times to initialize memory regions 1 through 11. These regions serve as arena-style allocators for different categories of EDG internal data:
- Token buffers (preprocessor token storage)
- IL node pools (intermediate language tree nodes)
- Symbol tables (name→declaration mappings)
- Type representations (structural type information)
The mmap-backed regions grow by mapping additional pages on demand, avoiding the fragmentation problems that would occur with individual malloc calls for the millions of small, short-lived objects the frontend creates during parsing. Region cleanup happens in bulk when the frontend completes -- all pages for a region are unmapped at once rather than individually freed.
The EDG heap allocator cluster at 0x821000–0x823FFF includes tracked allocation (sub_822B10/sub_822B90) with a 1024-entry inline tracking array (unk_4F19620, 1024 * 24 bytes) that overflows to heap when exceeded. The tracking count is maintained in dword_4F19600. The finalization function sub_823310 walks bucket chains to free all tracked allocations.
Large Argument Lists
The argv copy in sub_8F9C90 uses a threshold-based allocation strategy:
if (8 * argc <= 0x800) // argc <= 256
v284 = stack_buffer; // 2096 bytes on stack
else
v284 = sub_16CD150(8 * argc); // heap allocation
This avoids heap allocation for the common case (most cicc invocations have fewer than 256 arguments) while handling the worst case gracefully. The heap path uses sub_16CD150 (a realloc-like wrapper), and the buffer is freed during cleanup if it was heap-allocated.
Signal Handling and Crash Recovery
EDG Signal Handler
The EDG frontend registers a signal handler at 0x723610 during initialization:
// signal handler (0x723610)
void handler(int sig) {
write(STDERR_FILENO, "\n", 1);
dword_4F0790C = 1; // set "interrupted" flag
sub_7235F0(9); // initiate orderly shutdown
}
This handler is registered for SIGINT, allowing the compiler to be interrupted gracefully during long frontend operations (template instantiation, constexpr evaluation). The global dword_4F0790C flag is checked periodically by the parser loop, enabling cooperative cancellation.
LLVM Crash Recovery
The LLVM infrastructure provides its own crash handling via the print-on-crash and print-on-crash-path CLI options (registered in the 0x4F0000–0x51FFFF range). When enabled, the LLVM pass manager dumps the current IR to a specified path on any unhandled signal (SIGSEGV, SIGABRT, etc.). This is separate from the EDG handler and covers the optimization and codegen phases.
Concurrent API Protection
The global constructor at 0x4A5810 checks LIBNVVM_DISABLE_CONCURRENT_API. When set (to any value), byte_4F92D70 = 1 disables thread-safe LibNVVM API usage. The pipeline orchestrator (sub_12C35D0) uses pthread_once(&dword_4F92D9C, init_routine) for one-time setup, and TLS at __readfsqword(0)-24 stores exception handling stack frames while __readfsqword(0)-32 stores the cleanup function sub_12BCC20. These TLS slots ensure that concurrent compilations in the same process do not corrupt each other's state.
Timer Infrastructure
Compilation timing is implemented through a hierarchical timer system. Timer creation (sub_C996C0) takes a label and context string; timer stop (sub_C9AF60) records the elapsed time. The timer hierarchy is:
"CUDA C++ Front-End" ← EDG parsing + IL-to-IR conversion (Path A only)
└─ "LibNVVM" ← Full optimization + codegen pipeline
├─ "LNK" ← Module linking (sub_12C06E0)
├─ "OPT" ← LLVM optimization (sub_12E7E70)
│ ├─ "Phase I" ← Analysis + early optimization
│ └─ "Phase II" ← Backend optimization + codegen prep
├─ "OPTIXIR" ← OptiX IR generation (optional)
└─ "LLC" ← SelectionDAG codegen (sub_12F5100)
The profiler is controlled by sub_C96F30() (returns nonzero when active). Timer data is written to the output file after compilation via sub_C9C600 (Path A) or sub_16DD960 (Path B). The -time flag or environment variable controls activation. The timer names appear in the profiler output, making them essential for identifying compilation bottlenecks.
Architecture Detection — sub_95EB40
One of the most important functions in cicc: the architecture detection system translates a single user-facing flag like -arch=compute_90a into three independent flag strings, one for each pipeline stage. This 3-column fan-out is necessary because the EDG frontend, the LLVM optimizer, and the LLVM backend each use different flag formats to specify the target architecture. The mapping is stored in a std::map<string, ArchTriple> in a red-black tree at a1+248.
| Column | Target | Example |
|---|---|---|
| Column 1 | EDG frontend | -R __CUDA_ARCH=750 |
| Column 2 | Optimizer | -opt-arch=sm_75 |
| Column 3 | LLC backend | -mcpu=sm_75 |
Architecture Validation Bitmask
Before the 3-column mapping is consulted, the architecture number is validated against a hardcoded 64-bit bitmask. This is a fast rejection filter: the SM number minus 75 gives a bit index, and if that bit isn't set in the constant 0x60081200F821, the architecture is rejected. This means cicc v13.0 has a fixed, compile-time-determined set of supported architectures — you cannot add new SM targets without rebuilding the binary.
offset = arch_number - 75;
if (offset > 0x2E || !_bittest64(&0x60081200F821, offset))
→ ERROR: "is an unsupported option"
Valid architectures (bit positions in 0x60081200F821). Note the gaps — SM 81–85, 91–99, 101–102, 104–109, 111–119 are all absent:
| Bit | SM | Generation |
|---|---|---|
| 0 | 75 | Turing |
| 5 | 80 | Ampere |
| 11 | 86 | Ampere |
| 12 | 87 | Ampere (Jetson Orin) |
| 13 | 88 | Ada (undocumented) |
| 14 | 89 | Ada Lovelace |
| 15 | 90 | Hopper |
| 25 | 100 | Blackwell |
| 28 | 103 | Blackwell |
| 35 | 110 | Jetson Thor |
| 45 | 120 | Blackwell (sm120) — RTX 50xx / Pro |
| 46 | 121 | Blackwell (sm120) — DGX Spark |
Suffix handling: a and f variants share the base SM number for validation but get distinct -mcpu=sm_XXa/-mcpu=sm_XXf strings.
Architecture Parsing in the EDG Frontend
The EDG frontend (sub_617BD0, option ID 0x52 = --nv_arch) performs its own independent architecture parsing that produces three global variables:
| Global | Address | Purpose |
|---|---|---|
unk_4D045E8 | 0x4D045E8 | SM compute version (integer: 75, 80, ..., 121) |
unk_4D045E4 | 0x4D045E4 | Accelerated flag (1 if suffix a) |
unk_4D045E0 | 0x4D045E0 | Fast flag (1 if suffix f; also sets accelerated=1) |
The f suffix (fast-mode) is new to SM >= 100 architectures. When present, it implies a forward-compatible feature set that may not exactly match the base SM version's capabilities.
Flag Catalog — sub_9624D0
The flag catalog is the second-largest function in the entry point range at 75KB. It takes the raw CLI arguments and sorts them into four output vectors — one per pipeline stage (lnk, opt, lto, llc). This is the translation layer between user-facing flags and the internal per-stage options that each pipeline component understands.
A clever detail: the function takes a "mode cookie" parameter (a4) that distinguishes CUDA compilation (0xABBA) from OpenCL compilation (0xDEED). Several flags behave differently depending on this cookie — for example, -prec-div=0 maps to -nvptx-prec-divf32=1 in CUDA mode but -nvptx-prec-divf32=0 in OpenCL mode, reflecting the different default precision expectations of the two languages.
| Field | Value |
|---|---|
| Address | 0x9624D0 |
| Size | 75KB (2,626 lines) |
| Mode cookie | a4: 0xABBA=CUDA, 0xDEED=OpenCL |
| Output vectors | lnk, opt, lto, llc (32-byte std::string elements with SSO) |
-Ofast-compile Levels
NVIDIA's -Ofast-compile is a compile-time vs runtime-performance tradeoff. At "max" level, it disables memory space optimization and LSA optimization entirely — these are expensive analysis passes that improve runtime performance but slow compilation significantly. The "mid" and "min" levels provide intermediate points. This feature is targeted at iterative development workflows where compile speed matters more than code quality.
| Level String | Internal Value | Effect |
|---|---|---|
"max" | 2 | Most optimizations skipped, forces -lsa-opt=0 -memory-space-opt=0 |
"mid" | 3 | Medium speedup |
"min" | 4 | Minimal speedup |
"0" | 1 → reset to 0 | Disabled |
Error: "libnvvm : error: -Ofast-compile specified more than once". Only one -Ofast-compile per compilation is allowed.
Flag-to-Pipeline Routing (Selected)
This table shows how a single user-facing flag gets split into per-stage options. The pattern reveals NVIDIA's compilation architecture: the LNK stage communicates via -R macro definitions (these become #defines visible to the linker), the OPT stage uses NVIDIA-specific optimizer flags (-opt-use-*), and the LLC stage uses LLVM backend flags (-nvptx-*). Some flags like -ftz=1 propagate to all three stages, while others like -aggressive-inline only affect the optimizer.
| User Flag | LNK Forward | OPT Forward | LLC Forward |
|---|---|---|---|
-ftz=1 | -R __CUDA_FTZ=1 | -nvptx-f32ftz | -nvptx-f32ftz |
-prec-div=1 (CUDA) | -R __CUDA_PREC_DIV=1 | -opt-use-prec-div=true | -nvptx-prec-divf32=2 |
-prec-div=0 (CUDA) | — | -opt-use-prec-div=false | -nvptx-prec-divf32=1 |
-prec-sqrt=1 | -R __CUDA_PREC_SQRT=1 | — | -nvptx-prec-sqrtf32=1 |
-fma=1 | — | — | -nvptx-fma-level=1 |
-fast-math (CUDA) | -R __CUDA_USE_FAST_MATH=1 | -opt-use-fast-math | — |
-unsafe-math | -R FAST_RELAXED_MATH=1 -R __CUDA_FTZ=1 | -opt-use-fast-math -nvptx-f32ftz | -nvptx-fma-level=1 -nvptx-f32ftz |
-aggressive-inline | — | -inline-budget=40000 | — |
-new-nvvm-remat | — | — | -enable-new-nvvm-remat=true -nv-disable-remat=true -rp-aware-mcse=true |
nvcc→cicc Flag Translation — sub_8FE280
When cicc is invoked by nvcc (the CUDA compiler driver), the flags arrive in nvcc's format and need to be translated to cicc's internal format. This translation happens through a red-black tree at qword_4F6D2A0, populated once on first use (guarded by qword_4F6D2C8). Each entry maps an nvcc flag to a pair: an EDG passthrough string and a cicc internal string. Some flags only affect one side — for example, -fmad=1 has no EDG equivalent (FMA is a backend concern) but maps to cicc's -fma=1. Others are dual-mapped: -O0 becomes both --device-O=0 for EDG and -opt=0 for cicc.
| nvcc Flag | EDG Passthrough | cicc Internal |
|---|---|---|
-O0..-O3 | --device-O=N | -opt=N |
-fmad=1 | — | -fma=1 |
-prec_sqrt=1 | — | -prec-sqrt=1 |
-Ofast-compile=max | — | -Ofast-compile=max |
-Ofc=max | — | -Ofast-compile=max (alias) |
--emit-optix-ir | --emit-lifetime-intrinsics | --emit-optix-ir |
-discard-value-names | --discard_value_names=1 | -discard-value-names=1 |
Environment Variables
cicc checks 20 distinct environment variables across its subsystems. The six NVIDIA-specific variables are the most important for understanding and reimplementing the entry point behavior:
| Variable | Function | Effect |
|---|---|---|
NVVMCCWIZ | sub_8F9C90 | Set to 553282 → enables wizard mode (byte_4F6D280 = 1) |
NVVM_IR_VER_CHK | sub_12BFF60 | Set to "0" → disables NVVM IR version checking |
LIBNVVM_DISABLE_CONCURRENT_API | ctor at 0x4A5810 | Any value → disables thread-safe API (byte_4F92D70 = 1) |
NV_NVVM_VERSION | sub_8F9C90, sub_12B9F70 | "nvvm70" or "nvvm-latest" → controls Path A/B default and IR compat mode |
LIBNVVM_NVVM_VERSION | sub_12B9F70 | Same as NV_NVVM_VERSION (checked as fallback) |
LLVM_OVERRIDE_PRODUCER | ctors at 0x48CC90, 0x4CE640 | Overrides the producer string in output bitcode metadata |
The NV_NVVM_VERSION and LIBNVVM_NVVM_VERSION variables are obfuscated in the binary using the same XOR+ROT13 cipher as the CLI option strings. They are decrypted from 0x3C23A90 and 0x42812F0 respectively.
Key Global Variables
These globals persist across the entire compilation and are accessed from multiple subsystems. The wizard mode flag and flag mapping tree are set during CLI parsing and read throughout the pipeline. The embedded libdevice addresses are compile-time constants (.rodata), while the data model width is set during architecture configuration.
| Variable | Purpose |
|---|---|
byte_4F6D280 | Wizard mode flag (gates -v, -keep) |
qword_4F6D2A0 | Flag mapping red-black tree root |
qword_4F6D2C8 | Tree initialization guard |
byte_4F6D2D0 | --partial-link active flag |
byte_4F6D2DC | --force-llp64 active flag |
unk_3EA0080 | Embedded libdevice bitcode (Path A, 455,876 bytes) |
unk_420FD80 | Embedded libdevice bitcode (Path B, 455,876 bytes) |
off_4B90FE0 | LLVM options table (Path A, 37 entries) |
off_4C6EEE0 | LLVM options table (Path B, 37 entries) |
unk_4F06A68 | Data model width (8=64-bit, 4=32-bit) |
unk_4D0461C | Enable p3:32:32:32 in data layout (shared mem 32-bit ptrs) |
byte_4F92D70 | Concurrent API disabled flag |
dword_4F92D9C | pthread_once guard for one-time pipeline setup |
qword_4FBB3B0 | TLS: optimization phase counter (1=Phase I, 2=Phase II, 3=done) |
unk_4F6D2F8 | Global module pointer (set by sub_908850 after EDG binding) |
Function Map — Entry Point Cluster
| Function | Address | Size | Role |
|---|---|---|---|
main() thunk → sub_8F9C90 | 0x4396A0 | 16 B | -- |
| String deobfuscation (XOR + ROT13) | 0x8F98A0 | ~512 B | -- |
Push string to std::vector<std::string> | 0x8F9C20 | ~128 B | -- |
| Real main — CLI parser + dispatcher | 0x8F9C90 | 10,066 B | -- |
| nvcc→cicc flag translation (red-black tree) | 0x8FE280 | ~4 KB | -- |
| Path A CLI processing | 0x900130 | 39 KB | -- |
| Path A orchestrator (simple mode) | 0x902D10 | ~9 KB | -- |
| LLC stage verbose callback | 0x903730 | ~5 KB | -- |
| LNK stage verbose callback | 0x903BA0 | ~5 KB | -- |
| NVVM IR container parser (Path A) | 0x9047E0 | 10 KB | -- |
| CUDA C++ Front-End (lgenfe stage) | 0x905880 | ~6 KB | -- |
| lgenfe single-stage wrapper (Path A) | 0x905E50 | ~256 B | -- |
| LibNVVM pipeline driver (Path A) | 0x905EE0 | 43 KB | -- |
| Backend SM config + EDG module binding | 0x908850 | 10 KB | -- |
| Architecture detection (3-column fan-out) | 0x95EB40 | 38 KB | -- |
| Flag catalog (4 output vectors) | 0x9624D0 | 75 KB | -- |
| Pipeline option parser (4 stage vectors) | 0x9685E0 | ~8 KB | -- |
| Path B CLI processing | 0x125FB30 | ~8 KB | -- |
| Path B entry (simple mode) | 0x1262860 | ~4 KB | -- |
| Path B LNK verbose callback | 0x1263280 | ~1 KB | -- |
| Path B OPT verbose callback | 0x12636E0 | ~1 KB | -- |
| NVVM container parser (Path B) | 0x12642A0 | ~3 KB | -- |
| Path B pre-compilation setup | 0x1265340 | ~4 KB | -- |
| lgenfe single-stage wrapper (Path B) | 0x12658E0 | ~256 B | -- |
| LibNVVM compilation entry (Path B) | 0x1265970 | 48 KB | -- |
| LibNVVM API dispatch table (25 entries) | 0x12BC0F0 | ~3 KB | -- |
Thunk → sub_12BC8B0 (nvvmCUAddModuleFromBuffer) | 0x12BCB00 | ~64 B | -- |
| NVVM IR version checker | 0x12BFF60 | ~9 KB | -- |
| Module linker (LNK stage core) | 0x12C06E0 | 63 KB | -- |
| 4-stage pipeline orchestrator | 0x12C35D0 | 41 KB | -- |
| Stage bitmask parser | 0x12D2AA0 | ~4 KB | -- |
| Concurrency eligibility check | 0x12D4250 | ~2 KB | -- |
| Two-phase optimizer entry | 0x12E7E70 | ~8 KB | -- |
| Concurrent worker entry point | 0x12E7B90 | ~4 KB | -- |
| LLC core (SelectionDAG codegen) | 0x12F5100 | ~12 KB | -- |
| OptiX IR generator | 0x12F9270 | ~6 KB | -- |
| Path B context initialization | 0x1602D10 | ~2 KB | -- |
Cross-References
- EDG Frontend —
sub_617BD0(lgenfe_main), the 282-case CLI dispatch inside the EDG 6.6 frontend - NVVM Container Format — Container parsing by
sub_9047E0(Path A) andsub_12642A0(Path B) - Optimizer Pipeline — The OPT stage driven by
sub_12E7E70(two-phase optimization) - IR Generation — Module creation via
sub_908850(EDG module binding) - PTX Emission — The LLC stage's PTX output via
sub_12F5100