Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

nvlink Reverse Engineering Reference

This wiki documents the internal architecture of nvlink v13.0.88, NVIDIA's CUDA device linker, based on static reverse engineering of the stripped x86-64 ELF binary. nvlink is shipped as part of CUDA Toolkit 13.0 (release V13.0.88, build cuda_13.0.r13.0/compiler.36424714_0, August 2025).

Binarynvlink v13.0.88, 37 MB, x86-64, stripped ELF
Buildcuda_13.0.r13.0/compiler.36424714_0 (Wed Aug 20, 2025)
Functions40,532 total, 552K call edges
Strings31K extracted
Tool identityNVIDIA (R) Cuda linker
Copyright2005--2025 NVIDIA Corporation
Supported SM22 architectures: sm_75 (Turing) through sm_121 (Blackwell)
Input formatscubin, PTX, fatbin, NVVM IR / LTO IR, archives (.a), host ELF (.o/.so)
Output formatsCUDA device ELF (cubin), capsule mercury (SM100+), host linker scripts

Target audience: Senior C++ developers with ELF/linker experience seeking reimplementation-grade understanding of NVIDIA's device linking pipeline.

nvlink is the CUDA device linker -- the tool that combines separately compiled GPU object files into a single device executable. It is invoked by nvcc (or directly by build systems) after cicc and ptxas have compiled individual translation units into cubin objects. nvlink resolves cross-TU symbol references, applies R_CUDA relocations, merges .nv.info metadata, lays out shared memory, eliminates dead code, and emits the final device ELF.

But nvlink is not just a linker. It is a hybrid linker-compiler: a large fraction of the binary is an embedded GPU compiler backend that enables link-time optimization and JIT compilation of PTX and NVVM IR inputs. The tool name says "linker," but the binary contains a full assembler.

The 95/5 Split

The single most important structural fact about nvlink is its size distribution:

ComponentApproximate sizeFunction countRole
Embedded ptxas backend~24 MB (65%)~20,000PTX-to-SASS assembler, instruction selection, register allocation, scheduling, encoding
Instruction encoding tables~8 MB (22%)~8,000Per-SM binary encoding/decoding for SASS instructions
Linker core~1.2 MB (3%)~600ELF merge, symbol resolution, relocation, layout, output
LTO orchestration~1.5 MB (4%)~800IR collection, libnvvm dlopen, PTX compilation dispatch
Infrastructure~1.5 MB (4%)~1,100Memory arenas, option parsing, error reporting, compression
Mercury / FNLZR~0.8 MB (2%)~400SM100+ capsule mercury post-link transformation

Approximately 95% of the binary is compiler backend -- the same ptxas assembler that ships as a separate tool in the CUDA Toolkit, statically linked into nvlink to support LTO and PTX JIT compilation. Only about 5% (~1.2 MB) is the actual linker. This has a direct consequence for reverse engineering: most functions in the binary are instruction selection patterns, SASS encoders, or register allocator internals, not linker logic.

Architecture Overview

nvlink operates in a linear pipeline with two optional compiler paths:

Input Files (cubin, PTX, fatbin, NVVM IR, archives, host ELF)
  |
  +-- [File type detection: 56-byte header read, magic number dispatch]
  |
  +-- PTX input?  --> embedded ptxas JIT: PTX -> SASS cubin
  |
  +-- NVVM IR / LTO IR?  --> dlopen libnvvm.so, compile IR -> PTX,
  |                          then embedded ptxas: PTX -> SASS cubin
  |
  +-- fatbin?  --> extract members, recurse per-member
  |
  +-- archive (.a)?  --> iterate members, recurse
  |
  v
Merge Phase (merge_elf: 89KB function, iterates all sections)
  |
  +-- Symbol resolution, weak symbol selection
  +-- Section merging (.nv.global, .nv.shared, .nv.constant, .text)
  +-- Debug section merging (DWARF line tables, .nv.info)
  |
  v
Shared Memory Layout (65KB function, overlap set analysis)
  |
  v
Entry Property Computation (98KB function, register/barrier propagation)
  |
  v
Dead Code Elimination (callgraph reachability, removes unused functions)
  |
  v
Data Layout Optimization (constant dedup, overlapping data merge)
  |
  v
Layout Phase (section ordering, address assignment)
  |
  v
Relocation Phase (R_CUDA relocation application, UFT/UDT resolution)
  |
  v
Finalization (final relocation patching, ELF section generation)
  |
  v
Output (ELF serialization -> cubin file)
  |
  +-- SM100+?  --> FNLZR post-link transform -> capsule mercury
  +-- --gen-host-linker-script?  --> write SECTIONS { .nvFatBinSegment ... }

LTO Model

When invoked with -lto, nvlink performs link-time optimization:

  1. Collect: Gather NVVM IR modules from all input objects and fatbin members
  2. Compile IR to PTX: dlopen libnvvm.so (path from --nvvmpath), invoke the NVVM compiler API to lower IR to PTX. Supports both whole-program and partial LTO modes (--force-whole-lto, --force-partial-lto)
  3. Assemble PTX to SASS: Feed the resulting PTX through the embedded ptxas backend. Supports split compilation with thread pool parallelism (--split-compile-extended)
  4. Link: Merge the resulting cubin objects through the normal linking pipeline

The LTO pipeline reuses option forwarding (--Xptxas, --Xnvvm) to pass flags through to the embedded compiler stages.

Mercury

Mercury is NVIDIA's internal codename for a new ISA format used on SM100+ (Blackwell and later) architectures. When --arch sm_100 or higher is specified, nvlink:

  • Sets an internal mercury mode flag (byte at global +0x222)
  • Routes output through the FNLZR (Finalizer) post-link binary rewriter
  • Produces "capsule mercury" output instead of traditional cubin
  • Uses R_MERCURY relocations alongside R_CUDA relocations
  • Generates .nv.merc and related mercury-specific ELF sections

The Mercury codename appears ROT13-encoded throughout the ptxas backend as zrephel (Mercury), part of NVIDIA's standard obfuscation of internal pass names and section identifiers.

Supported Architectures

nvlink supports 22 SM architectures spanning five GPU generations:

GenerationSM versionsNotes
Turingsm_75Minimum supported architecture
Amperesm_80, sm_86, sm_87, sm_89sm_89 is Ada Lovelace (different die)
Hoppersm_90, sm_90aCluster launch, WGMMA, TMA
Blackwellsm_100, sm_100a, sm_103, sm_103aMercury format, capsule output
Next-gensm_110, sm_120, sm_121Jetson Thor, consumer Blackwell, DGX Spark

Architecture selection (--arch) gates: instruction encoding tables, relocation types, Mercury vs. SASS output path, FNLZR compatibility checks, and ptxas backend ISel dispatch via per-SM vtables.

Key Data Structures

The linker's internal state revolves around a few central structures:

  • elfw (ELF wrapper): The primary data structure for both input and output ELF objects. Created by elfw_create (0x4438F0), contains sections, symbols, relocations, string tables. The merge phase copies sections between elfw objects.
  • Memory arenas: Custom allocator (arena_alloc at 0x4307C0) used by nearly every function. Thread-safe, size-class-based free lists (625 buckets), large-block page pool. Named arenas ("nvlink option parser", "nvlink memory space", "elfw memory space") for diagnostics.
  • Callgraph: Directed graph of function-to-function calls, built during merge and used for dead code elimination, register count propagation, and stack size computation.
  • Architecture profile: Per-SM configuration record accessed from a global profile database in the 0x470000 region. Gates feature availability, encoding table selection, and compatibility checks.

Binary Layout (Address Map)

Address rangeSizeSubsystem
0x400000--0x470000448 KBmain(), CLI options, memory arenas, ELF core, merge, layout, relocation, output
0x470000--0x530000768 KBArchitecture profiles, finalization, debug info, archive parsing, knobs, Mercury/capsule dispatch
0x530000--0x620000960 KBIR node primitives, SM50--7x ISel backend, MercExpand engine
0x620000--0x9200003 MBSASS instruction binary encoders (1000+ functions, 4--9 KB each)
0x920000--0xA700001.3 MBInstruction descriptor initialization tables
0xA70000--0xCA00002 MBInstruction codec: encoding + decoding infrastructure
0xCA0000--0xDA00001 MBSM80 (Ampere) instruction selection / encoding backend
0xDA0000--0xF160001.5 MBSM100+ (Blackwell) instruction encoder/decoder
0xF16000--0x100C000984 KBSM75 (Turing) instruction selection / encoding backend
0x100C000--0x11EA0001.9 MBShared backend codegen + SM89/90 (Ada/Hopper) backend
0x11EA000--0x12B0000824 KBPTX compiler frontend: ISel patterns, parsing, operand construction
0x12B0000--0x14300001.5 MBLTO compilation engine: ISel, DWARF, ELF emission, MMA lowering
0x1430000--0x15C00001.6 MBPTX assembler frontend: instruction handlers, builtin registration
0x15C0000--0x16E00001.2 MBCUDA device-code compilation backend (PTX-to-SASS pipeline)
0x16E0000--0x18500001.5 MBBackend compiler: 6 major subsystems
0x1850000--0x1A000001.7 MBInstruction scheduling, register allocation, codegen verification
0x1A00000--0x1B600001.4 MBCore SASS backend: operand emission, encoding
0x1B60000--0x1D321721.8 MBISel + lowering, end of .text

Wiki Structure

This wiki is organized into 13 sections plus supporting reference pages. Every page is written at reimplementation-grade depth.

Linking Pipeline

End-to-end flow from main() through output writing. Covers entry point, CLI parsing, mode dispatch, library resolution, input file loop, merge, layout, relocation, finalization, and output serialization. Start here for the linker-specific logic.

  • Pipeline Overview -- End-to-end pipeline diagram with phase boundaries and key function addresses.
  • Entry Point & Main -- The 58KB main() function: initialization, file dispatch, phase orchestration.
  • CLI Option Parsing -- 60+ registered options, the option parser infrastructure, validation logic.
  • Mode Dispatch -- How input file type determines the processing path.
  • Library Resolution -- -l/-L search, libcudadevrt special handling, archive iteration.
  • Input File Loop -- The linked-list input iteration with per-file type dispatch.
  • Merge Phase -- The 89KB merge_elf function: section merging, symbol resolution, weak selection.
  • Layout Phase -- Shared memory layout, section ordering, address assignment.
  • Relocation Phase -- R_CUDA relocation application, UFT/UDT resolution.
  • Finalization Phase -- Final patching, entry property computation, register propagation.
  • Output Writing -- ELF serialization, host linker script generation, verbose stats.

Input Processing

How each input format is identified, validated, and converted to an internal ELF representation.

Linker Core

The fundamental linking algorithms: symbol resolution, section merging, relocation processing, and optimization.

The LTO pipeline: IR collection, libnvvm integration, compilation modes, and option forwarding.

Embedded ptxas

The ~24MB PTX assembler backend statically linked into nvlink. Covers ISel, register allocation, scheduling, and encoding.

Mercury

The SM100+ (Blackwell) capsule mercury format and associated infrastructure.

GPU Targets

Architecture profiles, compatibility logic, and per-generation feature details.

CUDA Device ELF

The output format: NVIDIA's CUDA device ELF (cubin) with its proprietary sections and metadata.

Debug Information

DWARF processing, line table merging, and NVIDIA-specific debug extensions.

Infrastructure

Shared services used across the linker: memory management, error handling, threading, and utilities.

Data Structures

Detailed layouts of the key internal structures.

Configuration

Command-line interface, environment variables, and embedded ptxas option forwarding.

Reference

Lookup tables and catalogs for quick reference during reverse engineering.

Reading Order

The wiki contains 85+ pages. These four reading paths guide you through the material based on your goal. Each path lists pages in the recommended order.

Path 1: New to GPU Linking

Goal: understand what a CUDA device linker does and how nvlink transforms input cubins into a final device executable.

  1. Pipeline Overview -- The end-to-end flow diagram. Establishes all phases and their relationships. Read this first to build the mental model that every other page assumes.
  2. Entry Point & Main -- How nvlink is invoked, the 58KB main() function, initialization sequence, and phase orchestration.
  3. File Type Detection -- The 56-byte header read and magic number dispatch. Understand how nvlink decides whether an input is a cubin, PTX, fatbin, NVVM IR, archive, or host ELF.
  4. Cubin Loading -- The most common input path: how device ELF objects are validated, parsed, and imported into the internal elfw representation.
  5. Merge Phase -- The 89KB merge_elf function that combines multiple input ELFs. This is the heart of the linker: section merging, symbol resolution, and weak symbol selection.
  6. Symbol Resolution -- How global, local, and weak symbols are resolved across translation units. COMDAT handling and multi-definition detection.
  7. R_CUDA Relocations -- CUDA-specific relocation types and how they differ from standard ELF relocations.
  8. Layout Phase -- Shared memory layout, section ordering, address assignment. The overlap-set analysis that makes .nv.shared work.
  9. Relocation Phase -- How relocations are applied to produce position-dependent binary code.
  10. Output Writing -- Final ELF serialization, host linker script generation, and the Mercury output path for SM100+.
  11. Device ELF Format -- The output format: e_machine 190, NVIDIA-proprietary sections, and the differences from standard ELF.

Optional extensions:

Path 2: LTO Deep-Dive

Goal: understand nvlink's link-time optimization pipeline -- how NVVM IR modules are collected, compiled to PTX via libnvvm, assembled to SASS via the embedded ptxas, and merged back into the normal linking flow.

  1. LTO Overview -- Architecture of the LTO pipeline: whole-program vs. relocatable compilation, the three-phase flow (collect IR, compile IR to PTX, assemble PTX to SASS).
  2. NVVM IR / LTO IR Input -- How NVVM IR modules are detected in input files, the -lto requirement, and libdevice detection.
  3. libnvvm Integration -- The dlopen/dlsym mechanics for loading libnvvm.so, the API surface nvlink calls, and __nvvmHandle resolution.
  4. Whole vs. Partial LTO -- How --force-whole-lto and --force-partial-lto change compilation scope, and the fallback behavior when LTO fails.
  5. Option Forwarding to cicc -- How --Xptxas, --Xnvvm, and --maxrregcount are forwarded through the LTO pipeline.
  6. Split Compilation -- Thread pool dispatch for --split-compile-extended, work item lifecycle, and per-function parallelism.
  7. PTX Input & JIT -- How the PTX output from libnvvm is fed into the embedded ptxas backend for assembly.
  8. Merge Phase -- The LTO-compiled cubins re-enter the normal merge flow. Understanding the merge phase is necessary to see where LTO output rejoins the pipeline.

Optional extensions:

Path 3: Mercury Investigation

Goal: understand the SM100+ (Blackwell) capsule mercury format -- the new ISA container that replaces traditional cubin output for Blackwell and later architectures.

  1. Mercury Overview -- What Mercury is, why it exists, the ROT13 obfuscation convention (zrephel = Mercury), and the global mode flag that gates Mercury output.
  2. Mercury ELF Sections -- The .nv.merc, HRKE/HRKI/HRCE/HRCI/HRDE/HRDI sections. How Mercury data is embedded within the ELF container.
  3. Capsule Mercury Format -- The container structure, capsule layout, and relationship to traditional cubin.
  4. Mercury Compiler Passes -- The MercExpand engine (204KB at sub_5B1D80) and Mercury-specific ISel patterns.
  5. FNLZR (Finalizer) -- The post-link binary rewriter, the 10-phase finalization pipeline, pre-link vs. post-link modes, and capability masks.
  6. R_MERCURY Relocations -- Mercury-specific relocation types, how they differ from R_CUDA, and the relocation application flow.

Optional extensions:

Path 4: Architecture Hacker

Goal: understand how nvlink supports 22 SM architectures (sm_75 through sm_121), how per-architecture dispatch works, and how the embedded ptxas backend selects encoding tables and ISel patterns per target.

  1. Architecture Profiles -- The profile database structure: per-SM feature flags, capability masks, and encoding table pointers.
  2. Compatibility Checking -- Cross-architecture linking rules: family matching, version validation, and the constraints on mixing objects from different SM versions.
  3. Architecture Dispatch (vtables) -- Per-SM vtable-based dispatch for ISel, encoding, and feature queries. How the binary selects the correct code paths for each target.
  4. SM75 Turing -- The minimum supported architecture, baseline feature set, and Turing-specific encoding.
  5. SM90 Hopper -- Cluster launch, WGMMA, TMA, asynchronous barriers. The last pre-Mercury architecture.
  6. SM100 Blackwell -- The Mercury transition: new output format, FNLZR integration, new MMA shapes.
  7. SM103 / SM110 / SM120 / SM121 -- Blackwell Ultra, Jetson Thor, consumer RTX 50-series, DGX Spark. How newer architectures derive from SM100.
  8. Embedded ptxas Overview -- The ~24MB embedded compiler backend: subsystem decomposition and relationship to standalone ptxas.
  9. Instruction Selection Hubs -- The five ISel mega-functions (160--280KB each) and how they dispatch per SM variant.

Optional extensions:

Relationship to Other CUDA Toolchain Wikis

nvlink is one component in a four-tool CUDA compilation pipeline. Each tool in the pipeline has its own reverse engineering wiki in this project, and together they provide complete coverage of device code compilation from CUDA C++ source to linked device executable.

The Pipeline

CUDA C++ source (.cu)
  |
  v
cudafe++  -->  host .int.c  +  device IL
  |
  v
cicc  -->  PTX assembly (.ptx)
  |
  v
ptxas  -->  device object (.cubin)
  |
  v
nvlink  -->  linked device executable (cubin or capsule mercury)

Cross-Wiki References

cudafe++ wiki (8.5 MB binary, 6,483 functions) -- Documents the CUDA frontend compiler built on EDG 6.6. cudafe++ separates device code from host code and emits the EDG Intermediate Language consumed by cicc. The execution-space bitfield encoding (__device__, __global__, __host__) defined by cudafe++ determines which symbols appear in the device objects that nvlink eventually merges. The lambda wrapper templates (__nv_dl_wrapper_t, __nv_hdl_wrapper_t) injected by cudafe++ produce the weak symbols that nvlink resolves during merge.

cicc wiki (60 MB binary, 80,562 functions) -- Documents the CUDA C-to-PTX compiler built on EDG 6.6 + LLVM 20.0.0. cicc transforms CUDA source into PTX assembly or NVVM IR bitcode. When nvlink operates in LTO mode, it loads libnvvm.so (a shared library containing cicc's LLVM backend) to compile NVVM IR modules back to PTX. The --Xnvvm option forwarding documented in this wiki's Option Forwarding page passes flags to cicc's LLVM pipeline.

ptxas wiki (37.7 MB binary, 40,185 functions) -- Documents the PTX-to-SASS assembler. nvlink embeds a copy of the ptxas backend statically linked into its binary (~24 MB, ~20,000 functions). The embedded ptxas section of this wiki provides a summary, but the ptxas wiki covers the same compiler in far greater depth: the 159-phase optimization pipeline, the fatpoint register allocator, the Mercury encoder architecture, and the 1,294 ROT13-encoded internal knobs. When investigating nvlink's embedded compiler behavior, consult the ptxas wiki for detailed pass descriptions.

When to Use Which Wiki

QuestionWiki
How does nvlink resolve cross-TU device symbols?nvlink -- Symbol Resolution
How does the PTX-to-SASS compiler work internally?ptxas -- the 74-page PTX assembler wiki
How does LTO compile NVVM IR to PTX?nvlink LTO Overview + cicc LLVM optimizer
What CUDA C++ syntax maps to which device IL?cudafe++ -- EDG frontend, execution spaces
How are Mercury/capsule sections structured?nvlink Mercury + ptxas capsule mercury
How does --Xptxas affect code generation?nvlink Embedded ptxas Options + ptxas knobs system
What are the 159 optimization phases?ptxas -- Pass Inventory & Ordering
How does register allocation determine occupancy?ptxas -- Allocator Architecture

Methodology

All analysis was performed by static reverse engineering of the stripped nvlink binary using IDA Pro 9.x and Hex-Rays decompilation. No dynamic analysis was used. Of the 40,532 functions in the binary, 40,366 (99.6%) were successfully decompiled. The remaining 168 are CRT thunks, computed-jump trampolines, and mega-functions exceeding decompiler limits.

For full details on the reverse engineering approach, confidence scoring, and data sources, see Methodology.