ELF/Cubin Output

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

After all per-kernel SASS encoding completes, ptxas enters the ELF output phase -- the final stage of the compilation pipeline. This phase transforms the accumulated per-kernel SASS bytes, relocation metadata, constant bank data, shared memory layouts, and debug information into a complete NVIDIA CUBIN file. The CUBIN is a standard ELF container with NVIDIA-proprietary extensions: machine type EM_CUDA (0xBE), non-standard ELF class bytes, CUDA-specific section types, and a rich per-entry metadata system called EIATTR. The output pipeline is a custom implementation with no libelf dependency -- ptxas constructs every byte of the ELF from scratch, including headers, section tables, symbol tables, string tables, relocations, and program headers.

The output phase handles three binary kinds: SASS (raw resolved SASS, legacy default), Mercury (SM 75--99 default), and Capsule Mercury (SM 100+ default, supporting deferred finalization). All three produce a valid CUBIN ELF; the difference is whether the .text sections contain final SASS bytes or Mercury-encoded streams that a downstream finalizer resolves at link or load time.


Cubin entry point	`sub_612DE0` (47 KB, called from `sub_446240`)
ELFW constructor	`sub_1CB53A0` (3,480 bytes, 672-byte central object)
Section creator	`sub_1CB3570` (1,963 bytes, 44 call sites)
Symbol table builder	`sub_1CB68D0` (9,578 bytes, ~1,700 decompiled lines)
Master ELF emitter	`sub_1C9F280` (15,263 bytes, 97 KB decompiled -- largest function in output range)
Section layout calculator	`sub_1C9DC60` (5,663 bytes)
Master section allocator	`sub_1CABD60` (11,856 bytes, 67 KB decompiled -- shared/constant/local addresses)
nvinfo/EIATTR builder	`sub_1CC9800` (14,764 bytes, 90 KB decompiled)
Master relocation resolver	`sub_1CD48C0` (4,184 bytes, 22 KB decompiled)
File serializer	`sub_1CD13A0` (2,541 bytes, writes final bytes to disk)
ELF machine type	`EM_CUDA` = `0xBE` (190)
CUDA section type	`SHT_CUDA_INFO` = `0x70000064`
ELF timing	`"ELF-time : %.3f ms (%.2f%%)"` in `--compiler-stats` output
Peak ELF memory	`"PeakELFMemoryUsage : %.3lf KB"`

Pipeline Overview

The ELF output pipeline runs as a single-threaded sequence after all per-kernel OCG passes have completed (multi-kernel compilation may be parallel, but ELF emission is serialized). The flow is orchestrated by sub_612DE0, which reads compilation flags, constructs the ELFW central object, then drives 11 phases to produce the final .cubin or .o file.

sub_446240 (compilation driver -- "real main")
  |  all per-kernel OCG passes complete
  v
sub_612DE0 (cubin entry, 47KB)
  |  reads: deviceDebug, lineInfo, optLevel, IsCompute, IsPIC
  |  establishes setjmp/longjmp error recovery
  |  writes "Cuda compilation tools, release 13.0, V13.0.88"
  |         "Build cuda_13.0.r13.0/compiler.36424714_0"
  |
  |-- Phase 1: ELFW construction
  |     sub_1CB53A0 -- create 672-byte ELFW object, 7 standard sections
  |
  |-- Phase 2: Per-kernel section creation
  |     sub_1CB42D0 x N -- .text.<func>, .rela.text.<func> (one per kernel)
  |     sub_1CB3570 x 44 -- .nv.constant0.<func>, .nv.shared.<func>, etc.
  |
  |-- Phase 3: Call graph analysis
  |     sub_1CBB920 -- recursion detection (DFS)
  |     sub_1CBC090 -- dead function elimination
  |     sub_1CBE1B0 -- .nv.callgraph section builder
  |
  |-- Phase 4: Symbol fixup
  |     sub_1CB2CA0 -- renumber symbols after dead code elimination
  |     sub_1C99BB0 -- remap .symtab_shndx extended indices
  |
  |-- Phase 5: Memory allocation
  |     sub_1CABD60 -- assign addresses: shared, constant, local memory
  |     sub_1CA92F0 -- shared memory interference graph
  |     sub_1CA6890 -- constant bank deduplication
  |
  |-- Phase 6: nvinfo/EIATTR generation
  |     sub_1CC9800 -- build .nv.info.<func> sections (EIATTR attributes)
  |     sub_1CC8950 -- propagate barrier/register counts across call graph
  |
  |-- Phase 7: Symbol table construction
  |     sub_1CB68D0 -- build .symtab, handle SHN_XINDEX overflow
  |
  |-- Phase 8: Section layout
  |     sub_1C9DC60 -- compute file offsets with alignment padding
  |
  |-- Phase 9: Relocation resolution
  |     sub_1CD48C0 -- resolve all R_CUDA_* relocations
  |     sub_1CD5920 -- write .nv.resolvedrela sections
  |
  |-- Phase 10: Capsule Mercury embedding (SM 100+ only)
  |     sub_1C9B110 -- create .nv.merc.* section namespace
  |     sub_1CA2E40 -- clone memory-space sections into merc namespace
  |     sub_1C9C300 -- build 328-byte capsule descriptors per function
  |
  |-- Phase 11: Final assembly & write
  |     sub_1C9F280 -- master ELF emitter (assemble complete CUBIN)
  |     sub_1CD13A0 -- file serializer (write to disk)
  |
  v
OUTPUT: .cubin / .o file

Custom ELF Emitter

ptxas builds the entire ELF output without libelf. The custom implementation spans approximately 20 functions in the 0x1C99--0x1CD6 address range (~300 KB of binary code). At the center is the ELFW ("ELF world") object -- a 672-byte structure that owns all sections, symbols, and string tables for a single compilation unit.

ELFW Object Layout

The ELFW constructor sub_1CB53A0 allocates 672 bytes from the pool allocator sub_424070, creates a dedicated "elfw memory space" pool (4,096-byte initial allocation), writes the ELF header, and initializes 7 mandatory sections:

Index	Section	Type	Purpose
0	(null)	`SHT_NULL`	Required ELF null section
1	`.shstrtab`	`SHT_STRTAB`	Section name string table
2	`.strtab`	`SHT_STRTAB`	Symbol name string table
3	`.symtab`	`SHT_SYMTAB`	Symbol table
4	`.symtab_shndx`	`SHT_SYMTAB_SHNDX`	Extended section indices (for >65,280 sections)
5	`.note.nv.tkinfo`	`SHT_NOTE`	NVIDIA toolkit info (version, build ID, CLI args)
6	`.note.nv.cuinfo`	`SHT_NOTE`	NVIDIA CUDA info (SM version, features)
7	`.nv.uft.entry`	`SHT_PROGBITS`	Unified Function Table entries

ELF Header

The ELF header uses standard structure with NVIDIA-specific overrides:

Offset	Size	Field	CUDA Value
`0x00`	4	`e_ident[EI_MAG0..3]`	`0x7F 'E' 'L' 'F'` (magic `0x464C457F`)
`0x04`	1	`e_ident[EI_CLASS]`	`0x33` (`'3'`, 32-bit) or `0x41` (`'A'`, 64-bit)
`0x05`	1	`e_ident[EI_DATA]`	`0x01` (little-endian)
`0x06`	1	`e_ident[EI_VERSION]`	`0x01` (`EV_CURRENT`)
`0x07`	1	`e_ident[EI_OSABI]`	CUDA ABI version
`0x12`	2	`e_machine`	`0x00BE` (`EM_CUDA` = 190)
`0x24`	4	`e_flags`	SM version bits `[7:0]` + CUDA control flags

Standard ELF uses ELFCLASS32 = 1 and ELFCLASS64 = 2. CUDA cubins use non-standard values '3' (0x33) and 'A' (0x41), which the CUDA driver recognizes as the cubin signature. Any tool using standard libelf will reject these as invalid, which is one reason ptxas uses a custom emitter.

The e_flags field packs the SM architecture version in the low byte (e.g., 100 for sm_100) along with flags for relocatable vs. executable mode and address size. The mask 0x7FFFBFFF (clears bits 14 and 31) is applied during finalization to strip internal control flags that must not appear in the output cubin.

Section Generation

Each kernel/function produces a set of ELF sections. For a program with N entry functions and M device functions, the CUBIN contains on the order of 4*(N+M) sections minimum. The section creator sub_1CB3570 (44 call sites) handles the generic case; sub_1CB42D0 is specialized for .text.<funcname> code sections.

Per-Kernel Sections

For each kernel entry my_kernel, the output pipeline generates:

Section	Type	Content
`.text.my_kernel`	`SHT_PROGBITS`	SASS instruction bytes (SHF_ALLOC \| SHF_EXECINSTR)
`.rela.text.my_kernel`	`SHT_RELA`	Relocation entries for the code section
`.nv.info.my_kernel`	`SHT_CUDA_INFO` (`0x70000064`)	EIATTR metadata (register count, barriers, stack sizes, etc.)
`.nv.constant0.my_kernel`	`SHT_PROGBITS`	Constant bank 0 data (kernel parameters + literal constants)

Additional per-kernel sections are generated as needed:

Section	Condition
`.nv.shared.my_kernel`	Kernel uses shared memory
`.nv.local.my_kernel`	Kernel uses local (spill) memory
`.nv.global.init`	Program uses initialized global variables
`.nv.callgraph`	Relocatable object mode (`-c`)
`.nv.prototype`	Prototype information for cross-module linking

Global Sections

Section	Purpose
`.nv.info`	Global EIATTR attributes (not per-kernel)
`.nv.constant0`	Merged constant bank (whole-program mode)
`.nv.reservedSmem`	Reserved shared memory for runtime (tmem allocation, mbarrier parity)
`.nv.metadata`	Module-level metadata
`.nv.compat`	Forward-compatibility attributes
`.note.nv.cuver`	CUDA version note
`.nv.uft`	Unified Function Table (indirect call support)
`.nv.udt`	Unified Data Table
`.nv.uft.entry`	UFT entry point table
`.nv.udt.entry`	UDT entry point table
`.nv.rel.action`	Relocation action table
`.nv.resolvedrela`	Resolved relocations (post-linking)
`.nv.host`	Host-side interop data

Constant Banks

CUDA supports up to 18 numbered constant banks (0--17) plus 6 named banks:

Bank	Name	Purpose
0	`.nv.constant0`	Kernel parameters + compiler constants (per-entry)
1--17	`.nv.constant1`--`.nv.constant17`	User-declared `__constant__` variables
--	`.nv.constant.entry_params`	Entry point parameter block
--	`.nv.constant.entry_image_header_indices`	Texture/surface header index table
--	`.nv.constant.driver`	Driver-injected constants
--	`.nv.constant.optimizer`	Optimizer-generated constants (OCG)
--	`.nv.constant.user`	User-specified constants
--	`.nv.constant.pic`	Position-independent code constants
--	`.nv.constant.tools_data`	Tools/debugger-injected data

Section Ordering

During finalization, sections are sorted into 8 priority buckets that determine their order in the output ELF:

Bucket	Contents
0 (highest)	ELF header pseudo-section, `.shstrtab`
1	`.strtab`, `.symtab`, `.symtab_shndx`
2	`.note.nv.tkinfo`, `.note.nv.cuinfo`
3	`.text.<funcname>` (code sections)
4	`.nv.constant0.`, `.nv.shared.`, `.nv.local.*` (data sections)
5	`.rela.`, `.rel.` (relocation sections)
6	`.nv.info.*` (EIATTR metadata sections)
7 (lowest)	`.debug_`, `.nv.merc.` (debug and Mercury metadata)

Section file offsets are assigned by sub_1C9DC60, walking the sorted list with alignment padding:

uint64_t offset = elf_header_size;
for (int i = 0; i < section_count; i++) {
    section_t* sec = sorted_sections[i];
    if (sec->sh_addralign > 1)
        offset = (offset + sec->sh_addralign - 1) & ~(sec->sh_addralign - 1);
    sec->sh_offset = offset;
    offset += sec->sh_size;
}

Two section types receive special treatment during layout: .nv.constant0 (address assigned by the OCG constant bank allocator, not the layout calculator) and .nv.reservedSmem (address assigned by the shared memory master allocator sub_1CABD60).

EIATTR Metadata

Each kernel's .nv.info.<funcname> section contains a sequence of EIATTR (Entry Information Attribute) records. These encode per-kernel metadata that the CUDA driver reads at launch time to configure the hardware correctly. The EIATTR builder is sub_1CC9800 (14,764 binary bytes, 90 KB decompiled, 51 callees) -- one of the largest functions in the output pipeline.

EIATTR Encoding

Each EIATTR record is a TLV (type-length-value) structure:

+--------+--------+------------------+
| type   | length | value            |
| 2 bytes| 2 bytes| length bytes     |
+--------+--------+------------------+

The type field is a 16-bit EIATTR code. The length field specifies the payload size. The value is type-dependent (scalar, array, or structured).

EIATTR Catalog

ptxas v13.0.88 defines 98 EIATTR codes. The critical ones that every cubin emitter must produce:

EIATTR	Purpose	Encoding
`EIATTR_REGCOUNT`	Register count for this kernel	4-byte LE integer
`EIATTR_NUM_BARRIERS`	Hardware barrier count (0--16)	4-byte LE integer
`EIATTR_FRAME_SIZE`	Per-thread stack frame size in bytes	4-byte LE integer
`EIATTR_MIN_STACK_SIZE`	Minimum call stack size	4-byte LE integer
`EIATTR_MAX_STACK_SIZE`	Maximum call stack size (recursive)	4-byte LE integer
`EIATTR_CRS_STACK_SIZE`	Call/Return/Sync stack size	4-byte LE integer
`EIATTR_EXIT_INSTR_OFFSETS`	Byte offsets of EXIT instructions in `.text`	Array of 4-byte offsets
`EIATTR_S2RCTAID_INSTR_OFFSETS`	Byte offsets of S2R SR_CTAID.* instructions	Array of 4-byte offsets
`EIATTR_CTAIDZ_USED`	Kernel reads CTA ID Z dimension	Flag (0-byte payload)
`EIATTR_REQNTID`	Required thread block dimensions	3x 4-byte integers (X, Y, Z)
`EIATTR_MAX_THREADS`	Maximum threads per block	4-byte LE integer
`EIATTR_PARAM_CBANK`	Constant bank for kernel parameters	4-byte bank index + offset
`EIATTR_CBANK_PARAM_SIZE`	Size of parameter constant bank region	4-byte LE integer
`EIATTR_KPARAM_INFO`	Per-parameter ordinal/offset/size/alignment	Structured (V1)
`EIATTR_KPARAM_INFO_V2`	Per-parameter info, extended format	Structured (V2)
`EIATTR_MAXREG_COUNT`	Maximum register count directive	4-byte LE integer
`EIATTR_EXTERNS`	List of external symbol references	Array of symbol indices

Additional EIATTR codes for textures/surfaces, barriers, cooperative groups, tensor cores, and hardware workarounds:

EIATTR	Purpose
`EIATTR_IMAGE_SLOT`	Texture/surface image binding slot
`EIATTR_SAMPLER_INIT`	Sampler initialization data
`EIATTR_TEXID_SAMPID_MAP`	Texture-to-sampler mapping
`EIATTR_BINDLESS_IMAGE_OFFSETS`	Bindless texture/surface offset table
`EIATTR_SYNC_STACK`	Synchronization stack requirements
`EIATTR_COOP_GROUP_MASK_REGIDS`	Cooperative group register IDs
`EIATTR_NUM_MBARRIERS`	Number of mbarrier objects used
`EIATTR_MBARRIER_INSTR_OFFSETS`	mbarrier instruction locations
`EIATTR_WMMA_USED`	Kernel uses WMMA (Tensor Core) instructions
`EIATTR_TCGEN05_1CTA_USED`	Kernel uses 5th-gen Tensor Core (1-CTA mode)
`EIATTR_TCGEN05_2CTA_USED`	Kernel uses 5th-gen Tensor Core (2-CTA mode)
`EIATTR_SPARSE_MMA_MASK`	Structured sparsity mask for MMA
`EIATTR_CTA_PER_CLUSTER`	CTAs per cluster (SM 90+)
`EIATTR_EXPLICIT_CLUSTER`	Kernel requires explicit cluster launch
`EIATTR_MAX_CLUSTER_RANK`	Maximum cluster rank
`EIATTR_REG_RECONFIG`	Register reconfiguration (setmaxreg)
`EIATTR_SAM_REGION_STACK_SIZE`	SAM (Shared Address Mode) region stack
`EIATTR_RESERVED_SMEM_USED`	Kernel uses reserved shared memory
`EIATTR_RESERVED_SMEM_0_SIZE`	Size of reserved shared memory region 0
`EIATTR_SW1850030_WAR`	Hardware workaround (bug SW-1850030)
`EIATTR_SW2393858_WAR`	Hardware workaround (bug SW-2393858)
`EIATTR_SW2861232_WAR`	Hardware workaround (bug SW-2861232)
`EIATTR_CUDA_API_VERSION`	Required CUDA API version
`EIATTR_MERCURY_ISA_VERSION`	Mercury ISA version for capmerc binaries
`EIATTR_MERCURY_FINALIZER_OPTIONS`	Finalizer tuning for capmerc
`EIATTR_SYSCALL_OFFSETS`	Byte offsets of syscall instructions
`EIATTR_INSTR_REG_MAP`	Instruction-to-register allocation map (debug)
`EIATTR_STATISTICS`	Per-kernel compilation statistics
`EIATTR_PERF_STATISTICS`	Performance statistics

Barrier and Register Count Propagation

The EIATTR builder runs a cross-function propagation pass via sub_1CC8950 (2,634 bytes). When a device function uses barriers or a high register count, these requirements must propagate upward to the entry kernel that calls it:

"regcount %d for %s propagated to entry %s"
"Creating new EIATTR_NUM_BARRIERS and moving barcount %d
    from section flags of %s to nvinfo for entry symbol %s"
"Propagating higher barcount %d to the section flags
    of %s of entry symbol %s"

This ensures that the CUDA driver allocates sufficient barriers and registers for the entry kernel, accounting for all callees.

Relocation Processing

The relocation system handles symbol resolution for branch targets, constant bank references, function descriptors, texture/surface bindings, and address computations. The master relocation resolver is sub_1CD48C0 (4,184 binary bytes, 22 KB decompiled). ptxas defines 117 CUDA-specific relocation types (R_CUDA_NONE through R_CUDA_NONE_LAST).

Relocation Categories

Category	Types	Purpose
Absolute address	`R_CUDA_32`, `R_CUDA_64`, `R_CUDA_ABS*`	Global memory addresses
Global address	`R_CUDA_G32`, `R_CUDA_G64`, `R_CUDA_G8_*`	Global-space addresses
PC-relative	`R_CUDA_PCREL_IMM24_26`, `R_CUDA_PCREL_IMM24_23`	Branch/call targets
Constant field	`R_CUDA_CONST_FIELD19_`, `R_CUDA_CONST_FIELD21_`, `R_CUDA_CONST_FIELD22_*`	Constant bank references
Function descriptor	`R_CUDA_FUNC_DESC*`	Indirect function call targets
Texture/surface	`R_CUDA_TEX_HEADER_INDEX`, `R_CUDA_SAMP_HEADER_INDEX`, `R_CUDA_SURF_*`	Texture/surface binding
Bindless	`R_CUDA_BINDLESSOFF`, `R_CUDA_TEX_BINDLESSOFF`	Bindless texture/surface offsets
Sub-byte patching	`R_CUDA_8_0` through `R_CUDA_8_56`	Individual byte within a 64-bit instruction
Unified address	`R_CUDA_UNIFIED`, `R_CUDA_UNIFIED_32`, `R_CUDA_UNIFIED_8_*`	Unified address space
Instruction-level	`R_CUDA_INSTRUCTION64`, `R_CUDA_INSTRUCTION128`	Whole-instruction replacement
Yield/NOP	`R_CUDA_YIELD_OPCODE9_0`, `R_CUDA_YIELD_CLEAR_PRED4_87`	YIELD-to-NOP patching
Cleanup	`R_CUDA_UNUSED_CLEAR32`, `R_CUDA_UNUSED_CLEAR64`	Zero out unused instruction fields

Resolution Logic

The resolver performs these operations for each relocation entry:

Alias resolution -- redirect relocations from alias symbols to their targets ("change alias reloc %s to %s")
Dead function filtering -- skip relocations on eliminated functions ("ignore reloc on dead func %s")
UFT/UDT pseudo-relocation -- handle __UFT_OFFSET, __UFT_CANONICAL, __UDT_OFFSET, __UDT_CANONICAL synthetic symbols
PC-relative validation -- ensure branch targets are in the same section ("PC relative branch address should be in the same section")
YIELD-to-NOP conversion -- convert YIELD instructions to NOP when forward progress requirements prevent yielding
Unified reloc replacement -- convert type 103 (unified) to type 1 (absolute) for final resolution
Address computation -- compute final patched value from symbol address + addend

Output relocation sections (.nv.resolvedrela) are written by sub_1CD5920.

Debug Information

When --device-debug or --generate-line-info is active, ptxas generates DWARF debug sections. The debug subsystem spans 0x1CBF--0x1CC9 and includes parsers, emitters, and dumpers for the standard DWARF sections plus NVIDIA-specific debug extensions.

Debug Sections

Section	Content
`.debug_info`	DWARF DIE tree (compilation units, types, variables)
`.debug_abbrev`	DWARF abbreviation table
`.debug_line`	Source-to-address line number mapping
`.debug_frame`	Call frame information for unwinding
`.debug_loc`	Location lists for variables
`.debug_str`	DWARF string table
`.debug_ranges`	Address ranges
`.debug_aranges`	Address range lookup table
`.debug_pubnames`	Public name index
`.debug_pubtypes`	Public type index
`.nv_debug_ptx_txt`	Embedded PTX source text
`.nv_debug_line_sass`	SASS-level line number mapping
`.nv_debug_info_reg_sass`	Register allocation debug info
`.nv_debug_info_reg_type`	Register type information

Key debug infrastructure functions:

Function	Purpose
`sub_1CBF820`	DWARF form name table (`DW_FORM_addr`, `DW_FORM_data4`, etc.)
`sub_1CBF9B0`	DWARF attribute name table (`DW_AT_producer`, `DW_AT_comp_dir`, etc.)
`sub_1CC0850`	`.debug_abbrev` parser/emitter (18 KB decompiled)
`sub_1CC4A40`	`.debug_info` DIE tree walker (28 KB decompiled)
`sub_1CC34E0`	DWARF location expression decoder (`DW_OP_*` operations)
`sub_1CC24C0`	`.debug_info` emission pass (18 KB decompiled)
`sub_1CC5EB0`	Compilation unit header parser
`sub_1C9D1F0`	Debug section classifier/mapper (16 KB decompiled)

The --suppress-debug-info option (sub_432A00) disables debug section generation even when debug flags are present.

Capsule Mercury Output Path

For SM 100+ targets (Blackwell, Jetson Thor, consumer RTX 50-series), the default output mode is Capsule Mercury. In this mode, the CUBIN ELF contains two layers of content: standard CUBIN sections and a parallel set of .nv.merc.* sections carrying Mercury-encoded instruction streams plus all metadata needed for deferred finalization.

Mercury-Specific Sections

Section	Purpose
`.nv.merc.debug_abbrev`	Cloned DWARF abbreviation table
`.nv.merc.debug_info`	Cloned DWARF info
`.nv.merc.debug_line`	Cloned DWARF line table
`.nv.merc.debug_frame`	Cloned DWARF frame info
`.nv.merc.debug_loc`	Cloned DWARF locations
`.nv.merc.debug_str`	Cloned DWARF string table
`.nv.merc.debug_ranges`	Cloned DWARF ranges
`.nv.merc.debug_aranges`	Cloned DWARF address ranges
`.nv.merc.debug_pubnames`	Cloned DWARF public names
`.nv.merc.debug_pubtypes`	Cloned DWARF public types
`.nv.merc.debug_macinfo`	Cloned DWARF macro info
`.nv.merc.nv_debug_ptx_txt`	Embedded PTX source text
`.nv.merc.nv_debug_line_sass`	SASS-level line mapping
`.nv.merc.nv_debug_info_reg_sass`	Register allocation debug info
`.nv.merc.nv_debug_info_reg_type`	Register type debug info
`.nv.merc.symtab_shndx`	Extended section index table (merc copy)
`.nv.merc.nv.shared.reserved`	Shared memory reservation metadata
`.nv.merc.rela<secname>`	Per-section relocation tables

Capsule Mercury Construction

The capmerc path is integrated into the master ELF emitter. The sequence:

sub_1C9B110 (23 KB decompiled) creates the .nv.merc.* section namespace
sub_1CA2E40 (18 KB decompiled) clones constant/global/shared/local sections into the merc namespace, creating .nv.merc.rela relocation sections
sub_1C9C300 (24 KB decompiled) processes .nv.capmerc<funcname> sections. Constructs a 328-byte capsule descriptor per function containing: Mercury-encoded instruction stream, relocation metadata, KNOBS configuration snapshot, and function-level metadata (register counts, barriers, shared memory usage)
sub_1CA3A90 (45 KB decompiled) merges sections that have both merc and non-merc copies
sub_1C99BB0 (25 KB decompiled) remaps section indices after merc section insertion

Off-Target Finalization

The cubin entry sub_612DE0 implements a "fastpath optimization" for off-target finalization:

"[Finalizer] fastpath optimization applied for off-target %u -> %u finalization"

When a capmerc binary compiled for SM X is finalized for SM Y (within the same family), the fastpath patches the Mercury instruction stream directly without full recompilation. The compatibility checker sub_60F290 determines whether fastpath is safe based on instruction set compatibility, register file layout, and memory model.

Self-Check

The --self-check option performs roundtrip verification: generate capmerc, reconstitute SASS from the capsule, and compare section-by-section. The verifier uses a Flex SASS lexer (sub_720F00, 64 KB) and a comparator (sub_729540, 35 KB). Error string: "Failure of '%s' section in self-check for capsule mercury".

Multi-Kernel Output

A typical CUDA program compiles multiple kernels and device functions into a single CUBIN. The output pipeline handles this through per-function section isolation, combined with cross-function analysis for call graph construction, barrier propagation, and dead code elimination.

Per-Function Section Layout

Each entry function and each device function gets its own .text section (the -ffunction-sections pattern). This enables:

Function-level dead code elimination -- sub_1CBC090 removes .text, .rela.text, .nv.info, and .nv.constant0 sections for unreachable functions
Linker granularity -- nvlink can select individual functions from relocatable objects
Driver loading -- the CUDA runtime can load individual kernels by name

Call Graph Construction

The call graph builder (sub_1CBE1B0) emits a .nv.callgraph section that encodes inter-function call edges. This section is present only in relocatable object mode (-c). The recursion detector (sub_1CBB920) performs a DFS traversal with manual 9-level unrolling, emitting "recursion at function %d" for each cycle found.

Dead functions are eliminated by sub_1CBC090:

"dead function %d(%s)"
"removed un-used section %s (%d)"   (x8 -- once per section type)
"function %d(%s) has address taken but no call to it"

Memory Allocation Across Kernels

The master section allocator sub_1CABD60 (11,856 binary bytes, 67 KB decompiled, 69 callees) assigns addresses to all memory-space sections across all kernels. It runs a multi-pass algorithm:

Global shared allocation -- shared variables visible to multiple kernels
Per-entry shared memory -- shared variables private to each kernel
Extern shared handling -- dynamically-sized shared memory (extern __shared__)
Reserved shared memory -- runtime reservations (.nv.reservedSmem.begin, .nv.reservedSmem.cap, .nv.reservedSmem.offset0, .nv.reservedSmem.offset1)
Local memory -- per-thread spill storage
Constant bank merging -- merges constant bank data across kernels, with deduplication (sub_1CA6890: "found duplicate value 0x%x, alias %s to %s")

The shared memory allocator sub_1CA92F0 (2,804 bytes) builds an interference graph for shared objects and performs group allocation for non-overlapping variables.

SHN_XINDEX Overflow

Large CUDA programs can exceed the ELF 65,280-section limit (SHN_LORESERVE = 0xFF00). Each kernel generates at minimum 4 sections (.text, .rela.text, .nv.info, .nv.constant0), so a program with 16,000+ kernels triggers the overflow mechanism:

e_shnum = 0 in the ELF header (signals overflow)
Section header [0].sh_size = real section count
e_shstrndx = SHN_XINDEX (0xFFFF)
Section header [0].sh_link = real .shstrtab index
Symbol st_shndx = SHN_XINDEX when real index >= 0xFF00
.symtab_shndx entries hold the actual section indices

This is standard ELF overflow handling, and it is production-critical -- sub_1CB68D0 checks for it with "overflow number of sections %d".

Key Functions

Address	Size	Decompiled	Purpose
`sub_612DE0`	~12 KB	47 KB	Cubin generation entry point
`sub_1C9F280`	15,263 B	97 KB	Master ELF emitter
`sub_1CC9800`	14,764 B	90 KB	nvinfo/EIATTR section builder
`sub_1CABD60`	11,856 B	67 KB	Master section allocator (shared/const/local)
`sub_1CB68D0`	9,578 B	49 KB	Symbol table builder (.symtab)
`sub_1CA3A90`	6,289 B	45 KB	Section merger (merc + non-merc)
`sub_1C9DC60`	5,663 B	29 KB	Section layout calculator
`sub_1C99BB0`	4,900 B	25 KB	Section index remap (.symtab_shndx)
`sub_1C9C300`	3,816 B	24 KB	Capsule Mercury section processor
`sub_1C9B110`	4,585 B	23 KB	Mercury capsule builder
`sub_1CD48C0`	4,184 B	22 KB	Master relocation resolver
`sub_1CBC090`	2,870 B	20 KB	Dead function eliminator
`sub_1CA2E40`	3,152 B	18 KB	Mercury section cloner
`sub_1CA92F0`	2,804 B	16 KB	Shared memory interference graph
`sub_1C9D1F0`	2,667 B	16 KB	Debug section classifier
`sub_1CA6890`	2,286 B	15 KB	Constant bank deduplication
`sub_1CC8950`	2,634 B	15 KB	Barrier/register count propagator
`sub_1CBB920`	~2,000 B	14 KB	Recursion detector (DFS)
`sub_1CB91C0`	2,668 B	13 KB	ELF structure dumper (debug)
`sub_1CB53A0`	3,480 B	13 KB	ELFW constructor
`sub_1CAB300`	2,157 B	12 KB	Bindless texture/surface handler
`sub_1CD5920`	1,985 B	11 KB	Relocation writer (.nv.resolvedrela)
`sub_1CD13A0`	2,541 B	11 KB	File serializer (final disk write)
`sub_1CBE1B0`	1,992 B	10 KB	.nv.callgraph section builder
`sub_1CD22E0`	1,979 B	10 KB	UFT manager
`sub_1CB3570`	1,963 B	10 KB	Section creator (44 call sites)
`sub_1C98C60`	1,755 B	9 KB	Mercury debug section classifier
`sub_1CB2CA0`	2,038 B	8 KB	Symbol fixup (post-deletion)
`sub_1CC7FB0`	--	--	.nv.info section name formatter
`sub_1CB9FF0`	--	--	Section count accessor
`sub_1CB9C40`	--	--	Get section by index

Cross-References

Custom ELF Emitter -- deep dive into ELFW object, header construction, section management, file serialization
Section Catalog & EIATTR -- complete inventory of section types and EIATTR attribute encoding
Relocations & Symbols -- relocation resolution, UFT/UDT management, symbol table details
Debug Information -- DWARF generation and .debug_* section handling
Capsule Mercury & Finalization -- capmerc packaging format, off-target finalization, self-check
Mercury Encoder -- Mercury instruction encoding (phases 117--122) that feeds the ELF emitter
SASS Code Generation -- the upstream per-kernel compilation that produces SASS bytes
Pipeline Overview -- where the ELF phase fits in the full PTX-to-SASS flow

Keyboard shortcuts

PTXAS Reverse Engineering Reference