Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

ELF/Cubin Output

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

After all per-kernel SASS encoding completes, ptxas enters the ELF output phase -- the final stage of the compilation pipeline. This phase transforms the accumulated per-kernel SASS bytes, relocation metadata, constant bank data, shared memory layouts, and debug information into a complete NVIDIA CUBIN file. The CUBIN is a standard ELF container with NVIDIA-proprietary extensions: machine type EM_CUDA (0xBE), non-standard ELF class bytes, CUDA-specific section types, and a rich per-entry metadata system called EIATTR. The output pipeline is a custom implementation with no libelf dependency -- ptxas constructs every byte of the ELF from scratch, including headers, section tables, symbol tables, string tables, relocations, and program headers.

The output phase handles three binary kinds: SASS (raw resolved SASS, legacy default), Mercury (SM 75--99 default), and Capsule Mercury (SM 100+ default, supporting deferred finalization). All three produce a valid CUBIN ELF; the difference is whether the .text sections contain final SASS bytes or Mercury-encoded streams that a downstream finalizer resolves at link or load time.

Cubin entry pointsub_612DE0 (47 KB, called from sub_446240)
ELFW constructorsub_1CB53A0 (3,480 bytes, 672-byte central object)
Section creatorsub_1CB3570 (1,963 bytes, 44 call sites)
Symbol table buildersub_1CB68D0 (9,578 bytes, ~1,700 decompiled lines)
Master ELF emittersub_1C9F280 (15,263 bytes, 97 KB decompiled -- largest function in output range)
Section layout calculatorsub_1C9DC60 (5,663 bytes)
Master section allocatorsub_1CABD60 (11,856 bytes, 67 KB decompiled -- shared/constant/local addresses)
nvinfo/EIATTR buildersub_1CC9800 (14,764 bytes, 90 KB decompiled)
Master relocation resolversub_1CD48C0 (4,184 bytes, 22 KB decompiled)
File serializersub_1CD13A0 (2,541 bytes, writes final bytes to disk)
ELF machine typeEM_CUDA = 0xBE (190)
CUDA section typeSHT_CUDA_INFO = 0x70000064
ELF timing"ELF-time : %.3f ms (%.2f%%)" in --compiler-stats output
Peak ELF memory"PeakELFMemoryUsage : %.3lf KB"

Pipeline Overview

The ELF output pipeline runs as a single-threaded sequence after all per-kernel OCG passes have completed (multi-kernel compilation may be parallel, but ELF emission is serialized). The flow is orchestrated by sub_612DE0, which reads compilation flags, constructs the ELFW central object, then drives 11 phases to produce the final .cubin or .o file.

sub_446240 (compilation driver -- "real main")
  |  all per-kernel OCG passes complete
  v
sub_612DE0 (cubin entry, 47KB)
  |  reads: deviceDebug, lineInfo, optLevel, IsCompute, IsPIC
  |  establishes setjmp/longjmp error recovery
  |  writes "Cuda compilation tools, release 13.0, V13.0.88"
  |         "Build cuda_13.0.r13.0/compiler.36424714_0"
  |
  |-- Phase 1: ELFW construction
  |     sub_1CB53A0 -- create 672-byte ELFW object, 7 standard sections
  |
  |-- Phase 2: Per-kernel section creation
  |     sub_1CB42D0 x N -- .text.<func>, .rela.text.<func> (one per kernel)
  |     sub_1CB3570 x 44 -- .nv.constant0.<func>, .nv.shared.<func>, etc.
  |
  |-- Phase 3: Call graph analysis
  |     sub_1CBB920 -- recursion detection (DFS)
  |     sub_1CBC090 -- dead function elimination
  |     sub_1CBE1B0 -- .nv.callgraph section builder
  |
  |-- Phase 4: Symbol fixup
  |     sub_1CB2CA0 -- renumber symbols after dead code elimination
  |     sub_1C99BB0 -- remap .symtab_shndx extended indices
  |
  |-- Phase 5: Memory allocation
  |     sub_1CABD60 -- assign addresses: shared, constant, local memory
  |     sub_1CA92F0 -- shared memory interference graph
  |     sub_1CA6890 -- constant bank deduplication
  |
  |-- Phase 6: nvinfo/EIATTR generation
  |     sub_1CC9800 -- build .nv.info.<func> sections (EIATTR attributes)
  |     sub_1CC8950 -- propagate barrier/register counts across call graph
  |
  |-- Phase 7: Symbol table construction
  |     sub_1CB68D0 -- build .symtab, handle SHN_XINDEX overflow
  |
  |-- Phase 8: Section layout
  |     sub_1C9DC60 -- compute file offsets with alignment padding
  |
  |-- Phase 9: Relocation resolution
  |     sub_1CD48C0 -- resolve all R_CUDA_* relocations
  |     sub_1CD5920 -- write .nv.resolvedrela sections
  |
  |-- Phase 10: Capsule Mercury embedding (SM 100+ only)
  |     sub_1C9B110 -- create .nv.merc.* section namespace
  |     sub_1CA2E40 -- clone memory-space sections into merc namespace
  |     sub_1C9C300 -- build 328-byte capsule descriptors per function
  |
  |-- Phase 11: Final assembly & write
  |     sub_1C9F280 -- master ELF emitter (assemble complete CUBIN)
  |     sub_1CD13A0 -- file serializer (write to disk)
  |
  v
OUTPUT: .cubin / .o file

Custom ELF Emitter

ptxas builds the entire ELF output without libelf. The custom implementation spans approximately 20 functions in the 0x1C99--0x1CD6 address range (~300 KB of binary code). At the center is the ELFW ("ELF world") object -- a 672-byte structure that owns all sections, symbols, and string tables for a single compilation unit.

ELFW Object Layout

The ELFW constructor sub_1CB53A0 allocates 672 bytes from the pool allocator sub_424070, creates a dedicated "elfw memory space" pool (4,096-byte initial allocation), writes the ELF header, and initializes 7 mandatory sections:

IndexSectionTypePurpose
0(null)SHT_NULLRequired ELF null section
1.shstrtabSHT_STRTABSection name string table
2.strtabSHT_STRTABSymbol name string table
3.symtabSHT_SYMTABSymbol table
4.symtab_shndxSHT_SYMTAB_SHNDXExtended section indices (for >65,280 sections)
5.note.nv.tkinfoSHT_NOTENVIDIA toolkit info (version, build ID, CLI args)
6.note.nv.cuinfoSHT_NOTENVIDIA CUDA info (SM version, features)
7.nv.uft.entrySHT_PROGBITSUnified Function Table entries

ELF Header

The ELF header uses standard structure with NVIDIA-specific overrides:

OffsetSizeFieldCUDA Value
0x004e_ident[EI_MAG0..3]0x7F 'E' 'L' 'F' (magic 0x464C457F)
0x041e_ident[EI_CLASS]0x33 ('3', 32-bit) or 0x41 ('A', 64-bit)
0x051e_ident[EI_DATA]0x01 (little-endian)
0x061e_ident[EI_VERSION]0x01 (EV_CURRENT)
0x071e_ident[EI_OSABI]CUDA ABI version
0x122e_machine0x00BE (EM_CUDA = 190)
0x244e_flagsSM version bits [7:0] + CUDA control flags

Standard ELF uses ELFCLASS32 = 1 and ELFCLASS64 = 2. CUDA cubins use non-standard values '3' (0x33) and 'A' (0x41), which the CUDA driver recognizes as the cubin signature. Any tool using standard libelf will reject these as invalid, which is one reason ptxas uses a custom emitter.

The e_flags field packs the SM architecture version in the low byte (e.g., 100 for sm_100) along with flags for relocatable vs. executable mode and address size. The mask 0x7FFFBFFF (clears bits 14 and 31) is applied during finalization to strip internal control flags that must not appear in the output cubin.

Section Generation

Each kernel/function produces a set of ELF sections. For a program with N entry functions and M device functions, the CUBIN contains on the order of 4*(N+M) sections minimum. The section creator sub_1CB3570 (44 call sites) handles the generic case; sub_1CB42D0 is specialized for .text.<funcname> code sections.

Per-Kernel Sections

For each kernel entry my_kernel, the output pipeline generates:

SectionTypeContent
.text.my_kernelSHT_PROGBITSSASS instruction bytes (SHF_ALLOC | SHF_EXECINSTR)
.rela.text.my_kernelSHT_RELARelocation entries for the code section
.nv.info.my_kernelSHT_CUDA_INFO (0x70000064)EIATTR metadata (register count, barriers, stack sizes, etc.)
.nv.constant0.my_kernelSHT_PROGBITSConstant bank 0 data (kernel parameters + literal constants)

Additional per-kernel sections are generated as needed:

SectionCondition
.nv.shared.my_kernelKernel uses shared memory
.nv.local.my_kernelKernel uses local (spill) memory
.nv.global.initProgram uses initialized global variables
.nv.callgraphRelocatable object mode (-c)
.nv.prototypePrototype information for cross-module linking

Global Sections

SectionPurpose
.nv.infoGlobal EIATTR attributes (not per-kernel)
.nv.constant0Merged constant bank (whole-program mode)
.nv.reservedSmemReserved shared memory for runtime (tmem allocation, mbarrier parity)
.nv.metadataModule-level metadata
.nv.compatForward-compatibility attributes
.note.nv.cuverCUDA version note
.nv.uftUnified Function Table (indirect call support)
.nv.udtUnified Data Table
.nv.uft.entryUFT entry point table
.nv.udt.entryUDT entry point table
.nv.rel.actionRelocation action table
.nv.resolvedrelaResolved relocations (post-linking)
.nv.hostHost-side interop data

Constant Banks

CUDA supports up to 18 numbered constant banks (0--17) plus 6 named banks:

BankNamePurpose
0.nv.constant0Kernel parameters + compiler constants (per-entry)
1--17.nv.constant1--.nv.constant17User-declared __constant__ variables
--.nv.constant.entry_paramsEntry point parameter block
--.nv.constant.entry_image_header_indicesTexture/surface header index table
--.nv.constant.driverDriver-injected constants
--.nv.constant.optimizerOptimizer-generated constants (OCG)
--.nv.constant.userUser-specified constants
--.nv.constant.picPosition-independent code constants
--.nv.constant.tools_dataTools/debugger-injected data

Section Ordering

During finalization, sections are sorted into 8 priority buckets that determine their order in the output ELF:

BucketContents
0 (highest)ELF header pseudo-section, .shstrtab
1.strtab, .symtab, .symtab_shndx
2.note.nv.tkinfo, .note.nv.cuinfo
3.text.<funcname> (code sections)
4.nv.constant0.*, .nv.shared.*, .nv.local.* (data sections)
5.rela.*, .rel.* (relocation sections)
6.nv.info.* (EIATTR metadata sections)
7 (lowest).debug_*, .nv.merc.* (debug and Mercury metadata)

Section file offsets are assigned by sub_1C9DC60, walking the sorted list with alignment padding:

uint64_t offset = elf_header_size;
for (int i = 0; i < section_count; i++) {
    section_t* sec = sorted_sections[i];
    if (sec->sh_addralign > 1)
        offset = (offset + sec->sh_addralign - 1) & ~(sec->sh_addralign - 1);
    sec->sh_offset = offset;
    offset += sec->sh_size;
}

Two section types receive special treatment during layout: .nv.constant0 (address assigned by the OCG constant bank allocator, not the layout calculator) and .nv.reservedSmem (address assigned by the shared memory master allocator sub_1CABD60).

EIATTR Metadata

Each kernel's .nv.info.<funcname> section contains a sequence of EIATTR (Entry Information Attribute) records. These encode per-kernel metadata that the CUDA driver reads at launch time to configure the hardware correctly. The EIATTR builder is sub_1CC9800 (14,764 binary bytes, 90 KB decompiled, 51 callees) -- one of the largest functions in the output pipeline.

EIATTR Encoding

Each EIATTR record is a TLV (type-length-value) structure:

+--------+--------+------------------+
| type   | length | value            |
| 2 bytes| 2 bytes| length bytes     |
+--------+--------+------------------+

The type field is a 16-bit EIATTR code. The length field specifies the payload size. The value is type-dependent (scalar, array, or structured).

EIATTR Catalog

ptxas v13.0.88 defines 98 EIATTR codes. The critical ones that every cubin emitter must produce:

EIATTRPurposeEncoding
EIATTR_REGCOUNTRegister count for this kernel4-byte LE integer
EIATTR_NUM_BARRIERSHardware barrier count (0--16)4-byte LE integer
EIATTR_FRAME_SIZEPer-thread stack frame size in bytes4-byte LE integer
EIATTR_MIN_STACK_SIZEMinimum call stack size4-byte LE integer
EIATTR_MAX_STACK_SIZEMaximum call stack size (recursive)4-byte LE integer
EIATTR_CRS_STACK_SIZECall/Return/Sync stack size4-byte LE integer
EIATTR_EXIT_INSTR_OFFSETSByte offsets of EXIT instructions in .textArray of 4-byte offsets
EIATTR_S2RCTAID_INSTR_OFFSETSByte offsets of S2R SR_CTAID.* instructionsArray of 4-byte offsets
EIATTR_CTAIDZ_USEDKernel reads CTA ID Z dimensionFlag (0-byte payload)
EIATTR_REQNTIDRequired thread block dimensions3x 4-byte integers (X, Y, Z)
EIATTR_MAX_THREADSMaximum threads per block4-byte LE integer
EIATTR_PARAM_CBANKConstant bank for kernel parameters4-byte bank index + offset
EIATTR_CBANK_PARAM_SIZESize of parameter constant bank region4-byte LE integer
EIATTR_KPARAM_INFOPer-parameter ordinal/offset/size/alignmentStructured (V1)
EIATTR_KPARAM_INFO_V2Per-parameter info, extended formatStructured (V2)
EIATTR_MAXREG_COUNTMaximum register count directive4-byte LE integer
EIATTR_EXTERNSList of external symbol referencesArray of symbol indices

Additional EIATTR codes for textures/surfaces, barriers, cooperative groups, tensor cores, and hardware workarounds:

EIATTRPurpose
EIATTR_IMAGE_SLOTTexture/surface image binding slot
EIATTR_SAMPLER_INITSampler initialization data
EIATTR_TEXID_SAMPID_MAPTexture-to-sampler mapping
EIATTR_BINDLESS_IMAGE_OFFSETSBindless texture/surface offset table
EIATTR_SYNC_STACKSynchronization stack requirements
EIATTR_COOP_GROUP_MASK_REGIDSCooperative group register IDs
EIATTR_NUM_MBARRIERSNumber of mbarrier objects used
EIATTR_MBARRIER_INSTR_OFFSETSmbarrier instruction locations
EIATTR_WMMA_USEDKernel uses WMMA (Tensor Core) instructions
EIATTR_TCGEN05_1CTA_USEDKernel uses 5th-gen Tensor Core (1-CTA mode)
EIATTR_TCGEN05_2CTA_USEDKernel uses 5th-gen Tensor Core (2-CTA mode)
EIATTR_SPARSE_MMA_MASKStructured sparsity mask for MMA
EIATTR_CTA_PER_CLUSTERCTAs per cluster (SM 90+)
EIATTR_EXPLICIT_CLUSTERKernel requires explicit cluster launch
EIATTR_MAX_CLUSTER_RANKMaximum cluster rank
EIATTR_REG_RECONFIGRegister reconfiguration (setmaxreg)
EIATTR_SAM_REGION_STACK_SIZESAM (Shared Address Mode) region stack
EIATTR_RESERVED_SMEM_USEDKernel uses reserved shared memory
EIATTR_RESERVED_SMEM_0_SIZESize of reserved shared memory region 0
EIATTR_SW1850030_WARHardware workaround (bug SW-1850030)
EIATTR_SW2393858_WARHardware workaround (bug SW-2393858)
EIATTR_SW2861232_WARHardware workaround (bug SW-2861232)
EIATTR_CUDA_API_VERSIONRequired CUDA API version
EIATTR_MERCURY_ISA_VERSIONMercury ISA version for capmerc binaries
EIATTR_MERCURY_FINALIZER_OPTIONSFinalizer tuning for capmerc
EIATTR_SYSCALL_OFFSETSByte offsets of syscall instructions
EIATTR_INSTR_REG_MAPInstruction-to-register allocation map (debug)
EIATTR_STATISTICSPer-kernel compilation statistics
EIATTR_PERF_STATISTICSPerformance statistics

Barrier and Register Count Propagation

The EIATTR builder runs a cross-function propagation pass via sub_1CC8950 (2,634 bytes). When a device function uses barriers or a high register count, these requirements must propagate upward to the entry kernel that calls it:

"regcount %d for %s propagated to entry %s"
"Creating new EIATTR_NUM_BARRIERS and moving barcount %d
    from section flags of %s to nvinfo for entry symbol %s"
"Propagating higher barcount %d to the section flags
    of %s of entry symbol %s"

This ensures that the CUDA driver allocates sufficient barriers and registers for the entry kernel, accounting for all callees.

Relocation Processing

The relocation system handles symbol resolution for branch targets, constant bank references, function descriptors, texture/surface bindings, and address computations. The master relocation resolver is sub_1CD48C0 (4,184 binary bytes, 22 KB decompiled). ptxas defines 117 CUDA-specific relocation types (R_CUDA_NONE through R_CUDA_NONE_LAST).

Relocation Categories

CategoryTypesPurpose
Absolute addressR_CUDA_32, R_CUDA_64, R_CUDA_ABS*Global memory addresses
Global addressR_CUDA_G32, R_CUDA_G64, R_CUDA_G8_*Global-space addresses
PC-relativeR_CUDA_PCREL_IMM24_26, R_CUDA_PCREL_IMM24_23Branch/call targets
Constant fieldR_CUDA_CONST_FIELD19_*, R_CUDA_CONST_FIELD21_*, R_CUDA_CONST_FIELD22_*Constant bank references
Function descriptorR_CUDA_FUNC_DESC*Indirect function call targets
Texture/surfaceR_CUDA_TEX_HEADER_INDEX, R_CUDA_SAMP_HEADER_INDEX, R_CUDA_SURF_*Texture/surface binding
BindlessR_CUDA_BINDLESSOFF*, R_CUDA_TEX_BINDLESSOFF*Bindless texture/surface offsets
Sub-byte patchingR_CUDA_8_0 through R_CUDA_8_56Individual byte within a 64-bit instruction
Unified addressR_CUDA_UNIFIED, R_CUDA_UNIFIED_32, R_CUDA_UNIFIED_8_*Unified address space
Instruction-levelR_CUDA_INSTRUCTION64, R_CUDA_INSTRUCTION128Whole-instruction replacement
Yield/NOPR_CUDA_YIELD_OPCODE9_0, R_CUDA_YIELD_CLEAR_PRED4_87YIELD-to-NOP patching
CleanupR_CUDA_UNUSED_CLEAR32, R_CUDA_UNUSED_CLEAR64Zero out unused instruction fields

Resolution Logic

The resolver performs these operations for each relocation entry:

  1. Alias resolution -- redirect relocations from alias symbols to their targets ("change alias reloc %s to %s")
  2. Dead function filtering -- skip relocations on eliminated functions ("ignore reloc on dead func %s")
  3. UFT/UDT pseudo-relocation -- handle __UFT_OFFSET, __UFT_CANONICAL, __UDT_OFFSET, __UDT_CANONICAL synthetic symbols
  4. PC-relative validation -- ensure branch targets are in the same section ("PC relative branch address should be in the same section")
  5. YIELD-to-NOP conversion -- convert YIELD instructions to NOP when forward progress requirements prevent yielding
  6. Unified reloc replacement -- convert type 103 (unified) to type 1 (absolute) for final resolution
  7. Address computation -- compute final patched value from symbol address + addend

Output relocation sections (.nv.resolvedrela) are written by sub_1CD5920.

Debug Information

When --device-debug or --generate-line-info is active, ptxas generates DWARF debug sections. The debug subsystem spans 0x1CBF--0x1CC9 and includes parsers, emitters, and dumpers for the standard DWARF sections plus NVIDIA-specific debug extensions.

Debug Sections

SectionContent
.debug_infoDWARF DIE tree (compilation units, types, variables)
.debug_abbrevDWARF abbreviation table
.debug_lineSource-to-address line number mapping
.debug_frameCall frame information for unwinding
.debug_locLocation lists for variables
.debug_strDWARF string table
.debug_rangesAddress ranges
.debug_arangesAddress range lookup table
.debug_pubnamesPublic name index
.debug_pubtypesPublic type index
.nv_debug_ptx_txtEmbedded PTX source text
.nv_debug_line_sassSASS-level line number mapping
.nv_debug_info_reg_sassRegister allocation debug info
.nv_debug_info_reg_typeRegister type information

Key debug infrastructure functions:

FunctionPurpose
sub_1CBF820DWARF form name table (DW_FORM_addr, DW_FORM_data4, etc.)
sub_1CBF9B0DWARF attribute name table (DW_AT_producer, DW_AT_comp_dir, etc.)
sub_1CC0850.debug_abbrev parser/emitter (18 KB decompiled)
sub_1CC4A40.debug_info DIE tree walker (28 KB decompiled)
sub_1CC34E0DWARF location expression decoder (DW_OP_* operations)
sub_1CC24C0.debug_info emission pass (18 KB decompiled)
sub_1CC5EB0Compilation unit header parser
sub_1C9D1F0Debug section classifier/mapper (16 KB decompiled)

The --suppress-debug-info option (sub_432A00) disables debug section generation even when debug flags are present.

Capsule Mercury Output Path

For SM 100+ targets (Blackwell, Jetson Thor, consumer RTX 50-series), the default output mode is Capsule Mercury. In this mode, the CUBIN ELF contains two layers of content: standard CUBIN sections and a parallel set of .nv.merc.* sections carrying Mercury-encoded instruction streams plus all metadata needed for deferred finalization.

Mercury-Specific Sections

SectionPurpose
.nv.merc.debug_abbrevCloned DWARF abbreviation table
.nv.merc.debug_infoCloned DWARF info
.nv.merc.debug_lineCloned DWARF line table
.nv.merc.debug_frameCloned DWARF frame info
.nv.merc.debug_locCloned DWARF locations
.nv.merc.debug_strCloned DWARF string table
.nv.merc.debug_rangesCloned DWARF ranges
.nv.merc.debug_arangesCloned DWARF address ranges
.nv.merc.debug_pubnamesCloned DWARF public names
.nv.merc.debug_pubtypesCloned DWARF public types
.nv.merc.debug_macinfoCloned DWARF macro info
.nv.merc.nv_debug_ptx_txtEmbedded PTX source text
.nv.merc.nv_debug_line_sassSASS-level line mapping
.nv.merc.nv_debug_info_reg_sassRegister allocation debug info
.nv.merc.nv_debug_info_reg_typeRegister type debug info
.nv.merc.symtab_shndxExtended section index table (merc copy)
.nv.merc.nv.shared.reservedShared memory reservation metadata
.nv.merc.rela<secname>Per-section relocation tables

Capsule Mercury Construction

The capmerc path is integrated into the master ELF emitter. The sequence:

  1. sub_1C9B110 (23 KB decompiled) creates the .nv.merc.* section namespace
  2. sub_1CA2E40 (18 KB decompiled) clones constant/global/shared/local sections into the merc namespace, creating .nv.merc.rela relocation sections
  3. sub_1C9C300 (24 KB decompiled) processes .nv.capmerc<funcname> sections. Constructs a 328-byte capsule descriptor per function containing: Mercury-encoded instruction stream, relocation metadata, KNOBS configuration snapshot, and function-level metadata (register counts, barriers, shared memory usage)
  4. sub_1CA3A90 (45 KB decompiled) merges sections that have both merc and non-merc copies
  5. sub_1C99BB0 (25 KB decompiled) remaps section indices after merc section insertion

Off-Target Finalization

The cubin entry sub_612DE0 implements a "fastpath optimization" for off-target finalization:

"[Finalizer] fastpath optimization applied for off-target %u -> %u finalization"

When a capmerc binary compiled for SM X is finalized for SM Y (within the same family), the fastpath patches the Mercury instruction stream directly without full recompilation. The compatibility checker sub_60F290 determines whether fastpath is safe based on instruction set compatibility, register file layout, and memory model.

Self-Check

The --self-check option performs roundtrip verification: generate capmerc, reconstitute SASS from the capsule, and compare section-by-section. The verifier uses a Flex SASS lexer (sub_720F00, 64 KB) and a comparator (sub_729540, 35 KB). Error string: "Failure of '%s' section in self-check for capsule mercury".

Multi-Kernel Output

A typical CUDA program compiles multiple kernels and device functions into a single CUBIN. The output pipeline handles this through per-function section isolation, combined with cross-function analysis for call graph construction, barrier propagation, and dead code elimination.

Per-Function Section Layout

Each entry function and each device function gets its own .text section (the -ffunction-sections pattern). This enables:

  • Function-level dead code elimination -- sub_1CBC090 removes .text, .rela.text, .nv.info, and .nv.constant0 sections for unreachable functions
  • Linker granularity -- nvlink can select individual functions from relocatable objects
  • Driver loading -- the CUDA runtime can load individual kernels by name

Call Graph Construction

The call graph builder (sub_1CBE1B0) emits a .nv.callgraph section that encodes inter-function call edges. This section is present only in relocatable object mode (-c). The recursion detector (sub_1CBB920) performs a DFS traversal with manual 9-level unrolling, emitting "recursion at function %d" for each cycle found.

Dead functions are eliminated by sub_1CBC090:

"dead function %d(%s)"
"removed un-used section %s (%d)"   (x8 -- once per section type)
"function %d(%s) has address taken but no call to it"

Memory Allocation Across Kernels

The master section allocator sub_1CABD60 (11,856 binary bytes, 67 KB decompiled, 69 callees) assigns addresses to all memory-space sections across all kernels. It runs a multi-pass algorithm:

  1. Global shared allocation -- shared variables visible to multiple kernels
  2. Per-entry shared memory -- shared variables private to each kernel
  3. Extern shared handling -- dynamically-sized shared memory (extern __shared__)
  4. Reserved shared memory -- runtime reservations (.nv.reservedSmem.begin, .nv.reservedSmem.cap, .nv.reservedSmem.offset0, .nv.reservedSmem.offset1)
  5. Local memory -- per-thread spill storage
  6. Constant bank merging -- merges constant bank data across kernels, with deduplication (sub_1CA6890: "found duplicate value 0x%x, alias %s to %s")

The shared memory allocator sub_1CA92F0 (2,804 bytes) builds an interference graph for shared objects and performs group allocation for non-overlapping variables.

SHN_XINDEX Overflow

Large CUDA programs can exceed the ELF 65,280-section limit (SHN_LORESERVE = 0xFF00). Each kernel generates at minimum 4 sections (.text, .rela.text, .nv.info, .nv.constant0), so a program with 16,000+ kernels triggers the overflow mechanism:

  1. e_shnum = 0 in the ELF header (signals overflow)
  2. Section header [0].sh_size = real section count
  3. e_shstrndx = SHN_XINDEX (0xFFFF)
  4. Section header [0].sh_link = real .shstrtab index
  5. Symbol st_shndx = SHN_XINDEX when real index >= 0xFF00
  6. .symtab_shndx entries hold the actual section indices

This is standard ELF overflow handling, and it is production-critical -- sub_1CB68D0 checks for it with "overflow number of sections %d".

Key Functions

AddressSizeDecompiledPurpose
sub_612DE0~12 KB47 KBCubin generation entry point
sub_1C9F28015,263 B97 KBMaster ELF emitter
sub_1CC980014,764 B90 KBnvinfo/EIATTR section builder
sub_1CABD6011,856 B67 KBMaster section allocator (shared/const/local)
sub_1CB68D09,578 B49 KBSymbol table builder (.symtab)
sub_1CA3A906,289 B45 KBSection merger (merc + non-merc)
sub_1C9DC605,663 B29 KBSection layout calculator
sub_1C99BB04,900 B25 KBSection index remap (.symtab_shndx)
sub_1C9C3003,816 B24 KBCapsule Mercury section processor
sub_1C9B1104,585 B23 KBMercury capsule builder
sub_1CD48C04,184 B22 KBMaster relocation resolver
sub_1CBC0902,870 B20 KBDead function eliminator
sub_1CA2E403,152 B18 KBMercury section cloner
sub_1CA92F02,804 B16 KBShared memory interference graph
sub_1C9D1F02,667 B16 KBDebug section classifier
sub_1CA68902,286 B15 KBConstant bank deduplication
sub_1CC89502,634 B15 KBBarrier/register count propagator
sub_1CBB920~2,000 B14 KBRecursion detector (DFS)
sub_1CB91C02,668 B13 KBELF structure dumper (debug)
sub_1CB53A03,480 B13 KBELFW constructor
sub_1CAB3002,157 B12 KBBindless texture/surface handler
sub_1CD59201,985 B11 KBRelocation writer (.nv.resolvedrela)
sub_1CD13A02,541 B11 KBFile serializer (final disk write)
sub_1CBE1B01,992 B10 KB.nv.callgraph section builder
sub_1CD22E01,979 B10 KBUFT manager
sub_1CB35701,963 B10 KBSection creator (44 call sites)
sub_1C98C601,755 B9 KBMercury debug section classifier
sub_1CB2CA02,038 B8 KBSymbol fixup (post-deletion)
sub_1CC7FB0----.nv.info section name formatter
sub_1CB9FF0----Section count accessor
sub_1CB9C40----Get section by index

Cross-References