Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

NVPTX Target Infrastructure

The NVPTXTargetMachine, NVPTXSubtarget, and NVPTXTargetTransformInfo form the target description layer that the entire LLVM backend consults for every decision from type legality through instruction cost to vectorization factor selection. In upstream LLVM, these are three separate source files totaling roughly 1,500 lines; in cicc v13.0 they are spread across the 0xDF0000-0xE00000 address range (TTI hooks), the 0x330-0x35B range (NVPTXTargetLowering), the type legalization tables embedded in NVPTXSubtarget, and the pipeline assembler at 0x12EA000-0x12F0000 (TargetMachine construction). The NVIDIA delta relative to upstream is moderate -- the TTI hooks return GPU-specific constants rather than CPU ones, the SubtargetFeatures carry NVIDIA-proprietary math precision flags, and the TargetMachine creation path has a dual-path design that handles both the cicc standalone pipeline and the LibNVVM API pipeline.

Key Facts

PropertyValue
SM processor tableqword_502A920 (45 entries, stride-2, ctor_605 at 0x584510)
Target lookupsub_12EA530 (4KB, calls sub_16D3AC0 = TargetRegistry::lookupTarget)
TargetMachine creationsub_12F4060 (16KB, NVIDIA options) / sub_12E54A0 (50KB, pipeline path)
TTI wrapper passsub_1BFB520 (208-byte alloc, wraps sub_1BFB9A0)
Register bit width (Vector)sub_DFE640 -- returns 32 (fixed)
Scalable vectorssub_DFE610 -- returns false
Max interleave factorsub_DFB120 (at TTI+448), sub_DFB730 (vectorized variant)
SubtargetFeaturesOffsets +2498, +2584, +2843, +2870, +2871
Target triplesnvptx64-nvidia-cuda, nvptx-nvidia-cuda, nvsass-nvidia-* (6 total)

NVPTXTargetMachine

Dual-Path Target Initialization

cicc constructs the TargetMachine through two independent code paths depending on whether compilation enters through the standalone cicc CLI or through the LibNVVM API. Both converge on TargetRegistry::lookupTarget (sub_16D3AC0) but assemble the target triple, feature string, and TargetOptions differently.

Path 1 -- cicc standalone (sub_12F7D90 -> sub_12F4060):

sub_12F7D90 — CLI parser:
    parse "-arch=compute_XX" → SM version (multiplied by 10)
    parse "-opt=N"           → optimization level
    parse "-ftz=N"           → flush-to-zero mode
    parse "-fma=N"           → FMA contraction level
    parse "-prec-div=N"      → float division precision
    parse "-prec-sqrt=N"     → sqrt precision
    parse "--device-c"       → device compilation flag

sub_12F4060 — TargetMachine creation (16KB):
    triple = (pointerWidth == 64) ? "nvptx64" : "nvptx"
    features = ""
    if (sharedmem32bit):
        features += "+sharedmem32bitptr"
    features += ",+fma-level=N,+prec-divf32=N,+prec-sqrtf32=N"

    opts = TargetOptions {
        flags: 0,
        reloc: PIC (1),
        codeModel: 8,
        optLevel: from_cli,
        threadModel: 1
    }

    TM = TargetRegistry::lookupTarget(triple, cpu_string)
    if (!TM):
        error "Error: Cannot specify multiple -llcO#\n"
    return TM->createTargetMachine(triple, cpu, features, opts)

Path 2 -- pipeline assembler (sub_12E54A0):

The master pipeline assembly function (50KB, called from both Phase I and Phase II) constructs the target independently:

sub_12E54A0:
    ptrSize = Module::getDataLayout().getPointerSizeInBits(0)
    if (8 * ptrSize == 64):
        triple = "nvptx64"                          // 7 chars
    else:
        triple = "nvptx"                            // 5 chars

    target = sub_16D3AC0(&triple, &cpu_string)      // TargetRegistry::lookupTarget
    if (!target):
        error "Failed to locate nvptx target\n"     // sub_1C3EFD0

    // TargetOptions setup:
    opts[0] = 0                                     // no flags
    opts[1] = 1                                     // PIC relocation
    opts[2] = 8                                     // code model
    opts[3] = 1                                     // opt level indicator
    opts[4] = 1                                     // thread model
    opts[5] = 0                                     // reserved

    sub_167F890(subtargetInfo)                       // initialize SubtargetInfo
    TLI = sub_14A04B0(targetLibInfo, moduleName)     // TargetLibraryInfo
    sub_149CBC0(TLI)                                 // finalize TLI
    TTI = sub_1BFB9A0(DataLayout, a2, a3, v269)     // TargetTransformInfo

    optLevel = read qword_4FBB430                    // cl::opt<int> value
    PassManagerBuilder = sub_1611EE0(PM)

The pipeline assembler path also checks for an extension hook: if the target has a createExtendedTargetMachine vtable entry at offset +88, it calls that instead, enabling custom target backends. The returned TargetMachine pointer feeds into the 150+ pass registrations that follow.

TargetOptions

The TargetOptions struct passed to both paths uses LLVM's standard layout. The key NVIDIA-specific values:

FieldValueMeaning
Relocation model1 (PIC)Position-independent code, always
Code model8Large code model (matches PTX's flat addressing)
Thread model1POSIX-style threading assumed
Optimization levelFrom CLIStored in qword_4FBB430, default from qword_4FBB430[2]

NVIDIA-Specific Target Features

The feature string passed to createTargetMachine encodes math precision and shared memory configuration as subtarget features. These are not upstream LLVM features -- they are NVIDIA extensions:

FeatureCLI SourceSubtarget Effect
+sharedmem32bitptrnvptx-short-ptr / nvptx-32-bit-smemEnables 32-bit pointers for address space 3 (shared memory); adds p3:32:32:32 to data layout
+fma-level=N-fma=N0=off, 1=on, 2=aggressive FMA contraction
+prec-divf32=N-prec-div=N0=approx, 1=full, 2=IEEE+ftz, 3=IEEE compliant
+prec-sqrtf32=N-prec-sqrt=N0=approx (rsqrt.approx), 1=rn (sqrt.rn)

Registered in ctor_607 (0x584B60, 14KB):

KnobTypeDefaultDescription
nvptx-sched4regbool--Schedule for register pressure
nvptx-fma-levelint--FMA contraction level
nvptx-prec-divf32int--F32 division precision
nvptx-prec-sqrtf32int--Sqrt precision
nvptx-approx-log2f32bool--Use lg2.approx for log2
nvptx-force-min-byval-param-alignbool--Force 4-byte byval alignment
nvptx-normalize-selectbool--Override shouldNormalizeToSelectSequence
enable-bfi64bool--Enable 64-bit BFI instructions

NVPTXSubtarget Feature Flags

The NVPTXSubtarget object carries the type legalization tables and architecture-specific feature flags that the SelectionDAG, register allocator, and type legalizer consult at every step. These are populated during target construction and indexed by the SM processor table.

Feature Flag Offsets

OffsetSizePurposeStride
+120ptrRegister class array (8-byte stride entries)--
+2498259Type legality flags (indexed per MVT)259 bytes per type action
+2584259Float legality flags (indexed per MVT)259 bytes per type action
+28431Integer type support flag--
+28701Branch distance flag--
+28711Jump table eligibility flag--

The type legality arrays at +2498 and +2584 are the backbone of SelectionDAG's getTypeAction() and isTypeLegal() queries. Each entry covers one MVT (Machine Value Type) and stores the action: Legal, Promote, Expand, Scalarize, or SplitVector. For NVPTX, i32 and f32 are always Legal; i64 and f64 are Legal on all supported SM versions but with expanded arithmetic costs; vectors wider than 128 bits are always Split or Scalarized.

The function sub_201BB90 reads these offsets during type legalization to determine expansion strategy. The branch distance flags at +2870/+2871 control sub_20650A0, which decides jump table eligibility beyond the standard no-jump-tables flag.

Initialization Flow

The SubtargetFeatures initialization follows this path:

  1. ctor_605 (0x584510, 2.6KB) populates qword_502A920 with the 45-entry SM processor table at static init time.
  2. sub_167F890 initializes the SubtargetInfo during pipeline setup.
  3. sub_982C80 initializes the 224-byte NVPTX feature flag table based on SM version and OS/ABI info.
  4. sub_97DEE0 performs initial population of the feature bitfield.
  5. sub_982B20 applies SM-version-specific refinements from the global table at qword_4F7FCC8.

The 224-byte feature table (sub_982C80) initializes bytes 0-127 to all-1s (0xFF), then selectively clears bits based on the target configuration. This "default-enabled, selectively-disabled" pattern means that features are assumed present unless explicitly turned off for a given target.

NVPTXTargetTransformInfo Hook Table

The TTI is the interface through which all LLVM optimization passes query target-specific costs and capabilities. For NVPTX, every hook returns a value calibrated for a scalar-register GPU architecture rather than a SIMD-register CPU.

TTI HookAddressReturn ValueUpstream Equivalent
getRegisterBitWidth(Vector)sub_DFE640TypeSize::getFixed(32)AVX2 returns 256, AVX-512 returns 512
supportsScalableVectors()sub_DFE610falseAArch64 SVE returns true
getMaxInterleaveFactor()sub_DFB120Register-pressure-boundedCPU returns 2-4 based on uarch
getMaxInterleaveFactor(vectorized)sub_DFB730Separate limit for vectorized loops--
getRegisterBitWidth(Scalar)sub_DFB1B032Matches PTX 32-bit register file
getInstructionCost()sub_20E14F0 (32KB)Per-opcode latency from sched model--
hasAttribute(30)sub_B2D610Checks noimplicitfloatStandard LLVM
hasAttribute(47)sub_B2D610Checks alwaysvectorizeStandard LLVM
hasAttribute(18)sub_B2D610Checks optnoneStandard LLVM

Impact on Loop Vectorization

The 32-bit register width return from sub_DFE640 is the single most consequential TTI hook for GPU compilation. The standard LLVM VF formula is:

VF = registerBitWidth / elementBitWidth

With registerBitWidth = 32:

  • float (32-bit): VF = 1 -- no vectorization from the register-width formula alone
  • half (16-bit): VF = 2
  • i8 (8-bit): VF = 4

This means that profitable vectorization of 32-bit types (the dominant case in CUDA) must come entirely from the cost model determining that ld.v2.f32 or ld.v4.f32 is cheaper than multiple scalar loads, not from the register-width heuristic. The LoopVectorize pass (sub_2AF1970) has an explicit override: when the VF formula produces VF <= 1 and the byte_500D208 knob is set, it forces VF = 4 for outer loops.

Impact on SLP Vectorization

The SLP vectorizer (sub_2BD1C50) receives the target vector register width as parameter a3 and uses it to determine maximum bundle width. With 32 bits, SLP bundles are limited to:

  • 2x i16 (32 bits total)
  • 4x i8 (32 bits total)
  • 1x i32 or f32 (degenerate -- no SLP benefit)

In practice, the SLP vectorizer's profitability model can override this limit when paired loads/stores demonstrate memory coalescing benefit, but the register width serves as the initial upper bound.

Impact on Interleave Count

The getMaxInterleaveFactor hook (sub_DFB120, queried at TTI+448) caps the interleave count (IC) for loop unroll-and-jam. The interleave selection algorithm in sub_2AED330 reads this value and combines it with scheduling info at TTI+56:

maxIC    = TTI.getMaxInterleaveFactor(VF)
issueWidth = *(TTI + 56 + 32)              // scheduling model: issue width
latency    = *(TTI + 56 + 36)              // scheduling model: latency
IC         = IC / max(issueWidth, latency)  // cap by pipeline throughput

This models the SM's instruction issue pipeline: even if register pressure allows IC=8, the warp scheduler may saturate at lower IC values, making additional interleaving waste register budget without throughput gain.

Arithmetic Cost for i64

NVPTX GPUs have 32-bit ALUs. All 64-bit integer arithmetic is emulated through pairs of 32-bit operations with carry propagation. The TTI getArithmeticInstrCost hook reflects this by returning approximately 2x the base cost for i64 operations:

Operationi32 Costi64 CostRatio
ADD/SUB122x (add.cc + addc)
MUL1~44x (mul.lo + mul.hi + add chain)
DIV/REMhighvery highLibrary call on both
Shift12-3funnel shift pair

This cost differential causes LLVM optimization passes (InstCombine, SCEV-based transformations, IV widening) to prefer i32 operations, which NVIDIA's custom IV Demotion pass (sub_18B1DE0) further exploits by narrowing 64-bit induction variables to 32-bit where the trip count permits.

SM Processor Table

The processor table at qword_502A920 is a flat array of 90 entries (45 SM variants x 2 fields per entry) with stride-2 layout: even indices hold the SM name string pointer, odd indices hold the PTX version code.

Populated by ctor_605 at 0x584510 (2.6KB), called during static initialization before main. The table is read-only after construction.

qword_502A920[2*i + 0] = const char* sm_name    // e.g., "sm_100"
qword_502A920[2*i + 1] = uint64_t   ptx_version // 5, 6, or 7

PTX Version Codes

CodeMeaningSM Range
5Legacy PTXsm_20 through sm_90 (all base variants)
6Modern PTXsm_90a, sm_100-sm_121 (base variants only)
7Extended PTXsm_100a/f through sm_121a/f (accelerated/forward-compatible)

Notable observations:

  • sm_90a is the only pre-Blackwell SM with PTX version 6.
  • The f (forward-compatible) suffix uses the same PTX version as a (accelerated).
  • No entries exist for sm_84, sm_85 (Ada Lovelace numbering gap).
  • sm_73 (Volta sub-variant) and sm_88 (Ada sub-variant) are present but not publicly documented.
  • The table contains 15 legacy architectures (sm_20 through sm_75) that are no longer accessible through the CLI mapping but remain in the backend's processor table.

Data Layout String

The NVPTX data layout string follows LLVM's standard format with three variants selected based on pointer width and shared memory pointer mode:

64-bit with shared memory specialization (most common)

e-p:64:64:64-p3:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-
i128:128:128-f16:16:16-f32:32:32-f64:64:64-v16:16:16-v32:32:32-n16:32:64

64-bit without shared memory specialization

e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-
i128:128:128-f16:16:16-f32:32:32-f64:64:64-v16:16:16-v32:32:32-n16:32:64

32-bit mode

e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-
i128:128:128-f16:16:16-f32:32:32-f64:64:64-v16:16:16-v32:32:32-n16:32:64

Key fields

FieldMeaningNVIDIA Note
eLittle-endianAll NVIDIA GPUs
p:64:64:64Generic pointers: 64-bit, 64-bit alignedDefault for 64-bit compilation
p3:32:32:32Address space 3 (shared memory): 32-bit pointersControlled by nvptx-short-ptr / nvptx-32-bit-smem / unk_4D0461C
n16:32:64Native integer widths: 16, 32, 64Tells LLVM that i16/i32/i64 are all hardware-supported
v16:16:16 / v32:32:32Vector alignment: natural16-bit and 32-bit vectors aligned to their width

The p3:32:32:32 entry is the NVIDIA delta: shared memory lives in a 48KB-228KB on-chip SRAM per SM, addressable with 32-bit pointers even in 64-bit mode. Using 32-bit pointers for shared memory saves register pressure and instruction count for every shared memory access.

A separate data layout string e-i64:64-v16:16-v32:32-n16:32:64 appears in the IR linker (sub_106AB30) as a compatibility check during module linking. This shortened form is used to validate that two modules being linked share the same NVPTX target data layout.

Data layout validation is performed at multiple points:

  • sub_2C74F70 in the NVVM verifier checks the layout string on every module
  • If empty: "Empty target data layout, must exist"
  • If invalid: prints "Example valid data layout:" with reference 32-bit and 64-bit strings from off_4C5D0A0 / off_4C5D0A8

Target Triple Construction

The target triple is constructed at module creation time by checking the pointer width:

if (unk_4F06A68 == 8)                    // 64-bit data model
    triple = "nvptx64-nvidia-cuda"       // 19 chars
else
    triple = "nvptx-nvidia-cuda"         // 17 chars

Eight triples are valid in UnifiedNVVMIR mode:

TripleWidthRuntime
nvptx-nvidia-cuda32-bitCUDA
nvptx64-nvidia-cuda64-bitCUDA
nvptx-nvidia-nvcl32-bitOpenCL
nvptx64-nvidia-nvcl64-bitOpenCL
nvsass-nvidia-cudaSASSCUDA native assembly
nvsass-nvidia-nvclSASSOpenCL native assembly
nvsass-nvidia-directxSASSDirectX backend
nvsass-nvidia-spirvSASSSPIR-V backend

In non-UnifiedNVVMIR mode, validation is looser: the triple must start with nvptx- or nvptx64- and contain -cuda. The nvsass-nvidia-directx and nvsass-nvidia-spirv triples (discovered in sub_2C80C90) are notable evidence that NVIDIA's SASS-level backend supports DirectX and SPIR-V shader compilation alongside traditional CUDA/OpenCL.

Configuration Knobs

Backend Options (ctor_609_0, 0x585D30, 37KB)

KnobTypeDefaultDescription
nvptx-short-ptrbool--32-bit pointers for const/local/shared
nvptx-32-bit-smembool--32-bit shared memory pointers
nvptx-enable-machine-sinkbool--Enable Machine Sinking
enable-new-nvvm-rematbooltrueEnable new rematerialization
nv-disable-rematboolfalseDisable all remat passes
nv-disable-mem2regboolfalseDisable MI Mem2Reg pass
nv-disable-scev-cgpboolfalseDisable SCEV address mode opt
disable-nvptx-load-store-vectorizerboolfalseDisable load/store vectorizer
disable-nvptx-require-structured-cfgboolfalseTurn off structured CFG requirement
nvptx-exit-on-unreachablebooltrueLower unreachable as exit
nvptx-early-byval-copybool--Copy byval args early
enable-nvvm-peepholebooltrueEnable NVVM Peephole Optimizer
lower-func-argsbooltrueLower large aggregate params
enable-sinkbooltrueEnable Sinking
disable-post-optboolfalseDisable LLVM IR opts post-opt
usedessaint2Select deSSA method
ldgbooltrueLoad Global Constant Transform
print-isel-inputboolfalsePrint LLVM IR input to isel
no-reg-target-nvptxrematboolfalseOnly old remat without reg targets
disable-set-array-alignmentboolfalseDisable alignment enhancements
nvptx-lower-global-ctor-dtorbool--Lower GPU ctor/dtors to globals

Register Pressure & FCA Options (ctor_074, 0x49AAB0)

KnobTypeDefaultDescription
fca-sizeint8Max size of first-class aggregates (bytes)
reg-target-adjustint0 (range -10..+10)Register pressure target adjustment
pred-target-adjustint0 (range -10..+10)Predicate register target adjustment
remat-load-parambool--Support remating const ld.param not in NVVM IR
cta-reconfig-aware-rpabool--CTA reconfiguration-aware register pressure analysis

Extension Options (ctor_610, 0x5888A0)

KnobTypeDefaultDescription
unroll-assumed-sizeint4Assumed size for unknown local array types
enable-loop-peelingbool--Enable loop peeling
enable-256-bit-load-storebool--Enable 256-bit vector loads/stores
ias-param-always-point-to-globalbool--Parameters always point to global memory
ias-strong-global-assumptionsbool--Strong global memory assumptions
ias-wmma-memory-space-optbool--Memory Space Optimization for WMMA

TTI Cost Model Options (ctor_061, 0x494D20)

KnobTypeDefaultDescription
costmodel-reduxcostbool--Recognize reduction patterns
cache-line-sizeint--Cache line size for cost model
min-page-sizeint--Minimum page size
predictable-branch-thresholdfloat--Threshold for predictable branch cost

Differences from Upstream LLVM

  1. Dual-path TargetMachine construction. Upstream LLVM has a single target creation path through LLVMTargetMachine::createPassConfig. NVIDIA has two independent paths (CLI and pipeline assembler) that converge at TargetRegistry::lookupTarget.

  2. NVIDIA-proprietary target features. The +sharedmem32bitptr, +fma-level=N, +prec-divf32=N, +prec-sqrtf32=N features do not exist in upstream NVPTX. Upstream NVPTX has +ptx75, +sm_90 style features. NVIDIA's math precision features are passed through the target feature string to avoid adding new cl::opt for each.

  3. 224-byte feature table. The sub_982C80 feature table with its "default all-1s then selectively clear" initialization pattern is unique to cicc. Upstream NVPTXSubtarget uses a much simpler feature set derived from +sm_XX and +ptx_YY features.

  4. Scheduling info at TTI+56. The issue-width and latency values stored in the TTI sub-structure at offset +56 are used by the interleave count selection algorithm. Upstream LLVM's NVPTX backend does not populate these scheduling parameters -- it relies on the default "no scheduling model" behavior.

  5. Extension hook at vtable+88. The pipeline assembler checks for a createExtendedTargetMachine entry, enabling loadable target backend extensions. This is not present in upstream LLVM.

Function Map

FunctionAddressSizeRole
NVPTX Target Lookup and Creationsub_12EA5304 KB--
TargetMachine Creation with NVIDIA Optionssub_12F406016 KB--
Master Pipeline Assembly (includes TM setup)sub_12E54A050 KB--
CICC CLI Argument Parsersub_12F7D9014 KB--
TargetRegistry::lookupTarget()sub_16D3AC0----
SubtargetInfo initializationsub_167F890----
TTIWrapperPass allocation (208 bytes)sub_1BFB520----
TargetTransformInfo / DataLayout creationsub_1BFB9A0----
TargetLibraryInfo creationsub_14A04B0----
TargetLibraryInfo finalizationsub_149CBC0----
TTI::getRegisterBitWidth(Vector) -- returns 32sub_DFE640----
TTI::supportsScalableVectors() -- returns falsesub_DFE610----
TTI::getMaxInterleaveFactor() (at TTI+448)sub_DFB120----
TTI::getMaxInterleaveFactor(vectorized)sub_DFB730----
TTI::getRegisterBitWidth(Scalar) or cache-line querysub_DFB1B0----
TTI::getInstructionCost() / scheduling cost modelsub_20E14F033 KB--
TTI::hasAttribute(N) -- function attribute querysub_B2D610----
TTI::getInstructionCost() (IR-level variant)sub_B91420----
NVPTX feature flag table initializer (224 bytes)sub_982C80----
Feature bitfield initial populationsub_97DEE0----
SM-version-specific feature refinementssub_982B20----
SubtargetFeature reads at +2843, +2584, +2498sub_201BB90----
Branch distance / jump table checks at +2870, +2871sub_20650A0----
EDG SM architecture feature gating (38KB, ~60 flags)sub_60E7C0----
Module initialization with triple and data layoutsub_908850----
SM processor table population (0x584510, 2.6KB)ctor_605----
NVPTX backend math options (0x584B60, 14KB)ctor_607----
NVPTX backend options (0x585D30, 37KB)ctor_609_0----

Cross-References