NVPTX Target Infrastructure
The NVPTXTargetMachine, NVPTXSubtarget, and NVPTXTargetTransformInfo form the target description layer that the entire LLVM backend consults for every decision from type legality through instruction cost to vectorization factor selection. In upstream LLVM, these are three separate source files totaling roughly 1,500 lines; in cicc v13.0 they are spread across the 0xDF0000-0xE00000 address range (TTI hooks), the 0x330-0x35B range (NVPTXTargetLowering), the type legalization tables embedded in NVPTXSubtarget, and the pipeline assembler at 0x12EA000-0x12F0000 (TargetMachine construction). The NVIDIA delta relative to upstream is moderate -- the TTI hooks return GPU-specific constants rather than CPU ones, the SubtargetFeatures carry NVIDIA-proprietary math precision flags, and the TargetMachine creation path has a dual-path design that handles both the cicc standalone pipeline and the LibNVVM API pipeline.
Key Facts
| Property | Value |
|---|---|
| SM processor table | qword_502A920 (45 entries, stride-2, ctor_605 at 0x584510) |
| Target lookup | sub_12EA530 (4KB, calls sub_16D3AC0 = TargetRegistry::lookupTarget) |
| TargetMachine creation | sub_12F4060 (16KB, NVIDIA options) / sub_12E54A0 (50KB, pipeline path) |
| TTI wrapper pass | sub_1BFB520 (208-byte alloc, wraps sub_1BFB9A0) |
| Register bit width (Vector) | sub_DFE640 -- returns 32 (fixed) |
| Scalable vectors | sub_DFE610 -- returns false |
| Max interleave factor | sub_DFB120 (at TTI+448), sub_DFB730 (vectorized variant) |
| SubtargetFeatures | Offsets +2498, +2584, +2843, +2870, +2871 |
| Target triples | nvptx64-nvidia-cuda, nvptx-nvidia-cuda, nvsass-nvidia-* (6 total) |
NVPTXTargetMachine
Dual-Path Target Initialization
cicc constructs the TargetMachine through two independent code paths depending on whether compilation enters through the standalone cicc CLI or through the LibNVVM API. Both converge on TargetRegistry::lookupTarget (sub_16D3AC0) but assemble the target triple, feature string, and TargetOptions differently.
Path 1 -- cicc standalone (sub_12F7D90 -> sub_12F4060):
sub_12F7D90 — CLI parser:
parse "-arch=compute_XX" → SM version (multiplied by 10)
parse "-opt=N" → optimization level
parse "-ftz=N" → flush-to-zero mode
parse "-fma=N" → FMA contraction level
parse "-prec-div=N" → float division precision
parse "-prec-sqrt=N" → sqrt precision
parse "--device-c" → device compilation flag
sub_12F4060 — TargetMachine creation (16KB):
triple = (pointerWidth == 64) ? "nvptx64" : "nvptx"
features = ""
if (sharedmem32bit):
features += "+sharedmem32bitptr"
features += ",+fma-level=N,+prec-divf32=N,+prec-sqrtf32=N"
opts = TargetOptions {
flags: 0,
reloc: PIC (1),
codeModel: 8,
optLevel: from_cli,
threadModel: 1
}
TM = TargetRegistry::lookupTarget(triple, cpu_string)
if (!TM):
error "Error: Cannot specify multiple -llcO#\n"
return TM->createTargetMachine(triple, cpu, features, opts)
Path 2 -- pipeline assembler (sub_12E54A0):
The master pipeline assembly function (50KB, called from both Phase I and Phase II) constructs the target independently:
sub_12E54A0:
ptrSize = Module::getDataLayout().getPointerSizeInBits(0)
if (8 * ptrSize == 64):
triple = "nvptx64" // 7 chars
else:
triple = "nvptx" // 5 chars
target = sub_16D3AC0(&triple, &cpu_string) // TargetRegistry::lookupTarget
if (!target):
error "Failed to locate nvptx target\n" // sub_1C3EFD0
// TargetOptions setup:
opts[0] = 0 // no flags
opts[1] = 1 // PIC relocation
opts[2] = 8 // code model
opts[3] = 1 // opt level indicator
opts[4] = 1 // thread model
opts[5] = 0 // reserved
sub_167F890(subtargetInfo) // initialize SubtargetInfo
TLI = sub_14A04B0(targetLibInfo, moduleName) // TargetLibraryInfo
sub_149CBC0(TLI) // finalize TLI
TTI = sub_1BFB9A0(DataLayout, a2, a3, v269) // TargetTransformInfo
optLevel = read qword_4FBB430 // cl::opt<int> value
PassManagerBuilder = sub_1611EE0(PM)
The pipeline assembler path also checks for an extension hook: if the target has a createExtendedTargetMachine vtable entry at offset +88, it calls that instead, enabling custom target backends. The returned TargetMachine pointer feeds into the 150+ pass registrations that follow.
TargetOptions
The TargetOptions struct passed to both paths uses LLVM's standard layout. The key NVIDIA-specific values:
| Field | Value | Meaning |
|---|---|---|
| Relocation model | 1 (PIC) | Position-independent code, always |
| Code model | 8 | Large code model (matches PTX's flat addressing) |
| Thread model | 1 | POSIX-style threading assumed |
| Optimization level | From CLI | Stored in qword_4FBB430, default from qword_4FBB430[2] |
NVIDIA-Specific Target Features
The feature string passed to createTargetMachine encodes math precision and shared memory configuration as subtarget features. These are not upstream LLVM features -- they are NVIDIA extensions:
| Feature | CLI Source | Subtarget Effect |
|---|---|---|
+sharedmem32bitptr | nvptx-short-ptr / nvptx-32-bit-smem | Enables 32-bit pointers for address space 3 (shared memory); adds p3:32:32:32 to data layout |
+fma-level=N | -fma=N | 0=off, 1=on, 2=aggressive FMA contraction |
+prec-divf32=N | -prec-div=N | 0=approx, 1=full, 2=IEEE+ftz, 3=IEEE compliant |
+prec-sqrtf32=N | -prec-sqrt=N | 0=approx (rsqrt.approx), 1=rn (sqrt.rn) |
Registered in ctor_607 (0x584B60, 14KB):
| Knob | Type | Default | Description |
|---|---|---|---|
nvptx-sched4reg | bool | -- | Schedule for register pressure |
nvptx-fma-level | int | -- | FMA contraction level |
nvptx-prec-divf32 | int | -- | F32 division precision |
nvptx-prec-sqrtf32 | int | -- | Sqrt precision |
nvptx-approx-log2f32 | bool | -- | Use lg2.approx for log2 |
nvptx-force-min-byval-param-align | bool | -- | Force 4-byte byval alignment |
nvptx-normalize-select | bool | -- | Override shouldNormalizeToSelectSequence |
enable-bfi64 | bool | -- | Enable 64-bit BFI instructions |
NVPTXSubtarget Feature Flags
The NVPTXSubtarget object carries the type legalization tables and architecture-specific feature flags that the SelectionDAG, register allocator, and type legalizer consult at every step. These are populated during target construction and indexed by the SM processor table.
Feature Flag Offsets
| Offset | Size | Purpose | Stride |
|---|---|---|---|
| +120 | ptr | Register class array (8-byte stride entries) | -- |
| +2498 | 259 | Type legality flags (indexed per MVT) | 259 bytes per type action |
| +2584 | 259 | Float legality flags (indexed per MVT) | 259 bytes per type action |
| +2843 | 1 | Integer type support flag | -- |
| +2870 | 1 | Branch distance flag | -- |
| +2871 | 1 | Jump table eligibility flag | -- |
The type legality arrays at +2498 and +2584 are the backbone of SelectionDAG's getTypeAction() and isTypeLegal() queries. Each entry covers one MVT (Machine Value Type) and stores the action: Legal, Promote, Expand, Scalarize, or SplitVector. For NVPTX, i32 and f32 are always Legal; i64 and f64 are Legal on all supported SM versions but with expanded arithmetic costs; vectors wider than 128 bits are always Split or Scalarized.
The function sub_201BB90 reads these offsets during type legalization to determine expansion strategy. The branch distance flags at +2870/+2871 control sub_20650A0, which decides jump table eligibility beyond the standard no-jump-tables flag.
Initialization Flow
The SubtargetFeatures initialization follows this path:
ctor_605(0x584510, 2.6KB) populatesqword_502A920with the 45-entry SM processor table at static init time.sub_167F890initializes the SubtargetInfo during pipeline setup.sub_982C80initializes the 224-byte NVPTX feature flag table based on SM version and OS/ABI info.sub_97DEE0performs initial population of the feature bitfield.sub_982B20applies SM-version-specific refinements from the global table atqword_4F7FCC8.
The 224-byte feature table (sub_982C80) initializes bytes 0-127 to all-1s (0xFF), then selectively clears bits based on the target configuration. This "default-enabled, selectively-disabled" pattern means that features are assumed present unless explicitly turned off for a given target.
NVPTXTargetTransformInfo Hook Table
The TTI is the interface through which all LLVM optimization passes query target-specific costs and capabilities. For NVPTX, every hook returns a value calibrated for a scalar-register GPU architecture rather than a SIMD-register CPU.
| TTI Hook | Address | Return Value | Upstream Equivalent |
|---|---|---|---|
getRegisterBitWidth(Vector) | sub_DFE640 | TypeSize::getFixed(32) | AVX2 returns 256, AVX-512 returns 512 |
supportsScalableVectors() | sub_DFE610 | false | AArch64 SVE returns true |
getMaxInterleaveFactor() | sub_DFB120 | Register-pressure-bounded | CPU returns 2-4 based on uarch |
getMaxInterleaveFactor(vectorized) | sub_DFB730 | Separate limit for vectorized loops | -- |
getRegisterBitWidth(Scalar) | sub_DFB1B0 | 32 | Matches PTX 32-bit register file |
getInstructionCost() | sub_20E14F0 (32KB) | Per-opcode latency from sched model | -- |
hasAttribute(30) | sub_B2D610 | Checks noimplicitfloat | Standard LLVM |
hasAttribute(47) | sub_B2D610 | Checks alwaysvectorize | Standard LLVM |
hasAttribute(18) | sub_B2D610 | Checks optnone | Standard LLVM |
Impact on Loop Vectorization
The 32-bit register width return from sub_DFE640 is the single most consequential TTI hook for GPU compilation. The standard LLVM VF formula is:
VF = registerBitWidth / elementBitWidth
With registerBitWidth = 32:
float(32-bit): VF = 1 -- no vectorization from the register-width formula alonehalf(16-bit): VF = 2i8(8-bit): VF = 4
This means that profitable vectorization of 32-bit types (the dominant case in CUDA) must come entirely from the cost model determining that ld.v2.f32 or ld.v4.f32 is cheaper than multiple scalar loads, not from the register-width heuristic. The LoopVectorize pass (sub_2AF1970) has an explicit override: when the VF formula produces VF <= 1 and the byte_500D208 knob is set, it forces VF = 4 for outer loops.
Impact on SLP Vectorization
The SLP vectorizer (sub_2BD1C50) receives the target vector register width as parameter a3 and uses it to determine maximum bundle width. With 32 bits, SLP bundles are limited to:
- 2x i16 (32 bits total)
- 4x i8 (32 bits total)
- 1x i32 or f32 (degenerate -- no SLP benefit)
In practice, the SLP vectorizer's profitability model can override this limit when paired loads/stores demonstrate memory coalescing benefit, but the register width serves as the initial upper bound.
Impact on Interleave Count
The getMaxInterleaveFactor hook (sub_DFB120, queried at TTI+448) caps the interleave count (IC) for loop unroll-and-jam. The interleave selection algorithm in sub_2AED330 reads this value and combines it with scheduling info at TTI+56:
maxIC = TTI.getMaxInterleaveFactor(VF)
issueWidth = *(TTI + 56 + 32) // scheduling model: issue width
latency = *(TTI + 56 + 36) // scheduling model: latency
IC = IC / max(issueWidth, latency) // cap by pipeline throughput
This models the SM's instruction issue pipeline: even if register pressure allows IC=8, the warp scheduler may saturate at lower IC values, making additional interleaving waste register budget without throughput gain.
Arithmetic Cost for i64
NVPTX GPUs have 32-bit ALUs. All 64-bit integer arithmetic is emulated through pairs of 32-bit operations with carry propagation. The TTI getArithmeticInstrCost hook reflects this by returning approximately 2x the base cost for i64 operations:
| Operation | i32 Cost | i64 Cost | Ratio |
|---|---|---|---|
| ADD/SUB | 1 | 2 | 2x (add.cc + addc) |
| MUL | 1 | ~4 | 4x (mul.lo + mul.hi + add chain) |
| DIV/REM | high | very high | Library call on both |
| Shift | 1 | 2-3 | funnel shift pair |
This cost differential causes LLVM optimization passes (InstCombine, SCEV-based transformations, IV widening) to prefer i32 operations, which NVIDIA's custom IV Demotion pass (sub_18B1DE0) further exploits by narrowing 64-bit induction variables to 32-bit where the trip count permits.
SM Processor Table
The processor table at qword_502A920 is a flat array of 90 entries (45 SM variants x 2 fields per entry) with stride-2 layout: even indices hold the SM name string pointer, odd indices hold the PTX version code.
Populated by ctor_605 at 0x584510 (2.6KB), called during static initialization before main. The table is read-only after construction.
qword_502A920[2*i + 0] = const char* sm_name // e.g., "sm_100"
qword_502A920[2*i + 1] = uint64_t ptx_version // 5, 6, or 7
PTX Version Codes
| Code | Meaning | SM Range |
|---|---|---|
| 5 | Legacy PTX | sm_20 through sm_90 (all base variants) |
| 6 | Modern PTX | sm_90a, sm_100-sm_121 (base variants only) |
| 7 | Extended PTX | sm_100a/f through sm_121a/f (accelerated/forward-compatible) |
Notable observations:
sm_90ais the only pre-Blackwell SM with PTX version 6.- The
f(forward-compatible) suffix uses the same PTX version asa(accelerated). - No entries exist for sm_84, sm_85 (Ada Lovelace numbering gap).
sm_73(Volta sub-variant) andsm_88(Ada sub-variant) are present but not publicly documented.- The table contains 15 legacy architectures (sm_20 through sm_75) that are no longer accessible through the CLI mapping but remain in the backend's processor table.
Data Layout String
The NVPTX data layout string follows LLVM's standard format with three variants selected based on pointer width and shared memory pointer mode:
64-bit with shared memory specialization (most common)
e-p:64:64:64-p3:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-
i128:128:128-f16:16:16-f32:32:32-f64:64:64-v16:16:16-v32:32:32-n16:32:64
64-bit without shared memory specialization
e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-
i128:128:128-f16:16:16-f32:32:32-f64:64:64-v16:16:16-v32:32:32-n16:32:64
32-bit mode
e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-
i128:128:128-f16:16:16-f32:32:32-f64:64:64-v16:16:16-v32:32:32-n16:32:64
Key fields
| Field | Meaning | NVIDIA Note |
|---|---|---|
e | Little-endian | All NVIDIA GPUs |
p:64:64:64 | Generic pointers: 64-bit, 64-bit aligned | Default for 64-bit compilation |
p3:32:32:32 | Address space 3 (shared memory): 32-bit pointers | Controlled by nvptx-short-ptr / nvptx-32-bit-smem / unk_4D0461C |
n16:32:64 | Native integer widths: 16, 32, 64 | Tells LLVM that i16/i32/i64 are all hardware-supported |
v16:16:16 / v32:32:32 | Vector alignment: natural | 16-bit and 32-bit vectors aligned to their width |
The p3:32:32:32 entry is the NVIDIA delta: shared memory lives in a 48KB-228KB on-chip SRAM per SM, addressable with 32-bit pointers even in 64-bit mode. Using 32-bit pointers for shared memory saves register pressure and instruction count for every shared memory access.
A separate data layout string e-i64:64-v16:16-v32:32-n16:32:64 appears in the IR linker (sub_106AB30) as a compatibility check during module linking. This shortened form is used to validate that two modules being linked share the same NVPTX target data layout.
Data layout validation is performed at multiple points:
sub_2C74F70in the NVVM verifier checks the layout string on every module- If empty:
"Empty target data layout, must exist" - If invalid: prints
"Example valid data layout:"with reference 32-bit and 64-bit strings fromoff_4C5D0A0/off_4C5D0A8
Target Triple Construction
The target triple is constructed at module creation time by checking the pointer width:
if (unk_4F06A68 == 8) // 64-bit data model
triple = "nvptx64-nvidia-cuda" // 19 chars
else
triple = "nvptx-nvidia-cuda" // 17 chars
Eight triples are valid in UnifiedNVVMIR mode:
| Triple | Width | Runtime |
|---|---|---|
nvptx-nvidia-cuda | 32-bit | CUDA |
nvptx64-nvidia-cuda | 64-bit | CUDA |
nvptx-nvidia-nvcl | 32-bit | OpenCL |
nvptx64-nvidia-nvcl | 64-bit | OpenCL |
nvsass-nvidia-cuda | SASS | CUDA native assembly |
nvsass-nvidia-nvcl | SASS | OpenCL native assembly |
nvsass-nvidia-directx | SASS | DirectX backend |
nvsass-nvidia-spirv | SASS | SPIR-V backend |
In non-UnifiedNVVMIR mode, validation is looser: the triple must start with nvptx- or nvptx64- and contain -cuda. The nvsass-nvidia-directx and nvsass-nvidia-spirv triples (discovered in sub_2C80C90) are notable evidence that NVIDIA's SASS-level backend supports DirectX and SPIR-V shader compilation alongside traditional CUDA/OpenCL.
Configuration Knobs
Backend Options (ctor_609_0, 0x585D30, 37KB)
| Knob | Type | Default | Description |
|---|---|---|---|
nvptx-short-ptr | bool | -- | 32-bit pointers for const/local/shared |
nvptx-32-bit-smem | bool | -- | 32-bit shared memory pointers |
nvptx-enable-machine-sink | bool | -- | Enable Machine Sinking |
enable-new-nvvm-remat | bool | true | Enable new rematerialization |
nv-disable-remat | bool | false | Disable all remat passes |
nv-disable-mem2reg | bool | false | Disable MI Mem2Reg pass |
nv-disable-scev-cgp | bool | false | Disable SCEV address mode opt |
disable-nvptx-load-store-vectorizer | bool | false | Disable load/store vectorizer |
disable-nvptx-require-structured-cfg | bool | false | Turn off structured CFG requirement |
nvptx-exit-on-unreachable | bool | true | Lower unreachable as exit |
nvptx-early-byval-copy | bool | -- | Copy byval args early |
enable-nvvm-peephole | bool | true | Enable NVVM Peephole Optimizer |
lower-func-args | bool | true | Lower large aggregate params |
enable-sink | bool | true | Enable Sinking |
disable-post-opt | bool | false | Disable LLVM IR opts post-opt |
usedessa | int | 2 | Select deSSA method |
ldg | bool | true | Load Global Constant Transform |
print-isel-input | bool | false | Print LLVM IR input to isel |
no-reg-target-nvptxremat | bool | false | Only old remat without reg targets |
disable-set-array-alignment | bool | false | Disable alignment enhancements |
nvptx-lower-global-ctor-dtor | bool | -- | Lower GPU ctor/dtors to globals |
Register Pressure & FCA Options (ctor_074, 0x49AAB0)
| Knob | Type | Default | Description |
|---|---|---|---|
fca-size | int | 8 | Max size of first-class aggregates (bytes) |
reg-target-adjust | int | 0 (range -10..+10) | Register pressure target adjustment |
pred-target-adjust | int | 0 (range -10..+10) | Predicate register target adjustment |
remat-load-param | bool | -- | Support remating const ld.param not in NVVM IR |
cta-reconfig-aware-rpa | bool | -- | CTA reconfiguration-aware register pressure analysis |
Extension Options (ctor_610, 0x5888A0)
| Knob | Type | Default | Description |
|---|---|---|---|
unroll-assumed-size | int | 4 | Assumed size for unknown local array types |
enable-loop-peeling | bool | -- | Enable loop peeling |
enable-256-bit-load-store | bool | -- | Enable 256-bit vector loads/stores |
ias-param-always-point-to-global | bool | -- | Parameters always point to global memory |
ias-strong-global-assumptions | bool | -- | Strong global memory assumptions |
ias-wmma-memory-space-opt | bool | -- | Memory Space Optimization for WMMA |
TTI Cost Model Options (ctor_061, 0x494D20)
| Knob | Type | Default | Description |
|---|---|---|---|
costmodel-reduxcost | bool | -- | Recognize reduction patterns |
cache-line-size | int | -- | Cache line size for cost model |
min-page-size | int | -- | Minimum page size |
predictable-branch-threshold | float | -- | Threshold for predictable branch cost |
Differences from Upstream LLVM
-
Dual-path TargetMachine construction. Upstream LLVM has a single target creation path through
LLVMTargetMachine::createPassConfig. NVIDIA has two independent paths (CLI and pipeline assembler) that converge atTargetRegistry::lookupTarget. -
NVIDIA-proprietary target features. The
+sharedmem32bitptr,+fma-level=N,+prec-divf32=N,+prec-sqrtf32=Nfeatures do not exist in upstream NVPTX. Upstream NVPTX has+ptx75,+sm_90style features. NVIDIA's math precision features are passed through the target feature string to avoid adding new cl::opt for each. -
224-byte feature table. The
sub_982C80feature table with its "default all-1s then selectively clear" initialization pattern is unique to cicc. Upstream NVPTXSubtarget uses a much simpler feature set derived from+sm_XXand+ptx_YYfeatures. -
Scheduling info at TTI+56. The issue-width and latency values stored in the TTI sub-structure at offset +56 are used by the interleave count selection algorithm. Upstream LLVM's NVPTX backend does not populate these scheduling parameters -- it relies on the default "no scheduling model" behavior.
-
Extension hook at vtable+88. The pipeline assembler checks for a
createExtendedTargetMachineentry, enabling loadable target backend extensions. This is not present in upstream LLVM.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| NVPTX Target Lookup and Creation | sub_12EA530 | 4 KB | -- |
| TargetMachine Creation with NVIDIA Options | sub_12F4060 | 16 KB | -- |
| Master Pipeline Assembly (includes TM setup) | sub_12E54A0 | 50 KB | -- |
| CICC CLI Argument Parser | sub_12F7D90 | 14 KB | -- |
TargetRegistry::lookupTarget() | sub_16D3AC0 | -- | -- |
| SubtargetInfo initialization | sub_167F890 | -- | -- |
| TTIWrapperPass allocation (208 bytes) | sub_1BFB520 | -- | -- |
| TargetTransformInfo / DataLayout creation | sub_1BFB9A0 | -- | -- |
| TargetLibraryInfo creation | sub_14A04B0 | -- | -- |
| TargetLibraryInfo finalization | sub_149CBC0 | -- | -- |
TTI::getRegisterBitWidth(Vector) -- returns 32 | sub_DFE640 | -- | -- |
TTI::supportsScalableVectors() -- returns false | sub_DFE610 | -- | -- |
TTI::getMaxInterleaveFactor() (at TTI+448) | sub_DFB120 | -- | -- |
TTI::getMaxInterleaveFactor(vectorized) | sub_DFB730 | -- | -- |
TTI::getRegisterBitWidth(Scalar) or cache-line query | sub_DFB1B0 | -- | -- |
TTI::getInstructionCost() / scheduling cost model | sub_20E14F0 | 33 KB | -- |
TTI::hasAttribute(N) -- function attribute query | sub_B2D610 | -- | -- |
TTI::getInstructionCost() (IR-level variant) | sub_B91420 | -- | -- |
| NVPTX feature flag table initializer (224 bytes) | sub_982C80 | -- | -- |
| Feature bitfield initial population | sub_97DEE0 | -- | -- |
| SM-version-specific feature refinements | sub_982B20 | -- | -- |
| SubtargetFeature reads at +2843, +2584, +2498 | sub_201BB90 | -- | -- |
| Branch distance / jump table checks at +2870, +2871 | sub_20650A0 | -- | -- |
| EDG SM architecture feature gating (38KB, ~60 flags) | sub_60E7C0 | -- | -- |
| Module initialization with triple and data layout | sub_908850 | -- | -- |
SM processor table population (0x584510, 2.6KB) | ctor_605 | -- | -- |
NVPTX backend math options (0x584B60, 14KB) | ctor_607 | -- | -- |
NVPTX backend options (0x585D30, 37KB) | ctor_609_0 | -- | -- |
Cross-References
- GPU Target Architecture -- Full SM table, architecture gating thresholds, NVVM container arch enum
- LoopVectorize & VPlan -- TTI hook usage in VF selection and interleave count
- SLP Vectorizer -- TTI register width as SLP bundle width limit
- SelectionDAG -- NVPTXTargetLowering, type legality from SubtargetFeatures
- Memory Space Optimization -- Address space numbering convention
- IV Demotion -- Exploits i64 cost differential reported by TTI
- Register Allocation -- Register pressure budgets bounded by TTI
- Instruction Scheduling -- Scheduling model data at TTI+56
- CLI Flags --
-arch,-ftz,-fma,-prec-div,-prec-sqrtrouting - Optimization Levels --
qword_4FBB430optimization level storage - Pipeline & Ordering -- Where TTI is registered in the pass pipeline