Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Constant Folding: Math & Intrinsics

NVIDIA-modified pass. GPU-specific changes (110+ math name variants, 60+ NVVM intrinsic IDs, exception-safe host evaluation) are documented throughout this page.

Upstream source: llvm/lib/Analysis/ConstantFolding.cpp (LLVM 20.0.0). The upstream ConstantFoldCall function handles standard llvm.* intrinsics; NVIDIA's extensions (sub_14D90D0 eligibility checker, sub_14D1BC0 evaluator) are layered on top.

LLVM version note: The upstream ConstantFolding.cpp in LLVM 20 handles approximately 30 standard math intrinsics (llvm.sin, llvm.cos, llvm.sqrt, etc.) and a small set of NVPTX-specific intrinsics (ceil, floor, fabs, sqrt in nvvm.* form). CICC extends this to 110+ math name variants (C, glibc __*_finite, C++ mangled _Z*) and 60+ NVVM intrinsic IDs. The upstream disable-fp-call-folding knob (cl::Hidden, default false) is preserved; NVIDIA adds a separate FPFoldDisable CiccOption for independent control.

CICC v13.0 extends LLVM's ConstantFolding analysis with two large custom functions that together enable compile-time evaluation of over 110 distinct math function name variants and 60+ NVVM intrinsic IDs. Upstream LLVM's ConstantFoldCall handles standard llvm.sin, llvm.cos, llvm.sqrt, and a handful of NVPTX-specific intrinsics (ceil, floor, fabs, sqrt in their nvvm.* forms, plus FP-to-integer conversion intrinsics). CICC goes far beyond this: it recognizes every C math library name (sin, sinf), every glibc __*_finite internal variant, every C++ mangled form (_Z3cosf, _Z4acosd), and the full set of NVVM approximate/FTZ math intrinsics -- then evaluates them using the host C math library with an exception-safe wrapper that refuses to produce results when the host FPU signals domain errors, overflow, or underflow.

The system is split into two cooperating functions. The eligibility checker sub_14D90D0 (27 KB, called nvvmIntrinsicConstantFold in the sweep analysis) is a fast predicate that answers "can this call be constant-folded?" without touching operand values. The evaluator sub_14D1BC0 (54 KB, called nvvmConstantFoldLibCall) performs the actual computation when all operands are constant. A third function, the NVVM InstCombine intrinsic folder sub_1169C30 (87 KB), handles algebraic simplification of NVVM intrinsics and is documented separately on the InstCombine page.

Eligibility checkersub_14D90D0 (0x14D90D0, 27 KB, 282 basic blocks, 489 edges)
Math evaluatorsub_14D1BC0 (0x14D1BC0, 54 KB)
Constant extractorsub_14D1620 (0x14D1620)
Safe unary eval wrappersub_14D19F0 (0x14D19F0)
Safe binary eval wrappersub_14D1A80 (0x14D1A80)
ConstantFP buildersub_14D17B0 (0x14D17B0)
Custom fabssub_14D1280 (0x14D1280) -- SSE2 sign-bit mask
Custom floorsub_14D13B0 (0x14D13B0) -- truncation + sign correction
Custom ceilsub_14D1410 (0x14D1410) -- truncation + sign correction
Custom sqrtsub_14D1470 (0x14D1470) -- thin wrapper around libc sqrt
Vector math mappingsub_149E420 (0x149E420, 26 KB)
LLVM knobdisable-fp-call-folding (upstream, cl::Hidden, default false)
NVIDIA knobFPFoldDisable (NVIDIA CiccOption, disables FP constant folding)

Two-Tier Architecture: Eligibility vs. Evaluation

The constant folding system operates as a two-phase protocol. The caller (from the ConstantFolding pass or InstCombine visitCallInst path) first invokes the eligibility checker to determine whether a call instruction is a candidate, then invokes the evaluator to produce the folded constant. This split exists for performance: the eligibility check is cheap (no operand extraction, no FP computation), while the evaluator is expensive (extracts APFloat values, calls host math library, checks FP exceptions).

Eligibility Checker: sub_14D90D0

The function takes a tagged IR node pointer and a context (intrinsic descriptor). The node pointer carries a 3-bit tag in its low bits; the function masks with ~7 to recover the aligned base. Before examining intrinsic IDs, it performs three attribute pre-filter checks on the callee:

  1. Speculatable/ReadNone (attribute kind 0x15 = 21): The callee must be safe to speculatively execute. If the direct callee lacks this attribute, the function follows one level of indirection through the resolved function target at [callee + 0x70] and re-checks.

  2. NoUnwind (attribute kind 5): The callee must not throw. Same indirection chain.

  3. Convergent gate (attribute kind 0x34 = 52): If the callee is marked convergent, the function returns 0 immediately. This is the critical safety check for GPU code -- convergent intrinsics like __syncthreads(), __ballot_sync(), and warp shuffle operations have warp-synchronous semantics that would be violated by folding them away, even when all arguments happen to be constant.

After attribute filtering, the function reads the intrinsic ID from [context + 0x24] (offset +36, unsigned 32-bit enum) and dispatches through a two-level scheme.

Evaluation: sub_14D1BC0

The evaluator receives the function name string, its length, an opcode/intrinsic-ID enum, a return type descriptor, an array of constant operand IR nodes, the operand count (1, 2, or 3), a flag enabling name-based matching, and a context pointer. It returns a ConstantFP or ConstantInt IR node on success, or null on failure.

The top-level dispatch is on operand count:

  • Unary (count = 1): Trigonometric, exponential, logarithmic, rounding, and absolute value functions.
  • Binary (count = 2): pow, fmod, atan2, copysign, fmin, fmax.
  • Ternary (count = 3): FMA / fused multiply-add (opcodes 99 and 100 only).

Foldable Intrinsics Master Table

Standard LLVM Intrinsic IDs (0--211)

These are dispatched via a jump table at jpt_14D91F0 in the eligibility checker. The evaluator handles them via cascading opcode comparisons.

IDHexIntrinsicCategory
50x05llvm.bswapBitwise
60x06llvm.ceilRounding
80x08llvm.copysignSign
110x0Bllvm.cosTrig
120x0Cllvm.ctlzBitwise
130x0Dllvm.ctpopBitwise
300x1Ellvm.expExponential
310x1Fllvm.exp2Exponential
320x20llvm.fabsAbsolute
330x21llvm.floorRounding
540x36llvm.fmaTernary
550x37llvm.fmuladdTernary
960x60llvm.logLogarithmic
970x61llvm.log10Logarithmic
990x63llvm.log2Logarithmic
1000x64llvm.lroundRounding
1150x73llvm.maxnumMinMax
1220x7Allvm.minnumMinMax
1230x7Bllvm.nearbyintRounding
1240x7Cllvm.powPower
1290x81llvm.powiPower
1320x84llvm.rintRounding
1390x8Bllvm.roundRounding
1400x8Cllvm.roundevenRounding
1460x92llvm.sinTrig
1470x93llvm.tanTrig
1870xBBllvm.sqrtRoot
1880xBCllvm.truncRounding
189--2110xBD--0xD3Integer ops (umax, sadd.with.overflow, etc.)Integer

NVVM-Specific Intrinsic IDs (>211)

These are dispatched via cascading range checks with bitmask tests in the eligibility checker.

ID RangeHexIntrinsicCategory
3637--36390xE35--0xE37nvvm.bitcast.* / nvvm.move.*Bitwise
36600xE4Cnvvm.ptr.gen.to.*Pointer
3764--37650xEB4--0xEB5nvvm.ceil.f / nvvm.ceil.dRounding
3778--37790xEC2--0xEC3nvvm.ctlz.i / nvvm.ctlz.llBitwise
37870xECBnvvm.cos.approx.ftz.fTrig
38110xEE3nvvm.div.* / nvvm.fabs variantArith
3870--38710xF1E--0xF1Fnvvm.exp2.approx.ftz.f / .dExponential
3911--39120xF47--0xF48nvvm.fabs.f / .dAbsolute
3924--39250xF54--0xF55nvvm.floor.f / .dRounding
39440xF68nvvm.log.approx.ftz.fLogarithmic
39460xF6Anvvm.log2.approx.ftz.fLogarithmic
39480xF6Cnvvm.log10.approx.ftz.fLogarithmic
39500xF6Envvm.rcp.approx.ftz.dReciprocal
39520xF70nvvm.rsqrt.approx.ftz.fRoot
39540xF72nvvm.sqrt.f / .approx.ftz.fRoot
4072--40740xFE8--0xFEAnvvm.sin/cos.approx.ftz variantsTrig
4114--41150x1012--0x1013nvvm.max.i / .uiMinMax
4118--41190x1016--0x1017nvvm.min.i / .uiMinMax
4167--41680x1047--0x1048nvvm.max.ll / .ullMinMax
4170--41720x104A--0x104Cnvvm.min.ll / .ullMinMax
4230--42310x1086--0x1087nvvm.mul.hi.*Multiply
44130x113Dnvvm.sin.approx.ftz.fTrig
4475, 44780x117B, 0x117Envvm.sqrt.f / .rn.dRoot
4483--44840x1183--0x1184nvvm.sqrt.approx.f / .ftz.fRoot
52930x14ADnvvm.f2i / nvvm.d2iConversion
53000x14B4nvvm.i2f / nvvm.i2dConversion
7297--72980x1C81--0x1C82nvvm.fmax.f / .dMinMax
7301--73020x1C85--0x1C86nvvm.fmin.f / .dMinMax
7334--73350x1CA6--0x1CA7nvvm.fmax.ftz.f / .ftz.nan.fMinMax
7339--73400x1CAB--0x1CACnvvm.fmin.ftz.f / .ftz.nan.fMinMax

Name-Based Foldable Functions (Case 0 Fallthrough)

When the intrinsic ID is 0 (unrecognized LLVM intrinsic), both the eligibility checker and the evaluator fall through to string-based matching. The evaluator uses a two-tier name matching system: fast-path intrinsic ID dispatch, then slow-path name comparison when the a7 flag is set.

Plain C library names (44 entries):

CategoryFunctions
Trigonometricsin, sinf, cos, cosf, tan, tanf
Inverse trigacos, acosf, asin, asinf, atan, atanf, atan2, atan2f
Hyperbolicsinh, sinhf, cosh, coshf, tanh, tanhf
Exponentialexp, expf, exp2, exp2f
Logarithmiclog, logf, log10, log10f
Roundingceil, ceilf, floor, floorf, round, roundf
Absolute / Rootfabs, fabsf, sqrt, sqrtf
Binarypow, powf, fmod, fmodf, atan2, atan2f

Glibc __*_finite variants (20 entries):

__acos_finite, __acosf_finite, __asin_finite, __asinf_finite, __atan2_finite, __atan2f_finite, __cosh_finite, __coshf_finite, __exp_finite, __expf_finite, __exp2_finite, __exp2f_finite, __log_finite, __logf_finite, __log10_finite, __log10f_finite, __pow_finite, __powf_finite, __sinh_finite, __sinhf_finite

C++ mangled names (~48 entries): _Z3cosf, _Z3cosd, _Z3sinf, _Z3sind, _Z3tanf, _Z3tand, _Z3expf, _Z3expd, _Z3logf, _Z3logd, _Z4acosf, _Z4acosd, _Z4asinf, _Z4asind, _Z4atanf, _Z4atand, _Z4ceilf, _Z4ceild, _Z4coshf, _Z4coshd, _Z4exp2f, _Z4exp2d, _Z4fabsf, _Z4fabsd, _Z4sinhf, _Z4sinhd, _Z4sqrtf, _Z4sqrtd, _Z4tanhf, _Z4tanhd, _Z4fmodff, _Z4fmoddd, _Z5floorf, _Z5floord, _Z5log10f, _Z5log10d, _Z5atan2ff, _Z5atan2dd, _Z5powff, _Z5powdd, _Z5roundf, _Z5roundd

Total across all three name forms: approximately 112 distinct recognized strings.

Name Matching Algorithm

The evaluator's name matching is a hand-tuned trie-like dispatch optimized for the specific set of math function names. It avoids hash tables or sorted arrays in favor of cascading character comparisons:

nameMatch(name, length):
    // Strip C++ mangling prefix
    if name[0] == '_' and name[1] == 'Z':
        dispatch on name[2]:  // length digit
            '3' -> match 3-char base: cos, sin, tan, exp, log
            '4' -> match 4-char base: acos, asin, atan, ceil, cosh, exp2, fabs, sinh, sqrt, tanh, fmod
            '5' -> match 5-char base: floor, log10, atan2, pow, round
        verify trailing type suffix: 'f' = float, 'd' = double
        return FOUND

    // Strip glibc __finite prefix
    if name[0] == '_' and name[1] == '_':
        dispatch on name[2]:
            'a' -> __acos_finite, __acosf_finite, __asin_finite, __asinf_finite,
                   __atan2_finite, __atan2f_finite
            'c' -> __cosh_finite, __coshf_finite
            'e' -> __exp_finite, __expf_finite, __exp2_finite, __exp2f_finite
            'l' -> __log_finite, __logf_finite, __log10_finite, __log10f_finite
            'p' -> __pow_finite, __powf_finite
            's' -> __sinh_finite, __sinhf_finite
        verify with memcmp against string constant
        return FOUND

    // Plain C library name
    dispatch on name[0]:
        'a' -> acos, asin, atan + 'f' variants
        'c' -> cos, cosf, ceil, ceilf, cosh, coshf
        'e' -> exp, expf, exp2, exp2f
        'f' -> fabs, fabsf, floor, floorf
        'l' -> log, logf, log10, log10f
        'p' -> pow, powf
        'r' -> round, roundf
        's' -> sin, sinf, sinh, sinhf, sqrt, sqrtf
        't' -> tan, tanf, tanh, tanhf

    // Within each group, dispatch on name length:
    length 3: direct 3-byte compare ("sin", "cos", "tan", "exp", "log", "pow")
    length 4: DWORD compare (4-byte integer, little-endian):
        0x736F6361 = "acos"    0x6E697361 = "asin"
        0x6E617461 = "atan"    0x6C696563 = "ceil"
        0x68736F63 = "cosh"    0x73626166 = "fabs"
        0x66736F63 = "cosf"    0x686E6973 = "sinh"
        0x74727173 = "sqrt"    0x686E6174 = "tanh"
        0x32707865 = "exp2"    0x66707865 = "expf"
        ...
    length 5+: memcmp against literal string constant
    return FOUND or NOT_FOUND

The 4-byte integer comparison trick deserves attention: instead of calling memcmp for 4-character names, the code loads the name as a uint32_t and compares against a pre-computed little-endian constant. For example, *(uint32_t*)name == 0x736F6361 checks for "acos" ('a'=0x61, 'c'=0x63, 'o'=0x6F, 's'=0x73). This micro-optimization eliminates function call overhead for the most common name lengths.

Exception-Safe Host Evaluation

The core safety mechanism is the FP exception wrapper used for all transcendental evaluation. Both the unary wrapper (sub_14D19F0) and binary wrapper (sub_14D1A80) follow the same protocol:

Value* safeMathEval(double (*mathFunc)(double), Type* resultType, double arg) {
    feclearexcept(FE_ALL_EXCEPT);        // clear all FP exception flags
    *__errno_location() = 0;             // clear errno

    double result = mathFunc(arg);       // call host C library

    // Check errno for domain/range error
    int e = *__errno_location();
    if (e == EDOM || e == ERANGE) {      // errno 33 or 34
        feclearexcept(FE_ALL_EXCEPT);
        *__errno_location() = 0;
        return nullptr;                  // refuse to fold
    }

    // Check FP exception flags (mask = 0x1D = 29)
    // FE_INVALID(1) | FE_DIVBYZERO(4) | FE_OVERFLOW(8) | FE_UNDERFLOW(16)
    if (fetestexcept(FE_INVALID | FE_DIVBYZERO | FE_OVERFLOW | FE_UNDERFLOW)) {
        feclearexcept(FE_ALL_EXCEPT);
        *__errno_location() = 0;
        return nullptr;                  // refuse to fold
    }

    // FE_INEXACT (32) is intentionally NOT checked --
    // most transcendentals produce inexact results and that is acceptable.

    return createConstantFP(resultType, result);
}

This design means the folder refuses to produce a result whenever the host FPU signals any exceptional condition other than inexact. The implications:

  • sin(1e308) might overflow on the host -- not folded, left in IR for runtime evaluation.
  • log(-1.0) produces a domain error -- not folded.
  • sqrt(-0.01) triggers FE_INVALID -- not folded.
  • sin(0.5) produces an inexact result (since sin(0.5) is irrational) -- folded normally.

Domain Pre-Checks

In addition to the post-evaluation exception check, certain functions have explicit domain guards before calling the host math library:

FunctionPreconditionRationale
log, logf, log10, log10fargument > 0.0Negative inputs produce NaN
sqrt, sqrtfargument >= 0.0Negative inputs produce NaN
acos, asinno pre-checkRelies on FP exception mechanism

The asymmetry is deliberate: log/sqrt get explicit checks because their domain violations are common and cheap to detect, while acos/asin rely on the post-evaluation FE_INVALID check.

Host FPU vs. GPU Precision

The constant folder evaluates using the host CPU's math library (j_sin, j_cos, j_exp, etc. -- PLT stubs to glibc). This creates a potential precision mismatch: the folded constant may not be bit-identical to what the GPU hardware would compute. NVIDIA mitigates this through several mechanisms:

  1. Custom implementations for exact functions. fabs, floor, ceil, and round have custom host-side implementations that match GPU rounding semantics exactly:

    • fabs (sub_14D1280): Pure SSE2 bitwise AND with 0x7FFFFFFFFFFFFFFF (clear sign bit). Bit-exact regardless of platform.
    • floor (sub_14D13B0): Custom truncation: for |x| < 2^52, truncate to integer, subtract 1.0 if truncation rounded toward zero for negative values, preserve sign bit. For |x| >= 2^52, return unchanged (already integral).
    • ceil (sub_14D1410): Mirror of floor: truncate to integer, add 1.0 if truncation rounded toward zero for positive values.
    • round (j__round): Uses libc round() directly (round-half-away-from-zero, matching PTX round.rni).
  2. Exception rejection for transcendentals. For sin, cos, exp, log and other transcendentals, CICC accepts the host result because IEEE-754 guarantees these are correctly rounded within 1 ULP on both host and device. The exception wrapper catches cases where host and device behavior might diverge (denormals, overflow boundary).

  3. exp2(x) folded as pow(2.0, x). Rather than calling exp2() directly (which might differ between host and device implementations), the evaluator computes pow(2.0, x) through the binary wrapper, ensuring consistent behavior.

  4. No half-precision transcendental folding. The type check at the evaluator's entry rejects type byte 1 (half) for all trig/exp/log functions. Only basic operations (convert, compare) work on fp16. This is safe because half-precision math functions are implemented as promote-to-float, compute, demote-to-half -- by the time the constant folder runs, the promotion has already been inlined.

FTZ and Approximate Intrinsics

NVVM intrinsics like nvvm.exp2.approx.ftz.f and nvvm.sin.approx.ftz.f carry .approx (reduced precision) and .ftz (flush-to-zero for denormals) modifiers. These are present in the foldable ID list, which may seem surprising -- folding an "approximate" intrinsic with exact host math could produce a different value than the hardware.

The rationale: constant folding evaluates the mathematical function, not the hardware instruction. If the input is a normal float and the result is a normal float, the folded value is correct regardless of FTZ or approximation quality. The FTZ modifier only affects denormal inputs (which the exception wrapper would catch via FE_UNDERFLOW), and the .approx modifier only matters for runtime execution speed. For compile-time constants, exact evaluation is strictly better.

Comparison with Upstream LLVM

Upstream LLVM's ConstantFolding.cpp (as of LLVM 19.x) handles NVPTX intrinsics in canConstantFoldCallTo and ConstantFoldCall. The overlap and gaps:

CapabilityUpstream LLVMCICC v13.0
llvm.sin, llvm.cos, llvm.exp, llvm.log, etc.YesYes
nvvm.ceil.f, nvvm.floor.f, nvvm.fabs, nvvm.sqrt.*YesYes
nvvm.fmax.*, nvvm.fmin.* (all variants)Yes (including .xorsign_abs)Yes (subset: .f, .d, .ftz, .ftz.nan)
nvvm.f2i_*, nvvm.d2i_* (FP-to-int with rounding modes)Yes (all 32 variants)Partial (IDs 5293, 5300 only)
Plain C math names (sin, cosf, exp2f, etc.)Via TargetLibraryInfoDirect name matching (44 entries)
Glibc __*_finite variantsNoYes (20 entries)
C++ mangled _Z3cosf, _Z4acosd, etc.NoYes (~48 entries)
nvvm.cos.approx.ftz.f, nvvm.exp2.approx.ftz.f, etc.NoYes
nvvm.rcp.approx.ftz.d, nvvm.rsqrt.approx.ftz.fNoYes
nvvm.mul.hi.*NoYes
Convergent intrinsic rejectionImplicit (no fold path)Explicit attribute check
FMA constant foldYes (via APFloat)Yes (opcodes 99/100, APFloat fma)
Integer min/max/ctlz/cttzPartialYes (full NVVM ID coverage)

The critical CICC-only capabilities are the __*_finite variants (needed when code is compiled with -ffinite-math-only), the C++ mangled names (emitted by device-side C++ math overloads), and the .approx.ftz intrinsic family.

Integer Constant Folding

The evaluator also handles integer-domain operations when operands have type tag 13 (ConstantInt) or when FP operands encode integer comparisons:

Binary integer ops (operand count = 2, both ConstantInt):

  • Opcodes 189, 195, 198, 209, 210, 211: APInt binary operations (add, sub, mul, sdiv, udiv, srem) via sub_16A7290 and related APInt helpers.
  • Opcodes 0xEC2/0xEC3 (3778/3779): ctlz (count leading zeros).
  • Opcodes 0x1014/0x1015, 0x1016/0x1017: Signed/unsigned min/max via APInt comparison.
  • Opcodes 0x104B/0x104C, 0x1087/0x1088: Additional signed/unsigned min/max encodings.
  • Opcode 3811: Division where divisor is known zero -- returns UndefValue.

Integer comparison fold (type tag 14 with integer-domain opcodes):

  • Opcode 0xBB (187), 0x8C (140): icmp eq/ne -- predicate 0.
  • Opcode 0x61 (97): icmp slt -- predicate 2.
  • Opcode 0xBC (188): icmp sgt -- predicate 4.
  • Opcode 0xCE (206): icmp uge -- predicate 3.
  • Opcode 0x08 (8): icmp ult -- predicate 1.

These produce ConstantInt 0 or 1 via sub_169EBA0/sub_169D440.

Libdevice Integration

NVIDIA's libdevice (libdevice.10.bc) provides optimized LLVM bitcode implementations of math functions. After linking libdevice, calls like __nv_sinf are typically inlined and disappear before constant folding runs. However, if inlining fails or is disabled, residual __nv_* calls may survive.

The constant folder does not recognize __nv_* prefixed names directly. The __ name-matching path only handles glibc __*_finite patterns, not NVIDIA's __nv_* convention. Un-inlined libdevice residuals are handled upstream by the NVVM InstCombine intrinsic canonicalizer (sub_1169C30), which recognizes __nv_* prefixes and may convert them to standard LLVM intrinsics that the constant folder can then process.

The __nvvm_reflect mechanism (used for __CUDA_ARCH queries) is resolved by a separate earlier pass (NVVMReflect) that replaces __nvvm_reflect("__CUDA_ARCH") with a constant integer based on the target SM. By the time the constant folder runs, all __nvvm_reflect calls have been eliminated.

Configuration Knobs

KnobTypeDefaultEffect
disable-fp-call-foldingcl::opt<bool>falseUpstream LLVM hidden flag. When true, prevents constant folding of any function returning or accepting floating-point types. Checked in canConstantFoldCallTo.
FPFoldDisableNVIDIA CiccOptionfalseNVIDIA-specific flag that disables FP constant folding at the NVVM level.
instcombine-negator-enabledcl::opt<bool>trueControls the negation propagation system in sub_1169C30 (InstCombine intrinsic folder).
instcombine-negator-max-depthcl::opt<int>platform-dependentDepth limit for the negator chain in InstCombine intrinsic folding. Prevents exponential blowup when pushing negation through deep arithmetic chains.

The FPFoldDisable knob is significant for debugging precision issues: when a kernel produces different results with -O0 vs -O2, disabling FP folding isolates whether constant-folded values are the source of the discrepancy.

ConstantFP Result Creation

The result builder sub_14D17B0 creates the final LLVM ConstantFP IR node from the evaluated double result. It dispatches on the return type byte at *(type + 8):

Type bytePrecisionBehavior
1halfNot reached from math folder (filtered at entry). Infrastructure exists: converts through APFloat semantics.
2floatTruncates double to float via C cast, then converts float to APFloat via sub_169D3B0.
3doubleStores full double precision via sub_169D3F0 (double to APFloat).

Both paths finish with sub_159CCF0(*type, &storage) which constructs the ConstantFP node from the APFloat storage. The float path's truncation via C cast means the folded float value matches what (float)host_result produces -- this is IEEE-754 correct because the cast performs round-to-nearest-even.

Function Map

FunctionAddressSizeRole
nvvmIntrinsicConstantFold0x14D90D027 KBEligibility predicate: can this intrinsic be constant-folded?
nvvmConstantFoldLibCall0x14D1BC054 KBMath evaluator: compute constant result from constant args
extractDoubleFromConstantFP0x14D1620--Extract double from ConstantFP IR node
safeMathEvalUnary0x14D19F0--Exception-safe unary evaluation wrapper
safeMathEvalBinary0x14D1A80--Exception-safe binary evaluation wrapper
createConstantFPResult0x14D17B0--Build ConstantFP from evaluated double
customFabs0x14D1280--SSE2 sign-bit clear
customFloor0x14D13B0--Truncation + sign correction
customCeil0x14D1410--Truncation + sign correction
customSqrt0x14D1470--Thin wrapper around libc sqrt
fptoui_fptosi_fold0x14D1500--FP-to-integer conversion fold
apintMoveTransfer0x14D15E0--APInt move/transfer helper
vectorMathLibMapping0x149E42026 KBScalar-to-vectorized math mapping table
platformFuncCanonicalize0x149FA6015 KBPlatform-specific name canonicalization
constantExprFoldSCEV0x14D44C020 KBConstantExpr fold / SCEV integration
constantFoldAggregate0x14D551016 KBConstantFold for aggregate types
constantFoldGEPExtract0x14D66F017 KBConstantFold for GEP and extract
constantExprSCEVBuild0x14DBA9022 KBConstantExpr + SCEV builder
AttributeList::hasAttribute0x1560260--Attribute query (used 8 times in eligibility checker)
Value::getName0x1649960--Name string extraction (case 0 path)
NVVM InstCombine intrinsic fold0x1169C3087 KBAlgebraic simplification of NVVM intrinsics (see InstCombine)

Cross-References

  • InstCombine -- The NVVM intrinsic canonicalizer (sub_1169C30) handles algebraic simplification, negation propagation, and operand folding for NVVM intrinsics. It calls constant folding as a sub-step.
  • Pipeline & Ordering -- Where constant folding sits in the optimization pipeline (runs within InstCombine and as a standalone analysis).
  • Builtin Table: Math Functions -- The complete list of CUDA math builtins and their mapping to NVVM intrinsics.
  • CLI Flags -- FPFoldDisable and other optimization control flags.
  • LLVM Knobs -- The full disable-fp-call-folding flag and related InstCombine depth limits.