Constant Folding: Math & Intrinsics
NVIDIA-modified pass. GPU-specific changes (110+ math name variants, 60+ NVVM intrinsic IDs, exception-safe host evaluation) are documented throughout this page.
Upstream source:
llvm/lib/Analysis/ConstantFolding.cpp(LLVM 20.0.0). The upstreamConstantFoldCallfunction handles standardllvm.*intrinsics; NVIDIA's extensions (sub_14D90D0eligibility checker,sub_14D1BC0evaluator) are layered on top.LLVM version note: The upstream
ConstantFolding.cppin LLVM 20 handles approximately 30 standard math intrinsics (llvm.sin,llvm.cos,llvm.sqrt, etc.) and a small set of NVPTX-specific intrinsics (ceil, floor, fabs, sqrt innvvm.*form). CICC extends this to 110+ math name variants (C, glibc__*_finite, C++ mangled_Z*) and 60+ NVVM intrinsic IDs. The upstreamdisable-fp-call-foldingknob (cl::Hidden, defaultfalse) is preserved; NVIDIA adds a separateFPFoldDisableCiccOption for independent control.
CICC v13.0 extends LLVM's ConstantFolding analysis with two large custom functions that together enable compile-time evaluation of over 110 distinct math function name variants and 60+ NVVM intrinsic IDs. Upstream LLVM's ConstantFoldCall handles standard llvm.sin, llvm.cos, llvm.sqrt, and a handful of NVPTX-specific intrinsics (ceil, floor, fabs, sqrt in their nvvm.* forms, plus FP-to-integer conversion intrinsics). CICC goes far beyond this: it recognizes every C math library name (sin, sinf), every glibc __*_finite internal variant, every C++ mangled form (_Z3cosf, _Z4acosd), and the full set of NVVM approximate/FTZ math intrinsics -- then evaluates them using the host C math library with an exception-safe wrapper that refuses to produce results when the host FPU signals domain errors, overflow, or underflow.
The system is split into two cooperating functions. The eligibility checker sub_14D90D0 (27 KB, called nvvmIntrinsicConstantFold in the sweep analysis) is a fast predicate that answers "can this call be constant-folded?" without touching operand values. The evaluator sub_14D1BC0 (54 KB, called nvvmConstantFoldLibCall) performs the actual computation when all operands are constant. A third function, the NVVM InstCombine intrinsic folder sub_1169C30 (87 KB), handles algebraic simplification of NVVM intrinsics and is documented separately on the InstCombine page.
| Eligibility checker | sub_14D90D0 (0x14D90D0, 27 KB, 282 basic blocks, 489 edges) |
| Math evaluator | sub_14D1BC0 (0x14D1BC0, 54 KB) |
| Constant extractor | sub_14D1620 (0x14D1620) |
| Safe unary eval wrapper | sub_14D19F0 (0x14D19F0) |
| Safe binary eval wrapper | sub_14D1A80 (0x14D1A80) |
| ConstantFP builder | sub_14D17B0 (0x14D17B0) |
| Custom fabs | sub_14D1280 (0x14D1280) -- SSE2 sign-bit mask |
| Custom floor | sub_14D13B0 (0x14D13B0) -- truncation + sign correction |
| Custom ceil | sub_14D1410 (0x14D1410) -- truncation + sign correction |
| Custom sqrt | sub_14D1470 (0x14D1470) -- thin wrapper around libc sqrt |
| Vector math mapping | sub_149E420 (0x149E420, 26 KB) |
| LLVM knob | disable-fp-call-folding (upstream, cl::Hidden, default false) |
| NVIDIA knob | FPFoldDisable (NVIDIA CiccOption, disables FP constant folding) |
Two-Tier Architecture: Eligibility vs. Evaluation
The constant folding system operates as a two-phase protocol. The caller (from the ConstantFolding pass or InstCombine visitCallInst path) first invokes the eligibility checker to determine whether a call instruction is a candidate, then invokes the evaluator to produce the folded constant. This split exists for performance: the eligibility check is cheap (no operand extraction, no FP computation), while the evaluator is expensive (extracts APFloat values, calls host math library, checks FP exceptions).
Eligibility Checker: sub_14D90D0
The function takes a tagged IR node pointer and a context (intrinsic descriptor). The node pointer carries a 3-bit tag in its low bits; the function masks with ~7 to recover the aligned base. Before examining intrinsic IDs, it performs three attribute pre-filter checks on the callee:
-
Speculatable/ReadNone (attribute kind
0x15= 21): The callee must be safe to speculatively execute. If the direct callee lacks this attribute, the function follows one level of indirection through the resolved function target at[callee + 0x70]and re-checks. -
NoUnwind (attribute kind
5): The callee must not throw. Same indirection chain. -
Convergent gate (attribute kind
0x34= 52): If the callee is markedconvergent, the function returns 0 immediately. This is the critical safety check for GPU code -- convergent intrinsics like__syncthreads(),__ballot_sync(), and warp shuffle operations have warp-synchronous semantics that would be violated by folding them away, even when all arguments happen to be constant.
After attribute filtering, the function reads the intrinsic ID from [context + 0x24] (offset +36, unsigned 32-bit enum) and dispatches through a two-level scheme.
Evaluation: sub_14D1BC0
The evaluator receives the function name string, its length, an opcode/intrinsic-ID enum, a return type descriptor, an array of constant operand IR nodes, the operand count (1, 2, or 3), a flag enabling name-based matching, and a context pointer. It returns a ConstantFP or ConstantInt IR node on success, or null on failure.
The top-level dispatch is on operand count:
- Unary (count = 1): Trigonometric, exponential, logarithmic, rounding, and absolute value functions.
- Binary (count = 2):
pow,fmod,atan2,copysign,fmin,fmax. - Ternary (count = 3): FMA / fused multiply-add (opcodes 99 and 100 only).
Foldable Intrinsics Master Table
Standard LLVM Intrinsic IDs (0--211)
These are dispatched via a jump table at jpt_14D91F0 in the eligibility checker. The evaluator handles them via cascading opcode comparisons.
| ID | Hex | Intrinsic | Category |
|---|---|---|---|
| 5 | 0x05 | llvm.bswap | Bitwise |
| 6 | 0x06 | llvm.ceil | Rounding |
| 8 | 0x08 | llvm.copysign | Sign |
| 11 | 0x0B | llvm.cos | Trig |
| 12 | 0x0C | llvm.ctlz | Bitwise |
| 13 | 0x0D | llvm.ctpop | Bitwise |
| 30 | 0x1E | llvm.exp | Exponential |
| 31 | 0x1F | llvm.exp2 | Exponential |
| 32 | 0x20 | llvm.fabs | Absolute |
| 33 | 0x21 | llvm.floor | Rounding |
| 54 | 0x36 | llvm.fma | Ternary |
| 55 | 0x37 | llvm.fmuladd | Ternary |
| 96 | 0x60 | llvm.log | Logarithmic |
| 97 | 0x61 | llvm.log10 | Logarithmic |
| 99 | 0x63 | llvm.log2 | Logarithmic |
| 100 | 0x64 | llvm.lround | Rounding |
| 115 | 0x73 | llvm.maxnum | MinMax |
| 122 | 0x7A | llvm.minnum | MinMax |
| 123 | 0x7B | llvm.nearbyint | Rounding |
| 124 | 0x7C | llvm.pow | Power |
| 129 | 0x81 | llvm.powi | Power |
| 132 | 0x84 | llvm.rint | Rounding |
| 139 | 0x8B | llvm.round | Rounding |
| 140 | 0x8C | llvm.roundeven | Rounding |
| 146 | 0x92 | llvm.sin | Trig |
| 147 | 0x93 | llvm.tan | Trig |
| 187 | 0xBB | llvm.sqrt | Root |
| 188 | 0xBC | llvm.trunc | Rounding |
| 189--211 | 0xBD--0xD3 | Integer ops (umax, sadd.with.overflow, etc.) | Integer |
NVVM-Specific Intrinsic IDs (>211)
These are dispatched via cascading range checks with bitmask tests in the eligibility checker.
| ID Range | Hex | Intrinsic | Category |
|---|---|---|---|
| 3637--3639 | 0xE35--0xE37 | nvvm.bitcast.* / nvvm.move.* | Bitwise |
| 3660 | 0xE4C | nvvm.ptr.gen.to.* | Pointer |
| 3764--3765 | 0xEB4--0xEB5 | nvvm.ceil.f / nvvm.ceil.d | Rounding |
| 3778--3779 | 0xEC2--0xEC3 | nvvm.ctlz.i / nvvm.ctlz.ll | Bitwise |
| 3787 | 0xECB | nvvm.cos.approx.ftz.f | Trig |
| 3811 | 0xEE3 | nvvm.div.* / nvvm.fabs variant | Arith |
| 3870--3871 | 0xF1E--0xF1F | nvvm.exp2.approx.ftz.f / .d | Exponential |
| 3911--3912 | 0xF47--0xF48 | nvvm.fabs.f / .d | Absolute |
| 3924--3925 | 0xF54--0xF55 | nvvm.floor.f / .d | Rounding |
| 3944 | 0xF68 | nvvm.log.approx.ftz.f | Logarithmic |
| 3946 | 0xF6A | nvvm.log2.approx.ftz.f | Logarithmic |
| 3948 | 0xF6C | nvvm.log10.approx.ftz.f | Logarithmic |
| 3950 | 0xF6E | nvvm.rcp.approx.ftz.d | Reciprocal |
| 3952 | 0xF70 | nvvm.rsqrt.approx.ftz.f | Root |
| 3954 | 0xF72 | nvvm.sqrt.f / .approx.ftz.f | Root |
| 4072--4074 | 0xFE8--0xFEA | nvvm.sin/cos.approx.ftz variants | Trig |
| 4114--4115 | 0x1012--0x1013 | nvvm.max.i / .ui | MinMax |
| 4118--4119 | 0x1016--0x1017 | nvvm.min.i / .ui | MinMax |
| 4167--4168 | 0x1047--0x1048 | nvvm.max.ll / .ull | MinMax |
| 4170--4172 | 0x104A--0x104C | nvvm.min.ll / .ull | MinMax |
| 4230--4231 | 0x1086--0x1087 | nvvm.mul.hi.* | Multiply |
| 4413 | 0x113D | nvvm.sin.approx.ftz.f | Trig |
| 4475, 4478 | 0x117B, 0x117E | nvvm.sqrt.f / .rn.d | Root |
| 4483--4484 | 0x1183--0x1184 | nvvm.sqrt.approx.f / .ftz.f | Root |
| 5293 | 0x14AD | nvvm.f2i / nvvm.d2i | Conversion |
| 5300 | 0x14B4 | nvvm.i2f / nvvm.i2d | Conversion |
| 7297--7298 | 0x1C81--0x1C82 | nvvm.fmax.f / .d | MinMax |
| 7301--7302 | 0x1C85--0x1C86 | nvvm.fmin.f / .d | MinMax |
| 7334--7335 | 0x1CA6--0x1CA7 | nvvm.fmax.ftz.f / .ftz.nan.f | MinMax |
| 7339--7340 | 0x1CAB--0x1CAC | nvvm.fmin.ftz.f / .ftz.nan.f | MinMax |
Name-Based Foldable Functions (Case 0 Fallthrough)
When the intrinsic ID is 0 (unrecognized LLVM intrinsic), both the eligibility checker and the evaluator fall through to string-based matching. The evaluator uses a two-tier name matching system: fast-path intrinsic ID dispatch, then slow-path name comparison when the a7 flag is set.
Plain C library names (44 entries):
| Category | Functions |
|---|---|
| Trigonometric | sin, sinf, cos, cosf, tan, tanf |
| Inverse trig | acos, acosf, asin, asinf, atan, atanf, atan2, atan2f |
| Hyperbolic | sinh, sinhf, cosh, coshf, tanh, tanhf |
| Exponential | exp, expf, exp2, exp2f |
| Logarithmic | log, logf, log10, log10f |
| Rounding | ceil, ceilf, floor, floorf, round, roundf |
| Absolute / Root | fabs, fabsf, sqrt, sqrtf |
| Binary | pow, powf, fmod, fmodf, atan2, atan2f |
Glibc __*_finite variants (20 entries):
__acos_finite, __acosf_finite, __asin_finite, __asinf_finite, __atan2_finite, __atan2f_finite, __cosh_finite, __coshf_finite, __exp_finite, __expf_finite, __exp2_finite, __exp2f_finite, __log_finite, __logf_finite, __log10_finite, __log10f_finite, __pow_finite, __powf_finite, __sinh_finite, __sinhf_finite
C++ mangled names (~48 entries): _Z3cosf, _Z3cosd, _Z3sinf, _Z3sind, _Z3tanf, _Z3tand, _Z3expf, _Z3expd, _Z3logf, _Z3logd, _Z4acosf, _Z4acosd, _Z4asinf, _Z4asind, _Z4atanf, _Z4atand, _Z4ceilf, _Z4ceild, _Z4coshf, _Z4coshd, _Z4exp2f, _Z4exp2d, _Z4fabsf, _Z4fabsd, _Z4sinhf, _Z4sinhd, _Z4sqrtf, _Z4sqrtd, _Z4tanhf, _Z4tanhd, _Z4fmodff, _Z4fmoddd, _Z5floorf, _Z5floord, _Z5log10f, _Z5log10d, _Z5atan2ff, _Z5atan2dd, _Z5powff, _Z5powdd, _Z5roundf, _Z5roundd
Total across all three name forms: approximately 112 distinct recognized strings.
Name Matching Algorithm
The evaluator's name matching is a hand-tuned trie-like dispatch optimized for the specific set of math function names. It avoids hash tables or sorted arrays in favor of cascading character comparisons:
nameMatch(name, length):
// Strip C++ mangling prefix
if name[0] == '_' and name[1] == 'Z':
dispatch on name[2]: // length digit
'3' -> match 3-char base: cos, sin, tan, exp, log
'4' -> match 4-char base: acos, asin, atan, ceil, cosh, exp2, fabs, sinh, sqrt, tanh, fmod
'5' -> match 5-char base: floor, log10, atan2, pow, round
verify trailing type suffix: 'f' = float, 'd' = double
return FOUND
// Strip glibc __finite prefix
if name[0] == '_' and name[1] == '_':
dispatch on name[2]:
'a' -> __acos_finite, __acosf_finite, __asin_finite, __asinf_finite,
__atan2_finite, __atan2f_finite
'c' -> __cosh_finite, __coshf_finite
'e' -> __exp_finite, __expf_finite, __exp2_finite, __exp2f_finite
'l' -> __log_finite, __logf_finite, __log10_finite, __log10f_finite
'p' -> __pow_finite, __powf_finite
's' -> __sinh_finite, __sinhf_finite
verify with memcmp against string constant
return FOUND
// Plain C library name
dispatch on name[0]:
'a' -> acos, asin, atan + 'f' variants
'c' -> cos, cosf, ceil, ceilf, cosh, coshf
'e' -> exp, expf, exp2, exp2f
'f' -> fabs, fabsf, floor, floorf
'l' -> log, logf, log10, log10f
'p' -> pow, powf
'r' -> round, roundf
's' -> sin, sinf, sinh, sinhf, sqrt, sqrtf
't' -> tan, tanf, tanh, tanhf
// Within each group, dispatch on name length:
length 3: direct 3-byte compare ("sin", "cos", "tan", "exp", "log", "pow")
length 4: DWORD compare (4-byte integer, little-endian):
0x736F6361 = "acos" 0x6E697361 = "asin"
0x6E617461 = "atan" 0x6C696563 = "ceil"
0x68736F63 = "cosh" 0x73626166 = "fabs"
0x66736F63 = "cosf" 0x686E6973 = "sinh"
0x74727173 = "sqrt" 0x686E6174 = "tanh"
0x32707865 = "exp2" 0x66707865 = "expf"
...
length 5+: memcmp against literal string constant
return FOUND or NOT_FOUND
The 4-byte integer comparison trick deserves attention: instead of calling memcmp for 4-character names, the code loads the name as a uint32_t and compares against a pre-computed little-endian constant. For example, *(uint32_t*)name == 0x736F6361 checks for "acos" ('a'=0x61, 'c'=0x63, 'o'=0x6F, 's'=0x73). This micro-optimization eliminates function call overhead for the most common name lengths.
Exception-Safe Host Evaluation
The core safety mechanism is the FP exception wrapper used for all transcendental evaluation. Both the unary wrapper (sub_14D19F0) and binary wrapper (sub_14D1A80) follow the same protocol:
Value* safeMathEval(double (*mathFunc)(double), Type* resultType, double arg) {
feclearexcept(FE_ALL_EXCEPT); // clear all FP exception flags
*__errno_location() = 0; // clear errno
double result = mathFunc(arg); // call host C library
// Check errno for domain/range error
int e = *__errno_location();
if (e == EDOM || e == ERANGE) { // errno 33 or 34
feclearexcept(FE_ALL_EXCEPT);
*__errno_location() = 0;
return nullptr; // refuse to fold
}
// Check FP exception flags (mask = 0x1D = 29)
// FE_INVALID(1) | FE_DIVBYZERO(4) | FE_OVERFLOW(8) | FE_UNDERFLOW(16)
if (fetestexcept(FE_INVALID | FE_DIVBYZERO | FE_OVERFLOW | FE_UNDERFLOW)) {
feclearexcept(FE_ALL_EXCEPT);
*__errno_location() = 0;
return nullptr; // refuse to fold
}
// FE_INEXACT (32) is intentionally NOT checked --
// most transcendentals produce inexact results and that is acceptable.
return createConstantFP(resultType, result);
}
This design means the folder refuses to produce a result whenever the host FPU signals any exceptional condition other than inexact. The implications:
sin(1e308)might overflow on the host -- not folded, left in IR for runtime evaluation.log(-1.0)produces a domain error -- not folded.sqrt(-0.01)triggersFE_INVALID-- not folded.sin(0.5)produces an inexact result (since sin(0.5) is irrational) -- folded normally.
Domain Pre-Checks
In addition to the post-evaluation exception check, certain functions have explicit domain guards before calling the host math library:
| Function | Precondition | Rationale |
|---|---|---|
log, logf, log10, log10f | argument > 0.0 | Negative inputs produce NaN |
sqrt, sqrtf | argument >= 0.0 | Negative inputs produce NaN |
acos, asin | no pre-check | Relies on FP exception mechanism |
The asymmetry is deliberate: log/sqrt get explicit checks because their domain violations are common and cheap to detect, while acos/asin rely on the post-evaluation FE_INVALID check.
Host FPU vs. GPU Precision
The constant folder evaluates using the host CPU's math library (j_sin, j_cos, j_exp, etc. -- PLT stubs to glibc). This creates a potential precision mismatch: the folded constant may not be bit-identical to what the GPU hardware would compute. NVIDIA mitigates this through several mechanisms:
-
Custom implementations for exact functions.
fabs,floor,ceil, androundhave custom host-side implementations that match GPU rounding semantics exactly:- fabs (
sub_14D1280): Pure SSE2 bitwise AND with0x7FFFFFFFFFFFFFFF(clear sign bit). Bit-exact regardless of platform. - floor (
sub_14D13B0): Custom truncation: for|x| < 2^52, truncate to integer, subtract 1.0 if truncation rounded toward zero for negative values, preserve sign bit. For|x| >= 2^52, return unchanged (already integral). - ceil (
sub_14D1410): Mirror of floor: truncate to integer, add 1.0 if truncation rounded toward zero for positive values. - round (
j__round): Uses libcround()directly (round-half-away-from-zero, matching PTXround.rni).
- fabs (
-
Exception rejection for transcendentals. For
sin,cos,exp,logand other transcendentals, CICC accepts the host result because IEEE-754 guarantees these are correctly rounded within 1 ULP on both host and device. The exception wrapper catches cases where host and device behavior might diverge (denormals, overflow boundary). -
exp2(x) folded as pow(2.0, x). Rather than calling
exp2()directly (which might differ between host and device implementations), the evaluator computespow(2.0, x)through the binary wrapper, ensuring consistent behavior. -
No half-precision transcendental folding. The type check at the evaluator's entry rejects type byte 1 (half) for all trig/exp/log functions. Only basic operations (convert, compare) work on fp16. This is safe because half-precision math functions are implemented as promote-to-float, compute, demote-to-half -- by the time the constant folder runs, the promotion has already been inlined.
FTZ and Approximate Intrinsics
NVVM intrinsics like nvvm.exp2.approx.ftz.f and nvvm.sin.approx.ftz.f carry .approx (reduced precision) and .ftz (flush-to-zero for denormals) modifiers. These are present in the foldable ID list, which may seem surprising -- folding an "approximate" intrinsic with exact host math could produce a different value than the hardware.
The rationale: constant folding evaluates the mathematical function, not the hardware instruction. If the input is a normal float and the result is a normal float, the folded value is correct regardless of FTZ or approximation quality. The FTZ modifier only affects denormal inputs (which the exception wrapper would catch via FE_UNDERFLOW), and the .approx modifier only matters for runtime execution speed. For compile-time constants, exact evaluation is strictly better.
Comparison with Upstream LLVM
Upstream LLVM's ConstantFolding.cpp (as of LLVM 19.x) handles NVPTX intrinsics in canConstantFoldCallTo and ConstantFoldCall. The overlap and gaps:
| Capability | Upstream LLVM | CICC v13.0 |
|---|---|---|
llvm.sin, llvm.cos, llvm.exp, llvm.log, etc. | Yes | Yes |
nvvm.ceil.f, nvvm.floor.f, nvvm.fabs, nvvm.sqrt.* | Yes | Yes |
nvvm.fmax.*, nvvm.fmin.* (all variants) | Yes (including .xorsign_abs) | Yes (subset: .f, .d, .ftz, .ftz.nan) |
nvvm.f2i_*, nvvm.d2i_* (FP-to-int with rounding modes) | Yes (all 32 variants) | Partial (IDs 5293, 5300 only) |
Plain C math names (sin, cosf, exp2f, etc.) | Via TargetLibraryInfo | Direct name matching (44 entries) |
Glibc __*_finite variants | No | Yes (20 entries) |
C++ mangled _Z3cosf, _Z4acosd, etc. | No | Yes (~48 entries) |
nvvm.cos.approx.ftz.f, nvvm.exp2.approx.ftz.f, etc. | No | Yes |
nvvm.rcp.approx.ftz.d, nvvm.rsqrt.approx.ftz.f | No | Yes |
nvvm.mul.hi.* | No | Yes |
| Convergent intrinsic rejection | Implicit (no fold path) | Explicit attribute check |
| FMA constant fold | Yes (via APFloat) | Yes (opcodes 99/100, APFloat fma) |
| Integer min/max/ctlz/cttz | Partial | Yes (full NVVM ID coverage) |
The critical CICC-only capabilities are the __*_finite variants (needed when code is compiled with -ffinite-math-only), the C++ mangled names (emitted by device-side C++ math overloads), and the .approx.ftz intrinsic family.
Integer Constant Folding
The evaluator also handles integer-domain operations when operands have type tag 13 (ConstantInt) or when FP operands encode integer comparisons:
Binary integer ops (operand count = 2, both ConstantInt):
- Opcodes 189, 195, 198, 209, 210, 211: APInt binary operations (add, sub, mul, sdiv, udiv, srem) via
sub_16A7290and related APInt helpers. - Opcodes
0xEC2/0xEC3(3778/3779):ctlz(count leading zeros). - Opcodes
0x1014/0x1015,0x1016/0x1017: Signed/unsigned min/max via APInt comparison. - Opcodes
0x104B/0x104C,0x1087/0x1088: Additional signed/unsigned min/max encodings. - Opcode 3811: Division where divisor is known zero -- returns
UndefValue.
Integer comparison fold (type tag 14 with integer-domain opcodes):
- Opcode
0xBB(187),0x8C(140):icmp eq/ne-- predicate 0. - Opcode
0x61(97):icmp slt-- predicate 2. - Opcode
0xBC(188):icmp sgt-- predicate 4. - Opcode
0xCE(206):icmp uge-- predicate 3. - Opcode
0x08(8):icmp ult-- predicate 1.
These produce ConstantInt 0 or 1 via sub_169EBA0/sub_169D440.
Libdevice Integration
NVIDIA's libdevice (libdevice.10.bc) provides optimized LLVM bitcode implementations of math functions. After linking libdevice, calls like __nv_sinf are typically inlined and disappear before constant folding runs. However, if inlining fails or is disabled, residual __nv_* calls may survive.
The constant folder does not recognize __nv_* prefixed names directly. The __ name-matching path only handles glibc __*_finite patterns, not NVIDIA's __nv_* convention. Un-inlined libdevice residuals are handled upstream by the NVVM InstCombine intrinsic canonicalizer (sub_1169C30), which recognizes __nv_* prefixes and may convert them to standard LLVM intrinsics that the constant folder can then process.
The __nvvm_reflect mechanism (used for __CUDA_ARCH queries) is resolved by a separate earlier pass (NVVMReflect) that replaces __nvvm_reflect("__CUDA_ARCH") with a constant integer based on the target SM. By the time the constant folder runs, all __nvvm_reflect calls have been eliminated.
Configuration Knobs
| Knob | Type | Default | Effect |
|---|---|---|---|
disable-fp-call-folding | cl::opt<bool> | false | Upstream LLVM hidden flag. When true, prevents constant folding of any function returning or accepting floating-point types. Checked in canConstantFoldCallTo. |
FPFoldDisable | NVIDIA CiccOption | false | NVIDIA-specific flag that disables FP constant folding at the NVVM level. |
instcombine-negator-enabled | cl::opt<bool> | true | Controls the negation propagation system in sub_1169C30 (InstCombine intrinsic folder). |
instcombine-negator-max-depth | cl::opt<int> | platform-dependent | Depth limit for the negator chain in InstCombine intrinsic folding. Prevents exponential blowup when pushing negation through deep arithmetic chains. |
The FPFoldDisable knob is significant for debugging precision issues: when a kernel produces different results with -O0 vs -O2, disabling FP folding isolates whether constant-folded values are the source of the discrepancy.
ConstantFP Result Creation
The result builder sub_14D17B0 creates the final LLVM ConstantFP IR node from the evaluated double result. It dispatches on the return type byte at *(type + 8):
| Type byte | Precision | Behavior |
|---|---|---|
| 1 | half | Not reached from math folder (filtered at entry). Infrastructure exists: converts through APFloat semantics. |
| 2 | float | Truncates double to float via C cast, then converts float to APFloat via sub_169D3B0. |
| 3 | double | Stores full double precision via sub_169D3F0 (double to APFloat). |
Both paths finish with sub_159CCF0(*type, &storage) which constructs the ConstantFP node from the APFloat storage. The float path's truncation via C cast means the folded float value matches what (float)host_result produces -- this is IEEE-754 correct because the cast performs round-to-nearest-even.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
nvvmIntrinsicConstantFold | 0x14D90D0 | 27 KB | Eligibility predicate: can this intrinsic be constant-folded? |
nvvmConstantFoldLibCall | 0x14D1BC0 | 54 KB | Math evaluator: compute constant result from constant args |
extractDoubleFromConstantFP | 0x14D1620 | -- | Extract double from ConstantFP IR node |
safeMathEvalUnary | 0x14D19F0 | -- | Exception-safe unary evaluation wrapper |
safeMathEvalBinary | 0x14D1A80 | -- | Exception-safe binary evaluation wrapper |
createConstantFPResult | 0x14D17B0 | -- | Build ConstantFP from evaluated double |
customFabs | 0x14D1280 | -- | SSE2 sign-bit clear |
customFloor | 0x14D13B0 | -- | Truncation + sign correction |
customCeil | 0x14D1410 | -- | Truncation + sign correction |
customSqrt | 0x14D1470 | -- | Thin wrapper around libc sqrt |
fptoui_fptosi_fold | 0x14D1500 | -- | FP-to-integer conversion fold |
apintMoveTransfer | 0x14D15E0 | -- | APInt move/transfer helper |
vectorMathLibMapping | 0x149E420 | 26 KB | Scalar-to-vectorized math mapping table |
platformFuncCanonicalize | 0x149FA60 | 15 KB | Platform-specific name canonicalization |
constantExprFoldSCEV | 0x14D44C0 | 20 KB | ConstantExpr fold / SCEV integration |
constantFoldAggregate | 0x14D5510 | 16 KB | ConstantFold for aggregate types |
constantFoldGEPExtract | 0x14D66F0 | 17 KB | ConstantFold for GEP and extract |
constantExprSCEVBuild | 0x14DBA90 | 22 KB | ConstantExpr + SCEV builder |
AttributeList::hasAttribute | 0x1560260 | -- | Attribute query (used 8 times in eligibility checker) |
Value::getName | 0x1649960 | -- | Name string extraction (case 0 path) |
| NVVM InstCombine intrinsic fold | 0x1169C30 | 87 KB | Algebraic simplification of NVVM intrinsics (see InstCombine) |
Cross-References
- InstCombine -- The NVVM intrinsic canonicalizer (
sub_1169C30) handles algebraic simplification, negation propagation, and operand folding for NVVM intrinsics. It calls constant folding as a sub-step. - Pipeline & Ordering -- Where constant folding sits in the optimization pipeline (runs within InstCombine and as a standalone analysis).
- Builtin Table: Math Functions -- The complete list of CUDA math builtins and their mapping to NVVM intrinsics.
- CLI Flags --
FPFoldDisableand other optimization control flags. - LLVM Knobs -- The full
disable-fp-call-foldingflag and related InstCombine depth limits.