KnownBits & DemandedBits for GPU
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
NVIDIA's KnownBits and DemandedBits infrastructure in cicc v13.0 diverges from upstream LLVM in three structural ways. First, the two analyses are fused into a single 127 KB function (sub_11A7600) that simultaneously computes known-zero/known-one bitmasks and simplifies instructions whose demanded bits allow constant folding or narrowing -- upstream LLVM separates computeKnownBits (in ValueTracking) from SimplifyDemandedBits (in InstCombine). Second, a dedicated GPU-specific known-bits oracle (sub_F0C4B0) provides range constraints for NVIDIA special registers (%tid, %ntid, %ctaid, %nctaid, %warpsize, %laneid) that have no CPU equivalent. Third, an early NVVM pipeline pass (nvvm-intr-range at sub_216F4B0) attaches !range metadata to every special-register read intrinsic, giving downstream analyses the same bounded-range information that CPU targets only get from profile data or programmer assertions. Together these form the primary dataflow backbone for address calculation optimization, type narrowing, and dead-bit elimination in GPU kernels.
| Merged computeKnownBits + SimplifyDemandedBits | sub_11A7600 (0x11A7600, 127 KB, 4,156 lines) |
| Secondary SimplifyDemandedBits helper | sub_11A1430 (0x11A1430, 6.3 KB, 6 opcodes) |
| Per-operand demand propagation trampoline | sub_11AE940 (0x11AE940) |
| Generic computeKnownBits (reference) | sub_9AC0E0 (fallback for unhandled opcodes) |
| Debug-only reference computeKnownBits | sub_9AC330 (cross-validation oracle) |
| computeKnownBitsFromOperator | sub_11A3F30 (0x11A3F30, 50 KB) |
| computeKnownBitsFromAssume | sub_11A6910 (0x11A6910, 12.5 KB) |
| computeKnownBitsFromRangeMetadata | sub_11A68C0 |
| Post-analysis NVIDIA fixup | sub_99B5E0 (alignment + range refinement) |
| NVIDIA intrinsic known-bits oracle | sub_F0C4B0 (special register ranges) |
| Intrinsic return range analysis | sub_10CA790 + sub_11A1390 |
| NVVMIntrRange pass | sub_216F4B0 (nvvm-intr-range) |
| SelectionDAG computeKnownBits | sub_33D4EF0 (0x33D4EF0, 114 KB, 3,286 lines) |
| Pointer alignment known-bits | sub_BD5420 (getPointerAlignmentBits) |
| Debug cross-validation flag | qword_4F90C28 (enables abort-on-mismatch) |
| Max recursion depth | 6 (checked in sub_11AE940) |
GPU-Specific Known-Bits Sources
The key difference from CPU targets: GPU code has dozens of values with statically knowable ranges that never exist on a CPU. Every CUDA thread reads its identity from special registers whose values are bounded by hardware launch parameters. NVIDIA exploits this in two places: the nvvm-intr-range pass adds !range metadata at the IR level, and the target-specific known-bits oracle sub_F0C4B0 provides bitmask information directly to computeKnownBits.
Special Register Range Table
The following ranges apply to every NVVM intrinsic that reads a PTX special register. The !range metadata attached by nvvm-intr-range (sub_216F4B0) encodes [lo, hi) as an LLVM MDNode. The known-bits column shows which bits are guaranteed zero given the maximum value.
| Register | PTX | NVVM Intrinsic ID Range | Value Range | i32 Known Zero (upper bits) |
|---|---|---|---|---|
%tid.x/y/z | %tid.x | 350--352 | [0, maxntid-1] | bits [ceil(log2(maxntid)), 31] |
%ntid.x/y/z | %ntid.x | 353--355 | [1, 1024] | bits [11, 31] (at most 1024) |
%ctaid.x/y/z | %ctaid.x | 356--358 | [0, gridDim-1] | bits [ceil(log2(gridDim)), 31] |
%nctaid.x/y/z | %nctaid.x | 359--361 | [1, 2^31-1] | bit 31 (always non-negative) |
%warpsize | %WARP_SZ | ~370 | {32} (constant) | bits [0,4] = 00000, bit 5 = 1, bits [6,31] = 0 |
%laneid | %laneid | ~371 | [0, 31] | bits [5, 31] |
%warpid | %warpid | ~372 | [0, maxWarpsPerSM-1] | SM-dependent upper bits |
%smid | %smid | ~375 | [0, numSMs-1] | architecture-dependent |
%nsmid | %nsmid | ~376 | [1, numSMs] | architecture-dependent |
%gridid | %gridid | ~378 | [0, 2^32-1] | none (full range) |
%clock | %clock | ~380 | [0, 2^32-1] | none |
%lanemask_eq/lt/le/gt/ge | %lanemask_* | ~382--386 | [0, 2^32-1] | none |
When __launch_bounds__(maxThreadsPerBlock, minBlocksPerMP) is present on a kernel, nvvm-intr-range tightens the %tid ranges to [0, maxThreadsPerBlock-1] and %ntid to [1, maxThreadsPerBlock]. Similarly, nvvm.reqntid metadata (from __launch_bounds__ with exact dimensions or reqntid pragmas) can constrain each dimension independently to an exact value.
The knob nvvm-intr-range-sm (constructor ctor_359) selects the SM variant used to determine architectural limits for registers like %warpid, %smid, and %nsmid.
Address Space Known Bits
CUDA uses separate address spaces with distinct pointer bit-widths and alignment properties. These feed directly into sub_BD5420 (getPointerAlignmentBits), which OR's known-zero low bits into the KnownBits result for any pointer-typed value:
| Address Space | PTX | Pointer Width | Known Alignment | Known Bits Effect |
|---|---|---|---|---|
| 0 (generic) | default | 64 bits | none guaranteed | pointer alignment only |
| 1 (global) | .global | 64 bits | >= 16 bytes (typical) | low 4 bits often known-zero |
| 3 (shared) | .shared | 32 bits | >= 4 bytes (minimum) | low 2 bits known-zero, bits [32,63] irrelevant |
| 4 (constant) | .const | 64 bits | >= 4 bytes | low 2 bits known-zero |
| 5 (local) | .local | 32 bits | >= 4 bytes (stack) | low 2 bits known-zero, bits [32,63] irrelevant |
The 32-bit address spaces (shared and local) are critical: any value known to be a shared-memory pointer has bits [32, 63] entirely dead. The DemandedBits analysis exploits this to eliminate zero-extensions and truncations around shared-memory address calculations, keeping everything in 32-bit arithmetic.
Launch Parameter Integration
The __launch_bounds__ attribute, __maxnreg__ pragma, and nvvm.reqntid / nvvm.maxntid metadata all flow into the known-bits infrastructure:
-
nvvm-intr-range pass (
sub_216F4B0): Runs early in the pipeline. Reads kernel metadata (nvvm.reqntid,nvvm.maxntid) viasub_93AE30. Attaches!rangemetadata to everyllvm.nvvm.read.ptx.sreg.*intrinsic call. The metadata format is!{i32 lo, i32 hi}wherehiis exclusive. -
computeKnownBitsFromRangeMetadata (
sub_11A68C0): Called during standardcomputeKnownBitstraversal. Reads!rangemetadata from any value and derives known-zero/known-one masks. For a range[0, 1024), this yieldsknownZero = 0xFFFFFC00(bits 10--31 known zero). -
Intrinsic return range analysis (
sub_10CA790+sub_11A1390): A separate path used when the mergedcomputeKnownBits+SimplifyDemandedBitsprocesses ZExt/SExt of intrinsic calls. Computes[lo, hi]bounds for the intrinsic's return value and checks whether the extension can be eliminated because the return range fits within the demanded bits.
The Merged Analysis: Algorithm and Pseudocode
Unlike upstream LLVM where InstCombiner::SimplifyDemandedBits calls computeKnownBits as a subroutine, cicc fuses them. The entry point sub_11AE870 wraps sub_11AE3E0, which calls the core sub_11A7600. A hash table at InstCombiner + 2064 tracks visited instructions to prevent infinite recursion.
Core Algorithm
// sub_11A7600 — merged computeKnownBits + SimplifyDemandedBits
// Returns: replacement instruction pointer, or NULL if no simplification
Instruction* computeKnownBitsAndSimplify(
AnalysisCtx *ctx, // a1 — holds IR module, pass info
IRNode *inst, // a2 — instruction to analyze
APInt *demanded, // a3 — which output bits the consumer needs
KnownBits *result, // a4 — output {knownZero, knownOne}
unsigned depth, // a5 — recursion depth (checked in caller)
QueryState *state // a6 — worklist context
) {
uint8_t opcode = inst->opcode_tag; // single-byte opcode at offset 0
unsigned width = demanded->getBitWidth();
// Stack-allocate 4 APInt accumulators for operand known bits
APInt kz0(width, 0), ko0(width, 0); // operand 0
APInt kz1(width, 0), ko1(width, 0); // operand 1
switch (opcode) {
case '*': // Mul — lines 654-1037
// Pattern: if one operand is known power-of-2 from intrinsic call,
// replace Mul with Shl (critical for threadIdx * stride)
if (auto *rhs = matchConstantPow2Call(inst->getOperand(1))) {
if (inst->getOperand(0)->hasOneUse())
return createShl(inst->getOperand(0), log2(rhs));
}
// Generic: narrow demanded mask by leading zeros, propagate to operands
unsigned effectiveBits = width - demanded->countLeadingZeros();
APInt narrowDemand = APInt::getLowBitsSet(width, effectiveBits);
propagateDemandToOperand(ctx, inst, 0, narrowDemand, &kz0, &ko0, depth+1, state);
propagateDemandToOperand(ctx, inst, 1, narrowDemand, &kz1, &ko1, depth+1, state);
KnownBits::computeForMul(result, {kz0,ko0}, {kz1,ko1}, inst->hasNUW(), inst->hasNSW());
break;
case '6': // ZExt — lines 1677-1919
// Check if source is intrinsic call with known return range
if (auto range = getIntrinsicReturnRange(inst->getOperand(0))) {
if (range.fitsBitWidth(demanded->getActiveBits()))
return inst->getOperand(0); // eliminate extension
}
// Standard: shift demanded bits down, propagate to source, zext result
propagateDemandToOperand(ctx, inst, 0, demanded->trunc(srcWidth), ...);
KnownBits::zext(result, srcWidth);
break;
case 'U': // NVIDIA Intrinsic — lines 3521-4085
unsigned intrinsicID = getIntrinsicID(inst);
switch (intrinsicID) {
case 0x0F: handleBFE_BFI(inst, demanded, result); break;
case 0x42: handlePopcount(inst, demanded, result); break;
case 0x01: handleAbs(inst, demanded, result); break;
case 0xB4: handleFSHL(inst, demanded, result); break;
case 0xB5: handleFSHR(inst, demanded, result); break;
case 0x12B: handleBswap(inst, demanded, result); break;
default:
// Fall through to NVIDIA intrinsic known-bits oracle
sub_F0C4B0(inst, result, depth, state);
break;
}
break;
// ... 13 more opcode cases (Add, Sub, Xor, PHI, Trunc, SExt, etc.)
default:
sub_9AC0E0(inst, result, depth, state); // generic fallback
break;
}
// POST-ANALYSIS REFINEMENT (lines 2134-2281)
// 1. Pointer alignment: if type is pointer, OR alignment bits into knownZero
if (inst->getType()->isPointerTy()) {
unsigned alignBits = getPointerAlignmentBits(inst); // sub_BD5420
result->knownZero |= APInt::getLowBitsSet(width, alignBits);
}
// 2. Debug cross-validation (when qword_4F90C28 is set)
if (DEBUG_FLAG) {
KnownBits reference;
sub_9AC330(inst, &reference, depth, state); // independent computation
if (reference != *result) {
print("computeKnownBits(): ", reference);
print("SimplifyDemandedBits(): ", *result);
abort();
}
}
// 3. Demand-covers-known check: can we replace with constant?
if (demanded->isSubsetOf(result->knownZero | result->knownOne))
return ConstantInt::get(inst->getType(), result->knownOne);
return nullptr;
}
Demand Propagation Per Operand
The trampoline sub_11AE940 is the per-operand demand propagation entry point. It increments depth, checks the depth limit (depth > 6 returns all-unknown), and dispatches between the big handler (sub_11A7600) and the binary-arithmetic-specific helper (sub_11A1430) based on opcode class:
// sub_11AE940 — per-operand demand propagation trampoline
Instruction* propagateDemandToOperand(
AnalysisCtx *ctx, IRNode *parent, unsigned opIdx,
APInt *demand, KnownBits *out, unsigned depth, QueryState *state
) {
if (depth > 6)
return nullptr; // MaxAnalysisRecursionDepth reached
IRNode *operand = parent->getOperand(opIdx);
uint8_t opcode = operand->opcode_tag;
// Binary arithmetic subset goes to the helper
if (opcode == '*' || opcode == '9' || opcode == ':' ||
opcode == ';' || opcode == ',' || opcode == '8')
return sub_11A1430(ctx, operand, demand, out, depth, state);
// Everything else goes to the big merged handler
return sub_11A7600(ctx, operand, demand, out, depth, state);
}
The secondary helper sub_11A1430 handles Add/Sub/Xor/Mul/BitCast/ExtractElement with a tighter structure: it uses a four-accumulator cascade with three successive isSubsetOf checks per operation, which is more aggressive than upstream LLVM's single post-merge check.
The Four-Accumulator Cascade
For binary operators (Add, Sub, Xor), cicc maintains four APInt accumulators (two per operand) and performs a three-tier check:
// Three-tier demand satisfaction check (sub_11A1430 pattern)
// More aggressive than upstream single-check approach
KnownBits kb0, kb1;
computeKnownBits(op0, &kb0, depth+1, state);
computeKnownBits(op1, &kb1, depth+1, state);
KnownBits merged = mergeForOpcode(kb0, kb1, opcode);
sub_99B5E0(inst, &merged, depth, state); // NVIDIA post-fixup
// Check 1: merged result covers demand?
if (demanded.isSubsetOf(merged.knownZero | merged.knownOne))
return ConstantInt::get(merged.knownOne);
// Check 2: union of operand known-bits covers demand?
if (demanded.isSubsetOf((kb0.knownZero | kb1.knownZero) |
(kb0.knownOne | kb1.knownOne)))
return ConstantInt::get(...);
// Check 3: all accumulated zero|one covers demand?
if (demanded.isSubsetOf(allAccumulatedZero | allAccumulatedOne))
return followUseDef(...);
The post-analysis fixup sub_99B5E0 is NVIDIA-specific and does not exist in upstream LLVM. It applies additional refinements from thread index range constraints, warp-level uniformity, and shared memory alignment guarantees.
DemandedBits for GPU: Narrowing Optimizations
The DemandedBits analysis is the backward complement to KnownBits' forward analysis. When a consumer only needs the low N bits of a value, the producer can be narrowed or eliminated. On GPU, this interaction is dramatically more productive than on CPU because of three factors:
-
32-bit address spaces: Shared memory (AS 3) and local memory (AS 5) use 32-bit pointers. When address calculations are performed in i64 (as the generic address space requires), the upper 32 bits are entirely undemanded for shared/local accesses. DemandedBits proves this and enables truncation to i32.
-
Bounded thread indices:
threadIdx.x * stride + offsetpatterns produce values that fit in far fewer bits than i32. IfthreadIdx.x < 256(from__launch_bounds__) andstride < 4096, the product fits in 20 bits. DemandedBits propagates this, enabling downstream shifts and masks to operate on narrower types. -
Type demotion to i16/fp16: When DemandedBits proves only the low 16 bits of an i32 computation matter, cicc can demote to 16-bit operations. The function at
sub_1185740(InstCombine'svisitTrunc) inserts narrowing truncations. This is particularly valuable for texture coordinate calculations and index arithmetic in tensor core operations.
Dead Bit Elimination
The core optimization check appears approximately 15 times across the analysis functions:
// Inline version (width <= 64):
uint64_t unknown = ~(knownZero | knownOne);
if ((demanded & unknown) == 0) {
// All demanded bits are determined -> replace with constant
return ConstantInt::get(type, knownOne);
}
// Wide version (width > 64):
if (demanded.isSubsetOf(knownZero | knownOne)) {
return ConstantInt::get(type, knownOne); // sub_AD6220
}
This is the heart of the analysis: backward-propagated demand meets forward-propagated known-bits. When they cover every bit the consumer needs, the entire instruction is dead and can be replaced with a compile-time constant.
GPU Patterns Enabled by Known Bits
The following simplifications are GPU-specific and do not have CPU equivalents:
Mul to Shl for threadIdx arithmetic (lines 714--861): When both operands of a multiply originate from intrinsic calls with known power-of-2 returns (e.g., threadIdx.x * blockDim.x where blockDim is a power-of-2 from __launch_bounds__), the multiply is replaced with a left shift. The pattern matcher checks sub_BCAC40 (hasOneUse) and sub_10A0620 (createShl replacement).
Bswap + BFE fusion (lines 3959--4007): Detects a byte-swap feeding into a bit-field extract and replaces with a direct byte read at the swapped offset. Common in endianness conversion code for shared memory operations.
ZExt/SExt elimination via intrinsic return range (sub_10CA790 path): When a ZExt or SExt extends the result of an NVVM intrinsic call, and the intrinsic's annotated return range fits entirely within the demanded bits, the extension is eliminated. This fires frequently for threadIdx.x reads extended to i64 for address calculations.
BitCast-through-ZExt folding (sub_11A1430 at 0x11A2360): When a BitCast's source is a ZExt and the demanded bits fit within the original narrow type, the bitcast+zext chain collapses to the original value. Common in CUDA address calculations involving zero-extension followed by pointer reinterpretation.
SelectionDAG computeKnownBits
The DAG-level known-bits analysis at sub_33D4EF0 (114 KB, 3,286 lines) mirrors the IR-level analysis but operates on SDNode opcodes. It handles 112 opcode cases organized into 14 groups.
NVPTX Target Node Known Bits
For NVPTX-specific DAG opcodes (above ISD::BUILTIN_OP_END = 499), the function delegates to NVPTXTargetLowering::computeKnownBitsForTargetNode via vtable slot 254 at offset 2032. The key NVPTX-specific cases:
| Opcode Range | NVPTX DAG Node | Known-Bits Behavior |
|---|---|---|
| 0x152--0x161 (338--353) | TEX, SULD, surface ops | Result width known: bits above element size set to zero |
| 0x12A (298) | LoadV2, LoadParam | Extension mode from flags byte bits[2:3]: zext/sext/none |
| 0x16A, 0x16C (362, 364) | StoreParam, StoreRetval | When flags bits[2:3] == 0b11: element type width known |
| 0x175 (373) | ConstantPool | Uses ConstantRange::fromKnownBits intersection |
| 0xCA (202) | INTRINSIC_WO_CHAIN | Boolean-like: bit 0 unknown, bits [1..width] known zero |
| >= 499 | All target-specific | Delegates to vtable[254] computeKnownBitsForTargetNode |
The DAG-level analysis uses the same recursion depth cap of 6 (a6 > 5 returns all-unknown), matching LLVM's MaxRecursionDepth.
Texture/Surface Fetch Result Width
Cases 0x152--0x161 encode the known bit-width of texture and surface fetch results. For an 8-bit texture fetch zero-extended to i32, the analysis sets bits [8, 31] as known-zero in the result. This enables downstream shift and mask elimination in texture sampling code.
KnownBits Data Structure Layout
Both the IR-level and DAG-level implementations use the same 32-byte struct:
struct KnownBits { // 32 bytes total
union {
uint64_t val; // +0x00: inline storage (width <= 64)
uint64_t *ptr; // +0x00: heap pointer (width > 64)
} knownZero;
uint32_t knownZero_width; // +0x08: bit-width
uint32_t _pad0; // +0x0C: padding
union {
uint64_t val; // +0x10: inline storage (width <= 64)
uint64_t *ptr; // +0x10: heap pointer (width > 64)
} knownOne;
uint32_t knownOne_width; // +0x18: bit-width
uint32_t _pad1; // +0x1C: padding
};
// Invariant: (knownZero & knownOne) == 0 (no bit both 0 and 1)
// Threshold: width > 64 triggers heap allocation via sub_C43690
Roughly 43% of sub_11A1430's binary size consists of APInt destructor sequences (cmp [rbp+var], 0x40; jbe skip; call free) for the width > 64 cleanup paths.
Configuration
| Knob | Source | Default | Effect |
|---|---|---|---|
nvvm-intr-range-sm | ctor_359 | Current target SM | SM variant used to compute special register ranges for nvvm-intr-range pass |
scev-cgp-tid-max-value | ctor_XXX | Architecture limit | Maximum value of thread ID used in SCEV-based CodeGenPrep address calculations |
nv-remat-threshold-for-spec-reg | unk_4FD3860 | 20 | Threshold controlling when special register reads are rematerialized instead of spilled (interacts with known-bits because remat preserves range metadata) |
qword_4F90C28 | internal debug flag | 0 (disabled) | Enables cross-validation abort: runs independent reference computeKnownBits (sub_9AC330) and aborts if results disagree with merged analysis |
| Max recursion depth | hardcoded | 6 | Matches LLVM's MaxAnalysisRecursionDepth; checked in sub_11AE940 |
| APInt inline threshold | hardcoded | 64 bits | Values <= 64 bits use inline uint64 storage; wider values heap-allocate |
Diagnostic Strings
The merged analysis emits the following diagnostics (only in debug/assert builds when qword_4F90C28 is set):
| String | Location | Trigger |
|---|---|---|
"computeKnownBits(): " | sub_11A7600 line ~2204 | Cross-validation mismatch: prints the reference implementation's result |
"SimplifyDemandedBits(): " | sub_11A7600 line ~2208 | Cross-validation mismatch: prints the merged analysis result |
"Mismatched known bits for <inst> in <func>" | sub_11A7600 line ~2200 | Precedes the two values above; followed by abort() |
The nvvm-intr-range pass emits:
| String | Location |
|---|---|
"Add !range metadata to NVVM intrinsics." | sub_216F4B0 (pass registration) |
NVVM IR Node Layout
The KnownBits analysis traverses IR nodes using cicc's internal representation. Each node is 32 bytes:
struct IRNode { // 32 bytes (0x20)
uint8_t opcode; // +0x00: single-byte opcode tag (ASCII-based)
uint8_t flags; // +0x01: bit 1, bit 2 = nsw/nuw flags
uint16_t _reserved; // +0x02
uint32_t operand_idx; // +0x04: 27-bit operand index + 5-bit flags
// byte 7 bit 6 (0x40) = use-list vs indexed
// ... remaining 24 bytes: use-list pointers, type info, metadata
};
// Operand resolution:
// If byte[7] & 0x40 (use-list flag set):
// operand = *(node - 8) -> *(ptr + 0x20)
// If byte[7] & 0x40 == 0 (indexed):
// idx = (node[4..7] & 0x7FFFFFF)
// operand = node - (idx << 5) // 27-bit index * 32 bytes
The 27-bit index allows up to 134 million nodes (4 GB theoretical IR size).
Function Map
IR-Level Known-Bits
| Function | Address | Size |
|---|---|---|
computeKnownBitsAndSimplify -- merged main analysis | sub_11A7600 | 127 KB |
SimplifyDemandedBitsHelper -- binary arithmetic subset | sub_11A1430 | 6.3 KB |
| Per-operand demand propagation trampoline (depth check) | sub_11AE940 | varies |
| SimplifyDemandedBits entry wrapper (allocates APInts) | sub_11AE870 | thin |
| SimplifyDemandedBits result caching (hash table at IC+2064) | sub_11AE3E0 | 235 lines |
computeKnownBitsFromOperator / PHI merge | sub_11A3F30 | 50 KB |
computeKnownBitsFromAssume (processes @llvm.assume) | sub_11A6910 | 12.5 KB |
computeKnownBitsFromRangeMetadata (reads !range) | sub_11A68C0 | varies |
Generic computeKnownBits (fallback, no simplification) | sub_9AC0E0 | varies |
Reference computeKnownBits (debug cross-validation only) | sub_9AC330 | varies |
| NVIDIA post-analysis fixup (alignment + range refinement) | sub_99B5E0 | varies |
| NVIDIA intrinsic known-bits oracle (special registers) | sub_F0C4B0 | varies |
isNVVMFunction check (NVIDIA-specific flag) | sub_F0C3D0 | varies |
Intrinsic return range analysis (computes [lo, hi]) | sub_10CA790 | 11.2 KB |
| Extract return range bounds from range analysis result | sub_11A1390 | varies |
getPointerAlignmentBits (alignment-derived known zeros) | sub_BD5420 | varies |
isDemandedBitsFullyKnown (demand subset-of known) | sub_10024C0 | varies |
NVVMIntrRange pass -- attaches !range metadata | sub_216F4B0 | varies |
SelectionDAG-Level Known-Bits
| Function | Address | Size |
|---|---|---|
SelectionDAG::computeKnownBits (recursive, 112 opcode cases) | sub_33D4EF0 | 114 KB |
Creates all-demanded mask, delegates to sub_33D4EF0 | sub_33DD090 | wrapper |
computeMinLeadingZeros (calls sub_33D25A0 + returns) | sub_33D4D80 | wrapper |
computeNumSignBits (parallel switch structure) | sub_33D25A0 | 49 KB |
computeOverflowForAdd / computeOverflowForSub | sub_33DCF10 | varies |
KnownBits Arithmetic Helpers
| Function | Address |
|---|---|
KnownBits::computeForMul(result, nuw, nsw, kb0, kb1) | sub_C70430 |
KnownBits::add(a, b, nsw, nuw, carry) | sub_C74E10 |
KnownBits::sub(a, b, nsw, nuw) | sub_C75B70 |
KnownBits::computeForAddSub(isSub, nsw, nuw, a, b) | sub_C76560 |
KnownBits::shl(a, shamt) | sub_C73220 |
KnownBits::lshr(a, b) | sub_C738B0 |
KnownBits::ashr(a, b) | sub_C73E40 |
KnownBits::and(a, b, commutative) | sub_C787D0 |
KnownBits::or(a, b) | sub_C78F20 |
KnownBits::xor(a, b) | sub_C790F0 |
KnownBits::mergeForPHI / smax(a, b) | sub_C79480 |
KnownBits::truncate / smulh(a, b) | sub_C7B4D0 |
KnownBits::cttz(a, shift) | sub_C7BCF0 |
KnownBits::ctpop(a) | sub_C7BD50 |
KnownBits::bswap(a) | sub_C7BDB0 |
KnownBits::abs(a, known_shift) | sub_C746C0 |
KnownBits::umin(a, b) | sub_C740A0 |
KnownBits::umax(a, b) | sub_C74180 |
KnownBits::ctlz(a, poisonAtZero) | sub_C778B0 |
APInt Utilities
| Function | Address |
|---|---|
APInt(width, 0) -- zero-init constructor (heap for width > 64) | sub_C43690 |
APInt copy constructor | sub_C43780 |
APInt::operator&= | sub_C43B90 |
| `APInt::operator\ | sub_C43BD0 |
APInt::setBits(lo, hi) | sub_C43C90 |
APInt::flipAllBits | sub_C43D10 |
APInt::trunc(width) | sub_C44740 |
APInt::zext(width) | sub_C449B0 |
APInt::sext(width) | sub_C44830 |
APInt::countTrailingZeros | sub_C44590 |
APInt::countLeadingZeros | sub_C444A0 |
APInt::countPopulation | sub_C44630 |
APInt::isSubsetOf(other) | sub_C446F0 |
APInt::reverseBits / byteSwap | sub_C44AB0 |
ConstantInt::get(type, APInt) -- creates constant replacement | sub_AD6220 |
ConstantInt::get(type, value, isSigned) | sub_AD64C0 |
Differences from Upstream LLVM
| Aspect | Upstream LLVM | CICC v13.0 |
|---|---|---|
| Analysis architecture | Separate computeKnownBits (ValueTracking) and SimplifyDemandedBits (InstCombine) | Fused into single 127 KB function (sub_11A7600) that simultaneously computes bitmasks and simplifies instructions |
| GPU register ranges | No special register concept; all values have full-width range | Dedicated oracle (sub_F0C4B0) provides known-zero bits for %tid, %ntid, %ctaid, %warpsize, %laneid, and 10+ PTX special registers |
| Range metadata injection | No equivalent pass; range info comes from profile data or programmer annotations | nvvm-intr-range pass (sub_216F4B0) attaches !range metadata to every special-register read; tightened by __launch_bounds__ |
| Warp size | Not a concept; no constant is known | %warpsize is statically known to be exactly 32 (known-zero bits [0,4] and [6,31], bit 5 = 1) |
| Cross-validation | No cross-validation in release builds | Debug flag qword_4F90C28 enables abort-on-mismatch between computeKnownBits and SimplifyDemandedBits results |
| SelectionDAG integration | Separate DAG-level computeKnownBits (~60 KB) | Extended DAG-level version at sub_33D4EF0 (114 KB, 3,286 lines) with GPU-specific value tracking |
| Max recursion depth | 6 (configurable) | Same default 6, checked in sub_11AE940 with identical semantics |
Cross-References
- InstCombine -- The primary consumer of KnownBits analysis;
sub_11AE870is called from the binary operator visitor's Phase 0 - SelectionDAG -- DAG-level known-bits at
sub_33D4EF0feeds into DAGCombine and instruction selection pattern matching - Loop Strength Reduction -- LSR interacts with shared-memory known-bits through the
lsr-no-ptr-address-space3knob that disables LSR for 32-bit shared memory pointers - GVN --
sub_9AC330(reference computeKnownBits) is also called from GVN to validate value numbering decisions - LICM -- Loop-invariant code motion uses known-bits to prove that hoisted expressions are safe (no integer overflow when known-bits constrain the range)