Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

KnownBits & DemandedBits for GPU

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

NVIDIA's KnownBits and DemandedBits infrastructure in cicc v13.0 diverges from upstream LLVM in three structural ways. First, the two analyses are fused into a single 127 KB function (sub_11A7600) that simultaneously computes known-zero/known-one bitmasks and simplifies instructions whose demanded bits allow constant folding or narrowing -- upstream LLVM separates computeKnownBits (in ValueTracking) from SimplifyDemandedBits (in InstCombine). Second, a dedicated GPU-specific known-bits oracle (sub_F0C4B0) provides range constraints for NVIDIA special registers (%tid, %ntid, %ctaid, %nctaid, %warpsize, %laneid) that have no CPU equivalent. Third, an early NVVM pipeline pass (nvvm-intr-range at sub_216F4B0) attaches !range metadata to every special-register read intrinsic, giving downstream analyses the same bounded-range information that CPU targets only get from profile data or programmer assertions. Together these form the primary dataflow backbone for address calculation optimization, type narrowing, and dead-bit elimination in GPU kernels.

Merged computeKnownBits + SimplifyDemandedBitssub_11A7600 (0x11A7600, 127 KB, 4,156 lines)
Secondary SimplifyDemandedBits helpersub_11A1430 (0x11A1430, 6.3 KB, 6 opcodes)
Per-operand demand propagation trampolinesub_11AE940 (0x11AE940)
Generic computeKnownBits (reference)sub_9AC0E0 (fallback for unhandled opcodes)
Debug-only reference computeKnownBitssub_9AC330 (cross-validation oracle)
computeKnownBitsFromOperatorsub_11A3F30 (0x11A3F30, 50 KB)
computeKnownBitsFromAssumesub_11A6910 (0x11A6910, 12.5 KB)
computeKnownBitsFromRangeMetadatasub_11A68C0
Post-analysis NVIDIA fixupsub_99B5E0 (alignment + range refinement)
NVIDIA intrinsic known-bits oraclesub_F0C4B0 (special register ranges)
Intrinsic return range analysissub_10CA790 + sub_11A1390
NVVMIntrRange passsub_216F4B0 (nvvm-intr-range)
SelectionDAG computeKnownBitssub_33D4EF0 (0x33D4EF0, 114 KB, 3,286 lines)
Pointer alignment known-bitssub_BD5420 (getPointerAlignmentBits)
Debug cross-validation flagqword_4F90C28 (enables abort-on-mismatch)
Max recursion depth6 (checked in sub_11AE940)

GPU-Specific Known-Bits Sources

The key difference from CPU targets: GPU code has dozens of values with statically knowable ranges that never exist on a CPU. Every CUDA thread reads its identity from special registers whose values are bounded by hardware launch parameters. NVIDIA exploits this in two places: the nvvm-intr-range pass adds !range metadata at the IR level, and the target-specific known-bits oracle sub_F0C4B0 provides bitmask information directly to computeKnownBits.

Special Register Range Table

The following ranges apply to every NVVM intrinsic that reads a PTX special register. The !range metadata attached by nvvm-intr-range (sub_216F4B0) encodes [lo, hi) as an LLVM MDNode. The known-bits column shows which bits are guaranteed zero given the maximum value.

RegisterPTXNVVM Intrinsic ID RangeValue Rangei32 Known Zero (upper bits)
%tid.x/y/z%tid.x350--352[0, maxntid-1]bits [ceil(log2(maxntid)), 31]
%ntid.x/y/z%ntid.x353--355[1, 1024]bits [11, 31] (at most 1024)
%ctaid.x/y/z%ctaid.x356--358[0, gridDim-1]bits [ceil(log2(gridDim)), 31]
%nctaid.x/y/z%nctaid.x359--361[1, 2^31-1]bit 31 (always non-negative)
%warpsize%WARP_SZ~370{32} (constant)bits [0,4] = 00000, bit 5 = 1, bits [6,31] = 0
%laneid%laneid~371[0, 31]bits [5, 31]
%warpid%warpid~372[0, maxWarpsPerSM-1]SM-dependent upper bits
%smid%smid~375[0, numSMs-1]architecture-dependent
%nsmid%nsmid~376[1, numSMs]architecture-dependent
%gridid%gridid~378[0, 2^32-1]none (full range)
%clock%clock~380[0, 2^32-1]none
%lanemask_eq/lt/le/gt/ge%lanemask_*~382--386[0, 2^32-1]none

When __launch_bounds__(maxThreadsPerBlock, minBlocksPerMP) is present on a kernel, nvvm-intr-range tightens the %tid ranges to [0, maxThreadsPerBlock-1] and %ntid to [1, maxThreadsPerBlock]. Similarly, nvvm.reqntid metadata (from __launch_bounds__ with exact dimensions or reqntid pragmas) can constrain each dimension independently to an exact value.

The knob nvvm-intr-range-sm (constructor ctor_359) selects the SM variant used to determine architectural limits for registers like %warpid, %smid, and %nsmid.

Address Space Known Bits

CUDA uses separate address spaces with distinct pointer bit-widths and alignment properties. These feed directly into sub_BD5420 (getPointerAlignmentBits), which OR's known-zero low bits into the KnownBits result for any pointer-typed value:

Address SpacePTXPointer WidthKnown AlignmentKnown Bits Effect
0 (generic)default64 bitsnone guaranteedpointer alignment only
1 (global).global64 bits>= 16 bytes (typical)low 4 bits often known-zero
3 (shared).shared32 bits>= 4 bytes (minimum)low 2 bits known-zero, bits [32,63] irrelevant
4 (constant).const64 bits>= 4 byteslow 2 bits known-zero
5 (local).local32 bits>= 4 bytes (stack)low 2 bits known-zero, bits [32,63] irrelevant

The 32-bit address spaces (shared and local) are critical: any value known to be a shared-memory pointer has bits [32, 63] entirely dead. The DemandedBits analysis exploits this to eliminate zero-extensions and truncations around shared-memory address calculations, keeping everything in 32-bit arithmetic.

Launch Parameter Integration

The __launch_bounds__ attribute, __maxnreg__ pragma, and nvvm.reqntid / nvvm.maxntid metadata all flow into the known-bits infrastructure:

  1. nvvm-intr-range pass (sub_216F4B0): Runs early in the pipeline. Reads kernel metadata (nvvm.reqntid, nvvm.maxntid) via sub_93AE30. Attaches !range metadata to every llvm.nvvm.read.ptx.sreg.* intrinsic call. The metadata format is !{i32 lo, i32 hi} where hi is exclusive.

  2. computeKnownBitsFromRangeMetadata (sub_11A68C0): Called during standard computeKnownBits traversal. Reads !range metadata from any value and derives known-zero/known-one masks. For a range [0, 1024), this yields knownZero = 0xFFFFFC00 (bits 10--31 known zero).

  3. Intrinsic return range analysis (sub_10CA790 + sub_11A1390): A separate path used when the merged computeKnownBits+SimplifyDemandedBits processes ZExt/SExt of intrinsic calls. Computes [lo, hi] bounds for the intrinsic's return value and checks whether the extension can be eliminated because the return range fits within the demanded bits.

The Merged Analysis: Algorithm and Pseudocode

Unlike upstream LLVM where InstCombiner::SimplifyDemandedBits calls computeKnownBits as a subroutine, cicc fuses them. The entry point sub_11AE870 wraps sub_11AE3E0, which calls the core sub_11A7600. A hash table at InstCombiner + 2064 tracks visited instructions to prevent infinite recursion.

Core Algorithm

// sub_11A7600 — merged computeKnownBits + SimplifyDemandedBits
// Returns: replacement instruction pointer, or NULL if no simplification
Instruction* computeKnownBitsAndSimplify(
    AnalysisCtx    *ctx,        // a1 — holds IR module, pass info
    IRNode         *inst,       // a2 — instruction to analyze
    APInt          *demanded,   // a3 — which output bits the consumer needs
    KnownBits      *result,     // a4 — output {knownZero, knownOne}
    unsigned        depth,      // a5 — recursion depth (checked in caller)
    QueryState     *state       // a6 — worklist context
) {
    uint8_t opcode = inst->opcode_tag;   // single-byte opcode at offset 0
    unsigned width = demanded->getBitWidth();

    // Stack-allocate 4 APInt accumulators for operand known bits
    APInt kz0(width, 0), ko0(width, 0);  // operand 0
    APInt kz1(width, 0), ko1(width, 0);  // operand 1

    switch (opcode) {
    case '*': // Mul — lines 654-1037
        // Pattern: if one operand is known power-of-2 from intrinsic call,
        //          replace Mul with Shl (critical for threadIdx * stride)
        if (auto *rhs = matchConstantPow2Call(inst->getOperand(1))) {
            if (inst->getOperand(0)->hasOneUse())
                return createShl(inst->getOperand(0), log2(rhs));
        }
        // Generic: narrow demanded mask by leading zeros, propagate to operands
        unsigned effectiveBits = width - demanded->countLeadingZeros();
        APInt narrowDemand = APInt::getLowBitsSet(width, effectiveBits);
        propagateDemandToOperand(ctx, inst, 0, narrowDemand, &kz0, &ko0, depth+1, state);
        propagateDemandToOperand(ctx, inst, 1, narrowDemand, &kz1, &ko1, depth+1, state);
        KnownBits::computeForMul(result, {kz0,ko0}, {kz1,ko1}, inst->hasNUW(), inst->hasNSW());
        break;

    case '6': // ZExt — lines 1677-1919
        // Check if source is intrinsic call with known return range
        if (auto range = getIntrinsicReturnRange(inst->getOperand(0))) {
            if (range.fitsBitWidth(demanded->getActiveBits()))
                return inst->getOperand(0);  // eliminate extension
        }
        // Standard: shift demanded bits down, propagate to source, zext result
        propagateDemandToOperand(ctx, inst, 0, demanded->trunc(srcWidth), ...);
        KnownBits::zext(result, srcWidth);
        break;

    case 'U': // NVIDIA Intrinsic — lines 3521-4085
        unsigned intrinsicID = getIntrinsicID(inst);
        switch (intrinsicID) {
        case 0x0F: handleBFE_BFI(inst, demanded, result); break;
        case 0x42: handlePopcount(inst, demanded, result); break;
        case 0x01: handleAbs(inst, demanded, result);      break;
        case 0xB4: handleFSHL(inst, demanded, result);     break;
        case 0xB5: handleFSHR(inst, demanded, result);     break;
        case 0x12B: handleBswap(inst, demanded, result);   break;
        default:
            // Fall through to NVIDIA intrinsic known-bits oracle
            sub_F0C4B0(inst, result, depth, state);
            break;
        }
        break;

    // ... 13 more opcode cases (Add, Sub, Xor, PHI, Trunc, SExt, etc.)

    default:
        sub_9AC0E0(inst, result, depth, state);  // generic fallback
        break;
    }

    // POST-ANALYSIS REFINEMENT (lines 2134-2281)
    // 1. Pointer alignment: if type is pointer, OR alignment bits into knownZero
    if (inst->getType()->isPointerTy()) {
        unsigned alignBits = getPointerAlignmentBits(inst);  // sub_BD5420
        result->knownZero |= APInt::getLowBitsSet(width, alignBits);
    }

    // 2. Debug cross-validation (when qword_4F90C28 is set)
    if (DEBUG_FLAG) {
        KnownBits reference;
        sub_9AC330(inst, &reference, depth, state);  // independent computation
        if (reference != *result) {
            print("computeKnownBits(): ", reference);
            print("SimplifyDemandedBits(): ", *result);
            abort();
        }
    }

    // 3. Demand-covers-known check: can we replace with constant?
    if (demanded->isSubsetOf(result->knownZero | result->knownOne))
        return ConstantInt::get(inst->getType(), result->knownOne);

    return nullptr;
}

Demand Propagation Per Operand

The trampoline sub_11AE940 is the per-operand demand propagation entry point. It increments depth, checks the depth limit (depth > 6 returns all-unknown), and dispatches between the big handler (sub_11A7600) and the binary-arithmetic-specific helper (sub_11A1430) based on opcode class:

// sub_11AE940 — per-operand demand propagation trampoline
Instruction* propagateDemandToOperand(
    AnalysisCtx *ctx, IRNode *parent, unsigned opIdx,
    APInt *demand, KnownBits *out, unsigned depth, QueryState *state
) {
    if (depth > 6)
        return nullptr;  // MaxAnalysisRecursionDepth reached

    IRNode *operand = parent->getOperand(opIdx);
    uint8_t opcode = operand->opcode_tag;

    // Binary arithmetic subset goes to the helper
    if (opcode == '*' || opcode == '9' || opcode == ':' ||
        opcode == ';' || opcode == ',' || opcode == '8')
        return sub_11A1430(ctx, operand, demand, out, depth, state);

    // Everything else goes to the big merged handler
    return sub_11A7600(ctx, operand, demand, out, depth, state);
}

The secondary helper sub_11A1430 handles Add/Sub/Xor/Mul/BitCast/ExtractElement with a tighter structure: it uses a four-accumulator cascade with three successive isSubsetOf checks per operation, which is more aggressive than upstream LLVM's single post-merge check.

The Four-Accumulator Cascade

For binary operators (Add, Sub, Xor), cicc maintains four APInt accumulators (two per operand) and performs a three-tier check:

// Three-tier demand satisfaction check (sub_11A1430 pattern)
// More aggressive than upstream single-check approach
KnownBits kb0, kb1;
computeKnownBits(op0, &kb0, depth+1, state);
computeKnownBits(op1, &kb1, depth+1, state);
KnownBits merged = mergeForOpcode(kb0, kb1, opcode);
sub_99B5E0(inst, &merged, depth, state);  // NVIDIA post-fixup

// Check 1: merged result covers demand?
if (demanded.isSubsetOf(merged.knownZero | merged.knownOne))
    return ConstantInt::get(merged.knownOne);

// Check 2: union of operand known-bits covers demand?
if (demanded.isSubsetOf((kb0.knownZero | kb1.knownZero) |
                        (kb0.knownOne  | kb1.knownOne)))
    return ConstantInt::get(...);

// Check 3: all accumulated zero|one covers demand?
if (demanded.isSubsetOf(allAccumulatedZero | allAccumulatedOne))
    return followUseDef(...);

The post-analysis fixup sub_99B5E0 is NVIDIA-specific and does not exist in upstream LLVM. It applies additional refinements from thread index range constraints, warp-level uniformity, and shared memory alignment guarantees.

DemandedBits for GPU: Narrowing Optimizations

The DemandedBits analysis is the backward complement to KnownBits' forward analysis. When a consumer only needs the low N bits of a value, the producer can be narrowed or eliminated. On GPU, this interaction is dramatically more productive than on CPU because of three factors:

  1. 32-bit address spaces: Shared memory (AS 3) and local memory (AS 5) use 32-bit pointers. When address calculations are performed in i64 (as the generic address space requires), the upper 32 bits are entirely undemanded for shared/local accesses. DemandedBits proves this and enables truncation to i32.

  2. Bounded thread indices: threadIdx.x * stride + offset patterns produce values that fit in far fewer bits than i32. If threadIdx.x < 256 (from __launch_bounds__) and stride < 4096, the product fits in 20 bits. DemandedBits propagates this, enabling downstream shifts and masks to operate on narrower types.

  3. Type demotion to i16/fp16: When DemandedBits proves only the low 16 bits of an i32 computation matter, cicc can demote to 16-bit operations. The function at sub_1185740 (InstCombine's visitTrunc) inserts narrowing truncations. This is particularly valuable for texture coordinate calculations and index arithmetic in tensor core operations.

Dead Bit Elimination

The core optimization check appears approximately 15 times across the analysis functions:

// Inline version (width <= 64):
uint64_t unknown = ~(knownZero | knownOne);
if ((demanded & unknown) == 0) {
    // All demanded bits are determined -> replace with constant
    return ConstantInt::get(type, knownOne);
}

// Wide version (width > 64):
if (demanded.isSubsetOf(knownZero | knownOne)) {
    return ConstantInt::get(type, knownOne);  // sub_AD6220
}

This is the heart of the analysis: backward-propagated demand meets forward-propagated known-bits. When they cover every bit the consumer needs, the entire instruction is dead and can be replaced with a compile-time constant.

GPU Patterns Enabled by Known Bits

The following simplifications are GPU-specific and do not have CPU equivalents:

Mul to Shl for threadIdx arithmetic (lines 714--861): When both operands of a multiply originate from intrinsic calls with known power-of-2 returns (e.g., threadIdx.x * blockDim.x where blockDim is a power-of-2 from __launch_bounds__), the multiply is replaced with a left shift. The pattern matcher checks sub_BCAC40 (hasOneUse) and sub_10A0620 (createShl replacement).

Bswap + BFE fusion (lines 3959--4007): Detects a byte-swap feeding into a bit-field extract and replaces with a direct byte read at the swapped offset. Common in endianness conversion code for shared memory operations.

ZExt/SExt elimination via intrinsic return range (sub_10CA790 path): When a ZExt or SExt extends the result of an NVVM intrinsic call, and the intrinsic's annotated return range fits entirely within the demanded bits, the extension is eliminated. This fires frequently for threadIdx.x reads extended to i64 for address calculations.

BitCast-through-ZExt folding (sub_11A1430 at 0x11A2360): When a BitCast's source is a ZExt and the demanded bits fit within the original narrow type, the bitcast+zext chain collapses to the original value. Common in CUDA address calculations involving zero-extension followed by pointer reinterpretation.

SelectionDAG computeKnownBits

The DAG-level known-bits analysis at sub_33D4EF0 (114 KB, 3,286 lines) mirrors the IR-level analysis but operates on SDNode opcodes. It handles 112 opcode cases organized into 14 groups.

NVPTX Target Node Known Bits

For NVPTX-specific DAG opcodes (above ISD::BUILTIN_OP_END = 499), the function delegates to NVPTXTargetLowering::computeKnownBitsForTargetNode via vtable slot 254 at offset 2032. The key NVPTX-specific cases:

Opcode RangeNVPTX DAG NodeKnown-Bits Behavior
0x152--0x161 (338--353)TEX, SULD, surface opsResult width known: bits above element size set to zero
0x12A (298)LoadV2, LoadParamExtension mode from flags byte bits[2:3]: zext/sext/none
0x16A, 0x16C (362, 364)StoreParam, StoreRetvalWhen flags bits[2:3] == 0b11: element type width known
0x175 (373)ConstantPoolUses ConstantRange::fromKnownBits intersection
0xCA (202)INTRINSIC_WO_CHAINBoolean-like: bit 0 unknown, bits [1..width] known zero
>= 499All target-specificDelegates to vtable[254] computeKnownBitsForTargetNode

The DAG-level analysis uses the same recursion depth cap of 6 (a6 > 5 returns all-unknown), matching LLVM's MaxRecursionDepth.

Texture/Surface Fetch Result Width

Cases 0x152--0x161 encode the known bit-width of texture and surface fetch results. For an 8-bit texture fetch zero-extended to i32, the analysis sets bits [8, 31] as known-zero in the result. This enables downstream shift and mask elimination in texture sampling code.

KnownBits Data Structure Layout

Both the IR-level and DAG-level implementations use the same 32-byte struct:

struct KnownBits {                      // 32 bytes total
    union {
        uint64_t  val;                  // +0x00: inline storage (width <= 64)
        uint64_t *ptr;                  // +0x00: heap pointer  (width > 64)
    } knownZero;
    uint32_t knownZero_width;           // +0x08: bit-width
    uint32_t _pad0;                     // +0x0C: padding
    union {
        uint64_t  val;                  // +0x10: inline storage (width <= 64)
        uint64_t *ptr;                  // +0x10: heap pointer  (width > 64)
    } knownOne;
    uint32_t knownOne_width;            // +0x18: bit-width
    uint32_t _pad1;                     // +0x1C: padding
};

// Invariant: (knownZero & knownOne) == 0   (no bit both 0 and 1)
// Threshold: width > 64 triggers heap allocation via sub_C43690

Roughly 43% of sub_11A1430's binary size consists of APInt destructor sequences (cmp [rbp+var], 0x40; jbe skip; call free) for the width > 64 cleanup paths.

Configuration

KnobSourceDefaultEffect
nvvm-intr-range-smctor_359Current target SMSM variant used to compute special register ranges for nvvm-intr-range pass
scev-cgp-tid-max-valuector_XXXArchitecture limitMaximum value of thread ID used in SCEV-based CodeGenPrep address calculations
nv-remat-threshold-for-spec-regunk_4FD386020Threshold controlling when special register reads are rematerialized instead of spilled (interacts with known-bits because remat preserves range metadata)
qword_4F90C28internal debug flag0 (disabled)Enables cross-validation abort: runs independent reference computeKnownBits (sub_9AC330) and aborts if results disagree with merged analysis
Max recursion depthhardcoded6Matches LLVM's MaxAnalysisRecursionDepth; checked in sub_11AE940
APInt inline thresholdhardcoded64 bitsValues <= 64 bits use inline uint64 storage; wider values heap-allocate

Diagnostic Strings

The merged analysis emits the following diagnostics (only in debug/assert builds when qword_4F90C28 is set):

StringLocationTrigger
"computeKnownBits(): "sub_11A7600 line ~2204Cross-validation mismatch: prints the reference implementation's result
"SimplifyDemandedBits(): "sub_11A7600 line ~2208Cross-validation mismatch: prints the merged analysis result
"Mismatched known bits for <inst> in <func>"sub_11A7600 line ~2200Precedes the two values above; followed by abort()

The nvvm-intr-range pass emits:

StringLocation
"Add !range metadata to NVVM intrinsics."sub_216F4B0 (pass registration)

NVVM IR Node Layout

The KnownBits analysis traverses IR nodes using cicc's internal representation. Each node is 32 bytes:

struct IRNode {             // 32 bytes (0x20)
    uint8_t  opcode;        // +0x00: single-byte opcode tag (ASCII-based)
    uint8_t  flags;         // +0x01: bit 1, bit 2 = nsw/nuw flags
    uint16_t _reserved;     // +0x02
    uint32_t operand_idx;   // +0x04: 27-bit operand index + 5-bit flags
                            //        byte 7 bit 6 (0x40) = use-list vs indexed
    // ... remaining 24 bytes: use-list pointers, type info, metadata
};

// Operand resolution:
// If byte[7] & 0x40 (use-list flag set):
//     operand = *(node - 8) -> *(ptr + 0x20)
// If byte[7] & 0x40 == 0 (indexed):
//     idx = (node[4..7] & 0x7FFFFFF)
//     operand = node - (idx << 5)    // 27-bit index * 32 bytes

The 27-bit index allows up to 134 million nodes (4 GB theoretical IR size).

Function Map

IR-Level Known-Bits

FunctionAddressSize
computeKnownBitsAndSimplify -- merged main analysissub_11A7600127 KB
SimplifyDemandedBitsHelper -- binary arithmetic subsetsub_11A14306.3 KB
Per-operand demand propagation trampoline (depth check)sub_11AE940varies
SimplifyDemandedBits entry wrapper (allocates APInts)sub_11AE870thin
SimplifyDemandedBits result caching (hash table at IC+2064)sub_11AE3E0235 lines
computeKnownBitsFromOperator / PHI mergesub_11A3F3050 KB
computeKnownBitsFromAssume (processes @llvm.assume)sub_11A691012.5 KB
computeKnownBitsFromRangeMetadata (reads !range)sub_11A68C0varies
Generic computeKnownBits (fallback, no simplification)sub_9AC0E0varies
Reference computeKnownBits (debug cross-validation only)sub_9AC330varies
NVIDIA post-analysis fixup (alignment + range refinement)sub_99B5E0varies
NVIDIA intrinsic known-bits oracle (special registers)sub_F0C4B0varies
isNVVMFunction check (NVIDIA-specific flag)sub_F0C3D0varies
Intrinsic return range analysis (computes [lo, hi])sub_10CA79011.2 KB
Extract return range bounds from range analysis resultsub_11A1390varies
getPointerAlignmentBits (alignment-derived known zeros)sub_BD5420varies
isDemandedBitsFullyKnown (demand subset-of known)sub_10024C0varies
NVVMIntrRange pass -- attaches !range metadatasub_216F4B0varies

SelectionDAG-Level Known-Bits

FunctionAddressSize
SelectionDAG::computeKnownBits (recursive, 112 opcode cases)sub_33D4EF0114 KB
Creates all-demanded mask, delegates to sub_33D4EF0sub_33DD090wrapper
computeMinLeadingZeros (calls sub_33D25A0 + returns)sub_33D4D80wrapper
computeNumSignBits (parallel switch structure)sub_33D25A049 KB
computeOverflowForAdd / computeOverflowForSubsub_33DCF10varies

KnownBits Arithmetic Helpers

FunctionAddress
KnownBits::computeForMul(result, nuw, nsw, kb0, kb1)sub_C70430
KnownBits::add(a, b, nsw, nuw, carry)sub_C74E10
KnownBits::sub(a, b, nsw, nuw)sub_C75B70
KnownBits::computeForAddSub(isSub, nsw, nuw, a, b)sub_C76560
KnownBits::shl(a, shamt)sub_C73220
KnownBits::lshr(a, b)sub_C738B0
KnownBits::ashr(a, b)sub_C73E40
KnownBits::and(a, b, commutative)sub_C787D0
KnownBits::or(a, b)sub_C78F20
KnownBits::xor(a, b)sub_C790F0
KnownBits::mergeForPHI / smax(a, b)sub_C79480
KnownBits::truncate / smulh(a, b)sub_C7B4D0
KnownBits::cttz(a, shift)sub_C7BCF0
KnownBits::ctpop(a)sub_C7BD50
KnownBits::bswap(a)sub_C7BDB0
KnownBits::abs(a, known_shift)sub_C746C0
KnownBits::umin(a, b)sub_C740A0
KnownBits::umax(a, b)sub_C74180
KnownBits::ctlz(a, poisonAtZero)sub_C778B0

APInt Utilities

FunctionAddress
APInt(width, 0) -- zero-init constructor (heap for width > 64)sub_C43690
APInt copy constructorsub_C43780
APInt::operator&=sub_C43B90
`APInt::operator\sub_C43BD0
APInt::setBits(lo, hi)sub_C43C90
APInt::flipAllBitssub_C43D10
APInt::trunc(width)sub_C44740
APInt::zext(width)sub_C449B0
APInt::sext(width)sub_C44830
APInt::countTrailingZerossub_C44590
APInt::countLeadingZerossub_C444A0
APInt::countPopulationsub_C44630
APInt::isSubsetOf(other)sub_C446F0
APInt::reverseBits / byteSwapsub_C44AB0
ConstantInt::get(type, APInt) -- creates constant replacementsub_AD6220
ConstantInt::get(type, value, isSigned)sub_AD64C0

Differences from Upstream LLVM

AspectUpstream LLVMCICC v13.0
Analysis architectureSeparate computeKnownBits (ValueTracking) and SimplifyDemandedBits (InstCombine)Fused into single 127 KB function (sub_11A7600) that simultaneously computes bitmasks and simplifies instructions
GPU register rangesNo special register concept; all values have full-width rangeDedicated oracle (sub_F0C4B0) provides known-zero bits for %tid, %ntid, %ctaid, %warpsize, %laneid, and 10+ PTX special registers
Range metadata injectionNo equivalent pass; range info comes from profile data or programmer annotationsnvvm-intr-range pass (sub_216F4B0) attaches !range metadata to every special-register read; tightened by __launch_bounds__
Warp sizeNot a concept; no constant is known%warpsize is statically known to be exactly 32 (known-zero bits [0,4] and [6,31], bit 5 = 1)
Cross-validationNo cross-validation in release buildsDebug flag qword_4F90C28 enables abort-on-mismatch between computeKnownBits and SimplifyDemandedBits results
SelectionDAG integrationSeparate DAG-level computeKnownBits (~60 KB)Extended DAG-level version at sub_33D4EF0 (114 KB, 3,286 lines) with GPU-specific value tracking
Max recursion depth6 (configurable)Same default 6, checked in sub_11AE940 with identical semantics

Cross-References

  • InstCombine -- The primary consumer of KnownBits analysis; sub_11AE870 is called from the binary operator visitor's Phase 0
  • SelectionDAG -- DAG-level known-bits at sub_33D4EF0 feeds into DAGCombine and instruction selection pattern matching
  • Loop Strength Reduction -- LSR interacts with shared-memory known-bits through the lsr-no-ptr-address-space3 knob that disables LSR for 32-bit shared memory pointers
  • GVN -- sub_9AC330 (reference computeKnownBits) is also called from GVN to validate value numbering decisions
  • LICM -- Loop-invariant code motion uses known-bits to prove that hoisted expressions are safe (no integer overflow when known-bits constrain the range)