Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

FP128/I128 Emulation

No NVIDIA GPU in any SM generation has native 128-bit arithmetic hardware. Neither fp128 (IEEE 754 binary128) nor i128 (128-bit integer) operations can be lowered to PTX instructions directly. CICC handles this by replacing every fp128 and i128 operation in LLVM IR with a call to one of 48 distinct NVIDIA runtime library functions whose implementations live in a separate bitcode module. The pass at sub_1C8C170 walks each function in the module, inspects every instruction, dispatches on the LLVM opcode byte, and emits the appropriate __nv_* call in place of the original operation. This is a correctness-critical legalization pass -- if any fp128/i128 operation survives past it, instruction selection will abort because NVPTX has no patterns for 128-bit types.

The pass is structurally part of lower-ops (LowerOpsPass), NVIDIA's umbrella module pass for lowering operations that the NVPTX backend cannot handle natively. Within the lower-ops framework, sub_1C8C170 is the dedicated handler for 128-bit types. It runs as a module-level pass early in the pipeline, after libdevice linking and before the main optimization sequence, so that the generated calls can be inlined and optimized by subsequent passes.

Entry pointsub_1C8C170
Size25 KB (~960 lines decompiled)
Pass frameworkPart of lower-ops / LowerOpsPass (module pass)
RegistrationNew PM slot 144 at sub_2342890; param enable-optimization
Runtime functions48 distinct __nv_* library calls
Upstream equivalentNone. Upstream LLVM lowers fp128 through SoftenFloat in type legalization. CICC replaces this with explicit call insertion at the IR level.

Opcode Dispatch

The pass reads the LLVM instruction opcode from the byte at offset +16 of the instruction node and dispatches through a dense switch. The following table lists every handled opcode and the corresponding lowering action. All unlisted opcodes in the range 0x18--0x58 produce an early return (no 128-bit type involvement, or handled elsewhere).

OpcodeLLVM InstructionLowering TargetHandler
0x24fadd__nv_add_fp128sub_1C8A5C0
0x26fsub__nv_sub_fp128sub_1C8A5C0
0x28fmul__nv_mul_fp128sub_1C8A5C0
0x29udiv__nv_udiv128sub_1C8BD70
0x2Asdiv__nv_idiv128sub_1C8BD70
0x2Bfdiv__nv_div_fp128sub_1C8A5C0
0x2Curem__nv_urem128sub_1C8BD70
0x2Dsrem__nv_irem128sub_1C8BD70
0x2Efrem__nv_rem_fp128sub_1C8A5C0
0x36trunc/extType-based conversionsub_1C8ADC0
0x3Ffptoui__nv_fp128_to_uint* or __nv_cvt_f*_u128_rzsub_1C8ADC0 / sub_1C8BF90
0x40fptosi__nv_fp128_to_int* or __nv_cvt_f*_i128_rzsub_1C8ADC0 / sub_1C8BF90
0x41uitofp__nv_uint*_to_fp128 or __nv_cvt_u128_f*_rnsub_1C8ADC0 / sub_1C8BF90
0x42sitofp__nv_int*_to_fp128 or __nv_cvt_i128_f*_rnsub_1C8ADC0 / sub_1C8BF90
0x43fptrunc__nv_fp128_to_float or __nv_fp128_to_doublesub_1C8ADC0
0x44fpext__nv_float_to_fp128 or __nv_double_to_fp128sub_1C8ADC0
0x4Cfcmp__nv_fcmp_* (predicate-selected)dedicated

Ignored opcode ranges: 0x18--0x23, 0x25, 0x27, 0x2F--0x35, 0x37--0x3E, 0x45--0x4B, 0x4D--0x58. Opcode 0x37 (store) receives a similar type check as 0x36 but for store target types.

Library Function Inventory

FP128 Arithmetic (5 functions)

Binary operations on IEEE 754 binary128. Each takes two fp128 operands, returns fp128.

FunctionOperationString Length
__nv_add_fp128fp128 addition14
__nv_sub_fp128fp128 subtraction14
__nv_mul_fp128fp128 multiplication14
__nv_div_fp128fp128 division14
__nv_rem_fp128fp128 remainder14

All five are lowered through sub_1C8A5C0, which constructs the call with a fixed string length of 14 characters.

I128 Division and Remainder (4 functions)

Integer division and remainder for i128. No native PTX instruction exists for 128-bit integer divide.

FunctionOperationSignednessString Length
__nv_udiv128i128 divisionunsigned12
__nv_idiv128i128 divisionsigned12
__nv_urem128i128 remainderunsigned12
__nv_irem128i128 remaindersigned12

Lowered through sub_1C8BD70 with string length 12. Note: i128 add/sub/mul are NOT lowered here -- those can be decomposed into pairs of 64-bit operations by standard LLVM legalization. Only division and remainder require the runtime call path because they involve complex multi-word algorithms.

FP128-to-Integer Conversions (10 functions)

Convert fp128 to integer types of various widths. The target width is determined by examining sub_1642F90 and the type's bit-width field (type_id >> 8).

FunctionConversion
__nv_fp128_to_uint8fp128 -> i8 (unsigned)
__nv_fp128_to_uint16fp128 -> i16 (unsigned)
__nv_fp128_to_uint32fp128 -> i32 (unsigned)
__nv_fp128_to_uint64fp128 -> i64 (unsigned)
__nv_fp128_to_uint128fp128 -> i128 (unsigned)
__nv_fp128_to_int8fp128 -> i8 (signed)
__nv_fp128_to_int16fp128 -> i16 (signed)
__nv_fp128_to_int32fp128 -> i32 (signed)
__nv_fp128_to_int64fp128 -> i64 (signed)
__nv_fp128_to_int128fp128 -> i128 (signed)

Integer-to-FP128 Conversions (10 functions)

Convert integer types to fp128.

FunctionConversion
__nv_uint8_to_fp128i8 (unsigned) -> fp128
__nv_uint16_to_fp128i16 (unsigned) -> fp128
__nv_uint32_to_fp128i32 (unsigned) -> fp128
__nv_uint64_to_fp128i64 (unsigned) -> fp128
__nv_uint128_to_fp128i128 (unsigned) -> fp128
__nv_int8_to_fp128i8 (signed) -> fp128
__nv_int16_to_fp128i16 (signed) -> fp128
__nv_int32_to_fp128i32 (signed) -> fp128
__nv_int64_to_fp128i64 (signed) -> fp128
__nv_int128_to_fp128i128 (signed) -> fp128

String lengths for both fp128-to-integer and integer-to-fp128 conversions vary from 18 to 21 characters depending on the function name. Lowered through sub_1C8ADC0.

FP128-to-Float/Double Conversions (4 functions)

Truncation and extension between fp128 and the native floating-point types.

FunctionConversionOpcode
__nv_fp128_to_floatfp128 -> float0x43 (fptrunc)
__nv_fp128_to_doublefp128 -> double0x43 (fptrunc)
__nv_float_to_fp128float -> fp1280x44 (fpext)
__nv_double_to_fp128double -> fp1280x44 (fpext)

I128-to-Float/Double Conversions (8 functions)

These handle the non-fp128 path: converting i128 directly to/from float/double without going through fp128 as an intermediate. The _rz suffix denotes round-toward-zero mode; _rn denotes round-to-nearest-even.

FunctionConversionRoundingString Length
__nv_cvt_f32_u128_rzi128 (unsigned) -> floattoward zero20
__nv_cvt_f32_i128_rzi128 (signed) -> floattoward zero20
__nv_cvt_f64_u128_rzi128 (unsigned) -> doubletoward zero20
__nv_cvt_f64_i128_rzi128 (signed) -> doubletoward zero20
__nv_cvt_u128_f32_rnfloat -> i128 (unsigned)to nearest20
__nv_cvt_i128_f32_rnfloat -> i128 (signed)to nearest20
__nv_cvt_u128_f64_rndouble -> i128 (unsigned)to nearest20
__nv_cvt_i128_f64_rndouble -> i128 (signed)to nearest20

All eight are lowered through sub_1C8BF90 with a fixed string length of 20 characters. The rounding mode choice is deliberate: _rz for integer-from-float (truncation semantics matching C/C++ cast behavior) and _rn for float-from-integer (IEEE 754 default rounding for conversions).

The dispatch logic selects between the __nv_fp128_to_* / __nv_*_to_fp128 family and the __nv_cvt_* family based on whether the source or destination type is fp128 (type_id == 5). If neither operand is fp128 but one is i128, the __nv_cvt_* path is taken.

FP128 Comparison Predicates

The fcmp instruction (opcode 0x4C) is dispatched by extracting the comparison predicate from bits 0--14 of the halfword at instruction offset +18. Each LLVM fcmp predicate maps to a dedicated runtime function.

Ordered Comparisons (7 functions)

Ordered comparisons return false if either operand is NaN.

FunctionPredicateSemantics
__nv_fcmp_oeqoeqordered equal
__nv_fcmp_ogtogtordered greater-than
__nv_fcmp_ogeogeordered greater-or-equal
__nv_fcmp_oltoltordered less-than
__nv_fcmp_oleoleordered less-or-equal
__nv_fcmp_oneoneordered not-equal
__nv_fcmp_ordordordered (neither is NaN)

Unordered Comparisons (7 functions)

Unordered comparisons return true if either operand is NaN.

FunctionPredicateSemantics
__nv_fcmp_unounounordered (either is NaN)
__nv_fcmp_uequequnordered or equal
__nv_fcmp_ugtugtunordered or greater-than
__nv_fcmp_ugeugeunordered or greater-or-equal
__nv_fcmp_ultultunordered or less-than
__nv_fcmp_uleuleunordered or less-or-equal
__nv_fcmp_uneuneunordered or not-equal

The predicate naming follows the standard LLVM fcmp convention: o prefix = ordered, u prefix = unordered. The 14 predicates cover the complete set of IEEE 754 comparison semantics excluding true and false (which are constant-folded before reaching this pass). Each function takes two fp128 operands and returns i1.

Trunc/Ext Handling (Opcode 0x36)

The trunc/zext/sext opcode path requires special logic because it must distinguish between genuine 128-bit truncation/extension and other type conversions that happen to use the same opcode.

sub_1C8C170::handle_trunc_ext(inst):
    if sub_1642F90(*operand, 128):      // Is the operand type 128-bit?
        // Determine source and dest bit-widths from DataLayout
        src_bits = type_id >> 8          // Bit-width encoded in high byte
        dst_bits = target_type_id >> 8
        if src_bits > dst_bits:
            emit_truncation(inst, src_bits, dst_bits)
        else:
            emit_extension(inst, src_bits, dst_bits, is_signed)
    elif type_id == 5:                   // fp128 type marker
        emit_fp128_conversion(inst)
    else:
        return                           // Not a 128-bit operation

The type_id value 5 is the LLVM type tag for fp128 in CICC's internal representation (consistent with the type code table: 1=half, 2=float, 3=double, 4=fp80, 5=fp128, 6=bf16, 0xB=integer with bit-width at type_id >> 8).

Lowering Helpers

Four internal helper functions perform the actual call construction. Each creates a new CallInst with the library function name, replaces all uses of the original instruction with the call result, and erases the original instruction.

HelperAddressPurposeName Length
sub_1C8A5C00x1C8A5C0Binary fp128 arithmetic (add/sub/mul/div/rem)14
sub_1C8BD700x1C8BD70Binary i128 division (udiv/idiv/urem/irem)12
sub_1C8ADC00x1C8ADC0FP128 conversions (to/from all integer widths, to/from float/double)18--21 (varies)
sub_1C8BF900x1C8BF90I128-to/from-float/double conversions20

The "name length" column refers to the string length passed to the call construction routine. This is a fixed constant in each helper, not computed at runtime, which means the function name strings are embedded as literals in the binary (confirmed by string sweep at 0x1C8C170).

Each helper follows the same pattern:

helper(module, instruction, name_string, name_length):
    // 1. Get or create function declaration in module
    func = module.getOrInsertFunction(name_string, return_type, param_types...)
    // 2. Build argument list from instruction operands
    args = extract_operands(instruction)
    // 3. Create CallInst
    call = IRBuilder.CreateCall(func, args)
    // 4. Replace uses and erase
    instruction.replaceAllUsesWith(call)
    instruction.eraseFromParent()

Libdevice Resolution

The 48 __nv_* functions emitted by this pass are not present in the standard libdevice.10.bc. The standard libdevice (455,876 bytes embedded at unk_3EA0080 / unk_420FD80) contains ~400+ math functions (__nv_sinf, __nv_expf, etc.) but does not include any fp128 or i128 emulation routines.

Instead, these functions are resolved through one of two mechanisms:

  1. Separate bitcode library: A dedicated 128-bit emulation bitcode module linked after lower-ops runs. This module contains the actual multi-word software implementations of 128-bit arithmetic using 64-bit operations.

  2. Late synthesis during type legalization: The SelectionDAG type legalization pass (SoftenFloat action) can also handle fp128 operations, but CICC's IR-level lowering preempts this by replacing operations before they reach the backend. The __nv_* functions, once declared in the module, must be resolvable at link time.

The call declarations emitted by the pass use external linkage, meaning the linker must supply definitions. If a definition is missing, the compilation will fail at the NVPTX link stage with an unresolved symbol error. The benefit of performing this lowering at the IR level rather than in SelectionDAG is that the resulting calls are visible to the LLVM optimizer: the inliner can inline the emulation routines, SROA can decompose the intermediate values, and the loop optimizers can hoist invariant 128-bit computations.

Configuration

The pass has no dedicated knobs. It is controlled indirectly through the lower-ops pass framework:

ParameterEffect
enable-optimizationParameter to LowerOpsPass registration (slot 144). When enabled, the lowered calls may be marked with optimization attributes.

There are no knobs in knobs.txt specific to fp128 or i128 lowering. The pass runs unconditionally whenever lower-ops is in the pipeline -- there is no way to disable 128-bit emulation because leaving fp128/i128 operations in the IR would cause a fatal error in the NVPTX backend.

Diagnostic Strings

The pass itself emits no diagnostic messages or debug prints. All diagnostic information comes from the embedded function name strings:

"__nv_add_fp128"         "__nv_sub_fp128"         "__nv_mul_fp128"
"__nv_div_fp128"         "__nv_rem_fp128"
"__nv_udiv128"           "__nv_idiv128"
"__nv_urem128"           "__nv_irem128"
"__nv_fp128_to_uint8"    "__nv_fp128_to_int8"
"__nv_fp128_to_uint16"   "__nv_fp128_to_int16"
"__nv_fp128_to_uint32"   "__nv_fp128_to_int32"
"__nv_fp128_to_uint64"   "__nv_fp128_to_int64"
"__nv_fp128_to_uint128"  "__nv_fp128_to_int128"
"__nv_uint8_to_fp128"    "__nv_int8_to_fp128"
"__nv_uint16_to_fp128"   "__nv_int16_to_fp128"
"__nv_uint32_to_fp128"   "__nv_int32_to_fp128"
"__nv_uint64_to_fp128"   "__nv_int64_to_fp128"
"__nv_uint128_to_fp128"  "__nv_int128_to_fp128"
"__nv_fp128_to_float"    "__nv_fp128_to_double"
"__nv_float_to_fp128"    "__nv_double_to_fp128"
"__nv_cvt_f32_u128_rz"   "__nv_cvt_f32_i128_rz"
"__nv_cvt_f64_u128_rz"   "__nv_cvt_f64_i128_rz"
"__nv_cvt_u128_f32_rn"   "__nv_cvt_i128_f32_rn"
"__nv_cvt_u128_f64_rn"   "__nv_cvt_i128_f64_rn"
"__nv_fcmp_oeq"          "__nv_fcmp_ogt"          "__nv_fcmp_oge"
"__nv_fcmp_olt"          "__nv_fcmp_ole"          "__nv_fcmp_one"
"__nv_fcmp_ord"          "__nv_fcmp_uno"          "__nv_fcmp_ueq"
"__nv_fcmp_ugt"          "__nv_fcmp_uge"          "__nv_fcmp_ult"
"__nv_fcmp_ule"          "__nv_fcmp_une"

Function Map

FunctionAddressSizeRole
Main entrysub_1C8C17025 KBOpcode dispatch, instruction walk, type checks
FP128 binary loweringsub_1C8A5C0--Emits __nv_{add,sub,mul,div,rem}_fp128 calls
FP128 conversion loweringsub_1C8ADC0--Emits __nv_fp128_to_* / __nv_*_to_fp128 calls
I128 division loweringsub_1C8BD70--Emits __nv_{u,i}div128 / __nv_{u,i}rem128 calls
I128-float loweringsub_1C8BF90--Emits __nv_cvt_* calls (rz/rn variants)
Type width checksub_1642F90--Tests whether a type has a given bit-width (e.g., 128)

Cross-References