Function, Call & Inline Asm Codegen
This page covers the four subsystems that together translate CUDA/C++ function definitions and call sites into LLVM IR: function prolog generation, call instruction emission, inline assembly compilation, and builtin lowering. The code lives in the 0x930000--0x960000 address range (Path A) with a parallel copy at 0x1270000--0x12D0000 (Path B).
| EmitFunction | sub_946060 (Path A) -- creates entry BB, allocapt sentinel, dispatches to prolog |
| GenerateFunctionProlog | sub_938240 (16 KB) -- parameter iteration, ABI dispatch, alloca emission |
| EmitCallExpr | sub_93CB50 (1,293 lines) -- type resolution, ABI classification, call emission |
| EmitInlineAsm | sub_1292420 (53 KB, 2,087 lines) -- 7-phase asm template-to-IR pipeline |
| BuiltinLowering | sub_12B3FD0 (103 KB, 3,409 lines) -- mega-switch over ~250 builtin IDs |
| EmitFunctionAttrs | sub_12735D0 / sub_1273F90 -- grid_constant, preserve_n, custom ABI metadata |
Function Prolog: Entry Block Setup
Every LLVM function produced by cicc starts with the same structural skeleton: an entry basic block containing a sentinel instruction, a cluster of alloca instructions for parameters and locals, and a return basic block for the unified exit path. The outer driver EmitFunction (sub_946060) builds this skeleton; the inner workhorse GenerateFunctionProlog (sub_938240) populates it with parameter handling code.
EmitFunction -- The Outer Driver
EmitFunction executes a fixed 10-step initialization sequence before tail-calling into the prolog generator:
EmitFunction(IRGenState *S, FunctionDecl *Decl, Function *F,
ParamList *Params, TypeInfoArray *TI, SourceLoc Loc, bool ByvalDemotion):
1. Resolve function type through typedef chain (kind==12 -> follow offset+160)
2. Call SetupFunctionMetadata(S, Decl)
3. Optionally set section name on F via Value::setSection
4. Create "entry" basic block:
entryBB = BasicBlock::Create(S, "entry", F, nullptr)
5. Create the "allocapt" sentinel instruction:
voidTy = Type::getVoidTy(ctx)
undef = UndefValue::get(voidTy)
allocapt = new BitCastInst(undef, voidTy) // void-to-void no-op
entryBB->getInstList().push_back(allocapt)
allocapt->setName("allocapt")
S->AllocaInsertPt = allocapt // stored at IRGenState+456
6. Create "return" basic block:
retBB = BasicBlock::Create(S, "return", nullptr, nullptr)
S->ReturnBlock = retBB // stored at IRGenState+200
7. Set up return value slot:
if returnType is void:
S->RetVal = nullptr
elif ABI kind == 2 (sret) AND isAggregate(returnType):
S->RetVal = F->arg_begin() // reuse the sret pointer
else:
S->RetVal = CreateTmpAlloca(S, returnType, "retval")
8. Store alignment of return type at S+216
9. Initialize insertion state: S->CurrentBB = entryBB
10. Tail-call GenerateFunctionProlog(S, Decl, F, Params, TI, Loc, ByvalDemotion)
The allocapt sentinel is the critical mechanism. It is a dead bitcast void undef to void instruction that serves as an insertion anchor. When CreateTmpAlloca (at sub_921D70) is called with no explicit array size -- the common case -- it inserts the new AllocaInst before the allocapt marker rather than at the current builder insertion point. This ensures that all alloca instructions cluster at the top of the entry block regardless of where in the function body they were requested, which is a hard requirement for LLVM's mem2reg pass to promote them to SSA registers.
The sentinel is eventually dead-code-eliminated in a later pass since it produces no usable value.
GenerateFunctionProlog -- Parameter Lowering
The prolog iterates four parallel data structures in lockstep:
| Cursor | Source | Stride | Termination |
|---|---|---|---|
| EDG parameter node | Linked list from Decl | next at offset +112 | nullptr |
| LLVM argument slot | F->arg_begin() | 40 bytes | F->arg_end() |
| Type info entry | From the ABI classifier | 40 bytes | (parallel with args) |
| Parameter index | 1-based counter | +1 | (parallel with params) |
A post-loop assertion validates that both cursors reached their end simultaneously: "Argument mismatch in generation function prolog!".
Struct Return: The agg.result Convention
Before entering the parameter loop, a helper (sub_938130) checks whether the first argument's ABI kind equals 2 (sret). When true, the prolog names the first LLVM argument "agg.result" and advances the argument cursor by one slot (+40 bytes), so that subsequent parameter processing starts at the second argument. This mirrors the standard LLVM sret convention where the caller pre-allocates space for a returned struct and passes a pointer as a hidden first parameter.
ABI Variant Dispatch
For each parameter, the ABI variant field at TypeInfo+12 selects one of four lowering paths:
Variant 0/1 -- Indirect/Aggregate Pass. The parameter arrives as a pointer to caller-allocated memory. If the type is an aggregate (struct/union/class/array -- type kinds 8--11 checked by IsAggregateType at sub_91B770), the prolog creates a local alloca named <param>.addr, stores the incoming argument into it, and registers the alloca in the declaration map via EmitParamDecl. If the type is a scalar, it goes directly to EmitParamDecl without an intermediate alloca.
Variant 2 -- Direct Pass (most common). The parameter is passed by value in a register or register pair. Two sub-paths exist:
-
Byval demotion path. When the
ByvalDemotionflag (parametera7) is set and the parameter carries abyvalattribute (TypeInfo+16 nonzero), the prolog consults a global name-set (dword_4D04688) to decide whether to create a__val_paramtemporary. If selected, it allocates a"tmp"alloca viaCreateTmpAlloca, stores the argument into it, names the alloca"__val_param" + param_name, and falls through toEmitParamDecl. The__val_paramprefix is NVIDIA-specific and marks parameters that have been demoted from byval to local copy for downstream optimization passes. -
Normal path. For non-byval scalars, calls
EmitParamDecldirectly. A guard validates that non-aggregate arguments are not marked indirect:"Non-aggregate arguments passed indirectly are not supported!".
Variant 3 -- Coercion. The parameter's LLVM type does not match the source type and requires a coercion cast. For aggregates, a "tmp" alloca is created. For scalars, the declaration is looked up and wrapped with a bitcast. The result is forwarded to EmitParamDecl.
EmitParamDecl -- Registration
EmitParamDecl (sub_9446C0) performs the final steps for each parameter:
- For scalar (non-aggregate, non-indirect) parameters: creates an alloca named
<param>.addr, stores the incoming argument into it, and names the argument with the original parameter name. - Inserts the mapping
(EDG decl pointer -> LLVM Value*)into a hash map with open-addressing/quadratic-probing collision resolution. A duplicate check guards against re-declaration:"unexpected: declaration for variable already exists!". - If debug info is enabled (
dword_4D046B4), emits debug metadata for the parameter viasub_9433F0.
Naming Convention Table
| IR Value | Name Assigned |
|---|---|
| sret argument | "agg.result" |
| Unnamed parameter | "temp_param" |
| C++ this parameter | "this" (detected by bit 0 at EDG node offset +172) |
| Parameter alloca | <param_name> + ".addr" |
| Byval temp alloca | "__val_param" + <param_name> |
| Return value alloca | "retval" |
| Entry basic block | "entry" |
| Return basic block | "return" |
| Alloca sentinel | "allocapt" |
CreateTmpAlloca Internals
CreateTmpAlloca (sub_921D70) computes alignment from the type size using _BitScanReverse64 (effectively log2(size)), looks up or creates the pointer-to-type in the module's type system, then delegates to CreateAllocaInst (sub_921B80). The key detail: when no explicit array size is provided, the alloca is inserted at the allocapt marker position (IRGenState+456+24), not at the current builder insertion point.
Call Codegen
Call emission (sub_93CB50) is a 1,293-line function that handles direct calls, indirect calls, builtins, special intrinsics, and printf interception. It receives the caller's codegen context, the EDG call expression node, and an optional pre-allocated destination for aggregate returns.
Phase 1: Type Resolution
The callee operand is extracted from the call node's first operand slot (offset +72). The function resolves the callee's declaration via sub_72B0F0, then peels through the type chain -- stripping typedef aliases (kind 12) by following offset +160 -- until it reaches a pointer-to-function type (kind 6) wrapping a function type (kind 7). Fatal assertions guard both steps: "Expected pointer to function!" and "unexpected: Callee does not have routine type!".
Phase 2: Builtin Dispatch
For direct calls (opcode 20), the resolved callee declaration is checked for the builtin flag: byte[199] & 2. When set, the entire normal call path is bypassed. Control transfers to sub_955A70 (or sub_12B3FD0 on Path B), the builtin lowering mega-switch described in a later section. If the builtin returns an aggregate, the call codegen allocates an "agg.tmp" stack slot and emits a store of the result into it.
Phase 3: Intrinsic Special Cases
If the callee is not a builtin but carries an intrinsic ID (word[176] != 0), a handful of intrinsic IDs receive special treatment:
| Intrinsic ID | Description |
|---|---|
| 10214 | Surface/texture primitive |
| 10219, 10227 | Warp-level primitives (detected via (id - 10219) & 0xFFF7 == 0) |
| 15752 | Special return convention intrinsic |
These dispatch to sub_939370, a dedicated handler that bypasses the normal ABI classification entirely.
Phase 4: Argument Processing
Arguments are codegen'd by walking the argument linked list and calling sub_921F50 on each expression. Results are collected into a dynamically-growing array (24 bytes per entry, managed by sub_C8D5F0).
When bit 1 of the call node's flags byte (offset +60) is set -- indicating variadic or reversed-evaluation convention -- arguments are first collected into a temporary linked list and then written into the array in reverse order. This preserves the C right-to-left evaluation order for variadic calls.
Phase 5: ABI Classification
The ABI classifier (sub_9378E0) receives the return type, parameter types, and byval flags, and produces a calling-convention descriptor. Each parameter gets an ABI kind:
| ABI Kind | Meaning | Codegen Action |
|---|---|---|
| 0 | Direct (register) | Push value directly if scalar; alloca + store if byval aggregate |
| 1 | Indirect (pointer) | Push pointer directly (only valid for aggregates) |
| 2 | Indirect + byval | Push value directly (callee copies) |
| 3 | Coercion/expand | Multi-register split, handled by sub_923000 |
For the return value, ABI kind 2 means sret: a hidden first parameter is prepended to the argument list, pointing to a caller-allocated "tmp" alloca.
Phase 6: Callee Bitcast Folding
If the callee operand is a bitcast (byte[0] == 5), the optimizer walks back to the original function pointer and compares return types and parameter counts. If the signature matches exactly (pointer equality on type nodes, parameter-by-parameter comparison), the bitcast is folded out. This removes unnecessary bitcast wrappers that arise from C-style casts between compatible function pointer types.
Phase 7: Pre-Call Hooks and printf Interception
Debug location metadata is emitted via sub_92FD10. Then a special case: if the call is direct (opcode 20) and the callee name is literally "printf", control transfers to sub_939F40 which performs GPU printf lowering -- converting the printf call into a vprintf-style call that writes formatted output through the GPU's printf buffer mechanism.
Phase 8: preserve_n Operand Bundles
If the call node's preserve_data field (offset +64) is non-null, up to three operand bundles are attached to the call instruction:
preserve_data[0] >= 0 => "preserve_n_data" = ConstantInt(value)
preserve_data[1] >= 0 => "preserve_n_control" = ConstantInt(value)
preserve_data[2] >= 0 => "preserve_n_after" = ConstantInt(value)
These NVPTX-specific operand bundles are register-pressure hints consumed by the instruction scheduler and register allocator. The value -1 means "not specified" and suppresses the bundle.
Phase 9: Call Emission and Attribute Attachment
The LLVM CallInst is created by sub_921880, which takes the callee, the argument array, return type, and the optional operand bundle. Calling-convention attributes (sret, byval, alignment) are collected by sub_93AE30 and attached to the call. For indirect calls, the instruction is named "call" for readability; direct calls inherit the callee's name.
Phase 10: Return Value Handling
| Return ABI Kind | Handling |
|---|---|
| 0 or 1 (direct scalar) | Return the CallInst result directly |
| 0 or 1 (direct aggregate) | Allocate "agg.tmp", store the result, return the alloca |
| 2 (sret) | Return the sret pointer (aggregate) or load from it (scalar) |
| 3 (expanded/multi-register) | Call sub_923000 to split across multiple extracts |
For indirect calls, callalign metadata is constructed by querying the alignment requirement of the return type and each argument type, wrapping them in an MDTuple, and attaching it to the call instruction. This metadata is consumed by the NVPTX backend to generate correct alignment annotations in PTX.
Call Emission Pseudocode
EmitCallExpr(Result *Out, CodegenCtx *Ctx, CallNode *Call, u64 DestFlags, u32 Align):
callee_decl = ResolveCallee(Call->operand[0])
func_type = PeelTypedefs(callee_decl->type) // kind 6 -> kind 7
// ---- Builtin fast path ----
if Call->opcode == CALL_DIRECT AND callee_decl->flags[199] & 2:
result = BuiltinLowering(Ctx, Call)
if isAggregate(func_type->returnType):
dest = DestFlags.ptr OR CreateTmpAlloca("agg.tmp")
Store(result, dest, ComputeAlign(returnType))
Out = {dest, INDIRECT, sizeof(returnType)}
else:
Out = result
return
// ---- Special intrinsics ----
if callee_decl->intrinsicID in {10214, 10219, 10227, 15752}:
return SpecialIntrinsicHandler(Out, Ctx, callee_decl->intrinsicID, Call)
// ---- Normal call path ----
callee_val = CodegenCallee(Ctx, Call->operand[0])
args[] = CodegenArguments(Ctx, Call->argList)
if Call->flags & REVERSED_EVAL:
Reverse(args)
abi_desc = ClassifyABI(func_type->returnType, paramTypes, byvalFlags)
if abi_desc.returnIsSRet:
sret_ptr = DestFlags.ptr OR CreateTmpAlloca("tmp")
PrependArg(args, sret_ptr)
for each (arg, abi_entry) in zip(args, abi_desc.params):
if abi_entry.kind == DIRECT AND abi_entry.isByval:
tmp = CreateAllocaForAggregate(arg)
Store(arg, tmp)
arg = tmp
elif abi_entry.kind == INDIRECT:
assert isAggregate(arg.type)
callee_val = FoldCalleebitcast(callee_val, func_type)
EmitDebugLoc(Ctx, Call->srcLoc)
if Call->opcode == CALL_DIRECT AND callee_name == "printf":
return PrintfExpansion(Ctx, abi_desc, args, Call->srcLoc)
bundle = BuildPreserveNBundle(Call->preserveData)
call_inst = EmitCall(func_type, callee_val, args, bundle)
AttachCCAttrs(call_inst, abi_desc)
Out = HandleReturnValue(call_inst, abi_desc, func_type->returnType)
Inline Assembly Codegen
The inline asm handler (sub_1292420, 53 KB) translates a CUDA __asm__() statement into an LLVM InlineAsm call instruction through a strict 7-phase pipeline. A nearly-identical duplicate exists at sub_932270 for the Path A codegen context -- same parsing logic, same constraint table, different diagnostic function pointers.
Phase 1: Template String Parsing
The raw PTX template string from the EDG AST is scanned character-by-character into a fragment array. Each fragment (48 bytes) is either a literal text chunk (kind=0) or an operand substitution reference (kind=1 with an operand index at offset +0x28).
The parser handles the CUDA-to-LLVM syntax translation:
| CUDA Syntax | LLVM IR Output | Parser Action |
|---|---|---|
$ (literal dollar) | $$ | Escape doubling |
%% | % | Literal percent |
%N (operand ref) | Fragment kind=1, index=N | Multi-digit decimal parse |
%= (unique ID) | ${:uid} | LLVM unique-identifier modifier |
%[name] | -- | Fatal: "symbolic operand reference not supported!" |
%cN (modifier+operand) | Fragment kind=1, modifier=c, index=N | Alpha char + decimal parse |
For operands referencing string literal constants (the C constraint), the parser resolves the constant through the EDG value chain, validates the type is array of char, extracts each byte, escapes any $ characters, strips the trailing NUL, and emits the entire string as a literal fragment.
Phase 2: Template Reconstruction
The fragment array is serialized into the final LLVM inline-asm template string:
- Literal fragments: appended verbatim.
- Operand references without modifier: converted to
$N(e.g., operand 3 becomes$3). - Operand references with modifier: converted to
${N:c}(e.g., operand 0 with modifierhbecomes${0:h}).
This is where the CUDA %N convention is translated to LLVM's $N convention. Literal % characters in PTX (like %tid.x) pass through unchanged because they were never parsed as operand references.
Phase 3: Constraint String Construction
The parser iterates the EDG operand linked list, building a comma-separated LLVM constraint string. Each EDG operand carries a constraint type-chain -- a linked list of tag bytes that map through a 256-byte global lookup table (aXg0123456789rh[]) to produce LLVM constraint letters.
Output operands (flags & 2 != 0):
- Pointer types: constraint prefix
"=*"+ letters (indirect output). - Non-pointer types: constraint prefix
"="+ letters (direct output). - Read-write operands (byte at +24 == 3): a tied input operand is generated with the output's index as the constraint, linking them as a two-address pair.
Input operands:
- Same tag-to-letter mapping.
- Tags 10--19 are prohibited:
"tied input/output operands not supported!"(GCC-style matching-digit constraints are not implemented). - Tag 23 (the
Cconstraint on inputs) creates anundefvalue -- the constant's value was already inlined into the template string during Phase 1.
Special tag handling:
| Tag | Effect |
|---|---|
| 8, 9 | Sets is_address + is_memory flags; tag 9 also emits "imr" composite constraint |
| 0x14, 0x15, 0x16, 0x18, 0x26, 0x2A | Pointer-through types: follow type chain, set is_address |
| 0x19, 0x1B, 0x1C | Memory constraints |
| 23 | Remapped to tag 20 before table lookup |
Phase 4: Clobber List
The EDG clobber linked list (at asmInfo+144) is iterated. Each clobber node has a tag byte selecting the clobber type:
- Tag 1: Memory clobber. Appends
",~{memory}"to the constraint string. - Tag 58: Named register clobber. Uses the name string from the node. Appends
",~{<name>}". - Other tags: Looks up the register name from a global table (
off_4B6DCE0[tag]). Appends",~{<name>}".
Phase 5: InlineAsm Object Creation
The LLVM function type for the asm is constructed based on the output count:
- Zero outputs:
voidreturn type. - One output: scalar return type matching the output operand.
- Multiple outputs: anonymous struct return type.
The volatile/sideeffect flag is read from asmInfo+128 (bit 2). A diagnostic (0xE9F) warns when outputs exist but the asm is not marked volatile, as this risks miscompilation.
The InlineAsm object is created via InlineAsm::get(funcType, asmString, constraintString, hasSideEffects, isAlignStack=0, dialect=0) and a CallInst is emitted to invoke it.
Phase 6: Result Extraction
For single-output asm, the CallInst result is used directly. For multiple outputs, each result is extracted with extractvalue instructions:
- Results with type size <= 16 bytes: a compact
extractvaluepath. - Results with type size > 16 bytes: a full instruction node (88 bytes) is allocated, the
extractvalueis constructed with explicit index arrays, linked into the basic block's instruction list, and named"asmresult".
Each extracted value is then stored into its output destination via sub_12843D0, which reads the output codegen-info records built during Phase 3.
Phase 7: Cleanup
All temporary vectors and strings are freed: the fragment array (with per-element string cleanup), constraint strings, operand/type/destination vectors, and tied-operand tracking arrays.
End-to-End Example
CUDA source: __asm__("mov.u32 %0, %tid.x" : "=r"(result));
Phase 1 parse: [literal("mov.u32 "), operand(idx=0), literal(", %tid.x")]
Phase 2 recon: "mov.u32 $0, %tid.x"
Phase 3 constr: "=r"
Phase 4 clobber: ""
Phase 5 create: InlineAsm::get("mov.u32 $0, %tid.x", "=r", sideeffects=true)
call i32 asm sideeffect "mov.u32 $0, %tid.x", "=r"()
Phase 6 extract: (single output -- use call result directly)
store i32 %asm_result, i32* %result.addr
Builtin Lowering
The builtin lowering mega-switch (sub_12B3FD0, 103 KB) is one of the largest single functions in the binary. It handles ~250 builtin IDs across ~130 case labels, dispatching CUDA intrinsic functions like __syncthreads(), __shfl_sync(), and __hmma_m16n16k16_mma_f16f16 into LLVM IR.
Entry Logic
The function extracts the callee from the call expression, validates the builtin bit (flags byte[199] & 2), then looks up the builtin ID by name via sub_12731E0. If the ID is 0 (name not in the builtin table), execution falls through to the LLVM intrinsic fallback path at line 3154.
Five Lowering Strategies
| Strategy | Usage (%) | Mechanism |
|---|---|---|
| Sub-handler delegation | 66% (~165 IDs) | Calls a specialized function for a family of builtins |
| Intrinsic call emission | 12% (~30 IDs) | 1:1 mapping to a single llvm.nvvm.* intrinsic via sub_1285290 |
| Inline IR generation | 10% (~25 IDs) | Builds IR nodes directly (alloca, load, store, cast, insertvalue) |
| Table-driven selection | 10% (~25 IDs) | Selects intrinsic ID from a table keyed by operand type/size |
| SM-gated conditional | 2% (~5 IDs) | Different lowering depending on target SM version |
Per-Category Dispatch
Atomics and synchronization (IDs 0xB5--0xCC, 181--204). Atomic operations delegate to sub_12A7DA0; fences and barriers to sub_12AB550. Cases 0xBA--0xBC map directly to LLVM intrinsic 6 (likely llvm.nvvm.atomic.*) with type-overloaded arguments. Case 0xCB is SM-gated: on SM <= 63 it emits an inline constant; on SM >= 70 it emits intrinsic 3769.
Warp shuffle (IDs 0x15F--0x166, 351--358). All eight variants delegate to sub_12ABB90 parameterized by shuffle mode (0=idx, 1=up, 2=down, 3=butterfly) and sync flag (0=legacy, 1=__shfl_sync_*). The clamp flag distinguishes butterfly from other modes.
Warp vote/ballot (IDs 0x12E--0x135, 0x152--0x159, 0x18B--0x192). Three groups of 8 IDs each, all delegating to sub_12B3540 with the builtin ID as a discriminator. This covers __ballot_sync, __all_sync, __any_sync across integer/float/predicate operand types.
Surface and texture operations (IDs 0xCF--0x113, 0x287--0x2A5, 207--275 + 647--677). The largest category at ~95 IDs (38%). Organized into pairs using two sub-handlers: sub_12ADE80(ctx, intrinsic_base, surface_type, variant, args) for individual load/store operations, and sub_12AA9B0(ctx, surface_type, expr) for combined operations. Surface types are encoded as integers (0=generic, 1=1D, 5=2D, 7=3D, 8=cubemap, 10=1D array, 11=2D array, 14=buffer). Intrinsic bases 3701/3702 are primary read/write; 3698/3699 are 2D-array variants.
The texture handler (case 0x287) is the most complex single case at ~230 lines. It walks the AST to extract the texture name string and return element type, constructs an intrinsic name as "<texname>_<typename>" using a type-name resolution switch (mapping integer subtypes 0--10 to strings like "uchar", "int", "ulonglong"), and emits the call. A global flag (dword_4F06B98) controls whether plain char maps to uchar or schar.
Tensor core / WMMA (IDs 0x16E--0x1D9, 0x2A6--0x2E8, 366--473 + 678--744). The second-largest category at ~85 IDs (34%). Three sub-handlers partition the work: sub_12AC1A0 handles wmma::mma_sync with bias/scale flags (has_bias, has_scale) encoding four accumulator modes; sub_12AC5F0 handles store_matrix_sync; sub_12ACA80 handles load_matrix_sync. IDs group into triplets by matrix shape: m16n16k16, m32n8k16, m8n32k16, m16n16k8 (TF32), bf16, and fp8 (SM 89+) families.
WGMMA (IDs 0x2E9--0x302, 745--770). SM 90+ warpgroup MMA operations. Cases 0x2E9--0x2EE handle fence/commit/wait. Cases 0x2F1--0x2FC implement __wgmma_mma_async through a massive ~800-line handler that selects from a 144-entry intrinsic table spanning IDs 5304--5447. The table is indexed by a 5-dimensional grid: N-size (16/32/64/128), B-operand source (shared vs register), element type (s64 vs other), scale/negate flags, and case variant. Mode bits are packed into a single integer: bit0=accumulate | bit1=transpose | bit2=negate-C | bit4=negate-A.
Memory copy (IDs 0x199, 0x291--0x299, 409 + 657--665). Memcpy variants encode alignment directly in the builtin ID: ID 658 = align 2, ID 659 = align 4, ID 660 = align 8, ID 661 = align 16. The actual emission delegates to sub_12897A0. Memset operations (IDs 410, 663, 665) delegate to sub_12A6DF0.
TMA bulk operations (IDs 0x19B--0x1A0, 411--416). Cases 0x19B and 0x19C are the largest individual handlers (~300 and ~450 lines respectively) for SM 90+ tensor memory access bulk copy/scatter operations. They build operand vectors iteratively and select from intrinsic tables indexed by element count (IDs 4218--4223 for stores, 4244--4250 for loads).
LLVM Intrinsic Fallback Path
When the builtin ID is 0, the default path (lines 3154--3407) looks up the LLVM intrinsic by name via sub_15E2770. If the intrinsic is type-overloaded, argument types are used to resolve the declaration. Each argument is lowered via sub_128F980, with type-mismatch bitcasts (opcode 47) and vector zexts (opcode 33) inserted as needed. Struct-return intrinsics are handled by iterating the return struct's fields with extractvalue.
Function Attributes
CUDA function attributes are lowered through a three-stage pipeline: EDG frontend parsing, attribute emission during IR generation, and a final metadata-attachment pass.
Stage 1: Frontend Parsing (sub_64F1A0)
The EDG parser scans the token stream for preserve_n_data, preserve_n_control, and preserve_n_after identifiers, parses each as an integer, and stores them in a 12-byte struct at offset +336 of the function declaration node:
struct preserve_reg_info {
int32_t preserve_n_data; // +0, -1 = not specified
int32_t preserve_n_control; // +4, -1 = not specified
int32_t preserve_n_after; // +8, -1 = not specified
};
Stage 2: Attribute Emission (sub_12735D0)
During IR generation, the attribute emitter checks declaration flags and writes attribute bundles:
-
Bit 0x20 at decl+198 (kernel function): emits
("kernel", 1). Then iterates the parameter array (40-byte entries); for each parameter withbyte[+33] != 0, emits("grid_constant", param_index)whereparam_indexis 1-based. This marks individual kernel parameters as grid-constant, enabling the backend to place them in constant memory. -
Bit 0x04 at decl+199 (custom ABI): emits
("full_custom_abi", 0xFFFFFFFF). -
Preserve-reg struct at decl+336: for each of the three fields, if the value is >= 0, emits the corresponding attribute and then writes -1 back (consumed pattern) to prevent double-emission.
Stage 3: Metadata Attachment (sub_1273F90)
The reader pass iterates all functions' attribute bundles and re-encodes them as LLVM named metadata:
grid_constant. Per-parameter type values are collected into a vector, then bundled under the MDString key "grid_constant" as an MDTuple. The downstream consumer sub_CE8660 queries this metadata to determine aliasing/readonly semantics for kernel parameters.
preserve_reg_abi. The three preserve_n values are collected with their MDString keys ("preserve_n_data", "preserve_n_control") into a vector, then bundled under the composite key "preserve_reg_abi" as an MDTuple. The register allocator and prologue-epilogue inserter query this via sub_314D260.
full_custom_abi. Emitted as a simple (MDString, MDNode(i32 0xFFFFFFFF)) pair. When a function has this attribute but NOT the full_custom_abi flag, the alternative "numParams" key records the explicit parameter count as a nested MDTuple.
Final Metadata Layout
For a __global__ kernel with grid_constant parameters and register preservation:
!kernel_attrs = !{
!MDString("kernel"), !MDNode(i32 1),
!MDString("grid_constant"), !MDTuple(
!MDNode(i32 <param1_type>), !MDNode(i32 <param2_type>), ...
),
!MDString("preserve_reg_abi"), !MDTuple(
!MDString("preserve_n_data"), !MDNode(i32 N),
!MDString("preserve_n_control"), !MDNode(i32 M),
!MDString("preserve_n_after"), !MDNode(i32 K)
)
}
Attribute Semantics
| Attribute | Meaning | Backend Effect |
|---|---|---|
grid_constant | Kernel parameter is immutable across the grid | Place in constant memory; optimize loads |
preserve_n_data | N data registers must be preserved across calls | Register allocator reserves R0--RN |
preserve_n_control | N predicate registers to preserve | Prologue/epilogue saves predicates |
preserve_n_after | N registers preserved after a call (callee-save count) | Adjusts spill/restore boundaries |
full_custom_abi | Function bypasses standard CUDA calling convention | Parameter passing determined by explicit annotations |
numParams | Explicit parameter count for non-full_custom_abi functions | Custom ABI parameter setup |
Cross-Reference
| Address | Function | Role |
|---|---|---|
sub_946060 | EmitFunction | Creates entry BB, allocapt, return BB, dispatches to prolog |
sub_938240 | GenerateFunctionProlog | Iterates parameters, ABI dispatch, alloca emission |
sub_9446C0 | EmitParamDecl | Creates alloca+store, registers decl->Value mapping |
sub_921D70 | CreateTmpAlloca | Alloca creation with alignment, inserted at allocapt |
sub_921B80 | CreateAllocaInst | Low-level alloca IR emission |
sub_938130 | IsSRetReturn | Checks ABI kind == 2 |
sub_91B770 | IsAggregateType | Type kinds 8--11 (struct/union/class/array) |
sub_93CB50 | EmitCallExpr | Full call instruction emission (1,293 lines) |
sub_9378E0 | ClassifyABI | Return + parameter ABI classification |
sub_939F40 | PrintfExpansion | GPU vprintf lowering for printf calls |
sub_93AE30 | CollectCCAttrs | Builds sret/byval/align attribute list |
sub_955A70 / sub_12B3FD0 | BuiltinLowering | Mega-switch over ~250 builtin IDs |
sub_1292420 / sub_932270 | EmitInlineAsm | 7-phase asm template-to-IR pipeline |
sub_12735D0 | EmitFunctionAttrs | Writes attribute bundles during IR gen |
sub_1273F90 | ReadFunctionAttrs | Attaches LLVM named metadata from bundles |
sub_64F1A0 | ParsePreserveAttrs | EDG parser for preserve_n_* tokens |