Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

AsmPrinter & PTX Body Emission

The NVPTXAsmPrinter is cicc's final code-generation stage: the component that converts the machine-level IR (MachineFunction, MachineBasicBlock, MachineInstr) into the textual PTX that ptxas consumes. Unlike a conventional LLVM AsmPrinter, which emits real machine assembly for a physical ISA, the NVPTX variant emits PTX -- a virtual ISA with its own declarative syntax for registers, parameters, address spaces, textures, and kernel launch metadata. The AsmPrinter is not merely "formatting instructions"; it is responsible for the entire PTX module structure: file header directives, global variable declarations with topological ordering, function signatures with .param space marshaling, register class declarations, the instruction body with debug annotations, and convergence-control pseudo-instructions required by the warp execution model. In cicc v13.0 the printer spans two address clusters -- the NVPTX-specific emission layer at 0x2140000-0x21FFFFF and the LLVM AsmPrinter override at 0x31E0000-0x3240000.

Pass registrationsub_214ABE0 -- "NVPTX Assembly Printer"
emitFunctionBodysub_31EC4F0 (12KB, 2565 asm lines)
Header emission (emitHeader)sub_214F370 (7.2KB)
Function header orchestratorsub_215A3C0 (10KB)
Kernel attribute emissionsub_214DA90 (8.7KB)
Parameter list emissionsub_21502D0 (22KB)
Stack frame + register declssub_2158E80 (17KB)
Global variable emissionsub_2156420 (20KB)
Call prototype emissionsub_21CF8D0 (29KB)
Inline asm handlersub_31F26A0 / sub_397DF10 (30KB)
AsmPrinter::doFinalizationsub_3972F10 (24KB)

PTX Output Structure

A complete PTX module emitted by cicc follows this exact structure. Every element in this layout corresponds to a specific emitter function:

//                                          ← sub_214F370 (emitHeader)
// Generated by NVIDIA NVVM Compiler
// Compiler Build ID: ...
// Based on NVVM 7.0.1
//
.version 8.5                                ← PTXVersion / 10 . PTXVersion % 10
.target sm_90, texmode_independent          ← subtarget name + driver interface
.address_size 64                            ← 64 or 32 from subtarget

// Start of file scope inline assembly      ← sub_215ACD0 (doInitialization)
...inline asm...
// End of file scope inline assembly

.extern .func (.param .b32 _) _Z3foov      ← sub_2151550 (forward declarations)
.global .texref my_tex;                     ← sub_2156420 (module-level globals)
.global .surfref my_surf;
.global .samplerref my_samp = { ... };
.global .align 4 .b8 data[1024];

.visible .entry _Z6kernelPf(               ← sub_215A3C0 (function header)
    .param .u64 _Z6kernelPf_param_0
)
.reqntid 256, 1, 1                          ← sub_214DA90 (kernel attributes)
.maxnreg 32
{
    .local .align 16 .b8 __local_depot0[64];← sub_2158E80 (frame + registers)
    .reg .b64   %SP;
    .reg .b64   %SPL;
    .reg .pred  %p<5>;
    .reg .b32   %r<47>;
    .reg .b64   %rd<8>;
    .reg .f32   %f<20>;

    // .loc 1 42 0                          ← sub_31D55F0 (per-instruction debug)
    ld.param.u64 %rd1, [_Z6kernelPf_param_0];
    mov.u32 %r1, %tid.x;
    ...
}
// -- End function

Header Directive Emission -- sub_214F370

The header is emitted once during doInitialization (sub_215ACD0). The function builds the output into a SmallString<128> buffer then flushes via OutStreamer.EmitRawText. The emission order is fixed:

  1. Comment block. "// Generated by NVIDIA NVVM Compiler", followed by "// Compiler Build ID: " with the build identifier string, then "// Based on NVVM 7.0.1" (the version string is read from llvm.ident metadata via sub_216F7F0).

  2. .version X.Y -- the PTX ISA version. Computed as PTXVersion / 10 for major, PTXVersion % 10 for minor. In cicc v13.0 targeting SM 90, this is typically .version 8.5.

  3. .target sm_XX[, texmode_independent][, debug] -- the SM target name from NVPTXSubtarget::getTargetName(). The texmode_independent modifier is appended when the driver interface is NVCL (OpenCL). If the driver interface is CUDA and the subtarget lacks double-precision support, map_f64_to_f32 is appended instead. The , debug suffix is added when MCAsmInfo::doesSupportDebugInformation() returns true.

  4. .address_size 64 (or 32) -- from NVPTXSubtarget::is64Bit(). All modern CUDA compilation uses 64-bit.

The doInitialization function (sub_215ACD0) also performs two critical rejection checks: it looks up llvm.global_ctors and llvm.global_dtors named metadata. If either is a non-empty array, it issues a fatal error: "Module has a nontrivial global ctor, which NVPTX does not support." GPU kernels have no program startup phase where global constructors could execute.

Function Declaration: .entry vs .func

The function header orchestrator (sub_215A3C0) emits the complete prologue for each function definition. The emission sequence is:

Step (a): Coroutine pragma. Checks a linked list at this+792 for metadata nodes with type byte 'N' (0x4E) matching the current function. If found, emits .pragma "coroutine";.

Step (b): Linkage directive. Calls sub_214CAD0 which emits .visible, .extern, or .common depending on the function's linkage. CUDA kernel compilation mode is gated by *(this+232)->field_952 == 1.

Step (c): Entry vs function. Calls sub_1C2F070 (isKernelFunction). If the function is a kernel: emit .entry. Otherwise: emit .func.

Step (d): Return type. For .func only. Calls sub_1C2FA50 to check whether the function returns a value. If so, calls sub_214C940 to emit the return type specification (e.g., (.param .b32 retval0)). Kernels have no return values in PTX.

Step (e): Function name. sub_214D1D0 emits the mangled C++ name.

Step (f): Parameter list. sub_21502D0 (22KB) emits the complete .param declaration list. This is the most complex part of the header -- see the next section.

Step (g): Kernel attributes. Only for .entry functions. sub_214DA90 emits launch-bound and cluster directives.

Step (h): Additional attributes. sub_214E300 emits .local_maxnreg if set.

Step (i): Noreturn. If the function has metadata attribute 29 (noreturn) and is not a kernel, emits .noreturn.

Step (j): Open body. Emits {\n.

Step (k): Frame and registers. sub_2158E80 emits the local depot, stack pointer registers, and all virtual register declarations.

.param Space Marshaling

PTX uses .param space for all function arguments. The parameter emission function sub_21502D0 handles the full taxonomy of NVPTX parameter types. The emitted parameter name follows the pattern FUNCNAME_param_N where N is a monotonic index starting at 0.

Scalar parameters are emitted as .param .TYPE _param_N where TYPE is the PTX scalar type (.b32, .b64, .f32, .f64, .pred). Scalars smaller than 32 bits are widened to 32 bits; this is the PTX rule that all .param scalars must be at least 4 bytes. The widening logic: if bit-width <= 32, widen to .b32; if 32 < bit-width < 64, widen to .b64; otherwise keep as-is.

Aggregate / byval parameters are emitted as .param .align ALIGN .b8 _param_N[SIZE] -- a byte array with explicit alignment. The alignment comes from the function's DataLayout and the parameter attribute.

Texture / surface / sampler parameters get special treatment:

  • .param .texref _param_N -- texture reference (direct binding)
  • .param .surfref _param_N -- surface reference
  • .param .samplerref _param_N -- sampler reference
  • .param .u64 .ptr .texref _param_N -- pointer to texture (indirect)
  • .param .u64 .ptr .surfref _param_N -- pointer to surface
  • .param .u64 .ptr .samplerref _param_N -- pointer to sampler

The distinction between direct references and pointer-to-references reflects whether the texture/surface handle is passed by value or by indirection through a 64-bit pointer.

Call prototypes (sub_21CF8D0, 29KB) are emitted for indirect calls. When a function pointer call occurs, the AsmPrinter generates a .callprototype declaration: prototype_N : .callprototype (.param .b32 _) _ (.param .b64 _, .param .b32 _). The prototype index N is monotonically increasing.

Register Declarations

Inside the function body, sub_2158E80 emits register declarations for every virtual register class used. The nine register classes, their vtable addresses, PTX type suffixes, prefixes, and encoded IDs are documented in Register Classes. The encoding scheme, declaration emission format, and the internal-only tenth class are covered in Register Encoding Scheme and Register Declaration Emission.

The emitted text for each class follows the pattern:

.reg .pred  %p<5>;       ← 5 predicate registers needed
.reg .b16   %rs<12>;     ← 12 short integer registers
.reg .b32   %r<47>;      ← 47 general-purpose 32-bit
.reg .b64   %rd<8>;      ← 8 double-width integer
.reg .f32   %f<20>;      ← 20 single-precision float
.reg .f64   %fd<3>;      ← 3 double-precision float

The count for each class is max_register_index + 1. The emitter iterates the function's virtual register map at this+800, deduplicates register classes using a hash table at this+808..832, and tracks the maximum index per class.

The stack frame is emitted before registers when the function has a non-zero local frame:

.local .align 16 .b8 __local_depot0[512];   ← ALIGN from frame info, N = function index
.reg .b64   %SP;                             ← stack pointer (64-bit mode)
.reg .b64   %SPL;                            ← stack pointer local

The __local_depot name is a fixed prefix (#define DEPOTNAME "__local_depot" in the source). %SP is the global stack pointer; %SPL points into the local depot. In 32-bit mode these are .reg .b32.

Global Variable & Texture Emission -- sub_2156420

Module-level global variables are emitted by sub_2156420 (20KB), called from emitGlobals during doInitialization. Globals must be emitted in topological order because ptxas does not support forward references. The ordering is computed by sub_2157D50 which performs a DFS over global variable use-def chains, detecting circular dependencies (fatal: "Circular dependency found in global variable set").

Texture references: .global .texref NAME; -- emitted when sub_1C2E830 classifies the global as a texture. Surface references: .global .surfref NAME;. Sampler references get an optional initializer block:

.global .samplerref my_sampler = {
    addr_mode_0 = clamp_to_edge,
    addr_mode_1 = wrap,
    filter_mode = linear,
    force_unnormalized_coords = 1
};

Address mode values: wrap, clamp_to_border, clamp_to_edge, mirror. Filter mode values: nearest, linear. The force_unnormalized_coords field is boolean.

Data globals receive an address-space qualifier from sub_214FA80: .global (addrspace 1), .shared (addrspace 3), .const (addrspace 4), .local (addrspace 5). Managed-memory globals get .attribute(.managed). Unified addressing gets .attribute(.unified) or .attribute(.unified(N)).

Skipped globals: Variables whose names start with "llvm.metadata", "llvm.", or "nvvm." are silently skipped.

Demoted globals (shared memory demotion, addrspace 3) emit a comment: "// NAME has been demoted".

Instruction Emission -- sub_31EC4F0

The core emission loop emitFunctionBody at sub_31EC4F0 (12KB) overrides llvm::AsmPrinter::emitFunctionBody. It allocates a 0xF28-byte stack frame (holding SmallString buffers, a DenseMap for instruction-mix statistics, and tracking structures) and proceeds through three phases:

Phase 1: Per-MBB Outer Loop

Iterates the MachineFunction's MBB linked list. The iteration strips tagged-pointer bits (AND ~7) from the ilist node pointers. For each MBB:

  1. Calls emitBasicBlockStart(MBB) via vtable dispatch.
  2. Enters the instruction inner loop.
  3. Calls emitBasicBlockEnd(MBB).
  4. Collects instruction-mix statistics when debug counters are active.

Phase 2: Per-Instruction Inner Loop

For each MachineInstr, reads the opcode at MI+0x44 (uint16) and dispatches through a 46-case jump table:

Default path (real instructions): Calls emitInstruction(MI) via [vtable+0x128], which dispatches to the tablegen-generated printInstruction(). This function uses the NVPTXGenAsmWriter.inc tables to format each instruction: printInstruction() calls NVPTXInstPrinter::printOperand for each operand, producing text like mov.u32 %r0, %r1 or add.f32 %f2, %f0, %f1. After emission, the instruction counter is incremented and, if debug info is present, sub_31D55F0 emits a .loc directive.

Inline assembly (opcodes 1, 2): Routed to sub_31F26A0 / sub_397DF10 (30KB). The inline asm handler parses ${} operand references, handles .att_syntax / .intel_syntax mode switching, and emits // begin inline asm / // end inline asm comment markers. PTX inline assembly is passed through essentially verbatim, with operand substitution.

Meta-instructions (opcodes 3-7, 10-18): These include STACKMAP, PATCHPOINT, EH_LABEL, GC_LABEL, KILL, CFI_INSTRUCTION, DBG_VALUE, DBG_VALUE_LIST, and DBG_LABEL. Most emit labels or debug comments rather than PTX instructions. The KILL pseudo emits a "kill:" comment listing each killed register with sub_2FF6320 (printReg). DBG_LABEL emits "DEBUG_LABEL: <label>".

Convergence control (opcodes 24, 33): CONVERGENCECTRL_ENTRY calls sub_31DB9B0 to mark the entry point of a convergent region. CONVERGENCECTRL_LOOP calls sub_31DB950 to mark a loop-back convergence point. These pseudo-instructions are critical for the PTX assembler to correctly track warp divergence and reconvergence. See the dedicated Convergence Control Framework section below for the full lowering pipeline.

FAKE_USE (opcode 43): Debug-only. Emits "fake_use:" followed by register operands.

MEMBARRIER (opcode 44): Emits "MEMBARRIER" as a raw comment.

Pre- and post-instruction hooks: Before each instruction, the Handlers vector at this+0x240 is iterated, calling beginInstruction(MI) on each handler. After each instruction, endInstruction() is called. The AsmPrinter maintains two handler lists (at +0x240 and +0x228) supporting both debug-info handlers and exception/unwind handlers.

Phase 3: Post-Function Processing

After all MBBs are emitted:

  1. Zero-length function avoidance. If no real instructions were emitted (tracked by var_F30 and var_ED1), inserts a NOP via sub_31DCBB0 with comment "avoids zero-length function".
  2. Function-end label. Creates a "func_end" temp symbol via sub_31DCC50 and emits it for DWARF range tracking.
  3. DWARF line table finalization. Creates CIE/FDE symbols, binds them via emitAssignment, and inserts a debug-loc entry for the function-end symbol.
  4. Handler finalization. Calls endFunction(MF) on all handlers in both lists.
  5. PGO / BBAddrMap emission. If enabled via dword_50360A8, emits BB address maps for profile-guided optimization. Missing labels trigger diagnostic: "pgo-analysis-map is enabled for function... but it does not have labels".
  6. End comment. Emits "-- End function\n" as a raw comment.

Debug Info Emission

Debug information in PTX is emitted as .loc and .file directives embedded in the instruction stream, not as separate DWARF sections (the PTX assembler ptxas constructs the actual DWARF from these directives).

The debug emission is layered:

LayerFunctionBehavior
Per-instruction .locsub_31D55F0Emits .loc FileIndex Line Col for instructions with attached DebugLoc
Source-line commentssub_31D89B0Emits source location as comments when asm-printer debug counter is active
Function-name + inlined-atemitInlinedAtInfo (NVIDIA)Appends , function_name LAB, inlined_at FILE LINE COL to .loc
Per-MBB boundarysub_31E6100Maintains file/line-to-MCSymbol mapping for MBB boundaries
.file directivesemitDwarfFileEntriesMaps source filenames to file indices during doFinalization
DWARF line sectionsub_E81A00Binds CIE/FDE symbols for line table construction

The NVIDIA extension to .loc is the function_name and inlined_at attributes. Upstream LLVM's .loc only has file line column. cicc appends inlining context so that ptxas can reconstruct the full inline call stack in DWARF. The InlinedAtLocs set tracks which inlined-at locations have already been emitted, preventing duplicates. A work list (SmallVector<DebugLoc, 8>) is built by walking the inlined-at chain, then emitted in reverse order so that outer locations appear before inner ones.

When InterleaveSrcInPtx is enabled, the AsmPrinter reads source file lines and emits them as comments interleaved with the PTX.

Module-Level Metadata Directives

Kernel launch-bound metadata directives are emitted by sub_214DA90 in this order:

DirectiveMetadata SourceNotes
.blocksareclustersnvvm.blocksareclustersFatal error if .reqntid not also set
.reqntid X, Y, Znvvm.reqntid (comma-separated strtol)Unspecified dims default to 1
.maxntid X, Y, ZStructured metadata readersUnspecified dims default to 1
.minnctapersm Nsub_1C2EF70Min CTAs per SM
.explicitclusternvvm.cluster_dimSM 90+ only (field_1212 > 0x59)
.reqnctapercluster X, Y, ZCluster dim readersSM 90+ only
.maxclusterrank Nsub_1C2EF50SM 90+ only
.maxnreg Nsub_1C2EF90Register limit per thread

The .pragma "nounroll" directive is emitted at MBB level by sub_3970E40 when llvm.loop.unroll.disable metadata is detected on a loop header. This is an NVIDIA modification to the MBB printer.

The .abi_preserve family of directives is emitted by sub_3937240: .abi_preserve, .abi_preserve_after, .abi_preserve_uniform, .abi_preserve_control. These are NVIDIA-specific PTX directives for register ABI preservation across function calls.

Convergence Control Framework

CUDA's SIMT execution model requires the compiler to track which threads in a warp must execute the same instruction simultaneously. When a conditional branch causes warp divergence (some threads take one path, others take the other), the hardware needs to know where threads reconverge. The convergence control framework propagates this information from LLVM IR intrinsics through MachineInstr pseudo-instructions to the final PTX output, where ptxas uses it to emit correct convergence/reconvergence barriers in SASS.

Three-Layer Architecture

Convergence information flows through three representation layers during compilation:

LLVM IR                    MachineInstr                AsmPrinter
─────────────────────      ──────────────────────      ──────────────────
llvm.experimental          CONVERGENCECTRL_ENTRY       sub_31DB9B0
  .convergence.entry  →    (opcode 24)            →   (emitConvergenceEntry)

llvm.experimental          CONVERGENCECTRL_LOOP        sub_31DB950
  .convergence.loop   →    (opcode 33)            →   (emitConvergenceLoop)

llvm.experimental          CONVERGENCECTRL_ANCHOR      (no AsmPrinter case --
  .convergence.anchor →    (opcode 34)                  dropped before emission)

"convergencectrl"          (operand bundle tag          (verified at IR level,
 operand bundle      →      preserved through ISel)      consumed by pseudo-instrs)

Layer 1: IR intrinsics. Three llvm.experimental.convergence.* intrinsics define convergent regions at the LLVM IR level. Each returns an abstract "convergence token" (type token) that is consumed by calls carrying the convergencectrl operand bundle. The bundle ties a call to a specific convergence scope -- the verifier at sub_29ED7A0 enforces "convergent call needs convergencectrl operand" for any call marked with the convergent attribute (attribute kind 0x34 = 52).

Layer 2: MachineInstr pseudo-opcodes. During instruction selection (SelectionDAG lowering), the convergence intrinsics are lowered to target-independent MachineInstr pseudo-opcodes. These survive register allocation and all machine-level optimization passes unchanged -- they carry no register operands and produce no real instructions. Their sole purpose is to mark positions in the MBB instruction stream for the AsmPrinter.

Layer 3: AsmPrinter emission. The emitFunctionBody loop at sub_31EC4F0 dispatches opcodes 24 and 33 to dedicated emitter functions that translate the pseudo-instructions into whatever PTX annotation ptxas requires for reconvergence tracking. The CONVERGENCECTRL_ANCHOR pseudo (opcode 34) does not appear in the AsmPrinter's 46-case jump table, indicating it is either dropped during ISel or consumed by an earlier machine pass.

Convergence Token Semantics

The convergence token model enforces a strict dominance and nesting discipline:

  1. convergence.entry produces a token that represents the function's entry convergence scope. All threads that enter the function are converged at this point. The token must dominate all its uses.

  2. convergence.loop produces a token scoped to a natural loop. The token marks the point where loop-back-edge threads reconverge before the next iteration. The loop header must dominate all blocks in the cycle.

  3. convergence.anchor produces a token at an arbitrary program point, used for structured convergence within non-loop regions (e.g., structured if/else regions where reconvergence is needed at the join point).

  4. convergencectrl operand bundle attaches a convergence token to a call site. This tells the compiler "this call must execute with the set of threads defined by this token's scope." For example:

%tok = call token @llvm.experimental.convergence.entry()
%result = call float @__shfl_sync(i32 %mask, float %val, i32 %lane)
          [ "convergencectrl"(token %tok) ]

The LLVM verifier (sub_BFC6A0, 211KB) checks that convergent calls carry the bundle; the convergence verifier (sub_E35A10, 14KB) checks the structural invariants.

ConvergenceVerifier -- sub_E35A10

The standalone convergence verification pass at sub_E35A10 (14KB) enforces five invariants on convergence token usage:

InvariantDiagnostic String
Token dominance"Convergence control token must dominate all its uses."
Region nesting"Convergence region is not well-nested."
Cycle heart dominance"Cycle heart must dominate all blocks in the cycle."
Single token per cycle"Two static convergence token uses in a cycle..."
Loop token typeChecks llvm.experimental.convergence.loop usage in cycles

The verifier calls sub_B19720 for domination checks, sub_E342D0 for cycle detection (using the generic cycle info infrastructure), sub_E45390 for diagnostic emission, and sub_E348A0 for error reporting. It runs as part of the IR verification pipeline, not as a separate pass -- the convergence invariants are checked alongside other LLVM IR well-formedness rules.

NVIDIA Convergent Branch Intrinsics

In addition to the upstream llvm.experimental.convergence.* intrinsics, cicc defines two NVIDIA-specific convergent branch intrinsics that interact with the convergence framework:

IntrinsicBuiltin IDMinimum SMError on Violation
llvm.nvvm.branch.if.all.convergent3755 / 8282sm_70+ (Volta)"not supported on pre-Volta Architectures"
llvm.nvvm.branch.if.convergent3754 / 8283sm_80+ (Ampere)"not supported on pre-Ampere Architectures"

These intrinsics produce a boolean result that must be consumed by exactly one branch instruction (enforced by sub_2C7B6A0 with diagnostic: "result of llvm.nvvm.branch.if.convergent and llvm.nvvm.branch.if.all.convergent can only be used by exactly one branch instruction"). The .all variant tests whether all threads in the warp are converged (equivalent to a "uniform predicate" test); the non-.all variant tests whether the current execution context is convergent (the thread set matches the convergence token's scope).

SM version gating is checked in both the NVVM verifier (sub_1C36530) and the lowering pass (sub_2C7B6A0). The SM version is stored as SM * 10 internally (so sm_70 = 700, sm_80 = 800), compared against thresholds at unk_4D045E8.

The convergent Function Attribute (Kind 0x34)

The convergent function/call attribute (attribute kind 52, bit 0x20 at byte offset +33 in the function attribute flags) marks operations that have warp-synchronous semantics. This attribute affects multiple compilation stages:

Constant folding gate (sub_2C7B430). The NVIDIA intrinsic fold function checks hasAttribute(callee, -1, 0x34) before attempting any constant fold. If the callee is convergent, folding is rejected unconditionally -- even if all arguments are compile-time constants. This prevents __syncthreads(), __ballot_sync(), __shfl_sync(), and warp-vote operations from being eliminated.

Inline asm convergence flag. During SelectionDAG lowering of inline assembly (sub_1560260), the convergent attribute is tested via operand bundle or function attribute. If set, bit 5 of the inline asm flags word is set (isConvergent), encoding into the DAG node as: flags = hasSideEffects | (isAlignStack << 1) | (dialect << 2) | (convergent << 5).

Loop unrolling epilog forcing. When a loop body contains convergent calls (hasCallInLoop check), the unroller forces epilog remainder style rather than prolog, because epilog preserves the property that all threads participate in each full iteration of the unrolled body.

StructurizeCFG skip. Functions carrying the convergent attribute (attribute ID 56 in the attribute check at sub_B2D610) are skipped by the StructurizeCFG pass -- they are assumed to already have correct convergence structure.

Dead barrier elimination gate. The dead sync elimination engine (sub_2C83D20) identifies barrier intrinsics by checking bit 0x20 at byte +33 (the convergent attribute flag) on the callee, combined with opcode 85 (the internal barrier opcode) and a barrier intrinsic ID confirmation via sub_CEA1A0.

Operand Bundle Registration

The convergencectrl operand bundle tag is registered during LLVMContext initialization at sub_B6EEA0 (9KB), alongside the other standard bundle tags:

Operand bundle tags registered at context creation:
  "funclet"           -- EH funclet scope
  "gc-transition"     -- GC state transition
  "ptrauth"           -- pointer authentication
  "kcfi"              -- kernel control flow integrity
  "convergencectrl"   -- convergence token attachment

These tags are interned as string IDs in the context's operand bundle tag table. When the bitcode reader parses a call instruction with operand bundles (sub_14FCE40, 107KB), the convergencectrl bundle is reconstructed from the bitcode record and attached to the CallInst/InvokeInst. The inliner at sub_29ED7A0 (96KB) checks "convergent call needs convergencectrl operand" to verify that convergent calls in the callee carry appropriate bundles after inlining.

Pseudo-Instruction Lowering in emitFunctionBody

The emitFunctionBody loop at sub_31EC4F0 handles the two convergence pseudo-instructions as part of its 46-case opcode switch:

Case 24 -- CONVERGENCECTRL_ENTRY. Calls sub_31DB9B0 (emitConvergenceEntry). This function is positioned at address 0x31DB9B0, immediately after sub_31DB950 in the binary layout (the two functions are adjacent, separated by only 0x60 bytes: 0x31DB950 to 0x31DB9B0). The entry pseudo marks the function entry convergence point. It does not emit visible PTX text -- instead it updates internal state that the OutStreamer uses for reconvergence tracking in the generated object.

Case 33 -- CONVERGENCECTRL_LOOP. Calls sub_31DB950 (emitConvergenceLoop). This marks loop-back convergence points. Like the entry pseudo, it produces no visible PTX output but influences ptxas's reconvergence analysis.

Both pseudo-instructions are "silent" -- they do not increment the instruction counter (var_F30), do not trigger .loc emission, and do not invoke the beginInstruction/endInstruction handler callbacks. They fall through the switch without reaching the default path's instruction-counting logic.

Post-Function Convergence Close-Out

After all MBBs in a function are emitted, the emitFunctionBody function performs convergence-related cleanup in Phase 3a (0x31ECFFD-0x31ED0FA):

Phase 3a: Convergence control close-out
  if (var_ED1 == true):                          // any real instructions seen?
      OutStreamer->emitAlignment(MF->getAlignment())
      for sym in MF->globalSymbolTable[0x48..0x50]:
          if (sym[-0x16] & 0x7FFF) != 0:         // visibility flags
              sub_31E1750(sym)                    // resolveBlockAddress
              if block_was_removed:
                  emit diagnostic "Address of block that was removed by Co..."
                  OutStreamer->emitLabel(fallback_sym)

The var_ED1 flag tracks whether any non-meta instructions appeared in the function body. When set, the close-out phase emits function alignment, resolves block-address symbols in the global symbol table (checking visibility flags at sym[-0x16] & 0x7FFF), and handles the edge case where a basic block was removed by CodeGen after a block-address was taken -- this would produce a dangling convergence reference, so a diagnostic is emitted and a fallback label is created.

Convergence and the StructurizeCFG Pass

The StructurizeCFG pass (documented in StructurizeCFG) is the primary consumer of convergence information during the CFG transformation phase. PTX requires reducible control flow: every back-edge must target a loop header that dominates all blocks in the cycle, and every divergent branch must reconverge at a post-dominator.

The pass performs a domtree-guided reconvergence insertion that stores head/tail pointers into function metadata at *(func_obj+672) and *(func_obj+680). These pointers are read by subsequent PTX emission passes to emit correct convergence annotations. Functions with the convergent attribute (or optnone) are skipped entirely -- they are assumed to already have correct structure.

When non-uniform divergent regions are identified, the pass creates new "reconvergence" basic blocks, copies phi entries, and reroutes edges so that all divergent paths merge at a single post-dominator. The sub_35CB4A0 uniformity check and sub_35C9ED0 NCA (nearest common ancestor) computation in the dominator tree determine where reconvergence points are inserted.

NVIDIA Extensions Beyond Upstream

cicc's AsmPrinter diverges from upstream LLVM's NVPTXAsmPrinter in several important ways:

Convergence control pseudo-instructions. Upstream LLVM (as of the LLVM 20 base) has llvm.experimental.convergence.* intrinsics, but the AsmPrinter handling of CONVERGENCECTRL_ENTRY and CONVERGENCECTRL_LOOP as dedicated opcode cases (24 and 33 in the jump table) with calls to sub_31DB9B0 / sub_31DB950 is cicc-specific. These ensure correct warp-level synchronization semantics in the emitted PTX. Additionally, cicc adds two NVIDIA-specific convergent branch intrinsics (llvm.nvvm.branch.if.convergent for sm_80+ and llvm.nvvm.branch.if.all.convergent for sm_70+) that have no upstream equivalent. See the Convergence Control Framework section for the full pipeline.

Enhanced .loc with inlined-at. The function_name and inlined_at extensions to .loc directives are NVIDIA additions. Upstream LLVM's NVPTX backend emits only standard .loc file line col. cicc's version walks the full inlining chain to produce richer debug information.

Cluster directives (SM 90+). The entire cluster attribute family (.blocksareclusters, .explicitcluster, .reqnctapercluster, .maxclusterrank) and the 15 cluster special registers are NVIDIA extensions to PTX not present in upstream LLVM's NVPTX backend.

.abi_preserve directives. The register ABI preservation annotations emitted by sub_3937240 have no upstream equivalent.

.pragma "coroutine". The coroutine pragma emission in the function header orchestrator is NVIDIA-specific, supporting CUDA coroutine execution.

PGO/BBAddrMap integration. The BBAddrMap and PGO analysis info structures (0x80 and 0x98 bytes respectively, dynamically allocated when analysis passes are absent) are LLVM 16+ features that cicc integrates into the PTX emission path.

Instruction-mix statistics. The per-MBB instruction-mix collection ("INST_<name>: <count>" format) under the "asm-printer" statistic group is significantly more elaborate than upstream's simple instruction counter.

Dual handler lists. cicc maintains two separate AsmPrinterHandler lists (at this+0x240 and this+0x228), iterated independently for beginInstruction/endInstruction/endFunction. Upstream uses a single handler list.

Function Map

FunctionAddressSizeRole
NVPTXAsmPrinter pass registrationsub_214ABE0----
Return type / .attribute(.unified) emissionsub_214C9401.9KB--
Linkage directive emission (.visible/.extern/.common)sub_214CAD02.4KB--
Kernel attribute emission (.reqntid, .maxnreg, cluster)sub_214DA908.7KB--
.local_maxnreg emissionsub_214E3001.3KB--
emitHeader (.version, .target, .address_size)sub_214F3707.2KB--
Address space qualifier emissionsub_214FA801.9KB--
emitFunctionParamList (.param declarations)sub_21502D022KB--
Parameter name generation (_param_N)sub_2150230----
Function forward declaration emissionsub_21515503.9KB--
emitFunctionEntryLabel (.entry/.func)sub_2151D307.0KB--
Function alias emission (.alias)sub_21518E05.0KB--
Static initializer expression emissionsub_21533505.3KB--
Byte-level constant data emissionsub_2153AE09.9KB--
printModuleLevelGV (texref/surfref/samplerref/data)sub_215642020KB--
Global variable topological sortsub_2157D505.9KB--
Register class -> encoded IDsub_21583D04.6KB--
Stack frame + register declaration emissionsub_2158E8017KB--
Function header orchestratorsub_215A3C010KB--
Module-level emission entry (ctor/dtor check, DWARF)sub_215ACD08.1KB--
GenericToNVVM pass registrationsub_215DC20----
Register class -> PTX type suffixsub_21637301.7KB--
Register class -> PTX register prefixsub_21638D01.6KB--
llvm.ident / "Based on NVVM 7.0.1" readersub_216F7F05.7KB--
emitCallPrototype (.callprototype for indirect calls)sub_21CF8D029KB--
Atomic opcode emission (13 operations)sub_21E5E70----
L2-hinted atomic emission (SM 80+)sub_21E6420----
Address space conversion (cvta) + MMA helperssub_21E7FE0----
Standard special register emission (%tid, %ctaid, etc.)sub_21E86B0----
Cluster barrier emission (SM 90+)sub_21E8EA0----
Cluster special register emission (SM 90+)sub_21E9060----
Memory barrier emission (membar/fence)sub_21E94F0----
printReg (register number -> %rN string)sub_2FF6320----
Per-instruction .loc DWARF directivesub_31D55F0----
Instruction-level debug comment emissionsub_31D89B0----
emitConvergenceEntry (CONVERGENCECTRL_ENTRY pseudo, opcode 24)sub_31DB9B0----
emitConvergenceLoop (CONVERGENCECTRL_LOOP pseudo, opcode 33)sub_31DB950----
ConvergenceVerifier::verify (token dominance/nesting checks)sub_E35A1014KB--
Cycle detection for convergence verificationsub_E342D0----
Convergence verification error reportingsub_E348A0----
Inliner/verifier core ("convergent call needs convergencectrl operand")sub_29ED7A096KB--
NVVM convergent branch intrinsic SM-version gatingsub_1C36530----
Convergent branch lowering + single-use enforcementsub_2C7B6A0----
Metadata kind + operand bundle tag registration (incl. convergencectrl)sub_B6EEA09KB--
emitNops (zero-length function avoidance)sub_31DCBB0----
createTempSymbol ("func_end", "Ltmp")sub_31DCC50----
emitFunctionBody (main loop)sub_31EC4F012KB--
emitInlineAsmsub_31F26A0----
.abi_preserve directive emissionsub_393724014KB--
MBB printer + .pragma "nounroll"sub_3970E4018KB--
doFinalizationsub_3972F1024KB--
emitInlineAsm (parser/streamer)sub_397DF1030KB--

Cross-References

  • PTX Emission -- hub page for the emission stage with additional detail on atomic/barrier/special-register emission
  • Code Generation -- the MachineInstr-producing stage that feeds the AsmPrinter
  • SelectionDAG -- instruction selection that creates the MachineInstrs
  • NVPTX Call ABI -- .param space calling convention detail
  • Register Allocation -- determines which virtual registers exist for the register declaration phase
  • Inliner Cost Model -- inlining decisions that create the inlined-at debug chains the AsmPrinter must emit
  • StructurizeCFG -- CFG restructuring pass that creates reconvergence basic blocks for divergent control flow
  • Dead Sync Elimination -- dead barrier elimination engine that uses the convergent attribute to identify barrier intrinsics
  • SM 70-89 Architecture -- SM version gating for convergent branch intrinsics
  • GPU Execution Model -- SIMT warp divergence/reconvergence background