OR-1 dataflow CPU sketch

Assembler Redesign Plan: Frame-Based PE Model#

This document describes the changes needed in the OR1 assembler (asm/) to match the frame-based PE redesign described in pe-design.md and architecture-overview.md. It covers every assembler pass and supporting module, identifying what changes, what stays, and the dependency order for implementation.

The assembler pipeline is: parse -> lower -> expand -> resolve -> place -> allocate -> codegen. Most changes concentrate in allocate and codegen; earlier passes are minimally affected.

1. Overview#

The PE redesign replaces the context-slot matching model with a frame-based model. Key architectural changes that affect the assembler:

  • Context slots replaced by activation IDs and frames. The old ctx_slots parameter (up to 16 slots) becomes frame_count (4 concurrent frames) with 3-bit act_id (8 unique IDs). Each frame has 64 addressable slots.
  • Instructions become templates. IRAM entries are 16-bit words [type:1][opcode:5][mode:3][wide:1][fref:6]. Constants and destinations are NOT in the instruction -- they live in frame slots referenced by fref.
  • Instruction deduplication. Multiple activations can share the same IRAM entry because per-activation data lives in frames, not instructions. The number of IRAM entries needed is the number of unique operation shapes, not the total operations executed.
  • 8 matchable offsets per frame. Dyadic instructions must be assigned offsets 0-7 within each activation. The assembler must enforce this constraint and split function bodies that exceed it.
  • Pre-formed flit 1 values. Output destinations are 16-bit packed flit 1 values stored in frame slots. The PE reads the slot and puts it on the bus verbatim as flit 1. Token type is determined by the prefix bits in the stored flit value.
  • Frame setup via PE-local write tokens. Constants, destinations, and other per-activation data are loaded into frame slots by a stream of PE-local write tokens before execution begins.
  • Frame lifecycle tokens. ALLOC (frame control, prefix 011+00, op=0) allocates a frame; FREE (frame control, op=1) or FREE_FRAME instruction releases it.
  • Mode field replaces output routing logic. The 3-bit mode field in the instruction word encodes output routing (INHERIT/CHANGE_TAG/SINK) and constant presence. The assembler must compute the correct mode for each instruction based on its edge topology and constant usage.

2. IR Type Changes (asm/ir.py)#

2.1 IRNode Changes#

Current fields affected:

Current Field Change Notes
ctx_slot: Optional[Union[int, CtxSlotRef, CtxSlotRange]] Rename to act_slot or remove entirely Macro parameter support (CtxSlotRef) may still be needed for parameterized activation grouping
iram_offset: Optional[int] Keep Semantics unchanged: index into PE's IRAM. Range 0-255.
ctx: Optional[int] Rename to act_id: Optional[int] 3-bit activation ID (0-7) instead of context slot

New fields to add to IRNode:

New Field Type Purpose
mode: Optional[int] Optional[int] 3-bit instruction mode (0-7), computed during allocate
fref: Optional[int] Optional[int] 6-bit frame slot base index (0-63), computed during allocate
wide: bool bool 32-bit frame values flag, default False
frame_layout: Optional[FrameSlotMap] Optional[FrameSlotMap] Per-node frame slot assignments (see below)

2.2 New IR Types#

@dataclass(frozen=True)
class FrameSlotMap:
    """Frame slot assignments for one instruction within an activation.

    Maps logical roles to physical frame slot indices within the 64-slot
    frame. Computed by the allocate pass.

    Attributes:
        match_slot: Frame slot for dyadic match operand storage (offsets 0-7)
        const_slot: Frame slot for IRAM constant (if any)
        dest_slots: Frame slot(s) for pre-formed flit 1 destination values
        sink_slot: Frame slot for SINK mode write-back target
    """
    match_slot: Optional[int] = None
    const_slot: Optional[int] = None
    dest_slots: tuple[int, ...] = ()
    sink_slot: Optional[int] = None
@dataclass(frozen=True)
class FrameLayout:
    """Complete frame layout for one activation on one PE.

    Computed by the allocate pass. Used by codegen to generate PE-local
    write tokens for frame setup.

    Attributes:
        act_id: Activation ID (0-7)
        pe_id: PE where this frame lives
        slots: Dict mapping slot index to (role, value) pairs
        total_slots: Total slots used (must be <= 64)
    """
    act_id: int
    pe_id: int
    slots: dict[int, tuple[str, int]]
    total_slots: int

2.3 SystemConfig Changes#

Current SystemConfig:

@dataclass(frozen=True)
class SystemConfig:
    pe_count: int
    sm_count: int
    iram_capacity: int = DEFAULT_IRAM_CAPACITY  # 128
    ctx_slots: int = DEFAULT_CTX_SLOTS           # 16

New SystemConfig:

@dataclass(frozen=True)
class SystemConfig:
    pe_count: int
    sm_count: int
    iram_capacity: int = 256              # 8-bit offset, 256 entries
    frame_count: int = 4                  # max concurrent frames per PE
    frame_slots: int = 64                 # slots per frame
    matchable_offsets: int = 8            # max dyadic instructions per activation per PE

The defaults change: DEFAULT_IRAM_CAPACITY goes from 128 to 256 (matching the 8-bit offset field). DEFAULT_CTX_SLOTS is removed entirely and replaced by frame_count, frame_slots, and matchable_offsets.

2.4 What Stays#

  • IREdge -- unchanged except ctx_override may need renaming or semantic adjustment (see section 5)
  • IRGraph structure -- unchanged
  • IRDataDef -- unchanged (SM data definitions are orthogonal to PE frames)
  • IRRegion, RegionKind -- unchanged
  • NameRef, ResolvedDest -- unchanged
  • MacroDef, IRMacroCall, MacroParam, ParamRef -- unchanged
  • CallSite, CallSiteResult -- fields may need renaming (trampoline_nodes, free_ctx_nodes -> free_frame_nodes)
  • SourceLoc, ConstExpr, IRRepetitionBlock -- unchanged

2.5 CtxSlotRef / CtxSlotRange#

CtxSlotRef and CtxSlotRange are used in macro templates for parameterized context slot assignment. Under the frame model, these become ActSlotRef / ActSlotRange (or are removed if activation ID assignment is always automatic). The expand pass resolves these to concrete values during macro expansion. The rename is straightforward but touches ir.py, lower.py, and expand.py.

3. Opcode Changes (asm/opcodes.py)#

3.1 Renamed Opcodes#

Current New Notes
RoutingOp.FREE_CTX RoutingOp.FREE_FRAME Deallocates a frame instead of a context slot

The mnemonic mapping changes: "free_ctx" -> "free_frame" in MNEMONIC_TO_OP. The OP_TO_MNEMONIC reverse mapping updates correspondingly.

3.2 New Opcodes#

Opcode Type Arity Purpose
RoutingOp.EXTRACT_TAG Monadic CM Monadic Captures executing token's identity as 16-bit packed flit 1 value (return continuation)
RoutingOp.ALLOC_REMOTE Monadic CM Monadic Triggers ALLOC frame control token to a target PE (may be handled at codegen level instead)

EXTRACT_TAG must be added to MNEMONIC_TO_OP (mnemonic: "extract_tag"), to _MONADIC_OPS_TUPLES, and to cm_inst.py as a new RoutingOp enum member.

Whether ALLOC_REMOTE is an ALU opcode or purely a codegen-emitted frame control token depends on whether the assembler exposes it as a user-facing instruction or handles it internally during call wiring. For static calls, ALLOC is emitted by codegen as a frame control token, not as an ALU instruction. For dynamic calls, it may need to be an instruction.

3.3 Mode-Dependent Arity#

The current is_monadic(op, const) and is_dyadic(op, const) functions remain correct in concept. The mode field is orthogonal to arity -- mode determines output routing behaviour, not input operand count. No changes needed to arity classification.

3.4 Monadic/Dyadic Classification#

Unchanged. The distinction between monadic and dyadic is fundamental to matching (dyadic tokens go through presence-based matching in the 670s; monadic tokens bypass matching). The assembler's arity classification drives IRAM offset assignment (dyadic at offsets 0-7, monadic at 8+).

4. Lower Pass Changes (asm/lower.py)#

Minimal changes. The lower pass translates Lark CST to IR and is mostly opcode-agnostic.

4.1 Opcode Mnemonic Updates#

The MNEMONIC_TO_OP lookup in _resolve_opcode() will pick up new opcodes automatically when they are added to opcodes.py. No lower.py code changes needed for new opcodes.

The free_ctx mnemonic rename to free_frame will need a corresponding change in lower.py only if the mnemonic string is hardcoded anywhere (it is not -- lower.py uses MNEMONIC_TO_OP for all lookups).

4.2 Context Slot Qualifiers#

The [ctx_slot] qualifier syntax in dfasm (&node[3] <| add) is parsed in lower.py and stored as ctx_slot on IRNode. This syntax and field will be renamed to reflect activation IDs if kept. If activation ID assignment is always automatic, the qualifier syntax may be removed entirely.

References in lower.py:

  • inst_def transformer rule: parses [N] qualifier into ctx_slot
  • ctx_slot_ref transformer rule: creates CtxSlotRef for macro parameter [${param}]
  • ctx_slot_range transformer rule: creates CtxSlotRange for [N:M]

4.3 New Syntax (Optional, Deferred)#

Frame directives in dfasm syntax (e.g., @frame_layout pragmas) could be added later. Not needed for v0 -- the allocator computes frame layouts automatically.

5. Expand Pass Changes (asm/expand.py)#

5.1 Function Call Wiring#

The expand pass currently generates cross-context call wiring:

  • Trampoline PASS nodes with ctx_override edges
  • FREE_CTX cleanup nodes
  • Per-call-site context slot allocation via CallSite metadata

Under the frame model, the call wiring changes:

  • ctx_override edges become frame-boundary edges. The semantic meaning is the same (data crosses activation boundaries), but the mechanism changes: instead of packing a target context into the instruction's const field with ctx_mode=1, the destination's pre-formed flit 1 value in the frame slot already encodes the target PE, offset, and act_id. Cross-activation routing is handled by frame setup, not by instruction encoding.

  • FREE_CTX nodes become FREE_FRAME nodes. The opcode changes from RoutingOp.FREE_CTX to RoutingOp.FREE_FRAME. The expand pass references RoutingOp.FREE_CTX directly in _wire_call_site() at line 1160 of expand.py.

  • Trampoline nodes may simplify. In the current design, trampolines exist to bridge context boundaries (an edge cannot cross contexts without a PASS node carrying ctx_mode). Under the frame model, destinations in frame slots already encode the target activation, so cross-activation edges are just edges whose destination flit 1 value points to a different activation. Trampolines may still be useful for fan-out or return routing but are no longer needed purely for context bridging.

  • @ret wiring stays conceptually the same. Return routing in the frame model is a pre-formed flit 1 value loaded into a frame slot. The expand pass creates edges for return routing; the allocate pass resolves those edges to flit 1 values in frame slots.

5.2 CallSite Metadata#

CallSite fields to rename:

  • free_ctx_nodes -> free_frame_nodes

The trampoline_nodes field stays (trampolines may still be generated for fan-out or return routing patterns).

5.3 CtxSlotRef Resolution#

The expand pass resolves CtxSlotRef and CtxSlotRange during macro expansion. These will be renamed to ActSlotRef / ActSlotRange (or removed). The resolution logic in _substitute_node() and _substitute_edge() is straightforward renaming.

5.4 Built-in Macros#

The built-in macros in builtins.py do NOT reference ctx_slot, ctx, free_ctx, or any context-specific concepts directly. They use generic opcodes (add, brgt, gate, pass, inc, const) and edge routing (@ret, ${param}). No changes needed to built-in macros.

6. Resolve Pass Changes (asm/resolve.py)#

No changes needed. The resolve pass validates that edge endpoints exist and detects scope violations. It operates on node names and graph structure, not on PE-level concepts like contexts or frames. The resolve pass does not reference ctx, ctx_slot, ctx_override, or any context-related fields.

7. Place Pass Changes (asm/place.py)#

7.1 New Constraint: Matchable Offset Limit#

The placement pass must enforce the 8-matchable-offset constraint: at most 8 dyadic instructions per activation per PE. This is a new constraint that does not exist in the current codebase.

Currently, _count_iram_cost() counts dyadic nodes as costing 2 IRAM slots and monadic as 1. Under the frame model:

  • Dyadic nodes cost 1 IRAM slot (the matching store entry is in the 670s, not IRAM)
  • Monadic nodes cost 1 IRAM slot
  • The 8-dyadic-per-activation limit is a separate constraint from IRAM capacity

The placement pass needs to track dyadic instruction count per activation group per PE, in addition to total IRAM usage.

7.2 IRAM Cost Recalculation#

_count_iram_cost() should return 1 for all node types (dyadic and monadic both use 1 IRAM slot in the frame model). The current cost of 2 for dyadic nodes was because the old matching store occupied IRAM slots; in the frame model, match operands live in frame SRAM, not IRAM.

7.3 Context Slots -> Frames#

_auto_place_nodes() currently tracks ctx_used per PE (context slots consumed). This becomes frames_used per PE (concurrent frames, max 4). The ctx_scopes_per_pe tracking of function scopes per PE maps directly to frame tracking: each function scope on a PE consumes one frame.

SystemConfig.ctx_slots references in place.py become SystemConfig.frame_count.

7.4 Instruction Deduplication Awareness#

Because IRAM entries are activation-independent templates, multiple activations of the same function on the same PE share IRAM entries. The placement pass could account for this when computing IRAM utilisation: if two activations of function $foo run on PE0, they share IRAM entries, so the IRAM cost is the unique instruction count, not the total.

This is an optimisation, not a correctness requirement. The placement pass can conservatively count IRAM entries per unique function body without deduplication for v0.

8. Allocate Pass Changes (asm/allocate.py) -- MAJOR#

This is the largest change. The current allocate pass assigns IRAM offsets and context slots, then resolves destinations to Addr values. The frame model requires fundamentally different allocation logic.

8.1 IRAM Offset Assignment#

Current behaviour (_assign_iram_offsets()):

  • Dyadic nodes get offsets 0..D-1
  • Monadic nodes get offsets D..D+M-1
  • Total must fit in iram_capacity

New behaviour:

  • Dyadic nodes get offsets 0-7 (within the matchable offset range). At most 8 dyadic instructions per activation group per PE. Because IRAM is activation-independent (shared templates), dyadic offset assignment is per-PE, not per-activation.
  • Monadic nodes get offsets 8-255 (or wherever dyadic offsets end).
  • The matchable_offsets limit (default 8) constrains dyadic count.
  • With instruction deduplication, multiple activations of the same function body share IRAM offsets. The allocator assigns offsets once per unique instruction template, not per activation.

8.2 Activation ID Assignment#

Replaces _assign_context_slots(). The current function assigns context slot indices (0-15) per function scope per PE. The new function assigns 3-bit activation IDs (0-7) with at most 4 concurrent activations per PE.

Key differences:

  • Smaller space: 8 act_ids (3-bit) vs 16 context slots. But only 4 can be concurrently active (4 physical frames).
  • ABA distance: the allocator must maintain ABA distance between act_ids to prevent stale token collisions. With 4 concurrent frames out of 8 possible IDs, 4 IDs of ABA distance exist before wraparound. For static programs, the allocator can assign act_ids sequentially (0, 1, 2, 3 for 4 concurrent activations).
  • Per-call-site allocation: same concept as current call_site_to_ctx_on_pe -- each call site that creates a new activation gets a fresh act_id. But the budget is 4 concurrent frames instead of 16 context slots.

The function scope grouping logic (_extract_function_scope()) stays.

8.3 Frame Layout Allocation (NEW)#

This is entirely new functionality. After IRAM offset and act_id assignment, the allocator must compute the frame layout for each activation: which frame slots hold what data.

Frame slot roles:

Role Slot Range Count per Instruction Notes
Match operand 0-7 1 per dyadic instruction Indexed by matchable offset. Presence bit in 670.
Constant 8+ 0-1 per instruction mode[0] (has_const) selects whether const is read from frame[fref]
Destination variable 0-2 per instruction Pre-formed flit 1 values. 1 for mode 0, 2 for mode 2/3.
Accumulator/sink variable 0-1 per instruction For SINK modes (6/7). Write-back target.
SM parameters variable 0-2 per SM instruction SM_id + addr, data or return routing.

Slot assignment algorithm:

  1. Reserve slots 0-7 for match operands (one per dyadic instruction, indexed by the instruction's matchable offset).
  2. Assign constant slots starting at slot 8. Constants that are shared across instructions within the same activation can be deduplicated (same value -> same slot).
  3. Assign destination slots. Each destination is a pre-formed flit 1 value. Destinations shared across instructions can be deduplicated.
  4. Assign sink/accumulator slots for SINK mode instructions.
  5. Assign SM parameter slots for SM operations.
  6. Verify total slots <= 64. If exceeded, report a frame overflow error.

fref computation:

The fref field in the instruction word points to the base of a contiguous group of frame slots used by that instruction. The slot count depends on the mode:

Mode Slots at fref Layout
0 1 [dest]
1 2 [const, dest]
2 2 [dest1, dest2]
3 3 [const, dest1, dest2]
4 0 (no frame access)
5 1 [const]
6 1 [sink_target] (write)
7 1 [sink_target] (read-modify-write)

The allocator must arrange frame slots such that each instruction's constant and destination(s) are contiguous starting at fref. This may require careful slot packing or a simple sequential allocation strategy.

Mode computation:

The allocator determines the mode for each instruction based on:

Condition Mode
0 dests, no const, has sink 6 (SINK)
0 dests, const via frame accumulator (RMW) 7 (SINK+CONST)
1 dest, no const 0 (INHERIT, single output)
1 dest, has const 1 (INHERIT, single output + const)
2 dests, no const 2 (INHERIT, fan-out)
2 dests, has const 3 (INHERIT, fan-out + const)
CHANGE_TAG, no const 4
CHANGE_TAG, has const 5

CHANGE_TAG is used when the instruction's left operand provides the output destination dynamically (e.g., dynamic return routing via EXTRACT_TAG + CHANGE_TAG).

8.4 Pre-Formed Flit 1 Computation#

The allocator must convert resolved destination Addr values into 16-bit packed flit 1 values for storage in frame destination slots.

Flit 1 formats (from architecture-overview.md):

DYADIC WIDE:    [0][0][port:1][PE:2][offset:8][act_id:3]   = 16 bits
MONADIC NORM:   [0][1][0][PE:2][offset:8][act_id:3]         = 16 bits
MONADIC INLINE: [0][1][1][PE:2][10][offset:7][spare:2]      = 16 bits

The packing function takes:

  • dest_pe: int (2 bits)
  • dest_offset: int (8 bits for dyadic/monadic normal, 7 for inline)
  • dest_act_id: int (3 bits, for dyadic and monadic normal)
  • dest_port: Port (1 bit, for dyadic wide only)
  • dest_type: TokenType (dyadic wide, monadic normal, or monadic inline)

And produces a 16-bit packed flit 1 value.

The dest_type is determined by whether the destination instruction is dyadic (-> dyadic wide flit) or monadic (-> monadic normal flit). For trigger-only destinations (e.g., switch not-taken path), monadic inline is used.

This replaces the current Addr(a, port, pe) resolution. The Addr type may still be used as an intermediate representation, but the final output is a packed flit 1 value in a frame slot.

8.5 Destination Resolution Rework#

The current _resolve_destinations() function creates ResolvedDest objects containing Addr(a=iram_offset, port=edge.port, pe=dest_pe). Under the frame model, destination resolution must additionally:

  1. Look up the destination node's act_id (not just iram_offset and pe).
  2. Determine the destination token type (dyadic wide vs monadic normal vs monadic inline).
  3. Compute the packed flit 1 value.
  4. Assign the flit 1 value to a frame slot.

The ResolvedDest type may be extended to carry the packed flit 1 value, or the flit 1 computation may happen in a separate sub-pass after destination resolution.

8.6 ctx_mode / ctx_override Removal#

The current allocator and codegen handle ctx_override edges by setting ctx_mode=1 on the source instruction and packing the target context and generation into the const field. This entire mechanism is removed.

Under the frame model, cross-activation routing is handled by frame destination slots. The destination flit 1 value already encodes the target PE, offset, and act_id. No instruction-level ctx_mode is needed. The ctx_override flag on IREdge may be kept for semantic annotation (this edge crosses activation boundaries) but has no effect on instruction encoding.

9. Codegen Changes (asm/codegen.py) -- MAJOR#

9.1 Instruction Generation#

Current codegen (_build_iram_for_pe()) generates ALUInst and SMInst objects. These are Python dataclasses that model the old instruction format with embedded destinations and constants:

ALUInst(op, dest_l, dest_r, const, ctx_mode)
SMInst(op, sm_id, const, ret, ret_dyadic)

New codegen generates 16-bit instruction words matching the hardware format:

@dataclass(frozen=True)
class Instruction:
    """16-bit instruction word for IRAM.

    [type:1][opcode:5][mode:3][wide:1][fref:6] = 16 bits
    """
    type: int       # 0 = CM, 1 = SM
    opcode: int     # 5-bit opcode
    mode: int       # 3-bit mode (0-7)
    wide: bool      # 32-bit frame values
    fref: int       # 6-bit frame slot base index

The Instruction type replaces both ALUInst and SMInst. Constants and destinations are NOT in the instruction -- they are in frame slots.

The PEConfig.iram field type changes from dict[int, ALUInst | SMInst] to dict[int, Instruction] (or dict[int, int] if storing raw 16-bit words).

9.2 Frame Setup Sequence Generation (NEW)#

Codegen must generate the bootstrap sequence that loads frame contents before execution begins. This is a stream of tokens:

  1. FrameControlToken (ALLOC) for each activation:

    flit 1: [0][1][1][PE:2][00][op=0][spare:3][act_id:3]
    flit 2: (return routing or unused)
    
  2. PELocalWriteToken for each frame slot that needs initialization:

    flit 1: [0][1][1][PE:2][01][region=1][spare:1][slot:5][act_id:3]
    flit 2: [data:16]  (the frame slot value)
    
  3. PELocalWriteToken for IRAM entries:

    flit 1: [0][1][1][PE:2][01][region=0][spare:1][slot:5][act_id:ignored]
    flit 2: [instruction:16]
    

The ordering matters: IRAM writes before frame setup, frame setup (ALLOC + slot writes) before seed tokens. More specifically:

IRAM writes (all PEs)
  -> ALLOC frame control (all activations)
    -> Frame slot writes (constants, destinations per activation)
      -> Seed tokens (initial data tokens to start execution)

9.3 New Token Types#

The emulator will need new token types (or the existing types must be adapted):

  • FrameControlToken -- ALLOC/FREE frame lifecycle. Currently not in tokens.py. Codegen needs to emit these.
  • PELocalWriteToken -- writes to IRAM (region=0) or frame slots (region=1). The current IRAMWriteToken is a special case of this (region=0 only). It may be generalised or a new type added.

These are emulator-level changes that codegen depends on. The codegen module must import and construct whatever token types the emulator provides.

9.4 Seed Token Generation#

Current seed token generation creates MonadToken or DyadToken with ctx field. Changes:

  • ctx field -> act_id field in token constructors
  • gen field on DyadToken is removed (ABA protection is via the 670 valid bit, not generation counters)
  • Seed tokens target (pe, offset, act_id) triples, packed from the destination's allocation data

9.5 Direct Mode Output#

generate_direct() currently returns AssemblyResult with:

  • pe_configs: list[PEConfig] -- PEConfig with iram dict of ALUInst/SMInst
  • sm_configs: list[SMConfig]
  • seed_tokens: list[MonadToken]

New AssemblyResult:

  • pe_configs: list[PEConfig] -- PEConfig with iram dict of Instruction (new type), plus frame_layouts: dict[int, FrameLayout] mapping act_id to frame layout data
  • sm_configs: list[SMConfig] -- unchanged
  • seed_tokens: list -- may include DyadToken and MonadToken with act_id instead of ctx
  • setup_tokens: list -- NEW: frame control and PE-local write tokens for bootstrap

9.6 Token Stream Mode Output#

generate_tokens() currently produces:

SM init tokens -> IRAM write tokens -> seed tokens

New ordering:

SM init tokens -> IRAM write tokens -> ALLOC tokens -> frame slot write tokens -> seed tokens

9.7 ctx_mode Removal#

The entire ctx_mode / ctx_override handling in _build_iram_for_pe() (lines 96-132 of codegen.py) is removed. Cross-activation routing is handled by frame destination slots, not by instruction encoding.

9.8 Route Restriction Computation#

_compute_route_restrictions() is unchanged in concept. It scans edges to determine which PEs and SMs a given PE needs to route to. The implementation stays the same.

10. Serialize Pass Changes (asm/serialize.py)#

10.1 Field Renaming#

  • ctx -> act_id in node serialization if activation ID is displayed
  • Any ctx_slot qualifier in dfasm output becomes activation-related

10.2 New Fields#

If mode, fref, and wide are displayed in serialized output (for debugging allocated IR), _serialize_node() needs to format them. Format could be: &node|pe0|act2|mode1|fref8 <| add

10.3 Round-Trip Support#

The serialize pass must be able to round-trip new IR fields. Since mode, fref, and frame layout are only populated after allocation, serialization before allocation produces the same output as today (no new fields to display). Post-allocation serialization adds the new fields.

11. Built-in Macro Changes (asm/builtins.py)#

No changes needed. The built-in macros (#loop_counted, #loop_while, #permit_inject, #reduce_2/3/4) use generic opcodes and edge routing. They do not reference ctx, ctx_slot, free_ctx, or any context-specific concepts.

The free_ctx opcode rename to free_frame does not affect builtins because none of them use free_ctx.

12. Error Types (asm/errors.py)#

12.1 New Error Categories#

Add to ErrorCategory:

class ErrorCategory(Enum):
    # ... existing ...
    FRAME = "frame"  # Frame layout overflow, slot conflicts

12.2 New Error Conditions#

Error Category Source Pass Condition
Frame slot overflow FRAME allocate Total slots > 64 for an activation
Matchable offset overflow RESOURCE allocate/place > 8 dyadic instructions per activation per PE
Frame count overflow RESOURCE allocate > 4 concurrent activations on one PE
Act ID exhaustion RESOURCE allocate > 8 activation IDs needed (wraparound)

13. Dfgraph Pipeline Impact (dfgraph/)#

13.1 dfgraph/pipeline.py#

The pipeline runner calls allocate() and uses IRGraph types. If SystemConfig fields change, pipeline.py may need minor updates for default values or error handling. The pipeline runner itself does not inspect allocation results in detail.

13.2 dfgraph/graph_json.py#

Currently includes ctx field in node JSON output. This becomes act_id. The field rename is straightforward.

13.3 dfgraph/categories.py#

References RoutingOp.FREE_CTX in the CONFIG category mapping. This becomes RoutingOp.FREE_FRAME. One line change.

EXTRACT_TAG (if added) maps to the ROUTING or CONFIG category depending on its semantics.

14. Monitor Impact (monitor/)#

14.1 monitor/snapshot.py#

PESnapshot currently captures matching_store (2D array of MatchEntry) and gen_counters. Under the frame model:

  • matching_store becomes frame state (per-frame slot values, presence bits)
  • gen_counters is removed (no generation counters in frame model)
  • New: frame_allocations (act_id -> frame_id mapping), frame_slots (per-frame slot contents)

14.2 monitor/graph_json.py#

Node state overlay: ctx field -> act_id. Frame slot contents may be added to the state overlay for debugging.

14.3 monitor/repl.py#

The pe command displays PE state including matching store contents. This needs updating to display frame state instead.

15. Dependency Order#

Implementation order based on the dependency graph:

Phase 1: Foundation Types#

  1. cm_inst.py: Add RoutingOp.FREE_FRAME, RoutingOp.EXTRACT_TAG. Add Instruction dataclass. Update is_monadic_alu().
  2. asm/ir.py: Add FrameSlotMap, FrameLayout. Rename ctx_slot -> act_slot, ctx -> act_id on IRNode. Update SystemConfig (remove ctx_slots, add frame_count, frame_slots, matchable_offsets). Rename CtxSlotRef -> ActSlotRef, CtxSlotRange -> ActSlotRange.
  3. asm/opcodes.py: Add new opcodes to MNEMONIC_TO_OP, _MONADIC_OPS_TUPLES. Rename free_ctx -> free_frame.
  4. asm/errors.py: Add ErrorCategory.FRAME.

Phase 2: Allocation and Codegen (the big changes)#

  1. asm/allocate.py: Rewrite _assign_context_slots() -> _assign_act_ids(). Add _compute_frame_layouts(). Add _compute_modes(). Add _pack_flit1(). Update _assign_iram_offsets() for new offset scheme. Update _resolve_destinations() to produce flit 1 values. Remove ctx_mode/ctx_override handling.
  2. asm/codegen.py: Replace ALUInst/SMInst generation with Instruction generation. Add frame setup token generation. Add FrameControlToken / PELocalWriteToken emission. Update seed token generation. Remove ctx_mode handling. Update PEConfig construction. Update AssemblyResult.

Phase 3: Upstream Pass Adjustments#

  1. asm/lower.py: Rename ctx_slot references. Update qualifier parsing if syntax changes.
  2. asm/expand.py: Replace FREE_CTX with FREE_FRAME in call wiring. Rename CtxSlotRef -> ActSlotRef. Update CallSite field names. Simplify trampoline logic if ctx_override is removed.
  3. asm/place.py: Update _count_iram_cost() (all nodes cost 1). Replace ctx_slots tracking with frame_count. Add matchable-offset-per-activation constraint.

Phase 4: Output and Tooling#

  1. asm/serialize.py: Rename ctx -> act_id in output. Add mode/fref display for post-allocation IR.
  2. asm/builtins.py: No changes (verified).
  3. dfgraph/categories.py: FREE_CTX -> FREE_FRAME.
  4. dfgraph/graph_json.py: ctx -> act_id in JSON output.
  5. monitor/: Update snapshot, graph_json, and REPL for frame model.

Phase 5: Emulator Updates (out of scope for this doc)#

  1. emu/types.py: PEConfig changes (iram type, remove ctx_slots, add frame parameters).
  2. emu/pe.py: Frame-based matching instead of context-slot matching store.
  3. tokens.py: ctx -> act_id, remove gen. Add FrameControlToken, PELocalWriteToken.

16. What Stays the Same#

  • dfasm grammar (dfasm.lark): mostly unchanged. May add frame directives later, but not required for v0.
  • Parse pass (Lark parser): no changes.
  • Error infrastructure (errors.py): structure unchanged, one new category added.
  • Graph structure (IRGraph, IREdge, IRRegion): unchanged.
  • Pipeline architecture: still 7 passes in the same order.
  • Resolve pass: completely unchanged.
  • Built-in macros: completely unchanged.
  • SM-related codegen: SMConfig construction, SM init tokens, data defs -- all unchanged.
  • Name resolution and scoping: unchanged.
  • Macro expansion core: parameter substitution, variadic repetition, opcode parameters -- all unchanged. Only call wiring details change.

17. Risk Assessment#

Area Risk Mitigation
Frame layout allocation New, complex algorithm Start with simple sequential allocation, optimise later
Flit 1 packing Bit-level encoding must match hardware spec Unit tests against architecture-overview.md bit layouts
Instruction deduplication Interaction with frame layouts is subtle Defer dedup to optimisation pass; v0 allocates per-activation
Mode computation 8 modes, context-dependent selection Exhaustive test matrix against mode table
ABA distance Act_id assignment must maintain distance Sequential assignment (0,1,2,3) is safe for static programs
ctx_override removal Affects expand pass call wiring Can keep ctx_override as semantic annotation initially
Emulator dependency Codegen must emit tokens emulator can consume Phase emulator changes alongside codegen; can use adapter layer