# Assembler Redesign Plan: Frame-Based PE Model This document describes the changes needed in the OR1 assembler (`asm/`) to match the frame-based PE redesign described in `pe-design.md` and `architecture-overview.md`. It covers every assembler pass and supporting module, identifying what changes, what stays, and the dependency order for implementation. The assembler pipeline is: parse -> lower -> expand -> resolve -> place -> allocate -> codegen. Most changes concentrate in allocate and codegen; earlier passes are minimally affected. ## 1. Overview The PE redesign replaces the context-slot matching model with a frame-based model. Key architectural changes that affect the assembler: - **Context slots replaced by activation IDs and frames.** The old `ctx_slots` parameter (up to 16 slots) becomes `frame_count` (4 concurrent frames) with 3-bit `act_id` (8 unique IDs). Each frame has 64 addressable slots. - **Instructions become templates.** IRAM entries are 16-bit words `[type:1][opcode:5][mode:3][wide:1][fref:6]`. Constants and destinations are NOT in the instruction -- they live in frame slots referenced by `fref`. - **Instruction deduplication.** Multiple activations can share the same IRAM entry because per-activation data lives in frames, not instructions. The number of IRAM entries needed is the number of unique operation shapes, not the total operations executed. - **8 matchable offsets per frame.** Dyadic instructions must be assigned offsets 0-7 within each activation. The assembler must enforce this constraint and split function bodies that exceed it. - **Pre-formed flit 1 values.** Output destinations are 16-bit packed flit 1 values stored in frame slots. The PE reads the slot and puts it on the bus verbatim as flit 1. Token type is determined by the prefix bits in the stored flit value. - **Frame setup via PE-local write tokens.** Constants, destinations, and other per-activation data are loaded into frame slots by a stream of PE-local write tokens before execution begins. - **Frame lifecycle tokens.** ALLOC (frame control, prefix `011+00`, op=0) allocates a frame; FREE (frame control, op=1) or FREE_FRAME instruction releases it. - **Mode field replaces output routing logic.** The 3-bit `mode` field in the instruction word encodes output routing (INHERIT/CHANGE_TAG/SINK) and constant presence. The assembler must compute the correct mode for each instruction based on its edge topology and constant usage. ## 2. IR Type Changes (`asm/ir.py`) ### 2.1 IRNode Changes Current fields affected: | Current Field | Change | Notes | |---|---|---| | `ctx_slot: Optional[Union[int, CtxSlotRef, CtxSlotRange]]` | Rename to `act_slot` or remove entirely | Macro parameter support (`CtxSlotRef`) may still be needed for parameterized activation grouping | | `iram_offset: Optional[int]` | Keep | Semantics unchanged: index into PE's IRAM. Range 0-255. | | `ctx: Optional[int]` | Rename to `act_id: Optional[int]` | 3-bit activation ID (0-7) instead of context slot | New fields to add to IRNode: | New Field | Type | Purpose | |---|---|---| | `mode: Optional[int]` | `Optional[int]` | 3-bit instruction mode (0-7), computed during allocate | | `fref: Optional[int]` | `Optional[int]` | 6-bit frame slot base index (0-63), computed during allocate | | `wide: bool` | `bool` | 32-bit frame values flag, default False | | `frame_layout: Optional[FrameSlotMap]` | `Optional[FrameSlotMap]` | Per-node frame slot assignments (see below) | ### 2.2 New IR Types ```python @dataclass(frozen=True) class FrameSlotMap: """Frame slot assignments for one instruction within an activation. Maps logical roles to physical frame slot indices within the 64-slot frame. Computed by the allocate pass. Attributes: match_slot: Frame slot for dyadic match operand storage (offsets 0-7) const_slot: Frame slot for IRAM constant (if any) dest_slots: Frame slot(s) for pre-formed flit 1 destination values sink_slot: Frame slot for SINK mode write-back target """ match_slot: Optional[int] = None const_slot: Optional[int] = None dest_slots: tuple[int, ...] = () sink_slot: Optional[int] = None ``` ```python @dataclass(frozen=True) class FrameLayout: """Complete frame layout for one activation on one PE. Computed by the allocate pass. Used by codegen to generate PE-local write tokens for frame setup. Attributes: act_id: Activation ID (0-7) pe_id: PE where this frame lives slots: Dict mapping slot index to (role, value) pairs total_slots: Total slots used (must be <= 64) """ act_id: int pe_id: int slots: dict[int, tuple[str, int]] total_slots: int ``` ### 2.3 SystemConfig Changes Current `SystemConfig`: ```python @dataclass(frozen=True) class SystemConfig: pe_count: int sm_count: int iram_capacity: int = DEFAULT_IRAM_CAPACITY # 128 ctx_slots: int = DEFAULT_CTX_SLOTS # 16 ``` New `SystemConfig`: ```python @dataclass(frozen=True) class SystemConfig: pe_count: int sm_count: int iram_capacity: int = 256 # 8-bit offset, 256 entries frame_count: int = 4 # max concurrent frames per PE frame_slots: int = 64 # slots per frame matchable_offsets: int = 8 # max dyadic instructions per activation per PE ``` The defaults change: `DEFAULT_IRAM_CAPACITY` goes from 128 to 256 (matching the 8-bit offset field). `DEFAULT_CTX_SLOTS` is removed entirely and replaced by `frame_count`, `frame_slots`, and `matchable_offsets`. ### 2.4 What Stays - `IREdge` -- unchanged except `ctx_override` may need renaming or semantic adjustment (see section 5) - `IRGraph` structure -- unchanged - `IRDataDef` -- unchanged (SM data definitions are orthogonal to PE frames) - `IRRegion`, `RegionKind` -- unchanged - `NameRef`, `ResolvedDest` -- unchanged - `MacroDef`, `IRMacroCall`, `MacroParam`, `ParamRef` -- unchanged - `CallSite`, `CallSiteResult` -- fields may need renaming (`trampoline_nodes`, `free_ctx_nodes` -> `free_frame_nodes`) - `SourceLoc`, `ConstExpr`, `IRRepetitionBlock` -- unchanged ### 2.5 CtxSlotRef / CtxSlotRange `CtxSlotRef` and `CtxSlotRange` are used in macro templates for parameterized context slot assignment. Under the frame model, these become `ActSlotRef` / `ActSlotRange` (or are removed if activation ID assignment is always automatic). The expand pass resolves these to concrete values during macro expansion. The rename is straightforward but touches `ir.py`, `lower.py`, and `expand.py`. ## 3. Opcode Changes (`asm/opcodes.py`) ### 3.1 Renamed Opcodes | Current | New | Notes | |---|---|---| | `RoutingOp.FREE_CTX` | `RoutingOp.FREE_FRAME` | Deallocates a frame instead of a context slot | The mnemonic mapping changes: `"free_ctx"` -> `"free_frame"` in `MNEMONIC_TO_OP`. The `OP_TO_MNEMONIC` reverse mapping updates correspondingly. ### 3.2 New Opcodes | Opcode | Type | Arity | Purpose | |---|---|---|---| | `RoutingOp.EXTRACT_TAG` | Monadic CM | Monadic | Captures executing token's identity as 16-bit packed flit 1 value (return continuation) | | `RoutingOp.ALLOC_REMOTE` | Monadic CM | Monadic | Triggers ALLOC frame control token to a target PE (may be handled at codegen level instead) | `EXTRACT_TAG` must be added to `MNEMONIC_TO_OP` (mnemonic: `"extract_tag"`), to `_MONADIC_OPS_TUPLES`, and to `cm_inst.py` as a new `RoutingOp` enum member. Whether `ALLOC_REMOTE` is an ALU opcode or purely a codegen-emitted frame control token depends on whether the assembler exposes it as a user-facing instruction or handles it internally during call wiring. For static calls, ALLOC is emitted by codegen as a frame control token, not as an ALU instruction. For dynamic calls, it may need to be an instruction. ### 3.3 Mode-Dependent Arity The current `is_monadic(op, const)` and `is_dyadic(op, const)` functions remain correct in concept. The `mode` field is orthogonal to arity -- mode determines output routing behaviour, not input operand count. No changes needed to arity classification. ### 3.4 Monadic/Dyadic Classification Unchanged. The distinction between monadic and dyadic is fundamental to matching (dyadic tokens go through presence-based matching in the 670s; monadic tokens bypass matching). The assembler's arity classification drives IRAM offset assignment (dyadic at offsets 0-7, monadic at 8+). ## 4. Lower Pass Changes (`asm/lower.py`) Minimal changes. The lower pass translates Lark CST to IR and is mostly opcode-agnostic. ### 4.1 Opcode Mnemonic Updates The `MNEMONIC_TO_OP` lookup in `_resolve_opcode()` will pick up new opcodes automatically when they are added to `opcodes.py`. No lower.py code changes needed for new opcodes. The `free_ctx` mnemonic rename to `free_frame` will need a corresponding change in lower.py only if the mnemonic string is hardcoded anywhere (it is not -- lower.py uses `MNEMONIC_TO_OP` for all lookups). ### 4.2 Context Slot Qualifiers The `[ctx_slot]` qualifier syntax in dfasm (`&node[3] <| add`) is parsed in lower.py and stored as `ctx_slot` on IRNode. This syntax and field will be renamed to reflect activation IDs if kept. If activation ID assignment is always automatic, the qualifier syntax may be removed entirely. References in lower.py: - `inst_def` transformer rule: parses `[N]` qualifier into `ctx_slot` - `ctx_slot_ref` transformer rule: creates `CtxSlotRef` for macro parameter `[${param}]` - `ctx_slot_range` transformer rule: creates `CtxSlotRange` for `[N:M]` ### 4.3 New Syntax (Optional, Deferred) Frame directives in dfasm syntax (e.g., `@frame_layout` pragmas) could be added later. Not needed for v0 -- the allocator computes frame layouts automatically. ## 5. Expand Pass Changes (`asm/expand.py`) ### 5.1 Function Call Wiring The expand pass currently generates cross-context call wiring: - Trampoline `PASS` nodes with `ctx_override` edges - `FREE_CTX` cleanup nodes - Per-call-site context slot allocation via `CallSite` metadata Under the frame model, the call wiring changes: - **`ctx_override` edges** become frame-boundary edges. The semantic meaning is the same (data crosses activation boundaries), but the mechanism changes: instead of packing a target context into the instruction's `const` field with `ctx_mode=1`, the destination's pre-formed flit 1 value in the frame slot already encodes the target PE, offset, and act_id. Cross-activation routing is handled by frame setup, not by instruction encoding. - **`FREE_CTX` nodes** become `FREE_FRAME` nodes. The opcode changes from `RoutingOp.FREE_CTX` to `RoutingOp.FREE_FRAME`. The expand pass references `RoutingOp.FREE_CTX` directly in `_wire_call_site()` at line 1160 of expand.py. - **Trampoline nodes** may simplify. In the current design, trampolines exist to bridge context boundaries (an edge cannot cross contexts without a PASS node carrying `ctx_mode`). Under the frame model, destinations in frame slots already encode the target activation, so cross-activation edges are just edges whose destination flit 1 value points to a different activation. Trampolines may still be useful for fan-out or return routing but are no longer needed purely for context bridging. - **`@ret` wiring** stays conceptually the same. Return routing in the frame model is a pre-formed flit 1 value loaded into a frame slot. The expand pass creates edges for return routing; the allocate pass resolves those edges to flit 1 values in frame slots. ### 5.2 CallSite Metadata `CallSite` fields to rename: - `free_ctx_nodes` -> `free_frame_nodes` The `trampoline_nodes` field stays (trampolines may still be generated for fan-out or return routing patterns). ### 5.3 CtxSlotRef Resolution The expand pass resolves `CtxSlotRef` and `CtxSlotRange` during macro expansion. These will be renamed to `ActSlotRef` / `ActSlotRange` (or removed). The resolution logic in `_substitute_node()` and `_substitute_edge()` is straightforward renaming. ### 5.4 Built-in Macros The built-in macros in `builtins.py` do NOT reference `ctx_slot`, `ctx`, `free_ctx`, or any context-specific concepts directly. They use generic opcodes (`add`, `brgt`, `gate`, `pass`, `inc`, `const`) and edge routing (`@ret`, `${param}`). No changes needed to built-in macros. ## 6. Resolve Pass Changes (`asm/resolve.py`) No changes needed. The resolve pass validates that edge endpoints exist and detects scope violations. It operates on node names and graph structure, not on PE-level concepts like contexts or frames. The resolve pass does not reference `ctx`, `ctx_slot`, `ctx_override`, or any context-related fields. ## 7. Place Pass Changes (`asm/place.py`) ### 7.1 New Constraint: Matchable Offset Limit The placement pass must enforce the 8-matchable-offset constraint: at most 8 dyadic instructions per activation per PE. This is a new constraint that does not exist in the current codebase. Currently, `_count_iram_cost()` counts dyadic nodes as costing 2 IRAM slots and monadic as 1. Under the frame model: - Dyadic nodes cost 1 IRAM slot (the matching store entry is in the 670s, not IRAM) - Monadic nodes cost 1 IRAM slot - The 8-dyadic-per-activation limit is a separate constraint from IRAM capacity The placement pass needs to track dyadic instruction count per activation group per PE, in addition to total IRAM usage. ### 7.2 IRAM Cost Recalculation `_count_iram_cost()` should return 1 for all node types (dyadic and monadic both use 1 IRAM slot in the frame model). The current cost of 2 for dyadic nodes was because the old matching store occupied IRAM slots; in the frame model, match operands live in frame SRAM, not IRAM. ### 7.3 Context Slots -> Frames `_auto_place_nodes()` currently tracks `ctx_used` per PE (context slots consumed). This becomes `frames_used` per PE (concurrent frames, max 4). The `ctx_scopes_per_pe` tracking of function scopes per PE maps directly to frame tracking: each function scope on a PE consumes one frame. `SystemConfig.ctx_slots` references in place.py become `SystemConfig.frame_count`. ### 7.4 Instruction Deduplication Awareness Because IRAM entries are activation-independent templates, multiple activations of the same function on the same PE share IRAM entries. The placement pass could account for this when computing IRAM utilisation: if two activations of function `$foo` run on PE0, they share IRAM entries, so the IRAM cost is the unique instruction count, not the total. This is an optimisation, not a correctness requirement. The placement pass can conservatively count IRAM entries per unique function body without deduplication for v0. ## 8. Allocate Pass Changes (`asm/allocate.py`) -- MAJOR This is the largest change. The current allocate pass assigns IRAM offsets and context slots, then resolves destinations to `Addr` values. The frame model requires fundamentally different allocation logic. ### 8.1 IRAM Offset Assignment Current behaviour (`_assign_iram_offsets()`): - Dyadic nodes get offsets 0..D-1 - Monadic nodes get offsets D..D+M-1 - Total must fit in `iram_capacity` New behaviour: - Dyadic nodes get offsets 0-7 (within the matchable offset range). At most 8 dyadic instructions per activation group per PE. Because IRAM is activation-independent (shared templates), dyadic offset assignment is per-PE, not per-activation. - Monadic nodes get offsets 8-255 (or wherever dyadic offsets end). - The `matchable_offsets` limit (default 8) constrains dyadic count. - With instruction deduplication, multiple activations of the same function body share IRAM offsets. The allocator assigns offsets once per unique instruction template, not per activation. ### 8.2 Activation ID Assignment Replaces `_assign_context_slots()`. The current function assigns context slot indices (0-15) per function scope per PE. The new function assigns 3-bit activation IDs (0-7) with at most 4 concurrent activations per PE. Key differences: - **Smaller space:** 8 act_ids (3-bit) vs 16 context slots. But only 4 can be concurrently active (4 physical frames). - **ABA distance:** the allocator must maintain ABA distance between act_ids to prevent stale token collisions. With 4 concurrent frames out of 8 possible IDs, 4 IDs of ABA distance exist before wraparound. For static programs, the allocator can assign act_ids sequentially (0, 1, 2, 3 for 4 concurrent activations). - **Per-call-site allocation:** same concept as current `call_site_to_ctx_on_pe` -- each call site that creates a new activation gets a fresh act_id. But the budget is 4 concurrent frames instead of 16 context slots. The function scope grouping logic (`_extract_function_scope()`) stays. ### 8.3 Frame Layout Allocation (NEW) This is entirely new functionality. After IRAM offset and act_id assignment, the allocator must compute the frame layout for each activation: which frame slots hold what data. **Frame slot roles:** | Role | Slot Range | Count per Instruction | Notes | |---|---|---|---| | Match operand | 0-7 | 1 per dyadic instruction | Indexed by matchable offset. Presence bit in 670. | | Constant | 8+ | 0-1 per instruction | `mode[0]` (has_const) selects whether const is read from frame[fref] | | Destination | variable | 0-2 per instruction | Pre-formed flit 1 values. 1 for mode 0, 2 for mode 2/3. | | Accumulator/sink | variable | 0-1 per instruction | For SINK modes (6/7). Write-back target. | | SM parameters | variable | 0-2 per SM instruction | SM_id + addr, data or return routing. | **Slot assignment algorithm:** 1. Reserve slots 0-7 for match operands (one per dyadic instruction, indexed by the instruction's matchable offset). 2. Assign constant slots starting at slot 8. Constants that are shared across instructions within the same activation can be deduplicated (same value -> same slot). 3. Assign destination slots. Each destination is a pre-formed flit 1 value. Destinations shared across instructions can be deduplicated. 4. Assign sink/accumulator slots for SINK mode instructions. 5. Assign SM parameter slots for SM operations. 6. Verify total slots <= 64. If exceeded, report a frame overflow error. **fref computation:** The `fref` field in the instruction word points to the base of a contiguous group of frame slots used by that instruction. The slot count depends on the mode: | Mode | Slots at fref | Layout | |---|---|---| | 0 | 1 | [dest] | | 1 | 2 | [const, dest] | | 2 | 2 | [dest1, dest2] | | 3 | 3 | [const, dest1, dest2] | | 4 | 0 | (no frame access) | | 5 | 1 | [const] | | 6 | 1 | [sink_target] (write) | | 7 | 1 | [sink_target] (read-modify-write) | The allocator must arrange frame slots such that each instruction's constant and destination(s) are contiguous starting at `fref`. This may require careful slot packing or a simple sequential allocation strategy. **Mode computation:** The allocator determines the mode for each instruction based on: | Condition | Mode | |---|---| | 0 dests, no const, has sink | 6 (SINK) | | 0 dests, const via frame accumulator (RMW) | 7 (SINK+CONST) | | 1 dest, no const | 0 (INHERIT, single output) | | 1 dest, has const | 1 (INHERIT, single output + const) | | 2 dests, no const | 2 (INHERIT, fan-out) | | 2 dests, has const | 3 (INHERIT, fan-out + const) | | CHANGE_TAG, no const | 4 | | CHANGE_TAG, has const | 5 | CHANGE_TAG is used when the instruction's left operand provides the output destination dynamically (e.g., dynamic return routing via `EXTRACT_TAG` + `CHANGE_TAG`). ### 8.4 Pre-Formed Flit 1 Computation The allocator must convert resolved destination `Addr` values into 16-bit packed flit 1 values for storage in frame destination slots. **Flit 1 formats** (from `architecture-overview.md`): ``` DYADIC WIDE: [0][0][port:1][PE:2][offset:8][act_id:3] = 16 bits MONADIC NORM: [0][1][0][PE:2][offset:8][act_id:3] = 16 bits MONADIC INLINE: [0][1][1][PE:2][10][offset:7][spare:2] = 16 bits ``` The packing function takes: - `dest_pe: int` (2 bits) - `dest_offset: int` (8 bits for dyadic/monadic normal, 7 for inline) - `dest_act_id: int` (3 bits, for dyadic and monadic normal) - `dest_port: Port` (1 bit, for dyadic wide only) - `dest_type: TokenType` (dyadic wide, monadic normal, or monadic inline) And produces a 16-bit packed flit 1 value. The `dest_type` is determined by whether the destination instruction is dyadic (-> dyadic wide flit) or monadic (-> monadic normal flit). For trigger-only destinations (e.g., switch not-taken path), monadic inline is used. This replaces the current `Addr(a, port, pe)` resolution. The `Addr` type may still be used as an intermediate representation, but the final output is a packed flit 1 value in a frame slot. ### 8.5 Destination Resolution Rework The current `_resolve_destinations()` function creates `ResolvedDest` objects containing `Addr(a=iram_offset, port=edge.port, pe=dest_pe)`. Under the frame model, destination resolution must additionally: 1. Look up the destination node's `act_id` (not just `iram_offset` and `pe`). 2. Determine the destination token type (dyadic wide vs monadic normal vs monadic inline). 3. Compute the packed flit 1 value. 4. Assign the flit 1 value to a frame slot. The `ResolvedDest` type may be extended to carry the packed flit 1 value, or the flit 1 computation may happen in a separate sub-pass after destination resolution. ### 8.6 ctx_mode / ctx_override Removal The current allocator and codegen handle `ctx_override` edges by setting `ctx_mode=1` on the source instruction and packing the target context and generation into the `const` field. This entire mechanism is removed. Under the frame model, cross-activation routing is handled by frame destination slots. The destination flit 1 value already encodes the target PE, offset, and act_id. No instruction-level `ctx_mode` is needed. The `ctx_override` flag on `IREdge` may be kept for semantic annotation (this edge crosses activation boundaries) but has no effect on instruction encoding. ## 9. Codegen Changes (`asm/codegen.py`) -- MAJOR ### 9.1 Instruction Generation Current codegen (`_build_iram_for_pe()`) generates `ALUInst` and `SMInst` objects. These are Python dataclasses that model the old instruction format with embedded destinations and constants: ```python ALUInst(op, dest_l, dest_r, const, ctx_mode) SMInst(op, sm_id, const, ret, ret_dyadic) ``` New codegen generates 16-bit instruction words matching the hardware format: ```python @dataclass(frozen=True) class Instruction: """16-bit instruction word for IRAM. [type:1][opcode:5][mode:3][wide:1][fref:6] = 16 bits """ type: int # 0 = CM, 1 = SM opcode: int # 5-bit opcode mode: int # 3-bit mode (0-7) wide: bool # 32-bit frame values fref: int # 6-bit frame slot base index ``` The `Instruction` type replaces both `ALUInst` and `SMInst`. Constants and destinations are NOT in the instruction -- they are in frame slots. The `PEConfig.iram` field type changes from `dict[int, ALUInst | SMInst]` to `dict[int, Instruction]` (or `dict[int, int]` if storing raw 16-bit words). ### 9.2 Frame Setup Sequence Generation (NEW) Codegen must generate the bootstrap sequence that loads frame contents before execution begins. This is a stream of tokens: 1. **FrameControlToken (ALLOC)** for each activation: ``` flit 1: [0][1][1][PE:2][00][op=0][spare:3][act_id:3] flit 2: (return routing or unused) ``` 2. **PELocalWriteToken** for each frame slot that needs initialization: ``` flit 1: [0][1][1][PE:2][01][region=1][spare:1][slot:5][act_id:3] flit 2: [data:16] (the frame slot value) ``` 3. **PELocalWriteToken** for IRAM entries: ``` flit 1: [0][1][1][PE:2][01][region=0][spare:1][slot:5][act_id:ignored] flit 2: [instruction:16] ``` The ordering matters: IRAM writes before frame setup, frame setup (ALLOC + slot writes) before seed tokens. More specifically: ``` IRAM writes (all PEs) -> ALLOC frame control (all activations) -> Frame slot writes (constants, destinations per activation) -> Seed tokens (initial data tokens to start execution) ``` ### 9.3 New Token Types The emulator will need new token types (or the existing types must be adapted): - **FrameControlToken** -- ALLOC/FREE frame lifecycle. Currently not in `tokens.py`. Codegen needs to emit these. - **PELocalWriteToken** -- writes to IRAM (region=0) or frame slots (region=1). The current `IRAMWriteToken` is a special case of this (region=0 only). It may be generalised or a new type added. These are emulator-level changes that codegen depends on. The codegen module must import and construct whatever token types the emulator provides. ### 9.4 Seed Token Generation Current seed token generation creates `MonadToken` or `DyadToken` with `ctx` field. Changes: - `ctx` field -> `act_id` field in token constructors - `gen` field on `DyadToken` is removed (ABA protection is via the 670 valid bit, not generation counters) - Seed tokens target `(pe, offset, act_id)` triples, packed from the destination's allocation data ### 9.5 Direct Mode Output `generate_direct()` currently returns `AssemblyResult` with: - `pe_configs: list[PEConfig]` -- PEConfig with `iram` dict of `ALUInst/SMInst` - `sm_configs: list[SMConfig]` - `seed_tokens: list[MonadToken]` New `AssemblyResult`: - `pe_configs: list[PEConfig]` -- PEConfig with `iram` dict of `Instruction` (new type), plus `frame_layouts: dict[int, FrameLayout]` mapping act_id to frame layout data - `sm_configs: list[SMConfig]` -- unchanged - `seed_tokens: list` -- may include `DyadToken` and `MonadToken` with `act_id` instead of `ctx` - `setup_tokens: list` -- NEW: frame control and PE-local write tokens for bootstrap ### 9.6 Token Stream Mode Output `generate_tokens()` currently produces: ``` SM init tokens -> IRAM write tokens -> seed tokens ``` New ordering: ``` SM init tokens -> IRAM write tokens -> ALLOC tokens -> frame slot write tokens -> seed tokens ``` ### 9.7 ctx_mode Removal The entire `ctx_mode` / `ctx_override` handling in `_build_iram_for_pe()` (lines 96-132 of codegen.py) is removed. Cross-activation routing is handled by frame destination slots, not by instruction encoding. ### 9.8 Route Restriction Computation `_compute_route_restrictions()` is unchanged in concept. It scans edges to determine which PEs and SMs a given PE needs to route to. The implementation stays the same. ## 10. Serialize Pass Changes (`asm/serialize.py`) ### 10.1 Field Renaming - `ctx` -> `act_id` in node serialization if activation ID is displayed - Any `ctx_slot` qualifier in dfasm output becomes activation-related ### 10.2 New Fields If `mode`, `fref`, and `wide` are displayed in serialized output (for debugging allocated IR), `_serialize_node()` needs to format them. Format could be: `&node|pe0|act2|mode1|fref8 <| add` ### 10.3 Round-Trip Support The serialize pass must be able to round-trip new IR fields. Since `mode`, `fref`, and frame layout are only populated after allocation, serialization before allocation produces the same output as today (no new fields to display). Post-allocation serialization adds the new fields. ## 11. Built-in Macro Changes (`asm/builtins.py`) No changes needed. The built-in macros (`#loop_counted`, `#loop_while`, `#permit_inject`, `#reduce_2/3/4`) use generic opcodes and edge routing. They do not reference `ctx`, `ctx_slot`, `free_ctx`, or any context-specific concepts. The `free_ctx` opcode rename to `free_frame` does not affect builtins because none of them use `free_ctx`. ## 12. Error Types (`asm/errors.py`) ### 12.1 New Error Categories Add to `ErrorCategory`: ```python class ErrorCategory(Enum): # ... existing ... FRAME = "frame" # Frame layout overflow, slot conflicts ``` ### 12.2 New Error Conditions | Error | Category | Source Pass | Condition | |---|---|---|---| | Frame slot overflow | FRAME | allocate | Total slots > 64 for an activation | | Matchable offset overflow | RESOURCE | allocate/place | > 8 dyadic instructions per activation per PE | | Frame count overflow | RESOURCE | allocate | > 4 concurrent activations on one PE | | Act ID exhaustion | RESOURCE | allocate | > 8 activation IDs needed (wraparound) | ## 13. Dfgraph Pipeline Impact (`dfgraph/`) ### 13.1 `dfgraph/pipeline.py` The pipeline runner calls `allocate()` and uses `IRGraph` types. If `SystemConfig` fields change, `pipeline.py` may need minor updates for default values or error handling. The pipeline runner itself does not inspect allocation results in detail. ### 13.2 `dfgraph/graph_json.py` Currently includes `ctx` field in node JSON output. This becomes `act_id`. The field rename is straightforward. ### 13.3 `dfgraph/categories.py` References `RoutingOp.FREE_CTX` in the CONFIG category mapping. This becomes `RoutingOp.FREE_FRAME`. One line change. `EXTRACT_TAG` (if added) maps to the ROUTING or CONFIG category depending on its semantics. ## 14. Monitor Impact (`monitor/`) ### 14.1 `monitor/snapshot.py` `PESnapshot` currently captures `matching_store` (2D array of `MatchEntry`) and `gen_counters`. Under the frame model: - `matching_store` becomes frame state (per-frame slot values, presence bits) - `gen_counters` is removed (no generation counters in frame model) - New: `frame_allocations` (act_id -> frame_id mapping), `frame_slots` (per-frame slot contents) ### 14.2 `monitor/graph_json.py` Node state overlay: `ctx` field -> `act_id`. Frame slot contents may be added to the state overlay for debugging. ### 14.3 `monitor/repl.py` The `pe` command displays PE state including matching store contents. This needs updating to display frame state instead. ## 15. Dependency Order Implementation order based on the dependency graph: ### Phase 1: Foundation Types 1. **`cm_inst.py`**: Add `RoutingOp.FREE_FRAME`, `RoutingOp.EXTRACT_TAG`. Add `Instruction` dataclass. Update `is_monadic_alu()`. 2. **`asm/ir.py`**: Add `FrameSlotMap`, `FrameLayout`. Rename `ctx_slot` -> `act_slot`, `ctx` -> `act_id` on IRNode. Update `SystemConfig` (remove `ctx_slots`, add `frame_count`, `frame_slots`, `matchable_offsets`). Rename `CtxSlotRef` -> `ActSlotRef`, `CtxSlotRange` -> `ActSlotRange`. 3. **`asm/opcodes.py`**: Add new opcodes to `MNEMONIC_TO_OP`, `_MONADIC_OPS_TUPLES`. Rename `free_ctx` -> `free_frame`. 4. **`asm/errors.py`**: Add `ErrorCategory.FRAME`. ### Phase 2: Allocation and Codegen (the big changes) 5. **`asm/allocate.py`**: Rewrite `_assign_context_slots()` -> `_assign_act_ids()`. Add `_compute_frame_layouts()`. Add `_compute_modes()`. Add `_pack_flit1()`. Update `_assign_iram_offsets()` for new offset scheme. Update `_resolve_destinations()` to produce flit 1 values. Remove `ctx_mode`/`ctx_override` handling. 6. **`asm/codegen.py`**: Replace `ALUInst`/`SMInst` generation with `Instruction` generation. Add frame setup token generation. Add `FrameControlToken` / `PELocalWriteToken` emission. Update seed token generation. Remove `ctx_mode` handling. Update `PEConfig` construction. Update `AssemblyResult`. ### Phase 3: Upstream Pass Adjustments 7. **`asm/lower.py`**: Rename `ctx_slot` references. Update qualifier parsing if syntax changes. 8. **`asm/expand.py`**: Replace `FREE_CTX` with `FREE_FRAME` in call wiring. Rename `CtxSlotRef` -> `ActSlotRef`. Update `CallSite` field names. Simplify trampoline logic if `ctx_override` is removed. 9. **`asm/place.py`**: Update `_count_iram_cost()` (all nodes cost 1). Replace `ctx_slots` tracking with `frame_count`. Add matchable-offset-per-activation constraint. ### Phase 4: Output and Tooling 10. **`asm/serialize.py`**: Rename `ctx` -> `act_id` in output. Add mode/fref display for post-allocation IR. 11. **`asm/builtins.py`**: No changes (verified). 12. **`dfgraph/categories.py`**: `FREE_CTX` -> `FREE_FRAME`. 13. **`dfgraph/graph_json.py`**: `ctx` -> `act_id` in JSON output. 14. **`monitor/`**: Update snapshot, graph_json, and REPL for frame model. ### Phase 5: Emulator Updates (out of scope for this doc) 15. **`emu/types.py`**: `PEConfig` changes (`iram` type, remove `ctx_slots`, add frame parameters). 16. **`emu/pe.py`**: Frame-based matching instead of context-slot matching store. 17. **`tokens.py`**: `ctx` -> `act_id`, remove `gen`. Add `FrameControlToken`, `PELocalWriteToken`. ## 16. What Stays the Same - **dfasm grammar** (`dfasm.lark`): mostly unchanged. May add frame directives later, but not required for v0. - **Parse pass** (Lark parser): no changes. - **Error infrastructure** (`errors.py`): structure unchanged, one new category added. - **Graph structure** (`IRGraph`, `IREdge`, `IRRegion`): unchanged. - **Pipeline architecture**: still 7 passes in the same order. - **Resolve pass**: completely unchanged. - **Built-in macros**: completely unchanged. - **SM-related codegen**: SMConfig construction, SM init tokens, data defs -- all unchanged. - **Name resolution and scoping**: unchanged. - **Macro expansion core**: parameter substitution, variadic repetition, opcode parameters -- all unchanged. Only call wiring details change. ## 17. Risk Assessment | Area | Risk | Mitigation | |---|---|---| | Frame layout allocation | New, complex algorithm | Start with simple sequential allocation, optimise later | | Flit 1 packing | Bit-level encoding must match hardware spec | Unit tests against architecture-overview.md bit layouts | | Instruction deduplication | Interaction with frame layouts is subtle | Defer dedup to optimisation pass; v0 allocates per-activation | | Mode computation | 8 modes, context-dependent selection | Exhaustive test matrix against mode table | | ABA distance | Act_id assignment must maintain distance | Sequential assignment (0,1,2,3) is safe for static programs | | `ctx_override` removal | Affects expand pass call wiring | Can keep `ctx_override` as semantic annotation initially | | Emulator dependency | Codegen must emit tokens emulator can consume | Phase emulator changes alongside codegen; can use adapter layer |