OR-1 dataflow CPU sketch
at main 852 lines 34 kB view raw view rendered
1# Assembler Redesign Plan: Frame-Based PE Model 2 3This document describes the changes needed in the OR1 assembler (`asm/`) 4to match the frame-based PE redesign described in `pe-design.md` and 5`architecture-overview.md`. It covers every assembler pass and supporting 6module, identifying what changes, what stays, and the dependency order 7for implementation. 8 9The assembler pipeline is: parse -> lower -> expand -> resolve -> place -> 10allocate -> codegen. Most changes concentrate in allocate and codegen; 11earlier passes are minimally affected. 12 13## 1. Overview 14 15The PE redesign replaces the context-slot matching model with a 16frame-based model. Key architectural changes that affect the assembler: 17 18- **Context slots replaced by activation IDs and frames.** The old 19 `ctx_slots` parameter (up to 16 slots) becomes `frame_count` (4 20 concurrent frames) with 3-bit `act_id` (8 unique IDs). Each frame has 21 64 addressable slots. 22- **Instructions become templates.** IRAM entries are 16-bit words 23 `[type:1][opcode:5][mode:3][wide:1][fref:6]`. Constants and 24 destinations are NOT in the instruction -- they live in frame slots 25 referenced by `fref`. 26- **Instruction deduplication.** Multiple activations can share the same 27 IRAM entry because per-activation data lives in frames, not 28 instructions. The number of IRAM entries needed is the number of unique 29 operation shapes, not the total operations executed. 30- **8 matchable offsets per frame.** Dyadic instructions must be assigned 31 offsets 0-7 within each activation. The assembler must enforce this 32 constraint and split function bodies that exceed it. 33- **Pre-formed flit 1 values.** Output destinations are 16-bit packed 34 flit 1 values stored in frame slots. The PE reads the slot and puts it 35 on the bus verbatim as flit 1. Token type is determined by the prefix 36 bits in the stored flit value. 37- **Frame setup via PE-local write tokens.** Constants, destinations, and 38 other per-activation data are loaded into frame slots by a stream of 39 PE-local write tokens before execution begins. 40- **Frame lifecycle tokens.** ALLOC (frame control, prefix `011+00`, 41 op=0) allocates a frame; FREE (frame control, op=1) or FREE_FRAME 42 instruction releases it. 43- **Mode field replaces output routing logic.** The 3-bit `mode` field 44 in the instruction word encodes output routing (INHERIT/CHANGE_TAG/SINK) 45 and constant presence. The assembler must compute the correct mode for 46 each instruction based on its edge topology and constant usage. 47 48## 2. IR Type Changes (`asm/ir.py`) 49 50### 2.1 IRNode Changes 51 52Current fields affected: 53 54| Current Field | Change | Notes | 55|---|---|---| 56| `ctx_slot: Optional[Union[int, CtxSlotRef, CtxSlotRange]]` | Rename to `act_slot` or remove entirely | Macro parameter support (`CtxSlotRef`) may still be needed for parameterized activation grouping | 57| `iram_offset: Optional[int]` | Keep | Semantics unchanged: index into PE's IRAM. Range 0-255. | 58| `ctx: Optional[int]` | Rename to `act_id: Optional[int]` | 3-bit activation ID (0-7) instead of context slot | 59 60New fields to add to IRNode: 61 62| New Field | Type | Purpose | 63|---|---|---| 64| `mode: Optional[int]` | `Optional[int]` | 3-bit instruction mode (0-7), computed during allocate | 65| `fref: Optional[int]` | `Optional[int]` | 6-bit frame slot base index (0-63), computed during allocate | 66| `wide: bool` | `bool` | 32-bit frame values flag, default False | 67| `frame_layout: Optional[FrameSlotMap]` | `Optional[FrameSlotMap]` | Per-node frame slot assignments (see below) | 68 69### 2.2 New IR Types 70 71```python 72@dataclass(frozen=True) 73class FrameSlotMap: 74 """Frame slot assignments for one instruction within an activation. 75 76 Maps logical roles to physical frame slot indices within the 64-slot 77 frame. Computed by the allocate pass. 78 79 Attributes: 80 match_slot: Frame slot for dyadic match operand storage (offsets 0-7) 81 const_slot: Frame slot for IRAM constant (if any) 82 dest_slots: Frame slot(s) for pre-formed flit 1 destination values 83 sink_slot: Frame slot for SINK mode write-back target 84 """ 85 match_slot: Optional[int] = None 86 const_slot: Optional[int] = None 87 dest_slots: tuple[int, ...] = () 88 sink_slot: Optional[int] = None 89``` 90 91```python 92@dataclass(frozen=True) 93class FrameLayout: 94 """Complete frame layout for one activation on one PE. 95 96 Computed by the allocate pass. Used by codegen to generate PE-local 97 write tokens for frame setup. 98 99 Attributes: 100 act_id: Activation ID (0-7) 101 pe_id: PE where this frame lives 102 slots: Dict mapping slot index to (role, value) pairs 103 total_slots: Total slots used (must be <= 64) 104 """ 105 act_id: int 106 pe_id: int 107 slots: dict[int, tuple[str, int]] 108 total_slots: int 109``` 110 111### 2.3 SystemConfig Changes 112 113Current `SystemConfig`: 114 115```python 116@dataclass(frozen=True) 117class SystemConfig: 118 pe_count: int 119 sm_count: int 120 iram_capacity: int = DEFAULT_IRAM_CAPACITY # 128 121 ctx_slots: int = DEFAULT_CTX_SLOTS # 16 122``` 123 124New `SystemConfig`: 125 126```python 127@dataclass(frozen=True) 128class SystemConfig: 129 pe_count: int 130 sm_count: int 131 iram_capacity: int = 256 # 8-bit offset, 256 entries 132 frame_count: int = 4 # max concurrent frames per PE 133 frame_slots: int = 64 # slots per frame 134 matchable_offsets: int = 8 # max dyadic instructions per activation per PE 135``` 136 137The defaults change: `DEFAULT_IRAM_CAPACITY` goes from 128 to 256 138(matching the 8-bit offset field). `DEFAULT_CTX_SLOTS` is removed 139entirely and replaced by `frame_count`, `frame_slots`, and 140`matchable_offsets`. 141 142### 2.4 What Stays 143 144- `IREdge` -- unchanged except `ctx_override` may need renaming or 145 semantic adjustment (see section 5) 146- `IRGraph` structure -- unchanged 147- `IRDataDef` -- unchanged (SM data definitions are orthogonal to PE frames) 148- `IRRegion`, `RegionKind` -- unchanged 149- `NameRef`, `ResolvedDest` -- unchanged 150- `MacroDef`, `IRMacroCall`, `MacroParam`, `ParamRef` -- unchanged 151- `CallSite`, `CallSiteResult` -- fields may need renaming 152 (`trampoline_nodes`, `free_ctx_nodes` -> `free_frame_nodes`) 153- `SourceLoc`, `ConstExpr`, `IRRepetitionBlock` -- unchanged 154 155### 2.5 CtxSlotRef / CtxSlotRange 156 157`CtxSlotRef` and `CtxSlotRange` are used in macro templates for 158parameterized context slot assignment. Under the frame model, these 159become `ActSlotRef` / `ActSlotRange` (or are removed if activation ID 160assignment is always automatic). The expand pass resolves these to 161concrete values during macro expansion. The rename is straightforward 162but touches `ir.py`, `lower.py`, and `expand.py`. 163 164## 3. Opcode Changes (`asm/opcodes.py`) 165 166### 3.1 Renamed Opcodes 167 168| Current | New | Notes | 169|---|---|---| 170| `RoutingOp.FREE_CTX` | `RoutingOp.FREE_FRAME` | Deallocates a frame instead of a context slot | 171 172The mnemonic mapping changes: `"free_ctx"` -> `"free_frame"` in 173`MNEMONIC_TO_OP`. The `OP_TO_MNEMONIC` reverse mapping updates 174correspondingly. 175 176### 3.2 New Opcodes 177 178| Opcode | Type | Arity | Purpose | 179|---|---|---|---| 180| `RoutingOp.EXTRACT_TAG` | Monadic CM | Monadic | Captures executing token's identity as 16-bit packed flit 1 value (return continuation) | 181| `RoutingOp.ALLOC_REMOTE` | Monadic CM | Monadic | Triggers ALLOC frame control token to a target PE (may be handled at codegen level instead) | 182 183`EXTRACT_TAG` must be added to `MNEMONIC_TO_OP` (mnemonic: `"extract_tag"`), 184to `_MONADIC_OPS_TUPLES`, and to `cm_inst.py` as a new `RoutingOp` enum 185member. 186 187Whether `ALLOC_REMOTE` is an ALU opcode or purely a codegen-emitted 188frame control token depends on whether the assembler exposes it as a 189user-facing instruction or handles it internally during call wiring. 190For static calls, ALLOC is emitted by codegen as a frame control token, 191not as an ALU instruction. For dynamic calls, it may need to be an 192instruction. 193 194### 3.3 Mode-Dependent Arity 195 196The current `is_monadic(op, const)` and `is_dyadic(op, const)` functions 197remain correct in concept. The `mode` field is orthogonal to arity -- 198mode determines output routing behaviour, not input operand count. No 199changes needed to arity classification. 200 201### 3.4 Monadic/Dyadic Classification 202 203Unchanged. The distinction between monadic and dyadic is fundamental to 204matching (dyadic tokens go through presence-based matching in the 670s; 205monadic tokens bypass matching). The assembler's arity classification 206drives IRAM offset assignment (dyadic at offsets 0-7, monadic at 8+). 207 208## 4. Lower Pass Changes (`asm/lower.py`) 209 210Minimal changes. The lower pass translates Lark CST to IR and is mostly 211opcode-agnostic. 212 213### 4.1 Opcode Mnemonic Updates 214 215The `MNEMONIC_TO_OP` lookup in `_resolve_opcode()` will pick up new 216opcodes automatically when they are added to `opcodes.py`. No lower.py 217code changes needed for new opcodes. 218 219The `free_ctx` mnemonic rename to `free_frame` will need a corresponding 220change in lower.py only if the mnemonic string is hardcoded anywhere 221(it is not -- lower.py uses `MNEMONIC_TO_OP` for all lookups). 222 223### 4.2 Context Slot Qualifiers 224 225The `[ctx_slot]` qualifier syntax in dfasm (`&node[3] <| add`) is 226parsed in lower.py and stored as `ctx_slot` on IRNode. This syntax and 227field will be renamed to reflect activation IDs if kept. If activation 228ID assignment is always automatic, the qualifier syntax may be removed 229entirely. 230 231References in lower.py: 232- `inst_def` transformer rule: parses `[N]` qualifier into `ctx_slot` 233- `ctx_slot_ref` transformer rule: creates `CtxSlotRef` for macro 234 parameter `[${param}]` 235- `ctx_slot_range` transformer rule: creates `CtxSlotRange` for `[N:M]` 236 237### 4.3 New Syntax (Optional, Deferred) 238 239Frame directives in dfasm syntax (e.g., `@frame_layout` pragmas) could 240be added later. Not needed for v0 -- the allocator computes frame 241layouts automatically. 242 243## 5. Expand Pass Changes (`asm/expand.py`) 244 245### 5.1 Function Call Wiring 246 247The expand pass currently generates cross-context call wiring: 248- Trampoline `PASS` nodes with `ctx_override` edges 249- `FREE_CTX` cleanup nodes 250- Per-call-site context slot allocation via `CallSite` metadata 251 252Under the frame model, the call wiring changes: 253 254- **`ctx_override` edges** become frame-boundary edges. The semantic 255 meaning is the same (data crosses activation boundaries), but the 256 mechanism changes: instead of packing a target context into the 257 instruction's `const` field with `ctx_mode=1`, the destination's 258 pre-formed flit 1 value in the frame slot already encodes the target 259 PE, offset, and act_id. Cross-activation routing is handled by frame 260 setup, not by instruction encoding. 261 262- **`FREE_CTX` nodes** become `FREE_FRAME` nodes. The opcode changes 263 from `RoutingOp.FREE_CTX` to `RoutingOp.FREE_FRAME`. The expand pass 264 references `RoutingOp.FREE_CTX` directly in `_wire_call_site()` at 265 line 1160 of expand.py. 266 267- **Trampoline nodes** may simplify. In the current design, trampolines 268 exist to bridge context boundaries (an edge cannot cross contexts 269 without a PASS node carrying `ctx_mode`). Under the frame model, 270 destinations in frame slots already encode the target activation, so 271 cross-activation edges are just edges whose destination flit 1 value 272 points to a different activation. Trampolines may still be useful for 273 fan-out or return routing but are no longer needed purely for context 274 bridging. 275 276- **`@ret` wiring** stays conceptually the same. Return routing in the 277 frame model is a pre-formed flit 1 value loaded into a frame slot. 278 The expand pass creates edges for return routing; the allocate pass 279 resolves those edges to flit 1 values in frame slots. 280 281### 5.2 CallSite Metadata 282 283`CallSite` fields to rename: 284- `free_ctx_nodes` -> `free_frame_nodes` 285 286The `trampoline_nodes` field stays (trampolines may still be generated 287for fan-out or return routing patterns). 288 289### 5.3 CtxSlotRef Resolution 290 291The expand pass resolves `CtxSlotRef` and `CtxSlotRange` during macro 292expansion. These will be renamed to `ActSlotRef` / `ActSlotRange` (or 293removed). The resolution logic in `_substitute_node()` and 294`_substitute_edge()` is straightforward renaming. 295 296### 5.4 Built-in Macros 297 298The built-in macros in `builtins.py` do NOT reference `ctx_slot`, `ctx`, 299`free_ctx`, or any context-specific concepts directly. They use generic 300opcodes (`add`, `brgt`, `gate`, `pass`, `inc`, `const`) and edge routing 301(`@ret`, `${param}`). No changes needed to built-in macros. 302 303## 6. Resolve Pass Changes (`asm/resolve.py`) 304 305No changes needed. The resolve pass validates that edge endpoints exist 306and detects scope violations. It operates on node names and graph 307structure, not on PE-level concepts like contexts or frames. The resolve 308pass does not reference `ctx`, `ctx_slot`, `ctx_override`, or any 309context-related fields. 310 311## 7. Place Pass Changes (`asm/place.py`) 312 313### 7.1 New Constraint: Matchable Offset Limit 314 315The placement pass must enforce the 8-matchable-offset constraint: at 316most 8 dyadic instructions per activation per PE. This is a new 317constraint that does not exist in the current codebase. 318 319Currently, `_count_iram_cost()` counts dyadic nodes as costing 2 IRAM 320slots and monadic as 1. Under the frame model: 321- Dyadic nodes cost 1 IRAM slot (the matching store entry is in the 322 670s, not IRAM) 323- Monadic nodes cost 1 IRAM slot 324- The 8-dyadic-per-activation limit is a separate constraint from IRAM 325 capacity 326 327The placement pass needs to track dyadic instruction count per 328activation group per PE, in addition to total IRAM usage. 329 330### 7.2 IRAM Cost Recalculation 331 332`_count_iram_cost()` should return 1 for all node types (dyadic and 333monadic both use 1 IRAM slot in the frame model). The current cost of 2 334for dyadic nodes was because the old matching store occupied IRAM slots; 335in the frame model, match operands live in frame SRAM, not IRAM. 336 337### 7.3 Context Slots -> Frames 338 339`_auto_place_nodes()` currently tracks `ctx_used` per PE (context slots 340consumed). This becomes `frames_used` per PE (concurrent frames, max 4). 341The `ctx_scopes_per_pe` tracking of function scopes per PE maps directly 342to frame tracking: each function scope on a PE consumes one frame. 343 344`SystemConfig.ctx_slots` references in place.py become 345`SystemConfig.frame_count`. 346 347### 7.4 Instruction Deduplication Awareness 348 349Because IRAM entries are activation-independent templates, multiple 350activations of the same function on the same PE share IRAM entries. The 351placement pass could account for this when computing IRAM utilisation: 352if two activations of function `$foo` run on PE0, they share IRAM 353entries, so the IRAM cost is the unique instruction count, not the total. 354 355This is an optimisation, not a correctness requirement. The placement 356pass can conservatively count IRAM entries per unique function body 357without deduplication for v0. 358 359## 8. Allocate Pass Changes (`asm/allocate.py`) -- MAJOR 360 361This is the largest change. The current allocate pass assigns IRAM 362offsets and context slots, then resolves destinations to `Addr` values. 363The frame model requires fundamentally different allocation logic. 364 365### 8.1 IRAM Offset Assignment 366 367Current behaviour (`_assign_iram_offsets()`): 368- Dyadic nodes get offsets 0..D-1 369- Monadic nodes get offsets D..D+M-1 370- Total must fit in `iram_capacity` 371 372New behaviour: 373- Dyadic nodes get offsets 0-7 (within the matchable offset range). 374 At most 8 dyadic instructions per activation group per PE. Because 375 IRAM is activation-independent (shared templates), dyadic offset 376 assignment is per-PE, not per-activation. 377- Monadic nodes get offsets 8-255 (or wherever dyadic offsets end). 378- The `matchable_offsets` limit (default 8) constrains dyadic count. 379- With instruction deduplication, multiple activations of the same 380 function body share IRAM offsets. The allocator assigns offsets once 381 per unique instruction template, not per activation. 382 383### 8.2 Activation ID Assignment 384 385Replaces `_assign_context_slots()`. The current function assigns context 386slot indices (0-15) per function scope per PE. The new function assigns 3873-bit activation IDs (0-7) with at most 4 concurrent activations per PE. 388 389Key differences: 390- **Smaller space:** 8 act_ids (3-bit) vs 16 context slots. But only 4 391 can be concurrently active (4 physical frames). 392- **ABA distance:** the allocator must maintain ABA distance between 393 act_ids to prevent stale token collisions. With 4 concurrent frames 394 out of 8 possible IDs, 4 IDs of ABA distance exist before wraparound. 395 For static programs, the allocator can assign act_ids sequentially 396 (0, 1, 2, 3 for 4 concurrent activations). 397- **Per-call-site allocation:** same concept as current 398 `call_site_to_ctx_on_pe` -- each call site that creates a new 399 activation gets a fresh act_id. But the budget is 4 concurrent frames 400 instead of 16 context slots. 401 402The function scope grouping logic (`_extract_function_scope()`) stays. 403 404### 8.3 Frame Layout Allocation (NEW) 405 406This is entirely new functionality. After IRAM offset and act_id 407assignment, the allocator must compute the frame layout for each 408activation: which frame slots hold what data. 409 410**Frame slot roles:** 411 412| Role | Slot Range | Count per Instruction | Notes | 413|---|---|---|---| 414| Match operand | 0-7 | 1 per dyadic instruction | Indexed by matchable offset. Presence bit in 670. | 415| Constant | 8+ | 0-1 per instruction | `mode[0]` (has_const) selects whether const is read from frame[fref] | 416| Destination | variable | 0-2 per instruction | Pre-formed flit 1 values. 1 for mode 0, 2 for mode 2/3. | 417| Accumulator/sink | variable | 0-1 per instruction | For SINK modes (6/7). Write-back target. | 418| SM parameters | variable | 0-2 per SM instruction | SM_id + addr, data or return routing. | 419 420**Slot assignment algorithm:** 421 4221. Reserve slots 0-7 for match operands (one per dyadic instruction, 423 indexed by the instruction's matchable offset). 4242. Assign constant slots starting at slot 8. Constants that are shared 425 across instructions within the same activation can be deduplicated 426 (same value -> same slot). 4273. Assign destination slots. Each destination is a pre-formed flit 1 428 value. Destinations shared across instructions can be deduplicated. 4294. Assign sink/accumulator slots for SINK mode instructions. 4305. Assign SM parameter slots for SM operations. 4316. Verify total slots <= 64. If exceeded, report a frame overflow error. 432 433**fref computation:** 434 435The `fref` field in the instruction word points to the base of a 436contiguous group of frame slots used by that instruction. The slot count 437depends on the mode: 438 439| Mode | Slots at fref | Layout | 440|---|---|---| 441| 0 | 1 | [dest] | 442| 1 | 2 | [const, dest] | 443| 2 | 2 | [dest1, dest2] | 444| 3 | 3 | [const, dest1, dest2] | 445| 4 | 0 | (no frame access) | 446| 5 | 1 | [const] | 447| 6 | 1 | [sink_target] (write) | 448| 7 | 1 | [sink_target] (read-modify-write) | 449 450The allocator must arrange frame slots such that each instruction's 451constant and destination(s) are contiguous starting at `fref`. This 452may require careful slot packing or a simple sequential allocation 453strategy. 454 455**Mode computation:** 456 457The allocator determines the mode for each instruction based on: 458 459| Condition | Mode | 460|---|---| 461| 0 dests, no const, has sink | 6 (SINK) | 462| 0 dests, const via frame accumulator (RMW) | 7 (SINK+CONST) | 463| 1 dest, no const | 0 (INHERIT, single output) | 464| 1 dest, has const | 1 (INHERIT, single output + const) | 465| 2 dests, no const | 2 (INHERIT, fan-out) | 466| 2 dests, has const | 3 (INHERIT, fan-out + const) | 467| CHANGE_TAG, no const | 4 | 468| CHANGE_TAG, has const | 5 | 469 470CHANGE_TAG is used when the instruction's left operand provides the 471output destination dynamically (e.g., dynamic return routing via 472`EXTRACT_TAG` + `CHANGE_TAG`). 473 474### 8.4 Pre-Formed Flit 1 Computation 475 476The allocator must convert resolved destination `Addr` values into 47716-bit packed flit 1 values for storage in frame destination slots. 478 479**Flit 1 formats** (from `architecture-overview.md`): 480 481``` 482DYADIC WIDE: [0][0][port:1][PE:2][offset:8][act_id:3] = 16 bits 483MONADIC NORM: [0][1][0][PE:2][offset:8][act_id:3] = 16 bits 484MONADIC INLINE: [0][1][1][PE:2][10][offset:7][spare:2] = 16 bits 485``` 486 487The packing function takes: 488- `dest_pe: int` (2 bits) 489- `dest_offset: int` (8 bits for dyadic/monadic normal, 7 for inline) 490- `dest_act_id: int` (3 bits, for dyadic and monadic normal) 491- `dest_port: Port` (1 bit, for dyadic wide only) 492- `dest_type: TokenType` (dyadic wide, monadic normal, or monadic inline) 493 494And produces a 16-bit packed flit 1 value. 495 496The `dest_type` is determined by whether the destination instruction is 497dyadic (-> dyadic wide flit) or monadic (-> monadic normal flit). For 498trigger-only destinations (e.g., switch not-taken path), monadic inline 499is used. 500 501This replaces the current `Addr(a, port, pe)` resolution. The `Addr` 502type may still be used as an intermediate representation, but the final 503output is a packed flit 1 value in a frame slot. 504 505### 8.5 Destination Resolution Rework 506 507The current `_resolve_destinations()` function creates `ResolvedDest` 508objects containing `Addr(a=iram_offset, port=edge.port, pe=dest_pe)`. 509Under the frame model, destination resolution must additionally: 510 5111. Look up the destination node's `act_id` (not just `iram_offset` and 512 `pe`). 5132. Determine the destination token type (dyadic wide vs monadic normal 514 vs monadic inline). 5153. Compute the packed flit 1 value. 5164. Assign the flit 1 value to a frame slot. 517 518The `ResolvedDest` type may be extended to carry the packed flit 1 519value, or the flit 1 computation may happen in a separate sub-pass 520after destination resolution. 521 522### 8.6 ctx_mode / ctx_override Removal 523 524The current allocator and codegen handle `ctx_override` edges by setting 525`ctx_mode=1` on the source instruction and packing the target context 526and generation into the `const` field. This entire mechanism is removed. 527 528Under the frame model, cross-activation routing is handled by frame 529destination slots. The destination flit 1 value already encodes the 530target PE, offset, and act_id. No instruction-level `ctx_mode` is 531needed. The `ctx_override` flag on `IREdge` may be kept for semantic 532annotation (this edge crosses activation boundaries) but has no effect 533on instruction encoding. 534 535## 9. Codegen Changes (`asm/codegen.py`) -- MAJOR 536 537### 9.1 Instruction Generation 538 539Current codegen (`_build_iram_for_pe()`) generates `ALUInst` and 540`SMInst` objects. These are Python dataclasses that model the old 541instruction format with embedded destinations and constants: 542 543```python 544ALUInst(op, dest_l, dest_r, const, ctx_mode) 545SMInst(op, sm_id, const, ret, ret_dyadic) 546``` 547 548New codegen generates 16-bit instruction words matching the hardware 549format: 550 551```python 552@dataclass(frozen=True) 553class Instruction: 554 """16-bit instruction word for IRAM. 555 556 [type:1][opcode:5][mode:3][wide:1][fref:6] = 16 bits 557 """ 558 type: int # 0 = CM, 1 = SM 559 opcode: int # 5-bit opcode 560 mode: int # 3-bit mode (0-7) 561 wide: bool # 32-bit frame values 562 fref: int # 6-bit frame slot base index 563``` 564 565The `Instruction` type replaces both `ALUInst` and `SMInst`. Constants 566and destinations are NOT in the instruction -- they are in frame slots. 567 568The `PEConfig.iram` field type changes from 569`dict[int, ALUInst | SMInst]` to `dict[int, Instruction]` (or 570`dict[int, int]` if storing raw 16-bit words). 571 572### 9.2 Frame Setup Sequence Generation (NEW) 573 574Codegen must generate the bootstrap sequence that loads frame contents 575before execution begins. This is a stream of tokens: 576 5771. **FrameControlToken (ALLOC)** for each activation: 578 ``` 579 flit 1: [0][1][1][PE:2][00][op=0][spare:3][act_id:3] 580 flit 2: (return routing or unused) 581 ``` 582 5832. **PELocalWriteToken** for each frame slot that needs initialization: 584 ``` 585 flit 1: [0][1][1][PE:2][01][region=1][spare:1][slot:5][act_id:3] 586 flit 2: [data:16] (the frame slot value) 587 ``` 588 5893. **PELocalWriteToken** for IRAM entries: 590 ``` 591 flit 1: [0][1][1][PE:2][01][region=0][spare:1][slot:5][act_id:ignored] 592 flit 2: [instruction:16] 593 ``` 594 595The ordering matters: IRAM writes before frame setup, frame setup 596(ALLOC + slot writes) before seed tokens. More specifically: 597 598``` 599IRAM writes (all PEs) 600 -> ALLOC frame control (all activations) 601 -> Frame slot writes (constants, destinations per activation) 602 -> Seed tokens (initial data tokens to start execution) 603``` 604 605### 9.3 New Token Types 606 607The emulator will need new token types (or the existing types must be 608adapted): 609 610- **FrameControlToken** -- ALLOC/FREE frame lifecycle. Currently not in 611 `tokens.py`. Codegen needs to emit these. 612- **PELocalWriteToken** -- writes to IRAM (region=0) or frame slots 613 (region=1). The current `IRAMWriteToken` is a special case of this 614 (region=0 only). It may be generalised or a new type added. 615 616These are emulator-level changes that codegen depends on. The codegen 617module must import and construct whatever token types the emulator 618provides. 619 620### 9.4 Seed Token Generation 621 622Current seed token generation creates `MonadToken` or `DyadToken` with 623`ctx` field. Changes: 624 625- `ctx` field -> `act_id` field in token constructors 626- `gen` field on `DyadToken` is removed (ABA protection is via the 670 627 valid bit, not generation counters) 628- Seed tokens target `(pe, offset, act_id)` triples, packed from the 629 destination's allocation data 630 631### 9.5 Direct Mode Output 632 633`generate_direct()` currently returns `AssemblyResult` with: 634- `pe_configs: list[PEConfig]` -- PEConfig with `iram` dict of 635 `ALUInst/SMInst` 636- `sm_configs: list[SMConfig]` 637- `seed_tokens: list[MonadToken]` 638 639New `AssemblyResult`: 640- `pe_configs: list[PEConfig]` -- PEConfig with `iram` dict of 641 `Instruction` (new type), plus `frame_layouts: dict[int, FrameLayout]` 642 mapping act_id to frame layout data 643- `sm_configs: list[SMConfig]` -- unchanged 644- `seed_tokens: list` -- may include `DyadToken` and `MonadToken` with 645 `act_id` instead of `ctx` 646- `setup_tokens: list` -- NEW: frame control and PE-local write tokens 647 for bootstrap 648 649### 9.6 Token Stream Mode Output 650 651`generate_tokens()` currently produces: 652``` 653SM init tokens -> IRAM write tokens -> seed tokens 654``` 655 656New ordering: 657``` 658SM init tokens -> IRAM write tokens -> ALLOC tokens -> frame slot write tokens -> seed tokens 659``` 660 661### 9.7 ctx_mode Removal 662 663The entire `ctx_mode` / `ctx_override` handling in `_build_iram_for_pe()` 664(lines 96-132 of codegen.py) is removed. Cross-activation routing is 665handled by frame destination slots, not by instruction encoding. 666 667### 9.8 Route Restriction Computation 668 669`_compute_route_restrictions()` is unchanged in concept. It scans edges 670to determine which PEs and SMs a given PE needs to route to. The 671implementation stays the same. 672 673## 10. Serialize Pass Changes (`asm/serialize.py`) 674 675### 10.1 Field Renaming 676 677- `ctx` -> `act_id` in node serialization if activation ID is displayed 678- Any `ctx_slot` qualifier in dfasm output becomes activation-related 679 680### 10.2 New Fields 681 682If `mode`, `fref`, and `wide` are displayed in serialized output (for 683debugging allocated IR), `_serialize_node()` needs to format them. 684Format could be: `&node|pe0|act2|mode1|fref8 <| add` 685 686### 10.3 Round-Trip Support 687 688The serialize pass must be able to round-trip new IR fields. Since 689`mode`, `fref`, and frame layout are only populated after allocation, 690serialization before allocation produces the same output as today (no 691new fields to display). Post-allocation serialization adds the new 692fields. 693 694## 11. Built-in Macro Changes (`asm/builtins.py`) 695 696No changes needed. The built-in macros (`#loop_counted`, `#loop_while`, 697`#permit_inject`, `#reduce_2/3/4`) use generic opcodes and edge routing. 698They do not reference `ctx`, `ctx_slot`, `free_ctx`, or any 699context-specific concepts. 700 701The `free_ctx` opcode rename to `free_frame` does not affect builtins 702because none of them use `free_ctx`. 703 704## 12. Error Types (`asm/errors.py`) 705 706### 12.1 New Error Categories 707 708Add to `ErrorCategory`: 709 710```python 711class ErrorCategory(Enum): 712 # ... existing ... 713 FRAME = "frame" # Frame layout overflow, slot conflicts 714``` 715 716### 12.2 New Error Conditions 717 718| Error | Category | Source Pass | Condition | 719|---|---|---|---| 720| Frame slot overflow | FRAME | allocate | Total slots > 64 for an activation | 721| Matchable offset overflow | RESOURCE | allocate/place | > 8 dyadic instructions per activation per PE | 722| Frame count overflow | RESOURCE | allocate | > 4 concurrent activations on one PE | 723| Act ID exhaustion | RESOURCE | allocate | > 8 activation IDs needed (wraparound) | 724 725## 13. Dfgraph Pipeline Impact (`dfgraph/`) 726 727### 13.1 `dfgraph/pipeline.py` 728 729The pipeline runner calls `allocate()` and uses `IRGraph` types. If 730`SystemConfig` fields change, `pipeline.py` may need minor updates for 731default values or error handling. The pipeline runner itself does not 732inspect allocation results in detail. 733 734### 13.2 `dfgraph/graph_json.py` 735 736Currently includes `ctx` field in node JSON output. This becomes 737`act_id`. The field rename is straightforward. 738 739### 13.3 `dfgraph/categories.py` 740 741References `RoutingOp.FREE_CTX` in the CONFIG category mapping. This 742becomes `RoutingOp.FREE_FRAME`. One line change. 743 744`EXTRACT_TAG` (if added) maps to the ROUTING or CONFIG category 745depending on its semantics. 746 747## 14. Monitor Impact (`monitor/`) 748 749### 14.1 `monitor/snapshot.py` 750 751`PESnapshot` currently captures `matching_store` (2D array of 752`MatchEntry`) and `gen_counters`. Under the frame model: 753- `matching_store` becomes frame state (per-frame slot values, presence 754 bits) 755- `gen_counters` is removed (no generation counters in frame model) 756- New: `frame_allocations` (act_id -> frame_id mapping), `frame_slots` 757 (per-frame slot contents) 758 759### 14.2 `monitor/graph_json.py` 760 761Node state overlay: `ctx` field -> `act_id`. Frame slot contents may be 762added to the state overlay for debugging. 763 764### 14.3 `monitor/repl.py` 765 766The `pe` command displays PE state including matching store contents. 767This needs updating to display frame state instead. 768 769## 15. Dependency Order 770 771Implementation order based on the dependency graph: 772 773### Phase 1: Foundation Types 7741. **`cm_inst.py`**: Add `RoutingOp.FREE_FRAME`, `RoutingOp.EXTRACT_TAG`. 775 Add `Instruction` dataclass. Update `is_monadic_alu()`. 7762. **`asm/ir.py`**: Add `FrameSlotMap`, `FrameLayout`. Rename 777 `ctx_slot` -> `act_slot`, `ctx` -> `act_id` on IRNode. Update 778 `SystemConfig` (remove `ctx_slots`, add `frame_count`, 779 `frame_slots`, `matchable_offsets`). Rename `CtxSlotRef` -> 780 `ActSlotRef`, `CtxSlotRange` -> `ActSlotRange`. 7813. **`asm/opcodes.py`**: Add new opcodes to `MNEMONIC_TO_OP`, 782 `_MONADIC_OPS_TUPLES`. Rename `free_ctx` -> `free_frame`. 7834. **`asm/errors.py`**: Add `ErrorCategory.FRAME`. 784 785### Phase 2: Allocation and Codegen (the big changes) 7865. **`asm/allocate.py`**: Rewrite `_assign_context_slots()` -> 787 `_assign_act_ids()`. Add `_compute_frame_layouts()`. Add 788 `_compute_modes()`. Add `_pack_flit1()`. Update 789 `_assign_iram_offsets()` for new offset scheme. Update 790 `_resolve_destinations()` to produce flit 1 values. Remove 791 `ctx_mode`/`ctx_override` handling. 7926. **`asm/codegen.py`**: Replace `ALUInst`/`SMInst` generation with 793 `Instruction` generation. Add frame setup token generation. Add 794 `FrameControlToken` / `PELocalWriteToken` emission. Update seed 795 token generation. Remove `ctx_mode` handling. Update `PEConfig` 796 construction. Update `AssemblyResult`. 797 798### Phase 3: Upstream Pass Adjustments 7997. **`asm/lower.py`**: Rename `ctx_slot` references. Update qualifier 800 parsing if syntax changes. 8018. **`asm/expand.py`**: Replace `FREE_CTX` with `FREE_FRAME` in call 802 wiring. Rename `CtxSlotRef` -> `ActSlotRef`. Update `CallSite` 803 field names. Simplify trampoline logic if `ctx_override` is removed. 8049. **`asm/place.py`**: Update `_count_iram_cost()` (all nodes cost 1). 805 Replace `ctx_slots` tracking with `frame_count`. Add 806 matchable-offset-per-activation constraint. 807 808### Phase 4: Output and Tooling 80910. **`asm/serialize.py`**: Rename `ctx` -> `act_id` in output. Add 810 mode/fref display for post-allocation IR. 81111. **`asm/builtins.py`**: No changes (verified). 81212. **`dfgraph/categories.py`**: `FREE_CTX` -> `FREE_FRAME`. 81313. **`dfgraph/graph_json.py`**: `ctx` -> `act_id` in JSON output. 81414. **`monitor/`**: Update snapshot, graph_json, and REPL for frame 815 model. 816 817### Phase 5: Emulator Updates (out of scope for this doc) 81815. **`emu/types.py`**: `PEConfig` changes (`iram` type, remove 819 `ctx_slots`, add frame parameters). 82016. **`emu/pe.py`**: Frame-based matching instead of context-slot 821 matching store. 82217. **`tokens.py`**: `ctx` -> `act_id`, remove `gen`. Add 823 `FrameControlToken`, `PELocalWriteToken`. 824 825## 16. What Stays the Same 826 827- **dfasm grammar** (`dfasm.lark`): mostly unchanged. May add frame 828 directives later, but not required for v0. 829- **Parse pass** (Lark parser): no changes. 830- **Error infrastructure** (`errors.py`): structure unchanged, one new 831 category added. 832- **Graph structure** (`IRGraph`, `IREdge`, `IRRegion`): unchanged. 833- **Pipeline architecture**: still 7 passes in the same order. 834- **Resolve pass**: completely unchanged. 835- **Built-in macros**: completely unchanged. 836- **SM-related codegen**: SMConfig construction, SM init tokens, data 837 defs -- all unchanged. 838- **Name resolution and scoping**: unchanged. 839- **Macro expansion core**: parameter substitution, variadic repetition, 840 opcode parameters -- all unchanged. Only call wiring details change. 841 842## 17. Risk Assessment 843 844| Area | Risk | Mitigation | 845|---|---|---| 846| Frame layout allocation | New, complex algorithm | Start with simple sequential allocation, optimise later | 847| Flit 1 packing | Bit-level encoding must match hardware spec | Unit tests against architecture-overview.md bit layouts | 848| Instruction deduplication | Interaction with frame layouts is subtle | Defer dedup to optimisation pass; v0 allocates per-activation | 849| Mode computation | 8 modes, context-dependent selection | Exhaustive test matrix against mode table | 850| ABA distance | Act_id assignment must maintain distance | Sequential assignment (0,1,2,3) is safe for static programs | 851| `ctx_override` removal | Affects expand pass call wiring | Can keep `ctx_override` as semantic annotation initially | 852| Emulator dependency | Codegen must emit tokens emulator can consume | Phase emulator changes alongside codegen; can use adapter layer |