OR-1 dataflow CPU sketch
1# Assembler Redesign Plan: Frame-Based PE Model
2
3This document describes the changes needed in the OR1 assembler (`asm/`)
4to match the frame-based PE redesign described in `pe-design.md` and
5`architecture-overview.md`. It covers every assembler pass and supporting
6module, identifying what changes, what stays, and the dependency order
7for implementation.
8
9The assembler pipeline is: parse -> lower -> expand -> resolve -> place ->
10allocate -> codegen. Most changes concentrate in allocate and codegen;
11earlier passes are minimally affected.
12
13## 1. Overview
14
15The PE redesign replaces the context-slot matching model with a
16frame-based model. Key architectural changes that affect the assembler:
17
18- **Context slots replaced by activation IDs and frames.** The old
19 `ctx_slots` parameter (up to 16 slots) becomes `frame_count` (4
20 concurrent frames) with 3-bit `act_id` (8 unique IDs). Each frame has
21 64 addressable slots.
22- **Instructions become templates.** IRAM entries are 16-bit words
23 `[type:1][opcode:5][mode:3][wide:1][fref:6]`. Constants and
24 destinations are NOT in the instruction -- they live in frame slots
25 referenced by `fref`.
26- **Instruction deduplication.** Multiple activations can share the same
27 IRAM entry because per-activation data lives in frames, not
28 instructions. The number of IRAM entries needed is the number of unique
29 operation shapes, not the total operations executed.
30- **8 matchable offsets per frame.** Dyadic instructions must be assigned
31 offsets 0-7 within each activation. The assembler must enforce this
32 constraint and split function bodies that exceed it.
33- **Pre-formed flit 1 values.** Output destinations are 16-bit packed
34 flit 1 values stored in frame slots. The PE reads the slot and puts it
35 on the bus verbatim as flit 1. Token type is determined by the prefix
36 bits in the stored flit value.
37- **Frame setup via PE-local write tokens.** Constants, destinations, and
38 other per-activation data are loaded into frame slots by a stream of
39 PE-local write tokens before execution begins.
40- **Frame lifecycle tokens.** ALLOC (frame control, prefix `011+00`,
41 op=0) allocates a frame; FREE (frame control, op=1) or FREE_FRAME
42 instruction releases it.
43- **Mode field replaces output routing logic.** The 3-bit `mode` field
44 in the instruction word encodes output routing (INHERIT/CHANGE_TAG/SINK)
45 and constant presence. The assembler must compute the correct mode for
46 each instruction based on its edge topology and constant usage.
47
48## 2. IR Type Changes (`asm/ir.py`)
49
50### 2.1 IRNode Changes
51
52Current fields affected:
53
54| Current Field | Change | Notes |
55|---|---|---|
56| `ctx_slot: Optional[Union[int, CtxSlotRef, CtxSlotRange]]` | Rename to `act_slot` or remove entirely | Macro parameter support (`CtxSlotRef`) may still be needed for parameterized activation grouping |
57| `iram_offset: Optional[int]` | Keep | Semantics unchanged: index into PE's IRAM. Range 0-255. |
58| `ctx: Optional[int]` | Rename to `act_id: Optional[int]` | 3-bit activation ID (0-7) instead of context slot |
59
60New fields to add to IRNode:
61
62| New Field | Type | Purpose |
63|---|---|---|
64| `mode: Optional[int]` | `Optional[int]` | 3-bit instruction mode (0-7), computed during allocate |
65| `fref: Optional[int]` | `Optional[int]` | 6-bit frame slot base index (0-63), computed during allocate |
66| `wide: bool` | `bool` | 32-bit frame values flag, default False |
67| `frame_layout: Optional[FrameSlotMap]` | `Optional[FrameSlotMap]` | Per-node frame slot assignments (see below) |
68
69### 2.2 New IR Types
70
71```python
72@dataclass(frozen=True)
73class FrameSlotMap:
74 """Frame slot assignments for one instruction within an activation.
75
76 Maps logical roles to physical frame slot indices within the 64-slot
77 frame. Computed by the allocate pass.
78
79 Attributes:
80 match_slot: Frame slot for dyadic match operand storage (offsets 0-7)
81 const_slot: Frame slot for IRAM constant (if any)
82 dest_slots: Frame slot(s) for pre-formed flit 1 destination values
83 sink_slot: Frame slot for SINK mode write-back target
84 """
85 match_slot: Optional[int] = None
86 const_slot: Optional[int] = None
87 dest_slots: tuple[int, ...] = ()
88 sink_slot: Optional[int] = None
89```
90
91```python
92@dataclass(frozen=True)
93class FrameLayout:
94 """Complete frame layout for one activation on one PE.
95
96 Computed by the allocate pass. Used by codegen to generate PE-local
97 write tokens for frame setup.
98
99 Attributes:
100 act_id: Activation ID (0-7)
101 pe_id: PE where this frame lives
102 slots: Dict mapping slot index to (role, value) pairs
103 total_slots: Total slots used (must be <= 64)
104 """
105 act_id: int
106 pe_id: int
107 slots: dict[int, tuple[str, int]]
108 total_slots: int
109```
110
111### 2.3 SystemConfig Changes
112
113Current `SystemConfig`:
114
115```python
116@dataclass(frozen=True)
117class SystemConfig:
118 pe_count: int
119 sm_count: int
120 iram_capacity: int = DEFAULT_IRAM_CAPACITY # 128
121 ctx_slots: int = DEFAULT_CTX_SLOTS # 16
122```
123
124New `SystemConfig`:
125
126```python
127@dataclass(frozen=True)
128class SystemConfig:
129 pe_count: int
130 sm_count: int
131 iram_capacity: int = 256 # 8-bit offset, 256 entries
132 frame_count: int = 4 # max concurrent frames per PE
133 frame_slots: int = 64 # slots per frame
134 matchable_offsets: int = 8 # max dyadic instructions per activation per PE
135```
136
137The defaults change: `DEFAULT_IRAM_CAPACITY` goes from 128 to 256
138(matching the 8-bit offset field). `DEFAULT_CTX_SLOTS` is removed
139entirely and replaced by `frame_count`, `frame_slots`, and
140`matchable_offsets`.
141
142### 2.4 What Stays
143
144- `IREdge` -- unchanged except `ctx_override` may need renaming or
145 semantic adjustment (see section 5)
146- `IRGraph` structure -- unchanged
147- `IRDataDef` -- unchanged (SM data definitions are orthogonal to PE frames)
148- `IRRegion`, `RegionKind` -- unchanged
149- `NameRef`, `ResolvedDest` -- unchanged
150- `MacroDef`, `IRMacroCall`, `MacroParam`, `ParamRef` -- unchanged
151- `CallSite`, `CallSiteResult` -- fields may need renaming
152 (`trampoline_nodes`, `free_ctx_nodes` -> `free_frame_nodes`)
153- `SourceLoc`, `ConstExpr`, `IRRepetitionBlock` -- unchanged
154
155### 2.5 CtxSlotRef / CtxSlotRange
156
157`CtxSlotRef` and `CtxSlotRange` are used in macro templates for
158parameterized context slot assignment. Under the frame model, these
159become `ActSlotRef` / `ActSlotRange` (or are removed if activation ID
160assignment is always automatic). The expand pass resolves these to
161concrete values during macro expansion. The rename is straightforward
162but touches `ir.py`, `lower.py`, and `expand.py`.
163
164## 3. Opcode Changes (`asm/opcodes.py`)
165
166### 3.1 Renamed Opcodes
167
168| Current | New | Notes |
169|---|---|---|
170| `RoutingOp.FREE_CTX` | `RoutingOp.FREE_FRAME` | Deallocates a frame instead of a context slot |
171
172The mnemonic mapping changes: `"free_ctx"` -> `"free_frame"` in
173`MNEMONIC_TO_OP`. The `OP_TO_MNEMONIC` reverse mapping updates
174correspondingly.
175
176### 3.2 New Opcodes
177
178| Opcode | Type | Arity | Purpose |
179|---|---|---|---|
180| `RoutingOp.EXTRACT_TAG` | Monadic CM | Monadic | Captures executing token's identity as 16-bit packed flit 1 value (return continuation) |
181| `RoutingOp.ALLOC_REMOTE` | Monadic CM | Monadic | Triggers ALLOC frame control token to a target PE (may be handled at codegen level instead) |
182
183`EXTRACT_TAG` must be added to `MNEMONIC_TO_OP` (mnemonic: `"extract_tag"`),
184to `_MONADIC_OPS_TUPLES`, and to `cm_inst.py` as a new `RoutingOp` enum
185member.
186
187Whether `ALLOC_REMOTE` is an ALU opcode or purely a codegen-emitted
188frame control token depends on whether the assembler exposes it as a
189user-facing instruction or handles it internally during call wiring.
190For static calls, ALLOC is emitted by codegen as a frame control token,
191not as an ALU instruction. For dynamic calls, it may need to be an
192instruction.
193
194### 3.3 Mode-Dependent Arity
195
196The current `is_monadic(op, const)` and `is_dyadic(op, const)` functions
197remain correct in concept. The `mode` field is orthogonal to arity --
198mode determines output routing behaviour, not input operand count. No
199changes needed to arity classification.
200
201### 3.4 Monadic/Dyadic Classification
202
203Unchanged. The distinction between monadic and dyadic is fundamental to
204matching (dyadic tokens go through presence-based matching in the 670s;
205monadic tokens bypass matching). The assembler's arity classification
206drives IRAM offset assignment (dyadic at offsets 0-7, monadic at 8+).
207
208## 4. Lower Pass Changes (`asm/lower.py`)
209
210Minimal changes. The lower pass translates Lark CST to IR and is mostly
211opcode-agnostic.
212
213### 4.1 Opcode Mnemonic Updates
214
215The `MNEMONIC_TO_OP` lookup in `_resolve_opcode()` will pick up new
216opcodes automatically when they are added to `opcodes.py`. No lower.py
217code changes needed for new opcodes.
218
219The `free_ctx` mnemonic rename to `free_frame` will need a corresponding
220change in lower.py only if the mnemonic string is hardcoded anywhere
221(it is not -- lower.py uses `MNEMONIC_TO_OP` for all lookups).
222
223### 4.2 Context Slot Qualifiers
224
225The `[ctx_slot]` qualifier syntax in dfasm (`&node[3] <| add`) is
226parsed in lower.py and stored as `ctx_slot` on IRNode. This syntax and
227field will be renamed to reflect activation IDs if kept. If activation
228ID assignment is always automatic, the qualifier syntax may be removed
229entirely.
230
231References in lower.py:
232- `inst_def` transformer rule: parses `[N]` qualifier into `ctx_slot`
233- `ctx_slot_ref` transformer rule: creates `CtxSlotRef` for macro
234 parameter `[${param}]`
235- `ctx_slot_range` transformer rule: creates `CtxSlotRange` for `[N:M]`
236
237### 4.3 New Syntax (Optional, Deferred)
238
239Frame directives in dfasm syntax (e.g., `@frame_layout` pragmas) could
240be added later. Not needed for v0 -- the allocator computes frame
241layouts automatically.
242
243## 5. Expand Pass Changes (`asm/expand.py`)
244
245### 5.1 Function Call Wiring
246
247The expand pass currently generates cross-context call wiring:
248- Trampoline `PASS` nodes with `ctx_override` edges
249- `FREE_CTX` cleanup nodes
250- Per-call-site context slot allocation via `CallSite` metadata
251
252Under the frame model, the call wiring changes:
253
254- **`ctx_override` edges** become frame-boundary edges. The semantic
255 meaning is the same (data crosses activation boundaries), but the
256 mechanism changes: instead of packing a target context into the
257 instruction's `const` field with `ctx_mode=1`, the destination's
258 pre-formed flit 1 value in the frame slot already encodes the target
259 PE, offset, and act_id. Cross-activation routing is handled by frame
260 setup, not by instruction encoding.
261
262- **`FREE_CTX` nodes** become `FREE_FRAME` nodes. The opcode changes
263 from `RoutingOp.FREE_CTX` to `RoutingOp.FREE_FRAME`. The expand pass
264 references `RoutingOp.FREE_CTX` directly in `_wire_call_site()` at
265 line 1160 of expand.py.
266
267- **Trampoline nodes** may simplify. In the current design, trampolines
268 exist to bridge context boundaries (an edge cannot cross contexts
269 without a PASS node carrying `ctx_mode`). Under the frame model,
270 destinations in frame slots already encode the target activation, so
271 cross-activation edges are just edges whose destination flit 1 value
272 points to a different activation. Trampolines may still be useful for
273 fan-out or return routing but are no longer needed purely for context
274 bridging.
275
276- **`@ret` wiring** stays conceptually the same. Return routing in the
277 frame model is a pre-formed flit 1 value loaded into a frame slot.
278 The expand pass creates edges for return routing; the allocate pass
279 resolves those edges to flit 1 values in frame slots.
280
281### 5.2 CallSite Metadata
282
283`CallSite` fields to rename:
284- `free_ctx_nodes` -> `free_frame_nodes`
285
286The `trampoline_nodes` field stays (trampolines may still be generated
287for fan-out or return routing patterns).
288
289### 5.3 CtxSlotRef Resolution
290
291The expand pass resolves `CtxSlotRef` and `CtxSlotRange` during macro
292expansion. These will be renamed to `ActSlotRef` / `ActSlotRange` (or
293removed). The resolution logic in `_substitute_node()` and
294`_substitute_edge()` is straightforward renaming.
295
296### 5.4 Built-in Macros
297
298The built-in macros in `builtins.py` do NOT reference `ctx_slot`, `ctx`,
299`free_ctx`, or any context-specific concepts directly. They use generic
300opcodes (`add`, `brgt`, `gate`, `pass`, `inc`, `const`) and edge routing
301(`@ret`, `${param}`). No changes needed to built-in macros.
302
303## 6. Resolve Pass Changes (`asm/resolve.py`)
304
305No changes needed. The resolve pass validates that edge endpoints exist
306and detects scope violations. It operates on node names and graph
307structure, not on PE-level concepts like contexts or frames. The resolve
308pass does not reference `ctx`, `ctx_slot`, `ctx_override`, or any
309context-related fields.
310
311## 7. Place Pass Changes (`asm/place.py`)
312
313### 7.1 New Constraint: Matchable Offset Limit
314
315The placement pass must enforce the 8-matchable-offset constraint: at
316most 8 dyadic instructions per activation per PE. This is a new
317constraint that does not exist in the current codebase.
318
319Currently, `_count_iram_cost()` counts dyadic nodes as costing 2 IRAM
320slots and monadic as 1. Under the frame model:
321- Dyadic nodes cost 1 IRAM slot (the matching store entry is in the
322 670s, not IRAM)
323- Monadic nodes cost 1 IRAM slot
324- The 8-dyadic-per-activation limit is a separate constraint from IRAM
325 capacity
326
327The placement pass needs to track dyadic instruction count per
328activation group per PE, in addition to total IRAM usage.
329
330### 7.2 IRAM Cost Recalculation
331
332`_count_iram_cost()` should return 1 for all node types (dyadic and
333monadic both use 1 IRAM slot in the frame model). The current cost of 2
334for dyadic nodes was because the old matching store occupied IRAM slots;
335in the frame model, match operands live in frame SRAM, not IRAM.
336
337### 7.3 Context Slots -> Frames
338
339`_auto_place_nodes()` currently tracks `ctx_used` per PE (context slots
340consumed). This becomes `frames_used` per PE (concurrent frames, max 4).
341The `ctx_scopes_per_pe` tracking of function scopes per PE maps directly
342to frame tracking: each function scope on a PE consumes one frame.
343
344`SystemConfig.ctx_slots` references in place.py become
345`SystemConfig.frame_count`.
346
347### 7.4 Instruction Deduplication Awareness
348
349Because IRAM entries are activation-independent templates, multiple
350activations of the same function on the same PE share IRAM entries. The
351placement pass could account for this when computing IRAM utilisation:
352if two activations of function `$foo` run on PE0, they share IRAM
353entries, so the IRAM cost is the unique instruction count, not the total.
354
355This is an optimisation, not a correctness requirement. The placement
356pass can conservatively count IRAM entries per unique function body
357without deduplication for v0.
358
359## 8. Allocate Pass Changes (`asm/allocate.py`) -- MAJOR
360
361This is the largest change. The current allocate pass assigns IRAM
362offsets and context slots, then resolves destinations to `Addr` values.
363The frame model requires fundamentally different allocation logic.
364
365### 8.1 IRAM Offset Assignment
366
367Current behaviour (`_assign_iram_offsets()`):
368- Dyadic nodes get offsets 0..D-1
369- Monadic nodes get offsets D..D+M-1
370- Total must fit in `iram_capacity`
371
372New behaviour:
373- Dyadic nodes get offsets 0-7 (within the matchable offset range).
374 At most 8 dyadic instructions per activation group per PE. Because
375 IRAM is activation-independent (shared templates), dyadic offset
376 assignment is per-PE, not per-activation.
377- Monadic nodes get offsets 8-255 (or wherever dyadic offsets end).
378- The `matchable_offsets` limit (default 8) constrains dyadic count.
379- With instruction deduplication, multiple activations of the same
380 function body share IRAM offsets. The allocator assigns offsets once
381 per unique instruction template, not per activation.
382
383### 8.2 Activation ID Assignment
384
385Replaces `_assign_context_slots()`. The current function assigns context
386slot indices (0-15) per function scope per PE. The new function assigns
3873-bit activation IDs (0-7) with at most 4 concurrent activations per PE.
388
389Key differences:
390- **Smaller space:** 8 act_ids (3-bit) vs 16 context slots. But only 4
391 can be concurrently active (4 physical frames).
392- **ABA distance:** the allocator must maintain ABA distance between
393 act_ids to prevent stale token collisions. With 4 concurrent frames
394 out of 8 possible IDs, 4 IDs of ABA distance exist before wraparound.
395 For static programs, the allocator can assign act_ids sequentially
396 (0, 1, 2, 3 for 4 concurrent activations).
397- **Per-call-site allocation:** same concept as current
398 `call_site_to_ctx_on_pe` -- each call site that creates a new
399 activation gets a fresh act_id. But the budget is 4 concurrent frames
400 instead of 16 context slots.
401
402The function scope grouping logic (`_extract_function_scope()`) stays.
403
404### 8.3 Frame Layout Allocation (NEW)
405
406This is entirely new functionality. After IRAM offset and act_id
407assignment, the allocator must compute the frame layout for each
408activation: which frame slots hold what data.
409
410**Frame slot roles:**
411
412| Role | Slot Range | Count per Instruction | Notes |
413|---|---|---|---|
414| Match operand | 0-7 | 1 per dyadic instruction | Indexed by matchable offset. Presence bit in 670. |
415| Constant | 8+ | 0-1 per instruction | `mode[0]` (has_const) selects whether const is read from frame[fref] |
416| Destination | variable | 0-2 per instruction | Pre-formed flit 1 values. 1 for mode 0, 2 for mode 2/3. |
417| Accumulator/sink | variable | 0-1 per instruction | For SINK modes (6/7). Write-back target. |
418| SM parameters | variable | 0-2 per SM instruction | SM_id + addr, data or return routing. |
419
420**Slot assignment algorithm:**
421
4221. Reserve slots 0-7 for match operands (one per dyadic instruction,
423 indexed by the instruction's matchable offset).
4242. Assign constant slots starting at slot 8. Constants that are shared
425 across instructions within the same activation can be deduplicated
426 (same value -> same slot).
4273. Assign destination slots. Each destination is a pre-formed flit 1
428 value. Destinations shared across instructions can be deduplicated.
4294. Assign sink/accumulator slots for SINK mode instructions.
4305. Assign SM parameter slots for SM operations.
4316. Verify total slots <= 64. If exceeded, report a frame overflow error.
432
433**fref computation:**
434
435The `fref` field in the instruction word points to the base of a
436contiguous group of frame slots used by that instruction. The slot count
437depends on the mode:
438
439| Mode | Slots at fref | Layout |
440|---|---|---|
441| 0 | 1 | [dest] |
442| 1 | 2 | [const, dest] |
443| 2 | 2 | [dest1, dest2] |
444| 3 | 3 | [const, dest1, dest2] |
445| 4 | 0 | (no frame access) |
446| 5 | 1 | [const] |
447| 6 | 1 | [sink_target] (write) |
448| 7 | 1 | [sink_target] (read-modify-write) |
449
450The allocator must arrange frame slots such that each instruction's
451constant and destination(s) are contiguous starting at `fref`. This
452may require careful slot packing or a simple sequential allocation
453strategy.
454
455**Mode computation:**
456
457The allocator determines the mode for each instruction based on:
458
459| Condition | Mode |
460|---|---|
461| 0 dests, no const, has sink | 6 (SINK) |
462| 0 dests, const via frame accumulator (RMW) | 7 (SINK+CONST) |
463| 1 dest, no const | 0 (INHERIT, single output) |
464| 1 dest, has const | 1 (INHERIT, single output + const) |
465| 2 dests, no const | 2 (INHERIT, fan-out) |
466| 2 dests, has const | 3 (INHERIT, fan-out + const) |
467| CHANGE_TAG, no const | 4 |
468| CHANGE_TAG, has const | 5 |
469
470CHANGE_TAG is used when the instruction's left operand provides the
471output destination dynamically (e.g., dynamic return routing via
472`EXTRACT_TAG` + `CHANGE_TAG`).
473
474### 8.4 Pre-Formed Flit 1 Computation
475
476The allocator must convert resolved destination `Addr` values into
47716-bit packed flit 1 values for storage in frame destination slots.
478
479**Flit 1 formats** (from `architecture-overview.md`):
480
481```
482DYADIC WIDE: [0][0][port:1][PE:2][offset:8][act_id:3] = 16 bits
483MONADIC NORM: [0][1][0][PE:2][offset:8][act_id:3] = 16 bits
484MONADIC INLINE: [0][1][1][PE:2][10][offset:7][spare:2] = 16 bits
485```
486
487The packing function takes:
488- `dest_pe: int` (2 bits)
489- `dest_offset: int` (8 bits for dyadic/monadic normal, 7 for inline)
490- `dest_act_id: int` (3 bits, for dyadic and monadic normal)
491- `dest_port: Port` (1 bit, for dyadic wide only)
492- `dest_type: TokenType` (dyadic wide, monadic normal, or monadic inline)
493
494And produces a 16-bit packed flit 1 value.
495
496The `dest_type` is determined by whether the destination instruction is
497dyadic (-> dyadic wide flit) or monadic (-> monadic normal flit). For
498trigger-only destinations (e.g., switch not-taken path), monadic inline
499is used.
500
501This replaces the current `Addr(a, port, pe)` resolution. The `Addr`
502type may still be used as an intermediate representation, but the final
503output is a packed flit 1 value in a frame slot.
504
505### 8.5 Destination Resolution Rework
506
507The current `_resolve_destinations()` function creates `ResolvedDest`
508objects containing `Addr(a=iram_offset, port=edge.port, pe=dest_pe)`.
509Under the frame model, destination resolution must additionally:
510
5111. Look up the destination node's `act_id` (not just `iram_offset` and
512 `pe`).
5132. Determine the destination token type (dyadic wide vs monadic normal
514 vs monadic inline).
5153. Compute the packed flit 1 value.
5164. Assign the flit 1 value to a frame slot.
517
518The `ResolvedDest` type may be extended to carry the packed flit 1
519value, or the flit 1 computation may happen in a separate sub-pass
520after destination resolution.
521
522### 8.6 ctx_mode / ctx_override Removal
523
524The current allocator and codegen handle `ctx_override` edges by setting
525`ctx_mode=1` on the source instruction and packing the target context
526and generation into the `const` field. This entire mechanism is removed.
527
528Under the frame model, cross-activation routing is handled by frame
529destination slots. The destination flit 1 value already encodes the
530target PE, offset, and act_id. No instruction-level `ctx_mode` is
531needed. The `ctx_override` flag on `IREdge` may be kept for semantic
532annotation (this edge crosses activation boundaries) but has no effect
533on instruction encoding.
534
535## 9. Codegen Changes (`asm/codegen.py`) -- MAJOR
536
537### 9.1 Instruction Generation
538
539Current codegen (`_build_iram_for_pe()`) generates `ALUInst` and
540`SMInst` objects. These are Python dataclasses that model the old
541instruction format with embedded destinations and constants:
542
543```python
544ALUInst(op, dest_l, dest_r, const, ctx_mode)
545SMInst(op, sm_id, const, ret, ret_dyadic)
546```
547
548New codegen generates 16-bit instruction words matching the hardware
549format:
550
551```python
552@dataclass(frozen=True)
553class Instruction:
554 """16-bit instruction word for IRAM.
555
556 [type:1][opcode:5][mode:3][wide:1][fref:6] = 16 bits
557 """
558 type: int # 0 = CM, 1 = SM
559 opcode: int # 5-bit opcode
560 mode: int # 3-bit mode (0-7)
561 wide: bool # 32-bit frame values
562 fref: int # 6-bit frame slot base index
563```
564
565The `Instruction` type replaces both `ALUInst` and `SMInst`. Constants
566and destinations are NOT in the instruction -- they are in frame slots.
567
568The `PEConfig.iram` field type changes from
569`dict[int, ALUInst | SMInst]` to `dict[int, Instruction]` (or
570`dict[int, int]` if storing raw 16-bit words).
571
572### 9.2 Frame Setup Sequence Generation (NEW)
573
574Codegen must generate the bootstrap sequence that loads frame contents
575before execution begins. This is a stream of tokens:
576
5771. **FrameControlToken (ALLOC)** for each activation:
578 ```
579 flit 1: [0][1][1][PE:2][00][op=0][spare:3][act_id:3]
580 flit 2: (return routing or unused)
581 ```
582
5832. **PELocalWriteToken** for each frame slot that needs initialization:
584 ```
585 flit 1: [0][1][1][PE:2][01][region=1][spare:1][slot:5][act_id:3]
586 flit 2: [data:16] (the frame slot value)
587 ```
588
5893. **PELocalWriteToken** for IRAM entries:
590 ```
591 flit 1: [0][1][1][PE:2][01][region=0][spare:1][slot:5][act_id:ignored]
592 flit 2: [instruction:16]
593 ```
594
595The ordering matters: IRAM writes before frame setup, frame setup
596(ALLOC + slot writes) before seed tokens. More specifically:
597
598```
599IRAM writes (all PEs)
600 -> ALLOC frame control (all activations)
601 -> Frame slot writes (constants, destinations per activation)
602 -> Seed tokens (initial data tokens to start execution)
603```
604
605### 9.3 New Token Types
606
607The emulator will need new token types (or the existing types must be
608adapted):
609
610- **FrameControlToken** -- ALLOC/FREE frame lifecycle. Currently not in
611 `tokens.py`. Codegen needs to emit these.
612- **PELocalWriteToken** -- writes to IRAM (region=0) or frame slots
613 (region=1). The current `IRAMWriteToken` is a special case of this
614 (region=0 only). It may be generalised or a new type added.
615
616These are emulator-level changes that codegen depends on. The codegen
617module must import and construct whatever token types the emulator
618provides.
619
620### 9.4 Seed Token Generation
621
622Current seed token generation creates `MonadToken` or `DyadToken` with
623`ctx` field. Changes:
624
625- `ctx` field -> `act_id` field in token constructors
626- `gen` field on `DyadToken` is removed (ABA protection is via the 670
627 valid bit, not generation counters)
628- Seed tokens target `(pe, offset, act_id)` triples, packed from the
629 destination's allocation data
630
631### 9.5 Direct Mode Output
632
633`generate_direct()` currently returns `AssemblyResult` with:
634- `pe_configs: list[PEConfig]` -- PEConfig with `iram` dict of
635 `ALUInst/SMInst`
636- `sm_configs: list[SMConfig]`
637- `seed_tokens: list[MonadToken]`
638
639New `AssemblyResult`:
640- `pe_configs: list[PEConfig]` -- PEConfig with `iram` dict of
641 `Instruction` (new type), plus `frame_layouts: dict[int, FrameLayout]`
642 mapping act_id to frame layout data
643- `sm_configs: list[SMConfig]` -- unchanged
644- `seed_tokens: list` -- may include `DyadToken` and `MonadToken` with
645 `act_id` instead of `ctx`
646- `setup_tokens: list` -- NEW: frame control and PE-local write tokens
647 for bootstrap
648
649### 9.6 Token Stream Mode Output
650
651`generate_tokens()` currently produces:
652```
653SM init tokens -> IRAM write tokens -> seed tokens
654```
655
656New ordering:
657```
658SM init tokens -> IRAM write tokens -> ALLOC tokens -> frame slot write tokens -> seed tokens
659```
660
661### 9.7 ctx_mode Removal
662
663The entire `ctx_mode` / `ctx_override` handling in `_build_iram_for_pe()`
664(lines 96-132 of codegen.py) is removed. Cross-activation routing is
665handled by frame destination slots, not by instruction encoding.
666
667### 9.8 Route Restriction Computation
668
669`_compute_route_restrictions()` is unchanged in concept. It scans edges
670to determine which PEs and SMs a given PE needs to route to. The
671implementation stays the same.
672
673## 10. Serialize Pass Changes (`asm/serialize.py`)
674
675### 10.1 Field Renaming
676
677- `ctx` -> `act_id` in node serialization if activation ID is displayed
678- Any `ctx_slot` qualifier in dfasm output becomes activation-related
679
680### 10.2 New Fields
681
682If `mode`, `fref`, and `wide` are displayed in serialized output (for
683debugging allocated IR), `_serialize_node()` needs to format them.
684Format could be: `&node|pe0|act2|mode1|fref8 <| add`
685
686### 10.3 Round-Trip Support
687
688The serialize pass must be able to round-trip new IR fields. Since
689`mode`, `fref`, and frame layout are only populated after allocation,
690serialization before allocation produces the same output as today (no
691new fields to display). Post-allocation serialization adds the new
692fields.
693
694## 11. Built-in Macro Changes (`asm/builtins.py`)
695
696No changes needed. The built-in macros (`#loop_counted`, `#loop_while`,
697`#permit_inject`, `#reduce_2/3/4`) use generic opcodes and edge routing.
698They do not reference `ctx`, `ctx_slot`, `free_ctx`, or any
699context-specific concepts.
700
701The `free_ctx` opcode rename to `free_frame` does not affect builtins
702because none of them use `free_ctx`.
703
704## 12. Error Types (`asm/errors.py`)
705
706### 12.1 New Error Categories
707
708Add to `ErrorCategory`:
709
710```python
711class ErrorCategory(Enum):
712 # ... existing ...
713 FRAME = "frame" # Frame layout overflow, slot conflicts
714```
715
716### 12.2 New Error Conditions
717
718| Error | Category | Source Pass | Condition |
719|---|---|---|---|
720| Frame slot overflow | FRAME | allocate | Total slots > 64 for an activation |
721| Matchable offset overflow | RESOURCE | allocate/place | > 8 dyadic instructions per activation per PE |
722| Frame count overflow | RESOURCE | allocate | > 4 concurrent activations on one PE |
723| Act ID exhaustion | RESOURCE | allocate | > 8 activation IDs needed (wraparound) |
724
725## 13. Dfgraph Pipeline Impact (`dfgraph/`)
726
727### 13.1 `dfgraph/pipeline.py`
728
729The pipeline runner calls `allocate()` and uses `IRGraph` types. If
730`SystemConfig` fields change, `pipeline.py` may need minor updates for
731default values or error handling. The pipeline runner itself does not
732inspect allocation results in detail.
733
734### 13.2 `dfgraph/graph_json.py`
735
736Currently includes `ctx` field in node JSON output. This becomes
737`act_id`. The field rename is straightforward.
738
739### 13.3 `dfgraph/categories.py`
740
741References `RoutingOp.FREE_CTX` in the CONFIG category mapping. This
742becomes `RoutingOp.FREE_FRAME`. One line change.
743
744`EXTRACT_TAG` (if added) maps to the ROUTING or CONFIG category
745depending on its semantics.
746
747## 14. Monitor Impact (`monitor/`)
748
749### 14.1 `monitor/snapshot.py`
750
751`PESnapshot` currently captures `matching_store` (2D array of
752`MatchEntry`) and `gen_counters`. Under the frame model:
753- `matching_store` becomes frame state (per-frame slot values, presence
754 bits)
755- `gen_counters` is removed (no generation counters in frame model)
756- New: `frame_allocations` (act_id -> frame_id mapping), `frame_slots`
757 (per-frame slot contents)
758
759### 14.2 `monitor/graph_json.py`
760
761Node state overlay: `ctx` field -> `act_id`. Frame slot contents may be
762added to the state overlay for debugging.
763
764### 14.3 `monitor/repl.py`
765
766The `pe` command displays PE state including matching store contents.
767This needs updating to display frame state instead.
768
769## 15. Dependency Order
770
771Implementation order based on the dependency graph:
772
773### Phase 1: Foundation Types
7741. **`cm_inst.py`**: Add `RoutingOp.FREE_FRAME`, `RoutingOp.EXTRACT_TAG`.
775 Add `Instruction` dataclass. Update `is_monadic_alu()`.
7762. **`asm/ir.py`**: Add `FrameSlotMap`, `FrameLayout`. Rename
777 `ctx_slot` -> `act_slot`, `ctx` -> `act_id` on IRNode. Update
778 `SystemConfig` (remove `ctx_slots`, add `frame_count`,
779 `frame_slots`, `matchable_offsets`). Rename `CtxSlotRef` ->
780 `ActSlotRef`, `CtxSlotRange` -> `ActSlotRange`.
7813. **`asm/opcodes.py`**: Add new opcodes to `MNEMONIC_TO_OP`,
782 `_MONADIC_OPS_TUPLES`. Rename `free_ctx` -> `free_frame`.
7834. **`asm/errors.py`**: Add `ErrorCategory.FRAME`.
784
785### Phase 2: Allocation and Codegen (the big changes)
7865. **`asm/allocate.py`**: Rewrite `_assign_context_slots()` ->
787 `_assign_act_ids()`. Add `_compute_frame_layouts()`. Add
788 `_compute_modes()`. Add `_pack_flit1()`. Update
789 `_assign_iram_offsets()` for new offset scheme. Update
790 `_resolve_destinations()` to produce flit 1 values. Remove
791 `ctx_mode`/`ctx_override` handling.
7926. **`asm/codegen.py`**: Replace `ALUInst`/`SMInst` generation with
793 `Instruction` generation. Add frame setup token generation. Add
794 `FrameControlToken` / `PELocalWriteToken` emission. Update seed
795 token generation. Remove `ctx_mode` handling. Update `PEConfig`
796 construction. Update `AssemblyResult`.
797
798### Phase 3: Upstream Pass Adjustments
7997. **`asm/lower.py`**: Rename `ctx_slot` references. Update qualifier
800 parsing if syntax changes.
8018. **`asm/expand.py`**: Replace `FREE_CTX` with `FREE_FRAME` in call
802 wiring. Rename `CtxSlotRef` -> `ActSlotRef`. Update `CallSite`
803 field names. Simplify trampoline logic if `ctx_override` is removed.
8049. **`asm/place.py`**: Update `_count_iram_cost()` (all nodes cost 1).
805 Replace `ctx_slots` tracking with `frame_count`. Add
806 matchable-offset-per-activation constraint.
807
808### Phase 4: Output and Tooling
80910. **`asm/serialize.py`**: Rename `ctx` -> `act_id` in output. Add
810 mode/fref display for post-allocation IR.
81111. **`asm/builtins.py`**: No changes (verified).
81212. **`dfgraph/categories.py`**: `FREE_CTX` -> `FREE_FRAME`.
81313. **`dfgraph/graph_json.py`**: `ctx` -> `act_id` in JSON output.
81414. **`monitor/`**: Update snapshot, graph_json, and REPL for frame
815 model.
816
817### Phase 5: Emulator Updates (out of scope for this doc)
81815. **`emu/types.py`**: `PEConfig` changes (`iram` type, remove
819 `ctx_slots`, add frame parameters).
82016. **`emu/pe.py`**: Frame-based matching instead of context-slot
821 matching store.
82217. **`tokens.py`**: `ctx` -> `act_id`, remove `gen`. Add
823 `FrameControlToken`, `PELocalWriteToken`.
824
825## 16. What Stays the Same
826
827- **dfasm grammar** (`dfasm.lark`): mostly unchanged. May add frame
828 directives later, but not required for v0.
829- **Parse pass** (Lark parser): no changes.
830- **Error infrastructure** (`errors.py`): structure unchanged, one new
831 category added.
832- **Graph structure** (`IRGraph`, `IREdge`, `IRRegion`): unchanged.
833- **Pipeline architecture**: still 7 passes in the same order.
834- **Resolve pass**: completely unchanged.
835- **Built-in macros**: completely unchanged.
836- **SM-related codegen**: SMConfig construction, SM init tokens, data
837 defs -- all unchanged.
838- **Name resolution and scoping**: unchanged.
839- **Macro expansion core**: parameter substitution, variadic repetition,
840 opcode parameters -- all unchanged. Only call wiring details change.
841
842## 17. Risk Assessment
843
844| Area | Risk | Mitigation |
845|---|---|---|
846| Frame layout allocation | New, complex algorithm | Start with simple sequential allocation, optimise later |
847| Flit 1 packing | Bit-level encoding must match hardware spec | Unit tests against architecture-overview.md bit layouts |
848| Instruction deduplication | Interaction with frame layouts is subtle | Defer dedup to optimisation pass; v0 allocates per-activation |
849| Mode computation | 8 modes, context-dependent selection | Exhaustive test matrix against mode table |
850| ABA distance | Act_id assignment must maintain distance | Sequential assignment (0,1,2,3) is safe for static programs |
851| `ctx_override` removal | Affects expand pass call wiring | Can keep `ctx_override` as semantic annotation initially |
852| Emulator dependency | Codegen must emit tokens emulator can consume | Phase emulator changes alongside codegen; can use adapter layer |