docs/assembler-redesign-plan.md at main · nonbinary.computer/or1-design

nonbinary.computer / or1-design
fork atom
OR-1 dataflow CPU sketch
fork atom
or1-design / docs / assembler-redesign-plan.md
at main 852 lines 34 kB view raw view rendered
wrap content
Orual feat: rewrite ProcessingElement with frame-based matching, output routing, and unified instruction set 15d ago
65613978
  1# Assembler Redesign Plan: Frame-Based PE Model
  2
  3This document describes the changes needed in the OR1 assembler (`asm/`)
  4to match the frame-based PE redesign described in `pe-design.md` and
  5`architecture-overview.md`. It covers every assembler pass and supporting
  6module, identifying what changes, what stays, and the dependency order
  7for implementation.
  8
  9The assembler pipeline is: parse -> lower -> expand -> resolve -> place ->
 10allocate -> codegen. Most changes concentrate in allocate and codegen;
 11earlier passes are minimally affected.
 12
 13## 1. Overview
 14
 15The PE redesign replaces the context-slot matching model with a
 16frame-based model. Key architectural changes that affect the assembler:
 17
 18- **Context slots replaced by activation IDs and frames.** The old
 19  `ctx_slots` parameter (up to 16 slots) becomes `frame_count` (4
 20  concurrent frames) with 3-bit `act_id` (8 unique IDs). Each frame has
 21  64 addressable slots.
 22- **Instructions become templates.** IRAM entries are 16-bit words
 23  `[type:1][opcode:5][mode:3][wide:1][fref:6]`. Constants and
 24  destinations are NOT in the instruction -- they live in frame slots
 25  referenced by `fref`.
 26- **Instruction deduplication.** Multiple activations can share the same
 27  IRAM entry because per-activation data lives in frames, not
 28  instructions. The number of IRAM entries needed is the number of unique
 29  operation shapes, not the total operations executed.
 30- **8 matchable offsets per frame.** Dyadic instructions must be assigned
 31  offsets 0-7 within each activation. The assembler must enforce this
 32  constraint and split function bodies that exceed it.
 33- **Pre-formed flit 1 values.** Output destinations are 16-bit packed
 34  flit 1 values stored in frame slots. The PE reads the slot and puts it
 35  on the bus verbatim as flit 1. Token type is determined by the prefix
 36  bits in the stored flit value.
 37- **Frame setup via PE-local write tokens.** Constants, destinations, and
 38  other per-activation data are loaded into frame slots by a stream of
 39  PE-local write tokens before execution begins.
 40- **Frame lifecycle tokens.** ALLOC (frame control, prefix `011+00`,
 41  op=0) allocates a frame; FREE (frame control, op=1) or FREE_FRAME
 42  instruction releases it.
 43- **Mode field replaces output routing logic.** The 3-bit `mode` field
 44  in the instruction word encodes output routing (INHERIT/CHANGE_TAG/SINK)
 45  and constant presence. The assembler must compute the correct mode for
 46  each instruction based on its edge topology and constant usage.
 47
 48## 2. IR Type Changes (`asm/ir.py`)
 49
 50### 2.1 IRNode Changes
 51
 52Current fields affected:
 53
 54| Current Field | Change | Notes |
 55|---|---|---|
 56| `ctx_slot: Optional[Union[int, CtxSlotRef, CtxSlotRange]]` | Rename to `act_slot` or remove entirely | Macro parameter support (`CtxSlotRef`) may still be needed for parameterized activation grouping |
 57| `iram_offset: Optional[int]` | Keep | Semantics unchanged: index into PE's IRAM. Range 0-255. |
 58| `ctx: Optional[int]` | Rename to `act_id: Optional[int]` | 3-bit activation ID (0-7) instead of context slot |
 59
 60New fields to add to IRNode:
 61
 62| New Field | Type | Purpose |
 63|---|---|---|
 64| `mode: Optional[int]` | `Optional[int]` | 3-bit instruction mode (0-7), computed during allocate |
 65| `fref: Optional[int]` | `Optional[int]` | 6-bit frame slot base index (0-63), computed during allocate |
 66| `wide: bool` | `bool` | 32-bit frame values flag, default False |
 67| `frame_layout: Optional[FrameSlotMap]` | `Optional[FrameSlotMap]` | Per-node frame slot assignments (see below) |
 68
 69### 2.2 New IR Types
 70
 71```python
 72@dataclass(frozen=True)
 73class FrameSlotMap:
 74    """Frame slot assignments for one instruction within an activation.
 75
 76    Maps logical roles to physical frame slot indices within the 64-slot
 77    frame. Computed by the allocate pass.
 78
 79    Attributes:
 80        match_slot: Frame slot for dyadic match operand storage (offsets 0-7)
 81        const_slot: Frame slot for IRAM constant (if any)
 82        dest_slots: Frame slot(s) for pre-formed flit 1 destination values
 83        sink_slot: Frame slot for SINK mode write-back target
 84    """
 85    match_slot: Optional[int] = None
 86    const_slot: Optional[int] = None
 87    dest_slots: tuple[int, ...] = ()
 88    sink_slot: Optional[int] = None
 89```
 90
 91```python
 92@dataclass(frozen=True)
 93class FrameLayout:
 94    """Complete frame layout for one activation on one PE.
 95
 96    Computed by the allocate pass. Used by codegen to generate PE-local
 97    write tokens for frame setup.
 98
 99    Attributes:
100        act_id: Activation ID (0-7)
101        pe_id: PE where this frame lives
102        slots: Dict mapping slot index to (role, value) pairs
103        total_slots: Total slots used (must be <= 64)
104    """
105    act_id: int
106    pe_id: int
107    slots: dict[int, tuple[str, int]]
108    total_slots: int
109```
110
111### 2.3 SystemConfig Changes
112
113Current `SystemConfig`:
114
115```python
116@dataclass(frozen=True)
117class SystemConfig:
118    pe_count: int
119    sm_count: int
120    iram_capacity: int = DEFAULT_IRAM_CAPACITY  # 128
121    ctx_slots: int = DEFAULT_CTX_SLOTS           # 16
122```
123
124New `SystemConfig`:
125
126```python
127@dataclass(frozen=True)
128class SystemConfig:
129    pe_count: int
130    sm_count: int
131    iram_capacity: int = 256              # 8-bit offset, 256 entries
132    frame_count: int = 4                  # max concurrent frames per PE
133    frame_slots: int = 64                 # slots per frame
134    matchable_offsets: int = 8            # max dyadic instructions per activation per PE
135```
136
137The defaults change: `DEFAULT_IRAM_CAPACITY` goes from 128 to 256
138(matching the 8-bit offset field). `DEFAULT_CTX_SLOTS` is removed
139entirely and replaced by `frame_count`, `frame_slots`, and
140`matchable_offsets`.
141
142### 2.4 What Stays
143
144- `IREdge` -- unchanged except `ctx_override` may need renaming or
145  semantic adjustment (see section 5)
146- `IRGraph` structure -- unchanged
147- `IRDataDef` -- unchanged (SM data definitions are orthogonal to PE frames)
148- `IRRegion`, `RegionKind` -- unchanged
149- `NameRef`, `ResolvedDest` -- unchanged
150- `MacroDef`, `IRMacroCall`, `MacroParam`, `ParamRef` -- unchanged
151- `CallSite`, `CallSiteResult` -- fields may need renaming
152  (`trampoline_nodes`, `free_ctx_nodes` -> `free_frame_nodes`)
153- `SourceLoc`, `ConstExpr`, `IRRepetitionBlock` -- unchanged
154
155### 2.5 CtxSlotRef / CtxSlotRange
156
157`CtxSlotRef` and `CtxSlotRange` are used in macro templates for
158parameterized context slot assignment. Under the frame model, these
159become `ActSlotRef` / `ActSlotRange` (or are removed if activation ID
160assignment is always automatic). The expand pass resolves these to
161concrete values during macro expansion. The rename is straightforward
162but touches `ir.py`, `lower.py`, and `expand.py`.
163
164## 3. Opcode Changes (`asm/opcodes.py`)
165
166### 3.1 Renamed Opcodes
167
168| Current | New | Notes |
169|---|---|---|
170| `RoutingOp.FREE_CTX` | `RoutingOp.FREE_FRAME` | Deallocates a frame instead of a context slot |
171
172The mnemonic mapping changes: `"free_ctx"` -> `"free_frame"` in
173`MNEMONIC_TO_OP`. The `OP_TO_MNEMONIC` reverse mapping updates
174correspondingly.
175
176### 3.2 New Opcodes
177
178| Opcode | Type | Arity | Purpose |
179|---|---|---|---|
180| `RoutingOp.EXTRACT_TAG` | Monadic CM | Monadic | Captures executing token's identity as 16-bit packed flit 1 value (return continuation) |
181| `RoutingOp.ALLOC_REMOTE` | Monadic CM | Monadic | Triggers ALLOC frame control token to a target PE (may be handled at codegen level instead) |
182
183`EXTRACT_TAG` must be added to `MNEMONIC_TO_OP` (mnemonic: `"extract_tag"`),
184to `_MONADIC_OPS_TUPLES`, and to `cm_inst.py` as a new `RoutingOp` enum
185member.
186
187Whether `ALLOC_REMOTE` is an ALU opcode or purely a codegen-emitted
188frame control token depends on whether the assembler exposes it as a
189user-facing instruction or handles it internally during call wiring.
190For static calls, ALLOC is emitted by codegen as a frame control token,
191not as an ALU instruction. For dynamic calls, it may need to be an
192instruction.
193
194### 3.3 Mode-Dependent Arity
195
196The current `is_monadic(op, const)` and `is_dyadic(op, const)` functions
197remain correct in concept. The `mode` field is orthogonal to arity --
198mode determines output routing behaviour, not input operand count. No
199changes needed to arity classification.
200
201### 3.4 Monadic/Dyadic Classification
202
203Unchanged. The distinction between monadic and dyadic is fundamental to
204matching (dyadic tokens go through presence-based matching in the 670s;
205monadic tokens bypass matching). The assembler's arity classification
206drives IRAM offset assignment (dyadic at offsets 0-7, monadic at 8+).
207
208## 4. Lower Pass Changes (`asm/lower.py`)
209
210Minimal changes. The lower pass translates Lark CST to IR and is mostly
211opcode-agnostic.
212
213### 4.1 Opcode Mnemonic Updates
214
215The `MNEMONIC_TO_OP` lookup in `_resolve_opcode()` will pick up new
216opcodes automatically when they are added to `opcodes.py`. No lower.py
217code changes needed for new opcodes.
218
219The `free_ctx` mnemonic rename to `free_frame` will need a corresponding
220change in lower.py only if the mnemonic string is hardcoded anywhere
221(it is not -- lower.py uses `MNEMONIC_TO_OP` for all lookups).
222
223### 4.2 Context Slot Qualifiers
224
225The `[ctx_slot]` qualifier syntax in dfasm (`&node[3] <| add`) is
226parsed in lower.py and stored as `ctx_slot` on IRNode. This syntax and
227field will be renamed to reflect activation IDs if kept. If activation
228ID assignment is always automatic, the qualifier syntax may be removed
229entirely.
230
231References in lower.py:
232- `inst_def` transformer rule: parses `[N]` qualifier into `ctx_slot`
233- `ctx_slot_ref` transformer rule: creates `CtxSlotRef` for macro
234  parameter `[${param}]`
235- `ctx_slot_range` transformer rule: creates `CtxSlotRange` for `[N:M]`
236
237### 4.3 New Syntax (Optional, Deferred)
238
239Frame directives in dfasm syntax (e.g., `@frame_layout` pragmas) could
240be added later. Not needed for v0 -- the allocator computes frame
241layouts automatically.
242
243## 5. Expand Pass Changes (`asm/expand.py`)
244
245### 5.1 Function Call Wiring
246
247The expand pass currently generates cross-context call wiring:
248- Trampoline `PASS` nodes with `ctx_override` edges
249- `FREE_CTX` cleanup nodes
250- Per-call-site context slot allocation via `CallSite` metadata
251
252Under the frame model, the call wiring changes:
253
254- **`ctx_override` edges** become frame-boundary edges. The semantic
255  meaning is the same (data crosses activation boundaries), but the
256  mechanism changes: instead of packing a target context into the
257  instruction's `const` field with `ctx_mode=1`, the destination's
258  pre-formed flit 1 value in the frame slot already encodes the target
259  PE, offset, and act_id. Cross-activation routing is handled by frame
260  setup, not by instruction encoding.
261
262- **`FREE_CTX` nodes** become `FREE_FRAME` nodes. The opcode changes
263  from `RoutingOp.FREE_CTX` to `RoutingOp.FREE_FRAME`. The expand pass
264  references `RoutingOp.FREE_CTX` directly in `_wire_call_site()` at
265  line 1160 of expand.py.
266
267- **Trampoline nodes** may simplify. In the current design, trampolines
268  exist to bridge context boundaries (an edge cannot cross contexts
269  without a PASS node carrying `ctx_mode`). Under the frame model,
270  destinations in frame slots already encode the target activation, so
271  cross-activation edges are just edges whose destination flit 1 value
272  points to a different activation. Trampolines may still be useful for
273  fan-out or return routing but are no longer needed purely for context
274  bridging.
275
276- **`@ret` wiring** stays conceptually the same. Return routing in the
277  frame model is a pre-formed flit 1 value loaded into a frame slot.
278  The expand pass creates edges for return routing; the allocate pass
279  resolves those edges to flit 1 values in frame slots.
280
281### 5.2 CallSite Metadata
282
283`CallSite` fields to rename:
284- `free_ctx_nodes` -> `free_frame_nodes`
285
286The `trampoline_nodes` field stays (trampolines may still be generated
287for fan-out or return routing patterns).
288
289### 5.3 CtxSlotRef Resolution
290
291The expand pass resolves `CtxSlotRef` and `CtxSlotRange` during macro
292expansion. These will be renamed to `ActSlotRef` / `ActSlotRange` (or
293removed). The resolution logic in `_substitute_node()` and
294`_substitute_edge()` is straightforward renaming.
295
296### 5.4 Built-in Macros
297
298The built-in macros in `builtins.py` do NOT reference `ctx_slot`, `ctx`,
299`free_ctx`, or any context-specific concepts directly. They use generic
300opcodes (`add`, `brgt`, `gate`, `pass`, `inc`, `const`) and edge routing
301(`@ret`, `${param}`). No changes needed to built-in macros.
302
303## 6. Resolve Pass Changes (`asm/resolve.py`)
304
305No changes needed. The resolve pass validates that edge endpoints exist
306and detects scope violations. It operates on node names and graph
307structure, not on PE-level concepts like contexts or frames. The resolve
308pass does not reference `ctx`, `ctx_slot`, `ctx_override`, or any
309context-related fields.
310
311## 7. Place Pass Changes (`asm/place.py`)
312
313### 7.1 New Constraint: Matchable Offset Limit
314
315The placement pass must enforce the 8-matchable-offset constraint: at
316most 8 dyadic instructions per activation per PE. This is a new
317constraint that does not exist in the current codebase.
318
319Currently, `_count_iram_cost()` counts dyadic nodes as costing 2 IRAM
320slots and monadic as 1. Under the frame model:
321- Dyadic nodes cost 1 IRAM slot (the matching store entry is in the
322  670s, not IRAM)
323- Monadic nodes cost 1 IRAM slot
324- The 8-dyadic-per-activation limit is a separate constraint from IRAM
325  capacity
326
327The placement pass needs to track dyadic instruction count per
328activation group per PE, in addition to total IRAM usage.
329
330### 7.2 IRAM Cost Recalculation
331
332`_count_iram_cost()` should return 1 for all node types (dyadic and
333monadic both use 1 IRAM slot in the frame model). The current cost of 2
334for dyadic nodes was because the old matching store occupied IRAM slots;
335in the frame model, match operands live in frame SRAM, not IRAM.
336
337### 7.3 Context Slots -> Frames
338
339`_auto_place_nodes()` currently tracks `ctx_used` per PE (context slots
340consumed). This becomes `frames_used` per PE (concurrent frames, max 4).
341The `ctx_scopes_per_pe` tracking of function scopes per PE maps directly
342to frame tracking: each function scope on a PE consumes one frame.
343
344`SystemConfig.ctx_slots` references in place.py become
345`SystemConfig.frame_count`.
346
347### 7.4 Instruction Deduplication Awareness
348
349Because IRAM entries are activation-independent templates, multiple
350activations of the same function on the same PE share IRAM entries. The
351placement pass could account for this when computing IRAM utilisation:
352if two activations of function `$foo` run on PE0, they share IRAM
353entries, so the IRAM cost is the unique instruction count, not the total.
354
355This is an optimisation, not a correctness requirement. The placement
356pass can conservatively count IRAM entries per unique function body
357without deduplication for v0.
358
359## 8. Allocate Pass Changes (`asm/allocate.py`) -- MAJOR
360
361This is the largest change. The current allocate pass assigns IRAM
362offsets and context slots, then resolves destinations to `Addr` values.
363The frame model requires fundamentally different allocation logic.
364
365### 8.1 IRAM Offset Assignment
366
367Current behaviour (`_assign_iram_offsets()`):
368- Dyadic nodes get offsets 0..D-1
369- Monadic nodes get offsets D..D+M-1
370- Total must fit in `iram_capacity`
371
372New behaviour:
373- Dyadic nodes get offsets 0-7 (within the matchable offset range).
374  At most 8 dyadic instructions per activation group per PE. Because
375  IRAM is activation-independent (shared templates), dyadic offset
376  assignment is per-PE, not per-activation.
377- Monadic nodes get offsets 8-255 (or wherever dyadic offsets end).
378- The `matchable_offsets` limit (default 8) constrains dyadic count.
379- With instruction deduplication, multiple activations of the same
380  function body share IRAM offsets. The allocator assigns offsets once
381  per unique instruction template, not per activation.
382
383### 8.2 Activation ID Assignment
384
385Replaces `_assign_context_slots()`. The current function assigns context
386slot indices (0-15) per function scope per PE. The new function assigns
3873-bit activation IDs (0-7) with at most 4 concurrent activations per PE.
388
389Key differences:
390- **Smaller space:** 8 act_ids (3-bit) vs 16 context slots. But only 4
391  can be concurrently active (4 physical frames).
392- **ABA distance:** the allocator must maintain ABA distance between
393  act_ids to prevent stale token collisions. With 4 concurrent frames
394  out of 8 possible IDs, 4 IDs of ABA distance exist before wraparound.
395  For static programs, the allocator can assign act_ids sequentially
396  (0, 1, 2, 3 for 4 concurrent activations).
397- **Per-call-site allocation:** same concept as current
398  `call_site_to_ctx_on_pe` -- each call site that creates a new
399  activation gets a fresh act_id. But the budget is 4 concurrent frames
400  instead of 16 context slots.
401
402The function scope grouping logic (`_extract_function_scope()`) stays.
403
404### 8.3 Frame Layout Allocation (NEW)
405
406This is entirely new functionality. After IRAM offset and act_id
407assignment, the allocator must compute the frame layout for each
408activation: which frame slots hold what data.
409
410**Frame slot roles:**
411
412| Role | Slot Range | Count per Instruction | Notes |
413|---|---|---|---|
414| Match operand | 0-7 | 1 per dyadic instruction | Indexed by matchable offset. Presence bit in 670. |
415| Constant | 8+ | 0-1 per instruction | `mode[0]` (has_const) selects whether const is read from frame[fref] |
416| Destination | variable | 0-2 per instruction | Pre-formed flit 1 values. 1 for mode 0, 2 for mode 2/3. |
417| Accumulator/sink | variable | 0-1 per instruction | For SINK modes (6/7). Write-back target. |
418| SM parameters | variable | 0-2 per SM instruction | SM_id + addr, data or return routing. |
419
420**Slot assignment algorithm:**
421
4221. Reserve slots 0-7 for match operands (one per dyadic instruction,
423   indexed by the instruction's matchable offset).
4242. Assign constant slots starting at slot 8. Constants that are shared
425   across instructions within the same activation can be deduplicated
426   (same value -> same slot).
4273. Assign destination slots. Each destination is a pre-formed flit 1
428   value. Destinations shared across instructions can be deduplicated.
4294. Assign sink/accumulator slots for SINK mode instructions.
4305. Assign SM parameter slots for SM operations.
4316. Verify total slots <= 64. If exceeded, report a frame overflow error.
432
433**fref computation:**
434
435The `fref` field in the instruction word points to the base of a
436contiguous group of frame slots used by that instruction. The slot count
437depends on the mode:
438
439| Mode | Slots at fref | Layout |
440|---|---|---|
441| 0 | 1 | [dest] |
442| 1 | 2 | [const, dest] |
443| 2 | 2 | [dest1, dest2] |
444| 3 | 3 | [const, dest1, dest2] |
445| 4 | 0 | (no frame access) |
446| 5 | 1 | [const] |
447| 6 | 1 | [sink_target] (write) |
448| 7 | 1 | [sink_target] (read-modify-write) |
449
450The allocator must arrange frame slots such that each instruction's
451constant and destination(s) are contiguous starting at `fref`. This
452may require careful slot packing or a simple sequential allocation
453strategy.
454
455**Mode computation:**
456
457The allocator determines the mode for each instruction based on:
458
459| Condition | Mode |
460|---|---|
461| 0 dests, no const, has sink | 6 (SINK) |
462| 0 dests, const via frame accumulator (RMW) | 7 (SINK+CONST) |
463| 1 dest, no const | 0 (INHERIT, single output) |
464| 1 dest, has const | 1 (INHERIT, single output + const) |
465| 2 dests, no const | 2 (INHERIT, fan-out) |
466| 2 dests, has const | 3 (INHERIT, fan-out + const) |
467| CHANGE_TAG, no const | 4 |
468| CHANGE_TAG, has const | 5 |
469
470CHANGE_TAG is used when the instruction's left operand provides the
471output destination dynamically (e.g., dynamic return routing via
472`EXTRACT_TAG` + `CHANGE_TAG`).
473
474### 8.4 Pre-Formed Flit 1 Computation
475
476The allocator must convert resolved destination `Addr` values into
47716-bit packed flit 1 values for storage in frame destination slots.
478
479**Flit 1 formats** (from `architecture-overview.md`):
480
481```
482DYADIC WIDE:    [0][0][port:1][PE:2][offset:8][act_id:3]   = 16 bits
483MONADIC NORM:   [0][1][0][PE:2][offset:8][act_id:3]         = 16 bits
484MONADIC INLINE: [0][1][1][PE:2][10][offset:7][spare:2]      = 16 bits
485```
486
487The packing function takes:
488- `dest_pe: int` (2 bits)
489- `dest_offset: int` (8 bits for dyadic/monadic normal, 7 for inline)
490- `dest_act_id: int` (3 bits, for dyadic and monadic normal)
491- `dest_port: Port` (1 bit, for dyadic wide only)
492- `dest_type: TokenType` (dyadic wide, monadic normal, or monadic inline)
493
494And produces a 16-bit packed flit 1 value.
495
496The `dest_type` is determined by whether the destination instruction is
497dyadic (-> dyadic wide flit) or monadic (-> monadic normal flit). For
498trigger-only destinations (e.g., switch not-taken path), monadic inline
499is used.
500
501This replaces the current `Addr(a, port, pe)` resolution. The `Addr`
502type may still be used as an intermediate representation, but the final
503output is a packed flit 1 value in a frame slot.
504
505### 8.5 Destination Resolution Rework
506
507The current `_resolve_destinations()` function creates `ResolvedDest`
508objects containing `Addr(a=iram_offset, port=edge.port, pe=dest_pe)`.
509Under the frame model, destination resolution must additionally:
510
5111. Look up the destination node's `act_id` (not just `iram_offset` and
512   `pe`).
5132. Determine the destination token type (dyadic wide vs monadic normal
514   vs monadic inline).
5153. Compute the packed flit 1 value.
5164. Assign the flit 1 value to a frame slot.
517
518The `ResolvedDest` type may be extended to carry the packed flit 1
519value, or the flit 1 computation may happen in a separate sub-pass
520after destination resolution.
521
522### 8.6 ctx_mode / ctx_override Removal
523
524The current allocator and codegen handle `ctx_override` edges by setting
525`ctx_mode=1` on the source instruction and packing the target context
526and generation into the `const` field. This entire mechanism is removed.
527
528Under the frame model, cross-activation routing is handled by frame
529destination slots. The destination flit 1 value already encodes the
530target PE, offset, and act_id. No instruction-level `ctx_mode` is
531needed. The `ctx_override` flag on `IREdge` may be kept for semantic
532annotation (this edge crosses activation boundaries) but has no effect
533on instruction encoding.
534
535## 9. Codegen Changes (`asm/codegen.py`) -- MAJOR
536
537### 9.1 Instruction Generation
538
539Current codegen (`_build_iram_for_pe()`) generates `ALUInst` and
540`SMInst` objects. These are Python dataclasses that model the old
541instruction format with embedded destinations and constants:
542
543```python
544ALUInst(op, dest_l, dest_r, const, ctx_mode)
545SMInst(op, sm_id, const, ret, ret_dyadic)
546```
547
548New codegen generates 16-bit instruction words matching the hardware
549format:
550
551```python
552@dataclass(frozen=True)
553class Instruction:
554    """16-bit instruction word for IRAM.
555
556    [type:1][opcode:5][mode:3][wide:1][fref:6] = 16 bits
557    """
558    type: int       # 0 = CM, 1 = SM
559    opcode: int     # 5-bit opcode
560    mode: int       # 3-bit mode (0-7)
561    wide: bool      # 32-bit frame values
562    fref: int       # 6-bit frame slot base index
563```
564
565The `Instruction` type replaces both `ALUInst` and `SMInst`. Constants
566and destinations are NOT in the instruction -- they are in frame slots.
567
568The `PEConfig.iram` field type changes from
569`dict[int, ALUInst | SMInst]` to `dict[int, Instruction]` (or
570`dict[int, int]` if storing raw 16-bit words).
571
572### 9.2 Frame Setup Sequence Generation (NEW)
573
574Codegen must generate the bootstrap sequence that loads frame contents
575before execution begins. This is a stream of tokens:
576
5771. **FrameControlToken (ALLOC)** for each activation:
578   ```
579   flit 1: [0][1][1][PE:2][00][op=0][spare:3][act_id:3]
580   flit 2: (return routing or unused)
581   ```
582
5832. **PELocalWriteToken** for each frame slot that needs initialization:
584   ```
585   flit 1: [0][1][1][PE:2][01][region=1][spare:1][slot:5][act_id:3]
586   flit 2: [data:16]  (the frame slot value)
587   ```
588
5893. **PELocalWriteToken** for IRAM entries:
590   ```
591   flit 1: [0][1][1][PE:2][01][region=0][spare:1][slot:5][act_id:ignored]
592   flit 2: [instruction:16]
593   ```
594
595The ordering matters: IRAM writes before frame setup, frame setup
596(ALLOC + slot writes) before seed tokens. More specifically:
597
598```
599IRAM writes (all PEs)
600  -> ALLOC frame control (all activations)
601    -> Frame slot writes (constants, destinations per activation)
602      -> Seed tokens (initial data tokens to start execution)
603```
604
605### 9.3 New Token Types
606
607The emulator will need new token types (or the existing types must be
608adapted):
609
610- **FrameControlToken** -- ALLOC/FREE frame lifecycle. Currently not in
611  `tokens.py`. Codegen needs to emit these.
612- **PELocalWriteToken** -- writes to IRAM (region=0) or frame slots
613  (region=1). The current `IRAMWriteToken` is a special case of this
614  (region=0 only). It may be generalised or a new type added.
615
616These are emulator-level changes that codegen depends on. The codegen
617module must import and construct whatever token types the emulator
618provides.
619
620### 9.4 Seed Token Generation
621
622Current seed token generation creates `MonadToken` or `DyadToken` with
623`ctx` field. Changes:
624
625- `ctx` field -> `act_id` field in token constructors
626- `gen` field on `DyadToken` is removed (ABA protection is via the 670
627  valid bit, not generation counters)
628- Seed tokens target `(pe, offset, act_id)` triples, packed from the
629  destination's allocation data
630
631### 9.5 Direct Mode Output
632
633`generate_direct()` currently returns `AssemblyResult` with:
634- `pe_configs: list[PEConfig]` -- PEConfig with `iram` dict of
635  `ALUInst/SMInst`
636- `sm_configs: list[SMConfig]`
637- `seed_tokens: list[MonadToken]`
638
639New `AssemblyResult`:
640- `pe_configs: list[PEConfig]` -- PEConfig with `iram` dict of
641  `Instruction` (new type), plus `frame_layouts: dict[int, FrameLayout]`
642  mapping act_id to frame layout data
643- `sm_configs: list[SMConfig]` -- unchanged
644- `seed_tokens: list` -- may include `DyadToken` and `MonadToken` with
645  `act_id` instead of `ctx`
646- `setup_tokens: list` -- NEW: frame control and PE-local write tokens
647  for bootstrap
648
649### 9.6 Token Stream Mode Output
650
651`generate_tokens()` currently produces:
652```
653SM init tokens -> IRAM write tokens -> seed tokens
654```
655
656New ordering:
657```
658SM init tokens -> IRAM write tokens -> ALLOC tokens -> frame slot write tokens -> seed tokens
659```
660
661### 9.7 ctx_mode Removal
662
663The entire `ctx_mode` / `ctx_override` handling in `_build_iram_for_pe()`
664(lines 96-132 of codegen.py) is removed. Cross-activation routing is
665handled by frame destination slots, not by instruction encoding.
666
667### 9.8 Route Restriction Computation
668
669`_compute_route_restrictions()` is unchanged in concept. It scans edges
670to determine which PEs and SMs a given PE needs to route to. The
671implementation stays the same.
672
673## 10. Serialize Pass Changes (`asm/serialize.py`)
674
675### 10.1 Field Renaming
676
677- `ctx` -> `act_id` in node serialization if activation ID is displayed
678- Any `ctx_slot` qualifier in dfasm output becomes activation-related
679
680### 10.2 New Fields
681
682If `mode`, `fref`, and `wide` are displayed in serialized output (for
683debugging allocated IR), `_serialize_node()` needs to format them.
684Format could be: `&node|pe0|act2|mode1|fref8 <| add`
685
686### 10.3 Round-Trip Support
687
688The serialize pass must be able to round-trip new IR fields. Since
689`mode`, `fref`, and frame layout are only populated after allocation,
690serialization before allocation produces the same output as today (no
691new fields to display). Post-allocation serialization adds the new
692fields.
693
694## 11. Built-in Macro Changes (`asm/builtins.py`)
695
696No changes needed. The built-in macros (`#loop_counted`, `#loop_while`,
697`#permit_inject`, `#reduce_2/3/4`) use generic opcodes and edge routing.
698They do not reference `ctx`, `ctx_slot`, `free_ctx`, or any
699context-specific concepts.
700
701The `free_ctx` opcode rename to `free_frame` does not affect builtins
702because none of them use `free_ctx`.
703
704## 12. Error Types (`asm/errors.py`)
705
706### 12.1 New Error Categories
707
708Add to `ErrorCategory`:
709
710```python
711class ErrorCategory(Enum):
712    # ... existing ...
713    FRAME = "frame"  # Frame layout overflow, slot conflicts
714```
715
716### 12.2 New Error Conditions
717
718| Error | Category | Source Pass | Condition |
719|---|---|---|---|
720| Frame slot overflow | FRAME | allocate | Total slots > 64 for an activation |
721| Matchable offset overflow | RESOURCE | allocate/place | > 8 dyadic instructions per activation per PE |
722| Frame count overflow | RESOURCE | allocate | > 4 concurrent activations on one PE |
723| Act ID exhaustion | RESOURCE | allocate | > 8 activation IDs needed (wraparound) |
724
725## 13. Dfgraph Pipeline Impact (`dfgraph/`)
726
727### 13.1 `dfgraph/pipeline.py`
728
729The pipeline runner calls `allocate()` and uses `IRGraph` types. If
730`SystemConfig` fields change, `pipeline.py` may need minor updates for
731default values or error handling. The pipeline runner itself does not
732inspect allocation results in detail.
733
734### 13.2 `dfgraph/graph_json.py`
735
736Currently includes `ctx` field in node JSON output. This becomes
737`act_id`. The field rename is straightforward.
738
739### 13.3 `dfgraph/categories.py`
740
741References `RoutingOp.FREE_CTX` in the CONFIG category mapping. This
742becomes `RoutingOp.FREE_FRAME`. One line change.
743
744`EXTRACT_TAG` (if added) maps to the ROUTING or CONFIG category
745depending on its semantics.
746
747## 14. Monitor Impact (`monitor/`)
748
749### 14.1 `monitor/snapshot.py`
750
751`PESnapshot` currently captures `matching_store` (2D array of
752`MatchEntry`) and `gen_counters`. Under the frame model:
753- `matching_store` becomes frame state (per-frame slot values, presence
754  bits)
755- `gen_counters` is removed (no generation counters in frame model)
756- New: `frame_allocations` (act_id -> frame_id mapping), `frame_slots`
757  (per-frame slot contents)
758
759### 14.2 `monitor/graph_json.py`
760
761Node state overlay: `ctx` field -> `act_id`. Frame slot contents may be
762added to the state overlay for debugging.
763
764### 14.3 `monitor/repl.py`
765
766The `pe` command displays PE state including matching store contents.
767This needs updating to display frame state instead.
768
769## 15. Dependency Order
770
771Implementation order based on the dependency graph:
772
773### Phase 1: Foundation Types
7741. **`cm_inst.py`**: Add `RoutingOp.FREE_FRAME`, `RoutingOp.EXTRACT_TAG`.
775   Add `Instruction` dataclass. Update `is_monadic_alu()`.
7762. **`asm/ir.py`**: Add `FrameSlotMap`, `FrameLayout`. Rename
777   `ctx_slot` -> `act_slot`, `ctx` -> `act_id` on IRNode. Update
778   `SystemConfig` (remove `ctx_slots`, add `frame_count`,
779   `frame_slots`, `matchable_offsets`). Rename `CtxSlotRef` ->
780   `ActSlotRef`, `CtxSlotRange` -> `ActSlotRange`.
7813. **`asm/opcodes.py`**: Add new opcodes to `MNEMONIC_TO_OP`,
782   `_MONADIC_OPS_TUPLES`. Rename `free_ctx` -> `free_frame`.
7834. **`asm/errors.py`**: Add `ErrorCategory.FRAME`.
784
785### Phase 2: Allocation and Codegen (the big changes)
7865. **`asm/allocate.py`**: Rewrite `_assign_context_slots()` ->
787   `_assign_act_ids()`. Add `_compute_frame_layouts()`. Add
788   `_compute_modes()`. Add `_pack_flit1()`. Update
789   `_assign_iram_offsets()` for new offset scheme. Update
790   `_resolve_destinations()` to produce flit 1 values. Remove
791   `ctx_mode`/`ctx_override` handling.
7926. **`asm/codegen.py`**: Replace `ALUInst`/`SMInst` generation with
793   `Instruction` generation. Add frame setup token generation. Add
794   `FrameControlToken` / `PELocalWriteToken` emission. Update seed
795   token generation. Remove `ctx_mode` handling. Update `PEConfig`
796   construction. Update `AssemblyResult`.
797
798### Phase 3: Upstream Pass Adjustments
7997. **`asm/lower.py`**: Rename `ctx_slot` references. Update qualifier
800   parsing if syntax changes.
8018. **`asm/expand.py`**: Replace `FREE_CTX` with `FREE_FRAME` in call
802   wiring. Rename `CtxSlotRef` -> `ActSlotRef`. Update `CallSite`
803   field names. Simplify trampoline logic if `ctx_override` is removed.
8049. **`asm/place.py`**: Update `_count_iram_cost()` (all nodes cost 1).
805   Replace `ctx_slots` tracking with `frame_count`. Add
806   matchable-offset-per-activation constraint.
807
808### Phase 4: Output and Tooling
80910. **`asm/serialize.py`**: Rename `ctx` -> `act_id` in output. Add
810    mode/fref display for post-allocation IR.
81111. **`asm/builtins.py`**: No changes (verified).
81212. **`dfgraph/categories.py`**: `FREE_CTX` -> `FREE_FRAME`.
81313. **`dfgraph/graph_json.py`**: `ctx` -> `act_id` in JSON output.
81414. **`monitor/`**: Update snapshot, graph_json, and REPL for frame
815    model.
816
817### Phase 5: Emulator Updates (out of scope for this doc)
81815. **`emu/types.py`**: `PEConfig` changes (`iram` type, remove
819    `ctx_slots`, add frame parameters).
82016. **`emu/pe.py`**: Frame-based matching instead of context-slot
821    matching store.
82217. **`tokens.py`**: `ctx` -> `act_id`, remove `gen`. Add
823    `FrameControlToken`, `PELocalWriteToken`.
824
825## 16. What Stays the Same
826
827- **dfasm grammar** (`dfasm.lark`): mostly unchanged. May add frame
828  directives later, but not required for v0.
829- **Parse pass** (Lark parser): no changes.
830- **Error infrastructure** (`errors.py`): structure unchanged, one new
831  category added.
832- **Graph structure** (`IRGraph`, `IREdge`, `IRRegion`): unchanged.
833- **Pipeline architecture**: still 7 passes in the same order.
834- **Resolve pass**: completely unchanged.
835- **Built-in macros**: completely unchanged.
836- **SM-related codegen**: SMConfig construction, SM init tokens, data
837  defs -- all unchanged.
838- **Name resolution and scoping**: unchanged.
839- **Macro expansion core**: parameter substitution, variadic repetition,
840  opcode parameters -- all unchanged. Only call wiring details change.
841
842## 17. Risk Assessment
843
844| Area | Risk | Mitigation |
845|---|---|---|
846| Frame layout allocation | New, complex algorithm | Start with simple sequential allocation, optimise later |
847| Flit 1 packing | Bit-level encoding must match hardware spec | Unit tests against architecture-overview.md bit layouts |
848| Instruction deduplication | Interaction with frame layouts is subtle | Defer dedup to optimisation pass; v0 allocates per-activation |
849| Mode computation | 8 modes, context-dependent selection | Exhaustive test matrix against mode table |
850| ABA distance | Act_id assignment must maintain distance | Sequential assignment (0,1,2,3) is safe for static programs |
851| `ctx_override` removal | Affects expand pass call wiring | Can keep `ctx_override` as semantic annotation initially |
852| Emulator dependency | Codegen must emit tokens emulator can consume | Phase emulator changes alongside codegen; can use adapter layer |