# Assembler Redesign Plan: Frame-Based PE Model

This document describes the changes needed in the OR1 assembler (`asm/`)
to match the frame-based PE redesign described in `pe-design.md` and
`architecture-overview.md`. It covers every assembler pass and supporting
module, identifying what changes, what stays, and the dependency order
for implementation.

The assembler pipeline is: parse -> lower -> expand -> resolve -> place ->
allocate -> codegen. Most changes concentrate in allocate and codegen;
earlier passes are minimally affected.

## 1. Overview

The PE redesign replaces the context-slot matching model with a
frame-based model. Key architectural changes that affect the assembler:

- **Context slots replaced by activation IDs and frames.** The old
  `ctx_slots` parameter (up to 16 slots) becomes `frame_count` (4
  concurrent frames) with 3-bit `act_id` (8 unique IDs). Each frame has
  64 addressable slots.
- **Instructions become templates.** IRAM entries are 16-bit words
  `[type:1][opcode:5][mode:3][wide:1][fref:6]`. Constants and
  destinations are NOT in the instruction -- they live in frame slots
  referenced by `fref`.
- **Instruction deduplication.** Multiple activations can share the same
  IRAM entry because per-activation data lives in frames, not
  instructions. The number of IRAM entries needed is the number of unique
  operation shapes, not the total operations executed.
- **8 matchable offsets per frame.** Dyadic instructions must be assigned
  offsets 0-7 within each activation. The assembler must enforce this
  constraint and split function bodies that exceed it.
- **Pre-formed flit 1 values.** Output destinations are 16-bit packed
  flit 1 values stored in frame slots. The PE reads the slot and puts it
  on the bus verbatim as flit 1. Token type is determined by the prefix
  bits in the stored flit value.
- **Frame setup via PE-local write tokens.** Constants, destinations, and
  other per-activation data are loaded into frame slots by a stream of
  PE-local write tokens before execution begins.
- **Frame lifecycle tokens.** ALLOC (frame control, prefix `011+00`,
  op=0) allocates a frame; FREE (frame control, op=1) or FREE_FRAME
  instruction releases it.
- **Mode field replaces output routing logic.** The 3-bit `mode` field
  in the instruction word encodes output routing (INHERIT/CHANGE_TAG/SINK)
  and constant presence. The assembler must compute the correct mode for
  each instruction based on its edge topology and constant usage.

## 2. IR Type Changes (`asm/ir.py`)

### 2.1 IRNode Changes

Current fields affected:

| Current Field | Change | Notes |
|---|---|---|
| `ctx_slot: Optional[Union[int, CtxSlotRef, CtxSlotRange]]` | Rename to `act_slot` or remove entirely | Macro parameter support (`CtxSlotRef`) may still be needed for parameterized activation grouping |
| `iram_offset: Optional[int]` | Keep | Semantics unchanged: index into PE's IRAM. Range 0-255. |
| `ctx: Optional[int]` | Rename to `act_id: Optional[int]` | 3-bit activation ID (0-7) instead of context slot |

New fields to add to IRNode:

| New Field | Type | Purpose |
|---|---|---|
| `mode: Optional[int]` | `Optional[int]` | 3-bit instruction mode (0-7), computed during allocate |
| `fref: Optional[int]` | `Optional[int]` | 6-bit frame slot base index (0-63), computed during allocate |
| `wide: bool` | `bool` | 32-bit frame values flag, default False |
| `frame_layout: Optional[FrameSlotMap]` | `Optional[FrameSlotMap]` | Per-node frame slot assignments (see below) |

### 2.2 New IR Types

```python
@dataclass(frozen=True)
class FrameSlotMap:
    """Frame slot assignments for one instruction within an activation.

    Maps logical roles to physical frame slot indices within the 64-slot
    frame. Computed by the allocate pass.

    Attributes:
        match_slot: Frame slot for dyadic match operand storage (offsets 0-7)
        const_slot: Frame slot for IRAM constant (if any)
        dest_slots: Frame slot(s) for pre-formed flit 1 destination values
        sink_slot: Frame slot for SINK mode write-back target
    """
    match_slot: Optional[int] = None
    const_slot: Optional[int] = None
    dest_slots: tuple[int, ...] = ()
    sink_slot: Optional[int] = None
```

```python
@dataclass(frozen=True)
class FrameLayout:
    """Complete frame layout for one activation on one PE.

    Computed by the allocate pass. Used by codegen to generate PE-local
    write tokens for frame setup.

    Attributes:
        act_id: Activation ID (0-7)
        pe_id: PE where this frame lives
        slots: Dict mapping slot index to (role, value) pairs
        total_slots: Total slots used (must be <= 64)
    """
    act_id: int
    pe_id: int
    slots: dict[int, tuple[str, int]]
    total_slots: int
```

### 2.3 SystemConfig Changes

Current `SystemConfig`:

```python
@dataclass(frozen=True)
class SystemConfig:
    pe_count: int
    sm_count: int
    iram_capacity: int = DEFAULT_IRAM_CAPACITY  # 128
    ctx_slots: int = DEFAULT_CTX_SLOTS           # 16
```

New `SystemConfig`:

```python
@dataclass(frozen=True)
class SystemConfig:
    pe_count: int
    sm_count: int
    iram_capacity: int = 256              # 8-bit offset, 256 entries
    frame_count: int = 4                  # max concurrent frames per PE
    frame_slots: int = 64                 # slots per frame
    matchable_offsets: int = 8            # max dyadic instructions per activation per PE
```

The defaults change: `DEFAULT_IRAM_CAPACITY` goes from 128 to 256
(matching the 8-bit offset field). `DEFAULT_CTX_SLOTS` is removed
entirely and replaced by `frame_count`, `frame_slots`, and
`matchable_offsets`.

### 2.4 What Stays

- `IREdge` -- unchanged except `ctx_override` may need renaming or
  semantic adjustment (see section 5)
- `IRGraph` structure -- unchanged
- `IRDataDef` -- unchanged (SM data definitions are orthogonal to PE frames)
- `IRRegion`, `RegionKind` -- unchanged
- `NameRef`, `ResolvedDest` -- unchanged
- `MacroDef`, `IRMacroCall`, `MacroParam`, `ParamRef` -- unchanged
- `CallSite`, `CallSiteResult` -- fields may need renaming
  (`trampoline_nodes`, `free_ctx_nodes` -> `free_frame_nodes`)
- `SourceLoc`, `ConstExpr`, `IRRepetitionBlock` -- unchanged

### 2.5 CtxSlotRef / CtxSlotRange

`CtxSlotRef` and `CtxSlotRange` are used in macro templates for
parameterized context slot assignment. Under the frame model, these
become `ActSlotRef` / `ActSlotRange` (or are removed if activation ID
assignment is always automatic). The expand pass resolves these to
concrete values during macro expansion. The rename is straightforward
but touches `ir.py`, `lower.py`, and `expand.py`.

## 3. Opcode Changes (`asm/opcodes.py`)

### 3.1 Renamed Opcodes

| Current | New | Notes |
|---|---|---|
| `RoutingOp.FREE_CTX` | `RoutingOp.FREE_FRAME` | Deallocates a frame instead of a context slot |

The mnemonic mapping changes: `"free_ctx"` -> `"free_frame"` in
`MNEMONIC_TO_OP`. The `OP_TO_MNEMONIC` reverse mapping updates
correspondingly.

### 3.2 New Opcodes

| Opcode | Type | Arity | Purpose |
|---|---|---|---|
| `RoutingOp.EXTRACT_TAG` | Monadic CM | Monadic | Captures executing token's identity as 16-bit packed flit 1 value (return continuation) |
| `RoutingOp.ALLOC_REMOTE` | Monadic CM | Monadic | Triggers ALLOC frame control token to a target PE (may be handled at codegen level instead) |

`EXTRACT_TAG` must be added to `MNEMONIC_TO_OP` (mnemonic: `"extract_tag"`),
to `_MONADIC_OPS_TUPLES`, and to `cm_inst.py` as a new `RoutingOp` enum
member.

Whether `ALLOC_REMOTE` is an ALU opcode or purely a codegen-emitted
frame control token depends on whether the assembler exposes it as a
user-facing instruction or handles it internally during call wiring.
For static calls, ALLOC is emitted by codegen as a frame control token,
not as an ALU instruction. For dynamic calls, it may need to be an
instruction.

### 3.3 Mode-Dependent Arity

The current `is_monadic(op, const)` and `is_dyadic(op, const)` functions
remain correct in concept. The `mode` field is orthogonal to arity --
mode determines output routing behaviour, not input operand count. No
changes needed to arity classification.

### 3.4 Monadic/Dyadic Classification

Unchanged. The distinction between monadic and dyadic is fundamental to
matching (dyadic tokens go through presence-based matching in the 670s;
monadic tokens bypass matching). The assembler's arity classification
drives IRAM offset assignment (dyadic at offsets 0-7, monadic at 8+).

## 4. Lower Pass Changes (`asm/lower.py`)

Minimal changes. The lower pass translates Lark CST to IR and is mostly
opcode-agnostic.

### 4.1 Opcode Mnemonic Updates

The `MNEMONIC_TO_OP` lookup in `_resolve_opcode()` will pick up new
opcodes automatically when they are added to `opcodes.py`. No lower.py
code changes needed for new opcodes.

The `free_ctx` mnemonic rename to `free_frame` will need a corresponding
change in lower.py only if the mnemonic string is hardcoded anywhere
(it is not -- lower.py uses `MNEMONIC_TO_OP` for all lookups).

### 4.2 Context Slot Qualifiers

The `[ctx_slot]` qualifier syntax in dfasm (`&node[3] <| add`) is
parsed in lower.py and stored as `ctx_slot` on IRNode. This syntax and
field will be renamed to reflect activation IDs if kept. If activation
ID assignment is always automatic, the qualifier syntax may be removed
entirely.

References in lower.py:
- `inst_def` transformer rule: parses `[N]` qualifier into `ctx_slot`
- `ctx_slot_ref` transformer rule: creates `CtxSlotRef` for macro
  parameter `[${param}]`
- `ctx_slot_range` transformer rule: creates `CtxSlotRange` for `[N:M]`

### 4.3 New Syntax (Optional, Deferred)

Frame directives in dfasm syntax (e.g., `@frame_layout` pragmas) could
be added later. Not needed for v0 -- the allocator computes frame
layouts automatically.

## 5. Expand Pass Changes (`asm/expand.py`)

### 5.1 Function Call Wiring

The expand pass currently generates cross-context call wiring:
- Trampoline `PASS` nodes with `ctx_override` edges
- `FREE_CTX` cleanup nodes
- Per-call-site context slot allocation via `CallSite` metadata

Under the frame model, the call wiring changes:

- **`ctx_override` edges** become frame-boundary edges. The semantic
  meaning is the same (data crosses activation boundaries), but the
  mechanism changes: instead of packing a target context into the
  instruction's `const` field with `ctx_mode=1`, the destination's
  pre-formed flit 1 value in the frame slot already encodes the target
  PE, offset, and act_id. Cross-activation routing is handled by frame
  setup, not by instruction encoding.

- **`FREE_CTX` nodes** become `FREE_FRAME` nodes. The opcode changes
  from `RoutingOp.FREE_CTX` to `RoutingOp.FREE_FRAME`. The expand pass
  references `RoutingOp.FREE_CTX` directly in `_wire_call_site()` at
  line 1160 of expand.py.

- **Trampoline nodes** may simplify. In the current design, trampolines
  exist to bridge context boundaries (an edge cannot cross contexts
  without a PASS node carrying `ctx_mode`). Under the frame model,
  destinations in frame slots already encode the target activation, so
  cross-activation edges are just edges whose destination flit 1 value
  points to a different activation. Trampolines may still be useful for
  fan-out or return routing but are no longer needed purely for context
  bridging.

- **`@ret` wiring** stays conceptually the same. Return routing in the
  frame model is a pre-formed flit 1 value loaded into a frame slot.
  The expand pass creates edges for return routing; the allocate pass
  resolves those edges to flit 1 values in frame slots.

### 5.2 CallSite Metadata

`CallSite` fields to rename:
- `free_ctx_nodes` -> `free_frame_nodes`

The `trampoline_nodes` field stays (trampolines may still be generated
for fan-out or return routing patterns).

### 5.3 CtxSlotRef Resolution

The expand pass resolves `CtxSlotRef` and `CtxSlotRange` during macro
expansion. These will be renamed to `ActSlotRef` / `ActSlotRange` (or
removed). The resolution logic in `_substitute_node()` and
`_substitute_edge()` is straightforward renaming.

### 5.4 Built-in Macros

The built-in macros in `builtins.py` do NOT reference `ctx_slot`, `ctx`,
`free_ctx`, or any context-specific concepts directly. They use generic
opcodes (`add`, `brgt`, `gate`, `pass`, `inc`, `const`) and edge routing
(`@ret`, `${param}`). No changes needed to built-in macros.

## 6. Resolve Pass Changes (`asm/resolve.py`)

No changes needed. The resolve pass validates that edge endpoints exist
and detects scope violations. It operates on node names and graph
structure, not on PE-level concepts like contexts or frames. The resolve
pass does not reference `ctx`, `ctx_slot`, `ctx_override`, or any
context-related fields.

## 7. Place Pass Changes (`asm/place.py`)

### 7.1 New Constraint: Matchable Offset Limit

The placement pass must enforce the 8-matchable-offset constraint: at
most 8 dyadic instructions per activation per PE. This is a new
constraint that does not exist in the current codebase.

Currently, `_count_iram_cost()` counts dyadic nodes as costing 2 IRAM
slots and monadic as 1. Under the frame model:
- Dyadic nodes cost 1 IRAM slot (the matching store entry is in the
  670s, not IRAM)
- Monadic nodes cost 1 IRAM slot
- The 8-dyadic-per-activation limit is a separate constraint from IRAM
  capacity

The placement pass needs to track dyadic instruction count per
activation group per PE, in addition to total IRAM usage.

### 7.2 IRAM Cost Recalculation

`_count_iram_cost()` should return 1 for all node types (dyadic and
monadic both use 1 IRAM slot in the frame model). The current cost of 2
for dyadic nodes was because the old matching store occupied IRAM slots;
in the frame model, match operands live in frame SRAM, not IRAM.

### 7.3 Context Slots -> Frames

`_auto_place_nodes()` currently tracks `ctx_used` per PE (context slots
consumed). This becomes `frames_used` per PE (concurrent frames, max 4).
The `ctx_scopes_per_pe` tracking of function scopes per PE maps directly
to frame tracking: each function scope on a PE consumes one frame.

`SystemConfig.ctx_slots` references in place.py become
`SystemConfig.frame_count`.

### 7.4 Instruction Deduplication Awareness

Because IRAM entries are activation-independent templates, multiple
activations of the same function on the same PE share IRAM entries. The
placement pass could account for this when computing IRAM utilisation:
if two activations of function `$foo` run on PE0, they share IRAM
entries, so the IRAM cost is the unique instruction count, not the total.

This is an optimisation, not a correctness requirement. The placement
pass can conservatively count IRAM entries per unique function body
without deduplication for v0.

## 8. Allocate Pass Changes (`asm/allocate.py`) -- MAJOR

This is the largest change. The current allocate pass assigns IRAM
offsets and context slots, then resolves destinations to `Addr` values.
The frame model requires fundamentally different allocation logic.

### 8.1 IRAM Offset Assignment

Current behaviour (`_assign_iram_offsets()`):
- Dyadic nodes get offsets 0..D-1
- Monadic nodes get offsets D..D+M-1
- Total must fit in `iram_capacity`

New behaviour:
- Dyadic nodes get offsets 0-7 (within the matchable offset range).
  At most 8 dyadic instructions per activation group per PE. Because
  IRAM is activation-independent (shared templates), dyadic offset
  assignment is per-PE, not per-activation.
- Monadic nodes get offsets 8-255 (or wherever dyadic offsets end).
- The `matchable_offsets` limit (default 8) constrains dyadic count.
- With instruction deduplication, multiple activations of the same
  function body share IRAM offsets. The allocator assigns offsets once
  per unique instruction template, not per activation.

### 8.2 Activation ID Assignment

Replaces `_assign_context_slots()`. The current function assigns context
slot indices (0-15) per function scope per PE. The new function assigns
3-bit activation IDs (0-7) with at most 4 concurrent activations per PE.

Key differences:
- **Smaller space:** 8 act_ids (3-bit) vs 16 context slots. But only 4
  can be concurrently active (4 physical frames).
- **ABA distance:** the allocator must maintain ABA distance between
  act_ids to prevent stale token collisions. With 4 concurrent frames
  out of 8 possible IDs, 4 IDs of ABA distance exist before wraparound.
  For static programs, the allocator can assign act_ids sequentially
  (0, 1, 2, 3 for 4 concurrent activations).
- **Per-call-site allocation:** same concept as current
  `call_site_to_ctx_on_pe` -- each call site that creates a new
  activation gets a fresh act_id. But the budget is 4 concurrent frames
  instead of 16 context slots.

The function scope grouping logic (`_extract_function_scope()`) stays.

### 8.3 Frame Layout Allocation (NEW)

This is entirely new functionality. After IRAM offset and act_id
assignment, the allocator must compute the frame layout for each
activation: which frame slots hold what data.

**Frame slot roles:**

| Role | Slot Range | Count per Instruction | Notes |
|---|---|---|---|
| Match operand | 0-7 | 1 per dyadic instruction | Indexed by matchable offset. Presence bit in 670. |
| Constant | 8+ | 0-1 per instruction | `mode[0]` (has_const) selects whether const is read from frame[fref] |
| Destination | variable | 0-2 per instruction | Pre-formed flit 1 values. 1 for mode 0, 2 for mode 2/3. |
| Accumulator/sink | variable | 0-1 per instruction | For SINK modes (6/7). Write-back target. |
| SM parameters | variable | 0-2 per SM instruction | SM_id + addr, data or return routing. |

**Slot assignment algorithm:**

1. Reserve slots 0-7 for match operands (one per dyadic instruction,
   indexed by the instruction's matchable offset).
2. Assign constant slots starting at slot 8. Constants that are shared
   across instructions within the same activation can be deduplicated
   (same value -> same slot).
3. Assign destination slots. Each destination is a pre-formed flit 1
   value. Destinations shared across instructions can be deduplicated.
4. Assign sink/accumulator slots for SINK mode instructions.
5. Assign SM parameter slots for SM operations.
6. Verify total slots <= 64. If exceeded, report a frame overflow error.

**fref computation:**

The `fref` field in the instruction word points to the base of a
contiguous group of frame slots used by that instruction. The slot count
depends on the mode:

| Mode | Slots at fref | Layout |
|---|---|---|
| 0 | 1 | [dest] |
| 1 | 2 | [const, dest] |
| 2 | 2 | [dest1, dest2] |
| 3 | 3 | [const, dest1, dest2] |
| 4 | 0 | (no frame access) |
| 5 | 1 | [const] |
| 6 | 1 | [sink_target] (write) |
| 7 | 1 | [sink_target] (read-modify-write) |

The allocator must arrange frame slots such that each instruction's
constant and destination(s) are contiguous starting at `fref`. This
may require careful slot packing or a simple sequential allocation
strategy.

**Mode computation:**

The allocator determines the mode for each instruction based on:

| Condition | Mode |
|---|---|
| 0 dests, no const, has sink | 6 (SINK) |
| 0 dests, const via frame accumulator (RMW) | 7 (SINK+CONST) |
| 1 dest, no const | 0 (INHERIT, single output) |
| 1 dest, has const | 1 (INHERIT, single output + const) |
| 2 dests, no const | 2 (INHERIT, fan-out) |
| 2 dests, has const | 3 (INHERIT, fan-out + const) |
| CHANGE_TAG, no const | 4 |
| CHANGE_TAG, has const | 5 |

CHANGE_TAG is used when the instruction's left operand provides the
output destination dynamically (e.g., dynamic return routing via
`EXTRACT_TAG` + `CHANGE_TAG`).

### 8.4 Pre-Formed Flit 1 Computation

The allocator must convert resolved destination `Addr` values into
16-bit packed flit 1 values for storage in frame destination slots.

**Flit 1 formats** (from `architecture-overview.md`):

```
DYADIC WIDE:    [0][0][port:1][PE:2][offset:8][act_id:3]   = 16 bits
MONADIC NORM:   [0][1][0][PE:2][offset:8][act_id:3]         = 16 bits
MONADIC INLINE: [0][1][1][PE:2][10][offset:7][spare:2]      = 16 bits
```

The packing function takes:
- `dest_pe: int` (2 bits)
- `dest_offset: int` (8 bits for dyadic/monadic normal, 7 for inline)
- `dest_act_id: int` (3 bits, for dyadic and monadic normal)
- `dest_port: Port` (1 bit, for dyadic wide only)
- `dest_type: TokenType` (dyadic wide, monadic normal, or monadic inline)

And produces a 16-bit packed flit 1 value.

The `dest_type` is determined by whether the destination instruction is
dyadic (-> dyadic wide flit) or monadic (-> monadic normal flit). For
trigger-only destinations (e.g., switch not-taken path), monadic inline
is used.

This replaces the current `Addr(a, port, pe)` resolution. The `Addr`
type may still be used as an intermediate representation, but the final
output is a packed flit 1 value in a frame slot.

### 8.5 Destination Resolution Rework

The current `_resolve_destinations()` function creates `ResolvedDest`
objects containing `Addr(a=iram_offset, port=edge.port, pe=dest_pe)`.
Under the frame model, destination resolution must additionally:

1. Look up the destination node's `act_id` (not just `iram_offset` and
   `pe`).
2. Determine the destination token type (dyadic wide vs monadic normal
   vs monadic inline).
3. Compute the packed flit 1 value.
4. Assign the flit 1 value to a frame slot.

The `ResolvedDest` type may be extended to carry the packed flit 1
value, or the flit 1 computation may happen in a separate sub-pass
after destination resolution.

### 8.6 ctx_mode / ctx_override Removal

The current allocator and codegen handle `ctx_override` edges by setting
`ctx_mode=1` on the source instruction and packing the target context
and generation into the `const` field. This entire mechanism is removed.

Under the frame model, cross-activation routing is handled by frame
destination slots. The destination flit 1 value already encodes the
target PE, offset, and act_id. No instruction-level `ctx_mode` is
needed. The `ctx_override` flag on `IREdge` may be kept for semantic
annotation (this edge crosses activation boundaries) but has no effect
on instruction encoding.

## 9. Codegen Changes (`asm/codegen.py`) -- MAJOR

### 9.1 Instruction Generation

Current codegen (`_build_iram_for_pe()`) generates `ALUInst` and
`SMInst` objects. These are Python dataclasses that model the old
instruction format with embedded destinations and constants:

```python
ALUInst(op, dest_l, dest_r, const, ctx_mode)
SMInst(op, sm_id, const, ret, ret_dyadic)
```

New codegen generates 16-bit instruction words matching the hardware
format:

```python
@dataclass(frozen=True)
class Instruction:
    """16-bit instruction word for IRAM.

    [type:1][opcode:5][mode:3][wide:1][fref:6] = 16 bits
    """
    type: int       # 0 = CM, 1 = SM
    opcode: int     # 5-bit opcode
    mode: int       # 3-bit mode (0-7)
    wide: bool      # 32-bit frame values
    fref: int       # 6-bit frame slot base index
```

The `Instruction` type replaces both `ALUInst` and `SMInst`. Constants
and destinations are NOT in the instruction -- they are in frame slots.

The `PEConfig.iram` field type changes from
`dict[int, ALUInst | SMInst]` to `dict[int, Instruction]` (or
`dict[int, int]` if storing raw 16-bit words).

### 9.2 Frame Setup Sequence Generation (NEW)

Codegen must generate the bootstrap sequence that loads frame contents
before execution begins. This is a stream of tokens:

1. **FrameControlToken (ALLOC)** for each activation:
   ```
   flit 1: [0][1][1][PE:2][00][op=0][spare:3][act_id:3]
   flit 2: (return routing or unused)
   ```

2. **PELocalWriteToken** for each frame slot that needs initialization:
   ```
   flit 1: [0][1][1][PE:2][01][region=1][spare:1][slot:5][act_id:3]
   flit 2: [data:16]  (the frame slot value)
   ```

3. **PELocalWriteToken** for IRAM entries:
   ```
   flit 1: [0][1][1][PE:2][01][region=0][spare:1][slot:5][act_id:ignored]
   flit 2: [instruction:16]
   ```

The ordering matters: IRAM writes before frame setup, frame setup
(ALLOC + slot writes) before seed tokens. More specifically:

```
IRAM writes (all PEs)
  -> ALLOC frame control (all activations)
    -> Frame slot writes (constants, destinations per activation)
      -> Seed tokens (initial data tokens to start execution)
```

### 9.3 New Token Types

The emulator will need new token types (or the existing types must be
adapted):

- **FrameControlToken** -- ALLOC/FREE frame lifecycle. Currently not in
  `tokens.py`. Codegen needs to emit these.
- **PELocalWriteToken** -- writes to IRAM (region=0) or frame slots
  (region=1). The current `IRAMWriteToken` is a special case of this
  (region=0 only). It may be generalised or a new type added.

These are emulator-level changes that codegen depends on. The codegen
module must import and construct whatever token types the emulator
provides.

### 9.4 Seed Token Generation

Current seed token generation creates `MonadToken` or `DyadToken` with
`ctx` field. Changes:

- `ctx` field -> `act_id` field in token constructors
- `gen` field on `DyadToken` is removed (ABA protection is via the 670
  valid bit, not generation counters)
- Seed tokens target `(pe, offset, act_id)` triples, packed from the
  destination's allocation data

### 9.5 Direct Mode Output

`generate_direct()` currently returns `AssemblyResult` with:
- `pe_configs: list[PEConfig]` -- PEConfig with `iram` dict of
  `ALUInst/SMInst`
- `sm_configs: list[SMConfig]`
- `seed_tokens: list[MonadToken]`

New `AssemblyResult`:
- `pe_configs: list[PEConfig]` -- PEConfig with `iram` dict of
  `Instruction` (new type), plus `frame_layouts: dict[int, FrameLayout]`
  mapping act_id to frame layout data
- `sm_configs: list[SMConfig]` -- unchanged
- `seed_tokens: list` -- may include `DyadToken` and `MonadToken` with
  `act_id` instead of `ctx`
- `setup_tokens: list` -- NEW: frame control and PE-local write tokens
  for bootstrap

### 9.6 Token Stream Mode Output

`generate_tokens()` currently produces:
```
SM init tokens -> IRAM write tokens -> seed tokens
```

New ordering:
```
SM init tokens -> IRAM write tokens -> ALLOC tokens -> frame slot write tokens -> seed tokens
```

### 9.7 ctx_mode Removal

The entire `ctx_mode` / `ctx_override` handling in `_build_iram_for_pe()`
(lines 96-132 of codegen.py) is removed. Cross-activation routing is
handled by frame destination slots, not by instruction encoding.

### 9.8 Route Restriction Computation

`_compute_route_restrictions()` is unchanged in concept. It scans edges
to determine which PEs and SMs a given PE needs to route to. The
implementation stays the same.

## 10. Serialize Pass Changes (`asm/serialize.py`)

### 10.1 Field Renaming

- `ctx` -> `act_id` in node serialization if activation ID is displayed
- Any `ctx_slot` qualifier in dfasm output becomes activation-related

### 10.2 New Fields

If `mode`, `fref`, and `wide` are displayed in serialized output (for
debugging allocated IR), `_serialize_node()` needs to format them.
Format could be: `&node|pe0|act2|mode1|fref8 <| add`

### 10.3 Round-Trip Support

The serialize pass must be able to round-trip new IR fields. Since
`mode`, `fref`, and frame layout are only populated after allocation,
serialization before allocation produces the same output as today (no
new fields to display). Post-allocation serialization adds the new
fields.

## 11. Built-in Macro Changes (`asm/builtins.py`)

No changes needed. The built-in macros (`#loop_counted`, `#loop_while`,
`#permit_inject`, `#reduce_2/3/4`) use generic opcodes and edge routing.
They do not reference `ctx`, `ctx_slot`, `free_ctx`, or any
context-specific concepts.

The `free_ctx` opcode rename to `free_frame` does not affect builtins
because none of them use `free_ctx`.

## 12. Error Types (`asm/errors.py`)

### 12.1 New Error Categories

Add to `ErrorCategory`:

```python
class ErrorCategory(Enum):
    # ... existing ...
    FRAME = "frame"  # Frame layout overflow, slot conflicts
```

### 12.2 New Error Conditions

| Error | Category | Source Pass | Condition |
|---|---|---|---|
| Frame slot overflow | FRAME | allocate | Total slots > 64 for an activation |
| Matchable offset overflow | RESOURCE | allocate/place | > 8 dyadic instructions per activation per PE |
| Frame count overflow | RESOURCE | allocate | > 4 concurrent activations on one PE |
| Act ID exhaustion | RESOURCE | allocate | > 8 activation IDs needed (wraparound) |

## 13. Dfgraph Pipeline Impact (`dfgraph/`)

### 13.1 `dfgraph/pipeline.py`

The pipeline runner calls `allocate()` and uses `IRGraph` types. If
`SystemConfig` fields change, `pipeline.py` may need minor updates for
default values or error handling. The pipeline runner itself does not
inspect allocation results in detail.

### 13.2 `dfgraph/graph_json.py`

Currently includes `ctx` field in node JSON output. This becomes
`act_id`. The field rename is straightforward.

### 13.3 `dfgraph/categories.py`

References `RoutingOp.FREE_CTX` in the CONFIG category mapping. This
becomes `RoutingOp.FREE_FRAME`. One line change.

`EXTRACT_TAG` (if added) maps to the ROUTING or CONFIG category
depending on its semantics.

## 14. Monitor Impact (`monitor/`)

### 14.1 `monitor/snapshot.py`

`PESnapshot` currently captures `matching_store` (2D array of
`MatchEntry`) and `gen_counters`. Under the frame model:
- `matching_store` becomes frame state (per-frame slot values, presence
  bits)
- `gen_counters` is removed (no generation counters in frame model)
- New: `frame_allocations` (act_id -> frame_id mapping), `frame_slots`
  (per-frame slot contents)

### 14.2 `monitor/graph_json.py`

Node state overlay: `ctx` field -> `act_id`. Frame slot contents may be
added to the state overlay for debugging.

### 14.3 `monitor/repl.py`

The `pe` command displays PE state including matching store contents.
This needs updating to display frame state instead.

## 15. Dependency Order

Implementation order based on the dependency graph:

### Phase 1: Foundation Types
1. **`cm_inst.py`**: Add `RoutingOp.FREE_FRAME`, `RoutingOp.EXTRACT_TAG`.
   Add `Instruction` dataclass. Update `is_monadic_alu()`.
2. **`asm/ir.py`**: Add `FrameSlotMap`, `FrameLayout`. Rename
   `ctx_slot` -> `act_slot`, `ctx` -> `act_id` on IRNode. Update
   `SystemConfig` (remove `ctx_slots`, add `frame_count`,
   `frame_slots`, `matchable_offsets`). Rename `CtxSlotRef` ->
   `ActSlotRef`, `CtxSlotRange` -> `ActSlotRange`.
3. **`asm/opcodes.py`**: Add new opcodes to `MNEMONIC_TO_OP`,
   `_MONADIC_OPS_TUPLES`. Rename `free_ctx` -> `free_frame`.
4. **`asm/errors.py`**: Add `ErrorCategory.FRAME`.

### Phase 2: Allocation and Codegen (the big changes)
5. **`asm/allocate.py`**: Rewrite `_assign_context_slots()` ->
   `_assign_act_ids()`. Add `_compute_frame_layouts()`. Add
   `_compute_modes()`. Add `_pack_flit1()`. Update
   `_assign_iram_offsets()` for new offset scheme. Update
   `_resolve_destinations()` to produce flit 1 values. Remove
   `ctx_mode`/`ctx_override` handling.
6. **`asm/codegen.py`**: Replace `ALUInst`/`SMInst` generation with
   `Instruction` generation. Add frame setup token generation. Add
   `FrameControlToken` / `PELocalWriteToken` emission. Update seed
   token generation. Remove `ctx_mode` handling. Update `PEConfig`
   construction. Update `AssemblyResult`.

### Phase 3: Upstream Pass Adjustments
7. **`asm/lower.py`**: Rename `ctx_slot` references. Update qualifier
   parsing if syntax changes.
8. **`asm/expand.py`**: Replace `FREE_CTX` with `FREE_FRAME` in call
   wiring. Rename `CtxSlotRef` -> `ActSlotRef`. Update `CallSite`
   field names. Simplify trampoline logic if `ctx_override` is removed.
9. **`asm/place.py`**: Update `_count_iram_cost()` (all nodes cost 1).
   Replace `ctx_slots` tracking with `frame_count`. Add
   matchable-offset-per-activation constraint.

### Phase 4: Output and Tooling
10. **`asm/serialize.py`**: Rename `ctx` -> `act_id` in output. Add
    mode/fref display for post-allocation IR.
11. **`asm/builtins.py`**: No changes (verified).
12. **`dfgraph/categories.py`**: `FREE_CTX` -> `FREE_FRAME`.
13. **`dfgraph/graph_json.py`**: `ctx` -> `act_id` in JSON output.
14. **`monitor/`**: Update snapshot, graph_json, and REPL for frame
    model.

### Phase 5: Emulator Updates (out of scope for this doc)
15. **`emu/types.py`**: `PEConfig` changes (`iram` type, remove
    `ctx_slots`, add frame parameters).
16. **`emu/pe.py`**: Frame-based matching instead of context-slot
    matching store.
17. **`tokens.py`**: `ctx` -> `act_id`, remove `gen`. Add
    `FrameControlToken`, `PELocalWriteToken`.

## 16. What Stays the Same

- **dfasm grammar** (`dfasm.lark`): mostly unchanged. May add frame
  directives later, but not required for v0.
- **Parse pass** (Lark parser): no changes.
- **Error infrastructure** (`errors.py`): structure unchanged, one new
  category added.
- **Graph structure** (`IRGraph`, `IREdge`, `IRRegion`): unchanged.
- **Pipeline architecture**: still 7 passes in the same order.
- **Resolve pass**: completely unchanged.
- **Built-in macros**: completely unchanged.
- **SM-related codegen**: SMConfig construction, SM init tokens, data
  defs -- all unchanged.
- **Name resolution and scoping**: unchanged.
- **Macro expansion core**: parameter substitution, variadic repetition,
  opcode parameters -- all unchanged. Only call wiring details change.

## 17. Risk Assessment

| Area | Risk | Mitigation |
|---|---|---|
| Frame layout allocation | New, complex algorithm | Start with simple sequential allocation, optimise later |
| Flit 1 packing | Bit-level encoding must match hardware spec | Unit tests against architecture-overview.md bit layouts |
| Instruction deduplication | Interaction with frame layouts is subtle | Defer dedup to optimisation pass; v0 allocates per-activation |
| Mode computation | 8 modes, context-dependent selection | Exhaustive test matrix against mode table |
| ABA distance | Act_id assignment must maintain distance | Sequential assignment (0,1,2,3) is safe for static programs |
| `ctx_override` removal | Affects expand pass call wiring | Can keep `ctx_override` as semantic annotation initially |
| Emulator dependency | Codegen must emit tokens emulator can consume | Phase emulator changes alongside codegen; can use adapter layer |