design-notes/pe-design.md at main · nonbinary.computer/or1-design

nonbinary.computer / or1-design
fork atom
OR-1 dataflow CPU sketch
fork atom
or1-design / design-notes / pe-design.md
at main 1733 lines 89 kB view raw view rendered
wrap content
Orual feat: rewrite ProcessingElement with frame-based matching, output routing, and unified instruction set 19d ago
65613978
   1# Dynamic Dataflow CPU — PE (Processing Element) Design
   2
   3Covers the CM (Control Module) pipeline, frame-based matching, instruction
   4memory and encoding, activation lifecycle, per-PE identity, 670 subsystem
   5design, pipeline stall analysis, SM operation dispatch, SC blocks, and
   6execution modes.
   7
   8See `architecture-overview.md` for token format, flit-1 bit allocation,
   9and module taxonomy. See `network-and-communication.md` for how tokens
  10enter/leave the PE. See `alu-and-output-design.md` for ALU operations,
  11output formatting, and SM flit assembly details. See
  12`bus-interconnect-design.md` for physical bus implementation. See
  13`sm-design.md` for SM internals and I-structure semantics. See
  14`io-and-bootstrap.md` for bootstrap loading and I/O subsystem design.
  15
  16## 1. Design Philosophy: Static Assignment, Compiler-Driven Sizing
  17
  18This design diverges significantly from both Manchester and Amamiya in how PEs are used. Understanding the difference is critical to understanding why the matching store can be so much smaller here.
  19
  20**Amamiya DFM (1982/17407 papers):** every PE has ALL function bodies pre-loaded in instruction memory (8KW, 58 bits/word per PE, identical contents across all PEs). Function _instances_ are dynamically assigned to PEs at runtime by a CCU (Cluster Control Unit) that picks the least-loaded PE. The OM (operand matching memory) needs 1024 CAM blocks per PE because any function can run anywhere, and deep Lisp recursion means many simultaneous activations. The "semi-CAM" was their solution to making this affordable -- instance name directly addresses a block, then 4-way set-associative lookup within the block on instruction identifier.
  21
  22**Manchester (Gurd 1985):** similar story but with hashing instead of semi-CAM. 16 parallel 64K-token memory banks per PE for set-associative hash lookup. 1M token capacity matching store. Plus an overflow unit (initially emulated on the host). The matching unit alone was 16 memory boards per PE.
  23
  24Both machines sized their matching stores for worst-case dynamic scheduling of arbitrary programs. The whole program lives in every PE (or in a single PE's matching unit), and any activation can land anywhere. That's why those matching stores are enormous.
  25
  26**This design:** the compiler statically assigns function bodies (or chunks of them) to specific PEs. Different PEs have different instruction memory contents. The compiler knows at compile time which functions run where, and can calculate maximum concurrent activations per PE. This means:
  27
  28- Instruction memory is NOT replicated -- each PE only holds its assigned function bodies. IM can be much smaller.
  29- The matching store only needs enough frames for the maximum concurrent activations the compiler predicts for that specific PE. Not 1024. Probably 4.
  30- No CCU needed for dynamic PE allocation. Scheduling decisions are made at compile time.
  31- The tradeoff is scheduling flexibility -- you can't dynamically rebalance load at runtime. The compiler must get it roughly right.
  32
  33### Function Splitting Across PEs
  34
  35A "function" in the source language does NOT need to map 1:1 to a contiguous block on one PE. The compiler can split a function body at any data-dependency boundary. The token network doesn't know or care whether two instructions are "in the same function" -- it just sees tokens with destinations.
  36
  37A 40-instruction function body could be split into three chunks of ~13 instructions across three PEs, each chunk fitting in a smaller frame. The "function" as the architecture sees it is really "a set of instructions that share a frame on this PE." The compiler defines what that grouping means.
  38
  39This is a powerful lever for keeping frames small: if a function body is too big for the frame size, the compiler splits it. The split introduces inter-PE token traffic (extra network hops), but keeps per-PE hardware simple. The compiler can optimise the split points to minimise cross-PE traffic.
  40
  41**Implication for frame semantics:** a frame doesn't mean "one function activation." It means "one chunk of work sharing a local operand namespace on this PE." Multiple frames on different PEs might collectively represent one function activation. The token's `activation_id` scopes operand matching to a local frame, nothing more.
  42
  43**Implication for the compiler:** this architecture actively wants either small functions or functions distributed across PEs. The compiler is free to treat any subgraph of the dataflow graph as a "chunk" and assign it to a PE, regardless of source-level function boundaries. Loop bodies, branch arms, pipeline stages: all valid chunk boundaries. The grain of scheduling is the subgraph, not the function.
  44
  45## 2. PE Identity
  46
  47Each PE has a unique ID used for routing. Two mechanisms, not mutually exclusive:
  48
  49**EEPROM-based**: the instruction decoder EEPROM already contains per-PE truth tables. The PE ID can be encoded as additional input bits to the EEPROM, meaning the EEPROM contents are unique per PE but the circuit board is identical. The instruction decoder "knows" which PE it is because its EEPROM was burned with that ID.
  50
  51**DIP switches**: 3-4 switches give 8-16 PE addresses. Better for early prototyping - reconfigurable without reflashing. Can coexist with the EEPROM approach (switches provide ID bits that feed into the EEPROM address lines).
  52
  53The PE ID is needed in two places:
  54
  551. Input token filtering: "is this token addressed to me?"
  562. Output token formatting: "set the source PE field" (if result tokens carry source info for return routing)
  57
  58## 3. PE Pipeline (5-stage)
  59
  60### Bus Interface: Serializer / Deserializer
  61
  62The PE connects to the 16-bit external bus via ser/deser logic at the input and output boundaries. This handles the width conversion between 16-bit flits on the bus and the wider internal token representation:
  63
  64- **Input deserializer**: receives 2+ flits from the bus, reassembles into a full token (routing fields from flit 1 + data from flit 2). Shift register + flit counter. Outputs a reassembled token to the
  65  input FIFO.
  66- **Output serializer**: takes a formed result token, splits it into flits (routing fields into flit 1, data into flit 2), and clocks them onto the bus. Shift register + toggle.
  67- Hardware cost: ~5-8 TTL chips per direction (shift registers, counters, muxes).
  68- Naturally integrates with the clock domain crossing FIFOs when running Mode B (2x bus clock). Under Mode A (shared clock), the ser/deser simply takes 2 clock cycles per token transfer.
  69
  70### Pipeline Stages
  71
  72The pipeline runs IFETCH before MATCH. The instruction word, decoded at
  73the end of stage 2, drives all subsequent pipeline behaviour: whether to
  74check the matching store, whether to read a constant, how many
  75destinations to read, whether to write back to the frame. The token's
  76`activation_id` drives associative lookup in parallel with the IRAM read,
  77hiding resolution latency.
  78
  79**Why IFETCH before MATCH.** The instruction word determines
  80*how* matching works: whether the instruction is dyadic or monadic,
  81which frame slots to read for operands and constants, whether to
  82write back to the frame (sink modes), and how many destinations to
  83read at output. Fetching the instruction first gives the pipeline
  84controller all the information it needs to sequence stage 3's SRAM
  85accesses efficiently.
  86
  87The token's dyadic/monadic prefix enables parallel work: when the
  88prefix indicates "dyadic," stage 2
  89starts act_id -> frame_id resolution via the 670s simultaneously with
  90the IRAM read. By the time stage 3 begins, both the instruction word
  91and the frame_id / presence / port metadata are available, and the
  92only remaining SRAM work is reading or writing actual operand data
  93and constants.
  94
  95```
  96Stage 1: INPUT
  97  - Receive reassembled token from input deserialiser
  98  - Classify by prefix: dyadic wide (00), monadic normal (010),
  99    misc bucket (011). Within misc bucket: frame control (sub=00),
 100    PE-local write (sub=01), monadic inline (sub=10)
 101  - Compute/data tokens -> pipeline FIFO
 102  - Frame control tokens -> tag store write/clear (side path)
 103  - PE-local write tokens -> SRAM write queue (side path, executes
 104    when frame SRAM is not busy with compute pipeline accesses)
 105  - Buffer in small FIFO (8-deep, storing reassembled tokens)
 106  - ~1K transistors (flip-flops) or use small SRAM
 107
 108Stage 2: IFETCH
 109  Two parallel operations within a single cycle:
 110
 111  (a) IRAM SRAM read at [bank_reg : token.offset]. Produces the
 112      16-bit instruction word: type, opcode, mode, wide, fref.
 113      Single read cycle (16-bit instruction, one chip pair).
 114
 115  (b) Activation_id resolution. For Approach C (74LS670 lookup),
 116      this is combinational (~35 ns): present act_id on the 670
 117      address lines, get {valid, frame_id} back. Presence and port
 118      metadata also resolve combinationally in this stage (670 read
 119      at frame_id, ~70 ns total from act_id presentation). At 5 MHz
 120      (200 ns cycle), all metadata is available before the IRAM read
 121      completes.
 122
 123  The dyadic/monadic prefix from flit 1 determines whether
 124  activation_id resolution starts in this stage (dyadic) or is
 125  deferred until the instruction confirms the need (monadic with
 126  frame access).
 127
 128  IRAM valid-bit check occurs in parallel: if the page containing
 129  the target offset is marked invalid, the token is rejected (see
 130  IRAM Valid-Bit Protection below).
 131
 132Stage 3: MATCH / FRAME
 133  Path depends on instruction type (from stage 2) and token type.
 134  Uses the instruction's mode field to determine frame SRAM accesses:
 135
 136  - Dyadic hit (second operand): read stored operand from frame SRAM
 137    at [frame_id : match_offset]. If has_const, also read constant
 138    from frame[fref]. 1-2 SRAM cycles depending on mode.
 139  - Dyadic miss (first operand): write incoming operand to frame SRAM,
 140    set presence bit in 670. Token consumed. 1 SRAM cycle.
 141  - Monadic with constant: read constant from frame[fref]. 1 SRAM
 142    cycle.
 143  - Monadic mode 4 (CHANGE_TAG, no frame access): 0 cycles, pass
 144    through.
 145  - Mode 7 (SINK+CONST / RMW): read old value from frame[fref] for
 146    read-modify-write. 1 SRAM cycle.
 147
 148  See Pipeline Stall Analysis (section 11) for full cycle-count
 149  tables across Approaches A, B, and C.
 150
 151Stage 4: EXECUTE
 152  - Instruction type bit selects CM compute (0) or SM operation (1)
 153  - CM path: 16-bit ALU executes arithmetic/logic/comparison/routing
 154    on operand data + constant (if present). Purely combinational.
 155    No SRAM access. Result latched.
 156  - SM path: ALU computes effective address or passes through data;
 157    PE constructs SM flit fields from frame data and operands.
 158    See SM Operation Dispatch (section 7) for encoding and dispatch
 159    details. See `alu-and-output-design.md` for SM flit assembly.
 160  - ~1500-2000 transistors (ALU) + SM flit assembly mux (~4-6 chips)
 161
 162Stage 5: OUTPUT
 163  - CM path: read destination(s) from frame SRAM. Destinations are
 164    pre-formed flit 1 values stored in frame slots during activation
 165    setup. The PE reads the slot and puts it directly on the bus as
 166    flit 1; the ALU result becomes flit 2. Near-zero token formation
 167    logic.
 168    - Mode 0/1 (single dest): read frame[fref] or frame[fref+1].
 169      1 SRAM cycle.
 170    - Mode 2/3 (fan-out): read dest1 and dest2 from consecutive
 171      frame slots. 2 SRAM cycles.
 172    - Mode 4/5 (CHANGE_TAG): left operand becomes flit 1 verbatim.
 173      0 SRAM cycles.
 174    - Mode 6/7 (SINK): write ALU result back to frame[fref]. No
 175      output token. 1 SRAM cycle.
 176  - SM path: emit SM token to target SM. SM flit 1 constructed from
 177    frame slot (SM_id + addr from frame[fref]), SM flit 2 source
 178    selected by instruction decoder (ALU out, R operand, or frame
 179    slot). See SM Operation Dispatch (section 7) for flit 2 source
 180    mux details.
 181  - Pass to output serialiser for flit encoding and bus injection.
 182```
 183
 184### Concurrency Model
 185
 186The pipeline is pipelined: multiple tokens can be in-flight simultaneously at different stages. In the emulator, each token spawns a separate SimPy process that progresses through the pipeline independently. This models the hardware reality where Stage 2 can be fetching a new instruction while Stage 4 is executing a previous one.
 187
 188Cycle counts per token type (Approach C, recommended v0):
 189- **Dyadic hit, mode 1** (common case): 6 cycles (input + ifetch + match/const + execute + output)
 190- **Dyadic hit, mode 0** (no const): 5 cycles
 191- **Dyadic miss**: 3 cycles (input + ifetch + store operand; no execute/output)
 192- **Monadic, mode 0**: 4 cycles (input + ifetch + execute + output; no match)
 193- **Monadic, mode 4** (CHANGE_TAG, no frame): 3 cycles
 194- **PE-local write**: side path, does not enter compute pipeline
 195- **Frame control**: side path, tag store write
 196- **Network delivery**: +1 cycle latency between emit and arrival at destination
 197
 198**Unpipelined throughput:** 3-7 cycles per token depending on instruction
 199mode. This is the baseline against which pipeline overlap improvements
 200are measured (see Pipeline Stall Analysis, section 11).
 201
 202### Pipeline Register Widths
 203
 204Between instruction fetch and execute, the pipeline carries both operand data and instruction control information in parallel but logically separate paths:
 205
 206**Data path (~32 bits between match/frame and ALU):**
 207```
 208data_L: 16 bits  (from frame SRAM or direct from token for monadic)
 209data_R: 16 bits  (from frame SRAM)
 210                  |
 211             ALU (16-bit)
 212                  |
 213           result: 16 bits
 214```
 215
 216**Control path (~16 bits from IRAM, plus frame reads in stage 5):**
 217```
 218type:      1 bit    (CM/SM select, from IRAM)
 219opcode:    5 bits   (from IRAM, consumed by ALU / SM decoder)
 220mode:      3 bits   (from IRAM, drives frame access pattern + output routing)
 221wide:      1 bit    (from IRAM, 16/32-bit frame access)
 222fref:      6 bits   (from IRAM, frame slot base index)
 223```
 224
 225Destinations are NOT in the pipeline control registers. They are read
 226from frame SRAM during stage 5 as pre-formed flit 1 values. This
 227simplifies the pipeline latch: only 16 bits of IRAM control + 32 bits
 228of operand data pass between stages.
 229
 230Total pipeline register between fetch and execute: **~48 bits**. The
 231mode field (3 bits) encodes tag behaviour, constant presence, and
 232fan-out in a single dense field (see mode table in section 6).
 233
 234Pipeline registers between stages: ~500 transistors
 235Control logic (state machine, handshaking): ~500-1000 transistors
 236
 237**Per-PE total: ~32-43 chips** (Approach C).
 238
 239## 4. Frame-Based Matching
 240
 241### Why It Can Be Small
 242
 243The matching store is the highest-risk component in any dataflow machine. Manchester needed 16 memory boards per PE. Amamiya needed 1024 CAM blocks (32KW at 43 bits/word) per PE. Both were sized for worst-case dynamic scheduling of arbitrary programs.
 244
 245This design avoids that because:
 246
 2471. **Static PE assignment**: the compiler knows which functions run on which PE and can calculate maximum concurrent activations per PE.
 2482. **Function splitting**: the compiler can split large function bodies across PEs so no single PE needs a huge frame.
 2493. **Compiler-controlled frame allocation**: the compiler assigns activation IDs at compile time for statically-known activations. Only genuinely dynamic activations (runtime-determined recursion depth) need runtime allocation.
 250
 251The frame count is therefore a _compiler parameter_, not an architectural constant. The hardware provides 4 concurrent frames of 64 slots each. The compiler must generate code that fits within those limits, splitting and scheduling accordingly.
 252
 253### Architecture: 74LS670 Register-File Lookup (Approach C, Recommended v0)
 254
 255Matching uses the per-activation **frame** model. Pending match operands
 256live in the same SRAM address space as constants, destinations, and
 257accumulators. The 74LS670 (4-word x 4-bit register file with independent
 258read/write ports) provides the activation_id-to-frame_id mapping and
 259presence/port metadata, all combinationally.
 260
 261**act_id-to-frame_id resolution:**
 262
 263Two 74LS670s, addressed by `act_id[1:0]` with `act_id[2]` selecting
 264between chips. Output = `{valid:1, frame_id:2, spare:1}`. Combinational
 265read: present act_id on the address lines, frame_id appears at the
 266output in ~35 ns.
 267
 268```
 269ALLOC: write {valid=1, frame_id} at address act_id (670 write port)
 270FREE:  write {valid=0, ...} at address act_id
 271LOOKUP: read port, address = act_id -> {valid, frame_id} in ~35 ns
 272```
 273
 274The 670's independent read and write ports allow ALLOC to proceed while
 275the pipeline reads -- zero conflict.
 276
 277**Presence and port metadata:**
 278
 279Presence and port bits live in additional 670s, addressed by
 280`[frame_id:2]`. The matchable offset range is constrained to offsets
 2810-7 (8 dyadic-capable slots per frame). The assembler packs dyadic
 282instructions at low offsets; offsets 8-255 are monadic-only.
 283
 284Recommended layout: 4x 670, each covering 2 offsets across 4 frames:
 285
 286```
 287670 chip 0 (offsets 0-1): word[frame_id] = {pres0:1, port0:1, pres1:1, port1:1}
 288670 chip 1 (offsets 2-3): word[frame_id] = {pres2:1, port2:1, pres3:1, port3:1}
 289670 chip 2 (offsets 4-5): word[frame_id] = {pres4:1, port4:1, pres5:1, port5:1}
 290670 chip 3 (offsets 6-7): word[frame_id] = {pres6:1, port6:1, pres7:1, port7:1}
 291```
 292
 293`offset[2:1]` selects chip, `offset[0]` selects which pair of bits
 294within the 4-bit output (a 2:1 mux -- one gate).
 295
 296The 670's simultaneous read/write is critical: during stage 3, when
 297a first operand stores and sets presence, the write port updates the
 298presence 670 while the read port remains available for the next
 299pipeline stage's lookup. No read-modify-write sequencing needed.
 300
 301All reads are combinational (~35 ns). All resolve during stage 2 in
 302parallel with the IRAM read. By the time stage 3 begins, the PE knows
 303frame_id, presence, and port -- the only SRAM access in stage 3 is
 304reading/writing the actual operand data.
 305
 306**The matching operation:**
 307
 308```
 309Stage 2 (parallel with IRAM read):
 310  act_id -> frame_id via 670 lookup (combinational)
 311  presence[frame_id][offset] via 670 read (combinational)
 312  port[frame_id][offset] via 670 read (combinational)
 313
 314Stage 3 (driven by instruction word from stage 2):
 315  if instruction is dyadic AND presence bit set:
 316    -> match found: read stored operand from frame SRAM at
 317       [frame_id:2][match_offset]. Clear presence bit.
 318       Read constant from frame[fref] if has_const.
 319    -> proceed to stage 4 with both operands.
 320  if instruction is dyadic AND presence bit clear:
 321    -> first operand: write incoming operand to frame SRAM at
 322       [frame_id:2][match_offset]. Set presence bit in 670.
 323    -> token consumed, advance to next input token.
 324  if instruction is monadic:
 325    -> bypass matching. Read constant from frame[fref] if has_const.
 326```
 327
 328**Hardware cost:**
 329
 330| Component                 | Chips  | Notes                                   |
 331| ------------------------- | ------ | --------------------------------------- |
 332| act_id -> frame_id lookup | 2      | 74LS670, indexed by act_id              |
 333| Presence + port metadata  | 4      | 74LS670, indexed by frame_id            |
 334| Bit select mux            | 1-2    | offset-based selection of presence/port |
 335| **Total match metadata**  | **~8** |                                         |
 336
 337**Constraint: 8 matchable offsets per frame.** The assembler enforces
 338this. 8 dyadic instructions per function chunk per PE is reasonable --
 339the compiler splits larger function bodies across PEs. With 4 PEs, the
 340system supports 32 simultaneous dyadic slots, which exceeds typical
 341working-set utilisation for the target workloads.
 342
 343**Constraint: 8 unique activation_ids.** The 3-bit act_id supports 8
 344entries in the lookup table. With 4 concurrent frames, 4 IDs of ABA
 345distance exist before wraparound.
 346
 347### Alternative Approaches
 348
 349Two alternative matching implementations exist:
 350
 351- **Approach A (set-associative tags in frame SRAM):** tag words share
 352  the frame SRAM chip. No extra chips, but matching consumes SRAM
 353  cycles (2 cycles for a dyadic hit). Lowest chip count (~4-6 extra
 354  TTL), highest pipeline stall rate.
 355
 356- **Approach B (full register-file match pool):** match entries live
 357  entirely in a register file with parallel comparators. Matching is
 358  fully combinational (~50 ns). Highest chip count (~16-18 extra TTL),
 359  best pipeline throughput (eliminates all match-related SRAM
 360  contention).
 361
 362Approach C (670 lookup, described above) sits between A and B: act_id
 363resolution and presence checking are combinational (like B), but operand
 364data lives in SRAM (like A). ~8 extra chips, good pipeline throughput.
 365
 366See the approach comparison table in Pipeline Stall Analysis (section 11)
 367for full cycle counts and throughput estimates across all three
 368approaches, plus 670-enhanced variants (B+670 indexed, B+670 semi-CAM)
 369and hybrid upgrade paths.
 370
 371### Frame Sizing
 372
 3736-bit `fref` addresses 64 slots per activation. With 4 concurrent
 374frames, the frame region occupies 4 x 64 x 2 bytes = 512 bytes of
 375SRAM. This fits trivially alongside the IRAM region in a 32Kx8 chip
 376pair.
 377
 378A function body with 10 dyadic instructions, 5 constants, and fan-out
 379on 3 instructions might use ~30 frame slots (10 match + 5 const + 8
 380dest + 7 accumulator/spare). With slot dedup (shared destinations,
 381aliased constants), actual usage is typically 15-25 slots per
 382activation.
 383
 384### What About Overflow?
 385
 386If all frames are occupied or a function body exceeds 8 dyadic
 387instructions per frame:
 388
 389**Compile-time prevention (primary strategy):**
 390
 391- The compiler knows the frame count and matchable offset limit
 392- It splits functions and schedules activations to fit
 393- If a program genuinely can't fit (unbounded recursion deeper than 4 frames), the compiler inserts throttling code: a token that waits for a frame to free before allowing the next recursive call
 394- This is the Amamiya throttle idea, but implemented in software (compiler-inserted dataflow logic) rather than hardware
 395
 396**Runtime overflow (safety net):**
 397
 398- If a token arrives and the tag store has no valid entry for its act_id (shouldn't happen with correct compilation), the PE stalls the input FIFO until a frame frees. Simplest, safest, most debuggable. If it fires, something is wrong and stalling surfaces the bug.
 399
 400**Upgrade path: hybrid with SRAM fallback:**
 401
 402- If the 8-offset matchable range proves tight, the high bit of the offset can select between register (offset[3]=0, check 670s) and SRAM (offset[3]=1, fall back to tag-in-SRAM from Approach A). The fast path stays combinational; the overflow path adds 1 SRAM cycle. The system degrades gracefully rather than hard-limiting at 8 dyadic offsets.
 403
 404## 5. Frame Lifecycle
 405
 406### Allocation
 407
 408An ALLOC frame control token (prefix 011+00, op=0) arrives at the PE,
 409specifying an `activation_id`. The PE assigns the next free physical
 410frame and records the act_id-to-frame_id mapping in the 670 tag store.
 411
 412Free frame tracking is a simple 2-bit counter or shift register (4
 413entries max). Hardware cost: ~2-3 TTL chips.
 414
 415### Setup
 416
 417PE-local write tokens (prefix 011+01, region=1) load constants and
 418destinations into the allocated frame's slots. The writer addresses
 419slots by (act_id, slot_index); the PE resolves act_id-to-frame_id
 420internally using the same 670 lookup as the compute pipeline. Setup
 421uses the same mechanism as IRAM loading -- a stream of write tokens,
 422precalculated by the assembler.
 423
 424### Execution
 425
 426Compute tokens arrive with `activation_id`. The PE resolves
 427act_id-to-frame_id (670 lookup, combinational), then uses frame_id to
 428address frame SRAM for matching, constant reads, destination reads, and
 429write-backs. See the pipeline stages (section 3) for the per-stage
 430access pattern.
 431
 432### Deallocation
 433
 434A `FREE_FRAME` instruction (opcode-driven, any mode) or a FREE frame
 435control token (prefix 011+00, op=1) releases the frame. The tag store
 436entry is cleared (`valid=0` written to the 670), presence/port metadata
 437for that frame is bulk-cleared across all 4 presence/port 670s, and the
 438frame_id returns to the free pool.
 439
 440Multiple frees are idempotent / harmless. Freed frames are immediately
 441available for reallocation.
 442
 443### ABA Protection
 444
 445- 3-bit activation_id provides 8 unique IDs
 446- With at most 4 concurrent frames, 4 IDs of ABA distance exist before wraparound
 447- Stale tokens (from freed activations) carry an act_id whose 670 entry is now `valid=0` or maps to a different frame. The PE detects this via the valid bit and discards the stale token.
 448- 4 IDs of distance is sufficient because stale tokens drain within single-digit cycles. Wraparound collision is effectively impossible.
 449- The act_id validity check in the 670 provides ABA protection without dedicated hardware -- the valid bit serves as an implicit generation guard.
 450
 451### Throttle
 452
 453- The PE tracks the number of active frames (frames with valid=1 in the tag store)
 454- When all 4 frames are active, ALLOC tokens are NACKed or stall until a free occurs
 455- Prevents frame overflow
 456- Hardware cost: 2-bit counter + comparator + gate. ~3 TTL chips.
 457- With compiler-controlled scheduling, the throttle should rarely fire. It's a safety net, not a performance mechanism.
 458
 459## 6. Instruction Memory
 460
 461### Static Assignment, Per-PE Contents
 462
 463Unlike Amamiya where every PE has identical IM contents (full program), each PE here holds only the function bodies (or function chunks) assigned to it by the compiler. This means:
 464
 465- IM is smaller per PE (only assigned code, not the whole program)
 466- Different PEs have different IM contents (loaded at bootstrap)
 467- The compiler emits a per-PE instruction image as part of the program
 468
 469### Runtime Writability
 470
 471Instruction memory is **not** read-only. It is writable from the network via IRAM write (prefix 011+01) packets. This serves two purposes:
 472
 4731. **Bootstrap**: loading programs before execution starts
 4742. **Runtime reprogramming**: loading new function bodies while other PEs continue executing (future capability, not needed for v0)
 475
 476Runtime writability also means instruction memory size is not a hard architectural limit -- if a program needs more code than fits in one PE's IM, the runtime (or a management PE) could swap function bodies in and out. Very speculative, but the hardware path exists.
 477
 478### Implementation
 479
 480Instruction memory is PE-local SRAM, sharing a chip pair with the frame
 481region via address partitioning (see Per-PE Memory Map, section 10).
 482**IRAM width is completely independent of bus width**. It is sized for
 483encoding needs, not bus constraints.
 484
 485#### IRAM Width: 16-bit Single-Half Format
 486
 487Each IRAM slot is **16 bits**, read in a single cycle from one 8-bit-wide
 488SRAM chip pair. Instruction templates are activation-independent: all
 489per-activation data (constants, destinations, match operands,
 490accumulators) lives in the frame.
 491
 492```
 493IRAM address = [offset:8]             (v0, 8-bit address)
 494```
 495
 496256 instruction slots per PE. Total IRAM SRAM usage: 512 bytes per PE.
 497For programs exceeding 256 instructions, see Future: Bank Switching
 498with 74LS610 (section 14).
 499
 500#### Instruction Word Format
 501
 502```
 503[type:1][opcode:5][mode:3][wide:1][fref:6] = 16 bits
 504  15       14-10     9-7     6       5-0
 505```
 506
 507| Field  | Bits | Purpose                                                      |
 508| ------ | ---- | ------------------------------------------------------------ |
 509| type   | 1    | 0 = CM compute, 1 = SM operation                             |
 510| opcode | 5    | Operation code (CM and SM have independent 32-entry spaces)  |
 511| mode   | 3    | Combined tag/frame-reference mode (see mode table below)     |
 512| wide   | 1    | 0 = 16-bit frame values, 1 = 32-bit (consecutive slot pairs) |
 513| fref   | 6    | Frame slot base index (64 slots per activation)              |
 514
 515**type:1** -- operation space select:
 516```
 5170 = CM compute operation (ALU)
 5181 = SM operation (structure memory bus command)
 519```
 520
 521**opcode:5** -- 32 slots per type. CM and SM have independent opcode
 522spaces (32 CM opcodes + 32 SM opcodes). Decoded by EEPROM into control
 523signals. See `alu-and-output-design.md` for the CM operation set.
 524
 525**mode:3** -- combined output routing and frame access mode. Controls
 526whether the instruction emits tokens, how it reads destinations and
 527constants from the frame, or whether it writes results back to the frame.
 528See the mode table below.
 529
 530**wide:1** -- frame value width:
 531```
 5320 = 16-bit frame values (single slot per logical value)
 5331 = 32-bit frame values (consecutive slot pairs per logical value)
 534```
 535
 536**fref:6** -- frame slot index (0-63). Base of a contiguous group of
 5371-3 slots, depending on mode. The instruction template references frame
 538data exclusively through this field; no per-activation data exists in
 539the instruction word itself.
 540
 541Constants and destinations are NOT in the instruction word. They live in
 542frame slots, referenced by `fref`. The instruction template is pure
 543control flow: opcode, mode flags, and a frame slot reference.
 544
 545**Instruction words are never serialised onto the external bus** during
 546normal execution. They are only written via PE-local write packets
 547(prefix 011+01, region=0) during program loading.
 548
 549#### Mode Table (3-bit `mode` field)
 550
 551The 3-bit `mode` field encodes both output routing behaviour and frame
 552access pattern in a single field. Every combination is useful; there are
 553no wasted encodings.
 554
 555```
 556mode  [2:0]  tag behaviour   frame reads at fref         use case
 557----  -----  --------------  --------------------------  --------------------------
 558  0   000    INHERIT         [dest]                      single output, no constant
 559  1   001    INHERIT         [const, dest]               single output with constant
 560  2   010    INHERIT         [dest1, dest2]              fan-out, no constant
 561  3   011    INHERIT         [const, dest1, dest2]       fan-out with constant
 562  4   100    CHANGE_TAG      (none)                      dynamic routing, no constant
 563  5   101    CHANGE_TAG      [const]                     dynamic routing with constant
 564  6   110    SINK            write result -> frame[fref]  store to frame, no output
 565  7   111    SINK+CONST      read frame[fref],           local accumulate / RMW
 566                             write result -> frame[fref]
 567```
 568
 569**Bit-level decode equations:**
 570
 571```
 572output_enable  = NOT mode[2]              modes 0-3: read dest from frame, emit token
 573change_tag     = mode[2] AND NOT mode[1]  modes 4-5: routing from left operand
 574sink           = mode[2] AND mode[1]      modes 6-7: no output token, write to frame
 575has_const      = mode[0]                  modes 1, 3, 5, 7: read constant from frame
 576has_fanout     = mode[1] AND NOT mode[2]  modes 2-3: read two destinations
 577```
 578
 579Frame slot count per mode = `1 + has_const + has_fanout`, read
 580sequentially starting from `fref`. For SINK modes, `fref` is the write
 581target. For SINK+CONST (mode 7, read-modify-write), the read and write
 582target is the same slot (`fref`), with the ALU result writing back
 583after computation.
 584
 585**INHERIT (modes 0-3):** output tokens are routed to destinations stored
 586in frame slots. Each destination slot holds a pre-formed flit 1 value:
 587
 588```
 589frame dest slot: [prefix:2-3][port:0-1][PE:2][offset:8][act_id:3]
 590```
 591
 592The PE reads the slot and puts it directly on the bus as flit 1. The ALU
 593result becomes flit 2. Almost zero token formation logic -- the frame
 594constant IS the output flit. This is the same format as the incoming
 595token's flit 1, enabling forwarding without repacking.
 596
 597- **Mode 0** (single output, no constant): frame[fref] = dest. 1 slot.
 598- **Mode 1** (single output with constant): frame[fref] = const,
 599  frame[fref+1] = dest. 2 slots.
 600- **Mode 2** (fan-out, no constant): frame[fref] = dest1,
 601  frame[fref+1] = dest2. 2 slots.
 602- **Mode 3** (fan-out with constant): frame[fref] = const,
 603  frame[fref+1] = dest1, frame[fref+2] = dest2. 3 slots.
 604
 605**CHANGE_TAG (modes 4-5):** the left operand replaces the frame
 606destination as flit 1. The entire output flit 1 comes from the left
 607operand data value (16 bits, verbatim). The right operand becomes flit 2
 608(payload data). This enables sending a value to any destination computed
 609at runtime -- the packed tag IS flit 1. No field extraction or assembly.
 610
 611- **Mode 4** (no constant): no frame reads. Flit 1 = left operand,
 612  flit 2 = right operand (or ALU result).
 613- **Mode 5** (with constant): frame[fref] = const. Flit 1 = left
 614  operand, flit 2 = right operand. The constant feeds the ALU.
 615
 616The output stage is a mux: frame dest vs left operand, selected by
 617mode[2]. Hardware: left operand bypass latch (~2 chips) preserves the
 618left operand value past the ALU. Stage 5 flit 1 mux (~2 chips) selects
 619between assembled flit and raw data.
 620
 621**SINK (modes 6-7):** no output token is emitted. The ALU result is
 622written back to frame[fref]. Used for local accumulation, temporary
 623storage, and read-modify-write patterns.
 624
 625- **Mode 6** (write only): ALU result written to frame[fref]. 1 SRAM
 626  write cycle.
 627- **Mode 7** (read-modify-write): frame[fref] is read as the constant
 628  input to the ALU, the result is written back to frame[fref]. Enables
 629  in-place accumulation without consuming a separate constant slot.
 630
 631**dest_type derivation:** the output token format (dyadic wide vs monadic
 632normal vs monadic inline) is determined by the destination frame slot
 633contents. Since destination slots hold pre-formed flit 1 values, the
 634token type is encoded in the prefix bits of the stored flit. The output
 635stage emits the frame slot verbatim as flit 1; the prefix bits in the
 636stored value determine the wire format. No runtime type derivation logic
 637is needed. For SWITCH not-taken paths, the output stage emits a monadic
 638inline token (hardwired prefix in the formatter, overriding the frame
 639destination).
 640
 641#### IRAM Valid-Bit Protection
 642
 643IRAM is written via PE-local write tokens (prefix 011+01, region=0) that
 644share the PE's input path with compute tokens. When IRAM contents are
 645replaced at runtime (swapping function fragments in and out), tokens in
 646flight may target IRAM addresses that have been or are being overwritten.
 647Because tokens do not carry instruction identity information, the PE
 648cannot distinguish "right instruction" from "wrong instruction" -- it
 649just sees an offset into IRAM.
 650
 651This is an **instruction identity problem**, not a presence problem. The
 652dangerous case is not "IRAM is empty" but "IRAM contains a different
 653instruction than the token expects."
 654
 655**The mechanism: per-page valid bits.** IRAM is logically divided into
 656pages (e.g. 8 pages of 32 instructions for 256-entry IRAM, or 16 pages
 657of 16). Each page has a 1-bit valid flag, stored in a small TTL register
 658alongside IRAM. Total hardware cost: one register chip for all page
 659valid bits.
 660
 661The valid bit is checked during the IFETCH pipeline stage, in parallel
 662with the IRAM SRAM read. The top bits of the token's `offset` field
 663select the page; the valid bit for that page gates whether the token
 664proceeds or is rejected.
 665
 666```
 667Token arrives at IFETCH:
 668  page = offset[high bits]
 669  if valid_bit[page] == 0:
 670    -> reject token (see rejection policy below)
 671  else:
 672    -> proceed with instruction fetch and act_id resolution
 673```
 674
 675**IRAM swap protocol.** Because config writes and compute tokens share
 676the input path, the swap sequence is naturally ordered:
 677
 678```
 6791. Loader sends drain signal (implementation TBD -- could be a PE-local
 680   write "quiesce" flag, or the PE back-pressures via handshake/ready
 681   signal)
 6822. PE processes remaining compute tokens in pipeline (natural drain)
 6833. PE-local write token (prefix 011+01, region=0) arrives:
 684   a. PE clears valid bit for target page
 685   b. PE writes instruction word to IRAM at specified address
 686   c. If more write tokens follow (burst), keep writing
 6874. Load-complete marker arrives (PE-local write with load-complete flag):
 688   a. PE sets valid bit for target page
 689   b. PE resumes accepting compute tokens for that page
 690```
 691
 692During steps 3-4, any compute token that arrives targeting an invalid
 693page is rejected. The shared input path ordering guarantees that tokens
 694from the *new* code epoch cannot arrive until after the loader has sent
 695them, which is after the load completes. Rejected tokens are therefore
 696late arrivals from the old epoch -- work that is being abandoned.
 697
 698**Presence-bit optimisation:** the frame's presence metadata (in the 670
 699register files) can be checked before step 1 to determine if any tokens
 700are pending for offsets in the target page. If all presence bits for
 701matchable offsets in that page are clear, the drain step can be skipped
 702entirely -- no tokens are waiting for those instructions. This enables
 703targeted IRAM replacement without stalling the entire PE.
 704
 705**Rejection policy.** v0: discard silently + diagnostic. Late-arriving
 706tokens targeting an invalid IRAM page are dropped. The PE sets a sticky
 707flag (directly driving a diagnostic LED) to indicate that a discard
 708occurred. This is a "should never happen if the loader protocol is
 709correct" safety net. The LED makes it visible during debugging without
 710adding any pipeline complexity.
 711
 712If the flag lights up, something is wrong with the drain timing.
 713
 714Future: NAK response. The PE could form a NAK token from the rejected
 715compute token and emit it to a coordinator. The output stage already
 716exists (forms tokens from frame destinations + ALU results); a bypass
 717path from IFETCH to the output would enable this. Estimated cost: a mux
 718and some control logic, ~5-8 TTL chips. Deferred until the runtime is
 719sophisticated enough to act on NAKs.
 720
 721**What this does NOT protect against.** The valid-bit mechanism catches
 722tokens that arrive **during** an IRAM swap (page is invalid). It does
 723**not** catch tokens that arrive after a swap completes and the page is
 724re-validated with different code. Preventing that case requires the drain
 725protocol to be correct -- all tokens from the old epoch must have been
 726processed or discarded before the new code is marked valid.
 727
 728For v0, this is a software/loader invariant enforced by the drain
 729protocol. Future hardening options:
 730
 731- **Per-page epoch counter (2-3 bits):** incremented on each page reload.
 732  Checked against an expected epoch stored per-activation or derived from
 733  spare token bits. Catches post-swap stale tokens at the cost of a
 734  comparator + epoch storage.
 735- **Fragment ID register per page:** similar to epoch but identifies the
 736  fragment by name rather than sequence number. More expensive (wider
 737  comparator) but more debuggable.
 738
 739Both options fit in the spare bits reserved in the token formats. The
 740valid-bit mechanism is forward-compatible with either.
 741
 742## 7. SM Operation Dispatch
 743
 744SM operations (type=1) use the same 16-bit instruction format. The 5-bit
 745opcode field selects from the SM opcode space (independent of CM opcodes).
 746`fref` points at frame slots containing SM-specific parameters. The PE
 747constructs SM tokens from frame data, operand data, and ALU results.
 748
 749All SM addressing goes through frame slots. There is no separate
 750"pointer-addressed" vs "constant-addressed" distinction in the instruction
 751encoding -- the frame slot contents determine the target SM, address, and
 752any return routing. Both pointer-addressed operations (address from token
 753data) and constant-addressed operations (address from frame slot) use the
 754same frame-slot-based encoding.
 755
 756### SM Bus Opcode Encoding
 757
 758The SM bus opcode encoding is unchanged -- variable-width with tier 1
 759(3-bit opcode, 10-bit addr) and tier 2 (5-bit opcode, 8-bit payload).
 760See `sm-design.md` for the full opcode table:
 761
 762```
 763Tier 1 (3-bit, 1024-cell addr range):
 764  read, write, alloc, free, exec, ext
 765
 766Tier 2 (5-bit, 256-cell addr range or 8-bit payload):
 767  rd_inc, rd_dec, cas, raw_rd, clear, set_pg, write_im, (spare)
 768```
 769
 770### SM Token Construction
 771
 772The PE output stage builds SM tokens on the wire. The instruction's SM
 773opcode (in the PE's 5-bit opcode field) maps to the SM bus opcode via
 774the instruction decoder EEPROM. Frame slots provide addressing and
 775return routing parameters:
 776
 777- **SM flit 1** is constructed from frame[fref] contents (SM_id, address)
 778  plus the SM bus opcode from the decoder.
 779- **SM flit 2** source depends on the operation (see flit 2 source mux
 780  below).
 781- **SM flit 3** (CAS and EXT only) carries additional data.
 782
 783### Frame Slot Packing for SM Parameters
 784
 785A single 16-bit frame slot holds the SM target:
 786
 787```
 788Tier 1 target slot: [SM_id:2][addr_high:2][addr_low:8][spare:4]  = 16 bits
 789                     (addr = 10 bits, 1024-cell range)
 790
 791Tier 2 target slot: [SM_id:2][addr:8][spare:6]                    = 16 bits
 792                     (addr = 8 bits, 256-cell range)
 793```
 794
 795For operations needing return routing, the next consecutive frame slot
 796holds a pre-formed response token flit 1:
 797
 798```
 799Return routing slot: [prefix:2-3][port:0-1][PE:2][offset:8][act_id:3] = 16 bits
 800```
 801
 802This is the same format as a CM destination slot -- the SM response token
 803routes as a normal compute token back to the requesting PE. The SM treats
 804flit 2 of the request as an opaque 16-bit blob and echoes it as flit 1
 805of the response (see `sm-design.md` Result Format).
 806
 807### Flit 2 Source Mux
 808
 809Different SM operations need different data in flit 2. The source is
 810selected by a 2-bit signal derived from the instruction decoder (SM
 811opcode + mode):
 812
 813```
 814source   select  use case
 815-------  ------  -----------------------------------------------
 816ALU out  00      SM write (operand is write data, passes through ALU),
 817                 SM write_im (immediate write). Default for CM compute.
 818R oper   01      SM scatter write (ALU computes addr from base + L operand;
 819                 R operand is write data, bypasses ALU to flit 2).
 820                 SM CAS flit 2 (expected value = L operand).
 821Frame    10      SM read / rd_inc / rd_dec / raw_rd / alloc return routing
 822                 (frame[fref+1] = pre-formed response token flit 1).
 823                 SM exec parameters.
 824(spare)  11      reserved
 825```
 826
 827Hardware: cascaded 74LS157 (quad 2:1 mux) pairs, ~4-6 chips for 16-bit
 828width.
 829
 830### SM Operation Mapping Table
 831
 832| SM bus op    | PE opcode    | frame slots          | mode | operands               | flits | flit 2 source          | notes                              |
 833|--------------|--------------|----------------------|------|------------------------|-------|------------------------|------------------------------------|
 834| read         | SM_READ      | 2: target + return   | 1    | monadic (trigger)      | 2     | frame (return routing) | indexed variant: ALU adds base + index |
 835| write        | SM_WRITE     | 1: target            | 0    | monadic (data)         | 2     | ALU (write data)       |                                    |
 836| write (scatter) | SM_WRITE_IX | 1: target         | 1    | dyadic (index, data)   | 2     | R operand (write data) | ALU: base + index                  |
 837| alloc        | SM_ALLOC     | 2: params + return   | 1    | monadic (trigger)      | 2     | frame (return routing) |                                    |
 838| free         | SM_FREE      | 1: target            | 0    | monadic (trigger)      | 2     | don't-care             |                                    |
 839| exec         | SM_EXEC      | 2: target + params   | 1    | monadic (trigger)      | 2     | frame (params/count)   |                                    |
 840| ext          | SM_EXT       | 1-2: varies          | varies | varies               | 3     | varies                 | 3-flit extended addressing         |
 841| rd_inc       | SM_RDINC     | 2: target + return   | 1    | monadic (trigger)      | 2     | frame (return routing) | atomic read-and-increment          |
 842| rd_dec       | SM_RDDEC     | 2: target + return   | 1    | monadic (trigger)      | 2     | frame (return routing) | atomic read-and-decrement          |
 843| cas          | SM_CAS       | 1: target            | 0    | dyadic (expected, new) | 3     | L operand (expected)   | 3-flit; return via prior read      |
 844| raw_rd       | SM_RAWRD     | 2: target + return   | 1    | monadic (trigger)      | 2     | frame (return routing) | non-blocking, no deferred read     |
 845| clear        | SM_CLEAR     | 1: target            | 0    | monadic (trigger)      | 2     | don't-care             | resets cell to EMPTY               |
 846| set_pg       | SM_SETPG     | 1: target            | 0    | monadic (page value)   | 2     | ALU (page value)       | SM-side bank switching             |
 847| write_im     | SM_WRIM      | 1: target            | 0    | monadic (data)         | 2     | ALU (write data)       | immediate write, tier 2 addr       |
 848
 849### SM Flit 1 Assembly
 850
 851Stage 5 assembles SM flit 1 from frame[fref] and the wire opcode from
 852the decoder EEPROM:
 853
 854```
 855Tier 1:
 856  flit 1 = [1][frame[fref][15:14] (SM_id)][wire_opcode:3][frame[fref][13:4] (addr:10)]
 857
 858Tier 2:
 859  flit 1 = [1][frame[fref][15:14] (SM_id)][wire_opcode:5][frame[fref][13:6] (addr:8)]
 860```
 861
 862Hardware: the `[1]` prefix is hardwired. SM_id comes from the top 2 bits
 863of the frame slot. The wire opcode comes from the EEPROM. The address
 864comes from the remaining frame slot bits. All fields are concatenated
 865on the wire with no runtime muxing -- the frame slot is pre-packed so
 866that the bit positions align with the SM bus format. ~1-2 chips for
 867output gating and serialisation.
 868
 869### Indexed Address Computation
 870
 871For indexed READ, scatter WRITE, and other address-computed operations,
 872the ALU computes the effective address: base address (extracted from
 873frame[fref]) + index (from the left operand). The computed address is
 874packed into SM flit 1 by the output stage. The ALU performs address
 875arithmetic without dedicated address-computation hardware.
 876
 877### CAS: 3-Flit SM Token
 878
 879Compare-and-swap requires address, expected value, and new value. This
 880exceeds the standard 2-flit SM packet.
 881
 882```
 883CAS emission (3 flits):
 884  flit 1: SM header (SM_id + CAS opcode + addr, from frame[fref])
 885  flit 2: expected value (left operand)
 886  flit 3: new value (right operand)
 887```
 888
 889The output serialiser emits 3 flits instead of the default 2. An
 890`extra_flit` signal from the instruction decoder (asserted for CAS and
 891EXT-mode ops) increments the serialiser's flit counter limit:
 892`flit_count = 2 + extra_flit`. One gate.
 893
 894Return routing for CAS uses the prior-READ pattern: issue an SM READ to
 895the target cell first (plants return routing in the SM's deferred-read
 896register), receive the current value as the response (which also provides
 897the expected value), then issue the CAS with the known expected value
 898and the desired new value.
 899
 900### SM Flit Assembly Hardware Cost
 901
 902| Component | Chips/PE | Purpose |
 903|-----------|----------|---------|
 904| Flit 2 source mux (16-bit, 4:1) | ~4-6 | ALU out / R oper / Frame / spare |
 905| SM flit 1 gating | ~1-2 | Frame target slot + EEPROM opcode to bus |
 906| Extra flit control | ~0.5 | CAS/EXT 3-flit counter |
 907
 908## 8. EEPROM-Based Instruction Decoding
 909
 910The instruction decoder can be implemented as an EEPROM acting like a PLD. Input bits = instruction opcode fields + PE ID bits. Output bits = control signals for the ALU, matching store, token output formatter, etc.
 911
 912This gives significant flexibility:
 913
 914- Instruction set can be changed by reflashing the EEPROM (no board changes)
 915- Per-PE customisation (different PEs could theoretically have different instruction subsets, though unlikely for v0)
 916- The PE ID is "free" -- it's just more EEPROM address bits
 917
 918## 9. The 670 Subsystem: Act ID Lookup, Match Metadata, and SC Register File
 919
 920### Role in the Frame-Based Architecture
 921
 922The 74LS670s serve two critical functions:
 923
 9241. **act_id -> frame_id lookup table.** Indexed by the token's 3-bit
 925   `activation_id`, outputs `{valid:1, frame_id:2, spare:1}` in
 926   ~35 ns (combinational). This replaces what would otherwise be an
 927   SRAM cycle for associative tag comparison.
 928
 9292. **Presence and port metadata store.** Indexed by `frame_id`,
 930   stores presence and port bits for all 8 matchable offsets across
 931   all 4 frames. Combinational read (~35 ns after frame_id settles,
 932   ~70 ns total from act_id presentation).
 933
 934Both functions complete within stage 2, in parallel with the IRAM
 935read. By the time stage 3 begins, the PE knows frame_id, presence,
 936and port -- the only remaining SRAM access is the actual operand
 937data.
 938
 939### Hardware Configuration
 940
 941**act_id -> frame_id (2x 74LS670):**
 942
 943Addressed by `act_id[1:0]` with `act_id[2]` selecting between chips.
 944Each chip holds 4 words x 4 bits. Output: `{valid:1, frame_id:2,
 945spare:1}`.
 946
 947```
 948ALLOC: write {valid=1, frame_id} at address act_id (670 write port)
 949FREE:  write {valid=0, ...} at address act_id
 950LOOKUP: read port, address = act_id -> {valid, frame_id} in ~35 ns
 951```
 952
 953The 670's independent read and write ports allow ALLOC to proceed
 954while the pipeline reads -- zero conflict.
 955
 956**Presence + port metadata (4x 74LS670):**
 957
 958Each 670 word (4 bits) holds presence+port for 2 offsets:
 959`{presence_N:1, port_N:1, presence_N+1:1, port_N+1:1}`.
 960Read address = `[frame_id:2]`. Output bits selected by
 961`offset[2:0]` via bit-select mux.
 962
 963```
 964670 chip 0 (offsets 0-1): word[frame_id] = {pres0, port0, pres1, port1}
 965670 chip 1 (offsets 2-3): word[frame_id] = {pres2, port2, pres3, port3}
 966670 chip 2 (offsets 4-5): word[frame_id] = {pres4, port4, pres5, port5}
 967670 chip 3 (offsets 6-7): word[frame_id] = {pres6, port6, pres7, port7}
 968```
 969
 970`offset[2:1]` selects chip, `offset[0]` selects which pair of bits
 971within the 4-bit output (a 2:1 mux -- one gate).
 972
 973The 670's simultaneous read/write is critical: during stage 3, when
 974a first operand stores and sets presence, the write port updates the
 975presence 670 while the read port remains available for the next
 976pipeline stage's lookup. No read-modify-write sequencing needed.
 977
 978**Bit select mux (1-2 chips):**
 979
 980Offset-based selection of the relevant presence and port bits from
 981the 670 outputs.
 982
 983### Chip Budget
 984
 985| Component                  | Chips  | Function                             |
 986|----------------------------|--------|--------------------------------------|
 987| act_id -> frame_id lookup  | 2      | 74LS670, indexed by act_id           |
 988| Presence + port metadata   | 4      | 74LS670, indexed by frame_id         |
 989| Bit select mux             | 1-2    | offset-based selection               |
 990| **Total match metadata**   | **~8** |                                      |
 991
 992### SC Register File (Mode-Switched)
 993
 994During **dataflow mode**, the PE uses act_id resolution and presence
 995metadata constantly but the SC register file is idle (no SC block
 996executing). During **SC mode**, the PE uses the register file
 997constantly but act_id lookup and presence tracking are idle (SC block
 998has exclusive PE access; no tokens enter matching).
 999
1000Some of the 670s can be repurposed for register storage during SC
1001mode. The exact mapping depends on the SC block design:
1002
1003- The 4 presence+port 670s (indexed by frame_id in dataflow mode) can
1004  be re-addressed by instruction register fields during SC mode,
1005  providing 4 chips x 4 words x 4 bits = 64 bits of register storage.
1006  Combined across chips, this gives **4 registers x 16 bits** (4 bits
1007  per chip, 4 chips for width).
1008
1009- With additional mux logic, all 6 shared 670s (excluding the act_id
1010  lookup pair, which may need to remain active for frame lifecycle
1011  management) could provide **6 registers x 16 bits** during SC mode.
1012
1013The act_id lookup 670s may need to remain in their dataflow role even
1014during SC mode if the PE must handle frame control tokens (ALLOC/FREE)
1015arriving during SC block execution. Whether to share them depends on
1016the SC block entry/exit protocol.
1017
1018### The Predicate Slice
1019
1020One of the 670s can be **permanently dedicated as a predicate
1021register** rather than participating in the mode-switched pool:
1022
1023- 4 entries x 4 bits = 16 predicate bits, always available
1024- Useful for: conditional token routing (SWITCH), loop termination
1025  flags, SC block branch conditions, I-structure status flags
1026- Does not reduce the metadata capacity significantly: the remaining
1027  3 presence+port 670s still cover 6 of the 8 matchable offsets;
1028  the 2 uncovered offsets can fall back to SRAM-based presence or
1029  simply constrain the assembler to 6 dyadic offsets per frame
1030
1031The predicate register is always readable and writable regardless of
1032mode, since it's a dedicated chip with its own address/enable lines.
1033Instructions can test or set predicate bits without going through the
1034matching store or the ALU result path.
1035
1036### Mode Switching
1037
1038When transitioning from dataflow mode to SC mode:
1039
10401. **Save metadata** from the shared 670s to spill storage.
10412. **Load initial SC register values** (matched operand pair that
1042   triggered the SC block) into the 670s.
10433. **Switch address mux**: 670 address lines now driven by
1044   instruction register fields instead of frame_id / act_id.
10454. **Switch IRAM to counter mode**: sequential fetch via incrementing
1046   counter rather than token-directed offset.
1047
1048When transitioning back:
1049
10501. **Emit final SC result** as token (last instruction with OUT=1).
10512. **Restore metadata** from spill storage to the 670s.
10523. **Switch address mux back** to frame_id / act_id addressing.
10534. **Resume token processing** from input FIFO.
1054
1055### Spill Storage Options
1056
1057Metadata from the shared 670s (~64-96 bits depending on how many
1058are shared) needs temporary storage during SC block execution.
1059
1060**Option A: Shift registers.** 2x 74LS165 (parallel-in, serial-out)
1061for save + 2x 74LS595 (serial-in, parallel-out) for restore. Total:
10624 chips. Save/restore takes ~12 clock cycles each.
1063
1064**Option B: Dedicated spill 670.** One additional 74LS670 (4x4 bits)
1065holds 16 bits per save cycle; need ~4-6 write cycles to save all
1066shared chips' contents. Total: 1 chip, ~4-6 cycles per save/restore.
1067
1068**Option C: Spill to frame SRAM.** During SC mode, the frame SRAM
1069has bandwidth available (no match operand reads). Write the 670
1070metadata contents into a reserved region of the frame SRAM address
1071space. No extra chips needed. ~4-6 SRAM write cycles to save, ~4-6
1072to restore. The SRAM is single-ported but there's no contention
1073because the pipeline is paused during mode switch.
1074
1075**Recommended: Option C.** Zero additional chips. The save/restore
1076overhead of ~4-6 cycles per transition is negligible compared to the
1077SC block's execution savings (EM-4 data: 23 clocks pure dataflow vs
10789 clocks SC for Fibonacci, so even with ~10 cycles of mode switch
1079overhead, you break even at ~5-7 SC instructions).
1080
1081## 10. SRAM Configuration and Memory Map
1082
1083### Unified SRAM Chip Pair
1084
1085The PE uses a single 32Kx8 chip pair (2 chips for 16-bit data width)
1086for both IRAM and frame storage, with address partitioning via a
1087single decode bit. The recommended part is the AS6C62256 (55 ns,
108832Kx8, DIP-28) or equivalent. 55 ns access time fits comfortably
1089within a 200 ns clock period at 5 MHz, with margin for address setup
1090and data hold.
1091
1092The unified SRAM approach keeps chip count low: one chip pair per PE
1093serves both IRAM and frame storage, avoiding the chip proliferation
1094that separate matching store and IRAM memories would require.
1095
1096### Address Map
1097
1098```
1099v0 address space (simple decode):
1100
1101  IRAM region:   [0][offset:8]              instruction templates
1102                  offset from token
1103                  capacity: 256 instructions (512 bytes)
1104
1105  Frame region:  [1][frame_id:2][slot:6]    per-activation storage
1106                  frame_id from tag store resolution
1107                  capacity: 4 frames x 64 slots = 256 entries (512 bytes)
1108```
1109
1110Total v0 SRAM utilisation: IRAM 512 bytes, frame 512 bytes. Under 1.5
1111KB used out of a 32Kx8 chip pair (64 KB). Ample room for future
1112expansion without changing chips. See Future: Bank Switching with
111374LS610 (section 14) for the upgrade path when programs exceed 256
1114instructions per PE.
1115
1116### Shared SRAM Arbitration
1117
1118The unified SRAM chip pair is shared between three access patterns:
1119
1120- Pipeline IRAM reads (stage 2, instruction fetch): high frequency, performance-critical
1121- Pipeline frame reads/writes (stages 3 and 5): high frequency, performance-critical
1122- PE-local write tokens (IRAM and frame loading): low frequency, can tolerate delay
1123
1124**Arbitration approach**: PE-local writes execute when the frame SRAM is
1125not busy with compute pipeline accesses (natural gaps between pipeline
1126stages). When no gap is available, writes queue and execute during the
1127next idle cycle. Hardware cost: mux on SRAM address/data buses +
1128write-enable gating + stall signal to pipeline. Roughly 5-8 TTL chips.
1129
1130**IRAM vs frame contention**: in v0, IRAM and frame share one SRAM chip
1131pair via address partitioning (region bit in the address). Stage 2
1132(IRAM read) and stage 3/5 (frame read/write) access different address
1133regions but contend for the same physical chip. The pipeline controller
1134ensures only one stage accesses the SRAM per cycle. With the natural
1135pipeline spacing, this rarely causes stalls -- see the frame SRAM
1136contention model in Pipeline Stall Analysis (section 11).
1137
1138**Upgrade path**: separating IRAM and frame onto independent SRAM chip
1139pairs eliminates all inter-region contention. Stage 2 (IRAM) and
1140stage 3/5 (frame) can access their respective chips in the same cycle.
1141
1142**Async-compatible arbitration**: defined as request/grant interface.
1143Synchronous implementation: priority mux resolved on clock edge. Async
1144implementation: mutual exclusion element (Seitz arbiter). Interface is
1145the same in both cases. See `network-and-communication.md` for clocking
1146discipline.
1147
1148## 11. Pipeline Stall Analysis
1149
1150### The Frame SRAM Contention Problem
1151
1152With Approach C (670 lookup), act_id -> frame_id resolution is
1153combinational (~35 ns via 670 read port), and the presence/port
1154check is also combinational (~35 ns from a second set of 670s).
1155There is no read-modify-write on SRAM for metadata -- metadata
1156lives entirely in the 670 register files.
1157
1158The primary bottleneck is **frame SRAM contention between stage 3 and
1159stage 5**. Both stages access the same single-ported SRAM chip pair:
1160
1161- **Stage 3** reads/writes operand data (dyadic match) and reads
1162  constants (modes with has_const=1).
1163- **Stage 5** reads destinations (modes 0-3), or writes results back
1164  to the frame (sink modes 6-7).
1165
1166When two pipelined tokens have stage 3 and stage 5 active in the
1167same cycle, the SRAM can serve only one. The other stalls.
1168
1169### The Pipeline Hazard
1170
1171The classic RAW hazard still exists but takes a different form. Two
1172consecutive tokens targeting the same frame slot (e.g., two mode 7
1173read-modify-write operations on the same accumulator slot) create a
1174data dependency: the second token's stage 3 read must see the first
1175token's stage 5 write.
1176
1177Detection requires comparing (act_id, fref) of the incoming token
1178against in-flight pipeline latches at stages 3-5. Hardware cost: ~2
1179chips (9-bit comparator + AND gate). Alternatively, the assembler
1180can guarantee this never happens by never emitting consecutive mode 7
1181tokens to the same slot on the same PE.
1182
1183This hazard is **statistically uncommon** in dataflow execution. Two operands arriving back-to-back
1184at the exact same frame slot requires coincidental timing. The
1185bypass path is cheap insurance that fires infrequently.
1186
1187### SRAM Contention Model
1188
1189The frame SRAM chip is single-ported (one access per clock cycle at
11905 MHz with 55 ns SRAM). The primary stall source is contention
1191between stage 3 (frame reads for operand data and constants) and
1192stage 5 (frame reads for destinations, or frame writes for sink
1193modes).
1194
1195**Contention arises only when:**
1196- Token A is at stage 5, needing a frame SRAM read (dest) or write
1197  (sink), AND
1198- Token B is at stage 3, needing a frame SRAM read (match operand,
1199  constant, or tag word).
1200
1201**Contention does NOT arise when:**
1202- Token A's stage 5 is mode 4/5 (change_tag -- no SRAM access).
1203- Token B's stage 3 is zero-cycle (monadic no-const, or match data
1204  in register file with no const).
1205- Token A was a dyadic miss (terminated at stage 3, never reaches
1206  stage 5).
1207
1208### Cycle Counts by Instruction Type
1209
1210**Approach C (74LS670 lookup, recommended v0):**
1211
1212```
1213                                stg1  stg2  stg3  stg4  stg5  total
1214monadic mode 4 (no frame)      1     1     0     1     0     3
1215monadic mode 0 (dest only)     1     1     0     1     1     4
1216monadic mode 6 (sink)          1     1     0     1     1     4
1217monadic mode 1 (const+dest)    1     1     1     1     1     5
1218monadic mode 7 (RMW)           1     1     1     1     1     5
1219dyadic miss                    1     1     1     --    --    3
1220dyadic hit, mode 0             1     1     1     1     1     5
1221dyadic hit, mode 1             1     1     2     1     1     6
1222dyadic hit, mode 3 (fan+const) 1     1     2     1     2     7
1223```
1224
1225Stage 3 breakdown for Approach C:
1226- Dyadic hit: 1 SRAM cycle to read stored operand (frame_id and
1227  presence already known from 670). +1 cycle for constant if
1228  has_const=1.
1229- Dyadic miss: 1 SRAM cycle to write operand data. 670 write port
1230  sets presence bit combinationally in parallel.
1231- Monadic: 0 SRAM cycles (no match), +1 for constant if has_const=1.
1232
1233**Approach B (register-file match pool):**
1234
1235```
1236                                stg1  stg2  stg3  stg4  stg5  total
1237monadic mode 4 (no frame)      1     1     0     1     0     3
1238monadic mode 0 (dest only)     1     1     0     1     1     4
1239monadic mode 6 (sink)          1     1     0     1     1     4
1240monadic mode 1 (const+dest)    1     1     1     1     1     5
1241monadic mode 7 (RMW)           1     1     1     1     1     5
1242dyadic miss                    1     1     1     --    --    3
1243dyadic hit, mode 0             1     1     1     1     1     5
1244dyadic hit, mode 1             1     1     2     1     1     6
1245dyadic hit, mode 3 (fan+const) 1     1     2     1     2     7
1246```
1247
1248Approaches B and C produce identical single-token cycle counts. The
1249difference emerges under pipelining: Approach B's match data never
1250touches the frame SRAM (operands stored in a dedicated register
1251file), so stage 3's only SRAM access is the constant read. This
1252reduces stage 3 vs stage 5 SRAM contention.
1253
1254**Approach A (set-associative tags in SRAM, minimal chips):**
1255
1256```
1257                                stg1  stg2  stg3  stg4  stg5  total
1258monadic mode 4 (no frame)      1     1     0     1     0     3
1259monadic mode 0 (dest only)     1     1     0     1     1     4
1260monadic mode 6 (sink)          1     1     0     1     1     4
1261monadic mode 1 (const+dest)    1     1     1     1     1     5
1262monadic mode 7 (RMW)           1     1     1     1     1     5
1263dyadic miss                    1     1     2     --    --    4
1264dyadic hit, mode 0             1     1     2     1     1     6
1265dyadic hit, mode 1             1     1     3     1     1     7
1266dyadic hit, mode 3 (fan+const) 1     1     3     1     2     8
1267```
1268
1269Approach A adds 1 extra SRAM cycle per dyadic operation (tag word
1270read + associative compare) because act_id resolution is not
1271combinational.
1272
1273### Pipeline Overlap Analysis
1274
1275With single-port frame SRAM at 5 MHz, the pipeline controller must
1276arbitrate between stage 3 and stage 5. When both need SRAM in the
1277same cycle, stage 3 stalls.
1278
1279**Approach B, two consecutive dyadic-hit mode 1 tokens:**
1280
1281```
1282cycle 0:  A.stg1
1283cycle 1:  A.stg2 (IRAM)
1284cycle 2:  A.stg3 match (reg file)  -- frame SRAM FREE
1285cycle 3:  A.stg3 const (SRAM)
1286cycle 4:  A.stg4 (ALU)             -- frame SRAM FREE
1287cycle 5:  A.stg5 dest (SRAM)       B.stg3 match (reg file) -- NO CONFLICT
1288cycle 6:  (A done)                  B.stg3 const (SRAM)
1289cycle 7:                            B.stg4 (ALU)
1290cycle 8:                            B.stg5 dest (SRAM)      -- NO CONFLICT
1291```
1292
1293Token spacing: 4 cycles. Approach A under the same conditions: ~6-7
1294cycles due to additional SRAM contention in stage 3.
1295
1296### Throughput Summary
1297
1298Per PE, at 5 MHz, single-port frame SRAM:
1299
1300| Instruction mix profile | Approach A | Approach C | Approach B |
1301|------------------------|------------|------------|------------|
1302| Monadic-heavy (mode 0/4/6) | ~1.25 MIPS | ~1.67 MIPS | ~1.67 MIPS |
1303| Mixed (40% dyadic mode 1, 30% monadic, 30% misc) | ~833 KIPS | ~1.25 MIPS | ~1.25 MIPS |
1304| Dyadic-heavy with constants | ~714 KIPS | ~1.00 MIPS | ~1.00 MIPS |
1305| Worst case (mode 3, const+fanout) | ~625 KIPS | ~714 KIPS | ~714 KIPS |
1306
13074-PE system: multiply by 4. Realistic mixed workload: ~3.3-5.0 MIPS
1308(A), ~5.0-6.7 MIPS (C), or ~5.0-6.7 MIPS (B). For reference: the
1309original Amamiya DFM prototype (TTL, 1982) achieved 1.8 MIPS per PE.
1310EM-4 prototype (VLSI gate array, 1990) achieved 12.5 MIPS per PE.
1311This design sits between the two, closer to the DFM, which is
1312historically appropriate for a discrete TTL build.
1313
1314### Pipeline Timing by Era
1315
1316With the 670-based matching subsystem (Approach C), act_id
1317resolution and presence/port checking are combinational (~35-70 ns)
1318**regardless of era**. These never become the timing bottleneck.
1319
1320The era-dependent part is **SRAM access time** for frame reads and
1321writes. This determines how many SRAM operations fit per clock cycle
1322and thus how much stage 3 vs stage 5 contention exists.
1323
1324**1979-1983 (5 MHz, 55 ns SRAM):**
1325
1326```
1327670 metadata: combinational (~35-70 ns), well within 200 ns cycle
1328Frame SRAM: one access per 200 ns cycle (55 ns access + setup/hold margin)
1329Bottleneck: frame SRAM single-port, stage 3 vs stage 5 contention
1330SC block throughput: ~1 instruction per clock (670 dual-port)
1331Overall token throughput: ~1 token per 3-5 clocks (pipelined, mode-dependent)
1332```
1333
1334**1984-1990 (5-10 MHz, dual-port SRAM):**
1335
1336```
1337670 metadata: combinational (unchanged)
1338Frame SRAM: dual-port (IDT7132 or similar), port A for stage 3, port B for stage 5
1339Bottleneck: eliminated -- both stages access SRAM simultaneously
1340SC block throughput: ~1 instruction per clock
1341Overall token throughput: approaches 1 token per 3 clocks for most modes
1342```
1343
1344Dual-port SRAM eliminates the primary stall source. The pipeline
1345becomes instruction-latency-limited rather than SRAM-contention-limited.
1346
1347**Modern parts (5 MHz clock, 15 ns SRAM):**
1348
1349```
1350670 metadata: combinational (unchanged)
1351Frame SRAM: 15 ns access, ~13 accesses fit in 200 ns cycle
1352Practical: 2-3 sub-cycle accesses via time-division multiplexing
1353Bottleneck: none -- frame SRAM has excess bandwidth
1354Token throughput: 1 token per 3 clocks (pipeline-stage-limited, not SRAM-limited)
1355```
1356
1357With 15 ns AS7C256B-15PIN (DIP, currently available at ~$3), two
1358sub-cycle accesses fit within a 200 ns clock period. This achieves
1359TDM-like parallelism without additional MUX logic, effectively
1360giving the pipeline a dual-port view of a single-port chip.
1361
1362**Integrated (on-chip SRAM, sub-ns access):**
1363
1364```
1365670 equivalent: on-chip multi-ported register file, ~200 transistors
1366Frame SRAM: on-chip, sub-cycle access trivially
1367Token throughput: 1 per 3 clocks, potentially faster with deeper pipelining
1368```
1369
1370## 12. SC Blocks and Execution Modes
1371
1372### PE-to-PE Pipelining
1373
1374When multiple PEs are chained for software-pipelined loops, the per-PE
1375pipeline throughput determines the overall chain throughput.
1376
1377With the pipelined design (1 token per 3-5 clocks depending on
1378instruction mix and era), the inter-PE hop cost becomes the critical
1379path for chained execution:
1380
1381| Interconnect | Hop latency | Viable? |
1382|-------------|-------------|---------|
1383| Shared bus (discrete build) | 5-8 cycles | Marginal -- chain overhead dominates |
1384| Dedicated FIFO between adjacent PEs | 2-3 cycles | Worthwhile for tight loops |
1385| On-chip wide parallel link (integrated) | 1-2 cycles | Competitive with intra-PE SC block |
1386
1387For the discrete v0 build, dedicated inter-PE FIFOs (bypassing the
1388shared bus) would enable PE chaining at reasonable cost. This is a
1389low-chip-count addition (~2-4 chips per PE pair) that unlocks
1390software-pipelined loop execution.
1391
1392**Loopback bypass.** When a PE emits a token destined for itself
1393(common in iterative computations), the token can be looped back
1394internally without traversing the bus at all. See
1395`bus-interconnect-design.md` for the loopback bypass design, which
1396eliminates the bus hop latency entirely for self-targeted tokens.
1397
1398### The Execution Mode Spectrum
1399
1400The pipelined PE with frame-based storage, SC blocks, and predicate
1401register supports a spectrum of execution modes, selectable by the
1402compiler per-region:
1403
1404| Mode | Pipeline behaviour | Throughput | When to use |
1405|------|-------------------|-----------|-------------|
1406| Pure dataflow | Token -> ifetch -> match/frame -> exec -> output | 1 token / 3-7 clocks (mode-dependent) | Parallel regions, independent ops |
1407| SC block (register) | Sequential IRAM fetch, 670 register file | ~1 instr / clock | Short sequential regions |
1408| SC block + predicate | As above, with conditional skip/branch via predicate bits | ~1 instr / clock | Conditional sequential regions |
1409| PE chain (software pipeline) | Tokens flow PE0->PE1->PE2, each PE handles one stage | 1 iteration / PE-pipeline-depth clocks | Loop bodies across PEs |
1410| SM-mediated sequential | Tokens to/from SM for memory-intensive work | SM-bandwidth-limited | Array/structure traversal |
1411
1412The compiler partitions the program graph and selects the best mode
1413for each region. This spectrum is arguably more expressive than what a
1414modern OoO core offers (which has exactly one mode: "pretend to be
1415sequential, discover parallelism at runtime").
1416
1417## 13. Instruction Residency and Code Loading
1418
1419### Why This Matters
1420
1421Unlike Manchester, Amamiya, or Monsoon -- which either replicated the
1422entire program into every PE's instruction memory or used very large
1423per-PE instruction stores -- this design has **small IRAM per bank**
1424(256 entries) with runtime-writable instruction memory. Without bank
1425switching, any program larger than a single PE's IRAM needs code loading
1426at runtime, even under fully static PE assignment.
1427
1428**With bank switching** (see section 14), each PE
1429holds up to 4096 instructions across 16 banks using the same SRAM chips.
1430This substantially reduces the pressure on runtime code loading -- most
1431programs' full working set fits in the preloaded banks, and switching
1432between function fragments costs a single register write instead of
1433IRAM rewrite traffic. The code storage hierarchy and loader mechanisms
1434below remain relevant for programs that exceed the banked capacity, but
1435bank switching makes that the exception rather than the rule.
1436
1437The 16-bit single-half instruction format provides good IRAM density:
1438one instruction per SRAM address. The effective capacity with bank
1439switching (4096 instructions) is substantial for the target workloads,
1440using only a single SRAM chip pair per PE.
1441
1442The reference architectures largely avoid the residency problem by
1443throwing memory at it: Amamiya's 8KW/PE replicated instruction memory,
1444Manchester's large instruction store, Monsoon's 64K-instruction frames.
1445Bank switching gives us a comparable effective capacity (4K instructions)
1446with much less hardware than full replication.
1447
1448### Proactive Loading (Primary Mechanism)
1449
1450The primary approach is **software-managed prefetch**: the compiler assigns a PE (typically the least-utilized one) to pull instruction pages from storage and load them onto the bus in advance of when they're needed. This is part of the program graph itself - the loader PE calls the `exec` SM instruction, which reads out pre-constructed tokens onto the bus.
1451
1452This fits naturally into the dataflow paradigm:
1453- The loader PE is just another participant in the token network
1454- Its "inputs" are load requests (tokens from other PEs or the scheduler)
1455- Its "outputs" are config write packets that load IRAM
1456- The compiler can schedule prefetches to overlap with computation on other PEs
1457
1458The loader PE could be dedicated (always running loader code) or could itself have its code swapped depending on system phase.
1459
1460### The Identity Problem: Miss Detection
1461
1462If code loading happens at runtime, the question arises: how does a PE know the code in its IRAM is the *right* code for an arriving token?
1463
1464A simple validity bitmap (like the matching store presence bit) is **not sufficient**. It can tell you "something is loaded at offset 7" but not "the right instruction is loaded at offset 7." If a different
1465function fragment has been loaded over a previous one, the IRAM slot is occupied by a valid-looking but wrong instruction. The token indexes directly into IRAM - there is no tag comparison against the token.
1466
1467Several detection mechanisms are possible:
1468
1469**Option A: Fragment ID register.**
1470Each PE has a small tag register (or set of registers, one per IRAM page/region) that records which function fragment is currently loaded. Set by config writes during loading. Incoming tokens carry (or the system derives from the token's address) the expected fragment ID. The PE compares the token's expected fragment against the loaded fragment register:
1471- Match -> proceed normally
1472- Mismatch -> miss, trigger fetch
1473- Hardware cost: one register + comparator per PE (or per IRAM region)
1474- Requires fragment ID bits in the token or a derivation mechanism
1475- Coarse-grained: one tag per PE or per page, not per instruction
1476
1477**Option B: Entry gate instruction.**
1478The compiler inserts a special instruction at each function body's entry point that verifies identity: "am I the function this activation expects?" Tokens arriving at non-entry instructions are assumed correct because they could only have reached that point by passing through a verified entry gate.
1479- No per-instruction tags needed
1480- Software-managed, compiler responsibility
1481- Detection granularity is per-function-body, not per-instruction
1482- In dataflow terms: the entry gate is a dyadic instruction whose left input is the activation token and whose right input is a "function loaded" token. If the function isn't loaded, the gate blocks (no match) until loading completes and a "loaded" confirmation token arrives.
1483
1484**Option C: Software-only invariant.**
1485No hardware miss detection. The loader protocol guarantees correctness: code is never overwritten while tokens are in flight targeting it. The throttle + drain approach (stall new activations, let existing ones complete, then overwrite IRAM) ensures the invariant.
1486- Simplest hardware -- no detection circuitry at all
1487- Most compiler/loader burden
1488- Relies on correct coordination; bugs cause silent wrong execution
1489- Viable for v0 where programs are small and manually verified
1490
1491These options are not mutually exclusive. v0 can start with Option C (software guarantee) and add hardware detection (A or B) later as programs grow beyond what manual verification can cover.
1492
1493### Miss Handling
1494
1495When a miss is detected (by whatever mechanism), the PE needs to handle a token that targets unloaded code. Two approaches:
1496
1497**Stall + fetch request:** PE emits an `exec` token and stalls its input FIFO until the instruction arrives via config write. Simple, deterministic, but blocks all traffic to that PE during the fetch. Acceptable if misses are rare (proactive loading handles most cases) and fetch latency is bounded.
1498
1499**Recirculate + fetch request:** PE emits a fetch-request token, puts the missed token back at the tail of its own input FIFO, and continues processing other tokens. The missed token retries later, hopefully after the instruction has been loaded. More complex but keeps the PE productive. Requires care to avoid FIFO fill-up with recirculated tokens.
1500
1501v0 may not implement either; starting with the software-only invariant (Option C above) means misses don't happen by construction. Hardware miss handling is an evolutionary step as programs outgrow what static loading can guarantee.
1502
1503## 14. Future: Bank Switching with 74LS610
1504
1505v0 supports 256 instructions per PE (8-bit offset) with simple address
1506decode. When programs exceed 256 instructions, the 74LS610 memory mapper
1507enables bank switching: 16 banks x 256 instructions = 4096 entries per
1508PE without changing SRAM chips.
1509
1510### What the 610 Is and What It Enables
1511
1512The 74LS610 (TI memory mapper, originally for TMS9900 family) is an
1513ideal fit for IRAM bank switching. Key properties:
1514
1515- 16 mapping registers, each 12 bits wide
1516- 4-bit logical address input selects register -> 12-bit physical address output
1517- **Latch control** (pin 28): outputs can be frozen while register contents change
1518- ~40-50ns propagation delay (LS family), pipelineable with SRAM access
1519- One chip per PE. Writes to mapping registers via data bus during config/bootstrap.
1520
1521The '610 is planned for both IRAM and SM banking (both are future
1522upgrades, not present in v0). Using it for IRAM banking is the same
1523chip, same wiring pattern, different address domain. One '610 per PE
1524for IRAM, one per SM.
1525
1526### The Socket Strategy
1527
1528The v0 board pre-wires SRAM address lines to a '610 socket with a
1529jumper wire in place of the chip. When bank switching is needed, the
1530'610 drops in with no board changes.
1531
1532### Address Space with Bank Switching
1533
1534With the '610 installed, the IRAM address becomes:
1535
1536```
1537Logical:  [bank_select:4][offset:8]
1538                |
1539                v (74LS610)
1540Physical: [phys_bank:12][offset:8] = up to 20-bit SRAM address
1541```
1542
1543In practice the physical address width is bounded by available SRAM chip
1544capacity. With 8Kx8 SRAMs (13-bit address): the '610's 12-bit output is
1545wider than needed -- only 5 bits of physical bank + 8-bit offset = 13
1546bits. This gives 32 physical banks of 256 instructions each (8192
1547instructions per PE, though address space constraints may limit this
1548further).
1549
1550The SRAM address map with bank switching:
1551
1552```
1553  IRAM region:   [0][bank:4][offset:8]      bank-switched templates
1554                  bank from '610 mapper
1555                  capacity: 16 banks x 256 instructions = 4096 entries
1556
1557  Frame region:  [1][frame_id:2][slot:6]    (unchanged)
1558```
1559
1560### Banking Workflow: MAP_PAGE and SET_PAGE Instructions
1561
1562Two instructions manage banking. Neither touches the token format.
1563
1564- **`map_page`** (monadic): writes a logical-to-physical mapping into one of the '610's 16 mapping registers. The register index and physical bank address come from frame constants. Used during bootstrap or runtime to establish which physical SRAM regions back which logical pages.
1565
1566- **`set_page`** (monadic): writes a 4-bit logical page selector into a PE-local latch. The latch feeds the '610's MA0-MA3 inputs. All subsequent IRAM fetches go through the selected logical page's mapped physical bank. One cycle to switch.
1567
1568```
1569Banking workflow:
1570  1. Bootstrap: MAP_PAGE instructions establish mappings
1571     (logical page 0 -> physical region A, page 1 -> region B, etc.)
1572  2. Runtime: SET_PAGE selects the active logical page
1573  3. Latch -> '610 MA0-MA3 -> physical SRAM bank selection
1574  4. All IRAM reads now address the selected bank
1575```
1576
1577Hardware cost: one 74LS175 (quad D flip-flop) as the page latch + the '610 itself.
1578
1579### Trade-offs and Costs
1580
1581- Bank switch affects all in-flight tokens targeting this PE at offsets in the old bank. The compiler (or scheduler) must drain tokens for the old bank before switching -- same throttle-and-drain protocol as code overwrite, but switching is instantaneous once drained (write latch, done).
1582- `set_page` is sequentially scoped: it affects all subsequent fetches, not just one activation. The compiler must ensure that concurrent activations on the same PE agree on the active page, or use `set_page` as a barrier between phases.
1583- Total capacity per PE is bounded by SRAM chip size, not the '610 (which can address far more than any reasonable IRAM).
1584- Pages are a pure address-mapping primitive. The compiler decides what they mean -- per-function, per-phase, or any other grouping. The hardware doesn't enforce or assume any relationship between pages and function bodies.
1585
1586## 15. Dynamic Scheduling: Future Capability
1587
1588The architecture is **policy-agnostic** on whether PE assignment is fully static (compiler decides everything) or partially dynamic (a scheduler places activations at runtime). The mechanism (tokens carry destination PE + activation_id, PEs have writable IRAM, frames are allocated and addressed by act_id) supports either policy.
1589
1590### Static Assignment (v0)
1591
1592Compiler decides everything at compile time. each PE gets specific function fragments loaded at bootstrap. no runtime decisions about placement. simplest, no scheduler hardware or firmware needed. For programs that exceed IRAM capacity, the compiler schedules `exec` instructions or similar.
1593
1594### Dynamic Scheduling (future)
1595
1596A CCU-like scheduler (could be firmware on a dedicated PE, a small fixed-function unit, or distributed logic) decides at runtime where to place new activations, based on PE load, IRAM contents, etc.
1597
1598The tension: dynamic scheduling wants **wide IRAM** (so the target PE already has the function body loaded), while cheap PEs want **narrow IRAM**. Amamiya resolved this by replicating the entire program into every PE's IRAM. that's one approach but costs a lot of memory.
1599
1600The middle ground is a **working set model**: keep hot function bodies loaded, swap cold ones via PE-local write tokens (prefix 011+01) when the scheduler wants to place an activation on a PE that doesn't have the code yet.
1601
1602- **miss latency**: significant (network round-trip to load code from SM or external storage). much worse than Amamiya's "already there."
1603- **miss rate**: depends on scheduler affinity policy. if the scheduler prefers placing activations on PEs that already have the code, misses should be rare. a small "IRAM directory" (which PE has which function body loaded) lets the scheduler make this decision cheaply.
1604- **coordination**: drain in-flight tokens for the old fragment before overwriting IRAM. throttle stalls new activations for that fragment, existing ones complete, then overwrite. coarse-grained context switch.
1605
1606## 16. Open Design Questions
1607
16081. **Approach selection for v0.** Approach C (670 lookup) is
1609   recommended as the starting point: combinational metadata at ~8
1610   chips. Approach B (register-file match pool) eliminates the last
1611   SRAM cycle from matching at the cost of ~16-18 chips. Approach A
1612   (SRAM tags) is the fallback if 670 supply is a problem. The
1613   choice depends on whether chip count or pipeline throughput is
1614   the binding constraint for the initial build. See section 11 for
1615   the full approach comparison and cycle counts.
1616
16172. **Frame SRAM contention under realistic workloads.** The pipeline
1618   stall analysis in section 11 uses worst-case consecutive tokens.
1619   Simulate representative dataflow programs in the behavioural
1620   emulator to measure actual stage 3 vs stage 5 contention rates
1621   and determine whether dual-port SRAM or faster SRAM is justified
1622   for v0.
1623
16243. **SC block register capacity.** With 4-6 registers available from
1625   repurposed 670s (depending on how many are shared), what is the
1626   longest SC block the compiler can generate before register
1627   pressure forces a spill? Evaluate empirically on target workloads.
1628
16294. **Predicate register encoding.** Document specific instruction
1630   encodings for predicate test/set/clear, and how SWITCH
1631   instructions interact with predicate bits. The predicate register
1632   may subsume some of the cancel-bit functionality planned for
1633   token format.
1634
16355. **Mode switch latency measurement.** Build a cycle-accurate model
1636   of the save-to-SRAM / restore-from-SRAM path and determine exact
1637   overhead. Target: <=10 cycles per transition.
1638
16396. **Assembler stall analysis.** The assembler can statically detect
1640   instruction pairs whose output tokens may cause frame SRAM
1641   contention on the same PE. For hot loops, the assembler can
1642   insert mode 4 NOP tokens (zero frame access) as pipeline padding.
1643   Validate static stall estimates against emulator simulation, since
1644   runtime arrival timing depends on network latency and SM response
1645   times.
1646
16477. **8-offset matchable constraint validation.** The 670-based
1648   presence metadata limits dyadic instructions to offsets 0-7 per
1649   frame. Evaluate whether this is sufficient for compiled programs.
1650   If tight, the hybrid upgrade path (offset[3]=0 checks 670s,
1651   offset[3]=1 falls back to SRAM tags) adds ~4-6 chips of SRAM tag
1652   logic for offsets 8-15+.
1653
16548. **Exact opcode assignments**: 5-bit opcode space is sufficient (CM
1655   and SM independent). Need to assign FREE_FRAME, ALLOC_REMOTE, and
1656   verify that existing ALU operations fit with the revised mode
1657   semantics.
1658
16599. **SC arc execution details**: the frame model supports
1660   strongly-connected arc execution (latch frame_id across sequential
1661   blocks); the pipeline sequencing and block-entry detection logic
1662   need design work. Deferred past v0 but should not be precluded by
1663   any v0 decisions.
1664
166510. **IRAM bank switching interaction with frames**: switching IRAM
1666    banks changes the instruction templates but not the frame contents.
1667    Tokens in flight targeting the old bank's instructions will execute
1668    against new instructions after the switch. The drain-before-switch
1669    protocol applies unchanged.
1670
167111. **Frame slot count (fref width)**: 6-bit fref = 64 slots is the
1672    current proposal. Real compiled programs may show that 32 suffices
1673    (freeing 1 bit for other uses) or that 64 is tight (requiring
1674    creative aliasing or more aggressive function splitting).
1675
167612. **Function splitting heuristics**: how does the compiler decide
1677    where to split? Minimize cross-PE traffic? Balance frame usage
1678    across PEs? Hardware constraints (frame count, matchable offset
1679    count) drive it.
1680
168113. **Instruction identity detection**: how does the PE know loaded
1682    code matches what an arriving token expects? Fragment ID register
1683    vs entry gate instruction vs software-only guarantee. See
1684    Instruction Residency section. v0 starts with software-only
1685    invariant (Option C).
1686
168714. **Miss handling mechanism**: stall + fetch request vs recirculate
1688    + fetch request. v0 may not implement either, relying on the
1689    software-only invariant.
1690
169115. **SM flit 1 bit alignment:** the frame slot packing for SM targets
1692    (`[SM_id:2][addr][spare]`) must align with the SM bus flit 1 format
1693    so that the output stage can concatenate frame bits + EEPROM opcode
1694    without field rearrangement. The exact spare bit positions depend on
1695    the final SM bus encoding; verify alignment after `sm-design.md`
1696    opcode assignments are frozen.
1697
169816. **PE-local write slot field width:** the current flit 1 format packs
1699    5 bits of slot index into the PE-local write token. With 64 frame
1700    slots (6-bit fref), one bit is missing. Options: (a) limit PE-local
1701    writes to slots 0-31, (b) steal a spare bit from act_id or elsewhere,
1702    (c) use flit 2 for extended addressing.
1703
170417. **Indexed SM ops: address overflow.** ALU base + index computation
1705    may overflow the 10-bit (tier 1) or 8-bit (tier 2) address range.
1706    The PE does not check for overflow -- the SM receives the truncated
1707    address. The assembler should warn on statically-detectable overflow
1708    risks. Runtime overflow detection is an SM-side concern.
1709
171018. **EXTRACT_TAG offset source.** The return offset in the packed tag
1711    could come from (a) a frame constant (flexible, costs 1 frame slot),
1712    (b) a fixed hardware-derived value (e.g. current instruction
1713    offset + 1), or (c) an IRAM-encoded small immediate (not available
1714    in the current 16-bit format). Option (a) is consistent with the
1715    frame-everything philosophy; option (b) saves a slot but limits
1716    return point placement.
1717
1718## 17. References
1719
1720- `architecture-overview.md` - Token format, flit-1 bit allocation, module taxonomy, bus framing protocol, function call design overview.
1721- `alu-and-output-design.md` - ALU operation details, output routing modes, flit 2 source mux, and SM flit assembly.
1722- `sm-design.md` - SM opcode table, extended addressing, CAS handling, I-structure semantics.
1723- `bus-interconnect-design.md` - Physical bus implementation: shared and split AN/CN/DN topologies, node interfaces, arbitration, loopback, backpressure, chip counts.
1724- `network-and-communication.md` - Routing topology, clocking discipline.
1725- `io-and-bootstrap.md` - Bootstrap loading, I/O subsystem design.
1726- `sram-availability.md` - Component availability for period-appropriate SRAMs.
1727- `17407_17358.pdf` - DFM evaluation: OM structure (1024 CAM blocks, 32 words each, 8 entries of 4 words, 4-way set-associative within entry). Function activation via CCU requesting least-loaded PE, then getting instance name from target PE's free instance table. IM is 8KW/PE, identical across all PEs. Critical for understanding why Amamiya's OM is so large and why ours can be much smaller.
1728- `gurd1985.pdf` - Manchester matching unit: 16 parallel hash banks, 64K tokens each, 54-bit comparators, 180ns clock period. Overflow unit emulated in software. Shows the cost of general-purpose matching.
1729- `Dataflow_Machine_Architecture.pdf` - Veen survey: matching store analysis, tag space management, overflow handling across multiple architectures.
1730- `amamiya1982.pdf` - Original DFM paper: semi-CAM concept, IM/OM split, execution control mechanism with associative IM fetch. Partial function body execution (begin executing when first argument arrives, don't wait for all arguments).
1731- EM-4 prototype papers - Direct matching, strongly-connected blocks, register-based advanced control pipeline. Informs SC arc upgrade path.
1732- Iannucci (1988) - Frame-based matching, continuation model, suspension semantics. Historical precedent for per-activation frame storage.
1733- Monsoon / TTDA papers - Explicit token store, frame-based execution, I-structure semantics.