OR-1 dataflow CPU sketch
at main 1733 lines 89 kB view raw view rendered
1# Dynamic Dataflow CPU — PE (Processing Element) Design 2 3Covers the CM (Control Module) pipeline, frame-based matching, instruction 4memory and encoding, activation lifecycle, per-PE identity, 670 subsystem 5design, pipeline stall analysis, SM operation dispatch, SC blocks, and 6execution modes. 7 8See `architecture-overview.md` for token format, flit-1 bit allocation, 9and module taxonomy. See `network-and-communication.md` for how tokens 10enter/leave the PE. See `alu-and-output-design.md` for ALU operations, 11output formatting, and SM flit assembly details. See 12`bus-interconnect-design.md` for physical bus implementation. See 13`sm-design.md` for SM internals and I-structure semantics. See 14`io-and-bootstrap.md` for bootstrap loading and I/O subsystem design. 15 16## 1. Design Philosophy: Static Assignment, Compiler-Driven Sizing 17 18This design diverges significantly from both Manchester and Amamiya in how PEs are used. Understanding the difference is critical to understanding why the matching store can be so much smaller here. 19 20**Amamiya DFM (1982/17407 papers):** every PE has ALL function bodies pre-loaded in instruction memory (8KW, 58 bits/word per PE, identical contents across all PEs). Function _instances_ are dynamically assigned to PEs at runtime by a CCU (Cluster Control Unit) that picks the least-loaded PE. The OM (operand matching memory) needs 1024 CAM blocks per PE because any function can run anywhere, and deep Lisp recursion means many simultaneous activations. The "semi-CAM" was their solution to making this affordable -- instance name directly addresses a block, then 4-way set-associative lookup within the block on instruction identifier. 21 22**Manchester (Gurd 1985):** similar story but with hashing instead of semi-CAM. 16 parallel 64K-token memory banks per PE for set-associative hash lookup. 1M token capacity matching store. Plus an overflow unit (initially emulated on the host). The matching unit alone was 16 memory boards per PE. 23 24Both machines sized their matching stores for worst-case dynamic scheduling of arbitrary programs. The whole program lives in every PE (or in a single PE's matching unit), and any activation can land anywhere. That's why those matching stores are enormous. 25 26**This design:** the compiler statically assigns function bodies (or chunks of them) to specific PEs. Different PEs have different instruction memory contents. The compiler knows at compile time which functions run where, and can calculate maximum concurrent activations per PE. This means: 27 28- Instruction memory is NOT replicated -- each PE only holds its assigned function bodies. IM can be much smaller. 29- The matching store only needs enough frames for the maximum concurrent activations the compiler predicts for that specific PE. Not 1024. Probably 4. 30- No CCU needed for dynamic PE allocation. Scheduling decisions are made at compile time. 31- The tradeoff is scheduling flexibility -- you can't dynamically rebalance load at runtime. The compiler must get it roughly right. 32 33### Function Splitting Across PEs 34 35A "function" in the source language does NOT need to map 1:1 to a contiguous block on one PE. The compiler can split a function body at any data-dependency boundary. The token network doesn't know or care whether two instructions are "in the same function" -- it just sees tokens with destinations. 36 37A 40-instruction function body could be split into three chunks of ~13 instructions across three PEs, each chunk fitting in a smaller frame. The "function" as the architecture sees it is really "a set of instructions that share a frame on this PE." The compiler defines what that grouping means. 38 39This is a powerful lever for keeping frames small: if a function body is too big for the frame size, the compiler splits it. The split introduces inter-PE token traffic (extra network hops), but keeps per-PE hardware simple. The compiler can optimise the split points to minimise cross-PE traffic. 40 41**Implication for frame semantics:** a frame doesn't mean "one function activation." It means "one chunk of work sharing a local operand namespace on this PE." Multiple frames on different PEs might collectively represent one function activation. The token's `activation_id` scopes operand matching to a local frame, nothing more. 42 43**Implication for the compiler:** this architecture actively wants either small functions or functions distributed across PEs. The compiler is free to treat any subgraph of the dataflow graph as a "chunk" and assign it to a PE, regardless of source-level function boundaries. Loop bodies, branch arms, pipeline stages: all valid chunk boundaries. The grain of scheduling is the subgraph, not the function. 44 45## 2. PE Identity 46 47Each PE has a unique ID used for routing. Two mechanisms, not mutually exclusive: 48 49**EEPROM-based**: the instruction decoder EEPROM already contains per-PE truth tables. The PE ID can be encoded as additional input bits to the EEPROM, meaning the EEPROM contents are unique per PE but the circuit board is identical. The instruction decoder "knows" which PE it is because its EEPROM was burned with that ID. 50 51**DIP switches**: 3-4 switches give 8-16 PE addresses. Better for early prototyping - reconfigurable without reflashing. Can coexist with the EEPROM approach (switches provide ID bits that feed into the EEPROM address lines). 52 53The PE ID is needed in two places: 54 551. Input token filtering: "is this token addressed to me?" 562. Output token formatting: "set the source PE field" (if result tokens carry source info for return routing) 57 58## 3. PE Pipeline (5-stage) 59 60### Bus Interface: Serializer / Deserializer 61 62The PE connects to the 16-bit external bus via ser/deser logic at the input and output boundaries. This handles the width conversion between 16-bit flits on the bus and the wider internal token representation: 63 64- **Input deserializer**: receives 2+ flits from the bus, reassembles into a full token (routing fields from flit 1 + data from flit 2). Shift register + flit counter. Outputs a reassembled token to the 65 input FIFO. 66- **Output serializer**: takes a formed result token, splits it into flits (routing fields into flit 1, data into flit 2), and clocks them onto the bus. Shift register + toggle. 67- Hardware cost: ~5-8 TTL chips per direction (shift registers, counters, muxes). 68- Naturally integrates with the clock domain crossing FIFOs when running Mode B (2x bus clock). Under Mode A (shared clock), the ser/deser simply takes 2 clock cycles per token transfer. 69 70### Pipeline Stages 71 72The pipeline runs IFETCH before MATCH. The instruction word, decoded at 73the end of stage 2, drives all subsequent pipeline behaviour: whether to 74check the matching store, whether to read a constant, how many 75destinations to read, whether to write back to the frame. The token's 76`activation_id` drives associative lookup in parallel with the IRAM read, 77hiding resolution latency. 78 79**Why IFETCH before MATCH.** The instruction word determines 80*how* matching works: whether the instruction is dyadic or monadic, 81which frame slots to read for operands and constants, whether to 82write back to the frame (sink modes), and how many destinations to 83read at output. Fetching the instruction first gives the pipeline 84controller all the information it needs to sequence stage 3's SRAM 85accesses efficiently. 86 87The token's dyadic/monadic prefix enables parallel work: when the 88prefix indicates "dyadic," stage 2 89starts act_id -> frame_id resolution via the 670s simultaneously with 90the IRAM read. By the time stage 3 begins, both the instruction word 91and the frame_id / presence / port metadata are available, and the 92only remaining SRAM work is reading or writing actual operand data 93and constants. 94 95``` 96Stage 1: INPUT 97 - Receive reassembled token from input deserialiser 98 - Classify by prefix: dyadic wide (00), monadic normal (010), 99 misc bucket (011). Within misc bucket: frame control (sub=00), 100 PE-local write (sub=01), monadic inline (sub=10) 101 - Compute/data tokens -> pipeline FIFO 102 - Frame control tokens -> tag store write/clear (side path) 103 - PE-local write tokens -> SRAM write queue (side path, executes 104 when frame SRAM is not busy with compute pipeline accesses) 105 - Buffer in small FIFO (8-deep, storing reassembled tokens) 106 - ~1K transistors (flip-flops) or use small SRAM 107 108Stage 2: IFETCH 109 Two parallel operations within a single cycle: 110 111 (a) IRAM SRAM read at [bank_reg : token.offset]. Produces the 112 16-bit instruction word: type, opcode, mode, wide, fref. 113 Single read cycle (16-bit instruction, one chip pair). 114 115 (b) Activation_id resolution. For Approach C (74LS670 lookup), 116 this is combinational (~35 ns): present act_id on the 670 117 address lines, get {valid, frame_id} back. Presence and port 118 metadata also resolve combinationally in this stage (670 read 119 at frame_id, ~70 ns total from act_id presentation). At 5 MHz 120 (200 ns cycle), all metadata is available before the IRAM read 121 completes. 122 123 The dyadic/monadic prefix from flit 1 determines whether 124 activation_id resolution starts in this stage (dyadic) or is 125 deferred until the instruction confirms the need (monadic with 126 frame access). 127 128 IRAM valid-bit check occurs in parallel: if the page containing 129 the target offset is marked invalid, the token is rejected (see 130 IRAM Valid-Bit Protection below). 131 132Stage 3: MATCH / FRAME 133 Path depends on instruction type (from stage 2) and token type. 134 Uses the instruction's mode field to determine frame SRAM accesses: 135 136 - Dyadic hit (second operand): read stored operand from frame SRAM 137 at [frame_id : match_offset]. If has_const, also read constant 138 from frame[fref]. 1-2 SRAM cycles depending on mode. 139 - Dyadic miss (first operand): write incoming operand to frame SRAM, 140 set presence bit in 670. Token consumed. 1 SRAM cycle. 141 - Monadic with constant: read constant from frame[fref]. 1 SRAM 142 cycle. 143 - Monadic mode 4 (CHANGE_TAG, no frame access): 0 cycles, pass 144 through. 145 - Mode 7 (SINK+CONST / RMW): read old value from frame[fref] for 146 read-modify-write. 1 SRAM cycle. 147 148 See Pipeline Stall Analysis (section 11) for full cycle-count 149 tables across Approaches A, B, and C. 150 151Stage 4: EXECUTE 152 - Instruction type bit selects CM compute (0) or SM operation (1) 153 - CM path: 16-bit ALU executes arithmetic/logic/comparison/routing 154 on operand data + constant (if present). Purely combinational. 155 No SRAM access. Result latched. 156 - SM path: ALU computes effective address or passes through data; 157 PE constructs SM flit fields from frame data and operands. 158 See SM Operation Dispatch (section 7) for encoding and dispatch 159 details. See `alu-and-output-design.md` for SM flit assembly. 160 - ~1500-2000 transistors (ALU) + SM flit assembly mux (~4-6 chips) 161 162Stage 5: OUTPUT 163 - CM path: read destination(s) from frame SRAM. Destinations are 164 pre-formed flit 1 values stored in frame slots during activation 165 setup. The PE reads the slot and puts it directly on the bus as 166 flit 1; the ALU result becomes flit 2. Near-zero token formation 167 logic. 168 - Mode 0/1 (single dest): read frame[fref] or frame[fref+1]. 169 1 SRAM cycle. 170 - Mode 2/3 (fan-out): read dest1 and dest2 from consecutive 171 frame slots. 2 SRAM cycles. 172 - Mode 4/5 (CHANGE_TAG): left operand becomes flit 1 verbatim. 173 0 SRAM cycles. 174 - Mode 6/7 (SINK): write ALU result back to frame[fref]. No 175 output token. 1 SRAM cycle. 176 - SM path: emit SM token to target SM. SM flit 1 constructed from 177 frame slot (SM_id + addr from frame[fref]), SM flit 2 source 178 selected by instruction decoder (ALU out, R operand, or frame 179 slot). See SM Operation Dispatch (section 7) for flit 2 source 180 mux details. 181 - Pass to output serialiser for flit encoding and bus injection. 182``` 183 184### Concurrency Model 185 186The pipeline is pipelined: multiple tokens can be in-flight simultaneously at different stages. In the emulator, each token spawns a separate SimPy process that progresses through the pipeline independently. This models the hardware reality where Stage 2 can be fetching a new instruction while Stage 4 is executing a previous one. 187 188Cycle counts per token type (Approach C, recommended v0): 189- **Dyadic hit, mode 1** (common case): 6 cycles (input + ifetch + match/const + execute + output) 190- **Dyadic hit, mode 0** (no const): 5 cycles 191- **Dyadic miss**: 3 cycles (input + ifetch + store operand; no execute/output) 192- **Monadic, mode 0**: 4 cycles (input + ifetch + execute + output; no match) 193- **Monadic, mode 4** (CHANGE_TAG, no frame): 3 cycles 194- **PE-local write**: side path, does not enter compute pipeline 195- **Frame control**: side path, tag store write 196- **Network delivery**: +1 cycle latency between emit and arrival at destination 197 198**Unpipelined throughput:** 3-7 cycles per token depending on instruction 199mode. This is the baseline against which pipeline overlap improvements 200are measured (see Pipeline Stall Analysis, section 11). 201 202### Pipeline Register Widths 203 204Between instruction fetch and execute, the pipeline carries both operand data and instruction control information in parallel but logically separate paths: 205 206**Data path (~32 bits between match/frame and ALU):** 207``` 208data_L: 16 bits (from frame SRAM or direct from token for monadic) 209data_R: 16 bits (from frame SRAM) 210 | 211 ALU (16-bit) 212 | 213 result: 16 bits 214``` 215 216**Control path (~16 bits from IRAM, plus frame reads in stage 5):** 217``` 218type: 1 bit (CM/SM select, from IRAM) 219opcode: 5 bits (from IRAM, consumed by ALU / SM decoder) 220mode: 3 bits (from IRAM, drives frame access pattern + output routing) 221wide: 1 bit (from IRAM, 16/32-bit frame access) 222fref: 6 bits (from IRAM, frame slot base index) 223``` 224 225Destinations are NOT in the pipeline control registers. They are read 226from frame SRAM during stage 5 as pre-formed flit 1 values. This 227simplifies the pipeline latch: only 16 bits of IRAM control + 32 bits 228of operand data pass between stages. 229 230Total pipeline register between fetch and execute: **~48 bits**. The 231mode field (3 bits) encodes tag behaviour, constant presence, and 232fan-out in a single dense field (see mode table in section 6). 233 234Pipeline registers between stages: ~500 transistors 235Control logic (state machine, handshaking): ~500-1000 transistors 236 237**Per-PE total: ~32-43 chips** (Approach C). 238 239## 4. Frame-Based Matching 240 241### Why It Can Be Small 242 243The matching store is the highest-risk component in any dataflow machine. Manchester needed 16 memory boards per PE. Amamiya needed 1024 CAM blocks (32KW at 43 bits/word) per PE. Both were sized for worst-case dynamic scheduling of arbitrary programs. 244 245This design avoids that because: 246 2471. **Static PE assignment**: the compiler knows which functions run on which PE and can calculate maximum concurrent activations per PE. 2482. **Function splitting**: the compiler can split large function bodies across PEs so no single PE needs a huge frame. 2493. **Compiler-controlled frame allocation**: the compiler assigns activation IDs at compile time for statically-known activations. Only genuinely dynamic activations (runtime-determined recursion depth) need runtime allocation. 250 251The frame count is therefore a _compiler parameter_, not an architectural constant. The hardware provides 4 concurrent frames of 64 slots each. The compiler must generate code that fits within those limits, splitting and scheduling accordingly. 252 253### Architecture: 74LS670 Register-File Lookup (Approach C, Recommended v0) 254 255Matching uses the per-activation **frame** model. Pending match operands 256live in the same SRAM address space as constants, destinations, and 257accumulators. The 74LS670 (4-word x 4-bit register file with independent 258read/write ports) provides the activation_id-to-frame_id mapping and 259presence/port metadata, all combinationally. 260 261**act_id-to-frame_id resolution:** 262 263Two 74LS670s, addressed by `act_id[1:0]` with `act_id[2]` selecting 264between chips. Output = `{valid:1, frame_id:2, spare:1}`. Combinational 265read: present act_id on the address lines, frame_id appears at the 266output in ~35 ns. 267 268``` 269ALLOC: write {valid=1, frame_id} at address act_id (670 write port) 270FREE: write {valid=0, ...} at address act_id 271LOOKUP: read port, address = act_id -> {valid, frame_id} in ~35 ns 272``` 273 274The 670's independent read and write ports allow ALLOC to proceed while 275the pipeline reads -- zero conflict. 276 277**Presence and port metadata:** 278 279Presence and port bits live in additional 670s, addressed by 280`[frame_id:2]`. The matchable offset range is constrained to offsets 2810-7 (8 dyadic-capable slots per frame). The assembler packs dyadic 282instructions at low offsets; offsets 8-255 are monadic-only. 283 284Recommended layout: 4x 670, each covering 2 offsets across 4 frames: 285 286``` 287670 chip 0 (offsets 0-1): word[frame_id] = {pres0:1, port0:1, pres1:1, port1:1} 288670 chip 1 (offsets 2-3): word[frame_id] = {pres2:1, port2:1, pres3:1, port3:1} 289670 chip 2 (offsets 4-5): word[frame_id] = {pres4:1, port4:1, pres5:1, port5:1} 290670 chip 3 (offsets 6-7): word[frame_id] = {pres6:1, port6:1, pres7:1, port7:1} 291``` 292 293`offset[2:1]` selects chip, `offset[0]` selects which pair of bits 294within the 4-bit output (a 2:1 mux -- one gate). 295 296The 670's simultaneous read/write is critical: during stage 3, when 297a first operand stores and sets presence, the write port updates the 298presence 670 while the read port remains available for the next 299pipeline stage's lookup. No read-modify-write sequencing needed. 300 301All reads are combinational (~35 ns). All resolve during stage 2 in 302parallel with the IRAM read. By the time stage 3 begins, the PE knows 303frame_id, presence, and port -- the only SRAM access in stage 3 is 304reading/writing the actual operand data. 305 306**The matching operation:** 307 308``` 309Stage 2 (parallel with IRAM read): 310 act_id -> frame_id via 670 lookup (combinational) 311 presence[frame_id][offset] via 670 read (combinational) 312 port[frame_id][offset] via 670 read (combinational) 313 314Stage 3 (driven by instruction word from stage 2): 315 if instruction is dyadic AND presence bit set: 316 -> match found: read stored operand from frame SRAM at 317 [frame_id:2][match_offset]. Clear presence bit. 318 Read constant from frame[fref] if has_const. 319 -> proceed to stage 4 with both operands. 320 if instruction is dyadic AND presence bit clear: 321 -> first operand: write incoming operand to frame SRAM at 322 [frame_id:2][match_offset]. Set presence bit in 670. 323 -> token consumed, advance to next input token. 324 if instruction is monadic: 325 -> bypass matching. Read constant from frame[fref] if has_const. 326``` 327 328**Hardware cost:** 329 330| Component | Chips | Notes | 331| ------------------------- | ------ | --------------------------------------- | 332| act_id -> frame_id lookup | 2 | 74LS670, indexed by act_id | 333| Presence + port metadata | 4 | 74LS670, indexed by frame_id | 334| Bit select mux | 1-2 | offset-based selection of presence/port | 335| **Total match metadata** | **~8** | | 336 337**Constraint: 8 matchable offsets per frame.** The assembler enforces 338this. 8 dyadic instructions per function chunk per PE is reasonable -- 339the compiler splits larger function bodies across PEs. With 4 PEs, the 340system supports 32 simultaneous dyadic slots, which exceeds typical 341working-set utilisation for the target workloads. 342 343**Constraint: 8 unique activation_ids.** The 3-bit act_id supports 8 344entries in the lookup table. With 4 concurrent frames, 4 IDs of ABA 345distance exist before wraparound. 346 347### Alternative Approaches 348 349Two alternative matching implementations exist: 350 351- **Approach A (set-associative tags in frame SRAM):** tag words share 352 the frame SRAM chip. No extra chips, but matching consumes SRAM 353 cycles (2 cycles for a dyadic hit). Lowest chip count (~4-6 extra 354 TTL), highest pipeline stall rate. 355 356- **Approach B (full register-file match pool):** match entries live 357 entirely in a register file with parallel comparators. Matching is 358 fully combinational (~50 ns). Highest chip count (~16-18 extra TTL), 359 best pipeline throughput (eliminates all match-related SRAM 360 contention). 361 362Approach C (670 lookup, described above) sits between A and B: act_id 363resolution and presence checking are combinational (like B), but operand 364data lives in SRAM (like A). ~8 extra chips, good pipeline throughput. 365 366See the approach comparison table in Pipeline Stall Analysis (section 11) 367for full cycle counts and throughput estimates across all three 368approaches, plus 670-enhanced variants (B+670 indexed, B+670 semi-CAM) 369and hybrid upgrade paths. 370 371### Frame Sizing 372 3736-bit `fref` addresses 64 slots per activation. With 4 concurrent 374frames, the frame region occupies 4 x 64 x 2 bytes = 512 bytes of 375SRAM. This fits trivially alongside the IRAM region in a 32Kx8 chip 376pair. 377 378A function body with 10 dyadic instructions, 5 constants, and fan-out 379on 3 instructions might use ~30 frame slots (10 match + 5 const + 8 380dest + 7 accumulator/spare). With slot dedup (shared destinations, 381aliased constants), actual usage is typically 15-25 slots per 382activation. 383 384### What About Overflow? 385 386If all frames are occupied or a function body exceeds 8 dyadic 387instructions per frame: 388 389**Compile-time prevention (primary strategy):** 390 391- The compiler knows the frame count and matchable offset limit 392- It splits functions and schedules activations to fit 393- If a program genuinely can't fit (unbounded recursion deeper than 4 frames), the compiler inserts throttling code: a token that waits for a frame to free before allowing the next recursive call 394- This is the Amamiya throttle idea, but implemented in software (compiler-inserted dataflow logic) rather than hardware 395 396**Runtime overflow (safety net):** 397 398- If a token arrives and the tag store has no valid entry for its act_id (shouldn't happen with correct compilation), the PE stalls the input FIFO until a frame frees. Simplest, safest, most debuggable. If it fires, something is wrong and stalling surfaces the bug. 399 400**Upgrade path: hybrid with SRAM fallback:** 401 402- If the 8-offset matchable range proves tight, the high bit of the offset can select between register (offset[3]=0, check 670s) and SRAM (offset[3]=1, fall back to tag-in-SRAM from Approach A). The fast path stays combinational; the overflow path adds 1 SRAM cycle. The system degrades gracefully rather than hard-limiting at 8 dyadic offsets. 403 404## 5. Frame Lifecycle 405 406### Allocation 407 408An ALLOC frame control token (prefix 011+00, op=0) arrives at the PE, 409specifying an `activation_id`. The PE assigns the next free physical 410frame and records the act_id-to-frame_id mapping in the 670 tag store. 411 412Free frame tracking is a simple 2-bit counter or shift register (4 413entries max). Hardware cost: ~2-3 TTL chips. 414 415### Setup 416 417PE-local write tokens (prefix 011+01, region=1) load constants and 418destinations into the allocated frame's slots. The writer addresses 419slots by (act_id, slot_index); the PE resolves act_id-to-frame_id 420internally using the same 670 lookup as the compute pipeline. Setup 421uses the same mechanism as IRAM loading -- a stream of write tokens, 422precalculated by the assembler. 423 424### Execution 425 426Compute tokens arrive with `activation_id`. The PE resolves 427act_id-to-frame_id (670 lookup, combinational), then uses frame_id to 428address frame SRAM for matching, constant reads, destination reads, and 429write-backs. See the pipeline stages (section 3) for the per-stage 430access pattern. 431 432### Deallocation 433 434A `FREE_FRAME` instruction (opcode-driven, any mode) or a FREE frame 435control token (prefix 011+00, op=1) releases the frame. The tag store 436entry is cleared (`valid=0` written to the 670), presence/port metadata 437for that frame is bulk-cleared across all 4 presence/port 670s, and the 438frame_id returns to the free pool. 439 440Multiple frees are idempotent / harmless. Freed frames are immediately 441available for reallocation. 442 443### ABA Protection 444 445- 3-bit activation_id provides 8 unique IDs 446- With at most 4 concurrent frames, 4 IDs of ABA distance exist before wraparound 447- Stale tokens (from freed activations) carry an act_id whose 670 entry is now `valid=0` or maps to a different frame. The PE detects this via the valid bit and discards the stale token. 448- 4 IDs of distance is sufficient because stale tokens drain within single-digit cycles. Wraparound collision is effectively impossible. 449- The act_id validity check in the 670 provides ABA protection without dedicated hardware -- the valid bit serves as an implicit generation guard. 450 451### Throttle 452 453- The PE tracks the number of active frames (frames with valid=1 in the tag store) 454- When all 4 frames are active, ALLOC tokens are NACKed or stall until a free occurs 455- Prevents frame overflow 456- Hardware cost: 2-bit counter + comparator + gate. ~3 TTL chips. 457- With compiler-controlled scheduling, the throttle should rarely fire. It's a safety net, not a performance mechanism. 458 459## 6. Instruction Memory 460 461### Static Assignment, Per-PE Contents 462 463Unlike Amamiya where every PE has identical IM contents (full program), each PE here holds only the function bodies (or function chunks) assigned to it by the compiler. This means: 464 465- IM is smaller per PE (only assigned code, not the whole program) 466- Different PEs have different IM contents (loaded at bootstrap) 467- The compiler emits a per-PE instruction image as part of the program 468 469### Runtime Writability 470 471Instruction memory is **not** read-only. It is writable from the network via IRAM write (prefix 011+01) packets. This serves two purposes: 472 4731. **Bootstrap**: loading programs before execution starts 4742. **Runtime reprogramming**: loading new function bodies while other PEs continue executing (future capability, not needed for v0) 475 476Runtime writability also means instruction memory size is not a hard architectural limit -- if a program needs more code than fits in one PE's IM, the runtime (or a management PE) could swap function bodies in and out. Very speculative, but the hardware path exists. 477 478### Implementation 479 480Instruction memory is PE-local SRAM, sharing a chip pair with the frame 481region via address partitioning (see Per-PE Memory Map, section 10). 482**IRAM width is completely independent of bus width**. It is sized for 483encoding needs, not bus constraints. 484 485#### IRAM Width: 16-bit Single-Half Format 486 487Each IRAM slot is **16 bits**, read in a single cycle from one 8-bit-wide 488SRAM chip pair. Instruction templates are activation-independent: all 489per-activation data (constants, destinations, match operands, 490accumulators) lives in the frame. 491 492``` 493IRAM address = [offset:8] (v0, 8-bit address) 494``` 495 496256 instruction slots per PE. Total IRAM SRAM usage: 512 bytes per PE. 497For programs exceeding 256 instructions, see Future: Bank Switching 498with 74LS610 (section 14). 499 500#### Instruction Word Format 501 502``` 503[type:1][opcode:5][mode:3][wide:1][fref:6] = 16 bits 504 15 14-10 9-7 6 5-0 505``` 506 507| Field | Bits | Purpose | 508| ------ | ---- | ------------------------------------------------------------ | 509| type | 1 | 0 = CM compute, 1 = SM operation | 510| opcode | 5 | Operation code (CM and SM have independent 32-entry spaces) | 511| mode | 3 | Combined tag/frame-reference mode (see mode table below) | 512| wide | 1 | 0 = 16-bit frame values, 1 = 32-bit (consecutive slot pairs) | 513| fref | 6 | Frame slot base index (64 slots per activation) | 514 515**type:1** -- operation space select: 516``` 5170 = CM compute operation (ALU) 5181 = SM operation (structure memory bus command) 519``` 520 521**opcode:5** -- 32 slots per type. CM and SM have independent opcode 522spaces (32 CM opcodes + 32 SM opcodes). Decoded by EEPROM into control 523signals. See `alu-and-output-design.md` for the CM operation set. 524 525**mode:3** -- combined output routing and frame access mode. Controls 526whether the instruction emits tokens, how it reads destinations and 527constants from the frame, or whether it writes results back to the frame. 528See the mode table below. 529 530**wide:1** -- frame value width: 531``` 5320 = 16-bit frame values (single slot per logical value) 5331 = 32-bit frame values (consecutive slot pairs per logical value) 534``` 535 536**fref:6** -- frame slot index (0-63). Base of a contiguous group of 5371-3 slots, depending on mode. The instruction template references frame 538data exclusively through this field; no per-activation data exists in 539the instruction word itself. 540 541Constants and destinations are NOT in the instruction word. They live in 542frame slots, referenced by `fref`. The instruction template is pure 543control flow: opcode, mode flags, and a frame slot reference. 544 545**Instruction words are never serialised onto the external bus** during 546normal execution. They are only written via PE-local write packets 547(prefix 011+01, region=0) during program loading. 548 549#### Mode Table (3-bit `mode` field) 550 551The 3-bit `mode` field encodes both output routing behaviour and frame 552access pattern in a single field. Every combination is useful; there are 553no wasted encodings. 554 555``` 556mode [2:0] tag behaviour frame reads at fref use case 557---- ----- -------------- -------------------------- -------------------------- 558 0 000 INHERIT [dest] single output, no constant 559 1 001 INHERIT [const, dest] single output with constant 560 2 010 INHERIT [dest1, dest2] fan-out, no constant 561 3 011 INHERIT [const, dest1, dest2] fan-out with constant 562 4 100 CHANGE_TAG (none) dynamic routing, no constant 563 5 101 CHANGE_TAG [const] dynamic routing with constant 564 6 110 SINK write result -> frame[fref] store to frame, no output 565 7 111 SINK+CONST read frame[fref], local accumulate / RMW 566 write result -> frame[fref] 567``` 568 569**Bit-level decode equations:** 570 571``` 572output_enable = NOT mode[2] modes 0-3: read dest from frame, emit token 573change_tag = mode[2] AND NOT mode[1] modes 4-5: routing from left operand 574sink = mode[2] AND mode[1] modes 6-7: no output token, write to frame 575has_const = mode[0] modes 1, 3, 5, 7: read constant from frame 576has_fanout = mode[1] AND NOT mode[2] modes 2-3: read two destinations 577``` 578 579Frame slot count per mode = `1 + has_const + has_fanout`, read 580sequentially starting from `fref`. For SINK modes, `fref` is the write 581target. For SINK+CONST (mode 7, read-modify-write), the read and write 582target is the same slot (`fref`), with the ALU result writing back 583after computation. 584 585**INHERIT (modes 0-3):** output tokens are routed to destinations stored 586in frame slots. Each destination slot holds a pre-formed flit 1 value: 587 588``` 589frame dest slot: [prefix:2-3][port:0-1][PE:2][offset:8][act_id:3] 590``` 591 592The PE reads the slot and puts it directly on the bus as flit 1. The ALU 593result becomes flit 2. Almost zero token formation logic -- the frame 594constant IS the output flit. This is the same format as the incoming 595token's flit 1, enabling forwarding without repacking. 596 597- **Mode 0** (single output, no constant): frame[fref] = dest. 1 slot. 598- **Mode 1** (single output with constant): frame[fref] = const, 599 frame[fref+1] = dest. 2 slots. 600- **Mode 2** (fan-out, no constant): frame[fref] = dest1, 601 frame[fref+1] = dest2. 2 slots. 602- **Mode 3** (fan-out with constant): frame[fref] = const, 603 frame[fref+1] = dest1, frame[fref+2] = dest2. 3 slots. 604 605**CHANGE_TAG (modes 4-5):** the left operand replaces the frame 606destination as flit 1. The entire output flit 1 comes from the left 607operand data value (16 bits, verbatim). The right operand becomes flit 2 608(payload data). This enables sending a value to any destination computed 609at runtime -- the packed tag IS flit 1. No field extraction or assembly. 610 611- **Mode 4** (no constant): no frame reads. Flit 1 = left operand, 612 flit 2 = right operand (or ALU result). 613- **Mode 5** (with constant): frame[fref] = const. Flit 1 = left 614 operand, flit 2 = right operand. The constant feeds the ALU. 615 616The output stage is a mux: frame dest vs left operand, selected by 617mode[2]. Hardware: left operand bypass latch (~2 chips) preserves the 618left operand value past the ALU. Stage 5 flit 1 mux (~2 chips) selects 619between assembled flit and raw data. 620 621**SINK (modes 6-7):** no output token is emitted. The ALU result is 622written back to frame[fref]. Used for local accumulation, temporary 623storage, and read-modify-write patterns. 624 625- **Mode 6** (write only): ALU result written to frame[fref]. 1 SRAM 626 write cycle. 627- **Mode 7** (read-modify-write): frame[fref] is read as the constant 628 input to the ALU, the result is written back to frame[fref]. Enables 629 in-place accumulation without consuming a separate constant slot. 630 631**dest_type derivation:** the output token format (dyadic wide vs monadic 632normal vs monadic inline) is determined by the destination frame slot 633contents. Since destination slots hold pre-formed flit 1 values, the 634token type is encoded in the prefix bits of the stored flit. The output 635stage emits the frame slot verbatim as flit 1; the prefix bits in the 636stored value determine the wire format. No runtime type derivation logic 637is needed. For SWITCH not-taken paths, the output stage emits a monadic 638inline token (hardwired prefix in the formatter, overriding the frame 639destination). 640 641#### IRAM Valid-Bit Protection 642 643IRAM is written via PE-local write tokens (prefix 011+01, region=0) that 644share the PE's input path with compute tokens. When IRAM contents are 645replaced at runtime (swapping function fragments in and out), tokens in 646flight may target IRAM addresses that have been or are being overwritten. 647Because tokens do not carry instruction identity information, the PE 648cannot distinguish "right instruction" from "wrong instruction" -- it 649just sees an offset into IRAM. 650 651This is an **instruction identity problem**, not a presence problem. The 652dangerous case is not "IRAM is empty" but "IRAM contains a different 653instruction than the token expects." 654 655**The mechanism: per-page valid bits.** IRAM is logically divided into 656pages (e.g. 8 pages of 32 instructions for 256-entry IRAM, or 16 pages 657of 16). Each page has a 1-bit valid flag, stored in a small TTL register 658alongside IRAM. Total hardware cost: one register chip for all page 659valid bits. 660 661The valid bit is checked during the IFETCH pipeline stage, in parallel 662with the IRAM SRAM read. The top bits of the token's `offset` field 663select the page; the valid bit for that page gates whether the token 664proceeds or is rejected. 665 666``` 667Token arrives at IFETCH: 668 page = offset[high bits] 669 if valid_bit[page] == 0: 670 -> reject token (see rejection policy below) 671 else: 672 -> proceed with instruction fetch and act_id resolution 673``` 674 675**IRAM swap protocol.** Because config writes and compute tokens share 676the input path, the swap sequence is naturally ordered: 677 678``` 6791. Loader sends drain signal (implementation TBD -- could be a PE-local 680 write "quiesce" flag, or the PE back-pressures via handshake/ready 681 signal) 6822. PE processes remaining compute tokens in pipeline (natural drain) 6833. PE-local write token (prefix 011+01, region=0) arrives: 684 a. PE clears valid bit for target page 685 b. PE writes instruction word to IRAM at specified address 686 c. If more write tokens follow (burst), keep writing 6874. Load-complete marker arrives (PE-local write with load-complete flag): 688 a. PE sets valid bit for target page 689 b. PE resumes accepting compute tokens for that page 690``` 691 692During steps 3-4, any compute token that arrives targeting an invalid 693page is rejected. The shared input path ordering guarantees that tokens 694from the *new* code epoch cannot arrive until after the loader has sent 695them, which is after the load completes. Rejected tokens are therefore 696late arrivals from the old epoch -- work that is being abandoned. 697 698**Presence-bit optimisation:** the frame's presence metadata (in the 670 699register files) can be checked before step 1 to determine if any tokens 700are pending for offsets in the target page. If all presence bits for 701matchable offsets in that page are clear, the drain step can be skipped 702entirely -- no tokens are waiting for those instructions. This enables 703targeted IRAM replacement without stalling the entire PE. 704 705**Rejection policy.** v0: discard silently + diagnostic. Late-arriving 706tokens targeting an invalid IRAM page are dropped. The PE sets a sticky 707flag (directly driving a diagnostic LED) to indicate that a discard 708occurred. This is a "should never happen if the loader protocol is 709correct" safety net. The LED makes it visible during debugging without 710adding any pipeline complexity. 711 712If the flag lights up, something is wrong with the drain timing. 713 714Future: NAK response. The PE could form a NAK token from the rejected 715compute token and emit it to a coordinator. The output stage already 716exists (forms tokens from frame destinations + ALU results); a bypass 717path from IFETCH to the output would enable this. Estimated cost: a mux 718and some control logic, ~5-8 TTL chips. Deferred until the runtime is 719sophisticated enough to act on NAKs. 720 721**What this does NOT protect against.** The valid-bit mechanism catches 722tokens that arrive **during** an IRAM swap (page is invalid). It does 723**not** catch tokens that arrive after a swap completes and the page is 724re-validated with different code. Preventing that case requires the drain 725protocol to be correct -- all tokens from the old epoch must have been 726processed or discarded before the new code is marked valid. 727 728For v0, this is a software/loader invariant enforced by the drain 729protocol. Future hardening options: 730 731- **Per-page epoch counter (2-3 bits):** incremented on each page reload. 732 Checked against an expected epoch stored per-activation or derived from 733 spare token bits. Catches post-swap stale tokens at the cost of a 734 comparator + epoch storage. 735- **Fragment ID register per page:** similar to epoch but identifies the 736 fragment by name rather than sequence number. More expensive (wider 737 comparator) but more debuggable. 738 739Both options fit in the spare bits reserved in the token formats. The 740valid-bit mechanism is forward-compatible with either. 741 742## 7. SM Operation Dispatch 743 744SM operations (type=1) use the same 16-bit instruction format. The 5-bit 745opcode field selects from the SM opcode space (independent of CM opcodes). 746`fref` points at frame slots containing SM-specific parameters. The PE 747constructs SM tokens from frame data, operand data, and ALU results. 748 749All SM addressing goes through frame slots. There is no separate 750"pointer-addressed" vs "constant-addressed" distinction in the instruction 751encoding -- the frame slot contents determine the target SM, address, and 752any return routing. Both pointer-addressed operations (address from token 753data) and constant-addressed operations (address from frame slot) use the 754same frame-slot-based encoding. 755 756### SM Bus Opcode Encoding 757 758The SM bus opcode encoding is unchanged -- variable-width with tier 1 759(3-bit opcode, 10-bit addr) and tier 2 (5-bit opcode, 8-bit payload). 760See `sm-design.md` for the full opcode table: 761 762``` 763Tier 1 (3-bit, 1024-cell addr range): 764 read, write, alloc, free, exec, ext 765 766Tier 2 (5-bit, 256-cell addr range or 8-bit payload): 767 rd_inc, rd_dec, cas, raw_rd, clear, set_pg, write_im, (spare) 768``` 769 770### SM Token Construction 771 772The PE output stage builds SM tokens on the wire. The instruction's SM 773opcode (in the PE's 5-bit opcode field) maps to the SM bus opcode via 774the instruction decoder EEPROM. Frame slots provide addressing and 775return routing parameters: 776 777- **SM flit 1** is constructed from frame[fref] contents (SM_id, address) 778 plus the SM bus opcode from the decoder. 779- **SM flit 2** source depends on the operation (see flit 2 source mux 780 below). 781- **SM flit 3** (CAS and EXT only) carries additional data. 782 783### Frame Slot Packing for SM Parameters 784 785A single 16-bit frame slot holds the SM target: 786 787``` 788Tier 1 target slot: [SM_id:2][addr_high:2][addr_low:8][spare:4] = 16 bits 789 (addr = 10 bits, 1024-cell range) 790 791Tier 2 target slot: [SM_id:2][addr:8][spare:6] = 16 bits 792 (addr = 8 bits, 256-cell range) 793``` 794 795For operations needing return routing, the next consecutive frame slot 796holds a pre-formed response token flit 1: 797 798``` 799Return routing slot: [prefix:2-3][port:0-1][PE:2][offset:8][act_id:3] = 16 bits 800``` 801 802This is the same format as a CM destination slot -- the SM response token 803routes as a normal compute token back to the requesting PE. The SM treats 804flit 2 of the request as an opaque 16-bit blob and echoes it as flit 1 805of the response (see `sm-design.md` Result Format). 806 807### Flit 2 Source Mux 808 809Different SM operations need different data in flit 2. The source is 810selected by a 2-bit signal derived from the instruction decoder (SM 811opcode + mode): 812 813``` 814source select use case 815------- ------ ----------------------------------------------- 816ALU out 00 SM write (operand is write data, passes through ALU), 817 SM write_im (immediate write). Default for CM compute. 818R oper 01 SM scatter write (ALU computes addr from base + L operand; 819 R operand is write data, bypasses ALU to flit 2). 820 SM CAS flit 2 (expected value = L operand). 821Frame 10 SM read / rd_inc / rd_dec / raw_rd / alloc return routing 822 (frame[fref+1] = pre-formed response token flit 1). 823 SM exec parameters. 824(spare) 11 reserved 825``` 826 827Hardware: cascaded 74LS157 (quad 2:1 mux) pairs, ~4-6 chips for 16-bit 828width. 829 830### SM Operation Mapping Table 831 832| SM bus op | PE opcode | frame slots | mode | operands | flits | flit 2 source | notes | 833|--------------|--------------|----------------------|------|------------------------|-------|------------------------|------------------------------------| 834| read | SM_READ | 2: target + return | 1 | monadic (trigger) | 2 | frame (return routing) | indexed variant: ALU adds base + index | 835| write | SM_WRITE | 1: target | 0 | monadic (data) | 2 | ALU (write data) | | 836| write (scatter) | SM_WRITE_IX | 1: target | 1 | dyadic (index, data) | 2 | R operand (write data) | ALU: base + index | 837| alloc | SM_ALLOC | 2: params + return | 1 | monadic (trigger) | 2 | frame (return routing) | | 838| free | SM_FREE | 1: target | 0 | monadic (trigger) | 2 | don't-care | | 839| exec | SM_EXEC | 2: target + params | 1 | monadic (trigger) | 2 | frame (params/count) | | 840| ext | SM_EXT | 1-2: varies | varies | varies | 3 | varies | 3-flit extended addressing | 841| rd_inc | SM_RDINC | 2: target + return | 1 | monadic (trigger) | 2 | frame (return routing) | atomic read-and-increment | 842| rd_dec | SM_RDDEC | 2: target + return | 1 | monadic (trigger) | 2 | frame (return routing) | atomic read-and-decrement | 843| cas | SM_CAS | 1: target | 0 | dyadic (expected, new) | 3 | L operand (expected) | 3-flit; return via prior read | 844| raw_rd | SM_RAWRD | 2: target + return | 1 | monadic (trigger) | 2 | frame (return routing) | non-blocking, no deferred read | 845| clear | SM_CLEAR | 1: target | 0 | monadic (trigger) | 2 | don't-care | resets cell to EMPTY | 846| set_pg | SM_SETPG | 1: target | 0 | monadic (page value) | 2 | ALU (page value) | SM-side bank switching | 847| write_im | SM_WRIM | 1: target | 0 | monadic (data) | 2 | ALU (write data) | immediate write, tier 2 addr | 848 849### SM Flit 1 Assembly 850 851Stage 5 assembles SM flit 1 from frame[fref] and the wire opcode from 852the decoder EEPROM: 853 854``` 855Tier 1: 856 flit 1 = [1][frame[fref][15:14] (SM_id)][wire_opcode:3][frame[fref][13:4] (addr:10)] 857 858Tier 2: 859 flit 1 = [1][frame[fref][15:14] (SM_id)][wire_opcode:5][frame[fref][13:6] (addr:8)] 860``` 861 862Hardware: the `[1]` prefix is hardwired. SM_id comes from the top 2 bits 863of the frame slot. The wire opcode comes from the EEPROM. The address 864comes from the remaining frame slot bits. All fields are concatenated 865on the wire with no runtime muxing -- the frame slot is pre-packed so 866that the bit positions align with the SM bus format. ~1-2 chips for 867output gating and serialisation. 868 869### Indexed Address Computation 870 871For indexed READ, scatter WRITE, and other address-computed operations, 872the ALU computes the effective address: base address (extracted from 873frame[fref]) + index (from the left operand). The computed address is 874packed into SM flit 1 by the output stage. The ALU performs address 875arithmetic without dedicated address-computation hardware. 876 877### CAS: 3-Flit SM Token 878 879Compare-and-swap requires address, expected value, and new value. This 880exceeds the standard 2-flit SM packet. 881 882``` 883CAS emission (3 flits): 884 flit 1: SM header (SM_id + CAS opcode + addr, from frame[fref]) 885 flit 2: expected value (left operand) 886 flit 3: new value (right operand) 887``` 888 889The output serialiser emits 3 flits instead of the default 2. An 890`extra_flit` signal from the instruction decoder (asserted for CAS and 891EXT-mode ops) increments the serialiser's flit counter limit: 892`flit_count = 2 + extra_flit`. One gate. 893 894Return routing for CAS uses the prior-READ pattern: issue an SM READ to 895the target cell first (plants return routing in the SM's deferred-read 896register), receive the current value as the response (which also provides 897the expected value), then issue the CAS with the known expected value 898and the desired new value. 899 900### SM Flit Assembly Hardware Cost 901 902| Component | Chips/PE | Purpose | 903|-----------|----------|---------| 904| Flit 2 source mux (16-bit, 4:1) | ~4-6 | ALU out / R oper / Frame / spare | 905| SM flit 1 gating | ~1-2 | Frame target slot + EEPROM opcode to bus | 906| Extra flit control | ~0.5 | CAS/EXT 3-flit counter | 907 908## 8. EEPROM-Based Instruction Decoding 909 910The instruction decoder can be implemented as an EEPROM acting like a PLD. Input bits = instruction opcode fields + PE ID bits. Output bits = control signals for the ALU, matching store, token output formatter, etc. 911 912This gives significant flexibility: 913 914- Instruction set can be changed by reflashing the EEPROM (no board changes) 915- Per-PE customisation (different PEs could theoretically have different instruction subsets, though unlikely for v0) 916- The PE ID is "free" -- it's just more EEPROM address bits 917 918## 9. The 670 Subsystem: Act ID Lookup, Match Metadata, and SC Register File 919 920### Role in the Frame-Based Architecture 921 922The 74LS670s serve two critical functions: 923 9241. **act_id -> frame_id lookup table.** Indexed by the token's 3-bit 925 `activation_id`, outputs `{valid:1, frame_id:2, spare:1}` in 926 ~35 ns (combinational). This replaces what would otherwise be an 927 SRAM cycle for associative tag comparison. 928 9292. **Presence and port metadata store.** Indexed by `frame_id`, 930 stores presence and port bits for all 8 matchable offsets across 931 all 4 frames. Combinational read (~35 ns after frame_id settles, 932 ~70 ns total from act_id presentation). 933 934Both functions complete within stage 2, in parallel with the IRAM 935read. By the time stage 3 begins, the PE knows frame_id, presence, 936and port -- the only remaining SRAM access is the actual operand 937data. 938 939### Hardware Configuration 940 941**act_id -> frame_id (2x 74LS670):** 942 943Addressed by `act_id[1:0]` with `act_id[2]` selecting between chips. 944Each chip holds 4 words x 4 bits. Output: `{valid:1, frame_id:2, 945spare:1}`. 946 947``` 948ALLOC: write {valid=1, frame_id} at address act_id (670 write port) 949FREE: write {valid=0, ...} at address act_id 950LOOKUP: read port, address = act_id -> {valid, frame_id} in ~35 ns 951``` 952 953The 670's independent read and write ports allow ALLOC to proceed 954while the pipeline reads -- zero conflict. 955 956**Presence + port metadata (4x 74LS670):** 957 958Each 670 word (4 bits) holds presence+port for 2 offsets: 959`{presence_N:1, port_N:1, presence_N+1:1, port_N+1:1}`. 960Read address = `[frame_id:2]`. Output bits selected by 961`offset[2:0]` via bit-select mux. 962 963``` 964670 chip 0 (offsets 0-1): word[frame_id] = {pres0, port0, pres1, port1} 965670 chip 1 (offsets 2-3): word[frame_id] = {pres2, port2, pres3, port3} 966670 chip 2 (offsets 4-5): word[frame_id] = {pres4, port4, pres5, port5} 967670 chip 3 (offsets 6-7): word[frame_id] = {pres6, port6, pres7, port7} 968``` 969 970`offset[2:1]` selects chip, `offset[0]` selects which pair of bits 971within the 4-bit output (a 2:1 mux -- one gate). 972 973The 670's simultaneous read/write is critical: during stage 3, when 974a first operand stores and sets presence, the write port updates the 975presence 670 while the read port remains available for the next 976pipeline stage's lookup. No read-modify-write sequencing needed. 977 978**Bit select mux (1-2 chips):** 979 980Offset-based selection of the relevant presence and port bits from 981the 670 outputs. 982 983### Chip Budget 984 985| Component | Chips | Function | 986|----------------------------|--------|--------------------------------------| 987| act_id -> frame_id lookup | 2 | 74LS670, indexed by act_id | 988| Presence + port metadata | 4 | 74LS670, indexed by frame_id | 989| Bit select mux | 1-2 | offset-based selection | 990| **Total match metadata** | **~8** | | 991 992### SC Register File (Mode-Switched) 993 994During **dataflow mode**, the PE uses act_id resolution and presence 995metadata constantly but the SC register file is idle (no SC block 996executing). During **SC mode**, the PE uses the register file 997constantly but act_id lookup and presence tracking are idle (SC block 998has exclusive PE access; no tokens enter matching). 999 1000Some of the 670s can be repurposed for register storage during SC 1001mode. The exact mapping depends on the SC block design: 1002 1003- The 4 presence+port 670s (indexed by frame_id in dataflow mode) can 1004 be re-addressed by instruction register fields during SC mode, 1005 providing 4 chips x 4 words x 4 bits = 64 bits of register storage. 1006 Combined across chips, this gives **4 registers x 16 bits** (4 bits 1007 per chip, 4 chips for width). 1008 1009- With additional mux logic, all 6 shared 670s (excluding the act_id 1010 lookup pair, which may need to remain active for frame lifecycle 1011 management) could provide **6 registers x 16 bits** during SC mode. 1012 1013The act_id lookup 670s may need to remain in their dataflow role even 1014during SC mode if the PE must handle frame control tokens (ALLOC/FREE) 1015arriving during SC block execution. Whether to share them depends on 1016the SC block entry/exit protocol. 1017 1018### The Predicate Slice 1019 1020One of the 670s can be **permanently dedicated as a predicate 1021register** rather than participating in the mode-switched pool: 1022 1023- 4 entries x 4 bits = 16 predicate bits, always available 1024- Useful for: conditional token routing (SWITCH), loop termination 1025 flags, SC block branch conditions, I-structure status flags 1026- Does not reduce the metadata capacity significantly: the remaining 1027 3 presence+port 670s still cover 6 of the 8 matchable offsets; 1028 the 2 uncovered offsets can fall back to SRAM-based presence or 1029 simply constrain the assembler to 6 dyadic offsets per frame 1030 1031The predicate register is always readable and writable regardless of 1032mode, since it's a dedicated chip with its own address/enable lines. 1033Instructions can test or set predicate bits without going through the 1034matching store or the ALU result path. 1035 1036### Mode Switching 1037 1038When transitioning from dataflow mode to SC mode: 1039 10401. **Save metadata** from the shared 670s to spill storage. 10412. **Load initial SC register values** (matched operand pair that 1042 triggered the SC block) into the 670s. 10433. **Switch address mux**: 670 address lines now driven by 1044 instruction register fields instead of frame_id / act_id. 10454. **Switch IRAM to counter mode**: sequential fetch via incrementing 1046 counter rather than token-directed offset. 1047 1048When transitioning back: 1049 10501. **Emit final SC result** as token (last instruction with OUT=1). 10512. **Restore metadata** from spill storage to the 670s. 10523. **Switch address mux back** to frame_id / act_id addressing. 10534. **Resume token processing** from input FIFO. 1054 1055### Spill Storage Options 1056 1057Metadata from the shared 670s (~64-96 bits depending on how many 1058are shared) needs temporary storage during SC block execution. 1059 1060**Option A: Shift registers.** 2x 74LS165 (parallel-in, serial-out) 1061for save + 2x 74LS595 (serial-in, parallel-out) for restore. Total: 10624 chips. Save/restore takes ~12 clock cycles each. 1063 1064**Option B: Dedicated spill 670.** One additional 74LS670 (4x4 bits) 1065holds 16 bits per save cycle; need ~4-6 write cycles to save all 1066shared chips' contents. Total: 1 chip, ~4-6 cycles per save/restore. 1067 1068**Option C: Spill to frame SRAM.** During SC mode, the frame SRAM 1069has bandwidth available (no match operand reads). Write the 670 1070metadata contents into a reserved region of the frame SRAM address 1071space. No extra chips needed. ~4-6 SRAM write cycles to save, ~4-6 1072to restore. The SRAM is single-ported but there's no contention 1073because the pipeline is paused during mode switch. 1074 1075**Recommended: Option C.** Zero additional chips. The save/restore 1076overhead of ~4-6 cycles per transition is negligible compared to the 1077SC block's execution savings (EM-4 data: 23 clocks pure dataflow vs 10789 clocks SC for Fibonacci, so even with ~10 cycles of mode switch 1079overhead, you break even at ~5-7 SC instructions). 1080 1081## 10. SRAM Configuration and Memory Map 1082 1083### Unified SRAM Chip Pair 1084 1085The PE uses a single 32Kx8 chip pair (2 chips for 16-bit data width) 1086for both IRAM and frame storage, with address partitioning via a 1087single decode bit. The recommended part is the AS6C62256 (55 ns, 108832Kx8, DIP-28) or equivalent. 55 ns access time fits comfortably 1089within a 200 ns clock period at 5 MHz, with margin for address setup 1090and data hold. 1091 1092The unified SRAM approach keeps chip count low: one chip pair per PE 1093serves both IRAM and frame storage, avoiding the chip proliferation 1094that separate matching store and IRAM memories would require. 1095 1096### Address Map 1097 1098``` 1099v0 address space (simple decode): 1100 1101 IRAM region: [0][offset:8] instruction templates 1102 offset from token 1103 capacity: 256 instructions (512 bytes) 1104 1105 Frame region: [1][frame_id:2][slot:6] per-activation storage 1106 frame_id from tag store resolution 1107 capacity: 4 frames x 64 slots = 256 entries (512 bytes) 1108``` 1109 1110Total v0 SRAM utilisation: IRAM 512 bytes, frame 512 bytes. Under 1.5 1111KB used out of a 32Kx8 chip pair (64 KB). Ample room for future 1112expansion without changing chips. See Future: Bank Switching with 111374LS610 (section 14) for the upgrade path when programs exceed 256 1114instructions per PE. 1115 1116### Shared SRAM Arbitration 1117 1118The unified SRAM chip pair is shared between three access patterns: 1119 1120- Pipeline IRAM reads (stage 2, instruction fetch): high frequency, performance-critical 1121- Pipeline frame reads/writes (stages 3 and 5): high frequency, performance-critical 1122- PE-local write tokens (IRAM and frame loading): low frequency, can tolerate delay 1123 1124**Arbitration approach**: PE-local writes execute when the frame SRAM is 1125not busy with compute pipeline accesses (natural gaps between pipeline 1126stages). When no gap is available, writes queue and execute during the 1127next idle cycle. Hardware cost: mux on SRAM address/data buses + 1128write-enable gating + stall signal to pipeline. Roughly 5-8 TTL chips. 1129 1130**IRAM vs frame contention**: in v0, IRAM and frame share one SRAM chip 1131pair via address partitioning (region bit in the address). Stage 2 1132(IRAM read) and stage 3/5 (frame read/write) access different address 1133regions but contend for the same physical chip. The pipeline controller 1134ensures only one stage accesses the SRAM per cycle. With the natural 1135pipeline spacing, this rarely causes stalls -- see the frame SRAM 1136contention model in Pipeline Stall Analysis (section 11). 1137 1138**Upgrade path**: separating IRAM and frame onto independent SRAM chip 1139pairs eliminates all inter-region contention. Stage 2 (IRAM) and 1140stage 3/5 (frame) can access their respective chips in the same cycle. 1141 1142**Async-compatible arbitration**: defined as request/grant interface. 1143Synchronous implementation: priority mux resolved on clock edge. Async 1144implementation: mutual exclusion element (Seitz arbiter). Interface is 1145the same in both cases. See `network-and-communication.md` for clocking 1146discipline. 1147 1148## 11. Pipeline Stall Analysis 1149 1150### The Frame SRAM Contention Problem 1151 1152With Approach C (670 lookup), act_id -> frame_id resolution is 1153combinational (~35 ns via 670 read port), and the presence/port 1154check is also combinational (~35 ns from a second set of 670s). 1155There is no read-modify-write on SRAM for metadata -- metadata 1156lives entirely in the 670 register files. 1157 1158The primary bottleneck is **frame SRAM contention between stage 3 and 1159stage 5**. Both stages access the same single-ported SRAM chip pair: 1160 1161- **Stage 3** reads/writes operand data (dyadic match) and reads 1162 constants (modes with has_const=1). 1163- **Stage 5** reads destinations (modes 0-3), or writes results back 1164 to the frame (sink modes 6-7). 1165 1166When two pipelined tokens have stage 3 and stage 5 active in the 1167same cycle, the SRAM can serve only one. The other stalls. 1168 1169### The Pipeline Hazard 1170 1171The classic RAW hazard still exists but takes a different form. Two 1172consecutive tokens targeting the same frame slot (e.g., two mode 7 1173read-modify-write operations on the same accumulator slot) create a 1174data dependency: the second token's stage 3 read must see the first 1175token's stage 5 write. 1176 1177Detection requires comparing (act_id, fref) of the incoming token 1178against in-flight pipeline latches at stages 3-5. Hardware cost: ~2 1179chips (9-bit comparator + AND gate). Alternatively, the assembler 1180can guarantee this never happens by never emitting consecutive mode 7 1181tokens to the same slot on the same PE. 1182 1183This hazard is **statistically uncommon** in dataflow execution. Two operands arriving back-to-back 1184at the exact same frame slot requires coincidental timing. The 1185bypass path is cheap insurance that fires infrequently. 1186 1187### SRAM Contention Model 1188 1189The frame SRAM chip is single-ported (one access per clock cycle at 11905 MHz with 55 ns SRAM). The primary stall source is contention 1191between stage 3 (frame reads for operand data and constants) and 1192stage 5 (frame reads for destinations, or frame writes for sink 1193modes). 1194 1195**Contention arises only when:** 1196- Token A is at stage 5, needing a frame SRAM read (dest) or write 1197 (sink), AND 1198- Token B is at stage 3, needing a frame SRAM read (match operand, 1199 constant, or tag word). 1200 1201**Contention does NOT arise when:** 1202- Token A's stage 5 is mode 4/5 (change_tag -- no SRAM access). 1203- Token B's stage 3 is zero-cycle (monadic no-const, or match data 1204 in register file with no const). 1205- Token A was a dyadic miss (terminated at stage 3, never reaches 1206 stage 5). 1207 1208### Cycle Counts by Instruction Type 1209 1210**Approach C (74LS670 lookup, recommended v0):** 1211 1212``` 1213 stg1 stg2 stg3 stg4 stg5 total 1214monadic mode 4 (no frame) 1 1 0 1 0 3 1215monadic mode 0 (dest only) 1 1 0 1 1 4 1216monadic mode 6 (sink) 1 1 0 1 1 4 1217monadic mode 1 (const+dest) 1 1 1 1 1 5 1218monadic mode 7 (RMW) 1 1 1 1 1 5 1219dyadic miss 1 1 1 -- -- 3 1220dyadic hit, mode 0 1 1 1 1 1 5 1221dyadic hit, mode 1 1 1 2 1 1 6 1222dyadic hit, mode 3 (fan+const) 1 1 2 1 2 7 1223``` 1224 1225Stage 3 breakdown for Approach C: 1226- Dyadic hit: 1 SRAM cycle to read stored operand (frame_id and 1227 presence already known from 670). +1 cycle for constant if 1228 has_const=1. 1229- Dyadic miss: 1 SRAM cycle to write operand data. 670 write port 1230 sets presence bit combinationally in parallel. 1231- Monadic: 0 SRAM cycles (no match), +1 for constant if has_const=1. 1232 1233**Approach B (register-file match pool):** 1234 1235``` 1236 stg1 stg2 stg3 stg4 stg5 total 1237monadic mode 4 (no frame) 1 1 0 1 0 3 1238monadic mode 0 (dest only) 1 1 0 1 1 4 1239monadic mode 6 (sink) 1 1 0 1 1 4 1240monadic mode 1 (const+dest) 1 1 1 1 1 5 1241monadic mode 7 (RMW) 1 1 1 1 1 5 1242dyadic miss 1 1 1 -- -- 3 1243dyadic hit, mode 0 1 1 1 1 1 5 1244dyadic hit, mode 1 1 1 2 1 1 6 1245dyadic hit, mode 3 (fan+const) 1 1 2 1 2 7 1246``` 1247 1248Approaches B and C produce identical single-token cycle counts. The 1249difference emerges under pipelining: Approach B's match data never 1250touches the frame SRAM (operands stored in a dedicated register 1251file), so stage 3's only SRAM access is the constant read. This 1252reduces stage 3 vs stage 5 SRAM contention. 1253 1254**Approach A (set-associative tags in SRAM, minimal chips):** 1255 1256``` 1257 stg1 stg2 stg3 stg4 stg5 total 1258monadic mode 4 (no frame) 1 1 0 1 0 3 1259monadic mode 0 (dest only) 1 1 0 1 1 4 1260monadic mode 6 (sink) 1 1 0 1 1 4 1261monadic mode 1 (const+dest) 1 1 1 1 1 5 1262monadic mode 7 (RMW) 1 1 1 1 1 5 1263dyadic miss 1 1 2 -- -- 4 1264dyadic hit, mode 0 1 1 2 1 1 6 1265dyadic hit, mode 1 1 1 3 1 1 7 1266dyadic hit, mode 3 (fan+const) 1 1 3 1 2 8 1267``` 1268 1269Approach A adds 1 extra SRAM cycle per dyadic operation (tag word 1270read + associative compare) because act_id resolution is not 1271combinational. 1272 1273### Pipeline Overlap Analysis 1274 1275With single-port frame SRAM at 5 MHz, the pipeline controller must 1276arbitrate between stage 3 and stage 5. When both need SRAM in the 1277same cycle, stage 3 stalls. 1278 1279**Approach B, two consecutive dyadic-hit mode 1 tokens:** 1280 1281``` 1282cycle 0: A.stg1 1283cycle 1: A.stg2 (IRAM) 1284cycle 2: A.stg3 match (reg file) -- frame SRAM FREE 1285cycle 3: A.stg3 const (SRAM) 1286cycle 4: A.stg4 (ALU) -- frame SRAM FREE 1287cycle 5: A.stg5 dest (SRAM) B.stg3 match (reg file) -- NO CONFLICT 1288cycle 6: (A done) B.stg3 const (SRAM) 1289cycle 7: B.stg4 (ALU) 1290cycle 8: B.stg5 dest (SRAM) -- NO CONFLICT 1291``` 1292 1293Token spacing: 4 cycles. Approach A under the same conditions: ~6-7 1294cycles due to additional SRAM contention in stage 3. 1295 1296### Throughput Summary 1297 1298Per PE, at 5 MHz, single-port frame SRAM: 1299 1300| Instruction mix profile | Approach A | Approach C | Approach B | 1301|------------------------|------------|------------|------------| 1302| Monadic-heavy (mode 0/4/6) | ~1.25 MIPS | ~1.67 MIPS | ~1.67 MIPS | 1303| Mixed (40% dyadic mode 1, 30% monadic, 30% misc) | ~833 KIPS | ~1.25 MIPS | ~1.25 MIPS | 1304| Dyadic-heavy with constants | ~714 KIPS | ~1.00 MIPS | ~1.00 MIPS | 1305| Worst case (mode 3, const+fanout) | ~625 KIPS | ~714 KIPS | ~714 KIPS | 1306 13074-PE system: multiply by 4. Realistic mixed workload: ~3.3-5.0 MIPS 1308(A), ~5.0-6.7 MIPS (C), or ~5.0-6.7 MIPS (B). For reference: the 1309original Amamiya DFM prototype (TTL, 1982) achieved 1.8 MIPS per PE. 1310EM-4 prototype (VLSI gate array, 1990) achieved 12.5 MIPS per PE. 1311This design sits between the two, closer to the DFM, which is 1312historically appropriate for a discrete TTL build. 1313 1314### Pipeline Timing by Era 1315 1316With the 670-based matching subsystem (Approach C), act_id 1317resolution and presence/port checking are combinational (~35-70 ns) 1318**regardless of era**. These never become the timing bottleneck. 1319 1320The era-dependent part is **SRAM access time** for frame reads and 1321writes. This determines how many SRAM operations fit per clock cycle 1322and thus how much stage 3 vs stage 5 contention exists. 1323 1324**1979-1983 (5 MHz, 55 ns SRAM):** 1325 1326``` 1327670 metadata: combinational (~35-70 ns), well within 200 ns cycle 1328Frame SRAM: one access per 200 ns cycle (55 ns access + setup/hold margin) 1329Bottleneck: frame SRAM single-port, stage 3 vs stage 5 contention 1330SC block throughput: ~1 instruction per clock (670 dual-port) 1331Overall token throughput: ~1 token per 3-5 clocks (pipelined, mode-dependent) 1332``` 1333 1334**1984-1990 (5-10 MHz, dual-port SRAM):** 1335 1336``` 1337670 metadata: combinational (unchanged) 1338Frame SRAM: dual-port (IDT7132 or similar), port A for stage 3, port B for stage 5 1339Bottleneck: eliminated -- both stages access SRAM simultaneously 1340SC block throughput: ~1 instruction per clock 1341Overall token throughput: approaches 1 token per 3 clocks for most modes 1342``` 1343 1344Dual-port SRAM eliminates the primary stall source. The pipeline 1345becomes instruction-latency-limited rather than SRAM-contention-limited. 1346 1347**Modern parts (5 MHz clock, 15 ns SRAM):** 1348 1349``` 1350670 metadata: combinational (unchanged) 1351Frame SRAM: 15 ns access, ~13 accesses fit in 200 ns cycle 1352Practical: 2-3 sub-cycle accesses via time-division multiplexing 1353Bottleneck: none -- frame SRAM has excess bandwidth 1354Token throughput: 1 token per 3 clocks (pipeline-stage-limited, not SRAM-limited) 1355``` 1356 1357With 15 ns AS7C256B-15PIN (DIP, currently available at ~$3), two 1358sub-cycle accesses fit within a 200 ns clock period. This achieves 1359TDM-like parallelism without additional MUX logic, effectively 1360giving the pipeline a dual-port view of a single-port chip. 1361 1362**Integrated (on-chip SRAM, sub-ns access):** 1363 1364``` 1365670 equivalent: on-chip multi-ported register file, ~200 transistors 1366Frame SRAM: on-chip, sub-cycle access trivially 1367Token throughput: 1 per 3 clocks, potentially faster with deeper pipelining 1368``` 1369 1370## 12. SC Blocks and Execution Modes 1371 1372### PE-to-PE Pipelining 1373 1374When multiple PEs are chained for software-pipelined loops, the per-PE 1375pipeline throughput determines the overall chain throughput. 1376 1377With the pipelined design (1 token per 3-5 clocks depending on 1378instruction mix and era), the inter-PE hop cost becomes the critical 1379path for chained execution: 1380 1381| Interconnect | Hop latency | Viable? | 1382|-------------|-------------|---------| 1383| Shared bus (discrete build) | 5-8 cycles | Marginal -- chain overhead dominates | 1384| Dedicated FIFO between adjacent PEs | 2-3 cycles | Worthwhile for tight loops | 1385| On-chip wide parallel link (integrated) | 1-2 cycles | Competitive with intra-PE SC block | 1386 1387For the discrete v0 build, dedicated inter-PE FIFOs (bypassing the 1388shared bus) would enable PE chaining at reasonable cost. This is a 1389low-chip-count addition (~2-4 chips per PE pair) that unlocks 1390software-pipelined loop execution. 1391 1392**Loopback bypass.** When a PE emits a token destined for itself 1393(common in iterative computations), the token can be looped back 1394internally without traversing the bus at all. See 1395`bus-interconnect-design.md` for the loopback bypass design, which 1396eliminates the bus hop latency entirely for self-targeted tokens. 1397 1398### The Execution Mode Spectrum 1399 1400The pipelined PE with frame-based storage, SC blocks, and predicate 1401register supports a spectrum of execution modes, selectable by the 1402compiler per-region: 1403 1404| Mode | Pipeline behaviour | Throughput | When to use | 1405|------|-------------------|-----------|-------------| 1406| Pure dataflow | Token -> ifetch -> match/frame -> exec -> output | 1 token / 3-7 clocks (mode-dependent) | Parallel regions, independent ops | 1407| SC block (register) | Sequential IRAM fetch, 670 register file | ~1 instr / clock | Short sequential regions | 1408| SC block + predicate | As above, with conditional skip/branch via predicate bits | ~1 instr / clock | Conditional sequential regions | 1409| PE chain (software pipeline) | Tokens flow PE0->PE1->PE2, each PE handles one stage | 1 iteration / PE-pipeline-depth clocks | Loop bodies across PEs | 1410| SM-mediated sequential | Tokens to/from SM for memory-intensive work | SM-bandwidth-limited | Array/structure traversal | 1411 1412The compiler partitions the program graph and selects the best mode 1413for each region. This spectrum is arguably more expressive than what a 1414modern OoO core offers (which has exactly one mode: "pretend to be 1415sequential, discover parallelism at runtime"). 1416 1417## 13. Instruction Residency and Code Loading 1418 1419### Why This Matters 1420 1421Unlike Manchester, Amamiya, or Monsoon -- which either replicated the 1422entire program into every PE's instruction memory or used very large 1423per-PE instruction stores -- this design has **small IRAM per bank** 1424(256 entries) with runtime-writable instruction memory. Without bank 1425switching, any program larger than a single PE's IRAM needs code loading 1426at runtime, even under fully static PE assignment. 1427 1428**With bank switching** (see section 14), each PE 1429holds up to 4096 instructions across 16 banks using the same SRAM chips. 1430This substantially reduces the pressure on runtime code loading -- most 1431programs' full working set fits in the preloaded banks, and switching 1432between function fragments costs a single register write instead of 1433IRAM rewrite traffic. The code storage hierarchy and loader mechanisms 1434below remain relevant for programs that exceed the banked capacity, but 1435bank switching makes that the exception rather than the rule. 1436 1437The 16-bit single-half instruction format provides good IRAM density: 1438one instruction per SRAM address. The effective capacity with bank 1439switching (4096 instructions) is substantial for the target workloads, 1440using only a single SRAM chip pair per PE. 1441 1442The reference architectures largely avoid the residency problem by 1443throwing memory at it: Amamiya's 8KW/PE replicated instruction memory, 1444Manchester's large instruction store, Monsoon's 64K-instruction frames. 1445Bank switching gives us a comparable effective capacity (4K instructions) 1446with much less hardware than full replication. 1447 1448### Proactive Loading (Primary Mechanism) 1449 1450The primary approach is **software-managed prefetch**: the compiler assigns a PE (typically the least-utilized one) to pull instruction pages from storage and load them onto the bus in advance of when they're needed. This is part of the program graph itself - the loader PE calls the `exec` SM instruction, which reads out pre-constructed tokens onto the bus. 1451 1452This fits naturally into the dataflow paradigm: 1453- The loader PE is just another participant in the token network 1454- Its "inputs" are load requests (tokens from other PEs or the scheduler) 1455- Its "outputs" are config write packets that load IRAM 1456- The compiler can schedule prefetches to overlap with computation on other PEs 1457 1458The loader PE could be dedicated (always running loader code) or could itself have its code swapped depending on system phase. 1459 1460### The Identity Problem: Miss Detection 1461 1462If code loading happens at runtime, the question arises: how does a PE know the code in its IRAM is the *right* code for an arriving token? 1463 1464A simple validity bitmap (like the matching store presence bit) is **not sufficient**. It can tell you "something is loaded at offset 7" but not "the right instruction is loaded at offset 7." If a different 1465function fragment has been loaded over a previous one, the IRAM slot is occupied by a valid-looking but wrong instruction. The token indexes directly into IRAM - there is no tag comparison against the token. 1466 1467Several detection mechanisms are possible: 1468 1469**Option A: Fragment ID register.** 1470Each PE has a small tag register (or set of registers, one per IRAM page/region) that records which function fragment is currently loaded. Set by config writes during loading. Incoming tokens carry (or the system derives from the token's address) the expected fragment ID. The PE compares the token's expected fragment against the loaded fragment register: 1471- Match -> proceed normally 1472- Mismatch -> miss, trigger fetch 1473- Hardware cost: one register + comparator per PE (or per IRAM region) 1474- Requires fragment ID bits in the token or a derivation mechanism 1475- Coarse-grained: one tag per PE or per page, not per instruction 1476 1477**Option B: Entry gate instruction.** 1478The compiler inserts a special instruction at each function body's entry point that verifies identity: "am I the function this activation expects?" Tokens arriving at non-entry instructions are assumed correct because they could only have reached that point by passing through a verified entry gate. 1479- No per-instruction tags needed 1480- Software-managed, compiler responsibility 1481- Detection granularity is per-function-body, not per-instruction 1482- In dataflow terms: the entry gate is a dyadic instruction whose left input is the activation token and whose right input is a "function loaded" token. If the function isn't loaded, the gate blocks (no match) until loading completes and a "loaded" confirmation token arrives. 1483 1484**Option C: Software-only invariant.** 1485No hardware miss detection. The loader protocol guarantees correctness: code is never overwritten while tokens are in flight targeting it. The throttle + drain approach (stall new activations, let existing ones complete, then overwrite IRAM) ensures the invariant. 1486- Simplest hardware -- no detection circuitry at all 1487- Most compiler/loader burden 1488- Relies on correct coordination; bugs cause silent wrong execution 1489- Viable for v0 where programs are small and manually verified 1490 1491These options are not mutually exclusive. v0 can start with Option C (software guarantee) and add hardware detection (A or B) later as programs grow beyond what manual verification can cover. 1492 1493### Miss Handling 1494 1495When a miss is detected (by whatever mechanism), the PE needs to handle a token that targets unloaded code. Two approaches: 1496 1497**Stall + fetch request:** PE emits an `exec` token and stalls its input FIFO until the instruction arrives via config write. Simple, deterministic, but blocks all traffic to that PE during the fetch. Acceptable if misses are rare (proactive loading handles most cases) and fetch latency is bounded. 1498 1499**Recirculate + fetch request:** PE emits a fetch-request token, puts the missed token back at the tail of its own input FIFO, and continues processing other tokens. The missed token retries later, hopefully after the instruction has been loaded. More complex but keeps the PE productive. Requires care to avoid FIFO fill-up with recirculated tokens. 1500 1501v0 may not implement either; starting with the software-only invariant (Option C above) means misses don't happen by construction. Hardware miss handling is an evolutionary step as programs outgrow what static loading can guarantee. 1502 1503## 14. Future: Bank Switching with 74LS610 1504 1505v0 supports 256 instructions per PE (8-bit offset) with simple address 1506decode. When programs exceed 256 instructions, the 74LS610 memory mapper 1507enables bank switching: 16 banks x 256 instructions = 4096 entries per 1508PE without changing SRAM chips. 1509 1510### What the 610 Is and What It Enables 1511 1512The 74LS610 (TI memory mapper, originally for TMS9900 family) is an 1513ideal fit for IRAM bank switching. Key properties: 1514 1515- 16 mapping registers, each 12 bits wide 1516- 4-bit logical address input selects register -> 12-bit physical address output 1517- **Latch control** (pin 28): outputs can be frozen while register contents change 1518- ~40-50ns propagation delay (LS family), pipelineable with SRAM access 1519- One chip per PE. Writes to mapping registers via data bus during config/bootstrap. 1520 1521The '610 is planned for both IRAM and SM banking (both are future 1522upgrades, not present in v0). Using it for IRAM banking is the same 1523chip, same wiring pattern, different address domain. One '610 per PE 1524for IRAM, one per SM. 1525 1526### The Socket Strategy 1527 1528The v0 board pre-wires SRAM address lines to a '610 socket with a 1529jumper wire in place of the chip. When bank switching is needed, the 1530'610 drops in with no board changes. 1531 1532### Address Space with Bank Switching 1533 1534With the '610 installed, the IRAM address becomes: 1535 1536``` 1537Logical: [bank_select:4][offset:8] 1538 | 1539 v (74LS610) 1540Physical: [phys_bank:12][offset:8] = up to 20-bit SRAM address 1541``` 1542 1543In practice the physical address width is bounded by available SRAM chip 1544capacity. With 8Kx8 SRAMs (13-bit address): the '610's 12-bit output is 1545wider than needed -- only 5 bits of physical bank + 8-bit offset = 13 1546bits. This gives 32 physical banks of 256 instructions each (8192 1547instructions per PE, though address space constraints may limit this 1548further). 1549 1550The SRAM address map with bank switching: 1551 1552``` 1553 IRAM region: [0][bank:4][offset:8] bank-switched templates 1554 bank from '610 mapper 1555 capacity: 16 banks x 256 instructions = 4096 entries 1556 1557 Frame region: [1][frame_id:2][slot:6] (unchanged) 1558``` 1559 1560### Banking Workflow: MAP_PAGE and SET_PAGE Instructions 1561 1562Two instructions manage banking. Neither touches the token format. 1563 1564- **`map_page`** (monadic): writes a logical-to-physical mapping into one of the '610's 16 mapping registers. The register index and physical bank address come from frame constants. Used during bootstrap or runtime to establish which physical SRAM regions back which logical pages. 1565 1566- **`set_page`** (monadic): writes a 4-bit logical page selector into a PE-local latch. The latch feeds the '610's MA0-MA3 inputs. All subsequent IRAM fetches go through the selected logical page's mapped physical bank. One cycle to switch. 1567 1568``` 1569Banking workflow: 1570 1. Bootstrap: MAP_PAGE instructions establish mappings 1571 (logical page 0 -> physical region A, page 1 -> region B, etc.) 1572 2. Runtime: SET_PAGE selects the active logical page 1573 3. Latch -> '610 MA0-MA3 -> physical SRAM bank selection 1574 4. All IRAM reads now address the selected bank 1575``` 1576 1577Hardware cost: one 74LS175 (quad D flip-flop) as the page latch + the '610 itself. 1578 1579### Trade-offs and Costs 1580 1581- Bank switch affects all in-flight tokens targeting this PE at offsets in the old bank. The compiler (or scheduler) must drain tokens for the old bank before switching -- same throttle-and-drain protocol as code overwrite, but switching is instantaneous once drained (write latch, done). 1582- `set_page` is sequentially scoped: it affects all subsequent fetches, not just one activation. The compiler must ensure that concurrent activations on the same PE agree on the active page, or use `set_page` as a barrier between phases. 1583- Total capacity per PE is bounded by SRAM chip size, not the '610 (which can address far more than any reasonable IRAM). 1584- Pages are a pure address-mapping primitive. The compiler decides what they mean -- per-function, per-phase, or any other grouping. The hardware doesn't enforce or assume any relationship between pages and function bodies. 1585 1586## 15. Dynamic Scheduling: Future Capability 1587 1588The architecture is **policy-agnostic** on whether PE assignment is fully static (compiler decides everything) or partially dynamic (a scheduler places activations at runtime). The mechanism (tokens carry destination PE + activation_id, PEs have writable IRAM, frames are allocated and addressed by act_id) supports either policy. 1589 1590### Static Assignment (v0) 1591 1592Compiler decides everything at compile time. each PE gets specific function fragments loaded at bootstrap. no runtime decisions about placement. simplest, no scheduler hardware or firmware needed. For programs that exceed IRAM capacity, the compiler schedules `exec` instructions or similar. 1593 1594### Dynamic Scheduling (future) 1595 1596A CCU-like scheduler (could be firmware on a dedicated PE, a small fixed-function unit, or distributed logic) decides at runtime where to place new activations, based on PE load, IRAM contents, etc. 1597 1598The tension: dynamic scheduling wants **wide IRAM** (so the target PE already has the function body loaded), while cheap PEs want **narrow IRAM**. Amamiya resolved this by replicating the entire program into every PE's IRAM. that's one approach but costs a lot of memory. 1599 1600The middle ground is a **working set model**: keep hot function bodies loaded, swap cold ones via PE-local write tokens (prefix 011+01) when the scheduler wants to place an activation on a PE that doesn't have the code yet. 1601 1602- **miss latency**: significant (network round-trip to load code from SM or external storage). much worse than Amamiya's "already there." 1603- **miss rate**: depends on scheduler affinity policy. if the scheduler prefers placing activations on PEs that already have the code, misses should be rare. a small "IRAM directory" (which PE has which function body loaded) lets the scheduler make this decision cheaply. 1604- **coordination**: drain in-flight tokens for the old fragment before overwriting IRAM. throttle stalls new activations for that fragment, existing ones complete, then overwrite. coarse-grained context switch. 1605 1606## 16. Open Design Questions 1607 16081. **Approach selection for v0.** Approach C (670 lookup) is 1609 recommended as the starting point: combinational metadata at ~8 1610 chips. Approach B (register-file match pool) eliminates the last 1611 SRAM cycle from matching at the cost of ~16-18 chips. Approach A 1612 (SRAM tags) is the fallback if 670 supply is a problem. The 1613 choice depends on whether chip count or pipeline throughput is 1614 the binding constraint for the initial build. See section 11 for 1615 the full approach comparison and cycle counts. 1616 16172. **Frame SRAM contention under realistic workloads.** The pipeline 1618 stall analysis in section 11 uses worst-case consecutive tokens. 1619 Simulate representative dataflow programs in the behavioural 1620 emulator to measure actual stage 3 vs stage 5 contention rates 1621 and determine whether dual-port SRAM or faster SRAM is justified 1622 for v0. 1623 16243. **SC block register capacity.** With 4-6 registers available from 1625 repurposed 670s (depending on how many are shared), what is the 1626 longest SC block the compiler can generate before register 1627 pressure forces a spill? Evaluate empirically on target workloads. 1628 16294. **Predicate register encoding.** Document specific instruction 1630 encodings for predicate test/set/clear, and how SWITCH 1631 instructions interact with predicate bits. The predicate register 1632 may subsume some of the cancel-bit functionality planned for 1633 token format. 1634 16355. **Mode switch latency measurement.** Build a cycle-accurate model 1636 of the save-to-SRAM / restore-from-SRAM path and determine exact 1637 overhead. Target: <=10 cycles per transition. 1638 16396. **Assembler stall analysis.** The assembler can statically detect 1640 instruction pairs whose output tokens may cause frame SRAM 1641 contention on the same PE. For hot loops, the assembler can 1642 insert mode 4 NOP tokens (zero frame access) as pipeline padding. 1643 Validate static stall estimates against emulator simulation, since 1644 runtime arrival timing depends on network latency and SM response 1645 times. 1646 16477. **8-offset matchable constraint validation.** The 670-based 1648 presence metadata limits dyadic instructions to offsets 0-7 per 1649 frame. Evaluate whether this is sufficient for compiled programs. 1650 If tight, the hybrid upgrade path (offset[3]=0 checks 670s, 1651 offset[3]=1 falls back to SRAM tags) adds ~4-6 chips of SRAM tag 1652 logic for offsets 8-15+. 1653 16548. **Exact opcode assignments**: 5-bit opcode space is sufficient (CM 1655 and SM independent). Need to assign FREE_FRAME, ALLOC_REMOTE, and 1656 verify that existing ALU operations fit with the revised mode 1657 semantics. 1658 16599. **SC arc execution details**: the frame model supports 1660 strongly-connected arc execution (latch frame_id across sequential 1661 blocks); the pipeline sequencing and block-entry detection logic 1662 need design work. Deferred past v0 but should not be precluded by 1663 any v0 decisions. 1664 166510. **IRAM bank switching interaction with frames**: switching IRAM 1666 banks changes the instruction templates but not the frame contents. 1667 Tokens in flight targeting the old bank's instructions will execute 1668 against new instructions after the switch. The drain-before-switch 1669 protocol applies unchanged. 1670 167111. **Frame slot count (fref width)**: 6-bit fref = 64 slots is the 1672 current proposal. Real compiled programs may show that 32 suffices 1673 (freeing 1 bit for other uses) or that 64 is tight (requiring 1674 creative aliasing or more aggressive function splitting). 1675 167612. **Function splitting heuristics**: how does the compiler decide 1677 where to split? Minimize cross-PE traffic? Balance frame usage 1678 across PEs? Hardware constraints (frame count, matchable offset 1679 count) drive it. 1680 168113. **Instruction identity detection**: how does the PE know loaded 1682 code matches what an arriving token expects? Fragment ID register 1683 vs entry gate instruction vs software-only guarantee. See 1684 Instruction Residency section. v0 starts with software-only 1685 invariant (Option C). 1686 168714. **Miss handling mechanism**: stall + fetch request vs recirculate 1688 + fetch request. v0 may not implement either, relying on the 1689 software-only invariant. 1690 169115. **SM flit 1 bit alignment:** the frame slot packing for SM targets 1692 (`[SM_id:2][addr][spare]`) must align with the SM bus flit 1 format 1693 so that the output stage can concatenate frame bits + EEPROM opcode 1694 without field rearrangement. The exact spare bit positions depend on 1695 the final SM bus encoding; verify alignment after `sm-design.md` 1696 opcode assignments are frozen. 1697 169816. **PE-local write slot field width:** the current flit 1 format packs 1699 5 bits of slot index into the PE-local write token. With 64 frame 1700 slots (6-bit fref), one bit is missing. Options: (a) limit PE-local 1701 writes to slots 0-31, (b) steal a spare bit from act_id or elsewhere, 1702 (c) use flit 2 for extended addressing. 1703 170417. **Indexed SM ops: address overflow.** ALU base + index computation 1705 may overflow the 10-bit (tier 1) or 8-bit (tier 2) address range. 1706 The PE does not check for overflow -- the SM receives the truncated 1707 address. The assembler should warn on statically-detectable overflow 1708 risks. Runtime overflow detection is an SM-side concern. 1709 171018. **EXTRACT_TAG offset source.** The return offset in the packed tag 1711 could come from (a) a frame constant (flexible, costs 1 frame slot), 1712 (b) a fixed hardware-derived value (e.g. current instruction 1713 offset + 1), or (c) an IRAM-encoded small immediate (not available 1714 in the current 16-bit format). Option (a) is consistent with the 1715 frame-everything philosophy; option (b) saves a slot but limits 1716 return point placement. 1717 1718## 17. References 1719 1720- `architecture-overview.md` - Token format, flit-1 bit allocation, module taxonomy, bus framing protocol, function call design overview. 1721- `alu-and-output-design.md` - ALU operation details, output routing modes, flit 2 source mux, and SM flit assembly. 1722- `sm-design.md` - SM opcode table, extended addressing, CAS handling, I-structure semantics. 1723- `bus-interconnect-design.md` - Physical bus implementation: shared and split AN/CN/DN topologies, node interfaces, arbitration, loopback, backpressure, chip counts. 1724- `network-and-communication.md` - Routing topology, clocking discipline. 1725- `io-and-bootstrap.md` - Bootstrap loading, I/O subsystem design. 1726- `sram-availability.md` - Component availability for period-appropriate SRAMs. 1727- `17407_17358.pdf` - DFM evaluation: OM structure (1024 CAM blocks, 32 words each, 8 entries of 4 words, 4-way set-associative within entry). Function activation via CCU requesting least-loaded PE, then getting instance name from target PE's free instance table. IM is 8KW/PE, identical across all PEs. Critical for understanding why Amamiya's OM is so large and why ours can be much smaller. 1728- `gurd1985.pdf` - Manchester matching unit: 16 parallel hash banks, 64K tokens each, 54-bit comparators, 180ns clock period. Overflow unit emulated in software. Shows the cost of general-purpose matching. 1729- `Dataflow_Machine_Architecture.pdf` - Veen survey: matching store analysis, tag space management, overflow handling across multiple architectures. 1730- `amamiya1982.pdf` - Original DFM paper: semi-CAM concept, IM/OM split, execution control mechanism with associative IM fetch. Partial function body execution (begin executing when first argument arrives, don't wait for all arguments). 1731- EM-4 prototype papers - Direct matching, strongly-connected blocks, register-based advanced control pipeline. Informs SC arc upgrade path. 1732- Iannucci (1988) - Frame-based matching, continuation model, suspension semantics. Historical precedent for per-activation frame storage. 1733- Monsoon / TTDA papers - Explicit token store, frame-based execution, I-structure semantics.