OR-1 dataflow CPU sketch
1# Dynamic Dataflow CPU — PE (Processing Element) Design
2
3Covers the CM (Control Module) pipeline, frame-based matching, instruction
4memory and encoding, activation lifecycle, per-PE identity, 670 subsystem
5design, pipeline stall analysis, SM operation dispatch, SC blocks, and
6execution modes.
7
8See `architecture-overview.md` for token format, flit-1 bit allocation,
9and module taxonomy. See `network-and-communication.md` for how tokens
10enter/leave the PE. See `alu-and-output-design.md` for ALU operations,
11output formatting, and SM flit assembly details. See
12`bus-interconnect-design.md` for physical bus implementation. See
13`sm-design.md` for SM internals and I-structure semantics. See
14`io-and-bootstrap.md` for bootstrap loading and I/O subsystem design.
15
16## 1. Design Philosophy: Static Assignment, Compiler-Driven Sizing
17
18This design diverges significantly from both Manchester and Amamiya in how PEs are used. Understanding the difference is critical to understanding why the matching store can be so much smaller here.
19
20**Amamiya DFM (1982/17407 papers):** every PE has ALL function bodies pre-loaded in instruction memory (8KW, 58 bits/word per PE, identical contents across all PEs). Function _instances_ are dynamically assigned to PEs at runtime by a CCU (Cluster Control Unit) that picks the least-loaded PE. The OM (operand matching memory) needs 1024 CAM blocks per PE because any function can run anywhere, and deep Lisp recursion means many simultaneous activations. The "semi-CAM" was their solution to making this affordable -- instance name directly addresses a block, then 4-way set-associative lookup within the block on instruction identifier.
21
22**Manchester (Gurd 1985):** similar story but with hashing instead of semi-CAM. 16 parallel 64K-token memory banks per PE for set-associative hash lookup. 1M token capacity matching store. Plus an overflow unit (initially emulated on the host). The matching unit alone was 16 memory boards per PE.
23
24Both machines sized their matching stores for worst-case dynamic scheduling of arbitrary programs. The whole program lives in every PE (or in a single PE's matching unit), and any activation can land anywhere. That's why those matching stores are enormous.
25
26**This design:** the compiler statically assigns function bodies (or chunks of them) to specific PEs. Different PEs have different instruction memory contents. The compiler knows at compile time which functions run where, and can calculate maximum concurrent activations per PE. This means:
27
28- Instruction memory is NOT replicated -- each PE only holds its assigned function bodies. IM can be much smaller.
29- The matching store only needs enough frames for the maximum concurrent activations the compiler predicts for that specific PE. Not 1024. Probably 4.
30- No CCU needed for dynamic PE allocation. Scheduling decisions are made at compile time.
31- The tradeoff is scheduling flexibility -- you can't dynamically rebalance load at runtime. The compiler must get it roughly right.
32
33### Function Splitting Across PEs
34
35A "function" in the source language does NOT need to map 1:1 to a contiguous block on one PE. The compiler can split a function body at any data-dependency boundary. The token network doesn't know or care whether two instructions are "in the same function" -- it just sees tokens with destinations.
36
37A 40-instruction function body could be split into three chunks of ~13 instructions across three PEs, each chunk fitting in a smaller frame. The "function" as the architecture sees it is really "a set of instructions that share a frame on this PE." The compiler defines what that grouping means.
38
39This is a powerful lever for keeping frames small: if a function body is too big for the frame size, the compiler splits it. The split introduces inter-PE token traffic (extra network hops), but keeps per-PE hardware simple. The compiler can optimise the split points to minimise cross-PE traffic.
40
41**Implication for frame semantics:** a frame doesn't mean "one function activation." It means "one chunk of work sharing a local operand namespace on this PE." Multiple frames on different PEs might collectively represent one function activation. The token's `activation_id` scopes operand matching to a local frame, nothing more.
42
43**Implication for the compiler:** this architecture actively wants either small functions or functions distributed across PEs. The compiler is free to treat any subgraph of the dataflow graph as a "chunk" and assign it to a PE, regardless of source-level function boundaries. Loop bodies, branch arms, pipeline stages: all valid chunk boundaries. The grain of scheduling is the subgraph, not the function.
44
45## 2. PE Identity
46
47Each PE has a unique ID used for routing. Two mechanisms, not mutually exclusive:
48
49**EEPROM-based**: the instruction decoder EEPROM already contains per-PE truth tables. The PE ID can be encoded as additional input bits to the EEPROM, meaning the EEPROM contents are unique per PE but the circuit board is identical. The instruction decoder "knows" which PE it is because its EEPROM was burned with that ID.
50
51**DIP switches**: 3-4 switches give 8-16 PE addresses. Better for early prototyping - reconfigurable without reflashing. Can coexist with the EEPROM approach (switches provide ID bits that feed into the EEPROM address lines).
52
53The PE ID is needed in two places:
54
551. Input token filtering: "is this token addressed to me?"
562. Output token formatting: "set the source PE field" (if result tokens carry source info for return routing)
57
58## 3. PE Pipeline (5-stage)
59
60### Bus Interface: Serializer / Deserializer
61
62The PE connects to the 16-bit external bus via ser/deser logic at the input and output boundaries. This handles the width conversion between 16-bit flits on the bus and the wider internal token representation:
63
64- **Input deserializer**: receives 2+ flits from the bus, reassembles into a full token (routing fields from flit 1 + data from flit 2). Shift register + flit counter. Outputs a reassembled token to the
65 input FIFO.
66- **Output serializer**: takes a formed result token, splits it into flits (routing fields into flit 1, data into flit 2), and clocks them onto the bus. Shift register + toggle.
67- Hardware cost: ~5-8 TTL chips per direction (shift registers, counters, muxes).
68- Naturally integrates with the clock domain crossing FIFOs when running Mode B (2x bus clock). Under Mode A (shared clock), the ser/deser simply takes 2 clock cycles per token transfer.
69
70### Pipeline Stages
71
72The pipeline runs IFETCH before MATCH. The instruction word, decoded at
73the end of stage 2, drives all subsequent pipeline behaviour: whether to
74check the matching store, whether to read a constant, how many
75destinations to read, whether to write back to the frame. The token's
76`activation_id` drives associative lookup in parallel with the IRAM read,
77hiding resolution latency.
78
79**Why IFETCH before MATCH.** The instruction word determines
80*how* matching works: whether the instruction is dyadic or monadic,
81which frame slots to read for operands and constants, whether to
82write back to the frame (sink modes), and how many destinations to
83read at output. Fetching the instruction first gives the pipeline
84controller all the information it needs to sequence stage 3's SRAM
85accesses efficiently.
86
87The token's dyadic/monadic prefix enables parallel work: when the
88prefix indicates "dyadic," stage 2
89starts act_id -> frame_id resolution via the 670s simultaneously with
90the IRAM read. By the time stage 3 begins, both the instruction word
91and the frame_id / presence / port metadata are available, and the
92only remaining SRAM work is reading or writing actual operand data
93and constants.
94
95```
96Stage 1: INPUT
97 - Receive reassembled token from input deserialiser
98 - Classify by prefix: dyadic wide (00), monadic normal (010),
99 misc bucket (011). Within misc bucket: frame control (sub=00),
100 PE-local write (sub=01), monadic inline (sub=10)
101 - Compute/data tokens -> pipeline FIFO
102 - Frame control tokens -> tag store write/clear (side path)
103 - PE-local write tokens -> SRAM write queue (side path, executes
104 when frame SRAM is not busy with compute pipeline accesses)
105 - Buffer in small FIFO (8-deep, storing reassembled tokens)
106 - ~1K transistors (flip-flops) or use small SRAM
107
108Stage 2: IFETCH
109 Two parallel operations within a single cycle:
110
111 (a) IRAM SRAM read at [bank_reg : token.offset]. Produces the
112 16-bit instruction word: type, opcode, mode, wide, fref.
113 Single read cycle (16-bit instruction, one chip pair).
114
115 (b) Activation_id resolution. For Approach C (74LS670 lookup),
116 this is combinational (~35 ns): present act_id on the 670
117 address lines, get {valid, frame_id} back. Presence and port
118 metadata also resolve combinationally in this stage (670 read
119 at frame_id, ~70 ns total from act_id presentation). At 5 MHz
120 (200 ns cycle), all metadata is available before the IRAM read
121 completes.
122
123 The dyadic/monadic prefix from flit 1 determines whether
124 activation_id resolution starts in this stage (dyadic) or is
125 deferred until the instruction confirms the need (monadic with
126 frame access).
127
128 IRAM valid-bit check occurs in parallel: if the page containing
129 the target offset is marked invalid, the token is rejected (see
130 IRAM Valid-Bit Protection below).
131
132Stage 3: MATCH / FRAME
133 Path depends on instruction type (from stage 2) and token type.
134 Uses the instruction's mode field to determine frame SRAM accesses:
135
136 - Dyadic hit (second operand): read stored operand from frame SRAM
137 at [frame_id : match_offset]. If has_const, also read constant
138 from frame[fref]. 1-2 SRAM cycles depending on mode.
139 - Dyadic miss (first operand): write incoming operand to frame SRAM,
140 set presence bit in 670. Token consumed. 1 SRAM cycle.
141 - Monadic with constant: read constant from frame[fref]. 1 SRAM
142 cycle.
143 - Monadic mode 4 (CHANGE_TAG, no frame access): 0 cycles, pass
144 through.
145 - Mode 7 (SINK+CONST / RMW): read old value from frame[fref] for
146 read-modify-write. 1 SRAM cycle.
147
148 See Pipeline Stall Analysis (section 11) for full cycle-count
149 tables across Approaches A, B, and C.
150
151Stage 4: EXECUTE
152 - Instruction type bit selects CM compute (0) or SM operation (1)
153 - CM path: 16-bit ALU executes arithmetic/logic/comparison/routing
154 on operand data + constant (if present). Purely combinational.
155 No SRAM access. Result latched.
156 - SM path: ALU computes effective address or passes through data;
157 PE constructs SM flit fields from frame data and operands.
158 See SM Operation Dispatch (section 7) for encoding and dispatch
159 details. See `alu-and-output-design.md` for SM flit assembly.
160 - ~1500-2000 transistors (ALU) + SM flit assembly mux (~4-6 chips)
161
162Stage 5: OUTPUT
163 - CM path: read destination(s) from frame SRAM. Destinations are
164 pre-formed flit 1 values stored in frame slots during activation
165 setup. The PE reads the slot and puts it directly on the bus as
166 flit 1; the ALU result becomes flit 2. Near-zero token formation
167 logic.
168 - Mode 0/1 (single dest): read frame[fref] or frame[fref+1].
169 1 SRAM cycle.
170 - Mode 2/3 (fan-out): read dest1 and dest2 from consecutive
171 frame slots. 2 SRAM cycles.
172 - Mode 4/5 (CHANGE_TAG): left operand becomes flit 1 verbatim.
173 0 SRAM cycles.
174 - Mode 6/7 (SINK): write ALU result back to frame[fref]. No
175 output token. 1 SRAM cycle.
176 - SM path: emit SM token to target SM. SM flit 1 constructed from
177 frame slot (SM_id + addr from frame[fref]), SM flit 2 source
178 selected by instruction decoder (ALU out, R operand, or frame
179 slot). See SM Operation Dispatch (section 7) for flit 2 source
180 mux details.
181 - Pass to output serialiser for flit encoding and bus injection.
182```
183
184### Concurrency Model
185
186The pipeline is pipelined: multiple tokens can be in-flight simultaneously at different stages. In the emulator, each token spawns a separate SimPy process that progresses through the pipeline independently. This models the hardware reality where Stage 2 can be fetching a new instruction while Stage 4 is executing a previous one.
187
188Cycle counts per token type (Approach C, recommended v0):
189- **Dyadic hit, mode 1** (common case): 6 cycles (input + ifetch + match/const + execute + output)
190- **Dyadic hit, mode 0** (no const): 5 cycles
191- **Dyadic miss**: 3 cycles (input + ifetch + store operand; no execute/output)
192- **Monadic, mode 0**: 4 cycles (input + ifetch + execute + output; no match)
193- **Monadic, mode 4** (CHANGE_TAG, no frame): 3 cycles
194- **PE-local write**: side path, does not enter compute pipeline
195- **Frame control**: side path, tag store write
196- **Network delivery**: +1 cycle latency between emit and arrival at destination
197
198**Unpipelined throughput:** 3-7 cycles per token depending on instruction
199mode. This is the baseline against which pipeline overlap improvements
200are measured (see Pipeline Stall Analysis, section 11).
201
202### Pipeline Register Widths
203
204Between instruction fetch and execute, the pipeline carries both operand data and instruction control information in parallel but logically separate paths:
205
206**Data path (~32 bits between match/frame and ALU):**
207```
208data_L: 16 bits (from frame SRAM or direct from token for monadic)
209data_R: 16 bits (from frame SRAM)
210 |
211 ALU (16-bit)
212 |
213 result: 16 bits
214```
215
216**Control path (~16 bits from IRAM, plus frame reads in stage 5):**
217```
218type: 1 bit (CM/SM select, from IRAM)
219opcode: 5 bits (from IRAM, consumed by ALU / SM decoder)
220mode: 3 bits (from IRAM, drives frame access pattern + output routing)
221wide: 1 bit (from IRAM, 16/32-bit frame access)
222fref: 6 bits (from IRAM, frame slot base index)
223```
224
225Destinations are NOT in the pipeline control registers. They are read
226from frame SRAM during stage 5 as pre-formed flit 1 values. This
227simplifies the pipeline latch: only 16 bits of IRAM control + 32 bits
228of operand data pass between stages.
229
230Total pipeline register between fetch and execute: **~48 bits**. The
231mode field (3 bits) encodes tag behaviour, constant presence, and
232fan-out in a single dense field (see mode table in section 6).
233
234Pipeline registers between stages: ~500 transistors
235Control logic (state machine, handshaking): ~500-1000 transistors
236
237**Per-PE total: ~32-43 chips** (Approach C).
238
239## 4. Frame-Based Matching
240
241### Why It Can Be Small
242
243The matching store is the highest-risk component in any dataflow machine. Manchester needed 16 memory boards per PE. Amamiya needed 1024 CAM blocks (32KW at 43 bits/word) per PE. Both were sized for worst-case dynamic scheduling of arbitrary programs.
244
245This design avoids that because:
246
2471. **Static PE assignment**: the compiler knows which functions run on which PE and can calculate maximum concurrent activations per PE.
2482. **Function splitting**: the compiler can split large function bodies across PEs so no single PE needs a huge frame.
2493. **Compiler-controlled frame allocation**: the compiler assigns activation IDs at compile time for statically-known activations. Only genuinely dynamic activations (runtime-determined recursion depth) need runtime allocation.
250
251The frame count is therefore a _compiler parameter_, not an architectural constant. The hardware provides 4 concurrent frames of 64 slots each. The compiler must generate code that fits within those limits, splitting and scheduling accordingly.
252
253### Architecture: 74LS670 Register-File Lookup (Approach C, Recommended v0)
254
255Matching uses the per-activation **frame** model. Pending match operands
256live in the same SRAM address space as constants, destinations, and
257accumulators. The 74LS670 (4-word x 4-bit register file with independent
258read/write ports) provides the activation_id-to-frame_id mapping and
259presence/port metadata, all combinationally.
260
261**act_id-to-frame_id resolution:**
262
263Two 74LS670s, addressed by `act_id[1:0]` with `act_id[2]` selecting
264between chips. Output = `{valid:1, frame_id:2, spare:1}`. Combinational
265read: present act_id on the address lines, frame_id appears at the
266output in ~35 ns.
267
268```
269ALLOC: write {valid=1, frame_id} at address act_id (670 write port)
270FREE: write {valid=0, ...} at address act_id
271LOOKUP: read port, address = act_id -> {valid, frame_id} in ~35 ns
272```
273
274The 670's independent read and write ports allow ALLOC to proceed while
275the pipeline reads -- zero conflict.
276
277**Presence and port metadata:**
278
279Presence and port bits live in additional 670s, addressed by
280`[frame_id:2]`. The matchable offset range is constrained to offsets
2810-7 (8 dyadic-capable slots per frame). The assembler packs dyadic
282instructions at low offsets; offsets 8-255 are monadic-only.
283
284Recommended layout: 4x 670, each covering 2 offsets across 4 frames:
285
286```
287670 chip 0 (offsets 0-1): word[frame_id] = {pres0:1, port0:1, pres1:1, port1:1}
288670 chip 1 (offsets 2-3): word[frame_id] = {pres2:1, port2:1, pres3:1, port3:1}
289670 chip 2 (offsets 4-5): word[frame_id] = {pres4:1, port4:1, pres5:1, port5:1}
290670 chip 3 (offsets 6-7): word[frame_id] = {pres6:1, port6:1, pres7:1, port7:1}
291```
292
293`offset[2:1]` selects chip, `offset[0]` selects which pair of bits
294within the 4-bit output (a 2:1 mux -- one gate).
295
296The 670's simultaneous read/write is critical: during stage 3, when
297a first operand stores and sets presence, the write port updates the
298presence 670 while the read port remains available for the next
299pipeline stage's lookup. No read-modify-write sequencing needed.
300
301All reads are combinational (~35 ns). All resolve during stage 2 in
302parallel with the IRAM read. By the time stage 3 begins, the PE knows
303frame_id, presence, and port -- the only SRAM access in stage 3 is
304reading/writing the actual operand data.
305
306**The matching operation:**
307
308```
309Stage 2 (parallel with IRAM read):
310 act_id -> frame_id via 670 lookup (combinational)
311 presence[frame_id][offset] via 670 read (combinational)
312 port[frame_id][offset] via 670 read (combinational)
313
314Stage 3 (driven by instruction word from stage 2):
315 if instruction is dyadic AND presence bit set:
316 -> match found: read stored operand from frame SRAM at
317 [frame_id:2][match_offset]. Clear presence bit.
318 Read constant from frame[fref] if has_const.
319 -> proceed to stage 4 with both operands.
320 if instruction is dyadic AND presence bit clear:
321 -> first operand: write incoming operand to frame SRAM at
322 [frame_id:2][match_offset]. Set presence bit in 670.
323 -> token consumed, advance to next input token.
324 if instruction is monadic:
325 -> bypass matching. Read constant from frame[fref] if has_const.
326```
327
328**Hardware cost:**
329
330| Component | Chips | Notes |
331| ------------------------- | ------ | --------------------------------------- |
332| act_id -> frame_id lookup | 2 | 74LS670, indexed by act_id |
333| Presence + port metadata | 4 | 74LS670, indexed by frame_id |
334| Bit select mux | 1-2 | offset-based selection of presence/port |
335| **Total match metadata** | **~8** | |
336
337**Constraint: 8 matchable offsets per frame.** The assembler enforces
338this. 8 dyadic instructions per function chunk per PE is reasonable --
339the compiler splits larger function bodies across PEs. With 4 PEs, the
340system supports 32 simultaneous dyadic slots, which exceeds typical
341working-set utilisation for the target workloads.
342
343**Constraint: 8 unique activation_ids.** The 3-bit act_id supports 8
344entries in the lookup table. With 4 concurrent frames, 4 IDs of ABA
345distance exist before wraparound.
346
347### Alternative Approaches
348
349Two alternative matching implementations exist:
350
351- **Approach A (set-associative tags in frame SRAM):** tag words share
352 the frame SRAM chip. No extra chips, but matching consumes SRAM
353 cycles (2 cycles for a dyadic hit). Lowest chip count (~4-6 extra
354 TTL), highest pipeline stall rate.
355
356- **Approach B (full register-file match pool):** match entries live
357 entirely in a register file with parallel comparators. Matching is
358 fully combinational (~50 ns). Highest chip count (~16-18 extra TTL),
359 best pipeline throughput (eliminates all match-related SRAM
360 contention).
361
362Approach C (670 lookup, described above) sits between A and B: act_id
363resolution and presence checking are combinational (like B), but operand
364data lives in SRAM (like A). ~8 extra chips, good pipeline throughput.
365
366See the approach comparison table in Pipeline Stall Analysis (section 11)
367for full cycle counts and throughput estimates across all three
368approaches, plus 670-enhanced variants (B+670 indexed, B+670 semi-CAM)
369and hybrid upgrade paths.
370
371### Frame Sizing
372
3736-bit `fref` addresses 64 slots per activation. With 4 concurrent
374frames, the frame region occupies 4 x 64 x 2 bytes = 512 bytes of
375SRAM. This fits trivially alongside the IRAM region in a 32Kx8 chip
376pair.
377
378A function body with 10 dyadic instructions, 5 constants, and fan-out
379on 3 instructions might use ~30 frame slots (10 match + 5 const + 8
380dest + 7 accumulator/spare). With slot dedup (shared destinations,
381aliased constants), actual usage is typically 15-25 slots per
382activation.
383
384### What About Overflow?
385
386If all frames are occupied or a function body exceeds 8 dyadic
387instructions per frame:
388
389**Compile-time prevention (primary strategy):**
390
391- The compiler knows the frame count and matchable offset limit
392- It splits functions and schedules activations to fit
393- If a program genuinely can't fit (unbounded recursion deeper than 4 frames), the compiler inserts throttling code: a token that waits for a frame to free before allowing the next recursive call
394- This is the Amamiya throttle idea, but implemented in software (compiler-inserted dataflow logic) rather than hardware
395
396**Runtime overflow (safety net):**
397
398- If a token arrives and the tag store has no valid entry for its act_id (shouldn't happen with correct compilation), the PE stalls the input FIFO until a frame frees. Simplest, safest, most debuggable. If it fires, something is wrong and stalling surfaces the bug.
399
400**Upgrade path: hybrid with SRAM fallback:**
401
402- If the 8-offset matchable range proves tight, the high bit of the offset can select between register (offset[3]=0, check 670s) and SRAM (offset[3]=1, fall back to tag-in-SRAM from Approach A). The fast path stays combinational; the overflow path adds 1 SRAM cycle. The system degrades gracefully rather than hard-limiting at 8 dyadic offsets.
403
404## 5. Frame Lifecycle
405
406### Allocation
407
408An ALLOC frame control token (prefix 011+00, op=0) arrives at the PE,
409specifying an `activation_id`. The PE assigns the next free physical
410frame and records the act_id-to-frame_id mapping in the 670 tag store.
411
412Free frame tracking is a simple 2-bit counter or shift register (4
413entries max). Hardware cost: ~2-3 TTL chips.
414
415### Setup
416
417PE-local write tokens (prefix 011+01, region=1) load constants and
418destinations into the allocated frame's slots. The writer addresses
419slots by (act_id, slot_index); the PE resolves act_id-to-frame_id
420internally using the same 670 lookup as the compute pipeline. Setup
421uses the same mechanism as IRAM loading -- a stream of write tokens,
422precalculated by the assembler.
423
424### Execution
425
426Compute tokens arrive with `activation_id`. The PE resolves
427act_id-to-frame_id (670 lookup, combinational), then uses frame_id to
428address frame SRAM for matching, constant reads, destination reads, and
429write-backs. See the pipeline stages (section 3) for the per-stage
430access pattern.
431
432### Deallocation
433
434A `FREE_FRAME` instruction (opcode-driven, any mode) or a FREE frame
435control token (prefix 011+00, op=1) releases the frame. The tag store
436entry is cleared (`valid=0` written to the 670), presence/port metadata
437for that frame is bulk-cleared across all 4 presence/port 670s, and the
438frame_id returns to the free pool.
439
440Multiple frees are idempotent / harmless. Freed frames are immediately
441available for reallocation.
442
443### ABA Protection
444
445- 3-bit activation_id provides 8 unique IDs
446- With at most 4 concurrent frames, 4 IDs of ABA distance exist before wraparound
447- Stale tokens (from freed activations) carry an act_id whose 670 entry is now `valid=0` or maps to a different frame. The PE detects this via the valid bit and discards the stale token.
448- 4 IDs of distance is sufficient because stale tokens drain within single-digit cycles. Wraparound collision is effectively impossible.
449- The act_id validity check in the 670 provides ABA protection without dedicated hardware -- the valid bit serves as an implicit generation guard.
450
451### Throttle
452
453- The PE tracks the number of active frames (frames with valid=1 in the tag store)
454- When all 4 frames are active, ALLOC tokens are NACKed or stall until a free occurs
455- Prevents frame overflow
456- Hardware cost: 2-bit counter + comparator + gate. ~3 TTL chips.
457- With compiler-controlled scheduling, the throttle should rarely fire. It's a safety net, not a performance mechanism.
458
459## 6. Instruction Memory
460
461### Static Assignment, Per-PE Contents
462
463Unlike Amamiya where every PE has identical IM contents (full program), each PE here holds only the function bodies (or function chunks) assigned to it by the compiler. This means:
464
465- IM is smaller per PE (only assigned code, not the whole program)
466- Different PEs have different IM contents (loaded at bootstrap)
467- The compiler emits a per-PE instruction image as part of the program
468
469### Runtime Writability
470
471Instruction memory is **not** read-only. It is writable from the network via IRAM write (prefix 011+01) packets. This serves two purposes:
472
4731. **Bootstrap**: loading programs before execution starts
4742. **Runtime reprogramming**: loading new function bodies while other PEs continue executing (future capability, not needed for v0)
475
476Runtime writability also means instruction memory size is not a hard architectural limit -- if a program needs more code than fits in one PE's IM, the runtime (or a management PE) could swap function bodies in and out. Very speculative, but the hardware path exists.
477
478### Implementation
479
480Instruction memory is PE-local SRAM, sharing a chip pair with the frame
481region via address partitioning (see Per-PE Memory Map, section 10).
482**IRAM width is completely independent of bus width**. It is sized for
483encoding needs, not bus constraints.
484
485#### IRAM Width: 16-bit Single-Half Format
486
487Each IRAM slot is **16 bits**, read in a single cycle from one 8-bit-wide
488SRAM chip pair. Instruction templates are activation-independent: all
489per-activation data (constants, destinations, match operands,
490accumulators) lives in the frame.
491
492```
493IRAM address = [offset:8] (v0, 8-bit address)
494```
495
496256 instruction slots per PE. Total IRAM SRAM usage: 512 bytes per PE.
497For programs exceeding 256 instructions, see Future: Bank Switching
498with 74LS610 (section 14).
499
500#### Instruction Word Format
501
502```
503[type:1][opcode:5][mode:3][wide:1][fref:6] = 16 bits
504 15 14-10 9-7 6 5-0
505```
506
507| Field | Bits | Purpose |
508| ------ | ---- | ------------------------------------------------------------ |
509| type | 1 | 0 = CM compute, 1 = SM operation |
510| opcode | 5 | Operation code (CM and SM have independent 32-entry spaces) |
511| mode | 3 | Combined tag/frame-reference mode (see mode table below) |
512| wide | 1 | 0 = 16-bit frame values, 1 = 32-bit (consecutive slot pairs) |
513| fref | 6 | Frame slot base index (64 slots per activation) |
514
515**type:1** -- operation space select:
516```
5170 = CM compute operation (ALU)
5181 = SM operation (structure memory bus command)
519```
520
521**opcode:5** -- 32 slots per type. CM and SM have independent opcode
522spaces (32 CM opcodes + 32 SM opcodes). Decoded by EEPROM into control
523signals. See `alu-and-output-design.md` for the CM operation set.
524
525**mode:3** -- combined output routing and frame access mode. Controls
526whether the instruction emits tokens, how it reads destinations and
527constants from the frame, or whether it writes results back to the frame.
528See the mode table below.
529
530**wide:1** -- frame value width:
531```
5320 = 16-bit frame values (single slot per logical value)
5331 = 32-bit frame values (consecutive slot pairs per logical value)
534```
535
536**fref:6** -- frame slot index (0-63). Base of a contiguous group of
5371-3 slots, depending on mode. The instruction template references frame
538data exclusively through this field; no per-activation data exists in
539the instruction word itself.
540
541Constants and destinations are NOT in the instruction word. They live in
542frame slots, referenced by `fref`. The instruction template is pure
543control flow: opcode, mode flags, and a frame slot reference.
544
545**Instruction words are never serialised onto the external bus** during
546normal execution. They are only written via PE-local write packets
547(prefix 011+01, region=0) during program loading.
548
549#### Mode Table (3-bit `mode` field)
550
551The 3-bit `mode` field encodes both output routing behaviour and frame
552access pattern in a single field. Every combination is useful; there are
553no wasted encodings.
554
555```
556mode [2:0] tag behaviour frame reads at fref use case
557---- ----- -------------- -------------------------- --------------------------
558 0 000 INHERIT [dest] single output, no constant
559 1 001 INHERIT [const, dest] single output with constant
560 2 010 INHERIT [dest1, dest2] fan-out, no constant
561 3 011 INHERIT [const, dest1, dest2] fan-out with constant
562 4 100 CHANGE_TAG (none) dynamic routing, no constant
563 5 101 CHANGE_TAG [const] dynamic routing with constant
564 6 110 SINK write result -> frame[fref] store to frame, no output
565 7 111 SINK+CONST read frame[fref], local accumulate / RMW
566 write result -> frame[fref]
567```
568
569**Bit-level decode equations:**
570
571```
572output_enable = NOT mode[2] modes 0-3: read dest from frame, emit token
573change_tag = mode[2] AND NOT mode[1] modes 4-5: routing from left operand
574sink = mode[2] AND mode[1] modes 6-7: no output token, write to frame
575has_const = mode[0] modes 1, 3, 5, 7: read constant from frame
576has_fanout = mode[1] AND NOT mode[2] modes 2-3: read two destinations
577```
578
579Frame slot count per mode = `1 + has_const + has_fanout`, read
580sequentially starting from `fref`. For SINK modes, `fref` is the write
581target. For SINK+CONST (mode 7, read-modify-write), the read and write
582target is the same slot (`fref`), with the ALU result writing back
583after computation.
584
585**INHERIT (modes 0-3):** output tokens are routed to destinations stored
586in frame slots. Each destination slot holds a pre-formed flit 1 value:
587
588```
589frame dest slot: [prefix:2-3][port:0-1][PE:2][offset:8][act_id:3]
590```
591
592The PE reads the slot and puts it directly on the bus as flit 1. The ALU
593result becomes flit 2. Almost zero token formation logic -- the frame
594constant IS the output flit. This is the same format as the incoming
595token's flit 1, enabling forwarding without repacking.
596
597- **Mode 0** (single output, no constant): frame[fref] = dest. 1 slot.
598- **Mode 1** (single output with constant): frame[fref] = const,
599 frame[fref+1] = dest. 2 slots.
600- **Mode 2** (fan-out, no constant): frame[fref] = dest1,
601 frame[fref+1] = dest2. 2 slots.
602- **Mode 3** (fan-out with constant): frame[fref] = const,
603 frame[fref+1] = dest1, frame[fref+2] = dest2. 3 slots.
604
605**CHANGE_TAG (modes 4-5):** the left operand replaces the frame
606destination as flit 1. The entire output flit 1 comes from the left
607operand data value (16 bits, verbatim). The right operand becomes flit 2
608(payload data). This enables sending a value to any destination computed
609at runtime -- the packed tag IS flit 1. No field extraction or assembly.
610
611- **Mode 4** (no constant): no frame reads. Flit 1 = left operand,
612 flit 2 = right operand (or ALU result).
613- **Mode 5** (with constant): frame[fref] = const. Flit 1 = left
614 operand, flit 2 = right operand. The constant feeds the ALU.
615
616The output stage is a mux: frame dest vs left operand, selected by
617mode[2]. Hardware: left operand bypass latch (~2 chips) preserves the
618left operand value past the ALU. Stage 5 flit 1 mux (~2 chips) selects
619between assembled flit and raw data.
620
621**SINK (modes 6-7):** no output token is emitted. The ALU result is
622written back to frame[fref]. Used for local accumulation, temporary
623storage, and read-modify-write patterns.
624
625- **Mode 6** (write only): ALU result written to frame[fref]. 1 SRAM
626 write cycle.
627- **Mode 7** (read-modify-write): frame[fref] is read as the constant
628 input to the ALU, the result is written back to frame[fref]. Enables
629 in-place accumulation without consuming a separate constant slot.
630
631**dest_type derivation:** the output token format (dyadic wide vs monadic
632normal vs monadic inline) is determined by the destination frame slot
633contents. Since destination slots hold pre-formed flit 1 values, the
634token type is encoded in the prefix bits of the stored flit. The output
635stage emits the frame slot verbatim as flit 1; the prefix bits in the
636stored value determine the wire format. No runtime type derivation logic
637is needed. For SWITCH not-taken paths, the output stage emits a monadic
638inline token (hardwired prefix in the formatter, overriding the frame
639destination).
640
641#### IRAM Valid-Bit Protection
642
643IRAM is written via PE-local write tokens (prefix 011+01, region=0) that
644share the PE's input path with compute tokens. When IRAM contents are
645replaced at runtime (swapping function fragments in and out), tokens in
646flight may target IRAM addresses that have been or are being overwritten.
647Because tokens do not carry instruction identity information, the PE
648cannot distinguish "right instruction" from "wrong instruction" -- it
649just sees an offset into IRAM.
650
651This is an **instruction identity problem**, not a presence problem. The
652dangerous case is not "IRAM is empty" but "IRAM contains a different
653instruction than the token expects."
654
655**The mechanism: per-page valid bits.** IRAM is logically divided into
656pages (e.g. 8 pages of 32 instructions for 256-entry IRAM, or 16 pages
657of 16). Each page has a 1-bit valid flag, stored in a small TTL register
658alongside IRAM. Total hardware cost: one register chip for all page
659valid bits.
660
661The valid bit is checked during the IFETCH pipeline stage, in parallel
662with the IRAM SRAM read. The top bits of the token's `offset` field
663select the page; the valid bit for that page gates whether the token
664proceeds or is rejected.
665
666```
667Token arrives at IFETCH:
668 page = offset[high bits]
669 if valid_bit[page] == 0:
670 -> reject token (see rejection policy below)
671 else:
672 -> proceed with instruction fetch and act_id resolution
673```
674
675**IRAM swap protocol.** Because config writes and compute tokens share
676the input path, the swap sequence is naturally ordered:
677
678```
6791. Loader sends drain signal (implementation TBD -- could be a PE-local
680 write "quiesce" flag, or the PE back-pressures via handshake/ready
681 signal)
6822. PE processes remaining compute tokens in pipeline (natural drain)
6833. PE-local write token (prefix 011+01, region=0) arrives:
684 a. PE clears valid bit for target page
685 b. PE writes instruction word to IRAM at specified address
686 c. If more write tokens follow (burst), keep writing
6874. Load-complete marker arrives (PE-local write with load-complete flag):
688 a. PE sets valid bit for target page
689 b. PE resumes accepting compute tokens for that page
690```
691
692During steps 3-4, any compute token that arrives targeting an invalid
693page is rejected. The shared input path ordering guarantees that tokens
694from the *new* code epoch cannot arrive until after the loader has sent
695them, which is after the load completes. Rejected tokens are therefore
696late arrivals from the old epoch -- work that is being abandoned.
697
698**Presence-bit optimisation:** the frame's presence metadata (in the 670
699register files) can be checked before step 1 to determine if any tokens
700are pending for offsets in the target page. If all presence bits for
701matchable offsets in that page are clear, the drain step can be skipped
702entirely -- no tokens are waiting for those instructions. This enables
703targeted IRAM replacement without stalling the entire PE.
704
705**Rejection policy.** v0: discard silently + diagnostic. Late-arriving
706tokens targeting an invalid IRAM page are dropped. The PE sets a sticky
707flag (directly driving a diagnostic LED) to indicate that a discard
708occurred. This is a "should never happen if the loader protocol is
709correct" safety net. The LED makes it visible during debugging without
710adding any pipeline complexity.
711
712If the flag lights up, something is wrong with the drain timing.
713
714Future: NAK response. The PE could form a NAK token from the rejected
715compute token and emit it to a coordinator. The output stage already
716exists (forms tokens from frame destinations + ALU results); a bypass
717path from IFETCH to the output would enable this. Estimated cost: a mux
718and some control logic, ~5-8 TTL chips. Deferred until the runtime is
719sophisticated enough to act on NAKs.
720
721**What this does NOT protect against.** The valid-bit mechanism catches
722tokens that arrive **during** an IRAM swap (page is invalid). It does
723**not** catch tokens that arrive after a swap completes and the page is
724re-validated with different code. Preventing that case requires the drain
725protocol to be correct -- all tokens from the old epoch must have been
726processed or discarded before the new code is marked valid.
727
728For v0, this is a software/loader invariant enforced by the drain
729protocol. Future hardening options:
730
731- **Per-page epoch counter (2-3 bits):** incremented on each page reload.
732 Checked against an expected epoch stored per-activation or derived from
733 spare token bits. Catches post-swap stale tokens at the cost of a
734 comparator + epoch storage.
735- **Fragment ID register per page:** similar to epoch but identifies the
736 fragment by name rather than sequence number. More expensive (wider
737 comparator) but more debuggable.
738
739Both options fit in the spare bits reserved in the token formats. The
740valid-bit mechanism is forward-compatible with either.
741
742## 7. SM Operation Dispatch
743
744SM operations (type=1) use the same 16-bit instruction format. The 5-bit
745opcode field selects from the SM opcode space (independent of CM opcodes).
746`fref` points at frame slots containing SM-specific parameters. The PE
747constructs SM tokens from frame data, operand data, and ALU results.
748
749All SM addressing goes through frame slots. There is no separate
750"pointer-addressed" vs "constant-addressed" distinction in the instruction
751encoding -- the frame slot contents determine the target SM, address, and
752any return routing. Both pointer-addressed operations (address from token
753data) and constant-addressed operations (address from frame slot) use the
754same frame-slot-based encoding.
755
756### SM Bus Opcode Encoding
757
758The SM bus opcode encoding is unchanged -- variable-width with tier 1
759(3-bit opcode, 10-bit addr) and tier 2 (5-bit opcode, 8-bit payload).
760See `sm-design.md` for the full opcode table:
761
762```
763Tier 1 (3-bit, 1024-cell addr range):
764 read, write, alloc, free, exec, ext
765
766Tier 2 (5-bit, 256-cell addr range or 8-bit payload):
767 rd_inc, rd_dec, cas, raw_rd, clear, set_pg, write_im, (spare)
768```
769
770### SM Token Construction
771
772The PE output stage builds SM tokens on the wire. The instruction's SM
773opcode (in the PE's 5-bit opcode field) maps to the SM bus opcode via
774the instruction decoder EEPROM. Frame slots provide addressing and
775return routing parameters:
776
777- **SM flit 1** is constructed from frame[fref] contents (SM_id, address)
778 plus the SM bus opcode from the decoder.
779- **SM flit 2** source depends on the operation (see flit 2 source mux
780 below).
781- **SM flit 3** (CAS and EXT only) carries additional data.
782
783### Frame Slot Packing for SM Parameters
784
785A single 16-bit frame slot holds the SM target:
786
787```
788Tier 1 target slot: [SM_id:2][addr_high:2][addr_low:8][spare:4] = 16 bits
789 (addr = 10 bits, 1024-cell range)
790
791Tier 2 target slot: [SM_id:2][addr:8][spare:6] = 16 bits
792 (addr = 8 bits, 256-cell range)
793```
794
795For operations needing return routing, the next consecutive frame slot
796holds a pre-formed response token flit 1:
797
798```
799Return routing slot: [prefix:2-3][port:0-1][PE:2][offset:8][act_id:3] = 16 bits
800```
801
802This is the same format as a CM destination slot -- the SM response token
803routes as a normal compute token back to the requesting PE. The SM treats
804flit 2 of the request as an opaque 16-bit blob and echoes it as flit 1
805of the response (see `sm-design.md` Result Format).
806
807### Flit 2 Source Mux
808
809Different SM operations need different data in flit 2. The source is
810selected by a 2-bit signal derived from the instruction decoder (SM
811opcode + mode):
812
813```
814source select use case
815------- ------ -----------------------------------------------
816ALU out 00 SM write (operand is write data, passes through ALU),
817 SM write_im (immediate write). Default for CM compute.
818R oper 01 SM scatter write (ALU computes addr from base + L operand;
819 R operand is write data, bypasses ALU to flit 2).
820 SM CAS flit 2 (expected value = L operand).
821Frame 10 SM read / rd_inc / rd_dec / raw_rd / alloc return routing
822 (frame[fref+1] = pre-formed response token flit 1).
823 SM exec parameters.
824(spare) 11 reserved
825```
826
827Hardware: cascaded 74LS157 (quad 2:1 mux) pairs, ~4-6 chips for 16-bit
828width.
829
830### SM Operation Mapping Table
831
832| SM bus op | PE opcode | frame slots | mode | operands | flits | flit 2 source | notes |
833|--------------|--------------|----------------------|------|------------------------|-------|------------------------|------------------------------------|
834| read | SM_READ | 2: target + return | 1 | monadic (trigger) | 2 | frame (return routing) | indexed variant: ALU adds base + index |
835| write | SM_WRITE | 1: target | 0 | monadic (data) | 2 | ALU (write data) | |
836| write (scatter) | SM_WRITE_IX | 1: target | 1 | dyadic (index, data) | 2 | R operand (write data) | ALU: base + index |
837| alloc | SM_ALLOC | 2: params + return | 1 | monadic (trigger) | 2 | frame (return routing) | |
838| free | SM_FREE | 1: target | 0 | monadic (trigger) | 2 | don't-care | |
839| exec | SM_EXEC | 2: target + params | 1 | monadic (trigger) | 2 | frame (params/count) | |
840| ext | SM_EXT | 1-2: varies | varies | varies | 3 | varies | 3-flit extended addressing |
841| rd_inc | SM_RDINC | 2: target + return | 1 | monadic (trigger) | 2 | frame (return routing) | atomic read-and-increment |
842| rd_dec | SM_RDDEC | 2: target + return | 1 | monadic (trigger) | 2 | frame (return routing) | atomic read-and-decrement |
843| cas | SM_CAS | 1: target | 0 | dyadic (expected, new) | 3 | L operand (expected) | 3-flit; return via prior read |
844| raw_rd | SM_RAWRD | 2: target + return | 1 | monadic (trigger) | 2 | frame (return routing) | non-blocking, no deferred read |
845| clear | SM_CLEAR | 1: target | 0 | monadic (trigger) | 2 | don't-care | resets cell to EMPTY |
846| set_pg | SM_SETPG | 1: target | 0 | monadic (page value) | 2 | ALU (page value) | SM-side bank switching |
847| write_im | SM_WRIM | 1: target | 0 | monadic (data) | 2 | ALU (write data) | immediate write, tier 2 addr |
848
849### SM Flit 1 Assembly
850
851Stage 5 assembles SM flit 1 from frame[fref] and the wire opcode from
852the decoder EEPROM:
853
854```
855Tier 1:
856 flit 1 = [1][frame[fref][15:14] (SM_id)][wire_opcode:3][frame[fref][13:4] (addr:10)]
857
858Tier 2:
859 flit 1 = [1][frame[fref][15:14] (SM_id)][wire_opcode:5][frame[fref][13:6] (addr:8)]
860```
861
862Hardware: the `[1]` prefix is hardwired. SM_id comes from the top 2 bits
863of the frame slot. The wire opcode comes from the EEPROM. The address
864comes from the remaining frame slot bits. All fields are concatenated
865on the wire with no runtime muxing -- the frame slot is pre-packed so
866that the bit positions align with the SM bus format. ~1-2 chips for
867output gating and serialisation.
868
869### Indexed Address Computation
870
871For indexed READ, scatter WRITE, and other address-computed operations,
872the ALU computes the effective address: base address (extracted from
873frame[fref]) + index (from the left operand). The computed address is
874packed into SM flit 1 by the output stage. The ALU performs address
875arithmetic without dedicated address-computation hardware.
876
877### CAS: 3-Flit SM Token
878
879Compare-and-swap requires address, expected value, and new value. This
880exceeds the standard 2-flit SM packet.
881
882```
883CAS emission (3 flits):
884 flit 1: SM header (SM_id + CAS opcode + addr, from frame[fref])
885 flit 2: expected value (left operand)
886 flit 3: new value (right operand)
887```
888
889The output serialiser emits 3 flits instead of the default 2. An
890`extra_flit` signal from the instruction decoder (asserted for CAS and
891EXT-mode ops) increments the serialiser's flit counter limit:
892`flit_count = 2 + extra_flit`. One gate.
893
894Return routing for CAS uses the prior-READ pattern: issue an SM READ to
895the target cell first (plants return routing in the SM's deferred-read
896register), receive the current value as the response (which also provides
897the expected value), then issue the CAS with the known expected value
898and the desired new value.
899
900### SM Flit Assembly Hardware Cost
901
902| Component | Chips/PE | Purpose |
903|-----------|----------|---------|
904| Flit 2 source mux (16-bit, 4:1) | ~4-6 | ALU out / R oper / Frame / spare |
905| SM flit 1 gating | ~1-2 | Frame target slot + EEPROM opcode to bus |
906| Extra flit control | ~0.5 | CAS/EXT 3-flit counter |
907
908## 8. EEPROM-Based Instruction Decoding
909
910The instruction decoder can be implemented as an EEPROM acting like a PLD. Input bits = instruction opcode fields + PE ID bits. Output bits = control signals for the ALU, matching store, token output formatter, etc.
911
912This gives significant flexibility:
913
914- Instruction set can be changed by reflashing the EEPROM (no board changes)
915- Per-PE customisation (different PEs could theoretically have different instruction subsets, though unlikely for v0)
916- The PE ID is "free" -- it's just more EEPROM address bits
917
918## 9. The 670 Subsystem: Act ID Lookup, Match Metadata, and SC Register File
919
920### Role in the Frame-Based Architecture
921
922The 74LS670s serve two critical functions:
923
9241. **act_id -> frame_id lookup table.** Indexed by the token's 3-bit
925 `activation_id`, outputs `{valid:1, frame_id:2, spare:1}` in
926 ~35 ns (combinational). This replaces what would otherwise be an
927 SRAM cycle for associative tag comparison.
928
9292. **Presence and port metadata store.** Indexed by `frame_id`,
930 stores presence and port bits for all 8 matchable offsets across
931 all 4 frames. Combinational read (~35 ns after frame_id settles,
932 ~70 ns total from act_id presentation).
933
934Both functions complete within stage 2, in parallel with the IRAM
935read. By the time stage 3 begins, the PE knows frame_id, presence,
936and port -- the only remaining SRAM access is the actual operand
937data.
938
939### Hardware Configuration
940
941**act_id -> frame_id (2x 74LS670):**
942
943Addressed by `act_id[1:0]` with `act_id[2]` selecting between chips.
944Each chip holds 4 words x 4 bits. Output: `{valid:1, frame_id:2,
945spare:1}`.
946
947```
948ALLOC: write {valid=1, frame_id} at address act_id (670 write port)
949FREE: write {valid=0, ...} at address act_id
950LOOKUP: read port, address = act_id -> {valid, frame_id} in ~35 ns
951```
952
953The 670's independent read and write ports allow ALLOC to proceed
954while the pipeline reads -- zero conflict.
955
956**Presence + port metadata (4x 74LS670):**
957
958Each 670 word (4 bits) holds presence+port for 2 offsets:
959`{presence_N:1, port_N:1, presence_N+1:1, port_N+1:1}`.
960Read address = `[frame_id:2]`. Output bits selected by
961`offset[2:0]` via bit-select mux.
962
963```
964670 chip 0 (offsets 0-1): word[frame_id] = {pres0, port0, pres1, port1}
965670 chip 1 (offsets 2-3): word[frame_id] = {pres2, port2, pres3, port3}
966670 chip 2 (offsets 4-5): word[frame_id] = {pres4, port4, pres5, port5}
967670 chip 3 (offsets 6-7): word[frame_id] = {pres6, port6, pres7, port7}
968```
969
970`offset[2:1]` selects chip, `offset[0]` selects which pair of bits
971within the 4-bit output (a 2:1 mux -- one gate).
972
973The 670's simultaneous read/write is critical: during stage 3, when
974a first operand stores and sets presence, the write port updates the
975presence 670 while the read port remains available for the next
976pipeline stage's lookup. No read-modify-write sequencing needed.
977
978**Bit select mux (1-2 chips):**
979
980Offset-based selection of the relevant presence and port bits from
981the 670 outputs.
982
983### Chip Budget
984
985| Component | Chips | Function |
986|----------------------------|--------|--------------------------------------|
987| act_id -> frame_id lookup | 2 | 74LS670, indexed by act_id |
988| Presence + port metadata | 4 | 74LS670, indexed by frame_id |
989| Bit select mux | 1-2 | offset-based selection |
990| **Total match metadata** | **~8** | |
991
992### SC Register File (Mode-Switched)
993
994During **dataflow mode**, the PE uses act_id resolution and presence
995metadata constantly but the SC register file is idle (no SC block
996executing). During **SC mode**, the PE uses the register file
997constantly but act_id lookup and presence tracking are idle (SC block
998has exclusive PE access; no tokens enter matching).
999
1000Some of the 670s can be repurposed for register storage during SC
1001mode. The exact mapping depends on the SC block design:
1002
1003- The 4 presence+port 670s (indexed by frame_id in dataflow mode) can
1004 be re-addressed by instruction register fields during SC mode,
1005 providing 4 chips x 4 words x 4 bits = 64 bits of register storage.
1006 Combined across chips, this gives **4 registers x 16 bits** (4 bits
1007 per chip, 4 chips for width).
1008
1009- With additional mux logic, all 6 shared 670s (excluding the act_id
1010 lookup pair, which may need to remain active for frame lifecycle
1011 management) could provide **6 registers x 16 bits** during SC mode.
1012
1013The act_id lookup 670s may need to remain in their dataflow role even
1014during SC mode if the PE must handle frame control tokens (ALLOC/FREE)
1015arriving during SC block execution. Whether to share them depends on
1016the SC block entry/exit protocol.
1017
1018### The Predicate Slice
1019
1020One of the 670s can be **permanently dedicated as a predicate
1021register** rather than participating in the mode-switched pool:
1022
1023- 4 entries x 4 bits = 16 predicate bits, always available
1024- Useful for: conditional token routing (SWITCH), loop termination
1025 flags, SC block branch conditions, I-structure status flags
1026- Does not reduce the metadata capacity significantly: the remaining
1027 3 presence+port 670s still cover 6 of the 8 matchable offsets;
1028 the 2 uncovered offsets can fall back to SRAM-based presence or
1029 simply constrain the assembler to 6 dyadic offsets per frame
1030
1031The predicate register is always readable and writable regardless of
1032mode, since it's a dedicated chip with its own address/enable lines.
1033Instructions can test or set predicate bits without going through the
1034matching store or the ALU result path.
1035
1036### Mode Switching
1037
1038When transitioning from dataflow mode to SC mode:
1039
10401. **Save metadata** from the shared 670s to spill storage.
10412. **Load initial SC register values** (matched operand pair that
1042 triggered the SC block) into the 670s.
10433. **Switch address mux**: 670 address lines now driven by
1044 instruction register fields instead of frame_id / act_id.
10454. **Switch IRAM to counter mode**: sequential fetch via incrementing
1046 counter rather than token-directed offset.
1047
1048When transitioning back:
1049
10501. **Emit final SC result** as token (last instruction with OUT=1).
10512. **Restore metadata** from spill storage to the 670s.
10523. **Switch address mux back** to frame_id / act_id addressing.
10534. **Resume token processing** from input FIFO.
1054
1055### Spill Storage Options
1056
1057Metadata from the shared 670s (~64-96 bits depending on how many
1058are shared) needs temporary storage during SC block execution.
1059
1060**Option A: Shift registers.** 2x 74LS165 (parallel-in, serial-out)
1061for save + 2x 74LS595 (serial-in, parallel-out) for restore. Total:
10624 chips. Save/restore takes ~12 clock cycles each.
1063
1064**Option B: Dedicated spill 670.** One additional 74LS670 (4x4 bits)
1065holds 16 bits per save cycle; need ~4-6 write cycles to save all
1066shared chips' contents. Total: 1 chip, ~4-6 cycles per save/restore.
1067
1068**Option C: Spill to frame SRAM.** During SC mode, the frame SRAM
1069has bandwidth available (no match operand reads). Write the 670
1070metadata contents into a reserved region of the frame SRAM address
1071space. No extra chips needed. ~4-6 SRAM write cycles to save, ~4-6
1072to restore. The SRAM is single-ported but there's no contention
1073because the pipeline is paused during mode switch.
1074
1075**Recommended: Option C.** Zero additional chips. The save/restore
1076overhead of ~4-6 cycles per transition is negligible compared to the
1077SC block's execution savings (EM-4 data: 23 clocks pure dataflow vs
10789 clocks SC for Fibonacci, so even with ~10 cycles of mode switch
1079overhead, you break even at ~5-7 SC instructions).
1080
1081## 10. SRAM Configuration and Memory Map
1082
1083### Unified SRAM Chip Pair
1084
1085The PE uses a single 32Kx8 chip pair (2 chips for 16-bit data width)
1086for both IRAM and frame storage, with address partitioning via a
1087single decode bit. The recommended part is the AS6C62256 (55 ns,
108832Kx8, DIP-28) or equivalent. 55 ns access time fits comfortably
1089within a 200 ns clock period at 5 MHz, with margin for address setup
1090and data hold.
1091
1092The unified SRAM approach keeps chip count low: one chip pair per PE
1093serves both IRAM and frame storage, avoiding the chip proliferation
1094that separate matching store and IRAM memories would require.
1095
1096### Address Map
1097
1098```
1099v0 address space (simple decode):
1100
1101 IRAM region: [0][offset:8] instruction templates
1102 offset from token
1103 capacity: 256 instructions (512 bytes)
1104
1105 Frame region: [1][frame_id:2][slot:6] per-activation storage
1106 frame_id from tag store resolution
1107 capacity: 4 frames x 64 slots = 256 entries (512 bytes)
1108```
1109
1110Total v0 SRAM utilisation: IRAM 512 bytes, frame 512 bytes. Under 1.5
1111KB used out of a 32Kx8 chip pair (64 KB). Ample room for future
1112expansion without changing chips. See Future: Bank Switching with
111374LS610 (section 14) for the upgrade path when programs exceed 256
1114instructions per PE.
1115
1116### Shared SRAM Arbitration
1117
1118The unified SRAM chip pair is shared between three access patterns:
1119
1120- Pipeline IRAM reads (stage 2, instruction fetch): high frequency, performance-critical
1121- Pipeline frame reads/writes (stages 3 and 5): high frequency, performance-critical
1122- PE-local write tokens (IRAM and frame loading): low frequency, can tolerate delay
1123
1124**Arbitration approach**: PE-local writes execute when the frame SRAM is
1125not busy with compute pipeline accesses (natural gaps between pipeline
1126stages). When no gap is available, writes queue and execute during the
1127next idle cycle. Hardware cost: mux on SRAM address/data buses +
1128write-enable gating + stall signal to pipeline. Roughly 5-8 TTL chips.
1129
1130**IRAM vs frame contention**: in v0, IRAM and frame share one SRAM chip
1131pair via address partitioning (region bit in the address). Stage 2
1132(IRAM read) and stage 3/5 (frame read/write) access different address
1133regions but contend for the same physical chip. The pipeline controller
1134ensures only one stage accesses the SRAM per cycle. With the natural
1135pipeline spacing, this rarely causes stalls -- see the frame SRAM
1136contention model in Pipeline Stall Analysis (section 11).
1137
1138**Upgrade path**: separating IRAM and frame onto independent SRAM chip
1139pairs eliminates all inter-region contention. Stage 2 (IRAM) and
1140stage 3/5 (frame) can access their respective chips in the same cycle.
1141
1142**Async-compatible arbitration**: defined as request/grant interface.
1143Synchronous implementation: priority mux resolved on clock edge. Async
1144implementation: mutual exclusion element (Seitz arbiter). Interface is
1145the same in both cases. See `network-and-communication.md` for clocking
1146discipline.
1147
1148## 11. Pipeline Stall Analysis
1149
1150### The Frame SRAM Contention Problem
1151
1152With Approach C (670 lookup), act_id -> frame_id resolution is
1153combinational (~35 ns via 670 read port), and the presence/port
1154check is also combinational (~35 ns from a second set of 670s).
1155There is no read-modify-write on SRAM for metadata -- metadata
1156lives entirely in the 670 register files.
1157
1158The primary bottleneck is **frame SRAM contention between stage 3 and
1159stage 5**. Both stages access the same single-ported SRAM chip pair:
1160
1161- **Stage 3** reads/writes operand data (dyadic match) and reads
1162 constants (modes with has_const=1).
1163- **Stage 5** reads destinations (modes 0-3), or writes results back
1164 to the frame (sink modes 6-7).
1165
1166When two pipelined tokens have stage 3 and stage 5 active in the
1167same cycle, the SRAM can serve only one. The other stalls.
1168
1169### The Pipeline Hazard
1170
1171The classic RAW hazard still exists but takes a different form. Two
1172consecutive tokens targeting the same frame slot (e.g., two mode 7
1173read-modify-write operations on the same accumulator slot) create a
1174data dependency: the second token's stage 3 read must see the first
1175token's stage 5 write.
1176
1177Detection requires comparing (act_id, fref) of the incoming token
1178against in-flight pipeline latches at stages 3-5. Hardware cost: ~2
1179chips (9-bit comparator + AND gate). Alternatively, the assembler
1180can guarantee this never happens by never emitting consecutive mode 7
1181tokens to the same slot on the same PE.
1182
1183This hazard is **statistically uncommon** in dataflow execution. Two operands arriving back-to-back
1184at the exact same frame slot requires coincidental timing. The
1185bypass path is cheap insurance that fires infrequently.
1186
1187### SRAM Contention Model
1188
1189The frame SRAM chip is single-ported (one access per clock cycle at
11905 MHz with 55 ns SRAM). The primary stall source is contention
1191between stage 3 (frame reads for operand data and constants) and
1192stage 5 (frame reads for destinations, or frame writes for sink
1193modes).
1194
1195**Contention arises only when:**
1196- Token A is at stage 5, needing a frame SRAM read (dest) or write
1197 (sink), AND
1198- Token B is at stage 3, needing a frame SRAM read (match operand,
1199 constant, or tag word).
1200
1201**Contention does NOT arise when:**
1202- Token A's stage 5 is mode 4/5 (change_tag -- no SRAM access).
1203- Token B's stage 3 is zero-cycle (monadic no-const, or match data
1204 in register file with no const).
1205- Token A was a dyadic miss (terminated at stage 3, never reaches
1206 stage 5).
1207
1208### Cycle Counts by Instruction Type
1209
1210**Approach C (74LS670 lookup, recommended v0):**
1211
1212```
1213 stg1 stg2 stg3 stg4 stg5 total
1214monadic mode 4 (no frame) 1 1 0 1 0 3
1215monadic mode 0 (dest only) 1 1 0 1 1 4
1216monadic mode 6 (sink) 1 1 0 1 1 4
1217monadic mode 1 (const+dest) 1 1 1 1 1 5
1218monadic mode 7 (RMW) 1 1 1 1 1 5
1219dyadic miss 1 1 1 -- -- 3
1220dyadic hit, mode 0 1 1 1 1 1 5
1221dyadic hit, mode 1 1 1 2 1 1 6
1222dyadic hit, mode 3 (fan+const) 1 1 2 1 2 7
1223```
1224
1225Stage 3 breakdown for Approach C:
1226- Dyadic hit: 1 SRAM cycle to read stored operand (frame_id and
1227 presence already known from 670). +1 cycle for constant if
1228 has_const=1.
1229- Dyadic miss: 1 SRAM cycle to write operand data. 670 write port
1230 sets presence bit combinationally in parallel.
1231- Monadic: 0 SRAM cycles (no match), +1 for constant if has_const=1.
1232
1233**Approach B (register-file match pool):**
1234
1235```
1236 stg1 stg2 stg3 stg4 stg5 total
1237monadic mode 4 (no frame) 1 1 0 1 0 3
1238monadic mode 0 (dest only) 1 1 0 1 1 4
1239monadic mode 6 (sink) 1 1 0 1 1 4
1240monadic mode 1 (const+dest) 1 1 1 1 1 5
1241monadic mode 7 (RMW) 1 1 1 1 1 5
1242dyadic miss 1 1 1 -- -- 3
1243dyadic hit, mode 0 1 1 1 1 1 5
1244dyadic hit, mode 1 1 1 2 1 1 6
1245dyadic hit, mode 3 (fan+const) 1 1 2 1 2 7
1246```
1247
1248Approaches B and C produce identical single-token cycle counts. The
1249difference emerges under pipelining: Approach B's match data never
1250touches the frame SRAM (operands stored in a dedicated register
1251file), so stage 3's only SRAM access is the constant read. This
1252reduces stage 3 vs stage 5 SRAM contention.
1253
1254**Approach A (set-associative tags in SRAM, minimal chips):**
1255
1256```
1257 stg1 stg2 stg3 stg4 stg5 total
1258monadic mode 4 (no frame) 1 1 0 1 0 3
1259monadic mode 0 (dest only) 1 1 0 1 1 4
1260monadic mode 6 (sink) 1 1 0 1 1 4
1261monadic mode 1 (const+dest) 1 1 1 1 1 5
1262monadic mode 7 (RMW) 1 1 1 1 1 5
1263dyadic miss 1 1 2 -- -- 4
1264dyadic hit, mode 0 1 1 2 1 1 6
1265dyadic hit, mode 1 1 1 3 1 1 7
1266dyadic hit, mode 3 (fan+const) 1 1 3 1 2 8
1267```
1268
1269Approach A adds 1 extra SRAM cycle per dyadic operation (tag word
1270read + associative compare) because act_id resolution is not
1271combinational.
1272
1273### Pipeline Overlap Analysis
1274
1275With single-port frame SRAM at 5 MHz, the pipeline controller must
1276arbitrate between stage 3 and stage 5. When both need SRAM in the
1277same cycle, stage 3 stalls.
1278
1279**Approach B, two consecutive dyadic-hit mode 1 tokens:**
1280
1281```
1282cycle 0: A.stg1
1283cycle 1: A.stg2 (IRAM)
1284cycle 2: A.stg3 match (reg file) -- frame SRAM FREE
1285cycle 3: A.stg3 const (SRAM)
1286cycle 4: A.stg4 (ALU) -- frame SRAM FREE
1287cycle 5: A.stg5 dest (SRAM) B.stg3 match (reg file) -- NO CONFLICT
1288cycle 6: (A done) B.stg3 const (SRAM)
1289cycle 7: B.stg4 (ALU)
1290cycle 8: B.stg5 dest (SRAM) -- NO CONFLICT
1291```
1292
1293Token spacing: 4 cycles. Approach A under the same conditions: ~6-7
1294cycles due to additional SRAM contention in stage 3.
1295
1296### Throughput Summary
1297
1298Per PE, at 5 MHz, single-port frame SRAM:
1299
1300| Instruction mix profile | Approach A | Approach C | Approach B |
1301|------------------------|------------|------------|------------|
1302| Monadic-heavy (mode 0/4/6) | ~1.25 MIPS | ~1.67 MIPS | ~1.67 MIPS |
1303| Mixed (40% dyadic mode 1, 30% monadic, 30% misc) | ~833 KIPS | ~1.25 MIPS | ~1.25 MIPS |
1304| Dyadic-heavy with constants | ~714 KIPS | ~1.00 MIPS | ~1.00 MIPS |
1305| Worst case (mode 3, const+fanout) | ~625 KIPS | ~714 KIPS | ~714 KIPS |
1306
13074-PE system: multiply by 4. Realistic mixed workload: ~3.3-5.0 MIPS
1308(A), ~5.0-6.7 MIPS (C), or ~5.0-6.7 MIPS (B). For reference: the
1309original Amamiya DFM prototype (TTL, 1982) achieved 1.8 MIPS per PE.
1310EM-4 prototype (VLSI gate array, 1990) achieved 12.5 MIPS per PE.
1311This design sits between the two, closer to the DFM, which is
1312historically appropriate for a discrete TTL build.
1313
1314### Pipeline Timing by Era
1315
1316With the 670-based matching subsystem (Approach C), act_id
1317resolution and presence/port checking are combinational (~35-70 ns)
1318**regardless of era**. These never become the timing bottleneck.
1319
1320The era-dependent part is **SRAM access time** for frame reads and
1321writes. This determines how many SRAM operations fit per clock cycle
1322and thus how much stage 3 vs stage 5 contention exists.
1323
1324**1979-1983 (5 MHz, 55 ns SRAM):**
1325
1326```
1327670 metadata: combinational (~35-70 ns), well within 200 ns cycle
1328Frame SRAM: one access per 200 ns cycle (55 ns access + setup/hold margin)
1329Bottleneck: frame SRAM single-port, stage 3 vs stage 5 contention
1330SC block throughput: ~1 instruction per clock (670 dual-port)
1331Overall token throughput: ~1 token per 3-5 clocks (pipelined, mode-dependent)
1332```
1333
1334**1984-1990 (5-10 MHz, dual-port SRAM):**
1335
1336```
1337670 metadata: combinational (unchanged)
1338Frame SRAM: dual-port (IDT7132 or similar), port A for stage 3, port B for stage 5
1339Bottleneck: eliminated -- both stages access SRAM simultaneously
1340SC block throughput: ~1 instruction per clock
1341Overall token throughput: approaches 1 token per 3 clocks for most modes
1342```
1343
1344Dual-port SRAM eliminates the primary stall source. The pipeline
1345becomes instruction-latency-limited rather than SRAM-contention-limited.
1346
1347**Modern parts (5 MHz clock, 15 ns SRAM):**
1348
1349```
1350670 metadata: combinational (unchanged)
1351Frame SRAM: 15 ns access, ~13 accesses fit in 200 ns cycle
1352Practical: 2-3 sub-cycle accesses via time-division multiplexing
1353Bottleneck: none -- frame SRAM has excess bandwidth
1354Token throughput: 1 token per 3 clocks (pipeline-stage-limited, not SRAM-limited)
1355```
1356
1357With 15 ns AS7C256B-15PIN (DIP, currently available at ~$3), two
1358sub-cycle accesses fit within a 200 ns clock period. This achieves
1359TDM-like parallelism without additional MUX logic, effectively
1360giving the pipeline a dual-port view of a single-port chip.
1361
1362**Integrated (on-chip SRAM, sub-ns access):**
1363
1364```
1365670 equivalent: on-chip multi-ported register file, ~200 transistors
1366Frame SRAM: on-chip, sub-cycle access trivially
1367Token throughput: 1 per 3 clocks, potentially faster with deeper pipelining
1368```
1369
1370## 12. SC Blocks and Execution Modes
1371
1372### PE-to-PE Pipelining
1373
1374When multiple PEs are chained for software-pipelined loops, the per-PE
1375pipeline throughput determines the overall chain throughput.
1376
1377With the pipelined design (1 token per 3-5 clocks depending on
1378instruction mix and era), the inter-PE hop cost becomes the critical
1379path for chained execution:
1380
1381| Interconnect | Hop latency | Viable? |
1382|-------------|-------------|---------|
1383| Shared bus (discrete build) | 5-8 cycles | Marginal -- chain overhead dominates |
1384| Dedicated FIFO between adjacent PEs | 2-3 cycles | Worthwhile for tight loops |
1385| On-chip wide parallel link (integrated) | 1-2 cycles | Competitive with intra-PE SC block |
1386
1387For the discrete v0 build, dedicated inter-PE FIFOs (bypassing the
1388shared bus) would enable PE chaining at reasonable cost. This is a
1389low-chip-count addition (~2-4 chips per PE pair) that unlocks
1390software-pipelined loop execution.
1391
1392**Loopback bypass.** When a PE emits a token destined for itself
1393(common in iterative computations), the token can be looped back
1394internally without traversing the bus at all. See
1395`bus-interconnect-design.md` for the loopback bypass design, which
1396eliminates the bus hop latency entirely for self-targeted tokens.
1397
1398### The Execution Mode Spectrum
1399
1400The pipelined PE with frame-based storage, SC blocks, and predicate
1401register supports a spectrum of execution modes, selectable by the
1402compiler per-region:
1403
1404| Mode | Pipeline behaviour | Throughput | When to use |
1405|------|-------------------|-----------|-------------|
1406| Pure dataflow | Token -> ifetch -> match/frame -> exec -> output | 1 token / 3-7 clocks (mode-dependent) | Parallel regions, independent ops |
1407| SC block (register) | Sequential IRAM fetch, 670 register file | ~1 instr / clock | Short sequential regions |
1408| SC block + predicate | As above, with conditional skip/branch via predicate bits | ~1 instr / clock | Conditional sequential regions |
1409| PE chain (software pipeline) | Tokens flow PE0->PE1->PE2, each PE handles one stage | 1 iteration / PE-pipeline-depth clocks | Loop bodies across PEs |
1410| SM-mediated sequential | Tokens to/from SM for memory-intensive work | SM-bandwidth-limited | Array/structure traversal |
1411
1412The compiler partitions the program graph and selects the best mode
1413for each region. This spectrum is arguably more expressive than what a
1414modern OoO core offers (which has exactly one mode: "pretend to be
1415sequential, discover parallelism at runtime").
1416
1417## 13. Instruction Residency and Code Loading
1418
1419### Why This Matters
1420
1421Unlike Manchester, Amamiya, or Monsoon -- which either replicated the
1422entire program into every PE's instruction memory or used very large
1423per-PE instruction stores -- this design has **small IRAM per bank**
1424(256 entries) with runtime-writable instruction memory. Without bank
1425switching, any program larger than a single PE's IRAM needs code loading
1426at runtime, even under fully static PE assignment.
1427
1428**With bank switching** (see section 14), each PE
1429holds up to 4096 instructions across 16 banks using the same SRAM chips.
1430This substantially reduces the pressure on runtime code loading -- most
1431programs' full working set fits in the preloaded banks, and switching
1432between function fragments costs a single register write instead of
1433IRAM rewrite traffic. The code storage hierarchy and loader mechanisms
1434below remain relevant for programs that exceed the banked capacity, but
1435bank switching makes that the exception rather than the rule.
1436
1437The 16-bit single-half instruction format provides good IRAM density:
1438one instruction per SRAM address. The effective capacity with bank
1439switching (4096 instructions) is substantial for the target workloads,
1440using only a single SRAM chip pair per PE.
1441
1442The reference architectures largely avoid the residency problem by
1443throwing memory at it: Amamiya's 8KW/PE replicated instruction memory,
1444Manchester's large instruction store, Monsoon's 64K-instruction frames.
1445Bank switching gives us a comparable effective capacity (4K instructions)
1446with much less hardware than full replication.
1447
1448### Proactive Loading (Primary Mechanism)
1449
1450The primary approach is **software-managed prefetch**: the compiler assigns a PE (typically the least-utilized one) to pull instruction pages from storage and load them onto the bus in advance of when they're needed. This is part of the program graph itself - the loader PE calls the `exec` SM instruction, which reads out pre-constructed tokens onto the bus.
1451
1452This fits naturally into the dataflow paradigm:
1453- The loader PE is just another participant in the token network
1454- Its "inputs" are load requests (tokens from other PEs or the scheduler)
1455- Its "outputs" are config write packets that load IRAM
1456- The compiler can schedule prefetches to overlap with computation on other PEs
1457
1458The loader PE could be dedicated (always running loader code) or could itself have its code swapped depending on system phase.
1459
1460### The Identity Problem: Miss Detection
1461
1462If code loading happens at runtime, the question arises: how does a PE know the code in its IRAM is the *right* code for an arriving token?
1463
1464A simple validity bitmap (like the matching store presence bit) is **not sufficient**. It can tell you "something is loaded at offset 7" but not "the right instruction is loaded at offset 7." If a different
1465function fragment has been loaded over a previous one, the IRAM slot is occupied by a valid-looking but wrong instruction. The token indexes directly into IRAM - there is no tag comparison against the token.
1466
1467Several detection mechanisms are possible:
1468
1469**Option A: Fragment ID register.**
1470Each PE has a small tag register (or set of registers, one per IRAM page/region) that records which function fragment is currently loaded. Set by config writes during loading. Incoming tokens carry (or the system derives from the token's address) the expected fragment ID. The PE compares the token's expected fragment against the loaded fragment register:
1471- Match -> proceed normally
1472- Mismatch -> miss, trigger fetch
1473- Hardware cost: one register + comparator per PE (or per IRAM region)
1474- Requires fragment ID bits in the token or a derivation mechanism
1475- Coarse-grained: one tag per PE or per page, not per instruction
1476
1477**Option B: Entry gate instruction.**
1478The compiler inserts a special instruction at each function body's entry point that verifies identity: "am I the function this activation expects?" Tokens arriving at non-entry instructions are assumed correct because they could only have reached that point by passing through a verified entry gate.
1479- No per-instruction tags needed
1480- Software-managed, compiler responsibility
1481- Detection granularity is per-function-body, not per-instruction
1482- In dataflow terms: the entry gate is a dyadic instruction whose left input is the activation token and whose right input is a "function loaded" token. If the function isn't loaded, the gate blocks (no match) until loading completes and a "loaded" confirmation token arrives.
1483
1484**Option C: Software-only invariant.**
1485No hardware miss detection. The loader protocol guarantees correctness: code is never overwritten while tokens are in flight targeting it. The throttle + drain approach (stall new activations, let existing ones complete, then overwrite IRAM) ensures the invariant.
1486- Simplest hardware -- no detection circuitry at all
1487- Most compiler/loader burden
1488- Relies on correct coordination; bugs cause silent wrong execution
1489- Viable for v0 where programs are small and manually verified
1490
1491These options are not mutually exclusive. v0 can start with Option C (software guarantee) and add hardware detection (A or B) later as programs grow beyond what manual verification can cover.
1492
1493### Miss Handling
1494
1495When a miss is detected (by whatever mechanism), the PE needs to handle a token that targets unloaded code. Two approaches:
1496
1497**Stall + fetch request:** PE emits an `exec` token and stalls its input FIFO until the instruction arrives via config write. Simple, deterministic, but blocks all traffic to that PE during the fetch. Acceptable if misses are rare (proactive loading handles most cases) and fetch latency is bounded.
1498
1499**Recirculate + fetch request:** PE emits a fetch-request token, puts the missed token back at the tail of its own input FIFO, and continues processing other tokens. The missed token retries later, hopefully after the instruction has been loaded. More complex but keeps the PE productive. Requires care to avoid FIFO fill-up with recirculated tokens.
1500
1501v0 may not implement either; starting with the software-only invariant (Option C above) means misses don't happen by construction. Hardware miss handling is an evolutionary step as programs outgrow what static loading can guarantee.
1502
1503## 14. Future: Bank Switching with 74LS610
1504
1505v0 supports 256 instructions per PE (8-bit offset) with simple address
1506decode. When programs exceed 256 instructions, the 74LS610 memory mapper
1507enables bank switching: 16 banks x 256 instructions = 4096 entries per
1508PE without changing SRAM chips.
1509
1510### What the 610 Is and What It Enables
1511
1512The 74LS610 (TI memory mapper, originally for TMS9900 family) is an
1513ideal fit for IRAM bank switching. Key properties:
1514
1515- 16 mapping registers, each 12 bits wide
1516- 4-bit logical address input selects register -> 12-bit physical address output
1517- **Latch control** (pin 28): outputs can be frozen while register contents change
1518- ~40-50ns propagation delay (LS family), pipelineable with SRAM access
1519- One chip per PE. Writes to mapping registers via data bus during config/bootstrap.
1520
1521The '610 is planned for both IRAM and SM banking (both are future
1522upgrades, not present in v0). Using it for IRAM banking is the same
1523chip, same wiring pattern, different address domain. One '610 per PE
1524for IRAM, one per SM.
1525
1526### The Socket Strategy
1527
1528The v0 board pre-wires SRAM address lines to a '610 socket with a
1529jumper wire in place of the chip. When bank switching is needed, the
1530'610 drops in with no board changes.
1531
1532### Address Space with Bank Switching
1533
1534With the '610 installed, the IRAM address becomes:
1535
1536```
1537Logical: [bank_select:4][offset:8]
1538 |
1539 v (74LS610)
1540Physical: [phys_bank:12][offset:8] = up to 20-bit SRAM address
1541```
1542
1543In practice the physical address width is bounded by available SRAM chip
1544capacity. With 8Kx8 SRAMs (13-bit address): the '610's 12-bit output is
1545wider than needed -- only 5 bits of physical bank + 8-bit offset = 13
1546bits. This gives 32 physical banks of 256 instructions each (8192
1547instructions per PE, though address space constraints may limit this
1548further).
1549
1550The SRAM address map with bank switching:
1551
1552```
1553 IRAM region: [0][bank:4][offset:8] bank-switched templates
1554 bank from '610 mapper
1555 capacity: 16 banks x 256 instructions = 4096 entries
1556
1557 Frame region: [1][frame_id:2][slot:6] (unchanged)
1558```
1559
1560### Banking Workflow: MAP_PAGE and SET_PAGE Instructions
1561
1562Two instructions manage banking. Neither touches the token format.
1563
1564- **`map_page`** (monadic): writes a logical-to-physical mapping into one of the '610's 16 mapping registers. The register index and physical bank address come from frame constants. Used during bootstrap or runtime to establish which physical SRAM regions back which logical pages.
1565
1566- **`set_page`** (monadic): writes a 4-bit logical page selector into a PE-local latch. The latch feeds the '610's MA0-MA3 inputs. All subsequent IRAM fetches go through the selected logical page's mapped physical bank. One cycle to switch.
1567
1568```
1569Banking workflow:
1570 1. Bootstrap: MAP_PAGE instructions establish mappings
1571 (logical page 0 -> physical region A, page 1 -> region B, etc.)
1572 2. Runtime: SET_PAGE selects the active logical page
1573 3. Latch -> '610 MA0-MA3 -> physical SRAM bank selection
1574 4. All IRAM reads now address the selected bank
1575```
1576
1577Hardware cost: one 74LS175 (quad D flip-flop) as the page latch + the '610 itself.
1578
1579### Trade-offs and Costs
1580
1581- Bank switch affects all in-flight tokens targeting this PE at offsets in the old bank. The compiler (or scheduler) must drain tokens for the old bank before switching -- same throttle-and-drain protocol as code overwrite, but switching is instantaneous once drained (write latch, done).
1582- `set_page` is sequentially scoped: it affects all subsequent fetches, not just one activation. The compiler must ensure that concurrent activations on the same PE agree on the active page, or use `set_page` as a barrier between phases.
1583- Total capacity per PE is bounded by SRAM chip size, not the '610 (which can address far more than any reasonable IRAM).
1584- Pages are a pure address-mapping primitive. The compiler decides what they mean -- per-function, per-phase, or any other grouping. The hardware doesn't enforce or assume any relationship between pages and function bodies.
1585
1586## 15. Dynamic Scheduling: Future Capability
1587
1588The architecture is **policy-agnostic** on whether PE assignment is fully static (compiler decides everything) or partially dynamic (a scheduler places activations at runtime). The mechanism (tokens carry destination PE + activation_id, PEs have writable IRAM, frames are allocated and addressed by act_id) supports either policy.
1589
1590### Static Assignment (v0)
1591
1592Compiler decides everything at compile time. each PE gets specific function fragments loaded at bootstrap. no runtime decisions about placement. simplest, no scheduler hardware or firmware needed. For programs that exceed IRAM capacity, the compiler schedules `exec` instructions or similar.
1593
1594### Dynamic Scheduling (future)
1595
1596A CCU-like scheduler (could be firmware on a dedicated PE, a small fixed-function unit, or distributed logic) decides at runtime where to place new activations, based on PE load, IRAM contents, etc.
1597
1598The tension: dynamic scheduling wants **wide IRAM** (so the target PE already has the function body loaded), while cheap PEs want **narrow IRAM**. Amamiya resolved this by replicating the entire program into every PE's IRAM. that's one approach but costs a lot of memory.
1599
1600The middle ground is a **working set model**: keep hot function bodies loaded, swap cold ones via PE-local write tokens (prefix 011+01) when the scheduler wants to place an activation on a PE that doesn't have the code yet.
1601
1602- **miss latency**: significant (network round-trip to load code from SM or external storage). much worse than Amamiya's "already there."
1603- **miss rate**: depends on scheduler affinity policy. if the scheduler prefers placing activations on PEs that already have the code, misses should be rare. a small "IRAM directory" (which PE has which function body loaded) lets the scheduler make this decision cheaply.
1604- **coordination**: drain in-flight tokens for the old fragment before overwriting IRAM. throttle stalls new activations for that fragment, existing ones complete, then overwrite. coarse-grained context switch.
1605
1606## 16. Open Design Questions
1607
16081. **Approach selection for v0.** Approach C (670 lookup) is
1609 recommended as the starting point: combinational metadata at ~8
1610 chips. Approach B (register-file match pool) eliminates the last
1611 SRAM cycle from matching at the cost of ~16-18 chips. Approach A
1612 (SRAM tags) is the fallback if 670 supply is a problem. The
1613 choice depends on whether chip count or pipeline throughput is
1614 the binding constraint for the initial build. See section 11 for
1615 the full approach comparison and cycle counts.
1616
16172. **Frame SRAM contention under realistic workloads.** The pipeline
1618 stall analysis in section 11 uses worst-case consecutive tokens.
1619 Simulate representative dataflow programs in the behavioural
1620 emulator to measure actual stage 3 vs stage 5 contention rates
1621 and determine whether dual-port SRAM or faster SRAM is justified
1622 for v0.
1623
16243. **SC block register capacity.** With 4-6 registers available from
1625 repurposed 670s (depending on how many are shared), what is the
1626 longest SC block the compiler can generate before register
1627 pressure forces a spill? Evaluate empirically on target workloads.
1628
16294. **Predicate register encoding.** Document specific instruction
1630 encodings for predicate test/set/clear, and how SWITCH
1631 instructions interact with predicate bits. The predicate register
1632 may subsume some of the cancel-bit functionality planned for
1633 token format.
1634
16355. **Mode switch latency measurement.** Build a cycle-accurate model
1636 of the save-to-SRAM / restore-from-SRAM path and determine exact
1637 overhead. Target: <=10 cycles per transition.
1638
16396. **Assembler stall analysis.** The assembler can statically detect
1640 instruction pairs whose output tokens may cause frame SRAM
1641 contention on the same PE. For hot loops, the assembler can
1642 insert mode 4 NOP tokens (zero frame access) as pipeline padding.
1643 Validate static stall estimates against emulator simulation, since
1644 runtime arrival timing depends on network latency and SM response
1645 times.
1646
16477. **8-offset matchable constraint validation.** The 670-based
1648 presence metadata limits dyadic instructions to offsets 0-7 per
1649 frame. Evaluate whether this is sufficient for compiled programs.
1650 If tight, the hybrid upgrade path (offset[3]=0 checks 670s,
1651 offset[3]=1 falls back to SRAM tags) adds ~4-6 chips of SRAM tag
1652 logic for offsets 8-15+.
1653
16548. **Exact opcode assignments**: 5-bit opcode space is sufficient (CM
1655 and SM independent). Need to assign FREE_FRAME, ALLOC_REMOTE, and
1656 verify that existing ALU operations fit with the revised mode
1657 semantics.
1658
16599. **SC arc execution details**: the frame model supports
1660 strongly-connected arc execution (latch frame_id across sequential
1661 blocks); the pipeline sequencing and block-entry detection logic
1662 need design work. Deferred past v0 but should not be precluded by
1663 any v0 decisions.
1664
166510. **IRAM bank switching interaction with frames**: switching IRAM
1666 banks changes the instruction templates but not the frame contents.
1667 Tokens in flight targeting the old bank's instructions will execute
1668 against new instructions after the switch. The drain-before-switch
1669 protocol applies unchanged.
1670
167111. **Frame slot count (fref width)**: 6-bit fref = 64 slots is the
1672 current proposal. Real compiled programs may show that 32 suffices
1673 (freeing 1 bit for other uses) or that 64 is tight (requiring
1674 creative aliasing or more aggressive function splitting).
1675
167612. **Function splitting heuristics**: how does the compiler decide
1677 where to split? Minimize cross-PE traffic? Balance frame usage
1678 across PEs? Hardware constraints (frame count, matchable offset
1679 count) drive it.
1680
168113. **Instruction identity detection**: how does the PE know loaded
1682 code matches what an arriving token expects? Fragment ID register
1683 vs entry gate instruction vs software-only guarantee. See
1684 Instruction Residency section. v0 starts with software-only
1685 invariant (Option C).
1686
168714. **Miss handling mechanism**: stall + fetch request vs recirculate
1688 + fetch request. v0 may not implement either, relying on the
1689 software-only invariant.
1690
169115. **SM flit 1 bit alignment:** the frame slot packing for SM targets
1692 (`[SM_id:2][addr][spare]`) must align with the SM bus flit 1 format
1693 so that the output stage can concatenate frame bits + EEPROM opcode
1694 without field rearrangement. The exact spare bit positions depend on
1695 the final SM bus encoding; verify alignment after `sm-design.md`
1696 opcode assignments are frozen.
1697
169816. **PE-local write slot field width:** the current flit 1 format packs
1699 5 bits of slot index into the PE-local write token. With 64 frame
1700 slots (6-bit fref), one bit is missing. Options: (a) limit PE-local
1701 writes to slots 0-31, (b) steal a spare bit from act_id or elsewhere,
1702 (c) use flit 2 for extended addressing.
1703
170417. **Indexed SM ops: address overflow.** ALU base + index computation
1705 may overflow the 10-bit (tier 1) or 8-bit (tier 2) address range.
1706 The PE does not check for overflow -- the SM receives the truncated
1707 address. The assembler should warn on statically-detectable overflow
1708 risks. Runtime overflow detection is an SM-side concern.
1709
171018. **EXTRACT_TAG offset source.** The return offset in the packed tag
1711 could come from (a) a frame constant (flexible, costs 1 frame slot),
1712 (b) a fixed hardware-derived value (e.g. current instruction
1713 offset + 1), or (c) an IRAM-encoded small immediate (not available
1714 in the current 16-bit format). Option (a) is consistent with the
1715 frame-everything philosophy; option (b) saves a slot but limits
1716 return point placement.
1717
1718## 17. References
1719
1720- `architecture-overview.md` - Token format, flit-1 bit allocation, module taxonomy, bus framing protocol, function call design overview.
1721- `alu-and-output-design.md` - ALU operation details, output routing modes, flit 2 source mux, and SM flit assembly.
1722- `sm-design.md` - SM opcode table, extended addressing, CAS handling, I-structure semantics.
1723- `bus-interconnect-design.md` - Physical bus implementation: shared and split AN/CN/DN topologies, node interfaces, arbitration, loopback, backpressure, chip counts.
1724- `network-and-communication.md` - Routing topology, clocking discipline.
1725- `io-and-bootstrap.md` - Bootstrap loading, I/O subsystem design.
1726- `sram-availability.md` - Component availability for period-appropriate SRAMs.
1727- `17407_17358.pdf` - DFM evaluation: OM structure (1024 CAM blocks, 32 words each, 8 entries of 4 words, 4-way set-associative within entry). Function activation via CCU requesting least-loaded PE, then getting instance name from target PE's free instance table. IM is 8KW/PE, identical across all PEs. Critical for understanding why Amamiya's OM is so large and why ours can be much smaller.
1728- `gurd1985.pdf` - Manchester matching unit: 16 parallel hash banks, 64K tokens each, 54-bit comparators, 180ns clock period. Overflow unit emulated in software. Shows the cost of general-purpose matching.
1729- `Dataflow_Machine_Architecture.pdf` - Veen survey: matching store analysis, tag space management, overflow handling across multiple architectures.
1730- `amamiya1982.pdf` - Original DFM paper: semi-CAM concept, IM/OM split, execution control mechanism with associative IM fetch. Partial function body execution (begin executing when first argument arrives, don't wait for all arguments).
1731- EM-4 prototype papers - Direct matching, strongly-connected blocks, register-based advanced control pipeline. Informs SC arc upgrade path.
1732- Iannucci (1988) - Frame-based matching, continuation model, suspension semantics. Historical precedent for per-activation frame storage.
1733- Monsoon / TTDA papers - Explicit token store, frame-based execution, I-structure semantics.