OR-1 dataflow CPU sketch
1# EM-4 Architecture Analysis
2
3Deep-dive reference on the EM-4 (Electrotechnical Laboratory, Japan) and
4its single-chip processor EMC-R, based on the ISCA '89 and IPPS '91
5papers. Captures architectural details for reference even where our
6design diverges.
7
8Source papers:
9
10- Sakai, Yamaguchi, Hiraki, Kodama, Yuba. "An Architecture of a Dataflow
11 Single Chip Processor." Proc. ISCA '89, pp. 46–53, 1989.
12- Sakai, Kodama, Yamaguchi. "Prototype Implementation of a Highly
13 Parallel Dataflow Machine EM-4." Proc. IPPS '91, pp. 278–286, 1991.
14
15See `Prior_Art_Reference_Guide_for_a_Discrete-Logic_Dynamic_Dataflow_CPU.md`
16for full bibliography including the network paper (Parallel Computing
171993), synchronisation paper (IEICE 1991), successor chip EMC-Y, and
18EM-X system papers.
19
20---
21
22## 1. Project Context and Design Objectives
23
24The EM-4 is a successor to SIGMA-1 (128-PE dataflow supercomputer,
25ETL, operational 1988, >100 MFLOPS). SIGMA-1 PEs were multi-chip
26(several gate arrays + large memory); direct scaling to 1000+ PEs was
27impractical due to hardware volume and architectural complexity.
28
29EM-4 design objectives:
30
311. 1000+ PE machine for general use (numerical, symbolic, simulation)
322. Single-chip PE (the EMC-R) to make scaling feasible
333. O(N) interconnection network (vs O(N log N) for multi-stage networks)
344. Improve dataflow execution efficiency via new computation model
35
36The 80-PE prototype was the first step toward a 1024-PE target. The
37EMC-R chip was designed with a 1024-PE global address space even though
38the prototype only used 80 PEs.
39
40---
41
42## 2. Identified Defects of Prior Dataflow Architectures
43
44The EM-4 papers explicitly list six defects of "conventional" (i.e.
45Manchester-style) dataflow machines. These are worth documenting because
46several of them drove design choices in our architecture too.
47
48**D1. Circular pipeline underutilisation at low parallelism.** When fewer
49tokens are in flight than N×S (N = PEs, S = pipeline stages), the
50circular pipeline has bubbles. A single token going round the loop
51achieves throughput < 1 instruction per pipeline circulation time. No
52advanced control mechanism exists to fill the gap.
53
54**D2. Packet-based architecture cannot exploit registers.** If every
55intermediate result becomes a packet and re-enters the PE through the
56input queue, there is no way to keep frequently-used values in fast
57local storage. This also prevents fine-grained pipelining within a
58single computation thread.
59
60**D3. High matching overhead (colored token style).** Associative memory
61or hashing for color matching requires complex control logic and
62significant time per match operation.
63
64**D4. Excessive packet traffic.** Every inter-node communication is a
65packet, even between nodes that always execute on the same PE. The
66network becomes the bottleneck.
67
68**D5. No resource management primitives.** Mutual exclusion (test-and-set,
69compare-and-swap) requires serialisation, which is difficult in a pure
70dataflow model where any node can fire at any time.
71
72**D6. Garbage token cleanup overhead.** Conditional branches (switches)
73produce garbage tokens on the not-taken path. Collecting these at
74program end is expensive.
75
76### Relevance to Our Design
77
78- D1: partially addressed by our monadic bypass (skip matching for
79 single-input ops). Strongly connected blocks (deferred) would address
80 it more completely.
81- D2: we currently have no register file. The matching store SRAM serves
82 as temporary operand storage. SM serves as shared scratch. A register
83 file is deferred pending strongly connected block implementation.
84- D3: solved — our direct-indexed matching with wire concatenation
85 addressing has zero associative lookup overhead, same as the EM-4's
86 direct matching.
87- D4: partially addressed by static PE assignment (compiler places
88 communicating nodes on the same PE to avoid network traffic). Strongly
89 connected blocks would further reduce intra-PE packet overhead.
90- D5: SM with RMW operations provides atomic primitives. I-structure
91 semantics (deferred) will add synchronising memory.
92- D6: generation counters handle stale token detection. The cancel bit
93 approach (EM-4's C field in the data part) is noted as a potential
94 addition. See open items.
95
96---
97
98## 3. Strongly Connected Arc Model
99
100The EM-4's most significant contribution. Arcs in the dataflow graph
101are classified into two types:
102
103- **Normal arcs**: standard dataflow — token matching, packet formation,
104 full pipeline traversal.
105- **Strongly connected arcs**: local to a PE — sequential register-based
106 execution without packet formation or matching.
107
108A **strongly connected block** is a subgraph whose internal arcs are all
109strongly connected. The execution rule: once any node in a strongly
110connected block fires, the entire block executes to completion on that
111PE, exclusively. No other block can interrupt or interleave with it on
112that PE during execution.
113
114### Hardware Mechanism
115
116The EMC-R instruction format contains:
117
118- **M (mode) field** (1 bit): if zero, the current strongly connected
119 block ends with the next instruction.
120- **NF (next flag)** (1 bit): controls continuation within the block.
121- **R0, R1** (4 bits each): strongly connected register references for
122 the next instruction's operands. These index into a 16-entry register
123 file in the EXU.
124- **OUT** (1 bit): whether this instruction generates an output packet
125 (1) or stores its result in a register (0, with result going to R2).
126
127When executing within a strongly connected block:
128
1291. Pipeline stages 3 (instruction fetch + decode) and stage 4 (execute)
130 repeat in an overlapped loop.
1312. Stage 4's execution overlaps with stage 3's fetch of the next
132 instruction in the block.
1333. Operands come from the register file (R0, R1) or from the matching
134 store (first instruction only).
1354. Results go to the register file (R2) or out as packets.
1365. When the block ends (M=0), the last instruction's execution overlaps
137 with stages 1-2 of the *next incoming packet*.
138
139The register-based pipeline throughput is up to **6x** the packet-based
140circular pipeline throughput.
141
142### Performance Example: Fibonacci
143
144The papers give a concrete comparison on a Fibonacci subgraph:
145
146- Pure dataflow program: 23 clocks for the recursive-call subgraph
147- Strongly connected program: 9 clocks for the same computation
148
149The strongly connected version packs type checking, branching, subtraction,
150and MKPKT operations into a single block that executes sequentially using
151registers, rather than creating separate tokens for each intermediate.
152
153### Implications for Compiler
154
155The compiler must:
156
1571. Identify valid strongly connected blocks (subgraphs that can execute
158 sequentially without violating data dependencies).
1592. Annotate instructions with M, NF, R0, R1, R2, OUT fields.
1603. Construct blocks automatically (the papers mention a "block
161 constructor" in their compiler).
162
163This is non-trivial compiler work but is amenable to standard
164optimisation techniques (instruction scheduling, register allocation
165within blocks).
166
167### Advantages Summarised (from the paper)
168
169- A1: enables advanced control pipeline (deterministic execution within
170 blocks)
171- A2: register file for intra-block operands (no matching needed)
172- A3: intra-block matching simplified (no color matching if block is
173 within a single function instance)
174- A4: no packet transfers within blocks (reduces network traffic)
175- A5: indivisible instruction sequences enable resource management
176 (test-and-set, etc.)
177- A6: garbage token cleanup simplified (just reset register file flags)
178
179---
180
181## 4. Direct Matching Scheme
182
183The EM-4's matching mechanism, which is conceptually very close to our
184design and to Monsoon's Explicit Token Store.
185
186### Mechanism
187
188At function invocation, two memory regions are bound:
189
190- **Operand segment**: storage for waiting/matching operands. One word per
191 dyadic instruction in the function body.
192- **Template segment**: compiled instruction codes for the function.
193
194The address of an instruction in the template segment has a **1:1 simple
195correspondence** with the matching location in the operand segment. No
196hashing, no associative lookup.
197
198Matching operation:
199
2001. Read the operand segment at the address corresponding to this
201 instruction.
2022. If partner data is present: match succeeds, clear the presence flag,
203 proceed to instruction fetch.
2043. If partner data is absent: write incoming data, set presence flag,
205 token consumed.
206
207The read-modify-write is completed in a single clock cycle (read in
208first half-clock, write/eliminate in second half-clock).
209
210### Comparison with Monsoon ETS
211
212The 1991 paper explicitly compares with Monsoon:
213
214> "Although the Explicit Token-Store scheme in the Monsoon Machine also
215> uses the frame-based matching, the direct matching does not need an
216> extra instruction field for data synchronization since a displacement
217> of the matching place is in common with the instruction. This scheme
218> realizes a much more efficient synchronization than Monsoon does and
219> instructions and packets are fairly smaller, at the cost of memory
220> space."
221
222Key difference: EM-4 shares the address between instruction fetch and
223operand matching (they're at the same offset in their respective
224segments), so no additional "frame offset" field is needed in the
225instruction or token. Monsoon needs a separate frame-pointer + slot-offset
226in each token.
227
228### Comparison with Our Design
229
230Our matching store uses the same principle: `SRAM_address = ctx_slot : match_entry`, with match_entry derived from the token's addressing fields, giving 1:1 correspondence to instruction addresses.
231
232The key differences:
233
234| Aspect | EM-4 | Our design |
235| ----------------- | ------------------------------------------------------------ | ----------------------------------------- |
236| Segment binding | Dynamic (at function invocation) | Static (compiler assigns at compile time) |
237| Segment size | Sized per function body | Sized per PE (fixed N slots × M entries) |
238| Presence tracking | Flags in operand segment words | Separate occupied bitmap |
239| Multi-activation | Multiple operand segments per function | Context slots with generation counters |
240| Template fetch | Extra pipeline stage (TNF) to look up which template segment | Not needed (IRAM content is fixed per PE) |
241
242---
243
244## 5. EMC-R Chip Architecture
245
246### Block Diagram (5 functional units + maintenance)
247
248```
249 ┌──────────────┐
250Network In ──►│ Switching │──► Network Out (port A)
251Network In ──►│ Unit (SU) │──► Network Out (port B)
252 └──────┬───────┘
253 │ to local PE
254 ┌──────▼───────┐
255 │ Input Buffer │
256 │ Unit (IBU) │
257 └──────┬───────┘
258 ┌──────▼───────┐ ┌──────────┐
259 │ Fetch and │◄───►│ Off-chip │
260 │ Matching │ │ Memory │
261 │ Unit (FMU) │◄──┐ │ (≤5 MB) │
262 └──────┬───────┘ │ └──────────┘
263 ┌──────▼───────┐ │ ▲
264 │ Execution │ │ │
265 │ Unit (EXU) │───┘ ┌────┴────┐
266 └──────┬───────┘ │ Memory │
267 │ │ Control │
268 ▼ │ Unit │
269 Packet output └─────────┘
270 (back to SU)
271```
272
273### Gate Counts and Pin Usage
274
275| Unit | Gates | Pins | Notes |
276|------|-------|------|-------|
277| Switching Unit (SU) | 10,112 (1989) / 9,179 (1991) | 176 | 3×3 packet switch, three-bank buffers, PRC |
278| Input Buffer Unit (IBU) | 9,238 (1989) / 9,295 (1991) | - | 32-word FIFO (dual-port RAM on chip) |
279| Fetch and Matching Unit (FMU) | 3,504 (1989) / 3,610 (1991) | - | Direct matching, sequencing, pipeline control |
280| Execution Unit (EXU) | 19,692 (1989) / 20,620 (1991) | - | ALU, multiplier, barrel shifter, register file, packet gen |
281| Memory Control Unit (MCU) | 1,518 / 1,664 | 67 | Address/data mux, arbitration |
282| Maintenance Circuits | 1,589 / 1,420 | 12 | Init, error handling, dynamic monitor |
283| **Total** | **45,653 / 45,788** | **255** | 1.5μm CMOS gate array, 299-pin package |
284
285Notable: the FMU (matching + sequencing) is only ~3.5K gates — the
286direct matching scheme keeps it small. The SU (network switch) is ~10K
287gates because of the three-bank deadlock prevention buffers. The EXU
288dominates at ~20K gates due to the multiplier, barrel shifter, and 16-
289register file.
290
291### Chip Implementation
292
293- 1.5 μm CMOS gate array
294- Inverter gate delay: 0.7 ns
295- Clock: 80 ns (12.5 MHz)
296- Peak performance: 12.5 MIPS per PE
297- Package: 299-pin ceramic (43 power/ground, 255 signal, 1 unused)
298- Fabricated from October 1989; no major bugs found in testing
299
300---
301
302## 6. Pipeline Organisation
303
304Four stages, some with sub-stages. Two pipeline modes that integrate:
305
306### Packet-Based Circular Pipeline (thin path)
307
308The standard dataflow pipeline for tokens arriving from the network:
309
310```
311Stage 1: TNF Template segment # fetch from off-chip memory
312 (bypassed for non-normal packets)
313
314Stage 2: Match - For matching with immediate: IMF (immediate fetch)
315 - For matching with stored data:
316 RD (first half-clock): read matching store
317 EL/WR (second half-clock): eliminate flag if match,
318 or write data if no match
319 - Bypassed for single-operand (monadic) instructions
320
321Stage 3: IF/DC Instruction fetch (half-clock) + decode (half-clock)
322
323Stage 4: EX Execution + packet output (overlapped)
324```
325
326### Register-Based Advanced Control Pipeline (thick path)
327
328For strongly connected block execution:
329
330```
331Stage 3 ◄──► Stage 4 (loop: fetch next instruction while executing
332 current instruction)
333```
334
335Stages 3 and 4 repeat until the strongly connected block ends. During
336this loop, stages 1 and 2 can process the next incoming packet
337concurrently.
338
339### Overlap Properties
340
341- Stage 4 (execution) overlaps with stage 3 (fetch) of the next
342 instruction in a strongly connected block.
343- When a block ends, stage 4 of the last block instruction overlaps
344 with stages 1-2 of the next packet.
345- The SU operates independently and concurrently with all PE stages.
346
347### Throughput
348
349- Packet-based circular pipeline: 1 instruction per 4 clocks (at best,
350 with full pipeline)
351- Register-based advanced control: 1 instruction per clock (within a
352 strongly connected block)
353- Ratio: up to 6× throughput improvement for strongly connected blocks
354 (4 stages × pipeline effects, empirical)
355
356---
357
358## 7. Instruction Set Architecture
359
360### RISC Characteristics
361
362- 26 instructions
363- 4 instruction formats (2 base + 2 immediate variants)
364- 2 memory addressing modes
365- 16-entry register file
366- No microcode
367- Fixed packet size
368- Few packet types
369- Simple synchronisation (direct matching + register sequencing)
370
371### Instruction Set (26 instructions)
372
373**Arithmetic and Logic (14):**
374ADD, SUB, MUL, DIV0-2, DIVR, DIVQ (division pipeline),
375SHF (shift), AND, OR, EOR, NOT, ALUTST
376
377**Branch (6):**
378BEQ (equal), BGT (greater), BGE (greater/equal),
379BTYPE (by data type), BTYPE2 (by 2 data types),
380BOVF (overflow). All implemented as delayed branches.
381
382**Memory or Register (4):**
383L (load), S (store), LS (load-and-store), LDR (load from register)
384
385**Others (2):**
386GET (remote operation — sends return address to operand destination),
387MKPKT (make packet — explicit packet construction from two operands)
388
389### Instruction Format
390
39138-bit instruction word (stored in off-chip memory).
392
393Without packet output (OUT=0):
394```
395| OP | AUX | TRC | M | NF | R0 | R1 | OUT | R2 | BC | DPL |
396| 5 | 7 | 1 | 1 | 1 | 4 | 4 | 1 | 4 | 2 | 8 |
397```
398
399With packet output (OUT=1):
400```
401| OP | AUX | TRC | M | NF | R0 | R1 | OUT | WCF | M2 | CA | DPL |
402| 5 | 7 | 1 | 1 | 1 | 4 | 4 | 1 | 2 | 1 | 3 | 8 |
403```
404
405Field descriptions:
406
407- **OP** (5 bits): opcode (26 instructions)
408- **AUX** (7 bits): secondary opcode field (e.g. shift direction/amount)
409- **TRC** (1 bit): trace control
410- **M** (1 bit): mode — 0 = strongly connected block ends after next
411 instruction
412- **NF** (1 bit): next instruction flag (continuation within block)
413- **R0, R1** (4 bits each): register file addresses for operands
414- **OUT** (1 bit): generate output packet (1) or store in register (0)
415- **R2** (4 bits): destination register (when OUT=0)
416- **BC** (2 bits): branch condition (when OUT=0, branch instructions only)
417- **DPL** (8 bits): displacement (branch target or packet address offset)
418- **WCF** (2 bits): waiting condition flag for output packet matching type
419- **M2** (1 bit): output packet arc type (normal vs strongly connected)
420- **CA** (3 bits): column address within PE group
421
422### Packet Format
423
42478 bits total: 39-bit address part + 39-bit data part (fixed size).
425
426Address part (39 bits):
427```
428| HST | PT | WCF | M | GA | CA | MA |
429| 1 | 5 | 2 | 1 | 7 | 3 | 20 |
430```
431
432- **HST** (1 bit): bound for host (vs normal destination)
433- **PT** (5 bits): packet type
434- **WCF** (2 bits): waiting condition flag (matching type)
435- **M** (1 bit): arc type (normal=0, strongly connected=1)
436- **GA** (7 bits): destination PE group address (supports 128 groups)
437- **CA** (3 bits): column/member address within group (supports 8 per group;
438 prototype uses 5 per group)
439- **MA** (20 bits): memory address (matching address for normal data packets)
440
441Data part (39 bits):
442```
443| C | * | DT | D |
444| 1 | 3 | 3 | 32 |
445```
446
447- **C** (1 bit): cancel bit (packet is nonsense/garbage — discard on arrival)
448- **\*** (3 bits): reserved
449- **DT** (3 bits): data type tag
450- **D** (32 bits): data value
451
452### Special Packets
453
454Non-data packets (function control, structure access, remote memory
455access) have special PT (packet type) values and are "colorless" (no
456operand segment number). Generated by MKPKT or GET instructions.
457Processed by a per-PE **SP Monitor** — a special strongly connected block
458that the system manager configures at initialisation. This makes special
459packet handling completely programmable.
460
461### Macro Instructions
462
463Complex operations (integer division, function call, complex structure
464ops) are implemented as strongly connected blocks containing simpler
465instructions, rather than as single complex instructions. The division
466pipeline (DIV0, DIV1, DIV2, DIVR, DIVQ) is a good example: five simple
467instructions that chain via registers to implement non-restoring division.
468
469---
470
471## 8. Interconnection Network
472
473### Processor Connected Omega Network
474
475Topology: each PE contains a 3×3 packet switch (the SU) as an element
476of the network. PEs are directly interconnected; there are no separate
477switching nodes.
478
479Properties:
480
481- Average distance from any PE to any other: O(log N)
482- Connection links per PE: small constant (3 input, 3 output in the SU)
483- Total switching elements: O(N) — smaller than O(N log N) for multi-
484 stage omega networks
485- Routing is self-routing based on destination address
486
487### Deadlock Prevention
488
489Store-and-forward deadlock is prevented by a **three-bank buffer** at each
490SU input port. Packets entering the network start in the lowest bank.
491When a packet reaches stage 0 of the network (one hop), it is promoted
492to the next bank level. Because the circular omega topology guarantees
493no packet ever needs more than two rounds through the network, three
494bank levels prevent circular buffer dependencies.
495
496### Load Balancing
497
498The SU contains a **Packet Rewriting Controller (PRC)** that implements
499function-level dynamic load balancing. Special **MLPE (Minimum Load PE)**
500packets circulate through the network. The PRC in each SU rewrites
501these packets so they always contain the address and load of the
502least-loaded PE encountered so far. PEs preparing to make function calls
503read MLPE packets from a local FIFO to determine where to place new
504function instances.
505
506This is done within normal packet transfer time (no overhead) using
507the PRC, which operates in parallel with normal routing.
508
509### Prototype Network Implementation
510
511- Intra-group (5 PEs per group): traces on the PE group board
512- Inter-group (same rack half): mother board connections
513- Inter-rack-half: shielded cables
514- Each network path: 44 signal lines (5 control + 39 data)
515- Transfer rate: 60.9 MB/s per port; 14.63 GB/s total
516- Transfer delay: 80 ns between adjacent nodes; ~410 ns average
517
518---
519
520## 9. EM-4 Prototype Details
521
522### Organisation
523
524- 80 PEs in 16 groups of 5 (group size determined by omega network
525 topology)
526- Each PE: EMC-R + Memory Address Register (MAR) + off-chip SRAM
527- Off-chip memory per PE: 1.31 MB SRAM (max 5.25 MB)
528- Host computer: SUN3/260 + VME bus
529- Packet interface processor: controls system clock, power, packet I/O
530- Packet interface switch: connects host to PE groups
531
532### Physical Implementation
533
534- Single rack: 60.0 cm × 92.0 cm × 140.5 cm
535- 16 PE group boards (4-layer multi-wire, 50.8 cm × 47.0 cm each)
536- 8 boards in upper rack, 8 in lower rack + interface switch
537- Single synchronisation clock: 12.5 MHz
538- Power: 3-phase 200V supply; ~2.6 KW total (PEs ~2.42 KW)
539
540### Memory Organisation (per PE)
541
542The off-chip memory is a unified address space partitioned into:
543
544- Secondary packet buffer (overflow from 32-word on-chip FIFO)
545- Matching store (operand segments)
546- Instruction store (template segments)
547- SP Monitor area
548- Structure store
549- Working area
550
551All accessible from the EXU as a single address space via the MCU.
552
553### Program Execution Flow
554
5551. **Initialisation**: host + maintenance circuits reset system, set PE
556 numbers, configure memory segment linkages, load monitor routines.
5572. **Program loading**: assembled code sent via packets using "broadcasting"
558 — each PE loads function codes into its memory and forwards to
559 uninitialised PEs.
5603. **Execution start**: host sends input packets to appropriate matching
561 addresses via the packet interface processor.
5624. **Execution**: function instances invoke other functions. Results flow
563 back to parent functions or to the packet interface.
5645. **Termination**: resultant packets sent to packet interface processor.
565
566### Performance Results (80 PEs)
567
568| Program | EM-4 time (s) | MIPS | vs SPARC330 | vs VAX8800 | vs CRAY/X-MP |
569|---------|---------------|------|-------------|------------|--------------|
570| FIB(23) | 0.00453 | 223 | 12× | 30× | 19× |
571| PRIME(65536) | 0.506 | 508 | 33× | 75× | 21× |
572| PI(4000) | 0.369 | 824 | 94× | 101× | 13× |
573| SUM(65535) | 0.00042 | 780 | 25× | 142× | 6× |
574| MTRX(80×80) | 0.00888 | 815 | 79× | 225× | 0.4× |
575
576Note: EM-4 programs were in assembly language; comparison programs were
577compiled C (SPARC, VAX) or FORTRAN (CRAY) with maximum optimisation.
578
579The CRAY/X-MP outperforms the EM-4 only on matrix multiplication, where
580the CRAY's vector pipeline dominates (2-3 instructions per inner loop
581vs EM-4's 12). The EM-4 is 6-21× faster on everything else.
582
583PI calculation achieves the highest MIPS (824, vs 1000 theoretical peak)
584because all operations can be statically scheduled with local-only
585communication. Fibonacci achieves the lowest (223) due to heavy packet
586traffic causing IBU FIFO overflow.
587
588---
589
590## 10. Analysis: Relevance to Our Architecture
591
592### Validated Design Choices
593
5941. **Direct matching with ordinary SRAM.** The EM-4 confirms this works
595 in real hardware. Their FMU is only ~3.5K gates with direct matching.
596 Our matching store should be similarly simple.
597
5982. **Static PE assignment reduces matching hardware.** Though the EM-4
599 still does dynamic function-to-group allocation, their acknowledgment
600 that compiler-directed allocation reduces matching overhead supports
601 our more aggressive static assignment approach.
602
6033. **RISC instruction philosophy.** 26 instructions, no microcode, simple
604 formats. Validates our EEPROM-decoded instruction approach.
605
6064. **Single-cycle matching via half-clock read-modify-write.** The EM-4
607 achieves this in their 80 ns clock. We should be able to do the same
608 with modern async SRAM (~15-25 ns access) within our target clock
609 period.
610
611### Features We Should Adopt or Plan For
612
6131. **Instruction format continuation bit.** Reserve 1 bit in the
614 instruction encoding for "next instruction follows sequentially"
615 (EM-4's NF flag). Even without strongly connected block execution
616 in v0, this preserves the option. Zero hardware cost — it's just
617 an IRAM bit that the v0 pipeline ignores.
618
6192. **Cancel bit in token format.** The EM-4's C bit in the data part
620 marks garbage tokens for discard. Consider adding this to our token
621 format (1 bit in flit 2 or flit 3) to handle not-taken branch
622 cleanup without relying solely on generation counters. Generation
623 counters catch stale matches; the cancel bit catches tokens still
624 in flight.
625
6263. **IBU overflow to off-chip SRAM.** The EM-4's two-level buffering
627 (32-word on-chip FIFO + 8K-word off-chip SRAM) is pragmatic. We
628 need an explicit overflow/backpressure strategy for PE input FIFOs.
629
6304. **SP Monitor concept.** Special packet handling as a programmable
631 strongly connected block rather than fixed-function hardware.
632 Interesting for our I/O and system packet handling — rather than a
633 fixed I/O controller state machine, a configurable handler built
634 from normal instructions.
635
636### Features Evaluated and Deferred
637
6381. **Strongly connected blocks.** The performance benefit is clear (up
639 to 6× throughput within blocks, 2.5× on Fibonacci subgraph). The
640 hardware cost is moderate (16-register file + sequencing logic).
641 But it requires significant compiler infrastructure (block
642 identification, register allocation within blocks). **Deferred to
643 post-v0.** The instruction format should reserve the continuation
644 bit to enable future implementation.
645
6462. **PE-embedded network switch.** The EMC-R's SU (~10K gates) operates
647 independently alongside the PE. Relevant only if we move from shared
648 bus to point-to-point topology. Not applicable to v0 shared-bus
649 design.
650
6513. **Dynamic load balancing (MLPE).** Makes sense at 80-1000 PEs, not
652 at 4. Static compiler assignment is sufficient for our scale.
653
6544. **16-register file.** Closely tied to strongly connected blocks. The
655 EM-4 uses it for intra-block operand passing (R0, R1, R2 fields).
656 Without strongly connected execution, there's no register file
657 traffic pattern. The matching store SRAM and SM handle all operand
658 staging for v0. If/when strongly connected blocks are added, a small
659 register file (using SRAM with dedicated address lines, or a 74670-
660 style register file chip) should be added simultaneously.
661
662### Features Not Applicable
663
6641. **Processor connected omega network.** O(N) hardware, O(log N)
665 distance. Irrelevant for a 4-PE shared bus. Worth revisiting if
666 scaling to 16+ PEs.
667
6682. **Template segment number fetch (pipeline stage 1).** The EM-4 needs
669 this because operand segments are dynamically bound to template
670 segments at function invocation. Our static PE assignment with
671 compiler-fixed IRAM contents eliminates this stage entirely, saving
672 one pipeline stage.
673
6743. **78-bit fixed-size packets.** Our variable-length multi-flit tokens
675 on a 16-bit bus are better suited to discrete logic where wide buses
676 are physically expensive. The EM-4 could use fixed-size packets
677 because everything was on-chip or on a custom bus.
678
6794. **Macro instructions via strongly connected blocks.** Complex
680 operations (division, function call) are decomposed into simple
681 instruction sequences within strongly connected blocks. Without
682 strongly connected execution in v0, we handle complex operations
683 differently (multi-token graph fragments or SM operations).
684
685---
686
687## 11. Open Items Arising from EM-4 Analysis
688
6891. **Cancel bit in token format**: evaluate adding 1 bit to type-00/01
690 tokens for garbage token discard. Requires: format change in flit 2,
691 discard logic in PE input (check cancel bit before entering matching
692 pipeline). Low hardware cost, potentially useful for conditional
693 branch cleanup.
694
6952. **IBU overflow strategy**: decide between backpressure (stall the
696 bus — simple but blocks all traffic) vs local overflow SRAM (more
697 hardware but non-blocking). For v0 shared bus at 4 PEs, backpressure
698 may be acceptable.
699
7003. **I-structure semantics in SM**: confirmed as wanted. The EM-4 puts
701 structure storage in per-PE off-chip memory; we're keeping it in a
702 dedicated SM module with synchronising memory semantics. Design the
703 deferred-read queue.
704
7054. **Instruction format bit reservation**: when finalising IRAM encoding,
706 reserve 1 bit for continuation flag (NF equivalent) and 1 bit for
707 output-mode flag (OUT equivalent). These cost nothing in v0 (pipeline
708 ignores them) and enable strongly connected blocks later.
709
7105. **Strongly connected block feasibility study**: before implementing,
711 compile some representative programs (fibonacci, simple parallel
712 computations) and measure what percentage of instructions could be
713 placed in strongly connected blocks. This determines the practical
714 performance benefit and justifies (or not) the compiler complexity.