design-notes/em4-analysis.md at main · nonbinary.computer/or1-design

nonbinary.computer / or1-design
fork atom
OR-1 dataflow CPU sketch
fork atom
or1-design / design-notes / em4-analysis.md
at main 714 lines 30 kB view raw view rendered
wrap content
Orual dfgraph renderer 3w ago
00e8aa1d
  1# EM-4 Architecture Analysis
  2
  3Deep-dive reference on the EM-4 (Electrotechnical Laboratory, Japan) and
  4its single-chip processor EMC-R, based on the ISCA '89 and IPPS '91
  5papers. Captures architectural details for reference even where our
  6design diverges.
  7
  8Source papers:
  9
 10- Sakai, Yamaguchi, Hiraki, Kodama, Yuba. "An Architecture of a Dataflow
 11  Single Chip Processor." Proc. ISCA '89, pp. 46–53, 1989.
 12- Sakai, Kodama, Yamaguchi. "Prototype Implementation of a Highly
 13  Parallel Dataflow Machine EM-4." Proc. IPPS '91, pp. 278–286, 1991.
 14
 15See `Prior_Art_Reference_Guide_for_a_Discrete-Logic_Dynamic_Dataflow_CPU.md`
 16for full bibliography including the network paper (Parallel Computing
 171993), synchronisation paper (IEICE 1991), successor chip EMC-Y, and
 18EM-X system papers.
 19
 20---
 21
 22## 1. Project Context and Design Objectives
 23
 24The EM-4 is a successor to SIGMA-1 (128-PE dataflow supercomputer,
 25ETL, operational 1988, >100 MFLOPS). SIGMA-1 PEs were multi-chip
 26(several gate arrays + large memory); direct scaling to 1000+ PEs was
 27impractical due to hardware volume and architectural complexity.
 28
 29EM-4 design objectives:
 30
 311. 1000+ PE machine for general use (numerical, symbolic, simulation)
 322. Single-chip PE (the EMC-R) to make scaling feasible
 333. O(N) interconnection network (vs O(N log N) for multi-stage networks)
 344. Improve dataflow execution efficiency via new computation model
 35
 36The 80-PE prototype was the first step toward a 1024-PE target. The
 37EMC-R chip was designed with a 1024-PE global address space even though
 38the prototype only used 80 PEs.
 39
 40---
 41
 42## 2. Identified Defects of Prior Dataflow Architectures
 43
 44The EM-4 papers explicitly list six defects of "conventional" (i.e.
 45Manchester-style) dataflow machines. These are worth documenting because
 46several of them drove design choices in our architecture too.
 47
 48**D1. Circular pipeline underutilisation at low parallelism.** When fewer
 49tokens are in flight than N×S (N = PEs, S = pipeline stages), the
 50circular pipeline has bubbles. A single token going round the loop
 51achieves throughput < 1 instruction per pipeline circulation time. No
 52advanced control mechanism exists to fill the gap.
 53
 54**D2. Packet-based architecture cannot exploit registers.** If every
 55intermediate result becomes a packet and re-enters the PE through the
 56input queue, there is no way to keep frequently-used values in fast
 57local storage. This also prevents fine-grained pipelining within a
 58single computation thread.
 59
 60**D3. High matching overhead (colored token style).** Associative memory
 61or hashing for color matching requires complex control logic and
 62significant time per match operation.
 63
 64**D4. Excessive packet traffic.** Every inter-node communication is a
 65packet, even between nodes that always execute on the same PE. The
 66network becomes the bottleneck.
 67
 68**D5. No resource management primitives.** Mutual exclusion (test-and-set,
 69compare-and-swap) requires serialisation, which is difficult in a pure
 70dataflow model where any node can fire at any time.
 71
 72**D6. Garbage token cleanup overhead.** Conditional branches (switches)
 73produce garbage tokens on the not-taken path. Collecting these at
 74program end is expensive.
 75
 76### Relevance to Our Design
 77
 78- D1: partially addressed by our monadic bypass (skip matching for
 79  single-input ops). Strongly connected blocks (deferred) would address
 80  it more completely.
 81- D2: we currently have no register file. The matching store SRAM serves
 82  as temporary operand storage. SM serves as shared scratch. A register
 83  file is deferred pending strongly connected block implementation.
 84- D3: solved — our direct-indexed matching with wire concatenation
 85  addressing has zero associative lookup overhead, same as the EM-4's
 86  direct matching.
 87- D4: partially addressed by static PE assignment (compiler places
 88  communicating nodes on the same PE to avoid network traffic). Strongly
 89  connected blocks would further reduce intra-PE packet overhead.
 90- D5: SM with RMW operations provides atomic primitives. I-structure
 91  semantics (deferred) will add synchronising memory.
 92- D6: generation counters handle stale token detection. The cancel bit
 93  approach (EM-4's C field in the data part) is noted as a potential
 94  addition. See open items.
 95
 96---
 97
 98## 3. Strongly Connected Arc Model
 99
100The EM-4's most significant contribution. Arcs in the dataflow graph
101are classified into two types:
102
103- **Normal arcs**: standard dataflow — token matching, packet formation,
104  full pipeline traversal.
105- **Strongly connected arcs**: local to a PE — sequential register-based
106  execution without packet formation or matching.
107
108A **strongly connected block** is a subgraph whose internal arcs are all
109strongly connected. The execution rule: once any node in a strongly
110connected block fires, the entire block executes to completion on that
111PE, exclusively. No other block can interrupt or interleave with it on
112that PE during execution.
113
114### Hardware Mechanism
115
116The EMC-R instruction format contains:
117
118- **M (mode) field** (1 bit): if zero, the current strongly connected
119  block ends with the next instruction.
120- **NF (next flag)** (1 bit): controls continuation within the block.
121- **R0, R1** (4 bits each): strongly connected register references for
122  the next instruction's operands. These index into a 16-entry register
123  file in the EXU.
124- **OUT** (1 bit): whether this instruction generates an output packet
125  (1) or stores its result in a register (0, with result going to R2).
126
127When executing within a strongly connected block:
128
1291. Pipeline stages 3 (instruction fetch + decode) and stage 4 (execute)
130   repeat in an overlapped loop.
1312. Stage 4's execution overlaps with stage 3's fetch of the next
132   instruction in the block.
1333. Operands come from the register file (R0, R1) or from the matching
134   store (first instruction only).
1354. Results go to the register file (R2) or out as packets.
1365. When the block ends (M=0), the last instruction's execution overlaps
137   with stages 1-2 of the *next incoming packet*.
138
139The register-based pipeline throughput is up to **6x** the packet-based
140circular pipeline throughput.
141
142### Performance Example: Fibonacci
143
144The papers give a concrete comparison on a Fibonacci subgraph:
145
146- Pure dataflow program: 23 clocks for the recursive-call subgraph
147- Strongly connected program: 9 clocks for the same computation
148
149The strongly connected version packs type checking, branching, subtraction,
150and MKPKT operations into a single block that executes sequentially using
151registers, rather than creating separate tokens for each intermediate.
152
153### Implications for Compiler
154
155The compiler must:
156
1571. Identify valid strongly connected blocks (subgraphs that can execute
158   sequentially without violating data dependencies).
1592. Annotate instructions with M, NF, R0, R1, R2, OUT fields.
1603. Construct blocks automatically (the papers mention a "block
161   constructor" in their compiler).
162
163This is non-trivial compiler work but is amenable to standard
164optimisation techniques (instruction scheduling, register allocation
165within blocks).
166
167### Advantages Summarised (from the paper)
168
169- A1: enables advanced control pipeline (deterministic execution within
170  blocks)
171- A2: register file for intra-block operands (no matching needed)
172- A3: intra-block matching simplified (no color matching if block is
173  within a single function instance)
174- A4: no packet transfers within blocks (reduces network traffic)
175- A5: indivisible instruction sequences enable resource management
176  (test-and-set, etc.)
177- A6: garbage token cleanup simplified (just reset register file flags)
178
179---
180
181## 4. Direct Matching Scheme
182
183The EM-4's matching mechanism, which is conceptually very close to our
184design and to Monsoon's Explicit Token Store.
185
186### Mechanism
187
188At function invocation, two memory regions are bound:
189
190- **Operand segment**: storage for waiting/matching operands. One word per
191  dyadic instruction in the function body.
192- **Template segment**: compiled instruction codes for the function.
193
194The address of an instruction in the template segment has a **1:1 simple
195correspondence** with the matching location in the operand segment. No
196hashing, no associative lookup.
197
198Matching operation:
199
2001. Read the operand segment at the address corresponding to this
201   instruction.
2022. If partner data is present: match succeeds, clear the presence flag,
203   proceed to instruction fetch.
2043. If partner data is absent: write incoming data, set presence flag,
205   token consumed.
206
207The read-modify-write is completed in a single clock cycle (read in
208first half-clock, write/eliminate in second half-clock).
209
210### Comparison with Monsoon ETS
211
212The 1991 paper explicitly compares with Monsoon:
213
214> "Although the Explicit Token-Store scheme in the Monsoon Machine also
215> uses the frame-based matching, the direct matching does not need an
216> extra instruction field for data synchronization since a displacement
217> of the matching place is in common with the instruction. This scheme
218> realizes a much more efficient synchronization than Monsoon does and
219> instructions and packets are fairly smaller, at the cost of memory
220> space."
221
222Key difference: EM-4 shares the address between instruction fetch and
223operand matching (they're at the same offset in their respective
224segments), so no additional "frame offset" field is needed in the
225instruction or token. Monsoon needs a separate frame-pointer + slot-offset
226in each token.
227
228### Comparison with Our Design
229
230Our matching store uses the same principle: `SRAM_address = ctx_slot : match_entry`, with match_entry derived from the token's addressing fields, giving 1:1 correspondence to instruction addresses. 
231
232The key differences:
233
234| Aspect            | EM-4                                                         | Our design                                |
235| ----------------- | ------------------------------------------------------------ | ----------------------------------------- |
236| Segment binding   | Dynamic (at function invocation)                             | Static (compiler assigns at compile time) |
237| Segment size      | Sized per function body                                      | Sized per PE (fixed N slots × M entries)  |
238| Presence tracking | Flags in operand segment words                               | Separate occupied bitmap                  |
239| Multi-activation  | Multiple operand segments per function                       | Context slots with generation counters    |
240| Template fetch    | Extra pipeline stage (TNF) to look up which template segment | Not needed (IRAM content is fixed per PE) |
241
242---
243
244## 5. EMC-R Chip Architecture
245
246### Block Diagram (5 functional units + maintenance)
247
248```
249              ┌──────────────┐
250Network In ──►│ Switching    │──► Network Out (port A)
251Network In ──►│ Unit (SU)    │──► Network Out (port B)
252              └──────┬───────┘
253                     │ to local PE
254              ┌──────▼───────┐
255              │ Input Buffer │
256              │ Unit (IBU)   │
257              └──────┬───────┘
258              ┌──────▼───────┐     ┌──────────┐
259              │ Fetch and    │◄───►│ Off-chip  │
260              │ Matching     │     │ Memory    │
261              │ Unit (FMU)   │◄──┐ │ (≤5 MB)  │
262              └──────┬───────┘   │ └──────────┘
263              ┌──────▼───────┐   │       ▲
264              │ Execution    │   │       │
265              │ Unit (EXU)   │───┘  ┌────┴────┐
266              └──────┬───────┘     │ Memory  │
267                     │             │ Control │
268                     ▼             │ Unit    │
269              Packet output       └─────────┘
270              (back to SU)
271```
272
273### Gate Counts and Pin Usage
274
275| Unit | Gates | Pins | Notes |
276|------|-------|------|-------|
277| Switching Unit (SU) | 10,112 (1989) / 9,179 (1991) | 176 | 3×3 packet switch, three-bank buffers, PRC |
278| Input Buffer Unit (IBU) | 9,238 (1989) / 9,295 (1991) | - | 32-word FIFO (dual-port RAM on chip) |
279| Fetch and Matching Unit (FMU) | 3,504 (1989) / 3,610 (1991) | - | Direct matching, sequencing, pipeline control |
280| Execution Unit (EXU) | 19,692 (1989) / 20,620 (1991) | - | ALU, multiplier, barrel shifter, register file, packet gen |
281| Memory Control Unit (MCU) | 1,518 / 1,664 | 67 | Address/data mux, arbitration |
282| Maintenance Circuits | 1,589 / 1,420 | 12 | Init, error handling, dynamic monitor |
283| **Total** | **45,653 / 45,788** | **255** | 1.5μm CMOS gate array, 299-pin package |
284
285Notable: the FMU (matching + sequencing) is only ~3.5K gates — the
286direct matching scheme keeps it small. The SU (network switch) is ~10K
287gates because of the three-bank deadlock prevention buffers. The EXU
288dominates at ~20K gates due to the multiplier, barrel shifter, and 16-
289register file.
290
291### Chip Implementation
292
293- 1.5 μm CMOS gate array
294- Inverter gate delay: 0.7 ns
295- Clock: 80 ns (12.5 MHz)
296- Peak performance: 12.5 MIPS per PE
297- Package: 299-pin ceramic (43 power/ground, 255 signal, 1 unused)
298- Fabricated from October 1989; no major bugs found in testing
299
300---
301
302## 6. Pipeline Organisation
303
304Four stages, some with sub-stages. Two pipeline modes that integrate:
305
306### Packet-Based Circular Pipeline (thin path)
307
308The standard dataflow pipeline for tokens arriving from the network:
309
310```
311Stage 1: TNF    Template segment # fetch from off-chip memory
312                (bypassed for non-normal packets)
313
314Stage 2: Match  - For matching with immediate: IMF (immediate fetch)
315                - For matching with stored data:
316                  RD (first half-clock): read matching store
317                  EL/WR (second half-clock): eliminate flag if match,
318                                              or write data if no match
319                - Bypassed for single-operand (monadic) instructions
320
321Stage 3: IF/DC  Instruction fetch (half-clock) + decode (half-clock)
322
323Stage 4: EX     Execution + packet output (overlapped)
324```
325
326### Register-Based Advanced Control Pipeline (thick path)
327
328For strongly connected block execution:
329
330```
331Stage 3 ◄──► Stage 4   (loop: fetch next instruction while executing
332                         current instruction)
333```
334
335Stages 3 and 4 repeat until the strongly connected block ends. During
336this loop, stages 1 and 2 can process the next incoming packet
337concurrently.
338
339### Overlap Properties
340
341- Stage 4 (execution) overlaps with stage 3 (fetch) of the next
342  instruction in a strongly connected block.
343- When a block ends, stage 4 of the last block instruction overlaps
344  with stages 1-2 of the next packet.
345- The SU operates independently and concurrently with all PE stages.
346
347### Throughput
348
349- Packet-based circular pipeline: 1 instruction per 4 clocks (at best,
350  with full pipeline)
351- Register-based advanced control: 1 instruction per clock (within a
352  strongly connected block)
353- Ratio: up to 6× throughput improvement for strongly connected blocks
354  (4 stages × pipeline effects, empirical)
355
356---
357
358## 7. Instruction Set Architecture
359
360### RISC Characteristics
361
362- 26 instructions
363- 4 instruction formats (2 base + 2 immediate variants)
364- 2 memory addressing modes
365- 16-entry register file
366- No microcode
367- Fixed packet size
368- Few packet types
369- Simple synchronisation (direct matching + register sequencing)
370
371### Instruction Set (26 instructions)
372
373**Arithmetic and Logic (14):**
374ADD, SUB, MUL, DIV0-2, DIVR, DIVQ (division pipeline),
375SHF (shift), AND, OR, EOR, NOT, ALUTST
376
377**Branch (6):**
378BEQ (equal), BGT (greater), BGE (greater/equal),
379BTYPE (by data type), BTYPE2 (by 2 data types),
380BOVF (overflow). All implemented as delayed branches.
381
382**Memory or Register (4):**
383L (load), S (store), LS (load-and-store), LDR (load from register)
384
385**Others (2):**
386GET (remote operation — sends return address to operand destination),
387MKPKT (make packet — explicit packet construction from two operands)
388
389### Instruction Format
390
39138-bit instruction word (stored in off-chip memory).
392
393Without packet output (OUT=0):
394```
395| OP | AUX | TRC | M | NF | R0 | R1 | OUT | R2 | BC | DPL |
396|  5 |  7  |  1  | 1 |  1 |  4 |  4 |  1  |  4 |  2 |  8  |
397```
398
399With packet output (OUT=1):
400```
401| OP | AUX | TRC | M | NF | R0 | R1 | OUT | WCF | M2 | CA | DPL |
402|  5 |  7  |  1  | 1 |  1 |  4 |  4 |  1  |  2  |  1 |  3 |  8  |
403```
404
405Field descriptions:
406
407- **OP** (5 bits): opcode (26 instructions)
408- **AUX** (7 bits): secondary opcode field (e.g. shift direction/amount)
409- **TRC** (1 bit): trace control
410- **M** (1 bit): mode — 0 = strongly connected block ends after next
411  instruction
412- **NF** (1 bit): next instruction flag (continuation within block)
413- **R0, R1** (4 bits each): register file addresses for operands
414- **OUT** (1 bit): generate output packet (1) or store in register (0)
415- **R2** (4 bits): destination register (when OUT=0)
416- **BC** (2 bits): branch condition (when OUT=0, branch instructions only)
417- **DPL** (8 bits): displacement (branch target or packet address offset)
418- **WCF** (2 bits): waiting condition flag for output packet matching type
419- **M2** (1 bit): output packet arc type (normal vs strongly connected)
420- **CA** (3 bits): column address within PE group
421
422### Packet Format
423
42478 bits total: 39-bit address part + 39-bit data part (fixed size).
425
426Address part (39 bits):
427```
428| HST | PT | WCF | M | GA  | CA | MA  |
429|  1  |  5 |  2  | 1 |  7  |  3 | 20  |
430```
431
432- **HST** (1 bit): bound for host (vs normal destination)
433- **PT** (5 bits): packet type
434- **WCF** (2 bits): waiting condition flag (matching type)
435- **M** (1 bit): arc type (normal=0, strongly connected=1)
436- **GA** (7 bits): destination PE group address (supports 128 groups)
437- **CA** (3 bits): column/member address within group (supports 8 per group;
438  prototype uses 5 per group)
439- **MA** (20 bits): memory address (matching address for normal data packets)
440
441Data part (39 bits):
442```
443| C | * | DT | D  |
444| 1 | 3 |  3 | 32 |
445```
446
447- **C** (1 bit): cancel bit (packet is nonsense/garbage — discard on arrival)
448- **\*** (3 bits): reserved
449- **DT** (3 bits): data type tag
450- **D** (32 bits): data value
451
452### Special Packets
453
454Non-data packets (function control, structure access, remote memory
455access) have special PT (packet type) values and are "colorless" (no
456operand segment number). Generated by MKPKT or GET instructions.
457Processed by a per-PE **SP Monitor** — a special strongly connected block
458that the system manager configures at initialisation. This makes special
459packet handling completely programmable.
460
461### Macro Instructions
462
463Complex operations (integer division, function call, complex structure
464ops) are implemented as strongly connected blocks containing simpler
465instructions, rather than as single complex instructions. The division
466pipeline (DIV0, DIV1, DIV2, DIVR, DIVQ) is a good example: five simple
467instructions that chain via registers to implement non-restoring division.
468
469---
470
471## 8. Interconnection Network
472
473### Processor Connected Omega Network
474
475Topology: each PE contains a 3×3 packet switch (the SU) as an element
476of the network. PEs are directly interconnected; there are no separate
477switching nodes.
478
479Properties:
480
481- Average distance from any PE to any other: O(log N)
482- Connection links per PE: small constant (3 input, 3 output in the SU)
483- Total switching elements: O(N) — smaller than O(N log N) for multi-
484  stage omega networks
485- Routing is self-routing based on destination address
486
487### Deadlock Prevention
488
489Store-and-forward deadlock is prevented by a **three-bank buffer** at each
490SU input port. Packets entering the network start in the lowest bank.
491When a packet reaches stage 0 of the network (one hop), it is promoted
492to the next bank level. Because the circular omega topology guarantees
493no packet ever needs more than two rounds through the network, three
494bank levels prevent circular buffer dependencies.
495
496### Load Balancing
497
498The SU contains a **Packet Rewriting Controller (PRC)** that implements
499function-level dynamic load balancing. Special **MLPE (Minimum Load PE)**
500packets circulate through the network. The PRC in each SU rewrites
501these packets so they always contain the address and load of the
502least-loaded PE encountered so far. PEs preparing to make function calls
503read MLPE packets from a local FIFO to determine where to place new
504function instances.
505
506This is done within normal packet transfer time (no overhead) using
507the PRC, which operates in parallel with normal routing.
508
509### Prototype Network Implementation
510
511- Intra-group (5 PEs per group): traces on the PE group board
512- Inter-group (same rack half): mother board connections
513- Inter-rack-half: shielded cables
514- Each network path: 44 signal lines (5 control + 39 data)
515- Transfer rate: 60.9 MB/s per port; 14.63 GB/s total
516- Transfer delay: 80 ns between adjacent nodes; ~410 ns average
517
518---
519
520## 9. EM-4 Prototype Details
521
522### Organisation
523
524- 80 PEs in 16 groups of 5 (group size determined by omega network
525  topology)
526- Each PE: EMC-R + Memory Address Register (MAR) + off-chip SRAM
527- Off-chip memory per PE: 1.31 MB SRAM (max 5.25 MB)
528- Host computer: SUN3/260 + VME bus
529- Packet interface processor: controls system clock, power, packet I/O
530- Packet interface switch: connects host to PE groups
531
532### Physical Implementation
533
534- Single rack: 60.0 cm × 92.0 cm × 140.5 cm
535- 16 PE group boards (4-layer multi-wire, 50.8 cm × 47.0 cm each)
536- 8 boards in upper rack, 8 in lower rack + interface switch
537- Single synchronisation clock: 12.5 MHz
538- Power: 3-phase 200V supply; ~2.6 KW total (PEs ~2.42 KW)
539
540### Memory Organisation (per PE)
541
542The off-chip memory is a unified address space partitioned into:
543
544- Secondary packet buffer (overflow from 32-word on-chip FIFO)
545- Matching store (operand segments)
546- Instruction store (template segments)
547- SP Monitor area
548- Structure store
549- Working area
550
551All accessible from the EXU as a single address space via the MCU.
552
553### Program Execution Flow
554
5551. **Initialisation**: host + maintenance circuits reset system, set PE
556   numbers, configure memory segment linkages, load monitor routines.
5572. **Program loading**: assembled code sent via packets using "broadcasting"
558   — each PE loads function codes into its memory and forwards to
559   uninitialised PEs.
5603. **Execution start**: host sends input packets to appropriate matching
561   addresses via the packet interface processor.
5624. **Execution**: function instances invoke other functions. Results flow
563   back to parent functions or to the packet interface.
5645. **Termination**: resultant packets sent to packet interface processor.
565
566### Performance Results (80 PEs)
567
568| Program | EM-4 time (s) | MIPS | vs SPARC330 | vs VAX8800 | vs CRAY/X-MP |
569|---------|---------------|------|-------------|------------|--------------|
570| FIB(23) | 0.00453 | 223 | 12× | 30× | 19× |
571| PRIME(65536) | 0.506 | 508 | 33× | 75× | 21× |
572| PI(4000) | 0.369 | 824 | 94× | 101× | 13× |
573| SUM(65535) | 0.00042 | 780 | 25× | 142× | 6× |
574| MTRX(80×80) | 0.00888 | 815 | 79× | 225× | 0.4× |
575
576Note: EM-4 programs were in assembly language; comparison programs were
577compiled C (SPARC, VAX) or FORTRAN (CRAY) with maximum optimisation.
578
579The CRAY/X-MP outperforms the EM-4 only on matrix multiplication, where
580the CRAY's vector pipeline dominates (2-3 instructions per inner loop
581vs EM-4's 12). The EM-4 is 6-21× faster on everything else.
582
583PI calculation achieves the highest MIPS (824, vs 1000 theoretical peak)
584because all operations can be statically scheduled with local-only
585communication. Fibonacci achieves the lowest (223) due to heavy packet
586traffic causing IBU FIFO overflow.
587
588---
589
590## 10. Analysis: Relevance to Our Architecture
591
592### Validated Design Choices
593
5941. **Direct matching with ordinary SRAM.** The EM-4 confirms this works
595   in real hardware. Their FMU is only ~3.5K gates with direct matching.
596   Our matching store should be similarly simple.
597
5982. **Static PE assignment reduces matching hardware.** Though the EM-4
599   still does dynamic function-to-group allocation, their acknowledgment
600   that compiler-directed allocation reduces matching overhead supports
601   our more aggressive static assignment approach.
602
6033. **RISC instruction philosophy.** 26 instructions, no microcode, simple
604   formats. Validates our EEPROM-decoded instruction approach.
605
6064. **Single-cycle matching via half-clock read-modify-write.** The EM-4
607   achieves this in their 80 ns clock. We should be able to do the same
608   with modern async SRAM (~15-25 ns access) within our target clock
609   period.
610
611### Features We Should Adopt or Plan For
612
6131. **Instruction format continuation bit.** Reserve 1 bit in the
614   instruction encoding for "next instruction follows sequentially"
615   (EM-4's NF flag). Even without strongly connected block execution
616   in v0, this preserves the option. Zero hardware cost — it's just
617   an IRAM bit that the v0 pipeline ignores.
618
6192. **Cancel bit in token format.** The EM-4's C bit in the data part
620   marks garbage tokens for discard. Consider adding this to our token
621   format (1 bit in flit 2 or flit 3) to handle not-taken branch
622   cleanup without relying solely on generation counters. Generation
623   counters catch stale matches; the cancel bit catches tokens still
624   in flight.
625
6263. **IBU overflow to off-chip SRAM.** The EM-4's two-level buffering
627   (32-word on-chip FIFO + 8K-word off-chip SRAM) is pragmatic. We
628   need an explicit overflow/backpressure strategy for PE input FIFOs.
629
6304. **SP Monitor concept.** Special packet handling as a programmable
631   strongly connected block rather than fixed-function hardware.
632   Interesting for our I/O and system packet handling — rather than a
633   fixed I/O controller state machine, a configurable handler built
634   from normal instructions.
635
636### Features Evaluated and Deferred
637
6381. **Strongly connected blocks.** The performance benefit is clear (up
639   to 6× throughput within blocks, 2.5× on Fibonacci subgraph). The
640   hardware cost is moderate (16-register file + sequencing logic).
641   But it requires significant compiler infrastructure (block
642   identification, register allocation within blocks). **Deferred to
643   post-v0.** The instruction format should reserve the continuation
644   bit to enable future implementation.
645
6462. **PE-embedded network switch.** The EMC-R's SU (~10K gates) operates
647   independently alongside the PE. Relevant only if we move from shared
648   bus to point-to-point topology. Not applicable to v0 shared-bus
649   design.
650
6513. **Dynamic load balancing (MLPE).** Makes sense at 80-1000 PEs, not
652   at 4. Static compiler assignment is sufficient for our scale.
653
6544. **16-register file.** Closely tied to strongly connected blocks. The
655   EM-4 uses it for intra-block operand passing (R0, R1, R2 fields).
656   Without strongly connected execution, there's no register file
657   traffic pattern. The matching store SRAM and SM handle all operand
658   staging for v0. If/when strongly connected blocks are added, a small
659   register file (using SRAM with dedicated address lines, or a 74670-
660   style register file chip) should be added simultaneously.
661
662### Features Not Applicable
663
6641. **Processor connected omega network.** O(N) hardware, O(log N)
665   distance. Irrelevant for a 4-PE shared bus. Worth revisiting if
666   scaling to 16+ PEs.
667
6682. **Template segment number fetch (pipeline stage 1).** The EM-4 needs
669   this because operand segments are dynamically bound to template
670   segments at function invocation. Our static PE assignment with
671   compiler-fixed IRAM contents eliminates this stage entirely, saving
672   one pipeline stage.
673
6743. **78-bit fixed-size packets.** Our variable-length multi-flit tokens
675   on a 16-bit bus are better suited to discrete logic where wide buses
676   are physically expensive. The EM-4 could use fixed-size packets
677   because everything was on-chip or on a custom bus.
678
6794. **Macro instructions via strongly connected blocks.** Complex
680   operations (division, function call) are decomposed into simple
681   instruction sequences within strongly connected blocks. Without
682   strongly connected execution in v0, we handle complex operations
683   differently (multi-token graph fragments or SM operations).
684
685---
686
687## 11. Open Items Arising from EM-4 Analysis
688
6891. **Cancel bit in token format**: evaluate adding 1 bit to type-00/01
690   tokens for garbage token discard. Requires: format change in flit 2,
691   discard logic in PE input (check cancel bit before entering matching
692   pipeline). Low hardware cost, potentially useful for conditional
693   branch cleanup.
694
6952. **IBU overflow strategy**: decide between backpressure (stall the
696   bus — simple but blocks all traffic) vs local overflow SRAM (more
697   hardware but non-blocking). For v0 shared bus at 4 PEs, backpressure
698   may be acceptable.
699
7003. **I-structure semantics in SM**: confirmed as wanted. The EM-4 puts
701   structure storage in per-PE off-chip memory; we're keeping it in a
702   dedicated SM module with synchronising memory semantics. Design the
703   deferred-read queue.
704
7054. **Instruction format bit reservation**: when finalising IRAM encoding,
706   reserve 1 bit for continuation flag (NF equivalent) and 1 bit for
707   output-mode flag (OUT equivalent). These cost nothing in v0 (pipeline
708   ignores them) and enable strongly connected blocks later.
709
7105. **Strongly connected block feasibility study**: before implementing,
711   compile some representative programs (fibonacci, simple parallel
712   computations) and measure what percentage of instructions could be
713   placed in strongly connected blocks. This determines the practical
714   performance benefit and justifies (or not) the compiler complexity.