OR-1 dataflow CPU sketch
at main 714 lines 30 kB view raw view rendered
1# EM-4 Architecture Analysis 2 3Deep-dive reference on the EM-4 (Electrotechnical Laboratory, Japan) and 4its single-chip processor EMC-R, based on the ISCA '89 and IPPS '91 5papers. Captures architectural details for reference even where our 6design diverges. 7 8Source papers: 9 10- Sakai, Yamaguchi, Hiraki, Kodama, Yuba. "An Architecture of a Dataflow 11 Single Chip Processor." Proc. ISCA '89, pp. 46–53, 1989. 12- Sakai, Kodama, Yamaguchi. "Prototype Implementation of a Highly 13 Parallel Dataflow Machine EM-4." Proc. IPPS '91, pp. 278–286, 1991. 14 15See `Prior_Art_Reference_Guide_for_a_Discrete-Logic_Dynamic_Dataflow_CPU.md` 16for full bibliography including the network paper (Parallel Computing 171993), synchronisation paper (IEICE 1991), successor chip EMC-Y, and 18EM-X system papers. 19 20--- 21 22## 1. Project Context and Design Objectives 23 24The EM-4 is a successor to SIGMA-1 (128-PE dataflow supercomputer, 25ETL, operational 1988, >100 MFLOPS). SIGMA-1 PEs were multi-chip 26(several gate arrays + large memory); direct scaling to 1000+ PEs was 27impractical due to hardware volume and architectural complexity. 28 29EM-4 design objectives: 30 311. 1000+ PE machine for general use (numerical, symbolic, simulation) 322. Single-chip PE (the EMC-R) to make scaling feasible 333. O(N) interconnection network (vs O(N log N) for multi-stage networks) 344. Improve dataflow execution efficiency via new computation model 35 36The 80-PE prototype was the first step toward a 1024-PE target. The 37EMC-R chip was designed with a 1024-PE global address space even though 38the prototype only used 80 PEs. 39 40--- 41 42## 2. Identified Defects of Prior Dataflow Architectures 43 44The EM-4 papers explicitly list six defects of "conventional" (i.e. 45Manchester-style) dataflow machines. These are worth documenting because 46several of them drove design choices in our architecture too. 47 48**D1. Circular pipeline underutilisation at low parallelism.** When fewer 49tokens are in flight than N×S (N = PEs, S = pipeline stages), the 50circular pipeline has bubbles. A single token going round the loop 51achieves throughput < 1 instruction per pipeline circulation time. No 52advanced control mechanism exists to fill the gap. 53 54**D2. Packet-based architecture cannot exploit registers.** If every 55intermediate result becomes a packet and re-enters the PE through the 56input queue, there is no way to keep frequently-used values in fast 57local storage. This also prevents fine-grained pipelining within a 58single computation thread. 59 60**D3. High matching overhead (colored token style).** Associative memory 61or hashing for color matching requires complex control logic and 62significant time per match operation. 63 64**D4. Excessive packet traffic.** Every inter-node communication is a 65packet, even between nodes that always execute on the same PE. The 66network becomes the bottleneck. 67 68**D5. No resource management primitives.** Mutual exclusion (test-and-set, 69compare-and-swap) requires serialisation, which is difficult in a pure 70dataflow model where any node can fire at any time. 71 72**D6. Garbage token cleanup overhead.** Conditional branches (switches) 73produce garbage tokens on the not-taken path. Collecting these at 74program end is expensive. 75 76### Relevance to Our Design 77 78- D1: partially addressed by our monadic bypass (skip matching for 79 single-input ops). Strongly connected blocks (deferred) would address 80 it more completely. 81- D2: we currently have no register file. The matching store SRAM serves 82 as temporary operand storage. SM serves as shared scratch. A register 83 file is deferred pending strongly connected block implementation. 84- D3: solved — our direct-indexed matching with wire concatenation 85 addressing has zero associative lookup overhead, same as the EM-4's 86 direct matching. 87- D4: partially addressed by static PE assignment (compiler places 88 communicating nodes on the same PE to avoid network traffic). Strongly 89 connected blocks would further reduce intra-PE packet overhead. 90- D5: SM with RMW operations provides atomic primitives. I-structure 91 semantics (deferred) will add synchronising memory. 92- D6: generation counters handle stale token detection. The cancel bit 93 approach (EM-4's C field in the data part) is noted as a potential 94 addition. See open items. 95 96--- 97 98## 3. Strongly Connected Arc Model 99 100The EM-4's most significant contribution. Arcs in the dataflow graph 101are classified into two types: 102 103- **Normal arcs**: standard dataflow — token matching, packet formation, 104 full pipeline traversal. 105- **Strongly connected arcs**: local to a PE — sequential register-based 106 execution without packet formation or matching. 107 108A **strongly connected block** is a subgraph whose internal arcs are all 109strongly connected. The execution rule: once any node in a strongly 110connected block fires, the entire block executes to completion on that 111PE, exclusively. No other block can interrupt or interleave with it on 112that PE during execution. 113 114### Hardware Mechanism 115 116The EMC-R instruction format contains: 117 118- **M (mode) field** (1 bit): if zero, the current strongly connected 119 block ends with the next instruction. 120- **NF (next flag)** (1 bit): controls continuation within the block. 121- **R0, R1** (4 bits each): strongly connected register references for 122 the next instruction's operands. These index into a 16-entry register 123 file in the EXU. 124- **OUT** (1 bit): whether this instruction generates an output packet 125 (1) or stores its result in a register (0, with result going to R2). 126 127When executing within a strongly connected block: 128 1291. Pipeline stages 3 (instruction fetch + decode) and stage 4 (execute) 130 repeat in an overlapped loop. 1312. Stage 4's execution overlaps with stage 3's fetch of the next 132 instruction in the block. 1333. Operands come from the register file (R0, R1) or from the matching 134 store (first instruction only). 1354. Results go to the register file (R2) or out as packets. 1365. When the block ends (M=0), the last instruction's execution overlaps 137 with stages 1-2 of the *next incoming packet*. 138 139The register-based pipeline throughput is up to **6x** the packet-based 140circular pipeline throughput. 141 142### Performance Example: Fibonacci 143 144The papers give a concrete comparison on a Fibonacci subgraph: 145 146- Pure dataflow program: 23 clocks for the recursive-call subgraph 147- Strongly connected program: 9 clocks for the same computation 148 149The strongly connected version packs type checking, branching, subtraction, 150and MKPKT operations into a single block that executes sequentially using 151registers, rather than creating separate tokens for each intermediate. 152 153### Implications for Compiler 154 155The compiler must: 156 1571. Identify valid strongly connected blocks (subgraphs that can execute 158 sequentially without violating data dependencies). 1592. Annotate instructions with M, NF, R0, R1, R2, OUT fields. 1603. Construct blocks automatically (the papers mention a "block 161 constructor" in their compiler). 162 163This is non-trivial compiler work but is amenable to standard 164optimisation techniques (instruction scheduling, register allocation 165within blocks). 166 167### Advantages Summarised (from the paper) 168 169- A1: enables advanced control pipeline (deterministic execution within 170 blocks) 171- A2: register file for intra-block operands (no matching needed) 172- A3: intra-block matching simplified (no color matching if block is 173 within a single function instance) 174- A4: no packet transfers within blocks (reduces network traffic) 175- A5: indivisible instruction sequences enable resource management 176 (test-and-set, etc.) 177- A6: garbage token cleanup simplified (just reset register file flags) 178 179--- 180 181## 4. Direct Matching Scheme 182 183The EM-4's matching mechanism, which is conceptually very close to our 184design and to Monsoon's Explicit Token Store. 185 186### Mechanism 187 188At function invocation, two memory regions are bound: 189 190- **Operand segment**: storage for waiting/matching operands. One word per 191 dyadic instruction in the function body. 192- **Template segment**: compiled instruction codes for the function. 193 194The address of an instruction in the template segment has a **1:1 simple 195correspondence** with the matching location in the operand segment. No 196hashing, no associative lookup. 197 198Matching operation: 199 2001. Read the operand segment at the address corresponding to this 201 instruction. 2022. If partner data is present: match succeeds, clear the presence flag, 203 proceed to instruction fetch. 2043. If partner data is absent: write incoming data, set presence flag, 205 token consumed. 206 207The read-modify-write is completed in a single clock cycle (read in 208first half-clock, write/eliminate in second half-clock). 209 210### Comparison with Monsoon ETS 211 212The 1991 paper explicitly compares with Monsoon: 213 214> "Although the Explicit Token-Store scheme in the Monsoon Machine also 215> uses the frame-based matching, the direct matching does not need an 216> extra instruction field for data synchronization since a displacement 217> of the matching place is in common with the instruction. This scheme 218> realizes a much more efficient synchronization than Monsoon does and 219> instructions and packets are fairly smaller, at the cost of memory 220> space." 221 222Key difference: EM-4 shares the address between instruction fetch and 223operand matching (they're at the same offset in their respective 224segments), so no additional "frame offset" field is needed in the 225instruction or token. Monsoon needs a separate frame-pointer + slot-offset 226in each token. 227 228### Comparison with Our Design 229 230Our matching store uses the same principle: `SRAM_address = ctx_slot : match_entry`, with match_entry derived from the token's addressing fields, giving 1:1 correspondence to instruction addresses. 231 232The key differences: 233 234| Aspect | EM-4 | Our design | 235| ----------------- | ------------------------------------------------------------ | ----------------------------------------- | 236| Segment binding | Dynamic (at function invocation) | Static (compiler assigns at compile time) | 237| Segment size | Sized per function body | Sized per PE (fixed N slots × M entries) | 238| Presence tracking | Flags in operand segment words | Separate occupied bitmap | 239| Multi-activation | Multiple operand segments per function | Context slots with generation counters | 240| Template fetch | Extra pipeline stage (TNF) to look up which template segment | Not needed (IRAM content is fixed per PE) | 241 242--- 243 244## 5. EMC-R Chip Architecture 245 246### Block Diagram (5 functional units + maintenance) 247 248``` 249 ┌──────────────┐ 250Network In ──►│ Switching │──► Network Out (port A) 251Network In ──►│ Unit (SU) │──► Network Out (port B) 252 └──────┬───────┘ 253 │ to local PE 254 ┌──────▼───────┐ 255 │ Input Buffer │ 256 │ Unit (IBU) │ 257 └──────┬───────┘ 258 ┌──────▼───────┐ ┌──────────┐ 259 │ Fetch and │◄───►│ Off-chip │ 260 │ Matching │ │ Memory │ 261 │ Unit (FMU) │◄──┐ │ (≤5 MB) │ 262 └──────┬───────┘ │ └──────────┘ 263 ┌──────▼───────┐ │ ▲ 264 │ Execution │ │ │ 265 │ Unit (EXU) │───┘ ┌────┴────┐ 266 └──────┬───────┘ │ Memory │ 267 │ │ Control │ 268 ▼ │ Unit │ 269 Packet output └─────────┘ 270 (back to SU) 271``` 272 273### Gate Counts and Pin Usage 274 275| Unit | Gates | Pins | Notes | 276|------|-------|------|-------| 277| Switching Unit (SU) | 10,112 (1989) / 9,179 (1991) | 176 | 3×3 packet switch, three-bank buffers, PRC | 278| Input Buffer Unit (IBU) | 9,238 (1989) / 9,295 (1991) | - | 32-word FIFO (dual-port RAM on chip) | 279| Fetch and Matching Unit (FMU) | 3,504 (1989) / 3,610 (1991) | - | Direct matching, sequencing, pipeline control | 280| Execution Unit (EXU) | 19,692 (1989) / 20,620 (1991) | - | ALU, multiplier, barrel shifter, register file, packet gen | 281| Memory Control Unit (MCU) | 1,518 / 1,664 | 67 | Address/data mux, arbitration | 282| Maintenance Circuits | 1,589 / 1,420 | 12 | Init, error handling, dynamic monitor | 283| **Total** | **45,653 / 45,788** | **255** | 1.5μm CMOS gate array, 299-pin package | 284 285Notable: the FMU (matching + sequencing) is only ~3.5K gates — the 286direct matching scheme keeps it small. The SU (network switch) is ~10K 287gates because of the three-bank deadlock prevention buffers. The EXU 288dominates at ~20K gates due to the multiplier, barrel shifter, and 16- 289register file. 290 291### Chip Implementation 292 293- 1.5 μm CMOS gate array 294- Inverter gate delay: 0.7 ns 295- Clock: 80 ns (12.5 MHz) 296- Peak performance: 12.5 MIPS per PE 297- Package: 299-pin ceramic (43 power/ground, 255 signal, 1 unused) 298- Fabricated from October 1989; no major bugs found in testing 299 300--- 301 302## 6. Pipeline Organisation 303 304Four stages, some with sub-stages. Two pipeline modes that integrate: 305 306### Packet-Based Circular Pipeline (thin path) 307 308The standard dataflow pipeline for tokens arriving from the network: 309 310``` 311Stage 1: TNF Template segment # fetch from off-chip memory 312 (bypassed for non-normal packets) 313 314Stage 2: Match - For matching with immediate: IMF (immediate fetch) 315 - For matching with stored data: 316 RD (first half-clock): read matching store 317 EL/WR (second half-clock): eliminate flag if match, 318 or write data if no match 319 - Bypassed for single-operand (monadic) instructions 320 321Stage 3: IF/DC Instruction fetch (half-clock) + decode (half-clock) 322 323Stage 4: EX Execution + packet output (overlapped) 324``` 325 326### Register-Based Advanced Control Pipeline (thick path) 327 328For strongly connected block execution: 329 330``` 331Stage 3 ◄──► Stage 4 (loop: fetch next instruction while executing 332 current instruction) 333``` 334 335Stages 3 and 4 repeat until the strongly connected block ends. During 336this loop, stages 1 and 2 can process the next incoming packet 337concurrently. 338 339### Overlap Properties 340 341- Stage 4 (execution) overlaps with stage 3 (fetch) of the next 342 instruction in a strongly connected block. 343- When a block ends, stage 4 of the last block instruction overlaps 344 with stages 1-2 of the next packet. 345- The SU operates independently and concurrently with all PE stages. 346 347### Throughput 348 349- Packet-based circular pipeline: 1 instruction per 4 clocks (at best, 350 with full pipeline) 351- Register-based advanced control: 1 instruction per clock (within a 352 strongly connected block) 353- Ratio: up to 6× throughput improvement for strongly connected blocks 354 (4 stages × pipeline effects, empirical) 355 356--- 357 358## 7. Instruction Set Architecture 359 360### RISC Characteristics 361 362- 26 instructions 363- 4 instruction formats (2 base + 2 immediate variants) 364- 2 memory addressing modes 365- 16-entry register file 366- No microcode 367- Fixed packet size 368- Few packet types 369- Simple synchronisation (direct matching + register sequencing) 370 371### Instruction Set (26 instructions) 372 373**Arithmetic and Logic (14):** 374ADD, SUB, MUL, DIV0-2, DIVR, DIVQ (division pipeline), 375SHF (shift), AND, OR, EOR, NOT, ALUTST 376 377**Branch (6):** 378BEQ (equal), BGT (greater), BGE (greater/equal), 379BTYPE (by data type), BTYPE2 (by 2 data types), 380BOVF (overflow). All implemented as delayed branches. 381 382**Memory or Register (4):** 383L (load), S (store), LS (load-and-store), LDR (load from register) 384 385**Others (2):** 386GET (remote operation — sends return address to operand destination), 387MKPKT (make packet — explicit packet construction from two operands) 388 389### Instruction Format 390 39138-bit instruction word (stored in off-chip memory). 392 393Without packet output (OUT=0): 394``` 395| OP | AUX | TRC | M | NF | R0 | R1 | OUT | R2 | BC | DPL | 396| 5 | 7 | 1 | 1 | 1 | 4 | 4 | 1 | 4 | 2 | 8 | 397``` 398 399With packet output (OUT=1): 400``` 401| OP | AUX | TRC | M | NF | R0 | R1 | OUT | WCF | M2 | CA | DPL | 402| 5 | 7 | 1 | 1 | 1 | 4 | 4 | 1 | 2 | 1 | 3 | 8 | 403``` 404 405Field descriptions: 406 407- **OP** (5 bits): opcode (26 instructions) 408- **AUX** (7 bits): secondary opcode field (e.g. shift direction/amount) 409- **TRC** (1 bit): trace control 410- **M** (1 bit): mode — 0 = strongly connected block ends after next 411 instruction 412- **NF** (1 bit): next instruction flag (continuation within block) 413- **R0, R1** (4 bits each): register file addresses for operands 414- **OUT** (1 bit): generate output packet (1) or store in register (0) 415- **R2** (4 bits): destination register (when OUT=0) 416- **BC** (2 bits): branch condition (when OUT=0, branch instructions only) 417- **DPL** (8 bits): displacement (branch target or packet address offset) 418- **WCF** (2 bits): waiting condition flag for output packet matching type 419- **M2** (1 bit): output packet arc type (normal vs strongly connected) 420- **CA** (3 bits): column address within PE group 421 422### Packet Format 423 42478 bits total: 39-bit address part + 39-bit data part (fixed size). 425 426Address part (39 bits): 427``` 428| HST | PT | WCF | M | GA | CA | MA | 429| 1 | 5 | 2 | 1 | 7 | 3 | 20 | 430``` 431 432- **HST** (1 bit): bound for host (vs normal destination) 433- **PT** (5 bits): packet type 434- **WCF** (2 bits): waiting condition flag (matching type) 435- **M** (1 bit): arc type (normal=0, strongly connected=1) 436- **GA** (7 bits): destination PE group address (supports 128 groups) 437- **CA** (3 bits): column/member address within group (supports 8 per group; 438 prototype uses 5 per group) 439- **MA** (20 bits): memory address (matching address for normal data packets) 440 441Data part (39 bits): 442``` 443| C | * | DT | D | 444| 1 | 3 | 3 | 32 | 445``` 446 447- **C** (1 bit): cancel bit (packet is nonsense/garbage — discard on arrival) 448- **\*** (3 bits): reserved 449- **DT** (3 bits): data type tag 450- **D** (32 bits): data value 451 452### Special Packets 453 454Non-data packets (function control, structure access, remote memory 455access) have special PT (packet type) values and are "colorless" (no 456operand segment number). Generated by MKPKT or GET instructions. 457Processed by a per-PE **SP Monitor** — a special strongly connected block 458that the system manager configures at initialisation. This makes special 459packet handling completely programmable. 460 461### Macro Instructions 462 463Complex operations (integer division, function call, complex structure 464ops) are implemented as strongly connected blocks containing simpler 465instructions, rather than as single complex instructions. The division 466pipeline (DIV0, DIV1, DIV2, DIVR, DIVQ) is a good example: five simple 467instructions that chain via registers to implement non-restoring division. 468 469--- 470 471## 8. Interconnection Network 472 473### Processor Connected Omega Network 474 475Topology: each PE contains a 3×3 packet switch (the SU) as an element 476of the network. PEs are directly interconnected; there are no separate 477switching nodes. 478 479Properties: 480 481- Average distance from any PE to any other: O(log N) 482- Connection links per PE: small constant (3 input, 3 output in the SU) 483- Total switching elements: O(N) — smaller than O(N log N) for multi- 484 stage omega networks 485- Routing is self-routing based on destination address 486 487### Deadlock Prevention 488 489Store-and-forward deadlock is prevented by a **three-bank buffer** at each 490SU input port. Packets entering the network start in the lowest bank. 491When a packet reaches stage 0 of the network (one hop), it is promoted 492to the next bank level. Because the circular omega topology guarantees 493no packet ever needs more than two rounds through the network, three 494bank levels prevent circular buffer dependencies. 495 496### Load Balancing 497 498The SU contains a **Packet Rewriting Controller (PRC)** that implements 499function-level dynamic load balancing. Special **MLPE (Minimum Load PE)** 500packets circulate through the network. The PRC in each SU rewrites 501these packets so they always contain the address and load of the 502least-loaded PE encountered so far. PEs preparing to make function calls 503read MLPE packets from a local FIFO to determine where to place new 504function instances. 505 506This is done within normal packet transfer time (no overhead) using 507the PRC, which operates in parallel with normal routing. 508 509### Prototype Network Implementation 510 511- Intra-group (5 PEs per group): traces on the PE group board 512- Inter-group (same rack half): mother board connections 513- Inter-rack-half: shielded cables 514- Each network path: 44 signal lines (5 control + 39 data) 515- Transfer rate: 60.9 MB/s per port; 14.63 GB/s total 516- Transfer delay: 80 ns between adjacent nodes; ~410 ns average 517 518--- 519 520## 9. EM-4 Prototype Details 521 522### Organisation 523 524- 80 PEs in 16 groups of 5 (group size determined by omega network 525 topology) 526- Each PE: EMC-R + Memory Address Register (MAR) + off-chip SRAM 527- Off-chip memory per PE: 1.31 MB SRAM (max 5.25 MB) 528- Host computer: SUN3/260 + VME bus 529- Packet interface processor: controls system clock, power, packet I/O 530- Packet interface switch: connects host to PE groups 531 532### Physical Implementation 533 534- Single rack: 60.0 cm × 92.0 cm × 140.5 cm 535- 16 PE group boards (4-layer multi-wire, 50.8 cm × 47.0 cm each) 536- 8 boards in upper rack, 8 in lower rack + interface switch 537- Single synchronisation clock: 12.5 MHz 538- Power: 3-phase 200V supply; ~2.6 KW total (PEs ~2.42 KW) 539 540### Memory Organisation (per PE) 541 542The off-chip memory is a unified address space partitioned into: 543 544- Secondary packet buffer (overflow from 32-word on-chip FIFO) 545- Matching store (operand segments) 546- Instruction store (template segments) 547- SP Monitor area 548- Structure store 549- Working area 550 551All accessible from the EXU as a single address space via the MCU. 552 553### Program Execution Flow 554 5551. **Initialisation**: host + maintenance circuits reset system, set PE 556 numbers, configure memory segment linkages, load monitor routines. 5572. **Program loading**: assembled code sent via packets using "broadcasting" 558 — each PE loads function codes into its memory and forwards to 559 uninitialised PEs. 5603. **Execution start**: host sends input packets to appropriate matching 561 addresses via the packet interface processor. 5624. **Execution**: function instances invoke other functions. Results flow 563 back to parent functions or to the packet interface. 5645. **Termination**: resultant packets sent to packet interface processor. 565 566### Performance Results (80 PEs) 567 568| Program | EM-4 time (s) | MIPS | vs SPARC330 | vs VAX8800 | vs CRAY/X-MP | 569|---------|---------------|------|-------------|------------|--------------| 570| FIB(23) | 0.00453 | 223 | 12× | 30× | 19× | 571| PRIME(65536) | 0.506 | 508 | 33× | 75× | 21× | 572| PI(4000) | 0.369 | 824 | 94× | 101× | 13× | 573| SUM(65535) | 0.00042 | 780 | 25× | 142× | 6× | 574| MTRX(80×80) | 0.00888 | 815 | 79× | 225× | 0.4× | 575 576Note: EM-4 programs were in assembly language; comparison programs were 577compiled C (SPARC, VAX) or FORTRAN (CRAY) with maximum optimisation. 578 579The CRAY/X-MP outperforms the EM-4 only on matrix multiplication, where 580the CRAY's vector pipeline dominates (2-3 instructions per inner loop 581vs EM-4's 12). The EM-4 is 6-21× faster on everything else. 582 583PI calculation achieves the highest MIPS (824, vs 1000 theoretical peak) 584because all operations can be statically scheduled with local-only 585communication. Fibonacci achieves the lowest (223) due to heavy packet 586traffic causing IBU FIFO overflow. 587 588--- 589 590## 10. Analysis: Relevance to Our Architecture 591 592### Validated Design Choices 593 5941. **Direct matching with ordinary SRAM.** The EM-4 confirms this works 595 in real hardware. Their FMU is only ~3.5K gates with direct matching. 596 Our matching store should be similarly simple. 597 5982. **Static PE assignment reduces matching hardware.** Though the EM-4 599 still does dynamic function-to-group allocation, their acknowledgment 600 that compiler-directed allocation reduces matching overhead supports 601 our more aggressive static assignment approach. 602 6033. **RISC instruction philosophy.** 26 instructions, no microcode, simple 604 formats. Validates our EEPROM-decoded instruction approach. 605 6064. **Single-cycle matching via half-clock read-modify-write.** The EM-4 607 achieves this in their 80 ns clock. We should be able to do the same 608 with modern async SRAM (~15-25 ns access) within our target clock 609 period. 610 611### Features We Should Adopt or Plan For 612 6131. **Instruction format continuation bit.** Reserve 1 bit in the 614 instruction encoding for "next instruction follows sequentially" 615 (EM-4's NF flag). Even without strongly connected block execution 616 in v0, this preserves the option. Zero hardware cost — it's just 617 an IRAM bit that the v0 pipeline ignores. 618 6192. **Cancel bit in token format.** The EM-4's C bit in the data part 620 marks garbage tokens for discard. Consider adding this to our token 621 format (1 bit in flit 2 or flit 3) to handle not-taken branch 622 cleanup without relying solely on generation counters. Generation 623 counters catch stale matches; the cancel bit catches tokens still 624 in flight. 625 6263. **IBU overflow to off-chip SRAM.** The EM-4's two-level buffering 627 (32-word on-chip FIFO + 8K-word off-chip SRAM) is pragmatic. We 628 need an explicit overflow/backpressure strategy for PE input FIFOs. 629 6304. **SP Monitor concept.** Special packet handling as a programmable 631 strongly connected block rather than fixed-function hardware. 632 Interesting for our I/O and system packet handling — rather than a 633 fixed I/O controller state machine, a configurable handler built 634 from normal instructions. 635 636### Features Evaluated and Deferred 637 6381. **Strongly connected blocks.** The performance benefit is clear (up 639 to 6× throughput within blocks, 2.5× on Fibonacci subgraph). The 640 hardware cost is moderate (16-register file + sequencing logic). 641 But it requires significant compiler infrastructure (block 642 identification, register allocation within blocks). **Deferred to 643 post-v0.** The instruction format should reserve the continuation 644 bit to enable future implementation. 645 6462. **PE-embedded network switch.** The EMC-R's SU (~10K gates) operates 647 independently alongside the PE. Relevant only if we move from shared 648 bus to point-to-point topology. Not applicable to v0 shared-bus 649 design. 650 6513. **Dynamic load balancing (MLPE).** Makes sense at 80-1000 PEs, not 652 at 4. Static compiler assignment is sufficient for our scale. 653 6544. **16-register file.** Closely tied to strongly connected blocks. The 655 EM-4 uses it for intra-block operand passing (R0, R1, R2 fields). 656 Without strongly connected execution, there's no register file 657 traffic pattern. The matching store SRAM and SM handle all operand 658 staging for v0. If/when strongly connected blocks are added, a small 659 register file (using SRAM with dedicated address lines, or a 74670- 660 style register file chip) should be added simultaneously. 661 662### Features Not Applicable 663 6641. **Processor connected omega network.** O(N) hardware, O(log N) 665 distance. Irrelevant for a 4-PE shared bus. Worth revisiting if 666 scaling to 16+ PEs. 667 6682. **Template segment number fetch (pipeline stage 1).** The EM-4 needs 669 this because operand segments are dynamically bound to template 670 segments at function invocation. Our static PE assignment with 671 compiler-fixed IRAM contents eliminates this stage entirely, saving 672 one pipeline stage. 673 6743. **78-bit fixed-size packets.** Our variable-length multi-flit tokens 675 on a 16-bit bus are better suited to discrete logic where wide buses 676 are physically expensive. The EM-4 could use fixed-size packets 677 because everything was on-chip or on a custom bus. 678 6794. **Macro instructions via strongly connected blocks.** Complex 680 operations (division, function call) are decomposed into simple 681 instruction sequences within strongly connected blocks. Without 682 strongly connected execution in v0, we handle complex operations 683 differently (multi-token graph fragments or SM operations). 684 685--- 686 687## 11. Open Items Arising from EM-4 Analysis 688 6891. **Cancel bit in token format**: evaluate adding 1 bit to type-00/01 690 tokens for garbage token discard. Requires: format change in flit 2, 691 discard logic in PE input (check cancel bit before entering matching 692 pipeline). Low hardware cost, potentially useful for conditional 693 branch cleanup. 694 6952. **IBU overflow strategy**: decide between backpressure (stall the 696 bus — simple but blocks all traffic) vs local overflow SRAM (more 697 hardware but non-blocking). For v0 shared bus at 4 PEs, backpressure 698 may be acceptable. 699 7003. **I-structure semantics in SM**: confirmed as wanted. The EM-4 puts 701 structure storage in per-PE off-chip memory; we're keeping it in a 702 dedicated SM module with synchronising memory semantics. Design the 703 deferred-read queue. 704 7054. **Instruction format bit reservation**: when finalising IRAM encoding, 706 reserve 1 bit for continuation flag (NF equivalent) and 1 bit for 707 output-mode flag (OUT equivalent). These cost nothing in v0 (pipeline 708 ignores them) and enable strongly connected blocks later. 709 7105. **Strongly connected block feasibility study**: before implementing, 711 compile some representative programs (fibonacci, simple parallel 712 computations) and measure what percentage of instructions could be 713 placed in strongly connected blocks. This determines the practical 714 performance benefit and justifies (or not) the compiler complexity.