EM-4 Architecture Analysis#

Deep-dive reference on the EM-4 (Electrotechnical Laboratory, Japan) and its single-chip processor EMC-R, based on the ISCA '89 and IPPS '91 papers. Captures architectural details for reference even where our design diverges.

Source papers:

Sakai, Yamaguchi, Hiraki, Kodama, Yuba. "An Architecture of a Dataflow Single Chip Processor." Proc. ISCA '89, pp. 46–53, 1989.
Sakai, Kodama, Yamaguchi. "Prototype Implementation of a Highly Parallel Dataflow Machine EM-4." Proc. IPPS '91, pp. 278–286, 1991.

See Prior_Art_Reference_Guide_for_a_Discrete-Logic_Dynamic_Dataflow_CPU.md for full bibliography including the network paper (Parallel Computing 1993), synchronisation paper (IEICE 1991), successor chip EMC-Y, and EM-X system papers.

1. Project Context and Design Objectives#

The EM-4 is a successor to SIGMA-1 (128-PE dataflow supercomputer, ETL, operational 1988, >100 MFLOPS). SIGMA-1 PEs were multi-chip (several gate arrays + large memory); direct scaling to 1000+ PEs was impractical due to hardware volume and architectural complexity.

EM-4 design objectives:

1000+ PE machine for general use (numerical, symbolic, simulation)
Single-chip PE (the EMC-R) to make scaling feasible
O(N) interconnection network (vs O(N log N) for multi-stage networks)
Improve dataflow execution efficiency via new computation model

The 80-PE prototype was the first step toward a 1024-PE target. The EMC-R chip was designed with a 1024-PE global address space even though the prototype only used 80 PEs.

2. Identified Defects of Prior Dataflow Architectures#

The EM-4 papers explicitly list six defects of "conventional" (i.e. Manchester-style) dataflow machines. These are worth documenting because several of them drove design choices in our architecture too.

D1. Circular pipeline underutilisation at low parallelism. When fewer tokens are in flight than N×S (N = PEs, S = pipeline stages), the circular pipeline has bubbles. A single token going round the loop achieves throughput < 1 instruction per pipeline circulation time. No advanced control mechanism exists to fill the gap.

D2. Packet-based architecture cannot exploit registers. If every intermediate result becomes a packet and re-enters the PE through the input queue, there is no way to keep frequently-used values in fast local storage. This also prevents fine-grained pipelining within a single computation thread.

D3. High matching overhead (colored token style). Associative memory or hashing for color matching requires complex control logic and significant time per match operation.

D4. Excessive packet traffic. Every inter-node communication is a packet, even between nodes that always execute on the same PE. The network becomes the bottleneck.

D5. No resource management primitives. Mutual exclusion (test-and-set, compare-and-swap) requires serialisation, which is difficult in a pure dataflow model where any node can fire at any time.

D6. Garbage token cleanup overhead. Conditional branches (switches) produce garbage tokens on the not-taken path. Collecting these at program end is expensive.

Relevance to Our Design#

D1: partially addressed by our monadic bypass (skip matching for single-input ops). Strongly connected blocks (deferred) would address it more completely.
D2: we currently have no register file. The matching store SRAM serves as temporary operand storage. SM serves as shared scratch. A register file is deferred pending strongly connected block implementation.
D3: solved — our direct-indexed matching with wire concatenation addressing has zero associative lookup overhead, same as the EM-4's direct matching.
D4: partially addressed by static PE assignment (compiler places communicating nodes on the same PE to avoid network traffic). Strongly connected blocks would further reduce intra-PE packet overhead.
D5: SM with RMW operations provides atomic primitives. I-structure semantics (deferred) will add synchronising memory.
D6: generation counters handle stale token detection. The cancel bit approach (EM-4's C field in the data part) is noted as a potential addition. See open items.

3. Strongly Connected Arc Model#

The EM-4's most significant contribution. Arcs in the dataflow graph are classified into two types:

Normal arcs: standard dataflow — token matching, packet formation, full pipeline traversal.
Strongly connected arcs: local to a PE — sequential register-based execution without packet formation or matching.

A strongly connected block is a subgraph whose internal arcs are all strongly connected. The execution rule: once any node in a strongly connected block fires, the entire block executes to completion on that PE, exclusively. No other block can interrupt or interleave with it on that PE during execution.

Hardware Mechanism#

The EMC-R instruction format contains:

M (mode) field (1 bit): if zero, the current strongly connected block ends with the next instruction.
NF (next flag) (1 bit): controls continuation within the block.
R0, R1 (4 bits each): strongly connected register references for the next instruction's operands. These index into a 16-entry register file in the EXU.
OUT (1 bit): whether this instruction generates an output packet (1) or stores its result in a register (0, with result going to R2).

When executing within a strongly connected block:

Pipeline stages 3 (instruction fetch + decode) and stage 4 (execute) repeat in an overlapped loop.
Stage 4's execution overlaps with stage 3's fetch of the next instruction in the block.
Operands come from the register file (R0, R1) or from the matching store (first instruction only).
Results go to the register file (R2) or out as packets.
When the block ends (M=0), the last instruction's execution overlaps with stages 1-2 of the next incoming packet.

The register-based pipeline throughput is up to 6x the packet-based circular pipeline throughput.

Performance Example: Fibonacci#

The papers give a concrete comparison on a Fibonacci subgraph:

Pure dataflow program: 23 clocks for the recursive-call subgraph
Strongly connected program: 9 clocks for the same computation

The strongly connected version packs type checking, branching, subtraction, and MKPKT operations into a single block that executes sequentially using registers, rather than creating separate tokens for each intermediate.

Implications for Compiler#

The compiler must:

Identify valid strongly connected blocks (subgraphs that can execute sequentially without violating data dependencies).
Annotate instructions with M, NF, R0, R1, R2, OUT fields.
Construct blocks automatically (the papers mention a "block constructor" in their compiler).

This is non-trivial compiler work but is amenable to standard optimisation techniques (instruction scheduling, register allocation within blocks).

Advantages Summarised (from the paper)#

A1: enables advanced control pipeline (deterministic execution within blocks)
A2: register file for intra-block operands (no matching needed)
A3: intra-block matching simplified (no color matching if block is within a single function instance)
A4: no packet transfers within blocks (reduces network traffic)
A5: indivisible instruction sequences enable resource management (test-and-set, etc.)
A6: garbage token cleanup simplified (just reset register file flags)

4. Direct Matching Scheme#

The EM-4's matching mechanism, which is conceptually very close to our design and to Monsoon's Explicit Token Store.

Mechanism#

At function invocation, two memory regions are bound:

Operand segment: storage for waiting/matching operands. One word per dyadic instruction in the function body.
Template segment: compiled instruction codes for the function.

The address of an instruction in the template segment has a 1:1 simple correspondence with the matching location in the operand segment. No hashing, no associative lookup.

Matching operation:

Read the operand segment at the address corresponding to this instruction.
If partner data is present: match succeeds, clear the presence flag, proceed to instruction fetch.
If partner data is absent: write incoming data, set presence flag, token consumed.

The read-modify-write is completed in a single clock cycle (read in first half-clock, write/eliminate in second half-clock).

Comparison with Monsoon ETS#

The 1991 paper explicitly compares with Monsoon:

"Although the Explicit Token-Store scheme in the Monsoon Machine also uses the frame-based matching, the direct matching does not need an extra instruction field for data synchronization since a displacement of the matching place is in common with the instruction. This scheme realizes a much more efficient synchronization than Monsoon does and instructions and packets are fairly smaller, at the cost of memory space."

Key difference: EM-4 shares the address between instruction fetch and operand matching (they're at the same offset in their respective segments), so no additional "frame offset" field is needed in the instruction or token. Monsoon needs a separate frame-pointer + slot-offset in each token.

Comparison with Our Design#

Our matching store uses the same principle: SRAM_address = ctx_slot : match_entry, with match_entry derived from the token's addressing fields, giving 1:1 correspondence to instruction addresses.

The key differences:

Aspect	EM-4	Our design
Segment binding	Dynamic (at function invocation)	Static (compiler assigns at compile time)
Segment size	Sized per function body	Sized per PE (fixed N slots × M entries)
Presence tracking	Flags in operand segment words	Separate occupied bitmap
Multi-activation	Multiple operand segments per function	Context slots with generation counters
Template fetch	Extra pipeline stage (TNF) to look up which template segment	Not needed (IRAM content is fixed per PE)

5. EMC-R Chip Architecture#

Block Diagram (5 functional units + maintenance)#

              ┌──────────────┐
Network In ──►│ Switching    │──► Network Out (port A)
Network In ──►│ Unit (SU)    │──► Network Out (port B)
              └──────┬───────┘
                     │ to local PE
              ┌──────▼───────┐
              │ Input Buffer │
              │ Unit (IBU)   │
              └──────┬───────┘
              ┌──────▼───────┐     ┌──────────┐
              │ Fetch and    │◄───►│ Off-chip  │
              │ Matching     │     │ Memory    │
              │ Unit (FMU)   │◄──┐ │ (≤5 MB)  │
              └──────┬───────┘   │ └──────────┘
              ┌──────▼───────┐   │       ▲
              │ Execution    │   │       │
              │ Unit (EXU)   │───┘  ┌────┴────┐
              └──────┬───────┘     │ Memory  │
                     │             │ Control │
                     ▼             │ Unit    │
              Packet output       └─────────┘
              (back to SU)

Gate Counts and Pin Usage#

Unit	Gates	Pins	Notes
Switching Unit (SU)	10,112 (1989) / 9,179 (1991)	176	3×3 packet switch, three-bank buffers, PRC
Input Buffer Unit (IBU)	9,238 (1989) / 9,295 (1991)	-	32-word FIFO (dual-port RAM on chip)
Fetch and Matching Unit (FMU)	3,504 (1989) / 3,610 (1991)	-	Direct matching, sequencing, pipeline control
Execution Unit (EXU)	19,692 (1989) / 20,620 (1991)	-	ALU, multiplier, barrel shifter, register file, packet gen
Memory Control Unit (MCU)	1,518 / 1,664	67	Address/data mux, arbitration
Maintenance Circuits	1,589 / 1,420	12	Init, error handling, dynamic monitor
Total	45,653 / 45,788	255	1.5μm CMOS gate array, 299-pin package

Notable: the FMU (matching + sequencing) is only ~3.5K gates — the direct matching scheme keeps it small. The SU (network switch) is ~10K gates because of the three-bank deadlock prevention buffers. The EXU dominates at ~20K gates due to the multiplier, barrel shifter, and 16- register file.

Chip Implementation#

1.5 μm CMOS gate array
Inverter gate delay: 0.7 ns
Clock: 80 ns (12.5 MHz)
Peak performance: 12.5 MIPS per PE
Package: 299-pin ceramic (43 power/ground, 255 signal, 1 unused)
Fabricated from October 1989; no major bugs found in testing

6. Pipeline Organisation#

Four stages, some with sub-stages. Two pipeline modes that integrate:

Packet-Based Circular Pipeline (thin path)#

The standard dataflow pipeline for tokens arriving from the network:

Stage 1: TNF    Template segment # fetch from off-chip memory
                (bypassed for non-normal packets)

Stage 2: Match  - For matching with immediate: IMF (immediate fetch)
                - For matching with stored data:
                  RD (first half-clock): read matching store
                  EL/WR (second half-clock): eliminate flag if match,
                                              or write data if no match
                - Bypassed for single-operand (monadic) instructions

Stage 3: IF/DC  Instruction fetch (half-clock) + decode (half-clock)

Stage 4: EX     Execution + packet output (overlapped)

Register-Based Advanced Control Pipeline (thick path)#

For strongly connected block execution:

Stage 3 ◄──► Stage 4   (loop: fetch next instruction while executing
                         current instruction)

Stages 3 and 4 repeat until the strongly connected block ends. During this loop, stages 1 and 2 can process the next incoming packet concurrently.

Overlap Properties#

Stage 4 (execution) overlaps with stage 3 (fetch) of the next instruction in a strongly connected block.
When a block ends, stage 4 of the last block instruction overlaps with stages 1-2 of the next packet.
The SU operates independently and concurrently with all PE stages.

Throughput#

Packet-based circular pipeline: 1 instruction per 4 clocks (at best, with full pipeline)
Register-based advanced control: 1 instruction per clock (within a strongly connected block)
Ratio: up to 6× throughput improvement for strongly connected blocks (4 stages × pipeline effects, empirical)

7. Instruction Set Architecture#

RISC Characteristics#

26 instructions
4 instruction formats (2 base + 2 immediate variants)
2 memory addressing modes
16-entry register file
No microcode
Fixed packet size
Few packet types
Simple synchronisation (direct matching + register sequencing)

Instruction Set (26 instructions)#

Arithmetic and Logic (14): ADD, SUB, MUL, DIV0-2, DIVR, DIVQ (division pipeline), SHF (shift), AND, OR, EOR, NOT, ALUTST

Branch (6): BEQ (equal), BGT (greater), BGE (greater/equal), BTYPE (by data type), BTYPE2 (by 2 data types), BOVF (overflow). All implemented as delayed branches.

Memory or Register (4): L (load), S (store), LS (load-and-store), LDR (load from register)

Others (2): GET (remote operation — sends return address to operand destination), MKPKT (make packet — explicit packet construction from two operands)

Instruction Format#

38-bit instruction word (stored in off-chip memory).

Without packet output (OUT=0):

| OP | AUX | TRC | M | NF | R0 | R1 | OUT | R2 | BC | DPL |
|  5 |  7  |  1  | 1 |  1 |  4 |  4 |  1  |  4 |  2 |  8  |

With packet output (OUT=1):

| OP | AUX | TRC | M | NF | R0 | R1 | OUT | WCF | M2 | CA | DPL |
|  5 |  7  |  1  | 1 |  1 |  4 |  4 |  1  |  2  |  1 |  3 |  8  |

Field descriptions:

OP (5 bits): opcode (26 instructions)
AUX (7 bits): secondary opcode field (e.g. shift direction/amount)
TRC (1 bit): trace control
M (1 bit): mode — 0 = strongly connected block ends after next instruction
NF (1 bit): next instruction flag (continuation within block)
R0, R1 (4 bits each): register file addresses for operands
OUT (1 bit): generate output packet (1) or store in register (0)
R2 (4 bits): destination register (when OUT=0)
BC (2 bits): branch condition (when OUT=0, branch instructions only)
DPL (8 bits): displacement (branch target or packet address offset)
WCF (2 bits): waiting condition flag for output packet matching type
M2 (1 bit): output packet arc type (normal vs strongly connected)
CA (3 bits): column address within PE group

Packet Format#

78 bits total: 39-bit address part + 39-bit data part (fixed size).

Address part (39 bits):

| HST | PT | WCF | M | GA  | CA | MA  |
|  1  |  5 |  2  | 1 |  7  |  3 | 20  |

HST (1 bit): bound for host (vs normal destination)
PT (5 bits): packet type
WCF (2 bits): waiting condition flag (matching type)
M (1 bit): arc type (normal=0, strongly connected=1)
GA (7 bits): destination PE group address (supports 128 groups)
CA (3 bits): column/member address within group (supports 8 per group; prototype uses 5 per group)
MA (20 bits): memory address (matching address for normal data packets)

Data part (39 bits):

| C | * | DT | D  |
| 1 | 3 |  3 | 32 |

C (1 bit): cancel bit (packet is nonsense/garbage — discard on arrival)
* (3 bits): reserved
DT (3 bits): data type tag
D (32 bits): data value

Special Packets#

Non-data packets (function control, structure access, remote memory access) have special PT (packet type) values and are "colorless" (no operand segment number). Generated by MKPKT or GET instructions. Processed by a per-PE SP Monitor — a special strongly connected block that the system manager configures at initialisation. This makes special packet handling completely programmable.

Macro Instructions#

Complex operations (integer division, function call, complex structure ops) are implemented as strongly connected blocks containing simpler instructions, rather than as single complex instructions. The division pipeline (DIV0, DIV1, DIV2, DIVR, DIVQ) is a good example: five simple instructions that chain via registers to implement non-restoring division.

8. Interconnection Network#

Processor Connected Omega Network#

Topology: each PE contains a 3×3 packet switch (the SU) as an element of the network. PEs are directly interconnected; there are no separate switching nodes.

Properties:

Average distance from any PE to any other: O(log N)
Connection links per PE: small constant (3 input, 3 output in the SU)
Total switching elements: O(N) — smaller than O(N log N) for multi- stage omega networks
Routing is self-routing based on destination address

Deadlock Prevention#

Store-and-forward deadlock is prevented by a three-bank buffer at each SU input port. Packets entering the network start in the lowest bank. When a packet reaches stage 0 of the network (one hop), it is promoted to the next bank level. Because the circular omega topology guarantees no packet ever needs more than two rounds through the network, three bank levels prevent circular buffer dependencies.

Load Balancing#

The SU contains a Packet Rewriting Controller (PRC) that implements function-level dynamic load balancing. Special MLPE (Minimum Load PE) packets circulate through the network. The PRC in each SU rewrites these packets so they always contain the address and load of the least-loaded PE encountered so far. PEs preparing to make function calls read MLPE packets from a local FIFO to determine where to place new function instances.

This is done within normal packet transfer time (no overhead) using the PRC, which operates in parallel with normal routing.

Prototype Network Implementation#

Intra-group (5 PEs per group): traces on the PE group board
Inter-group (same rack half): mother board connections
Inter-rack-half: shielded cables
Each network path: 44 signal lines (5 control + 39 data)
Transfer rate: 60.9 MB/s per port; 14.63 GB/s total
Transfer delay: 80 ns between adjacent nodes; ~410 ns average

9. EM-4 Prototype Details#

Organisation#

80 PEs in 16 groups of 5 (group size determined by omega network topology)
Each PE: EMC-R + Memory Address Register (MAR) + off-chip SRAM
Off-chip memory per PE: 1.31 MB SRAM (max 5.25 MB)
Host computer: SUN3/260 + VME bus
Packet interface processor: controls system clock, power, packet I/O
Packet interface switch: connects host to PE groups

Physical Implementation#

Single rack: 60.0 cm × 92.0 cm × 140.5 cm
16 PE group boards (4-layer multi-wire, 50.8 cm × 47.0 cm each)
8 boards in upper rack, 8 in lower rack + interface switch
Single synchronisation clock: 12.5 MHz
Power: 3-phase 200V supply; ~2.6 KW total (PEs ~2.42 KW)

Memory Organisation (per PE)#

The off-chip memory is a unified address space partitioned into:

Secondary packet buffer (overflow from 32-word on-chip FIFO)
Matching store (operand segments)
Instruction store (template segments)
SP Monitor area
Structure store
Working area

All accessible from the EXU as a single address space via the MCU.

Program Execution Flow#

Initialisation: host + maintenance circuits reset system, set PE numbers, configure memory segment linkages, load monitor routines.
Program loading: assembled code sent via packets using "broadcasting" — each PE loads function codes into its memory and forwards to uninitialised PEs.
Execution start: host sends input packets to appropriate matching addresses via the packet interface processor.
Execution: function instances invoke other functions. Results flow back to parent functions or to the packet interface.
Termination: resultant packets sent to packet interface processor.

Performance Results (80 PEs)#

Program	EM-4 time (s)	MIPS	vs SPARC330	vs VAX8800	vs CRAY/X-MP
FIB(23)	0.00453	223	12×	30×	19×
PRIME(65536)	0.506	508	33×	75×	21×
PI(4000)	0.369	824	94×	101×	13×
SUM(65535)	0.00042	780	25×	142×	6×
MTRX(80×80)	0.00888	815	79×	225×	0.4×

Note: EM-4 programs were in assembly language; comparison programs were compiled C (SPARC, VAX) or FORTRAN (CRAY) with maximum optimisation.

The CRAY/X-MP outperforms the EM-4 only on matrix multiplication, where the CRAY's vector pipeline dominates (2-3 instructions per inner loop vs EM-4's 12). The EM-4 is 6-21× faster on everything else.

PI calculation achieves the highest MIPS (824, vs 1000 theoretical peak) because all operations can be statically scheduled with local-only communication. Fibonacci achieves the lowest (223) due to heavy packet traffic causing IBU FIFO overflow.

10. Analysis: Relevance to Our Architecture#

Validated Design Choices#

Direct matching with ordinary SRAM. The EM-4 confirms this works in real hardware. Their FMU is only ~3.5K gates with direct matching. Our matching store should be similarly simple.
Static PE assignment reduces matching hardware. Though the EM-4 still does dynamic function-to-group allocation, their acknowledgment that compiler-directed allocation reduces matching overhead supports our more aggressive static assignment approach.
RISC instruction philosophy. 26 instructions, no microcode, simple formats. Validates our EEPROM-decoded instruction approach.
Single-cycle matching via half-clock read-modify-write. The EM-4 achieves this in their 80 ns clock. We should be able to do the same with modern async SRAM (~15-25 ns access) within our target clock period.

Features We Should Adopt or Plan For#

Instruction format continuation bit. Reserve 1 bit in the instruction encoding for "next instruction follows sequentially" (EM-4's NF flag). Even without strongly connected block execution in v0, this preserves the option. Zero hardware cost — it's just an IRAM bit that the v0 pipeline ignores.
Cancel bit in token format. The EM-4's C bit in the data part marks garbage tokens for discard. Consider adding this to our token format (1 bit in flit 2 or flit 3) to handle not-taken branch cleanup without relying solely on generation counters. Generation counters catch stale matches; the cancel bit catches tokens still in flight.
IBU overflow to off-chip SRAM. The EM-4's two-level buffering (32-word on-chip FIFO + 8K-word off-chip SRAM) is pragmatic. We need an explicit overflow/backpressure strategy for PE input FIFOs.
SP Monitor concept. Special packet handling as a programmable strongly connected block rather than fixed-function hardware. Interesting for our I/O and system packet handling — rather than a fixed I/O controller state machine, a configurable handler built from normal instructions.

Features Evaluated and Deferred#

Strongly connected blocks. The performance benefit is clear (up to 6× throughput within blocks, 2.5× on Fibonacci subgraph). The hardware cost is moderate (16-register file + sequencing logic). But it requires significant compiler infrastructure (block identification, register allocation within blocks). Deferred to post-v0. The instruction format should reserve the continuation bit to enable future implementation.
PE-embedded network switch. The EMC-R's SU (~10K gates) operates independently alongside the PE. Relevant only if we move from shared bus to point-to-point topology. Not applicable to v0 shared-bus design.
Dynamic load balancing (MLPE). Makes sense at 80-1000 PEs, not at 4. Static compiler assignment is sufficient for our scale.
16-register file. Closely tied to strongly connected blocks. The EM-4 uses it for intra-block operand passing (R0, R1, R2 fields). Without strongly connected execution, there's no register file traffic pattern. The matching store SRAM and SM handle all operand staging for v0. If/when strongly connected blocks are added, a small register file (using SRAM with dedicated address lines, or a 74670- style register file chip) should be added simultaneously.

Features Not Applicable#

Processor connected omega network. O(N) hardware, O(log N) distance. Irrelevant for a 4-PE shared bus. Worth revisiting if scaling to 16+ PEs.
Template segment number fetch (pipeline stage 1). The EM-4 needs this because operand segments are dynamically bound to template segments at function invocation. Our static PE assignment with compiler-fixed IRAM contents eliminates this stage entirely, saving one pipeline stage.
78-bit fixed-size packets. Our variable-length multi-flit tokens on a 16-bit bus are better suited to discrete logic where wide buses are physically expensive. The EM-4 could use fixed-size packets because everything was on-chip or on a custom bus.
Macro instructions via strongly connected blocks. Complex operations (division, function call) are decomposed into simple instruction sequences within strongly connected blocks. Without strongly connected execution in v0, we handle complex operations differently (multi-token graph fragments or SM operations).

11. Open Items Arising from EM-4 Analysis#

Cancel bit in token format: evaluate adding 1 bit to type-00/01 tokens for garbage token discard. Requires: format change in flit 2, discard logic in PE input (check cancel bit before entering matching pipeline). Low hardware cost, potentially useful for conditional branch cleanup.
IBU overflow strategy: decide between backpressure (stall the bus — simple but blocks all traffic) vs local overflow SRAM (more hardware but non-blocking). For v0 shared bus at 4 PEs, backpressure may be acceptable.
I-structure semantics in SM: confirmed as wanted. The EM-4 puts structure storage in per-PE off-chip memory; we're keeping it in a dedicated SM module with synchronising memory semantics. Design the deferred-read queue.
Instruction format bit reservation: when finalising IRAM encoding, reserve 1 bit for continuation flag (NF equivalent) and 1 bit for output-mode flag (OUT equivalent). These cost nothing in v0 (pipeline ignores them) and enable strongly connected blocks later.
Strongly connected block feasibility study: before implementing, compile some representative programs (fibonacci, simple parallel computations) and measure what percentage of instructions could be placed in strongly connected blocks. This determines the practical performance benefit and justifies (or not) the compiler complexity.