Dynamic Dataflow CPU — Architectural Positioning & Research Notes#
Working notes capturing design philosophy decisions, research insights, and prior art observations. Not a design spec — a reference for "why we chose this direction" and "what we learned from the literature."
Companion Documents#
architecture-overview.md— master architecture referencepe-design.md— PE pipeline, matching store, context slotsdesign-alternatives.md— rejected/deferred approachesPrior_Art_Reference_Guide_for_a_Discrete-Logic_Dynamic_Dataflow_CPU.md— comprehensive bibliography
1. Core Architectural Commitment: Pure Dataflow, Not Hybrid#
This project is a dynamic dataflow machine, not a multithreaded RISC core with dataflow-style synchronisation primitives.
The MIT lineage went: Manchester → TTDA → Monsoon → *T → Sparcle, with each step making the PE more like a conventional CPU and reducing the dataflow aspects to synchronisation mechanisms (presence bits on memory, fast context switching, message passing). By *T (1992), the "dataflow" part is essentially hardware semaphores bolted onto a modified SPARC.
We are not going down that road. The point of this project is the different execution model — where synchronisation is implicit in the data flow, not explicit in the program. A PE that needs a program counter, register file, bypass network, and branch prediction is solving a different problem than we're solving.
Specific non-goals:
- Sequential instruction streams within a PE (*T, Sparcle, EARTH)
- Register files as primary operand storage
- Program counter / sequential fetch logic
- Branch prediction hardware
- Hardware semaphores / presence-bit memory traps (*T model)
- "Make each PE a full CPU" — this blows the transistor budget from 4 PEs down to 1-2, losing the parallelism that's the whole point
Where We Sit on the Spectrum#
Pure dataflow ←————————————————————————→ Pure von Neumann
Manchester Monsoon *T/Sparcle OoO superscalar
| | | |
| [this project] | |
| | | |
hash matching ETS/direct RISC+sync register renaming
no PC no PC has PC has PC
no registers no regs register file register file
We're roughly at the Monsoon point on this spectrum: direct-indexed matching with presence bits (independently derived, see §2), token-driven execution, no program counter. But with a smaller/simpler PE than Monsoon (fewer pipeline stages, smaller frames, generation counters for ABA protection instead of Monsoon's tighter deallocation control).
What IS Worth Mining from the Hybrid Work#
The Papadopoulos & Traub 1991 paper ("Multithreading: A Revisionist View of Dataflow Architectures") contains one microarchitectural optimisation that's relevant without changing the architecture:
Sequential scheduling of monadic chains. If A feeds B feeds C and all are monadic (single-input), the tokens cycle through the full pipeline for each hop: token-in → match-bypass → fetch → execute → token-out → back to token-in. If the PE could recognise this pattern and keep the result in-pipeline for the next instruction (skipping token formatting and input FIFO), that's a significant latency win on sequential chains.
This is a microarchitectural shortcut, not an architectural change. The token semantics don't change. The compiler doesn't need to know. It's just an optimisation where the PE notices "output goes to me, monadic, next instruction" and short-circuits the pipeline. Worth considering post-v0 if sequential throughput is a problem.
2. Convergence with Monsoon's Explicit Token Store#
The matching store design in pe-design.md — direct-indexed context slots
with occupied bits, compiler-assigned slot IDs, bump allocator — was
derived independently from first principles:
- Manchester's hash table has terrible utilisation (<20%) and enormous hardware cost (16 SRAM banks + comparators per PE)
- Amamiya's semi-CAM is better but CAM chips are tiny at discrete scale
- If the compiler assigns context IDs statically, you can use them as direct SRAM addresses → no associative lookup at all
- Occupied bit = 1-bit presence flag per matching entry
- Generation counter handles ABA on slot reuse
This turns out to be essentially the same thing Papadopoulos and Culler arrived at with the Explicit Token Store (ETS) for Monsoon (1990), via a similar line of reasoning from the TTDA's frame-based matching (Arvind & Nikhil 1987/1990).
Key differences from Monsoon ETS:
| Aspect | Monsoon ETS | This design |
|---|---|---|
| Frame size | 128 words (fixed) | 16-32 entries (configurable) |
| Allocation | Shared free-list of frame pointers | Bump allocator + bitmap/FIFO |
| ABA protection | Tight dealloc control, no reuse until drained | 2-bit generation counter per slot |
| Pipeline depth | 8 stages | 5 stages (target) |
| Matching entry ID | Compiler-assigned slot offset in frame | Compiler-assigned match_entry in context slot |
| Overflow | Not handled (compiler must fit) | Stall + optional future CAM buffer |
The generation counter is more defensive than Monsoon's approach, which is appropriate for a first build where catching bugs matters more than saving 2 bits per slot. Monsoon's free-list is cleaner in theory but the bump allocator is simpler hardware (counter vs. FIFO management).
Actionable insight from this convergence: the ETS papers (especially Papadopoulos's 1988 PhD thesis) contain detailed pipeline timing, hazard analysis, and state-bit logic that's directly applicable to our matching store, even though the designs were arrived at independently. Worth reading for implementation details, not just architecture.
3. Clock Efficiency as Primary PE Constraint#
In discrete logic, clock speed is the hard constraint. Individual gate delays are ~10-30ns per stage depending on technology, and pipeline depth directly multiplies total latency. At realistic clock speeds (5-20 MHz for well-designed discrete logic), every wasted cycle is expensive.
This means:
- Single-cycle matching is non-negotiable (achieved via direct indexing)
- Pipeline depth should be minimised — each stage adds a clock period of latency. Monsoon's 8 stages would give 400-1600ns per token at discrete-logic speeds. Our target of 5 stages is aggressive but achievable.
- Monadic bypass matters — monadic tokens skipping the matching stage saves a full cycle per monadic operation. At 50% monadic ops (typical for many programs), this is significant.
- Network latency is the enemy — every hop between PEs adds pipeline latency. Compiler-assigned locality (keeping communicating nodes on the same PE) is critical. This is why we care about static PE assignment even though matching is dynamic.
- The sequential scheduling shortcut (§1) becomes more valuable the slower the clock is — if a monadic chain of 5 ops takes 25 cycles through the full pipeline but could take 5 cycles with short-circuit execution, that's 100-500ns saved at discrete speeds.
Implications for PE Design#
Every pipeline stage must justify its existence in terms of critical path. If a stage can be merged with an adjacent stage without extending the critical path beyond the target clock period, merge it.
Stages that are "free" (can overlap with SRAM access time): address generation, mux selection, comparator setup.
Stages that set the clock period: SRAM read (15-25ns for fast async SRAM), ALU operation (depends on width — 8-bit add ~15ns in 74HC, 16-bit ~25ns with carry lookahead).
4. Technology Notes#
Not Strictly TTL#
The project is described as "74-series TTL + SRAM" but the actual target technology is more nuanced:
- 74HC / 74HCT CMOS is the likely primary logic family, not original 74-series TTL. HC/HCT gives lower power, better noise margins, and similar or better speed at the gate level. HCT is input-compatible with TTL levels.
- 74AC / 74ACT (Advanced CMOS) for critical-path stages where the extra speed matters. ~5ns propagation vs ~10ns for HC.
- 74F (FAST TTL) is an option for specific high-speed paths but draws more power and is less available.
- Async SRAM (IS61C256, AS6C4008, etc.) for all bulk storage: instruction memory, matching store, token FIFOs, structure memory. 15-25ns access times are the pipeline clock floor.
- EEPROMs (AT28C256 or similar) for instruction memory where runtime reprogramming via type-11 is acceptable with higher write latency. Or SRAM with battery backup / external loading.
The key constraint is no large-scale integration beyond commodity SRAM and EEPROM. No FPGAs in the final build (though FPGA prototyping is encouraged). No custom ASICs. No microcontrollers in the datapath (though a microcontroller as external test fixture / bootstrap host is fine for development).
The "period-plausible" framing refers to the transistor budget being comparable to processors from the late 1970s / early 1980s (68000-class), not to the specific technology used. Modern CMOS 74-series parts are faster and lower power than original TTL but the logic complexity and design methodology are the same.
5. Priority Reading List (from Prior Art Survey)#
Based on the reference guide and current design state, prioritised for maximum impact on near-term design decisions:
Must-Read (directly affects current design choices)#
-
Papadopoulos & Culler, "Monsoon: An Explicit Token-Store Architecture" (ISCA 1990) — ETS mechanics, 8-stage pipeline, frame memory organisation. Closest prior art to our matching store.
-
Culler & Papadopoulos, "The Explicit Token Store" (JPDC 1990) — Extended journal version with more detail on state-bit mechanism and pipeline stages.
-
Papadopoulos PhD thesis (MIT, 1988) — THE most detailed Monsoon hardware source. Board-level design, chip selection, pipeline timing. Hard to find but worth the effort.
-
Sakai et al., "An Architecture of a Dataflow Single Chip Processor" (ISCA 1989) — EM-4 core paper. 50K-gate PE, circular pipeline, direct matching, strongly connected arc model. Most sophisticated pipeline design in the literature.
-
da Silva & Watson, "A Pseudo-Associative Matching Store with Hardware Hashing" (IEE 1983) — Even though we're not using hash matching, understanding WHY Manchester went this way and what the tradeoffs were informs our design.
-
Culler, "Resource Management for the Tagged Token Dataflow Architecture" (MIT TR-332, 1985) — Token store overflow, deadlock, frame-space management. Essential for understanding throttling and resource constraints.
Should-Read (informs broader design context)#
-
Dennis, "Building Blocks for Data Flow Prototypes" (ISCA 1980) — Modular hardware building blocks for discrete-logic dataflow. May directly influence our board-level module decomposition.
-
Sakai et al., EM-4 network paper (Parallel Computing 1993) — Circular omega network design and deadlock prevention. Relevant when we scale past shared bus.
-
Arvind & Nikhil, "Executing a Program on the MIT TTDA" (IEEE TC 1990) — TTDA PE organisation, tag format, I-structure memory. Foundational context for understanding Monsoon.
-
Papadopoulos & Traub, "Multithreading: A Revisionist View" (ISCA 1991) — Sequential scheduling optimisation. Not for the architecture, but for the microarchitectural shortcut idea.
Background (fills in the picture)#
-
Lee & Hurson, "Dataflow Architectures and Multithreading" (IEEE Computer 1994) — Survey bridging pure dataflow to multithreading era.
-
Arvind & Culler, "Dataflow Architectures" (Annual Review 1986) — MIT perspective, good overview of the design space.
-
Grafe et al., "The Epsilon Dataflow Processor" (ISCA 1989) — Hybrid approach (Sandia), interesting as a contrast to show where we DON'T want to go.
-
Watson & Gurd, "A Practical Data Flow Computer" (IEEE Computer 1982) — Board-level Manchester hardware aimed at hardware engineers.
6. Key Open Questions Informed by Research#
Things the prior art survey surfaced that we should think about:
-
Monsoon's 128-word frames vs our 16-32 entry slots: are we too small? Monsoon's larger frames reduce allocation frequency but waste space on small activations. Our smaller slots are more efficient but may cause more allocation churn. Need to compile some test programs and measure.
-
EM-4's circular pipeline: their PE reuses pipeline stages for different phases of token processing, reducing total hardware per PE. Worth investigating whether our 5-stage pipeline could benefit from a similar trick.
-
EM-4's strongly connected arc model: a different take on monadic chains where consecutive operations within a thread execute without re-entering the network. Related to the sequential scheduling idea but architecturally distinct. Need to read the papers to understand the hardware implications.
-
I-structure memory (Arvind): presence bits on structure memory words for synchronisation. Our SM doesn't currently have this — SM operations are simple read/write/RMW. I-structures enable deferred reads (read of empty word blocks until write arrives). This is a significant capability for certain parallel patterns. Worth evaluating whether SM should support it.