OR-1 dataflow CPU sketch
1# Dynamic Dataflow CPU — Architectural Positioning & Research Notes
2
3Working notes capturing design philosophy decisions, research insights,
4and prior art observations. Not a design spec — a reference for "why we
5chose this direction" and "what we learned from the literature."
6
7## Companion Documents
8
9- `architecture-overview.md` — master architecture reference
10- `pe-design.md` — PE pipeline, matching store, context slots
11- `design-alternatives.md` — rejected/deferred approaches
12- `Prior_Art_Reference_Guide_for_a_Discrete-Logic_Dynamic_Dataflow_CPU.md`
13 — comprehensive bibliography
14
15---
16
17## 1. Core Architectural Commitment: Pure Dataflow, Not Hybrid
18
19This project is a **dynamic dataflow machine**, not a multithreaded RISC
20core with dataflow-style synchronisation primitives.
21
22The MIT lineage went: Manchester → TTDA → Monsoon → *T → Sparcle, with
23each step making the PE more like a conventional CPU and reducing the
24dataflow aspects to synchronisation mechanisms (presence bits on memory,
25fast context switching, message passing). By *T (1992), the "dataflow"
26part is essentially hardware semaphores bolted onto a modified SPARC.
27
28**We are not going down that road.** The point of this project is the
29different execution model — where synchronisation is implicit in the
30data flow, not explicit in the program. A PE that needs a program counter,
31register file, bypass network, and branch prediction is solving a
32different problem than we're solving.
33
34Specific non-goals:
35- Sequential instruction streams within a PE (*T, Sparcle, EARTH)
36- Register files as primary operand storage
37- Program counter / sequential fetch logic
38- Branch prediction hardware
39- Hardware semaphores / presence-bit memory traps (*T model)
40- "Make each PE a full CPU" — this blows the transistor budget from
41 4 PEs down to 1-2, losing the parallelism that's the whole point
42
43### Where We Sit on the Spectrum
44
45```
46Pure dataflow ←————————————————————————→ Pure von Neumann
47Manchester Monsoon *T/Sparcle OoO superscalar
48 | | | |
49 | [this project] | |
50 | | | |
51hash matching ETS/direct RISC+sync register renaming
52no PC no PC has PC has PC
53no registers no regs register file register file
54```
55
56We're roughly at the Monsoon point on this spectrum: direct-indexed
57matching with presence bits (independently derived, see §2), token-driven
58execution, no program counter. But with a smaller/simpler PE than Monsoon
59(fewer pipeline stages, smaller frames, generation counters for ABA
60protection instead of Monsoon's tighter deallocation control).
61
62### What IS Worth Mining from the Hybrid Work
63
64The Papadopoulos & Traub 1991 paper ("Multithreading: A Revisionist View
65of Dataflow Architectures") contains one microarchitectural optimisation
66that's relevant without changing the architecture:
67
68**Sequential scheduling of monadic chains.** If A feeds B feeds C and all
69are monadic (single-input), the tokens cycle through the full pipeline
70for each hop: token-in → match-bypass → fetch → execute → token-out →
71back to token-in. If the PE could recognise this pattern and keep the
72result in-pipeline for the next instruction (skipping token formatting
73and input FIFO), that's a significant latency win on sequential chains.
74
75This is a **microarchitectural shortcut**, not an architectural change.
76The token semantics don't change. The compiler doesn't need to know. It's
77just an optimisation where the PE notices "output goes to me, monadic,
78next instruction" and short-circuits the pipeline. Worth considering
79post-v0 if sequential throughput is a problem.
80
81---
82
83## 2. Convergence with Monsoon's Explicit Token Store
84
85The matching store design in `pe-design.md` — direct-indexed context slots
86with occupied bits, compiler-assigned slot IDs, bump allocator — was
87derived independently from first principles:
88
891. Manchester's hash table has terrible utilisation (<20%) and enormous
90 hardware cost (16 SRAM banks + comparators per PE)
912. Amamiya's semi-CAM is better but CAM chips are tiny at discrete scale
923. If the compiler assigns context IDs statically, you can use them as
93 direct SRAM addresses → no associative lookup at all
944. Occupied bit = 1-bit presence flag per matching entry
955. Generation counter handles ABA on slot reuse
96
97This turns out to be essentially the same thing Papadopoulos and Culler
98arrived at with the Explicit Token Store (ETS) for Monsoon (1990), via
99a similar line of reasoning from the TTDA's frame-based matching (Arvind
100& Nikhil 1987/1990).
101
102**Key differences from Monsoon ETS:**
103
104| Aspect | Monsoon ETS | This design |
105|--------|-------------|-------------|
106| Frame size | 128 words (fixed) | 16-32 entries (configurable) |
107| Allocation | Shared free-list of frame pointers | Bump allocator + bitmap/FIFO |
108| ABA protection | Tight dealloc control, no reuse until drained | 2-bit generation counter per slot |
109| Pipeline depth | 8 stages | 5 stages (target) |
110| Matching entry ID | Compiler-assigned slot offset in frame | Compiler-assigned match_entry in context slot |
111| Overflow | Not handled (compiler must fit) | Stall + optional future CAM buffer |
112
113The generation counter is more defensive than Monsoon's approach, which
114is appropriate for a first build where catching bugs matters more than
115saving 2 bits per slot. Monsoon's free-list is cleaner in theory but the
116bump allocator is simpler hardware (counter vs. FIFO management).
117
118**Actionable insight from this convergence:** the ETS papers (especially
119Papadopoulos's 1988 PhD thesis) contain detailed pipeline timing, hazard
120analysis, and state-bit logic that's directly applicable to our matching
121store, even though the designs were arrived at independently. Worth
122reading for implementation details, not just architecture.
123
124---
125
126## 3. Clock Efficiency as Primary PE Constraint
127
128In discrete logic, clock speed is the hard constraint. Individual gate
129delays are ~10-30ns per stage depending on technology, and pipeline depth
130directly multiplies total latency. At realistic clock speeds (5-20 MHz
131for well-designed discrete logic), every wasted cycle is expensive.
132
133This means:
134- **Single-cycle matching is non-negotiable** (achieved via direct indexing)
135- **Pipeline depth should be minimised** — each stage adds a clock period
136 of latency. Monsoon's 8 stages would give 400-1600ns per token at
137 discrete-logic speeds. Our target of 5 stages is aggressive but
138 achievable.
139- **Monadic bypass matters** — monadic tokens skipping the matching stage
140 saves a full cycle per monadic operation. At 50% monadic ops (typical
141 for many programs), this is significant.
142- **Network latency is the enemy** — every hop between PEs adds pipeline
143 latency. Compiler-assigned locality (keeping communicating nodes on
144 the same PE) is critical. This is why we care about static PE
145 assignment even though matching is dynamic.
146- **The sequential scheduling shortcut (§1) becomes more valuable** the
147 slower the clock is — if a monadic chain of 5 ops takes 25 cycles
148 through the full pipeline but could take 5 cycles with short-circuit
149 execution, that's 100-500ns saved at discrete speeds.
150
151### Implications for PE Design
152
153Every pipeline stage must justify its existence in terms of critical path.
154If a stage can be merged with an adjacent stage without extending the
155critical path beyond the target clock period, merge it.
156
157Stages that are "free" (can overlap with SRAM access time): address
158generation, mux selection, comparator setup.
159
160Stages that set the clock period: SRAM read (15-25ns for fast async SRAM),
161ALU operation (depends on width — 8-bit add ~15ns in 74HC, 16-bit ~25ns
162with carry lookahead).
163
164---
165
166## 4. Technology Notes
167
168### Not Strictly TTL
169
170The project is described as "74-series TTL + SRAM" but the actual target
171technology is more nuanced:
172
173- **74HC / 74HCT CMOS** is the likely primary logic family, not original
174 74-series TTL. HC/HCT gives lower power, better noise margins, and
175 similar or better speed at the gate level. HCT is input-compatible
176 with TTL levels.
177- **74AC / 74ACT** (Advanced CMOS) for critical-path stages where the
178 extra speed matters. ~5ns propagation vs ~10ns for HC.
179- **74F** (FAST TTL) is an option for specific high-speed paths but draws
180 more power and is less available.
181- **Async SRAM** (IS61C256, AS6C4008, etc.) for all bulk storage:
182 instruction memory, matching store, token FIFOs, structure memory.
183 15-25ns access times are the pipeline clock floor.
184- **EEPROMs** (AT28C256 or similar) for instruction memory where runtime
185 reprogramming via type-11 is acceptable with higher write latency.
186 Or SRAM with battery backup / external loading.
187
188The key constraint is **no large-scale integration beyond commodity SRAM
189and EEPROM**. No FPGAs in the final build (though FPGA prototyping is
190encouraged). No custom ASICs. No microcontrollers in the datapath (though
191a microcontroller as external test fixture / bootstrap host is fine for
192development).
193
194The "period-plausible" framing refers to the transistor budget being
195comparable to processors from the late 1970s / early 1980s (68000-class),
196not to the specific technology used. Modern CMOS 74-series parts are
197faster and lower power than original TTL but the logic complexity and
198design methodology are the same.
199
200---
201
202## 5. Priority Reading List (from Prior Art Survey)
203
204Based on the reference guide and current design state, prioritised for
205maximum impact on near-term design decisions:
206
207### Must-Read (directly affects current design choices)
208
2091. **Papadopoulos & Culler, "Monsoon: An Explicit Token-Store
210 Architecture" (ISCA 1990)** — ETS mechanics, 8-stage pipeline,
211 frame memory organisation. Closest prior art to our matching store.
212
2132. **Culler & Papadopoulos, "The Explicit Token Store" (JPDC 1990)** —
214 Extended journal version with more detail on state-bit mechanism
215 and pipeline stages.
216
2173. **Papadopoulos PhD thesis (MIT, 1988)** — THE most detailed Monsoon
218 hardware source. Board-level design, chip selection, pipeline timing.
219 Hard to find but worth the effort.
220
2214. **Sakai et al., "An Architecture of a Dataflow Single Chip Processor"
222 (ISCA 1989)** — EM-4 core paper. 50K-gate PE, circular pipeline,
223 direct matching, strongly connected arc model. Most sophisticated
224 pipeline design in the literature.
225
2265. **da Silva & Watson, "A Pseudo-Associative Matching Store with Hardware
227 Hashing" (IEE 1983)** — Even though we're not using hash matching,
228 understanding WHY Manchester went this way and what the tradeoffs
229 were informs our design.
230
2316. **Culler, "Resource Management for the Tagged Token Dataflow
232 Architecture" (MIT TR-332, 1985)** — Token store overflow, deadlock,
233 frame-space management. Essential for understanding throttling and
234 resource constraints.
235
236### Should-Read (informs broader design context)
237
2387. **Dennis, "Building Blocks for Data Flow Prototypes" (ISCA 1980)** —
239 Modular hardware building blocks for discrete-logic dataflow. May
240 directly influence our board-level module decomposition.
241
2428. **Sakai et al., EM-4 network paper (Parallel Computing 1993)** —
243 Circular omega network design and deadlock prevention. Relevant
244 when we scale past shared bus.
245
2469. **Arvind & Nikhil, "Executing a Program on the MIT TTDA" (IEEE TC
247 1990)** — TTDA PE organisation, tag format, I-structure memory.
248 Foundational context for understanding Monsoon.
249
25010. **Papadopoulos & Traub, "Multithreading: A Revisionist View" (ISCA
251 1991)** — Sequential scheduling optimisation. Not for the
252 architecture, but for the microarchitectural shortcut idea.
253
254### Background (fills in the picture)
255
25611. **Lee & Hurson, "Dataflow Architectures and Multithreading" (IEEE
257 Computer 1994)** — Survey bridging pure dataflow to multithreading era.
258
25912. **Arvind & Culler, "Dataflow Architectures" (Annual Review 1986)** —
260 MIT perspective, good overview of the design space.
261
26213. **Grafe et al., "The Epsilon Dataflow Processor" (ISCA 1989)** —
263 Hybrid approach (Sandia), interesting as a contrast to show where
264 we DON'T want to go.
265
26614. **Watson & Gurd, "A Practical Data Flow Computer" (IEEE Computer
267 1982)** — Board-level Manchester hardware aimed at hardware engineers.
268
269---
270
271## 6. Key Open Questions Informed by Research
272
273Things the prior art survey surfaced that we should think about:
274
2751. **Monsoon's 128-word frames vs our 16-32 entry slots**: are we too
276 small? Monsoon's larger frames reduce allocation frequency but waste
277 space on small activations. Our smaller slots are more efficient but
278 may cause more allocation churn. Need to compile some test programs
279 and measure.
280
2812. **EM-4's circular pipeline**: their PE reuses pipeline stages for
282 different phases of token processing, reducing total hardware per PE.
283 Worth investigating whether our 5-stage pipeline could benefit from
284 a similar trick.
285
2863. **EM-4's strongly connected arc model**: a different take on monadic
287 chains where consecutive operations within a thread execute without
288 re-entering the network. Related to the sequential scheduling idea
289 but architecturally distinct. Need to read the papers to understand
290 the hardware implications.
291
2924. **I-structure memory (Arvind)**: presence bits on structure memory
293 words for synchronisation. Our SM doesn't currently have this — SM
294 operations are simple read/write/RMW. I-structures enable deferred
295 reads (read of empty word blocks until write arrives). This is a
296 significant capability for certain parallel patterns. Worth
297 evaluating whether SM should support it.