OR-1 dataflow CPU sketch
at main 297 lines 14 kB view raw view rendered
1# Dynamic Dataflow CPU — Architectural Positioning & Research Notes 2 3Working notes capturing design philosophy decisions, research insights, 4and prior art observations. Not a design spec — a reference for "why we 5chose this direction" and "what we learned from the literature." 6 7## Companion Documents 8 9- `architecture-overview.md` — master architecture reference 10- `pe-design.md` — PE pipeline, matching store, context slots 11- `design-alternatives.md` — rejected/deferred approaches 12- `Prior_Art_Reference_Guide_for_a_Discrete-Logic_Dynamic_Dataflow_CPU.md` 13 — comprehensive bibliography 14 15--- 16 17## 1. Core Architectural Commitment: Pure Dataflow, Not Hybrid 18 19This project is a **dynamic dataflow machine**, not a multithreaded RISC 20core with dataflow-style synchronisation primitives. 21 22The MIT lineage went: Manchester → TTDA → Monsoon → *T → Sparcle, with 23each step making the PE more like a conventional CPU and reducing the 24dataflow aspects to synchronisation mechanisms (presence bits on memory, 25fast context switching, message passing). By *T (1992), the "dataflow" 26part is essentially hardware semaphores bolted onto a modified SPARC. 27 28**We are not going down that road.** The point of this project is the 29different execution model — where synchronisation is implicit in the 30data flow, not explicit in the program. A PE that needs a program counter, 31register file, bypass network, and branch prediction is solving a 32different problem than we're solving. 33 34Specific non-goals: 35- Sequential instruction streams within a PE (*T, Sparcle, EARTH) 36- Register files as primary operand storage 37- Program counter / sequential fetch logic 38- Branch prediction hardware 39- Hardware semaphores / presence-bit memory traps (*T model) 40- "Make each PE a full CPU" — this blows the transistor budget from 41 4 PEs down to 1-2, losing the parallelism that's the whole point 42 43### Where We Sit on the Spectrum 44 45``` 46Pure dataflow ←————————————————————————→ Pure von Neumann 47Manchester Monsoon *T/Sparcle OoO superscalar 48 | | | | 49 | [this project] | | 50 | | | | 51hash matching ETS/direct RISC+sync register renaming 52no PC no PC has PC has PC 53no registers no regs register file register file 54``` 55 56We're roughly at the Monsoon point on this spectrum: direct-indexed 57matching with presence bits (independently derived, see §2), token-driven 58execution, no program counter. But with a smaller/simpler PE than Monsoon 59(fewer pipeline stages, smaller frames, generation counters for ABA 60protection instead of Monsoon's tighter deallocation control). 61 62### What IS Worth Mining from the Hybrid Work 63 64The Papadopoulos & Traub 1991 paper ("Multithreading: A Revisionist View 65of Dataflow Architectures") contains one microarchitectural optimisation 66that's relevant without changing the architecture: 67 68**Sequential scheduling of monadic chains.** If A feeds B feeds C and all 69are monadic (single-input), the tokens cycle through the full pipeline 70for each hop: token-in → match-bypass → fetch → execute → token-out → 71back to token-in. If the PE could recognise this pattern and keep the 72result in-pipeline for the next instruction (skipping token formatting 73and input FIFO), that's a significant latency win on sequential chains. 74 75This is a **microarchitectural shortcut**, not an architectural change. 76The token semantics don't change. The compiler doesn't need to know. It's 77just an optimisation where the PE notices "output goes to me, monadic, 78next instruction" and short-circuits the pipeline. Worth considering 79post-v0 if sequential throughput is a problem. 80 81--- 82 83## 2. Convergence with Monsoon's Explicit Token Store 84 85The matching store design in `pe-design.md` — direct-indexed context slots 86with occupied bits, compiler-assigned slot IDs, bump allocator — was 87derived independently from first principles: 88 891. Manchester's hash table has terrible utilisation (<20%) and enormous 90 hardware cost (16 SRAM banks + comparators per PE) 912. Amamiya's semi-CAM is better but CAM chips are tiny at discrete scale 923. If the compiler assigns context IDs statically, you can use them as 93 direct SRAM addresses → no associative lookup at all 944. Occupied bit = 1-bit presence flag per matching entry 955. Generation counter handles ABA on slot reuse 96 97This turns out to be essentially the same thing Papadopoulos and Culler 98arrived at with the Explicit Token Store (ETS) for Monsoon (1990), via 99a similar line of reasoning from the TTDA's frame-based matching (Arvind 100& Nikhil 1987/1990). 101 102**Key differences from Monsoon ETS:** 103 104| Aspect | Monsoon ETS | This design | 105|--------|-------------|-------------| 106| Frame size | 128 words (fixed) | 16-32 entries (configurable) | 107| Allocation | Shared free-list of frame pointers | Bump allocator + bitmap/FIFO | 108| ABA protection | Tight dealloc control, no reuse until drained | 2-bit generation counter per slot | 109| Pipeline depth | 8 stages | 5 stages (target) | 110| Matching entry ID | Compiler-assigned slot offset in frame | Compiler-assigned match_entry in context slot | 111| Overflow | Not handled (compiler must fit) | Stall + optional future CAM buffer | 112 113The generation counter is more defensive than Monsoon's approach, which 114is appropriate for a first build where catching bugs matters more than 115saving 2 bits per slot. Monsoon's free-list is cleaner in theory but the 116bump allocator is simpler hardware (counter vs. FIFO management). 117 118**Actionable insight from this convergence:** the ETS papers (especially 119Papadopoulos's 1988 PhD thesis) contain detailed pipeline timing, hazard 120analysis, and state-bit logic that's directly applicable to our matching 121store, even though the designs were arrived at independently. Worth 122reading for implementation details, not just architecture. 123 124--- 125 126## 3. Clock Efficiency as Primary PE Constraint 127 128In discrete logic, clock speed is the hard constraint. Individual gate 129delays are ~10-30ns per stage depending on technology, and pipeline depth 130directly multiplies total latency. At realistic clock speeds (5-20 MHz 131for well-designed discrete logic), every wasted cycle is expensive. 132 133This means: 134- **Single-cycle matching is non-negotiable** (achieved via direct indexing) 135- **Pipeline depth should be minimised** — each stage adds a clock period 136 of latency. Monsoon's 8 stages would give 400-1600ns per token at 137 discrete-logic speeds. Our target of 5 stages is aggressive but 138 achievable. 139- **Monadic bypass matters** — monadic tokens skipping the matching stage 140 saves a full cycle per monadic operation. At 50% monadic ops (typical 141 for many programs), this is significant. 142- **Network latency is the enemy** — every hop between PEs adds pipeline 143 latency. Compiler-assigned locality (keeping communicating nodes on 144 the same PE) is critical. This is why we care about static PE 145 assignment even though matching is dynamic. 146- **The sequential scheduling shortcut (§1) becomes more valuable** the 147 slower the clock is — if a monadic chain of 5 ops takes 25 cycles 148 through the full pipeline but could take 5 cycles with short-circuit 149 execution, that's 100-500ns saved at discrete speeds. 150 151### Implications for PE Design 152 153Every pipeline stage must justify its existence in terms of critical path. 154If a stage can be merged with an adjacent stage without extending the 155critical path beyond the target clock period, merge it. 156 157Stages that are "free" (can overlap with SRAM access time): address 158generation, mux selection, comparator setup. 159 160Stages that set the clock period: SRAM read (15-25ns for fast async SRAM), 161ALU operation (depends on width — 8-bit add ~15ns in 74HC, 16-bit ~25ns 162with carry lookahead). 163 164--- 165 166## 4. Technology Notes 167 168### Not Strictly TTL 169 170The project is described as "74-series TTL + SRAM" but the actual target 171technology is more nuanced: 172 173- **74HC / 74HCT CMOS** is the likely primary logic family, not original 174 74-series TTL. HC/HCT gives lower power, better noise margins, and 175 similar or better speed at the gate level. HCT is input-compatible 176 with TTL levels. 177- **74AC / 74ACT** (Advanced CMOS) for critical-path stages where the 178 extra speed matters. ~5ns propagation vs ~10ns for HC. 179- **74F** (FAST TTL) is an option for specific high-speed paths but draws 180 more power and is less available. 181- **Async SRAM** (IS61C256, AS6C4008, etc.) for all bulk storage: 182 instruction memory, matching store, token FIFOs, structure memory. 183 15-25ns access times are the pipeline clock floor. 184- **EEPROMs** (AT28C256 or similar) for instruction memory where runtime 185 reprogramming via type-11 is acceptable with higher write latency. 186 Or SRAM with battery backup / external loading. 187 188The key constraint is **no large-scale integration beyond commodity SRAM 189and EEPROM**. No FPGAs in the final build (though FPGA prototyping is 190encouraged). No custom ASICs. No microcontrollers in the datapath (though 191a microcontroller as external test fixture / bootstrap host is fine for 192development). 193 194The "period-plausible" framing refers to the transistor budget being 195comparable to processors from the late 1970s / early 1980s (68000-class), 196not to the specific technology used. Modern CMOS 74-series parts are 197faster and lower power than original TTL but the logic complexity and 198design methodology are the same. 199 200--- 201 202## 5. Priority Reading List (from Prior Art Survey) 203 204Based on the reference guide and current design state, prioritised for 205maximum impact on near-term design decisions: 206 207### Must-Read (directly affects current design choices) 208 2091. **Papadopoulos & Culler, "Monsoon: An Explicit Token-Store 210 Architecture" (ISCA 1990)** — ETS mechanics, 8-stage pipeline, 211 frame memory organisation. Closest prior art to our matching store. 212 2132. **Culler & Papadopoulos, "The Explicit Token Store" (JPDC 1990)**214 Extended journal version with more detail on state-bit mechanism 215 and pipeline stages. 216 2173. **Papadopoulos PhD thesis (MIT, 1988)** — THE most detailed Monsoon 218 hardware source. Board-level design, chip selection, pipeline timing. 219 Hard to find but worth the effort. 220 2214. **Sakai et al., "An Architecture of a Dataflow Single Chip Processor" 222 (ISCA 1989)** — EM-4 core paper. 50K-gate PE, circular pipeline, 223 direct matching, strongly connected arc model. Most sophisticated 224 pipeline design in the literature. 225 2265. **da Silva & Watson, "A Pseudo-Associative Matching Store with Hardware 227 Hashing" (IEE 1983)** — Even though we're not using hash matching, 228 understanding WHY Manchester went this way and what the tradeoffs 229 were informs our design. 230 2316. **Culler, "Resource Management for the Tagged Token Dataflow 232 Architecture" (MIT TR-332, 1985)** — Token store overflow, deadlock, 233 frame-space management. Essential for understanding throttling and 234 resource constraints. 235 236### Should-Read (informs broader design context) 237 2387. **Dennis, "Building Blocks for Data Flow Prototypes" (ISCA 1980)**239 Modular hardware building blocks for discrete-logic dataflow. May 240 directly influence our board-level module decomposition. 241 2428. **Sakai et al., EM-4 network paper (Parallel Computing 1993)**243 Circular omega network design and deadlock prevention. Relevant 244 when we scale past shared bus. 245 2469. **Arvind & Nikhil, "Executing a Program on the MIT TTDA" (IEEE TC 247 1990)** — TTDA PE organisation, tag format, I-structure memory. 248 Foundational context for understanding Monsoon. 249 25010. **Papadopoulos & Traub, "Multithreading: A Revisionist View" (ISCA 251 1991)** — Sequential scheduling optimisation. Not for the 252 architecture, but for the microarchitectural shortcut idea. 253 254### Background (fills in the picture) 255 25611. **Lee & Hurson, "Dataflow Architectures and Multithreading" (IEEE 257 Computer 1994)** — Survey bridging pure dataflow to multithreading era. 258 25912. **Arvind & Culler, "Dataflow Architectures" (Annual Review 1986)**260 MIT perspective, good overview of the design space. 261 26213. **Grafe et al., "The Epsilon Dataflow Processor" (ISCA 1989)**263 Hybrid approach (Sandia), interesting as a contrast to show where 264 we DON'T want to go. 265 26614. **Watson & Gurd, "A Practical Data Flow Computer" (IEEE Computer 267 1982)** — Board-level Manchester hardware aimed at hardware engineers. 268 269--- 270 271## 6. Key Open Questions Informed by Research 272 273Things the prior art survey surfaced that we should think about: 274 2751. **Monsoon's 128-word frames vs our 16-32 entry slots**: are we too 276 small? Monsoon's larger frames reduce allocation frequency but waste 277 space on small activations. Our smaller slots are more efficient but 278 may cause more allocation churn. Need to compile some test programs 279 and measure. 280 2812. **EM-4's circular pipeline**: their PE reuses pipeline stages for 282 different phases of token processing, reducing total hardware per PE. 283 Worth investigating whether our 5-stage pipeline could benefit from 284 a similar trick. 285 2863. **EM-4's strongly connected arc model**: a different take on monadic 287 chains where consecutive operations within a thread execute without 288 re-entering the network. Related to the sequential scheduling idea 289 but architecturally distinct. Need to read the papers to understand 290 the hardware implications. 291 2924. **I-structure memory (Arvind)**: presence bits on structure memory 293 words for synchronisation. Our SM doesn't currently have this — SM 294 operations are simple read/write/RMW. I-structures enable deferred 295 reads (read of empty word blocks until write arrives). This is a 296 significant capability for certain parallel patterns. Worth 297 evaluating whether SM should support it.