design-notes/architectural-positioning.md at main · nonbinary.computer/or1-design

nonbinary.computer / or1-design
fork atom
OR-1 dataflow CPU sketch
fork atom
or1-design / design-notes / architectural-positioning.md
at main 297 lines 14 kB view raw view rendered
wrap content
Orual dfgraph renderer 3w ago
00e8aa1d
  1# Dynamic Dataflow CPU — Architectural Positioning & Research Notes
  2
  3Working notes capturing design philosophy decisions, research insights,
  4and prior art observations. Not a design spec — a reference for "why we
  5chose this direction" and "what we learned from the literature."
  6
  7## Companion Documents
  8
  9- `architecture-overview.md` — master architecture reference
 10- `pe-design.md` — PE pipeline, matching store, context slots
 11- `design-alternatives.md` — rejected/deferred approaches
 12- `Prior_Art_Reference_Guide_for_a_Discrete-Logic_Dynamic_Dataflow_CPU.md`
 13  — comprehensive bibliography
 14
 15---
 16
 17## 1. Core Architectural Commitment: Pure Dataflow, Not Hybrid
 18
 19This project is a **dynamic dataflow machine**, not a multithreaded RISC
 20core with dataflow-style synchronisation primitives.
 21
 22The MIT lineage went: Manchester → TTDA → Monsoon → *T → Sparcle, with
 23each step making the PE more like a conventional CPU and reducing the
 24dataflow aspects to synchronisation mechanisms (presence bits on memory,
 25fast context switching, message passing). By *T (1992), the "dataflow"
 26part is essentially hardware semaphores bolted onto a modified SPARC.
 27
 28**We are not going down that road.** The point of this project is the
 29different execution model — where synchronisation is implicit in the
 30data flow, not explicit in the program. A PE that needs a program counter,
 31register file, bypass network, and branch prediction is solving a
 32different problem than we're solving.
 33
 34Specific non-goals:
 35- Sequential instruction streams within a PE (*T, Sparcle, EARTH)
 36- Register files as primary operand storage
 37- Program counter / sequential fetch logic
 38- Branch prediction hardware
 39- Hardware semaphores / presence-bit memory traps (*T model)
 40- "Make each PE a full CPU" — this blows the transistor budget from
 41  4 PEs down to 1-2, losing the parallelism that's the whole point
 42
 43### Where We Sit on the Spectrum
 44
 45```
 46Pure dataflow ←————————————————————————→ Pure von Neumann
 47Manchester   Monsoon   *T/Sparcle   OoO superscalar
 48    |            |         |              |
 49    |     [this project]   |              |
 50    |            |         |              |
 51hash matching  ETS/direct  RISC+sync     register renaming
 52no PC          no PC       has PC        has PC
 53no registers   no regs     register file register file
 54```
 55
 56We're roughly at the Monsoon point on this spectrum: direct-indexed
 57matching with presence bits (independently derived, see §2), token-driven
 58execution, no program counter. But with a smaller/simpler PE than Monsoon
 59(fewer pipeline stages, smaller frames, generation counters for ABA
 60protection instead of Monsoon's tighter deallocation control).
 61
 62### What IS Worth Mining from the Hybrid Work
 63
 64The Papadopoulos & Traub 1991 paper ("Multithreading: A Revisionist View
 65of Dataflow Architectures") contains one microarchitectural optimisation
 66that's relevant without changing the architecture:
 67
 68**Sequential scheduling of monadic chains.** If A feeds B feeds C and all
 69are monadic (single-input), the tokens cycle through the full pipeline
 70for each hop: token-in → match-bypass → fetch → execute → token-out →
 71back to token-in. If the PE could recognise this pattern and keep the
 72result in-pipeline for the next instruction (skipping token formatting
 73and input FIFO), that's a significant latency win on sequential chains.
 74
 75This is a **microarchitectural shortcut**, not an architectural change.
 76The token semantics don't change. The compiler doesn't need to know. It's
 77just an optimisation where the PE notices "output goes to me, monadic,
 78next instruction" and short-circuits the pipeline. Worth considering
 79post-v0 if sequential throughput is a problem.
 80
 81---
 82
 83## 2. Convergence with Monsoon's Explicit Token Store
 84
 85The matching store design in `pe-design.md` — direct-indexed context slots
 86with occupied bits, compiler-assigned slot IDs, bump allocator — was
 87derived independently from first principles:
 88
 891. Manchester's hash table has terrible utilisation (<20%) and enormous
 90   hardware cost (16 SRAM banks + comparators per PE)
 912. Amamiya's semi-CAM is better but CAM chips are tiny at discrete scale
 923. If the compiler assigns context IDs statically, you can use them as
 93   direct SRAM addresses → no associative lookup at all
 944. Occupied bit = 1-bit presence flag per matching entry
 955. Generation counter handles ABA on slot reuse
 96
 97This turns out to be essentially the same thing Papadopoulos and Culler
 98arrived at with the Explicit Token Store (ETS) for Monsoon (1990), via
 99a similar line of reasoning from the TTDA's frame-based matching (Arvind
100& Nikhil 1987/1990).
101
102**Key differences from Monsoon ETS:**
103
104| Aspect | Monsoon ETS | This design |
105|--------|-------------|-------------|
106| Frame size | 128 words (fixed) | 16-32 entries (configurable) |
107| Allocation | Shared free-list of frame pointers | Bump allocator + bitmap/FIFO |
108| ABA protection | Tight dealloc control, no reuse until drained | 2-bit generation counter per slot |
109| Pipeline depth | 8 stages | 5 stages (target) |
110| Matching entry ID | Compiler-assigned slot offset in frame | Compiler-assigned match_entry in context slot |
111| Overflow | Not handled (compiler must fit) | Stall + optional future CAM buffer |
112
113The generation counter is more defensive than Monsoon's approach, which
114is appropriate for a first build where catching bugs matters more than
115saving 2 bits per slot. Monsoon's free-list is cleaner in theory but the
116bump allocator is simpler hardware (counter vs. FIFO management).
117
118**Actionable insight from this convergence:** the ETS papers (especially
119Papadopoulos's 1988 PhD thesis) contain detailed pipeline timing, hazard
120analysis, and state-bit logic that's directly applicable to our matching
121store, even though the designs were arrived at independently. Worth
122reading for implementation details, not just architecture.
123
124---
125
126## 3. Clock Efficiency as Primary PE Constraint
127
128In discrete logic, clock speed is the hard constraint. Individual gate
129delays are ~10-30ns per stage depending on technology, and pipeline depth
130directly multiplies total latency. At realistic clock speeds (5-20 MHz
131for well-designed discrete logic), every wasted cycle is expensive.
132
133This means:
134- **Single-cycle matching is non-negotiable** (achieved via direct indexing)
135- **Pipeline depth should be minimised** — each stage adds a clock period
136  of latency. Monsoon's 8 stages would give 400-1600ns per token at
137  discrete-logic speeds. Our target of 5 stages is aggressive but
138  achievable.
139- **Monadic bypass matters** — monadic tokens skipping the matching stage
140  saves a full cycle per monadic operation. At 50% monadic ops (typical
141  for many programs), this is significant.
142- **Network latency is the enemy** — every hop between PEs adds pipeline
143  latency. Compiler-assigned locality (keeping communicating nodes on
144  the same PE) is critical. This is why we care about static PE
145  assignment even though matching is dynamic.
146- **The sequential scheduling shortcut (§1) becomes more valuable** the
147  slower the clock is — if a monadic chain of 5 ops takes 25 cycles
148  through the full pipeline but could take 5 cycles with short-circuit
149  execution, that's 100-500ns saved at discrete speeds.
150
151### Implications for PE Design
152
153Every pipeline stage must justify its existence in terms of critical path.
154If a stage can be merged with an adjacent stage without extending the
155critical path beyond the target clock period, merge it.
156
157Stages that are "free" (can overlap with SRAM access time): address
158generation, mux selection, comparator setup.
159
160Stages that set the clock period: SRAM read (15-25ns for fast async SRAM),
161ALU operation (depends on width — 8-bit add ~15ns in 74HC, 16-bit ~25ns
162with carry lookahead).
163
164---
165
166## 4. Technology Notes
167
168### Not Strictly TTL
169
170The project is described as "74-series TTL + SRAM" but the actual target
171technology is more nuanced:
172
173- **74HC / 74HCT CMOS** is the likely primary logic family, not original
174  74-series TTL. HC/HCT gives lower power, better noise margins, and
175  similar or better speed at the gate level. HCT is input-compatible
176  with TTL levels.
177- **74AC / 74ACT** (Advanced CMOS) for critical-path stages where the
178  extra speed matters. ~5ns propagation vs ~10ns for HC.
179- **74F** (FAST TTL) is an option for specific high-speed paths but draws
180  more power and is less available.
181- **Async SRAM** (IS61C256, AS6C4008, etc.) for all bulk storage:
182  instruction memory, matching store, token FIFOs, structure memory.
183  15-25ns access times are the pipeline clock floor.
184- **EEPROMs** (AT28C256 or similar) for instruction memory where runtime
185  reprogramming via type-11 is acceptable with higher write latency.
186  Or SRAM with battery backup / external loading.
187
188The key constraint is **no large-scale integration beyond commodity SRAM
189and EEPROM**. No FPGAs in the final build (though FPGA prototyping is
190encouraged). No custom ASICs. No microcontrollers in the datapath (though
191a microcontroller as external test fixture / bootstrap host is fine for
192development).
193
194The "period-plausible" framing refers to the transistor budget being
195comparable to processors from the late 1970s / early 1980s (68000-class),
196not to the specific technology used. Modern CMOS 74-series parts are
197faster and lower power than original TTL but the logic complexity and
198design methodology are the same.
199
200---
201
202## 5. Priority Reading List (from Prior Art Survey)
203
204Based on the reference guide and current design state, prioritised for
205maximum impact on near-term design decisions:
206
207### Must-Read (directly affects current design choices)
208
2091. **Papadopoulos & Culler, "Monsoon: An Explicit Token-Store
210   Architecture" (ISCA 1990)** — ETS mechanics, 8-stage pipeline,
211   frame memory organisation. Closest prior art to our matching store.
212
2132. **Culler & Papadopoulos, "The Explicit Token Store" (JPDC 1990)** —
214   Extended journal version with more detail on state-bit mechanism
215   and pipeline stages.
216
2173. **Papadopoulos PhD thesis (MIT, 1988)** — THE most detailed Monsoon
218   hardware source. Board-level design, chip selection, pipeline timing.
219   Hard to find but worth the effort.
220
2214. **Sakai et al., "An Architecture of a Dataflow Single Chip Processor"
222   (ISCA 1989)** — EM-4 core paper. 50K-gate PE, circular pipeline,
223   direct matching, strongly connected arc model. Most sophisticated
224   pipeline design in the literature.
225
2265. **da Silva & Watson, "A Pseudo-Associative Matching Store with Hardware
227   Hashing" (IEE 1983)** — Even though we're not using hash matching,
228   understanding WHY Manchester went this way and what the tradeoffs
229   were informs our design.
230
2316. **Culler, "Resource Management for the Tagged Token Dataflow
232   Architecture" (MIT TR-332, 1985)** — Token store overflow, deadlock,
233   frame-space management. Essential for understanding throttling and
234   resource constraints.
235
236### Should-Read (informs broader design context)
237
2387. **Dennis, "Building Blocks for Data Flow Prototypes" (ISCA 1980)** —
239   Modular hardware building blocks for discrete-logic dataflow. May
240   directly influence our board-level module decomposition.
241
2428. **Sakai et al., EM-4 network paper (Parallel Computing 1993)** —
243   Circular omega network design and deadlock prevention. Relevant
244   when we scale past shared bus.
245
2469. **Arvind & Nikhil, "Executing a Program on the MIT TTDA" (IEEE TC
247   1990)** — TTDA PE organisation, tag format, I-structure memory.
248   Foundational context for understanding Monsoon.
249
25010. **Papadopoulos & Traub, "Multithreading: A Revisionist View" (ISCA
251    1991)** — Sequential scheduling optimisation. Not for the
252    architecture, but for the microarchitectural shortcut idea.
253
254### Background (fills in the picture)
255
25611. **Lee & Hurson, "Dataflow Architectures and Multithreading" (IEEE
257    Computer 1994)** — Survey bridging pure dataflow to multithreading era.
258
25912. **Arvind & Culler, "Dataflow Architectures" (Annual Review 1986)** —
260    MIT perspective, good overview of the design space.
261
26213. **Grafe et al., "The Epsilon Dataflow Processor" (ISCA 1989)** —
263    Hybrid approach (Sandia), interesting as a contrast to show where
264    we DON'T want to go.
265
26614. **Watson & Gurd, "A Practical Data Flow Computer" (IEEE Computer
267    1982)** — Board-level Manchester hardware aimed at hardware engineers.
268
269---
270
271## 6. Key Open Questions Informed by Research
272
273Things the prior art survey surfaced that we should think about:
274
2751. **Monsoon's 128-word frames vs our 16-32 entry slots**: are we too
276   small? Monsoon's larger frames reduce allocation frequency but waste
277   space on small activations. Our smaller slots are more efficient but
278   may cause more allocation churn. Need to compile some test programs
279   and measure.
280
2812. **EM-4's circular pipeline**: their PE reuses pipeline stages for
282   different phases of token processing, reducing total hardware per PE.
283   Worth investigating whether our 5-stage pipeline could benefit from
284   a similar trick.
285
2863. **EM-4's strongly connected arc model**: a different take on monadic
287   chains where consecutive operations within a thread execute without
288   re-entering the network. Related to the sequential scheduling idea
289   but architecturally distinct. Need to read the papers to understand
290   the hardware implications.
291
2924. **I-structure memory (Arvind)**: presence bits on structure memory
293   words for synchronisation. Our SM doesn't currently have this — SM
294   operations are simple read/write/RMW. I-structures enable deferred
295   reads (read of empty word blocks until write arrives). This is a
296   significant capability for certain parallel patterns. Worth
297   evaluating whether SM should support it.