# Prior art reference guide for a discrete-logic dynamic dataflow CPU **The dataflow architecture literature from 1975–2005 contains roughly 80–100 papers with significant hardware implementation detail**, spanning dynamic tagged-token machines, static dataflow, hybrid architectures, systolic arrays, transport-triggered designs, and wavefront processors. This guide organizes the most relevant references for someone building a 74-series TTL + SRAM dynamic dataflow processor, with emphasis on matching store design, token formats, pipeline organization, and board-level prototypes. Papers the user already holds are marked with ★ and cross-referenced rather than re-described. The single most directly relevant hardware reference remains the Manchester Dataflow Machine — built entirely from standard TTL and SRAM — but the Monsoon Explicit Token Store papers are equally critical because they demonstrate how to eliminate CAM-based matching entirely using presence bits, a design vastly more tractable in discrete logic. The EM-4 single-chip processor papers from ETL Japan offer the most sophisticated pipeline design in the literature, while Dennis's "Building Blocks for Data Flow Prototypes" is literally a guide to modular discrete-logic dataflow hardware construction. --- ## 1. Dynamic (tagged-token) dataflow machines This is the primary category. Dynamic dataflow uses tagged tokens to distinguish multiple activations of the same graph node, enabling full reentrancy. The core hardware challenge is the **matching store** — the mechanism that pairs tokens destined for the same instruction. ### 1.1 Manchester Dataflow Machine The Manchester machine is the most directly relevant prior art: it was built from **standard 74-series TTL** with SRAM-based memories, operational from October 1981. Its ring pipeline (token queue → matching unit → instruction store → processing unit) is the canonical dynamic dataflow architecture. - ★ **Gurd, J.R., Kirkham, C.C., and Watson, I.** "The Manchester Prototype Dataflow Computer." *Communications of the ACM*, 28(1):34–52, January 1985. DOI: 10.1145/2465.2468. *User already has this. The canonical reference: ring pipeline, hash-based matching unit with 16 boards of 64K-token memory + 54-bit comparators, overflow unit for collisions. All standard TTL.* - **Watson, I. and Gurd, J.R.** "A Practical Data Flow Computer." *IEEE Computer*, 15(2):51–57, February 1982. https://ieeexplore.ieee.org/document/1654598/. *First detailed description of the working prototype. Predates and complements the 1985 CACM paper. Describes the ring pipeline at the board level.* - **da Silva, J.G.D. and Watson, I.** "A Pseudo-Associative Matching Store with Hardware Hashing." *IEE Proceedings Part E: Computers and Digital Techniques*, 130(1):19–24, January 1983. https://ieeexplore.ieee.org/document/5009424. **CRITICAL for matching store design.** Describes the hash-based replacement for the original CAM. Parallel hash table with 16-bit hash function on tag/destination fields, 64K-token memory per board + 54-bit comparator, collision handling via overflow unit. This is the key paper on practical hardware hashing for token matching. - **Gurd, J.R., Kirkham, C.C., and Böhm, A.P.W.** "The Manchester Dataflow Computing System." In *Experimental Parallel Computing Architecture*, J. Dongarra (ed.), North-Holland, pp. 177–219, 1987. *Extended 43-page treatment. The most detailed hardware reference for the Manchester machine: board-level organization, chip counts, implementation trade-offs, multi-ring interconnect scaling.* - **Gurd, J.R.** "The Manchester Dataflow Machine." *Computer Physics Communications*, 37(1):49–62, July 1985. https://www.sciencedirect.com/science/article/abs/pii/0010465585901353. *Hardware structure with emphasis on enhancements made since the 1981 first operation.* - **Gurd, J.R.** "The Manchester Dataflow Machine." *Future Generation Computer Systems*, 1(1):35–47, 1985. *Tag-field overhead analysis and hardware cost discussion.* - **Böhm, A.P.W., Gurd, J.R., and Sargeant, J.** "Hardware and Software Enhancement of the Manchester Dataflow Machine." *COMPCON Spring '85*, IEEE, pp. 420–423, 1985. *Documents physical evolution of the hardware: matching unit changes, token queue sizing, PE improvements.* - **Gurd, J.R. et al.** "Fine-Grain Parallel Computing: The Dataflow Approach." *Future Parallel Computers*, Springer LNCS 295, pp. 82–152, 1986. DOI: 10.1007/3-540-18203-9_3. *Comprehensive treatment of pipeline organization, matching unit design, and chip-level TTL implementation.* - **Gurd, J. and Watson, I.** "Data Driven System for High Speed Parallel Computing — Part 2: Hardware Design." *Computer Design*, 19(7):97–106, 1980. *Aimed at hardware engineers. Board-level design details of the Manchester machine.* - **Watson, I. and Gurd, J.** "A Prototype Data Flow Computer with Token Labelling." *Proc. AFIPS NCC*, pp. 623–628, 1979. *Early description of the pilot-project hardware that was later scaled up to the full prototype.* - **Kawakami, K. and Gurd, J.R.** "A Scalable Dataflow Structure Store." *Proc. ISCA '86*, IEEE, June 1986. *Hardware organization of the structure store (heap memory) added for array/data-structure handling.* - **Inagami, Y. and Foley, J.F.** "The Specification of a New Manchester Dataflow Machine." *Proc. ICS '89*, pp. 371–380, June 1989. *Redesign using supercomputer technology: new matching unit design and pipeline organization for high performance.* - **Barahona, P. and Gurd, J.R.** "Simulated Performance of the Manchester Multi-Ring Dataflow Machine." *Parallel Computing 85*, North-Holland, pp. 419–424, 1985. *Performance modeling of multi-ring interconnect for scaling.* - **Kirkham, C.C. and Watson, I.** "Iterative Instructions in the Manchester Dataflow Computer." *IEEE Transactions on Computers*, ~1989. https://ieeexplore.ieee.org/document/80141/. *Buffer restructuring and function unit array modifications for iterative workloads. Hardware-level performance measurements on the TTL prototype.* - **Patnaik, L.M., Govindarajan, R., and Ramadoss, N.S.** "Design and Performance Evaluation of EXMAN: An Extended Manchester Dataflow Computer." *IEEE Trans. Comput.*, 35(3):229–243, 1986. *Extended Manchester architecture with concurrent matching operations in the matching unit.* ### 1.2 MIT Tagged-Token Dataflow Architecture (TTDA) The TTDA introduced **frame-based direct matching with presence bits**, replacing associative lookup. Tags encode ⟨instruction-address, activity-name⟩ where the activity name serves as a frame pointer + context ID. This is the conceptual foundation for the Explicit Token Store. - **Arvind and Nikhil, R.S.** "Executing a Program on the MIT Tagged-Token Dataflow Architecture." *IEEE Trans. Comput.*, 39(3):300–318, March 1990. DOI: 10.1109/12.48862. *THE canonical TTDA paper. Processing element organization: instruction fetch, operand matching via presence bits, ALU, token formation. Tag format, direct matching via frame memory, I-structure memory nodes, interconnection network. Detailed PE pipeline figures.* - **Arvind and Nikhil, R.S.** "Executing a Program on the MIT Tagged-Token Dataflow Architecture." *Proc. PARLE*, Springer LNCS 259, pp. 1–29, June 1987. DOI: 10.1007/3-540-17945-3_1. *Earlier conference version. TTDA multiprocessor organization with PE array, I-structure memory, and interconnection network.* - **Arvind and Iannucci, R.A.** "Two Fundamental Issues in Multiprocessing." *Proc. DFVLR Conference*, Springer LNCS 295, 1987. Also CSG Memo 226-6, MIT LCS, May 1987. *Why von Neumann architectures fail at fine-grained parallelism and how dataflow hardware addresses synchronization and latency tolerance.* - **Arvind, Nikhil, R.S., and Pingali, K.K.** "I-Structures: Data Structures for Parallel Computing." *ACM TOPLAS*, 11(4):598–632, October 1989. DOI: 10.1145/69558.69562. *Defines the I-structure hardware semantics: presence bits (empty/full/waiting) on each memory word, deferred-read queuing. The state machine for I-structure operations must be implemented in hardware.* - **Steele, K.** "An I-Structure Memory Controller." M.S. Thesis, MIT, December 1989. *Detailed hardware design of the I-structure memory controller: state machine, presence bit logic, deferred-read queue management, PE network interface.* - **Culler, D.E.** "Resource Management for the Tagged Token Dataflow Architecture." Tech. Rep. TR-332, MIT LCS, 1985. https://dspace.mit.edu/bitstream/handle/1721.1/149603/MIT-LCS-TR-332.pdf. *Token store overflow/deadlock and frame-space management. How fixed-size tag hardware constrains iteration depth. Essential for understanding hardware resource constraints and throttling.* ### 1.3 Monsoon (Explicit Token Store) Monsoon is the **most important architecture for a TTL builder to study** after Manchester. The Explicit Token Store (ETS) eliminates CAM and hash matching entirely. Operand matching uses **presence bits at compiler-assigned frame offsets** — first token sets the bit and stores data; second token reads stored value and proceeds. The processor is an **8-stage pipeline** processing one token per cycle. - **Papadopoulos, G.M. and Culler, D.E.** "Monsoon: An Explicit Token-Store Architecture." *Proc. ISCA '90*, Seattle, pp. 82–91, 1990. DOI: 10.1145/325164.325117. *THE key Monsoon paper. ETS model, 8-stage pipeline, instruction memory, frame memory with presence bits, token format ⟨frame-pointer, slot-offset⟩, hazard handling, 128-word activation frames with shared free-list. Revolutionary simplification over Manchester's hash-based matching.* - **Culler, D.E. and Papadopoulos, G.M.** "The Explicit Token Store." *J. Parallel and Distributed Computing*, 10:289–308, December 1990. *Extended journal version. More detailed ETS state-bit mechanism, frame memory organization, pipeline stages, token format, and state transition logic.* - **Papadopoulos, G.M.** "Implementation of a General Purpose Dataflow Multiprocessor." PhD Thesis, MIT EECS, September 1988. MIT/LCS/TR-432. Published by MIT Press, 1991 (ISBN: 0-262-66069-5). *THE most detailed Monsoon hardware source. Complete board-level design, chip selection, pipeline timing, instruction decoding, I-structure memory controller interface, interconnection network. 16-processor systems deployed at Los Alamos and MIT.* - **Papadopoulos, G.M. and Culler, D.E.** "Retrospective: Monsoon: An Explicit Token-Store Architecture." In *25 Years of ISCA*, ACM, 1998, pp. 74–76. https://people.eecs.berkeley.edu/~culler/courses/cs252-s05/papers/p74-papadopoulos.pdf. *Confirms 5–10 million messages/second per PE. 16-processor systems built with Motorola Cambridge Research Lab.* - **Hicks, J., Chiou, D., Ang, B.S., and Arvind.** "Performance Studies of Id on the Monsoon Dataflow System." *J. Parallel and Distributed Computing*, 18(3):273–300, July 1993. *Execution cycle counts and speedup curves on physical hardware. Compares single-PE Monsoon against MIPS R3000.* - **Papadopoulos, G.M. and Traub, K.R.** "Multithreading: A Revisionist View of Dataflow Architectures." *Proc. ISCA '91*, Toronto, pp. 342–351, May 1991. *Extends Monsoon's ETS with sequential scheduling within threads. Instructions within a thread use implicit sequencing, reducing synchronization overhead. Bridges toward \*T.* ### 1.4 \*T (StarT) \*T unifies dataflow with von Neumann by splitting "complex" dataflow instructions into separate synchronization, arithmetic, and fork/control instructions on a RISC-like PE. Frame memory with presence bits is unified with conventional memory. - **Nikhil, R.S., Papadopoulos, G.M., and Arvind.** "\*T: A Multithreaded Massively Parallel Architecture." *Proc. ISCA '92*, Gold Coast, Australia, pp. 156–167, May 1992. DOI: 10.1109/ISCA.1992.753313. *Core \*T paper. PE organization: instruction fetch, register file, ALU, with synchronization primitives. Eliminates the distinction between processor and I-structure nodes.* - **Beckerle, M.J.** "Overview of the START (\*T) Multithreaded Computer." *COMPCON Spring '93*, IEEE, pp. 148–156, February 1993. *Physical \*T hardware design: processing node architecture, interconnect, memory system. Board-level and system-level details.* - **Arvind and Brobst, S.A.** "The Evolution of Dataflow Architectures from Static Dataflow to P-RISC." *Intl. J. High Speed Computing*, 5(2), 1993. *Traces hardware evolution: static → dynamic (tagged-token) → ETS (Monsoon) → P-RISC → \*T. Explains at each step what hardware was added or removed.* - **Agarwal, A., Kubiatowicz, J., et al.** "Sparcle: An Evolutionary Processor Design for Multiprocessors." *IEEE Micro*, 13(3):48–61, June 1993. *Same design philosophy as \*T: SPARC core modified for multithreading with fast context switching and presence-bit memory. Actual chip modifications described.* ### 1.5 Japanese dynamic dataflow machines Japan invested heavily in dataflow hardware during the 1980s–90s through ETL (now AIST), NTT, and the Fifth Generation project. The EM-4 single-chip processor is particularly notable for its **circular pipeline** and **strongly connected arc** execution model. **Sigma-1 (ETL/MITI)** — 128-PE dataflow supercomputer, operational 1988, measured >100 MFLOPS: - **Hiraki, K., Shimada, T., and Nishida, K.** "A Hardware Design of the SIGMA-1, A Data Flow Computer for Scientific Computations." *Proc. ICPP '84*, pp. 524–531, 1984. *Original hardware architecture.* - **Yuba, T., Shimada, T., Hiraki, K., and Kashiwagi, H.** "SIGMA-1: A Dataflow Computer for Scientific Computations." *Computer Physics Communications*, 37:141–148, 1985. DOI: 10.1016/0010-4655(85)90146-8. *PE organization, matching memory unit, structure memory, communication network for ~200 PEs.* - **Shimada, T., Hiraki, K., Nishida, K., and Sekiguchi, S.** "Evaluation of a Prototype Data Flow Processor of the SIGMA-1." *Proc. ISCA '86*, Tokyo, pp. 226–234, 1986. DOI: 10.1145/17356.17383. *Operational PE evaluation, sticky-token mechanism for loop invariants.* - **Hiraki, K., Nishida, K., Sekiguchi, S., and Shimada, T.** "Maintenance Architecture and its LSI Implementation of a Dataflow Computer." *Proc. ICPP '86*, pp. 584–591, 1986. *LSI implementation details for the 128-PE system.* - **Hiraki, K., Sekiguchi, S., and Shimada, T.** "The SIGMA-1 Dataflow Supercomputer: A Challenge for New Generation Supercomputing Systems." *J. Information Processing (IPS Japan)*, 10(4):219–226, 1987. *Comprehensive description of the complete 128-PE system.* - **Hiraki, K., Sekiguchi, S., and Shimada, T.** "Status Report of SIGMA-1: A Data-Flow Supercomputer." In Gaudiot and Bic (eds.), *Advanced Topics in Data-Flow Computing*, Prentice Hall, Chapter 7, pp. 207–223, 1991. *Status report of the completed system with hardware implementation details.* **EM-4 / EM-X (ETL)** — Single-chip dataflow PE (EMC-R) on 50,000-gate array, 80-PE prototype operational April 1990: - **Sakai, S., Yamaguchi, Y., Hiraki, K., Kodama, Y., and Yuba, T.** "An Architecture of a Dataflow Single Chip Processor." *Proc. ISCA '89*, pp. 46–53, 1989. DOI: 10.1145/74925.74931. *THE core EM-4 chip paper. EMC-R on 50K-gate array. Strongly connected arc model, direct matching, RISC-based design, deadlock-free on-chip packet switch, circular pipeline.* - **Sakai, S., Kodama, Y., and Yamaguchi, Y.** "Prototype Implementation of a Highly Parallel Dataflow Machine EM-4." *Proc. IPPS '91*, pp. 278–286, 1991. DOI: 10.1109/IPPS.1991.153792. *80-PE prototype implementation, peak 1 GIPS. Multiple-RISC concept, versatile interconnection network.* - **Sakai, S., Kodama, Y., and Yamaguchi, Y.** "Design and Implementation of a Circular Omega Network in the EM-4." *Parallel Computing*, 19(2):125–142, 1993. DOI: 10.1016/0167-8191(93)90043-K. *THE network paper for EM-4. Circular omega topology, self-routing, deadlock prevention, 14.63 GB/s.* - **Yamaguchi, Y., Sakai, S., and Kodama, Y.** "Synchronization Mechanisms of a Highly Parallel Dataflow Machine EM-4." *IEICE Trans.*, E74(1):204–213, 1991. *Two synchronization mechanisms: direct matching and register-based.* - **Kodama, Y. et al.** "EMC-Y: Parallel Processing Element Optimizing Communication and Computation." *Proc. ICS '93*, pp. 167–174, 1993. *Successor chip to EMC-R for the EM-X system.* - **Kodama, Y. et al.** "The EM-X Parallel Computer: Architecture and Basic Performance." *Proc. ISCA '95*, pp. 14–23, 1995. DOI: 10.1145/223982.223987. *Full EM-X system architecture with EMC-Y PE.* - **Sakai, S. et al.** "Reduced Interprocessor-Communication Architecture and Its Implementation on EM-4." *Parallel Computing*, 21(5):753–769, 1995. *RICA design: fused pipeline performing message handling + instruction execution + packet output in 2 RISC clocks.* **DDDP (OKI Electric)** — Four PEs on ring bus with hardware hashing: - **Kishi, M., Yasuhara, H., and Kawamura, Y.** "DDDP — A Distributed Data Driven Processor." *Proc. ISCA '83*, Stockholm, pp. 236–242, 1983. DOI: 10.1145/800046.801661. *Four PEs connected by ring bus + structured data memory. Hardware hashing for token coloring. 0.7 MIPS measured.* **Other Japanese dataflow machines:** - ★ **Amamiya, M. et al.** DFM architecture and evaluation papers. *User already has amamiya1982.pdf and the evaluation paper.* - **Amamiya, M. et al.** "Implementation and Evaluation of a List-Processing-Oriented Data Flow Machine." *Proc. ISCA '86*, Tokyo, pp. 10–19, 1986. *Register-transfer-level implementation, ~5× speedup over von Neumann equivalent. Closely related to user's existing papers.* - **Ito, N. et al.** "The Architecture and Preliminary Evaluation Results of the Experimental Parallel Inference Machine PIM-D." *Proc. ISCA '86*, pp. 149–156, 1986. DOI: 10.1145/17356.17373. *Dataflow-based parallel inference machine from the Fifth Generation project.* - **Yuba, T. et al.** "Dataflow Computer Development in Japan." *Proc. ICS '90*, ACM SIGARCH, 18(3b):140–147, 1990. DOI: 10.1145/255129.255151. *Survey covering SIGMA-1 and EM-4 with comparison of design goals.* - **Yuba, T.** "Research and Development Efforts on Data-flow Computer Architecture in Japan." *J. Information Processing*, 9(2):51–60, 1986. *Overview of the entire Japanese dataflow program.* **Note on DDMP and TOPSTAR:** The term "DDMP" (Data-Driven Media Processor) was not confirmed in English-language literature; the Takahashi/Amamiya NTT machine is consistently called a "Dataflow Processor Array System." TOPSTAR was a function-level (coarse-grain) dataflow machine at University of Tokyo under Prof. Moto-Oka, not an ETL successor to EM-4. The actual EM-4 successor at ETL was **EM-X** (EMC-Y chip). --- ## 2. Static dataflow machines Static dataflow uses acknowledgment signals rather than tagged tokens: one token per arc, with presence bits in instruction cells. Simpler hardware but no reentrancy without code copying. ### 2.1 Dennis MIT static dataflow - **Dennis, J.B. and Misunas, D.P.** "A Preliminary Architecture for a Basic Data Flow Processor." *Proc. ISCA '75*, pp. 126–132, January 1975. DOI: 10.1145/642089.642111. *THE seminal static dataflow hardware paper. Cell-based PE with instruction cells containing operand slots and presence bits. Circular pipeline. Four key networks: distribution, control, arbitration, and processing section.* - **Dennis, J.B., Boughton, G.A., and Leung, C.K.C.** "Building Blocks for Data Flow Prototypes." *Proc. ISCA '80*, La Baule, France, pp. 1–8, May 1980. **EXTREMELY VALUABLE for TTL builders.** Describes modular hardware building blocks (instruction cell modules, operation units, arbitration networks) specifically designed for constructing dataflow processor prototypes from discrete logic. - **Dennis, J.B.** "Data Flow Supercomputers." *IEEE Computer*, 13(11):48–56, November 1980. DOI: 10.1109/C-M.1980.220135. *Architectural vision for scaling static dataflow. PE pipeline, multiprocessor interconnect, memory system.* - **Dennis, J.B., Lim, W.Y.P., and Ackerman, W.B.** "The MIT Data Flow Engineering Model." *Proc. IFIP 9th World Computer Congress*, Paris, pp. 553–560, 1983. *Engineering model of the MIT static dataflow machine: pipelined PE and interconnection network.* - **Dennis, J.B.** "The Evolution of 'Static' Data-Flow Architecture." In *Advanced Topics in Data-Flow Computing*, Gaudiot and Bic (eds.), Prentice Hall, Chapter 2, 1991. *Dennis's own comprehensive account of how the static architecture evolved through hardware prototypes.* - **Rumbaugh, J.E.** "A Data Flow Multiprocessor." *IEEE Trans. Comput.*, C-26(2):138–146, February 1977. DOI: 10.1109/TC.1977.5009292. Also: MIT Project MAC TR-150, 1975. *Early hardware dataflow multiprocessor. Hierarchically constructed network of asynchronous modules. Pipelined activation processors. Important for discrete-logic design: modular, asynchronous approach.* ### 2.2 TI Distributed Data Processor - **Cornish, M., Hogan, D.W., and Jensen, J.C.** "The Texas Instruments Distributed Data Processor." *Proc. Louisiana Computer Exposition*, pp. 189–193, March 1979. *First public paper on TI DDP. Board-level static dataflow processor for real-time avionics. Register-free instruction format.* - **Cornish, M. et al.** "The TI Data Flow Architectures: The Power of Concurrency for Avionics." *Proc. 3rd Digital Avionics Systems Conf.*, Fort Worth, TX, pp. 19–25, November 1979. *Hardware architecture for avionics. Data-driven controllers find all parallel instructions simultaneously.* ### 2.3 LAU System (CERT/ONERA Toulouse) First operational dataflow multiprocessor (32 processors, 1979). - **Plas, A., Comte, D., Gelly, O., and Syre, J.C.** "LAU System Architecture: A Parallel Data-Driven Processor Based on Single Assignment." *Proc. ICPP '76*, pp. 293–302, 1976. *Primary architecture paper. PE hardware, acknowledgment-based token management, communication network.* - **Syre, J.C., Comte, D., and Hifdi, N.** "Pipelining, Parallelism and Asynchronism in the LAU System." *Proc. ICPP '77*, Detroit, pp. 87–92, August 1977. *Pipeline design within LAU PEs and asynchronous inter-PE communication hardware.* - **Comte, D. and Hifdi, N.** "LAU Multiprocessor: Microfunctional Description and Technological Choices." *Proc. Workshop: Data Driven Languages and Machines*, Toulouse, 1979, pp. I.1–I.8. *Detailed hardware microfunctional units and board-level technology choices.* - **Comte, D., Hifdi, N., and Syre, J.C.** "The Data Driven LAU Multiprocessor System: Results and Perspectives." *Proc. IFIP Congress 80*, Tokyo, pp. 175–180, 1980. *Performance results from the operational 32-processor system.* ### 2.4 Other static machines - **Davis, A.L.** "The Architecture and System Method of DDM1: A Recursively Structured Data Driven Machine." *Proc. ISCA '78*, pp. 210–215, April 1978. DOI: 10.1145/800094.803050. *DDM1 at University of Utah — first operational dataflow processor (July 1976). Tree-structured, fully distributed, asynchronous. Each module is self-timed; no global clock. Board-level hardware.* - **Vedder, R. and Finn, D.** "The Hughes Data Flow Multiprocessor: Architecture for Efficient Signal and Data Processing." *Proc. ISCA '85*, Boston, pp. 324–332, June 1985. DOI: 10.1145/327070.327290. *Hughes Aircraft Company. Hierarchical network of asynchronous modules with pipelined activation processors.* - **NEC μPD7281 Image Pipeliner** — one of the only commercially sold dataflow chips (December 1985). Static dataflow for image processing, 5 MIPS, unidirectional pipeline bus, multiple chips cascade without extra circuitry. See: Kurokawa, K. et al., "The Architecture and Performance of Image Pipeline Processor," *VLSI '83*, pp. 275–284. Also: NEC μPD7281 data sheet. - **Hartimo, I., Kronlof, K., Simula, O., and Skytta, J.** "DFSP: A Data Flow Signal Processor." *IEEE Trans. Comput.*, C-35(1):23–33, June 1986. *Board-level medium-grain static dataflow processor for signal processing, from Helsinki.* --- ## 3. Hybrid dataflow/von Neumann architectures These combine dataflow's latency-tolerant fine-grained synchronization with conventional sequential execution efficiency. - **Iannucci, R.A.** "Toward a Dataflow/von Neumann Hybrid Architecture." *Proc. ISCA '88*, pp. 131–140, June 1988. DOI: 10.1145/633625.52416. Also MIT/LCS/TR-418. *Foundational hybrid paper. P-RISC (Parallel RISC): RISC instruction set extended with three dataflow-style synchronization instructions. Scheduling quanta bound at compile time. ~240 citations.* - **Nikhil, R.S. and Arvind.** "Can Dataflow Subsume von Neumann Computing?" *Proc. ISCA '89*, pp. 262–272, 1989. DOI: 10.1109/ISCA.1989.714561. *Introduces P-RISC concept: modified von Neumann PE with split-phase memory and presence-bit memory for dataflow-style synchronization. Directly influenced \*T design.* - **Grafe, V.G., Davidson, G.S., Hoch, J.E., and Holmes, V.P.** "The Epsilon Dataflow Processor." *Proc. ISCA '89*, pp. 36–45, June 1989. https://ieeexplore.ieee.org/document/714522/. *Directly matches ready operands, eliminating associative matching stores entirely. 10 MFLOPS prototype at Sandia National Labs.* - **Grafe, V.G. and Hoch, J.E.** "The Epsilon-2 Multiprocessor System." *J. Parallel and Distributed Computing*, 10(4):309–318, 1990. Also SAND-89-2622C. https://www.osti.gov/servlets/purl/5213672. *Combines fine-grain dataflow parallelism with von Neumann sequential efficiency. Instruction-level synchronization, single-cycle context switches, RISC-like execution, tree of activation frames.* - **Gao, G.R., Hum, H.H.J., and Monti, J.-M.** "Towards an Efficient Hybrid Dataflow Architecture Model." *Proc. PARLE '91*, Springer LNCS 505, pp. 355–371, June 1991. DOI: 10.1007/978-3-662-25209-3_24. *McGill Dataflow Architecture (MDFA). Argument-fetching dataflow principle, conventional pipelined execution, dataflow software pipelining. Predecessor to EARTH.* - **Hum, H.H.J. et al.** "A Design Study of the EARTH Multiprocessor." *Proc. PACT '95*, pp. 59–68, 1995. *Core EARTH architecture: off-the-shelf Execution Unit + ASIC Synchronization Unit for dataflow-like thread scheduling.* - **Hum, H.H.J. et al.** "A Study of the EARTH-MANNA Multithreaded System." *Intl. J. Parallel Programming*, 24(4):319–348, 1996. DOI: 10.1007/BF03356753. *Key paper showing how EARTH combines dataflow signaling with RISC cores. Each processor has an SU handling synchronization + remote requests.* - **Culler, D.E. et al.** "Fine-grain Parallelism with Minimal Hardware Support: A Compiler-Controlled Threaded Abstract Machine." *Proc. ASPLOS-IV*, pp. 164–175, April 1991. *TAM: shows synchronization on conventional multiprocessors can approach dedicated hardware performance.* - **Nowatzki, T., Gangadhar, V., and Sankaralingam, K.** "Heterogeneous Von Neumann/Dataflow Microprocessors." *Communications of the ACM*, 62(6):83–91, 2019. *Modern perspective on hybrid architectures. Shows continued relevance of dataflow ideas in contemporary microprocessor design.* --- ## 4. Systolic arrays and wavefront processors ### 4.1 CMU systolic arrays - **Kung, H.T. and Leiserson, C.E.** "Systolic Arrays (for VLSI)." *Sparse Matrix Proceedings 1978*, SIAM, pp. 256–282, 1979. Also in Mead and Conway, *Introduction to VLSI Systems*, Addison-Wesley, 1980, Section 8.3. *Originated the term "systolic." Locally connected processor networks with rhythmic data flow. Band matrix algorithms.* - **Kung, H.T.** "Why Systolic Architectures?" *IEEE Computer*, 15(1):37–46, January 1982. DOI: 10.1109/MC.1982.1653825. *THE definitive systolic paper (~2,400 citations). I/O bandwidth problem, systolic solution, multiple array designs for convolution and matrix multiplication.* - **Fisher, A.L., Kung, H.T., Monier, L.M., and Dohi, Y.** "Architecture of the PSC: A Programmable Systolic Chip." *Proc. ISCA '83*, Stockholm, pp. 48–53, June 1983. DOI: 10.1145/800046.801637. Extended: *J. VLSI and Computer Systems*, 1(2):153–169, 1984. *CMU PSC chip: ~25,000 transistors, 74 pins, 4μm nMOS via MOSIS. First microprogrammable chip for systolic arrays.* ### 4.2 Warp and iWarp - **Annaratone, M. et al.** "The Warp Computer: Architecture, Implementation, and Performance." *IEEE Trans. Comput.*, C-36(12):1523–1538, December 1987. DOI: 10.1109/TC.1987.5009502. *Linear systolic array of 10 programmable cells at 10 MFLOPS each (100 MFLOPS peak). Board-level implementation using off-the-shelf components. Wire-wrap prototype (WW-Warp) completed June 1985; production PC-Warp delivered by GE April 1987.* - **Borkar, S. et al.** "iWarp: An Integrated Solution to High-Speed Parallel Computing." *Proc. Supercomputing '88*, pp. 330–339, 1988. https://www.eecs.harvard.edu/~htk/publication/1988-supercomputing-borkar-etc.pdf. *iWarp chip: 1.2cm × 1.2cm, ~700K transistors. 32-bit RISC core + 96-bit LIW decoder, 64-bit FPU at 20 MHz (20 MFLOPS), communication agent (320 MB/s). Typical system: 8×8 torus (64 cells).* - **Borkar, S. et al.** "Supporting Systolic and Memory Communication in iWarp." *Proc. ISCA '90*, Seattle, pp. 70–81, May 1990. *Hardware support for both systolic (streaming) and memory-based (message passing) communication on one chip. Network connections mapped to on-chip gate registers.* - **Gross, T. and O'Hallaron, D.R.** *iWarp: Anatomy of a Parallel Computing System.* MIT Press, 1998, 488 pp. *The definitive iWarp reference: node architecture, communication agent, wormhole routing, system integration.* ### 4.3 Wavefront array processors - **Kung, S.Y. et al.** "Wavefront Array Processor: Language, Architecture, and Applications." *IEEE Trans. Comput.*, C-31(11):1054–1066, November 1982. *Introduces WAP: unlike systolic (globally clocked), WAPs use data-driven, self-timed processing. Each PE has bidirectional buffers with independent status flags. Same geometry as systolic arrays but asynchronous local control.* - **Kung, S.Y.** *VLSI Array Processors.* Prentice Hall, 1988. ISBN: 0-13-942749-X. *Definitive book on array processor architectures covering both systolic and wavefront design methodology.* - **Vlontzos, J.A. and Kung, S.Y.** "A Wavefront Array Processor Using Dataflow Processing Elements." *Supercomputing '87*, Springer LNCS 297, pp. 608–620, 1988. DOI: 10.1007/3-540-18991-2_42. *Prototype WAP built from NEC μPD7281 dataflow chips with reconfigurable interconnections. Demonstrates physical hardware.* --- ## 5. Transport-triggered architectures (TTA) TTAs expose the processor's internal transport buses as the programming model. Operations are triggered as side effects of data movement — conceptually related to dataflow's data-driven execution. - **Corporaal, H.** "Design of Transport Triggered Architectures." *Proc. 4th Great Lakes Symp. on VLSI*, pp. 130–135, March 1994. https://www.cecs.uci.edu/~papers/compendium94-03/papers/1994/glsvlsi94/pdffiles/glsvlsi94_130.pdf. *Introduces TTA and the MOVE32INT prototype: 32-bit pipelined TTA processor. Function units triggered by data transport to trigger registers.* - **Corporaal, H.** *Microprocessor Architectures: From VLIW to TTA.* John Wiley & Sons, 1997. *The definitive TTA reference. MOVE framework for processor generation, hardware complexity analysis, MOVE32INT implementation. Shows how TTA reduces register file complexity vs. VLIW.* - **Corporaal, H.** "TTAs: Missing the ILP Complexity Wall." *J. Systems Architecture*, 44(9-10):619–650, 1998. *Analyzes VLIW data path complexity and demonstrates TTA's hardware reduction in register file ports, bypass connectivity, and overall area.* - **TCE/OpenASIP** — open-source TTA codesign environment from Tampere University. https://openasip.org. *Complete toolchain: Architecture Definition File → compiler → simulator → synthesizable VHDL. Enables automatic design-space exploration for custom TTA processors.* - **Maxim MAXQ** — the only commercially manufactured TTA-based microcontroller. 16-bit core where a single MOVE instruction encodes source and destination modules on an internal transport bus. Computation occurs as side-effect of transport. Originally Dallas Semiconductor; products shipped ~2004–2020s. - **8-bit TTL TTA** — a transport-triggered CPU built from discrete TTL chips on stripboard, documented on Hackaday. *Directly relevant as a discrete-logic TTA precedent.* --- ## 6. Modern dataflow-inspired architectures Lower priority but included for significant hardware implementation detail. - **Swanson, S. et al.** "WaveScalar." *Proc. MICRO-36*, pp. 291–302, 2003. DOI: 10.1109/MICRO.2003.1253203. https://homes.cs.washington.edu/~oskin/wavescalar.pdf. *WaveScalar ISA and WaveCache: grid of ALU-in-cache nodes. Completely distributed matching (no central CAM), direct inter-PE communication. 2–7× over superscalar.* - **Swanson, S. et al.** "The WaveScalar Architecture." *ACM Trans. Computer Systems*, 25(2), Article 4, May 2007. DOI: 10.1145/1233307.1233308. *Comprehensive journal paper: full ISA, WaveCache microarchitecture, area/performance evaluation.* - **Swanson, S. et al.** "Area-Performance Trade-offs in Tiled Dataflow Architectures." *Proc. ISCA '06*, pp. 314–326, 2006. *RTL-based synthesis study. Pareto analysis of >200 WaveScalar designs. Crucial for understanding hardware cost of dataflow tiles.* - **Sankaralingam, K. et al.** "Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture." *Proc. ISCA '03*, pp. 422–433, 2003. DOI: 10.1109/ISCA.2003.1207019. *EDGE ISA: 4×4 grid of execution units, dataflow scheduling within blocks, control-flow between blocks.* - **Sankaralingam, K. et al.** "Distributed Microarchitectural Protocols in the TRIPS Prototype Processor." *Proc. MICRO '06*, pp. 480–491, 2006. DOI: 10.1109/MICRO.2006.19. *Actual TRIPS prototype chip: two 16-wide cores, up to 1,024 instructions in flight. Fabricated in silicon.* - **Parashar, A. et al.** "Triggered Instructions: A Control Paradigm for Spatially-Programmed Architectures." *Proc. ISCA '13*, pp. 142–153, 2013. DOI: 10.1145/2485922.2485935. https://people.csail.mit.edu/emer/media/papers/2013.06.isca.triggeredinstructions.pdf. *Eliminates the program counter entirely. PEs react to input channel arrivals. 8× area-normalized performance vs. GP processor. Very relevant to dataflow-style hardware control.* - **Jeffrey, M.C. et al.** "A Scalable Architecture for Ordered Parallelism." *Proc. MICRO '15*, pp. 228–241, 2015. *Swarm: tiled CMP with task units for speculative ordered parallelism. Hardware timestamp-ordered commit, Bloom filter conflict detection.* --- ## 7. Major surveys and meta-sources These provide the best starting points for understanding the field as a whole and serve as indexes to the primary literature. | Year | Authors | Title | Venue | Key value | |------|---------|-------|-------|-----------| | 1980 | Dennis | "Data Flow Supercomputers" | *IEEE Computer* 13(11) | Founding vision, static dataflow scaling | | 1982 | Treleaven, Brownbridge, Hopkins | "Data-Driven and Demand-Driven Computer Architecture" | *ACM Comp. Surveys* 14(1) | Broadest scope: dataflow + reduction architectures | | 1982 | Gajski, Padua, Kuck, Kuhn | "A Second Opinion on Data Flow Machines and Languages" | *IEEE Computer* 15(2) | Critical counterpoint from Illinois group | | 1986 | ★ Veen | "Dataflow Machine Architecture" | *ACM Comp. Surveys* 18(4) | **User has this.** Most focused survey on hardware | | 1986 | Arvind and Culler | "Dataflow Architectures" | *Annual Review of CS* 1:225–253 | MIT perspective; PDF at https://apps.dtic.mil/sti/pdfs/ADA166235.pdf | | 1986 | Srini | "An Architectural Comparison of Dataflow Systems" | *IEEE Computer* 19(3) | Compares 7 architectures on 16 criteria | | 1994 | Lee and Hurson | "Dataflow Architectures and Multithreading" | *IEEE Computer* 27(8) | Bridges pure dataflow to multithreading era | | 1999 | Šilc, Robič, Ungerer | *Processor Architecture: From Dataflow to Superscalar and Beyond* | Springer (textbook) | Most comprehensive textbook; 389 pp. | | 1999 | Najjar, Lee, Gao | "Advances in the Dataflow Computational Model" | *Parallel Computing* 25(13-14) | State of the art circa 1999 | | 2005 | Arvind | "Dataflow: Passing the Token" | ISCA Keynote, https://csg.csail.mit.edu/Users/arvind/ISCAfinal.pdf | Visual retrospective of entire arc from static → \*T | --- ## 8. Design-critical cross-cutting references These papers address specific hardware subsystem challenges most relevant to a TTL-based dynamic dataflow build. **Matching store tradeoffs:** The field explored three fundamental approaches: (1) **CAM-based** — Manchester's original design, replaced due to cost and scaling; (2) **hash-based** — Manchester's production design (da Silva & Watson 1983), practical with SRAM but requires overflow handling; (3) **presence-bit / direct-indexed** — Monsoon's ETS (Papadopoulos & Culler 1990), eliminates associative lookup entirely by moving slot assignment to the compiler. For a TTL build, the ETS approach is dramatically simpler in hardware — each frame slot needs only a 1-bit presence flag plus a data-width storage word. The hash approach is also tractable with 74-series logic (hash function → SRAM address → comparator → overflow chain). **Token format considerations:** Manchester tokens carry ⟨tag, data⟩ where the tag encodes ⟨destination-instruction, iteration-level, activation-name⟩. The tag width directly determines matching store addressing and dominates interconnect bandwidth. Monsoon simplified this to ⟨frame-pointer, slot-offset, instruction-pointer, data⟩. The EM-4 uses a strongly connected arc model with direct matching and register-based synchronization on a 50K-gate budget. **Interconnect choices:** Manchester uses a **ring pipeline** (simplest; all units connected in sequence). EM-4 uses a **circular omega network** with store-and-forward routing. Dennis's original architecture defines four separate networks (distribution, control, arbitration, processing). For a small-scale TTL prototype, the ring topology used by Manchester and DDDP is most tractable. **Throttling and resource management:** Culler's MIT TR-332 (1985) is the essential reference on token store overflow, deadlock prevention, and termination detection — all critical for a finite-resource hardware implementation. --- ## Conclusion The most actionable references for a 74-series TTL dynamic dataflow build form a clear reading path. Start with the **da Silva & Watson (1983)** paper on hash-based matching and the **Papadopoulos & Culler (1990)** ETS papers to make the fundamental matching store architecture decision — hash vs. presence-bit. Study the **Manchester ring pipeline** (Watson & Gurd 1982, Gurd et al. 1985) as the proven TTL implementation template. Read **Dennis's "Building Blocks" (1980)** for modular discrete-logic dataflow design philosophy. The **EM-4 papers** (Sakai et al. 1989, 1991, 1993) offer the most sophisticated pipeline and interconnect designs if scaling beyond a single PE. For throttling, **Culler's TR-332** is essential. The major surveys — especially **Arvind & Culler (1986)** and **Lee & Hurson (1994)** — efficiently index the rest of the literature. The Japanese machines (Sigma-1, EM-4) are underappreciated in English-language literature but contain some of the most detailed hardware implementation work in the field.