OR-1 dataflow CPU sketch

Prior art reference guide for a discrete-logic dynamic dataflow CPU#

The dataflow architecture literature from 1975–2005 contains roughly 80–100 papers with significant hardware implementation detail, spanning dynamic tagged-token machines, static dataflow, hybrid architectures, systolic arrays, transport-triggered designs, and wavefront processors. This guide organizes the most relevant references for someone building a 74-series TTL + SRAM dynamic dataflow processor, with emphasis on matching store design, token formats, pipeline organization, and board-level prototypes. Papers the user already holds are marked with ★ and cross-referenced rather than re-described.

The single most directly relevant hardware reference remains the Manchester Dataflow Machine — built entirely from standard TTL and SRAM — but the Monsoon Explicit Token Store papers are equally critical because they demonstrate how to eliminate CAM-based matching entirely using presence bits, a design vastly more tractable in discrete logic. The EM-4 single-chip processor papers from ETL Japan offer the most sophisticated pipeline design in the literature, while Dennis's "Building Blocks for Data Flow Prototypes" is literally a guide to modular discrete-logic dataflow hardware construction.


1. Dynamic (tagged-token) dataflow machines#

This is the primary category. Dynamic dataflow uses tagged tokens to distinguish multiple activations of the same graph node, enabling full reentrancy. The core hardware challenge is the matching store — the mechanism that pairs tokens destined for the same instruction.

1.1 Manchester Dataflow Machine#

The Manchester machine is the most directly relevant prior art: it was built from standard 74-series TTL with SRAM-based memories, operational from October 1981. Its ring pipeline (token queue → matching unit → instruction store → processing unit) is the canonical dynamic dataflow architecture.

  • Gurd, J.R., Kirkham, C.C., and Watson, I. "The Manchester Prototype Dataflow Computer." Communications of the ACM, 28(1):34–52, January 1985. DOI: 10.1145/2465.2468. User already has this. The canonical reference: ring pipeline, hash-based matching unit with 16 boards of 64K-token memory + 54-bit comparators, overflow unit for collisions. All standard TTL.

  • Watson, I. and Gurd, J.R. "A Practical Data Flow Computer." IEEE Computer, 15(2):51–57, February 1982. https://ieeexplore.ieee.org/document/1654598/. First detailed description of the working prototype. Predates and complements the 1985 CACM paper. Describes the ring pipeline at the board level.

  • da Silva, J.G.D. and Watson, I. "A Pseudo-Associative Matching Store with Hardware Hashing." IEE Proceedings Part E: Computers and Digital Techniques, 130(1):19–24, January 1983. https://ieeexplore.ieee.org/document/5009424. CRITICAL for matching store design. Describes the hash-based replacement for the original CAM. Parallel hash table with 16-bit hash function on tag/destination fields, 64K-token memory per board + 54-bit comparator, collision handling via overflow unit. This is the key paper on practical hardware hashing for token matching.

  • Gurd, J.R., Kirkham, C.C., and Böhm, A.P.W. "The Manchester Dataflow Computing System." In Experimental Parallel Computing Architecture, J. Dongarra (ed.), North-Holland, pp. 177–219, 1987. Extended 43-page treatment. The most detailed hardware reference for the Manchester machine: board-level organization, chip counts, implementation trade-offs, multi-ring interconnect scaling.

  • Gurd, J.R. "The Manchester Dataflow Machine." Computer Physics Communications, 37(1):49–62, July 1985. https://www.sciencedirect.com/science/article/abs/pii/0010465585901353. Hardware structure with emphasis on enhancements made since the 1981 first operation.

  • Gurd, J.R. "The Manchester Dataflow Machine." Future Generation Computer Systems, 1(1):35–47, 1985. Tag-field overhead analysis and hardware cost discussion.

  • Böhm, A.P.W., Gurd, J.R., and Sargeant, J. "Hardware and Software Enhancement of the Manchester Dataflow Machine." COMPCON Spring '85, IEEE, pp. 420–423, 1985. Documents physical evolution of the hardware: matching unit changes, token queue sizing, PE improvements.

  • Gurd, J.R. et al. "Fine-Grain Parallel Computing: The Dataflow Approach." Future Parallel Computers, Springer LNCS 295, pp. 82–152, 1986. DOI: 10.1007/3-540-18203-9_3. Comprehensive treatment of pipeline organization, matching unit design, and chip-level TTL implementation.

  • Gurd, J. and Watson, I. "Data Driven System for High Speed Parallel Computing — Part 2: Hardware Design." Computer Design, 19(7):97–106, 1980. Aimed at hardware engineers. Board-level design details of the Manchester machine.

  • Watson, I. and Gurd, J. "A Prototype Data Flow Computer with Token Labelling." Proc. AFIPS NCC, pp. 623–628, 1979. Early description of the pilot-project hardware that was later scaled up to the full prototype.

  • Kawakami, K. and Gurd, J.R. "A Scalable Dataflow Structure Store." Proc. ISCA '86, IEEE, June 1986. Hardware organization of the structure store (heap memory) added for array/data-structure handling.

  • Inagami, Y. and Foley, J.F. "The Specification of a New Manchester Dataflow Machine." Proc. ICS '89, pp. 371–380, June 1989. Redesign using supercomputer technology: new matching unit design and pipeline organization for high performance.

  • Barahona, P. and Gurd, J.R. "Simulated Performance of the Manchester Multi-Ring Dataflow Machine." Parallel Computing 85, North-Holland, pp. 419–424, 1985. Performance modeling of multi-ring interconnect for scaling.

  • Kirkham, C.C. and Watson, I. "Iterative Instructions in the Manchester Dataflow Computer." IEEE Transactions on Computers, ~1989. https://ieeexplore.ieee.org/document/80141/. Buffer restructuring and function unit array modifications for iterative workloads. Hardware-level performance measurements on the TTL prototype.

  • Patnaik, L.M., Govindarajan, R., and Ramadoss, N.S. "Design and Performance Evaluation of EXMAN: An Extended Manchester Dataflow Computer." IEEE Trans. Comput., 35(3):229–243, 1986. Extended Manchester architecture with concurrent matching operations in the matching unit.

1.2 MIT Tagged-Token Dataflow Architecture (TTDA)#

The TTDA introduced frame-based direct matching with presence bits, replacing associative lookup. Tags encode ⟨instruction-address, activity-name⟩ where the activity name serves as a frame pointer + context ID. This is the conceptual foundation for the Explicit Token Store.

  • Arvind and Nikhil, R.S. "Executing a Program on the MIT Tagged-Token Dataflow Architecture." IEEE Trans. Comput., 39(3):300–318, March 1990. DOI: 10.1109/12.48862. THE canonical TTDA paper. Processing element organization: instruction fetch, operand matching via presence bits, ALU, token formation. Tag format, direct matching via frame memory, I-structure memory nodes, interconnection network. Detailed PE pipeline figures.

  • Arvind and Nikhil, R.S. "Executing a Program on the MIT Tagged-Token Dataflow Architecture." Proc. PARLE, Springer LNCS 259, pp. 1–29, June 1987. DOI: 10.1007/3-540-17945-3_1. Earlier conference version. TTDA multiprocessor organization with PE array, I-structure memory, and interconnection network.

  • Arvind and Iannucci, R.A. "Two Fundamental Issues in Multiprocessing." Proc. DFVLR Conference, Springer LNCS 295, 1987. Also CSG Memo 226-6, MIT LCS, May 1987. Why von Neumann architectures fail at fine-grained parallelism and how dataflow hardware addresses synchronization and latency tolerance.

  • Arvind, Nikhil, R.S., and Pingali, K.K. "I-Structures: Data Structures for Parallel Computing." ACM TOPLAS, 11(4):598–632, October 1989. DOI: 10.1145/69558.69562. Defines the I-structure hardware semantics: presence bits (empty/full/waiting) on each memory word, deferred-read queuing. The state machine for I-structure operations must be implemented in hardware.

  • Steele, K. "An I-Structure Memory Controller." M.S. Thesis, MIT, December 1989. Detailed hardware design of the I-structure memory controller: state machine, presence bit logic, deferred-read queue management, PE network interface.

  • Culler, D.E. "Resource Management for the Tagged Token Dataflow Architecture." Tech. Rep. TR-332, MIT LCS, 1985. https://dspace.mit.edu/bitstream/handle/1721.1/149603/MIT-LCS-TR-332.pdf. Token store overflow/deadlock and frame-space management. How fixed-size tag hardware constrains iteration depth. Essential for understanding hardware resource constraints and throttling.

1.3 Monsoon (Explicit Token Store)#

Monsoon is the most important architecture for a TTL builder to study after Manchester. The Explicit Token Store (ETS) eliminates CAM and hash matching entirely. Operand matching uses presence bits at compiler-assigned frame offsets — first token sets the bit and stores data; second token reads stored value and proceeds. The processor is an 8-stage pipeline processing one token per cycle.

  • Papadopoulos, G.M. and Culler, D.E. "Monsoon: An Explicit Token-Store Architecture." Proc. ISCA '90, Seattle, pp. 82–91, 1990. DOI: 10.1145/325164.325117. THE key Monsoon paper. ETS model, 8-stage pipeline, instruction memory, frame memory with presence bits, token format ⟨frame-pointer, slot-offset⟩, hazard handling, 128-word activation frames with shared free-list. Revolutionary simplification over Manchester's hash-based matching.

  • Culler, D.E. and Papadopoulos, G.M. "The Explicit Token Store." J. Parallel and Distributed Computing, 10:289–308, December 1990. Extended journal version. More detailed ETS state-bit mechanism, frame memory organization, pipeline stages, token format, and state transition logic.

  • Papadopoulos, G.M. "Implementation of a General Purpose Dataflow Multiprocessor." PhD Thesis, MIT EECS, September 1988. MIT/LCS/TR-432. Published by MIT Press, 1991 (ISBN: 0-262-66069-5). THE most detailed Monsoon hardware source. Complete board-level design, chip selection, pipeline timing, instruction decoding, I-structure memory controller interface, interconnection network. 16-processor systems deployed at Los Alamos and MIT.

  • Papadopoulos, G.M. and Culler, D.E. "Retrospective: Monsoon: An Explicit Token-Store Architecture." In 25 Years of ISCA, ACM, 1998, pp. 74–76. https://people.eecs.berkeley.edu/~culler/courses/cs252-s05/papers/p74-papadopoulos.pdf. Confirms 5–10 million messages/second per PE. 16-processor systems built with Motorola Cambridge Research Lab.

  • Hicks, J., Chiou, D., Ang, B.S., and Arvind. "Performance Studies of Id on the Monsoon Dataflow System." J. Parallel and Distributed Computing, 18(3):273–300, July 1993. Execution cycle counts and speedup curves on physical hardware. Compares single-PE Monsoon against MIPS R3000.

  • Papadopoulos, G.M. and Traub, K.R. "Multithreading: A Revisionist View of Dataflow Architectures." Proc. ISCA '91, Toronto, pp. 342–351, May 1991. Extends Monsoon's ETS with sequential scheduling within threads. Instructions within a thread use implicit sequencing, reducing synchronization overhead. Bridges toward *T.

1.4 *T (StarT)#

*T unifies dataflow with von Neumann by splitting "complex" dataflow instructions into separate synchronization, arithmetic, and fork/control instructions on a RISC-like PE. Frame memory with presence bits is unified with conventional memory.

  • Nikhil, R.S., Papadopoulos, G.M., and Arvind. "*T: A Multithreaded Massively Parallel Architecture." Proc. ISCA '92, Gold Coast, Australia, pp. 156–167, May 1992. DOI: 10.1109/ISCA.1992.753313. Core *T paper. PE organization: instruction fetch, register file, ALU, with synchronization primitives. Eliminates the distinction between processor and I-structure nodes.

  • Beckerle, M.J. "Overview of the START (*T) Multithreaded Computer." COMPCON Spring '93, IEEE, pp. 148–156, February 1993. Physical *T hardware design: processing node architecture, interconnect, memory system. Board-level and system-level details.

  • Arvind and Brobst, S.A. "The Evolution of Dataflow Architectures from Static Dataflow to P-RISC." Intl. J. High Speed Computing, 5(2), 1993. Traces hardware evolution: static → dynamic (tagged-token) → ETS (Monsoon) → P-RISC → *T. Explains at each step what hardware was added or removed.

  • Agarwal, A., Kubiatowicz, J., et al. "Sparcle: An Evolutionary Processor Design for Multiprocessors." IEEE Micro, 13(3):48–61, June 1993. Same design philosophy as *T: SPARC core modified for multithreading with fast context switching and presence-bit memory. Actual chip modifications described.

1.5 Japanese dynamic dataflow machines#

Japan invested heavily in dataflow hardware during the 1980s–90s through ETL (now AIST), NTT, and the Fifth Generation project. The EM-4 single-chip processor is particularly notable for its circular pipeline and strongly connected arc execution model.

Sigma-1 (ETL/MITI) — 128-PE dataflow supercomputer, operational 1988, measured >100 MFLOPS:

  • Hiraki, K., Shimada, T., and Nishida, K. "A Hardware Design of the SIGMA-1, A Data Flow Computer for Scientific Computations." Proc. ICPP '84, pp. 524–531, 1984. Original hardware architecture.

  • Yuba, T., Shimada, T., Hiraki, K., and Kashiwagi, H. "SIGMA-1: A Dataflow Computer for Scientific Computations." Computer Physics Communications, 37:141–148, 1985. DOI: 10.1016/0010-4655(85)90146-8. PE organization, matching memory unit, structure memory, communication network for ~200 PEs.

  • Shimada, T., Hiraki, K., Nishida, K., and Sekiguchi, S. "Evaluation of a Prototype Data Flow Processor of the SIGMA-1." Proc. ISCA '86, Tokyo, pp. 226–234, 1986. DOI: 10.1145/17356.17383. Operational PE evaluation, sticky-token mechanism for loop invariants.

  • Hiraki, K., Nishida, K., Sekiguchi, S., and Shimada, T. "Maintenance Architecture and its LSI Implementation of a Dataflow Computer." Proc. ICPP '86, pp. 584–591, 1986. LSI implementation details for the 128-PE system.

  • Hiraki, K., Sekiguchi, S., and Shimada, T. "The SIGMA-1 Dataflow Supercomputer: A Challenge for New Generation Supercomputing Systems." J. Information Processing (IPS Japan), 10(4):219–226, 1987. Comprehensive description of the complete 128-PE system.

  • Hiraki, K., Sekiguchi, S., and Shimada, T. "Status Report of SIGMA-1: A Data-Flow Supercomputer." In Gaudiot and Bic (eds.), Advanced Topics in Data-Flow Computing, Prentice Hall, Chapter 7, pp. 207–223, 1991. Status report of the completed system with hardware implementation details.

EM-4 / EM-X (ETL) — Single-chip dataflow PE (EMC-R) on 50,000-gate array, 80-PE prototype operational April 1990:

  • Sakai, S., Yamaguchi, Y., Hiraki, K., Kodama, Y., and Yuba, T. "An Architecture of a Dataflow Single Chip Processor." Proc. ISCA '89, pp. 46–53, 1989. DOI: 10.1145/74925.74931. THE core EM-4 chip paper. EMC-R on 50K-gate array. Strongly connected arc model, direct matching, RISC-based design, deadlock-free on-chip packet switch, circular pipeline.

  • Sakai, S., Kodama, Y., and Yamaguchi, Y. "Prototype Implementation of a Highly Parallel Dataflow Machine EM-4." Proc. IPPS '91, pp. 278–286, 1991. DOI: 10.1109/IPPS.1991.153792. 80-PE prototype implementation, peak 1 GIPS. Multiple-RISC concept, versatile interconnection network.

  • Sakai, S., Kodama, Y., and Yamaguchi, Y. "Design and Implementation of a Circular Omega Network in the EM-4." Parallel Computing, 19(2):125–142, 1993. DOI: 10.1016/0167-8191(93)90043-K. THE network paper for EM-4. Circular omega topology, self-routing, deadlock prevention, 14.63 GB/s.

  • Yamaguchi, Y., Sakai, S., and Kodama, Y. "Synchronization Mechanisms of a Highly Parallel Dataflow Machine EM-4." IEICE Trans., E74(1):204–213, 1991. Two synchronization mechanisms: direct matching and register-based.

  • Kodama, Y. et al. "EMC-Y: Parallel Processing Element Optimizing Communication and Computation." Proc. ICS '93, pp. 167–174, 1993. Successor chip to EMC-R for the EM-X system.

  • Kodama, Y. et al. "The EM-X Parallel Computer: Architecture and Basic Performance." Proc. ISCA '95, pp. 14–23, 1995. DOI: 10.1145/223982.223987. Full EM-X system architecture with EMC-Y PE.

  • Sakai, S. et al. "Reduced Interprocessor-Communication Architecture and Its Implementation on EM-4." Parallel Computing, 21(5):753–769, 1995. RICA design: fused pipeline performing message handling + instruction execution + packet output in 2 RISC clocks.

DDDP (OKI Electric) — Four PEs on ring bus with hardware hashing:

  • Kishi, M., Yasuhara, H., and Kawamura, Y. "DDDP — A Distributed Data Driven Processor." Proc. ISCA '83, Stockholm, pp. 236–242, 1983. DOI: 10.1145/800046.801661. Four PEs connected by ring bus + structured data memory. Hardware hashing for token coloring. 0.7 MIPS measured.

Other Japanese dataflow machines:

  • Amamiya, M. et al. DFM architecture and evaluation papers. User already has amamiya1982.pdf and the evaluation paper.

  • Amamiya, M. et al. "Implementation and Evaluation of a List-Processing-Oriented Data Flow Machine." Proc. ISCA '86, Tokyo, pp. 10–19, 1986. Register-transfer-level implementation, ~5× speedup over von Neumann equivalent. Closely related to user's existing papers.

  • Ito, N. et al. "The Architecture and Preliminary Evaluation Results of the Experimental Parallel Inference Machine PIM-D." Proc. ISCA '86, pp. 149–156, 1986. DOI: 10.1145/17356.17373. Dataflow-based parallel inference machine from the Fifth Generation project.

  • Yuba, T. et al. "Dataflow Computer Development in Japan." Proc. ICS '90, ACM SIGARCH, 18(3b):140–147, 1990. DOI: 10.1145/255129.255151. Survey covering SIGMA-1 and EM-4 with comparison of design goals.

  • Yuba, T. "Research and Development Efforts on Data-flow Computer Architecture in Japan." J. Information Processing, 9(2):51–60, 1986. Overview of the entire Japanese dataflow program.

Note on DDMP and TOPSTAR: The term "DDMP" (Data-Driven Media Processor) was not confirmed in English-language literature; the Takahashi/Amamiya NTT machine is consistently called a "Dataflow Processor Array System." TOPSTAR was a function-level (coarse-grain) dataflow machine at University of Tokyo under Prof. Moto-Oka, not an ETL successor to EM-4. The actual EM-4 successor at ETL was EM-X (EMC-Y chip).


2. Static dataflow machines#

Static dataflow uses acknowledgment signals rather than tagged tokens: one token per arc, with presence bits in instruction cells. Simpler hardware but no reentrancy without code copying.

2.1 Dennis MIT static dataflow#

  • Dennis, J.B. and Misunas, D.P. "A Preliminary Architecture for a Basic Data Flow Processor." Proc. ISCA '75, pp. 126–132, January 1975. DOI: 10.1145/642089.642111. THE seminal static dataflow hardware paper. Cell-based PE with instruction cells containing operand slots and presence bits. Circular pipeline. Four key networks: distribution, control, arbitration, and processing section.

  • Dennis, J.B., Boughton, G.A., and Leung, C.K.C. "Building Blocks for Data Flow Prototypes." Proc. ISCA '80, La Baule, France, pp. 1–8, May 1980. EXTREMELY VALUABLE for TTL builders. Describes modular hardware building blocks (instruction cell modules, operation units, arbitration networks) specifically designed for constructing dataflow processor prototypes from discrete logic.

  • Dennis, J.B. "Data Flow Supercomputers." IEEE Computer, 13(11):48–56, November 1980. DOI: 10.1109/C-M.1980.220135. Architectural vision for scaling static dataflow. PE pipeline, multiprocessor interconnect, memory system.

  • Dennis, J.B., Lim, W.Y.P., and Ackerman, W.B. "The MIT Data Flow Engineering Model." Proc. IFIP 9th World Computer Congress, Paris, pp. 553–560, 1983. Engineering model of the MIT static dataflow machine: pipelined PE and interconnection network.

  • Dennis, J.B. "The Evolution of 'Static' Data-Flow Architecture." In Advanced Topics in Data-Flow Computing, Gaudiot and Bic (eds.), Prentice Hall, Chapter 2, 1991. Dennis's own comprehensive account of how the static architecture evolved through hardware prototypes.

  • Rumbaugh, J.E. "A Data Flow Multiprocessor." IEEE Trans. Comput., C-26(2):138–146, February 1977. DOI: 10.1109/TC.1977.5009292. Also: MIT Project MAC TR-150, 1975. Early hardware dataflow multiprocessor. Hierarchically constructed network of asynchronous modules. Pipelined activation processors. Important for discrete-logic design: modular, asynchronous approach.

2.2 TI Distributed Data Processor#

  • Cornish, M., Hogan, D.W., and Jensen, J.C. "The Texas Instruments Distributed Data Processor." Proc. Louisiana Computer Exposition, pp. 189–193, March 1979. First public paper on TI DDP. Board-level static dataflow processor for real-time avionics. Register-free instruction format.

  • Cornish, M. et al. "The TI Data Flow Architectures: The Power of Concurrency for Avionics." Proc. 3rd Digital Avionics Systems Conf., Fort Worth, TX, pp. 19–25, November 1979. Hardware architecture for avionics. Data-driven controllers find all parallel instructions simultaneously.

2.3 LAU System (CERT/ONERA Toulouse)#

First operational dataflow multiprocessor (32 processors, 1979).

  • Plas, A., Comte, D., Gelly, O., and Syre, J.C. "LAU System Architecture: A Parallel Data-Driven Processor Based on Single Assignment." Proc. ICPP '76, pp. 293–302, 1976. Primary architecture paper. PE hardware, acknowledgment-based token management, communication network.

  • Syre, J.C., Comte, D., and Hifdi, N. "Pipelining, Parallelism and Asynchronism in the LAU System." Proc. ICPP '77, Detroit, pp. 87–92, August 1977. Pipeline design within LAU PEs and asynchronous inter-PE communication hardware.

  • Comte, D. and Hifdi, N. "LAU Multiprocessor: Microfunctional Description and Technological Choices." Proc. Workshop: Data Driven Languages and Machines, Toulouse, 1979, pp. I.1–I.8. Detailed hardware microfunctional units and board-level technology choices.

  • Comte, D., Hifdi, N., and Syre, J.C. "The Data Driven LAU Multiprocessor System: Results and Perspectives." Proc. IFIP Congress 80, Tokyo, pp. 175–180, 1980. Performance results from the operational 32-processor system.

2.4 Other static machines#

  • Davis, A.L. "The Architecture and System Method of DDM1: A Recursively Structured Data Driven Machine." Proc. ISCA '78, pp. 210–215, April 1978. DOI: 10.1145/800094.803050. DDM1 at University of Utah — first operational dataflow processor (July 1976). Tree-structured, fully distributed, asynchronous. Each module is self-timed; no global clock. Board-level hardware.

  • Vedder, R. and Finn, D. "The Hughes Data Flow Multiprocessor: Architecture for Efficient Signal and Data Processing." Proc. ISCA '85, Boston, pp. 324–332, June 1985. DOI: 10.1145/327070.327290. Hughes Aircraft Company. Hierarchical network of asynchronous modules with pipelined activation processors.

  • NEC μPD7281 Image Pipeliner — one of the only commercially sold dataflow chips (December 1985). Static dataflow for image processing, 5 MIPS, unidirectional pipeline bus, multiple chips cascade without extra circuitry. See: Kurokawa, K. et al., "The Architecture and Performance of Image Pipeline Processor," VLSI '83, pp. 275–284. Also: NEC μPD7281 data sheet.

  • Hartimo, I., Kronlof, K., Simula, O., and Skytta, J. "DFSP: A Data Flow Signal Processor." IEEE Trans. Comput., C-35(1):23–33, June 1986. Board-level medium-grain static dataflow processor for signal processing, from Helsinki.


3. Hybrid dataflow/von Neumann architectures#

These combine dataflow's latency-tolerant fine-grained synchronization with conventional sequential execution efficiency.

  • Iannucci, R.A. "Toward a Dataflow/von Neumann Hybrid Architecture." Proc. ISCA '88, pp. 131–140, June 1988. DOI: 10.1145/633625.52416. Also MIT/LCS/TR-418. Foundational hybrid paper. P-RISC (Parallel RISC): RISC instruction set extended with three dataflow-style synchronization instructions. Scheduling quanta bound at compile time. ~240 citations.

  • Nikhil, R.S. and Arvind. "Can Dataflow Subsume von Neumann Computing?" Proc. ISCA '89, pp. 262–272, 1989. DOI: 10.1109/ISCA.1989.714561. Introduces P-RISC concept: modified von Neumann PE with split-phase memory and presence-bit memory for dataflow-style synchronization. Directly influenced *T design.

  • Grafe, V.G., Davidson, G.S., Hoch, J.E., and Holmes, V.P. "The Epsilon Dataflow Processor." Proc. ISCA '89, pp. 36–45, June 1989. https://ieeexplore.ieee.org/document/714522/. Directly matches ready operands, eliminating associative matching stores entirely. 10 MFLOPS prototype at Sandia National Labs.

  • Grafe, V.G. and Hoch, J.E. "The Epsilon-2 Multiprocessor System." J. Parallel and Distributed Computing, 10(4):309–318, 1990. Also SAND-89-2622C. https://www.osti.gov/servlets/purl/5213672. Combines fine-grain dataflow parallelism with von Neumann sequential efficiency. Instruction-level synchronization, single-cycle context switches, RISC-like execution, tree of activation frames.

  • Gao, G.R., Hum, H.H.J., and Monti, J.-M. "Towards an Efficient Hybrid Dataflow Architecture Model." Proc. PARLE '91, Springer LNCS 505, pp. 355–371, June 1991. DOI: 10.1007/978-3-662-25209-3_24. McGill Dataflow Architecture (MDFA). Argument-fetching dataflow principle, conventional pipelined execution, dataflow software pipelining. Predecessor to EARTH.

  • Hum, H.H.J. et al. "A Design Study of the EARTH Multiprocessor." Proc. PACT '95, pp. 59–68, 1995. Core EARTH architecture: off-the-shelf Execution Unit + ASIC Synchronization Unit for dataflow-like thread scheduling.

  • Hum, H.H.J. et al. "A Study of the EARTH-MANNA Multithreaded System." Intl. J. Parallel Programming, 24(4):319–348, 1996. DOI: 10.1007/BF03356753. Key paper showing how EARTH combines dataflow signaling with RISC cores. Each processor has an SU handling synchronization + remote requests.

  • Culler, D.E. et al. "Fine-grain Parallelism with Minimal Hardware Support: A Compiler-Controlled Threaded Abstract Machine." Proc. ASPLOS-IV, pp. 164–175, April 1991. TAM: shows synchronization on conventional multiprocessors can approach dedicated hardware performance.

  • Nowatzki, T., Gangadhar, V., and Sankaralingam, K. "Heterogeneous Von Neumann/Dataflow Microprocessors." Communications of the ACM, 62(6):83–91, 2019. Modern perspective on hybrid architectures. Shows continued relevance of dataflow ideas in contemporary microprocessor design.


4. Systolic arrays and wavefront processors#

4.1 CMU systolic arrays#

  • Kung, H.T. and Leiserson, C.E. "Systolic Arrays (for VLSI)." Sparse Matrix Proceedings 1978, SIAM, pp. 256–282, 1979. Also in Mead and Conway, Introduction to VLSI Systems, Addison-Wesley, 1980, Section 8.3. Originated the term "systolic." Locally connected processor networks with rhythmic data flow. Band matrix algorithms.

  • Kung, H.T. "Why Systolic Architectures?" IEEE Computer, 15(1):37–46, January 1982. DOI: 10.1109/MC.1982.1653825. THE definitive systolic paper (~2,400 citations). I/O bandwidth problem, systolic solution, multiple array designs for convolution and matrix multiplication.

  • Fisher, A.L., Kung, H.T., Monier, L.M., and Dohi, Y. "Architecture of the PSC: A Programmable Systolic Chip." Proc. ISCA '83, Stockholm, pp. 48–53, June 1983. DOI: 10.1145/800046.801637. Extended: J. VLSI and Computer Systems, 1(2):153–169, 1984. CMU PSC chip: ~25,000 transistors, 74 pins, 4μm nMOS via MOSIS. First microprogrammable chip for systolic arrays.

4.2 Warp and iWarp#

  • Annaratone, M. et al. "The Warp Computer: Architecture, Implementation, and Performance." IEEE Trans. Comput., C-36(12):1523–1538, December 1987. DOI: 10.1109/TC.1987.5009502. Linear systolic array of 10 programmable cells at 10 MFLOPS each (100 MFLOPS peak). Board-level implementation using off-the-shelf components. Wire-wrap prototype (WW-Warp) completed June 1985; production PC-Warp delivered by GE April 1987.

  • Borkar, S. et al. "iWarp: An Integrated Solution to High-Speed Parallel Computing." Proc. Supercomputing '88, pp. 330–339, 1988. https://www.eecs.harvard.edu/~htk/publication/1988-supercomputing-borkar-etc.pdf. iWarp chip: 1.2cm × 1.2cm, ~700K transistors. 32-bit RISC core + 96-bit LIW decoder, 64-bit FPU at 20 MHz (20 MFLOPS), communication agent (320 MB/s). Typical system: 8×8 torus (64 cells).

  • Borkar, S. et al. "Supporting Systolic and Memory Communication in iWarp." Proc. ISCA '90, Seattle, pp. 70–81, May 1990. Hardware support for both systolic (streaming) and memory-based (message passing) communication on one chip. Network connections mapped to on-chip gate registers.

  • Gross, T. and O'Hallaron, D.R. iWarp: Anatomy of a Parallel Computing System. MIT Press, 1998, 488 pp. The definitive iWarp reference: node architecture, communication agent, wormhole routing, system integration.

4.3 Wavefront array processors#

  • Kung, S.Y. et al. "Wavefront Array Processor: Language, Architecture, and Applications." IEEE Trans. Comput., C-31(11):1054–1066, November 1982. Introduces WAP: unlike systolic (globally clocked), WAPs use data-driven, self-timed processing. Each PE has bidirectional buffers with independent status flags. Same geometry as systolic arrays but asynchronous local control.

  • Kung, S.Y. VLSI Array Processors. Prentice Hall, 1988. ISBN: 0-13-942749-X. Definitive book on array processor architectures covering both systolic and wavefront design methodology.

  • Vlontzos, J.A. and Kung, S.Y. "A Wavefront Array Processor Using Dataflow Processing Elements." Supercomputing '87, Springer LNCS 297, pp. 608–620, 1988. DOI: 10.1007/3-540-18991-2_42. Prototype WAP built from NEC μPD7281 dataflow chips with reconfigurable interconnections. Demonstrates physical hardware.


5. Transport-triggered architectures (TTA)#

TTAs expose the processor's internal transport buses as the programming model. Operations are triggered as side effects of data movement — conceptually related to dataflow's data-driven execution.

  • Corporaal, H. "Design of Transport Triggered Architectures." Proc. 4th Great Lakes Symp. on VLSI, pp. 130–135, March 1994. https://www.cecs.uci.edu/~papers/compendium94-03/papers/1994/glsvlsi94/pdffiles/glsvlsi94_130.pdf. Introduces TTA and the MOVE32INT prototype: 32-bit pipelined TTA processor. Function units triggered by data transport to trigger registers.

  • Corporaal, H. Microprocessor Architectures: From VLIW to TTA. John Wiley & Sons, 1997. The definitive TTA reference. MOVE framework for processor generation, hardware complexity analysis, MOVE32INT implementation. Shows how TTA reduces register file complexity vs. VLIW.

  • Corporaal, H. "TTAs: Missing the ILP Complexity Wall." J. Systems Architecture, 44(9-10):619–650, 1998. Analyzes VLIW data path complexity and demonstrates TTA's hardware reduction in register file ports, bypass connectivity, and overall area.

  • TCE/OpenASIP — open-source TTA codesign environment from Tampere University. https://openasip.org. Complete toolchain: Architecture Definition File → compiler → simulator → synthesizable VHDL. Enables automatic design-space exploration for custom TTA processors.

  • Maxim MAXQ — the only commercially manufactured TTA-based microcontroller. 16-bit core where a single MOVE instruction encodes source and destination modules on an internal transport bus. Computation occurs as side-effect of transport. Originally Dallas Semiconductor; products shipped ~2004–2020s.

  • 8-bit TTL TTA — a transport-triggered CPU built from discrete TTL chips on stripboard, documented on Hackaday. Directly relevant as a discrete-logic TTA precedent.


6. Modern dataflow-inspired architectures#

Lower priority but included for significant hardware implementation detail.

  • Swanson, S. et al. "WaveScalar." Proc. MICRO-36, pp. 291–302, 2003. DOI: 10.1109/MICRO.2003.1253203. https://homes.cs.washington.edu/~oskin/wavescalar.pdf. WaveScalar ISA and WaveCache: grid of ALU-in-cache nodes. Completely distributed matching (no central CAM), direct inter-PE communication. 2–7× over superscalar.

  • Swanson, S. et al. "The WaveScalar Architecture." ACM Trans. Computer Systems, 25(2), Article 4, May 2007. DOI: 10.1145/1233307.1233308. Comprehensive journal paper: full ISA, WaveCache microarchitecture, area/performance evaluation.

  • Swanson, S. et al. "Area-Performance Trade-offs in Tiled Dataflow Architectures." Proc. ISCA '06, pp. 314–326, 2006. RTL-based synthesis study. Pareto analysis of >200 WaveScalar designs. Crucial for understanding hardware cost of dataflow tiles.

  • Sankaralingam, K. et al. "Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture." Proc. ISCA '03, pp. 422–433, 2003. DOI: 10.1109/ISCA.2003.1207019. EDGE ISA: 4×4 grid of execution units, dataflow scheduling within blocks, control-flow between blocks.

  • Sankaralingam, K. et al. "Distributed Microarchitectural Protocols in the TRIPS Prototype Processor." Proc. MICRO '06, pp. 480–491, 2006. DOI: 10.1109/MICRO.2006.19. Actual TRIPS prototype chip: two 16-wide cores, up to 1,024 instructions in flight. Fabricated in silicon.

  • Parashar, A. et al. "Triggered Instructions: A Control Paradigm for Spatially-Programmed Architectures." Proc. ISCA '13, pp. 142–153, 2013. DOI: 10.1145/2485922.2485935. https://people.csail.mit.edu/emer/media/papers/2013.06.isca.triggeredinstructions.pdf. Eliminates the program counter entirely. PEs react to input channel arrivals. 8× area-normalized performance vs. GP processor. Very relevant to dataflow-style hardware control.

  • Jeffrey, M.C. et al. "A Scalable Architecture for Ordered Parallelism." Proc. MICRO '15, pp. 228–241, 2015. Swarm: tiled CMP with task units for speculative ordered parallelism. Hardware timestamp-ordered commit, Bloom filter conflict detection.


7. Major surveys and meta-sources#

These provide the best starting points for understanding the field as a whole and serve as indexes to the primary literature.

Year Authors Title Venue Key value
1980 Dennis "Data Flow Supercomputers" IEEE Computer 13(11) Founding vision, static dataflow scaling
1982 Treleaven, Brownbridge, Hopkins "Data-Driven and Demand-Driven Computer Architecture" ACM Comp. Surveys 14(1) Broadest scope: dataflow + reduction architectures
1982 Gajski, Padua, Kuck, Kuhn "A Second Opinion on Data Flow Machines and Languages" IEEE Computer 15(2) Critical counterpoint from Illinois group
1986 ★ Veen "Dataflow Machine Architecture" ACM Comp. Surveys 18(4) User has this. Most focused survey on hardware
1986 Arvind and Culler "Dataflow Architectures" Annual Review of CS 1:225–253 MIT perspective; PDF at https://apps.dtic.mil/sti/pdfs/ADA166235.pdf
1986 Srini "An Architectural Comparison of Dataflow Systems" IEEE Computer 19(3) Compares 7 architectures on 16 criteria
1994 Lee and Hurson "Dataflow Architectures and Multithreading" IEEE Computer 27(8) Bridges pure dataflow to multithreading era
1999 Šilc, Robič, Ungerer Processor Architecture: From Dataflow to Superscalar and Beyond Springer (textbook) Most comprehensive textbook; 389 pp.
1999 Najjar, Lee, Gao "Advances in the Dataflow Computational Model" Parallel Computing 25(13-14) State of the art circa 1999
2005 Arvind "Dataflow: Passing the Token" ISCA Keynote, https://csg.csail.mit.edu/Users/arvind/ISCAfinal.pdf Visual retrospective of entire arc from static → *T

8. Design-critical cross-cutting references#

These papers address specific hardware subsystem challenges most relevant to a TTL-based dynamic dataflow build.

Matching store tradeoffs: The field explored three fundamental approaches: (1) CAM-based — Manchester's original design, replaced due to cost and scaling; (2) hash-based — Manchester's production design (da Silva & Watson 1983), practical with SRAM but requires overflow handling; (3) presence-bit / direct-indexed — Monsoon's ETS (Papadopoulos & Culler 1990), eliminates associative lookup entirely by moving slot assignment to the compiler. For a TTL build, the ETS approach is dramatically simpler in hardware — each frame slot needs only a 1-bit presence flag plus a data-width storage word. The hash approach is also tractable with 74-series logic (hash function → SRAM address → comparator → overflow chain).

Token format considerations: Manchester tokens carry ⟨tag, data⟩ where the tag encodes ⟨destination-instruction, iteration-level, activation-name⟩. The tag width directly determines matching store addressing and dominates interconnect bandwidth. Monsoon simplified this to ⟨frame-pointer, slot-offset, instruction-pointer, data⟩. The EM-4 uses a strongly connected arc model with direct matching and register-based synchronization on a 50K-gate budget.

Interconnect choices: Manchester uses a ring pipeline (simplest; all units connected in sequence). EM-4 uses a circular omega network with store-and-forward routing. Dennis's original architecture defines four separate networks (distribution, control, arbitration, processing). For a small-scale TTL prototype, the ring topology used by Manchester and DDDP is most tractable.

Throttling and resource management: Culler's MIT TR-332 (1985) is the essential reference on token store overflow, deadlock prevention, and termination detection — all critical for a finite-resource hardware implementation.


Conclusion#

The most actionable references for a 74-series TTL dynamic dataflow build form a clear reading path. Start with the da Silva & Watson (1983) paper on hash-based matching and the Papadopoulos & Culler (1990) ETS papers to make the fundamental matching store architecture decision — hash vs. presence-bit. Study the Manchester ring pipeline (Watson & Gurd 1982, Gurd et al. 1985) as the proven TTL implementation template. Read Dennis's "Building Blocks" (1980) for modular discrete-logic dataflow design philosophy. The EM-4 papers (Sakai et al. 1989, 1991, 1993) offer the most sophisticated pipeline and interconnect designs if scaling beyond a single PE. For throttling, Culler's TR-332 is essential. The major surveys — especially Arvind & Culler (1986) and Lee & Hurson (1994) — efficiently index the rest of the literature. The Japanese machines (Sigma-1, EM-4) are underappreciated in English-language literature but contain some of the most detailed hardware implementation work in the field.