OR-1 dataflow CPU sketch
0
fork

Configure Feed

Select the types of activity you want to include in your feed.

OR1 Dataflow CPU Behavioural Emulator Design#

Summary#

This design describes a behavioural emulator for the OR1 dataflow CPU architecture, built using Python's SimPy discrete event simulation framework. The emulator models three core hardware components: Processing Elements (PEs) that execute dataflow instructions using a matching store for operand pairing and generation counters for dataflow synchronization; Structure Memory (SM) modules that implement I-structure semantics with blocking reads and a depth-1 deferred read register; and a token-routing network that connects these components via bounded FIFOs with backpressure.

The implementation prioritizes behavioural correctness over cycle-accurate timing. Tokens flow as complete Python objects between modules connected through SimPy Stores that represent hardware FIFOs, with backpressure emerging naturally when stores reach capacity. The system provides a direct Python initialization API for configuring IRAM contents, SM state, and routing tables without requiring configuration token processing. A property-based test suite using pytest and hypothesis validates matching store invariants, I-structure state transitions, ALU correctness across all v0 operations, output formatter modes, and end-to-end token flow through multi-PE programs.

Definition of Done#

  1. SimPy-based emulator with behaviourally correct implementations of: Processing Elements (matching store with generation counters and presence bits, ALU execution for all v0 operations, output formatter with SUPPRESS/SINGLE/DUAL/SWITCH modes), Structure Memory (I-structure semantics with depth-1 deferred read register, 2-bit presence state machine for all transitions, all v0 memory operations), and a token-routing network (type-based routing, backpressure via bounded SimPy Stores).

  2. Direct initialization API — Python-native setup of IRAM contents, SM initial state, and routing configuration. CfgToken processing as abstract IRAM/route mutation is a stretch goal.

  3. Property-based test suite (pytest + hypothesis) validating: matching store invariants (generation counter ABA prevention, presence bit state machine correctness), SM state transitions and I-structure semantics (deferred read depth-1 constraint, blocking read/write ordering), ALU operation correctness across the full v0 opcode set, output formatter mode semantics (token count and routing per mode), and end-to-end token flow through a minimal multi-PE path.

  4. End-to-end demonstration: the emulator can run a small hand-constructed program (a few instructions across 1-2 PEs) end-to-end, demonstrating correct token flow, matching, execution, and output routing.

Acceptance Criteria#

or1-emu.AC1: Processing Element Behaviour#

  • or1-emu.AC1.1 Success: Monadic token bypasses matching store and executes immediately
  • or1-emu.AC1.2 Success: First dyadic token for an offset/ctx stores in matching store, does not fire
  • or1-emu.AC1.3 Success: Second dyadic token for same offset/ctx retrieves partner, fires instruction
  • or1-emu.AC1.4 Success: Generation counter mismatch causes stale token discard (no match, no fire)
  • or1-emu.AC1.5 Success: Output formatter SINGLE mode emits exactly one token to dest_l
  • or1-emu.AC1.6 Success: Output formatter DUAL mode emits two tokens with same data to dest_l and dest_r
  • or1-emu.AC1.7 Success: Output formatter SWITCH mode routes data to taken side, inline trigger to not-taken side
  • or1-emu.AC1.8 Success: Output formatter SUPPRESS mode emits zero tokens
  • or1-emu.AC1.9 Failure: Token targeting non-existent IRAM offset is handled without crash

or1-emu.AC2: ALU Correctness#

  • or1-emu.AC2.1 Success: Arithmetic ops (ADD, SUB) produce correct 16-bit wrapping results for all uint16 pairs
  • or1-emu.AC2.2 Success: INC/DEC monadic ops correctly increment/decrement
  • or1-emu.AC2.3 Success: Shift ops use const as shift amount; ASHIFTR sign-extends from bit 15
  • or1-emu.AC2.4 Success: Logic ops (AND, OR, XOR, NOT) produce correct bitwise results
  • or1-emu.AC2.5 Success: Comparison ops interpret operands as signed 2's complement (0xFFFF < 0x0001)
  • or1-emu.AC2.6 Success: Comparison ops produce 0x0001/0x0000 result and correct bool_out
  • or1-emu.AC2.7 Success: Routing ops (BR*, SW*, GATE) compute boolean and pass data through unchanged
  • or1-emu.AC2.8 Success: PASS returns left operand, CONST returns const field
  • or1-emu.AC2.9 Edge: Signed boundary: GT(0x7FFF, 0x8000) is true (32767 > -32768)

or1-emu.AC3: Structure Memory Behaviour#

  • or1-emu.AC3.1 Success: READ on FULL cell returns data immediately via result token
  • or1-emu.AC3.2 Success: READ on EMPTY cell with empty deferred register stashes return route, sets WAITING
  • or1-emu.AC3.3 Success: WRITE on WAITING cell satisfies deferred read — emits result token to stashed return route
  • or1-emu.AC3.4 Success: WRITE on EMPTY/RESERVED sets cell to FULL
  • or1-emu.AC3.5 Success: CLEAR sets cell to EMPTY, cancels deferred read if targeting that cell
  • or1-emu.AC3.6 Success: READ_INC/READ_DEC atomically modify and return value (lower 256 cells only)
  • or1-emu.AC3.7 Failure: Depth-1 constraint: second blocking READ on different empty cell stalls until first deferred read is satisfied
  • or1-emu.AC3.8 Edge: WRITE on FULL cell overwrites data (diagnostic flag set if modelled)

or1-emu.AC4: Network and Routing#

  • or1-emu.AC4.1 Success: Token with dest PE_id N arrives at PE N's input Store
  • or1-emu.AC4.2 Success: SM token (type 10) routes to correct SM by SM_id
  • or1-emu.AC4.3 Success: Backpressure: PE blocks on put() when destination Store is at capacity
  • or1-emu.AC4.4 Success: Backpressure releases when consumer drains destination Store

or1-emu.AC5: Direct Initialization API#

  • or1-emu.AC5.1 Success: System constructed from PEConfig with IRAM contents — PE has expected instructions at expected offsets
  • or1-emu.AC5.2 Success: System constructed from SMConfig with initial cell data — SM cells match config
  • or1-emu.AC5.3 Success: inject(token) delivers seed token to correct module's input Store

or1-emu.AC6: End-to-End Execution#

  • or1-emu.AC6.1 Success: CONST on PE0 emits token that arrives at PE1, triggers ADD, produces correct result
  • or1-emu.AC6.2 Success: PE writes to SM, different PE reads from SM, receives correct data
  • or1-emu.AC6.3 Success: DUAL mode fan-out delivers same result to two different consumers
  • or1-emu.AC6.4 Success: SWITCH mode routes data and trigger to correct destinations based on comparison result

Glossary#

  • ALU (Arithmetic Logic Unit): The execution unit within a Processing Element that performs arithmetic, logical, comparison, and routing operations on operands.
  • Backpressure: Flow control mechanism where a downstream component that cannot accept more data signals the upstream component to pause, implemented via blocking on full SimPy Stores.
  • CfgToken: Configuration token type used to mutate IRAM contents or routing tables at runtime (stretch goal in this design).
  • Context slot: An index dimension in the matching store corresponding to a dataflow context, used to isolate concurrent activations of the same instruction.
  • Dataflow architecture: A computation model where instruction execution is triggered by operand availability rather than a program counter, enabling natural parallelism.
  • Deferred read: I-structure pattern where a READ on an EMPTY cell blocks by storing the return route for later satisfaction when a WRITE arrives.
  • Dyadic operation: An instruction requiring two operands (left and right), contrasted with monadic operations requiring only one.
  • Generation counter: A 2-bit counter per context slot that increments on reallocation, used to detect and discard stale tokens from previous activations (ABA prevention).
  • I-structure: A synchronization primitive from dataflow computing where cells have blocking read-when-empty semantics, with writes satisfying pending reads.
  • IRAM (Instruction RAM): Instruction memory within a Processing Element, indexed by token offset to fetch the operation and destination routing.
  • Matching store: Hardware structure within a PE that holds the first operand of a dyadic instruction until its partner arrives, enabling dataflow firing when both are present.
  • Monadic operation: An instruction requiring only one operand, which bypasses the matching store and executes immediately.
  • Output formatter: PE stage that determines how many result tokens to emit and where to route them based on instruction mode (SUPPRESS/SINGLE/DUAL/SWITCH).
  • Presence state: Per-entry flag in the matching store (occupied/empty) or per-cell 2-bit state in Structure Memory (EMPTY/RESERVED/FULL/WAITING).
  • Property-based testing: Testing approach using hypothesis to generate randomized test cases validating invariants rather than testing specific examples.
  • SimPy: Python discrete event simulation framework providing processes, resources, and stores.
  • SimPy Store: SimPy's queue primitive with optional capacity bounds; get() and put() block when empty/full respectively.
  • Token: The fundamental unit of data flow in the OR1 architecture, carrying a value, destination routing, context information, and port assignment.
  • Type-based routing: Routing mechanism where tokens are directed to destinations based on their type field (types 00/01 for PE, type 10 for SM, type 11 for system) combined with a destination ID.

Architecture#

Behavioural emulator built on SimPy's discrete event simulation. Three module types — Processing Element (PE), Structure Memory (SM), and Network — communicate via bounded SimPy Stores representing hardware FIFOs. Tokens flow as whole Python objects (not flit-level serialization). Backpressure emerges naturally: yield store.put(token) blocks when a destination FIFO is full.

Route table + direct put topology. Each module holds a route_table: dict[int, simpy.Store] mapping destination IDs to other modules' input Stores. A top-level build_topology() function wires these up at initialization. No central router process — modules resolve destinations and put directly. This mirrors the hardware's type-based routing without modelling bus arbitration.

System object is the top-level handle returned by build_topology(). It provides inject(token) for seed token insertion, pes[id] / sms[id] for direct module access, and env for running the simulation. Tests and the initialization API interact through System.

Module Layout#

or1-design/
  token.py              # Token hierarchy (shared with assembler)
  cm_inst.py            # ALU ops, instruction types (shared with assembler)
  sm_mod.py             # SM data types: Presence, SMCell (shared with assembler)
  emu/
    __init__.py
    alu.py              # Pure function: execute(op, left, right, const) -> (result, bool_out)
    pe.py               # ProcessingElement: SimPy process, matching store, output formatter
    sm.py               # StructureMemory: SimPy process, I-structure cells, deferred read
    network.py          # build_topology(), System class
    types.py            # Emulator-specific types (MatchEntry, DeferredRead, PEConfig, SMConfig)
  tests/
    conftest.py         # Shared fixtures, hypothesis strategies
    test_alu.py         # ALU property-based tests
    test_pe.py          # PE matching + output formatter tests
    test_sm.py          # SM state machine + deferred read tests
    test_network.py     # Routing + backpressure tests
    test_integration.py # End-to-end multi-module programs

Root-level modules (token.py, cm_inst.py, sm_mod.py) contain shared data representations used by both the emulator and the future assembler. The emu/ package imports from them and adds simulation behaviour.

Processing Element#

Single SimPy process per PE. Five pipeline stages implemented as separable methods (can later become independent SimPy processes connected by inter-stage Stores).

State:

  • iram: dict[int, ALUInst] — instruction memory keyed by offset. Each ALUInst holds opcode, dest_l, dest_r, const. Mutable via CfgToken.
  • matching_store: list[list[MatchEntry]] — 2D indexed by [ctx_slot][offset]. Each MatchEntry has occupied: bool, data: Optional[int], port: Port.
  • gen_counters: list[int] — 2-bit generation counter per context slot. Incremented on slot reallocation.
  • input_store: simpy.Store — bounded input FIFO.
  • route_table: dict[int, simpy.Store] — destination PE/SM ID to output Store.

Process loop:

  1. Token Input: yield input_store.get() — blocks until token available.
  2. Cfg Check: If CfgToken, apply abstract mutation to IRAM or route_table. Continue.
  3. Match/Bypass: Monadic tokens bypass matching, returning (data, None). Dyadic tokens check generation counter (discard if stale), then check presence bit at [ctx][offset] — store partial operand and wait, or retrieve partner and fire.
  4. Instruction Fetch: Read iram[offset] to get opcode and destination fields.
  5. Execute: Call alu.execute(op, left, right, const)(result, bool_out).
  6. Token Output: Output formatter inspects mode (derived from opcode + has_dest2):
    • SUPPRESS: no tokens emitted. Used by FREE, GATE-when-false.
    • SINGLE: one token to dest_l. yield route_table[dest_l.pe].put(token).
    • DUAL: two tokens with same data to dest_l and dest_r.
    • SWITCH: data token to bool_out ? dest_l : dest_r, inline monadic trigger to the other.

Each yield store.put() in the output stage blocks on backpressure.

Structure Memory#

Single SimPy process. Implements I-structure semantics with a depth-1 deferred read register.

State:

  • cells: list[SMCell] — 512 cells, each with presence: Presence and data: Optional[int].
  • deferred_read: Optional[DeferredRead] — single register: (cell_addr, return_route). Only one pending deferred read at a time.
  • input_store: simpy.Store — bounded input FIFO.
  • route_table: dict[int, simpy.Store] — for sending result tokens back to PEs.

Operation dispatch:

  • READ on FULL: immediate result token via return route.
  • READ on EMPTY/RESERVED: if deferred register empty, stash return route, set cell to WAITING. If deferred register occupied (depth-1 constraint), SM process yields on an internal simpy.Event until the existing deferred read is satisfied, then retries.
  • WRITE on WAITING: satisfy deferred read — emit result token to stashed return route, set cell to FULL.
  • WRITE on EMPTY/RESERVED: set to FULL, store data.
  • CLEAR: set to EMPTY, cancel deferred read if targeting this cell.
  • READ_INC / READ_DEC / CAS: atomic operations restricted to lower 256 cells, immediate-result pattern.

ALU#

Pure function in emu/alu.py. No state, no SimPy involvement.

Contract: execute(op: ALUOp, left: int, right: Optional[int], const: Optional[int]) -> tuple[int, bool]

Returns (result & 0xFFFF, bool_out). All values stored as unsigned 16-bit. Comparisons and overflow detection interpret operands as signed 2's complement via to_signed() helper.

Opcode groups:

  • Arithmetic (ADD, SUB, INC, DEC): 16-bit wrapping. INC/DEC monadic.
  • Shifts (SHIFTL, SHIFTR, ASHIFTR): monadic, amount from const. ASHIFTR sign-extends.
  • Logic (AND, OR, XOR, NOT): bitwise. NOT monadic.
  • Comparison (EQ, LT, LTE, GT, GTE): signed 2's complement. Result is 0x0001/0x0000, sets bool_out.
  • Routing (BR*, SW*, GATE, SEL, MERGE): compute boolean condition, data passes through. Output formatter uses bool_out for routing decisions. GATE and SEL are dyadic: left=data, right=boolean. bool_out comes from right operand (port R bit 0 in hardware).
  • Data (PASS, CONST, FREE): PASS = identity, CONST = return const, FREE = no output (SUPPRESS).

Network and Topology#

build_topology(env, pe_configs, sm_configs, fifo_capacity=8) -> System

Creates bounded SimPy Stores (one per module input), instantiates PE and SM objects, populates route_tables bidirectionally (PE→PE, PE→SM, SM→PE). Returns System object.

PEConfig: PE ID, IRAM contents, matching store dimensions (ctx_slots, offsets), initial generation counters.

SMConfig: SM ID, cell count, initial cell contents.

Existing Patterns#

Existing code in token.py, cm_inst.py, and sm_mod.py establishes:

  • @dataclass(frozen=True) for immutable value types (tokens, instructions, addresses)
  • Mutable @dataclass for stateful containers (SMCell)
  • IntEnum for opcodes and state enums (ALUOp, Presence, MemOp, Port)
  • Type annotations throughout
  • SimPy Resource as base class for SM (sm_mod.py)

The emulator follows these patterns. New emulator-specific types (MatchEntry, DeferredRead, PEConfig, SMConfig) use the same conventions. The PE class does not extend a SimPy resource — it owns a SimPy Store and runs as a process, which is the canonical SimPy pattern for active components (per Graphcore's "SimPy for Chips" approach).

Implementation Phases#

Phase 1: Emulator Scaffold and ALU#

Goal: Project structure, emulator-specific types, and a fully tested ALU.

Components:

  • emu/__init__.py — package init
  • emu/types.py — MatchEntry, DeferredRead, PEConfig, SMConfig dataclasses
  • emu/alu.pyexecute() pure function with full v0 opcode dispatch
  • tests/conftest.py — hypothesis strategies: uint16(), alu_ops(), int16()
  • tests/test_alu.py — property-based tests for all opcode groups

Dependencies: None (first phase)

Done when: ALU passes property-based tests for all v0 opcodes including signed comparison edge cases (0x7FFF vs 0x8000, overflow wrapping). or1-emu.AC2 criteria covered.

Phase 2: Processing Element Core#

Goal: PE with matching store, instruction fetch, and output formatter — testable in isolation.

Components:

  • emu/pe.py — ProcessingElement class: matching store state, generation counters, process loop, stage methods (match, fetch, execute, emit)
  • tests/test_pe.py — matching store invariants (hypothesis), output formatter mode tests (SimPy)
  • tests/conftest.py — additional strategies: token_sequences(), PE fixtures with minimal IRAM

Dependencies: Phase 1 (ALU, types)

Done when: PE correctly processes monadic tokens (bypass matching), dyadic tokens (store partial, fire on complete), discards stale tokens (generation mismatch), and emits correct token count per output mode. or1-emu.AC1 criteria covered.

Phase 3: Structure Memory#

Goal: SM with I-structure semantics, depth-1 deferred read, and all v0 operations.

Components:

  • emu/sm.py — StructureMemory class: cell state machine, deferred read register, process loop, operation dispatch
  • tests/test_sm.py — presence state machine (hypothesis), deferred read stall (SimPy), atomic ops
  • tests/conftest.py — additional strategies: sm_op_sequences()

Dependencies: Phase 1 (types)

Done when: SM handles all state transitions (EMPTY→FULL, EMPTY→WAITING→FULL with deferred satisfaction), enforces depth-1 deferred read constraint, and atomic ops work on lower 256 cells. or1-emu.AC3 criteria covered.

Phase 4: Network and Topology#

Goal: Route table wiring, System object, token routing, and backpressure.

Components:

  • emu/network.pybuild_topology(), System class with inject(), module accessors
  • tests/test_network.py — routing correctness (token reaches right destination by type+ID), backpressure (PE stalls when output FIFO full)

Dependencies: Phase 2 (PE), Phase 3 (SM)

Done when: Tokens route correctly by type and destination ID. Backpressure propagates: filling a destination Store causes the sending PE to block on put(). or1-emu.AC4 criteria covered.

Phase 5: Direct Initialization API#

Goal: Python-native setup of IRAM, SM contents, and routes without CfgTokens.

Components:

  • Initialization helpers in emu/network.py or emu/__init__.py — accept PEConfig/SMConfig dicts, populate IRAM and SM cells at construction time
  • tests/test_integration.py — initialization smoke tests: create system, verify IRAM and SM state match config

Dependencies: Phase 4 (network/System)

Done when: A System can be constructed from Python config objects with pre-populated IRAM and SM state, ready to accept seed tokens. or1-emu.AC5 criteria covered.

Phase 6: End-to-End Integration#

Goal: Run a small hand-constructed program across 1-2 PEs and verify results.

Components:

  • tests/test_integration.py — multi-PE programs: PE0 CONST feeds PE1 ADD, SM round-trip (PE writes, PE reads), fan-out via DUAL mode, conditional routing via SWITCH
  • Stretch: CfgToken abstract mutation test (load IRAM via token injection)

Dependencies: Phase 5 (initialization API)

Done when: A hand-constructed program executes end-to-end across 2 PEs with correct token flow, matching, execution, and output routing. SM read/write round-trip produces correct data. or1-emu.AC6 criteria covered.

Additional Considerations#

Split-ready PE stages. The PE process loop calls stage methods sequentially. To split into independent SimPy processes later, replace method calls with inter-stage SimPy Stores. The method signatures are designed for this: each takes explicit inputs and returns explicit outputs, no implicit shared state between stages.

CfgToken stretch goal. Abstract mutation (direct IRAM/route_table write from a CfgToken) can be added in Phase 6 without architectural changes — the PE process loop already has a cfg check branch. Full hardware-faithful cfg processing (PE locking, multi-cycle writes) is out of scope.

Shared mutable list footgun. sm_mod.py line 29 uses [SMCell(...)] * size which creates a list of references to the same object. The emulator's SM will need to create independent cell instances.