OR1 Dataflow CPU Behavioural Emulator Design#
Summary#
This design describes a behavioural emulator for the OR1 dataflow CPU architecture, built using Python's SimPy discrete event simulation framework. The emulator models three core hardware components: Processing Elements (PEs) that execute dataflow instructions using a matching store for operand pairing and generation counters for dataflow synchronization; Structure Memory (SM) modules that implement I-structure semantics with blocking reads and a depth-1 deferred read register; and a token-routing network that connects these components via bounded FIFOs with backpressure.
The implementation prioritizes behavioural correctness over cycle-accurate timing. Tokens flow as complete Python objects between modules connected through SimPy Stores that represent hardware FIFOs, with backpressure emerging naturally when stores reach capacity. The system provides a direct Python initialization API for configuring IRAM contents, SM state, and routing tables without requiring configuration token processing. A property-based test suite using pytest and hypothesis validates matching store invariants, I-structure state transitions, ALU correctness across all v0 operations, output formatter modes, and end-to-end token flow through multi-PE programs.
Definition of Done#
-
SimPy-based emulator with behaviourally correct implementations of: Processing Elements (matching store with generation counters and presence bits, ALU execution for all v0 operations, output formatter with SUPPRESS/SINGLE/DUAL/SWITCH modes), Structure Memory (I-structure semantics with depth-1 deferred read register, 2-bit presence state machine for all transitions, all v0 memory operations), and a token-routing network (type-based routing, backpressure via bounded SimPy Stores).
-
Direct initialization API — Python-native setup of IRAM contents, SM initial state, and routing configuration. CfgToken processing as abstract IRAM/route mutation is a stretch goal.
-
Property-based test suite (pytest + hypothesis) validating: matching store invariants (generation counter ABA prevention, presence bit state machine correctness), SM state transitions and I-structure semantics (deferred read depth-1 constraint, blocking read/write ordering), ALU operation correctness across the full v0 opcode set, output formatter mode semantics (token count and routing per mode), and end-to-end token flow through a minimal multi-PE path.
-
End-to-end demonstration: the emulator can run a small hand-constructed program (a few instructions across 1-2 PEs) end-to-end, demonstrating correct token flow, matching, execution, and output routing.
Acceptance Criteria#
or1-emu.AC1: Processing Element Behaviour#
- or1-emu.AC1.1 Success: Monadic token bypasses matching store and executes immediately
- or1-emu.AC1.2 Success: First dyadic token for an offset/ctx stores in matching store, does not fire
- or1-emu.AC1.3 Success: Second dyadic token for same offset/ctx retrieves partner, fires instruction
- or1-emu.AC1.4 Success: Generation counter mismatch causes stale token discard (no match, no fire)
- or1-emu.AC1.5 Success: Output formatter SINGLE mode emits exactly one token to dest_l
- or1-emu.AC1.6 Success: Output formatter DUAL mode emits two tokens with same data to dest_l and dest_r
- or1-emu.AC1.7 Success: Output formatter SWITCH mode routes data to taken side, inline trigger to not-taken side
- or1-emu.AC1.8 Success: Output formatter SUPPRESS mode emits zero tokens
- or1-emu.AC1.9 Failure: Token targeting non-existent IRAM offset is handled without crash
or1-emu.AC2: ALU Correctness#
- or1-emu.AC2.1 Success: Arithmetic ops (ADD, SUB) produce correct 16-bit wrapping results for all uint16 pairs
- or1-emu.AC2.2 Success: INC/DEC monadic ops correctly increment/decrement
- or1-emu.AC2.3 Success: Shift ops use const as shift amount; ASHIFTR sign-extends from bit 15
- or1-emu.AC2.4 Success: Logic ops (AND, OR, XOR, NOT) produce correct bitwise results
- or1-emu.AC2.5 Success: Comparison ops interpret operands as signed 2's complement (0xFFFF < 0x0001)
- or1-emu.AC2.6 Success: Comparison ops produce 0x0001/0x0000 result and correct bool_out
- or1-emu.AC2.7 Success: Routing ops (BR*, SW*, GATE) compute boolean and pass data through unchanged
- or1-emu.AC2.8 Success: PASS returns left operand, CONST returns const field
- or1-emu.AC2.9 Edge: Signed boundary: GT(0x7FFF, 0x8000) is true (32767 > -32768)
or1-emu.AC3: Structure Memory Behaviour#
- or1-emu.AC3.1 Success: READ on FULL cell returns data immediately via result token
- or1-emu.AC3.2 Success: READ on EMPTY cell with empty deferred register stashes return route, sets WAITING
- or1-emu.AC3.3 Success: WRITE on WAITING cell satisfies deferred read — emits result token to stashed return route
- or1-emu.AC3.4 Success: WRITE on EMPTY/RESERVED sets cell to FULL
- or1-emu.AC3.5 Success: CLEAR sets cell to EMPTY, cancels deferred read if targeting that cell
- or1-emu.AC3.6 Success: READ_INC/READ_DEC atomically modify and return value (lower 256 cells only)
- or1-emu.AC3.7 Failure: Depth-1 constraint: second blocking READ on different empty cell stalls until first deferred read is satisfied
- or1-emu.AC3.8 Edge: WRITE on FULL cell overwrites data (diagnostic flag set if modelled)
or1-emu.AC4: Network and Routing#
- or1-emu.AC4.1 Success: Token with dest PE_id N arrives at PE N's input Store
- or1-emu.AC4.2 Success: SM token (type 10) routes to correct SM by SM_id
- or1-emu.AC4.3 Success: Backpressure: PE blocks on put() when destination Store is at capacity
- or1-emu.AC4.4 Success: Backpressure releases when consumer drains destination Store
or1-emu.AC5: Direct Initialization API#
- or1-emu.AC5.1 Success: System constructed from PEConfig with IRAM contents — PE has expected instructions at expected offsets
- or1-emu.AC5.2 Success: System constructed from SMConfig with initial cell data — SM cells match config
- or1-emu.AC5.3 Success: inject(token) delivers seed token to correct module's input Store
or1-emu.AC6: End-to-End Execution#
- or1-emu.AC6.1 Success: CONST on PE0 emits token that arrives at PE1, triggers ADD, produces correct result
- or1-emu.AC6.2 Success: PE writes to SM, different PE reads from SM, receives correct data
- or1-emu.AC6.3 Success: DUAL mode fan-out delivers same result to two different consumers
- or1-emu.AC6.4 Success: SWITCH mode routes data and trigger to correct destinations based on comparison result
Glossary#
- ALU (Arithmetic Logic Unit): The execution unit within a Processing Element that performs arithmetic, logical, comparison, and routing operations on operands.
- Backpressure: Flow control mechanism where a downstream component that cannot accept more data signals the upstream component to pause, implemented via blocking on full SimPy Stores.
- CfgToken: Configuration token type used to mutate IRAM contents or routing tables at runtime (stretch goal in this design).
- Context slot: An index dimension in the matching store corresponding to a dataflow context, used to isolate concurrent activations of the same instruction.
- Dataflow architecture: A computation model where instruction execution is triggered by operand availability rather than a program counter, enabling natural parallelism.
- Deferred read: I-structure pattern where a READ on an EMPTY cell blocks by storing the return route for later satisfaction when a WRITE arrives.
- Dyadic operation: An instruction requiring two operands (left and right), contrasted with monadic operations requiring only one.
- Generation counter: A 2-bit counter per context slot that increments on reallocation, used to detect and discard stale tokens from previous activations (ABA prevention).
- I-structure: A synchronization primitive from dataflow computing where cells have blocking read-when-empty semantics, with writes satisfying pending reads.
- IRAM (Instruction RAM): Instruction memory within a Processing Element, indexed by token offset to fetch the operation and destination routing.
- Matching store: Hardware structure within a PE that holds the first operand of a dyadic instruction until its partner arrives, enabling dataflow firing when both are present.
- Monadic operation: An instruction requiring only one operand, which bypasses the matching store and executes immediately.
- Output formatter: PE stage that determines how many result tokens to emit and where to route them based on instruction mode (SUPPRESS/SINGLE/DUAL/SWITCH).
- Presence state: Per-entry flag in the matching store (occupied/empty) or per-cell 2-bit state in Structure Memory (EMPTY/RESERVED/FULL/WAITING).
- Property-based testing: Testing approach using hypothesis to generate randomized test cases validating invariants rather than testing specific examples.
- SimPy: Python discrete event simulation framework providing processes, resources, and stores.
- SimPy Store: SimPy's queue primitive with optional capacity bounds;
get()andput()block when empty/full respectively. - Token: The fundamental unit of data flow in the OR1 architecture, carrying a value, destination routing, context information, and port assignment.
- Type-based routing: Routing mechanism where tokens are directed to destinations based on their type field (types 00/01 for PE, type 10 for SM, type 11 for system) combined with a destination ID.
Architecture#
Behavioural emulator built on SimPy's discrete event simulation. Three module types — Processing Element (PE), Structure Memory (SM), and Network — communicate via bounded SimPy Stores representing hardware FIFOs. Tokens flow as whole Python objects (not flit-level serialization). Backpressure emerges naturally: yield store.put(token) blocks when a destination FIFO is full.
Route table + direct put topology. Each module holds a route_table: dict[int, simpy.Store] mapping destination IDs to other modules' input Stores. A top-level build_topology() function wires these up at initialization. No central router process — modules resolve destinations and put directly. This mirrors the hardware's type-based routing without modelling bus arbitration.
System object is the top-level handle returned by build_topology(). It provides inject(token) for seed token insertion, pes[id] / sms[id] for direct module access, and env for running the simulation. Tests and the initialization API interact through System.
Module Layout#
or1-design/
token.py # Token hierarchy (shared with assembler)
cm_inst.py # ALU ops, instruction types (shared with assembler)
sm_mod.py # SM data types: Presence, SMCell (shared with assembler)
emu/
__init__.py
alu.py # Pure function: execute(op, left, right, const) -> (result, bool_out)
pe.py # ProcessingElement: SimPy process, matching store, output formatter
sm.py # StructureMemory: SimPy process, I-structure cells, deferred read
network.py # build_topology(), System class
types.py # Emulator-specific types (MatchEntry, DeferredRead, PEConfig, SMConfig)
tests/
conftest.py # Shared fixtures, hypothesis strategies
test_alu.py # ALU property-based tests
test_pe.py # PE matching + output formatter tests
test_sm.py # SM state machine + deferred read tests
test_network.py # Routing + backpressure tests
test_integration.py # End-to-end multi-module programs
Root-level modules (token.py, cm_inst.py, sm_mod.py) contain shared data representations used by both the emulator and the future assembler. The emu/ package imports from them and adds simulation behaviour.
Processing Element#
Single SimPy process per PE. Five pipeline stages implemented as separable methods (can later become independent SimPy processes connected by inter-stage Stores).
State:
iram: dict[int, ALUInst]— instruction memory keyed by offset. Each ALUInst holds opcode, dest_l, dest_r, const. Mutable via CfgToken.matching_store: list[list[MatchEntry]]— 2D indexed by[ctx_slot][offset]. Each MatchEntry hasoccupied: bool,data: Optional[int],port: Port.gen_counters: list[int]— 2-bit generation counter per context slot. Incremented on slot reallocation.input_store: simpy.Store— bounded input FIFO.route_table: dict[int, simpy.Store]— destination PE/SM ID to output Store.
Process loop:
- Token Input:
yield input_store.get()— blocks until token available. - Cfg Check: If CfgToken, apply abstract mutation to IRAM or route_table. Continue.
- Match/Bypass: Monadic tokens bypass matching, returning
(data, None). Dyadic tokens check generation counter (discard if stale), then check presence bit at[ctx][offset]— store partial operand and wait, or retrieve partner and fire. - Instruction Fetch: Read
iram[offset]to get opcode and destination fields. - Execute: Call
alu.execute(op, left, right, const)→(result, bool_out). - Token Output: Output formatter inspects mode (derived from opcode + has_dest2):
- SUPPRESS: no tokens emitted. Used by FREE, GATE-when-false.
- SINGLE: one token to dest_l.
yield route_table[dest_l.pe].put(token). - DUAL: two tokens with same data to dest_l and dest_r.
- SWITCH: data token to
bool_out ? dest_l : dest_r, inline monadic trigger to the other.
Each yield store.put() in the output stage blocks on backpressure.
Structure Memory#
Single SimPy process. Implements I-structure semantics with a depth-1 deferred read register.
State:
cells: list[SMCell]— 512 cells, each withpresence: Presenceanddata: Optional[int].deferred_read: Optional[DeferredRead]— single register:(cell_addr, return_route). Only one pending deferred read at a time.input_store: simpy.Store— bounded input FIFO.route_table: dict[int, simpy.Store]— for sending result tokens back to PEs.
Operation dispatch:
- READ on FULL: immediate result token via return route.
- READ on EMPTY/RESERVED: if deferred register empty, stash return route, set cell to WAITING. If deferred register occupied (depth-1 constraint), SM process yields on an internal
simpy.Eventuntil the existing deferred read is satisfied, then retries. - WRITE on WAITING: satisfy deferred read — emit result token to stashed return route, set cell to FULL.
- WRITE on EMPTY/RESERVED: set to FULL, store data.
- CLEAR: set to EMPTY, cancel deferred read if targeting this cell.
- READ_INC / READ_DEC / CAS: atomic operations restricted to lower 256 cells, immediate-result pattern.
ALU#
Pure function in emu/alu.py. No state, no SimPy involvement.
Contract: execute(op: ALUOp, left: int, right: Optional[int], const: Optional[int]) -> tuple[int, bool]
Returns (result & 0xFFFF, bool_out). All values stored as unsigned 16-bit. Comparisons and overflow detection interpret operands as signed 2's complement via to_signed() helper.
Opcode groups:
- Arithmetic (ADD, SUB, INC, DEC): 16-bit wrapping. INC/DEC monadic.
- Shifts (SHIFTL, SHIFTR, ASHIFTR): monadic, amount from const. ASHIFTR sign-extends.
- Logic (AND, OR, XOR, NOT): bitwise. NOT monadic.
- Comparison (EQ, LT, LTE, GT, GTE): signed 2's complement. Result is 0x0001/0x0000, sets bool_out.
- Routing (BR*, SW*, GATE, SEL, MERGE): compute boolean condition, data passes through. Output formatter uses bool_out for routing decisions. GATE and SEL are dyadic: left=data, right=boolean. bool_out comes from right operand (port R bit 0 in hardware).
- Data (PASS, CONST, FREE): PASS = identity, CONST = return const, FREE = no output (SUPPRESS).
Network and Topology#
build_topology(env, pe_configs, sm_configs, fifo_capacity=8) -> System
Creates bounded SimPy Stores (one per module input), instantiates PE and SM objects, populates route_tables bidirectionally (PE→PE, PE→SM, SM→PE). Returns System object.
PEConfig: PE ID, IRAM contents, matching store dimensions (ctx_slots, offsets), initial generation counters.
SMConfig: SM ID, cell count, initial cell contents.
Existing Patterns#
Existing code in token.py, cm_inst.py, and sm_mod.py establishes:
@dataclass(frozen=True)for immutable value types (tokens, instructions, addresses)- Mutable
@dataclassfor stateful containers (SMCell) IntEnumfor opcodes and state enums (ALUOp, Presence, MemOp, Port)- Type annotations throughout
- SimPy Resource as base class for SM (
sm_mod.py)
The emulator follows these patterns. New emulator-specific types (MatchEntry, DeferredRead, PEConfig, SMConfig) use the same conventions. The PE class does not extend a SimPy resource — it owns a SimPy Store and runs as a process, which is the canonical SimPy pattern for active components (per Graphcore's "SimPy for Chips" approach).
Implementation Phases#
Phase 1: Emulator Scaffold and ALU#
Goal: Project structure, emulator-specific types, and a fully tested ALU.
Components:
emu/__init__.py— package initemu/types.py— MatchEntry, DeferredRead, PEConfig, SMConfig dataclassesemu/alu.py—execute()pure function with full v0 opcode dispatchtests/conftest.py— hypothesis strategies:uint16(),alu_ops(),int16()tests/test_alu.py— property-based tests for all opcode groups
Dependencies: None (first phase)
Done when: ALU passes property-based tests for all v0 opcodes including signed comparison edge cases (0x7FFF vs 0x8000, overflow wrapping). or1-emu.AC2 criteria covered.
Phase 2: Processing Element Core#
Goal: PE with matching store, instruction fetch, and output formatter — testable in isolation.
Components:
emu/pe.py— ProcessingElement class: matching store state, generation counters, process loop, stage methods (match, fetch, execute, emit)tests/test_pe.py— matching store invariants (hypothesis), output formatter mode tests (SimPy)tests/conftest.py— additional strategies:token_sequences(), PE fixtures with minimal IRAM
Dependencies: Phase 1 (ALU, types)
Done when: PE correctly processes monadic tokens (bypass matching), dyadic tokens (store partial, fire on complete), discards stale tokens (generation mismatch), and emits correct token count per output mode. or1-emu.AC1 criteria covered.
Phase 3: Structure Memory#
Goal: SM with I-structure semantics, depth-1 deferred read, and all v0 operations.
Components:
emu/sm.py— StructureMemory class: cell state machine, deferred read register, process loop, operation dispatchtests/test_sm.py— presence state machine (hypothesis), deferred read stall (SimPy), atomic opstests/conftest.py— additional strategies:sm_op_sequences()
Dependencies: Phase 1 (types)
Done when: SM handles all state transitions (EMPTY→FULL, EMPTY→WAITING→FULL with deferred satisfaction), enforces depth-1 deferred read constraint, and atomic ops work on lower 256 cells. or1-emu.AC3 criteria covered.
Phase 4: Network and Topology#
Goal: Route table wiring, System object, token routing, and backpressure.
Components:
emu/network.py—build_topology(), System class withinject(), module accessorstests/test_network.py— routing correctness (token reaches right destination by type+ID), backpressure (PE stalls when output FIFO full)
Dependencies: Phase 2 (PE), Phase 3 (SM)
Done when: Tokens route correctly by type and destination ID. Backpressure propagates: filling a destination Store causes the sending PE to block on put(). or1-emu.AC4 criteria covered.
Phase 5: Direct Initialization API#
Goal: Python-native setup of IRAM, SM contents, and routes without CfgTokens.
Components:
- Initialization helpers in
emu/network.pyoremu/__init__.py— accept PEConfig/SMConfig dicts, populate IRAM and SM cells at construction time tests/test_integration.py— initialization smoke tests: create system, verify IRAM and SM state match config
Dependencies: Phase 4 (network/System)
Done when: A System can be constructed from Python config objects with pre-populated IRAM and SM state, ready to accept seed tokens. or1-emu.AC5 criteria covered.
Phase 6: End-to-End Integration#
Goal: Run a small hand-constructed program across 1-2 PEs and verify results.
Components:
tests/test_integration.py— multi-PE programs: PE0 CONST feeds PE1 ADD, SM round-trip (PE writes, PE reads), fan-out via DUAL mode, conditional routing via SWITCH- Stretch: CfgToken abstract mutation test (load IRAM via token injection)
Dependencies: Phase 5 (initialization API)
Done when: A hand-constructed program executes end-to-end across 2 PEs with correct token flow, matching, execution, and output routing. SM read/write round-trip produces correct data. or1-emu.AC6 criteria covered.
Additional Considerations#
Split-ready PE stages. The PE process loop calls stage methods sequentially. To split into independent SimPy processes later, replace method calls with inter-stage SimPy Stores. The method signatures are designed for this: each takes explicit inputs and returns explicit outputs, no implicit shared state between stages.
CfgToken stretch goal. Abstract mutation (direct IRAM/route_table write from a CfgToken) can be added in Phase 6 without architectural changes — the PE process loop already has a cfg check branch. Full hardware-faithful cfg processing (PE locking, multi-cycle writes) is out of scope.
Shared mutable list footgun. sm_mod.py line 29 uses [SMCell(...)] * size which creates a list of references to the same object. The emulator's SM will need to create independent cell instances.