Dynamic Dataflow CPU — PE (Processing Element) Design#

Covers the CM (Control Module) pipeline, frame-based matching, instruction memory and encoding, activation lifecycle, per-PE identity, 670 subsystem design, pipeline stall analysis, SM operation dispatch, SC blocks, and execution modes.

See architecture-overview.md for token format, flit-1 bit allocation, and module taxonomy. See network-and-communication.md for how tokens enter/leave the PE. See alu-and-output-design.md for ALU operations, output formatting, and SM flit assembly details. See bus-interconnect-design.md for physical bus implementation. See sm-design.md for SM internals and I-structure semantics. See io-and-bootstrap.md for bootstrap loading and I/O subsystem design.

1. Design Philosophy: Static Assignment, Compiler-Driven Sizing#

This design diverges significantly from both Manchester and Amamiya in how PEs are used. Understanding the difference is critical to understanding why the matching store can be so much smaller here.

Amamiya DFM (1982/17407 papers): every PE has ALL function bodies pre-loaded in instruction memory (8KW, 58 bits/word per PE, identical contents across all PEs). Function instances are dynamically assigned to PEs at runtime by a CCU (Cluster Control Unit) that picks the least-loaded PE. The OM (operand matching memory) needs 1024 CAM blocks per PE because any function can run anywhere, and deep Lisp recursion means many simultaneous activations. The "semi-CAM" was their solution to making this affordable -- instance name directly addresses a block, then 4-way set-associative lookup within the block on instruction identifier.

Manchester (Gurd 1985): similar story but with hashing instead of semi-CAM. 16 parallel 64K-token memory banks per PE for set-associative hash lookup. 1M token capacity matching store. Plus an overflow unit (initially emulated on the host). The matching unit alone was 16 memory boards per PE.

Both machines sized their matching stores for worst-case dynamic scheduling of arbitrary programs. The whole program lives in every PE (or in a single PE's matching unit), and any activation can land anywhere. That's why those matching stores are enormous.

This design: the compiler statically assigns function bodies (or chunks of them) to specific PEs. Different PEs have different instruction memory contents. The compiler knows at compile time which functions run where, and can calculate maximum concurrent activations per PE. This means:

Instruction memory is NOT replicated -- each PE only holds its assigned function bodies. IM can be much smaller.
The matching store only needs enough frames for the maximum concurrent activations the compiler predicts for that specific PE. Not 1024. Probably 4.
No CCU needed for dynamic PE allocation. Scheduling decisions are made at compile time.
The tradeoff is scheduling flexibility -- you can't dynamically rebalance load at runtime. The compiler must get it roughly right.

Function Splitting Across PEs#

A "function" in the source language does NOT need to map 1:1 to a contiguous block on one PE. The compiler can split a function body at any data-dependency boundary. The token network doesn't know or care whether two instructions are "in the same function" -- it just sees tokens with destinations.

A 40-instruction function body could be split into three chunks of ~13 instructions across three PEs, each chunk fitting in a smaller frame. The "function" as the architecture sees it is really "a set of instructions that share a frame on this PE." The compiler defines what that grouping means.

This is a powerful lever for keeping frames small: if a function body is too big for the frame size, the compiler splits it. The split introduces inter-PE token traffic (extra network hops), but keeps per-PE hardware simple. The compiler can optimise the split points to minimise cross-PE traffic.

Implication for frame semantics: a frame doesn't mean "one function activation." It means "one chunk of work sharing a local operand namespace on this PE." Multiple frames on different PEs might collectively represent one function activation. The token's activation_id scopes operand matching to a local frame, nothing more.

Implication for the compiler: this architecture actively wants either small functions or functions distributed across PEs. The compiler is free to treat any subgraph of the dataflow graph as a "chunk" and assign it to a PE, regardless of source-level function boundaries. Loop bodies, branch arms, pipeline stages: all valid chunk boundaries. The grain of scheduling is the subgraph, not the function.

2. PE Identity#

Each PE has a unique ID used for routing. Two mechanisms, not mutually exclusive:

EEPROM-based: the instruction decoder EEPROM already contains per-PE truth tables. The PE ID can be encoded as additional input bits to the EEPROM, meaning the EEPROM contents are unique per PE but the circuit board is identical. The instruction decoder "knows" which PE it is because its EEPROM was burned with that ID.

DIP switches: 3-4 switches give 8-16 PE addresses. Better for early prototyping - reconfigurable without reflashing. Can coexist with the EEPROM approach (switches provide ID bits that feed into the EEPROM address lines).

The PE ID is needed in two places:

Input token filtering: "is this token addressed to me?"
Output token formatting: "set the source PE field" (if result tokens carry source info for return routing)

3. PE Pipeline (5-stage)#

Bus Interface: Serializer / Deserializer#

The PE connects to the 16-bit external bus via ser/deser logic at the input and output boundaries. This handles the width conversion between 16-bit flits on the bus and the wider internal token representation:

Input deserializer: receives 2+ flits from the bus, reassembles into a full token (routing fields from flit 1 + data from flit 2). Shift register + flit counter. Outputs a reassembled token to the input FIFO.
Output serializer: takes a formed result token, splits it into flits (routing fields into flit 1, data into flit 2), and clocks them onto the bus. Shift register + toggle.
Hardware cost: ~5-8 TTL chips per direction (shift registers, counters, muxes).
Naturally integrates with the clock domain crossing FIFOs when running Mode B (2x bus clock). Under Mode A (shared clock), the ser/deser simply takes 2 clock cycles per token transfer.

Pipeline Stages#

The pipeline runs IFETCH before MATCH. The instruction word, decoded at the end of stage 2, drives all subsequent pipeline behaviour: whether to check the matching store, whether to read a constant, how many destinations to read, whether to write back to the frame. The token's activation_id drives associative lookup in parallel with the IRAM read, hiding resolution latency.

Why IFETCH before MATCH. The instruction word determines how matching works: whether the instruction is dyadic or monadic, which frame slots to read for operands and constants, whether to write back to the frame (sink modes), and how many destinations to read at output. Fetching the instruction first gives the pipeline controller all the information it needs to sequence stage 3's SRAM accesses efficiently.

The token's dyadic/monadic prefix enables parallel work: when the prefix indicates "dyadic," stage 2 starts act_id -> frame_id resolution via the 670s simultaneously with the IRAM read. By the time stage 3 begins, both the instruction word and the frame_id / presence / port metadata are available, and the only remaining SRAM work is reading or writing actual operand data and constants.

Stage 1: INPUT
  - Receive reassembled token from input deserialiser
  - Classify by prefix: dyadic wide (00), monadic normal (010),
    misc bucket (011). Within misc bucket: frame control (sub=00),
    PE-local write (sub=01), monadic inline (sub=10)
  - Compute/data tokens -> pipeline FIFO
  - Frame control tokens -> tag store write/clear (side path)
  - PE-local write tokens -> SRAM write queue (side path, executes
    when frame SRAM is not busy with compute pipeline accesses)
  - Buffer in small FIFO (8-deep, storing reassembled tokens)
  - ~1K transistors (flip-flops) or use small SRAM

Stage 2: IFETCH
  Two parallel operations within a single cycle:

  (a) IRAM SRAM read at [bank_reg : token.offset]. Produces the
      16-bit instruction word: type, opcode, mode, wide, fref.
      Single read cycle (16-bit instruction, one chip pair).

  (b) Activation_id resolution. For Approach C (74LS670 lookup),
      this is combinational (~35 ns): present act_id on the 670
      address lines, get {valid, frame_id} back. Presence and port
      metadata also resolve combinationally in this stage (670 read
      at frame_id, ~70 ns total from act_id presentation). At 5 MHz
      (200 ns cycle), all metadata is available before the IRAM read
      completes.

  The dyadic/monadic prefix from flit 1 determines whether
  activation_id resolution starts in this stage (dyadic) or is
  deferred until the instruction confirms the need (monadic with
  frame access).

  IRAM valid-bit check occurs in parallel: if the page containing
  the target offset is marked invalid, the token is rejected (see
  IRAM Valid-Bit Protection below).

Stage 3: MATCH / FRAME
  Path depends on instruction type (from stage 2) and token type.
  Uses the instruction's mode field to determine frame SRAM accesses:

  - Dyadic hit (second operand): read stored operand from frame SRAM
    at [frame_id : match_offset]. If has_const, also read constant
    from frame[fref]. 1-2 SRAM cycles depending on mode.
  - Dyadic miss (first operand): write incoming operand to frame SRAM,
    set presence bit in 670. Token consumed. 1 SRAM cycle.
  - Monadic with constant: read constant from frame[fref]. 1 SRAM
    cycle.
  - Monadic mode 4 (CHANGE_TAG, no frame access): 0 cycles, pass
    through.
  - Mode 7 (SINK+CONST / RMW): read old value from frame[fref] for
    read-modify-write. 1 SRAM cycle.

  See Pipeline Stall Analysis (section 11) for full cycle-count
  tables across Approaches A, B, and C.

Stage 4: EXECUTE
  - Instruction type bit selects CM compute (0) or SM operation (1)
  - CM path: 16-bit ALU executes arithmetic/logic/comparison/routing
    on operand data + constant (if present). Purely combinational.
    No SRAM access. Result latched.
  - SM path: ALU computes effective address or passes through data;
    PE constructs SM flit fields from frame data and operands.
    See SM Operation Dispatch (section 7) for encoding and dispatch
    details. See `alu-and-output-design.md` for SM flit assembly.
  - ~1500-2000 transistors (ALU) + SM flit assembly mux (~4-6 chips)

Stage 5: OUTPUT
  - CM path: read destination(s) from frame SRAM. Destinations are
    pre-formed flit 1 values stored in frame slots during activation
    setup. The PE reads the slot and puts it directly on the bus as
    flit 1; the ALU result becomes flit 2. Near-zero token formation
    logic.
    - Mode 0/1 (single dest): read frame[fref] or frame[fref+1].
      1 SRAM cycle.
    - Mode 2/3 (fan-out): read dest1 and dest2 from consecutive
      frame slots. 2 SRAM cycles.
    - Mode 4/5 (CHANGE_TAG): left operand becomes flit 1 verbatim.
      0 SRAM cycles.
    - Mode 6/7 (SINK): write ALU result back to frame[fref]. No
      output token. 1 SRAM cycle.
  - SM path: emit SM token to target SM. SM flit 1 constructed from
    frame slot (SM_id + addr from frame[fref]), SM flit 2 source
    selected by instruction decoder (ALU out, R operand, or frame
    slot). See SM Operation Dispatch (section 7) for flit 2 source
    mux details.
  - Pass to output serialiser for flit encoding and bus injection.

Concurrency Model#

The pipeline is pipelined: multiple tokens can be in-flight simultaneously at different stages. In the emulator, each token spawns a separate SimPy process that progresses through the pipeline independently. This models the hardware reality where Stage 2 can be fetching a new instruction while Stage 4 is executing a previous one.

Cycle counts per token type (Approach C, recommended v0):

Dyadic hit, mode 1 (common case): 6 cycles (input + ifetch + match/const + execute + output)
Dyadic hit, mode 0 (no const): 5 cycles
Dyadic miss: 3 cycles (input + ifetch + store operand; no execute/output)
Monadic, mode 0: 4 cycles (input + ifetch + execute + output; no match)
Monadic, mode 4 (CHANGE_TAG, no frame): 3 cycles
PE-local write: side path, does not enter compute pipeline
Frame control: side path, tag store write
Network delivery: +1 cycle latency between emit and arrival at destination

Unpipelined throughput: 3-7 cycles per token depending on instruction mode. This is the baseline against which pipeline overlap improvements are measured (see Pipeline Stall Analysis, section 11).

Pipeline Register Widths#

Between instruction fetch and execute, the pipeline carries both operand data and instruction control information in parallel but logically separate paths:

Data path (~32 bits between match/frame and ALU):

data_L: 16 bits  (from frame SRAM or direct from token for monadic)
data_R: 16 bits  (from frame SRAM)
                  |
             ALU (16-bit)
                  |
           result: 16 bits

Control path (~16 bits from IRAM, plus frame reads in stage 5):

type:      1 bit    (CM/SM select, from IRAM)
opcode:    5 bits   (from IRAM, consumed by ALU / SM decoder)
mode:      3 bits   (from IRAM, drives frame access pattern + output routing)
wide:      1 bit    (from IRAM, 16/32-bit frame access)
fref:      6 bits   (from IRAM, frame slot base index)

Destinations are NOT in the pipeline control registers. They are read from frame SRAM during stage 5 as pre-formed flit 1 values. This simplifies the pipeline latch: only 16 bits of IRAM control + 32 bits of operand data pass between stages.

Total pipeline register between fetch and execute: ~48 bits. The mode field (3 bits) encodes tag behaviour, constant presence, and fan-out in a single dense field (see mode table in section 6).

Pipeline registers between stages: ~500 transistors Control logic (state machine, handshaking): ~500-1000 transistors

Per-PE total: ~32-43 chips (Approach C).

4. Frame-Based Matching#

Why It Can Be Small#

The matching store is the highest-risk component in any dataflow machine. Manchester needed 16 memory boards per PE. Amamiya needed 1024 CAM blocks (32KW at 43 bits/word) per PE. Both were sized for worst-case dynamic scheduling of arbitrary programs.

This design avoids that because:

Static PE assignment: the compiler knows which functions run on which PE and can calculate maximum concurrent activations per PE.
Function splitting: the compiler can split large function bodies across PEs so no single PE needs a huge frame.
Compiler-controlled frame allocation: the compiler assigns activation IDs at compile time for statically-known activations. Only genuinely dynamic activations (runtime-determined recursion depth) need runtime allocation.

The frame count is therefore a compiler parameter, not an architectural constant. The hardware provides 4 concurrent frames of 64 slots each. The compiler must generate code that fits within those limits, splitting and scheduling accordingly.

Architecture: 74LS670 Register-File Lookup (Approach C, Recommended v0)#

Matching uses the per-activation frame model. Pending match operands live in the same SRAM address space as constants, destinations, and accumulators. The 74LS670 (4-word x 4-bit register file with independent read/write ports) provides the activation_id-to-frame_id mapping and presence/port metadata, all combinationally.

act_id-to-frame_id resolution:

Two 74LS670s, addressed by act_id[1:0] with act_id[2] selecting between chips. Output = {valid:1, frame_id:2, spare:1}. Combinational read: present act_id on the address lines, frame_id appears at the output in ~35 ns.

ALLOC: write {valid=1, frame_id} at address act_id (670 write port)
FREE:  write {valid=0, ...} at address act_id
LOOKUP: read port, address = act_id -> {valid, frame_id} in ~35 ns

The 670's independent read and write ports allow ALLOC to proceed while the pipeline reads -- zero conflict.

Presence and port metadata:

Presence and port bits live in additional 670s, addressed by [frame_id:2]. The matchable offset range is constrained to offsets 0-7 (8 dyadic-capable slots per frame). The assembler packs dyadic instructions at low offsets; offsets 8-255 are monadic-only.

Recommended layout: 4x 670, each covering 2 offsets across 4 frames:

670 chip 0 (offsets 0-1): word[frame_id] = {pres0:1, port0:1, pres1:1, port1:1}
670 chip 1 (offsets 2-3): word[frame_id] = {pres2:1, port2:1, pres3:1, port3:1}
670 chip 2 (offsets 4-5): word[frame_id] = {pres4:1, port4:1, pres5:1, port5:1}
670 chip 3 (offsets 6-7): word[frame_id] = {pres6:1, port6:1, pres7:1, port7:1}

offset[2:1] selects chip, offset[0] selects which pair of bits within the 4-bit output (a 2:1 mux -- one gate).

The 670's simultaneous read/write is critical: during stage 3, when a first operand stores and sets presence, the write port updates the presence 670 while the read port remains available for the next pipeline stage's lookup. No read-modify-write sequencing needed.

All reads are combinational (~35 ns). All resolve during stage 2 in parallel with the IRAM read. By the time stage 3 begins, the PE knows frame_id, presence, and port -- the only SRAM access in stage 3 is reading/writing the actual operand data.

The matching operation:

Stage 2 (parallel with IRAM read):
  act_id -> frame_id via 670 lookup (combinational)
  presence[frame_id][offset] via 670 read (combinational)
  port[frame_id][offset] via 670 read (combinational)

Stage 3 (driven by instruction word from stage 2):
  if instruction is dyadic AND presence bit set:
    -> match found: read stored operand from frame SRAM at
       [frame_id:2][match_offset]. Clear presence bit.
       Read constant from frame[fref] if has_const.
    -> proceed to stage 4 with both operands.
  if instruction is dyadic AND presence bit clear:
    -> first operand: write incoming operand to frame SRAM at
       [frame_id:2][match_offset]. Set presence bit in 670.
    -> token consumed, advance to next input token.
  if instruction is monadic:
    -> bypass matching. Read constant from frame[fref] if has_const.

Hardware cost:

Component	Chips	Notes
act_id -> frame_id lookup	2	74LS670, indexed by act_id
Presence + port metadata	4	74LS670, indexed by frame_id
Bit select mux	1-2	offset-based selection of presence/port
Total match metadata	~8

Constraint: 8 matchable offsets per frame. The assembler enforces this. 8 dyadic instructions per function chunk per PE is reasonable -- the compiler splits larger function bodies across PEs. With 4 PEs, the system supports 32 simultaneous dyadic slots, which exceeds typical working-set utilisation for the target workloads.

Constraint: 8 unique activation_ids. The 3-bit act_id supports 8 entries in the lookup table. With 4 concurrent frames, 4 IDs of ABA distance exist before wraparound.

Alternative Approaches#

Two alternative matching implementations exist:

Approach A (set-associative tags in frame SRAM): tag words share the frame SRAM chip. No extra chips, but matching consumes SRAM cycles (2 cycles for a dyadic hit). Lowest chip count (~4-6 extra TTL), highest pipeline stall rate.
Approach B (full register-file match pool): match entries live entirely in a register file with parallel comparators. Matching is fully combinational (~50 ns). Highest chip count (~16-18 extra TTL), best pipeline throughput (eliminates all match-related SRAM contention).

Approach C (670 lookup, described above) sits between A and B: act_id resolution and presence checking are combinational (like B), but operand data lives in SRAM (like A). ~8 extra chips, good pipeline throughput.

See the approach comparison table in Pipeline Stall Analysis (section 11) for full cycle counts and throughput estimates across all three approaches, plus 670-enhanced variants (B+670 indexed, B+670 semi-CAM) and hybrid upgrade paths.

Frame Sizing#

6-bit fref addresses 64 slots per activation. With 4 concurrent frames, the frame region occupies 4 x 64 x 2 bytes = 512 bytes of SRAM. This fits trivially alongside the IRAM region in a 32Kx8 chip pair.

A function body with 10 dyadic instructions, 5 constants, and fan-out on 3 instructions might use ~30 frame slots (10 match + 5 const + 8 dest + 7 accumulator/spare). With slot dedup (shared destinations, aliased constants), actual usage is typically 15-25 slots per activation.

What About Overflow?#

If all frames are occupied or a function body exceeds 8 dyadic instructions per frame:

Compile-time prevention (primary strategy):

The compiler knows the frame count and matchable offset limit
It splits functions and schedules activations to fit
If a program genuinely can't fit (unbounded recursion deeper than 4 frames), the compiler inserts throttling code: a token that waits for a frame to free before allowing the next recursive call
This is the Amamiya throttle idea, but implemented in software (compiler-inserted dataflow logic) rather than hardware

Runtime overflow (safety net):

If a token arrives and the tag store has no valid entry for its act_id (shouldn't happen with correct compilation), the PE stalls the input FIFO until a frame frees. Simplest, safest, most debuggable. If it fires, something is wrong and stalling surfaces the bug.

Upgrade path: hybrid with SRAM fallback:

If the 8-offset matchable range proves tight, the high bit of the offset can select between register (offset[3]=0, check 670s) and SRAM (offset[3]=1, fall back to tag-in-SRAM from Approach A). The fast path stays combinational; the overflow path adds 1 SRAM cycle. The system degrades gracefully rather than hard-limiting at 8 dyadic offsets.

5. Frame Lifecycle#

Allocation#

An ALLOC frame control token (prefix 011+00, op=0) arrives at the PE, specifying an activation_id. The PE assigns the next free physical frame and records the act_id-to-frame_id mapping in the 670 tag store.

Free frame tracking is a simple 2-bit counter or shift register (4 entries max). Hardware cost: ~2-3 TTL chips.

Setup#

PE-local write tokens (prefix 011+01, region=1) load constants and destinations into the allocated frame's slots. The writer addresses slots by (act_id, slot_index); the PE resolves act_id-to-frame_id internally using the same 670 lookup as the compute pipeline. Setup uses the same mechanism as IRAM loading -- a stream of write tokens, precalculated by the assembler.

Execution#

Compute tokens arrive with activation_id. The PE resolves act_id-to-frame_id (670 lookup, combinational), then uses frame_id to address frame SRAM for matching, constant reads, destination reads, and write-backs. See the pipeline stages (section 3) for the per-stage access pattern.

Deallocation#

A FREE_FRAME instruction (opcode-driven, any mode) or a FREE frame control token (prefix 011+00, op=1) releases the frame. The tag store entry is cleared (valid=0 written to the 670), presence/port metadata for that frame is bulk-cleared across all 4 presence/port 670s, and the frame_id returns to the free pool.

Multiple frees are idempotent / harmless. Freed frames are immediately available for reallocation.

ABA Protection#

3-bit activation_id provides 8 unique IDs
With at most 4 concurrent frames, 4 IDs of ABA distance exist before wraparound
Stale tokens (from freed activations) carry an act_id whose 670 entry is now valid=0 or maps to a different frame. The PE detects this via the valid bit and discards the stale token.
4 IDs of distance is sufficient because stale tokens drain within single-digit cycles. Wraparound collision is effectively impossible.
The act_id validity check in the 670 provides ABA protection without dedicated hardware -- the valid bit serves as an implicit generation guard.

Throttle#

The PE tracks the number of active frames (frames with valid=1 in the tag store)
When all 4 frames are active, ALLOC tokens are NACKed or stall until a free occurs
Prevents frame overflow
Hardware cost: 2-bit counter + comparator + gate. ~3 TTL chips.
With compiler-controlled scheduling, the throttle should rarely fire. It's a safety net, not a performance mechanism.

6. Instruction Memory#

Static Assignment, Per-PE Contents#

Unlike Amamiya where every PE has identical IM contents (full program), each PE here holds only the function bodies (or function chunks) assigned to it by the compiler. This means:

IM is smaller per PE (only assigned code, not the whole program)
Different PEs have different IM contents (loaded at bootstrap)
The compiler emits a per-PE instruction image as part of the program

Runtime Writability#

Instruction memory is not read-only. It is writable from the network via IRAM write (prefix 011+01) packets. This serves two purposes:

Bootstrap: loading programs before execution starts
Runtime reprogramming: loading new function bodies while other PEs continue executing (future capability, not needed for v0)

Runtime writability also means instruction memory size is not a hard architectural limit -- if a program needs more code than fits in one PE's IM, the runtime (or a management PE) could swap function bodies in and out. Very speculative, but the hardware path exists.

Implementation#

Instruction memory is PE-local SRAM, sharing a chip pair with the frame region via address partitioning (see Per-PE Memory Map, section 10). IRAM width is completely independent of bus width. It is sized for encoding needs, not bus constraints.

IRAM Width: 16-bit Single-Half Format#

Each IRAM slot is 16 bits, read in a single cycle from one 8-bit-wide SRAM chip pair. Instruction templates are activation-independent: all per-activation data (constants, destinations, match operands, accumulators) lives in the frame.

IRAM address = [offset:8]             (v0, 8-bit address)

256 instruction slots per PE. Total IRAM SRAM usage: 512 bytes per PE. For programs exceeding 256 instructions, see Future: Bank Switching with 74LS610 (section 14).

Instruction Word Format#

[type:1][opcode:5][mode:3][wide:1][fref:6] = 16 bits
  15       14-10     9-7     6       5-0

Field	Bits	Purpose
type	1	0 = CM compute, 1 = SM operation
opcode	5	Operation code (CM and SM have independent 32-entry spaces)
mode	3	Combined tag/frame-reference mode (see mode table below)
wide	1	0 = 16-bit frame values, 1 = 32-bit (consecutive slot pairs)
fref	6	Frame slot base index (64 slots per activation)

type:1 -- operation space select:

0 = CM compute operation (ALU)
1 = SM operation (structure memory bus command)

opcode:5 -- 32 slots per type. CM and SM have independent opcode spaces (32 CM opcodes + 32 SM opcodes). Decoded by EEPROM into control signals. See alu-and-output-design.md for the CM operation set.

mode:3 -- combined output routing and frame access mode. Controls whether the instruction emits tokens, how it reads destinations and constants from the frame, or whether it writes results back to the frame. See the mode table below.

wide:1 -- frame value width:

0 = 16-bit frame values (single slot per logical value)
1 = 32-bit frame values (consecutive slot pairs per logical value)

fref:6 -- frame slot index (0-63). Base of a contiguous group of 1-3 slots, depending on mode. The instruction template references frame data exclusively through this field; no per-activation data exists in the instruction word itself.

Constants and destinations are NOT in the instruction word. They live in frame slots, referenced by fref. The instruction template is pure control flow: opcode, mode flags, and a frame slot reference.

Instruction words are never serialised onto the external bus during normal execution. They are only written via PE-local write packets (prefix 011+01, region=0) during program loading.

Mode Table (3-bit `mode` field)#

The 3-bit mode field encodes both output routing behaviour and frame access pattern in a single field. Every combination is useful; there are no wasted encodings.

mode  [2:0]  tag behaviour   frame reads at fref         use case
----  -----  --------------  --------------------------  --------------------------
  0   000    INHERIT         [dest]                      single output, no constant
  1   001    INHERIT         [const, dest]               single output with constant
  2   010    INHERIT         [dest1, dest2]              fan-out, no constant
  3   011    INHERIT         [const, dest1, dest2]       fan-out with constant
  4   100    CHANGE_TAG      (none)                      dynamic routing, no constant
  5   101    CHANGE_TAG      [const]                     dynamic routing with constant
  6   110    SINK            write result -> frame[fref]  store to frame, no output
  7   111    SINK+CONST      read frame[fref],           local accumulate / RMW
                             write result -> frame[fref]

Bit-level decode equations:

output_enable  = NOT mode[2]              modes 0-3: read dest from frame, emit token
change_tag     = mode[2] AND NOT mode[1]  modes 4-5: routing from left operand
sink           = mode[2] AND mode[1]      modes 6-7: no output token, write to frame
has_const      = mode[0]                  modes 1, 3, 5, 7: read constant from frame
has_fanout     = mode[1] AND NOT mode[2]  modes 2-3: read two destinations

Frame slot count per mode = 1 + has_const + has_fanout, read sequentially starting from fref. For SINK modes, fref is the write target. For SINK+CONST (mode 7, read-modify-write), the read and write target is the same slot (fref), with the ALU result writing back after computation.

INHERIT (modes 0-3): output tokens are routed to destinations stored in frame slots. Each destination slot holds a pre-formed flit 1 value:

frame dest slot: [prefix:2-3][port:0-1][PE:2][offset:8][act_id:3]

The PE reads the slot and puts it directly on the bus as flit 1. The ALU result becomes flit 2. Almost zero token formation logic -- the frame constant IS the output flit. This is the same format as the incoming token's flit 1, enabling forwarding without repacking.

Mode 0 (single output, no constant): frame[fref] = dest. 1 slot.
Mode 1 (single output with constant): frame[fref] = const, frame[fref+1] = dest. 2 slots.
Mode 2 (fan-out, no constant): frame[fref] = dest1, frame[fref+1] = dest2. 2 slots.
Mode 3 (fan-out with constant): frame[fref] = const, frame[fref+1] = dest1, frame[fref+2] = dest2. 3 slots.

CHANGE_TAG (modes 4-5): the left operand replaces the frame destination as flit 1. The entire output flit 1 comes from the left operand data value (16 bits, verbatim). The right operand becomes flit 2 (payload data). This enables sending a value to any destination computed at runtime -- the packed tag IS flit 1. No field extraction or assembly.

Mode 4 (no constant): no frame reads. Flit 1 = left operand, flit 2 = right operand (or ALU result).
Mode 5 (with constant): frame[fref] = const. Flit 1 = left operand, flit 2 = right operand. The constant feeds the ALU.

The output stage is a mux: frame dest vs left operand, selected by mode[2]. Hardware: left operand bypass latch (~2 chips) preserves the left operand value past the ALU. Stage 5 flit 1 mux (~2 chips) selects between assembled flit and raw data.

SINK (modes 6-7): no output token is emitted. The ALU result is written back to frame[fref]. Used for local accumulation, temporary storage, and read-modify-write patterns.

Mode 6 (write only): ALU result written to frame[fref]. 1 SRAM write cycle.
Mode 7 (read-modify-write): frame[fref] is read as the constant input to the ALU, the result is written back to frame[fref]. Enables in-place accumulation without consuming a separate constant slot.

dest_type derivation: the output token format (dyadic wide vs monadic normal vs monadic inline) is determined by the destination frame slot contents. Since destination slots hold pre-formed flit 1 values, the token type is encoded in the prefix bits of the stored flit. The output stage emits the frame slot verbatim as flit 1; the prefix bits in the stored value determine the wire format. No runtime type derivation logic is needed. For SWITCH not-taken paths, the output stage emits a monadic inline token (hardwired prefix in the formatter, overriding the frame destination).

IRAM Valid-Bit Protection#

IRAM is written via PE-local write tokens (prefix 011+01, region=0) that share the PE's input path with compute tokens. When IRAM contents are replaced at runtime (swapping function fragments in and out), tokens in flight may target IRAM addresses that have been or are being overwritten. Because tokens do not carry instruction identity information, the PE cannot distinguish "right instruction" from "wrong instruction" -- it just sees an offset into IRAM.

This is an instruction identity problem, not a presence problem. The dangerous case is not "IRAM is empty" but "IRAM contains a different instruction than the token expects."

The mechanism: per-page valid bits. IRAM is logically divided into pages (e.g. 8 pages of 32 instructions for 256-entry IRAM, or 16 pages of 16). Each page has a 1-bit valid flag, stored in a small TTL register alongside IRAM. Total hardware cost: one register chip for all page valid bits.

The valid bit is checked during the IFETCH pipeline stage, in parallel with the IRAM SRAM read. The top bits of the token's offset field select the page; the valid bit for that page gates whether the token proceeds or is rejected.

Token arrives at IFETCH:
  page = offset[high bits]
  if valid_bit[page] == 0:
    -> reject token (see rejection policy below)
  else:
    -> proceed with instruction fetch and act_id resolution

IRAM swap protocol. Because config writes and compute tokens share the input path, the swap sequence is naturally ordered:

1. Loader sends drain signal (implementation TBD -- could be a PE-local
   write "quiesce" flag, or the PE back-pressures via handshake/ready
   signal)
2. PE processes remaining compute tokens in pipeline (natural drain)
3. PE-local write token (prefix 011+01, region=0) arrives:
   a. PE clears valid bit for target page
   b. PE writes instruction word to IRAM at specified address
   c. If more write tokens follow (burst), keep writing
4. Load-complete marker arrives (PE-local write with load-complete flag):
   a. PE sets valid bit for target page
   b. PE resumes accepting compute tokens for that page

During steps 3-4, any compute token that arrives targeting an invalid page is rejected. The shared input path ordering guarantees that tokens from the new code epoch cannot arrive until after the loader has sent them, which is after the load completes. Rejected tokens are therefore late arrivals from the old epoch -- work that is being abandoned.

Presence-bit optimisation: the frame's presence metadata (in the 670 register files) can be checked before step 1 to determine if any tokens are pending for offsets in the target page. If all presence bits for matchable offsets in that page are clear, the drain step can be skipped entirely -- no tokens are waiting for those instructions. This enables targeted IRAM replacement without stalling the entire PE.

Rejection policy. v0: discard silently + diagnostic. Late-arriving tokens targeting an invalid IRAM page are dropped. The PE sets a sticky flag (directly driving a diagnostic LED) to indicate that a discard occurred. This is a "should never happen if the loader protocol is correct" safety net. The LED makes it visible during debugging without adding any pipeline complexity.

If the flag lights up, something is wrong with the drain timing.

Future: NAK response. The PE could form a NAK token from the rejected compute token and emit it to a coordinator. The output stage already exists (forms tokens from frame destinations + ALU results); a bypass path from IFETCH to the output would enable this. Estimated cost: a mux and some control logic, ~5-8 TTL chips. Deferred until the runtime is sophisticated enough to act on NAKs.

What this does NOT protect against. The valid-bit mechanism catches tokens that arrive during an IRAM swap (page is invalid). It does not catch tokens that arrive after a swap completes and the page is re-validated with different code. Preventing that case requires the drain protocol to be correct -- all tokens from the old epoch must have been processed or discarded before the new code is marked valid.

For v0, this is a software/loader invariant enforced by the drain protocol. Future hardening options:

Per-page epoch counter (2-3 bits): incremented on each page reload. Checked against an expected epoch stored per-activation or derived from spare token bits. Catches post-swap stale tokens at the cost of a comparator + epoch storage.
Fragment ID register per page: similar to epoch but identifies the fragment by name rather than sequence number. More expensive (wider comparator) but more debuggable.

Both options fit in the spare bits reserved in the token formats. The valid-bit mechanism is forward-compatible with either.

7. SM Operation Dispatch#

SM operations (type=1) use the same 16-bit instruction format. The 5-bit opcode field selects from the SM opcode space (independent of CM opcodes). fref points at frame slots containing SM-specific parameters. The PE constructs SM tokens from frame data, operand data, and ALU results.

All SM addressing goes through frame slots. There is no separate "pointer-addressed" vs "constant-addressed" distinction in the instruction encoding -- the frame slot contents determine the target SM, address, and any return routing. Both pointer-addressed operations (address from token data) and constant-addressed operations (address from frame slot) use the same frame-slot-based encoding.

SM Bus Opcode Encoding#

The SM bus opcode encoding is unchanged -- variable-width with tier 1 (3-bit opcode, 10-bit addr) and tier 2 (5-bit opcode, 8-bit payload). See sm-design.md for the full opcode table:

Tier 1 (3-bit, 1024-cell addr range):
  read, write, alloc, free, exec, ext

Tier 2 (5-bit, 256-cell addr range or 8-bit payload):
  rd_inc, rd_dec, cas, raw_rd, clear, set_pg, write_im, (spare)

SM Token Construction#

The PE output stage builds SM tokens on the wire. The instruction's SM opcode (in the PE's 5-bit opcode field) maps to the SM bus opcode via the instruction decoder EEPROM. Frame slots provide addressing and return routing parameters:

SM flit 1 is constructed from frame[fref] contents (SM_id, address) plus the SM bus opcode from the decoder.
SM flit 2 source depends on the operation (see flit 2 source mux below).
SM flit 3 (CAS and EXT only) carries additional data.

Frame Slot Packing for SM Parameters#

A single 16-bit frame slot holds the SM target:

Tier 1 target slot: [SM_id:2][addr_high:2][addr_low:8][spare:4]  = 16 bits
                     (addr = 10 bits, 1024-cell range)

Tier 2 target slot: [SM_id:2][addr:8][spare:6]                    = 16 bits
                     (addr = 8 bits, 256-cell range)

For operations needing return routing, the next consecutive frame slot holds a pre-formed response token flit 1:

Return routing slot: [prefix:2-3][port:0-1][PE:2][offset:8][act_id:3] = 16 bits

This is the same format as a CM destination slot -- the SM response token routes as a normal compute token back to the requesting PE. The SM treats flit 2 of the request as an opaque 16-bit blob and echoes it as flit 1 of the response (see sm-design.md Result Format).

Flit 2 Source Mux#

Different SM operations need different data in flit 2. The source is selected by a 2-bit signal derived from the instruction decoder (SM opcode + mode):

source   select  use case
-------  ------  -----------------------------------------------
ALU out  00      SM write (operand is write data, passes through ALU),
                 SM write_im (immediate write). Default for CM compute.
R oper   01      SM scatter write (ALU computes addr from base + L operand;
                 R operand is write data, bypasses ALU to flit 2).
                 SM CAS flit 2 (expected value = L operand).
Frame    10      SM read / rd_inc / rd_dec / raw_rd / alloc return routing
                 (frame[fref+1] = pre-formed response token flit 1).
                 SM exec parameters.
(spare)  11      reserved

Hardware: cascaded 74LS157 (quad 2:1 mux) pairs, ~4-6 chips for 16-bit width.

SM Operation Mapping Table#

SM bus op	PE opcode	frame slots	mode	operands	flits	flit 2 source	notes
read	SM_READ	2: target + return	1	monadic (trigger)	2	frame (return routing)	indexed variant: ALU adds base + index
write	SM_WRITE	1: target	0	monadic (data)	2	ALU (write data)
write (scatter)	SM_WRITE_IX	1: target	1	dyadic (index, data)	2	R operand (write data)	ALU: base + index
alloc	SM_ALLOC	2: params + return	1	monadic (trigger)	2	frame (return routing)
free	SM_FREE	1: target	0	monadic (trigger)	2	don't-care
exec	SM_EXEC	2: target + params	1	monadic (trigger)	2	frame (params/count)
ext	SM_EXT	1-2: varies	varies	varies	3	varies	3-flit extended addressing
rd_inc	SM_RDINC	2: target + return	1	monadic (trigger)	2	frame (return routing)	atomic read-and-increment
rd_dec	SM_RDDEC	2: target + return	1	monadic (trigger)	2	frame (return routing)	atomic read-and-decrement
cas	SM_CAS	1: target	0	dyadic (expected, new)	3	L operand (expected)	3-flit; return via prior read
raw_rd	SM_RAWRD	2: target + return	1	monadic (trigger)	2	frame (return routing)	non-blocking, no deferred read
clear	SM_CLEAR	1: target	0	monadic (trigger)	2	don't-care	resets cell to EMPTY
set_pg	SM_SETPG	1: target	0	monadic (page value)	2	ALU (page value)	SM-side bank switching
write_im	SM_WRIM	1: target	0	monadic (data)	2	ALU (write data)	immediate write, tier 2 addr

SM Flit 1 Assembly#

Stage 5 assembles SM flit 1 from frame[fref] and the wire opcode from the decoder EEPROM:

Tier 1:
  flit 1 = [1][frame[fref][15:14] (SM_id)][wire_opcode:3][frame[fref][13:4] (addr:10)]

Tier 2:
  flit 1 = [1][frame[fref][15:14] (SM_id)][wire_opcode:5][frame[fref][13:6] (addr:8)]

Hardware: the [1] prefix is hardwired. SM_id comes from the top 2 bits of the frame slot. The wire opcode comes from the EEPROM. The address comes from the remaining frame slot bits. All fields are concatenated on the wire with no runtime muxing -- the frame slot is pre-packed so that the bit positions align with the SM bus format. ~1-2 chips for output gating and serialisation.

Indexed Address Computation#

For indexed READ, scatter WRITE, and other address-computed operations, the ALU computes the effective address: base address (extracted from frame[fref]) + index (from the left operand). The computed address is packed into SM flit 1 by the output stage. The ALU performs address arithmetic without dedicated address-computation hardware.

CAS: 3-Flit SM Token#

Compare-and-swap requires address, expected value, and new value. This exceeds the standard 2-flit SM packet.

CAS emission (3 flits):
  flit 1: SM header (SM_id + CAS opcode + addr, from frame[fref])
  flit 2: expected value (left operand)
  flit 3: new value (right operand)

The output serialiser emits 3 flits instead of the default 2. An extra_flit signal from the instruction decoder (asserted for CAS and EXT-mode ops) increments the serialiser's flit counter limit: flit_count = 2 + extra_flit. One gate.

Return routing for CAS uses the prior-READ pattern: issue an SM READ to the target cell first (plants return routing in the SM's deferred-read register), receive the current value as the response (which also provides the expected value), then issue the CAS with the known expected value and the desired new value.

SM Flit Assembly Hardware Cost#

Component	Chips/PE	Purpose
Flit 2 source mux (16-bit, 4:1)	~4-6	ALU out / R oper / Frame / spare
SM flit 1 gating	~1-2	Frame target slot + EEPROM opcode to bus
Extra flit control	~0.5	CAS/EXT 3-flit counter

8. EEPROM-Based Instruction Decoding#

The instruction decoder can be implemented as an EEPROM acting like a PLD. Input bits = instruction opcode fields + PE ID bits. Output bits = control signals for the ALU, matching store, token output formatter, etc.

This gives significant flexibility:

Instruction set can be changed by reflashing the EEPROM (no board changes)
Per-PE customisation (different PEs could theoretically have different instruction subsets, though unlikely for v0)
The PE ID is "free" -- it's just more EEPROM address bits

9. The 670 Subsystem: Act ID Lookup, Match Metadata, and SC Register File#

Role in the Frame-Based Architecture#

The 74LS670s serve two critical functions:

act_id -> frame_id lookup table. Indexed by the token's 3-bit activation_id, outputs {valid:1, frame_id:2, spare:1} in ~35 ns (combinational). This replaces what would otherwise be an SRAM cycle for associative tag comparison.
Presence and port metadata store. Indexed by frame_id, stores presence and port bits for all 8 matchable offsets across all 4 frames. Combinational read (~35 ns after frame_id settles, ~70 ns total from act_id presentation).

Both functions complete within stage 2, in parallel with the IRAM read. By the time stage 3 begins, the PE knows frame_id, presence, and port -- the only remaining SRAM access is the actual operand data.

Hardware Configuration#

act_id -> frame_id (2x 74LS670):

Addressed by act_id[1:0] with act_id[2] selecting between chips. Each chip holds 4 words x 4 bits. Output: {valid:1, frame_id:2, spare:1}.

ALLOC: write {valid=1, frame_id} at address act_id (670 write port)
FREE:  write {valid=0, ...} at address act_id
LOOKUP: read port, address = act_id -> {valid, frame_id} in ~35 ns

The 670's independent read and write ports allow ALLOC to proceed while the pipeline reads -- zero conflict.

Presence + port metadata (4x 74LS670):

Each 670 word (4 bits) holds presence+port for 2 offsets: {presence_N:1, port_N:1, presence_N+1:1, port_N+1:1}. Read address = [frame_id:2]. Output bits selected by offset[2:0] via bit-select mux.

670 chip 0 (offsets 0-1): word[frame_id] = {pres0, port0, pres1, port1}
670 chip 1 (offsets 2-3): word[frame_id] = {pres2, port2, pres3, port3}
670 chip 2 (offsets 4-5): word[frame_id] = {pres4, port4, pres5, port5}
670 chip 3 (offsets 6-7): word[frame_id] = {pres6, port6, pres7, port7}

offset[2:1] selects chip, offset[0] selects which pair of bits within the 4-bit output (a 2:1 mux -- one gate).

Bit select mux (1-2 chips):

Offset-based selection of the relevant presence and port bits from the 670 outputs.

Chip Budget#

Component	Chips	Function
act_id -> frame_id lookup	2	74LS670, indexed by act_id
Presence + port metadata	4	74LS670, indexed by frame_id
Bit select mux	1-2	offset-based selection
Total match metadata	~8

SC Register File (Mode-Switched)#

During dataflow mode, the PE uses act_id resolution and presence metadata constantly but the SC register file is idle (no SC block executing). During SC mode, the PE uses the register file constantly but act_id lookup and presence tracking are idle (SC block has exclusive PE access; no tokens enter matching).

Some of the 670s can be repurposed for register storage during SC mode. The exact mapping depends on the SC block design:

The 4 presence+port 670s (indexed by frame_id in dataflow mode) can be re-addressed by instruction register fields during SC mode, providing 4 chips x 4 words x 4 bits = 64 bits of register storage. Combined across chips, this gives 4 registers x 16 bits (4 bits per chip, 4 chips for width).
With additional mux logic, all 6 shared 670s (excluding the act_id lookup pair, which may need to remain active for frame lifecycle management) could provide 6 registers x 16 bits during SC mode.

The act_id lookup 670s may need to remain in their dataflow role even during SC mode if the PE must handle frame control tokens (ALLOC/FREE) arriving during SC block execution. Whether to share them depends on the SC block entry/exit protocol.

The Predicate Slice#

One of the 670s can be permanently dedicated as a predicate register rather than participating in the mode-switched pool:

4 entries x 4 bits = 16 predicate bits, always available
Useful for: conditional token routing (SWITCH), loop termination flags, SC block branch conditions, I-structure status flags
Does not reduce the metadata capacity significantly: the remaining 3 presence+port 670s still cover 6 of the 8 matchable offsets; the 2 uncovered offsets can fall back to SRAM-based presence or simply constrain the assembler to 6 dyadic offsets per frame

The predicate register is always readable and writable regardless of mode, since it's a dedicated chip with its own address/enable lines. Instructions can test or set predicate bits without going through the matching store or the ALU result path.

Mode Switching#

When transitioning from dataflow mode to SC mode:

Save metadata from the shared 670s to spill storage.
Load initial SC register values (matched operand pair that triggered the SC block) into the 670s.
Switch address mux: 670 address lines now driven by instruction register fields instead of frame_id / act_id.
Switch IRAM to counter mode: sequential fetch via incrementing counter rather than token-directed offset.

When transitioning back:

Emit final SC result as token (last instruction with OUT=1).
Restore metadata from spill storage to the 670s.
Switch address mux back to frame_id / act_id addressing.
Resume token processing from input FIFO.

Spill Storage Options#

Metadata from the shared 670s (~64-96 bits depending on how many are shared) needs temporary storage during SC block execution.

Option A: Shift registers. 2x 74LS165 (parallel-in, serial-out) for save + 2x 74LS595 (serial-in, parallel-out) for restore. Total: 4 chips. Save/restore takes ~12 clock cycles each.

Option B: Dedicated spill 670. One additional 74LS670 (4x4 bits) holds 16 bits per save cycle; need ~4-6 write cycles to save all shared chips' contents. Total: 1 chip, ~4-6 cycles per save/restore.

Option C: Spill to frame SRAM. During SC mode, the frame SRAM has bandwidth available (no match operand reads). Write the 670 metadata contents into a reserved region of the frame SRAM address space. No extra chips needed. ~4-6 SRAM write cycles to save, ~4-6 to restore. The SRAM is single-ported but there's no contention because the pipeline is paused during mode switch.

Recommended: Option C. Zero additional chips. The save/restore overhead of ~4-6 cycles per transition is negligible compared to the SC block's execution savings (EM-4 data: 23 clocks pure dataflow vs 9 clocks SC for Fibonacci, so even with ~10 cycles of mode switch overhead, you break even at ~5-7 SC instructions).

10. SRAM Configuration and Memory Map#

Unified SRAM Chip Pair#

The PE uses a single 32Kx8 chip pair (2 chips for 16-bit data width) for both IRAM and frame storage, with address partitioning via a single decode bit. The recommended part is the AS6C62256 (55 ns, 32Kx8, DIP-28) or equivalent. 55 ns access time fits comfortably within a 200 ns clock period at 5 MHz, with margin for address setup and data hold.

The unified SRAM approach keeps chip count low: one chip pair per PE serves both IRAM and frame storage, avoiding the chip proliferation that separate matching store and IRAM memories would require.

Address Map#

v0 address space (simple decode):

  IRAM region:   [0][offset:8]              instruction templates
                  offset from token
                  capacity: 256 instructions (512 bytes)

  Frame region:  [1][frame_id:2][slot:6]    per-activation storage
                  frame_id from tag store resolution
                  capacity: 4 frames x 64 slots = 256 entries (512 bytes)

Total v0 SRAM utilisation: IRAM 512 bytes, frame 512 bytes. Under 1.5 KB used out of a 32Kx8 chip pair (64 KB). Ample room for future expansion without changing chips. See Future: Bank Switching with 74LS610 (section 14) for the upgrade path when programs exceed 256 instructions per PE.

Shared SRAM Arbitration#

The unified SRAM chip pair is shared between three access patterns:

Pipeline IRAM reads (stage 2, instruction fetch): high frequency, performance-critical
Pipeline frame reads/writes (stages 3 and 5): high frequency, performance-critical
PE-local write tokens (IRAM and frame loading): low frequency, can tolerate delay

Arbitration approach: PE-local writes execute when the frame SRAM is not busy with compute pipeline accesses (natural gaps between pipeline stages). When no gap is available, writes queue and execute during the next idle cycle. Hardware cost: mux on SRAM address/data buses + write-enable gating + stall signal to pipeline. Roughly 5-8 TTL chips.

IRAM vs frame contention: in v0, IRAM and frame share one SRAM chip pair via address partitioning (region bit in the address). Stage 2 (IRAM read) and stage 3/5 (frame read/write) access different address regions but contend for the same physical chip. The pipeline controller ensures only one stage accesses the SRAM per cycle. With the natural pipeline spacing, this rarely causes stalls -- see the frame SRAM contention model in Pipeline Stall Analysis (section 11).

Upgrade path: separating IRAM and frame onto independent SRAM chip pairs eliminates all inter-region contention. Stage 2 (IRAM) and stage 3/5 (frame) can access their respective chips in the same cycle.

Async-compatible arbitration: defined as request/grant interface. Synchronous implementation: priority mux resolved on clock edge. Async implementation: mutual exclusion element (Seitz arbiter). Interface is the same in both cases. See network-and-communication.md for clocking discipline.

11. Pipeline Stall Analysis#

The Frame SRAM Contention Problem#

With Approach C (670 lookup), act_id -> frame_id resolution is combinational (~35 ns via 670 read port), and the presence/port check is also combinational (~35 ns from a second set of 670s). There is no read-modify-write on SRAM for metadata -- metadata lives entirely in the 670 register files.

The primary bottleneck is frame SRAM contention between stage 3 and stage 5. Both stages access the same single-ported SRAM chip pair:

Stage 3 reads/writes operand data (dyadic match) and reads constants (modes with has_const=1).
Stage 5 reads destinations (modes 0-3), or writes results back to the frame (sink modes 6-7).

When two pipelined tokens have stage 3 and stage 5 active in the same cycle, the SRAM can serve only one. The other stalls.

The Pipeline Hazard#

The classic RAW hazard still exists but takes a different form. Two consecutive tokens targeting the same frame slot (e.g., two mode 7 read-modify-write operations on the same accumulator slot) create a data dependency: the second token's stage 3 read must see the first token's stage 5 write.

Detection requires comparing (act_id, fref) of the incoming token against in-flight pipeline latches at stages 3-5. Hardware cost: ~2 chips (9-bit comparator + AND gate). Alternatively, the assembler can guarantee this never happens by never emitting consecutive mode 7 tokens to the same slot on the same PE.

This hazard is statistically uncommon in dataflow execution. Two operands arriving back-to-back at the exact same frame slot requires coincidental timing. The bypass path is cheap insurance that fires infrequently.

SRAM Contention Model#

The frame SRAM chip is single-ported (one access per clock cycle at 5 MHz with 55 ns SRAM). The primary stall source is contention between stage 3 (frame reads for operand data and constants) and stage 5 (frame reads for destinations, or frame writes for sink modes).

Contention arises only when:

Token A is at stage 5, needing a frame SRAM read (dest) or write (sink), AND
Token B is at stage 3, needing a frame SRAM read (match operand, constant, or tag word).

Contention does NOT arise when:

Token A's stage 5 is mode 4/5 (change_tag -- no SRAM access).
Token B's stage 3 is zero-cycle (monadic no-const, or match data in register file with no const).
Token A was a dyadic miss (terminated at stage 3, never reaches stage 5).

Cycle Counts by Instruction Type#

Approach C (74LS670 lookup, recommended v0):

                                stg1  stg2  stg3  stg4  stg5  total
monadic mode 4 (no frame)      1     1     0     1     0     3
monadic mode 0 (dest only)     1     1     0     1     1     4
monadic mode 6 (sink)          1     1     0     1     1     4
monadic mode 1 (const+dest)    1     1     1     1     1     5
monadic mode 7 (RMW)           1     1     1     1     1     5
dyadic miss                    1     1     1     --    --    3
dyadic hit, mode 0             1     1     1     1     1     5
dyadic hit, mode 1             1     1     2     1     1     6
dyadic hit, mode 3 (fan+const) 1     1     2     1     2     7

Stage 3 breakdown for Approach C:

Dyadic hit: 1 SRAM cycle to read stored operand (frame_id and presence already known from 670). +1 cycle for constant if has_const=1.
Dyadic miss: 1 SRAM cycle to write operand data. 670 write port sets presence bit combinationally in parallel.
Monadic: 0 SRAM cycles (no match), +1 for constant if has_const=1.

Approach B (register-file match pool):

                                stg1  stg2  stg3  stg4  stg5  total
monadic mode 4 (no frame)      1     1     0     1     0     3
monadic mode 0 (dest only)     1     1     0     1     1     4
monadic mode 6 (sink)          1     1     0     1     1     4
monadic mode 1 (const+dest)    1     1     1     1     1     5
monadic mode 7 (RMW)           1     1     1     1     1     5
dyadic miss                    1     1     1     --    --    3
dyadic hit, mode 0             1     1     1     1     1     5
dyadic hit, mode 1             1     1     2     1     1     6
dyadic hit, mode 3 (fan+const) 1     1     2     1     2     7

Approaches B and C produce identical single-token cycle counts. The difference emerges under pipelining: Approach B's match data never touches the frame SRAM (operands stored in a dedicated register file), so stage 3's only SRAM access is the constant read. This reduces stage 3 vs stage 5 SRAM contention.

Approach A (set-associative tags in SRAM, minimal chips):

                                stg1  stg2  stg3  stg4  stg5  total
monadic mode 4 (no frame)      1     1     0     1     0     3
monadic mode 0 (dest only)     1     1     0     1     1     4
monadic mode 6 (sink)          1     1     0     1     1     4
monadic mode 1 (const+dest)    1     1     1     1     1     5
monadic mode 7 (RMW)           1     1     1     1     1     5
dyadic miss                    1     1     2     --    --    4
dyadic hit, mode 0             1     1     2     1     1     6
dyadic hit, mode 1             1     1     3     1     1     7
dyadic hit, mode 3 (fan+const) 1     1     3     1     2     8

Approach A adds 1 extra SRAM cycle per dyadic operation (tag word read + associative compare) because act_id resolution is not combinational.

Pipeline Overlap Analysis#

With single-port frame SRAM at 5 MHz, the pipeline controller must arbitrate between stage 3 and stage 5. When both need SRAM in the same cycle, stage 3 stalls.

Approach B, two consecutive dyadic-hit mode 1 tokens:

cycle 0:  A.stg1
cycle 1:  A.stg2 (IRAM)
cycle 2:  A.stg3 match (reg file)  -- frame SRAM FREE
cycle 3:  A.stg3 const (SRAM)
cycle 4:  A.stg4 (ALU)             -- frame SRAM FREE
cycle 5:  A.stg5 dest (SRAM)       B.stg3 match (reg file) -- NO CONFLICT
cycle 6:  (A done)                  B.stg3 const (SRAM)
cycle 7:                            B.stg4 (ALU)
cycle 8:                            B.stg5 dest (SRAM)      -- NO CONFLICT

Token spacing: 4 cycles. Approach A under the same conditions: ~6-7 cycles due to additional SRAM contention in stage 3.

Throughput Summary#

Per PE, at 5 MHz, single-port frame SRAM:

Instruction mix profile	Approach A	Approach C	Approach B
Monadic-heavy (mode 0/4/6)	~1.25 MIPS	~1.67 MIPS	~1.67 MIPS
Mixed (40% dyadic mode 1, 30% monadic, 30% misc)	~833 KIPS	~1.25 MIPS	~1.25 MIPS
Dyadic-heavy with constants	~714 KIPS	~1.00 MIPS	~1.00 MIPS
Worst case (mode 3, const+fanout)	~625 KIPS	~714 KIPS	~714 KIPS

4-PE system: multiply by 4. Realistic mixed workload: ~3.3-5.0 MIPS (A), ~5.0-6.7 MIPS (C), or ~5.0-6.7 MIPS (B). For reference: the original Amamiya DFM prototype (TTL, 1982) achieved 1.8 MIPS per PE. EM-4 prototype (VLSI gate array, 1990) achieved 12.5 MIPS per PE. This design sits between the two, closer to the DFM, which is historically appropriate for a discrete TTL build.

Pipeline Timing by Era#

With the 670-based matching subsystem (Approach C), act_id resolution and presence/port checking are combinational (~35-70 ns) regardless of era. These never become the timing bottleneck.

The era-dependent part is SRAM access time for frame reads and writes. This determines how many SRAM operations fit per clock cycle and thus how much stage 3 vs stage 5 contention exists.

1979-1983 (5 MHz, 55 ns SRAM):

670 metadata: combinational (~35-70 ns), well within 200 ns cycle
Frame SRAM: one access per 200 ns cycle (55 ns access + setup/hold margin)
Bottleneck: frame SRAM single-port, stage 3 vs stage 5 contention
SC block throughput: ~1 instruction per clock (670 dual-port)
Overall token throughput: ~1 token per 3-5 clocks (pipelined, mode-dependent)

1984-1990 (5-10 MHz, dual-port SRAM):

670 metadata: combinational (unchanged)
Frame SRAM: dual-port (IDT7132 or similar), port A for stage 3, port B for stage 5
Bottleneck: eliminated -- both stages access SRAM simultaneously
SC block throughput: ~1 instruction per clock
Overall token throughput: approaches 1 token per 3 clocks for most modes

Dual-port SRAM eliminates the primary stall source. The pipeline becomes instruction-latency-limited rather than SRAM-contention-limited.

Modern parts (5 MHz clock, 15 ns SRAM):

670 metadata: combinational (unchanged)
Frame SRAM: 15 ns access, ~13 accesses fit in 200 ns cycle
Practical: 2-3 sub-cycle accesses via time-division multiplexing
Bottleneck: none -- frame SRAM has excess bandwidth
Token throughput: 1 token per 3 clocks (pipeline-stage-limited, not SRAM-limited)

With 15 ns AS7C256B-15PIN (DIP, currently available at ~$3), two sub-cycle accesses fit within a 200 ns clock period. This achieves TDM-like parallelism without additional MUX logic, effectively giving the pipeline a dual-port view of a single-port chip.

Integrated (on-chip SRAM, sub-ns access):

670 equivalent: on-chip multi-ported register file, ~200 transistors
Frame SRAM: on-chip, sub-cycle access trivially
Token throughput: 1 per 3 clocks, potentially faster with deeper pipelining

12. SC Blocks and Execution Modes#

PE-to-PE Pipelining#

When multiple PEs are chained for software-pipelined loops, the per-PE pipeline throughput determines the overall chain throughput.

With the pipelined design (1 token per 3-5 clocks depending on instruction mix and era), the inter-PE hop cost becomes the critical path for chained execution:

Interconnect	Hop latency	Viable?
Shared bus (discrete build)	5-8 cycles	Marginal -- chain overhead dominates
Dedicated FIFO between adjacent PEs	2-3 cycles	Worthwhile for tight loops
On-chip wide parallel link (integrated)	1-2 cycles	Competitive with intra-PE SC block

For the discrete v0 build, dedicated inter-PE FIFOs (bypassing the shared bus) would enable PE chaining at reasonable cost. This is a low-chip-count addition (~2-4 chips per PE pair) that unlocks software-pipelined loop execution.

Loopback bypass. When a PE emits a token destined for itself (common in iterative computations), the token can be looped back internally without traversing the bus at all. See bus-interconnect-design.md for the loopback bypass design, which eliminates the bus hop latency entirely for self-targeted tokens.

The Execution Mode Spectrum#

The pipelined PE with frame-based storage, SC blocks, and predicate register supports a spectrum of execution modes, selectable by the compiler per-region:

Mode	Pipeline behaviour	Throughput	When to use
Pure dataflow	Token -> ifetch -> match/frame -> exec -> output	1 token / 3-7 clocks (mode-dependent)	Parallel regions, independent ops
SC block (register)	Sequential IRAM fetch, 670 register file	~1 instr / clock	Short sequential regions
SC block + predicate	As above, with conditional skip/branch via predicate bits	~1 instr / clock	Conditional sequential regions
PE chain (software pipeline)	Tokens flow PE0->PE1->PE2, each PE handles one stage	1 iteration / PE-pipeline-depth clocks	Loop bodies across PEs
SM-mediated sequential	Tokens to/from SM for memory-intensive work	SM-bandwidth-limited	Array/structure traversal

The compiler partitions the program graph and selects the best mode for each region. This spectrum is arguably more expressive than what a modern OoO core offers (which has exactly one mode: "pretend to be sequential, discover parallelism at runtime").

13. Instruction Residency and Code Loading#

Why This Matters#

Unlike Manchester, Amamiya, or Monsoon -- which either replicated the entire program into every PE's instruction memory or used very large per-PE instruction stores -- this design has small IRAM per bank (256 entries) with runtime-writable instruction memory. Without bank switching, any program larger than a single PE's IRAM needs code loading at runtime, even under fully static PE assignment.

With bank switching (see section 14), each PE holds up to 4096 instructions across 16 banks using the same SRAM chips. This substantially reduces the pressure on runtime code loading -- most programs' full working set fits in the preloaded banks, and switching between function fragments costs a single register write instead of IRAM rewrite traffic. The code storage hierarchy and loader mechanisms below remain relevant for programs that exceed the banked capacity, but bank switching makes that the exception rather than the rule.

The 16-bit single-half instruction format provides good IRAM density: one instruction per SRAM address. The effective capacity with bank switching (4096 instructions) is substantial for the target workloads, using only a single SRAM chip pair per PE.

The reference architectures largely avoid the residency problem by throwing memory at it: Amamiya's 8KW/PE replicated instruction memory, Manchester's large instruction store, Monsoon's 64K-instruction frames. Bank switching gives us a comparable effective capacity (4K instructions) with much less hardware than full replication.

Proactive Loading (Primary Mechanism)#

The primary approach is software-managed prefetch: the compiler assigns a PE (typically the least-utilized one) to pull instruction pages from storage and load them onto the bus in advance of when they're needed. This is part of the program graph itself - the loader PE calls the exec SM instruction, which reads out pre-constructed tokens onto the bus.

This fits naturally into the dataflow paradigm:

The loader PE is just another participant in the token network
Its "inputs" are load requests (tokens from other PEs or the scheduler)
Its "outputs" are config write packets that load IRAM
The compiler can schedule prefetches to overlap with computation on other PEs

The loader PE could be dedicated (always running loader code) or could itself have its code swapped depending on system phase.

The Identity Problem: Miss Detection#

If code loading happens at runtime, the question arises: how does a PE know the code in its IRAM is the right code for an arriving token?

A simple validity bitmap (like the matching store presence bit) is not sufficient. It can tell you "something is loaded at offset 7" but not "the right instruction is loaded at offset 7." If a different function fragment has been loaded over a previous one, the IRAM slot is occupied by a valid-looking but wrong instruction. The token indexes directly into IRAM - there is no tag comparison against the token.

Several detection mechanisms are possible:

Option A: Fragment ID register. Each PE has a small tag register (or set of registers, one per IRAM page/region) that records which function fragment is currently loaded. Set by config writes during loading. Incoming tokens carry (or the system derives from the token's address) the expected fragment ID. The PE compares the token's expected fragment against the loaded fragment register:

Match -> proceed normally
Mismatch -> miss, trigger fetch
Hardware cost: one register + comparator per PE (or per IRAM region)
Requires fragment ID bits in the token or a derivation mechanism
Coarse-grained: one tag per PE or per page, not per instruction

Option B: Entry gate instruction. The compiler inserts a special instruction at each function body's entry point that verifies identity: "am I the function this activation expects?" Tokens arriving at non-entry instructions are assumed correct because they could only have reached that point by passing through a verified entry gate.

No per-instruction tags needed
Software-managed, compiler responsibility
Detection granularity is per-function-body, not per-instruction
In dataflow terms: the entry gate is a dyadic instruction whose left input is the activation token and whose right input is a "function loaded" token. If the function isn't loaded, the gate blocks (no match) until loading completes and a "loaded" confirmation token arrives.

Option C: Software-only invariant. No hardware miss detection. The loader protocol guarantees correctness: code is never overwritten while tokens are in flight targeting it. The throttle + drain approach (stall new activations, let existing ones complete, then overwrite IRAM) ensures the invariant.

Simplest hardware -- no detection circuitry at all
Most compiler/loader burden
Relies on correct coordination; bugs cause silent wrong execution
Viable for v0 where programs are small and manually verified

These options are not mutually exclusive. v0 can start with Option C (software guarantee) and add hardware detection (A or B) later as programs grow beyond what manual verification can cover.

Miss Handling#

When a miss is detected (by whatever mechanism), the PE needs to handle a token that targets unloaded code. Two approaches:

Stall + fetch request: PE emits an exec token and stalls its input FIFO until the instruction arrives via config write. Simple, deterministic, but blocks all traffic to that PE during the fetch. Acceptable if misses are rare (proactive loading handles most cases) and fetch latency is bounded.

Recirculate + fetch request: PE emits a fetch-request token, puts the missed token back at the tail of its own input FIFO, and continues processing other tokens. The missed token retries later, hopefully after the instruction has been loaded. More complex but keeps the PE productive. Requires care to avoid FIFO fill-up with recirculated tokens.

v0 may not implement either; starting with the software-only invariant (Option C above) means misses don't happen by construction. Hardware miss handling is an evolutionary step as programs outgrow what static loading can guarantee.

14. Future: Bank Switching with 74LS610#

v0 supports 256 instructions per PE (8-bit offset) with simple address decode. When programs exceed 256 instructions, the 74LS610 memory mapper enables bank switching: 16 banks x 256 instructions = 4096 entries per PE without changing SRAM chips.

What the 610 Is and What It Enables#

The 74LS610 (TI memory mapper, originally for TMS9900 family) is an ideal fit for IRAM bank switching. Key properties:

16 mapping registers, each 12 bits wide
4-bit logical address input selects register -> 12-bit physical address output
Latch control (pin 28): outputs can be frozen while register contents change
~40-50ns propagation delay (LS family), pipelineable with SRAM access
One chip per PE. Writes to mapping registers via data bus during config/bootstrap.

The '610 is planned for both IRAM and SM banking (both are future upgrades, not present in v0). Using it for IRAM banking is the same chip, same wiring pattern, different address domain. One '610 per PE for IRAM, one per SM.

The Socket Strategy#

The v0 board pre-wires SRAM address lines to a '610 socket with a jumper wire in place of the chip. When bank switching is needed, the '610 drops in with no board changes.

Address Space with Bank Switching#

With the '610 installed, the IRAM address becomes:

Logical:  [bank_select:4][offset:8]
                |
                v (74LS610)
Physical: [phys_bank:12][offset:8] = up to 20-bit SRAM address

In practice the physical address width is bounded by available SRAM chip capacity. With 8Kx8 SRAMs (13-bit address): the '610's 12-bit output is wider than needed -- only 5 bits of physical bank + 8-bit offset = 13 bits. This gives 32 physical banks of 256 instructions each (8192 instructions per PE, though address space constraints may limit this further).

The SRAM address map with bank switching:

  IRAM region:   [0][bank:4][offset:8]      bank-switched templates
                  bank from '610 mapper
                  capacity: 16 banks x 256 instructions = 4096 entries

  Frame region:  [1][frame_id:2][slot:6]    (unchanged)

Banking Workflow: MAP_PAGE and SET_PAGE Instructions#

Two instructions manage banking. Neither touches the token format.

map_page (monadic): writes a logical-to-physical mapping into one of the '610's 16 mapping registers. The register index and physical bank address come from frame constants. Used during bootstrap or runtime to establish which physical SRAM regions back which logical pages.
set_page (monadic): writes a 4-bit logical page selector into a PE-local latch. The latch feeds the '610's MA0-MA3 inputs. All subsequent IRAM fetches go through the selected logical page's mapped physical bank. One cycle to switch.

Banking workflow:
  1. Bootstrap: MAP_PAGE instructions establish mappings
     (logical page 0 -> physical region A, page 1 -> region B, etc.)
  2. Runtime: SET_PAGE selects the active logical page
  3. Latch -> '610 MA0-MA3 -> physical SRAM bank selection
  4. All IRAM reads now address the selected bank

Hardware cost: one 74LS175 (quad D flip-flop) as the page latch + the '610 itself.

Trade-offs and Costs#

Bank switch affects all in-flight tokens targeting this PE at offsets in the old bank. The compiler (or scheduler) must drain tokens for the old bank before switching -- same throttle-and-drain protocol as code overwrite, but switching is instantaneous once drained (write latch, done).
set_page is sequentially scoped: it affects all subsequent fetches, not just one activation. The compiler must ensure that concurrent activations on the same PE agree on the active page, or use set_page as a barrier between phases.
Total capacity per PE is bounded by SRAM chip size, not the '610 (which can address far more than any reasonable IRAM).
Pages are a pure address-mapping primitive. The compiler decides what they mean -- per-function, per-phase, or any other grouping. The hardware doesn't enforce or assume any relationship between pages and function bodies.

15. Dynamic Scheduling: Future Capability#

The architecture is policy-agnostic on whether PE assignment is fully static (compiler decides everything) or partially dynamic (a scheduler places activations at runtime). The mechanism (tokens carry destination PE + activation_id, PEs have writable IRAM, frames are allocated and addressed by act_id) supports either policy.

Static Assignment (v0)#

Compiler decides everything at compile time. each PE gets specific function fragments loaded at bootstrap. no runtime decisions about placement. simplest, no scheduler hardware or firmware needed. For programs that exceed IRAM capacity, the compiler schedules exec instructions or similar.

Dynamic Scheduling (future)#

A CCU-like scheduler (could be firmware on a dedicated PE, a small fixed-function unit, or distributed logic) decides at runtime where to place new activations, based on PE load, IRAM contents, etc.

The tension: dynamic scheduling wants wide IRAM (so the target PE already has the function body loaded), while cheap PEs want narrow IRAM. Amamiya resolved this by replicating the entire program into every PE's IRAM. that's one approach but costs a lot of memory.

The middle ground is a working set model: keep hot function bodies loaded, swap cold ones via PE-local write tokens (prefix 011+01) when the scheduler wants to place an activation on a PE that doesn't have the code yet.

miss latency: significant (network round-trip to load code from SM or external storage). much worse than Amamiya's "already there."
miss rate: depends on scheduler affinity policy. if the scheduler prefers placing activations on PEs that already have the code, misses should be rare. a small "IRAM directory" (which PE has which function body loaded) lets the scheduler make this decision cheaply.
coordination: drain in-flight tokens for the old fragment before overwriting IRAM. throttle stalls new activations for that fragment, existing ones complete, then overwrite. coarse-grained context switch.

16. Open Design Questions#

Approach selection for v0. Approach C (670 lookup) is recommended as the starting point: combinational metadata at ~8 chips. Approach B (register-file match pool) eliminates the last SRAM cycle from matching at the cost of ~16-18 chips. Approach A (SRAM tags) is the fallback if 670 supply is a problem. The choice depends on whether chip count or pipeline throughput is the binding constraint for the initial build. See section 11 for the full approach comparison and cycle counts.
Frame SRAM contention under realistic workloads. The pipeline stall analysis in section 11 uses worst-case consecutive tokens. Simulate representative dataflow programs in the behavioural emulator to measure actual stage 3 vs stage 5 contention rates and determine whether dual-port SRAM or faster SRAM is justified for v0.
SC block register capacity. With 4-6 registers available from repurposed 670s (depending on how many are shared), what is the longest SC block the compiler can generate before register pressure forces a spill? Evaluate empirically on target workloads.
Predicate register encoding. Document specific instruction encodings for predicate test/set/clear, and how SWITCH instructions interact with predicate bits. The predicate register may subsume some of the cancel-bit functionality planned for token format.
Mode switch latency measurement. Build a cycle-accurate model of the save-to-SRAM / restore-from-SRAM path and determine exact overhead. Target: <=10 cycles per transition.
Assembler stall analysis. The assembler can statically detect instruction pairs whose output tokens may cause frame SRAM contention on the same PE. For hot loops, the assembler can insert mode 4 NOP tokens (zero frame access) as pipeline padding. Validate static stall estimates against emulator simulation, since runtime arrival timing depends on network latency and SM response times.
8-offset matchable constraint validation. The 670-based presence metadata limits dyadic instructions to offsets 0-7 per frame. Evaluate whether this is sufficient for compiled programs. If tight, the hybrid upgrade path (offset[3]=0 checks 670s, offset[3]=1 falls back to SRAM tags) adds ~4-6 chips of SRAM tag logic for offsets 8-15+.
Exact opcode assignments: 5-bit opcode space is sufficient (CM and SM independent). Need to assign FREE_FRAME, ALLOC_REMOTE, and verify that existing ALU operations fit with the revised mode semantics.
SC arc execution details: the frame model supports strongly-connected arc execution (latch frame_id across sequential blocks); the pipeline sequencing and block-entry detection logic need design work. Deferred past v0 but should not be precluded by any v0 decisions.
IRAM bank switching interaction with frames: switching IRAM banks changes the instruction templates but not the frame contents. Tokens in flight targeting the old bank's instructions will execute against new instructions after the switch. The drain-before-switch protocol applies unchanged.
Frame slot count (fref width): 6-bit fref = 64 slots is the current proposal. Real compiled programs may show that 32 suffices (freeing 1 bit for other uses) or that 64 is tight (requiring creative aliasing or more aggressive function splitting).
Function splitting heuristics: how does the compiler decide where to split? Minimize cross-PE traffic? Balance frame usage across PEs? Hardware constraints (frame count, matchable offset count) drive it.
Instruction identity detection: how does the PE know loaded code matches what an arriving token expects? Fragment ID register vs entry gate instruction vs software-only guarantee. See Instruction Residency section. v0 starts with software-only invariant (Option C).
Miss handling mechanism: stall + fetch request vs recirculate
- fetch request. v0 may not implement either, relying on the software-only invariant.
SM flit 1 bit alignment: the frame slot packing for SM targets ([SM_id:2][addr][spare]) must align with the SM bus flit 1 format so that the output stage can concatenate frame bits + EEPROM opcode without field rearrangement. The exact spare bit positions depend on the final SM bus encoding; verify alignment after sm-design.md opcode assignments are frozen.
PE-local write slot field width: the current flit 1 format packs 5 bits of slot index into the PE-local write token. With 64 frame slots (6-bit fref), one bit is missing. Options: (a) limit PE-local writes to slots 0-31, (b) steal a spare bit from act_id or elsewhere, (c) use flit 2 for extended addressing.
Indexed SM ops: address overflow. ALU base + index computation may overflow the 10-bit (tier 1) or 8-bit (tier 2) address range. The PE does not check for overflow -- the SM receives the truncated address. The assembler should warn on statically-detectable overflow risks. Runtime overflow detection is an SM-side concern.
EXTRACT_TAG offset source. The return offset in the packed tag could come from (a) a frame constant (flexible, costs 1 frame slot), (b) a fixed hardware-derived value (e.g. current instruction offset + 1), or (c) an IRAM-encoded small immediate (not available in the current 16-bit format). Option (a) is consistent with the frame-everything philosophy; option (b) saves a slot but limits return point placement.

17. References#

architecture-overview.md - Token format, flit-1 bit allocation, module taxonomy, bus framing protocol, function call design overview.
alu-and-output-design.md - ALU operation details, output routing modes, flit 2 source mux, and SM flit assembly.
sm-design.md - SM opcode table, extended addressing, CAS handling, I-structure semantics.
bus-interconnect-design.md - Physical bus implementation: shared and split AN/CN/DN topologies, node interfaces, arbitration, loopback, backpressure, chip counts.
network-and-communication.md - Routing topology, clocking discipline.
io-and-bootstrap.md - Bootstrap loading, I/O subsystem design.
sram-availability.md - Component availability for period-appropriate SRAMs.
17407_17358.pdf - DFM evaluation: OM structure (1024 CAM blocks, 32 words each, 8 entries of 4 words, 4-way set-associative within entry). Function activation via CCU requesting least-loaded PE, then getting instance name from target PE's free instance table. IM is 8KW/PE, identical across all PEs. Critical for understanding why Amamiya's OM is so large and why ours can be much smaller.
gurd1985.pdf - Manchester matching unit: 16 parallel hash banks, 64K tokens each, 54-bit comparators, 180ns clock period. Overflow unit emulated in software. Shows the cost of general-purpose matching.
Dataflow_Machine_Architecture.pdf - Veen survey: matching store analysis, tag space management, overflow handling across multiple architectures.
amamiya1982.pdf - Original DFM paper: semi-CAM concept, IM/OM split, execution control mechanism with associative IM fetch. Partial function body execution (begin executing when first argument arrives, don't wait for all arguments).
EM-4 prototype papers - Direct matching, strongly-connected blocks, register-based advanced control pipeline. Informs SC arc upgrade path.
Iannucci (1988) - Frame-based matching, continuation model, suspension semantics. Historical precedent for per-activation frame storage.
Monsoon / TTDA papers - Explicit token store, frame-based execution, I-structure semantics.