Dynamic Dataflow CPU: SM (Structure Memory) Design#

Covers the SM interface protocol, operation set, banking scheme, address space extension, memory tiers, wide pointers, bulk operations, EXEC/bootstrap, and hardware architecture.

See architecture-overview.md for module taxonomy and token format. See network-and-communication.md for how SM connects to the bus. See bus-architecture-and-width-decoupling.md for bus width rationale See sm-and-token-format-discussion.md for extended discussion of design decisions including DRAM latency context, bootstrap unification, and address space distribution.

Role#

SM stores structured data (arrays, lists, heap), performs operations on it, and provides memory-mapped IO.

SM is synchronising memory — not just a data store. Tier 1 cells have presence state (empty/full), and reads to empty cells are deferred until a write arrives. This gives implicit producer-consumer synchronization without locks or explicit message-passing: the write is the signal. This is the dataflow architecture's answer to shared mutable state.

IO is memory-mapped into SM address space. An SM (typically SM00 at v0) maps IO devices into its address range. I-structure semantics provide natural interrupt-free IO: a READ from an IO device that has no data defers until data arrives, triggering the receiving node in the dataflow graph.

From a CM's perspective: send a bit[15]=1 request, get a CM result token back eventually. Split-phase, asynchronous relative to the requesting CM. A READ may return immediately (cell is full) or later (cell is empty, deferred until written).

See em4-analysis.md for context on why dedicated synchronizing memory matters vs flattening structure storage into PE-local SRAM.

Internal data path: 16-bit, matching the SRAM word size. Tokens arrive as 2 flits on the 16-bit external bus, are serialized at the SM input FIFO into a reassembled logical token, processed, and result tokens are serialized back into flits at the output. The SM never needs an internal data path wider than 16 bits for data operations.

Interface Protocol#

Stateless request handling: the request token carries its own return routing info in the bits that are unused by that operation type. SM does not maintain persistent per-request state, outside of caching the return token template when a read is deferred.

Request Format (bit[15]=1, 2-flit standard)#

All SM requests arrive as 2 flits on the 16-bit bus and are reassembled at the SM input FIFO before processing.

Flit 1 (common to all SM ops):
  [1][SM_id:2][op_base:3][addr/op_ext:2][addr_low:8]              = 16 bits

  15 bits available after the SM discriminator bit. SM_id (2 bits) selects
  the target SM. The remaining 13 bits encode opcode and address using
  variable-width encoding:

  When op[2:1] != 11: 3-bit opcode, 10-bit addr (1024 cells)
  When op[2:1] == 11: extends to 5-bit opcode, 8-bit payload (256 cells)

  Decode signal: op[2] AND op[1] — one gate.

Flit 2 varies by operation:

  WRITE: [data:16]
    Write data to address. No DN response unless cell was WAITING
    (deferred read satisfied — result token emitted using saved routing).

  READ / CLEAR / ALLOC / FREE: [return_routing:16]
    Flit 2 carries a **pre-formed CM token template**. The SM's result
    formatter latches this template, prepends it as the result's flit 1,
    and appends read data as flit 2. No bit-shuffling — the requesting
    CM does all format work upfront. The SM treats this as an opaque
    16-bit blob.
    (CLEAR/ALLOC/FREE may not need return routing — if no result token
    is emitted, flit 2 could be omitted or carry flags instead. TBD.)

  READ_INC / READ_DEC: [return_routing:16]
    Same as READ. Atomic ops always return the old value.

  CAS — compare-and-swap:
    Requires 3-flit extended format to carry expected_value, new_value,
    AND return routing:
      flit 2: [expected_value:16]
      flit 3: [new_value:8][return_routing:8] (or similar split — TBD)
    Alternatively, CAS could use two sequential 2-flit packets.
    Design TBD — CAS is the most complex SM operation.

3-flit extended addressing mode: for access to external RAM or memory-mapped IO address spaces, a 3-flit SM token provides wider addresses at the cost of one extra flit cycle. The EXT opcode (one of the 3-bit tier) signals 3-flit format.

Result Format (on DN, pre-formed CM token)#

SM uses the pre-formed CM token template from the request's flit 2 as the result's flit 1, and appends the read data as flit 2:

Result (2 flits):
  flit 1: [return_routing:16]   (opaque template from original request)
  flit 2: [fetched_data:16]

The template can encode any CM token format whose routing fits in 16 bits: dyadic wide, monadic normal, or monadic inline. This means SM read results can land directly in a matching store slot as one operand of a dyadic instruction.

Cell State Model (I-Structure Semantics)#

Each SM cell has a 2-bit state field in addition to its 16-bit data value.

State    Bits   Meaning
───────────────────────────────────────────────────────────
EMPTY      00   Never written, or cleared. READ defers.
RESERVED   01   Allocated but not yet written. READ defers.
FULL       10   Written. READ returns value immediately.
WAITING    11   Empty + a deferred READ is pending for this cell.

State Transitions#

                    WRITE
  EMPTY ──────────────────────────► FULL
    │                                 │
    │ READ                            │ READ
    ▼                                 ▼
  WAITING ────────────────────────► FULL   (WRITE satisfies deferred reader,
    (deferred read registered)             sends result, cell becomes FULL)
                                      │
                                      │ CLEAR
                                      ▼
                                    EMPTY

  EMPTY ──── ALLOC ──► RESERVED ──── WRITE ──► FULL
  FULL  ──── FREE  ──► EMPTY
  any   ──── CLEAR ──► EMPTY

Detailed transitions:

READ on EMPTY or RESERVED: cell transitions to WAITING. return routing is saved in the deferred read register. no result token emitted yet. if the deferred read register is already occupied (by a different cell), SM stalls and the request stays in the input FIFO until the pending deferred read is satisfied.
READ on FULL: result token emitted immediately. Cell stays FULL. Multiple reads of the same full cell are fine: all get the value.
READ on WAITING: error condition (second deferred read to same cell while first is pending). SM behaviour TBD. Options: stall, return error token, or drop silently. stall is safest for v0.
WRITE on EMPTY or RESERVED: data stored, cell transitions to FULL. no result token.
WRITE on WAITING: data stored, cell transitions to FULL. SM immediately uses the saved return routing from the deferred read register to emit a result token with the written data. deferred read register is freed.
WRITE on FULL: depends on design choice. options:
- overwrite silently: data replaced, cell stays FULL. simple, but hides bugs.
- error: SM emits an error/diagnostic signal. data not written. safe but requires error handling.
- overwrite + diagnostic flag: data replaced, diagnostic LED or counter incremented. best of both for v0.
CLEAR on any state: cell transitions to EMPTY. if cell was WAITING, the pending deferred read is cancelled (return routing discarded from deferred read register, no result token emitted). CLEAR never emits a result token.
ALLOC: transitions EMPTY->RESERVED. no effect on non-EMPTY cells.
FREE: transitions any->EMPTY. if cell was WAITING, cancels the deferred read (same as CLEAR).

Presence Metadata Hardware#

4 bits per cell: [presence:2][is_wide:1][spare:1]

presence:2 - EMPTY/RESERVED/FULL/WAITING (checked on every operation)
is_wide:1 - tags this cell as part of a wide pointer pair. The SM checks this before deciding whether to also read the next cell for length metadata. See "Wide Pointers" section below.
spare:1 - reserved. Candidates: write-once flag, type tag, owner ID.

At 1024 cells = 4096 bits = 512 bytes. Using a byte-wide SRAM chip means bits 4-7 are physically present whether used or not. Committing 4 bits with 4 spare avoids needing to change the presence SRAM layout when additional per-cell metadata is added during testing.

Implementation:

Small SRAM alongside the data SRAM. Addressed in parallel, read/written on every operation. One 8-bit-wide SRAM chip covers 2 cells per byte (4 bits each) -> 256 bytes easily fits.
Must support single-cycle read-modify-write (read state, decide action, write new state) within one clock. Achievable if the state SRAM access time is less than half the clock period (read in first half, write in second half; same half-clock RMW technique as the EM-4 matching stage).

Deferred Read Register#

Single register per SM. holds the return routing for one pending deferred read.

Deferred Read Register:
  [valid:1][cell_addr:10][return_routing:16]  = 27 bits

valid: set when a READ hits an EMPTY/RESERVED cell. Cleared when the deferred read is satisfied (WRITE to the target cell) or cancelled (CLEAR/FREE to the target cell).
cell_addr: which cell this deferred read is waiting on (10-bit address for 1024-cell range). Compared against incoming WRITE addresses to detect satisfaction.
return_routing: the 16 bits from the original READ's flit 2 (the pre-formed CM token template). Used as flit 1 of the result token when the deferred read is satisfied.

Deferred Read Satisfaction#

On every WRITE, the SM checks: valid == 1 AND write_addr == cell_addr. If true:

Store the write data in the cell (normal WRITE behaviour).
Transition cell state to FULL.
Emit result token: flit 1 = saved return_routing, flit 2 = written data.
Clear the deferred read register (valid = 0).

Hardware cost: one 10-bit comparator (cell_addr vs write_addr), one AND gate (valid AND addr_match), one 27-bit register. Trivial - maybe 3-4 TTL chips total.

Depth-1 Limitation and Backpressure#

With one deferred read register per SM, only one cell can have a pending deferred read at a time. if a second READ arrives for a different empty cell while the register is occupied, the SM cannot service it.

in practice at v0 scale (4 CMs, low contention), this should be rare. the compiler can also help by ensuring that reads and writes to the same cell are ordered appropriately in the program graph.

Multi-Slot Deferred Read CAM (Candidate Enhancement)#

If depth-1 proves too restrictive, expanding to a multi-entry deferred read store using a small CAM (content-addressable memory) is a natural fit. A CAM does associative lookup in one cycle - present the WRITE address, the CAM match line fires if any entry holds that address. No sequential scan, no priority logic changes to the SM pipeline.

Deferred Read CAM (e.g., 4 entries):
  Entry 0: [valid:1][cell_addr:10][return_routing:16]
  Entry 1: [valid:1][cell_addr:10][return_routing:16]
  Entry 2: [valid:1][cell_addr:10][return_routing:16]
  Entry 3: [valid:1][cell_addr:10][return_routing:16]

On READ hitting EMPTY cell:
  - Find first invalid entry (priority encoder), write cell_addr +
    return_routing, set valid
  - If all entries valid: stall (same as depth-1 overflow)

On WRITE:
  - Present write_addr to CAM match lines
  - If any entry matches: satisfy that deferred read, clear entry
  - Normal WRITE proceeds regardless

Why CAM is ideal here: the deferred read lookup is inherently associative - "does any pending read match this write address?" A register file would require sequential comparison against each entry. A CAM answers in one cycle regardless of entry count. Small CAMs (4-16 entries) existed as discrete TTL/CMOS parts and are also trivial to build from comparators + registers.

Hardware cost: 4 entries × (10-bit comparator + 27-bit register) + priority encoder for allocation + match OR for satisfaction detection. Estimated 8-12 TTL chips for a 4-entry CAM — roughly double the single-register cost. Alternatively, the National Semiconductor 100142 (4-word × 4-bit, ECL, see datasheets/NATLS21982-1.pdf) is a discrete CAM chip that provides 4-entry associative lookup in a single package. Two 100142s cascade to 4 words × 8 bits, covering the cell address match width. The return routing storage still requires separate registers, but the address-match portion shrinks to 2-3 chips instead of a comparator tree.

IO motivation: the strongest argument for multiple deferred read slots comes from IO on SM00 (see io-and-bootstrap.md). The always-pending deferred read pattern for IO permanently occupies a slot. With a single slot, SM00 can either service IO or do normal I-structure deferred reads, but not both. Even 2 slots (one for IO, one for memory operations) resolves this. 4 slots covers multiple IO sources (UART + SPI + timer) without needing SM00 specialization or spontaneous token emission hardware.

Uniform vs SM00-only: giving all SMs multi-slot deferred reads keeps the architecture uniform (no special cases). The per-SM cost increase is small. Alternatively, only SM00 gets the CAM and other SMs keep depth-1 - saves chips but adds a special case. The uniform approach is preferred unless chip budget is extremely tight.

Operation Set#

Internal Representation#

The SM internally uses a 4-bit opcode + 12-bit address command format, regardless of how the command arrived:

Internal command: [op:4][addr:12][data:16][return_routing:16]

This is the canonical form that the op decoder, state machine, and ALU all work with. input interfaces (bus adapter, future direct path) translate their respective wire formats into this internal representation.

Bus Encoding (flit 1: variable-width opcode)#

On the 16-bit bus, flit 1 of an SM token is:

  [1][SM_id:2][op_base:3][op_ext/addr:2][addr:8]           = 16 bits

15 bits available after the SM discriminator. SM_id (2 bits) selects the target SM. The remaining 13 bits encode opcode and address using variable-width encoding.

The interpretation of the 2 bits after op_base depends on op_base[2:1]:

op[2:1] != 11:
  3-bit opcode, 10-bit addr (1024 cells).
  The 2 extension bits are the high address bits.

op[2:1] == 11:
  Extends to 5-bit opcode, 8-bit payload (256 cells or inline data).
  The 2 extension bits are opcode extension.

Decode signal: op[2] AND op[1] — one gate.

Opcode Table (bus encoding)#

op_base  ext   bus opcode  internal op  addr bits  name
─────────────────────────────────────────────────────────────────
  000     aa      000       0000        10 (1024)  read
  001     aa      001       0001        10         write
  010     aa      010       0010        10         alloc
  011     aa      011       0011        10         free
  100     aa      100       0100        10         exec
  101     aa      101       0101        10         ext (3-flit mode)
  110     00      11000     0110         8 (256)   rd_inc
  110     01      11001     0111         8         rd_dec
  110     10      11010     1000         8         cas
  110     11      11011     1001         8         raw_rd
  111     00      11100     1010         8         clear
  111     01      11101     1011         8         set_pg
  111     10      11110     1100         8         write_im
  111     11      11111     1101         8         (spare)

'aa' = address bits (part of 10-bit address).

Tier 1 ops (3-bit, full address range): read, write, alloc, exec, free reach the full 1024-cell address space. EXT signals a 3-flit token for extended addressing (external RAM, wide addresses).

Tier 2 ops (5-bit, restricted address/payload): atomic operations (rd_nc, rd_dec, cas) as well as clear are restricted to 256 cells - the compiler places atomic-access cells in the lower range. set_pg and write_im use the 8-bit payload field for non-address data (length, page register value, small immediate).

Key insight for variable-width encoding: not all SM ops are op(address). Some are op(config_value) or op() with no cell operand. The 8-bit payload in the restricted tier can be inline data, config values, or range counts depending on the opcode:

set_pg: payload = page register value
write_im: 8-bit addr, flit 2 carries immediate data
raw_rd: non-blocking read, returns data or empty indicator without registering a deferred read. Useful for polling and diagnostics.

Direct Path Encoding (future)#

A dedicated CM-SM link bypasses the bus. on the direct path, the type and SM_id fields are redundant (the wires go to exactly one SM), so the full 16 bits of the command word are available:

direct path: [op:4][addr:12]  = 16 bits

4-bit opcode (all 16 internal ops directly addressable) + 12-bit address (4096 cells). the SM's internal command representation matches this directly — no translation needed.

v0 implementation: direct path not built. the extra input lines on the SM's internal command bus are disabled (active-high for addr lines, active-low for unused op bits) — physically present but unused. when/if a direct path is added, the bus adapter and direct path adapter both feed into the same internal command bus via a mux.

Extended Bus Encoding (3-flit SM token via `ext` opcode)#

For operations that need wider addresses or additional data (cas with both expected and new values), 3-flit SM tokens (via EXT opcode) carry the full 4-bit opcode in the extra flit:

flit 1: [type:2=10][SM_id:2][flags/op_hint:3][addr_low:9]   = 16 bits
flit 2: [extended_op:4][addr_high:4][data:8]                  = 16 bits
         or [expected_value:16] for CAS
flit 3: [new_value:16] or [return_routing:16]                 = 16 bits

Exact 3-flit format TBD. the point is that extended tokens can carry the full internal 4-bit opcode, giving access to all 16 internal operations including any that are only available via direct path or extended format.

Operation Details (updated for I-structure semantics)#

read (internal 0000): I-structure blocking read. if cell is FULL, returns data immediately. if cell is EMPTY or RESERVED, registers a deferred read (cell → WAITING) and returns the data later when a WRITE arrives. see state transition table above.

write (internal 0001): stores data. transitions EMPTY/RESERVED → FULL. if cell is WAITING, satisfies the deferred reader (emits result token using saved return routing). if cell is already FULL, overwrites + diagnostic flag (v0 behaviour, see state transition discussion). WRITE to a FULL cell does NOT emit a result token — the overwrite diagnostic is a local indicator (LED/counter), not a token.

alloc (internal 0010): transitions EMPTY → RESERVED. for heap management. deferred to post-v0 but opcode and state transition defined.

free (internal 0011): transitions any → EMPTY. cancels any pending deferred read on that cell. deferred to post-v0.

exec (internal 0100): transitions any → EMPTY. cancels any pending deferred read. unlike FREE, CLEAR is intended for explicit cell lifecycle management by the program (reclaim a cell for reuse). FREE is for heap management. semantically identical for now but distinct opcodes allow future differentiation (e.g. FREE updates a free-list, CLEAR doesn't).

(spare) (internal 0101): reserved. candidate: RAW_READ (non-blocking read that returns data or an empty indicator without deferring). useful for polling patterns and diagnostics. not committed.

rd_inc (internal 0110, restricted to lower 256 cells): atomic fetch-and-add(+1). reads current value, increments, writes back, returns old value. operates on FULL cells only — READ_INC on an EMPTY cell is an error (returns error indicator or stalls). does not interact with I-structure state transitions (cell stays FULL).

read_dec (internal 0111, restricted to lower 256 cells): atomic fetch-and-add(-1). same semantics as READ_INC but decrements. CM checks returned value for zero to detect refcount exhaustion.

cas (internal 1000, restricted to lower 256 cells): compare-and-swap. requires extended format (3-flit or sequential 2-flit) to carry both expected_value and new_value + return routing. operates on FULL cells only. returns old value regardless of success/failure; CM infers success by comparing returned value with expected.

(spare) (internal 1001): reserved for future atomic op.

Hardware Architecture#

16-bit Bus                                           16-bit Bus
     |                                                    ^
     v                                                    |
[Input Deserialiser]                           [Output Serialiser]
  (reassemble 2+ flits,                         (split token into flits)
   variable opcode decode,                            ^
   translate to internal                              |
   4-bit op + 12-bit addr)                     [Result FIFO]
     |                                           (formed result tokens)
     v                                                ^
[Request FIFO]                                        |
  (internal-format commands)                   [Result Formatter]
     |                                                ^
     v                                                |
[Op Decoder / State Machine]──────────────────►[Deferred Read Reg]
     |          |          |                    (1 entry: valid,
     v          v          v                     cell_addr, ret_routing)
[Addr Decode]  [ALU]  [Presence State SRAM]
     |          |      (2 bits per cell,
     v          v       read/write parallel
[Data SRAM     |       with data SRAM)
 Bank 0/1]─────┘

Banking#

Start with 2 banks (1 address bit selects bank) for v0
10-bit address = 1024 cells per SM = 2KB at 16-bit data width
Each bank is one SRAM chip with room to spare
Banking allows pipelining: one bank can be reading while another is being written (for RMW ops, or overlapping independent requests)

Internal Components#

Input deserializer / Request FIFO: receives 16-bit flits from the bus, reassembles into logical tokens (2 flits standard, 3 for cas or extended addressing). performs variable-width opcode decode and translates to internal 4-bit op + 12-bit addr format. buffers reassembled requests in internal format. depth TBD (4-8 deep probably sufficient for v0). handles bursty traffic from multiple CMs.

Op decoder / State machine: the core control logic. for each request:

Read presence state for the target cell (from presence SRAM).
Determine action based on opcode × cell state (see state transition table).
Issue data SRAM read/write as needed.
Update presence state.
If result token needed: send to result formatter.
If deferred read: save return routing in deferred read register.
On every WRITE: check deferred read register for match on cell_addr.

This is the most complex piece of the SM - roughly a 10-15 state FSM plus the deferred read comparator. estimated 15-25 TTL chips.

Address decode: selects SRAM bank from address bits.

ALU: minimal — increment, decrement, compare. NOT a full ALU. just enough for the atomic operations. hardware cost: 16-bit incrementer + 16-bit comparator + mux. ~10-15 TTL chips.

Presence metadata SRAM: 4 bits per cell (presence:2 + is_wide:1 + spare:1). 1024 cells = 512 bytes. One small SRAM chip addressed in parallel with the data SRAM. Must support half-clock read-modify-write (read state in first half, write new state in second half) for single-cycle operation.

Deferred read register: one 27-bit register (valid:1 + cell_addr:10 + return_routing:16). One 10-bit comparator for addr matching against incoming writes. ~3-4 TTL chips total.

Result formatter: latches the pre-formed CM token template from the original request's flit 2 (or from the deferred read register for deferred reads), emits it as flit 1 of the result, and appends read data as flit 2. The SM does not parse or modify the template - it is an opaque blob. Output serializer splits the formed token into flits for bus transmission.

Address Space Extension#

The 10-bit address in the standard SM token gives 1024 cells per SM for tier-1 ops (3-bit opcode). Three mechanisms extend this further:

1. Page Register#

SM has a writable config register set via SET_PAGE opcode
10-bit token address is treated as offset, added to page base
Gives up to 64K+ addressable cells per SM
CM sets the page before issuing a burst of reads/writes to a region
Hardware cost: ~3 chips (latch for page register + adder)
Programming model: familiar bank-switching, like 8-bit micros
Trade-off: page switch costs one extra token; compiler batches accesses to same page to amortize

2. Banking as Implicit Address Bits#

SM_id field (2 bits) gives 4 SMs = 4 x 1024 = 4K cells system-wide
Not contiguous from a programming perspective, but compiler can distribute data structures across SMs for both capacity and parallelism
Essentially free — already in the token format
Combine with page registers for 4 x 64K = 256K cells system-wide

3. Extended SM Tokens (3-flit via EXT opcode)#

The EXT opcode (3-bit tier) signals a 3-flit SM token with full 16-bit address in flit 2, data/return-routing in flit 3
Full 16-bit address space per SM, at the cost of one extra flit cycle
Use for: large heap, external RAM, memory-mapped IO address ranges
Standard 2-flit tokens remain the fast path for common/local accesses

Practical Address Space with All Three Combined#

Fast path (standard + page register): 64K per SM, 2-flit token
Medium path (across SMs): 4 x 64K = 256K, 2-flit token
Slow path (3-flit EXT): up to 64K per SM with wide addresses, 3-flit

V0 Test Plan#

Drive input with microcontroller (RP2040 / Arduino)
Microcontroller formats 2-flit (16-bit each) request packets, clocks flits into input deserializer / request FIFO
Read 2-flit result packets from output FIFO
Test suite:
- Sequential read/write to FULL cells
- Random access patterns
- READ_INC sequences (verify atomicity, verify returned old value)
- READ_DEC to zero (verify underflow behaviour)
- CAS success and failure cases
- Bank contention (same bank back-to-back)
- Page register set + offset access
- Boundary conditions (address 0, address 511, page wraparound)
- I-structure tests:
  - WRITE then READ: immediate result
  - READ then WRITE: deferred read — verify result arrives after WRITE
  - READ on EMPTY, WRITE to different cell, WRITE to target cell: verify deferred read resolves only on correct address
  - Two deferred READs (different cells): verify stall on second, then resolution when first is satisfied, then second proceeds
  - CLEAR on FULL cell, then READ: verify deferral
  - CLEAR on WAITING cell: verify deferred read is cancelled (no result token emitted)
  - WRITE on FULL cell: verify overwrite + diagnostic indicator
- Variable opcode decode tests:
  - Verify READ/WRITE/ALLOC/FREE/CLEAR reach full 1024-cell range
  - Verify READ_INC/READ_DEC/CAS/EXEC etc. restricted to lower 256 cells
  - Verify op[2:1]=11 decode correctly separates opcode extension from address

Memory Tiers#

SM address space supports regions with different semantics, selectable by address range. The tier is not encoded in the token - it is determined by the target address within the SM.

Tier 0: Raw Memory#

No presence bits. SRAM read/write only. Suitable for bulk data that does not need synchronization (image buffers, DMA staging areas, ROM).

READ always returns immediately (no deferral)
WRITE always succeeds
Presence metadata bits are ignored / not maintained
Allocated by compiler or loader, not by ALLOC/FREE
Address range: configurable, typically the top of the SM address space

Tier 1: I-Structure Memory#

Standard I-structure semantics with presence tracking. This is the default operating mode described throughout this document.

READ on EMPTY/RESERVED cell defers until WRITE
WRITE transitions cell to FULL
ALLOC/FREE manage the free list
Full presence metadata (4 bits per cell)
Address range: configurable, typically the bulk of SM address space

Tier 2: Wide/Bulk Memory#

Extends tier 1 with wide pointer support. Cells tagged with is_wide=1 in presence metadata are treated as the base of a (pointer, length) pair: the cell itself holds the data pointer, the next cell holds the length.

READ checks is_wide and returns either 1 cell (normal) or 2 cells (wide pointer pair — requires 3-flit result token or two result tokens)
WRITE to a wide cell writes both pointer and length
Enables ITERATE, COPY_RANGE, and EXEC to take wide pointers as arguments
Address range: overlaps with tier 1 (any tier 1 cell can be marked wide)

Tier Boundary Configuration#

Boundaries are set by config registers (SET_PAGE or a dedicated config mechanism) during bootstrap
The SM's address decoder checks the incoming address against tier boundaries to select behaviour
Hardware cost: 1-2 comparators + a mux on the presence metadata path
v0 can use a fixed split (e.g., lower 768 = tier 1, upper 256 = tier 0) and defer runtime-configurable boundaries

Design status: tiers are directionally decided. Exact address range splits, tier 2 wide-read mechanics, and interaction between tiers and page registers are still being refined.

Wide Pointers and Bulk Operations#

Wide Pointer Format#

A wide pointer occupies 2 consecutive SM cells:

Cell N:     [data_pointer:16]   (base address in SM or external memory)
Cell N+1:  [length:16]          (element count)

Cell N has is_wide=1 in its presence metadata. The SM knows to read both cells when servicing a READ on a wide cell.

Wide pointers are the parameter format for bulk operations. A CM does not iterate over SM contents directly — it sends an SM operation with a wide pointer, and the SM's sequencer engine handles the iteration internally.

`exec`#

exec reads pre-formed tokens from a contiguous region of SM and pushes them onto the bus. The SM becomes a token source - effectively an autonomous injector.

exec request:
  flit 1: [...exec opcode...][count:8]
  flit 2: [base_addr from config register or wide pointer]

The SM's sequencer reads count 2-cell entries starting at base_addr. Each entry is a pre-formed 2-flit token (flit 1 in cell N, flit 2 in cell N+1). The sequencer emits them onto the bus in order.

Bootstrap use: on system reset, SM00 is wired to execute exec on a predetermined ROM address. The ROM contents are pre-formed IRAM write tokens and seed tokens; everything needed to load the system. No external microcontroller needed for self-hosted boot. See "SM00 Bootstrap" below.

Hardware reuse: the exec sequencer (address counter, limit comparator, increment logic, output path to bus) is the same hardware needed for iter and cp. Building exec for bootstrap gives bulk operations nearly for free.

`iter`#

Reads each cell in a range and emits a result token for each. Takes a wide pointer (base + length). The SM's sequencer walks the range, constructing result tokens using a pre-loaded return routing template.

`cp`#

Copies a contiguous range of cells from one SM region to another (or to a different SM). Takes source wide pointer and destination base. Useful for structure copying, GC compaction.

Design status: iter and cp are directionally committed. Exact token format for range parameters, interaction with deferred reads in the target range, and atomicity guarantees are still being refined.

SM00 Bootstrap#

Reset Behaviour#

SM00 has dedicated wiring to the system reset signal. On reset:

SM00's sequencer triggers exec on a predetermined ROM base address
The ROM region contains pre-formed tokens: IRAM write tokens to load PE instruction memories, followed by seed tokens to start execution
SM00 emits these tokens onto the bus in order
PEs receive IRAM writes and load their instruction memories
Seed tokens fire and execution begins

This is the only hardware specialization of SM00, the reset vector wiring. At runtime, SM00 behaves as a standard SM with standard opcodes.

ROM Mapping#

The bootstrap ROM is mapped into SM00's address space (tier 0 - raw memory, no presence bits). It can be:

Physical ROM/EEPROM on SM00's address bus
A region of SM00's SRAM pre-loaded by an external microcontroller (development/prototyping)
Flash memory accessed via page register for larger images

Future Specialization (not committed)#

SM00 could be further specialized for IO:

Atomic/alloc opcodes could be repurposed for IO-specific operations (e.g., rd_inc becomes "read UART with auto-acknowledge")
Memory-mapped IO devices occupy a reserved address range within SM00
SM00 could have additional interrupt-sensing hardware that triggers token emission on external events

This is documented as a design option but not committed for v0. The standard SM opcodes are sufficient for basic IO via I-structure semantics: read from an IO-mapped address defers until the IO device writes data.

Presence-Bit Guided IRAM Writes#

The matching store's presence bitmap provides information about whether any dyadic instruction slot has a pending (half-matched) operand. During IRAM writes, the PE can check the presence bits for the IRAM page being overwritten:

If all presence bits for slots in that page are clear, no tokens are pending and the page can be safely overwritten without drain delay
If any presence bit is set, the PE knows tokens are in flight for that instruction and can either wait or discard the stale operand

This enables more targeted IRAM replacement than the blanket drain protocol. Instead of draining the entire PE, only the affected page needs attention. The valid-bit page protection mechanism (bus-architecture-and-width-decoupling.md) remains the safety net, but presence-bit checking can eliminate unnecessary stalls.

Open Design Questions#

Page register per-CM or global? - if multiple CMs access the same SM, do they share a page register (contention) or each have their own (more hardware, more config)? Probably global for v0.
Banking vs pipeline depth - with 2 banks, can we overlap a read to bank 0 with a write to bank 1? Worth the control complexity for v0? Presence state SRAM complicates this - is presence per-bank or shared? If shared, it serializes cross-bank operations. If per-bank, each bank needs its own presence SRAM. Probably shared for v0 (simpler).
Atomic ops on non-FULL cells - rd_inc/rd_dec/cas on EMPTY or WAITING cells is currently undefined. Options: error, stall, or treat as zero. Error is safest for v0.
Direct path input mux - when direct path is added, the SM needs a mux between bus input and direct input feeding the internal command bus. Arbitration policy TBD (direct path priority? round-robin?).
Wide pointer read format - does a READ on a wide cell return two separate result tokens or one 3-flit result token? Two tokens is simpler (reuse existing result formatter), 3-flit is more atomic.
Tier boundary mechanism - fixed at build time, config register, or address-range comparator? Fixed is simplest for v0.
iter return template - how is the return routing template supplied for iterated results? Preloaded config register? Part of the iter request? Template per element or shared?
SM00 ROM size and mapping - how large is the bootstrap ROM? What address range? How does it interact with the page register?