Dynamic Dataflow CPU: SM (Structure Memory) Design#
Covers the SM interface protocol, operation set, banking scheme, address space extension, memory tiers, wide pointers, bulk operations, EXEC/bootstrap, and hardware architecture.
See architecture-overview.md for module taxonomy and token format.
See network-and-communication.md for how SM connects to the bus.
See bus-architecture-and-width-decoupling.md for bus width rationale
See sm-and-token-format-discussion.md for extended discussion of design decisions including DRAM latency context, bootstrap unification, and address space distribution.
Role#
SM stores structured data (arrays, lists, heap), performs operations on it, and provides memory-mapped IO.
SM is synchronising memory — not just a data store. Tier 1 cells have presence state (empty/full), and reads to empty cells are deferred until a write arrives. This gives implicit producer-consumer synchronization without locks or explicit message-passing: the write is the signal. This is the dataflow architecture's answer to shared mutable state.
IO is memory-mapped into SM address space. An SM (typically SM00 at v0) maps IO devices into its address range. I-structure semantics provide natural interrupt-free IO: a READ from an IO device that has no data defers until data arrives, triggering the receiving node in the dataflow graph.
From a CM's perspective: send a bit[15]=1 request, get a CM result token back eventually. Split-phase, asynchronous relative to the requesting CM. A READ may return immediately (cell is full) or later (cell is empty, deferred until written).
See em4-analysis.md for context on why dedicated synchronizing memory matters vs flattening structure storage into PE-local SRAM.
Internal data path: 16-bit, matching the SRAM word size. Tokens arrive as 2 flits on the 16-bit external bus, are serialized at the SM input FIFO into a reassembled logical token, processed, and result tokens are serialized back into flits at the output. The SM never needs an internal data path wider than 16 bits for data operations.
Interface Protocol#
Stateless request handling: the request token carries its own return routing info in the bits that are unused by that operation type. SM does not maintain persistent per-request state, outside of caching the return token template when a read is deferred.
Request Format (bit[15]=1, 2-flit standard)#
All SM requests arrive as 2 flits on the 16-bit bus and are reassembled at the SM input FIFO before processing.
Flit 1 (common to all SM ops):
[1][SM_id:2][op_base:3][addr/op_ext:2][addr_low:8] = 16 bits
15 bits available after the SM discriminator bit. SM_id (2 bits) selects
the target SM. The remaining 13 bits encode opcode and address using
variable-width encoding:
When op[2:1] != 11: 3-bit opcode, 10-bit addr (1024 cells)
When op[2:1] == 11: extends to 5-bit opcode, 8-bit payload (256 cells)
Decode signal: op[2] AND op[1] — one gate.
Flit 2 varies by operation:
WRITE: [data:16]
Write data to address. No DN response unless cell was WAITING
(deferred read satisfied — result token emitted using saved routing).
READ / CLEAR / ALLOC / FREE: [return_routing:16]
Flit 2 carries a **pre-formed CM token template**. The SM's result
formatter latches this template, prepends it as the result's flit 1,
and appends read data as flit 2. No bit-shuffling — the requesting
CM does all format work upfront. The SM treats this as an opaque
16-bit blob.
(CLEAR/ALLOC/FREE may not need return routing — if no result token
is emitted, flit 2 could be omitted or carry flags instead. TBD.)
READ_INC / READ_DEC: [return_routing:16]
Same as READ. Atomic ops always return the old value.
CAS — compare-and-swap:
Requires 3-flit extended format to carry expected_value, new_value,
AND return routing:
flit 2: [expected_value:16]
flit 3: [new_value:8][return_routing:8] (or similar split — TBD)
Alternatively, CAS could use two sequential 2-flit packets.
Design TBD — CAS is the most complex SM operation.
3-flit extended addressing mode: for access to external RAM or memory-mapped IO address spaces, a 3-flit SM token provides wider addresses at the cost of one extra flit cycle. The EXT opcode (one of the 3-bit tier) signals 3-flit format.
Result Format (on DN, pre-formed CM token)#
SM uses the pre-formed CM token template from the request's flit 2 as the result's flit 1, and appends the read data as flit 2:
Result (2 flits):
flit 1: [return_routing:16] (opaque template from original request)
flit 2: [fetched_data:16]
The template can encode any CM token format whose routing fits in 16 bits: dyadic wide, monadic normal, or monadic inline. This means SM read results can land directly in a matching store slot as one operand of a dyadic instruction.
Cell State Model (I-Structure Semantics)#
Each SM cell has a 2-bit state field in addition to its 16-bit data value.
State Bits Meaning
───────────────────────────────────────────────────────────
EMPTY 00 Never written, or cleared. READ defers.
RESERVED 01 Allocated but not yet written. READ defers.
FULL 10 Written. READ returns value immediately.
WAITING 11 Empty + a deferred READ is pending for this cell.
State Transitions#
WRITE
EMPTY ──────────────────────────► FULL
│ │
│ READ │ READ
▼ ▼
WAITING ────────────────────────► FULL (WRITE satisfies deferred reader,
(deferred read registered) sends result, cell becomes FULL)
│
│ CLEAR
▼
EMPTY
EMPTY ──── ALLOC ──► RESERVED ──── WRITE ──► FULL
FULL ──── FREE ──► EMPTY
any ──── CLEAR ──► EMPTY
Detailed transitions:
- READ on EMPTY or RESERVED: cell transitions to WAITING. return routing is saved in the deferred read register. no result token emitted yet. if the deferred read register is already occupied (by a different cell), SM stalls and the request stays in the input FIFO until the pending deferred read is satisfied.
- READ on FULL: result token emitted immediately. Cell stays FULL. Multiple reads of the same full cell are fine: all get the value.
- READ on WAITING: error condition (second deferred read to same cell while first is pending). SM behaviour TBD. Options: stall, return error token, or drop silently. stall is safest for v0.
- WRITE on EMPTY or RESERVED: data stored, cell transitions to FULL. no result token.
- WRITE on WAITING: data stored, cell transitions to FULL. SM immediately uses the saved return routing from the deferred read register to emit a result token with the written data. deferred read register is freed.
- WRITE on FULL: depends on design choice. options:
- overwrite silently: data replaced, cell stays FULL. simple, but hides bugs.
- error: SM emits an error/diagnostic signal. data not written. safe but requires error handling.
- overwrite + diagnostic flag: data replaced, diagnostic LED or counter incremented. best of both for v0.
- CLEAR on any state: cell transitions to EMPTY. if cell was WAITING, the pending deferred read is cancelled (return routing discarded from deferred read register, no result token emitted). CLEAR never emits a result token.
- ALLOC: transitions EMPTY->RESERVED. no effect on non-EMPTY cells.
- FREE: transitions any->EMPTY. if cell was WAITING, cancels the deferred read (same as CLEAR).
Presence Metadata Hardware#
4 bits per cell: [presence:2][is_wide:1][spare:1]
- presence:2 - EMPTY/RESERVED/FULL/WAITING (checked on every operation)
- is_wide:1 - tags this cell as part of a wide pointer pair. The SM checks this before deciding whether to also read the next cell for length metadata. See "Wide Pointers" section below.
- spare:1 - reserved. Candidates: write-once flag, type tag, owner ID.
At 1024 cells = 4096 bits = 512 bytes. Using a byte-wide SRAM chip means bits 4-7 are physically present whether used or not. Committing 4 bits with 4 spare avoids needing to change the presence SRAM layout when additional per-cell metadata is added during testing.
Implementation:
- Small SRAM alongside the data SRAM. Addressed in parallel, read/written on every operation. One 8-bit-wide SRAM chip covers 2 cells per byte (4 bits each) -> 256 bytes easily fits.
- Must support single-cycle read-modify-write (read state, decide action, write new state) within one clock. Achievable if the state SRAM access time is less than half the clock period (read in first half, write in second half; same half-clock RMW technique as the EM-4 matching stage).
Deferred Read Register#
Single register per SM. holds the return routing for one pending deferred read.
Deferred Read Register:
[valid:1][cell_addr:10][return_routing:16] = 27 bits
- valid: set when a READ hits an EMPTY/RESERVED cell. Cleared when the deferred read is satisfied (WRITE to the target cell) or cancelled (CLEAR/FREE to the target cell).
- cell_addr: which cell this deferred read is waiting on (10-bit address for 1024-cell range). Compared against incoming WRITE addresses to detect satisfaction.
- return_routing: the 16 bits from the original READ's flit 2 (the pre-formed CM token template). Used as flit 1 of the result token when the deferred read is satisfied.
Deferred Read Satisfaction#
On every WRITE, the SM checks: valid == 1 AND write_addr == cell_addr. If true:
- Store the write data in the cell (normal WRITE behaviour).
- Transition cell state to FULL.
- Emit result token: flit 1 = saved return_routing, flit 2 = written data.
- Clear the deferred read register (valid = 0).
Hardware cost: one 10-bit comparator (cell_addr vs write_addr), one AND gate (valid AND addr_match), one 27-bit register. Trivial - maybe 3-4 TTL chips total.
Depth-1 Limitation and Backpressure#
With one deferred read register per SM, only one cell can have a pending deferred read at a time. if a second READ arrives for a different empty cell while the register is occupied, the SM cannot service it.
in practice at v0 scale (4 CMs, low contention), this should be rare. the compiler can also help by ensuring that reads and writes to the same cell are ordered appropriately in the program graph.
Multi-Slot Deferred Read CAM (Candidate Enhancement)#
If depth-1 proves too restrictive, expanding to a multi-entry deferred read store using a small CAM (content-addressable memory) is a natural fit. A CAM does associative lookup in one cycle - present the WRITE address, the CAM match line fires if any entry holds that address. No sequential scan, no priority logic changes to the SM pipeline.
Deferred Read CAM (e.g., 4 entries):
Entry 0: [valid:1][cell_addr:10][return_routing:16]
Entry 1: [valid:1][cell_addr:10][return_routing:16]
Entry 2: [valid:1][cell_addr:10][return_routing:16]
Entry 3: [valid:1][cell_addr:10][return_routing:16]
On READ hitting EMPTY cell:
- Find first invalid entry (priority encoder), write cell_addr +
return_routing, set valid
- If all entries valid: stall (same as depth-1 overflow)
On WRITE:
- Present write_addr to CAM match lines
- If any entry matches: satisfy that deferred read, clear entry
- Normal WRITE proceeds regardless
Why CAM is ideal here: the deferred read lookup is inherently associative - "does any pending read match this write address?" A register file would require sequential comparison against each entry. A CAM answers in one cycle regardless of entry count. Small CAMs (4-16 entries) existed as discrete TTL/CMOS parts and are also trivial to build from comparators + registers.
Hardware cost: 4 entries × (10-bit comparator + 27-bit register) + priority encoder for allocation + match OR for satisfaction detection. Estimated 8-12 TTL chips for a 4-entry CAM — roughly double the single-register cost. Alternatively, the National Semiconductor 100142
(4-word × 4-bit, ECL, see datasheets/NATLS21982-1.pdf) is a discrete CAM chip that provides 4-entry associative lookup in a single package. Two 100142s cascade to 4 words × 8 bits, covering the cell address match width. The return routing storage still requires separate
registers, but the address-match portion shrinks to 2-3 chips instead of a comparator tree.
IO motivation: the strongest argument for multiple deferred read slots comes from IO on SM00 (see io-and-bootstrap.md). The always-pending deferred read pattern for IO permanently occupies a slot. With a single slot, SM00 can either service IO or do normal I-structure deferred reads, but not both. Even 2 slots (one for IO, one for memory operations) resolves this. 4 slots covers multiple IO sources (UART + SPI + timer) without needing SM00 specialization or
spontaneous token emission hardware.
Uniform vs SM00-only: giving all SMs multi-slot deferred reads keeps the architecture uniform (no special cases). The per-SM cost increase is small. Alternatively, only SM00 gets the CAM and other SMs keep depth-1 - saves chips but adds a special case. The uniform approach is preferred unless chip budget is extremely tight.
Operation Set#
Internal Representation#
The SM internally uses a 4-bit opcode + 12-bit address command format, regardless of how the command arrived:
Internal command: [op:4][addr:12][data:16][return_routing:16]
This is the canonical form that the op decoder, state machine, and ALU all work with. input interfaces (bus adapter, future direct path) translate their respective wire formats into this internal representation.
Bus Encoding (flit 1: variable-width opcode)#
On the 16-bit bus, flit 1 of an SM token is:
[1][SM_id:2][op_base:3][op_ext/addr:2][addr:8] = 16 bits
15 bits available after the SM discriminator. SM_id (2 bits) selects the target SM. The remaining 13 bits encode opcode and address using variable-width encoding.
The interpretation of the 2 bits after op_base depends on op_base[2:1]:
op[2:1] != 11:
3-bit opcode, 10-bit addr (1024 cells).
The 2 extension bits are the high address bits.
op[2:1] == 11:
Extends to 5-bit opcode, 8-bit payload (256 cells or inline data).
The 2 extension bits are opcode extension.
Decode signal: op[2] AND op[1] — one gate.
Opcode Table (bus encoding)#
op_base ext bus opcode internal op addr bits name
─────────────────────────────────────────────────────────────────
000 aa 000 0000 10 (1024) read
001 aa 001 0001 10 write
010 aa 010 0010 10 alloc
011 aa 011 0011 10 free
100 aa 100 0100 10 exec
101 aa 101 0101 10 ext (3-flit mode)
110 00 11000 0110 8 (256) rd_inc
110 01 11001 0111 8 rd_dec
110 10 11010 1000 8 cas
110 11 11011 1001 8 raw_rd
111 00 11100 1010 8 clear
111 01 11101 1011 8 set_pg
111 10 11110 1100 8 write_im
111 11 11111 1101 8 (spare)
'aa' = address bits (part of 10-bit address).
Tier 1 ops (3-bit, full address range): read, write, alloc, exec, free reach the full 1024-cell address space. EXT signals a 3-flit token for extended addressing (external RAM, wide addresses).
Tier 2 ops (5-bit, restricted address/payload): atomic operations (rd_nc, rd_dec, cas) as well as clear are restricted to 256 cells - the compiler places atomic-access cells in the lower range. set_pg and write_im use the 8-bit payload field for non-address data (length,
page register value, small immediate).
Key insight for variable-width encoding: not all SM ops are
op(address). Some are op(config_value) or op() with no cell
operand. The 8-bit payload in the restricted tier can be inline data,
config values, or range counts depending on the opcode:
set_pg: payload = page register valuewrite_im: 8-bit addr, flit 2 carries immediate dataraw_rd: non-blocking read, returns data or empty indicator without registering a deferred read. Useful for polling and diagnostics.
Direct Path Encoding (future)#
A dedicated CM-SM link bypasses the bus. on the direct path, the type and SM_id fields are redundant (the wires go to exactly one SM), so the full 16 bits of the command word are available:
direct path: [op:4][addr:12] = 16 bits
4-bit opcode (all 16 internal ops directly addressable) + 12-bit address (4096 cells). the SM's internal command representation matches this directly — no translation needed.
v0 implementation: direct path not built. the extra input lines on the SM's internal command bus are disabled (active-high for addr lines, active-low for unused op bits) — physically present but unused. when/if a direct path is added, the bus adapter and direct path adapter both feed into the same internal command bus via a mux.
Extended Bus Encoding (3-flit SM token via ext opcode)#
For operations that need wider addresses or additional data (cas with both expected and new values), 3-flit SM tokens (via EXT opcode) carry the full 4-bit opcode in the extra flit:
flit 1: [type:2=10][SM_id:2][flags/op_hint:3][addr_low:9] = 16 bits
flit 2: [extended_op:4][addr_high:4][data:8] = 16 bits
or [expected_value:16] for CAS
flit 3: [new_value:16] or [return_routing:16] = 16 bits
Exact 3-flit format TBD. the point is that extended tokens can carry the full internal 4-bit opcode, giving access to all 16 internal operations including any that are only available via direct path or extended format.
Operation Details (updated for I-structure semantics)#
read (internal 0000): I-structure blocking read. if cell is FULL,
returns data immediately. if cell is EMPTY or RESERVED, registers a
deferred read (cell → WAITING) and returns the data later when a WRITE
arrives. see state transition table above.
write (internal 0001): stores data. transitions EMPTY/RESERVED → FULL.
if cell is WAITING, satisfies the deferred reader (emits result token
using saved return routing). if cell is already FULL, overwrites +
diagnostic flag (v0 behaviour, see state transition discussion).
WRITE to a FULL cell does NOT emit a result token — the overwrite
diagnostic is a local indicator (LED/counter), not a token.
alloc (internal 0010): transitions EMPTY → RESERVED. for heap
management. deferred to post-v0 but opcode and state transition defined.
free (internal 0011): transitions any → EMPTY. cancels any pending
deferred read on that cell. deferred to post-v0.
exec (internal 0100): transitions any → EMPTY. cancels any pending
deferred read. unlike FREE, CLEAR is intended for explicit cell lifecycle
management by the program (reclaim a cell for reuse). FREE is for heap
management. semantically identical for now but distinct opcodes allow
future differentiation (e.g. FREE updates a free-list, CLEAR doesn't).
(spare) (internal 0101): reserved. candidate: RAW_READ (non-blocking read that returns data or an empty indicator without deferring). useful for polling patterns and diagnostics. not committed.
rd_inc (internal 0110, restricted to lower 256 cells): atomic
fetch-and-add(+1). reads current value, increments, writes back, returns
old value. operates on FULL cells only — READ_INC on an EMPTY cell is
an error (returns error indicator or stalls). does not interact with
I-structure state transitions (cell stays FULL).
read_dec (internal 0111, restricted to lower 256 cells): atomic
fetch-and-add(-1). same semantics as READ_INC but decrements. CM checks
returned value for zero to detect refcount exhaustion.
cas (internal 1000, restricted to lower 256 cells): compare-and-swap.
requires extended format (3-flit or sequential 2-flit) to carry both
expected_value and new_value + return routing. operates on FULL cells
only. returns old value regardless of success/failure; CM infers
success by comparing returned value with expected.
(spare) (internal 1001): reserved for future atomic op.
Hardware Architecture#
16-bit Bus 16-bit Bus
| ^
v |
[Input Deserialiser] [Output Serialiser]
(reassemble 2+ flits, (split token into flits)
variable opcode decode, ^
translate to internal |
4-bit op + 12-bit addr) [Result FIFO]
| (formed result tokens)
v ^
[Request FIFO] |
(internal-format commands) [Result Formatter]
| ^
v |
[Op Decoder / State Machine]──────────────────►[Deferred Read Reg]
| | | (1 entry: valid,
v v v cell_addr, ret_routing)
[Addr Decode] [ALU] [Presence State SRAM]
| | (2 bits per cell,
v v read/write parallel
[Data SRAM | with data SRAM)
Bank 0/1]─────┘
Banking#
- Start with 2 banks (1 address bit selects bank) for v0
- 10-bit address = 1024 cells per SM = 2KB at 16-bit data width
- Each bank is one SRAM chip with room to spare
- Banking allows pipelining: one bank can be reading while another is being written (for RMW ops, or overlapping independent requests)
Internal Components#
Input deserializer / Request FIFO: receives 16-bit flits from the bus, reassembles into logical tokens (2 flits standard, 3 for cas or extended addressing). performs variable-width opcode decode and translates to internal 4-bit op + 12-bit addr format. buffers reassembled requests in
internal format. depth TBD (4-8 deep probably sufficient for v0). handles bursty traffic from multiple CMs.
Op decoder / State machine: the core control logic. for each request:
- Read presence state for the target cell (from presence SRAM).
- Determine action based on opcode × cell state (see state transition table).
- Issue data SRAM read/write as needed.
- Update presence state.
- If result token needed: send to result formatter.
- If deferred read: save return routing in deferred read register.
- On every WRITE: check deferred read register for match on cell_addr.
This is the most complex piece of the SM - roughly a 10-15 state FSM plus the deferred read comparator. estimated 15-25 TTL chips.
Address decode: selects SRAM bank from address bits.
ALU: minimal — increment, decrement, compare. NOT a full ALU. just enough for the atomic operations. hardware cost: 16-bit incrementer + 16-bit comparator + mux. ~10-15 TTL chips.
Presence metadata SRAM: 4 bits per cell (presence:2 + is_wide:1 + spare:1). 1024 cells = 512 bytes. One small SRAM chip addressed in parallel with the data SRAM. Must support half-clock read-modify-write (read state in first half, write new state in second half) for single-cycle operation.
Deferred read register: one 27-bit register (valid:1 + cell_addr:10 + return_routing:16). One 10-bit comparator for addr matching against incoming writes. ~3-4 TTL chips total.
Result formatter: latches the pre-formed CM token template from the original request's flit 2 (or from the deferred read register for deferred reads), emits it as flit 1 of the result, and appends read data as flit 2. The SM does not parse or modify the template - it is an opaque blob. Output serializer splits the formed token into flits for bus transmission.
Address Space Extension#
The 10-bit address in the standard SM token gives 1024 cells per SM for tier-1 ops (3-bit opcode). Three mechanisms extend this further:
1. Page Register#
- SM has a writable config register set via SET_PAGE opcode
- 10-bit token address is treated as offset, added to page base
- Gives up to 64K+ addressable cells per SM
- CM sets the page before issuing a burst of reads/writes to a region
- Hardware cost: ~3 chips (latch for page register + adder)
- Programming model: familiar bank-switching, like 8-bit micros
- Trade-off: page switch costs one extra token; compiler batches accesses to same page to amortize
2. Banking as Implicit Address Bits#
- SM_id field (2 bits) gives 4 SMs = 4 x 1024 = 4K cells system-wide
- Not contiguous from a programming perspective, but compiler can distribute data structures across SMs for both capacity and parallelism
- Essentially free — already in the token format
- Combine with page registers for 4 x 64K = 256K cells system-wide
3. Extended SM Tokens (3-flit via EXT opcode)#
- The EXT opcode (3-bit tier) signals a 3-flit SM token with full 16-bit address in flit 2, data/return-routing in flit 3
- Full 16-bit address space per SM, at the cost of one extra flit cycle
- Use for: large heap, external RAM, memory-mapped IO address ranges
- Standard 2-flit tokens remain the fast path for common/local accesses
Practical Address Space with All Three Combined#
- Fast path (standard + page register): 64K per SM, 2-flit token
- Medium path (across SMs): 4 x 64K = 256K, 2-flit token
- Slow path (3-flit EXT): up to 64K per SM with wide addresses, 3-flit
V0 Test Plan#
- Drive input with microcontroller (RP2040 / Arduino)
- Microcontroller formats 2-flit (16-bit each) request packets, clocks flits into input deserializer / request FIFO
- Read 2-flit result packets from output FIFO
- Test suite:
- Sequential read/write to FULL cells
- Random access patterns
- READ_INC sequences (verify atomicity, verify returned old value)
- READ_DEC to zero (verify underflow behaviour)
- CAS success and failure cases
- Bank contention (same bank back-to-back)
- Page register set + offset access
- Boundary conditions (address 0, address 511, page wraparound)
- I-structure tests:
- WRITE then READ: immediate result
- READ then WRITE: deferred read — verify result arrives after WRITE
- READ on EMPTY, WRITE to different cell, WRITE to target cell: verify deferred read resolves only on correct address
- Two deferred READs (different cells): verify stall on second, then resolution when first is satisfied, then second proceeds
- CLEAR on FULL cell, then READ: verify deferral
- CLEAR on WAITING cell: verify deferred read is cancelled (no result token emitted)
- WRITE on FULL cell: verify overwrite + diagnostic indicator
- Variable opcode decode tests:
- Verify READ/WRITE/ALLOC/FREE/CLEAR reach full 1024-cell range
- Verify READ_INC/READ_DEC/CAS/EXEC etc. restricted to lower 256 cells
- Verify op[2:1]=11 decode correctly separates opcode extension from address
Memory Tiers#
SM address space supports regions with different semantics, selectable by address range. The tier is not encoded in the token - it is determined by the target address within the SM.
Tier 0: Raw Memory#
No presence bits. SRAM read/write only. Suitable for bulk data that does not need synchronization (image buffers, DMA staging areas, ROM).
- READ always returns immediately (no deferral)
- WRITE always succeeds
- Presence metadata bits are ignored / not maintained
- Allocated by compiler or loader, not by ALLOC/FREE
- Address range: configurable, typically the top of the SM address space
Tier 1: I-Structure Memory#
Standard I-structure semantics with presence tracking. This is the default operating mode described throughout this document.
- READ on EMPTY/RESERVED cell defers until WRITE
- WRITE transitions cell to FULL
- ALLOC/FREE manage the free list
- Full presence metadata (4 bits per cell)
- Address range: configurable, typically the bulk of SM address space
Tier 2: Wide/Bulk Memory#
Extends tier 1 with wide pointer support. Cells tagged with is_wide=1 in presence metadata are treated as the base of a (pointer, length) pair: the cell itself holds the data pointer, the next cell holds the length.
- READ checks
is_wideand returns either 1 cell (normal) or 2 cells (wide pointer pair — requires 3-flit result token or two result tokens) - WRITE to a wide cell writes both pointer and length
- Enables ITERATE, COPY_RANGE, and EXEC to take wide pointers as arguments
- Address range: overlaps with tier 1 (any tier 1 cell can be marked wide)
Tier Boundary Configuration#
- Boundaries are set by config registers (SET_PAGE or a dedicated config mechanism) during bootstrap
- The SM's address decoder checks the incoming address against tier boundaries to select behaviour
- Hardware cost: 1-2 comparators + a mux on the presence metadata path
- v0 can use a fixed split (e.g., lower 768 = tier 1, upper 256 = tier 0) and defer runtime-configurable boundaries
Design status: tiers are directionally decided. Exact address range splits, tier 2 wide-read mechanics, and interaction between tiers and page registers are still being refined.
Wide Pointers and Bulk Operations#
Wide Pointer Format#
A wide pointer occupies 2 consecutive SM cells:
Cell N: [data_pointer:16] (base address in SM or external memory)
Cell N+1: [length:16] (element count)
Cell N has is_wide=1 in its presence metadata. The SM knows to read both cells when servicing a READ on a wide cell.
Wide pointers are the parameter format for bulk operations. A CM does not iterate over SM contents directly — it sends an SM operation with a wide pointer, and the SM's sequencer engine handles the iteration internally.
exec#
exec reads pre-formed tokens from a contiguous region of SM and pushes them onto the bus. The SM becomes a token source - effectively an autonomous injector.
exec request:
flit 1: [...exec opcode...][count:8]
flit 2: [base_addr from config register or wide pointer]
The SM's sequencer reads count 2-cell entries starting at base_addr. Each entry is a pre-formed 2-flit token (flit 1 in cell N, flit 2 in cell N+1). The sequencer emits them onto the bus in order.
Bootstrap use: on system reset, SM00 is wired to execute exec on a predetermined ROM address. The ROM contents are pre-formed IRAM write tokens and seed tokens; everything needed to load the system. No external microcontroller needed for self-hosted boot. See "SM00 Bootstrap" below.
Hardware reuse: the exec sequencer (address counter, limit comparator, increment logic, output path to bus) is the same hardware needed for iter and cp. Building exec for bootstrap gives bulk operations nearly for free.
iter#
Reads each cell in a range and emits a result token for each. Takes a wide pointer (base + length). The SM's sequencer walks the range, constructing result tokens using a pre-loaded return routing template.
cp#
Copies a contiguous range of cells from one SM region to another (or to a different SM). Takes source wide pointer and destination base. Useful for structure copying, GC compaction.
Design status:
iterandcpare directionally committed. Exact token format for range parameters, interaction with deferred reads in the target range, and atomicity guarantees are still being refined.
SM00 Bootstrap#
Reset Behaviour#
SM00 has dedicated wiring to the system reset signal. On reset:
- SM00's sequencer triggers
execon a predetermined ROM base address - The ROM region contains pre-formed tokens: IRAM write tokens to load PE instruction memories, followed by seed tokens to start execution
- SM00 emits these tokens onto the bus in order
- PEs receive IRAM writes and load their instruction memories
- Seed tokens fire and execution begins
This is the only hardware specialization of SM00, the reset vector wiring. At runtime, SM00 behaves as a standard SM with standard opcodes.
ROM Mapping#
The bootstrap ROM is mapped into SM00's address space (tier 0 - raw memory, no presence bits). It can be:
- Physical ROM/EEPROM on SM00's address bus
- A region of SM00's SRAM pre-loaded by an external microcontroller (development/prototyping)
- Flash memory accessed via page register for larger images
Future Specialization (not committed)#
SM00 could be further specialized for IO:
- Atomic/alloc opcodes could be repurposed for IO-specific operations
(e.g.,
rd_incbecomes "read UART with auto-acknowledge") - Memory-mapped IO devices occupy a reserved address range within SM00
- SM00 could have additional interrupt-sensing hardware that triggers token emission on external events
This is documented as a design option but not committed for v0. The standard SM opcodes are sufficient for basic IO via I-structure semantics: read from an IO-mapped address defers until the IO device writes data.
Presence-Bit Guided IRAM Writes#
The matching store's presence bitmap provides information about whether any dyadic instruction slot has a pending (half-matched) operand. During IRAM writes, the PE can check the presence bits for the IRAM page being overwritten:
- If all presence bits for slots in that page are clear, no tokens are pending and the page can be safely overwritten without drain delay
- If any presence bit is set, the PE knows tokens are in flight for that instruction and can either wait or discard the stale operand
This enables more targeted IRAM replacement than the blanket drain protocol. Instead of draining the entire PE, only the affected page needs attention. The valid-bit page protection mechanism (bus-architecture-and-width-decoupling.md) remains the safety net, but
presence-bit checking can eliminate unnecessary stalls.
Open Design Questions#
- Page register per-CM or global? - if multiple CMs access the same SM, do they share a page register (contention) or each have their own (more hardware, more config)? Probably global for v0.
- Banking vs pipeline depth - with 2 banks, can we overlap a read to bank 0 with a write to bank 1? Worth the control complexity for v0? Presence state SRAM complicates this - is presence per-bank or shared? If shared, it serializes cross-bank operations. If per-bank, each bank needs its own presence SRAM. Probably shared for v0 (simpler).
- Atomic ops on non-FULL cells -
rd_inc/rd_dec/cason EMPTY or WAITING cells is currently undefined. Options: error, stall, or treat as zero. Error is safest for v0. - Direct path input mux - when direct path is added, the SM needs a mux between bus input and direct input feeding the internal command bus. Arbitration policy TBD (direct path priority? round-robin?).
- Wide pointer read format - does a READ on a wide cell return two separate result tokens or one 3-flit result token? Two tokens is simpler (reuse existing result formatter), 3-flit is more atomic.
- Tier boundary mechanism - fixed at build time, config register, or address-range comparator? Fixed is simplest for v0.
iterreturn template - how is the return routing template supplied for iterated results? Preloaded config register? Part of theiterrequest? Template per element or shared?- SM00 ROM size and mapping - how large is the bootstrap ROM? What address range? How does it interact with the page register?