design-notes/sm-design.md at main · nonbinary.computer/or1-design

nonbinary.computer / or1-design
fork atom
OR-1 dataflow CPU sketch
fork atom
or1-design / design-notes / sm-design.md
at main 644 lines 36 kB view raw view rendered
wrap content
Orual feat: rewrite ProcessingElement with frame-based matching, output routing, and unified instruction set 18d ago
65613978
  1# Dynamic Dataflow CPU: SM (Structure Memory) Design
  2
  3Covers the SM interface protocol, operation set, banking scheme, address space extension, memory tiers, wide pointers, bulk operations, EXEC/bootstrap, and hardware architecture.
  4
  5See `architecture-overview.md` for module taxonomy and token format.
  6See `network-and-communication.md` for how SM connects to the bus.
  7See `bus-architecture-and-width-decoupling.md` for bus width rationale
  8See `sm-and-token-format-discussion.md` for extended discussion of design decisions including DRAM latency context, bootstrap unification, and address space distribution.
  9
 10## Role
 11
 12SM stores structured data (arrays, lists, heap), performs operations on it, and provides memory-mapped IO.
 13
 14SM is **synchronising memory** — not just a data store. Tier 1 cells have presence state (empty/full), and reads to empty cells are deferred until a write arrives. This gives implicit producer-consumer synchronization without locks or explicit message-passing: the write *is* the signal. This is the dataflow architecture's answer to shared mutable state.
 15
 16**IO is memory-mapped into SM address space.** An SM (typically SM00 at v0) maps IO devices into its address range. I-structure semantics provide natural interrupt-free IO: a READ from an IO device that has no data defers until data arrives, triggering the receiving node in the dataflow graph.
 17
 18From a CM's perspective: send a bit[15]=1 request, get a CM result token back eventually. Split-phase, asynchronous relative to the requesting CM. A READ may return immediately (cell is full) or later (cell is empty, deferred until written).
 19
 20See `em4-analysis.md` for context on why dedicated synchronizing memory matters vs flattening structure storage into PE-local SRAM.
 21
 22**Internal data path: 16-bit**, matching the SRAM word size. Tokens arrive as 2 flits on the 16-bit external bus, are serialized at the SM input FIFO into a reassembled logical token, processed, and result tokens are serialized back into flits at the output. The SM never needs an internal
 23data path wider than 16 bits for data operations.
 24
 25## Interface Protocol
 26
 27Stateless request handling: the request token carries its own return routing info in the bits that are unused by that operation type. SM does not maintain persistent per-request state, outside of caching the return token template when a read is deferred.
 28
 29### Request Format (bit[15]=1, 2-flit standard)
 30
 31All SM requests arrive as 2 flits on the 16-bit bus and are reassembled at the SM input FIFO before processing.
 32
 33```
 34Flit 1 (common to all SM ops):
 35  [1][SM_id:2][op_base:3][addr/op_ext:2][addr_low:8]              = 16 bits
 36
 37  15 bits available after the SM discriminator bit. SM_id (2 bits) selects
 38  the target SM. The remaining 13 bits encode opcode and address using
 39  variable-width encoding:
 40
 41  When op[2:1] != 11: 3-bit opcode, 10-bit addr (1024 cells)
 42  When op[2:1] == 11: extends to 5-bit opcode, 8-bit payload (256 cells)
 43
 44  Decode signal: op[2] AND op[1] — one gate.
 45
 46Flit 2 varies by operation:
 47
 48  WRITE: [data:16]
 49    Write data to address. No DN response unless cell was WAITING
 50    (deferred read satisfied — result token emitted using saved routing).
 51
 52  READ / CLEAR / ALLOC / FREE: [return_routing:16]
 53    Flit 2 carries a **pre-formed CM token template**. The SM's result
 54    formatter latches this template, prepends it as the result's flit 1,
 55    and appends read data as flit 2. No bit-shuffling — the requesting
 56    CM does all format work upfront. The SM treats this as an opaque
 57    16-bit blob.
 58    (CLEAR/ALLOC/FREE may not need return routing — if no result token
 59    is emitted, flit 2 could be omitted or carry flags instead. TBD.)
 60
 61  READ_INC / READ_DEC: [return_routing:16]
 62    Same as READ. Atomic ops always return the old value.
 63
 64  CAS — compare-and-swap:
 65    Requires 3-flit extended format to carry expected_value, new_value,
 66    AND return routing:
 67      flit 2: [expected_value:16]
 68      flit 3: [new_value:8][return_routing:8] (or similar split — TBD)
 69    Alternatively, CAS could use two sequential 2-flit packets.
 70    Design TBD — CAS is the most complex SM operation.
 71```
 72
 73**3-flit extended addressing mode**: for access to external RAM or memory-mapped IO address spaces, a 3-flit SM token provides wider addresses at the cost of one extra flit cycle. The EXT opcode (one of the 3-bit tier) signals 3-flit format.
 74
 75### Result Format (on DN, pre-formed CM token)
 76
 77SM uses the pre-formed CM token template from the request's flit 2 as the result's flit 1, and appends the read data as flit 2:
 78
 79```
 80Result (2 flits):
 81  flit 1: [return_routing:16]   (opaque template from original request)
 82  flit 2: [fetched_data:16]
 83```
 84
 85The template can encode any CM token format whose routing fits in 16 bits: dyadic wide, monadic normal, or monadic inline. This means SM read results can land directly in a matching store slot as one operand of a dyadic instruction.
 86## Cell State Model (I-Structure Semantics)
 87
 88Each SM cell has a 2-bit state field in addition to its 16-bit data value.
 89
 90```
 91State    Bits   Meaning
 92───────────────────────────────────────────────────────────
 93EMPTY      00   Never written, or cleared. READ defers.
 94RESERVED   01   Allocated but not yet written. READ defers.
 95FULL       10   Written. READ returns value immediately.
 96WAITING    11   Empty + a deferred READ is pending for this cell.
 97```
 98
 99### State Transitions
100
101```
102                    WRITE
103  EMPTY ──────────────────────────► FULL
104    │                                 │
105    │ READ                            │ READ
106    ▼                                 ▼
107  WAITING ────────────────────────► FULL   (WRITE satisfies deferred reader,
108    (deferred read registered)             sends result, cell becomes FULL)
109                                      │
110                                      │ CLEAR
111                                      ▼
112                                    EMPTY
113
114  EMPTY ──── ALLOC ──► RESERVED ──── WRITE ──► FULL
115  FULL  ──── FREE  ──► EMPTY
116  any   ──── CLEAR ──► EMPTY
117```
118
119Detailed transitions:
120
121- **READ on EMPTY or RESERVED**: cell transitions to WAITING. return routing is saved in the deferred read register. no result token emitted yet. if the deferred read register is already occupied (by a different cell), SM stalls and the request stays in the input FIFO until the
122  pending deferred read is satisfied.
123- **READ on FULL**: result token emitted immediately. Cell stays FULL. Multiple reads of the same full cell are fine: all get the value.
124- **READ on WAITING**: error condition (second deferred read to same cell while first is pending). SM behaviour TBD. Options: stall, return error token, or drop silently. stall is safest for v0.
125- **WRITE on EMPTY or RESERVED**: data stored, cell transitions to FULL. no result token.
126- **WRITE on WAITING**: data stored, cell transitions to FULL. SM immediately uses the saved return routing from the deferred read register to emit a result token with the written data. deferred read register is freed.
127- **WRITE on FULL**: depends on design choice. options:
128  -  **overwrite silently**: data replaced, cell stays FULL. simple, but hides bugs.
129  - **error**: SM emits an error/diagnostic signal. data not written. safe but requires error handling.
130  - **overwrite + diagnostic flag**: data replaced, diagnostic LED or counter incremented. best of both for v0.
131- **CLEAR on any state**: cell transitions to EMPTY. if cell was WAITING, the pending deferred read is cancelled (return routing discarded from deferred read register, no result token emitted). CLEAR never emits a result token.
132- **ALLOC**: transitions EMPTY->RESERVED. no effect on non-EMPTY cells.
133- **FREE**: transitions any->EMPTY. if cell was WAITING, cancels the deferred read (same as CLEAR).
134
135### Presence Metadata Hardware
136
1374 bits per cell: `[presence:2][is_wide:1][spare:1]`
138
139- **presence:2** - EMPTY/RESERVED/FULL/WAITING (checked on every operation)
140- **is_wide:1** - tags this cell as part of a wide pointer pair. The SM checks this before deciding whether to also read the next cell for length metadata. See "Wide Pointers" section below.
141- **spare:1** - reserved. Candidates: write-once flag, type tag, owner ID.
142
143At 1024 cells = 4096 bits = 512 bytes. Using a byte-wide SRAM chip means bits 4-7 are physically present whether used or not. Committing 4 bits with 4 spare avoids needing to change the presence SRAM layout when additional per-cell metadata is added during testing.
144
145Implementation:
146
147- **Small SRAM** alongside the data SRAM. Addressed in parallel, read/written on every operation. One 8-bit-wide SRAM chip covers 2 cells per byte (4 bits each) -> 256 bytes easily fits.
148- Must support single-cycle read-modify-write (read state, decide action, write new state) within one clock. Achievable if the state SRAM access time is less than half the clock period (read in first half, write in second half; same half-clock RMW technique as the EM-4 matching stage).
149
150## Deferred Read Register
151
152Single register per SM. holds the return routing for one pending deferred read.
153
154```
155Deferred Read Register:
156  [valid:1][cell_addr:10][return_routing:16]  = 27 bits
157```
158
159- **valid**: set when a READ hits an EMPTY/RESERVED cell. Cleared when the deferred read is satisfied (WRITE to the target cell) or cancelled (CLEAR/FREE to the target cell).
160- **cell_addr**: which cell this deferred read is waiting on (10-bit address for 1024-cell range). Compared against incoming WRITE addresses to detect satisfaction.
161- **return_routing**: the 16 bits from the original READ's flit 2 (the pre-formed CM token template). Used as flit 1 of the result token when the deferred read is satisfied.
162
163### Deferred Read Satisfaction
164
165On every WRITE, the SM checks: `valid == 1 AND write_addr == cell_addr`. If true:
166
1671. Store the write data in the cell (normal WRITE behaviour).
1682. Transition cell state to FULL.
1693. Emit result token: flit 1 = saved return_routing, flit 2 = written data.
1704. Clear the deferred read register (valid = 0).
171
172Hardware cost: one 10-bit comparator (cell_addr vs write_addr), one AND gate (valid AND addr_match), one 27-bit register. Trivial - maybe 3-4 TTL chips total.
173
174### Depth-1 Limitation and Backpressure
175
176With one deferred read register per SM, only one cell can have a pending deferred read at a time. if a second READ arrives for a different empty cell while the register is occupied, the SM cannot service it.
177
178in practice at v0 scale (4 CMs, low contention), this should be rare.
179the compiler can also help by ensuring that reads and writes to the same
180cell are ordered appropriately in the program graph.
181
182### Multi-Slot Deferred Read CAM (Candidate Enhancement)
183
184If depth-1 proves too restrictive, expanding to a multi-entry deferred read store using a small CAM (content-addressable memory) is a natural fit. A CAM does associative lookup in one cycle - present the WRITE address, the CAM match line fires if any entry holds that address. No sequential scan, no priority logic changes to the SM pipeline.
185
186```
187Deferred Read CAM (e.g., 4 entries):
188  Entry 0: [valid:1][cell_addr:10][return_routing:16]
189  Entry 1: [valid:1][cell_addr:10][return_routing:16]
190  Entry 2: [valid:1][cell_addr:10][return_routing:16]
191  Entry 3: [valid:1][cell_addr:10][return_routing:16]
192
193On READ hitting EMPTY cell:
194  - Find first invalid entry (priority encoder), write cell_addr +
195    return_routing, set valid
196  - If all entries valid: stall (same as depth-1 overflow)
197
198On WRITE:
199  - Present write_addr to CAM match lines
200  - If any entry matches: satisfy that deferred read, clear entry
201  - Normal WRITE proceeds regardless
202```
203
204**Why CAM is ideal here:** the deferred read lookup is inherently associative - "does any pending read match this write address?" A register file would require sequential comparison against each entry. A CAM answers in one cycle regardless of entry count. Small CAMs (4-16 entries) existed as discrete TTL/CMOS parts and are also trivial to build from comparators + registers.
205
206**Hardware cost:** 4 entries × (10-bit comparator + 27-bit register) + priority encoder for allocation + match OR for satisfaction detection. Estimated 8-12 TTL chips for a 4-entry CAM — roughly double the single-register cost. Alternatively, the National Semiconductor 100142
207(4-word × 4-bit, ECL, see `datasheets/NATLS21982-1.pdf`) is a discrete CAM chip that provides 4-entry associative lookup in a single package. Two 100142s cascade to 4 words × 8 bits, covering the cell address match width. The return routing storage still requires separate
208registers, but the address-match portion shrinks to 2-3 chips instead of a comparator tree.
209
210**IO motivation:** the strongest argument for multiple deferred read slots comes from IO on SM00 (see `io-and-bootstrap.md`). The always-pending deferred read pattern for IO permanently occupies a slot. With a single slot, SM00 can either service IO *or* do normal I-structure deferred reads, but not both. Even 2 slots (one for IO, one for memory operations) resolves this. 4 slots covers multiple IO sources (UART + SPI + timer) without needing SM00 specialization or
211spontaneous token emission hardware.
212
213**Uniform vs SM00-only:** giving all SMs multi-slot deferred reads keeps the architecture uniform (no special cases). The per-SM cost increase is small. Alternatively, only SM00 gets the CAM and other SMs keep depth-1 - saves chips but adds a special case. The uniform approach is preferred unless chip budget is extremely tight.
214## Operation Set
215
216### Internal Representation
217
218The SM internally uses a **4-bit opcode + 12-bit address** command format, regardless of how the command arrived:
219
220```
221Internal command: [op:4][addr:12][data:16][return_routing:16]
222```
223
224This is the canonical form that the op decoder, state machine, and ALU all work with. input interfaces (bus adapter, future direct path) translate their respective wire formats into this internal representation.
225
226### Bus Encoding (flit 1: variable-width opcode)
227
228On the 16-bit bus, flit 1 of an SM token is:
229
230```
231  [1][SM_id:2][op_base:3][op_ext/addr:2][addr:8]           = 16 bits
232```
233
23415 bits available after the SM discriminator. SM_id (2 bits) selects the target SM. The remaining 13 bits encode opcode and address using variable-width encoding.
235
236The interpretation of the 2 bits after op_base depends on op_base[2:1]:
237
238```
239op[2:1] != 11:
240  3-bit opcode, 10-bit addr (1024 cells).
241  The 2 extension bits are the high address bits.
242
243op[2:1] == 11:
244  Extends to 5-bit opcode, 8-bit payload (256 cells or inline data).
245  The 2 extension bits are opcode extension.
246```
247
248Decode signal: `op[2] AND op[1]` — one gate.
249
250### Opcode Table (bus encoding)
251
252```
253op_base  ext   bus opcode  internal op  addr bits  name
254─────────────────────────────────────────────────────────────────
255  000     aa      000       0000        10 (1024)  read
256  001     aa      001       0001        10         write
257  010     aa      010       0010        10         alloc
258  011     aa      011       0011        10         free
259  100     aa      100       0100        10         exec
260  101     aa      101       0101        10         ext (3-flit mode)
261  110     00      11000     0110         8 (256)   rd_inc
262  110     01      11001     0111         8         rd_dec
263  110     10      11010     1000         8         cas
264  110     11      11011     1001         8         raw_rd
265  111     00      11100     1010         8         clear
266  111     01      11101     1011         8         set_pg
267  111     10      11110     1100         8         write_im
268  111     11      11111     1101         8         (spare)
269```
270
271'aa' = address bits (part of 10-bit address).
272
273**Tier 1 ops (3-bit, full address range):** `read`, `write`, `alloc`, `exec`, `free` reach the full 1024-cell address space. EXT signals a 3-flit token for extended addressing (external RAM, wide addresses).
274
275**Tier 2 ops (5-bit, restricted address/payload):** atomic operations (`rd_nc`, `rd_dec`, `cas`) as well as `clear` are restricted to 256 cells - the compiler places atomic-access cells in the lower range. `set_pg` and `write_im` use the 8-bit payload field for non-address data (length,
276page register value, small immediate).
277
278**Key insight for variable-width encoding:** not all SM ops are
279`op(address)`. Some are `op(config_value)` or `op()` with no cell
280operand. The 8-bit payload in the restricted tier can be inline data,
281config values, or range counts depending on the opcode:
282
283- `set_pg`: payload = page register value
284- `write_im`: 8-bit addr, flit 2 carries immediate data
285- `raw_rd`: non-blocking read, returns data or empty indicator without registering a deferred read. Useful for polling and diagnostics.
286
287### Direct Path Encoding (future)
288
289A dedicated CM-SM link bypasses the bus. on the direct path, the type
290and SM_id fields are redundant (the wires go to exactly one SM), so the
291full 16 bits of the command word are available:
292
293```
294direct path: [op:4][addr:12]  = 16 bits
295```
296
2974-bit opcode (all 16 internal ops directly addressable) + 12-bit address
298(4096 cells). the SM's internal command representation matches this
299directly — no translation needed.
300
301**v0 implementation**: direct path not built. the extra input lines on
302the SM's internal command bus are disabled (active-high for addr lines,
303active-low for unused op bits) — physically present but unused. when/if
304a direct path is added, the bus adapter and direct path adapter both
305feed into the same internal command bus via a mux.
306
307### Extended Bus Encoding (3-flit SM token via `ext` opcode)
308
309For operations that need wider addresses or additional data (`cas` with both expected and new values), 3-flit SM tokens (via EXT opcode) carry the full 4-bit opcode in the extra flit:
310
311```
312flit 1: [type:2=10][SM_id:2][flags/op_hint:3][addr_low:9]   = 16 bits
313flit 2: [extended_op:4][addr_high:4][data:8]                  = 16 bits
314         or [expected_value:16] for CAS
315flit 3: [new_value:16] or [return_routing:16]                 = 16 bits
316```
317
318Exact 3-flit format TBD. the point is that extended tokens can carry the
319full internal 4-bit opcode, giving access to all 16 internal operations
320including any that are only available via direct path or extended format.
321
322### Operation Details (updated for I-structure semantics)
323
324**`read`** (internal 0000): I-structure blocking read. if cell is FULL,
325returns data immediately. if cell is EMPTY or RESERVED, registers a
326deferred read (cell → WAITING) and returns the data later when a WRITE
327arrives. see state transition table above.
328
329**`write`** (internal 0001): stores data. transitions EMPTY/RESERVED → FULL.
330if cell is WAITING, satisfies the deferred reader (emits result token
331using saved return routing). if cell is already FULL, overwrites +
332diagnostic flag (v0 behaviour, see state transition discussion).
333WRITE to a FULL cell does NOT emit a result token — the overwrite
334diagnostic is a local indicator (LED/counter), not a token.
335
336**`alloc`** (internal 0010): transitions EMPTY → RESERVED. for heap
337management. deferred to post-v0 but opcode and state transition defined.
338
339**`free`** (internal 0011): transitions any → EMPTY. cancels any pending
340deferred read on that cell. deferred to post-v0.
341
342**`exec`** (internal 0100): transitions any → EMPTY. cancels any pending
343deferred read. unlike FREE, CLEAR is intended for explicit cell lifecycle
344management by the program (reclaim a cell for reuse). FREE is for heap
345management. semantically identical for now but distinct opcodes allow
346future differentiation (e.g. FREE updates a free-list, CLEAR doesn't).
347
348**(spare)** (internal 0101): reserved. candidate: RAW_READ (non-blocking
349read that returns data or an empty indicator without deferring). useful
350for polling patterns and diagnostics. not committed.
351
352**`rd_inc`** (internal 0110, restricted to lower 256 cells): atomic
353fetch-and-add(+1). reads current value, increments, writes back, returns
354old value. operates on FULL cells only — READ_INC on an EMPTY cell is
355an error (returns error indicator or stalls). does not interact with
356I-structure state transitions (cell stays FULL).
357
358**`read_dec`** (internal 0111, restricted to lower 256 cells): atomic
359fetch-and-add(-1). same semantics as READ_INC but decrements. CM checks
360returned value for zero to detect refcount exhaustion.
361
362**`cas`** (internal 1000, restricted to lower 256 cells): compare-and-swap.
363requires extended format (3-flit or sequential 2-flit) to carry both
364expected_value and new_value + return routing. operates on FULL cells
365only. returns old value regardless of success/failure; CM infers
366success by comparing returned value with expected.
367
368**(spare)** (internal 1001): reserved for future atomic op.
369
370## Hardware Architecture
371
372```
37316-bit Bus                                           16-bit Bus
374     |                                                    ^
375     v                                                    |
376[Input Deserialiser]                           [Output Serialiser]
377  (reassemble 2+ flits,                         (split token into flits)
378   variable opcode decode,                            ^
379   translate to internal                              |
380   4-bit op + 12-bit addr)                     [Result FIFO]
381     |                                           (formed result tokens)
382     v                                                ^
383[Request FIFO]                                        |
384  (internal-format commands)                   [Result Formatter]
385     |                                                ^
386     v                                                |
387[Op Decoder / State Machine]──────────────────►[Deferred Read Reg]
388     |          |          |                    (1 entry: valid,
389     v          v          v                     cell_addr, ret_routing)
390[Addr Decode]  [ALU]  [Presence State SRAM]
391     |          |      (2 bits per cell,
392     v          v       read/write parallel
393[Data SRAM     |       with data SRAM)
394 Bank 0/1]─────┘
395```
396
397### Banking
398
399- Start with 2 banks (1 address bit selects bank) for v0
400- 10-bit address = 1024 cells per SM = 2KB at 16-bit data width
401- Each bank is one SRAM chip with room to spare
402- Banking allows pipelining: one bank can be reading while another is being written (for RMW ops, or overlapping independent requests)
403
404### Internal Components
405
406**Input deserializer / Request FIFO**: receives 16-bit flits from the bus, reassembles into logical tokens (2 flits standard, 3 for `cas` or extended addressing). performs variable-width opcode decode and translates to internal 4-bit op + 12-bit addr format. buffers reassembled requests in
407internal format. depth TBD (4-8 deep probably sufficient for v0). handles bursty traffic from multiple CMs.
408
409**Op decoder / State machine**: the core control logic. for each request:
410
4111. Read presence state for the target cell (from presence SRAM).
4122. Determine action based on opcode × cell state (see state transition table).
4133. Issue data SRAM read/write as needed.
4144. Update presence state.
4155. If result token needed: send to result formatter.
4166. If deferred read: save return routing in deferred read register.
4177. On every WRITE: check deferred read register for match on cell_addr.
418
419This is the most complex piece of the SM - roughly a 10-15 state FSM plus the deferred read comparator. estimated 15-25 TTL chips.
420
421**Address decode**: selects SRAM bank from address bits.
422
423**ALU**: minimal — increment, decrement, compare. NOT a full ALU. just enough for the atomic operations. hardware cost: 16-bit incrementer + 16-bit comparator + mux. ~10-15 TTL chips.
424
425**Presence metadata SRAM**: 4 bits per cell (presence:2 + is_wide:1 + spare:1). 1024 cells = 512 bytes. One small SRAM chip addressed in parallel with the data SRAM. Must support half-clock read-modify-write (read state in first half, write new state in second half) for single-cycle operation.
426
427**Deferred read register**: one 27-bit register (valid:1 + cell_addr:10 + return_routing:16). One 10-bit comparator for addr matching against incoming `write`s. ~3-4 TTL chips total.
428
429**Result formatter**: latches the pre-formed CM token template from the original request's flit 2 (or from the deferred read register for deferred reads), emits it as flit 1 of the result, and appends read data as flit 2. The SM does not parse or modify the template - it is an opaque blob. Output
430serializer splits the formed token into flits for bus transmission.
431
432## Address Space Extension
433
434The 10-bit address in the standard SM token gives 1024 cells per SM for tier-1 ops (3-bit opcode). Three mechanisms extend this further:
435
436### 1. Page Register
437
438- SM has a writable config register set via SET_PAGE opcode
439- 10-bit token address is treated as offset, added to page base
440- Gives up to 64K+ addressable cells per SM
441- CM sets the page before issuing a burst of reads/writes to a region
442- Hardware cost: ~3 chips (latch for page register + adder)
443- Programming model: familiar bank-switching, like 8-bit micros
444- Trade-off: page switch costs one extra token; compiler batches accesses to same page to amortize
445
446### 2. Banking as Implicit Address Bits
447
448- SM_id field (2 bits) gives 4 SMs = 4 x 1024 = 4K cells system-wide
449- Not contiguous from a programming perspective, but compiler can distribute data structures across SMs for both capacity and parallelism
450- Essentially free — already in the token format
451- Combine with page registers for 4 x 64K = 256K cells system-wide
452
453### 3. Extended SM Tokens (3-flit via EXT opcode)
454
455- The EXT opcode (3-bit tier) signals a 3-flit SM token with full 16-bit address in flit 2, data/return-routing in flit 3
456- Full 16-bit address space per SM, at the cost of one extra flit cycle
457- Use for: large heap, external RAM, memory-mapped IO address ranges
458- Standard 2-flit tokens remain the fast path for common/local accesses
459
460### Practical Address Space with All Three Combined
461
462- Fast path (standard + page register): 64K per SM, 2-flit token
463- Medium path (across SMs): 4 x 64K = 256K, 2-flit token
464- Slow path (3-flit EXT): up to 64K per SM with wide addresses, 3-flit
465
466## V0 Test Plan
467
468- Drive input with microcontroller (RP2040 / Arduino)
469- Microcontroller formats 2-flit (16-bit each) request packets, clocks flits into input deserializer / request FIFO
470- Read 2-flit result packets from output FIFO
471- Test suite:
472  - Sequential read/write to FULL cells
473  - Random access patterns
474  - READ_INC sequences (verify atomicity, verify returned old value)
475  - READ_DEC to zero (verify underflow behaviour)
476  - CAS success and failure cases
477  - Bank contention (same bank back-to-back)
478  - Page register set + offset access
479  - Boundary conditions (address 0, address 511, page wraparound)
480  - **I-structure tests**:
481    - WRITE then READ: immediate result
482    - READ then WRITE: deferred read — verify result arrives after WRITE
483    - READ on EMPTY, WRITE to different cell, WRITE to target cell: verify deferred read resolves only on correct address
484    - Two deferred READs (different cells): verify stall on second, then resolution when first is satisfied, then second proceeds
485    - CLEAR on FULL cell, then READ: verify deferral
486    - CLEAR on WAITING cell: verify deferred read is cancelled (no result token emitted)
487    - WRITE on FULL cell: verify overwrite + diagnostic indicator
488  - **Variable opcode decode tests**:
489    - Verify READ/WRITE/ALLOC/FREE/CLEAR reach full 1024-cell range
490    - Verify READ_INC/READ_DEC/CAS/EXEC etc. restricted to lower 256 cells
491    - Verify op[2:1]=11 decode correctly separates opcode extension from address
492
493---
494
495## Memory Tiers
496
497SM address space supports regions with different semantics, selectable by address range. The tier is not encoded in the token - it is determined by the target address within the SM.
498
499### Tier 0: Raw Memory
500
501No presence bits. SRAM read/write only. Suitable for bulk data that does not need synchronization (image buffers, DMA staging areas, ROM).
502
503- READ always returns immediately (no deferral)
504- WRITE always succeeds
505- Presence metadata bits are ignored / not maintained
506- Allocated by compiler or loader, not by ALLOC/FREE
507- Address range: configurable, typically the top of the SM address space
508
509### Tier 1: I-Structure Memory
510
511Standard I-structure semantics with presence tracking. This is the default operating mode described throughout this document.
512
513- READ on EMPTY/RESERVED cell defers until WRITE
514- WRITE transitions cell to FULL
515- ALLOC/FREE manage the free list
516- Full presence metadata (4 bits per cell)
517- Address range: configurable, typically the bulk of SM address space
518
519### Tier 2: Wide/Bulk Memory
520
521Extends tier 1 with wide pointer support. Cells tagged with `is_wide=1` in presence metadata are treated as the base of a (pointer, length) pair: the cell itself holds the data pointer, the next cell holds the length.
522
523- READ checks `is_wide` and returns either 1 cell (normal) or 2 cells (wide pointer pair — requires 3-flit result token or two result tokens)
524- WRITE to a wide cell writes both pointer and length
525- Enables ITERATE, COPY_RANGE, and EXEC to take wide pointers as arguments
526- Address range: overlaps with tier 1 (any tier 1 cell can be marked wide)
527
528### Tier Boundary Configuration
529
530- Boundaries are set by config registers (SET_PAGE or a dedicated config mechanism) during bootstrap
531- The SM's address decoder checks the incoming address against tier boundaries to select behaviour
532- Hardware cost: 1-2 comparators + a mux on the presence metadata path
533- v0 can use a fixed split (e.g., lower 768 = tier 1, upper 256 = tier 0) and defer runtime-configurable boundaries
534
535> **Design status:** tiers are directionally decided. Exact address range splits, tier 2 wide-read mechanics, and interaction between tiers and page registers are still being refined.
536
537---
538
539## Wide Pointers and Bulk Operations
540
541### Wide Pointer Format
542
543A wide pointer occupies 2 consecutive SM cells:
544
545```
546Cell N:     [data_pointer:16]   (base address in SM or external memory)
547Cell N+1:  [length:16]          (element count)
548```
549
550Cell N has `is_wide=1` in its presence metadata. The SM knows to read both cells when servicing a READ on a wide cell.
551
552Wide pointers are the parameter format for bulk operations. A CM does not iterate over SM contents directly — it sends an SM operation with a wide pointer, and the SM's sequencer engine handles the iteration internally.
553
554### `exec`
555
556`exec` reads pre-formed tokens from a contiguous region of SM and pushes them onto the bus. The SM becomes a token source - effectively an autonomous injector.
557
558```
559exec request:
560  flit 1: [...exec opcode...][count:8]
561  flit 2: [base_addr from config register or wide pointer]
562```
563
564The SM's sequencer reads `count` 2-cell entries starting at `base_addr`. Each entry is a pre-formed 2-flit token (flit 1 in cell N, flit 2 in cell N+1). The sequencer emits them onto the bus in order.
565
566**Bootstrap use:** on system reset, SM00 is wired to execute `exec` on a predetermined ROM address. The ROM contents are pre-formed IRAM write tokens and seed tokens; everything needed to load the system. No external microcontroller needed for self-hosted boot. See "SM00 Bootstrap" below.
567
568**Hardware reuse:** the `exec` sequencer (address counter, limit comparator, increment logic, output path to bus) is the same hardware needed for `iter` and `cp`. Building `exec` for bootstrap gives bulk operations nearly for free.
569
570### `iter`
571
572Reads each cell in a range and emits a result token for each. Takes a
573wide pointer (base + length). The SM's sequencer walks the range,
574constructing result tokens using a pre-loaded return routing template.
575
576### `cp`
577
578Copies a contiguous range of cells from one SM region to another (or to
579a different SM). Takes source wide pointer and destination base. Useful
580for structure copying, GC compaction.
581
582> **Design status:** `iter` and `cp` are directionally committed.
583> Exact token format for range parameters, interaction with deferred reads
584> in the target range, and atomicity guarantees are still being refined.
585
586---
587
588## SM00 Bootstrap
589
590### Reset Behaviour
591
592SM00 has dedicated wiring to the system reset signal. On reset:
593
5941. SM00's sequencer triggers `exec` on a predetermined ROM base address
5952. The ROM region contains pre-formed tokens: IRAM write tokens to load PE instruction memories, followed by seed tokens to start execution
5963. SM00 emits these tokens onto the bus in order
5974. PEs receive IRAM writes and load their instruction memories
5985. Seed tokens fire and execution begins
599
600This is the only hardware specialization of SM00, the reset vector wiring. At runtime, SM00 behaves as a standard SM with standard opcodes.
601
602### ROM Mapping
603
604The bootstrap ROM is mapped into SM00's address space (tier 0 - raw memory, no presence bits). It can be:
605
606- Physical ROM/EEPROM on SM00's address bus
607- A region of SM00's SRAM pre-loaded by an external microcontroller (development/prototyping)
608- Flash memory accessed via page register for larger images
609
610### Future Specialization (not committed)
611
612SM00 could be further specialized for IO:
613
614- Atomic/alloc opcodes could be repurposed for IO-specific operations
615  (e.g., `rd_inc` becomes "read UART with auto-acknowledge")
616- Memory-mapped IO devices occupy a reserved address range within SM00
617- SM00 could have additional interrupt-sensing hardware that triggers token emission on external events
618
619This is documented as a design option but **not committed for v0**. The standard SM opcodes are sufficient for basic IO via I-structure semantics: `read` from an IO-mapped address defers until the IO device writes data.
620
621---
622
623## Presence-Bit Guided IRAM Writes
624
625The matching store's presence bitmap provides information about whether any dyadic instruction slot has a pending (half-matched) operand. During IRAM writes, the PE can check the presence bits for the IRAM page being overwritten:
626
627- If all presence bits for slots in that page are clear, no tokens are pending and the page can be safely overwritten without drain delay
628- If any presence bit is set, the PE knows tokens are in flight for that instruction and can either wait or discard the stale operand
629
630This enables more targeted IRAM replacement than the blanket drain protocol. Instead of draining the entire PE, only the affected page needs attention. The valid-bit page protection mechanism (`bus-architecture-and-width-decoupling.md`) remains the safety net, but
631presence-bit checking can eliminate unnecessary stalls.
632
633---
634
635## Open Design Questions
636
6371. **Page register per-CM or global?** - if multiple CMs access the same SM, do they share a page register (contention) or each have their own (more hardware, more config)? Probably global for v0.
6382. **Banking vs pipeline depth** - with 2 banks, can we overlap a read to bank 0 with a write to bank 1? Worth the control complexity for v0? Presence state SRAM complicates this - is presence per-bank or shared? If shared, it serializes cross-bank operations. If per-bank, each bank needs its own presence SRAM. Probably shared for v0 (simpler).
6393. **Atomic ops on non-FULL cells** - `rd_inc`/`rd_dec`/`cas` on EMPTY or WAITING cells is currently undefined. Options: error, stall, or treat as zero. Error is safest for v0.
6404. **Direct path input mux** - when direct path is added, the SM needs a mux between bus input and direct input feeding the internal command bus. Arbitration policy TBD (direct path priority? round-robin?).
6415. **Wide pointer read format** - does a READ on a wide cell return two separate result tokens or one 3-flit result token? Two tokens is simpler (reuse existing result formatter), 3-flit is more atomic.
6426. **Tier boundary mechanism** - fixed at build time, config register, or address-range comparator? Fixed is simplest for v0.
6437. **`iter` return template** - how is the return routing template supplied for iterated results? Preloaded config register? Part of the `iter` request? Template per element or shared?
6448. **SM00 ROM size and mapping** - how large is the bootstrap ROM? What address range? How does it interact with the page register?