OR-1 dataflow CPU sketch
at main 644 lines 36 kB view raw view rendered
1# Dynamic Dataflow CPU: SM (Structure Memory) Design 2 3Covers the SM interface protocol, operation set, banking scheme, address space extension, memory tiers, wide pointers, bulk operations, EXEC/bootstrap, and hardware architecture. 4 5See `architecture-overview.md` for module taxonomy and token format. 6See `network-and-communication.md` for how SM connects to the bus. 7See `bus-architecture-and-width-decoupling.md` for bus width rationale 8See `sm-and-token-format-discussion.md` for extended discussion of design decisions including DRAM latency context, bootstrap unification, and address space distribution. 9 10## Role 11 12SM stores structured data (arrays, lists, heap), performs operations on it, and provides memory-mapped IO. 13 14SM is **synchronising memory** — not just a data store. Tier 1 cells have presence state (empty/full), and reads to empty cells are deferred until a write arrives. This gives implicit producer-consumer synchronization without locks or explicit message-passing: the write *is* the signal. This is the dataflow architecture's answer to shared mutable state. 15 16**IO is memory-mapped into SM address space.** An SM (typically SM00 at v0) maps IO devices into its address range. I-structure semantics provide natural interrupt-free IO: a READ from an IO device that has no data defers until data arrives, triggering the receiving node in the dataflow graph. 17 18From a CM's perspective: send a bit[15]=1 request, get a CM result token back eventually. Split-phase, asynchronous relative to the requesting CM. A READ may return immediately (cell is full) or later (cell is empty, deferred until written). 19 20See `em4-analysis.md` for context on why dedicated synchronizing memory matters vs flattening structure storage into PE-local SRAM. 21 22**Internal data path: 16-bit**, matching the SRAM word size. Tokens arrive as 2 flits on the 16-bit external bus, are serialized at the SM input FIFO into a reassembled logical token, processed, and result tokens are serialized back into flits at the output. The SM never needs an internal 23data path wider than 16 bits for data operations. 24 25## Interface Protocol 26 27Stateless request handling: the request token carries its own return routing info in the bits that are unused by that operation type. SM does not maintain persistent per-request state, outside of caching the return token template when a read is deferred. 28 29### Request Format (bit[15]=1, 2-flit standard) 30 31All SM requests arrive as 2 flits on the 16-bit bus and are reassembled at the SM input FIFO before processing. 32 33``` 34Flit 1 (common to all SM ops): 35 [1][SM_id:2][op_base:3][addr/op_ext:2][addr_low:8] = 16 bits 36 37 15 bits available after the SM discriminator bit. SM_id (2 bits) selects 38 the target SM. The remaining 13 bits encode opcode and address using 39 variable-width encoding: 40 41 When op[2:1] != 11: 3-bit opcode, 10-bit addr (1024 cells) 42 When op[2:1] == 11: extends to 5-bit opcode, 8-bit payload (256 cells) 43 44 Decode signal: op[2] AND op[1] — one gate. 45 46Flit 2 varies by operation: 47 48 WRITE: [data:16] 49 Write data to address. No DN response unless cell was WAITING 50 (deferred read satisfied — result token emitted using saved routing). 51 52 READ / CLEAR / ALLOC / FREE: [return_routing:16] 53 Flit 2 carries a **pre-formed CM token template**. The SM's result 54 formatter latches this template, prepends it as the result's flit 1, 55 and appends read data as flit 2. No bit-shuffling — the requesting 56 CM does all format work upfront. The SM treats this as an opaque 57 16-bit blob. 58 (CLEAR/ALLOC/FREE may not need return routing — if no result token 59 is emitted, flit 2 could be omitted or carry flags instead. TBD.) 60 61 READ_INC / READ_DEC: [return_routing:16] 62 Same as READ. Atomic ops always return the old value. 63 64 CAS — compare-and-swap: 65 Requires 3-flit extended format to carry expected_value, new_value, 66 AND return routing: 67 flit 2: [expected_value:16] 68 flit 3: [new_value:8][return_routing:8] (or similar split — TBD) 69 Alternatively, CAS could use two sequential 2-flit packets. 70 Design TBD — CAS is the most complex SM operation. 71``` 72 73**3-flit extended addressing mode**: for access to external RAM or memory-mapped IO address spaces, a 3-flit SM token provides wider addresses at the cost of one extra flit cycle. The EXT opcode (one of the 3-bit tier) signals 3-flit format. 74 75### Result Format (on DN, pre-formed CM token) 76 77SM uses the pre-formed CM token template from the request's flit 2 as the result's flit 1, and appends the read data as flit 2: 78 79``` 80Result (2 flits): 81 flit 1: [return_routing:16] (opaque template from original request) 82 flit 2: [fetched_data:16] 83``` 84 85The template can encode any CM token format whose routing fits in 16 bits: dyadic wide, monadic normal, or monadic inline. This means SM read results can land directly in a matching store slot as one operand of a dyadic instruction. 86## Cell State Model (I-Structure Semantics) 87 88Each SM cell has a 2-bit state field in addition to its 16-bit data value. 89 90``` 91State Bits Meaning 92─────────────────────────────────────────────────────────── 93EMPTY 00 Never written, or cleared. READ defers. 94RESERVED 01 Allocated but not yet written. READ defers. 95FULL 10 Written. READ returns value immediately. 96WAITING 11 Empty + a deferred READ is pending for this cell. 97``` 98 99### State Transitions 100 101``` 102 WRITE 103 EMPTY ──────────────────────────► FULL 104 │ │ 105 │ READ │ READ 106 ▼ ▼ 107 WAITING ────────────────────────► FULL (WRITE satisfies deferred reader, 108 (deferred read registered) sends result, cell becomes FULL) 109110 │ CLEAR 111112 EMPTY 113 114 EMPTY ──── ALLOC ──► RESERVED ──── WRITE ──► FULL 115 FULL ──── FREE ──► EMPTY 116 any ──── CLEAR ──► EMPTY 117``` 118 119Detailed transitions: 120 121- **READ on EMPTY or RESERVED**: cell transitions to WAITING. return routing is saved in the deferred read register. no result token emitted yet. if the deferred read register is already occupied (by a different cell), SM stalls and the request stays in the input FIFO until the 122 pending deferred read is satisfied. 123- **READ on FULL**: result token emitted immediately. Cell stays FULL. Multiple reads of the same full cell are fine: all get the value. 124- **READ on WAITING**: error condition (second deferred read to same cell while first is pending). SM behaviour TBD. Options: stall, return error token, or drop silently. stall is safest for v0. 125- **WRITE on EMPTY or RESERVED**: data stored, cell transitions to FULL. no result token. 126- **WRITE on WAITING**: data stored, cell transitions to FULL. SM immediately uses the saved return routing from the deferred read register to emit a result token with the written data. deferred read register is freed. 127- **WRITE on FULL**: depends on design choice. options: 128 - **overwrite silently**: data replaced, cell stays FULL. simple, but hides bugs. 129 - **error**: SM emits an error/diagnostic signal. data not written. safe but requires error handling. 130 - **overwrite + diagnostic flag**: data replaced, diagnostic LED or counter incremented. best of both for v0. 131- **CLEAR on any state**: cell transitions to EMPTY. if cell was WAITING, the pending deferred read is cancelled (return routing discarded from deferred read register, no result token emitted). CLEAR never emits a result token. 132- **ALLOC**: transitions EMPTY->RESERVED. no effect on non-EMPTY cells. 133- **FREE**: transitions any->EMPTY. if cell was WAITING, cancels the deferred read (same as CLEAR). 134 135### Presence Metadata Hardware 136 1374 bits per cell: `[presence:2][is_wide:1][spare:1]` 138 139- **presence:2** - EMPTY/RESERVED/FULL/WAITING (checked on every operation) 140- **is_wide:1** - tags this cell as part of a wide pointer pair. The SM checks this before deciding whether to also read the next cell for length metadata. See "Wide Pointers" section below. 141- **spare:1** - reserved. Candidates: write-once flag, type tag, owner ID. 142 143At 1024 cells = 4096 bits = 512 bytes. Using a byte-wide SRAM chip means bits 4-7 are physically present whether used or not. Committing 4 bits with 4 spare avoids needing to change the presence SRAM layout when additional per-cell metadata is added during testing. 144 145Implementation: 146 147- **Small SRAM** alongside the data SRAM. Addressed in parallel, read/written on every operation. One 8-bit-wide SRAM chip covers 2 cells per byte (4 bits each) -> 256 bytes easily fits. 148- Must support single-cycle read-modify-write (read state, decide action, write new state) within one clock. Achievable if the state SRAM access time is less than half the clock period (read in first half, write in second half; same half-clock RMW technique as the EM-4 matching stage). 149 150## Deferred Read Register 151 152Single register per SM. holds the return routing for one pending deferred read. 153 154``` 155Deferred Read Register: 156 [valid:1][cell_addr:10][return_routing:16] = 27 bits 157``` 158 159- **valid**: set when a READ hits an EMPTY/RESERVED cell. Cleared when the deferred read is satisfied (WRITE to the target cell) or cancelled (CLEAR/FREE to the target cell). 160- **cell_addr**: which cell this deferred read is waiting on (10-bit address for 1024-cell range). Compared against incoming WRITE addresses to detect satisfaction. 161- **return_routing**: the 16 bits from the original READ's flit 2 (the pre-formed CM token template). Used as flit 1 of the result token when the deferred read is satisfied. 162 163### Deferred Read Satisfaction 164 165On every WRITE, the SM checks: `valid == 1 AND write_addr == cell_addr`. If true: 166 1671. Store the write data in the cell (normal WRITE behaviour). 1682. Transition cell state to FULL. 1693. Emit result token: flit 1 = saved return_routing, flit 2 = written data. 1704. Clear the deferred read register (valid = 0). 171 172Hardware cost: one 10-bit comparator (cell_addr vs write_addr), one AND gate (valid AND addr_match), one 27-bit register. Trivial - maybe 3-4 TTL chips total. 173 174### Depth-1 Limitation and Backpressure 175 176With one deferred read register per SM, only one cell can have a pending deferred read at a time. if a second READ arrives for a different empty cell while the register is occupied, the SM cannot service it. 177 178in practice at v0 scale (4 CMs, low contention), this should be rare. 179the compiler can also help by ensuring that reads and writes to the same 180cell are ordered appropriately in the program graph. 181 182### Multi-Slot Deferred Read CAM (Candidate Enhancement) 183 184If depth-1 proves too restrictive, expanding to a multi-entry deferred read store using a small CAM (content-addressable memory) is a natural fit. A CAM does associative lookup in one cycle - present the WRITE address, the CAM match line fires if any entry holds that address. No sequential scan, no priority logic changes to the SM pipeline. 185 186``` 187Deferred Read CAM (e.g., 4 entries): 188 Entry 0: [valid:1][cell_addr:10][return_routing:16] 189 Entry 1: [valid:1][cell_addr:10][return_routing:16] 190 Entry 2: [valid:1][cell_addr:10][return_routing:16] 191 Entry 3: [valid:1][cell_addr:10][return_routing:16] 192 193On READ hitting EMPTY cell: 194 - Find first invalid entry (priority encoder), write cell_addr + 195 return_routing, set valid 196 - If all entries valid: stall (same as depth-1 overflow) 197 198On WRITE: 199 - Present write_addr to CAM match lines 200 - If any entry matches: satisfy that deferred read, clear entry 201 - Normal WRITE proceeds regardless 202``` 203 204**Why CAM is ideal here:** the deferred read lookup is inherently associative - "does any pending read match this write address?" A register file would require sequential comparison against each entry. A CAM answers in one cycle regardless of entry count. Small CAMs (4-16 entries) existed as discrete TTL/CMOS parts and are also trivial to build from comparators + registers. 205 206**Hardware cost:** 4 entries × (10-bit comparator + 27-bit register) + priority encoder for allocation + match OR for satisfaction detection. Estimated 8-12 TTL chips for a 4-entry CAM — roughly double the single-register cost. Alternatively, the National Semiconductor 100142 207(4-word × 4-bit, ECL, see `datasheets/NATLS21982-1.pdf`) is a discrete CAM chip that provides 4-entry associative lookup in a single package. Two 100142s cascade to 4 words × 8 bits, covering the cell address match width. The return routing storage still requires separate 208registers, but the address-match portion shrinks to 2-3 chips instead of a comparator tree. 209 210**IO motivation:** the strongest argument for multiple deferred read slots comes from IO on SM00 (see `io-and-bootstrap.md`). The always-pending deferred read pattern for IO permanently occupies a slot. With a single slot, SM00 can either service IO *or* do normal I-structure deferred reads, but not both. Even 2 slots (one for IO, one for memory operations) resolves this. 4 slots covers multiple IO sources (UART + SPI + timer) without needing SM00 specialization or 211spontaneous token emission hardware. 212 213**Uniform vs SM00-only:** giving all SMs multi-slot deferred reads keeps the architecture uniform (no special cases). The per-SM cost increase is small. Alternatively, only SM00 gets the CAM and other SMs keep depth-1 - saves chips but adds a special case. The uniform approach is preferred unless chip budget is extremely tight. 214## Operation Set 215 216### Internal Representation 217 218The SM internally uses a **4-bit opcode + 12-bit address** command format, regardless of how the command arrived: 219 220``` 221Internal command: [op:4][addr:12][data:16][return_routing:16] 222``` 223 224This is the canonical form that the op decoder, state machine, and ALU all work with. input interfaces (bus adapter, future direct path) translate their respective wire formats into this internal representation. 225 226### Bus Encoding (flit 1: variable-width opcode) 227 228On the 16-bit bus, flit 1 of an SM token is: 229 230``` 231 [1][SM_id:2][op_base:3][op_ext/addr:2][addr:8] = 16 bits 232``` 233 23415 bits available after the SM discriminator. SM_id (2 bits) selects the target SM. The remaining 13 bits encode opcode and address using variable-width encoding. 235 236The interpretation of the 2 bits after op_base depends on op_base[2:1]: 237 238``` 239op[2:1] != 11: 240 3-bit opcode, 10-bit addr (1024 cells). 241 The 2 extension bits are the high address bits. 242 243op[2:1] == 11: 244 Extends to 5-bit opcode, 8-bit payload (256 cells or inline data). 245 The 2 extension bits are opcode extension. 246``` 247 248Decode signal: `op[2] AND op[1]` — one gate. 249 250### Opcode Table (bus encoding) 251 252``` 253op_base ext bus opcode internal op addr bits name 254───────────────────────────────────────────────────────────────── 255 000 aa 000 0000 10 (1024) read 256 001 aa 001 0001 10 write 257 010 aa 010 0010 10 alloc 258 011 aa 011 0011 10 free 259 100 aa 100 0100 10 exec 260 101 aa 101 0101 10 ext (3-flit mode) 261 110 00 11000 0110 8 (256) rd_inc 262 110 01 11001 0111 8 rd_dec 263 110 10 11010 1000 8 cas 264 110 11 11011 1001 8 raw_rd 265 111 00 11100 1010 8 clear 266 111 01 11101 1011 8 set_pg 267 111 10 11110 1100 8 write_im 268 111 11 11111 1101 8 (spare) 269``` 270 271'aa' = address bits (part of 10-bit address). 272 273**Tier 1 ops (3-bit, full address range):** `read`, `write`, `alloc`, `exec`, `free` reach the full 1024-cell address space. EXT signals a 3-flit token for extended addressing (external RAM, wide addresses). 274 275**Tier 2 ops (5-bit, restricted address/payload):** atomic operations (`rd_nc`, `rd_dec`, `cas`) as well as `clear` are restricted to 256 cells - the compiler places atomic-access cells in the lower range. `set_pg` and `write_im` use the 8-bit payload field for non-address data (length, 276page register value, small immediate). 277 278**Key insight for variable-width encoding:** not all SM ops are 279`op(address)`. Some are `op(config_value)` or `op()` with no cell 280operand. The 8-bit payload in the restricted tier can be inline data, 281config values, or range counts depending on the opcode: 282 283- `set_pg`: payload = page register value 284- `write_im`: 8-bit addr, flit 2 carries immediate data 285- `raw_rd`: non-blocking read, returns data or empty indicator without registering a deferred read. Useful for polling and diagnostics. 286 287### Direct Path Encoding (future) 288 289A dedicated CM-SM link bypasses the bus. on the direct path, the type 290and SM_id fields are redundant (the wires go to exactly one SM), so the 291full 16 bits of the command word are available: 292 293``` 294direct path: [op:4][addr:12] = 16 bits 295``` 296 2974-bit opcode (all 16 internal ops directly addressable) + 12-bit address 298(4096 cells). the SM's internal command representation matches this 299directly — no translation needed. 300 301**v0 implementation**: direct path not built. the extra input lines on 302the SM's internal command bus are disabled (active-high for addr lines, 303active-low for unused op bits) — physically present but unused. when/if 304a direct path is added, the bus adapter and direct path adapter both 305feed into the same internal command bus via a mux. 306 307### Extended Bus Encoding (3-flit SM token via `ext` opcode) 308 309For operations that need wider addresses or additional data (`cas` with both expected and new values), 3-flit SM tokens (via EXT opcode) carry the full 4-bit opcode in the extra flit: 310 311``` 312flit 1: [type:2=10][SM_id:2][flags/op_hint:3][addr_low:9] = 16 bits 313flit 2: [extended_op:4][addr_high:4][data:8] = 16 bits 314 or [expected_value:16] for CAS 315flit 3: [new_value:16] or [return_routing:16] = 16 bits 316``` 317 318Exact 3-flit format TBD. the point is that extended tokens can carry the 319full internal 4-bit opcode, giving access to all 16 internal operations 320including any that are only available via direct path or extended format. 321 322### Operation Details (updated for I-structure semantics) 323 324**`read`** (internal 0000): I-structure blocking read. if cell is FULL, 325returns data immediately. if cell is EMPTY or RESERVED, registers a 326deferred read (cell → WAITING) and returns the data later when a WRITE 327arrives. see state transition table above. 328 329**`write`** (internal 0001): stores data. transitions EMPTY/RESERVED → FULL. 330if cell is WAITING, satisfies the deferred reader (emits result token 331using saved return routing). if cell is already FULL, overwrites + 332diagnostic flag (v0 behaviour, see state transition discussion). 333WRITE to a FULL cell does NOT emit a result token — the overwrite 334diagnostic is a local indicator (LED/counter), not a token. 335 336**`alloc`** (internal 0010): transitions EMPTY → RESERVED. for heap 337management. deferred to post-v0 but opcode and state transition defined. 338 339**`free`** (internal 0011): transitions any → EMPTY. cancels any pending 340deferred read on that cell. deferred to post-v0. 341 342**`exec`** (internal 0100): transitions any → EMPTY. cancels any pending 343deferred read. unlike FREE, CLEAR is intended for explicit cell lifecycle 344management by the program (reclaim a cell for reuse). FREE is for heap 345management. semantically identical for now but distinct opcodes allow 346future differentiation (e.g. FREE updates a free-list, CLEAR doesn't). 347 348**(spare)** (internal 0101): reserved. candidate: RAW_READ (non-blocking 349read that returns data or an empty indicator without deferring). useful 350for polling patterns and diagnostics. not committed. 351 352**`rd_inc`** (internal 0110, restricted to lower 256 cells): atomic 353fetch-and-add(+1). reads current value, increments, writes back, returns 354old value. operates on FULL cells only — READ_INC on an EMPTY cell is 355an error (returns error indicator or stalls). does not interact with 356I-structure state transitions (cell stays FULL). 357 358**`read_dec`** (internal 0111, restricted to lower 256 cells): atomic 359fetch-and-add(-1). same semantics as READ_INC but decrements. CM checks 360returned value for zero to detect refcount exhaustion. 361 362**`cas`** (internal 1000, restricted to lower 256 cells): compare-and-swap. 363requires extended format (3-flit or sequential 2-flit) to carry both 364expected_value and new_value + return routing. operates on FULL cells 365only. returns old value regardless of success/failure; CM infers 366success by comparing returned value with expected. 367 368**(spare)** (internal 1001): reserved for future atomic op. 369 370## Hardware Architecture 371 372``` 37316-bit Bus 16-bit Bus 374 | ^ 375 v | 376[Input Deserialiser] [Output Serialiser] 377 (reassemble 2+ flits, (split token into flits) 378 variable opcode decode, ^ 379 translate to internal | 380 4-bit op + 12-bit addr) [Result FIFO] 381 | (formed result tokens) 382 v ^ 383[Request FIFO] | 384 (internal-format commands) [Result Formatter] 385 | ^ 386 v | 387[Op Decoder / State Machine]──────────────────►[Deferred Read Reg] 388 | | | (1 entry: valid, 389 v v v cell_addr, ret_routing) 390[Addr Decode] [ALU] [Presence State SRAM] 391 | | (2 bits per cell, 392 v v read/write parallel 393[Data SRAM | with data SRAM) 394 Bank 0/1]─────┘ 395``` 396 397### Banking 398 399- Start with 2 banks (1 address bit selects bank) for v0 400- 10-bit address = 1024 cells per SM = 2KB at 16-bit data width 401- Each bank is one SRAM chip with room to spare 402- Banking allows pipelining: one bank can be reading while another is being written (for RMW ops, or overlapping independent requests) 403 404### Internal Components 405 406**Input deserializer / Request FIFO**: receives 16-bit flits from the bus, reassembles into logical tokens (2 flits standard, 3 for `cas` or extended addressing). performs variable-width opcode decode and translates to internal 4-bit op + 12-bit addr format. buffers reassembled requests in 407internal format. depth TBD (4-8 deep probably sufficient for v0). handles bursty traffic from multiple CMs. 408 409**Op decoder / State machine**: the core control logic. for each request: 410 4111. Read presence state for the target cell (from presence SRAM). 4122. Determine action based on opcode × cell state (see state transition table). 4133. Issue data SRAM read/write as needed. 4144. Update presence state. 4155. If result token needed: send to result formatter. 4166. If deferred read: save return routing in deferred read register. 4177. On every WRITE: check deferred read register for match on cell_addr. 418 419This is the most complex piece of the SM - roughly a 10-15 state FSM plus the deferred read comparator. estimated 15-25 TTL chips. 420 421**Address decode**: selects SRAM bank from address bits. 422 423**ALU**: minimal — increment, decrement, compare. NOT a full ALU. just enough for the atomic operations. hardware cost: 16-bit incrementer + 16-bit comparator + mux. ~10-15 TTL chips. 424 425**Presence metadata SRAM**: 4 bits per cell (presence:2 + is_wide:1 + spare:1). 1024 cells = 512 bytes. One small SRAM chip addressed in parallel with the data SRAM. Must support half-clock read-modify-write (read state in first half, write new state in second half) for single-cycle operation. 426 427**Deferred read register**: one 27-bit register (valid:1 + cell_addr:10 + return_routing:16). One 10-bit comparator for addr matching against incoming `write`s. ~3-4 TTL chips total. 428 429**Result formatter**: latches the pre-formed CM token template from the original request's flit 2 (or from the deferred read register for deferred reads), emits it as flit 1 of the result, and appends read data as flit 2. The SM does not parse or modify the template - it is an opaque blob. Output 430serializer splits the formed token into flits for bus transmission. 431 432## Address Space Extension 433 434The 10-bit address in the standard SM token gives 1024 cells per SM for tier-1 ops (3-bit opcode). Three mechanisms extend this further: 435 436### 1. Page Register 437 438- SM has a writable config register set via SET_PAGE opcode 439- 10-bit token address is treated as offset, added to page base 440- Gives up to 64K+ addressable cells per SM 441- CM sets the page before issuing a burst of reads/writes to a region 442- Hardware cost: ~3 chips (latch for page register + adder) 443- Programming model: familiar bank-switching, like 8-bit micros 444- Trade-off: page switch costs one extra token; compiler batches accesses to same page to amortize 445 446### 2. Banking as Implicit Address Bits 447 448- SM_id field (2 bits) gives 4 SMs = 4 x 1024 = 4K cells system-wide 449- Not contiguous from a programming perspective, but compiler can distribute data structures across SMs for both capacity and parallelism 450- Essentially free — already in the token format 451- Combine with page registers for 4 x 64K = 256K cells system-wide 452 453### 3. Extended SM Tokens (3-flit via EXT opcode) 454 455- The EXT opcode (3-bit tier) signals a 3-flit SM token with full 16-bit address in flit 2, data/return-routing in flit 3 456- Full 16-bit address space per SM, at the cost of one extra flit cycle 457- Use for: large heap, external RAM, memory-mapped IO address ranges 458- Standard 2-flit tokens remain the fast path for common/local accesses 459 460### Practical Address Space with All Three Combined 461 462- Fast path (standard + page register): 64K per SM, 2-flit token 463- Medium path (across SMs): 4 x 64K = 256K, 2-flit token 464- Slow path (3-flit EXT): up to 64K per SM with wide addresses, 3-flit 465 466## V0 Test Plan 467 468- Drive input with microcontroller (RP2040 / Arduino) 469- Microcontroller formats 2-flit (16-bit each) request packets, clocks flits into input deserializer / request FIFO 470- Read 2-flit result packets from output FIFO 471- Test suite: 472 - Sequential read/write to FULL cells 473 - Random access patterns 474 - READ_INC sequences (verify atomicity, verify returned old value) 475 - READ_DEC to zero (verify underflow behaviour) 476 - CAS success and failure cases 477 - Bank contention (same bank back-to-back) 478 - Page register set + offset access 479 - Boundary conditions (address 0, address 511, page wraparound) 480 - **I-structure tests**: 481 - WRITE then READ: immediate result 482 - READ then WRITE: deferred read — verify result arrives after WRITE 483 - READ on EMPTY, WRITE to different cell, WRITE to target cell: verify deferred read resolves only on correct address 484 - Two deferred READs (different cells): verify stall on second, then resolution when first is satisfied, then second proceeds 485 - CLEAR on FULL cell, then READ: verify deferral 486 - CLEAR on WAITING cell: verify deferred read is cancelled (no result token emitted) 487 - WRITE on FULL cell: verify overwrite + diagnostic indicator 488 - **Variable opcode decode tests**: 489 - Verify READ/WRITE/ALLOC/FREE/CLEAR reach full 1024-cell range 490 - Verify READ_INC/READ_DEC/CAS/EXEC etc. restricted to lower 256 cells 491 - Verify op[2:1]=11 decode correctly separates opcode extension from address 492 493--- 494 495## Memory Tiers 496 497SM address space supports regions with different semantics, selectable by address range. The tier is not encoded in the token - it is determined by the target address within the SM. 498 499### Tier 0: Raw Memory 500 501No presence bits. SRAM read/write only. Suitable for bulk data that does not need synchronization (image buffers, DMA staging areas, ROM). 502 503- READ always returns immediately (no deferral) 504- WRITE always succeeds 505- Presence metadata bits are ignored / not maintained 506- Allocated by compiler or loader, not by ALLOC/FREE 507- Address range: configurable, typically the top of the SM address space 508 509### Tier 1: I-Structure Memory 510 511Standard I-structure semantics with presence tracking. This is the default operating mode described throughout this document. 512 513- READ on EMPTY/RESERVED cell defers until WRITE 514- WRITE transitions cell to FULL 515- ALLOC/FREE manage the free list 516- Full presence metadata (4 bits per cell) 517- Address range: configurable, typically the bulk of SM address space 518 519### Tier 2: Wide/Bulk Memory 520 521Extends tier 1 with wide pointer support. Cells tagged with `is_wide=1` in presence metadata are treated as the base of a (pointer, length) pair: the cell itself holds the data pointer, the next cell holds the length. 522 523- READ checks `is_wide` and returns either 1 cell (normal) or 2 cells (wide pointer pair — requires 3-flit result token or two result tokens) 524- WRITE to a wide cell writes both pointer and length 525- Enables ITERATE, COPY_RANGE, and EXEC to take wide pointers as arguments 526- Address range: overlaps with tier 1 (any tier 1 cell can be marked wide) 527 528### Tier Boundary Configuration 529 530- Boundaries are set by config registers (SET_PAGE or a dedicated config mechanism) during bootstrap 531- The SM's address decoder checks the incoming address against tier boundaries to select behaviour 532- Hardware cost: 1-2 comparators + a mux on the presence metadata path 533- v0 can use a fixed split (e.g., lower 768 = tier 1, upper 256 = tier 0) and defer runtime-configurable boundaries 534 535> **Design status:** tiers are directionally decided. Exact address range splits, tier 2 wide-read mechanics, and interaction between tiers and page registers are still being refined. 536 537--- 538 539## Wide Pointers and Bulk Operations 540 541### Wide Pointer Format 542 543A wide pointer occupies 2 consecutive SM cells: 544 545``` 546Cell N: [data_pointer:16] (base address in SM or external memory) 547Cell N+1: [length:16] (element count) 548``` 549 550Cell N has `is_wide=1` in its presence metadata. The SM knows to read both cells when servicing a READ on a wide cell. 551 552Wide pointers are the parameter format for bulk operations. A CM does not iterate over SM contents directly — it sends an SM operation with a wide pointer, and the SM's sequencer engine handles the iteration internally. 553 554### `exec` 555 556`exec` reads pre-formed tokens from a contiguous region of SM and pushes them onto the bus. The SM becomes a token source - effectively an autonomous injector. 557 558``` 559exec request: 560 flit 1: [...exec opcode...][count:8] 561 flit 2: [base_addr from config register or wide pointer] 562``` 563 564The SM's sequencer reads `count` 2-cell entries starting at `base_addr`. Each entry is a pre-formed 2-flit token (flit 1 in cell N, flit 2 in cell N+1). The sequencer emits them onto the bus in order. 565 566**Bootstrap use:** on system reset, SM00 is wired to execute `exec` on a predetermined ROM address. The ROM contents are pre-formed IRAM write tokens and seed tokens; everything needed to load the system. No external microcontroller needed for self-hosted boot. See "SM00 Bootstrap" below. 567 568**Hardware reuse:** the `exec` sequencer (address counter, limit comparator, increment logic, output path to bus) is the same hardware needed for `iter` and `cp`. Building `exec` for bootstrap gives bulk operations nearly for free. 569 570### `iter` 571 572Reads each cell in a range and emits a result token for each. Takes a 573wide pointer (base + length). The SM's sequencer walks the range, 574constructing result tokens using a pre-loaded return routing template. 575 576### `cp` 577 578Copies a contiguous range of cells from one SM region to another (or to 579a different SM). Takes source wide pointer and destination base. Useful 580for structure copying, GC compaction. 581 582> **Design status:** `iter` and `cp` are directionally committed. 583> Exact token format for range parameters, interaction with deferred reads 584> in the target range, and atomicity guarantees are still being refined. 585 586--- 587 588## SM00 Bootstrap 589 590### Reset Behaviour 591 592SM00 has dedicated wiring to the system reset signal. On reset: 593 5941. SM00's sequencer triggers `exec` on a predetermined ROM base address 5952. The ROM region contains pre-formed tokens: IRAM write tokens to load PE instruction memories, followed by seed tokens to start execution 5963. SM00 emits these tokens onto the bus in order 5974. PEs receive IRAM writes and load their instruction memories 5985. Seed tokens fire and execution begins 599 600This is the only hardware specialization of SM00, the reset vector wiring. At runtime, SM00 behaves as a standard SM with standard opcodes. 601 602### ROM Mapping 603 604The bootstrap ROM is mapped into SM00's address space (tier 0 - raw memory, no presence bits). It can be: 605 606- Physical ROM/EEPROM on SM00's address bus 607- A region of SM00's SRAM pre-loaded by an external microcontroller (development/prototyping) 608- Flash memory accessed via page register for larger images 609 610### Future Specialization (not committed) 611 612SM00 could be further specialized for IO: 613 614- Atomic/alloc opcodes could be repurposed for IO-specific operations 615 (e.g., `rd_inc` becomes "read UART with auto-acknowledge") 616- Memory-mapped IO devices occupy a reserved address range within SM00 617- SM00 could have additional interrupt-sensing hardware that triggers token emission on external events 618 619This is documented as a design option but **not committed for v0**. The standard SM opcodes are sufficient for basic IO via I-structure semantics: `read` from an IO-mapped address defers until the IO device writes data. 620 621--- 622 623## Presence-Bit Guided IRAM Writes 624 625The matching store's presence bitmap provides information about whether any dyadic instruction slot has a pending (half-matched) operand. During IRAM writes, the PE can check the presence bits for the IRAM page being overwritten: 626 627- If all presence bits for slots in that page are clear, no tokens are pending and the page can be safely overwritten without drain delay 628- If any presence bit is set, the PE knows tokens are in flight for that instruction and can either wait or discard the stale operand 629 630This enables more targeted IRAM replacement than the blanket drain protocol. Instead of draining the entire PE, only the affected page needs attention. The valid-bit page protection mechanism (`bus-architecture-and-width-decoupling.md`) remains the safety net, but 631presence-bit checking can eliminate unnecessary stalls. 632 633--- 634 635## Open Design Questions 636 6371. **Page register per-CM or global?** - if multiple CMs access the same SM, do they share a page register (contention) or each have their own (more hardware, more config)? Probably global for v0. 6382. **Banking vs pipeline depth** - with 2 banks, can we overlap a read to bank 0 with a write to bank 1? Worth the control complexity for v0? Presence state SRAM complicates this - is presence per-bank or shared? If shared, it serializes cross-bank operations. If per-bank, each bank needs its own presence SRAM. Probably shared for v0 (simpler). 6393. **Atomic ops on non-FULL cells** - `rd_inc`/`rd_dec`/`cas` on EMPTY or WAITING cells is currently undefined. Options: error, stall, or treat as zero. Error is safest for v0. 6404. **Direct path input mux** - when direct path is added, the SM needs a mux between bus input and direct input feeding the internal command bus. Arbitration policy TBD (direct path priority? round-robin?). 6415. **Wide pointer read format** - does a READ on a wide cell return two separate result tokens or one 3-flit result token? Two tokens is simpler (reuse existing result formatter), 3-flit is more atomic. 6426. **Tier boundary mechanism** - fixed at build time, config register, or address-range comparator? Fixed is simplest for v0. 6437. **`iter` return template** - how is the return routing template supplied for iterated results? Preloaded config register? Part of the `iter` request? Template per element or shared? 6448. **SM00 ROM size and mapping** - how large is the bootstrap ROM? What address range? How does it interact with the page register?