OR-1 dataflow CPU sketch

docs: add token format migration design plan

Completed brainstorming session. Design includes:
- 1-bit SM/CM token split replacing old 2-bit type field
- IRAMWriteToken replacing SysToken/CfgToken hierarchy
- SM T0/T1 memory tier split with shared T0 storage
- EXEC opcode for bootstrap and bulk token injection
- 5 implementation phases

Orual dea94049 00e8aa1d

+3130 -743
+8
CLAUDE.md
··· 102 102 103 103 ## Architecture Contracts 104 104 105 + > **Note:** The design documents in `design-notes/` have been updated to 106 + > reflect the 1-bit SM/CM token split (eliminating SysToken/type-11), 107 + > IO as memory-mapped SM, bootstrap via SM00 EXEC, and variable 3/5 SM 108 + > opcode encoding. The emulator and assembler code below still uses the 109 + > old token hierarchy (SysToken, CfgToken, IOToken). An implementation 110 + > update is pending. See `design-notes/sm-and-token-format-discussion.md` 111 + > for the design rationale. 112 + 105 113 ### Token Hierarchy (tokens.py) 106 114 107 115 All tokens inherit from `Token(target: int)`. The hierarchy:
+17 -24
design-notes/alu-and-output-design.md
··· 415 415 416 416 Cycle 1 — trigger token (not-taken side): 417 417 destination: bool_out ? dest2 : dest1 418 - format: ALWAYS inline monadic (type=01, inline=1), regardless 418 + format: ALWAYS monadic inline (prefix 011+10), regardless 419 419 of what the destination's dest_type field says 420 420 data: none (1-flit token) 421 421 semantics: controlled by the not-taken destination's not_taken_op field: ··· 495 495 ### Token Packing Mux 496 496 497 497 The packing mux assembles flit 1 (and flit 2 if needed) from the 498 - destination fields and data. The dest_type selector (2 bits) determines 499 - the flit layout. 498 + destination fields and data. The dest_type selector determines the flit 499 + layout based on the prefix encoding. 500 500 501 501 ``` 502 - Flit 1 assembly — all four compute token formats: 502 + Flit 1 assembly — CM token formats by prefix: 503 503 504 - Common: 505 - bits [15:14] = type derived from dest_type 506 - bits [13:12] = dest_PE from IRAM dest field 507 - bit [11] = mode_flag derived from dest_type (width or inline bit) 504 + Dyadic wide (prefix 00): 505 + [0][0][PE:2][offset:5][ctx:4][port:1][gen:2] 508 506 509 - Format-dependent (bits [10:0]): 507 + Monadic normal (prefix 010): 508 + [0][1][0][PE:2][offset:7][ctx:4] 510 509 511 - Dyadic narrow (type=00, W=0): 512 - [offset:5][ctx:4][spare:2] 510 + Dyadic narrow (prefix 011+00): 511 + [0][1][1][PE:2][00][offset:5][ctx:4] 513 512 514 - Dyadic wide (type=00, W=1): 515 - [offset:4][ctx:4][port:1][gen:2] 516 - 517 - Monadic normal (type=01, I=0): 518 - [offset:7][ctx:4] 519 - 520 - Monadic inline (type=01, I=1): 521 - [offset:4][ctx:4][spare:3] 513 + Monadic inline (prefix 011+10): 514 + [0][1][1][PE:2][10][offset:4][ctx:4][spare:1] 522 515 523 516 Flit 2 assembly — format-dependent: 524 517 525 - Dyadic narrow: [data:8][port:1][gen:2][spare:5] 526 518 Dyadic wide: [data:16] 527 519 Monadic normal: [data:16] 520 + Dyadic narrow: [data:8][port:1][gen:2][spare:5] 528 521 Inline: (no flit 2 emitted) 529 522 ``` 530 523 ··· 532 525 selecting which source bits connect to which flit positions, controlled 533 526 by the dest_type field. Estimate: 4-6 TTL chips. 534 527 535 - For SWITCH cycle 1 (not-taken trigger), the mux is forced to inline 536 - monadic format regardless of the destination's stated dest_type. This 537 - override is a single AND gate on the dest_type control lines during 538 - the EMIT_SECOND state when in SWITCH mode. 528 + For SWITCH cycle 1 (not-taken trigger), the mux is forced to monadic 529 + inline format (prefix 011+10) regardless of the destination's stated 530 + dest_type. This override is a single AND gate on the format control lines 531 + during the EMIT_SECOND state when in SWITCH mode. 539 532 540 533 ### Source of ctx and gen in Output Tokens 541 534
+137 -113
design-notes/architecture-overview.md
··· 11 11 - `pe-design.md` — PE pipeline, matching store, instruction memory, context slots 12 12 - `sm-design.md` — structure memory interface, operations, banking, address space 13 13 - `network-and-communication.md` — interconnect, routing, clocking, handshaking 14 - - `io-and-bootstrap.md` — I/O subsystem, bootstrap sequence, type-11 protocol 14 + - `io-and-bootstrap.md` — IO as memory-mapped SM, bootstrap via SM00 EXEC 15 15 - `design-alternatives.md` — rejected/deferred approaches with rationale 16 16 17 17 ## Project Goals ··· 82 82 This decouples token width from instruction width, and means IRAM width is 83 83 completely independent of bus width. 84 84 85 - ### Type Field Semantics 85 + ### Top-Level Discriminator: 1-Bit SM/CM Split 86 + 87 + ``` 88 + BIT[15] = 1: SM TOKEN — destination is an SM bank. carries a memory 89 + operation request (read, write, atomic RMW, bulk ops, etc.). 90 + BIT[15] = 0: CM TOKEN — destination is a CM (PE). carries operand data 91 + for compute instructions, or IRAM write commands. 92 + ``` 93 + 94 + IO is memory-mapped into SM address space (typically SM00 at v0). IRAM 95 + writes are a CM misc-bucket subtype. Bootstrap uses EXEC on SM00, reading 96 + pre-formed tokens from ROM. Debug/trace can use a reserved SM address 97 + range or a spare misc-bucket subtype. 98 + 99 + ### CM Token Prefix Encoding 86 100 87 101 ``` 88 - Type 00 — DYADIC: destination is a CM. token carries operand for a dyadic 89 - (two-input) instruction. requires matching store lookup. 90 - Type 01 — MONADIC: destination is a CM. token carries operand for a monadic 91 - (single-input) instruction. bypasses matching store. 92 - Type 10 — STRUCTURE: destination is an SM bank. carries a memory operation 93 - request (read, write, atomic RMW, etc.). 94 - Type 11 — SYSTEM: destination is the I/O subsystem, OR carries an extended/ 95 - config operation. subtype field discriminates: 96 - 11 + 00: I/O operation (routed to I/O controller) 97 - 11 + 01: extended address / config write (e.g., remote instruction 98 - memory write, routing table config) 99 - 11 + 10: reserved (future: debug/trace, DMA) 100 - 11 + 11: reserved 102 + BIT[15:14] = 00: DYADIC WIDE — hot path. 2 flits, 16-bit data. 103 + offset:5 = 32 dyadic slots per context. 104 + BIT[15:13] = 010: MONADIC NORMAL — 2 flits, 16-bit data. 105 + offset:7 = 128 IRAM slots. 106 + BIT[15:13] = 011: MISC BUCKET — infrequent CM formats. 107 + sub:2 discriminates: 108 + 011+00: DYADIC NARROW (2 flits, 8-bit data) 109 + 011+01: IRAM WRITE (2-3 flits, instruction loading) 110 + 011+10: MONADIC INLINE (1 flit, trigger only) 111 + 011+11: SPARE (reserved) 112 + ``` 113 + 114 + Two bits determine the fast-path pipeline: bit[15] splits SM/CM (one gate), 115 + bit[14] splits dyadic-wide from everything else (one gate). The PE can 116 + start matching store SRAM read the instant flit 1 is latched for dyadic-wide 117 + tokens. The misc bucket is three gates deep but nothing there is 118 + latency-critical. 119 + 120 + ### SM Token Format 121 + 101 122 ``` 123 + BIT[15] = 1: SM TOKEN 124 + flit 1: [1][SM_id:2][op:variable][addr:variable] = 16 bits 125 + flit 2: [data:16] or [return_routing:16] 102 126 103 - Types 00/01 hit CMs only. Type 10 hits SM banks only. Type 11 can hit the 104 - I/O controller, target PEs (for config writes), or future system 105 - infrastructure depending on subtype. 127 + Variable-width opcode: common ops (READ, WRITE, ALLOC, FREE, CLEAR, EXT) 128 + get 3-bit opcode + 10-bit addr (1024 cells). Rare/special ops (atomics, 129 + EXEC, SET_PAGE, etc.) get 5-bit opcode + 8-bit payload. One decode gate 130 + on op[2:1] discriminates the two tiers. See `sm-design.md` for the full 131 + opcode table. 132 + ``` 106 133 107 - ### Flit-Based Packet Formats (Preliminary) 134 + SM tokens are always bit[15]=1. The network routes them to the SM identified 135 + by SM_id. IO is memory-mapped into SM address space (typically SM00 at v0). 108 136 109 - Standard tokens are 2 flits (32 bits logical). Extended operations 110 - (type-11) may be 3-4 flits. The number of flits is determined by the 111 - type/subtype field in flit 1 — routing nodes and receivers can predict 112 - packet length from the first flit alone. 137 + ### Flit-Based Packet Formats 113 138 114 139 ``` 115 - Standard compute token (types 00/01, 2 flits): 116 - flit 1: [type:2][PE_id:2][...12 bits: addr, ctx, port, gen...] = 16 bits 140 + Dyadic wide (prefix 00, 2 flits): 141 + flit 1: [0][0][PE:2][offset:5][ctx:4][port:1][gen:2] = 16 bits 117 142 flit 2: [data:16] = 16 bits 118 143 119 - The 12 bits after type + PE_id encode instruction address, context slot, 120 - port (L/R), and generation counter. Exact allocation TBD — key tradeoffs: 121 - - Wider instr_addr vs wider ctx_slot vs gen bits 122 - - IRAM will be small (hundreds of entries), so a smaller instr_addr 123 - field is reasonable 124 - - Generation counter is only needed for dyadic tokens (type 00); 125 - monadic tokens (type 01) can repurpose those bits 126 - - Multi-flit extended compute tokens (3 flits) provide an escape hatch 127 - for wider addressing at the cost of one extra flit cycle 144 + Monadic normal (prefix 010, 2 flits): 145 + flit 1: [0][1][0][PE:2][offset:7][ctx:4] = 16 bits 146 + flit 2: [data:16] = 16 bits 128 147 129 - Structure token (type 10, 2 flits): 130 - flit 1: [type:2][SM_id:2][operation:3][addr:9] = 16 bits 131 - flit 2: [data:16] (or return routing for READ operations) = 16 bits 148 + Dyadic narrow (prefix 011+00, 2 flits): 149 + flit 1: [0][1][1][PE:2][00][offset:5][ctx:4] = 16 bits 150 + flit 2: [data:8][port:1][gen:2][spare:5] = 16 bits 132 151 133 - System/extended token (type 11, 3+ flits): 134 - flit 1: [type:2][subtype:2][routing:12] = 16 bits 135 - flit 2: [payload / extended addr] = 16 bits 136 - flit 3: [payload / data] = 16 bits 137 - (flit 4 if needed for full instruction word writes, etc.) 152 + IRAM write (prefix 011+01, 2-3 flits): 153 + flit 1: [0][1][1][PE:2][01][iram_addr:7][flags:2] = 16 bits 154 + flit 2: [instruction_word_low:16] = 16 bits 155 + (flit 3: [instruction_word_high:8][spare:8] if needed) 156 + 157 + Monadic inline (prefix 011+10, 1 flit): 158 + flit 1: [0][1][1][PE:2][10][offset:4][ctx:4][spare:1] = 16 bits 159 + 160 + SM standard (prefix 1, 2 flits): 161 + flit 1: [1][SM_id:2][op:3-5][addr:8-10] = 16 bits 162 + flit 2: [data:16] or [return_routing:16] = 16 bits 138 163 ``` 139 164 165 + See `bus-architecture-and-width-decoupling.md` for detailed field analysis 166 + and `sm-design.md` for SM opcode encoding. 167 + 140 168 ### Variable-Length Token Summary 141 169 142 - | Token Type | Flits | Total Bits | Contents | 143 - |------------|-------|------------|----------| 144 - | Standard compute (monadic/dyadic) | 2 | 32 | addr + ctx + port + 16-bit data | 145 - | Structure operation | 2 | 32 | SM addr + op + 16-bit data or ret routing | 146 - | System / I/O | 3 | 48 | routing + subtype + payload | 147 - | Config write (IRAM load) | 3-4 | 48-64 | target PE + IRAM addr + instr word | 148 - | Extended data (32-bit value) | 3 | 48 | addr + ctx + 32-bit data | 170 + | Token Type | Prefix | Flits | Data | Offset | Ctx | Port | Gen | 171 + |------------|--------|-------|------|--------|-----|------|-----| 172 + | Dyadic wide | 00 | 2 | 16-bit | 5 (32) | 4 | 1 | 2 | 173 + | Monadic normal | 010 | 2 | 16-bit | 7 (128) | 4 | — | — | 174 + | Dyadic narrow | 011+00 | 2 | 8-bit | 5 (32) | 4 | 1 | 2 | 175 + | IRAM write | 011+01 | 2-3 | instr word | 7 (128) | — | — | — | 176 + | Monadic inline | 011+10 | 1 | none | 4 (16) | 4 | — | — | 177 + | SM standard | 1 | 2 | 16-bit | 8-10 | — | — | — | 149 178 150 179 ### Key Design Rationale 151 180 152 181 - **Opcodes don't travel**: tokens carry destination addresses, not opcodes. 153 182 IRAM is fetched PE-locally after matching. Instruction width is completely 154 183 independent of bus width. 155 - - **16-bit data in all standard tokens**: every standard compute token 156 - carries a full 16-bit data word in flit 2, regardless of monadic/dyadic 157 - type. The old 14-bit dyadic data limitation is eliminated. 158 - - **Flit 1 is self-describing**: type + PE_id (or SM_id) enable routing 159 - after receiving just one flit. Routing nodes inspect flit 1 only; 160 - subsequent flits follow the same path. 161 - - **Variable-length is the escape hatch**: the common case (2 flits) is 162 - fast. Extended operations and wider addressing pay extra flits only 163 - when needed. 184 + - **1-bit top-level split**: bit[15] discriminates SM from CM traffic. One 185 + gate. The network routes on bit[15] + the 2-bit destination ID (PE_id or 186 + SM_id). Everything below that is endpoint-decoded. 187 + - **Hot path decode is shallow**: dyadic wide (the dominant format) is 188 + identified by two bits (bit[15]=0, bit[14]=0). The PE can begin matching 189 + store SRAM read immediately on flit 1 latch. 190 + - **IRAM writes are CM-local**: the PE manages its own instruction memory 191 + via the misc-bucket IRAM write format. No separate system token type. 164 192 - **Generation counter only on dyadic tokens**: prevents ABA problem when 165 - context slots are reused. Monadic and structure tokens don't need it. 193 + context slots are reused. Monadic and SM tokens don't need it. 166 194 - **Width domains are independent**: bus width (16-bit), token format 167 195 (variable flit count), IRAM width (32-48 bits), and PE pipeline width 168 196 (wider, decomposed) are each sized for their own constraints. See ··· 174 202 - Instruction memory (IM / IRAM): stores dataflow program (function bodies) 175 203 - Width decoupled from bus: 32-48 bits, sized for opcode + destination 176 204 encoding (see `bus-architecture-and-width-decoupling.md`) 177 - - **Runtime-writable** via type-11 config packets from the network 205 + - **Runtime-writable** via IRAM write tokens (prefix `011+01`) 178 206 - Write from network stalls the pipeline (acceptable for config operations) 179 207 - Enables runtime reprogramming and eliminates need for separate config bus 180 208 - Operand memory (OM) / matching store: buffers arriving operands, performs 181 209 matching. Entries are 16-bit data + 1-bit presence per slot. 182 - - Receives tokens from CN (types 00/01) and DN (SM results repackaged as 183 - type 00/01), produces tokens to CN and AN 210 + - Receives CM tokens (bit[15]=0) from CN and DN (SM results repackaged as 211 + CM tokens), produces tokens to CN and AN 184 212 - Contains the bump allocator, throttle, and generation counter logic 185 213 - Each PE has a unique ID, set via EEPROM (instruction decoder doubles as 186 214 ID store) or DIP switches during prototyping 187 215 - See `pe-design.md` for pipeline details 188 216 189 - ### SM (Structure Memory) — data storage and structure operations 217 + ### SM (Structure Memory) — data storage, structure operations, and IO 190 218 - Banked data memory (cells) for arrays, lists, heap data 191 219 - Embedded functional units for structure operations (read, write, atomic 192 - RMW, etc.) 193 - - Receives operation requests via AN (type 10), returns results via DN 194 - (repackaged as type 00/01 tokens) 220 + RMW, bulk ops via EXEC/ITERATE/COPY_RANGE) 221 + - Receives operation requests via AN (bit[15]=1 tokens), returns results 222 + via DN (repackaged as CM tokens) 195 223 - Operates asynchronously from CMs — split-phase memory access 196 - - Pure data storage — no I/O mapping (I/O lives in the type-11 subsystem) 197 - - See `sm-design.md` for interface and banking details 224 + - **IO is memory-mapped into SM address space.** An SM (typically SM00 at 225 + v0) maps IO devices into its address range. I-structure semantics provide 226 + natural interrupt-free IO: a READ from an IO device that has no data 227 + defers until data arrives. 228 + - **SM00 has bootstrap responsibility:** wired to the system reset signal, 229 + it calls EXEC on a predetermined ROM address to load the system. At 230 + runtime, SM00 behaves as a standard SM; only the reset-vector wiring is 231 + special. See `sm-design.md` for details. 232 + - **Memory tiers:** SM address space supports regions with different 233 + semantics — tier 0 (raw, no presence bits), tier 1 (I-structure), and 234 + tier 2 (wide/bulk with is_wide tag). Tier selection is by address range. 235 + - See `sm-design.md` for interface, banking, and tier details 198 236 199 - ### I/O Controller — peripheral interface 200 - - Fixed-function device on the network, NOT a full PE 201 - - Receives type-11 subtype-00 packets, interprets as I/O commands 202 - - Returns results as type 00/01 tokens to the requesting CM 203 - - Can spontaneously emit tokens (unsolicited I/O: UART RX, interrupt 204 - equivalent) — the only network participant that generates tokens 205 - without receiving one first 206 - - Also handles type-11 subtype-01 during bootstrap (reading from UART/flash, 207 - formatting config writes to load programs into PEs) 208 - - See `io-and-bootstrap.md` for design details 209 - 210 - ### Three Logical Interconnects (shared physical bus for v0) 237 + ### Two Logical Interconnects (shared physical bus for v0) 211 238 212 239 ``` 213 - CN (Communication Network): CM <-> CM, types 00/01 214 - AN (Arbitration Network): CM -> SM, type 10 215 - DN (Distribution Network): SM -> CM, type 10 results repackaged as 00/01 216 - System channel: any <-> I/O controller, type 11 240 + CN (Communication Network): CM <-> CM, bit[15]=0 241 + AN (Arbitration Network): CM -> SM, bit[15]=1 242 + DN (Distribution Network): SM -> CM, SM results repackaged as bit[15]=0 217 243 ``` 218 244 219 - For v0 (4 PEs + 1-2 SMs + I/O controller), all traffic shares a single 220 - physical 16-bit bus with type-based routing. Routing nodes inspect flit 1 221 - (type field + destination ID) and forward the entire multi-flit packet to 222 - the appropriate destination. Multiple packets can be in flight 245 + For v0 (4 PEs + 2-4 SMs), all traffic shares a single physical 16-bit bus 246 + with bit[15]-based routing. Routing nodes inspect flit 1 (bit[15] + 247 + destination ID in bits[14:13] or [13:12]) and forward the entire multi-flit 248 + packet to the appropriate destination. Multiple packets can be in flight 223 249 simultaneously if the bus is pipelined with latches at each stage. 224 250 225 251 The AN/DN can be split onto separate physical paths later if SM access 226 - contention becomes a bottleneck. The type-field-based routing means this 252 + contention becomes a bottleneck. The bit[15]-based routing means this 227 253 is a topology change, not a protocol change — no module interfaces need 228 254 to change. 229 255 ··· 235 261 |-----------|------------| 236 262 | 4x PE logic | 20-32K | 237 263 | Routing network (4 PEs) | 2-3K | 238 - | I/O controller | ~1-2K | 264 + | SM bootstrap/EXEC sequencer | ~1-2K | 239 265 | **Total logic** | **~25-35K** | 240 266 | SRAM chips (instruction mem, matching stores, token queues) | 8-16 chips | 241 267 242 - Bootstrap is handled by the I/O controller via type-11 config writes, 243 - or by an external microcontroller during early prototyping. No dedicated 244 - bootstrap hardware in the architecture. 268 + Bootstrap is handled by SM00's EXEC sequencer reading pre-formed tokens 269 + from ROM, or by an external microcontroller during early prototyping. 245 270 246 271 ## IPC / Performance Expectations 247 272 ··· 250 275 - With 4 PEs and single-cycle matching (common case), peak is 4 ops/clock 251 276 - Realistic sustained throughput depends on: 252 277 - Network crossing frequency (adds routing latency) 253 - - Hash path hits vs direct index (matching latency) 278 + - Direct index matching latency (single-cycle common case) 254 279 - Available parallelism in the program 255 280 - Network contention (shared bus at v0 scale) 256 281 - Parallel workloads (matrix multiply, FFT): near peak ··· 282 307 - Test with microcontroller injecting tokens, verify matching + execution 283 308 284 309 ### Phase 2: CM + SM pair 285 - - Connect via shared bus with type routing 286 - - Load a program using microcontroller (external, via type-11 config writes 310 + - Connect via shared bus with bit[15]-based routing 311 + - Load a program using microcontroller (external, via IRAM write tokens 287 312 or direct SRAM programming) 288 313 - Execute a dataflow graph that uses structure memory 289 314 - First real program: fibonacci, small FFT, or similar ··· 294 319 - Demonstrate actual parallel execution speedup 295 320 296 321 ### Phase 4: System 297 - - Expand to 4 CMs + 1-2 SMs 298 - - I/O controller (type-11 subsystem) with UART 299 - - Bootstrap via I/O controller reading from flash/serial 300 - - ISR support (compiler-assigned PE with interrupt token injection from 301 - I/O controller) 322 + - Expand to 4 CMs + 2-4 SMs 323 + - SM00 bootstrap via EXEC from ROM 324 + - IO memory-mapped into SM00 address space (UART, etc.) 325 + - ISR equivalent via I-structure semantics: READ from IO device defers 326 + until data arrives, triggering the receiving node in the dataflow graph 302 327 - Performance benchmarking vs period-equivalent CPUs 303 328 304 329 ## Open Questions / Next Steps 305 330 306 - 1. **SM internal design** — banking scheme, operation set, 307 - interface protocol (partially specified, see `sm-design.md`) 331 + 1. **SM internal design** — banking scheme, bulk op sequencer, tier 332 + boundary configuration, wide pointer cell format (partially specified, 333 + see `sm-design.md`) 308 334 2. **Context slot count per CM** — 4 bits = 16 slots vs 5 bits = 32 slots; 309 - drives flit-1 bit allocation tradeoffs 310 - 3. **Flit-1 bit allocation** — how to divide the 12 available bits (after 311 - type + PE_id) among instr_addr, ctx_slot, port, and gen. Smaller 312 - instr_addr is reasonable given small IRAM; gen bits only needed for 313 - dyadic. Multi-flit extended tokens provide escape hatch for wider 314 - addressing. See `bus-architecture-and-width-decoupling.md` open questions. 315 - 4. **Instruction encoding** — operation set, format, IRAM width (32 or 48 335 + the new dyadic wide format gives 5-bit offset with 4-bit ctx. See 336 + `bus-architecture-and-width-decoupling.md` for field allocation analysis. 337 + 3. **Instruction encoding** — operation set, format, IRAM width (32 or 48 316 338 bits). Decoupled from bus width, driven by opcode + destination fields. 317 339 > **Partially resolved:** the emulator and assembler use Python IntEnum 318 340 > values as opcode placeholders. These do NOT represent final hardware 319 341 > bit encodings — a hardware encoding pass is still needed. The 5-bit 320 342 > opcode table in `alu-and-output-design.md` is a preliminary draft. 321 - 5. **I/O controller internal design** — state machine, UART bridge, 322 - unsolicited token generation, flit-based I/O token format 323 - 6. ~~**Compiler / assembler**~~ — **Partially Resolved.** The `asm/` package 343 + 4. **IO address space allocation** — which SM_id is reserved for IO? How 344 + much of SM00's address space is mapped to IO vs general-purpose storage? 345 + SM00 is special only at boot for now; further specialisation deferred 346 + until profiling shows the standard opcodes are insufficient. 347 + 5. ~~**Compiler / assembler**~~ — **Partially Resolved.** The `asm/` package 324 348 implements a 6-stage assembler pipeline (parse → lower → resolve → 325 349 place → allocate → codegen). Produces PEConfig/SMConfig + seed tokens 326 350 or a bootstrap token stream. See `assembler-architecture.md` for architecture.
+182 -173
design-notes/bus-architecture-and-width-decoupling.md
··· 49 49 enabling routing nodes to make forwarding decisions after receiving only 50 50 the first flit. 51 51 52 - Tokens are 1-2 flits for compute traffic, 2 flits for structure operations, 53 - and 3-4 flits for system/config operations. The number of flits is 54 - determined by the type field (and, for compute tokens, the subtype/mode 55 - bits) in flit 1 — routing nodes and receivers can predict packet length 56 - from the first flit alone. 52 + Tokens are 1-2 flits for compute traffic, 2 flits for SM operations, and 53 + 2-3 flits for IRAM writes. The number of flits is determined by the prefix 54 + bits in flit 1 — routing nodes and receivers can predict packet length from 55 + the first flit alone. 57 56 58 57 ### Impact on Throughput 59 58 ··· 95 94 destination, no instruction metadata. 96 95 - **IRAM width is completely independent of bus width.** Instruction words 97 96 are fetched from PE-local SRAM, never serialised onto the external bus 98 - (except during program loading via type-11 config writes, which is a 97 + (except during program loading via IRAM write tokens, which is a 99 98 slow-path operation). 100 99 - **The opcode is an implementation detail of the PE.** It gets loaded 101 100 into IRAM during program initialisation and sits there until replaced. ··· 116 115 117 116 Loading a program means: 118 117 119 - 1. Write IRAM entries via type-11 config packets (opcode + destinations 120 - per slot, multi-flit — 3 or 4 flits depending on instruction width) 118 + 1. Write IRAM entries via IRAM write tokens (prefix 011+01), 2-3 flits 119 + depending on instruction width 121 120 2. Inject initial tokens (constants, inputs) via I/O 122 121 3. The graph executes itself 123 122 124 - Replacing instructions at runtime is a type-11 config write to the target 125 - PE's IRAM. See **IRAM Valid-Bit Protection** below for safe swap protocol. 123 + Replacing instructions at runtime is an IRAM write token to the target 124 + PE. See **IRAM Valid-Bit Protection** below for safe swap protocol. 126 125 127 126 --- 128 127 129 - ## Flit-1 Bit Allocation: Routing vs PE Decode 128 + ## Flit-1 Bit Allocation: Routing vs Endpoint Decode 130 129 131 130 Flit 1 is 16 bits. The hard routing requirement is minimal: 132 131 133 - **Network routing nodes** inspect only `type:2` + `PE_id:2` (or `SM_id:2` 134 - for structure tokens). These 4 bits determine where the flit goes. The 135 - remaining 12 bits are opaque payload — the network forwards them blindly. 132 + **Network routing nodes** inspect only `bit[15]` (SM/CM split) plus the 133 + 2-bit destination ID. These 3 bits determine where the flit goes. The 134 + remaining bits are opaque payload — the network forwards them blindly. 136 135 137 - **The destination PE** decodes the full 16 bits to determine token format, 138 - how many further flits to clock in, and how to begin pipeline work. 136 + **The destination endpoint** (PE or SM) decodes the full 16 bits to 137 + determine token format, how many further flits to clock in, and how to 138 + begin processing. 139 139 140 140 ### Routing Field 141 141 142 142 ``` 143 - Flit 1 bits [15:14] — type:2 144 - 0x = compute token → route to CM identified by PE_id in bits [13:12] 145 - 10 = structure → route to SM identified by SM_id in bits [13:12] 146 - 11 = system → route by subtype in bits [13:12] 143 + Flit 1 bit [15] — SM/CM discriminator: 144 + 0 = CM token → route to PE identified by PE_id 145 + 1 = SM token → route to SM identified by SM_id 146 + 147 + For CM tokens (bit[15]=0): 148 + bits [13:12] = PE_id for dyadic wide (prefix 00) 149 + bits [12:11] = PE_id for monadic normal and misc bucket (prefix 01x) 147 150 148 - Flit 1 bits [13:12] — destination ID:2 149 - For types 0x: PE_id (0-3, expandable via hierarchical routing) 150 - For type 10: SM_id (0-3) 151 - For type 11: subtype (I/O, config, reserved, reserved) 151 + For SM tokens (bit[15]=1): 152 + bits [14:13] = SM_id (0-3) 152 153 ``` 153 154 154 - Everything below bit 12 is destination-decoded, not network-decoded. 155 + Everything below the destination ID is endpoint-decoded, not 156 + network-decoded. The routing node's job is to extract the destination ID 157 + from the appropriate bit position based on bit[15]. 155 158 156 159 --- 157 160 158 - ## Compute Token Formats (types 00 and 01) 159 - 160 - The type field's LSB distinguishes dyadic (00) from monadic (01). Both 161 - route identically (network only looks at the MSB to see `0x` = compute). 162 - The PE decodes type[0] to determine matching behaviour and flit-2 layout. 161 + ## CM Token Formats (bit[15]=0) 163 162 164 - Bit [11] is a **mode flag** whose interpretation depends on the type: 165 - 166 - - For dyadic (00): **width flag** — selects 8-bit or 16-bit data mode 167 - - For monadic (01): **inline flag** — selects 1-flit or 2-flit format 168 - 169 - This gives four compute token formats from the 2-bit type + 1-bit mode: 163 + The prefix encoding uses variable-length bit patterns to give the hot path 164 + (dyadic wide) the shallowest decode and the most generous field allocation. 170 165 171 - ### Format 1: Dyadic Narrow (type=00, width=0) — 2 flits, 8-bit data 166 + ### Dyadic Wide (prefix 00) — 2 flits, 16-bit data, HOT PATH 172 167 173 168 ``` 174 - Flit 1: [type:2=00][PE:2][W:1=0][offset:5][ctx:4][spare:2] = 16 bits 175 - Flit 2: [data:8][port:1][gen:2][spare:5] = 16 bits 169 + Flit 1: [0][0][PE:2][offset:5][ctx:4][port:1][gen:2] = 16 bits 170 + Flit 2: [data:16] = 16 bits 176 171 ``` 177 172 178 - 8-bit data mode. Port and generation counter ride in flit 2 alongside the 179 - narrow data word, freeing flit 1 for a wider offset field and 2 spare bits. 173 + The dominant token format. Port and gen live in flit 1 alongside routing. 180 174 181 175 - **offset:5** → 32 dyadic instruction slots per context (matching store 182 176 address = `[ctx:4][offset:5]` = 9 bits = 512 cells) 183 177 - **ctx:4** → 16 concurrent activations per PE 184 - - **port:1** → L/R operand discriminator (in flit 2) 185 - - **gen:2** → generation counter for ABA protection (in flit 2) 186 - - **spare:2** (flit 1) — reserved for future use (epoch, extended 187 - addressing flag, multicast, etc.) 188 - - **spare:5** (flit 2) — reserved. candidates: extended context bits, 189 - fragment tag, priority, debug flags 178 + - **port:1** → L/R operand discriminator 179 + - **gen:2** → generation counter for ABA protection 180 + - Full 16-bit data in flit 2 190 181 191 - The PE begins the matching store SRAM read on `[ctx:4][offset:5]` as soon 192 - as flit 1 is latched (bus cycle 0). Port and gen arrive in flit 2 (bus 193 - cycle 1) and are available by the time the SRAM read completes. 182 + The PE begins the matching store SRAM read on `[ctx:4][offset:5]` the 183 + instant flit 1 is latched. Decode: bit[15]=0 AND bit[14]=0 — two gates. 194 184 195 - ### Format 2: Dyadic Wide (type=00, width=1) — 2 flits, 16-bit data 185 + ### Monadic Normal (prefix 010) — 2 flits, 16-bit data 196 186 197 187 ``` 198 - Flit 1: [type:2=00][PE:2][W:1=1][offset:4][ctx:4][port:1][gen:2] = 16 bits 188 + Flit 1: [0][1][0][PE:2][offset:7][ctx:4] = 16 bits 199 189 Flit 2: [data:16] = 16 bits 200 190 ``` 201 191 202 - 16-bit data mode. Port and gen must live in flit 1, consuming bits that 203 - reduce the offset field. 192 + Standard monadic token. No matching store access — the arriving token IS 193 + the sole input. No port, no gen needed. 204 194 205 - - **offset:4** → 16 dyadic instruction slots per context (matching store 206 - address = `[ctx:4][offset:4]` = 8 bits = 256 cells) 195 + - **offset:7** → 128 IRAM entries addressable by monadic tokens 207 196 - **ctx:4** → 16 concurrent activations 208 197 - Full 16-bit data in flit 2 209 198 210 - Tighter than narrow mode — 16 dyadic entries per chunk vs 32. Adequate 211 - for most function fragments given instruction deduplication (see above). 212 - Programs needing more dyadic instructions per chunk split across PEs. 199 + Decode adds one more gate at bit[13]. 200 + 201 + ### Misc Bucket (prefix 011) — infrequent formats 202 + 203 + The 2-bit sub field after the 011 prefix discriminates four formats: 213 204 214 - ### Format 3: Monadic Normal (type=01, inline=0) — 2 flits, 16-bit data 205 + **Dyadic Narrow (sub=00) — 2 flits, 8-bit data:** 215 206 216 207 ``` 217 - Flit 1: [type:2=01][PE:2][I:1=0][offset:7][ctx:4] = 16 bits 218 - Flit 2: [data:16] = 16 bits 208 + Flit 1: [0][1][1][PE:2][00][offset:5][ctx:4] = 16 bits 209 + Flit 2: [data:8][port:1][gen:2][spare:5] = 16 bits 219 210 ``` 220 211 221 - Standard monadic token. No matching store access (monadic instructions 222 - have one input — the arriving token IS the input). No port, no gen needed. 212 + 8-bit data mode. Port and gen ride in flit 2 alongside the narrow data, 213 + freeing flit 1 for a wider offset field. 223 214 224 - - **offset:7** → 128 IRAM entries addressable by monadic tokens 225 - - **ctx:4** → 16 concurrent activations (same context slot space) 226 - - Full 16-bit data in flit 2 215 + - **offset:5** → 32 dyadic slots per context (same as dyadic wide) 216 + - **spare:5** (flit 2) — reserved: extended context, fragment tag, etc. 227 217 228 - The 7-bit offset gives monadic tokens access to the full IRAM address 229 - space (128 entries). Dyadic instructions occupy the low offsets (0-15 or 230 - 0-31 depending on width mode); monadic instructions can be placed anywhere 231 - in 0-127. The compiler controls layout. 218 + **IRAM Write (sub=01) — 2-3 flits, instruction loading:** 219 + 220 + ``` 221 + Flit 1: [0][1][1][PE:2][01][iram_addr:7][flags:2] = 16 bits 222 + Flit 2: [instruction_word_low:16] = 16 bits 223 + (Flit 3: [instruction_word_high:8][spare:8] if IRAM > 16 bits) 224 + ``` 225 + 226 + Loads instruction words into PE IRAM. No ctx/port/gen needed — the PE 227 + manages its own instruction memory. 7-bit address = 128 IRAM slots, full 228 + coverage. The 2-bit flags field can signal 2-flit vs 3-flit mode and 229 + other load control (e.g., page valid-bit set/clear). 232 230 233 - ### Format 4: Monadic Inline (type=01, inline=1) — 1 flit, no data 231 + **Monadic Inline (sub=10) — 1 flit, trigger only:** 234 232 235 233 ``` 236 - Flit 1: [type:2=01][PE:2][I:1=1][offset:4][ctx:4][spare:3] = 16 bits 234 + Flit 1: [0][1][1][PE:2][10][offset:4][ctx:4][spare:1] = 16 bits 237 235 No flit 2. 238 236 ``` 239 237 240 - Single-flit token. Carries no data payload — it is a **trigger**. The 238 + Single-flit token. Carries no data payload — it is a trigger. The 241 239 instruction's IRAM word supplies any needed constant via its immediate 242 - field. The token's purpose is to signal "this input is ready" to fire a 243 - monadic instruction. 240 + field. 244 241 245 - - **offset:4** → 16 IRAM entries addressable by inline monadic tokens 246 - - **ctx:4** → 16 concurrent activations 247 - - **spare:3** — reserved. candidates: small immediate (0-7), flags, 248 - extended context, priority 242 + - **offset:4** → 16 IRAM entries addressable by inline tokens 243 + - **spare:1** — reserved 249 244 250 245 Use cases: synchronisation signals, barrier tokens, loop iteration 251 - triggers, boolean control flow edges where the value is implicit in the 252 - graph structure. These are common in dataflow programs and currently cost 253 - 2 bus cycles to deliver what is effectively 0-1 bits of useful data. 254 - Inline monadic halves the bus cost. 246 + triggers, SWITCH not-taken triggers. 247 + 248 + **Spare (sub=11) — reserved:** 249 + 250 + Candidates: extended monadic with wider offset, broadcast/multicast, 251 + debug/trace injection. 252 + 253 + ### Monadic Offset Relative Addressing 254 + 255 + Dyadic instructions pack into the lowest IRAM offsets (0-31). Monadic 256 + tokens never target those slots. Making monadic offsets **relative to the 257 + dyadic ceiling** avoids wasting encodings: 258 + 259 + ``` 260 + dyadic ceiling = 32 → monadic base = 32 261 + 7-bit relative offset → addresses 32-159 (128 monadic slots) 262 + ``` 263 + 264 + Hardware cost: since the dyadic ceiling is a power of 2 (32), the base 265 + can be OR'd onto the high address bits. One gate plus a config bit. 266 + 267 + SC blocks need contiguous IRAM space that does not respect the 268 + dyadic-below-monadic packing rule. The compiler packs SC blocks into a 269 + separate IRAM region, addressed via a base register set on entering SC 270 + mode. 255 271 256 - The PE knows to expect no flit 2 because `type=01, inline=1` is decoded 257 - on flit 1 arrival. The bus is free for the next token immediately. 272 + ### CM Token Summary 258 273 259 - ### Compute Token Summary 274 + | Format | Prefix | Flits | Data | Offset | Ctx | Port | Gen | 275 + |--------|--------|-------|------|--------|-----|------|-----| 276 + | Dyadic wide | 00 | 2 | 16-bit | 5 (32) | 4 | flit1 | flit1 | 277 + | Monadic normal | 010 | 2 | 16-bit | 7 (128) | 4 | — | — | 278 + | Dyadic narrow | 011+00 | 2 | 8-bit | 5 (32) | 4 | flit2 | flit2 | 279 + | IRAM write | 011+01 | 2-3 | instr word | 7 (128) | — | — | — | 280 + | Monadic inline | 011+10 | 1 | none | 4 (16) | 4 | — | — | 281 + | (spare) | 011+11 | ? | ? | ? | ? | ? | ? | 260 282 261 - | Format | Type | Mode | Flits | Data | Offset | Ctx | Port | Gen | 262 - |--------|------|------|-------|------|--------|-----|------|-----| 263 - | Dyadic narrow | 00 | W=0 | 2 | 8-bit | 5 (32) | 4 | flit2 | flit2 | 264 - | Dyadic wide | 00 | W=1 | 2 | 16-bit | 4 (16) | 4 | flit1 | flit1 | 265 - | Monadic normal | 01 | I=0 | 2 | 16-bit | 7 (128) | 4 | — | — | 266 - | Monadic inline | 01 | I=1 | 1 | none | 4 (16) | 4 | — | — | 283 + ### Hot Path Decode Summary 267 284 268 - All four formats share `[type:2][PE:2]` at the top and `[ctx:4]` in the 269 - same bit position, simplifying PE input decode. The mode flag at bit [11] 270 - tells the PE everything it needs to know about flit count and field layout. 285 + - bit[15] splits SM/CM: one gate 286 + - bit[14] splits dyadic-wide from everything else: one gate 287 + - bit[13] splits monadic normal from misc bucket: one more gate 288 + - the misc bucket is three gates deep, but nothing there is 289 + latency-critical 271 290 272 291 ### Spare Bits and Future Use 273 292 274 293 The spare bits are explicitly reserved, not accidentally unused: 275 294 276 - - **Dyadic narrow flit 1, spare:2** — future candidates: IRAM page epoch 277 - (for identity checking beyond the valid-bit gate), extended addressing 278 - flag (signals a 3-flit token with wider address in flit 3), or multicast. 279 295 - **Dyadic narrow flit 2, spare:5** — future candidates: extended context 280 296 (up to 9 bits total ctx), fragment tag, debug/trace ID, priority level. 281 - - **Monadic inline spare:3** — future candidates: small immediate value 282 - (0-7 for booleans, enum tags, iteration counters), extended context bits, 283 - or an "extended interpretation" flag that changes the meaning of the 284 - remaining bits for special-purpose tokens. 297 + - **Monadic inline spare:1** — future candidate: extended context bit or 298 + flag for extended interpretation. 299 + - **Misc bucket sub=11** — entire format reserved for future use. 285 300 286 301 The spare bits provide escape hatches for architectural evolution without 287 302 changing the base format. v0 should treat them as must-be-zero on ··· 289 304 290 305 --- 291 306 292 - ## Structure Token Format (type 10) — unchanged 307 + ## SM Token Format (bit[15]=1) 293 308 294 309 ``` 295 - Structure token (2 flits): 296 - flit 1: [type:2=10][SM_id:2][operation:3][addr:9] = 16 bits 297 - flit 2: [data:16] (WRITE) or [return routing:16] (READ variants) = 16 bits 310 + SM token (2 flits, standard): 311 + flit 1: [1][SM_id:2][op:3-5][addr:8-10] = 16 bits 312 + flit 2: [data:16] or [return_routing:16] = 16 bits 298 313 ``` 299 314 300 - See `sm-design.md` for operation set, extended addressing, and CAS handling. 315 + 15 bits available after the SM discriminator. SM_id (2 bits) selects one of 316 + 4 SMs. The remaining 13 bits are split between opcode and address using 317 + variable-width encoding: 301 318 302 - 3-flit extended addressing mode for wide address spaces: 303 319 ``` 304 - flit 1: [type:2=10][SM_id:2][op:3][flags:9] = 16 bits 305 - flit 2: [extended_addr:16] = 16 bits 306 - flit 3: [data:16] or [return routing:16] = 16 bits 320 + op[2:1] != 11: 3-bit opcode, 10-bit addr (1024 cells) 321 + READ, WRITE, ALLOC, FREE, CLEAR, EXT 322 + 323 + op[2:1] == 11: extends to 5-bit opcode, 8-bit payload (256 cells or inline data) 324 + READ_INC, READ_DEC, CAS, RAW_READ, EXEC, SET_PAGE, WRITE_IMM, (spare) 307 325 ``` 308 326 309 - --- 327 + One decode gate on op[2:1] discriminates the two tiers. 310 328 311 - ## System Token Format (type 11) — 3+ flits 329 + See `sm-design.md` for the full opcode table, extended addressing, and 330 + CAS handling. 312 331 313 - ``` 314 - System/extended token (3+ flits): 315 - flit 1: [type:2=11][subtype:2][routing:12] = 16 bits 316 - flit 2: [payload / extended addr] = 16 bits 317 - flit 3: [payload / data] = 16 bits 318 - (flit 4 if needed for full instruction word writes, etc.) 332 + **Return routing in flit 2:** for READ and other result-producing ops, 333 + flit 2 carries a **pre-formed CM token template**. The SM's result 334 + formatter latches the template, prepends it as flit 1, and appends the 335 + read data as flit 2. No bit-shuffling — the requesting CM does all the 336 + format work upfront. This means SM results can land directly in a matching 337 + store slot as one operand of a dyadic instruction without an intermediate 338 + forwarding step. 319 339 320 - Subtypes: 321 - 11 + 00: I/O operation (routed to I/O controller) 322 - 11 + 01: config write (IRAM load, routing table, etc.) 323 - 11 + 10: reserved (debug/trace, DMA) 324 - 11 + 11: reserved 325 - ``` 326 - 327 - Config writes (subtype 01) include IRAM loading. The target PE recognises 328 - its own address in the routing field and activates the instruction bus for 329 - writing. This shares the same input FIFO as compute tokens — see IRAM 330 - Valid-Bit Protection below. 340 + **IO is memory-mapped SM:** IO devices are mapped into SM address space 341 + (typically SM00 at v0). IO operations use the standard SM token format. 342 + I-structure semantics provide natural interrupt-free IO — a READ from an 343 + IO device that has no data defers until data arrives. 331 344 332 345 --- 333 346 334 347 ## Variable-Length Token Summary 335 348 336 - | Token Type | Flits | Total Bits | Contents | 337 - |------------|-------|------------|----------| 338 - | Monadic inline (trigger) | 1 | 16 | addr + ctx only, no data | 339 - | Dyadic narrow (8-bit data) | 2 | 32 | addr + ctx + 8-bit data + port + gen | 340 - | Dyadic wide (16-bit data) | 2 | 32 | addr + ctx + port + gen + 16-bit data | 341 - | Monadic normal (16-bit data) | 2 | 32 | addr + ctx + 16-bit data | 342 - | Structure operation | 2 | 32 | SM addr + op + 16-bit data or ret routing | 343 - | Structure extended addr | 3 | 48 | SM addr + op + wide addr + data | 344 - | System / I/O | 3 | 48 | routing + subtype + payload | 345 - | Config write (IRAM load) | 3-4 | 48-64 | target PE + IRAM addr + instruction word | 349 + | Token Type | Prefix | Flits | Total Bits | Contents | 350 + |------------|--------|-------|------------|----------| 351 + | Dyadic wide | 00 | 2 | 32 | offset + ctx + port + gen + 16-bit data | 352 + | Monadic normal | 010 | 2 | 32 | offset + ctx + 16-bit data | 353 + | Dyadic narrow | 011+00 | 2 | 32 | offset + ctx + 8-bit data + port + gen | 354 + | IRAM write | 011+01 | 2-3 | 32-48 | iram_addr + instruction word | 355 + | Monadic inline | 011+10 | 1 | 16 | offset + ctx only, no data | 356 + | SM standard | 1 | 2 | 32 | SM_id + op + addr + 16-bit data or ret routing | 346 357 347 358 The common case is 2 flits. Inline monadic (1 flit) is the fast path for 348 - control-flow tokens. System and config operations pay 3-4 flits but are 349 - infrequent during execution. 359 + control-flow tokens. IRAM writes pay 2-3 flits but are infrequent during 360 + execution. 350 361 351 362 --- 352 363 ··· 354 365 355 366 ### The Problem 356 367 357 - IRAM is written via type-11 config packets that share the PE's input FIFO 368 + IRAM is written via IRAM write tokens (prefix 011+01) that share the PE's input FIFO 358 369 with compute tokens. When IRAM contents are replaced at runtime (swapping 359 370 function fragments in and out), tokens in flight may target IRAM addresses 360 371 that have been or are being overwritten. Because tokens do not carry ··· 392 403 sequence is naturally ordered: 393 404 394 405 ``` 395 - 1. Loader sends drain signal (implementation TBD — could be a type-11 396 - "quiesce" subop, or the PE back-pressures via handshake/ready signal) 406 + 1. Loader sends drain signal (implementation TBD — could be an IRAM write 407 + "quiesce" flag, or the PE back-pressures via handshake/ready signal) 397 408 2. PE processes remaining compute tokens in FIFO (natural drain) 398 - 3. Type-11 ld_inst config write arrives in FIFO: 409 + 3. IRAM write token (prefix 011+01) arrives in FIFO: 399 410 a. PE clears valid bit for target page 400 411 b. PE writes instruction word to IRAM at specified address 401 412 c. If more ld_inst packets follow (burst), keep writing 402 - 4. Load-complete marker arrives (type-11 subop or last-write flag): 413 + 4. Load-complete marker arrives (IRAM write with load-complete flag): 403 414 a. PE sets valid bit for target page 404 415 b. PE resumes accepting compute tokens for that page 405 416 ``` ··· 410 421 is after the load completes. Rejected tokens are therefore late arrivals 411 422 from the old epoch — work that is being abandoned. 412 423 424 + **Presence-bit optimisation:** the matching store's presence bitmap can 425 + be checked before step 1 to determine if any tokens are pending for the 426 + target page. If all presence bits for slots in that page are clear, the 427 + drain step can be skipped entirely — no tokens are waiting for those 428 + instructions. This enables targeted IRAM replacement without stalling 429 + the entire PE. See `sm-design.md` "Presence-Bit Guided IRAM Writes". 430 + 413 431 ### Rejection Policy 414 432 415 433 **v0: discard silently + diagnostic.** Late-arriving tokens targeting an ··· 421 439 No token format changes, no NAK generation, no recirculation. If the flag 422 440 lights up, something is wrong with the drain timing. 423 441 424 - **Future: NAK response.** The PE could form a type-11 NAK token from the 442 + **Future: NAK response.** The PE could form a NAK token from the 425 443 rejected compute token and emit it to a coordinator. The output serialiser 426 444 already exists (stage 4 forms tokens); a bypass path from stage 2 to the 427 445 output would enable this. Estimated cost: a mux and some control logic, ··· 492 510 - Width: 2-3 parallel 8-bit SRAM chips for 16-48 bit instruction words, 493 511 all addressed by the same address lines 494 512 - Read in one pipeline stage (instruction fetch, after matching completes) 495 - - Written only during program loading (type-11 config) with valid-bit 513 + - Written only during program loading (IRAM write tokens) with valid-bit 496 514 protection 497 515 498 516 Because IRAM read is a single pipeline stage using parallel SRAM chips, ··· 512 530 ## Matching Store: Addressing and Layout 513 531 514 532 The matching store is indexed by `[ctx_slot : offset]` where both fields 515 - come from flit 1. The offset width depends on the token's width mode: 533 + come from flit 1. Both dyadic formats now have 5-bit offset: 516 534 517 535 | Token format | ctx bits | offset bits | Match store cells | SRAM address bits | 518 536 |--------------|----------|-------------|-------------------|-------------------| 519 - | Dyadic narrow (W=0) | 4 | 5 | 512 | 9 | 520 - | Dyadic wide (W=1) | 4 | 4 | 256 | 8 | 537 + | Dyadic wide (prefix 00) | 4 | 5 | 512 | 9 | 538 + | Dyadic narrow (prefix 011+00) | 4 | 5 | 512 | 9 | 521 539 522 - Both configurations fit in a single SRAM chip. If the PE only supports 523 - one width mode (configuration-time choice), the matching store address 524 - generation is just wire concatenation with the unused address line tied 525 - low. If both modes are supported simultaneously, the W bit from flit 1 526 - gates the top address line. 540 + Both formats address the same 512-cell matching store. Address generation 541 + is simple wire concatenation: `[ctx:4][offset:5]`. 527 542 528 543 ### Entry Format 529 544 ··· 532 547 [presence:1][data:8 or 16] = 9 or 17 bits per entry 533 548 ``` 534 549 535 - **8-bit data mode:** matching store entries are 8-bit data + presence. 536 - Physically: one 8-bit-wide SRAM chip for data, separate presence bitmap 537 - in a register file. 512 entries x 8 bits = 512 bytes = one standard SRAM. 538 - 539 - **16-bit data mode:** matching store entries are 16-bit data + presence. 540 - Physically: one 16-bit-wide SRAM (or two 8-bit SRAMs) for data, separate 541 - presence bitmap. 256 entries x 16 bits = 512 bytes = one standard SRAM. 542 - 543 - Both modes use the same physical SRAM capacity. The tradeoff is data 544 - width vs entry count, controlled by token format. 550 + Matching store entries are 16-bit data + presence. Physically: one 551 + 16-bit-wide SRAM (or two 8-bit SRAMs) for data, separate presence bitmap. 552 + 512 entries x 16 bits = 1024 bytes. 545 553 546 554 Metadata (presence bitmap, port indicators, generation counters) stored 547 555 separately in dedicated fast-access registers: ··· 690 698 for control logic, 16-bit PE for arithmetic). Hardwired is simpler. 691 699 4. **Inline monadic spare bits** — small immediate (0-7) vs reserved? 692 700 Immediate is useful for boolean/enum signals but adds decode logic. 693 - 5. **SM result token width** — does SM always return 16-bit (wide) results? 694 - Or can it return 8-bit (narrow) based on the request? Affects return 695 - routing format in SM request tokens. 701 + 5. **SM result token format** — SM results use the pre-formed token template 702 + from the request's flit 2 as their flit 1. The template can encode any 703 + CM format whose routing fits in 16 bits (dyadic wide, monadic normal, 704 + monadic inline). SM does not parse the template. 696 705 6. **Network handling of 1-flit packets** — routing nodes need to know 697 706 packet length from flit 1 to avoid waiting for a flit 2 that won't 698 - come. The `type=01, bit[11]=1` decode is straightforward but must be 707 + come. The `prefix=011, sub=10` (monadic inline) decode must be 699 708 implemented in every routing node, not just destination PEs. 700 709 7. **Valid-bit page granularity** — 4 pages of 32 entries, or 8 pages of 701 710 16? Finer granularity allows partial IRAM updates without invalidating 702 711 unrelated code, at the cost of more valid-bit storage (still trivial). 703 - 8. **Drain protocol specifics** — quiesce via type-11 subop, or via 712 + 8. **Drain protocol specifics** — quiesce via IRAM write flags, or via 704 713 handshake backpressure, or both? Needs to be defined before IRAM 705 714 swap can be implemented.
+382
design-notes/historical-plausibility.md
··· 1 + # Historical Plausibility: A Dataflow Microcomputer circa 1979–1984 2 + 3 + Research notes on transistor budgets, memory technology, and the 4 + counterfactual case for a multi-PE dataflow system built with 5 + period-appropriate technology. 6 + 7 + --- 8 + 9 + ## 1. Contemporary Processor Reference Points 10 + 11 + | Chip | Year | Transistors | Process | On-chip storage | Notes | 12 + |------|------|------------|---------|----------------|-------| 13 + | MOS 6502 | 1975 | ~3,510 (logic only) | 8 µm NMOS | A, X, Y internal regs only | Minimal transistor budget | 14 + | Z80 | 1976 | ~8,500 | 4 µm NMOS | ~20 registers (two banks + specials) | | 15 + | TMS9900 | 1976 | ~8,000 est. | NMOS | 3 internal regs; **16 GP regs in external RAM** | Minicomputer heritage | 16 + | Am2901 | 1975 | ~1,000 | Bipolar (TTL/ECL) | 16 × 4-bit register file | Bit-slice, 16 MHz | 17 + | Intel 8086 | 1978 | ~29,000 | 3.2 µm NMOS | 14 registers; large microcode ROM | | 18 + | Motorola 68000 | 1979 | ~68,000 | 3.5 µm NMOS | 8 data + 7 addr regs (32-bit) | Clean 32-bit ISA | 19 + | Intel 80286 | 1982 | ~134,000 | 1.5 µm | No cache | | 20 + | ARM1 | 1985 | ~25,000 | 3 µm CMOS | 25 × 32-bit registers | RISC; 50 mm² die | 21 + | Motorola 68020 | 1984 | ~190,000 | 2 µm CMOS | **256-byte instruction cache** | First µP with on-chip cache | 22 + | INMOS T414 | 1985 | ~200,000* | 1.5 µm CMOS | **2 KB on-chip SRAM** | Transputer; 4 serial links | 23 + | INMOS T800 | 1987 | ~250,000+ | CMOS | **4 KB on-chip SRAM** + FPU | | 24 + | EM-4 (EMC-R) | 1990 | ~45,788 gates | 1.5 µm CMOS | 1.31 MB SRAM per PE | 80-PE prototype built | 25 + 26 + \*The T414's ~200K figure likely includes SRAM cell transistors. The logic 27 + core was probably 50–80K transistors, with on-chip SRAM accounting for a 28 + large fraction of the total. 29 + 30 + **Our PE target: ~3–5K transistors of logic + external SRAM chips.** 31 + 32 + This places each PE squarely in 6502-to-Z80 territory for logic 33 + complexity. The PE is not a complete general-purpose processor — it 34 + trades away the program counter, complex instruction decoder, microcode 35 + ROM, and sequential control flow machinery in exchange for matching 36 + store logic and token handling. 37 + 38 + --- 39 + 40 + ## 2. The TMS9900 Precedent 41 + 42 + The TMS9900 (1976) is the strongest historical precedent for the 43 + external-storage PE model. It had only 3 internal registers (PC, WP, SR) 44 + and accessed its 16 "general purpose registers" through external SRAM via 45 + a workspace pointer. Every register access was a memory access. 46 + 47 + At 3 MHz with ~300 ns SRAM, register accesses were single-cycle. The 48 + penalty was real — a register-to-register ADD took 14 clock cycles due to 49 + multiple memory round-trips — but the design worked, and context-switch 50 + speed was unmatched (change one pointer, entire register set swaps). 51 + 52 + Our PE has the same structural relationship to its SRAM: matching store, 53 + instruction memory, and context slots all live in external SRAM, accessed 54 + via direct address concatenation. The key difference is that our PE 55 + doesn't need multiple memory round-trips per instruction — a token 56 + arrives, we index into the matching store (one SRAM access), check the 57 + occupied bit, and if matched, fetch the instruction (one SRAM access) and 58 + execute. Arguably *fewer* memory accesses per useful operation than the 59 + TMS9900. 60 + 61 + The TMS9900 also demonstrates that the "registers in external memory" 62 + approach was commercially viable and accepted by the market in the 63 + mid-1970s. The 40-pin implementations (TMS9995 etc.) later included 64 + 128–256 bytes of fast on-chip RAM for registers, validating the 65 + evolutionary path from external to on-chip storage. 66 + 67 + --- 68 + 69 + ## 3. SRAM Technology and Access Times 70 + 71 + ### Available Parts by Year 72 + 73 + | Part | Organisation | Access Time | Approx. Availability | 74 + |------|-------------|------------|---------------------| 75 + | Intel 2102 | 1K × 1 | 500–850 ns | 1972 | 76 + | Intel 2114 | 1K × 4 | 200–450 ns | ~1977 | 77 + | Intel 2147 | 4K × 1 | 55–70 ns | 1979 (bipolar, expensive) | 78 + | HM6116 | 2K × 8 | 120–200 ns | ~1981 | 79 + | HM6264 | 8K × 8 | 70–200 ns | ~1983–84 | 80 + | HM62256 | 32K × 8 | 70–150 ns | mid-1980s | 81 + 82 + ### Clock Speed vs SRAM Access 83 + 84 + | Clock | Period | Single-cycle SRAM needed | Available by | 85 + |-------|--------|-------------------------|-------------| 86 + | 4 MHz | 250 ns | 250 ns (2114 comfortably) | 1977 | 87 + | 5 MHz | 200 ns | 200 ns (2114, tight) | 1978 | 88 + | 8 MHz | 125 ns | 125 ns (6116 fast grades) | 1982 | 89 + | 10 MHz | 100 ns | 100 ns (6264 fast grades) | 1984 | 90 + 91 + At 5 MHz with 200 ns 2114s, single-cycle *read or write* is achievable. 92 + Single-cycle read-modify-write (required for matching store) is not — 93 + the 2114 is single-ported and 200 ns access fills the entire clock 94 + period. This constrains matching store pipeline throughput to one 95 + operation per 2 clock cycles in the 1979 scenario. See the companion 96 + document on pipelining for approaches to mitigating this. 97 + 98 + By 1981–82 with 150 ns 6116 parts at 5 MHz, a half-clock 99 + read/write split becomes feasible (100 ns per half-cycle, with margin). 100 + By 1984 with fast 6264 parts, 10 MHz pipelined operation is practical. 101 + 102 + ### The 74LS670 Register File 103 + 104 + The SN74LS670 (4 × 4 register file with 3-state outputs) provides a 105 + critical capability: **true simultaneous read and write to different 106 + addresses**, with: 107 + 108 + - Read access time: ~20–24 ns typical 109 + - Write time: ~27 ns typical 110 + - Separate read and write address/enable inputs (dual-ported) 111 + - 16-pin DIP, ~98 gate equivalents, ~125 mW 112 + 113 + This part was available in the LS family by the late 1970s (the LS 114 + subfamily was comprehensively available by 1977–78). At $2–4 per chip in 115 + volume, it's affordable for targeted use in pipeline bypass logic and 116 + small register files. 117 + 118 + The 670's 4-bit word width is an exact match for per-entry matching 119 + store metadata (1 occupied bit + 1 port bit + 2 generation counter 120 + bits), making it ideal for a write-through metadata cache. See the 121 + companion pipelining document for the full design. 122 + 123 + --- 124 + 125 + ## 4. Per-PE Chip Count Analysis (1979 Scenario) 126 + 127 + ### Configuration: 8 context slots × 32 entries = 256 cells 128 + 129 + Using 2114 (1K × 4) SRAM for bulk storage and 74LS670 for fast-path 130 + and register file functions. 131 + 132 + **Matching store data (16-bit operand values):** 133 + 256 entries × 16 bits = 512 bytes. 134 + 4 × 2114 in parallel (each contributes 4 bits of the 16-bit word, 135 + using 256 of 1024 available locations). 136 + 137 + **Matching store metadata:** 138 + Handled by shared 670-based cache / SC register file. 139 + See pipelining companion document. 140 + 141 + **Instruction RAM (128 entries × 24-bit):** 142 + 128 × 24 bits = 384 bytes. 143 + 4 × 2114 (using 128 of 1024 locations × 4 bits each, paralleled for 144 + width). Alternatively 6 × 2114 for 24-bit width without bit waste, 145 + depending on encoding. 146 + 147 + **Shared metadata cache / SC register file / predicate register:** 148 + 8 × 74LS670 (see companion document). 149 + 150 + **ALU + control logic:** 151 + ~15–20 TTL chips (adder, logic unit, comparators, muxes, shifter, 152 + EEPROM decoder, pipeline state machine, bus serialiser/deserialiser). 153 + 154 + **Per-PE total: ~31–36 chips (1979 parts)** 155 + 156 + ### 4-PE System Total 157 + 158 + | Subsystem | Chips | 159 + |-----------|-------| 160 + | 4 × PE logic + SRAM | 124–144 | 161 + | Interconnect (shared bus, arbitration) | ~10–15 | 162 + | SM (structure memory, 4–8 banks) | ~20–40 | 163 + | I/O + bootstrap | ~15–25 | 164 + | **System total** | **~170–225 chips** | 165 + 166 + This is comparable to a late-1970s minicomputer CPU board, or roughly 167 + two S-100 boards' worth of components. Well within the engineering 168 + capability and cost envelope of a minicomputer product. 169 + 170 + --- 171 + 172 + ## 5. The 68000 Comparison 173 + 174 + The 68000 (1979) is the most apt contemporary comparison: 175 + 176 + - **Instruction width**: 68000 uses 16-bit instruction words encoding a 177 + 32-bit ISA. Our IRAM uses ~24-bit instruction words encoding dataflow 178 + operations. Comparable. 179 + - **Data path**: 68000 has 16-bit external bus, 32-bit internal paths. 180 + Our design has 16-bit external bus, wider internal pipeline registers 181 + (~64–68 bits). Structurally similar. 182 + - **Logic budget**: 68000 uses ~68,000 transistors, of which a huge 183 + fraction is microcode ROM for complex instruction decode. Our 4-PE 184 + system at ~3–5K logic transistors each = 12–20K transistors of PE 185 + logic. With interconnect and I/O, maybe 25–35K total. Roughly half a 186 + 68000 in logic, or about one-third when counting the 68000's internal 187 + register file transistors. 188 + - **SRAM dependency**: 68000 has on-chip registers (expensive in 189 + transistors). Our design uses external SRAM (cheap in silicon, more 190 + board space). The TMS9900 proved this trade-off was commercially 191 + viable three years earlier. 192 + 193 + At 1979's 3.5 µm NMOS process, 25K transistors of logic fits in 194 + ~15–25 mm² of die area. The 68000 die was ~44 mm². A single-die 195 + integration of 4 PEs (logic only, SRAM external) would be significantly 196 + smaller and cheaper than a 68000. 197 + 198 + --- 199 + 200 + ## 6. The Transputer Comparison (1985) 201 + 202 + The INMOS T414 transputer (1985) is the closest historical analogue to 203 + what we're proposing, but approached from a different direction: 204 + 205 + | | T414 Transputer | Our 4-PE Design | 206 + |---|---|---| 207 + | Architecture | Single complex PE | 4 simple PEs | 208 + | Parallelism model | CSP message passing (explicit) | Dataflow (implicit) | 209 + | On-chip storage | 2 KB SRAM | External SRAM | 210 + | Transistors | ~200,000 | ~25–35K logic | 211 + | Process | 1.5 µm CMOS | 3.5 µm NMOS (1979 target) | 212 + | Inter-PE communication | Serial links (20 Mbit/s) | Shared bus or dedicated links | 213 + | Programming model | occam (explicit distribution) | Compiler-managed graph | 214 + 215 + The Transputer took the "one big smart PE with built-in message passing" 216 + path. Our architecture takes the "many small dumb PEs with implicit 217 + synchronisation" path. The Transputer's 200K transistors could fund 218 + ~40–60 of our PEs in raw logic. Even accounting for SRAM overhead, 219 + an integrated version at Transputer-class process technology could pack 220 + 8–16 PEs on a die, which is qualitatively different from a single 221 + Transputer — you're getting genuine fine-grained parallelism rather than 222 + coarse-grained task parallelism. 223 + 224 + --- 225 + 226 + ## 7. Why Multi-Processor Microcomputers Didn't Happen (And Why Dataflow Changes This) 227 + 228 + ### The Historical Blockers 229 + 230 + 1. **Cache coherence**: von Neumann processors sharing memory need 231 + coherence protocols. These are complex and were not well understood 232 + until the mid-1980s. 233 + 234 + 2. **Software parallelism**: writing parallel software for shared-memory 235 + von Neumann machines was (and remains) brutally difficult. The 236 + installed base of sequential FORTRAN and C code was enormous. 237 + 238 + 3. **Instruction set compatibility**: the IBM 360 lesson — ISA 239 + compatibility wins markets. A parallel machine that can't run 240 + existing binaries starts with zero software. 241 + 242 + 4. **Single-thread performance**: for inherently sequential code, one 243 + big fast core beats multiple small slow cores. In 1979, most 244 + programs were deeply sequential. 245 + 246 + ### What Dataflow Changes 247 + 248 + - **No cache coherence needed**: each PE has its own local IRAM and 249 + matching store. Data moves as tokens. There is no shared mutable 250 + state at the PE level (SM handles shared data with its own 251 + synchronisation protocol via I-structure semantics). 252 + 253 + - **Implicit parallelism**: the compiler decomposes the program into a 254 + dataflow graph. Parallelism is inherent in the graph structure. The 255 + hardware handles synchronisation through token matching. No 256 + programmer effort required beyond writing the source code. 257 + 258 + - **Software compatibility via compiler**: an LLVM backend targeting the 259 + dataflow ISA could compile standard C/Rust. The gap between LLVM's 260 + SSA-form IR and a dataflow graph is much smaller than the gap 261 + between 1979-era C compilers and dataflow assembly. 262 + 263 + - **Latency tolerance**: the PE processes whatever tokens are ready. 264 + If one token is waiting on a slow SM access, the PE works on other 265 + tokens. This is inherent in the execution model — no special 266 + hardware needed. 267 + 268 + ### The Remaining Hard Problem: Compiler Technology 269 + 270 + The biggest genuine blocker in 1979 was compiler technology. Dataflow 271 + compilers need to partition programs into graphs, assign nodes to PEs, 272 + manage context slots, and schedule token routes. In 1979, compilers 273 + could barely optimise sequential code. By the mid-1980s, the Manchester 274 + and Monsoon teams had working dataflow compilers, but these were 275 + research efforts, not production tools. 276 + 277 + Today, this is a solvable problem. LLVM already performs sophisticated 278 + dependency analysis, loop vectorisation, and graph-based intermediate 279 + representations. A dataflow backend is substantial but not unreasonable. 280 + 281 + --- 282 + 283 + ## 8. The "Road Not Taken" Argument 284 + 285 + Modern out-of-order superscalar processors are, at their core, dataflow 286 + engines trapped inside a von Neumann straitjacket: 287 + 288 + - **Register renaming** creates unique names for each value — this is 289 + exactly what tagged tokens do in a dataflow machine. 290 + - **Reservation stations** (Tomasulo, 1967) are matching stores: an 291 + instruction waits for its operands to arrive, then fires. 292 + - **The reorder buffer** exists solely to reconstruct sequential 293 + semantics from what is internally dataflow execution. It is the 294 + tax paid for pretending to be von Neumann. 295 + - **Branch prediction** attempts to speculate about the dataflow graph's 296 + structure, because the sequential ISA doesn't encode it. A dataflow 297 + graph has no branches to predict — conditional execution is a SWITCH 298 + node that routes tokens deterministically. 299 + - **Out-of-order execution** discovers at runtime the parallelism that 300 + was always present in the program but obscured by the sequential 301 + instruction stream. A dataflow compiler encodes this parallelism 302 + explicitly. 303 + 304 + A modern high-performance core dedicates ~80–90% of its transistor 305 + budget to the cache hierarchy and the OoO/speculation engine. The 306 + actual ALU is a tiny fraction of the die. A dataflow PE is almost 307 + entirely ALU and matching, because the execution model eliminates the 308 + need for the translation layer. 309 + 310 + As the memory wall has worsened (DRAM latency ~100–300 cycles on modern 311 + systems vs 1–2 cycles in 1979), the overhead of the von Neumann 312 + translation layer has grown proportionally. The dataflow model's 313 + inherent latency tolerance — process whatever token is ready — becomes 314 + more valuable as memory gets relatively slower. 315 + 316 + This suggests that a dataflow architecture, while perhaps premature in 317 + 1979 due to compiler limitations, might actually age *better* than 318 + von Neumann as the memory wall gets worse. The matching store never 319 + misses. The data arrives when it arrives. The PE does useful work in the 320 + meantime. No cache hierarchy needed in the PE pipeline — just fast 321 + local SRAM for matching and instruction storage, with SM handling the 322 + shared data. 323 + 324 + --- 325 + 326 + ## 9. Scaling Considerations for Integration 327 + 328 + ### 1979–1982: Discrete Logic (Prototype / Low-Volume Minicomputer) 329 + 330 + - 4 PEs on 1–2 large PCBs 331 + - ~170–225 TTL + SRAM chips total 332 + - 5 MHz clock, 2-cycle matching store access 333 + - Shared bus interconnect 334 + - Competitive with a 68000 on parallel workloads 335 + 336 + ### 1983–1985: Single-Chip PE (1.5–2 µm CMOS) 337 + 338 + - PE logic on-chip (~3–5K transistors) 339 + - Matching store metadata on-chip (670-equivalent register file) 340 + - Bulk SRAM external 341 + - 4–8 PEs per board with external SRAM 342 + - 8–10 MHz, single-cycle matching via half-clock or on-chip dual-port 343 + 344 + ### 1986–1988: Multi-PE Chip (1–1.5 µm CMOS) 345 + 346 + - 4–8 PE cores on one die with shared on-chip SRAM 347 + - Wide parallel local interconnect between adjacent PEs (~1 cycle hop) 348 + - External SM SRAM 349 + - 15–20 MHz 350 + - Competitive with Transputer systems at lower per-chip cost 351 + 352 + ### Modern: Many-PE Tile (Sub-100 nm) 353 + 354 + - 64–256+ PEs per die with on-chip SRAM hierarchy 355 + - Network-on-chip interconnect 356 + - I-structure SM cache with simplified coherence protocol 357 + (write-once semantics reduce coherence to fill notifications) 358 + - 1+ GHz 359 + - LLVM-based compiler toolchain 360 + 361 + --- 362 + 363 + ## 10. Open Questions 364 + 365 + 1. **SM cache architecture for modern integration**: does the 366 + I-structure write-once semantics enable a dramatically simpler 367 + coherence protocol than MESI? What does the cache hierarchy look 368 + like for SM in a many-PE chip? 369 + 370 + 2. **Compiler partitioning strategy**: how does matching store size 371 + (context slots × entries) interact with compiler code generation? 372 + What's the minimum matching store that supports "most" programs 373 + without excessive function splitting? 374 + 375 + 3. **Sequential performance floor**: what is the minimum acceptable 376 + single-thread performance for a dataflow PE, and how does the 377 + strongly connected block mechanism close the gap with conventional 378 + cores? See companion pipelining document. 379 + 380 + 4. **Network topology at scale**: at what PE count does the shared bus 381 + become inadequate, and what's the right topology for 16–64 PEs? 382 + Ring, mesh, omega network, or hierarchical?
+162 -228
design-notes/io-and-bootstrap.md
··· 1 1 # Dynamic Dataflow CPU — I/O & Bootstrap 2 2 3 - Covers the type-11 subsystem: I/O controller design, peripheral interface, 4 - bootstrap sequence, and the path from microcontroller-assisted bring-up to 5 - self-hosted boot. 3 + Covers IO as memory-mapped SM, the bootstrap sequence via SM00 EXEC, and 4 + the path from microcontroller-assisted bring-up to self-hosted boot. 5 + 6 + See `architecture-overview.md` for module taxonomy and token format. 7 + See `sm-design.md` for SM interface protocol, EXEC operation, and SM00 8 + bootstrap details. 9 + See `bus-architecture-and-width-decoupling.md` for bus width rationale 10 + and IRAM write token format. 11 + 12 + ## IO Model: Memory-Mapped SM 6 13 7 - See `architecture-overview.md` for type-11 packet semantics. 8 - See `network-and-communication.md` for how the I/O controller connects to 9 - the bus. See `bus-architecture-and-width-decoupling.md` for bus width 10 - rationale and flit structure. 14 + IO devices are mapped into SM address space. There is no separate IO 15 + controller or dedicated IO token type — IO operations use standard SM 16 + tokens (bit[15]=1). 11 17 12 - ## Type 11 Subtypes 18 + ### How It Works 13 19 14 - Type 11 is the "system management" channel. the 2-bit subtype field 15 - immediately after the type field discriminates traffic classes: 20 + An SM (typically SM00 at v0) maps IO devices into a reserved address 21 + range. From the CM's perspective, reading from a UART is identical to 22 + reading from any other SM cell: 16 23 17 24 ``` 18 - 11 + 00: I/O operation — routed to the I/O controller 19 - 11 + 01: Extended address / config write — target PE instruction memory, 20 - routing table, or other config registers 21 - 11 + 10: Reserved (future: debug/trace, DMA, performance counters) 22 - 11 + 11: Reserved 25 + READ from IO-mapped address: 26 + 1. CM sends SM READ token to SM00, IO-mapped address range 27 + 2. If IO device has data: cell is FULL, SM returns result immediately 28 + 3. If IO device has no data: cell is EMPTY, SM defers read 29 + 4. When IO device receives data: cell transitions to FULL, 30 + deferred read is satisfied, result token emitted 31 + 32 + WRITE to IO-mapped address: 33 + 1. CM sends SM WRITE token to SM00, IO-mapped address range 34 + 2. SM forwards write data to the IO device 23 35 ``` 24 36 25 - All type-11 traffic is low frequency relative to types 00/01/10. it is 26 - acceptable for decode and handling to take extra cycles. 37 + I-structure semantics provide natural interrupt-free IO. The CM does not 38 + poll or busy-wait — it issues a READ, and the result arrives when data is 39 + available. This is the dataflow-native interrupt model: external events 40 + are just writes to SM cells that satisfy deferred reads. 27 41 28 - ## I/O Controller 42 + ### IO Address Mapping 29 43 30 - ### What It Is 44 + IO devices occupy a contiguous range within SM00's address space. The 45 + mapping is configured at bootstrap (or hardwired for v0): 31 46 32 - A fixed-function device on the network (not a PE): it has no matching 33 - store, instruction memory, or ALU. it receives type-11 subtype-00 34 - packets, interprets them as I/O commands, and responds. 47 + ``` 48 + Example SM00 address map: 49 + 0x000 - 0x0FF: IO devices (tier 0, raw memory semantics) 50 + 0x100 - 0x1FF: Bootstrap ROM (tier 0, read-only) 51 + 0x200 - 0x3FF: General-purpose I-structure cells (tier 1) 52 + ``` 35 53 36 - it is also the only network participant that can **spontaneously generate 37 - tokens** without first receiving one. this is how external events (UART 38 - RX, sensor interrupts, timer ticks) enter the dataflow graph. 54 + Within the IO range, specific addresses map to device registers: 39 55 40 - ### Token Format for I/O (subtype 00, 3-flit) 56 + ``` 57 + Example IO register map (v0, UART only): 58 + 0x000: UART TX data (write) 59 + 0x001: UART RX data (read, defers if no byte available) 60 + 0x002: UART status (read, raw — no deferral) 61 + 0x003: UART config (write) 62 + ``` 41 63 42 - I/O operations use type-11 3-flit tokens. flit 1 carries type + subtype + 43 - routing, flits 2-3 carry the I/O command and data/return routing. 64 + ### Unsolicited Events (Interrupt Equivalent) 44 65 45 - ``` 46 - I/O Request (CM -> I/O controller, 3 flits): 47 - flit 1: [type:2=11][subtype:2=00][device:3][register:4][R/W:1][pad:4] 48 - = 16 bits 49 - flit 2: [data:16] (for writes) or [return routing:16] (for reads) 50 - = 16 bits 51 - flit 3: [extended payload / additional return routing / pad] = 16 bits 66 + When an external event occurs (e.g., UART receives a byte), the IO 67 + hardware writes the data into the corresponding SM cell. If a deferred 68 + read is pending on that cell, the SM satisfies it immediately — the 69 + requesting CM receives a result token with the data. No interrupt 70 + controller, no vector table, no mode switch. 52 71 53 - I/O Response (I/O controller -> CM): 54 - Repackaged as a standard 2-flit type 00 or 01 token addressed to the 55 - requesting CM, using return routing from the request. 56 - ``` 72 + If no deferred read is pending, the data sits in the cell (FULL state) 73 + until a CM reads it. This is natural flow control: the IO device produces, 74 + the CM consumes, with I-structure semantics as the synchronisation 75 + primitive. 57 76 58 - **Open question**: return routing for I/O reads. options: 59 - - (a) I/O read requests carry return routing in flit 2 (same pattern as 60 - SM READ requests — data field is unused on reads). flit 3 provides 61 - additional space if needed. 62 - - (b) I/O controller has a preconfigured "return to" address per device 63 - (simpler requests, but less flexible) 64 - - (c) I/O controller always returns to a fixed "I/O result handler" node 65 - in a designated PE (simplest, but rigid) 66 - 67 - option (a) is most consistent with how SM works. likely the right call. 68 - the extra flit (3 vs SM's 2) gives more room for return routing and 69 - I/O command fields. 77 + Implications: 78 + - IO devices are **write sources** into SM cells 79 + - Backpressure: if the CM hasn't issued a READ, the IO device can still 80 + write (cell becomes FULL). Subsequent writes before a READ overwrite — 81 + the SM cell is a 1-deep buffer. For deeper buffering, the IO hardware 82 + can use a range of cells as a circular buffer. 83 + - The SM does not spontaneously generate tokens — it only satisfies 84 + pending deferred reads. The IO device's write *triggers* the deferred 85 + read satisfaction, which is when the result token is emitted. 70 86 71 87 ### Hardware 72 88 89 + IO hardware sits on SM00's internal data bus alongside the SRAM banks. 90 + The address decoder routes IO-range addresses to IO device registers 91 + instead of SRAM: 92 + 73 93 ``` 74 - 16-bit Bus 94 + SM00 Internal Bus 75 95 | 76 - v 77 - [Input Deserialiser] 78 - (reassemble 3+ flits into logical token) 79 - | 80 - v 81 - [Input FIFO] 82 - | 83 - v 84 - [Subtype Check] -- not subtype 00? --> discard or forward 85 - | 86 - v 87 - [Device/Register Decode] --- EEPROM or small logic 96 + [Address Decoder] 88 97 | 89 - +---> [UART chip (6850/16550/etc.)] 90 - | - TX data register 91 - | - RX data register 92 - | - Status register 93 - | - Baud/config registers 98 + +---> addr < 0x100? --> [IO Device Registers] 99 + | | 100 + | +---> [UART chip (6850/16550/etc.)] 101 + | +---> [future: SPI, GPIO, timer] 94 102 | 95 - +---> [future: SPI, GPIO, timer, etc.] 96 - | 97 - v 98 - [Result Formatter] --- constructs 2-flit type 00/01 return token 99 - | 100 - v 101 - [Output Serialiser] 102 - (split token into 16-bit flits) 103 - | 104 - v 105 - [Output FIFO] 106 - | 107 - v 108 - 16-bit Bus (token injected as type 00/01) 103 + +---> addr >= 0x100? --> [SRAM Banks] (normal SM operation) 109 104 ``` 110 105 111 - Estimated hardware: ~15-25 TTL chips + UART chip. comparable to the 112 - microsequencer it replaces, but architecturally integrated. 113 - 114 - ### Unsolicited Token Generation (Interrupt Equivalent) 115 - 116 - When an external event occurs (e.g., UART receives a byte), the I/O 117 - controller generates a token and injects it onto the network. from the 118 - receiving CM's perspective, data just arrived — exactly like any other 119 - token. no interrupt hardware needed on the CM side. 120 - 121 - The destination for unsolicited tokens is preconfigured: either hardcoded 122 - in the I/O controller's EEPROM, or set at bootstrap via a type-11 config 123 - write to the I/O controller itself. "when UART RX fires, send the byte 124 - to PE 2, offset 0x10, context slot 3, port 0." 125 - 126 - This is the **dataflow-native interrupt model**: external events are 127 - token sources. they feed into the dataflow graph at designated entry 128 - points. the receiving PE doesn't need to do anything special — it just 129 - sees a token arrive and processes it like any other. 130 - 131 - Implications: 132 - - the I/O controller is a **source node** in the dataflow graph 133 - - it breaks the invariant that "tokens are only produced in response to 134 - other tokens" — external reality leaks in here 135 - - the network must accept tokens from the I/O controller even when no 136 - request was sent (the I/O controller's output FIFO can fill independently) 137 - - if the destination PE's input FIFO is full, backpressure propagates 138 - to the I/O controller. UART RX bytes could be lost if the system can't 139 - keep up. the I/O controller should have a small internal buffer 140 - (or the UART chip's built-in FIFO handles this). 141 - 142 - ## Config Writes (subtype 01) 106 + The IO device registers behave like SM cells from the SM controller's 107 + perspective: they have presence state, support READ/WRITE, and can 108 + satisfy deferred reads. The difference is that the data source is an 109 + external device rather than SRAM. 143 110 144 - ### Purpose 111 + Estimated additional hardware for IO: ~8-12 TTL chips (address decode, 112 + IO device interface, presence state for IO cells) + UART chip. 145 113 146 - Type-11 subtype-01 packets write to PE instruction memory, routing tables, 147 - or other configuration state. they are the mechanism for: 114 + --- 148 115 149 - 1. Bootstrap program loading 150 - 2. Runtime reprogramming (future) 151 - 3. Routing table configuration 116 + ## IRAM Writes 152 117 153 - ### Packet Format (3-4 flits) 118 + IRAM writes use CM misc-bucket tokens (prefix 011+01). They are addressed 119 + to a specific PE and carry instruction word data. See 120 + `bus-architecture-and-width-decoupling.md` for the IRAM write token format 121 + and valid-bit protection protocol. 154 122 155 - Config writes are inherently multi-flit: they must carry a target PE, 156 - IRAM address, and an instruction word that is 32-48 bits wide (decoupled 157 - from bus width). A 48-bit instruction word requires 3 data flits plus 158 - the routing flit. 123 + IRAM writes can originate from: 124 + - An external microcontroller (development/prototyping) 125 + - SM00's EXEC sequencer during bootstrap (reading pre-formed IRAM write 126 + tokens from ROM) 127 + - A CM running a loader program (runtime code loading) 159 128 160 - ``` 161 - Config Write (IRAM load, 3-4 flits): 162 - flit 1: [type:2=11][subtype:2=01][target_PE:2][flags:2][addr:8] = 16 bits 163 - flit 2: [instruction_word_hi:16] = 16 bits 164 - flit 3: [instruction_word_lo:16] = 16 bits 165 - flit 4: [instruction_word_ext:16] (if IRAM > 32 bits) = 16 bits 129 + The network routes IRAM writes like any other CM token — bit[15]=0, PE_id 130 + in the appropriate flit 1 position. The target PE recognises the 011+01 131 + prefix and routes the token to the instruction memory write port. 166 132 167 - Routing table config write (3 flits): 168 - flit 1: [type:2=11][subtype:2=01][target_node:4][config_type:8] = 16 bits 169 - flit 2: [table_entry / prefix data] = 16 bits 170 - flit 3: [routing data] = 16 bits 171 - ``` 133 + --- 172 134 173 - The exact bit allocation depends on IRAM width (see 174 - `bus-architecture-and-width-decoupling.md`). 48-bit IRAM = 4 flits per 175 - config write; 32-bit IRAM = 3 flits. Config writes are rare (bootstrap 176 - and hot-patching only), so the extra flits are acceptable. 135 + ## Bootstrap Sequence 177 136 178 - ### Routing 137 + ### Self-Hosted Bootstrap (via SM00 EXEC) 179 138 180 - Config writes are addressed to a specific PE by the target_PE field. the 181 - routing network delivers them like any other token — type 11 is inspected 182 - by routing nodes only to the extent of "this is not type 00/01/10, forward 183 - toward the target." the target PE recognises the subtype-01 packet and 184 - routes it to the instruction memory write port (see `pe-design.md`). 139 + On system reset: 185 140 186 - Routing tables themselves can be written via config writes. the target is 187 - a routing node, not a PE. routing nodes need a small amount of config 188 - write handling: recognise "this config write is for me" (based on node ID) 189 - and update the local prefix table. during bootstrap, routing nodes are in 190 - default mode (fixed-address routing), so config writes reach them reliably 191 - without needing configured routing. 141 + 1. SM00 is wired to the reset signal 142 + 2. SM00's sequencer triggers EXEC on a predetermined ROM base address 143 + 3. The ROM region contains pre-formed tokens stored as 2-cell entries: 144 + - IRAM write tokens (prefix 011+01) to load PE instruction memories 145 + - Seed tokens (dyadic wide, monadic normal) to start execution 146 + 4. SM00 reads each 2-cell entry and emits it as a 2-flit token on the bus 147 + 5. PEs receive IRAM writes and load their instruction memories 148 + 6. Seed tokens fire and execution begins 192 149 193 - ## Bootstrap Sequence 150 + The program image in ROM is a flat sequence of pre-formed token pairs. 151 + The compiler outputs this format directly — each entry is (flit 1, flit 2) 152 + as stored in two consecutive SM cells. SM00's EXEC sequencer is just an 153 + address counter with a limit comparator; it does not interpret the tokens 154 + it emits. 194 155 195 156 ### Development / Early Prototyping 196 157 197 158 For Phase 0-2, an external microcontroller (RP2040, Arduino) acts as the 198 - bootstrap source. it is NOT part of the architecture — it's a test fixture. 159 + bootstrap source. It is NOT part of the architecture — it is a test 160 + fixture. 199 161 200 162 The microcontroller: 201 - 1. Formats type-11 subtype-01 multi-flit packets (config writes: 3-4 202 - flits per instruction word depending on IRAM width) 163 + 1. Formats IRAM write tokens (prefix 011+01, 2-3 flits per instruction) 203 164 2. Injects flits into the 16-bit bus (via a dedicated injection port or 204 165 by bit-banging the bus interface) 205 166 3. Writes instruction words to each PE's instruction memory 206 - 4. Optionally writes routing table entries to routing nodes 207 - 5. Optionally writes initial SM contents via 2-flit type-10 packets 208 - 6. Injects seed token(s) — 2-flit type 00/01 packets that kick off 209 - execution 210 - 7. Releases the bus (goes high-impedance or disconnects) 167 + 4. Optionally writes initial SM contents via SM tokens (bit[15]=1) 168 + 5. Injects seed tokens to start execution 169 + 6. Releases the bus (goes high-impedance or disconnects) 211 170 212 - This lets PE and SM hardware be tested without any of the I/O controller 213 - or bootstrap logic existing. the microcontroller is the bootstrap, the 214 - debug interface, and the test harness all in one. 171 + This lets PE and SM hardware be tested without bootstrap ROM or EXEC 172 + sequencer existing. The microcontroller is the bootstrap, the debug 173 + interface, and the test harness all in one. 215 174 216 - ### Self-Hosted Bootstrap (Phase 4+) 175 + ### Routing During Bootstrap 217 176 218 - The I/O controller replaces the microcontroller as the bootstrap source: 219 - 220 - 1. On reset, the I/O controller enters bootstrap mode 221 - 2. It reads program data from a connected flash/EEPROM (via SPI or 222 - parallel interface) or receives it over UART from an external host 223 - 3. It formats config write packets (type-11 subtype-01) and injects them 224 - onto the network 225 - 4. Each PE receives config writes and loads its instruction memory 226 - 5. Routing tables are configured via config writes to routing nodes 227 - 6. I/O controller injects seed token(s) to start execution 228 - 7. I/O controller transitions to normal mode (handling I/O requests) 229 - 230 - The I/O controller's bootstrap logic is a state machine, likely driven 231 - by a small ROM or EEPROM. it doesn't need to be a general-purpose 232 - processor — it just sequences reads from storage and formats them as 233 - config writes. 234 - 235 - ### Chicken-and-Egg: Routing During Bootstrap 236 - 237 - During bootstrap, routing tables are not yet configured. the network uses 177 + During bootstrap, routing tables are not yet configured. The network uses 238 178 fixed-address default routing (see `network-and-communication.md`): 239 179 240 180 - Each PE has a unique ID (EEPROM / DIP switches) 241 - - Routing nodes forward by PE_id without consulting tables 181 + - Routing nodes forward by destination ID without consulting tables 242 182 - At v0 scale (shared bus), this is trivially true — everything sees 243 183 everything 244 - - At larger scale, default routing must be sufficient to reach all PEs 245 - from the bootstrap source. this constrains the physical topology 246 - (bootstrap source must be topologically reachable from all PEs via 247 - default forwarding). 248 - 249 - The I/O controller's own ID and the default routing path to it are 250 - hardwired or EEPROM-configured. it doesn't depend on routing tables. 184 + - SM00's tokens reach all PEs via default routing since SM00 is on the 185 + shared bus 251 186 252 187 ### Seed Token Injection 253 188 254 - After program loading, the I/O controller (or microcontroller) injects 255 - one or more seed tokens to start execution. these are normal type 00/01 256 - tokens addressed to the entry point(s) of the loaded program. 189 + The bootstrap ROM includes seed tokens after the IRAM write tokens. These 190 + are standard CM tokens addressed to the entry point(s) of the loaded 191 + program. SM00's EXEC sequencer emits them in order after the IRAM writes — 192 + no special handling needed. 257 193 258 194 For a simple program with one entry point: one seed token to one PE. 259 195 For a program with multiple independent entry points (e.g., main program 260 - + I/O handler): multiple seed tokens to different PEs. 196 + + IO handler): multiple seed tokens to different PEs. 261 197 262 - The seed tokens are part of the program image — the compiler specifies 263 - "to start this program, inject these tokens." the bootstrap loader reads 264 - them from the program image and injects them after loading is complete. 198 + --- 265 199 266 200 ## Layering Summary 267 201 268 - The I/O and bootstrap design is explicitly layered for incremental 269 - development: 270 - 271 - | Phase | Bootstrap Source | I/O | Network Config | 202 + | Phase | Bootstrap Source | IO | Network Config | 272 203 |-------|-----------------|-----|----------------| 273 204 | 0-1 | Microcontroller (external) | None | Direct SRAM programming | 274 - | 2 | Microcontroller via type-11 | None | Config writes on bus | 275 - | 3 | Microcontroller via type-11 | Polling via SM (optional) | Config writes | 276 - | 4 | I/O controller (self-hosted) | I/O controller (type 11) | Config writes from I/O controller | 205 + | 2 | Microcontroller via IRAM writes | None | IRAM writes on bus | 206 + | 3 | Microcontroller via IRAM writes | Polling via SM (optional) | IRAM writes | 207 + | 4 | SM00 EXEC (self-hosted) | Memory-mapped SM00 | IRAM writes from EXEC | 277 208 278 - Each phase adds capability without redesigning previous work. the key 279 - enabler is that config writes (type-11 subtype-01) work the same whether 280 - they come from a microcontroller or the I/O controller. the network 281 - doesn't know or care about the source. 209 + Each phase adds capability without redesigning previous work. The key 210 + enabler is that IRAM write tokens work the same whether they come from a 211 + microcontroller or SM00's EXEC sequencer. The network doesn't know or care 212 + about the source. 213 + 214 + --- 282 215 283 216 ## Open Design Questions 284 217 285 - 1. **I/O return routing** — option (a), (b), or (c) from above? 286 - 2. **Unsolicited token destination config** — hardcoded or runtime- 287 - configurable? if configurable, via what mechanism? (probably a 288 - type-11 config write to the I/O controller itself) 289 - 3. **I/O controller bootstrap ROM** — how big? what's in it? just a 290 - state machine for "read flash, emit config writes" or something more? 291 - 4. **Flash/EEPROM interface** — SPI? parallel? what storage device? 292 - 5. **Program image format** — what does the compiler output? a stream of 293 - (target_PE, address, instruction_word) tuples? plus seed tokens at 294 - the end? 295 - 6. **Multiple I/O devices** — how does the device field in the I/O token 296 - scale? 3 bits = 8 devices. enough? 297 - 7. **I/O controller as bootstrap PE** — at what point (if ever) does it 298 - make sense to make the I/O controller a full PE with a boot ROM 299 - instead of fixed-function? probably not for v0-v4, but worth keeping 300 - in mind architecturally. 218 + 1. **IO address range size** — how many cells reserved for IO in SM00? 219 + Depends on device count and register depth per device. 220 + 2. **IO cell buffering depth** — single cell (1-deep) per IO register, or 221 + a range of cells for circular buffering? 1-deep is simplest; circular 222 + buffer needs SM-side write pointer management. 223 + 3. **IO write-overwrite semantics** — if an IO device writes to a FULL 224 + cell (previous data not yet consumed), overwrite or error? Overwrite 225 + is simpler but loses data. Error needs a signalling mechanism. 226 + 4. **Flash/EEPROM interface for ROM** — SPI? Parallel? What storage device? 227 + ROM could be physical ROM chips on SM00's address bus, or flash accessed 228 + via page register. 229 + 5. **Program image format** — flat sequence of (flit1, flit2) token pairs 230 + in ROM. Needs a terminator or length prefix so EXEC knows when to stop. 231 + Length is the EXEC count parameter. 232 + 6. **SM00 further specialisation** — documented as an option (see 233 + `sm-design.md`). Not committed for v0. Standard SM opcodes are 234 + sufficient for basic IO via I-structure semantics.
+38 -38
design-notes/pe-design.md
··· 75 75 ``` 76 76 Stage 1: TOKEN INPUT 77 77 - Receive reassembled token from input deserialiser 78 - - Classify: type 00/01 (normal), type 11 subtype 01 (config write) 79 - - Normal tokens -> pipeline FIFO 80 - - Config writes -> instruction memory write port (stalls pipeline) 78 + - Classify by prefix: dyadic wide (00), monadic normal (010), 79 + misc bucket (011). Within misc bucket: dyadic narrow (sub=00), 80 + IRAM write (sub=01), monadic inline (sub=10) 81 + - Compute/data tokens -> pipeline FIFO 82 + - IRAM writes -> instruction memory write port (stalls pipeline) 81 83 - Buffer in small FIFO (8-deep, storing reassembled tokens) 82 84 - ~1K transistors (flip-flops) or use small SRAM 83 85 84 86 Stage 2: MATCH / BYPASS 85 - - Type 00 (dyadic): direct-index into context slot array 87 + - Dyadic (prefix 00 or 011+00): direct-index into context slot array 86 88 - Check generation counter: mismatch = stale, discard 87 89 - First operand: store in slot, advance to wait state 88 90 - Second operand: read partner from slot, both proceed 89 - - Type 01 (monadic): bypass matching entirely, proceed directly 91 + - Monadic (prefix 010 or 011+10): bypass matching, proceed directly 90 92 - Single cycle for all cases (no hash path, no CAM search — 91 93 direct indexing only, see matching store section below) 92 94 - Estimated: ~200-300 transistors + SRAM ··· 165 167 166 168 ### Runtime Writability 167 169 168 - Instruction memory is **not** read-only. It is writable from the network via type-11 subtype-01 (config/extended address) packets. This serves two purposes: 170 + Instruction memory is **not** read-only. It is writable from the network via IRAM write (prefix 011+01) packets. This serves two purposes: 169 171 170 172 1. **Bootstrap**: loading programs before execution starts 171 173 2. **Runtime reprogramming**: loading new function bodies while other PEs continue executing (future capability, not needed for v0) ··· 186 188 187 189 #### IRAM Width 188 190 189 - | Field | Purpose | Bits (est.) | 190 - |-------|---------|-------------| 191 - | Opcode | ALU/control operation | 5-8 | 192 - | Dest 1 addr | first output instruction address | 10-12 | 193 - | Dest 1 port | L/R input on destination | 1 | 194 - | Dest 2 addr | second output instruction address (fan-out) | 10-12 | 195 - | Dest 2 port | L/R input on destination | 1 | 196 - | Dest 2 PE | remote PE flag + PE_id for cross-PE outputs | 0-3 | 197 - | Arity | monadic/dyadic (or encoded in opcode) | 0-1 | 198 - | Flags | immediate mode, structure op, etc. | 2-4 | 199 - | Immediate | small constant for immediate-mode ops | 0-8 | 191 + | Field | Purpose | Bits (est.) | 192 + | ----------- | ------------------------------------------- | ----------- | 193 + | Opcode | ALU/control operation | 5-8 | 194 + | Dest 1 addr | first output instruction address | 10-12 | 195 + | Dest 1 port | L/R input on destination | 1 | 196 + | Dest 2 addr | second output instruction address (fan-out) | 10-12 | 197 + | Dest 2 port | L/R input on destination | 1 | 198 + | Dest 2 PE | remote PE flag + PE_id for cross-PE outputs | 0-3 | 199 + | Arity | monadic/dyadic (or encoded in opcode) | 0-1 | 200 + | Flags | immediate mode, structure op, etc. | 2-4 | 201 + | Immediate | small constant for immediate-mode ops | 0-8 | 200 202 201 203 This sums to roughly **32-48 bits** depending on address space size and 202 204 how aggressively fields are packed. 48 bits (3 x 16-bit SRAM words) is a ··· 207 209 lines. Width costs physical chips but does NOT add pipeline latency. 208 210 209 211 **Instruction words are never serialised onto the external bus** during 210 - normal execution. They are only written via type-11 config packets during 212 + normal execution. They are only written via IRAM write packets during 211 213 program loading (slow-path, multi-flit operation). The opcode is an 212 214 implementation detail of the PE — from the network's perspective, it 213 215 doesn't exist. ··· 313 315 **Candidate configurations (targeting clean SRAM utilisation):** 314 316 315 317 ``` 316 - Config A: 32 slots x 16 entries = 512 cells 317 - - 9-bit SRAM address (5 ctx + 4 offset) 318 - - 16-bit values: 1KB exactly in one 8Kbit SRAM chip 319 - - Good concurrency headroom, smaller function chunks 320 - 321 318 Config B: 16 slots x 32 entries = 512 cells 322 319 - 9-bit SRAM address (4 ctx + 5 offset) 323 320 - 16-bit values: 1KB exactly 324 - - Matches current 4-bit ctx_slot token format 325 - - Fewer concurrent activations, bigger function chunks 321 + - Matches the 4-bit ctx and 5-bit offset in dyadic wide tokens 322 + - 32 dyadic instructions per context chunk 326 323 327 324 Config C: 32 slots x 32 entries = 1024 cells 328 325 - 10-bit SRAM address (5 ctx + 5 offset) 329 326 - 16-bit values: 2KB, fits in one 16Kbit SRAM chip 327 + - Requires 5-bit ctx (would need wider token format in future) 330 328 - Most headroom in both dimensions 331 - - Probably the sweet spot for SRAM utilisation vs headroom 332 - 333 - Config D: 64 slots x 16 entries = 1024 cells 334 - - 10-bit SRAM address (6 ctx + 4 offset) 335 - - 16-bit values: 2KB 336 - - Favours concurrency over function chunk size 337 - - 64 concurrent activations likely overkill for v0 but future-proof 338 329 ``` 339 330 340 331 **Entry format (per context slot, per instruction offset):** ··· 369 360 > measuring actual concurrent activation counts and dyadic instruction 370 361 > density per PE. 371 362 372 - **Recommendation for v0**: start with Config B (16 slots x 32 entries = 1KB) to match the current 4-bit ctx_slot token field. Upgrade to Config C (32 x 32 = 2KB, needs 5-bit ctx_slot) if 16 concurrent activations proves too tight. The physical SRAM chip doesn't change between these configs — just the address generation logic. 363 + **Recommendation for v0**: Config B (16 slots x 32 entries = 1KB). This 364 + matches the token format directly: 4-bit ctx from flit 1, 5-bit offset 365 + from flit 1. Matching store address = `[ctx:4][offset:5]` = 9 bits. The 366 + SRAM chip doesn't change if upgraded to Config C later — just the address 367 + generation logic and token format would need a wider ctx field. 373 368 374 369 ### Instruction Address vs Matching Store Address 375 370 ··· 381 376 The matching store is dense with respect to dyadic instructions — no 382 377 gaps for monadic instructions. A function chunk with 20 instructions, 383 378 8 of which are dyadic, uses 8 matching store entries, not 20. 379 + 380 + Dyadic instructions pack into IRAM offsets 0-31 (matching the 5-bit 381 + offset field). Monadic instructions use offsets above the dyadic ceiling, 382 + addressed by the 7-bit monadic offset field relative to the ceiling. 383 + See `bus-architecture-and-width-decoupling.md` for the relative addressing 384 + scheme. 384 385 385 386 ### Unified Offset: Instruction Address = Matching Store Entry 386 387 ··· 529 530 SM can hold code or data — there is no architectural distinction. The 530 531 compiler or loader writes code pages into SM addresses during bootstrap 531 532 or at runtime. A PE (or the I/O controller) reads code from SM and 532 - reformats it as type-11 config writes to load into the target PE's 533 - IRAM. 533 + reformats it as IRAM write tokens to load into the target PE's IRAM. 534 534 535 535 SM capacity is not enough for more than a small program's code, so 536 536 external addressable code storage is required for anything nontrivial. ··· 543 543 assigns a PE (typically the least-utilized one) to pull instruction 544 544 pages from storage and load them onto the bus in advance of when they're 545 545 needed. This is part of the program graph itself — the loader PE runs 546 - dataflow code that reads from SM or external storage and emits type-11 546 + dataflow code that reads from SM or external storage and emits IRAM write 547 547 config writes. 548 548 549 549 This fits naturally into the dataflow paradigm: ··· 664 664 every PE's IRAM. that's one approach but costs a lot of memory. 665 665 666 666 The middle ground is a **working set model**: keep hot function bodies 667 - loaded, swap cold ones via type-11 config writes when the scheduler 667 + loaded, swap cold ones via IRAM write tokens (prefix 011+01) when the scheduler 668 668 wants to place an activation on a PE that doesn't have the code yet. 669 669 this is demand paging for instruction memory, and the code storage 670 670 hierarchy (external → SM → IRAM) provides the backing store. ··· 679 679 overwriting IRAM. throttle stalls new activations for that fragment, 680 680 existing ones complete, then overwrite. coarse-grained context switch. 681 681 682 - The hardware path is already there — writable IRAM + type-11 config 682 + The hardware path is already there — writable IRAM + IRAM write 683 683 writes + throttle + code storage hierarchy. the missing piece is the 684 684 scheduler, which is a software/firmware problem. nothing in the v0 685 685 hardware prevents adding this later.
+411
design-notes/pe-pipelining-and-multiplexing.md
··· 1 + # PE Pipelining and Metadata/SC Register Multiplexing 2 + 3 + Design notes on pipelining the PE's token processing path, the matching 4 + store read-modify-write timing problem, and a dual-use 74LS670-based 5 + subsystem that serves as both a matching store metadata cache (dataflow 6 + mode) and an SC block register file (sequential mode). 7 + 8 + --- 9 + 10 + ## 1. Current Pipeline Stages (Unpipelined Baseline) 11 + 12 + The PE processes one token to completion before accepting the next: 13 + 14 + ``` 15 + Stage 1: INPUT Deserialise flits from bus → internal registers 16 + Stage 2: MATCH Matching store R/M/W (read occupied, store/retrieve operand) 17 + Stage 3: IFETCH IRAM read (node address → opcode + destinations) 18 + Stage 4: EXECUTE ALU operation (data from match + control from IRAM) 19 + Stage 5: OUTPUT Form result token, serialise to flits, inject onto bus 20 + ``` 21 + 22 + Unpipelined throughput: ~8–10 clocks per token (all stages sequential). 23 + This is the baseline against which pipeline improvements are measured. 24 + 25 + --- 26 + 27 + ## 2. The Matching Store Timing Problem 28 + 29 + ### The Read-Modify-Write Requirement 30 + 31 + Matching store access (stage 2) requires: 32 + 33 + 1. Read the cell at `[ctx_slot : match_entry]` 34 + 2. Check metadata: occupied bit, generation counter, port flag 35 + 3. **If occupied** (second operand arriving): read stored value, clear 36 + occupied bit → proceed to instruction fetch with both operands 37 + 4. **If not occupied** (first operand): write incoming value, set 38 + occupied bit, record port → consume token, advance to next input 39 + 40 + This is an atomic read-modify-write. With single-ported SRAM (the 2114 41 + at 200 ns access time in a 200 ns clock period at 5 MHz), only one 42 + operation — read OR write — fits in a single cycle. 43 + 44 + ### The Pipeline Hazard 45 + 46 + If two consecutive tokens target the same matching store cell (both 47 + operands for the same dyadic instruction arriving back-to-back), the 48 + second token's read must see the first token's write. With a pipelined 49 + matching stage, this is a classic RAW (read-after-write) hazard. 50 + 51 + Note: this hazard is **statistically uncommon** in dataflow execution. 52 + Two operands arriving back-to-back means both sides of a computation 53 + completed near-simultaneously, which requires coincidental timing. The 54 + two operands typically originate from different parts of the graph with 55 + different latencies. The bypass path is cheap insurance that fires 56 + infrequently. 57 + 58 + --- 59 + 60 + ## 3. Pipeline Solutions by Aggressiveness 61 + 62 + ### Option 1: 2-Cycle Matching (Conservative, 1979-Safe) 63 + 64 + Accept that matching takes 2 clock cycles: read in cycle 1, decision + 65 + write in cycle 2. Pipeline everything else around it. 66 + 67 + ``` 68 + Clock: 1 2 3 4 5 6 7 8 69 + Token A: IN MA1 MA2 IF EX OUT 70 + Token B: IN --- MA1 MA2 IF EX OUT 71 + ``` 72 + 73 + Throughput: one token per ~3 clocks (match is the bottleneck at 2 74 + cycles; other stages overlap). This is roughly a 3× improvement over 75 + the unpipelined baseline. 76 + 77 + Hardware cost: minimal — just pipeline registers between stages. 78 + ~500 extra transistors. 79 + 80 + **This is the recommended v0 approach for the discrete build at 5 MHz 81 + with 2114 SRAM.** 82 + 83 + ### Option 2: Half-Clock Split (1981+ with Faster SRAM) 84 + 85 + Split the matching store cycle into two half-clock phases: 86 + - Rising edge: address stable, read data appears (~100 ns) 87 + - Mid-cycle: decision logic resolves 88 + - Falling edge: write-back commits 89 + 90 + Requires SRAM access time < half the clock period. At 5 MHz (200 ns 91 + period), this needs <100 ns SRAM — achievable with 6116 (120–150 ns 92 + fast grades) from ~1981, comfortable with 6264 (70–100 ns) from 1983. 93 + 94 + Not feasible with 200 ns 2114s at 5 MHz. 95 + 96 + Throughput: one token per ~1–2 clocks through matching. With overlap, 97 + ~5 stages at 1 clock each = **one token per clock** in steady state. 98 + 99 + ### Option 3: Forwarding / Bypass Path 100 + 101 + Detect when the incoming token targets the same matching store address 102 + as a token currently in the pipeline. Forward the in-flight data 103 + directly instead of reading stale SRAM. 104 + 105 + Hardware: address comparator (2 × 74LS85 or 1 × 74LS688) + data mux 106 + (74LS157). ~4–6 chips per PE. 107 + 108 + Useful in combination with options 1 or 2 to handle the rare 109 + back-to-back hazard case. Not a standalone solution. 110 + 111 + ### Option 4: Metadata Cache with 74LS670 (Recommended Enhancement) 112 + 113 + Separate the metadata (occupied, port, generation — 4 bits per entry) 114 + from the bulk operand data (16 bits per entry). Store metadata for 115 + recently-accessed entries in 74LS670 register file chips, which support 116 + **simultaneous read and write to different addresses** at ~20–24 ns. 117 + 118 + The SRAM holds the authoritative copy of all metadata. The 670s act as 119 + a write-through cache for the hot working set. On a cache hit, the 120 + dual-port 670 provides single-cycle read-modify-write with no timing 121 + gymnastics. On a cache miss, fall back to 2-cycle SRAM access. 122 + 123 + See section 5 for the full design, which combines this metadata cache 124 + with the SC block register file. 125 + 126 + --- 127 + 128 + ## 4. SRAM Width and the 2114 Constraint 129 + 130 + ### The Width Problem 131 + 132 + The 2114 is 1K × 4 bits. For 16-bit operand data, 4 chips must be 133 + paralleled (same address lines, each contributing 4 bits). This is 134 + straightforward — it's just wiring. 135 + 136 + The issue is that each chip has 1024 locations but we may use only 256 137 + (for the 8 ctx × 32 entry configuration). The unused capacity is 138 + wasted. However, the chip count is determined by **width**, not depth — 139 + you need 4 chips regardless of whether you use 256 or 1024 locations. 140 + 141 + ### Matching Store Sizing 142 + 143 + | Config | Cells | Address bits | Data SRAM (16-bit) | Dyadic limit | 144 + |--------|-------|--------------|--------------------|-------------| 145 + | 16 ctx × 32 entries | 512 | 9 (4+5) | 4 × 2114 | 32 per chunk | 146 + | 8 ctx × 32 entries | 256 | 8 (3+5) | 4 × 2114 | 32 per chunk | 147 + | 8 ctx × 16 entries | 128 | 7 (3+4) | 4 × 2114 | 16 per chunk | 148 + 149 + All three use the same 4 × 2114 for data — the chip count doesn't 150 + change. The 8 × 32 configuration (256 cells, 8-bit address) is 151 + recommended as the v0 starting point: sufficient concurrency for 152 + static PE assignment with 4 PEs, and 32 dyadic instructions per 153 + function chunk avoids forced mid-computation splits. 154 + 155 + ### Implications for IRAM 156 + 157 + 32 dyadic entries implies ~55–80 total instructions per function chunk 158 + (assuming 40–60% dyadic ratio). IRAM should be sized for 128–256 159 + entries at 24-bit width: 160 + 161 + - 128 entries × 24 bits: 4 × 2114 (using 128 of 1024 locations) 162 + - 256 entries × 24 bits: 6 × 2114 (or reorganise encoding to 32-bit 163 + and use 4 × 2114) 164 + 165 + --- 166 + 167 + ## 5. The Shared 670 Subsystem: Metadata Cache + SC Register File 168 + 169 + ### Core Insight 170 + 171 + During **dataflow mode**, the PE uses matching store metadata constantly 172 + but the SC register file is idle (no SC block executing). During **SC 173 + mode**, the PE uses the register file constantly but the matching store 174 + is idle (SC block has exclusive PE access; no tokens enter matching). 175 + 176 + These two functions can share the same physical 74LS670 chips, 177 + repurposed by a mode switch. 178 + 179 + ### Metadata Format 180 + 181 + Per matching store entry (4 bits — exact 670 word width): 182 + 183 + ``` 184 + Bit 3: occupied (has a first operand been stored?) 185 + Bit 2: port (was it left or right operand?) 186 + Bit 1: gen_hi (generation counter high bit) 187 + Bit 0: gen_lo (generation counter low bit) 188 + ``` 189 + 190 + Per context slot, the generation counter can optionally be stored 191 + separately (in the per-slot portion of the 670 address space) or 192 + replicated per-entry for finer-grained invalidation. 193 + 194 + ### Configuration: 8 × 74LS670 195 + 196 + 8 chips in parallel, each providing 4 words × 4 bits. Addressing: 197 + 198 + - 2 address bits select the word within each chip (4 words) 199 + - chip-select logic maps the entry address across the 8 chips 200 + 201 + Total capacity: 8 chips × 4 words = 32 entries of 4-bit metadata. 202 + This is exactly one full context slot (32 entries), or a 32-entry 203 + write-through cache window into the 256-entry matching store. 204 + 205 + In **SC mode**, the same 8 chips provide: 206 + 207 + - 32 words × 4 bits = 32 × 4 = 128 bits total 208 + - Interpreted as: **8 registers × 16 bits** (4 chips per register 209 + for width) 210 + 211 + This gives 8 SC registers of 16 bits each — sufficient for most 212 + short SC blocks (the EM-4 had 16 but targeted longer blocks; 8 213 + covers the common case of 4–8 instruction sequential regions). 214 + 215 + ### The Predicate Slice 216 + 217 + One of the 8 chips can be **permanently dedicated as a predicate 218 + register** rather than participating in the cache/register-file 219 + multiplexing. This provides: 220 + 221 + - 4 entries × 4 bits = 16 predicate bits, always available 222 + - Useful for: conditional token routing (SWITCH), loop termination 223 + flags, SC block branch conditions, I-structure status flags 224 + - Does not reduce the register file capacity significantly: 225 + 7 remaining chips still provide 7 registers × 16 bits in SC mode, 226 + or 28 metadata cache entries in dataflow mode 227 + 228 + The predicate register is always readable and writable regardless of 229 + mode, since it's a dedicated chip with its own address/enable lines. 230 + Instructions can test or set predicate bits without going through the 231 + matching store or the ALU result path. 232 + 233 + ### Mode Switching 234 + 235 + When transitioning from dataflow mode to SC mode: 236 + 237 + 1. **Save cached metadata** from the 7 shared 670s (7 × 4 words × 4 238 + bits = 112 bits) to spill storage. 239 + 2. **Load initial SC register values** (matched operand pair that 240 + triggered the SC block) into the 670s. 241 + 3. **Switch address mux**: 670 address lines now driven by instruction 242 + R0/R1/R2 fields instead of matching store entry address. 243 + 4. **Switch IRAM to counter mode**: sequential fetch via incrementing 244 + counter rather than token-directed node address. 245 + 246 + When transitioning back: 247 + 248 + 1. **Emit final SC result** as token (last instruction with OUT=1). 249 + 2. **Restore cached metadata** from spill storage to the 670s. 250 + 3. **Switch address mux back** to matching store entry addressing. 251 + 4. **Resume token processing** from input FIFO. 252 + 253 + ### Spill Storage Options 254 + 255 + 112 bits (or 128 bits if including the predicate slice) need temporary 256 + storage during SC block execution. 257 + 258 + **Option A: Shift registers.** 2 × 74LS165 (parallel-in, serial-out) 259 + for save + 2 × 74LS595 (serial-in, parallel-out) for restore. Total: 260 + 4 chips. Save/restore takes ~16 clock cycles each (shifting 112+ bits 261 + serially at 8 bits per chip per cycle). 262 + 263 + **Option B: Dedicated spill 670.** One additional 74LS670 (4 × 4 bits) 264 + holds 16 bits per save cycle; need ~7 write cycles to save all 7 265 + chips' contents. Total: 1 chip, ~7 cycles per save/restore. 266 + 267 + **Option C: Spill to matching store SRAM.** During SC mode, the 268 + matching store data SRAM is idle. Write the 670 metadata contents into 269 + unused matching store SRAM locations (e.g., a reserved region at the 270 + top of the address space). No extra chips needed. ~7 SRAM write cycles 271 + to save, ~7 to restore. The SRAM is single-ported but there's no 272 + contention because nothing else accesses it during mode switch. 273 + 274 + **Recommended: Option C.** Zero additional chips. The matching store 275 + SRAM is guaranteed idle during SC mode, and the metadata only needs 276 + to survive for the duration of the SC block (typically 4–20 277 + instructions = a few microseconds). The save/restore overhead of 278 + ~7 cycles per transition is negligible compared to the SC block's 279 + execution savings (EM-4 data: 23 clocks pure dataflow vs 9 clocks SC 280 + for Fibonacci, so even with 14 cycles of mode switch overhead, you 281 + break even at ~7 SC instructions). 282 + 283 + ### Chip Budget Summary 284 + 285 + | Component | Chips | Function | 286 + |-----------|-------|----------| 287 + | 7 × 74LS670 | 7 | Shared metadata cache / SC register file | 288 + | 1 × 74LS670 | 1 | Dedicated predicate register (8 × 1-bit usable or 4 × 4-bit) | 289 + | ~3 chips | 3 | Address mux, mode control, chip-select decode | 290 + | 0 chips | 0 | Spill storage (reuse matching store SRAM) | 291 + | **Total** | **~11** | | 292 + 293 + This replaces what would otherwise require separate implementations: 294 + - Metadata cache alone: ~7 chips (670s) + ~5 chips (tag comparators) 295 + - SC register file alone: ~4–8 chips (670s) 296 + - Predicate register alone: ~1 chip 297 + 298 + The shared design is fewer chips than either separate subsystem, and 299 + provides all three functions. 300 + 301 + --- 302 + 303 + ## 6. Pipeline Timing by Era 304 + 305 + ### 1979 (5 MHz, 2114 SRAM) 306 + 307 + ``` 308 + Matching: 2 cycles (SRAM single-port constraint) 309 + 670 metadata cache: single-cycle on hit, 2-cycle on miss 310 + SC block throughput: ~1 instruction per clock (670 dual-port) 311 + Overall token throughput: ~1 token per 3 clocks (pipelined) 312 + ``` 313 + 314 + ### 1982 (5–8 MHz, 6116 SRAM) 315 + 316 + ``` 317 + Matching: 1 cycle (half-clock split feasible at 5 MHz) 318 + 670 metadata cache: single-cycle (rarely needed, SRAM is fast enough) 319 + SC block throughput: ~1 instruction per clock 320 + Overall token throughput: ~1 token per clock (fully pipelined) 321 + ``` 322 + 323 + ### 1984+ (10 MHz, 6264 SRAM) 324 + 325 + ``` 326 + Matching: 1 cycle (half-clock comfortable) 327 + SC block throughput: ~1 instruction per clock 328 + Overall token throughput: ~1 token per clock 329 + 670s primarily useful for SC register file and predicate register 330 + ``` 331 + 332 + ### Integrated (on-chip SRAM, sub-ns access) 333 + 334 + ``` 335 + Matching: 1 cycle (trivially fast) 336 + SC block throughput: 1 instruction per clock 337 + Token throughput: 1 per clock, potentially 2 with banked match store 338 + 670 equivalent: on-chip multi-ported register file, ~200 transistors 339 + ``` 340 + 341 + --- 342 + 343 + ## 7. Interaction with PE-to-PE Pipelining 344 + 345 + When multiple PEs are chained for software-pipelined loops (see 346 + architecture overview), the per-PE pipeline throughput determines the 347 + overall chain throughput. 348 + 349 + With the pipelined design (1 token per 1–3 clocks depending on era), 350 + the inter-PE hop cost becomes the critical path for chained execution: 351 + 352 + | Interconnect | Hop latency | Viable? | 353 + |-------------|-------------|---------| 354 + | Shared bus (discrete build) | 5–8 cycles | Marginal — chain overhead dominates | 355 + | Dedicated FIFO between adjacent PEs | 2–3 cycles | Worthwhile for tight loops | 356 + | On-chip wide parallel link (integrated) | 1–2 cycles | Competitive with intra-PE SC block | 357 + 358 + For the discrete v0 build, dedicated inter-PE FIFOs (a small 74LS224 359 + or similar between adjacent PEs, bypassing the shared bus) would enable 360 + PE chaining at reasonable cost. This is a low-chip-count addition (~2–4 361 + chips per PE pair) that unlocks software-pipelined loop execution. 362 + 363 + --- 364 + 365 + ## 8. The Execution Mode Spectrum 366 + 367 + The pipelined PE with SC blocks and predicate register supports a 368 + spectrum of execution modes, selectable by the compiler per-region: 369 + 370 + | Mode | Pipeline behaviour | Throughput | When to use | 371 + |------|-------------------|-----------|-------------| 372 + | Pure dataflow | Token → match → fetch → exec → output | 1 token / 1–3 clocks | Parallel regions, independent ops | 373 + | SC block (register) | Sequential IRAM fetch, 670 register file | ~1 instr / clock | Short sequential regions (≤8 instrs) | 374 + | SC block + predicate | As above, with conditional skip/branch via predicate bits | ~1 instr / clock | Conditional sequential regions | 375 + | PE chain (software pipeline) | Tokens flow PE₀→PE₁→PE₂, each PE handles one stage | 1 iteration / PE-pipeline-depth clocks | Loop bodies across PEs | 376 + | SM-mediated sequential | Tokens to/from SM for memory-intensive work | SM-bandwidth-limited | Array/structure traversal | 377 + 378 + The compiler partitions the program graph and selects the best mode for 379 + each region. This spectrum is arguably more expressive than what a 380 + modern OoO core offers (which has exactly one mode: "pretend to be 381 + sequential, discover parallelism at runtime"). 382 + 383 + --- 384 + 385 + ## 9. Open Items 386 + 387 + 1. **670 cache tag comparison**: the current design assumes a 388 + single-slot cache window (all 32 entries of the active context 389 + slot). If the working set frequently spans multiple slots, a proper 390 + tag-compared cache may be needed. Evaluate hit rates via the 391 + behavioural emulator before committing to hardware. 392 + 393 + 2. **SC block maximum length**: with 7 registers (after dedicating one 394 + 670 to predicates), what is the longest SC block the compiler can 395 + generate before register pressure forces a spill? Standard graph 396 + colouring gives a rough answer; evaluate empirically on target 397 + workloads. 398 + 399 + 3. **Predicate register uses**: document specific instruction 400 + encodings for predicate test/set/clear, and how SWITCH instructions 401 + interact with predicate bits. The predicate register may subsume 402 + some of the cancel-bit functionality planned for token format. 403 + 404 + 4. **Mode switch latency measurement**: build a cycle-accurate model of 405 + the save-to-SRAM / restore-from-SRAM path and determine exact 406 + overhead. Target: ≤10 cycles per transition. 407 + 408 + 5. **Dual-use validation**: simulate the metadata cache hit rate for 409 + representative dataflow programs using the behavioural emulator. 410 + If hit rates are below ~70%, the metadata cache may not justify its 411 + complexity vs simply accepting 2-cycle matching everywhere.
+996
design-notes/sm-and-token-format-discussion.md
··· 1 + # Extended SM Design Discussion & Token Format Rework 2 + 3 + Covers SM enhancements, token format rearrangement, DRAM latency context, 4 + wide pointers, bulk SM operations, bootstrap/EXEC unification, bootstrap 5 + SM ownership, and presence metadata design. 6 + 7 + See `sm-design.md` for the base SM design this builds on. 8 + See `architecture-overview.md` for module taxonomy and existing token format. 9 + 10 + --- 11 + 12 + ## DRAM vs SRAM Latency in the Target Era 13 + 14 + Historical context for SM backing store decisions. 15 + 16 + ### 4116 DRAM (the workhorse of 1979) 17 + 18 + The MK4116 (16Kx1) was ubiquitous: ZX Spectrum, Apple II, IBM PC. 19 + 20 + ``` 21 + Speed grade Access time (RAS) Cycle time Page mode access 22 + ───────────────────────────────────────────────────────────────── 23 + 4116-2 150 ns 320 ns 100 ns 24 + 4116-3 200 ns 375 ns 135 ns 25 + 4116-4 250 ns 410 ns 165 ns 26 + ``` 27 + 28 + **Access time** = when valid data appears after RAS goes low. 29 + **Cycle time** = minimum time between the start of one access and start of 30 + the next. Includes RAS precharge recovery. A new random access cannot 31 + begin until the full cycle time elapses. 32 + 33 + Comparison: 2114 SRAM at 200ns has access time AND cycle time of ~200ns. 34 + No precharge, no refresh. DRAM is roughly **half the random-access 35 + throughput** of SRAM at the same clock speed due to cycle time overhead. 36 + 37 + The 4116 also needs 128 refresh cycles every 2ms, stealing ~2-3% of 38 + available bandwidth. 39 + 40 + **Page mode** is noteworthy: when accessing multiple locations in the same 41 + 128-bit row, page mode on the 4116-2 drops to 100ns — faster than the 42 + 2114. The row stays open; the controller just strobes new column 43 + addresses. This is how the C64 and BBC Micro shared DRAM between CPU and 44 + video on alternate half-cycles. 45 + 46 + ### The SRAM/DRAM Ratio Over Time 47 + 48 + ``` 49 + Era SRAM access DRAM access DRAM cycle Ratio (DRAM/SRAM) 50 + ───────────────────────────────────────────────────────────────────── 51 + 1979 200 ns 200 ns 375 ns ~1-2x 52 + 1985 70 ns 100 ns 230 ns ~1.5-3x 53 + 1995 15 ns 60 ns ~120 ns ~4-8x 54 + 2005 2 ns ~50 ns (CAS) ~100 ns ~25-50x 55 + 2024 0.5 ns ~15 ns (CAS) ~50 ns ~30-100x 56 + ``` 57 + 58 + In 1979, DRAM was at most **2x slower** than SRAM for random access. 59 + Today it's **50-100x slower** relative to on-chip SRAM. The memory wall 60 + barely existed in 1979. 61 + 62 + **Implication for SM:** SM backed by DRAM in 1979 incurs only a 2x 63 + latency penalty vs SRAM — viable without caching. In a modern version, 64 + the same architecture is better positioned for the memory wall than 65 + conventional designs, because dataflow provides natural latency tolerance: 66 + the PE processes other tokens while waiting for SM responses. 67 + 68 + ### C64 Memory Architecture Note 69 + 70 + Main 64K: 8× 4116 DRAM. Colour RAM: single 2114 SRAM (1K×4, low nybble 71 + only, 16 colours). The colour RAM was SRAM specifically because it needed 72 + independent access without fighting the VIC-II for DRAM bus cycles. 73 + 74 + --- 75 + 76 + ## 74LS610 Memory Mapper for SM Banking 77 + 78 + The 74LS610 is a TI memory mapper chip (originally for TMS9900 family): 79 + 80 + - 16 mapping registers, each 12 bits wide 81 + - 4-bit logical address input (MA0-MA3) selects register 82 + - 12-bit physical address output (MO0-MO11) 83 + - The '610 has a **latch control** pin (pin 28) absent from the '612 — 84 + outputs can be frozen while register contents change 85 + - Appeared in PC-AT boards and Nintendo cartridges 86 + - Internally a multiplexed register file, not dual-port 87 + 88 + ### SM Address Translation Path 89 + 90 + 1. Token arrives with structure address 91 + 2. High bits go to '610, select mapping register → 12-bit physical bank 92 + (~40-50ns propagation delay, LS family) 93 + 3. Bank bits + low address bits drive DRAM row/column 94 + 4. DRAM access (~200ns access + 375ns cycle for 4116) 95 + 96 + The '610 propagation delay is pipelineable — it's a combinational lookup 97 + that overlaps with DRAM RAS setup. The real cost comes from *changing* a 98 + mapping register mid-operation (a write cycle to the '610's internal 99 + registers via the data bus). That is the bank-switch overhead. 100 + 101 + ### SM Usage Model 102 + 103 + Mapping registers are set at load time and mostly left alone. SM 104 + addresses are global — every PE sees the same SM address space. The 105 + mapping registers expand physical capacity rather than providing 106 + per-context isolation. 107 + 108 + Bank switching hurts when more live structures exist than fit in the 109 + directly-mapped physical space. The compiler can hint which structures are 110 + hot vs cold, and the SM controller can manage the mapping registers 111 + accordingly: fast path (bank already mapped → straight through) vs slow 112 + path (bank miss → remap → re-access). 113 + 114 + --- 115 + 116 + ## SM Return Routing: Pre-Formed Token Templates 117 + 118 + The return routing field in SM READ requests (flit 2) is conceptualized as a 119 + **pre-formed CM token template**. The SM's result formatter simply latches the 120 + template, concatenates the read data, and pushes the result to the output. 121 + No bit-shuffling, no field packing — the requesting CM does all that work 122 + upfront when constructing the request. 123 + 124 + Because the template is a full 16-bit flit carried in flit 2 (or flit 3 125 + for extended formats), it has enough room to encode any token format whose 126 + routing information fits in a single flit. In practice this means any 127 + format OTHER than dyadic narrow can serve as a return route: dyadic wide, 128 + monadic normal, and monadic inline all fit their routing fields into 129 + flit 1. The SM does not need to understand the format — it just prepends 130 + the template as flit 1 and appends the read data as flit 2. 131 + 132 + This means SM read results can land directly in a matching store slot as 133 + one operand of a dyadic instruction. The result does not need to pass 134 + through an intermediate monadic forwarding step — it can match against 135 + another token that arrived independently, enabling patterns like "fetch 136 + value from SM, combine with a locally-computed operand, produce result" 137 + in a single matching store cycle. 138 + 139 + **Implication:** bits cannot be stolen from the return routing field for 140 + page register selection or other SM-side metadata, because the SM would 141 + need to parse its own return routing — defeating the purpose of making it 142 + an opaque blob. 143 + 144 + --- 145 + 146 + ## Token Format Rework: 1-Bit SM/CM Split 147 + 148 + ### Motivation 149 + 150 + Eliminating type-11 (system) tokens and moving to a 1-bit SM/CM 151 + discriminator reclaims 1 bit in the SM flit. This bit can be spent on 152 + addressing, opcodes, or SM_id width. 153 + 154 + ### Why Type-11 Can Be Eliminated 155 + 156 + Each type-11 subtype can be absorbed into existing types: 157 + 158 + - `11+00` (IO): IO module becomes a specialised SM or memory-mapped 159 + region within SM address space 160 + - `11+01` (config/IRAM load): becomes a CM opcode (see IRAM Write below) 161 + - `11+10` (debug/trace): reserved SM address range or special monadic 162 + opcode 163 + - `11+11` (reserved): not committed 164 + 165 + Bootstrap/initial loading uses a dedicated hardware path (see Bootstrap 166 + section below), not runtime tokens. 167 + 168 + ### IO as Memory-Mapped SM 169 + 170 + The IO module is interacted with like an SM: reads/writes use the 171 + standard SM token format. I-structure semantics help naturally — a READ 172 + from an IO device that doesn't have data ready defers, and the response 173 + arrives when the device is ready. This provides interrupt-driven IO 174 + without interrupts. 175 + 176 + At v0 scale, dedicating one SM_id to IO leaves 3 SMs × 512 = 1536 177 + structure cells. See "Bootstrap SM Ownership" below for how this interacts 178 + with the bootstrap path. 179 + 180 + ### IRAM Write as CM Opcode 181 + 182 + From the CM's perspective, an IRAM write is "put this instruction word at 183 + this IRAM address." It needs no ctx, port, gen, or matching store access. 184 + Those bits become extra address or data bits. 185 + 186 + The CM is also in the best position to know whether it can safely swap out 187 + a given instruction or whether tokens are in flight for it, enabling 188 + finer-grained hot-reload at runtime. 189 + 190 + ### New Token Encoding 191 + 192 + ``` 193 + ═══════════════════════════════════════════════════════════════════ 194 + BIT[15] = 1: SM TOKEN 195 + ═══════════════════════════════════════════════════════════════════ 196 + 197 + Standard (2 flit): 198 + flit 1: [1][SM_id:2][op:3-5][addr:8-10] = 16 199 + flit 2: [data:16] or [return_routing:16] 200 + 201 + 15 bits available. See "SM Opcode Width Options" below. 202 + 203 + ═══════════════════════════════════════════════════════════════════ 204 + BIT[15:14] = 00: DYADIC WIDE (hot path) 205 + ═══════════════════════════════════════════════════════════════════ 206 + 207 + flit 1: [0][0][PE:2][offset:5][ctx:4][port:1][gen:2] = 16 208 + flit 2: [data:16] 209 + 210 + offset:5 = 32 dyadic slots per context (doubled from 16) 211 + matching store addr = [ctx:4][offset:5] = 9 bits = 512 cells 212 + decode: bit[15]=0 AND bit[14]=0 → two gates 213 + 214 + ═══════════════════════════════════════════════════════════════════ 215 + BIT[15:13] = 010: MONADIC NORMAL (2 flit) 216 + ═══════════════════════════════════════════════════════════════════ 217 + 218 + flit 1: [0][1][0][PE:2][offset:7][ctx:4] = 16 219 + flit 2: [data:16] 220 + 221 + offset:7 = 128 IRAM slots, unchanged 222 + No port, no gen 223 + 224 + ═══════════════════════════════════════════════════════════════════ 225 + BIT[15:13] = 011: MISC BUCKET (infrequent formats) 226 + ═══════════════════════════════════════════════════════════════════ 227 + 228 + flit 1: [0][1][1][PE:2][sub:2][...9 bits...] = 16 229 + 230 + sub=00: DYADIC NARROW (2 flit, 8-bit data) 231 + flit 1: [011][PE:2][00][offset:5][ctx:4] = 16 232 + flit 2: [data:8][port:1][gen:2][spare:5] = 16 233 + 234 + sub=01: IRAM WRITE (2+ flit) 235 + flit 1: [011][PE:2][01][iram_addr:7][flags:2] = 16 236 + flit 2: [instruction_word_low:16] = 16 237 + (flit 3: [instruction_word_high:8][spare:8] if needed) 238 + 7-bit addr = 128 IRAM slots, full coverage. 239 + No ctx/port/gen needed. 240 + 241 + sub=10: MONADIC INLINE (1 flit, trigger) 242 + flit 1: [011][PE:2][10][offset:4][ctx:4][spare:1] = 16 243 + No flit 2. 244 + 245 + sub=11: SPARE 246 + Reserved. Candidates: extended monadic with wider offset, 247 + broadcast/multicast, debug/trace injection. 248 + ``` 249 + 250 + ### Summary Table 251 + 252 + ``` 253 + prefix format flits offset ctx port gen vs current 254 + ────────────────────────────────────────────────────────────────────── 255 + 1 SM standard 2 9-10 — — — +1 bit (addr or op) 256 + 00 dyadic wide 2 5 (32) 4 1 2 +1 offset (was 4) 257 + 010 monadic normal 2 7 (128) 4 — — unchanged 258 + 011+00 dyadic narrow 2 5 (32) 4 1 2 unchanged 259 + 011+01 IRAM write 2-3 7 (128) — — — NEW 260 + 011+10 monadic inline 1 4 (16) 4 — — unchanged 261 + 011+11 (spare) ? ? ? ? ? reserved 262 + ``` 263 + 264 + ### Hot Path Decode 265 + 266 + Two bits determine the fast-path pipeline: 267 + 268 + - bit[15] splits SM/CM: one gate 269 + - bit[14] splits dyadic-wide from everything else: one gate 270 + - the PE can start matching store SRAM read on [ctx:4][offset:5] the 271 + instant flit 1 is latched for dyadic-wide tokens (the dominant format) 272 + - monadic decode adds one more gate at bit[13] 273 + - the misc bucket is three gates deep, but nothing there is 274 + latency-critical 275 + 276 + ### Dyadic Narrow as Demoted Format 277 + 278 + Dyadic narrow carries 8-bit data but still costs two flits. Its main 279 + advantage is a wider offset field plus spare bits in flit 2. It is less 280 + broadly useful than dyadic wide (16-bit data, the common case) and 281 + comfortably shares the misc bucket with IRAM write and monadic inline — 282 + all three are infrequent relative to the two hot-path formats. 283 + 284 + ### Monadic Offset Relative Addressing 285 + 286 + Dyadic instructions pack into the lowest IRAM offsets (0-15 wide, 0-31 287 + narrow). Monadic tokens never target those slots. Making monadic offsets 288 + **relative to the dyadic ceiling** avoids wasting encodings: 289 + 290 + ``` 291 + wide mode: 16 dyadic slots → monadic base = 16 292 + 6-bit relative offset → addresses 16-79 (64 monadic slots, all valid) 293 + 294 + narrow mode: 32 dyadic slots → monadic base = 32 295 + 6-bit relative offset → addresses 32-95 (64 monadic slots, all valid) 296 + ``` 297 + 298 + Hardware cost: since the dyadic ceiling is always a power of 2 (16 or 299 + 32), the base can be OR'd onto the high address bits. One gate plus a 300 + config bit. 301 + 302 + **SC block interaction:** SC blocks need contiguous IRAM space that does 303 + not respect the dyadic-below-monadic packing rule. The compiler packs SC 304 + blocks into a separate IRAM region above the monadic ceiling, addressed 305 + via a base register set on entering SC mode. 306 + 307 + ### SM Opcode Width Options 308 + 309 + With 15 bits available after the SM discriminator bit, the SM_id (2 bits) 310 + leaves 13 bits split between opcode and address. Three alternatives: 311 + 312 + **3-bit fixed:** 8 ops, 10-bit addr (1024 cells). READ, WRITE, ALLOC, 313 + FREE, CLEAR, READ_INC, READ_DEC, CAS fills it exactly. No headroom, no 314 + decode complexity. 315 + 316 + **4-bit fixed:** 16 ops, 9-bit addr (512 cells). Room for EXT, RAW_READ, 317 + EXEC, and 5 spare slots. Trivial decode. 11 defined operations with 5 318 + spare gives comfortable expansion room. 319 + 320 + **Variable 3/5:** one decode gate. Common ops get 3-bit opcode + 10-bit 321 + addr (1024 cells). Rare/special ops get 5-bit opcode + 8-bit payload 322 + (256 addresses or inline data). 323 + 324 + ``` 325 + op[2:1] != 11: 6 opcodes × 10-bit addr (1024 cells) 326 + READ, WRITE, ALLOC, FREE, CLEAR, EXT 327 + 328 + op[2:1] == 11: extends to 5-bit → 8 opcodes × 8-bit payload 329 + READ_INC, READ_DEC, CAS, RAW_READ, EXEC, SET_PAGE, WRITE_IMM, (spare) 330 + ``` 331 + 332 + **Key insight for variable-width encoding:** not all SM ops are 333 + `op(address)`. Some are `op(config_value)` or `op()` with no cell 334 + operand. The 8-bit payload in the restricted tier can be inline data, 335 + config values, or range counts depending on the opcode: 336 + 337 + - EXEC: payload = length/count (base addr in config register) 338 + - SET_PAGE: payload = page register value 339 + - WRITE_IMM: 8-bit addr + 8-bit immediate (single-flit small constant 340 + writes, feasible if flit 2 is repurposed or omitted) 341 + - CLEAR_RANGE: base addr + count 342 + 343 + --- 344 + 345 + ## Bootstrap and EXEC Unification 346 + 347 + ### The EXEC Concept 348 + 349 + An SM operation that reads from a region of SM address space, **bypasses 350 + the result token formatter**, and pushes raw data onto the token bus as 351 + pre-formed tokens. 352 + 353 + ### Bootstrap via EXEC 354 + 355 + At power-on, a small state machine (counter + comparator + a few gates) 356 + begins reading from a ROM base address. The ROM is mapped into the SM's 357 + extended address space (directly wired or via '610 mapper). Its contents 358 + are pre-assembled tokens: IRAM writes, routing config, and everything 359 + else needed to bring the system alive. 360 + 361 + ``` 362 + Bootstrap sequence: 363 + 1. Power on, reset 364 + 2. Bootstrap state machine starts clocking reads from ROM base 365 + 3. SM reads ROM, pushes raw bytes onto token bus 366 + 4. Those bytes ARE tokens — CM IRAM writes, SM config, routing setup 367 + 5. CMs and SMs see valid tokens on the bus, process normally 368 + 6. Bootstrap hits stop sentinel, halts 369 + 7. System is loaded, bootstrap state machine goes idle 370 + 8. Final token in ROM sequence could be a "go" trigger 371 + ``` 372 + 373 + Bootstrap hardware: ~5-8 chips (12-bit counter, stop comparator, bus 374 + driver, trigger logic, mux between bootstrap and runtime address sources). 375 + 376 + ### Bootstrap Bus Arbitration 377 + 378 + During bootstrap, nothing else transmits (nothing is loaded yet). The 379 + bootstrap SM gets unconditional bus access until completion. A single 380 + flip-flop ("bootstrap complete," active-low) gates other bus requesters. 381 + Normal arbitration activates after bootstrap signals done. 382 + 383 + ### ROM Image Format 384 + 385 + The toolchain compiles the program, generates a token stream for loading, 386 + and packs it into a ROM image. No special bootstrap format or separate 387 + loader protocol — the ROM contains the same tokens that would flow on the 388 + bus during normal operation, pre-baked. 389 + 390 + **Token ordering in the ROM is critical:** routing config must precede 391 + compute tokens (so the network knows where to route), IRAM loads must 392 + precede trigger tokens (so instructions exist before anything fires). 393 + This is a constraint on the assembler/linker output ordering. 394 + 395 + ### Runtime EXEC 396 + 397 + At runtime, EXEC reuses the same address counter, bus output path, and 398 + sequencing logic as bootstrap. The differences: 399 + 400 + - Triggered by a token instead of a power-on reset signal 401 + - Address + length come from the token (or from a wide pointer) rather 402 + than "start at 0, go until sentinel" 403 + - Additional hardware: trigger latch, length register (~2 chips) 404 + 405 + Runtime EXEC enables: 406 + 407 + - **Checkpoint/restore:** snapshot tokens into SM region, reload via EXEC 408 + - **Code migration:** EXEC a sequence that writes new IRAM to a target 409 + PE, updates routing, and re-injects pending tokens 410 + - **Computed token streams:** a CM builds a token sequence in SM via 411 + WRITEs, then triggers EXEC to emit them all at once 412 + - **Bulk scatter:** pre-formed token templates in SM, EXEC distributes 413 + data to PEs (faster than a CM emit loop) 414 + 415 + ### Bus Bandwidth During EXEC 416 + 417 + EXEC monopolises the bus while clocking out tokens. At v0 scale this is 418 + acceptable. At scale, EXEC should be backpressure-aware — the SM only 419 + emits when the bus grants access, naturally interleaving with other 420 + traffic. 421 + 422 + ### Security Consideration 423 + 424 + EXEC allows an SM to inject tokens that look like they came from any 425 + source, targeting any PE, with any opcode. At v0 this is a non-issue 426 + (same threat model as a 6502 — all code is trusted). At scale, EXEC 427 + should be gated behind privilege levels or a config bit controlling which 428 + SMs can use it. 429 + 430 + --- 431 + 432 + ## Bootstrap SM Ownership 433 + 434 + ### Dedicated Bootstrap SM 435 + 436 + One SM (likely SM00) is wired to the system reset signal. On coming out 437 + of reset, it calls EXEC on a predetermined address in program storage 438 + (ROM or flash mapped into its extended address space). 439 + 440 + The bootstrap program at the reset vector is responsible for: 441 + 442 + 1. Commanding all other CMs and SMs to clear existing state 443 + 2. Loading the program (IRAM writes, routing config, initial data) 444 + 3. Emitting the initial trigger tokens to start execution 445 + 446 + Any program loaded by the bootstrap must **not** issue commands to the 447 + bootstrap SM as part of its own loading sequence — doing so could 448 + interfere with the bootstrap process itself. 449 + 450 + At runtime, the same EXEC behaviour can be triggered on any SM targeting 451 + any part of memory. The only thing special about SM00 is the reset-vector 452 + wiring. 453 + 454 + ### Should the Bootstrap SM Be Further Specialised? 455 + 456 + This is an open question with trade-offs in both directions. 457 + 458 + **Arguments for specialisation (SM00 as "IO SM"):** 459 + 460 + Some SM opcodes (atomics, alloc/free) don't have meaningful semantics on 461 + memory-mapped IO addresses. On SM00 (aliased `io` in the assembler), the 462 + atomic and allocation opcodes could be repurposed for specialised IO 463 + operations. This avoids wasting opcode space on operations that don't 464 + apply to IO, and gives the IO subsystem its own instruction vocabulary 465 + within the existing token format. 466 + 467 + SM00 would also be the natural home for program storage (ROM/flash) and 468 + the serial port, since the bootstrap hardware already requires access to 469 + external storage. 470 + 471 + **Arguments against specialisation:** 472 + 473 + Making SM00 special outside the boot process creates a bottleneck. If 474 + SM00 is the only path to program storage, serial IO, and other 475 + peripherals, it concentrates traffic. In a system with complex IO 476 + requirements or significant address space demand, a single special-cased 477 + SM limits scalability. 478 + 479 + The more devices or address space mapped through SM00, the worse the 480 + contention. Other SMs cannot help share the load because they lack the 481 + specialised decode logic. 482 + 483 + **Middle ground (recommended for v0):** 484 + 485 + SM00 is special only at boot (reset vector wiring). At runtime, IO is 486 + memory-mapped into SM00's address space using the standard SM opcode set. 487 + Opcode reuse for IO-specific operations is deferred until profiling shows 488 + the standard opcodes are insufficient. Any SM can perform EXEC at runtime, 489 + so program loading and code migration are not locked to SM00. 490 + 491 + This avoids committing to a specialised instruction set before knowing 492 + what IO operations actually need hardware support vs what the standard 493 + read/write/presence-bit model can handle. 494 + 495 + --- 496 + 497 + ## Memory Tier Model 498 + 499 + Not all addressable storage needs synchronising memory semantics. Treating 500 + every cell as an I-structure is expensive (presence SRAM, FSM complexity, 501 + deferred read logic) and unnecessary for many use cases. The SM address 502 + space should support regions with different semantics tiers, selected by 503 + address range. 504 + 505 + ### Tier Definitions 506 + 507 + **Tier 0: Raw memory.** No presence bits, no metadata, no I-structure 508 + semantics. Reads always return whatever is stored; writes always go 509 + through. No per-cell state means no consistency concerns — this tier can 510 + be trivially shared across SMs because there is nothing to keep 511 + synchronised. The SM hardware path for tier 0 is minimal: address decode, 512 + SRAM/DRAM read/write, done. No FSM, no deferred read register, no 513 + presence SRAM access. 514 + 515 + Use cases: framebuffers, ROM, DMA buffers, lookup tables, program 516 + storage, any memory region where the programmer accepts "last writer wins" 517 + semantics or read-only access. 518 + 519 + **Tier 1: I-structure memory.** The full presence-bit model. 520 + EMPTY/RESERVED/FULL/WAITING state per cell, deferred reads, write-once 521 + semantics with diagnostic on overwrite. Each cell has metadata in the 522 + presence SRAM. This is the synchronising memory that makes the dataflow 523 + architecture work — producer-consumer coordination without locks. Must be 524 + owned by a single SM because the metadata is authoritative. 525 + 526 + **Tier 2: Wide/bulk memory.** Tier 1 plus the is_wide tag, sequencer 527 + support for ITERATE/COPY_RANGE, and bounds checking via wide pointer 528 + metadata. A superset of tier 1 that enables SM-local bulk operations. 529 + Same ownership constraints as tier 1. 530 + 531 + ### Tier Selection by Address Range 532 + 533 + For v0, the tier boundary is a fixed address convention. Addresses below 534 + a threshold are tier 1 (I-structure); addresses above it are tier 0 535 + (raw). The boundary may be a hardwired constant or a config register. The 536 + decode logic is one comparator. 537 + 538 + The compiler and assembler know the layout and enforce placement: 539 + I-structure cells go in the tier 1 range, framebuffers and lookup tables 540 + go in tier 0. The ROM base address (and therefore the EXEC reset vector) 541 + is always in tier 0 at a fixed location, because the bootstrap hardware 542 + must know where to start reading without any configuration having 543 + occurred. 544 + 545 + Future upgrade path: **per-page tier tags.** If the '610 mapper is 546 + present, each mapping register can carry 2 extra bits indicating the tier 547 + for that page. The mapper already translates logical→physical addresses; 548 + it additionally outputs "this page is tier 0/1/2" alongside the physical 549 + address. The SM FSM uses this signal to skip the presence check and 550 + deferred-read logic for tier 0 pages. Hardware cost: 2 extra bits per 551 + mapping register and one mux on the presence SRAM read-enable line. 552 + 553 + ### Implications for Shared Access 554 + 555 + Tier 0 regions do not need to be owned by a single SM. If there are no 556 + presence bits, there is no per-cell state to be inconsistent. Multiple 557 + SMs can map the same physical DRAM or ROM as tier 0, and concurrent 558 + access works without coordination — reads are instantaneous, writes are 559 + last-writer-wins. The hardware cost per SM for tier 0 access is 560 + essentially zero beyond address decode and the bus interface. 561 + 562 + This directly solves the "program storage bottleneck on SM00" problem 563 + from the bootstrap discussion. If ROM is tier 0 and any SM can map it, 564 + then any SM can EXEC from it or issue READs to it. SM00 has the reset 565 + vector wiring for initial bootstrap, but at runtime program storage is 566 + accessible from any SM. No bottleneck. 567 + 568 + For ROM specifically, the path is even simpler: tier 0 read-only. The SM 569 + does not need write logic for the ROM region. EXEC on ROM is the 570 + bootstrap path. Regular READs on ROM are just reads — no presence check, 571 + no state transition, the fastest possible SM path. 572 + 573 + --- 574 + 575 + ## Address Space Distribution Across SMs 576 + 577 + ### The Core Tension 578 + 579 + Each SM manages presence bits and metadata for its own cells. This 580 + requires per-SM ownership of tier 1/2 cells, which inherently segments 581 + the memory space. The question is how to present this to the programmer 582 + and how to handle structures that don't fit neatly into one SM's address 583 + range. 584 + 585 + ### Option Analysis 586 + 587 + **Hard partitioned (current design).** SM_id in the token selects the SM; 588 + address bits are local to that SM. Each SM owns its cells, its presence 589 + metadata, everything. The compiler/programmer must know which SM holds 590 + what. 591 + 592 + Advantages: zero contention on metadata, SMs are completely independent, 593 + simplest hardware. Four SMs running in parallel can service four requests 594 + simultaneously. Wide pointers and bulk ops work trivially because all 595 + cells in a structure are SM-local. 596 + 597 + Disadvantages: the programmer sees 4 small address spaces instead of one 598 + large one. A 200-element array must either fit in one SM (consuming nearly 599 + half its capacity at 512 cells) or be explicitly split across SMs by the 600 + compiler. No parallelism on accesses within one SM's address range. 601 + 602 + **Interleaved.** Low address bits select the SM; high bits select the cell 603 + within it. The programmer sees a flat address space. 604 + 605 + Advantages: flat addressing, automatic load balancing for sequential 606 + access patterns. 607 + 608 + Disadvantages: locality is destroyed. Consecutive elements are on 609 + different SMs, which breaks wide pointers (cell[N] on SM0, cell[N+1] on 610 + SM1) and prevents SM-local bulk ops (ITERATE, COPY_RANGE). Wide pointers 611 + effectively rule out pure interleaving. 612 + 613 + **Block interleaved.** Blocks of N cells (16-32) assigned to SMs 614 + round-robin. Consecutive cells within a block are on the same SM. 615 + 616 + Advantages: flat-ish addressing, wide pointers work within blocks, some 617 + load balancing for large structures. Bulk ops work within a block. 618 + 619 + Disadvantages: block boundaries create edge cases for structures that 620 + straddle two SMs. The compiler must be block-aware for optimal placement. 621 + 622 + **Segmented with toolchain abstraction (recommended for v0).** Keep hard 623 + partitioning in hardware, but provide a toolchain abstraction layer. The 624 + assembler provides named memory regions mapping to SM_id + local address. 625 + The compiler handles placement. At runtime, the SM_id bits in the token 626 + explicitly select the SM, but the programmer writes `STORE array[i]` and 627 + the toolchain resolves the target. 628 + 629 + Advantages: hardware stays simple, programmer does not manually manage SM 630 + placement, wide pointers and bulk ops work trivially (everything within 631 + a structure is on one SM). The "address space" is as large as the 632 + toolchain can represent. 633 + 634 + Disadvantages: parallelism requires the compiler to deliberately 635 + distribute structures across SMs. A hot array on one SM cannot benefit 636 + from idle SMs. Essentially hard partitioning with better ergonomics. 637 + 638 + ### Recommended Approach 639 + 640 + **v0: segmented with toolchain abstraction.** Hard partitioning is the 641 + correct hardware choice because independent metadata, no contention, and 642 + no multi-port reads are prerequisites for the wide pointer and bulk 643 + operation features. The programmer experience problem is real but solvable 644 + in the toolchain. The compiler can be smart about placement: arrays that 645 + will be iterated go on one SM (for ITERATE), independent structures go on 646 + different SMs (for parallelism), hot accumulators go on SMs not busy with 647 + bulk ops. 648 + 649 + **Scale: block interleaving.** At 8+ SMs with a dedicated CM×SM fabric, 650 + the network can route based on address bits without needing an explicit 651 + SM_id in the token. The SM_id effectively becomes the high address bits. 652 + This is a clean evolution from "explicit SM_id in the token" to "implicit 653 + SM selection from address bits" without changing the token format — the 654 + 2-3 SM_id bits are simply reinterpreted. 655 + 656 + ### Cross-SM Wide Pointers 657 + 658 + A wide pointer can reference cells on a *different* SM if the base 659 + address includes an SM_id (or maps to a different SM through the address 660 + translation layer). When an SM processes an ITERATE on such a pointer, it 661 + cannot read the target cells locally. Instead, it emits READ tokens to 662 + the target SM for each element, acting as a "list controller" that 663 + orchestrates access across the system. 664 + 665 + This is slower than SM-local iteration (bus traffic per element instead of 666 + internal reads) but papers over the distribution from the programmer's 667 + perspective. The programmer writes ITERATE; the hardware determines 668 + whether it can run locally or must go remote. Amamiya's SM list operations 669 + worked on a similar principle. 670 + 671 + The resulting access patterns: 672 + 673 + ``` 674 + Local ITERATE (target cells on same SM): 675 + 1 ITERATE token + N internal reads + N response tokens = 2 + 2N flits 676 + 677 + Remote ITERATE (target cells on different SM): 678 + 1 ITERATE token + N remote READs + N responses back = 2 + 4N flits 679 + ``` 680 + 681 + Remote iteration is 2× the bus traffic of local iteration, but still 682 + better than the CM orchestrating each access individually (which would 683 + also be 4N flits, plus the CM compute overhead per element). 684 + 685 + ### Tier 0 Regions and Distribution 686 + 687 + Tier 0 (raw) memory sidesteps the distribution problem entirely. Because 688 + there are no presence bits or metadata, tier 0 regions can be shared 689 + across SMs without ownership constraints. A framebuffer mapped as tier 0 690 + can be written by any SM, read by any SM, with no coordination overhead. 691 + The only concern is last-writer-wins semantics, which is acceptable for 692 + use cases like display buffers, shared lookup tables, and ROM. 693 + 694 + This creates a natural split: tier 1/2 memory is partitioned across SMs 695 + for correctness (metadata ownership), while tier 0 memory is shared for 696 + convenience and throughput. 697 + 698 + --- 699 + 700 + ## Wide Pointers and Bulk SM Operations 701 + 702 + ### Motivation 703 + 704 + Historical SM designs (including Amamiya's) treated SM as dumb storage: 705 + cells with presence bits, every operation one-cell-at-a-time, CM doing 706 + all orchestration. Amamiya's SM had CAR/CDR fields per cell for LISP cons 707 + cells. 708 + 709 + The approach here is analogous to Rust's wide pointers: carry length + 710 + address metadata alongside the cell data. The SM becomes a **smart memory 711 + controller** that can operate on structures without per-element CM 712 + round-trips. 713 + 714 + Inspired in part by near-data-processing research (e.g. SFU 2022 paper 715 + on smarter caches, DOI: 10.1145/3470496.3527380). 716 + 717 + ### Wide Pointer = (address, length) 718 + 719 + An SM cell tagged as a "wide pointer" implicitly pairs with the next cell 720 + to form a `(base_address, length)` tuple. The SM can act on this metadata 721 + without CM involvement. 722 + 723 + ### Capabilities Enabled 724 + 725 + **Bounded slices:** the SM knows the extent of a structure. Hardware 726 + bounds checking on READ rejects out-of-range accesses without a CM 727 + round-trip. 728 + 729 + **SM-local iteration (READ_NEXT / ITERATE):** the SM maintains an 730 + internal cursor. The CM sends "give me the next element," the SM 731 + increments its pointer, and returns the value. One token in, one token 732 + out, repeated until exhaustion. The CM does not track the index. 733 + 734 + ``` 735 + Bus traffic for 64-element array iteration: 736 + without bulk ops: 64 READs + 64 responses = 256 flits 737 + with ITERATE: 1 ITERATE + 64 responses = 130 flits 738 + ``` 739 + 740 + **SM-local memcpy/memmove (COPY_RANGE):** source and destination slices 741 + provided, SM handles the copy internally. No per-element tokens on the 742 + bus. 743 + 744 + ``` 745 + without: 64 READs + 64 responses + 64 WRITEs = 384 flits 746 + with: 1 COPY_RANGE + 1 completion token = 4 flits 747 + ``` 748 + 749 + **SM-local string ops:** a length-aware SM can perform compare, search, 750 + and concat on packed byte data (2 chars per 16-bit cell) without per-byte 751 + CM involvement. 752 + 753 + ### Cell Type Tag Implementation 754 + 755 + **Recommended: tagged cells.** Widen per-cell metadata from 2-bit 756 + presence to 3+ bits: presence:2 + is_wide:1. 757 + 758 + - Cells that are not wide pointers pay zero overhead (is_wide = 0) 759 + - Wide pointer cells consume 2 cells (the pointer cell + the length/ 760 + metadata cell) 761 + - The SM knows to treat them as a unit because of the tag 762 + - Presence SRAM goes from 2 bits/cell to 3 bits/cell, still trivially 763 + fits the same small SRAM chip 764 + 765 + Alternatives considered: 766 + 767 + - **Paired cells:** even addresses hold data, odd hold metadata. Simple, 768 + but halves effective cell count for all cells regardless of use. 769 + - **Separate metadata SRAM:** a third SRAM plane alongside data and 770 + presence. Does not eat cell count but costs additional chips. 771 + 772 + Tagged cells are the most flexible option — only wide-pointer cells pay 773 + the cost of consuming two cells. 774 + 775 + ### Hardware Reuse: EXEC Sequencer = Bulk Op Engine 776 + 777 + The bootstrap/EXEC hardware and the bulk operation engine share nearly 778 + all their components: 779 + 780 + ``` 781 + Already present for EXEC: 782 + - Address counter 783 + - Limit comparator (stop sentinel) 784 + - Increment logic (READ_INC/DEC already requires 16-bit incrementer) 785 + - Output path to bus (result formatter bypass) 786 + - Internal read/write sequencing (FSM) 787 + 788 + Additional for bulk ops: 789 + - Second address register (destination for COPY_RANGE) 790 + - Mode selector (determines per-element action) 791 + ~5-6 extra chips on top of the EXEC sequencer 792 + ``` 793 + 794 + All bulk operations are modes of the same sequencer: 795 + 796 + - **EXEC:** read cells, push raw to bus 797 + - **ITERATE:** read cells, push as formatted response tokens 798 + - **COPY_RANGE:** read cells, write to other cells internally 799 + - **CLEAR_RANGE:** write zeros across a cell range 800 + 801 + ### Wide Pointer as Sequencer Parameter Block 802 + 803 + Without wide pointers, every bulk op requires the CM to send base+length 804 + as part of the command (3-flit tokens minimum; SM latches parameters from 805 + the token). 806 + 807 + With wide pointers, the command is just "ITERATE cell[N]." The SM reads 808 + cell[N], sees the is_wide tag, reads cell[N+1] for the length, loads the 809 + sequencer, and runs. A single 2-flit token kicks off an arbitrarily long 810 + bulk operation. 811 + 812 + The limit register IS the length from the wide pointer. The base address 813 + IS the address from the wide pointer. The wide pointer serves directly as 814 + the parameter block for SM-internal bulk operations. 815 + 816 + ### Bus Contention During Bulk Ops 817 + 818 + **Backpressure-based (recommended for v0):** the SM only emits when it 819 + receives bus access, naturally interleaving with other traffic. The 820 + sequencer pauses when it cannot transmit. This falls out of normal bus 821 + arbitration and degrades throughput gracefully rather than starving other 822 + devices. 823 + 824 + An alternative is interruptible sequencing with burst length limits (the 825 + SM processes N elements, yields the bus, resumes later). This adds a 826 + resume-state register but is probably unnecessary at v0 scale. 827 + 828 + ### Complexity Boundary 829 + 830 + The SM should perform **address arithmetic and counting**, not 831 + **data-dependent branching on cell values**. Once the SM starts evaluating 832 + data content (e.g. "if value > threshold, do X"), it has crossed from 833 + smart memory controller into coprocessor territory and gate count 834 + escalates. 835 + 836 + Acceptable SM-internal decisions: 837 + 838 + - Presence state checks (metadata the SM owns) 839 + - Wide pointer length/bounds checks (structural metadata) 840 + - Counter increment/compare (sequencer logic) 841 + 842 + Belongs in the CM / dataflow graph, not the SM: 843 + 844 + - Conditional operations based on data values 845 + - Arbitrary arithmetic on cell contents 846 + - Data-dependent control flow 847 + 848 + The existing atomic ops (READ_INC, READ_DEC, CAS) do modify data, but 849 + they are fixed-function single-cell operations, not sequenced bulk 850 + compute. They remain below the complexity line. 851 + 852 + ### Road Not Taken: SM Evolution Trajectory 853 + 854 + Hypothetical trajectory if someone had cracked the compiler problem circa 855 + 1975: 856 + 857 + ``` 858 + 1975-78: SM = dumb cells + presence bits (Amamiya baseline) 859 + 1979-82: SM gains atomic ops, wide pointers (structure-aware memory) 860 + 1983-86: SM gains bulk sequencer (EXEC/ITERATE/COPY_RANGE) 861 + 1987-90: SM gets microcode ROM for near-data programs (filter, reduce) 862 + 1992+: SM becomes a specialised vector unit next to memory 863 + (essentially what GPUs independently reinvented) 864 + ``` 865 + 866 + Compare to the actual historical trajectory: make the CPU faster, add 867 + cache layers to hide memory latency, add OoO to hide cache miss latency, 868 + add prefetchers to hide OoO limits — each layer compensating for the 869 + previous layer's failure to solve data-far-from-compute. 870 + 871 + --- 872 + 873 + ## Presence Metadata SRAM Design 874 + 875 + ### Constraint 876 + 877 + Presence metadata must have **single-cycle immediate access**. In the 878 + historical scenario this means fast SRAM regardless of whether the data 879 + backing store is DRAM. The presence check determines the FSM's action 880 + (defer, satisfy, error) and cannot wait for DRAM latency. 881 + 882 + ### Parallel Access 883 + 884 + Presence SRAM and data SRAM share address lines (same address driven to 885 + both chips simultaneously). The presence result feeds the FSM decision 886 + logic while the data result waits in a latch. If the FSM decides the data 887 + is needed (cell is FULL, op is READ), it is already available. If not 888 + (cell is EMPTY, defer), the data latch is simply ignored. 889 + 890 + For historical DRAM backing: the DRAM data read can be speculative — 891 + kicked off in parallel with the presence check and cancelled or ignored if 892 + the cell turns out to be EMPTY. 893 + 894 + ### Per-Cell Metadata Candidates 895 + 896 + **Checked on every operation (fast path):** 897 + 898 + - Presence state: 2 bits (EMPTY/RESERVED/FULL/WAITING) 899 + - is_wide tag: 1 bit (SM needs this before deciding whether to also read 900 + the next cell) 901 + 902 + **Checked on most operations (likely fast path):** 903 + 904 + - Type tag: 1-2 bits (scalar, wide pointer, packed bytes — avoids wasted 905 + reads or misinterpretation of cell contents) 906 + - Write-once flag: 1 bit (hard error on overwrite, distinct from FULL 907 + which permits overwrite with a diagnostic indicator) 908 + 909 + **Checked selectively (possibly fast path):** 910 + 911 + - Owner/source: 2 bits (which CM "owns" this cell — for future access 912 + control) 913 + - Refcount indicator: 1 bit (flag that this cell uses atomic semantics, 914 + allowing ref-counted and non-ref-counted cells to coexist) 915 + 916 + ### Recommended: 4 Bits Per Cell 917 + 918 + `presence:2 + is_wide:1 + spare:1` 919 + 920 + At 512 cells this is 256 bytes — trivially fits any small SRAM chip. 921 + Using a byte-wide SRAM chip means bits 4-7 are physically present whether 922 + used or not. Committing 4 bits with 4 spare avoids needing to change the 923 + presence SRAM layout when write-once enforcement or type tags are added 924 + during testing. 925 + 926 + --- 927 + 928 + ## Open Design Questions 929 + 930 + 1. **SM opcode width** — 3-bit fixed (maximum address range), 4-bit fixed 931 + (maximum opcodes, simplest decode), or variable 3/5 (both benefits, one 932 + extra decode gate)? Depends on how much the 512 vs 1024 cell 933 + difference matters in practice. 934 + 935 + 2. **Wide pointer cell format** — is (base_addr, length) in two 936 + consecutive cells sufficient, or should the metadata cell also carry a 937 + type/tag byte? e.g. cell[N] = base_addr, cell[N+1] = length | 938 + element_type. 939 + 940 + 3. **EXEC stop condition** — sentinel value in the data stream, or length 941 + register preloaded from a wide pointer or command? Sentinel is simpler 942 + for bootstrap (no length needed); a length register is safer for 943 + runtime EXEC (no risk of data accidentally matching the sentinel). 944 + 945 + 4. **Bulk op completion signalling** — when ITERATE or COPY_RANGE 946 + finishes, does the SM emit a completion token? Could reuse the return 947 + routing from the original request to send a "done" signal that fires 948 + the next node in the dataflow graph. 949 + 950 + 5. **Bulk ops on non-FULL cells** — if ITERATE reads a cell in WAITING 951 + state (an I-structure not yet fulfilled), should the sequencer stall 952 + per-cell, skip the cell, or error? Per-cell stalling risks blocking 953 + the sequencer indefinitely. 954 + 955 + 6. **IRAM write multi-flit handling** — if instruction words are wider 956 + than 16 bits, IRAM write needs 3 flits. The misc bucket sub=01 format 957 + has 2-bit flags that could signal 2-flit vs 3-flit mode. Depends on 958 + final instruction word width. 959 + 960 + 7. **IO address space allocation** — if IO becomes memory-mapped SM, 961 + which SM_id is reserved? Is IO mapped into a specific address range 962 + within SM00, or distributed across SMs? 963 + 964 + 8. **Relative monadic offset + SC blocks** — SC blocks need their own 965 + IRAM region. Does the relative offset mechanism require a third base 966 + register for SC mode, or does SC bypass offset translation entirely 967 + (using raw IRAM addresses from a sequential counter)? 968 + 969 + 9. **Bootstrap SM specialisation** — should SM00 repurpose unused opcodes 970 + (atomics, alloc/free) for IO-specific operations, or should it use the 971 + standard SM opcode set with IO handled purely through read/write and 972 + presence-bit semantics? Specialisation provides richer IO control; 973 + uniformity avoids creating a traffic bottleneck and simplifies the 974 + hardware. 975 + 976 + 10. **Tier boundary configuration** — hardwired address split between 977 + tier 0 (raw) and tier 1 (I-structure), or config register settable at 978 + load time? Hardwired is simplest; a config register allows different 979 + programs to allocate different amounts of raw vs synchronised memory. 980 + The ROM base / reset vector address is always fixed regardless. 981 + 982 + 11. **Tier 0 write semantics** — should tier 0 cells support writes from 983 + any SM (true shared memory), or only from the "home" SM with other SMs 984 + issuing remote write requests? True sharing is simpler but 985 + last-writer-wins can cause subtle bugs if two CMs write the same cell 986 + concurrently. Remote writes add a serialisation point. 987 + 988 + 12. **Cross-SM ITERATE** — remote iteration is 2× the bus traffic of 989 + local iteration. Is this acceptable, or should the compiler be 990 + required to keep iterable structures SM-local? If required, the 991 + cross-SM path could be omitted from v0 and treated as an error. 992 + 993 + 13. **Per-page tier tags with '610** — the '610 mapper has 12-bit output 994 + registers. Are spare bits available for tier tags, or does the full 995 + 12-bit output go to physical addressing? If the physical address space 996 + does not need all 12 bits, the high bits could encode tier.
+376 -166
design-notes/sm-design.md
··· 1 1 # Dynamic Dataflow CPU — SM (Structure Memory) Design 2 2 3 3 Covers the SM interface protocol, operation set, banking scheme, address 4 - space extension, and hardware architecture. 4 + space extension, memory tiers, wide pointers, bulk operations, 5 + EXEC/bootstrap, and hardware architecture. 5 6 6 7 See `architecture-overview.md` for module taxonomy and token format. 7 8 See `network-and-communication.md` for how SM connects to the bus. 8 9 See `bus-architecture-and-width-decoupling.md` for bus width rationale 9 10 and SM internal datapath confirmation. 11 + See `sm-and-token-format-discussion.md` for extended discussion of design 12 + decisions including DRAM latency context, bootstrap unification, and 13 + address space distribution. 10 14 11 15 ## Role 12 16 13 - SM stores structured data (arrays, lists, heap) and performs operations on 14 - it. it is NOT used for I/O mapping — I/O lives in the type-11 subsystem 15 - (see `io-and-bootstrap.md`). 17 + SM stores structured data (arrays, lists, heap), performs operations on it, 18 + and provides memory-mapped IO. 16 19 17 - SM is **synchronising memory** — not just a data store. each cell has 20 + SM is **synchronising memory** — not just a data store. Tier 1 cells have 18 21 presence state (empty/full), and reads to empty cells are deferred until 19 - a write arrives. this gives implicit producer-consumer synchronisation 22 + a write arrives. This gives implicit producer-consumer synchronisation 20 23 without locks or explicit message-passing: the write *is* the signal. 21 - this is the dataflow architecture's answer to shared mutable state. 24 + This is the dataflow architecture's answer to shared mutable state. 22 25 23 - from a CM's perspective: send a type-10 request, get a result token back 24 - eventually. split-phase, asynchronous relative to the requesting CM. a 25 - READ may return immediately (cell is full) or later (cell is empty, 26 + **IO is memory-mapped into SM address space.** An SM (typically SM00 at v0) 27 + maps IO devices into its address range. I-structure semantics provide 28 + natural interrupt-free IO: a READ from an IO device that has no data defers 29 + until data arrives, triggering the receiving node in the dataflow graph. 30 + 31 + From a CM's perspective: send a bit[15]=1 request, get a CM result token 32 + back eventually. Split-phase, asynchronous relative to the requesting CM. 33 + A READ may return immediately (cell is full) or later (cell is empty, 26 34 deferred until written). 27 35 28 36 See `em4-analysis.md` for context on why dedicated synchronising memory ··· 40 48 info in the bits that are unused by that operation type. SM does not 41 49 maintain persistent per-request state — result packets are self-addressed. 42 50 43 - **Exception**: deferred reads. when a READ hits an empty cell, the SM 51 + **Exception**: deferred reads. When a READ hits an empty cell, the SM 44 52 stashes the return routing in a deferred read register and sends the 45 - result later when a WRITE arrives. this is the one case where the SM 46 - holds pending state. see "Deferred Read Register" section below. 53 + result later when a WRITE arrives. This is the one case where the SM 54 + holds pending state. See "Deferred Read Register" section below. 47 55 48 - ### Request Formats (type 10, 2-flit, received on AN) 56 + ### Request Format (bit[15]=1, 2-flit standard) 49 57 50 58 All SM requests arrive as 2 flits on the 16-bit bus and are reassembled 51 59 at the SM input FIFO before processing. 52 60 53 61 ``` 54 62 Flit 1 (common to all SM ops): 55 - [type:2=10][SM_id:2][op_base:3][addr/op_ext:1][addr_low:8] = 16 bits 63 + [1][SM_id:2][op_base:3][addr/op_ext:2][addr_low:8] = 16 bits 64 + 65 + 15 bits available after the SM discriminator bit. SM_id (2 bits) selects 66 + the target SM. The remaining 13 bits encode opcode and address using 67 + variable-width encoding: 68 + 69 + When op[2:1] != 11: 3-bit opcode, 10-bit addr (1024 cells) 70 + When op[2:1] == 11: extends to 5-bit opcode, 8-bit payload (256 cells) 56 71 57 - When op_base[2:1] != 11: bit[8] is addr MSB → 3-bit op, 9-bit addr 58 - When op_base[2:1] == 11: bit[8] is opcode extension → 4-bit op, 8-bit addr 59 - See "Bus Encoding" in Operation Set section for full decode table. 72 + Decode signal: op[2] AND op[1] — one gate. 60 73 61 74 Flit 2 varies by operation: 62 75 ··· 64 77 Write data to address. No DN response unless cell was WAITING 65 78 (deferred read satisfied — result token emitted using saved routing). 66 79 67 - READ / CLEAR / ALLOC / FREE: [return routing:16] 68 - Flit 2 carries return routing (destination CM, offset, ctx, port) 69 - instead of data. Exact bit packing TBD — 16 bits available for 70 - ret_CM + ret_offset + ret_ctx + ret_port. 80 + READ / CLEAR / ALLOC / FREE: [return_routing:16] 81 + Flit 2 carries a **pre-formed CM token template**. The SM's result 82 + formatter latches this template, prepends it as the result's flit 1, 83 + and appends read data as flit 2. No bit-shuffling — the requesting 84 + CM does all format work upfront. The SM treats this as an opaque 85 + 16-bit blob. 71 86 (CLEAR/ALLOC/FREE may not need return routing — if no result token 72 87 is emitted, flit 2 could be omitted or carry flags instead. TBD.) 73 88 74 - READ_INC / READ_DEC: [return routing:16] 75 - Same as READ. atomic ops always return the old value. 89 + READ_INC / READ_DEC: [return_routing:16] 90 + Same as READ. Atomic ops always return the old value. 76 91 77 92 CAS — compare-and-swap: 78 93 Requires 3-flit extended format to carry expected_value, new_value, ··· 84 99 ``` 85 100 86 101 **3-flit extended addressing mode**: for access to external RAM or 87 - memory-mapped I/O address spaces, a 3-flit structure token provides 88 - wider addresses at the cost of one extra flit cycle: 89 - ``` 90 - flit 1: [type:2=10][SM_id:2][op:3][flags:9] = 16 bits 91 - flit 2: [extended_addr:16] = 16 bits 92 - flit 3: [data:16] or [return routing:16] = 16 bits 93 - ``` 94 - This is distinct from the type-11 extended structure path — it stays 95 - within the type-10 traffic class but uses a flag or reserved opcode to 96 - signal 3-flit format. 102 + memory-mapped IO address spaces, a 3-flit SM token provides wider 103 + addresses at the cost of one extra flit cycle. The EXT opcode (one of 104 + the 3-bit tier) signals 3-flit format. 97 105 98 - ### Result Format (on DN, repackaged as type 00 or 01) 106 + ### Result Format (on DN, pre-formed CM token) 99 107 100 - SM extracts return routing from the request's flit 2 and constructs a 101 - standard 2-flit compute token: 108 + SM uses the pre-formed CM token template from the request's flit 2 as the 109 + result's flit 1, and appends the read data as flit 2: 102 110 103 111 ``` 104 - Result (2 flits, type 00 or 01): 105 - flit 1: [type:2][ret_CM:2][...12 bits: ret addr, ctx, port...] = 16 bits 106 - flit 2: [fetched_data:16] = 16 bits 112 + Result (2 flits): 113 + flit 1: [return_routing:16] (opaque template from original request) 114 + flit 2: [fetched_data:16] 107 115 ``` 108 116 109 - The requesting CM specified where this result should land (which PE, 110 - offset, context slot, port). SM just repackages. the result looks like 111 - any other token arriving at the CM — the CM doesn't know or care that 112 - it came from SM. 117 + The template can encode any CM token format whose routing fits in 16 bits: 118 + dyadic wide, monadic normal, or monadic inline. This means SM read results 119 + can land directly in a matching store slot as one operand of a dyadic 120 + instruction — no intermediate monadic forwarding step needed. 113 121 114 - **Open question**: the return routing in READ requests needs to include 115 - enough information to form a complete flit-1 for the result token. If 116 - the result is type 01 (monadic), no generation counter is needed and 16 117 - bits of return routing may suffice. If type 00 (dyadic), gen bits are 118 - needed. Options: 119 - - (a) SM results are always monadic (type 01) — simplest, SM results 120 - bypass matching and go straight to instruction fetch 121 - - (b) Return routing in flit 2 includes gen bits, reducing space for 122 - other return fields 123 - - (c) The requesting CM stores the gen locally and the SM result matches 124 - without it 125 - Option (a) remains simplest — SM results always feed monadic instruction 126 - inputs. 122 + **Implication:** bits cannot be stolen from the return routing field for 123 + SM-side metadata, because the SM would need to parse its own return 124 + routing — defeating the purpose of keeping it opaque. 127 125 128 126 ## Cell State Model (I-Structure Semantics) 129 127 ··· 194 192 - **FREE**: transitions any→EMPTY. if cell was WAITING, cancels the 195 193 deferred read (same as CLEAR). deferred to post-v0. 196 194 197 - ### Presence State Hardware 195 + ### Presence Metadata Hardware 198 196 199 - 2 bits per cell. at 512 cells = 1024 bits = 128 bytes. implementation 200 - options: 197 + 4 bits per cell: `[presence:2][is_wide:1][spare:1]` 201 198 202 - - **Small SRAM** alongside the data SRAM. addressed in parallel, read/ 203 - written on every operation. one 8-bit-wide SRAM chip covers 4 cells 204 - per byte → 128 bytes easily fits. 205 - - **Register file** if cell count is small enough. 512 × 2 bits is too 206 - big for discrete flip-flops, but a small SRAM works fine. 199 + - **presence:2** — EMPTY/RESERVED/FULL/WAITING (checked on every operation) 200 + - **is_wide:1** — tags this cell as part of a wide pointer pair. The SM 201 + checks this before deciding whether to also read the next cell for 202 + length metadata. See "Wide Pointers" section below. 203 + - **spare:1** — reserved. Candidates: write-once flag, type tag, owner ID. 204 + 205 + At 1024 cells = 4096 bits = 512 bytes. Using a byte-wide SRAM chip means 206 + bits 4-7 are physically present whether used or not. Committing 4 bits 207 + with 4 spare avoids needing to change the presence SRAM layout when 208 + additional per-cell metadata is added during testing. 209 + 210 + Implementation: 211 + 212 + - **Small SRAM** alongside the data SRAM. Addressed in parallel, read/ 213 + written on every operation. One 8-bit-wide SRAM chip covers 2 cells 214 + per byte (4 bits each) → 256 bytes easily fits. 207 215 - Must support single-cycle read-modify-write (read state, decide action, 208 - write new state) within one clock. achievable if the state SRAM access 216 + write new state) within one clock. Achievable if the state SRAM access 209 217 time is less than half the clock period (read in first half, write in 210 218 second half — same half-clock RMW technique as the EM-4 matching stage). 211 219 ··· 216 224 217 225 ``` 218 226 Deferred Read Register: 219 - [valid:1][cell_addr:9][return_routing:16] = 26 bits 227 + [valid:1][cell_addr:10][return_routing:16] = 27 bits 220 228 ``` 221 229 222 - - **valid**: set when a READ hits an EMPTY/RESERVED cell. cleared when the 230 + - **valid**: set when a READ hits an EMPTY/RESERVED cell. Cleared when the 223 231 deferred read is satisfied (WRITE to the target cell) or cancelled 224 232 (CLEAR/FREE to the target cell). 225 - - **cell_addr**: which cell this deferred read is waiting on. compared 226 - against incoming WRITE addresses to detect satisfaction. 227 - - **return_routing**: the 16 bits from the original READ's flit 2. used 228 - to construct the result token when the deferred read is satisfied. 233 + - **cell_addr**: which cell this deferred read is waiting on (10-bit 234 + address for 1024-cell range). Compared against incoming WRITE addresses 235 + to detect satisfaction. 236 + - **return_routing**: the 16 bits from the original READ's flit 2 (the 237 + pre-formed CM token template). Used as flit 1 of the result token 238 + when the deferred read is satisfied. 229 239 230 240 ### Deferred Read Satisfaction 231 241 232 242 On every WRITE, the SM checks: `valid == 1 AND write_addr == cell_addr`. 233 - if true: 243 + If true: 234 244 235 245 1. Store the write data in the cell (normal WRITE behaviour). 236 246 2. Transition cell state to FULL. 237 - 3. Construct result token using saved return_routing + written data. 238 - 4. Emit result token via DN. 239 - 5. Clear the deferred read register (valid = 0). 247 + 3. Emit result token: flit 1 = saved return_routing, flit 2 = written data. 248 + 4. Clear the deferred read register (valid = 0). 240 249 241 - Hardware cost: one 9-bit comparator (cell_addr vs write_addr), one AND 242 - gate (valid AND addr_match), one 26-bit register. trivial — maybe 3-4 250 + Hardware cost: one 10-bit comparator (cell_addr vs write_addr), one AND 251 + gate (valid AND addr_match), one 27-bit register. Trivial — maybe 3-4 243 252 TTL chips total. 244 253 245 254 ### Depth-1 Limitation and Backpressure ··· 285 294 286 295 ### Bus Encoding (flit 1: variable-width opcode) 287 296 288 - On the 16-bit bus, flit 1 of a type-10 token is: 297 + On the 16-bit bus, flit 1 of an SM token is: 289 298 290 299 ``` 291 - [type:2=10][SM_id:2][op_base:3][addr/op_ext:1][addr:8] = 16 bits 300 + [1][SM_id:2][op_base:3][op_ext/addr:2][addr:8] = 16 bits 292 301 ``` 293 302 294 - The interpretation of bit[8] depends on op_base[2:1] (the top two bits 295 - of the 3-bit base opcode): 303 + 15 bits available after the SM discriminator. SM_id (2 bits) selects the 304 + target SM. The remaining 13 bits encode opcode and address using 305 + variable-width encoding. 306 + 307 + The interpretation of the 2 bits after op_base depends on op_base[2:1]: 296 308 297 309 ``` 298 - op_base[2:1] = 00, 01, or 10: 299 - bit[8] is part of the address. opcode = 3 bits, address = 9 bits (512 cells). 310 + op[2:1] != 11: 311 + 3-bit opcode, 10-bit addr (1024 cells). 312 + The 2 extension bits are the high address bits. 300 313 301 - op_base[2:1] = 11: 302 - bit[8] is an opcode extension. opcode = 4 bits, address = 8 bits (256 cells). 314 + op[2:1] == 11: 315 + Extends to 5-bit opcode, 8-bit payload (256 cells or inline data). 316 + The 2 extension bits are opcode extension. 303 317 ``` 304 318 305 - Decode signal: `op_base[2] AND op_base[1]` — one gate. 319 + Decode signal: `op[2] AND op[1]` — one gate. 306 320 307 321 ### Opcode Table (bus encoding) 308 322 309 323 ``` 310 - op_base bit[8] bus opcode internal op addr bits name 311 - ───────────────────────────────────────────────────────────── 312 - 000 a 000 0000 9 (512) READ 313 - 001 a 001 0001 9 WRITE 314 - 010 a 010 0010 9 ALLOC 315 - 011 a 011 0011 9 FREE 316 - 100 a 100 0100 9 CLEAR 317 - 101 a 101 0101 9 (spare) 318 - 110 0 1100 0110 8 (256) READ_INC 319 - 110 1 1101 0111 8 READ_DEC 320 - 111 0 1110 1000 8 CAS 321 - 111 1 1111 1001 8 (spare) 324 + op_base ext bus opcode internal op addr bits name 325 + ───────────────────────────────────────────────────────────────── 326 + 000 aa 000 0000 10 (1024) READ 327 + 001 aa 001 0001 10 WRITE 328 + 010 aa 010 0010 10 ALLOC 329 + 011 aa 011 0011 10 FREE 330 + 100 aa 100 0100 10 CLEAR 331 + 101 aa 101 0101 10 EXT (3-flit mode) 332 + 110 00 11000 0110 8 (256) READ_INC 333 + 110 01 11001 0111 8 READ_DEC 334 + 110 10 11010 1000 8 CAS 335 + 110 11 11011 1001 8 RAW_READ 336 + 111 00 11100 1010 8 EXEC 337 + 111 01 11101 1011 8 SET_PAGE 338 + 111 10 11110 1100 8 WRITE_IMM 339 + 111 11 11111 1101 8 (spare) 322 340 ``` 323 341 324 - 'a' = address bit (part of 9-bit address). 342 + 'aa' = address bits (part of 10-bit address). 325 343 326 - **Rationale for restricted-address ops**: READ_INC, READ_DEC, and CAS are 327 - atomic operations typically used on specific cells (refcounts, semaphores, 328 - locks) that the compiler places deliberately. restricting them to the 329 - lower 256 cells per SM is acceptable — the compiler knows this and 330 - allocates atomic-access cells in the lower range. 344 + **Tier 1 ops (3-bit, full address range):** READ, WRITE, ALLOC, FREE, 345 + CLEAR reach the full 1024-cell address space. EXT signals a 3-flit 346 + token for extended addressing (external RAM, wide addresses). 347 + 348 + **Tier 2 ops (5-bit, restricted address/payload):** atomic operations 349 + (READ_INC, READ_DEC, CAS) are restricted to 256 cells — the compiler 350 + places atomic-access cells in the lower range. EXEC, SET_PAGE, and 351 + WRITE_IMM use the 8-bit payload field for non-address data (length, 352 + page register value, small immediate). 331 353 332 - Full-address ops (READ, WRITE, ALLOC, FREE, CLEAR) need to reach the 333 - entire address space since they manage all cells. 354 + **Key insight for variable-width encoding:** not all SM ops are 355 + `op(address)`. Some are `op(config_value)` or `op()` with no cell 356 + operand. The 8-bit payload in the restricted tier can be inline data, 357 + config values, or range counts depending on the opcode: 334 358 335 - One spare slot in each range for future expansion. the full-address spare 336 - (101) could become RAW_READ (non-blocking, non-deferring read — returns 337 - data or empty indicator without registering a deferred read) if needed. 338 - not committed for v0 but the slot is reserved. 359 + - EXEC: payload = length/count (base addr in config register) 360 + - SET_PAGE: payload = page register value 361 + - WRITE_IMM: 8-bit addr, flit 2 carries immediate data 362 + - RAW_READ: non-blocking read, returns data or empty indicator without 363 + registering a deferred read. Useful for polling and diagnostics. 339 364 340 365 ### Direct Path Encoding (future) 341 366 ··· 357 382 a direct path is added, the bus adapter and direct path adapter both 358 383 feed into the same internal command bus via a mux. 359 384 360 - ### Extended Bus Encoding (3-flit type-10) 385 + ### Extended Bus Encoding (3-flit SM token via EXT opcode) 361 386 362 387 For operations that need wider addresses or additional data (CAS with 363 - both expected and new values), 3-flit type-10 tokens carry the full 388 + both expected and new values), 3-flit SM tokens (via EXT opcode) carry the full 364 389 4-bit opcode in the extra flit: 365 390 366 391 ``` ··· 452 477 ### Banking 453 478 454 479 - Start with 2 banks (1 address bit selects bank) for v0 455 - - 9-bit address = 512 cells per SM = 1KB at 16-bit data width 480 + - 10-bit address = 1024 cells per SM = 2KB at 16-bit data width 456 481 - Each bank is one SRAM chip with room to spare 457 482 - Banking allows pipelining: one bank can be reading while another is 458 483 being written (for RMW ops, or overlapping independent requests) ··· 486 511 enough for the atomic operations. hardware cost: 16-bit incrementer + 487 512 16-bit comparator + mux. ~10-15 TTL chips. 488 513 489 - **Presence state SRAM**: 2 bits per cell. 512 cells = 128 bytes. one small 490 - SRAM chip or a section of a larger SRAM, addressed in parallel with the 491 - data SRAM. must support half-clock read-modify-write (read state in first 492 - half, write new state in second half) for single-cycle operation. 514 + **Presence metadata SRAM**: 4 bits per cell (presence:2 + is_wide:1 + 515 + spare:1). 1024 cells = 512 bytes. One small SRAM chip addressed in 516 + parallel with the data SRAM. Must support half-clock read-modify-write 517 + (read state in first half, write new state in second half) for 518 + single-cycle operation. 493 519 494 - **Deferred read register**: one 26-bit register (valid:1 + cell_addr:9 + 495 - return_routing:16). one 9-bit comparator for addr matching against 520 + **Deferred read register**: one 27-bit register (valid:1 + cell_addr:10 + 521 + return_routing:16). One 10-bit comparator for addr matching against 496 522 incoming WRITEs. ~3-4 TTL chips total. 497 523 498 - **Result formatter**: extracts return routing from the original request's 499 - flit 2 (or from the deferred read register for deferred reads), combines 500 - with read data, constructs a 2-flit type 00/01 result token. this is 501 - where the SM-to-DN format conversion happens. Output serialiser splits 502 - the formed token into flits for bus transmission. 524 + **Result formatter**: latches the pre-formed CM token template from the 525 + original request's flit 2 (or from the deferred read register for deferred 526 + reads), emits it as flit 1 of the result, and appends read data as flit 2. 527 + The SM does not parse or modify the template — it is an opaque blob. Output 528 + serialiser splits the formed token into flits for bus transmission. 503 529 504 530 ## Address Space Extension 505 531 506 - The 9-bit address in the compact structure token (type 10) gives only 507 - 512 cells per SM. three mechanisms to extend it: 532 + The 10-bit address in the standard SM token gives 1024 cells per SM for 533 + tier-1 ops (3-bit opcode). Three mechanisms extend this further: 508 534 509 - ### 1. Page Register (recommended for v0) 535 + ### 1. Page Register 510 536 511 - - SM has a writable config register: "page base" (8-16 bits) 512 - - 9-bit token address is treated as offset, added to page base 537 + - SM has a writable config register set via SET_PAGE opcode 538 + - 10-bit token address is treated as offset, added to page base 513 539 - Gives up to 64K+ addressable cells per SM 514 - - CM sets the page with a WRITE to a reserved config address before 515 - issuing a burst of reads/writes to a region 540 + - CM sets the page before issuing a burst of reads/writes to a region 516 541 - Hardware cost: ~3 chips (latch for page register + adder) 517 542 - Programming model: familiar bank-switching, like 8-bit micros 518 543 - Tradeoff: page switch costs one extra token; compiler batches accesses ··· 520 545 521 546 ### 2. Banking as Implicit Address Bits 522 547 523 - - SM_id field (2 bits) gives 4 SMs = 4 x 512 = 2K cells system-wide 548 + - SM_id field (2 bits) gives 4 SMs = 4 x 1024 = 4K cells system-wide 524 549 - Not contiguous from a programming perspective, but compiler can 525 550 distribute data structures across SMs for both capacity and parallelism 526 551 - Essentially free — already in the token format 527 552 - Combine with page registers for 4 x 64K = 256K cells system-wide 528 553 529 - ### 3. Extended Structure Tokens (via type 10 3-flit or type 11) 554 + ### 3. Extended SM Tokens (3-flit via EXT opcode) 530 555 531 - - Use 3-flit type-10 tokens (see request format above) or type-11 (system) 532 - packets with a structure-extended subtype for structure ops needing wide 533 - addresses 534 - - Full 16-24 bit address space, at the cost of one extra flit cycle 535 - - Use for: large heap, external RAM chip, memory-mapped I/O space 536 - - Compact 2-flit type-10 tokens remain the fast path for common/local 537 - accesses 556 + - The EXT opcode (3-bit tier) signals a 3-flit SM token with full 16-bit 557 + address in flit 2, data/return-routing in flit 3 558 + - Full 16-bit address space per SM, at the cost of one extra flit cycle 559 + - Use for: large heap, external RAM, memory-mapped IO address ranges 560 + - Standard 2-flit tokens remain the fast path for common/local accesses 538 561 539 562 ### Practical Address Space with All Three Combined 540 563 541 - - Fast path (type 10 + page register): 64K per SM, 2-flit token 542 - - Medium path (type 10 across SMs): 4 x 64K = 256K, 2-flit token 543 - - Slow path (3-flit type 10 or type 11 extended): up to 16M+ with wide 544 - addresses, 3-flit token 564 + - Fast path (standard + page register): 64K per SM, 2-flit token 565 + - Medium path (across SMs): 4 x 64K = 256K, 2-flit token 566 + - Slow path (3-flit EXT): up to 64K per SM with wide addresses, 3-flit 545 567 546 568 ## V0 Test Plan 547 569 ··· 570 592 result token emitted) 571 593 - WRITE on FULL cell: verify overwrite + diagnostic indicator 572 594 - **Variable opcode decode tests**: 573 - - Verify READ/WRITE/ALLOC/FREE/CLEAR reach full 512-cell range 574 - - Verify READ_INC/READ_DEC/CAS restricted to lower 256 cells 575 - - Verify op_base[2:1]=11 decode correctly separates opcode 576 - extension from address 595 + - Verify READ/WRITE/ALLOC/FREE/CLEAR reach full 1024-cell range 596 + - Verify READ_INC/READ_DEC/CAS/EXEC etc. restricted to lower 256 cells 597 + - Verify op[2:1]=11 decode correctly separates opcode extension 598 + from address 599 + 600 + --- 601 + 602 + ## Memory Tiers 603 + 604 + SM address space supports regions with different semantics, selectable by 605 + address range. The tier is not encoded in the token — it is determined by 606 + the target address within the SM. 607 + 608 + ### Tier 0: Raw Memory 609 + 610 + No presence bits. SRAM read/write only. Suitable for bulk data that does 611 + not need synchronisation (image buffers, DMA staging areas, ROM). 612 + 613 + - READ always returns immediately (no deferral) 614 + - WRITE always succeeds 615 + - Presence metadata bits are ignored / not maintained 616 + - Allocated by compiler or loader, not by ALLOC/FREE 617 + - Address range: configurable, typically the top of the SM address space 618 + 619 + ### Tier 1: I-Structure Memory 620 + 621 + Standard I-structure semantics with presence tracking. This is the default 622 + operating mode described throughout this document. 623 + 624 + - READ on EMPTY/RESERVED cell defers until WRITE 625 + - WRITE transitions cell to FULL 626 + - ALLOC/FREE manage the free list 627 + - Full presence metadata (4 bits per cell) 628 + - Address range: configurable, typically the bulk of SM address space 629 + 630 + ### Tier 2: Wide/Bulk Memory 631 + 632 + Extends tier 1 with wide pointer support. Cells tagged with `is_wide=1` 633 + in presence metadata are treated as the base of a (pointer, length) pair: 634 + the cell itself holds the data pointer, the next cell holds the length. 635 + 636 + - READ checks `is_wide` and returns either 1 cell (normal) or 2 cells 637 + (wide pointer pair — requires 3-flit result token or two result tokens) 638 + - WRITE to a wide cell writes both pointer and length 639 + - Enables ITERATE, COPY_RANGE, and EXEC to take wide pointers as arguments 640 + - Address range: overlaps with tier 1 (any tier 1 cell can be marked wide) 641 + 642 + ### Tier Boundary Configuration 643 + 644 + - Boundaries are set by config registers (SET_PAGE or a dedicated config 645 + mechanism) during bootstrap 646 + - The SM's address decoder checks the incoming address against tier 647 + boundaries to select behaviour 648 + - Hardware cost: 1-2 comparators + a mux on the presence metadata path 649 + - v0 can use a fixed split (e.g., lower 768 = tier 1, upper 256 = tier 0) 650 + and defer runtime-configurable boundaries 651 + 652 + > **Design status:** tiers are directionally decided. Exact address range 653 + > splits, tier 2 wide-read mechanics, and interaction between tiers and 654 + > page registers are still being refined. 655 + 656 + --- 657 + 658 + ## Wide Pointers and Bulk Operations 659 + 660 + ### Wide Pointer Format 661 + 662 + A wide pointer occupies 2 consecutive SM cells: 663 + 664 + ``` 665 + Cell N: [data_pointer:16] (base address in SM or external memory) 666 + Cell N+1: [length:16] (element count) 667 + ``` 668 + 669 + Cell N has `is_wide=1` in its presence metadata. The SM knows to read 670 + both cells when servicing a READ on a wide cell. 671 + 672 + Wide pointers are the parameter format for bulk operations. A CM does not 673 + iterate over SM contents directly — it sends an SM operation with a wide 674 + pointer, and the SM's sequencer engine handles the iteration internally. 675 + 676 + ### EXEC 677 + 678 + EXEC reads pre-formed tokens from a contiguous region of SM and pushes 679 + them onto the bus. The SM becomes a token source — effectively an 680 + autonomous injector. 681 + 682 + ``` 683 + EXEC request: 684 + flit 1: [...EXEC opcode...][count:8] 685 + flit 2: [base_addr from config register or wide pointer] 686 + ``` 687 + 688 + The SM's sequencer reads `count` 2-cell entries starting at `base_addr`. 689 + Each entry is a pre-formed 2-flit token (flit 1 in cell N, flit 2 in 690 + cell N+1). The sequencer emits them onto the bus in order. 691 + 692 + **Bootstrap use:** on system reset, SM00 is wired to execute EXEC on a 693 + predetermined ROM address. The ROM contents are pre-formed IRAM write 694 + tokens and seed tokens — everything needed to load the system. No external 695 + microcontroller needed for self-hosted boot. See "SM00 Bootstrap" below. 696 + 697 + **Hardware reuse:** the EXEC sequencer (address counter, limit comparator, 698 + increment logic, output path to bus) is the same hardware needed for 699 + ITERATE and COPY_RANGE. Building EXEC for bootstrap gives bulk operations 700 + nearly for free. 701 + 702 + ### ITERATE 703 + 704 + Reads each cell in a range and emits a result token for each. Takes a 705 + wide pointer (base + length). The SM's sequencer walks the range, 706 + constructing result tokens using a pre-loaded return routing template. 707 + 708 + ### COPY_RANGE 709 + 710 + Copies a contiguous range of cells from one SM region to another (or to 711 + a different SM). Takes source wide pointer and destination base. Useful 712 + for structure copying, GC compaction. 713 + 714 + > **Design status:** ITERATE and COPY_RANGE are directionally committed. 715 + > Exact token format for range parameters, interaction with deferred reads 716 + > in the target range, and atomicity guarantees are still being refined. 717 + 718 + --- 719 + 720 + ## SM00 Bootstrap 721 + 722 + ### Reset Behaviour 723 + 724 + SM00 has dedicated wiring to the system reset signal. On reset: 725 + 726 + 1. SM00's sequencer triggers EXEC on a predetermined ROM base address 727 + 2. The ROM region contains pre-formed tokens: IRAM write tokens to load 728 + PE instruction memories, followed by seed tokens to start execution 729 + 3. SM00 emits these tokens onto the bus in order 730 + 4. PEs receive IRAM writes and load their instruction memories 731 + 5. Seed tokens fire and execution begins 732 + 733 + This is the only hardware specialisation of SM00 — the reset vector 734 + wiring. At runtime, SM00 behaves as a standard SM with standard opcodes. 735 + 736 + ### ROM Mapping 737 + 738 + The bootstrap ROM is mapped into SM00's address space (tier 0 — raw 739 + memory, no presence bits). It can be: 740 + 741 + - Physical ROM/EEPROM on SM00's address bus 742 + - A region of SM00's SRAM pre-loaded by an external microcontroller 743 + (development/prototyping) 744 + - Flash memory accessed via page register for larger images 745 + 746 + ### Future Specialisation (not committed) 747 + 748 + SM00 could be further specialised for IO: 749 + 750 + - Atomic/alloc opcodes could be repurposed for IO-specific operations 751 + (e.g., READ_INC becomes "read UART with auto-acknowledge") 752 + - Memory-mapped IO devices occupy a reserved address range within SM00 753 + - SM00 could have additional interrupt-sensing hardware that triggers 754 + token emission on external events 755 + 756 + This is documented as a design option but **not committed for v0**. The 757 + standard SM opcodes are sufficient for basic IO via I-structure semantics: 758 + READ from an IO-mapped address defers until the IO device writes data. 759 + 760 + --- 761 + 762 + ## Presence-Bit Guided IRAM Writes 763 + 764 + The matching store's presence bitmap provides information about whether 765 + any dyadic instruction slot has a pending (half-matched) operand. During 766 + IRAM writes, the PE can check the presence bits for the IRAM page being 767 + overwritten: 768 + 769 + - If all presence bits for slots in that page are clear, no tokens are 770 + pending — the page can be safely overwritten without drain delay 771 + - If any presence bit is set, the PE knows tokens are in flight for 772 + that instruction and can either wait or discard the stale operand 773 + 774 + This enables more targeted IRAM replacement than the blanket drain 775 + protocol. Instead of draining the entire PE, only the affected page needs 776 + attention. The valid-bit page protection mechanism 777 + (`bus-architecture-and-width-decoupling.md`) remains the safety net, but 778 + presence-bit checking can eliminate unnecessary stalls. 779 + 780 + --- 577 781 578 782 ## Open Design Questions 579 783 580 784 1. **CAS multi-flit handling** — how does the request FIFO handle 3-flit 581 - ops? does it buffer all flits before dispatching, or pipeline them? 785 + ops? Does it buffer all flits before dispatching, or pipeline them? 582 786 CAS needs expected_value, new_value, AND return routing — that's 48 583 787 bits of payload beyond the opcode/address. 3 flits minimum. 584 788 2. **Page register per-CM or global?** — if multiple CMs access the same 585 789 SM, do they share a page register (contention) or each have their own 586 - (more hardware, more config)? probably global for v0. 790 + (more hardware, more config)? Probably global for v0. 587 791 3. **Banking vs pipeline depth** — with 2 banks, can we overlap a read to 588 - bank 0 with a write to bank 1? worth the control complexity for v0? 589 - presence state SRAM complicates this — is presence per-bank or shared? 590 - if shared, it serialises cross-bank operations. if per-bank, each bank 591 - needs its own presence SRAM. probably shared for v0 (simpler). 792 + bank 0 with a write to bank 1? Worth the control complexity for v0? 793 + Presence state SRAM complicates this — is presence per-bank or shared? 794 + If shared, it serialises cross-bank operations. If per-bank, each bank 795 + needs its own presence SRAM. Probably shared for v0 (simpler). 592 796 4. **SRAM chip selection** — specific part numbers, speed grades, package. 593 - needs to match the target clock frequency. presence state SRAM needs 797 + Needs to match the target clock frequency. Presence metadata SRAM needs 594 798 to be at least as fast as data SRAM (same access path). 595 - 5. **RAW_READ** — non-blocking read that returns data or empty indicator 596 - without registering a deferred read. useful for polling and diagnostics. 597 - opcode slot reserved (internal 0101) but not committed. implement if 598 - needed during testing. 599 - 6. **Atomic ops on non-FULL cells** — READ_INC/READ_DEC/CAS on EMPTY or 600 - WAITING cells is currently undefined. options: error, stall, or treat 601 - as zero. error is safest for v0. 602 - 7. **Direct path input mux** — when direct path is added, the SM needs a 603 - mux between bus input and direct input feeding the internal command 604 - bus. arbitration policy TBD (direct path priority? round-robin?). 799 + 5. **Atomic ops on non-FULL cells** — READ_INC/READ_DEC/CAS on EMPTY or 800 + WAITING cells is currently undefined. Options: error, stall, or treat 801 + as zero. Error is safest for v0. 802 + 6. **Direct path input mux** — when direct path is added, the SM needs a 803 + mux between bus input and direct input feeding the internal command 804 + bus. Arbitration policy TBD (direct path priority? round-robin?). 805 + 7. **Wide pointer read format** — does a READ on a wide cell return two 806 + separate result tokens or one 3-flit result token? Two tokens is 807 + simpler (reuse existing result formatter), 3-flit is more atomic. 808 + 8. **Tier boundary mechanism** — fixed at build time, config register, 809 + or address-range comparator? Fixed is simplest for v0. 810 + 9. **ITERATE return template** — how is the return routing template 811 + supplied for iterated results? Pre-loaded config register? Part of 812 + the ITERATE request? Template per element or shared? 813 + 10. **SM00 ROM size and mapping** — how large is the bootstrap ROM? What 814 + address range? How does it interact with the page register?
+79
design-notes/token-format-migration.md
··· 1 + # Token Format Migration Reference 2 + 3 + Reference for migrating the emulator and assembler from the old 2-bit type 4 + field token format to the 1-bit SM/CM split with prefix encoding. 5 + 6 + ## Old → New Token Type Mapping 7 + 8 + | Old Type | Old Format | New Prefix | New Format | 9 + |----------|-----------|------------|------------| 10 + | type 00, W=1 (dyadic wide) | `[00][PE:2][W:1=1][offset:4][ctx:4][port:1][gen:2]` | 00 | `[00][PE:2][offset:5][ctx:4][port:1][gen:2]` | 11 + | type 00, W=0 (dyadic narrow) | `[00][PE:2][W:1=0][offset:5][ctx:4][spare:2]` | 011+00 | `[011][PE:2][00][offset:5][ctx:4]` | 12 + | type 01, I=0 (monadic normal) | `[01][PE:2][I:1=0][offset:7][ctx:4]` | 010 | `[010][PE:2][offset:7][ctx:4]` | 13 + | type 01, I=1 (monadic inline) | `[01][PE:2][I:1=1][offset:4][ctx:4][spare:3]` | 011+10 | `[011][PE:2][10][offset:4][ctx:4][spare:1]` | 14 + | type 10 (structure) | `[10][SM_id:2][op:3][addr:9]` | 1 | `[1][SM_id:2][op:3-5][addr:8-10]` | 15 + | type 11, sub 00 (IO) | `[11][00][device:3][register:4][R/W:1][pad:4]` | 1 (SM token to IO-mapped SM) | standard SM token to SM00 IO-mapped address | 16 + | type 11, sub 01 (config write / IRAM load) | `[11][01][target_PE:2][flags:2][addr:8]` | 011+01 | `[011][PE:2][01][iram_addr:7][flags:2]` | 17 + 18 + ## Key Changes 19 + 20 + ### Dyadic Wide: offset widened from 4 to 5 bits 21 + - Old: 16 dyadic slots per context 22 + - New: 32 dyadic slots per context 23 + - Matching store grows from 256 cells (8-bit addr) to 512 cells (9-bit addr) 24 + 25 + ### SM Tokens: 1-bit prefix, variable opcode width 26 + - Old: type=10 (2-bit), 3-bit op, 9-bit addr 27 + - New: bit[15]=1, 3-bit op + 10-bit addr (common) or 5-bit op + 8-bit addr (rare) 28 + - Extra opcode slot gained: 15 bits available vs old 12 29 + 30 + ### SysToken / CfgToken / IOToken: eliminated 31 + - `SysToken` → removed entirely 32 + - `CfgToken` → IRAM write functionality moves to CM misc-bucket (prefix 011+01) 33 + - `LoadInstToken` → absorbed into IRAM write token handling in PE 34 + - `RouteSetToken` → absorbed into IRAM write token handling (or deferred) 35 + - `IOToken` → IO operations become standard SM tokens to SM00 36 + 37 + ### SM Opcode Table: variable 3/5 encoding 38 + - Old: 3-bit fixed with 1-bit addr/ext split → 4-bit max, 8 or 9-bit addr 39 + - New: 3-bit (10-bit addr) or 5-bit (8-bit addr), discriminated by op[2:1] 40 + - New opcodes available: EXEC, SET_PAGE, WRITE_IMM, RAW_READ 41 + 42 + ### SM Presence Metadata: widened to 4 bits 43 + - Old: 2-bit (presence only) 44 + - New: 4-bit (presence:2 + is_wide:1 + spare:1) 45 + 46 + ## Emulator Changes Required 47 + 48 + ### tokens.py 49 + - Remove `SysToken`, `CfgToken`, `LoadInstToken`, `RouteSetToken`, `IOToken` 50 + - Update `SMToken` to support variable opcode width 51 + - Consider adding `IRAMWriteToken` as a `CMToken` subclass 52 + 53 + ### emu/pe.py 54 + - Handle IRAM write tokens as a CM token subclass instead of CfgToken 55 + - Update matching store sizing if testing with 5-bit offset 56 + 57 + ### emu/sm.py 58 + - Add new opcodes (EXEC, SET_PAGE, WRITE_IMM, RAW_READ) 59 + - Widen presence metadata to 4 bits 60 + - Add is_wide support 61 + 62 + ### emu/network.py 63 + - Update `System.inject()` / `System.send()` routing to use 1-bit SM/CM 64 + split instead of isinstance checks on SysToken/CfgToken 65 + - Remove IOToken routing 66 + 67 + ### cm_inst.py 68 + - Add new SM opcodes to `MemOp` enum 69 + - Remove `CfgOp` enum if CfgToken is eliminated 70 + 71 + ## Assembler Changes Required 72 + 73 + ### asm/codegen.py 74 + - Update token stream generation to use new prefix encoding 75 + - Replace CfgToken generation with IRAM write token generation 76 + - Update SM token generation for variable opcode width 77 + 78 + ### asm/ir.py 79 + - May need to represent IRAM write operations differently
+341
docs/design-plans/2026-02-26-token-migration.md
··· 1 + # Token Format Migration Design 2 + 3 + ## Summary 4 + 5 + This design plan migrates the OR1 dataflow CPU emulator and assembler from an older token encoding scheme to a cleaner one. In the old scheme, the system used a 2-bit type field to distinguish four categories of token: CM (computation), SM (structure memory), IO, and a "system" catch-all that handled IRAM loading, routing configuration, and other control operations. That system category is being eliminated. IO is folded into the SM address space (the IO module becomes a memory-mapped region on SM0), and IRAM loading becomes a new CM token subtype called `IRAMWriteToken`. The result is a simpler 1-bit discriminator: a token is either headed for a Structure Memory unit or a Computation Module, with no special cases in the network router. 6 + 7 + Alongside the routing simplification, the Structure Memory gains a two-tier address model. Addresses below a configurable boundary (default 256) retain the existing I-structure semantics — presence tracking, deferred reads, write-once enforcement, and atomic operations. Addresses at or above that boundary become raw shared storage (T0): no presence checking, no deferral, and a single store shared across all SM instances. A new EXEC opcode reads pre-formed tokens from T0 and injects them directly into the network, providing both the system bootstrap path (loading programs from ROM at reset) and a runtime mechanism for bulk token emission. The migration is implemented bottom-up — type definitions first, then emulator core, then assembler and tools, then test updates — with each phase building on the previous. 8 + 9 + ## Definition of Done 10 + 11 + Migrate the OR1 emulator, assembler, and supporting tools from the old 2-bit type field token format to the 1-bit SM/CM split with prefix encoding, as specified in the updated design notes. The old token hierarchy (SysToken, CfgToken, LoadInstToken, RouteSetToken, IOToken) is eliminated. SM gains memory tier support (T0 shared raw storage, T1 per-SM I-structure) and new opcodes. All existing tests pass (updated as needed) and new tests cover the changed functionality. 12 + 13 + Specifically: 14 + 15 + - **tokens.py**: Remove SysToken, CfgToken, IOToken, LoadInstToken, RouteSetToken. Add IRAMWriteToken as a CMToken subclass. DyadToken retains `wide: bool` field. 16 + - **cm_inst.py**: Remove CfgOp enum. Extend MemOp with new opcodes (EXEC, SET_PAGE, WRITE_IMM, RAW_READ, EXT) as a flat internal enum (no encoding tier enforcement). 17 + - **emu/sm.py**: T0/T1 memory tier split. T0 is a shared `list[Token]` across all SMs, with a configurable boundary (default: address 256). New SM opcodes implemented (at minimum EXEC for bootstrap, RAW_READ). Presence metadata widened to 4 bits (presence:2 + is_wide:1 + spare:1). 18 + - **emu/pe.py**: Handle IRAMWriteToken instead of CfgToken/LoadInstToken. Remove RouteSetToken handling. 19 + - **emu/network.py**: Simplify routing to 1-bit SM/CM isinstance check. Remove IOToken/CfgToken routing. 20 + - **asm/**: Update codegen to emit IRAMWriteToken. Remove all CfgOp references. Add T0 boundary awareness for SM address allocation. 21 + - **dfgraph/**: Update CfgOp category handling. 22 + - **All tests pass.** 23 + 24 + ## Acceptance Criteria 25 + 26 + ### token-migration.AC1: Old token types removed 27 + - **token-migration.AC1.1 Success:** `tokens.py` has no SysToken, CfgToken, IOToken, LoadInstToken, or RouteSetToken classes 28 + - **token-migration.AC1.2 Success:** `cm_inst.py` has no CfgOp enum 29 + - **token-migration.AC1.3 Success:** No module in the codebase imports any deleted type 30 + 31 + ### token-migration.AC2: IRAMWriteToken works 32 + - **token-migration.AC2.1 Success:** IRAMWriteToken routes to target PE via network (isinstance CMToken) 33 + - **token-migration.AC2.2 Success:** PE receives IRAMWriteToken and writes instructions to IRAM at the specified offset 34 + - **token-migration.AC2.3 Success:** PE executes instructions loaded via IRAMWriteToken correctly 35 + - **token-migration.AC2.4 Failure:** IRAMWriteToken with invalid target PE raises or is dropped 36 + 37 + ### token-migration.AC3: MemOp enum updated 38 + - **token-migration.AC3.1 Success:** MemOp contains EXEC, EXT, SET_PAGE, WRITE_IMM, RAW_READ, CLEAR with correct tier grouping 39 + - **token-migration.AC3.2 Success:** ALLOC, FREE remain in tier 1 (3-bit); CLEAR in tier 2 (5-bit) 40 + - **token-migration.AC3.3 Success:** Assembler mnemonic mapping includes all new opcodes 41 + 42 + ### token-migration.AC4: SM T0/T1 tier split 43 + - **token-migration.AC4.1 Success:** SM operations on addresses below tier_boundary use I-structure semantics (presence tracking, deferred reads) 44 + - **token-migration.AC4.2 Success:** SM WRITE to T0 address stores data without presence checking 45 + - **token-migration.AC4.3 Success:** SM READ on T0 address returns immediately (no deferral) 46 + - **token-migration.AC4.4 Success:** T0 storage is shared — all SMs reference the same T0 store 47 + - **token-migration.AC4.5 Failure:** I-structure ops (CLEAR, ALLOC, FREE, atomics) on T0 address produce error 48 + - **token-migration.AC4.6 Edge:** Tier boundary is configurable via SMConfig; default is 256 49 + 50 + ### token-migration.AC5: EXEC opcode 51 + - **token-migration.AC5.1 Success:** EXEC reads Token objects from T0 starting at given address and injects them into the network 52 + - **token-migration.AC5.2 Success:** Injected tokens are processed normally by target PEs/SMs 53 + - **token-migration.AC5.3 Success:** EXEC can load a program (IRAM writes + seed tokens) from T0 that executes correctly 54 + - **token-migration.AC5.4 Edge:** EXEC on empty T0 region is a no-op 55 + 56 + ### token-migration.AC6: Presence metadata widened 57 + - **token-migration.AC6.1 Success:** SMCell has is_wide field (default False) 58 + - **token-migration.AC6.2 Success:** Existing I-structure behaviour unchanged (is_wide=False path) 59 + 60 + ### token-migration.AC7: Assembler updated 61 + - **token-migration.AC7.1 Success:** Token stream mode emits IRAMWriteToken (not LoadInstToken) 62 + - **token-migration.AC7.2 Success:** Token stream mode does not emit RouteSetToken 63 + - **token-migration.AC7.3 Success:** Direct mode (PEConfig/SMConfig) still works 64 + - **token-migration.AC7.4 Success:** Assembler round-trip (serialize -> parse -> assemble) works with updated types 65 + 66 + ### token-migration.AC8: All tests pass 67 + - **token-migration.AC8.1 Success:** `python -m pytest tests/ -v` exits with zero failures 68 + 69 + ## Glossary 70 + 71 + - **Token**: The fundamental unit of data and control in OR1. Carries a value and a destination address; computation fires when all required tokens for an instruction arrive. 72 + - **CM (Computation Module)**: A processing element in the dataflow array. Receives tokens, matches operand pairs, executes IRAM instructions, and emits result tokens. 73 + - **SM (Structure Memory)**: A memory controller with I-structure semantics. Manages presence state per cell, handles deferred reads, and coordinates producer-consumer synchronisation without locks. 74 + - **DyadToken / MonadToken**: The two standard CM token subtypes. Dyadic instructions require two operands (matched before firing); monadic require one. 75 + - **IRAMWriteToken**: New CM token subtype introduced by this migration. Carries a block of instructions and a target IRAM address. Replaces LoadInstToken. 76 + - **IRAM**: Instruction RAM. Each PE has a small local memory storing the instructions it executes when a matching pair fires. Loaded at startup or patched at runtime via IRAMWriteToken. 77 + - **SysToken / CfgToken / LoadInstToken / RouteSetToken / IOToken**: The old token types being deleted. Together they formed the "system" category (type-11 in the old 2-bit encoding). 78 + - **I-structure**: A single-assignment memory cell with four states: EMPTY, RESERVED, FULL, WAITING. A READ on an empty cell defers until a WRITE arrives. 79 + - **T0 (tier 0)**: Raw shared storage in the SM address space. No presence bits, no deferred reads. Shared across all SM instances. Used for program storage and the EXEC bootstrap region. 80 + - **T1 (tier 1)**: The per-SM I-structure region of SM address space, below the tier boundary. Full presence tracking and deferred-read semantics. 81 + - **EXEC opcode**: A new SM operation that reads Token objects from a T0 region and injects them into the network. Used for system bootstrap and runtime bulk token emission. 82 + - **Tier boundary**: A configurable address (default: 256) that divides T0 from T1 within an SM's address space. 83 + - **MemOp**: The enum in `cm_inst.py` listing all SM opcodes. This migration adds EXEC, SET_PAGE, WRITE_IMM, RAW_READ, and EXT. 84 + - **CfgOp**: The enum being deleted. Listed configuration opcodes (LOAD_INST, ROUTE_SET) that backed the old CfgToken mechanism. 85 + - **Matching store**: The per-PE 2D array that holds one half of a dyadic token pair while waiting for the other. Indexed by context slot and IRAM offset. 86 + - **SimPy**: A Python discrete-event simulation library. OR1's emulator is implemented as SimPy processes communicating via SimPy Stores. 87 + - **Frozen dataclass**: A Python dataclass with `frozen=True`, making instances immutable and hashable. All token types use this pattern. 88 + - **Lark**: A Python parsing toolkit used to implement the dfasm grammar. 89 + - **dfasm**: The assembly language for OR1 dataflow graphs. Translated by the assembler into emulator configuration and token streams. 90 + - **dfgraph**: The interactive graph renderer for dfasm files. FastAPI server with TypeScript/Cytoscape.js frontend. 91 + 92 + ## Architecture 93 + 94 + Bottom-up migration: change type definitions first (`tokens.py`, `cm_inst.py`), 95 + then fix all consumers (emulator, assembler, graph renderer, tests). 96 + 97 + ### Token Hierarchy (new) 98 + 99 + ``` 100 + Token(target: int) 101 + ├── CMToken(Token) — offset, ctx, data 102 + │ ├── DyadToken(CMToken) — port, gen, wide 103 + │ ├── MonadToken(CMToken) — inline 104 + │ └── IRAMWriteToken(CMToken) — instructions: tuple[ALUInst | SMInst, ...] 105 + └── SMToken(Token) — addr, op, flags, data, ret 106 + ``` 107 + 108 + `IRAMWriteToken` inherits `CMToken` so network routing (`isinstance(token, CMToken)`) 109 + sends it to PEs without special cases. The `offset` field serves as `iram_addr`; 110 + `ctx` and `data` are set to 0 (unused but satisfying the frozen dataclass contract). 111 + 112 + `SysToken`, `CfgToken`, `LoadInstToken`, `RouteSetToken`, `IOToken` are deleted. 113 + `DyadToken` retains `wide: bool` — the prefix distinction (dyadic wide vs narrow) 114 + is an encoding detail, not a semantic one at the emulator level. 115 + 116 + ### MemOp Enum (new) 117 + 118 + Flat sequential enum. The tier 1 / tier 2 grouping documents the intended hardware 119 + encoding but is not enforced in the emulator: 120 + 121 + ``` 122 + Tier 1 (3-bit opcode, 10-bit addr — full 1024-cell range): 123 + READ=0, WRITE=1, EXEC=2, ALLOC=3, FREE=4, EXT=5 124 + 125 + Tier 2 (5-bit opcode, 8-bit payload — 256-cell range): 126 + CLEAR=6, RD_INC=7, RD_DEC=8, CMP_SW=9, RAW_READ=10, 127 + SET_PAGE=11, WRITE_IMM=12 128 + ``` 129 + 130 + EXEC is in the 3-bit tier because it needs full address range to reach T0 at high 131 + addresses. CLEAR is in the 5-bit tier because it is purely an I-structure operation 132 + (presence state reset) that only applies to T1 cells in lower address space. 133 + 134 + `CfgOp` enum is deleted entirely. IRAM writes are identified by 135 + `isinstance(token, IRAMWriteToken)`. 136 + 137 + ### SM Memory Tier Model 138 + 139 + SM address space splits into two tiers at a configurable boundary (default: 256): 140 + 141 + - **T1 (below boundary):** Per-SM I-structure cells with presence tracking, deferred 142 + reads, atomic ops. SM_id is conceptually part of the address — each SM owns its 143 + own T1 space independently. Current `list[SMCell]` model, unchanged. 144 + 145 + - **T0 (at/above boundary):** Shared raw storage across all SMs. No presence tracking, 146 + no deferred reads. Modelled as a single `list[Token]` referenced by all SM instances. 147 + T0 is the same physical address space regardless of which SM receives the request. 148 + 149 + T0 is initially a list of Token objects (not bytes/words) to avoid encoding/decoding 150 + in the emulator. EXEC iterates T0 and injects tokens into the network. Future work 151 + will swap T0 to `list[int]` (16-bit words) when modelling programs that access T0 152 + as normal data. 153 + 154 + **T0 operations:** WRITE stores into T0 (no presence check). EXEC reads from T0 and 155 + injects tokens. All I-structure-specific operations (CLEAR, ALLOC, FREE, atomics) 156 + on T0 addresses are errors. READ on T0 returns immediately (no deferral). 157 + 158 + **EXEC flow:** SM receives `SMToken(op=EXEC, addr=start)`, iterates `t0_store[start:]`, 159 + calls `system.send(token)` for each entry. SM holds a reference to the `System` object 160 + (set by `build_topology()`). 161 + 162 + ### Presence Metadata 163 + 164 + `SMCell` gains `is_wide: bool = False` for the widened 4-bit presence metadata 165 + (presence:2 + is_wide:1 + spare:1). The spare bit is unused. `Presence` enum values 166 + are unchanged (EMPTY/RESERVED/FULL/WAITING). 167 + 168 + ### Network Routing 169 + 170 + `System._target_store()` simplifies to: 171 + 172 + ``` 173 + SMToken → sms[token.target].input_store 174 + CMToken → pes[token.target].input_store 175 + ``` 176 + 177 + No special cases for CfgToken, IOToken, or any other subtype. `IRAMWriteToken` 178 + routes to PEs automatically as a CMToken subclass. 179 + 180 + ### PE Token Handling 181 + 182 + The PE's `_run()` loop replaces: 183 + ``` 184 + isinstance(token, CfgToken) → _handle_cfg() 185 + ``` 186 + with: 187 + ``` 188 + isinstance(token, IRAMWriteToken) → _handle_iram_write() 189 + ``` 190 + 191 + `_handle_iram_write` writes `token.instructions` into IRAM at `token.offset`. 192 + `RouteSetToken` handling is deleted (route restriction dropped; full mesh is the 193 + v0 default). Route restriction will return later via a different mechanism when 194 + multi-cluster topologies are needed. 195 + 196 + ### Assembler Token Emission 197 + 198 + `asm/codegen.py` token stream mode currently emits: 199 + `SM init → ROUTE_SET → LOAD_INST → seeds` 200 + 201 + New ordering: 202 + `SM init → IRAM writes → seeds` 203 + 204 + `RouteSetToken` emission is removed. `LoadInstToken` emission becomes 205 + `IRAMWriteToken` emission with `target=pe_id`, `offset=iram_addr`, `ctx=0`, 206 + `data=0`, `instructions=(...)`. 207 + 208 + All `CfgOp` references removed from `asm/opcodes.py`, `asm/lower.py`, and 209 + `asm/codegen.py`. New MemOp members added to mnemonic mapping in `opcodes.py`. 210 + 211 + ### dfgraph Category Handling 212 + 213 + `dfgraph/categories.py` removes `isinstance(op, CfgOp)` branch from `categorise()`. 214 + New MemOp values (EXEC, SET_PAGE, etc.) either fall into the existing SM category 215 + or a new "control" subcategory. 216 + 217 + ## Existing Patterns 218 + 219 + Investigation found the following patterns in the codebase: 220 + 221 + - **Token hierarchy** follows frozen dataclass inheritance (`tokens.py`). New 222 + `IRAMWriteToken` follows the same pattern. 223 + - **SM operations** use `match token.op:` dispatch in `_run()` loop (`emu/sm.py`). 224 + New opcodes follow this pattern. 225 + - **Network routing** uses `isinstance` dispatch (`emu/network.py`). Simplified 226 + routing follows the same pattern with fewer branches. 227 + - **PE config handling** uses `isinstance` dispatch for token subtypes (`emu/pe.py`). 228 + `IRAMWriteToken` replaces `CfgToken` in this dispatch. 229 + - **Assembler codegen** constructs token objects directly (`asm/codegen.py`). 230 + `IRAMWriteToken` replaces `LoadInstToken`/`RouteSetToken` construction. 231 + - **Config types** use frozen dataclasses (`emu/types.py`). `SMConfig` gains 232 + `tier_boundary` field following the same pattern. 233 + 234 + No new patterns are introduced. All changes follow existing conventions. 235 + 236 + ## Implementation Phases 237 + 238 + <!-- START_PHASE_1 --> 239 + ### Phase 1: Type Definitions 240 + 241 + **Goal:** Update token hierarchy and instruction set enums to the new format. 242 + 243 + **Components:** 244 + - `tokens.py` — delete SysToken/CfgToken/IOToken/LoadInstToken/RouteSetToken; add IRAMWriteToken 245 + - `cm_inst.py` — delete CfgOp enum; update MemOp with new opcodes and sequential values 246 + - `sm_mod.py` — add `is_wide: bool` field to SMCell 247 + 248 + **Dependencies:** None (first phase). 249 + 250 + **Done when:** Type definitions compile. Importing `tokens` and `cm_inst` does not 251 + error. All downstream breakage is expected and addressed in subsequent phases. 252 + <!-- END_PHASE_1 --> 253 + 254 + <!-- START_PHASE_2 --> 255 + ### Phase 2: Emulator Core 256 + 257 + **Goal:** Update emulator to use new token types and add SM memory tier support. 258 + 259 + **Components:** 260 + - `emu/types.py` — add `tier_boundary: int = 256` to SMConfig 261 + - `emu/sm.py` — T0/T1 dispatch based on address vs tier_boundary; T0 operations (WRITE, READ, EXEC); accept `t0_store` and `system` references; add EXEC implementation; unimplemented new opcodes raise error 262 + - `emu/pe.py` — replace CfgToken/LoadInstToken/RouteSetToken handling with IRAMWriteToken; remove `_handle_cfg`; add `_handle_iram_write` 263 + - `emu/network.py` — simplify `_target_store` to SMToken/CMToken isinstance; create shared T0 store in `build_topology`; pass T0 store and System reference to all SMs 264 + - `emu/__init__.py` — update exports if needed 265 + 266 + **Dependencies:** Phase 1 (type definitions). 267 + 268 + **Done when:** Emulator processes IRAMWriteToken correctly. SM handles T0/T1 split. 269 + EXEC reads from T0 and injects tokens. Existing emulator behaviour for T1 operations 270 + is preserved. Tests covering these behaviours pass. 271 + <!-- END_PHASE_2 --> 272 + 273 + <!-- START_PHASE_3 --> 274 + ### Phase 3: Assembler and Tools 275 + 276 + **Goal:** Update assembler and dfgraph to use new types. 277 + 278 + **Components:** 279 + - `asm/codegen.py` — emit IRAMWriteToken instead of LoadInstToken/RouteSetToken; remove ROUTE_SET from token stream ordering; remove CfgOp import 280 + - `asm/opcodes.py` — remove CfgOp from MNEMONIC_TO_OP and type-aware containers; add new MemOp mnemonics (exec, set_page, write_imm, raw_read, ext, clear updated value) 281 + - `asm/lower.py` — remove CfgOp from opcode() return type 282 + - `dfgraph/categories.py` — remove CfgOp isinstance branch; handle new MemOp values 283 + 284 + **Dependencies:** Phase 1 (type definitions). 285 + 286 + **Done when:** Assembler produces IRAMWriteToken in token stream mode. All assembler 287 + pipeline stages work with updated types. dfgraph categorises new opcodes. Tests pass. 288 + <!-- END_PHASE_3 --> 289 + 290 + <!-- START_PHASE_4 --> 291 + ### Phase 4: Test Updates and New Coverage 292 + 293 + **Goal:** Update all broken tests and add coverage for new functionality. 294 + 295 + **Components:** 296 + - `tests/test_pe.py` — delete RouteSetToken tests; update LoadInstToken tests to IRAMWriteToken 297 + - `tests/test_codegen.py` — update CfgToken/RouteSetToken/LoadInstToken assertions to IRAMWriteToken 298 + - `tests/test_integration.py` — update TestTask5CfgTokenLoadInst to use IRAMWriteToken 299 + - `tests/test_e2e.py` — update CfgToken isinstance check to IRAMWriteToken 300 + - `tests/test_opcodes.py` — remove CfgOp assertions; add new MemOp assertions 301 + - `tests/test_dfgraph_categories.py` — remove TestCategoriseCfgOp; add new MemOp tests 302 + - `tests/conftest.py` — update Hypothesis strategies if they generate old token types 303 + - New tests for: T0/T1 tier split, EXEC opcode, IRAMWriteToken routing, T0 boundary enforcement 304 + 305 + **Dependencies:** Phases 2 and 3 (emulator and assembler updated). 306 + 307 + **Done when:** `python -m pytest tests/ -v` passes with zero failures. New tests 308 + cover T0/T1 split, EXEC, and IRAMWriteToken routing. 309 + <!-- END_PHASE_4 --> 310 + 311 + <!-- START_PHASE_5 --> 312 + ### Phase 5: Documentation 313 + 314 + **Goal:** Update project documentation to reflect new token format. 315 + 316 + **Components:** 317 + - `CLAUDE.md` — update Token Hierarchy, Architecture Contracts, and Module Dependency Graph sections 318 + - `asm/CLAUDE.md` — update Dependencies and Invariants sections (remove CfgOp/RouteSetToken/LoadInstToken references, add IRAMWriteToken) 319 + - `dfgraph/CLAUDE.md` — update category references if present 320 + 321 + **Dependencies:** Phase 4 (all code changes complete and tested). 322 + 323 + **Done when:** CLAUDE.md files accurately describe the current codebase state. No 324 + references to deleted types remain in documentation. 325 + <!-- END_PHASE_5 --> 326 + 327 + ## Additional Considerations 328 + 329 + **Route restriction (deferred):** RouteSetToken and route restriction are removed in 330 + this migration. Route restriction will return in a future design when multi-cluster 331 + (4-CM cluster) topologies are needed, likely as a misc-bucket subtype (011+11) or a 332 + PE config register mechanism. 333 + 334 + **T0 evolution:** T0 is initially modelled as `list[Token]` for simplicity. When 335 + programs need to access T0 as normal data (framebuffers, lookup tables), T0 will 336 + migrate to `list[int]` (16-bit words) with token serialisation/deserialisation. 337 + This is a separate future design. 338 + 339 + **Unimplemented opcodes:** SET_PAGE, WRITE_IMM, RAW_READ, and EXT exist as MemOp 340 + enum members but raise NotImplementedError if they reach the SM's run loop. They are 341 + placeholders for future SM features and ensure the assembler can parse them.
+1 -1
tokens.py
··· 1 1 from dataclasses import dataclass 2 - +from typing import Optional 2 + from typing import Optional 3 3 4 4 from cm_inst import ALUInst, CfgOp, MemOp, Port, SMInst 5 5