OR-1 dataflow CPU sketch
at main 795 lines 32 kB view raw view rendered
1# Bus Architecture and Interconnect Design 2 3Covers the physical bus implementation, node transceivers, arbitration, 4flit reception, and the scaling path from shared bus to split AN/CN/DN 5networks. Companion to `pe-redesign-frames-and-pipeline.md` (PE pipeline 6and token format) and `network-and-communication.md` (logical network 7model and clocking discipline). 8 9--- 10 11## Design Context 12 13The system has up to 4 PEs (CMs) and up to 4 SMs, communicating via 16-bit 14flit-based tokens. All tokens are 1–3 flits. Flit 1 is self-describing: 15the prefix bits (bit[15] for SM/CM, bits[13:12] for destination ID) tell 16any node on the bus who the packet is for and how long it is. 17 18Two physical bus configurations are documented: shared (simpler, fewer 19chips, more contention) and split AN/CN/DN (more chips, more bandwidth, 20closer to historical dataflow machine practice). Both use the same node 21interface design and token format. The split is a wiring and arbiter 22change, not a protocol change. 23 24--- 25 26## Shared Bus (v0 / v0.5 starting point) 27 28### Topology 29 30Single 16-bit bus connecting all nodes. Every node sees every flit. 31Receivers filter by destination address. One bus master at a time; 32arbitration determines who drives. 33 34``` 35┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ 36│ PE 0 │ │ PE 1 │ │ PE 2 │ │ PE 3 │ 37└──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ 38 │ │ │ │ 39═══╪═════════╪═════════╪═════════╪═══════════╗ 40 │ │ │ │ 16-bit ║ shared bus 41═══╪═════════╪═════════╪═════════╪═══════════╝ 42 │ │ │ │ 43┌──┴───┐ ┌──┴───┐ ┌──┴───┐ ┌──┴───┐ 44│ SM 0 │ │ SM 1 │ │ SM 2 │ │ SM 3 │ 45└──────┘ └──────┘ └──────┘ └──────┘ 46 ┌─────────┐ 47 │ ARBITER │ 48 └─────────┘ 49``` 50 51### Bus Signals 52 53``` 54DATA[15:0] 16-bit data bus. driven by current bus master, hi-Z otherwise. 55FLIT_VALID asserted by bus master. indicates DATA is a valid flit. 56BUS_HOLD asserted by bus master. "more flits coming, don't re-arbitrate." 57BUS_REQ[7:0] one per node. active = "i have a packet to send." 58BUS_GRANT active for current bus master. directly enables output drivers. 59``` 60 61FLIT_VALID and BUS_HOLD are active-high, driven only by the granted node. 62since only one node drives at a time, no wired-OR or open-collector needed. 63 64### Packet Transmission Protocol 65 66``` 67IDLE: 68 nodes with pending output assert BUS_REQ. 69 arbiter selects winner, asserts BUS_GRANT for that node. 70 71FLIT 1: 72 winner's output latch OE goes active (drives DATA bus). 73 winner puts flit 1 on DATA, asserts FLIT_VALID. 74 winner asserts BUS_HOLD (if more flits follow). 75 all receivers inspect DATA for destination match. 76 matching receiver captures flit 1. 77 78FLIT 2: 79 winner puts flit 2 on DATA. FLIT_VALID stays high. 80 matching receiver captures flit 2. 81 if packet complete (2-flit): winner deasserts BUS_HOLD. 82 83FLIT 3 (CAS, EXT only): 84 winner puts flit 3 on DATA. FLIT_VALID stays high. 85 matching receiver captures flit 3. 86 winner deasserts BUS_HOLD. 87 88RELEASE: 89 BUS_HOLD drops. arbiter deasserts GRANT. 90 arbiter immediately checks BUS_REQ for next winner. 91 (zero idle cycles between back-to-back packets if another REQ is pending.) 92``` 93 94Once granted, the bus master holds the bus for the full packet duration 95(2–3 flit cycles). no preemption, no interleaving. packet atomicity is 96guaranteed by BUS_HOLD. 97 98### Chip Count (Shared Bus) 99 100``` 101per PE: ~9 chips (bus interface) 102 +5 chips (loopback: mux + comparator + control) 103 = ~14 chips per PE 104per SM: ~9 chips (bus interface, no loopback needed) 1054 PEs + 4 SMs: 4×14 + 4×9 = 92 chips 106arbiter: ~6 chips 107BUSY subsystem: ~8-12 chips 108bus control: ~2 chips 109 ───────── 110total: ~108-112 chips 111``` 112 113--- 114 115## Split AN/CN/DN (v0.5+ upgrade) 116 117### Topology 118 119Three separate 16-bit buses, each carrying one logical traffic class. 120Follows Amamiya's DFM architecture where AN and DN are unidirectional. 121 122``` 123CN bus (bidirectional, CM↔CM): 124 PE 0 ←→ PE 1 ←→ PE 2 ←→ PE 3 compute tokens between PEs. 125 4 nodes, bidirectional. 126 127AN bus (unidirectional, CM→SM): 128 PE 0 → ┐ 129 PE 1 → ┤ SM operation requests. 130 PE 2 → ├→ SM 0, SM 1, SM 2, SM 3 4 senders (PEs), 4 receivers (SMs). 131 PE 3 → ┘ 132 133DN bus (unidirectional, SM→CM): 134 SM 0 → ┐ 135 SM 1 → ┤ SM operation responses. 136 SM 2 → ├→ PE 0, PE 1, PE 2, PE 3 4 senders (SMs), 4 receivers (PEs). 137 SM 3 → ┘ 138``` 139 140### What the Split Buys 141 142**Bandwidth.** PE-to-PE compute traffic and SM request/response traffic are 143fully decoupled. PE0 sending an SM READ (AN) while PE1 sends a compute 144token to PE2 (CN) proceed simultaneously. effective bandwidth roughly 3× 145the shared bus for mixed workloads. 146 147**SM round-trip latency.** request goes out on AN, response comes back on DN. 148the two phases never compete for the same bus. on the shared bus, the 149response must wait for a bus grant that competes with outgoing requests and 150inter-PE traffic. 151 152**Reduced contention per bus.** CN has 4 nodes (PEs only). AN has 4 senders 153(PEs). DN has 4 senders (SMs). each bus sees half (or less) of the traffic 154the shared bus would see. 155 156**Cleaner scaling.** adding SMs doesn't increase CN contention. adding PEs 157doesn't increase DN contention. traffic classes scale independently. 158 159### What the Split Costs 160 161**Wiring.** three buses × (16 data + control) vs one. ribbon cables make 162this manageable: each bus is a 20-pin IDC ribbon (16 data + FLIT_VALID + 163BUS_HOLD + REQ_line + GND). three ribbons. first PCB candidate: a bus 164backplane with three sets of IDC connectors and the three arbiters. 165 166**Chips.** each PE needs ports on CN (bidirectional) + AN (output only) + 167DN (input only). each SM needs ports on AN (input only) + DN (output only). 168see chip count below. 169 170**SM→SM path.** on the shared bus, SM0's `exec` can emit SM tokens directly. 171on the split bus, SM0 only has DN output (→ CMs). SM-bound tokens from 172exec must route through a loader PE: SM0 emits CM tokens onto DN, the 173loader PE receives them, executes instructions that construct SM tokens, 174and emits those onto AN. one extra hop, only during bootstrap, no runtime 175cost. alternatively, the spare CM token subtype (011+11) could serve as a 176bus-level "forward to AN" hint, but the loader PE approach is cleaner and 177doesn't burn encoding space. 178 179### Node Bus Ports (Split Configuration) 180 181``` 182PE: CN out + CN in + AN out + DN in 183 bidirectional send only receive only 184 185SM: AN in + DN out 186 receive only send only 187 188SM0: AN in + DN out (same as other SMs — exec routes through loader PE) 189``` 190 191### Chip Count (Split AN/CN/DN) 192 193``` 194per PE: CN bidi (4) + AN out (2) + DN in (2) + decode (2) = ~10 chips 195 + loopback (5) = ~15 chips 196per SM: AN in (2) + DN out (2) + decode (1) = ~5 chips 1974 PEs: 60 chips 1984 SMs: 20 chips 199CN arbiter: ~4 chips (4-node) 200AN arbiter: ~4 chips (4-sender) 201DN arbiter: ~4 chips (4-sender) 202BUSY subsystem: ~6-8 chips (fewer nodes per bus, smaller muxes) 203bus control: ~2 chips 204 ───────── 205total: ~100-102 chips 206``` 207 208Delta vs shared bus: ~8-10 fewer chips (loopback cost is the same, but 209per-bus node counts are lower and BUSY muxes are smaller). buys ~3× 210bandwidth and decoupled SM latency. the split is strictly better on both 211chip count and performance once loopback is included — the shared bus 212node count was inflated by every node needing bidirectional capability. 213 214### PE Input Merge (DN + CN) 215 216A PE receives from two buses: CN (inter-PE compute tokens) and DN (SM 217responses). both feed into the same PE pipeline. the PE's input stage has 218a 2:1 priority mux: 219 220``` 221if DN_in has data: select DN_in → pipeline (SM responses unblock waiting ops) 222else if CN_in has data: select CN_in → pipeline 223else: idle 224``` 225 226hardware: one priority signal (DN_NOT_EMPTY), a 2:1 mux on the input latch 227LE lines. ~1-2 chips. SM responses get priority because they unblock dyadic 228instructions waiting on SM data — the matching store has a first operand 229parked, the SM response is the second operand that lets it fire. 230 231### PE Output Split (CN vs AN) 232 233A PE emits to CN (bit[15]=0) or AN (bit[15]=1). the output stage has 234latches on both buses. bit[15] from the token determines which latch gets 235loaded and which BUS_REQ fires: 236 237``` 238if output_token.bit[15] == 0: load CN_out latch, assert CN_REQ 239if output_token.bit[15] == 1: load AN_out latch, assert AN_REQ 240``` 241 242both output latches share the same data input lines from the PE pipeline. 243the type bit gates the LE (latch enable) to the correct pair. one gate. 244output enable is controlled independently by each bus's GRANT signal. 245 246--- 247 248## Node Interface (Common to Both Configurations) 249 250### Output Path 251 252``` 253 PE pipeline output 254255 ┌────┴────┐ 256 │ 2× 373 │ output latch (16-bit) 257 │ OE=GRANT│ tri-state: drives bus only when granted 258 └────┬────┘ 259260 ══════╪══════ bus DATA[15:0] 261``` 262 2632× 74LS373 per output port. LE (latch enable) pulsed by the PE/SM output 264stage when a flit is ready. OE (output enable) tied to BUS_GRANT for this 265node — only drives the bus when the arbiter says so. when not granted, 266outputs are hi-Z. 267 268the PE's output stage loads flit 1, asserts BUS_REQ, and waits. when 269GRANT arrives, OE goes active, flit 1 appears on the bus, FLIT_VALID 270asserts. next cycle, the PE loads flit 2 into the same latch (overwrites 271flit 1), FLIT_VALID stays high. for 3-flit packets, repeat once more. 272BUS_HOLD deasserts on the last flit. 273 274### Input Path 275 276``` 277 ══════╪══════ bus DATA[15:0] 278279 ┌────┴────┐ 280 │ 2× 373 │ bus register (always captures on DEST_MATCH) 281 └────┬────┘ 282283 ┌────┴────┐ 284 │ 2× 373 │ flit 2 holding register (3-flit packets only) 285 └────┬────┘ 286287 PE pipeline input 288``` 289 290**Bus register** (2× 373): D inputs connected directly to bus DATA. LE 291gated by `DEST_MATCH AND FLIT_VALID`. captures every flit addressed to 292this node. 293 294**Flit 2 holding register** (2× 373): captures bus_reg contents when a 2953-flit packet's flit 2 needs to be preserved before flit 3 arrives. 296for 2-flit packets (the common case), this register is unused. 297 298### Destination Decode 299 300Each node compares the incoming flit 1 against its own ID: 301 302``` 303CM node (PE): match = NOT bit[15] AND (bit[13:12] == my_PE_id) 304SM node: match = bit[15] AND (bit[13:12] == my_SM_id) 305``` 306 307`my_PE_id` / `my_SM_id` set by DIP switches or hardwired. the comparison 308is 2 bits (74LS85 is overkill — an XNOR gate + AND suffices). total decode: 309~1 chip per node (a few gates from a 74LS00/74LS86 package). 310 311For the misc bucket CM subtypes (011+xx), the PE accepts all of them if 312the PE_id matches — subtype decode happens inside the PE pipeline, not at 313the bus interface. 314 315For the split bus configuration, destination decode simplifies further: 316on the CN bus, only PEs exist, so bit[15] is always 0 and only PE_id 317matters. on the AN bus, only SMs are receivers, so bit[15] is always 1 and 318only SM_id matters. the type-bit check becomes redundant per-bus (but 319costs nothing to keep for robustness). 320 321### Flit Reception (The "No Deserializer" Insight) 322 323The revised token format puts all routing and control information in flit 1. 324The pipeline needs offset, act_id, port, and prefix to begin processing 325(stage 1 INPUT and stage 2 IFETCH). All of these are in flit 1. 326 327Flit 2 carries the data operand. The pipeline doesn't need it until stage 3 328(MATCH), which is 2 cycles after stage 1 latches flit 1's fields. By that 329time flit 2 has arrived in the bus register and is waiting. 330 331This means **no dedicated deserializer is needed.** the pipeline's own 332stage 1 register latch IS the flit 1 holding register: 333 334``` 335cycle N: flit 1 on bus → bus_reg captures. 336 stage 1 latches: offset, act_id, port, prefix from bus_reg. 337cycle N+1: flit 2 on bus → bus_reg captures (overwrites flit 1 — fine, 338 stage 1 already grabbed everything it needs). 339cycle N+2: stage 3 reads bus_reg for operand data (flit 2 still there). 340``` 341 342for 2-flit tokens: zero additional registers beyond the bus 373 pair. 343the bus_reg holds flit 2 until the pipeline consumes it. 344 345for 3-flit tokens (CAS, EXT): flit 3 arrives on cycle N+2 and would 346overwrite flit 2 in bus_reg. the pipeline needs flit 2 for the left 347operand and flit 3 for the right operand. the **flit 2 holding register** 348captures flit 2 from bus_reg before flit 3 arrives: 349 350``` 351cycle N: flit 1 → bus_reg → stage 1 latches fields 352cycle N+1: flit 2 → bus_reg. simultaneously, previous bus_reg (flit 1) 353 was already consumed by stage 1. flit 2 now in bus_reg. 354cycle N+2: flit2_hold captures bus_reg (flit 2). 355 flit 3 → bus_reg. 356 pipeline has: flit 1 fields in stage regs, flit 2 in 357 flit2_hold, flit 3 in bus_reg. 358``` 359 360the flit 2 holding register is clocked by a "3-flit packet, flit 2 is 361about to be overwritten" signal, derived from the flit counter and the 362packet length (decoded from flit 1 prefix in stage 1). for 2-flit packets, 363this signal never fires and the holding register sits idle. 364 365### Input Hardware Cost Per Port 366 367``` 368bus_reg: 2× 74LS373 (always present) 369flit2_hold: 2× 74LS373 (for 3-flit packets) 370dest decode: ~1 chip (a few gates) 371flit counter: half a 74LS74 (2-bit counter) 372control logic: ~1 chip (gates for LE timing, TOKEN_READY) 373 ──────────── 374total: ~6 chips per input port 375``` 376 377### Output Hardware Cost Per Port 378 379``` 380out_latch: 2× 74LS373 (drives bus when granted) 381BUS_REQ logic: ~half a chip (set on pipeline output ready, clear on grant) 382 ──────────── 383total: ~3 chips per output port 384``` 385 386### Per-Node Totals 387 388**Shared bus (bidirectional):** 389 390``` 391output: ~3 chips 392input: ~6 chips 393 ──────── 394per node: ~9 chips (could reduce to ~6 if 3-flit packets handled in pipeline) 395``` 396 397**Split bus, PE (CN bidi + AN out + DN in):** 398 399``` 400CN out: ~3 chips 401CN in: ~6 chips 402AN out: ~3 chips 403DN in: ~6 chips 404output mux: ~1 chip (bit[15] selects CN vs AN output) 405input mux: ~1 chip (DN vs CN priority select) 406 ──────── 407per PE: ~20 chips 408``` 409 410**Split bus, SM (AN in + DN out):** 411 412``` 413AN in: ~6 chips 414DN out: ~3 chips 415 ──────── 416per SM: ~9 chips 417``` 418 419Note: these counts are slightly higher than earlier estimates because they 420include the flit2 holding register. if 3-flit tokens are rare enough, the 421holding register can be omitted from nodes that will never receive them 422(e.g., SMs receive 3-flit CAS tokens, but PEs only receive 2-flit compute 423tokens and 2-flit SM responses — PEs can skip the holding register). 424 425--- 426 427## PE Loopback (Self-Addressed Token Bypass) 428 429When a PE produces a token addressed to itself, there is no reason for that 430token to traverse the external bus. self-addressed tokens are the dominant 431traffic pattern in well-compiled programs (the assembler/compiler maximises 432self-PE placement), so bypassing the bus for self-sends eliminates a large 433fraction of bus contention. 434 435### Hardware 436 437The PE's output stage already knows the destination PE_id — it's in the 438flit 1 routing word read from the frame. The PE already knows its own ID 439(DIP switches / EEPROM). A comparator on those 2-3 bits determines whether 440the token is self-addressed: 441 442``` 443self_send = (dest_PE_id == my_PE_id) 444``` 445 446A 2:1 mux on the PE's input path selects between "from bus" and "from own 447output": 448 449``` 450 PE pipeline output 451452 ┌──────┴──────┐ 453 │ self_send? │ comparator on PE_id (a few gates) 454 └──┬───────┬──┘ 455 │ │ 456 self=1 │ │ self=0 457 │ │ 458 │ ┌──┴──────┐ 459 │ │ bus out │ 2× 373, OE=GRANT 460 │ │ latch │ assert BUS_REQ 461 │ └─────────┘ 462463 ┌───┐ │ ┌─────────┐ 464 │mux│◄─┴──│ bus in │ from bus input path 465 │2:1│ │ latch │ 466 └─┬─┘ └─────────┘ 467468 PE pipeline input (stage 1) 469``` 470 471When `self_send=1`: 472- output does NOT load the bus output latch. BUS_REQ is not asserted. 473- output feeds directly to the input mux, which selects the loopback path. 474- the pipeline sees the token arrive at stage 1 as if it came from the bus. 475 476When `self_send=0`: 477- output loads the bus output latch, asserts BUS_REQ. normal bus path. 478- input mux selects bus input. normal receive path. 479 480### Hardware Cost 481 482``` 483PE_id comparator: ~half a chip (XNOR + AND on 2-3 bits) 484input mux: 4× 74LS157 (quad 2:1 mux, 16-bit data path) 485loopback control: ~half a chip (gates for mux select, BUS_REQ inhibit) 486 ───────────── 487total: ~5 chips per PE 488``` 489 490### Timing 491 492Bus path (self-addressed, without loopback): output stage → wait for bus 493grant (1-4 cycles depending on contention) → flit 1 on bus (1 cycle) → 494flit 2 on bus (1 cycle) → dest decode + input capture → stage 1. minimum 495~4-6 cycles. 496 497Loopback path: output stage → mux → stage 1. with a register in the 498loopback path for timing safety, 2 cycles (output latches flit 1, next 499cycle it's at stage 1 input; flit 2 follows). without the register 500(combinational loopback), potentially 1 cycle. 501 502savings: 2-4 cycles per self-addressed token, plus zero bus contention 503impact on other nodes. 504 505### Multi-Flit Loopback 506 507the loopback path handles multi-flit tokens the same way the bus input 508does. flit 1 feeds through the mux, stage 1 latches the routing fields. 509flit 2 feeds through on the next cycle, sits in the input register until 510stage 3 reads it. for 3-flit tokens (rare on loopback — a PE would have 511to CAS its own local SM? unclear when this happens), the flit 2 holding 512register captures flit 2 before flit 3 arrives, same as bus reception. 513 514### Throughput Tradeoff: Self-Route vs Cross-PE Ping-Pong 515 516Self-loopback is fast but serialises the pipeline: the PE processes its 517own output token, so it can't work on another node's token simultaneously. 518Cross-PE routing (PE0 → PE1 → PE0) costs bus latency per hop but lets both 519PEs work in parallel — PE1 processes while PE0's result is in transit. 520 521For a linear dependency chain with no parallelism, self-route wins: fewer 522cycles per token, no bus overhead. For wide graphs with independent 523branches, spreading work across PEs wins: the bus latency is paid once per 524hop but multiple PEs execute concurrently. 525 526The compiler/assembler can model this tradeoff per-subgraph: 527 528- **self-route**: saves ~3 cycles per token, serialises the PE. 529 best for: tight dependent chains, loop bodies, accumulations. 530- **cross-PE**: costs ~4 cycles bus latency per hop, enables parallelism. 531 best for: independent branches, wide fan-out, producer-consumer pipelines. 532 533Until strongly-connected arc execution is implemented (which makes 534sequential self-execution much more efficient), there are cases where 535ping-ponging between PEs genuinely outperforms self-routing due to 536pipeline fill. the assembler's placement algorithm should consider bus 537contention, pipeline depth, and dependency structure when deciding. 538 539### Interaction with BUSY and Concurrent Reception 540 541Self-addressed loopback tokens bypass the bus entirely. crucially, the 542bus_reg remains available to capture externally-arriving tokens while the 543loopback path feeds the pipeline. this is possible because BUSY is tied 544to bus_reg occupancy, not pipeline input occupancy: 545 546``` 547BUSY = bus_reg_captured AND NOT bus_reg_consumed 548``` 549 550one AND gate. BUSY asserts when the bus_reg holds an unconsumed flit, and 551clears when the pipeline drains it. the loopback path is invisible to BUSY. 552 553this enables concurrent reception: 554 555``` 556cycle N: loopback token → mux → pipeline stage 1 (self-send, fast path) 557 simultaneously: external token on bus → bus_reg captures 558 BUSY asserts (bus_reg now occupied) 559cycle N+1: bus_reg token → mux → pipeline stage 1 (external, from bus) 560 BUSY clears 561``` 562 563the input mux priority: loopback wins if both arrive simultaneously. 564the PE's own output is already committed and can't stall; the bus token 565waits one cycle in bus_reg. BUSY holds other senders off for that one 566beat. one priority gate on two "data available" signals. 567 568without this decoupling, a self-send would block the bus_reg (the 569pipeline input is occupied by the loopback token), forcing external 570senders to wait even though the bus_reg is physically empty. the 571decoupling means the bus_reg serves as a 1-token buffer that absorbs 572one external arrival while the pipeline processes a loopback token. 573 574**future upgrade: deeper input buffer.** v0 operates with the bus_reg as 575the sole external input buffer (1 token deep). once pipelining becomes 576more aggressive post-v0 — particularly with strongly-connected arc 577execution keeping the PE busy on sequential chains — a small register 578FIFO (2-4 entries) behind the bus_reg would absorb bursts of incoming 579tokens without asserting BUSY. this is the same upgrade path noted in 580the open questions: NOS IDT7201 FIFOs, register FIFOs from 373s, or 581SRAM-based. the BUSY signal transitions from "bus_reg full" to "FIFO 582full" with no protocol change. the loopback path remains independent 583of the FIFO — it feeds the pipeline directly, not through the buffer. 584 585--- 586 587## Arbiter 588 589### Shared Bus Arbiter 590 5918 requesters, one winner per arbitration cycle. SM0 gets hard priority 592(bootstrap). all others round-robin. 593 594``` 595inputs: BUS_REQ[7:0], BUS_HOLD 596outputs: BUS_GRANT[7:0] (one-hot), or WINNER_ID[2:0] + external decoder 597 598state: 599 IDLE: if any BUS_REQ asserted, select winner, go to GRANTED. 600 GRANTED: hold GRANT while BUS_HOLD asserted. 601 when BUS_HOLD drops, go to IDLE (or fast-path: immediately 602 select next winner if any REQ pending). 603``` 604 605**Winner selection:** if BUS_REQ[SM0] is asserted, SM0 wins (hard priority 606for bootstrap and IO). otherwise, round-robin starting from 607`(last_winner + 1) mod 8`. 608 609hardware: 74LS148 (8-to-3 priority encoder) with input masking for 610round-robin. a 3-bit counter (`last_winner`) advances after each grant. 611the REQ lines are rotated by `last_winner` before hitting the priority 612encoder (using a barrel shift or a mux tree). SM0 priority override: 613OR the SM0 REQ into the highest-priority encoder input, bypassing the 614rotation. 615 616``` 617round-robin mux: 2× 74LS151 (8:1 mux) or equivalent = 2 chips 618priority encoder: 1× 74LS148 = 1 chip 619grant decoder: 1× 74LS138 (3-to-8) = 1 chip 620last_winner counter: half a 74LS163 (4-bit counter) = 1 chip 621SM0 override + FSM: ~1 chip (gates) = 1 chip 622 total: ~6 chips 623``` 624 625### Split Bus Arbiters 626 627Each bus has its own arbiter. CN arbiter: 4 nodes (PEs), bidirectional. 628AN arbiter: 4 senders (PEs only). DN arbiter: 4 senders (SMs only). 629 6304-node arbiter is simpler than 8-node: 631 632``` 633priority encoder: 1× 74LS148 (using 4 of 8 inputs) = 1 chip 634grant decoder: 1× 74LS139 (dual 2-to-4) = 1 chip 635round-robin counter: half a 74LS74 = shared 636SM0 priority (DN only): 1 gate = shared 637FSM: ~1 chip (gates) = 1 chip 638 per bus: ~3-4 chips 639 3 buses: ~10-12 chips 640``` 641 642SM0 priority applies to the DN arbiter (SM0's responses and exec output 643get priority on the return path). CN and AN arbiters are pure round-robin 644among PEs. 645 646--- 647 648## Backpressure 649 650### BUSY Mechanism (Mandatory) 651 652There is no input FIFO. The bus register is the buffer — one token deep. 653If the pipeline hasn't consumed the current token when the next one arrives 654for this node, the token is lost. Therefore sender-side destination-busy 655checking is **mandatory**, not optional. 656 657Each node has a `BUSY` output: active whenever its bus register holds an 658unconsumed token. Senders check `BUSY[dest]` before asserting `BUS_REQ`. 659If the destination is busy, the sender holds. Its output latch stays full, 660which stalls its pipeline at the output stage. Backpressure propagates 661naturally through the pipeline. 662 663``` 664BUSY[7:0] one wire per node (active-low, open-collector). 665 set when bus_reg captures a flit (DEST_MATCH AND FLIT_VALID). 666 cleared when pipeline latches the token into stage 1. 667``` 668 669**Per-node BUSY generation:** one flip-flop (half a 74LS74). set on capture, 670cleared on pipeline consumption. active-low open-collector output allows 671wired-AND if needed, and maps directly to an async ACK signal for future 672Mode C operation. 673 674**Per-sender BUSY lookup:** the outgoing token's destination ID (2-3 bits 675from the formed flit 1, known before BUS_REQ asserts) indexes into 676`BUSY[0:7]` via a 74LS151 (8:1 mux). output gates BUS_REQ: if 677`BUSY[dest]` is active, BUS_REQ is suppressed. 678 679``` 680sender logic: 681 dest_id = output_token.flit1[13:12] (PE_id or SM_id) 682 + output_token.flit1[15] (SM/CM select, for full 3-bit node addr) 683 dest_busy = BUSY_MUX[dest_id] 684 BUS_REQ = output_ready AND NOT dest_busy 685``` 686 687**Hardware cost:** 688 689``` 690BUSY flip-flops: 4× 74LS74 (8 flip-flops, one per node) = 4 chips 691BUSY mux (per sender): 1× 74LS151 (8:1) each = 8 chips max 692 (senders can share mux chips on split bus 693 where each bus has only 4 destinations) 694BUS_REQ gate: included in existing output logic = 0 extra 695 total: ~8-12 chips 696``` 697 698For the split bus configuration, each bus has only 4 possible destinations. 699BUSY lines are per-bus: `CN_BUSY[3:0]`, `AN_BUSY[3:0]`, `DN_BUSY[3:0]`. 700the mux shrinks to a 74LS153 (dual 4:1) or equivalent. fewer chips. 701 702**Async upgrade path:** the BUSY signal is the embryonic form of an ACK in 703a request/acknowledge handshake protocol. In Mode C (fully async), BUSY 704becomes the ACK that completes the flit transfer. the wiring is unchanged — 705only the timing discipline around it evolves from "check before requesting" 706to "wait for acknowledge after sending." designing BUSY in from the start 707means the async transition doesn't require re-wiring the status lines. 708 709--- 710 711## Physical Wiring 712 713### Shared Bus 714 715``` 7161 ribbon cable, 20-pin IDC: 717 DATA[15:0] 16 pins 718 FLIT_VALID 1 pin 719 BUS_HOLD 1 pin 720 GND 2 pins 721``` 722 723BUS_REQ and BUS_GRANT run as separate point-to-point wires between each 724node and the arbiter (8 REQ + 8 GRANT = 16 wires, or 8 REQ + 3-bit 725WINNER_ID = 11 wires). these can be a second ribbon or direct jumper wires. 726 727### Split AN/CN/DN 728 729``` 7303 ribbon cables, 20-pin IDC each: 731 CN: DATA[15:0] + FLIT_VALID + BUS_HOLD + GND×2 732 AN: DATA[15:0] + FLIT_VALID + BUS_HOLD + GND×2 733 DN: DATA[15:0] + FLIT_VALID + BUS_HOLD + GND×2 734 735REQ/GRANT per bus: separate wires or a 4th ribbon. 736``` 737 738first PCB candidate: a bus backplane board with three rows of IDC 739connectors (one row per bus), the three arbiters, and the BUSY/status 740lines. PEs and SMs plug in via ribbon cables. testable with 1 PE + 1 SM 741before populating the full system. 742 743--- 744 745## Scaling Path 746 747| scale | topology | notes | 748|---|---|---| 749| 1 PE + 1 SM | single shared bus | initial bring-up, no arbiter needed | 750| 2–4 PE + 1–2 SM | single shared bus | v0, arbiter added | 751| 4 PE + 4 SM | shared or split | v0.5, split if contention measured | 752| 8+ PE | split + ring or prefix | CN becomes a ring bus between PEs | 753| 16+ PE | hierarchical prefix | full multi-level routing network | 754 755the token format and node interfaces are unchanged across all scales. 756the bus interface (373 latches + dest decode + FIFO) is the same whether 757it's on a shared bus, a split segment, or a ring. what changes is wiring 758and arbiters. 759 760--- 761 762## Open Questions 763 7641. **Input FIFO as upgrade.** v0 operates with no input FIFO (bus reg is the 765 sole buffer, BUSY mechanism prevents overflow). adding a small FIFO 766 (4–8 entries) behind the bus reg absorbs bursts and reduces the frequency 767 of BUSY stalls. NOS IDT7201 FIFOs (512×9, $3–8 on eBay) are ideal if 768 available. otherwise, a register FIFO from 373s (8 chips for 4 entries) 769 or SRAM + counter. this is a pure performance upgrade — the BUSY 770 mechanism remains as the ultimate backpressure signal even with a FIFO 771 (BUSY asserts when the FIFO fills instead of when the bus reg is 772 occupied). 773 7742. **SM0 bootstrap priority.** hard priority on the shared bus / DN arbiter 775 is simple. does it need to be SM0 specifically, or should priority be 776 configurable (e.g., DIP switch selects which node gets priority)? 777 hardwired SM0 is fine for v0. 778 7793. **Arbiter latency.** the round-robin mux + priority encoder + decoder 780 chain is ~3 gate delays (~30–45 ns). at 5 MHz (200 ns cycle), this 781 easily fits. at higher clock rates, the arbiter might need pipelining 782 (one cycle to decide, next cycle to grant). for v0 at 5 MHz, single- 783 cycle arbitration is fine. 784 7854. **3-flit reception.** the flit 2 holding register adds 2 chips per input 786 port. if 3-flit tokens (CAS, EXT) are rare and only received by SMs, 787 only SM input ports need the holding register. PE input ports can skip 788 it (PEs never receive 3-flit tokens in normal operation). saves 2 chips 789 per PE input port. 790 7915. **Split bus trigger.** at what contention level should the shared bus be 792 split into AN/CN/DN? diagnostic counters on the arbiter (grant-wait 793 cycles per node) provide the measurement. if any node averages >20% 794 wait cycles, the split is likely worthwhile. the compiler can also 795 estimate bus utilisation statically from the dataflow graph structure.