design-notes/bus-interconnect-design.md at main · nonbinary.computer/or1-design

nonbinary.computer / or1-design
fork atom
OR-1 dataflow CPU sketch
fork atom
or1-design / design-notes / bus-interconnect-design.md
at main 795 lines 32 kB view raw view rendered
wrap content
Orual feat: rewrite ProcessingElement with frame-based matching, output routing, and unified instruction set 19d ago
65613978
  1# Bus Architecture and Interconnect Design
  2
  3Covers the physical bus implementation, node transceivers, arbitration,
  4flit reception, and the scaling path from shared bus to split AN/CN/DN
  5networks. Companion to `pe-redesign-frames-and-pipeline.md` (PE pipeline
  6and token format) and `network-and-communication.md` (logical network
  7model and clocking discipline).
  8
  9---
 10
 11## Design Context
 12
 13The system has up to 4 PEs (CMs) and up to 4 SMs, communicating via 16-bit
 14flit-based tokens. All tokens are 1–3 flits. Flit 1 is self-describing:
 15the prefix bits (bit[15] for SM/CM, bits[13:12] for destination ID) tell
 16any node on the bus who the packet is for and how long it is.
 17
 18Two physical bus configurations are documented: shared (simpler, fewer
 19chips, more contention) and split AN/CN/DN (more chips, more bandwidth,
 20closer to historical dataflow machine practice). Both use the same node
 21interface design and token format. The split is a wiring and arbiter
 22change, not a protocol change.
 23
 24---
 25
 26## Shared Bus (v0 / v0.5 starting point)
 27
 28### Topology
 29
 30Single 16-bit bus connecting all nodes. Every node sees every flit.
 31Receivers filter by destination address. One bus master at a time;
 32arbitration determines who drives.
 33
 34```
 35┌──────┐  ┌──────┐  ┌──────┐  ┌──────┐
 36│ PE 0 │  │ PE 1 │  │ PE 2 │  │ PE 3 │
 37└──┬───┘  └──┬───┘  └──┬───┘  └──┬───┘
 38   │         │         │         │
 39═══╪═════════╪═════════╪═════════╪═══════════╗
 40   │         │         │         │    16-bit  ║ shared bus
 41═══╪═════════╪═════════╪═════════╪═══════════╝
 42   │         │         │         │
 43┌──┴───┐  ┌──┴───┐  ┌──┴───┐  ┌──┴───┐
 44│ SM 0 │  │ SM 1 │  │ SM 2 │  │ SM 3 │
 45└──────┘  └──────┘  └──────┘  └──────┘
 46                 ┌─────────┐
 47                 │ ARBITER │
 48                 └─────────┘
 49```
 50
 51### Bus Signals
 52
 53```
 54DATA[15:0]    16-bit data bus. driven by current bus master, hi-Z otherwise.
 55FLIT_VALID    asserted by bus master. indicates DATA is a valid flit.
 56BUS_HOLD      asserted by bus master. "more flits coming, don't re-arbitrate."
 57BUS_REQ[7:0]  one per node. active = "i have a packet to send."
 58BUS_GRANT     active for current bus master. directly enables output drivers.
 59```
 60
 61FLIT_VALID and BUS_HOLD are active-high, driven only by the granted node.
 62since only one node drives at a time, no wired-OR or open-collector needed.
 63
 64### Packet Transmission Protocol
 65
 66```
 67IDLE:
 68  nodes with pending output assert BUS_REQ.
 69  arbiter selects winner, asserts BUS_GRANT for that node.
 70
 71FLIT 1:
 72  winner's output latch OE goes active (drives DATA bus).
 73  winner puts flit 1 on DATA, asserts FLIT_VALID.
 74  winner asserts BUS_HOLD (if more flits follow).
 75  all receivers inspect DATA for destination match.
 76  matching receiver captures flit 1.
 77
 78FLIT 2:
 79  winner puts flit 2 on DATA. FLIT_VALID stays high.
 80  matching receiver captures flit 2.
 81  if packet complete (2-flit): winner deasserts BUS_HOLD.
 82
 83FLIT 3 (CAS, EXT only):
 84  winner puts flit 3 on DATA. FLIT_VALID stays high.
 85  matching receiver captures flit 3.
 86  winner deasserts BUS_HOLD.
 87
 88RELEASE:
 89  BUS_HOLD drops. arbiter deasserts GRANT.
 90  arbiter immediately checks BUS_REQ for next winner.
 91  (zero idle cycles between back-to-back packets if another REQ is pending.)
 92```
 93
 94Once granted, the bus master holds the bus for the full packet duration
 95(2–3 flit cycles). no preemption, no interleaving. packet atomicity is
 96guaranteed by BUS_HOLD.
 97
 98### Chip Count (Shared Bus)
 99
100```
101per PE:         ~9 chips  (bus interface)
102                +5 chips  (loopback: mux + comparator + control)
103                = ~14 chips per PE
104per SM:         ~9 chips  (bus interface, no loopback needed)
1054 PEs + 4 SMs:  4×14 + 4×9 = 92 chips
106arbiter:        ~6 chips
107BUSY subsystem: ~8-12 chips
108bus control:    ~2 chips
109                ─────────
110total:          ~108-112 chips
111```
112
113---
114
115## Split AN/CN/DN (v0.5+ upgrade)
116
117### Topology
118
119Three separate 16-bit buses, each carrying one logical traffic class.
120Follows Amamiya's DFM architecture where AN and DN are unidirectional.
121
122```
123CN bus (bidirectional, CM↔CM):
124  PE 0 ←→ PE 1 ←→ PE 2 ←→ PE 3        compute tokens between PEs.
125                                         4 nodes, bidirectional.
126
127AN bus (unidirectional, CM→SM):
128  PE 0 → ┐
129  PE 1 → ┤                              SM operation requests.
130  PE 2 → ├→ SM 0, SM 1, SM 2, SM 3     4 senders (PEs), 4 receivers (SMs).
131  PE 3 → ┘
132
133DN bus (unidirectional, SM→CM):
134  SM 0 → ┐
135  SM 1 → ┤                              SM operation responses.
136  SM 2 → ├→ PE 0, PE 1, PE 2, PE 3     4 senders (SMs), 4 receivers (PEs).
137  SM 3 → ┘
138```
139
140### What the Split Buys
141
142**Bandwidth.** PE-to-PE compute traffic and SM request/response traffic are
143fully decoupled. PE0 sending an SM READ (AN) while PE1 sends a compute
144token to PE2 (CN) proceed simultaneously. effective bandwidth roughly 3×
145the shared bus for mixed workloads.
146
147**SM round-trip latency.** request goes out on AN, response comes back on DN.
148the two phases never compete for the same bus. on the shared bus, the
149response must wait for a bus grant that competes with outgoing requests and
150inter-PE traffic.
151
152**Reduced contention per bus.** CN has 4 nodes (PEs only). AN has 4 senders
153(PEs). DN has 4 senders (SMs). each bus sees half (or less) of the traffic
154the shared bus would see.
155
156**Cleaner scaling.** adding SMs doesn't increase CN contention. adding PEs
157doesn't increase DN contention. traffic classes scale independently.
158
159### What the Split Costs
160
161**Wiring.** three buses × (16 data + control) vs one. ribbon cables make
162this manageable: each bus is a 20-pin IDC ribbon (16 data + FLIT_VALID +
163BUS_HOLD + REQ_line + GND). three ribbons. first PCB candidate: a bus
164backplane with three sets of IDC connectors and the three arbiters.
165
166**Chips.** each PE needs ports on CN (bidirectional) + AN (output only) +
167DN (input only). each SM needs ports on AN (input only) + DN (output only).
168see chip count below.
169
170**SM→SM path.** on the shared bus, SM0's `exec` can emit SM tokens directly.
171on the split bus, SM0 only has DN output (→ CMs). SM-bound tokens from
172exec must route through a loader PE: SM0 emits CM tokens onto DN, the
173loader PE receives them, executes instructions that construct SM tokens,
174and emits those onto AN. one extra hop, only during bootstrap, no runtime
175cost. alternatively, the spare CM token subtype (011+11) could serve as a
176bus-level "forward to AN" hint, but the loader PE approach is cleaner and
177doesn't burn encoding space.
178
179### Node Bus Ports (Split Configuration)
180
181```
182PE:  CN out + CN in + AN out + DN in
183     bidirectional     send only  receive only
184
185SM:  AN in + DN out
186     receive only  send only
187
188SM0: AN in + DN out  (same as other SMs — exec routes through loader PE)
189```
190
191### Chip Count (Split AN/CN/DN)
192
193```
194per PE:  CN bidi (4) + AN out (2) + DN in (2) + decode (2)  = ~10 chips
195         + loopback (5)                                       = ~15 chips
196per SM:  AN in (2) + DN out (2) + decode (1)                  = ~5 chips
1974 PEs:   60 chips
1984 SMs:   20 chips
199CN arbiter:  ~4 chips  (4-node)
200AN arbiter:  ~4 chips  (4-sender)
201DN arbiter:  ~4 chips  (4-sender)
202BUSY subsystem: ~6-8 chips  (fewer nodes per bus, smaller muxes)
203bus control: ~2 chips
204                       ─────────
205total:                 ~100-102 chips
206```
207
208Delta vs shared bus: ~8-10 fewer chips (loopback cost is the same, but
209per-bus node counts are lower and BUSY muxes are smaller). buys ~3×
210bandwidth and decoupled SM latency. the split is strictly better on both
211chip count and performance once loopback is included — the shared bus
212node count was inflated by every node needing bidirectional capability.
213
214### PE Input Merge (DN + CN)
215
216A PE receives from two buses: CN (inter-PE compute tokens) and DN (SM
217responses). both feed into the same PE pipeline. the PE's input stage has
218a 2:1 priority mux:
219
220```
221if DN_in has data:  select DN_in → pipeline  (SM responses unblock waiting ops)
222else if CN_in has data: select CN_in → pipeline
223else: idle
224```
225
226hardware: one priority signal (DN_NOT_EMPTY), a 2:1 mux on the input latch
227LE lines. ~1-2 chips. SM responses get priority because they unblock dyadic
228instructions waiting on SM data — the matching store has a first operand
229parked, the SM response is the second operand that lets it fire.
230
231### PE Output Split (CN vs AN)
232
233A PE emits to CN (bit[15]=0) or AN (bit[15]=1). the output stage has
234latches on both buses. bit[15] from the token determines which latch gets
235loaded and which BUS_REQ fires:
236
237```
238if output_token.bit[15] == 0:  load CN_out latch, assert CN_REQ
239if output_token.bit[15] == 1:  load AN_out latch, assert AN_REQ
240```
241
242both output latches share the same data input lines from the PE pipeline.
243the type bit gates the LE (latch enable) to the correct pair. one gate.
244output enable is controlled independently by each bus's GRANT signal.
245
246---
247
248## Node Interface (Common to Both Configurations)
249
250### Output Path
251
252```
253                    PE pipeline output
254                          │
255                     ┌────┴────┐
256                     │ 2× 373  │  output latch (16-bit)
257                     │ OE=GRANT│  tri-state: drives bus only when granted
258                     └────┬────┘
259                          │
260                    ══════╪══════  bus DATA[15:0]
261```
262
2632× 74LS373 per output port. LE (latch enable) pulsed by the PE/SM output
264stage when a flit is ready. OE (output enable) tied to BUS_GRANT for this
265node — only drives the bus when the arbiter says so. when not granted,
266outputs are hi-Z.
267
268the PE's output stage loads flit 1, asserts BUS_REQ, and waits. when
269GRANT arrives, OE goes active, flit 1 appears on the bus, FLIT_VALID
270asserts. next cycle, the PE loads flit 2 into the same latch (overwrites
271flit 1), FLIT_VALID stays high. for 3-flit packets, repeat once more.
272BUS_HOLD deasserts on the last flit.
273
274### Input Path
275
276```
277                    ══════╪══════  bus DATA[15:0]
278                          │
279                     ┌────┴────┐
280                     │ 2× 373  │  bus register (always captures on DEST_MATCH)
281                     └────┬────┘
282                          │
283                     ┌────┴────┐
284                     │ 2× 373  │  flit 2 holding register (3-flit packets only)
285                     └────┬────┘
286                          │
287                    PE pipeline input
288```
289
290**Bus register** (2× 373): D inputs connected directly to bus DATA. LE
291gated by `DEST_MATCH AND FLIT_VALID`. captures every flit addressed to
292this node.
293
294**Flit 2 holding register** (2× 373): captures bus_reg contents when a
2953-flit packet's flit 2 needs to be preserved before flit 3 arrives.
296for 2-flit packets (the common case), this register is unused.
297
298### Destination Decode
299
300Each node compares the incoming flit 1 against its own ID:
301
302```
303CM node (PE):  match = NOT bit[15] AND (bit[13:12] == my_PE_id)
304SM node:       match = bit[15] AND (bit[13:12] == my_SM_id)
305```
306
307`my_PE_id` / `my_SM_id` set by DIP switches or hardwired. the comparison
308is 2 bits (74LS85 is overkill — an XNOR gate + AND suffices). total decode:
309~1 chip per node (a few gates from a 74LS00/74LS86 package).
310
311For the misc bucket CM subtypes (011+xx), the PE accepts all of them if
312the PE_id matches — subtype decode happens inside the PE pipeline, not at
313the bus interface.
314
315For the split bus configuration, destination decode simplifies further:
316on the CN bus, only PEs exist, so bit[15] is always 0 and only PE_id
317matters. on the AN bus, only SMs are receivers, so bit[15] is always 1 and
318only SM_id matters. the type-bit check becomes redundant per-bus (but
319costs nothing to keep for robustness).
320
321### Flit Reception (The "No Deserializer" Insight)
322
323The revised token format puts all routing and control information in flit 1.
324The pipeline needs offset, act_id, port, and prefix to begin processing
325(stage 1 INPUT and stage 2 IFETCH). All of these are in flit 1.
326
327Flit 2 carries the data operand. The pipeline doesn't need it until stage 3
328(MATCH), which is 2 cycles after stage 1 latches flit 1's fields. By that
329time flit 2 has arrived in the bus register and is waiting.
330
331This means **no dedicated deserializer is needed.** the pipeline's own
332stage 1 register latch IS the flit 1 holding register:
333
334```
335cycle N:    flit 1 on bus → bus_reg captures.
336            stage 1 latches: offset, act_id, port, prefix from bus_reg.
337cycle N+1:  flit 2 on bus → bus_reg captures (overwrites flit 1 — fine,
338            stage 1 already grabbed everything it needs).
339cycle N+2:  stage 3 reads bus_reg for operand data (flit 2 still there).
340```
341
342for 2-flit tokens: zero additional registers beyond the bus 373 pair.
343the bus_reg holds flit 2 until the pipeline consumes it.
344
345for 3-flit tokens (CAS, EXT): flit 3 arrives on cycle N+2 and would
346overwrite flit 2 in bus_reg. the pipeline needs flit 2 for the left
347operand and flit 3 for the right operand. the **flit 2 holding register**
348captures flit 2 from bus_reg before flit 3 arrives:
349
350```
351cycle N:    flit 1 → bus_reg → stage 1 latches fields
352cycle N+1:  flit 2 → bus_reg. simultaneously, previous bus_reg (flit 1)
353            was already consumed by stage 1. flit 2 now in bus_reg.
354cycle N+2:  flit2_hold captures bus_reg (flit 2).
355            flit 3 → bus_reg.
356            pipeline has: flit 1 fields in stage regs, flit 2 in
357            flit2_hold, flit 3 in bus_reg.
358```
359
360the flit 2 holding register is clocked by a "3-flit packet, flit 2 is
361about to be overwritten" signal, derived from the flit counter and the
362packet length (decoded from flit 1 prefix in stage 1). for 2-flit packets,
363this signal never fires and the holding register sits idle.
364
365### Input Hardware Cost Per Port
366
367```
368bus_reg:        2× 74LS373    (always present)
369flit2_hold:     2× 74LS373    (for 3-flit packets)
370dest decode:    ~1 chip        (a few gates)
371flit counter:   half a 74LS74  (2-bit counter)
372control logic:  ~1 chip        (gates for LE timing, TOKEN_READY)
373                ────────────
374total:          ~6 chips per input port
375```
376
377### Output Hardware Cost Per Port
378
379```
380out_latch:      2× 74LS373    (drives bus when granted)
381BUS_REQ logic:  ~half a chip   (set on pipeline output ready, clear on grant)
382                ────────────
383total:          ~3 chips per output port
384```
385
386### Per-Node Totals
387
388**Shared bus (bidirectional):**
389
390```
391output:    ~3 chips
392input:     ~6 chips
393            ────────
394per node:  ~9 chips  (could reduce to ~6 if 3-flit packets handled in pipeline)
395```
396
397**Split bus, PE (CN bidi + AN out + DN in):**
398
399```
400CN out:    ~3 chips
401CN in:     ~6 chips
402AN out:    ~3 chips
403DN in:     ~6 chips
404output mux: ~1 chip  (bit[15] selects CN vs AN output)
405input mux:  ~1 chip  (DN vs CN priority select)
406             ────────
407per PE:     ~20 chips
408```
409
410**Split bus, SM (AN in + DN out):**
411
412```
413AN in:     ~6 chips
414DN out:    ~3 chips
415            ────────
416per SM:    ~9 chips
417```
418
419Note: these counts are slightly higher than earlier estimates because they
420include the flit2 holding register. if 3-flit tokens are rare enough, the
421holding register can be omitted from nodes that will never receive them
422(e.g., SMs receive 3-flit CAS tokens, but PEs only receive 2-flit compute
423tokens and 2-flit SM responses — PEs can skip the holding register).
424
425---
426
427## PE Loopback (Self-Addressed Token Bypass)
428
429When a PE produces a token addressed to itself, there is no reason for that
430token to traverse the external bus. self-addressed tokens are the dominant
431traffic pattern in well-compiled programs (the assembler/compiler maximises
432self-PE placement), so bypassing the bus for self-sends eliminates a large
433fraction of bus contention.
434
435### Hardware
436
437The PE's output stage already knows the destination PE_id — it's in the
438flit 1 routing word read from the frame. The PE already knows its own ID
439(DIP switches / EEPROM). A comparator on those 2-3 bits determines whether
440the token is self-addressed:
441
442```
443self_send = (dest_PE_id == my_PE_id)
444```
445
446A 2:1 mux on the PE's input path selects between "from bus" and "from own
447output":
448
449```
450                PE pipeline output
451                      │
452               ┌──────┴──────┐
453               │  self_send?  │  comparator on PE_id (a few gates)
454               └──┬───────┬──┘
455                  │       │
456          self=1  │       │  self=0
457                  │       │
458                  │    ┌──┴──────┐
459                  │    │ bus out  │  2× 373, OE=GRANT
460                  │    │ latch   │  assert BUS_REQ
461                  │    └─────────┘
462                  │
463          ┌───┐  │  ┌─────────┐
464          │mux│◄─┴──│ bus in  │  from bus input path
465          │2:1│     │ latch   │
466          └─┬─┘     └─────────┘
467            │
468      PE pipeline input (stage 1)
469```
470
471When `self_send=1`:
472- output does NOT load the bus output latch. BUS_REQ is not asserted.
473- output feeds directly to the input mux, which selects the loopback path.
474- the pipeline sees the token arrive at stage 1 as if it came from the bus.
475
476When `self_send=0`:
477- output loads the bus output latch, asserts BUS_REQ. normal bus path.
478- input mux selects bus input. normal receive path.
479
480### Hardware Cost
481
482```
483PE_id comparator:   ~half a chip  (XNOR + AND on 2-3 bits)
484input mux:          4× 74LS157   (quad 2:1 mux, 16-bit data path)
485loopback control:   ~half a chip  (gates for mux select, BUS_REQ inhibit)
486                    ─────────────
487total:              ~5 chips per PE
488```
489
490### Timing
491
492Bus path (self-addressed, without loopback): output stage → wait for bus
493grant (1-4 cycles depending on contention) → flit 1 on bus (1 cycle) →
494flit 2 on bus (1 cycle) → dest decode + input capture → stage 1. minimum
495~4-6 cycles.
496
497Loopback path: output stage → mux → stage 1. with a register in the
498loopback path for timing safety, 2 cycles (output latches flit 1, next
499cycle it's at stage 1 input; flit 2 follows). without the register
500(combinational loopback), potentially 1 cycle.
501
502savings: 2-4 cycles per self-addressed token, plus zero bus contention
503impact on other nodes.
504
505### Multi-Flit Loopback
506
507the loopback path handles multi-flit tokens the same way the bus input
508does. flit 1 feeds through the mux, stage 1 latches the routing fields.
509flit 2 feeds through on the next cycle, sits in the input register until
510stage 3 reads it. for 3-flit tokens (rare on loopback — a PE would have
511to CAS its own local SM? unclear when this happens), the flit 2 holding
512register captures flit 2 before flit 3 arrives, same as bus reception.
513
514### Throughput Tradeoff: Self-Route vs Cross-PE Ping-Pong
515
516Self-loopback is fast but serialises the pipeline: the PE processes its
517own output token, so it can't work on another node's token simultaneously.
518Cross-PE routing (PE0 → PE1 → PE0) costs bus latency per hop but lets both
519PEs work in parallel — PE1 processes while PE0's result is in transit.
520
521For a linear dependency chain with no parallelism, self-route wins: fewer
522cycles per token, no bus overhead. For wide graphs with independent
523branches, spreading work across PEs wins: the bus latency is paid once per
524hop but multiple PEs execute concurrently.
525
526The compiler/assembler can model this tradeoff per-subgraph:
527
528- **self-route**: saves ~3 cycles per token, serialises the PE.
529  best for: tight dependent chains, loop bodies, accumulations.
530- **cross-PE**: costs ~4 cycles bus latency per hop, enables parallelism.
531  best for: independent branches, wide fan-out, producer-consumer pipelines.
532
533Until strongly-connected arc execution is implemented (which makes
534sequential self-execution much more efficient), there are cases where
535ping-ponging between PEs genuinely outperforms self-routing due to
536pipeline fill. the assembler's placement algorithm should consider bus
537contention, pipeline depth, and dependency structure when deciding.
538
539### Interaction with BUSY and Concurrent Reception
540
541Self-addressed loopback tokens bypass the bus entirely. crucially, the
542bus_reg remains available to capture externally-arriving tokens while the
543loopback path feeds the pipeline. this is possible because BUSY is tied
544to bus_reg occupancy, not pipeline input occupancy:
545
546```
547BUSY = bus_reg_captured AND NOT bus_reg_consumed
548```
549
550one AND gate. BUSY asserts when the bus_reg holds an unconsumed flit, and
551clears when the pipeline drains it. the loopback path is invisible to BUSY.
552
553this enables concurrent reception:
554
555```
556cycle N:   loopback token → mux → pipeline stage 1  (self-send, fast path)
557           simultaneously: external token on bus → bus_reg captures
558           BUSY asserts (bus_reg now occupied)
559cycle N+1: bus_reg token → mux → pipeline stage 1   (external, from bus)
560           BUSY clears
561```
562
563the input mux priority: loopback wins if both arrive simultaneously.
564the PE's own output is already committed and can't stall; the bus token
565waits one cycle in bus_reg. BUSY holds other senders off for that one
566beat. one priority gate on two "data available" signals.
567
568without this decoupling, a self-send would block the bus_reg (the
569pipeline input is occupied by the loopback token), forcing external
570senders to wait even though the bus_reg is physically empty. the
571decoupling means the bus_reg serves as a 1-token buffer that absorbs
572one external arrival while the pipeline processes a loopback token.
573
574**future upgrade: deeper input buffer.** v0 operates with the bus_reg as
575the sole external input buffer (1 token deep). once pipelining becomes
576more aggressive post-v0 — particularly with strongly-connected arc
577execution keeping the PE busy on sequential chains — a small register
578FIFO (2-4 entries) behind the bus_reg would absorb bursts of incoming
579tokens without asserting BUSY. this is the same upgrade path noted in
580the open questions: NOS IDT7201 FIFOs, register FIFOs from 373s, or
581SRAM-based. the BUSY signal transitions from "bus_reg full" to "FIFO
582full" with no protocol change. the loopback path remains independent
583of the FIFO — it feeds the pipeline directly, not through the buffer.
584
585---
586
587## Arbiter
588
589### Shared Bus Arbiter
590
5918 requesters, one winner per arbitration cycle. SM0 gets hard priority
592(bootstrap). all others round-robin.
593
594```
595inputs:   BUS_REQ[7:0], BUS_HOLD
596outputs:  BUS_GRANT[7:0] (one-hot), or WINNER_ID[2:0] + external decoder
597
598state:
599  IDLE:     if any BUS_REQ asserted, select winner, go to GRANTED.
600  GRANTED:  hold GRANT while BUS_HOLD asserted.
601            when BUS_HOLD drops, go to IDLE (or fast-path: immediately
602            select next winner if any REQ pending).
603```
604
605**Winner selection:** if BUS_REQ[SM0] is asserted, SM0 wins (hard priority
606for bootstrap and IO). otherwise, round-robin starting from
607`(last_winner + 1) mod 8`.
608
609hardware: 74LS148 (8-to-3 priority encoder) with input masking for
610round-robin. a 3-bit counter (`last_winner`) advances after each grant.
611the REQ lines are rotated by `last_winner` before hitting the priority
612encoder (using a barrel shift or a mux tree). SM0 priority override:
613OR the SM0 REQ into the highest-priority encoder input, bypassing the
614rotation.
615
616```
617round-robin mux:     2× 74LS151 (8:1 mux) or equivalent  = 2 chips
618priority encoder:    1× 74LS148                            = 1 chip
619grant decoder:       1× 74LS138 (3-to-8)                   = 1 chip
620last_winner counter: half a 74LS163 (4-bit counter)        = 1 chip
621SM0 override + FSM:  ~1 chip (gates)                       = 1 chip
622                                                    total: ~6 chips
623```
624
625### Split Bus Arbiters
626
627Each bus has its own arbiter. CN arbiter: 4 nodes (PEs), bidirectional.
628AN arbiter: 4 senders (PEs only). DN arbiter: 4 senders (SMs only).
629
6304-node arbiter is simpler than 8-node:
631
632```
633priority encoder:    1× 74LS148 (using 4 of 8 inputs)  = 1 chip
634grant decoder:       1× 74LS139 (dual 2-to-4)          = 1 chip
635round-robin counter: half a 74LS74                      = shared
636SM0 priority (DN only): 1 gate                          = shared
637FSM:                 ~1 chip (gates)                    = 1 chip
638                                              per bus:  ~3-4 chips
639                                              3 buses:  ~10-12 chips
640```
641
642SM0 priority applies to the DN arbiter (SM0's responses and exec output
643get priority on the return path). CN and AN arbiters are pure round-robin
644among PEs.
645
646---
647
648## Backpressure
649
650### BUSY Mechanism (Mandatory)
651
652There is no input FIFO. The bus register is the buffer — one token deep.
653If the pipeline hasn't consumed the current token when the next one arrives
654for this node, the token is lost. Therefore sender-side destination-busy
655checking is **mandatory**, not optional.
656
657Each node has a `BUSY` output: active whenever its bus register holds an
658unconsumed token. Senders check `BUSY[dest]` before asserting `BUS_REQ`.
659If the destination is busy, the sender holds. Its output latch stays full,
660which stalls its pipeline at the output stage. Backpressure propagates
661naturally through the pipeline.
662
663```
664BUSY[7:0]    one wire per node (active-low, open-collector).
665             set when bus_reg captures a flit (DEST_MATCH AND FLIT_VALID).
666             cleared when pipeline latches the token into stage 1.
667```
668
669**Per-node BUSY generation:** one flip-flop (half a 74LS74). set on capture,
670cleared on pipeline consumption. active-low open-collector output allows
671wired-AND if needed, and maps directly to an async ACK signal for future
672Mode C operation.
673
674**Per-sender BUSY lookup:** the outgoing token's destination ID (2-3 bits
675from the formed flit 1, known before BUS_REQ asserts) indexes into
676`BUSY[0:7]` via a 74LS151 (8:1 mux). output gates BUS_REQ: if
677`BUSY[dest]` is active, BUS_REQ is suppressed.
678
679```
680sender logic:
681  dest_id = output_token.flit1[13:12]  (PE_id or SM_id)
682            + output_token.flit1[15]    (SM/CM select, for full 3-bit node addr)
683  dest_busy = BUSY_MUX[dest_id]
684  BUS_REQ = output_ready AND NOT dest_busy
685```
686
687**Hardware cost:**
688
689```
690BUSY flip-flops:   4× 74LS74 (8 flip-flops, one per node)     = 4 chips
691BUSY mux (per sender): 1× 74LS151 (8:1) each                  = 8 chips max
692                        (senders can share mux chips on split bus
693                         where each bus has only 4 destinations)
694BUS_REQ gate:      included in existing output logic            = 0 extra
695                                                         total: ~8-12 chips
696```
697
698For the split bus configuration, each bus has only 4 possible destinations.
699BUSY lines are per-bus: `CN_BUSY[3:0]`, `AN_BUSY[3:0]`, `DN_BUSY[3:0]`.
700the mux shrinks to a 74LS153 (dual 4:1) or equivalent. fewer chips.
701
702**Async upgrade path:** the BUSY signal is the embryonic form of an ACK in
703a request/acknowledge handshake protocol. In Mode C (fully async), BUSY
704becomes the ACK that completes the flit transfer. the wiring is unchanged —
705only the timing discipline around it evolves from "check before requesting"
706to "wait for acknowledge after sending." designing BUSY in from the start
707means the async transition doesn't require re-wiring the status lines.
708
709---
710
711## Physical Wiring
712
713### Shared Bus
714
715```
7161 ribbon cable, 20-pin IDC:
717  DATA[15:0]     16 pins
718  FLIT_VALID      1 pin
719  BUS_HOLD        1 pin
720  GND             2 pins
721```
722
723BUS_REQ and BUS_GRANT run as separate point-to-point wires between each
724node and the arbiter (8 REQ + 8 GRANT = 16 wires, or 8 REQ + 3-bit
725WINNER_ID = 11 wires). these can be a second ribbon or direct jumper wires.
726
727### Split AN/CN/DN
728
729```
7303 ribbon cables, 20-pin IDC each:
731  CN: DATA[15:0] + FLIT_VALID + BUS_HOLD + GND×2
732  AN: DATA[15:0] + FLIT_VALID + BUS_HOLD + GND×2
733  DN: DATA[15:0] + FLIT_VALID + BUS_HOLD + GND×2
734
735REQ/GRANT per bus: separate wires or a 4th ribbon.
736```
737
738first PCB candidate: a bus backplane board with three rows of IDC
739connectors (one row per bus), the three arbiters, and the BUSY/status
740lines. PEs and SMs plug in via ribbon cables. testable with 1 PE + 1 SM
741before populating the full system.
742
743---
744
745## Scaling Path
746
747| scale | topology | notes |
748|---|---|---|
749| 1 PE + 1 SM | single shared bus | initial bring-up, no arbiter needed |
750| 2–4 PE + 1–2 SM | single shared bus | v0, arbiter added |
751| 4 PE + 4 SM | shared or split | v0.5, split if contention measured |
752| 8+ PE | split + ring or prefix | CN becomes a ring bus between PEs |
753| 16+ PE | hierarchical prefix | full multi-level routing network |
754
755the token format and node interfaces are unchanged across all scales.
756the bus interface (373 latches + dest decode + FIFO) is the same whether
757it's on a shared bus, a split segment, or a ring. what changes is wiring
758and arbiters.
759
760---
761
762## Open Questions
763
7641. **Input FIFO as upgrade.** v0 operates with no input FIFO (bus reg is the
765   sole buffer, BUSY mechanism prevents overflow). adding a small FIFO
766   (4–8 entries) behind the bus reg absorbs bursts and reduces the frequency
767   of BUSY stalls. NOS IDT7201 FIFOs (512×9, $3–8 on eBay) are ideal if
768   available. otherwise, a register FIFO from 373s (8 chips for 4 entries)
769   or SRAM + counter. this is a pure performance upgrade — the BUSY
770   mechanism remains as the ultimate backpressure signal even with a FIFO
771   (BUSY asserts when the FIFO fills instead of when the bus reg is
772   occupied).
773
7742. **SM0 bootstrap priority.** hard priority on the shared bus / DN arbiter
775   is simple. does it need to be SM0 specifically, or should priority be
776   configurable (e.g., DIP switch selects which node gets priority)?
777   hardwired SM0 is fine for v0.
778
7793. **Arbiter latency.** the round-robin mux + priority encoder + decoder
780   chain is ~3 gate delays (~30–45 ns). at 5 MHz (200 ns cycle), this
781   easily fits. at higher clock rates, the arbiter might need pipelining
782   (one cycle to decide, next cycle to grant). for v0 at 5 MHz, single-
783   cycle arbitration is fine.
784
7854. **3-flit reception.** the flit 2 holding register adds 2 chips per input
786   port. if 3-flit tokens (CAS, EXT) are rare and only received by SMs,
787   only SM input ports need the holding register. PE input ports can skip
788   it (PEs never receive 3-flit tokens in normal operation). saves 2 chips
789   per PE input port.
790
7915. **Split bus trigger.** at what contention level should the shared bus be
792   split into AN/CN/DN? diagnostic counters on the arbiter (grant-wait
793   cycles per node) provide the measurement. if any node averages >20%
794   wait cycles, the split is likely worthwhile. the compiler can also
795   estimate bus utilisation statically from the dataflow graph structure.