# Bus Architecture and Interconnect Design

Covers the physical bus implementation, node transceivers, arbitration,
flit reception, and the scaling path from shared bus to split AN/CN/DN
networks. Companion to `pe-redesign-frames-and-pipeline.md` (PE pipeline
and token format) and `network-and-communication.md` (logical network
model and clocking discipline).

---

## Design Context

The system has up to 4 PEs (CMs) and up to 4 SMs, communicating via 16-bit
flit-based tokens. All tokens are 1–3 flits. Flit 1 is self-describing:
the prefix bits (bit[15] for SM/CM, bits[13:12] for destination ID) tell
any node on the bus who the packet is for and how long it is.

Two physical bus configurations are documented: shared (simpler, fewer
chips, more contention) and split AN/CN/DN (more chips, more bandwidth,
closer to historical dataflow machine practice). Both use the same node
interface design and token format. The split is a wiring and arbiter
change, not a protocol change.

---

## Shared Bus (v0 / v0.5 starting point)

### Topology

Single 16-bit bus connecting all nodes. Every node sees every flit.
Receivers filter by destination address. One bus master at a time;
arbitration determines who drives.

```
┌──────┐  ┌──────┐  ┌──────┐  ┌──────┐
│ PE 0 │  │ PE 1 │  │ PE 2 │  │ PE 3 │
└──┬───┘  └──┬───┘  └──┬───┘  └──┬───┘
   │         │         │         │
═══╪═════════╪═════════╪═════════╪═══════════╗
   │         │         │         │    16-bit  ║ shared bus
═══╪═════════╪═════════╪═════════╪═══════════╝
   │         │         │         │
┌──┴───┐  ┌──┴───┐  ┌──┴───┐  ┌──┴───┐
│ SM 0 │  │ SM 1 │  │ SM 2 │  │ SM 3 │
└──────┘  └──────┘  └──────┘  └──────┘
                 ┌─────────┐
                 │ ARBITER │
                 └─────────┘
```

### Bus Signals

```
DATA[15:0]    16-bit data bus. driven by current bus master, hi-Z otherwise.
FLIT_VALID    asserted by bus master. indicates DATA is a valid flit.
BUS_HOLD      asserted by bus master. "more flits coming, don't re-arbitrate."
BUS_REQ[7:0]  one per node. active = "i have a packet to send."
BUS_GRANT     active for current bus master. directly enables output drivers.
```

FLIT_VALID and BUS_HOLD are active-high, driven only by the granted node.
since only one node drives at a time, no wired-OR or open-collector needed.

### Packet Transmission Protocol

```
IDLE:
  nodes with pending output assert BUS_REQ.
  arbiter selects winner, asserts BUS_GRANT for that node.

FLIT 1:
  winner's output latch OE goes active (drives DATA bus).
  winner puts flit 1 on DATA, asserts FLIT_VALID.
  winner asserts BUS_HOLD (if more flits follow).
  all receivers inspect DATA for destination match.
  matching receiver captures flit 1.

FLIT 2:
  winner puts flit 2 on DATA. FLIT_VALID stays high.
  matching receiver captures flit 2.
  if packet complete (2-flit): winner deasserts BUS_HOLD.

FLIT 3 (CAS, EXT only):
  winner puts flit 3 on DATA. FLIT_VALID stays high.
  matching receiver captures flit 3.
  winner deasserts BUS_HOLD.

RELEASE:
  BUS_HOLD drops. arbiter deasserts GRANT.
  arbiter immediately checks BUS_REQ for next winner.
  (zero idle cycles between back-to-back packets if another REQ is pending.)
```

Once granted, the bus master holds the bus for the full packet duration
(2–3 flit cycles). no preemption, no interleaving. packet atomicity is
guaranteed by BUS_HOLD.

### Chip Count (Shared Bus)

```
per PE:         ~9 chips  (bus interface)
                +5 chips  (loopback: mux + comparator + control)
                = ~14 chips per PE
per SM:         ~9 chips  (bus interface, no loopback needed)
4 PEs + 4 SMs:  4×14 + 4×9 = 92 chips
arbiter:        ~6 chips
BUSY subsystem: ~8-12 chips
bus control:    ~2 chips
                ─────────
total:          ~108-112 chips
```

---

## Split AN/CN/DN (v0.5+ upgrade)

### Topology

Three separate 16-bit buses, each carrying one logical traffic class.
Follows Amamiya's DFM architecture where AN and DN are unidirectional.

```
CN bus (bidirectional, CM↔CM):
  PE 0 ←→ PE 1 ←→ PE 2 ←→ PE 3        compute tokens between PEs.
                                         4 nodes, bidirectional.

AN bus (unidirectional, CM→SM):
  PE 0 → ┐
  PE 1 → ┤                              SM operation requests.
  PE 2 → ├→ SM 0, SM 1, SM 2, SM 3     4 senders (PEs), 4 receivers (SMs).
  PE 3 → ┘

DN bus (unidirectional, SM→CM):
  SM 0 → ┐
  SM 1 → ┤                              SM operation responses.
  SM 2 → ├→ PE 0, PE 1, PE 2, PE 3     4 senders (SMs), 4 receivers (PEs).
  SM 3 → ┘
```

### What the Split Buys

**Bandwidth.** PE-to-PE compute traffic and SM request/response traffic are
fully decoupled. PE0 sending an SM READ (AN) while PE1 sends a compute
token to PE2 (CN) proceed simultaneously. effective bandwidth roughly 3×
the shared bus for mixed workloads.

**SM round-trip latency.** request goes out on AN, response comes back on DN.
the two phases never compete for the same bus. on the shared bus, the
response must wait for a bus grant that competes with outgoing requests and
inter-PE traffic.

**Reduced contention per bus.** CN has 4 nodes (PEs only). AN has 4 senders
(PEs). DN has 4 senders (SMs). each bus sees half (or less) of the traffic
the shared bus would see.

**Cleaner scaling.** adding SMs doesn't increase CN contention. adding PEs
doesn't increase DN contention. traffic classes scale independently.

### What the Split Costs

**Wiring.** three buses × (16 data + control) vs one. ribbon cables make
this manageable: each bus is a 20-pin IDC ribbon (16 data + FLIT_VALID +
BUS_HOLD + REQ_line + GND). three ribbons. first PCB candidate: a bus
backplane with three sets of IDC connectors and the three arbiters.

**Chips.** each PE needs ports on CN (bidirectional) + AN (output only) +
DN (input only). each SM needs ports on AN (input only) + DN (output only).
see chip count below.

**SM→SM path.** on the shared bus, SM0's `exec` can emit SM tokens directly.
on the split bus, SM0 only has DN output (→ CMs). SM-bound tokens from
exec must route through a loader PE: SM0 emits CM tokens onto DN, the
loader PE receives them, executes instructions that construct SM tokens,
and emits those onto AN. one extra hop, only during bootstrap, no runtime
cost. alternatively, the spare CM token subtype (011+11) could serve as a
bus-level "forward to AN" hint, but the loader PE approach is cleaner and
doesn't burn encoding space.

### Node Bus Ports (Split Configuration)

```
PE:  CN out + CN in + AN out + DN in
     bidirectional     send only  receive only

SM:  AN in + DN out
     receive only  send only

SM0: AN in + DN out  (same as other SMs — exec routes through loader PE)
```

### Chip Count (Split AN/CN/DN)

```
per PE:  CN bidi (4) + AN out (2) + DN in (2) + decode (2)  = ~10 chips
         + loopback (5)                                       = ~15 chips
per SM:  AN in (2) + DN out (2) + decode (1)                  = ~5 chips
4 PEs:   60 chips
4 SMs:   20 chips
CN arbiter:  ~4 chips  (4-node)
AN arbiter:  ~4 chips  (4-sender)
DN arbiter:  ~4 chips  (4-sender)
BUSY subsystem: ~6-8 chips  (fewer nodes per bus, smaller muxes)
bus control: ~2 chips
                       ─────────
total:                 ~100-102 chips
```

Delta vs shared bus: ~8-10 fewer chips (loopback cost is the same, but
per-bus node counts are lower and BUSY muxes are smaller). buys ~3×
bandwidth and decoupled SM latency. the split is strictly better on both
chip count and performance once loopback is included — the shared bus
node count was inflated by every node needing bidirectional capability.

### PE Input Merge (DN + CN)

A PE receives from two buses: CN (inter-PE compute tokens) and DN (SM
responses). both feed into the same PE pipeline. the PE's input stage has
a 2:1 priority mux:

```
if DN_in has data:  select DN_in → pipeline  (SM responses unblock waiting ops)
else if CN_in has data: select CN_in → pipeline
else: idle
```

hardware: one priority signal (DN_NOT_EMPTY), a 2:1 mux on the input latch
LE lines. ~1-2 chips. SM responses get priority because they unblock dyadic
instructions waiting on SM data — the matching store has a first operand
parked, the SM response is the second operand that lets it fire.

### PE Output Split (CN vs AN)

A PE emits to CN (bit[15]=0) or AN (bit[15]=1). the output stage has
latches on both buses. bit[15] from the token determines which latch gets
loaded and which BUS_REQ fires:

```
if output_token.bit[15] == 0:  load CN_out latch, assert CN_REQ
if output_token.bit[15] == 1:  load AN_out latch, assert AN_REQ
```

both output latches share the same data input lines from the PE pipeline.
the type bit gates the LE (latch enable) to the correct pair. one gate.
output enable is controlled independently by each bus's GRANT signal.

---

## Node Interface (Common to Both Configurations)

### Output Path

```
                    PE pipeline output
                          │
                     ┌────┴────┐
                     │ 2× 373  │  output latch (16-bit)
                     │ OE=GRANT│  tri-state: drives bus only when granted
                     └────┬────┘
                          │
                    ══════╪══════  bus DATA[15:0]
```

2× 74LS373 per output port. LE (latch enable) pulsed by the PE/SM output
stage when a flit is ready. OE (output enable) tied to BUS_GRANT for this
node — only drives the bus when the arbiter says so. when not granted,
outputs are hi-Z.

the PE's output stage loads flit 1, asserts BUS_REQ, and waits. when
GRANT arrives, OE goes active, flit 1 appears on the bus, FLIT_VALID
asserts. next cycle, the PE loads flit 2 into the same latch (overwrites
flit 1), FLIT_VALID stays high. for 3-flit packets, repeat once more.
BUS_HOLD deasserts on the last flit.

### Input Path

```
                    ══════╪══════  bus DATA[15:0]
                          │
                     ┌────┴────┐
                     │ 2× 373  │  bus register (always captures on DEST_MATCH)
                     └────┬────┘
                          │
                     ┌────┴────┐
                     │ 2× 373  │  flit 2 holding register (3-flit packets only)
                     └────┬────┘
                          │
                    PE pipeline input
```

**Bus register** (2× 373): D inputs connected directly to bus DATA. LE
gated by `DEST_MATCH AND FLIT_VALID`. captures every flit addressed to
this node.

**Flit 2 holding register** (2× 373): captures bus_reg contents when a
3-flit packet's flit 2 needs to be preserved before flit 3 arrives.
for 2-flit packets (the common case), this register is unused.

### Destination Decode

Each node compares the incoming flit 1 against its own ID:

```
CM node (PE):  match = NOT bit[15] AND (bit[13:12] == my_PE_id)
SM node:       match = bit[15] AND (bit[13:12] == my_SM_id)
```

`my_PE_id` / `my_SM_id` set by DIP switches or hardwired. the comparison
is 2 bits (74LS85 is overkill — an XNOR gate + AND suffices). total decode:
~1 chip per node (a few gates from a 74LS00/74LS86 package).

For the misc bucket CM subtypes (011+xx), the PE accepts all of them if
the PE_id matches — subtype decode happens inside the PE pipeline, not at
the bus interface.

For the split bus configuration, destination decode simplifies further:
on the CN bus, only PEs exist, so bit[15] is always 0 and only PE_id
matters. on the AN bus, only SMs are receivers, so bit[15] is always 1 and
only SM_id matters. the type-bit check becomes redundant per-bus (but
costs nothing to keep for robustness).

### Flit Reception (The "No Deserializer" Insight)

The revised token format puts all routing and control information in flit 1.
The pipeline needs offset, act_id, port, and prefix to begin processing
(stage 1 INPUT and stage 2 IFETCH). All of these are in flit 1.

Flit 2 carries the data operand. The pipeline doesn't need it until stage 3
(MATCH), which is 2 cycles after stage 1 latches flit 1's fields. By that
time flit 2 has arrived in the bus register and is waiting.

This means **no dedicated deserializer is needed.** the pipeline's own
stage 1 register latch IS the flit 1 holding register:

```
cycle N:    flit 1 on bus → bus_reg captures.
            stage 1 latches: offset, act_id, port, prefix from bus_reg.
cycle N+1:  flit 2 on bus → bus_reg captures (overwrites flit 1 — fine,
            stage 1 already grabbed everything it needs).
cycle N+2:  stage 3 reads bus_reg for operand data (flit 2 still there).
```

for 2-flit tokens: zero additional registers beyond the bus 373 pair.
the bus_reg holds flit 2 until the pipeline consumes it.

for 3-flit tokens (CAS, EXT): flit 3 arrives on cycle N+2 and would
overwrite flit 2 in bus_reg. the pipeline needs flit 2 for the left
operand and flit 3 for the right operand. the **flit 2 holding register**
captures flit 2 from bus_reg before flit 3 arrives:

```
cycle N:    flit 1 → bus_reg → stage 1 latches fields
cycle N+1:  flit 2 → bus_reg. simultaneously, previous bus_reg (flit 1)
            was already consumed by stage 1. flit 2 now in bus_reg.
cycle N+2:  flit2_hold captures bus_reg (flit 2).
            flit 3 → bus_reg.
            pipeline has: flit 1 fields in stage regs, flit 2 in
            flit2_hold, flit 3 in bus_reg.
```

the flit 2 holding register is clocked by a "3-flit packet, flit 2 is
about to be overwritten" signal, derived from the flit counter and the
packet length (decoded from flit 1 prefix in stage 1). for 2-flit packets,
this signal never fires and the holding register sits idle.

### Input Hardware Cost Per Port

```
bus_reg:        2× 74LS373    (always present)
flit2_hold:     2× 74LS373    (for 3-flit packets)
dest decode:    ~1 chip        (a few gates)
flit counter:   half a 74LS74  (2-bit counter)
control logic:  ~1 chip        (gates for LE timing, TOKEN_READY)
                ────────────
total:          ~6 chips per input port
```

### Output Hardware Cost Per Port

```
out_latch:      2× 74LS373    (drives bus when granted)
BUS_REQ logic:  ~half a chip   (set on pipeline output ready, clear on grant)
                ────────────
total:          ~3 chips per output port
```

### Per-Node Totals

**Shared bus (bidirectional):**

```
output:    ~3 chips
input:     ~6 chips
            ────────
per node:  ~9 chips  (could reduce to ~6 if 3-flit packets handled in pipeline)
```

**Split bus, PE (CN bidi + AN out + DN in):**

```
CN out:    ~3 chips
CN in:     ~6 chips
AN out:    ~3 chips
DN in:     ~6 chips
output mux: ~1 chip  (bit[15] selects CN vs AN output)
input mux:  ~1 chip  (DN vs CN priority select)
             ────────
per PE:     ~20 chips
```

**Split bus, SM (AN in + DN out):**

```
AN in:     ~6 chips
DN out:    ~3 chips
            ────────
per SM:    ~9 chips
```

Note: these counts are slightly higher than earlier estimates because they
include the flit2 holding register. if 3-flit tokens are rare enough, the
holding register can be omitted from nodes that will never receive them
(e.g., SMs receive 3-flit CAS tokens, but PEs only receive 2-flit compute
tokens and 2-flit SM responses — PEs can skip the holding register).

---

## PE Loopback (Self-Addressed Token Bypass)

When a PE produces a token addressed to itself, there is no reason for that
token to traverse the external bus. self-addressed tokens are the dominant
traffic pattern in well-compiled programs (the assembler/compiler maximises
self-PE placement), so bypassing the bus for self-sends eliminates a large
fraction of bus contention.

### Hardware

The PE's output stage already knows the destination PE_id — it's in the
flit 1 routing word read from the frame. The PE already knows its own ID
(DIP switches / EEPROM). A comparator on those 2-3 bits determines whether
the token is self-addressed:

```
self_send = (dest_PE_id == my_PE_id)
```

A 2:1 mux on the PE's input path selects between "from bus" and "from own
output":

```
                PE pipeline output
                      │
               ┌──────┴──────┐
               │  self_send?  │  comparator on PE_id (a few gates)
               └──┬───────┬──┘
                  │       │
          self=1  │       │  self=0
                  │       │
                  │    ┌──┴──────┐
                  │    │ bus out  │  2× 373, OE=GRANT
                  │    │ latch   │  assert BUS_REQ
                  │    └─────────┘
                  │
          ┌───┐  │  ┌─────────┐
          │mux│◄─┴──│ bus in  │  from bus input path
          │2:1│     │ latch   │
          └─┬─┘     └─────────┘
            │
      PE pipeline input (stage 1)
```

When `self_send=1`:
- output does NOT load the bus output latch. BUS_REQ is not asserted.
- output feeds directly to the input mux, which selects the loopback path.
- the pipeline sees the token arrive at stage 1 as if it came from the bus.

When `self_send=0`:
- output loads the bus output latch, asserts BUS_REQ. normal bus path.
- input mux selects bus input. normal receive path.

### Hardware Cost

```
PE_id comparator:   ~half a chip  (XNOR + AND on 2-3 bits)
input mux:          4× 74LS157   (quad 2:1 mux, 16-bit data path)
loopback control:   ~half a chip  (gates for mux select, BUS_REQ inhibit)
                    ─────────────
total:              ~5 chips per PE
```

### Timing

Bus path (self-addressed, without loopback): output stage → wait for bus
grant (1-4 cycles depending on contention) → flit 1 on bus (1 cycle) →
flit 2 on bus (1 cycle) → dest decode + input capture → stage 1. minimum
~4-6 cycles.

Loopback path: output stage → mux → stage 1. with a register in the
loopback path for timing safety, 2 cycles (output latches flit 1, next
cycle it's at stage 1 input; flit 2 follows). without the register
(combinational loopback), potentially 1 cycle.

savings: 2-4 cycles per self-addressed token, plus zero bus contention
impact on other nodes.

### Multi-Flit Loopback

the loopback path handles multi-flit tokens the same way the bus input
does. flit 1 feeds through the mux, stage 1 latches the routing fields.
flit 2 feeds through on the next cycle, sits in the input register until
stage 3 reads it. for 3-flit tokens (rare on loopback — a PE would have
to CAS its own local SM? unclear when this happens), the flit 2 holding
register captures flit 2 before flit 3 arrives, same as bus reception.

### Throughput Tradeoff: Self-Route vs Cross-PE Ping-Pong

Self-loopback is fast but serialises the pipeline: the PE processes its
own output token, so it can't work on another node's token simultaneously.
Cross-PE routing (PE0 → PE1 → PE0) costs bus latency per hop but lets both
PEs work in parallel — PE1 processes while PE0's result is in transit.

For a linear dependency chain with no parallelism, self-route wins: fewer
cycles per token, no bus overhead. For wide graphs with independent
branches, spreading work across PEs wins: the bus latency is paid once per
hop but multiple PEs execute concurrently.

The compiler/assembler can model this tradeoff per-subgraph:

- **self-route**: saves ~3 cycles per token, serialises the PE.
  best for: tight dependent chains, loop bodies, accumulations.
- **cross-PE**: costs ~4 cycles bus latency per hop, enables parallelism.
  best for: independent branches, wide fan-out, producer-consumer pipelines.

Until strongly-connected arc execution is implemented (which makes
sequential self-execution much more efficient), there are cases where
ping-ponging between PEs genuinely outperforms self-routing due to
pipeline fill. the assembler's placement algorithm should consider bus
contention, pipeline depth, and dependency structure when deciding.

### Interaction with BUSY and Concurrent Reception

Self-addressed loopback tokens bypass the bus entirely. crucially, the
bus_reg remains available to capture externally-arriving tokens while the
loopback path feeds the pipeline. this is possible because BUSY is tied
to bus_reg occupancy, not pipeline input occupancy:

```
BUSY = bus_reg_captured AND NOT bus_reg_consumed
```

one AND gate. BUSY asserts when the bus_reg holds an unconsumed flit, and
clears when the pipeline drains it. the loopback path is invisible to BUSY.

this enables concurrent reception:

```
cycle N:   loopback token → mux → pipeline stage 1  (self-send, fast path)
           simultaneously: external token on bus → bus_reg captures
           BUSY asserts (bus_reg now occupied)
cycle N+1: bus_reg token → mux → pipeline stage 1   (external, from bus)
           BUSY clears
```

the input mux priority: loopback wins if both arrive simultaneously.
the PE's own output is already committed and can't stall; the bus token
waits one cycle in bus_reg. BUSY holds other senders off for that one
beat. one priority gate on two "data available" signals.

without this decoupling, a self-send would block the bus_reg (the
pipeline input is occupied by the loopback token), forcing external
senders to wait even though the bus_reg is physically empty. the
decoupling means the bus_reg serves as a 1-token buffer that absorbs
one external arrival while the pipeline processes a loopback token.

**future upgrade: deeper input buffer.** v0 operates with the bus_reg as
the sole external input buffer (1 token deep). once pipelining becomes
more aggressive post-v0 — particularly with strongly-connected arc
execution keeping the PE busy on sequential chains — a small register
FIFO (2-4 entries) behind the bus_reg would absorb bursts of incoming
tokens without asserting BUSY. this is the same upgrade path noted in
the open questions: NOS IDT7201 FIFOs, register FIFOs from 373s, or
SRAM-based. the BUSY signal transitions from "bus_reg full" to "FIFO
full" with no protocol change. the loopback path remains independent
of the FIFO — it feeds the pipeline directly, not through the buffer.

---

## Arbiter

### Shared Bus Arbiter

8 requesters, one winner per arbitration cycle. SM0 gets hard priority
(bootstrap). all others round-robin.

```
inputs:   BUS_REQ[7:0], BUS_HOLD
outputs:  BUS_GRANT[7:0] (one-hot), or WINNER_ID[2:0] + external decoder

state:
  IDLE:     if any BUS_REQ asserted, select winner, go to GRANTED.
  GRANTED:  hold GRANT while BUS_HOLD asserted.
            when BUS_HOLD drops, go to IDLE (or fast-path: immediately
            select next winner if any REQ pending).
```

**Winner selection:** if BUS_REQ[SM0] is asserted, SM0 wins (hard priority
for bootstrap and IO). otherwise, round-robin starting from
`(last_winner + 1) mod 8`.

hardware: 74LS148 (8-to-3 priority encoder) with input masking for
round-robin. a 3-bit counter (`last_winner`) advances after each grant.
the REQ lines are rotated by `last_winner` before hitting the priority
encoder (using a barrel shift or a mux tree). SM0 priority override:
OR the SM0 REQ into the highest-priority encoder input, bypassing the
rotation.

```
round-robin mux:     2× 74LS151 (8:1 mux) or equivalent  = 2 chips
priority encoder:    1× 74LS148                            = 1 chip
grant decoder:       1× 74LS138 (3-to-8)                   = 1 chip
last_winner counter: half a 74LS163 (4-bit counter)        = 1 chip
SM0 override + FSM:  ~1 chip (gates)                       = 1 chip
                                                    total: ~6 chips
```

### Split Bus Arbiters

Each bus has its own arbiter. CN arbiter: 4 nodes (PEs), bidirectional.
AN arbiter: 4 senders (PEs only). DN arbiter: 4 senders (SMs only).

4-node arbiter is simpler than 8-node:

```
priority encoder:    1× 74LS148 (using 4 of 8 inputs)  = 1 chip
grant decoder:       1× 74LS139 (dual 2-to-4)          = 1 chip
round-robin counter: half a 74LS74                      = shared
SM0 priority (DN only): 1 gate                          = shared
FSM:                 ~1 chip (gates)                    = 1 chip
                                              per bus:  ~3-4 chips
                                              3 buses:  ~10-12 chips
```

SM0 priority applies to the DN arbiter (SM0's responses and exec output
get priority on the return path). CN and AN arbiters are pure round-robin
among PEs.

---

## Backpressure

### BUSY Mechanism (Mandatory)

There is no input FIFO. The bus register is the buffer — one token deep.
If the pipeline hasn't consumed the current token when the next one arrives
for this node, the token is lost. Therefore sender-side destination-busy
checking is **mandatory**, not optional.

Each node has a `BUSY` output: active whenever its bus register holds an
unconsumed token. Senders check `BUSY[dest]` before asserting `BUS_REQ`.
If the destination is busy, the sender holds. Its output latch stays full,
which stalls its pipeline at the output stage. Backpressure propagates
naturally through the pipeline.

```
BUSY[7:0]    one wire per node (active-low, open-collector).
             set when bus_reg captures a flit (DEST_MATCH AND FLIT_VALID).
             cleared when pipeline latches the token into stage 1.
```

**Per-node BUSY generation:** one flip-flop (half a 74LS74). set on capture,
cleared on pipeline consumption. active-low open-collector output allows
wired-AND if needed, and maps directly to an async ACK signal for future
Mode C operation.

**Per-sender BUSY lookup:** the outgoing token's destination ID (2-3 bits
from the formed flit 1, known before BUS_REQ asserts) indexes into
`BUSY[0:7]` via a 74LS151 (8:1 mux). output gates BUS_REQ: if
`BUSY[dest]` is active, BUS_REQ is suppressed.

```
sender logic:
  dest_id = output_token.flit1[13:12]  (PE_id or SM_id)
            + output_token.flit1[15]    (SM/CM select, for full 3-bit node addr)
  dest_busy = BUSY_MUX[dest_id]
  BUS_REQ = output_ready AND NOT dest_busy
```

**Hardware cost:**

```
BUSY flip-flops:   4× 74LS74 (8 flip-flops, one per node)     = 4 chips
BUSY mux (per sender): 1× 74LS151 (8:1) each                  = 8 chips max
                        (senders can share mux chips on split bus
                         where each bus has only 4 destinations)
BUS_REQ gate:      included in existing output logic            = 0 extra
                                                         total: ~8-12 chips
```

For the split bus configuration, each bus has only 4 possible destinations.
BUSY lines are per-bus: `CN_BUSY[3:0]`, `AN_BUSY[3:0]`, `DN_BUSY[3:0]`.
the mux shrinks to a 74LS153 (dual 4:1) or equivalent. fewer chips.

**Async upgrade path:** the BUSY signal is the embryonic form of an ACK in
a request/acknowledge handshake protocol. In Mode C (fully async), BUSY
becomes the ACK that completes the flit transfer. the wiring is unchanged —
only the timing discipline around it evolves from "check before requesting"
to "wait for acknowledge after sending." designing BUSY in from the start
means the async transition doesn't require re-wiring the status lines.

---

## Physical Wiring

### Shared Bus

```
1 ribbon cable, 20-pin IDC:
  DATA[15:0]     16 pins
  FLIT_VALID      1 pin
  BUS_HOLD        1 pin
  GND             2 pins
```

BUS_REQ and BUS_GRANT run as separate point-to-point wires between each
node and the arbiter (8 REQ + 8 GRANT = 16 wires, or 8 REQ + 3-bit
WINNER_ID = 11 wires). these can be a second ribbon or direct jumper wires.

### Split AN/CN/DN

```
3 ribbon cables, 20-pin IDC each:
  CN: DATA[15:0] + FLIT_VALID + BUS_HOLD + GND×2
  AN: DATA[15:0] + FLIT_VALID + BUS_HOLD + GND×2
  DN: DATA[15:0] + FLIT_VALID + BUS_HOLD + GND×2

REQ/GRANT per bus: separate wires or a 4th ribbon.
```

first PCB candidate: a bus backplane board with three rows of IDC
connectors (one row per bus), the three arbiters, and the BUSY/status
lines. PEs and SMs plug in via ribbon cables. testable with 1 PE + 1 SM
before populating the full system.

---

## Scaling Path

| scale | topology | notes |
|---|---|---|
| 1 PE + 1 SM | single shared bus | initial bring-up, no arbiter needed |
| 2–4 PE + 1–2 SM | single shared bus | v0, arbiter added |
| 4 PE + 4 SM | shared or split | v0.5, split if contention measured |
| 8+ PE | split + ring or prefix | CN becomes a ring bus between PEs |
| 16+ PE | hierarchical prefix | full multi-level routing network |

the token format and node interfaces are unchanged across all scales.
the bus interface (373 latches + dest decode + FIFO) is the same whether
it's on a shared bus, a split segment, or a ring. what changes is wiring
and arbiters.

---

## Open Questions

1. **Input FIFO as upgrade.** v0 operates with no input FIFO (bus reg is the
   sole buffer, BUSY mechanism prevents overflow). adding a small FIFO
   (4–8 entries) behind the bus reg absorbs bursts and reduces the frequency
   of BUSY stalls. NOS IDT7201 FIFOs (512×9, $3–8 on eBay) are ideal if
   available. otherwise, a register FIFO from 373s (8 chips for 4 entries)
   or SRAM + counter. this is a pure performance upgrade — the BUSY
   mechanism remains as the ultimate backpressure signal even with a FIFO
   (BUSY asserts when the FIFO fills instead of when the bus reg is
   occupied).

2. **SM0 bootstrap priority.** hard priority on the shared bus / DN arbiter
   is simple. does it need to be SM0 specifically, or should priority be
   configurable (e.g., DIP switch selects which node gets priority)?
   hardwired SM0 is fine for v0.

3. **Arbiter latency.** the round-robin mux + priority encoder + decoder
   chain is ~3 gate delays (~30–45 ns). at 5 MHz (200 ns cycle), this
   easily fits. at higher clock rates, the arbiter might need pipelining
   (one cycle to decide, next cycle to grant). for v0 at 5 MHz, single-
   cycle arbitration is fine.

4. **3-flit reception.** the flit 2 holding register adds 2 chips per input
   port. if 3-flit tokens (CAS, EXT) are rare and only received by SMs,
   only SM input ports need the holding register. PE input ports can skip
   it (PEs never receive 3-flit tokens in normal operation). saves 2 chips
   per PE input port.

5. **Split bus trigger.** at what contention level should the shared bus be
   split into AN/CN/DN? diagnostic counters on the arbiter (grant-wait
   cycles per node) provide the measurement. if any node averages >20%
   wait cycles, the split is likely worthwhile. the compiler can also
   estimate bus utilisation statically from the dataflow graph structure.