# Bus Architecture and Interconnect Design Covers the physical bus implementation, node transceivers, arbitration, flit reception, and the scaling path from shared bus to split AN/CN/DN networks. Companion to `pe-redesign-frames-and-pipeline.md` (PE pipeline and token format) and `network-and-communication.md` (logical network model and clocking discipline). --- ## Design Context The system has up to 4 PEs (CMs) and up to 4 SMs, communicating via 16-bit flit-based tokens. All tokens are 1–3 flits. Flit 1 is self-describing: the prefix bits (bit[15] for SM/CM, bits[13:12] for destination ID) tell any node on the bus who the packet is for and how long it is. Two physical bus configurations are documented: shared (simpler, fewer chips, more contention) and split AN/CN/DN (more chips, more bandwidth, closer to historical dataflow machine practice). Both use the same node interface design and token format. The split is a wiring and arbiter change, not a protocol change. --- ## Shared Bus (v0 / v0.5 starting point) ### Topology Single 16-bit bus connecting all nodes. Every node sees every flit. Receivers filter by destination address. One bus master at a time; arbitration determines who drives. ``` ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │ PE 0 │ │ PE 1 │ │ PE 2 │ │ PE 3 │ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ │ │ │ │ ═══╪═════════╪═════════╪═════════╪═══════════╗ │ │ │ │ 16-bit ║ shared bus ═══╪═════════╪═════════╪═════════╪═══════════╝ │ │ │ │ ┌──┴───┐ ┌──┴───┐ ┌──┴───┐ ┌──┴───┐ │ SM 0 │ │ SM 1 │ │ SM 2 │ │ SM 3 │ └──────┘ └──────┘ └──────┘ └──────┘ ┌─────────┐ │ ARBITER │ └─────────┘ ``` ### Bus Signals ``` DATA[15:0] 16-bit data bus. driven by current bus master, hi-Z otherwise. FLIT_VALID asserted by bus master. indicates DATA is a valid flit. BUS_HOLD asserted by bus master. "more flits coming, don't re-arbitrate." BUS_REQ[7:0] one per node. active = "i have a packet to send." BUS_GRANT active for current bus master. directly enables output drivers. ``` FLIT_VALID and BUS_HOLD are active-high, driven only by the granted node. since only one node drives at a time, no wired-OR or open-collector needed. ### Packet Transmission Protocol ``` IDLE: nodes with pending output assert BUS_REQ. arbiter selects winner, asserts BUS_GRANT for that node. FLIT 1: winner's output latch OE goes active (drives DATA bus). winner puts flit 1 on DATA, asserts FLIT_VALID. winner asserts BUS_HOLD (if more flits follow). all receivers inspect DATA for destination match. matching receiver captures flit 1. FLIT 2: winner puts flit 2 on DATA. FLIT_VALID stays high. matching receiver captures flit 2. if packet complete (2-flit): winner deasserts BUS_HOLD. FLIT 3 (CAS, EXT only): winner puts flit 3 on DATA. FLIT_VALID stays high. matching receiver captures flit 3. winner deasserts BUS_HOLD. RELEASE: BUS_HOLD drops. arbiter deasserts GRANT. arbiter immediately checks BUS_REQ for next winner. (zero idle cycles between back-to-back packets if another REQ is pending.) ``` Once granted, the bus master holds the bus for the full packet duration (2–3 flit cycles). no preemption, no interleaving. packet atomicity is guaranteed by BUS_HOLD. ### Chip Count (Shared Bus) ``` per PE: ~9 chips (bus interface) +5 chips (loopback: mux + comparator + control) = ~14 chips per PE per SM: ~9 chips (bus interface, no loopback needed) 4 PEs + 4 SMs: 4×14 + 4×9 = 92 chips arbiter: ~6 chips BUSY subsystem: ~8-12 chips bus control: ~2 chips ───────── total: ~108-112 chips ``` --- ## Split AN/CN/DN (v0.5+ upgrade) ### Topology Three separate 16-bit buses, each carrying one logical traffic class. Follows Amamiya's DFM architecture where AN and DN are unidirectional. ``` CN bus (bidirectional, CM↔CM): PE 0 ←→ PE 1 ←→ PE 2 ←→ PE 3 compute tokens between PEs. 4 nodes, bidirectional. AN bus (unidirectional, CM→SM): PE 0 → ┐ PE 1 → ┤ SM operation requests. PE 2 → ├→ SM 0, SM 1, SM 2, SM 3 4 senders (PEs), 4 receivers (SMs). PE 3 → ┘ DN bus (unidirectional, SM→CM): SM 0 → ┐ SM 1 → ┤ SM operation responses. SM 2 → ├→ PE 0, PE 1, PE 2, PE 3 4 senders (SMs), 4 receivers (PEs). SM 3 → ┘ ``` ### What the Split Buys **Bandwidth.** PE-to-PE compute traffic and SM request/response traffic are fully decoupled. PE0 sending an SM READ (AN) while PE1 sends a compute token to PE2 (CN) proceed simultaneously. effective bandwidth roughly 3× the shared bus for mixed workloads. **SM round-trip latency.** request goes out on AN, response comes back on DN. the two phases never compete for the same bus. on the shared bus, the response must wait for a bus grant that competes with outgoing requests and inter-PE traffic. **Reduced contention per bus.** CN has 4 nodes (PEs only). AN has 4 senders (PEs). DN has 4 senders (SMs). each bus sees half (or less) of the traffic the shared bus would see. **Cleaner scaling.** adding SMs doesn't increase CN contention. adding PEs doesn't increase DN contention. traffic classes scale independently. ### What the Split Costs **Wiring.** three buses × (16 data + control) vs one. ribbon cables make this manageable: each bus is a 20-pin IDC ribbon (16 data + FLIT_VALID + BUS_HOLD + REQ_line + GND). three ribbons. first PCB candidate: a bus backplane with three sets of IDC connectors and the three arbiters. **Chips.** each PE needs ports on CN (bidirectional) + AN (output only) + DN (input only). each SM needs ports on AN (input only) + DN (output only). see chip count below. **SM→SM path.** on the shared bus, SM0's `exec` can emit SM tokens directly. on the split bus, SM0 only has DN output (→ CMs). SM-bound tokens from exec must route through a loader PE: SM0 emits CM tokens onto DN, the loader PE receives them, executes instructions that construct SM tokens, and emits those onto AN. one extra hop, only during bootstrap, no runtime cost. alternatively, the spare CM token subtype (011+11) could serve as a bus-level "forward to AN" hint, but the loader PE approach is cleaner and doesn't burn encoding space. ### Node Bus Ports (Split Configuration) ``` PE: CN out + CN in + AN out + DN in bidirectional send only receive only SM: AN in + DN out receive only send only SM0: AN in + DN out (same as other SMs — exec routes through loader PE) ``` ### Chip Count (Split AN/CN/DN) ``` per PE: CN bidi (4) + AN out (2) + DN in (2) + decode (2) = ~10 chips + loopback (5) = ~15 chips per SM: AN in (2) + DN out (2) + decode (1) = ~5 chips 4 PEs: 60 chips 4 SMs: 20 chips CN arbiter: ~4 chips (4-node) AN arbiter: ~4 chips (4-sender) DN arbiter: ~4 chips (4-sender) BUSY subsystem: ~6-8 chips (fewer nodes per bus, smaller muxes) bus control: ~2 chips ───────── total: ~100-102 chips ``` Delta vs shared bus: ~8-10 fewer chips (loopback cost is the same, but per-bus node counts are lower and BUSY muxes are smaller). buys ~3× bandwidth and decoupled SM latency. the split is strictly better on both chip count and performance once loopback is included — the shared bus node count was inflated by every node needing bidirectional capability. ### PE Input Merge (DN + CN) A PE receives from two buses: CN (inter-PE compute tokens) and DN (SM responses). both feed into the same PE pipeline. the PE's input stage has a 2:1 priority mux: ``` if DN_in has data: select DN_in → pipeline (SM responses unblock waiting ops) else if CN_in has data: select CN_in → pipeline else: idle ``` hardware: one priority signal (DN_NOT_EMPTY), a 2:1 mux on the input latch LE lines. ~1-2 chips. SM responses get priority because they unblock dyadic instructions waiting on SM data — the matching store has a first operand parked, the SM response is the second operand that lets it fire. ### PE Output Split (CN vs AN) A PE emits to CN (bit[15]=0) or AN (bit[15]=1). the output stage has latches on both buses. bit[15] from the token determines which latch gets loaded and which BUS_REQ fires: ``` if output_token.bit[15] == 0: load CN_out latch, assert CN_REQ if output_token.bit[15] == 1: load AN_out latch, assert AN_REQ ``` both output latches share the same data input lines from the PE pipeline. the type bit gates the LE (latch enable) to the correct pair. one gate. output enable is controlled independently by each bus's GRANT signal. --- ## Node Interface (Common to Both Configurations) ### Output Path ``` PE pipeline output │ ┌────┴────┐ │ 2× 373 │ output latch (16-bit) │ OE=GRANT│ tri-state: drives bus only when granted └────┬────┘ │ ══════╪══════ bus DATA[15:0] ``` 2× 74LS373 per output port. LE (latch enable) pulsed by the PE/SM output stage when a flit is ready. OE (output enable) tied to BUS_GRANT for this node — only drives the bus when the arbiter says so. when not granted, outputs are hi-Z. the PE's output stage loads flit 1, asserts BUS_REQ, and waits. when GRANT arrives, OE goes active, flit 1 appears on the bus, FLIT_VALID asserts. next cycle, the PE loads flit 2 into the same latch (overwrites flit 1), FLIT_VALID stays high. for 3-flit packets, repeat once more. BUS_HOLD deasserts on the last flit. ### Input Path ``` ══════╪══════ bus DATA[15:0] │ ┌────┴────┐ │ 2× 373 │ bus register (always captures on DEST_MATCH) └────┬────┘ │ ┌────┴────┐ │ 2× 373 │ flit 2 holding register (3-flit packets only) └────┬────┘ │ PE pipeline input ``` **Bus register** (2× 373): D inputs connected directly to bus DATA. LE gated by `DEST_MATCH AND FLIT_VALID`. captures every flit addressed to this node. **Flit 2 holding register** (2× 373): captures bus_reg contents when a 3-flit packet's flit 2 needs to be preserved before flit 3 arrives. for 2-flit packets (the common case), this register is unused. ### Destination Decode Each node compares the incoming flit 1 against its own ID: ``` CM node (PE): match = NOT bit[15] AND (bit[13:12] == my_PE_id) SM node: match = bit[15] AND (bit[13:12] == my_SM_id) ``` `my_PE_id` / `my_SM_id` set by DIP switches or hardwired. the comparison is 2 bits (74LS85 is overkill — an XNOR gate + AND suffices). total decode: ~1 chip per node (a few gates from a 74LS00/74LS86 package). For the misc bucket CM subtypes (011+xx), the PE accepts all of them if the PE_id matches — subtype decode happens inside the PE pipeline, not at the bus interface. For the split bus configuration, destination decode simplifies further: on the CN bus, only PEs exist, so bit[15] is always 0 and only PE_id matters. on the AN bus, only SMs are receivers, so bit[15] is always 1 and only SM_id matters. the type-bit check becomes redundant per-bus (but costs nothing to keep for robustness). ### Flit Reception (The "No Deserializer" Insight) The revised token format puts all routing and control information in flit 1. The pipeline needs offset, act_id, port, and prefix to begin processing (stage 1 INPUT and stage 2 IFETCH). All of these are in flit 1. Flit 2 carries the data operand. The pipeline doesn't need it until stage 3 (MATCH), which is 2 cycles after stage 1 latches flit 1's fields. By that time flit 2 has arrived in the bus register and is waiting. This means **no dedicated deserializer is needed.** the pipeline's own stage 1 register latch IS the flit 1 holding register: ``` cycle N: flit 1 on bus → bus_reg captures. stage 1 latches: offset, act_id, port, prefix from bus_reg. cycle N+1: flit 2 on bus → bus_reg captures (overwrites flit 1 — fine, stage 1 already grabbed everything it needs). cycle N+2: stage 3 reads bus_reg for operand data (flit 2 still there). ``` for 2-flit tokens: zero additional registers beyond the bus 373 pair. the bus_reg holds flit 2 until the pipeline consumes it. for 3-flit tokens (CAS, EXT): flit 3 arrives on cycle N+2 and would overwrite flit 2 in bus_reg. the pipeline needs flit 2 for the left operand and flit 3 for the right operand. the **flit 2 holding register** captures flit 2 from bus_reg before flit 3 arrives: ``` cycle N: flit 1 → bus_reg → stage 1 latches fields cycle N+1: flit 2 → bus_reg. simultaneously, previous bus_reg (flit 1) was already consumed by stage 1. flit 2 now in bus_reg. cycle N+2: flit2_hold captures bus_reg (flit 2). flit 3 → bus_reg. pipeline has: flit 1 fields in stage regs, flit 2 in flit2_hold, flit 3 in bus_reg. ``` the flit 2 holding register is clocked by a "3-flit packet, flit 2 is about to be overwritten" signal, derived from the flit counter and the packet length (decoded from flit 1 prefix in stage 1). for 2-flit packets, this signal never fires and the holding register sits idle. ### Input Hardware Cost Per Port ``` bus_reg: 2× 74LS373 (always present) flit2_hold: 2× 74LS373 (for 3-flit packets) dest decode: ~1 chip (a few gates) flit counter: half a 74LS74 (2-bit counter) control logic: ~1 chip (gates for LE timing, TOKEN_READY) ──────────── total: ~6 chips per input port ``` ### Output Hardware Cost Per Port ``` out_latch: 2× 74LS373 (drives bus when granted) BUS_REQ logic: ~half a chip (set on pipeline output ready, clear on grant) ──────────── total: ~3 chips per output port ``` ### Per-Node Totals **Shared bus (bidirectional):** ``` output: ~3 chips input: ~6 chips ──────── per node: ~9 chips (could reduce to ~6 if 3-flit packets handled in pipeline) ``` **Split bus, PE (CN bidi + AN out + DN in):** ``` CN out: ~3 chips CN in: ~6 chips AN out: ~3 chips DN in: ~6 chips output mux: ~1 chip (bit[15] selects CN vs AN output) input mux: ~1 chip (DN vs CN priority select) ──────── per PE: ~20 chips ``` **Split bus, SM (AN in + DN out):** ``` AN in: ~6 chips DN out: ~3 chips ──────── per SM: ~9 chips ``` Note: these counts are slightly higher than earlier estimates because they include the flit2 holding register. if 3-flit tokens are rare enough, the holding register can be omitted from nodes that will never receive them (e.g., SMs receive 3-flit CAS tokens, but PEs only receive 2-flit compute tokens and 2-flit SM responses — PEs can skip the holding register). --- ## PE Loopback (Self-Addressed Token Bypass) When a PE produces a token addressed to itself, there is no reason for that token to traverse the external bus. self-addressed tokens are the dominant traffic pattern in well-compiled programs (the assembler/compiler maximises self-PE placement), so bypassing the bus for self-sends eliminates a large fraction of bus contention. ### Hardware The PE's output stage already knows the destination PE_id — it's in the flit 1 routing word read from the frame. The PE already knows its own ID (DIP switches / EEPROM). A comparator on those 2-3 bits determines whether the token is self-addressed: ``` self_send = (dest_PE_id == my_PE_id) ``` A 2:1 mux on the PE's input path selects between "from bus" and "from own output": ``` PE pipeline output │ ┌──────┴──────┐ │ self_send? │ comparator on PE_id (a few gates) └──┬───────┬──┘ │ │ self=1 │ │ self=0 │ │ │ ┌──┴──────┐ │ │ bus out │ 2× 373, OE=GRANT │ │ latch │ assert BUS_REQ │ └─────────┘ │ ┌───┐ │ ┌─────────┐ │mux│◄─┴──│ bus in │ from bus input path │2:1│ │ latch │ └─┬─┘ └─────────┘ │ PE pipeline input (stage 1) ``` When `self_send=1`: - output does NOT load the bus output latch. BUS_REQ is not asserted. - output feeds directly to the input mux, which selects the loopback path. - the pipeline sees the token arrive at stage 1 as if it came from the bus. When `self_send=0`: - output loads the bus output latch, asserts BUS_REQ. normal bus path. - input mux selects bus input. normal receive path. ### Hardware Cost ``` PE_id comparator: ~half a chip (XNOR + AND on 2-3 bits) input mux: 4× 74LS157 (quad 2:1 mux, 16-bit data path) loopback control: ~half a chip (gates for mux select, BUS_REQ inhibit) ───────────── total: ~5 chips per PE ``` ### Timing Bus path (self-addressed, without loopback): output stage → wait for bus grant (1-4 cycles depending on contention) → flit 1 on bus (1 cycle) → flit 2 on bus (1 cycle) → dest decode + input capture → stage 1. minimum ~4-6 cycles. Loopback path: output stage → mux → stage 1. with a register in the loopback path for timing safety, 2 cycles (output latches flit 1, next cycle it's at stage 1 input; flit 2 follows). without the register (combinational loopback), potentially 1 cycle. savings: 2-4 cycles per self-addressed token, plus zero bus contention impact on other nodes. ### Multi-Flit Loopback the loopback path handles multi-flit tokens the same way the bus input does. flit 1 feeds through the mux, stage 1 latches the routing fields. flit 2 feeds through on the next cycle, sits in the input register until stage 3 reads it. for 3-flit tokens (rare on loopback — a PE would have to CAS its own local SM? unclear when this happens), the flit 2 holding register captures flit 2 before flit 3 arrives, same as bus reception. ### Throughput Tradeoff: Self-Route vs Cross-PE Ping-Pong Self-loopback is fast but serialises the pipeline: the PE processes its own output token, so it can't work on another node's token simultaneously. Cross-PE routing (PE0 → PE1 → PE0) costs bus latency per hop but lets both PEs work in parallel — PE1 processes while PE0's result is in transit. For a linear dependency chain with no parallelism, self-route wins: fewer cycles per token, no bus overhead. For wide graphs with independent branches, spreading work across PEs wins: the bus latency is paid once per hop but multiple PEs execute concurrently. The compiler/assembler can model this tradeoff per-subgraph: - **self-route**: saves ~3 cycles per token, serialises the PE. best for: tight dependent chains, loop bodies, accumulations. - **cross-PE**: costs ~4 cycles bus latency per hop, enables parallelism. best for: independent branches, wide fan-out, producer-consumer pipelines. Until strongly-connected arc execution is implemented (which makes sequential self-execution much more efficient), there are cases where ping-ponging between PEs genuinely outperforms self-routing due to pipeline fill. the assembler's placement algorithm should consider bus contention, pipeline depth, and dependency structure when deciding. ### Interaction with BUSY and Concurrent Reception Self-addressed loopback tokens bypass the bus entirely. crucially, the bus_reg remains available to capture externally-arriving tokens while the loopback path feeds the pipeline. this is possible because BUSY is tied to bus_reg occupancy, not pipeline input occupancy: ``` BUSY = bus_reg_captured AND NOT bus_reg_consumed ``` one AND gate. BUSY asserts when the bus_reg holds an unconsumed flit, and clears when the pipeline drains it. the loopback path is invisible to BUSY. this enables concurrent reception: ``` cycle N: loopback token → mux → pipeline stage 1 (self-send, fast path) simultaneously: external token on bus → bus_reg captures BUSY asserts (bus_reg now occupied) cycle N+1: bus_reg token → mux → pipeline stage 1 (external, from bus) BUSY clears ``` the input mux priority: loopback wins if both arrive simultaneously. the PE's own output is already committed and can't stall; the bus token waits one cycle in bus_reg. BUSY holds other senders off for that one beat. one priority gate on two "data available" signals. without this decoupling, a self-send would block the bus_reg (the pipeline input is occupied by the loopback token), forcing external senders to wait even though the bus_reg is physically empty. the decoupling means the bus_reg serves as a 1-token buffer that absorbs one external arrival while the pipeline processes a loopback token. **future upgrade: deeper input buffer.** v0 operates with the bus_reg as the sole external input buffer (1 token deep). once pipelining becomes more aggressive post-v0 — particularly with strongly-connected arc execution keeping the PE busy on sequential chains — a small register FIFO (2-4 entries) behind the bus_reg would absorb bursts of incoming tokens without asserting BUSY. this is the same upgrade path noted in the open questions: NOS IDT7201 FIFOs, register FIFOs from 373s, or SRAM-based. the BUSY signal transitions from "bus_reg full" to "FIFO full" with no protocol change. the loopback path remains independent of the FIFO — it feeds the pipeline directly, not through the buffer. --- ## Arbiter ### Shared Bus Arbiter 8 requesters, one winner per arbitration cycle. SM0 gets hard priority (bootstrap). all others round-robin. ``` inputs: BUS_REQ[7:0], BUS_HOLD outputs: BUS_GRANT[7:0] (one-hot), or WINNER_ID[2:0] + external decoder state: IDLE: if any BUS_REQ asserted, select winner, go to GRANTED. GRANTED: hold GRANT while BUS_HOLD asserted. when BUS_HOLD drops, go to IDLE (or fast-path: immediately select next winner if any REQ pending). ``` **Winner selection:** if BUS_REQ[SM0] is asserted, SM0 wins (hard priority for bootstrap and IO). otherwise, round-robin starting from `(last_winner + 1) mod 8`. hardware: 74LS148 (8-to-3 priority encoder) with input masking for round-robin. a 3-bit counter (`last_winner`) advances after each grant. the REQ lines are rotated by `last_winner` before hitting the priority encoder (using a barrel shift or a mux tree). SM0 priority override: OR the SM0 REQ into the highest-priority encoder input, bypassing the rotation. ``` round-robin mux: 2× 74LS151 (8:1 mux) or equivalent = 2 chips priority encoder: 1× 74LS148 = 1 chip grant decoder: 1× 74LS138 (3-to-8) = 1 chip last_winner counter: half a 74LS163 (4-bit counter) = 1 chip SM0 override + FSM: ~1 chip (gates) = 1 chip total: ~6 chips ``` ### Split Bus Arbiters Each bus has its own arbiter. CN arbiter: 4 nodes (PEs), bidirectional. AN arbiter: 4 senders (PEs only). DN arbiter: 4 senders (SMs only). 4-node arbiter is simpler than 8-node: ``` priority encoder: 1× 74LS148 (using 4 of 8 inputs) = 1 chip grant decoder: 1× 74LS139 (dual 2-to-4) = 1 chip round-robin counter: half a 74LS74 = shared SM0 priority (DN only): 1 gate = shared FSM: ~1 chip (gates) = 1 chip per bus: ~3-4 chips 3 buses: ~10-12 chips ``` SM0 priority applies to the DN arbiter (SM0's responses and exec output get priority on the return path). CN and AN arbiters are pure round-robin among PEs. --- ## Backpressure ### BUSY Mechanism (Mandatory) There is no input FIFO. The bus register is the buffer — one token deep. If the pipeline hasn't consumed the current token when the next one arrives for this node, the token is lost. Therefore sender-side destination-busy checking is **mandatory**, not optional. Each node has a `BUSY` output: active whenever its bus register holds an unconsumed token. Senders check `BUSY[dest]` before asserting `BUS_REQ`. If the destination is busy, the sender holds. Its output latch stays full, which stalls its pipeline at the output stage. Backpressure propagates naturally through the pipeline. ``` BUSY[7:0] one wire per node (active-low, open-collector). set when bus_reg captures a flit (DEST_MATCH AND FLIT_VALID). cleared when pipeline latches the token into stage 1. ``` **Per-node BUSY generation:** one flip-flop (half a 74LS74). set on capture, cleared on pipeline consumption. active-low open-collector output allows wired-AND if needed, and maps directly to an async ACK signal for future Mode C operation. **Per-sender BUSY lookup:** the outgoing token's destination ID (2-3 bits from the formed flit 1, known before BUS_REQ asserts) indexes into `BUSY[0:7]` via a 74LS151 (8:1 mux). output gates BUS_REQ: if `BUSY[dest]` is active, BUS_REQ is suppressed. ``` sender logic: dest_id = output_token.flit1[13:12] (PE_id or SM_id) + output_token.flit1[15] (SM/CM select, for full 3-bit node addr) dest_busy = BUSY_MUX[dest_id] BUS_REQ = output_ready AND NOT dest_busy ``` **Hardware cost:** ``` BUSY flip-flops: 4× 74LS74 (8 flip-flops, one per node) = 4 chips BUSY mux (per sender): 1× 74LS151 (8:1) each = 8 chips max (senders can share mux chips on split bus where each bus has only 4 destinations) BUS_REQ gate: included in existing output logic = 0 extra total: ~8-12 chips ``` For the split bus configuration, each bus has only 4 possible destinations. BUSY lines are per-bus: `CN_BUSY[3:0]`, `AN_BUSY[3:0]`, `DN_BUSY[3:0]`. the mux shrinks to a 74LS153 (dual 4:1) or equivalent. fewer chips. **Async upgrade path:** the BUSY signal is the embryonic form of an ACK in a request/acknowledge handshake protocol. In Mode C (fully async), BUSY becomes the ACK that completes the flit transfer. the wiring is unchanged — only the timing discipline around it evolves from "check before requesting" to "wait for acknowledge after sending." designing BUSY in from the start means the async transition doesn't require re-wiring the status lines. --- ## Physical Wiring ### Shared Bus ``` 1 ribbon cable, 20-pin IDC: DATA[15:0] 16 pins FLIT_VALID 1 pin BUS_HOLD 1 pin GND 2 pins ``` BUS_REQ and BUS_GRANT run as separate point-to-point wires between each node and the arbiter (8 REQ + 8 GRANT = 16 wires, or 8 REQ + 3-bit WINNER_ID = 11 wires). these can be a second ribbon or direct jumper wires. ### Split AN/CN/DN ``` 3 ribbon cables, 20-pin IDC each: CN: DATA[15:0] + FLIT_VALID + BUS_HOLD + GND×2 AN: DATA[15:0] + FLIT_VALID + BUS_HOLD + GND×2 DN: DATA[15:0] + FLIT_VALID + BUS_HOLD + GND×2 REQ/GRANT per bus: separate wires or a 4th ribbon. ``` first PCB candidate: a bus backplane board with three rows of IDC connectors (one row per bus), the three arbiters, and the BUSY/status lines. PEs and SMs plug in via ribbon cables. testable with 1 PE + 1 SM before populating the full system. --- ## Scaling Path | scale | topology | notes | |---|---|---| | 1 PE + 1 SM | single shared bus | initial bring-up, no arbiter needed | | 2–4 PE + 1–2 SM | single shared bus | v0, arbiter added | | 4 PE + 4 SM | shared or split | v0.5, split if contention measured | | 8+ PE | split + ring or prefix | CN becomes a ring bus between PEs | | 16+ PE | hierarchical prefix | full multi-level routing network | the token format and node interfaces are unchanged across all scales. the bus interface (373 latches + dest decode + FIFO) is the same whether it's on a shared bus, a split segment, or a ring. what changes is wiring and arbiters. --- ## Open Questions 1. **Input FIFO as upgrade.** v0 operates with no input FIFO (bus reg is the sole buffer, BUSY mechanism prevents overflow). adding a small FIFO (4–8 entries) behind the bus reg absorbs bursts and reduces the frequency of BUSY stalls. NOS IDT7201 FIFOs (512×9, $3–8 on eBay) are ideal if available. otherwise, a register FIFO from 373s (8 chips for 4 entries) or SRAM + counter. this is a pure performance upgrade — the BUSY mechanism remains as the ultimate backpressure signal even with a FIFO (BUSY asserts when the FIFO fills instead of when the bus reg is occupied). 2. **SM0 bootstrap priority.** hard priority on the shared bus / DN arbiter is simple. does it need to be SM0 specifically, or should priority be configurable (e.g., DIP switch selects which node gets priority)? hardwired SM0 is fine for v0. 3. **Arbiter latency.** the round-robin mux + priority encoder + decoder chain is ~3 gate delays (~30–45 ns). at 5 MHz (200 ns cycle), this easily fits. at higher clock rates, the arbiter might need pipelining (one cycle to decide, next cycle to grant). for v0 at 5 MHz, single- cycle arbitration is fine. 4. **3-flit reception.** the flit 2 holding register adds 2 chips per input port. if 3-flit tokens (CAS, EXT) are rare and only received by SMs, only SM input ports need the holding register. PE input ports can skip it (PEs never receive 3-flit tokens in normal operation). saves 2 chips per PE input port. 5. **Split bus trigger.** at what contention level should the shared bus be split into AN/CN/DN? diagnostic counters on the arbiter (grant-wait cycles per node) provide the measurement. if any node averages >20% wait cycles, the split is likely worthwhile. the compiler can also estimate bus utilisation statically from the dataflow graph structure.