design-notes/loop-patterns-and-flow-control.md at main

nonbinary.computer / or1-design
fork atom
OR-1 dataflow CPU sketch
fork atom
or1-design / design-notes / loop-patterns-and-flow-control.md
at main 970 lines 36 kB view raw view rendered
wrap content
Orual fixes and grammar updates 2w ago
c2dc6d4d
  1# Loop Patterns and Flow Control Idioms
  2
  3Execution patterns for loops, reductions, and flow control in the
  4dataflow architecture. These are software/compiler conventions built
  5from existing hardware primitives — no dedicated loop hardware exists.
  6
  7Most patterns described here are candidates for **assembler macros**:
  8reusable expansions that emit the underlying instruction sequences.
  9The programmer writes `#loop_counted counter, limit, body_label` and
 10the assembler expands it into the token feedback arcs, SWITCH routing,
 11and permit structures described below.
 12
 13See `iram-and-function-calls.md` for IRAM format and ctx_mode details.
 14See `alu-and-output-design.md` for SWITCH, GATE, and output modes.
 15See `sm-design.md` for SM operations referenced by some patterns.
 16
 17---
 18
 19## Core Loop Mechanism
 20
 21There is no program counter, no branch instruction, and no loop
 22construct in hardware. Loops are **token feedback arcs**: an
 23instruction's output token is routed back to an input of an earlier
 24instruction (or itself) in the dataflow graph. The loop "iterates"
 25each time a token completes the feedback circuit.
 26
 27### Minimal Counted Loop
 28
 29```
 30graph:
 31  CONST(0)  ──────────────────────────────────┐
 32                                              │
 33  ┌───────────────────────────────────────────┤
 34  │                                           V
 35  │   ┌─────┐      ┌──────────┐      ┌────────────────┐
 36  └─►│ INC │────►│ LT limit │────►│ SWITCH         │
 37      └─────┘      └──────────┘      │  true  → body │
 38       ▲  i+1         bool          │  false → exit │
 39       │                             └────────┬───────┘
 40       │         data (i+1) to body ◄────────┘
 41       │                                      │
 42       └── i+1 fed back (same token) ─────────┘
 43```
 44
 45Instructions (all on same PE, same context):
 46
 47```dfasm
 48; Counted loop: increment from 0, dispatch to body, exit when done
 49
 50&counter <| const, 0               ; initial counter value (seed token starts loop)
 51&step    <| inc                    ; increment counter
 52&cmp     <| lt                     ; compare counter < limit
 53&route   <| sweq                   ; route by comparison result
 54
 55const <limit> |> &cmp:R            ; limit value (or SM read if > 255)
 56
 57&counter |> &step                  ; seed → first increment
 58&step    |> &cmp:L                 ; counter → comparison left
 59&step    |> &route:L               ; counter → switch data input (fan-out)
 60&cmp     |> &route:R               ; bool → switch control
 61
 62&route:L |> &body:L                ; taken (true) → dispatch to body
 63&route:R |> &exit:L                ; not-taken (false) → loop done
 64&route:L |> &step                  ; feedback arc: counter recirculates
 65```
 66
 67The feedback arc is just a destination field in the SWITCH instruction
 68(or a PASS trampoline if SWITCH can't dual-route to both body dispatch
 69and the INC feedback simultaneously). The loop "runs" as long as tokens
 70keep flowing through the feedback arc.
 71
 72### Timing
 73
 74Each iteration traverses: INC → LT → SWITCH → bus → INC. With the v0
 75pipeline (no local bypass, all tokens go through external bus), expect
 76roughly 6-10 cycles per loop control iteration depending on bus
 77contention and pipeline depth.
 78
 79The loop body executes concurrently with loop control — the dispatched
 80body token enters the body subgraph immediately while the counter
 81continues to the next iteration. If the body takes longer than one
 82control iteration, multiple body invocations can be in flight
 83simultaneously (with appropriate flow control — see Permit Tokens).
 84
 85---
 86
 87## Permit-Token Flow Control
 88
 89When loop body iterations can execute concurrently, a throttling
 90mechanism prevents context slot exhaustion. **Permit tokens** are the
 91standard dataflow idiom for this.
 92
 93### Concept
 94
 95K permit tokens circulate through the system. Each dispatch to a body
 96context consumes one permit. Each body completion produces one permit.
 97At most K body iterations are in flight simultaneously. If no permits
 98are available, the dispatch GATE stalls — the loop control token waits
 99in the matching store until a permit arrives.
100
101```
102  permits (K tokens, initially injected at boot)
103      │
104      ▼
105  ┌────────┐
106  │  GATE  │◄──── loop control produces (counter, body_data)
107  │  L: permit                        
108  │  R: dispatch_data                 
109  └───┬────┘                          
110      │ (fires only when BOTH permit AND data are ready)
111      ▼
112  dispatch to body context (CHANGE_TAG or CTX_OVRD)
113      │
114      ▼
115  body executes ... body completes
116      │
117      ▼
118  emit permit token back to GATE (port L)
119```
120
121### Implementation
122
123The GATE instruction is dyadic. Left port receives the permit token.
124Right port receives the loop's dispatch data (counter value, array
125pointer, whatever the body needs). GATE fires only when both are
126present — this IS the backpressure mechanism. No special hardware
127flow control.
128
129```dfasm
130; Permit-gated dispatch
131&gate <| gate                      ; dyadic: L=permit, R=dispatch data
132&loop_output |> &gate:R            ; loop control feeds data to gate
133
134; Body completion recycles the permit
135&body_done |> &gate:L              ; body's final instruction returns permit
136```
137
138The body's final instruction emits a token to `&gate:L` as one of its
139destinations. This token is the recycled permit. Its data value is
140irrelevant (just a trigger); what matters is its presence in the
141matching store.
142
143K is chosen by the compiler:
144
145- K = 1: fully sequential, one body at a time. safe default.
146- K = number of reserved body context slots: maximum parallelism.
147- K = pipeline depth / body latency: optimal for throughput.
148
149### Initial Permit Injection
150
151At boot (or function entry), K permit tokens must be injected into
152the GATE. Options:
153
154- **CONST chain:** K CONST instructions at sequential offsets, each
155  with dest targeting the GATE's left port. Triggered by the function
156  entry token via fan-out. Burns K IRAM slots but is simple.
157- **SM EXEC:** pre-load K permit tokens in SM, EXEC emits them.
158  Uses one IRAM slot for the EXEC trigger. Better for large K.
159- **Assembler macro:** `#permit_inject *args` expands to the
160  appropriate injection sequence.
161
162### Assembler Macro Sketch
163
164```
165; PERMIT_LOOP(K, limit, body_label, exit_label)
166; Expands to:
167;   - K CONST instructions emitting initial permits
168;   - GATE (permit, dispatch_data) guarding body dispatch
169;   - INC/LT/SWITCH loop control chain
170;   - feedback arc from SWITCH to INC
171;   - body return path emitting permit on completion
172```
173
174The macro assigns IRAM offsets for the control structure and reserves
175K context slots for body iterations.
176
177---
178
179## Parallel Reduction
180
181A common pattern following parallel loop iterations: combine K partial
182results into a single value.
183
184### Binary Reduction Tree
185
186```
187K=4 partial sums:  s0   s1   s2   s3
188                    \   /     \   /
189                     ADD       ADD
190                      \       /
191                        ADD
192                         │
193                       total
194```
195
196Each ADD is a dyadic instruction in its own right. The partial results
197arrive as tokens, match in the ADD's matching store entry, fire, and
198produce the next level's input. The tree structure is pure dataflow —
199no special reduction hardware.
200
201For K iterations, the tree has log2(K) levels and K-1 ADD instructions.
202With K=8, that's 7 ADDs across 3 levels. All ADDs at the same level
203can fire in parallel (they're on different matching store entries or
204different PEs).
205
206### Assembler Macro Sketch
207
208```
209; REDUCE(op, inputs[], output)
210; Expands to:
211;   - ceil(log2(N)) levels of binary op instructions
212;   - routing from each level's outputs to next level's inputs
213;   - final output routed to specified destination
214```
215
216---
217
218## Loop-Carried Accumulators (Self-Loop Pattern)
219
220A value that updates every iteration and feeds back to itself. The
221canonical example: `sum += a[i]`.
222
223### Matching-Store-as-Register
224
225A dyadic instruction whose output routes back to its own left port:
226
227```dfasm
228; Self-loop accumulator: sum += each incoming value
229&acc <| add                        ; dyadic: L=accumulated sum, R=new element
230&acc |> &acc:L                     ; feedback: result → own left port
231
232; Initialise: deposit starting value before first element arrives
233&init <| const, 0
234&init |> &acc:L                    ; seed the accumulator with 0
235```
236
237The matching store cell at `&acc`'s (ctx, offset, port L) holds the
238accumulator value between iterations. Each new element arriving
239on port R triggers the ADD, which deposits the updated sum back
240into port L's cell.
241
242**Timing:** Each accumulation step is a full round-trip: ALU →
243output formatter → bus → input FIFO → matching store → ALU. Roughly
2446-10 cycles at v0. This is the sequential bottleneck — the accumulator
245feedback arc is inherently serial.
246
247**Extracting the final value:** After the last element is accumulated,
248the sum sits in the matching store cell at port L. It needs a "drain"
249event to extract it. Options:
250
251- A sentinel token on port R triggers one final ADD (or PASS), and
252  dest2 routes the result to the downstream consumer.
253- A GATE controlled by a "loop done" boolean from the loop control.
254  When the loop completes, the GATE opens and the accumulated value
255  flows out.
256
257### With Parallel Iterations
258
259Each body context has its own matching store entries (different ctx →
260different SRAM address). K parallel accumulators at the same IRAM
261offset but different context slots operate independently.
262
263```dfasm
264; All iterations share the same IRAM instruction for &acc.
265; Different context slots → different matching store cells.
266; No interference between iterations.
267
268; ctx=1: sum_1 accumulator (self-loop)
269; ctx=2: sum_2 accumulator (self-loop)
270; ctx=3: sum_3 accumulator (self-loop)
271; ... all at the same &acc instruction, different contexts
272```
273
274After all iterations complete, a reduction tree combines partial sums.
275The permit-token mechanism guarantees that partial sums are ready before
276the reduction begins (the permits themselves can be chained to trigger
277the reduction).
278
279---
280
281## Predicate Register Optimisation (Future, ~1 Chip)
282
283A single shared 1-bit register (or small multi-bit register) that
284stores a comparison result locally, bypassing the token network for
285the boolean path.
286
287### Benefit
288
289In the standard loop control pattern, the comparison boolean travels
290as a token: LT produces a bool token → bus → matching store → SWITCH
291consumes it. The predicate register short-circuits this:
292
293```
294without predicate register:
295  LT → [bool token] → bus → matching → SWITCH
296  cost: full token round-trip for the boolean
297
298with predicate register:
299  LT writes bool to predicate register (side effect, no token)
300  SWITCH reads predicate register (local wire, no matching needed)
301  cost: zero additional cycles for the boolean path
302```
303
304The counter feedback arc still goes through the bus. But the boolean
305path — typically half the loop control overhead — becomes free.
306
307### Constraints
308
309- **Not per-context.** Single shared register. The compiler must
310  guarantee only one activation uses the predicate at a time.
311- **Not suitable for parallel iterations.** Each iteration would need
312  its own predicate state. Use the token-based boolean path for
313  parallel loop control.
314- **IRAM encoding:** 1-2 bits per instruction (pred_write, pred_read).
315  Can be folded into opcode space as dedicated variants (LT_P, SWITCH_P)
316  or drawn from spare bits in half 1.
317
318### Hardware
319
320```
3211-bit predicate register:     1 flip-flop
322write path (from comparator): 1 gate (write enable)
323read mux (to SWITCH):         1 gate (bool_out source select)
324Total:                        ~1 chip (fraction of a chip, really)
325```
326
327---
328
329## Accumulator Register Optimisation (Future, ~3 Chips)
330
331A single shared 16-bit register writable by the ALU and readable as
332an ALU input source. Eliminates the bus round-trip for tight
333accumulation loops.
334
335### Benefit
336
337```
338without accumulator register:
339  ADD(acc, new) → output → bus → input → matching → ADD
340  cost: ~6-10 cycles per accumulation
341
342with accumulator register:
343  ACC_ADD: reads acc from register, adds new from token, writes result
344           back to register. monadic (only new element token needed).
345  cost: 1 pipeline pass per accumulation (~3-5 cycles)
346```
347
348### Constraints
349
350- **Not per-context.** Same single-activation restriction as predicate
351  register.
352- **Monadic operation.** ACC_ADD/ACC_SUB are monadic — the accumulator
353  is an implicit operand from the register, the explicit operand comes
354  from the arriving token. No matching store entry consumed.
355- **No matching store write conflict.** The register is a separate
356  storage element from the matching store. ALU writes to the register
357  at Stage 4; matching store writes happen at Stage 2. No port
358  conflict, no stall logic needed.
359- **Stepping stone to SC blocks.** The accumulator register is
360  effectively the first register of a future local register file.
361  Adding a second register + sequential instruction counter yields a
362  minimal strongly-connected (SC) block capability.
363
364### Hardware
365
366```
36716-bit register (2x 74LS374):           2 chips
368ALU source mux (add acc_reg input):     1 chip
369write enable gating:                    ~0 chips (1 gate)
370Total:                                  ~3 chips per PE
371```
372
373---
374
375## Assembler Macro and Function Call Strategy
376
377The patterns above are mechanical enough to be assembler macros.
378Macros are expected to be simple text/token substitution (C-style
379`#define` territory, not Zig comptime or C++ templates). This means:
380
381- No conditional logic within macros. Different strategies get
382  different macros, and the programmer picks which to use.
383- No offset allocation intelligence. The assembler tracks placement
384  and validates the expansion (offset collisions, missing labels),
385  but the macro itself is dumb substitution.
386- No type checking or context-slot tracking. The programmer is
387  responsible for not blowing the slot budget.
388
389### Macro Syntax
390
391A macro call uses `#macro_name` and follows the same syntax as any
392other operation (arguments, edge wiring, port qualifiers). Macro
393definitions follow a function-block-like structure.
394
395### Function Calls as Syntax
396
397Using a `$func` label as an instruction generates the appropriate
398routing for static calls. Named arguments match against the
399function's internal labels:
400
401```dfasm
402; Function definition:
403$add_pair |> {
404  add &a, &b |> #ret
405}
406
407; Static call — named args wire to internal labels:
408$add_pair a=&x, b=&y |> @output
409```
410
411`@ret` is a built-in pseudo-node that marks the function's return point.
412The assembler resolves `a=&x` to mean "wire `&x`'s output to
413`$add_pair.&a`, port L, with ctx_mode=01 and the allocator-assigned
414context slot." The `|> @output` wires the return point to `@output`.
415
416For static calls (non-recursive, known call graph), this generates
417only routing annotations on existing instructions — no extra IRAM
418entries, no CHANGE_TAG, just destination fields set with ctx_mode=01.
419
420### Dynamic and Recursive Calls (v1, Manual / Macro-Assisted)
421
422The assembler does NOT handle recursive or indirect calls
423automatically — that's compiler territory. Recursive calls require
424runtime context allocation (SM READ_INC), CHANGE_TAG sequences, and
425EXTRACT_TAG for return continuations. The assembler provides macros
426to reduce boilerplate for the mechanical parts; the programmer
427manages descriptor tables, context budgets, and flow control.
428
429#### The Problem
430
431A dynamic call to a function with N arguments requires:
432
4331. **Allocate context** — SM READ_INC on an allocator cell → new_ctx
4342. **Build return continuation** — EXTRACT_TAG captures caller's
435   (PE, ctx, offset, gen) as a 16-bit packed tag value
4363. **Fetch N+1 tag templates** — SM_READ_C from a descriptor table
437   (one per argument destination + one for return destination)
4384. **Patch ctx into each tag** — OR new_ctx into bits [3:0] of
439   each template (templates are pre-built at boot with ctx=0)
4405. **Send return continuation** — CHANGE_TAG with patched return tag
4416. **Send each argument** — CHANGE_TAG with patched arg tag + value
442
443Done naively (one IRAM slot per step), this burns 4N+8 IRAM slots
444per call site. Two recursive calls = 24+ slots for N=1. With 128
445slots per PE, that's unsustainable.
446
447#### The EXEC-Based Call Stub
448
449EXEC is a token cannon — it reads a sequence of pre-formed tokens
450from SM/ROM and fires them onto the bus. The tokens can be anything:
451SM read requests, CM tokens, triggers. One IRAM slot (the EXEC
452trigger) replaces an arbitrary number of pre-staged operations.
453
454The key insight: steps 3-4 above (fetch tag templates, deliver them
455to patching logic) are **identical for every call to the same
456function**. The tag templates, their SM addresses, and where to
457deliver them are all compile-time constants. Only the allocated
458ctx and argument values change per call.
459
460This splits the call machinery into two parts:
461
462**Call stub (shared, loaded once per function):**
463
464IRAM instructions that receive runtime values (ctx, args, return
465continuation) and perform the patching + dispatch. These live in
466IRAM and are shared across all call sites for the same function.
467Different call sites invoke the stub in different context slots,
468so their matching store entries don't collide.
469
470**EXEC sequence (in SM/ROM, per function):**
471
472Pre-formed tokens that read tag templates from the descriptor table
473and deliver them to the stub's OR instructions. Triggered by a
474single EXEC instruction. Stored once, fired per call.
475
476```
477Per-function (loaded once):
478  call stub in IRAM:   ctx fan-out + OR patches + CHANGE_TAGs
479  EXEC sequence in SM: SM_READ_C tokens targeting the stub
480  descriptor table:    pre-formed flit 1 templates (ctx=0)
481
482Per-call-site (tiny):
483  3 IRAM slots:  rd_inc (allocate) + exec (trigger) + extract_tag (return)
484  wiring:        feed ctx, return cont, and arg values into the stub
485```
486
487#### Call Stub Structure (Example: N=1 Argument)
488
489```dfasm
490; ── call stub for $fib, loaded once, shared across call sites ──
491; runs in caller's allocated ctx (different per call → no collision)
492
493; ctx fan-out: new_ctx needs to reach 2 OR instructions (ret + arg)
494&__fib_ctx_fan <| pass
495&__fib_ctx_fan |> &__fib_or_ret:R, &__fib_or_n:R
496
497; tag patching: template (from EXEC'd SM_READ_C) OR'd with new_ctx
498&__fib_or_ret  <| or            ; L: ret tag template, R: new_ctx
499&__fib_or_n    <| or            ; L: arg tag template, R: new_ctx
500
501; dispatch: patched tag + data → output token
502&__fib_ct_ret  <| change_tag    ; L: patched ret tag, R: return continuation
503&__fib_ct_n    <| change_tag    ; L: patched arg tag, R: argument value
504
505; internal wiring
506&__fib_or_ret  |> &__fib_ct_ret:L
507&__fib_or_n    |> &__fib_ct_n:L
508```
509
510Stub cost: 1 (PASS fan-out) + 2 (OR) + 2 (CHANGE_TAG) = 5 IRAM slots
511for N=1. For N=2: add 1 more PASS in fan-out chain + 1 OR + 1
512CHANGE_TAG = 8 slots. General: 2 + 2N slots (fan-out chain + per-arg
513OR and CHANGE_TAG).
514
515Note: the OR and CHANGE_TAG instructions are dyadic, consuming IRAM
516slots in the low-offset range (0-31). The PASS fan-out chain is
517monadic and can live in the monadic range (offsets 32+), where IRAM
518space is more abundant — monadic instructions don't consume matching
519store entries, so the 7-bit offset space is available.
520
521#### Per-Call-Site Expansion
522
523```dfasm
524; ── per call site: 3 IRAM slots + wiring ──
525
526&__alloc  <| rd_inc, @ctx_alloc            ; allocate callee context
527&__exec   <| exec, @fib_call_seq           ; fire tag-fetch sequence
528&__extag  <| extract_tag, <ret_offset>     ; capture return continuation
529
530; wire runtime values into stub (these are edge declarations, not IRAM)
531&__alloc  |> &__fib_ctx_fan                ; new ctx → stub fan-out
532&__extag  |> &__fib_ct_ret:R              ; return cont → stub
533&arg_val  |> &__fib_ct_n:R               ; argument → stub
534```
535
536rd_inc and extract_tag are monadic. exec is monadic. The per-call-site
537cost is 3 monadic IRAM slots — they sit in the monadic offset range
538and don't consume matching store entries.
539
540#### EXEC Sequence Contents (in SM/ROM)
541
542Pre-formed tokens, stored at boot, fired by exec:
543
544```
545Token 0:  SM_READ_C(@fib_desc + 0) → deliver to &__fib_or_ret:L
546Token 1:  SM_READ_C(@fib_desc + 1) → deliver to &__fib_or_n:L
547```
548
549Each token is a fully-formed 2-flit packet: flit 1 = SM read command,
550flit 2 = return routing pointing at the stub's OR instruction. The
551EXEC sequencer reads these from consecutive SM cells and emits them
552onto the bus. The SM processes each read and returns the tag template
553to the specified OR instruction.
554
555#### Fibonacci: Two Recursive Calls
556
557```dfasm
558$fib |> {
559  ; ── function body ──
560  &n <| pass
561  lt &n, 2 |> &test
562  &test <| sweq
563  &n |> &test:L
564
565  ; base case
566  &test:L |> @ret
567
568  ; recursive case
569  sub &n, 1 |> &n1
570  sub &n, 2 |> &n2
571
572  ; two calls, each 3 monadic IRAM slots + shared stub
573  &__alloc1  <| rd_inc, @ctx_alloc
574  &__exec1   <| exec, @fib_call_seq
575  &__extag1  <| extract_tag, 20          ; results arrive at offset 20
576
577  &__alloc1  |> &__fib_ctx_fan
578  &__extag1  |> &__fib_ct_ret:R
579  &n1        |> &__fib_ct_n:R
580
581  &__alloc2  <| rd_inc, @ctx_alloc
582  &__exec2   <| exec, @fib_call_seq
583  &__extag2  <| extract_tag, 21          ; results arrive at offset 21
584
585  &__alloc2  |> &__fib_ctx_fan
586  &__extag2  |> &__fib_ct_ret:R
587  &n2        |> &__fib_ct_n:R
588
589  ; reduction
590  add &r1, &r2 |> @ret                   ; r1 at offset 20, r2 at offset 21
591}
592```
593
594**Important:** The two calls share the same stub IRAM instructions
595but run in different contexts (ctx allocated by rd_inc). The matching
596store entries for `&__fib_or_ret` etc. are indexed by (ctx, offset),
597so different ctx values → different cells → no collision.
598
599The calls ARE sequenced by data dependencies — the second call can't
600fire its CHANGE_TAG until its rd_inc and exec complete, which are
601independent of the first call. Both calls can be in flight
602simultaneously.
603
604#### IRAM Budget
605
606```
607                          dyadic slots   monadic slots
608                          (0-31 range)   (32-127 range)
609─────────────────────────────────────────────────────────
610function body (fib):          ~6             ~4
611call stub (shared):           4              1
612per call site (×2):           0              6
613result reduction:             1              0
614─────────────────────────────────────────────────────────
615total:                       ~11            ~11
616```
617
618~22 IRAM slots total for recursive fibonacci. Well within 128. And
619if fib is called from external sites, they pay only 3 monadic slots
620each — the stub and body are already loaded.
621
622#### Stub Sharing Across Mutual Recursion
623
624Two-layer recursion (A calls B calls A) can share ctx allocation and
625EXEC infrastructure. If A and B are on the same PE:
626
627- Each function has its own call stub (different tag templates)
628- They share the same `@ctx_alloc` SM cell
629- Their EXEC sequences are independent but stored in the same SM
630- Both stubs live in IRAM simultaneously at different offsets
631
632If A and B have the same argument count and shape, a future
633optimisation is a *generic* call stub parameterised only by which
634EXEC sequence to fire. The tag templates in the descriptor table
635encode all the per-function differences. The stub just patches ctx
636and dispatches — it doesn't know or care which function it's calling.
637This is essentially a vtable dispatch and emerges naturally from the
638architecture.
639
640#### Descriptor Table Layout (in SM, initialised at boot)
641
642```
643@fib_desc + 0:  return destination tag template (ctx=0)
644                [0][0][port][PE][gen][offset][0000]
645@fib_desc + 1:  arg 'n' destination tag template (ctx=0)
646                [0][0][port][PE][gen][offset][0000]
647
648; for N=2 function:
649@func_desc + 0: return tag template
650@func_desc + 1: arg 0 tag template
651@func_desc + 2: arg 1 tag template
652```
653
654Templates are full 16-bit flit 1 values with ctx field set to 0.
655The stub's OR instruction patches bits [3:0] with the allocated ctx.
656Templates are written to SM during bootstrap (via EXEC from ROM or
657explicit SM_WRITE_C during init).
658
659#### SM-Based Argument Passing (Large Functions)
660
661For functions with many arguments (N > 3), the IRAM cost of the
662OR + CHANGE_TAG stub becomes prohibitive — each arg burns 2 dyadic
663IRAM slots. An alternative: **stage arguments in SM cells and let
664the EXEC sequence deliver them.**
665
666The caller writes argument values to a block of SM "call frame"
667cells using SM_WRITE_C (monadic, const-addressed). The EXEC
668sequence's tail end includes SM_READ_C tokens that read those cells
669back out and deliver them as tokens to the callee's entry points.
670The callee is oblivious — it just sees tokens arriving normally.
671
672```
673Caller writes args to SM call frame:
674  SM_WRITE_C(@frame + 0, arg0_value)    ; monadic, 1 IRAM slot
675  SM_WRITE_C(@frame + 1, arg1_value)    ; monadic
676  ...
677  SM_WRITE_C(@frame + N-1, argN_value)  ; monadic
678  SM_WRITE_C(@frame + N, ret_cont)      ; return continuation
679  EXEC @call_seq                         ; fire it all
680
681EXEC sequence (in SM/ROM):
682  SM_READ_C(@frame + 0) → deliver to callee &arg0:L
683  SM_READ_C(@frame + 1) → deliver to callee &arg1:L
684  ...
685  SM_READ_C(@frame + N-1) → deliver to callee &argN:L
686  SM_READ_C(@frame + N) → deliver to callee &ret_cont:L
687```
688
689**Costs:**
690
691```
692                     stub approach         SM call frame
693                     (per-arg OR+CT)       (SM staging)
694───────────────────────────────────────────────────────────
695caller IRAM:         3 monadic             N+2 monadic (writes + exec + extag)
696stub IRAM:           2N dyadic + fan-out   0
697EXEC sequence:       N+1 tokens            N+1 tokens
698SM cells used:       0 (runtime)           N+1 (call frame)
699───────────────────────────────────────────────────────────
700dyadic IRAM slots:   2N+                   0
701monadic IRAM slots:  3                     N+2
702```
703
704The SM call frame approach uses zero dyadic IRAM for call overhead.
705All caller instructions are monadic (SM_WRITE_C, EXEC, EXTRACT_TAG),
706living in the abundant 32-127 offset range. The callee's IRAM is
707pure function body — no call machinery whatsoever.
708
709**Tradeoffs:**
710
711- **Latency:** two SM round-trips per argument (write then read) vs
712  one CHANGE_TAG. Adds ~2× SM access latency to call setup. For
713  large N where the stub approach would serialise through a fan-out
714  chain anyway, the SM approach may not be worse.
715- **SM cell pressure:** N+1 cells per call frame. Concurrent calls
716  need separate frames (different base addresses). The caller manages
717  this — either static allocation for known call depth, or an SM
718  frame pointer bumped via READ_INC.
719- **No ctx patching needed:** the EXEC sequence tokens already have
720  the correct routing baked in (they target the callee's entry points
721  directly). Context allocation is still needed, but the patching
722  step (OR new_ctx into tag templates) is eliminated because the
723  EXEC sequence can be **rebuilt per call** from a template +
724  allocated ctx. Or, if the callee always runs in a fixed ctx
725  (static allocation), the EXEC sequence is truly static.
726
727**When to use which:**
728
729- N=1-2 args: stub approach. the OR + CHANGE_TAG chain is small,
730  latency is minimal, no SM cells consumed.
731- N=3+ args: SM call frame starts winning on IRAM pressure.
732- N=6+ args: SM call frame is clearly better. the stub would need
733  12+ dyadic slots just for call overhead.
734- one-shot EXEC'd functions (loaded on demand, run once): SM call
735  frame is natural — the EXEC that loads the code can also deliver
736  the arguments in one sequence.
737
738A function intended to be called via EXEC one-shot (loaded from ROM
739into IRAM, executed, then discarded) can have its entire call
740convention baked into the EXEC block: code loading tokens first,
741then SM_READ tokens that deliver arguments from pre-staged cells.
742The caller just writes args to SM and fires EXEC. The function
743loads, receives its arguments, runs, sends results, done.
744
745---
746## Built-in Macros
747
748The assembler ships built-in macros (prepended to every program) that
749implement common patterns from this document:
750
751| Macro | Parameters | Outputs (@ret) | Pattern |
752|-------|------------|----------------|---------|
753| `#loop_counted` | `init, limit` | `body`, `exit` | Counted loop (counter + compare + increment feedback) |
754| `#loop_while` | `test` | `body`, `exit` | Condition-tested loop (gate node) |
755| `#permit_inject` | `*targets` | (routes to targets) | Permit injection (variadic, one const(1) per target) |
756| `#reduce_2`..`_4` | `op` | (positional) | Binary reduction tree (parameterized opcode, per-arity) |
757
758All built-in macros use `@ret` output wiring. The `#reduce_*` family
759accepts any opcode as a parameter (e.g., `#reduce_4 add |> &result`).
760See `asm/builtins.py` for definitions.
761
762#### Example Macro Definitions
763
764```dfasm
765; ── call stub for a 1-argument function ──
766; emitted once per function, provides shared call infrastructure
767#call_stub_1 func, desc |> {
768  ; ctx fan-out (1 arg + 1 return = 2 consumers, one PASS suffices)
769  &__${func}_ctx_fan <| pass
770  &__${func}_ctx_fan |> &__${func}_or_ret:R, &__${func}_or_arg0:R
771
772  ; tag patching
773  &__${func}_or_ret  <| or
774  &__${func}_or_arg0 <| or
775
776  ; dispatch
777  &__${func}_ct_ret  <| change_tag
778  &__${func}_ct_arg0 <| change_tag
779
780  ; internal wiring
781  &__${func}_or_ret  |> &__${func}_ct_ret:L
782  &__${func}_or_arg0 |> &__${func}_ct_arg0:L
783}
784
785; ── call stub for a 2-argument function ──
786#call_stub_2 func, desc |> {
787  ; ctx fan-out chain (3 consumers)
788  &__${func}_ctx_fan0 <| pass
789  &__${func}_ctx_fan1 <| pass
790  &__${func}_ctx_fan0 |> &__${func}_or_ret:R, &__${func}_ctx_fan1
791  &__${func}_ctx_fan1 |> &__${func}_or_arg0:R, &__${func}_or_arg1:R
792
793  ; tag patching
794  &__${func}_or_ret  <| or
795  &__${func}_or_arg0 <| or
796  &__${func}_or_arg1 <| or
797
798  ; dispatch
799  &__${func}_ct_ret  <| change_tag
800  &__${func}_ct_arg0 <| change_tag
801  &__${func}_ct_arg1 <| change_tag
802
803  ; internal wiring
804  &__${func}_or_ret  |> &__${func}_ct_ret:L
805  &__${func}_or_arg0 |> &__${func}_ct_arg0:L
806  &__${func}_or_arg1 |> &__${func}_ct_arg1:L
807}
808
809; ── per-call-site (works for any arity) ──
810#call_dyn func, alloc, call_seq, ret_offset, *args |> {
811  ; allocate + trigger + return continuation
812  &__call_alloc_${func}  <| rd_inc, ${alloc}
813  &__call_exec_${func}   <| exec, ${call_seq}
814  &__call_extag_${func}  <| extract_tag, ${ret_offset}
815
816  ; wire into stub
817  &__call_alloc_${func} |> &__${func}_ctx_fan
818  &__call_extag_${func} |> &__${func}_ct_ret:R
819  $(
820    ${args} |> &__${func}_ct_${args}:R
821  ),*
822}
823```
824
825Usage:
826
827```dfasm
828; one-time setup
829#call_stub_1 fib, @fib_desc
830
831; at each call site
832#call_dyn fib, @ctx_alloc, @fib_call_seq, 20, &my_arg
833```
834
835The `call_stub_N` per-arity approach is admittedly clunky. Macros can
836invoke other macros (nested expansion up to depth 32), so a unified
837variadic call_stub using a helper macro is possible. Per-arity variants
838are used here for clarity. N=1 through N=3 covers the vast majority
839of functions, and anything beyond N=3 can be hand-written — it's the
840same pattern, just more of it.
841
842### Permit Injection — Two Approaches
843
844For small K (roughly K <= 4), inline CONST injection. The built-in
845`#permit_inject` macro is variadic — pass the target nodes directly:
846
847```dfasm
848; Built-in definition (from asm/builtins.py):
849#permit_inject *targets |> {
850    $(
851        &p <| const, 1
852        &p |> ${targets}
853    ),*
854}
855
856; Usage: pass gate nodes as targets
857#permit_inject &dispatch_gate:L, &dispatch_gate:L
858```
859
860For large K, use SM EXEC to batch-emit permits:
861
862```dfasm
863; Custom macro for EXEC-based injection:
864#permit_inject_exec count, sm_base |> {
865  ; a single SM EXEC reading ${count} pre-formed permit
866  ; tokens from SM starting at ${sm_base}
867  &exec <| exec, ${sm_base}
868}
869
870; Usage: inject 8 permits via EXEC
871#permit_inject_exec 8, @permit_store
872```
873
874Programmer chooses based on K. No magic.
875
876### Loop Control Macro
877
878The built-in `#loop_counted` provides the core loop infrastructure.
879It accepts `init` and `limit` as input parameters and exposes `body`
880and `exit` as `@ret` outputs:
881
882```dfasm
883; Built-in definition (from asm/builtins.py):
884#loop_counted init, limit |> {
885    &counter <| add
886    &compare <| brgt
887    &counter |> &compare:L
888    &body_fan <| pass
889    &compare |> &body_fan:L
890    &inc <| inc
891    &body_fan |> &inc:L
892    &inc |> &counter:R
893    ${init} |> &counter:L
894    ${limit} |> &compare:R
895    &body_fan |> @ret_body
896    &compare |> @ret_exit:R
897}
898
899; Usage: pass init/limit as args, wire body/exit via |>
900&c_init <| const, 0
901&c_limit <| const, 64
902#loop_counted &c_init, &c_limit |> body=&body_entry, exit=&done
903```
904
905### Reduction Tree Macro
906
907```dfasm
908#reduce_tree &op, &inputs[], &output |> {
909  ; expands to ceil(log2(N)) levels of binary &op instructions
910  ; N inferred from length of &inputs[]
911  ; &output receives the final reduced value
912}
913
914; Usage:
915#reduce_tree add, [&s0, &s1, &s2, &s3], @total
916```
917
918### Parallel Loop (Composition)
919
920A parallel loop is manual composition of macros and function calls.
921No single macro tries to handle the full topology — each handles the
922repetitive part it's good at.
923
924```dfasm
925@system pe=4, sm=1, ctx=8
926
927; The body as a function — self-loop accumulator
928$body |> {
929  &acc <| add
930  &acc |> &acc:L                   ; feedback: acc recirculates
931  ; &i arrives as input, feeds &acc:R
932  ; &acc drains to @ret on completion
933}
934
935; Loop control
936&c_init <| const, 0
937&c_limit <| const, 64
938#loop_counted &c_init, &c_limit |> body=&dispatch, exit=&done
939
940; Permit injection (4 permits to throttle body launches)
941#permit_inject &gate:L, &gate:L, &gate:L, &gate:L
942
943; Gated dispatch — permits throttle body launches
944&gate <| gate
945&dispatch |> &gate:R               ; loop data → gate right port
946; permits arrive at &gate:L from injection + body completion
947
948; Body invocations via function call syntax
949$body i=&gate |> &partial
950
951; Reduction of partial results
952#reduce_tree add, [&p0, &p1, &p2, &p3], @final_sum
953```
954
955---
956
957## Pattern Cost Summary
958
959| Pattern               | HW cost  | IRAM slots              | Iterations/cycle     | Parallel?       |
960| --------------------- | -------- | ----------------------- | -------------------- | --------------- |
961| Self-loop accumulator | 0        | 1 (the ADD)             | ~1/8 (bus RT)        | yes (per-ctx)   |
962| Permit-token throttle | 0        | K+2 (permits + GATE)    | K in flight          | yes             |
963| Counted loop control  | 0        | 4 (CONST+INC+LT+SWITCH) | ~1/8 (bus RT)        | no (sequential) |
964| Binary reduction tree | 0        | K-1 (one per ADD)       | log2(K) levels       | yes             |
965| Predicate register    | ~1 chip  | +1 bit/instr            | saves ~4 cycles/iter | no (shared)     |
966| Accumulator register  | ~3 chips | 1 (ACC_ADD)             | ~1/4 (no bus RT)     | no (shared)     |
967
968All zero-hardware patterns work with v0. Predicate and accumulator
969registers are independent future additions that compose with the
970existing patterns.