OR-1 dataflow CPU sketch
at main 970 lines 36 kB view raw view rendered
1# Loop Patterns and Flow Control Idioms 2 3Execution patterns for loops, reductions, and flow control in the 4dataflow architecture. These are software/compiler conventions built 5from existing hardware primitives — no dedicated loop hardware exists. 6 7Most patterns described here are candidates for **assembler macros**: 8reusable expansions that emit the underlying instruction sequences. 9The programmer writes `#loop_counted counter, limit, body_label` and 10the assembler expands it into the token feedback arcs, SWITCH routing, 11and permit structures described below. 12 13See `iram-and-function-calls.md` for IRAM format and ctx_mode details. 14See `alu-and-output-design.md` for SWITCH, GATE, and output modes. 15See `sm-design.md` for SM operations referenced by some patterns. 16 17--- 18 19## Core Loop Mechanism 20 21There is no program counter, no branch instruction, and no loop 22construct in hardware. Loops are **token feedback arcs**: an 23instruction's output token is routed back to an input of an earlier 24instruction (or itself) in the dataflow graph. The loop "iterates" 25each time a token completes the feedback circuit. 26 27### Minimal Counted Loop 28 29``` 30graph: 31 CONST(0) ──────────────────────────────────┐ 32 33 ┌───────────────────────────────────────────┤ 34 │ V 35 │ ┌─────┐ ┌──────────┐ ┌────────────────┐ 36 └─►│ INC │────►│ LT limit │────►│ SWITCH │ 37 └─────┘ └──────────┘ │ true → body │ 38 ▲ i+1 bool │ false → exit │ 39 │ └────────┬───────┘ 40 │ data (i+1) to body ◄────────┘ 41 │ │ 42 └── i+1 fed back (same token) ─────────┘ 43``` 44 45Instructions (all on same PE, same context): 46 47```dfasm 48; Counted loop: increment from 0, dispatch to body, exit when done 49 50&counter <| const, 0 ; initial counter value (seed token starts loop) 51&step <| inc ; increment counter 52&cmp <| lt ; compare counter < limit 53&route <| sweq ; route by comparison result 54 55const <limit> |> &cmp:R ; limit value (or SM read if > 255) 56 57&counter |> &step ; seed → first increment 58&step |> &cmp:L ; counter → comparison left 59&step |> &route:L ; counter → switch data input (fan-out) 60&cmp |> &route:R ; bool → switch control 61 62&route:L |> &body:L ; taken (true) → dispatch to body 63&route:R |> &exit:L ; not-taken (false) → loop done 64&route:L |> &step ; feedback arc: counter recirculates 65``` 66 67The feedback arc is just a destination field in the SWITCH instruction 68(or a PASS trampoline if SWITCH can't dual-route to both body dispatch 69and the INC feedback simultaneously). The loop "runs" as long as tokens 70keep flowing through the feedback arc. 71 72### Timing 73 74Each iteration traverses: INC → LT → SWITCH → bus → INC. With the v0 75pipeline (no local bypass, all tokens go through external bus), expect 76roughly 6-10 cycles per loop control iteration depending on bus 77contention and pipeline depth. 78 79The loop body executes concurrently with loop control — the dispatched 80body token enters the body subgraph immediately while the counter 81continues to the next iteration. If the body takes longer than one 82control iteration, multiple body invocations can be in flight 83simultaneously (with appropriate flow control — see Permit Tokens). 84 85--- 86 87## Permit-Token Flow Control 88 89When loop body iterations can execute concurrently, a throttling 90mechanism prevents context slot exhaustion. **Permit tokens** are the 91standard dataflow idiom for this. 92 93### Concept 94 95K permit tokens circulate through the system. Each dispatch to a body 96context consumes one permit. Each body completion produces one permit. 97At most K body iterations are in flight simultaneously. If no permits 98are available, the dispatch GATE stalls — the loop control token waits 99in the matching store until a permit arrives. 100 101``` 102 permits (K tokens, initially injected at boot) 103104105 ┌────────┐ 106 │ GATE │◄──── loop control produces (counter, body_data) 107 │ L: permit 108 │ R: dispatch_data 109 └───┬────┘ 110 │ (fires only when BOTH permit AND data are ready) 111112 dispatch to body context (CHANGE_TAG or CTX_OVRD) 113114115 body executes ... body completes 116117118 emit permit token back to GATE (port L) 119``` 120 121### Implementation 122 123The GATE instruction is dyadic. Left port receives the permit token. 124Right port receives the loop's dispatch data (counter value, array 125pointer, whatever the body needs). GATE fires only when both are 126present — this IS the backpressure mechanism. No special hardware 127flow control. 128 129```dfasm 130; Permit-gated dispatch 131&gate <| gate ; dyadic: L=permit, R=dispatch data 132&loop_output |> &gate:R ; loop control feeds data to gate 133 134; Body completion recycles the permit 135&body_done |> &gate:L ; body's final instruction returns permit 136``` 137 138The body's final instruction emits a token to `&gate:L` as one of its 139destinations. This token is the recycled permit. Its data value is 140irrelevant (just a trigger); what matters is its presence in the 141matching store. 142 143K is chosen by the compiler: 144 145- K = 1: fully sequential, one body at a time. safe default. 146- K = number of reserved body context slots: maximum parallelism. 147- K = pipeline depth / body latency: optimal for throughput. 148 149### Initial Permit Injection 150 151At boot (or function entry), K permit tokens must be injected into 152the GATE. Options: 153 154- **CONST chain:** K CONST instructions at sequential offsets, each 155 with dest targeting the GATE's left port. Triggered by the function 156 entry token via fan-out. Burns K IRAM slots but is simple. 157- **SM EXEC:** pre-load K permit tokens in SM, EXEC emits them. 158 Uses one IRAM slot for the EXEC trigger. Better for large K. 159- **Assembler macro:** `#permit_inject *args` expands to the 160 appropriate injection sequence. 161 162### Assembler Macro Sketch 163 164``` 165; PERMIT_LOOP(K, limit, body_label, exit_label) 166; Expands to: 167; - K CONST instructions emitting initial permits 168; - GATE (permit, dispatch_data) guarding body dispatch 169; - INC/LT/SWITCH loop control chain 170; - feedback arc from SWITCH to INC 171; - body return path emitting permit on completion 172``` 173 174The macro assigns IRAM offsets for the control structure and reserves 175K context slots for body iterations. 176 177--- 178 179## Parallel Reduction 180 181A common pattern following parallel loop iterations: combine K partial 182results into a single value. 183 184### Binary Reduction Tree 185 186``` 187K=4 partial sums: s0 s1 s2 s3 188 \ / \ / 189 ADD ADD 190 \ / 191 ADD 192193 total 194``` 195 196Each ADD is a dyadic instruction in its own right. The partial results 197arrive as tokens, match in the ADD's matching store entry, fire, and 198produce the next level's input. The tree structure is pure dataflow — 199no special reduction hardware. 200 201For K iterations, the tree has log2(K) levels and K-1 ADD instructions. 202With K=8, that's 7 ADDs across 3 levels. All ADDs at the same level 203can fire in parallel (they're on different matching store entries or 204different PEs). 205 206### Assembler Macro Sketch 207 208``` 209; REDUCE(op, inputs[], output) 210; Expands to: 211; - ceil(log2(N)) levels of binary op instructions 212; - routing from each level's outputs to next level's inputs 213; - final output routed to specified destination 214``` 215 216--- 217 218## Loop-Carried Accumulators (Self-Loop Pattern) 219 220A value that updates every iteration and feeds back to itself. The 221canonical example: `sum += a[i]`. 222 223### Matching-Store-as-Register 224 225A dyadic instruction whose output routes back to its own left port: 226 227```dfasm 228; Self-loop accumulator: sum += each incoming value 229&acc <| add ; dyadic: L=accumulated sum, R=new element 230&acc |> &acc:L ; feedback: result → own left port 231 232; Initialise: deposit starting value before first element arrives 233&init <| const, 0 234&init |> &acc:L ; seed the accumulator with 0 235``` 236 237The matching store cell at `&acc`'s (ctx, offset, port L) holds the 238accumulator value between iterations. Each new element arriving 239on port R triggers the ADD, which deposits the updated sum back 240into port L's cell. 241 242**Timing:** Each accumulation step is a full round-trip: ALU → 243output formatter → bus → input FIFO → matching store → ALU. Roughly 2446-10 cycles at v0. This is the sequential bottleneck — the accumulator 245feedback arc is inherently serial. 246 247**Extracting the final value:** After the last element is accumulated, 248the sum sits in the matching store cell at port L. It needs a "drain" 249event to extract it. Options: 250 251- A sentinel token on port R triggers one final ADD (or PASS), and 252 dest2 routes the result to the downstream consumer. 253- A GATE controlled by a "loop done" boolean from the loop control. 254 When the loop completes, the GATE opens and the accumulated value 255 flows out. 256 257### With Parallel Iterations 258 259Each body context has its own matching store entries (different ctx → 260different SRAM address). K parallel accumulators at the same IRAM 261offset but different context slots operate independently. 262 263```dfasm 264; All iterations share the same IRAM instruction for &acc. 265; Different context slots → different matching store cells. 266; No interference between iterations. 267 268; ctx=1: sum_1 accumulator (self-loop) 269; ctx=2: sum_2 accumulator (self-loop) 270; ctx=3: sum_3 accumulator (self-loop) 271; ... all at the same &acc instruction, different contexts 272``` 273 274After all iterations complete, a reduction tree combines partial sums. 275The permit-token mechanism guarantees that partial sums are ready before 276the reduction begins (the permits themselves can be chained to trigger 277the reduction). 278 279--- 280 281## Predicate Register Optimisation (Future, ~1 Chip) 282 283A single shared 1-bit register (or small multi-bit register) that 284stores a comparison result locally, bypassing the token network for 285the boolean path. 286 287### Benefit 288 289In the standard loop control pattern, the comparison boolean travels 290as a token: LT produces a bool token → bus → matching store → SWITCH 291consumes it. The predicate register short-circuits this: 292 293``` 294without predicate register: 295 LT → [bool token] → bus → matching → SWITCH 296 cost: full token round-trip for the boolean 297 298with predicate register: 299 LT writes bool to predicate register (side effect, no token) 300 SWITCH reads predicate register (local wire, no matching needed) 301 cost: zero additional cycles for the boolean path 302``` 303 304The counter feedback arc still goes through the bus. But the boolean 305path — typically half the loop control overhead — becomes free. 306 307### Constraints 308 309- **Not per-context.** Single shared register. The compiler must 310 guarantee only one activation uses the predicate at a time. 311- **Not suitable for parallel iterations.** Each iteration would need 312 its own predicate state. Use the token-based boolean path for 313 parallel loop control. 314- **IRAM encoding:** 1-2 bits per instruction (pred_write, pred_read). 315 Can be folded into opcode space as dedicated variants (LT_P, SWITCH_P) 316 or drawn from spare bits in half 1. 317 318### Hardware 319 320``` 3211-bit predicate register: 1 flip-flop 322write path (from comparator): 1 gate (write enable) 323read mux (to SWITCH): 1 gate (bool_out source select) 324Total: ~1 chip (fraction of a chip, really) 325``` 326 327--- 328 329## Accumulator Register Optimisation (Future, ~3 Chips) 330 331A single shared 16-bit register writable by the ALU and readable as 332an ALU input source. Eliminates the bus round-trip for tight 333accumulation loops. 334 335### Benefit 336 337``` 338without accumulator register: 339 ADD(acc, new) → output → bus → input → matching → ADD 340 cost: ~6-10 cycles per accumulation 341 342with accumulator register: 343 ACC_ADD: reads acc from register, adds new from token, writes result 344 back to register. monadic (only new element token needed). 345 cost: 1 pipeline pass per accumulation (~3-5 cycles) 346``` 347 348### Constraints 349 350- **Not per-context.** Same single-activation restriction as predicate 351 register. 352- **Monadic operation.** ACC_ADD/ACC_SUB are monadic — the accumulator 353 is an implicit operand from the register, the explicit operand comes 354 from the arriving token. No matching store entry consumed. 355- **No matching store write conflict.** The register is a separate 356 storage element from the matching store. ALU writes to the register 357 at Stage 4; matching store writes happen at Stage 2. No port 358 conflict, no stall logic needed. 359- **Stepping stone to SC blocks.** The accumulator register is 360 effectively the first register of a future local register file. 361 Adding a second register + sequential instruction counter yields a 362 minimal strongly-connected (SC) block capability. 363 364### Hardware 365 366``` 36716-bit register (2x 74LS374): 2 chips 368ALU source mux (add acc_reg input): 1 chip 369write enable gating: ~0 chips (1 gate) 370Total: ~3 chips per PE 371``` 372 373--- 374 375## Assembler Macro and Function Call Strategy 376 377The patterns above are mechanical enough to be assembler macros. 378Macros are expected to be simple text/token substitution (C-style 379`#define` territory, not Zig comptime or C++ templates). This means: 380 381- No conditional logic within macros. Different strategies get 382 different macros, and the programmer picks which to use. 383- No offset allocation intelligence. The assembler tracks placement 384 and validates the expansion (offset collisions, missing labels), 385 but the macro itself is dumb substitution. 386- No type checking or context-slot tracking. The programmer is 387 responsible for not blowing the slot budget. 388 389### Macro Syntax 390 391A macro call uses `#macro_name` and follows the same syntax as any 392other operation (arguments, edge wiring, port qualifiers). Macro 393definitions follow a function-block-like structure. 394 395### Function Calls as Syntax 396 397Using a `$func` label as an instruction generates the appropriate 398routing for static calls. Named arguments match against the 399function's internal labels: 400 401```dfasm 402; Function definition: 403$add_pair |> { 404 add &a, &b |> #ret 405} 406 407; Static call — named args wire to internal labels: 408$add_pair a=&x, b=&y |> @output 409``` 410 411`@ret` is a built-in pseudo-node that marks the function's return point. 412The assembler resolves `a=&x` to mean "wire `&x`'s output to 413`$add_pair.&a`, port L, with ctx_mode=01 and the allocator-assigned 414context slot." The `|> @output` wires the return point to `@output`. 415 416For static calls (non-recursive, known call graph), this generates 417only routing annotations on existing instructions — no extra IRAM 418entries, no CHANGE_TAG, just destination fields set with ctx_mode=01. 419 420### Dynamic and Recursive Calls (v1, Manual / Macro-Assisted) 421 422The assembler does NOT handle recursive or indirect calls 423automatically — that's compiler territory. Recursive calls require 424runtime context allocation (SM READ_INC), CHANGE_TAG sequences, and 425EXTRACT_TAG for return continuations. The assembler provides macros 426to reduce boilerplate for the mechanical parts; the programmer 427manages descriptor tables, context budgets, and flow control. 428 429#### The Problem 430 431A dynamic call to a function with N arguments requires: 432 4331. **Allocate context** — SM READ_INC on an allocator cell → new_ctx 4342. **Build return continuation** — EXTRACT_TAG captures caller's 435 (PE, ctx, offset, gen) as a 16-bit packed tag value 4363. **Fetch N+1 tag templates** — SM_READ_C from a descriptor table 437 (one per argument destination + one for return destination) 4384. **Patch ctx into each tag** — OR new_ctx into bits [3:0] of 439 each template (templates are pre-built at boot with ctx=0) 4405. **Send return continuation** — CHANGE_TAG with patched return tag 4416. **Send each argument** — CHANGE_TAG with patched arg tag + value 442 443Done naively (one IRAM slot per step), this burns 4N+8 IRAM slots 444per call site. Two recursive calls = 24+ slots for N=1. With 128 445slots per PE, that's unsustainable. 446 447#### The EXEC-Based Call Stub 448 449EXEC is a token cannon — it reads a sequence of pre-formed tokens 450from SM/ROM and fires them onto the bus. The tokens can be anything: 451SM read requests, CM tokens, triggers. One IRAM slot (the EXEC 452trigger) replaces an arbitrary number of pre-staged operations. 453 454The key insight: steps 3-4 above (fetch tag templates, deliver them 455to patching logic) are **identical for every call to the same 456function**. The tag templates, their SM addresses, and where to 457deliver them are all compile-time constants. Only the allocated 458ctx and argument values change per call. 459 460This splits the call machinery into two parts: 461 462**Call stub (shared, loaded once per function):** 463 464IRAM instructions that receive runtime values (ctx, args, return 465continuation) and perform the patching + dispatch. These live in 466IRAM and are shared across all call sites for the same function. 467Different call sites invoke the stub in different context slots, 468so their matching store entries don't collide. 469 470**EXEC sequence (in SM/ROM, per function):** 471 472Pre-formed tokens that read tag templates from the descriptor table 473and deliver them to the stub's OR instructions. Triggered by a 474single EXEC instruction. Stored once, fired per call. 475 476``` 477Per-function (loaded once): 478 call stub in IRAM: ctx fan-out + OR patches + CHANGE_TAGs 479 EXEC sequence in SM: SM_READ_C tokens targeting the stub 480 descriptor table: pre-formed flit 1 templates (ctx=0) 481 482Per-call-site (tiny): 483 3 IRAM slots: rd_inc (allocate) + exec (trigger) + extract_tag (return) 484 wiring: feed ctx, return cont, and arg values into the stub 485``` 486 487#### Call Stub Structure (Example: N=1 Argument) 488 489```dfasm 490; ── call stub for $fib, loaded once, shared across call sites ── 491; runs in caller's allocated ctx (different per call → no collision) 492 493; ctx fan-out: new_ctx needs to reach 2 OR instructions (ret + arg) 494&__fib_ctx_fan <| pass 495&__fib_ctx_fan |> &__fib_or_ret:R, &__fib_or_n:R 496 497; tag patching: template (from EXEC'd SM_READ_C) OR'd with new_ctx 498&__fib_or_ret <| or ; L: ret tag template, R: new_ctx 499&__fib_or_n <| or ; L: arg tag template, R: new_ctx 500 501; dispatch: patched tag + data → output token 502&__fib_ct_ret <| change_tag ; L: patched ret tag, R: return continuation 503&__fib_ct_n <| change_tag ; L: patched arg tag, R: argument value 504 505; internal wiring 506&__fib_or_ret |> &__fib_ct_ret:L 507&__fib_or_n |> &__fib_ct_n:L 508``` 509 510Stub cost: 1 (PASS fan-out) + 2 (OR) + 2 (CHANGE_TAG) = 5 IRAM slots 511for N=1. For N=2: add 1 more PASS in fan-out chain + 1 OR + 1 512CHANGE_TAG = 8 slots. General: 2 + 2N slots (fan-out chain + per-arg 513OR and CHANGE_TAG). 514 515Note: the OR and CHANGE_TAG instructions are dyadic, consuming IRAM 516slots in the low-offset range (0-31). The PASS fan-out chain is 517monadic and can live in the monadic range (offsets 32+), where IRAM 518space is more abundant — monadic instructions don't consume matching 519store entries, so the 7-bit offset space is available. 520 521#### Per-Call-Site Expansion 522 523```dfasm 524; ── per call site: 3 IRAM slots + wiring ── 525 526&__alloc <| rd_inc, @ctx_alloc ; allocate callee context 527&__exec <| exec, @fib_call_seq ; fire tag-fetch sequence 528&__extag <| extract_tag, <ret_offset> ; capture return continuation 529 530; wire runtime values into stub (these are edge declarations, not IRAM) 531&__alloc |> &__fib_ctx_fan ; new ctx → stub fan-out 532&__extag |> &__fib_ct_ret:R ; return cont → stub 533&arg_val |> &__fib_ct_n:R ; argument → stub 534``` 535 536rd_inc and extract_tag are monadic. exec is monadic. The per-call-site 537cost is 3 monadic IRAM slots — they sit in the monadic offset range 538and don't consume matching store entries. 539 540#### EXEC Sequence Contents (in SM/ROM) 541 542Pre-formed tokens, stored at boot, fired by exec: 543 544``` 545Token 0: SM_READ_C(@fib_desc + 0) → deliver to &__fib_or_ret:L 546Token 1: SM_READ_C(@fib_desc + 1) → deliver to &__fib_or_n:L 547``` 548 549Each token is a fully-formed 2-flit packet: flit 1 = SM read command, 550flit 2 = return routing pointing at the stub's OR instruction. The 551EXEC sequencer reads these from consecutive SM cells and emits them 552onto the bus. The SM processes each read and returns the tag template 553to the specified OR instruction. 554 555#### Fibonacci: Two Recursive Calls 556 557```dfasm 558$fib |> { 559 ; ── function body ── 560 &n <| pass 561 lt &n, 2 |> &test 562 &test <| sweq 563 &n |> &test:L 564 565 ; base case 566 &test:L |> @ret 567 568 ; recursive case 569 sub &n, 1 |> &n1 570 sub &n, 2 |> &n2 571 572 ; two calls, each 3 monadic IRAM slots + shared stub 573 &__alloc1 <| rd_inc, @ctx_alloc 574 &__exec1 <| exec, @fib_call_seq 575 &__extag1 <| extract_tag, 20 ; results arrive at offset 20 576 577 &__alloc1 |> &__fib_ctx_fan 578 &__extag1 |> &__fib_ct_ret:R 579 &n1 |> &__fib_ct_n:R 580 581 &__alloc2 <| rd_inc, @ctx_alloc 582 &__exec2 <| exec, @fib_call_seq 583 &__extag2 <| extract_tag, 21 ; results arrive at offset 21 584 585 &__alloc2 |> &__fib_ctx_fan 586 &__extag2 |> &__fib_ct_ret:R 587 &n2 |> &__fib_ct_n:R 588 589 ; reduction 590 add &r1, &r2 |> @ret ; r1 at offset 20, r2 at offset 21 591} 592``` 593 594**Important:** The two calls share the same stub IRAM instructions 595but run in different contexts (ctx allocated by rd_inc). The matching 596store entries for `&__fib_or_ret` etc. are indexed by (ctx, offset), 597so different ctx values → different cells → no collision. 598 599The calls ARE sequenced by data dependencies — the second call can't 600fire its CHANGE_TAG until its rd_inc and exec complete, which are 601independent of the first call. Both calls can be in flight 602simultaneously. 603 604#### IRAM Budget 605 606``` 607 dyadic slots monadic slots 608 (0-31 range) (32-127 range) 609───────────────────────────────────────────────────────── 610function body (fib): ~6 ~4 611call stub (shared): 4 1 612per call site (×2): 0 6 613result reduction: 1 0 614───────────────────────────────────────────────────────── 615total: ~11 ~11 616``` 617 618~22 IRAM slots total for recursive fibonacci. Well within 128. And 619if fib is called from external sites, they pay only 3 monadic slots 620each — the stub and body are already loaded. 621 622#### Stub Sharing Across Mutual Recursion 623 624Two-layer recursion (A calls B calls A) can share ctx allocation and 625EXEC infrastructure. If A and B are on the same PE: 626 627- Each function has its own call stub (different tag templates) 628- They share the same `@ctx_alloc` SM cell 629- Their EXEC sequences are independent but stored in the same SM 630- Both stubs live in IRAM simultaneously at different offsets 631 632If A and B have the same argument count and shape, a future 633optimisation is a *generic* call stub parameterised only by which 634EXEC sequence to fire. The tag templates in the descriptor table 635encode all the per-function differences. The stub just patches ctx 636and dispatches — it doesn't know or care which function it's calling. 637This is essentially a vtable dispatch and emerges naturally from the 638architecture. 639 640#### Descriptor Table Layout (in SM, initialised at boot) 641 642``` 643@fib_desc + 0: return destination tag template (ctx=0) 644 [0][0][port][PE][gen][offset][0000] 645@fib_desc + 1: arg 'n' destination tag template (ctx=0) 646 [0][0][port][PE][gen][offset][0000] 647 648; for N=2 function: 649@func_desc + 0: return tag template 650@func_desc + 1: arg 0 tag template 651@func_desc + 2: arg 1 tag template 652``` 653 654Templates are full 16-bit flit 1 values with ctx field set to 0. 655The stub's OR instruction patches bits [3:0] with the allocated ctx. 656Templates are written to SM during bootstrap (via EXEC from ROM or 657explicit SM_WRITE_C during init). 658 659#### SM-Based Argument Passing (Large Functions) 660 661For functions with many arguments (N > 3), the IRAM cost of the 662OR + CHANGE_TAG stub becomes prohibitive — each arg burns 2 dyadic 663IRAM slots. An alternative: **stage arguments in SM cells and let 664the EXEC sequence deliver them.** 665 666The caller writes argument values to a block of SM "call frame" 667cells using SM_WRITE_C (monadic, const-addressed). The EXEC 668sequence's tail end includes SM_READ_C tokens that read those cells 669back out and deliver them as tokens to the callee's entry points. 670The callee is oblivious — it just sees tokens arriving normally. 671 672``` 673Caller writes args to SM call frame: 674 SM_WRITE_C(@frame + 0, arg0_value) ; monadic, 1 IRAM slot 675 SM_WRITE_C(@frame + 1, arg1_value) ; monadic 676 ... 677 SM_WRITE_C(@frame + N-1, argN_value) ; monadic 678 SM_WRITE_C(@frame + N, ret_cont) ; return continuation 679 EXEC @call_seq ; fire it all 680 681EXEC sequence (in SM/ROM): 682 SM_READ_C(@frame + 0) → deliver to callee &arg0:L 683 SM_READ_C(@frame + 1) → deliver to callee &arg1:L 684 ... 685 SM_READ_C(@frame + N-1) → deliver to callee &argN:L 686 SM_READ_C(@frame + N) → deliver to callee &ret_cont:L 687``` 688 689**Costs:** 690 691``` 692 stub approach SM call frame 693 (per-arg OR+CT) (SM staging) 694─────────────────────────────────────────────────────────── 695caller IRAM: 3 monadic N+2 monadic (writes + exec + extag) 696stub IRAM: 2N dyadic + fan-out 0 697EXEC sequence: N+1 tokens N+1 tokens 698SM cells used: 0 (runtime) N+1 (call frame) 699─────────────────────────────────────────────────────────── 700dyadic IRAM slots: 2N+ 0 701monadic IRAM slots: 3 N+2 702``` 703 704The SM call frame approach uses zero dyadic IRAM for call overhead. 705All caller instructions are monadic (SM_WRITE_C, EXEC, EXTRACT_TAG), 706living in the abundant 32-127 offset range. The callee's IRAM is 707pure function body — no call machinery whatsoever. 708 709**Tradeoffs:** 710 711- **Latency:** two SM round-trips per argument (write then read) vs 712 one CHANGE_TAG. Adds ~2× SM access latency to call setup. For 713 large N where the stub approach would serialise through a fan-out 714 chain anyway, the SM approach may not be worse. 715- **SM cell pressure:** N+1 cells per call frame. Concurrent calls 716 need separate frames (different base addresses). The caller manages 717 this — either static allocation for known call depth, or an SM 718 frame pointer bumped via READ_INC. 719- **No ctx patching needed:** the EXEC sequence tokens already have 720 the correct routing baked in (they target the callee's entry points 721 directly). Context allocation is still needed, but the patching 722 step (OR new_ctx into tag templates) is eliminated because the 723 EXEC sequence can be **rebuilt per call** from a template + 724 allocated ctx. Or, if the callee always runs in a fixed ctx 725 (static allocation), the EXEC sequence is truly static. 726 727**When to use which:** 728 729- N=1-2 args: stub approach. the OR + CHANGE_TAG chain is small, 730 latency is minimal, no SM cells consumed. 731- N=3+ args: SM call frame starts winning on IRAM pressure. 732- N=6+ args: SM call frame is clearly better. the stub would need 733 12+ dyadic slots just for call overhead. 734- one-shot EXEC'd functions (loaded on demand, run once): SM call 735 frame is natural — the EXEC that loads the code can also deliver 736 the arguments in one sequence. 737 738A function intended to be called via EXEC one-shot (loaded from ROM 739into IRAM, executed, then discarded) can have its entire call 740convention baked into the EXEC block: code loading tokens first, 741then SM_READ tokens that deliver arguments from pre-staged cells. 742The caller just writes args to SM and fires EXEC. The function 743loads, receives its arguments, runs, sends results, done. 744 745--- 746## Built-in Macros 747 748The assembler ships built-in macros (prepended to every program) that 749implement common patterns from this document: 750 751| Macro | Parameters | Outputs (@ret) | Pattern | 752|-------|------------|----------------|---------| 753| `#loop_counted` | `init, limit` | `body`, `exit` | Counted loop (counter + compare + increment feedback) | 754| `#loop_while` | `test` | `body`, `exit` | Condition-tested loop (gate node) | 755| `#permit_inject` | `*targets` | (routes to targets) | Permit injection (variadic, one const(1) per target) | 756| `#reduce_2`..`_4` | `op` | (positional) | Binary reduction tree (parameterized opcode, per-arity) | 757 758All built-in macros use `@ret` output wiring. The `#reduce_*` family 759accepts any opcode as a parameter (e.g., `#reduce_4 add |> &result`). 760See `asm/builtins.py` for definitions. 761 762#### Example Macro Definitions 763 764```dfasm 765; ── call stub for a 1-argument function ── 766; emitted once per function, provides shared call infrastructure 767#call_stub_1 func, desc |> { 768 ; ctx fan-out (1 arg + 1 return = 2 consumers, one PASS suffices) 769 &__${func}_ctx_fan <| pass 770 &__${func}_ctx_fan |> &__${func}_or_ret:R, &__${func}_or_arg0:R 771 772 ; tag patching 773 &__${func}_or_ret <| or 774 &__${func}_or_arg0 <| or 775 776 ; dispatch 777 &__${func}_ct_ret <| change_tag 778 &__${func}_ct_arg0 <| change_tag 779 780 ; internal wiring 781 &__${func}_or_ret |> &__${func}_ct_ret:L 782 &__${func}_or_arg0 |> &__${func}_ct_arg0:L 783} 784 785; ── call stub for a 2-argument function ── 786#call_stub_2 func, desc |> { 787 ; ctx fan-out chain (3 consumers) 788 &__${func}_ctx_fan0 <| pass 789 &__${func}_ctx_fan1 <| pass 790 &__${func}_ctx_fan0 |> &__${func}_or_ret:R, &__${func}_ctx_fan1 791 &__${func}_ctx_fan1 |> &__${func}_or_arg0:R, &__${func}_or_arg1:R 792 793 ; tag patching 794 &__${func}_or_ret <| or 795 &__${func}_or_arg0 <| or 796 &__${func}_or_arg1 <| or 797 798 ; dispatch 799 &__${func}_ct_ret <| change_tag 800 &__${func}_ct_arg0 <| change_tag 801 &__${func}_ct_arg1 <| change_tag 802 803 ; internal wiring 804 &__${func}_or_ret |> &__${func}_ct_ret:L 805 &__${func}_or_arg0 |> &__${func}_ct_arg0:L 806 &__${func}_or_arg1 |> &__${func}_ct_arg1:L 807} 808 809; ── per-call-site (works for any arity) ── 810#call_dyn func, alloc, call_seq, ret_offset, *args |> { 811 ; allocate + trigger + return continuation 812 &__call_alloc_${func} <| rd_inc, ${alloc} 813 &__call_exec_${func} <| exec, ${call_seq} 814 &__call_extag_${func} <| extract_tag, ${ret_offset} 815 816 ; wire into stub 817 &__call_alloc_${func} |> &__${func}_ctx_fan 818 &__call_extag_${func} |> &__${func}_ct_ret:R 819 $( 820 ${args} |> &__${func}_ct_${args}:R 821 ),* 822} 823``` 824 825Usage: 826 827```dfasm 828; one-time setup 829#call_stub_1 fib, @fib_desc 830 831; at each call site 832#call_dyn fib, @ctx_alloc, @fib_call_seq, 20, &my_arg 833``` 834 835The `call_stub_N` per-arity approach is admittedly clunky. Macros can 836invoke other macros (nested expansion up to depth 32), so a unified 837variadic call_stub using a helper macro is possible. Per-arity variants 838are used here for clarity. N=1 through N=3 covers the vast majority 839of functions, and anything beyond N=3 can be hand-written — it's the 840same pattern, just more of it. 841 842### Permit Injection — Two Approaches 843 844For small K (roughly K <= 4), inline CONST injection. The built-in 845`#permit_inject` macro is variadic — pass the target nodes directly: 846 847```dfasm 848; Built-in definition (from asm/builtins.py): 849#permit_inject *targets |> { 850 $( 851 &p <| const, 1 852 &p |> ${targets} 853 ),* 854} 855 856; Usage: pass gate nodes as targets 857#permit_inject &dispatch_gate:L, &dispatch_gate:L 858``` 859 860For large K, use SM EXEC to batch-emit permits: 861 862```dfasm 863; Custom macro for EXEC-based injection: 864#permit_inject_exec count, sm_base |> { 865 ; a single SM EXEC reading ${count} pre-formed permit 866 ; tokens from SM starting at ${sm_base} 867 &exec <| exec, ${sm_base} 868} 869 870; Usage: inject 8 permits via EXEC 871#permit_inject_exec 8, @permit_store 872``` 873 874Programmer chooses based on K. No magic. 875 876### Loop Control Macro 877 878The built-in `#loop_counted` provides the core loop infrastructure. 879It accepts `init` and `limit` as input parameters and exposes `body` 880and `exit` as `@ret` outputs: 881 882```dfasm 883; Built-in definition (from asm/builtins.py): 884#loop_counted init, limit |> { 885 &counter <| add 886 &compare <| brgt 887 &counter |> &compare:L 888 &body_fan <| pass 889 &compare |> &body_fan:L 890 &inc <| inc 891 &body_fan |> &inc:L 892 &inc |> &counter:R 893 ${init} |> &counter:L 894 ${limit} |> &compare:R 895 &body_fan |> @ret_body 896 &compare |> @ret_exit:R 897} 898 899; Usage: pass init/limit as args, wire body/exit via |> 900&c_init <| const, 0 901&c_limit <| const, 64 902#loop_counted &c_init, &c_limit |> body=&body_entry, exit=&done 903``` 904 905### Reduction Tree Macro 906 907```dfasm 908#reduce_tree &op, &inputs[], &output |> { 909 ; expands to ceil(log2(N)) levels of binary &op instructions 910 ; N inferred from length of &inputs[] 911 ; &output receives the final reduced value 912} 913 914; Usage: 915#reduce_tree add, [&s0, &s1, &s2, &s3], @total 916``` 917 918### Parallel Loop (Composition) 919 920A parallel loop is manual composition of macros and function calls. 921No single macro tries to handle the full topology — each handles the 922repetitive part it's good at. 923 924```dfasm 925@system pe=4, sm=1, ctx=8 926 927; The body as a function — self-loop accumulator 928$body |> { 929 &acc <| add 930 &acc |> &acc:L ; feedback: acc recirculates 931 ; &i arrives as input, feeds &acc:R 932 ; &acc drains to @ret on completion 933} 934 935; Loop control 936&c_init <| const, 0 937&c_limit <| const, 64 938#loop_counted &c_init, &c_limit |> body=&dispatch, exit=&done 939 940; Permit injection (4 permits to throttle body launches) 941#permit_inject &gate:L, &gate:L, &gate:L, &gate:L 942 943; Gated dispatch — permits throttle body launches 944&gate <| gate 945&dispatch |> &gate:R ; loop data → gate right port 946; permits arrive at &gate:L from injection + body completion 947 948; Body invocations via function call syntax 949$body i=&gate |> &partial 950 951; Reduction of partial results 952#reduce_tree add, [&p0, &p1, &p2, &p3], @final_sum 953``` 954 955--- 956 957## Pattern Cost Summary 958 959| Pattern | HW cost | IRAM slots | Iterations/cycle | Parallel? | 960| --------------------- | -------- | ----------------------- | -------------------- | --------------- | 961| Self-loop accumulator | 0 | 1 (the ADD) | ~1/8 (bus RT) | yes (per-ctx) | 962| Permit-token throttle | 0 | K+2 (permits + GATE) | K in flight | yes | 963| Counted loop control | 0 | 4 (CONST+INC+LT+SWITCH) | ~1/8 (bus RT) | no (sequential) | 964| Binary reduction tree | 0 | K-1 (one per ADD) | log2(K) levels | yes | 965| Predicate register | ~1 chip | +1 bit/instr | saves ~4 cycles/iter | no (shared) | 966| Accumulator register | ~3 chips | 1 (ACC_ADD) | ~1/4 (no bus RT) | no (shared) | 967 968All zero-hardware patterns work with v0. Predicate and accumulator 969registers are independent future additions that compose with the 970existing patterns.