OR-1 dataflow CPU sketch

Loop Patterns and Flow Control Idioms#

Execution patterns for loops, reductions, and flow control in the dataflow architecture. These are software/compiler conventions built from existing hardware primitives — no dedicated loop hardware exists.

Most patterns described here are candidates for assembler macros: reusable expansions that emit the underlying instruction sequences. The programmer writes #loop_counted counter, limit, body_label and the assembler expands it into the token feedback arcs, SWITCH routing, and permit structures described below.

See iram-and-function-calls.md for IRAM format and ctx_mode details. See alu-and-output-design.md for SWITCH, GATE, and output modes. See sm-design.md for SM operations referenced by some patterns.


Core Loop Mechanism#

There is no program counter, no branch instruction, and no loop construct in hardware. Loops are token feedback arcs: an instruction's output token is routed back to an input of an earlier instruction (or itself) in the dataflow graph. The loop "iterates" each time a token completes the feedback circuit.

Minimal Counted Loop#

graph:
  CONST(0)  ──────────────────────────────────┐
                                              │
  ┌───────────────────────────────────────────┤
  │                                           V
  │   ┌─────┐      ┌──────────┐      ┌────────────────┐
  └─►│ INC │────►│ LT limit │────►│ SWITCH         │
      └─────┘      └──────────┘      │  true  → body │
       ▲  i+1         bool          │  false → exit │
       │                             └────────┬───────┘
       │         data (i+1) to body ◄────────┘
       │                                      │
       └── i+1 fed back (same token) ─────────┘

Instructions (all on same PE, same context):

; Counted loop: increment from 0, dispatch to body, exit when done

&counter <| const, 0               ; initial counter value (seed token starts loop)
&step    <| inc                    ; increment counter
&cmp     <| lt                     ; compare counter < limit
&route   <| sweq                   ; route by comparison result

const <limit> |> &cmp:R            ; limit value (or SM read if > 255)

&counter |> &step                  ; seed → first increment
&step    |> &cmp:L                 ; counter → comparison left
&step    |> &route:L               ; counter → switch data input (fan-out)
&cmp     |> &route:R               ; bool → switch control

&route:L |> &body:L                ; taken (true) → dispatch to body
&route:R |> &exit:L                ; not-taken (false) → loop done
&route:L |> &step                  ; feedback arc: counter recirculates

The feedback arc is just a destination field in the SWITCH instruction (or a PASS trampoline if SWITCH can't dual-route to both body dispatch and the INC feedback simultaneously). The loop "runs" as long as tokens keep flowing through the feedback arc.

Timing#

Each iteration traverses: INC → LT → SWITCH → bus → INC. With the v0 pipeline (no local bypass, all tokens go through external bus), expect roughly 6-10 cycles per loop control iteration depending on bus contention and pipeline depth.

The loop body executes concurrently with loop control — the dispatched body token enters the body subgraph immediately while the counter continues to the next iteration. If the body takes longer than one control iteration, multiple body invocations can be in flight simultaneously (with appropriate flow control — see Permit Tokens).


Permit-Token Flow Control#

When loop body iterations can execute concurrently, a throttling mechanism prevents context slot exhaustion. Permit tokens are the standard dataflow idiom for this.

Concept#

K permit tokens circulate through the system. Each dispatch to a body context consumes one permit. Each body completion produces one permit. At most K body iterations are in flight simultaneously. If no permits are available, the dispatch GATE stalls — the loop control token waits in the matching store until a permit arrives.

  permits (K tokens, initially injected at boot)
      │
      ▼
  ┌────────┐
  │  GATE  │◄──── loop control produces (counter, body_data)
  │  L: permit                        
  │  R: dispatch_data                 
  └───┬────┘                          
      │ (fires only when BOTH permit AND data are ready)
      ▼
  dispatch to body context (CHANGE_TAG or CTX_OVRD)
      │
      ▼
  body executes ... body completes
      │
      ▼
  emit permit token back to GATE (port L)

Implementation#

The GATE instruction is dyadic. Left port receives the permit token. Right port receives the loop's dispatch data (counter value, array pointer, whatever the body needs). GATE fires only when both are present — this IS the backpressure mechanism. No special hardware flow control.

; Permit-gated dispatch
&gate <| gate                      ; dyadic: L=permit, R=dispatch data
&loop_output |> &gate:R            ; loop control feeds data to gate

; Body completion recycles the permit
&body_done |> &gate:L              ; body's final instruction returns permit

The body's final instruction emits a token to &gate:L as one of its destinations. This token is the recycled permit. Its data value is irrelevant (just a trigger); what matters is its presence in the matching store.

K is chosen by the compiler:

  • K = 1: fully sequential, one body at a time. safe default.
  • K = number of reserved body context slots: maximum parallelism.
  • K = pipeline depth / body latency: optimal for throughput.

Initial Permit Injection#

At boot (or function entry), K permit tokens must be injected into the GATE. Options:

  • CONST chain: K CONST instructions at sequential offsets, each with dest targeting the GATE's left port. Triggered by the function entry token via fan-out. Burns K IRAM slots but is simple.
  • SM EXEC: pre-load K permit tokens in SM, EXEC emits them. Uses one IRAM slot for the EXEC trigger. Better for large K.
  • Assembler macro: #permit_inject *args expands to the appropriate injection sequence.

Assembler Macro Sketch#

; PERMIT_LOOP(K, limit, body_label, exit_label)
; Expands to:
;   - K CONST instructions emitting initial permits
;   - GATE (permit, dispatch_data) guarding body dispatch
;   - INC/LT/SWITCH loop control chain
;   - feedback arc from SWITCH to INC
;   - body return path emitting permit on completion

The macro assigns IRAM offsets for the control structure and reserves K context slots for body iterations.


Parallel Reduction#

A common pattern following parallel loop iterations: combine K partial results into a single value.

Binary Reduction Tree#

K=4 partial sums:  s0   s1   s2   s3
                    \   /     \   /
                     ADD       ADD
                      \       /
                        ADD
                         │
                       total

Each ADD is a dyadic instruction in its own right. The partial results arrive as tokens, match in the ADD's matching store entry, fire, and produce the next level's input. The tree structure is pure dataflow — no special reduction hardware.

For K iterations, the tree has log2(K) levels and K-1 ADD instructions. With K=8, that's 7 ADDs across 3 levels. All ADDs at the same level can fire in parallel (they're on different matching store entries or different PEs).

Assembler Macro Sketch#

; REDUCE(op, inputs[], output)
; Expands to:
;   - ceil(log2(N)) levels of binary op instructions
;   - routing from each level's outputs to next level's inputs
;   - final output routed to specified destination

Loop-Carried Accumulators (Self-Loop Pattern)#

A value that updates every iteration and feeds back to itself. The canonical example: sum += a[i].

Matching-Store-as-Register#

A dyadic instruction whose output routes back to its own left port:

; Self-loop accumulator: sum += each incoming value
&acc <| add                        ; dyadic: L=accumulated sum, R=new element
&acc |> &acc:L                     ; feedback: result → own left port

; Initialise: deposit starting value before first element arrives
&init <| const, 0
&init |> &acc:L                    ; seed the accumulator with 0

The matching store cell at &acc's (ctx, offset, port L) holds the accumulator value between iterations. Each new element arriving on port R triggers the ADD, which deposits the updated sum back into port L's cell.

Timing: Each accumulation step is a full round-trip: ALU → output formatter → bus → input FIFO → matching store → ALU. Roughly 6-10 cycles at v0. This is the sequential bottleneck — the accumulator feedback arc is inherently serial.

Extracting the final value: After the last element is accumulated, the sum sits in the matching store cell at port L. It needs a "drain" event to extract it. Options:

  • A sentinel token on port R triggers one final ADD (or PASS), and dest2 routes the result to the downstream consumer.
  • A GATE controlled by a "loop done" boolean from the loop control. When the loop completes, the GATE opens and the accumulated value flows out.

With Parallel Iterations#

Each body context has its own matching store entries (different ctx → different SRAM address). K parallel accumulators at the same IRAM offset but different context slots operate independently.

; All iterations share the same IRAM instruction for &acc.
; Different context slots → different matching store cells.
; No interference between iterations.

; ctx=1: sum_1 accumulator (self-loop)
; ctx=2: sum_2 accumulator (self-loop)
; ctx=3: sum_3 accumulator (self-loop)
; ... all at the same &acc instruction, different contexts

After all iterations complete, a reduction tree combines partial sums. The permit-token mechanism guarantees that partial sums are ready before the reduction begins (the permits themselves can be chained to trigger the reduction).


Predicate Register Optimisation (Future, ~1 Chip)#

A single shared 1-bit register (or small multi-bit register) that stores a comparison result locally, bypassing the token network for the boolean path.

Benefit#

In the standard loop control pattern, the comparison boolean travels as a token: LT produces a bool token → bus → matching store → SWITCH consumes it. The predicate register short-circuits this:

without predicate register:
  LT → [bool token] → bus → matching → SWITCH
  cost: full token round-trip for the boolean

with predicate register:
  LT writes bool to predicate register (side effect, no token)
  SWITCH reads predicate register (local wire, no matching needed)
  cost: zero additional cycles for the boolean path

The counter feedback arc still goes through the bus. But the boolean path — typically half the loop control overhead — becomes free.

Constraints#

  • Not per-context. Single shared register. The compiler must guarantee only one activation uses the predicate at a time.
  • Not suitable for parallel iterations. Each iteration would need its own predicate state. Use the token-based boolean path for parallel loop control.
  • IRAM encoding: 1-2 bits per instruction (pred_write, pred_read). Can be folded into opcode space as dedicated variants (LT_P, SWITCH_P) or drawn from spare bits in half 1.

Hardware#

1-bit predicate register:     1 flip-flop
write path (from comparator): 1 gate (write enable)
read mux (to SWITCH):         1 gate (bool_out source select)
Total:                        ~1 chip (fraction of a chip, really)

Accumulator Register Optimisation (Future, ~3 Chips)#

A single shared 16-bit register writable by the ALU and readable as an ALU input source. Eliminates the bus round-trip for tight accumulation loops.

Benefit#

without accumulator register:
  ADD(acc, new) → output → bus → input → matching → ADD
  cost: ~6-10 cycles per accumulation

with accumulator register:
  ACC_ADD: reads acc from register, adds new from token, writes result
           back to register. monadic (only new element token needed).
  cost: 1 pipeline pass per accumulation (~3-5 cycles)

Constraints#

  • Not per-context. Same single-activation restriction as predicate register.
  • Monadic operation. ACC_ADD/ACC_SUB are monadic — the accumulator is an implicit operand from the register, the explicit operand comes from the arriving token. No matching store entry consumed.
  • No matching store write conflict. The register is a separate storage element from the matching store. ALU writes to the register at Stage 4; matching store writes happen at Stage 2. No port conflict, no stall logic needed.
  • Stepping stone to SC blocks. The accumulator register is effectively the first register of a future local register file. Adding a second register + sequential instruction counter yields a minimal strongly-connected (SC) block capability.

Hardware#

16-bit register (2x 74LS374):           2 chips
ALU source mux (add acc_reg input):     1 chip
write enable gating:                    ~0 chips (1 gate)
Total:                                  ~3 chips per PE

Assembler Macro and Function Call Strategy#

The patterns above are mechanical enough to be assembler macros. Macros are expected to be simple text/token substitution (C-style #define territory, not Zig comptime or C++ templates). This means:

  • No conditional logic within macros. Different strategies get different macros, and the programmer picks which to use.
  • No offset allocation intelligence. The assembler tracks placement and validates the expansion (offset collisions, missing labels), but the macro itself is dumb substitution.
  • No type checking or context-slot tracking. The programmer is responsible for not blowing the slot budget.

Macro Syntax#

A macro call uses #macro_name and follows the same syntax as any other operation (arguments, edge wiring, port qualifiers). Macro definitions follow a function-block-like structure.

Function Calls as Syntax#

Using a $func label as an instruction generates the appropriate routing for static calls. Named arguments match against the function's internal labels:

; Function definition:
$add_pair |> {
  add &a, &b |> #ret
}

; Static call — named args wire to internal labels:
$add_pair a=&x, b=&y |> @output

@ret is a built-in pseudo-node that marks the function's return point. The assembler resolves a=&x to mean "wire &x's output to $add_pair.&a, port L, with ctx_mode=01 and the allocator-assigned context slot." The |> @output wires the return point to @output.

For static calls (non-recursive, known call graph), this generates only routing annotations on existing instructions — no extra IRAM entries, no CHANGE_TAG, just destination fields set with ctx_mode=01.

Dynamic and Recursive Calls (v1, Manual / Macro-Assisted)#

The assembler does NOT handle recursive or indirect calls automatically — that's compiler territory. Recursive calls require runtime context allocation (SM READ_INC), CHANGE_TAG sequences, and EXTRACT_TAG for return continuations. The assembler provides macros to reduce boilerplate for the mechanical parts; the programmer manages descriptor tables, context budgets, and flow control.

The Problem#

A dynamic call to a function with N arguments requires:

  1. Allocate context — SM READ_INC on an allocator cell → new_ctx
  2. Build return continuation — EXTRACT_TAG captures caller's (PE, ctx, offset, gen) as a 16-bit packed tag value
  3. Fetch N+1 tag templates — SM_READ_C from a descriptor table (one per argument destination + one for return destination)
  4. Patch ctx into each tag — OR new_ctx into bits [3:0] of each template (templates are pre-built at boot with ctx=0)
  5. Send return continuation — CHANGE_TAG with patched return tag
  6. Send each argument — CHANGE_TAG with patched arg tag + value

Done naively (one IRAM slot per step), this burns 4N+8 IRAM slots per call site. Two recursive calls = 24+ slots for N=1. With 128 slots per PE, that's unsustainable.

The EXEC-Based Call Stub#

EXEC is a token cannon — it reads a sequence of pre-formed tokens from SM/ROM and fires them onto the bus. The tokens can be anything: SM read requests, CM tokens, triggers. One IRAM slot (the EXEC trigger) replaces an arbitrary number of pre-staged operations.

The key insight: steps 3-4 above (fetch tag templates, deliver them to patching logic) are identical for every call to the same function. The tag templates, their SM addresses, and where to deliver them are all compile-time constants. Only the allocated ctx and argument values change per call.

This splits the call machinery into two parts:

Call stub (shared, loaded once per function):

IRAM instructions that receive runtime values (ctx, args, return continuation) and perform the patching + dispatch. These live in IRAM and are shared across all call sites for the same function. Different call sites invoke the stub in different context slots, so their matching store entries don't collide.

EXEC sequence (in SM/ROM, per function):

Pre-formed tokens that read tag templates from the descriptor table and deliver them to the stub's OR instructions. Triggered by a single EXEC instruction. Stored once, fired per call.

Per-function (loaded once):
  call stub in IRAM:   ctx fan-out + OR patches + CHANGE_TAGs
  EXEC sequence in SM: SM_READ_C tokens targeting the stub
  descriptor table:    pre-formed flit 1 templates (ctx=0)

Per-call-site (tiny):
  3 IRAM slots:  rd_inc (allocate) + exec (trigger) + extract_tag (return)
  wiring:        feed ctx, return cont, and arg values into the stub

Call Stub Structure (Example: N=1 Argument)#

; ── call stub for $fib, loaded once, shared across call sites ──
; runs in caller's allocated ctx (different per call → no collision)

; ctx fan-out: new_ctx needs to reach 2 OR instructions (ret + arg)
&__fib_ctx_fan <| pass
&__fib_ctx_fan |> &__fib_or_ret:R, &__fib_or_n:R

; tag patching: template (from EXEC'd SM_READ_C) OR'd with new_ctx
&__fib_or_ret  <| or            ; L: ret tag template, R: new_ctx
&__fib_or_n    <| or            ; L: arg tag template, R: new_ctx

; dispatch: patched tag + data → output token
&__fib_ct_ret  <| change_tag    ; L: patched ret tag, R: return continuation
&__fib_ct_n    <| change_tag    ; L: patched arg tag, R: argument value

; internal wiring
&__fib_or_ret  |> &__fib_ct_ret:L
&__fib_or_n    |> &__fib_ct_n:L

Stub cost: 1 (PASS fan-out) + 2 (OR) + 2 (CHANGE_TAG) = 5 IRAM slots for N=1. For N=2: add 1 more PASS in fan-out chain + 1 OR + 1 CHANGE_TAG = 8 slots. General: 2 + 2N slots (fan-out chain + per-arg OR and CHANGE_TAG).

Note: the OR and CHANGE_TAG instructions are dyadic, consuming IRAM slots in the low-offset range (0-31). The PASS fan-out chain is monadic and can live in the monadic range (offsets 32+), where IRAM space is more abundant — monadic instructions don't consume matching store entries, so the 7-bit offset space is available.

Per-Call-Site Expansion#

; ── per call site: 3 IRAM slots + wiring ──

&__alloc  <| rd_inc, @ctx_alloc            ; allocate callee context
&__exec   <| exec, @fib_call_seq           ; fire tag-fetch sequence
&__extag  <| extract_tag, <ret_offset>     ; capture return continuation

; wire runtime values into stub (these are edge declarations, not IRAM)
&__alloc  |> &__fib_ctx_fan                ; new ctx → stub fan-out
&__extag  |> &__fib_ct_ret:R              ; return cont → stub
&arg_val  |> &__fib_ct_n:R               ; argument → stub

rd_inc and extract_tag are monadic. exec is monadic. The per-call-site cost is 3 monadic IRAM slots — they sit in the monadic offset range and don't consume matching store entries.

EXEC Sequence Contents (in SM/ROM)#

Pre-formed tokens, stored at boot, fired by exec:

Token 0:  SM_READ_C(@fib_desc + 0) → deliver to &__fib_or_ret:L
Token 1:  SM_READ_C(@fib_desc + 1) → deliver to &__fib_or_n:L

Each token is a fully-formed 2-flit packet: flit 1 = SM read command, flit 2 = return routing pointing at the stub's OR instruction. The EXEC sequencer reads these from consecutive SM cells and emits them onto the bus. The SM processes each read and returns the tag template to the specified OR instruction.

Fibonacci: Two Recursive Calls#

$fib |> {
  ; ── function body ──
  &n <| pass
  lt &n, 2 |> &test
  &test <| sweq
  &n |> &test:L

  ; base case
  &test:L |> @ret

  ; recursive case
  sub &n, 1 |> &n1
  sub &n, 2 |> &n2

  ; two calls, each 3 monadic IRAM slots + shared stub
  &__alloc1  <| rd_inc, @ctx_alloc
  &__exec1   <| exec, @fib_call_seq
  &__extag1  <| extract_tag, 20          ; results arrive at offset 20

  &__alloc1  |> &__fib_ctx_fan
  &__extag1  |> &__fib_ct_ret:R
  &n1        |> &__fib_ct_n:R

  &__alloc2  <| rd_inc, @ctx_alloc
  &__exec2   <| exec, @fib_call_seq
  &__extag2  <| extract_tag, 21          ; results arrive at offset 21

  &__alloc2  |> &__fib_ctx_fan
  &__extag2  |> &__fib_ct_ret:R
  &n2        |> &__fib_ct_n:R

  ; reduction
  add &r1, &r2 |> @ret                   ; r1 at offset 20, r2 at offset 21
}

Important: The two calls share the same stub IRAM instructions but run in different contexts (ctx allocated by rd_inc). The matching store entries for &__fib_or_ret etc. are indexed by (ctx, offset), so different ctx values → different cells → no collision.

The calls ARE sequenced by data dependencies — the second call can't fire its CHANGE_TAG until its rd_inc and exec complete, which are independent of the first call. Both calls can be in flight simultaneously.

IRAM Budget#

                          dyadic slots   monadic slots
                          (0-31 range)   (32-127 range)
─────────────────────────────────────────────────────────
function body (fib):          ~6             ~4
call stub (shared):           4              1
per call site (×2):           0              6
result reduction:             1              0
─────────────────────────────────────────────────────────
total:                       ~11            ~11

~22 IRAM slots total for recursive fibonacci. Well within 128. And if fib is called from external sites, they pay only 3 monadic slots each — the stub and body are already loaded.

Stub Sharing Across Mutual Recursion#

Two-layer recursion (A calls B calls A) can share ctx allocation and EXEC infrastructure. If A and B are on the same PE:

  • Each function has its own call stub (different tag templates)
  • They share the same @ctx_alloc SM cell
  • Their EXEC sequences are independent but stored in the same SM
  • Both stubs live in IRAM simultaneously at different offsets

If A and B have the same argument count and shape, a future optimisation is a generic call stub parameterised only by which EXEC sequence to fire. The tag templates in the descriptor table encode all the per-function differences. The stub just patches ctx and dispatches — it doesn't know or care which function it's calling. This is essentially a vtable dispatch and emerges naturally from the architecture.

Descriptor Table Layout (in SM, initialised at boot)#

@fib_desc + 0:  return destination tag template (ctx=0)
                [0][0][port][PE][gen][offset][0000]
@fib_desc + 1:  arg 'n' destination tag template (ctx=0)
                [0][0][port][PE][gen][offset][0000]

; for N=2 function:
@func_desc + 0: return tag template
@func_desc + 1: arg 0 tag template
@func_desc + 2: arg 1 tag template

Templates are full 16-bit flit 1 values with ctx field set to 0. The stub's OR instruction patches bits [3:0] with the allocated ctx. Templates are written to SM during bootstrap (via EXEC from ROM or explicit SM_WRITE_C during init).

SM-Based Argument Passing (Large Functions)#

For functions with many arguments (N > 3), the IRAM cost of the OR + CHANGE_TAG stub becomes prohibitive — each arg burns 2 dyadic IRAM slots. An alternative: stage arguments in SM cells and let the EXEC sequence deliver them.

The caller writes argument values to a block of SM "call frame" cells using SM_WRITE_C (monadic, const-addressed). The EXEC sequence's tail end includes SM_READ_C tokens that read those cells back out and deliver them as tokens to the callee's entry points. The callee is oblivious — it just sees tokens arriving normally.

Caller writes args to SM call frame:
  SM_WRITE_C(@frame + 0, arg0_value)    ; monadic, 1 IRAM slot
  SM_WRITE_C(@frame + 1, arg1_value)    ; monadic
  ...
  SM_WRITE_C(@frame + N-1, argN_value)  ; monadic
  SM_WRITE_C(@frame + N, ret_cont)      ; return continuation
  EXEC @call_seq                         ; fire it all

EXEC sequence (in SM/ROM):
  SM_READ_C(@frame + 0) → deliver to callee &arg0:L
  SM_READ_C(@frame + 1) → deliver to callee &arg1:L
  ...
  SM_READ_C(@frame + N-1) → deliver to callee &argN:L
  SM_READ_C(@frame + N) → deliver to callee &ret_cont:L

Costs:

                     stub approach         SM call frame
                     (per-arg OR+CT)       (SM staging)
───────────────────────────────────────────────────────────
caller IRAM:         3 monadic             N+2 monadic (writes + exec + extag)
stub IRAM:           2N dyadic + fan-out   0
EXEC sequence:       N+1 tokens            N+1 tokens
SM cells used:       0 (runtime)           N+1 (call frame)
───────────────────────────────────────────────────────────
dyadic IRAM slots:   2N+                   0
monadic IRAM slots:  3                     N+2

The SM call frame approach uses zero dyadic IRAM for call overhead. All caller instructions are monadic (SM_WRITE_C, EXEC, EXTRACT_TAG), living in the abundant 32-127 offset range. The callee's IRAM is pure function body — no call machinery whatsoever.

Tradeoffs:

  • Latency: two SM round-trips per argument (write then read) vs one CHANGE_TAG. Adds ~2× SM access latency to call setup. For large N where the stub approach would serialise through a fan-out chain anyway, the SM approach may not be worse.
  • SM cell pressure: N+1 cells per call frame. Concurrent calls need separate frames (different base addresses). The caller manages this — either static allocation for known call depth, or an SM frame pointer bumped via READ_INC.
  • No ctx patching needed: the EXEC sequence tokens already have the correct routing baked in (they target the callee's entry points directly). Context allocation is still needed, but the patching step (OR new_ctx into tag templates) is eliminated because the EXEC sequence can be rebuilt per call from a template + allocated ctx. Or, if the callee always runs in a fixed ctx (static allocation), the EXEC sequence is truly static.

When to use which:

  • N=1-2 args: stub approach. the OR + CHANGE_TAG chain is small, latency is minimal, no SM cells consumed.
  • N=3+ args: SM call frame starts winning on IRAM pressure.
  • N=6+ args: SM call frame is clearly better. the stub would need 12+ dyadic slots just for call overhead.
  • one-shot EXEC'd functions (loaded on demand, run once): SM call frame is natural — the EXEC that loads the code can also deliver the arguments in one sequence.

A function intended to be called via EXEC one-shot (loaded from ROM into IRAM, executed, then discarded) can have its entire call convention baked into the EXEC block: code loading tokens first, then SM_READ tokens that deliver arguments from pre-staged cells. The caller just writes args to SM and fires EXEC. The function loads, receives its arguments, runs, sends results, done.


Built-in Macros#

The assembler ships built-in macros (prepended to every program) that implement common patterns from this document:

Macro Parameters Outputs (@ret) Pattern
#loop_counted init, limit body, exit Counted loop (counter + compare + increment feedback)
#loop_while test body, exit Condition-tested loop (gate node)
#permit_inject *targets (routes to targets) Permit injection (variadic, one const(1) per target)
#reduce_2.._4 op (positional) Binary reduction tree (parameterized opcode, per-arity)

All built-in macros use @ret output wiring. The #reduce_* family accepts any opcode as a parameter (e.g., #reduce_4 add |> &result). See asm/builtins.py for definitions.

Example Macro Definitions#

; ── call stub for a 1-argument function ──
; emitted once per function, provides shared call infrastructure
#call_stub_1 func, desc |> {
  ; ctx fan-out (1 arg + 1 return = 2 consumers, one PASS suffices)
  &__${func}_ctx_fan <| pass
  &__${func}_ctx_fan |> &__${func}_or_ret:R, &__${func}_or_arg0:R

  ; tag patching
  &__${func}_or_ret  <| or
  &__${func}_or_arg0 <| or

  ; dispatch
  &__${func}_ct_ret  <| change_tag
  &__${func}_ct_arg0 <| change_tag

  ; internal wiring
  &__${func}_or_ret  |> &__${func}_ct_ret:L
  &__${func}_or_arg0 |> &__${func}_ct_arg0:L
}

; ── call stub for a 2-argument function ──
#call_stub_2 func, desc |> {
  ; ctx fan-out chain (3 consumers)
  &__${func}_ctx_fan0 <| pass
  &__${func}_ctx_fan1 <| pass
  &__${func}_ctx_fan0 |> &__${func}_or_ret:R, &__${func}_ctx_fan1
  &__${func}_ctx_fan1 |> &__${func}_or_arg0:R, &__${func}_or_arg1:R

  ; tag patching
  &__${func}_or_ret  <| or
  &__${func}_or_arg0 <| or
  &__${func}_or_arg1 <| or

  ; dispatch
  &__${func}_ct_ret  <| change_tag
  &__${func}_ct_arg0 <| change_tag
  &__${func}_ct_arg1 <| change_tag

  ; internal wiring
  &__${func}_or_ret  |> &__${func}_ct_ret:L
  &__${func}_or_arg0 |> &__${func}_ct_arg0:L
  &__${func}_or_arg1 |> &__${func}_ct_arg1:L
}

; ── per-call-site (works for any arity) ──
#call_dyn func, alloc, call_seq, ret_offset, *args |> {
  ; allocate + trigger + return continuation
  &__call_alloc_${func}  <| rd_inc, ${alloc}
  &__call_exec_${func}   <| exec, ${call_seq}
  &__call_extag_${func}  <| extract_tag, ${ret_offset}

  ; wire into stub
  &__call_alloc_${func} |> &__${func}_ctx_fan
  &__call_extag_${func} |> &__${func}_ct_ret:R
  $(
    ${args} |> &__${func}_ct_${args}:R
  ),*
}

Usage:

; one-time setup
#call_stub_1 fib, @fib_desc

; at each call site
#call_dyn fib, @ctx_alloc, @fib_call_seq, 20, &my_arg

The call_stub_N per-arity approach is admittedly clunky. Macros can invoke other macros (nested expansion up to depth 32), so a unified variadic call_stub using a helper macro is possible. Per-arity variants are used here for clarity. N=1 through N=3 covers the vast majority of functions, and anything beyond N=3 can be hand-written — it's the same pattern, just more of it.

Permit Injection — Two Approaches#

For small K (roughly K <= 4), inline CONST injection. The built-in #permit_inject macro is variadic — pass the target nodes directly:

; Built-in definition (from asm/builtins.py):
#permit_inject *targets |> {
    $(
        &p <| const, 1
        &p |> ${targets}
    ),*
}

; Usage: pass gate nodes as targets
#permit_inject &dispatch_gate:L, &dispatch_gate:L

For large K, use SM EXEC to batch-emit permits:

; Custom macro for EXEC-based injection:
#permit_inject_exec count, sm_base |> {
  ; a single SM EXEC reading ${count} pre-formed permit
  ; tokens from SM starting at ${sm_base}
  &exec <| exec, ${sm_base}
}

; Usage: inject 8 permits via EXEC
#permit_inject_exec 8, @permit_store

Programmer chooses based on K. No magic.

Loop Control Macro#

The built-in #loop_counted provides the core loop infrastructure. It accepts init and limit as input parameters and exposes body and exit as @ret outputs:

; Built-in definition (from asm/builtins.py):
#loop_counted init, limit |> {
    &counter <| add
    &compare <| brgt
    &counter |> &compare:L
    &body_fan <| pass
    &compare |> &body_fan:L
    &inc <| inc
    &body_fan |> &inc:L
    &inc |> &counter:R
    ${init} |> &counter:L
    ${limit} |> &compare:R
    &body_fan |> @ret_body
    &compare |> @ret_exit:R
}

; Usage: pass init/limit as args, wire body/exit via |>
&c_init <| const, 0
&c_limit <| const, 64
#loop_counted &c_init, &c_limit |> body=&body_entry, exit=&done

Reduction Tree Macro#

#reduce_tree &op, &inputs[], &output |> {
  ; expands to ceil(log2(N)) levels of binary &op instructions
  ; N inferred from length of &inputs[]
  ; &output receives the final reduced value
}

; Usage:
#reduce_tree add, [&s0, &s1, &s2, &s3], @total

Parallel Loop (Composition)#

A parallel loop is manual composition of macros and function calls. No single macro tries to handle the full topology — each handles the repetitive part it's good at.

@system pe=4, sm=1, ctx=8

; The body as a function — self-loop accumulator
$body |> {
  &acc <| add
  &acc |> &acc:L                   ; feedback: acc recirculates
  ; &i arrives as input, feeds &acc:R
  ; &acc drains to @ret on completion
}

; Loop control
&c_init <| const, 0
&c_limit <| const, 64
#loop_counted &c_init, &c_limit |> body=&dispatch, exit=&done

; Permit injection (4 permits to throttle body launches)
#permit_inject &gate:L, &gate:L, &gate:L, &gate:L

; Gated dispatch — permits throttle body launches
&gate <| gate
&dispatch |> &gate:R               ; loop data → gate right port
; permits arrive at &gate:L from injection + body completion

; Body invocations via function call syntax
$body i=&gate |> &partial

; Reduction of partial results
#reduce_tree add, [&p0, &p1, &p2, &p3], @final_sum

Pattern Cost Summary#

Pattern HW cost IRAM slots Iterations/cycle Parallel?
Self-loop accumulator 0 1 (the ADD) ~1/8 (bus RT) yes (per-ctx)
Permit-token throttle 0 K+2 (permits + GATE) K in flight yes
Counted loop control 0 4 (CONST+INC+LT+SWITCH) ~1/8 (bus RT) no (sequential)
Binary reduction tree 0 K-1 (one per ADD) log2(K) levels yes
Predicate register ~1 chip +1 bit/instr saves ~4 cycles/iter no (shared)
Accumulator register ~3 chips 1 (ACC_ADD) ~1/4 (no bus RT) no (shared)

All zero-hardware patterns work with v0. Predicate and accumulator registers are independent future additions that compose with the existing patterns.