···33These macros are automatically available in all dfasm programs.
44The BUILTIN_MACROS string is prepended to user source before parsing.
5566-Note: The current grammar does not support referencing macro parameters
77-in edge endpoints (bare identifiers aren't valid qualified_ref). Therefore
88-all built-in macros are self-contained: they define their own internal
99-topology and expose well-known internal node names for the user to wire to.
66+Macro parameters can appear in edge endpoints via ${param} syntax.
77+The expand pass handles parameter refs in source/dest positions,
88+correctly skipping scope qualification for external references.
1091111-For example, #loop_counted expands to nodes named &counter (add), &compare
1212-(brgt), and &inc (inc) with the internal feedback loop pre-wired. The user
1313-wires init/limit/body/exit externally after invoking the macro.
1010+The opcode position cannot be parameterized (grammar constraint:
1111+opcode is a keyword terminal). Per-opcode variants are provided
1212+where needed (e.g., reduce_add_N).
1413"""
15141615BUILTIN_MACROS = """\
···7170}
72717372; --- Binary reduction trees (per-arity, per-opcode variants) ---
7474-; Note: The macro expansion system's ParamRef only handles const fields and
7575-; edge endpoints, not opcode positions. Generic opcode parameterization
7676-; (e.g., passing 'add' as a macro argument) is a future enhancement.
7777-; For now, per-opcode variants are provided.
7373+; Note: The opcode position cannot be parameterized (grammar constraint).
7474+; Per-opcode variants are provided instead.
7875#reduce_add_2 |> {
7976 &r <| add
8077}
+81-32
design-notes/alu-and-output-design.md
···100100```
101101EQ A == B. Result: 0x01 if true, 0x00 if false.
102102LT A < B (signed). Result: 0x01 / 0x00.
103103+LTE A ≤ B (signed). Result: 0x01 / 0x00.
103104GT A > B (signed). Result: 0x01 / 0x00.
104104-ULT A < B (unsigned). Result: 0x01 / 0x00.
105105-UGT A > B (unsigned). Result: 0x01 / 0x00.
105105+GTE A ≥ B (signed). Result: 0x01 / 0x00.
106106```
107107108108-All comparisons produce an 8-bit result token with value 0x00 or 0x01.
108108+All comparisons produce a result token with value 0x0000 or 0x0001.
109109This token is data — it enters the network and arrives at a downstream
110110SWITCH or GATE instruction as a normal operand. The bool_out signal
111111(internal to the PE, derived from the comparator output) is also available
112112to the output formatter for same-instruction conditional routing, but the
113113primary mechanism is to emit the boolean as a data token.
114114115115-Signed vs unsigned distinction: the comparator hardware (74LS85) natively
116116-does unsigned magnitude comparison. Signed comparison can be derived by
117117-XORing the sign bits of both inputs before feeding the comparator, or by
118118-using a dedicated signed-compare path. The opcode decoder selects which
119119-path via control lines.
115115+> **⚠ Note:** The emulator operates with 16-bit data throughout, so
116116+> comparison results are 16-bit values (0x0000 or 0x0001). The 8-bit
117117+> hardware datapath sketch below produces 8-bit results (0x00 or 0x01).
118118+> Both representations are semantically identical — only bit 0 carries
119119+> information.
120120+121121+All comparisons interpret operands as signed 2's complement. The
122122+comparator hardware (74LS85) natively does unsigned magnitude comparison.
123123+Signed comparison is derived by XORing the sign bits of both inputs
124124+before feeding the comparator. LTE and GTE are synthesised from the
125125+existing comparator outputs (LTE = LT OR EQ, GTE = GT OR EQ) via
126126+the decoder EEPROM's control signal selection.
120127121128### Routing (dyadic)
122129130130+**Branch operations** (BREQ, BRGT, BRGE, BROF): compare L and R
131131+operands, then route the data to the taken or not-taken destination.
132132+Both destinations receive the data value; the branch condition selects
133133+which destination gets it first (taken) vs second (not-taken).
134134+135135+**Switch operations** (SWEQ, SWGT, SWGE, SWOF): compare L and R
136136+operands, then route data to the taken side and a trigger token
137137+(value 0) to the not-taken side. See the Output Formatter section
138138+for the not-taken trigger semantics (FREE vs NOOP).
139139+123140```
124124-SWITCH Conditional route. Takes data (port L) and boolean (port R).
125125- If bool == true: emit data to dest1, emit trigger to dest2.
126126- If bool == false: emit data to dest2, emit trigger to dest1.
127127- See Output Formatter section for trigger token semantics.
128128-129141GATE Conditional pass/suppress. Takes data (port L) and boolean (port R).
130142 If bool == true: emit data to dest1 (and optionally dest2).
131143 If bool == false: emit nothing. Both tokens consumed silently.
144144+SEL Select between inputs based on a condition.
145145+MRGE Merge two token streams (non-deterministic).
132146```
133147134134-SWITCH always emits exactly 2 tokens: one data token to the taken side,
135135-one inline monadic trigger to the not-taken side. See the Output Formatter
136136-section for the not-taken trigger semantics (FREE vs NOOP).
148148+Branch and switch ops always emit exactly 2 tokens: one data token to
149149+the taken side, one token to the not-taken side (data for branch, inline
150150+monadic trigger for switch). See the Output Formatter section for details.
137151138152GATE emits 0 or 1-2 tokens depending on the boolean. When suppressed
139153(bool false), both input operands are consumed with no output — the
···170184| ----------- | -------- | ------- | -------------- | ------------------------ |
171185| 00000 | ADD | dyadic | DUAL or SINGLE | A + B |
172186| 00001 | SUB | dyadic | DUAL or SINGLE | A - B |
173173-| 00000 | INC | monadic | DUAL or SINGLE | A + 1 (imm const) |
174174-| 00001 | DEC | monadic | DUAL or SINGLE | A - 1 (imm const) |
187187+| 00010 | INC | monadic | DUAL or SINGLE | A + 1 (imm const) |
188188+| 00011 | DEC | monadic | DUAL or SINGLE | A - 1 (imm const) |
175189| 00100 | AND | dyadic | DUAL or SINGLE | A & B |
176190| 00101 | OR | dyadic | DUAL or SINGLE | A \| B |
177191| 00110 | XOR | dyadic | DUAL or SINGLE | A ^ B |
···181195| 01010 | ASR | monadic | DUAL or SINGLE | A >> N (imm, arithmetic) |
182196| 01011 | EQ | dyadic | DUAL or SINGLE | A == B → bool |
183197| 01100 | LT | dyadic | DUAL or SINGLE | A < B signed → bool |
184184-| 01101 | GT | dyadic | DUAL or SINGLE | A > B signed → bool |
185185-| 01110 | ULT | dyadic | DUAL or SINGLE | A < B unsigned → bool |
186186-| 01111 | UGT | dyadic | DUAL or SINGLE | A > B unsigned → bool |
187187-| 10000 | SWITCH | dyadic | SWITCH | route data by bool |
188188-| 10001 | GATE | dyadic | GATE | pass or suppress by bool |
189189-| 10010 | PASS | monadic | DUAL or SINGLE | identity |
190190-| 10011 | CONST | monadic | DUAL or SINGLE | output = immediate |
191191-| 10100 | FREE_CTX | monadic | SUPPRESS | deallocate slot |
192192-| 10101-11111 | — | — | — | reserved for expansion |
198198+| 01101 | LTE | dyadic | DUAL or SINGLE | A ≤ B signed → bool |
199199+| 01110 | GT | dyadic | DUAL or SINGLE | A > B signed → bool |
200200+| 01111 | GTE | dyadic | DUAL or SINGLE | A ≥ B signed → bool |
201201+| 10000 | BREQ | dyadic | SWITCH | branch if L == R |
202202+| 10001 | BRGT | dyadic | SWITCH | branch if L > R |
203203+| 10010 | BRGE | dyadic | SWITCH | branch if L ≥ R |
204204+| 10011 | BROF | dyadic | SWITCH | branch on overflow |
205205+| 10100 | SWEQ | dyadic | SWITCH | switch if L == R |
206206+| 10101 | SWGT | dyadic | SWITCH | switch if L > R |
207207+| 10110 | SWGE | dyadic | SWITCH | switch if L ≥ R |
208208+| 10111 | SWOF | dyadic | SWITCH | switch on overflow |
209209+| 11000 | GATE | dyadic | GATE | pass or suppress by bool |
210210+| 11001 | SEL | dyadic | DUAL or SINGLE | select between inputs |
211211+| 11010 | MRGE | dyadic | DUAL or SINGLE | merge two token streams |
212212+| 11011 | PASS | monadic | DUAL or SINGLE | identity |
213213+| 11100 | CONST | monadic | DUAL or SINGLE | output = immediate |
214214+| 11101 | FREE_CTX | monadic | SUPPRESS | deallocate slot |
215215+| 11110-11111 | — | — | — | reserved for expansion |
216216+217217+> **⚠ Preliminary:** The binary opcode encodings above are a draft layout,
218218+> not a committed hardware encoding. The Python emulator and assembler use
219219+> IntEnum ordinal values that do NOT correspond to these bit patterns.
220220+> Final hardware encoding will be determined during physical build.
193221194222The output mode column indicates the default. DUAL vs SINGLE is
195223controlled by a flag in the IRAM instruction word (has_dest2), not by
196224the opcode — any arithmetic/logic/compare instruction can fan out to
197225two destinations or one.
198226199199-**Reserved opcode space (11 slots):** future candidates include hardware
200200-multiply, predicate store read/write, triadic operations, SM-directed
201201-operations, debug/trace instructions, and CALL/RETURN linkage primitives.
227227+**Branch vs Switch:** Branch operations (BR*) compare their two operands
228228+and route the data value to the taken or not-taken destination. Switch
229229+operations (SW*) are similar but emit a trigger token (value 0) to the
230230+not-taken side instead of the data value. See the Output Formatter section
231231+for the not-taken trigger semantics.
232232+233233+**Reserved opcode space (2 slots):** future candidates include hardware
234234+multiply, predicate store read/write, and debug/trace instructions.
235235+236236+### SM Instruction Dispatch
237237+238238+The operation set above covers CM compute instructions (IRAM half 0,
239239+bit 15 = 0). When IRAM half 0 bit 15 = 1, the PE dispatches an SM
240240+operation instead of an ALU computation. In this mode:
241241+242242+- The ALU is bypassed entirely
243243+- Stage 4 constructs an `SMToken` from the SM opcode, operand data,
244244+ and IRAM fields (SM_id, const address, return routing)
245245+- Stage 5 emits the SMToken to the target SM via the SM bus
246246+247247+The SM instruction encoding uses the same IRAM word width but with a
248248+different field layout. See `iram-and-function-calls.md` for the SM
249249+IRAM word format and operation table.
202250203251---
204252···343391```
344392result:8 computed data value (or passthrough for SWITCH/PASS)
345393bool_out:1 boolean/comparison result:
346346- - from comparator for EQ/LT/GT/ULT/UGT
347347- - from data_R bit 0 for SWITCH/GATE (boolean input)
394394+ - from comparator for EQ/LT/LTE/GT/GTE
395395+ - from comparator for BR*/SW* (inline comparison)
396396+ - from data_R bit 0 for GATE (boolean input)
348397 - undefined for arithmetic/logic ops
349398```
350399
+12-7
design-notes/architecture-overview.md
···344344 much of SM00's address space is mapped to IO vs general-purpose storage?
345345 SM00 is special only at boot for now; further specialisation deferred
346346 until profiling shows the standard opcodes are insufficient.
347347-5. ~~**Compiler / assembler**~~ — **Partially Resolved.** The `asm/` package
348348- implements a 6-stage assembler pipeline (parse → lower → resolve →
349349- place → allocate → codegen). Produces PEConfig/SMConfig + seed tokens
350350- or a bootstrap token stream. See `assembler-architecture.md` for architecture.
351351- Grammar is `dfasm.lark` (Lark/Earley parser). Auto-placement via
352352- greedy bin-packing with locality heuristic. Remaining work: further
353353- optimisation passes, macro expansion, binary output.
347347+5. ~~**Compiler / assembler**~~ — **Resolved.** The `asm/` package
348348+ implements a 7-stage assembler pipeline (parse → lower → expand →
349349+ resolve → place → allocate → codegen). Produces PEConfig/SMConfig +
350350+ seed tokens or a bootstrap token stream. The expand pass handles
351351+ macro expansion (`#macro` definitions with `${param}` substitution,
352352+ variadic repetition, constant arithmetic) and function call wiring
353353+ (cross-context edges, trampoline nodes, context teardown). Built-in
354354+ macros for common patterns (counted loops, permit injection, reduction
355355+ trees) are automatically available. See `assembler-architecture.md`
356356+ for architecture. Grammar is `dfasm.lark` (Lark/Earley parser).
357357+ Auto-placement via greedy bin-packing with locality heuristic.
358358+ Remaining work: optimisation passes, binary output.
3543597. **Mode B clock ratio** — exactly 2x, or design for arbitrary integer
355360 ratios? See `bus-architecture-and-width-decoupling.md`.
3563618. **Instruction residency** — small IRAM per PE means programs larger
+54-11
design-notes/assembler-architecture.md
···13131414## Pipeline Overview
15151616-Six stages, each a pure function from `IRGraph → IRGraph` (or `IRGraph → output`).
1616+Seven stages, each a pure function from `IRGraph → IRGraph` (or `IRGraph → output`).
17171818The pipeline is:
1919···2424 Parse Lark/Earley parser, dfasm.lark grammar
2525 │ → concrete syntax tree (CST)
2626 ▼
2727- Lower CST → IRGraph (nodes, edges, regions, data defs)
2727+ Lower CST → IRGraph (nodes, edges, regions, data defs,
2828+ │ macro defs, macro calls, call sites)
2829 │ → name qualification, scope creation
3030+ ▼
3131+ Expand macro expansion + function call wiring
3232+ │ → clone macro bodies, substitute params, evaluate
3333+ │ const expressions, wire cross-context call edges
2934 ▼
3035 Resolve validate edge endpoints, detect scope violations
3136 │ → "did you mean" suggestions via Levenshtein distance
···3742 │ → dyadic-first layout, per-PE context scoping
3843 ▼
3944 Codegen emit PEConfig/SMConfig + seeds (direct mode)
4040- or SM init → ROUTE_SET → LOAD_INST → seeds (token mode)
4545+ or SM init → IRAM writes → seeds (token mode)
4146```
42474348Each pass returns a new `IRGraph`. Graphs are never mutated after construction, each pass produces a fresh copy with the new information filled in. Errors accumulate in `IRGraph.errors` rather than failing fast, so the assembler reports all problems in a single pass rather than forcing the programmer to fix them one at a time.
···57625863| Type | Fields | Purpose |
5964|------|--------|---------|
6060-| `IRNode` | name, opcode, dest_l, dest_r, const, pe, iram_offset, ctx, loc, args, sm_id | Single instruction in the dataflow graph |
6161-| `IREdge` | source, dest, port, source_port, loc | Connection between two nodes |
6262-| `IRGraph` | nodes, edges, regions, data_defs, system, errors | Complete program representation |
6363-| `IRRegion` | tag, kind, body (IRGraph), loc | Nested scope (FUNCTION or LOCATION) |
6565+| `IRNode` | name, opcode, dest_l, dest_r, const, pe, iram_offset, ctx, loc, args, sm_id, seed | Single instruction in the dataflow graph. `name` and `const` can be `ParamRef` or `ConstExpr` in macro templates. |
6666+| `IREdge` | source, dest, port, source_port, port_explicit, ctx_override, loc | Connection between two nodes. `ctx_override` marks cross-context call edges (ctx_mode=01). `port_explicit` tracks whether port was user-specified. |
6767+| `IRGraph` | nodes, edges, regions, data_defs, system, errors, macro_defs, macro_calls, raw_call_sites, call_sites, builtin_line_offset | Complete program representation. Macro-related fields are populated by lower and consumed by expand. |
6868+| `IRRegion` | tag, kind, body (IRGraph), loc | Nested scope (FUNCTION, LOCATION, or MACRO) |
6469| `IRDataDef` | name, sm_id, cell_addr, value, loc | SM cell initialisation |
6565-| `SystemConfig` | pe_count, sm_count, iram_capacity, ctx_slots, loc | Hardware configuration from `@system` pragma |
7070+| `SystemConfig` | pe_count, sm_count, iram_capacity, ctx_slots, loc | Hardware configuration from `@system` pragma (defaults: iram=128, ctx=16) |
7171+| `MacroDef` | name, params, body (IRGraph), repetition_blocks, loc | Macro definition: formal parameters + template body with `ParamRef` placeholders |
7272+| `MacroParam` | name, variadic | Formal macro parameter. `variadic=True` collects remaining args. |
7373+| `ParamRef` | param, prefix, suffix | Placeholder for a macro parameter. Supports token pasting via prefix/suffix. |
7474+| `ConstExpr` | expression, params, loc | Compile-time arithmetic expression (e.g., `base + _idx + 1`) |
7575+| `IRRepetitionBlock` | body (IRGraph), variadic_param, loc | Repetition block (`@each`) expanded once per variadic argument |
7676+| `IRMacroCall` | name, positional_args, named_args, loc | Macro invocation, consumed by expand pass |
7777+| `CallSiteResult` | func_name, input_args, output_dests, loc | Intermediate call data from lower, consumed by expand |
7878+| `CallSite` | func_name, call_id, input_edges, trampoline_nodes, free_ctx_nodes, loc | Processed call site metadata for per-call-site context allocation |
66796780### Destination Representation
6881···100113- `strong_edge` / `weak_edge` -> anonymous `IRNode` + input/output
101114 `IREdge` set (creates a `CompositeResult`)
102115- `func_def` → `IRRegion(kind=FUNCTION)` with a nested `IRGraph` body
116116+- `macro_def` → `MacroDef` with `ParamRef` placeholders in body template
117117+- `macro_call` → `IRMacroCall` with positional/named arguments
118118+- `call_stmt` → `CallSiteResult` with function name, input args, output dests
103119- `location_dir` -> `IRRegion(kind=LOCATION)`: subsequent statements
104120 are collected into its body during post-processing
105121- `data_def` -> `IRDataDef` with SM placement and cell address
···112128113129**Opcode mapping (`opcodes.py`):**
114130115115-Mnemonic strings from the grammar are mapped to `ALUOp`, `MemOp`, or `CfgOp` enum values via `MNEMONIC_TO_OP`. A complication: Python `IntEnum` sub-classes can share numeric values across types (`ArithOp.ADD == 0 == MemOp.READ`), so the reverse mapping and set membership tests use type-aware collections (`TypeAwareOpToMnemonicDict`, `TypeAwareMonadicOpsSet`) that key on `(type, value)` tuples internally.
131131+Mnemonic strings from the grammar are mapped to `ALUOp` or `MemOp` enum values via `MNEMONIC_TO_OP`. A complication: Python `IntEnum` sub-classes can share numeric values across types (`ArithOp.ADD == 0 == MemOp.READ`), so the reverse mapping and set membership tests use type-aware collections (`TypeAwareOpToMnemonicDict`, `TypeAwareMonadicOpsSet`) that key on `(type, value)` tuples internally.
132132+133133+### Expand (`expand.py`)
134134+135135+Processes macro definitions, macro invocations, and function call sites. This pass bridges the gap between the template-based IR from lowering and the concrete IR that resolve expects.
136136+137137+**Macro expansion:**
138138+139139+1. Collect all `MacroDef` entries from the graph (including built-in macros prepended to every program)
140140+2. For each `IRMacroCall`, clone the macro's body template
141141+3. Substitute `ParamRef` placeholders with actual argument values
142142+4. Evaluate `ConstExpr` arithmetic expressions (supports `+`, `-`, `*` on integers and `_idx`)
143143+5. Expand `IRRepetitionBlock` entries once per variadic argument, binding `_idx` to the iteration index
144144+6. Qualify expanded names with scope prefixes: `#macroname_N.&label` for top-level, `$func.#macro_N.&label` inside functions
145145+146146+**Function call wiring:**
147147+148148+1. For each `CallSiteResult` from lowering, create a `CallSite` with a unique `call_id`
149149+2. Generate trampoline `PASS` nodes for return routing
150150+3. Create `IREdge` entries with `ctx_override=True` for cross-context argument passing (becomes `ctx_mode=01` in codegen)
151151+4. Generate `FREE_CTX` nodes for context teardown on call completion
152152+5. Wire `@ret` / `@ret_name` synthetic nodes for return paths
153153+154154+**Post-conditions:**
155155+156156+After expansion, the IR contains only concrete `IRNode`/`IREdge` entries. No `ParamRef` placeholders, no `MacroDef` regions, no `IRMacroCall` entries remain.
116157117158### Resolve (`resolve.py`)
118159···151192152193**System config inference:**
153194154154-If no `@system` pragma is provided, the placer infers `pe_count` from the highest explicit PE ID and uses defaults for IRAM capacity (64) and context slots (4).
195195+If no `@system` pragma is provided, the placer infers `pe_count` from the highest explicit PE ID and uses defaults for IRAM capacity (128) and context slots (16).
155196156197### Allocate (`allocate.py`)
157198···216257| RESOURCE | allocate | IRAM overflow, context slot overflow, missing SM target |
217258| ARITY | lower | wrong operand count |
218259| PORT | allocate | port conflicts, missing destinations |
260260+| MACRO | expand | undefined macro, wrong argument count, expansion failure |
261261+| CALL | expand | undefined function, missing return path, call wiring error |
219262| UNREACHABLE | (future) | unused nodes |
220263| VALUE | lower | out-of-range literals |
221264···243286## Future Work
244287245288- **Optimization passes** between resolve and place: dead node elimination, constant folding, sub-graph deduplication
246246-- **Macro expansion**: the grammar already supports `#macro` syntax; the expansion pass is not yet implemented
247289- **Wider placement heuristics**: graph partitioning, min-cut algorithms, or profile-guided placement for larger programs
248290- **Incremental reassembly**: modify part of the graph and re-run only affected passes
249291- **Hardware encoding pass**: translate ALUInst/SMInst to bit-level instruction words for actual IRAM loading
292292+- **Conditional macro expansion**: the current macro system supports variadic repetition, constant arithmetic, and nested macro invocation (depth limit 32), but not conditionals within macros
+196-45
design-notes/dfasm-primer.md
···23232424### Names and Sigils
25252626-dfasm uses three sigil-prefixed naming conventions:
2626+dfasm uses four sigil-prefixed naming conventions:
2727+2828+| Sigil | Scope | Use |
2929+| -------- | --------------------------------- | --------------------------------- |
3030+| `@name` | Global (top-level) | Node references, data definitions |
3131+| `&name` | Local (within enclosing function) | Labels for instructions |
3232+| `$name` | Global | Function / subgraph definitions |
3333+| `#name` | Global | Macro definitions and invocations |
27342828-| Sigil | Scope | Use |
2929-| ------- | --------------------------------- | --------------------------------- |
3030-| `@name` | Global (top-level) | Node references, data definitions |
3131-| `&name` | Local (within enclosing function) | Labels for instructions |
3232-| `$name` | Global | Function / subgraph definitions |
3535+Additionally, `${name}` is used within macro bodies for parameter substitution (see Macros section).
33363437Names are composed of `[a-zA-Z_][a-zA-Z0-9_]*`.
3538···6770| --------- | -------- | ------- | ---------------------------------------- |
6871| `pe` | yes | — | Number of processing elements |
6972| `sm` | yes | — | Number of structure memory modules |
7070-| `iram` | no | 64 | IRAM capacity per PE (instruction slots) |
7171-| `ctx` | no | 4 | Context slots per PE |
7373+| `iram` | no | 128 | IRAM capacity per PE (instruction slots) |
7474+| `ctx` | no | 16 | Context slots per PE |
72757376At most one `@system` pragma per program.
7477···235238236239These operations route tokens based on a comparison result. They are all dyadic — they compare L and R, then route accordingly.
237240238238-**Branch operations** (`br*`): emit data to `dest_l` (taken) or `dest_r` (not taken) based on comparison:
241241+**Branch operations** (`br*`): compare L and R, then emit data to `dest_l` (taken) or `dest_r` (not taken). Both outputs carry the data value; the branch condition selects the destination:
239242240243| Mnemonic | Condition |
241244| -------- | ---------- |
···243246| `brgt` | L > R |
244247| `brge` | L ≥ R |
245248| `brof` | overflow |
246246-| `brty` | type match |
247249248248-> NOTE:
249249->`br*` ops use predicate register and internal-to-PE loopback route if supported by hardware.
250250+> NOTE:
251251+> `br*` ops use predicate register and internal-to-PE loopback route if supported by hardware. Future strongly-connected block execution will change the behaviour of `br*` ops to support pseudo-sequential execution within a PE.
250252251251-**Switch operations** (`sw*`): like branch, but when the condition is true, data goes to `dest_l` and a trigger token (value 0) goes to `dest_r`.
253253+**Switch operations** (`sw*`): like branch, but when the condition is true, data goes to `dest_l` and a trigger token (value 0) goes to `dest_r`.
252254When false, trigger goes to `dest_l` and data goes to `dest_r`:
253255254256| Mnemonic | Condition |
255257|----------|-----------|
256256-| `sweq` | L == R |
257257-| `swgt` | L > R |
258258-| `swge` | L ≥ R |
259259-| `swof` | overflow |
260260-| `swty` | type match |
258258+| `sweq` | L == R |
259259+| `swgt` | L > R |
260260+| `swge` | L ≥ R |
261261+| `swof` | overflow |
261262262263**Other routing:**
263264264264-| Mnemonic | Arity | Description |
265265-| -------- | ------ | -------------------------------------------------------- |
266266-| `gate` | dyadic | pass data through if bool_out is true, suppress if false |
267267-| `sel` | dyadic | select between inputs |
268268-| `merge` | dyadic | merge two inputs |
265265+| Mnemonic | Arity | Description |
266266+| -------- | ------ | ------------------------------------------------------------------- |
267267+| `gate` | dyadic | pass data through if bool_out is true, suppress if false |
268268+| `sel` | dyadic | select between L and R based on a condition |
269269+| `merge` | dyadic | merge two token streams (non-deterministic: fires on either input) |
269270270271### Data
271272···274275| `pass` | monadic | pass data through unchanged |
275276| `const` | monadic | emit constant value (from const field) |
276277| `free_ctx` | monadic | deallocate context slot, no data output |
277277-| `call` | dyadic | |
278278279279-- `free_ctx` in particular is a special token used to handle function body and loop exits.
279279+- `free_ctx` is a special instruction used to handle function body and loop exits. It frees the context slot so it can be reused.
280280281281### Structure Memory
282282283283-| Mnemonic | Arity | Description |
284284-| -------- | ----------------- | --------------------------------------------------------------------------------------------------------------------- |
285285-| `read` | monadic | read from SM cell (const = cell address) |
286286-| `write` | context-dependent | write to SM cell — monadic if const is set (cell addr from const), dyadic if const is None (cell addr from L operand) |
287287-| `clear` | monadic | clear SM cell |
288288-| `alloc` | monadic | allocate SM cell |
289289-| `free` | monadic | free SM cell |
290290-| `rd_inc` | monadic | atomic read-and-increment |
291291-| `rd_dec` | monadic | atomic read-and-decrement |
292292-| `cmp_sw` | monadic | compare-and-swap |
293293-### Configuration / System
283283+| Mnemonic | Arity | Description |
284284+| ----------- | ----------------- | --------------------------------------------------------------------------------------------------------------------- |
285285+| `read` | monadic | read from SM cell (const = cell address) |
286286+| `write` | context-dependent | write to SM cell — monadic if const is set (cell addr from const), dyadic if const is None (cell addr from L operand) |
287287+| `clear` | monadic | clear SM cell (reset to EMPTY state) |
288288+| `alloc` | monadic | allocate SM cell |
289289+| `free` | monadic | free SM cell |
290290+| `rd_inc` | monadic | atomic read-and-increment |
291291+| `rd_dec` | monadic | atomic read-and-decrement |
292292+| `cmp_sw` | dyadic | compare-and-swap (L = expected, R = new value) |
293293+| `exec` | monadic | trigger EXEC on SM (inject tokens from T0 storage into network) |
294294+| `raw_read` | monadic | raw read from T0 storage (no I-structure semantics) |
295295+| `set_page` | monadic | set SM page register (T0 operation) |
296296+| `write_imm` | monadic | immediate write to SM cell (T0 operation) |
297297+| `ext` | monadic | extended SM operation |
294298295295-| Mnemonic | Description |
296296-|----------|-------------|
297297-| `load_inst` | load instruction into PE IRAM |
298298-| `route_set` | configure PE routing table |
299299-| `ior` | I/O read |
300300-| `iow` | I/O write |
301301-| `iorw` | I/O read-write |
299299+Note: `free` (SM cell deallocation) and `free_ctx` (PE context slot deallocation) are distinct operations targeting different resources.
302300303303-These are rarely written by hand — `load_inst` and `route_set` are generated by the assembler's token stream mode during bootstrap.
301301+SM opcodes use a variable-width bus encoding. See `sm-design.md` for the full opcode table and encoding tiers.
304302305303## Literals
306304···445443446444**Direct mode** produces `PEConfig` objects (IRAM contents, route restrictions, context slot count) and `SMConfig` objects (initial cell values), plus seed tokens. This is the fast path for the emulator. Configuration is applied directly.
447445448448-**Token stream mode** produces a bootstrap sequence: SM initialization writes, route configuration tokens, instruction load tokens, then seed tokens. This mirrors the bootstrap process, loading the code stored at the reset vector.446446+**Token stream mode** produces a bootstrap sequence: SM initialization writes, IRAM write tokens, then seed tokens. This mirrors the bootstrap process, loading the code stored at the reset vector.
447447+448448+## Macros
449449+450450+Macros define reusable template subgraphs that are expanded inline at their call sites. The macro system supports parameterisation, variadic arguments, repetition blocks, constant arithmetic, and token pasting.
451451+452452+### Macro Definition
453453+454454+```dfasm
455455+#macro_name param1, param2, *variadic_param |> {
456456+ ; body — instructions and edges using ${param} substitution
457457+ &node <| add ${param1}
458458+ ${param1} |> &node:L
459459+ ${param2} |> &node:R
460460+}
461461+```
462462+463463+- Macro names use the `#` sigil
464464+- Parameters are declared before `|>`
465465+- Variadic parameters are prefixed with `*` and collect remaining arguments
466466+- The body contains standard dfasm statements with `${param}` placeholders
467467+468468+### Parameter Substitution
469469+470470+Within a macro body, `${name}` references are replaced with the actual argument values during expansion:
471471+472472+```dfasm
473473+#add_const val |> {
474474+ &adder <| add
475475+ &c <| const, ${val}
476476+ &c |> &adder:R
477477+}
478478+```
479479+480480+**Token pasting:** Parameters can be combined with literal text to synthesise unique names. The `${param}` reference within a label name produces a label that incorporates the argument value:
481481+482482+```dfasm
483483+#make_pair name |> {
484484+ &${name}_left <| pass
485485+ &${name}_right <| pass
486486+}
487487+```
488488+489489+### Repetition Blocks
490490+491491+The `$( ),*` syntax expands its body once per element of a variadic parameter. Within a repetition block, `${_idx}` provides the current iteration index (0-based):
492492+493493+```dfasm
494494+#fan_out *targets |> {
495495+ &src <| pass
496496+ $(
497497+ &src |> ${targets}
498498+ ),*
499499+}
500500+```
501501+502502+### Constant Arithmetic
503503+504504+Macro const fields support compile-time arithmetic with `+`, `-`, `*` on integer values and parameters:
505505+506506+```dfasm
507507+#indexed_read base, *cells |> {
508508+ $(
509509+ &r${_idx} <| read, ${base} + ${_idx}
510510+ ),*
511511+}
512512+```
513513+514514+### Macro Invocation
515515+516516+Macros are invoked as standalone statements:
517517+518518+```dfasm
519519+#loop_counted
520520+#fan_out &a:L, &b:R, &c:L
521521+#indexed_read 10, &dest1, &dest2, &dest3
522522+```
523523+524524+Arguments can be positional or named:
525525+526526+```dfasm
527527+#make_pair name=foo
528528+```
529529+530530+### Scoping
531531+532532+Expanded macro names are automatically qualified to prevent collisions between multiple invocations of the same macro:
533533+534534+- Top-level invocation: `#macro_N.&label` (N is the invocation counter)
535535+- Inside a function: `$func.#macro_N.&label`
536536+537537+### Built-in Macros
538538+539539+The following macros are automatically available in all programs:
540540+541541+| Macro | Purpose |
542542+|-------|---------|
543543+| `#loop_counted` | Counted loop: counter + compare + increment feedback loop |
544544+| `#loop_while` | Condition-tested loop: gate node for predicate-driven iteration |
545545+| `#permit_inject_N` | Inject N const(1) seed tokens (variants for N=1..4) |
546546+| `#reduce_add_N` | Binary reduction tree for addition (variants for N=2..4) |
547547+548548+Built-in macros expose well-known internal node names (e.g., `&counter`, `&compare`, `&gate`) that the user wires externally after invocation.
549549+550550+## Function Calls
551551+552552+Function calls wire argument values across context boundaries using the expand pass. The call syntax declares which arguments feed into the callee and where results flow back.
553553+554554+### Call Syntax
555555+556556+```dfasm
557557+$func_name arg1=&source1, arg2=&source2 |> @result
558558+```
559559+560560+The function must be defined as a `$name |> { ... }` region. Arguments are named (matching the function's parameter labels) or positional. Outputs after `|>` specify where results are routed.
561561+562562+Multiple outputs can be named:
563563+564564+```dfasm
565565+$divmod a=÷nd, b=&divisor |> @quotient, remainder=@remainder
566566+```
567567+568568+### What the Expand Pass Does
569569+570570+When processing a function call:
571571+572572+1. Allocates a fresh context slot for the callee activation
573573+2. Generates cross-context edges with `ctx_override=True` (becomes `ctx_mode=01` / CTX_OVRD in hardware)
574574+3. Creates trampoline `PASS` nodes for return routing
575575+4. Generates `FREE_CTX` nodes to clean up the callee's context on completion
576576+5. Synthesises `@ret` marker nodes for return paths
577577+578578+### Return Convention
579579+580580+The expand pass creates synthetic `@ret` (or `@ret_name` for named outputs) nodes as return markers. The callee's result edges are wired to these markers, which trampoline the results back to the caller's context.
581581+582582+### Example
583583+584584+```dfasm
585585+@system pe=2, sm=0
586586+587587+; Define a function that adds two values
588588+$add_pair |> {
589589+ &sum <| add
590590+}
591591+592592+; Call it
593593+&a <| const, 3
594594+&b <| const, 7
595595+$add_pair a=&a, b=&b |> @result
596596+@result <| pass
597597+```
598598+599599+After expansion, the assembler generates the cross-context wiring, trampoline nodes, and context cleanup automatically. The programmer does not need to manage context slots or return routing manually.
+27-9
design-notes/io-and-bootstrap.md
···4545mapping is configured at bootstrap (or hardwired for v0):
46464747```
4848-Example SM00 address map:
4949- 0x000 - 0x0FF: IO devices (tier 0, raw memory semantics)
5050- 0x100 - 0x1FF: Bootstrap ROM (tier 0, read-only)
5151- 0x200 - 0x3FF: General-purpose I-structure cells (tier 1)
4848+Example SM00 address map (indicative, not final):
4949+ 0x000 - 0x0FF: I-structure cells (tier 1, presence-tracked)
5050+ 0x100 - 0x1FF: IO devices (tier 0, raw memory semantics)
5151+ 0x200 - 0x3FF: Bootstrap ROM (tier 0, read-only)
5252```
5353+5454+**Tier boundary direction** is a hardware design decision, not yet
5555+finalised. The current design intent places I-structure cells at low
5656+addresses (below the boundary) where they are directly addressable by
5757+2-flit SM tokens and reachable by atomic operations (RD_INC, RD_DEC,
5858+CMP_SW — 5-bit opcode tier, 8-bit payload, 256-cell range). T0 raw
5959+storage sits above the boundary; it does not need atomics and can use
6060+extended addressing when necessary.
6161+6262+The emulator currently uses `tier_boundary=256` with T1 below and T0
6363+at/above, but the exact mapping may change during physical build.
6464+See `sm-design.md` for the full tier model and encoding details.
53655466Within the IO range, specific addresses map to device registers:
5567···95107 |
96108 [Address Decoder]
97109 |
9898- +---> addr < 0x100? --> [IO Device Registers]
9999- | |
100100- | +---> [UART chip (6850/16550/etc.)]
101101- | +---> [future: SPI, GPIO, timer]
110110+ +---> addr < tier_boundary? --> [SRAM Banks] (I-structure cells)
102111 |
103103- +---> addr >= 0x100? --> [SRAM Banks] (normal SM operation)
112112+ +---> addr >= tier_boundary?
113113+ |
114114+ +---> IO range? --> [IO Device Registers]
115115+ | |
116116+ | +---> [UART chip (6850/16550/etc.)]
117117+ | +---> [future: SPI, GPIO, timer]
118118+ |
119119+ +---> ROM range? --> [Bootstrap ROM]
120120+ |
121121+ +---> else --> [T0 raw SRAM]
104122```
105123106124The IO device registers behave like SM cells from the SM controller's
+94-103
design-notes/loop-patterns-and-flow-control.md
···751751or a purpose-built assembler template system. No conditionals, no
752752recursion in the macro evaluator, no type system.
753753754754+> **Implementation status:** All five requirements below are now
755755+> implemented in the `asm/expand.py` pass. The implemented syntax
756756+> differs from the Rust-style examples in this section — see
757757+> `dfasm-primer.md` for the canonical syntax reference.
758758+754759#### 1. Named Variadic Repetition
755760756756-```
757757-$($arg = $src),*
758758-```
759759-760760-A comma-separated list of named pairs, expanded once per entry.
761761-Rust `macro_rules!` provides this directly.
761761+**Implemented.** Variadic parameters use `*name` in the macro
762762+definition. Repetition blocks use `$( ... ),*` syntax
763763+(`IRRepetitionBlock` in the IR). Each iteration binds the variadic
764764+parameter to the current element.
762765763766#### 2. Token Pasting (Label Synthesis)
764767765765-```
766766-&__${arg}_tag
767767-```
768768-769769-Concatenate a macro parameter into a label name. Produces unique
770770-labels per repetition entry. C has `##`, rust proc macros have
771771-`Ident::new()`, but `macro_rules!` does NOT have this natively —
772772-would need an assembler-specific extension or a `paste!`-style
773773-helper.
774774-775775-This is the single most important extension beyond stock
776776-`macro_rules!`. Without it, macros can't generate unique labels
777777-for per-arg instructions.
768768+**Implemented.** `ParamRef` supports `prefix` and `suffix` fields
769769+for token pasting. Within a macro body, `&${name}_tag` produces
770770+labels incorporating the parameter value. Unique labels per
771771+repetition entry are generated automatically.
778772779773#### 3. Implicit Repetition Index
780774781781-```
782782-${_idx}
783783-```
784784-785785-An auto-incrementing counter within a `$(...),*` expansion.
786786-Used for descriptor table offset arithmetic (`$desc + ${_idx} + 1`).
787787-Not available in rust `macro_rules!` — another assembler-specific
788788-extension. Alternative: require the programmer to pass explicit
789789-indices, which is ugly but functional.
775775+**Implemented.** `${_idx}` is bound to the current iteration index
776776+(0-based) within `$(...),*` expansion blocks. Used for descriptor
777777+table offset arithmetic and generating unique per-iteration names.
790778791779#### 4. Constant Arithmetic in Expressions
792780793793-```
794794-$desc + ${_idx} + 1
795795-```
796796-797797-Compile-time addition on constant/label expressions. The assembler
798798-already evaluates constant expressions for instruction operands, so
799799-the macro expander just needs to emit the expression text and let the
800800-normal evaluator handle it. No new evaluation capability needed.
781781+**Implemented.** `ConstExpr` in `ir.py` supports compile-time `+`,
782782+`-`, `*` on integer values and parameter references. Expressions like
783783+`${base} + ${_idx} + 1` are evaluated during macro expansion.
801784802785#### 5. Label Reference Across Macro Boundaries
803786804804-```
805805-&__alloc |> $func.__stub.ctx_in
806806-```
787787+**Implemented via scoped naming.** Expanded macro names are
788788+automatically qualified with scope prefixes (`#macroname_N.&label`
789789+for top-level, `$func.#macro_N.&label` inside functions). Cross-macro
790790+references use these qualified names. The expand pass also handles
791791+function call wiring via `CallSite` metadata, which generates
792792+cross-context edges and trampoline nodes automatically.
807793808808-The per-call-site macro needs to reference labels inside the
809809-per-function stub macro's expansion. This requires either:
810810-811811-- A naming convention that both macros agree on (fragile but simple)
812812-- The stub macro "exporting" label names via a known pattern
813813- (`$func.__stub.*`)
814814-- The assembler resolving qualified names across scopes
815815-816816-The naming convention approach is the most macro-friendly: the
817817-`call_stub` macro always emits labels named
818818-`&__${func}_ctx_fan`, `&__${func}_or_ret`, etc., and the
819819-`call_dyn` macro references them by constructing the same names.
820820-Both macros must agree. This is a social contract, not a type system.
821821-822822-#### What's NOT Needed
794794+#### What's NOT Needed (still true)
823795824796- **Conditional expansion** — different call shapes get different
825797 macros, not `if` inside a macro.
826826-- **Recursive macro expansion** — the fan-out PASS chain has a fixed
827827- structure per argument count. For N=1 it's one PASS with dual dest.
828828- For N=2 it's two PASSes. Rather than recursing, provide
829829- `call_stub_1`, `call_stub_2`, `call_stub_3` for common arities.
830830- Ugly, pragmatic, correct.
798798+- **Recursive macro expansion** — macros CAN invoke other macros
799799+ (nested expansion with depth limit of 32). Per-arity variants are
800800+ provided for the built-ins for simplicity, not due to a limitation.
831801- **Type checking** — the assembler validates after expansion (wrong
832802 arity, missing labels, offset overflow). The macro doesn't check.
833833-- **Hygiene** — label collisions between macro expansions ARE a risk.
834834- Mitigated by the `&__${func}_` prefix convention. If two functions
835835- have the same name, you have bigger problems.
803803+- **Hygiene** — scoped naming (`#macroname_N.&label`) prevents
804804+ collisions between multiple invocations of the same macro.
805805+806806+#### Built-in Macros
807807+808808+The assembler ships built-in macros (prepended to every program) that
809809+implement common patterns from this document:
810810+811811+| Macro | Pattern |
812812+|-------|---------|
813813+| `#loop_counted` | Counted loop (counter + compare + increment feedback) |
814814+| `#loop_while` | Condition-tested loop (gate node) |
815815+| `#permit_inject_N` | Permit injection (N=1..4 const seed tokens) |
816816+| `#reduce_add_N` | Binary reduction tree for addition (N=2..4) |
817817+818818+See `asm/builtins.py` for definitions.
836819837820#### Example Macro Definitions
838821839822```dfasm
840823; ── call stub for a 1-argument function ──
841824; emitted once per function, provides shared call infrastructure
842842-.macro call_stub_1 $func, $desc {
825825+#call_stub_1 func, desc |> {
843826 ; ctx fan-out (1 arg + 1 return = 2 consumers, one PASS suffices)
844827 &__${func}_ctx_fan <| pass
845828 &__${func}_ctx_fan |> &__${func}_or_ret:R, &__${func}_or_arg0:R
···858841}
859842860843; ── call stub for a 2-argument function ──
861861-.macro call_stub_2 $func, $desc {
844844+#call_stub_2 func, desc |> {
862845 ; ctx fan-out chain (3 consumers)
863846 &__${func}_ctx_fan0 <| pass
864847 &__${func}_ctx_fan1 <| pass
···882865}
883866884867; ── per-call-site (works for any arity) ──
885885-.macro call_dyn $func, $alloc, $call_seq, $ret_offset, $($arg = $src),* {
868868+#call_dyn func, alloc, call_seq, ret_offset, *args |> {
886869 ; allocate + trigger + return continuation
887887- &__call_alloc_${func} <| rd_inc, $alloc
888888- &__call_exec_${func} <| exec, $call_seq
889889- &__call_extag_${func} <| extract_tag, $ret_offset
870870+ &__call_alloc_${func} <| rd_inc, ${alloc}
871871+ &__call_exec_${func} <| exec, ${call_seq}
872872+ &__call_extag_${func} <| extract_tag, ${ret_offset}
890873891874 ; wire into stub
892875 &__call_alloc_${func} |> &__${func}_ctx_fan
893876 &__call_extag_${func} |> &__${func}_ct_ret:R
894877 $(
895895- $src |> &__${func}_ct_${arg}:R
878878+ ${args} |> &__${func}_ct_${args}:R
896879 ),*
897880}
898881```
···904887#call_stub_1 fib, @fib_desc
905888906889; at each call site
907907-#call_dyn fib, @ctx_alloc, @fib_call_seq, 20, n = &my_arg
890890+#call_dyn fib, @ctx_alloc, @fib_call_seq, 20, &my_arg
908891```
909892910910-The `call_stub_N` per-arity approach is admittedly clunky. A future
911911-macro system with proper counted repetition could unify them. For
912912-now, N=1 through N=3 covers the vast majority of functions, and
913913-anything beyond N=3 can be hand-written — it's the same pattern,
914914-just more of it.
893893+The `call_stub_N` per-arity approach is admittedly clunky. Macros can
894894+invoke other macros (nested expansion up to depth 32), so a unified
895895+variadic call_stub using a helper macro is possible. Per-arity variants
896896+are used here for clarity. N=1 through N=3 covers the vast majority
897897+of functions, and anything beyond N=3 can be hand-written — it's the
898898+same pattern, just more of it.
915899916916-### Permit Injection — Two Macros
900900+### Permit Injection — Two Approaches
917901918918-For small K (roughly K <= 4), inline CONST injection:
902902+For small K (roughly K <= 4), inline CONST injection. The built-in
903903+macros `#permit_inject_1` through `#permit_inject_4` provide this:
919904920905```dfasm
921921-; Macro definition:
922922-$permit_inject_inline K, &gate |> {
923923- ; expands to K const instructions, each targeting &gate:L
924924- ; each const needs its own trigger to fire
906906+; Built-in definition (from asm/builtins.py):
907907+#permit_inject_2 |> {
908908+ &p0 <| const, 1
909909+ &p1 <| const, 1
925910}
926911927927-; Usage: inject 3 permits into the gate
928928-#permit_inject_inline 3, &dispatch_gate
912912+; Usage: invoke the built-in, then wire outputs to the gate
913913+#permit_inject_2
914914+#permit_inject_2.&p0 |> &dispatch_gate:L
915915+#permit_inject_2.&p1 |> &dispatch_gate:L
929916```
930917931918For large K, use SM EXEC to batch-emit permits:
932919933920```dfasm
934934-; Macro definition:
935935-$permit_inject_exec K, &gate, @sm_base |> {
936936- ; expands to a single SM_EXEC reading K pre-formed permit
937937- ; tokens from SM starting at @sm_base, each addressed to &gate:L
921921+; Custom macro for EXEC-based injection:
922922+#permit_inject_exec count, sm_base |> {
923923+ ; a single SM EXEC reading ${count} pre-formed permit
924924+ ; tokens from SM starting at ${sm_base}
925925+ &exec <| exec, ${sm_base}
938926}
939927940928; Usage: inject 8 permits via EXEC
941941-#permit_inject_exec 8, &dispatch_gate, @permit_store
929929+#permit_inject_exec 8, @permit_store
942930```
943931944932Programmer chooses based on K. No magic.
945933946934### Loop Control Macro
947935936936+The built-in `#loop_counted` provides the core loop infrastructure:
937937+948938```dfasm
949949-$loop_counted &limit, &body, &exit |> {
950950- &counter <| const, 0
951951- &step <| inc
952952- &test <| lt
953953- &route <| sweq
954954-955955- &counter |> &step
956956- &step |> &test:L, &route:L ; fan-out: counter to both LT and SWITCH
957957- &limit |> &test:R
958958- &test |> &route:R ; bool from comparison → SWITCH control
959959- &route:L |> &body ; taken → body dispatch
960960- &route:R |> &exit ; not-taken → done
961961- &route:L |> &step ; feedback arc: counter recirculates
939939+; Built-in definition (from asm/builtins.py):
940940+#loop_counted |> {
941941+ &counter <| add
942942+ &compare <| brgt
943943+ &counter |> &compare:L
944944+ &inc <| inc
945945+ &compare |> &inc:L
946946+ &inc |> &counter:R
962947}
963948964964-; Usage:
965965-#loop_counted 64, &body_entry, &done
949949+; Usage: wire init, limit, body, and exit externally
950950+#loop_counted
951951+&c_init <| const, 0
952952+&c_limit <| const, 64
953953+&c_init |> #loop_counted.&counter:L ; initial counter value
954954+&c_limit |> #loop_counted.&compare:R ; loop bound
955955+#loop_counted.&compare:L |> &body_entry ; taken → body
956956+#loop_counted.&compare:R |> &done ; not-taken → exit
966957```
967958968959### Reduction Tree Macro
+43-9
design-notes/pe-design.md
···8989 - First operand: store in slot, advance to wait state
9090 - Second operand: read partner from slot, both proceed
9191 - Monadic (prefix 010 or 011+10): bypass matching, proceed directly
9292+ - **DyadToken at monadic instruction**: if a dyadic-format token
9393+ arrives at an instruction that is monadic (determined after IRAM
9494+ fetch), the matching store is bypassed and the token fires
9595+ immediately as a monadic operand (data from token, no partner).
9696+ This allows the compiler to send dyadic-format tokens to monadic
9797+ instructions without deadlocking the matching store.
9298 - Single cycle for all cases (no hash path, no CAM search —
9399 direct indexing only, see matching store section below)
94100 - Estimated: ~200-300 transistors + SRAM
···104110 network config writes — see "Instruction Memory" section below
105111106112Stage 4: EXECUTE
107107- - 16-bit ALU
108108- - ~1500-2000 transistors
113113+ - IRAM half 0 bit 15 selects CM compute (0) or SM operation (1)
114114+ - CM path: 16-bit ALU executes arithmetic/logic/comparison/routing
115115+ - SM path: ALU is bypassed; PE constructs an SMToken from the SM
116116+ opcode, operand data, and IRAM fields (SM_id, const address,
117117+ return routing). See `iram-and-function-calls.md` for SM IRAM
118118+ word format.
119119+ - ~1500-2000 transistors (ALU) + SM flit assembly mux (~2 chips)
109120110121Stage 5: TOKEN OUTPUT
111111- - Form result token with routing prefix (type, destination PE/SM,
112112- offset, context, etc.) using destination fields from IRAM
122122+ - CM path: form result CM token with routing prefix (type,
123123+ destination PE, offset, context, etc.) using destination fields
124124+ from IRAM. ctx_mode selects context source:
125125+ 00 = INHERIT (pipeline latches)
126126+ 01 = CTX_OVRD (const field overrides ctx/gen)
127127+ 10 = CHANGE_TAG (left operand becomes flit 1 verbatim)
128128+ See `iram-and-function-calls.md` for ctx_mode details.
129129+ - SM path: emit SMToken to target SM via SM bus routes
113130 - Pass to output serialiser for flit encoding and bus injection
114114- - ~300 transistors
131131+ - ~300 transistors + ctx/gen mux (~3 chips)
115132```
133133+134134+### Concurrency Model
135135+136136+The pipeline is pipelined: multiple tokens can be in-flight
137137+simultaneously at different stages. In the emulator, each token spawns
138138+a separate SimPy process that progresses through the pipeline
139139+independently. This models the hardware reality where Stage 2 can be
140140+matching a new token while Stage 4 is executing a previous one.
141141+142142+Cycle counts per token type:
143143+- **Dyadic**: 5 cycles (dequeue + match + fetch + execute + emit)
144144+- **Monadic**: 4 cycles (dequeue + fetch + execute + emit; match bypassed)
145145+- **IRAMWriteToken**: 2 cycles (dequeue + write; no match/fetch/execute/emit)
146146+- **Network delivery**: +1 cycle latency between emit and arrival at destination
116147117148### Pipeline Register Widths
118149···355386just 32 bytes.
356387357388> **⚠ Preliminary:** These configurations are candidates, not commitments.
358358-> The emulator defaults to 16 context slots and 128 IRAM entries (matching
359359-> Config B sizing). Final sizing depends on compiling real programs and
360360-> measuring actual concurrent activation counts and dyadic instruction
361361-> density per PE.
389389+> The emulator defaults to 16 context slots and 128 IRAM entries. Note
390390+> that IRAM entries (128) and matching store entries per context (32 in
391391+> Config B) are distinct: the matching store only holds dyadic
392392+> instructions at offsets 0-31, while IRAM holds all instructions
393393+> (dyadic + monadic) across the full 128-slot range. Final sizing
394394+> depends on compiling real programs and measuring actual concurrent
395395+> activation counts and dyadic instruction density per PE.
362396363397**Recommendation for v0**: Config B (16 slots x 32 entries = 1KB). This
364398matches the token format directly: 4-bit ctx from flit 1, 5-bit offset
+3-4
design-notes/sm-design.md
···341341342342'aa' = address bits (part of 10-bit address).
343343344344-**Tier 1 ops (3-bit, full address range):** READ, WRITE, ALLOC, FREE,
344344+**Tier 1 ops (3-bit, full address range):** READ, WRITE, ALLOC, EXEC,
345345CLEAR reach the full 1024-cell address space. EXT signals a 3-flit
346346token for extended addressing (external RAM, wide addresses).
347347348348**Tier 2 ops (5-bit, restricted address/payload):** atomic operations
349349-(READ_INC, READ_DEC, CAS) are restricted to 256 cells — the compiler
350350-places atomic-access cells in the lower range. EXEC, SET_PAGE, and
349349+(READ_INC, READ_DEC, CAS) as well as CLEAR are restricted to 256 cells — the compiler
350350+places atomic-access cells in the lower range. SET_PAGE and
351351WRITE_IMM use the 8-bit payload field for non-address data (length,
352352page register value, small immediate).
353353···356356operand. The 8-bit payload in the restricted tier can be inline data,
357357config values, or range counts depending on the opcode:
358358359359-- EXEC: payload = length/count (base addr in config register)
360359- SET_PAGE: payload = page register value
361360- WRITE_IMM: 8-bit addr, flit 2 carries immediate data
362361- RAW_READ: non-blocking read, returns data or empty indicator without
+575
docs/macro-enhancements.md
···11+# Macro Enhancements: Opcode Parameters, Qualified Ref Parameters, and @ret Wiring
22+33+Extends the dfasm macro system with three capabilities that reduce the need for per-variant macro definitions and make macros composable in the same way as functions.
44+55+## Current State
66+77+The macro system (implemented in `asm/expand.py`, grammar in `dfasm.lark`) supports:
88+99+- Parameter substitution in node names via `${param}` (token pasting with prefix/suffix)
1010+- Parameter substitution in edge endpoints via `${param}` in `qualified_ref`
1111+- Parameter substitution in const fields
1212+- Compile-time arithmetic via `ConstExpr` (`${base} + ${_idx} + 1`)
1313+- Variadic parameters with `@each` repetition blocks
1414+- Nested macro invocation (depth limit 32)
1515+1616+Three gaps remain:
1717+1818+1. **Opcode position is not parameterizable.** The grammar defines `opcode: OPCODE` as a keyword terminal. You cannot pass an opcode as a macro argument. This forces per-opcode variants: `#reduce_add_2`, `#reduce_add_3`, etc.
1919+2020+2. **Placement and port qualifiers are not parameterizable.** The grammar defines `placement: "|" IDENT` and `port: ":" PORT_SPEC` — neither accepts `param_ref`. You cannot write `&ref:${port}` or `&ref|${pe}` in a macro body to parameterize which port or PE a reference targets.
2121+2222+3. **Macros have no output wiring convention.** Functions use `@ret` / `@ret_name` markers in their body, and the call syntax `$func args |> outputs` auto-wires return paths. Macros have no equivalent — the user must manually wire to expanded internal node names after invocation.
2323+2424+## Enhancement 1: Opcode Parameters
2525+2626+### Goal
2727+2828+Allow macro parameters to appear in the opcode position of `inst_def`, `strong_edge`, and `weak_edge` rules.
2929+3030+### Grammar Change
3131+3232+```lark
3333+// Current:
3434+opcode: OPCODE
3535+3636+// Proposed:
3737+opcode: OPCODE | param_ref
3838+```
3939+4040+This is the only grammar change needed. `param_ref` (`${name}`) is already a valid production. Earley parsing handles the ambiguity.
4141+4242+### Lower Pass
4343+4444+The `inst_def` handler in `lower.py` currently calls `self._resolve_opcode()` which maps mnemonic strings to `ALUOp`/`MemOp` values. When the opcode is a `ParamRef`, lowering must defer resolution — store the `ParamRef` on the `IRNode` in a new field (or overload the `opcode` field's type to `Union[ALUOp, MemOp, ParamRef]`).
4545+4646+The `strong_edge` and `weak_edge` handlers need the same treatment: if the opcode token is a `ParamRef`, create the anonymous node with a deferred opcode.
4747+4848+### Expand Pass
4949+5050+During `_clone_and_substitute_node`, if `node.opcode` is a `ParamRef`:
5151+5252+1. Look up the parameter in the substitution map
5353+2. The argument value must be a string matching a known opcode mnemonic
5454+3. Resolve via `MNEMONIC_TO_OP` to get the concrete `ALUOp`/`MemOp`
5555+4. Replace the node's opcode with the resolved value
5656+5. Error if the argument is not a valid opcode mnemonic
5757+5858+### Validation
5959+6060+Opcode validation (monadic/dyadic arity, valid argument combinations) already happens after expansion in the resolve and allocate passes. No additional validation needed at expansion time beyond confirming the mnemonic exists.
6161+6262+### Example
6363+6464+Before (current — per-opcode variants):
6565+6666+```
6767+#reduce_add_2 |> { &r <| add }
6868+#reduce_add_3 |> { &r0 <| add; &r1 <| add; &r0 |> &r1:L }
6969+#reduce_sub_2 |> { &r <| sub }
7070+; ... N variants per opcode
7171+```
7272+7373+After (parameterized):
7474+7575+```
7676+#reduce_2 op |> {
7777+ &r <| ${op}
7878+}
7979+8080+#reduce_3 op |> {
8181+ &r0 <| ${op}
8282+ &r1 <| ${op}
8383+ &r0 |> &r1:L
8484+}
8585+8686+; Usage:
8787+#reduce_2 add
8888+#reduce_3 sub
8989+```
9090+9191+### Argument Syntax
9292+9393+Opcode arguments are passed as bare identifiers in the macro call. The grammar for `macro_call_stmt` already accepts `argument` which includes `qualified_ref`, and a bare `IDENT` would normally parse as... hmm, actually it won't. An unqualified `add` in argument position parses as the `OPCODE` terminal (priority 2), not as `IDENT`. And `OPCODE` is not a valid `argument`.
9494+9595+Two options:
9696+9797+**Option A: Quote opcode arguments.** Pass as string literals: `#reduce_2 "add"`. Simple, unambiguous. Expand pass strips quotes and resolves. Slightly ugly.
9898+9999+**Option B: Accept OPCODE as a macro argument.** Add `OPCODE` as an alternative in `positional_arg`:
100100+101101+```lark
102102+// Current:
103103+?positional_arg: value | qualified_ref
104104+105105+// Proposed:
106106+?positional_arg: value | qualified_ref | OPCODE
107107+```
108108+109109+The lower pass wraps the bare opcode token as a string argument in the `IRMacroCall`. Expand resolves it against `MNEMONIC_TO_OP`. This reads naturally: `#reduce_2 add`.
110110+111111+Option B is cleaner. The only risk is if someone has an `IDENT` that collides with an opcode name as a label/node, but the priority system already handles that (opcodes win at lexer level), and this collision already exists in the language.
112112+113113+**Recommendation: Option B.**
114114+115115+116116+## Enhancement 2: Parameterized Placement and Port Qualifiers
117117+118118+### Goal
119119+120120+Allow `${param}` in the placement (`|pe0`) and port (`:L`) positions of a `qualified_ref`, so macros can parameterize which PE a node targets, which port an edge uses, and (when exposed) which context slot to use.
121121+122122+### Current State
123123+124124+`qualified_ref` is built from three parts:
125125+126126+```lark
127127+qualified_ref: (node_ref | label_ref | ... | param_ref) placement? port?
128128+placement: "|" IDENT
129129+port: ":" PORT_SPEC
130130+PORT_SPEC: IDENT | HEX_LIT | DEC_LIT
131131+```
132132+133133+`${param}` can already stand in for the entire ref part (the first element). But the `placement` and `port` suffixes only accept literal tokens. So `&node:${port}` and `&node|${pe}` don't parse.
134134+135135+In the lower pass, `qualified_ref` collects its children into a dict:
136136+- The ref part becomes `{"name": ...}`
137137+- `placement` returns a string (e.g., `"pe0"`)
138138+- `port` returns a `Port` enum (`Port.L`, `Port.R`) or raw `int`
139139+140140+In the IR, `IRNode.pe` stores placement as `Optional[int]`, and `IREdge.port`/`IREdge.source_port` store port as `Port`. Neither field currently accepts `ParamRef`.
141141+142142+The expand pass (`_clone_and_substitute_node`, `_clone_and_substitute_edge`) only substitutes `name`, `const`, `source`, and `dest`. It does not touch `pe`, `port`, or `source_port`.
143143+144144+### Grammar Changes
145145+146146+```lark
147147+// Current:
148148+placement: "|" IDENT
149149+port: ":" PORT_SPEC
150150+151151+// Proposed:
152152+placement: "|" (IDENT | param_ref)
153153+port: ":" (PORT_SPEC | param_ref)
154154+```
155155+156156+### Lower Pass
157157+158158+The `placement` handler currently does `return str(token)`. It needs to handle receiving a `ParamRef` from the parser and return it as-is:
159159+160160+```python
161161+def placement(self, *args):
162162+ for arg in args:
163163+ if isinstance(arg, ParamRef):
164164+ return arg
165165+ return str(args[-1])
166166+```
167167+168168+Similarly, the `port` handler needs to pass through `ParamRef` instead of resolving to `Port`:
169169+170170+```python
171171+def port(self, *args):
172172+ for arg in args:
173173+ if isinstance(arg, ParamRef):
174174+ return arg
175175+ # ... existing Port.L / Port.R / int resolution
176176+```
177177+178178+The `qualified_ref` handler already iterates over args by type. It needs a new branch to detect `ParamRef` in placement/port positions (currently it only detects `ParamRef` in the ref-name position). The disambiguation is based on ordering: the ref-name comes first, placement second (prefixed with `|`), port third (prefixed with `:`). Since Lark processes them through their respective rules before `qualified_ref` sees them, the parser distinguishes them. The `qualified_ref` handler just needs to accept `ParamRef` for placement and port:
179179+180180+```python
181181+def qualified_ref(self, *args):
182182+ ref_type = None
183183+ placement = None
184184+ port = None
185185+ for arg in args:
186186+ if isinstance(arg, ParamRef) and ref_type is None:
187187+ ref_type = {"name": arg}
188188+ elif isinstance(arg, ParamRef) and ref_type is not None:
189189+ # Second or third ParamRef — depends on position
190190+ # But Lark gives us placement/port through their handlers,
191191+ # so we get ParamRef from the placement() or port() handler.
192192+ # Need to distinguish: placement handler adds a marker or
193193+ # we rely on Lark's rule names.
194194+ ...
195195+```
196196+197197+Actually, this is simpler than it looks. Lark calls `placement()` and `port()` before `qualified_ref()`. So `qualified_ref` receives:
198198+- A dict or `ParamRef` (from the ref-name rules)
199199+- A string or `ParamRef` (from the `placement` handler)
200200+- A `Port`/`int` or `ParamRef` (from the `port` handler)
201201+202202+The existing type-based dispatch in `qualified_ref` needs one addition: if an arg is `ParamRef` and `ref_type` is already set, it's either placement or port. We can distinguish by wrapping them — the placement handler returns `("placement", ParamRef(...))` and port returns `("port", ParamRef(...))` when deferring. Or simpler: use a thin wrapper type.
203203+204204+Alternatively, Lark's `@v_args(inline=True)` on placement/port means the handler already knows which rule matched. The cleanest approach: return a `ParamRef` tagged with its role:
205205+206206+```python
207207+@dataclass(frozen=True)
208208+class PlacementRef:
209209+ """Deferred placement from macro parameter."""
210210+ param: ParamRef
211211+212212+@dataclass(frozen=True)
213213+class PortRef:
214214+ """Deferred port from macro parameter."""
215215+ param: ParamRef
216216+```
217217+218218+Then `qualified_ref` type-dispatches on `PlacementRef`/`PortRef` alongside `str`/`Port`/`int`.
219219+220220+### IR Changes
221221+222222+`IRNode.pe` type becomes `Optional[Union[int, ParamRef]]`.
223223+224224+`IREdge.port` type becomes `Union[Port, ParamRef]`.
225225+226226+`IREdge.source_port` type becomes `Optional[Union[Port, ParamRef]]`.
227227+228228+These wider types only appear in macro template bodies. After expansion, all `ParamRef` values are resolved to concrete types. The resolve, place, and allocate passes never see `ParamRef` — if one leaks through, it's a bug in expand.
229229+230230+### Expand Pass
231231+232232+`_clone_and_substitute_node` gains:
233233+234234+```python
235235+# Substitute PE placement if it's a ParamRef
236236+new_pe = node.pe
237237+if isinstance(new_pe, ParamRef):
238238+ resolved = _substitute_param(new_pe, subst_map)
239239+ # Must resolve to a PE identifier string like "pe0" or an int
240240+ new_pe = _resolve_pe_placement(resolved) # parse "pe0" -> 0, or int -> int
241241+```
242242+243243+`_clone_and_substitute_edge` gains:
244244+245245+```python
246246+# Substitute port if it's a ParamRef
247247+new_port = edge.port
248248+if isinstance(new_port, ParamRef):
249249+ resolved = _substitute_param(new_port, subst_map)
250250+ new_port = _resolve_port(resolved) # "L" -> Port.L, "R" -> Port.R, int -> int
251251+252252+new_source_port = edge.source_port
253253+if isinstance(new_source_port, ParamRef):
254254+ resolved = _substitute_param(new_source_port, subst_map)
255255+ new_source_port = _resolve_port(resolved)
256256+```
257257+258258+### Validation
259259+260260+Invalid port/placement values (e.g., passing `"banana"` as a port) produce a MACRO error during expansion. Post-expansion, the existing place and allocate passes validate that PE IDs are in range and ports are valid.
261261+262262+### Examples
263263+264264+Parameterized port selection:
265265+266266+```
267267+; Macro that wires to a caller-selected port
268268+#wire_to_port target, port |> {
269269+ &src <| pass
270270+ &src |> ${target}:${port}
271271+}
272272+273273+; Usage: wire to left port
274274+#wire_to_port &dest, L
275275+276276+; Usage: wire to right port
277277+#wire_to_port &dest, R
278278+```
279279+280280+Parameterized PE placement:
281281+282282+```
283283+; Macro that places its node on a specific PE
284284+#placed_const val, pe |> {
285285+ &c <| const, ${val} |${pe}
286286+ &c |> @ret
287287+}
288288+289289+; Usage: place on pe0
290290+#placed_const 42, pe0 |> &target
291291+292292+; Usage: place on pe1
293293+#placed_const 42, pe1 |> &target
294294+```
295295+296296+Combined — a macro that builds a cross-PE relay:
297297+298298+```
299299+; Route a value from one PE to another
300300+#cross_pe_relay src_pe, dst_pe |> {
301301+ &hop <| pass |${src_pe}
302302+ &hop |> @ret
303303+}
304304+305305+; Usage:
306306+#cross_pe_relay pe0, pe1 |> &destination
307307+```
308308+309309+### Context Slot Syntax
310310+311311+Context slots use bracket syntax `[N]`, distinct from all other qualifiers:
312312+313313+```
314314+&node|pe0[2] ; place on pe0, context slot 2
315315+&node[0] ; context slot 0, auto-placed PE
316316+&node|pe1[0..4] ; reserve context slots 0-4 for this instruction
317317+```
318318+319319+The bracket syntax avoids overloading `:` (which already carries port, cell address, and potentially IRAM address semantics). `[N]` is exclusively context slots.
320320+321321+#### Grammar
322322+323323+```lark
324324+// New production:
325325+ctx_slot: "[" (DEC_LIT | ctx_range | param_ref) "]"
326326+ctx_range: DEC_LIT ".." DEC_LIT
327327+328328+// Updated qualified_ref:
329329+qualified_ref: (node_ref | label_ref | ... | param_ref) placement? ctx_slot? port?
330330+```
331331+332332+`ctx_slot` appears between placement and port in the qualifier chain: `&node|pe0[2]:L`.
333333+334334+#### Use Cases
335335+336336+- **Explicit context partitioning**: place parallel computations in distinct context slots to avoid matching store collisions
337337+- **Debugging**: force a known context layout for inspection
338338+- **Range reservation** (`[0..4]`): reserve a contiguous block of slots for an instruction that will be targeted by multiple parallel sources wired identically — not essential but a natural extension
339339+340340+#### Parameterization
341341+342342+Same mechanism as placement/port. `[${ctx}]` in a macro body, substituted to an integer during expansion:
343343+344344+```
345345+#placed_op op, pe, ctx |> {
346346+ &n <| ${op} |${pe}[${ctx}]
347347+ &n |> @ret
348348+}
349349+350350+; Usage:
351351+#placed_op add, pe0, 2 |> &target
352352+```
353353+354354+355355+## Enhancement 3: @ret Wiring for Macros
356356+357357+### Goal
358358+359359+Allow macros to define output points using `@ret` / `@ret_name` markers, and wire them to destinations at the call site using the `|>` syntax.
360360+361361+### Grammar Change
362362+363363+Add optional output list to `macro_call_stmt`:
364364+365365+```lark
366366+// Current:
367367+macro_call_stmt: "#" IDENT (argument ("," argument)*)?
368368+369369+// Proposed:
370370+macro_call_stmt: "#" IDENT (argument ("," argument)*)? (FLOW_OUT call_output_list)?
371371+```
372372+373373+This reuses the existing `call_output_list` and `call_output` productions from `call_stmt`. Same syntax: `#macro args |> &dest` or `#macro args |> name=&dest`.
374374+375375+### Macro Body Convention
376376+377377+Macro bodies use `@ret` and `@ret_name` in edge destinations, same as function bodies:
378378+379379+```
380380+#loop op, init_val |> {
381381+ &counter <| add
382382+ &compare <| ${op}
383383+ &counter |> &compare:L
384384+ &inc <| inc
385385+ &compare |> &inc:L
386386+ &inc |> &counter:R
387387+ ; Output edges use @ret convention
388388+ &compare |> @ret_body
389389+ &compare |> @ret_exit:R
390390+}
391391+```
392392+393393+### Lower Pass
394394+395395+When lowering `macro_call_stmt` with a `FLOW_OUT` and `call_output_list`:
396396+397397+1. Parse the output list the same way `call_stmt` does (named/positional outputs)
398398+2. Store output destinations on the `IRMacroCall` in a new field: `output_dests: tuple`
399399+400400+The `IRMacroCall` dataclass gains:
401401+402402+```python
403403+@dataclass(frozen=True)
404404+class IRMacroCall:
405405+ name: str
406406+ positional_args: tuple
407407+ named_args: tuple
408408+ output_dests: tuple = () # New: output wiring destinations
409409+ loc: Optional[SourceLoc] = None
410410+```
411411+412412+### Expand Pass
413413+414414+After cloning and substituting the macro body, process `@ret` markers:
415415+416416+1. Scan expanded edges for destinations starting with `@ret`
417417+2. For each `@ret` / `@ret_name` destination, look up the corresponding output from `IRMacroCall.output_dests`
418418+3. Replace the `@ret*` destination with the actual target node name
419419+4. If a `@ret*` marker has no matching output dest, report a MACRO error
420420+421421+This is simpler than function call wiring because macros don't need:
422422+- Trampoline nodes (no cross-context routing)
423423+- `ctx_override` edges (macros inline into the caller's context)
424424+- `FREE_CTX` nodes (no context allocation)
425425+- Synthetic PASS nodes (direct edge replacement suffices)
426426+427427+The `@ret` substitution in macros is purely edge rewriting — replace the symbolic `@ret_name` destination with the concrete node reference from the call site.
428428+429429+### Positional @ret Mapping
430430+431431+Same convention as function calls:
432432+433433+- Bare `@ret` maps to the first (or only) positional output
434434+- `@ret_name` maps to the named output `name=&dest`
435435+- Multiple bare `@ret` edges to different ports on the same output are valid
436436+437437+### Example
438438+439439+```
440440+; Define macro with outputs
441441+#loop_counted |> {
442442+ &counter <| add
443443+ &compare <| brgt
444444+ &counter |> &compare:L
445445+ &inc <| inc
446446+ &compare |> &inc:L
447447+ &inc |> &counter:R
448448+ &compare |> @ret_body
449449+ &compare |> @ret_exit:R
450450+}
451451+452452+; Invoke with output wiring
453453+#loop_counted |> body=&process, exit=&done
454454+&init |> #loop_counted_0.&counter:L
455455+&limit |> #loop_counted_0.&compare:R
456456+```
457457+458458+Or positionally:
459459+460460+```
461461+#simple_gate |> {
462462+ &g <| gate
463463+ &g |> @ret
464464+ &g |> @ret:R ; second output port
465465+}
466466+467467+; Invoke — positional @ret maps to first output
468468+#simple_gate |> &body, &exit
469469+```
470470+471471+472472+## Impact on Built-in Macros
473473+474474+With both enhancements, the built-in library collapses significantly:
475475+476476+### Current (11 macros)
477477+478478+```
479479+#loop_counted, #loop_while
480480+#permit_inject_1, #permit_inject_2, #permit_inject_3, #permit_inject_4
481481+#reduce_add_2, #reduce_add_3, #reduce_add_4
482482+```
483483+484484+### Proposed (4-5 macros, more capable)
485485+486486+```
487487+; Counted loop with output wiring
488488+#loop_counted |> {
489489+ &counter <| add
490490+ &compare <| brgt
491491+ &counter |> &compare:L
492492+ &inc <| inc
493493+ &compare |> &inc:L
494494+ &inc |> &counter:R
495495+ &compare |> @ret_body
496496+ &compare |> @ret_exit:R
497497+}
498498+499499+; Condition-tested loop
500500+#loop_while |> {
501501+ &gate <| gate
502502+ &gate |> @ret_body
503503+ &gate |> @ret_exit:R
504504+}
505505+506506+; Permit injection — variadic, outputs via @ret
507507+#permit_inject *nodes |> {
508508+ $(
509509+ &p_${_idx} <| const, 1
510510+ &p_${_idx} |> @ret
511511+ ),*
512512+}
513513+514514+; Binary reduction tree — parameterized opcode + arity
515515+#reduce_2 op |> {
516516+ &r <| ${op}
517517+}
518518+519519+#reduce_3 op |> {
520520+ &r0 <| ${op}
521521+ &r1 <| ${op}
522522+ &r0 |> &r1:L
523523+}
524524+525525+#reduce_4 op |> {
526526+ &r0 <| ${op}
527527+ &r1 <| ${op}
528528+ &r2 <| ${op}
529529+ &r0 |> &r2:L
530530+ &r1 |> &r2:R
531531+}
532532+```
533533+534534+Usage:
535535+536536+```
537537+; Old:
538538+!#loop_counted
539539+&init |> #loop_counted_0.&counter:L
540540+&limit |> #loop_counted_0.&compare:R
541541+#loop_counted_0.&compare |> &body:L
542542+#loop_counted_0.&compare |> &exit:R
543543+544544+; New:
545545+#loop_counted |> body=&process, exit=&done
546546+&init |> #loop_counted_0.&counter:L
547547+&limit |> #loop_counted_0.&compare:R
548548+549549+; Old:
550550+!#reduce_add_4
551551+552552+; New:
553553+#reduce_4 add
554554+```
555555+556556+Note: the `#permit_inject` example with variadic `@ret` is aspirational — it requires `@ret` to work inside repetition blocks, which means the `@ret` substitution must happen after repetition expansion. This ordering is already correct since repetition expansion happens before edge rewriting in the expand pass.
557557+558558+559559+## Implementation Order
560560+561561+1. **Opcode parameters** — grammar change (`opcode: OPCODE | param_ref`), argument syntax (`positional_arg: ... | OPCODE`), expand pass substitution. Smallest diff, immediately useful.
562562+563563+2. **Qualified ref parameters** — grammar changes to `placement` and `port`, `PlacementRef`/`PortRef` wrapper types, IR type widening, expand pass substitution. Mechanically similar to opcode params, builds on the same `_substitute_param` infrastructure.
564564+565565+3. **@ret wiring for macros** — grammar change (output list on `macro_call_stmt`), `IRMacroCall.output_dests`, expand pass edge rewriting. Builds on existing `@ret` patterns from function calls.
566566+567567+4. **Built-in macro rewrite** — collapse per-variant macros using the new features. Backwards-incompatible (old macro names removed), but since the built-ins are bundled and the system is pre-1.0, this is acceptable.
568568+569569+## Open Questions
570570+571571+1. **Should macros with `@ret` also support `|>` on inputs?** Function calls use `$func a=&x |> @output`. Currently macro calls use `#macro arg1, arg2` for inputs. Adding `|>` for outputs is proposed above. Should inputs also support named wiring? Probably not needed — macros already have `${param}` for inputs, and the input wiring is fundamentally different (parameter substitution vs edge creation).
572572+573573+2. **Error messages for mismatched @ret counts.** If a macro body has `@ret_body` and `@ret_exit` but the call site only provides one output, what error? Probably MACRO category: "macro '#loop_counted' defines outputs @ret_body, @ret_exit but call provides 1 output".
574574+575575+3. **Interaction with nested macros.** If macro A calls macro B which has `@ret`, and A also has `@ret`, the scoping should work naturally — B's `@ret` resolves at B's call site (inside A's body), A's `@ret` resolves at A's call site. The existing scope qualification prevents name collisions.