OR-1 dataflow CPU sketch

updating docs

Orual 281148e3 c68f3ea8

+1094 -231
.deciduous/deciduous.db

This is a binary file and will not be displayed.

+1
.gitignore
··· 20 20 # ExoMonad - track config, ignore runtime artifacts 21 21 .exo/* 22 22 !.exo/config.toml 23 + .deciduous
+8 -11
asm/builtins.py
··· 3 3 These macros are automatically available in all dfasm programs. 4 4 The BUILTIN_MACROS string is prepended to user source before parsing. 5 5 6 - Note: The current grammar does not support referencing macro parameters 7 - in edge endpoints (bare identifiers aren't valid qualified_ref). Therefore 8 - all built-in macros are self-contained: they define their own internal 9 - topology and expose well-known internal node names for the user to wire to. 6 + Macro parameters can appear in edge endpoints via ${param} syntax. 7 + The expand pass handles parameter refs in source/dest positions, 8 + correctly skipping scope qualification for external references. 10 9 11 - For example, #loop_counted expands to nodes named &counter (add), &compare 12 - (brgt), and &inc (inc) with the internal feedback loop pre-wired. The user 13 - wires init/limit/body/exit externally after invoking the macro. 10 + The opcode position cannot be parameterized (grammar constraint: 11 + opcode is a keyword terminal). Per-opcode variants are provided 12 + where needed (e.g., reduce_add_N). 14 13 """ 15 14 16 15 BUILTIN_MACROS = """\ ··· 71 70 } 72 71 73 72 ; --- Binary reduction trees (per-arity, per-opcode variants) --- 74 - ; Note: The macro expansion system's ParamRef only handles const fields and 75 - ; edge endpoints, not opcode positions. Generic opcode parameterization 76 - ; (e.g., passing 'add' as a macro argument) is a future enhancement. 77 - ; For now, per-opcode variants are provided. 73 + ; Note: The opcode position cannot be parameterized (grammar constraint). 74 + ; Per-opcode variants are provided instead. 78 75 #reduce_add_2 |> { 79 76 &r <| add 80 77 }
+81 -32
design-notes/alu-and-output-design.md
··· 100 100 ``` 101 101 EQ A == B. Result: 0x01 if true, 0x00 if false. 102 102 LT A < B (signed). Result: 0x01 / 0x00. 103 + LTE A ≤ B (signed). Result: 0x01 / 0x00. 103 104 GT A > B (signed). Result: 0x01 / 0x00. 104 - ULT A < B (unsigned). Result: 0x01 / 0x00. 105 - UGT A > B (unsigned). Result: 0x01 / 0x00. 105 + GTE A ≥ B (signed). Result: 0x01 / 0x00. 106 106 ``` 107 107 108 - All comparisons produce an 8-bit result token with value 0x00 or 0x01. 108 + All comparisons produce a result token with value 0x0000 or 0x0001. 109 109 This token is data — it enters the network and arrives at a downstream 110 110 SWITCH or GATE instruction as a normal operand. The bool_out signal 111 111 (internal to the PE, derived from the comparator output) is also available 112 112 to the output formatter for same-instruction conditional routing, but the 113 113 primary mechanism is to emit the boolean as a data token. 114 114 115 - Signed vs unsigned distinction: the comparator hardware (74LS85) natively 116 - does unsigned magnitude comparison. Signed comparison can be derived by 117 - XORing the sign bits of both inputs before feeding the comparator, or by 118 - using a dedicated signed-compare path. The opcode decoder selects which 119 - path via control lines. 115 + > **⚠ Note:** The emulator operates with 16-bit data throughout, so 116 + > comparison results are 16-bit values (0x0000 or 0x0001). The 8-bit 117 + > hardware datapath sketch below produces 8-bit results (0x00 or 0x01). 118 + > Both representations are semantically identical — only bit 0 carries 119 + > information. 120 + 121 + All comparisons interpret operands as signed 2's complement. The 122 + comparator hardware (74LS85) natively does unsigned magnitude comparison. 123 + Signed comparison is derived by XORing the sign bits of both inputs 124 + before feeding the comparator. LTE and GTE are synthesised from the 125 + existing comparator outputs (LTE = LT OR EQ, GTE = GT OR EQ) via 126 + the decoder EEPROM's control signal selection. 120 127 121 128 ### Routing (dyadic) 122 129 130 + **Branch operations** (BREQ, BRGT, BRGE, BROF): compare L and R 131 + operands, then route the data to the taken or not-taken destination. 132 + Both destinations receive the data value; the branch condition selects 133 + which destination gets it first (taken) vs second (not-taken). 134 + 135 + **Switch operations** (SWEQ, SWGT, SWGE, SWOF): compare L and R 136 + operands, then route data to the taken side and a trigger token 137 + (value 0) to the not-taken side. See the Output Formatter section 138 + for the not-taken trigger semantics (FREE vs NOOP). 139 + 123 140 ``` 124 - SWITCH Conditional route. Takes data (port L) and boolean (port R). 125 - If bool == true: emit data to dest1, emit trigger to dest2. 126 - If bool == false: emit data to dest2, emit trigger to dest1. 127 - See Output Formatter section for trigger token semantics. 128 - 129 141 GATE Conditional pass/suppress. Takes data (port L) and boolean (port R). 130 142 If bool == true: emit data to dest1 (and optionally dest2). 131 143 If bool == false: emit nothing. Both tokens consumed silently. 144 + SEL Select between inputs based on a condition. 145 + MRGE Merge two token streams (non-deterministic). 132 146 ``` 133 147 134 - SWITCH always emits exactly 2 tokens: one data token to the taken side, 135 - one inline monadic trigger to the not-taken side. See the Output Formatter 136 - section for the not-taken trigger semantics (FREE vs NOOP). 148 + Branch and switch ops always emit exactly 2 tokens: one data token to 149 + the taken side, one token to the not-taken side (data for branch, inline 150 + monadic trigger for switch). See the Output Formatter section for details. 137 151 138 152 GATE emits 0 or 1-2 tokens depending on the boolean. When suppressed 139 153 (bool false), both input operands are consumed with no output — the ··· 170 184 | ----------- | -------- | ------- | -------------- | ------------------------ | 171 185 | 00000 | ADD | dyadic | DUAL or SINGLE | A + B | 172 186 | 00001 | SUB | dyadic | DUAL or SINGLE | A - B | 173 - | 00000 | INC | monadic | DUAL or SINGLE | A + 1 (imm const) | 174 - | 00001 | DEC | monadic | DUAL or SINGLE | A - 1 (imm const) | 187 + | 00010 | INC | monadic | DUAL or SINGLE | A + 1 (imm const) | 188 + | 00011 | DEC | monadic | DUAL or SINGLE | A - 1 (imm const) | 175 189 | 00100 | AND | dyadic | DUAL or SINGLE | A & B | 176 190 | 00101 | OR | dyadic | DUAL or SINGLE | A \| B | 177 191 | 00110 | XOR | dyadic | DUAL or SINGLE | A ^ B | ··· 181 195 | 01010 | ASR | monadic | DUAL or SINGLE | A >> N (imm, arithmetic) | 182 196 | 01011 | EQ | dyadic | DUAL or SINGLE | A == B → bool | 183 197 | 01100 | LT | dyadic | DUAL or SINGLE | A < B signed → bool | 184 - | 01101 | GT | dyadic | DUAL or SINGLE | A > B signed → bool | 185 - | 01110 | ULT | dyadic | DUAL or SINGLE | A < B unsigned → bool | 186 - | 01111 | UGT | dyadic | DUAL or SINGLE | A > B unsigned → bool | 187 - | 10000 | SWITCH | dyadic | SWITCH | route data by bool | 188 - | 10001 | GATE | dyadic | GATE | pass or suppress by bool | 189 - | 10010 | PASS | monadic | DUAL or SINGLE | identity | 190 - | 10011 | CONST | monadic | DUAL or SINGLE | output = immediate | 191 - | 10100 | FREE_CTX | monadic | SUPPRESS | deallocate slot | 192 - | 10101-11111 | — | — | — | reserved for expansion | 198 + | 01101 | LTE | dyadic | DUAL or SINGLE | A ≤ B signed → bool | 199 + | 01110 | GT | dyadic | DUAL or SINGLE | A > B signed → bool | 200 + | 01111 | GTE | dyadic | DUAL or SINGLE | A ≥ B signed → bool | 201 + | 10000 | BREQ | dyadic | SWITCH | branch if L == R | 202 + | 10001 | BRGT | dyadic | SWITCH | branch if L > R | 203 + | 10010 | BRGE | dyadic | SWITCH | branch if L ≥ R | 204 + | 10011 | BROF | dyadic | SWITCH | branch on overflow | 205 + | 10100 | SWEQ | dyadic | SWITCH | switch if L == R | 206 + | 10101 | SWGT | dyadic | SWITCH | switch if L > R | 207 + | 10110 | SWGE | dyadic | SWITCH | switch if L ≥ R | 208 + | 10111 | SWOF | dyadic | SWITCH | switch on overflow | 209 + | 11000 | GATE | dyadic | GATE | pass or suppress by bool | 210 + | 11001 | SEL | dyadic | DUAL or SINGLE | select between inputs | 211 + | 11010 | MRGE | dyadic | DUAL or SINGLE | merge two token streams | 212 + | 11011 | PASS | monadic | DUAL or SINGLE | identity | 213 + | 11100 | CONST | monadic | DUAL or SINGLE | output = immediate | 214 + | 11101 | FREE_CTX | monadic | SUPPRESS | deallocate slot | 215 + | 11110-11111 | — | — | — | reserved for expansion | 216 + 217 + > **⚠ Preliminary:** The binary opcode encodings above are a draft layout, 218 + > not a committed hardware encoding. The Python emulator and assembler use 219 + > IntEnum ordinal values that do NOT correspond to these bit patterns. 220 + > Final hardware encoding will be determined during physical build. 193 221 194 222 The output mode column indicates the default. DUAL vs SINGLE is 195 223 controlled by a flag in the IRAM instruction word (has_dest2), not by 196 224 the opcode — any arithmetic/logic/compare instruction can fan out to 197 225 two destinations or one. 198 226 199 - **Reserved opcode space (11 slots):** future candidates include hardware 200 - multiply, predicate store read/write, triadic operations, SM-directed 201 - operations, debug/trace instructions, and CALL/RETURN linkage primitives. 227 + **Branch vs Switch:** Branch operations (BR*) compare their two operands 228 + and route the data value to the taken or not-taken destination. Switch 229 + operations (SW*) are similar but emit a trigger token (value 0) to the 230 + not-taken side instead of the data value. See the Output Formatter section 231 + for the not-taken trigger semantics. 232 + 233 + **Reserved opcode space (2 slots):** future candidates include hardware 234 + multiply, predicate store read/write, and debug/trace instructions. 235 + 236 + ### SM Instruction Dispatch 237 + 238 + The operation set above covers CM compute instructions (IRAM half 0, 239 + bit 15 = 0). When IRAM half 0 bit 15 = 1, the PE dispatches an SM 240 + operation instead of an ALU computation. In this mode: 241 + 242 + - The ALU is bypassed entirely 243 + - Stage 4 constructs an `SMToken` from the SM opcode, operand data, 244 + and IRAM fields (SM_id, const address, return routing) 245 + - Stage 5 emits the SMToken to the target SM via the SM bus 246 + 247 + The SM instruction encoding uses the same IRAM word width but with a 248 + different field layout. See `iram-and-function-calls.md` for the SM 249 + IRAM word format and operation table. 202 250 203 251 --- 204 252 ··· 343 391 ``` 344 392 result:8 computed data value (or passthrough for SWITCH/PASS) 345 393 bool_out:1 boolean/comparison result: 346 - - from comparator for EQ/LT/GT/ULT/UGT 347 - - from data_R bit 0 for SWITCH/GATE (boolean input) 394 + - from comparator for EQ/LT/LTE/GT/GTE 395 + - from comparator for BR*/SW* (inline comparison) 396 + - from data_R bit 0 for GATE (boolean input) 348 397 - undefined for arithmetic/logic ops 349 398 ``` 350 399
+12 -7
design-notes/architecture-overview.md
··· 344 344 much of SM00's address space is mapped to IO vs general-purpose storage? 345 345 SM00 is special only at boot for now; further specialisation deferred 346 346 until profiling shows the standard opcodes are insufficient. 347 - 5. ~~**Compiler / assembler**~~ — **Partially Resolved.** The `asm/` package 348 - implements a 6-stage assembler pipeline (parse → lower → resolve → 349 - place → allocate → codegen). Produces PEConfig/SMConfig + seed tokens 350 - or a bootstrap token stream. See `assembler-architecture.md` for architecture. 351 - Grammar is `dfasm.lark` (Lark/Earley parser). Auto-placement via 352 - greedy bin-packing with locality heuristic. Remaining work: further 353 - optimisation passes, macro expansion, binary output. 347 + 5. ~~**Compiler / assembler**~~ — **Resolved.** The `asm/` package 348 + implements a 7-stage assembler pipeline (parse → lower → expand → 349 + resolve → place → allocate → codegen). Produces PEConfig/SMConfig + 350 + seed tokens or a bootstrap token stream. The expand pass handles 351 + macro expansion (`#macro` definitions with `${param}` substitution, 352 + variadic repetition, constant arithmetic) and function call wiring 353 + (cross-context edges, trampoline nodes, context teardown). Built-in 354 + macros for common patterns (counted loops, permit injection, reduction 355 + trees) are automatically available. See `assembler-architecture.md` 356 + for architecture. Grammar is `dfasm.lark` (Lark/Earley parser). 357 + Auto-placement via greedy bin-packing with locality heuristic. 358 + Remaining work: optimisation passes, binary output. 354 359 7. **Mode B clock ratio** — exactly 2x, or design for arbitrary integer 355 360 ratios? See `bus-architecture-and-width-decoupling.md`. 356 361 8. **Instruction residency** — small IRAM per PE means programs larger
+54 -11
design-notes/assembler-architecture.md
··· 13 13 14 14 ## Pipeline Overview 15 15 16 - Six stages, each a pure function from `IRGraph → IRGraph` (or `IRGraph → output`). 16 + Seven stages, each a pure function from `IRGraph → IRGraph` (or `IRGraph → output`). 17 17 18 18 The pipeline is: 19 19 ··· 24 24 Parse Lark/Earley parser, dfasm.lark grammar 25 25 │ → concrete syntax tree (CST) 26 26 27 - Lower CST → IRGraph (nodes, edges, regions, data defs) 27 + Lower CST → IRGraph (nodes, edges, regions, data defs, 28 + │ macro defs, macro calls, call sites) 28 29 │ → name qualification, scope creation 30 + 31 + Expand macro expansion + function call wiring 32 + │ → clone macro bodies, substitute params, evaluate 33 + │ const expressions, wire cross-context call edges 29 34 30 35 Resolve validate edge endpoints, detect scope violations 31 36 │ → "did you mean" suggestions via Levenshtein distance ··· 37 42 │ → dyadic-first layout, per-PE context scoping 38 43 39 44 Codegen emit PEConfig/SMConfig + seeds (direct mode) 40 - or SM init → ROUTE_SET → LOAD_INST → seeds (token mode) 45 + or SM init → IRAM writes → seeds (token mode) 41 46 ``` 42 47 43 48 Each pass returns a new `IRGraph`. Graphs are never mutated after construction, each pass produces a fresh copy with the new information filled in. Errors accumulate in `IRGraph.errors` rather than failing fast, so the assembler reports all problems in a single pass rather than forcing the programmer to fix them one at a time. ··· 57 62 58 63 | Type | Fields | Purpose | 59 64 |------|--------|---------| 60 - | `IRNode` | name, opcode, dest_l, dest_r, const, pe, iram_offset, ctx, loc, args, sm_id | Single instruction in the dataflow graph | 61 - | `IREdge` | source, dest, port, source_port, loc | Connection between two nodes | 62 - | `IRGraph` | nodes, edges, regions, data_defs, system, errors | Complete program representation | 63 - | `IRRegion` | tag, kind, body (IRGraph), loc | Nested scope (FUNCTION or LOCATION) | 65 + | `IRNode` | name, opcode, dest_l, dest_r, const, pe, iram_offset, ctx, loc, args, sm_id, seed | Single instruction in the dataflow graph. `name` and `const` can be `ParamRef` or `ConstExpr` in macro templates. | 66 + | `IREdge` | source, dest, port, source_port, port_explicit, ctx_override, loc | Connection between two nodes. `ctx_override` marks cross-context call edges (ctx_mode=01). `port_explicit` tracks whether port was user-specified. | 67 + | `IRGraph` | nodes, edges, regions, data_defs, system, errors, macro_defs, macro_calls, raw_call_sites, call_sites, builtin_line_offset | Complete program representation. Macro-related fields are populated by lower and consumed by expand. | 68 + | `IRRegion` | tag, kind, body (IRGraph), loc | Nested scope (FUNCTION, LOCATION, or MACRO) | 64 69 | `IRDataDef` | name, sm_id, cell_addr, value, loc | SM cell initialisation | 65 - | `SystemConfig` | pe_count, sm_count, iram_capacity, ctx_slots, loc | Hardware configuration from `@system` pragma | 70 + | `SystemConfig` | pe_count, sm_count, iram_capacity, ctx_slots, loc | Hardware configuration from `@system` pragma (defaults: iram=128, ctx=16) | 71 + | `MacroDef` | name, params, body (IRGraph), repetition_blocks, loc | Macro definition: formal parameters + template body with `ParamRef` placeholders | 72 + | `MacroParam` | name, variadic | Formal macro parameter. `variadic=True` collects remaining args. | 73 + | `ParamRef` | param, prefix, suffix | Placeholder for a macro parameter. Supports token pasting via prefix/suffix. | 74 + | `ConstExpr` | expression, params, loc | Compile-time arithmetic expression (e.g., `base + _idx + 1`) | 75 + | `IRRepetitionBlock` | body (IRGraph), variadic_param, loc | Repetition block (`@each`) expanded once per variadic argument | 76 + | `IRMacroCall` | name, positional_args, named_args, loc | Macro invocation, consumed by expand pass | 77 + | `CallSiteResult` | func_name, input_args, output_dests, loc | Intermediate call data from lower, consumed by expand | 78 + | `CallSite` | func_name, call_id, input_edges, trampoline_nodes, free_ctx_nodes, loc | Processed call site metadata for per-call-site context allocation | 66 79 67 80 ### Destination Representation 68 81 ··· 100 113 - `strong_edge` / `weak_edge` -> anonymous `IRNode` + input/output 101 114 `IREdge` set (creates a `CompositeResult`) 102 115 - `func_def` → `IRRegion(kind=FUNCTION)` with a nested `IRGraph` body 116 + - `macro_def` → `MacroDef` with `ParamRef` placeholders in body template 117 + - `macro_call` → `IRMacroCall` with positional/named arguments 118 + - `call_stmt` → `CallSiteResult` with function name, input args, output dests 103 119 - `location_dir` -> `IRRegion(kind=LOCATION)`: subsequent statements 104 120 are collected into its body during post-processing 105 121 - `data_def` -> `IRDataDef` with SM placement and cell address ··· 112 128 113 129 **Opcode mapping (`opcodes.py`):** 114 130 115 - Mnemonic strings from the grammar are mapped to `ALUOp`, `MemOp`, or `CfgOp` enum values via `MNEMONIC_TO_OP`. A complication: Python `IntEnum` sub-classes can share numeric values across types (`ArithOp.ADD == 0 == MemOp.READ`), so the reverse mapping and set membership tests use type-aware collections (`TypeAwareOpToMnemonicDict`, `TypeAwareMonadicOpsSet`) that key on `(type, value)` tuples internally. 131 + Mnemonic strings from the grammar are mapped to `ALUOp` or `MemOp` enum values via `MNEMONIC_TO_OP`. A complication: Python `IntEnum` sub-classes can share numeric values across types (`ArithOp.ADD == 0 == MemOp.READ`), so the reverse mapping and set membership tests use type-aware collections (`TypeAwareOpToMnemonicDict`, `TypeAwareMonadicOpsSet`) that key on `(type, value)` tuples internally. 132 + 133 + ### Expand (`expand.py`) 134 + 135 + Processes macro definitions, macro invocations, and function call sites. This pass bridges the gap between the template-based IR from lowering and the concrete IR that resolve expects. 136 + 137 + **Macro expansion:** 138 + 139 + 1. Collect all `MacroDef` entries from the graph (including built-in macros prepended to every program) 140 + 2. For each `IRMacroCall`, clone the macro's body template 141 + 3. Substitute `ParamRef` placeholders with actual argument values 142 + 4. Evaluate `ConstExpr` arithmetic expressions (supports `+`, `-`, `*` on integers and `_idx`) 143 + 5. Expand `IRRepetitionBlock` entries once per variadic argument, binding `_idx` to the iteration index 144 + 6. Qualify expanded names with scope prefixes: `#macroname_N.&label` for top-level, `$func.#macro_N.&label` inside functions 145 + 146 + **Function call wiring:** 147 + 148 + 1. For each `CallSiteResult` from lowering, create a `CallSite` with a unique `call_id` 149 + 2. Generate trampoline `PASS` nodes for return routing 150 + 3. Create `IREdge` entries with `ctx_override=True` for cross-context argument passing (becomes `ctx_mode=01` in codegen) 151 + 4. Generate `FREE_CTX` nodes for context teardown on call completion 152 + 5. Wire `@ret` / `@ret_name` synthetic nodes for return paths 153 + 154 + **Post-conditions:** 155 + 156 + After expansion, the IR contains only concrete `IRNode`/`IREdge` entries. No `ParamRef` placeholders, no `MacroDef` regions, no `IRMacroCall` entries remain. 116 157 117 158 ### Resolve (`resolve.py`) 118 159 ··· 151 192 152 193 **System config inference:** 153 194 154 - If no `@system` pragma is provided, the placer infers `pe_count` from the highest explicit PE ID and uses defaults for IRAM capacity (64) and context slots (4). 195 + If no `@system` pragma is provided, the placer infers `pe_count` from the highest explicit PE ID and uses defaults for IRAM capacity (128) and context slots (16). 155 196 156 197 ### Allocate (`allocate.py`) 157 198 ··· 216 257 | RESOURCE | allocate | IRAM overflow, context slot overflow, missing SM target | 217 258 | ARITY | lower | wrong operand count | 218 259 | PORT | allocate | port conflicts, missing destinations | 260 + | MACRO | expand | undefined macro, wrong argument count, expansion failure | 261 + | CALL | expand | undefined function, missing return path, call wiring error | 219 262 | UNREACHABLE | (future) | unused nodes | 220 263 | VALUE | lower | out-of-range literals | 221 264 ··· 243 286 ## Future Work 244 287 245 288 - **Optimization passes** between resolve and place: dead node elimination, constant folding, sub-graph deduplication 246 - - **Macro expansion**: the grammar already supports `#macro` syntax; the expansion pass is not yet implemented 247 289 - **Wider placement heuristics**: graph partitioning, min-cut algorithms, or profile-guided placement for larger programs 248 290 - **Incremental reassembly**: modify part of the graph and re-run only affected passes 249 291 - **Hardware encoding pass**: translate ALUInst/SMInst to bit-level instruction words for actual IRAM loading 292 + - **Conditional macro expansion**: the current macro system supports variadic repetition, constant arithmetic, and nested macro invocation (depth limit 32), but not conditionals within macros
+196 -45
design-notes/dfasm-primer.md
··· 23 23 24 24 ### Names and Sigils 25 25 26 - dfasm uses three sigil-prefixed naming conventions: 26 + dfasm uses four sigil-prefixed naming conventions: 27 + 28 + | Sigil | Scope | Use | 29 + | -------- | --------------------------------- | --------------------------------- | 30 + | `@name` | Global (top-level) | Node references, data definitions | 31 + | `&name` | Local (within enclosing function) | Labels for instructions | 32 + | `$name` | Global | Function / subgraph definitions | 33 + | `#name` | Global | Macro definitions and invocations | 27 34 28 - | Sigil | Scope | Use | 29 - | ------- | --------------------------------- | --------------------------------- | 30 - | `@name` | Global (top-level) | Node references, data definitions | 31 - | `&name` | Local (within enclosing function) | Labels for instructions | 32 - | `$name` | Global | Function / subgraph definitions | 35 + Additionally, `${name}` is used within macro bodies for parameter substitution (see Macros section). 33 36 34 37 Names are composed of `[a-zA-Z_][a-zA-Z0-9_]*`. 35 38 ··· 67 70 | --------- | -------- | ------- | ---------------------------------------- | 68 71 | `pe` | yes | — | Number of processing elements | 69 72 | `sm` | yes | — | Number of structure memory modules | 70 - | `iram` | no | 64 | IRAM capacity per PE (instruction slots) | 71 - | `ctx` | no | 4 | Context slots per PE | 73 + | `iram` | no | 128 | IRAM capacity per PE (instruction slots) | 74 + | `ctx` | no | 16 | Context slots per PE | 72 75 73 76 At most one `@system` pragma per program. 74 77 ··· 235 238 236 239 These operations route tokens based on a comparison result. They are all dyadic — they compare L and R, then route accordingly. 237 240 238 - **Branch operations** (`br*`): emit data to `dest_l` (taken) or `dest_r` (not taken) based on comparison: 241 + **Branch operations** (`br*`): compare L and R, then emit data to `dest_l` (taken) or `dest_r` (not taken). Both outputs carry the data value; the branch condition selects the destination: 239 242 240 243 | Mnemonic | Condition | 241 244 | -------- | ---------- | ··· 243 246 | `brgt` | L > R | 244 247 | `brge` | L ≥ R | 245 248 | `brof` | overflow | 246 - | `brty` | type match | 247 249 248 - > NOTE: 249 - >`br*` ops use predicate register and internal-to-PE loopback route if supported by hardware. 250 + > NOTE: 251 + > `br*` ops use predicate register and internal-to-PE loopback route if supported by hardware. Future strongly-connected block execution will change the behaviour of `br*` ops to support pseudo-sequential execution within a PE. 250 252 251 - **Switch operations** (`sw*`): like branch, but when the condition is true, data goes to `dest_l` and a trigger token (value 0) goes to `dest_r`. 253 + **Switch operations** (`sw*`): like branch, but when the condition is true, data goes to `dest_l` and a trigger token (value 0) goes to `dest_r`. 252 254 When false, trigger goes to `dest_l` and data goes to `dest_r`: 253 255 254 256 | Mnemonic | Condition | 255 257 |----------|-----------| 256 - | `sweq` | L == R | 257 - | `swgt` | L > R | 258 - | `swge` | L ≥ R | 259 - | `swof` | overflow | 260 - | `swty` | type match | 258 + | `sweq` | L == R | 259 + | `swgt` | L > R | 260 + | `swge` | L ≥ R | 261 + | `swof` | overflow | 261 262 262 263 **Other routing:** 263 264 264 - | Mnemonic | Arity | Description | 265 - | -------- | ------ | -------------------------------------------------------- | 266 - | `gate` | dyadic | pass data through if bool_out is true, suppress if false | 267 - | `sel` | dyadic | select between inputs | 268 - | `merge` | dyadic | merge two inputs | 265 + | Mnemonic | Arity | Description | 266 + | -------- | ------ | ------------------------------------------------------------------- | 267 + | `gate` | dyadic | pass data through if bool_out is true, suppress if false | 268 + | `sel` | dyadic | select between L and R based on a condition | 269 + | `merge` | dyadic | merge two token streams (non-deterministic: fires on either input) | 269 270 270 271 ### Data 271 272 ··· 274 275 | `pass` | monadic | pass data through unchanged | 275 276 | `const` | monadic | emit constant value (from const field) | 276 277 | `free_ctx` | monadic | deallocate context slot, no data output | 277 - | `call` | dyadic | | 278 278 279 - - `free_ctx` in particular is a special token used to handle function body and loop exits. 279 + - `free_ctx` is a special instruction used to handle function body and loop exits. It frees the context slot so it can be reused. 280 280 281 281 ### Structure Memory 282 282 283 - | Mnemonic | Arity | Description | 284 - | -------- | ----------------- | --------------------------------------------------------------------------------------------------------------------- | 285 - | `read` | monadic | read from SM cell (const = cell address) | 286 - | `write` | context-dependent | write to SM cell — monadic if const is set (cell addr from const), dyadic if const is None (cell addr from L operand) | 287 - | `clear` | monadic | clear SM cell | 288 - | `alloc` | monadic | allocate SM cell | 289 - | `free` | monadic | free SM cell | 290 - | `rd_inc` | monadic | atomic read-and-increment | 291 - | `rd_dec` | monadic | atomic read-and-decrement | 292 - | `cmp_sw` | monadic | compare-and-swap | 293 - ### Configuration / System 283 + | Mnemonic | Arity | Description | 284 + | ----------- | ----------------- | --------------------------------------------------------------------------------------------------------------------- | 285 + | `read` | monadic | read from SM cell (const = cell address) | 286 + | `write` | context-dependent | write to SM cell — monadic if const is set (cell addr from const), dyadic if const is None (cell addr from L operand) | 287 + | `clear` | monadic | clear SM cell (reset to EMPTY state) | 288 + | `alloc` | monadic | allocate SM cell | 289 + | `free` | monadic | free SM cell | 290 + | `rd_inc` | monadic | atomic read-and-increment | 291 + | `rd_dec` | monadic | atomic read-and-decrement | 292 + | `cmp_sw` | dyadic | compare-and-swap (L = expected, R = new value) | 293 + | `exec` | monadic | trigger EXEC on SM (inject tokens from T0 storage into network) | 294 + | `raw_read` | monadic | raw read from T0 storage (no I-structure semantics) | 295 + | `set_page` | monadic | set SM page register (T0 operation) | 296 + | `write_imm` | monadic | immediate write to SM cell (T0 operation) | 297 + | `ext` | monadic | extended SM operation | 294 298 295 - | Mnemonic | Description | 296 - |----------|-------------| 297 - | `load_inst` | load instruction into PE IRAM | 298 - | `route_set` | configure PE routing table | 299 - | `ior` | I/O read | 300 - | `iow` | I/O write | 301 - | `iorw` | I/O read-write | 299 + Note: `free` (SM cell deallocation) and `free_ctx` (PE context slot deallocation) are distinct operations targeting different resources. 302 300 303 - These are rarely written by hand — `load_inst` and `route_set` are generated by the assembler's token stream mode during bootstrap. 301 + SM opcodes use a variable-width bus encoding. See `sm-design.md` for the full opcode table and encoding tiers. 304 302 305 303 ## Literals 306 304 ··· 445 443 446 444 **Direct mode** produces `PEConfig` objects (IRAM contents, route restrictions, context slot count) and `SMConfig` objects (initial cell values), plus seed tokens. This is the fast path for the emulator. Configuration is applied directly. 447 445 448 - **Token stream mode** produces a bootstrap sequence: SM initialization writes, route configuration tokens, instruction load tokens, then seed tokens. This mirrors the bootstrap process, loading the code stored at the reset vector. 446 + **Token stream mode** produces a bootstrap sequence: SM initialization writes, IRAM write tokens, then seed tokens. This mirrors the bootstrap process, loading the code stored at the reset vector. 447 + 448 + ## Macros 449 + 450 + Macros define reusable template subgraphs that are expanded inline at their call sites. The macro system supports parameterisation, variadic arguments, repetition blocks, constant arithmetic, and token pasting. 451 + 452 + ### Macro Definition 453 + 454 + ```dfasm 455 + #macro_name param1, param2, *variadic_param |> { 456 + ; body — instructions and edges using ${param} substitution 457 + &node <| add ${param1} 458 + ${param1} |> &node:L 459 + ${param2} |> &node:R 460 + } 461 + ``` 462 + 463 + - Macro names use the `#` sigil 464 + - Parameters are declared before `|>` 465 + - Variadic parameters are prefixed with `*` and collect remaining arguments 466 + - The body contains standard dfasm statements with `${param}` placeholders 467 + 468 + ### Parameter Substitution 469 + 470 + Within a macro body, `${name}` references are replaced with the actual argument values during expansion: 471 + 472 + ```dfasm 473 + #add_const val |> { 474 + &adder <| add 475 + &c <| const, ${val} 476 + &c |> &adder:R 477 + } 478 + ``` 479 + 480 + **Token pasting:** Parameters can be combined with literal text to synthesise unique names. The `${param}` reference within a label name produces a label that incorporates the argument value: 481 + 482 + ```dfasm 483 + #make_pair name |> { 484 + &${name}_left <| pass 485 + &${name}_right <| pass 486 + } 487 + ``` 488 + 489 + ### Repetition Blocks 490 + 491 + The `$( ),*` syntax expands its body once per element of a variadic parameter. Within a repetition block, `${_idx}` provides the current iteration index (0-based): 492 + 493 + ```dfasm 494 + #fan_out *targets |> { 495 + &src <| pass 496 + $( 497 + &src |> ${targets} 498 + ),* 499 + } 500 + ``` 501 + 502 + ### Constant Arithmetic 503 + 504 + Macro const fields support compile-time arithmetic with `+`, `-`, `*` on integer values and parameters: 505 + 506 + ```dfasm 507 + #indexed_read base, *cells |> { 508 + $( 509 + &r${_idx} <| read, ${base} + ${_idx} 510 + ),* 511 + } 512 + ``` 513 + 514 + ### Macro Invocation 515 + 516 + Macros are invoked as standalone statements: 517 + 518 + ```dfasm 519 + #loop_counted 520 + #fan_out &a:L, &b:R, &c:L 521 + #indexed_read 10, &dest1, &dest2, &dest3 522 + ``` 523 + 524 + Arguments can be positional or named: 525 + 526 + ```dfasm 527 + #make_pair name=foo 528 + ``` 529 + 530 + ### Scoping 531 + 532 + Expanded macro names are automatically qualified to prevent collisions between multiple invocations of the same macro: 533 + 534 + - Top-level invocation: `#macro_N.&label` (N is the invocation counter) 535 + - Inside a function: `$func.#macro_N.&label` 536 + 537 + ### Built-in Macros 538 + 539 + The following macros are automatically available in all programs: 540 + 541 + | Macro | Purpose | 542 + |-------|---------| 543 + | `#loop_counted` | Counted loop: counter + compare + increment feedback loop | 544 + | `#loop_while` | Condition-tested loop: gate node for predicate-driven iteration | 545 + | `#permit_inject_N` | Inject N const(1) seed tokens (variants for N=1..4) | 546 + | `#reduce_add_N` | Binary reduction tree for addition (variants for N=2..4) | 547 + 548 + Built-in macros expose well-known internal node names (e.g., `&counter`, `&compare`, `&gate`) that the user wires externally after invocation. 549 + 550 + ## Function Calls 551 + 552 + Function calls wire argument values across context boundaries using the expand pass. The call syntax declares which arguments feed into the callee and where results flow back. 553 + 554 + ### Call Syntax 555 + 556 + ```dfasm 557 + $func_name arg1=&source1, arg2=&source2 |> @result 558 + ``` 559 + 560 + The function must be defined as a `$name |> { ... }` region. Arguments are named (matching the function's parameter labels) or positional. Outputs after `|>` specify where results are routed. 561 + 562 + Multiple outputs can be named: 563 + 564 + ```dfasm 565 + $divmod a=&dividend, b=&divisor |> @quotient, remainder=@remainder 566 + ``` 567 + 568 + ### What the Expand Pass Does 569 + 570 + When processing a function call: 571 + 572 + 1. Allocates a fresh context slot for the callee activation 573 + 2. Generates cross-context edges with `ctx_override=True` (becomes `ctx_mode=01` / CTX_OVRD in hardware) 574 + 3. Creates trampoline `PASS` nodes for return routing 575 + 4. Generates `FREE_CTX` nodes to clean up the callee's context on completion 576 + 5. Synthesises `@ret` marker nodes for return paths 577 + 578 + ### Return Convention 579 + 580 + The expand pass creates synthetic `@ret` (or `@ret_name` for named outputs) nodes as return markers. The callee's result edges are wired to these markers, which trampoline the results back to the caller's context. 581 + 582 + ### Example 583 + 584 + ```dfasm 585 + @system pe=2, sm=0 586 + 587 + ; Define a function that adds two values 588 + $add_pair |> { 589 + &sum <| add 590 + } 591 + 592 + ; Call it 593 + &a <| const, 3 594 + &b <| const, 7 595 + $add_pair a=&a, b=&b |> @result 596 + @result <| pass 597 + ``` 598 + 599 + After expansion, the assembler generates the cross-context wiring, trampoline nodes, and context cleanup automatically. The programmer does not need to manage context slots or return routing manually.
+27 -9
design-notes/io-and-bootstrap.md
··· 45 45 mapping is configured at bootstrap (or hardwired for v0): 46 46 47 47 ``` 48 - Example SM00 address map: 49 - 0x000 - 0x0FF: IO devices (tier 0, raw memory semantics) 50 - 0x100 - 0x1FF: Bootstrap ROM (tier 0, read-only) 51 - 0x200 - 0x3FF: General-purpose I-structure cells (tier 1) 48 + Example SM00 address map (indicative, not final): 49 + 0x000 - 0x0FF: I-structure cells (tier 1, presence-tracked) 50 + 0x100 - 0x1FF: IO devices (tier 0, raw memory semantics) 51 + 0x200 - 0x3FF: Bootstrap ROM (tier 0, read-only) 52 52 ``` 53 + 54 + **Tier boundary direction** is a hardware design decision, not yet 55 + finalised. The current design intent places I-structure cells at low 56 + addresses (below the boundary) where they are directly addressable by 57 + 2-flit SM tokens and reachable by atomic operations (RD_INC, RD_DEC, 58 + CMP_SW — 5-bit opcode tier, 8-bit payload, 256-cell range). T0 raw 59 + storage sits above the boundary; it does not need atomics and can use 60 + extended addressing when necessary. 61 + 62 + The emulator currently uses `tier_boundary=256` with T1 below and T0 63 + at/above, but the exact mapping may change during physical build. 64 + See `sm-design.md` for the full tier model and encoding details. 53 65 54 66 Within the IO range, specific addresses map to device registers: 55 67 ··· 95 107 | 96 108 [Address Decoder] 97 109 | 98 - +---> addr < 0x100? --> [IO Device Registers] 99 - | | 100 - | +---> [UART chip (6850/16550/etc.)] 101 - | +---> [future: SPI, GPIO, timer] 110 + +---> addr < tier_boundary? --> [SRAM Banks] (I-structure cells) 102 111 | 103 - +---> addr >= 0x100? --> [SRAM Banks] (normal SM operation) 112 + +---> addr >= tier_boundary? 113 + | 114 + +---> IO range? --> [IO Device Registers] 115 + | | 116 + | +---> [UART chip (6850/16550/etc.)] 117 + | +---> [future: SPI, GPIO, timer] 118 + | 119 + +---> ROM range? --> [Bootstrap ROM] 120 + | 121 + +---> else --> [T0 raw SRAM] 104 122 ``` 105 123 106 124 The IO device registers behave like SM cells from the SM controller's
+94 -103
design-notes/loop-patterns-and-flow-control.md
··· 751 751 or a purpose-built assembler template system. No conditionals, no 752 752 recursion in the macro evaluator, no type system. 753 753 754 + > **Implementation status:** All five requirements below are now 755 + > implemented in the `asm/expand.py` pass. The implemented syntax 756 + > differs from the Rust-style examples in this section — see 757 + > `dfasm-primer.md` for the canonical syntax reference. 758 + 754 759 #### 1. Named Variadic Repetition 755 760 756 - ``` 757 - $($arg = $src),* 758 - ``` 759 - 760 - A comma-separated list of named pairs, expanded once per entry. 761 - Rust `macro_rules!` provides this directly. 761 + **Implemented.** Variadic parameters use `*name` in the macro 762 + definition. Repetition blocks use `$( ... ),*` syntax 763 + (`IRRepetitionBlock` in the IR). Each iteration binds the variadic 764 + parameter to the current element. 762 765 763 766 #### 2. Token Pasting (Label Synthesis) 764 767 765 - ``` 766 - &__${arg}_tag 767 - ``` 768 - 769 - Concatenate a macro parameter into a label name. Produces unique 770 - labels per repetition entry. C has `##`, rust proc macros have 771 - `Ident::new()`, but `macro_rules!` does NOT have this natively — 772 - would need an assembler-specific extension or a `paste!`-style 773 - helper. 774 - 775 - This is the single most important extension beyond stock 776 - `macro_rules!`. Without it, macros can't generate unique labels 777 - for per-arg instructions. 768 + **Implemented.** `ParamRef` supports `prefix` and `suffix` fields 769 + for token pasting. Within a macro body, `&${name}_tag` produces 770 + labels incorporating the parameter value. Unique labels per 771 + repetition entry are generated automatically. 778 772 779 773 #### 3. Implicit Repetition Index 780 774 781 - ``` 782 - ${_idx} 783 - ``` 784 - 785 - An auto-incrementing counter within a `$(...),*` expansion. 786 - Used for descriptor table offset arithmetic (`$desc + ${_idx} + 1`). 787 - Not available in rust `macro_rules!` — another assembler-specific 788 - extension. Alternative: require the programmer to pass explicit 789 - indices, which is ugly but functional. 775 + **Implemented.** `${_idx}` is bound to the current iteration index 776 + (0-based) within `$(...),*` expansion blocks. Used for descriptor 777 + table offset arithmetic and generating unique per-iteration names. 790 778 791 779 #### 4. Constant Arithmetic in Expressions 792 780 793 - ``` 794 - $desc + ${_idx} + 1 795 - ``` 796 - 797 - Compile-time addition on constant/label expressions. The assembler 798 - already evaluates constant expressions for instruction operands, so 799 - the macro expander just needs to emit the expression text and let the 800 - normal evaluator handle it. No new evaluation capability needed. 781 + **Implemented.** `ConstExpr` in `ir.py` supports compile-time `+`, 782 + `-`, `*` on integer values and parameter references. Expressions like 783 + `${base} + ${_idx} + 1` are evaluated during macro expansion. 801 784 802 785 #### 5. Label Reference Across Macro Boundaries 803 786 804 - ``` 805 - &__alloc |> $func.__stub.ctx_in 806 - ``` 787 + **Implemented via scoped naming.** Expanded macro names are 788 + automatically qualified with scope prefixes (`#macroname_N.&label` 789 + for top-level, `$func.#macro_N.&label` inside functions). Cross-macro 790 + references use these qualified names. The expand pass also handles 791 + function call wiring via `CallSite` metadata, which generates 792 + cross-context edges and trampoline nodes automatically. 807 793 808 - The per-call-site macro needs to reference labels inside the 809 - per-function stub macro's expansion. This requires either: 810 - 811 - - A naming convention that both macros agree on (fragile but simple) 812 - - The stub macro "exporting" label names via a known pattern 813 - (`$func.__stub.*`) 814 - - The assembler resolving qualified names across scopes 815 - 816 - The naming convention approach is the most macro-friendly: the 817 - `call_stub` macro always emits labels named 818 - `&__${func}_ctx_fan`, `&__${func}_or_ret`, etc., and the 819 - `call_dyn` macro references them by constructing the same names. 820 - Both macros must agree. This is a social contract, not a type system. 821 - 822 - #### What's NOT Needed 794 + #### What's NOT Needed (still true) 823 795 824 796 - **Conditional expansion** — different call shapes get different 825 797 macros, not `if` inside a macro. 826 - - **Recursive macro expansion** — the fan-out PASS chain has a fixed 827 - structure per argument count. For N=1 it's one PASS with dual dest. 828 - For N=2 it's two PASSes. Rather than recursing, provide 829 - `call_stub_1`, `call_stub_2`, `call_stub_3` for common arities. 830 - Ugly, pragmatic, correct. 798 + - **Recursive macro expansion** — macros CAN invoke other macros 799 + (nested expansion with depth limit of 32). Per-arity variants are 800 + provided for the built-ins for simplicity, not due to a limitation. 831 801 - **Type checking** — the assembler validates after expansion (wrong 832 802 arity, missing labels, offset overflow). The macro doesn't check. 833 - - **Hygiene** — label collisions between macro expansions ARE a risk. 834 - Mitigated by the `&__${func}_` prefix convention. If two functions 835 - have the same name, you have bigger problems. 803 + - **Hygiene** — scoped naming (`#macroname_N.&label`) prevents 804 + collisions between multiple invocations of the same macro. 805 + 806 + #### Built-in Macros 807 + 808 + The assembler ships built-in macros (prepended to every program) that 809 + implement common patterns from this document: 810 + 811 + | Macro | Pattern | 812 + |-------|---------| 813 + | `#loop_counted` | Counted loop (counter + compare + increment feedback) | 814 + | `#loop_while` | Condition-tested loop (gate node) | 815 + | `#permit_inject_N` | Permit injection (N=1..4 const seed tokens) | 816 + | `#reduce_add_N` | Binary reduction tree for addition (N=2..4) | 817 + 818 + See `asm/builtins.py` for definitions. 836 819 837 820 #### Example Macro Definitions 838 821 839 822 ```dfasm 840 823 ; ── call stub for a 1-argument function ── 841 824 ; emitted once per function, provides shared call infrastructure 842 - .macro call_stub_1 $func, $desc { 825 + #call_stub_1 func, desc |> { 843 826 ; ctx fan-out (1 arg + 1 return = 2 consumers, one PASS suffices) 844 827 &__${func}_ctx_fan <| pass 845 828 &__${func}_ctx_fan |> &__${func}_or_ret:R, &__${func}_or_arg0:R ··· 858 841 } 859 842 860 843 ; ── call stub for a 2-argument function ── 861 - .macro call_stub_2 $func, $desc { 844 + #call_stub_2 func, desc |> { 862 845 ; ctx fan-out chain (3 consumers) 863 846 &__${func}_ctx_fan0 <| pass 864 847 &__${func}_ctx_fan1 <| pass ··· 882 865 } 883 866 884 867 ; ── per-call-site (works for any arity) ── 885 - .macro call_dyn $func, $alloc, $call_seq, $ret_offset, $($arg = $src),* { 868 + #call_dyn func, alloc, call_seq, ret_offset, *args |> { 886 869 ; allocate + trigger + return continuation 887 - &__call_alloc_${func} <| rd_inc, $alloc 888 - &__call_exec_${func} <| exec, $call_seq 889 - &__call_extag_${func} <| extract_tag, $ret_offset 870 + &__call_alloc_${func} <| rd_inc, ${alloc} 871 + &__call_exec_${func} <| exec, ${call_seq} 872 + &__call_extag_${func} <| extract_tag, ${ret_offset} 890 873 891 874 ; wire into stub 892 875 &__call_alloc_${func} |> &__${func}_ctx_fan 893 876 &__call_extag_${func} |> &__${func}_ct_ret:R 894 877 $( 895 - $src |> &__${func}_ct_${arg}:R 878 + ${args} |> &__${func}_ct_${args}:R 896 879 ),* 897 880 } 898 881 ``` ··· 904 887 #call_stub_1 fib, @fib_desc 905 888 906 889 ; at each call site 907 - #call_dyn fib, @ctx_alloc, @fib_call_seq, 20, n = &my_arg 890 + #call_dyn fib, @ctx_alloc, @fib_call_seq, 20, &my_arg 908 891 ``` 909 892 910 - The `call_stub_N` per-arity approach is admittedly clunky. A future 911 - macro system with proper counted repetition could unify them. For 912 - now, N=1 through N=3 covers the vast majority of functions, and 913 - anything beyond N=3 can be hand-written — it's the same pattern, 914 - just more of it. 893 + The `call_stub_N` per-arity approach is admittedly clunky. Macros can 894 + invoke other macros (nested expansion up to depth 32), so a unified 895 + variadic call_stub using a helper macro is possible. Per-arity variants 896 + are used here for clarity. N=1 through N=3 covers the vast majority 897 + of functions, and anything beyond N=3 can be hand-written — it's the 898 + same pattern, just more of it. 915 899 916 - ### Permit Injection — Two Macros 900 + ### Permit Injection — Two Approaches 917 901 918 - For small K (roughly K <= 4), inline CONST injection: 902 + For small K (roughly K <= 4), inline CONST injection. The built-in 903 + macros `#permit_inject_1` through `#permit_inject_4` provide this: 919 904 920 905 ```dfasm 921 - ; Macro definition: 922 - $permit_inject_inline K, &gate |> { 923 - ; expands to K const instructions, each targeting &gate:L 924 - ; each const needs its own trigger to fire 906 + ; Built-in definition (from asm/builtins.py): 907 + #permit_inject_2 |> { 908 + &p0 <| const, 1 909 + &p1 <| const, 1 925 910 } 926 911 927 - ; Usage: inject 3 permits into the gate 928 - #permit_inject_inline 3, &dispatch_gate 912 + ; Usage: invoke the built-in, then wire outputs to the gate 913 + #permit_inject_2 914 + #permit_inject_2.&p0 |> &dispatch_gate:L 915 + #permit_inject_2.&p1 |> &dispatch_gate:L 929 916 ``` 930 917 931 918 For large K, use SM EXEC to batch-emit permits: 932 919 933 920 ```dfasm 934 - ; Macro definition: 935 - $permit_inject_exec K, &gate, @sm_base |> { 936 - ; expands to a single SM_EXEC reading K pre-formed permit 937 - ; tokens from SM starting at @sm_base, each addressed to &gate:L 921 + ; Custom macro for EXEC-based injection: 922 + #permit_inject_exec count, sm_base |> { 923 + ; a single SM EXEC reading ${count} pre-formed permit 924 + ; tokens from SM starting at ${sm_base} 925 + &exec <| exec, ${sm_base} 938 926 } 939 927 940 928 ; Usage: inject 8 permits via EXEC 941 - #permit_inject_exec 8, &dispatch_gate, @permit_store 929 + #permit_inject_exec 8, @permit_store 942 930 ``` 943 931 944 932 Programmer chooses based on K. No magic. 945 933 946 934 ### Loop Control Macro 947 935 936 + The built-in `#loop_counted` provides the core loop infrastructure: 937 + 948 938 ```dfasm 949 - $loop_counted &limit, &body, &exit |> { 950 - &counter <| const, 0 951 - &step <| inc 952 - &test <| lt 953 - &route <| sweq 954 - 955 - &counter |> &step 956 - &step |> &test:L, &route:L ; fan-out: counter to both LT and SWITCH 957 - &limit |> &test:R 958 - &test |> &route:R ; bool from comparison → SWITCH control 959 - &route:L |> &body ; taken → body dispatch 960 - &route:R |> &exit ; not-taken → done 961 - &route:L |> &step ; feedback arc: counter recirculates 939 + ; Built-in definition (from asm/builtins.py): 940 + #loop_counted |> { 941 + &counter <| add 942 + &compare <| brgt 943 + &counter |> &compare:L 944 + &inc <| inc 945 + &compare |> &inc:L 946 + &inc |> &counter:R 962 947 } 963 948 964 - ; Usage: 965 - #loop_counted 64, &body_entry, &done 949 + ; Usage: wire init, limit, body, and exit externally 950 + #loop_counted 951 + &c_init <| const, 0 952 + &c_limit <| const, 64 953 + &c_init |> #loop_counted.&counter:L ; initial counter value 954 + &c_limit |> #loop_counted.&compare:R ; loop bound 955 + #loop_counted.&compare:L |> &body_entry ; taken → body 956 + #loop_counted.&compare:R |> &done ; not-taken → exit 966 957 ``` 967 958 968 959 ### Reduction Tree Macro
+43 -9
design-notes/pe-design.md
··· 89 89 - First operand: store in slot, advance to wait state 90 90 - Second operand: read partner from slot, both proceed 91 91 - Monadic (prefix 010 or 011+10): bypass matching, proceed directly 92 + - **DyadToken at monadic instruction**: if a dyadic-format token 93 + arrives at an instruction that is monadic (determined after IRAM 94 + fetch), the matching store is bypassed and the token fires 95 + immediately as a monadic operand (data from token, no partner). 96 + This allows the compiler to send dyadic-format tokens to monadic 97 + instructions without deadlocking the matching store. 92 98 - Single cycle for all cases (no hash path, no CAM search — 93 99 direct indexing only, see matching store section below) 94 100 - Estimated: ~200-300 transistors + SRAM ··· 104 110 network config writes — see "Instruction Memory" section below 105 111 106 112 Stage 4: EXECUTE 107 - - 16-bit ALU 108 - - ~1500-2000 transistors 113 + - IRAM half 0 bit 15 selects CM compute (0) or SM operation (1) 114 + - CM path: 16-bit ALU executes arithmetic/logic/comparison/routing 115 + - SM path: ALU is bypassed; PE constructs an SMToken from the SM 116 + opcode, operand data, and IRAM fields (SM_id, const address, 117 + return routing). See `iram-and-function-calls.md` for SM IRAM 118 + word format. 119 + - ~1500-2000 transistors (ALU) + SM flit assembly mux (~2 chips) 109 120 110 121 Stage 5: TOKEN OUTPUT 111 - - Form result token with routing prefix (type, destination PE/SM, 112 - offset, context, etc.) using destination fields from IRAM 122 + - CM path: form result CM token with routing prefix (type, 123 + destination PE, offset, context, etc.) using destination fields 124 + from IRAM. ctx_mode selects context source: 125 + 00 = INHERIT (pipeline latches) 126 + 01 = CTX_OVRD (const field overrides ctx/gen) 127 + 10 = CHANGE_TAG (left operand becomes flit 1 verbatim) 128 + See `iram-and-function-calls.md` for ctx_mode details. 129 + - SM path: emit SMToken to target SM via SM bus routes 113 130 - Pass to output serialiser for flit encoding and bus injection 114 - - ~300 transistors 131 + - ~300 transistors + ctx/gen mux (~3 chips) 115 132 ``` 133 + 134 + ### Concurrency Model 135 + 136 + The pipeline is pipelined: multiple tokens can be in-flight 137 + simultaneously at different stages. In the emulator, each token spawns 138 + a separate SimPy process that progresses through the pipeline 139 + independently. This models the hardware reality where Stage 2 can be 140 + matching a new token while Stage 4 is executing a previous one. 141 + 142 + Cycle counts per token type: 143 + - **Dyadic**: 5 cycles (dequeue + match + fetch + execute + emit) 144 + - **Monadic**: 4 cycles (dequeue + fetch + execute + emit; match bypassed) 145 + - **IRAMWriteToken**: 2 cycles (dequeue + write; no match/fetch/execute/emit) 146 + - **Network delivery**: +1 cycle latency between emit and arrival at destination 116 147 117 148 ### Pipeline Register Widths 118 149 ··· 355 386 just 32 bytes. 356 387 357 388 > **⚠ Preliminary:** These configurations are candidates, not commitments. 358 - > The emulator defaults to 16 context slots and 128 IRAM entries (matching 359 - > Config B sizing). Final sizing depends on compiling real programs and 360 - > measuring actual concurrent activation counts and dyadic instruction 361 - > density per PE. 389 + > The emulator defaults to 16 context slots and 128 IRAM entries. Note 390 + > that IRAM entries (128) and matching store entries per context (32 in 391 + > Config B) are distinct: the matching store only holds dyadic 392 + > instructions at offsets 0-31, while IRAM holds all instructions 393 + > (dyadic + monadic) across the full 128-slot range. Final sizing 394 + > depends on compiling real programs and measuring actual concurrent 395 + > activation counts and dyadic instruction density per PE. 362 396 363 397 **Recommendation for v0**: Config B (16 slots x 32 entries = 1KB). This 364 398 matches the token format directly: 4-bit ctx from flit 1, 5-bit offset
+3 -4
design-notes/sm-design.md
··· 341 341 342 342 'aa' = address bits (part of 10-bit address). 343 343 344 - **Tier 1 ops (3-bit, full address range):** READ, WRITE, ALLOC, FREE, 344 + **Tier 1 ops (3-bit, full address range):** READ, WRITE, ALLOC, EXEC, 345 345 CLEAR reach the full 1024-cell address space. EXT signals a 3-flit 346 346 token for extended addressing (external RAM, wide addresses). 347 347 348 348 **Tier 2 ops (5-bit, restricted address/payload):** atomic operations 349 - (READ_INC, READ_DEC, CAS) are restricted to 256 cells — the compiler 350 - places atomic-access cells in the lower range. EXEC, SET_PAGE, and 349 + (READ_INC, READ_DEC, CAS) as well as CLEAR are restricted to 256 cells — the compiler 350 + places atomic-access cells in the lower range. SET_PAGE and 351 351 WRITE_IMM use the 8-bit payload field for non-address data (length, 352 352 page register value, small immediate). 353 353 ··· 356 356 operand. The 8-bit payload in the restricted tier can be inline data, 357 357 config values, or range counts depending on the opcode: 358 358 359 - - EXEC: payload = length/count (base addr in config register) 360 359 - SET_PAGE: payload = page register value 361 360 - WRITE_IMM: 8-bit addr, flit 2 carries immediate data 362 361 - RAW_READ: non-blocking read, returns data or empty indicator without
+575
docs/macro-enhancements.md
··· 1 + # Macro Enhancements: Opcode Parameters, Qualified Ref Parameters, and @ret Wiring 2 + 3 + Extends the dfasm macro system with three capabilities that reduce the need for per-variant macro definitions and make macros composable in the same way as functions. 4 + 5 + ## Current State 6 + 7 + The macro system (implemented in `asm/expand.py`, grammar in `dfasm.lark`) supports: 8 + 9 + - Parameter substitution in node names via `${param}` (token pasting with prefix/suffix) 10 + - Parameter substitution in edge endpoints via `${param}` in `qualified_ref` 11 + - Parameter substitution in const fields 12 + - Compile-time arithmetic via `ConstExpr` (`${base} + ${_idx} + 1`) 13 + - Variadic parameters with `@each` repetition blocks 14 + - Nested macro invocation (depth limit 32) 15 + 16 + Three gaps remain: 17 + 18 + 1. **Opcode position is not parameterizable.** The grammar defines `opcode: OPCODE` as a keyword terminal. You cannot pass an opcode as a macro argument. This forces per-opcode variants: `#reduce_add_2`, `#reduce_add_3`, etc. 19 + 20 + 2. **Placement and port qualifiers are not parameterizable.** The grammar defines `placement: "|" IDENT` and `port: ":" PORT_SPEC` — neither accepts `param_ref`. You cannot write `&ref:${port}` or `&ref|${pe}` in a macro body to parameterize which port or PE a reference targets. 21 + 22 + 3. **Macros have no output wiring convention.** Functions use `@ret` / `@ret_name` markers in their body, and the call syntax `$func args |> outputs` auto-wires return paths. Macros have no equivalent — the user must manually wire to expanded internal node names after invocation. 23 + 24 + ## Enhancement 1: Opcode Parameters 25 + 26 + ### Goal 27 + 28 + Allow macro parameters to appear in the opcode position of `inst_def`, `strong_edge`, and `weak_edge` rules. 29 + 30 + ### Grammar Change 31 + 32 + ```lark 33 + // Current: 34 + opcode: OPCODE 35 + 36 + // Proposed: 37 + opcode: OPCODE | param_ref 38 + ``` 39 + 40 + This is the only grammar change needed. `param_ref` (`${name}`) is already a valid production. Earley parsing handles the ambiguity. 41 + 42 + ### Lower Pass 43 + 44 + The `inst_def` handler in `lower.py` currently calls `self._resolve_opcode()` which maps mnemonic strings to `ALUOp`/`MemOp` values. When the opcode is a `ParamRef`, lowering must defer resolution — store the `ParamRef` on the `IRNode` in a new field (or overload the `opcode` field's type to `Union[ALUOp, MemOp, ParamRef]`). 45 + 46 + The `strong_edge` and `weak_edge` handlers need the same treatment: if the opcode token is a `ParamRef`, create the anonymous node with a deferred opcode. 47 + 48 + ### Expand Pass 49 + 50 + During `_clone_and_substitute_node`, if `node.opcode` is a `ParamRef`: 51 + 52 + 1. Look up the parameter in the substitution map 53 + 2. The argument value must be a string matching a known opcode mnemonic 54 + 3. Resolve via `MNEMONIC_TO_OP` to get the concrete `ALUOp`/`MemOp` 55 + 4. Replace the node's opcode with the resolved value 56 + 5. Error if the argument is not a valid opcode mnemonic 57 + 58 + ### Validation 59 + 60 + Opcode validation (monadic/dyadic arity, valid argument combinations) already happens after expansion in the resolve and allocate passes. No additional validation needed at expansion time beyond confirming the mnemonic exists. 61 + 62 + ### Example 63 + 64 + Before (current — per-opcode variants): 65 + 66 + ``` 67 + #reduce_add_2 |> { &r <| add } 68 + #reduce_add_3 |> { &r0 <| add; &r1 <| add; &r0 |> &r1:L } 69 + #reduce_sub_2 |> { &r <| sub } 70 + ; ... N variants per opcode 71 + ``` 72 + 73 + After (parameterized): 74 + 75 + ``` 76 + #reduce_2 op |> { 77 + &r <| ${op} 78 + } 79 + 80 + #reduce_3 op |> { 81 + &r0 <| ${op} 82 + &r1 <| ${op} 83 + &r0 |> &r1:L 84 + } 85 + 86 + ; Usage: 87 + #reduce_2 add 88 + #reduce_3 sub 89 + ``` 90 + 91 + ### Argument Syntax 92 + 93 + Opcode arguments are passed as bare identifiers in the macro call. The grammar for `macro_call_stmt` already accepts `argument` which includes `qualified_ref`, and a bare `IDENT` would normally parse as... hmm, actually it won't. An unqualified `add` in argument position parses as the `OPCODE` terminal (priority 2), not as `IDENT`. And `OPCODE` is not a valid `argument`. 94 + 95 + Two options: 96 + 97 + **Option A: Quote opcode arguments.** Pass as string literals: `#reduce_2 "add"`. Simple, unambiguous. Expand pass strips quotes and resolves. Slightly ugly. 98 + 99 + **Option B: Accept OPCODE as a macro argument.** Add `OPCODE` as an alternative in `positional_arg`: 100 + 101 + ```lark 102 + // Current: 103 + ?positional_arg: value | qualified_ref 104 + 105 + // Proposed: 106 + ?positional_arg: value | qualified_ref | OPCODE 107 + ``` 108 + 109 + The lower pass wraps the bare opcode token as a string argument in the `IRMacroCall`. Expand resolves it against `MNEMONIC_TO_OP`. This reads naturally: `#reduce_2 add`. 110 + 111 + Option B is cleaner. The only risk is if someone has an `IDENT` that collides with an opcode name as a label/node, but the priority system already handles that (opcodes win at lexer level), and this collision already exists in the language. 112 + 113 + **Recommendation: Option B.** 114 + 115 + 116 + ## Enhancement 2: Parameterized Placement and Port Qualifiers 117 + 118 + ### Goal 119 + 120 + Allow `${param}` in the placement (`|pe0`) and port (`:L`) positions of a `qualified_ref`, so macros can parameterize which PE a node targets, which port an edge uses, and (when exposed) which context slot to use. 121 + 122 + ### Current State 123 + 124 + `qualified_ref` is built from three parts: 125 + 126 + ```lark 127 + qualified_ref: (node_ref | label_ref | ... | param_ref) placement? port? 128 + placement: "|" IDENT 129 + port: ":" PORT_SPEC 130 + PORT_SPEC: IDENT | HEX_LIT | DEC_LIT 131 + ``` 132 + 133 + `${param}` can already stand in for the entire ref part (the first element). But the `placement` and `port` suffixes only accept literal tokens. So `&node:${port}` and `&node|${pe}` don't parse. 134 + 135 + In the lower pass, `qualified_ref` collects its children into a dict: 136 + - The ref part becomes `{"name": ...}` 137 + - `placement` returns a string (e.g., `"pe0"`) 138 + - `port` returns a `Port` enum (`Port.L`, `Port.R`) or raw `int` 139 + 140 + In the IR, `IRNode.pe` stores placement as `Optional[int]`, and `IREdge.port`/`IREdge.source_port` store port as `Port`. Neither field currently accepts `ParamRef`. 141 + 142 + The expand pass (`_clone_and_substitute_node`, `_clone_and_substitute_edge`) only substitutes `name`, `const`, `source`, and `dest`. It does not touch `pe`, `port`, or `source_port`. 143 + 144 + ### Grammar Changes 145 + 146 + ```lark 147 + // Current: 148 + placement: "|" IDENT 149 + port: ":" PORT_SPEC 150 + 151 + // Proposed: 152 + placement: "|" (IDENT | param_ref) 153 + port: ":" (PORT_SPEC | param_ref) 154 + ``` 155 + 156 + ### Lower Pass 157 + 158 + The `placement` handler currently does `return str(token)`. It needs to handle receiving a `ParamRef` from the parser and return it as-is: 159 + 160 + ```python 161 + def placement(self, *args): 162 + for arg in args: 163 + if isinstance(arg, ParamRef): 164 + return arg 165 + return str(args[-1]) 166 + ``` 167 + 168 + Similarly, the `port` handler needs to pass through `ParamRef` instead of resolving to `Port`: 169 + 170 + ```python 171 + def port(self, *args): 172 + for arg in args: 173 + if isinstance(arg, ParamRef): 174 + return arg 175 + # ... existing Port.L / Port.R / int resolution 176 + ``` 177 + 178 + The `qualified_ref` handler already iterates over args by type. It needs a new branch to detect `ParamRef` in placement/port positions (currently it only detects `ParamRef` in the ref-name position). The disambiguation is based on ordering: the ref-name comes first, placement second (prefixed with `|`), port third (prefixed with `:`). Since Lark processes them through their respective rules before `qualified_ref` sees them, the parser distinguishes them. The `qualified_ref` handler just needs to accept `ParamRef` for placement and port: 179 + 180 + ```python 181 + def qualified_ref(self, *args): 182 + ref_type = None 183 + placement = None 184 + port = None 185 + for arg in args: 186 + if isinstance(arg, ParamRef) and ref_type is None: 187 + ref_type = {"name": arg} 188 + elif isinstance(arg, ParamRef) and ref_type is not None: 189 + # Second or third ParamRef — depends on position 190 + # But Lark gives us placement/port through their handlers, 191 + # so we get ParamRef from the placement() or port() handler. 192 + # Need to distinguish: placement handler adds a marker or 193 + # we rely on Lark's rule names. 194 + ... 195 + ``` 196 + 197 + Actually, this is simpler than it looks. Lark calls `placement()` and `port()` before `qualified_ref()`. So `qualified_ref` receives: 198 + - A dict or `ParamRef` (from the ref-name rules) 199 + - A string or `ParamRef` (from the `placement` handler) 200 + - A `Port`/`int` or `ParamRef` (from the `port` handler) 201 + 202 + The existing type-based dispatch in `qualified_ref` needs one addition: if an arg is `ParamRef` and `ref_type` is already set, it's either placement or port. We can distinguish by wrapping them — the placement handler returns `("placement", ParamRef(...))` and port returns `("port", ParamRef(...))` when deferring. Or simpler: use a thin wrapper type. 203 + 204 + Alternatively, Lark's `@v_args(inline=True)` on placement/port means the handler already knows which rule matched. The cleanest approach: return a `ParamRef` tagged with its role: 205 + 206 + ```python 207 + @dataclass(frozen=True) 208 + class PlacementRef: 209 + """Deferred placement from macro parameter.""" 210 + param: ParamRef 211 + 212 + @dataclass(frozen=True) 213 + class PortRef: 214 + """Deferred port from macro parameter.""" 215 + param: ParamRef 216 + ``` 217 + 218 + Then `qualified_ref` type-dispatches on `PlacementRef`/`PortRef` alongside `str`/`Port`/`int`. 219 + 220 + ### IR Changes 221 + 222 + `IRNode.pe` type becomes `Optional[Union[int, ParamRef]]`. 223 + 224 + `IREdge.port` type becomes `Union[Port, ParamRef]`. 225 + 226 + `IREdge.source_port` type becomes `Optional[Union[Port, ParamRef]]`. 227 + 228 + These wider types only appear in macro template bodies. After expansion, all `ParamRef` values are resolved to concrete types. The resolve, place, and allocate passes never see `ParamRef` — if one leaks through, it's a bug in expand. 229 + 230 + ### Expand Pass 231 + 232 + `_clone_and_substitute_node` gains: 233 + 234 + ```python 235 + # Substitute PE placement if it's a ParamRef 236 + new_pe = node.pe 237 + if isinstance(new_pe, ParamRef): 238 + resolved = _substitute_param(new_pe, subst_map) 239 + # Must resolve to a PE identifier string like "pe0" or an int 240 + new_pe = _resolve_pe_placement(resolved) # parse "pe0" -> 0, or int -> int 241 + ``` 242 + 243 + `_clone_and_substitute_edge` gains: 244 + 245 + ```python 246 + # Substitute port if it's a ParamRef 247 + new_port = edge.port 248 + if isinstance(new_port, ParamRef): 249 + resolved = _substitute_param(new_port, subst_map) 250 + new_port = _resolve_port(resolved) # "L" -> Port.L, "R" -> Port.R, int -> int 251 + 252 + new_source_port = edge.source_port 253 + if isinstance(new_source_port, ParamRef): 254 + resolved = _substitute_param(new_source_port, subst_map) 255 + new_source_port = _resolve_port(resolved) 256 + ``` 257 + 258 + ### Validation 259 + 260 + Invalid port/placement values (e.g., passing `"banana"` as a port) produce a MACRO error during expansion. Post-expansion, the existing place and allocate passes validate that PE IDs are in range and ports are valid. 261 + 262 + ### Examples 263 + 264 + Parameterized port selection: 265 + 266 + ``` 267 + ; Macro that wires to a caller-selected port 268 + #wire_to_port target, port |> { 269 + &src <| pass 270 + &src |> ${target}:${port} 271 + } 272 + 273 + ; Usage: wire to left port 274 + #wire_to_port &dest, L 275 + 276 + ; Usage: wire to right port 277 + #wire_to_port &dest, R 278 + ``` 279 + 280 + Parameterized PE placement: 281 + 282 + ``` 283 + ; Macro that places its node on a specific PE 284 + #placed_const val, pe |> { 285 + &c <| const, ${val} |${pe} 286 + &c |> @ret 287 + } 288 + 289 + ; Usage: place on pe0 290 + #placed_const 42, pe0 |> &target 291 + 292 + ; Usage: place on pe1 293 + #placed_const 42, pe1 |> &target 294 + ``` 295 + 296 + Combined — a macro that builds a cross-PE relay: 297 + 298 + ``` 299 + ; Route a value from one PE to another 300 + #cross_pe_relay src_pe, dst_pe |> { 301 + &hop <| pass |${src_pe} 302 + &hop |> @ret 303 + } 304 + 305 + ; Usage: 306 + #cross_pe_relay pe0, pe1 |> &destination 307 + ``` 308 + 309 + ### Context Slot Syntax 310 + 311 + Context slots use bracket syntax `[N]`, distinct from all other qualifiers: 312 + 313 + ``` 314 + &node|pe0[2] ; place on pe0, context slot 2 315 + &node[0] ; context slot 0, auto-placed PE 316 + &node|pe1[0..4] ; reserve context slots 0-4 for this instruction 317 + ``` 318 + 319 + The bracket syntax avoids overloading `:` (which already carries port, cell address, and potentially IRAM address semantics). `[N]` is exclusively context slots. 320 + 321 + #### Grammar 322 + 323 + ```lark 324 + // New production: 325 + ctx_slot: "[" (DEC_LIT | ctx_range | param_ref) "]" 326 + ctx_range: DEC_LIT ".." DEC_LIT 327 + 328 + // Updated qualified_ref: 329 + qualified_ref: (node_ref | label_ref | ... | param_ref) placement? ctx_slot? port? 330 + ``` 331 + 332 + `ctx_slot` appears between placement and port in the qualifier chain: `&node|pe0[2]:L`. 333 + 334 + #### Use Cases 335 + 336 + - **Explicit context partitioning**: place parallel computations in distinct context slots to avoid matching store collisions 337 + - **Debugging**: force a known context layout for inspection 338 + - **Range reservation** (`[0..4]`): reserve a contiguous block of slots for an instruction that will be targeted by multiple parallel sources wired identically — not essential but a natural extension 339 + 340 + #### Parameterization 341 + 342 + Same mechanism as placement/port. `[${ctx}]` in a macro body, substituted to an integer during expansion: 343 + 344 + ``` 345 + #placed_op op, pe, ctx |> { 346 + &n <| ${op} |${pe}[${ctx}] 347 + &n |> @ret 348 + } 349 + 350 + ; Usage: 351 + #placed_op add, pe0, 2 |> &target 352 + ``` 353 + 354 + 355 + ## Enhancement 3: @ret Wiring for Macros 356 + 357 + ### Goal 358 + 359 + Allow macros to define output points using `@ret` / `@ret_name` markers, and wire them to destinations at the call site using the `|>` syntax. 360 + 361 + ### Grammar Change 362 + 363 + Add optional output list to `macro_call_stmt`: 364 + 365 + ```lark 366 + // Current: 367 + macro_call_stmt: "#" IDENT (argument ("," argument)*)? 368 + 369 + // Proposed: 370 + macro_call_stmt: "#" IDENT (argument ("," argument)*)? (FLOW_OUT call_output_list)? 371 + ``` 372 + 373 + This reuses the existing `call_output_list` and `call_output` productions from `call_stmt`. Same syntax: `#macro args |> &dest` or `#macro args |> name=&dest`. 374 + 375 + ### Macro Body Convention 376 + 377 + Macro bodies use `@ret` and `@ret_name` in edge destinations, same as function bodies: 378 + 379 + ``` 380 + #loop op, init_val |> { 381 + &counter <| add 382 + &compare <| ${op} 383 + &counter |> &compare:L 384 + &inc <| inc 385 + &compare |> &inc:L 386 + &inc |> &counter:R 387 + ; Output edges use @ret convention 388 + &compare |> @ret_body 389 + &compare |> @ret_exit:R 390 + } 391 + ``` 392 + 393 + ### Lower Pass 394 + 395 + When lowering `macro_call_stmt` with a `FLOW_OUT` and `call_output_list`: 396 + 397 + 1. Parse the output list the same way `call_stmt` does (named/positional outputs) 398 + 2. Store output destinations on the `IRMacroCall` in a new field: `output_dests: tuple` 399 + 400 + The `IRMacroCall` dataclass gains: 401 + 402 + ```python 403 + @dataclass(frozen=True) 404 + class IRMacroCall: 405 + name: str 406 + positional_args: tuple 407 + named_args: tuple 408 + output_dests: tuple = () # New: output wiring destinations 409 + loc: Optional[SourceLoc] = None 410 + ``` 411 + 412 + ### Expand Pass 413 + 414 + After cloning and substituting the macro body, process `@ret` markers: 415 + 416 + 1. Scan expanded edges for destinations starting with `@ret` 417 + 2. For each `@ret` / `@ret_name` destination, look up the corresponding output from `IRMacroCall.output_dests` 418 + 3. Replace the `@ret*` destination with the actual target node name 419 + 4. If a `@ret*` marker has no matching output dest, report a MACRO error 420 + 421 + This is simpler than function call wiring because macros don't need: 422 + - Trampoline nodes (no cross-context routing) 423 + - `ctx_override` edges (macros inline into the caller's context) 424 + - `FREE_CTX` nodes (no context allocation) 425 + - Synthetic PASS nodes (direct edge replacement suffices) 426 + 427 + The `@ret` substitution in macros is purely edge rewriting — replace the symbolic `@ret_name` destination with the concrete node reference from the call site. 428 + 429 + ### Positional @ret Mapping 430 + 431 + Same convention as function calls: 432 + 433 + - Bare `@ret` maps to the first (or only) positional output 434 + - `@ret_name` maps to the named output `name=&dest` 435 + - Multiple bare `@ret` edges to different ports on the same output are valid 436 + 437 + ### Example 438 + 439 + ``` 440 + ; Define macro with outputs 441 + #loop_counted |> { 442 + &counter <| add 443 + &compare <| brgt 444 + &counter |> &compare:L 445 + &inc <| inc 446 + &compare |> &inc:L 447 + &inc |> &counter:R 448 + &compare |> @ret_body 449 + &compare |> @ret_exit:R 450 + } 451 + 452 + ; Invoke with output wiring 453 + #loop_counted |> body=&process, exit=&done 454 + &init |> #loop_counted_0.&counter:L 455 + &limit |> #loop_counted_0.&compare:R 456 + ``` 457 + 458 + Or positionally: 459 + 460 + ``` 461 + #simple_gate |> { 462 + &g <| gate 463 + &g |> @ret 464 + &g |> @ret:R ; second output port 465 + } 466 + 467 + ; Invoke — positional @ret maps to first output 468 + #simple_gate |> &body, &exit 469 + ``` 470 + 471 + 472 + ## Impact on Built-in Macros 473 + 474 + With both enhancements, the built-in library collapses significantly: 475 + 476 + ### Current (11 macros) 477 + 478 + ``` 479 + #loop_counted, #loop_while 480 + #permit_inject_1, #permit_inject_2, #permit_inject_3, #permit_inject_4 481 + #reduce_add_2, #reduce_add_3, #reduce_add_4 482 + ``` 483 + 484 + ### Proposed (4-5 macros, more capable) 485 + 486 + ``` 487 + ; Counted loop with output wiring 488 + #loop_counted |> { 489 + &counter <| add 490 + &compare <| brgt 491 + &counter |> &compare:L 492 + &inc <| inc 493 + &compare |> &inc:L 494 + &inc |> &counter:R 495 + &compare |> @ret_body 496 + &compare |> @ret_exit:R 497 + } 498 + 499 + ; Condition-tested loop 500 + #loop_while |> { 501 + &gate <| gate 502 + &gate |> @ret_body 503 + &gate |> @ret_exit:R 504 + } 505 + 506 + ; Permit injection — variadic, outputs via @ret 507 + #permit_inject *nodes |> { 508 + $( 509 + &p_${_idx} <| const, 1 510 + &p_${_idx} |> @ret 511 + ),* 512 + } 513 + 514 + ; Binary reduction tree — parameterized opcode + arity 515 + #reduce_2 op |> { 516 + &r <| ${op} 517 + } 518 + 519 + #reduce_3 op |> { 520 + &r0 <| ${op} 521 + &r1 <| ${op} 522 + &r0 |> &r1:L 523 + } 524 + 525 + #reduce_4 op |> { 526 + &r0 <| ${op} 527 + &r1 <| ${op} 528 + &r2 <| ${op} 529 + &r0 |> &r2:L 530 + &r1 |> &r2:R 531 + } 532 + ``` 533 + 534 + Usage: 535 + 536 + ``` 537 + ; Old: 538 + !#loop_counted 539 + &init |> #loop_counted_0.&counter:L 540 + &limit |> #loop_counted_0.&compare:R 541 + #loop_counted_0.&compare |> &body:L 542 + #loop_counted_0.&compare |> &exit:R 543 + 544 + ; New: 545 + #loop_counted |> body=&process, exit=&done 546 + &init |> #loop_counted_0.&counter:L 547 + &limit |> #loop_counted_0.&compare:R 548 + 549 + ; Old: 550 + !#reduce_add_4 551 + 552 + ; New: 553 + #reduce_4 add 554 + ``` 555 + 556 + Note: the `#permit_inject` example with variadic `@ret` is aspirational — it requires `@ret` to work inside repetition blocks, which means the `@ret` substitution must happen after repetition expansion. This ordering is already correct since repetition expansion happens before edge rewriting in the expand pass. 557 + 558 + 559 + ## Implementation Order 560 + 561 + 1. **Opcode parameters** — grammar change (`opcode: OPCODE | param_ref`), argument syntax (`positional_arg: ... | OPCODE`), expand pass substitution. Smallest diff, immediately useful. 562 + 563 + 2. **Qualified ref parameters** — grammar changes to `placement` and `port`, `PlacementRef`/`PortRef` wrapper types, IR type widening, expand pass substitution. Mechanically similar to opcode params, builds on the same `_substitute_param` infrastructure. 564 + 565 + 3. **@ret wiring for macros** — grammar change (output list on `macro_call_stmt`), `IRMacroCall.output_dests`, expand pass edge rewriting. Builds on existing `@ret` patterns from function calls. 566 + 567 + 4. **Built-in macro rewrite** — collapse per-variant macros using the new features. Backwards-incompatible (old macro names removed), but since the built-ins are bundled and the system is pre-1.0, this is acceptable. 568 + 569 + ## Open Questions 570 + 571 + 1. **Should macros with `@ret` also support `|>` on inputs?** Function calls use `$func a=&x |> @output`. Currently macro calls use `#macro arg1, arg2` for inputs. Adding `|>` for outputs is proposed above. Should inputs also support named wiring? Probably not needed — macros already have `${param}` for inputs, and the input wiring is fundamentally different (parameter substitution vs edge creation). 572 + 573 + 2. **Error messages for mismatched @ret counts.** If a macro body has `@ret_body` and `@ret_exit` but the call site only provides one output, what error? Probably MACRO category: "macro '#loop_counted' defines outputs @ret_body, @ret_exit but call provides 1 output". 574 + 575 + 3. **Interaction with nested macros.** If macro A calls macro B which has `@ret`, and A also has `@ret`, the scoping should work naturally — B's `@ret` resolves at B's call site (inside A's body), A's `@ret` resolves at A's call site. The existing scope qualification prevents name collisions.