OR-1 dataflow CPU sketch
at main 304 lines 17 kB view raw view rendered
1# OR-1 CPU - dfasm Assembler Architecture 2 3Covers the `asm/` package: pipeline structure, IR design, pass architecture, code generation modes, and key implementation decisions. 4 5See `architecture-overview.md` for the target hardware model. 6See `dfasm-primer.md` for the language itself. 7 8## Role 9 10The assembler translates dfasm source into emulator-ready configuration objects (`PEConfig`, `SMConfig`, seed tokens) or a hardware-faithful bootstrap token sequence. It bridges the gap between human-authored dataflow graph programs and the structures the emulator (and eventually hardware) consumes. 11 12The assembler does NOT (currently) optimize. It does not reorder instructions for performance, fuse operations, or eliminate redundant sub-graphs. It is a faithful translator: the graph you write is the graph you get. Future optimization passes may be inserted between resolve and place, but the current pipeline is intentionally thin. Correctness first, cleverness later. 13 14## Pipeline Overview 15 16Seven stages, each a pure function from `IRGraph → IRGraph` (or `IRGraph → output`). 17 18The pipeline is: 19 20``` 21dfasm source 22 23 24 Parse Lark/Earley parser, dfasm.lark grammar 25 │ → concrete syntax tree (CST) 26 27 Lower CST → IRGraph (nodes, edges, regions, data defs, 28 │ macro defs, macro calls, call sites) 29 │ → name qualification, scope creation 30 31 Expand macro expansion + function call wiring 32 │ → clone macro bodies, substitute params, evaluate 33 │ const expressions, wire cross-context call edges 34 35 Resolve validate edge endpoints, detect scope violations 36 │ → "did you mean" suggestions via Levenshtein distance 37 38 Place assign PEs to unplaced nodes 39 │ → greedy bin-packing with locality heuristic 40 41 Allocate assign IRAM offsets, context slots, resolve addresses 42 │ → dyadic-first layout, per-PE context scoping 43 44 Codegen emit PEConfig/SMConfig + seeds (direct mode) 45 or SM init → IRAM writes → seeds (token mode) 46``` 47 48Each pass returns a new `IRGraph`. Graphs are never mutated after construction, each pass produces a fresh copy with the new information filled in. Errors accumulate in `IRGraph.errors` rather than failing fast, so the assembler reports all problems in a single pass rather than forcing the programmer to fix them one at a time. 49 50The public API orchestrates the pipeline and raises `ValueError` if any stage produces errors: 51 52```python 53assemble(source: str) -> AssemblyResult # direct mode 54assemble_to_tokens(source: str) -> list # token stream mode 55round_trip(source: str) -> str # parse → lower → serialize 56serialize_graph(graph: IRGraph) -> str # IRGraph → dfasm at any stage 57``` 58 59## IR Types (`ir.py`) 60 61### Core Types 62 63| Type | Fields | Purpose | 64|------|--------|---------| 65| `IRNode` | name, opcode, dest_l, dest_r, const, pe, iram_offset, ctx, loc, args, sm_id, seed | Single instruction in the dataflow graph. `name` and `const` can be `ParamRef` or `ConstExpr` in macro templates. | 66| `IREdge` | source, dest, port, source_port, port_explicit, ctx_override, loc | Connection between two nodes. `ctx_override` marks cross-context call edges (ctx_mode=01). `port_explicit` tracks whether port was user-specified. | 67| `IRGraph` | nodes, edges, regions, data_defs, system, errors, macro_defs, macro_calls, raw_call_sites, call_sites, builtin_line_offset | Complete program representation. Macro-related fields are populated by lower and consumed by expand. | 68| `IRRegion` | tag, kind, body (IRGraph), loc | Nested scope (FUNCTION, LOCATION, or MACRO) | 69| `IRDataDef` | name, sm_id, cell_addr, value, loc | SM cell initialisation | 70| `SystemConfig` | pe_count, sm_count, iram_capacity, ctx_slots, loc | Hardware configuration from `@system` pragma (defaults: iram=128, ctx=16) | 71| `MacroDef` | name, params, body (IRGraph), repetition_blocks, loc | Macro definition: formal parameters + template body with `ParamRef` placeholders | 72| `MacroParam` | name, variadic | Formal macro parameter. `variadic=True` collects remaining args. | 73| `ParamRef` | param, prefix, suffix | Placeholder for a macro parameter. Supports token pasting via prefix/suffix. | 74| `ConstExpr` | expression, params, loc | Compile-time arithmetic expression (e.g., `base + _idx + 1`) | 75| `IRRepetitionBlock` | body (IRGraph), variadic_param, loc | Repetition block (`@each`) expanded once per variadic argument | 76| `IRMacroCall` | name, positional_args, named_args, loc | Macro invocation, consumed by expand pass | 77| `CallSiteResult` | func_name, input_args, output_dests, loc | Intermediate call data from lower, consumed by expand | 78| `CallSite` | func_name, call_id, input_edges, trampoline_nodes, free_ctx_nodes, loc | Processed call site metadata for per-call-site context allocation | 79 80### Destination Representation 81 82Destinations evolve through the pipeline: 83 841. **After lower**: `NameRef(name, port)` - symbolic, unresolved 852. **After allocate**: `ResolvedDest(name, addr)` - concrete `Addr(a, port, pe)` with IRAM offset, port, and target PE 86 87This two-stage resolution means early passes can work with symbolic names while later passes have concrete hardware addresses, while maintaining the names for debug clarity. 88### Graph Traversal Utilities 89 90`IRGraph` provides recursive traversal functions for working with nested regions: 91 92- `iter_all_subgraphs()`: depth-first traversal of graph + all region bodies 93- `collect_all_nodes()`: flatten nodes from graph and all nested regions 94- `collect_all_nodes_and_edges()`: flatten both 95- `collect_all_data_defs()`: flatten data definitions 96- `update_graph_nodes()`: recursively update nodes while preserving region structure 97 98## Pass Details 99 100### Parse 101 102Uses the Lark library with the Earley parser algorithm. The formal grammar is defined in `dfasm.lark`. Earley is required because the grammar has ambiguities between `location_dir` (a bare qualified reference) and `weak_edge` (outputs before opcode) that require context-sensitive resolution. 103 104The parser produces a concrete syntax tree (CST), a Lark `Tree` objects with `Token` terminals. 105### Lower (`lower.py`) 106 107A Lark `Transformer` that walks the CST bottom-up, converting each grammar rule into IR types. 108 109**Key transformations:** 110 111- `inst_def` -> `IRNode` with opcode, const, PE placement, named args 112- `plain_edge` -> `IREdge` with source, dest, port qualifiers 113- `strong_edge` / `weak_edge` -> anonymous `IRNode` + input/output 114 `IREdge` set (creates a `CompositeResult`) 115- `func_def``IRRegion(kind=FUNCTION)` with a nested `IRGraph` body 116- `macro_def``MacroDef` with `ParamRef` placeholders in body template 117- `macro_call``IRMacroCall` with positional/named arguments 118- `call_stmt``CallSiteResult` with function name, input args, output dests 119- `location_dir` -> `IRRegion(kind=LOCATION)`: subsequent statements 120 are collected into its body during post-processing 121- `data_def` -> `IRDataDef` with SM placement and cell address 122- `system_pragma``SystemConfig` (stored on the transformer, attached 123 to the final `IRGraph`) 124 125**Name qualification:** 126 127Labels (`&name`) inside function regions are qualified with the function scope: `&add` inside `$main` becomes `$main.&add`. 128 129**Opcode mapping (`opcodes.py`):** 130 131Mnemonic strings from the grammar are mapped to `ALUOp` or `MemOp` enum values via `MNEMONIC_TO_OP`. A complication: Python `IntEnum` sub-classes can share numeric values across types (`ArithOp.ADD == 0 == MemOp.READ`), so the reverse mapping and set membership tests use type-aware collections (`TypeAwareOpToMnemonicDict`, `TypeAwareMonadicOpsSet`) that key on `(type, value)` tuples internally. 132 133### Expand (`expand.py`) 134 135Processes macro definitions, macro invocations, and function call sites. This pass bridges the gap between the template-based IR from lowering and the concrete IR that resolve expects. 136 137**Macro expansion:** 138 1391. Collect all `MacroDef` entries from the graph (including built-in macros prepended to every program) 1402. For each `IRMacroCall`, clone the macro's body template 1413. Substitute `ParamRef` placeholders with actual argument values in all contexts: 142 - **Const fields**: literal value substitution 143 - **Edge endpoints**: node reference substitution 144 - **Node names**: token pasting with prefix/suffix 145 - **Opcode position**: resolve mnemonic string via `MNEMONIC_TO_OP` to concrete `ALUOp`/`MemOp` 146 - **Placement qualifiers**: resolve `"pe0"``0` (via `PlacementRef`) 147 - **Port qualifiers**: resolve `"L"``Port.L` (via `PortRef`) 148 - **Context slot qualifiers**: resolve integer values (via `CtxSlotRef`) 1494. Evaluate `ConstExpr` arithmetic expressions (supports `+`, `-`, `*`, `//` on integers and `_idx`) 1505. Expand `IRRepetitionBlock` entries once per variadic argument, binding `_idx` to the iteration index 1516. Qualify expanded names with scope prefixes: `#macroname_N.&label` for top-level, `$func.#macro_N.&label` inside functions 152 153**Macro `@ret` wiring:** 154 155After body expansion, `@ret` / `@ret_name` markers in macro edge destinations are replaced with concrete node references from the call site's output list (`IRMacroCall.output_dests`). This is pure edge rewriting — no trampolines, no cross-context routing, no `free_ctx`. Macros inline into the caller's context. 156 157**Function call wiring:** 158 1591. For each `CallSiteResult` from lowering, create a `CallSite` with a unique `call_id` 1602. Generate trampoline `PASS` nodes for return routing 1613. Create `IREdge` entries with `ctx_override=True` for cross-context argument passing (becomes `ctx_mode=01` in codegen) 1624. Generate `FREE_CTX` nodes for context teardown on call completion 1635. Wire `@ret` / `@ret_name` synthetic nodes for return paths (function `@ret` is distinct from macro `@ret` — functions create trampolines with context management) 164 165**Post-conditions:** 166 167After expansion, the IR contains only concrete `IRNode`/`IREdge` entries. No `ParamRef` placeholders, no `MacroDef` regions, no `IRMacroCall` entries remain. 168 169### Resolve (`resolve.py`) 170 171Validates that all edge endpoints exist in the flattened namespace. 172 173**Process:** 174 1751. Flatten all nodes from the graph and all nested regions into a single namespace 1762. Build a scope map: qualified name -> defining scope 1773. For each edge, check that both source and dest exist 1784. Detect cross-function label references (a `&label` in `$main` cannot reference a `&label` in `$other`) and flag as SCOPE errors 1795. For undefined references, compute Levenshtein distance against all known names and suggest the closest match ("did you mean `&addr`?") 180 181Resolve does not modify nodes, it only appends errors. 182 183### Place (`place.py`) 184 185Assigns PE IDs to nodes that don't have explicit placement. 186 187**Explicit placements** (from `|pe0` qualifiers in source) are validated first. 188Reject any `pe >= pe_count`. 189 190**Auto-placement algorithm** for unplaced nodes (greedy, insertion order): 191 1921. For each unplaced node, find its connected neighbours via edges 1932. Count PE occurrences among placed neighbours (locality heuristic) 1943. Sort candidate PEs by: most neighbours (descending), 195 then most remaining IRAM capacity (tie-break) 1964. Place on the first PE with room for both IRAM and context slots 1975. If no PE fits, record a placement error with per-PE utilization breakdown 198 199**IRAM cost model:** 200 201- Dyadic instruction: 2 slots (instruction + matching store entry) 202- Monadic instruction: 1 slot 203 204**System config inference:** 205 206If no `@system` pragma is provided, the placer infers `pe_count` from the highest explicit PE ID and uses defaults for IRAM capacity (128) and context slots (16). 207 208### Allocate (`allocate.py`) 209 210Assigns three things per node: IRAM offset, context slot, and resolved destinations. 211 212**IRAM layout (per PE):** 213 214Dyadic instructions are packed at low offsets (0..D-1), monadic at higher offsets (D..D+M-1). This matches the hardware contract in `pe-design.md`: the token's offset field doubles as the matching store entry for dyadic instructions, so they must occupy the dense low range. 215 216**Context slot assignment (per PE):** 217 218Each function scope gets a distinct context slot. The root scope (top-level) always gets slot 0. Additional scopes get slots 1, 2, ... in order of first appearance. Overflow beyond `ctx_slots` is a RESOURCE error. 219 220**Destination resolution:** 221 222For each node, outgoing edges are resolved to `ResolvedDest` objects containing concrete `Addr(a=iram_offset, port=Port.L|R, pe=target_pe)`. The allocator handles: 223 224- Single outgoing edge -> `dest_l` 225- Two outgoing edges -> `dest_l` + `dest_r` (distinguished by `source_port` qualifier or positional order) 226- Port conflicts (duplicate ports, mixed explicit/implicit) -> PORT error 227 228**SM ID assignment:** 229 230For `MemOp` nodes, the allocator assigns the target SM ID. Single-SM systems default to `sm_id=0`. Multi-SM systems with ambiguous targets produce a RESOURCE error. 231 232### Codegen (`codegen.py`) 233 234Two output modes from the same allocated `IRGraph`: 235 236**Direct mode** (`generate_direct() → AssemblyResult`): 237 238Produces immediately usable emulator configuration: 239 240- `pe_configs`: list of `PEConfig` with populated IRAM (ALUInst/SMInst by offset), context slot count, and route restrictions 241- `sm_configs`: list of `SMConfig` with initial cell values from data definitions 242- `seed_tokens`: `CMToken` for each strongly-connected `CONST` node with no incoming edges 243 244Route restrictions are computed by scanning all edges from each PE to determine which other PEs and SMs it can reach. Self-routes are always included. 245 246**Token stream mode** (`generate_tokens() → list`): 247 248Produces a hardware-faithful bootstrap sequence: 249 2501. **SM init tokens**: `SMToken(op=WRITE)` for each data definition 2512. **IRAMWrite tokens**: `IRAMWrite` tokens per PE with instructions 2523. **Seed tokens**: same `CMToken` list as direct mode 253 254This ordering mirrors what the hardware bootstrap would do: initialize structure memory, load instruction memory, then inject the initial tokens that start execution. 255 256Both modes reuse the same internal logic. Token stream mode calls `generate_direct()` internally and then re-formats the result. 257 258## Error Handling (`errors.py`) 259 260Errors are structured with category, source location, message, and optional suggestions. 261 262| Category | Stage | Examples | 263|----------|-------|---------| 264| PARSE | lower | unknown opcode, malformed grammar | 265| NAME | resolve | undefined node reference | 266| SCOPE | resolve, lower | cross-function label reference, duplicate definition | 267| PLACEMENT | place | PE ID out of range, capacity exceeded | 268| RESOURCE | allocate | IRAM overflow, context slot overflow, missing SM target | 269| ARITY | lower | wrong operand count | 270| PORT | allocate | port conflicts, missing destinations | 271| MACRO | expand | undefined macro, wrong argument count, expansion failure | 272| CALL | expand | undefined function, missing return path, call wiring error | 273| UNREACHABLE | (future) | unused nodes | 274| VALUE | lower | out-of-range literals | 275 276Error formatting follows a Rust-style pattern: 277 278``` 279error[SCOPE]: Duplicate label '&add' in function '$main' 280 --> line 5, column 3 281 | 2825 | &add <| sub 283 | ^^^ 284 = help: First defined at line 2 285``` 286 287## Serialization and Round-Tripping (`serialize.py`) 288 289The serializer emits valid dfasm source from an `IRGraph` at any pipeline stage. This enables: 290 291- **Round-trip testing**: `source → parse → lower → serialize → source'` verifies that the parser and lowering pass preserve the program 292- **IR inspection**: dump the graph after any pass to see what the assembler is doing 293- **Code generation from IR**: future tools could construct `IRGraph` objects programmatically and serialize them to dfasm 294 295The serializer unqualifies names inside function regions (strips the `$func.` prefix), preserves port and placement qualifiers, and formats values as hex (if > 255) or decimal. 296 297## Future Work 298 299- **Optimization passes** between resolve and place: dead node elimination, constant folding, sub-graph deduplication 300- **Wider placement heuristics**: graph partitioning, min-cut algorithms, or profile-guided placement for larger programs 301- **Incremental reassembly**: modify part of the graph and re-run only affected passes 302- **Hardware encoding pass**: translate ALUInst/SMInst to bit-level instruction words for actual IRAM loading 303- **Conditional macro expansion**: the current macro system supports variadic repetition, constant arithmetic, opcode parameters, parameterized qualifiers, `@ret` output wiring (including positional `@ret` in variadic repetition blocks), and nested macro invocation (depth limit 32), but not conditionals within macros 304- **Generic variadic reduction**: a single `#reduce` macro that infers tree depth from variadic argument count (requires conditional expansion to handle non-power-of-2 inputs)