OR-1 CPU - dfasm Assembler Architecture#

Covers the asm/ package: pipeline structure, IR design, pass architecture, code generation modes, and key implementation decisions.

See architecture-overview.md for the target hardware model. See dfasm-primer.md for the language itself.

Role#

The assembler translates dfasm source into emulator-ready configuration objects (PEConfig, SMConfig, seed tokens) or a hardware-faithful bootstrap token sequence. It bridges the gap between human-authored dataflow graph programs and the structures the emulator (and eventually hardware) consumes.

The assembler does NOT (currently) optimize. It does not reorder instructions for performance, fuse operations, or eliminate redundant sub-graphs. It is a faithful translator: the graph you write is the graph you get. Future optimization passes may be inserted between resolve and place, but the current pipeline is intentionally thin. Correctness first, cleverness later.

Pipeline Overview#

Seven stages, each a pure function from IRGraph → IRGraph (or IRGraph → output).

The pipeline is:

dfasm source
    │
    ▼
  Parse       Lark/Earley parser, dfasm.lark grammar
    │          → concrete syntax tree (CST)
    ▼
  Lower       CST → IRGraph (nodes, edges, regions, data defs,
    │            macro defs, macro calls, call sites)
    │          → name qualification, scope creation
    ▼
  Expand      macro expansion + function call wiring
    │          → clone macro bodies, substitute params, evaluate
    │            const expressions, wire cross-context call edges
    ▼
  Resolve     validate edge endpoints, detect scope violations
    │          → "did you mean" suggestions via Levenshtein distance
    ▼
  Place       assign PEs to unplaced nodes
    │          → greedy bin-packing with locality heuristic
    ▼
  Allocate    assign IRAM offsets, context slots, resolve addresses
    │          → dyadic-first layout, per-PE context scoping
    ▼
  Codegen     emit PEConfig/SMConfig + seeds (direct mode)
              or SM init → IRAM writes → seeds (token mode)

Each pass returns a new IRGraph. Graphs are never mutated after construction, each pass produces a fresh copy with the new information filled in. Errors accumulate in IRGraph.errors rather than failing fast, so the assembler reports all problems in a single pass rather than forcing the programmer to fix them one at a time.

The public API orchestrates the pipeline and raises ValueError if any stage produces errors:

assemble(source: str) -> AssemblyResult       # direct mode
assemble_to_tokens(source: str) -> list       # token stream mode
round_trip(source: str) -> str                # parse → lower → serialize
serialize_graph(graph: IRGraph) -> str        # IRGraph → dfasm at any stage

IR Types (`ir.py`)#

Core Types#

Type	Fields	Purpose
`IRNode`	name, opcode, dest_l, dest_r, const, pe, iram_offset, ctx, loc, args, sm_id, seed	Single instruction in the dataflow graph. `name` and `const` can be `ParamRef` or `ConstExpr` in macro templates.
`IREdge`	source, dest, port, source_port, port_explicit, ctx_override, loc	Connection between two nodes. `ctx_override` marks cross-context call edges (ctx_mode=01). `port_explicit` tracks whether port was user-specified.
`IRGraph`	nodes, edges, regions, data_defs, system, errors, macro_defs, macro_calls, raw_call_sites, call_sites, builtin_line_offset	Complete program representation. Macro-related fields are populated by lower and consumed by expand.
`IRRegion`	tag, kind, body (IRGraph), loc	Nested scope (FUNCTION, LOCATION, or MACRO)
`IRDataDef`	name, sm_id, cell_addr, value, loc	SM cell initialisation
`SystemConfig`	pe_count, sm_count, iram_capacity, ctx_slots, loc	Hardware configuration from `@system` pragma (defaults: iram=128, ctx=16)
`MacroDef`	name, params, body (IRGraph), repetition_blocks, loc	Macro definition: formal parameters + template body with `ParamRef` placeholders
`MacroParam`	name, variadic	Formal macro parameter. `variadic=True` collects remaining args.
`ParamRef`	param, prefix, suffix	Placeholder for a macro parameter. Supports token pasting via prefix/suffix.
`ConstExpr`	expression, params, loc	Compile-time arithmetic expression (e.g., `base + _idx + 1`)
`IRRepetitionBlock`	body (IRGraph), variadic_param, loc	Repetition block (`@each`) expanded once per variadic argument
`IRMacroCall`	name, positional_args, named_args, loc	Macro invocation, consumed by expand pass
`CallSiteResult`	func_name, input_args, output_dests, loc	Intermediate call data from lower, consumed by expand
`CallSite`	func_name, call_id, input_edges, trampoline_nodes, free_ctx_nodes, loc	Processed call site metadata for per-call-site context allocation

Destination Representation#

Destinations evolve through the pipeline:

After lower: NameRef(name, port) - symbolic, unresolved
After allocate: ResolvedDest(name, addr) - concrete Addr(a, port, pe) with IRAM offset, port, and target PE

This two-stage resolution means early passes can work with symbolic names while later passes have concrete hardware addresses, while maintaining the names for debug clarity.

Graph Traversal Utilities#

IRGraph provides recursive traversal functions for working with nested regions:

iter_all_subgraphs(): depth-first traversal of graph + all region bodies
collect_all_nodes(): flatten nodes from graph and all nested regions
collect_all_nodes_and_edges(): flatten both
collect_all_data_defs(): flatten data definitions
update_graph_nodes(): recursively update nodes while preserving region structure

Pass Details#

Parse#

Uses the Lark library with the Earley parser algorithm. The formal grammar is defined in dfasm.lark. Earley is required because the grammar has ambiguities between location_dir (a bare qualified reference) and weak_edge (outputs before opcode) that require context-sensitive resolution.

The parser produces a concrete syntax tree (CST), a Lark Tree objects with Token terminals.

Lower (`lower.py`)#

A Lark Transformer that walks the CST bottom-up, converting each grammar rule into IR types.

Key transformations:

inst_def -> IRNode with opcode, const, PE placement, named args
plain_edge -> IREdge with source, dest, port qualifiers
strong_edge / weak_edge -> anonymous IRNode + input/output IREdge set (creates a CompositeResult)
func_def → IRRegion(kind=FUNCTION) with a nested IRGraph body
macro_def → MacroDef with ParamRef placeholders in body template
macro_call → IRMacroCall with positional/named arguments
call_stmt → CallSiteResult with function name, input args, output dests
location_dir -> IRRegion(kind=LOCATION): subsequent statements are collected into its body during post-processing
data_def -> IRDataDef with SM placement and cell address
system_pragma → SystemConfig (stored on the transformer, attached to the final IRGraph)

Name qualification:

Labels (&name) inside function regions are qualified with the function scope: &add inside $main becomes $main.&add.

Opcode mapping (opcodes.py):

Mnemonic strings from the grammar are mapped to ALUOp or MemOp enum values via MNEMONIC_TO_OP. A complication: Python IntEnum sub-classes can share numeric values across types (ArithOp.ADD == 0 == MemOp.READ), so the reverse mapping and set membership tests use type-aware collections (TypeAwareOpToMnemonicDict, TypeAwareMonadicOpsSet) that key on (type, value) tuples internally.

Expand (`expand.py`)#

Processes macro definitions, macro invocations, and function call sites. This pass bridges the gap between the template-based IR from lowering and the concrete IR that resolve expects.

Macro expansion:

Collect all MacroDef entries from the graph (including built-in macros prepended to every program)
For each IRMacroCall, clone the macro's body template
Substitute ParamRef placeholders with actual argument values in all contexts:
- Const fields: literal value substitution
- Edge endpoints: node reference substitution
- Node names: token pasting with prefix/suffix
- Opcode position: resolve mnemonic string via MNEMONIC_TO_OP to concrete ALUOp/MemOp
- Placement qualifiers: resolve "pe0" → 0 (via PlacementRef)
- Port qualifiers: resolve "L" → Port.L (via PortRef)
- Context slot qualifiers: resolve integer values (via CtxSlotRef)
Evaluate ConstExpr arithmetic expressions (supports +, -, *, // on integers and _idx)
Expand IRRepetitionBlock entries once per variadic argument, binding _idx to the iteration index
Qualify expanded names with scope prefixes: #macroname_N.&label for top-level, $func.#macro_N.&label inside functions

Macro @ret wiring:

After body expansion, @ret / @ret_name markers in macro edge destinations are replaced with concrete node references from the call site's output list (IRMacroCall.output_dests). This is pure edge rewriting — no trampolines, no cross-context routing, no free_ctx. Macros inline into the caller's context.

Function call wiring:

For each CallSiteResult from lowering, create a CallSite with a unique call_id
Generate trampoline PASS nodes for return routing
Create IREdge entries with ctx_override=True for cross-context argument passing (becomes ctx_mode=01 in codegen)
Generate FREE_CTX nodes for context teardown on call completion
Wire @ret / @ret_name synthetic nodes for return paths (function @ret is distinct from macro @ret — functions create trampolines with context management)

Post-conditions:

After expansion, the IR contains only concrete IRNode/IREdge entries. No ParamRef placeholders, no MacroDef regions, no IRMacroCall entries remain.

Resolve (`resolve.py`)#

Validates that all edge endpoints exist in the flattened namespace.

Process:

Flatten all nodes from the graph and all nested regions into a single namespace
Build a scope map: qualified name -> defining scope
For each edge, check that both source and dest exist
Detect cross-function label references (a &label in $main cannot reference a &label in $other) and flag as SCOPE errors
For undefined references, compute Levenshtein distance against all known names and suggest the closest match ("did you mean &addr?")

Resolve does not modify nodes, it only appends errors.

Place (`place.py`)#

Assigns PE IDs to nodes that don't have explicit placement.

Explicit placements (from |pe0 qualifiers in source) are validated first. Reject any pe >= pe_count.

Auto-placement algorithm for unplaced nodes (greedy, insertion order):

For each unplaced node, find its connected neighbours via edges
Count PE occurrences among placed neighbours (locality heuristic)
Sort candidate PEs by: most neighbours (descending), then most remaining IRAM capacity (tie-break)
Place on the first PE with room for both IRAM and context slots
If no PE fits, record a placement error with per-PE utilization breakdown

IRAM cost model:

Dyadic instruction: 2 slots (instruction + matching store entry)
Monadic instruction: 1 slot

System config inference:

If no @system pragma is provided, the placer infers pe_count from the highest explicit PE ID and uses defaults for IRAM capacity (128) and context slots (16).

Allocate (`allocate.py`)#

Assigns three things per node: IRAM offset, context slot, and resolved destinations.

IRAM layout (per PE):

Dyadic instructions are packed at low offsets (0..D-1), monadic at higher offsets (D..D+M-1). This matches the hardware contract in pe-design.md: the token's offset field doubles as the matching store entry for dyadic instructions, so they must occupy the dense low range.

Context slot assignment (per PE):

Each function scope gets a distinct context slot. The root scope (top-level) always gets slot 0. Additional scopes get slots 1, 2, ... in order of first appearance. Overflow beyond ctx_slots is a RESOURCE error.

Destination resolution:

For each node, outgoing edges are resolved to ResolvedDest objects containing concrete Addr(a=iram_offset, port=Port.L|R, pe=target_pe). The allocator handles:

Single outgoing edge -> dest_l
Two outgoing edges -> dest_l + dest_r (distinguished by source_port qualifier or positional order)
Port conflicts (duplicate ports, mixed explicit/implicit) -> PORT error

SM ID assignment:

For MemOp nodes, the allocator assigns the target SM ID. Single-SM systems default to sm_id=0. Multi-SM systems with ambiguous targets produce a RESOURCE error.

Codegen (`codegen.py`)#

Two output modes from the same allocated IRGraph:

Direct mode (generate_direct() → AssemblyResult):

Produces immediately usable emulator configuration:

pe_configs: list of PEConfig with populated IRAM (ALUInst/SMInst by offset), context slot count, and route restrictions
sm_configs: list of SMConfig with initial cell values from data definitions
seed_tokens: CMToken for each strongly-connected CONST node with no incoming edges

Route restrictions are computed by scanning all edges from each PE to determine which other PEs and SMs it can reach. Self-routes are always included.

Token stream mode (generate_tokens() → list):

Produces a hardware-faithful bootstrap sequence:

SM init tokens: SMToken(op=WRITE) for each data definition
IRAMWrite tokens: IRAMWrite tokens per PE with instructions
Seed tokens: same CMToken list as direct mode

This ordering mirrors what the hardware bootstrap would do: initialize structure memory, load instruction memory, then inject the initial tokens that start execution.

Both modes reuse the same internal logic. Token stream mode calls generate_direct() internally and then re-formats the result.

Error Handling (`errors.py`)#

Errors are structured with category, source location, message, and optional suggestions.

Category	Stage	Examples
PARSE	lower	unknown opcode, malformed grammar
NAME	resolve	undefined node reference
SCOPE	resolve, lower	cross-function label reference, duplicate definition
PLACEMENT	place	PE ID out of range, capacity exceeded
RESOURCE	allocate	IRAM overflow, context slot overflow, missing SM target
ARITY	lower	wrong operand count
PORT	allocate	port conflicts, missing destinations
MACRO	expand	undefined macro, wrong argument count, expansion failure
CALL	expand	undefined function, missing return path, call wiring error
UNREACHABLE	(future)	unused nodes
VALUE	lower	out-of-range literals

Error formatting follows a Rust-style pattern:

error[SCOPE]: Duplicate label '&add' in function '$main'
 --> line 5, column 3
  |
5 |   &add <| sub
  |   ^^^
  = help: First defined at line 2

Serialization and Round-Tripping (`serialize.py`)#

The serializer emits valid dfasm source from an IRGraph at any pipeline stage. This enables:

Round-trip testing: source → parse → lower → serialize → source' verifies that the parser and lowering pass preserve the program
IR inspection: dump the graph after any pass to see what the assembler is doing
Code generation from IR: future tools could construct IRGraph objects programmatically and serialize them to dfasm

The serializer unqualifies names inside function regions (strips the $func. prefix), preserves port and placement qualifiers, and formats values as hex (if > 255) or decimal.

Future Work#

Optimization passes between resolve and place: dead node elimination, constant folding, sub-graph deduplication
Wider placement heuristics: graph partitioning, min-cut algorithms, or profile-guided placement for larger programs
Incremental reassembly: modify part of the graph and re-run only affected passes
Hardware encoding pass: translate ALUInst/SMInst to bit-level instruction words for actual IRAM loading
Conditional macro expansion: the current macro system supports variadic repetition, constant arithmetic, opcode parameters, parameterized qualifiers, @ret output wiring (including positional @ret in variadic repetition blocks), and nested macro invocation (depth limit 32), but not conditionals within macros
Generic variadic reduction: a single #reduce macro that infers tree depth from variadic argument count (requires conditional expansion to handle non-power-of-2 inputs)

OR-1 CPU - dfasm Assembler Architecture#

Role#

Pipeline Overview#

IR Types (ir.py)#

Core Types#

Destination Representation#

Graph Traversal Utilities#

Pass Details#

Parse#

Lower (lower.py)#

Expand (expand.py)#

Resolve (resolve.py)#

Place (place.py)#

Allocate (allocate.py)#

Codegen (codegen.py)#

Error Handling (errors.py)#

Serialization and Round-Tripping (serialize.py)#