Assembler Redesign Plan: Frame-Based PE Model#
This document describes the changes needed in the OR1 assembler (asm/)
to match the frame-based PE redesign described in pe-design.md and
architecture-overview.md. It covers every assembler pass and supporting
module, identifying what changes, what stays, and the dependency order
for implementation.
The assembler pipeline is: parse -> lower -> expand -> resolve -> place -> allocate -> codegen. Most changes concentrate in allocate and codegen; earlier passes are minimally affected.
1. Overview#
The PE redesign replaces the context-slot matching model with a frame-based model. Key architectural changes that affect the assembler:
- Context slots replaced by activation IDs and frames. The old
ctx_slotsparameter (up to 16 slots) becomesframe_count(4 concurrent frames) with 3-bitact_id(8 unique IDs). Each frame has 64 addressable slots. - Instructions become templates. IRAM entries are 16-bit words
[type:1][opcode:5][mode:3][wide:1][fref:6]. Constants and destinations are NOT in the instruction -- they live in frame slots referenced byfref. - Instruction deduplication. Multiple activations can share the same IRAM entry because per-activation data lives in frames, not instructions. The number of IRAM entries needed is the number of unique operation shapes, not the total operations executed.
- 8 matchable offsets per frame. Dyadic instructions must be assigned offsets 0-7 within each activation. The assembler must enforce this constraint and split function bodies that exceed it.
- Pre-formed flit 1 values. Output destinations are 16-bit packed flit 1 values stored in frame slots. The PE reads the slot and puts it on the bus verbatim as flit 1. Token type is determined by the prefix bits in the stored flit value.
- Frame setup via PE-local write tokens. Constants, destinations, and other per-activation data are loaded into frame slots by a stream of PE-local write tokens before execution begins.
- Frame lifecycle tokens. ALLOC (frame control, prefix
011+00, op=0) allocates a frame; FREE (frame control, op=1) or FREE_FRAME instruction releases it. - Mode field replaces output routing logic. The 3-bit
modefield in the instruction word encodes output routing (INHERIT/CHANGE_TAG/SINK) and constant presence. The assembler must compute the correct mode for each instruction based on its edge topology and constant usage.
2. IR Type Changes (asm/ir.py)#
2.1 IRNode Changes#
Current fields affected:
| Current Field | Change | Notes |
|---|---|---|
ctx_slot: Optional[Union[int, CtxSlotRef, CtxSlotRange]] |
Rename to act_slot or remove entirely |
Macro parameter support (CtxSlotRef) may still be needed for parameterized activation grouping |
iram_offset: Optional[int] |
Keep | Semantics unchanged: index into PE's IRAM. Range 0-255. |
ctx: Optional[int] |
Rename to act_id: Optional[int] |
3-bit activation ID (0-7) instead of context slot |
New fields to add to IRNode:
| New Field | Type | Purpose |
|---|---|---|
mode: Optional[int] |
Optional[int] |
3-bit instruction mode (0-7), computed during allocate |
fref: Optional[int] |
Optional[int] |
6-bit frame slot base index (0-63), computed during allocate |
wide: bool |
bool |
32-bit frame values flag, default False |
frame_layout: Optional[FrameSlotMap] |
Optional[FrameSlotMap] |
Per-node frame slot assignments (see below) |
2.2 New IR Types#
@dataclass(frozen=True)
class FrameSlotMap:
"""Frame slot assignments for one instruction within an activation.
Maps logical roles to physical frame slot indices within the 64-slot
frame. Computed by the allocate pass.
Attributes:
match_slot: Frame slot for dyadic match operand storage (offsets 0-7)
const_slot: Frame slot for IRAM constant (if any)
dest_slots: Frame slot(s) for pre-formed flit 1 destination values
sink_slot: Frame slot for SINK mode write-back target
"""
match_slot: Optional[int] = None
const_slot: Optional[int] = None
dest_slots: tuple[int, ...] = ()
sink_slot: Optional[int] = None
@dataclass(frozen=True)
class FrameLayout:
"""Complete frame layout for one activation on one PE.
Computed by the allocate pass. Used by codegen to generate PE-local
write tokens for frame setup.
Attributes:
act_id: Activation ID (0-7)
pe_id: PE where this frame lives
slots: Dict mapping slot index to (role, value) pairs
total_slots: Total slots used (must be <= 64)
"""
act_id: int
pe_id: int
slots: dict[int, tuple[str, int]]
total_slots: int
2.3 SystemConfig Changes#
Current SystemConfig:
@dataclass(frozen=True)
class SystemConfig:
pe_count: int
sm_count: int
iram_capacity: int = DEFAULT_IRAM_CAPACITY # 128
ctx_slots: int = DEFAULT_CTX_SLOTS # 16
New SystemConfig:
@dataclass(frozen=True)
class SystemConfig:
pe_count: int
sm_count: int
iram_capacity: int = 256 # 8-bit offset, 256 entries
frame_count: int = 4 # max concurrent frames per PE
frame_slots: int = 64 # slots per frame
matchable_offsets: int = 8 # max dyadic instructions per activation per PE
The defaults change: DEFAULT_IRAM_CAPACITY goes from 128 to 256
(matching the 8-bit offset field). DEFAULT_CTX_SLOTS is removed
entirely and replaced by frame_count, frame_slots, and
matchable_offsets.
2.4 What Stays#
IREdge-- unchanged exceptctx_overridemay need renaming or semantic adjustment (see section 5)IRGraphstructure -- unchangedIRDataDef-- unchanged (SM data definitions are orthogonal to PE frames)IRRegion,RegionKind-- unchangedNameRef,ResolvedDest-- unchangedMacroDef,IRMacroCall,MacroParam,ParamRef-- unchangedCallSite,CallSiteResult-- fields may need renaming (trampoline_nodes,free_ctx_nodes->free_frame_nodes)SourceLoc,ConstExpr,IRRepetitionBlock-- unchanged
2.5 CtxSlotRef / CtxSlotRange#
CtxSlotRef and CtxSlotRange are used in macro templates for
parameterized context slot assignment. Under the frame model, these
become ActSlotRef / ActSlotRange (or are removed if activation ID
assignment is always automatic). The expand pass resolves these to
concrete values during macro expansion. The rename is straightforward
but touches ir.py, lower.py, and expand.py.
3. Opcode Changes (asm/opcodes.py)#
3.1 Renamed Opcodes#
| Current | New | Notes |
|---|---|---|
RoutingOp.FREE_CTX |
RoutingOp.FREE_FRAME |
Deallocates a frame instead of a context slot |
The mnemonic mapping changes: "free_ctx" -> "free_frame" in
MNEMONIC_TO_OP. The OP_TO_MNEMONIC reverse mapping updates
correspondingly.
3.2 New Opcodes#
| Opcode | Type | Arity | Purpose |
|---|---|---|---|
RoutingOp.EXTRACT_TAG |
Monadic CM | Monadic | Captures executing token's identity as 16-bit packed flit 1 value (return continuation) |
RoutingOp.ALLOC_REMOTE |
Monadic CM | Monadic | Triggers ALLOC frame control token to a target PE (may be handled at codegen level instead) |
EXTRACT_TAG must be added to MNEMONIC_TO_OP (mnemonic: "extract_tag"),
to _MONADIC_OPS_TUPLES, and to cm_inst.py as a new RoutingOp enum
member.
Whether ALLOC_REMOTE is an ALU opcode or purely a codegen-emitted
frame control token depends on whether the assembler exposes it as a
user-facing instruction or handles it internally during call wiring.
For static calls, ALLOC is emitted by codegen as a frame control token,
not as an ALU instruction. For dynamic calls, it may need to be an
instruction.
3.3 Mode-Dependent Arity#
The current is_monadic(op, const) and is_dyadic(op, const) functions
remain correct in concept. The mode field is orthogonal to arity --
mode determines output routing behaviour, not input operand count. No
changes needed to arity classification.
3.4 Monadic/Dyadic Classification#
Unchanged. The distinction between monadic and dyadic is fundamental to matching (dyadic tokens go through presence-based matching in the 670s; monadic tokens bypass matching). The assembler's arity classification drives IRAM offset assignment (dyadic at offsets 0-7, monadic at 8+).
4. Lower Pass Changes (asm/lower.py)#
Minimal changes. The lower pass translates Lark CST to IR and is mostly opcode-agnostic.
4.1 Opcode Mnemonic Updates#
The MNEMONIC_TO_OP lookup in _resolve_opcode() will pick up new
opcodes automatically when they are added to opcodes.py. No lower.py
code changes needed for new opcodes.
The free_ctx mnemonic rename to free_frame will need a corresponding
change in lower.py only if the mnemonic string is hardcoded anywhere
(it is not -- lower.py uses MNEMONIC_TO_OP for all lookups).
4.2 Context Slot Qualifiers#
The [ctx_slot] qualifier syntax in dfasm (&node[3] <| add) is
parsed in lower.py and stored as ctx_slot on IRNode. This syntax and
field will be renamed to reflect activation IDs if kept. If activation
ID assignment is always automatic, the qualifier syntax may be removed
entirely.
References in lower.py:
inst_deftransformer rule: parses[N]qualifier intoctx_slotctx_slot_reftransformer rule: createsCtxSlotReffor macro parameter[${param}]ctx_slot_rangetransformer rule: createsCtxSlotRangefor[N:M]
4.3 New Syntax (Optional, Deferred)#
Frame directives in dfasm syntax (e.g., @frame_layout pragmas) could
be added later. Not needed for v0 -- the allocator computes frame
layouts automatically.
5. Expand Pass Changes (asm/expand.py)#
5.1 Function Call Wiring#
The expand pass currently generates cross-context call wiring:
- Trampoline
PASSnodes withctx_overrideedges FREE_CTXcleanup nodes- Per-call-site context slot allocation via
CallSitemetadata
Under the frame model, the call wiring changes:
-
ctx_overrideedges become frame-boundary edges. The semantic meaning is the same (data crosses activation boundaries), but the mechanism changes: instead of packing a target context into the instruction'sconstfield withctx_mode=1, the destination's pre-formed flit 1 value in the frame slot already encodes the target PE, offset, and act_id. Cross-activation routing is handled by frame setup, not by instruction encoding. -
FREE_CTXnodes becomeFREE_FRAMEnodes. The opcode changes fromRoutingOp.FREE_CTXtoRoutingOp.FREE_FRAME. The expand pass referencesRoutingOp.FREE_CTXdirectly in_wire_call_site()at line 1160 of expand.py. -
Trampoline nodes may simplify. In the current design, trampolines exist to bridge context boundaries (an edge cannot cross contexts without a PASS node carrying
ctx_mode). Under the frame model, destinations in frame slots already encode the target activation, so cross-activation edges are just edges whose destination flit 1 value points to a different activation. Trampolines may still be useful for fan-out or return routing but are no longer needed purely for context bridging. -
@retwiring stays conceptually the same. Return routing in the frame model is a pre-formed flit 1 value loaded into a frame slot. The expand pass creates edges for return routing; the allocate pass resolves those edges to flit 1 values in frame slots.
5.2 CallSite Metadata#
CallSite fields to rename:
free_ctx_nodes->free_frame_nodes
The trampoline_nodes field stays (trampolines may still be generated
for fan-out or return routing patterns).
5.3 CtxSlotRef Resolution#
The expand pass resolves CtxSlotRef and CtxSlotRange during macro
expansion. These will be renamed to ActSlotRef / ActSlotRange (or
removed). The resolution logic in _substitute_node() and
_substitute_edge() is straightforward renaming.
5.4 Built-in Macros#
The built-in macros in builtins.py do NOT reference ctx_slot, ctx,
free_ctx, or any context-specific concepts directly. They use generic
opcodes (add, brgt, gate, pass, inc, const) and edge routing
(@ret, ${param}). No changes needed to built-in macros.
6. Resolve Pass Changes (asm/resolve.py)#
No changes needed. The resolve pass validates that edge endpoints exist
and detects scope violations. It operates on node names and graph
structure, not on PE-level concepts like contexts or frames. The resolve
pass does not reference ctx, ctx_slot, ctx_override, or any
context-related fields.
7. Place Pass Changes (asm/place.py)#
7.1 New Constraint: Matchable Offset Limit#
The placement pass must enforce the 8-matchable-offset constraint: at most 8 dyadic instructions per activation per PE. This is a new constraint that does not exist in the current codebase.
Currently, _count_iram_cost() counts dyadic nodes as costing 2 IRAM
slots and monadic as 1. Under the frame model:
- Dyadic nodes cost 1 IRAM slot (the matching store entry is in the 670s, not IRAM)
- Monadic nodes cost 1 IRAM slot
- The 8-dyadic-per-activation limit is a separate constraint from IRAM capacity
The placement pass needs to track dyadic instruction count per activation group per PE, in addition to total IRAM usage.
7.2 IRAM Cost Recalculation#
_count_iram_cost() should return 1 for all node types (dyadic and
monadic both use 1 IRAM slot in the frame model). The current cost of 2
for dyadic nodes was because the old matching store occupied IRAM slots;
in the frame model, match operands live in frame SRAM, not IRAM.
7.3 Context Slots -> Frames#
_auto_place_nodes() currently tracks ctx_used per PE (context slots
consumed). This becomes frames_used per PE (concurrent frames, max 4).
The ctx_scopes_per_pe tracking of function scopes per PE maps directly
to frame tracking: each function scope on a PE consumes one frame.
SystemConfig.ctx_slots references in place.py become
SystemConfig.frame_count.
7.4 Instruction Deduplication Awareness#
Because IRAM entries are activation-independent templates, multiple
activations of the same function on the same PE share IRAM entries. The
placement pass could account for this when computing IRAM utilisation:
if two activations of function $foo run on PE0, they share IRAM
entries, so the IRAM cost is the unique instruction count, not the total.
This is an optimisation, not a correctness requirement. The placement pass can conservatively count IRAM entries per unique function body without deduplication for v0.
8. Allocate Pass Changes (asm/allocate.py) -- MAJOR#
This is the largest change. The current allocate pass assigns IRAM
offsets and context slots, then resolves destinations to Addr values.
The frame model requires fundamentally different allocation logic.
8.1 IRAM Offset Assignment#
Current behaviour (_assign_iram_offsets()):
- Dyadic nodes get offsets 0..D-1
- Monadic nodes get offsets D..D+M-1
- Total must fit in
iram_capacity
New behaviour:
- Dyadic nodes get offsets 0-7 (within the matchable offset range). At most 8 dyadic instructions per activation group per PE. Because IRAM is activation-independent (shared templates), dyadic offset assignment is per-PE, not per-activation.
- Monadic nodes get offsets 8-255 (or wherever dyadic offsets end).
- The
matchable_offsetslimit (default 8) constrains dyadic count. - With instruction deduplication, multiple activations of the same function body share IRAM offsets. The allocator assigns offsets once per unique instruction template, not per activation.
8.2 Activation ID Assignment#
Replaces _assign_context_slots(). The current function assigns context
slot indices (0-15) per function scope per PE. The new function assigns
3-bit activation IDs (0-7) with at most 4 concurrent activations per PE.
Key differences:
- Smaller space: 8 act_ids (3-bit) vs 16 context slots. But only 4 can be concurrently active (4 physical frames).
- ABA distance: the allocator must maintain ABA distance between act_ids to prevent stale token collisions. With 4 concurrent frames out of 8 possible IDs, 4 IDs of ABA distance exist before wraparound. For static programs, the allocator can assign act_ids sequentially (0, 1, 2, 3 for 4 concurrent activations).
- Per-call-site allocation: same concept as current
call_site_to_ctx_on_pe-- each call site that creates a new activation gets a fresh act_id. But the budget is 4 concurrent frames instead of 16 context slots.
The function scope grouping logic (_extract_function_scope()) stays.
8.3 Frame Layout Allocation (NEW)#
This is entirely new functionality. After IRAM offset and act_id assignment, the allocator must compute the frame layout for each activation: which frame slots hold what data.
Frame slot roles:
| Role | Slot Range | Count per Instruction | Notes |
|---|---|---|---|
| Match operand | 0-7 | 1 per dyadic instruction | Indexed by matchable offset. Presence bit in 670. |
| Constant | 8+ | 0-1 per instruction | mode[0] (has_const) selects whether const is read from frame[fref] |
| Destination | variable | 0-2 per instruction | Pre-formed flit 1 values. 1 for mode 0, 2 for mode 2/3. |
| Accumulator/sink | variable | 0-1 per instruction | For SINK modes (6/7). Write-back target. |
| SM parameters | variable | 0-2 per SM instruction | SM_id + addr, data or return routing. |
Slot assignment algorithm:
- Reserve slots 0-7 for match operands (one per dyadic instruction, indexed by the instruction's matchable offset).
- Assign constant slots starting at slot 8. Constants that are shared across instructions within the same activation can be deduplicated (same value -> same slot).
- Assign destination slots. Each destination is a pre-formed flit 1 value. Destinations shared across instructions can be deduplicated.
- Assign sink/accumulator slots for SINK mode instructions.
- Assign SM parameter slots for SM operations.
- Verify total slots <= 64. If exceeded, report a frame overflow error.
fref computation:
The fref field in the instruction word points to the base of a
contiguous group of frame slots used by that instruction. The slot count
depends on the mode:
| Mode | Slots at fref | Layout |
|---|---|---|
| 0 | 1 | [dest] |
| 1 | 2 | [const, dest] |
| 2 | 2 | [dest1, dest2] |
| 3 | 3 | [const, dest1, dest2] |
| 4 | 0 | (no frame access) |
| 5 | 1 | [const] |
| 6 | 1 | [sink_target] (write) |
| 7 | 1 | [sink_target] (read-modify-write) |
The allocator must arrange frame slots such that each instruction's
constant and destination(s) are contiguous starting at fref. This
may require careful slot packing or a simple sequential allocation
strategy.
Mode computation:
The allocator determines the mode for each instruction based on:
| Condition | Mode |
|---|---|
| 0 dests, no const, has sink | 6 (SINK) |
| 0 dests, const via frame accumulator (RMW) | 7 (SINK+CONST) |
| 1 dest, no const | 0 (INHERIT, single output) |
| 1 dest, has const | 1 (INHERIT, single output + const) |
| 2 dests, no const | 2 (INHERIT, fan-out) |
| 2 dests, has const | 3 (INHERIT, fan-out + const) |
| CHANGE_TAG, no const | 4 |
| CHANGE_TAG, has const | 5 |
CHANGE_TAG is used when the instruction's left operand provides the
output destination dynamically (e.g., dynamic return routing via
EXTRACT_TAG + CHANGE_TAG).
8.4 Pre-Formed Flit 1 Computation#
The allocator must convert resolved destination Addr values into
16-bit packed flit 1 values for storage in frame destination slots.
Flit 1 formats (from architecture-overview.md):
DYADIC WIDE: [0][0][port:1][PE:2][offset:8][act_id:3] = 16 bits
MONADIC NORM: [0][1][0][PE:2][offset:8][act_id:3] = 16 bits
MONADIC INLINE: [0][1][1][PE:2][10][offset:7][spare:2] = 16 bits
The packing function takes:
dest_pe: int(2 bits)dest_offset: int(8 bits for dyadic/monadic normal, 7 for inline)dest_act_id: int(3 bits, for dyadic and monadic normal)dest_port: Port(1 bit, for dyadic wide only)dest_type: TokenType(dyadic wide, monadic normal, or monadic inline)
And produces a 16-bit packed flit 1 value.
The dest_type is determined by whether the destination instruction is
dyadic (-> dyadic wide flit) or monadic (-> monadic normal flit). For
trigger-only destinations (e.g., switch not-taken path), monadic inline
is used.
This replaces the current Addr(a, port, pe) resolution. The Addr
type may still be used as an intermediate representation, but the final
output is a packed flit 1 value in a frame slot.
8.5 Destination Resolution Rework#
The current _resolve_destinations() function creates ResolvedDest
objects containing Addr(a=iram_offset, port=edge.port, pe=dest_pe).
Under the frame model, destination resolution must additionally:
- Look up the destination node's
act_id(not justiram_offsetandpe). - Determine the destination token type (dyadic wide vs monadic normal vs monadic inline).
- Compute the packed flit 1 value.
- Assign the flit 1 value to a frame slot.
The ResolvedDest type may be extended to carry the packed flit 1
value, or the flit 1 computation may happen in a separate sub-pass
after destination resolution.
8.6 ctx_mode / ctx_override Removal#
The current allocator and codegen handle ctx_override edges by setting
ctx_mode=1 on the source instruction and packing the target context
and generation into the const field. This entire mechanism is removed.
Under the frame model, cross-activation routing is handled by frame
destination slots. The destination flit 1 value already encodes the
target PE, offset, and act_id. No instruction-level ctx_mode is
needed. The ctx_override flag on IREdge may be kept for semantic
annotation (this edge crosses activation boundaries) but has no effect
on instruction encoding.
9. Codegen Changes (asm/codegen.py) -- MAJOR#
9.1 Instruction Generation#
Current codegen (_build_iram_for_pe()) generates ALUInst and
SMInst objects. These are Python dataclasses that model the old
instruction format with embedded destinations and constants:
ALUInst(op, dest_l, dest_r, const, ctx_mode)
SMInst(op, sm_id, const, ret, ret_dyadic)
New codegen generates 16-bit instruction words matching the hardware format:
@dataclass(frozen=True)
class Instruction:
"""16-bit instruction word for IRAM.
[type:1][opcode:5][mode:3][wide:1][fref:6] = 16 bits
"""
type: int # 0 = CM, 1 = SM
opcode: int # 5-bit opcode
mode: int # 3-bit mode (0-7)
wide: bool # 32-bit frame values
fref: int # 6-bit frame slot base index
The Instruction type replaces both ALUInst and SMInst. Constants
and destinations are NOT in the instruction -- they are in frame slots.
The PEConfig.iram field type changes from
dict[int, ALUInst | SMInst] to dict[int, Instruction] (or
dict[int, int] if storing raw 16-bit words).
9.2 Frame Setup Sequence Generation (NEW)#
Codegen must generate the bootstrap sequence that loads frame contents before execution begins. This is a stream of tokens:
-
FrameControlToken (ALLOC) for each activation:
flit 1: [0][1][1][PE:2][00][op=0][spare:3][act_id:3] flit 2: (return routing or unused) -
PELocalWriteToken for each frame slot that needs initialization:
flit 1: [0][1][1][PE:2][01][region=1][spare:1][slot:5][act_id:3] flit 2: [data:16] (the frame slot value) -
PELocalWriteToken for IRAM entries:
flit 1: [0][1][1][PE:2][01][region=0][spare:1][slot:5][act_id:ignored] flit 2: [instruction:16]
The ordering matters: IRAM writes before frame setup, frame setup (ALLOC + slot writes) before seed tokens. More specifically:
IRAM writes (all PEs)
-> ALLOC frame control (all activations)
-> Frame slot writes (constants, destinations per activation)
-> Seed tokens (initial data tokens to start execution)
9.3 New Token Types#
The emulator will need new token types (or the existing types must be adapted):
- FrameControlToken -- ALLOC/FREE frame lifecycle. Currently not in
tokens.py. Codegen needs to emit these. - PELocalWriteToken -- writes to IRAM (region=0) or frame slots
(region=1). The current
IRAMWriteTokenis a special case of this (region=0 only). It may be generalised or a new type added.
These are emulator-level changes that codegen depends on. The codegen module must import and construct whatever token types the emulator provides.
9.4 Seed Token Generation#
Current seed token generation creates MonadToken or DyadToken with
ctx field. Changes:
ctxfield ->act_idfield in token constructorsgenfield onDyadTokenis removed (ABA protection is via the 670 valid bit, not generation counters)- Seed tokens target
(pe, offset, act_id)triples, packed from the destination's allocation data
9.5 Direct Mode Output#
generate_direct() currently returns AssemblyResult with:
pe_configs: list[PEConfig]-- PEConfig withiramdict ofALUInst/SMInstsm_configs: list[SMConfig]seed_tokens: list[MonadToken]
New AssemblyResult:
pe_configs: list[PEConfig]-- PEConfig withiramdict ofInstruction(new type), plusframe_layouts: dict[int, FrameLayout]mapping act_id to frame layout datasm_configs: list[SMConfig]-- unchangedseed_tokens: list-- may includeDyadTokenandMonadTokenwithact_idinstead ofctxsetup_tokens: list-- NEW: frame control and PE-local write tokens for bootstrap
9.6 Token Stream Mode Output#
generate_tokens() currently produces:
SM init tokens -> IRAM write tokens -> seed tokens
New ordering:
SM init tokens -> IRAM write tokens -> ALLOC tokens -> frame slot write tokens -> seed tokens
9.7 ctx_mode Removal#
The entire ctx_mode / ctx_override handling in _build_iram_for_pe()
(lines 96-132 of codegen.py) is removed. Cross-activation routing is
handled by frame destination slots, not by instruction encoding.
9.8 Route Restriction Computation#
_compute_route_restrictions() is unchanged in concept. It scans edges
to determine which PEs and SMs a given PE needs to route to. The
implementation stays the same.
10. Serialize Pass Changes (asm/serialize.py)#
10.1 Field Renaming#
ctx->act_idin node serialization if activation ID is displayed- Any
ctx_slotqualifier in dfasm output becomes activation-related
10.2 New Fields#
If mode, fref, and wide are displayed in serialized output (for
debugging allocated IR), _serialize_node() needs to format them.
Format could be: &node|pe0|act2|mode1|fref8 <| add
10.3 Round-Trip Support#
The serialize pass must be able to round-trip new IR fields. Since
mode, fref, and frame layout are only populated after allocation,
serialization before allocation produces the same output as today (no
new fields to display). Post-allocation serialization adds the new
fields.
11. Built-in Macro Changes (asm/builtins.py)#
No changes needed. The built-in macros (#loop_counted, #loop_while,
#permit_inject, #reduce_2/3/4) use generic opcodes and edge routing.
They do not reference ctx, ctx_slot, free_ctx, or any
context-specific concepts.
The free_ctx opcode rename to free_frame does not affect builtins
because none of them use free_ctx.
12. Error Types (asm/errors.py)#
12.1 New Error Categories#
Add to ErrorCategory:
class ErrorCategory(Enum):
# ... existing ...
FRAME = "frame" # Frame layout overflow, slot conflicts
12.2 New Error Conditions#
| Error | Category | Source Pass | Condition |
|---|---|---|---|
| Frame slot overflow | FRAME | allocate | Total slots > 64 for an activation |
| Matchable offset overflow | RESOURCE | allocate/place | > 8 dyadic instructions per activation per PE |
| Frame count overflow | RESOURCE | allocate | > 4 concurrent activations on one PE |
| Act ID exhaustion | RESOURCE | allocate | > 8 activation IDs needed (wraparound) |
13. Dfgraph Pipeline Impact (dfgraph/)#
13.1 dfgraph/pipeline.py#
The pipeline runner calls allocate() and uses IRGraph types. If
SystemConfig fields change, pipeline.py may need minor updates for
default values or error handling. The pipeline runner itself does not
inspect allocation results in detail.
13.2 dfgraph/graph_json.py#
Currently includes ctx field in node JSON output. This becomes
act_id. The field rename is straightforward.
13.3 dfgraph/categories.py#
References RoutingOp.FREE_CTX in the CONFIG category mapping. This
becomes RoutingOp.FREE_FRAME. One line change.
EXTRACT_TAG (if added) maps to the ROUTING or CONFIG category
depending on its semantics.
14. Monitor Impact (monitor/)#
14.1 monitor/snapshot.py#
PESnapshot currently captures matching_store (2D array of
MatchEntry) and gen_counters. Under the frame model:
matching_storebecomes frame state (per-frame slot values, presence bits)gen_countersis removed (no generation counters in frame model)- New:
frame_allocations(act_id -> frame_id mapping),frame_slots(per-frame slot contents)
14.2 monitor/graph_json.py#
Node state overlay: ctx field -> act_id. Frame slot contents may be
added to the state overlay for debugging.
14.3 monitor/repl.py#
The pe command displays PE state including matching store contents.
This needs updating to display frame state instead.
15. Dependency Order#
Implementation order based on the dependency graph:
Phase 1: Foundation Types#
cm_inst.py: AddRoutingOp.FREE_FRAME,RoutingOp.EXTRACT_TAG. AddInstructiondataclass. Updateis_monadic_alu().asm/ir.py: AddFrameSlotMap,FrameLayout. Renamectx_slot->act_slot,ctx->act_idon IRNode. UpdateSystemConfig(removectx_slots, addframe_count,frame_slots,matchable_offsets). RenameCtxSlotRef->ActSlotRef,CtxSlotRange->ActSlotRange.asm/opcodes.py: Add new opcodes toMNEMONIC_TO_OP,_MONADIC_OPS_TUPLES. Renamefree_ctx->free_frame.asm/errors.py: AddErrorCategory.FRAME.
Phase 2: Allocation and Codegen (the big changes)#
asm/allocate.py: Rewrite_assign_context_slots()->_assign_act_ids(). Add_compute_frame_layouts(). Add_compute_modes(). Add_pack_flit1(). Update_assign_iram_offsets()for new offset scheme. Update_resolve_destinations()to produce flit 1 values. Removectx_mode/ctx_overridehandling.asm/codegen.py: ReplaceALUInst/SMInstgeneration withInstructiongeneration. Add frame setup token generation. AddFrameControlToken/PELocalWriteTokenemission. Update seed token generation. Removectx_modehandling. UpdatePEConfigconstruction. UpdateAssemblyResult.
Phase 3: Upstream Pass Adjustments#
asm/lower.py: Renamectx_slotreferences. Update qualifier parsing if syntax changes.asm/expand.py: ReplaceFREE_CTXwithFREE_FRAMEin call wiring. RenameCtxSlotRef->ActSlotRef. UpdateCallSitefield names. Simplify trampoline logic ifctx_overrideis removed.asm/place.py: Update_count_iram_cost()(all nodes cost 1). Replacectx_slotstracking withframe_count. Add matchable-offset-per-activation constraint.
Phase 4: Output and Tooling#
asm/serialize.py: Renamectx->act_idin output. Add mode/fref display for post-allocation IR.asm/builtins.py: No changes (verified).dfgraph/categories.py:FREE_CTX->FREE_FRAME.dfgraph/graph_json.py:ctx->act_idin JSON output.monitor/: Update snapshot, graph_json, and REPL for frame model.
Phase 5: Emulator Updates (out of scope for this doc)#
emu/types.py:PEConfigchanges (iramtype, removectx_slots, add frame parameters).emu/pe.py: Frame-based matching instead of context-slot matching store.tokens.py:ctx->act_id, removegen. AddFrameControlToken,PELocalWriteToken.
16. What Stays the Same#
- dfasm grammar (
dfasm.lark): mostly unchanged. May add frame directives later, but not required for v0. - Parse pass (Lark parser): no changes.
- Error infrastructure (
errors.py): structure unchanged, one new category added. - Graph structure (
IRGraph,IREdge,IRRegion): unchanged. - Pipeline architecture: still 7 passes in the same order.
- Resolve pass: completely unchanged.
- Built-in macros: completely unchanged.
- SM-related codegen: SMConfig construction, SM init tokens, data defs -- all unchanged.
- Name resolution and scoping: unchanged.
- Macro expansion core: parameter substitution, variadic repetition, opcode parameters -- all unchanged. Only call wiring details change.
17. Risk Assessment#
| Area | Risk | Mitigation |
|---|---|---|
| Frame layout allocation | New, complex algorithm | Start with simple sequential allocation, optimise later |
| Flit 1 packing | Bit-level encoding must match hardware spec | Unit tests against architecture-overview.md bit layouts |
| Instruction deduplication | Interaction with frame layouts is subtle | Defer dedup to optimisation pass; v0 allocates per-activation |
| Mode computation | 8 modes, context-dependent selection | Exhaustive test matrix against mode table |
| ABA distance | Act_id assignment must maintain distance | Sequential assignment (0,1,2,3) is safe for static programs |
ctx_override removal |
Affects expand pass call wiring | Can keep ctx_override as semantic annotation initially |
| Emulator dependency | Codegen must emit tokens emulator can consume | Phase emulator changes alongside codegen; can use adapter layer |