Reference implementation for the Phoenix Architecture. Work in progress.
aicoding.leaflet.pub/
ai
coding
crazy
1# Phase A — Clause Extraction & Semantic Hashing
2
3## Overview
4
5Phase A is the foundation layer. It parses spec documents (Markdown) into discrete clauses and computes semantic hashes for change detection.
6
7## Components
8
9### 1. Spec Parser (`src/spec-parser.ts`)
10
11Parses Markdown spec documents into structured clauses.
12
13**Input:** Markdown file content + document ID
14**Output:** Array of `Clause` objects
15
16**Parsing Rules:**
17- Split on heading boundaries (any level: #, ##, ###, etc.)
18- Each heading + its body content = one clause
19- Track section hierarchy (e.g., `["1. Adoption Scope", "v1 Scope"]`)
20- Record source line ranges
21- Preserve raw text, compute normalized text
22
23**Normalization:**
24- Lowercase
25- Collapse whitespace (multiple spaces/tabs → single space)
26- Strip leading/trailing whitespace per line
27- Remove markdown formatting characters (**, *, `, #)
28- Remove empty lines
29- Sort list items within a list block (for order-invariant hashing)
30
31### 2. Clause Model (`src/models/clause.ts`)
32
33```typescript
34interface Clause {
35 clause_id: string; // content-addressed hash
36 source_doc_id: string; // document identifier
37 source_line_range: [number, number]; // [start, end] 1-indexed
38 raw_text: string; // original text
39 normalized_text: string; // after normalization
40 section_path: string[]; // heading hierarchy
41 clause_semhash: string; // SHA-256 of normalized_text
42 context_semhash_cold: string; // SHA-256 of normalized_text + section_path + adjacent clause hashes
43}
44```
45
46### 3. Semantic Hasher (`src/semhash.ts`)
47
48**clause_semhash:** `SHA-256(normalized_text)`
49
50**context_semhash_cold:** `SHA-256(normalized_text + section_path.join('/') + prev_clause_semhash + next_clause_semhash)`
51
52This captures local context without requiring the canonical graph (cold start).
53
54### 4. Spec Graph Store (`src/store/spec-store.ts`)
55
56Persists clauses to the content-addressed store and maintains the spec graph index.
57
58**Operations:**
59- `ingestDocument(docPath: string): IngestResult`
60- `getClauses(docId: string): Clause[]`
61- `getClause(clauseId: string): Clause | null`
62- `diffDocument(docPath: string): ClauseDiff[]`
63
64### 5. Diff Engine (`src/diff.ts`)
65
66Compares previous vs. current clauses for a document.
67
68**Diff types:**
69- `ADDED` — new clause
70- `REMOVED` — clause deleted
71- `MODIFIED` — clause_semhash changed
72- `MOVED` — section_path changed but content same
73- `UNCHANGED` — identical
74
75## Data Flow
76
77```
78spec/*.md → SpecParser.parse() → Clause[] → SemHasher.hash() → Clause[] (with hashes) → SpecStore.save()
79```
80
81## File Layout
82
83```
84src/
85 models/
86 clause.ts # Clause interface + types
87 spec-parser.ts # Markdown → Clause[] parser
88 semhash.ts # Semantic hashing functions
89 normalizer.ts # Text normalization
90 diff.ts # Clause diff engine
91 store/
92 spec-store.ts # Spec graph persistence
93 content-store.ts # Content-addressed object store
94 index.ts # Public API exports
95```
96
97## Success Criteria
98
991. Parse a Markdown spec into correct clauses with accurate line ranges
1002. Normalized text is deterministic and order-invariant for lists
1013. clause_semhash is stable across formatting-only changes
1024. context_semhash_cold captures local structure
1035. Diff engine correctly classifies all change types
1046. Store persists and retrieves clauses by ID and document