Reference implementation for the Phoenix Architecture. Work in progress. aicoding.leaflet.pub/
ai coding crazy

Phase A — Clause Extraction & Semantic Hashing#

Overview#

Phase A is the foundation layer. It parses spec documents (Markdown) into discrete clauses and computes semantic hashes for change detection.

Components#

1. Spec Parser (src/spec-parser.ts)#

Parses Markdown spec documents into structured clauses.

Input: Markdown file content + document ID
Output: Array of Clause objects

Parsing Rules:

  • Split on heading boundaries (any level: #, ##, ###, etc.)
  • Each heading + its body content = one clause
  • Track section hierarchy (e.g., ["1. Adoption Scope", "v1 Scope"])
  • Record source line ranges
  • Preserve raw text, compute normalized text

Normalization:

  • Lowercase
  • Collapse whitespace (multiple spaces/tabs → single space)
  • Strip leading/trailing whitespace per line
  • Remove markdown formatting characters (**, *, `, #)
  • Remove empty lines
  • Sort list items within a list block (for order-invariant hashing)

2. Clause Model (src/models/clause.ts)#

interface Clause {
  clause_id: string;           // content-addressed hash
  source_doc_id: string;       // document identifier
  source_line_range: [number, number]; // [start, end] 1-indexed
  raw_text: string;            // original text
  normalized_text: string;     // after normalization
  section_path: string[];      // heading hierarchy
  clause_semhash: string;      // SHA-256 of normalized_text
  context_semhash_cold: string; // SHA-256 of normalized_text + section_path + adjacent clause hashes
}

3. Semantic Hasher (src/semhash.ts)#

clause_semhash: SHA-256(normalized_text)

context_semhash_cold: SHA-256(normalized_text + section_path.join('/') + prev_clause_semhash + next_clause_semhash)

This captures local context without requiring the canonical graph (cold start).

4. Spec Graph Store (src/store/spec-store.ts)#

Persists clauses to the content-addressed store and maintains the spec graph index.

Operations:

  • ingestDocument(docPath: string): IngestResult
  • getClauses(docId: string): Clause[]
  • getClause(clauseId: string): Clause | null
  • diffDocument(docPath: string): ClauseDiff[]

5. Diff Engine (src/diff.ts)#

Compares previous vs. current clauses for a document.

Diff types:

  • ADDED — new clause
  • REMOVED — clause deleted
  • MODIFIED — clause_semhash changed
  • MOVED — section_path changed but content same
  • UNCHANGED — identical

Data Flow#

spec/*.md → SpecParser.parse() → Clause[] → SemHasher.hash() → Clause[] (with hashes) → SpecStore.save()

File Layout#

src/
  models/
    clause.ts          # Clause interface + types
  spec-parser.ts       # Markdown → Clause[] parser
  semhash.ts           # Semantic hashing functions
  normalizer.ts        # Text normalization
  diff.ts              # Clause diff engine
  store/
    spec-store.ts      # Spec graph persistence
    content-store.ts   # Content-addressed object store
  index.ts             # Public API exports

Success Criteria#

  1. Parse a Markdown spec into correct clauses with accurate line ranges
  2. Normalized text is deterministic and order-invariant for lists
  3. clause_semhash is stable across formatting-only changes
  4. context_semhash_cold captures local structure
  5. Diff engine correctly classifies all change types
  6. Store persists and retrieves clauses by ID and document