docs/CANONICALIZATION.md at main · chadfowler.com/phoenix

chadfowler.com / phoenix
fork atom
Reference implementation for the Phoenix Architecture. Work in progress. aicoding.leaflet.pub/
ai coding crazy
fork atom
phoenix / docs / CANONICALIZATION.md
at main 522 lines 28 kB view raw view rendered
wrap content
Chad Fowler Add CANONICALIZATION.md — deep technical doc for research team review 6w ago
3265fd9f
  1# Canonicalization: Technical Deep-Dive & Open Problems
  2
  3**Version:** 1.0  
  4**Status:** Research review document  
  5**Audience:** Research team — evaluate alternative approaches to canonicalization  
  6**Goal:** Explain exactly what canonicalization does, how it's currently implemented, what works, what doesn't, and where we need better ideas.
  7
  8---
  9
 10## 1. What Canonicalization Does
 11
 12Canonicalization is the **central transformation** in Phoenix. It converts raw text extracted from specification documents (clauses) into a typed, linked graph of canonical nodes — structured statements that the rest of the system can reason about.
 13
 14```
 15Clause (raw text block)           Canonical Node (structured)
 16─────────────────────────          ──────────────────────────
 17"Tasks must support status    →    type: REQUIREMENT
 18 transitions: open →               statement: "tasks must support status
 19 in_progress → review → done"        transitions: open → in_progress →
 20                                       review → done"
 21                                   tags: [tasks, support, status,
 22                                          transitions, open, ...]
 23                                   source_clause_ids: [<clause_id>]
 24                                   linked_canon_ids: [<related_nodes>]
 25```
 26
 27### What it must produce
 28
 29Each canonical node has:
 30
 31| Field | Purpose |
 32|-------|---------|
 33| `canon_id` | Content-addressed identity: `SHA-256(type + statement + source_clause_id)` |
 34| `type` | One of: REQUIREMENT, CONSTRAINT, INVARIANT, DEFINITION |
 35| `statement` | Normalized, unambiguous English sentence expressing one idea |
 36| `source_clause_ids` | Provenance: which clause(s) this node was extracted from |
 37| `linked_canon_ids` | Cross-references: other canon nodes this node relates to |
 38| `tags` | Extracted domain terms for search, linking, and IU grouping |
 39
 40### Why it matters
 41
 42Canonicalization is the **bottleneck** of the entire pipeline. Every downstream system depends on its output quality:
 43
 44- **IU Planner** groups canon nodes into implementation units — if nodes are too coarse, IUs are too broad to selectively invalidate. If nodes are too fine, IUs proliferate.
 45- **Change Classification** uses canon node identity to determine what changed — if canonicalization is unstable (same input produces different nodes across runs), the classifier sees phantom changes.
 46- **Selective Invalidation** traces from changed clauses → affected canon nodes → affected IUs. If a clause maps to too many canon nodes, invalidation loses selectivity.
 47- **Provenance** must be accurate: every canon node must trace back to the specific clause(s) that justify it. Broken provenance means `phoenix inspect` lies.
 48
 49---
 50
 51## 2. Current Implementation: Rule-Based Canonicalizer
 52
 53**File:** `src/canonicalizer.ts` (155 lines)
 54
 55### 2.1 Algorithm
 56
 57```
 58Input:  Clause[]
 59Output: CanonicalNode[]
 60
 61For each clause:
 62  Split clause.raw_text into lines
 63  For each non-empty, non-heading line:
 64    Strip list markers (-, *, •, 1.)
 65    Skip lines shorter than 5 characters
 66    Classify line type using regex patterns
 67    If classified:
 68      Normalize text (lowercase, strip formatting)
 69      Extract tags (non-stopword tokens > 2 chars)
 70      Generate canon_id = SHA-256(type + statement + clause_id)
 71      Create node with [clause_id] as source
 72    Else:
 73      Skip (line produces no canonical node)
 74
 75After all nodes extracted:
 76  Link nodes by shared terms (≥2 common tags → bidirectional link)
 77```
 78
 79### 2.2 Type Classification
 80
 81The classifier uses ordered regex pattern matching. Most specific patterns (constraints, invariants) are checked first.
 82
 83**Constraint patterns** (checked first):
 84```
 85/\b(?:must not|shall not|forbidden|prohibited|cannot|disallowed)\b/i
 86/\b(?:limited to|maximum|minimum|at most|at least|no more than)\b/i
 87```
 88
 89**Invariant patterns**:
 90```
 91/\b(?:always|never|invariant|at all times|guaranteed)\b/i
 92```
 93
 94**Requirement patterns** (broadest, checked last):
 95```
 96/\b(?:must|shall|required|requires?)\b/i
 97/\b(?:needs? to|has to|will)\b/i
 98```
 99
100**Definition patterns**:
101```
102/\b(?:is defined as|means|refers to)\b/i
103/:\s+\S/    (colon followed by text)
104```
105
106**Heading context fallback:** If no pattern matches a line, the classifier checks the clause's `section_path` (heading hierarchy) for keywords like "constraint", "requirement", "definition", "invariant". This allows lines under a "Security Constraints" heading to be classified as constraints even without explicit keywords.
107
108**If nothing matches:** The line is dropped — it produces no canonical node.
109
110### 2.3 Text Normalization
111
112Before hashing, text is normalized (`src/normalizer.ts`):
113
114- Markdown formatting stripped (bold, italic, links, code fences)
115- Headings removed
116- Lowercased
117- List items sorted alphabetically (so reordering lists doesn't change the hash)
118- Whitespace collapsed
119
120This ensures formatting-only changes produce identical normalized output and thus identical `clause_semhash` values.
121
122### 2.4 Term Extraction & Linking
123
124Tags are extracted by tokenizing, removing stopwords (a curated list of ~55 English function words), and keeping tokens > 2 characters.
125
126Linking: O(n²) pairwise comparison. Two nodes are linked if they share ≥ 2 tags. Links are bidirectional.
127
128### 2.5 Concrete Output (TaskFlow Example)
129
130Input: `spec/tasks.md` (34 lines, 4 sections, 18 list items)
131
132| Stage | Count |
133|-------|-------|
134| Clauses extracted | 5 (one per heading) |
135| Canonical nodes | 18 (one per list item) |
136| Types | 18 REQUIREMENT, 0 CONSTRAINT, 0 INVARIANT, 0 DEFINITION |
137| Linked pairs | 12 bidirectional links |
138
139Input: `tests/fixtures/spec-auth-v1.md` (29 lines, 4 sections)
140
141| Stage | Count |
142|-------|-------|
143| Clauses extracted | ~6 |
144| Canonical nodes | 8 |
145| Types | 6 REQUIREMENT, 2 CONSTRAINT, 0 INVARIANT, 0 DEFINITION |
146
147Input: `tests/fixtures/spec-notifications.md`
148
149| Stage | Count |
150|-------|-------|
151| Canonical nodes | 14 |
152| Types | 12 REQUIREMENT, 1 CONSTRAINT, 1 INVARIANT, 0 DEFINITION |
153
154---
155
156## 3. Current Implementation: LLM-Enhanced Canonicalizer
157
158**File:** `src/canonicalizer-llm.ts` (195 lines)
159
160### 3.1 Algorithm
161
162```
163Input:  Clause[], LLMProvider | null
164Output: CanonicalNode[]
165
166If no LLM provider → fall back to rule-based
167
168Batch clauses into groups of 20
169For each batch:
170  Build prompt:
171    System: "You are a requirements engineer extracting structured canonical nodes..."
172    User:   For each clause, output section header + raw text
173    Request: JSON array of {type, statement, tags}
174
175  Send to LLM (temperature: 0.1, max_tokens: 4096)
176  Parse response:
177    Strip markdown fences
178    Find JSON array
179    Validate each element has type (string) and statement (string)
180
181  For each parsed node:
182    Match to best source clause by term overlap
183    Generate canon_id = SHA-256(type + statement + source_clause_id)
184    Create node
185
186After all batches:
187  Link nodes by shared terms (same O(n²) algorithm)
188
189On any failure → fall back to rule-based
190```
191
192### 3.2 Source Clause Attribution
193
194The LLM returns flat JSON with no explicit clause references. To establish provenance, the system uses a **best-match heuristic**: for each LLM-returned node, it finds the clause whose text has the most word overlap with the node's statement + tags.
195
196```typescript
197function findBestSourceClause(node: LLMCanonNode, clauses: Clause[]): Clause | null {
198  // Tokenize node statement + tags → nodeTerms
199  // For each clause: count overlap between clause tokens and nodeTerms
200  // Return clause with highest overlap
201}
202```
203
204If no good match, it falls back to positional assignment (node index → clause index, clamped).
205
206### 3.3 LLM Prompt
207
208```
209System:
210You are a requirements engineer extracting structured canonical nodes
211from specification text.
212
213For each meaningful statement, extract a JSON object with:
214- type: one of REQUIREMENT, CONSTRAINT, INVARIANT, DEFINITION
215- statement: the normalized canonical statement
216- tags: array of key domain terms (lowercase, no stop words)
217
218Rules:
219- REQUIREMENT: something the system must do
220- CONSTRAINT: something the system must NOT do, or limits/bounds
221- INVARIANT: something that must ALWAYS or NEVER hold
222- DEFINITION: defines a term or concept
223
224Output a JSON array. No markdown fences, no explanation.
225Only extract nodes where there is a clear, actionable statement.
226Skip headings, meta-text, and filler.
227```
228
229---
230
231## 4. What Works
232
233### 4.1 Content-Addressed Identity Is Sound
234
235The `canon_id = SHA-256(type + statement + source_clause_id)` scheme means identical extraction from identical input always produces the same node ID. This is critical for change detection: if a clause doesn't change, its canon nodes keep their IDs, and no downstream invalidation fires.
236
237### 4.2 Provenance Tracking Is Correct (For Rule-Based)
238
239In the rule-based path, every canon node is created directly from a specific clause's text. The `source_clause_ids` array is always correct because the mapping is syntactic — line N of clause C produces node N of clause C.
240
241### 4.3 Fallback Strategy Is Robust
242
243The LLM-enhanced path falls back to rule-based on any failure (no provider, parse error, empty result, timeout). This means canonicalization never blocks on external dependencies.
244
245### 4.4 Normalization Produces Stable Hashes
246
247List sorting, whitespace collapse, and format stripping mean that most cosmetic edits (reindenting, reordering bullets, changing bold to italic) produce identical `clause_semhash` values and thus don't trigger re-canonicalization.
248
249---
250
251## 5. What's Wrong: Known Shortcomings
252
253### 5.1 Rule-Based: Everything Is a REQUIREMENT
254
255**The problem:** The task management spec has 18 canon nodes. All 18 are typed as REQUIREMENT. Zero constraints, zero invariants, zero definitions.
256
257This is clearly wrong. "Tasks must support status transitions: open → in_progress → review → done" is a requirement, but it also implicitly defines "task" and the valid statuses. "Invalid status transitions must be rejected" is a constraint. The rule-based classifier can't see these semantic distinctions — it matches "must" and calls everything a REQUIREMENT.
258
259**Impact:** Type information is used to derive risk tiers, evidence policies, and invariant lists on IU contracts. If everything is REQUIREMENT, risk assessment is degraded and invariants are empty. In the TaskFlow example, zero invariants are extracted, so IU contracts have empty invariant lists.
260
261**Root cause:** Regex patterns are too blunt. "Must" appears in requirements, constraints, and invariants. The patterns need semantic understanding that regex can't provide.
262
263### 5.2 Rule-Based: Line-Level Granularity Is Too Rigid
264
265**The problem:** The canonicalizer operates line-by-line. Each line that matches a pattern becomes one canonical node. This means:
266
267- A multi-line statement split across lines produces multiple incomplete nodes
268- A compound statement ("X must do A and must do B") becomes one node instead of two
269- Paragraph-style specs (not bulleted lists) often produce zero nodes because no single line matches a pattern strongly enough
270
271**Example failure:** Consider this spec text:
272```
273Tasks support three assignment modes. In single mode, one person owns
274the task. In team mode, the task is shared. The assignee must accept
275the assignment before it takes effect.
276```
277
278The rule-based canonicalizer would:
279- Skip line 1 (no keyword match for "support three assignment modes")
280- Skip line 2 (no "must/shall" keyword)
281- Extract only line 3 as a REQUIREMENT ("the assignee must accept...")
282- Miss the definition of assignment modes entirely
283
284**Impact:** Specs that use flowing prose instead of bulleted lists get significantly fewer canonical nodes extracted. The system penalizes a writing style.
285
286### 5.3 Rule-Based: Dropped Lines Are Silent
287
288**The problem:** When a line doesn't match any pattern and there's no heading context, it's silently dropped. There is no diagnostic, no coverage metric, no way to know that 30% of your spec text was ignored.
289
290**Impact:** Users don't know their spec has uncovered requirements. A clause might have 8 lines but produce only 3 canon nodes. The other 5 lines — which may contain important context, definitions, or implicit constraints — are invisible to the rest of the pipeline.
291
292### 5.4 Term-Based Linking Is Noisy
293
294**The problem:** Nodes are linked if they share ≥ 2 non-stopword tags. With extracted tags like `[tasks, status, transitions, open, ...]`, the word "tasks" appears in nearly every node in a task management spec. This means most nodes end up linked to most other nodes.
295
296**Concrete example:** In the TaskFlow spec, node [15] ("overdue tasks must be flagged automatically") is linked to **5 other nodes** — nearly a third of all nodes. Node [2] ("tasks must support status transitions") is linked to **4 nodes**.
297
298When everything is linked to everything, linking provides no useful information. It's noise, not signal.
299
300**Root cause:** The linking threshold (≥ 2 shared tags) is too low for domains with small vocabularies. And tag extraction is just tokenization + stopword removal — there's no concept of term importance or domain specificity.
301
302### 5.5 LLM Path: Provenance Attribution Is Heuristic
303
304**The problem:** When the LLM extracts canonical nodes, the system doesn't know which clause each node came from. It guesses using word overlap: "which clause's text overlaps most with this node's statement and tags?"
305
306This heuristic breaks when:
307- The LLM synthesizes a node from multiple clauses (the node is a composite)
308- The LLM rephrases heavily (low word overlap with any single clause)
309- Two clauses cover similar vocabulary (ambiguous attribution)
310
311**Impact:** `source_clause_ids` may be wrong for LLM-extracted nodes. This means the provenance graph lies — you trace a canon node back to a clause, but it was actually derived from a different clause (or multiple clauses). This undermines the core promise of Phoenix: "you can trace any generated file back to the spec sentence that caused it."
312
313### 5.6 LLM Path: Instability Across Runs
314
315**The problem:** Even with `temperature: 0.1`, the LLM may produce slightly different statements across runs. "Tasks must support status transitions" might become "The system shall allow task status transitions" on a second run. These produce different normalized text → different `canon_id` → the system sees phantom changes.
316
317**Impact:** Re-running `phoenix canonicalize` on an unchanged spec may produce different canon_ids, triggering unnecessary downstream invalidation. This defeats the purpose of content-addressed identity.
318
319**Root cause:** LLMs are not deterministic functions. Temperature 0.1 is low but not zero, and even at temperature 0, implementation details (batching, floating point, etc.) cause variation.
320
321### 5.7 LLM Path: No Structural Awareness
322
323**The problem:** The LLM receives clause text with section headers, but the prompt doesn't convey structural relationships: "this clause is in the same document as these other clauses," "this section is nested under that section," "these three clauses are sequential."
324
325**Impact:** The LLM can't extract cross-clause relationships. If clause 1 defines "task" and clause 2 references "task" without re-defining it, the LLM extracts from clause 2 without the context that "task" was defined in clause 1. This limits the LLM's ability to produce accurate types (DEFINITION vs. REQUIREMENT) and cross-references.
326
327### 5.8 One-to-One Clause→Node Assumption
328
329**The problem:** Both the rule-based and LLM paths assume each canon node comes from exactly one clause (`source_clause_ids` is always a single-element array in practice). But real requirements often span multiple clauses:
330
331- "Tasks have statuses" (clause in section 1) + "Statuses must follow the transition graph" (clause in section 2) = one canonical requirement that needs both clauses as provenance
332- "Users are authenticated" (auth spec) + "Authenticated users can create tasks" (task spec) = cross-document dependency
333
334**Impact:** Canon nodes can't express multi-clause provenance, which means cross-cutting requirements (security constraints that apply to multiple features, definitions used across sections) are either duplicated or attributed to only one source.
335
336### 5.9 No Merge/Dedup Across Clauses
337
338**The problem:** Two clauses in different sections might express the same requirement. The canonicalizer creates two separate nodes with different `canon_id`s (because `source_clause_id` is part of the hash). There is no deduplication.
339
340**Example:**
341- Clause in "Task Lifecycle": "Tasks must have a status"
342- Clause in "Assignment": "Each task has a status that affects assignment eligibility"
343
344These should arguably be one canonical node with two source clauses. Instead they're two nodes that happen to share some tags.
345
346### 5.10 O(n²) Linking Doesn't Scale
347
348**The problem:** The linking algorithm compares every pair of nodes. For the TaskFlow example (54 total nodes across 3 specs), this is 1,431 comparisons — fine. For a real project with 500 canon nodes, it's 124,750 comparisons. For 5,000 nodes, it's 12.5 million.
349
350The comparison itself is also naive (array intersection of string tags), not just the iteration pattern.
351
352---
353
354## 6. Deeper Structural Problems
355
356### 6.1 No Notion of "Coverage"
357
358The system doesn't track what percentage of a clause's content was extracted into canonical nodes. A clause with 10 sentences might produce 3 canon nodes, and the other 7 sentences are silently ignored. There's no metric for this.
359
360**What we need:** A coverage score per clause: `nodes_extracted / extractable_statements`. This would let `phoenix status` warn: "Clause at spec/tasks.md L14-20 has 35% coverage — 4 statements were not canonicalized."
361
362### 6.2 No Confidence Scoring
363
364Canon nodes have no confidence score. A node extracted by a perfect regex match on "shall not" has the same weight as a node extracted from heading context fallback on an ambiguous line. The downstream systems (IU planner, classifier) can't distinguish high-confidence extractions from low-confidence ones.
365
366### 6.3 Canonicalization Is Not Idempotent Under Composition
367
368If you canonicalize clauses 1–5, then later canonicalize clauses 6–10, and then canonicalize all 10 together, you get different linking results. The pairwise linking step is global — adding new nodes creates new links between existing nodes. This means the order of canonicalization matters for the link graph, even though the node extraction per-clause is independent.
369
370### 6.4 No Hierarchy in Canonical Graph
371
372The canonical graph is flat — all nodes are peers connected by undirected links. But requirements naturally form hierarchies: "The system supports task management" decomposes into "Tasks have lifecycle states" which decomposes into "Status transitions follow the allowed graph." This parent-child structure is lost.
373
374### 6.5 The Statement Is The Identity
375
376Because `canon_id = SHA-256(type + statement + clause_id)`, the statement is the identity. If the LLM slightly rephrases a statement, it's a completely new node. There is no "soft identity" or similarity threshold — you're either identical or you're different.
377
378This creates a tension: we want statements to be normalized (so identity is stable) but also want them to be meaningful (so humans can read them). Heavy normalization helps stability but hurts readability. Light normalization helps readability but hurts stability.
379
380---
381
382## 7. What Better Approaches Might Look Like
383
384These are directions for the research team to evaluate. We're not prescribing solutions — we're naming the design space.
385
386### 7.1 Semantic Chunking Instead of Line Splitting
387
388Instead of splitting on line boundaries, identify **semantic units** within clause text — statements that express a single requirement or constraint, regardless of how many lines they span.
389
390Possible approaches:
391- Sentence boundary detection + classification per sentence
392- Dependency parsing to identify clause-level semantic units
393- LLM-based extraction with explicit sentence boundary identification
394
395### 7.2 Multi-Pass Extraction
396
397```
398Pass 1: Extract DEFINITION nodes (terms, concepts)
399Pass 2: Extract REQUIREMENT nodes, resolving references to definitions
400Pass 3: Extract CONSTRAINT/INVARIANT nodes, linking to requirements they constrain
401Pass 4: Cross-document resolution (same term used in different specs)
402```
403
404Multi-pass could solve the typing problem (pass 1 establishes vocabulary, later passes use it) and the cross-clause provenance problem (pass 4 links across documents).
405
406### 7.3 Embedding-Based Linking Instead of Keyword Matching
407
408Replace term overlap linking with embedding similarity. Compute a vector embedding for each canon node's statement, then link nodes whose embeddings are within a threshold.
409
410Advantages: captures semantic similarity ("rate limiting" and "throttling" would link). Disadvantages: requires embedding model, threshold tuning, and introduces non-determinism.
411
412### 7.4 Hierarchical Canonical Graph
413
414Add a `parent_canon_id` field. Top-level nodes represent high-level capabilities. Children represent specific requirements. Leaves represent constraints/invariants.
415
416This would enable hierarchical invalidation: changing a leaf only invalidates its subtree, not the whole cluster connected by term overlap.
417
418### 7.5 Stable Canonical Identity (Soft Matching)
419
420Instead of exact hash identity, use a **two-layer identity**:
421- `canon_id` (exact): current SHA-256 scheme, changes on any rewording
422- `canon_anchor`: a stable identity based on semantic meaning, survives minor rephrasing
423
424The anchor could be:
425- SHA-256 of sorted tags (survives statement rewording if tags are stable)
426- Embedding hash (locality-sensitive hash of statement embedding)
427- Clause-anchored: `SHA-256(source_clause_id + type)` (stable as long as source clause and type don't change)
428
429When comparing old and new canonical graphs, match first by `canon_anchor`, then by `canon_id`. This separates "this node changed its wording" from "this is a completely new node."
430
431### 7.6 LLM With Clause References in Output
432
433Modify the LLM prompt to require explicit clause attribution:
434
435```json
436{
437  "type": "REQUIREMENT",
438  "statement": "Tasks must support status transitions: open → in_progress → review → done",
439  "source_clauses": ["Task Lifecycle"],
440  "tags": ["task", "status", "transitions"]
441}
442```
443
444The LLM identifies which section/clause each node came from. This replaces the term-overlap heuristic for provenance.
445
446### 7.7 Coverage-Aware Extraction
447
448After extraction, compute a coverage map: for each sentence/line in the original clause, did any canon node claim it? Report uncovered content as a diagnostic.
449
450This could be combined with a "residual" extraction pass: after the first pass, feed uncovered sentences back to the extractor with a prompt like "these statements were not classified — are they requirements, constraints, definitions, or truly irrelevant?"
451
452### 7.8 Deterministic LLM Normalization
453
454Instead of using the LLM for extraction, use it only for **normalization**: the rule-based extractor identifies candidate nodes, then the LLM normalizes each statement to a canonical form. This preserves the deterministic extraction (same input → same candidates) while improving statement quality.
455
456```
457Rule-based: extracts "Tasks must support status transitions: open → in_progress → review → done"
458LLM normalizer: "The system shall support task status transitions following the graph: open → in_progress → review → done"
459```
460
461The normalized statement is hashed for identity. Because the LLM input is a single sentence (not a whole clause), output variance is much lower.
462
463---
464
465## 8. Evaluation Criteria for New Approaches
466
467Any replacement or augmentation of the canonicalization system must be evaluated against:
468
469| Criterion | Description | Current Baseline |
470|-----------|-------------|-----------------|
471| **Extraction recall** | % of spec statements that produce at least one canon node | ~70% (prose paragraphs are missed) |
472| **Type accuracy** | % of canon nodes with correct type classification | ~60% (everything tends toward REQUIREMENT) |
473| **Provenance accuracy** | % of canon nodes with correct source_clause_ids | 100% rule-based, ~80% LLM (heuristic attribution) |
474| **Identity stability** | Same input → same canon_ids across runs | 100% rule-based, ~90% LLM (temperature variance) |
475| **Linking precision** | % of links that represent genuine semantic relationships | ~40% (keyword overlap is noisy) |
476| **Linking recall** | % of genuine semantic relationships that are captured | ~30% (misses synonym and implication relationships) |
477| **Cross-clause resolution** | Can a canon node cite multiple source clauses? | No (single clause only) |
478| **Coverage visibility** | Does the system report what was NOT extracted? | No |
479| **Scalability** | Performance at 1K, 10K, 100K nodes | O(n²) linking is the bottleneck |
480| **Latency** | Time to canonicalize 100 clauses | <100ms rule-based, 5–15s LLM |
481| **Determinism** | Does repeated execution produce identical output? | Yes rule-based, no LLM |
482| **Fallback graceful** | Does the system degrade gracefully without LLM? | Yes |
483
484---
485
486## 9. Code Pointers
487
488| File | Lines | What It Does |
489|------|-------|-------------|
490| `src/canonicalizer.ts` | 155 | Rule-based extraction: pattern matching, term extraction, linking |
491| `src/canonicalizer-llm.ts` | 195 | LLM-enhanced extraction: batched prompt, JSON parsing, heuristic provenance |
492| `src/normalizer.ts` | 70 | Text normalization: format stripping, list sorting, whitespace collapse |
493| `src/semhash.ts` | 55 | SHA-256 hashing, clause_semhash, context_semhash_cold |
494| `src/warm-hasher.ts` | 50 | context_semhash_warm incorporating canonical graph context |
495| `src/spec-parser.ts` | 130 | Markdown → Clause[] (section boundary detection, heading hierarchy) |
496| `src/models/canonical.ts` | 30 | CanonicalNode and CanonicalGraph type definitions |
497| `src/store/canonical-store.ts` | 90 | Persistence layer for canonical graph |
498| `tests/functional/canonicalization.test.ts` | 140 | Integration tests for the full canonicalization pipeline |
499
500---
501
502## 10. Summary of What We Need Help With
503
504Ranked by impact:
505
5061. **Better type classification.** The current system classifies almost everything as REQUIREMENT. We need an approach that reliably distinguishes requirements, constraints, invariants, and definitions — ideally without requiring an LLM for every extraction.
507
5082. **Stable multi-clause provenance.** Canon nodes must be able to cite multiple source clauses, and LLM-extracted nodes must have accurate (not heuristic) provenance.
509
5103. **Meaningful linking.** The current keyword-overlap linking produces too many false connections. We need linking that captures actual semantic relationships (constraint-constrains-requirement, definition-defines-term-used-in-requirement) without requiring O(n²) comparisons.
511
5124. **Coverage visibility.** Users need to know what percentage of their spec was successfully canonicalized, and which statements were dropped.
513
5145. **Identity stability under LLM extraction.** If we use LLMs, we need a way to produce stable canon_ids across runs. The current statement-is-identity model breaks with any LLM variance.
515
5166. **Extraction from prose.** The system only handles bulleted lists well. Flowing prose, tables, and mixed-format specs need better support.
517
518We're looking for approaches that maintain Phoenix's core properties — content-addressed identity, explicit provenance, deterministic fallback — while dramatically improving extraction quality, type accuracy, and linking precision.
519
520---
521
522*Generated from Phoenix VCS v0.1.0 codebase. All code is TypeScript, zero runtime dependencies, ~700 lines total for canonicalization.*