Reference implementation for the Phoenix Architecture. Work in progress. aicoding.leaflet.pub/
ai coding crazy
at main 522 lines 28 kB view raw view rendered
1# Canonicalization: Technical Deep-Dive & Open Problems 2 3**Version:** 1.0 4**Status:** Research review document 5**Audience:** Research team — evaluate alternative approaches to canonicalization 6**Goal:** Explain exactly what canonicalization does, how it's currently implemented, what works, what doesn't, and where we need better ideas. 7 8--- 9 10## 1. What Canonicalization Does 11 12Canonicalization is the **central transformation** in Phoenix. It converts raw text extracted from specification documents (clauses) into a typed, linked graph of canonical nodes — structured statements that the rest of the system can reason about. 13 14``` 15Clause (raw text block) Canonical Node (structured) 16───────────────────────── ────────────────────────── 17"Tasks must support status → type: REQUIREMENT 18 transitions: open → statement: "tasks must support status 19 in_progress → review → done" transitions: open → in_progress → 20 review → done" 21 tags: [tasks, support, status, 22 transitions, open, ...] 23 source_clause_ids: [<clause_id>] 24 linked_canon_ids: [<related_nodes>] 25``` 26 27### What it must produce 28 29Each canonical node has: 30 31| Field | Purpose | 32|-------|---------| 33| `canon_id` | Content-addressed identity: `SHA-256(type + statement + source_clause_id)` | 34| `type` | One of: REQUIREMENT, CONSTRAINT, INVARIANT, DEFINITION | 35| `statement` | Normalized, unambiguous English sentence expressing one idea | 36| `source_clause_ids` | Provenance: which clause(s) this node was extracted from | 37| `linked_canon_ids` | Cross-references: other canon nodes this node relates to | 38| `tags` | Extracted domain terms for search, linking, and IU grouping | 39 40### Why it matters 41 42Canonicalization is the **bottleneck** of the entire pipeline. Every downstream system depends on its output quality: 43 44- **IU Planner** groups canon nodes into implementation units — if nodes are too coarse, IUs are too broad to selectively invalidate. If nodes are too fine, IUs proliferate. 45- **Change Classification** uses canon node identity to determine what changed — if canonicalization is unstable (same input produces different nodes across runs), the classifier sees phantom changes. 46- **Selective Invalidation** traces from changed clauses → affected canon nodes → affected IUs. If a clause maps to too many canon nodes, invalidation loses selectivity. 47- **Provenance** must be accurate: every canon node must trace back to the specific clause(s) that justify it. Broken provenance means `phoenix inspect` lies. 48 49--- 50 51## 2. Current Implementation: Rule-Based Canonicalizer 52 53**File:** `src/canonicalizer.ts` (155 lines) 54 55### 2.1 Algorithm 56 57``` 58Input: Clause[] 59Output: CanonicalNode[] 60 61For each clause: 62 Split clause.raw_text into lines 63 For each non-empty, non-heading line: 64 Strip list markers (-, *, •, 1.) 65 Skip lines shorter than 5 characters 66 Classify line type using regex patterns 67 If classified: 68 Normalize text (lowercase, strip formatting) 69 Extract tags (non-stopword tokens > 2 chars) 70 Generate canon_id = SHA-256(type + statement + clause_id) 71 Create node with [clause_id] as source 72 Else: 73 Skip (line produces no canonical node) 74 75After all nodes extracted: 76 Link nodes by shared terms (≥2 common tags → bidirectional link) 77``` 78 79### 2.2 Type Classification 80 81The classifier uses ordered regex pattern matching. Most specific patterns (constraints, invariants) are checked first. 82 83**Constraint patterns** (checked first): 84``` 85/\b(?:must not|shall not|forbidden|prohibited|cannot|disallowed)\b/i 86/\b(?:limited to|maximum|minimum|at most|at least|no more than)\b/i 87``` 88 89**Invariant patterns**: 90``` 91/\b(?:always|never|invariant|at all times|guaranteed)\b/i 92``` 93 94**Requirement patterns** (broadest, checked last): 95``` 96/\b(?:must|shall|required|requires?)\b/i 97/\b(?:needs? to|has to|will)\b/i 98``` 99 100**Definition patterns**: 101``` 102/\b(?:is defined as|means|refers to)\b/i 103/:\s+\S/ (colon followed by text) 104``` 105 106**Heading context fallback:** If no pattern matches a line, the classifier checks the clause's `section_path` (heading hierarchy) for keywords like "constraint", "requirement", "definition", "invariant". This allows lines under a "Security Constraints" heading to be classified as constraints even without explicit keywords. 107 108**If nothing matches:** The line is dropped — it produces no canonical node. 109 110### 2.3 Text Normalization 111 112Before hashing, text is normalized (`src/normalizer.ts`): 113 114- Markdown formatting stripped (bold, italic, links, code fences) 115- Headings removed 116- Lowercased 117- List items sorted alphabetically (so reordering lists doesn't change the hash) 118- Whitespace collapsed 119 120This ensures formatting-only changes produce identical normalized output and thus identical `clause_semhash` values. 121 122### 2.4 Term Extraction & Linking 123 124Tags are extracted by tokenizing, removing stopwords (a curated list of ~55 English function words), and keeping tokens > 2 characters. 125 126Linking: O(n²) pairwise comparison. Two nodes are linked if they share ≥ 2 tags. Links are bidirectional. 127 128### 2.5 Concrete Output (TaskFlow Example) 129 130Input: `spec/tasks.md` (34 lines, 4 sections, 18 list items) 131 132| Stage | Count | 133|-------|-------| 134| Clauses extracted | 5 (one per heading) | 135| Canonical nodes | 18 (one per list item) | 136| Types | 18 REQUIREMENT, 0 CONSTRAINT, 0 INVARIANT, 0 DEFINITION | 137| Linked pairs | 12 bidirectional links | 138 139Input: `tests/fixtures/spec-auth-v1.md` (29 lines, 4 sections) 140 141| Stage | Count | 142|-------|-------| 143| Clauses extracted | ~6 | 144| Canonical nodes | 8 | 145| Types | 6 REQUIREMENT, 2 CONSTRAINT, 0 INVARIANT, 0 DEFINITION | 146 147Input: `tests/fixtures/spec-notifications.md` 148 149| Stage | Count | 150|-------|-------| 151| Canonical nodes | 14 | 152| Types | 12 REQUIREMENT, 1 CONSTRAINT, 1 INVARIANT, 0 DEFINITION | 153 154--- 155 156## 3. Current Implementation: LLM-Enhanced Canonicalizer 157 158**File:** `src/canonicalizer-llm.ts` (195 lines) 159 160### 3.1 Algorithm 161 162``` 163Input: Clause[], LLMProvider | null 164Output: CanonicalNode[] 165 166If no LLM provider → fall back to rule-based 167 168Batch clauses into groups of 20 169For each batch: 170 Build prompt: 171 System: "You are a requirements engineer extracting structured canonical nodes..." 172 User: For each clause, output section header + raw text 173 Request: JSON array of {type, statement, tags} 174 175 Send to LLM (temperature: 0.1, max_tokens: 4096) 176 Parse response: 177 Strip markdown fences 178 Find JSON array 179 Validate each element has type (string) and statement (string) 180 181 For each parsed node: 182 Match to best source clause by term overlap 183 Generate canon_id = SHA-256(type + statement + source_clause_id) 184 Create node 185 186After all batches: 187 Link nodes by shared terms (same O(n²) algorithm) 188 189On any failure → fall back to rule-based 190``` 191 192### 3.2 Source Clause Attribution 193 194The LLM returns flat JSON with no explicit clause references. To establish provenance, the system uses a **best-match heuristic**: for each LLM-returned node, it finds the clause whose text has the most word overlap with the node's statement + tags. 195 196```typescript 197function findBestSourceClause(node: LLMCanonNode, clauses: Clause[]): Clause | null { 198 // Tokenize node statement + tags → nodeTerms 199 // For each clause: count overlap between clause tokens and nodeTerms 200 // Return clause with highest overlap 201} 202``` 203 204If no good match, it falls back to positional assignment (node index → clause index, clamped). 205 206### 3.3 LLM Prompt 207 208``` 209System: 210You are a requirements engineer extracting structured canonical nodes 211from specification text. 212 213For each meaningful statement, extract a JSON object with: 214- type: one of REQUIREMENT, CONSTRAINT, INVARIANT, DEFINITION 215- statement: the normalized canonical statement 216- tags: array of key domain terms (lowercase, no stop words) 217 218Rules: 219- REQUIREMENT: something the system must do 220- CONSTRAINT: something the system must NOT do, or limits/bounds 221- INVARIANT: something that must ALWAYS or NEVER hold 222- DEFINITION: defines a term or concept 223 224Output a JSON array. No markdown fences, no explanation. 225Only extract nodes where there is a clear, actionable statement. 226Skip headings, meta-text, and filler. 227``` 228 229--- 230 231## 4. What Works 232 233### 4.1 Content-Addressed Identity Is Sound 234 235The `canon_id = SHA-256(type + statement + source_clause_id)` scheme means identical extraction from identical input always produces the same node ID. This is critical for change detection: if a clause doesn't change, its canon nodes keep their IDs, and no downstream invalidation fires. 236 237### 4.2 Provenance Tracking Is Correct (For Rule-Based) 238 239In the rule-based path, every canon node is created directly from a specific clause's text. The `source_clause_ids` array is always correct because the mapping is syntactic — line N of clause C produces node N of clause C. 240 241### 4.3 Fallback Strategy Is Robust 242 243The LLM-enhanced path falls back to rule-based on any failure (no provider, parse error, empty result, timeout). This means canonicalization never blocks on external dependencies. 244 245### 4.4 Normalization Produces Stable Hashes 246 247List sorting, whitespace collapse, and format stripping mean that most cosmetic edits (reindenting, reordering bullets, changing bold to italic) produce identical `clause_semhash` values and thus don't trigger re-canonicalization. 248 249--- 250 251## 5. What's Wrong: Known Shortcomings 252 253### 5.1 Rule-Based: Everything Is a REQUIREMENT 254 255**The problem:** The task management spec has 18 canon nodes. All 18 are typed as REQUIREMENT. Zero constraints, zero invariants, zero definitions. 256 257This is clearly wrong. "Tasks must support status transitions: open → in_progress → review → done" is a requirement, but it also implicitly defines "task" and the valid statuses. "Invalid status transitions must be rejected" is a constraint. The rule-based classifier can't see these semantic distinctions — it matches "must" and calls everything a REQUIREMENT. 258 259**Impact:** Type information is used to derive risk tiers, evidence policies, and invariant lists on IU contracts. If everything is REQUIREMENT, risk assessment is degraded and invariants are empty. In the TaskFlow example, zero invariants are extracted, so IU contracts have empty invariant lists. 260 261**Root cause:** Regex patterns are too blunt. "Must" appears in requirements, constraints, and invariants. The patterns need semantic understanding that regex can't provide. 262 263### 5.2 Rule-Based: Line-Level Granularity Is Too Rigid 264 265**The problem:** The canonicalizer operates line-by-line. Each line that matches a pattern becomes one canonical node. This means: 266 267- A multi-line statement split across lines produces multiple incomplete nodes 268- A compound statement ("X must do A and must do B") becomes one node instead of two 269- Paragraph-style specs (not bulleted lists) often produce zero nodes because no single line matches a pattern strongly enough 270 271**Example failure:** Consider this spec text: 272``` 273Tasks support three assignment modes. In single mode, one person owns 274the task. In team mode, the task is shared. The assignee must accept 275the assignment before it takes effect. 276``` 277 278The rule-based canonicalizer would: 279- Skip line 1 (no keyword match for "support three assignment modes") 280- Skip line 2 (no "must/shall" keyword) 281- Extract only line 3 as a REQUIREMENT ("the assignee must accept...") 282- Miss the definition of assignment modes entirely 283 284**Impact:** Specs that use flowing prose instead of bulleted lists get significantly fewer canonical nodes extracted. The system penalizes a writing style. 285 286### 5.3 Rule-Based: Dropped Lines Are Silent 287 288**The problem:** When a line doesn't match any pattern and there's no heading context, it's silently dropped. There is no diagnostic, no coverage metric, no way to know that 30% of your spec text was ignored. 289 290**Impact:** Users don't know their spec has uncovered requirements. A clause might have 8 lines but produce only 3 canon nodes. The other 5 lines — which may contain important context, definitions, or implicit constraints — are invisible to the rest of the pipeline. 291 292### 5.4 Term-Based Linking Is Noisy 293 294**The problem:** Nodes are linked if they share ≥ 2 non-stopword tags. With extracted tags like `[tasks, status, transitions, open, ...]`, the word "tasks" appears in nearly every node in a task management spec. This means most nodes end up linked to most other nodes. 295 296**Concrete example:** In the TaskFlow spec, node [15] ("overdue tasks must be flagged automatically") is linked to **5 other nodes** — nearly a third of all nodes. Node [2] ("tasks must support status transitions") is linked to **4 nodes**. 297 298When everything is linked to everything, linking provides no useful information. It's noise, not signal. 299 300**Root cause:** The linking threshold (≥ 2 shared tags) is too low for domains with small vocabularies. And tag extraction is just tokenization + stopword removal — there's no concept of term importance or domain specificity. 301 302### 5.5 LLM Path: Provenance Attribution Is Heuristic 303 304**The problem:** When the LLM extracts canonical nodes, the system doesn't know which clause each node came from. It guesses using word overlap: "which clause's text overlaps most with this node's statement and tags?" 305 306This heuristic breaks when: 307- The LLM synthesizes a node from multiple clauses (the node is a composite) 308- The LLM rephrases heavily (low word overlap with any single clause) 309- Two clauses cover similar vocabulary (ambiguous attribution) 310 311**Impact:** `source_clause_ids` may be wrong for LLM-extracted nodes. This means the provenance graph lies — you trace a canon node back to a clause, but it was actually derived from a different clause (or multiple clauses). This undermines the core promise of Phoenix: "you can trace any generated file back to the spec sentence that caused it." 312 313### 5.6 LLM Path: Instability Across Runs 314 315**The problem:** Even with `temperature: 0.1`, the LLM may produce slightly different statements across runs. "Tasks must support status transitions" might become "The system shall allow task status transitions" on a second run. These produce different normalized text → different `canon_id` → the system sees phantom changes. 316 317**Impact:** Re-running `phoenix canonicalize` on an unchanged spec may produce different canon_ids, triggering unnecessary downstream invalidation. This defeats the purpose of content-addressed identity. 318 319**Root cause:** LLMs are not deterministic functions. Temperature 0.1 is low but not zero, and even at temperature 0, implementation details (batching, floating point, etc.) cause variation. 320 321### 5.7 LLM Path: No Structural Awareness 322 323**The problem:** The LLM receives clause text with section headers, but the prompt doesn't convey structural relationships: "this clause is in the same document as these other clauses," "this section is nested under that section," "these three clauses are sequential." 324 325**Impact:** The LLM can't extract cross-clause relationships. If clause 1 defines "task" and clause 2 references "task" without re-defining it, the LLM extracts from clause 2 without the context that "task" was defined in clause 1. This limits the LLM's ability to produce accurate types (DEFINITION vs. REQUIREMENT) and cross-references. 326 327### 5.8 One-to-One Clause→Node Assumption 328 329**The problem:** Both the rule-based and LLM paths assume each canon node comes from exactly one clause (`source_clause_ids` is always a single-element array in practice). But real requirements often span multiple clauses: 330 331- "Tasks have statuses" (clause in section 1) + "Statuses must follow the transition graph" (clause in section 2) = one canonical requirement that needs both clauses as provenance 332- "Users are authenticated" (auth spec) + "Authenticated users can create tasks" (task spec) = cross-document dependency 333 334**Impact:** Canon nodes can't express multi-clause provenance, which means cross-cutting requirements (security constraints that apply to multiple features, definitions used across sections) are either duplicated or attributed to only one source. 335 336### 5.9 No Merge/Dedup Across Clauses 337 338**The problem:** Two clauses in different sections might express the same requirement. The canonicalizer creates two separate nodes with different `canon_id`s (because `source_clause_id` is part of the hash). There is no deduplication. 339 340**Example:** 341- Clause in "Task Lifecycle": "Tasks must have a status" 342- Clause in "Assignment": "Each task has a status that affects assignment eligibility" 343 344These should arguably be one canonical node with two source clauses. Instead they're two nodes that happen to share some tags. 345 346### 5.10 O(n²) Linking Doesn't Scale 347 348**The problem:** The linking algorithm compares every pair of nodes. For the TaskFlow example (54 total nodes across 3 specs), this is 1,431 comparisons — fine. For a real project with 500 canon nodes, it's 124,750 comparisons. For 5,000 nodes, it's 12.5 million. 349 350The comparison itself is also naive (array intersection of string tags), not just the iteration pattern. 351 352--- 353 354## 6. Deeper Structural Problems 355 356### 6.1 No Notion of "Coverage" 357 358The system doesn't track what percentage of a clause's content was extracted into canonical nodes. A clause with 10 sentences might produce 3 canon nodes, and the other 7 sentences are silently ignored. There's no metric for this. 359 360**What we need:** A coverage score per clause: `nodes_extracted / extractable_statements`. This would let `phoenix status` warn: "Clause at spec/tasks.md L14-20 has 35% coverage — 4 statements were not canonicalized." 361 362### 6.2 No Confidence Scoring 363 364Canon nodes have no confidence score. A node extracted by a perfect regex match on "shall not" has the same weight as a node extracted from heading context fallback on an ambiguous line. The downstream systems (IU planner, classifier) can't distinguish high-confidence extractions from low-confidence ones. 365 366### 6.3 Canonicalization Is Not Idempotent Under Composition 367 368If you canonicalize clauses 1–5, then later canonicalize clauses 6–10, and then canonicalize all 10 together, you get different linking results. The pairwise linking step is global — adding new nodes creates new links between existing nodes. This means the order of canonicalization matters for the link graph, even though the node extraction per-clause is independent. 369 370### 6.4 No Hierarchy in Canonical Graph 371 372The canonical graph is flat — all nodes are peers connected by undirected links. But requirements naturally form hierarchies: "The system supports task management" decomposes into "Tasks have lifecycle states" which decomposes into "Status transitions follow the allowed graph." This parent-child structure is lost. 373 374### 6.5 The Statement Is The Identity 375 376Because `canon_id = SHA-256(type + statement + clause_id)`, the statement is the identity. If the LLM slightly rephrases a statement, it's a completely new node. There is no "soft identity" or similarity threshold — you're either identical or you're different. 377 378This creates a tension: we want statements to be normalized (so identity is stable) but also want them to be meaningful (so humans can read them). Heavy normalization helps stability but hurts readability. Light normalization helps readability but hurts stability. 379 380--- 381 382## 7. What Better Approaches Might Look Like 383 384These are directions for the research team to evaluate. We're not prescribing solutions — we're naming the design space. 385 386### 7.1 Semantic Chunking Instead of Line Splitting 387 388Instead of splitting on line boundaries, identify **semantic units** within clause text — statements that express a single requirement or constraint, regardless of how many lines they span. 389 390Possible approaches: 391- Sentence boundary detection + classification per sentence 392- Dependency parsing to identify clause-level semantic units 393- LLM-based extraction with explicit sentence boundary identification 394 395### 7.2 Multi-Pass Extraction 396 397``` 398Pass 1: Extract DEFINITION nodes (terms, concepts) 399Pass 2: Extract REQUIREMENT nodes, resolving references to definitions 400Pass 3: Extract CONSTRAINT/INVARIANT nodes, linking to requirements they constrain 401Pass 4: Cross-document resolution (same term used in different specs) 402``` 403 404Multi-pass could solve the typing problem (pass 1 establishes vocabulary, later passes use it) and the cross-clause provenance problem (pass 4 links across documents). 405 406### 7.3 Embedding-Based Linking Instead of Keyword Matching 407 408Replace term overlap linking with embedding similarity. Compute a vector embedding for each canon node's statement, then link nodes whose embeddings are within a threshold. 409 410Advantages: captures semantic similarity ("rate limiting" and "throttling" would link). Disadvantages: requires embedding model, threshold tuning, and introduces non-determinism. 411 412### 7.4 Hierarchical Canonical Graph 413 414Add a `parent_canon_id` field. Top-level nodes represent high-level capabilities. Children represent specific requirements. Leaves represent constraints/invariants. 415 416This would enable hierarchical invalidation: changing a leaf only invalidates its subtree, not the whole cluster connected by term overlap. 417 418### 7.5 Stable Canonical Identity (Soft Matching) 419 420Instead of exact hash identity, use a **two-layer identity**: 421- `canon_id` (exact): current SHA-256 scheme, changes on any rewording 422- `canon_anchor`: a stable identity based on semantic meaning, survives minor rephrasing 423 424The anchor could be: 425- SHA-256 of sorted tags (survives statement rewording if tags are stable) 426- Embedding hash (locality-sensitive hash of statement embedding) 427- Clause-anchored: `SHA-256(source_clause_id + type)` (stable as long as source clause and type don't change) 428 429When comparing old and new canonical graphs, match first by `canon_anchor`, then by `canon_id`. This separates "this node changed its wording" from "this is a completely new node." 430 431### 7.6 LLM With Clause References in Output 432 433Modify the LLM prompt to require explicit clause attribution: 434 435```json 436{ 437 "type": "REQUIREMENT", 438 "statement": "Tasks must support status transitions: open → in_progress → review → done", 439 "source_clauses": ["Task Lifecycle"], 440 "tags": ["task", "status", "transitions"] 441} 442``` 443 444The LLM identifies which section/clause each node came from. This replaces the term-overlap heuristic for provenance. 445 446### 7.7 Coverage-Aware Extraction 447 448After extraction, compute a coverage map: for each sentence/line in the original clause, did any canon node claim it? Report uncovered content as a diagnostic. 449 450This could be combined with a "residual" extraction pass: after the first pass, feed uncovered sentences back to the extractor with a prompt like "these statements were not classified — are they requirements, constraints, definitions, or truly irrelevant?" 451 452### 7.8 Deterministic LLM Normalization 453 454Instead of using the LLM for extraction, use it only for **normalization**: the rule-based extractor identifies candidate nodes, then the LLM normalizes each statement to a canonical form. This preserves the deterministic extraction (same input → same candidates) while improving statement quality. 455 456``` 457Rule-based: extracts "Tasks must support status transitions: open → in_progress → review → done" 458LLM normalizer: "The system shall support task status transitions following the graph: open → in_progress → review → done" 459``` 460 461The normalized statement is hashed for identity. Because the LLM input is a single sentence (not a whole clause), output variance is much lower. 462 463--- 464 465## 8. Evaluation Criteria for New Approaches 466 467Any replacement or augmentation of the canonicalization system must be evaluated against: 468 469| Criterion | Description | Current Baseline | 470|-----------|-------------|-----------------| 471| **Extraction recall** | % of spec statements that produce at least one canon node | ~70% (prose paragraphs are missed) | 472| **Type accuracy** | % of canon nodes with correct type classification | ~60% (everything tends toward REQUIREMENT) | 473| **Provenance accuracy** | % of canon nodes with correct source_clause_ids | 100% rule-based, ~80% LLM (heuristic attribution) | 474| **Identity stability** | Same input → same canon_ids across runs | 100% rule-based, ~90% LLM (temperature variance) | 475| **Linking precision** | % of links that represent genuine semantic relationships | ~40% (keyword overlap is noisy) | 476| **Linking recall** | % of genuine semantic relationships that are captured | ~30% (misses synonym and implication relationships) | 477| **Cross-clause resolution** | Can a canon node cite multiple source clauses? | No (single clause only) | 478| **Coverage visibility** | Does the system report what was NOT extracted? | No | 479| **Scalability** | Performance at 1K, 10K, 100K nodes | O(n²) linking is the bottleneck | 480| **Latency** | Time to canonicalize 100 clauses | <100ms rule-based, 5–15s LLM | 481| **Determinism** | Does repeated execution produce identical output? | Yes rule-based, no LLM | 482| **Fallback graceful** | Does the system degrade gracefully without LLM? | Yes | 483 484--- 485 486## 9. Code Pointers 487 488| File | Lines | What It Does | 489|------|-------|-------------| 490| `src/canonicalizer.ts` | 155 | Rule-based extraction: pattern matching, term extraction, linking | 491| `src/canonicalizer-llm.ts` | 195 | LLM-enhanced extraction: batched prompt, JSON parsing, heuristic provenance | 492| `src/normalizer.ts` | 70 | Text normalization: format stripping, list sorting, whitespace collapse | 493| `src/semhash.ts` | 55 | SHA-256 hashing, clause_semhash, context_semhash_cold | 494| `src/warm-hasher.ts` | 50 | context_semhash_warm incorporating canonical graph context | 495| `src/spec-parser.ts` | 130 | Markdown → Clause[] (section boundary detection, heading hierarchy) | 496| `src/models/canonical.ts` | 30 | CanonicalNode and CanonicalGraph type definitions | 497| `src/store/canonical-store.ts` | 90 | Persistence layer for canonical graph | 498| `tests/functional/canonicalization.test.ts` | 140 | Integration tests for the full canonicalization pipeline | 499 500--- 501 502## 10. Summary of What We Need Help With 503 504Ranked by impact: 505 5061. **Better type classification.** The current system classifies almost everything as REQUIREMENT. We need an approach that reliably distinguishes requirements, constraints, invariants, and definitions — ideally without requiring an LLM for every extraction. 507 5082. **Stable multi-clause provenance.** Canon nodes must be able to cite multiple source clauses, and LLM-extracted nodes must have accurate (not heuristic) provenance. 509 5103. **Meaningful linking.** The current keyword-overlap linking produces too many false connections. We need linking that captures actual semantic relationships (constraint-constrains-requirement, definition-defines-term-used-in-requirement) without requiring O(n²) comparisons. 511 5124. **Coverage visibility.** Users need to know what percentage of their spec was successfully canonicalized, and which statements were dropped. 513 5145. **Identity stability under LLM extraction.** If we use LLMs, we need a way to produce stable canon_ids across runs. The current statement-is-identity model breaks with any LLM variance. 515 5166. **Extraction from prose.** The system only handles bulleted lists well. Flowing prose, tables, and mixed-format specs need better support. 517 518We're looking for approaches that maintain Phoenix's core properties — content-addressed identity, explicit provenance, deterministic fallback — while dramatically improving extraction quality, type accuracy, and linking precision. 519 520--- 521 522*Generated from Phoenix VCS v0.1.0 codebase. All code is TypeScript, zero runtime dependencies, ~700 lines total for canonicalization.*