commits

SPRINT 3: LLM Stabilization + Anchors
--------------------------------------

Self-consistency (k=3 medoid):
- src/canonicalizer-llm.ts: LLMCanonOptions.selfConsistencyK parameter
- Generate k samples (first at temp=0, rest at temp=0.3)
- Select lexical medoid (most similar to all others by token Jaccard)
- Ties broken alphabetically for determinism
- Exported selectMedoid() for testing

Anchor-based diff in classifier:
- src/classifier.ts: computeAnchorOverlap() compares canon_anchor sets
between before/after clauses
- When anchors match (>50% overlap), high-edit-distance changes get
downgraded from D→B (same concept, different wording)
- Reduces phantom D-class from LLM rephrasing

SPRINT 4: Evaluation + Polish
------------------------------

Evaluation harness:
- tests/eval/gold-standard.ts: 6 annotated specs with expected nodes,
types, edges, coverage bounds, and node count ranges
- tests/eval/canonicalization-eval.test.ts: 40 tests measuring:
- Extraction recall (per-spec and aggregate)
- Type accuracy (per-spec and aggregate)
- Coverage (per-spec bounds)
- Linking precision (for specs with expected edges)
- Node count bounds
- Max degree enforcement
- Hierarchy coverage
- Baseline report table printed to stdout

Results (rule-based, no LLM):
┌──────────────────┬────────┬─────────┬───────┬───────┬───────┬───────┐
│ Spec │ Recall │ TypeAcc │ Cover │ ResD% │ Hier% │ Nodes │
├──────────────────┼────────┼─────────┼───────┼───────┼───────┼───────┤
│ Auth v1 │ 100% │ 100% │ 86% │ 50% │ 100% │ 11 │
│ Auth v2 │ 100% │ 67% │ 88% │ 50% │ 100% │ 14 │
│ Notifications │ 100% │ 100% │ 100% │ 60% │ 100% │ 15 │
│ Gateway │ 100% │ 100% │ 100% │ 78% │ 100% │ 21 │
│ TaskFlow: tasks │ 100% │ 100% │ 100% │ 100% │ 100% │ 19 │
│ TaskFlow: analyt │ 100% │ 100% │ 100% │ 57% │ 100% │ 11 │
├──────────────────┼────────┼─────────┼───────┤ │ │ │
│ AVERAGE │ 100% │ 94% │ 96% │ │ │ │
└──────────────────┴────────┴─────────┴───────┴───────┴───────┴───────┘

vs Targets: Recall ≥95% ✅, TypeAcc ≥90% ✅, Coverage ≥95% ✅

Phoenix status enhancements:
- Canon type breakdown (e.g., '18 REQUIREMENT, 3 CONSTRAINT, 1 CONTEXT')
- Resolution metrics: edge count, relates_to %, max degree, hierarchy %
- Extraction coverage % with per-clause warnings for <80%
- Low-coverage clauses appear as info diagnostics

Phoenix inspect enhancements:
- CanonNodeInfo: confidence, anchor, parentId, linkTypes, extractionMethod
- Edge type passed through for canon→canon edges
- Parent edges (canon→parent) for hierarchy visualization
- CONTEXT badge color (yellow) distinct from CONSTRAINT (red)
- Canon subtitle shows confidence score and extraction method

New files:
- tests/eval/gold-standard.ts (6 annotated specs)
- tests/eval/canonicalization-eval.test.ts (40 tests)
- tests/unit/self-consistency.test.ts (5 tests)
- tests/unit/anchor-diff.test.ts (3 tests)

305 tests passing across 33 files (48 new tests since Sprint 2)

5w ago

Chad Fowler

148a0d29

Implement canonicalization v2: two-phase extraction/resolution pipeline

ARCHITECTURE CHANGES:
- Split canonicalization into Phase 1 (Extraction) and Phase 2 (Resolution)
- Phase 1 is deterministic, per-clause, parallelizable
- Phase 2 is a versioned global graph pass

NEW FILES:
- src/sentence-segmenter.ts — sentence-level text segmentation
- src/resolution.ts — dedup, typed edges, hierarchy, anchors, IDF linking
- tests/unit/sentence-segmenter.test.ts — 9 tests
- tests/unit/resolution.test.ts — 13 tests

MODEL CHANGES (src/models/canonical.ts):
- Added CONTEXT as 5th CanonicalType (framing text, not actionable)
- Added CandidateNode interface (Phase 1 output)
- Added ExtractionCoverage interface
- Added EdgeType union: constrains | defines | refines | invariant_of | duplicates | relates_to
- Added optional fields to CanonicalNode: canon_anchor, confidence,
link_types, parent_canon_id, extraction_method

EXTRACTION (src/canonicalizer.ts rewrite):
- Sentence-level segmentation replaces line-level splitting
- Scoring rubric replaces binary regex matching (scores across all 5 types)
- CONTEXT type catches non-actionable text (previously dropped silently)
- Confidence scores: margin between winning and runner-up type
- Acronym whitelist: id, api, jwt, sso, otp, etc. no longer dropped
- Hyphenated compounds preserved as single tags (rate-limit, in-progress)
- extractCandidates() exposed as public API with coverage metrics

RESOLUTION (src/resolution.ts new):
- Deduplication: token-trigram fingerprinting + Jaccard similarity >0.7
- Typed edge inference: constrains, defines, refines, invariant_of
- IDF-weighted inverted index replaces O(n²) pairwise linking
- Hierarchy from heading structure (parent_canon_id)
- canon_anchor: SHA-256(type + sorted_tags + sorted_source_clause_ids)
- Max degree cap of 8 per node (enforced by IDF-scored pruning)

NORMALIZER FIX (src/normalizer.ts):
- Numbered lists no longer sorted (correctness bug — order matters)
- Bullet lists with sequence indicators (→, ->, ordinals) preserved

LLM CANONICALIZER (src/canonicalizer-llm.ts rewrite):
- Default mode: LLM-as-normalizer (rule extraction + LLM statement rewrite)
- Temperature 0, JSON schema enforced, per-sentence (not batch)
- CONTEXT nodes skipped (not worth LLM cost)
- Full extraction mode behind extractWithLLMFull() with explicit provenance
- Positional fallback removed — nodes without valid provenance are dropped

WARM HASHER (src/warm-hasher.ts):
- Uses only typed edges (excludes weak 'relates_to') in context hash
- Filters by confidence threshold (≥0.3)

IU PLANNER (src/iu-planner.ts):
- CONTEXT nodes filtered out (don't generate code)

RESULTS (before → after):
TaskFlow tasks.md:
Types: {REQ:18} → {CTX:1, REQ:18}
Coverage: unmeasured → 100%
Hierarchy: none → 18/19 nodes have parents
Edges: 24 untyped → 26 (all typed)

Auth v1:
Types: {REQ:6, CON:2} → {CTX:5, REQ:3, CON:3}
Edges: 2 untyped → 8 (4 refines, 4 relates_to)

Notifications:
Types: {REQ:12, CON:1, INV:1} → {CTX:1, REQ:10, CON:3, INV:1}
Edges: 14 untyped → 10 (2 constrains, 2 refines, 6 relates_to)

257 tests passing across 30 files (22 new tests).

5w ago

Chad Fowler

7d0b9931

Add CANONICALIZATION-PLAN.md — architecture plan for canonicalization v2

5w ago

Chad Fowler

3265fd9f

Add CANONICALIZATION.md — deep technical doc for research team review

5w ago

Chad Fowler

ef64efcd

Add DATA-MODEL.md — comprehensive taxonomy doc for research team review

5w ago

Chad Fowler

62bc6188

Redesigned phoenix inspect with focus mode, SVG lines, and graph overlay

5w ago

Chad Fowler

dd726f7f

phoenix inspect — interactive intent pipeline visualisation

5w ago

Chad Fowler

1bd3c2b2

Fix manifest stale entry eviction on IU ID changes

5w ago

Chad Fowler

7a45d1da

TaskFlow web dashboard — spec-driven UI generation

5w ago

Chad Fowler

822126e0

Taskflow example + scaffold naming collision fix

5w ago

Chad Fowler

f7b13024

LLM-enhanced canonicalization & classification + E2E success criteria tests

5w ago

Chad Fowler

795f8a9b

Pluggable LLM code generation with typecheck-and-retry

5w ago

Chad Fowler

030f51e3

Phoenix VCS v0.1.0 — initial commit

5w ago

feat: code gen reliability 5% → 89% — template assembly + SQL repair main

d0ea5dab

Chad Fowler

feat: template-based code generation — reliability 5% → 79%

85172555

Chad Fowler

refactor: split Architecture from Runtime Target

6a4f2b5b

Chad Fowler

feat: deep improvement report — HTML dashboard with all 5 categories

538b54e3

Chad Fowler

feat: deep improvement sweep — 5 categories, measurable outcomes

003eecc2

Chad Fowler

docs: add README for open source launch

575e16db

Chad Fowler

chore: clean up for open source — remove .DS_Store, update gitignore

0d9919d3

Chad Fowler

feat: phoenix ingest --verbose shows line-level text diff

9410856c

Chad Fowler

feat: interactive Spec Text view in Phoenix Inspect

3417b635

Chad Fowler

fix: phoenix ingest now shows diff before overwriting stored clauses

92c279f3

Chad Fowler

fix: null vs undefined in FK validation — systemic prompt rule

1c2d6c7f

Chad Fowler

fix: nullable FK fields in Zod schemas — systemic prompt rule

bc32f6e8

Chad Fowler

fix: three systemic pipeline issues found via user testing

f3d3c226

Chad Fowler

fix: systemic — web UI module now calls sibling API endpoints

ada916f7

Chad Fowler

feat: v2 user-centric todo spec — full Todoist-style app from behaviors

2ae724ca

Chad Fowler

test: confirm 19/19 (100%) with UI module — increase eval timeout

75670484

Chad Fowler

feat: generate web UI from spec + architecture stub fallback

bea0e462

Chad Fowler

experiment: snake_case enforcement + stats path — 32%→100% (19/19)

457ce230

Chad Fowler

fix: write shared files before codegen so typecheck resolves imports

ddfe1ec9

Chad Fowler

feat: harder multi-resource spec + improved architecture target

95e5d521

Chad Fowler

feat: add architecture eval runner — end-to-end CRUD verification

58dca5f7

Chad Fowler

fix: working end-to-end todo API from spec

954f3ac5

Chad Fowler

feat: architecture targets — compile specs to working apps

12396fa0

Chad Fowler

feat: add phoenix-self example — Phoenix specs itself (dog-food)

ceeaff42

Chad Fowler

feat: add LLM reclassifier mode, run 4 LLM experiments

1478747a

Chad Fowler

feat: add LLM experiment infrastructure

a934e659

Chad Fowler

fix: hierarchy inference, heading extraction, and gold standard accuracy

e5a28949

Chad Fowler

experiment: round 3 with 12 gold specs, final score 0.9021

f0b00a02

Chad Fowler

experiment: SAME_TYPE_REFINE_THRESHOLD 0.2→0.15 score=0.9021

4e325fec

Chad Fowler

eval: fix gold standard type annotations to match pipeline semantics

653949d4

Chad Fowler

eval: add 6 new gold-standard specs for broader evaluation

1ca17820

Chad Fowler

experiment: round 2 complete, SAME_TYPE_REFINE_THRESHOLD is the key lever

7a75354b

Chad Fowler

experiment: SAME_TYPE_REFINE_THRESHOLD 0.3→0.2 score=0.9640

1462844a

Chad Fowler

experiment: SAME_TYPE_REFINE_THRESHOLD 0.5→0.3 score=0.9265

666e4bee

Chad Fowler

fix: richer edge type inference reduces D-rate 61%→47%

36c380dd

Chad Fowler

experiment: log 15 experiments, final score 0.8861

f31a1c78

Chad Fowler

experiment: CONTEXT_NO_MODAL_WEIGHT 2→1 score=0.8861

081140f8

Chad Fowler

experiment: MAX_DEGREE 8→12 score=0.8824

aa1c0a6a

Chad Fowler

experiment: MIN_SHARED_TAGS 2→1 score=0.8821

76340677

Chad Fowler

experiment: DOC_FREQ_CUTOFF 0.4→0.5 score=0.8816

afa272b7

Chad Fowler

feat: add autoresearch-style experiment loop for canonicalization tuning

9470fa51

Chad Fowler

feat: add settle-up example — expense splitter for demos

1011b949

Chad Fowler

feat: implement Fowler gap analysis — evaluations, pace layers, conceptual mass, negative knowledge, replacement audit

f5187379

Chad Fowler

Add Pixel Wars example: real-time multiplayer territory game

35c27b83

Chad Fowler

Complete canonicalization v2 Sprints 3-4: LLM stability, anchor diff, eval harness

73c1a820

Chad Fowler

Implement canonicalization v2: two-phase extraction/resolution pipeline

148a0d29

Chad Fowler

Add CANONICALIZATION-PLAN.md — architecture plan for canonicalization v2

7d0b9931

Chad Fowler

Add CANONICALIZATION.md — deep technical doc for research team review

3265fd9f

Chad Fowler

Add DATA-MODEL.md — comprehensive taxonomy doc for research team review

ef64efcd

Chad Fowler

Redesigned phoenix inspect with focus mode, SVG lines, and graph overlay

62bc6188

Chad Fowler

phoenix inspect — interactive intent pipeline visualisation

dd726f7f

Chad Fowler

Fix manifest stale entry eviction on IU ID changes

1bd3c2b2

Chad Fowler

TaskFlow web dashboard — spec-driven UI generation

7a45d1da

Chad Fowler

Taskflow example + scaffold naming collision fix

822126e0

Chad Fowler

LLM-enhanced canonicalization & classification + E2E success criteria tests

f7b13024

Chad Fowler

Pluggable LLM code generation with typecheck-and-retry

795f8a9b

Chad Fowler

Phoenix VCS v0.1.0 — initial commit

030f51e3

Chad Fowler