experiments/program.md at main · chadfowler.com/phoenix

chadfowler.com / phoenix
fork atom
Reference implementation for the Phoenix Architecture. Work in progress. aicoding.leaflet.pub/
ai coding crazy
fork atom
phoenix / experiments / program.md
at main 91 lines 5.0 kB view raw view rendered
wrap content
Chad Fowler feat: add autoresearch-style experiment loop for canonicalization tuning 7d ago
9470fa51
 1# Phoenix Canonicalization — Experiment Program
 2
 3You are an autonomous research agent optimizing Phoenix's canonicalization pipeline.
 4
 5## Rules
 6
 71. **Edit ONLY `src/experiment-config.ts`** — never touch source files, tests, or this file
 82. **Run `npx tsx experiments/eval-runner.ts`** after every change
 93. **Parse the composite score** from the last line: `val_score=X.XXXX`
104. **If score improved** → `git add src/experiment-config.ts && git commit -m "experiment: <description> score=X.XXXX"`
115. **If score decreased or unchanged** → `git checkout src/experiment-config.ts` (revert)
126. **Never stop to ask the human** — run experiments indefinitely until interrupted
137. **Never install packages** — work within existing dependencies
148. **Log your reasoning** in commit messages so the human can review your thought process
15
16## Composite Score Formula
17
18```
19score = 0.30 * avg_recall
20      + 0.25 * avg_type_accuracy
21      + 0.20 * avg_coverage / 100
22      + 0.15 * (1 - avg_d_rate)
23      + 0.10 * avg_hier_coverage
24```
25
26Higher is better. Current baseline is established in `experiments/results.tsv`.
27
28## Available Parameters (in config.ts)
29
30### Resolution (graph construction)
31- `MAX_DEGREE` — max edges per node (currently 8)
32- `MIN_SHARED_TAGS` — minimum shared tags to create an edge (currently 2)
33- `JACCARD_DEDUP_THRESHOLD` — similarity threshold for merging duplicates (currently 0.7)
34- `FINGERPRINT_PREFIX_COUNT` — number of token prefixes for dedup bucketing (currently 8)
35- `DOC_FREQ_CUTOFF` — fraction above which tags are considered trivial (currently 0.4)
36
37### Scoring Weights (type classification)
38- `CONSTRAINT_NEGATION_WEIGHT` — "must not", "forbidden", etc. (currently 4)
39- `CONSTRAINT_LIMIT_WEIGHT` — "maximum", "at most", etc. (currently 3)
40- `CONSTRAINT_NUMERIC_WEIGHT` — numeric bounds like "≤100" (currently 2)
41- `INVARIANT_SIGNAL_WEIGHT` — "always", "never", "guaranteed" (currently 4)
42- `REQUIREMENT_MODAL_WEIGHT` — "must", "shall" (currently 2)
43- `REQUIREMENT_KEYWORD_WEIGHT` — "required", "needs to" (currently 2)
44- `REQUIREMENT_VERB_WEIGHT` — action verbs like "implement", "validate" (currently 1)
45- `DEFINITION_EXPLICIT_WEIGHT` — "is defined as", "means" (currently 4)
46- `DEFINITION_COLON_WEIGHT` — "Term: definition" pattern (currently 3)
47- `CONTEXT_NO_MODAL_WEIGHT` — no modal verbs present (currently 2)
48- `CONTEXT_SHORT_WEIGHT` — short sentence without modals (currently 1)
49- `HEADING_CONTEXT_BONUS` — bonus from heading keywords (currently 2)
50- `CONSTRAINT_MUST_BONUS` — extra constraint credit for "must" (currently 1)
51
52### Confidence & Extraction
53- `MIN_CONFIDENCE` / `MAX_CONFIDENCE` — confidence bounds (currently 0.3 / 1.0)
54- `DEFINITION_MAX_LENGTH` — max text length for definition detection (currently 200)
55- `MIN_EXTRACTION_LENGTH` — minimum sentence length to extract (currently 5)
56- `MIN_TERM_LENGTH` — minimum hyphenated compound length (currently 3)
57- `MIN_WORD_LENGTH` — minimum individual word length for terms (currently 2)
58
59### Sentence Segmentation
60- `MIN_LIST_ITEM_LENGTH` — minimum list item character length (currently 3)
61- `MIN_PROSE_SENTENCE_LENGTH` — minimum prose sentence length (currently 3)
62- `PROSE_SPLIT_THRESHOLD` — text length below which no sentence splitting (currently 80)
63- `MIN_SPLIT_PART_LENGTH` — minimum part length for compound splits (currently 3)
64
65### Warm Hashing
66- `WARM_MIN_CONFIDENCE` — minimum confidence for warm hash inclusion (currently 0.3)
67
68### Change Classification
69- `CLASS_A_NORM_DIFF` / `CLASS_A_TERM_DELTA` — thresholds for trivial (currently 0.1 / 0.2)
70- `CLASS_B_NORM_DIFF` / `CLASS_B_TERM_DELTA` — thresholds for local semantic (currently 0.5 / 0.5)
71- `CLASS_D_HIGH_CHANGE` — threshold for uncertain classification (currently 0.7)
72- `ANCHOR_MATCH_THRESHOLD` — anchor overlap to rescue from D→B (currently 0.5)
73
74## Research Priorities
75
76_Edit this section to steer the agent's focus._
77
781. **Maximize recall** — the gold-standard nodes that aren't being found. This is the highest-weighted component.
792. **Improve type accuracy** — correct classification of REQUIREMENT vs CONSTRAINT vs INVARIANT.
803. **Reduce D-rate** — lower the fraction of 'relates_to' fallback edges.
814. **Tune dedup** — the Jaccard threshold (0.7) and fingerprint settings might be too aggressive or too loose.
825. **Explore scoring weight ratios** — the relative weights between type signals matter more than absolute values.
83
84## Strategy Tips
85
86- Change ONE parameter at a time to isolate effects
87- Try both directions (increase and decrease) for each parameter
88- The scoring weights interact — after finding a good single-param change, try combinations
89- The resolution parameters (JACCARD_DEDUP_THRESHOLD, DOC_FREQ_CUTOFF) affect graph structure globally
90- Small changes to MIN_EXTRACTION_LENGTH or PROSE_SPLIT_THRESHOLD can change which sentences get extracted at all
91- Watch for overfitting: if one spec improves dramatically but others drop, the change isn't generalizable