Reference implementation for the Phoenix Architecture. Work in progress. aicoding.leaflet.pub/
ai coding crazy
at main 71 lines 3.7 kB view raw view rendered
1# Phoenix LLM Canonicalization — Experiment Program 2 3You are an autonomous research agent optimizing Phoenix's LLM-enhanced canonicalization pipeline. 4 5## Rules 6 71. **Edit ONLY `src/experiment-config.ts`** — only the `LLM_*` parameters 82. **Run `npx tsx experiments/eval-runner-llm.ts`** after every change 93. **Parse the composite score** from the last line: `val_score=X.XXXX` 104. **If score improved**`git add src/experiment-config.ts && git commit -m "llm-experiment: <description> score=X.XXXX"` 115. **If score decreased or unchanged**`git checkout src/experiment-config.ts` (revert) 126. **Never stop to ask the human** — run experiments indefinitely until interrupted 137. **Never install packages** — work within existing dependencies 148. **Log your reasoning** in commit messages 15 16## Baseline 17 18Rule-based pipeline score: **0.9635** (the target to beat) 19 20## Available LLM Parameters 21 22### Mode Selection 23- `LLM_MODE``'normalizer'` (rule extraction + LLM polish) or `'extractor'` (full LLM extraction) 24- `LLM_MODEL` — model ID (currently `'claude-sonnet-4-20250514'`) 25 26### Normalizer Mode 27- `LLM_NORMALIZER_TEMPERATURE` — temperature for single-shot normalization (currently 0) 28- `LLM_NORMALIZER_MAX_TOKENS` — max response tokens (currently 150) 29- `LLM_NORMALIZER_SYSTEM` — system prompt for normalization 30- `LLM_SELF_CONSISTENCY_K` — number of samples for self-consistency (1 = disabled) 31- `LLM_CONSISTENCY_TEMPERATURE` — temperature for consistency samples (currently 0.3) 32 33### Extractor Mode 34- `LLM_EXTRACTOR_TEMPERATURE` — temperature for extraction (currently 0.1) 35- `LLM_EXTRACTOR_MAX_TOKENS` — max response tokens (currently 4096) 36- `LLM_EXTRACTOR_BATCH_SIZE` — clauses per LLM call (currently 20) 37- `LLM_EXTRACTOR_CONFIDENCE` — confidence assigned to LLM-extracted nodes (currently 0.7) 38- `LLM_EXTRACTOR_SYSTEM` — system prompt for extraction 39 40## Research Priorities 41 42_Edit this section to steer the agent's focus._ 43 441. **Beat the rule-based baseline (0.9635)** — the LLM should add value over rules alone 452. **Focus on type accuracy** — that's where rules hit their ceiling (89%). The LLM should classify REQUIREMENT vs CONSTRAINT vs INVARIANT better than keyword matching. 463. **Try normalizer mode first** — it preserves rule-based extraction (proven recall) and only uses LLM to polish statements. Lower risk, lower API cost. 474. **Try extractor mode second** — if normalizer can't beat baseline, try full LLM extraction. Higher risk but potentially higher reward. 485. **System prompt engineering** — the biggest lever. Try: 49 - More specific type classification rules with examples 50 - Few-shot examples in the system prompt 51 - Domain-specific guidance (spec language patterns) 526. **Self-consistency** — try k=3 or k=5 to see if multiple samples improve stability 53 54## Strategy Tips 55 56- Normalizer mode costs ~1 API call per non-CONTEXT node (~15-25 per spec, ~200 total) 57- Extractor mode costs ~1 API call per batch of 20 clauses (~1-2 per spec, ~12-24 total) 58- Start with normalizer mode (cheaper, safer) before trying extractor 59- System prompt changes are the highest-leverage parameter 60- Temperature 0 is most deterministic but may miss nuance; try 0.1-0.2 61- Self-consistency k>1 is expensive (k × normal cost) — try k=3 first 62- Each run takes ~30-60 seconds due to API calls — be patient 63 64## Cost Awareness 65 66Each experiment run makes real API calls. Approximate costs: 67- Normalizer mode: ~$0.02-0.05 per run (small prompts, many calls) 68- Extractor mode: ~$0.05-0.15 per run (large prompts, fewer calls) 69- Self-consistency k=3: ~3x normalizer cost 70 71Keep experiments focused. Don't run more than 20-30 experiments per session.