experiment: round 3 with 12 gold specs, final score 0.9021
Added 6 new gold specs (Pixel Wars, Settle Up, User Service, TicTacToe),
fixed gold type annotations, tuned SAME_TYPE_REFINE_THRESHOLD to 0.15.
Full journey: 0.8785 → 0.8861 → 0.9061 → 0.9640 → 0.8298 (new specs) →
0.8912 (gold fixes) → 0.9021 (tuning). Remaining gaps are hierarchy
inference (needs CONTEXT parents) and coverage for list-heavy specs.