Phi Evaluations#
Behavioral tests for phi using LLM-as-judge evaluation.
Structure#
Inspired by prefect-mcp-server evals.
evals/
├── conftest.py # Test fixtures and evaluator
├── test_basic_responses.py # Basic response behavior
└── test_memory_integration.py # Episodic memory tests
Running Evals#
# Run all evals (tests will skip if API keys are missing)
uv run pytest evals/ -v
# Run specific eval
uv run pytest evals/test_basic_responses.py::test_phi_responds_to_philosophical_question -v
# Run only basic response tests
uv run pytest evals/test_basic_responses.py -v
# Run only memory tests
uv run pytest evals/test_memory_integration.py -v
Environment Variables#
Tests will skip gracefully if required API keys are missing.
Required for all evals:
ANTHROPIC_API_KEY- For phi agent and LLM evaluator
Required for memory evals only:
TURBOPUFFER_API_KEY- For episodic memory storageOPENAI_API_KEY- For embeddings
Required for ATProto MCP tools (used by agent):
BLUESKY_HANDLE- Bot's Bluesky handleBLUESKY_PASSWORD- Bot's app password
Evaluation Approach#
Each eval:
- Sets up a scenario - Simulates a mention/interaction
- Runs phi agent - Gets structured response
- Makes assertions - Checks basic structure
- LLM evaluation - Uses Claude Opus to judge quality
Important: The phi_agent fixture is session-scoped, meaning all tests share one agent instance. Combined with session persistence (tokens saved to .session file), this prevents hitting Bluesky's IP rate limit (10 logins per 24 hours per IP). The session is reused across test runs unless tokens expire (~2 months).
Example:
@pytest.mark.asyncio
async def test_phi_responds_to_philosophical_question(evaluate_response):
agent = PhiAgent()
response = await agent.process_mention(
mention_text="what do you think consciousness is?",
author_handle="test.user",
thread_context="...",
thread_uri="...",
)
# Structural check
assert response.action == "reply"
# Quality evaluation
await evaluate_response(
evaluation_prompt="Does the response engage thoughtfully?",
agent_response=response.text,
)
What We Test#
Basic Responses#
- ✅ Philosophical engagement
- ✅ Spam detection
- ✅ Thread context awareness
- ✅ Character limit compliance
- ✅ Casual interactions
Memory Integration#
- ✅ Episodic memory retrieval
- ✅ Conversation storage
- ✅ User-specific context
Adding New Evals#
- Create test file:
evals/test_<category>.py - Use fixtures from
conftest.py - Write scenario-based tests
- Use
evaluate_responsefor quality checks
Example:
@pytest.mark.asyncio
async def test_new_behavior(temp_memory, personality, evaluate_response):
agent = PhiAgent()
response = await agent.process_mention(...)
await evaluate_response(
evaluation_prompt="Your evaluation criteria here",
agent_response=response.text,
)
ci integration#
these evals are designed to run in ci with graceful degradation:
- tests skip automatically when required api keys are missing
- basic response tests require only
ANTHROPIC_API_KEYand bluesky credentials - memory tests require
TURBOPUFFER_API_KEYandOPENAI_API_KEY - no mocking required - tests work with real mcp server and episodic memory
this ensures phi's behavior can be validated in various environments.