Phi Evaluations#

Behavioral tests for phi using LLM-as-judge evaluation.

Structure#

evals/
├── conftest.py              # Test fixtures and evaluator
├── test_basic_responses.py  # Basic response behavior
└── test_memory_integration.py  # Episodic memory tests

Running Evals#

# Run all evals (tests will skip if API keys are missing)
uv run pytest evals/ -v

# Run specific eval
uv run pytest evals/test_basic_responses.py::test_phi_responds_to_philosophical_question -v

# Run only basic response tests
uv run pytest evals/test_basic_responses.py -v

# Run only memory tests
uv run pytest evals/test_memory_integration.py -v

Environment Variables#

Tests will skip gracefully if required API keys are missing.

Required for all evals:

ANTHROPIC_API_KEY - For phi agent and LLM evaluator

Required for memory evals only:

TURBOPUFFER_API_KEY - For episodic memory storage
OPENAI_API_KEY - For embeddings

Required for ATProto MCP tools (used by agent):

BLUESKY_HANDLE - Bot's Bluesky handle
BLUESKY_PASSWORD - Bot's app password

Evaluation Approach#

Each eval:

Sets up a scenario - Simulates a mention/interaction
Runs phi agent - Gets structured response
Makes assertions - Checks basic structure
LLM evaluation - Uses Claude Opus to judge quality

Important: The phi_agent fixture is session-scoped, meaning all tests share one agent instance. Combined with session persistence (tokens saved to .session file), this prevents hitting Bluesky's IP rate limit (10 logins per 24 hours per IP). The session is reused across test runs unless tokens expire (~2 months).

Example:

@pytest.mark.asyncio
async def test_phi_responds_to_philosophical_question(evaluate_response):
    agent = PhiAgent()

    response = await agent.process_mention(
        mention_text="what do you think consciousness is?",
        author_handle="test.user",
        thread_context="...",
        thread_uri="...",
    )

    # Structural check
    assert response.action == "reply"

    # Quality evaluation
    await evaluate_response(
        evaluation_prompt="Does the response engage thoughtfully?",
        agent_response=response.text,
    )

What We Test#

Basic Responses#

✅ Philosophical engagement
✅ Spam detection
✅ Thread context awareness
✅ Character limit compliance
✅ Casual interactions

Memory Integration#

✅ Episodic memory retrieval
✅ Conversation storage
✅ User-specific context

Adding New Evals#

Create test file: evals/test_<category>.py
Use fixtures from conftest.py
Write scenario-based tests
Use evaluate_response for quality checks

Example:

@pytest.mark.asyncio
async def test_new_behavior(temp_memory, personality, evaluate_response):
    agent = PhiAgent()

    response = await agent.process_mention(...)

    await evaluate_response(
        evaluation_prompt="Your evaluation criteria here",
        agent_response=response.text,
    )

ci integration#

these evals are designed to run in ci with graceful degradation:

tests skip automatically when required api keys are missing
basic response tests require only ANTHROPIC_API_KEY and bluesky credentials
memory tests require TURBOPUFFER_API_KEY and OPENAI_API_KEY
no mocking required - tests work with real mcp server and episodic memory

this ensures phi's behavior can be validated in various environments.