testing#

phi uses behavioral testing with llm-as-judge evaluation.

philosophy#

test outcomes, not implementation

we care that phi:

replies appropriately to mentions
uses thread context correctly
maintains consistent personality
makes reasonable action decisions

we don't care:

which exact HTTP calls were made
internal state of the agent
specific tool invocation order

test structure#

async def test_thread_awareness():
    """phi should reference thread context in replies"""

    # arrange: create thread context
    thread_context = """
    @alice: I love birds
    @phi: me too! what's your favorite?
    """

    # act: process new mention
    response = await agent.process_mention(
        mention_text="especially crows",
        author_handle="alice.bsky.social",
        thread_context=thread_context
    )

    # assert: behavioral check
    assert response.action == "reply"
    assert any(word in response.text.lower()
              for word in ["bird", "crow", "favorite"])

llm-as-judge#

for subjective qualities (tone, relevance, personality):

async def test_personality_consistency():
    """phi should maintain grounded, honest tone"""

    response = await agent.process_mention(...)

    # use claude opus to evaluate
    evaluation = await judge_response(
        response=response.text,
        criteria=[
            "grounded (not overly philosophical)",
            "honest about capabilities",
            "concise for bluesky's 300 char limit"
        ]
    )

    assert evaluation.passes_criteria

what we test#

unit tests#

memory operations (store/retrieve)
thread context building
response parsing

integration tests#

full mention handling flow
thread discovery
decision making

behavioral tests (evals)#

personality consistency
thread awareness
appropriate action selection
memory utilization

mocking strategy#

mock external services, not internal logic

mock ATProto client (don't actually post to bluesky)
mock TurboPuffer (in-memory dict instead of network calls)
mock MCP server (fake tool implementations)

keep agent logic real - we want to test actual decision making.

running tests#

just test        # unit tests
just evals       # behavioral tests with llm-as-judge
just check       # full suite (lint + typecheck + test)

test isolation#

tests never touch production:

no real bluesky posts
separate turbopuffer namespace for tests
deterministic mock responses where needed

see sandbox/TESTING_STRATEGY.md for detailed approach.