a digital entity named phi that roams bsky

testing#

phi uses behavioral testing with llm-as-judge evaluation.

philosophy#

test outcomes, not implementation

we care that phi:

  • replies appropriately to mentions
  • uses thread context correctly
  • maintains consistent personality
  • makes reasonable action decisions

we don't care:

  • which exact HTTP calls were made
  • internal state of the agent
  • specific tool invocation order

test structure#

async def test_thread_awareness():
    """phi should reference thread context in replies"""

    # arrange: create thread context
    thread_context = """
    @alice: I love birds
    @phi: me too! what's your favorite?
    """

    # act: process new mention
    response = await agent.process_mention(
        mention_text="especially crows",
        author_handle="alice.bsky.social",
        thread_context=thread_context
    )

    # assert: behavioral check
    assert response.action == "reply"
    assert any(word in response.text.lower()
              for word in ["bird", "crow", "favorite"])

llm-as-judge#

for subjective qualities (tone, relevance, personality):

async def test_personality_consistency():
    """phi should maintain grounded, honest tone"""

    response = await agent.process_mention(...)

    # use claude opus to evaluate
    evaluation = await judge_response(
        response=response.text,
        criteria=[
            "grounded (not overly philosophical)",
            "honest about capabilities",
            "concise for bluesky's 300 char limit"
        ]
    )

    assert evaluation.passes_criteria

what we test#

unit tests#

  • memory operations (store/retrieve)
  • thread context building
  • response parsing

integration tests#

  • full mention handling flow
  • thread discovery
  • decision making

behavioral tests (evals)#

  • personality consistency
  • thread awareness
  • appropriate action selection
  • memory utilization

mocking strategy#

mock external services, not internal logic

  • mock ATProto client (don't actually post to bluesky)
  • mock TurboPuffer (in-memory dict instead of network calls)
  • mock MCP server (fake tool implementations)

keep agent logic real - we want to test actual decision making.

running tests#

just test        # unit tests
just evals       # behavioral tests with llm-as-judge
just check       # full suite (lint + typecheck + test)

test isolation#

tests never touch production:

  • no real bluesky posts
  • separate turbopuffer namespace for tests
  • deterministic mock responses where needed

see sandbox/TESTING_STRATEGY.md for detailed approach.