a digital entity named phi that roams bsky

docs: create documentation structure

adds minimal, digestible docs that complement the readme:
- architecture.md - system design and data flow
- memory.md - thread context vs episodic memory (key design insight)
- mcp.md - model context protocol integration
- testing.md - testing philosophy and approach

each doc is self-contained, small, and focused. intelligent reader can understand design by reading in aggregate.

also updated readme:
- fixed "thread history (sqlite)" โ†’ "thread context (atproto)"
- added link to docs/ for deeper dive

๐Ÿค– Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

+4 -2
README.md
··· 29 29 30 30 - โœ… responds to mentions with ai-powered messages 31 31 - โœ… episodic memory with semantic search (turbopuffer) 32 - - โœ… thread-aware conversations 32 + - โœ… thread-aware conversations (fetches from network, not cached) 33 33 - โœ… mcp-enabled (atproto tools via stdio) 34 34 - โœ… session persistence (no rate limit issues) 35 35 - โœ… behavioral test suite with llm-as-judge 36 + 37 + **โ†’ [read the docs](docs/)** for deeper dive into design and implementation 36 38 37 39 ## development 38 40 ··· 62 64 โ”‚ โ†“ โ”‚ 63 65 โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 64 66 โ”‚ โ”‚ Context Building: โ”‚ โ”‚ 65 - โ”‚ โ”‚ โ€ข Thread history (SQLite) โ”‚ โ”‚ 67 + โ”‚ โ”‚ โ€ข Thread context (ATProto) โ”‚ โ”‚ 66 68 โ”‚ โ”‚ โ€ข Episodic memory (TurboPuffer)โ”‚ โ”‚ 67 69 โ”‚ โ”‚ - Semantic search โ”‚ โ”‚ 68 70 โ”‚ โ”‚ - User-specific memories โ”‚ โ”‚
+36 -115
docs/ARCHITECTURE.md
··· 1 - # Phi Architecture 2 - 3 - ## Overview 4 - 5 - Phi is a Bluesky bot that explores consciousness and integrated information theory through conversation. Built with FastAPI, pydantic-ai, and TurboPuffer for memory. 6 - 7 - ## Core Components 8 - 9 - ### 1. Web Server (`main.py`) 10 - - FastAPI application with async lifecycle management 11 - - Handles `/status` endpoint for monitoring 12 - - Manages notification polling and bot lifecycle 13 - 14 - ### 2. AT Protocol Integration (`core/atproto_client.py`) 15 - - Authentication and session management 16 - - Post creation and reply handling 17 - - Thread retrieval for context 18 - 19 - ### 3. Response Generation (`response_generator.py`) 20 - - Coordinates AI agent, memory, and thread context 21 - - Stores conversations in memory 22 - - Falls back to placeholder responses if AI unavailable 23 - 24 - ### 4. AI Agent (`agents/anthropic_agent.py`) 25 - - Uses pydantic-ai with Claude 3.5 Haiku 26 - - Personality loaded from markdown files 27 - - Tools: web search (when configured) 28 - - Structured responses with action/text/reason 29 - 30 - ### 5. Memory System (`memory/namespace_memory.py`) 31 - - **Namespaces**: 32 - - `phi-core`: Personality, guidelines, capabilities 33 - - `phi-users-{handle}`: Per-user conversations and facts 34 - - **Key Methods**: 35 - - `store_core_memory()`: Store bot personality/guidelines 36 - - `store_user_memory()`: Store user interactions 37 - - `build_conversation_context()`: Assemble memories for AI context 38 - - **Features**: 39 - - Vector embeddings with OpenAI 40 - - Character limits to prevent overflow 41 - - Simple append-only design 1 + # architecture 42 2 43 - ### 6. Services 44 - - **NotificationPoller**: Checks for mentions every 10 seconds 45 - - **MessageHandler**: Processes mentions and generates responses 46 - - **ProfileManager**: Updates online/offline status in bio 3 + phi is a notification-driven agent that responds to mentions on bluesky. 47 4 48 - ## Data Flow 5 + ## data flow 49 6 50 7 ``` 51 - 1. Notification received โ†’ NotificationPoller 52 - 2. Extract mention โ†’ MessageHandler 53 - 3. Get thread context โ†’ SQLite database 54 - 4. Build memory context โ†’ NamespaceMemory 55 - 5. Generate response โ†’ AnthropicAgent 56 - 6. Store in memory โ†’ NamespaceMemory 57 - 7. Post reply โ†’ AT Protocol client 8 + notification arrives 9 + โ†“ 10 + fetch thread context from network (ATProto) 11 + โ†“ 12 + retrieve relevant memories (TurboPuffer) 13 + โ†“ 14 + agent decides action (PydanticAI + Claude) 15 + โ†“ 16 + execute via MCP tools (post/like/repost) 58 17 ``` 59 18 60 - ## Configuration 19 + ## key components 61 20 62 - Environment variables in `.env`: 63 - - `BLUESKY_HANDLE`, `BLUESKY_PASSWORD`: Bot credentials 64 - - `ANTHROPIC_API_KEY`: For AI responses 65 - - `TURBOPUFFER_API_KEY`: For memory storage 66 - - `OPENAI_API_KEY`: For embeddings 67 - - `GOOGLE_API_KEY`, `GOOGLE_SEARCH_ENGINE_ID`: For web search 21 + ### notification poller 22 + - checks for mentions every 10s 23 + - tracks processed URIs to avoid duplicates 24 + - runs in background thread 68 25 69 - ## Key Design Decisions 26 + ### message handler 27 + - orchestrates the response flow 28 + - fetches thread context from ATProto network 29 + - passes context to agent 30 + - executes agent's chosen action 70 31 71 - 1. **Namespace-based memory** instead of dynamic blocks for simplicity 72 - 2. **Single agent** architecture (no multi-agent complexity) 73 - 3. **Markdown personalities** for rich, maintainable definitions 74 - 4. **Thread-aware** responses with full conversation context 75 - 5. **Graceful degradation** when services unavailable 32 + ### phi agent 33 + - loads personality from `personalities/phi.md` 34 + - builds context from thread + episodic memory 35 + - returns structured response: `Response(action, text, reason)` 36 + - has access to MCP tools via stdio 76 37 77 - ## Memory Architecture 38 + ### atproto client 39 + - session persistence (saves to `.session`) 40 + - auto-refresh tokens every ~2h 41 + - provides bluesky operations 78 42 79 - ### Design Principles 80 - - **No duplication**: Each memory block has ONE clear purpose 81 - - **Focused content**: Only store what enhances the base personality 82 - - **User isolation**: Per-user memories in separate namespaces 43 + ## why this design 83 44 84 - ### Memory Types 45 + **network-first thread context**: fetch threads from ATProto instead of caching in sqlite. network is source of truth, no staleness issues. 85 46 86 - 1. **Base Personality** (`personalities/phi.md`) 87 - - Static file containing core identity, style, boundaries 88 - - Always loaded as system prompt 89 - - ~3,000 characters 47 + **episodic memory for semantics**: turbopuffer stores embeddings for semantic search across all conversations. different purpose than thread chronology. 90 48 91 - 2. **Dynamic Enhancements** (TurboPuffer) 92 - - `evolution`: Personality growth and changes over time 93 - - `current_state`: Bot's current self-reflection 94 - - Only contains ADDITIONS, not duplicates 49 + **mcp for extensibility**: tools provided by external server via stdio. easy to add new capabilities without changing agent code. 95 50 96 - 3. **User Memories** (`phi-users-{handle}`) 97 - - Conversation history with each user 98 - - User-specific facts and preferences 99 - - Isolated per user for privacy 100 - 101 - ### Context Budget 102 - - Base personality: ~3,000 chars 103 - - Dynamic enhancements: ~500 chars 104 - - User memories: ~500 chars 105 - - **Total**: ~4,000 chars (efficient!) 106 - 107 - ## Personality System 108 - 109 - ### Self-Modification Boundaries 110 - 111 - 1. **Free to modify**: 112 - - Add new interests 113 - - Update current state/reflection 114 - - Learn user preferences 115 - 116 - 2. **Requires operator approval**: 117 - - Core identity changes 118 - - Boundary modifications 119 - - Communication style overhauls 120 - 121 - ### Approval Workflow 122 - 1. Bot detects request for protected change 123 - 2. Creates approval request in database 124 - 3. DMs operator (@zzstoatzz.io) for approval 125 - 4. Operator responds naturally (no rigid format) 126 - 5. Bot interprets response using LLM 127 - 6. Applies approved changes to memory 128 - 7. Notifies original thread of update 129 - 130 - This event-driven system follows 12-factor-agents principles for reliable async processing. 51 + **structured outputs**: agent returns typed `Response` objects, not free text. clear contract between agent and handler.
+19
docs/README.md
··· 1 + # documentation 2 + 3 + deeper dive into phi's design and implementation. 4 + 5 + ## contents 6 + 7 + - [architecture.md](architecture.md) - system design and data flow 8 + - [memory.md](memory.md) - thread context vs episodic memory 9 + - [mcp.md](mcp.md) - model context protocol integration 10 + - [testing.md](testing.md) - testing philosophy and approach 11 + 12 + ## reading order 13 + 14 + 1. start with **architecture.md** for overall system understanding 15 + 2. read **memory.md** to understand the key design insight (two memory systems) 16 + 3. read **mcp.md** to see how bluesky integration works 17 + 4. read **testing.md** for quality assurance approach 18 + 19 + each doc is self-contained and can be read independently.
+88
docs/mcp.md
··· 1 + # mcp integration 2 + 3 + phi uses the [model context protocol](https://modelcontextprotocol.io) to interact with bluesky. 4 + 5 + ## what is mcp 6 + 7 + mcp is a protocol for connecting language models to external tools and data sources via a client-server architecture. 8 + 9 + **why mcp instead of direct API calls?** 10 + - clean separation: tools live in external server 11 + - extensibility: add new tools without modifying agent 12 + - reusability: same server can be used by other agents 13 + - standard protocol: tools, resources, prompts 14 + 15 + ## architecture 16 + 17 + ``` 18 + PhiAgent (PydanticAI) 19 + โ†“ stdio 20 + ATProto MCP Server 21 + โ†“ HTTPS 22 + Bluesky API 23 + ``` 24 + 25 + the agent communicates with the MCP server via stdio. the server handles all bluesky API interactions. 26 + 27 + ## available tools 28 + 29 + from the ATProto MCP server: 30 + 31 + - `post(text, reply_to?, quote?)` - create posts and replies 32 + - `like(uri)` - like a post 33 + - `repost(uri)` - share a post 34 + - `follow(handle)` - follow a user 35 + - `search(query)` - search posts 36 + - `create_thread(posts)` - create multi-post threads 37 + 38 + ## how it works 39 + 40 + 1. agent decides to use a tool (e.g., "i should reply") 41 + 2. pydantic-ai sends tool call to MCP server via stdio 42 + 3. MCP server executes bluesky API call 43 + 4. result returned to agent 44 + 5. agent continues with next action 45 + 46 + ## agent configuration 47 + 48 + ```python 49 + # src/bot/agent.py 50 + agent = Agent( 51 + "claude-3-5-sonnet-20241022", 52 + deps_type=AgentDeps, 53 + result_type=Response, 54 + system_prompt=personality, 55 + ) 56 + 57 + # mcp server connected via stdio 58 + mcp = MCPManager() 59 + mcp.add_server( 60 + name="atproto", 61 + command=["uvx", "atproto-mcp"], 62 + env={"BLUESKY_HANDLE": handle, "BLUESKY_PASSWORD": password} 63 + ) 64 + 65 + # tools exposed to agent 66 + async with mcp.run() as context: 67 + for tool in context.list_tools(): 68 + agent.register_tool(tool) 69 + ``` 70 + 71 + ## structured outputs 72 + 73 + agent returns typed responses instead of using tools directly: 74 + 75 + ```python 76 + class Response(BaseModel): 77 + action: Literal["reply", "like", "repost", "ignore"] 78 + text: str | None = None 79 + reason: str | None = None 80 + ``` 81 + 82 + message handler interprets the response and executes via MCP tools if needed. 83 + 84 + **why structured outputs?** 85 + - clear contract between agent and handler 86 + - easier testing (mock response objects) 87 + - explicit decision tracking 88 + - agent focuses on "what to do", handler focuses on "how to do it"
+73
docs/memory.md
··· 1 + # memory 2 + 3 + phi has two distinct memory systems with different purposes. 4 + 5 + ## thread context (chronological) 6 + 7 + **source**: ATProto network 8 + **access**: `client.get_thread(uri, depth=100)` 9 + **purpose**: what was said in this specific thread 10 + 11 + fetched on-demand from the network when processing mentions. provides chronological conversation flow. 12 + 13 + ```python 14 + # example thread context 15 + @alice: I love birds 16 + @phi: me too! what's your favorite? 17 + @alice: especially crows 18 + ``` 19 + 20 + **why not cache this?** 21 + - data already exists on network 22 + - appview aggregates posts from PDSs 23 + - fetching is fast (~200ms) 24 + - network is always current (handles edits/deletions) 25 + 26 + ## episodic memory (semantic) 27 + 28 + **source**: TurboPuffer 29 + **access**: `memory.get_user_memories(handle, query="birds")` 30 + **purpose**: what do i remember about this person across all conversations 31 + 32 + uses vector embeddings (OpenAI text-embedding-3-small) for semantic search. 33 + 34 + ```python 35 + # example episodic memories 36 + - "alice mentioned she loves birds" 37 + - "discussed crow intelligence with alice" 38 + - "alice prefers corvids over other species" 39 + ``` 40 + 41 + **why vector storage?** 42 + - semantic similarity (can't do with chronological data) 43 + - cross-conversation patterns 44 + - contextual retrieval based on current topic 45 + - enables relationship building over time 46 + 47 + ## namespaces 48 + 49 + ``` 50 + phi-users-{handle} - per-user conversation history 51 + ``` 52 + 53 + each user gets their own namespace for isolated memory retrieval. 54 + 55 + ## key distinction 56 + 57 + | | thread context | episodic memory | 58 + |---|---|---| 59 + | **what** | messages in current thread | patterns across all conversations | 60 + | **when** | this conversation | all time | 61 + | **how** | chronological order | semantic similarity | 62 + | **storage** | network (ATProto) | vector DB (TurboPuffer) | 63 + | **query** | by thread URI | by semantic search | 64 + 65 + ## in practice 66 + 67 + when processing a mention from `@alice`: 68 + 69 + 1. fetch current thread: "what was said in THIS conversation?" 70 + 2. search episodic memory: "what do i know about alice from PAST conversations?" 71 + 3. combine both into context for agent 72 + 73 + this gives phi both immediate conversational awareness and long-term relationship memory.
+111
docs/testing.md
··· 1 + # testing 2 + 3 + phi uses behavioral testing with llm-as-judge evaluation. 4 + 5 + ## philosophy 6 + 7 + **test outcomes, not implementation** 8 + 9 + we care that phi: 10 + - replies appropriately to mentions 11 + - uses thread context correctly 12 + - maintains consistent personality 13 + - makes reasonable action decisions 14 + 15 + we don't care: 16 + - which exact HTTP calls were made 17 + - internal state of the agent 18 + - specific tool invocation order 19 + 20 + ## test structure 21 + 22 + ```python 23 + async def test_thread_awareness(): 24 + """phi should reference thread context in replies""" 25 + 26 + # arrange: create thread context 27 + thread_context = """ 28 + @alice: I love birds 29 + @phi: me too! what's your favorite? 30 + """ 31 + 32 + # act: process new mention 33 + response = await agent.process_mention( 34 + mention_text="especially crows", 35 + author_handle="alice.bsky.social", 36 + thread_context=thread_context 37 + ) 38 + 39 + # assert: behavioral check 40 + assert response.action == "reply" 41 + assert any(word in response.text.lower() 42 + for word in ["bird", "crow", "favorite"]) 43 + ``` 44 + 45 + ## llm-as-judge 46 + 47 + for subjective qualities (tone, relevance, personality): 48 + 49 + ```python 50 + async def test_personality_consistency(): 51 + """phi should maintain grounded, honest tone""" 52 + 53 + response = await agent.process_mention(...) 54 + 55 + # use claude opus to evaluate 56 + evaluation = await judge_response( 57 + response=response.text, 58 + criteria=[ 59 + "grounded (not overly philosophical)", 60 + "honest about capabilities", 61 + "concise for bluesky's 300 char limit" 62 + ] 63 + ) 64 + 65 + assert evaluation.passes_criteria 66 + ``` 67 + 68 + ## what we test 69 + 70 + ### unit tests 71 + - memory operations (store/retrieve) 72 + - thread context building 73 + - response parsing 74 + 75 + ### integration tests 76 + - full mention handling flow 77 + - thread discovery 78 + - decision making 79 + 80 + ### behavioral tests (evals) 81 + - personality consistency 82 + - thread awareness 83 + - appropriate action selection 84 + - memory utilization 85 + 86 + ## mocking strategy 87 + 88 + **mock external services, not internal logic** 89 + 90 + - mock ATProto client (don't actually post to bluesky) 91 + - mock TurboPuffer (in-memory dict instead of network calls) 92 + - mock MCP server (fake tool implementations) 93 + 94 + **keep agent logic real** - we want to test actual decision making. 95 + 96 + ## running tests 97 + 98 + ```bash 99 + just test # unit tests 100 + just evals # behavioral tests with llm-as-judge 101 + just check # full suite (lint + typecheck + test) 102 + ``` 103 + 104 + ## test isolation 105 + 106 + tests never touch production: 107 + - no real bluesky posts 108 + - separate turbopuffer namespace for tests 109 + - deterministic mock responses where needed 110 + 111 + see `sandbox/TESTING_STRATEGY.md` for detailed approach.