personal memory agent
at main 334 lines 11 kB view raw view rendered
1# solstone Diagnostic Guide 2 3Quick reference for debugging and diagnosing issues. For detailed specifications, see linked documentation. 4 5## Quick Health Check 6 7```bash 8# Check if supervisor services are running 9pgrep -af "sol:observer|sol:sense|sol:supervisor" 10 11# Check Callosum socket exists 12ls -la journal/health/callosum.sock 13 14# Check for stuck agents (should be empty or short-lived) 15ls journal/agents/*/*_active.jsonl 2>/dev/null 16``` 17 18**Healthy state:** 19- All three processes running 20- `callosum.sock` exists 21- `supervisor.status` events show no stale heartbeats 22- No `_active.jsonl` files older than a few minutes 23 24--- 25 26## Service Architecture 27 28The supervisor (`sol supervisor`) manages these services: 29 30| Service | Command | Purpose | Auto-restart | 31|---------|---------|---------|--------------| 32| Callosum | (in-process) | Message bus for inter-service events | No | 33| Observer | `sol observer` | Screen/audio capture (platform-detected) | Yes | 34| Sense | `sol sense` | File detection, processing dispatch | Yes | 35 36Cortex (agent execution) connects to Callosum but runs independently via `sol cortex`. 37 38See [CALLOSUM.md](CALLOSUM.md) for message protocol and [CORTEX.md](CORTEX.md) for agent system. 39 40--- 41 42## Log Locations 43 44| What | Where | 45|------|-------| 46| Current service logs | `journal/health/{service}.log` (symlinks) | 47| Day's process logs | `journal/{YYYYMMDD}/health/{ref}_{name}.log` | 48| Agent execution | `journal/agents/<name>/*.jsonl` | 49| Journal task log | `journal/task_log.txt` | 50 51**Symlink structure:** Journal-level symlinks point to current day's logs. Day-level symlinks point to current process instance (by ref). 52 53```bash 54# Tail current observer log 55tail -f journal/health/observer.log 56 57# Find today's logs 58ls -la journal/$(date +%Y%m%d)/health/ 59``` 60 61--- 62 63## Health Signals 64 65Health uses a **fail-fast model**: observers exit if they detect problems, and supervisor restarts them. Health is simply whether the observer is running and sending status events. 66 67| Signal | Healthy when | Stale when | 68|--------|--------------|------------| 69| `hear` | Status received within threshold | No status for 60+ seconds | 70| `see` | Status received within threshold | No status for 60+ seconds | 71 72Both signals track the same thing: is the observer alive and communicating? If the observer has capture problems (e.g., screencast files not growing), it exits gracefully and supervisor restarts it. 73 74Staleness threshold: 60 seconds (configurable via `--threshold`). 75 76### Callosum Status Events 77 78Services emit periodic status to Callosum (every 5 seconds when active): 79 80- `observe.status` - Capture state (screencast, audio, activity) 81- `cortex.status` - Running agents list 82- `supervisor.status` - Service health, stale heartbeats 83 84The supervisor checks for `observe.status` event freshness and includes `stale_heartbeats` in its own status. 85 86See [CALLOSUM.md](CALLOSUM.md) Tract Registry for event schemas. 87 88--- 89 90## Reading Agent Files 91 92**Location:** `journal/agents/` 93 94**File states:** 95- `{name}/{timestamp}_active.jsonl` - Agent currently running 96- `{name}/{timestamp}.jsonl` - Agent completed 97 98**Event sequence** (JSONL, one event per line): 99 1001. `request` - Initial spawn request (prompt, provider, name) 1012. `start` - Agent began execution (model info) 1023. `tool_start`/`tool_end` - Tool calls (paired by `call_id`) 1034. `thinking` - Model reasoning (if supported) 1045. `finish` or `error` - Final result or failure 105 106```bash 107# View an agent's final result 108jq -r 'select(.event=="finish") | .result' journal/agents/default/1234567890123.jsonl 109 110# List today's agents with their prompts 111for id in $(jq -r '.agent_id' journal/agents/$(date +%Y%m%d).jsonl 2>/dev/null); do 112 f=$(find journal/agents -maxdepth 2 -path "*/${id}.jsonl" -print -quit) 113 [ -n "$f" ] || continue 114 echo "=== $(basename "$f") ===" 115 head -1 "$f" | jq -r '.prompt[:80]' 116done 117``` 118 119See [CORTEX.md](CORTEX.md) for complete event schemas and agent configuration. 120 121--- 122 123## Common Issues 124 125### Observer not capturing 126 127```bash 128# Check observer log for errors 129tail -50 journal/health/observer.log | grep -i error 130 131# Check if observer is emitting status (supervisor.status will show stale_heartbeats) 132# Health is derived from observe.status Callosum events 133``` 134 135Causes: DBus issues, screencast permissions, audio device unavailable. 136 137### Agent appears stuck 138 139```bash 140# Find active agents 141ls -la journal/agents/*/*_active.jsonl 142 143# Check last event in active agent 144tail -1 journal/agents/*/*_active.jsonl | jq . 145``` 146 147Causes: Backend timeout, tool hanging, network issues. 148 149### No Callosum events 150 151```bash 152# Verify socket exists 153ls -la journal/health/callosum.sock 154 155# Check supervisor is running 156pgrep -af sol:supervisor 157``` 158 159Causes: Supervisor not started, socket path permissions. 160 161### Processing backlog 162 163```bash 164# Check sense log for queue status 165grep -i "queue" journal/health/sense.log | tail -10 166``` 167 168Causes: Slow transcription, describe API rate limits. 169 170--- 171 172## Useful Commands 173 174```bash 175# Watch all service logs 176tail -f journal/health/*.log 177 178# Count today's agents by status 179echo "Completed: $([ -f journal/agents/$(date +%Y%m%d).jsonl ] && wc -l < journal/agents/$(date +%Y%m%d).jsonl || echo 0)" 180echo "Running: $(ls journal/agents/*/*_active.jsonl 2>/dev/null | wc -l)" 181 182# Find agents that errored today 183jq -r 'select(.status=="error") | .agent_id' journal/agents/$(date +%Y%m%d).jsonl 2>/dev/null 184 185# Check token usage for today 186wc -l journal/tokens/$(date +%Y%m%d).jsonl 187 188# Find errors in today's logs 189grep -i error journal/$(date +%Y%m%d)/health/*.log 190 191# Watch Callosum events in real-time 192socat - UNIX-CONNECT:journal/health/callosum.sock 193``` 194 195--- 196 197## Recovery Playbooks 198 199### Unfinalized MOV Files (Missing moov Atom) 200 201**Symptoms:** `sol describe` fails with `av.error.InvalidDataError: Invalid data found when processing input`. Sense logs show `describe failed ... exit code 1` and `Segment observed with errors ... ['describe exit 1']`. 202 203**Diagnosis:** The `.mov` file has `ftyp` + `wide` + `mdat` atoms but is missing the `moov` atom. The `mdat` size is 0 (extends-to-EOF). This means the screen recorder (solstone-macos native app) never finalized the file — it wrote video frames but crashed or was interrupted before writing the metadata index. 204 205Known trigger: screen sharing active during solstone-macos native app capture causes AVAssetWriter finalization to be skipped (missing `endSession()` call in `VideoWriter.swift`). 206 207```bash 208# Confirm the issue — should report "moov atom not found" 209ffprobe -v error journal/YYYYMMDD/STREAM/SEGMENT/center_1_screen.mov 210 211# Inspect atom structure (moov should be present but isn't) 212python3 -c " 213import struct, os, sys 214path = sys.argv[1] 215size = os.path.getsize(path) 216pos = 0 217with open(path, 'rb') as f: 218 while pos < size: 219 f.seek(pos) 220 header = f.read(8) 221 if len(header) < 8: break 222 atom_size, atom_type = struct.unpack('>I4s', header) 223 atom_type = atom_type.decode('ascii', errors='replace') 224 flag = ' [extends-to-EOF]' if atom_size == 0 else '' 225 if atom_size == 0: atom_size = size - pos 226 print(f' {atom_type:6s} {atom_size:>12,} bytes{flag}') 227 pos += atom_size 228" /path/to/broken.mov 229``` 230 231**Recovery:** Extract HEVC parameter sets (VPS/SPS/PPS) from a working sibling file's `hvcC` box, convert the broken file's length-prefixed NALUs to Annex B format, and remux with ffmpeg. 232 233Prerequisites: a good `.mov` from the same stream/session (same codec settings), Python 3, ffmpeg. 234 235```bash 236# Step 1: Extract VPS/SPS/PPS from a good reference file 237python3 -c " 238import struct, os, sys 239 240def find_atom(data, name, offset=0): 241 pos = offset 242 while pos < len(data) - 8: 243 size = struct.unpack('>I', data[pos:pos+4])[0] 244 atype = data[pos+4:pos+8] 245 if size < 8: break 246 if atype == name: return pos, size 247 if atype in (b'moov', b'trak', b'mdia', b'minf', b'stbl'): 248 result = find_atom(data, name, pos + 8) 249 if result: return result 250 pos += size 251 return None 252 253with open(sys.argv[1], 'rb') as f: 254 data = f.read() 255pos, size = find_atom(data, b'stsd') 256stsd = data[pos:pos+size] 257hvcc_off = stsd.find(b'hvcC') 258hvcc_size = struct.unpack('>I', stsd[hvcc_off-4:hvcc_off])[0] 259cfg = stsd[hvcc_off-4+8:hvcc_off-4+hvcc_size] 260offset = 23 261with open('/tmp/hevc_params.bin', 'wb') as pf: 262 for i in range(cfg[22]): 263 num = struct.unpack('>H', cfg[offset+1:offset+3])[0] 264 offset += 3 265 for j in range(num): 266 nalu_len = struct.unpack('>H', cfg[offset:offset+2])[0] 267 pf.write(b'\x00\x00\x00\x01') 268 pf.write(cfg[offset+2:offset+2+nalu_len]) 269 offset += 2 + nalu_len 270print('Wrote parameter sets to /tmp/hevc_params.bin') 271" /path/to/good_reference.mov 272 273# Step 2: Convert broken file to Annex B and remux 274python3 -c " 275import struct, os, subprocess, sys 276 277src, dst, seg_duration = sys.argv[1], sys.argv[2], int(sys.argv[3]) 278fsize = os.path.getsize(src) 279mdat_offset = 36 # ftyp(20) + wide(8) + mdat_header(8) 280 281with open('/tmp/hevc_params.bin', 'rb') as pf: 282 params = pf.read() 283 284annex_b = '/tmp/recovery_raw.h265' 285frame_count = 0 286with open(src, 'rb') as fin, open(annex_b, 'wb') as fout: 287 fout.write(params) 288 fin.seek(mdat_offset) 289 bytes_read = 0 290 mdat_size = fsize - mdat_offset 291 while bytes_read < mdat_size - 4: 292 lb = fin.read(4) 293 if len(lb) < 4: break 294 nalu_len = struct.unpack('>I', lb)[0] 295 if nalu_len <= 0 or nalu_len > mdat_size - bytes_read: break 296 nalu_data = fin.read(nalu_len) 297 if len(nalu_data) < nalu_len: break 298 nal_type = (nalu_data[0] >> 1) & 0x3f 299 if nal_type < 32: frame_count += 1 300 fout.write(b'\x00\x00\x00\x01') 301 fout.write(nalu_data) 302 bytes_read += 4 + nalu_len 303 304fps = f'{frame_count}/{seg_duration}' 305print(f'{frame_count} frames, {fps} fps') 306subprocess.run(['ffmpeg', '-y', '-v', 'warning', '-r', fps, 307 '-f', 'hevc', '-i', annex_b, '-c', 'copy', 308 '-movflags', '+faststart', '-tag:v', 'hvc1', dst], check=True) 309os.unlink(annex_b) 310print(f'Recovered: {dst}') 311" /path/to/broken.mov /path/to/recovered.mov DURATION_SECS 312 313# Step 3: Verify recovery 314ffprobe -v error -show_streams /path/to/recovered.mov 315# Should show codec_name=hevc, correct width/height/duration 316 317# Step 4: Replace original and re-run describe 318cp /path/to/recovered.mov /path/to/broken.mov 319sol describe /path/to/broken.mov -v 320``` 321 322**Notes:** 323- The segment duration (DURATION_SECS) comes from the segment folder name (`HHMMSS_LEN` — LEN is duration in seconds) 324- The reference file must be from the same stream/session so codec parameters match 325- PyAV (used by `sol describe`) bundles its own HEVC decoder, so this works even if system ffmpeg lacks one 326- After recovery, run `sol indexer` if you need the new screen extracts searchable 327 328--- 329 330## See Also 331 332- [JOURNAL.md](JOURNAL.md) - Directory structure and file formats 333- [CORTEX.md](CORTEX.md) - Agent system, events, configuration 334- [CALLOSUM.md](CALLOSUM.md) - Message bus protocol