personal memory agent
1# solstone Diagnostic Guide
2
3Quick reference for debugging and diagnosing issues. For detailed specifications, see linked documentation.
4
5## Quick Health Check
6
7```bash
8# Check if supervisor services are running
9pgrep -af "sol:observer|sol:sense|sol:supervisor"
10
11# Check Callosum socket exists
12ls -la journal/health/callosum.sock
13
14# Check for stuck agents (should be empty or short-lived)
15ls journal/agents/*/*_active.jsonl 2>/dev/null
16```
17
18**Healthy state:**
19- All three processes running
20- `callosum.sock` exists
21- `supervisor.status` events show no stale heartbeats
22- No `_active.jsonl` files older than a few minutes
23
24---
25
26## Service Architecture
27
28The supervisor (`sol supervisor`) manages these services:
29
30| Service | Command | Purpose | Auto-restart |
31|---------|---------|---------|--------------|
32| Callosum | (in-process) | Message bus for inter-service events | No |
33| Observer | `sol observer` | Screen/audio capture (platform-detected) | Yes |
34| Sense | `sol sense` | File detection, processing dispatch | Yes |
35
36Cortex (agent execution) connects to Callosum but runs independently via `sol cortex`.
37
38See [CALLOSUM.md](CALLOSUM.md) for message protocol and [CORTEX.md](CORTEX.md) for agent system.
39
40---
41
42## Log Locations
43
44| What | Where |
45|------|-------|
46| Current service logs | `journal/health/{service}.log` (symlinks) |
47| Day's process logs | `journal/{YYYYMMDD}/health/{ref}_{name}.log` |
48| Agent execution | `journal/agents/<name>/*.jsonl` |
49| Journal task log | `journal/task_log.txt` |
50
51**Symlink structure:** Journal-level symlinks point to current day's logs. Day-level symlinks point to current process instance (by ref).
52
53```bash
54# Tail current observer log
55tail -f journal/health/observer.log
56
57# Find today's logs
58ls -la journal/$(date +%Y%m%d)/health/
59```
60
61---
62
63## Health Signals
64
65Health uses a **fail-fast model**: observers exit if they detect problems, and supervisor restarts them. Health is simply whether the observer is running and sending status events.
66
67| Signal | Healthy when | Stale when |
68|--------|--------------|------------|
69| `hear` | Status received within threshold | No status for 60+ seconds |
70| `see` | Status received within threshold | No status for 60+ seconds |
71
72Both signals track the same thing: is the observer alive and communicating? If the observer has capture problems (e.g., screencast files not growing), it exits gracefully and supervisor restarts it.
73
74Staleness threshold: 60 seconds (configurable via `--threshold`).
75
76### Callosum Status Events
77
78Services emit periodic status to Callosum (every 5 seconds when active):
79
80- `observe.status` - Capture state (screencast, audio, activity)
81- `cortex.status` - Running agents list
82- `supervisor.status` - Service health, stale heartbeats
83
84The supervisor checks for `observe.status` event freshness and includes `stale_heartbeats` in its own status.
85
86See [CALLOSUM.md](CALLOSUM.md) Tract Registry for event schemas.
87
88---
89
90## Reading Agent Files
91
92**Location:** `journal/agents/`
93
94**File states:**
95- `{name}/{timestamp}_active.jsonl` - Agent currently running
96- `{name}/{timestamp}.jsonl` - Agent completed
97
98**Event sequence** (JSONL, one event per line):
99
1001. `request` - Initial spawn request (prompt, provider, name)
1012. `start` - Agent began execution (model info)
1023. `tool_start`/`tool_end` - Tool calls (paired by `call_id`)
1034. `thinking` - Model reasoning (if supported)
1045. `finish` or `error` - Final result or failure
105
106```bash
107# View an agent's final result
108jq -r 'select(.event=="finish") | .result' journal/agents/default/1234567890123.jsonl
109
110# List today's agents with their prompts
111for id in $(jq -r '.agent_id' journal/agents/$(date +%Y%m%d).jsonl 2>/dev/null); do
112 f=$(find journal/agents -maxdepth 2 -path "*/${id}.jsonl" -print -quit)
113 [ -n "$f" ] || continue
114 echo "=== $(basename "$f") ==="
115 head -1 "$f" | jq -r '.prompt[:80]'
116done
117```
118
119See [CORTEX.md](CORTEX.md) for complete event schemas and agent configuration.
120
121---
122
123## Common Issues
124
125### Observer not capturing
126
127```bash
128# Check observer log for errors
129tail -50 journal/health/observer.log | grep -i error
130
131# Check if observer is emitting status (supervisor.status will show stale_heartbeats)
132# Health is derived from observe.status Callosum events
133```
134
135Causes: DBus issues, screencast permissions, audio device unavailable.
136
137### Agent appears stuck
138
139```bash
140# Find active agents
141ls -la journal/agents/*/*_active.jsonl
142
143# Check last event in active agent
144tail -1 journal/agents/*/*_active.jsonl | jq .
145```
146
147Causes: Backend timeout, tool hanging, network issues.
148
149### No Callosum events
150
151```bash
152# Verify socket exists
153ls -la journal/health/callosum.sock
154
155# Check supervisor is running
156pgrep -af sol:supervisor
157```
158
159Causes: Supervisor not started, socket path permissions.
160
161### Processing backlog
162
163```bash
164# Check sense log for queue status
165grep -i "queue" journal/health/sense.log | tail -10
166```
167
168Causes: Slow transcription, describe API rate limits.
169
170---
171
172## Useful Commands
173
174```bash
175# Watch all service logs
176tail -f journal/health/*.log
177
178# Count today's agents by status
179echo "Completed: $([ -f journal/agents/$(date +%Y%m%d).jsonl ] && wc -l < journal/agents/$(date +%Y%m%d).jsonl || echo 0)"
180echo "Running: $(ls journal/agents/*/*_active.jsonl 2>/dev/null | wc -l)"
181
182# Find agents that errored today
183jq -r 'select(.status=="error") | .agent_id' journal/agents/$(date +%Y%m%d).jsonl 2>/dev/null
184
185# Check token usage for today
186wc -l journal/tokens/$(date +%Y%m%d).jsonl
187
188# Find errors in today's logs
189grep -i error journal/$(date +%Y%m%d)/health/*.log
190
191# Watch Callosum events in real-time
192socat - UNIX-CONNECT:journal/health/callosum.sock
193```
194
195---
196
197## Recovery Playbooks
198
199### Unfinalized MOV Files (Missing moov Atom)
200
201**Symptoms:** `sol describe` fails with `av.error.InvalidDataError: Invalid data found when processing input`. Sense logs show `describe failed ... exit code 1` and `Segment observed with errors ... ['describe exit 1']`.
202
203**Diagnosis:** The `.mov` file has `ftyp` + `wide` + `mdat` atoms but is missing the `moov` atom. The `mdat` size is 0 (extends-to-EOF). This means the screen recorder (solstone-macos native app) never finalized the file — it wrote video frames but crashed or was interrupted before writing the metadata index.
204
205Known trigger: screen sharing active during solstone-macos native app capture causes AVAssetWriter finalization to be skipped (missing `endSession()` call in `VideoWriter.swift`).
206
207```bash
208# Confirm the issue — should report "moov atom not found"
209ffprobe -v error journal/YYYYMMDD/STREAM/SEGMENT/center_1_screen.mov
210
211# Inspect atom structure (moov should be present but isn't)
212python3 -c "
213import struct, os, sys
214path = sys.argv[1]
215size = os.path.getsize(path)
216pos = 0
217with open(path, 'rb') as f:
218 while pos < size:
219 f.seek(pos)
220 header = f.read(8)
221 if len(header) < 8: break
222 atom_size, atom_type = struct.unpack('>I4s', header)
223 atom_type = atom_type.decode('ascii', errors='replace')
224 flag = ' [extends-to-EOF]' if atom_size == 0 else ''
225 if atom_size == 0: atom_size = size - pos
226 print(f' {atom_type:6s} {atom_size:>12,} bytes{flag}')
227 pos += atom_size
228" /path/to/broken.mov
229```
230
231**Recovery:** Extract HEVC parameter sets (VPS/SPS/PPS) from a working sibling file's `hvcC` box, convert the broken file's length-prefixed NALUs to Annex B format, and remux with ffmpeg.
232
233Prerequisites: a good `.mov` from the same stream/session (same codec settings), Python 3, ffmpeg.
234
235```bash
236# Step 1: Extract VPS/SPS/PPS from a good reference file
237python3 -c "
238import struct, os, sys
239
240def find_atom(data, name, offset=0):
241 pos = offset
242 while pos < len(data) - 8:
243 size = struct.unpack('>I', data[pos:pos+4])[0]
244 atype = data[pos+4:pos+8]
245 if size < 8: break
246 if atype == name: return pos, size
247 if atype in (b'moov', b'trak', b'mdia', b'minf', b'stbl'):
248 result = find_atom(data, name, pos + 8)
249 if result: return result
250 pos += size
251 return None
252
253with open(sys.argv[1], 'rb') as f:
254 data = f.read()
255pos, size = find_atom(data, b'stsd')
256stsd = data[pos:pos+size]
257hvcc_off = stsd.find(b'hvcC')
258hvcc_size = struct.unpack('>I', stsd[hvcc_off-4:hvcc_off])[0]
259cfg = stsd[hvcc_off-4+8:hvcc_off-4+hvcc_size]
260offset = 23
261with open('/tmp/hevc_params.bin', 'wb') as pf:
262 for i in range(cfg[22]):
263 num = struct.unpack('>H', cfg[offset+1:offset+3])[0]
264 offset += 3
265 for j in range(num):
266 nalu_len = struct.unpack('>H', cfg[offset:offset+2])[0]
267 pf.write(b'\x00\x00\x00\x01')
268 pf.write(cfg[offset+2:offset+2+nalu_len])
269 offset += 2 + nalu_len
270print('Wrote parameter sets to /tmp/hevc_params.bin')
271" /path/to/good_reference.mov
272
273# Step 2: Convert broken file to Annex B and remux
274python3 -c "
275import struct, os, subprocess, sys
276
277src, dst, seg_duration = sys.argv[1], sys.argv[2], int(sys.argv[3])
278fsize = os.path.getsize(src)
279mdat_offset = 36 # ftyp(20) + wide(8) + mdat_header(8)
280
281with open('/tmp/hevc_params.bin', 'rb') as pf:
282 params = pf.read()
283
284annex_b = '/tmp/recovery_raw.h265'
285frame_count = 0
286with open(src, 'rb') as fin, open(annex_b, 'wb') as fout:
287 fout.write(params)
288 fin.seek(mdat_offset)
289 bytes_read = 0
290 mdat_size = fsize - mdat_offset
291 while bytes_read < mdat_size - 4:
292 lb = fin.read(4)
293 if len(lb) < 4: break
294 nalu_len = struct.unpack('>I', lb)[0]
295 if nalu_len <= 0 or nalu_len > mdat_size - bytes_read: break
296 nalu_data = fin.read(nalu_len)
297 if len(nalu_data) < nalu_len: break
298 nal_type = (nalu_data[0] >> 1) & 0x3f
299 if nal_type < 32: frame_count += 1
300 fout.write(b'\x00\x00\x00\x01')
301 fout.write(nalu_data)
302 bytes_read += 4 + nalu_len
303
304fps = f'{frame_count}/{seg_duration}'
305print(f'{frame_count} frames, {fps} fps')
306subprocess.run(['ffmpeg', '-y', '-v', 'warning', '-r', fps,
307 '-f', 'hevc', '-i', annex_b, '-c', 'copy',
308 '-movflags', '+faststart', '-tag:v', 'hvc1', dst], check=True)
309os.unlink(annex_b)
310print(f'Recovered: {dst}')
311" /path/to/broken.mov /path/to/recovered.mov DURATION_SECS
312
313# Step 3: Verify recovery
314ffprobe -v error -show_streams /path/to/recovered.mov
315# Should show codec_name=hevc, correct width/height/duration
316
317# Step 4: Replace original and re-run describe
318cp /path/to/recovered.mov /path/to/broken.mov
319sol describe /path/to/broken.mov -v
320```
321
322**Notes:**
323- The segment duration (DURATION_SECS) comes from the segment folder name (`HHMMSS_LEN` — LEN is duration in seconds)
324- The reference file must be from the same stream/session so codec parameters match
325- PyAV (used by `sol describe`) bundles its own HEVC decoder, so this works even if system ffmpeg lacks one
326- After recovery, run `sol indexer` if you need the new screen extracts searchable
327
328---
329
330## See Also
331
332- [JOURNAL.md](JOURNAL.md) - Directory structure and file formats
333- [CORTEX.md](CORTEX.md) - Agent system, events, configuration
334- [CALLOSUM.md](CALLOSUM.md) - Message bus protocol