solstone Diagnostic Guide#
Quick reference for debugging and diagnosing issues. For detailed specifications, see linked documentation.
Quick Health Check#
# Check if supervisor services are running
pgrep -af "sol:observer|sol:sense|sol:supervisor"
# Check Callosum socket exists
ls -la journal/health/callosum.sock
# Check for stuck agents (should be empty or short-lived)
ls journal/agents/*/*_active.jsonl 2>/dev/null
Healthy state:
- All three processes running
callosum.sockexistssupervisor.statusevents show no stale heartbeats- No
_active.jsonlfiles older than a few minutes
Service Architecture#
The supervisor (sol supervisor) manages these services:
| Service | Command | Purpose | Auto-restart |
|---|---|---|---|
| Callosum | (in-process) | Message bus for inter-service events | No |
| Observer | sol observer |
Screen/audio capture (platform-detected) | Yes |
| Sense | sol sense |
File detection, processing dispatch | Yes |
Cortex (agent execution) connects to Callosum but runs independently via sol cortex.
See CALLOSUM.md for message protocol and CORTEX.md for agent system.
Log Locations#
| What | Where |
|---|---|
| Current service logs | journal/health/{service}.log (symlinks) |
| Day's process logs | journal/{YYYYMMDD}/health/{ref}_{name}.log |
| Agent execution | journal/agents/<name>/*.jsonl |
| Journal task log | journal/task_log.txt |
Symlink structure: Journal-level symlinks point to current day's logs. Day-level symlinks point to current process instance (by ref).
# Tail current observer log
tail -f journal/health/observer.log
# Find today's logs
ls -la journal/$(date +%Y%m%d)/health/
Health Signals#
Health uses a fail-fast model: observers exit if they detect problems, and supervisor restarts them. Health is simply whether the observer is running and sending status events.
| Signal | Healthy when | Stale when |
|---|---|---|
hear |
Status received within threshold | No status for 60+ seconds |
see |
Status received within threshold | No status for 60+ seconds |
Both signals track the same thing: is the observer alive and communicating? If the observer has capture problems (e.g., screencast files not growing), it exits gracefully and supervisor restarts it.
Staleness threshold: 60 seconds (configurable via --threshold).
Callosum Status Events#
Services emit periodic status to Callosum (every 5 seconds when active):
observe.status- Capture state (screencast, audio, activity)cortex.status- Running agents listsupervisor.status- Service health, stale heartbeats
The supervisor checks for observe.status event freshness and includes stale_heartbeats in its own status.
See CALLOSUM.md Tract Registry for event schemas.
Reading Agent Files#
Location: journal/agents/
File states:
{name}/{timestamp}_active.jsonl- Agent currently running{name}/{timestamp}.jsonl- Agent completed
Event sequence (JSONL, one event per line):
request- Initial spawn request (prompt, provider, name)start- Agent began execution (model info)tool_start/tool_end- Tool calls (paired bycall_id)thinking- Model reasoning (if supported)finishorerror- Final result or failure
# View an agent's final result
jq -r 'select(.event=="finish") | .result' journal/agents/default/1234567890123.jsonl
# List today's agents with their prompts
for id in $(jq -r '.agent_id' journal/agents/$(date +%Y%m%d).jsonl 2>/dev/null); do
f=$(find journal/agents -maxdepth 2 -path "*/${id}.jsonl" -print -quit)
[ -n "$f" ] || continue
echo "=== $(basename "$f") ==="
head -1 "$f" | jq -r '.prompt[:80]'
done
See CORTEX.md for complete event schemas and agent configuration.
Common Issues#
Observer not capturing#
# Check observer log for errors
tail -50 journal/health/observer.log | grep -i error
# Check if observer is emitting status (supervisor.status will show stale_heartbeats)
# Health is derived from observe.status Callosum events
Causes: DBus issues, screencast permissions, audio device unavailable.
Agent appears stuck#
# Find active agents
ls -la journal/agents/*/*_active.jsonl
# Check last event in active agent
tail -1 journal/agents/*/*_active.jsonl | jq .
Causes: Backend timeout, tool hanging, network issues.
No Callosum events#
# Verify socket exists
ls -la journal/health/callosum.sock
# Check supervisor is running
pgrep -af sol:supervisor
Causes: Supervisor not started, socket path permissions.
Processing backlog#
# Check sense log for queue status
grep -i "queue" journal/health/sense.log | tail -10
Causes: Slow transcription, describe API rate limits.
Useful Commands#
# Watch all service logs
tail -f journal/health/*.log
# Count today's agents by status
echo "Completed: $([ -f journal/agents/$(date +%Y%m%d).jsonl ] && wc -l < journal/agents/$(date +%Y%m%d).jsonl || echo 0)"
echo "Running: $(ls journal/agents/*/*_active.jsonl 2>/dev/null | wc -l)"
# Find agents that errored today
jq -r 'select(.status=="error") | .agent_id' journal/agents/$(date +%Y%m%d).jsonl 2>/dev/null
# Check token usage for today
wc -l journal/tokens/$(date +%Y%m%d).jsonl
# Find errors in today's logs
grep -i error journal/$(date +%Y%m%d)/health/*.log
# Watch Callosum events in real-time
socat - UNIX-CONNECT:journal/health/callosum.sock
Recovery Playbooks#
Unfinalized MOV Files (Missing moov Atom)#
Symptoms: sol describe fails with av.error.InvalidDataError: Invalid data found when processing input. Sense logs show describe failed ... exit code 1 and Segment observed with errors ... ['describe exit 1'].
Diagnosis: The .mov file has ftyp + wide + mdat atoms but is missing the moov atom. The mdat size is 0 (extends-to-EOF). This means the screen recorder (solstone-macos native app) never finalized the file — it wrote video frames but crashed or was interrupted before writing the metadata index.
Known trigger: screen sharing active during solstone-macos native app capture causes AVAssetWriter finalization to be skipped (missing endSession() call in VideoWriter.swift).
# Confirm the issue — should report "moov atom not found"
ffprobe -v error journal/YYYYMMDD/STREAM/SEGMENT/center_1_screen.mov
# Inspect atom structure (moov should be present but isn't)
python3 -c "
import struct, os, sys
path = sys.argv[1]
size = os.path.getsize(path)
pos = 0
with open(path, 'rb') as f:
while pos < size:
f.seek(pos)
header = f.read(8)
if len(header) < 8: break
atom_size, atom_type = struct.unpack('>I4s', header)
atom_type = atom_type.decode('ascii', errors='replace')
flag = ' [extends-to-EOF]' if atom_size == 0 else ''
if atom_size == 0: atom_size = size - pos
print(f' {atom_type:6s} {atom_size:>12,} bytes{flag}')
pos += atom_size
" /path/to/broken.mov
Recovery: Extract HEVC parameter sets (VPS/SPS/PPS) from a working sibling file's hvcC box, convert the broken file's length-prefixed NALUs to Annex B format, and remux with ffmpeg.
Prerequisites: a good .mov from the same stream/session (same codec settings), Python 3, ffmpeg.
# Step 1: Extract VPS/SPS/PPS from a good reference file
python3 -c "
import struct, os, sys
def find_atom(data, name, offset=0):
pos = offset
while pos < len(data) - 8:
size = struct.unpack('>I', data[pos:pos+4])[0]
atype = data[pos+4:pos+8]
if size < 8: break
if atype == name: return pos, size
if atype in (b'moov', b'trak', b'mdia', b'minf', b'stbl'):
result = find_atom(data, name, pos + 8)
if result: return result
pos += size
return None
with open(sys.argv[1], 'rb') as f:
data = f.read()
pos, size = find_atom(data, b'stsd')
stsd = data[pos:pos+size]
hvcc_off = stsd.find(b'hvcC')
hvcc_size = struct.unpack('>I', stsd[hvcc_off-4:hvcc_off])[0]
cfg = stsd[hvcc_off-4+8:hvcc_off-4+hvcc_size]
offset = 23
with open('/tmp/hevc_params.bin', 'wb') as pf:
for i in range(cfg[22]):
num = struct.unpack('>H', cfg[offset+1:offset+3])[0]
offset += 3
for j in range(num):
nalu_len = struct.unpack('>H', cfg[offset:offset+2])[0]
pf.write(b'\x00\x00\x00\x01')
pf.write(cfg[offset+2:offset+2+nalu_len])
offset += 2 + nalu_len
print('Wrote parameter sets to /tmp/hevc_params.bin')
" /path/to/good_reference.mov
# Step 2: Convert broken file to Annex B and remux
python3 -c "
import struct, os, subprocess, sys
src, dst, seg_duration = sys.argv[1], sys.argv[2], int(sys.argv[3])
fsize = os.path.getsize(src)
mdat_offset = 36 # ftyp(20) + wide(8) + mdat_header(8)
with open('/tmp/hevc_params.bin', 'rb') as pf:
params = pf.read()
annex_b = '/tmp/recovery_raw.h265'
frame_count = 0
with open(src, 'rb') as fin, open(annex_b, 'wb') as fout:
fout.write(params)
fin.seek(mdat_offset)
bytes_read = 0
mdat_size = fsize - mdat_offset
while bytes_read < mdat_size - 4:
lb = fin.read(4)
if len(lb) < 4: break
nalu_len = struct.unpack('>I', lb)[0]
if nalu_len <= 0 or nalu_len > mdat_size - bytes_read: break
nalu_data = fin.read(nalu_len)
if len(nalu_data) < nalu_len: break
nal_type = (nalu_data[0] >> 1) & 0x3f
if nal_type < 32: frame_count += 1
fout.write(b'\x00\x00\x00\x01')
fout.write(nalu_data)
bytes_read += 4 + nalu_len
fps = f'{frame_count}/{seg_duration}'
print(f'{frame_count} frames, {fps} fps')
subprocess.run(['ffmpeg', '-y', '-v', 'warning', '-r', fps,
'-f', 'hevc', '-i', annex_b, '-c', 'copy',
'-movflags', '+faststart', '-tag:v', 'hvc1', dst], check=True)
os.unlink(annex_b)
print(f'Recovered: {dst}')
" /path/to/broken.mov /path/to/recovered.mov DURATION_SECS
# Step 3: Verify recovery
ffprobe -v error -show_streams /path/to/recovered.mov
# Should show codec_name=hevc, correct width/height/duration
# Step 4: Replace original and re-run describe
cp /path/to/recovered.mov /path/to/broken.mov
sol describe /path/to/broken.mov -v
Notes:
- The segment duration (DURATION_SECS) comes from the segment folder name (
HHMMSS_LEN— LEN is duration in seconds) - The reference file must be from the same stream/session so codec parameters match
- PyAV (used by
sol describe) bundles its own HEVC decoder, so this works even if system ffmpeg lacks one - After recovery, run
sol indexerif you need the new screen extracts searchable
See Also#
- JOURNAL.md - Directory structure and file formats
- CORTEX.md - Agent system, events, configuration
- CALLOSUM.md - Message bus protocol