solstone Diagnostic Guide#

Quick reference for debugging and diagnosing issues. For detailed specifications, see linked documentation.

Quick Health Check#

# Check if supervisor services are running
pgrep -af "sol:observer|sol:sense|sol:supervisor"

# Check Callosum socket exists
ls -la journal/health/callosum.sock

# Check for stuck agents (should be empty or short-lived)
ls journal/agents/*/*_active.jsonl 2>/dev/null

Healthy state:

All three processes running
callosum.sock exists
supervisor.status events show no stale heartbeats
No _active.jsonl files older than a few minutes

Service Architecture#

The supervisor (sol supervisor) manages these services:

Service	Command	Purpose	Auto-restart
Callosum	(in-process)	Message bus for inter-service events	No
Observer	`sol observer`	Screen/audio capture (platform-detected)	Yes
Sense	`sol sense`	File detection, processing dispatch	Yes

Cortex (agent execution) connects to Callosum but runs independently via sol cortex.

See CALLOSUM.md for message protocol and CORTEX.md for agent system.

Log Locations#

What	Where
Current service logs	`journal/health/{service}.log` (symlinks)
Day's process logs	`journal/{YYYYMMDD}/health/{ref}_{name}.log`
Agent execution	`journal/agents/<name>/*.jsonl`
Journal task log	`journal/task_log.txt`

Symlink structure: Journal-level symlinks point to current day's logs. Day-level symlinks point to current process instance (by ref).

# Tail current observer log
tail -f journal/health/observer.log

# Find today's logs
ls -la journal/$(date +%Y%m%d)/health/

Health Signals#

Health uses a fail-fast model: observers exit if they detect problems, and supervisor restarts them. Health is simply whether the observer is running and sending status events.

Signal	Healthy when	Stale when
`hear`	Status received within threshold	No status for 60+ seconds
`see`	Status received within threshold	No status for 60+ seconds

Both signals track the same thing: is the observer alive and communicating? If the observer has capture problems (e.g., screencast files not growing), it exits gracefully and supervisor restarts it.

Staleness threshold: 60 seconds (configurable via --threshold).

Callosum Status Events#

Services emit periodic status to Callosum (every 5 seconds when active):

observe.status - Capture state (screencast, audio, activity)
cortex.status - Running agents list
supervisor.status - Service health, stale heartbeats

The supervisor checks for observe.status event freshness and includes stale_heartbeats in its own status.

See CALLOSUM.md Tract Registry for event schemas.

Reading Agent Files#

Location: journal/agents/

File states:

{name}/{timestamp}_active.jsonl - Agent currently running
{name}/{timestamp}.jsonl - Agent completed

Event sequence (JSONL, one event per line):

request - Initial spawn request (prompt, provider, name)
start - Agent began execution (model info)
tool_start/tool_end - Tool calls (paired by call_id)
thinking - Model reasoning (if supported)
finish or error - Final result or failure

# View an agent's final result
jq -r 'select(.event=="finish") | .result' journal/agents/default/1234567890123.jsonl

# List today's agents with their prompts
for id in $(jq -r '.agent_id' journal/agents/$(date +%Y%m%d).jsonl 2>/dev/null); do
  f=$(find journal/agents -maxdepth 2 -path "*/${id}.jsonl" -print -quit)
  [ -n "$f" ] || continue
  echo "=== $(basename "$f") ==="
  head -1 "$f" | jq -r '.prompt[:80]'
done

See CORTEX.md for complete event schemas and agent configuration.

Common Issues#

Observer not capturing#

# Check observer log for errors
tail -50 journal/health/observer.log | grep -i error

# Check if observer is emitting status (supervisor.status will show stale_heartbeats)
# Health is derived from observe.status Callosum events

Causes: DBus issues, screencast permissions, audio device unavailable.

Agent appears stuck#

# Find active agents
ls -la journal/agents/*/*_active.jsonl

# Check last event in active agent
tail -1 journal/agents/*/*_active.jsonl | jq .

Causes: Backend timeout, tool hanging, network issues.

No Callosum events#

# Verify socket exists
ls -la journal/health/callosum.sock

# Check supervisor is running
pgrep -af sol:supervisor

Causes: Supervisor not started, socket path permissions.

Processing backlog#

# Check sense log for queue status
grep -i "queue" journal/health/sense.log | tail -10

Causes: Slow transcription, describe API rate limits.

Useful Commands#

# Watch all service logs
tail -f journal/health/*.log

# Count today's agents by status
echo "Completed: $([ -f journal/agents/$(date +%Y%m%d).jsonl ] && wc -l < journal/agents/$(date +%Y%m%d).jsonl || echo 0)"
echo "Running: $(ls journal/agents/*/*_active.jsonl 2>/dev/null | wc -l)"

# Find agents that errored today
jq -r 'select(.status=="error") | .agent_id' journal/agents/$(date +%Y%m%d).jsonl 2>/dev/null

# Check token usage for today
wc -l journal/tokens/$(date +%Y%m%d).jsonl

# Find errors in today's logs
grep -i error journal/$(date +%Y%m%d)/health/*.log

# Watch Callosum events in real-time
socat - UNIX-CONNECT:journal/health/callosum.sock

Recovery Playbooks#

Unfinalized MOV Files (Missing moov Atom)#

Symptoms: sol describe fails with av.error.InvalidDataError: Invalid data found when processing input. Sense logs show describe failed ... exit code 1 and Segment observed with errors ... ['describe exit 1'].

Diagnosis: The .mov file has ftyp + wide + mdat atoms but is missing the moov atom. The mdat size is 0 (extends-to-EOF). This means the screen recorder (solstone-macos native app) never finalized the file — it wrote video frames but crashed or was interrupted before writing the metadata index.

Known trigger: screen sharing active during solstone-macos native app capture causes AVAssetWriter finalization to be skipped (missing endSession() call in VideoWriter.swift).

# Confirm the issue — should report "moov atom not found"
ffprobe -v error journal/YYYYMMDD/STREAM/SEGMENT/center_1_screen.mov

# Inspect atom structure (moov should be present but isn't)
python3 -c "
import struct, os, sys
path = sys.argv[1]
size = os.path.getsize(path)
pos = 0
with open(path, 'rb') as f:
    while pos < size:
        f.seek(pos)
        header = f.read(8)
        if len(header) < 8: break
        atom_size, atom_type = struct.unpack('>I4s', header)
        atom_type = atom_type.decode('ascii', errors='replace')
        flag = '  [extends-to-EOF]' if atom_size == 0 else ''
        if atom_size == 0: atom_size = size - pos
        print(f'  {atom_type:6s} {atom_size:>12,} bytes{flag}')
        pos += atom_size
" /path/to/broken.mov

Recovery: Extract HEVC parameter sets (VPS/SPS/PPS) from a working sibling file's hvcC box, convert the broken file's length-prefixed NALUs to Annex B format, and remux with ffmpeg.

Prerequisites: a good .mov from the same stream/session (same codec settings), Python 3, ffmpeg.

# Step 1: Extract VPS/SPS/PPS from a good reference file
python3 -c "
import struct, os, sys

def find_atom(data, name, offset=0):
    pos = offset
    while pos < len(data) - 8:
        size = struct.unpack('>I', data[pos:pos+4])[0]
        atype = data[pos+4:pos+8]
        if size < 8: break
        if atype == name: return pos, size
        if atype in (b'moov', b'trak', b'mdia', b'minf', b'stbl'):
            result = find_atom(data, name, pos + 8)
            if result: return result
        pos += size
    return None

with open(sys.argv[1], 'rb') as f:
    data = f.read()
pos, size = find_atom(data, b'stsd')
stsd = data[pos:pos+size]
hvcc_off = stsd.find(b'hvcC')
hvcc_size = struct.unpack('>I', stsd[hvcc_off-4:hvcc_off])[0]
cfg = stsd[hvcc_off-4+8:hvcc_off-4+hvcc_size]
offset = 23
with open('/tmp/hevc_params.bin', 'wb') as pf:
    for i in range(cfg[22]):
        num = struct.unpack('>H', cfg[offset+1:offset+3])[0]
        offset += 3
        for j in range(num):
            nalu_len = struct.unpack('>H', cfg[offset:offset+2])[0]
            pf.write(b'\x00\x00\x00\x01')
            pf.write(cfg[offset+2:offset+2+nalu_len])
            offset += 2 + nalu_len
print('Wrote parameter sets to /tmp/hevc_params.bin')
" /path/to/good_reference.mov

# Step 2: Convert broken file to Annex B and remux
python3 -c "
import struct, os, subprocess, sys

src, dst, seg_duration = sys.argv[1], sys.argv[2], int(sys.argv[3])
fsize = os.path.getsize(src)
mdat_offset = 36  # ftyp(20) + wide(8) + mdat_header(8)

with open('/tmp/hevc_params.bin', 'rb') as pf:
    params = pf.read()

annex_b = '/tmp/recovery_raw.h265'
frame_count = 0
with open(src, 'rb') as fin, open(annex_b, 'wb') as fout:
    fout.write(params)
    fin.seek(mdat_offset)
    bytes_read = 0
    mdat_size = fsize - mdat_offset
    while bytes_read < mdat_size - 4:
        lb = fin.read(4)
        if len(lb) < 4: break
        nalu_len = struct.unpack('>I', lb)[0]
        if nalu_len <= 0 or nalu_len > mdat_size - bytes_read: break
        nalu_data = fin.read(nalu_len)
        if len(nalu_data) < nalu_len: break
        nal_type = (nalu_data[0] >> 1) & 0x3f
        if nal_type < 32: frame_count += 1
        fout.write(b'\x00\x00\x00\x01')
        fout.write(nalu_data)
        bytes_read += 4 + nalu_len

fps = f'{frame_count}/{seg_duration}'
print(f'{frame_count} frames, {fps} fps')
subprocess.run(['ffmpeg', '-y', '-v', 'warning', '-r', fps,
    '-f', 'hevc', '-i', annex_b, '-c', 'copy',
    '-movflags', '+faststart', '-tag:v', 'hvc1', dst], check=True)
os.unlink(annex_b)
print(f'Recovered: {dst}')
" /path/to/broken.mov /path/to/recovered.mov DURATION_SECS

# Step 3: Verify recovery
ffprobe -v error -show_streams /path/to/recovered.mov
# Should show codec_name=hevc, correct width/height/duration

# Step 4: Replace original and re-run describe
cp /path/to/recovered.mov /path/to/broken.mov
sol describe /path/to/broken.mov -v

Notes:

The segment duration (DURATION_SECS) comes from the segment folder name (HHMMSS_LEN — LEN is duration in seconds)
The reference file must be from the same stream/session so codec parameters match
PyAV (used by sol describe) bundles its own HEVC decoder, so this works even if system ffmpeg lacks one
After recovery, run sol indexer if you need the new screen extracts searchable