Add ask command, update README, default to gpt-oss · burrito.space/localcode@589ab4c

+54 -165

README.md

··· 1 - # Local AI Coding Environment 2 - 3 - A fully offline, privacy-first AI coding setup for macOS Apple Silicon. Uses **llama.cpp** to run **Qwen 2.5 Coder** models locally, with **Aider** and **OpenCode** as terminal-based coding agents — no API keys, no cloud, no costs. 4 - 5 - ## Hardware Requirements 6 - 7 - - **Mac with Apple Silicon** (M1/M2/M3/M4) 8 - - **32GB RAM** recommended (the 32B model uses ~20GB) 9 - - ~25GB free disk space for models 10 - 11 - ## What Gets Installed 12 - 13 - | Component | Purpose | 14 - |---|---| 15 - | **llama.cpp** | Local model inference with Metal GPU acceleration | 16 - | **Qwen 2.5 Coder 32B** (Q4_K_M) | Main chat/coding model (~20GB) | 17 - | **Qwen 2.5 Coder 1.5B** (Q4_K_M) | Fast autocomplete model (~1.2GB) | 18 - | **Aider** | Terminal coding agent (Claude Code alternative) | 19 - | **jq** | JSON processing for the pipe command | 1 + # localcode 20 2 21 - Both models are served via llama.cpp's built-in OpenAI-compatible API, making them work with any tool that supports the OpenAI API format. 3 + Fully offline AI coding environment for macOS Apple Silicon. Uses **Ollama** to serve local models with **10 terminal coding agents** — no API keys, no cloud, no costs. 22 4 23 - ## Installation 5 + ## Quick Start 24 6 25 7 ```bash 26 - chmod +x setup.sh 27 - ./setup.sh 8 + npm run build 9 + npm run dev -- setup # installs Ollama, pulls models, sets up configs 10 + localcode # launch with defaults 28 11 ``` 29 12 30 - The script is idempotent — safe to run multiple times. The first run downloads ~21GB of model weights from HuggingFace. 31 - 32 - After installation, restart your shell or run: 13 + ## Usage 33 14 34 15 ```bash 35 - source ~/.zshrc 36 - ``` 37 - 38 - ## Commands 39 - 40 - ### `llama-start` 41 - 42 - Starts both llama.cpp servers in the foreground. Press `Ctrl+C` to stop both. 43 - 44 - - Chat model (32B): `http://127.0.0.1:8080` 45 - - Autocomplete model (1.5B): `http://127.0.0.1:8081` 46 - 47 - ### `llama-stop` 48 - 49 - Kills all running llama-server processes. 50 - 51 - ### `ai-code [directory]` 52 - 53 - The main coding agent. Auto-starts the chat server if it's not running. Initializes a git repo if one doesn't exist, then launches Aider with full file-editing capabilities. 16 + localcode # launch with last-used TUI + model 17 + localcode goose # launch with Goose 18 + localcode claude # launch with Claude Code 19 + localcode gpt-oss # launch with GPT-OSS model 20 + localcode goose gpt-oss # launch with both overrides 54 21 55 - ```bash 56 - cd ~/projects/my-app 57 - ai-code . 22 + localcode models # list available models 23 + localcode tuis # list available TUIs 24 + localcode status # show config + Ollama health 58 25 59 - # or from anywhere 60 - ai-code ~/projects/my-app 26 + localcode start # start Ollama + pull models 27 + localcode stop # stop Ollama 28 + localcode bench # benchmark the active chat model 29 + localcode pipe "add type hints" # pipe stdin through the model 61 30 ``` 62 31 63 - Inside Aider you can ask it to edit files, run commands, and refactor code across your project. Changes are auto-committed to git so you can always roll back. 32 + TUI and model names are auto-detected — just type the id directly. Last-used choices are saved for next time. 64 33 65 - ### `ai-ask "question"` 34 + ## TUIs 66 35 67 - Quick Q&A mode — no file editing, just chat. Useful for coding questions without modifying your project. 36 + TUIs with `ollama launch` support are installed and configured automatically by Ollama. 68 37 69 - ```bash 70 - ai-ask "how do I handle errors in rust" 71 - ``` 38 + | TUI | Launch | Method | 39 + |-----|--------|--------| 40 + | Claude Code | `localcode claude` | ollama launch | 41 + | Codex CLI | `localcode codex` | ollama launch | 42 + | OpenCode | `localcode opencode` | ollama launch | 43 + | Pi | `localcode pi` | ollama launch | 44 + | Cline | `localcode cline` | ollama launch | 45 + | Droid | `localcode droid` | ollama launch | 46 + | OpenClaw | `localcode openclaw` | ollama launch | 47 + | Aider | `localcode aider` | direct (env vars) | 48 + | Goose | `localcode goose` | direct (env vars) | 49 + | gptme | `localcode gptme` | direct (env vars) | 72 50 73 - ### `ai-pipe "prompt"` 51 + ## Models 74 52 75 - Pipe code through the model via stdin. Useful for one-shot transforms in scripts. 53 + | Model | ID | Size | Tool Calling | Notes | 54 + |-------|----|------|-------------|-------| 55 + | Qwen3 Coder 30B-A3B | `qwen3-coder` | 19 GB | Yes | Best coding benchmarks (SWE-bench 69.6) | 56 + | GLM-4.7 Flash 30B-A3B | `glm-flash` | 19 GB | Yes | Strong coding, 198K context | 57 + | GPT-OSS 20B | `gpt-oss` | 14 GB | Yes | Lightest with tool support, good for 32GB machines | 58 + | Qwen 2.5 Coder 32B | `qwen-32b-chat` | 20 GB | No | Dense model, no structured tool calls | 59 + | Qwen 2.5 Coder 14B | `qwen-14b-chat` | 9 GB | No | Smallest chat model | 60 + | Qwen 2.5 Coder 7B | `qwen-7b-chat` | 5 GB | No | Lightest chat model | 61 + | Qwen 2.5 Coder 1.5B | `qwen-1.5b-autocomplete` | 1 GB | No | Autocomplete only | 76 62 77 - ```bash 78 - cat main.py | ai-pipe "add type hints" 79 - git diff | ai-pipe "write a commit message" 80 - cat api.go | ai-pipe "find bugs" 81 - ``` 63 + Models without tool calling will output raw tool-call text in agents that rely on structured tool use (Goose, Pi, etc.). Use gpt-oss, qwen3-coder, or glm-flash for those agents. 82 64 83 - ## Using OpenCode Instead of Aider 65 + ## Hardware Requirements 84 66 85 - [OpenCode](https://opencode.ai) is another terminal coding agent with a polished TUI. It connects to the same llama.cpp backend. 86 - 87 - ### Install OpenCode 88 - 89 - ```bash 90 - brew install anomalyco/tap/opencode 91 - ``` 67 + - **Mac with Apple Silicon** (M1/M2/M3/M4) 68 + - **32 GB RAM** recommended 69 + - Disk space depends on models pulled 92 70 93 - ### Configure 71 + ## Configuration 94 72 95 - Create `~/.config/opencode/opencode.json`: 96 - 97 - ```json 98 - { 99 - "$schema": "https://opencode.ai/config.json", 100 - "model": "llama-cpp/qwen2.5-coder-32b", 101 - "provider": { 102 - "llama-cpp": { 103 - "npm": "@ai-sdk/openai-compatible", 104 - "name": "llama.cpp (local)", 105 - "options": { 106 - "baseURL": "http://127.0.0.1:8080/v1", 107 - "apiKey": "not-needed" 108 - }, 109 - "models": { 110 - "qwen2.5-coder-32b": { 111 - "name": "Qwen 2.5 Coder 32B", 112 - "tools": true 113 - } 114 - } 115 - } 116 - } 117 - } 118 - ``` 119 - 120 - ### Run 121 - 122 - ```bash 123 - llama-start # start the server (or ai-code auto-starts it) 124 - opencode # launch OpenCode in your project directory 125 - ``` 126 - 127 - Use `/models` inside OpenCode to select the Qwen model, and `Tab` to switch between Plan and Build modes. 128 - 129 - ## Configuration Files 130 - 131 - | File | Purpose | 132 - |---|---| 133 - | `~/.aider/aider.conf.yml` | Aider settings (model, git, UI) | 134 - | `~/.aider/.env` | API base URL and key for Aider | 135 - | `~/.config/opencode/opencode.json` | OpenCode provider config | 136 - | `~/.local/share/llama-models/` | Downloaded GGUF model files | 137 - | `~/.local/bin/` | Launcher scripts | 138 - 139 - ## llama.cpp Server Flags 140 - 141 - The chat server launches with these defaults: 142 - 143 - | Flag | Value | Purpose | 144 - |---|---|---| 145 - | `--ctx-size` | 16384 | Context window (increase to 32768 if tools misbehave) | 146 - | `--n-gpu-layers` | 99 | Offload all layers to Metal GPU | 147 - | `--flash-attn` | — | Enable flash attention for speed | 148 - | `--mlock` | — | Lock model in RAM to prevent swapping | 149 - | `--threads` | auto | Uses performance core count | 150 - 151 - ## Troubleshooting 152 - 153 - **Model loading is slow on first run**: The first inference after starting the server takes 10–30 seconds while the model loads into memory. Subsequent requests are fast. 154 - 155 - **Running out of RAM / swapping**: The 32B Q4 model needs ~20GB. Close memory-heavy apps. You can also try the smaller `qwen2.5-coder-14b-instruct-q4_k_m.gguf` instead. 156 - 157 - **OpenCode tools not working**: Increase `--ctx-size` to 32768 in the `llama-chat-server` script. Tool-calling needs more context to work reliably. 158 - 159 - **Slow generation speed**: Expect ~15–25 tokens/sec on the 32B model with M4. This is normal for a model this size running locally. The 1.5B autocomplete model runs much faster. 160 - 161 - **Server won't start**: Check if another process is using port 8080 or 8081 with `lsof -i :8080`. Use `llama-stop` to kill stale processes. 162 - 163 - ## Performance Expectations 164 - 165 - | Model | Speed | Use Case | 166 - |---|---|---| 167 - | Qwen 2.5 Coder 32B | ~15–25 tok/s | Chat, code generation, refactoring | 168 - | Qwen 2.5 Coder 1.5B | ~100+ tok/s | Autocomplete, quick suggestions | 169 - 170 - Both models run entirely on-device using Metal acceleration. No network connection required after initial setup. 73 + All stored in `~/.config/localcode/config.json` — just the active model, autocomplete model, and TUI ids. Ollama manages model storage. 171 74 172 75 ## Uninstall 173 76 174 77 ```bash 175 - # Remove models (~21GB) 176 - rm -rf ~/.local/share/llama-models 177 - 178 - # Remove launcher scripts 179 - rm ~/.local/bin/{ai-code,ai-ask,ai-pipe,llama-start,llama-stop,llama-chat-server,llama-complete-server} 180 - 181 - # Remove configs 182 - rm -rf ~/.aider 183 - rm -f ~/.config/opencode/opencode.json 184 - 185 - # Remove Ollama auto-start (if set) 186 - launchctl unload ~/Library/LaunchAgents/com.ollama.serve.plist 187 - rm ~/Library/LaunchAgents/com.ollama.serve.plist 188 - 189 - # Uninstall packages 190 - pipx uninstall aider-chat 191 - brew uninstall llama.cpp jq 78 + brew uninstall ollama # remove Ollama + all pulled models 79 + rm ~/.local/bin/localcode # remove CLI wrapper 80 + rm -rf ~/.config/localcode # remove config 192 81 ```

+71

src/commands/ask.ts

··· 1 + import { OLLAMA_URL } from "../config.js"; 2 + import { getActiveChatModel } from "../runtime-config.js"; 3 + import { err } from "../log.js"; 4 + import { ensureOllama } from "./server.js"; 5 + 6 + export async function runAsk(question: string): Promise<void> { 7 + await ensureOllama(); 8 + const model = getActiveChatModel(); 9 + 10 + const body = JSON.stringify({ 11 + model: model.ollamaTag, 12 + messages: [ 13 + { 14 + role: "system", 15 + content: 16 + "You are an expert programmer. Give concise, practical answers. Include code examples when helpful.", 17 + }, 18 + { role: "user", content: question }, 19 + ], 20 + stream: true, 21 + }); 22 + 23 + let res: Response; 24 + try { 25 + res = await fetch(`${OLLAMA_URL}/v1/chat/completions`, { 26 + method: "POST", 27 + headers: { "Content-Type": "application/json" }, 28 + body, 29 + }); 30 + } catch { 31 + err("Ollama not running. Start it with: localcode start"); 32 + } 33 + 34 + if (!res!.ok) { 35 + err(`Server returned ${res!.status}`); 36 + } 37 + 38 + // Stream the response 39 + const reader = res!.body?.getReader(); 40 + if (!reader) { 41 + err("No response body"); 42 + } 43 + 44 + const decoder = new TextDecoder(); 45 + let buffer = ""; 46 + 47 + while (true) { 48 + const { done, value } = await reader.read(); 49 + if (done) break; 50 + 51 + buffer += decoder.decode(value, { stream: true }); 52 + const lines = buffer.split("\n"); 53 + buffer = lines.pop()!; 54 + 55 + for (const line of lines) { 56 + if (!line.startsWith("data: ")) continue; 57 + const data = line.slice(6); 58 + if (data === "[DONE]") continue; 59 + try { 60 + const json = JSON.parse(data) as { 61 + choices?: { delta?: { content?: string } }[]; 62 + }; 63 + const content = json.choices?.[0]?.delta?.content; 64 + if (content) process.stdout.write(content); 65 + } catch { 66 + // skip malformed chunks 67 + } 68 + } 69 + } 70 + process.stdout.write("\n"); 71 + }

+2

src/commands/pipe.ts

··· 1 1 import { OLLAMA_URL } from "../config.js"; 2 2 import { getActiveChatModel } from "../runtime-config.js"; 3 3 import { err } from "../log.js"; 4 + import { ensureOllama } from "./server.js"; 4 5 5 6 export async function runPipe(prompt: string): Promise<void> { 7 + await ensureOllama(); 6 8 const model = getActiveChatModel(); 7 9 8 10 // Read stdin

+15 -1

src/main.ts

··· 5 5 import { showStatus } from "./commands/status.js"; 6 6 import { startServers, stopServers } from "./commands/server.js"; 7 7 import { runPipe } from "./commands/pipe.js"; 8 + import { runAsk } from "./commands/ask.js"; 8 9 import { getTuiById } from "./registry/tuis.js"; 9 10 import { getModelById } from "./registry/models.js"; 10 11 ··· 35 36 localcode bench Benchmark the running chat model 36 37 localcode bench history Show past benchmark results 37 38 39 + ${BOLD}Quick:${RESET} 40 + localcode ask "question" Quick coding Q&A (streamed) 41 + localcode pipe "prompt" Pipe stdin through the model 42 + 38 43 ${BOLD}Other:${RESET} 39 - localcode pipe "prompt" Pipe stdin through the model 40 44 localcode setup Full install (Ollama, models, tools) 41 45 `); 42 46 } ··· 92 96 } else { 93 97 await runBench(process.argv.slice(3)); 94 98 } 99 + break; 100 + } 101 + 102 + case "ask": { 103 + const question = process.argv.slice(3).join(" "); 104 + if (!question) { 105 + console.error("Usage: localcode ask \"question\""); 106 + process.exit(1); 107 + } 108 + await runAsk(question); 95 109 break; 96 110 } 97 111

+1 -1

src/runtime-config.ts

··· 15 15 const CONFIG_PATH = join(homedir(), ".config", "localcode", "config.json"); 16 16 17 17 const DEFAULTS: RuntimeConfig = { 18 - chatModel: "qwen3-coder", 18 + chatModel: "gpt-oss", 19 19 autocompleteModel: "qwen-1.5b-autocomplete", 20 20 tui: "aider", 21 21 };