···11-# Local AI Coding Environment
22-33-A fully offline, privacy-first AI coding setup for macOS Apple Silicon. Uses **llama.cpp** to run **Qwen 2.5 Coder** models locally, with **Aider** and **OpenCode** as terminal-based coding agents — no API keys, no cloud, no costs.
44-55-## Hardware Requirements
66-77-- **Mac with Apple Silicon** (M1/M2/M3/M4)
88-- **32GB RAM** recommended (the 32B model uses ~20GB)
99-- ~25GB free disk space for models
1010-1111-## What Gets Installed
1212-1313-| Component | Purpose |
1414-|---|---|
1515-| **llama.cpp** | Local model inference with Metal GPU acceleration |
1616-| **Qwen 2.5 Coder 32B** (Q4_K_M) | Main chat/coding model (~20GB) |
1717-| **Qwen 2.5 Coder 1.5B** (Q4_K_M) | Fast autocomplete model (~1.2GB) |
1818-| **Aider** | Terminal coding agent (Claude Code alternative) |
1919-| **jq** | JSON processing for the pipe command |
11+# localcode
2022121-Both models are served via llama.cpp's built-in OpenAI-compatible API, making them work with any tool that supports the OpenAI API format.
33+Fully offline AI coding environment for macOS Apple Silicon. Uses **Ollama** to serve local models with **10 terminal coding agents** — no API keys, no cloud, no costs.
2242323-## Installation
55+## Quick Start
246257```bash
2626-chmod +x setup.sh
2727-./setup.sh
88+npm run build
99+npm run dev -- setup # installs Ollama, pulls models, sets up configs
1010+localcode # launch with defaults
2811```
29123030-The script is idempotent — safe to run multiple times. The first run downloads ~21GB of model weights from HuggingFace.
3131-3232-After installation, restart your shell or run:
1313+## Usage
33143415```bash
3535-source ~/.zshrc
3636-```
3737-3838-## Commands
3939-4040-### `llama-start`
4141-4242-Starts both llama.cpp servers in the foreground. Press `Ctrl+C` to stop both.
4343-4444-- Chat model (32B): `http://127.0.0.1:8080`
4545-- Autocomplete model (1.5B): `http://127.0.0.1:8081`
4646-4747-### `llama-stop`
4848-4949-Kills all running llama-server processes.
5050-5151-### `ai-code [directory]`
5252-5353-The main coding agent. Auto-starts the chat server if it's not running. Initializes a git repo if one doesn't exist, then launches Aider with full file-editing capabilities.
1616+localcode # launch with last-used TUI + model
1717+localcode goose # launch with Goose
1818+localcode claude # launch with Claude Code
1919+localcode gpt-oss # launch with GPT-OSS model
2020+localcode goose gpt-oss # launch with both overrides
54215555-```bash
5656-cd ~/projects/my-app
5757-ai-code .
2222+localcode models # list available models
2323+localcode tuis # list available TUIs
2424+localcode status # show config + Ollama health
58255959-# or from anywhere
6060-ai-code ~/projects/my-app
2626+localcode start # start Ollama + pull models
2727+localcode stop # stop Ollama
2828+localcode bench # benchmark the active chat model
2929+localcode pipe "add type hints" # pipe stdin through the model
6130```
62316363-Inside Aider you can ask it to edit files, run commands, and refactor code across your project. Changes are auto-committed to git so you can always roll back.
3232+TUI and model names are auto-detected — just type the id directly. Last-used choices are saved for next time.
64336565-### `ai-ask "question"`
3434+## TUIs
66356767-Quick Q&A mode — no file editing, just chat. Useful for coding questions without modifying your project.
3636+TUIs with `ollama launch` support are installed and configured automatically by Ollama.
68376969-```bash
7070-ai-ask "how do I handle errors in rust"
7171-```
3838+| TUI | Launch | Method |
3939+|-----|--------|--------|
4040+| Claude Code | `localcode claude` | ollama launch |
4141+| Codex CLI | `localcode codex` | ollama launch |
4242+| OpenCode | `localcode opencode` | ollama launch |
4343+| Pi | `localcode pi` | ollama launch |
4444+| Cline | `localcode cline` | ollama launch |
4545+| Droid | `localcode droid` | ollama launch |
4646+| OpenClaw | `localcode openclaw` | ollama launch |
4747+| Aider | `localcode aider` | direct (env vars) |
4848+| Goose | `localcode goose` | direct (env vars) |
4949+| gptme | `localcode gptme` | direct (env vars) |
72507373-### `ai-pipe "prompt"`
5151+## Models
74527575-Pipe code through the model via stdin. Useful for one-shot transforms in scripts.
5353+| Model | ID | Size | Tool Calling | Notes |
5454+|-------|----|------|-------------|-------|
5555+| Qwen3 Coder 30B-A3B | `qwen3-coder` | 19 GB | Yes | Best coding benchmarks (SWE-bench 69.6) |
5656+| GLM-4.7 Flash 30B-A3B | `glm-flash` | 19 GB | Yes | Strong coding, 198K context |
5757+| GPT-OSS 20B | `gpt-oss` | 14 GB | Yes | Lightest with tool support, good for 32GB machines |
5858+| Qwen 2.5 Coder 32B | `qwen-32b-chat` | 20 GB | No | Dense model, no structured tool calls |
5959+| Qwen 2.5 Coder 14B | `qwen-14b-chat` | 9 GB | No | Smallest chat model |
6060+| Qwen 2.5 Coder 7B | `qwen-7b-chat` | 5 GB | No | Lightest chat model |
6161+| Qwen 2.5 Coder 1.5B | `qwen-1.5b-autocomplete` | 1 GB | No | Autocomplete only |
76627777-```bash
7878-cat main.py | ai-pipe "add type hints"
7979-git diff | ai-pipe "write a commit message"
8080-cat api.go | ai-pipe "find bugs"
8181-```
6363+Models without tool calling will output raw tool-call text in agents that rely on structured tool use (Goose, Pi, etc.). Use gpt-oss, qwen3-coder, or glm-flash for those agents.
82648383-## Using OpenCode Instead of Aider
6565+## Hardware Requirements
84668585-[OpenCode](https://opencode.ai) is another terminal coding agent with a polished TUI. It connects to the same llama.cpp backend.
8686-8787-### Install OpenCode
8888-8989-```bash
9090-brew install anomalyco/tap/opencode
9191-```
6767+- **Mac with Apple Silicon** (M1/M2/M3/M4)
6868+- **32 GB RAM** recommended
6969+- Disk space depends on models pulled
92709393-### Configure
7171+## Configuration
94729595-Create `~/.config/opencode/opencode.json`:
9696-9797-```json
9898-{
9999- "$schema": "https://opencode.ai/config.json",
100100- "model": "llama-cpp/qwen2.5-coder-32b",
101101- "provider": {
102102- "llama-cpp": {
103103- "npm": "@ai-sdk/openai-compatible",
104104- "name": "llama.cpp (local)",
105105- "options": {
106106- "baseURL": "http://127.0.0.1:8080/v1",
107107- "apiKey": "not-needed"
108108- },
109109- "models": {
110110- "qwen2.5-coder-32b": {
111111- "name": "Qwen 2.5 Coder 32B",
112112- "tools": true
113113- }
114114- }
115115- }
116116- }
117117-}
118118-```
119119-120120-### Run
121121-122122-```bash
123123-llama-start # start the server (or ai-code auto-starts it)
124124-opencode # launch OpenCode in your project directory
125125-```
126126-127127-Use `/models` inside OpenCode to select the Qwen model, and `Tab` to switch between Plan and Build modes.
128128-129129-## Configuration Files
130130-131131-| File | Purpose |
132132-|---|---|
133133-| `~/.aider/aider.conf.yml` | Aider settings (model, git, UI) |
134134-| `~/.aider/.env` | API base URL and key for Aider |
135135-| `~/.config/opencode/opencode.json` | OpenCode provider config |
136136-| `~/.local/share/llama-models/` | Downloaded GGUF model files |
137137-| `~/.local/bin/` | Launcher scripts |
138138-139139-## llama.cpp Server Flags
140140-141141-The chat server launches with these defaults:
142142-143143-| Flag | Value | Purpose |
144144-|---|---|---|
145145-| `--ctx-size` | 16384 | Context window (increase to 32768 if tools misbehave) |
146146-| `--n-gpu-layers` | 99 | Offload all layers to Metal GPU |
147147-| `--flash-attn` | — | Enable flash attention for speed |
148148-| `--mlock` | — | Lock model in RAM to prevent swapping |
149149-| `--threads` | auto | Uses performance core count |
150150-151151-## Troubleshooting
152152-153153-**Model loading is slow on first run**: The first inference after starting the server takes 10–30 seconds while the model loads into memory. Subsequent requests are fast.
154154-155155-**Running out of RAM / swapping**: The 32B Q4 model needs ~20GB. Close memory-heavy apps. You can also try the smaller `qwen2.5-coder-14b-instruct-q4_k_m.gguf` instead.
156156-157157-**OpenCode tools not working**: Increase `--ctx-size` to 32768 in the `llama-chat-server` script. Tool-calling needs more context to work reliably.
158158-159159-**Slow generation speed**: Expect ~15–25 tokens/sec on the 32B model with M4. This is normal for a model this size running locally. The 1.5B autocomplete model runs much faster.
160160-161161-**Server won't start**: Check if another process is using port 8080 or 8081 with `lsof -i :8080`. Use `llama-stop` to kill stale processes.
162162-163163-## Performance Expectations
164164-165165-| Model | Speed | Use Case |
166166-|---|---|---|
167167-| Qwen 2.5 Coder 32B | ~15–25 tok/s | Chat, code generation, refactoring |
168168-| Qwen 2.5 Coder 1.5B | ~100+ tok/s | Autocomplete, quick suggestions |
169169-170170-Both models run entirely on-device using Metal acceleration. No network connection required after initial setup.
7373+All stored in `~/.config/localcode/config.json` — just the active model, autocomplete model, and TUI ids. Ollama manages model storage.
1717417275## Uninstall
1737617477```bash
175175-# Remove models (~21GB)
176176-rm -rf ~/.local/share/llama-models
177177-178178-# Remove launcher scripts
179179-rm ~/.local/bin/{ai-code,ai-ask,ai-pipe,llama-start,llama-stop,llama-chat-server,llama-complete-server}
180180-181181-# Remove configs
182182-rm -rf ~/.aider
183183-rm -f ~/.config/opencode/opencode.json
184184-185185-# Remove Ollama auto-start (if set)
186186-launchctl unload ~/Library/LaunchAgents/com.ollama.serve.plist
187187-rm ~/Library/LaunchAgents/com.ollama.serve.plist
188188-189189-# Uninstall packages
190190-pipx uninstall aider-chat
191191-brew uninstall llama.cpp jq
7878+brew uninstall ollama # remove Ollama + all pulled models
7979+rm ~/.local/bin/localcode # remove CLI wrapper
8080+rm -rf ~/.config/localcode # remove config
19281```
+71
src/commands/ask.ts
···11+import { OLLAMA_URL } from "../config.js";
22+import { getActiveChatModel } from "../runtime-config.js";
33+import { err } from "../log.js";
44+import { ensureOllama } from "./server.js";
55+66+export async function runAsk(question: string): Promise<void> {
77+ await ensureOllama();
88+ const model = getActiveChatModel();
99+1010+ const body = JSON.stringify({
1111+ model: model.ollamaTag,
1212+ messages: [
1313+ {
1414+ role: "system",
1515+ content:
1616+ "You are an expert programmer. Give concise, practical answers. Include code examples when helpful.",
1717+ },
1818+ { role: "user", content: question },
1919+ ],
2020+ stream: true,
2121+ });
2222+2323+ let res: Response;
2424+ try {
2525+ res = await fetch(`${OLLAMA_URL}/v1/chat/completions`, {
2626+ method: "POST",
2727+ headers: { "Content-Type": "application/json" },
2828+ body,
2929+ });
3030+ } catch {
3131+ err("Ollama not running. Start it with: localcode start");
3232+ }
3333+3434+ if (!res!.ok) {
3535+ err(`Server returned ${res!.status}`);
3636+ }
3737+3838+ // Stream the response
3939+ const reader = res!.body?.getReader();
4040+ if (!reader) {
4141+ err("No response body");
4242+ }
4343+4444+ const decoder = new TextDecoder();
4545+ let buffer = "";
4646+4747+ while (true) {
4848+ const { done, value } = await reader.read();
4949+ if (done) break;
5050+5151+ buffer += decoder.decode(value, { stream: true });
5252+ const lines = buffer.split("\n");
5353+ buffer = lines.pop()!;
5454+5555+ for (const line of lines) {
5656+ if (!line.startsWith("data: ")) continue;
5757+ const data = line.slice(6);
5858+ if (data === "[DONE]") continue;
5959+ try {
6060+ const json = JSON.parse(data) as {
6161+ choices?: { delta?: { content?: string } }[];
6262+ };
6363+ const content = json.choices?.[0]?.delta?.content;
6464+ if (content) process.stdout.write(content);
6565+ } catch {
6666+ // skip malformed chunks
6767+ }
6868+ }
6969+ }
7070+ process.stdout.write("\n");
7171+}
+2
src/commands/pipe.ts
···11import { OLLAMA_URL } from "../config.js";
22import { getActiveChatModel } from "../runtime-config.js";
33import { err } from "../log.js";
44+import { ensureOllama } from "./server.js";
4556export async function runPipe(prompt: string): Promise<void> {
77+ await ensureOllama();
68 const model = getActiveChatModel();
79810 // Read stdin
+15-1
src/main.ts
···55import { showStatus } from "./commands/status.js";
66import { startServers, stopServers } from "./commands/server.js";
77import { runPipe } from "./commands/pipe.js";
88+import { runAsk } from "./commands/ask.js";
89import { getTuiById } from "./registry/tuis.js";
910import { getModelById } from "./registry/models.js";
1011···3536 localcode bench Benchmark the running chat model
3637 localcode bench history Show past benchmark results
37383939+${BOLD}Quick:${RESET}
4040+ localcode ask "question" Quick coding Q&A (streamed)
4141+ localcode pipe "prompt" Pipe stdin through the model
4242+3843${BOLD}Other:${RESET}
3939- localcode pipe "prompt" Pipe stdin through the model
4044 localcode setup Full install (Ollama, models, tools)
4145`);
4246}
···9296 } else {
9397 await runBench(process.argv.slice(3));
9498 }
9999+ break;
100100+ }
101101+102102+ case "ask": {
103103+ const question = process.argv.slice(3).join(" ");
104104+ if (!question) {
105105+ console.error("Usage: localcode ask \"question\"");
106106+ process.exit(1);
107107+ }
108108+ await runAsk(question);
95109 break;
96110 }
97111