# deployment

zlay runs on a Hetzner CPX41 in Hillsboro OR, managed via k3s. all deployment is orchestrated from the [relay repo](https://tangled.org/zzstoatzz.io/relay) using `just` recipes.

## build and deploy

the preferred method builds natively on the server (fast, no cross-compilation):

```bash
just zlay-publish-remote
```

this SSHs into the server and:

1. `git pull --ff-only` in `/opt/zlay`
2. `zig build -Doptimize=ReleaseSafe -Dtarget=x86_64-linux-gnu`
3. `buildah bud -f Dockerfile.runtime .` — thin runtime image with SHA tag
4. pushes to k3s containerd via `buildah push` → `ctr images import`
5. `kubectl set image deployment/zlay -n zlay main=<sha-tagged-image>` + `kubectl rollout status`

the runtime image (`Dockerfile.runtime`) is minimal: debian bookworm-slim + ca-certificates + the binary.

### why not Docker build?

the full `Dockerfile` exists for CI/standalone builds but is slow on Mac (cross-compilation + QEMU). `zlay-publish-remote` skips all of that by building on the target architecture.

### build flags

- `-Dtarget=x86_64-linux-gnu` — **must use glibc**, not musl. zig 0.15's C++ codegen for musl produces illegal instructions in RocksDB's LRU cache.
- `-Dcpu=baseline` — required when building inside Docker/QEMU (not needed for `zlay-publish-remote` since it builds natively).
- `-Doptimize=ReleaseSafe` — safety checks on, optimizations on. production default since 2026-03-05. previously caused OOM (see [incident-2026-03-04.md](incident-2026-03-04.md)) — resolved by the frame pool moving heavy work off reader threads.

## initial setup

```bash
just zlay-init          # terraform init
just zlay-infra         # create Hetzner server with k3s
just zlay-kubeconfig    # pull kubeconfig (~2 min after creation)
just zlay-deploy        # full deploy: cert-manager, postgres, relay, monitoring
```

point DNS A record for `ZLAY_DOMAIN` at the server IP (`just zlay-server-ip`) before deploying.

## environment variables

set in `.env` in the relay repo:

| variable | required | description |
|----------|----------|-------------|
| `HCLOUD_TOKEN` | yes | Hetzner Cloud API token |
| `ZLAY_DOMAIN` | yes | public domain (e.g. `zlay.waow.tech`) |
| `ZLAY_ADMIN_PASSWORD` | yes | bearer token for admin endpoints |
| `ZLAY_POSTGRES_PASSWORD` | yes | postgres password |
| `LETSENCRYPT_EMAIL` | yes | email for TLS certificates |

## operations

```bash
just zlay-status        # nodes, pods, health
just zlay-logs          # tail relay logs
just zlay-health        # curl public health endpoint
just zlay-ssh           # ssh into server
```

## infrastructure

- **server**: Hetzner CPX41 — 16 vCPU (AMD), 32 GB RAM, 240 GB NVMe
- **k3s**: single-node kubernetes with traefik ingress
- **cert-manager**: automatic TLS via Let's Encrypt
- **postgres**: bitnami/postgresql helm chart (relay state, backfill progress)
- **monitoring**: prometheus + grafana via kube-prometheus-stack
- **terraform**: `infra/zlay/` in the relay repo

## memory tuning

four changes brought steady-state memory from ~6.6 GiB down to ~1.1 GiB at ~2,250 connected hosts (ReleaseSafe):

**shared TLS CA bundle.** the biggest single win. websocket.zig's TLS client calls `Bundle.rescan()` per connection, loading the system CA certificates into a per-connection arena. with ~2,750 PDS connections, that's ~2,750 copies of the CA bundle in memory (~800 KB each = ~2.2 GiB). fix: load the bundle once in the slurper, pass it to all subscribers via `config.ca_bundle`. memory dropped from ~3.3 GiB to ~1.2 GiB (~65% reduction).

**thread stack sizes.** zig's default thread stack is 16 MB. with ~2,750 subscriber threads that maps 44 GB of virtual memory. all `Thread.spawn` calls use `main.default_stack_size` (8 MB). this is virtual memory — only touched pages count as RSS. 8 MB supports ReleaseSafe's TLS handshake path (~134 KiB peak stack).

**c_allocator instead of GeneralPurposeAllocator.** GPA is a debug allocator — it tracks per-allocation metadata and never returns freed small allocations to the OS. since zlay links glibc (`build.zig:42`), `std.heap.c_allocator` gives us glibc malloc with per-thread arenas, madvise-based page return, and production-grade fragmentation mitigation.

**frame processing pool.** reader threads (one per PDS) now only do TLS read, header decode, cursor tracking, and rate limiting — then queue raw frames to a shared pool of 16 workers. this dramatically reduced per-thread RSS in ReleaseSafe (from ~3.9 MiB to ~0.45 MiB) by keeping crypto, DB, and broadcast off reader thread stacks.

## resource usage

| metric | value |
|--------|-------|
| memory | ~1.1 GiB at ~2,250 hosts (ReleaseSafe), projected ~1.3 GiB steady state |
| CPU | ~1.5 cores peak |
| requests | 1 GiB memory, 1000m CPU |
| limits | 8 GiB memory |
| PVC | 20 GiB (events + RocksDB collection index) |
| postgres | ~238 MiB |

## git push

the zlay repo is hosted on tangled. pushing requires the tangled SSH key:

```bash
GIT_SSH_COMMAND="ssh -i ~/.ssh/tangled_ed25519 -o IdentitiesOnly=yes" git push
```