# deployment zlay runs on a Hetzner CPX41 in Hillsboro OR, managed via k3s. all deployment is orchestrated from the [relay repo](https://tangled.org/zzstoatzz.io/relay) using `just` recipes. ## build and deploy the preferred method builds natively on the server (fast, no cross-compilation): ```bash just zlay-publish-remote ``` this SSHs into the server and: 1. `git pull --ff-only` in `/opt/zlay` 2. `zig build -Doptimize=ReleaseSafe -Dtarget=x86_64-linux-gnu` 3. `buildah bud -f Dockerfile.runtime .` — thin runtime image with SHA tag 4. pushes to k3s containerd via `buildah push` → `ctr images import` 5. `kubectl set image deployment/zlay -n zlay main=` + `kubectl rollout status` the runtime image (`Dockerfile.runtime`) is minimal: debian bookworm-slim + ca-certificates + the binary. ### why not Docker build? the full `Dockerfile` exists for CI/standalone builds but is slow on Mac (cross-compilation + QEMU). `zlay-publish-remote` skips all of that by building on the target architecture. ### build flags - `-Dtarget=x86_64-linux-gnu` — **must use glibc**, not musl. zig 0.15's C++ codegen for musl produces illegal instructions in RocksDB's LRU cache. - `-Dcpu=baseline` — required when building inside Docker/QEMU (not needed for `zlay-publish-remote` since it builds natively). - `-Doptimize=ReleaseSafe` — safety checks on, optimizations on. production default since 2026-03-05. previously caused OOM (see [incident-2026-03-04.md](incident-2026-03-04.md)) — resolved by the frame pool moving heavy work off reader threads. ## initial setup ```bash just zlay-init # terraform init just zlay-infra # create Hetzner server with k3s just zlay-kubeconfig # pull kubeconfig (~2 min after creation) just zlay-deploy # full deploy: cert-manager, postgres, relay, monitoring ``` point DNS A record for `ZLAY_DOMAIN` at the server IP (`just zlay-server-ip`) before deploying. ## environment variables set in `.env` in the relay repo: | variable | required | description | |----------|----------|-------------| | `HCLOUD_TOKEN` | yes | Hetzner Cloud API token | | `ZLAY_DOMAIN` | yes | public domain (e.g. `zlay.waow.tech`) | | `ZLAY_ADMIN_PASSWORD` | yes | bearer token for admin endpoints | | `ZLAY_POSTGRES_PASSWORD` | yes | postgres password | | `LETSENCRYPT_EMAIL` | yes | email for TLS certificates | ## operations ```bash just zlay-status # nodes, pods, health just zlay-logs # tail relay logs just zlay-health # curl public health endpoint just zlay-ssh # ssh into server ``` ## infrastructure - **server**: Hetzner CPX41 — 16 vCPU (AMD), 32 GB RAM, 240 GB NVMe - **k3s**: single-node kubernetes with traefik ingress - **cert-manager**: automatic TLS via Let's Encrypt - **postgres**: bitnami/postgresql helm chart (relay state, backfill progress) - **monitoring**: prometheus + grafana via kube-prometheus-stack - **terraform**: `infra/zlay/` in the relay repo ## memory tuning four changes brought steady-state memory from ~6.6 GiB down to ~1.1 GiB at ~2,250 connected hosts (ReleaseSafe): **shared TLS CA bundle.** the biggest single win. websocket.zig's TLS client calls `Bundle.rescan()` per connection, loading the system CA certificates into a per-connection arena. with ~2,750 PDS connections, that's ~2,750 copies of the CA bundle in memory (~800 KB each = ~2.2 GiB). fix: load the bundle once in the slurper, pass it to all subscribers via `config.ca_bundle`. memory dropped from ~3.3 GiB to ~1.2 GiB (~65% reduction). **thread stack sizes.** zig's default thread stack is 16 MB. with ~2,750 subscriber threads that maps 44 GB of virtual memory. all `Thread.spawn` calls use `main.default_stack_size` (8 MB). this is virtual memory — only touched pages count as RSS. 8 MB supports ReleaseSafe's TLS handshake path (~134 KiB peak stack). **c_allocator instead of GeneralPurposeAllocator.** GPA is a debug allocator — it tracks per-allocation metadata and never returns freed small allocations to the OS. since zlay links glibc (`build.zig:42`), `std.heap.c_allocator` gives us glibc malloc with per-thread arenas, madvise-based page return, and production-grade fragmentation mitigation. **frame processing pool.** reader threads (one per PDS) now only do TLS read, header decode, cursor tracking, and rate limiting — then queue raw frames to a shared pool of 16 workers. this dramatically reduced per-thread RSS in ReleaseSafe (from ~3.9 MiB to ~0.45 MiB) by keeping crypto, DB, and broadcast off reader thread stacks. ## resource usage | metric | value | |--------|-------| | memory | ~1.1 GiB at ~2,250 hosts (ReleaseSafe), projected ~1.3 GiB steady state | | CPU | ~1.5 cores peak | | requests | 1 GiB memory, 1000m CPU | | limits | 8 GiB memory | | PVC | 20 GiB (events + RocksDB collection index) | | postgres | ~238 MiB | ## git push the zlay repo is hosted on tangled. pushing requires the tangled SSH key: ```bash GIT_SSH_COMMAND="ssh -i ~/.ssh/tangled_ed25519 -o IdentitiesOnly=yes" git push ```