Kubernetes Operator for Tangled Spindles

Proposal: Runtime Interface for Spindle Engines#

Status: Draft Author: @evanjarrett Date: 2025-12-09

Summary#

Extract a Runtime interface from the nixery engine to enable:

  1. A new engine:docker that accepts user-specified images
  2. Support for multiple container runtimes (Docker, Podman)
  3. Downstream implementations (e.g., Kubernetes in Loom)

Motivation#

Currently, the spindle engine architecture tightly couples workflow execution logic with Docker-specific container management in the nixery engine. This creates several limitations:

  1. No user-specified images: Users must declare Nix dependencies; they can't use pre-built Docker images like node:20 or golang:1.22

  2. Single runtime: Only Docker daemon is supported; no path to Podman or other OCI runtimes

  3. Downstream friction: Loom (Kubernetes-based spindle) must reimplement the entire Engine interface rather than reusing workflow parsing and step execution logic

Proposal#

New Runtime Interface#

Create spindle/models/runtime.go:

package models

import (
    "context"
    "io"
)

// Runtime abstracts container/job execution environments.
// Implementations: Docker, Podman, (downstream: Kubernetes)
type Runtime interface {
    // Setup creates the execution environment and returns a handle.
    // The environment should be ready for Exec calls after Setup returns.
    Setup(ctx context.Context, opts SetupOpts) (Handle, error)

    // Exec runs a command in the environment.
    // For container runtimes, this is typically `exec` into a running container.
    // For job-based runtimes (K8s), this may stream logs from a pre-defined job.
    Exec(ctx context.Context, h Handle, opts ExecOpts) (*ExecResult, error)

    // Destroy tears down the environment and releases resources.
    Destroy(ctx context.Context, h Handle) error
}

// SetupOpts configures the execution environment.
type SetupOpts struct {
    // Image is the container image to use (e.g., "node:20", "nixery.dev/shell/git")
    Image string

    // WorkflowID uniquely identifies this workflow run (used for labeling/naming)
    WorkflowID WorkflowId

    // WorkDir is the working directory inside the container (e.g., "/tangled/workspace")
    WorkDir string

    // Labels for the container/job (e.g., {"sh.tangled.pipeline/workflow_id": "..."})
    Labels map[string]string

    // Security options
    DropAllCaps bool
    AddCaps     []string // e.g., ["DAC_OVERRIDE", "CHOWN", "FOWNER", "SETUID", "SETGID"]

    // Architecture hint for multi-arch scheduling (used by K8s runtime)
    Architecture string
}

// ExecOpts configures a single command execution.
type ExecOpts struct {
    // Command to run (e.g., ["bash", "-c", "npm install"])
    Command []string

    // Environment variables
    Env []string

    // Output streams (nil = discard)
    Stdout io.Writer
    Stderr io.Writer
}

// ExecResult contains the outcome of an Exec call.
type ExecResult struct {
    ExitCode  int
    OOMKilled bool
}

// Handle is an opaque reference to an execution environment.
type Handle interface {
    // ID returns a unique identifier for this environment (container ID, job name, etc.)
    ID() string
}

// RuntimeMode indicates how the runtime executes steps.
type RuntimeMode int

const (
    // RuntimeModeExec means steps are executed one at a time via Exec calls.
    // Used by Docker/Podman where we exec into a running container.
    RuntimeModeExec RuntimeMode = iota

    // RuntimeModeBatch means all steps run in a single invocation.
    // Used by Kubernetes where a Job runs all steps and the engine streams logs.
    // In this mode, Exec() streams logs rather than executing commands.
    RuntimeModeBatch
)

// RuntimeInfo provides metadata about a runtime implementation.
type RuntimeInfo interface {
    Mode() RuntimeMode
}

Refactored Engine Structure#

Engines become thin wrappers focused on image resolution and step generation:

// engines/base/engine.go - shared logic
package base

type Engine struct {
    Runtime models.Runtime
    Logger  *slog.Logger
    Handles map[string]models.Handle
    mu      sync.Mutex
}

func (e *Engine) SetupWorkflow(ctx context.Context, wid models.WorkflowId, wf *models.Workflow) error {
    data := wf.Data.(WorkflowData)

    handle, err := e.Runtime.Setup(ctx, models.SetupOpts{
        Image:       data.Image,
        WorkflowID:  wid,
        WorkDir:     "/tangled/workspace",
        Labels:      map[string]string{"sh.tangled.pipeline/workflow_id": wid.String()},
        DropAllCaps: true,
        AddCaps:     []string{"DAC_OVERRIDE", "CHOWN", "FOWNER", "SETUID", "SETGID"},
    })
    if err != nil {
        return err
    }

    e.mu.Lock()
    e.Handles[wid.String()] = handle
    e.mu.Unlock()

    // Create workspace directories
    _, err = e.Runtime.Exec(ctx, handle, models.ExecOpts{
        Command: []string{"mkdir", "-p", "/tangled/workspace", "/tangled/home"},
    })
    return err
}

func (e *Engine) RunStep(ctx context.Context, wid models.WorkflowId, w *models.Workflow, idx int, secrets []secrets.UnlockedSecret, wfLogger *models.WorkflowLogger) error {
    e.mu.Lock()
    handle := e.Handles[wid.String()]
    e.mu.Unlock()

    step := w.Steps[idx]
    env := buildEnvs(w.Environment, step, secrets)

    var stdout, stderr io.Writer
    if wfLogger != nil {
        stdout = wfLogger.DataWriter(idx, "stdout")
        stderr = wfLogger.DataWriter(idx, "stderr")
    }

    result, err := e.Runtime.Exec(ctx, handle, models.ExecOpts{
        Command: []string{"bash", "-c", step.Command()},
        Env:     env,
        Stdout:  stdout,
        Stderr:  stderr,
    })
    if err != nil {
        return err
    }
    if result.OOMKilled {
        return ErrOOMKilled
    }
    if result.ExitCode != 0 {
        return engine.ErrWorkflowFailed
    }
    return nil
}

func (e *Engine) DestroyWorkflow(ctx context.Context, wid models.WorkflowId) error {
    e.mu.Lock()
    handle, exists := e.Handles[wid.String()]
    delete(e.Handles, wid.String())
    e.mu.Unlock()

    if !exists {
        return nil
    }
    return e.Runtime.Destroy(ctx, handle)
}

Engine Implementations#

Nixery Engine (image from dependencies):

// engines/nixery/engine.go
package nixery

type Engine struct {
    *base.Engine
    cfg *config.Config
}

func (e *Engine) InitWorkflow(twf tangled.Pipeline_Workflow, tpl tangled.Pipeline) (*models.Workflow, error) {
    var spec struct {
        Steps        []StepSpec          `yaml:"steps"`
        Dependencies map[string][]string `yaml:"dependencies"`
        Environment  map[string]string   `yaml:"environment"`
    }
    yaml.Unmarshal([]byte(twf.Raw), &spec)

    // NIXERY-SPECIFIC: Build image URL from dependencies
    image := workflowImage(spec.Dependencies, e.cfg.NixeryPipelines.Nixery)

    steps := []models.Step{}

    // Add nixery-specific setup steps
    steps = append(steps, nixConfStep())
    steps = append(steps, models.BuildCloneStep(twf, *tpl.TriggerMetadata, e.cfg.Server.Dev))
    if depStep := dependencyStep(spec.Dependencies); depStep != nil {
        steps = append(steps, *depStep)
    }

    // Add user steps
    for _, s := range spec.Steps {
        steps = append(steps, Step{name: s.Name, command: s.Command, ...})
    }

    return &models.Workflow{
        Name:        twf.Name,
        Steps:       steps,
        Environment: spec.Environment,
        Data:        base.WorkflowData{Image: image},
    }, nil
}

func (e *Engine) WorkflowTimeout() time.Duration {
    // ... existing config-based timeout
}

Docker Engine (user-specified image):

// engines/docker/engine.go
package docker

type Engine struct {
    *base.Engine
}

func (e *Engine) InitWorkflow(twf tangled.Pipeline_Workflow, tpl tangled.Pipeline) (*models.Workflow, error) {
    var spec struct {
        Image       string            `yaml:"image"`
        Steps       []StepSpec        `yaml:"steps"`
        Environment map[string]string `yaml:"environment"`
    }
    yaml.Unmarshal([]byte(twf.Raw), &spec)

    // DOCKER-SPECIFIC: Require explicit image
    if spec.Image == "" {
        return nil, fmt.Errorf("docker engine requires 'image' field in workflow")
    }

    steps := []models.Step{}

    // Add clone step (shared with nixery)
    steps = append(steps, models.BuildCloneStep(twf, *tpl.TriggerMetadata, false))

    // Add user steps
    for _, s := range spec.Steps {
        steps = append(steps, SimpleStep{Name: s.Name, Command: s.Command, ...})
    }

    return &models.Workflow{
        Name:        twf.Name,
        Steps:       steps,
        Environment: spec.Environment,
        Data:        base.WorkflowData{Image: spec.Image},
    }, nil
}

func (e *Engine) WorkflowTimeout() time.Duration {
    return 1 * time.Hour // default
}

Runtime Implementations#

Docker Runtime:

// runtime/docker/runtime.go
package docker

type Runtime struct {
    client client.APIClient
    logger *slog.Logger
}

type handle struct {
    containerID string
    networkID   string
}

func (h *handle) ID() string { return h.containerID }

func (r *Runtime) Setup(ctx context.Context, opts models.SetupOpts) (models.Handle, error) {
    // Create network
    netResp, _ := r.client.NetworkCreate(ctx, networkName(opts.WorkflowID), ...)

    // Pull image
    reader, _ := r.client.ImagePull(ctx, opts.Image, image.PullOptions{})
    io.Copy(io.Discard, reader)
    reader.Close()

    // Create container
    resp, _ := r.client.ContainerCreate(ctx, &container.Config{
        Image:      opts.Image,
        Cmd:        []string{"cat"},
        OpenStdin:  true,
        WorkingDir: opts.WorkDir,
        Labels:     opts.Labels,
    }, &container.HostConfig{
        CapDrop: capDrop(opts.DropAllCaps),
        CapAdd:  opts.AddCaps,
        // ... mounts, security opts
    }, nil, nil, "")

    r.client.ContainerStart(ctx, resp.ID, container.StartOptions{})

    return &handle{containerID: resp.ID, networkID: netResp.ID}, nil
}

func (r *Runtime) Exec(ctx context.Context, h models.Handle, opts models.ExecOpts) (*models.ExecResult, error) {
    dh := h.(*handle)

    execResp, _ := r.client.ContainerExecCreate(ctx, dh.containerID, container.ExecOptions{
        Cmd:          opts.Command,
        Env:          opts.Env,
        AttachStdout: true,
        AttachStderr: true,
    })

    attach, _ := r.client.ContainerExecAttach(ctx, execResp.ID, container.ExecAttachOptions{})
    defer attach.Close()

    stdcopy.StdCopy(opts.Stdout, opts.Stderr, attach.Reader)

    inspect, _ := r.client.ContainerExecInspect(ctx, execResp.ID)

    // Check OOMKilled
    containerInspect, _ := r.client.ContainerInspect(ctx, dh.containerID)

    return &models.ExecResult{
        ExitCode:  inspect.ExitCode,
        OOMKilled: containerInspect.State.OOMKilled,
    }, nil
}

func (r *Runtime) Destroy(ctx context.Context, h models.Handle) error {
    dh := h.(*handle)
    r.client.ContainerStop(ctx, dh.containerID, container.StopOptions{})
    r.client.ContainerRemove(ctx, dh.containerID, container.RemoveOptions{RemoveVolumes: true})
    r.client.NetworkRemove(ctx, dh.networkID)
    return nil
}

Podman Runtime:

// runtime/podman/runtime.go
package podman

// Podman is API-compatible with Docker for most operations.
// This runtime uses the Podman socket instead of Docker socket.

type Runtime struct {
    client *podman.APIClient // or use Docker client with Podman socket
    logger *slog.Logger
}

// Implementation nearly identical to Docker runtime.
// Key differences:
// - Socket path: /run/user/1000/podman/podman.sock (rootless) or /run/podman/podman.sock
// - Some API differences in network handling
// - Native support for rootless containers

Kubernetes Runtime (Downstream - Loom)#

// In loom repo: runtime/kubernetes/runtime.go
package kubernetes

type Runtime struct {
    client    client.Client
    config    *rest.Config
    namespace string
    template  SpindleTemplate
}

func (r *Runtime) Mode() models.RuntimeMode {
    return models.RuntimeModeBatch // All steps run in single Job
}

func (r *Runtime) Setup(ctx context.Context, opts models.SetupOpts) (models.Handle, error) {
    // Build Job spec with:
    // - Init containers for setup (user namespace, clone, etc.)
    // - Main container running loom-runner with all steps
    // - Node affinity based on opts.Architecture

    job := jobbuilder.BuildJob(jobbuilder.WorkflowConfig{
        Image:        opts.Image,
        Architecture: opts.Architecture,
        // ... steps passed via ConfigMap
    })

    r.client.Create(ctx, job)

    // Wait for pod to be running
    pod := waitForPod(ctx, job)

    return &k8sHandle{
        jobName: job.Name,
        podName: pod.Name,
    }, nil
}

func (r *Runtime) Exec(ctx context.Context, h models.Handle, opts models.ExecOpts) (*models.ExecResult, error) {
    // In batch mode, Exec streams logs rather than executing commands.
    // The loom-runner binary inside the Job executes all steps.
    // This method reads log output and returns when the step completes.

    kh := h.(*k8sHandle)

    // Stream logs from pod, parse JSON log lines from loom-runner
    // Return when step end marker is seen

    return &models.ExecResult{ExitCode: 0}, nil
}

func (r *Runtime) Destroy(ctx context.Context, h models.Handle) error {
    kh := h.(*k8sHandle)
    // Delete Job (GC handles pod cleanup)
    return r.client.Delete(ctx, &batchv1.Job{ObjectMeta: metav1.ObjectMeta{Name: kh.jobName}})
}

Workflow YAML Examples#

engine:nixery (current behavior):

engine: nixery
dependencies:
  nixpkgs:
    - nodejs
    - python3
steps:
  - name: build
    command: npm run build

engine:docker (new):

engine: docker
image: node:20-alpine
steps:
  - name: install
    command: npm ci
  - name: build
    command: npm run build
  - name: test
    command: npm test

engine:kubernetes (downstream, in Loom):

engine: kubernetes
image: golang:1.22
architecture: arm64
steps:
  - name: build
    command: go build ./...
  - name: test
    command: go test ./...

Server Wiring#

// server.go
func Run(ctx context.Context) error {
    cfg, err := config.Load(ctx)

    // Create runtime based on config
    var rt models.Runtime
    switch cfg.Runtime.Type {
    case "docker":
        dockerClient, _ := client.NewClientWithOpts(client.FromEnv)
        rt = docker.NewRuntime(dockerClient, logger)
    case "podman":
        podmanClient, _ := podman.NewClient(cfg.Runtime.Podman.Socket)
        rt = podman.NewRuntime(podmanClient, logger)
    default:
        rt = docker.NewRuntime(...) // default
    }

    // Create engines with shared runtime
    nixeryEng := nixery.New(rt, cfg, logger)
    dockerEng := dockerengine.New(rt, logger)

    s, _ := New(ctx, cfg, map[string]models.Engine{
        "nixery": nixeryEng,
        "docker": dockerEng,
    })

    return s.Start(ctx)
}

Configuration#

# spindle.toml

[runtime]
type = "docker"  # or "podman"

[runtime.docker]
# Uses DOCKER_HOST env var by default

[runtime.podman]
socket = "/run/user/1000/podman/podman.sock"

Migration Path#

Phase 1: Extract Runtime Interface#

  • Add models/runtime.go with interface definition
  • Add runtime/docker/ implementation
  • Refactor nixery to use docker runtime internally
  • No breaking changes to existing workflows

Phase 2: Add Docker Engine#

  • Add engines/docker/ that uses same runtime
  • Register as "docker" in engine map
  • Users can now use engine: docker with explicit images

Phase 3: Add Podman Runtime#

  • Add runtime/podman/ implementation
  • Add config option to select runtime
  • Podman users can run nixery/docker engines without Docker daemon

Phase 4: Downstream Kubernetes (Loom)#

  • Loom implements runtime/kubernetes/
  • Can register "kubernetes" engine or reuse "docker"/"nixery" engines with K8s runtime
  • Maintains current Job + loom-runner architecture

Alternatives Considered#

1. Keep engines monolithic#

  • Pro: Simpler, less abstraction
  • Con: Code duplication, can't swap runtimes, harder for downstream

2. Docker-in-Docker for Kubernetes#

  • Pro: Identical behavior to local execution
  • Con: Security concerns, complexity, resource overhead

3. Runtime as engine parameter#

  • Pro: More flexible per-workflow
  • Con: Overcomplicates workflow YAML, runtime is deployment choice not user choice

Open Questions#

  1. Should runtime selection be per-engine or global?

    • Proposal: Global (deployment config), not per-workflow
  2. How to handle runtime-specific features?

    • E.g., K8s node affinity, Docker network modes
    • Proposal: SetupOpts has optional fields; runtimes ignore unsupported options
  3. Should we upstream the Kubernetes runtime?

    • Proposal: No, keep in Loom. Upstream provides interface, downstream implements.
  4. Podman rootless considerations?

    • User namespaces, different capability handling
    • Need testing matrix

References#