rob thijssen acf9f5332f
Some checks failed
build / check (push) Successful in 3m30s
build / clippy (push) Successful in 3m35s
build / test (push) Successful in 4m0s
build / fmt (push) Has been cancelled
fix(controller): reap stuck-registered runners after 5min instead of 30min
A runner that registers with Gitea but never picks up a job (because the
queued jobs that triggered its spawn turned out to be waiting on
'needs:' / gated by 'if:' conditions, or any other claim-side stall)
holds capacity for the full window. With a 30-minute threshold, a burst
of over-eager spawns can block all real work for half an hour.

Drop the threshold to 5 minutes. False positives are self-healing: if a
job was about to be claimed, the next brew tick (5s) will see it still
queued and spawn a fresh runner — cost is one extra image pull (cached)
and a registration token round-trip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 16:05:21 +03:00
2026-05-12 06:46:03 +03:00

gongfoo

Ephemeral, autoscaling Gitea Actions runners on Podman. Written in Rust.

gongfoo watches the Gitea job queue and spawns single-use containerised runners across a fleet of hosts on demand. Each runner picks up exactly one job, executes it, and exits. The controller maintains desired capacity per runner-image / label-set, balanced across hosts via a spread-placement strategy.

The name is a deliberate misspelling of gongfu — the tea ceremony style of many small, precise, repeated brews. That's what the runners are: small, precise, ephemeral.

Why

The previous setup ran multiple long-lived act_runner processes per host via templated systemd units on bare metal. That works but is rigid: capacity is fixed, distro/version variants require static provisioning, and there's shared state between jobs. gongfoo replaces that with on-demand containerised runners that scale with queue depth and isolate every job in a fresh container.

Architecture

┌─────────────┐     ┌──────────────────┐     ┌─────────────────────┐
│  Gitea API  │◄────│   controller     │────►│  agent (per host)   │
│  (queue)    │     │   (reconciler)   │     │                     │
└─────────────┘     └──────────────────┘     └─────────────────────┘
                            │                          │
                            ▼                          ▼
                    ┌──────────────────┐     ┌─────────────────────┐
                    │   PostgreSQL     │     │  Podman + ephemeral │
                    │  (runner state)  │     │  act_runner         │
                    └──────────────────┘     └─────────────────────┘
  • controller polls Gitea for queued jobs, reconciles desired vs actual runner count per (image, label-set), and instructs agents to spawn or reports orphans for reaping. Postgres is the source of truth.
  • agent runs on each runner host, receives spawn RPCs, drives the local Podman socket, watches container events, and reports lifecycle transitions back to the controller.
  • runners are short-lived Podman containers running act_runner in --once / --ephemeral mode. Each container registers with Gitea, takes one job, exits. The container has the host's Podman socket mounted so it can spawn job containers for the actual workflow steps.

Workspace layout

gongfoo/
├── Cargo.toml                    # workspace root
├── crates/
│   ├── gongfoo-proto/            # shared types and RPC definitions
│   ├── gongfoo-controller/       # reconciler + Gitea/Postgres client
│   ├── gongfoo-agent/            # per-host daemon, Podman driver
│   └── gongfoo-migrations/       # sqlx migrations
├── images/
│   ├── runner-fedora-43/         # Fedora 43 base: act_runner + nodejs + CA
│   ├── runner-rust/              # Fedora + rust, cargo, clippy, rustfmt
│   ├── runner-rpm/               # Fedora + rpm-build, createrepo_c
│   ├── runner-ubuntu-24.04/      # Ubuntu 24.04 base: act_runner + nodejs + CA
│   └── runner-deb/               # Ubuntu + debhelper, devscripts, reprepro
├── asset/                        # deployment artifacts
│   ├── manifest.yml              # environments → components → hosts
│   ├── systemd/                  # service units, cert-reload .path units
│   ├── firewalld/                # named service definitions
│   ├── config/                   # config templates with {{SECRET}} placeholders
│   └── sql/                      # bootstrap SQL for Postgres role/db creation
└── script/
    └── deploy.sh                 # deploy postgres, controller, agent components

Crates

Crate Binary Purpose
gongfoo-proto Shared request/response types, error enums, state machine
gongfoo-controller gongfoo-controller Long-running daemon. Polls Gitea, reconciles, dispatches spawn RPCs
gongfoo-agent gongfoo-agent Long-running daemon on runner hosts. Receives RPCs, drives Podman
gongfoo-migrations sqlx migration files and an embed helper for the controller

Vocabulary

The codebase uses tea-ceremony terms where they're genuinely clearer than generic alternatives, and plain English elsewhere. Specifically:

  • brew loop — the reconciliation cycle (every 5s by default).

host, image, job, runner are kept as plain terms.

Placement strategy

Spread: pick the host with the fewest active runners that has capacity for the requested image, breaking ties randomly. CPU and memory budgets are checked against the host's reported totals.

This is intentionally simple. Once cichlid (a separate distributed orchestration project) is ready, placement will be delegated to its rules engine.

Runner state machine

pending → spawning → started → registered → running → completed → reaped
                  ↘         ↘              ↘        ↘
                   failed ──────────────────────────→ reaped
  • pending: controller decided to spawn, placement chosen
  • spawning: spawn RPC sent to agent
  • started: container running, Gitea registration not yet confirmed
  • registered: confirmed in Gitea's runner list
  • running: runner is busy executing a job (detected via Gitea API)
  • completed/failed: runner exited
  • reaped: cleanup done, row awaiting deletion

Deployment

./script/deploy.sh <environment> [component...] [--dry-run]
./script/deploy.sh prod controller agent
./script/deploy.sh prod postgres

Components: postgres (bootstrap DB, register hosts and images), controller, agent (deployed to all hosts in manifest).

Both controller and agent communicate over mTLS using the host's step-ca-issued certificates.

Status

Working. Dogfooding — gongfoo's own CI runs on gongfoo-managed runners.

Repository

git.lair.cafe/gongfoo/gongfoo

Licence

TBD.

Description
No description provided
Readme 790 KiB
Languages
Rust 64.9%
Shell 16.2%
TypeScript 13.6%
Dockerfile 3.2%
CSS 1.9%
Other 0.2%