chore: scaffold

This commit is contained in:
2026-04-28 10:39:51 +03:00
commit 64bbf5a6a0
17 changed files with 516 additions and 0 deletions

1
.gitignore vendored Normal file
View File

@@ -0,0 +1 @@
target/

225
CLAUDE.md Normal file
View File

@@ -0,0 +1,225 @@
# CLAUDE.md
Guidance for Claude Code working in this repository. Read this before
proposing changes or writing an implementation plan.
## What gongfoo is
An autoscaling Gitea Actions runner system. The controller watches the
Gitea job queue and instructs per-host agents to spawn ephemeral Podman
containers running `act_runner` in `--once` mode. Each container takes one
job and exits.
Read `README.md` first for the high-level architecture diagram and the
crate map. This file is the engineering brief.
## Operator context
The operator runs ~40 servers on a WireGuard mesh, Fedora throughout,
SELinux enforcing, ZFS, Podman with quadlets, internal step-ca PKI, minio
for object storage. Preferences: Rust for backend, Postgres for data, Vite
(react-swc-ts) for any frontend, OPNsense for routing, quantum-safe SSH/TLS
where supported. Strong preference for bare-metal or minimal-container
approaches; avoids Docker Compose; favours bespoke tooling over external
SaaS.
These aren't decoration — they constrain implementation choices below.
## Core design decisions (already made, do not re-litigate)
1. **Two-layer containers.** The runner container holds `act_runner`. It
mounts the host's Podman socket and lets `act_runner` spawn job
containers for the actual workflow steps. We do not use Podman-in-Podman
and we do not use `--no-job-container` mode.
2. **Ephemeral runners only.** Every runner is registered with `--ephemeral`
and run with `--once`. One job, one container, then exit. No reuse.
3. **Postgres is the source of truth.** Controller is stateless across
restarts. It rebuilds in-memory state from Postgres on boot and
reconciles against Gitea + agents.
4. **Spread placement.** Pick the least-loaded host (by active runner
count) with capacity. Random tie-break. No bin-packing, no affinity in
v1.
5. **Caches are external.** sccache→minio, panamax for cargo, verdaccio
(minio-backed) for npm. Runner images carry env vars; credentials come
in via the spawn RPC.
6. **Reconciliation loop, not pure event-driven.** A 510s poll of Gitea's
queue plus orphan reaping. Webhooks may be added later but the loop
stays authoritative.
7. **Hand-off to cichlid eventually.** Placement logic is encapsulated so
it can be replaced when the operator's `cichlid` orchestration project
is ready. Don't over-engineer the placement layer in v1.
If a proposal contradicts any of these, surface it explicitly and ask
before proceeding.
## Crate responsibilities
### `gongfoo-proto`
Shared types only. No business logic, no I/O.
- RPC request/response structs (`SpawnRequest`, `SpawnResponse`,
`TerminateRequest`, `HeartbeatReport`, `RunnerEvent`, etc.)
- The `RunnerState` enum: `Pending | Spawning | Registered | Running |
Completed | Failed | Reaped`
- Error types shared across controller and agent
- `serde` + appropriate derives. No `tokio`, no `sqlx`, no `reqwest`
dependencies here
### `gongfoo-controller`
Long-running daemon. Single binary, multiple internal subsystems:
- **Gitea client** — polls queued jobs, fetches runner registration
tokens via the admin API, reads runner status. Use `reqwest`. Token
comes from config / env, never logged.
- **Postgres layer** — `sqlx` with compile-time-checked queries.
Migrations live in `gongfoo-migrations` and are embedded.
- **Reconciler ("brew loop")** — runs every 5s. For each label-set:
compute desired = queued + small_buffer, compare to current
non-terminal runners, spawn or skip. See README's pseudocode for the
shape.
- **Placement** — `trait Placer` with one method, `pick_host(&self,
image: &Image, hosts: &[HostState]) -> Option<HostId>`. v1 impl is
`SpreadPlacer`. Easy to swap later.
- **Agent client** — outbound RPCs to agents. Authenticated. Use mTLS
with the operator's step-ca; the controller has a client cert, each
agent has a server cert.
- **Reaper** — sub-loop. Handles stuck-spawning, stuck-registered,
silent-host scenarios. See README pseudocode.
- **Hysteresis** — `queue_observations` table. Don't scale up on a
single observation; require N consecutive observations above
threshold. Same for scale-down (though scale-down is mostly passive,
since runners are ephemeral).
### `gongfoo-agent`
Long-running daemon, one per runner host. Privileged enough to drive
Podman.
- **RPC server** — receives `SpawnRequest`, `TerminateRequest`,
`HealthCheck` from controller. mTLS with step-ca-issued server cert.
- **Podman driver** — uses the `podman-api` crate against
`/run/podman/podman.sock` (rootful). Generates container specs per
spawn request, mounts the Podman socket into the container,
applies CPU/memory limits, sets env from the spawn request.
- **Event watcher** — subscribes to Podman events, translates them into
`RunnerEvent` messages (`Started`, `PickedUpJob`, `Exited`) sent back
to the controller.
- **Capacity reporter** — periodic heartbeat with current CPU/memory
usage and active runner count.
- **GC** — periodic sweep for exited containers and stale volumes (both
should be rare given `--rm`, but defensive cleanup matters).
The agent does *not* talk to Gitea directly. All Gitea interaction goes
through the controller.
### `gongfoo-cli` (binary `gfoo`)
Operator tool. Talks to the controller's admin API.
Required commands for v1:
- `gfoo runners list [--state STATE] [--host HOST]`
- `gfoo hosts list`
- `gfoo hosts drain <hostname>` — stop placing new runners; existing
runners finish naturally
- `gfoo hosts undrain <hostname>`
- `gfoo images list`
- `gfoo spawn --image NAME --host HOSTNAME` — manual spawn for testing
- `gfoo tail <runner-id>` — stream logs from the agent
Use `clap` derive. Output defaults to a tabular human format; `--json`
flag for machine-readable.
### `gongfoo-migrations`
Just sqlx migration `.sql` files plus a tiny `embed!` helper the
controller calls on startup. Schema is in the README/architecture notes
and should be implemented faithfully on first pass.
## Bootstrap problem
Building the runner images requires runners. Strategy: keep one or two
existing bare-metal `act_runner` instances alive, labelled
`runner-builder`, dedicated to image builds via Gitea Actions. Once
`gongfoo` is stable, migrate image builds onto `gongfoo` itself and retire
the legacy runners.
Image build pipeline lives in a separate repo (`runner-images` or
similar), not in this workspace. This workspace produces the controller,
agent, CLI, and the `Containerfile`s under `images/`.
## Implementation order (suggested)
1. `gongfoo-proto` — types and RPC contracts.
2. Migrations — get the schema written and reviewable.
3. Single-host happy path: agent that can spawn one runner from a hard-
coded `SpawnRequest`, registering against a real Gitea instance.
No controller yet; drive it from a test harness or CLI.
4. Controller skeleton: Postgres connection, Gitea poll, reconciler stub
that just logs decisions instead of acting.
5. Wire controller→agent: real spawn RPCs, mTLS.
6. Reaper.
7. CLI.
8. Multi-host: spread placement, capacity tracking.
9. Hysteresis.
10. Image `Containerfile`s and the bootstrap workflow.
Don't try to land everything at once. Each step above should be an
independently reviewable change.
## Things to actively avoid in v1
- Web UI of any kind (CLI only).
- Prometheus/metrics shipping (rely on `journalctl` + Postgres queries
for now).
- Per-job log shipping to a central store (agent tails to local journal;
CLI tails on demand).
- Multi-tenancy or per-org quotas.
- Runner pre-warming / pools (the whole point is on-demand ephemerals).
- Kubernetes-anything.
- Dynamic image building inside the controller. Images are built
out-of-band and referenced by digest.
- Anything that requires Docker Compose or Helm.
## Coding standards
- Edition 2021 or 2024 — use whatever's current at implementation time.
- `tokio` for async runtime. `tracing` for logging (structured, JSON
output in production, pretty in dev).
- `sqlx` with `query!`/`query_as!` macros for compile-time checking.
Offline mode (`sqlx prepare`) for CI.
- `reqwest` with `rustls-tls` (not `native-tls`).
- `clap` derive for CLI argument parsing.
- `thiserror` for library error types, `anyhow` only at binary
boundaries.
- `serde` with `serde_json` for RPC bodies; consider `prost` + tonic
later if RPC volume justifies it, but not for v1.
- One workspace `Cargo.lock`. `resolver = "2"` (or "3" if available)
in the workspace root.
- All warnings fixed; `clippy` clean on `--all-targets`.
## Security notes
- The agent is the only privileged component. The controller is
unprivileged.
- Gitea registration tokens are single-use, short-lived, generated per
spawn. Never logged. Never persisted beyond the spawn RPC.
- mTLS on controller↔agent traffic. Both sides verify the peer's cert
against the step-ca root.
- minio credentials for sccache are passed in spawn RPC env, not baked
into images. They're scoped to the sccache bucket only.
- The host Podman socket inside a runner container is a privileged
mount. The runner image must contain only `act_runner` and the
trusted toolchain — no user-controlled code at image-build time.
## Communication style
The operator wants direct technical responses. No hedged framing, no
unnecessary preamble. Critical warnings (data loss, security
implications) precede destructive commands. If a design choice has a
real downside, name it instead of glossing over.
When uncertain about an unstated decision, ask rather than guess. A
short `ask_user_input_v0`-style enumeration of options is preferred over
a long monologue exploring alternatives.

100
Cargo.lock generated Normal file
View File

@@ -0,0 +1,100 @@
# This file is automatically @generated by Cargo.
# It is not intended for manual editing.
version = 4
[[package]]
name = "gongfoo-agent"
version = "0.1.0"
dependencies = [
"gongfoo-proto",
]
[[package]]
name = "gongfoo-cli"
version = "0.1.0"
dependencies = [
"gongfoo-proto",
]
[[package]]
name = "gongfoo-controller"
version = "0.1.0"
dependencies = [
"gongfoo-proto",
]
[[package]]
name = "gongfoo-migrations"
version = "0.1.0"
[[package]]
name = "gongfoo-proto"
version = "0.1.0"
dependencies = [
"serde",
]
[[package]]
name = "proc-macro2"
version = "1.0.106"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8fd00f0bb2e90d81d1044c2b32617f68fcb9fa3bb7640c23e9c748e53fb30934"
dependencies = [
"unicode-ident",
]
[[package]]
name = "quote"
version = "1.0.45"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "41f2619966050689382d2b44f664f4bc593e129785a36d6ee376ddf37259b924"
dependencies = [
"proc-macro2",
]
[[package]]
name = "serde"
version = "1.0.228"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9a8e94ea7f378bd32cbbd37198a4a91436180c5bb472411e48b5ec2e2124ae9e"
dependencies = [
"serde_core",
"serde_derive",
]
[[package]]
name = "serde_core"
version = "1.0.228"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "41d385c7d4ca58e59fc732af25c3983b67ac852c1a25000afe1175de458b67ad"
dependencies = [
"serde_derive",
]
[[package]]
name = "serde_derive"
version = "1.0.228"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d540f220d3187173da220f885ab66608367b6574e925011a9353e4badda91d79"
dependencies = [
"proc-macro2",
"quote",
"syn",
]
[[package]]
name = "syn"
version = "2.0.117"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e665b8803e7b1d2a727f4023456bbbbe74da67099c585258af0ad9c5013b9b99"
dependencies = [
"proc-macro2",
"quote",
"unicode-ident",
]
[[package]]
name = "unicode-ident"
version = "1.0.24"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e6e4313cd5fcd3dad5cafa179702e2b244f760991f45397d14d4ebf38247da75"

17
Cargo.toml Normal file
View File

@@ -0,0 +1,17 @@
[workspace]
resolver = "2"
members = [
"crates/gongfoo-proto",
"crates/gongfoo-controller",
"crates/gongfoo-agent",
"crates/gongfoo-cli",
"crates/gongfoo-migrations",
]
[workspace.package]
version = "0.1.0"
edition = "2024"
license = "MIT"
[workspace.dependencies]
gongfoo-proto = { path = "crates/gongfoo-proto" }

View File

@@ -0,0 +1,7 @@
[package]
name = "gongfoo-agent"
version.workspace = true
edition.workspace = true
[dependencies]
gongfoo-proto.workspace = true

View File

@@ -0,0 +1,3 @@
fn main() {
println!("gongfoo-agent");
}

View File

@@ -0,0 +1,11 @@
[package]
name = "gongfoo-cli"
version.workspace = true
edition.workspace = true
[[bin]]
name = "gfoo"
path = "src/main.rs"
[dependencies]
gongfoo-proto.workspace = true

View File

@@ -0,0 +1,3 @@
fn main() {
println!("gfoo");
}

View File

@@ -0,0 +1,7 @@
[package]
name = "gongfoo-controller"
version.workspace = true
edition.workspace = true
[dependencies]
gongfoo-proto.workspace = true

View File

@@ -0,0 +1,3 @@
fn main() {
println!("gongfoo-controller");
}

View File

@@ -0,0 +1,6 @@
[package]
name = "gongfoo-migrations"
version.workspace = true
edition.workspace = true
[dependencies]

View File

@@ -0,0 +1,3 @@
pub fn hello() {
println!("gongfoo-migrations");
}

View File

@@ -0,0 +1,7 @@
[package]
name = "gongfoo-proto"
version.workspace = true
edition.workspace = true
[dependencies]
serde = { version = "1", features = ["derive"] }

View File

@@ -0,0 +1,3 @@
pub fn hello() {
println!("gongfoo-proto");
}

View File

View File

120
readme.md Normal file
View File

@@ -0,0 +1,120 @@
# gongfoo
Ephemeral, autoscaling Gitea Actions runners on Podman. Written in Rust.
`gongfoo` watches the Gitea job queue and spawns single-use containerised
runners across a fleet of hosts on demand. Each runner picks up exactly one
job, executes it, and exits. The controller maintains desired capacity per
runner-image / label-set, balanced across hosts via a spread-placement
strategy.
The name is a deliberate misspelling of *gongfu* — the tea ceremony style of
many small, precise, repeated brews. That's what the runners are: small,
precise, ephemeral.
## Why
The previous setup ran multiple long-lived `act_runner` processes per host
via templated systemd units on bare metal. That works but is rigid: capacity
is fixed, distro/version variants require static provisioning, and there's
shared state between jobs. `gongfoo` replaces that with on-demand
containerised runners that scale with queue depth and isolate every job in
a fresh container.
## Architecture
```
┌─────────────┐ ┌──────────────────┐ ┌─────────────────────┐
│ Gitea API │◄────│ controller │────►│ agent (per host) │
│ (queue) │ │ (reconciler) │ │ │
└─────────────┘ └──────────────────┘ └─────────────────────┘
│ │
▼ ▼
┌──────────────────┐ ┌─────────────────────┐
│ PostgreSQL │ │ Podman + ephemeral │
│ (runner state) │ │ act_runner │
└──────────────────┘ └─────────────────────┘
```
- **controller** polls Gitea for queued jobs, reconciles desired vs actual
runner count per (image, label-set), and instructs agents to spawn or
reports orphans for reaping. Postgres is the source of truth.
- **agent** runs on each runner host, receives spawn RPCs, drives the local
Podman socket, watches container events, and reports lifecycle
transitions back to the controller.
- **runners** are short-lived Podman containers running `act_runner` in
`--once` / `--ephemeral` mode. Each container registers with Gitea, takes
one job, exits. The container has the host's Podman socket mounted so it
can spawn job containers for the actual workflow steps.
## Workspace layout
```
gongfoo/
├── Cargo.toml # workspace root
├── crates/
│ ├── gongfoo-proto/ # shared types and RPC definitions
│ ├── gongfoo-controller/ # reconciler + Gitea/Postgres client
│ ├── gongfoo-agent/ # per-host daemon, Podman driver
│ ├── gongfoo-cli/ # gfoo: inspect, force spawn, drain hosts
│ └── gongfoo-migrations/ # sqlx migrations
└── images/
├── runner-base/ # base image: act_runner + CA + minimal tooling
└── runner-rust/ # rust toolchain + sccache config
```
### Crates
| Crate | Binary | Purpose |
|---|---|---|
| `gongfoo-proto` | — | Shared request/response types, error enums, RPC trait definitions |
| `gongfoo-controller` | `gongfoo-controller` | Long-running daemon. Polls Gitea, reconciles, dispatches spawn RPCs |
| `gongfoo-agent` | `gongfoo-agent` | Long-running daemon on runner hosts. Receives RPCs, drives Podman |
| `gongfoo-cli` | `gfoo` | Operator CLI: list runners/hosts, drain a host, force-spawn, tail logs |
| `gongfoo-migrations` | — | sqlx migration files and an embed helper for the controller |
### Vocabulary
The codebase uses tea-ceremony terms where they're genuinely clearer than
generic alternatives, and plain English elsewhere. Specifically:
- **steep** — one runner's lifecycle (a single job from spawn to exit). The
`Steep` struct is the in-memory representation of a runner row.
- **pour** — dispatch a spawn request to a host's agent.
- **brew loop** — the reconciliation cycle.
`host`, `image`, `job`, `runner` are kept as plain terms.
## Placement strategy
Spread: pick the host with the fewest active runners that has capacity for
the requested image, breaking ties randomly. CPU and memory budgets are
checked against the host's reported totals.
This is intentionally simple. Once `cichlid` (a separate distributed
orchestration project) is ready, placement will be delegated to its rules
engine.
## Caching
Build caches are external to the runners themselves:
- **sccache** → minio (S3 API). Configured via env vars baked into runner
images.
- **cargo registry** → panamax pull-through proxy on a host with disk.
- **npm registry** → verdaccio with `verdaccio-aws-s3-storage` against
minio.
Credentials are injected by the agent at spawn time, not baked into images.
## Status
Early development. Not yet usable.
## Repository
`git.lair.cafe/gongfoo/gongfoo`
## Licence
TBD.