docs: add CI deployment and internal-TLS guidance, cross-reference from generic

Add two new guidance documents alongside generic.md:

- deployment-gitea-actions.md: CI-driven deployment via a Gitea Actions
  workflow as an alternative to deploy.sh + manifest.yml (§7), with the
  workflow as the source of infra truth and a scoped gitea_ci runner user.
- internal-tls.md: provisioning and renewing per-service internal TLS
  certs (<service>.internal) for mesh-only nginx vhosts, extending the
  PKI conventions in §11.

Cross-reference both from generic.md and list them in readme.md. Also
add a "never suppress errors" rule to the deploy-script conventions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-14 15:43:18 +03:00
parent 83652460ed
commit 200c41b4f1
4 changed files with 365 additions and 0 deletions

199
deployment-gitea-actions.md Normal file
View File

@@ -0,0 +1,199 @@
# Deployment via Gitea Actions
An alternative to the local `deploy.sh` + `manifest.yml` flow in `generic.md` §6§7.
Use this when deployment should be **CI-driven** — triggered by a push or a manual
dispatch, run on a Gitea Actions runner, and auditable in the Actions log — rather
than run by an operator from a workstation.
Both models coexist; pick per project:
- **`deploy.sh` (generic.md §7)** — operator-driven, runs from a workstation with the
operator's own ssh + `pass` access. Good for tightly-held apps, one-off targets, or
when no runner can reach the target hosts.
- **Gitea Actions (this doc)** — runner-driven, secrets in Gitea, no operator in the
loop. Good for anything that should redeploy on merge to `main` and for fleets a
runner can already reach over the WireGuard mesh.
The defining principle: **the workflow is the source of infra truth.** Hosts, ports,
paths, and component→host mapping live in the workflow YAML, not a `manifest.yml`.
There is no separate manifest in this model.
---
## 1. The deploy user: `gitea_ci`
The runner SSHes into each target as a dedicated **`gitea_ci`** system user — never
root, never the operator's account. On each target host `gitea_ci` has:
- a home dir (`/var/lib/gitea_ci`) and `~/.ssh/authorized_keys` containing the
runner's public key,
- membership in `systemd-journal` (so the workflow can capture
`journalctl -u <unit>` after a service start without a sudoers entry),
- a **scoped** `/etc/sudoers.d/<app>_gitea_ci` drop-in granting `NOPASSWD` for
exactly the commands the deploy runs — nothing broader.
Name the sudoers file `<app>_gitea_ci`, not bare `gitea_ci`, so multiple apps can
drop their own files on a shared host without clobbering each other.
### Scoped sudoers
Whitelist exact commands. For file pushes the workflow uses
`rsync --rsync-path='sudo rsync'`, so the remote rsync runs as root; pin each line to
one destination with a trailing literal path:
```
gitea_ci ALL=(root) NOPASSWD: /usr/bin/rsync * /usr/local/bin/<app>
gitea_ci ALL=(root) NOPASSWD: /usr/bin/rsync * /etc/<app>/config.toml
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl restart <app>.service
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl daemon-reload
gitea_ci ALL=(root) NOPASSWD: /usr/sbin/restorecon -R /usr/local/bin/<app> /etc/<app> /var/lib/<app>
gitea_ci ALL=(root) NOPASSWD: /usr/sbin/semanage port -l
gitea_ci ALL=(root) NOPASSWD: /usr/sbin/semanage port -a -t http_port_t -p tcp 8081
gitea_ci ALL=(root) NOPASSWD: /usr/bin/firewall-cmd --add-service=<app> --permanent
gitea_ci ALL=(root) NOPASSWD: /usr/bin/firewall-cmd --add-service=<app>
gitea_ci ALL=(root) NOPASSWD: /usr/bin/firewall-cmd --query-service=<app>
gitea_ci ALL=(root) NOPASSWD: /usr/bin/firewall-cmd --reload
```
- The `*` in an rsync line matches rsync's `--server …` argument vector across spaces;
the trailing literal destination is what actually bounds the rule.
- sudoers treats `:` and `=` as reserved; escape them (`\:`, `\=`) inside command
arguments or `visudo` rejects the file — common when whitelisting
`dnf config-manager addrepo --from-repofile=https://…`.
- Verify every drop-in with `visudo -cf` at install time so a typo can't lock the host.
---
## 2. One-time host provisioning: `script/infra-setup.sh`
Everything `gitea_ci` needs is provisioned **once per host** by a
`script/infra-setup.sh`, run by the operator from a workstation (full sudo, not the
scoped account). It is idempotent and skips past unreachable hosts so one offline node
doesn't block the rest. It:
1. creates the `gitea_ci` user (if missing) and its `~/.ssh`,
2. installs the runner's pubkey into `authorized_keys`
(`rsync --chown gitea_ci:gitea_ci --chmod 0600 --rsync-path 'sudo rsync'`),
3. adds `gitea_ci` to `systemd-journal`,
4. installs the host-appropriate `/etc/sudoers.d/<app>_gitea_ci` drop-in and
`visudo`-verifies it,
5. provisions any per-service internal TLS cert the app's nginx vhost needs
(see `internal-tls.md`).
The runner's keypair is generated once (`ssh-keygen -t ed25519 -f ~/.ssh/id_gitea_ci`);
the **private** key becomes the `RSYNC_SSH_KEY` Gitea secret, the public key is what
`infra-setup.sh` distributes.
Application config is **not** shipped by `infra-setup.sh` in this model — the workflow
renders it from Gitea secrets on every deploy (see §4). (`deploy.sh`-style apps that
keep config in `pass` are the §7 model instead.)
---
## 3. Artifact delivery: two variants
**(a) RPM channel.** A separate `build-prerelease` workflow builds, packages, signs, and
publishes RPMs to an internal repo (e.g. `rpm.lair.cafe/unstable`); the deploy workflow
just `dnf install/upgrade`s them. Best when the app already ships as an RPM and the
build is heavy (CUDA, vendored deps). The deploy workflow triggers on
`workflow_run: [build-prerelease] completed`.
**(b) Build-and-rsync.** The deploy workflow's own `build` job produces the artifact
(e.g. a static binary + a static frontend bundle) and `rsync`s it straight to the
targets. Best when there's no packaging step. Build for a **statically-linked target**
(e.g. musl) so a runner newer than the target host doesn't produce a binary the target's
older glibc can't load — see §6.
---
## 4. The workflow
```yaml
on:
push: { branches: [main] } # redeploy on merge
workflow_dispatch: # manual re-run from the UI
concurrency: # serialize deploys; never half-apply two at once
group: deploy
cancel-in-progress: false
env:
# --- infra truth: hosts, ports, paths live here, not in a manifest ---
API_HOST: <host>.<site>.internal
API_PORT: "8081"
DEPLOY_KEY: |
${{ secrets.RSYNC_SSH_KEY }}
```
Jobs follow a **build → deploy-per-component** shape:
- **build** — runs the lint/test gate (`fmt`, `clippy -D warnings`, `test`) so a broken
commit never deploys, then produces and uploads the artifact(s).
- **deploy-\<component\>** (`needs: build`) — one job per component/host. Each:
1. writes the SSH key from `DEPLOY_KEY`, then `ssh gitea_ci@$HOST hostname -f` as a
reachability/auth check (`StrictHostKeyChecking=accept-new`);
2. renders config from secrets (use a literal substitution — `python3` `.replace()` or
`envsubst` — so secrets with shell-special characters survive);
3. `rsync`s the artifact, config, systemd unit, and any firewalld/SELinux assets
(`--rsync-path='sudo rsync'`, `--chown`/`--chmod` to set ownership in transit,
`--mkpath` — see §6);
4. applies system state over `ssh` with the scoped `sudo` commands (sysusers,
`restorecon`, `semanage`, firewalld, `daemon-reload`, `restart`);
5. health-probes (HTTP for an API, `systemctl is-active` otherwise);
6. captures the unit's startup journal with `if: always()` so a failed start still
leaves a usable record.
Secrets to expect in the repo settings: `RSYNC_SSH_KEY` plus the app's config secrets
(API keys, tokens, etc.). Non-secret values stay inline in `env:`.
---
## 5. Runner images
Deploy jobs run on a generic runner (e.g. `runs-on: fedora-43`); build jobs run on a
toolchain runner. **Bake build dependencies into the runner image, not into the
workflow** — a deploy workflow should never `dnf install` at run time (runners may run
unprivileged, and per-run installs are slow and flaky). If a build needs a tool the
image lacks, add it to the image (the `gongfoo/images/*` Containerfiles) and rebuild.
For static cross-compilation specifically, the runner's toolchain must include the
cross target's std. Where a distro's packaged compiler can't load a foreign std (e.g.
Fedora's distro `rustc` rejects any std it didn't build, and Fedora ships no musl std),
provision the toolchain via its own version manager (`rustup` + `rustup target add …`)
so compiler and std always match.
---
## 6. Gotchas worth pre-empting
These cost a deploy round-trip each; encode them up front.
- **`rsync` won't create a missing destination directory** for a single-file copy. On
Fedora, `/etc/sysusers.d` and `/etc/firewalld/services` don't exist by default (only
the `/usr/lib` variants ship). Add `--mkpath` to file pushes so the root-side rsync
creates the parent.
- **firewalld only learns a freshly-shipped custom service after `--reload`.** Querying
or adding it to the runtime before reloading fails `INVALID_SERVICE`. Order:
rsync the XML → `firewall-cmd --reload``--query-service` → (if absent)
`--add-service --permanent``--reload`.
- **nginx `sites-enabled` holds only symlinks.** rsync vhost configs to
`sites-available/` and `ln -sfn` them into `sites-enabled/`. (The include is often
one level removed: `nginx.conf``conf.d/*.conf` → a `sites-enabled.conf` that does
`include …/sites-enabled/*.conf;`.) `nginx -t` before reload; a bad config left on
disk also breaks unrelated `systemctl reload nginx` calls (e.g. cert-renewal hooks).
- **glibc skew:** a binary built on a newer host won't run on an older target. Build
static (musl) or build on a runner no newer than the oldest target.
- **`semanage port -a` on an already-labelled port** prints "already defined, modifying
instead" and reassigns; guard with `semanage port -l | grep` so re-runs are no-ops.
- **`Type=notify` units** must actually send `READY=1` (`sd_notify`) or `systemctl
restart` blocks until `TimeoutStartSec`. The daemon sends it after it finishes binding.
---
## 7. Relationship to generic.md
This model reuses §8 (service accounts, hardened units), §9 (named firewalld services),
§10 (SELinux), and §11 (PKI/cert paths) unchanged — it only swaps *how* the assets get
onto the host (CI + `gitea_ci` + scoped sudo, instead of `deploy.sh` + operator sudo).
The `asset/` layout from §6 is the same, minus `manifest.yml`. Per-service TLS for
mesh-only nginx vhosts is covered in `internal-tls.md`.