docs: add CI deployment and internal-TLS guidance, cross-reference from generic

Add two new guidance documents alongside generic.md:

- deployment-gitea-actions.md: CI-driven deployment via a Gitea Actions
  workflow as an alternative to deploy.sh + manifest.yml (§7), with the
  workflow as the source of infra truth and a scoped gitea_ci runner user.
- internal-tls.md: provisioning and renewing per-service internal TLS
  certs (<service>.internal) for mesh-only nginx vhosts, extending the
  PKI conventions in §11.

Cross-reference both from generic.md and list them in readme.md. Also
add a "never suppress errors" rule to the deploy-script conventions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-14 15:43:18 +03:00
parent 83652460ed
commit 200c41b4f1
4 changed files with 365 additions and 0 deletions

199
deployment-gitea-actions.md Normal file
View File

@@ -0,0 +1,199 @@
# Deployment via Gitea Actions
An alternative to the local `deploy.sh` + `manifest.yml` flow in `generic.md` §6§7.
Use this when deployment should be **CI-driven** — triggered by a push or a manual
dispatch, run on a Gitea Actions runner, and auditable in the Actions log — rather
than run by an operator from a workstation.
Both models coexist; pick per project:
- **`deploy.sh` (generic.md §7)** — operator-driven, runs from a workstation with the
operator's own ssh + `pass` access. Good for tightly-held apps, one-off targets, or
when no runner can reach the target hosts.
- **Gitea Actions (this doc)** — runner-driven, secrets in Gitea, no operator in the
loop. Good for anything that should redeploy on merge to `main` and for fleets a
runner can already reach over the WireGuard mesh.
The defining principle: **the workflow is the source of infra truth.** Hosts, ports,
paths, and component→host mapping live in the workflow YAML, not a `manifest.yml`.
There is no separate manifest in this model.
---
## 1. The deploy user: `gitea_ci`
The runner SSHes into each target as a dedicated **`gitea_ci`** system user — never
root, never the operator's account. On each target host `gitea_ci` has:
- a home dir (`/var/lib/gitea_ci`) and `~/.ssh/authorized_keys` containing the
runner's public key,
- membership in `systemd-journal` (so the workflow can capture
`journalctl -u <unit>` after a service start without a sudoers entry),
- a **scoped** `/etc/sudoers.d/<app>_gitea_ci` drop-in granting `NOPASSWD` for
exactly the commands the deploy runs — nothing broader.
Name the sudoers file `<app>_gitea_ci`, not bare `gitea_ci`, so multiple apps can
drop their own files on a shared host without clobbering each other.
### Scoped sudoers
Whitelist exact commands. For file pushes the workflow uses
`rsync --rsync-path='sudo rsync'`, so the remote rsync runs as root; pin each line to
one destination with a trailing literal path:
```
gitea_ci ALL=(root) NOPASSWD: /usr/bin/rsync * /usr/local/bin/<app>
gitea_ci ALL=(root) NOPASSWD: /usr/bin/rsync * /etc/<app>/config.toml
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl restart <app>.service
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl daemon-reload
gitea_ci ALL=(root) NOPASSWD: /usr/sbin/restorecon -R /usr/local/bin/<app> /etc/<app> /var/lib/<app>
gitea_ci ALL=(root) NOPASSWD: /usr/sbin/semanage port -l
gitea_ci ALL=(root) NOPASSWD: /usr/sbin/semanage port -a -t http_port_t -p tcp 8081
gitea_ci ALL=(root) NOPASSWD: /usr/bin/firewall-cmd --add-service=<app> --permanent
gitea_ci ALL=(root) NOPASSWD: /usr/bin/firewall-cmd --add-service=<app>
gitea_ci ALL=(root) NOPASSWD: /usr/bin/firewall-cmd --query-service=<app>
gitea_ci ALL=(root) NOPASSWD: /usr/bin/firewall-cmd --reload
```
- The `*` in an rsync line matches rsync's `--server …` argument vector across spaces;
the trailing literal destination is what actually bounds the rule.
- sudoers treats `:` and `=` as reserved; escape them (`\:`, `\=`) inside command
arguments or `visudo` rejects the file — common when whitelisting
`dnf config-manager addrepo --from-repofile=https://…`.
- Verify every drop-in with `visudo -cf` at install time so a typo can't lock the host.
---
## 2. One-time host provisioning: `script/infra-setup.sh`
Everything `gitea_ci` needs is provisioned **once per host** by a
`script/infra-setup.sh`, run by the operator from a workstation (full sudo, not the
scoped account). It is idempotent and skips past unreachable hosts so one offline node
doesn't block the rest. It:
1. creates the `gitea_ci` user (if missing) and its `~/.ssh`,
2. installs the runner's pubkey into `authorized_keys`
(`rsync --chown gitea_ci:gitea_ci --chmod 0600 --rsync-path 'sudo rsync'`),
3. adds `gitea_ci` to `systemd-journal`,
4. installs the host-appropriate `/etc/sudoers.d/<app>_gitea_ci` drop-in and
`visudo`-verifies it,
5. provisions any per-service internal TLS cert the app's nginx vhost needs
(see `internal-tls.md`).
The runner's keypair is generated once (`ssh-keygen -t ed25519 -f ~/.ssh/id_gitea_ci`);
the **private** key becomes the `RSYNC_SSH_KEY` Gitea secret, the public key is what
`infra-setup.sh` distributes.
Application config is **not** shipped by `infra-setup.sh` in this model — the workflow
renders it from Gitea secrets on every deploy (see §4). (`deploy.sh`-style apps that
keep config in `pass` are the §7 model instead.)
---
## 3. Artifact delivery: two variants
**(a) RPM channel.** A separate `build-prerelease` workflow builds, packages, signs, and
publishes RPMs to an internal repo (e.g. `rpm.lair.cafe/unstable`); the deploy workflow
just `dnf install/upgrade`s them. Best when the app already ships as an RPM and the
build is heavy (CUDA, vendored deps). The deploy workflow triggers on
`workflow_run: [build-prerelease] completed`.
**(b) Build-and-rsync.** The deploy workflow's own `build` job produces the artifact
(e.g. a static binary + a static frontend bundle) and `rsync`s it straight to the
targets. Best when there's no packaging step. Build for a **statically-linked target**
(e.g. musl) so a runner newer than the target host doesn't produce a binary the target's
older glibc can't load — see §6.
---
## 4. The workflow
```yaml
on:
push: { branches: [main] } # redeploy on merge
workflow_dispatch: # manual re-run from the UI
concurrency: # serialize deploys; never half-apply two at once
group: deploy
cancel-in-progress: false
env:
# --- infra truth: hosts, ports, paths live here, not in a manifest ---
API_HOST: <host>.<site>.internal
API_PORT: "8081"
DEPLOY_KEY: |
${{ secrets.RSYNC_SSH_KEY }}
```
Jobs follow a **build → deploy-per-component** shape:
- **build** — runs the lint/test gate (`fmt`, `clippy -D warnings`, `test`) so a broken
commit never deploys, then produces and uploads the artifact(s).
- **deploy-\<component\>** (`needs: build`) — one job per component/host. Each:
1. writes the SSH key from `DEPLOY_KEY`, then `ssh gitea_ci@$HOST hostname -f` as a
reachability/auth check (`StrictHostKeyChecking=accept-new`);
2. renders config from secrets (use a literal substitution — `python3` `.replace()` or
`envsubst` — so secrets with shell-special characters survive);
3. `rsync`s the artifact, config, systemd unit, and any firewalld/SELinux assets
(`--rsync-path='sudo rsync'`, `--chown`/`--chmod` to set ownership in transit,
`--mkpath` — see §6);
4. applies system state over `ssh` with the scoped `sudo` commands (sysusers,
`restorecon`, `semanage`, firewalld, `daemon-reload`, `restart`);
5. health-probes (HTTP for an API, `systemctl is-active` otherwise);
6. captures the unit's startup journal with `if: always()` so a failed start still
leaves a usable record.
Secrets to expect in the repo settings: `RSYNC_SSH_KEY` plus the app's config secrets
(API keys, tokens, etc.). Non-secret values stay inline in `env:`.
---
## 5. Runner images
Deploy jobs run on a generic runner (e.g. `runs-on: fedora-43`); build jobs run on a
toolchain runner. **Bake build dependencies into the runner image, not into the
workflow** — a deploy workflow should never `dnf install` at run time (runners may run
unprivileged, and per-run installs are slow and flaky). If a build needs a tool the
image lacks, add it to the image (the `gongfoo/images/*` Containerfiles) and rebuild.
For static cross-compilation specifically, the runner's toolchain must include the
cross target's std. Where a distro's packaged compiler can't load a foreign std (e.g.
Fedora's distro `rustc` rejects any std it didn't build, and Fedora ships no musl std),
provision the toolchain via its own version manager (`rustup` + `rustup target add …`)
so compiler and std always match.
---
## 6. Gotchas worth pre-empting
These cost a deploy round-trip each; encode them up front.
- **`rsync` won't create a missing destination directory** for a single-file copy. On
Fedora, `/etc/sysusers.d` and `/etc/firewalld/services` don't exist by default (only
the `/usr/lib` variants ship). Add `--mkpath` to file pushes so the root-side rsync
creates the parent.
- **firewalld only learns a freshly-shipped custom service after `--reload`.** Querying
or adding it to the runtime before reloading fails `INVALID_SERVICE`. Order:
rsync the XML → `firewall-cmd --reload``--query-service` → (if absent)
`--add-service --permanent``--reload`.
- **nginx `sites-enabled` holds only symlinks.** rsync vhost configs to
`sites-available/` and `ln -sfn` them into `sites-enabled/`. (The include is often
one level removed: `nginx.conf``conf.d/*.conf` → a `sites-enabled.conf` that does
`include …/sites-enabled/*.conf;`.) `nginx -t` before reload; a bad config left on
disk also breaks unrelated `systemctl reload nginx` calls (e.g. cert-renewal hooks).
- **glibc skew:** a binary built on a newer host won't run on an older target. Build
static (musl) or build on a runner no newer than the oldest target.
- **`semanage port -a` on an already-labelled port** prints "already defined, modifying
instead" and reassigns; guard with `semanage port -l | grep` so re-runs are no-ops.
- **`Type=notify` units** must actually send `READY=1` (`sd_notify`) or `systemctl
restart` blocks until `TimeoutStartSec`. The daemon sends it after it finishes binding.
---
## 7. Relationship to generic.md
This model reuses §8 (service accounts, hardened units), §9 (named firewalld services),
§10 (SELinux), and §11 (PKI/cert paths) unchanged — it only swaps *how* the assets get
onto the host (CI + `gitea_ci` + scoped sudo, instead of `deploy.sh` + operator sudo).
The `asset/` layout from §6 is the same, minus `manifest.yml`. Per-service TLS for
mesh-only nginx vhosts is covered in `internal-tls.md`.

View File

@@ -279,6 +279,14 @@ Config file templates use a simple `{{VAR_NAME}}` syntax. `deploy.sh` substitute
## 7. Deployment Script (`script/deploy.sh`)
> **Alternative: CI-driven deployment.** When a project should redeploy from a Gitea
> Actions workflow instead of an operator running `deploy.sh`, see
> **`deployment-gitea-actions.md`**. In that model the workflow itself is the source of
> infra truth (no `manifest.yml`), the runner SSHes in as a dedicated `gitea_ci` user
> with a scoped sudoers drop-in, and one-time host prep lives in
> `script/infra-setup.sh`. The `asset/` layout (§6) and the on-host conventions
> (§8§11) are otherwise identical. The two models coexist; pick per project.
A bash script with a stable CLI:
```
@@ -311,6 +319,7 @@ A bash script with a stable CLI:
- Quiet on success, loud on failure.
- Supports `--dry-run` to print what would happen.
- Never writes secrets to disk on the build host outside of the rendered template being rsynced.
- **Never suppress errors.** Do not use `2>/dev/null`, `|| true`, or any pattern that hides error output or swallows exit codes. If a command might fail legitimately (e.g. stopping a service that isn't installed yet on first deploy), handle the failure explicitly with a visible message (e.g. `cmd || info "service was not running"`). If a command shouldn't fail, let it fail loudly — `set -euo pipefail` will catch it. This rule applies to all shell scripts in the project, not just `deploy.sh`.
---
@@ -520,6 +529,12 @@ The service unit itself needs an `ExecReload=` that causes the daemon to re-read
Ship these `.path` and cert-reload `.service` units from `asset/systemd/` the same way as the main unit.
**Per-service internal certs.** The paths above are the host *identity* cert. A service
fronted by its own mesh name (e.g. an nginx vhost at `<service>.internal`, distinct from
the host FQDN) needs its own cert — minted via the `lair` provisioner and renewed by a
templated `step@<name>` unit. That pattern is documented separately in
**`internal-tls.md`**.
### Ingress
- Per-site nginx reverse proxy terminates all WAN inbound 443.
- Public DNS via Cloudflare, **unproxied by default** (CF's mTLS origin-pull has been unreliable). Revisit if/when that changes.

149
internal-tls.md Normal file
View File

@@ -0,0 +1,149 @@
# Internal TLS: per-service certs for mesh services
Extends `generic.md` §11 (TLS / PKI). That section covers the **host identity cert**
every host carries (`/etc/pki/tls/{misc,private}/$(hostname -f).pem`, kept fresh by
`step.service`). This doc covers the other common case: a **per-service vanity cert**
for an internal service reached by its own name on the WireGuard mesh — typically an
nginx vhost like `gongfoo.internal` or `vlc-admin.internal`.
Use this whenever a service is fronted by a `*.internal` name that differs from the
host's FQDN. Serving the host cert for a `vlc-admin.internal` request fails client
verification (the host cert's SAN is the host's FQDN, not the service name), so the
service needs its own cert.
All of this rides on the existing internal PKI: Smallstep `step-ca` at
`https://ca.internal`, internal root already trusted fleet-wide at
`/etc/pki/ca-trust/source/anchors/root-internal.pem`.
---
## 1. Naming and DNS
- The service name is `<service>.internal`, resolved by split-horizon DNS on the mesh.
**Never give it a public / Cloudflare record** — these names are mesh-only.
- The renewal unit is a systemd template instance, and `%i` can't contain dots cleanly,
so the **instance label is the dot-free short name** and the unit appends `.internal`:
instance `vlc-admin` → serves/renews `vlc-admin.internal`. Choose service short names
without dots.
## 2. Paths
Follow the established convention (shared with nginx):
| Path | Contents | Mode |
| --- | --- | --- |
| `/etc/nginx/tls/cert/<name>.internal.pem` | cert (chain) | `0644 root:root` |
| `/etc/nginx/tls/key/<name>.internal.pem` | private key | `0640 root:root`, `setfacl u:nginx:r` |
`setfacl -m u:nginx:r` on the key is required when nginx **workers** must read it
(e.g. a `proxy_ssl_certificate_key` for mTLS to an internal backend). For a plain
server cert the master (root) reads it at load time and the ACL is belt-and-suspenders —
set it anyway for consistency.
## 3. Renewal: the `step@` template
Renewal is autonomous via a templated unit pair, instantiated per service
(`step@<name>.timer`). The cert renews itself over mTLS (no provisioner needed once a
cert exists), and reloads nginx on success:
```ini
# /etc/systemd/system/step@.service
[Service]
Type=oneshot
ExecCondition=/usr/bin/step certificate needs-renewal /etc/nginx/tls/cert/%i.internal.pem
ExecStart=/usr/bin/step ca renew --force \
--ca-url https://ca.internal \
--root /etc/pki/ca-trust/source/anchors/root-internal.pem \
/etc/nginx/tls/cert/%i.internal.pem \
/etc/nginx/tls/key/%i.internal.pem
ExecStartPost=/usr/bin/systemctl reload nginx.service
```
```ini
# /etc/systemd/system/step@.timer
[Timer]
Persistent=true
OnCalendar=*:1/15 # every 15 min; certs are short-lived (24h)
RandomizedDelaySec=5m
[Install]
WantedBy=timers.target
```
Enable per service: `systemctl enable --now step@<name>.timer`. The `ExecCondition`
makes it a clean no-op until a cert exists, so enabling it before the first mint is
harmless.
## 4. Initial minting
The timer only **renews** an existing cert; the **first** cert is minted explicitly via
the JWK provisioner (`lair`). Mint it from the provisioning script (`infra-setup.sh`,
see `deployment-gitea-actions.md` §2) by shipping the provisioner password to the host
just long enough to issue the cert:
```sh
name=<service>
cert=/etc/nginx/tls/cert/${name}.internal.pem
key=/etc/nginx/tls/key/${name}.internal.pem
# Skip if already valid (verify checks chain/expiry, not the name).
state=$(ssh "$host" "[ -f $cert ] && step certificate verify $cert \
--roots /etc/pki/ca-trust/source/anchors/root-internal.pem >/dev/null 2>&1 \
&& echo valid || echo missing")
if [ "$state" != valid ]; then
# provisioner password lives at ~/.step/secrets/provisioner on the operator box
rsync -az --rsync-path='sudo rsync' --chmod=0600 \
~/.step/secrets/provisioner "$host:/tmp/${name}-provisioner"
ssh "$host" "
sudo mkdir -p /etc/nginx/tls/cert /etc/nginx/tls/key
rc=0
sudo step ca certificate --force \
--provisioner lair \
--provisioner-password-file /tmp/${name}-provisioner \
--ca-url https://ca.internal \
--root /etc/pki/ca-trust/source/anchors/root-internal.pem \
--san ${name}.internal \
${name}.internal $cert $key || rc=\$?
sudo rm -f /tmp/${name}-provisioner # always remove the credential
[ \$rc -eq 0 ] || { echo 'mint failed' >&2; exit \$rc; }
sudo chown root:root $cert $key
sudo chmod 644 $cert; sudo chmod 640 $key
sudo setfacl -m u:nginx:r $key"
fi
systemctl enable --now step@${name}.timer # on the host
```
Rules that matter:
- **Always pass `--san <name>.internal`.** Modern TLS clients ignore CN and require a
matching SAN; a CN-only cert fails with *"no alternative certificate subject name
matches target hostname"*.
- **Remove the provisioner password even on failure** (capture the exit code, `rm`,
then propagate). Never leave the credential on the host.
- The password file convention is `~/.step/secrets/provisioner` on the operator
workstation — the same one `deploy.sh`-style scripts use.
## 5. nginx wiring
```nginx
server {
listen 443 ssl;
server_name <name>.internal;
ssl_certificate /etc/nginx/tls/cert/<name>.internal.pem;
ssl_certificate_key /etc/nginx/tls/key/<name>.internal.pem;
# ... + proxy_ssl_certificate{,_key} with the same paths if the upstream wants mTLS
}
```
Clients verify against the internal root (`--cacert
/etc/pki/ca-trust/source/anchors/root-internal.pem`), which is already in the fleet
trust store, so browsers and `curl` on any mesh host trust it without extra flags.
## 6. Checklist for a new mesh service cert
1. Pick a dot-free short name; add split-horizon DNS `<name>.internal → <host>`
(no public record).
2. Mint the first cert with `--san <name>.internal` (§4), from `infra-setup.sh`.
3. `systemctl enable --now step@<name>.timer` for renewal.
4. Point the nginx vhost at the cert/key paths (§5); `nginx -t` && reload.

View File

@@ -11,6 +11,8 @@ The goal is boring consistency: the same crate layout, the same deploy flow, the
## What's here
- **`generic.md`** — the baseline. Applies to every project unless that project explicitly overrides a section. Covers workspace layout, separation of concerns, configuration, secrets, deployment, service accounts, firewalld, SELinux, and code quality.
- **`deployment-gitea-actions.md`** — CI-driven deployment via a Gitea Actions workflow, as an alternative to the `deploy.sh` + `manifest.yml` flow in `generic.md` §7. The workflow is the source of infra truth; the runner deploys as a scoped `gitea_ci` user.
- **`internal-tls.md`** — provisioning and renewing per-service internal TLS certs (`<service>.internal`) for mesh-only nginx vhosts, extending the PKI conventions in `generic.md` §11.
More files will appear here over time as guidance that's more specific than `generic.md` gets extracted — per-stack, per-deployment-target, or per-problem-domain documents. When a project needs guidance that isn't generic, it belongs in a new file here, not buried in one project's repo.