docs: add CI deployment and internal-TLS guidance, cross-reference from generic

Add two new guidance documents alongside generic.md: - deployment-gitea-actions.md: CI-driven deployment via a Gitea Actions workflow as an alternative to deploy.sh + manifest.yml (§7), with the workflow as the source of infra truth and a scoped gitea_ci runner user. - internal-tls.md: provisioning and renewing per-service internal TLS certs (<service>.internal) for mesh-only nginx vhosts, extending the PKI conventions in §11. Cross-reference both from generic.md and list them in readme.md. Also add a "never suppress errors" rule to the deploy-script conventions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 15:43:18 +03:00
parent 83652460ed
commit 200c41b4f1
4 changed files with 365 additions and 0 deletions
--- a/deployment-gitea-actions.md
+++ b/deployment-gitea-actions.md
@@ -0,0 +1,199 @@
 # Deployment via Gitea Actions
 An alternative to the local `deploy.sh` + `manifest.yml` flow in `generic.md` §6–§7.
 Use this when deployment should be **CI-driven** — triggered by a push or a manual
 dispatch, run on a Gitea Actions runner, and auditable in the Actions log — rather
 than run by an operator from a workstation.
 Both models coexist; pick per project:
 - **`deploy.sh` (generic.md §7)** — operator-driven, runs from a workstation with the
  operator's own ssh + `pass` access. Good for tightly-held apps, one-off targets, or
  when no runner can reach the target hosts.
 - **Gitea Actions (this doc)** — runner-driven, secrets in Gitea, no operator in the
  loop. Good for anything that should redeploy on merge to `main` and for fleets a
  runner can already reach over the WireGuard mesh.
 The defining principle: **the workflow is the source of infra truth.** Hosts, ports,
 paths, and component→host mapping live in the workflow YAML, not a `manifest.yml`.
 There is no separate manifest in this model.
 ---
 ## 1. The deploy user: `gitea_ci`
 The runner SSHes into each target as a dedicated **`gitea_ci`** system user — never
 root, never the operator's account. On each target host `gitea_ci` has:
 - a home dir (`/var/lib/gitea_ci`) and `~/.ssh/authorized_keys` containing the
  runner's public key,
 - membership in `systemd-journal` (so the workflow can capture
  `journalctl -u <unit>` after a service start without a sudoers entry),
 - a **scoped** `/etc/sudoers.d/<app>_gitea_ci` drop-in granting `NOPASSWD` for
  exactly the commands the deploy runs — nothing broader.
 Name the sudoers file `<app>_gitea_ci`, not bare `gitea_ci`, so multiple apps can
 drop their own files on a shared host without clobbering each other.
 ### Scoped sudoers
 Whitelist exact commands. For file pushes the workflow uses
 `rsync --rsync-path='sudo rsync'`, so the remote rsync runs as root; pin each line to
 one destination with a trailing literal path:
 ```
 gitea_ci ALL=(root) NOPASSWD: /usr/bin/rsync * /usr/local/bin/<app>
 gitea_ci ALL=(root) NOPASSWD: /usr/bin/rsync * /etc/<app>/config.toml
 gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl restart <app>.service
 gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl daemon-reload
 gitea_ci ALL=(root) NOPASSWD: /usr/sbin/restorecon -R /usr/local/bin/<app> /etc/<app> /var/lib/<app>
 gitea_ci ALL=(root) NOPASSWD: /usr/sbin/semanage port -l
 gitea_ci ALL=(root) NOPASSWD: /usr/sbin/semanage port -a -t http_port_t -p tcp 8081
 gitea_ci ALL=(root) NOPASSWD: /usr/bin/firewall-cmd --add-service=<app> --permanent
 gitea_ci ALL=(root) NOPASSWD: /usr/bin/firewall-cmd --add-service=<app>
 gitea_ci ALL=(root) NOPASSWD: /usr/bin/firewall-cmd --query-service=<app>
 gitea_ci ALL=(root) NOPASSWD: /usr/bin/firewall-cmd --reload
 ```
 - The `*` in an rsync line matches rsync's `--server …` argument vector across spaces;
  the trailing literal destination is what actually bounds the rule.
 - sudoers treats `:` and `=` as reserved; escape them (`\:`, `\=`) inside command
  arguments or `visudo` rejects the file — common when whitelisting
  `dnf config-manager addrepo --from-repofile=https://…`.
 - Verify every drop-in with `visudo -cf` at install time so a typo can't lock the host.
 ---
 ## 2. One-time host provisioning: `script/infra-setup.sh`
 Everything `gitea_ci` needs is provisioned **once per host** by a
 `script/infra-setup.sh`, run by the operator from a workstation (full sudo, not the
 scoped account). It is idempotent and skips past unreachable hosts so one offline node
 doesn't block the rest. It:
 1. creates the `gitea_ci` user (if missing) and its `~/.ssh`,
 2. installs the runner's pubkey into `authorized_keys`
   (`rsync --chown gitea_ci:gitea_ci --chmod 0600 --rsync-path 'sudo rsync'`),
 3. adds `gitea_ci` to `systemd-journal`,
 4. installs the host-appropriate `/etc/sudoers.d/<app>_gitea_ci` drop-in and
   `visudo`-verifies it,
 5. provisions any per-service internal TLS cert the app's nginx vhost needs
   (see `internal-tls.md`).
 The runner's keypair is generated once (`ssh-keygen -t ed25519 -f ~/.ssh/id_gitea_ci`);
 the **private** key becomes the `RSYNC_SSH_KEY` Gitea secret, the public key is what
 `infra-setup.sh` distributes.
 Application config is **not** shipped by `infra-setup.sh` in this model — the workflow
 renders it from Gitea secrets on every deploy (see §4). (`deploy.sh`-style apps that
 keep config in `pass` are the §7 model instead.)
 ---
 ## 3. Artifact delivery: two variants
 **(a) RPM channel.** A separate `build-prerelease` workflow builds, packages, signs, and
 publishes RPMs to an internal repo (e.g. `rpm.lair.cafe/unstable`); the deploy workflow
 just `dnf install/upgrade`s them. Best when the app already ships as an RPM and the
 build is heavy (CUDA, vendored deps). The deploy workflow triggers on
 `workflow_run: [build-prerelease] completed`.
 **(b) Build-and-rsync.** The deploy workflow's own `build` job produces the artifact
 (e.g. a static binary + a static frontend bundle) and `rsync`s it straight to the
 targets. Best when there's no packaging step. Build for a **statically-linked target**
 (e.g. musl) so a runner newer than the target host doesn't produce a binary the target's
 older glibc can't load — see §6.
 ---
 ## 4. The workflow
 ```yaml
 on:
  push: { branches: [main] }      # redeploy on merge
  workflow_dispatch:              # manual re-run from the UI
 concurrency:                      # serialize deploys; never half-apply two at once
  group: deploy
  cancel-in-progress: false
 env:
  # --- infra truth: hosts, ports, paths live here, not in a manifest ---
  API_HOST: <host>.<site>.internal
  API_PORT: "8081"
  DEPLOY_KEY: |
    ${{ secrets.RSYNC_SSH_KEY }}
 ```
 Jobs follow a **build → deploy-per-component** shape:
 - **build** — runs the lint/test gate (`fmt`, `clippy -D warnings`, `test`) so a broken
  commit never deploys, then produces and uploads the artifact(s).
 - **deploy-\<component\>** (`needs: build`) — one job per component/host. Each:
  1. writes the SSH key from `DEPLOY_KEY`, then `ssh gitea_ci@$HOST hostname -f` as a
     reachability/auth check (`StrictHostKeyChecking=accept-new`);
  2. renders config from secrets (use a literal substitution — `python3` `.replace()` or
     `envsubst` — so secrets with shell-special characters survive);
  3. `rsync`s the artifact, config, systemd unit, and any firewalld/SELinux assets
     (`--rsync-path='sudo rsync'`, `--chown`/`--chmod` to set ownership in transit,
     `--mkpath` — see §6);
  4. applies system state over `ssh` with the scoped `sudo` commands (sysusers,
     `restorecon`, `semanage`, firewalld, `daemon-reload`, `restart`);
  5. health-probes (HTTP for an API, `systemctl is-active` otherwise);
  6. captures the unit's startup journal with `if: always()` so a failed start still
     leaves a usable record.
 Secrets to expect in the repo settings: `RSYNC_SSH_KEY` plus the app's config secrets
 (API keys, tokens, etc.). Non-secret values stay inline in `env:`.
 ---
 ## 5. Runner images
 Deploy jobs run on a generic runner (e.g. `runs-on: fedora-43`); build jobs run on a
 toolchain runner. **Bake build dependencies into the runner image, not into the
 workflow** — a deploy workflow should never `dnf install` at run time (runners may run
 unprivileged, and per-run installs are slow and flaky). If a build needs a tool the
 image lacks, add it to the image (the `gongfoo/images/*` Containerfiles) and rebuild.
 For static cross-compilation specifically, the runner's toolchain must include the
 cross target's std. Where a distro's packaged compiler can't load a foreign std (e.g.
 Fedora's distro `rustc` rejects any std it didn't build, and Fedora ships no musl std),
 provision the toolchain via its own version manager (`rustup` + `rustup target add …`)
 so compiler and std always match.
 ---
 ## 6. Gotchas worth pre-empting
 These cost a deploy round-trip each; encode them up front.
 - **`rsync` won't create a missing destination directory** for a single-file copy. On
  Fedora, `/etc/sysusers.d` and `/etc/firewalld/services` don't exist by default (only
  the `/usr/lib` variants ship). Add `--mkpath` to file pushes so the root-side rsync
  creates the parent.
 - **firewalld only learns a freshly-shipped custom service after `--reload`.** Querying
  or adding it to the runtime before reloading fails `INVALID_SERVICE`. Order:
  rsync the XML → `firewall-cmd --reload` → `--query-service` → (if absent)
  `--add-service --permanent` → `--reload`.
 - **nginx `sites-enabled` holds only symlinks.** rsync vhost configs to
  `sites-available/` and `ln -sfn` them into `sites-enabled/`. (The include is often
  one level removed: `nginx.conf` → `conf.d/*.conf` → a `sites-enabled.conf` that does
  `include …/sites-enabled/*.conf;`.) `nginx -t` before reload; a bad config left on
  disk also breaks unrelated `systemctl reload nginx` calls (e.g. cert-renewal hooks).
 - **glibc skew:** a binary built on a newer host won't run on an older target. Build
  static (musl) or build on a runner no newer than the oldest target.
 - **`semanage port -a` on an already-labelled port** prints "already defined, modifying
  instead" and reassigns; guard with `semanage port -l | grep` so re-runs are no-ops.
 - **`Type=notify` units** must actually send `READY=1` (`sd_notify`) or `systemctl
  restart` blocks until `TimeoutStartSec`. The daemon sends it after it finishes binding.
 ---
 ## 7. Relationship to generic.md
 This model reuses §8 (service accounts, hardened units), §9 (named firewalld services),
 §10 (SELinux), and §11 (PKI/cert paths) unchanged — it only swaps *how* the assets get
 onto the host (CI + `gitea_ci` + scoped sudo, instead of `deploy.sh` + operator sudo).
 The `asset/` layout from §6 is the same, minus `manifest.yml`. Per-service TLS for
 mesh-only nginx vhosts is covered in `internal-tls.md`.
--- a/generic.md
+++ b/generic.md
@@ -279,6 +279,14 @@ Config file templates use a simple `{{VAR_NAME}}` syntax. `deploy.sh` substitute
 ## 7. Deployment Script (`script/deploy.sh`)
 > **Alternative: CI-driven deployment.** When a project should redeploy from a Gitea
 > Actions workflow instead of an operator running `deploy.sh`, see
 > **`deployment-gitea-actions.md`**. In that model the workflow itself is the source of
 > infra truth (no `manifest.yml`), the runner SSHes in as a dedicated `gitea_ci` user
 > with a scoped sudoers drop-in, and one-time host prep lives in
 > `script/infra-setup.sh`. The `asset/` layout (§6) and the on-host conventions
 > (§8–§11) are otherwise identical. The two models coexist; pick per project.
 A bash script with a stable CLI:
 ```
@@ -311,6 +319,7 @@ A bash script with a stable CLI:
 - Quiet on success, loud on failure.
 - Supports `--dry-run` to print what would happen.
 - Never writes secrets to disk on the build host outside of the rendered template being rsynced.
 - **Never suppress errors.** Do not use `2>/dev/null`, `|| true`, or any pattern that hides error output or swallows exit codes. If a command might fail legitimately (e.g. stopping a service that isn't installed yet on first deploy), handle the failure explicitly with a visible message (e.g. `cmd || info "service was not running"`). If a command shouldn't fail, let it fail loudly — `set -euo pipefail` will catch it. This rule applies to all shell scripts in the project, not just `deploy.sh`.
 ---
@@ -520,6 +529,12 @@ The service unit itself needs an `ExecReload=` that causes the daemon to re-read
 Ship these `.path` and cert-reload `.service` units from `asset/systemd/` the same way as the main unit.
 **Per-service internal certs.** The paths above are the host *identity* cert. A service
 fronted by its own mesh name (e.g. an nginx vhost at `<service>.internal`, distinct from
 the host FQDN) needs its own cert — minted via the `lair` provisioner and renewed by a
 templated `step@<name>` unit. That pattern is documented separately in
 **`internal-tls.md`**.
 ### Ingress
 - Per-site nginx reverse proxy terminates all WAN inbound 443.
 - Public DNS via Cloudflare, **unproxied by default** (CF's mTLS origin-pull has been unreliable). Revisit if/when that changes.
--- a/internal-tls.md
+++ b/internal-tls.md
@@ -0,0 +1,149 @@
 # Internal TLS: per-service certs for mesh services
 Extends `generic.md` §11 (TLS / PKI). That section covers the **host identity cert**
 every host carries (`/etc/pki/tls/{misc,private}/$(hostname -f).pem`, kept fresh by
 `step.service`). This doc covers the other common case: a **per-service vanity cert**
 for an internal service reached by its own name on the WireGuard mesh — typically an
 nginx vhost like `gongfoo.internal` or `vlc-admin.internal`.
 Use this whenever a service is fronted by a `*.internal` name that differs from the
 host's FQDN. Serving the host cert for a `vlc-admin.internal` request fails client
 verification (the host cert's SAN is the host's FQDN, not the service name), so the
 service needs its own cert.
 All of this rides on the existing internal PKI: Smallstep `step-ca` at
 `https://ca.internal`, internal root already trusted fleet-wide at
 `/etc/pki/ca-trust/source/anchors/root-internal.pem`.
 ---
 ## 1. Naming and DNS
 - The service name is `<service>.internal`, resolved by split-horizon DNS on the mesh.
  **Never give it a public / Cloudflare record** — these names are mesh-only.
 - The renewal unit is a systemd template instance, and `%i` can't contain dots cleanly,
  so the **instance label is the dot-free short name** and the unit appends `.internal`:
  instance `vlc-admin` → serves/renews `vlc-admin.internal`. Choose service short names
  without dots.
 ## 2. Paths
 Follow the established convention (shared with nginx):
 | Path | Contents | Mode |
 | --- | --- | --- |
 | `/etc/nginx/tls/cert/<name>.internal.pem` | cert (chain) | `0644 root:root` |
 | `/etc/nginx/tls/key/<name>.internal.pem` | private key | `0640 root:root`, `setfacl u:nginx:r` |
 `setfacl -m u:nginx:r` on the key is required when nginx **workers** must read it
 (e.g. a `proxy_ssl_certificate_key` for mTLS to an internal backend). For a plain
 server cert the master (root) reads it at load time and the ACL is belt-and-suspenders —
 set it anyway for consistency.
 ## 3. Renewal: the `step@` template
 Renewal is autonomous via a templated unit pair, instantiated per service
 (`step@<name>.timer`). The cert renews itself over mTLS (no provisioner needed once a
 cert exists), and reloads nginx on success:
 ```ini
 # /etc/systemd/system/step@.service
 [Service]
 Type=oneshot
 ExecCondition=/usr/bin/step certificate needs-renewal /etc/nginx/tls/cert/%i.internal.pem
 ExecStart=/usr/bin/step ca renew --force \
    --ca-url https://ca.internal \
    --root /etc/pki/ca-trust/source/anchors/root-internal.pem \
    /etc/nginx/tls/cert/%i.internal.pem \
    /etc/nginx/tls/key/%i.internal.pem
 ExecStartPost=/usr/bin/systemctl reload nginx.service
 ```
 ```ini
 # /etc/systemd/system/step@.timer
 [Timer]
 Persistent=true
 OnCalendar=*:1/15            # every 15 min; certs are short-lived (24h)
 RandomizedDelaySec=5m
 [Install]
 WantedBy=timers.target
 ```
 Enable per service: `systemctl enable --now step@<name>.timer`. The `ExecCondition`
 makes it a clean no-op until a cert exists, so enabling it before the first mint is
 harmless.
 ## 4. Initial minting
 The timer only **renews** an existing cert; the **first** cert is minted explicitly via
 the JWK provisioner (`lair`). Mint it from the provisioning script (`infra-setup.sh`,
 see `deployment-gitea-actions.md` §2) by shipping the provisioner password to the host
 just long enough to issue the cert:
 ```sh
 name=<service>
 cert=/etc/nginx/tls/cert/${name}.internal.pem
 key=/etc/nginx/tls/key/${name}.internal.pem
 # Skip if already valid (verify checks chain/expiry, not the name).
 state=$(ssh "$host" "[ -f $cert ] && step certificate verify $cert \
    --roots /etc/pki/ca-trust/source/anchors/root-internal.pem >/dev/null 2>&1 \
    && echo valid || echo missing")
 if [ "$state" != valid ]; then
    # provisioner password lives at ~/.step/secrets/provisioner on the operator box
    rsync -az --rsync-path='sudo rsync' --chmod=0600 \
        ~/.step/secrets/provisioner "$host:/tmp/${name}-provisioner"
    ssh "$host" "
        sudo mkdir -p /etc/nginx/tls/cert /etc/nginx/tls/key
        rc=0
        sudo step ca certificate --force \
            --provisioner lair \
            --provisioner-password-file /tmp/${name}-provisioner \
            --ca-url https://ca.internal \
            --root /etc/pki/ca-trust/source/anchors/root-internal.pem \
            --san ${name}.internal \
            ${name}.internal $cert $key || rc=\$?
        sudo rm -f /tmp/${name}-provisioner          # always remove the credential
        [ \$rc -eq 0 ] || { echo 'mint failed' >&2; exit \$rc; }
        sudo chown root:root $cert $key
        sudo chmod 644 $cert; sudo chmod 640 $key
        sudo setfacl -m u:nginx:r $key"
 fi
 systemctl enable --now step@${name}.timer    # on the host
 ```
 Rules that matter:
 - **Always pass `--san <name>.internal`.** Modern TLS clients ignore CN and require a
  matching SAN; a CN-only cert fails with *"no alternative certificate subject name
  matches target hostname"*.
 - **Remove the provisioner password even on failure** (capture the exit code, `rm`,
  then propagate). Never leave the credential on the host.
 - The password file convention is `~/.step/secrets/provisioner` on the operator
  workstation — the same one `deploy.sh`-style scripts use.
 ## 5. nginx wiring
 ```nginx
 server {
    listen 443 ssl;
    server_name <name>.internal;
    ssl_certificate     /etc/nginx/tls/cert/<name>.internal.pem;
    ssl_certificate_key /etc/nginx/tls/key/<name>.internal.pem;
    # ... + proxy_ssl_certificate{,_key} with the same paths if the upstream wants mTLS
 }
 ```
 Clients verify against the internal root (`--cacert
 /etc/pki/ca-trust/source/anchors/root-internal.pem`), which is already in the fleet
 trust store, so browsers and `curl` on any mesh host trust it without extra flags.
 ## 6. Checklist for a new mesh service cert
 1. Pick a dot-free short name; add split-horizon DNS `<name>.internal → <host>`
   (no public record).
 2. Mint the first cert with `--san <name>.internal` (§4), from `infra-setup.sh`.
 3. `systemctl enable --now step@<name>.timer` for renewal.
 4. Point the nginx vhost at the cert/key paths (§5); `nginx -t` && reload.
--- a/readme.md
+++ b/readme.md
@@ -11,6 +11,8 @@ The goal is boring consistency: the same crate layout, the same deploy flow, the
 ## What's here
 - **`generic.md`** — the baseline. Applies to every project unless that project explicitly overrides a section. Covers workspace layout, separation of concerns, configuration, secrets, deployment, service accounts, firewalld, SELinux, and code quality.
 - **`deployment-gitea-actions.md`** — CI-driven deployment via a Gitea Actions workflow, as an alternative to the `deploy.sh` + `manifest.yml` flow in `generic.md` §7. The workflow is the source of infra truth; the runner deploys as a scoped `gitea_ci` user.
 - **`internal-tls.md`** — provisioning and renewing per-service internal TLS certs (`<service>.internal`) for mesh-only nginx vhosts, extending the PKI conventions in `generic.md` §11.
 More files will appear here over time as guidance that's more specific than `generic.md` gets extracted — per-stack, per-deployment-target, or per-problem-domain documents. When a project needs guidance that isn't generic, it belongs in a new file here, not buried in one project's repo.