diff --git a/deployment-gitea-actions.md b/deployment-gitea-actions.md new file mode 100644 index 0000000..80925a2 --- /dev/null +++ b/deployment-gitea-actions.md @@ -0,0 +1,199 @@ +# Deployment via Gitea Actions + +An alternative to the local `deploy.sh` + `manifest.yml` flow in `generic.md` §6–§7. +Use this when deployment should be **CI-driven** — triggered by a push or a manual +dispatch, run on a Gitea Actions runner, and auditable in the Actions log — rather +than run by an operator from a workstation. + +Both models coexist; pick per project: + +- **`deploy.sh` (generic.md §7)** — operator-driven, runs from a workstation with the + operator's own ssh + `pass` access. Good for tightly-held apps, one-off targets, or + when no runner can reach the target hosts. +- **Gitea Actions (this doc)** — runner-driven, secrets in Gitea, no operator in the + loop. Good for anything that should redeploy on merge to `main` and for fleets a + runner can already reach over the WireGuard mesh. + +The defining principle: **the workflow is the source of infra truth.** Hosts, ports, +paths, and component→host mapping live in the workflow YAML, not a `manifest.yml`. +There is no separate manifest in this model. + +--- + +## 1. The deploy user: `gitea_ci` + +The runner SSHes into each target as a dedicated **`gitea_ci`** system user — never +root, never the operator's account. On each target host `gitea_ci` has: + +- a home dir (`/var/lib/gitea_ci`) and `~/.ssh/authorized_keys` containing the + runner's public key, +- membership in `systemd-journal` (so the workflow can capture + `journalctl -u ` after a service start without a sudoers entry), +- a **scoped** `/etc/sudoers.d/_gitea_ci` drop-in granting `NOPASSWD` for + exactly the commands the deploy runs — nothing broader. + +Name the sudoers file `_gitea_ci`, not bare `gitea_ci`, so multiple apps can +drop their own files on a shared host without clobbering each other. + +### Scoped sudoers + +Whitelist exact commands. For file pushes the workflow uses +`rsync --rsync-path='sudo rsync'`, so the remote rsync runs as root; pin each line to +one destination with a trailing literal path: + +``` +gitea_ci ALL=(root) NOPASSWD: /usr/bin/rsync * /usr/local/bin/ +gitea_ci ALL=(root) NOPASSWD: /usr/bin/rsync * /etc//config.toml +gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl restart .service +gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl daemon-reload +gitea_ci ALL=(root) NOPASSWD: /usr/sbin/restorecon -R /usr/local/bin/ /etc/ /var/lib/ +gitea_ci ALL=(root) NOPASSWD: /usr/sbin/semanage port -l +gitea_ci ALL=(root) NOPASSWD: /usr/sbin/semanage port -a -t http_port_t -p tcp 8081 +gitea_ci ALL=(root) NOPASSWD: /usr/bin/firewall-cmd --add-service= --permanent +gitea_ci ALL=(root) NOPASSWD: /usr/bin/firewall-cmd --add-service= +gitea_ci ALL=(root) NOPASSWD: /usr/bin/firewall-cmd --query-service= +gitea_ci ALL=(root) NOPASSWD: /usr/bin/firewall-cmd --reload +``` + +- The `*` in an rsync line matches rsync's `--server …` argument vector across spaces; + the trailing literal destination is what actually bounds the rule. +- sudoers treats `:` and `=` as reserved; escape them (`\:`, `\=`) inside command + arguments or `visudo` rejects the file — common when whitelisting + `dnf config-manager addrepo --from-repofile=https://…`. +- Verify every drop-in with `visudo -cf` at install time so a typo can't lock the host. + +--- + +## 2. One-time host provisioning: `script/infra-setup.sh` + +Everything `gitea_ci` needs is provisioned **once per host** by a +`script/infra-setup.sh`, run by the operator from a workstation (full sudo, not the +scoped account). It is idempotent and skips past unreachable hosts so one offline node +doesn't block the rest. It: + +1. creates the `gitea_ci` user (if missing) and its `~/.ssh`, +2. installs the runner's pubkey into `authorized_keys` + (`rsync --chown gitea_ci:gitea_ci --chmod 0600 --rsync-path 'sudo rsync'`), +3. adds `gitea_ci` to `systemd-journal`, +4. installs the host-appropriate `/etc/sudoers.d/_gitea_ci` drop-in and + `visudo`-verifies it, +5. provisions any per-service internal TLS cert the app's nginx vhost needs + (see `internal-tls.md`). + +The runner's keypair is generated once (`ssh-keygen -t ed25519 -f ~/.ssh/id_gitea_ci`); +the **private** key becomes the `RSYNC_SSH_KEY` Gitea secret, the public key is what +`infra-setup.sh` distributes. + +Application config is **not** shipped by `infra-setup.sh` in this model — the workflow +renders it from Gitea secrets on every deploy (see §4). (`deploy.sh`-style apps that +keep config in `pass` are the §7 model instead.) + +--- + +## 3. Artifact delivery: two variants + +**(a) RPM channel.** A separate `build-prerelease` workflow builds, packages, signs, and +publishes RPMs to an internal repo (e.g. `rpm.lair.cafe/unstable`); the deploy workflow +just `dnf install/upgrade`s them. Best when the app already ships as an RPM and the +build is heavy (CUDA, vendored deps). The deploy workflow triggers on +`workflow_run: [build-prerelease] completed`. + +**(b) Build-and-rsync.** The deploy workflow's own `build` job produces the artifact +(e.g. a static binary + a static frontend bundle) and `rsync`s it straight to the +targets. Best when there's no packaging step. Build for a **statically-linked target** +(e.g. musl) so a runner newer than the target host doesn't produce a binary the target's +older glibc can't load — see §6. + +--- + +## 4. The workflow + +```yaml +on: + push: { branches: [main] } # redeploy on merge + workflow_dispatch: # manual re-run from the UI + +concurrency: # serialize deploys; never half-apply two at once + group: deploy + cancel-in-progress: false + +env: + # --- infra truth: hosts, ports, paths live here, not in a manifest --- + API_HOST: ..internal + API_PORT: "8081" + DEPLOY_KEY: | + ${{ secrets.RSYNC_SSH_KEY }} +``` + +Jobs follow a **build → deploy-per-component** shape: + +- **build** — runs the lint/test gate (`fmt`, `clippy -D warnings`, `test`) so a broken + commit never deploys, then produces and uploads the artifact(s). +- **deploy-\** (`needs: build`) — one job per component/host. Each: + 1. writes the SSH key from `DEPLOY_KEY`, then `ssh gitea_ci@$HOST hostname -f` as a + reachability/auth check (`StrictHostKeyChecking=accept-new`); + 2. renders config from secrets (use a literal substitution — `python3` `.replace()` or + `envsubst` — so secrets with shell-special characters survive); + 3. `rsync`s the artifact, config, systemd unit, and any firewalld/SELinux assets + (`--rsync-path='sudo rsync'`, `--chown`/`--chmod` to set ownership in transit, + `--mkpath` — see §6); + 4. applies system state over `ssh` with the scoped `sudo` commands (sysusers, + `restorecon`, `semanage`, firewalld, `daemon-reload`, `restart`); + 5. health-probes (HTTP for an API, `systemctl is-active` otherwise); + 6. captures the unit's startup journal with `if: always()` so a failed start still + leaves a usable record. + +Secrets to expect in the repo settings: `RSYNC_SSH_KEY` plus the app's config secrets +(API keys, tokens, etc.). Non-secret values stay inline in `env:`. + +--- + +## 5. Runner images + +Deploy jobs run on a generic runner (e.g. `runs-on: fedora-43`); build jobs run on a +toolchain runner. **Bake build dependencies into the runner image, not into the +workflow** — a deploy workflow should never `dnf install` at run time (runners may run +unprivileged, and per-run installs are slow and flaky). If a build needs a tool the +image lacks, add it to the image (the `gongfoo/images/*` Containerfiles) and rebuild. + +For static cross-compilation specifically, the runner's toolchain must include the +cross target's std. Where a distro's packaged compiler can't load a foreign std (e.g. +Fedora's distro `rustc` rejects any std it didn't build, and Fedora ships no musl std), +provision the toolchain via its own version manager (`rustup` + `rustup target add …`) +so compiler and std always match. + +--- + +## 6. Gotchas worth pre-empting + +These cost a deploy round-trip each; encode them up front. + +- **`rsync` won't create a missing destination directory** for a single-file copy. On + Fedora, `/etc/sysusers.d` and `/etc/firewalld/services` don't exist by default (only + the `/usr/lib` variants ship). Add `--mkpath` to file pushes so the root-side rsync + creates the parent. +- **firewalld only learns a freshly-shipped custom service after `--reload`.** Querying + or adding it to the runtime before reloading fails `INVALID_SERVICE`. Order: + rsync the XML → `firewall-cmd --reload` → `--query-service` → (if absent) + `--add-service --permanent` → `--reload`. +- **nginx `sites-enabled` holds only symlinks.** rsync vhost configs to + `sites-available/` and `ln -sfn` them into `sites-enabled/`. (The include is often + one level removed: `nginx.conf` → `conf.d/*.conf` → a `sites-enabled.conf` that does + `include …/sites-enabled/*.conf;`.) `nginx -t` before reload; a bad config left on + disk also breaks unrelated `systemctl reload nginx` calls (e.g. cert-renewal hooks). +- **glibc skew:** a binary built on a newer host won't run on an older target. Build + static (musl) or build on a runner no newer than the oldest target. +- **`semanage port -a` on an already-labelled port** prints "already defined, modifying + instead" and reassigns; guard with `semanage port -l | grep` so re-runs are no-ops. +- **`Type=notify` units** must actually send `READY=1` (`sd_notify`) or `systemctl + restart` blocks until `TimeoutStartSec`. The daemon sends it after it finishes binding. + +--- + +## 7. Relationship to generic.md + +This model reuses §8 (service accounts, hardened units), §9 (named firewalld services), +§10 (SELinux), and §11 (PKI/cert paths) unchanged — it only swaps *how* the assets get +onto the host (CI + `gitea_ci` + scoped sudo, instead of `deploy.sh` + operator sudo). +The `asset/` layout from §6 is the same, minus `manifest.yml`. Per-service TLS for +mesh-only nginx vhosts is covered in `internal-tls.md`. diff --git a/generic.md b/generic.md index efed177..d2637d4 100644 --- a/generic.md +++ b/generic.md @@ -279,6 +279,14 @@ Config file templates use a simple `{{VAR_NAME}}` syntax. `deploy.sh` substitute ## 7. Deployment Script (`script/deploy.sh`) +> **Alternative: CI-driven deployment.** When a project should redeploy from a Gitea +> Actions workflow instead of an operator running `deploy.sh`, see +> **`deployment-gitea-actions.md`**. In that model the workflow itself is the source of +> infra truth (no `manifest.yml`), the runner SSHes in as a dedicated `gitea_ci` user +> with a scoped sudoers drop-in, and one-time host prep lives in +> `script/infra-setup.sh`. The `asset/` layout (§6) and the on-host conventions +> (§8–§11) are otherwise identical. The two models coexist; pick per project. + A bash script with a stable CLI: ``` @@ -311,6 +319,7 @@ A bash script with a stable CLI: - Quiet on success, loud on failure. - Supports `--dry-run` to print what would happen. - Never writes secrets to disk on the build host outside of the rendered template being rsynced. +- **Never suppress errors.** Do not use `2>/dev/null`, `|| true`, or any pattern that hides error output or swallows exit codes. If a command might fail legitimately (e.g. stopping a service that isn't installed yet on first deploy), handle the failure explicitly with a visible message (e.g. `cmd || info "service was not running"`). If a command shouldn't fail, let it fail loudly — `set -euo pipefail` will catch it. This rule applies to all shell scripts in the project, not just `deploy.sh`. --- @@ -520,6 +529,12 @@ The service unit itself needs an `ExecReload=` that causes the daemon to re-read Ship these `.path` and cert-reload `.service` units from `asset/systemd/` the same way as the main unit. +**Per-service internal certs.** The paths above are the host *identity* cert. A service +fronted by its own mesh name (e.g. an nginx vhost at `.internal`, distinct from +the host FQDN) needs its own cert — minted via the `lair` provisioner and renewed by a +templated `step@` unit. That pattern is documented separately in +**`internal-tls.md`**. + ### Ingress - Per-site nginx reverse proxy terminates all WAN inbound 443. - Public DNS via Cloudflare, **unproxied by default** (CF's mTLS origin-pull has been unreliable). Revisit if/when that changes. diff --git a/internal-tls.md b/internal-tls.md new file mode 100644 index 0000000..31c1935 --- /dev/null +++ b/internal-tls.md @@ -0,0 +1,149 @@ +# Internal TLS: per-service certs for mesh services + +Extends `generic.md` §11 (TLS / PKI). That section covers the **host identity cert** +every host carries (`/etc/pki/tls/{misc,private}/$(hostname -f).pem`, kept fresh by +`step.service`). This doc covers the other common case: a **per-service vanity cert** +for an internal service reached by its own name on the WireGuard mesh — typically an +nginx vhost like `gongfoo.internal` or `vlc-admin.internal`. + +Use this whenever a service is fronted by a `*.internal` name that differs from the +host's FQDN. Serving the host cert for a `vlc-admin.internal` request fails client +verification (the host cert's SAN is the host's FQDN, not the service name), so the +service needs its own cert. + +All of this rides on the existing internal PKI: Smallstep `step-ca` at +`https://ca.internal`, internal root already trusted fleet-wide at +`/etc/pki/ca-trust/source/anchors/root-internal.pem`. + +--- + +## 1. Naming and DNS + +- The service name is `.internal`, resolved by split-horizon DNS on the mesh. + **Never give it a public / Cloudflare record** — these names are mesh-only. +- The renewal unit is a systemd template instance, and `%i` can't contain dots cleanly, + so the **instance label is the dot-free short name** and the unit appends `.internal`: + instance `vlc-admin` → serves/renews `vlc-admin.internal`. Choose service short names + without dots. + +## 2. Paths + +Follow the established convention (shared with nginx): + +| Path | Contents | Mode | +| --- | --- | --- | +| `/etc/nginx/tls/cert/.internal.pem` | cert (chain) | `0644 root:root` | +| `/etc/nginx/tls/key/.internal.pem` | private key | `0640 root:root`, `setfacl u:nginx:r` | + +`setfacl -m u:nginx:r` on the key is required when nginx **workers** must read it +(e.g. a `proxy_ssl_certificate_key` for mTLS to an internal backend). For a plain +server cert the master (root) reads it at load time and the ACL is belt-and-suspenders — +set it anyway for consistency. + +## 3. Renewal: the `step@` template + +Renewal is autonomous via a templated unit pair, instantiated per service +(`step@.timer`). The cert renews itself over mTLS (no provisioner needed once a +cert exists), and reloads nginx on success: + +```ini +# /etc/systemd/system/step@.service +[Service] +Type=oneshot +ExecCondition=/usr/bin/step certificate needs-renewal /etc/nginx/tls/cert/%i.internal.pem +ExecStart=/usr/bin/step ca renew --force \ + --ca-url https://ca.internal \ + --root /etc/pki/ca-trust/source/anchors/root-internal.pem \ + /etc/nginx/tls/cert/%i.internal.pem \ + /etc/nginx/tls/key/%i.internal.pem +ExecStartPost=/usr/bin/systemctl reload nginx.service +``` + +```ini +# /etc/systemd/system/step@.timer +[Timer] +Persistent=true +OnCalendar=*:1/15 # every 15 min; certs are short-lived (24h) +RandomizedDelaySec=5m +[Install] +WantedBy=timers.target +``` + +Enable per service: `systemctl enable --now step@.timer`. The `ExecCondition` +makes it a clean no-op until a cert exists, so enabling it before the first mint is +harmless. + +## 4. Initial minting + +The timer only **renews** an existing cert; the **first** cert is minted explicitly via +the JWK provisioner (`lair`). Mint it from the provisioning script (`infra-setup.sh`, +see `deployment-gitea-actions.md` §2) by shipping the provisioner password to the host +just long enough to issue the cert: + +```sh +name= +cert=/etc/nginx/tls/cert/${name}.internal.pem +key=/etc/nginx/tls/key/${name}.internal.pem + +# Skip if already valid (verify checks chain/expiry, not the name). +state=$(ssh "$host" "[ -f $cert ] && step certificate verify $cert \ + --roots /etc/pki/ca-trust/source/anchors/root-internal.pem >/dev/null 2>&1 \ + && echo valid || echo missing") + +if [ "$state" != valid ]; then + # provisioner password lives at ~/.step/secrets/provisioner on the operator box + rsync -az --rsync-path='sudo rsync' --chmod=0600 \ + ~/.step/secrets/provisioner "$host:/tmp/${name}-provisioner" + ssh "$host" " + sudo mkdir -p /etc/nginx/tls/cert /etc/nginx/tls/key + rc=0 + sudo step ca certificate --force \ + --provisioner lair \ + --provisioner-password-file /tmp/${name}-provisioner \ + --ca-url https://ca.internal \ + --root /etc/pki/ca-trust/source/anchors/root-internal.pem \ + --san ${name}.internal \ + ${name}.internal $cert $key || rc=\$? + sudo rm -f /tmp/${name}-provisioner # always remove the credential + [ \$rc -eq 0 ] || { echo 'mint failed' >&2; exit \$rc; } + sudo chown root:root $cert $key + sudo chmod 644 $cert; sudo chmod 640 $key + sudo setfacl -m u:nginx:r $key" +fi +systemctl enable --now step@${name}.timer # on the host +``` + +Rules that matter: + +- **Always pass `--san .internal`.** Modern TLS clients ignore CN and require a + matching SAN; a CN-only cert fails with *"no alternative certificate subject name + matches target hostname"*. +- **Remove the provisioner password even on failure** (capture the exit code, `rm`, + then propagate). Never leave the credential on the host. +- The password file convention is `~/.step/secrets/provisioner` on the operator + workstation — the same one `deploy.sh`-style scripts use. + +## 5. nginx wiring + +```nginx +server { + listen 443 ssl; + server_name .internal; + + ssl_certificate /etc/nginx/tls/cert/.internal.pem; + ssl_certificate_key /etc/nginx/tls/key/.internal.pem; + # ... + proxy_ssl_certificate{,_key} with the same paths if the upstream wants mTLS +} +``` + +Clients verify against the internal root (`--cacert +/etc/pki/ca-trust/source/anchors/root-internal.pem`), which is already in the fleet +trust store, so browsers and `curl` on any mesh host trust it without extra flags. + +## 6. Checklist for a new mesh service cert + +1. Pick a dot-free short name; add split-horizon DNS `.internal → ` + (no public record). +2. Mint the first cert with `--san .internal` (§4), from `infra-setup.sh`. +3. `systemctl enable --now step@.timer` for renewal. +4. Point the nginx vhost at the cert/key paths (§5); `nginx -t` && reload. diff --git a/readme.md b/readme.md index a698dc6..1a796f9 100644 --- a/readme.md +++ b/readme.md @@ -11,6 +11,8 @@ The goal is boring consistency: the same crate layout, the same deploy flow, the ## What's here - **`generic.md`** — the baseline. Applies to every project unless that project explicitly overrides a section. Covers workspace layout, separation of concerns, configuration, secrets, deployment, service accounts, firewalld, SELinux, and code quality. +- **`deployment-gitea-actions.md`** — CI-driven deployment via a Gitea Actions workflow, as an alternative to the `deploy.sh` + `manifest.yml` flow in `generic.md` §7. The workflow is the source of infra truth; the runner deploys as a scoped `gitea_ci` user. +- **`internal-tls.md`** — provisioning and renewing per-service internal TLS certs (`.internal`) for mesh-only nginx vhosts, extending the PKI conventions in `generic.md` §11. More files will appear here over time as guidance that's more specific than `generic.md` gets extracted — per-stack, per-deployment-target, or per-problem-domain documents. When a project needs guidance that isn't generic, it belongs in a new file here, not buried in one project's repo.