Files
architecture/internal-tls.md
rob thijssen 200c41b4f1 docs: add CI deployment and internal-TLS guidance, cross-reference from generic
Add two new guidance documents alongside generic.md:

- deployment-gitea-actions.md: CI-driven deployment via a Gitea Actions
  workflow as an alternative to deploy.sh + manifest.yml (§7), with the
  workflow as the source of infra truth and a scoped gitea_ci runner user.
- internal-tls.md: provisioning and renewing per-service internal TLS
  certs (<service>.internal) for mesh-only nginx vhosts, extending the
  PKI conventions in §11.

Cross-reference both from generic.md and list them in readme.md. Also
add a "never suppress errors" rule to the deploy-script conventions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 15:43:18 +03:00

6.0 KiB

Internal TLS: per-service certs for mesh services

Extends generic.md §11 (TLS / PKI). That section covers the host identity cert every host carries (/etc/pki/tls/{misc,private}/$(hostname -f).pem, kept fresh by step.service). This doc covers the other common case: a per-service vanity cert for an internal service reached by its own name on the WireGuard mesh — typically an nginx vhost like gongfoo.internal or vlc-admin.internal.

Use this whenever a service is fronted by a *.internal name that differs from the host's FQDN. Serving the host cert for a vlc-admin.internal request fails client verification (the host cert's SAN is the host's FQDN, not the service name), so the service needs its own cert.

All of this rides on the existing internal PKI: Smallstep step-ca at https://ca.internal, internal root already trusted fleet-wide at /etc/pki/ca-trust/source/anchors/root-internal.pem.


1. Naming and DNS

  • The service name is <service>.internal, resolved by split-horizon DNS on the mesh. Never give it a public / Cloudflare record — these names are mesh-only.
  • The renewal unit is a systemd template instance, and %i can't contain dots cleanly, so the instance label is the dot-free short name and the unit appends .internal: instance vlc-admin → serves/renews vlc-admin.internal. Choose service short names without dots.

2. Paths

Follow the established convention (shared with nginx):

Path Contents Mode
/etc/nginx/tls/cert/<name>.internal.pem cert (chain) 0644 root:root
/etc/nginx/tls/key/<name>.internal.pem private key 0640 root:root, setfacl u:nginx:r

setfacl -m u:nginx:r on the key is required when nginx workers must read it (e.g. a proxy_ssl_certificate_key for mTLS to an internal backend). For a plain server cert the master (root) reads it at load time and the ACL is belt-and-suspenders — set it anyway for consistency.

3. Renewal: the step@ template

Renewal is autonomous via a templated unit pair, instantiated per service (step@<name>.timer). The cert renews itself over mTLS (no provisioner needed once a cert exists), and reloads nginx on success:

# /etc/systemd/system/step@.service
[Service]
Type=oneshot
ExecCondition=/usr/bin/step certificate needs-renewal /etc/nginx/tls/cert/%i.internal.pem
ExecStart=/usr/bin/step ca renew --force \
    --ca-url https://ca.internal \
    --root /etc/pki/ca-trust/source/anchors/root-internal.pem \
    /etc/nginx/tls/cert/%i.internal.pem \
    /etc/nginx/tls/key/%i.internal.pem
ExecStartPost=/usr/bin/systemctl reload nginx.service
# /etc/systemd/system/step@.timer
[Timer]
Persistent=true
OnCalendar=*:1/15            # every 15 min; certs are short-lived (24h)
RandomizedDelaySec=5m
[Install]
WantedBy=timers.target

Enable per service: systemctl enable --now step@<name>.timer. The ExecCondition makes it a clean no-op until a cert exists, so enabling it before the first mint is harmless.

4. Initial minting

The timer only renews an existing cert; the first cert is minted explicitly via the JWK provisioner (lair). Mint it from the provisioning script (infra-setup.sh, see deployment-gitea-actions.md §2) by shipping the provisioner password to the host just long enough to issue the cert:

name=<service>
cert=/etc/nginx/tls/cert/${name}.internal.pem
key=/etc/nginx/tls/key/${name}.internal.pem

# Skip if already valid (verify checks chain/expiry, not the name).
state=$(ssh "$host" "[ -f $cert ] && step certificate verify $cert \
    --roots /etc/pki/ca-trust/source/anchors/root-internal.pem >/dev/null 2>&1 \
    && echo valid || echo missing")

if [ "$state" != valid ]; then
    # provisioner password lives at ~/.step/secrets/provisioner on the operator box
    rsync -az --rsync-path='sudo rsync' --chmod=0600 \
        ~/.step/secrets/provisioner "$host:/tmp/${name}-provisioner"
    ssh "$host" "
        sudo mkdir -p /etc/nginx/tls/cert /etc/nginx/tls/key
        rc=0
        sudo step ca certificate --force \
            --provisioner lair \
            --provisioner-password-file /tmp/${name}-provisioner \
            --ca-url https://ca.internal \
            --root /etc/pki/ca-trust/source/anchors/root-internal.pem \
            --san ${name}.internal \
            ${name}.internal $cert $key || rc=\$?
        sudo rm -f /tmp/${name}-provisioner          # always remove the credential
        [ \$rc -eq 0 ] || { echo 'mint failed' >&2; exit \$rc; }
        sudo chown root:root $cert $key
        sudo chmod 644 $cert; sudo chmod 640 $key
        sudo setfacl -m u:nginx:r $key"
fi
systemctl enable --now step@${name}.timer    # on the host

Rules that matter:

  • Always pass --san <name>.internal. Modern TLS clients ignore CN and require a matching SAN; a CN-only cert fails with "no alternative certificate subject name matches target hostname".
  • Remove the provisioner password even on failure (capture the exit code, rm, then propagate). Never leave the credential on the host.
  • The password file convention is ~/.step/secrets/provisioner on the operator workstation — the same one deploy.sh-style scripts use.

5. nginx wiring

server {
    listen 443 ssl;
    server_name <name>.internal;

    ssl_certificate     /etc/nginx/tls/cert/<name>.internal.pem;
    ssl_certificate_key /etc/nginx/tls/key/<name>.internal.pem;
    # ... + proxy_ssl_certificate{,_key} with the same paths if the upstream wants mTLS
}

Clients verify against the internal root (--cacert /etc/pki/ca-trust/source/anchors/root-internal.pem), which is already in the fleet trust store, so browsers and curl on any mesh host trust it without extra flags.

6. Checklist for a new mesh service cert

  1. Pick a dot-free short name; add split-horizon DNS <name>.internal → <host> (no public record).
  2. Mint the first cert with --san <name>.internal (§4), from infra-setup.sh.
  3. systemctl enable --now step@<name>.timer for renewal.
  4. Point the nginx vhost at the cert/key paths (§5); nginx -t && reload.