feat(deploy): gitea workflow for rolling RPM deploys + host bootstrap
Replace operator-run script/deploy.sh with a CI-driven rolling deploy:
- .gitea/workflows/deploy.yml fires on build-prerelease success (and is
re-runnable via workflow_dispatch). Cortex upgrades first on
hanzalova.internal; the three neuron hosts upgrade in parallel under
fail-fast: false so one failing host doesn't sink the rest.
Concurrency-grouped to serialize overlapping deploys, never cancelling
in-flight runs (a half-applied dnf transaction is worse than a stale
deploy).
- asset/sudoers.d/{cortex,neuron}-host.conf are the canonical source for
the scoped privileges gitea_ci needs on each host kind, installed as
/etc/sudoers.d/helexa_gitea_ci. URLs and = signs are backslash-escaped
per sudoers reserved-character rules.
- script/infra-setup.sh idempotently provisions the gitea_ci user,
installs the runner pubkey, drops in the appropriate sudoers fragment
with visudo verification, and syncs cortex.toml / models.toml /
per-host asset/neuron/<short>.toml — config still ships from operator
workstations rather than CI because the first two are gitignored.
The CI-only secret is RSYNC_SSH_KEY (already configured for the repo);
the matching pubkey is ~/.ssh/id_gitea_ci.pub on the operator's box.
script/deploy.sh and asset/manifest.yml are left in place until the
first end-to-end deploy workflow run succeeds, then removed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
20
asset/sudoers.d/cortex-host.conf
Normal file
20
asset/sudoers.d/cortex-host.conf
Normal file
@@ -0,0 +1,20 @@
|
||||
# Install on the cortex gateway host as /etc/sudoers.d/helexa_gitea_ci
|
||||
# (owner root:root, mode 0440). Required by .gitea/workflows/deploy.yml,
|
||||
# which SSHes as gitea_ci@<gateway> to roll out cortex package upgrades
|
||||
# and config changes.
|
||||
#
|
||||
# Filename convention `helexa_gitea_ci` (vs bare `gitea_ci`) so other
|
||||
# helexa-org apps can drop their own sudoers files on the same host
|
||||
# without overwriting this one.
|
||||
|
||||
gitea_ci ALL=(root) NOPASSWD: /usr/bin/rsync * /etc/cortex/cortex.toml
|
||||
gitea_ci ALL=(root) NOPASSWD: /usr/bin/rsync * /etc/cortex/models.toml
|
||||
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl start cortex.service
|
||||
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl stop cortex.service
|
||||
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl daemon-reload
|
||||
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf install --refresh --allowerasing -y cortex
|
||||
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf upgrade --refresh --allowerasing -y cortex
|
||||
# sudoers reserves `:` and `=` and requires `\` escaping inside command
|
||||
# arguments — without it visudo errors at the first `:` in `https://`.
|
||||
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf config-manager addrepo --from-repofile\=https\://rpm.lair.cafe/lair-cafe-unstable.repo
|
||||
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf config-manager setopt lair-cafe-unstable.enabled\=1
|
||||
33
asset/sudoers.d/neuron-host.conf
Normal file
33
asset/sudoers.d/neuron-host.conf
Normal file
@@ -0,0 +1,33 @@
|
||||
# Install on every neuron host as /etc/sudoers.d/helexa_gitea_ci
|
||||
# (owner root:root, mode 0440). Required by .gitea/workflows/deploy.yml,
|
||||
# which SSHes as gitea_ci@<neuron-host> to roll out helexa-neuron-<flavour>
|
||||
# package upgrades and config changes.
|
||||
#
|
||||
# Filename convention `helexa_gitea_ci` (vs bare `gitea_ci`) so other
|
||||
# helexa-org apps can drop their own sudoers files on the same host
|
||||
# without overwriting this one.
|
||||
#
|
||||
# All three CUDA flavours are listed because a host's flavour can change
|
||||
# (e.g. GPU swap) and we don't want the sudoers file to need to change
|
||||
# in lockstep. Only one flavour can be installed at a time (the packages
|
||||
# Conflict: with each other), so the attack surface is bounded to "wrong
|
||||
# flavour installed" — vandalism, not privilege escalation.
|
||||
|
||||
gitea_ci ALL=(root) NOPASSWD: /usr/bin/rsync * /etc/neuron/neuron.toml
|
||||
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl start neuron.service
|
||||
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl stop neuron.service
|
||||
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl daemon-reload
|
||||
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf install --refresh --allowerasing -y helexa-neuron-ampere
|
||||
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf upgrade --refresh --allowerasing -y helexa-neuron-ampere
|
||||
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf install --refresh --allowerasing -y helexa-neuron-ada
|
||||
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf upgrade --refresh --allowerasing -y helexa-neuron-ada
|
||||
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf install --refresh --allowerasing -y helexa-neuron-blackwell
|
||||
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf upgrade --refresh --allowerasing -y helexa-neuron-blackwell
|
||||
# sudoers reserves `:` and `=` and requires `\` escaping inside command
|
||||
# arguments — without it visudo errors at the first `:` in `https://`.
|
||||
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf config-manager addrepo --from-repofile\=https\://rpm.lair.cafe/lair-cafe-unstable.repo
|
||||
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf config-manager setopt lair-cafe-unstable.enabled\=1
|
||||
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf config-manager addrepo --from-repofile\=https\://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
|
||||
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf install -y libcudnn9-cuda-13
|
||||
gitea_ci ALL=(root) NOPASSWD: /usr/bin/firewall-cmd --add-service=helexa-neuron --permanent
|
||||
gitea_ci ALL=(root) NOPASSWD: /usr/bin/firewall-cmd --reload
|
||||
Reference in New Issue
Block a user