Files
cortex/data/neuron.service
rob thijssen f9f5fa41b6
Some checks failed
CI / Format (push) Successful in 30s
CI / Test (push) Failing after 49s
CI / Clippy (push) Successful in 2m16s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
fix(neuron): surface full anyhow chain + ensure $HOME exists at start
Two fixes uncovered by the live validation against beast/benjy/quadbrat:

1. api.rs swallowed everything beyond the outermost anyhow context.
   The validation script reported '{"error":"fetch GGUF ...gguf"}' but
   the actual underlying hf-hub failure (cache dir creation, network,
   auth, etc.) was hidden. Switching every error response to
   format!("{e:#}") expands the full cause chain via anyhow's
   alternate Display format.

2. The neuron systemd unit declared the service user but never ensured
   /var/lib/neuron (its $HOME) existed. hf-hub defaults its cache to
   ~/.cache/huggingface/hub — when $HOME is absent the cache dir
   creation fails and the download aborts. Adding `StateDirectory=neuron`
   makes systemd create + chown that directory at activation; no spec
   change needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 08:17:37 +03:00

32 lines
1.1 KiB
Desktop File

[Unit]
Description=Neuron — per-node GPU discovery and harness daemon for cortex
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
ExecStart=/usr/bin/neuron --config /etc/neuron/neuron.toml
Restart=on-failure
RestartSec=5
User=neuron
Group=neuron
# /var/lib/neuron is the neuron user's $HOME — hf-hub writes its
# default cache there (~/.cache/huggingface/hub). Without this directive
# systemd doesn't create the directory and hf-hub downloads fail with
# "fetch GGUF <file>: failed to create cache dir".
StateDirectory=neuron
StateDirectoryMode=0755
# Loading default_models from neuron.toml happens before the HTTP
# listener binds; large models can take many minutes to download and
# materialise on first activation. systemd's default TimeoutStartSec
# (90s) is far too short; allow 30 minutes.
TimeoutStartSec=1800s
# On stop, neuron drains in-flight requests then unloads every model
# to release CUDA contexts cleanly. Allow generous time for big-model
# unloads; systemd will SIGKILL after this bound.
TimeoutStopSec=120s
KillSignal=SIGTERM
[Install]
WantedBy=multi-user.target