Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 31s
build-prerelease / Build neuron-blackwell (push) Successful in 3m39s
build-prerelease / Build cortex binary (push) Successful in 4m17s
build-prerelease / Package cortex RPM (push) Successful in 1m22s
CI / Format (push) Successful in 32s
CI / Test (push) Failing after 51s
CI / Clippy (push) Successful in 2m17s
build-prerelease / Build neuron-ampere (push) Successful in 4m58s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ada (push) Successful in 5m1s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m4s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m37s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
Resolves the candle harness's HuggingFace cache directory with the
following precedence (first hit wins):
1. Explicit `hf_cache` in `[harness.candle]` from neuron.toml.
2. `HF_HUB_CACHE` env var — the Python `huggingface_hub` convention.
The Rust hf-hub crate doesn't read this natively, so we bridge here.
3. `HF_HOME` env var (`$HF_HOME/hub` per the canonical layout).
4. None — falls through to hf-hub's own default.
Honouring HF_HUB_CACHE lets a neuron host reuse an existing cache
directory shared with Python tooling or other harnesses on the same
host without per-tool config. The canonical per-host setup is a
systemd drop-in:
/etc/systemd/system/neuron.service.d/local.conf
[Service]
Environment=HF_HUB_CACHE=/archive/hf-cache
neuron.example.toml documents the resolution chain inline.
script/validate-neuron.sh: bump LOAD_TIMEOUT from 600s to 3600s and
expose both load/infer timeouts via env (NEURON_LOAD_TIMEOUT,
NEURON_INFER_TIMEOUT). A Qwen3.6-class dense model is ~54 GB and was
hitting the 10-min ceiling cold-downloading on a residential link.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
54 lines
2.0 KiB
TOML
54 lines
2.0 KiB
TOML
# neuron.example.toml — example configuration
|
|
#
|
|
# Copy to /etc/neuron/neuron.toml and adjust for your environment.
|
|
#
|
|
# Environment variable overrides use NEURON_ prefix with __ separators:
|
|
# NEURON_PORT=13131
|
|
|
|
port = 13131
|
|
|
|
# -- Harnesses ---------------------------------------------------------------
|
|
# Each [[harnesses]] entry enables an inference engine. Currently only
|
|
# "candle" is supported — it runs in-process and uses huggingface/candle
|
|
# for inference on local CUDA devices (or CPU when CUDA is unavailable).
|
|
|
|
[[harnesses]]
|
|
name = "candle"
|
|
|
|
# -- Candle harness settings -------------------------------------------------
|
|
# Optional tuning for the candle harness.
|
|
|
|
[harness.candle]
|
|
# HuggingFace cache directory for model weights.
|
|
#
|
|
# Resolution order (first hit wins):
|
|
# 1. `hf_cache` here in this file.
|
|
# 2. `HF_HUB_CACHE` env var — same convention as the Python
|
|
# `huggingface_hub` library, so an existing cache directory shared
|
|
# with other tooling can be reused without per-tool config.
|
|
# 3. `HF_HOME` env var (cache appended as `$HF_HOME/hub`).
|
|
# 4. hf-hub's default (`~/.cache/huggingface/hub`).
|
|
#
|
|
# For per-host overrides (e.g. one neuron has an SSD with prefetched
|
|
# weights), prefer a systemd drop-in over editing this file:
|
|
# /etc/systemd/system/neuron.service.d/local.conf:
|
|
# [Service]
|
|
# Environment=HF_HUB_CACHE=/archive/hf-cache
|
|
# hf_cache = "/var/lib/neuron/hf-cache"
|
|
|
|
# -- Default models ----------------------------------------------------------
|
|
# Models listed here are loaded automatically when the neuron service
|
|
# activates. Loading is sequential — a slow or failing entry doesn't
|
|
# block the rest of the fleet, but it does push out the time before
|
|
# neuron starts serving HTTP, so keep the list short. Operators can
|
|
# load additional models on demand via POST /models/load.
|
|
#
|
|
# Make sure data/neuron.service's TimeoutStartSec is generous enough to
|
|
# cover the slowest entry's first-time download + materialisation.
|
|
|
|
# [[default_models]]
|
|
# model_id = "Qwen/Qwen3-0.6B-GGUF"
|
|
# harness = "candle"
|
|
# quant = "Q4_K_M"
|
|
# devices = [0]
|