All checks were successful
CI / Format (push) Successful in 34s
CI / Clippy (push) Successful in 2m13s
CI / Test (push) Successful in 4m6s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Stage 5 of the candle-native pivot. Adds first-class support for auto-loading a configured set of models when the neuron service activates. Config: - NeuronConfig.default_models: Vec<ModelSpec> (defaults to []). - neuron.example.toml ships a commented [[default_models]] example. Activation flow (crates/neuron/src/startup.rs::load_default_models): - Sequential — VRAM contention makes parallel loads risky. - Per-entry timing logged at info level on success. - Failures logged as warnings; the next entry is still attempted. - An empty list short-circuits without log noise. Called from main.rs after the registry is built and before the axum listener binds, so /models reflects the loaded state from the very first request. data/neuron.service gains TimeoutStartSec=1800s. With activation blocked on potentially slow first-time HF downloads + GGUF materialisation, systemd's default 90s would kill larger model loads mid-flight. Two non-gated tests in tests/activation.rs cover the continues-past-failure and empty-list paths using a synthetically unknown harness name to fail loads fast without touching the network. The cuda-integration test from earlier stages still exercises the real load/unload lifecycle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
41 lines
1.5 KiB
TOML
41 lines
1.5 KiB
TOML
# neuron.example.toml — example configuration
|
|
#
|
|
# Copy to /etc/neuron/neuron.toml and adjust for your environment.
|
|
#
|
|
# Environment variable overrides use NEURON_ prefix with __ separators:
|
|
# NEURON_PORT=13131
|
|
|
|
port = 13131
|
|
|
|
# -- Harnesses ---------------------------------------------------------------
|
|
# Each [[harnesses]] entry enables an inference engine. Currently only
|
|
# "candle" is supported — it runs in-process and uses huggingface/candle
|
|
# for inference on local CUDA devices (or CPU when CUDA is unavailable).
|
|
|
|
[[harnesses]]
|
|
name = "candle"
|
|
|
|
# -- Candle harness settings -------------------------------------------------
|
|
# Optional tuning for the candle harness.
|
|
|
|
[harness.candle]
|
|
# HuggingFace cache directory for model weights. When unset, hf-hub's
|
|
# default (~/.cache/huggingface) is used.
|
|
# hf_cache = "/var/lib/neuron/hf-cache"
|
|
|
|
# -- Default models ----------------------------------------------------------
|
|
# Models listed here are loaded automatically when the neuron service
|
|
# activates. Loading is sequential — a slow or failing entry doesn't
|
|
# block the rest of the fleet, but it does push out the time before
|
|
# neuron starts serving HTTP, so keep the list short. Operators can
|
|
# load additional models on demand via POST /models/load.
|
|
#
|
|
# Make sure data/neuron.service's TimeoutStartSec is generous enough to
|
|
# cover the slowest entry's first-time download + materialisation.
|
|
|
|
# [[default_models]]
|
|
# model_id = "Qwen/Qwen3-0.6B-GGUF"
|
|
# harness = "candle"
|
|
# quant = "Q4_K_M"
|
|
# devices = [0]
|