feat(deploy): per-host neuron config + pre-warm headline models

Adds asset/neuron/{beast,benjy,quadbrat}.toml — per-host neuron.toml files keyed by the first dot-component of the host. deploy.sh now rsyncs the matching file to /etc/neuron/neuron.toml on each neuron and stops+starts the service so default_models is re-read. Headline model per host (drives /v1/models output immediately after a clean deploy): beast Qwen/Qwen3.6-27B (q5k, tp=2, devices=[0,1]) benjy Qwen/Qwen3-8B (bf16, devices=[0]) quadbrat Qwen/Qwen3-1.7B (bf16, devices=[0]) Removes the need to follow deploy.sh with `validate-neuron.sh beast Qwen/Qwen3.6-27B q5k 2` to surface the 27B in the catalogue — the neuron loads it itself on activation. The neuron loop now mirrors the cortex flow (stop → install/upgrade → sync config → start) so config-only changes pick up on subsequent deploys; previously a no-package-change deploy would silently leave the host on the old default_models. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 14:05:54 +03:00
parent 2740e61a23
commit d3f2d50749
4 changed files with 116 additions and 26 deletions
--- a/asset/neuron/beast.toml
+++ b/asset/neuron/beast.toml
@@ -0,0 +1,24 @@
+# neuron.toml for beast.hanzalova.internal
+#
+# 2x RTX 5090 (32 GB each) — TP-2 capable. Pre-warms Qwen3.6-27B with
+# q5k ISQ across both GPUs at activation, matching the validate-neuron
+# invocation: `validate-neuron.sh beast.hanzalova.internal
+# Qwen/Qwen3.6-27B q5k 2`.
+#
+# Synced by script/deploy.sh from asset/neuron/<short-host>.toml. Edits
+# take effect on the next deploy.sh run (which stops + restarts the
+# service so default_models is re-read at activation).
+
+port = 13131
+
+[[harnesses]]
+name = "candle"
+
+[harness.candle]
+
+[[default_models]]
+model_id = "Qwen/Qwen3.6-27B"
+harness = "candle"
+quant = "q5k"
+tensor_parallel = 2
+devices = [0, 1]
--- a/asset/neuron/benjy.toml
+++ b/asset/neuron/benjy.toml
@@ -0,0 +1,19 @@
+# neuron.toml for benjy.hanzalova.internal
+#
+# 1x RTX 4090 (24 GB) — largest single-GPU host on the fleet. Pre-warms
+# Qwen3-8B (bf16, ~18 GB), leaving ~6 GB for KV cache + activations on
+# moderate-length contexts.
+#
+# Synced by script/deploy.sh from asset/neuron/<short-host>.toml.
+
+port = 13131
+
+[[harnesses]]
+name = "candle"
+
+[harness.candle]
+
+[[default_models]]
+model_id = "Qwen/Qwen3-8B"
+harness = "candle"
+devices = [0]
--- a/asset/neuron/quadbrat.toml
+++ b/asset/neuron/quadbrat.toml
@@ -0,0 +1,19 @@
+# neuron.toml for quadbrat.hanzalova.internal
+#
+# 1x RTX 3060 (12 GB) — small / quantised tier. Pre-warms Qwen3-1.7B
+# (bf16, ~4 GB), leaving ~7 GB for KV cache so long contexts on a small
+# model still have plenty of room.
+#
+# Synced by script/deploy.sh from asset/neuron/<short-host>.toml.
+
+port = 13131
+
+[[harnesses]]
+name = "candle"
+
+[harness.candle]
+
+[[default_models]]
+model_id = "Qwen/Qwen3-1.7B"
+harness = "candle"
+devices = [0]