cortex

Author	SHA1	Message	Date
rob thijssen	b400e8b704	feat(neuron): honour HF_HUB_CACHE / HF_HOME for the candle harness cache Some checks failed build-prerelease / Resolve version stamps (push) Successful in 31s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m39s Details build-prerelease / Build cortex binary (push) Successful in 4m17s Details build-prerelease / Package cortex RPM (push) Successful in 1m22s Details CI / Format (push) Successful in 32s Details CI / Test (push) Failing after 51s Details CI / Clippy (push) Successful in 2m17s Details build-prerelease / Build neuron-ampere (push) Successful in 4m58s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ada (push) Successful in 5m1s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m4s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m37s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details Resolves the candle harness's HuggingFace cache directory with the following precedence (first hit wins): 1. Explicit `hf_cache` in `[harness.candle]` from neuron.toml. 2. `HF_HUB_CACHE` env var — the Python `huggingface_hub` convention. The Rust hf-hub crate doesn't read this natively, so we bridge here. 3. `HF_HOME` env var (`$HF_HOME/hub` per the canonical layout). 4. None — falls through to hf-hub's own default. Honouring HF_HUB_CACHE lets a neuron host reuse an existing cache directory shared with Python tooling or other harnesses on the same host without per-tool config. The canonical per-host setup is a systemd drop-in: /etc/systemd/system/neuron.service.d/local.conf [Service] Environment=HF_HUB_CACHE=/archive/hf-cache neuron.example.toml documents the resolution chain inline. script/validate-neuron.sh: bump LOAD_TIMEOUT from 600s to 3600s and expose both load/infer timeouts via env (NEURON_LOAD_TIMEOUT, NEURON_INFER_TIMEOUT). A Qwen3.6-class dense model is ~54 GB and was hitting the 10-min ceiling cold-downloading on a residential link. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 07:52:50 +03:00
rob thijssen	62ca125a68	chore: keep models.example.toml generic; deploy.sh sync's local models.toml Some checks failed build-prerelease / Resolve version stamps (push) Successful in 34s Details CI / Format (push) Successful in 40s Details CI / Clippy (push) Successful in 2m22s Details CI / Test (push) Successful in 4m31s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m28s Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has started running Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details Reverts the previous commit's naming of specific helexa neuron hosts in the shipped example catalogue (`models.example.toml`) — the example is supposed to be a generic starting point that any operator copies and adapts, not a record of one particular fleet's layout. - `pinned_on` in the TP example uses the placeholder `"your-multi-gpu-neuron"`. Other entries keep the model ids (since those are HuggingFace-canonical, not fleet-specific). - New `models.toml` at repo root holds the helexa-fleet catalogue (beast / benjy / quadbrat). Added to `.gitignore` alongside `cortex.toml` — both are operator-owned, gitignored, RPM-marked `%config(noreplace)`, and synced by `deploy.sh`. - `deploy.sh` now rsync's `models.toml` to `/etc/cortex/models.toml` on the gateway host on the same lifecycle as `cortex.toml`. Skips cleanly when no local file exists, so users without a catalogue aren't surprised by silent overwrites. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 07:47:08 +03:00
rob thijssen	d46d8d4f6c	feat(tp): Stage 7b-iv — RPC + orchestration for TP load/inference All checks were successful build-prerelease / Resolve version stamps (push) Successful in 38s Details CI / Format (push) Successful in 40s Details CI / Clippy (push) Successful in 2m20s Details build-prerelease / Build cortex binary (push) Successful in 4m25s Details build-prerelease / Package cortex RPM (push) Successful in 1m22s Details CI / Test (push) Successful in 4m34s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m57s Details build-prerelease / Build neuron-ampere (push) Successful in 4m51s Details build-prerelease / Build neuron-ada (push) Successful in 5m12s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m49s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m51s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s Details Wires the in-flight TP machinery (Stage 7a workers, 7b-iii sharded Qwen3) end to end so a non-streaming chat completion can run across multiple GPUs via NCCL. RPC additions (tp/rpc.rs): - LoadDenseShard{model_id, config_json, safetensors_paths} - GenerateStep{model_id, tokens, offset} - ClearKvCache{model_id} - UnloadModel{model_id} - LoadDenseShardOk / GenerateStepOk / KvCacheCleared / Unloaded Worker side (tp/worker.rs): - WorkerState gains a `models: HashMap<String, TpQwen3ForCausalLM>` keyed by model_id. LoadDenseShard mmaps safetensors via ShardedVarBuilder (only this rank's slice materialises), builds the TP model with the rank's NCCL Comm cloned from NcclState. - GenerateStep runs the rank-local forward; the resulting logits are dropped (only the leader's are used for sampling). The forward's value here is the NCCL collectives inside the row-parallel layers letting the leader's rank-0 forward make progress. Pool side (tp/mod.rs): - WorkerPool::load_dense_shard fans LoadDenseShard out to every worker, builds rank 0's shard on the leader via spawn_blocking with a fresh SendComm wrapper at the move boundary (Comm is !Send at the type level), collects per-rank LoadDenseShardOk. Returns the leader's Arc<Mutex<TpQwen3ForCausalLM>>. - WorkerPool::generate_step fans GenerateStep out, runs the leader's rank-0 forward in spawn_blocking (the AllReduce CustomOps inside row-parallel layers block until every worker issues the matching collective), returns the leader's last-position logits Tensor. - WorkerPool::clear_kv_cache + unload_model follow the same pattern. NcclState refactor (tp/nccl_state.rs): - comm field becomes Option<Arc<Comm>> (was Option<Comm>) so callers can share a clone with TpQwen3ForCausalLM::load. - new `comm()` accessor + `SendComm` wrapper for spawn_blocking moves. - single allow(clippy::arc_with_non_send_sync) at the canonical construction site (Comm is !Send by type but the runtime invariant is enforced by SendComm + the pool's Mutex). Harness side (candle.rs): - LoadedHandle enum (Single \| Tp) replaces the bare Arc<LoadedModel> in the harness's registry. list_models / unload_model / inference_endpoint walk the enum uniformly. - TpLoadedModel holds the pool + leader_model + tokenizer + devices. - load_model dispatches on `spec.tensor_parallel > 1` to a new cuda-gated load_tp path: resolve dense files via hf-hub, spawn the pool, init_nccl, load_dense_shard. - chat_completion branches on the handle variant. The TP path mirrors run_inference: clear_kv_cache, prefill, sample, decode loop, detokenize. Acquires the pool Mutex for the whole request. - Streaming through TP is deferred to Stage 7c (returns Other(err)). Script (script/validate-neuron.sh): - 4th positional arg `tp_size` (default 1). When >1, switches to the dense path (tp + GGUF is mutually exclusive — bails) and adds `tensor_parallel` + `devices` to the load payload. NEURON_DEVICES env overrides the default 0..N-1 device list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 06:38:33 +03:00
rob thijssen	9b8bd146f6	feat(tp): --tp-smoke CLI subcommand + remote validation script All checks were successful CI / Format (push) Successful in 36s Details build-prerelease / Resolve version stamps (push) Successful in 38s Details CI / Clippy (push) Successful in 2m19s Details CI / Test (push) Successful in 4m32s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m43s Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m16s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ampere (push) Successful in 4m56s Details build-prerelease / Build neuron-ada (push) Successful in 5m1s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m51s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m39s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s Details Adds a one-shot diagnostic that exercises the lower half of the TP stack — WorkerPool::spawn, init_nccl, nccl_sanity_check — in isolation from model load and inference. Runs N-1 worker subprocesses (rank 0 stays in this process), joins them in an NCCL communicator on the specified CUDA devices, all_reduces a sentinel 1u32 per rank, verifies the observed_sum equals world_size on every rank, then shuts down. Output is `status=ok` on stdout (plus key=value lines for tp_size and cuda_devices) when every check passes, non-zero exit + tracing on stderr otherwise. The smoke command is diagnostic-only and not exposed through the daemon HTTP API. script/tp-smoke.sh wraps it with an ssh invocation against a fleet host (default beast — the only host with 2 GPUs) and asserts the status line, mirroring the validate-neuron.sh ergonomics. This is step 1 of the TP test plan. A failure here means TP cannot work on the host at all; step 2 (Stage 7b-iv) wires real model load and inference through the same WorkerPool primitives. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 19:40:25 +03:00
rob thijssen	5436af9c73	fix(neuron/candle): dense Qwen3 returns rank-3 logits, double-squeeze All checks were successful build-prerelease / Resolve version stamps (push) Successful in 33s Details CI / Format (push) Successful in 38s Details CI / Clippy (push) Successful in 2m19s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m32s Details CI / Test (push) Successful in 4m34s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m16s Details build-prerelease / Package cortex RPM (push) Successful in 1m18s Details build-prerelease / Build neuron-ampere (push) Successful in 4m55s Details build-prerelease / Build neuron-ada (push) Successful in 5m11s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m52s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m35s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s Details Caught by live validation against Qwen/Qwen3-1.7B on beast: HTTP 500 "unexpected rank, expected: 1, got: 2 ([1, 151936])" Candle's qwen3::ModelForCausalLM::forward returns shape [B, 1, V] (no final squeeze) while quantized_qwen3::ModelWeights::forward returns [B, V] (with squeeze(1) at the end). My match arms applied a single squeeze(0) uniformly, which is correct for the quantized [1, V] → [V] but leaves the dense at [1, V] → which then trips apply_repeat_penalty::to_vec1() expecting rank 1. Dense match arms now strip both batch and seq dims: model.forward(&input, offset)?.squeeze(0)?.squeeze(0)? Also fixes validate-neuron.sh's `${3:-Q4_K_M}` → `${3-Q4_K_M}` (no colon) so passing an explicit empty third arg now drives the dense path instead of falling back to Q4_K_M. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 17:49:43 +03:00
rob thijssen	05e15f3597	Stage 7b-i: dense safetensors Qwen3 load path Some checks failed build-prerelease / Build cortex binary (push) Blocked by required conditions Details CI / Test (push) Waiting to run Details CI / Format (push) Successful in 43s Details build-prerelease / Resolve version stamps (push) Successful in 44s Details CI / Clippy (push) Successful in 2m4s Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details Adds the bf16/fp16 safetensors path alongside the existing GGUF quantized one. The harness now dispatches by ModelSpec.quant: - Some(_) → GGUF (pre-quantized, single-GPU only path, unchanged). - None → safetensors dense (new). The dense path uses candle-transformers::models::qwen3::ModelForCausalLM verbatim, fed via VarBuilder::from_mmaped_safetensors over the files listed in `model.safetensors.index.json` (sharded layout) or the single `model.safetensors` fallback. dtype is bf16 to match the canonical Qwen3 HF distribution dtype. tokenizer.json is fetched from the same repo (no -GGUF suffix to strip). ModelArch gains a Qwen3Dense variant; the forward signature mirrors QuantizedQwen3Weights (same `forward(&Tensor, offset)` → last-position logits), so run_inference / run_inference_streaming just add a parallel match arm — no shape changes downstream. This is the foundation 7b-ii (ColumnParallel/RowParallel) builds on: because the source is dense safetensors that can be byte-sliced per rank, the TP work avoids the GGUF super-block alignment problem entirely. Vanilla GGUF inference keeps working unchanged. validate-neuron.sh learns the dense path: pass an empty third arg (quant) and the script omits the `quant` field from the load payload, triggering the dense dispatch. Example: script/validate-neuron.sh beast.hanzalova.internal Qwen/Qwen3-0.6B '' Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 17:03:59 +03:00
rob thijssen	18ae3c30ee	post-validation cleanup: cuDNN runtime + repetition penalty All checks were successful CI / Format (push) Successful in 34s Details build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Clippy (push) Successful in 2m17s Details CI / Test (push) Successful in 4m16s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m28s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m42s Details build-prerelease / Package cortex RPM (push) Successful in 1m25s Details build-prerelease / Build neuron-ampere (push) Successful in 4m27s Details build-prerelease / Build neuron-ada (push) Successful in 4m51s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m50s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m40s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 6m52s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 2m32s Details Two followups from the live single-GPU validation pass. 1. deploy.sh now ensures libcudnn.so.9 is available on each neuron host before installing/upgrading the package. Probes ldconfig first so hosts with a manual (tar/runfile) cuDNN install are untouched, then adds NVIDIA's RHEL9 CUDA repo (the Fedora 43 CUDA repo doesn't ship cuDNN; only the RHEL9 one does) and installs libcudnn9-cuda-13. benjy hit "cannot open shared object file: libcudnn.so.9" during validation; this prevents that recurring. 2. candle.rs applies a 1.1 repetition penalty over the last 64 generated tokens before sampling, in both the non-streaming chat_completion path and the streaming chat_completion_stream path. Without it small Q4_K_M models degenerate into "Wait, no, no..." loops once they hit a confident-but-wrong path; with it sampling stays coherent. Defaults match mistral.rs and llama.cpp; exposing the value via the OpenAI request (frequency/presence penalty mapping) is Stage 8 territory. Both routes through a new sample_with_penalty() helper so future sampling tweaks land in one place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 14:48:08 +03:00
rob thijssen	1a0400131e	fix(deploy): use dnf upgrade for stale installs, install only when absent All checks were successful CI / Format (push) Successful in 35s Details build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Clippy (push) Successful in 2m27s Details CI / Test (push) Successful in 4m30s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m29s Details build-prerelease / Build cortex binary (push) Successful in 4m32s Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-ampere (push) Successful in 5m15s Details build-prerelease / Build neuron-ada (push) Successful in 4m51s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m48s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m47s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m38s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 57s Details dnf5's `dnf install <pkg>` is a no-op when the package is already installed at ANY version — it does NOT auto-upgrade to the latest available. The deploy script's install branch was therefore silently leaving hosts on older builds even though needs_update correctly reported an upgrade was available. Add an is_installed() probe and an install_or_upgrade() helper that picks the right verb: `dnf install` when fresh, `dnf upgrade` when stale. Captured combined-stream output is exposed via __DNF_OUTPUT__ for the existing failure-diagnostic path. Verified end-to-end against the live fleet: hanzalova/beast/benjy/ quadbrat all upgraded cleanly from prior prerelease NVRs to 0.1.16-0.1.20260519134302.git1866b99.fc43, validation script returned "Paris" from all three neurons. Followup (not in this commit): all hosts running helexa-neuron-* need libcudnn.so.9 available at runtime. Currently: - quadbrat: libcudnn9-cuda-13 RPM (rhel9 CUDA repo) - beast: /usr/lib64/libcudnn.so.9 (manual install) - benjy: needed rhel9 CUDA repo added + libcudnn9-cuda-13 installed as part of this validation pass. The spec currently excludes cuDNN from auto-detected deps. Should add a Recommends:libcudnn9-cuda-13 (soft) and ensure the rhel9 CUDA repo is configured on each neuron host, similar to how ensure_lair_repo handles the unstable channel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 14:10:48 +03:00
rob thijssen	1866b99a89	fix(validate-neuron): jq for JSON, say→stderr, sane max_tokens All checks were successful CI / Format (push) Successful in 35s Details build-prerelease / Resolve version stamps (push) Successful in 38s Details CI / Clippy (push) Successful in 2m13s Details CI / Test (push) Successful in 4m22s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m25s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m21s Details build-prerelease / Package cortex RPM (push) Successful in 1m17s Details build-prerelease / Build neuron-ampere (push) Successful in 4m39s Details build-prerelease / Build neuron-ada (push) Successful in 4m57s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m58s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m34s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details Three real bugs caught while exercising the script end-to-end against the live quadbrat node: 1. say() printed status to stdout. Inside run_probe(), the "POST /v1/chat/completions (probe: ...)" line was being captured by `raw=$(run_probe)` along with the JSON body, so jq saw "[host] POST..." as the first line and choked at column 29 with "Invalid numeric literal" (it tried to parse the `[` as the start of a JSON array). Redirect say() to stderr so command substitutions capture only the intended return value. 2. The pretty-print step `echo "${raw}" \| yq -r '.'` re-emitted the JSON as YAML, which fails on response content that looks like YAML markers (chatcmpl ids that parse as aliases, escaped quotes inside <think>...</think> blocks). Drop the pretty-print; just echo the raw JSON. 3. JSON response parsing now uses jq (always JSON) instead of yq (parses input as YAML by default). yq remains in use only for the genuinely-YAML asset/manifest.yml elsewhere. 4. max_tokens bumped 32 → 256. Qwen3 prepends a <think>...</think> reasoning block before its final answer when the chat template enables thinking mode, and that eats most of a small budget — the "Paris" answer was being truncated mid-thought. 256 leaves enough room for both. Verified pipeline end-to-end on quadbrat (RTX 3060, helexa-neuron-ampere git602e8e1): /health OK → /models/load (unsloth/Qwen3-0.6B-GGUF Q4_K_M) → /v1/chat/completions → response content contains "Paris". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 13:43:02 +03:00
rob thijssen	ed4d71db09	fix(validate-neuron): default to unsloth GGUF + capture curl errors Two reasons the previous run silently bailed after POST /models/load: 1. Default model was Qwen/Qwen3-0.6B-GGUF (official). That repo ships ONLY Q8_0 — no Q4_K_M, no Q4_0, nothing else. The GGUF filename matcher in CandleHarness::resolve_files returned "no GGUF file matching quant Q4_K_M" and the load endpoint returned an error, but the script used `curl --silent --fail` and swallowed it. 2. /models/load is synchronous (it awaits the full HF download + GGUF parse). curl --max-time 30 was way too short for a 400 MB fresh download. Fixes: - Default model is now unsloth/Qwen3-0.6B-GGUF, which mirrors the full Q-spectrum (Q2_K through Q8_0 plus BF16) so Q4_K_M actually exists. - trigger_load / run_probe now use --write-out to capture HTTP code and emit the response body on non-2xx, so failures surface a real diagnostic instead of an opaque set -e abort. - LOAD_TIMEOUT bumped to 600s; INFER_TIMEOUT to 120s. - Probe payload built via `yq -n` so JSON quoting is reliable regardless of the prompt text. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 08:14:31 +03:00
rob thijssen	39010c779f	add script/validate-neuron.sh — end-to-end candle harness smoke test Loads a small public Qwen3 GGUF on a target neuron host, fires a deterministic reasoning probe ("What is the capital of France?"), and asserts the response contains 'Paris'. Used to validate the candle harness on a real GPU host before the Stage 7 TP work begins, and as a regression check after future neuron builds. Defaults to beast.hanzalova.internal + Qwen/Qwen3-1.7B-GGUF + Q4_K_M; all three are positional args so the same script tests any node / model combination. Polls /models after triggering the load since /models/load returns once the materialisation is queued, not finished. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 07:58:05 +03:00
rob thijssen	8a2334eacb	deploy: dnf-native version check + lair.cafe repo bootstrap Replaces the string compare of 'git describe --tags' vs the binary's self-reported --version (which lies about prereleases — every 0.1.16-* RPM reports just "0.1.16") with the dnf-native question of "is the installed package current against what the repo offers". Mechanism: - installed_nvr(): rpm -q --qf '%{version}-%{release}' for the resident package, falling back to "(not installed)". Capturing rpm's output through a variable keeps its "package X is not installed" stdout message out of the result on failure. - needs_update(): probes rpm -q first (treats absent as "needs work"), then asks dnf check-update --refresh -q. Other dnf failures collapse into "needs update" so the subsequent install surfaces a real error rather than this check swallowing one silently. - ensure_lair_repo(): probes for /etc/yum.repos.d/lair-cafe-unstable.repo and adds it with `dnf config-manager addrepo` when missing. The upstream .repo file ships enabled=0 (unstable channel doesn't auto-engage on fetch), so we then run `dnf config-manager setopt lair-cafe-unstable.enabled=1` every run — cheap, idempotent. - Cortex and neuron install branches now guard `systemctl stop` with `[ ! -f /usr/lib/systemd/system/...service ] \|\| sudo systemctl stop` so fresh installs (no unit file yet) don't short-circuit the install step under set -e. - dnf output is captured into a variable and only printed (with a [host] prefix per line) on failure, so success stays quiet and failures show the actual diagnostic instead of being eaten by &> /dev/null. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 18:55:02 +03:00
rob thijssen	249c9442e8	chore: track deployment script All checks were successful CI / Format (push) Successful in 37s Details CI / Clippy (push) Successful in 2m2s Details CI / Test (push) Successful in 3m59s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details	2026-05-18 17:50:35 +03:00
rob thijssen	5c957d08ec	ci: add build-prerelease workflow for CUDA RPMs on rpm.lair.cafe Some checks failed CI / Format (push) Successful in 36s Details CI / Test (push) Failing after 53s Details CI / Clippy (push) Successful in 2m35s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details Adds a manually-triggered workflow that builds CUDA-flavoured neuron binaries and a CPU cortex binary, packages them as Fedora RPMs, signs them, and rsyncs to the unstable channel at https://rpm.lair.cafe/fedora/43/x86_64/unstable/. Mirrors the build pipeline used by grenade/mistralrs-package. Pipeline: - prepare: derive {version,short_sha,commit_date} from the checkout; the prerelease Release stamp "0.1.YYYYMMDDgitSHORTSHA" sorts below the eventual "1" stable release. - build-cortex: cargo build --release -p cortex-cli on a rust runner. - build-neuron: matrix over ada (sm_89) and blackwell (sm_120) on cuda-13.0 runners; cargo build with features "cuda cudnn flash-attn" and CUDA_COMPUTE_CAP set per flavour. - package-{cortex,neuron}: rpmbuild on the rpm runner against the new prebuilt-binary specs in rpm/. - publish: import signing key, sign RPMs, rsync to oolon, createrepo_c --update, then regenerate packages.json for the UI. New specs are prebuilt-binary variants — they consume the artifact from the build job rather than running cargo at rpmbuild time. Each helexa-neuron-{flavour} package Conflicts with the other flavours and with helexa-neuron (the future source-build stable package) so one flavour is installed at a time on a given host. neuron crate gains cudnn and flash-attn feature flags forwarding to the corresponding candle features, so the CI build command compiles those kernels into the binary. sccache is intentionally NOT used in the prerelease jobs — CUDA compute cap isn't in its cache key, so flavours would mis-hit each other. Each prerelease build is a clean cargo build. Required Gitea secrets (already in place for cortex.spec / COPR workflow): - RPM_SIGNING_KEY, RPM_SIGNING_KEY_ID - RSYNC_SSH_KEY Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 17:01:35 +03:00

14 Commits