cortex

Author	SHA1	Message	Date
rob thijssen	c6022aa6b9	feat(stage-8b): Llama + Qwen3 MoE families on the candle harness All checks were successful CI / Format (push) Successful in 31s Details build-prerelease / Resolve version stamps (push) Successful in 36s Details CI / Clippy (push) Successful in 2m6s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m50s Details build-prerelease / Build cortex binary (push) Successful in 4m54s Details CI / Test (push) Successful in 4m58s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ampere (push) Successful in 4m43s Details build-prerelease / Build neuron-ada (push) Successful in 5m8s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m52s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s Details Broadens the single-GPU dense and quantized paths to cover three non-Qwen3 architectures already shipped by candle-transformers. TP for these is a separate stage (each family would need its own tp_*.rs mirroring tp_qwen3.rs). `ModelArch` gains four variants: - LlamaDense (boxed — wraps Llama + an inline Cache + the config it takes to rebuild the cache, since candle::llama::Cache has no reset) - LlamaQuantized (candle_transformers::models::quantized_llama) - Qwen3MoeDense (candle::models::qwen3_moe::ModelForCausalLM) - Qwen3MoeQuantized (candle::models::quantized_qwen3_moe::GGUFQWenMoE — takes an explicit compute dtype; F16 by default for best consumer-GPU throughput) The dispatch is method-based now: - `ModelArch::forward(&mut self, input, offset) -> Result<Tensor>` with a shared `squeeze_to_vocab` normalising shape differences (qwen3 returns [B,1,V]; quantized_qwen3 returns [B,V]; new families may differ again — the helper handles all of them). - `ModelArch::clear_kv_cache(&mut self) -> Result<()>`. Llama needs a Cache rebuild because its Cache has no in-place reset; the new `LlamaDense` wrapper holds the bits needed to do it. `run_inference` / `run_inference_streaming` collapse to a single dispatch path: no more per-variant match arms in the hot loop, and new architectures pick up streaming + non-streaming for free with zero changes outside `ModelArch`. DENSE_SUPPORTED_MODEL_TYPES is now ["llama", "qwen3", "qwen3_moe"]. GGUF arch switch grows "qwen3moe" + "llama" branches (qwen3moe with no underscore matches llama.cpp's general.architecture convention). Stage 8a's diagnostic auto-reports the new supported set. The `LlamaDense` variant is boxed because the wrapper's inline Cache + Config makes it 544 bytes vs ~300 for everything else (clippy::large_enum_variant). Verified: cargo test --workspace passes 66 tests; cargo clippy CPU and `--features cuda` both clean (the cuda check ran inside the locally-built `neuron-build-local` container with the math_functions.h patch applied). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 08:36:22 +03:00
rob thijssen	9e31d8deca	feat(stage-8a): pre-flight architecture check for dense model loads Some checks failed CI / Format (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 34s Details CI / Clippy (push) Successful in 2m21s Details CI / Test (push) Successful in 4m27s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m50s Details build-prerelease / Build cortex binary (push) Successful in 4m28s Details build-prerelease / Package cortex RPM (push) Successful in 1m24s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details A request to load Qwen/Qwen3.6-27B (model_type "qwen3_5") on the dense path was failing deep inside serde with: missing field `vocab_size` at line 140 column 1 …because Qwen3.6 wraps its actual hyperparameters under `text_config`, so none of `qwen3::Config`'s expected top-level fields are present. The error gave no hint that the architecture was the problem. `check_dense_config_supported` parses `config.json` as an untyped JSON Value, inspects `model_type` (with `architectures` as bonus context), and bails cleanly when it's not in the supported set (currently `["qwen3"]`). The error names the rejected type, the supported set, and points at the files a contributor needs to touch to extend coverage — both the single-process `ModelArch` variants in `candle.rs` and the TP analogue in `tp_qwen3.rs`. Wired into both load paths: - `load_arch_dense` (single-GPU), before the typed deserialize. - `load_tp`, before spawning the worker pool — TP loads of an unsupported arch now fail before NCCL/init costs are paid. 4 unit tests cover the accept/reject/missing-field/malformed cases. Bonus: makes Stage 8b/8c work easier — adding a new architecture is now a `DENSE_SUPPORTED_MODEL_TYPES` edit + ModelArch variant + load branch, with the diagnostic auto-correctly listing the supported set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 08:27:29 +03:00
rob thijssen	b400e8b704	feat(neuron): honour HF_HUB_CACHE / HF_HOME for the candle harness cache Some checks failed build-prerelease / Resolve version stamps (push) Successful in 31s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m39s Details build-prerelease / Build cortex binary (push) Successful in 4m17s Details build-prerelease / Package cortex RPM (push) Successful in 1m22s Details CI / Format (push) Successful in 32s Details CI / Test (push) Failing after 51s Details CI / Clippy (push) Successful in 2m17s Details build-prerelease / Build neuron-ampere (push) Successful in 4m58s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ada (push) Successful in 5m1s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m4s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m37s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details Resolves the candle harness's HuggingFace cache directory with the following precedence (first hit wins): 1. Explicit `hf_cache` in `[harness.candle]` from neuron.toml. 2. `HF_HUB_CACHE` env var — the Python `huggingface_hub` convention. The Rust hf-hub crate doesn't read this natively, so we bridge here. 3. `HF_HOME` env var (`$HF_HOME/hub` per the canonical layout). 4. None — falls through to hf-hub's own default. Honouring HF_HUB_CACHE lets a neuron host reuse an existing cache directory shared with Python tooling or other harnesses on the same host without per-tool config. The canonical per-host setup is a systemd drop-in: /etc/systemd/system/neuron.service.d/local.conf [Service] Environment=HF_HUB_CACHE=/archive/hf-cache neuron.example.toml documents the resolution chain inline. script/validate-neuron.sh: bump LOAD_TIMEOUT from 600s to 3600s and expose both load/infer timeouts via env (NEURON_LOAD_TIMEOUT, NEURON_INFER_TIMEOUT). A Qwen3.6-class dense model is ~54 GB and was hitting the 10-min ceiling cold-downloading on a residential link. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 07:52:50 +03:00
rob thijssen	f72dee094f	feat(tp): Stage 7c-i — streaming SSE through TP Some checks failed build-prerelease / Package cortex RPM (push) Blocked by required conditions Details build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Format (push) Successful in 37s Details CI / Clippy (push) Successful in 2m12s Details CI / Test (push) Successful in 5m3s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m39s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 5m7s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details `chat_completion_stream` no longer returns an error for TP loads. The new `chat_completion_tp_stream` mirrors the non-streaming TP path (clear_kv_cache, prefill, sample, decode loop) but emits one `ChatCompletionChunk` per generated token over an mpsc channel so the handler can write a streaming SSE response. Unlike the single-GPU streaming path (which runs candle's forward inside `spawn_blocking` and uses `blocking_send`), the TP loop is itself async — every `pool.generate_step` already awaits the leader's own spawn_blocking forward plus every worker's recv_only. So the orchestration runs as a plain `tokio::spawn` task using `Sender::send`. The shared `emit_chunk` helper tracks the cumulative decoded prefix and emits the delta — same UTF-8-safe BPE boundary handling as the single-GPU streaming path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 07:32:46 +03:00
rob thijssen	d46d8d4f6c	feat(tp): Stage 7b-iv — RPC + orchestration for TP load/inference All checks were successful build-prerelease / Resolve version stamps (push) Successful in 38s Details CI / Format (push) Successful in 40s Details CI / Clippy (push) Successful in 2m20s Details build-prerelease / Build cortex binary (push) Successful in 4m25s Details build-prerelease / Package cortex RPM (push) Successful in 1m22s Details CI / Test (push) Successful in 4m34s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m57s Details build-prerelease / Build neuron-ampere (push) Successful in 4m51s Details build-prerelease / Build neuron-ada (push) Successful in 5m12s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m49s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m51s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s Details Wires the in-flight TP machinery (Stage 7a workers, 7b-iii sharded Qwen3) end to end so a non-streaming chat completion can run across multiple GPUs via NCCL. RPC additions (tp/rpc.rs): - LoadDenseShard{model_id, config_json, safetensors_paths} - GenerateStep{model_id, tokens, offset} - ClearKvCache{model_id} - UnloadModel{model_id} - LoadDenseShardOk / GenerateStepOk / KvCacheCleared / Unloaded Worker side (tp/worker.rs): - WorkerState gains a `models: HashMap<String, TpQwen3ForCausalLM>` keyed by model_id. LoadDenseShard mmaps safetensors via ShardedVarBuilder (only this rank's slice materialises), builds the TP model with the rank's NCCL Comm cloned from NcclState. - GenerateStep runs the rank-local forward; the resulting logits are dropped (only the leader's are used for sampling). The forward's value here is the NCCL collectives inside the row-parallel layers letting the leader's rank-0 forward make progress. Pool side (tp/mod.rs): - WorkerPool::load_dense_shard fans LoadDenseShard out to every worker, builds rank 0's shard on the leader via spawn_blocking with a fresh SendComm wrapper at the move boundary (Comm is !Send at the type level), collects per-rank LoadDenseShardOk. Returns the leader's Arc<Mutex<TpQwen3ForCausalLM>>. - WorkerPool::generate_step fans GenerateStep out, runs the leader's rank-0 forward in spawn_blocking (the AllReduce CustomOps inside row-parallel layers block until every worker issues the matching collective), returns the leader's last-position logits Tensor. - WorkerPool::clear_kv_cache + unload_model follow the same pattern. NcclState refactor (tp/nccl_state.rs): - comm field becomes Option<Arc<Comm>> (was Option<Comm>) so callers can share a clone with TpQwen3ForCausalLM::load. - new `comm()` accessor + `SendComm` wrapper for spawn_blocking moves. - single allow(clippy::arc_with_non_send_sync) at the canonical construction site (Comm is !Send by type but the runtime invariant is enforced by SendComm + the pool's Mutex). Harness side (candle.rs): - LoadedHandle enum (Single \| Tp) replaces the bare Arc<LoadedModel> in the harness's registry. list_models / unload_model / inference_endpoint walk the enum uniformly. - TpLoadedModel holds the pool + leader_model + tokenizer + devices. - load_model dispatches on `spec.tensor_parallel > 1` to a new cuda-gated load_tp path: resolve dense files via hf-hub, spawn the pool, init_nccl, load_dense_shard. - chat_completion branches on the handle variant. The TP path mirrors run_inference: clear_kv_cache, prefill, sample, decode loop, detokenize. Acquires the pool Mutex for the whole request. - Streaming through TP is deferred to Stage 7c (returns Other(err)). Script (script/validate-neuron.sh): - 4th positional arg `tp_size` (default 1). When >1, switches to the dense path (tp + GGUF is mutually exclusive — bails) and adds `tensor_parallel` + `devices` to the load payload. NEURON_DEVICES env overrides the default 0..N-1 device list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 06:38:33 +03:00
rob thijssen	5436af9c73	fix(neuron/candle): dense Qwen3 returns rank-3 logits, double-squeeze All checks were successful build-prerelease / Resolve version stamps (push) Successful in 33s Details CI / Format (push) Successful in 38s Details CI / Clippy (push) Successful in 2m19s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m32s Details CI / Test (push) Successful in 4m34s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m16s Details build-prerelease / Package cortex RPM (push) Successful in 1m18s Details build-prerelease / Build neuron-ampere (push) Successful in 4m55s Details build-prerelease / Build neuron-ada (push) Successful in 5m11s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m52s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m35s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s Details Caught by live validation against Qwen/Qwen3-1.7B on beast: HTTP 500 "unexpected rank, expected: 1, got: 2 ([1, 151936])" Candle's qwen3::ModelForCausalLM::forward returns shape [B, 1, V] (no final squeeze) while quantized_qwen3::ModelWeights::forward returns [B, V] (with squeeze(1) at the end). My match arms applied a single squeeze(0) uniformly, which is correct for the quantized [1, V] → [V] but leaves the dense at [1, V] → which then trips apply_repeat_penalty::to_vec1() expecting rank 1. Dense match arms now strip both batch and seq dims: model.forward(&input, offset)?.squeeze(0)?.squeeze(0)? Also fixes validate-neuron.sh's `${3:-Q4_K_M}` → `${3-Q4_K_M}` (no colon) so passing an explicit empty third arg now drives the dense path instead of falling back to Q4_K_M. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 17:49:43 +03:00
rob thijssen	05e15f3597	Stage 7b-i: dense safetensors Qwen3 load path Some checks failed build-prerelease / Build cortex binary (push) Blocked by required conditions Details CI / Test (push) Waiting to run Details CI / Format (push) Successful in 43s Details build-prerelease / Resolve version stamps (push) Successful in 44s Details CI / Clippy (push) Successful in 2m4s Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details Adds the bf16/fp16 safetensors path alongside the existing GGUF quantized one. The harness now dispatches by ModelSpec.quant: - Some(_) → GGUF (pre-quantized, single-GPU only path, unchanged). - None → safetensors dense (new). The dense path uses candle-transformers::models::qwen3::ModelForCausalLM verbatim, fed via VarBuilder::from_mmaped_safetensors over the files listed in `model.safetensors.index.json` (sharded layout) or the single `model.safetensors` fallback. dtype is bf16 to match the canonical Qwen3 HF distribution dtype. tokenizer.json is fetched from the same repo (no -GGUF suffix to strip). ModelArch gains a Qwen3Dense variant; the forward signature mirrors QuantizedQwen3Weights (same `forward(&Tensor, offset)` → last-position logits), so run_inference / run_inference_streaming just add a parallel match arm — no shape changes downstream. This is the foundation 7b-ii (ColumnParallel/RowParallel) builds on: because the source is dense safetensors that can be byte-sliced per rank, the TP work avoids the GGUF super-block alignment problem entirely. Vanilla GGUF inference keeps working unchanged. validate-neuron.sh learns the dense path: pass an empty third arg (quant) and the script omits the `quant` field from the load payload, triggering the dense dispatch. Example: script/validate-neuron.sh beast.hanzalova.internal Qwen/Qwen3-0.6B '' Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 17:03:59 +03:00
rob thijssen	2a7ede0232	Stage 7a-i: TP worker lifecycle scaffolding All checks were successful CI / Format (push) Successful in 36s Details build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Clippy (push) Successful in 2m12s Details CI / Test (push) Successful in 4m25s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m49s Details build-prerelease / Build cortex binary (push) Successful in 4m22s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ampere (push) Successful in 5m9s Details build-prerelease / Build neuron-ada (push) Successful in 4m59s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m59s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m38s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m8s Details Leader → worker process plumbing for tensor parallelism. The neuron binary picks up two modes: default (the existing daemon, axum + HTTP) and `--worker` (a bare RPC loop driven over stdin/stdout). The leader spawns one worker per non-zero NCCL rank via tokio::process::Command on the same binary path (production: /proc/self/exe; tests: env!("CARGO_BIN_EXE_neuron")) and talks to each over newline- delimited JSON. Protocol (harness/tp/rpc.rs) is serde-tagged from the start — WorkerRequest::{Ping, Init, NcclSanityCheck, Shutdown} and WorkerResponse::{Pong, InitOk, NcclSanityResult, Bye, Error}, both `#[serde(tag = "op", rename_all = "snake_case")]`. Adding ops in 7b/7c is purely additive; unknown ops on the wire fail to parse (verified in unit tests). 7a-i scope: - WorkerPool::spawn(binary, world_size, devices) forks ranks 1..N as subprocesses, captures stdin/stdout, kills on drop. - ping_all() round-trips a Ping to every worker and validates the returned rank. - shutdown() sends Shutdown to each worker, awaits Bye, reaps. - Worker mode: parse Ping/Shutdown, return Pong/Bye; Init and NcclSanityCheck return Error{kind="not_implemented_7a_i"} so a 7a-ii binary speaking the same wire is a drop-in replacement (the kind field signals "real NCCL lands in the next commit"). - CandleHarness::load_model refuses tensor_parallel > 1 with a clear message until 7b is in. Three integration tests in tests/tp_worker_lifecycle.rs cover spawn/ ping/shutdown for 2- and 3-worker pools, plus the not_implemented_7a_i contract test for Init. Seven rpc serde unit tests assert the wire shape (op tags, field names, unknown-op rejection). All pass on the dev host; no CUDA required. Stage 7a-ii (next): the real NCCL Comm::from_rank wiring behind the existing Init/NcclSanityCheck op surface, CUDA-gated. Verifiable on beast's 2×5090. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 15:53:00 +03:00
rob thijssen	18ae3c30ee	post-validation cleanup: cuDNN runtime + repetition penalty All checks were successful CI / Format (push) Successful in 34s Details build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Clippy (push) Successful in 2m17s Details CI / Test (push) Successful in 4m16s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m28s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m42s Details build-prerelease / Package cortex RPM (push) Successful in 1m25s Details build-prerelease / Build neuron-ampere (push) Successful in 4m27s Details build-prerelease / Build neuron-ada (push) Successful in 4m51s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m50s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m40s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 6m52s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 2m32s Details Two followups from the live single-GPU validation pass. 1. deploy.sh now ensures libcudnn.so.9 is available on each neuron host before installing/upgrading the package. Probes ldconfig first so hosts with a manual (tar/runfile) cuDNN install are untouched, then adds NVIDIA's RHEL9 CUDA repo (the Fedora 43 CUDA repo doesn't ship cuDNN; only the RHEL9 one does) and installs libcudnn9-cuda-13. benjy hit "cannot open shared object file: libcudnn.so.9" during validation; this prevents that recurring. 2. candle.rs applies a 1.1 repetition penalty over the last 64 generated tokens before sampling, in both the non-streaming chat_completion path and the streaming chat_completion_stream path. Without it small Q4_K_M models degenerate into "Wait, no, no..." loops once they hit a confident-but-wrong path; with it sampling stays coherent. Defaults match mistral.rs and llama.cpp; exposing the value via the OpenAI request (frequency/presence penalty mapping) is Stage 8 territory. Both routes through a new sample_with_penalty() helper so future sampling tweaks land in one place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 14:48:08 +03:00
rob thijssen	602e8e1471	fix(neuron/candle): source tokenizer.json from base repo when GGUF Some checks failed build-prerelease / Resolve version stamps (push) Successful in 31s Details CI / Format (push) Successful in 37s Details CI / Clippy (push) Failing after 50s Details CI / Test (push) Failing after 49s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m32s Details build-prerelease / Build cortex binary (push) Successful in 4m34s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Successful in 5m9s Details build-prerelease / Build neuron-ada (push) Successful in 4m52s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m36s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s Details GGUF-only HF repos (unsloth/Qwen3--GGUF, Qwen/Qwen3--GGUF) ship the .gguf file but not tokenizer.json — the tokenizer data is embedded in the GGUF metadata itself, and the standalone tokenizer.json lives in the base non-GGUF repo (unsloth/Qwen3-0.6B, Qwen/Qwen3-0.6B, etc.). Live validation against quadbrat hit: HTTP 400 fetch tokenizer.json from unsloth/Qwen3-0.6B-GGUF: HTTP status client error (404 Not Found) resolve_files now derives the tokenizer repo by stripping a `-GGUF` or `-gguf` suffix from the model_id; non-GGUF ids fall through to fetching from the same repo. The error message includes the attempted tokenizer repo id so the next failure (e.g. base repo doesn't exist) is unambiguous. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 13:16:39 +03:00
rob thijssen	84f5662df1	feat(neuron): OpenAI-compatible SSE streaming chat completions Stage 4 of the candle-native pivot. /v1/chat/completions now switches to text/event-stream when the request sets stream: true, emitting one chat.completion.chunk per generated token followed by the OpenAI [DONE] terminator. Pipeline: - chat_completion_stream creates a bounded mpsc::channel<ChatCompletionChunk>(32), sends the leading role chunk, then spawns a blocking task that acquires the per-model arch lock and runs the streaming generation loop. - run_inference_streaming tracks a cumulative decoded prefix so each chunk's delta.content is the substring added since the last chunk — safe across BPE byte-fallback boundaries that would otherwise split multi-byte UTF-8 chars. - The blocking task aborts cleanly if blocking_send fails (client disconnected), so generation stops when the SSE consumer hangs up. - Final chunk carries finish_reason ("stop" on EOS, "length" on max_tokens). The handler appends data: [DONE] after the channel closes. The Stage 3 streaming 501 placeholder test is repurposed: with the streaming path live, an unloaded model now hits the same 404 surface as the non-streaming path (the model lookup happens first). cortex-gateway's existing proxy is unchanged — it already forwards SSE bytes verbatim from Phase 2 work, so the candle SSE format passes through unmodified. Neuron Cargo.toml gains futures + tokio-stream (both already in workspace deps) for ReceiverStream and stream combinators. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 17:53:14 +03:00
rob thijssen	729317d1ef	feat(neuron): OpenAI-compatible non-streaming chat completion Stage 3 of the candle-native pivot. neuron now serves POST /v1/chat/completions backed by candle's quantized_qwen3 forward pass on a per-model serialised generation loop, returning the standard OpenAI ChatCompletionResponse envelope. Pipeline per request: - Look up the LoadedModel by request.model (404 if absent). - Apply the Qwen3 chat template across all messages. - Tokenize, then spawn_blocking onto tokio's blocking pool to acquire the per-model arch lock and run prefill + greedy/temperature/top-p sampling via LogitsProcessor. - Stop on <\|im_end\|>/<\|endoftext\|> EOS or max_tokens (finish_reason "stop" vs "length"). - Decode with skip_special_tokens=true, build OpenAI response with prompt/completion/total usage counts. Supporting changes: - HarnessRegistry now stores Arc<dyn Harness> and caches a typed Arc<CandleHarness> so inference routes bypass dyn-Trait dispatch. - LoadedModel.arch becomes Arc<Mutex<ModelArch>> so the lock guard can be moved into spawn_blocking. - NeuronState gains an Option<Arc<CandleHarness>> field for the new inference route. - Typed InferenceError lets the handler map ModelNotLoaded → 404 and other failures → 500 without string-matching anyhow messages. - stream=true returns 501 until Stage 4 wires up SSE. - Two leftover mistral.rs string references in proxy.rs and cortex-cli (missed during the Stage 1 sweep) are corrected here. Three new default-feature tests cover the no-candle 503, model-not- loaded 404, and stream=true 501 paths. The cuda-integration test from Stage 2 still covers real load/unload; a streaming-feature gated test exercising actual generation will arrive with Stage 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 16:47:58 +03:00
rob thijssen	5c2bd1a1da	feat(neuron): wire candle harness load/unload via GGUF Stage 2 of the candle-native pivot. Fleshes out CandleHarness with a LoadedModel registry keyed by model_id, hf-hub-backed GGUF download, and Qwen3 quantized weight construction via candle-transformers' quantized_qwen3 module. unload_model drops the entry; Drop on the candle ModelWeights frees device memory. Device selection prefers CUDA (gated behind the new `cuda` feature), falling back to CPU when CUDA is unavailable so default builds work on non-GPU hosts. The candle CUDA toolchain isn't pulled in unless `--features cuda` is passed, keeping CI green on CPU runners. Config gains a [harness.candle] block with an optional hf_cache path. HarnessRegistry::from_configs now takes HarnessSettings so per-harness config flows through. A gated tests/candle_lifecycle.rs exercises real load → list → unload → list-empty when run with `--features cuda-integration` against a host with HF network access. The default-feature test in tests/api.rs covers the wrong-harness rejection path without needing the network. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 16:02:49 +03:00
rob thijssen	3cccc2c56b	refactor(neuron): cut mistralrs/llamacpp, scaffold candle harness Stage 1 of the candle-native pivot. Replaces the external-process harness model (mistralrs over HTTP, llamacpp placeholder) with an in-process Harness trait whose sole implementation is candle. The trait keeps its shape so future engines slot in additively, but start/stop default to no-ops and HarnessConfig drops endpoint and systemd_unit since no harness needs external supervision. Behaviour is unchanged on the wire: load_model returns a "not implemented yet (Stage 2)" error and list_models is empty. The gateway-side proxy, poller, and router are untouched. CLAUDE.md Phase 11 (llama.cpp) and Phase 12 (mistral.rs COPR) are marked superseded; the staged plan lives in ~/.claude/plans/create-a-more-aggressive-calm-naur.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 15:53:04 +03:00

14 Commits