cortex

Author	SHA1	Message	Date
rob thijssen	495d3f7c05	fix(qwen3_5): promote beta to F32 alongside q/k/v in delta rule All checks were successful build-prerelease / Resolve version stamps (push) Successful in 40s Details CI / Format (push) Successful in 43s Details CI / Clippy (push) Successful in 2m20s Details CI / Test (push) Successful in 4m33s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m19s Details build-prerelease / Package cortex RPM (push) Successful in 1m25s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m39s Details build-prerelease / Build neuron-ampere (push) Successful in 4m46s Details build-prerelease / Build neuron-ada (push) Successful in 5m9s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m58s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m9s Details The single-GPU dense load of Qwen/Qwen3.5-0.8B succeeded but the first inference forward bombed with `dtype mismatch in mul, lhs: F32, rhs: BF16`. Trace through the recurrent delta-rule loop: let q = (q.to_dtype(F32)? * scale)?; // F32 let k = k.to_dtype(F32)?; // F32 let v = v.to_dtype(F32)?; // F32 // g built from A_log/dt_bias // F32 // beta = sigmoid(b) // BF16 (sigmoid preserves dtype) ... let delta = (v_t - kv_mem)?.broadcast_mul(&beta_col)?; ^^^^^^^^^^^^^ ^^^^^^^^^ F32 BF16 ← mismatch `g` was already F32 because it was constructed from `a_log.to_dtype(F32)` + `dt_bias.to_dtype(F32)` earlier in the function. `beta` came from `sigmoid(b)` where `b` was the model dtype (BF16), so beta stayed BF16 and the multiplication tripped candle's dtype-mismatch check. Promote beta to F32 at the same point we promote q/k/v. Caught by the validate-neuron.sh probe against Qwen/Qwen3.5-0.8B on beast — load returned 200, then `POST /v1/chat/completions` returned the dtype error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 21:13:19 +03:00
rob thijssen	5c4c8e0eba	fix(qwen3_5): tensor names are under `model.language_model.`, not `model.` All checks were successful build-prerelease / Resolve version stamps (push) Successful in 33s Details CI / Format (push) Successful in 35s Details CI / Clippy (push) Successful in 2m12s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m49s Details CI / Test (push) Successful in 4m27s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ampere (push) Successful in 4m50s Details build-prerelease / Build neuron-ada (push) Successful in 5m12s Details build-prerelease / Build cortex binary (push) Successful in 4m14s Details build-prerelease / Package cortex RPM (push) Successful in 1m17s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m52s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s Details Qwen3-Next is a multimodal architecture whose text core sits under `model.language_model.` — sibling to `model.visual.` (vision tower) and to top-level `lm_head` / `mtp.`. Every text-side tensor in the safetensors files carries that prefix: model.language_model.embed_tokens.weight model.language_model.layers.{i}.{input,post_attention}_layernorm.weight model.language_model.layers.{i}.linear_attn.{in_proj_, conv1d.weight, A_log, dt_bias, norm.weight, out_proj.weight} model.language_model.layers.{i}.self_attn.{q,k,v,o}_proj.weight + {q,k}_norm.weight model.language_model.layers.{i}.mlp.{gate,up,down}_proj.weight model.language_model.norm.weight lm_head.weight (top-level; not under language_model) The single-pre-emptive fix is in Qwen3_5Model::load — derive a `text_vb = vb.pp("model.language_model")` once and walk embed_tokens / layers / norm from there. `lm_head` stays at the top-level VB; that path was already correct. The non-text tensors (`model.visual.`, `mtp.`) are ignored: we don't reference them, so the safetensors mmap is fine even though the bytes are loaded into the address space. After this, the load that was failing at "cannot find tensor model.embed_tokens.weight" should proceed to materialising the actual layer weights — where any further bugs will be substantive architecture issues rather than naming ones. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 16:48:16 +03:00
rob thijssen	07c44d5db1	fix(qwen3_5): nested rope_parameters + partial_rotary_factor=0.25 All checks were successful build-prerelease / Resolve version stamps (push) Successful in 34s Details CI / Format (push) Successful in 36s Details CI / Clippy (push) Successful in 2m16s Details CI / Test (push) Successful in 4m37s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m21s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m51s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Successful in 5m2s Details build-prerelease / Build neuron-ada (push) Successful in 5m8s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m40s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m11s Details Two interlocked bugs surfaced trying to load Qwen/Qwen3.5-0.8B (and the same applies to Qwen/Qwen3.6-27B): 1. Qwen3-Next config.json does NOT have a top-level `rope_theta`. It lives inside `rope_parameters: { rope_theta, partial_rotary_factor, rope_type, mrope_section, mrope_interleaved }`. Our TextConfig declared `rope_theta` as a non-optional top-level field, so the deserializer bailed with the misleading "missing field `rope_theta` at line 74 col 5". Replaced with a nested `RopeParameters` struct that mirrors the upstream shape. Defaults are conservative (rope_theta=10000, partial_rotary_factor=1.0) so a missing or partial block degrades to standard full-rotation RoPE rather than failing. 2. `partial_rotary_factor: 0.25` means only `head_dim * 0.25 = 64` of the 256 head_dim values get RoPE applied — the rest pass through unchanged. Our RotaryEmbedding was building the inv_freq table for the full head_dim and rotating everything. Silently wrong for every full-attention layer. `RotaryEmbedding` now derives `rotary_dim` from `head_dim * partial_rotary_factor`, builds its cos/sin tables at that smaller size, and in `apply()` splits q/k into (rotate, pass) on the last dim, only `rope_slow`-rotates the rotate half, and re-concatenates. Mirrors the reference Python's `apply_rotary_pos_emb` exactly for the non-trivial `partial_rotary_factor` case. Tests updated: config-deserialise fixture uses the real `rope_parameters` shape (matching the Qwen3.6-27B and Qwen3.5-0.8B configs). The linear-attention forward-smoke test was already using full rotation which still works; just shifted to the nested struct. After this, the load that previously failed at "parse Qwen3-Next (qwen3_5) config.json: missing field rope_theta" should reach the actual safetensors materialisation step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 16:18:52 +03:00
rob thijssen	e7eb3dab6a	feat(stage-8c): full-attention layer + decoder + Model + ForCausalLM for qwen3_5 All checks were successful build-prerelease / Resolve version stamps (push) Successful in 37s Details CI / Format (push) Successful in 39s Details CI / Clippy (push) Successful in 2m19s Details CI / Test (push) Successful in 4m50s Details build-prerelease / Build cortex binary (push) Successful in 4m21s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m41s Details build-prerelease / Package cortex RPM (push) Successful in 1m27s Details build-prerelease / Build neuron-ampere (push) Successful in 4m58s Details build-prerelease / Build neuron-ada (push) Successful in 5m8s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m52s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 58s Details Completes the single-GPU dense path for Qwen3-Next (Qwen3.6's architecture). The four new modules wrap the substantive `linear_attn.rs` (landed previously) with the rest of the transformer: - `arch/qwen3_5/rope.rs` — text-side rotary embedding. MRoPE is simplified to plain RoPE (the three position grids collapse to one for text-only inference); uses candle's `rope_slow` for the GLM-style rotate-half rotation. - `arch/qwen3_5/mlp.rs` — Qwen3_5MLP (SwiGLU: gate/up/down, bias=False). - `arch/qwen3_5/full_attn.rs` — Qwen3_5Attention with the two Qwen3-Next quirks: - `q_proj` widened to `2 * num_heads * head_dim`; second half sigmoid'd and multiplied into the attention output before `o_proj`. - q_norm/k_norm use the `(1+w)*x` RmsNorm variant. - `arch/qwen3_5/decoder.rs` — Qwen3_5DecoderLayer dispatching on `layer_types[i]` to either Full attention or GatedDeltaNet. `arch/qwen3_5/mod.rs` gets the real `Qwen3_5Model` (embedding + layer stack + final norm) and `Qwen3_5ForCausalLM` (model + lm_head). The forward returns `[B, 1, vocab]` to match `qwen3_dense`; the harness's `squeeze_to_vocab` handles either shape. Switch: `candle.rs::load_arch_dense` for `model_type=qwen3_5` now builds a `ShardedVarBuilder` instead of a plain VarBuilder. The sharded backend falls through to the unsharded path when `world_size=1`, so single-GPU load is zero-cost; this lets the forthcoming `tp_qwen3_5.rs` reuse the same load functions without a second copy. Verified: cargo build CPU + --features cuda inside the patched container; clippy clean on both; 32 lib tests still pass. The ForCausalLM forward no longer bails — but numerical correctness vs the Python reference hasn't been validated yet (that's the next step, with the Tbilisi probe). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 15:52:33 +03:00
rob thijssen	180274548d	feat(stage-8c): linear-attention layer (Qwen3-Next GatedDeltaNet) All checks were successful build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Format (push) Successful in 38s Details CI / Clippy (push) Successful in 2m17s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m48s Details CI / Test (push) Successful in 5m1s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m36s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ampere (push) Successful in 5m13s Details build-prerelease / Build neuron-ada (push) Successful in 4m39s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m4s Details Implements the recurrent-path Gated DeltaNet block that occupies 48 of Qwen3.6's 64 decoder layers (`layer_types[i] == "linear_attention"`). Ported from `huggingface/transformers/models/qwen3_5/modeling_qwen3_5.py` (`Qwen3_5GatedDeltaNet`, `torch_recurrent_gated_delta_rule`, `Qwen3_5RMSNormGated`, `l2norm`). Layout: `arch/qwen3_5.rs` becomes `arch/qwen3_5/` with submodules - `mod.rs` — Config + (still-stub) ForCausalLM - `linear_attn.rs` — GatedDeltaNet + GatedDeltaNetState - `rmsnorm.rs` — Qwen3_5RmsNorm `(1+w)x`, Qwen3_5RmsNormGated, l2norm Architecture pieces in this commit: - Block: in_proj_qkv + in_proj_z + in_proj_b + in_proj_a + out_proj (all bias=False); depthwise causal Conv1d (k=4) with state-aware prepend; SiLU; per-head reshape; L2norm on q,k. - Discretisation: g = -exp(A_log) softplus(a + dt_bias); beta = σ(b). All computed in f32 to avoid the -inf underflow in fp16 that the reference notes. - Delta rule (recurrent, per-token): state = exp(g_t) kv_mem = state^T · k_t delta = (v_t - kv_mem) beta_t state += outer(k_t, delta) out_t = state^T · q_t - Output: RMSNormGated(core_attn_out, z) reshape out_proj. State (`GatedDeltaNetState`) lives inline on the layer: - conv_state: (B, conv_dim, conv_kernel_size) — left-padded tail. - recurrent_state: (B, num_v_heads, head_k_dim, head_v_dim) — the delta-rule outer-product memory. Cleared via `clear_kv_cache` at the start of every new request. Config extended with the qwen3_5-specific fields: - linear_num_value_heads (48 in Qwen3.6-27B) - linear_num_key_heads (16) - linear_key_head_dim (128) - linear_value_head_dim (128) - linear_conv_kernel_dim (4) - hidden_act ("silu") Performance note: this is the recurrent delta-rule (PyTorch's `torch_recurrent_gated_delta_rule`), correct for any seq_len but O(L) prefill. The chunked algorithm (`torch_chunk_gated_delta_rule`, chunk_size=64) is a follow-up perf optimisation; surface stays the same. 8 unit tests: - softplus small/large branches - l2norm hand-calc + zero-vector stability - repeat_interleave round-trip - forward_smoke on tiny dims (4-head fixture) — verifies shape + no NaN/Inf propagation through the f32-promotion pipeline. Doesn't validate numerical correctness against the Python reference; that requires a fixed-weight fixture and is the next step. cargo clippy CPU + --features cuda both clean; 32 lib tests pass. The ForCausalLM stub still bails on forward — wrapping attention/MLP/decoder layer + lm_head is the next sub-stage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 09:29:52 +03:00
rob thijssen	a70f317729	feat(stage-8c): scaffold qwen3_5 (Qwen3.6) — dispatch + stubs + TP gate All checks were successful build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / Format (push) Successful in 38s Details CI / Clippy (push) Successful in 2m14s Details CI / Test (push) Successful in 4m29s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m39s Details build-prerelease / Build cortex binary (push) Successful in 4m17s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m31s Details build-prerelease / Build neuron-ampere (push) Successful in 5m13s Details build-prerelease / Build neuron-ada (push) Successful in 5m1s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m14s Details Lays the wiring for the top-priority TP-2 target without doing the substantive architecture work yet. After this commit, attempting to load a Qwen3.6 (`model_type = "qwen3_5"`) model: - Passes config.json parse — the real upstream shape (text_config wrapper, layer_types, attn_output_gate, head_dim=256, etc.) round- trips through a typed Config (unit test included). - Constructs a placeholder Qwen3_5ForCausalLM, attaches it to a ModelArch::Qwen3_5Dense variant, registers it in the loaded set. - Fails on the first inference forward with a clear "Qwen3-Next forward not implemented yet (Stage 8c, TP-2 motivator)" — the point where the real architecture work begins. New layout: - `harness/arch/` for custom architectures candle-transformers doesn't ship. Each architecture is one module: Config + ForCausalLM + impl. - `harness/arch/qwen3_5.rs` — the scaffold. Heavy doc comments on the open work: layer_types dispatch (full_attention vs linear_attention, the latter being the hard part with no candle precedent), attn_output_gate, text_config nesting, recurrent state lifecycle. - DENSE_SUPPORTED_MODEL_TYPES adds "qwen3_5"; load_arch_dense gains a branch that constructs the stub. TP-side gate: - New `check_tp_arch_supported`: even though Llama / Qwen3 MoE pass the single-GPU dense check (DENSE_SUPPORTED_MODEL_TYPES), the worker pool's `load_dense_shard` reconstructs the config as Qwen3 on every rank — silently misrouting a non-Qwen3 dense load through it would surface as a cryptic per-rank deserialise error. - TP_SUPPORTED_MODEL_TYPES = ["qwen3"] (cuda-gated). Anything else bails before the worker pool spawns and NCCL handshake costs are paid, with a marker pointing at the `tp_<family>.rs` module a contributor would need to add. qwen3_5 specifically lands here until its architecture is real. The naming choice: keep "qwen3_5" from the model's own config.json rather than mistralrs's "qwen3_next" — the latter ages poorly the moment Qwen ship another architecture revision. Unit tests: 2 new for qwen3_5 (config deserialise + dispatch gate); the previously-rejecting test for qwen3_5 swapped to a fictional arch so it stays meaningful as the supported set grows. 26 lib tests pass; cargo clippy CPU + --features cuda both clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 08:58:01 +03:00
rob thijssen	c6022aa6b9	feat(stage-8b): Llama + Qwen3 MoE families on the candle harness All checks were successful CI / Format (push) Successful in 31s Details build-prerelease / Resolve version stamps (push) Successful in 36s Details CI / Clippy (push) Successful in 2m6s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m50s Details build-prerelease / Build cortex binary (push) Successful in 4m54s Details CI / Test (push) Successful in 4m58s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ampere (push) Successful in 4m43s Details build-prerelease / Build neuron-ada (push) Successful in 5m8s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m52s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s Details Broadens the single-GPU dense and quantized paths to cover three non-Qwen3 architectures already shipped by candle-transformers. TP for these is a separate stage (each family would need its own tp_*.rs mirroring tp_qwen3.rs). `ModelArch` gains four variants: - LlamaDense (boxed — wraps Llama + an inline Cache + the config it takes to rebuild the cache, since candle::llama::Cache has no reset) - LlamaQuantized (candle_transformers::models::quantized_llama) - Qwen3MoeDense (candle::models::qwen3_moe::ModelForCausalLM) - Qwen3MoeQuantized (candle::models::quantized_qwen3_moe::GGUFQWenMoE — takes an explicit compute dtype; F16 by default for best consumer-GPU throughput) The dispatch is method-based now: - `ModelArch::forward(&mut self, input, offset) -> Result<Tensor>` with a shared `squeeze_to_vocab` normalising shape differences (qwen3 returns [B,1,V]; quantized_qwen3 returns [B,V]; new families may differ again — the helper handles all of them). - `ModelArch::clear_kv_cache(&mut self) -> Result<()>`. Llama needs a Cache rebuild because its Cache has no in-place reset; the new `LlamaDense` wrapper holds the bits needed to do it. `run_inference` / `run_inference_streaming` collapse to a single dispatch path: no more per-variant match arms in the hot loop, and new architectures pick up streaming + non-streaming for free with zero changes outside `ModelArch`. DENSE_SUPPORTED_MODEL_TYPES is now ["llama", "qwen3", "qwen3_moe"]. GGUF arch switch grows "qwen3moe" + "llama" branches (qwen3moe with no underscore matches llama.cpp's general.architecture convention). Stage 8a's diagnostic auto-reports the new supported set. The `LlamaDense` variant is boxed because the wrapper's inline Cache + Config makes it 544 bytes vs ~300 for everything else (clippy::large_enum_variant). Verified: cargo test --workspace passes 66 tests; cargo clippy CPU and `--features cuda` both clean (the cuda check ran inside the locally-built `neuron-build-local` container with the math_functions.h patch applied). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 08:36:22 +03:00
rob thijssen	9e31d8deca	feat(stage-8a): pre-flight architecture check for dense model loads Some checks failed CI / Format (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 34s Details CI / Clippy (push) Successful in 2m21s Details CI / Test (push) Successful in 4m27s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m50s Details build-prerelease / Build cortex binary (push) Successful in 4m28s Details build-prerelease / Package cortex RPM (push) Successful in 1m24s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details A request to load Qwen/Qwen3.6-27B (model_type "qwen3_5") on the dense path was failing deep inside serde with: missing field `vocab_size` at line 140 column 1 …because Qwen3.6 wraps its actual hyperparameters under `text_config`, so none of `qwen3::Config`'s expected top-level fields are present. The error gave no hint that the architecture was the problem. `check_dense_config_supported` parses `config.json` as an untyped JSON Value, inspects `model_type` (with `architectures` as bonus context), and bails cleanly when it's not in the supported set (currently `["qwen3"]`). The error names the rejected type, the supported set, and points at the files a contributor needs to touch to extend coverage — both the single-process `ModelArch` variants in `candle.rs` and the TP analogue in `tp_qwen3.rs`. Wired into both load paths: - `load_arch_dense` (single-GPU), before the typed deserialize. - `load_tp`, before spawning the worker pool — TP loads of an unsupported arch now fail before NCCL/init costs are paid. 4 unit tests cover the accept/reject/missing-field/malformed cases. Bonus: makes Stage 8b/8c work easier — adding a new architecture is now a `DENSE_SUPPORTED_MODEL_TYPES` edit + ModelArch variant + load branch, with the diagnostic auto-correctly listing the supported set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 08:27:29 +03:00
rob thijssen	b400e8b704	feat(neuron): honour HF_HUB_CACHE / HF_HOME for the candle harness cache Some checks failed build-prerelease / Resolve version stamps (push) Successful in 31s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m39s Details build-prerelease / Build cortex binary (push) Successful in 4m17s Details build-prerelease / Package cortex RPM (push) Successful in 1m22s Details CI / Format (push) Successful in 32s Details CI / Test (push) Failing after 51s Details CI / Clippy (push) Successful in 2m17s Details build-prerelease / Build neuron-ampere (push) Successful in 4m58s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ada (push) Successful in 5m1s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m4s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m37s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details Resolves the candle harness's HuggingFace cache directory with the following precedence (first hit wins): 1. Explicit `hf_cache` in `[harness.candle]` from neuron.toml. 2. `HF_HUB_CACHE` env var — the Python `huggingface_hub` convention. The Rust hf-hub crate doesn't read this natively, so we bridge here. 3. `HF_HOME` env var (`$HF_HOME/hub` per the canonical layout). 4. None — falls through to hf-hub's own default. Honouring HF_HUB_CACHE lets a neuron host reuse an existing cache directory shared with Python tooling or other harnesses on the same host without per-tool config. The canonical per-host setup is a systemd drop-in: /etc/systemd/system/neuron.service.d/local.conf [Service] Environment=HF_HUB_CACHE=/archive/hf-cache neuron.example.toml documents the resolution chain inline. script/validate-neuron.sh: bump LOAD_TIMEOUT from 600s to 3600s and expose both load/infer timeouts via env (NEURON_LOAD_TIMEOUT, NEURON_INFER_TIMEOUT). A Qwen3.6-class dense model is ~54 GB and was hitting the 10-min ceiling cold-downloading on a residential link. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 07:52:50 +03:00
rob thijssen	f72dee094f	feat(tp): Stage 7c-i — streaming SSE through TP Some checks failed build-prerelease / Package cortex RPM (push) Blocked by required conditions Details build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Format (push) Successful in 37s Details CI / Clippy (push) Successful in 2m12s Details CI / Test (push) Successful in 5m3s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m39s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 5m7s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details `chat_completion_stream` no longer returns an error for TP loads. The new `chat_completion_tp_stream` mirrors the non-streaming TP path (clear_kv_cache, prefill, sample, decode loop) but emits one `ChatCompletionChunk` per generated token over an mpsc channel so the handler can write a streaming SSE response. Unlike the single-GPU streaming path (which runs candle's forward inside `spawn_blocking` and uses `blocking_send`), the TP loop is itself async — every `pool.generate_step` already awaits the leader's own spawn_blocking forward plus every worker's recv_only. So the orchestration runs as a plain `tokio::spawn` task using `Sender::send`. The shared `emit_chunk` helper tracks the cumulative decoded prefix and emits the delta — same UTF-8-safe BPE boundary handling as the single-GPU streaming path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 07:32:46 +03:00
rob thijssen	d46d8d4f6c	feat(tp): Stage 7b-iv — RPC + orchestration for TP load/inference All checks were successful build-prerelease / Resolve version stamps (push) Successful in 38s Details CI / Format (push) Successful in 40s Details CI / Clippy (push) Successful in 2m20s Details build-prerelease / Build cortex binary (push) Successful in 4m25s Details build-prerelease / Package cortex RPM (push) Successful in 1m22s Details CI / Test (push) Successful in 4m34s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m57s Details build-prerelease / Build neuron-ampere (push) Successful in 4m51s Details build-prerelease / Build neuron-ada (push) Successful in 5m12s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m49s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m51s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s Details Wires the in-flight TP machinery (Stage 7a workers, 7b-iii sharded Qwen3) end to end so a non-streaming chat completion can run across multiple GPUs via NCCL. RPC additions (tp/rpc.rs): - LoadDenseShard{model_id, config_json, safetensors_paths} - GenerateStep{model_id, tokens, offset} - ClearKvCache{model_id} - UnloadModel{model_id} - LoadDenseShardOk / GenerateStepOk / KvCacheCleared / Unloaded Worker side (tp/worker.rs): - WorkerState gains a `models: HashMap<String, TpQwen3ForCausalLM>` keyed by model_id. LoadDenseShard mmaps safetensors via ShardedVarBuilder (only this rank's slice materialises), builds the TP model with the rank's NCCL Comm cloned from NcclState. - GenerateStep runs the rank-local forward; the resulting logits are dropped (only the leader's are used for sampling). The forward's value here is the NCCL collectives inside the row-parallel layers letting the leader's rank-0 forward make progress. Pool side (tp/mod.rs): - WorkerPool::load_dense_shard fans LoadDenseShard out to every worker, builds rank 0's shard on the leader via spawn_blocking with a fresh SendComm wrapper at the move boundary (Comm is !Send at the type level), collects per-rank LoadDenseShardOk. Returns the leader's Arc<Mutex<TpQwen3ForCausalLM>>. - WorkerPool::generate_step fans GenerateStep out, runs the leader's rank-0 forward in spawn_blocking (the AllReduce CustomOps inside row-parallel layers block until every worker issues the matching collective), returns the leader's last-position logits Tensor. - WorkerPool::clear_kv_cache + unload_model follow the same pattern. NcclState refactor (tp/nccl_state.rs): - comm field becomes Option<Arc<Comm>> (was Option<Comm>) so callers can share a clone with TpQwen3ForCausalLM::load. - new `comm()` accessor + `SendComm` wrapper for spawn_blocking moves. - single allow(clippy::arc_with_non_send_sync) at the canonical construction site (Comm is !Send by type but the runtime invariant is enforced by SendComm + the pool's Mutex). Harness side (candle.rs): - LoadedHandle enum (Single \| Tp) replaces the bare Arc<LoadedModel> in the harness's registry. list_models / unload_model / inference_endpoint walk the enum uniformly. - TpLoadedModel holds the pool + leader_model + tokenizer + devices. - load_model dispatches on `spec.tensor_parallel > 1` to a new cuda-gated load_tp path: resolve dense files via hf-hub, spawn the pool, init_nccl, load_dense_shard. - chat_completion branches on the handle variant. The TP path mirrors run_inference: clear_kv_cache, prefill, sample, decode loop, detokenize. Acquires the pool Mutex for the whole request. - Streaming through TP is deferred to Stage 7c (returns Other(err)). Script (script/validate-neuron.sh): - 4th positional arg `tp_size` (default 1). When >1, switches to the dense path (tp + GGUF is mutually exclusive — bails) and adds `tensor_parallel` + `devices` to the load payload. NEURON_DEVICES env overrides the default 0..N-1 device list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 06:38:33 +03:00
rob thijssen	9b8bd146f6	feat(tp): --tp-smoke CLI subcommand + remote validation script All checks were successful CI / Format (push) Successful in 36s Details build-prerelease / Resolve version stamps (push) Successful in 38s Details CI / Clippy (push) Successful in 2m19s Details CI / Test (push) Successful in 4m32s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m43s Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m16s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ampere (push) Successful in 4m56s Details build-prerelease / Build neuron-ada (push) Successful in 5m1s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m51s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m39s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s Details Adds a one-shot diagnostic that exercises the lower half of the TP stack — WorkerPool::spawn, init_nccl, nccl_sanity_check — in isolation from model load and inference. Runs N-1 worker subprocesses (rank 0 stays in this process), joins them in an NCCL communicator on the specified CUDA devices, all_reduces a sentinel 1u32 per rank, verifies the observed_sum equals world_size on every rank, then shuts down. Output is `status=ok` on stdout (plus key=value lines for tp_size and cuda_devices) when every check passes, non-zero exit + tracing on stderr otherwise. The smoke command is diagnostic-only and not exposed through the daemon HTTP API. script/tp-smoke.sh wraps it with an ssh invocation against a fleet host (default beast — the only host with 2 GPUs) and asserts the status line, mirroring the validate-neuron.sh ergonomics. This is step 1 of the TP test plan. A failure here means TP cannot work on the host at all; step 2 (Stage 7b-iv) wires real model load and inference through the same WorkerPool primitives. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 19:40:25 +03:00
rob thijssen	96d8755245	fix(tp): add half dep + drop double-wrapped .w() on CudaDevice::alloc All checks were successful build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Format (push) Successful in 37s Details CI / Clippy (push) Successful in 2m17s Details CI / Test (push) Successful in 4m50s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m36s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m32s Details build-prerelease / Package cortex RPM (push) Successful in 1m25s Details build-prerelease / Build neuron-ampere (push) Successful in 5m13s Details build-prerelease / Build neuron-ada (push) Successful in 4m42s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m52s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m39s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m12s Details Two follow-up cuda-only fixes surfaced by `cargo build --features cuda` inside the cuda-13.0 runner container: 1. `half::{bf16, f16}` was an undeclared dep. Added `half = "2.5"` (matching candle-core's pinned major) under the cuda feature flag. 2. `dev.alloc::<T>(n)` already returns `candle_core::Result` (it calls `.w()` internally on the cudarc error). Calling `.w()?` on top of that needs `From<candle_core::Error> for CudaError`, which doesn't exist — collapse to `?`. Removed the now-unused `cuda_backend::WrapErr` import. Verified by `cargo build -p neuron --features cuda` and `cargo clippy -p neuron --all-targets --features cuda -- -D warnings` inside `git.lair.cafe/gongfoo/runner-cuda-13.0` with the local glibc/CUDA-13.0 math_functions.h noexcept patch. CPU clippy/tests stay green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 19:11:59 +03:00
rob thijssen	12549c9aed	fix(tp): import BackendStorage trait for CudaStorage methods Some checks failed build-prerelease / Resolve version stamps (push) Successful in 32s Details CI / Format (push) Successful in 37s Details CI / Clippy (push) Successful in 3m9s Details CI / Test (push) Successful in 4m28s Details build-prerelease / Build neuron-blackwell (push) Failing after 3m41s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m32s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ampere (push) Failing after 4m45s Details build-prerelease / Build neuron-ada (push) Failing after 5m13s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped Details Stage 7b-iii (1/2) introduced AllReduce with `s.device()` and `s.dtype()` calls on `&CudaStorage`. Both come from the `candle_core::backend::BackendStorage` trait, which wasn't imported — fine on CPU builds (the cuda_fwd block was cfg-gated out) but the prerelease cuda build hit E0599. Also drop the unused `cudarc::driver::DeviceSlice` import inside cuda_fwd — `CudaSlice::len()` is an inherent method on cudarc 0.19, not a trait method. Caught by run 2894 (build-neuron-{blackwell,ampere}); CPU clippy + tests stay green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 18:32:05 +03:00
rob thijssen	46527d7804	feat(tp): TP-aware Qwen3 dense model (Stage 7b-iii 2/2) Mirrors candle_transformers::models::qwen3 structurally with column- parallel q/k/v + gate/up projections, row-parallel o + down projections, and replicated embedding/norms/lm_head. Per-rank head counts come from dividing num_attention_heads / num_key_value_heads by world_size at load time; intermediate_size split likewise. Load bails on any non-divisible shape — the safetensors slice would lose data otherwise. KV cache holds the rank-local slice since K/V come out of column-parallel projections; no cache resharding across ranks. Causal mask is computed on rank 0 shape and broadcasts over the head dim so per-rank H differs without rework. Replicated tensors (embedding, all RmsNorms, untied lm_head) load via vb.get(shape, name), which uses the default Shard { world_size: 1 } and falls through to the unsharded backend path on ShardedSafeTensors. The cuda / non-cuda load splits track the existing tp_linear pattern: RowParallelLinear takes an Arc<Comm> only under cuda, and the higher- level composers (TpQwen3MLP, TpQwen3Attention, TpDecoderLayer, TpQwen3Model, TpQwen3ForCausalLM) thread it through accordingly. 7b-iv wires RPC + dispatch in CandleHarness::load_model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 18:24:20 +03:00
rob thijssen	8d3194f992	Stage 7b-iii (1/2): AllReduce CustomOp + ShardedVarBuilder-backed TP linears Some checks failed build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Format (push) Successful in 38s Details CI / Clippy (push) Successful in 2m16s Details build-prerelease / Build neuron-blackwell (push) Failing after 3m19s Details CI / Test (push) Successful in 4m26s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m22s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ampere (push) Failing after 4m58s Details build-prerelease / Build neuron-ada (push) Failing after 4m53s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped Details Ports the canonical candle-examples/examples/llama_multiprocess/model.rs pattern into the harness. Two new files, one deletion: - harness/tp/all_reduce.rs — AllReduce wraps Arc<cudarc::nccl::Comm> and implements candle's CustomOp1 trait. cuda_fwd extracts the rank's CudaSlice<dtype> from a CudaStorage, asserts the input is contiguous (a strided activation hitting all_reduce is almost always a model construction bug), allocates an output CudaSlice on the same device, calls Comm::all_reduce(Sum), and wraps the result back as a CudaStorage. Handles BF16, F16, F32. NcclError surfaces via {e:?} (no Display impl in cudarc 0.19.x). Send/Sync hand-impl'd with the same NCCL-thread-safety caveat candle's example documents. - harness/tp/tp_linear.rs — ColumnParallelLinear and RowParallelLinear, both built on candle's ShardedVarBuilder + Shard hints. `vb.get_with_hints((), "weight", shard(dim, rank, ws))` reads JUST the rank's slice from the safetensors view; no full- tensor host materialisation. ColumnParallel.forward is a plain local matmul (output is naturally sharded). RowParallel.forward = local matmul + apply_op1_no_bwd(&self.all_reduce). On CPU / world_size == 1, the AllReduce is skipped and the partial output is returned as-is. Both layers are no-bias — every Qwen3-family target sets attention_bias=false; bias-aware sharding is a future-model concern. - Deletes harness/tp/sharded_linear.rs from 7b-ii. That commit's hand-rolled "load full + narrow" approach was useful exploration but candle's ShardedVarBuilder does the same work without materialising the full tensor on host. The 5 unit tests there verified the slicing math against an unsharded reference; that math now lives inside candle and is covered by candle's own tests. Next (7b-iii 2/2): TpQwen3Attention + TpQwen3MLP composing the column/row pair, then a TpQwen3Model that runs the full forward. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 18:14:54 +03:00
rob thijssen	5436af9c73	fix(neuron/candle): dense Qwen3 returns rank-3 logits, double-squeeze All checks were successful build-prerelease / Resolve version stamps (push) Successful in 33s Details CI / Format (push) Successful in 38s Details CI / Clippy (push) Successful in 2m19s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m32s Details CI / Test (push) Successful in 4m34s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m16s Details build-prerelease / Package cortex RPM (push) Successful in 1m18s Details build-prerelease / Build neuron-ampere (push) Successful in 4m55s Details build-prerelease / Build neuron-ada (push) Successful in 5m11s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m52s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m35s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s Details Caught by live validation against Qwen/Qwen3-1.7B on beast: HTTP 500 "unexpected rank, expected: 1, got: 2 ([1, 151936])" Candle's qwen3::ModelForCausalLM::forward returns shape [B, 1, V] (no final squeeze) while quantized_qwen3::ModelWeights::forward returns [B, V] (with squeeze(1) at the end). My match arms applied a single squeeze(0) uniformly, which is correct for the quantized [1, V] → [V] but leaves the dense at [1, V] → which then trips apply_repeat_penalty::to_vec1() expecting rank 1. Dense match arms now strip both batch and seq dims: model.forward(&input, offset)?.squeeze(0)?.squeeze(0)? Also fixes validate-neuron.sh's `${3:-Q4_K_M}` → `${3-Q4_K_M}` (no colon) so passing an explicit empty third arg now drives the dense path instead of falling back to Q4_K_M. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 17:49:43 +03:00
rob thijssen	8e882c0757	fix(neuron/tp): NcclError {e:?} + cudarc 0.19 deprecation cleanup All checks were successful CI / Format (push) Successful in 38s Details build-prerelease / Resolve version stamps (push) Successful in 40s Details CI / Clippy (push) Successful in 2m15s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m35s Details CI / Test (push) Successful in 5m0s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m51s Details build-prerelease / Package cortex RPM (push) Successful in 1m27s Details build-prerelease / Build neuron-ampere (push) Successful in 4m55s Details build-prerelease / Build neuron-ada (push) Successful in 4m57s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m50s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m37s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s Details Two cuda-feature-only build errors only the CI runner catches: 1. cudarc::nccl::NcclError doesn't impl Display in 0.19.x, so the `format!("...: {e}")` map_err calls fail to compile when the cuda feature actually wires them up. Switch every NcclError-typed `{e}` in nccl_state.rs to `{e:?}` — surfaces variant + ncclResult code in the same diagnostic shape just via Debug instead of Display. 2. cudarc::CudaStream::memcpy_stod / memcpy_dtov are deprecated in 0.19.7 in favour of clone_htod / clone_dtoh. The replacements take/return the same types, so the swap is mechanical. Dev box can't compile with --features cuda (no nvcc), so these only surface in the build-prerelease CUDA matrix jobs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 17:24:13 +03:00
rob thijssen	93421f48e2	Stage 7b-ii: ColumnParallel + RowParallel sharded linear primitives Some checks failed build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / Format (push) Successful in 31s Details CI / Clippy (push) Failing after 49s Details build-prerelease / Build neuron-blackwell (push) Failing after 3m29s Details build-prerelease / Build cortex binary (push) Successful in 4m41s Details CI / Test (push) Successful in 5m6s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-ampere (push) Failing after 5m1s Details build-prerelease / Build neuron-ada (push) Failing after 4m53s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped Details Adds harness/tp/sharded_linear.rs with ShardedLinear — a Megatron-LM style sharded wrapper over candle_nn::Linear. Two constructors: - load_column: splits the output dimension. Each rank holds rows [rout/N .. (r+1)out/N] of the weight, plus its slice of the bias. Forward = local matmul; output is naturally sharded; downstream consumer either accepts the shard (next layer is column-parallel) or merges via all-gather later. - load_row: splits the input dimension. Each rank holds cols [rin/N .. (r+1)in/N] of the weight; bias lives only on rank 0 so the post-all_reduce sum carries it exactly once. Forward produces a partial output that the caller reduces via NCCL. Both constructors bail with a clear error when divisibility doesn't hold — the precondition mistral.rs's first qwen3-next-tp commit made explicit. The path included in the error is the VarBuilder prefix, so the operator sees exactly which projection failed ("column-parallel 'model.layers.0.self_attn.q_proj': out_features=..."). 5 unit tests on CPU verify the math against an unsharded reference: - column shard produces the expected slice of the full matmul - row partials sum to the unsharded result - row bias appears only on rank 0 - divisibility violations bail (column + row) forward_with_comm() is stubbed for row-parallel (CUDA-only) — wiring the actual cudarc::nccl all_reduce against candle's Tensor lands in 7b-iii alongside the model assembly, where the model holds the Comm in scope. ColumnParallel's forward_with_comm just delegates to the local matmul (no collective needed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 17:07:19 +03:00
rob thijssen	05e15f3597	Stage 7b-i: dense safetensors Qwen3 load path Some checks failed build-prerelease / Build cortex binary (push) Blocked by required conditions Details CI / Test (push) Waiting to run Details CI / Format (push) Successful in 43s Details build-prerelease / Resolve version stamps (push) Successful in 44s Details CI / Clippy (push) Successful in 2m4s Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details Adds the bf16/fp16 safetensors path alongside the existing GGUF quantized one. The harness now dispatches by ModelSpec.quant: - Some(_) → GGUF (pre-quantized, single-GPU only path, unchanged). - None → safetensors dense (new). The dense path uses candle-transformers::models::qwen3::ModelForCausalLM verbatim, fed via VarBuilder::from_mmaped_safetensors over the files listed in `model.safetensors.index.json` (sharded layout) or the single `model.safetensors` fallback. dtype is bf16 to match the canonical Qwen3 HF distribution dtype. tokenizer.json is fetched from the same repo (no -GGUF suffix to strip). ModelArch gains a Qwen3Dense variant; the forward signature mirrors QuantizedQwen3Weights (same `forward(&Tensor, offset)` → last-position logits), so run_inference / run_inference_streaming just add a parallel match arm — no shape changes downstream. This is the foundation 7b-ii (ColumnParallel/RowParallel) builds on: because the source is dense safetensors that can be byte-sliced per rank, the TP work avoids the GGUF super-block alignment problem entirely. Vanilla GGUF inference keeps working unchanged. validate-neuron.sh learns the dense path: pass an empty third arg (quant) and the script omits the `quant` field from the load payload, triggering the dense dispatch. Example: script/validate-neuron.sh beast.hanzalova.internal Qwen/Qwen3-0.6B '' Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 17:03:59 +03:00
rob thijssen	da068ded6d	Stage 7a-ii: real NCCL handshake behind the worker pool Some checks failed CI / Format (push) Failing after 38s Details build-prerelease / Resolve version stamps (push) Successful in 42s Details CI / Clippy (push) Successful in 2m18s Details build-prerelease / Build neuron-blackwell (push) Failing after 3m33s Details CI / Test (push) Successful in 4m27s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m31s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Failing after 4m19s Details build-prerelease / Build neuron-ada (push) Failing after 4m56s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped Details Wires cudarc::nccl into the TP worker lifecycle introduced in 7a-i. With --features cuda the leader and its workers now establish a live NCCL communicator end-to-end; without the feature the same code paths return Error{kind="cuda_feature_not_enabled"} so a misconfigured build is obvious instead of silently no-op. NCCL state machine (harness/tp/nccl_state.rs) is shared between the worker process and the leader's pool: - generate_comm_id_hex() mints an Id::new() on the leader. - NcclState::init parses 256 hex chars → [c_char; 128] → Id::uninit, opens a CudaContext on the configured device, calls Comm::from_rank with the supplied (rank, world_size, id). NCCL blocks until every rank has joined. - NcclState::sanity_check runs one all_reduce(1u32, Sum); the leader asserts every rank reports observed_sum == world_size. - NCCL handles serialised under Mutex; unsafe impl Send/Sync gates the Comm across spawn_blocking boundaries (NCCL is move-safe; only concurrent op issuance is unsafe). WorkerPool::init_nccl orchestrates the rendezvous: 1. Write Init { comm_id } to every worker's stdin (no await yet). 2. Leader rank 0 calls its own Comm::from_rank in spawn_blocking, concurrently with workers. 3. NCCL handshake completes for all ranks simultaneously. 4. Leader collects InitOk responses. WorkerPool::nccl_sanity_check follows the same pattern over all_reduce, validating world_size == observed_sum on every rank. Worker.send_only / Worker.recv_only split out from the previous monolithic Worker.request so the leader can interleave its own NCCL work with the worker calls — required because NCCL blocks during init. Tests: - 4 hex roundtrip unit tests for the wire encoding. - The 7a-i "not implemented" expectation now reads "cuda_feature_not_enabled" on the local dev box (no CUDA), or accepts InitOk on a cuda-built test binary. - New cuda-integration test in tp_worker_lifecycle_cuda.rs covers the real init + sanity round-trip; gated on the cuda-integration feature so default CI doesn't try to NCCL. Verifiable on beast (2× RTX 5090): cargo test -p neuron --features cuda-integration \ --test tp_worker_lifecycle_cuda Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 16:40:01 +03:00
rob thijssen	2a7ede0232	Stage 7a-i: TP worker lifecycle scaffolding All checks were successful CI / Format (push) Successful in 36s Details build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Clippy (push) Successful in 2m12s Details CI / Test (push) Successful in 4m25s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m49s Details build-prerelease / Build cortex binary (push) Successful in 4m22s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ampere (push) Successful in 5m9s Details build-prerelease / Build neuron-ada (push) Successful in 4m59s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m59s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m38s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m8s Details Leader → worker process plumbing for tensor parallelism. The neuron binary picks up two modes: default (the existing daemon, axum + HTTP) and `--worker` (a bare RPC loop driven over stdin/stdout). The leader spawns one worker per non-zero NCCL rank via tokio::process::Command on the same binary path (production: /proc/self/exe; tests: env!("CARGO_BIN_EXE_neuron")) and talks to each over newline- delimited JSON. Protocol (harness/tp/rpc.rs) is serde-tagged from the start — WorkerRequest::{Ping, Init, NcclSanityCheck, Shutdown} and WorkerResponse::{Pong, InitOk, NcclSanityResult, Bye, Error}, both `#[serde(tag = "op", rename_all = "snake_case")]`. Adding ops in 7b/7c is purely additive; unknown ops on the wire fail to parse (verified in unit tests). 7a-i scope: - WorkerPool::spawn(binary, world_size, devices) forks ranks 1..N as subprocesses, captures stdin/stdout, kills on drop. - ping_all() round-trips a Ping to every worker and validates the returned rank. - shutdown() sends Shutdown to each worker, awaits Bye, reaps. - Worker mode: parse Ping/Shutdown, return Pong/Bye; Init and NcclSanityCheck return Error{kind="not_implemented_7a_i"} so a 7a-ii binary speaking the same wire is a drop-in replacement (the kind field signals "real NCCL lands in the next commit"). - CandleHarness::load_model refuses tensor_parallel > 1 with a clear message until 7b is in. Three integration tests in tests/tp_worker_lifecycle.rs cover spawn/ ping/shutdown for 2- and 3-worker pools, plus the not_implemented_7a_i contract test for Init. Seven rpc serde unit tests assert the wire shape (op tags, field names, unknown-op rejection). All pass on the dev host; no CUDA required. Stage 7a-ii (next): the real NCCL Comm::from_rank wiring behind the existing Init/NcclSanityCheck op surface, CUDA-gated. Verifiable on beast's 2×5090. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 15:53:00 +03:00
rob thijssen	18ae3c30ee	post-validation cleanup: cuDNN runtime + repetition penalty All checks were successful CI / Format (push) Successful in 34s Details build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Clippy (push) Successful in 2m17s Details CI / Test (push) Successful in 4m16s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m28s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m42s Details build-prerelease / Package cortex RPM (push) Successful in 1m25s Details build-prerelease / Build neuron-ampere (push) Successful in 4m27s Details build-prerelease / Build neuron-ada (push) Successful in 4m51s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m50s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m40s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 6m52s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 2m32s Details Two followups from the live single-GPU validation pass. 1. deploy.sh now ensures libcudnn.so.9 is available on each neuron host before installing/upgrading the package. Probes ldconfig first so hosts with a manual (tar/runfile) cuDNN install are untouched, then adds NVIDIA's RHEL9 CUDA repo (the Fedora 43 CUDA repo doesn't ship cuDNN; only the RHEL9 one does) and installs libcudnn9-cuda-13. benjy hit "cannot open shared object file: libcudnn.so.9" during validation; this prevents that recurring. 2. candle.rs applies a 1.1 repetition penalty over the last 64 generated tokens before sampling, in both the non-streaming chat_completion path and the streaming chat_completion_stream path. Without it small Q4_K_M models degenerate into "Wait, no, no..." loops once they hit a confident-but-wrong path; with it sampling stays coherent. Defaults match mistral.rs and llama.cpp; exposing the value via the OpenAI request (frequency/presence penalty mapping) is Stage 8 territory. Both routes through a new sample_with_penalty() helper so future sampling tweaks land in one place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 14:48:08 +03:00
rob thijssen	602e8e1471	fix(neuron/candle): source tokenizer.json from base repo when GGUF Some checks failed build-prerelease / Resolve version stamps (push) Successful in 31s Details CI / Format (push) Successful in 37s Details CI / Clippy (push) Failing after 50s Details CI / Test (push) Failing after 49s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m32s Details build-prerelease / Build cortex binary (push) Successful in 4m34s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Successful in 5m9s Details build-prerelease / Build neuron-ada (push) Successful in 4m52s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m36s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s Details GGUF-only HF repos (unsloth/Qwen3--GGUF, Qwen/Qwen3--GGUF) ship the .gguf file but not tokenizer.json — the tokenizer data is embedded in the GGUF metadata itself, and the standalone tokenizer.json lives in the base non-GGUF repo (unsloth/Qwen3-0.6B, Qwen/Qwen3-0.6B, etc.). Live validation against quadbrat hit: HTTP 400 fetch tokenizer.json from unsloth/Qwen3-0.6B-GGUF: HTTP status client error (404 Not Found) resolve_files now derives the tokenizer repo by stripping a `-GGUF` or `-gguf` suffix from the model_id; non-GGUF ids fall through to fetching from the same repo. The error message includes the attempted tokenizer repo id so the next failure (e.g. base repo doesn't exist) is unambiguous. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 13:16:39 +03:00
rob thijssen	6cf87e328f	chore(neuron): log load_model failures server-side with full chain The HTTP handler now emits a tracing::warn on load_model failures with the expanded anyhow chain (format!("{e:#}")) before returning the 400. journalctl -u neuron will surface the underlying hf-hub / materialisation error without needing to capture the curl response body separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 13:08:54 +03:00
rob thijssen	f9f5fa41b6	fix(neuron): surface full anyhow chain + ensure $HOME exists at start Some checks failed CI / Format (push) Successful in 30s Details CI / Test (push) Failing after 49s Details CI / Clippy (push) Successful in 2m16s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details Two fixes uncovered by the live validation against beast/benjy/quadbrat: 1. api.rs swallowed everything beyond the outermost anyhow context. The validation script reported '{"error":"fetch GGUF ...gguf"}' but the actual underlying hf-hub failure (cache dir creation, network, auth, etc.) was hidden. Switching every error response to format!("{e:#}") expands the full cause chain via anyhow's alternate Display format. 2. The neuron systemd unit declared the service user but never ensured /var/lib/neuron (its $HOME) existed. hf-hub defaults its cache to ~/.cache/huggingface/hub — when $HOME is absent the cache dir creation fails and the download aborts. Adding `StateDirectory=neuron` makes systemd create + chown that directory at activation; no spec change needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 08:17:37 +03:00
rob thijssen	aad314cdfa	feat(neuron): graceful unload-on-shutdown via SIGTERM/SIGINT Stage 6 of the candle-native pivot. Adds first-class deactivation: neuron now drains in-flight requests on SIGTERM (systemd stop) or SIGINT (Ctrl-C), then unloads every loaded model before the process exits — releasing CUDA contexts and VRAM cleanly rather than leaving the OS to reclaim them. Mechanism: - startup::shutdown_signal() resolves on either ctrl_c() or a SIGTERM listener. - axum::serve(...).with_graceful_shutdown(shutdown_signal()) stops accepting new connections, lets active requests finish, then returns control to main. - startup::unload_all_models(&registry) iterates list_all_models() and calls unload per entry. Per-model failures are logged warnings; cleanup continues. Empty registry is a fast no-op. - main holds an Arc<NeuronState> reference past axum's lifetime so the registry is still reachable for the unload sweep. data/neuron.service: - TimeoutStopSec=120s — generous bound for big-model unloads before systemd escalates to SIGKILL. - KillSignal=SIGTERM — explicit, matches the handler. Two non-gated tests cover the empty-registry no-op and the no-models- loaded path. Real load-then-unload-on-shutdown is exercised by the cuda-integration test from Stage 2 (which calls unload_model directly) and observable on a real GPU host by stopping the service and watching nvidia-smi. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 17:58:07 +03:00
rob thijssen	6779b7526a	feat(neuron): load default_models on service activation All checks were successful CI / Format (push) Successful in 34s Details CI / Clippy (push) Successful in 2m13s Details CI / Test (push) Successful in 4m6s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details Stage 5 of the candle-native pivot. Adds first-class support for auto-loading a configured set of models when the neuron service activates. Config: - NeuronConfig.default_models: Vec<ModelSpec> (defaults to []). - neuron.example.toml ships a commented [[default_models]] example. Activation flow (crates/neuron/src/startup.rs::load_default_models): - Sequential — VRAM contention makes parallel loads risky. - Per-entry timing logged at info level on success. - Failures logged as warnings; the next entry is still attempted. - An empty list short-circuits without log noise. Called from main.rs after the registry is built and before the axum listener binds, so /models reflects the loaded state from the very first request. data/neuron.service gains TimeoutStartSec=1800s. With activation blocked on potentially slow first-time HF downloads + GGUF materialisation, systemd's default 90s would kill larger model loads mid-flight. Two non-gated tests in tests/activation.rs cover the continues-past-failure and empty-list paths using a synthetically unknown harness name to fail loads fast without touching the network. The cuda-integration test from earlier stages still exercises the real load/unload lifecycle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 17:56:08 +03:00
rob thijssen	84f5662df1	feat(neuron): OpenAI-compatible SSE streaming chat completions Stage 4 of the candle-native pivot. /v1/chat/completions now switches to text/event-stream when the request sets stream: true, emitting one chat.completion.chunk per generated token followed by the OpenAI [DONE] terminator. Pipeline: - chat_completion_stream creates a bounded mpsc::channel<ChatCompletionChunk>(32), sends the leading role chunk, then spawns a blocking task that acquires the per-model arch lock and runs the streaming generation loop. - run_inference_streaming tracks a cumulative decoded prefix so each chunk's delta.content is the substring added since the last chunk — safe across BPE byte-fallback boundaries that would otherwise split multi-byte UTF-8 chars. - The blocking task aborts cleanly if blocking_send fails (client disconnected), so generation stops when the SSE consumer hangs up. - Final chunk carries finish_reason ("stop" on EOS, "length" on max_tokens). The handler appends data: [DONE] after the channel closes. The Stage 3 streaming 501 placeholder test is repurposed: with the streaming path live, an unloaded model now hits the same 404 surface as the non-streaming path (the model lookup happens first). cortex-gateway's existing proxy is unchanged — it already forwards SSE bytes verbatim from Phase 2 work, so the candle SSE format passes through unmodified. Neuron Cargo.toml gains futures + tokio-stream (both already in workspace deps) for ReceiverStream and stream combinators. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 17:53:14 +03:00
rob thijssen	729317d1ef	feat(neuron): OpenAI-compatible non-streaming chat completion Stage 3 of the candle-native pivot. neuron now serves POST /v1/chat/completions backed by candle's quantized_qwen3 forward pass on a per-model serialised generation loop, returning the standard OpenAI ChatCompletionResponse envelope. Pipeline per request: - Look up the LoadedModel by request.model (404 if absent). - Apply the Qwen3 chat template across all messages. - Tokenize, then spawn_blocking onto tokio's blocking pool to acquire the per-model arch lock and run prefill + greedy/temperature/top-p sampling via LogitsProcessor. - Stop on <\|im_end\|>/<\|endoftext\|> EOS or max_tokens (finish_reason "stop" vs "length"). - Decode with skip_special_tokens=true, build OpenAI response with prompt/completion/total usage counts. Supporting changes: - HarnessRegistry now stores Arc<dyn Harness> and caches a typed Arc<CandleHarness> so inference routes bypass dyn-Trait dispatch. - LoadedModel.arch becomes Arc<Mutex<ModelArch>> so the lock guard can be moved into spawn_blocking. - NeuronState gains an Option<Arc<CandleHarness>> field for the new inference route. - Typed InferenceError lets the handler map ModelNotLoaded → 404 and other failures → 500 without string-matching anyhow messages. - stream=true returns 501 until Stage 4 wires up SSE. - Two leftover mistral.rs string references in proxy.rs and cortex-cli (missed during the Stage 1 sweep) are corrected here. Three new default-feature tests cover the no-candle 503, model-not- loaded 404, and stream=true 501 paths. The cuda-integration test from Stage 2 still covers real load/unload; a streaming-feature gated test exercising actual generation will arrive with Stage 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 16:47:58 +03:00
rob thijssen	5c2bd1a1da	feat(neuron): wire candle harness load/unload via GGUF Stage 2 of the candle-native pivot. Fleshes out CandleHarness with a LoadedModel registry keyed by model_id, hf-hub-backed GGUF download, and Qwen3 quantized weight construction via candle-transformers' quantized_qwen3 module. unload_model drops the entry; Drop on the candle ModelWeights frees device memory. Device selection prefers CUDA (gated behind the new `cuda` feature), falling back to CPU when CUDA is unavailable so default builds work on non-GPU hosts. The candle CUDA toolchain isn't pulled in unless `--features cuda` is passed, keeping CI green on CPU runners. Config gains a [harness.candle] block with an optional hf_cache path. HarnessRegistry::from_configs now takes HarnessSettings so per-harness config flows through. A gated tests/candle_lifecycle.rs exercises real load → list → unload → list-empty when run with `--features cuda-integration` against a host with HF network access. The default-feature test in tests/api.rs covers the wrong-harness rejection path without needing the network. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 16:02:49 +03:00
rob thijssen	3cccc2c56b	refactor(neuron): cut mistralrs/llamacpp, scaffold candle harness Stage 1 of the candle-native pivot. Replaces the external-process harness model (mistralrs over HTTP, llamacpp placeholder) with an in-process Harness trait whose sole implementation is candle. The trait keeps its shape so future engines slot in additively, but start/stop default to no-ops and HarnessConfig drops endpoint and systemd_unit since no harness needs external supervision. Behaviour is unchanged on the wire: load_model returns a "not implemented yet (Stage 2)" error and list_models is empty. The gateway-side proxy, poller, and router are untouched. CLAUDE.md Phase 11 (llama.cpp) and Phase 12 (mistral.rs COPR) are marked superseded; the staged plan lives in ~/.claude/plans/create-a-more-aggressive-calm-naur.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 15:53:04 +03:00
rob thijssen	3f94c50817	chore: move default ports out of common-collision ranges Previous defaults collided with well-trodden infra services and with the Linux ephemeral port range: - cortex API 8000 — common dev-server default (Django, minio UI) - cortex metrics 9100 — Prometheus node_exporter default - neuron API 9090 — Cockpit default on Fedora, Prometheus self Move to helexa-themed palindromic ports, all below Linux's 32768-60999 ephemeral range and not registered to any well-known service: - cortex API 31313 - cortex metrics 31314 - neuron API 13131 Updated places: - cortex.example.toml, neuron.example.toml defaults - default impls in cortex-core and neuron config - cortex-cli --endpoint default for the status subcommand - doc comments citing example URLs - README.md and CLAUDE.md snippets Consumers already on the old ports need a one-line edit in their /etc/cortex/cortex.toml or /etc/neuron/neuron.toml to match; firewall rules and prometheus scrape configs will also need updating. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 17:45:25 +03:00
rob thijssen	6c238f4557	refactor: rename cortex-neuron binary and crate to neuron All checks were successful CI / Format, lint, build, test (push) Successful in 2m28s Details CI / Build SRPM (push) Has been skipped Details CI / Publish to COPR (push) Has been skipped Details Package name, lib name, and binary all now just "neuron" without the cortex- prefix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 15:51:15 +03:00
rob thijssen	26e5e7ead8	feat: implement mistral.rs harness and neuron model API All checks were successful CI / Format, lint, build, test (push) Successful in 2m30s Details CI / Build SRPM (push) Has been skipped Details CI / Publish to COPR (push) Has been skipped Details - MistralRsHarness: Harness trait impl wrapping mistral.rs HTTP API (list/load/unload models, health check, start/stop via systemd) - HarnessRegistry: maps harness name -> Box<dyn Harness>, built from neuron.toml config - Neuron API endpoints: GET /models, POST /models/load, POST /models/unload, GET /models/:id/endpoint - NeuronConfig: figment-based config loading from neuron.toml - Integration test: full model lifecycle through mock mistral.rs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 14:29:42 +03:00
rob thijssen	6dc717ebcd	feat: add neuron daemon with GPU discovery and health endpoints All checks were successful CI / Format, lint, build, test (push) Successful in 2m29s Details CI / Build SRPM (push) Has been skipped Details CI / Publish to COPR (push) Has been skipped Details Replace cortex-agent stub with neuron (cortex-neuron binary). cortex-core additions: - discovery.rs: DeviceInfo, DiscoveryResponse, DeviceHealth, HealthResponse - harness.rs: Harness async trait, HarnessConfig, ModelSpec, ModelInfo neuron crate (crates/neuron/): - discovery.rs: nvidia-smi CSV parsing (pure functions) + system discovery via uname/nvidia-smi/nvcc - health.rs: cached GPU health polling every 5s - api.rs: GET /discovery and GET /health axum handlers - main.rs: CLI entrypoint with --port flag (default 9090) - harness stubs for mistralrs (Phase 8) and llamacpp (Phase 11) 12 new tests (9 unit + 3 integration), 35 total. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 14:23:42 +03:00

36 Commits