cortex

Author	SHA1	Message	Date
rob thijssen	96d8755245	fix(tp): add half dep + drop double-wrapped .w() on CudaDevice::alloc All checks were successful build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Format (push) Successful in 37s Details CI / Clippy (push) Successful in 2m17s Details CI / Test (push) Successful in 4m50s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m36s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m32s Details build-prerelease / Package cortex RPM (push) Successful in 1m25s Details build-prerelease / Build neuron-ampere (push) Successful in 5m13s Details build-prerelease / Build neuron-ada (push) Successful in 4m42s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m52s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m39s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m12s Details Two follow-up cuda-only fixes surfaced by `cargo build --features cuda` inside the cuda-13.0 runner container: 1. `half::{bf16, f16}` was an undeclared dep. Added `half = "2.5"` (matching candle-core's pinned major) under the cuda feature flag. 2. `dev.alloc::<T>(n)` already returns `candle_core::Result` (it calls `.w()` internally on the cudarc error). Calling `.w()?` on top of that needs `From<candle_core::Error> for CudaError`, which doesn't exist — collapse to `?`. Removed the now-unused `cuda_backend::WrapErr` import. Verified by `cargo build -p neuron --features cuda` and `cargo clippy -p neuron --all-targets --features cuda -- -D warnings` inside `git.lair.cafe/gongfoo/runner-cuda-13.0` with the local glibc/CUDA-13.0 math_functions.h noexcept patch. CPU clippy/tests stay green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 19:11:59 +03:00
rob thijssen	12549c9aed	fix(tp): import BackendStorage trait for CudaStorage methods Some checks failed build-prerelease / Resolve version stamps (push) Successful in 32s Details CI / Format (push) Successful in 37s Details CI / Clippy (push) Successful in 3m9s Details CI / Test (push) Successful in 4m28s Details build-prerelease / Build neuron-blackwell (push) Failing after 3m41s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m32s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ampere (push) Failing after 4m45s Details build-prerelease / Build neuron-ada (push) Failing after 5m13s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped Details Stage 7b-iii (1/2) introduced AllReduce with `s.device()` and `s.dtype()` calls on `&CudaStorage`. Both come from the `candle_core::backend::BackendStorage` trait, which wasn't imported — fine on CPU builds (the cuda_fwd block was cfg-gated out) but the prerelease cuda build hit E0599. Also drop the unused `cudarc::driver::DeviceSlice` import inside cuda_fwd — `CudaSlice::len()` is an inherent method on cudarc 0.19, not a trait method. Caught by run 2894 (build-neuron-{blackwell,ampere}); CPU clippy + tests stay green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 18:32:05 +03:00
rob thijssen	46527d7804	feat(tp): TP-aware Qwen3 dense model (Stage 7b-iii 2/2) Mirrors candle_transformers::models::qwen3 structurally with column- parallel q/k/v + gate/up projections, row-parallel o + down projections, and replicated embedding/norms/lm_head. Per-rank head counts come from dividing num_attention_heads / num_key_value_heads by world_size at load time; intermediate_size split likewise. Load bails on any non-divisible shape — the safetensors slice would lose data otherwise. KV cache holds the rank-local slice since K/V come out of column-parallel projections; no cache resharding across ranks. Causal mask is computed on rank 0 shape and broadcasts over the head dim so per-rank H differs without rework. Replicated tensors (embedding, all RmsNorms, untied lm_head) load via vb.get(shape, name), which uses the default Shard { world_size: 1 } and falls through to the unsharded backend path on ShardedSafeTensors. The cuda / non-cuda load splits track the existing tp_linear pattern: RowParallelLinear takes an Arc<Comm> only under cuda, and the higher- level composers (TpQwen3MLP, TpQwen3Attention, TpDecoderLayer, TpQwen3Model, TpQwen3ForCausalLM) thread it through accordingly. 7b-iv wires RPC + dispatch in CandleHarness::load_model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 18:24:20 +03:00
rob thijssen	8d3194f992	Stage 7b-iii (1/2): AllReduce CustomOp + ShardedVarBuilder-backed TP linears Some checks failed build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Format (push) Successful in 38s Details CI / Clippy (push) Successful in 2m16s Details build-prerelease / Build neuron-blackwell (push) Failing after 3m19s Details CI / Test (push) Successful in 4m26s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m22s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ampere (push) Failing after 4m58s Details build-prerelease / Build neuron-ada (push) Failing after 4m53s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped Details Ports the canonical candle-examples/examples/llama_multiprocess/model.rs pattern into the harness. Two new files, one deletion: - harness/tp/all_reduce.rs — AllReduce wraps Arc<cudarc::nccl::Comm> and implements candle's CustomOp1 trait. cuda_fwd extracts the rank's CudaSlice<dtype> from a CudaStorage, asserts the input is contiguous (a strided activation hitting all_reduce is almost always a model construction bug), allocates an output CudaSlice on the same device, calls Comm::all_reduce(Sum), and wraps the result back as a CudaStorage. Handles BF16, F16, F32. NcclError surfaces via {e:?} (no Display impl in cudarc 0.19.x). Send/Sync hand-impl'd with the same NCCL-thread-safety caveat candle's example documents. - harness/tp/tp_linear.rs — ColumnParallelLinear and RowParallelLinear, both built on candle's ShardedVarBuilder + Shard hints. `vb.get_with_hints((), "weight", shard(dim, rank, ws))` reads JUST the rank's slice from the safetensors view; no full- tensor host materialisation. ColumnParallel.forward is a plain local matmul (output is naturally sharded). RowParallel.forward = local matmul + apply_op1_no_bwd(&self.all_reduce). On CPU / world_size == 1, the AllReduce is skipped and the partial output is returned as-is. Both layers are no-bias — every Qwen3-family target sets attention_bias=false; bias-aware sharding is a future-model concern. - Deletes harness/tp/sharded_linear.rs from 7b-ii. That commit's hand-rolled "load full + narrow" approach was useful exploration but candle's ShardedVarBuilder does the same work without materialising the full tensor on host. The 5 unit tests there verified the slicing math against an unsharded reference; that math now lives inside candle and is covered by candle's own tests. Next (7b-iii 2/2): TpQwen3Attention + TpQwen3MLP composing the column/row pair, then a TpQwen3Model that runs the full forward. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 18:14:54 +03:00
rob thijssen	5436af9c73	fix(neuron/candle): dense Qwen3 returns rank-3 logits, double-squeeze All checks were successful build-prerelease / Resolve version stamps (push) Successful in 33s Details CI / Format (push) Successful in 38s Details CI / Clippy (push) Successful in 2m19s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m32s Details CI / Test (push) Successful in 4m34s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m16s Details build-prerelease / Package cortex RPM (push) Successful in 1m18s Details build-prerelease / Build neuron-ampere (push) Successful in 4m55s Details build-prerelease / Build neuron-ada (push) Successful in 5m11s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m52s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m35s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s Details Caught by live validation against Qwen/Qwen3-1.7B on beast: HTTP 500 "unexpected rank, expected: 1, got: 2 ([1, 151936])" Candle's qwen3::ModelForCausalLM::forward returns shape [B, 1, V] (no final squeeze) while quantized_qwen3::ModelWeights::forward returns [B, V] (with squeeze(1) at the end). My match arms applied a single squeeze(0) uniformly, which is correct for the quantized [1, V] → [V] but leaves the dense at [1, V] → which then trips apply_repeat_penalty::to_vec1() expecting rank 1. Dense match arms now strip both batch and seq dims: model.forward(&input, offset)?.squeeze(0)?.squeeze(0)? Also fixes validate-neuron.sh's `${3:-Q4_K_M}` → `${3-Q4_K_M}` (no colon) so passing an explicit empty third arg now drives the dense path instead of falling back to Q4_K_M. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 17:49:43 +03:00
rob thijssen	8e882c0757	fix(neuron/tp): NcclError {e:?} + cudarc 0.19 deprecation cleanup All checks were successful CI / Format (push) Successful in 38s Details build-prerelease / Resolve version stamps (push) Successful in 40s Details CI / Clippy (push) Successful in 2m15s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m35s Details CI / Test (push) Successful in 5m0s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m51s Details build-prerelease / Package cortex RPM (push) Successful in 1m27s Details build-prerelease / Build neuron-ampere (push) Successful in 4m55s Details build-prerelease / Build neuron-ada (push) Successful in 4m57s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m50s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m37s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s Details Two cuda-feature-only build errors only the CI runner catches: 1. cudarc::nccl::NcclError doesn't impl Display in 0.19.x, so the `format!("...: {e}")` map_err calls fail to compile when the cuda feature actually wires them up. Switch every NcclError-typed `{e}` in nccl_state.rs to `{e:?}` — surfaces variant + ncclResult code in the same diagnostic shape just via Debug instead of Display. 2. cudarc::CudaStream::memcpy_stod / memcpy_dtov are deprecated in 0.19.7 in favour of clone_htod / clone_dtoh. The replacements take/return the same types, so the swap is mechanical. Dev box can't compile with --features cuda (no nvcc), so these only surface in the build-prerelease CUDA matrix jobs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 17:24:13 +03:00
rob thijssen	93421f48e2	Stage 7b-ii: ColumnParallel + RowParallel sharded linear primitives Some checks failed build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / Format (push) Successful in 31s Details CI / Clippy (push) Failing after 49s Details build-prerelease / Build neuron-blackwell (push) Failing after 3m29s Details build-prerelease / Build cortex binary (push) Successful in 4m41s Details CI / Test (push) Successful in 5m6s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-ampere (push) Failing after 5m1s Details build-prerelease / Build neuron-ada (push) Failing after 4m53s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped Details Adds harness/tp/sharded_linear.rs with ShardedLinear — a Megatron-LM style sharded wrapper over candle_nn::Linear. Two constructors: - load_column: splits the output dimension. Each rank holds rows [rout/N .. (r+1)out/N] of the weight, plus its slice of the bias. Forward = local matmul; output is naturally sharded; downstream consumer either accepts the shard (next layer is column-parallel) or merges via all-gather later. - load_row: splits the input dimension. Each rank holds cols [rin/N .. (r+1)in/N] of the weight; bias lives only on rank 0 so the post-all_reduce sum carries it exactly once. Forward produces a partial output that the caller reduces via NCCL. Both constructors bail with a clear error when divisibility doesn't hold — the precondition mistral.rs's first qwen3-next-tp commit made explicit. The path included in the error is the VarBuilder prefix, so the operator sees exactly which projection failed ("column-parallel 'model.layers.0.self_attn.q_proj': out_features=..."). 5 unit tests on CPU verify the math against an unsharded reference: - column shard produces the expected slice of the full matmul - row partials sum to the unsharded result - row bias appears only on rank 0 - divisibility violations bail (column + row) forward_with_comm() is stubbed for row-parallel (CUDA-only) — wiring the actual cudarc::nccl all_reduce against candle's Tensor lands in 7b-iii alongside the model assembly, where the model holds the Comm in scope. ColumnParallel's forward_with_comm just delegates to the local matmul (no collective needed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 17:07:19 +03:00
rob thijssen	05e15f3597	Stage 7b-i: dense safetensors Qwen3 load path Some checks failed build-prerelease / Build cortex binary (push) Blocked by required conditions Details CI / Test (push) Waiting to run Details CI / Format (push) Successful in 43s Details build-prerelease / Resolve version stamps (push) Successful in 44s Details CI / Clippy (push) Successful in 2m4s Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details Adds the bf16/fp16 safetensors path alongside the existing GGUF quantized one. The harness now dispatches by ModelSpec.quant: - Some(_) → GGUF (pre-quantized, single-GPU only path, unchanged). - None → safetensors dense (new). The dense path uses candle-transformers::models::qwen3::ModelForCausalLM verbatim, fed via VarBuilder::from_mmaped_safetensors over the files listed in `model.safetensors.index.json` (sharded layout) or the single `model.safetensors` fallback. dtype is bf16 to match the canonical Qwen3 HF distribution dtype. tokenizer.json is fetched from the same repo (no -GGUF suffix to strip). ModelArch gains a Qwen3Dense variant; the forward signature mirrors QuantizedQwen3Weights (same `forward(&Tensor, offset)` → last-position logits), so run_inference / run_inference_streaming just add a parallel match arm — no shape changes downstream. This is the foundation 7b-ii (ColumnParallel/RowParallel) builds on: because the source is dense safetensors that can be byte-sliced per rank, the TP work avoids the GGUF super-block alignment problem entirely. Vanilla GGUF inference keeps working unchanged. validate-neuron.sh learns the dense path: pass an empty third arg (quant) and the script omits the `quant` field from the load payload, triggering the dense dispatch. Example: script/validate-neuron.sh beast.hanzalova.internal Qwen/Qwen3-0.6B '' Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 17:03:59 +03:00
rob thijssen	da068ded6d	Stage 7a-ii: real NCCL handshake behind the worker pool Some checks failed CI / Format (push) Failing after 38s Details build-prerelease / Resolve version stamps (push) Successful in 42s Details CI / Clippy (push) Successful in 2m18s Details build-prerelease / Build neuron-blackwell (push) Failing after 3m33s Details CI / Test (push) Successful in 4m27s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m31s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Failing after 4m19s Details build-prerelease / Build neuron-ada (push) Failing after 4m56s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped Details Wires cudarc::nccl into the TP worker lifecycle introduced in 7a-i. With --features cuda the leader and its workers now establish a live NCCL communicator end-to-end; without the feature the same code paths return Error{kind="cuda_feature_not_enabled"} so a misconfigured build is obvious instead of silently no-op. NCCL state machine (harness/tp/nccl_state.rs) is shared between the worker process and the leader's pool: - generate_comm_id_hex() mints an Id::new() on the leader. - NcclState::init parses 256 hex chars → [c_char; 128] → Id::uninit, opens a CudaContext on the configured device, calls Comm::from_rank with the supplied (rank, world_size, id). NCCL blocks until every rank has joined. - NcclState::sanity_check runs one all_reduce(1u32, Sum); the leader asserts every rank reports observed_sum == world_size. - NCCL handles serialised under Mutex; unsafe impl Send/Sync gates the Comm across spawn_blocking boundaries (NCCL is move-safe; only concurrent op issuance is unsafe). WorkerPool::init_nccl orchestrates the rendezvous: 1. Write Init { comm_id } to every worker's stdin (no await yet). 2. Leader rank 0 calls its own Comm::from_rank in spawn_blocking, concurrently with workers. 3. NCCL handshake completes for all ranks simultaneously. 4. Leader collects InitOk responses. WorkerPool::nccl_sanity_check follows the same pattern over all_reduce, validating world_size == observed_sum on every rank. Worker.send_only / Worker.recv_only split out from the previous monolithic Worker.request so the leader can interleave its own NCCL work with the worker calls — required because NCCL blocks during init. Tests: - 4 hex roundtrip unit tests for the wire encoding. - The 7a-i "not implemented" expectation now reads "cuda_feature_not_enabled" on the local dev box (no CUDA), or accepts InitOk on a cuda-built test binary. - New cuda-integration test in tp_worker_lifecycle_cuda.rs covers the real init + sanity round-trip; gated on the cuda-integration feature so default CI doesn't try to NCCL. Verifiable on beast (2× RTX 5090): cargo test -p neuron --features cuda-integration \ --test tp_worker_lifecycle_cuda Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 16:40:01 +03:00
rob thijssen	2a7ede0232	Stage 7a-i: TP worker lifecycle scaffolding All checks were successful CI / Format (push) Successful in 36s Details build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Clippy (push) Successful in 2m12s Details CI / Test (push) Successful in 4m25s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m49s Details build-prerelease / Build cortex binary (push) Successful in 4m22s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ampere (push) Successful in 5m9s Details build-prerelease / Build neuron-ada (push) Successful in 4m59s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m59s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m38s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m8s Details Leader → worker process plumbing for tensor parallelism. The neuron binary picks up two modes: default (the existing daemon, axum + HTTP) and `--worker` (a bare RPC loop driven over stdin/stdout). The leader spawns one worker per non-zero NCCL rank via tokio::process::Command on the same binary path (production: /proc/self/exe; tests: env!("CARGO_BIN_EXE_neuron")) and talks to each over newline- delimited JSON. Protocol (harness/tp/rpc.rs) is serde-tagged from the start — WorkerRequest::{Ping, Init, NcclSanityCheck, Shutdown} and WorkerResponse::{Pong, InitOk, NcclSanityResult, Bye, Error}, both `#[serde(tag = "op", rename_all = "snake_case")]`. Adding ops in 7b/7c is purely additive; unknown ops on the wire fail to parse (verified in unit tests). 7a-i scope: - WorkerPool::spawn(binary, world_size, devices) forks ranks 1..N as subprocesses, captures stdin/stdout, kills on drop. - ping_all() round-trips a Ping to every worker and validates the returned rank. - shutdown() sends Shutdown to each worker, awaits Bye, reaps. - Worker mode: parse Ping/Shutdown, return Pong/Bye; Init and NcclSanityCheck return Error{kind="not_implemented_7a_i"} so a 7a-ii binary speaking the same wire is a drop-in replacement (the kind field signals "real NCCL lands in the next commit"). - CandleHarness::load_model refuses tensor_parallel > 1 with a clear message until 7b is in. Three integration tests in tests/tp_worker_lifecycle.rs cover spawn/ ping/shutdown for 2- and 3-worker pools, plus the not_implemented_7a_i contract test for Init. Seven rpc serde unit tests assert the wire shape (op tags, field names, unknown-op rejection). All pass on the dev host; no CUDA required. Stage 7a-ii (next): the real NCCL Comm::from_rank wiring behind the existing Init/NcclSanityCheck op surface, CUDA-gated. Verifiable on beast's 2×5090. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 15:53:00 +03:00
rob thijssen	18ae3c30ee	post-validation cleanup: cuDNN runtime + repetition penalty All checks were successful CI / Format (push) Successful in 34s Details build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Clippy (push) Successful in 2m17s Details CI / Test (push) Successful in 4m16s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m28s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m42s Details build-prerelease / Package cortex RPM (push) Successful in 1m25s Details build-prerelease / Build neuron-ampere (push) Successful in 4m27s Details build-prerelease / Build neuron-ada (push) Successful in 4m51s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m50s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m40s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 6m52s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 2m32s Details Two followups from the live single-GPU validation pass. 1. deploy.sh now ensures libcudnn.so.9 is available on each neuron host before installing/upgrading the package. Probes ldconfig first so hosts with a manual (tar/runfile) cuDNN install are untouched, then adds NVIDIA's RHEL9 CUDA repo (the Fedora 43 CUDA repo doesn't ship cuDNN; only the RHEL9 one does) and installs libcudnn9-cuda-13. benjy hit "cannot open shared object file: libcudnn.so.9" during validation; this prevents that recurring. 2. candle.rs applies a 1.1 repetition penalty over the last 64 generated tokens before sampling, in both the non-streaming chat_completion path and the streaming chat_completion_stream path. Without it small Q4_K_M models degenerate into "Wait, no, no..." loops once they hit a confident-but-wrong path; with it sampling stays coherent. Defaults match mistral.rs and llama.cpp; exposing the value via the OpenAI request (frequency/presence penalty mapping) is Stage 8 territory. Both routes through a new sample_with_penalty() helper so future sampling tweaks land in one place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 14:48:08 +03:00
rob thijssen	1a0400131e	fix(deploy): use dnf upgrade for stale installs, install only when absent All checks were successful CI / Format (push) Successful in 35s Details build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Clippy (push) Successful in 2m27s Details CI / Test (push) Successful in 4m30s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m29s Details build-prerelease / Build cortex binary (push) Successful in 4m32s Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-ampere (push) Successful in 5m15s Details build-prerelease / Build neuron-ada (push) Successful in 4m51s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m48s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m47s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m38s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 57s Details dnf5's `dnf install <pkg>` is a no-op when the package is already installed at ANY version — it does NOT auto-upgrade to the latest available. The deploy script's install branch was therefore silently leaving hosts on older builds even though needs_update correctly reported an upgrade was available. Add an is_installed() probe and an install_or_upgrade() helper that picks the right verb: `dnf install` when fresh, `dnf upgrade` when stale. Captured combined-stream output is exposed via __DNF_OUTPUT__ for the existing failure-diagnostic path. Verified end-to-end against the live fleet: hanzalova/beast/benjy/ quadbrat all upgraded cleanly from prior prerelease NVRs to 0.1.16-0.1.20260519134302.git1866b99.fc43, validation script returned "Paris" from all three neurons. Followup (not in this commit): all hosts running helexa-neuron-* need libcudnn.so.9 available at runtime. Currently: - quadbrat: libcudnn9-cuda-13 RPM (rhel9 CUDA repo) - beast: /usr/lib64/libcudnn.so.9 (manual install) - benjy: needed rhel9 CUDA repo added + libcudnn9-cuda-13 installed as part of this validation pass. The spec currently excludes cuDNN from auto-detected deps. Should add a Recommends:libcudnn9-cuda-13 (soft) and ensure the rhel9 CUDA repo is configured on each neuron host, similar to how ensure_lair_repo handles the unstable channel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 14:10:48 +03:00
rob thijssen	1866b99a89	fix(validate-neuron): jq for JSON, say→stderr, sane max_tokens All checks were successful CI / Format (push) Successful in 35s Details build-prerelease / Resolve version stamps (push) Successful in 38s Details CI / Clippy (push) Successful in 2m13s Details CI / Test (push) Successful in 4m22s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m25s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m21s Details build-prerelease / Package cortex RPM (push) Successful in 1m17s Details build-prerelease / Build neuron-ampere (push) Successful in 4m39s Details build-prerelease / Build neuron-ada (push) Successful in 4m57s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m58s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m34s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details Three real bugs caught while exercising the script end-to-end against the live quadbrat node: 1. say() printed status to stdout. Inside run_probe(), the "POST /v1/chat/completions (probe: ...)" line was being captured by `raw=$(run_probe)` along with the JSON body, so jq saw "[host] POST..." as the first line and choked at column 29 with "Invalid numeric literal" (it tried to parse the `[` as the start of a JSON array). Redirect say() to stderr so command substitutions capture only the intended return value. 2. The pretty-print step `echo "${raw}" \| yq -r '.'` re-emitted the JSON as YAML, which fails on response content that looks like YAML markers (chatcmpl ids that parse as aliases, escaped quotes inside <think>...</think> blocks). Drop the pretty-print; just echo the raw JSON. 3. JSON response parsing now uses jq (always JSON) instead of yq (parses input as YAML by default). yq remains in use only for the genuinely-YAML asset/manifest.yml elsewhere. 4. max_tokens bumped 32 → 256. Qwen3 prepends a <think>...</think> reasoning block before its final answer when the chat template enables thinking mode, and that eats most of a small budget — the "Paris" answer was being truncated mid-thought. 256 leaves enough room for both. Verified pipeline end-to-end on quadbrat (RTX 3060, helexa-neuron-ampere git602e8e1): /health OK → /models/load (unsloth/Qwen3-0.6B-GGUF Q4_K_M) → /v1/chat/completions → response content contains "Paris". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 13:43:02 +03:00
rob thijssen	60176e7c2e	ci: monotonic prerelease versions + serialize CI on shared runner Two CI hygiene fixes uncovered while validating against the live fleet. 1. Same-day prerelease packages were being ordered by RPM-vercmp's alpha-vs-digit precedence on the git SHA fragment, not by commit chronology. With release stamps like "0.1.${YYYYMMDD}git${SHA}", two commits on the same day produce the same numeric prefix and rpmvercmp falls back to comparing the alphanumeric SHA suffixes, where digit-leading SHAs are ranked above alpha-leading ones — completely unrelated to which commit landed first. Verified with rpmdev-vercmp: gitabc1234 < gitdef5678 (old scheme — purely lexicographic) Bumping the timestamp prefix to second-precision (%Y%m%d%H%M%S) makes the numeric prefix strictly monotonic for any chronologically- ordered commits, so the SHA fragment becomes a debug identifier only — never participates in version ordering. 2. ci.yml and build-prerelease.yml both target the `rust` runner label and both auto-trigger on push to main. The act-based runner reuses /root/.cache/act/<hash>/hostexecutor/ across concurrent jobs, so ci.yml's clippy and build-prerelease.yml's build-cortex were racing each other's checkout/cleanup steps and corrupting in-flight compile artifacts. Real fix is in gongfoo; workflow-level workaround is a shared concurrency group with cancel-in-progress=false so the two workflows queue sequentially on the same ref. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 13:36:53 +03:00
rob thijssen	602e8e1471	fix(neuron/candle): source tokenizer.json from base repo when GGUF Some checks failed build-prerelease / Resolve version stamps (push) Successful in 31s Details CI / Format (push) Successful in 37s Details CI / Clippy (push) Failing after 50s Details CI / Test (push) Failing after 49s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m32s Details build-prerelease / Build cortex binary (push) Successful in 4m34s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Successful in 5m9s Details build-prerelease / Build neuron-ada (push) Successful in 4m52s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m36s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s Details GGUF-only HF repos (unsloth/Qwen3--GGUF, Qwen/Qwen3--GGUF) ship the .gguf file but not tokenizer.json — the tokenizer data is embedded in the GGUF metadata itself, and the standalone tokenizer.json lives in the base non-GGUF repo (unsloth/Qwen3-0.6B, Qwen/Qwen3-0.6B, etc.). Live validation against quadbrat hit: HTTP 400 fetch tokenizer.json from unsloth/Qwen3-0.6B-GGUF: HTTP status client error (404 Not Found) resolve_files now derives the tokenizer repo by stripping a `-GGUF` or `-gguf` suffix from the model_id; non-GGUF ids fall through to fetching from the same repo. The error message includes the attempted tokenizer repo id so the next failure (e.g. base repo doesn't exist) is unambiguous. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 13:16:39 +03:00
rob thijssen	e9d0a75dd5	ci(prerelease): auto-build on every push to main Some checks failed build-prerelease / Build cortex binary (push) Blocked by required conditions Details CI / Clippy (push) Waiting to run Details CI / Test (push) Waiting to run Details build-prerelease / Resolve version stamps (push) Successful in 33s Details CI / Format (push) Successful in 36s Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details The build-prerelease workflow was workflow_dispatch-only, which meant every commit needed a manual run dispatch before any host could upgrade. That left rolling fixes (e.g. f9f5fa4's StateDirectory fix) sitting on main with no published RPM behind them, so deploy.sh silently fell back to an older prerelease. Add 'push: branches: [main]' alongside the existing workflow_dispatch trigger; the unstable channel now tracks head automatically. The concurrency group is keyed on ${{ github.ref }} with cancel-in-progress so successive rapid-fire pushes coalesce to one build (latest wins) rather than queueing every intermediate commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 13:13:36 +03:00
rob thijssen	6cf87e328f	chore(neuron): log load_model failures server-side with full chain The HTTP handler now emits a tracing::warn on load_model failures with the expanded anyhow chain (format!("{e:#}")) before returning the 400. journalctl -u neuron will surface the underlying hf-hub / materialisation error without needing to capture the curl response body separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 13:08:54 +03:00
rob thijssen	f9f5fa41b6	fix(neuron): surface full anyhow chain + ensure $HOME exists at start Some checks failed CI / Format (push) Successful in 30s Details CI / Test (push) Failing after 49s Details CI / Clippy (push) Successful in 2m16s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details Two fixes uncovered by the live validation against beast/benjy/quadbrat: 1. api.rs swallowed everything beyond the outermost anyhow context. The validation script reported '{"error":"fetch GGUF ...gguf"}' but the actual underlying hf-hub failure (cache dir creation, network, auth, etc.) was hidden. Switching every error response to format!("{e:#}") expands the full cause chain via anyhow's alternate Display format. 2. The neuron systemd unit declared the service user but never ensured /var/lib/neuron (its $HOME) existed. hf-hub defaults its cache to ~/.cache/huggingface/hub — when $HOME is absent the cache dir creation fails and the download aborts. Adding `StateDirectory=neuron` makes systemd create + chown that directory at activation; no spec change needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 08:17:37 +03:00
rob thijssen	ed4d71db09	fix(validate-neuron): default to unsloth GGUF + capture curl errors Two reasons the previous run silently bailed after POST /models/load: 1. Default model was Qwen/Qwen3-0.6B-GGUF (official). That repo ships ONLY Q8_0 — no Q4_K_M, no Q4_0, nothing else. The GGUF filename matcher in CandleHarness::resolve_files returned "no GGUF file matching quant Q4_K_M" and the load endpoint returned an error, but the script used `curl --silent --fail` and swallowed it. 2. /models/load is synchronous (it awaits the full HF download + GGUF parse). curl --max-time 30 was way too short for a 400 MB fresh download. Fixes: - Default model is now unsloth/Qwen3-0.6B-GGUF, which mirrors the full Q-spectrum (Q2_K through Q8_0 plus BF16) so Q4_K_M actually exists. - trigger_load / run_probe now use --write-out to capture HTTP code and emit the response body on non-2xx, so failures surface a real diagnostic instead of an opaque set -e abort. - LOAD_TIMEOUT bumped to 600s; INFER_TIMEOUT to 120s. - Probe payload built via `yq -n` so JSON quoting is reliable regardless of the prompt text. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 08:14:31 +03:00
rob thijssen	39010c779f	add script/validate-neuron.sh — end-to-end candle harness smoke test Loads a small public Qwen3 GGUF on a target neuron host, fires a deterministic reasoning probe ("What is the capital of France?"), and asserts the response contains 'Paris'. Used to validate the candle harness on a real GPU host before the Stage 7 TP work begins, and as a regression check after future neuron builds. Defaults to beast.hanzalova.internal + Qwen/Qwen3-1.7B-GGUF + Q4_K_M; all three are positional args so the same script tests any node / model combination. Polls /models after triggering the load since /models/load returns once the materialisation is queued, not finished. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 07:58:05 +03:00
rob thijssen	57d7ef8d3c	chore: revert dnf. runner user has no system privs All checks were successful CI / Format (push) Successful in 38s Details CI / Clippy (push) Successful in 2m20s Details CI / Test (push) Successful in 4m42s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details	2026-05-19 07:16:38 +03:00
rob thijssen	0e9671dd7d	fix(ci): drop sudo from dnf install (runner runs as root, no sudo) All checks were successful CI / Format (push) Successful in 36s Details CI / Clippy (push) Successful in 2m13s Details CI / Test (push) Successful in 4m17s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details The act runner container has no sudo binary; the runner user already runs as root inside the container. Existing steps (rpmbuild, gpg, etc) already invoke privileged commands directly without sudo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 07:06:52 +03:00
rob thijssen	e29c9e35f0	fix(ci): ensure rust toolchain present on cuda-13.0 runner The currently-published runner-cuda-13.0 image (gongfoo) is missing rust/cargo despite inheriting from runner-rust. Build-neuron fails immediately with 'cargo: command not found' even though build-cortex on the bare 'rust' runner builds fine. Add a defensive `dnf install rust cargo clippy` step at the top of build-neuron. Idempotent — on a properly-built runner image this is a fast no-op; on the current broken image it installs the toolchain in a few seconds. The runner image itself should be rebuilt in gongfoo so this step becomes redundant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 07:04:57 +03:00
rob thijssen	8a2334eacb	deploy: dnf-native version check + lair.cafe repo bootstrap Replaces the string compare of 'git describe --tags' vs the binary's self-reported --version (which lies about prereleases — every 0.1.16-* RPM reports just "0.1.16") with the dnf-native question of "is the installed package current against what the repo offers". Mechanism: - installed_nvr(): rpm -q --qf '%{version}-%{release}' for the resident package, falling back to "(not installed)". Capturing rpm's output through a variable keeps its "package X is not installed" stdout message out of the result on failure. - needs_update(): probes rpm -q first (treats absent as "needs work"), then asks dnf check-update --refresh -q. Other dnf failures collapse into "needs update" so the subsequent install surfaces a real error rather than this check swallowing one silently. - ensure_lair_repo(): probes for /etc/yum.repos.d/lair-cafe-unstable.repo and adds it with `dnf config-manager addrepo` when missing. The upstream .repo file ships enabled=0 (unstable channel doesn't auto-engage on fetch), so we then run `dnf config-manager setopt lair-cafe-unstable.enabled=1` every run — cheap, idempotent. - Cortex and neuron install branches now guard `systemctl stop` with `[ ! -f /usr/lib/systemd/system/...service ] \|\| sudo systemctl stop` so fresh installs (no unit file yet) don't short-circuit the install step under set -e. - dnf output is captured into a variable and only printed (with a [host] prefix per line) on failure, so success stays quiet and failures show the actual diagnostic instead of being eaten by &> /dev/null. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 18:55:02 +03:00
rob thijssen	aad314cdfa	feat(neuron): graceful unload-on-shutdown via SIGTERM/SIGINT Stage 6 of the candle-native pivot. Adds first-class deactivation: neuron now drains in-flight requests on SIGTERM (systemd stop) or SIGINT (Ctrl-C), then unloads every loaded model before the process exits — releasing CUDA contexts and VRAM cleanly rather than leaving the OS to reclaim them. Mechanism: - startup::shutdown_signal() resolves on either ctrl_c() or a SIGTERM listener. - axum::serve(...).with_graceful_shutdown(shutdown_signal()) stops accepting new connections, lets active requests finish, then returns control to main. - startup::unload_all_models(&registry) iterates list_all_models() and calls unload per entry. Per-model failures are logged warnings; cleanup continues. Empty registry is a fast no-op. - main holds an Arc<NeuronState> reference past axum's lifetime so the registry is still reachable for the unload sweep. data/neuron.service: - TimeoutStopSec=120s — generous bound for big-model unloads before systemd escalates to SIGKILL. - KillSignal=SIGTERM — explicit, matches the handler. Two non-gated tests cover the empty-registry no-op and the no-models- loaded path. Real load-then-unload-on-shutdown is exercised by the cuda-integration test from Stage 2 (which calls unload_model directly) and observable on a real GPU host by stopping the service and watching nvidia-smi. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 17:58:07 +03:00
rob thijssen	6779b7526a	feat(neuron): load default_models on service activation All checks were successful CI / Format (push) Successful in 34s Details CI / Clippy (push) Successful in 2m13s Details CI / Test (push) Successful in 4m6s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details Stage 5 of the candle-native pivot. Adds first-class support for auto-loading a configured set of models when the neuron service activates. Config: - NeuronConfig.default_models: Vec<ModelSpec> (defaults to []). - neuron.example.toml ships a commented [[default_models]] example. Activation flow (crates/neuron/src/startup.rs::load_default_models): - Sequential — VRAM contention makes parallel loads risky. - Per-entry timing logged at info level on success. - Failures logged as warnings; the next entry is still attempted. - An empty list short-circuits without log noise. Called from main.rs after the registry is built and before the axum listener binds, so /models reflects the loaded state from the very first request. data/neuron.service gains TimeoutStartSec=1800s. With activation blocked on potentially slow first-time HF downloads + GGUF materialisation, systemd's default 90s would kill larger model loads mid-flight. Two non-gated tests in tests/activation.rs cover the continues-past-failure and empty-list paths using a synthetically unknown harness name to fail loads fast without touching the network. The cuda-integration test from earlier stages still exercises the real load/unload lifecycle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 17:56:08 +03:00
rob thijssen	84f5662df1	feat(neuron): OpenAI-compatible SSE streaming chat completions Stage 4 of the candle-native pivot. /v1/chat/completions now switches to text/event-stream when the request sets stream: true, emitting one chat.completion.chunk per generated token followed by the OpenAI [DONE] terminator. Pipeline: - chat_completion_stream creates a bounded mpsc::channel<ChatCompletionChunk>(32), sends the leading role chunk, then spawns a blocking task that acquires the per-model arch lock and runs the streaming generation loop. - run_inference_streaming tracks a cumulative decoded prefix so each chunk's delta.content is the substring added since the last chunk — safe across BPE byte-fallback boundaries that would otherwise split multi-byte UTF-8 chars. - The blocking task aborts cleanly if blocking_send fails (client disconnected), so generation stops when the SSE consumer hangs up. - Final chunk carries finish_reason ("stop" on EOS, "length" on max_tokens). The handler appends data: [DONE] after the channel closes. The Stage 3 streaming 501 placeholder test is repurposed: with the streaming path live, an unloaded model now hits the same 404 surface as the non-streaming path (the model lookup happens first). cortex-gateway's existing proxy is unchanged — it already forwards SSE bytes verbatim from Phase 2 work, so the candle SSE format passes through unmodified. Neuron Cargo.toml gains futures + tokio-stream (both already in workspace deps) for ReceiverStream and stream combinators. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 17:53:14 +03:00
rob thijssen	249c9442e8	chore: track deployment script All checks were successful CI / Format (push) Successful in 37s Details CI / Clippy (push) Successful in 2m2s Details CI / Test (push) Successful in 3m59s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details	2026-05-18 17:50:35 +03:00
rob thijssen	5e17081fb4	ci(prerelease): drop redundant rustup install step The build-cortex and build-neuron jobs were running a copied-from- mistralrs rustup install step. Both jobs use runner images that already provide rust via dnf: - runner-rust installs rust/cargo/clippy/rustfmt directly. - runner-cuda-13.0 extends runner-rust. Running 'rustup update stable' on top would install a parallel rustup-managed toolchain and shadow the dnf one — confusing and unnecessary. The existing ci.yml already trusts the dnf toolchain without any install step, so match that behaviour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 17:47:29 +03:00
rob thijssen	03bed93fee	add asset/manifest.yml describing fleet hosts and neuron flavours All checks were successful CI / Format (push) Successful in 28s Details CI / Clippy (push) Successful in 2m54s Details CI / Test (push) Successful in 5m37s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details Adds a single source of truth for which hosts run cortex vs neuron and which CUDA compute-capability flavour each neuron host needs: cortex : hanzalova.internal neurons : beast → helexa-neuron-blackwell (2x RTX 5090, sm_120) benjy → helexa-neuron-ada (RTX 4090, sm_89) quadbrat → helexa-neuron-ampere (RTX 3060, sm_86) script/deploy.sh (gitignored, local-only) is updated locally to read hosts and flavours from this manifest and dnf install the correct helexa-neuron-<flavour> package per host. Using 'dnf install --refresh --allowerasing' lets it swap out the previous bare helexa-neuron RPM or a different flavour without manual intervention; the spec Conflicts: clauses keep at most one flavour resident. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 17:37:14 +03:00
rob thijssen	4a5211d830	ci(prerelease): add ampere flavour alongside ada and blackwell Adds ampere (CUDA compute capability sm_86) to both the build-neuron and package-neuron matrices, so helexa-neuron-ampere RPMs are built and published alongside helexa-neuron-ada and helexa-neuron-blackwell. The prerelease spec already lists ampere in its Conflicts: clause, so no spec change is needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 17:28:19 +03:00
rob thijssen	6d2dc5ff1a	fix(ci): give fmt/clippy/test distinct CARGO_TARGET_DIR to avoid races After the candle deps were added, cargo builds run long enough that the parallel fmt/clippy/test jobs (all on the `rust` runner label, which appears to use act in host-executor mode) start racing each other's intermediate temp files under /root/.cache/act/<hash>/hostexecutor/target/debug/deps/ Concretely the test job hit: error: No such file or directory at path "target/debug/deps/.tmprlicL7" Compiling unicode-ident because another job's cargo invocation cleaned up the temp file mid-compile. fmt and clippy happened to finish without their own target races landing fatally, so only test failed visibly. Set CARGO_TARGET_DIR=target-${{ github.job }} at the workflow level so each job writes to its own target directory. sccache still backs the actual rustc cache, so the rebuild penalty is just metadata not full recompiles. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 17:26:29 +03:00
rob thijssen	b713dbe669	fix(ci): pass GPG secrets via env to avoid Gitea log leakage Some checks failed CI / Format (push) Successful in 28s Details CI / Test (push) Failing after 43s Details CI / Clippy (push) Successful in 2m9s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details The previous "Import signing key" step inlined ${{ secrets.RPM_SIGNING_KEY }} and ${{ secrets.RPM_SIGNING_KEY_ID }} directly into the run: block. Template expansion writes the literal secret value into the rendered shell script, and Gitea logs the rendered script — Gitea's masker may not reliably scrub multi-line keys, so values can leak. Move both secrets into the step's env: block (the same pattern the "Set up SSH" step already uses) and reference $VARs in the script. The script body now contains only variable names; the secret values live in the process environment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 17:13:52 +03:00
rob thijssen	5c957d08ec	ci: add build-prerelease workflow for CUDA RPMs on rpm.lair.cafe Some checks failed CI / Format (push) Successful in 36s Details CI / Test (push) Failing after 53s Details CI / Clippy (push) Successful in 2m35s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details Adds a manually-triggered workflow that builds CUDA-flavoured neuron binaries and a CPU cortex binary, packages them as Fedora RPMs, signs them, and rsyncs to the unstable channel at https://rpm.lair.cafe/fedora/43/x86_64/unstable/. Mirrors the build pipeline used by grenade/mistralrs-package. Pipeline: - prepare: derive {version,short_sha,commit_date} from the checkout; the prerelease Release stamp "0.1.YYYYMMDDgitSHORTSHA" sorts below the eventual "1" stable release. - build-cortex: cargo build --release -p cortex-cli on a rust runner. - build-neuron: matrix over ada (sm_89) and blackwell (sm_120) on cuda-13.0 runners; cargo build with features "cuda cudnn flash-attn" and CUDA_COMPUTE_CAP set per flavour. - package-{cortex,neuron}: rpmbuild on the rpm runner against the new prebuilt-binary specs in rpm/. - publish: import signing key, sign RPMs, rsync to oolon, createrepo_c --update, then regenerate packages.json for the UI. New specs are prebuilt-binary variants — they consume the artifact from the build job rather than running cargo at rpmbuild time. Each helexa-neuron-{flavour} package Conflicts with the other flavours and with helexa-neuron (the future source-build stable package) so one flavour is installed at a time on a given host. neuron crate gains cudnn and flash-attn feature flags forwarding to the corresponding candle features, so the CI build command compiles those kernels into the binary. sccache is intentionally NOT used in the prerelease jobs — CUDA compute cap isn't in its cache key, so flavours would mis-hit each other. Each prerelease build is a clean cargo build. Required Gitea secrets (already in place for cortex.spec / COPR workflow): - RPM_SIGNING_KEY, RPM_SIGNING_KEY_ID - RSYNC_SSH_KEY Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 17:01:35 +03:00
rob thijssen	729317d1ef	feat(neuron): OpenAI-compatible non-streaming chat completion Stage 3 of the candle-native pivot. neuron now serves POST /v1/chat/completions backed by candle's quantized_qwen3 forward pass on a per-model serialised generation loop, returning the standard OpenAI ChatCompletionResponse envelope. Pipeline per request: - Look up the LoadedModel by request.model (404 if absent). - Apply the Qwen3 chat template across all messages. - Tokenize, then spawn_blocking onto tokio's blocking pool to acquire the per-model arch lock and run prefill + greedy/temperature/top-p sampling via LogitsProcessor. - Stop on <\|im_end\|>/<\|endoftext\|> EOS or max_tokens (finish_reason "stop" vs "length"). - Decode with skip_special_tokens=true, build OpenAI response with prompt/completion/total usage counts. Supporting changes: - HarnessRegistry now stores Arc<dyn Harness> and caches a typed Arc<CandleHarness> so inference routes bypass dyn-Trait dispatch. - LoadedModel.arch becomes Arc<Mutex<ModelArch>> so the lock guard can be moved into spawn_blocking. - NeuronState gains an Option<Arc<CandleHarness>> field for the new inference route. - Typed InferenceError lets the handler map ModelNotLoaded → 404 and other failures → 500 without string-matching anyhow messages. - stream=true returns 501 until Stage 4 wires up SSE. - Two leftover mistral.rs string references in proxy.rs and cortex-cli (missed during the Stage 1 sweep) are corrected here. Three new default-feature tests cover the no-candle 503, model-not- loaded 404, and stream=true 501 paths. The cuda-integration test from Stage 2 still covers real load/unload; a streaming-feature gated test exercising actual generation will arrive with Stage 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 16:47:58 +03:00
rob thijssen	5c2bd1a1da	feat(neuron): wire candle harness load/unload via GGUF Stage 2 of the candle-native pivot. Fleshes out CandleHarness with a LoadedModel registry keyed by model_id, hf-hub-backed GGUF download, and Qwen3 quantized weight construction via candle-transformers' quantized_qwen3 module. unload_model drops the entry; Drop on the candle ModelWeights frees device memory. Device selection prefers CUDA (gated behind the new `cuda` feature), falling back to CPU when CUDA is unavailable so default builds work on non-GPU hosts. The candle CUDA toolchain isn't pulled in unless `--features cuda` is passed, keeping CI green on CPU runners. Config gains a [harness.candle] block with an optional hf_cache path. HarnessRegistry::from_configs now takes HarnessSettings so per-harness config flows through. A gated tests/candle_lifecycle.rs exercises real load → list → unload → list-empty when run with `--features cuda-integration` against a host with HF network access. The default-feature test in tests/api.rs covers the wrong-harness rejection path without needing the network. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 16:02:49 +03:00
rob thijssen	3cccc2c56b	refactor(neuron): cut mistralrs/llamacpp, scaffold candle harness Stage 1 of the candle-native pivot. Replaces the external-process harness model (mistralrs over HTTP, llamacpp placeholder) with an in-process Harness trait whose sole implementation is candle. The trait keeps its shape so future engines slot in additively, but start/stop default to no-ops and HarnessConfig drops endpoint and systemd_unit since no harness needs external supervision. Behaviour is unchanged on the wire: load_model returns a "not implemented yet (Stage 2)" error and list_models is empty. The gateway-side proxy, poller, and router are untouched. CLAUDE.md Phase 11 (llama.cpp) and Phase 12 (mistral.rs COPR) are marked superseded; the staged plan lives in ~/.claude/plans/create-a-more-aggressive-calm-naur.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 15:53:04 +03:00
rob thijssen	7f797b0265	ci: parallelise fmt/clippy/test and drop sccache install step All checks were successful CI / Format (push) Successful in 33s Details CI / Clippy (push) Successful in 1m31s Details CI / Test (push) Successful in 2m11s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-11 13:55:17 +03:00
rob thijssen	5a0360c1d5	ci: use container runner labels for CI jobs Some checks failed CI / Format, lint, build, test (push) Successful in 4m20s Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-11 13:29:42 +03:00
rob thijssen	472c0e8737	fix(rpm): ship firewalld service definitions with correct ports Some checks failed CI / Format, lint, build, test (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details cortex: opens 31313/tcp (API) and 31314/tcp (metrics) neuron: opens 13131/tcp Installs to /usr/lib/firewalld/services/ so firewall-cmd --add-service=cortex / --add-service=helexa-neuron works out of the box. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-11 12:52:20 +03:00
Gitea Actions	b9d8e30058	chore: bump version to 0.1.16	2026-04-16 15:04:21 +00:00
rob thijssen	25f75fe552	chore: ignore local deploy script All checks were successful CI / Format, lint, build, test (push) Successful in 1m15s Details CI / Build cortex SRPM (push) Successful in 43s Details CI / Build neuron SRPM (push) Successful in 44s Details CI / Publish cortex to COPR (push) Successful in 7m23s Details CI / Publish neuron to COPR (push) Successful in 15m58s Details CI / Bump version in source (push) Successful in 31s Details v0.1.16	2026-04-16 17:45:25 +03:00
rob thijssen	3f94c50817	chore: move default ports out of common-collision ranges Previous defaults collided with well-trodden infra services and with the Linux ephemeral port range: - cortex API 8000 — common dev-server default (Django, minio UI) - cortex metrics 9100 — Prometheus node_exporter default - neuron API 9090 — Cockpit default on Fedora, Prometheus self Move to helexa-themed palindromic ports, all below Linux's 32768-60999 ephemeral range and not registered to any well-known service: - cortex API 31313 - cortex metrics 31314 - neuron API 13131 Updated places: - cortex.example.toml, neuron.example.toml defaults - default impls in cortex-core and neuron config - cortex-cli --endpoint default for the status subcommand - doc comments citing example URLs - README.md and CLAUDE.md snippets Consumers already on the old ports need a one-line edit in their /etc/cortex/cortex.toml or /etc/neuron/neuron.toml to match; firewall rules and prometheus scrape configs will also need updating. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 17:45:25 +03:00
rob thijssen	3e1fb60076	ci: drop actions/cache for cargo registry and target The cache round-trip (download + unpack) was consistently taking around 6 minutes, noticeably longer than the ~3 minute cold build it was meant to accelerate. Net-negative on CI time — remove it. sccache with the S3 backend still provides dep-level caching at a much lower overhead, so we keep the majority of the cache benefit without paying the actions/cache tarball cost. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 17:45:25 +03:00
Gitea Actions	9bf987888c	chore: bump version to 0.1.14	2026-04-16 16:57:24 +03:00
rob thijssen	abe4ff7ccc	ci: publish both packages to a single helexa/helexa COPR project All checks were successful CI / Format, lint, build, test (push) Successful in 9m50s Details CI / Build neuron SRPM (push) Successful in 43s Details CI / Build cortex SRPM (push) Successful in 48s Details CI / Publish neuron to COPR (push) Successful in 6m14s Details CI / Publish cortex to COPR (push) Successful in 7m53s Details CI / Bump version in source (push) Successful in 31s Details Consolidates the previous helexa/cortex and helexa/helexa-neuron COPR projects into one shared project. Hosts enable a single repo and get access to both packages — cortex for gateway hosts and helexa-neuron for GPU nodes. Reduces the "which copr do I enable on this host" friction, and makes it clear the two packages are parts of the same helexa project suite. CI keeps two independent publish jobs (copr-cortex and copr-neuron) running in parallel; they now both target helexa/helexa with their respective SRPMs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> v0.1.14	2026-04-16 16:37:47 +03:00
rob thijssen	7c3390a4e1	fix(rpm): rename neuron package to helexa-neuron Fedora's official repos ship a package named `neuron` — the NEURON neural-simulation environment from Yale (see https://src.fedoraproject.org/rpms/neuron). Having our own `neuron` in the helexa COPR caused dnf5 to silently no-op `dnf install neuron` because of the name collision, even with the COPR repo enabled and keys imported. The only workarounds were full NEVRA (`dnf install neuron-0.1.12-1.fc43.x86_64`) or a local file install — neither acceptable for end-users. Rename the RPM package to `helexa-neuron`. Keep binary (/usr/bin/neuron), systemd unit (neuron.service), system user (neuron), and config dir (/etc/neuron) unchanged — those are project-local contexts where the short name is unambiguous. Follows Fedora subpackage-style naming except with a vendor prefix rather than a parent-package prefix, because neuron is an independent package from cortex (installed on different hosts) and neither depends on the other. Changes: - neuron.spec -> helexa-neuron.spec (git rename) - Name: neuron -> helexa-neuron (with comment explaining why) - CI: srpm-neuron job now builds helexa-neuron-VERSION.tar.gz with the matching top-level dir prefix, publishes to helexa/helexa-neuron COPR - CI: bump-version job references helexa-neuron.spec - CLAUDE.md: install instructions updated Old helexa/neuron COPR project can be deleted after the first helexa/helexa-neuron build lands. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 16:37:47 +03:00
rob thijssen	2ff062da0e	ci: commit generated %changelog entries back to main Previously the srpm-* jobs generated a fresh %changelog entry and shipped it to COPR, but the version-stamped spec pushed back to main by the bump-version job only updated the Version: line — not the %changelog section. The result: SRPM and in-tree spec diverged and a fresh clone of the repo showed a perpetually empty changelog. Run the rpm-changelog action in bump-version too. Now the committed specs track the SRPMs: each release leaves a dated %changelog entry in main covering commits since the previous tag, visible in git log and in the repo's spec browser. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 16:37:03 +03:00
Gitea Actions	357f858a29	chore: bump version to 0.1.12	2026-04-16 15:47:21 +03:00
rob thijssen	556e5293dc	fix(rpm): explicitly Provides user(name) to satisfy systemd unit Requires All checks were successful CI / Format, lint, build, test (push) Successful in 2m59s Details CI / Build cortex SRPM (push) Successful in 44s Details CI / Build neuron SRPM (push) Successful in 49s Details CI / Publish neuron to COPR (push) Successful in 8m17s Details CI / Publish cortex to COPR (push) Successful in 9m56s Details CI / Bump version in source (push) Successful in 30s Details Diagnosing the persistent "Nothing to do" on v0.1.10 surfaced that removing %attr(,,name) from %files wasn't enough. systemd-rpm-macros ships its own rpm dep generator (/usr/lib/rpm/systemd.req) that parses User=/Group= directives from every .service file the package ships and emits Requires: user(NAME)/group(NAME) accordingly. Rpmbuild log from v0.1.10 shows these Requires are still emitted even after the %attr removal. Meanwhile the sysusers provides-generator emits group(NAME) in both unversioned and versioned forms, but only a versioned user(NAME) = <base64> when the u-line has GECOS/home/shell fields. The asymmetry leaves Requires: user(NAME) unresolvable. Add explicit Provides: user(NAME) back to both specs, with a comment documenting the actual cause (systemd unit parsing, not file attrs) so the next person touching these specs doesn't repeat the mistake. Why monsoon didn't hit this: it creates its user in %pre via groupadd/useradd (not sysusers.d), so no Provides are generated at all — matching the Requires: user(monsoon) by luck of the rpm solver treating unknown symbols as soft-fails for that path. Ours went through the sysusers Provides code path and hit the asymmetry instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> v0.1.12	2026-04-16 15:32:51 +03:00

1 2

89 Commits