cortex

Author	SHA1	Message	Date
rob thijssen	70eb6af42b	feat(tp): cancellation-safe inference + structured tracing All checks were successful CI / Format (push) Successful in 30s Details build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Clippy (push) Successful in 2m14s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m44s Details build-prerelease / Build cortex binary (push) Successful in 4m13s Details CI / Test (push) Successful in 4m38s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ampere (push) Successful in 5m13s Details build-prerelease / Build neuron-ada (push) Successful in 4m47s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m1s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m41s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details Two changes addressing operator visibility into TP inference + the HTTP-cancellation poisoning chain: 1. `chat_completion_tp` now runs its body inside `tokio::spawn`. When the HTTP client disconnects (curl --max-time, browser nav, etc.) the future returned from `chat_completion_tp` gets dropped, but the spawned task keeps running to completion — finishing every `pool.generate_step` / `pool.clear_kv_cache` to drain the worker pipes. The next inference request then finds a clean pool. Previously: dropped future left workers still processing the in-flight request, the next call's `ClearKvCache` recv would read the stale `GenerateStepOk` from the abandoned step ("rank N expected KvCacheCleared, got GenerateStepOk"). The drain-on- leader-error fix from `d1a4aad` covered Rust-side leader failures but not HTTP-layer cancellation, which is what we actually hit on the user's Qwen3.6 test. 2. Tracing throughout the TP path so journalctl shows where an inference spends its time without needing to surface harness internals via the HTTP error body: - `chat_completion_tp_inner` (now a free fn so it can run inside spawn): `info` at request start (prompt_len, max_new, temp, top_p, eos_id), `info` per major phase (prefill complete with elapsed_ms, decode complete with elapsed_ms + token count), `info` at completion (total_ms, finish_reason). `debug` for pool-lock acquisition + kv-cache clear timing. `trace` per decode step (next_token, step_ms). - `WorkerPool::generate_step` (leader side): `debug` at fan-out, `debug` after leader forward returns with elapsed_ms + ok flag, `debug` after drain with errors count + total_ms. - `WorkerPool::clear_kv_cache`: matching `debug` at fan-out + drain. - `worker::handle_generate_step`: `debug` at forward start + done with elapsed_ms, `warn` on forward failure with the full error. The default log filter is already `info,neuron=debug` so the operator gets every `info` and `debug` line by default; `trace` needs RUST_LOG=trace for per-step decode timing. Stage 7c-ii crash-detection is still future work; this is the minimum that makes the "where did the 120s go" question answerable from the logs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 08:22:00 +03:00
rob thijssen	d1a4aad91d	fix(tp): always drain worker responses on leader failure All checks were successful build-prerelease / Resolve version stamps (push) Successful in 34s Details CI / Format (push) Successful in 1m6s Details CI / Clippy (push) Successful in 2m56s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m40s Details CI / Test (push) Successful in 5m1s Details build-prerelease / Build cortex binary (push) Successful in 4m36s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m19s Details build-prerelease / Build neuron-ampere (push) Successful in 4m29s Details build-prerelease / Build neuron-ada (push) Successful in 4m51s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m9s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m4s Details The TP-2 inference probe against Qwen3.6-27B surfaced: worker rank 1 ClearKvCache: expected KvCacheCleared, got GenerateStepOk Caused by pipe poisoning. The previous shape of `generate_step`: for w in workers { w.send_only(GenerateStep) } // 1. fan-out let logits = spawn_blocking(leader.forward)??; // 2. early return on err for w in workers { w.recv_only() } // 3. drain (skipped on 2's err) When step 2 returned `Err` (e.g. a dtype mismatch we hadn't seen before, an OOM, a downstream squeeze that didn't match the shape), the function bailed before step 3 — but workers had already written `GenerateStepOk` to their stdout pipes, since their forwards (and the NCCL collectives inside) completed independently of the leader's post-collective Rust-side work. The next call (typically `ClearKvCache` at the start of the next inference request) would then send a fresh request and read those stale replies as if they were the new operation's. Once a pipe is poisoned, every subsequent call surfaces the same shape of error even though nothing's actually broken. Fix: introduce two helpers in `tp/mod.rs`: - `drain_workers(workers, check)` — reads exactly one response from every worker regardless of individual outcomes. Returns `Vec<String>` of `rank N: detail` strings for any non-OK reply. - `combine_leader_workers(leader, worker_errs, op)` — folds the leader's `Result<Result<T>>` (the spawn_blocking shape) with the worker drain into a single `Result<T>`. Leader failure takes precedence but worker errors get appended so both halves surface. `generate_step` and `clear_kv_cache` now use this pattern. Worst case: both halves fail and the operator sees a combined error message; either way the pipes are always drained so the next call's recv matches the request it sent. Note: the model is still poisoned in the current state — the operator needs to either `POST /models/unload` + reload, or `systemctl restart neuron`, to recover. The fix prevents future desync; it doesn't repair existing stale pipe state. Stage 7c-ii crash detection was tracked as the canonical solution to this class of issue; this is the minimum-viable subset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 07:39:36 +03:00
rob thijssen	95dc8745eb	feat(stage-8c): TP-aware Qwen3-Next (tp_qwen3_5) All checks were successful build-prerelease / Resolve version stamps (push) Successful in 36s Details CI / Format (push) Successful in 39s Details CI / Clippy (push) Successful in 2m13s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m37s Details CI / Test (push) Successful in 4m49s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m26s Details build-prerelease / Build neuron-ampere (push) Successful in 5m18s Details build-prerelease / Package cortex RPM (push) Successful in 7m6s Details build-prerelease / Build neuron-ada (push) Successful in 5m13s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m2s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m55s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 5m39s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details Adds `harness/tp/tp_qwen3_5.rs` — the tensor-parallel variant of the Qwen3-Next architecture — plus the dispatch wiring needed to route a load through it on both the leader and the workers. Architecture pieces (all per-rank, follow `tp_qwen3.rs` patterns for the full-attention layers + a new pattern for linear-attention): - TpQwen3_5GatedDeltaNet: V-head-dim sharded. `num_v_heads / world_size` V-heads per rank, `num_k_heads / world_size` K-heads. `in_proj_z`, `in_proj_b`, `in_proj_a`, `A_log`, `dt_bias` shard uniformly along the V-head dim. `out_proj` is row-parallel + AllReduce (the only collective inside the block). The recurrent state shards 1:1 with V-heads — no cross-rank sync inside the delta-rule loop. `in_proj_qkv` and `conv1d.weight` are FUSED tensors with three regions along dim 0 (`[first key_dim, second key_dim, value_dim]`). Standard uniform-slicing doesn't align with the head boundaries — rank 0 would end up with `[first half of K_0, full K_1, first half of V]`. New `load_fused_qkv_slice_{2d,3d}` helpers load the full tensor, narrow per-region per-rank, and `Tensor::cat` the three slices into a per-rank fused weight. Transient peak of one full tensor per layer during construction; net memory is properly per- rank after the full drops. - TpQwen3_5Attention: column-parallel `q_proj` (the widened `2 * num_heads * head_dim` output, including the gate half — shards along the head axis so both query AND gate halves stay consistent per rank), `k_proj`, `v_proj`; row-parallel `o_proj` with AllReduce. Otherwise mirrors `tp_qwen3.rs`'s attention. - TpQwen3_5MLP, TpQwen3_5DecoderLayer (dispatches on layer_types), TpQwen3_5Model (with `model.language_model.` prefix), and TpQwen3_5ForCausalLM (with tied or separate `lm_head` at top level). Dispatch wiring: - New `tp::TpLeaderModel` enum holds either Qwen3 or Qwen3_5 variant. `WorkerPool::load_dense_shard` now dispatches on `model_type` from the config JSON and returns `Arc<Mutex<TpLeaderModel>>`. The two downstream methods (`generate_step`, `clear_kv_cache`) thread this enum through — the inner forward+clear_kv_cache dispatch happens via the enum's pub methods. Adding another TP architecture later is one more enum variant + match arms. - Worker side gets a parallel `WorkerModel` enum + dispatch in `handle_load_dense_shard`, branching on the same `model_type`. - Harness gate `TP_SUPPORTED_MODEL_TYPES` now `["qwen3", "qwen3_5"]`. `TpLoadedModel.leader_model` retyped to the enum. Helpers in `arch/qwen3_5/linear_attn.rs`: - `softplus` and `repeat_interleave` made `pub(crate)` so the TP module reuses them rather than duplicating. Reuses unchanged: `Qwen3_5RmsNorm` (replicated weight), the gated `Qwen3_5RmsNormGated` tail, `l2norm`, the `RotaryEmbedding` (partial RoPE with `partial_rotary_factor` already correct). CPU build + clippy + 32 lib tests pass; `cargo clippy --features cuda` also clean inside the patched runner container. Single inflight risk to call out: tensor names. For full-attention layers the per-layer prefix is `model.language_model.layers.<i>.self_attn.` and for linear-attention layers `model.language_model.layers.<i>.linear_attn.*` — the same as the single-GPU path. lm_head sits at the top level (not under `language_model`) — consistent with the single-GPU path that validated against Qwen3.5-0.8B. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 22:02:42 +03:00
rob thijssen	d46d8d4f6c	feat(tp): Stage 7b-iv — RPC + orchestration for TP load/inference All checks were successful build-prerelease / Resolve version stamps (push) Successful in 38s Details CI / Format (push) Successful in 40s Details CI / Clippy (push) Successful in 2m20s Details build-prerelease / Build cortex binary (push) Successful in 4m25s Details build-prerelease / Package cortex RPM (push) Successful in 1m22s Details CI / Test (push) Successful in 4m34s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m57s Details build-prerelease / Build neuron-ampere (push) Successful in 4m51s Details build-prerelease / Build neuron-ada (push) Successful in 5m12s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m49s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m51s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s Details Wires the in-flight TP machinery (Stage 7a workers, 7b-iii sharded Qwen3) end to end so a non-streaming chat completion can run across multiple GPUs via NCCL. RPC additions (tp/rpc.rs): - LoadDenseShard{model_id, config_json, safetensors_paths} - GenerateStep{model_id, tokens, offset} - ClearKvCache{model_id} - UnloadModel{model_id} - LoadDenseShardOk / GenerateStepOk / KvCacheCleared / Unloaded Worker side (tp/worker.rs): - WorkerState gains a `models: HashMap<String, TpQwen3ForCausalLM>` keyed by model_id. LoadDenseShard mmaps safetensors via ShardedVarBuilder (only this rank's slice materialises), builds the TP model with the rank's NCCL Comm cloned from NcclState. - GenerateStep runs the rank-local forward; the resulting logits are dropped (only the leader's are used for sampling). The forward's value here is the NCCL collectives inside the row-parallel layers letting the leader's rank-0 forward make progress. Pool side (tp/mod.rs): - WorkerPool::load_dense_shard fans LoadDenseShard out to every worker, builds rank 0's shard on the leader via spawn_blocking with a fresh SendComm wrapper at the move boundary (Comm is !Send at the type level), collects per-rank LoadDenseShardOk. Returns the leader's Arc<Mutex<TpQwen3ForCausalLM>>. - WorkerPool::generate_step fans GenerateStep out, runs the leader's rank-0 forward in spawn_blocking (the AllReduce CustomOps inside row-parallel layers block until every worker issues the matching collective), returns the leader's last-position logits Tensor. - WorkerPool::clear_kv_cache + unload_model follow the same pattern. NcclState refactor (tp/nccl_state.rs): - comm field becomes Option<Arc<Comm>> (was Option<Comm>) so callers can share a clone with TpQwen3ForCausalLM::load. - new `comm()` accessor + `SendComm` wrapper for spawn_blocking moves. - single allow(clippy::arc_with_non_send_sync) at the canonical construction site (Comm is !Send by type but the runtime invariant is enforced by SendComm + the pool's Mutex). Harness side (candle.rs): - LoadedHandle enum (Single \| Tp) replaces the bare Arc<LoadedModel> in the harness's registry. list_models / unload_model / inference_endpoint walk the enum uniformly. - TpLoadedModel holds the pool + leader_model + tokenizer + devices. - load_model dispatches on `spec.tensor_parallel > 1` to a new cuda-gated load_tp path: resolve dense files via hf-hub, spawn the pool, init_nccl, load_dense_shard. - chat_completion branches on the handle variant. The TP path mirrors run_inference: clear_kv_cache, prefill, sample, decode loop, detokenize. Acquires the pool Mutex for the whole request. - Streaming through TP is deferred to Stage 7c (returns Other(err)). Script (script/validate-neuron.sh): - 4th positional arg `tp_size` (default 1). When >1, switches to the dense path (tp + GGUF is mutually exclusive — bails) and adds `tensor_parallel` + `devices` to the load payload. NEURON_DEVICES env overrides the default 0..N-1 device list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 06:38:33 +03:00
rob thijssen	46527d7804	feat(tp): TP-aware Qwen3 dense model (Stage 7b-iii 2/2) Mirrors candle_transformers::models::qwen3 structurally with column- parallel q/k/v + gate/up projections, row-parallel o + down projections, and replicated embedding/norms/lm_head. Per-rank head counts come from dividing num_attention_heads / num_key_value_heads by world_size at load time; intermediate_size split likewise. Load bails on any non-divisible shape — the safetensors slice would lose data otherwise. KV cache holds the rank-local slice since K/V come out of column-parallel projections; no cache resharding across ranks. Causal mask is computed on rank 0 shape and broadcasts over the head dim so per-rank H differs without rework. Replicated tensors (embedding, all RmsNorms, untied lm_head) load via vb.get(shape, name), which uses the default Shard { world_size: 1 } and falls through to the unsharded backend path on ShardedSafeTensors. The cuda / non-cuda load splits track the existing tp_linear pattern: RowParallelLinear takes an Arc<Comm> only under cuda, and the higher- level composers (TpQwen3MLP, TpQwen3Attention, TpDecoderLayer, TpQwen3Model, TpQwen3ForCausalLM) thread it through accordingly. 7b-iv wires RPC + dispatch in CandleHarness::load_model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 18:24:20 +03:00
rob thijssen	8d3194f992	Stage 7b-iii (1/2): AllReduce CustomOp + ShardedVarBuilder-backed TP linears Some checks failed build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Format (push) Successful in 38s Details CI / Clippy (push) Successful in 2m16s Details build-prerelease / Build neuron-blackwell (push) Failing after 3m19s Details CI / Test (push) Successful in 4m26s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m22s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ampere (push) Failing after 4m58s Details build-prerelease / Build neuron-ada (push) Failing after 4m53s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped Details Ports the canonical candle-examples/examples/llama_multiprocess/model.rs pattern into the harness. Two new files, one deletion: - harness/tp/all_reduce.rs — AllReduce wraps Arc<cudarc::nccl::Comm> and implements candle's CustomOp1 trait. cuda_fwd extracts the rank's CudaSlice<dtype> from a CudaStorage, asserts the input is contiguous (a strided activation hitting all_reduce is almost always a model construction bug), allocates an output CudaSlice on the same device, calls Comm::all_reduce(Sum), and wraps the result back as a CudaStorage. Handles BF16, F16, F32. NcclError surfaces via {e:?} (no Display impl in cudarc 0.19.x). Send/Sync hand-impl'd with the same NCCL-thread-safety caveat candle's example documents. - harness/tp/tp_linear.rs — ColumnParallelLinear and RowParallelLinear, both built on candle's ShardedVarBuilder + Shard hints. `vb.get_with_hints((), "weight", shard(dim, rank, ws))` reads JUST the rank's slice from the safetensors view; no full- tensor host materialisation. ColumnParallel.forward is a plain local matmul (output is naturally sharded). RowParallel.forward = local matmul + apply_op1_no_bwd(&self.all_reduce). On CPU / world_size == 1, the AllReduce is skipped and the partial output is returned as-is. Both layers are no-bias — every Qwen3-family target sets attention_bias=false; bias-aware sharding is a future-model concern. - Deletes harness/tp/sharded_linear.rs from 7b-ii. That commit's hand-rolled "load full + narrow" approach was useful exploration but candle's ShardedVarBuilder does the same work without materialising the full tensor on host. The 5 unit tests there verified the slicing math against an unsharded reference; that math now lives inside candle and is covered by candle's own tests. Next (7b-iii 2/2): TpQwen3Attention + TpQwen3MLP composing the column/row pair, then a TpQwen3Model that runs the full forward. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 18:14:54 +03:00
rob thijssen	93421f48e2	Stage 7b-ii: ColumnParallel + RowParallel sharded linear primitives Some checks failed build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / Format (push) Successful in 31s Details CI / Clippy (push) Failing after 49s Details build-prerelease / Build neuron-blackwell (push) Failing after 3m29s Details build-prerelease / Build cortex binary (push) Successful in 4m41s Details CI / Test (push) Successful in 5m6s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-ampere (push) Failing after 5m1s Details build-prerelease / Build neuron-ada (push) Failing after 4m53s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped Details Adds harness/tp/sharded_linear.rs with ShardedLinear — a Megatron-LM style sharded wrapper over candle_nn::Linear. Two constructors: - load_column: splits the output dimension. Each rank holds rows [rout/N .. (r+1)out/N] of the weight, plus its slice of the bias. Forward = local matmul; output is naturally sharded; downstream consumer either accepts the shard (next layer is column-parallel) or merges via all-gather later. - load_row: splits the input dimension. Each rank holds cols [rin/N .. (r+1)in/N] of the weight; bias lives only on rank 0 so the post-all_reduce sum carries it exactly once. Forward produces a partial output that the caller reduces via NCCL. Both constructors bail with a clear error when divisibility doesn't hold — the precondition mistral.rs's first qwen3-next-tp commit made explicit. The path included in the error is the VarBuilder prefix, so the operator sees exactly which projection failed ("column-parallel 'model.layers.0.self_attn.q_proj': out_features=..."). 5 unit tests on CPU verify the math against an unsharded reference: - column shard produces the expected slice of the full matmul - row partials sum to the unsharded result - row bias appears only on rank 0 - divisibility violations bail (column + row) forward_with_comm() is stubbed for row-parallel (CUDA-only) — wiring the actual cudarc::nccl all_reduce against candle's Tensor lands in 7b-iii alongside the model assembly, where the model holds the Comm in scope. ColumnParallel's forward_with_comm just delegates to the local matmul (no collective needed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 17:07:19 +03:00
rob thijssen	05e15f3597	Stage 7b-i: dense safetensors Qwen3 load path Some checks failed build-prerelease / Build cortex binary (push) Blocked by required conditions Details CI / Test (push) Waiting to run Details CI / Format (push) Successful in 43s Details build-prerelease / Resolve version stamps (push) Successful in 44s Details CI / Clippy (push) Successful in 2m4s Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details Adds the bf16/fp16 safetensors path alongside the existing GGUF quantized one. The harness now dispatches by ModelSpec.quant: - Some(_) → GGUF (pre-quantized, single-GPU only path, unchanged). - None → safetensors dense (new). The dense path uses candle-transformers::models::qwen3::ModelForCausalLM verbatim, fed via VarBuilder::from_mmaped_safetensors over the files listed in `model.safetensors.index.json` (sharded layout) or the single `model.safetensors` fallback. dtype is bf16 to match the canonical Qwen3 HF distribution dtype. tokenizer.json is fetched from the same repo (no -GGUF suffix to strip). ModelArch gains a Qwen3Dense variant; the forward signature mirrors QuantizedQwen3Weights (same `forward(&Tensor, offset)` → last-position logits), so run_inference / run_inference_streaming just add a parallel match arm — no shape changes downstream. This is the foundation 7b-ii (ColumnParallel/RowParallel) builds on: because the source is dense safetensors that can be byte-sliced per rank, the TP work avoids the GGUF super-block alignment problem entirely. Vanilla GGUF inference keeps working unchanged. validate-neuron.sh learns the dense path: pass an empty third arg (quant) and the script omits the `quant` field from the load payload, triggering the dense dispatch. Example: script/validate-neuron.sh beast.hanzalova.internal Qwen/Qwen3-0.6B '' Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 17:03:59 +03:00
rob thijssen	da068ded6d	Stage 7a-ii: real NCCL handshake behind the worker pool Some checks failed CI / Format (push) Failing after 38s Details build-prerelease / Resolve version stamps (push) Successful in 42s Details CI / Clippy (push) Successful in 2m18s Details build-prerelease / Build neuron-blackwell (push) Failing after 3m33s Details CI / Test (push) Successful in 4m27s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m31s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Failing after 4m19s Details build-prerelease / Build neuron-ada (push) Failing after 4m56s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped Details Wires cudarc::nccl into the TP worker lifecycle introduced in 7a-i. With --features cuda the leader and its workers now establish a live NCCL communicator end-to-end; without the feature the same code paths return Error{kind="cuda_feature_not_enabled"} so a misconfigured build is obvious instead of silently no-op. NCCL state machine (harness/tp/nccl_state.rs) is shared between the worker process and the leader's pool: - generate_comm_id_hex() mints an Id::new() on the leader. - NcclState::init parses 256 hex chars → [c_char; 128] → Id::uninit, opens a CudaContext on the configured device, calls Comm::from_rank with the supplied (rank, world_size, id). NCCL blocks until every rank has joined. - NcclState::sanity_check runs one all_reduce(1u32, Sum); the leader asserts every rank reports observed_sum == world_size. - NCCL handles serialised under Mutex; unsafe impl Send/Sync gates the Comm across spawn_blocking boundaries (NCCL is move-safe; only concurrent op issuance is unsafe). WorkerPool::init_nccl orchestrates the rendezvous: 1. Write Init { comm_id } to every worker's stdin (no await yet). 2. Leader rank 0 calls its own Comm::from_rank in spawn_blocking, concurrently with workers. 3. NCCL handshake completes for all ranks simultaneously. 4. Leader collects InitOk responses. WorkerPool::nccl_sanity_check follows the same pattern over all_reduce, validating world_size == observed_sum on every rank. Worker.send_only / Worker.recv_only split out from the previous monolithic Worker.request so the leader can interleave its own NCCL work with the worker calls — required because NCCL blocks during init. Tests: - 4 hex roundtrip unit tests for the wire encoding. - The 7a-i "not implemented" expectation now reads "cuda_feature_not_enabled" on the local dev box (no CUDA), or accepts InitOk on a cuda-built test binary. - New cuda-integration test in tp_worker_lifecycle_cuda.rs covers the real init + sanity round-trip; gated on the cuda-integration feature so default CI doesn't try to NCCL. Verifiable on beast (2× RTX 5090): cargo test -p neuron --features cuda-integration \ --test tp_worker_lifecycle_cuda Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 16:40:01 +03:00
rob thijssen	2a7ede0232	Stage 7a-i: TP worker lifecycle scaffolding All checks were successful CI / Format (push) Successful in 36s Details build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Clippy (push) Successful in 2m12s Details CI / Test (push) Successful in 4m25s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m49s Details build-prerelease / Build cortex binary (push) Successful in 4m22s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ampere (push) Successful in 5m9s Details build-prerelease / Build neuron-ada (push) Successful in 4m59s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m59s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m38s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m8s Details Leader → worker process plumbing for tensor parallelism. The neuron binary picks up two modes: default (the existing daemon, axum + HTTP) and `--worker` (a bare RPC loop driven over stdin/stdout). The leader spawns one worker per non-zero NCCL rank via tokio::process::Command on the same binary path (production: /proc/self/exe; tests: env!("CARGO_BIN_EXE_neuron")) and talks to each over newline- delimited JSON. Protocol (harness/tp/rpc.rs) is serde-tagged from the start — WorkerRequest::{Ping, Init, NcclSanityCheck, Shutdown} and WorkerResponse::{Pong, InitOk, NcclSanityResult, Bye, Error}, both `#[serde(tag = "op", rename_all = "snake_case")]`. Adding ops in 7b/7c is purely additive; unknown ops on the wire fail to parse (verified in unit tests). 7a-i scope: - WorkerPool::spawn(binary, world_size, devices) forks ranks 1..N as subprocesses, captures stdin/stdout, kills on drop. - ping_all() round-trips a Ping to every worker and validates the returned rank. - shutdown() sends Shutdown to each worker, awaits Bye, reaps. - Worker mode: parse Ping/Shutdown, return Pong/Bye; Init and NcclSanityCheck return Error{kind="not_implemented_7a_i"} so a 7a-ii binary speaking the same wire is a drop-in replacement (the kind field signals "real NCCL lands in the next commit"). - CandleHarness::load_model refuses tensor_parallel > 1 with a clear message until 7b is in. Three integration tests in tests/tp_worker_lifecycle.rs cover spawn/ ping/shutdown for 2- and 3-worker pools, plus the not_implemented_7a_i contract test for Init. Seven rpc serde unit tests assert the wire shape (op tags, field names, unknown-op rejection). All pass on the dev host; no CUDA required. Stage 7a-ii (next): the real NCCL Comm::from_rank wiring behind the existing Init/NcclSanityCheck op surface, CUDA-gated. Verifiable on beast's 2×5090. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 15:53:00 +03:00

10 Commits