Stage 7b-i: dense safetensors Qwen3 load path

Adds the bf16/fp16 safetensors path alongside the existing GGUF quantized one. The harness now dispatches by ModelSpec.quant: - Some(_) → GGUF (pre-quantized, single-GPU only path, unchanged). - None → safetensors dense (new). The dense path uses candle-transformers::models::qwen3::ModelForCausalLM verbatim, fed via VarBuilder::from_mmaped_safetensors over the files listed in `model.safetensors.index.json` (sharded layout) or the single `model.safetensors` fallback. dtype is bf16 to match the canonical Qwen3 HF distribution dtype. tokenizer.json is fetched from the same repo (no -GGUF suffix to strip). ModelArch gains a Qwen3Dense variant; the forward signature mirrors QuantizedQwen3Weights (same `forward(&Tensor, offset)` → last-position logits), so run_inference / run_inference_streaming just add a parallel match arm — no shape changes downstream. This is the foundation 7b-ii (ColumnParallel/RowParallel) builds on: because the source is dense safetensors that can be byte-sliced per rank, the TP work avoids the GGUF super-block alignment problem entirely. Vanilla GGUF inference keeps working unchanged. validate-neuron.sh learns the dense path: pass an empty third arg (quant) and the script omits the `quant` field from the load payload, triggering the dense dispatch. Example: script/validate-neuron.sh beast.hanzalova.internal Qwen/Qwen3-0.6B '' Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 17:03:59 +03:00
parent da068ded6d
commit 05e15f3597
3 changed files with 236 additions and 61 deletions
--- a/crates/neuron/src/harness/tp/mod.rs
+++ b/crates/neuron/src/harness/tp/mod.rs
@@ -213,8 +213,7 @@ impl WorkerPool {
        // Swap out the leader's NcclState into a fresh empty one so we
        // can move it into spawn_blocking; restore after the task
        // returns. (NcclState isn't Clone — it owns a real NCCL Comm.)
-        let mut leader_state =
-            std::mem::take(&mut self.leader_nccl);
+        let mut leader_state = std::mem::take(&mut self.leader_nccl);
        let (returned_state, leader_resp) = tokio::task::spawn_blocking(move || {
            let resp = leader_state.init(leader_cfg, &comm_id_for_leader);
            (leader_state, resp)
@@ -269,8 +268,7 @@ impl WorkerPool {

        // 2. Leader's own all_reduce, in spawn_blocking. NCCL operations
        //    block until every rank participates.
-        let mut leader_state =
-            std::mem::take(&mut self.leader_nccl);
+        let mut leader_state = std::mem::take(&mut self.leader_nccl);
        let (returned_state, leader_resp) = tokio::task::spawn_blocking(move || {
            let resp = leader_state.sanity_check();
            (leader_state, resp)