fix(stage-8e-2b): allow quant on the TP load path

The pre-existing guard in candle.rs rejected any spec.quant on the TP path with "GGUF quantized models are not supported in the TP path" — written when quant only ever meant GGUF. With 8e-1/8e-2 in, quant != None on the TP path triggers in-situ quantization of the loaded safetensors shards. resolve_dense_files only looks for safetensors so a GGUF-source-file model with TP still errors out cleanly downstream. validate-neuron.sh: rebuild the load payload incrementally so tp_size > 1 + non-empty quant produces both fields. Same script now covers all four combos (single/TP × dense/ISQ). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 19:17:14 +03:00
parent 4aa71902d0
commit 68a606a79c
2 changed files with 25 additions and 31 deletions
--- a/crates/neuron/src/harness/candle.rs
+++ b/crates/neuron/src/harness/candle.rs
@@ -1091,13 +1091,13 @@ impl CandleHarness {
                devices.len()
            );
        }
-        if spec.quant.is_some() {
-            anyhow::bail!(
-                "tensor_parallel={tp_size} with quant={:?}: GGUF quantized models \
-                 are not supported in the TP path; use a dense safetensors source",
-                spec.quant
-            );
-        }
+        // `quant` on the TP path now means in-situ quantization (ISQ):
+        // load safetensors, quantize the per-rank shard to the named
+        // GgmlDType at load time. The worker's parse_quant_string
+        // accepts the same names (q5k, q8_0, etc.) as the single-GPU
+        // path. GGUF-source-file models still aren't TP-loadable, but
+        // resolve_dense_files only looks for safetensors so that path
+        // errors out cleanly later if no safetensors are present.

        // 1. Resolve config + tokenizer + safetensors via hf-hub.
        let (config_path, tokenizer_path, safetensors_paths) =