fix(stage-8e-2b): allow quant on the TP load path
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 35s
CI / Clippy (push) Successful in 2m16s
CI / Test (push) Successful in 4m29s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m50s
build-prerelease / Build cortex binary (push) Successful in 8m37s
build-prerelease / Build neuron-ampere (push) Successful in 5m13s
build-prerelease / Package cortex RPM (push) Successful in 1m17s
build-prerelease / Build neuron-ada (push) Successful in 4m55s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m57s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 12m35s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s

The pre-existing guard in candle.rs rejected any spec.quant on the TP
path with "GGUF quantized models are not supported in the TP path" —
written when quant only ever meant GGUF. With 8e-1/8e-2 in,
quant != None on the TP path triggers in-situ quantization of the
loaded safetensors shards. resolve_dense_files only looks for
safetensors so a GGUF-source-file model with TP still errors out
cleanly downstream.

validate-neuron.sh: rebuild the load payload incrementally so
tp_size > 1 + non-empty quant produces both fields. Same script now
covers all four combos (single/TP × dense/ISQ).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-21 19:17:14 +03:00
parent 4aa71902d0
commit 68a606a79c
2 changed files with 25 additions and 31 deletions

View File

@@ -1091,13 +1091,13 @@ impl CandleHarness {
devices.len()
);
}
if spec.quant.is_some() {
anyhow::bail!(
"tensor_parallel={tp_size} with quant={:?}: GGUF quantized models \
are not supported in the TP path; use a dense safetensors source",
spec.quant
);
}
// `quant` on the TP path now means in-situ quantization (ISQ):
// load safetensors, quantize the per-rank shard to the named
// GgmlDType at load time. The worker's parse_quant_string
// accepts the same names (q5k, q8_0, etc.) as the single-GPU
// path. GGUF-source-file models still aren't TP-loadable, but
// resolve_dense_files only looks for safetensors so that path
// errors out cleanly later if no safetensors are present.
// 1. Resolve config + tokenizer + safetensors via hf-hub.
let (config_path, tokenizer_path, safetensors_paths) =