fix(stage-8e-2b): allow quant on the TP load path
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 35s
CI / Clippy (push) Successful in 2m16s
CI / Test (push) Successful in 4m29s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m50s
build-prerelease / Build cortex binary (push) Successful in 8m37s
build-prerelease / Build neuron-ampere (push) Successful in 5m13s
build-prerelease / Package cortex RPM (push) Successful in 1m17s
build-prerelease / Build neuron-ada (push) Successful in 4m55s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m57s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 12m35s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s

The pre-existing guard in candle.rs rejected any spec.quant on the TP
path with "GGUF quantized models are not supported in the TP path" —
written when quant only ever meant GGUF. With 8e-1/8e-2 in,
quant != None on the TP path triggers in-situ quantization of the
loaded safetensors shards. resolve_dense_files only looks for
safetensors so a GGUF-source-file model with TP still errors out
cleanly downstream.

validate-neuron.sh: rebuild the load payload incrementally so
tp_size > 1 + non-empty quant produces both fields. Same script now
covers all four combos (single/TP × dense/ISQ).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-21 19:17:14 +03:00
parent 4aa71902d0
commit 68a606a79c
2 changed files with 25 additions and 31 deletions

View File

@@ -1091,13 +1091,13 @@ impl CandleHarness {
devices.len() devices.len()
); );
} }
if spec.quant.is_some() { // `quant` on the TP path now means in-situ quantization (ISQ):
anyhow::bail!( // load safetensors, quantize the per-rank shard to the named
"tensor_parallel={tp_size} with quant={:?}: GGUF quantized models \ // GgmlDType at load time. The worker's parse_quant_string
are not supported in the TP path; use a dense safetensors source", // accepts the same names (q5k, q8_0, etc.) as the single-GPU
spec.quant // path. GGUF-source-file models still aren't TP-loadable, but
); // resolve_dense_files only looks for safetensors so that path
} // errors out cleanly later if no safetensors are present.
// 1. Resolve config + tokenizer + safetensors via hf-hub. // 1. Resolve config + tokenizer + safetensors via hf-hub.
let (config_path, tokenizer_path, safetensors_paths) = let (config_path, tokenizer_path, safetensors_paths) =

View File

@@ -91,32 +91,26 @@ trigger_load() {
fi fi
say "POST /models/load ${MODEL_ID} (quant=${QUANT:-<dense>}, tp=${TP_SIZE}, devices=${devices_json})" say "POST /models/load ${MODEL_ID} (quant=${QUANT:-<dense>}, tp=${TP_SIZE}, devices=${devices_json})"
say " (synchronous; may take a minute on first run while HF downloads)" say " (synchronous; may take a minute on first run while HF downloads)"
if (( TP_SIZE > 1 )) && [[ -n "${QUANT}" ]]; then # Build the payload via jq so optional fields are omitted entirely
die "tp_size>1 requires dense safetensors — pass quant='' as the 3rd argument" # when not in use. `tensor_parallel` is dropped when tp_size == 1;
fi # `quant` is dropped when empty. Both can coexist: tp_size > 1 +
# Build the payload via jq so the optional `quant` and # ISQ quant (q5k/q8_0/etc.) loads safetensors and quantizes the
# `tensor_parallel` fields are omitted entirely when not in use — # per-rank shard at load time. GGUF quants (Q4_K_M) are incompatible
# that's how the harness tells dense from quantized and single-GPU # with TP — but the harness rejects that combination at load time
# from TP. # rather than here.
local payload local payload
if [[ -z "${QUANT}" ]] && (( TP_SIZE > 1 )); then local base
payload=$(jq -n -c \ base=$(jq -n -c \
--arg id "${MODEL_ID}" \ --arg id "${MODEL_ID}" \
--argjson tp "${TP_SIZE}" \ --argjson devices "${devices_json}" \
--argjson devices "${devices_json}" \ '{model_id: $id, harness: "candle", devices: $devices}')
'{model_id: $id, harness: "candle", tensor_parallel: $tp, devices: $devices}') if [[ -n "${QUANT}" ]]; then
elif [[ -z "${QUANT}" ]]; then base=$(echo "${base}" | jq -c --arg q "${QUANT}" '. + {quant: $q}')
payload=$(jq -n -c \
--arg id "${MODEL_ID}" \
--argjson devices "${devices_json}" \
'{model_id: $id, harness: "candle", devices: $devices}')
else
payload=$(jq -n -c \
--arg id "${MODEL_ID}" \
--arg q "${QUANT}" \
--argjson devices "${devices_json}" \
'{model_id: $id, harness: "candle", quant: $q, devices: $devices}')
fi fi
if (( TP_SIZE > 1 )); then
base=$(echo "${base}" | jq -c --argjson tp "${TP_SIZE}" '. + {tensor_parallel: $tp}')
fi
payload="${base}"
# --write-out captures the response code on a separate line so we # --write-out captures the response code on a separate line so we
# can surface a real diagnostic instead of relying on --fail. # can surface a real diagnostic instead of relying on --fail.
local resp http_code body local resp http_code body