fix(stage-8e-2b): allow quant on the TP load path
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 35s
CI / Clippy (push) Successful in 2m16s
CI / Test (push) Successful in 4m29s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m50s
build-prerelease / Build cortex binary (push) Successful in 8m37s
build-prerelease / Build neuron-ampere (push) Successful in 5m13s
build-prerelease / Package cortex RPM (push) Successful in 1m17s
build-prerelease / Build neuron-ada (push) Successful in 4m55s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m57s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 12m35s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 35s
CI / Clippy (push) Successful in 2m16s
CI / Test (push) Successful in 4m29s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m50s
build-prerelease / Build cortex binary (push) Successful in 8m37s
build-prerelease / Build neuron-ampere (push) Successful in 5m13s
build-prerelease / Package cortex RPM (push) Successful in 1m17s
build-prerelease / Build neuron-ada (push) Successful in 4m55s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m57s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 12m35s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
The pre-existing guard in candle.rs rejected any spec.quant on the TP path with "GGUF quantized models are not supported in the TP path" — written when quant only ever meant GGUF. With 8e-1/8e-2 in, quant != None on the TP path triggers in-situ quantization of the loaded safetensors shards. resolve_dense_files only looks for safetensors so a GGUF-source-file model with TP still errors out cleanly downstream. validate-neuron.sh: rebuild the load payload incrementally so tp_size > 1 + non-empty quant produces both fields. Same script now covers all four combos (single/TP × dense/ISQ). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1091,13 +1091,13 @@ impl CandleHarness {
|
|||||||
devices.len()
|
devices.len()
|
||||||
);
|
);
|
||||||
}
|
}
|
||||||
if spec.quant.is_some() {
|
// `quant` on the TP path now means in-situ quantization (ISQ):
|
||||||
anyhow::bail!(
|
// load safetensors, quantize the per-rank shard to the named
|
||||||
"tensor_parallel={tp_size} with quant={:?}: GGUF quantized models \
|
// GgmlDType at load time. The worker's parse_quant_string
|
||||||
are not supported in the TP path; use a dense safetensors source",
|
// accepts the same names (q5k, q8_0, etc.) as the single-GPU
|
||||||
spec.quant
|
// path. GGUF-source-file models still aren't TP-loadable, but
|
||||||
);
|
// resolve_dense_files only looks for safetensors so that path
|
||||||
}
|
// errors out cleanly later if no safetensors are present.
|
||||||
|
|
||||||
// 1. Resolve config + tokenizer + safetensors via hf-hub.
|
// 1. Resolve config + tokenizer + safetensors via hf-hub.
|
||||||
let (config_path, tokenizer_path, safetensors_paths) =
|
let (config_path, tokenizer_path, safetensors_paths) =
|
||||||
|
|||||||
@@ -91,32 +91,26 @@ trigger_load() {
|
|||||||
fi
|
fi
|
||||||
say "POST /models/load ${MODEL_ID} (quant=${QUANT:-<dense>}, tp=${TP_SIZE}, devices=${devices_json})"
|
say "POST /models/load ${MODEL_ID} (quant=${QUANT:-<dense>}, tp=${TP_SIZE}, devices=${devices_json})"
|
||||||
say " (synchronous; may take a minute on first run while HF downloads)"
|
say " (synchronous; may take a minute on first run while HF downloads)"
|
||||||
if (( TP_SIZE > 1 )) && [[ -n "${QUANT}" ]]; then
|
# Build the payload via jq so optional fields are omitted entirely
|
||||||
die "tp_size>1 requires dense safetensors — pass quant='' as the 3rd argument"
|
# when not in use. `tensor_parallel` is dropped when tp_size == 1;
|
||||||
fi
|
# `quant` is dropped when empty. Both can coexist: tp_size > 1 +
|
||||||
# Build the payload via jq so the optional `quant` and
|
# ISQ quant (q5k/q8_0/etc.) loads safetensors and quantizes the
|
||||||
# `tensor_parallel` fields are omitted entirely when not in use —
|
# per-rank shard at load time. GGUF quants (Q4_K_M) are incompatible
|
||||||
# that's how the harness tells dense from quantized and single-GPU
|
# with TP — but the harness rejects that combination at load time
|
||||||
# from TP.
|
# rather than here.
|
||||||
local payload
|
local payload
|
||||||
if [[ -z "${QUANT}" ]] && (( TP_SIZE > 1 )); then
|
local base
|
||||||
payload=$(jq -n -c \
|
base=$(jq -n -c \
|
||||||
--arg id "${MODEL_ID}" \
|
--arg id "${MODEL_ID}" \
|
||||||
--argjson tp "${TP_SIZE}" \
|
--argjson devices "${devices_json}" \
|
||||||
--argjson devices "${devices_json}" \
|
'{model_id: $id, harness: "candle", devices: $devices}')
|
||||||
'{model_id: $id, harness: "candle", tensor_parallel: $tp, devices: $devices}')
|
if [[ -n "${QUANT}" ]]; then
|
||||||
elif [[ -z "${QUANT}" ]]; then
|
base=$(echo "${base}" | jq -c --arg q "${QUANT}" '. + {quant: $q}')
|
||||||
payload=$(jq -n -c \
|
|
||||||
--arg id "${MODEL_ID}" \
|
|
||||||
--argjson devices "${devices_json}" \
|
|
||||||
'{model_id: $id, harness: "candle", devices: $devices}')
|
|
||||||
else
|
|
||||||
payload=$(jq -n -c \
|
|
||||||
--arg id "${MODEL_ID}" \
|
|
||||||
--arg q "${QUANT}" \
|
|
||||||
--argjson devices "${devices_json}" \
|
|
||||||
'{model_id: $id, harness: "candle", quant: $q, devices: $devices}')
|
|
||||||
fi
|
fi
|
||||||
|
if (( TP_SIZE > 1 )); then
|
||||||
|
base=$(echo "${base}" | jq -c --argjson tp "${TP_SIZE}" '. + {tensor_parallel: $tp}')
|
||||||
|
fi
|
||||||
|
payload="${base}"
|
||||||
# --write-out captures the response code on a separate line so we
|
# --write-out captures the response code on a separate line so we
|
||||||
# can surface a real diagnostic instead of relying on --fail.
|
# can surface a real diagnostic instead of relying on --fail.
|
||||||
local resp http_code body
|
local resp http_code body
|
||||||
|
|||||||
Reference in New Issue
Block a user