Stage 7b-i: dense safetensors Qwen3 load path
Some checks failed
build-prerelease / Build cortex binary (push) Blocked by required conditions
CI / Test (push) Waiting to run
CI / Format (push) Successful in 43s
build-prerelease / Resolve version stamps (push) Successful in 44s
CI / Clippy (push) Successful in 2m4s
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
Some checks failed
build-prerelease / Build cortex binary (push) Blocked by required conditions
CI / Test (push) Waiting to run
CI / Format (push) Successful in 43s
build-prerelease / Resolve version stamps (push) Successful in 44s
CI / Clippy (push) Successful in 2m4s
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
Adds the bf16/fp16 safetensors path alongside the existing GGUF quantized one. The harness now dispatches by ModelSpec.quant: - Some(_) → GGUF (pre-quantized, single-GPU only path, unchanged). - None → safetensors dense (new). The dense path uses candle-transformers::models::qwen3::ModelForCausalLM verbatim, fed via VarBuilder::from_mmaped_safetensors over the files listed in `model.safetensors.index.json` (sharded layout) or the single `model.safetensors` fallback. dtype is bf16 to match the canonical Qwen3 HF distribution dtype. tokenizer.json is fetched from the same repo (no -GGUF suffix to strip). ModelArch gains a Qwen3Dense variant; the forward signature mirrors QuantizedQwen3Weights (same `forward(&Tensor, offset)` → last-position logits), so run_inference / run_inference_streaming just add a parallel match arm — no shape changes downstream. This is the foundation 7b-ii (ColumnParallel/RowParallel) builds on: because the source is dense safetensors that can be byte-sliced per rank, the TP work avoids the GGUF super-block alignment problem entirely. Vanilla GGUF inference keeps working unchanged. validate-neuron.sh learns the dense path: pass an empty third arg (quant) and the script omits the `quant` field from the load payload, triggering the dense dispatch. Example: script/validate-neuron.sh beast.hanzalova.internal Qwen/Qwen3-0.6B '' Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -29,8 +29,8 @@ BASE="http://${HOST}:${PORT}"
|
||||
# Reasoning probe — concrete, low-temperature answer that small models
|
||||
# can still get right. "Paris" is a strong signal of basic competence
|
||||
# beyond gibberish.
|
||||
PROBE_PROMPT='What is the capital of France? Respond with the city name only, no punctuation.'
|
||||
EXPECT_SUBSTR='Paris'
|
||||
PROBE_PROMPT='What is the capital of Georgia (Caucasus)? Respond with the city name only, no punctuation.'
|
||||
EXPECT_SUBSTR='Tbilisi'
|
||||
# Qwen3 prepends <think>...</think> reasoning before the answer when the
|
||||
# chat template enables thinking mode, which eats most of a small token
|
||||
# budget. 256 leaves enough room for thinking + final answer.
|
||||
@@ -67,18 +67,22 @@ is_loaded() {
|
||||
}
|
||||
|
||||
trigger_load() {
|
||||
say "POST /models/load ${MODEL_ID} (quant=${QUANT}, device=[0])"
|
||||
say "POST /models/load ${MODEL_ID} (quant=${QUANT:-<dense>}, device=[0])"
|
||||
say " (synchronous; may take a minute on first run while HF downloads)"
|
||||
# Build the payload via jq so the optional `quant` field is
|
||||
# omitted entirely when empty — that's the signal to the harness
|
||||
# to take the dense safetensors load path rather than GGUF.
|
||||
local payload
|
||||
payload=$(cat <<EOF
|
||||
{
|
||||
"model_id": "${MODEL_ID}",
|
||||
"harness": "candle",
|
||||
"quant": "${QUANT}",
|
||||
"devices": [0]
|
||||
}
|
||||
EOF
|
||||
)
|
||||
if [[ -z "${QUANT}" ]]; then
|
||||
payload=$(jq -n -c \
|
||||
--arg id "${MODEL_ID}" \
|
||||
'{model_id: $id, harness: "candle", devices: [0]}')
|
||||
else
|
||||
payload=$(jq -n -c \
|
||||
--arg id "${MODEL_ID}" \
|
||||
--arg q "${QUANT}" \
|
||||
'{model_id: $id, harness: "candle", quant: $q, devices: [0]}')
|
||||
fi
|
||||
# --write-out captures the response code on a separate line so we
|
||||
# can surface a real diagnostic instead of relying on --fail.
|
||||
local resp http_code body
|
||||
|
||||
Reference in New Issue
Block a user