Stage 7b-i: dense safetensors Qwen3 load path

Adds the bf16/fp16 safetensors path alongside the existing GGUF quantized one. The harness now dispatches by ModelSpec.quant: - Some(_) → GGUF (pre-quantized, single-GPU only path, unchanged). - None → safetensors dense (new). The dense path uses candle-transformers::models::qwen3::ModelForCausalLM verbatim, fed via VarBuilder::from_mmaped_safetensors over the files listed in `model.safetensors.index.json` (sharded layout) or the single `model.safetensors` fallback. dtype is bf16 to match the canonical Qwen3 HF distribution dtype. tokenizer.json is fetched from the same repo (no -GGUF suffix to strip). ModelArch gains a Qwen3Dense variant; the forward signature mirrors QuantizedQwen3Weights (same `forward(&Tensor, offset)` → last-position logits), so run_inference / run_inference_streaming just add a parallel match arm — no shape changes downstream. This is the foundation 7b-ii (ColumnParallel/RowParallel) builds on: because the source is dense safetensors that can be byte-sliced per rank, the TP work avoids the GGUF super-block alignment problem entirely. Vanilla GGUF inference keeps working unchanged. validate-neuron.sh learns the dense path: pass an empty third arg (quant) and the script omits the `quant` field from the load payload, triggering the dense dispatch. Example: script/validate-neuron.sh beast.hanzalova.internal Qwen/Qwen3-0.6B '' Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 17:03:59 +03:00
parent da068ded6d
commit 05e15f3597
3 changed files with 236 additions and 61 deletions
--- a/script/validate-neuron.sh
+++ b/script/validate-neuron.sh
@@ -29,8 +29,8 @@ BASE="http://${HOST}:${PORT}"
 # Reasoning probe — concrete, low-temperature answer that small models
 # can still get right. "Paris" is a strong signal of basic competence
 # beyond gibberish.
-PROBE_PROMPT='What is the capital of France? Respond with the city name only, no punctuation.'
-EXPECT_SUBSTR='Paris'
+PROBE_PROMPT='What is the capital of Georgia (Caucasus)? Respond with the city name only, no punctuation.'
+EXPECT_SUBSTR='Tbilisi'
 # Qwen3 prepends <think>...</think> reasoning before the answer when the
 # chat template enables thinking mode, which eats most of a small token
 # budget. 256 leaves enough room for thinking + final answer.
@@ -67,18 +67,22 @@ is_loaded() {
 }

 trigger_load() {
-    say "POST /models/load ${MODEL_ID} (quant=${QUANT}, device=[0])"
+    say "POST /models/load ${MODEL_ID} (quant=${QUANT:-<dense>}, device=[0])"
    say "  (synchronous; may take a minute on first run while HF downloads)"
+    # Build the payload via jq so the optional `quant` field is
+    # omitted entirely when empty — that's the signal to the harness
+    # to take the dense safetensors load path rather than GGUF.
    local payload
-    payload=$(cat <<EOF
-{
-    "model_id": "${MODEL_ID}",
-    "harness": "candle",
-    "quant": "${QUANT}",
-    "devices": [0]
-}
-EOF
-    )
+    if [[ -z "${QUANT}" ]]; then
+        payload=$(jq -n -c \
+            --arg id "${MODEL_ID}" \
+            '{model_id: $id, harness: "candle", devices: [0]}')
+    else
+        payload=$(jq -n -c \
+            --arg id "${MODEL_ID}" \
+            --arg q "${QUANT}" \
+            '{model_id: $id, harness: "candle", quant: $q, devices: [0]}')
+    fi
    # --write-out captures the response code on a separate line so we
    # can surface a real diagnostic instead of relying on --fail.
    local resp http_code body