Adds the bf16/fp16 safetensors path alongside the existing GGUF
quantized one. The harness now dispatches by ModelSpec.quant:
- Some(_) → GGUF (pre-quantized, single-GPU only path, unchanged).
- None → safetensors dense (new).
The dense path uses candle-transformers::models::qwen3::ModelForCausalLM
verbatim, fed via VarBuilder::from_mmaped_safetensors over the files
listed in `model.safetensors.index.json` (sharded layout) or the
single `model.safetensors` fallback. dtype is bf16 to match the
canonical Qwen3 HF distribution dtype. tokenizer.json is fetched from
the same repo (no -GGUF suffix to strip).
ModelArch gains a Qwen3Dense variant; the forward signature mirrors
QuantizedQwen3Weights (same `forward(&Tensor, offset)` → last-position
logits), so run_inference / run_inference_streaming just add a parallel
match arm — no shape changes downstream.
This is the foundation 7b-ii (ColumnParallel/RowParallel) builds on:
because the source is dense safetensors that can be byte-sliced per
rank, the TP work avoids the GGUF super-block alignment problem
entirely. Vanilla GGUF inference keeps working unchanged.
validate-neuron.sh learns the dense path: pass an empty third arg
(quant) and the script omits the `quant` field from the load
payload, triggering the dense dispatch. Example:
script/validate-neuron.sh beast.hanzalova.internal Qwen/Qwen3-0.6B ''
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three real bugs caught while exercising the script end-to-end against
the live quadbrat node:
1. say() printed status to stdout. Inside run_probe(), the
"POST /v1/chat/completions (probe: ...)" line was being captured
by `raw=$(run_probe)` along with the JSON body, so jq saw
"[host] POST..." as the first line and choked at column 29 with
"Invalid numeric literal" (it tried to parse the `[` as the start
of a JSON array). Redirect say() to stderr so command
substitutions capture only the intended return value.
2. The pretty-print step `echo "${raw}" | yq -r '.'` re-emitted the
JSON as YAML, which fails on response content that looks like YAML
markers (chatcmpl ids that parse as aliases, escaped quotes inside
<think>...</think> blocks). Drop the pretty-print; just echo the
raw JSON.
3. JSON response parsing now uses jq (always JSON) instead of yq
(parses input as YAML by default). yq remains in use only for the
genuinely-YAML asset/manifest.yml elsewhere.
4. max_tokens bumped 32 → 256. Qwen3 prepends a <think>...</think>
reasoning block before its final answer when the chat template
enables thinking mode, and that eats most of a small budget — the
"Paris" answer was being truncated mid-thought. 256 leaves enough
room for both.
Verified pipeline end-to-end on quadbrat (RTX 3060, helexa-neuron-ampere
git602e8e1): /health OK → /models/load (unsloth/Qwen3-0.6B-GGUF Q4_K_M)
→ /v1/chat/completions → response content contains "Paris".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two reasons the previous run silently bailed after POST /models/load:
1. Default model was Qwen/Qwen3-0.6B-GGUF (official). That repo ships
ONLY Q8_0 — no Q4_K_M, no Q4_0, nothing else. The GGUF filename
matcher in CandleHarness::resolve_files returned "no GGUF file
matching quant Q4_K_M" and the load endpoint returned an error,
but the script used `curl --silent --fail` and swallowed it.
2. /models/load is synchronous (it awaits the full HF download + GGUF
parse). curl --max-time 30 was way too short for a 400 MB fresh
download.
Fixes:
- Default model is now unsloth/Qwen3-0.6B-GGUF, which mirrors the
full Q-spectrum (Q2_K through Q8_0 plus BF16) so Q4_K_M actually
exists.
- trigger_load / run_probe now use --write-out to capture HTTP code
and emit the response body on non-2xx, so failures surface a real
diagnostic instead of an opaque set -e abort.
- LOAD_TIMEOUT bumped to 600s; INFER_TIMEOUT to 120s.
- Probe payload built via `yq -n` so JSON quoting is reliable
regardless of the prompt text.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Loads a small public Qwen3 GGUF on a target neuron host, fires a
deterministic reasoning probe ("What is the capital of France?"),
and asserts the response contains 'Paris'. Used to validate the
candle harness on a real GPU host before the Stage 7 TP work begins,
and as a regression check after future neuron builds.
Defaults to beast.hanzalova.internal + Qwen/Qwen3-1.7B-GGUF + Q4_K_M;
all three are positional args so the same script tests any node /
model combination. Polls /models after triggering the load since
/models/load returns once the materialisation is *queued*, not
finished.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>