All checks were successful
CI / Format (push) Successful in 35s
build-prerelease / Resolve version stamps (push) Successful in 38s
CI / Clippy (push) Successful in 2m13s
CI / Test (push) Successful in 4m22s
build-prerelease / Build neuron-blackwell (push) Successful in 3m25s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m21s
build-prerelease / Package cortex RPM (push) Successful in 1m17s
build-prerelease / Build neuron-ampere (push) Successful in 4m39s
build-prerelease / Build neuron-ada (push) Successful in 4m57s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m58s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m34s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
Three real bugs caught while exercising the script end-to-end against
the live quadbrat node:
1. say() printed status to stdout. Inside run_probe(), the
"POST /v1/chat/completions (probe: ...)" line was being captured
by `raw=$(run_probe)` along with the JSON body, so jq saw
"[host] POST..." as the first line and choked at column 29 with
"Invalid numeric literal" (it tried to parse the `[` as the start
of a JSON array). Redirect say() to stderr so command
substitutions capture only the intended return value.
2. The pretty-print step `echo "${raw}" | yq -r '.'` re-emitted the
JSON as YAML, which fails on response content that looks like YAML
markers (chatcmpl ids that parse as aliases, escaped quotes inside
<think>...</think> blocks). Drop the pretty-print; just echo the
raw JSON.
3. JSON response parsing now uses jq (always JSON) instead of yq
(parses input as YAML by default). yq remains in use only for the
genuinely-YAML asset/manifest.yml elsewhere.
4. max_tokens bumped 32 → 256. Qwen3 prepends a <think>...</think>
reasoning block before its final answer when the chat template
enables thinking mode, and that eats most of a small budget — the
"Paris" answer was being truncated mid-thought. 256 leaves enough
room for both.
Verified pipeline end-to-end on quadbrat (RTX 3060, helexa-neuron-ampere
git602e8e1): /health OK → /models/load (unsloth/Qwen3-0.6B-GGUF Q4_K_M)
→ /v1/chat/completions → response content contains "Paris".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
157 lines
5.3 KiB
Bash
Executable File
157 lines
5.3 KiB
Bash
Executable File
#!/bin/env bash
|
|
#
|
|
# End-to-end smoke test for a deployed neuron.
|
|
#
|
|
# Confirms the daemon is reachable, loads a small public Qwen3 GGUF,
|
|
# fires a reasoning probe at /v1/chat/completions, and prints the
|
|
# answer. Used to validate the candle harness on a real GPU host
|
|
# before trusting it for production traffic, and as a regression test
|
|
# after pushing new neuron builds.
|
|
#
|
|
# Usage:
|
|
# script/validate-neuron.sh [host] [model_id] [quant]
|
|
#
|
|
# Defaults:
|
|
# host = beast.hanzalova.internal
|
|
# model_id = unsloth/Qwen3-0.6B-GGUF (official Qwen3-*-GGUF repos
|
|
# ship Q8_0 only; unsloth's mirror ships the full Q-spectrum
|
|
# including Q4_K_M)
|
|
# quant = Q4_K_M
|
|
|
|
set -euo pipefail
|
|
|
|
HOST="${1:-beast.hanzalova.internal}"
|
|
MODEL_ID="${2:-unsloth/Qwen3-0.6B-GGUF}"
|
|
QUANT="${3:-Q4_K_M}"
|
|
PORT="${NEURON_PORT:-13131}"
|
|
BASE="http://${HOST}:${PORT}"
|
|
|
|
# Reasoning probe — concrete, low-temperature answer that small models
|
|
# can still get right. "Paris" is a strong signal of basic competence
|
|
# beyond gibberish.
|
|
PROBE_PROMPT='What is the capital of France? Respond with the city name only, no punctuation.'
|
|
EXPECT_SUBSTR='Paris'
|
|
# Qwen3 prepends <think>...</think> reasoning before the answer when the
|
|
# chat template enables thinking mode, which eats most of a small token
|
|
# budget. 256 leaves enough room for thinking + final answer.
|
|
MAX_TOKENS=256
|
|
|
|
# /models/load is synchronous — neuron blocks the response until the
|
|
# hf-hub download + GGUF parse + tensor materialisation is done. A
|
|
# fresh 0.6B-Q4_K_M is ~400 MB; on a slow link or cold cache that's
|
|
# easily a minute. Pick a generous ceiling.
|
|
LOAD_TIMEOUT=600
|
|
INFER_TIMEOUT=120
|
|
|
|
# Status messages go to stderr so command substitutions like
|
|
# `raw=$(run_probe)` capture only the function's intended return value
|
|
# (an HTTP body), not the progress chatter.
|
|
say() { printf '[%s] %s\n' "${HOST}" "$*" >&2; }
|
|
die() { say "FAIL: $*"; exit 1; }
|
|
|
|
probe_health() {
|
|
curl --silent --fail --max-time 5 "${BASE}/health" >/dev/null \
|
|
|| die "neuron not reachable at ${BASE}/health"
|
|
}
|
|
|
|
list_loaded_ids() {
|
|
# The manifest is YAML and uses yq; HTTP responses are JSON and use
|
|
# jq directly. pip-yq parses input as YAML by default, which trips
|
|
# on JSON content that happens to look like YAML aliases (chatcmpl
|
|
# ids, escaped quotes inside `<think>...</think>` blocks, etc.).
|
|
curl --silent --fail "${BASE}/models" | jq -r '.[].id'
|
|
}
|
|
|
|
is_loaded() {
|
|
list_loaded_ids 2>/dev/null | grep -Fxq "${MODEL_ID}"
|
|
}
|
|
|
|
trigger_load() {
|
|
say "POST /models/load ${MODEL_ID} (quant=${QUANT}, device=[0])"
|
|
say " (synchronous; may take a minute on first run while HF downloads)"
|
|
local payload
|
|
payload=$(cat <<EOF
|
|
{
|
|
"model_id": "${MODEL_ID}",
|
|
"harness": "candle",
|
|
"quant": "${QUANT}",
|
|
"devices": [0]
|
|
}
|
|
EOF
|
|
)
|
|
# --write-out captures the response code on a separate line so we
|
|
# can surface a real diagnostic instead of relying on --fail.
|
|
local resp http_code body
|
|
resp=$(curl --silent --show-error --max-time "${LOAD_TIMEOUT}" \
|
|
--write-out '\n__HTTP__%{http_code}' \
|
|
-X POST "${BASE}/models/load" \
|
|
-H 'content-type: application/json' \
|
|
--data "${payload}") || die "curl /models/load failed: $?"
|
|
http_code=$(echo "${resp}" | grep -oP '(?<=__HTTP__)\d+$' | tail -1)
|
|
body=$(echo "${resp}" | sed '$ s/__HTTP__.*$//')
|
|
if [[ "${http_code}" != "200" ]]; then
|
|
die "load returned HTTP ${http_code}: ${body}"
|
|
fi
|
|
say "load returned ${http_code}: ${body}"
|
|
}
|
|
|
|
run_probe() {
|
|
say "POST /v1/chat/completions (probe: ${PROBE_PROMPT})"
|
|
local payload
|
|
payload=$(jq -n -c \
|
|
--arg model "${MODEL_ID}" \
|
|
--arg content "${PROBE_PROMPT}" \
|
|
--argjson tokens "${MAX_TOKENS}" \
|
|
'{
|
|
model: $model,
|
|
messages: [{role: "user", content: $content}],
|
|
temperature: 0.1,
|
|
max_tokens: $tokens
|
|
}')
|
|
local resp http_code body
|
|
resp=$(curl --silent --show-error --max-time "${INFER_TIMEOUT}" \
|
|
--write-out '\n__HTTP__%{http_code}' \
|
|
-X POST "${BASE}/v1/chat/completions" \
|
|
-H 'content-type: application/json' \
|
|
--data "${payload}") || die "curl /v1/chat/completions failed: $?"
|
|
http_code=$(echo "${resp}" | grep -oP '(?<=__HTTP__)\d+$' | tail -1)
|
|
body=$(echo "${resp}" | sed '$ s/__HTTP__.*$//')
|
|
if [[ "${http_code}" != "200" ]]; then
|
|
die "inference returned HTTP ${http_code}: ${body}"
|
|
fi
|
|
echo "${body}"
|
|
}
|
|
|
|
say "validating neuron at ${BASE}"
|
|
probe_health
|
|
say "/health OK"
|
|
|
|
if is_loaded; then
|
|
say "${MODEL_ID} already loaded"
|
|
else
|
|
trigger_load
|
|
fi
|
|
|
|
raw=$(run_probe)
|
|
echo "---"
|
|
# Dump the raw JSON. Don't pipe through `yq -r '.'` — yq's default
|
|
# YAML output mode chokes on JSON strings that contain `<` (and the
|
|
# `<think>` markers Qwen3 emits during reasoning are a perfect
|
|
# example). The targeted `yq -r '.path'` calls below work fine
|
|
# because jq's path filter mode bypasses the YAML re-emit.
|
|
echo "${raw}"
|
|
echo "---"
|
|
|
|
content=$(echo "${raw}" | jq -r '.choices[0].message.content // empty')
|
|
if [[ -z "${content}" ]]; then
|
|
die "no content in chat completion response"
|
|
fi
|
|
say "assistant said: ${content}"
|
|
|
|
if echo "${content}" | grep -qiF "${EXPECT_SUBSTR}"; then
|
|
say "PASS — response contains expected substring '${EXPECT_SUBSTR}'"
|
|
exit 0
|
|
else
|
|
die "response did not contain '${EXPECT_SUBSTR}'"
|
|
fi
|