96d87552452e5991b3cbf5573184db01cf5621d1
10 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
5436af9c73
|
fix(neuron/candle): dense Qwen3 returns rank-3 logits, double-squeeze
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 38s
CI / Clippy (push) Successful in 2m19s
build-prerelease / Build neuron-blackwell (push) Successful in 3m32s
CI / Test (push) Successful in 4m34s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m16s
build-prerelease / Package cortex RPM (push) Successful in 1m18s
build-prerelease / Build neuron-ampere (push) Successful in 4m55s
build-prerelease / Build neuron-ada (push) Successful in 5m11s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m52s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m35s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s
Caught by live validation against Qwen/Qwen3-1.7B on beast:
HTTP 500 "unexpected rank, expected: 1, got: 2 ([1, 151936])"
Candle's qwen3::ModelForCausalLM::forward returns shape [B, 1, V]
(no final squeeze) while quantized_qwen3::ModelWeights::forward
returns [B, V] (with squeeze(1) at the end). My match arms applied
a single squeeze(0) uniformly, which is correct for the quantized
[1, V] → [V] but leaves the dense at [1, V] → which then trips
apply_repeat_penalty::to_vec1() expecting rank 1.
Dense match arms now strip both batch and seq dims:
model.forward(&input, offset)?.squeeze(0)?.squeeze(0)?
Also fixes validate-neuron.sh's `${3:-Q4_K_M}` → `${3-Q4_K_M}`
(no colon) so passing an explicit empty third arg now drives the
dense path instead of falling back to Q4_K_M.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
|
05e15f3597
|
Stage 7b-i: dense safetensors Qwen3 load path
Some checks failed
build-prerelease / Build cortex binary (push) Blocked by required conditions
CI / Test (push) Waiting to run
CI / Format (push) Successful in 43s
build-prerelease / Resolve version stamps (push) Successful in 44s
CI / Clippy (push) Successful in 2m4s
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
Adds the bf16/fp16 safetensors path alongside the existing GGUF quantized one. The harness now dispatches by ModelSpec.quant: - Some(_) → GGUF (pre-quantized, single-GPU only path, unchanged). - None → safetensors dense (new). The dense path uses candle-transformers::models::qwen3::ModelForCausalLM verbatim, fed via VarBuilder::from_mmaped_safetensors over the files listed in `model.safetensors.index.json` (sharded layout) or the single `model.safetensors` fallback. dtype is bf16 to match the canonical Qwen3 HF distribution dtype. tokenizer.json is fetched from the same repo (no -GGUF suffix to strip). ModelArch gains a Qwen3Dense variant; the forward signature mirrors QuantizedQwen3Weights (same `forward(&Tensor, offset)` → last-position logits), so run_inference / run_inference_streaming just add a parallel match arm — no shape changes downstream. This is the foundation 7b-ii (ColumnParallel/RowParallel) builds on: because the source is dense safetensors that can be byte-sliced per rank, the TP work avoids the GGUF super-block alignment problem entirely. Vanilla GGUF inference keeps working unchanged. validate-neuron.sh learns the dense path: pass an empty third arg (quant) and the script omits the `quant` field from the load payload, triggering the dense dispatch. Example: script/validate-neuron.sh beast.hanzalova.internal Qwen/Qwen3-0.6B '' Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
18ae3c30ee
|
post-validation cleanup: cuDNN runtime + repetition penalty
All checks were successful
CI / Format (push) Successful in 34s
build-prerelease / Resolve version stamps (push) Successful in 35s
CI / Clippy (push) Successful in 2m17s
CI / Test (push) Successful in 4m16s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m28s
build-prerelease / Build neuron-blackwell (push) Successful in 3m42s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Build neuron-ampere (push) Successful in 4m27s
build-prerelease / Build neuron-ada (push) Successful in 4m51s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m50s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m40s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 6m52s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 2m32s
Two followups from the live single-GPU validation pass. 1. deploy.sh now ensures libcudnn.so.9 is available on each neuron host before installing/upgrading the package. Probes ldconfig first so hosts with a manual (tar/runfile) cuDNN install are untouched, then adds NVIDIA's RHEL9 CUDA repo (the Fedora 43 CUDA repo doesn't ship cuDNN; only the RHEL9 one does) and installs libcudnn9-cuda-13. benjy hit "cannot open shared object file: libcudnn.so.9" during validation; this prevents that recurring. 2. candle.rs applies a 1.1 repetition penalty over the last 64 generated tokens before sampling, in both the non-streaming chat_completion path and the streaming chat_completion_stream path. Without it small Q4_K_M models degenerate into "Wait, no, no..." loops once they hit a confident-but-wrong path; with it sampling stays coherent. Defaults match mistral.rs and llama.cpp; exposing the value via the OpenAI request (frequency/presence penalty mapping) is Stage 8 territory. Both routes through a new sample_with_penalty() helper so future sampling tweaks land in one place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
1a0400131e
|
fix(deploy): use dnf upgrade for stale installs, install only when absent
All checks were successful
CI / Format (push) Successful in 35s
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Clippy (push) Successful in 2m27s
CI / Test (push) Successful in 4m30s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m29s
build-prerelease / Build cortex binary (push) Successful in 4m32s
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Build neuron-ampere (push) Successful in 5m15s
build-prerelease / Build neuron-ada (push) Successful in 4m51s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m48s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m47s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m38s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 57s
dnf5's `dnf install <pkg>` is a no-op when the package is already
installed at ANY version — it does NOT auto-upgrade to the latest
available. The deploy script's install branch was therefore silently
leaving hosts on older builds even though needs_update correctly
reported an upgrade was available.
Add an is_installed() probe and an install_or_upgrade() helper that
picks the right verb: `dnf install` when fresh, `dnf upgrade` when
stale. Captured combined-stream output is exposed via __DNF_OUTPUT__
for the existing failure-diagnostic path.
Verified end-to-end against the live fleet: hanzalova/beast/benjy/
quadbrat all upgraded cleanly from prior prerelease NVRs to
0.1.16-0.1.20260519134302.git1866b99.fc43, validation script returned
"Paris" from all three neurons.
Followup (not in this commit): all hosts running helexa-neuron-*
need libcudnn.so.9 available at runtime. Currently:
- quadbrat: libcudnn9-cuda-13 RPM (rhel9 CUDA repo)
- beast: /usr/lib64/libcudnn.so.9 (manual install)
- benjy: needed rhel9 CUDA repo added + libcudnn9-cuda-13 installed
as part of this validation pass.
The spec currently excludes cuDNN from auto-detected deps. Should
add a Recommends:libcudnn9-cuda-13 (soft) and ensure the rhel9 CUDA
repo is configured on each neuron host, similar to how ensure_lair_repo
handles the unstable channel.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
|
1866b99a89
|
fix(validate-neuron): jq for JSON, say→stderr, sane max_tokens
All checks were successful
CI / Format (push) Successful in 35s
build-prerelease / Resolve version stamps (push) Successful in 38s
CI / Clippy (push) Successful in 2m13s
CI / Test (push) Successful in 4m22s
build-prerelease / Build neuron-blackwell (push) Successful in 3m25s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m21s
build-prerelease / Package cortex RPM (push) Successful in 1m17s
build-prerelease / Build neuron-ampere (push) Successful in 4m39s
build-prerelease / Build neuron-ada (push) Successful in 4m57s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m58s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m34s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
Three real bugs caught while exercising the script end-to-end against
the live quadbrat node:
1. say() printed status to stdout. Inside run_probe(), the
"POST /v1/chat/completions (probe: ...)" line was being captured
by `raw=$(run_probe)` along with the JSON body, so jq saw
"[host] POST..." as the first line and choked at column 29 with
"Invalid numeric literal" (it tried to parse the `[` as the start
of a JSON array). Redirect say() to stderr so command
substitutions capture only the intended return value.
2. The pretty-print step `echo "${raw}" | yq -r '.'` re-emitted the
JSON as YAML, which fails on response content that looks like YAML
markers (chatcmpl ids that parse as aliases, escaped quotes inside
<think>...</think> blocks). Drop the pretty-print; just echo the
raw JSON.
3. JSON response parsing now uses jq (always JSON) instead of yq
(parses input as YAML by default). yq remains in use only for the
genuinely-YAML asset/manifest.yml elsewhere.
4. max_tokens bumped 32 → 256. Qwen3 prepends a <think>...</think>
reasoning block before its final answer when the chat template
enables thinking mode, and that eats most of a small budget — the
"Paris" answer was being truncated mid-thought. 256 leaves enough
room for both.
Verified pipeline end-to-end on quadbrat (RTX 3060, helexa-neuron-ampere
git602e8e1): /health OK → /models/load (unsloth/Qwen3-0.6B-GGUF Q4_K_M)
→ /v1/chat/completions → response content contains "Paris".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
|
ed4d71db09
|
fix(validate-neuron): default to unsloth GGUF + capture curl errors
Two reasons the previous run silently bailed after POST /models/load: 1. Default model was Qwen/Qwen3-0.6B-GGUF (official). That repo ships ONLY Q8_0 — no Q4_K_M, no Q4_0, nothing else. The GGUF filename matcher in CandleHarness::resolve_files returned "no GGUF file matching quant Q4_K_M" and the load endpoint returned an error, but the script used `curl --silent --fail` and swallowed it. 2. /models/load is synchronous (it awaits the full HF download + GGUF parse). curl --max-time 30 was way too short for a 400 MB fresh download. Fixes: - Default model is now unsloth/Qwen3-0.6B-GGUF, which mirrors the full Q-spectrum (Q2_K through Q8_0 plus BF16) so Q4_K_M actually exists. - trigger_load / run_probe now use --write-out to capture HTTP code and emit the response body on non-2xx, so failures surface a real diagnostic instead of an opaque set -e abort. - LOAD_TIMEOUT bumped to 600s; INFER_TIMEOUT to 120s. - Probe payload built via `yq -n` so JSON quoting is reliable regardless of the prompt text. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
39010c779f
|
add script/validate-neuron.sh — end-to-end candle harness smoke test
Loads a small public Qwen3 GGUF on a target neuron host, fires a
deterministic reasoning probe ("What is the capital of France?"),
and asserts the response contains 'Paris'. Used to validate the
candle harness on a real GPU host before the Stage 7 TP work begins,
and as a regression check after future neuron builds.
Defaults to beast.hanzalova.internal + Qwen/Qwen3-1.7B-GGUF + Q4_K_M;
all three are positional args so the same script tests any node /
model combination. Polls /models after triggering the load since
/models/load returns once the materialisation is *queued*, not
finished.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
|
8a2334eacb
|
deploy: dnf-native version check + lair.cafe repo bootstrap
Replaces the string compare of 'git describe --tags' vs the binary's
self-reported --version (which lies about prereleases — every
0.1.16-* RPM reports just "0.1.16") with the dnf-native question of
"is the installed package current against what the repo offers".
Mechanism:
- installed_nvr(): rpm -q --qf '%{version}-%{release}' for the
resident package, falling back to "(not installed)". Capturing rpm's
output through a variable keeps its "package X is not installed"
stdout message out of the result on failure.
- needs_update(): probes rpm -q first (treats absent as "needs work"),
then asks dnf check-update --refresh -q. Other dnf failures collapse
into "needs update" so the subsequent install surfaces a real error
rather than this check swallowing one silently.
- ensure_lair_repo(): probes for /etc/yum.repos.d/lair-cafe-unstable.repo
and adds it with `dnf config-manager addrepo` when missing. The
upstream .repo file ships enabled=0 (unstable channel doesn't
auto-engage on fetch), so we then run `dnf config-manager setopt
lair-cafe-unstable.enabled=1` every run — cheap, idempotent.
- Cortex and neuron install branches now guard `systemctl stop` with
`[ ! -f /usr/lib/systemd/system/...service ] || sudo systemctl stop`
so fresh installs (no unit file yet) don't short-circuit the install
step under set -e.
- dnf output is captured into a variable and only printed (with a
[host] prefix per line) on failure, so success stays quiet and
failures show the actual diagnostic instead of being eaten by
&> /dev/null.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
|
249c9442e8
|
chore: track deployment script
All checks were successful
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m2s
CI / Test (push) Successful in 3m59s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
|
|||
|
5c957d08ec
|
ci: add build-prerelease workflow for CUDA RPMs on rpm.lair.cafe
Some checks failed
CI / Format (push) Successful in 36s
CI / Test (push) Failing after 53s
CI / Clippy (push) Successful in 2m35s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Adds a manually-triggered workflow that builds CUDA-flavoured neuron binaries and a CPU cortex binary, packages them as Fedora RPMs, signs them, and rsyncs to the unstable channel at https://rpm.lair.cafe/fedora/43/x86_64/unstable/. Mirrors the build pipeline used by grenade/mistralrs-package. Pipeline: - prepare: derive {version,short_sha,commit_date} from the checkout; the prerelease Release stamp "0.1.YYYYMMDDgitSHORTSHA" sorts below the eventual "1" stable release. - build-cortex: cargo build --release -p cortex-cli on a rust runner. - build-neuron: matrix over ada (sm_89) and blackwell (sm_120) on cuda-13.0 runners; cargo build with features "cuda cudnn flash-attn" and CUDA_COMPUTE_CAP set per flavour. - package-{cortex,neuron}: rpmbuild on the rpm runner against the new prebuilt-binary specs in rpm/. - publish: import signing key, sign RPMs, rsync to oolon, createrepo_c --update, then regenerate packages.json for the UI. New specs are prebuilt-binary variants — they consume the artifact from the build job rather than running cargo at rpmbuild time. Each helexa-neuron-{flavour} package Conflicts with the other flavours and with helexa-neuron (the future source-build stable package) so one flavour is installed at a time on a given host. neuron crate gains cudnn and flash-attn feature flags forwarding to the corresponding candle features, so the CI build command compiles those kernels into the binary. sccache is intentionally NOT used in the prerelease jobs — CUDA compute cap isn't in its cache key, so flavours would mis-hit each other. Each prerelease build is a clean cargo build. Required Gitea secrets (already in place for cortex.spec / COPR workflow): - RPM_SIGNING_KEY, RPM_SIGNING_KEY_ID - RSYNC_SSH_KEY Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |