Commit Graph

12 Commits

Author SHA1 Message Date
d46d8d4f6c feat(tp): Stage 7b-iv — RPC + orchestration for TP load/inference
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 38s
CI / Format (push) Successful in 40s
CI / Clippy (push) Successful in 2m20s
build-prerelease / Build cortex binary (push) Successful in 4m25s
build-prerelease / Package cortex RPM (push) Successful in 1m22s
CI / Test (push) Successful in 4m34s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m57s
build-prerelease / Build neuron-ampere (push) Successful in 4m51s
build-prerelease / Build neuron-ada (push) Successful in 5m12s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m49s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m51s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s
Wires the in-flight TP machinery (Stage 7a workers, 7b-iii sharded
Qwen3) end to end so a non-streaming chat completion can run across
multiple GPUs via NCCL.

RPC additions (tp/rpc.rs):
- LoadDenseShard{model_id, config_json, safetensors_paths}
- GenerateStep{model_id, tokens, offset}
- ClearKvCache{model_id}
- UnloadModel{model_id}
- LoadDenseShardOk / GenerateStepOk / KvCacheCleared / Unloaded

Worker side (tp/worker.rs):
- WorkerState gains a `models: HashMap<String, TpQwen3ForCausalLM>`
  keyed by model_id. LoadDenseShard mmaps safetensors via
  ShardedVarBuilder (only this rank's slice materialises), builds the
  TP model with the rank's NCCL Comm cloned from NcclState.
- GenerateStep runs the rank-local forward; the resulting logits are
  dropped (only the leader's are used for sampling). The forward's
  value here is the NCCL collectives inside the row-parallel layers
  letting the leader's rank-0 forward make progress.

Pool side (tp/mod.rs):
- WorkerPool::load_dense_shard fans LoadDenseShard out to every worker,
  builds rank 0's shard on the leader via spawn_blocking with a fresh
  SendComm wrapper at the move boundary (Comm is !Send at the type
  level), collects per-rank LoadDenseShardOk. Returns the leader's
  Arc<Mutex<TpQwen3ForCausalLM>>.
- WorkerPool::generate_step fans GenerateStep out, runs the leader's
  rank-0 forward in spawn_blocking (the AllReduce CustomOps inside
  row-parallel layers block until every worker issues the matching
  collective), returns the leader's last-position logits Tensor.
- WorkerPool::clear_kv_cache + unload_model follow the same pattern.

NcclState refactor (tp/nccl_state.rs):
- comm field becomes Option<Arc<Comm>> (was Option<Comm>) so callers
  can share a clone with TpQwen3ForCausalLM::load.
- new `comm()` accessor + `SendComm` wrapper for spawn_blocking moves.
- single allow(clippy::arc_with_non_send_sync) at the canonical
  construction site (Comm is !Send by type but the runtime invariant
  is enforced by SendComm + the pool's Mutex).

Harness side (candle.rs):
- LoadedHandle enum (Single | Tp) replaces the bare Arc<LoadedModel>
  in the harness's registry. list_models / unload_model /
  inference_endpoint walk the enum uniformly.
- TpLoadedModel holds the pool + leader_model + tokenizer + devices.
- load_model dispatches on `spec.tensor_parallel > 1` to a new
  cuda-gated load_tp path: resolve dense files via hf-hub, spawn the
  pool, init_nccl, load_dense_shard.
- chat_completion branches on the handle variant. The TP path mirrors
  run_inference: clear_kv_cache, prefill, sample, decode loop,
  detokenize. Acquires the pool Mutex for the whole request.
- Streaming through TP is deferred to Stage 7c (returns Other(err)).

Script (script/validate-neuron.sh):
- 4th positional arg `tp_size` (default 1). When >1, switches to the
  dense path (tp + GGUF is mutually exclusive — bails) and adds
  `tensor_parallel` + `devices` to the load payload. NEURON_DEVICES
  env overrides the default 0..N-1 device list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 06:38:33 +03:00
9b8bd146f6 feat(tp): --tp-smoke CLI subcommand + remote validation script
All checks were successful
CI / Format (push) Successful in 36s
build-prerelease / Resolve version stamps (push) Successful in 38s
CI / Clippy (push) Successful in 2m19s
CI / Test (push) Successful in 4m32s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m43s
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m16s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ampere (push) Successful in 4m56s
build-prerelease / Build neuron-ada (push) Successful in 5m1s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m51s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m39s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s
Adds a one-shot diagnostic that exercises the lower half of the TP
stack — WorkerPool::spawn, init_nccl, nccl_sanity_check — in isolation
from model load and inference. Runs N-1 worker subprocesses (rank 0
stays in this process), joins them in an NCCL communicator on the
specified CUDA devices, all_reduces a sentinel 1u32 per rank, verifies
the observed_sum equals world_size on every rank, then shuts down.

Output is `status=ok` on stdout (plus key=value lines for tp_size and
cuda_devices) when every check passes, non-zero exit + tracing on
stderr otherwise. The smoke command is diagnostic-only and not exposed
through the daemon HTTP API.

script/tp-smoke.sh wraps it with an ssh invocation against a fleet
host (default beast — the only host with 2 GPUs) and asserts the
status line, mirroring the validate-neuron.sh ergonomics.

This is step 1 of the TP test plan. A failure here means TP cannot
work on the host at all; step 2 (Stage 7b-iv) wires real model load
and inference through the same WorkerPool primitives.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 19:40:25 +03:00
5436af9c73 fix(neuron/candle): dense Qwen3 returns rank-3 logits, double-squeeze
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 38s
CI / Clippy (push) Successful in 2m19s
build-prerelease / Build neuron-blackwell (push) Successful in 3m32s
CI / Test (push) Successful in 4m34s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m16s
build-prerelease / Package cortex RPM (push) Successful in 1m18s
build-prerelease / Build neuron-ampere (push) Successful in 4m55s
build-prerelease / Build neuron-ada (push) Successful in 5m11s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m52s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m35s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s
Caught by live validation against Qwen/Qwen3-1.7B on beast:
  HTTP 500 "unexpected rank, expected: 1, got: 2 ([1, 151936])"

Candle's qwen3::ModelForCausalLM::forward returns shape [B, 1, V]
(no final squeeze) while quantized_qwen3::ModelWeights::forward
returns [B, V] (with squeeze(1) at the end). My match arms applied
a single squeeze(0) uniformly, which is correct for the quantized
[1, V] → [V] but leaves the dense at [1, V] → which then trips
apply_repeat_penalty::to_vec1() expecting rank 1.

Dense match arms now strip both batch and seq dims:
  model.forward(&input, offset)?.squeeze(0)?.squeeze(0)?

Also fixes validate-neuron.sh's `${3:-Q4_K_M}` → `${3-Q4_K_M}`
(no colon) so passing an explicit empty third arg now drives the
dense path instead of falling back to Q4_K_M.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 17:49:43 +03:00
05e15f3597 Stage 7b-i: dense safetensors Qwen3 load path
Some checks failed
build-prerelease / Build cortex binary (push) Blocked by required conditions
CI / Test (push) Waiting to run
CI / Format (push) Successful in 43s
build-prerelease / Resolve version stamps (push) Successful in 44s
CI / Clippy (push) Successful in 2m4s
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
Adds the bf16/fp16 safetensors path alongside the existing GGUF
quantized one. The harness now dispatches by ModelSpec.quant:
- Some(_) → GGUF (pre-quantized, single-GPU only path, unchanged).
- None    → safetensors dense (new).

The dense path uses candle-transformers::models::qwen3::ModelForCausalLM
verbatim, fed via VarBuilder::from_mmaped_safetensors over the files
listed in `model.safetensors.index.json` (sharded layout) or the
single `model.safetensors` fallback. dtype is bf16 to match the
canonical Qwen3 HF distribution dtype. tokenizer.json is fetched from
the same repo (no -GGUF suffix to strip).

ModelArch gains a Qwen3Dense variant; the forward signature mirrors
QuantizedQwen3Weights (same `forward(&Tensor, offset)` → last-position
logits), so run_inference / run_inference_streaming just add a parallel
match arm — no shape changes downstream.

This is the foundation 7b-ii (ColumnParallel/RowParallel) builds on:
because the source is dense safetensors that can be byte-sliced per
rank, the TP work avoids the GGUF super-block alignment problem
entirely. Vanilla GGUF inference keeps working unchanged.

validate-neuron.sh learns the dense path: pass an empty third arg
(quant) and the script omits the `quant` field from the load
payload, triggering the dense dispatch. Example:
  script/validate-neuron.sh beast.hanzalova.internal Qwen/Qwen3-0.6B ''

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 17:03:59 +03:00
18ae3c30ee post-validation cleanup: cuDNN runtime + repetition penalty
All checks were successful
CI / Format (push) Successful in 34s
build-prerelease / Resolve version stamps (push) Successful in 35s
CI / Clippy (push) Successful in 2m17s
CI / Test (push) Successful in 4m16s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m28s
build-prerelease / Build neuron-blackwell (push) Successful in 3m42s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Build neuron-ampere (push) Successful in 4m27s
build-prerelease / Build neuron-ada (push) Successful in 4m51s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m50s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m40s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 6m52s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 2m32s
Two followups from the live single-GPU validation pass.

1. deploy.sh now ensures libcudnn.so.9 is available on each neuron
   host before installing/upgrading the package. Probes ldconfig first
   so hosts with a manual (tar/runfile) cuDNN install are untouched,
   then adds NVIDIA's RHEL9 CUDA repo (the Fedora 43 CUDA repo doesn't
   ship cuDNN; only the RHEL9 one does) and installs libcudnn9-cuda-13.
   benjy hit "cannot open shared object file: libcudnn.so.9" during
   validation; this prevents that recurring.

2. candle.rs applies a 1.1 repetition penalty over the last 64
   generated tokens before sampling, in both the non-streaming
   chat_completion path and the streaming chat_completion_stream
   path. Without it small Q4_K_M models degenerate into "Wait, no,
   no..." loops once they hit a confident-but-wrong path; with it
   sampling stays coherent. Defaults match mistral.rs and llama.cpp;
   exposing the value via the OpenAI request (frequency/presence
   penalty mapping) is Stage 8 territory.

Both routes through a new sample_with_penalty() helper so future
sampling tweaks land in one place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 14:48:08 +03:00
1a0400131e fix(deploy): use dnf upgrade for stale installs, install only when absent
All checks were successful
CI / Format (push) Successful in 35s
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Clippy (push) Successful in 2m27s
CI / Test (push) Successful in 4m30s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m29s
build-prerelease / Build cortex binary (push) Successful in 4m32s
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Build neuron-ampere (push) Successful in 5m15s
build-prerelease / Build neuron-ada (push) Successful in 4m51s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m48s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m47s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m38s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 57s
dnf5's `dnf install <pkg>` is a no-op when the package is already
installed at ANY version — it does NOT auto-upgrade to the latest
available. The deploy script's install branch was therefore silently
leaving hosts on older builds even though needs_update correctly
reported an upgrade was available.

Add an is_installed() probe and an install_or_upgrade() helper that
picks the right verb: `dnf install` when fresh, `dnf upgrade` when
stale. Captured combined-stream output is exposed via __DNF_OUTPUT__
for the existing failure-diagnostic path.

Verified end-to-end against the live fleet: hanzalova/beast/benjy/
quadbrat all upgraded cleanly from prior prerelease NVRs to
0.1.16-0.1.20260519134302.git1866b99.fc43, validation script returned
"Paris" from all three neurons.

Followup (not in this commit): all hosts running helexa-neuron-*
need libcudnn.so.9 available at runtime. Currently:
  - quadbrat: libcudnn9-cuda-13 RPM (rhel9 CUDA repo)
  - beast:    /usr/lib64/libcudnn.so.9 (manual install)
  - benjy:    needed rhel9 CUDA repo added + libcudnn9-cuda-13 installed
              as part of this validation pass.
The spec currently excludes cuDNN from auto-detected deps. Should
add a Recommends:libcudnn9-cuda-13 (soft) and ensure the rhel9 CUDA
repo is configured on each neuron host, similar to how ensure_lair_repo
handles the unstable channel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 14:10:48 +03:00
1866b99a89 fix(validate-neuron): jq for JSON, say→stderr, sane max_tokens
All checks were successful
CI / Format (push) Successful in 35s
build-prerelease / Resolve version stamps (push) Successful in 38s
CI / Clippy (push) Successful in 2m13s
CI / Test (push) Successful in 4m22s
build-prerelease / Build neuron-blackwell (push) Successful in 3m25s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m21s
build-prerelease / Package cortex RPM (push) Successful in 1m17s
build-prerelease / Build neuron-ampere (push) Successful in 4m39s
build-prerelease / Build neuron-ada (push) Successful in 4m57s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m58s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m34s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
Three real bugs caught while exercising the script end-to-end against
the live quadbrat node:

1. say() printed status to stdout. Inside run_probe(), the
   "POST /v1/chat/completions (probe: ...)" line was being captured
   by `raw=$(run_probe)` along with the JSON body, so jq saw
   "[host] POST..." as the first line and choked at column 29 with
   "Invalid numeric literal" (it tried to parse the `[` as the start
   of a JSON array). Redirect say() to stderr so command
   substitutions capture only the intended return value.

2. The pretty-print step `echo "${raw}" | yq -r '.'` re-emitted the
   JSON as YAML, which fails on response content that looks like YAML
   markers (chatcmpl ids that parse as aliases, escaped quotes inside
   <think>...</think> blocks). Drop the pretty-print; just echo the
   raw JSON.

3. JSON response parsing now uses jq (always JSON) instead of yq
   (parses input as YAML by default). yq remains in use only for the
   genuinely-YAML asset/manifest.yml elsewhere.

4. max_tokens bumped 32 → 256. Qwen3 prepends a <think>...</think>
   reasoning block before its final answer when the chat template
   enables thinking mode, and that eats most of a small budget — the
   "Paris" answer was being truncated mid-thought. 256 leaves enough
   room for both.

Verified pipeline end-to-end on quadbrat (RTX 3060, helexa-neuron-ampere
git602e8e1): /health OK → /models/load (unsloth/Qwen3-0.6B-GGUF Q4_K_M)
→ /v1/chat/completions → response content contains "Paris".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 13:43:02 +03:00
ed4d71db09 fix(validate-neuron): default to unsloth GGUF + capture curl errors
Two reasons the previous run silently bailed after POST /models/load:

1. Default model was Qwen/Qwen3-0.6B-GGUF (official). That repo ships
   ONLY Q8_0 — no Q4_K_M, no Q4_0, nothing else. The GGUF filename
   matcher in CandleHarness::resolve_files returned "no GGUF file
   matching quant Q4_K_M" and the load endpoint returned an error,
   but the script used `curl --silent --fail` and swallowed it.

2. /models/load is synchronous (it awaits the full HF download + GGUF
   parse). curl --max-time 30 was way too short for a 400 MB fresh
   download.

Fixes:
- Default model is now unsloth/Qwen3-0.6B-GGUF, which mirrors the
  full Q-spectrum (Q2_K through Q8_0 plus BF16) so Q4_K_M actually
  exists.
- trigger_load / run_probe now use --write-out to capture HTTP code
  and emit the response body on non-2xx, so failures surface a real
  diagnostic instead of an opaque set -e abort.
- LOAD_TIMEOUT bumped to 600s; INFER_TIMEOUT to 120s.
- Probe payload built via `yq -n` so JSON quoting is reliable
  regardless of the prompt text.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 08:14:31 +03:00
39010c779f add script/validate-neuron.sh — end-to-end candle harness smoke test
Loads a small public Qwen3 GGUF on a target neuron host, fires a
deterministic reasoning probe ("What is the capital of France?"),
and asserts the response contains 'Paris'. Used to validate the
candle harness on a real GPU host before the Stage 7 TP work begins,
and as a regression check after future neuron builds.

Defaults to beast.hanzalova.internal + Qwen/Qwen3-1.7B-GGUF + Q4_K_M;
all three are positional args so the same script tests any node /
model combination. Polls /models after triggering the load since
/models/load returns once the materialisation is *queued*, not
finished.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 07:58:05 +03:00
8a2334eacb deploy: dnf-native version check + lair.cafe repo bootstrap
Replaces the string compare of 'git describe --tags' vs the binary's
self-reported --version (which lies about prereleases — every
0.1.16-* RPM reports just "0.1.16") with the dnf-native question of
"is the installed package current against what the repo offers".

Mechanism:
- installed_nvr(): rpm -q --qf '%{version}-%{release}' for the
  resident package, falling back to "(not installed)". Capturing rpm's
  output through a variable keeps its "package X is not installed"
  stdout message out of the result on failure.
- needs_update(): probes rpm -q first (treats absent as "needs work"),
  then asks dnf check-update --refresh -q. Other dnf failures collapse
  into "needs update" so the subsequent install surfaces a real error
  rather than this check swallowing one silently.
- ensure_lair_repo(): probes for /etc/yum.repos.d/lair-cafe-unstable.repo
  and adds it with `dnf config-manager addrepo` when missing. The
  upstream .repo file ships enabled=0 (unstable channel doesn't
  auto-engage on fetch), so we then run `dnf config-manager setopt
  lair-cafe-unstable.enabled=1` every run — cheap, idempotent.
- Cortex and neuron install branches now guard `systemctl stop` with
  `[ ! -f /usr/lib/systemd/system/...service ] || sudo systemctl stop`
  so fresh installs (no unit file yet) don't short-circuit the install
  step under set -e.
- dnf output is captured into a variable and only printed (with a
  [host]   prefix per line) on failure, so success stays quiet and
  failures show the actual diagnostic instead of being eaten by
  &> /dev/null.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 18:55:02 +03:00
249c9442e8 chore: track deployment script
All checks were successful
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m2s
CI / Test (push) Successful in 3m59s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
2026-05-18 17:50:35 +03:00
5c957d08ec ci: add build-prerelease workflow for CUDA RPMs on rpm.lair.cafe
Some checks failed
CI / Format (push) Successful in 36s
CI / Test (push) Failing after 53s
CI / Clippy (push) Successful in 2m35s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Adds a manually-triggered workflow that builds CUDA-flavoured neuron
binaries and a CPU cortex binary, packages them as Fedora RPMs, signs
them, and rsyncs to the unstable channel at
https://rpm.lair.cafe/fedora/43/x86_64/unstable/. Mirrors the build
pipeline used by grenade/mistralrs-package.

Pipeline:
- prepare: derive {version,short_sha,commit_date} from the checkout;
  the prerelease Release stamp "0.1.YYYYMMDDgitSHORTSHA" sorts below
  the eventual "1" stable release.
- build-cortex: cargo build --release -p cortex-cli on a rust runner.
- build-neuron: matrix over ada (sm_89) and blackwell (sm_120) on
  cuda-13.0 runners; cargo build with features "cuda cudnn flash-attn"
  and CUDA_COMPUTE_CAP set per flavour.
- package-{cortex,neuron}: rpmbuild on the rpm runner against the new
  prebuilt-binary specs in rpm/.
- publish: import signing key, sign RPMs, rsync to oolon, createrepo_c
  --update, then regenerate packages.json for the UI.

New specs are prebuilt-binary variants — they consume the artifact
from the build job rather than running cargo at rpmbuild time. Each
helexa-neuron-{flavour} package Conflicts with the other flavours and
with helexa-neuron (the future source-build stable package) so one
flavour is installed at a time on a given host.

neuron crate gains cudnn and flash-attn feature flags forwarding to
the corresponding candle features, so the CI build command compiles
those kernels into the binary.

sccache is intentionally NOT used in the prerelease jobs — CUDA
compute cap isn't in its cache key, so flavours would mis-hit each
other. Each prerelease build is a clean cargo build.

Required Gitea secrets (already in place for cortex.spec / COPR
workflow):
- RPM_SIGNING_KEY, RPM_SIGNING_KEY_ID
- RSYNC_SSH_KEY

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:01:35 +03:00