helexa

Author	SHA1	Message	Date
rob thijssen	b3dc835375	ci: bound job runtime + stop dropping sccache on rustc signal-death All checks were successful build-prerelease / Resolve version stamps + change detection (push) Successful in 30s Details build-prerelease / Lint (fmt + clippy) (push) Successful in 2m23s Details build-prerelease / Build cortex binary (push) Successful in 2m29s Details build-prerelease / Build helexa-bench binary (push) Successful in 2m34s Details build-prerelease / Test (push) Successful in 4m33s Details build-prerelease / Build neuron-blackwell (push) Successful in 1m31s Details build-prerelease / Build neuron-ada (push) Successful in 2m13s Details build-prerelease / Build neuron-ampere (push) Successful in 2m50s Details build-prerelease / Package helexa-bench RPM (push) Successful in 1m17s Details build-prerelease / Package cortex RPM (push) Successful in 1m27s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m38s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m42s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m44s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 55s Details A neuron-blackwell build hung ~90 min (siblings finished in 2) and there was no job timeout to kill it, so it sat burning a runner. Root cause of the hang: the inline retry loop treated every failure identically and, on its final attempt, rebuilt with sccache disabled. When the real failure is a rustc SIGSEGV or an OOM-kill, an uncached rebuild does more work under the same memory pressure — turning one transient compiler crash into a wedged job. Two fixes: 1. timeout-minutes on every job in build-prerelease.yml and ci.yml (builds 25, neuron CUDA build/cuda-check 35, packaging 20, COPR 60, fast jobs 10-15). A hang now dies in minutes, not hours. 2. New script/ci-cargo-escalate.sh replaces the five (prerelease) + three (ci) inline escalation loops. It classifies the failure: - signal death (exit >=128, or cargo reporting `signal: N`/SIGSEGV/ SIGKILL) → compiler crash, NOT an sccache fault: keep the cache, one warm retry, then fail fast. Never escalate to uncached. - sccache fault (recognisable sccache error) → restart the server, retry, then one final uncached attempt. - deterministic compile/test error → fail fast (no wasteful retry). It also folds in the CUDA-image sccache probe the neuron/cuda-check jobs did inline. Classification verified locally against success, plain failure, exit-139, and the cargo-wrapped `signal: 11` form. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-15 13:02:50 +03:00
rob thijssen	112e4e124a	fix(ci): export RUSTC_WRAPPER in the build step itself — GITHUB_ENV doesn't propagate Some checks failed build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions Details build-prerelease / Resolve version stamps + change detection (push) Successful in 32s Details build-prerelease / Lint (fmt + clippy) (push) Successful in 2m22s Details build-prerelease / Build cortex binary (push) Successful in 2m20s Details build-prerelease / Test (push) Successful in 3m50s Details build-prerelease / Build neuron-blackwell (push) Successful in 10m10s Details build-prerelease / Package cortex RPM (push) Successful in 1m25s Details build-prerelease / Build neuron-ada (push) Successful in 14m29s Details build-prerelease / Build neuron-ampere (push) Successful in 14m31s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details Run 375 proved the CUDA image ships sccache (probe step printed "sccache enabled") but the wrapper never reached cargo: the runner does not propagate GITHUB_ENV across steps, so the builds ran unwrapped (server stats: 4 compile requests for a ~600-crate build, durations unchanged). Probe and export inside the build step's own shell instead, in both build-neuron and ci.yml's cuda-check. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 14:50:25 +03:00
rob thijssen	2dadea5d8d	ci: enable sccache on the build jobs (conditional on the CUDA image) Some checks failed build-prerelease / Build neuron-blackwell (push) Blocked by required conditions Details build-prerelease / Resolve version stamps + change detection (push) Successful in 34s Details build-prerelease / Lint (fmt + clippy) (push) Successful in 2m57s Details build-prerelease / Test (push) Has been cancelled Details build-prerelease / Build cortex binary (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details The 3 CUDA flavour builds (10-14 min each, the critical path of every full run) and build-cortex compiled entirely uncached. With the gongfoo-side sccache hardening in place, wire them up: - build-cortex: full sccache env (rust image ships it) + the standard escalation loop (retry -> server restart -> uncached final attempt). - build-neuron: probe for sccache before enabling the wrapper — the CUDA image may not ship it, and a missing binary must degrade to an uncached build, not fail cargo at `sccache rustc -vV` (the original reason the wrapper was cleared here). rustc compilations are shared across all three flavours; candle-kernels' nvcc output stays uncached (build-script artifact). - ci.yml cuda-check: same probe pattern replaces the blanket env clear; also pins CUDA_COMPUTE_CAP=86 since the image no longer ships nvidia-smi for candle-kernels' fallback detection (mirrors `9bb9678` on the #20 branch). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 14:05:26 +03:00
rob thijssen	f5fa840dfb	ci: escalate sccache retries — restart server, then fall back uncached All checks were successful build-prerelease / Resolve version stamps + change detection (push) Successful in 30s Details build-prerelease / Lint (fmt + clippy) (push) Successful in 2m6s Details build-prerelease / Test (push) Successful in 4m50s Details build-prerelease / Build cortex binary (push) Successful in 3m45s Details build-prerelease / Build neuron-blackwell (push) Successful in 9m59s Details build-prerelease / Build neuron-ada (push) Successful in 14m11s Details build-prerelease / Build neuron-ampere (push) Successful in 14m13s Details build-prerelease / Package cortex RPM (push) Successful in 1m30s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m28s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m50s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m54s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details Run 361's Test job failed all 3 attempts with the sccache dead-server signature (sccache fatal error, ENOENT on its own tmp files under target/debug/deps). Retrying the same invocation only helps for transient races; against a wedged server every same-VM retry fails identically — and under the new pipeline that blocks publish and the deploy behind it. Escalate instead: attempt 1 plain, attempt 2 after an sccache server restart, attempt 3 with RUSTC_WRAPPER unset (uncached). A sick cache now costs build minutes, never the deploy. Applied to the lint/test jobs in build-prerelease.yml and ci.yml alike. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 13:24:02 +03:00
rob thijssen	7557c5e877	ci: cut iteration latency — change-aware builds, gated deploys, dev fast path Some checks failed build-prerelease / Build neuron-blackwell (push) Blocked by required conditions Details build-prerelease / Resolve version stamps + change detection (push) Successful in 28s Details build-prerelease / Test (push) Failing after 1m16s Details build-prerelease / Lint (fmt + clippy) (push) Successful in 3m7s Details build-prerelease / Build cortex binary (push) Successful in 3m57s Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details Push-to-testable was ~20.5 min for every commit (measured on the 2026-06-08 green chain) plus a ~5 min 27B cold-load, regardless of what changed. Three structural fixes: - build-prerelease: a change-detection step in `prepare` diffs HEAD against the git sha embedded in the last published unstable RPM (per package, from packages.json) and skips builds whose inputs didn't change. Docs-only commits build nothing; gateway-only commits skip the 3 CUDA flavour builds. Detection failures fall open to a full build. - ci.yml no longer runs on pushes to main; fmt/clippy/test live in build-prerelease as parallel jobs gating publish. The two workflows previously queued against each other on the same runner labels, delaying the cortex build ~12 min. Branches, PRs, and tags keep the full ci.yml gate. - deploy: each host self-gates with `dnf check-update` and leaves the service untouched when the installed package is already current — no more neuron restarts (and 27B cold-loads) for commits that didn't change neuron. - deploy-dev (new): manual single-host fast path — build one CUDA flavour, scp the binary, restart the service. Skips packaging, signing, publish, and dnf entirely. Backed by a new exact-form sudoers rule in asset/sudoers.d/neuron-host.conf (already applied to all three hosts). Expected loop times when runners behave: docs ≈ 1 min (nothing deploys), gateway-only ≈ 6-8 min, single-neuron dev ≈ 8-10 min, full fleet ≈ 13-15 min. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 13:17:22 +03:00
rob thijssen	1a74cb0c56	chore: rename repo cortex -> helexa Some checks failed CI / CUDA type-check (push) Failing after 30s Details build-prerelease / Resolve version stamps (push) Successful in 45s Details CI / Format (push) Successful in 32s Details build-prerelease / Build neuron-blackwell (push) Failing after 31s Details build-prerelease / Build neuron-ada (push) Failing after 34s Details build-prerelease / Build neuron-ampere (push) Failing after 38s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped Details CI / Clippy (push) Failing after 1m11s Details build-prerelease / Build cortex binary (push) Successful in 3m47s Details CI / Test (push) Successful in 5m32s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m22s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details helexa is the project; cortex (per-operator control plane / LLM proxy) and neuron (per-host LLM harness) are its components. The Gitea repo is now helexa/helexa. Update repository URLs in Cargo metadata, RPM specs, and docs; make the CI changelog push URL rename-proof via the github.repository context; reframe README.md and CLAUDE.md around the project name. Binary, package, service, and config-path names are unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 10:54:01 +03:00
rob thijssen	2f387f33f8	ci: export CUDA paths in cuda-check so cudarc build.rs finds nvcc Some checks failed build-prerelease / Build cortex binary (push) Blocked by required conditions Details CI / Format (push) Successful in 34s Details build-prerelease / Resolve version stamps (push) Successful in 41s Details CI / Clippy (push) Failing after 1m7s Details CI / Test (push) Failing after 56s Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details CI / CUDA type-check (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details act launches step shells without sourcing /etc/profile, so the gitea_runner user's PATH lacks /usr/local/cuda-13.0/bin. cudarc's build.rs panics with ENOENT on `nvcc --version` under the neuron crate's cuda-version-from-build-system feature. build-prerelease.yml already does this export — mirror it here. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 23:28:04 +03:00
rob thijssen	cad7552104	ci: clear sccache env on cuda-check so cargo doesn't try to wrap rustc Some checks failed CI / Test (push) Waiting to run Details CI / CUDA type-check (push) Failing after 18s Details build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / Format (push) Successful in 31s Details CI / Clippy (push) Successful in 2m25s Details build-prerelease / Build cortex binary (push) Successful in 5m19s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details CI run 255 job 3 (CUDA type-check) fails with: error: could not execute process `* rustc -vV` (never executed) Caused by: No such file or directory (os error 2) The redacted `` is `sccache`. The ci.yml workflow-level env block sets `RUSTC_WRAPPER: sccache` because the generic `rust` runner has sccache installed and routes the cache to caveman.kosherinata.internal. The new `cuda-check` job runs on `cuda-13.0` (where nvcc lives), and that runner doesn't carry sccache on PATH — so cargo's first action (`sccache rustc -vV` to probe the compiler version) fails before borrow-check even starts. `build-prerelease.yml`, which uses the same `cuda-13.0` runner for the actual release neuron builds, deliberately does NOT set RUSTC_WRAPPER. That's the pattern this commit applies. Fix: override `RUSTC_WRAPPER` (plus the SCCACHE_ and AWS_* env locally on the job. We lose caching on the cuda-check job (it's borrow-check-only and finishes in a couple minutes anyway), but the gate runs. The job's purpose — fail fast on `#[cfg(feature = "cuda")]` borrowck errors that the default-feature gate misses — is what matters, and that purpose was undermined by the env inheritance. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 13:55:18 +03:00
rob thijssen	957f704efa	feat(neuron): OpenAI Responses API + ci cuda-check runner label Some checks failed build-prerelease / Package cortex RPM (push) Blocked by required conditions Details CI / CUDA type-check (push) Failing after 11s Details build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / Format (push) Successful in 32s Details CI / Clippy (push) Successful in 2m31s Details build-prerelease / Build cortex binary (push) Successful in 4m32s Details CI / Test (push) Successful in 5m42s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 6m8s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details Step 2 of the Responses rollout: native `/v1/responses` endpoint on neuron that consumes the same InferenceEvent stream as `/v1/chat/completions` but emits it as the Responses API's named SSE event family. No gateway-side translation. ## Surface - `cortex-core::responses` envelope types: `ResponsesRequest`, `ResponsesInput` (text \| items), `ResponsesInputItem` (message \| function_call \| function_call_output \| reasoning), `ResponsesContentPart` (input_text \| input_image \| output_text), `ResponsesResponse`, `ResponsesOutputItem`, `ResponsesUsage`. Plus a `events::*` constant module so the projector and the wire shape stay in sync without string-typos. - `neuron::wire::openai_responses`: - `request_to_chat(req)` flattens Responses input + instructions into a `ChatCompletionRequest` the candle harness already understands. Text-only Parts collapse to a string; mixed text+image Parts go to chat's content-array shape; reasoning items drop; function_call / function_call_output round-trip via tool_calls / tool_call_id metadata so the surface is consistent for the day the harness emits tool calls. - `project_responses_stream(rx, meta)` reads InferenceEvents and emits the eight named events that compose a Responses stream: response.created → output_item.added → content_part.added → output_text.delta×N → output_text.done → content_part.done → output_item.done → response.completed. Synthesises start frames if the producer skips Start (poisoned model, early disconnect) so the stream stays coherent. - `build_response(meta, text, reason, usage)` for the non-streaming path. - `CandleHarness::inference_stream(req)` extracted from `chat_completion_stream`, returning a typed `InferenceStream` (event receiver + id/created/model_id metadata). Both `chat_completion_stream` and the new `responses_stream` are now thin wrappers that pick their wire projection. TP path got the same treatment (`chat_completion_tp_stream` → `inference_tp_stream`). - `POST /v1/responses` route on neuron. Non-streaming returns one buffered `ResponsesResponse`; streaming returns axum SSE with both event names and JSON data per frame (Responses, unlike chat completions, uses named `event:` lines). Reused `inference_error_response` helper hoisted out so the chat and responses handlers share the InferenceError → HTTP mapping. ## CI Also bundles the `cuda-check` runner-label fix from feedback on commit `1859777`: `runs-on: rpm` doesn't ship the CUDA toolkit so cudarc's nvcc-version build script blew up. Switched to `runs-on: cuda-13.0` per the existing labels. ## Scope cuts (documented in the modules) - `previous_response_id` rejected at translate time with 400 (`code: chained_conversation_not_supported`) — stateful chained conversations need a persistence layer we haven't built. - Reasoning items dropped (no Qwen3 `<think>` routing yet). - Single output item per response (one `"message"` carrying text); `function_call` items reserved but not synthesised. - Streaming events cover the core set; `response.in_progress` and the web_search / image_generation event families are out-of-scope. 22 new tests: 5 in cortex-core (envelope round-trips), 13 in neuron::wire (request translator + projector + non-streaming builder), 4 in neuron's tests/api.rs (route surface — 503 when no candle, 400 on previous_response_id, 404 on missing model for both stream and non-stream). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 11:13:44 +03:00
rob thijssen	1859777332	ci: add cuda type-check job so CUDA-only borrowck errors fail fast Some checks failed build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / Format (push) Successful in 37s Details CI / CUDA type-check (push) Failing after 3m8s Details CI / Clippy (push) Successful in 2m27s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m46s Details build-prerelease / Build cortex binary (push) Successful in 5m0s Details build-prerelease / Build neuron-ampere (push) Successful in 7m39s Details CI / Test (push) Successful in 5m37s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m33s Details build-prerelease / Build neuron-ada (push) Successful in 5m12s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m8s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m9s Details Run 244 caught a use-of-moved-value in a `#[cfg(feature = "cuda")]` block that the default-feature workspace clippy/test gate had no chance of seeing. The error appeared only when the RPM build workflow compiled with `--features cuda` — 30+ minutes after push. Add a `cuda-check` job to ci.yml that runs `cargo check -p neuron --features cuda --all-targets` on the rpm runner (where nvcc / cudarc build deps live; the generic `rust` runner doesn't have them). Borrow-check only — we never run tests here, the runner has no GPU. Same retry pattern as clippy/test. Both SRPM jobs (`srpm-cortex`, `srpm-neuron`) now gate on `cuda-check` so a CUDA build break can't reach the release pipeline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 09:49:51 +03:00
rob thijssen	6e1c1dd0fc	ci: retry clippy + test up to 3 times on spurious sccache failures All checks were successful build-prerelease / Resolve version stamps (push) Successful in 33s Details CI / Format (push) Successful in 36s Details CI / Clippy (push) Successful in 2m25s Details CI / Test (push) Successful in 5m7s Details build-prerelease / Build cortex binary (push) Successful in 4m34s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-blackwell (push) Successful in 11m2s Details build-prerelease / Build neuron-ada (push) Successful in 12m23s Details build-prerelease / Build neuron-ampere (push) Successful in 12m26s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s Details sccache occasionally fails mid-compile with race-condition errors that clear on a re-run without any code changes. Rather than tracking that down right now, wrap the two affected steps in a bash loop that retries up to three times with a 5-second pause. Real failures still surface; they just take ~10s longer to fail. fmt is left as a single invocation — it's a one-shot syntactic check, not a build, and isn't subject to the same sccache races. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 12:55:18 +03:00
rob thijssen	60176e7c2e	ci: monotonic prerelease versions + serialize CI on shared runner Two CI hygiene fixes uncovered while validating against the live fleet. 1. Same-day prerelease packages were being ordered by RPM-vercmp's alpha-vs-digit precedence on the git SHA fragment, not by commit chronology. With release stamps like "0.1.${YYYYMMDD}git${SHA}", two commits on the same day produce the same numeric prefix and rpmvercmp falls back to comparing the alphanumeric SHA suffixes, where digit-leading SHAs are ranked above alpha-leading ones — completely unrelated to which commit landed first. Verified with rpmdev-vercmp: gitabc1234 < gitdef5678 (old scheme — purely lexicographic) Bumping the timestamp prefix to second-precision (%Y%m%d%H%M%S) makes the numeric prefix strictly monotonic for any chronologically- ordered commits, so the SHA fragment becomes a debug identifier only — never participates in version ordering. 2. ci.yml and build-prerelease.yml both target the `rust` runner label and both auto-trigger on push to main. The act-based runner reuses /root/.cache/act/<hash>/hostexecutor/ across concurrent jobs, so ci.yml's clippy and build-prerelease.yml's build-cortex were racing each other's checkout/cleanup steps and corrupting in-flight compile artifacts. Real fix is in gongfoo; workflow-level workaround is a shared concurrency group with cancel-in-progress=false so the two workflows queue sequentially on the same ref. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 13:36:53 +03:00
rob thijssen	6d2dc5ff1a	fix(ci): give fmt/clippy/test distinct CARGO_TARGET_DIR to avoid races After the candle deps were added, cargo builds run long enough that the parallel fmt/clippy/test jobs (all on the `rust` runner label, which appears to use act in host-executor mode) start racing each other's intermediate temp files under /root/.cache/act/<hash>/hostexecutor/target/debug/deps/ Concretely the test job hit: error: No such file or directory at path "target/debug/deps/.tmprlicL7" Compiling unicode-ident because another job's cargo invocation cleaned up the temp file mid-compile. fmt and clippy happened to finish without their own target races landing fatally, so only test failed visibly. Set CARGO_TARGET_DIR=target-${{ github.job }} at the workflow level so each job writes to its own target directory. sccache still backs the actual rustc cache, so the rebuild penalty is just metadata not full recompiles. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 17:26:29 +03:00
rob thijssen	7f797b0265	ci: parallelise fmt/clippy/test and drop sccache install step All checks were successful CI / Format (push) Successful in 33s Details CI / Clippy (push) Successful in 1m31s Details CI / Test (push) Successful in 2m11s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-11 13:55:17 +03:00
rob thijssen	5a0360c1d5	ci: use container runner labels for CI jobs Some checks failed CI / Format, lint, build, test (push) Successful in 4m20s Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-11 13:29:42 +03:00
rob thijssen	3e1fb60076	ci: drop actions/cache for cargo registry and target The cache round-trip (download + unpack) was consistently taking around 6 minutes, noticeably longer than the ~3 minute cold build it was meant to accelerate. Net-negative on CI time — remove it. sccache with the S3 backend still provides dep-level caching at a much lower overhead, so we keep the majority of the cache benefit without paying the actions/cache tarball cost. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 17:45:25 +03:00
rob thijssen	abe4ff7ccc	ci: publish both packages to a single helexa/helexa COPR project All checks were successful CI / Format, lint, build, test (push) Successful in 9m50s Details CI / Build neuron SRPM (push) Successful in 43s Details CI / Build cortex SRPM (push) Successful in 48s Details CI / Publish neuron to COPR (push) Successful in 6m14s Details CI / Publish cortex to COPR (push) Successful in 7m53s Details CI / Bump version in source (push) Successful in 31s Details Consolidates the previous helexa/cortex and helexa/helexa-neuron COPR projects into one shared project. Hosts enable a single repo and get access to both packages — cortex for gateway hosts and helexa-neuron for GPU nodes. Reduces the "which copr do I enable on this host" friction, and makes it clear the two packages are parts of the same helexa project suite. CI keeps two independent publish jobs (copr-cortex and copr-neuron) running in parallel; they now both target helexa/helexa with their respective SRPMs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 16:37:47 +03:00
rob thijssen	7c3390a4e1	fix(rpm): rename neuron package to helexa-neuron Fedora's official repos ship a package named `neuron` — the NEURON neural-simulation environment from Yale (see https://src.fedoraproject.org/rpms/neuron). Having our own `neuron` in the helexa COPR caused dnf5 to silently no-op `dnf install neuron` because of the name collision, even with the COPR repo enabled and keys imported. The only workarounds were full NEVRA (`dnf install neuron-0.1.12-1.fc43.x86_64`) or a local file install — neither acceptable for end-users. Rename the RPM package to `helexa-neuron`. Keep binary (/usr/bin/neuron), systemd unit (neuron.service), system user (neuron), and config dir (/etc/neuron) unchanged — those are project-local contexts where the short name is unambiguous. Follows Fedora subpackage-style naming except with a vendor prefix rather than a parent-package prefix, because neuron is an independent package from cortex (installed on different hosts) and neither depends on the other. Changes: - neuron.spec -> helexa-neuron.spec (git rename) - Name: neuron -> helexa-neuron (with comment explaining why) - CI: srpm-neuron job now builds helexa-neuron-VERSION.tar.gz with the matching top-level dir prefix, publishes to helexa/helexa-neuron COPR - CI: bump-version job references helexa-neuron.spec - CLAUDE.md: install instructions updated Old helexa/neuron COPR project can be deleted after the first helexa/helexa-neuron build lands. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 16:37:47 +03:00
rob thijssen	2ff062da0e	ci: commit generated %changelog entries back to main Previously the srpm-* jobs generated a fresh %changelog entry and shipped it to COPR, but the version-stamped spec pushed back to main by the bump-version job only updated the Version: line — not the %changelog section. The result: SRPM and in-tree spec diverged and a fresh clone of the repo showed a perpetually empty changelog. Run the rpm-changelog action in bump-version too. Now the committed specs track the SRPMs: each release leaves a dated %changelog entry in main covering commits since the previous tag, visible in git log and in the repo's spec browser. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 16:37:03 +03:00
rob thijssen	1d90238b01	ci: migrate rpm changelog generation to reusable action Replace the local .gitea/scripts/generate-rpm-changelog.sh with the shared composite action at https://git.lair.cafe/actions/rpm-changelog@v1. Behaviour is identical — collect commits since the previous v* tag, filter bump-version and merge noise, prepend a dated entry to the spec — but the logic now lives in one place that other projects can consume. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 15:32:51 +03:00
rob thijssen	d99b25fb8a	ci: auto-generate rpm changelog entry per release On every tag push, build a %changelog entry from the git log since the previous v* tag and prepend it to each spec. Stops the initial entry from drifting further and catches bogus-date / stale-version warnings automatically since the generated date always matches the day the CI runs. The generator drops "chore: bump version" commits (bot-authored, noisy in user-facing changelogs) and merge commits. Author defaults to the gitea-actions identity but can be overridden via CHANGELOG_AUTHOR env var if a human release is desired. Requires fetch-depth: 0 on checkout so git describe can see prior tags and git log can reach them. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 15:32:51 +03:00
rob thijssen	4a9a4fc775	ci: migrate copr publish to reusable action All checks were successful CI / Format, lint, build, test (push) Successful in 1m26s Details CI / Build neuron SRPM (push) Successful in 45s Details CI / Build cortex SRPM (push) Successful in 44s Details CI / Publish neuron to COPR (push) Successful in 8m22s Details CI / Publish cortex to COPR (push) Successful in 11m0s Details CI / Bump version in source (push) Successful in 30s Details Replace the in-repo .gitea/scripts/copr-build.sh and per-job copr-cli configuration with the shared composite action at https://git.lair.cafe/actions/copr-publish@v1. Behaviour is identical — submit, watch, dump per-chroot logs — but the logic now lives in a single place that other projects can consume. Removes the actions/checkout step from both COPR jobs since the build script is no longer local to this repo. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 12:34:39 +03:00
rob thijssen	5c7d63c658	ci: dump COPR per-chroot build logs to CI output Previously the COPR publish steps only surfaced copr-cli's status updates (pending/importing/running). When a build failed, diagnosing required clicking through to the COPR web UI. Now we submit with --nowait, watch the build, then use copr-cli download-build to fetch each chroot's builder-live.log and cat them as collapsible ::group:: blocks in the CI output. Logic is factored into .gitea/scripts/copr-build.sh so cortex and neuron jobs share it. Both COPR jobs now check out the repo to access the script. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 12:06:05 +03:00
rob thijssen	15ded3a5bd	ci: cache target/, disable incremental, drop redundant build Three complementary tweaks to close the gap sccache alone can't: - CARGO_INCREMENTAL=0: reclaims the 17 incremental-mode cache misses per run and prevents cargo from writing incremental fingerprints that defeat sccache. Incremental mode is useless in CI anyway since each run starts from scratch. - actions/cache for ~/.cargo and target/: sidesteps sccache's structural limits (proc-macro non-cacheables, clippy-vs-rustc separate namespaces) by caching the whole build output keyed on Cargo.lock. Also caches ~/.cargo/bin so the installed sccache binary survives between runs. - Drop the separate 'cargo build' step: 'cargo test --workspace' builds everything anyway, so the standalone build was a full redundant workspace compile pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 09:44:45 +03:00
rob thijssen	7befa882d5	fix: yaml syntax Some checks failed CI / Format, lint, build, test (push) Successful in 1m42s Details CI / Build neuron SRPM (push) Successful in 42s Details CI / Build cortex SRPM (push) Successful in 1m40s Details CI / Publish neuron to COPR (push) Failing after 4m11s Details CI / Publish cortex to COPR (push) Failing after 3m16s Details CI / Bump version in source (push) Has been skipped Details	2026-04-16 09:25:02 +03:00
rob thijssen	d03fae960a	fix(ci): unset RUSTC_WRAPPER during sccache install All checks were successful CI / Format, lint, build, test (push) Successful in 2m40s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details The workflow-level env set RUSTC_WRAPPER=sccache for every step, including the install step itself. cargo install sccache then tried to invoke `sccache rustc -vV` to detect the toolchain before sccache existed on PATH, failing with "No such file or directory". Override RUSTC_WRAPPER to empty on the install step so cargo uses rustc directly; subsequent steps still inherit the wrapper. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 08:31:26 +03:00
rob thijssen	7b2235d56b	fix(ci): install sccache with S3 feature if missing Some checks failed CI / Format, lint, build, test (push) Failing after 4s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details The distro sccache package lacks S3 support. Install from cargo with --features s3 if the existing binary can't connect to the S3 backend. Skips install if already present and working. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:44:21 +03:00
rob thijssen	54f9f3dc36	ci: add sccache with MinIO backend for build caching Some checks failed CI / Format, lint, build, test (push) Failing after 3s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details All Rust compilation steps now use sccache backed by MinIO S3 at caveman.kosherinata.internal:9000. Credentials via repo secrets SCCACHE_S3_ACCESS_KEY and SCCACHE_S3_SECRET_KEY. Cache is shared across all bare metal runners. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:38:13 +03:00
rob thijssen	caee8bba11	fix(ci): use GITEA_TOKEN env var for push, not checkout Some checks failed CI / Format, lint, build, test (push) Successful in 2m40s Details CI / Build neuron SRPM (push) Successful in 47s Details CI / Build cortex SRPM (push) Successful in 48s Details CI / Publish cortex to COPR (push) Failing after 7s Details CI / Publish neuron to COPR (push) Failing after 3s Details CI / Bump version in source (push) Has been skipped Details Token is only needed for the authenticated push, not the public checkout. Set remote URL with token inline before pushing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 16:31:13 +03:00
rob thijssen	324dfa05c5	ci: add RPM packaging for cortex and neuron - cortex.spec: gateway binary, cortex.service systemd unit, cortex.toml + models.toml config files - neuron.spec: neuron binary, neuron.service systemd unit, neuron.toml config file - Parallel CI: srpm-cortex and srpm-neuron jobs build SRPMs concurrently, then publish to separate COPR repos (helexa/cortex and helexa/neuron) - bump-version job: after both COPR publishes succeed, stamps tag version into Cargo.toml, specs, Cargo.lock and pushes to main via GITEA_TOKEN - Shared cortex user/group across both packages - Example configs: cortex.example.toml, neuron.example.toml, models.example.toml Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 16:28:31 +03:00
rob thijssen	c85d50066e	ci: add RPM packaging for cortex and neuron - cortex.spec: gateway binary, cortex.service systemd unit, cortex.toml + models.toml config files - neuron.spec: neuron binary, neuron.service systemd unit, neuron.toml config file - Parallel CI: srpm-cortex and srpm-neuron jobs build SRPMs concurrently, then publish to separate COPR repos (helexa/cortex and helexa/neuron) - Shared cortex user/group across both packages - Example configs: cortex.example.toml, neuron.example.toml, models.example.toml Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 16:09:04 +03:00
rob thijssen	6bb3004cfc	ci: add Gitea CI, RPM spec, license, and repo hygiene All checks were successful CI / Format, lint, build, test (push) Successful in 2m15s Details CI / Build SRPM (push) Has been skipped Details CI / Publish to COPR (push) Has been skipped Details - Add .gitea/workflows/ci.yml with fmt/clippy/test on all branches and SRPM build + COPR publish on version tags - Add cortex.spec for Fedora RPM packaging - Add GPL-3.0-or-later LICENSE file - Add cortex.example.toml with generic hostnames; gitignore cortex.toml - Scrub infrastructure-specific hostnames from README.md, CLAUDE.md, and doc comments - Fix unused imports and clippy warnings to pass -D warnings - Fix missing deps (bytes, reqwest, serde_json) exposed during build - Run cargo fmt across workspace - Update SPDX license identifier to GPL-3.0-or-later Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 18:24:04 +03:00

32 Commits