helexa

Author	SHA1	Message	Date
rob thijssen	569c528c4b	feat(gateway): Anthropic streaming SSE translation (#24 ) All checks were successful CI / Format (push) Successful in 36s Details CI / CUDA type-check (push) Successful in 2m25s Details CI / Clippy (push) Successful in 2m25s Details CI / Format (pull_request) Successful in 41s Details CI / CUDA type-check (pull_request) Successful in 2m9s Details CI / Clippy (pull_request) Successful in 2m45s Details CI / Test (push) Successful in 5m3s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details CI / Test (pull_request) Successful in 4m29s Details CI / Build cortex SRPM (pull_request) Has been skipped Details CI / Publish cortex to COPR (pull_request) Has been skipped Details CI / Build neuron SRPM (pull_request) Has been skipped Details CI / Publish neuron to COPR (pull_request) Has been skipped Details CI / Bump version in source (pull_request) Has been skipped Details The /v1/messages handler translated request envelopes but proxied raw OpenAI SSE frames back to streaming Anthropic clients — the gap between the README's "point your tooling at it once" contract and what Claude Code actually received. cortex-core gains AnthropicStreamTranslator, a pure per-stream state machine: OpenAI chunks in, ordered (event, payload) pairs out — message_start → content_block_start/delta/stop (text and tool_use blocks, indexed; tool_calls map to input_json_delta) → message_delta (stop_reason mapped via the now-shared map_stop_reason, which also teaches the non-streaming path tool_calls→tool_use) → message_stop. Without an upstream usage frame the output count falls back to the delta count (engine-exact for neuron's one-chunk-per-token streams, #31); with one, input/output tokens ride message_delta. cortex-gateway gains anthropic_sse: the wire pump that splits the upstream byte stream into SSE events, parses data: payloads (leniently — engines omit fields on special frames), feeds the translator, and frames results as `event:`/`data:` pairs through a bounded channel (slow client back-pressures the upstream read). Upstream truncation without [DONE] still closes the Anthropic event sequence. Nothing is buffered beyond the current event's bytes. Tests: 5 state-machine unit tests (text flow, stop-reason mapping + defaults, tool_use blocks, usage propagation, idempotent finish) and 2 gateway integration tests (full event sequence + text reassembly, usage propagation into message_delta). Validated end-to-end by running this branch's gateway against a production neuron and streaming a live Anthropic request. Closes #24 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 15:47:30 +03:00
rob thijssen	8f6f1d3205	feat(deploy): validate neuron capability after every deploy Some checks failed build-prerelease / Build neuron-ampere (push) Blocked by required conditions Details build-prerelease / Build neuron-ada (push) Blocked by required conditions Details build-prerelease / Package cortex RPM (push) Blocked by required conditions Details build-prerelease / Resolve version stamps + change detection (push) Successful in 29s Details build-prerelease / Lint (fmt + clippy) (push) Successful in 2m14s Details build-prerelease / Build neuron-blackwell (push) Successful in 10m36s Details build-prerelease / Build cortex binary (push) Successful in 2m35s Details build-prerelease / Test (push) Successful in 6m35s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details A deploy previously went green the moment systemd reported the service started — a merge that broke model loading or inference itself would deploy "successfully" and only surface when a human noticed. Each neuron deploy now earns its green: 1. Wait for default models: poll /health until activation.state is ready, with per-host timeouts in the matrix (beast 900s for the 27B Q6K TP=2 cold-load, benjy/quadbrat 300s). Any entry in activation.failed fails the deploy with the per-model error — the structured equivalent of watching the journal for "loaded default model", plus failure detail the journal line can't carry. 2. LLM smoke probe: ask the first loaded model to reply with one specific word (max_tokens 512 so thinking models have room, temperature 0) and grep the response for it. Not a quality bar — just proof the deploy didn't lobotomize inference. Hosts whose package is already current still skip everything — the validation cost is only paid when a restart actually happened. The probe was dry-run against benjy's production neuron before landing. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 15:28:20 +03:00
grenade	b0d0b939af	Merge pull request 'feat(gateway): per-request token metrics — TTFT and tok/s (#21 )' (#30 ) from feat/gateway-21-token-metrics into main Some checks failed build-prerelease / Lint (fmt + clippy) (push) Blocked by required conditions Details build-prerelease / Test (push) Blocked by required conditions Details build-prerelease / Build cortex binary (push) Blocked by required conditions Details build-prerelease / Build neuron-blackwell (push) Blocked by required conditions Details build-prerelease / Build neuron-ampere (push) Blocked by required conditions Details build-prerelease / Build neuron-ada (push) Blocked by required conditions Details build-prerelease / Resolve version stamps + change detection (push) Successful in 33s Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details	2026-06-12 12:25:32 +00:00
rob thijssen	6a36d15ef1	feat(gateway): per-request token metrics — TTFT and tok/s (#21 ) All checks were successful CI / Format (push) Successful in 45s Details CI / Format (pull_request) Successful in 37s Details CI / CUDA type-check (push) Successful in 2m25s Details CI / Clippy (push) Successful in 2m37s Details CI / Test (push) Successful in 4m22s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details CI / Clippy (pull_request) Successful in 2m23s Details CI / Test (pull_request) Successful in 4m19s Details CI / CUDA type-check (pull_request) Successful in 1m57s Details CI / Build cortex SRPM (pull_request) Has been skipped Details CI / Publish cortex to COPR (pull_request) Has been skipped Details CI / Build neuron SRPM (pull_request) Has been skipped Details CI / Publish neuron to COPR (pull_request) Has been skipped Details CI / Bump version in source (pull_request) Has been skipped Details The deferred Phase 6b, and the unblock for the 7→8 milestone's benchmark work (#22): until cortex measures itself per request, nothing downstream can be benchmarked or graphed. The proxy wraps the upstream byte stream in a pass-through inspector (TokenMetricsStream): chunks are forwarded verbatim — never buffered or re-serialised — while the inspector records arrival times and keeps a bounded (64 KiB) tail of the body text. At stream end (or client disconnect, via Drop) it extracts the final OpenAI usage object — present on the last SSE chunk and non-streaming JSON bodies alike — for engine-truth token counts. Per request, labelled {model, node}: - cortex_time_to_first_token_seconds (histogram) — first body chunk - cortex_tokens_per_second (histogram) — completion tokens over the decode window (first→last chunk); falls back to total request duration for single-chunk non-streaming bodies - cortex_prompt_tokens_total / cortex_completion_tokens_total (counters) The extractor is pure and chunk-boundary-safe; quoted-needle matching keeps completion_tokens_details from shadowing completion_tokens, and the last usage object wins. Covers chat completions, completions, the Responses API, and the Anthropic streaming path (which currently proxies OpenAI SSE). Tests: 4 extractor unit tests; integration test with a streaming mock emitting a stream_options-style final usage chunk, asserting both histograms and exact-or-greater counter values (the test recorder is process-global and shared across the binary's tests). Closes #21 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 15:11:52 +03:00
grenade	b463439416	Merge pull request 'feat(neuron): startup preflight for NVIDIA driver/library mismatch (#19 )' (#29 ) from feat/neuron-19-driver-preflight into main Some checks failed build-prerelease / Build neuron-ampere (push) Blocked by required conditions Details build-prerelease / Build neuron-ada (push) Blocked by required conditions Details build-prerelease / Resolve version stamps + change detection (push) Successful in 29s Details build-prerelease / Lint (fmt + clippy) (push) Successful in 2m11s Details build-prerelease / Build cortex binary (push) Successful in 2m33s Details build-prerelease / Test (push) Successful in 4m24s Details build-prerelease / Package cortex RPM (push) Successful in 1m27s Details build-prerelease / Build neuron-blackwell (push) Successful in 10m18s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details	2026-06-12 12:08:20 +00:00
rob thijssen	716558c8ff	feat(neuron): startup preflight for NVIDIA driver/library mismatch (#19 ) All checks were successful CI / Format (push) Successful in 38s Details CI / Format (pull_request) Successful in 38s Details CI / CUDA type-check (push) Successful in 2m11s Details CI / Clippy (push) Successful in 2m13s Details CI / Clippy (pull_request) Successful in 2m37s Details CI / Test (push) Successful in 4m17s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details CI / Test (pull_request) Successful in 3m56s Details CI / CUDA type-check (pull_request) Successful in 1m44s Details CI / Build cortex SRPM (pull_request) Has been skipped Details CI / Build neuron SRPM (pull_request) Has been skipped Details CI / Publish cortex to COPR (pull_request) Has been skipped Details CI / Publish neuron to COPR (pull_request) Has been skipped Details CI / Bump version in source (pull_request) Has been skipped Details The un-rebooted driver update (userspace libs bumped, kernel module still old) kills every CUDA call on the host including nvidia-smi, and neuron surfaced it only as `Comm::from_rank ... NcclError` deep inside the first model load — 30 minutes of forensics on beast (2026-06-08) to diagnose. Make it instantly legible instead: - discovery distinguishes nvidia-smi absent (CPU-only, fine) from present-but-failing, classifies the "Driver/library version mismatch" signature, and pairs the userspace NVML version with the loaded kernel-module version from /proc/driver/nvidia/version. - DiscoveryResponse gains `cuda_unavailable_reason` (omitted when None — wire-compatible) so cortex can see why the node has no devices and route around it. - startup logs one loud ERROR line with the actionable reason ("reboot the host to reload the kernel module") and skips default model loads entirely, marking each failed with that reason so /health activation shows the real cause. - POST /models/load fast-rejects with 503 + code=cuda_unavailable on a mismatch host instead of dying minutes later in cuInit/NCCL. No false positives: other nvidia-smi failures (no devices, perms) keep their existing behaviour, CPU-only hosts stay silent. Closes #19 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 15:00:00 +03:00
rob thijssen	112e4e124a	fix(ci): export RUSTC_WRAPPER in the build step itself — GITHUB_ENV doesn't propagate Some checks failed build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions Details build-prerelease / Resolve version stamps + change detection (push) Successful in 32s Details build-prerelease / Lint (fmt + clippy) (push) Successful in 2m22s Details build-prerelease / Build cortex binary (push) Successful in 2m20s Details build-prerelease / Test (push) Successful in 3m50s Details build-prerelease / Build neuron-blackwell (push) Successful in 10m10s Details build-prerelease / Package cortex RPM (push) Successful in 1m25s Details build-prerelease / Build neuron-ada (push) Successful in 14m29s Details build-prerelease / Build neuron-ampere (push) Successful in 14m31s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details Run 375 proved the CUDA image ships sccache (probe step printed "sccache enabled") but the wrapper never reached cargo: the runner does not propagate GITHUB_ENV across steps, so the builds ran unwrapped (server stats: 4 compile requests for a ~600-crate build, durations unchanged). Probe and export inside the build step's own shell instead, in both build-neuron and ci.yml's cuda-check. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 14:50:25 +03:00
rob thijssen	dc6feec6dc	fix(deploy): gate on the publish manifest, not unprivileged dnf check-update All checks were successful build-prerelease / Resolve version stamps + change detection (push) Successful in 31s Details build-prerelease / Build cortex binary (push) Successful in 2m18s Details build-prerelease / Lint (fmt + clippy) (push) Successful in 2m33s Details build-prerelease / Test (push) Successful in 4m20s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-blackwell (push) Successful in 9m46s Details build-prerelease / Build neuron-ampere (push) Successful in 13m57s Details build-prerelease / Build neuron-ada (push) Successful in 15m29s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m5s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m49s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m8s Details The `f5fa840` deploy exposed both failure modes of gating with `dnf check-update` as the gitea_ci user in one run: it hung indefinitely on quadbrat (blocked process, 0 CPU, killed manually), and on benjy/beast it silently reported "no updates" two minutes after new RPMs were published — both hosts skipped a real (luckily binary-identical) update. Gate with data we own instead: fetch packages.json from rpm.lair.cafe (plain curl, no privileges, no dnf locks), take the newest release per package by buildTime, and skip the stop/upgrade/start cycle only when it exactly equals `rpm -q %{VERSION}-%{RELEASE}`. Unreachable or unparsable manifest fails open to a full deploy. The dnf transaction itself still runs under the scoped sudoers rules, unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 14:20:21 +03:00
grenade	02f20bc9e1	Merge pull request 'feat: keep auto-recovering models visible as recovering (#20 )' (#28 ) from feat/neuron-20-recovering-status into main Some checks failed build-prerelease / Test (push) Blocked by required conditions Details build-prerelease / Build neuron-blackwell (push) Blocked by required conditions Details build-prerelease / Build neuron-ampere (push) Blocked by required conditions Details build-prerelease / Build neuron-ada (push) Blocked by required conditions Details build-prerelease / Resolve version stamps + change detection (push) Successful in 30s Details build-prerelease / Lint (fmt + clippy) (push) Successful in 2m39s Details build-prerelease / Build cortex binary (push) Successful in 3m46s Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details	2026-06-12 11:15:38 +00:00
rob thijssen	2a231e49de	merge main (sccache enablement supersedes branch cuda-check pin) All checks were successful CI / Format (push) Successful in 40s Details CI / Format (pull_request) Successful in 37s Details CI / Clippy (push) Successful in 2m17s Details CI / CUDA type-check (push) Successful in 2m39s Details CI / CUDA type-check (pull_request) Successful in 2m30s Details CI / Test (push) Successful in 4m51s Details CI / Clippy (pull_request) Successful in 2m12s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details CI / Test (pull_request) Successful in 4m49s Details CI / Build cortex SRPM (pull_request) Has been skipped Details CI / Build neuron SRPM (pull_request) Has been skipped Details CI / Publish cortex to COPR (pull_request) Has been skipped Details CI / Publish neuron to COPR (pull_request) Has been skipped Details CI / Bump version in source (pull_request) Has been skipped Details # Conflicts: # .gitea/workflows/ci.yml	2026-06-12 14:05:55 +03:00
rob thijssen	2dadea5d8d	ci: enable sccache on the build jobs (conditional on the CUDA image) Some checks failed build-prerelease / Build neuron-blackwell (push) Blocked by required conditions Details build-prerelease / Resolve version stamps + change detection (push) Successful in 34s Details build-prerelease / Lint (fmt + clippy) (push) Successful in 2m57s Details build-prerelease / Test (push) Has been cancelled Details build-prerelease / Build cortex binary (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details The 3 CUDA flavour builds (10-14 min each, the critical path of every full run) and build-cortex compiled entirely uncached. With the gongfoo-side sccache hardening in place, wire them up: - build-cortex: full sccache env (rust image ships it) + the standard escalation loop (retry -> server restart -> uncached final attempt). - build-neuron: probe for sccache before enabling the wrapper — the CUDA image may not ship it, and a missing binary must degrade to an uncached build, not fail cargo at `sccache rustc -vV` (the original reason the wrapper was cleared here). rustc compilations are shared across all three flavours; candle-kernels' nvcc output stays uncached (build-script artifact). - ci.yml cuda-check: same probe pattern replaces the blanket env clear; also pins CUDA_COMPUTE_CAP=86 since the image no longer ships nvidia-smi for candle-kernels' fallback detection (mirrors `9bb9678` on the #20 branch). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 14:05:26 +03:00
rob thijssen	9bb9678f93	fix(ci): pin CUDA_COMPUTE_CAP in cuda-check — builder image has no nvidia-smi All checks were successful CI / Format (push) Successful in 37s Details CI / Format (pull_request) Successful in 38s Details CI / CUDA type-check (push) Successful in 1m45s Details CI / Clippy (push) Successful in 2m24s Details CI / Clippy (pull_request) Successful in 2m19s Details CI / Test (push) Successful in 4m40s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details CI / Test (pull_request) Successful in 4m35s Details CI / CUDA type-check (pull_request) Successful in 1m50s Details CI / Build cortex SRPM (pull_request) Has been skipped Details CI / Publish cortex to COPR (pull_request) Has been skipped Details CI / Build neuron SRPM (pull_request) Has been skipped Details CI / Publish neuron to COPR (pull_request) Has been skipped Details CI / Bump version in source (pull_request) Has been skipped Details candle-kernels' build script shells out to nvidia-smi for compute-cap detection when CUDA_COMPUTE_CAP is unset; the current GPU-less builder image doesn't ship it, so the type-check died in the build script before borrow-checking anything. Pin an arbitrary valid cap — the check is feature-gate compilation only; real caps live in build-prerelease.yml's flavour matrix. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 13:55:23 +03:00
rob thijssen	df9c490614	feat(neuron+gateway): keep auto-recovering models visible as `recovering` (#20 ) Some checks failed CI / Format (push) Successful in 37s Details CI / CUDA type-check (pull_request) Failing after 28s Details CI / Format (pull_request) Successful in 37s Details CI / Clippy (push) Successful in 2m54s Details CI / Clippy (pull_request) Successful in 3m36s Details CI / Test (push) Successful in 4m37s Details CI / Test (pull_request) Successful in 5m20s Details CI / Build cortex SRPM (pull_request) Has been skipped Details CI / Build neuron SRPM (pull_request) Has been skipped Details CI / Publish cortex to COPR (pull_request) Has been skipped Details CI / Publish neuron to COPR (pull_request) Has been skipped Details CI / Bump version in source (pull_request) Has been skipped Details CI / CUDA type-check (push) Failing after 31s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details During the #17 auto-recovery window (unload → reload, minutes for a large TP model) the model's registry slot is absent, so it vanished from neuron's /models — and cortex, routing by /models presence, answered "model not found on any node" while a direct request to neuron would have correctly said "recovering, retry shortly". neuron: the recovery set becomes a map carrying a devices/capabilities snapshot taken at trigger time (while the registry slot still exists). list_models reports `recovering` for models in the set — both while the poisoned slot is still present and during the reload gap, where the snapshot keeps the model listed. gateway: ModelStatus grows a Recovering variant (parsed from the wire); the router holds the route — new RouteError::ModelRecovering mapped to 503 instead of 404 — and deliberately does not fall through to the catalogue cold-load, which would race a second placement against the in-flight recovery. The evictor already ignores non-Loaded entries. Tests: neuron unit test (recovering model stays listed with snapshot), gateway integration tests (poller parses `recovering`; request gets 503 retry-shortly and the model stays on /v1/models). Closes #20 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 13:42:03 +03:00
rob thijssen	f5fa840dfb	ci: escalate sccache retries — restart server, then fall back uncached All checks were successful build-prerelease / Resolve version stamps + change detection (push) Successful in 30s Details build-prerelease / Lint (fmt + clippy) (push) Successful in 2m6s Details build-prerelease / Test (push) Successful in 4m50s Details build-prerelease / Build cortex binary (push) Successful in 3m45s Details build-prerelease / Build neuron-blackwell (push) Successful in 9m59s Details build-prerelease / Build neuron-ada (push) Successful in 14m11s Details build-prerelease / Build neuron-ampere (push) Successful in 14m13s Details build-prerelease / Package cortex RPM (push) Successful in 1m30s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m28s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m50s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m54s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details Run 361's Test job failed all 3 attempts with the sccache dead-server signature (sccache fatal error, ENOENT on its own tmp files under target/debug/deps). Retrying the same invocation only helps for transient races; against a wedged server every same-VM retry fails identically — and under the new pipeline that blocks publish and the deploy behind it. Escalate instead: attempt 1 plain, attempt 2 after an sccache server restart, attempt 3 with RUSTC_WRAPPER unset (uncached). A sick cache now costs build minutes, never the deploy. Applied to the lint/test jobs in build-prerelease.yml and ci.yml alike. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 13:24:02 +03:00
rob thijssen	7557c5e877	ci: cut iteration latency — change-aware builds, gated deploys, dev fast path Some checks failed build-prerelease / Build neuron-blackwell (push) Blocked by required conditions Details build-prerelease / Resolve version stamps + change detection (push) Successful in 28s Details build-prerelease / Test (push) Failing after 1m16s Details build-prerelease / Lint (fmt + clippy) (push) Successful in 3m7s Details build-prerelease / Build cortex binary (push) Successful in 3m57s Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details Push-to-testable was ~20.5 min for every commit (measured on the 2026-06-08 green chain) plus a ~5 min 27B cold-load, regardless of what changed. Three structural fixes: - build-prerelease: a change-detection step in `prepare` diffs HEAD against the git sha embedded in the last published unstable RPM (per package, from packages.json) and skips builds whose inputs didn't change. Docs-only commits build nothing; gateway-only commits skip the 3 CUDA flavour builds. Detection failures fall open to a full build. - ci.yml no longer runs on pushes to main; fmt/clippy/test live in build-prerelease as parallel jobs gating publish. The two workflows previously queued against each other on the same runner labels, delaying the cortex build ~12 min. Branches, PRs, and tags keep the full ci.yml gate. - deploy: each host self-gates with `dnf check-update` and leaves the service untouched when the installed package is already current — no more neuron restarts (and 27B cold-loads) for commits that didn't change neuron. - deploy-dev (new): manual single-host fast path — build one CUDA flavour, scp the binary, restart the service. Skips packaging, signing, publish, and dnf entirely. Backed by a new exact-form sudoers rule in asset/sudoers.d/neuron-host.conf (already applied to all three hosts). Expected loop times when runners behave: docs ≈ 1 min (nothing deploys), gateway-only ≈ 6-8 min, single-neuron dev ≈ 8-10 min, full fleet ≈ 13-15 min. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 13:17:22 +03:00
rob thijssen	91e95ca979	docs: rewrite README around project positioning Some checks failed CI / CUDA type-check (push) Failing after 46s Details CI / Format (push) Successful in 47s Details CI / Clippy (push) Successful in 2m53s Details CI / Test (push) Successful in 4m31s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions Details build-prerelease / Resolve version stamps (push) Successful in 39s Details build-prerelease / Build cortex binary (push) Successful in 3m52s Details build-prerelease / Package cortex RPM (push) Successful in 1m18s Details build-prerelease / Build neuron-blackwell (push) Successful in 11m34s Details build-prerelease / Build neuron-ampere (push) Successful in 15m31s Details build-prerelease / Build neuron-ada (push) Successful in 15m37s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details Lead with what helexa is for — near-frontier open-weight models on consumer hardware you own — instead of a feature list. Adds the scope section (intentional divergence from vLLM/SGLang; CUDA-only today as a test-coverage constraint, not a principle), an engine section covering the per-device worker threads and consumer-GPU tensor parallelism, the previously-missing helexa-acp crate, and a status section pointing at git.lair.cafe as the source of truth with GitHub as read-only mirror. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 11:37:00 +03:00
rob thijssen	1a74cb0c56	chore: rename repo cortex -> helexa Some checks failed CI / CUDA type-check (push) Failing after 30s Details build-prerelease / Resolve version stamps (push) Successful in 45s Details CI / Format (push) Successful in 32s Details build-prerelease / Build neuron-blackwell (push) Failing after 31s Details build-prerelease / Build neuron-ada (push) Failing after 34s Details build-prerelease / Build neuron-ampere (push) Failing after 38s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped Details CI / Clippy (push) Failing after 1m11s Details build-prerelease / Build cortex binary (push) Successful in 3m47s Details CI / Test (push) Successful in 5m32s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m22s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details helexa is the project; cortex (per-operator control plane / LLM proxy) and neuron (per-host LLM harness) are its components. The Gitea repo is now helexa/helexa. Update repository URLs in Cargo metadata, RPM specs, and docs; make the CI changelog push URL rename-proof via the github.repository context; reframe README.md and CLAUDE.md around the project name. Binary, package, service, and config-path names are unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 10:54:01 +03:00
rob thijssen	60f5598542	build(neuron): bump cudarc fork to 63327a2 (idempotent abort + Comm Send+Sync) Some checks failed build-prerelease / Resolve version stamps (push) Successful in 29s Details CI / CUDA type-check (push) Successful in 31s Details CI / Format (push) Successful in 35s Details CI / Test (push) Failing after 1m9s Details CI / Clippy (push) Successful in 2m36s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 6m10s Details build-prerelease / Build neuron-ampere (push) Successful in 7m35s Details build-prerelease / Build neuron-ada (push) Successful in 5m7s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m53s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m14s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s Details build-prerelease / Build cortex binary (push) Successful in 4m33s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details The fork's new commit makes `Comm: Send + Sync` (asserting NCCL's thread-safety invariant upstream) and makes `Comm::abort` idempotent via an `aborted` flag (so abort-then-Drop can't double-free) — strictly better than the previous Drop-no-panic workaround, and the `abort()` signature is unchanged so the watchdog call site is unaffected. Because `Comm` is now `Send + Sync`, `Arc<Comm>` and the `SendComm` / `NcclState` wrappers auto-derive `Send`/`Sync`, which conflicts (E0119) with neuron's manual `unsafe impl`s. Remove the four now-redundant impls — the safety assertion lives upstream in cudarc where it belongs. The conflict is in cuda-gated code, so only the CUDA type-check catches it (non-cuda build + clippy + tests stay green). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 16:33:14 +03:00
rob thijssen	7945240646	chore: re-trigger deploy (#17 Stage 2, attempt 3) All checks were successful CI / CUDA type-check (push) Successful in 31s Details build-prerelease / Resolve version stamps (push) Successful in 31s Details CI / Format (push) Successful in 33s Details CI / Clippy (push) Successful in 2m41s Details build-prerelease / Build cortex binary (push) Successful in 4m45s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m50s Details CI / Test (push) Successful in 6m44s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ampere (push) Successful in 8m38s Details build-prerelease / Build neuron-ada (push) Successful in 5m36s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m59s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s Details No code change. Each deploy run, the degraded CI runner kills a different single arch build (blackwell, then ada) ~fast, and the all-arch-gated packaging skips → no publish. Every arch HAS built green across runs (blackwell ✅ in 342, ampere ✅, ada ✅ in 339) and the gate + CUDA type-check pass. Re-running to catch all three green in one run so the Stage-2 RPMs publish. Runner FS/cache health is the real fix (separate infra work). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 15:06:04 +03:00
rob thijssen	0c74d89d15	chore: re-trigger deploy (#17 Stage 2) Some checks failed CI / CUDA type-check (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 29s Details CI / Format (push) Successful in 30s Details build-prerelease / Build neuron-ada (push) Failing after 51s Details CI / Clippy (push) Successful in 2m41s Details build-prerelease / Build cortex binary (push) Successful in 4m28s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m32s Details build-prerelease / Build neuron-ampere (push) Successful in 7m42s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped Details CI / Test (push) Successful in 6m6s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped Details No code change. The `c94a2ae` deploy's neuron-blackwell build died ~12min into the Blackwell kernel compile on the degraded runner, while neuron-ampere + neuron-ada built the identical Rust + patched cudarc cleanly and the CUDA type-check passed. Transient infra; re-running to get a healthy blackwell build so the RPMs publish and beast (Blackwell) picks it up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 14:45:16 +03:00
rob thijssen	c94a2ae755	fix(neuron): correct nccl_state path on WorkerPool.leader_comm (#17 S2) Some checks failed CI / CUDA type-check (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Format (push) Successful in 44s Details build-prerelease / Build cortex binary (push) Successful in 4m57s Details build-prerelease / Package cortex RPM (push) Successful in 1m36s Details CI / Test (push) Successful in 7m10s Details CI / Clippy (push) Failing after 1m21s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ampere (push) Successful in 8m40s Details build-prerelease / Build neuron-ada (push) Successful in 9m5s Details build-prerelease / Build neuron-blackwell (push) Failing after 12m2s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details `super::nccl_state` from tp/mod.rs resolves to `crate::harness::nccl_state` (nonexistent); the module is the child `nccl_state` (cf. the existing `nccl_state::generate_comm_id_hex` call). The field is cuda-gated so the non-cuda build couldn't catch it; the branch CUDA type-check flaked on the runner before compiling. Self-audited fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 14:21:43 +03:00
rob thijssen	99920dd322	feat(neuron): TP step watchdog aborts wedged collectives (#17 Stage 2) Some checks failed CI / CUDA type-check (push) Failing after 47s Details CI / Format (push) Successful in 31s Details CI / Test (push) Failing after 1m3s Details CI / Clippy (push) Successful in 2m44s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details Make a hung NCCL collective recoverable instead of a permanent brick. Today a wedged collective hangs the in-process leader thread forever, and even Stage 1's recovery can't help — its unload's DropTp queues behind the stuck thread and hangs too. - Cache the leader's NCCL Comm handle async-side at init (new cuda-gated Job::GetLeaderComm → DeviceWorkerHandle::get_leader_comm → stored on WorkerPool.leader_comm). Fetched while the thread is responsive — a wedged thread can't service the fetch, which is why it's cached up front. - Wrap the leader forward in both generate_step and generate_step_with_images in tokio::time::timeout (default 120s, NEURON_TP_STEP_TIMEOUT_S). On expiry the watchdog calls Comm::abort() (ncclCommAbort) on the cached handle from the async thread — the one NCCL op sanctioned concurrently with an in-flight collective — which unblocks the leader thread, then fails the step WITHOUT draining (workers are wedged too; recovery's unload kills them). The error is a device fault → poison → Stage 1 auto-recovery, which now completes because the leader thread is responsive again. - Bumps the cudarc patch to dbc425a (adds the Drop-must-not-panic fix so the post-abort comm teardown during recovery doesn't double-abort-panic). Logs the whole sequence at ERROR with greppable `tp watchdog:` / `ncclCommAbort` markers so a real-world hang leaves a forensic trail — verification is by inspecting journals after real hangs, not a synthetic harness. cuda-gated → validated by the blackwell build. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 14:15:29 +03:00
rob thijssen	c4f239ceb9	build(neuron): patch cudarc to expose Comm::abort/get_async_error (#17 Stage 2) All checks were successful CI / CUDA type-check (push) Successful in 33s Details CI / Format (push) Successful in 35s Details CI / Clippy (push) Successful in 2m34s Details CI / Test (push) Successful in 6m1s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details #17 Stage 2 (TP hang-recovery) needs to call ncclCommAbort on a LIVE communicator from another thread — to unblock a collective wedged on a dead/hung peer so the ranks can resync. No cudarc release (incl. main) exposes this: the safe Comm only aborts in Drop, which can't fire while a stuck thread holds an Arc<Comm> clone. Pin neuron's cudarc 0.19.7 to a fork (grenade/cudarc @ nccl-comm-abort, rev 4dff0be) adding three thin methods — Comm::abort, get_async_error, and a raw comm() accessor — to be submitted upstream. The patch targets 0.19.x only; candle's transitive cudarc 0.17.8 stays on crates.io. Foundation only; the watchdog + abort + comm-rebuild that consume these land in follow-up commits (cuda-gated → validated by the blackwell build). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 13:49:59 +03:00
rob thijssen	ac445c1569	chore: re-trigger deploy (#17 Stage 1) Some checks failed CI / CUDA type-check (push) Failing after 19s Details CI / Format (push) Successful in 37s Details build-prerelease / Resolve version stamps (push) Successful in 42s Details CI / Clippy (push) Successful in 3m54s Details build-prerelease / Build cortex binary (push) Successful in 4m43s Details CI / Test (push) Successful in 6m35s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m58s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Successful in 8m10s Details build-prerelease / Build neuron-ada (push) Successful in 5m21s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m1s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m46s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m4s Details No code change. The `abc6e60` deploy's neuron-ada build died on the degraded CI runner (container dropped mid-checkout), skipping the gated publish — even though neuron-blackwell + neuron-ampere compiled the Stage-1 fault-recovery code cleanly. Re-running to get a healthy ada build so the RPMs publish and beast picks up the build. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 09:34:20 +03:00
rob thijssen	abc6e605b8	test(neuron): NEURON_DEBUG_POISON hook to verify auto-recovery (#17 ) Some checks failed CI / CUDA type-check (push) Failing after 19s Details build-prerelease / Resolve version stamps (push) Successful in 43s Details CI / Format (push) Successful in 50s Details CI / Clippy (push) Failing after 57s Details build-prerelease / Build neuron-ada (push) Failing after 48s Details build-prerelease / Build cortex binary (push) Successful in 5m5s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m38s Details build-prerelease / Package cortex RPM (push) Successful in 1m27s Details build-prerelease / Build neuron-ampere (push) Successful in 7m27s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped Details CI / Test (push) Successful in 10m27s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details One-shot, env-gated fault injector for beast verification: when NEURON_DEBUG_POISON names a model, the first request for it triggers the auto-recovery path as if a device fault had occurred — exercising unload→reload→healthy without corrupting the GPU. Latched so it fires exactly once (no recovery loop). No-op unless the env var is set; wired into both the single-GPU and TP chat poison gates. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 09:08:40 +03:00
rob thijssen	4f2957af9e	feat(neuron): auto-recover poisoned models (#17 Stage 1c) When an inference hit a device fault, the model was flagged poisoned and every subsequent request rejected with "unload and reload the model to recover" — until a human did exactly that. Now the harness rebuilds the context automatically. - Retain the loading `ModelSpec` on `LoadedModel`/`TpLoadedModel` (+ `LoadedHandle::spec()`) so a poisoned model can be reloaded without an operator reconstructing the spec. - A background recovery task (held via `Weak<CandleHarness>`, spawned in `new()` when a runtime is present) drains poisoned model ids and runs `unload_model` → `load_model(spec)`. Unload drops the model → cudarc `Comm::drop` aborts NCCL + releases the context; reload re-runs NCCL init + sanity inside the load path, so a successful reload yields a fresh, healthy model. A failed reload leaves it unloaded (next load retries) — never poisoned forever. - The request-entry poison gates now `trigger_recovery` (single-flight per model via a `recovering` set) and return a transient "recovering, retry shortly" error instead of the manual-reload message. Requests that arrive during the brief reload gap (model absent from the registry) also get "recovering" rather than a misleading "not loaded". `new()` now returns `Arc<Self>`. Recovery runs only on the background task — never inline on the request path, which holds `inference_lock` and would deadlock on the `models` write lock. Stage 1c of the #17 plan (verified-healthy auto-recovery). Watchdog (1b) + a fault-injection hook for beast verification follow. The in-process rank-0 leader's own context fault still needs a reload that can't rebind it (Stage 3); comm-desync + worker faults recover here. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 09:05:02 +03:00
rob thijssen	75cd088b61	fix(neuron): cap vision max_pixels to the pos_embed patch budget (#14 ) All checks were successful CI / CUDA type-check (push) Successful in 31s Details build-prerelease / Resolve version stamps (push) Successful in 29s Details CI / Format (push) Successful in 30s Details CI / Clippy (push) Successful in 2m32s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m5s Details CI / Test (push) Successful in 5m49s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ampere (push) Successful in 8m11s Details build-prerelease / Build neuron-ada (push) Successful in 5m40s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m4s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m2s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m57s Details build-prerelease / Build cortex binary (push) Successful in 4m21s Details build-prerelease / Package cortex RPM (push) Successful in 1m25s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m16s Details Beast testing surfaced a real regression in the dynamic-resolution default: a tall 808×1600 image resized (within the 1024² max_pixels) to a 90×44 patch grid = 3960 patches, exceeding the vision tower's hard `num_position_embeddings = 2304` pos-embed budget. The per-rank `patch count 3960 exceeds pos_embed budget 2304` error fired mid-TP- forward and poisoned the device context, bricking the model until reload. Hard-cap `max_pixels` to `2304 × 16² = 589_824` px (≤ 2304 patches → ≤ 576 LM tokens), clamping even the operator env override. `smart_resize` floors the pixel count under the cap, so no resized image can ever exceed the budget — the tower check never fires, no poison. The pos-embed grid (48×48) is the resolution Qwen3.6 was trained at, so the cap is principled, not just defensive. Still ~3× the old fixed 196 tokens, and the book-cover OCR test (1176 patches) already reads full title+subtitle. Test: a huge/tall/wide/extreme image battery stays within the 2304 patch budget. (Per-rank-error poison robustness itself remains issue #17.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 23:30:47 +03:00
rob thijssen	d311c8ca7a	feat(neuron): operator pixel-budget env override + doc cleanup (#14 C5) Some checks failed CI / CUDA type-check (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 38s Details CI / Format (push) Successful in 45s Details CI / Test (push) Failing after 58s Details CI / Clippy (push) Successful in 2m41s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m14s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m20s Details build-prerelease / Build neuron-ampere (push) Successful in 7m18s Details build-prerelease / Build neuron-ada (push) Successful in 5m10s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m7s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s Details - PreprocessProfile::qwen3_6() reads NEURON_VISION_MIN_PIXELS / NEURON_VISION_MAX_PIXELS (clamped to factor² ≤ min ≤ max), matching the NEURON_VISION_LEGACY_* / NEURON_MROPE knob convention. Defaults remain 256²…1024² (64…1024 LM tokens/image). - Test: a max-resolution source caps within the token budget (can't blow NEURON_MAX_PROMPT_TOKENS). - Strip stale fixed-resolution / "MRoPE gap (#15)" / 14×14 language from the preprocess, mod, and rope doc-comments now that resolution is dynamic and M-RoPE is implemented. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 22:50:03 +03:00
rob thijssen	c97a8654f5	feat(neuron): dynamic-resolution images via Qwen smart_resize (#14 ) Some checks failed CI / Clippy (push) Waiting to run Details CI / Test (push) Waiting to run Details CI / CUDA type-check (push) Successful in 32s Details CI / Format (push) Successful in 34s Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details Replace the fixed 448×448-square preprocess with native-aspect `smart_resize`, and thread the resulting per-image grid through the LM so spatial structure survives non-square images (documents, screenshots, charts, panoramas, OCR) instead of being squished into a square. - preprocess.rs: port Qwen `smart_resize` (factor = patch×merge = 32; pixel budget [min,max], default 256²–1024² → 64–1024 LM tokens). `PreprocessProfile` drops the fixed target dims for `factor`/`min_pixels`/ `max_pixels`; `preprocess`/`preprocess_data_uri` now return the resized `(h, w)`; add `resized_dims_for_uri` (decode + resize, no normalize) for the TP leader's token count. - rope.rs: `compute_mrope_index`/`get_rope_index` take per-image `grids: &[(lm_gh, lm_gw)]` instead of assuming a square `isqrt(run)`. Walk image runs in order, validate `run == gh*gw`, emit row-major positions, resume the shared counter at `base + max(gh,gw)`. Correct for multiple images of differing grids interleaved with text. - candle.rs: `VisionMeta`/`LoadedModel`/`TpLoadedModel` carry the `image_grid_factor` (patch×merge) instead of the constant 196; all four prompt-build sites compute per-image counts from each image's resized grid (single-GPU from the extracted `ImageInput.h/w`, TP from `resized_dims_for_uri`). `ModelArch` gains `vision_grid_factor`. - single-GPU (`mod.rs`, `dispatch.rs`) and TP (`tp_qwen3_5.rs::prefill_with_images_chunked`, `dispatch.rs`, `tp/worker.rs`) thread the grids into `get_rope_index`. Each TP rank recomputes grids from its own deterministic preprocess — no rpc.rs change, single source of truth. The vision tower itself was already grid-general (recent pos-embed interpolation + 2D rotary fix). No patch-count cap: pos-embed is interpolated to any grid; `max_pixels` bounds cost (O(patches²) ViT attention + prefill) instead. Tests: smart_resize (aspect/cap/floor/reject), `compute_mrope_index` non-square + two-image + mismatch cases, square-grid regression guard. Non-cuda build + clippy + full workspace tests green; TP load/dispatch paths are cuda-gated → Gitea CUDA type-check. Operator pixel-budget config + remaining doc cleanup follow in C5. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 22:47:27 +03:00
rob thijssen	dc048ffcc9	fix(neuron): vision-tower 2D positions + M-RoPE default on All checks were successful CI / CUDA type-check (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 32s Details CI / Format (push) Successful in 33s Details CI / Clippy (push) Successful in 2m36s Details build-prerelease / Build cortex binary (push) Successful in 4m48s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m59s Details CI / Test (push) Successful in 6m35s Details build-prerelease / Build neuron-ampere (push) Successful in 7m51s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ada (push) Successful in 5m13s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m49s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m6s Details Two fixes to the spatial handling of images, validated against the HF transformers 4.57.1 qwen3_vl reference on beast. Vision tower (the real cause of poor spatial vision). The Stage-A tower encoded position two ways wrong, so the model saw image content but not layout (a row of 5 people read as "a line of 23", sky inverted), regardless of the LM-side rope: - Learned pos-embed was a naive sequential lookup of the first `n_patches` rows of the 48×48 (`num_position_embeddings=2304`) grid — wrong stride for a 28×28 patch grid. Now bilinearly interpolates the grid to `gh×gw` (port of HF `fast_pos_embed_interpolate`), row-major. - The 2D vision rotary was absent entirely. Added `VisionRotaryEmbedding` (θ=10000, dim=head_dim/2) applying per-patch `(row, col)` rotary to q/k in every ViT block via rope_slow, matching HF `apply_rotary_pos_emb_vision`. Both default on; `NEURON_VISION_LEGACY_POS=1` / `NEURON_VISION_LEGACY_ROPE=1` revert each for A/B (no rebuild). New unit tests: interpolation reduces to the sequential lookup at the native grid; rotary row/col structure. M-RoPE default on. The interleaved M-RoPE matches HF apply_interleaved_mrope / get_rope_index exactly and A/B'd strictly ≥ plain. `NEURON_MROPE` is now a kill switch (`=0` for plain), not opt-in — defaults should encode the model's trained behaviour, not freeze the broken state. Vision tower is plain candle (CPU-testable): built, clippy-clean, full workspace tests green locally. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 20:53:07 +03:00
rob thijssen	7ebcfba5ca	fix(neuron): gate M-RoPE behind NEURON_MROPE (default off) All checks were successful CI / CUDA type-check (push) Successful in 33s Details build-prerelease / Resolve version stamps (push) Successful in 32s Details CI / Format (push) Successful in 33s Details CI / Clippy (push) Successful in 2m34s Details build-prerelease / Build cortex binary (push) Successful in 4m33s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m14s Details CI / Test (push) Successful in 6m50s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ampere (push) Successful in 8m12s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ada (push) Successful in 5m9s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m59s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m3s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m52s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s Details On beast the interleaved M-RoPE degraded image understanding rather than fixing it: the model misread spatial layout (a horizontal row of people described as a "diagonal receding line"), got attributes wrong, and rambled — a "how many people" follow-up generated 4459 tokens over 3.5 minutes, past agent-0's HTTP timeout (the "fails to respond without an error"). The interleave is evidently not numerically correct, and it can't be validated remotely without a transformers reference. Gate it: `get_rope_index` now returns plain sequential identity positions unless NEURON_MROPE is truthy, so mrope_cos_sin reduces to plain RoPE and image tokens behave exactly as pre-M-RoPE (content recognition works; spatial layout approximate; no rambling). The real computation moves to `compute_mrope_index` (still unit-tested). Default off restores the working vision and unblocks agent-0; the M-RoPE code stays in place to debug + validate before flipping the default on. Pure non-cuda change (rope.rs); both single-GPU and TP forwards call the gated get_rope_index unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 19:32:59 +03:00
rob thijssen	825bf4e905	feat(neuron): M-RoPE Stage 4 — wire interleaved M-RoPE into the TP path All checks were successful build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / CUDA type-check (push) Successful in 31s Details CI / Format (push) Successful in 42s Details build-prerelease / Build cortex binary (push) Successful in 5m9s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m4s Details build-prerelease / Package cortex RPM (push) Successful in 1m32s Details CI / Test (push) Successful in 7m19s Details build-prerelease / Build neuron-ampere (push) Successful in 8m40s Details build-prerelease / Build neuron-ada (push) Successful in 5m17s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m1s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m53s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m14s Details CI / Clippy (push) Successful in 2m29s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details Mirror Stage 3 into the tensor-parallel Qwen3.6 model: - TpQwen3_5Attention / DecoderLayer take (cos, sin) instead of a scalar offset and apply via apply_cos_sin. - TpQwen3_5Model gains the replicated rotary + rope_delta (reset in clear_kv_cache, settable). forward_inner builds the cos/sin once — interleaved M-RoPE from explicit position_ids (vision) or plain at offset+rope_delta (text/decode). forward() and forward_with_positions() delegate; the old single-shot forward_with_vision is gone. - prefill_with_images_chunked now computes get_rope_index over the whole prompt once, stores rope_delta on the base model, and slices the (3, prompt_len) position tensor per chunk — so every rank assigns image tokens their 14×14 grid coordinates and steps in lockstep (every chunk, text or image, carries the M-RoPE slice because the image shifts the surrounding text positions). Also build the position-id tensor as f32 directly (positions are small integers, exact in f32) to avoid an i64→f32 cast on the GPU. The TP forward is cuda-gated — CI CUDA type-check is the compile gate. Non-cuda build + clippy + full workspace tests green; rope math + the plain-RoPE-reduction invariant covered by unit tests. Completes the interleaved-M-RoPE work for the vision spatial misread. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 18:46:27 +03:00
rob thijssen	4c12c7e2f0	feat(neuron): M-RoPE Stage 3 — wire interleaved M-RoPE into single-GPU Qwen3_5Model now builds the rotary cos/sin once per forward and threads (cos, sin) through the decoder → full-attention → rope, replacing the scalar offset that reached RotaryEmbedding: - vision forward computes get_rope_index over the (single-shot) prompt, sets rope_delta, and builds interleaved-M-RoPE cos/sin so image tokens carry their 14×14 grid (height/width) positions; - text / decode take plain_cos_sin at offset + rope_delta — with rope_delta == 0 (no image) this is bit-for-bit the old plain RoPE, and the device→host id copy is skipped on the text decode hot path. rope_delta is stored on the model and reset in clear_kv_cache, so decode after a vision prefill resumes text positions from the image-compressed counter. decoder.rs / full_attn.rs take (cos, sin) instead of offset; linear-attention layers are unchanged (no RoPE). The TP path still uses the retained apply(offset) — wired in Stage 4. Full workspace tests green; the load-bearing invariant (M-RoPE == plain for equal axes) keeps text unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 18:39:52 +03:00
rob thijssen	ba1b5ba408	feat(neuron): M-RoPE Stage 2 — get_rope_index position-id helper Pure function computing the interleaved-M-RoPE 3D position ids for a prompt with image-placeholder runs, plus the decode rope_delta: text tokens advance a single counter (all axes equal); each image run gets [base+t, base+h, base+w] row-major over a square grid_t=1, grid_h=grid_w=isqrt(run) (196 → 14×14); the counter resumes from base + max(grid). rope_delta = final_counter - seq_len lets decode resume text positions after the position-compressed image blocks. Plus mrope_position_tensor to build the (3, seq) tensor. Unit tests: text-only is sequential (delta 0); text+image+text matches hand-computed grid ids + resume + delta; 196 → 14×14; non-square run rejected; end-to-end through mrope_cos_sin tracks the height axis. #[allow(dead_code)] until Stage 3/4 wire it into the forward. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 18:34:28 +03:00
rob thijssen	5731f4c318	feat(neuron): M-RoPE Stage 1 — interleaved rope machinery + config Parse + store mrope_section / mrope_interleaved in RopeParameters (previously accepted-but-ignored). RotaryEmbedding gains: - inv_freq + per-axis column masks (mask_t/h/w) built from mrope_section; - plain_cos_sin(pos, seq_len): narrow the precomputed tables (text/decode); - mrope_cos_sin(position_ids (3,seq)): per-axis freqs blended at the interleave columns (vision); - apply_cos_sin(q,k,cos,sin): the rope_slow application, factored out. The existing apply(q,k,offset) is retained (delegates to plain_cos_sin + apply_cos_sin) so current callers are unchanged; Stages 3–4 move cos/sin construction into the model forward and thread the 3D position ids for image tokens. Tests: masks partition the half-dim; interleave drives the right axis per column; and the load-bearing invariant — mrope_cos_sin reduces bit-for-bit to plain_cos_sin when the three axes are equal (so text inference is unchanged). Refs the MRoPE-gap diagnosis (vision spatial misread). Pure non-cuda; no behaviour change until wired. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 18:31:15 +03:00
rob thijssen	fa013505d1	fix(neuron): chunked TP-vision prefill + pre-flight VRAM guard All checks were successful build-prerelease / Resolve version stamps (push) Successful in 29s Details build-prerelease / Build cortex binary (push) Successful in 4m26s Details build-prerelease / Package cortex RPM (push) Successful in 1m18s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m6s Details build-prerelease / Build neuron-ampere (push) Successful in 8m30s Details CI / Format (push) Successful in 38s Details CI / CUDA type-check (push) Successful in 47s Details CI / Clippy (push) Successful in 2m36s Details build-prerelease / Build neuron-ada (push) Successful in 5m19s Details CI / Test (push) Successful in 6m3s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m1s Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m32s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m47s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s Details agent-0 sent a ~13k-token prompt + image; the TP vision prefill was single-shot, so it tried to materialise activations for all 12,960 positions at once and OOM'd rank 1 mid-forward. Rank 1 died before issuing its row-parallel AllReduce, stranding rank 0 on the collective (it hung holding the pool lock). The text path survives the same size because it chunks the prefill. Chunk the vision prefill the same way: - TpQwen3_5ForCausalLM::prefill_with_images_chunked encodes the image(s) once, then walks the pre-expanded prompt in prefill_chunk_tokens() windows, splicing the patch-embedding rows into whichever chunk(s) carry <\|image_pad\|> positions (pure-text chunks take the plain forward). Activation is bounded by the chunk, not the prompt. - Every rank runs the identical chunk sequence (chunk_size threaded through GenerateStepWithImages / TpForwardLogitsWithImages / generate_step_with_images), so the per-chunk AllReduces stay paired across ranks with no extra sync — the KV cache accumulates via the growing offset, only the last chunk's logits are kept. Pre-flight guard (validate_vision_prefill): even chunked, a long prompt's KV cache can exhaust VRAM mid-forward, and on TP that hangs the collective. Reject up front with a clean InsufficientVram when the estimated footprint exceeds free VRAM, so a doomed request fails fast instead of hanging the daemon. Heuristic + tunable (NEURON_VISION_PREFILL_MB_PER_1K_TOKENS / _BASE_MB); default permissive so the now-working 12,960-token case still passes. Applied to every vision path (single-GPU + TP); single-GPU vision stays single-shot for now, so the guard is its protection until it's chunked too. Tests: pre-flight guard behaviour; RPC round-trip carries chunk_size. The chunked forward is cuda-gated — CI CUDA type-check validates it. Refs #16 / TP-vision. Operational note: a TP rank OOM still hangs the daemon (needs restart); making a worker failure abort the leader's collective is separate, broader TP hardening. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 17:21:36 +03:00
rob thijssen	c8bcaabc38	fix(neuron): render HF chat templates via minijinja pycompat All checks were successful build-prerelease / Resolve version stamps (push) Successful in 29s Details CI / Format (push) Successful in 34s Details CI / CUDA type-check (push) Successful in 39s Details CI / Clippy (push) Successful in 2m35s Details build-prerelease / Build cortex binary (push) Successful in 4m21s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m4s Details CI / Test (push) Successful in 6m47s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ampere (push) Successful in 7m43s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ada (push) Successful in 5m41s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m5s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m52s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details The Qwen3.6 chat_template.jinja (now loaded after the precedence fix) failed to render in minijinja: it uses Python str methods (content.startswith/endswith/split/rstrip/lstrip) and the raise_exception global that HF transformers patches into its Jinja env but minijinja doesn't provide. The render error tripped the text-only fallback, so image requests still produced zero <\|image_pad\|> tokens. Wire the standard bridge into render_chat_template: - minijinja-contrib `pycompat::unknown_method_callback` supplies the Python string/list/dict methods; - a `raise_exception` global maps to a render error (so malformed inputs — e.g. an image in a system message — surface cleanly). Add the real Qwen3.6-27B chat_template.jinja (verbatim from beast's HF cache) as a test fixture and assert it renders one <\|image_pad\|> for a text+image turn — the end-to-end check that would have caught this before deploy. Refs #16 / TP-vision. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 16:32:23 +03:00
rob thijssen	7ad56c6a86	fix(neuron): load chat_template.jinja (transformers precedence) The chat-template loader only read the `chat_template` field from tokenizer_config.json. Qwen3.6-27B ships its vision-aware template only in a standalone `chat_template.jinja` (and has no tokenizer_config.json at all), so the loader returned None and image requests fell back to the text-only format_qwen3_prompt — rendering zero `<\|image_pad\|>` tokens and tripping "expand_image_pad_tokens: prompt has 0 image_token_id occurrences". load_chat_template_alongside now follows HF transformers precedence: standalone chat_template.jinja → chat_template.json → the chat_template field in tokenizer_config.json. Tests cover the precedence, the text-only fallback, and that an OpenAI image_url content part renders `<\|image_pad\|>` through the real template condition (`'image_url' in item`). Refs #16 / TP-vision. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 16:25:30 +03:00
rob thijssen	1b0e36c119	fix(neuron): cover TpForwardLogitsWithImages in drain_poisoned match All checks were successful CI / CUDA type-check (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 37s Details CI / Format (push) Successful in 37s Details CI / Clippy (push) Successful in 2m41s Details build-prerelease / Build cortex binary (push) Successful in 4m18s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m48s Details build-prerelease / Package cortex RPM (push) Successful in 1m32s Details CI / Test (push) Successful in 6m20s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ampere (push) Successful in 8m26s Details build-prerelease / Build neuron-ada (push) Successful in 5m21s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s Details The CUDA type-check caught a non-exhaustive match: drain_poisoned() must reply an error to every Job variant's reply channel, including the new cuda-gated TpForwardLogitsWithImages. The non-cuda build couldn't see it — the variant is #[cfg(feature = "cuda")], so the match is exhaustive without it on CPU. Refs TP-vision plan Stage 2. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 15:26:46 +03:00
rob thijssen	ed2d09864e	feat(neuron): TP-vision Stage 3 — wire TP chat + stream vision prefill Some checks failed CI / Format (push) Successful in 30s Details CI / Clippy (push) Successful in 2m51s Details CI / Test (push) Successful in 5m52s Details CI / CUDA type-check (push) Failing after 50s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details End-to-end TP-vision: an image request to a TP-loaded Qwen3.6-27B now conditions on the image across both ranks. - TpLoadedModel carries has_vision / image_token_id / lm_tokens_per_image, populated at load via the shared VisionMeta::from_config_path (same config.json the shards loaded from; Stage 1 materialises the replicated tower on every rank). - LoadedHandle::capabilities() now advertises "vision" for TP loads with a tower (cortex-gateway already unions this into /v1/models via C3). - The TP rejection guards (chat_completion_tp + inference_tp_stream) are now conditional on !has_vision — text-only TP models still 400 cleanly, vision-capable ones fall through. - chat_completion_tp_inner and the streaming orchestration task detect images (request_has_images), expand <\|image_pad\|> to the per-image patch count, and run a single-shot generate_step_with_images prefill (every rank encodes + splices its replicated tower) before the unchanged decode loop. Text requests keep chunked_prefill_tp. - extract_image_data_uris ships the source data URIs to every rank for identical per-rank preprocessing. prompt_tokens now reflects the patch expansion, so usage accounting and KV offsets match the single-GPU baseline. TP entry points are cuda-gated (validated by CI's CUDA type-check); capabilities() + extract_image_data_uris + VisionMeta reuse compile on the non-cuda build. Full workspace test green. Refs TP-vision plan Stage 3. Implements #12. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 15:14:44 +03:00
rob thijssen	4994b94c84	feat(neuron): TP-vision Stage 2 — per-rank image RPC + worker plumbing Carry image content through the TP forward path so every rank encodes and splices locally (replicated tower, no embedding broadcast). - rpc.rs: new WorkerRequest::GenerateStepWithImages carrying the source image data URIs + image_token_id for the single-shot vision prefill; worker still replies GenerateStepOk. Round-trip test added. - tp_qwen3_5.rs: TpQwen3_5ForCausalLM::forward_with_images — encode each preprocessed image through the rank's replicated tower, cat, splice, forward. Shared by leader and worker so every rank runs identical work. - tp/mod.rs: TpLeaderModel::forward_with_images and WorkerPool::generate_step_with_images (mirrors generate_step: fan out GenerateStepWithImages to subprocess ranks, run the leader's image forward on its device worker thread, drain, combine). - worker.rs: WorkerModel::forward_with_images + handle_generate_step_with_images — each subprocess rank preprocesses the same data URIs via the shared deterministic preprocess_data_uri, encodes, splices, forwards. - device_worker: Job::TpForwardLogitsWithImages + tp_forward_logits_with_images dispatch handler + DeviceWorkerHandle::tp_forward_logits_with_images. Determinism: every rank runs the same preprocess on the same source URIs through the same replicated tower, so the spliced hidden state matches across ranks — preserving the replicated-hidden-state invariant the row-parallel AllReduce relies on, with no NCCL broadcast. No caller yet — Stage 3 wires the TP chat/stream entry points to invoke generate_step_with_images for image prefill. cuda-gated plumbing covered by CI's CUDA type-check; rpc/route/forward_with_images compile on the non-cuda build. Refs TP-vision plan Stage 2. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 15:08:08 +03:00
rob thijssen	9a24b05866	feat(neuron): TP-vision Stage 1 — replicated vision tower on the TP model Load the full, unsharded model.visual.* vision tower on every TP rank (leader + each subprocess worker mmaps the same local safetensors) when config.vision_config is present. VisionTower::load already takes a ShardedVarBuilder whose plain .get() returns the full replicated tensor, so the tower loads identically regardless of world_size — no sharding, no NCCL broadcast. - TpQwen3_5ForCausalLM gains vision: Option<VisionTower> + image_token_id, plus has_vision/image_token_id/encode_image/forward_with_vision, mirroring the single-GPU Qwen3_5ForCausalLM wrapper. - TpQwen3_5Model::forward_with_vision mirrors the single-GPU forward_inner splice: embed locally, replace rows at image_token_id positions, run the sharded decoder stack. Because every rank encodes the same pixels through its replicated tower, the spliced input embeddings are identical across ranks — preserving the TP replicated-hidden-state invariant the row-parallel AllReduce relies on. - splice_runs is now pub(crate) and shared with the TP model. No caller yet — Stage 2 wires the RPC/worker path that invokes encode_image + forward_with_vision per rank. Most of this compiles on the non-cuda build (only the cuda load variant's tower line is gated); CI's CUDA type-check covers the rest. Refs TP-vision plan Stage 1. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 15:00:05 +03:00
rob thijssen	7bb033b4ed	chore: untrack stray .claude/scheduled_tasks.lock and gitignore .claude/ All checks were successful CI / CUDA type-check (push) Successful in 32s Details CI / Format (push) Successful in 30s Details build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / Clippy (push) Successful in 2m45s Details build-prerelease / Build cortex binary (push) Successful in 4m28s Details CI / Test (push) Successful in 6m6s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m11s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m28s Details build-prerelease / Build neuron-ampere (push) Successful in 8m1s Details build-prerelease / Build neuron-ada (push) Successful in 8m9s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m52s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s Details A runtime scheduler lock was accidentally swept into the previous commit by `git add -A`. Remove it from tracking (file stays on disk) and ignore the whole `.claude/` dir so local agent runtime state never lands in the repo again. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 14:55:05 +03:00
rob thijssen	f8c0da0ebf	fix(neuron): TP-vision Stage 0 — reject image requests on the TP path Some checks failed build-prerelease / Resolve version stamps (push) Waiting to run Details CI / Format (push) Waiting to run Details CI / CUDA type-check (push) Successful in 32s Details build-prerelease / Build cortex binary (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details CI / Clippy (push) Has been cancelled Details CI / Test (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details The TP inference path has no vision tower, and the TP dispatch in chat_completion / inference_stream returns before the VisionUnsupported guard runs — so an image request to a TP-loaded model (e.g. beast's tp=2 Qwen3.6-27B) was silently dropped and answered from text alone, the exact issue-#3 confident-hallucination pattern Stage C killed for single-GPU. Add the request_has_images → VisionUnsupported guard to both chat_completion_tp and inference_tp_stream, before prefill / before the SSE stream opens, so beast returns a clean 400 vision_unsupported. The guard is unconditional for now (TP has no tower); Stage 3 makes it conditional on the TP model's has_vision once real TP-vision lands. Detection is covered by the existing request_has_images unit test; the guard itself is cuda-gated (validated by CI's CUDA type-check). Refs TP-vision plan Stage 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 14:53:56 +03:00
rob thijssen	dd592d918d	test(neuron): C2 — guard Responses→chat image translation contract All checks were successful CI / CUDA type-check (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Format (push) Successful in 44s Details CI / Clippy (push) Successful in 2m51s Details build-prerelease / Build cortex binary (push) Successful in 4m42s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m52s Details CI / Test (push) Successful in 6m16s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ampere (push) Successful in 8m12s Details build-prerelease / Package cortex RPM (push) Successful in 1m26s Details build-prerelease / Build neuron-ada (push) Successful in 5m34s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m59s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m2s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details The Responses request translator already emits the chat `image_url` Parts array Stage B5's vision path consumes, and the non-streaming (`chat_completion`) and streaming (`responses_stream` → `inference_stream`, Stage C1) Responses paths both route image content to the vision-aware prefill — so vision works end-to-end through `/v1/responses` with no translator change required. Add a multi-image test asserting order preservation and that the `detail` hint is tolerated (and dropped, since chat image_url has no analogue), locking the translator's output to the exact `image_url.url` shape `extract_images_from_request` walks. Closes part of #16 (Stage C2). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 13:57:43 +03:00
rob thijssen	766c20ba47	feat(neuron): C1 — streaming SSE chat completion with vision The streaming worker path now splices image embeddings on prefill, closing the silent text-only degrade for `stream=true` image requests. `inference_stream` gains the same vision-routing block as the non-streaming `chat_completion`: detect `image_url` content, reject it against text-only models with `VisionUnsupported` (before any SSE frame is sent), preprocess each image and expand its `<\|image_pad\|>` sentinel to the per-image patch count, then carry the payload through dispatch. Rather than duplicate the 75-line `route_token!` reasoning/tool-call state machine into a sibling streamer, `stream_inference_via_worker` takes an `Option<(Vec<ImageInput>, u32)>`: when `Some`, prefill is a single-shot `forward_logits_with_images` splice; when `None`, the original chunked text-only prefill. Image embeddings are prefill-only, so every decode step stays on the plain `forward_logits` path and the shared decode loop is untouched. This keeps exactly one copy of the tool-call/reasoning logic to maintain. The Responses API streaming path (`responses_stream`) inherits vision for free since it drives the same `inference_stream`. Unit test covers `request_has_images` (the shared routing gate); the real-weights SSE smoke is the manual curl on beast (cuda-integration). Closes part of #16 (Stage C1). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 13:57:02 +03:00
rob thijssen	4972c7d1e7	feat(cortex-gateway): C3 — propagate vision capabilities through /v1/models ModelEntry and CortexModelEntry gain a `capabilities: Vec<String>` field (serde-default for back-compat). The poller copies it verbatim from each neuron's ModelInfo.capabilities; list_models computes the union across every node where a model is loaded so a checkpoint loaded text-only on one neuron and text+vision on another reports both to the fleet. Catalogue-only and mid-prewarm entries default to empty until the catalogue gains a capabilities declaration. Aliases inherit their target's capability union. New gateway test mocks two nodes with differing capability arrays and asserts the unioned /v1/models response. Closes part of #16 (Stage C3). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 13:49:54 +03:00
rob thijssen	a26bb9f04b	feat(deploy): capture service startup journal after each restart After both `Start cortex.service` and `Start neuron.service`, sleep 10s and run `journalctl --unit <unit> -I --no-pager` to record the latest invocation's log in the workflow output. Step is guarded by `if: always()` so a failed start still leaves a usable trace. infra-setup.sh now adds gitea_ci to the systemd-journal group during user provisioning, so `journalctl` works without a sudoers entry. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 16:48:56 +03:00
rob thijssen	ea1fdf8aa6	chore(deploy): drop deploy.sh and manifest.yml now that workflow runs First end-to-end run of the deploy workflow succeeded (gitea run #289), so the operator-run rolling-deploy script and its YAML manifest are no longer the source of truth — fleet topology lives in .gitea/workflows/deploy.yml and per-host config in script/infra-setup.sh. Per-host neuron config comments updated to point at the new sync path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 16:41:04 +03:00
rob thijssen	577781de8d	fix(neuron): derive Clone on ImageInput for the CUDA vision dispatch All checks were successful CI / CUDA type-check (push) Successful in 32s Details CI / Format (push) Successful in 34s Details build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Clippy (push) Successful in 2m47s Details build-prerelease / Build cortex binary (push) Successful in 4m34s Details CI / Test (push) Successful in 6m14s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 5m58s Details build-prerelease / Package cortex RPM (push) Successful in 1m22s Details build-prerelease / Build neuron-ampere (push) Successful in 8m5s Details build-prerelease / Build neuron-ada (push) Successful in 8m9s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s Details CUDA type-check in CI failed on commit `24968e9` with E0308: error[E0308]: mismatched types --> crates/neuron/src/harness/candle.rs:1707:33 1707 \| images.clone(), \| ^^^^^^^^^^^^^^ expected `Vec<ImageInput>`, found `&Vec<ImageInput>` In Stage B5 the cuda branch of `chat_completion` matches `&vision_route` to keep the `vision_route: Option<...>` alive for both arms, which makes `images` bind as `&Vec<ImageInput>`. The subsequent `images.clone()` call doesn't deep-clone because `ImageInput` doesn't derive `Clone` — rustc falls back to cloning the `&Vec` reference, which has the wrong type for the worker job. The CPU build (non-cuda) compiled fine because that branch is behind `#[cfg(feature = "cuda")]`; the cuda-check job is what catches the regression. Fix: derive `Clone` on `ImageInput`. The clone cost is one pixel-buffer memcpy per image (~2.4 MiB at fixed 448×448), which is fine on the chat-completion hot path — vision requests are rare per second relative to text-only decode. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 15:51:57 +03:00

1 2 3 4 5

247 Commits