Commit Graph

247 Commits

Author SHA1 Message Date
569c528c4b feat(gateway): Anthropic streaming SSE translation (#24)
All checks were successful
CI / Format (push) Successful in 36s
CI / CUDA type-check (push) Successful in 2m25s
CI / Clippy (push) Successful in 2m25s
CI / Format (pull_request) Successful in 41s
CI / CUDA type-check (pull_request) Successful in 2m9s
CI / Clippy (pull_request) Successful in 2m45s
CI / Test (push) Successful in 5m3s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 4m29s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
The /v1/messages handler translated request envelopes but proxied raw
OpenAI SSE frames back to streaming Anthropic clients — the gap
between the README's "point your tooling at it once" contract and
what Claude Code actually received.

cortex-core gains AnthropicStreamTranslator, a pure per-stream state
machine: OpenAI chunks in, ordered (event, payload) pairs out —
message_start → content_block_start/delta/stop (text and tool_use
blocks, indexed; tool_calls map to input_json_delta) → message_delta
(stop_reason mapped via the now-shared map_stop_reason, which also
teaches the non-streaming path tool_calls→tool_use) → message_stop.
Without an upstream usage frame the output count falls back to the
delta count (engine-exact for neuron's one-chunk-per-token streams,
#31); with one, input/output tokens ride message_delta.

cortex-gateway gains anthropic_sse: the wire pump that splits the
upstream byte stream into SSE events, parses data: payloads
(leniently — engines omit fields on special frames), feeds the
translator, and frames results as `event:`/`data:` pairs through a
bounded channel (slow client back-pressures the upstream read).
Upstream truncation without [DONE] still closes the Anthropic event
sequence. Nothing is buffered beyond the current event's bytes.

Tests: 5 state-machine unit tests (text flow, stop-reason mapping +
defaults, tool_use blocks, usage propagation, idempotent finish) and
2 gateway integration tests (full event sequence + text reassembly,
usage propagation into message_delta). Validated end-to-end by
running this branch's gateway against a production neuron and
streaming a live Anthropic request.

Closes #24

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 15:47:30 +03:00
8f6f1d3205 feat(deploy): validate neuron capability after every deploy
Some checks failed
build-prerelease / Build neuron-ampere (push) Blocked by required conditions
build-prerelease / Build neuron-ada (push) Blocked by required conditions
build-prerelease / Package cortex RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 29s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m14s
build-prerelease / Build neuron-blackwell (push) Successful in 10m36s
build-prerelease / Build cortex binary (push) Successful in 2m35s
build-prerelease / Test (push) Successful in 6m35s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
A deploy previously went green the moment systemd reported the
service started — a merge that broke model loading or inference
itself would deploy "successfully" and only surface when a human
noticed. Each neuron deploy now earns its green:

1. Wait for default models: poll /health until activation.state is
   ready, with per-host timeouts in the matrix (beast 900s for the
   27B Q6K TP=2 cold-load, benjy/quadbrat 300s). Any entry in
   activation.failed fails the deploy with the per-model error —
   the structured equivalent of watching the journal for
   "loaded default model", plus failure detail the journal line
   can't carry.
2. LLM smoke probe: ask the first loaded model to reply with one
   specific word (max_tokens 512 so thinking models have room,
   temperature 0) and grep the response for it. Not a quality bar —
   just proof the deploy didn't lobotomize inference.

Hosts whose package is already current still skip everything — the
validation cost is only paid when a restart actually happened. The
probe was dry-run against benjy's production neuron before landing.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 15:28:20 +03:00
b0d0b939af Merge pull request 'feat(gateway): per-request token metrics — TTFT and tok/s (#21)' (#30) from feat/gateway-21-token-metrics into main
Some checks failed
build-prerelease / Lint (fmt + clippy) (push) Blocked by required conditions
build-prerelease / Test (push) Blocked by required conditions
build-prerelease / Build cortex binary (push) Blocked by required conditions
build-prerelease / Build neuron-blackwell (push) Blocked by required conditions
build-prerelease / Build neuron-ampere (push) Blocked by required conditions
build-prerelease / Build neuron-ada (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 33s
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
2026-06-12 12:25:32 +00:00
6a36d15ef1 feat(gateway): per-request token metrics — TTFT and tok/s (#21)
All checks were successful
CI / Format (push) Successful in 45s
CI / Format (pull_request) Successful in 37s
CI / CUDA type-check (push) Successful in 2m25s
CI / Clippy (push) Successful in 2m37s
CI / Test (push) Successful in 4m22s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Clippy (pull_request) Successful in 2m23s
CI / Test (pull_request) Successful in 4m19s
CI / CUDA type-check (pull_request) Successful in 1m57s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
The deferred Phase 6b, and the unblock for the 7→8 milestone's
benchmark work (#22): until cortex measures itself per request,
nothing downstream can be benchmarked or graphed.

The proxy wraps the upstream byte stream in a pass-through inspector
(TokenMetricsStream): chunks are forwarded verbatim — never buffered
or re-serialised — while the inspector records arrival times and
keeps a bounded (64 KiB) tail of the body text. At stream end (or
client disconnect, via Drop) it extracts the final OpenAI usage
object — present on the last SSE chunk and non-streaming JSON bodies
alike — for engine-truth token counts.

Per request, labelled {model, node}:
- cortex_time_to_first_token_seconds (histogram) — first body chunk
- cortex_tokens_per_second (histogram) — completion tokens over the
  decode window (first→last chunk); falls back to total request
  duration for single-chunk non-streaming bodies
- cortex_prompt_tokens_total / cortex_completion_tokens_total
  (counters)

The extractor is pure and chunk-boundary-safe; quoted-needle matching
keeps completion_tokens_details from shadowing completion_tokens,
and the last usage object wins. Covers chat completions, completions,
the Responses API, and the Anthropic streaming path (which currently
proxies OpenAI SSE).

Tests: 4 extractor unit tests; integration test with a streaming
mock emitting a stream_options-style final usage chunk, asserting
both histograms and exact-or-greater counter values (the test
recorder is process-global and shared across the binary's tests).

Closes #21

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 15:11:52 +03:00
b463439416 Merge pull request 'feat(neuron): startup preflight for NVIDIA driver/library mismatch (#19)' (#29) from feat/neuron-19-driver-preflight into main
Some checks failed
build-prerelease / Build neuron-ampere (push) Blocked by required conditions
build-prerelease / Build neuron-ada (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 29s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m11s
build-prerelease / Build cortex binary (push) Successful in 2m33s
build-prerelease / Test (push) Successful in 4m24s
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Build neuron-blackwell (push) Successful in 10m18s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
2026-06-12 12:08:20 +00:00
716558c8ff feat(neuron): startup preflight for NVIDIA driver/library mismatch (#19)
All checks were successful
CI / Format (push) Successful in 38s
CI / Format (pull_request) Successful in 38s
CI / CUDA type-check (push) Successful in 2m11s
CI / Clippy (push) Successful in 2m13s
CI / Clippy (pull_request) Successful in 2m37s
CI / Test (push) Successful in 4m17s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 3m56s
CI / CUDA type-check (pull_request) Successful in 1m44s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
The un-rebooted driver update (userspace libs bumped, kernel module
still old) kills every CUDA call on the host including nvidia-smi,
and neuron surfaced it only as `Comm::from_rank ... NcclError` deep
inside the first model load — 30 minutes of forensics on beast
(2026-06-08) to diagnose. Make it instantly legible instead:

- discovery distinguishes nvidia-smi absent (CPU-only, fine) from
  present-but-failing, classifies the "Driver/library version
  mismatch" signature, and pairs the userspace NVML version with the
  loaded kernel-module version from /proc/driver/nvidia/version.
- DiscoveryResponse gains `cuda_unavailable_reason` (omitted when
  None — wire-compatible) so cortex can see why the node has no
  devices and route around it.
- startup logs one loud ERROR line with the actionable reason
  ("reboot the host to reload the kernel module") and skips default
  model loads entirely, marking each failed with that reason so
  /health activation shows the real cause.
- POST /models/load fast-rejects with 503 + code=cuda_unavailable on
  a mismatch host instead of dying minutes later in cuInit/NCCL.

No false positives: other nvidia-smi failures (no devices, perms)
keep their existing behaviour, CPU-only hosts stay silent.

Closes #19

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 15:00:00 +03:00
112e4e124a fix(ci): export RUSTC_WRAPPER in the build step itself — GITHUB_ENV doesn't propagate
Some checks failed
build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m22s
build-prerelease / Build cortex binary (push) Successful in 2m20s
build-prerelease / Test (push) Successful in 3m50s
build-prerelease / Build neuron-blackwell (push) Successful in 10m10s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Build neuron-ada (push) Successful in 14m29s
build-prerelease / Build neuron-ampere (push) Successful in 14m31s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Run 375 proved the CUDA image ships sccache (probe step printed
"sccache enabled") but the wrapper never reached cargo: the runner
does not propagate GITHUB_ENV across steps, so the builds ran
unwrapped (server stats: 4 compile requests for a ~600-crate build,
durations unchanged). Probe and export inside the build step's own
shell instead, in both build-neuron and ci.yml's cuda-check.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 14:50:25 +03:00
dc6feec6dc fix(deploy): gate on the publish manifest, not unprivileged dnf check-update
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Build cortex binary (push) Successful in 2m18s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m33s
build-prerelease / Test (push) Successful in 4m20s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-blackwell (push) Successful in 9m46s
build-prerelease / Build neuron-ampere (push) Successful in 13m57s
build-prerelease / Build neuron-ada (push) Successful in 15m29s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m5s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m49s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m8s
The f5fa840 deploy exposed both failure modes of gating with
`dnf check-update` as the gitea_ci user in one run: it hung
indefinitely on quadbrat (blocked process, 0 CPU, killed manually),
and on benjy/beast it silently reported "no updates" two minutes
after new RPMs were published — both hosts skipped a real (luckily
binary-identical) update.

Gate with data we own instead: fetch packages.json from
rpm.lair.cafe (plain curl, no privileges, no dnf locks), take the
newest release per package by buildTime, and skip the
stop/upgrade/start cycle only when it exactly equals
`rpm -q %{VERSION}-%{RELEASE}`. Unreachable or unparsable manifest
fails open to a full deploy. The dnf transaction itself still runs
under the scoped sudoers rules, unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 14:20:21 +03:00
02f20bc9e1 Merge pull request 'feat: keep auto-recovering models visible as recovering (#20)' (#28) from feat/neuron-20-recovering-status into main
Some checks failed
build-prerelease / Test (push) Blocked by required conditions
build-prerelease / Build neuron-blackwell (push) Blocked by required conditions
build-prerelease / Build neuron-ampere (push) Blocked by required conditions
build-prerelease / Build neuron-ada (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m39s
build-prerelease / Build cortex binary (push) Successful in 3m46s
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
2026-06-12 11:15:38 +00:00
2a231e49de merge main (sccache enablement supersedes branch cuda-check pin)
All checks were successful
CI / Format (push) Successful in 40s
CI / Format (pull_request) Successful in 37s
CI / Clippy (push) Successful in 2m17s
CI / CUDA type-check (push) Successful in 2m39s
CI / CUDA type-check (pull_request) Successful in 2m30s
CI / Test (push) Successful in 4m51s
CI / Clippy (pull_request) Successful in 2m12s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 4m49s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
# Conflicts:
#	.gitea/workflows/ci.yml
2026-06-12 14:05:55 +03:00
2dadea5d8d ci: enable sccache on the build jobs (conditional on the CUDA image)
Some checks failed
build-prerelease / Build neuron-blackwell (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 34s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m57s
build-prerelease / Test (push) Has been cancelled
build-prerelease / Build cortex binary (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
The 3 CUDA flavour builds (10-14 min each, the critical path of every
full run) and build-cortex compiled entirely uncached. With the
gongfoo-side sccache hardening in place, wire them up:

- build-cortex: full sccache env (rust image ships it) + the standard
  escalation loop (retry -> server restart -> uncached final attempt).
- build-neuron: probe for sccache before enabling the wrapper — the
  CUDA image may not ship it, and a missing binary must degrade to an
  uncached build, not fail cargo at `sccache rustc -vV` (the original
  reason the wrapper was cleared here). rustc compilations are shared
  across all three flavours; candle-kernels' nvcc output stays
  uncached (build-script artifact).
- ci.yml cuda-check: same probe pattern replaces the blanket env
  clear; also pins CUDA_COMPUTE_CAP=86 since the image no longer
  ships nvidia-smi for candle-kernels' fallback detection (mirrors
  9bb9678 on the #20 branch).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 14:05:26 +03:00
9bb9678f93 fix(ci): pin CUDA_COMPUTE_CAP in cuda-check — builder image has no nvidia-smi
All checks were successful
CI / Format (push) Successful in 37s
CI / Format (pull_request) Successful in 38s
CI / CUDA type-check (push) Successful in 1m45s
CI / Clippy (push) Successful in 2m24s
CI / Clippy (pull_request) Successful in 2m19s
CI / Test (push) Successful in 4m40s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 4m35s
CI / CUDA type-check (pull_request) Successful in 1m50s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
candle-kernels' build script shells out to nvidia-smi for compute-cap
detection when CUDA_COMPUTE_CAP is unset; the current GPU-less builder
image doesn't ship it, so the type-check died in the build script
before borrow-checking anything. Pin an arbitrary valid cap — the
check is feature-gate compilation only; real caps live in
build-prerelease.yml's flavour matrix.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 13:55:23 +03:00
df9c490614 feat(neuron+gateway): keep auto-recovering models visible as recovering (#20)
Some checks failed
CI / Format (push) Successful in 37s
CI / CUDA type-check (pull_request) Failing after 28s
CI / Format (pull_request) Successful in 37s
CI / Clippy (push) Successful in 2m54s
CI / Clippy (pull_request) Successful in 3m36s
CI / Test (push) Successful in 4m37s
CI / Test (pull_request) Successful in 5m20s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
CI / CUDA type-check (push) Failing after 31s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
During the #17 auto-recovery window (unload → reload, minutes for a
large TP model) the model's registry slot is absent, so it vanished
from neuron's /models — and cortex, routing by /models presence,
answered "model not found on any node" while a direct request to
neuron would have correctly said "recovering, retry shortly".

neuron: the recovery set becomes a map carrying a devices/capabilities
snapshot taken at trigger time (while the registry slot still exists).
list_models reports `recovering` for models in the set — both while
the poisoned slot is still present and during the reload gap, where
the snapshot keeps the model listed.

gateway: ModelStatus grows a Recovering variant (parsed from the
wire); the router holds the route — new RouteError::ModelRecovering
mapped to 503 instead of 404 — and deliberately does not fall through
to the catalogue cold-load, which would race a second placement
against the in-flight recovery. The evictor already ignores
non-Loaded entries.

Tests: neuron unit test (recovering model stays listed with snapshot),
gateway integration tests (poller parses `recovering`; request gets
503 retry-shortly and the model stays on /v1/models).

Closes #20

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 13:42:03 +03:00
f5fa840dfb ci: escalate sccache retries — restart server, then fall back uncached
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m6s
build-prerelease / Test (push) Successful in 4m50s
build-prerelease / Build cortex binary (push) Successful in 3m45s
build-prerelease / Build neuron-blackwell (push) Successful in 9m59s
build-prerelease / Build neuron-ada (push) Successful in 14m11s
build-prerelease / Build neuron-ampere (push) Successful in 14m13s
build-prerelease / Package cortex RPM (push) Successful in 1m30s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m28s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m50s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m54s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
Run 361's Test job failed all 3 attempts with the sccache
dead-server signature (sccache fatal error, ENOENT on its own tmp
files under target/debug/deps). Retrying the same invocation only
helps for transient races; against a wedged server every same-VM
retry fails identically — and under the new pipeline that blocks
publish and the deploy behind it.

Escalate instead: attempt 1 plain, attempt 2 after an sccache server
restart, attempt 3 with RUSTC_WRAPPER unset (uncached). A sick cache
now costs build minutes, never the deploy. Applied to the lint/test
jobs in build-prerelease.yml and ci.yml alike.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 13:24:02 +03:00
7557c5e877 ci: cut iteration latency — change-aware builds, gated deploys, dev fast path
Some checks failed
build-prerelease / Build neuron-blackwell (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 28s
build-prerelease / Test (push) Failing after 1m16s
build-prerelease / Lint (fmt + clippy) (push) Successful in 3m7s
build-prerelease / Build cortex binary (push) Successful in 3m57s
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Push-to-testable was ~20.5 min for every commit (measured on the
2026-06-08 green chain) plus a ~5 min 27B cold-load, regardless of
what changed. Three structural fixes:

- build-prerelease: a change-detection step in `prepare` diffs HEAD
  against the git sha embedded in the last *published* unstable RPM
  (per package, from packages.json) and skips builds whose inputs
  didn't change. Docs-only commits build nothing; gateway-only
  commits skip the 3 CUDA flavour builds. Detection failures fall
  open to a full build.
- ci.yml no longer runs on pushes to main; fmt/clippy/test live in
  build-prerelease as parallel jobs gating publish. The two workflows
  previously queued against each other on the same runner labels,
  delaying the cortex build ~12 min. Branches, PRs, and tags keep the
  full ci.yml gate.
- deploy: each host self-gates with `dnf check-update` and leaves the
  service untouched when the installed package is already current —
  no more neuron restarts (and 27B cold-loads) for commits that
  didn't change neuron.
- deploy-dev (new): manual single-host fast path — build one CUDA
  flavour, scp the binary, restart the service. Skips packaging,
  signing, publish, and dnf entirely. Backed by a new exact-form
  sudoers rule in asset/sudoers.d/neuron-host.conf (already applied
  to all three hosts).

Expected loop times when runners behave: docs ≈ 1 min (nothing
deploys), gateway-only ≈ 6-8 min, single-neuron dev ≈ 8-10 min,
full fleet ≈ 13-15 min.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 13:17:22 +03:00
91e95ca979 docs: rewrite README around project positioning
Some checks failed
CI / CUDA type-check (push) Failing after 46s
CI / Format (push) Successful in 47s
CI / Clippy (push) Successful in 2m53s
CI / Test (push) Successful in 4m31s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps (push) Successful in 39s
build-prerelease / Build cortex binary (push) Successful in 3m52s
build-prerelease / Package cortex RPM (push) Successful in 1m18s
build-prerelease / Build neuron-blackwell (push) Successful in 11m34s
build-prerelease / Build neuron-ampere (push) Successful in 15m31s
build-prerelease / Build neuron-ada (push) Successful in 15m37s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Lead with what helexa is for — near-frontier open-weight models on
consumer hardware you own — instead of a feature list. Adds the scope
section (intentional divergence from vLLM/SGLang; CUDA-only today as a
test-coverage constraint, not a principle), an engine section covering
the per-device worker threads and consumer-GPU tensor parallelism, the
previously-missing helexa-acp crate, and a status section pointing at
git.lair.cafe as the source of truth with GitHub as read-only mirror.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 11:37:00 +03:00
1a74cb0c56 chore: rename repo cortex -> helexa
Some checks failed
CI / CUDA type-check (push) Failing after 30s
build-prerelease / Resolve version stamps (push) Successful in 45s
CI / Format (push) Successful in 32s
build-prerelease / Build neuron-blackwell (push) Failing after 31s
build-prerelease / Build neuron-ada (push) Failing after 34s
build-prerelease / Build neuron-ampere (push) Failing after 38s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
CI / Clippy (push) Failing after 1m11s
build-prerelease / Build cortex binary (push) Successful in 3m47s
CI / Test (push) Successful in 5m32s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m22s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
helexa is the project; cortex (per-operator control plane / LLM proxy)
and neuron (per-host LLM harness) are its components. The Gitea repo
is now helexa/helexa. Update repository URLs in Cargo metadata, RPM
specs, and docs; make the CI changelog push URL rename-proof via the
github.repository context; reframe README.md and CLAUDE.md around the
project name. Binary, package, service, and config-path names are
unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 10:54:01 +03:00
60f5598542 build(neuron): bump cudarc fork to 63327a2 (idempotent abort + Comm Send+Sync)
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 29s
CI / CUDA type-check (push) Successful in 31s
CI / Format (push) Successful in 35s
CI / Test (push) Failing after 1m9s
CI / Clippy (push) Successful in 2m36s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 6m10s
build-prerelease / Build neuron-ampere (push) Successful in 7m35s
build-prerelease / Build neuron-ada (push) Successful in 5m7s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m53s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m14s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s
build-prerelease / Build cortex binary (push) Successful in 4m33s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
The fork's new commit makes `Comm: Send + Sync` (asserting NCCL's
thread-safety invariant upstream) and makes `Comm::abort` idempotent via
an `aborted` flag (so abort-then-Drop can't double-free) — strictly
better than the previous Drop-no-panic workaround, and the `abort()`
signature is unchanged so the watchdog call site is unaffected.

Because `Comm` is now `Send + Sync`, `Arc<Comm>` and the `SendComm` /
`NcclState` wrappers auto-derive `Send`/`Sync`, which conflicts (E0119)
with neuron's manual `unsafe impl`s. Remove the four now-redundant impls
— the safety assertion lives upstream in cudarc where it belongs. The
conflict is in cuda-gated code, so only the CUDA type-check catches it
(non-cuda build + clippy + tests stay green).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 16:33:14 +03:00
7945240646 chore: re-trigger deploy (#17 Stage 2, attempt 3)
All checks were successful
CI / CUDA type-check (push) Successful in 31s
build-prerelease / Resolve version stamps (push) Successful in 31s
CI / Format (push) Successful in 33s
CI / Clippy (push) Successful in 2m41s
build-prerelease / Build cortex binary (push) Successful in 4m45s
build-prerelease / Build neuron-blackwell (push) Successful in 5m50s
CI / Test (push) Successful in 6m44s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ampere (push) Successful in 8m38s
build-prerelease / Build neuron-ada (push) Successful in 5m36s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m55s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m59s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s
No code change. Each deploy run, the degraded CI runner kills a different
single arch build (blackwell, then ada) ~fast, and the all-arch-gated
packaging skips → no publish. Every arch HAS built green across runs
(blackwell  in 342, ampere , ada  in 339) and the gate + CUDA
type-check pass. Re-running to catch all three green in one run so the
Stage-2 RPMs publish. Runner FS/cache health is the real fix (separate
infra work).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 15:06:04 +03:00
0c74d89d15 chore: re-trigger deploy (#17 Stage 2)
Some checks failed
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 29s
CI / Format (push) Successful in 30s
build-prerelease / Build neuron-ada (push) Failing after 51s
CI / Clippy (push) Successful in 2m41s
build-prerelease / Build cortex binary (push) Successful in 4m28s
build-prerelease / Build neuron-blackwell (push) Successful in 6m32s
build-prerelease / Build neuron-ampere (push) Successful in 7m42s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
CI / Test (push) Successful in 6m6s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
No code change. The c94a2ae deploy's neuron-blackwell build died ~12min
into the Blackwell kernel compile on the degraded runner, while
neuron-ampere + neuron-ada built the identical Rust + patched cudarc
cleanly and the CUDA type-check passed. Transient infra; re-running to
get a healthy blackwell build so the RPMs publish and beast (Blackwell)
picks it up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 14:45:16 +03:00
c94a2ae755 fix(neuron): correct nccl_state path on WorkerPool.leader_comm (#17 S2)
Some checks failed
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 35s
CI / Format (push) Successful in 44s
build-prerelease / Build cortex binary (push) Successful in 4m57s
build-prerelease / Package cortex RPM (push) Successful in 1m36s
CI / Test (push) Successful in 7m10s
CI / Clippy (push) Failing after 1m21s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 8m40s
build-prerelease / Build neuron-ada (push) Successful in 9m5s
build-prerelease / Build neuron-blackwell (push) Failing after 12m2s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
`super::nccl_state` from tp/mod.rs resolves to `crate::harness::nccl_state`
(nonexistent); the module is the child `nccl_state` (cf. the existing
`nccl_state::generate_comm_id_hex` call). The field is cuda-gated so the
non-cuda build couldn't catch it; the branch CUDA type-check flaked on the
runner before compiling. Self-audited fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 14:21:43 +03:00
99920dd322 feat(neuron): TP step watchdog aborts wedged collectives (#17 Stage 2)
Some checks failed
CI / CUDA type-check (push) Failing after 47s
CI / Format (push) Successful in 31s
CI / Test (push) Failing after 1m3s
CI / Clippy (push) Successful in 2m44s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Make a hung NCCL collective recoverable instead of a permanent brick.
Today a wedged collective hangs the in-process leader thread forever, and
even Stage 1's recovery can't help — its unload's DropTp queues behind the
stuck thread and hangs too.

- Cache the leader's NCCL Comm handle async-side at init (new cuda-gated
  Job::GetLeaderComm → DeviceWorkerHandle::get_leader_comm → stored on
  WorkerPool.leader_comm). Fetched while the thread is responsive — a
  wedged thread can't service the fetch, which is why it's cached up front.
- Wrap the leader forward in both generate_step and
  generate_step_with_images in tokio::time::timeout (default 120s,
  NEURON_TP_STEP_TIMEOUT_S). On expiry the watchdog calls
  Comm::abort() (ncclCommAbort) on the cached handle from the async
  thread — the one NCCL op sanctioned concurrently with an in-flight
  collective — which unblocks the leader thread, then fails the step
  WITHOUT draining (workers are wedged too; recovery's unload kills them).
  The error is a device fault → poison → Stage 1 auto-recovery, which now
  completes because the leader thread is responsive again.
- Bumps the cudarc patch to dbc425a (adds the Drop-must-not-panic fix so
  the post-abort comm teardown during recovery doesn't double-abort-panic).

Logs the whole sequence at ERROR with greppable `tp watchdog:` /
`ncclCommAbort` markers so a real-world hang leaves a forensic trail —
verification is by inspecting journals after real hangs, not a synthetic
harness. cuda-gated → validated by the blackwell build.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 14:15:29 +03:00
c4f239ceb9 build(neuron): patch cudarc to expose Comm::abort/get_async_error (#17 Stage 2)
All checks were successful
CI / CUDA type-check (push) Successful in 33s
CI / Format (push) Successful in 35s
CI / Clippy (push) Successful in 2m34s
CI / Test (push) Successful in 6m1s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
#17 Stage 2 (TP hang-recovery) needs to call ncclCommAbort on a LIVE
communicator from another thread — to unblock a collective wedged on a
dead/hung peer so the ranks can resync. No cudarc release (incl. main)
exposes this: the safe Comm only aborts in Drop, which can't fire while a
stuck thread holds an Arc<Comm> clone.

Pin neuron's cudarc 0.19.7 to a fork (grenade/cudarc @ nccl-comm-abort,
rev 4dff0be) adding three thin methods — Comm::abort, get_async_error,
and a raw comm() accessor — to be submitted upstream. The patch targets
0.19.x only; candle's transitive cudarc 0.17.8 stays on crates.io.

Foundation only; the watchdog + abort + comm-rebuild that consume these
land in follow-up commits (cuda-gated → validated by the blackwell build).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 13:49:59 +03:00
ac445c1569 chore: re-trigger deploy (#17 Stage 1)
Some checks failed
CI / CUDA type-check (push) Failing after 19s
CI / Format (push) Successful in 37s
build-prerelease / Resolve version stamps (push) Successful in 42s
CI / Clippy (push) Successful in 3m54s
build-prerelease / Build cortex binary (push) Successful in 4m43s
CI / Test (push) Successful in 6m35s
build-prerelease / Build neuron-blackwell (push) Successful in 5m58s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Successful in 8m10s
build-prerelease / Build neuron-ada (push) Successful in 5m21s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m1s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m46s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m4s
No code change. The abc6e60 deploy's neuron-ada build died on the
degraded CI runner (container dropped mid-checkout), skipping the
gated publish — even though neuron-blackwell + neuron-ampere compiled
the Stage-1 fault-recovery code cleanly. Re-running to get a healthy
ada build so the RPMs publish and beast picks up the build.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 09:34:20 +03:00
abc6e605b8 test(neuron): NEURON_DEBUG_POISON hook to verify auto-recovery (#17)
Some checks failed
CI / CUDA type-check (push) Failing after 19s
build-prerelease / Resolve version stamps (push) Successful in 43s
CI / Format (push) Successful in 50s
CI / Clippy (push) Failing after 57s
build-prerelease / Build neuron-ada (push) Failing after 48s
build-prerelease / Build cortex binary (push) Successful in 5m5s
build-prerelease / Build neuron-blackwell (push) Successful in 6m38s
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Build neuron-ampere (push) Successful in 7m27s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
CI / Test (push) Successful in 10m27s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
One-shot, env-gated fault injector for beast verification: when
NEURON_DEBUG_POISON names a model, the first request for it triggers the
auto-recovery path as if a device fault had occurred — exercising
unload→reload→healthy without corrupting the GPU. Latched so it fires
exactly once (no recovery loop). No-op unless the env var is set; wired
into both the single-GPU and TP chat poison gates.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 09:08:40 +03:00
4f2957af9e feat(neuron): auto-recover poisoned models (#17 Stage 1c)
When an inference hit a device fault, the model was flagged poisoned and
every subsequent request rejected with "unload and reload the model to
recover" — until a *human* did exactly that. Now the harness rebuilds the
context automatically.

- Retain the loading `ModelSpec` on `LoadedModel`/`TpLoadedModel` (+
  `LoadedHandle::spec()`) so a poisoned model can be reloaded without an
  operator reconstructing the spec.
- A background recovery task (held via `Weak<CandleHarness>`, spawned in
  `new()` when a runtime is present) drains poisoned model ids and runs
  `unload_model` → `load_model(spec)`. Unload drops the model → cudarc
  `Comm::drop` aborts NCCL + releases the context; reload re-runs NCCL
  init + sanity inside the load path, so a successful reload yields a
  fresh, healthy model. A failed reload leaves it unloaded (next load
  retries) — never poisoned forever.
- The request-entry poison gates now `trigger_recovery` (single-flight
  per model via a `recovering` set) and return a transient "recovering,
  retry shortly" error instead of the manual-reload message. Requests
  that arrive during the brief reload gap (model absent from the registry)
  also get "recovering" rather than a misleading "not loaded".

`new()` now returns `Arc<Self>`. Recovery runs only on the background
task — never inline on the request path, which holds `inference_lock`
and would deadlock on the `models` write lock.

Stage 1c of the #17 plan (verified-healthy auto-recovery). Watchdog
(1b) + a fault-injection hook for beast verification follow. The
in-process rank-0 leader's own context fault still needs a reload that
can't rebind it (Stage 3); comm-desync + worker faults recover here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 09:05:02 +03:00
75cd088b61 fix(neuron): cap vision max_pixels to the pos_embed patch budget (#14)
All checks were successful
CI / CUDA type-check (push) Successful in 31s
build-prerelease / Resolve version stamps (push) Successful in 29s
CI / Format (push) Successful in 30s
CI / Clippy (push) Successful in 2m32s
build-prerelease / Build neuron-blackwell (push) Successful in 6m5s
CI / Test (push) Successful in 5m49s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 8m11s
build-prerelease / Build neuron-ada (push) Successful in 5m40s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m4s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m2s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m57s
build-prerelease / Build cortex binary (push) Successful in 4m21s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m16s
Beast testing surfaced a real regression in the dynamic-resolution
default: a tall 808×1600 image resized (within the 1024² max_pixels) to a
90×44 patch grid = 3960 patches, exceeding the vision tower's hard
`num_position_embeddings = 2304` pos-embed budget. The per-rank
`patch count 3960 exceeds pos_embed budget 2304` error fired mid-TP-
forward and poisoned the device context, bricking the model until reload.

Hard-cap `max_pixels` to `2304 × 16² = 589_824` px (≤ 2304 patches →
≤ 576 LM tokens), clamping even the operator env override. `smart_resize`
floors the pixel count under the cap, so no resized image can ever exceed
the budget — the tower check never fires, no poison. The pos-embed grid
(48×48) is the resolution Qwen3.6 was trained at, so the cap is
principled, not just defensive. Still ~3× the old fixed 196 tokens, and
the book-cover OCR test (1176 patches) already reads full title+subtitle.

Test: a huge/tall/wide/extreme image battery stays within the 2304 patch
budget. (Per-rank-error poison robustness itself remains issue #17.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 23:30:47 +03:00
d311c8ca7a feat(neuron): operator pixel-budget env override + doc cleanup (#14 C5)
Some checks failed
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 38s
CI / Format (push) Successful in 45s
CI / Test (push) Failing after 58s
CI / Clippy (push) Successful in 2m41s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m14s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-blackwell (push) Successful in 6m20s
build-prerelease / Build neuron-ampere (push) Successful in 7m18s
build-prerelease / Build neuron-ada (push) Successful in 5m10s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m7s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s
- PreprocessProfile::qwen3_6() reads NEURON_VISION_MIN_PIXELS /
  NEURON_VISION_MAX_PIXELS (clamped to factor² ≤ min ≤ max), matching the
  NEURON_VISION_LEGACY_* / NEURON_MROPE knob convention. Defaults remain
  256²…1024² (64…1024 LM tokens/image).
- Test: a max-resolution source caps within the token budget (can't blow
  NEURON_MAX_PROMPT_TOKENS).
- Strip stale fixed-resolution / "MRoPE gap (#15)" / 14×14 language from
  the preprocess, mod, and rope doc-comments now that resolution is
  dynamic and M-RoPE is implemented.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 22:50:03 +03:00
c97a8654f5 feat(neuron): dynamic-resolution images via Qwen smart_resize (#14)
Some checks failed
CI / Clippy (push) Waiting to run
CI / Test (push) Waiting to run
CI / CUDA type-check (push) Successful in 32s
CI / Format (push) Successful in 34s
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
Replace the fixed 448×448-square preprocess with native-aspect
`smart_resize`, and thread the resulting per-image grid through the LM
so spatial structure survives non-square images (documents, screenshots,
charts, panoramas, OCR) instead of being squished into a square.

- preprocess.rs: port Qwen `smart_resize` (factor = patch×merge = 32;
  pixel budget [min,max], default 256²–1024² → 64–1024 LM tokens).
  `PreprocessProfile` drops the fixed target dims for `factor`/`min_pixels`/
  `max_pixels`; `preprocess`/`preprocess_data_uri` now return the resized
  `(h, w)`; add `resized_dims_for_uri` (decode + resize, no normalize) for
  the TP leader's token count.
- rope.rs: `compute_mrope_index`/`get_rope_index` take per-image
  `grids: &[(lm_gh, lm_gw)]` instead of assuming a square `isqrt(run)`.
  Walk image runs in order, validate `run == gh*gw`, emit row-major
  positions, resume the shared counter at `base + max(gh,gw)`. Correct
  for multiple images of differing grids interleaved with text.
- candle.rs: `VisionMeta`/`LoadedModel`/`TpLoadedModel` carry the
  `image_grid_factor` (patch×merge) instead of the constant 196; all four
  prompt-build sites compute per-image counts from each image's resized
  grid (single-GPU from the extracted `ImageInput.h/w`, TP from
  `resized_dims_for_uri`). `ModelArch` gains `vision_grid_factor`.
- single-GPU (`mod.rs`, `dispatch.rs`) and TP
  (`tp_qwen3_5.rs::prefill_with_images_chunked`, `dispatch.rs`,
  `tp/worker.rs`) thread the grids into `get_rope_index`. Each TP rank
  recomputes grids from its own deterministic preprocess — no rpc.rs
  change, single source of truth.

The vision tower itself was already grid-general (recent pos-embed
interpolation + 2D rotary fix). No patch-count cap: pos-embed is
interpolated to any grid; `max_pixels` bounds cost (O(patches²) ViT
attention + prefill) instead.

Tests: smart_resize (aspect/cap/floor/reject), `compute_mrope_index`
non-square + two-image + mismatch cases, square-grid regression guard.
Non-cuda build + clippy + full workspace tests green; TP load/dispatch
paths are cuda-gated → Gitea CUDA type-check. Operator pixel-budget
config + remaining doc cleanup follow in C5.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 22:47:27 +03:00
dc048ffcc9 fix(neuron): vision-tower 2D positions + M-RoPE default on
All checks were successful
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 32s
CI / Format (push) Successful in 33s
CI / Clippy (push) Successful in 2m36s
build-prerelease / Build cortex binary (push) Successful in 4m48s
build-prerelease / Build neuron-blackwell (push) Successful in 5m59s
CI / Test (push) Successful in 6m35s
build-prerelease / Build neuron-ampere (push) Successful in 7m51s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ada (push) Successful in 5m13s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m49s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m6s
Two fixes to the spatial handling of images, validated against the HF
transformers 4.57.1 qwen3_vl reference on beast.

**Vision tower (the real cause of poor spatial vision).** The Stage-A
tower encoded position two ways wrong, so the model saw image *content*
but not *layout* (a row of 5 people read as "a line of 23", sky
inverted), regardless of the LM-side rope:

- Learned pos-embed was a naive sequential lookup of the first
  `n_patches` rows of the 48×48 (`num_position_embeddings=2304`) grid —
  wrong stride for a 28×28 patch grid. Now bilinearly interpolates the
  grid to `gh×gw` (port of HF `fast_pos_embed_interpolate`), row-major.
- The 2D vision rotary was absent entirely. Added
  `VisionRotaryEmbedding` (θ=10000, dim=head_dim/2) applying per-patch
  `(row, col)` rotary to q/k in every ViT block via rope_slow, matching
  HF `apply_rotary_pos_emb_vision`.

Both default on; `NEURON_VISION_LEGACY_POS=1` / `NEURON_VISION_LEGACY_ROPE=1`
revert each for A/B (no rebuild). New unit tests: interpolation reduces
to the sequential lookup at the native grid; rotary row/col structure.

**M-RoPE default on.** The interleaved M-RoPE matches HF
apply_interleaved_mrope / get_rope_index exactly and A/B'd strictly ≥
plain. `NEURON_MROPE` is now a kill switch (`=0` for plain), not opt-in
— defaults should encode the model's trained behaviour, not freeze the
broken state.

Vision tower is plain candle (CPU-testable): built, clippy-clean, full
workspace tests green locally.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 20:53:07 +03:00
7ebcfba5ca fix(neuron): gate M-RoPE behind NEURON_MROPE (default off)
All checks were successful
CI / CUDA type-check (push) Successful in 33s
build-prerelease / Resolve version stamps (push) Successful in 32s
CI / Format (push) Successful in 33s
CI / Clippy (push) Successful in 2m34s
build-prerelease / Build cortex binary (push) Successful in 4m33s
build-prerelease / Build neuron-blackwell (push) Successful in 6m14s
CI / Test (push) Successful in 6m50s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 8m12s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ada (push) Successful in 5m9s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m59s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m3s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m52s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s
On beast the interleaved M-RoPE degraded image understanding rather than
fixing it: the model misread spatial layout (a horizontal row of people
described as a "diagonal receding line"), got attributes wrong, and
rambled — a "how many people" follow-up generated 4459 tokens over 3.5
minutes, past agent-0's HTTP timeout (the "fails to respond without an
error"). The interleave is evidently not numerically correct, and it
can't be validated remotely without a transformers reference.

Gate it: `get_rope_index` now returns plain sequential identity
positions unless NEURON_MROPE is truthy, so mrope_cos_sin reduces to
plain RoPE and image tokens behave exactly as pre-M-RoPE (content
recognition works; spatial layout approximate; no rambling). The real
computation moves to `compute_mrope_index` (still unit-tested). Default
off restores the working vision and unblocks agent-0; the M-RoPE code
stays in place to debug + validate before flipping the default on.

Pure non-cuda change (rope.rs); both single-GPU and TP forwards call
the gated get_rope_index unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 19:32:59 +03:00
825bf4e905 feat(neuron): M-RoPE Stage 4 — wire interleaved M-RoPE into the TP path
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / CUDA type-check (push) Successful in 31s
CI / Format (push) Successful in 42s
build-prerelease / Build cortex binary (push) Successful in 5m9s
build-prerelease / Build neuron-blackwell (push) Successful in 6m4s
build-prerelease / Package cortex RPM (push) Successful in 1m32s
CI / Test (push) Successful in 7m19s
build-prerelease / Build neuron-ampere (push) Successful in 8m40s
build-prerelease / Build neuron-ada (push) Successful in 5m17s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m1s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m53s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m14s
CI / Clippy (push) Successful in 2m29s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Mirror Stage 3 into the tensor-parallel Qwen3.6 model:

- TpQwen3_5Attention / DecoderLayer take (cos, sin) instead of a scalar
  offset and apply via apply_cos_sin.
- TpQwen3_5Model gains the replicated rotary + rope_delta (reset in
  clear_kv_cache, settable). forward_inner builds the cos/sin once —
  interleaved M-RoPE from explicit position_ids (vision) or plain at
  offset+rope_delta (text/decode). forward() and forward_with_positions()
  delegate; the old single-shot forward_with_vision is gone.
- prefill_with_images_chunked now computes get_rope_index over the whole
  prompt once, stores rope_delta on the base model, and slices the
  (3, prompt_len) position tensor per chunk — so every rank assigns image
  tokens their 14×14 grid coordinates and steps in lockstep (every chunk,
  text or image, carries the M-RoPE slice because the image shifts the
  surrounding text positions).

Also build the position-id tensor as f32 directly (positions are small
integers, exact in f32) to avoid an i64→f32 cast on the GPU.

The TP forward is cuda-gated — CI CUDA type-check is the compile gate.
Non-cuda build + clippy + full workspace tests green; rope math + the
plain-RoPE-reduction invariant covered by unit tests.

Completes the interleaved-M-RoPE work for the vision spatial misread.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 18:46:27 +03:00
4c12c7e2f0 feat(neuron): M-RoPE Stage 3 — wire interleaved M-RoPE into single-GPU
Qwen3_5Model now builds the rotary cos/sin once per forward and threads
(cos, sin) through the decoder → full-attention → rope, replacing the
scalar offset that reached RotaryEmbedding:

- vision forward computes get_rope_index over the (single-shot) prompt,
  sets rope_delta, and builds interleaved-M-RoPE cos/sin so image tokens
  carry their 14×14 grid (height/width) positions;
- text / decode take plain_cos_sin at offset + rope_delta — with
  rope_delta == 0 (no image) this is bit-for-bit the old plain RoPE, and
  the device→host id copy is skipped on the text decode hot path.

rope_delta is stored on the model and reset in clear_kv_cache, so decode
after a vision prefill resumes text positions from the image-compressed
counter. decoder.rs / full_attn.rs take (cos, sin) instead of offset;
linear-attention layers are unchanged (no RoPE). The TP path still uses
the retained apply(offset) — wired in Stage 4.

Full workspace tests green; the load-bearing invariant (M-RoPE == plain
for equal axes) keeps text unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 18:39:52 +03:00
ba1b5ba408 feat(neuron): M-RoPE Stage 2 — get_rope_index position-id helper
Pure function computing the interleaved-M-RoPE 3D position ids for a
prompt with image-placeholder runs, plus the decode rope_delta:
text tokens advance a single counter (all axes equal); each image run
gets [base+t, base+h, base+w] row-major over a square grid_t=1,
grid_h=grid_w=isqrt(run) (196 → 14×14); the counter resumes from
base + max(grid). rope_delta = final_counter - seq_len lets decode
resume text positions after the position-compressed image blocks.
Plus mrope_position_tensor to build the (3, seq) tensor.

Unit tests: text-only is sequential (delta 0); text+image+text matches
hand-computed grid ids + resume + delta; 196 → 14×14; non-square run
rejected; end-to-end through mrope_cos_sin tracks the height axis.

#[allow(dead_code)] until Stage 3/4 wire it into the forward.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 18:34:28 +03:00
5731f4c318 feat(neuron): M-RoPE Stage 1 — interleaved rope machinery + config
Parse + store mrope_section / mrope_interleaved in RopeParameters
(previously accepted-but-ignored). RotaryEmbedding gains:
- inv_freq + per-axis column masks (mask_t/h/w) built from mrope_section;
- plain_cos_sin(pos, seq_len): narrow the precomputed tables (text/decode);
- mrope_cos_sin(position_ids (3,seq)): per-axis freqs blended at the
  interleave columns (vision);
- apply_cos_sin(q,k,cos,sin): the rope_slow application, factored out.

The existing apply(q,k,offset) is retained (delegates to
plain_cos_sin + apply_cos_sin) so current callers are unchanged; Stages
3–4 move cos/sin construction into the model forward and thread the 3D
position ids for image tokens.

Tests: masks partition the half-dim; interleave drives the right axis
per column; and the load-bearing invariant — mrope_cos_sin reduces
bit-for-bit to plain_cos_sin when the three axes are equal (so text
inference is unchanged).

Refs the MRoPE-gap diagnosis (vision spatial misread). Pure non-cuda;
no behaviour change until wired.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 18:31:15 +03:00
fa013505d1 fix(neuron): chunked TP-vision prefill + pre-flight VRAM guard
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 29s
build-prerelease / Build cortex binary (push) Successful in 4m26s
build-prerelease / Package cortex RPM (push) Successful in 1m18s
build-prerelease / Build neuron-blackwell (push) Successful in 6m6s
build-prerelease / Build neuron-ampere (push) Successful in 8m30s
CI / Format (push) Successful in 38s
CI / CUDA type-check (push) Successful in 47s
CI / Clippy (push) Successful in 2m36s
build-prerelease / Build neuron-ada (push) Successful in 5m19s
CI / Test (push) Successful in 6m3s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m1s
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m32s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m47s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s
agent-0 sent a ~13k-token prompt + image; the TP vision prefill was
single-shot, so it tried to materialise activations for all 12,960
positions at once and OOM'd rank 1 mid-forward. Rank 1 died before
issuing its row-parallel AllReduce, stranding rank 0 on the collective
(it hung holding the pool lock). The text path survives the same size
because it chunks the prefill.

Chunk the vision prefill the same way:

- TpQwen3_5ForCausalLM::prefill_with_images_chunked encodes the image(s)
  once, then walks the pre-expanded prompt in prefill_chunk_tokens()
  windows, splicing the patch-embedding rows into whichever chunk(s)
  carry <|image_pad|> positions (pure-text chunks take the plain
  forward). Activation is bounded by the chunk, not the prompt.
- Every rank runs the identical chunk sequence (chunk_size threaded
  through GenerateStepWithImages / TpForwardLogitsWithImages /
  generate_step_with_images), so the per-chunk AllReduces stay paired
  across ranks with no extra sync — the KV cache accumulates via the
  growing offset, only the last chunk's logits are kept.

Pre-flight guard (validate_vision_prefill): even chunked, a long
prompt's KV cache can exhaust VRAM mid-forward, and on TP that hangs
the collective. Reject up front with a clean InsufficientVram when the
estimated footprint exceeds free VRAM, so a doomed request fails fast
instead of hanging the daemon. Heuristic + tunable
(NEURON_VISION_PREFILL_MB_PER_1K_TOKENS / _BASE_MB); default permissive
so the now-working 12,960-token case still passes. Applied to every
vision path (single-GPU + TP); single-GPU vision stays single-shot for
now, so the guard is its protection until it's chunked too.

Tests: pre-flight guard behaviour; RPC round-trip carries chunk_size.
The chunked forward is cuda-gated — CI CUDA type-check validates it.

Refs #16 / TP-vision. Operational note: a TP rank OOM still hangs the
daemon (needs restart); making a worker failure abort the leader's
collective is separate, broader TP hardening.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 17:21:36 +03:00
c8bcaabc38 fix(neuron): render HF chat templates via minijinja pycompat
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 29s
CI / Format (push) Successful in 34s
CI / CUDA type-check (push) Successful in 39s
CI / Clippy (push) Successful in 2m35s
build-prerelease / Build cortex binary (push) Successful in 4m21s
build-prerelease / Build neuron-blackwell (push) Successful in 6m4s
CI / Test (push) Successful in 6m47s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 7m43s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ada (push) Successful in 5m41s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m5s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m52s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
The Qwen3.6 chat_template.jinja (now loaded after the precedence fix)
failed to render in minijinja: it uses Python str methods
(content.startswith/endswith/split/rstrip/lstrip) and the raise_exception
global that HF transformers patches into its Jinja env but minijinja
doesn't provide. The render error tripped the text-only fallback, so
image requests still produced zero <|image_pad|> tokens.

Wire the standard bridge into render_chat_template:
- minijinja-contrib `pycompat::unknown_method_callback` supplies the
  Python string/list/dict methods;
- a `raise_exception` global maps to a render error (so malformed inputs
  — e.g. an image in a system message — surface cleanly).

Add the real Qwen3.6-27B chat_template.jinja (verbatim from beast's HF
cache) as a test fixture and assert it renders one <|image_pad|> for a
text+image turn — the end-to-end check that would have caught this
before deploy.

Refs #16 / TP-vision.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 16:32:23 +03:00
7ad56c6a86 fix(neuron): load chat_template.jinja (transformers precedence)
The chat-template loader only read the `chat_template` field from
tokenizer_config.json. Qwen3.6-27B ships its vision-aware template
*only* in a standalone `chat_template.jinja` (and has no
tokenizer_config.json at all), so the loader returned None and image
requests fell back to the text-only format_qwen3_prompt — rendering
zero `<|image_pad|>` tokens and tripping
"expand_image_pad_tokens: prompt has 0 image_token_id occurrences".

load_chat_template_alongside now follows HF transformers precedence:
standalone chat_template.jinja → chat_template.json → the
chat_template field in tokenizer_config.json. Tests cover the
precedence, the text-only fallback, and that an OpenAI image_url
content part renders `<|image_pad|>` through the real template
condition (`'image_url' in item`).

Refs #16 / TP-vision.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 16:25:30 +03:00
1b0e36c119 fix(neuron): cover TpForwardLogitsWithImages in drain_poisoned match
All checks were successful
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 37s
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m41s
build-prerelease / Build cortex binary (push) Successful in 4m18s
build-prerelease / Build neuron-blackwell (push) Successful in 5m48s
build-prerelease / Package cortex RPM (push) Successful in 1m32s
CI / Test (push) Successful in 6m20s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 8m26s
build-prerelease / Build neuron-ada (push) Successful in 5m21s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s
The CUDA type-check caught a non-exhaustive match: drain_poisoned()
must reply an error to every Job variant's reply channel, including the
new cuda-gated TpForwardLogitsWithImages. The non-cuda build couldn't
see it — the variant is #[cfg(feature = "cuda")], so the match is
exhaustive without it on CPU.

Refs TP-vision plan Stage 2.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 15:26:46 +03:00
ed2d09864e feat(neuron): TP-vision Stage 3 — wire TP chat + stream vision prefill
Some checks failed
CI / Format (push) Successful in 30s
CI / Clippy (push) Successful in 2m51s
CI / Test (push) Successful in 5m52s
CI / CUDA type-check (push) Failing after 50s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
End-to-end TP-vision: an image request to a TP-loaded Qwen3.6-27B now
conditions on the image across both ranks.

- TpLoadedModel carries has_vision / image_token_id / lm_tokens_per_image,
  populated at load via the shared VisionMeta::from_config_path (same
  config.json the shards loaded from; Stage 1 materialises the replicated
  tower on every rank).
- LoadedHandle::capabilities() now advertises "vision" for TP loads with
  a tower (cortex-gateway already unions this into /v1/models via C3).
- The TP rejection guards (chat_completion_tp + inference_tp_stream) are
  now conditional on !has_vision — text-only TP models still 400 cleanly,
  vision-capable ones fall through.
- chat_completion_tp_inner and the streaming orchestration task detect
  images (request_has_images), expand <|image_pad|> to the per-image
  patch count, and run a single-shot generate_step_with_images prefill
  (every rank encodes + splices its replicated tower) before the
  unchanged decode loop. Text requests keep chunked_prefill_tp.
- extract_image_data_uris ships the source data URIs to every rank for
  identical per-rank preprocessing.

prompt_tokens now reflects the patch expansion, so usage accounting and
KV offsets match the single-GPU baseline.

TP entry points are cuda-gated (validated by CI's CUDA type-check);
capabilities() + extract_image_data_uris + VisionMeta reuse compile on
the non-cuda build. Full workspace test green.

Refs TP-vision plan Stage 3. Implements #12.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 15:14:44 +03:00
4994b94c84 feat(neuron): TP-vision Stage 2 — per-rank image RPC + worker plumbing
Carry image content through the TP forward path so every rank encodes
and splices locally (replicated tower, no embedding broadcast).

- rpc.rs: new WorkerRequest::GenerateStepWithImages carrying the source
  image data URIs + image_token_id for the single-shot vision prefill;
  worker still replies GenerateStepOk. Round-trip test added.
- tp_qwen3_5.rs: TpQwen3_5ForCausalLM::forward_with_images — encode each
  preprocessed image through the rank's replicated tower, cat, splice,
  forward. Shared by leader and worker so every rank runs identical work.
- tp/mod.rs: TpLeaderModel::forward_with_images and
  WorkerPool::generate_step_with_images (mirrors generate_step: fan out
  GenerateStepWithImages to subprocess ranks, run the leader's image
  forward on its device worker thread, drain, combine).
- worker.rs: WorkerModel::forward_with_images + handle_generate_step_with_images
  — each subprocess rank preprocesses the same data URIs via the shared
  deterministic preprocess_data_uri, encodes, splices, forwards.
- device_worker: Job::TpForwardLogitsWithImages + tp_forward_logits_with_images
  dispatch handler + DeviceWorkerHandle::tp_forward_logits_with_images.

Determinism: every rank runs the same preprocess on the same source
URIs through the same replicated tower, so the spliced hidden state
matches across ranks — preserving the replicated-hidden-state invariant
the row-parallel AllReduce relies on, with no NCCL broadcast.

No caller yet — Stage 3 wires the TP chat/stream entry points to invoke
generate_step_with_images for image prefill. cuda-gated plumbing covered
by CI's CUDA type-check; rpc/route/forward_with_images compile on the
non-cuda build.

Refs TP-vision plan Stage 2.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 15:08:08 +03:00
9a24b05866 feat(neuron): TP-vision Stage 1 — replicated vision tower on the TP model
Load the full, unsharded model.visual.* vision tower on every TP rank
(leader + each subprocess worker mmaps the same local safetensors) when
config.vision_config is present. VisionTower::load already takes a
ShardedVarBuilder whose plain .get() returns the full replicated tensor,
so the tower loads identically regardless of world_size — no sharding,
no NCCL broadcast.

- TpQwen3_5ForCausalLM gains vision: Option<VisionTower> + image_token_id,
  plus has_vision/image_token_id/encode_image/forward_with_vision,
  mirroring the single-GPU Qwen3_5ForCausalLM wrapper.
- TpQwen3_5Model::forward_with_vision mirrors the single-GPU
  forward_inner splice: embed locally, replace rows at image_token_id
  positions, run the sharded decoder stack. Because every rank encodes
  the same pixels through its replicated tower, the spliced input
  embeddings are identical across ranks — preserving the TP
  replicated-hidden-state invariant the row-parallel AllReduce relies on.
- splice_runs is now pub(crate) and shared with the TP model.

No caller yet — Stage 2 wires the RPC/worker path that invokes
encode_image + forward_with_vision per rank. Most of this compiles on
the non-cuda build (only the cuda load variant's tower line is gated);
CI's CUDA type-check covers the rest.

Refs TP-vision plan Stage 1.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 15:00:05 +03:00
7bb033b4ed chore: untrack stray .claude/scheduled_tasks.lock and gitignore .claude/
All checks were successful
CI / CUDA type-check (push) Successful in 32s
CI / Format (push) Successful in 30s
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Clippy (push) Successful in 2m45s
build-prerelease / Build cortex binary (push) Successful in 4m28s
CI / Test (push) Successful in 6m6s
build-prerelease / Build neuron-blackwell (push) Successful in 6m11s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m28s
build-prerelease / Build neuron-ampere (push) Successful in 8m1s
build-prerelease / Build neuron-ada (push) Successful in 8m9s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m52s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s
A runtime scheduler lock was accidentally swept into the previous
commit by `git add -A`. Remove it from tracking (file stays on disk)
and ignore the whole `.claude/` dir so local agent runtime state never
lands in the repo again.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 14:55:05 +03:00
f8c0da0ebf fix(neuron): TP-vision Stage 0 — reject image requests on the TP path
Some checks failed
build-prerelease / Resolve version stamps (push) Waiting to run
CI / Format (push) Waiting to run
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Build cortex binary (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
CI / Clippy (push) Has been cancelled
CI / Test (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
The TP inference path has no vision tower, and the TP dispatch in
chat_completion / inference_stream returns before the VisionUnsupported
guard runs — so an image request to a TP-loaded model (e.g. beast's
tp=2 Qwen3.6-27B) was silently dropped and answered from text alone,
the exact issue-#3 confident-hallucination pattern Stage C killed for
single-GPU.

Add the request_has_images → VisionUnsupported guard to both
chat_completion_tp and inference_tp_stream, before prefill / before the
SSE stream opens, so beast returns a clean 400 vision_unsupported. The
guard is unconditional for now (TP has no tower); Stage 3 makes it
conditional on the TP model's has_vision once real TP-vision lands.

Detection is covered by the existing request_has_images unit test; the
guard itself is cuda-gated (validated by CI's CUDA type-check).

Refs TP-vision plan Stage 0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 14:53:56 +03:00
dd592d918d test(neuron): C2 — guard Responses→chat image translation contract
All checks were successful
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Format (push) Successful in 44s
CI / Clippy (push) Successful in 2m51s
build-prerelease / Build cortex binary (push) Successful in 4m42s
build-prerelease / Build neuron-blackwell (push) Successful in 5m52s
CI / Test (push) Successful in 6m16s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 8m12s
build-prerelease / Package cortex RPM (push) Successful in 1m26s
build-prerelease / Build neuron-ada (push) Successful in 5m34s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m59s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m2s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
The Responses request translator already emits the chat `image_url`
Parts array Stage B5's vision path consumes, and the non-streaming
(`chat_completion`) and streaming (`responses_stream` → `inference_stream`,
Stage C1) Responses paths both route image content to the vision-aware
prefill — so vision works end-to-end through `/v1/responses` with no
translator change required.

Add a multi-image test asserting order preservation and that the
`detail` hint is tolerated (and dropped, since chat image_url has no
analogue), locking the translator's output to the exact
`image_url.url` shape `extract_images_from_request` walks.

Closes part of #16 (Stage C2).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 13:57:43 +03:00
766c20ba47 feat(neuron): C1 — streaming SSE chat completion with vision
The streaming worker path now splices image embeddings on prefill,
closing the silent text-only degrade for `stream=true` image requests.

`inference_stream` gains the same vision-routing block as the
non-streaming `chat_completion`: detect `image_url` content, reject it
against text-only models with `VisionUnsupported` (before any SSE frame
is sent), preprocess each image and expand its `<|image_pad|>` sentinel
to the per-image patch count, then carry the payload through dispatch.

Rather than duplicate the 75-line `route_token!` reasoning/tool-call
state machine into a sibling streamer, `stream_inference_via_worker`
takes an `Option<(Vec<ImageInput>, u32)>`: when `Some`, prefill is a
single-shot `forward_logits_with_images` splice; when `None`, the
original chunked text-only prefill. Image embeddings are prefill-only,
so every decode step stays on the plain `forward_logits` path and the
shared decode loop is untouched. This keeps exactly one copy of the
tool-call/reasoning logic to maintain.

The Responses API streaming path (`responses_stream`) inherits vision
for free since it drives the same `inference_stream`.

Unit test covers `request_has_images` (the shared routing gate); the
real-weights SSE smoke is the manual curl on beast (cuda-integration).

Closes part of #16 (Stage C1).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 13:57:02 +03:00
4972c7d1e7 feat(cortex-gateway): C3 — propagate vision capabilities through /v1/models
ModelEntry and CortexModelEntry gain a `capabilities: Vec<String>`
field (serde-default for back-compat). The poller copies it verbatim
from each neuron's ModelInfo.capabilities; list_models computes the
union across every node where a model is loaded so a checkpoint loaded
text-only on one neuron and text+vision on another reports both to the
fleet. Catalogue-only and mid-prewarm entries default to empty until
the catalogue gains a capabilities declaration.

Aliases inherit their target's capability union. New gateway test mocks
two nodes with differing capability arrays and asserts the unioned
/v1/models response.

Closes part of #16 (Stage C3).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 13:49:54 +03:00
a26bb9f04b feat(deploy): capture service startup journal after each restart
After both `Start cortex.service` and `Start neuron.service`, sleep 10s
and run `journalctl --unit <unit> -I --no-pager` to record the latest
invocation's log in the workflow output. Step is guarded by
`if: always()` so a failed start still leaves a usable trace.

infra-setup.sh now adds gitea_ci to the systemd-journal group during
user provisioning, so `journalctl` works without a sudoers entry.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 16:48:56 +03:00
ea1fdf8aa6 chore(deploy): drop deploy.sh and manifest.yml now that workflow runs
First end-to-end run of the deploy workflow succeeded (gitea run #289),
so the operator-run rolling-deploy script and its YAML manifest are no
longer the source of truth — fleet topology lives in
.gitea/workflows/deploy.yml and per-host config in script/infra-setup.sh.

Per-host neuron config comments updated to point at the new sync path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 16:41:04 +03:00
577781de8d fix(neuron): derive Clone on ImageInput for the CUDA vision dispatch
All checks were successful
CI / CUDA type-check (push) Successful in 32s
CI / Format (push) Successful in 34s
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Clippy (push) Successful in 2m47s
build-prerelease / Build cortex binary (push) Successful in 4m34s
CI / Test (push) Successful in 6m14s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 5m58s
build-prerelease / Package cortex RPM (push) Successful in 1m22s
build-prerelease / Build neuron-ampere (push) Successful in 8m5s
build-prerelease / Build neuron-ada (push) Successful in 8m9s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s
CUDA type-check in CI failed on commit 24968e9 with E0308:

  error[E0308]: mismatched types
      --> crates/neuron/src/harness/candle.rs:1707:33
   1707 |                                 images.clone(),
        |                                 ^^^^^^^^^^^^^^ expected `Vec<ImageInput>`,
                                                          found `&Vec<ImageInput>`

In Stage B5 the cuda branch of `chat_completion` matches
`&vision_route` to keep the `vision_route: Option<...>` alive for
both arms, which makes `images` bind as `&Vec<ImageInput>`. The
subsequent `images.clone()` call doesn't deep-clone because
`ImageInput` doesn't derive `Clone` — rustc falls back to cloning
the `&Vec` reference, which has the wrong type for the worker job.

The CPU build (non-cuda) compiled fine because that branch is
behind `#[cfg(feature = "cuda")]`; the cuda-check job is what
catches the regression.

Fix: derive `Clone` on `ImageInput`. The clone cost is one
pixel-buffer memcpy per image (~2.4 MiB at fixed 448×448), which
is fine on the chat-completion hot path — vision requests are
rare per second relative to text-only decode.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 15:51:57 +03:00