Compare commits

..

304 Commits

Author SHA1 Message Date
5c1623a817 fix(#49): allow-anonymous mode must ignore unrecognized keys, not 401
All checks were successful
CI / Format (push) Successful in 38s
CI / CUDA type-check (push) Successful in 1m41s
CI / Clippy (push) Successful in 2m20s
CI / Test (push) Successful in 4m37s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m21s
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 2m35s
build-prerelease / Test (push) Successful in 4m39s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 52s
Regression from #49: the auth middleware rejected ANY present-but-
unresolvable bearer token with 401 invalid_api_key, even when
require_auth=false. But OpenAI-compatible clients (opencode, Open WebUI,
Agent Zero, litellm) send a placeholder bearer by default — so enabling
the build broke every existing client even though the operator never
opted into auth. Pre-#49 the bearer was never inspected at all.

Fix: in allow-anonymous mode (require_auth=false, the default) an
unrecognized key is now ignored and the request is served anonymously,
restoring pre-#49 behaviour. A bad key only 401s when require_auth=true.
A valid key is still resolved + metered in both modes.

Test renamed/split: unrecognized_key_is_ignored_when_auth_not_required
(now 200, served anonymously) + invalid_key_is_401_when_auth_required.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 21:40:34 +03:00
3b60dd7a31 Merge #56 (phase 3): fail-fast prompt pre-validation + advisory hints
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 36s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m44s
build-prerelease / Build neuron-blackwell (push) Successful in 1m42s
build-prerelease / Build helexa-bench binary (push) Successful in 2m2s
build-prerelease / Build neuron-ada (push) Successful in 2m19s
build-prerelease / Build neuron-ampere (push) Successful in 2m20s
build-prerelease / Build cortex binary (push) Successful in 2m28s
build-prerelease / Test (push) Successful in 4m45s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m24s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m42s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m41s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 55s
2026-06-17 20:57:55 +03:00
4feaaf1cfb Merge #55 (phase 2d): cortex load-aware routing across replicas
Some checks failed
build-prerelease / Test (push) Blocked by required conditions
build-prerelease / Package cortex RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 34s
build-prerelease / Build neuron-blackwell (push) Successful in 1m39s
build-prerelease / Build neuron-ampere (push) Successful in 2m20s
build-prerelease / Build neuron-ada (push) Successful in 2m21s
build-prerelease / Build cortex binary (push) Successful in 2m52s
build-prerelease / Lint (fmt + clippy) (push) Successful in 3m1s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m9s
build-prerelease / Build helexa-bench binary (push) Has been cancelled
build-prerelease / Package helexa-bench RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
2026-06-17 20:51:26 +03:00
057bc71e80 feat(#47 #56 phase 3): fail-fast prompt pre-validation + advisory hints
All checks were successful
CI / Format (push) Successful in 29s
CI / CUDA type-check (push) Successful in 1m37s
CI / Clippy (push) Successful in 2m35s
CI / Test (push) Successful in 5m4s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Stage 3 (DX): A0 burned an hour then failed deep in litellm with
prompt_too_long (35544 > 32768). cortex knows each model's real context
window (#62/#67) and can pre-empt that at the edge.

- Pre-validate the prompt against the model's advertised limit.context
  before dispatch (in proxy_with_metrics, covering chat/completions/
  responses). Over → 400 context_length_exceeded in the #60 envelope — the
  same shape neuron emits on overflow, just earlier and without burning a
  cold-load/queue slot. cortex has no tokenizer, so estimate_prompt_tokens
  under-counts (~4 chars/token over message text); neuron stays the exact
  wall and we only catch gross overages. Skipped when no limit is known.
- Advisory X-Helexa-Advice header: fingerprints User-Agent
  (litellm / Agent-Zero / Zed) and attaches client-specific guidance.
  Strictly advisory — header only, never in the error envelope, behaviour
  never depends on it; unknown clients get nothing.

3 integration tests: over-long prompt → 400 context_length_exceeded with
the advice header, refused before neuron is hit; within-context passes
through; unknown client gets a clean 400 with no advice header. cortex-side
(no CUDA); local fmt/clippy/test green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 20:50:38 +03:00
dd31c3cd49 feat(#47 #55 phase 2d): cortex load-aware routing across replicas
All checks were successful
CI / Format (push) Successful in 39s
CI / CUDA type-check (push) Successful in 1m50s
CI / Clippy (push) Successful in 2m24s
CI / Test (push) Successful in 4m51s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Stage 2 completes: when a model is loaded on more than one healthy neuron,
the router picks the least-busy replica instead of always taking the first,
and neuron backpressure propagates to the client intact.

- NodeState.model_load: per-model admission load (in_flight + queue_depth),
  stashed by the poller from neuron's /health (#53/#2b).
- router::resolve collects all loaded replicas and picks the one with the
  lowest in_flight+queue_depth (ties break by node name for determinism),
  replacing the previous first-match-wins.
- Backpressure passthrough: the existing streaming proxy already forwards
  the upstream status + all headers verbatim, so a neuron 503/429 +
  Retry-After + #60 envelope reaches the client unmodified — now covered by
  a regression test so a future change can't silently unwrap it.

Tests (tests/load_routing.rs): routes to the idle replica and follows the
lighter load when it flips; ties break by name; a saturated neuron's 503 +
Retry-After + envelope propagates through the gateway intact. All
cortex-side (no CUDA); local fmt/clippy/test green.

Retry-route-to-another-replica-on-backpressure (the issue's stretch goal)
is deferred — least-busy spread + honest passthrough is the substantive win.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 20:45:50 +03:00
c83f1eb98c feat(#47 #54 phase 2c): neuron per-principal in-flight cap (fair-share)
Some checks failed
CI / Format (push) Successful in 37s
CI / CUDA type-check (push) Successful in 1m37s
CI / Clippy (push) Successful in 2m13s
CI / Test (push) Successful in 4m50s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Test (push) Blocked by required conditions
build-prerelease / Build neuron-ampere (push) Blocked by required conditions
build-prerelease / Build neuron-ada (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 37s
build-prerelease / Build neuron-blackwell (push) Successful in 1m28s
build-prerelease / Lint (fmt + clippy) (push) Successful in 3m0s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Budget caps total spend over time (#52); this caps instantaneous
starvation so one principal's burst can't monopolize a model while others
wait.

- AdmissionController gains per-principal accounting (moved from a lone
  atomic to a Mutex<AdmissionState> holding the overall pending count + a
  per-principal map). enter(principal) now also fast-rejects when a
  principal already has max_per_principal requests in flight/queued →
  AdmissionRejection::PrincipalCap. Anonymous (None) requests are exempt.
- Config [harness.candle.admission].max_per_principal (default 2 = one
  running + one queued; 0 disables). A bursting principal's overflow is
  refused while a different principal still gets a queue slot.
- The principal (account/key) is reconstructed on the neuron side from the
  x-helexa-account-id/key-id headers cortex stamps (#49) — trusted over
  WireGuard, never from the request body — and threaded explicitly through
  all inference entry points (chat_completion, *_stream(_with),
  responses_stream, and the TP variants) to the admission gate.
- InferenceError::PerPrincipalLimit → 429 rate_limit_exceeded + Retry-After
  (distinct from load-shedding's 503 Overloaded); opencode/AI SDK self-pace.

Tests: fair-share unit test (A floods → A's 2nd is PrincipalCap, B still
queues + is served) + the existing admission tests adapted to enter(None).
Non-CUDA build green locally; TP entry points (cuda-gated) validated by CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 20:40:25 +03:00
a60c9f1075 feat(#47 #53 phase 2b): expose per-model admission load in GET /health
All checks were successful
CI / Format (push) Successful in 30s
CI / CUDA type-check (push) Successful in 1m30s
CI / Clippy (push) Successful in 2m18s
CI / Test (push) Successful in 4m17s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Resolve version stamps + change detection (push) Successful in 33s
build-prerelease / Build neuron-blackwell (push) Successful in 1m42s
build-prerelease / Build neuron-ampere (push) Successful in 2m18s
build-prerelease / Build neuron-ada (push) Successful in 2m19s
build-prerelease / Build helexa-bench binary (push) Successful in 2m18s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m27s
build-prerelease / Build cortex binary (push) Successful in 2m45s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m2s
build-prerelease / Test (push) Successful in 4m50s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m18s
build-prerelease / Package cortex RPM (push) Successful in 1m22s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m37s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 56s
Completes #53: the bounded scheduler's lock-free counters are now visible
to the fleet, which is what cortex's load-aware router (#55) consumes to
spread traffic across replicas and propagate honest backpressure.

- cortex-core::discovery: HealthResponse gains `models: Vec<ModelLoad>`
  (#[serde(default)] — back-compatible; older gateways/neurons interop).
  ModelLoad { id, in_flight, queue_depth }.
- LoadedHandle::load() → (in_flight, queue_depth), lock-free for both
  single-GPU and TP; CandleHarness::load_snapshot() enumerates resident
  models; the /health handler overlays it from the candle harness.

Tests: /health always exposes a models array (api integration test); a
pre-#53 payload without `models` still deserializes, and ModelLoad
round-trips (cortex-core serde tests). Local fmt/clippy/test green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 20:13:07 +03:00
b2bd86bfa5 feat(#47 #53 phase 2a): neuron admission control — bounded queue + backpressure
All checks were successful
CI / Format (push) Successful in 41s
CI / CUDA type-check (push) Successful in 1m40s
CI / Clippy (push) Successful in 2m18s
CI / Test (push) Successful in 4m53s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 1m43s
build-prerelease / Build neuron-ampere (push) Successful in 2m18s
build-prerelease / Build neuron-ada (push) Successful in 2m19s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m29s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m46s
build-prerelease / Test (push) Successful in 4m48s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m49s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m53s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m7s
Replaces the per-model unbounded, untimed FIFO of inference-lock waiters
(a busy model made new requests hang ~300s until the client gave up with
an opaque error) with an explicit bounded scheduler.

- harness::admission::AdmissionController: batch-1 scheduler — max_in_flight
  running (1) + a bounded queue (max_queue_depth) with a max_wait. enter()
  fast-rejects when the queue is full (QueueFull) or the wait elapses
  (Timeout); the returned AdmissionPermit is held for the request and frees
  both slots on drop. Pure async (no CUDA), lock-free in_flight/queue_depth
  counters for future /health reporting. Configurable via
  [harness.candle.admission] (max_in_flight=1, max_queue_depth=8,
  max_wait_secs=30).
- Gated at all four inference entry points before the inference_lock/pool
  lock: single-GPU non-streaming + streaming, TP non-streaming + streaming.
  The streaming paths acquire the permit before opening the SSE (so a
  rejection is a clean error, not a half-open stream) and move it into the
  inference task.
- InferenceError::Overloaded { retry_after_secs } → 503 rate_limit_exceeded
  + Retry-After via the #60/#63 envelope: a fast, retryable "busy" signal
  opencode/AI SDK back off on, not a stall.

Scope: this branch is the admission *core* (the hang→backpressure fix).
Exposing in_flight/queue_depth in GET /health (consumed by cortex
load-aware routing #55) is the next focused branch under #53.

4 unit tests (admit/report load, queue-full reject, wait-timeout reject)
+ Overloaded envelope mapping test. Non-CUDA build green locally; the
CUDA + TP sites are validated by branch CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 20:03:07 +03:00
cdf87284af feat(#47 phase 1d): budget enforcement — hard caps, reserve→settle, 429
All checks were successful
CI / Format (push) Successful in 1s
CI / CUDA type-check (push) Successful in 1m40s
CI / Clippy (push) Successful in 2m40s
CI / Test (push) Successful in 6m23s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Resolve version stamps + change detection (push) Successful in 34s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m19s
build-prerelease / Test (push) Successful in 4m28s
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 2m27s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 50s
Stage 1 complete: the A0 seatbelt (#52). Flips the metering-only reserve(0)
from #51 to the request's real upper-bound cost and refuses over-cap
requests *before* neuron is hit.

- metering::reservation_estimate: prompt estimate (~4 chars/token over the
  body — cortex has no tokenizer, so a conservative over-estimate; neuron
  stays the exact context wall) + max output. Max output comes from
  max_completion_tokens / legacy max_tokens, else the model's advertised
  limit.output (#62), else FALLBACK_MAX_OUTPUT. Over-reserving is safe —
  settle reconciles to actual.
- metering::reserve_or_reject: reserve the estimate; on BudgetError map to
  the #63 envelope and the caller refuses before dispatch — rolling window →
  429 rate_limit_exceeded + Retry-After (until reset); hard balance → 429
  insufficient_quota (no Retry-After). Never 402.
- Wired into both the OpenAI proxy path (proxy_with_metrics) and the
  Anthropic path (estimate from the translated body). advertised_output_limit
  reads the loaded model's limit.output from fleet state.
- Reservation prevents overshoot under concurrency: a successful reserve
  gates on spent+reserved+estimate ≤ cap, and settle records actual ≤
  reserved, so spend can never exceed the hard cap.

4 integration tests with a hit-counting mock neuron: balance over-cap →
429 insufficient_quota (no Retry-After, not dispatched); rolling over-cap →
429 rate_limit_exceeded + Retry-After (not dispatched); within-cap served;
**A0 repro** — a capped key's 20-request fan-out drains the cap, then is
refused, neuron only saw the served ones, and spend never exceeds the cap.
Plus 5 metering unit tests. Local fmt/clippy/test all green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 19:35:04 +03:00
4f16b8c541 feat(#47 phase 1c): per-request token metering + spend ledger
All checks were successful
CI / Format (push) Successful in 40s
CI / CUDA type-check (push) Successful in 1m41s
CI / Clippy (push) Successful in 2m15s
CI / Test (push) Successful in 4m28s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m30s
build-prerelease / Build cortex binary (push) Successful in 2m49s
build-prerelease / Package cortex RPM (push) Successful in 1m24s
build-prerelease / Test (push) Successful in 5m59s
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 49s
Stage 1 accounting (#51): capture real per-request usage and feed it to
the spend ledger + per-principal metrics. Establishes the reserve→settle
lifecycle that budget enforcement (#52) will tighten.

- cortex-gateway::metering: ReservationGuard makes reservation leaks
  impossible — settle() records actual spend + releases the remainder;
  dropping an un-settled guard releases the whole reservation, so any
  early return / error / dropped stream resolves it. UsageSink is the
  completion hook; principal_from_headers reconstructs the principal from
  the middleware-stamped headers (uniform across all proxy paths, no
  handler-signature churn); record_spend emits per-principal counters.
- proxy::TokenMetrics gains an optional usage_sink, invoked exactly once
  in finish() with the observed (prompt, completion) — restructured so it
  always runs (even when no body/usage arrived → settle 0 → release),
  while preserving the existing per-model metric emissions unchanged.
- All proxy paths metered: chat/completions/responses via
  proxy_with_metrics (reserve 0 → forward_request → settle in finish);
  Anthropic non-streaming settles from the buffered body; Anthropic
  streaming (anthropic_sse) now scans the upstream frames for the usage
  object (#48) — it captured none before — and settles at pump end.
- This phase reserves 0 tokens (metering only, no enforcement); #52 flips
  the reserved amount to prompt+max_output and surfaces BudgetError. The
  settle/release plumbing is identical, so that change is localized.
- New Prometheus counters: cortex_spend_tokens_total (+ prompt/completion
  splits), labelled by account/key.

2 integration tests: cumulative per-key spend after N requests with
reservations settled to zero outstanding; anonymous requests record no
spend. Local fmt/clippy/test all green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 19:29:51 +03:00
486d7e9a8f feat(#47 phase 1b): API-key auth + principal resolution
All checks were successful
CI / Format (push) Successful in 36s
CI / CUDA type-check (push) Successful in 1m51s
CI / Clippy (push) Successful in 2m40s
CI / Test (push) Successful in 5m50s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Build neuron-blackwell (push) Successful in 1m41s
build-prerelease / Build neuron-ada (push) Successful in 2m15s
build-prerelease / Build neuron-ampere (push) Successful in 2m18s
build-prerelease / Build helexa-bench binary (push) Successful in 2m20s
build-prerelease / Build cortex binary (push) Successful in 2m22s
build-prerelease / Lint (fmt + clippy) (push) Successful in 3m10s
build-prerelease / Test (push) Successful in 5m19s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m18s
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m44s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 57s
Stage 1 identity (#49): cortex now knows who a request is for. Identity
rides standard bearer auth only (Authorization: Bearer <key>) — no custom
required headers or body fields — which is what keeps every tier
OpenAI-compatible by construction.

- cortex-gateway::auth: `require_principal` axum middleware
  (from_fn_with_state), wired in build_app outer-to-inner as
  trace → CORS → auth → handlers (CORS outer so preflight short-circuits).
  It resolves the bearer key via the EntitlementProvider, inserts the
  typed Principal into request extensions (for metering #51 / enforcement
  #52), and stamps internal x-helexa-account-id / x-helexa-key-id headers
  so the principal reaches neuron, which trusts cortex over WireGuard (#54).
- Anti-spoofing: client-supplied principal headers are stripped before the
  authoritative value is stamped — a client can never assert a principal
  it didn't authenticate as.
- Rejection contract (#63): missing key under require_auth, or any present
  but unresolvable key, → 401 invalid_api_key in the #60 envelope. /health
  and / stay public. require_auth=false (default) allows anonymous through
  but still 401s a present-but-invalid key.
- Header-name constants (HEADER_ACCOUNT_ID/KEY_ID) live in cortex-core so
  neuron (#54) shares them. The chat/completions/responses paths forward
  the stamped headers automatically via proxy::forward_request; the
  Anthropic streaming + non-streaming paths forward them explicitly via
  auth::forward_principal_headers (they build their own upstream requests).

5 integration tests: missing-key 401, invalid-key 401 (even when auth not
required, not dispatched), valid key reaches neuron with principal headers
+ spoofed header stripped, anonymous allowed when not required, /health
public. Local fmt/clippy/test all green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 19:07:10 +03:00
bc74e0e95f feat(#47 phase 1a): EntitlementProvider trait + local/static provider
Some checks failed
CI / Format (push) Successful in 38s
CI / CUDA type-check (push) Successful in 1m39s
CI / Clippy (push) Successful in 2m26s
CI / Test (push) Successful in 4m49s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Build neuron-blackwell (push) Successful in 1m40s
build-prerelease / Build neuron-ada (push) Successful in 2m19s
build-prerelease / Build neuron-ampere (push) Successful in 2m22s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m49s
build-prerelease / Build cortex binary (push) Successful in 3m0s
build-prerelease / Test (push) Successful in 4m25s
build-prerelease / Package cortex RPM (push) Successful in 1m32s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m50s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m49s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m54s
build-prerelease / Build helexa-bench binary (push) Successful in 2m12s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Stage 1's build seam (#50): the interface auth, metering, and budget
enforcement all hang off, with a local/static provider so the A0
amplification fix can land before any upstream clearing house exists.
The future helexa-upstream client (#57) is just another impl.

- cortex-core::entitlements: Principal {account_id, key_id}, CapWindow
  (Balance | Rolling{seconds}), Reservation handle, BudgetSnapshot,
  AuthError/BudgetError, and the async EntitlementProvider trait
  (resolve / reserve / settle / release / snapshot). BudgetError carries
  the window semantics so callers pick the #63 code (rate_limit_exceeded
  + Retry-After vs insufficient_quota) without the provider touching HTTP.
- cortex-core::config: [entitlements] section on GatewayConfig
  (require_auth + [[entitlements.keys]] with account_id, optional key_id,
  hard_cap, window). Additive + serde(default) — anonymous/uncapped when
  omitted, so existing setups are unaffected.
- cortex-gateway::entitlements_local: LocalEntitlementProvider. Budget
  math serialized under one Mutex so spent+reserved can never exceed a
  hard cap under concurrency (the #52 guarantee); rolling windows reset
  lazily; uncapped keys (no hard_cap) always reserve but still meter.
- CortexState gains Arc<dyn EntitlementProvider> + require_auth, built in
  from_config. Not yet consumed by the request path — auth middleware is
  1b (#49), enforcement is 1d (#52).
- cortex.example.toml documents the section; test GatewayConfig literals
  updated for the new field.

6 provider unit tests (resolve, unknown-key, round-trip, balance/rolling
over-cap codes, uncapped infra key). Local fmt/clippy/test all green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 19:00:05 +03:00
f22d83df14 feat(#47 phase 0): centralize OpenAI error envelope + add Retry-After
Some checks failed
CI / Format (push) Successful in 38s
CI / CUDA type-check (push) Successful in 1m40s
CI / Clippy (push) Successful in 2m20s
CI / Test (push) Successful in 4m35s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Test (push) Blocked by required conditions
build-prerelease / Build cortex binary (push) Blocked by required conditions
build-prerelease / Package helexa-bench RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 24s
build-prerelease / Build neuron-blackwell (push) Successful in 1m26s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m48s
build-prerelease / Build neuron-ada (push) Successful in 2m3s
build-prerelease / Build helexa-bench binary (push) Successful in 2m7s
build-prerelease / Build neuron-ampere (push) Successful in 2m12s
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
The rejection contract (#63) requires every "no" path to speak the
OpenAI envelope with standard codes and, for retryable conditions, a
Retry-After header. Two gaps remained despite #63 being closed:
Retry-After was implemented nowhere, and the envelope was hand-built
inline in four places (gateway handlers/proxy/router, neuron api) with
no shared source of truth — exactly the inconsistency #63 set out to
prevent, and a foundation every Stage 1-2 rejection (401/429/503) needs.

- cortex-core: new `error_envelope::OpenAiError` — an axum-agnostic
  builder carrying status, type, code, message, param, optional
  retry_after, and diagnostic extras. Named constructors encode the #63
  codes (invalid_api_key, rate_limit_exceeded, insufficient_quota,
  context_length_exceeded, service_unavailable) and which carry
  Retry-After. cortex-core stays a pure types crate; each HTTP crate
  owns a thin `envelope_response` adapter that sets the header.
- cortex-gateway: route error_response, ProxyError, and RouteError
  through the shared builder; RouteError::retry_after_secs wires
  Retry-After on the transient NoHealthyNodes (5s) / ModelRecovering
  (2s) variants.
- neuron: route inference_error_response through the shared builder;
  InsufficientVram (transient 503) now advertises Retry-After: 5.

Behaviour for existing paths is unchanged (same status/type/code/extras);
only the new Retry-After headers are added. Tests cover the builder wire
shape and Retry-After presence/absence on both sides.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 18:46:56 +03:00
4b28a64b34 feat(#67 phase 5b): enforce the derived input as the prompt cap
All checks were successful
CI / Format (push) Successful in 39s
CI / CUDA type-check (push) Successful in 1m38s
CI / Clippy (push) Successful in 2m19s
CI / Test (push) Successful in 4m17s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m14s
build-prerelease / Build neuron-blackwell (push) Successful in 1m42s
build-prerelease / Build neuron-ada (push) Successful in 2m15s
build-prerelease / Build neuron-ampere (push) Successful in 2m17s
build-prerelease / Build helexa-bench binary (push) Successful in 2m23s
build-prerelease / Build cortex binary (push) Successful in 2m29s
build-prerelease / Test (push) Successful in 4m28s
build-prerelease / Package cortex RPM (push) Successful in 1m15s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m17s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m41s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 51s
The request path now rejects prompts above the model's self-derived input
budget, not the static NEURON_MAX_PROMPT_TOKENS — so a VRAM-tight host
(where the VRAM ceiling binds below the static cap) rejects an
over-budget prompt up front instead of accepting it and OOMing
mid-prefill.

- derived_input_cap: AtomicUsize on LoadedModel + TpLoadedModel; refreshed
  by LoadedHandle::derived_limit (runs on every /models poll). 0 = not
  derived yet.
- effective_prompt_cap(): cached derived input when >0, else the static
  max_prompt_tokens() (cold-start / no-profile fallback).
- validate_request takes the cap as a param; all 4 call sites
  (chat_completion, inference_stream, inference_tp_stream, TP
  chat_completion) pass the in-scope model's effective_prompt_cap().
- doc/context-limits.md: enforcement note updated from "remaining" to
  landed.

Reads the cap lock-free from the sync validate path (no per-request VRAM
query); the cap tracks live state via the poll-driven derivation. With
this, advertise and enforce agree and both track the resident model.

fmt/clippy/test green; CUDA paths type-checked in CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 14:26:37 +03:00
dd65eedb24 feat(#67 phase 5a): NEURON_MAX_PROMPT_TOKENS becomes a clamp-only backstop; docs
All checks were successful
CI / Format (push) Successful in 31s
CI / CUDA type-check (push) Successful in 1m49s
CI / Clippy (push) Successful in 2m12s
CI / Test (push) Successful in 4m24s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Demotes the static per-host prompt cap from authority to an optional
upper-bound clamp on the self-derived limit, and rewrites the
context-limits doc around the computed model.

- max_prompt_tokens_clamp(): reads NEURON_MAX_PROMPT_TOKENS directly so
  "explicitly set" is distinct from the 16384 default; returns None when
  unset (no clamp). Applied as derive_limit's hard_ceiling in
  LoadedHandle::derived_limit, so the advertised context is clamped only
  when an operator set a backstop — the derivation is otherwise
  authoritative and binds below it in practice.
- doc/context-limits.md: intro + "After #62" rewritten as "After #67 —
  the neuron computes its own limit" (formula, live signals, config
  block, opencode note, NEURON_MAX_PROMPT_TOKENS demotion).

Remaining (phase 5b, follow-up): enforce the *derived* input as the
prompt cap (reject above computed input, not the static
NEURON_MAX_PROMPT_TOKENS) so VRAM-tight hosts can't accept an
OOM-inducing prompt. Needs a per-model cached cap read from the sync
validate path; scoped separately. Until then the static cap remains the
enforced backstop (advertised <= enforced holds when the env is set).

fmt/clippy/test green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 14:14:34 +03:00
8b2e01a072 feat(#67 phase 4): advertise neuron-computed limit on /models; drop catalogue override
Some checks failed
CI / Test (push) Waiting to run
CI / Format (push) Successful in 35s
CI / CUDA type-check (push) Successful in 2m12s
CI / Clippy (push) Successful in 2m10s
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
The neuron now self-derives and advertises limit{context,input,output}
per loaded model; cortex forwards it and stops consulting the
operator-declared catalogue limit (which can't track hot-swapped models
or live capacity). Operator-set `cost` still flows from the catalogue.

neuron:
- CandleHarness gains context_limit_cfg (from [harness.candle.context_limit]).
- LoadedHandle::derived_limit(): profile + live tightest-card free VRAM
  (single: query_vram; TP: query_vram_tightest_free_mb) + prefill-rate
  EMA (bootstrap until first sample) → derive_limit. None for arches
  without a context profile. No operator clamp here (advertise the honest
  derived value; the clamp is an enforcement-side backstop).
- list_models() fills ModelInfo.limit from derived_limit (was None).
- derive_limit treats free_tightest_mb == 0 (unknown/CPU sentinel) as
  "no VRAM ceiling" instead of collapsing to zero.

cortex:
- ModelEntry gains `limit`, copied from ModelInfo.limit by the poller.
- /v1/models: catalogue `limit` no longer flows (Pass 1 sets None);
  Pass 2 adopts the neuron's limit, taking the tightest across neurons
  via tightest_limit(). cost unchanged.
- model_limits.rs rewritten: catalogue limit (999999) is ignored; the
  neuron's ModelEntry.limit is advertised; cost still from catalogue.
- All ModelEntry literals updated with the new field.

fmt/clippy/test green; CUDA paths type-checked in CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 14:10:20 +03:00
464b6b0db9 feat(neuron): self-measured prefill tok/s EMA on streaming paths (#67 phase 3)
All checks were successful
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m13s
CI / Test (push) Successful in 4m30s
CI / CUDA type-check (push) Successful in 1m42s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Refs #67. Feeds the throughput ceiling a live, per-model prefill rate
instead of only the configured bootstrap estimate, so the advertised
limit tracks real prefill speed and rises automatically as prefix
caching (#11) reduces effective prefill cost.

- context_limit::PrefillRateEma: lock-free f64-bits EMA (alpha 0.3),
  ignores degenerate samples, None before the first sample. Unit-tested.
- prefill_rate field on LoadedModel + TpLoadedModel.
- Recorded as total-prompt-tokens / prefill-elapsed in the two streaming
  serving paths (TP: inference_tp_stream via tp_for_task; single-GPU:
  stream_inference_via_worker via a new &prefill_rate param threaded from
  loaded_for_task). Measuring total prompt (not just the divergent
  suffix) means a prefix-cache hit shrinks elapsed while the prompt stays
  large, so the effective rate — and the ceiling — rises toward the VRAM
  ceiling, exactly the #11 payoff.

Per the agreed scope, non-streaming + CPU paths fall back to the
bootstrap estimate (opencode streams; those paths rarely carry the
fleet). fmt/clippy/test green; CUDA paths type-checked in CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 14:02:02 +03:00
f2e05d96ec feat(neuron): capture ContextProfile at load + per-rank VRAM fan-out (#67 phase 2)
All checks were successful
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m14s
CI / Test (push) Successful in 4m38s
CI / CUDA type-check (push) Successful in 1m30s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Refs #67. Captures the per-model context physics at load and adds the
live free-VRAM signal the derivation needs — the tightest card across TP
ranks, not just the leader.

- ContextProfile captured at load:
  - single-GPU dense CUDA path (world_size 1) via
    context_limit::profile_from_qwen3_5_config(config_path, ..);
  - TP path (world_size = tp_size) at TpLoadedModel construction.
  GGUF/CPU/non-qwen3_5 → None (fall back to the static prompt cap).
  New `context_profile` field on LoadedModel + TpLoadedModel.
- profile_from_qwen3_5_config(): reads config.json (mirrors
  VisionMeta::from_config_path), counts full_attention layers
  (layer_types authoritative, full_attention_interval fallback), builds
  the per-card KV cost via the shared helper.
- Folded the inline per-rank KV-bytes math in tp_qwen3.rs (both
  cuda/non-cuda log_construction_complete) and tp_qwen3_5.rs onto
  context_limit::kv_bytes_per_token + KV_CACHE_DTYPE_BYTES.
- Per-rank VRAM fan-out (tightest card):
  - WorkerRequest::QueryVram + WorkerResponse::VramInfo { free_mb, total_mb };
  - worker.rs handle_query_vram (cuda: mem_get_info; non-cuda: error);
  - WorkerPool::query_vram_tightest_free_mb fans out to every rank
    (leader via its device worker, subprocess ranks via RPC) → min free;
  - TpLoadedModel::query_vram_tightest_free_mb convenience wrapper.

No advertise/enforce yet (phases 4/5). fmt/clippy/test green; CUDA paths
type-checked in CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 13:18:27 +03:00
4f05a87449 feat(neuron): self-derived context-limit core — physics + policy (#67 phase 1)
All checks were successful
CI / Format (push) Successful in 38s
CI / CUDA type-check (push) Successful in 1m49s
CI / Clippy (push) Successful in 2m16s
CI / Test (push) Successful in 4m28s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Refs #67. The correct limit{context,input,output} for a deployment is a
computed function of model architecture + live free VRAM + a
coherence/throughput trade-off, not an operator-declared static fact that
goes stale on model swap. This lands the arch-agnostic derivation core;
later phases capture per-model physics at load, measure throughput, and
advertise/enforce the computed limit.

- crates/neuron/src/harness/context_limit.rs (new):
  - kv_bytes_per_token(): shared per-card KV cost (counts only
    full-attention layers; sharded by TP world size). The TP load paths'
    inline math folds onto this in phase 2.
  - ContextProfile: per-model physics snapshot (max_position_embeddings,
    kv_bytes_per_token_per_card, world_size).
  - derive_limit(): context = min(max_pos, vram_ceiling,
    throughput_ceiling) clamped by an optional backstop; input = context −
    output; rounded to 1024. 6 unit tests.
- config.rs: [harness.candle.context_limit] block (mirrors prefix_cache):
  target_prefill_latency_secs, bootstrap_prefill_tok_per_sec,
  activation_headroom_mb, min_free_floor_mb, output_reserve_tokens.
- neuron.example.toml: documented the new block.

No runtime behaviour change yet. fmt/clippy/test green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 13:00:52 +03:00
2f67d17ec7 feat(neuron): emit reasoning_tokens usage details on streaming
All checks were successful
CI / CUDA type-check (push) Successful in 1m45s
CI / Format (push) Successful in 43s
CI / Clippy (push) Successful in 2m16s
CI / Test (push) Successful in 4m28s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Resolve version stamps + change detection (push) Successful in 34s
build-prerelease / Build neuron-blackwell (push) Successful in 1m38s
build-prerelease / Build neuron-ada (push) Successful in 2m3s
build-prerelease / Build cortex binary (push) Successful in 2m16s
build-prerelease / Build helexa-bench binary (push) Successful in 2m23s
build-prerelease / Build neuron-ampere (push) Successful in 2m50s
build-prerelease / Lint (fmt + clippy) (push) Successful in 3m3s
build-prerelease / Package cortex RPM (push) Successful in 1m22s
build-prerelease / Test (push) Successful in 5m11s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m24s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m41s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 56s
Closes #64.

opencode meters reasoning tokens separately via the OpenAI-standard
detail objects, which neuron's usage structs didn't expose. Add them
additively so older clients ignore them.

- cortex-core: Usage gains completion_tokens_details/prompt_tokens_details;
  ResponsesUsage gains output_tokens_details/input_tokens_details. Optional
  + skip_serializing_if, so the wire shape is unchanged for non-reasoning
  models. cached_tokens fields are defined but always None until prompt
  caching lands (#11).
- candle.rs: count tokens generated while in_reasoning across all three
  streaming paths (TP, worker, CPU); carry the count on InferenceEvent::Finish.
- chat projector: populate completion_tokens_details.reasoning_tokens.
- responses projector: wire up base usage emission on the streaming path
  (it emitted none before) and add output_tokens_details.reasoning_tokens.
- non-streaming paths leave details None (they don't track in_reasoning).

reasoning_tokens is a sub-count of completion/output tokens (OpenAI
semantics) — not added into total_tokens.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 12:04:05 +03:00
11b2e6f78c fix(cortex): default models_config to the packaged absolute path
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m24s
build-prerelease / Build neuron-blackwell (push) Successful in 1m42s
build-prerelease / Build neuron-ada (push) Successful in 2m7s
build-prerelease / Build helexa-bench binary (push) Successful in 2m7s
build-prerelease / Build cortex binary (push) Successful in 2m20s
build-prerelease / Build neuron-ampere (push) Successful in 2m49s
build-prerelease / Test (push) Successful in 4m26s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m23s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m41s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m43s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m47s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 52s
cortex resolved the catalogue path "models.toml" relative to the service's
working directory, so the systemd-launched binary never found
/etc/cortex/models.toml and ran with an EMPTY catalogue in production —
limits, cost, pinning, aliases and feasibility were all silent no-ops,
with models surfacing only via the neuron poller. Tests never caught it
because they pass models_config explicitly; only the defaulted,
packaged path was broken.

Default to the absolute /etc/cortex/models.toml (where cortex.spec installs
it) and document the override in cortex.example.toml. Restores the #62
limit/cost advertisement (the catalogue is now actually read) along with
pinning/aliases/feasibility.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 10:04:29 +03:00
8a636c687f feat(cortex): per-model limit + cost on /v1/models; remove max_model_len
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 37s
build-prerelease / Build neuron-blackwell (push) Successful in 1m36s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m33s
build-prerelease / Build neuron-ada (push) Successful in 2m2s
build-prerelease / Build neuron-ampere (push) Successful in 2m47s
build-prerelease / Build helexa-bench binary (push) Successful in 2m8s
build-prerelease / Build cortex binary (push) Successful in 2m35s
build-prerelease / Test (push) Successful in 5m13s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m17s
build-prerelease / Package cortex RPM (push) Successful in 1m18s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m43s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m42s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 54s
Resolves #62. opencode's helexa provider discovers a model's serving
budget from /v1/models and uses it to size context, trigger compaction,
and show spend with no hand-configuration. Each model entry now carries:

  - limit { context, input?, output }  — operator-declared in models.toml
  - cost  { input, output, cache_read?, cache_write? }  — USD per 1M tokens
  - tool_call / reasoning  — runtime-detected by the candle harness and
    OR-ed in from each serving neuron

Composition: the catalogue profile supplies limit/cost (Pass 1); the
poller carries the neuron's detected tool_call/reasoning into ModelEntry,
which the gateway unions onto the entry (Pass 2); aliases propagate every
field (Pass 4). Wire types extend ModelInfo / ModelProfile /
CortexModelEntry additively (serde default + skip_serializing_if), so
older neurons and clients are unaffected. helexa-bench's ModelInfo
constructor and the gateway test fixtures are updated for the new fields.
Adds tests/model_limits.rs asserting /v1/models surfaces limit + cost
(catalogue) and tool_call + reasoning (runtime), and that max_model_len
is gone.

Removes max_model_len. It was write-only with no consumer — opencode's
source references it nowhere and it is not an OpenAI /v1/models field —
and doubly misleading: vLLM's max_model_len means total sequence length,
but cortex populated it from NEURON_MAX_PROMPT_TOKENS, a prompt-only cap.
The limit{} contract replaces it. The neuron's max_prompt_tokens remains
the enforced prompt cap (neuron-side); cortex just stops re-advertising a
derived, mis-named copy. Closes #66 — its stale-max_model_len premise is
moot once the field is gone.

limit/cost are operator-declared (catalogue) per #62's design; auto-
deriving the advertised budget from each neuron's reported cap is a
tracked follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 09:26:55 +03:00
6088830e7d feat(deploy): manage NEURON_MAX_PROMPT_TOKENS per host via model.conf drop-in
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Test (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Roll the per-model context cap into deploy.yml so it is deterministic per
host and rolled out (with a restart) alongside the rest of the service
config, rather than hand-edited in local.conf. The deploy now writes
/etc/systemd/system/neuron.service.d/model.conf from a new per-host
`max_prompt_tokens` matrix field, and restarts a neuron when the package
OR the drop-in changes — so a cap change applies even with no new RPM.

beast (Qwen3.6-27B, hybrid linear, 2x 32GB) -> 131072 (~128k); benjy and
quadbrat (dense, VRAM-bound) stay at 16384 but become deploy-managed.

Adds the scoped sudoers grant for the root-owned drop-in install, and
doc/context-limits.md documenting the knob relationships and KV/VRAM math
(refs #62 for the eventual /models-advertised source of truth, #65 for
the length-aware text VRAM guard that gates pushing beyond 128k).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-16 18:48:19 +03:00
04f798ec23 feat(cortex-gateway): enhance error responses with structured data
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m22s
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 2m26s
build-prerelease / Test (push) Successful in 4m23s
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 47s
fixes #63
Standardize error messages by adding type, code, and param fields to
align with OpenAI API format. Updates include:
- Structured error envelopes with broad type categorization
  (invalid_request_error/api_error)
- Specific machine-readable codes (model_not_found/service_unavailable)
- Null param field as required by OpenAI specification
- Consistent error response formatting across handlers, proxy, and
  routing layers

New tests verify correct error envelope structure for various failure
scenarios.

Co-Authored-By: Helexa (Qwen3.6-27B, 48k context) <noreply@helexa.ai>
2026-06-16 17:51:04 +03:00
6f3e9276cd docs: add AGENTS.md with project architecture, build commands, and conventions
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 37s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m27s
build-prerelease / Test (push) Successful in 4m37s
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
2026-06-16 14:15:32 +03:00
8f9e956d17 fix(neuron): emit OpenAI-standard nested error envelopes (#60)
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 33s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 1m44s
build-prerelease / Build neuron-ada (push) Successful in 2m14s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m16s
build-prerelease / Build neuron-ampere (push) Successful in 2m55s
build-prerelease / Test (push) Successful in 4m24s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m41s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m43s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 53s
InferenceError responses were a flat `{"error": "..."}` string. OpenAI
clients (opencode, the openai SDK) reach into `error.type`/`error.code`
to drive behaviour — most importantly `code == "context_length_exceeded"`
triggers auto-compaction + retry instead of a hard failure. A flat string
is invisible to that logic.

Rewrite `inference_error_response` to emit the nested envelope
`{"error": {"message","type","code","param", ...diagnostics}}` and map:

- ModelNotLoaded   → 404 invalid_request_error / model_not_found
- PromptTooLong    → 400 invalid_request_error / context_length_exceeded
  (message: "maximum context length is N tokens", + prompt_len/max)
- InsufficientVram → 503 api_error / insufficient_vram
- VisionUnsupported→ 400 invalid_request_error / vision_unsupported
- TemplateRenderFailed → 422 invalid_request_error / template_render_failed
- Other            → 500 api_error / null code

Diagnostic extras ride inside the error object so the envelope shape is
stable. Both inline match blocks in the chat-completions handler
(streaming + non-streaming) now defer to the shared helper, which the
responses handler already used — one source of truth.

Adds 4 unit tests covering the envelope shape and codes. Also fixes a
pre-existing clippy lint (cloned_ref_to_slice_refs) in qwen3_5 snapshot
test surfaced by a newer clippy.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 20:42:14 +03:00
cb758d4706 feat(neuron): emit usage on the streaming path so clients can track context
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 33s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m20s
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 1m46s
build-prerelease / Build neuron-ada (push) Successful in 2m9s
build-prerelease / Build cortex binary (push) Successful in 2m24s
build-prerelease / Build neuron-ampere (push) Successful in 2m52s
build-prerelease / Test (push) Successful in 4m16s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m43s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m43s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 55s
The deeper reason opencode showed "Context: 0 tokens / 0% used" and flew
into a 400: streaming responses carried NO `usage`. Clients track context
(and trigger compaction) from the `usage` field; the legacy candle
streaming path set `usage: None` on every chunk, so a streaming client
had no token count at all — `max_model_len` alone is a denominator with
no numerator.

InferenceEvent::Finish now carries prompt_tokens + completion_tokens
(the streaming loops already have both: prompt_tokens.len() and the
generated all_tokens.len()). The openai_chat projector emits an
OpenAI-style trailing usage chunk (empty `choices`, populated `usage`)
after the finish chunk. cortex's Anthropic stream translator already
reads chunk.usage, so this fixes context tracking on BOTH the OpenAI
(opencode) and Anthropic (Claude Code) paths.

Also harden the max_model_len plumbing's sibling: cortex re-polls
/discovery while a neuron's max_prompt_tokens is still 0 (unknown), so a
rolling-deploy race where cortex caches discovery before the neuron has
the field self-heals instead of pinning max_model_len to None until a
manual cortex restart.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 19:43:59 +03:00
a2d2dbd006 feat: advertise max_model_len on /v1/models so clients can compact
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m15s
build-prerelease / Build neuron-blackwell (push) Successful in 1m38s
build-prerelease / Build neuron-ada (push) Successful in 2m2s
build-prerelease / Build helexa-bench binary (push) Successful in 2m0s
build-prerelease / Build cortex binary (push) Successful in 2m26s
build-prerelease / Build neuron-ampere (push) Successful in 2m55s
build-prerelease / Test (push) Successful in 4m28s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m22s
build-prerelease / Package cortex RPM (push) Successful in 1m22s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m37s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m41s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m41s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 54s
opencode (and any OpenAI/Anthropic client) couldn't size or compact its
context against helexa because /v1/models never advertised a context
window — opencode showed "0 tokens / 0% used" and flew straight into a
400 PromptTooLong once a conversation + a fetched 64KB log overflowed the
49152-token cap. Compaction is the client's job, but the client needs to
know the limit to do it.

neuron now reports its effective prompt cap (NEURON_MAX_PROMPT_TOKENS)
in GET /discovery (`max_prompt_tokens`). cortex surfaces it on
/v1/models as `max_model_len` (vLLM / OpenAI-compatible convention) per
model — the smallest cap among the neurons that can serve it
(feasible_on ∪ locations), so the advertised limit holds wherever the
request routes. A neuron reporting 0 predates the field and is treated
as unknown (skipped); models with no reporting neuron omit the field.

helexa still rejects over-limit prompts with a clean 400 — this just
gives clients the number to compact *before* hitting it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 19:11:13 +03:00
544214d0f8 fix(neuron): normalize OpenAI string tool-call arguments before rendering
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 29s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m16s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 1m39s
build-prerelease / Build neuron-ada (push) Successful in 2m5s
build-prerelease / Build neuron-ampere (push) Successful in 2m48s
build-prerelease / Test (push) Successful in 4m35s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m38s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m39s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 52s
opencode (OpenAI path, /v1/chat/completions passthrough) hit the same
chat_template:120 failure Claude Code did — "cannot convert value into
pairs" — because the OpenAI wire format carries
tool_calls[].function.arguments as a JSON *string*, while Qwen3.6's
template iterates it as a dict (`arguments | items`). The Anthropic-side
fix (8880b2f) only covered cortex's translation; the OpenAI path reaches
neuron unchanged.

render_chat_template now normalizes string-form tool-call arguments to
objects across all messages before building the Jinja context, so OpenAI
and Anthropic clients both render. Object args (Anthropic path) pass
through untouched; a string that doesn't parse is left as-is and the
render fails loudly (422 TemplateRenderFailed, a94dd55) rather than
silently dropping tools.

The loud-fail change earned out immediately here: opencode got a clean
422 with the exact `chat_template:120` cause instead of a degraded
session.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 18:13:36 +03:00
a94dd55ab8 feat(neuron): fail loud (422) when a tools-bearing request can't render
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m18s
build-prerelease / Test (push) Successful in 4m12s
build-prerelease / Build neuron-blackwell (push) Successful in 1m38s
build-prerelease / Build neuron-ada (push) Successful in 2m10s
build-prerelease / Build neuron-ampere (push) Successful in 2m49s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m36s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 58s
Three of this session's bugs (system-message position, tool_call argument
shape, and the original tool rendering) all hid behind the same silent
behaviour: chat_template render fails → neuron falls back to
format_qwen3_prompt, which drops every tool → the request still returns
200 with degraded, tool-less output. Each cost real debugging time
because the failure was invisible on the wire.

build_prompt_for_request now returns Result. On a render failure it
checks whether the request carried tools: if so it returns the new
InferenceError::TemplateRenderFailed (mapped to 422 with a
template_render_failed code and the underlying Jinja error), instead of
silently degrading. A render failure with no tools still falls back
quietly — there's nothing to lose, and `format_qwen3_prompt` is a
reasonable text-only prompt. The four prompt-build call sites propagate
with `?`.

Now the next client/template incompatibility surfaces as a loud 422 the
operator sees immediately, not a mysteriously-degraded session.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 17:48:31 +03:00
8880b2f8a6 fix(cortex): emit tool_call arguments as an object so Qwen3.6 can chain tools
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Build helexa-bench binary (push) Successful in 2m14s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m27s
build-prerelease / Build cortex binary (push) Successful in 2m37s
build-prerelease / Test (push) Successful in 4m32s
build-prerelease / Build neuron-blackwell (push) Successful in 1m41s
build-prerelease / Build neuron-ada (push) Successful in 2m5s
build-prerelease / Build neuron-ampere (push) Successful in 2m50s
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m23s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m39s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m41s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m4s
Verified live via the rendered-prompt trace: once a tool call is in the
conversation history, the Qwen3.6 chat template fails to render —

  render chat_template: invalid operation: cannot convert value into
  pairs (in chat_template:120)

because line 120 iterates `tool_call.arguments | items` (treats arguments
as a dict), while cortex emitted the OpenAI-standard JSON *string*. On
that render error neuron silently falls back to a tool-less prompt, so
the model loses every tool the moment it makes one call — it can make the
first tool call, read the result, then can only narrate ("now let me
check the runs") and stop, because the next turn has no tools. That's the
"drops the ball a little later" symptom: the CC trace shows the get_me
turn rendering 42653 tokens (tools present) and every subsequent
tool-history turn falling back to ~6k tokens (tools gone).

anthropic_to_openai now passes `function.arguments` as the parsed object
rather than stringifying it. Tests updated to expect the object form.

This is the same silent-fallback failure class as the system-message
merge (295b10c) — which is why making neuron's template-render fallback
LOUD (4xx on a tools-bearing request instead of a degraded 200) is now
clearly worth doing: it would have surfaced both in seconds.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 16:43:17 +03:00
4e8f4e0d04 fix(neuron): don't generate <think> reasoning when the client drops it
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m14s
build-prerelease / Build neuron-blackwell (push) Successful in 1m50s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 2m36s
build-prerelease / Build neuron-ada (push) Successful in 2m37s
build-prerelease / Test (push) Successful in 4m15s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m36s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m37s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 50s
Verified live: Qwen/Qwen3.6-27B with a simple prompt and max_tokens=400
generated 400 tokens, finish_reason=length, and 0 visible characters —
the model spent the ENTIRE budget on <think> reasoning, which we then
drop for OpenAI/Anthropic clients (include_thinking=false), starving the
visible answer. This is why Claude Code "dropped the ball": empty or
truncated responses. A/B confirms the cause — same prompt with
chat_template_kwargs.enable_thinking=false yields a full 545-char answer.

The earlier prompt_opens_reasoning fix stopped the reasoning *leaking* as
text but left it consuming the token budget. Couple the two: when the
caller isn't going to see the reasoning (include_thinking=false, the
default), default chat_template_kwargs.enable_thinking to false so the
model doesn't generate it. An explicit client enable_thinking wins;
thinking-aware clients (helexa-acp, x-include-thinking: true) keep
reasoning on. Tests cover the default (false), surfacing (true), explicit
override, and preservation of other kwargs.

Note: only the /v1/chat/completions path (what Claude Code uses via
cortex /v1/messages); /v1/responses could get the same defaulting as a
follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 15:00:50 +03:00
295b10c103 fix(cortex): merge all system content into one leading system message
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 33s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m49s
build-prerelease / Build helexa-bench binary (push) Successful in 2m14s
build-prerelease / Build cortex binary (push) Successful in 2m54s
build-prerelease / Test (push) Successful in 5m21s
build-prerelease / Build neuron-blackwell (push) Successful in 1m38s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m19s
build-prerelease / Build neuron-ada (push) Successful in 2m3s
build-prerelease / Build neuron-ampere (push) Successful in 2m52s
build-prerelease / Package cortex RPM (push) Successful in 1m34s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m38s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m46s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 54s
Verified live via neuron trace: Claude Code's real requests carry a
top-level `system` AND a `role:"system"` turn inside `messages`. cortex
passed the latter through at a non-first position, and Qwen3.6's chat
template hard-rejects it:

  WARN chat_template render failed; falling back to format_qwen3_prompt
  error=... invalid operation: System message must be at the beginning.

On that render error neuron silently falls back to a template that
renders NO tools, so the model got zero tool-format guidance and
improvised an unparseable `<tool><name>…` syntax — tool calling broke
entirely for real CC traffic, even though synthetic single-system
probes (and the earlier translation/parse fixes) worked.

anthropic_to_openai now accumulates the top-level `system` plus every
`role:"system"` conversation turn and emits a single system message at
index 0, with the non-system turns following in order. Reproduced the
trigger (system-role message at index>0 → fallback) and the fix
(merged → template renders tools). Test covers the merge + ordering.

Secondary hardening worth a follow-up: neuron's silent template
fallback drops tools without surfacing it to the client — a render
failure on a tools-bearing request should arguably 4xx rather than
degrade invisibly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 14:09:08 +03:00
1c485aedce feat(neuron): trace the fully rendered chat-template prompt
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 27s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 1m31s
build-prerelease / Build neuron-ampere (push) Successful in 2m13s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m45s
build-prerelease / Build neuron-ada (push) Successful in 3m31s
build-prerelease / Test (push) Successful in 4m39s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m49s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m46s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 52s
Debugging tool-call format drift (Qwen3.6-27B emitting wrapper-less
<tool><name>…> under Claude Code's real system prompt + 120-tool list,
which neuron's <tool_call> detector can't parse) needs ground truth on
what the model actually sees. neuron logged nothing about the rendered
prompt. Add a trace! in build_prompt_for_request emitting the full
rendered prompt + char count + tool count, so we can see whether the
chat template's <tool_call> format instruction survives a large system
prompt and how the tools render. Gated at trace (the prompt can be tens
of KB): RUST_LOG=neuron::harness::candle=trace.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 13:38:51 +03:00
b3dc835375 ci: bound job runtime + stop dropping sccache on rustc signal-death
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m23s
build-prerelease / Build cortex binary (push) Successful in 2m29s
build-prerelease / Build helexa-bench binary (push) Successful in 2m34s
build-prerelease / Test (push) Successful in 4m33s
build-prerelease / Build neuron-blackwell (push) Successful in 1m31s
build-prerelease / Build neuron-ada (push) Successful in 2m13s
build-prerelease / Build neuron-ampere (push) Successful in 2m50s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m17s
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m38s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m42s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 55s
A neuron-blackwell build hung ~90 min (siblings finished in 2) and there
was no job timeout to kill it, so it sat burning a runner. Root cause of
the hang: the inline retry loop treated every failure identically and, on
its final attempt, rebuilt with sccache disabled. When the real failure
is a rustc SIGSEGV or an OOM-kill, an uncached rebuild does *more* work
under the same memory pressure — turning one transient compiler crash
into a wedged job.

Two fixes:

1. timeout-minutes on every job in build-prerelease.yml and ci.yml
   (builds 25, neuron CUDA build/cuda-check 35, packaging 20, COPR 60,
   fast jobs 10-15). A hang now dies in minutes, not hours.

2. New script/ci-cargo-escalate.sh replaces the five (prerelease) + three
   (ci) inline escalation loops. It classifies the failure:
     - signal death (exit >=128, or cargo reporting `signal: N`/SIGSEGV/
       SIGKILL) → compiler crash, NOT an sccache fault: keep the cache,
       one warm retry, then fail fast. Never escalate to uncached.
     - sccache fault (recognisable sccache error) → restart the server,
       retry, then one final uncached attempt.
     - deterministic compile/test error → fail fast (no wasteful retry).
   It also folds in the CUDA-image sccache probe the neuron/cuda-check
   jobs did inline. Classification verified locally against success,
   plain failure, exit-139, and the cargo-wrapped `signal: 11` form.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 13:02:50 +03:00
746d84c0fb fix(neuron): seed in_reasoning from the prompt so Qwen3.6 thinking isn't leaked
Some checks failed
build-prerelease / Build neuron-blackwell (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 2m3s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m18s
build-prerelease / Build neuron-ada (push) Successful in 2m15s
build-prerelease / Test (push) Successful in 4m13s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Qwen3.6's chat template injects the opening <think> into the generation
prompt, so generation begins mid-thought and the open marker is never
sampled. The streaming loops flipped in_reasoning to true only on a
*generated* open token, so they stayed in text mode and streamed the
model's reasoning out as visible text — verified live: a tool request
returned a 255-char text block of chain-of-thought ("The user wants to
know the weather… I will construct the function call now.") ahead of the
tool_use block, with the trailing </think> stripped (close token
recognised) but no opening <think>.

Each streaming loop now seeds in_reasoning by replaying the prompt's
reasoning markers (new `prompt_opens_reasoning`): if the prompt ends
inside an open <think>, the loop starts in reasoning mode, the thinking
routes to ReasoningDelta (dropped by the chat projector's default
include_thinking=false, which is what cortex uses), and the model's
</think> flips back to visible text for the answer/tool call. Template-
agnostic and self-correcting: a prompt that doesn't open reasoning (no
think injection, enable_thinking off, non-reasoning model) starts false,
preserving current behaviour. Thinking is hidden, not disabled, so answer
quality is unaffected.

Applied to all three streaming loops (inference_tp_stream,
stream_inference_via_worker, run_inference_streaming). Test covers
open/close replay, multi-turn closed state, reopen-at-tail, and the
no-pair pass-through.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 11:03:26 +03:00
f15b9e2848 fix(neuron): parse Qwen-XML tool calls + emit tool_use stop_reason
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m16s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 2m2s
build-prerelease / Build neuron-ada (push) Successful in 2m7s
build-prerelease / Build neuron-ampere (push) Successful in 2m16s
build-prerelease / Test (push) Successful in 4m13s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m34s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m36s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m37s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 52s
Verified live (commit d662fa2 logs): cortex now delivers OpenAI-shaped
tools to neuron correctly, but Qwen3.6-27B emits tool calls in the
Qwen-XML form inside the <tool_call> markers —

    <tool_call>
    <function=get_weather>
    <parameter=city>
    Brno
    </parameter>
    </function>
    </tool_call>

— while parse_tool_call_body only did serde_json::from_str expecting
{"name":…,"arguments":…}. It returned None, the dispatch re-emitted the
raw block as a text delta, and clients saw the markup as prose. cortex
logged upstream_tool_calls=false finish_reason="stop".

parse_tool_call_body is now format-tolerant: JSON first (Qwen3-Instruct
/ Hermes), then a Qwen-XML parser (Qwen3-Coder / Qwen3.6). Each
<parameter> value is coerced to its declared JSON type using a new
ToolSchemas map built from the request's tools (string stays string,
integer/number/boolean/object/array coerced, mistyped values fall back
to string so an argument is never dropped). build_tool_schemas is
threaded into all three streaming loops (inference_tp_stream,
stream_inference_via_worker, run_inference_streaming).

Each loop also tracks emitted_tool_call and promotes the terminal
finish_reason from Stop to ToolCalls when a call parsed, so the OpenAI
chunk carries finish_reason:"tool_calls" and cortex maps it to Anthropic
stop_reason:"tool_use" — without which an Anthropic agent (Claude Code)
sees a tool_use block but stop_reason:end_turn and may not run the tool.
FinishReason::ToolCalls drops its dead_code allow.

Tests: JSON form still parses; Qwen-XML multi-param parse with
schema-driven string/integer/boolean coercion; no-schema type sniffing;
type-mismatch string fallback; unparseable body returns None.

Known gap (separate): the non-streaming run_inference paths have no
tool-call handling at all; Claude Code streams, so the streaming loops
are the ones that matter here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 10:39:38 +03:00
d662fa20ef fix(cortex): translate Anthropic tools to OpenAI shape + wire-debug logging
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m20s
build-prerelease / Build helexa-bench binary (push) Successful in 2m6s
build-prerelease / Build cortex binary (push) Successful in 2m20s
build-prerelease / Test (push) Successful in 4m12s
build-prerelease / Build neuron-blackwell (push) Successful in 1m38s
build-prerelease / Build neuron-ada (push) Successful in 2m5s
build-prerelease / Build neuron-ampere (push) Successful in 4m44s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m17s
build-prerelease / Package cortex RPM (push) Successful in 1m17s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m41s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m42s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m48s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 53s
Claude Code (ANTHROPIC_BASE_URL -> cortex) hits POST /v1/messages, but
anthropic_to_openai forwarded the request's `tools` array verbatim via
the flattened `extra`. neuron feeds that straight into the HF chat
template, which iterates the OpenAI shape (tool.function.name/.parameters).
Anthropic-shaped tools ({name, description, input_schema}) rendered as
broken/empty definitions, the model improvised an unparseable
<tool_use_name>...</tool_use_name> tool-call format, neuron's
<tool_call>{json}</tool_call> detector missed it, and the markup fell
through as plain assistant text — so CC never received a structured
tool_use and the agent loop died.

Request-side translation now reshapes:
- tool definitions: {name, description, input_schema}
  -> {type:"function", function:{name, description, parameters}}
- tool_choice: auto->"auto", any->"required", none->"none",
  tool->{type:"function",function:{name}}
- assistant tool_use blocks -> OpenAI assistant.tool_calls
  (arguments JSON-stringified) — fixes multi-turn
- user tool_result blocks -> standalone role:"tool" messages keyed by
  tool_call_id
- system content blocks flatten to text instead of being JSON-serialised
  into the prompt; best-effort image-block -> image_url part

Wire-debug instrumentation (tracing levels only; cortex/neuron ship at
info, operator infra runs at debug):
- every handler emits a debug! "inbound request" line tagging the wire
  surface (anthropic | openai-chat | openai-responses | openai-completions)
  plus model/stream/tools and, for Anthropic, tool_history/system
- response side reports upstream_tool_calls + finish_reason, streaming
  and non-streaming
- full inbound + translated-upstream bodies at trace! (UTF-8-safe, capped)

Tests: 8 request-side unit tests + an end-to-end gateway test asserting
the upstream neuron receives OpenAI-shaped tools and a
user->assistant(+tool_calls)->tool->user history.

Also tighten script/infra-log-verbosity.sh: independent cortex/neuron
RUST_LOG args, cortex-only by default (neuron restart behind
--with-neuron so we don't needlessly cold-reload models), mkdir -p the
drop-in dir, symmetric RUST_LOG cleanup, and set -euo pipefail.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 09:58:25 +03:00
d04f4ad704 feat(bench): show GPUs as the resource name instead of hostnames
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Successful in 2m34s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m54s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m15s
build-prerelease / Test (push) Successful in 5m11s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 56s
Public visitors don't know the hostnames, so surface each host's GPU(s)
as the resource name across the UI.

- store: gpu_label() turns the stored gpus_json into a compact label
  ("2× RTX 5090", "RTX 4090"); add `gpu` to ReportRow + RunRow and
  `host_gpus`/`model_gpus` maps to /api/dimensions (from each one's
  latest run). render_json gains gpu too.
- UI: Overview + Runs show a "GPU" column (gpu, fallback host); Runs'
  filter is now GPU-labelled (still filters by host underneath); Trends
  shows a "Measured on <gpu>" line for the selected model.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:29:13 +03:00
e3879f093a feat(bench-ui): drop host selector from Trends; resolve host server-side
Some checks are pending
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m38s
build-prerelease / Test (push) Successful in 4m47s
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Successful in 2m2s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m22s
Public visitors don't know the hostnames or per-host hardware, so the
host picker on Trends was confusing. Select by model + scenario only;
/api/series now takes host as optional and resolves it to the host
serving that (model, scenario) — coherent since each model maps to one
host today. Runs (drill-down) keeps its host filter.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:19:09 +03:00
e4b9b88de0 feat(bench-ui): mark the baseline↔live regime boundary on Trends
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Lint (fmt + clippy) (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Test (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Add a dashed vertical ReferenceLine at the first live build (labelled
"bench.py → helexa-bench") so the intentional gap between the gateway
baseline and the direct-to-neuron series reads as a deliberate
measurement-regime change, not missing data. The two series stay
unconnected by design (different regimes, not directly comparable).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:13:34 +03:00
21db334e37 feat(bench-ui): overlay pre-helexa-bench baseline on Trends
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Lint (fmt + clippy) (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Test (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Option C: a curated static baseline (bench/src/baseline.ts), transcribed
from doc/benchmarks.md (8f6f1d3 + a1952a4 post-#11), overlaid on the
Trends charts as a dashed, clearly-labelled historical series ahead of
the bench era. Host inferred from model via the doc's fleet table;
ordered by snapshot time so it anchors the timeline.

Kept deliberately separate from the live series (no DB/API change) — the
baseline is a different regime (bench.py through the cortex gateway,
medians only) so it's never merged into the direct-to-neuron line; a
caption spells out the distinction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:02:43 +03:00
7dd1ddcfba fix(infra-setup): stat LE live dir via sudo; rsync provisioner secret for bench.internal issuance
Some checks failed
build-prerelease / Resolve version stamps + change detection (push) Failing after 11m1s
build-prerelease / Lint (fmt + clippy) (push) Has been cancelled
build-prerelease / Test (push) Has been cancelled
build-prerelease / Build cortex binary (push) Has been cancelled
build-prerelease / Build helexa-bench binary (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-bench RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
- cert_present() must `sudo test -d /etc/letsencrypt/live/...` (root-only
  0700); without sudo it falsely reported "no cert" and downgraded the
  bench.helexa.ai vhost to the http-only bootstrap (dropping its 443
  server). Now correctly keeps the full TLS vhost.
- bench.internal initial cert: rsync the operator's JWK 'lair' provisioner
  password to the host transiently (root, 0600), issue via
  step ca certificate, then remove it (trap + belt-and-suspenders rm).

Verified: bench.helexa.ai (LE) and bench.internal (lair CA) both serve the
SPA + /api→bob; step@bench.timer renews; secret removed from host.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 15:40:38 +03:00
4ee7da4f97 feat(bench-ui): internal vhost bench.internal + step@ cert renewal
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Lint (fmt + clippy) (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Test (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Inside the WireGuard mesh, bench.helexa.ai dead-ends at the OPNsense LAN
interface (only WAN :443 is port-forwarded), so add an internal path:

- asset/nginx/bench.internal.conf — server_name bench.internal, internal
  "lair" CA cert, same SPA + /api→bob proxy. Mirrors the *.internal vhost
  convention on oolon.kosherinata.internal.
- asset/systemd/step@.{service,timer} — replicate oolon's smallstep cert
  renewal (step ca renew via mTLS, every 15 min, reload nginx).
- infra-setup.sh: install the step@ units + /etc/nginx/tls/{cert,key},
  install the vhost + enable step@bench.timer once the cert exists; prints
  the one-time issuance command otherwise.

Initial cert issuance (JWK provisioner) and bench.internal DNS are
operator steps.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 15:34:38 +03:00
db3cb95cbf fix(infra-setup): provision bench.helexa.ai cert via Cloudflare DNS-01 (ecdsa)
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 33s
build-prerelease / Lint (fmt + clippy) (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Test (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
The webroot/http-01 approach needed nginx serving :80, but the gateway's
nginx was dormant. Switch to the host's established convention —
certbot --dns-cloudflare --key-type ecdsa with /root/.certbot-internal —
which needs neither nginx nor :80, so the cert provisions independently
of the vhost being served. Also restorecon the webroot (SELinux
enforcing → nginx 403 without httpd_sys_content_t), and only ever
install the full TLS vhost once the cert exists (http-only bootstrap
otherwise) so `nginx -t` always passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 11:54:24 +03:00
37c19aa985 feat(bench-ui): public hosting at https://bench.helexa.ai via gateway nginx
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Build neuron-blackwell (push) Successful in 1m32s
build-prerelease / Build neuron-ada (push) Successful in 2m15s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m29s
build-prerelease / Build helexa-bench binary (push) Successful in 2m25s
build-prerelease / Build cortex binary (push) Successful in 2m39s
build-prerelease / Build neuron-ampere (push) Successful in 2m48s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m30s
build-prerelease / Test (push) Successful in 4m38s
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m36s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m37s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m39s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 51s
nginx on the gateway serves the bench SPA and reverse-proxies /api to the
bob bench API over WireGuard — public, auth-less, same-origin (no CORS),
internal API stays private.

- asset/nginx/bench.helexa.ai.conf (full TLS vhost: SPA + /api proxy) and
  a bootstrap http-only vhost for the initial ACME challenge.
- infra-setup.sh: one-time gateway setup — webroot, Let's Encrypt cert
  (certbot webroot, idempotent), install + enable the vhost.
- deploy.yml: deploy-bench-ui builds the SPA (setup-node) and rsyncs
  dist/ to /var/www/bench.helexa.ai every deploy; built same-origin so
  no VITE_API_BASE.
- cortex-host.conf: scoped gitea_ci rsync grant for the webroot.
- bench/README: production hosting notes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 11:40:29 +03:00
f50f5531cf feat(bench): read-only JSON API on bob + bench/ React visualisation app
Some checks are pending
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m21s
build-prerelease / Build cortex binary (push) Successful in 2m27s
build-prerelease / Build helexa-bench binary (push) Successful in 2m44s
build-prerelease / Test (push) Successful in 4m32s
build-prerelease / Build neuron-ampere (push) Successful in 2m7s
build-prerelease / Build neuron-ada (push) Successful in 2m28s
build-prerelease / Build neuron-blackwell (push) Successful in 2m59s
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m19s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m39s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m39s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m42s
Part A — helexa-bench read API:
- [api] config (enabled, listen :13132); WAL on the store so API reads
  never block the sweep writer.
- store read methods: summary, series (chronological per-build medians),
  runs (filtered), dimensions, run_count.
- api.rs: axum /api/health|dimensions|summary|series|runs, permissive
  CORS (UI is a separate origin). The `run` daemon binds the API
  alongside the sweep; new `serve` subcommand serves API-only.
- listener plumbing (bench gains a port): data/helexa-bench-firewalld.xml,
  spec install, deploy-bench /api/health probe + firewalld step, sudoers
  firewall-cmd grants, [api] in example + bob.toml.
- 5 API tests + serve smoke.

Part B — bench/ Vite + React-SWC-TS app (router, react-bootstrap,
recharts): Overview (summary table), Trends (decode tok/s & TTFT across
build SHAs), Runs (filterable explorer). Typed API client with
VITE_API_BASE + dev proxy to bob. npm build/typecheck clean. Hosted
separately from the API (per design); .gitignore excludes node_modules/dist.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 11:26:55 +03:00
5999c8a5a3 Merge branch 'feat/deploy-bench-on-bob' into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 36s
build-prerelease / Lint (fmt + clippy) (push) Has been skipped
build-prerelease / Test (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
ci(deploy): deploy helexa-bench to bob + enable all fleet services on boot

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 09:17:11 +03:00
66833890c0 ci(deploy): deploy helexa-bench to bob + enable all fleet services on boot
All checks were successful
CI / CUDA type-check (push) Successful in 2m9s
CI / Format (push) Successful in 36s
CI / Clippy (push) Successful in 2m12s
CI / Test (push) Successful in 4m8s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Adds a deploy-bench job to deploy.yml that rolls helexa-bench onto bob
(the bench host, also running Agent Zero), following the deploy-cortex
pattern: manifest-gated skip-when-current, light "service stays active"
validation (outbound-only, no listener/model to probe), journal capture.
Runs alongside the cortex→neurons chain (no deploy-ordering dependency —
the sweep loop is version-aware).

Boot persistence: all systemd deployments now `systemctl enable --now`
instead of bare `start`, so cortex / neuron / helexa-bench come back
after a host reboot. Covers deploy.yml (all three services) and
deploy-dev.yml (neuron fast path); sudoers gain the matching
`enable --now <svc>` grant.

infra-setup.sh handles bob: provisions gitea_ci, installs the
bench-host sudoers, enables the lair-cafe-unstable repo (bob is a client
host without it), pre-creates /etc/helexa-bench, and syncs
asset/helexa-bench/bob.toml. New assets: bench-host.conf sudoers and
bob.toml (three neuron targets).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 09:10:07 +03:00
7bb20241a6 Merge branch 'feat/version-metadata-and-bench' into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Build neuron-ada (push) Successful in 2m13s
build-prerelease / Build neuron-ampere (push) Successful in 2m15s
build-prerelease / Build neuron-blackwell (push) Successful in 2m30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m34s
build-prerelease / Build cortex binary (push) Successful in 2m38s
build-prerelease / Build helexa-bench binary (push) Successful in 3m40s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m53s
build-prerelease / Test (push) Successful in 4m35s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m14s
build-prerelease / Package cortex RPM (push) Successful in 1m16s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m41s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m42s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 50s
feat(bench): version-aware benchmark harness + neuron build metadata

Adds GET /version build metadata to neuron and the helexa-bench crate — a continuous, version-aware harness that records fleet benchmarks into SQLite keyed by neuron build SHA, replacing manual bench.py runs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 15:33:33 +03:00
42da25a37c feat(bench): version-aware benchmark harness + neuron build metadata
All checks were successful
CI / CUDA type-check (push) Successful in 1m36s
CI / Format (push) Successful in 31s
CI / Clippy (push) Successful in 2m47s
CI / Test (push) Successful in 4m33s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Adds automated, longitudinal performance tracking across neuron builds,
replacing manual script/bench.py runs and hand edits to benchmarks.md.

neuron build metadata + GET /version:
- cortex-core: shared BuildInfo type (build_info.rs).
- neuron build.rs captures git SHA (preferring injected HELEXA_BUILD_SHA,
  else git, else "unknown"), dirty flag, build timestamp, rustc version,
  profile, target, enabled cargo features, and best-effort candle-core
  version from Cargo.lock.
- New GET /version endpoint (version.rs) + clap --version long form.
- SHA injected in CI (build-neuron step) and helexa-neuron.spec
  (%{?helexa_commit}) so tarball RPMs report the real SHA. /version is
  now the canonical "which build is live" probe.

helexa-bench crate:
- Continuous daemon: hits each neuron directly on :13131, exercises each
  warm (status==loaded) model, records every run into a SQLite
  system-of-record stamped with the neuron's full BuildInfo.
- Version-aware: skips any (target, build SHA, model, scenario) cell
  already at samples_per_version, so a steady fleet costs only cheap
  /version + /models polls until a new SHA ships.
- Extensible Scenario trait; phase-1 chat-latency family ported verbatim
  from bench.py (synthetic 128/4096-tok prompts, /no_think, streamed
  TTFT + decode-window tok/s). `report` regenerates the benchmarks table.
- kind="openai" comparison targets scaffolded, not yet wired.

Packaging: data/helexa-bench.service (+ sysusers), prebuilt-binary RPM
spec (outbound-only, no firewalld), and build/package/publish wiring in
build-prerelease.yml with change detection.

Tests: cortex-core BuildInfo round-trip, neuron GET /version integration,
helexa-bench unit (prompt/SSE/config/store) + end-to-end sweep
(record -> skip -> resume on new SHA). Docs updated (benchmarks.md,
CLAUDE.md addendum).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 15:26:02 +03:00
30d50d6215 Merge pull request 'fix(ci): drop the unused flash-attn feature from neuron builds (#42)' (#46) from fix/42-drop-flash-attn into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 33s
build-prerelease / Build neuron-blackwell (push) Successful in 1m28s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m43s
build-prerelease / Build cortex binary (push) Successful in 2m45s
build-prerelease / Test (push) Successful in 4m32s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ampere (push) Successful in 2m5s
build-prerelease / Build neuron-ada (push) Successful in 3m29s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m39s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m41s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 53s
2026-06-13 07:15:43 +00:00
9a312098dd fix(ci): drop the unused flash-attn feature from neuron builds (#42)
All checks were successful
CI / Format (push) Successful in 37s
CI / CUDA type-check (push) Successful in 2m9s
CI / Clippy (push) Successful in 2m17s
CI / Test (push) Successful in 4m56s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Format (pull_request) Successful in 34s
CI / CUDA type-check (pull_request) Successful in 1m37s
CI / Clippy (pull_request) Successful in 2m31s
CI / Test (pull_request) Successful in 4m45s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
The neuron fleet builds with `cuda cudnn flash-attn`, but nothing in
neuron uses flash-attn: the qwen3_5 (27B) arch is hand-rolled, the
candle-transformers qwen3 model has no flash path, llama is built with
use_flash_attn=false, and `grep flash crates/neuron/src` is empty. The
feature only pulls in candle-flash-attn's sm_80/sm_86 CUDA kernel
sweep — which is exactly where ptxas SIGSEGVs/hangs in #42 (3 hits in
one day, the last a ~4-hour hang that stalled the whole deploy behind
the ampere job).

Dropping the feature removes the #42 failure surface at the root (not
a mitigation) and cuts the longest, most fragile part of each flavour
build. No runtime change — nothing called those kernels. Removed from
all three flavour builds in build-prerelease.yml and from deploy-dev.yml;
ci.yml's cuda-check already used `--features cuda` only.

Closes #42

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 09:43:14 +03:00
98e9749f22 Merge pull request 'feat(neuron): speculative decoding — acceptance core + config (#25, phase 1)' (#45) from feat/25-speculative-decoding into main
Some checks failed
build-prerelease / Build neuron-ampere (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m11s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Test (push) Successful in 5m22s
build-prerelease / Build neuron-blackwell (push) Successful in 10m12s
build-prerelease / Build neuron-ada (push) Successful in 14m21s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
2026-06-13 06:39:49 +00:00
ec764a2cac feat(neuron): speculative decoding — acceptance core + config (#25, phase 1)
All checks were successful
CI / Format (push) Successful in 36s
CI / Format (pull_request) Successful in 31s
CI / CUDA type-check (push) Successful in 1m50s
CI / CUDA type-check (pull_request) Successful in 1m44s
CI / Clippy (push) Successful in 2m38s
CI / Test (push) Successful in 4m21s
CI / Clippy (pull_request) Successful in 2m29s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 4m37s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
First phase of speculative decoding: the pure, state-free acceptance
logic and per-target config, unit-tested in isolation before the
draft/verify loop and GDN-state rollback wire it into the generation
path.

greedy_accept walks the drafter's K proposed tokens against the
target's greedy token at each of the K+1 positions, accepting the
longest matching prefix and always committing one bonus token on top
(the target's correction at the first mismatch, or a free extra token
when the whole draft matched). So a round commits 1..=K+1 tokens —
never zero, guaranteeing forward progress even with a useless drafter.
Greedy is exact for temperature-0 (the fleet probe + #22 bench
regime); stochastic acceptance is a later phase.

SpeculativeConfig carries the drafter id (must share the target's
tokenizer — Qwen3.5-0.8B for the Qwen3.6-27B target, both qwen3_5,
byte-identical tokenizer, confirmed on beast) and the draft length K.

6 unit tests: full accept, partial accept, zero accept (progress
guarantee), last-position mismatch, single-token draft, config
gating. Not yet wired into the decode path — phase 2 (single-GPU
draft/verify) follows. Design + phasing on the issue.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 08:30:21 +03:00
4c1bdba31d Merge pull request 'feat(neuron): chunk the single-GPU vision prefill (parity with TP) (#18)' (#44) from feat/18-single-gpu-vision-chunked into main
Some checks failed
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m11s
build-prerelease / Test (push) Successful in 4m7s
build-prerelease / Build neuron-blackwell (push) Successful in 9m55s
build-prerelease / Build neuron-ada (push) Successful in 13m10s
build-prerelease / Build neuron-ampere (push) Has started running
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
2026-06-13 05:23:07 +00:00
988ef5afc2 feat(neuron): chunk the single-GPU vision prefill (parity with TP) (#18)
All checks were successful
CI / Format (push) Successful in 31s
CI / Format (pull_request) Successful in 41s
CI / CUDA type-check (pull_request) Successful in 1m27s
CI / CUDA type-check (push) Successful in 2m11s
CI / Clippy (push) Successful in 2m38s
CI / Clippy (pull_request) Successful in 2m31s
CI / Test (pull_request) Successful in 4m13s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Test (push) Successful in 4m39s
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
The single-GPU vision path was still single-shot: a long vision-bearing
prompt to a single-GPU-loaded qwen3_5 had the OOM exposure the TP path
shed in fa01350 (it was only guard-rejected, never served).

Mirror TpQwen3_5ForCausalLM::prefill_with_images_chunked onto the
single-GPU Qwen3_5ForCausalLM: encode the image(s) once, walk the
pre-expanded prompt in prefill_chunk_tokens() windows splicing the
per-chunk <|image_pad|> rows, accumulate KV + GDN state across chunks
via the growing offset, keep the last chunk's logits. Interleaved
M-RoPE positions are computed once over the whole prompt and sliced
per chunk (an image compresses the position space, so per-chunk offset
arithmetic would be wrong) — so Qwen3_5Model::forward_inner gains an
explicit position_ids path alongside the internal-from-grids
(single-shot) and plain (text/decode) paths, plus a forward_with_positions
entry point. The device-worker ForwardLogitsWithImages handler now
calls the chunked method; chunk size comes from prefill_chunk_tokens()
on the worker thread, so the Job/handle surface and the callers are
unchanged.

The shared validate_vision_prefill VRAM/KV backstop stays (TP keeps it
too) — chunking bounds activation memory, not the accumulating KV
cache, so the guard still does useful work.

Verified on real weights (Qwen3.5-0.8B): extended the #15 vision
reference test to also run the chunked path with chunk_size=64 over the
217-token prompt (4 chunks; the ~196-token image-pad run spans them).
Chunked vs single-shot logits: cosine 1.000000, max_abs 0.0001;
argmax matches the HF reference. The test covers all three
forward_inner branches (text plain / single-shot vision / chunked
vision) on a real single-GPU qwen3_5 load.

Closes #18

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 08:17:11 +03:00
a1450789d2 Merge pull request 'docs(learnings): source-control P1 + P2 sprint learnings' (#43) from docs/learnings-p1-p2 into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 35s
build-prerelease / Lint (fmt + clippy) (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Test (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
2026-06-13 04:21:11 +00:00
2eaa776d85 docs(learnings): source-control P1 + P2 sprint learnings
All checks were successful
CI / Format (push) Successful in 35s
CI / Format (pull_request) Successful in 35s
CI / CUDA type-check (push) Successful in 1m32s
CI / CUDA type-check (pull_request) Successful in 1m37s
CI / Clippy (push) Successful in 2m30s
CI / Clippy (pull_request) Successful in 2m31s
CI / Test (push) Successful in 4m24s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 4m27s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
doc/plan/* is gitignored, so the P1 learnings briefing could never be
committed. Move it to doc/learnings/p1.md (verbatim) and add
doc/learnings/p2.md capturing the P2 sprint (#11/#23/#1/#15).

The P2 doc's headline: CI green != correct. Four correctness bugs
passed every CI gate and surfaced only on the live fleet (post-gen
snapshots never re-match reasoning models; full-prompt snapshots
break on BPE retokenization; the chunked delta-rule's nilpotent-
squaring shortcut NaNs on correlated keys; the 0.8B masked two of
these by luck). Plus the device-worker/TP state patterns, the
deploy-dev + systemd-drop-in A/B loop, the per-package change-
detection fleet-split failure mode (#42), and the f32-fixture
numerical-validation rig (#15).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 07:13:36 +03:00
7918995e5a chore(ci): retrigger build-prerelease — ampere ptxas segfault (flash-attn sm_86, runner-side) on 538cc87
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m12s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Test (push) Successful in 3m57s
build-prerelease / Build neuron-blackwell (push) Successful in 9m5s
build-prerelease / Build neuron-ampere (push) Successful in 15m3s
build-prerelease / Build neuron-ada (push) Successful in 19m4s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m9s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m13s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m50s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m12s
2026-06-13 00:12:24 +03:00
538cc87572 Merge pull request 'feat(neuron): numerical validation against the transformers reference (#15)' (#41) from feat/15-numerical-reference into main
Some checks failed
build-prerelease / Resolve version stamps + change detection (push) Successful in 29s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m13s
build-prerelease / Build cortex binary (push) Successful in 2m43s
build-prerelease / Test (push) Successful in 4m25s
build-prerelease / Package cortex RPM (push) Successful in 1m39s
build-prerelease / Build neuron-ampere (push) Failing after 8m54s
build-prerelease / Build neuron-blackwell (push) Successful in 10m11s
build-prerelease / Build neuron-ada (push) Successful in 14m9s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 56s
2026-06-12 20:43:37 +00:00
1c4b53cbf1 feat(neuron): numerical validation against the transformers reference (#15)
All checks were successful
CI / Format (push) Successful in 44s
CI / Format (pull_request) Successful in 37s
CI / CUDA type-check (push) Successful in 1m39s
CI / CUDA type-check (pull_request) Successful in 2m6s
CI / Clippy (push) Successful in 2m23s
CI / Clippy (pull_request) Successful in 2m24s
CI / Test (push) Successful in 4m21s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 4m1s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
script/dump_reference.py captures fixtures from the HF qwen3_5
implementation (token ids + reference tensors, f32 by default so the
comparison pins math rather than dtype noise);
tests/numerical_reference.rs replays them through our arch and
asserts argmax equality, cosine similarity, and max-abs ceilings. The
tests self-skip without NEURON_REF_MODEL_PATH so CI stays green
without weights.

Measured on beast (f32-vs-f32): text logits max_abs 0.000 / cosine
1.000000 (the >64-token prompt routes through the chunked GDN
prefill, so the production prefill math is what's validated); vision
tower cosine 0.999998, end-to-end vision logits cosine 1.000000 with
identical argmax. Mutation sensitivity: NEURON_VISION_LEGACY_POS=1
collapses tower cosine to 0.75 and fails loudly.

One production fidelity fix the harness surfaced: the pos-embed
bilinear blend now accumulates in f32 and casts once at the end,
matching the reference (we previously rounded the weights to bf16
before blending).

Fixtures: 0.8B text + vision (f32), 27B text (bf16 — an f32 27B
forward needs ~108 GB; the automated comparison runs against the
0.8B, which executes the same arch modules). Regeneration documented
in tests/fixtures/numerical/README.md.

Closes #15

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 23:35:57 +03:00
49a8dbcd28 Merge pull request 'perf(neuron): parallel in-situ quantization + cold-load phase timing (#1)' (#40) from perf/1-parallel-isq into main
Some checks are pending
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m11s
build-prerelease / Build cortex binary (push) Successful in 2m21s
build-prerelease / Test (push) Successful in 3m56s
build-prerelease / Build neuron-blackwell (push) Successful in 9m38s
build-prerelease / Build neuron-ada (push) Successful in 14m21s
build-prerelease / Build neuron-ampere (push) Successful in 19m0s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m24s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m12s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 4m29s
2026-06-12 20:12:44 +00:00
90e971dcf5 perf(neuron): parallel in-situ quantization + cold-load phase timing (#1)
All checks were successful
CI / Format (push) Successful in 32s
CI / Format (pull_request) Successful in 35s
CI / CUDA type-check (push) Successful in 1m50s
CI / CUDA type-check (pull_request) Successful in 2m7s
CI / Clippy (pull_request) Successful in 2m18s
CI / Clippy (push) Successful in 2m46s
CI / Test (push) Successful in 5m33s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 5m33s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
QTensor::quantize runs its per-block math strictly sequentially on
one core (CUDA storage round-trips through the same CPU path), which
made Q6K ISQ the dominant phase of the 27B TP cold load. Blocks are
independent, so quantize_parallel re-implements the same encoding
through candle's public per-block API (k_quants::GgmlType::from_float)
with rayon fanning blocks across the CPU pool — byte-identical output,
pinned by parity tests against QTensor::quantize for Q6K/Q5K/Q4K/Q8_0.

Threading discipline holds: the device-to-host read and the
QStorage::from_data upload stay on the calling thread (device worker /
subprocess main); rayon workers touch host memory only.

Also adds the per-phase timing the issue asked for first: per-layer
debug + layer-loop total + lm_head info lines, so the next cold load
shows where the time actually goes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 22:47:57 +03:00
92273eb936 chore(ci): retrigger build-prerelease — ampere/blackwell packaging skipped after transient build failure on 128b381
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 29s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m10s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Test (push) Successful in 3m54s
build-prerelease / Build neuron-blackwell (push) Successful in 9m36s
build-prerelease / Build neuron-ada (push) Successful in 14m6s
build-prerelease / Build neuron-ampere (push) Successful in 19m8s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m8s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m8s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m52s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s
2026-06-12 22:38:31 +03:00
128b3818cb Merge pull request 'perf(neuron): chunked delta-rule prefill for Gated DeltaNet (#23)' (#39) from perf/23-chunked-gdn-prefill into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m13s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Test (push) Successful in 3m57s
build-prerelease / Build neuron-blackwell (push) Successful in 10m3s
build-prerelease / Build neuron-ada (push) Successful in 14m11s
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 14m3s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m10s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 58s
2026-06-12 18:44:22 +00:00
812d191e50 fix(neuron): UT transform by forward substitution, not nilpotent squaring
All checks were successful
CI / Format (push) Successful in 32s
CI / Format (pull_request) Successful in 53s
CI / CUDA type-check (push) Successful in 1m52s
CI / CUDA type-check (pull_request) Successful in 2m12s
CI / Clippy (push) Successful in 2m18s
CI / Clippy (pull_request) Successful in 2m36s
CI / Test (push) Successful in 4m18s
CI / Test (pull_request) Successful in 4m22s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
Live A/B on beast produced NaN logits ("!!!" replies) on real prompts:
the nilpotent-squaring form of (I - T)^-1 computes raw powers of T,
whose entries grow combinatorially (path counts ~ C(62,31)) before
nilpotency collapses them — fine on uncorrelated test data, f32
precision death on real prompts whose repetitive text makes keys
highly correlated. The reference's forward-substitution loop never
forms raw powers; its intermediates are the convergent M entries.

Port the reference loop faithfully (rows accumulate into a fresh
tensor). New adversarial parity test with near-identical keys and
beta ~= 1 diverges to 8e30 under the squaring form and passes under
forward substitution.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 21:18:32 +03:00
2a9def6d2d perf(neuron): chunked delta-rule prefill for Gated DeltaNet (#23)
All checks were successful
CI / Format (push) Successful in 32s
CI / Format (pull_request) Successful in 24s
CI / CUDA type-check (push) Successful in 1m38s
CI / CUDA type-check (pull_request) Successful in 2m10s
CI / Clippy (push) Successful in 2m34s
CI / Test (push) Successful in 4m20s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Clippy (pull_request) Successful in 2m29s
CI / Test (pull_request) Successful in 4m21s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
Prefill (seq_len >= 64) now runs the chunk-parallel gated delta rule
ported from the HF reference torch_chunk_gated_delta_rule
(chunk_size=64): identical math reorganised into per-chunk batched
matmuls (cuBLAS/tensor cores on CUDA, gemm on CPU) instead of the
O(L)-sequential per-token recurrence. Decode steps and short prompts
keep the recurrent paths (CUDA kernel / Rust loop) unchanged.

One deliberate deviation from the reference: its in-place row-by-row
UT-transform computes (I - T)^-1 - I by forward substitution; T is
strictly lower triangular and therefore nilpotent at chunk size 64,
so the same inverse is the product of six squarings
prod_{j=0..5}(I + T^(2^j)) — batched matmuls instead of 63 sequential
row updates, which suits candle's immutable tensors. Chunk-local math
runs rank-3 over a flattened B*H*N batch dim (candle matmul supports
at most two batch dims).

Initial-state continuation is supported, so chunked prefill composes
with #11's restored prefix snapshots. Both single-GPU and TP paths
pick this up through the shared run_delta_rule dispatch.
NEURON_GDN_CHUNKED=0 forces the recurrent paths for A/B measurement.

Parity tests pin chunked against recurrent (2e-4 abs) across padding
(L=130), exact multiples with non-zero initial state (L=128 after a
50-token prefix), and a single exact chunk.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:51:51 +03:00
ddb331e1a3 Merge pull request 'docs(bench): record post-#11 fleet numbers' (#38) from docs/benchmarks-post-11 into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m47s
build-prerelease / Test (push) Successful in 4m25s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
2026-06-12 17:14:00 +00:00
df0bf4c518 docs(bench): record post-#11 fleet numbers
All checks were successful
CI / Format (push) Successful in 37s
CI / Format (pull_request) Successful in 37s
CI / CUDA type-check (push) Successful in 1m31s
CI / CUDA type-check (pull_request) Successful in 2m7s
CI / Clippy (push) Successful in 2m24s
CI / Test (push) Successful in 4m17s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Clippy (pull_request) Successful in 2m24s
CI / Test (pull_request) Successful in 3m57s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
Appends the 2026-06-12 post-prefix-cache run: 27B @4k warm TTFT
7.07 s -> 1.43 s, no-cache control models unchanged, with a
methodology note that repeated-prompt cells now measure warm TTFT on
qwen3_5-arch models.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:06:53 +03:00
a1952a4522 Merge pull request 'fix(neuron): snapshot at the last special-token boundary (#11)' (#37) from fix/11-snapshot-cut-retokenization into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m37s
build-prerelease / Test (push) Successful in 4m21s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 10m22s
build-prerelease / Build neuron-ampere (push) Successful in 13m8s
build-prerelease / Build neuron-ada (push) Successful in 21m31s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m46s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s
2026-06-12 16:24:15 +00:00
4f266dbd82 fix(neuron): snapshot at the last special-token boundary (#11)
All checks were successful
CI / Format (push) Successful in 42s
CI / Format (pull_request) Successful in 34s
CI / CUDA type-check (push) Successful in 1m31s
CI / Clippy (push) Successful in 2m19s
CI / CUDA type-check (pull_request) Successful in 2m10s
CI / Test (push) Successful in 4m13s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Clippy (pull_request) Successful in 2m9s
CI / Test (pull_request) Successful in 4m5s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
Second finding from live 27B validation: prompt-covering snapshots
still never matched. The rendered prompt ends with
`<|im_start|>assistant\n`, and when the next turn re-tokenizes that
text followed by the assistant's reply, BPE merges the trailing
newline with the reply's first characters — the final token(s) of the
cached sequence differ from the next prompt's, so the exact-prefix
match never fires. (A reply starting with an atomic special token
like <think> masks this, which is why the 0.8B check passed.)

Snapshot one past the last <|im_start|> instead: special tokens are
hard segmentation points, so ids up to and including it are provably
identical across renders. Prefill pauses at that boundary to capture
the snapshot, then finishes the ~2-token `assistant\n` tail. Applied
to all six request paths; unit tests for the cut helper.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 19:16:45 +03:00
43a6d96d5f Merge pull request 'fix(neuron): snapshot prefix cache at the prefill boundary (#11)' (#36) from fix/11-prefix-snapshot-at-prefill into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 36s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m16s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Test (push) Successful in 4m2s
build-prerelease / Build neuron-ampere (push) Successful in 13m22s
build-prerelease / Build neuron-blackwell (push) Successful in 13m31s
build-prerelease / Build neuron-ada (push) Successful in 14m25s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m12s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m50s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m6s
2026-06-12 15:34:59 +00:00
3fd1989b2b fix(neuron): snapshot prefix cache at the prefill boundary (#11)
All checks were successful
CI / Format (push) Successful in 41s
CI / Format (pull_request) Successful in 42s
CI / CUDA type-check (push) Successful in 1m39s
CI / CUDA type-check (pull_request) Successful in 2m6s
CI / Clippy (push) Successful in 3m10s
CI / Clippy (pull_request) Successful in 3m3s
CI / Test (pull_request) Successful in 4m2s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
CI / Test (push) Successful in 5m1s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Live validation on beast's Qwen3.6-27B showed reused=0 on every turn:
the post-generation snapshot includes reasoning tokens (<think>...)
that get stripped when the client echoes the assistant message back,
so the cached sequence is never a token-prefix of the next prompt.
quadbrat's 0.8B only matched because its think block round-tripped as
literal text.

Snapshot after prefill instead (covering exactly the prompt tokens) —
that is the state the next turn provably extends under a stable chat
template, regardless of how reasoning or tool-call content is
transformed on echo. Taken after the first healthy sample so
NaN-poisoned prefills never cache their state; this also retires the
forwarded-token bookkeeping and the consumer-hangup store sites.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 18:29:00 +03:00
f7952547e7 Merge pull request 'feat(neuron): prefix KV caching for the TP path (#11)' (#35) from feat/11-prefix-kv-cache-tp into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m12s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Test (push) Successful in 3m58s
build-prerelease / Build neuron-blackwell (push) Successful in 9m5s
build-prerelease / Build neuron-ada (push) Successful in 14m22s
build-prerelease / Build neuron-ampere (push) Successful in 19m0s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m51s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s
2026-06-12 14:49:19 +00:00
7e66f77851 fix(neuron): CUDA type-check fixes for TP prefix cache
All checks were successful
CI / Format (push) Successful in 38s
CI / Format (pull_request) Successful in 39s
CI / CUDA type-check (pull_request) Successful in 1m26s
CI / CUDA type-check (push) Successful in 1m34s
CI / Clippy (push) Successful in 3m14s
CI / Clippy (pull_request) Successful in 3m18s
CI / Test (push) Successful in 5m15s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 3m56s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
Two errors only the cuda config surfaces: the TpSnapshotKv dispatch
arms mixed candle and anyhow error types, and restore_or_clear_tp held
the registry MutexGuard across the cleanup await inside a let-chain
(making the TP request futures non-Send). Bind the removed ref before
awaiting, same discipline as the other lock sites.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 17:39:32 +03:00
e629e1872c feat(neuron): prefix KV caching for the TP path (#11)
Some checks failed
CI / Format (push) Successful in 37s
CI / Format (pull_request) Successful in 31s
CI / CUDA type-check (push) Failing after 1m55s
CI / CUDA type-check (pull_request) Failing after 1m47s
CI / Clippy (push) Successful in 2m11s
CI / Test (push) Successful in 4m15s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Clippy (pull_request) Successful in 2m23s
CI / Test (pull_request) Successful in 4m0s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
Extends the prefix cache to tensor-parallel models — Qwen3.6-27B on
beast, where the TTFT win is largest. Closes #11.

Every rank holds its shard's snapshot under one pool-minted id: the
leader's lives in the device worker beside the TP slab
(Job::TpSnapshotKv / TpRestoreKv / TpDropKvSnapshot), each subprocess
rank stores its own in-process via new WorkerRequest variants
(SnapshotKvCache / RestoreKvCache / DropKvSnapshot). Shard state has
the same shape as single-GPU (attention ConcatKvCache + GDN
conv/recurrent state + rope_delta), so the snapshot types are reused;
all ranks sit at the same token boundary because step fan-out is
synchronous.

Consistency on partial failure: a failed restore falls back to
clear-all-ranks + full prefill (and drops the entry); a failed
snapshot drops the id on every rank so nothing half-stored leaks.
DropTp / UnloadModel invalidate a model's snapshots with it, covering
auto-recovery. Vision requests bypass as on single-GPU. Budget
accounting uses leader bytes x world_size (shards are symmetric).

Wired into both TP request paths (non-streaming inner + streaming
orchestration task); chunked_prefill_tp gains the restored-offset
start.

Closes #11

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 17:34:49 +03:00
bb558451db Merge pull request 'feat(neuron): prefix KV caching across requests — single-GPU + CPU paths (#11)' (#34) from feat/11-prefix-kv-cache into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 29s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m15s
build-prerelease / Test (push) Successful in 4m0s
build-prerelease / Build neuron-blackwell (push) Successful in 9m44s
build-prerelease / Build neuron-ampere (push) Successful in 12m47s
build-prerelease / Build neuron-ada (push) Successful in 19m6s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 4m2s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m10s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 4m8s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s
2026-06-12 14:20:24 +00:00
c5378d532d feat(neuron): prefix KV caching across requests — single-GPU + CPU paths (#11)
All checks were successful
CI / Format (push) Successful in 32s
CI / Format (pull_request) Successful in 34s
CI / Clippy (push) Successful in 2m29s
CI / CUDA type-check (pull_request) Successful in 1m31s
CI / CUDA type-check (push) Successful in 1m37s
CI / Clippy (pull_request) Successful in 2m32s
CI / Test (push) Successful in 4m24s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 4m23s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
Stop discarding cache state between requests. When an incoming
prompt's token sequence starts with the exact tokens of a stored
snapshot, restore it and prefill only the divergent suffix.

For the hybrid qwen3_5 arch a snapshot is attention ConcatKvCache k/v
+ GatedDeltaNet conv/recurrent state + the rope_delta counter, all at
one token boundary; the recurrent state cannot rewind, so matching is
exact-prefix only. GDN states are deep-copied both directions (the
CUDA delta-rule kernels mutate the state buffer in place); attention
k/v snapshots share storage safely (append-by-cat never mutates).

Snapshots live in the device worker's state next to the model slab
(Job::SnapshotKv / RestoreKv / DropKvSnapshot); the async side holds
only an opaque id + token sequence + byte size. DropArch drops a
model's snapshots with it, so unload and auto-recovery invalidate for
free. CPU loads hold snapshots inline on the legacy path.

Per-model LRU registry (harness/prefix_cache.rs) bounded by
[harness.candle.prefix_cache] budget_mb / max_entries, enabled by
default; inserting a snapshot drops entries it strictly extends.
Vision requests and candle-transformers archs bypass the cache
entirely (clear-every-request, unchanged).

Covers the single-GPU worker path (streaming + non-streaming) and the
CPU-local path. The TP path (Qwen3.6-27B on beast) is a follow-up PR
that closes #11 with before/after bench numbers.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 17:14:07 +03:00
9f383e7bc7 Merge pull request 'feat(gateway): Anthropic streaming SSE translation (#24)' (#33) from feat/gateway-24-anthropic-sse into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 33s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m13s
build-prerelease / Test (push) Successful in 3m59s
build-prerelease / Build cortex binary (push) Successful in 2m15s
build-prerelease / Build neuron-blackwell (push) Successful in 9m59s
build-prerelease / Build neuron-ada (push) Successful in 14m24s
build-prerelease / Build neuron-ampere (push) Successful in 19m3s
build-prerelease / Package cortex RPM (push) Successful in 1m24s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m59s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m8s
2026-06-12 12:57:09 +00:00
569c528c4b feat(gateway): Anthropic streaming SSE translation (#24)
All checks were successful
CI / Format (push) Successful in 36s
CI / CUDA type-check (push) Successful in 2m25s
CI / Clippy (push) Successful in 2m25s
CI / Format (pull_request) Successful in 41s
CI / CUDA type-check (pull_request) Successful in 2m9s
CI / Clippy (pull_request) Successful in 2m45s
CI / Test (push) Successful in 5m3s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 4m29s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
The /v1/messages handler translated request envelopes but proxied raw
OpenAI SSE frames back to streaming Anthropic clients — the gap
between the README's "point your tooling at it once" contract and
what Claude Code actually received.

cortex-core gains AnthropicStreamTranslator, a pure per-stream state
machine: OpenAI chunks in, ordered (event, payload) pairs out —
message_start → content_block_start/delta/stop (text and tool_use
blocks, indexed; tool_calls map to input_json_delta) → message_delta
(stop_reason mapped via the now-shared map_stop_reason, which also
teaches the non-streaming path tool_calls→tool_use) → message_stop.
Without an upstream usage frame the output count falls back to the
delta count (engine-exact for neuron's one-chunk-per-token streams,
#31); with one, input/output tokens ride message_delta.

cortex-gateway gains anthropic_sse: the wire pump that splits the
upstream byte stream into SSE events, parses data: payloads
(leniently — engines omit fields on special frames), feeds the
translator, and frames results as `event:`/`data:` pairs through a
bounded channel (slow client back-pressures the upstream read).
Upstream truncation without [DONE] still closes the Anthropic event
sequence. Nothing is buffered beyond the current event's bytes.

Tests: 5 state-machine unit tests (text flow, stop-reason mapping +
defaults, tool_use blocks, usage propagation, idempotent finish) and
2 gateway integration tests (full event sequence + text reassembly,
usage propagation into message_delta). Validated end-to-end by
running this branch's gateway against a production neuron and
streaming a live Anthropic request.

Closes #24

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 15:47:30 +03:00
06e4ffc25c Merge pull request 'feat(bench): reproducible benchmark harness + first fleet numbers (#22)' (#32) from feat/22-benchmark-harness into main
Some checks failed
build-prerelease / Build neuron-blackwell (push) Blocked by required conditions
build-prerelease / Build neuron-ampere (push) Blocked by required conditions
build-prerelease / Build neuron-ada (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m27s
build-prerelease / Build cortex binary (push) Successful in 2m41s
build-prerelease / Package cortex RPM (push) Successful in 1m29s
build-prerelease / Test (push) Successful in 4m44s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
2026-06-12 12:46:33 +00:00
a2e73a8907 feat(bench): reproducible batch-1 benchmark harness + first fleet numbers (#22)
All checks were successful
CI / Format (push) Successful in 40s
CI / Format (pull_request) Successful in 38s
CI / CUDA type-check (push) Successful in 2m8s
CI / CUDA type-check (pull_request) Successful in 2m8s
CI / Clippy (push) Successful in 2m23s
CI / Test (pull_request) Successful in 3m54s
CI / Test (push) Successful in 6m23s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Clippy (pull_request) Successful in 4m23s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
script/bench.py: stdlib-only, works against any OpenAI-compatible /v1
endpoint (helexa, llama.cpp, Ollama, vLLM) so cross-engine tables are
a concatenation via the --label column. Measures the operator-felt
trio per (model, prompt-size) cell: TTFT (first SSE content chunk),
decode tok/s (visible tokens over the first→last chunk window,
chunk-per-token engine invariant since streaming usage frames aren't
emitted yet — #31), total wall-clock. Medians over N runs after one
warmup; append-only JSONL for longitudinal tracking.

Measurement traps found against the live fleet and handled:
- thinking models burn the budget invisibly (reasoning deltas are
  off-wire by default) — the prompt appends Qwen's /no_think soft
  switch
- short coalesced replies collapse the decode window to one TCP read
  — rates require a ≥200 ms window and the prompt demands ~300 words

doc/benchmarks.md: method, fleet table, and the first published
numbers (2026-06-12, 8f6f1d3): 1.7B@3060 81 tok/s, 8B@4090 62 tok/s,
27B@2×5090 Q6K TP=2 35 tok/s with flat decode from 128→4k context —
and the 7.1 s 4k-prefill TTFT recorded as #23's before-number.

Refs #22 (competitor baselines still pending — the harness is ready
for them)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 15:39:13 +03:00
8f6f1d3205 feat(deploy): validate neuron capability after every deploy
Some checks failed
build-prerelease / Build neuron-ampere (push) Blocked by required conditions
build-prerelease / Build neuron-ada (push) Blocked by required conditions
build-prerelease / Package cortex RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 29s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m14s
build-prerelease / Build neuron-blackwell (push) Successful in 10m36s
build-prerelease / Build cortex binary (push) Successful in 2m35s
build-prerelease / Test (push) Successful in 6m35s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
A deploy previously went green the moment systemd reported the
service started — a merge that broke model loading or inference
itself would deploy "successfully" and only surface when a human
noticed. Each neuron deploy now earns its green:

1. Wait for default models: poll /health until activation.state is
   ready, with per-host timeouts in the matrix (beast 900s for the
   27B Q6K TP=2 cold-load, benjy/quadbrat 300s). Any entry in
   activation.failed fails the deploy with the per-model error —
   the structured equivalent of watching the journal for
   "loaded default model", plus failure detail the journal line
   can't carry.
2. LLM smoke probe: ask the first loaded model to reply with one
   specific word (max_tokens 512 so thinking models have room,
   temperature 0) and grep the response for it. Not a quality bar —
   just proof the deploy didn't lobotomize inference.

Hosts whose package is already current still skip everything — the
validation cost is only paid when a restart actually happened. The
probe was dry-run against benjy's production neuron before landing.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 15:28:20 +03:00
b0d0b939af Merge pull request 'feat(gateway): per-request token metrics — TTFT and tok/s (#21)' (#30) from feat/gateway-21-token-metrics into main
Some checks failed
build-prerelease / Lint (fmt + clippy) (push) Blocked by required conditions
build-prerelease / Test (push) Blocked by required conditions
build-prerelease / Build cortex binary (push) Blocked by required conditions
build-prerelease / Build neuron-blackwell (push) Blocked by required conditions
build-prerelease / Build neuron-ampere (push) Blocked by required conditions
build-prerelease / Build neuron-ada (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 33s
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
2026-06-12 12:25:32 +00:00
6a36d15ef1 feat(gateway): per-request token metrics — TTFT and tok/s (#21)
All checks were successful
CI / Format (push) Successful in 45s
CI / Format (pull_request) Successful in 37s
CI / CUDA type-check (push) Successful in 2m25s
CI / Clippy (push) Successful in 2m37s
CI / Test (push) Successful in 4m22s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Clippy (pull_request) Successful in 2m23s
CI / Test (pull_request) Successful in 4m19s
CI / CUDA type-check (pull_request) Successful in 1m57s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
The deferred Phase 6b, and the unblock for the 7→8 milestone's
benchmark work (#22): until cortex measures itself per request,
nothing downstream can be benchmarked or graphed.

The proxy wraps the upstream byte stream in a pass-through inspector
(TokenMetricsStream): chunks are forwarded verbatim — never buffered
or re-serialised — while the inspector records arrival times and
keeps a bounded (64 KiB) tail of the body text. At stream end (or
client disconnect, via Drop) it extracts the final OpenAI usage
object — present on the last SSE chunk and non-streaming JSON bodies
alike — for engine-truth token counts.

Per request, labelled {model, node}:
- cortex_time_to_first_token_seconds (histogram) — first body chunk
- cortex_tokens_per_second (histogram) — completion tokens over the
  decode window (first→last chunk); falls back to total request
  duration for single-chunk non-streaming bodies
- cortex_prompt_tokens_total / cortex_completion_tokens_total
  (counters)

The extractor is pure and chunk-boundary-safe; quoted-needle matching
keeps completion_tokens_details from shadowing completion_tokens,
and the last usage object wins. Covers chat completions, completions,
the Responses API, and the Anthropic streaming path (which currently
proxies OpenAI SSE).

Tests: 4 extractor unit tests; integration test with a streaming
mock emitting a stream_options-style final usage chunk, asserting
both histograms and exact-or-greater counter values (the test
recorder is process-global and shared across the binary's tests).

Closes #21

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 15:11:52 +03:00
b463439416 Merge pull request 'feat(neuron): startup preflight for NVIDIA driver/library mismatch (#19)' (#29) from feat/neuron-19-driver-preflight into main
Some checks failed
build-prerelease / Build neuron-ampere (push) Blocked by required conditions
build-prerelease / Build neuron-ada (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 29s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m11s
build-prerelease / Build cortex binary (push) Successful in 2m33s
build-prerelease / Test (push) Successful in 4m24s
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Build neuron-blackwell (push) Successful in 10m18s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
2026-06-12 12:08:20 +00:00
716558c8ff feat(neuron): startup preflight for NVIDIA driver/library mismatch (#19)
All checks were successful
CI / Format (push) Successful in 38s
CI / Format (pull_request) Successful in 38s
CI / CUDA type-check (push) Successful in 2m11s
CI / Clippy (push) Successful in 2m13s
CI / Clippy (pull_request) Successful in 2m37s
CI / Test (push) Successful in 4m17s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 3m56s
CI / CUDA type-check (pull_request) Successful in 1m44s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
The un-rebooted driver update (userspace libs bumped, kernel module
still old) kills every CUDA call on the host including nvidia-smi,
and neuron surfaced it only as `Comm::from_rank ... NcclError` deep
inside the first model load — 30 minutes of forensics on beast
(2026-06-08) to diagnose. Make it instantly legible instead:

- discovery distinguishes nvidia-smi absent (CPU-only, fine) from
  present-but-failing, classifies the "Driver/library version
  mismatch" signature, and pairs the userspace NVML version with the
  loaded kernel-module version from /proc/driver/nvidia/version.
- DiscoveryResponse gains `cuda_unavailable_reason` (omitted when
  None — wire-compatible) so cortex can see why the node has no
  devices and route around it.
- startup logs one loud ERROR line with the actionable reason
  ("reboot the host to reload the kernel module") and skips default
  model loads entirely, marking each failed with that reason so
  /health activation shows the real cause.
- POST /models/load fast-rejects with 503 + code=cuda_unavailable on
  a mismatch host instead of dying minutes later in cuInit/NCCL.

No false positives: other nvidia-smi failures (no devices, perms)
keep their existing behaviour, CPU-only hosts stay silent.

Closes #19

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 15:00:00 +03:00
112e4e124a fix(ci): export RUSTC_WRAPPER in the build step itself — GITHUB_ENV doesn't propagate
Some checks failed
build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m22s
build-prerelease / Build cortex binary (push) Successful in 2m20s
build-prerelease / Test (push) Successful in 3m50s
build-prerelease / Build neuron-blackwell (push) Successful in 10m10s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Build neuron-ada (push) Successful in 14m29s
build-prerelease / Build neuron-ampere (push) Successful in 14m31s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Run 375 proved the CUDA image ships sccache (probe step printed
"sccache enabled") but the wrapper never reached cargo: the runner
does not propagate GITHUB_ENV across steps, so the builds ran
unwrapped (server stats: 4 compile requests for a ~600-crate build,
durations unchanged). Probe and export inside the build step's own
shell instead, in both build-neuron and ci.yml's cuda-check.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 14:50:25 +03:00
dc6feec6dc fix(deploy): gate on the publish manifest, not unprivileged dnf check-update
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Build cortex binary (push) Successful in 2m18s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m33s
build-prerelease / Test (push) Successful in 4m20s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-blackwell (push) Successful in 9m46s
build-prerelease / Build neuron-ampere (push) Successful in 13m57s
build-prerelease / Build neuron-ada (push) Successful in 15m29s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m5s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m49s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m8s
The f5fa840 deploy exposed both failure modes of gating with
`dnf check-update` as the gitea_ci user in one run: it hung
indefinitely on quadbrat (blocked process, 0 CPU, killed manually),
and on benjy/beast it silently reported "no updates" two minutes
after new RPMs were published — both hosts skipped a real (luckily
binary-identical) update.

Gate with data we own instead: fetch packages.json from
rpm.lair.cafe (plain curl, no privileges, no dnf locks), take the
newest release per package by buildTime, and skip the
stop/upgrade/start cycle only when it exactly equals
`rpm -q %{VERSION}-%{RELEASE}`. Unreachable or unparsable manifest
fails open to a full deploy. The dnf transaction itself still runs
under the scoped sudoers rules, unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 14:20:21 +03:00
02f20bc9e1 Merge pull request 'feat: keep auto-recovering models visible as recovering (#20)' (#28) from feat/neuron-20-recovering-status into main
Some checks failed
build-prerelease / Test (push) Blocked by required conditions
build-prerelease / Build neuron-blackwell (push) Blocked by required conditions
build-prerelease / Build neuron-ampere (push) Blocked by required conditions
build-prerelease / Build neuron-ada (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m39s
build-prerelease / Build cortex binary (push) Successful in 3m46s
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
2026-06-12 11:15:38 +00:00
2a231e49de merge main (sccache enablement supersedes branch cuda-check pin)
All checks were successful
CI / Format (push) Successful in 40s
CI / Format (pull_request) Successful in 37s
CI / Clippy (push) Successful in 2m17s
CI / CUDA type-check (push) Successful in 2m39s
CI / CUDA type-check (pull_request) Successful in 2m30s
CI / Test (push) Successful in 4m51s
CI / Clippy (pull_request) Successful in 2m12s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 4m49s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
# Conflicts:
#	.gitea/workflows/ci.yml
2026-06-12 14:05:55 +03:00
2dadea5d8d ci: enable sccache on the build jobs (conditional on the CUDA image)
Some checks failed
build-prerelease / Build neuron-blackwell (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 34s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m57s
build-prerelease / Test (push) Has been cancelled
build-prerelease / Build cortex binary (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
The 3 CUDA flavour builds (10-14 min each, the critical path of every
full run) and build-cortex compiled entirely uncached. With the
gongfoo-side sccache hardening in place, wire them up:

- build-cortex: full sccache env (rust image ships it) + the standard
  escalation loop (retry -> server restart -> uncached final attempt).
- build-neuron: probe for sccache before enabling the wrapper — the
  CUDA image may not ship it, and a missing binary must degrade to an
  uncached build, not fail cargo at `sccache rustc -vV` (the original
  reason the wrapper was cleared here). rustc compilations are shared
  across all three flavours; candle-kernels' nvcc output stays
  uncached (build-script artifact).
- ci.yml cuda-check: same probe pattern replaces the blanket env
  clear; also pins CUDA_COMPUTE_CAP=86 since the image no longer
  ships nvidia-smi for candle-kernels' fallback detection (mirrors
  9bb9678 on the #20 branch).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 14:05:26 +03:00
9bb9678f93 fix(ci): pin CUDA_COMPUTE_CAP in cuda-check — builder image has no nvidia-smi
All checks were successful
CI / Format (push) Successful in 37s
CI / Format (pull_request) Successful in 38s
CI / CUDA type-check (push) Successful in 1m45s
CI / Clippy (push) Successful in 2m24s
CI / Clippy (pull_request) Successful in 2m19s
CI / Test (push) Successful in 4m40s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 4m35s
CI / CUDA type-check (pull_request) Successful in 1m50s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
candle-kernels' build script shells out to nvidia-smi for compute-cap
detection when CUDA_COMPUTE_CAP is unset; the current GPU-less builder
image doesn't ship it, so the type-check died in the build script
before borrow-checking anything. Pin an arbitrary valid cap — the
check is feature-gate compilation only; real caps live in
build-prerelease.yml's flavour matrix.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 13:55:23 +03:00
df9c490614 feat(neuron+gateway): keep auto-recovering models visible as recovering (#20)
Some checks failed
CI / Format (push) Successful in 37s
CI / CUDA type-check (pull_request) Failing after 28s
CI / Format (pull_request) Successful in 37s
CI / Clippy (push) Successful in 2m54s
CI / Clippy (pull_request) Successful in 3m36s
CI / Test (push) Successful in 4m37s
CI / Test (pull_request) Successful in 5m20s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
CI / CUDA type-check (push) Failing after 31s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
During the #17 auto-recovery window (unload → reload, minutes for a
large TP model) the model's registry slot is absent, so it vanished
from neuron's /models — and cortex, routing by /models presence,
answered "model not found on any node" while a direct request to
neuron would have correctly said "recovering, retry shortly".

neuron: the recovery set becomes a map carrying a devices/capabilities
snapshot taken at trigger time (while the registry slot still exists).
list_models reports `recovering` for models in the set — both while
the poisoned slot is still present and during the reload gap, where
the snapshot keeps the model listed.

gateway: ModelStatus grows a Recovering variant (parsed from the
wire); the router holds the route — new RouteError::ModelRecovering
mapped to 503 instead of 404 — and deliberately does not fall through
to the catalogue cold-load, which would race a second placement
against the in-flight recovery. The evictor already ignores
non-Loaded entries.

Tests: neuron unit test (recovering model stays listed with snapshot),
gateway integration tests (poller parses `recovering`; request gets
503 retry-shortly and the model stays on /v1/models).

Closes #20

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 13:42:03 +03:00
f5fa840dfb ci: escalate sccache retries — restart server, then fall back uncached
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m6s
build-prerelease / Test (push) Successful in 4m50s
build-prerelease / Build cortex binary (push) Successful in 3m45s
build-prerelease / Build neuron-blackwell (push) Successful in 9m59s
build-prerelease / Build neuron-ada (push) Successful in 14m11s
build-prerelease / Build neuron-ampere (push) Successful in 14m13s
build-prerelease / Package cortex RPM (push) Successful in 1m30s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m28s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m50s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m54s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
Run 361's Test job failed all 3 attempts with the sccache
dead-server signature (sccache fatal error, ENOENT on its own tmp
files under target/debug/deps). Retrying the same invocation only
helps for transient races; against a wedged server every same-VM
retry fails identically — and under the new pipeline that blocks
publish and the deploy behind it.

Escalate instead: attempt 1 plain, attempt 2 after an sccache server
restart, attempt 3 with RUSTC_WRAPPER unset (uncached). A sick cache
now costs build minutes, never the deploy. Applied to the lint/test
jobs in build-prerelease.yml and ci.yml alike.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 13:24:02 +03:00
7557c5e877 ci: cut iteration latency — change-aware builds, gated deploys, dev fast path
Some checks failed
build-prerelease / Build neuron-blackwell (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 28s
build-prerelease / Test (push) Failing after 1m16s
build-prerelease / Lint (fmt + clippy) (push) Successful in 3m7s
build-prerelease / Build cortex binary (push) Successful in 3m57s
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Push-to-testable was ~20.5 min for every commit (measured on the
2026-06-08 green chain) plus a ~5 min 27B cold-load, regardless of
what changed. Three structural fixes:

- build-prerelease: a change-detection step in `prepare` diffs HEAD
  against the git sha embedded in the last *published* unstable RPM
  (per package, from packages.json) and skips builds whose inputs
  didn't change. Docs-only commits build nothing; gateway-only
  commits skip the 3 CUDA flavour builds. Detection failures fall
  open to a full build.
- ci.yml no longer runs on pushes to main; fmt/clippy/test live in
  build-prerelease as parallel jobs gating publish. The two workflows
  previously queued against each other on the same runner labels,
  delaying the cortex build ~12 min. Branches, PRs, and tags keep the
  full ci.yml gate.
- deploy: each host self-gates with `dnf check-update` and leaves the
  service untouched when the installed package is already current —
  no more neuron restarts (and 27B cold-loads) for commits that
  didn't change neuron.
- deploy-dev (new): manual single-host fast path — build one CUDA
  flavour, scp the binary, restart the service. Skips packaging,
  signing, publish, and dnf entirely. Backed by a new exact-form
  sudoers rule in asset/sudoers.d/neuron-host.conf (already applied
  to all three hosts).

Expected loop times when runners behave: docs ≈ 1 min (nothing
deploys), gateway-only ≈ 6-8 min, single-neuron dev ≈ 8-10 min,
full fleet ≈ 13-15 min.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 13:17:22 +03:00
91e95ca979 docs: rewrite README around project positioning
Some checks failed
CI / CUDA type-check (push) Failing after 46s
CI / Format (push) Successful in 47s
CI / Clippy (push) Successful in 2m53s
CI / Test (push) Successful in 4m31s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps (push) Successful in 39s
build-prerelease / Build cortex binary (push) Successful in 3m52s
build-prerelease / Package cortex RPM (push) Successful in 1m18s
build-prerelease / Build neuron-blackwell (push) Successful in 11m34s
build-prerelease / Build neuron-ampere (push) Successful in 15m31s
build-prerelease / Build neuron-ada (push) Successful in 15m37s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Lead with what helexa is for — near-frontier open-weight models on
consumer hardware you own — instead of a feature list. Adds the scope
section (intentional divergence from vLLM/SGLang; CUDA-only today as a
test-coverage constraint, not a principle), an engine section covering
the per-device worker threads and consumer-GPU tensor parallelism, the
previously-missing helexa-acp crate, and a status section pointing at
git.lair.cafe as the source of truth with GitHub as read-only mirror.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 11:37:00 +03:00
1a74cb0c56 chore: rename repo cortex -> helexa
Some checks failed
CI / CUDA type-check (push) Failing after 30s
build-prerelease / Resolve version stamps (push) Successful in 45s
CI / Format (push) Successful in 32s
build-prerelease / Build neuron-blackwell (push) Failing after 31s
build-prerelease / Build neuron-ada (push) Failing after 34s
build-prerelease / Build neuron-ampere (push) Failing after 38s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
CI / Clippy (push) Failing after 1m11s
build-prerelease / Build cortex binary (push) Successful in 3m47s
CI / Test (push) Successful in 5m32s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m22s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
helexa is the project; cortex (per-operator control plane / LLM proxy)
and neuron (per-host LLM harness) are its components. The Gitea repo
is now helexa/helexa. Update repository URLs in Cargo metadata, RPM
specs, and docs; make the CI changelog push URL rename-proof via the
github.repository context; reframe README.md and CLAUDE.md around the
project name. Binary, package, service, and config-path names are
unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 10:54:01 +03:00
60f5598542 build(neuron): bump cudarc fork to 63327a2 (idempotent abort + Comm Send+Sync)
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 29s
CI / CUDA type-check (push) Successful in 31s
CI / Format (push) Successful in 35s
CI / Test (push) Failing after 1m9s
CI / Clippy (push) Successful in 2m36s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 6m10s
build-prerelease / Build neuron-ampere (push) Successful in 7m35s
build-prerelease / Build neuron-ada (push) Successful in 5m7s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m53s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m14s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s
build-prerelease / Build cortex binary (push) Successful in 4m33s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
The fork's new commit makes `Comm: Send + Sync` (asserting NCCL's
thread-safety invariant upstream) and makes `Comm::abort` idempotent via
an `aborted` flag (so abort-then-Drop can't double-free) — strictly
better than the previous Drop-no-panic workaround, and the `abort()`
signature is unchanged so the watchdog call site is unaffected.

Because `Comm` is now `Send + Sync`, `Arc<Comm>` and the `SendComm` /
`NcclState` wrappers auto-derive `Send`/`Sync`, which conflicts (E0119)
with neuron's manual `unsafe impl`s. Remove the four now-redundant impls
— the safety assertion lives upstream in cudarc where it belongs. The
conflict is in cuda-gated code, so only the CUDA type-check catches it
(non-cuda build + clippy + tests stay green).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 16:33:14 +03:00
7945240646 chore: re-trigger deploy (#17 Stage 2, attempt 3)
All checks were successful
CI / CUDA type-check (push) Successful in 31s
build-prerelease / Resolve version stamps (push) Successful in 31s
CI / Format (push) Successful in 33s
CI / Clippy (push) Successful in 2m41s
build-prerelease / Build cortex binary (push) Successful in 4m45s
build-prerelease / Build neuron-blackwell (push) Successful in 5m50s
CI / Test (push) Successful in 6m44s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ampere (push) Successful in 8m38s
build-prerelease / Build neuron-ada (push) Successful in 5m36s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m55s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m59s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s
No code change. Each deploy run, the degraded CI runner kills a different
single arch build (blackwell, then ada) ~fast, and the all-arch-gated
packaging skips → no publish. Every arch HAS built green across runs
(blackwell  in 342, ampere , ada  in 339) and the gate + CUDA
type-check pass. Re-running to catch all three green in one run so the
Stage-2 RPMs publish. Runner FS/cache health is the real fix (separate
infra work).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 15:06:04 +03:00
0c74d89d15 chore: re-trigger deploy (#17 Stage 2)
Some checks failed
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 29s
CI / Format (push) Successful in 30s
build-prerelease / Build neuron-ada (push) Failing after 51s
CI / Clippy (push) Successful in 2m41s
build-prerelease / Build cortex binary (push) Successful in 4m28s
build-prerelease / Build neuron-blackwell (push) Successful in 6m32s
build-prerelease / Build neuron-ampere (push) Successful in 7m42s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
CI / Test (push) Successful in 6m6s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
No code change. The c94a2ae deploy's neuron-blackwell build died ~12min
into the Blackwell kernel compile on the degraded runner, while
neuron-ampere + neuron-ada built the identical Rust + patched cudarc
cleanly and the CUDA type-check passed. Transient infra; re-running to
get a healthy blackwell build so the RPMs publish and beast (Blackwell)
picks it up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 14:45:16 +03:00
c94a2ae755 fix(neuron): correct nccl_state path on WorkerPool.leader_comm (#17 S2)
Some checks failed
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 35s
CI / Format (push) Successful in 44s
build-prerelease / Build cortex binary (push) Successful in 4m57s
build-prerelease / Package cortex RPM (push) Successful in 1m36s
CI / Test (push) Successful in 7m10s
CI / Clippy (push) Failing after 1m21s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 8m40s
build-prerelease / Build neuron-ada (push) Successful in 9m5s
build-prerelease / Build neuron-blackwell (push) Failing after 12m2s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
`super::nccl_state` from tp/mod.rs resolves to `crate::harness::nccl_state`
(nonexistent); the module is the child `nccl_state` (cf. the existing
`nccl_state::generate_comm_id_hex` call). The field is cuda-gated so the
non-cuda build couldn't catch it; the branch CUDA type-check flaked on the
runner before compiling. Self-audited fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 14:21:43 +03:00
99920dd322 feat(neuron): TP step watchdog aborts wedged collectives (#17 Stage 2)
Some checks failed
CI / CUDA type-check (push) Failing after 47s
CI / Format (push) Successful in 31s
CI / Test (push) Failing after 1m3s
CI / Clippy (push) Successful in 2m44s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Make a hung NCCL collective recoverable instead of a permanent brick.
Today a wedged collective hangs the in-process leader thread forever, and
even Stage 1's recovery can't help — its unload's DropTp queues behind the
stuck thread and hangs too.

- Cache the leader's NCCL Comm handle async-side at init (new cuda-gated
  Job::GetLeaderComm → DeviceWorkerHandle::get_leader_comm → stored on
  WorkerPool.leader_comm). Fetched while the thread is responsive — a
  wedged thread can't service the fetch, which is why it's cached up front.
- Wrap the leader forward in both generate_step and
  generate_step_with_images in tokio::time::timeout (default 120s,
  NEURON_TP_STEP_TIMEOUT_S). On expiry the watchdog calls
  Comm::abort() (ncclCommAbort) on the cached handle from the async
  thread — the one NCCL op sanctioned concurrently with an in-flight
  collective — which unblocks the leader thread, then fails the step
  WITHOUT draining (workers are wedged too; recovery's unload kills them).
  The error is a device fault → poison → Stage 1 auto-recovery, which now
  completes because the leader thread is responsive again.
- Bumps the cudarc patch to dbc425a (adds the Drop-must-not-panic fix so
  the post-abort comm teardown during recovery doesn't double-abort-panic).

Logs the whole sequence at ERROR with greppable `tp watchdog:` /
`ncclCommAbort` markers so a real-world hang leaves a forensic trail —
verification is by inspecting journals after real hangs, not a synthetic
harness. cuda-gated → validated by the blackwell build.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 14:15:29 +03:00
c4f239ceb9 build(neuron): patch cudarc to expose Comm::abort/get_async_error (#17 Stage 2)
All checks were successful
CI / CUDA type-check (push) Successful in 33s
CI / Format (push) Successful in 35s
CI / Clippy (push) Successful in 2m34s
CI / Test (push) Successful in 6m1s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
#17 Stage 2 (TP hang-recovery) needs to call ncclCommAbort on a LIVE
communicator from another thread — to unblock a collective wedged on a
dead/hung peer so the ranks can resync. No cudarc release (incl. main)
exposes this: the safe Comm only aborts in Drop, which can't fire while a
stuck thread holds an Arc<Comm> clone.

Pin neuron's cudarc 0.19.7 to a fork (grenade/cudarc @ nccl-comm-abort,
rev 4dff0be) adding three thin methods — Comm::abort, get_async_error,
and a raw comm() accessor — to be submitted upstream. The patch targets
0.19.x only; candle's transitive cudarc 0.17.8 stays on crates.io.

Foundation only; the watchdog + abort + comm-rebuild that consume these
land in follow-up commits (cuda-gated → validated by the blackwell build).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 13:49:59 +03:00
ac445c1569 chore: re-trigger deploy (#17 Stage 1)
Some checks failed
CI / CUDA type-check (push) Failing after 19s
CI / Format (push) Successful in 37s
build-prerelease / Resolve version stamps (push) Successful in 42s
CI / Clippy (push) Successful in 3m54s
build-prerelease / Build cortex binary (push) Successful in 4m43s
CI / Test (push) Successful in 6m35s
build-prerelease / Build neuron-blackwell (push) Successful in 5m58s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Successful in 8m10s
build-prerelease / Build neuron-ada (push) Successful in 5m21s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m1s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m46s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m4s
No code change. The abc6e60 deploy's neuron-ada build died on the
degraded CI runner (container dropped mid-checkout), skipping the
gated publish — even though neuron-blackwell + neuron-ampere compiled
the Stage-1 fault-recovery code cleanly. Re-running to get a healthy
ada build so the RPMs publish and beast picks up the build.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 09:34:20 +03:00
abc6e605b8 test(neuron): NEURON_DEBUG_POISON hook to verify auto-recovery (#17)
Some checks failed
CI / CUDA type-check (push) Failing after 19s
build-prerelease / Resolve version stamps (push) Successful in 43s
CI / Format (push) Successful in 50s
CI / Clippy (push) Failing after 57s
build-prerelease / Build neuron-ada (push) Failing after 48s
build-prerelease / Build cortex binary (push) Successful in 5m5s
build-prerelease / Build neuron-blackwell (push) Successful in 6m38s
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Build neuron-ampere (push) Successful in 7m27s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
CI / Test (push) Successful in 10m27s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
One-shot, env-gated fault injector for beast verification: when
NEURON_DEBUG_POISON names a model, the first request for it triggers the
auto-recovery path as if a device fault had occurred — exercising
unload→reload→healthy without corrupting the GPU. Latched so it fires
exactly once (no recovery loop). No-op unless the env var is set; wired
into both the single-GPU and TP chat poison gates.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 09:08:40 +03:00
4f2957af9e feat(neuron): auto-recover poisoned models (#17 Stage 1c)
When an inference hit a device fault, the model was flagged poisoned and
every subsequent request rejected with "unload and reload the model to
recover" — until a *human* did exactly that. Now the harness rebuilds the
context automatically.

- Retain the loading `ModelSpec` on `LoadedModel`/`TpLoadedModel` (+
  `LoadedHandle::spec()`) so a poisoned model can be reloaded without an
  operator reconstructing the spec.
- A background recovery task (held via `Weak<CandleHarness>`, spawned in
  `new()` when a runtime is present) drains poisoned model ids and runs
  `unload_model` → `load_model(spec)`. Unload drops the model → cudarc
  `Comm::drop` aborts NCCL + releases the context; reload re-runs NCCL
  init + sanity inside the load path, so a successful reload yields a
  fresh, healthy model. A failed reload leaves it unloaded (next load
  retries) — never poisoned forever.
- The request-entry poison gates now `trigger_recovery` (single-flight
  per model via a `recovering` set) and return a transient "recovering,
  retry shortly" error instead of the manual-reload message. Requests
  that arrive during the brief reload gap (model absent from the registry)
  also get "recovering" rather than a misleading "not loaded".

`new()` now returns `Arc<Self>`. Recovery runs only on the background
task — never inline on the request path, which holds `inference_lock`
and would deadlock on the `models` write lock.

Stage 1c of the #17 plan (verified-healthy auto-recovery). Watchdog
(1b) + a fault-injection hook for beast verification follow. The
in-process rank-0 leader's own context fault still needs a reload that
can't rebind it (Stage 3); comm-desync + worker faults recover here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 09:05:02 +03:00
75cd088b61 fix(neuron): cap vision max_pixels to the pos_embed patch budget (#14)
All checks were successful
CI / CUDA type-check (push) Successful in 31s
build-prerelease / Resolve version stamps (push) Successful in 29s
CI / Format (push) Successful in 30s
CI / Clippy (push) Successful in 2m32s
build-prerelease / Build neuron-blackwell (push) Successful in 6m5s
CI / Test (push) Successful in 5m49s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 8m11s
build-prerelease / Build neuron-ada (push) Successful in 5m40s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m4s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m2s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m57s
build-prerelease / Build cortex binary (push) Successful in 4m21s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m16s
Beast testing surfaced a real regression in the dynamic-resolution
default: a tall 808×1600 image resized (within the 1024² max_pixels) to a
90×44 patch grid = 3960 patches, exceeding the vision tower's hard
`num_position_embeddings = 2304` pos-embed budget. The per-rank
`patch count 3960 exceeds pos_embed budget 2304` error fired mid-TP-
forward and poisoned the device context, bricking the model until reload.

Hard-cap `max_pixels` to `2304 × 16² = 589_824` px (≤ 2304 patches →
≤ 576 LM tokens), clamping even the operator env override. `smart_resize`
floors the pixel count under the cap, so no resized image can ever exceed
the budget — the tower check never fires, no poison. The pos-embed grid
(48×48) is the resolution Qwen3.6 was trained at, so the cap is
principled, not just defensive. Still ~3× the old fixed 196 tokens, and
the book-cover OCR test (1176 patches) already reads full title+subtitle.

Test: a huge/tall/wide/extreme image battery stays within the 2304 patch
budget. (Per-rank-error poison robustness itself remains issue #17.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 23:30:47 +03:00
d311c8ca7a feat(neuron): operator pixel-budget env override + doc cleanup (#14 C5)
Some checks failed
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 38s
CI / Format (push) Successful in 45s
CI / Test (push) Failing after 58s
CI / Clippy (push) Successful in 2m41s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m14s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-blackwell (push) Successful in 6m20s
build-prerelease / Build neuron-ampere (push) Successful in 7m18s
build-prerelease / Build neuron-ada (push) Successful in 5m10s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m7s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s
- PreprocessProfile::qwen3_6() reads NEURON_VISION_MIN_PIXELS /
  NEURON_VISION_MAX_PIXELS (clamped to factor² ≤ min ≤ max), matching the
  NEURON_VISION_LEGACY_* / NEURON_MROPE knob convention. Defaults remain
  256²…1024² (64…1024 LM tokens/image).
- Test: a max-resolution source caps within the token budget (can't blow
  NEURON_MAX_PROMPT_TOKENS).
- Strip stale fixed-resolution / "MRoPE gap (#15)" / 14×14 language from
  the preprocess, mod, and rope doc-comments now that resolution is
  dynamic and M-RoPE is implemented.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 22:50:03 +03:00
c97a8654f5 feat(neuron): dynamic-resolution images via Qwen smart_resize (#14)
Some checks failed
CI / Clippy (push) Waiting to run
CI / Test (push) Waiting to run
CI / CUDA type-check (push) Successful in 32s
CI / Format (push) Successful in 34s
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
Replace the fixed 448×448-square preprocess with native-aspect
`smart_resize`, and thread the resulting per-image grid through the LM
so spatial structure survives non-square images (documents, screenshots,
charts, panoramas, OCR) instead of being squished into a square.

- preprocess.rs: port Qwen `smart_resize` (factor = patch×merge = 32;
  pixel budget [min,max], default 256²–1024² → 64–1024 LM tokens).
  `PreprocessProfile` drops the fixed target dims for `factor`/`min_pixels`/
  `max_pixels`; `preprocess`/`preprocess_data_uri` now return the resized
  `(h, w)`; add `resized_dims_for_uri` (decode + resize, no normalize) for
  the TP leader's token count.
- rope.rs: `compute_mrope_index`/`get_rope_index` take per-image
  `grids: &[(lm_gh, lm_gw)]` instead of assuming a square `isqrt(run)`.
  Walk image runs in order, validate `run == gh*gw`, emit row-major
  positions, resume the shared counter at `base + max(gh,gw)`. Correct
  for multiple images of differing grids interleaved with text.
- candle.rs: `VisionMeta`/`LoadedModel`/`TpLoadedModel` carry the
  `image_grid_factor` (patch×merge) instead of the constant 196; all four
  prompt-build sites compute per-image counts from each image's resized
  grid (single-GPU from the extracted `ImageInput.h/w`, TP from
  `resized_dims_for_uri`). `ModelArch` gains `vision_grid_factor`.
- single-GPU (`mod.rs`, `dispatch.rs`) and TP
  (`tp_qwen3_5.rs::prefill_with_images_chunked`, `dispatch.rs`,
  `tp/worker.rs`) thread the grids into `get_rope_index`. Each TP rank
  recomputes grids from its own deterministic preprocess — no rpc.rs
  change, single source of truth.

The vision tower itself was already grid-general (recent pos-embed
interpolation + 2D rotary fix). No patch-count cap: pos-embed is
interpolated to any grid; `max_pixels` bounds cost (O(patches²) ViT
attention + prefill) instead.

Tests: smart_resize (aspect/cap/floor/reject), `compute_mrope_index`
non-square + two-image + mismatch cases, square-grid regression guard.
Non-cuda build + clippy + full workspace tests green; TP load/dispatch
paths are cuda-gated → Gitea CUDA type-check. Operator pixel-budget
config + remaining doc cleanup follow in C5.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 22:47:27 +03:00
dc048ffcc9 fix(neuron): vision-tower 2D positions + M-RoPE default on
All checks were successful
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 32s
CI / Format (push) Successful in 33s
CI / Clippy (push) Successful in 2m36s
build-prerelease / Build cortex binary (push) Successful in 4m48s
build-prerelease / Build neuron-blackwell (push) Successful in 5m59s
CI / Test (push) Successful in 6m35s
build-prerelease / Build neuron-ampere (push) Successful in 7m51s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ada (push) Successful in 5m13s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m49s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m6s
Two fixes to the spatial handling of images, validated against the HF
transformers 4.57.1 qwen3_vl reference on beast.

**Vision tower (the real cause of poor spatial vision).** The Stage-A
tower encoded position two ways wrong, so the model saw image *content*
but not *layout* (a row of 5 people read as "a line of 23", sky
inverted), regardless of the LM-side rope:

- Learned pos-embed was a naive sequential lookup of the first
  `n_patches` rows of the 48×48 (`num_position_embeddings=2304`) grid —
  wrong stride for a 28×28 patch grid. Now bilinearly interpolates the
  grid to `gh×gw` (port of HF `fast_pos_embed_interpolate`), row-major.
- The 2D vision rotary was absent entirely. Added
  `VisionRotaryEmbedding` (θ=10000, dim=head_dim/2) applying per-patch
  `(row, col)` rotary to q/k in every ViT block via rope_slow, matching
  HF `apply_rotary_pos_emb_vision`.

Both default on; `NEURON_VISION_LEGACY_POS=1` / `NEURON_VISION_LEGACY_ROPE=1`
revert each for A/B (no rebuild). New unit tests: interpolation reduces
to the sequential lookup at the native grid; rotary row/col structure.

**M-RoPE default on.** The interleaved M-RoPE matches HF
apply_interleaved_mrope / get_rope_index exactly and A/B'd strictly ≥
plain. `NEURON_MROPE` is now a kill switch (`=0` for plain), not opt-in
— defaults should encode the model's trained behaviour, not freeze the
broken state.

Vision tower is plain candle (CPU-testable): built, clippy-clean, full
workspace tests green locally.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 20:53:07 +03:00
7ebcfba5ca fix(neuron): gate M-RoPE behind NEURON_MROPE (default off)
All checks were successful
CI / CUDA type-check (push) Successful in 33s
build-prerelease / Resolve version stamps (push) Successful in 32s
CI / Format (push) Successful in 33s
CI / Clippy (push) Successful in 2m34s
build-prerelease / Build cortex binary (push) Successful in 4m33s
build-prerelease / Build neuron-blackwell (push) Successful in 6m14s
CI / Test (push) Successful in 6m50s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 8m12s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ada (push) Successful in 5m9s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m59s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m3s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m52s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s
On beast the interleaved M-RoPE degraded image understanding rather than
fixing it: the model misread spatial layout (a horizontal row of people
described as a "diagonal receding line"), got attributes wrong, and
rambled — a "how many people" follow-up generated 4459 tokens over 3.5
minutes, past agent-0's HTTP timeout (the "fails to respond without an
error"). The interleave is evidently not numerically correct, and it
can't be validated remotely without a transformers reference.

Gate it: `get_rope_index` now returns plain sequential identity
positions unless NEURON_MROPE is truthy, so mrope_cos_sin reduces to
plain RoPE and image tokens behave exactly as pre-M-RoPE (content
recognition works; spatial layout approximate; no rambling). The real
computation moves to `compute_mrope_index` (still unit-tested). Default
off restores the working vision and unblocks agent-0; the M-RoPE code
stays in place to debug + validate before flipping the default on.

Pure non-cuda change (rope.rs); both single-GPU and TP forwards call
the gated get_rope_index unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 19:32:59 +03:00
825bf4e905 feat(neuron): M-RoPE Stage 4 — wire interleaved M-RoPE into the TP path
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / CUDA type-check (push) Successful in 31s
CI / Format (push) Successful in 42s
build-prerelease / Build cortex binary (push) Successful in 5m9s
build-prerelease / Build neuron-blackwell (push) Successful in 6m4s
build-prerelease / Package cortex RPM (push) Successful in 1m32s
CI / Test (push) Successful in 7m19s
build-prerelease / Build neuron-ampere (push) Successful in 8m40s
build-prerelease / Build neuron-ada (push) Successful in 5m17s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m1s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m53s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m14s
CI / Clippy (push) Successful in 2m29s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Mirror Stage 3 into the tensor-parallel Qwen3.6 model:

- TpQwen3_5Attention / DecoderLayer take (cos, sin) instead of a scalar
  offset and apply via apply_cos_sin.
- TpQwen3_5Model gains the replicated rotary + rope_delta (reset in
  clear_kv_cache, settable). forward_inner builds the cos/sin once —
  interleaved M-RoPE from explicit position_ids (vision) or plain at
  offset+rope_delta (text/decode). forward() and forward_with_positions()
  delegate; the old single-shot forward_with_vision is gone.
- prefill_with_images_chunked now computes get_rope_index over the whole
  prompt once, stores rope_delta on the base model, and slices the
  (3, prompt_len) position tensor per chunk — so every rank assigns image
  tokens their 14×14 grid coordinates and steps in lockstep (every chunk,
  text or image, carries the M-RoPE slice because the image shifts the
  surrounding text positions).

Also build the position-id tensor as f32 directly (positions are small
integers, exact in f32) to avoid an i64→f32 cast on the GPU.

The TP forward is cuda-gated — CI CUDA type-check is the compile gate.
Non-cuda build + clippy + full workspace tests green; rope math + the
plain-RoPE-reduction invariant covered by unit tests.

Completes the interleaved-M-RoPE work for the vision spatial misread.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 18:46:27 +03:00
4c12c7e2f0 feat(neuron): M-RoPE Stage 3 — wire interleaved M-RoPE into single-GPU
Qwen3_5Model now builds the rotary cos/sin once per forward and threads
(cos, sin) through the decoder → full-attention → rope, replacing the
scalar offset that reached RotaryEmbedding:

- vision forward computes get_rope_index over the (single-shot) prompt,
  sets rope_delta, and builds interleaved-M-RoPE cos/sin so image tokens
  carry their 14×14 grid (height/width) positions;
- text / decode take plain_cos_sin at offset + rope_delta — with
  rope_delta == 0 (no image) this is bit-for-bit the old plain RoPE, and
  the device→host id copy is skipped on the text decode hot path.

rope_delta is stored on the model and reset in clear_kv_cache, so decode
after a vision prefill resumes text positions from the image-compressed
counter. decoder.rs / full_attn.rs take (cos, sin) instead of offset;
linear-attention layers are unchanged (no RoPE). The TP path still uses
the retained apply(offset) — wired in Stage 4.

Full workspace tests green; the load-bearing invariant (M-RoPE == plain
for equal axes) keeps text unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 18:39:52 +03:00
ba1b5ba408 feat(neuron): M-RoPE Stage 2 — get_rope_index position-id helper
Pure function computing the interleaved-M-RoPE 3D position ids for a
prompt with image-placeholder runs, plus the decode rope_delta:
text tokens advance a single counter (all axes equal); each image run
gets [base+t, base+h, base+w] row-major over a square grid_t=1,
grid_h=grid_w=isqrt(run) (196 → 14×14); the counter resumes from
base + max(grid). rope_delta = final_counter - seq_len lets decode
resume text positions after the position-compressed image blocks.
Plus mrope_position_tensor to build the (3, seq) tensor.

Unit tests: text-only is sequential (delta 0); text+image+text matches
hand-computed grid ids + resume + delta; 196 → 14×14; non-square run
rejected; end-to-end through mrope_cos_sin tracks the height axis.

#[allow(dead_code)] until Stage 3/4 wire it into the forward.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 18:34:28 +03:00
5731f4c318 feat(neuron): M-RoPE Stage 1 — interleaved rope machinery + config
Parse + store mrope_section / mrope_interleaved in RopeParameters
(previously accepted-but-ignored). RotaryEmbedding gains:
- inv_freq + per-axis column masks (mask_t/h/w) built from mrope_section;
- plain_cos_sin(pos, seq_len): narrow the precomputed tables (text/decode);
- mrope_cos_sin(position_ids (3,seq)): per-axis freqs blended at the
  interleave columns (vision);
- apply_cos_sin(q,k,cos,sin): the rope_slow application, factored out.

The existing apply(q,k,offset) is retained (delegates to
plain_cos_sin + apply_cos_sin) so current callers are unchanged; Stages
3–4 move cos/sin construction into the model forward and thread the 3D
position ids for image tokens.

Tests: masks partition the half-dim; interleave drives the right axis
per column; and the load-bearing invariant — mrope_cos_sin reduces
bit-for-bit to plain_cos_sin when the three axes are equal (so text
inference is unchanged).

Refs the MRoPE-gap diagnosis (vision spatial misread). Pure non-cuda;
no behaviour change until wired.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 18:31:15 +03:00
fa013505d1 fix(neuron): chunked TP-vision prefill + pre-flight VRAM guard
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 29s
build-prerelease / Build cortex binary (push) Successful in 4m26s
build-prerelease / Package cortex RPM (push) Successful in 1m18s
build-prerelease / Build neuron-blackwell (push) Successful in 6m6s
build-prerelease / Build neuron-ampere (push) Successful in 8m30s
CI / Format (push) Successful in 38s
CI / CUDA type-check (push) Successful in 47s
CI / Clippy (push) Successful in 2m36s
build-prerelease / Build neuron-ada (push) Successful in 5m19s
CI / Test (push) Successful in 6m3s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m1s
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m32s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m47s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s
agent-0 sent a ~13k-token prompt + image; the TP vision prefill was
single-shot, so it tried to materialise activations for all 12,960
positions at once and OOM'd rank 1 mid-forward. Rank 1 died before
issuing its row-parallel AllReduce, stranding rank 0 on the collective
(it hung holding the pool lock). The text path survives the same size
because it chunks the prefill.

Chunk the vision prefill the same way:

- TpQwen3_5ForCausalLM::prefill_with_images_chunked encodes the image(s)
  once, then walks the pre-expanded prompt in prefill_chunk_tokens()
  windows, splicing the patch-embedding rows into whichever chunk(s)
  carry <|image_pad|> positions (pure-text chunks take the plain
  forward). Activation is bounded by the chunk, not the prompt.
- Every rank runs the identical chunk sequence (chunk_size threaded
  through GenerateStepWithImages / TpForwardLogitsWithImages /
  generate_step_with_images), so the per-chunk AllReduces stay paired
  across ranks with no extra sync — the KV cache accumulates via the
  growing offset, only the last chunk's logits are kept.

Pre-flight guard (validate_vision_prefill): even chunked, a long
prompt's KV cache can exhaust VRAM mid-forward, and on TP that hangs
the collective. Reject up front with a clean InsufficientVram when the
estimated footprint exceeds free VRAM, so a doomed request fails fast
instead of hanging the daemon. Heuristic + tunable
(NEURON_VISION_PREFILL_MB_PER_1K_TOKENS / _BASE_MB); default permissive
so the now-working 12,960-token case still passes. Applied to every
vision path (single-GPU + TP); single-GPU vision stays single-shot for
now, so the guard is its protection until it's chunked too.

Tests: pre-flight guard behaviour; RPC round-trip carries chunk_size.
The chunked forward is cuda-gated — CI CUDA type-check validates it.

Refs #16 / TP-vision. Operational note: a TP rank OOM still hangs the
daemon (needs restart); making a worker failure abort the leader's
collective is separate, broader TP hardening.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 17:21:36 +03:00
c8bcaabc38 fix(neuron): render HF chat templates via minijinja pycompat
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 29s
CI / Format (push) Successful in 34s
CI / CUDA type-check (push) Successful in 39s
CI / Clippy (push) Successful in 2m35s
build-prerelease / Build cortex binary (push) Successful in 4m21s
build-prerelease / Build neuron-blackwell (push) Successful in 6m4s
CI / Test (push) Successful in 6m47s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 7m43s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ada (push) Successful in 5m41s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m5s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m52s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
The Qwen3.6 chat_template.jinja (now loaded after the precedence fix)
failed to render in minijinja: it uses Python str methods
(content.startswith/endswith/split/rstrip/lstrip) and the raise_exception
global that HF transformers patches into its Jinja env but minijinja
doesn't provide. The render error tripped the text-only fallback, so
image requests still produced zero <|image_pad|> tokens.

Wire the standard bridge into render_chat_template:
- minijinja-contrib `pycompat::unknown_method_callback` supplies the
  Python string/list/dict methods;
- a `raise_exception` global maps to a render error (so malformed inputs
  — e.g. an image in a system message — surface cleanly).

Add the real Qwen3.6-27B chat_template.jinja (verbatim from beast's HF
cache) as a test fixture and assert it renders one <|image_pad|> for a
text+image turn — the end-to-end check that would have caught this
before deploy.

Refs #16 / TP-vision.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 16:32:23 +03:00
7ad56c6a86 fix(neuron): load chat_template.jinja (transformers precedence)
The chat-template loader only read the `chat_template` field from
tokenizer_config.json. Qwen3.6-27B ships its vision-aware template
*only* in a standalone `chat_template.jinja` (and has no
tokenizer_config.json at all), so the loader returned None and image
requests fell back to the text-only format_qwen3_prompt — rendering
zero `<|image_pad|>` tokens and tripping
"expand_image_pad_tokens: prompt has 0 image_token_id occurrences".

load_chat_template_alongside now follows HF transformers precedence:
standalone chat_template.jinja → chat_template.json → the
chat_template field in tokenizer_config.json. Tests cover the
precedence, the text-only fallback, and that an OpenAI image_url
content part renders `<|image_pad|>` through the real template
condition (`'image_url' in item`).

Refs #16 / TP-vision.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 16:25:30 +03:00
1b0e36c119 fix(neuron): cover TpForwardLogitsWithImages in drain_poisoned match
All checks were successful
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 37s
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m41s
build-prerelease / Build cortex binary (push) Successful in 4m18s
build-prerelease / Build neuron-blackwell (push) Successful in 5m48s
build-prerelease / Package cortex RPM (push) Successful in 1m32s
CI / Test (push) Successful in 6m20s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 8m26s
build-prerelease / Build neuron-ada (push) Successful in 5m21s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s
The CUDA type-check caught a non-exhaustive match: drain_poisoned()
must reply an error to every Job variant's reply channel, including the
new cuda-gated TpForwardLogitsWithImages. The non-cuda build couldn't
see it — the variant is #[cfg(feature = "cuda")], so the match is
exhaustive without it on CPU.

Refs TP-vision plan Stage 2.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 15:26:46 +03:00
ed2d09864e feat(neuron): TP-vision Stage 3 — wire TP chat + stream vision prefill
Some checks failed
CI / Format (push) Successful in 30s
CI / Clippy (push) Successful in 2m51s
CI / Test (push) Successful in 5m52s
CI / CUDA type-check (push) Failing after 50s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
End-to-end TP-vision: an image request to a TP-loaded Qwen3.6-27B now
conditions on the image across both ranks.

- TpLoadedModel carries has_vision / image_token_id / lm_tokens_per_image,
  populated at load via the shared VisionMeta::from_config_path (same
  config.json the shards loaded from; Stage 1 materialises the replicated
  tower on every rank).
- LoadedHandle::capabilities() now advertises "vision" for TP loads with
  a tower (cortex-gateway already unions this into /v1/models via C3).
- The TP rejection guards (chat_completion_tp + inference_tp_stream) are
  now conditional on !has_vision — text-only TP models still 400 cleanly,
  vision-capable ones fall through.
- chat_completion_tp_inner and the streaming orchestration task detect
  images (request_has_images), expand <|image_pad|> to the per-image
  patch count, and run a single-shot generate_step_with_images prefill
  (every rank encodes + splices its replicated tower) before the
  unchanged decode loop. Text requests keep chunked_prefill_tp.
- extract_image_data_uris ships the source data URIs to every rank for
  identical per-rank preprocessing.

prompt_tokens now reflects the patch expansion, so usage accounting and
KV offsets match the single-GPU baseline.

TP entry points are cuda-gated (validated by CI's CUDA type-check);
capabilities() + extract_image_data_uris + VisionMeta reuse compile on
the non-cuda build. Full workspace test green.

Refs TP-vision plan Stage 3. Implements #12.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 15:14:44 +03:00
4994b94c84 feat(neuron): TP-vision Stage 2 — per-rank image RPC + worker plumbing
Carry image content through the TP forward path so every rank encodes
and splices locally (replicated tower, no embedding broadcast).

- rpc.rs: new WorkerRequest::GenerateStepWithImages carrying the source
  image data URIs + image_token_id for the single-shot vision prefill;
  worker still replies GenerateStepOk. Round-trip test added.
- tp_qwen3_5.rs: TpQwen3_5ForCausalLM::forward_with_images — encode each
  preprocessed image through the rank's replicated tower, cat, splice,
  forward. Shared by leader and worker so every rank runs identical work.
- tp/mod.rs: TpLeaderModel::forward_with_images and
  WorkerPool::generate_step_with_images (mirrors generate_step: fan out
  GenerateStepWithImages to subprocess ranks, run the leader's image
  forward on its device worker thread, drain, combine).
- worker.rs: WorkerModel::forward_with_images + handle_generate_step_with_images
  — each subprocess rank preprocesses the same data URIs via the shared
  deterministic preprocess_data_uri, encodes, splices, forwards.
- device_worker: Job::TpForwardLogitsWithImages + tp_forward_logits_with_images
  dispatch handler + DeviceWorkerHandle::tp_forward_logits_with_images.

Determinism: every rank runs the same preprocess on the same source
URIs through the same replicated tower, so the spliced hidden state
matches across ranks — preserving the replicated-hidden-state invariant
the row-parallel AllReduce relies on, with no NCCL broadcast.

No caller yet — Stage 3 wires the TP chat/stream entry points to invoke
generate_step_with_images for image prefill. cuda-gated plumbing covered
by CI's CUDA type-check; rpc/route/forward_with_images compile on the
non-cuda build.

Refs TP-vision plan Stage 2.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 15:08:08 +03:00
9a24b05866 feat(neuron): TP-vision Stage 1 — replicated vision tower on the TP model
Load the full, unsharded model.visual.* vision tower on every TP rank
(leader + each subprocess worker mmaps the same local safetensors) when
config.vision_config is present. VisionTower::load already takes a
ShardedVarBuilder whose plain .get() returns the full replicated tensor,
so the tower loads identically regardless of world_size — no sharding,
no NCCL broadcast.

- TpQwen3_5ForCausalLM gains vision: Option<VisionTower> + image_token_id,
  plus has_vision/image_token_id/encode_image/forward_with_vision,
  mirroring the single-GPU Qwen3_5ForCausalLM wrapper.
- TpQwen3_5Model::forward_with_vision mirrors the single-GPU
  forward_inner splice: embed locally, replace rows at image_token_id
  positions, run the sharded decoder stack. Because every rank encodes
  the same pixels through its replicated tower, the spliced input
  embeddings are identical across ranks — preserving the TP
  replicated-hidden-state invariant the row-parallel AllReduce relies on.
- splice_runs is now pub(crate) and shared with the TP model.

No caller yet — Stage 2 wires the RPC/worker path that invokes
encode_image + forward_with_vision per rank. Most of this compiles on
the non-cuda build (only the cuda load variant's tower line is gated);
CI's CUDA type-check covers the rest.

Refs TP-vision plan Stage 1.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 15:00:05 +03:00
7bb033b4ed chore: untrack stray .claude/scheduled_tasks.lock and gitignore .claude/
All checks were successful
CI / CUDA type-check (push) Successful in 32s
CI / Format (push) Successful in 30s
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Clippy (push) Successful in 2m45s
build-prerelease / Build cortex binary (push) Successful in 4m28s
CI / Test (push) Successful in 6m6s
build-prerelease / Build neuron-blackwell (push) Successful in 6m11s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m28s
build-prerelease / Build neuron-ampere (push) Successful in 8m1s
build-prerelease / Build neuron-ada (push) Successful in 8m9s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m52s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s
A runtime scheduler lock was accidentally swept into the previous
commit by `git add -A`. Remove it from tracking (file stays on disk)
and ignore the whole `.claude/` dir so local agent runtime state never
lands in the repo again.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 14:55:05 +03:00
f8c0da0ebf fix(neuron): TP-vision Stage 0 — reject image requests on the TP path
Some checks failed
build-prerelease / Resolve version stamps (push) Waiting to run
CI / Format (push) Waiting to run
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Build cortex binary (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
CI / Clippy (push) Has been cancelled
CI / Test (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
The TP inference path has no vision tower, and the TP dispatch in
chat_completion / inference_stream returns before the VisionUnsupported
guard runs — so an image request to a TP-loaded model (e.g. beast's
tp=2 Qwen3.6-27B) was silently dropped and answered from text alone,
the exact issue-#3 confident-hallucination pattern Stage C killed for
single-GPU.

Add the request_has_images → VisionUnsupported guard to both
chat_completion_tp and inference_tp_stream, before prefill / before the
SSE stream opens, so beast returns a clean 400 vision_unsupported. The
guard is unconditional for now (TP has no tower); Stage 3 makes it
conditional on the TP model's has_vision once real TP-vision lands.

Detection is covered by the existing request_has_images unit test; the
guard itself is cuda-gated (validated by CI's CUDA type-check).

Refs TP-vision plan Stage 0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 14:53:56 +03:00
dd592d918d test(neuron): C2 — guard Responses→chat image translation contract
All checks were successful
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Format (push) Successful in 44s
CI / Clippy (push) Successful in 2m51s
build-prerelease / Build cortex binary (push) Successful in 4m42s
build-prerelease / Build neuron-blackwell (push) Successful in 5m52s
CI / Test (push) Successful in 6m16s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 8m12s
build-prerelease / Package cortex RPM (push) Successful in 1m26s
build-prerelease / Build neuron-ada (push) Successful in 5m34s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m59s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m2s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
The Responses request translator already emits the chat `image_url`
Parts array Stage B5's vision path consumes, and the non-streaming
(`chat_completion`) and streaming (`responses_stream` → `inference_stream`,
Stage C1) Responses paths both route image content to the vision-aware
prefill — so vision works end-to-end through `/v1/responses` with no
translator change required.

Add a multi-image test asserting order preservation and that the
`detail` hint is tolerated (and dropped, since chat image_url has no
analogue), locking the translator's output to the exact
`image_url.url` shape `extract_images_from_request` walks.

Closes part of #16 (Stage C2).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 13:57:43 +03:00
766c20ba47 feat(neuron): C1 — streaming SSE chat completion with vision
The streaming worker path now splices image embeddings on prefill,
closing the silent text-only degrade for `stream=true` image requests.

`inference_stream` gains the same vision-routing block as the
non-streaming `chat_completion`: detect `image_url` content, reject it
against text-only models with `VisionUnsupported` (before any SSE frame
is sent), preprocess each image and expand its `<|image_pad|>` sentinel
to the per-image patch count, then carry the payload through dispatch.

Rather than duplicate the 75-line `route_token!` reasoning/tool-call
state machine into a sibling streamer, `stream_inference_via_worker`
takes an `Option<(Vec<ImageInput>, u32)>`: when `Some`, prefill is a
single-shot `forward_logits_with_images` splice; when `None`, the
original chunked text-only prefill. Image embeddings are prefill-only,
so every decode step stays on the plain `forward_logits` path and the
shared decode loop is untouched. This keeps exactly one copy of the
tool-call/reasoning logic to maintain.

The Responses API streaming path (`responses_stream`) inherits vision
for free since it drives the same `inference_stream`.

Unit test covers `request_has_images` (the shared routing gate); the
real-weights SSE smoke is the manual curl on beast (cuda-integration).

Closes part of #16 (Stage C1).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 13:57:02 +03:00
4972c7d1e7 feat(cortex-gateway): C3 — propagate vision capabilities through /v1/models
ModelEntry and CortexModelEntry gain a `capabilities: Vec<String>`
field (serde-default for back-compat). The poller copies it verbatim
from each neuron's ModelInfo.capabilities; list_models computes the
union across every node where a model is loaded so a checkpoint loaded
text-only on one neuron and text+vision on another reports both to the
fleet. Catalogue-only and mid-prewarm entries default to empty until
the catalogue gains a capabilities declaration.

Aliases inherit their target's capability union. New gateway test mocks
two nodes with differing capability arrays and asserts the unioned
/v1/models response.

Closes part of #16 (Stage C3).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 13:49:54 +03:00
a26bb9f04b feat(deploy): capture service startup journal after each restart
After both `Start cortex.service` and `Start neuron.service`, sleep 10s
and run `journalctl --unit <unit> -I --no-pager` to record the latest
invocation's log in the workflow output. Step is guarded by
`if: always()` so a failed start still leaves a usable trace.

infra-setup.sh now adds gitea_ci to the systemd-journal group during
user provisioning, so `journalctl` works without a sudoers entry.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 16:48:56 +03:00
ea1fdf8aa6 chore(deploy): drop deploy.sh and manifest.yml now that workflow runs
First end-to-end run of the deploy workflow succeeded (gitea run #289),
so the operator-run rolling-deploy script and its YAML manifest are no
longer the source of truth — fleet topology lives in
.gitea/workflows/deploy.yml and per-host config in script/infra-setup.sh.

Per-host neuron config comments updated to point at the new sync path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 16:41:04 +03:00
577781de8d fix(neuron): derive Clone on ImageInput for the CUDA vision dispatch
All checks were successful
CI / CUDA type-check (push) Successful in 32s
CI / Format (push) Successful in 34s
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Clippy (push) Successful in 2m47s
build-prerelease / Build cortex binary (push) Successful in 4m34s
CI / Test (push) Successful in 6m14s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 5m58s
build-prerelease / Package cortex RPM (push) Successful in 1m22s
build-prerelease / Build neuron-ampere (push) Successful in 8m5s
build-prerelease / Build neuron-ada (push) Successful in 8m9s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s
CUDA type-check in CI failed on commit 24968e9 with E0308:

  error[E0308]: mismatched types
      --> crates/neuron/src/harness/candle.rs:1707:33
   1707 |                                 images.clone(),
        |                                 ^^^^^^^^^^^^^^ expected `Vec<ImageInput>`,
                                                          found `&Vec<ImageInput>`

In Stage B5 the cuda branch of `chat_completion` matches
`&vision_route` to keep the `vision_route: Option<...>` alive for
both arms, which makes `images` bind as `&Vec<ImageInput>`. The
subsequent `images.clone()` call doesn't deep-clone because
`ImageInput` doesn't derive `Clone` — rustc falls back to cloning
the `&Vec` reference, which has the wrong type for the worker job.

The CPU build (non-cuda) compiled fine because that branch is
behind `#[cfg(feature = "cuda")]`; the cuda-check job is what
catches the regression.

Fix: derive `Clone` on `ImageInput`. The clone cost is one
pixel-buffer memcpy per image (~2.4 MiB at fixed 448×448), which
is fine on the chat-completion hot path — vision requests are
rare per second relative to text-only decode.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 15:51:57 +03:00
24968e9233 feat(neuron): Stage B — end-to-end text+image chat for Qwen3.6
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 31s
CI / Format (push) Successful in 33s
CI / CUDA type-check (push) Failing after 46s
CI / Clippy (push) Successful in 2m37s
build-prerelease / Build cortex binary (push) Successful in 4m32s
build-prerelease / Build neuron-blackwell (push) Failing after 5m35s
CI / Test (push) Successful in 6m40s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Failing after 7m46s
build-prerelease / Package cortex RPM (push) Successful in 1m22s
build-prerelease / Build neuron-ada (push) Failing after 4m51s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Stage B of the vision plan (doc/vision-qwen3_6-spec.md). Wires
the vision tower from Stage A through to a complete non-streaming
chat completion: extract images from the request, preprocess,
encode on the worker thread, splice embeddings into the LM input
at `<|image_pad|>` positions, return coherent text response with
`prompt_tokens` reflecting patch tokens.

Closes the silent-drop class of failures from issue #3 — vision
requests against Qwen3.6 now condition the model on the image
instead of producing confident text-only hallucinations.

Streaming for vision is Stage C. Deferred items tracked under
#12 (TP-vision), #13 (27B production), #14 (dynamic resolution),
#15 (numerical validation).

What landed:

- **B1 — `Qwen3_5Model::forward_with_vision`**: text-only `forward`
  unchanged; new method takes `(input_ids, offset, image_embeds,
  image_token_id)`, embeds tokens, locates `image_token_id`
  positions, splices via the new `splice_runs` helper. MRoPE
  applies text-positions to image tokens for Stage B (spatial
  MRoPE is the issue #15 numerical-validation follow-up). 2 unit
  tests for `splice_runs` covering contiguous + non-contiguous
  runs.

- **B2 — `ModelArch::forward_with_vision` dispatch**: routes
  Qwen3_5Dense to the new method; other arches return an error.
  Defence-in-depth — the HTTP layer (B6) already rejects image
  content for non-vision models.

- **B3 — `Job::ForwardLogitsWithImages`**: new worker variant
  carrying tokens + per-image `(pixels, c, h, w)` payloads. The
  dispatcher encodes each image (device-resident), concatenates
  the resulting embeddings, calls `arch.forward_with_vision`, and
  returns CPU logits. Image embeddings never copy back to CPU —
  the "tensors don't escape the worker" invariant from the
  per-device worker refactor still holds. Poisoned-worker drain
  path handles the new variant.

- **B4 — Prompt builder**:
  - `request_has_images` detects image content cheaply.
  - `extract_images_from_request(request, profile)` walks
    `MessageContent::Parts`, decodes data URIs, runs
    `harness::preprocess::preprocess` per image, returns
    `Vec<ImageInput>` in request order.
  - `expand_image_pad_tokens(input_ids, image_token_id,
    patches_per_image)` walks the tokenized prompt and replaces
    each `<|image_pad|>` (id 248056 for Qwen3.6) with N copies
    matching the per-image patch count. 4 unit tests.
  - `VisionMeta::from_config_path` peeks `config.json` at load
    time for `image_token_id`, vision_config patch/merge sizes,
    and derives `lm_tokens_per_image` for the Stage B fixed
    resolution.

- **B5 — `chat_completion` vision routing**: detects image
  content, validates the loaded model has vision, expands the
  prompt, and calls a new `run_inference_with_images_via_worker`
  helper that does single-shot prefill + standard decode loop
  (KV cache holds the post-splice hidden states from prefill, so
  decode steps don't re-splice). Stage B skips chunked prefill
  for vision — at 448×448 fixed resolution the budget stays well
  under the activation-memory threshold. Long-vision chunking is
  Stage D follow-up.

- **B6 — `InferenceError::VisionUnsupported`**: structured 400
  with `code=vision_unsupported, model_id, suggestion` when an
  image request hits a non-vision model. Closes the agent0
  failure mode where vision requests degraded silently.

- **B7 — `ModelInfo.capabilities`**: per-model array (`["text"]`
  vs `["text", "vision"]`) in `/v1/models` and forwarded verbatim
  by cortex-gateway. Lets clients (litellm, agent0) gate
  image_url submission on the declared capability set. Optional
  in the wire format; defaults to empty for older clients.

CI gate: cargo fmt --check, cargo clippy --workspace --all-targets
-- -D warnings, cargo test --workspace (all 28 test groups ok,
124 lib tests). New unit-test counts: +2 splice_runs, +4
expand_image_pad.

Manual verification (after RPMs deploy on beast):

  curl http://hanzalova.internal:31313/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -d "{\"model\":\"Qwen/Qwen3.6-27B\", \"messages\":[{\"role\":\"user\",\"content\":[
      {\"type\":\"text\",\"text\":\"What's in this image?\"},
      {\"type\":\"image_url\",\"image_url\":{\"url\":\"data:image/jpeg;base64,...\"}}
    ]}], \"max_tokens\":120}" | jq

  Expect prompt_tokens > 196 (text + 196 patch tokens) and a
  response that references actual image content.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 15:33:00 +03:00
7df84fed8f feat(neuron): Stage A — vision tower load + preprocessor for Qwen3.6
All checks were successful
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Format (push) Successful in 28s
CI / Clippy (push) Successful in 2m35s
build-prerelease / Build cortex binary (push) Successful in 5m13s
build-prerelease / Build neuron-blackwell (push) Successful in 6m23s
build-prerelease / Build neuron-ampere (push) Successful in 7m56s
CI / Test (push) Successful in 7m11s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Build neuron-ada (push) Successful in 5m30s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 4m25s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
Stage A of the vision implementation plan
(doc/vision-qwen3_6-spec.md). Builds the vision tower scaffolding
that today's silent-drop failure mode (issue #3) needs — the
Qwen3.6 ViT loads from `model.visual.*`, runs forward producing
post-merger LM-side image embeddings, and routes through the
device worker via a new `Job::EncodeImage`. No LM splice yet —
that's Stage B.

Refs #3 (umbrella). Deferred sub-stages tracked as #12 (TP-vision),
#13 (27B production deploy), #14 (dynamic resolution), #15
(numerical validation).

What landed:

- **A0 — investigation**: pulled config.json, preprocessor_config.json,
  chat_template.jinja, and safetensors index from beast's local
  Qwen3.6-27B cache. Documented in doc/vision-qwen3_6-spec.md with
  exact tensor shapes for every `model.visual.*` weight. Confirms
  27-block ViT with `hidden_size=1152`, `patch_size=16`,
  `spatial_merge_size=2`, `out_hidden_size=5120`. Vision tower lives
  in 2 of the 15 safetensors shards.

- **A1 — deps + scaffolding**: added `image = "0.25"` (default-
  features off, PNG/JPEG/WebP/BMP/GIF) and `base64 = "0.22"` to
  crates/neuron/Cargo.toml. Created `harness::preprocess` and
  `harness::arch::qwen3_5::vision` modules.

- **A2 — preprocess.rs**: `decode_data_uri` strips
  `data:image/...;base64,...` → image bytes → `image::DynamicImage`
  (rejecting `http(s)://` URLs to avoid SSRF/recursion); `preprocess`
  resizes to a fixed `PreprocessProfile::qwen3_6()` (448×448),
  normalises to `[-1, 1]` per the model's mean/std=0.5, emits
  row-major `(3, H, W)` f32. 9 unit tests covering data URI parse,
  decode failure paths, grayscale-to-RGB promotion, and the
  exact-value normalisation contract.

- **A3 — vision.rs**: `VisionTower` struct with `patch_embed: Conv2d`,
  learned `pos_embed: Embedding`, 27 `VisionBlock`s (pre-LN +
  multi-head self-attention with fused QKV + GELU-tanh MLP +
  residuals), and `VisionMerger` (LayerNorm → 2×2 spatial concat →
  linear_fc1 → GELU-tanh → linear_fc2 to LM hidden_size).
  Includes the Conv3d→Conv2d fold trick documented at the top of
  the file — the published patch_embed.proj.weight is 5D
  `(1152, 3, 2, 16, 16)` but candle 0.10 has no Conv3d; for static
  images we sum-collapse the temporal axis. Video would need real
  Conv3d. 5 unit tests including the exact `gelu_pytorch_tanh`
  reference values from PyTorch.

- **A4 — wire vision into Qwen3_5ForCausalLM**: extended `Config`
  with optional `vision_config: Option<VisionConfig>` and
  `image_token_id`; `Qwen3_5ForCausalLM::new` now loads the vision
  tower when present, exposes `has_vision()` and `vision()` so the
  HTTP layer can advertise capability and so the encode path can
  reach it.

- **A5 — device worker `Job::EncodeImage`**: new job variant carrying
  CPU-side `(C, H, W)` pixels. Dispatch handler reconstructs the
  tensor on the worker's device, calls `arch.encode_image(image)`,
  copies the result back to CPU as flat `Vec<f32>`. Keeps the
  "tensors don't escape the worker" invariant. Poisoned-worker
  drain path handles the new variant.

- **A6 — dispatch round-trip test**: `encode_image_routes_to_dispatch_
  and_errors_on_unknown_handle` proves the channel/dispatch wiring
  works end-to-end via the CPU device worker (errors on unknown
  ArchHandle, which is the expected behaviour without a loaded
  model — real-weights validation happens in Stage B when the LM
  splice path exists).

CI gate: cargo fmt --check, cargo clippy --workspace --all-targets
-- -D warnings, cargo test --workspace (all 28 test groups ok,
zero failures). New test counts: +9 in preprocess, +5 in vision,
+1 in device_worker.

Out of scope (deferred):
- LM-side splice of image embeddings at `<|image_pad|>` positions
  → Stage B.
- Streaming SSE for vision-bearing chat completions → Stage C.
- Reject `image_url` with HTTP 400 for non-vision models /
  advertise `capabilities` in /v1/models → Stage C.
- TP-vision (#12), 27B production deploy (#13), dynamic resolution
  (#14), numerical validation (#15).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 11:40:47 +03:00
5c520c7e90 feat(deploy): gitea workflow for rolling RPM deploys + host bootstrap
Replace operator-run script/deploy.sh with a CI-driven rolling deploy:

- .gitea/workflows/deploy.yml fires on build-prerelease success (and is
  re-runnable via workflow_dispatch). Cortex upgrades first on
  hanzalova.internal; the three neuron hosts upgrade in parallel under
  fail-fast: false so one failing host doesn't sink the rest.
  Concurrency-grouped to serialize overlapping deploys, never cancelling
  in-flight runs (a half-applied dnf transaction is worse than a stale
  deploy).

- asset/sudoers.d/{cortex,neuron}-host.conf are the canonical source for
  the scoped privileges gitea_ci needs on each host kind, installed as
  /etc/sudoers.d/helexa_gitea_ci. URLs and = signs are backslash-escaped
  per sudoers reserved-character rules.

- script/infra-setup.sh idempotently provisions the gitea_ci user,
  installs the runner pubkey, drops in the appropriate sudoers fragment
  with visudo verification, and syncs cortex.toml / models.toml /
  per-host asset/neuron/<short>.toml — config still ships from operator
  workstations rather than CI because the first two are gitignored.

The CI-only secret is RSYNC_SSH_KEY (already configured for the repo);
the matching pubkey is ~/.ssh/id_gitea_ci.pub on the operator's box.

script/deploy.sh and asset/manifest.yml are left in place until the
first end-to-end deploy workflow run succeeds, then removed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 14:58:23 +03:00
d0292ed377 feat(cortex): catalogue source field + scheme-qualified /models/load
Some checks failed
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 40s
CI / Format (push) Successful in 40s
CI / Test (push) Failing after 1m3s
CI / Clippy (push) Successful in 2m43s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 6m13s
build-prerelease / Build neuron-ampere (push) Successful in 7m31s
build-prerelease / Build neuron-ada (push) Successful in 8m16s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m21s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Build cortex binary (push) Successful in 4m5s
build-prerelease / Package cortex RPM (push) Successful in 1m30s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
Phase 3 of plan-source-aware-loader-preflight. Adds an optional
`source` field to `ModelProfile` and threads it through the
router's cold-load path so a profile pointing at the helexa
registry forwards `helexa:<id>` to neuron's `/models/load`
instead of leaving neuron to substitute its `default_source`
(typically `huggingface`).

Without this, an operator who declares
`source = "helexa"` in models.toml would still see neuron fetch
from HuggingFace — the catalogue → ModelSpec translation in
`profile_to_spec` was dropping the scheme on the floor.

What lands:

- `cortex-core::catalogue::ModelProfile.source: Option<String>`.
  None is the default and preserves pre-Phase-3 behaviour.
- `cortex-gateway::router::qualified_model_id(profile)` —
  small pure helper, extracted from `profile_to_spec` so it can
  be unit-tested. Empty-string `source` is treated as None so
  operators who blank out a previously-set value don't trip a
  scheme-with-no-scheme failure mode in neuron.
- `models.example.toml` documents the new field with a
  commented-out helexa-scheme example pointing back at
  neuron.example.toml's matching sources block.

Tests:

- 2 new unit tests in `cortex-core::catalogue`: source-absent
  round-trip and source-present round-trip through TOML.
- 3 new unit tests in `cortex-gateway::router`: pass-through
  when None, prefix when Some, pass-through on empty-string
  source.
- ModelProfile literal in catalogue's existing test updated to
  carry `source: None`.

CI gate: cargo fmt --check, cargo clippy --workspace
--all-targets -- -D warnings, cargo test --workspace
(24 test groups ok, zero failures).

Completes Phase 3. With Phases 1+2+3 landed:
- neuron parses `scheme:org/name`, routes per-source hf-hub
  Api with disambiguated cache.
- preflight returns structured errors before any device
  allocation.
- cortex catalogue declares per-model source jurisdiction
  and forwards it to neuron.

The registry itself (registry.helexa.ai service, MinIO,
nginx, mirror fabric) is the next moving piece — landing
under a separate project per the design discussion.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 14:53:58 +03:00
d4e1b05956 feat(neuron,cortex-core): source-aware loader (scheme:org/name)
All checks were successful
CI / CUDA type-check (push) Successful in 46s
CI / Format (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 42s
CI / Clippy (push) Successful in 2m40s
build-prerelease / Build cortex binary (push) Successful in 4m23s
CI / Test (push) Successful in 5m28s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 5m39s
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Build neuron-ampere (push) Successful in 7m53s
build-prerelease / Build neuron-ada (push) Successful in 5m18s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m59s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s
Phase 1 of plan-source-aware-loader-preflight. Makes neuron's
loader treat `huggingface:org/name` and `helexa:org/name` as
first-class distinct sources with per-source endpoint + cache,
while staying backwards-compatible with bare `org/name` ids.
Zero behavior change for existing operator configs.

Motivation: helexa is adding an EU-hosted registry
(`registry.helexa.ai`) alongside HF. Both speak HF-compatible
wire format, but the bytes, jurisdiction, trust root, and cache
namespace are distinct. The loader needs to disambiguate which
registry serves a given model id, and to keep their caches from
colliding on disk when both happen to host the same `org/name`.

What lands:

- `cortex-core::source` — new module. `ModelSourceId { scheme,
  org, name }` with `FromStr` accepting both `scheme:org/name`
  and bare `org/name`. `Display` round-trips. `repo_path()`
  emits the `org/name` half for the hf-hub `Api::model(...)`
  call regardless of which scheme/endpoint we're hitting.
  Rejects malformed input with typed `ParseError` variants
  (empty scheme, missing slash, scheme with `/`, name with
  `:`, etc.).

- `neuron::config::CandleHarnessConfig` gains
  `default_source: Option<String>` and
  `sources: HashMap<String, SourceConfig>`. `SourceConfig`
  mirrors what `hf_hub::ApiBuilder` consumes: endpoint URL,
  optional `auth_env` (env var name read at startup so secrets
  stay out of TOML), and optional cache_dir. Defaults
  synthesise a `huggingface` entry pointing at
  `https://huggingface.co` with the legacy `hf_cache` field as
  its cache_dir — so existing configs that only set `hf_cache`
  keep working unchanged.

- `CandleHarness::new(bind_url, &CandleHarnessConfig)` replaces
  `CandleHarness::new(bind_url, hf_cache)`. Resolves every
  configured source's auth env var and cache dir up front so
  `hf_api_for(scheme)` is a pure HashMap lookup on the hot
  load path. Only the `huggingface` scheme gets the legacy
  `HF_HUB_CACHE`/`HF_HOME` env-var fallback chain; other
  schemes resolve to whatever the operator typed.

- `hf_api()` -> `hf_api_for(scheme)`. Builds an
  `hf_hub::Api` with the source's endpoint, cache_dir, and
  auth token. Errors with a useful message naming the
  configured schemes when an unknown scheme is requested.

- `CandleHarness::load_model` parses `spec.model_id` into a
  `ModelSourceId`, substitutes `default_source` for bare ids,
  and threads the parsed source through `preflight`,
  `resolve_files`, `resolve_dense_files`, `load_arch_gguf`,
  `load_arch_dense`, and `load_tp`. The hf-hub `Api::model()`
  call now uses `source_id.repo_path()` so registry calls hit
  the right URL shape regardless of scheme.

- `preflight()` signature gains a `&ModelSourceId` parameter
  (it's the canonical id for log lines and error display);
  `RepoFetchFailed.model_id` etc. now carry the
  scheme-qualified form so operator-visible errors echo
  exactly what was configured.

- `neuron.example.toml` documents the new
  `[harness.candle.sources.*]` table with commented-out
  examples for `huggingface` (explicit override) and `helexa`.

Tests:

- 13 new unit tests in `cortex-core::source` covering parse /
  display round-trip, default-scheme substitution semantics,
  and every `ParseError` variant.
- 6 new unit tests in `neuron::config` covering the
  `effective_sources` synth (legacy `hf_cache` carry-through,
  explicit override preservation, helexa-alongside-huggingface)
  and `effective_default_source` fallback.
- 2 new unit tests in `harness::candle::tests` covering
  multi-scheme `hf_api_for` routing, including the
  "unknown scheme" error path naming configured schemes.
- Preflight integration tests updated to construct
  `ModelSourceId` and assert against the scheme-qualified
  error form.

CI gate: cargo fmt --check, cargo clippy --workspace
--all-targets -- -D warnings, cargo test --workspace (all 24
test groups ok, zero failures).

Out of scope (Phase 3):
- Cortex catalogue `source` field — independent of Phase 1+2,
  ships when the registry comes online.
- `helexa` source endpoint itself — separate project; this
  PR adds the client-side rails only.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 13:42:11 +03:00
61adff347a feat(neuron): preflight placement check with structured errors
Some checks failed
CI / CUDA type-check (push) Successful in 31s
CI / Format (push) Successful in 30s
build-prerelease / Resolve version stamps (push) Successful in 48s
CI / Test (push) Failing after 1m10s
CI / Clippy (push) Successful in 2m49s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m25s
build-prerelease / Build neuron-blackwell (push) Successful in 5m53s
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Build neuron-ampere (push) Successful in 8m0s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
Phase 2 of plan-source-aware-loader-preflight. Adds a one-RTT
placement feasibility check that runs before any device allocation,
NCCL handshake, or weight fetch. Replaces today's opaque
"fetch config.json … 404" failure mode (when an operator points
`tensor_parallel = 2` at a GGUF-only repo) with a structured
error that names the failure class and points at the fix.

What lands:

- `crates/neuron/src/harness/preflight.rs` — new module. Classifies
  a repo's siblings listing into `SourceFormat` (Gguf | DenseSafetensors
  | Mixed | Empty), applies the tp/quant feasibility table, returns a
  `PlacementPlan` on success or a typed `PreflightError` on rejection.
  `PreflightError` is `serde::Serialize` so the HTTP layer can emit
  the structured shape verbatim; it's `thiserror::Error` so log lines
  get a single-line Display when downcasting from anyhow. Includes
  best-effort Levenshtein-nearest suggestion for malformed quant names
  (the second sharp edge the HauhauCS scenario surfaced — operator
  writes `q6k` against filenames containing `Q6_K_P`, and today's
  matcher just says "no GGUF file matching quant").
- `CandleHarness::load_model` — calls `preflight(...)` first thing
  after the "already loaded" guard, before any `ensure_device_worker`
  or `resolve_*`. Failure wraps the typed error in `anyhow::Error` so
  the existing trait surface is unchanged; the HTTP handler and the
  startup logger downcast to recover the structured form.
- `crates/neuron/src/api.rs::load_model` handler — maps `PreflightError`
  to 422 Unprocessable Entity with `{"error": {"kind": "...",
  "model_id": "...", "suggestion": "..." }}`. Other failures keep
  the existing 400 + free-form `format!("{e:#}")` shape.
- `crates/neuron/src/startup.rs::load_default_models` — when the
  failure is a preflight rejection, log as `reason=<kind> detail=<msg>`
  instead of the opaque `error=<chain>`, so journalctl on beast will
  now show `reason=tp_requires_safetensors detail="repo is GGUF-only
  (8 .gguf files); TP requires dense safetensors..."` instead of
  `error=fetch config.json from HauhauCS/...: 404 Not Found`.

Tests:

- 18 unit tests in `harness/preflight.rs` covering classifier,
  quant matching, Levenshtein, error serialization, and the full
  feasibility table (gguf+tp rejected, gguf+bad-quant suggests
  nearest, gguf+good-quant ok, dense+tp ok, empty rejected, mixed
  prefers safetensors).
- 7 integration tests in `tests/preflight.rs` exercising the
  network path through an axum mock that serves hf-hub-compatible
  `/api/models/{org}/{name}/revision/main` payloads. Adds `tempfile`
  as a dev-dependency for per-test cache dirs.

Out of scope (deferred to subsequent phases):

- Phase 1 (source-aware loader plumbing — `scheme:org/name` parsing,
  per-scheme `SourceConfig`, cache disambiguation). Preflight runs
  against the single configured HuggingFace source today; the scheme
  threading lands cleanly when Phase 1 ships.
- Phase 3 (cortex catalogue source field).
- GGUF tensor-parallel loading. Preflight rejects this combination
  with `TpRequiresSafetensors`; the underlying loader gap is the
  separate `Helexa` curated-registry / heretic-rs conversation.

Refs #4-#9 architectural follow-up; no specific issue closed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 13:24:30 +03:00
0af8c8d6e7 chore(ci): enable colored logs for readability 2026-06-01 09:06:28 +03:00
435fd10902 fix(neuron): macro-ify CUDA single-GPU route_token so DecodeStream type stays inferred
All checks were successful
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 29s
CI / Format (push) Successful in 29s
CI / Clippy (push) Successful in 2m47s
build-prerelease / Build cortex binary (push) Successful in 4m27s
CI / Test (push) Successful in 5m40s
build-prerelease / Build neuron-blackwell (push) Successful in 5m47s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Successful in 8m30s
build-prerelease / Build neuron-ada (push) Successful in 5m39s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m2s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m11s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 4m1s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s
Prerelease build (run 270) failed on commit cb30383 with:

  error[E0107]: struct takes 5 generic arguments but 0 generic
    arguments were supplied
       --> crates/neuron/src/harness/candle.rs:3554:41
        |
   3554 |     decode_stream: &mut tokenizers::DecodeStream<'_>,
        |                                     ^^^^^^^^^^^^

The Step-2-era refactor for #6's tool-call extraction added a
nested `async fn route_token` inside `stream_inference_via_worker`
that named `tokenizers::DecodeStream<'_>` as a parameter type.
`DecodeStream` actually has five generic parameters
(`'tok, M, N, PT, PP, D`) which makes naming it explicitly
painful — the working approach the CPU path uses is a macro,
where the body expands inline at the call site and the
decoder type stays inferred.

This commit replicates the CPU-side macro for the CUDA worker
path. Same shape, just with `.await` calls inside (macros tolerate
that since they expand inline into the enclosing async context).
Control flow uses a labelled-block + `consumer_alive` flag rather
than `return` so the macro stays generic over the surrounding
return type.

The CPU build (default-feature workspace, what `clippy` and `test`
jobs exercise) doesn't compile this `#[cfg(feature = "cuda")]`
branch, which is why local CI green-lit it. The cuda-check job
should catch this category of breakage now that #cb30383+CI-fix
landed; this commit just resolves the actual breakage on the
prerelease workflow.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 08:59:56 +03:00
cb303832bc feat(neuron): render the model's chat_template with chat_template_kwargs
Some checks failed
CI / CUDA type-check (push) Failing after 58s
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Format (push) Successful in 40s
build-prerelease / Build neuron-ampere (push) Failing after 1s
CI / Clippy (push) Successful in 2m37s
build-prerelease / Build cortex binary (push) Successful in 4m47s
CI / Test (push) Successful in 6m13s
build-prerelease / Build neuron-blackwell (push) Failing after 5m34s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Build neuron-ada (push) Failing after 7m20s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Closes #9.

Replaces the hardcoded `format_qwen3_prompt` ChatML glue with
`minijinja`-driven rendering of the model's own `chat_template`
from `tokenizer_config.json`. The request's `chat_template_kwargs`
flow into the Jinja context so model-specific levers
(Qwen3's `enable_thinking: false`, etc.) actually take effect.

## Implementation

- New `harness::chat_template` module with three entry points:
  - `load_chat_template_alongside(tokenizer_json_path)` — probes
    `tokenizer_config.json` in the same hf-hub snapshot directory.
    Supports both the canonical string-form `chat_template` and
    the array-form some tokenizers ship (multi-template models).
  - `render_chat_template(template, messages, tools, kwargs)` —
    renders via `minijinja`. Messages flatten into the
    `[{role, content}]` shape HF templates iterate, with
    per-message extras (`tool_calls`, `tool_call_id`) preserved.
    `tools` and `kwargs` add into the Jinja context so templates
    that reference them work without us interpreting their shape.
  - `chat_templates_enabled()` reads `NEURON_USE_CHAT_TEMPLATE`
    (default true). Falsy values force the fallback path
    everywhere — a kill switch for emergency rollback without a
    rebuild.

- `LoadedModel.chat_template: Option<String>` and the TP
  equivalent are populated once at load time. `None` (no
  tokenizer_config.json, parse error, missing field) routes the
  fallback path silently; logs go through `tracing::debug`/`warn`
  per condition.

- New `build_prompt_for_request(chat_template, request)` wraps
  the decision: when both the template is present AND the kill
  switch is off, render with kwargs from `request.extra` (looks
  up `chat_template_kwargs` and `tools` lazily). On render error
  → warn + fallback to `format_qwen3_prompt`. Wired into all four
  current prompt-build sites (single-GPU stream + non-stream, TP
  stream + non-stream).

## Dependency

`minijinja = "2"` with the `builtins`, `json`, and `serde`
features. Pure-Rust Jinja2 implementation, ~80KB compiled. Used
internally by HF's `tokenizers-rs` for its own chat templating;
the API surface we touch (`Environment::add_template` +
`Template::render(serde_value)`) is stable.

## Validation strategy

I can't byte-compare the new path's output against
`format_qwen3_prompt` for live models without GPU (CI doesn't
have one). The fallback path and kill switch are the mitigations
— a deploy can flip `NEURON_USE_CHAT_TEMPLATE=false` in the
neuron service env if the chat template renders surprisingly on
Qwen3-8B in production. The legacy formatter stays the
fail-closed default.

## Scope cuts (documented in module header)

- Tool-definition lifting from helexa-acp's system-prompt
  injection into the chat_template's native tools block is
  deferred. Today the request's `tools` array threads into the
  Jinja context, but helexa-acp continues to inject Hermes-format
  tool descriptions into the system prompt for backwards-compat
  with non-cortex endpoints.

## Tests

9 unit tests in `chat_template`: kill-switch matrix (truthy /
falsy / unset), template loading (string form, array form,
missing file, unparseable JSON, missing field), rendering
(basic conversation threading, kwargs forwarding, message-extras
threading for tool_calls).

215 workspace tests pass; clippy + fmt clean across all workspace
features (default).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 23:43:11 +03:00
44008358c5 feat(neuron): emit response.in_progress between created and output_item.added
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 40s
CI / Format (push) Successful in 44s
CI / Test (push) Failing after 1m5s
CI / Clippy (push) Successful in 2m36s
CI / CUDA type-check (push) Failing after 52s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m32s
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Build neuron-blackwell (push) Failing after 5m42s
build-prerelease / Build neuron-ampere (push) Failing after 7m14s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
Refs #7.

OpenAI's Responses API spec emits `response.in_progress` between
`response.created` and the first output-item event to mark
"request validated, model is generating". Some Responses-API
clients distinguish loading-spinner vs streaming-spinner UI based
on which event arrived last; emitting both keeps the wire shape
matched.

Carries the same shell as `response.created` (status=in_progress,
empty output, no usage yet) — both events are payload-light
bookkeeping, distinguished only by the event name.

The hosted-tool event families remaining in #7 (web_search_call,
code_interpreter_call, file_search_call, image_generation_call)
stay deferred until the underlying tools exist in neuron.

Updated `full_stream_emits_expected_event_sequence` to assert the
new event lands in position 1; downstream indexing shifted by one
across the existing test assertions. CI green, fmt + clippy clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 23:30:34 +03:00
2f387f33f8 ci: export CUDA paths in cuda-check so cudarc build.rs finds nvcc
Some checks failed
build-prerelease / Build cortex binary (push) Blocked by required conditions
CI / Format (push) Successful in 34s
build-prerelease / Resolve version stamps (push) Successful in 41s
CI / Clippy (push) Failing after 1m7s
CI / Test (push) Failing after 56s
build-prerelease / Build neuron-blackwell (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
CI / CUDA type-check (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
act launches step shells without sourcing /etc/profile, so the
gitea_runner user's PATH lacks /usr/local/cuda-13.0/bin. cudarc's
build.rs panics with ENOENT on `nvcc --version` under the neuron
crate's cuda-version-from-build-system feature. build-prerelease.yml
already does this export — mirror it here.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 23:28:04 +03:00
fc9a8c42a3 feat(neuron): extract <tool_call> blocks to structured tool_calls deltas
Some checks failed
build-prerelease / Build cortex binary (push) Blocked by required conditions
CI / Clippy (push) Waiting to run
CI / Test (push) Waiting to run
CI / CUDA type-check (push) Failing after 17s
build-prerelease / Resolve version stamps (push) Successful in 32s
CI / Format (push) Successful in 32s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
Closes #6.

Same model-agnostic seam as #8 but for tool-call markers
(`<tool_call>` / `</tool_call>` on Qwen3-Coder, Hermes-format,
DeepSeek-Coder, gpt-oss, …). Lets Zed's tool-use feature and any
other vanilla OpenAI chat client get structured `tool_calls` deltas
out of cortex without having to parse markers themselves.

## Implementation

1. **Tokenizer probe at load time** (`detect_tool_call_token_pair`
   in `wire::event`) — same shape as the reasoning-marker probe
   from #8. Both open AND close must resolve to single token ids;
   non-tool-use models get `None` and pass through unchanged.
   Stored on `LoadedModel.tool_call_tokens` and the TP analogue.

2. **New `InferenceEvent::ToolCall` variant** — carries `index`
   (call slot, per-turn counter), generated `id` (`call_<hex>_<idx>`),
   `name`, and the complete `arguments` JSON string. One event per
   parsed call.

3. **Token-level state machine** in all three streaming paths
   (CPU `run_inference_streaming`, CUDA single-GPU
   `stream_inference_via_worker`, CUDA TP `chat_completion_tp_stream`)
   layered on top of #8's reasoning routing:
   - `<tool_call>` token → enter buffering state, clear buffer.
   - Tokens while buffering → accumulate into `tool_call_buf`
     via the decoder (so multi-byte UTF-8 still buffers correctly)
     without emitting anything visible.
   - `</tool_call>` token → take the buffer, parse with
     `parse_tool_call_body` (extract `name` + `arguments`),
     emit a structured `ToolCall` event with a fresh `call_<hex>`
     id and the parsed fields.
   - On parse failure → fall back to re-emitting the original
     `<tool_call>{buf}</tool_call>` block as plain text content
     so helexa-acp's existing `ToolCallParser` repair passes still
     have a chance to recover the call.

4. **OpenAI chat projector** emits the OpenAI streaming
   `tool_calls` delta shape on `InferenceEvent::ToolCall` —
   `{tool_calls: [{index, id, type:"function",
   function:{name, arguments}}]}`. One chunk per call slot.

5. **OpenAI Responses projector** drops `ToolCall` events for
   now (Responses-side function_call event family routing tracked
   under #7); the chat path is what unblocks Zed's tool use today.

## Acceptance

- Vanilla OpenAI chat clients (Zed's tool-use feature, any other
  OpenAI-compatible tool-call consumer) get structured tool_calls
  deltas against cortex+neuron without having to parse `<tool_call>`
  markers in content.
- helexa-acp continues to work — when neuron parses cleanly, it
  consumes the structured deltas through its existing decoder.
  When the model emits malformed JSON, neuron falls back to text
  pass-through and helexa-acp's `ToolCallParser` recovers via the
  same path it always did.
- Models without tool-call markers in their tokenizer pass through
  unchanged.
- No hardcoded model knowledge — entirely driven by tokenizer
  metadata.

## Tests

2 new detection tests in `wire::event` (Qwen3-style marker
detection, no-marker case). The streaming paths themselves stay
covered by the existing chat-completions integration tests; full
end-to-end exercise of the new path requires GPU-loaded models
and lives outside the CI test surface.

215 workspace tests pass; clippy + fmt clean across the
workspace.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 23:26:31 +03:00
7733eecba5 feat(neuron): strip reasoning from chat completions by default
Some checks failed
CI / CUDA type-check (push) Failing after 18s
build-prerelease / Resolve version stamps (push) Successful in 32s
CI / Format (push) Successful in 32s
CI / Clippy (push) Successful in 2m36s
build-prerelease / Build cortex binary (push) Successful in 4m29s
CI / Test (push) Successful in 5m19s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 5m56s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Successful in 7m45s
build-prerelease / Build neuron-ada (push) Successful in 5m24s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m53s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s
Closes #8.

Reasoning-capable models (Qwen3, DeepSeek-R1, gpt-oss, Mistral
Magistral, …) emit `<think>...</think>` blocks inline in their
content stream. The chat-completions wire format has no slot for
reasoning, so until this change every consumer either parsed the
markers themselves (helexa-acp) or wrote the raw scratchpad
content into their UI (Zed's commit-message generator — visible
as the leaked reasoning block on every generated commit message
against benjy's Qwen3-8B).

## Implementation, model-agnostic by design

The neuron side now does token-level routing without any
hardcoded model knowledge:

1. **At load time** (`detect_reasoning_token_pair` in
   `wire::event`), probe the tokenizer's vocabulary for a known
   reasoning-marker pair: `<think>` / `</think>` (Qwen3,
   DeepSeek-R1, gpt-oss), `[THINK]` / `[/THINK]` (Mistral
   Magistral), and a couple of derivatives. Each marker must
   resolve to a single token id; if both open and close resolve,
   stash on `LoadedModel.reasoning_tokens` (similarly
   `TpLoadedModel`). Non-reasoning models get `None` and pass
   through unchanged.

2. **At inference time**, the three streaming paths
   (`run_inference_streaming` CPU, `stream_inference_via_worker`
   CUDA single-GPU, `chat_completion_tp_stream` CUDA TP) now
   check each sampled token against the pair via the new
   `handle_reasoning_marker` helper before feeding it to the
   detokeniser. Open marker → set `in_reasoning = true`, drop
   the marker. Close marker → unset, drop. Other tokens go
   through `emit_delta(_blocking)` which now picks
   `ReasoningDelta` or `TextDelta` based on state. Markers
   never appear in the streamed output.

3. **In `wire::openai_chat`**, the projector splits into:
   - `project_chat_stream` (unchanged signature; default
     behaviour — drops `ReasoningDelta`)
   - `project_chat_stream_with(rx, …, ChatProjectionConfig)` —
     when `include_thinking: true` and `reasoning_markers:
     Some(_)`, re-wraps reasoning content with the literal
     open/close marker text and emits as content deltas.
     Preserves the on-the-wire shape that helexa-acp's
     `ThinkParser` expects.

4. **HTTP handler** reads `x-include-thinking: true` (case-
   insensitive `1`/`true`/`yes`) from the request headers and
   threads it into the projection config. cortex-gateway already
   forwards arbitrary headers verbatim, so the opt-in works
   end-to-end without gateway changes.

5. **helexa-acp's `openai_chat` provider** sets
   `x-include-thinking: true` on every request so its existing
   `ThinkParser` keeps receiving the marked content stream.
   `ThinkParser` itself is unchanged — needed for endpoints that
   aren't reasoning-aware (OpenRouter, OpenAI directly, etc.).

## Acceptance

- Zed's commit-message generator (vanilla chat-completions
  client, no `x-include-thinking`) gets clean commit messages
  with no `<think>` block.
- helexa-acp sessions continue to render thinking in Zed's
  thought UI via the opt-in path.
- Models without reasoning tokens declared in their tokenizer
  pass through unchanged.
- Implementation contains zero references to "qwen3" or any
  specific model — entirely driven by tokenizer metadata.

## Tests

9 new tests in `wire::event` (token-pair detection across 4
marker conventions, edge cases) and `wire::openai_chat` (default
drop, opt-in re-wrap with multi-chunk reasoning, close-marker on
Finish, fallback when markers absent, off-switch with markers
present). All 213 workspace tests pass; fmt + clippy clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 17:55:04 +03:00
fdc0adb738 docs(helexa-acp): README + example config for end-user onboarding
Some checks failed
CI / CUDA type-check (push) Failing after 18s
CI / Format (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 35s
CI / Clippy (push) Successful in 2m36s
build-prerelease / Build cortex binary (push) Successful in 4m13s
CI / Test (push) Successful in 5m6s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 5m40s
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Build neuron-ampere (push) Successful in 7m53s
build-prerelease / Build neuron-ada (push) Successful in 5m12s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m55s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m4s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s
Stage 7. Walks a new user from "never heard of helexa-acp" to
"chatting via Zed against helexa or a public API in 10 minutes":

- crates/helexa-acp/README.md — install (from source / COPR),
  quick-start env-var path, multi-endpoint TOML, full Zed setup,
  endpoint cookbook (cortex/neuron, OpenAI, Anthropic, OpenRouter,
  LM Studio, multi-cortex), three session modes (Default / Bypass /
  Plan) with their tool tables, tool surface + path-handling rules,
  session resume, context compaction, troubleshooting for the
  five failure modes a new user is likely to hit, and architecture
  reference for contributors.

- helexa-acp.example.toml — copy-paste-and-edit starter config at
  the repo root, mirroring the existing cortex.example.toml /
  neuron.example.toml pattern.

No code changes. fmt + clippy clean as a sanity check.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 14:25:56 +03:00
8fa1d1962e feat(helexa-acp): anthropic-messages provider
Some checks failed
CI / CUDA type-check (push) Failing after 18s
CI / Format (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 35s
CI / Test (push) Failing after 59s
CI / Clippy (push) Successful in 2m28s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m17s
build-prerelease / Build neuron-blackwell (push) Successful in 5m32s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Successful in 7m50s
build-prerelease / Build neuron-ada (push) Successful in 5m55s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m2s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m52s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m4s
Stage 6b. Third provider impl, completing the wire-format trio
(openai-chat, openai-responses, anthropic-messages). Lets a
helexa-acp endpoint configured with `wire_api = "anthropic-messages"`
drive Claude models — either against Anthropic directly or via
cortex's /v1/messages translation surface.

## Encoder (CompletionRequest → Anthropic body)

- System messages flatten to the top-level `system` field
  (concatenated with blank lines when there are multiple).
- User text → `{role:"user", content:"..."}`.
- User MultiPart (text + images) → `content` array with Anthropic's
  distinct image shape: `{type:"image", source:{type:"base64",
  media_type, data}}` — structurally different from OpenAI's
  `image_url` data URI.
- Assistant text → `{role:"assistant", content:"..."}`.
- Assistant tool_calls → `content` array with optional `{type:"text"}`
  block plus one `{type:"tool_use", id, name, input:<parsed json>}`
  per call. The internal arguments JSON string is parsed back to a
  Value before encoding (Anthropic requires the parsed form);
  malformed JSON falls back to a String input so the request body
  still serialises.
- Tool result → `{role:"user", content:[{type:"tool_result",
  tool_use_id, content}]}` per Anthropic's convention (no separate
  `tool` role).
- `max_tokens` is required by Anthropic; defaults to 8192 when the
  request doesn't specify.

## Decoder (Anthropic SSE → CompletionEvent)

Named SSE events:

- `message_start` → captures input_tokens from `usage` for the
  eventual UsageStats.
- `content_block_start` (type=text) → TextDelta (initial text, if any).
- `content_block_start` (type=tool_use) → ToolCallStart; if a
  pre-buffered `input` is present, also emits a single
  ToolCallArgsDelta.
- `content_block_start` (type=thinking, for extended-thinking
  models) → ReasoningDelta.
- `content_block_delta` (text_delta) → TextDelta.
- `content_block_delta` (input_json_delta) → ToolCallArgsDelta,
  correlated by block index.
- `content_block_delta` (thinking_delta) → ReasoningDelta.
- `message_delta` → Usage (final output_tokens) + Finish with
  stop_reason mapped: end_turn/stop_sequence → "stop", max_tokens
  → "length", tool_use → "tool_calls".
- `message_stop` → stream terminates.
- `ping` ignored (Anthropic's keep-alive).
- `error` → yields Err and ends the stream.

## Wiring

- Authentication: `x-api-key` + `anthropic-version: 2023-06-01`
  headers (not Bearer). Both ship when api_key is configured;
  servers that don't care (cortex) ignore them.
- `WireApi::AnthropicMessages` in build_provider now constructs
  the provider instead of erroring "reserved for future".
- `provider::mod.rs` registers the new module.

18 new unit tests: encoder (system collapse, multi-system concat,
default max_tokens, multipart with image, tool_use blocks, tool
results, malformed JSON arg fallback), decoder (text streaming,
tool_use lifecycle, max_tokens→length mapping, empty deltas, ping
events, error events, cancellation, malformed payload skip,
thinking blocks).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 14:01:59 +03:00
cad7552104 ci: clear sccache env on cuda-check so cargo doesn't try to wrap rustc
Some checks failed
CI / Test (push) Waiting to run
CI / CUDA type-check (push) Failing after 18s
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Format (push) Successful in 31s
CI / Clippy (push) Successful in 2m25s
build-prerelease / Build cortex binary (push) Successful in 5m19s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
CI run 255 job 3 (CUDA type-check) fails with:

  error: could not execute process `*** rustc -vV` (never executed)
    Caused by: No such file or directory (os error 2)

The redacted `***` is `sccache`. The ci.yml workflow-level env block
sets `RUSTC_WRAPPER: sccache` because the generic `rust` runner has
sccache installed and routes the cache to caveman.kosherinata.internal.
The new `cuda-check` job runs on `cuda-13.0` (where nvcc lives), and
that runner doesn't carry sccache on PATH — so cargo's first action
(`sccache rustc -vV` to probe the compiler version) fails before
borrow-check even starts.

`build-prerelease.yml`, which uses the same `cuda-13.0` runner for
the actual release neuron builds, deliberately does NOT set
RUSTC_WRAPPER. That's the pattern this commit applies.

Fix: override `RUSTC_WRAPPER` (plus the SCCACHE_* and AWS_* env
locally on the job. We lose caching on the cuda-check job (it's
borrow-check-only and finishes in a couple minutes anyway), but
the gate runs.

The job's purpose — fail fast on `#[cfg(feature = "cuda")]`
borrowck errors that the default-feature gate misses — is what
matters, and that purpose was undermined by the env inheritance.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 13:55:18 +03:00
1818dfb337 feat(helexa-acp): openai-responses provider
Some checks failed
CI / Format (push) Successful in 38s
build-prerelease / Resolve version stamps (push) Successful in 45s
CI / Clippy (push) Successful in 2m35s
CI / CUDA type-check (push) Failing after 12s
CI / Test (push) Successful in 5m54s
build-prerelease / Build cortex binary (push) Successful in 5m9s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Build neuron-blackwell (push) Successful in 4m36s
build-prerelease / Build neuron-ampere (push) Successful in 7m11s
build-prerelease / Build neuron-ada (push) Successful in 6m33s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s
Stage 6a. Implements the `Provider` trait for OpenAI's Responses
API surface, parallel to the existing `OpenAIChatProvider`. Lets a
helexa-acp endpoint configured with `wire_api = "openai-responses"`
drive a `/v1/responses` server (today: neuron through cortex; later:
OpenAI directly) using the same agent-loop machinery the chat
provider already supports.

## Encoder (CompletionRequest → Responses body)

- System messages collapse into a single top-level `instructions`
  string. Multiple system messages concatenate with blank lines so
  ordering is preserved.
- User messages become `{type:"message", role:"user", content:…}`
  input items. Text content stays a bare string; MultiPart content
  (text + images, post-Stage 5) becomes a
  `[{type:"input_text"}, {type:"input_image"}]` array with images
  encoded as `data:{mime};base64,{data}` URIs — exactly the shape
  neuron's `wire::openai_responses::request_to_chat` accepts.
- Assistant text turns become an `output_text` content part inside
  a `message` item.
- Assistant tool-call turns become `function_call` input items.
- Tool result turns become `function_call_output` input items.
- `max_tokens` translates to `max_output_tokens`.

## Decoder (Responses SSE → CompletionEvent)

Reads named events on the SSE `event:` line:

- `response.output_text.delta` → `CompletionEvent::TextDelta`
- `response.output_item.added` with `type:"function_call"` →
  `CompletionEvent::ToolCallStart` (and, when the upstream
  pre-buffers fully, a single `ToolCallArgsDelta`)
- `response.function_call_arguments.delta` →
  `CompletionEvent::ToolCallArgsDelta`, correlated back to the
  tool-call slot by output_index.
- `response.completed` → `CompletionEvent::Usage` (if present) +
  `CompletionEvent::Finish` with reason mapped from `status`:
  `"completed"` → `"stop"`, `"incomplete"` → `"length"`.
- Bookkeeping events (`response.created`, `response.in_progress`,
  `*.content_part.*`, `*.output_text.done`, `*.output_item.done`,
  `*.function_call_arguments.done`, reasoning_*) are skipped.

## Wiring

- `EndpointConfig::responses_url()` joins `{base_url}/responses`.
- `WireApi::OpenAiResponses` in `build_provider` constructs the new
  provider (was previously a "reserved for future" error).
- `provider::mod.rs` registers the new module.

## Cuts (carried over from neuron-side issues)

- The decoder's `ToolCall*` handling fires correctly when the
  upstream emits `function_call` items, but the neuron candle
  harness doesn't yet (Refs #6). Real tool-call testing against
  cortex+neuron stays on the chat path until #6 lands.
- Reasoning events (`response.reasoning_*`) are deliberately
  dropped today; once neuron emits `InferenceEvent::ReasoningDelta`
  (Refs #5) the projector on the neuron side will start firing the
  reasoning event family and this decoder will need a matching
  case to route them to `CompletionEvent::ReasoningDelta`.

13 new unit tests cover encoder (system collapse, multipart user
input, assistant output_text encoding, tool-call round-trip via
function_call items) and decoder (text streaming, empty deltas
dropped, length finish, function_call lifecycle, inline-arguments
shape, cancellation, malformed payload skip).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 11:30:25 +03:00
5ed1140c97 feat(cortex-gateway): proxy /v1/responses to neuron
Some checks failed
CI / CUDA type-check (push) Failing after 12s
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 37s
CI / Clippy (push) Failing after 1m5s
build-prerelease / Build cortex binary (push) Successful in 4m26s
CI / Test (push) Successful in 5m17s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 5m39s
build-prerelease / Package cortex RPM (push) Successful in 1m24s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
Step 3 of the Responses rollout: plain proxy route on the gateway,
no translation. Neuron speaks the Responses API natively after Step
2 (commit 957f704), so the gateway just needs the same routing
shape it uses for /v1/chat/completions — extract `model`, resolve
via router::resolve, forward verbatim.

- New `POST /v1/responses` handler in handlers.rs::responses.
- Mock neuron under tests/common/mod.rs gains a `/v1/responses`
  endpoint that mirrors the ResponsesResponse shape neuron emits.
- New integration test file `tests/responses.rs` exercises:
  - Happy path (200, body round-trips, ResponsesUsage shape).
  - Unknown model → 404 (matches chat-completions error shape).
  - Missing `model` field → 400 (same extract_model helper).

Streaming proxy works through the same path as chat completions —
the upstream Content-Type (`text/event-stream` for stream:true,
`application/json` otherwise) propagates through proxy_with_metrics
unchanged. Live-stream integration tests against a streaming mock
deferred until we exercise the path against a real neuron, since
the chat-completions streaming test already covers the proxy's
SSE forwarding mechanics.

Three new tests; clippy + fmt clean across the workspace.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 11:21:43 +03:00
957f704efa feat(neuron): OpenAI Responses API + ci cuda-check runner label
Some checks failed
build-prerelease / Package cortex RPM (push) Blocked by required conditions
CI / CUDA type-check (push) Failing after 11s
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Format (push) Successful in 32s
CI / Clippy (push) Successful in 2m31s
build-prerelease / Build cortex binary (push) Successful in 4m32s
CI / Test (push) Successful in 5m42s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 6m8s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
Step 2 of the Responses rollout: native `/v1/responses` endpoint on
neuron that consumes the same InferenceEvent stream as
`/v1/chat/completions` but emits it as the Responses API's named
SSE event family. No gateway-side translation.

## Surface

- `cortex-core::responses` envelope types: `ResponsesRequest`,
  `ResponsesInput` (text | items), `ResponsesInputItem` (message |
  function_call | function_call_output | reasoning),
  `ResponsesContentPart` (input_text | input_image | output_text),
  `ResponsesResponse`, `ResponsesOutputItem`, `ResponsesUsage`. Plus
  a `events::*` constant module so the projector and the wire shape
  stay in sync without string-typos.

- `neuron::wire::openai_responses`:
  - `request_to_chat(req)` flattens Responses input + instructions
    into a `ChatCompletionRequest` the candle harness already
    understands. Text-only Parts collapse to a string; mixed
    text+image Parts go to chat's content-array shape; reasoning
    items drop; function_call / function_call_output round-trip
    via tool_calls / tool_call_id metadata so the surface is
    consistent for the day the harness emits tool calls.
  - `project_responses_stream(rx, meta)` reads InferenceEvents
    and emits the eight named events that compose a Responses
    stream: response.created → output_item.added → content_part.added
    → output_text.delta×N → output_text.done → content_part.done
    → output_item.done → response.completed. Synthesises start
    frames if the producer skips Start (poisoned model, early
    disconnect) so the stream stays coherent.
  - `build_response(meta, text, reason, usage)` for the
    non-streaming path.

- `CandleHarness::inference_stream(req)` extracted from
  `chat_completion_stream`, returning a typed `InferenceStream`
  (event receiver + id/created/model_id metadata). Both
  `chat_completion_stream` and the new `responses_stream` are now
  thin wrappers that pick their wire projection. TP path got the
  same treatment (`chat_completion_tp_stream` → `inference_tp_stream`).

- `POST /v1/responses` route on neuron. Non-streaming returns one
  buffered `ResponsesResponse`; streaming returns axum SSE with
  both event names and JSON data per frame (Responses, unlike
  chat completions, uses named `event:` lines). Reused
  `inference_error_response` helper hoisted out so the chat and
  responses handlers share the InferenceError → HTTP mapping.

## CI

Also bundles the `cuda-check` runner-label fix from feedback on
commit 1859777: `runs-on: rpm` doesn't ship the CUDA toolkit so
cudarc's nvcc-version build script blew up. Switched to
`runs-on: cuda-13.0` per the existing labels.

## Scope cuts (documented in the modules)

- `previous_response_id` rejected at translate time with 400
  (`code: chained_conversation_not_supported`) — stateful chained
  conversations need a persistence layer we haven't built.
- Reasoning items dropped (no Qwen3 `<think>` routing yet).
- Single output item per response (one `"message"` carrying text);
  `function_call` items reserved but not synthesised.
- Streaming events cover the core set; `response.in_progress`
  and the web_search / image_generation event families are
  out-of-scope.

22 new tests: 5 in cortex-core (envelope round-trips), 13 in
neuron::wire (request translator + projector + non-streaming
builder), 4 in neuron's tests/api.rs (route surface — 503 when no
candle, 400 on previous_response_id, 404 on missing model for
both stream and non-stream).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 11:13:44 +03:00
1859777332 ci: add cuda type-check job so CUDA-only borrowck errors fail fast
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Format (push) Successful in 37s
CI / CUDA type-check (push) Failing after 3m8s
CI / Clippy (push) Successful in 2m27s
build-prerelease / Build neuron-blackwell (push) Successful in 5m46s
build-prerelease / Build cortex binary (push) Successful in 5m0s
build-prerelease / Build neuron-ampere (push) Successful in 7m39s
CI / Test (push) Successful in 5m37s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m33s
build-prerelease / Build neuron-ada (push) Successful in 5m12s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m8s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m9s
Run 244 caught a use-of-moved-value in a `#[cfg(feature = "cuda")]`
block that the default-feature workspace clippy/test gate had no
chance of seeing. The error appeared only when the RPM build
workflow compiled with `--features cuda` — 30+ minutes after push.

Add a `cuda-check` job to ci.yml that runs `cargo check -p neuron
--features cuda --all-targets` on the rpm runner (where nvcc /
cudarc build deps live; the generic `rust` runner doesn't have
them). Borrow-check only — we never run tests here, the runner
has no GPU. Same retry pattern as clippy/test.

Both SRPM jobs (`srpm-cortex`, `srpm-neuron`) now gate on
`cuda-check` so a CUDA build break can't reach the release pipeline.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 09:49:51 +03:00
6927286cab fix(neuron): clone id/model_id before TP spawn so wire projector can use them
Some checks failed
build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions
CI / Format (push) Successful in 39s
build-prerelease / Resolve version stamps (push) Successful in 40s
CI / Clippy (push) Successful in 2m34s
CI / Test (push) Successful in 5m40s
build-prerelease / Build cortex binary (push) Successful in 5m16s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 5m49s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Build neuron-ampere (push) Successful in 7m38s
build-prerelease / Build neuron-ada (push) Successful in 5m34s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
The Step 1 refactor moved the InferenceEvent receiver wrap to *after*
the orchestration spawn in chat_completion_tp_stream, but the spawn
moves both `id` and `model_id` into its async closure (used heavily
by acquire_pool_lock, NCCL ops, and tracing). Result: borrowck
error E0382 use-of-moved-value on the wire_chat::project_chat_stream
call.

The non-CUDA build doesn't exercise this branch (it lives behind
`#[cfg(feature = "cuda")]`) which is why the workspace clippy/test
gate passed locally and on the regular CI workflow. The RPM build
workflow, which compiles with --features cuda, caught it (run 244
jobs 2/3/4 against beast / ampere / ada respectively, all the same
error).

Fix: snapshot `id` and `model_id` into `projector_id` /
`projector_model_id` before the spawn, use those at the projector
call site. The originals stay free to be moved into the closure.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 09:37:10 +03:00
302ccfb982 refactor(neuron): introduce InferenceEvent + wire projection layer
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 31s
CI / Format (push) Successful in 38s
CI / Clippy (push) Successful in 3m28s
build-prerelease / Build neuron-blackwell (push) Failing after 6m4s
build-prerelease / Build neuron-ampere (push) Failing after 7m20s
CI / Test (push) Successful in 7m29s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ada (push) Failing after 4m57s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m19s
build-prerelease / Package cortex RPM (push) Successful in 1m24s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Step 1 of the OpenAI Responses API rollout. Pure refactor — no new
endpoints, no behaviour change on the wire. Lays the seam for
emitting Responses-shaped streaming events from the same harness
output as chat completions in Step 2.

- New `neuron::wire` module tree:
  - `wire::event::InferenceEvent` — format-agnostic enum
    (Start, TextDelta, ReasoningDelta, Finish) the candle harness
    now emits as its native streaming currency.
  - `wire::event::FinishReason` — typed reason that maps cleanly
    onto OpenAI `finish_reason`, OpenAI Responses `status`, and
    Anthropic `stop_reason` strings.
  - `wire::openai_chat::project_chat_stream` — async task that
    consumes an InferenceEvent receiver and produces a
    ChatCompletionChunk receiver, stamping per-request metadata
    (id, created, model_id) onto every chunk. Output matches the
    pre-refactor wire shape bit-for-bit.

- candle.rs refactored to emit InferenceEvent on its internal
  channel through all three streaming paths (CPU
  run_inference_streaming, CUDA single-GPU stream_inference_via_worker,
  CUDA TP chat_completion_tp_stream). The streaming functions lost
  their id/created/model_id parameters since wire-format metadata
  now lives in the projector.

- emit_delta + emit_delta_blocking simplified to single-purpose
  TextDelta emitters with no wire-format coupling.

- chat_completion_stream wraps the InferenceEvent receiver in
  wire_chat::project_chat_stream before returning so the
  /v1/chat/completions HTTP handler keeps consuming
  ChatCompletionChunks unchanged. External signature preserved.

Also fixes a pre-existing helexa-acp test race (three modules each
declared their own static LOCK for HOME mutation, so cross-module
parallelism flaked tests that read HOME at runtime). Consolidated
onto a single crate-wide path_util::ENV_LOCK.

122 helexa-acp tests + 44 neuron tests pass (5 new wire projection
tests). fmt + clippy --workspace -- -D warnings clean. Ran helexa-acp
suite 3x to confirm the env race is closed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 11:30:17 +03:00
df0abfe4d4 feat(helexa-acp): image input for vision-capable models
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 34s
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m33s
CI / Test (push) Successful in 5m4s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 6m2s
build-prerelease / Build neuron-ampere (push) Successful in 7m49s
build-prerelease / Build neuron-ada (push) Successful in 5m27s
build-prerelease / Build cortex binary (push) Successful in 4m16s
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m2s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m10s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m47s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s
Stage 5. Zed clipboard/DnD images get forwarded as OpenAI
content-array messages on user turns.

- New MessageContent::MultiPart variant + MessagePart (Text|Image)
  + ImageData struct (mime_type, base64 data, optional uri).
- flatten_prompt now produces structured content: collapses to
  Text when every block is text (some upstreams treat array-form
  as vision-only and refuse on text-only models), otherwise
  produces MultiPart preserving block order.
- OpenAI encoder emits `[{type:"text",text:…}, {type:"image_url",
  image_url:{url:"data:{mime};base64,{data}"}}]` for MultiPart user
  messages. Data URIs are used over remote `uri` because they
  round-trip through every upstream we care about.
- prompt_capabilities.image = true at initialize so Zed actually
  sends image blocks.
- compaction estimates ~512 tokens per image (the middle of the
  Qwen3-VL / OpenAI detail range) so the budget tracker doesn't
  pretend images are free.
- session/load replays image-bearing user turns by surfacing the
  text parts verbatim and rendering each image as a "[image: {mime}
  ({n} bytes)]" placeholder chunk — Zed can show the prior text
  context even though re-uploading the bytes through ACP isn't
  meaningful for resume.
- 4 new tests: flatten produces MultiPart in block order, image-only
  prompts still flatten to MultiPart, encoder emits the correct
  array shape, text-only encoding stays as the string form.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 09:43:00 +03:00
b9016571f6 feat(helexa-acp): expand ~ / $HOME and fall back to local fs on ACP read errors
Some checks failed
build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps (push) Successful in 44s
CI / Format (push) Successful in 50s
CI / Clippy (push) Successful in 2m34s
build-prerelease / Build cortex binary (push) Successful in 4m29s
CI / Test (push) Successful in 5m13s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m18s
build-prerelease / Build neuron-blackwell (push) Successful in 6m4s
build-prerelease / Build neuron-ampere (push) Successful in 8m15s
build-prerelease / Build neuron-ada (push) Successful in 5m23s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Two related polish fixes for daily use:

- New `path_util` module expands `~`, `~/…`, `$HOME`, and `$HOME/…`
  prefixes in every tool that takes a path (read_file, write_file,
  edit_file, list_dir, bash cwd). The expansion is also applied to
  the plan-mode write gate so `~/.local/share/helexa-acp/plans/…`
  comparisons behave correctly regardless of which form the model
  emits.
- `read_file` now falls back to `std::fs::read_to_string` when ACP's
  `fs/read_text_file` errors out. Zed's workspace-scoped read was
  the source of "model can't see ~/git/architecture/generic.md"
  when the session cwd is a different project; the fallback lets
  the agent pull in shared material that lives outside the active
  workspace, the same way `list_dir` already does via local
  `std::fs::read_dir`. Local fallback honours line/limit args.

The fallback also produces a combined error message when both ACP
and local-fs reads fail, so the model sees what actually broke
rather than just the ACP-side error.

14 new unit tests cover path_util's prefix matrix, fallback
success/failure paths, and the line/limit slicing in fallback.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 09:28:58 +03:00
adbc52bfcd feat(helexa-acp): model picker + session/set_model handler
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 37s
CI / Format (push) Successful in 41s
CI / Clippy (push) Successful in 2m32s
build-prerelease / Build cortex binary (push) Successful in 4m45s
CI / Test (push) Successful in 5m52s
build-prerelease / Build neuron-blackwell (push) Successful in 5m59s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 7m21s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ada (push) Successful in 4m54s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m58s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
Stage 4. Zed's model dropdown now lists every model from every
configured endpoint, and switching it routes the next prompt to a
new endpoint+model.

- Enable `unstable_session_model` on the agent-client-protocol dep
  so SessionModelState / SetSessionModelRequest / ModelInfo are
  available.
- Agent::new becomes async and calls Provider::list_models on every
  provider at startup; per-endpoint failures warn-and-skip instead
  of aborting the agent.
- With a single endpoint configured, model ids appear bare; with
  multiple endpoints every id carries the `endpoint:` prefix so the
  picker is unambiguous and parse_model_selector routes correctly.
- NewSessionResponse and LoadSessionResponse attach SessionModelState
  with the session's current model id + the aggregated catalogue.
- session/set_model: validates the requested model id against
  resolve_provider, mutates session.model_id, and persists so the
  on-disk transcript reflects the new model.

Three new aggregate_models tests cover the prefixing rule (bare vs
multi-endpoint) and warn-and-skip on a failing endpoint.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 09:10:16 +03:00
537a0fe7f2 feat(helexa-acp): context compaction for small-context local models
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 26s
CI / Format (push) Successful in 29s
CI / Clippy (push) Successful in 2m26s
build-prerelease / Build cortex binary (push) Successful in 5m17s
build-prerelease / Build neuron-blackwell (push) Successful in 5m51s
CI / Test (push) Successful in 5m53s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Successful in 7m58s
build-prerelease / Build neuron-ada (push) Successful in 5m30s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m7s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m40s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s
A new src/compaction.rs module projects rolling conversation history
into a token budget before each completion. Older tool results and
assistant prose get elided to one-line markers; system prompts, user
turns, and the last KEEP_TAIL=4 messages stay verbatim. tool_call_id
pairing is preserved so OpenAI strict-schema providers keep working.

Driven by a new per-endpoint `context_window` config field (also
HELEXA_ACP_CONTEXT_WINDOW for the env-only single-endpoint case).
When set, prompt budget = context_window - max_tokens - 512_safety;
when unset, behaviour is unchanged.

Without this, a 32 K Qwen3 dies with `prompt_too_long` after the
first few read_file results pile up in history — the symptom seen
in plan-mode dogfooding on beat.

10 new unit tests cover the compaction strategy and the prompt
budget arithmetic.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 08:22:01 +03:00
cbadfcf112 feat(helexa-acp): plan mode — third session mode for read-and-plan-only flows
Some checks failed
build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps (push) Successful in 37s
CI / Format (push) Successful in 36s
CI / Clippy (push) Successful in 2m44s
CI / Test (push) Successful in 5m3s
build-prerelease / Build cortex binary (push) Successful in 4m36s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Build neuron-blackwell (push) Successful in 6m37s
build-prerelease / Build neuron-ampere (push) Successful in 8m12s
build-prerelease / Build neuron-ada (push) Successful in 5m32s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Plan mode is the most restrictive of the three session modes: bash is
disabled outright, writes are confined to a per-project plan directory
under $XDG_DATA_HOME/helexa-acp/plans/<basename>-<8hex>/, and reads /
list_dir are unrestricted. The system prompt is rebuilt at the top of
every round so a mid-turn switch into (or out of) plan mode takes
effect on the next streaming round, and plan mode appends a 3-option
menu instructing the model to stop and let the user pick how to
proceed once the plan is complete.

The project id is basename + FNV-1a-32 of the cwd so it stays stable
across runs (SipHash's DefaultHasher reseeds per process), while still
disambiguating multiple checkouts that share a final path component.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 08:06:25 +03:00
3ecbb21ece fix(helexa-acp): persist per round, cancel previous prompt, log loop
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 34s
CI / Format (push) Successful in 35s
CI / Clippy (push) Successful in 2m32s
CI / Test (push) Successful in 5m8s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 6m4s
build-prerelease / Build neuron-ampere (push) Successful in 8m13s
build-prerelease / Build neuron-ada (push) Successful in 5m18s
build-prerelease / Build cortex binary (push) Successful in 16m12s
build-prerelease / Package cortex RPM (push) Successful in 1m15s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m2s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m39s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
Three changes addressing "session stops mid-turn and disk store
doesn't update":

1. Per-round persistence. drive_prompt previously called
   store::save() once at the very end of the turn. If the loop
   stalled in a later round (long-running bash, upstream SSE that
   never finished, wedged ACP roundtrip), earlier successful
   rounds lived only in the spawned task's `new_turns` and never
   reached disk. Move the extend-history + save into a helper
   (extend_and_persist) and call it at the end of every loop
   iteration. The post-loop save catches whatever the break paths
   leave behind. Failure is logged not propagated.

2. Cancel previous in-flight prompt on new session/prompt. The
   handler used to overwrite SessionState.cancel with a fresh
   token *without firing the old one*. A wedged prior prompt would
   then live forever, holding session-state references and never
   persisting. Now we fire the existing cancel under the lock
   before installing the new token — the old task observes
   is_cancelled() on its next .await and unwinds.

3. Per-round and per-tool log lines. drive_prompt now emits:
   - INFO  prompt round: streaming { round, of, history_turns }
   - INFO  dispatch tool { tool, tool_call_id }
   - INFO  dispatch tool complete { tool_call_id, is_error }
   - INFO  prompt round complete; persisting { round, turns }
   - INFO  prompt complete { stop_reason }
   so the next hang shows up by line number in /tmp/helexa-acp.log
   instead of as silence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 16:29:22 +03:00
0d841a4981 feat(helexa-acp): replay session history on session/load
Some checks failed
CI / Format (push) Successful in 31s
build-prerelease / Resolve version stamps (push) Successful in 48s
CI / Test (push) Failing after 1m19s
CI / Clippy (push) Successful in 2m56s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m17s
build-prerelease / Package cortex RPM (push) Successful in 1m26s
build-prerelease / Build neuron-blackwell (push) Successful in 5m52s
build-prerelease / Build neuron-ampere (push) Successful in 7m49s
build-prerelease / Build neuron-ada (push) Successful in 5m8s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
session/list and session/load were both implemented but clicking
a session in Zed's thread picker still left the agent panel
empty. Zed (and ACP clients in general) doesn't cache the
transcript for custom agent_servers entries — it only owns
conversation state for first-party agents. For custom agents the
expectation is that session/load returns successfully and the
agent then re-emits the conversation as a stream of session/update
notifications so the client can rebuild its view.

Implement that replay path:

- handle_load_session now returns (LoadSessionResponse, Vec<Message>)
  so the caller has the history available after the in-memory
  hydration finishes.
- The session/load closure responds to the request *first*, then
  spawns a task that calls replay_history off the dispatch loop.
- replay_history walks the persisted history and emits one
  session/update per turn:
    Role::User           → UserMessageChunk(text)
    Role::Assistant text → AgentMessageChunk(text)
    Role::Assistant tool → AgentMessageChunk for any accompanying
                           text + one ToolCall card per call (with
                           kind/title/raw_input rendered the same
                           way as the live dispatch path)
    Role::Tool result    → ToolCallUpdate matching the assistant's
                           call id, status: Completed, content set
                           to the result text
    Role::System         → skipped (system prompts aren't shown)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 16:02:00 +03:00
0bbb9b752d feat(helexa-acp): session/list so Zed can discover sessions to resume
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 28s
CI / Format (push) Successful in 28s
CI / Clippy (push) Successful in 2m45s
build-prerelease / Build cortex binary (push) Successful in 4m41s
CI / Test (push) Successful in 4m58s
build-prerelease / Build neuron-blackwell (push) Successful in 6m4s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Successful in 7m36s
build-prerelease / Build neuron-ada (push) Successful in 5m40s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m3s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m40s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
Stage 3b only implemented the trailing half of resume: write
sessions to disk + handle session/load. But Zed (and any ACP
client) needs `session/list` to discover *which* session belongs
to the workspace it's reopening — without it, the client only
knows how to mint new sessions and resume never fires even
though the JSON sits ready on disk.

Add the missing pieces:

- store::list / list_in_dir — enumerate {id}.json under
  sessions_dir(), optionally filter by cwd, sort recent-first.
  Skips unparseable files with a warn rather than aborting.
- store::unix_to_iso8601 — RFC 3339 formatter for
  SessionInfo.updated_at; pulls chrono in directly (already in
  the dep tree transitively).
- agent::handle_list_sessions — wires the request to the store,
  builds SessionInfo entries with derived titles (first user
  turn, truncated to 60 chars).
- agent::initialize_response — advertise
  session_capabilities.list = {} alongside the existing
  load_session: true.

Verified end-to-end against the user's real hxa-1.json
(60-turn beat conversation): `session/list` returns the entry
with cwd, derived title, and ISO 8601 timestamp.

4 new store unit tests for list filtering, missing-dir
handling, unparseable-file skipping, and ISO 8601 formatting.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 14:34:41 +03:00
5aac1ffc59 feat(helexa-acp): session resume via session/load
All checks were successful
CI / Format (push) Successful in 31s
build-prerelease / Resolve version stamps (push) Successful in 40s
CI / Clippy (push) Successful in 2m37s
CI / Test (push) Successful in 4m59s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m35s
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Build neuron-blackwell (push) Successful in 6m4s
build-prerelease / Build neuron-ampere (push) Successful in 7m45s
build-prerelease / Build neuron-ada (push) Successful in 5m31s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
Zed restarts (frequent during helexa-acp dogfooding) used to lose
every conversation because we'd ignore the load_session capability
and treat every project-reopen as a fresh session/new. Persist
sessions to disk and honour session/load so the agent panel comes
back where it left off.

Storage layout:
  $XDG_DATA_HOME/helexa-acp/sessions/{session_id}.json

Each file holds session_id, cwd, model_id, mode_id, full Message
history, plus created/updated timestamps. Atomic save via
tempfile+rename so a crash mid-write can't corrupt the store.

Touch points:

- src/store.rs (new) — sessions_dir() resolution, save/load via
  default and explicit-dir entry points (so unit tests don't have
  to race on XDG_DATA_HOME). 5 unit tests cover round-trip,
  not-found errors, atomic overwrite, tool-call/result preservation,
  and the filename sanitiser's path-traversal handling.
- src/provider/mod.rs — Serialize/Deserialize on Role, Message,
  MessageContent, ToolCall. MessageContent::Text turned into a
  struct variant ({text: ...}) so internally-tagged JSON works.
- src/agent.rs — initialize_response advertises load_session: true;
  handle_load_session reads the file, snapshots in-memory state,
  returns LoadSessionResponse with the persisted mode preselected;
  drive_prompt persists at the end of every prompt round under the
  session lock with the I/O outside the lock.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 13:34:42 +03:00
ec2b6450b2 feat(helexa-acp): infer tool name from arg shape when model omits it
Some checks are pending
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Blocked by required conditions
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 36s
CI / Clippy (push) Successful in 2m33s
build-prerelease / Build cortex binary (push) Successful in 4m20s
CI / Test (push) Successful in 5m4s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 5m40s
build-prerelease / Build neuron-ampere (push) Successful in 7m53s
build-prerelease / Build neuron-ada (push) Successful in 5m33s
build-prerelease / Package cortex RPM (push) Successful in 8m20s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m46s
Qwen3.6-27B occasionally emits a <tool_call> body with the right
arguments but no top-level `name` field — observed in the field as
mkdir-style bash calls like
  {"arguments":{"command":"mkdir -p .../doc/plan/{01-discovery,...}"}}
with no `name`. The agent had no tool to dispatch and surfaced a
Failed card; the model would then hang or retry the same shape.

Add a shape-based inference layer:

- tools::infer_tool_name(arguments) — given an `arguments` object
  alone, return Some(name) when the key set uniquely identifies one
  tool: `{command}` or `{command,cwd}` → bash, `{path,content}` →
  write_file, `{path,old_text,new_text}` → edit_file. Ambiguous
  shapes (`{path}` alone — could be read_file or list_dir) return
  None so the agent still emits a Failed card rather than guessing.
- agent::try_repair_missing_name(raw) — parses a malformed body,
  applies infer_tool_name, returns (name, args_json) on success.
- drive_prompt sweeps malformed_calls through this repair before
  the Failed-card path. Recovered calls go into tool_buckets at
  the next free index and dispatch through the normal tool loop.

10 new unit tests in tools::tests cover the inference table plus
the verbatim mkdir failure from the field log.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 13:14:50 +03:00
a494c8d43c feat(helexa-acp): repair malformed tool calls and render failures as cards
Some checks failed
build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps (push) Successful in 28s
CI / Format (push) Successful in 4m7s
CI / Test (push) Failing after 1m2s
build-prerelease / Build neuron-blackwell (push) Successful in 6m10s
CI / Clippy (push) Successful in 2m37s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m24s
build-prerelease / Build neuron-ampere (push) Successful in 8m18s
build-prerelease / Package cortex RPM (push) Successful in 1m22s
build-prerelease / Build neuron-ada (push) Successful in 5m23s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Two related fixes for cases where Qwen3 sometimes emits slightly-off
JSON inside <tool_call> blocks:

1. JSON repair pass in qwen3::parse_tool_call_body — strip up to
   three trailing extra `}` characters (model overshoots its closing
   braces), and hoist `name` out of `arguments` when it lands
   nested instead of as a sibling. Both observed in the field; both
   trivially repairable; both now dispatch as normal tool calls
   instead of falling back to the malformed path.

2. New CompletionEvent::MalformedToolCall variant for the cases
   repair can't fix. decode_stream now emits it instead of wrapping
   the raw body in a TextDelta, and agent.rs surfaces each one as
   a Failed SessionUpdate::ToolCall card (so Zed renders it as a
   structured failure UI element rather than dumping the body
   inline) plus a synthetic tool-call/tool-result history pair so
   the model gets clear feedback for self-correction on the next
   round.

Empty <tool_call></tool_call> blocks are now a no-op too (no
Malformed event), matching the existing empty-<think> behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 12:58:51 +03:00
abbedf8d8a chore(neuron): bump default max_tokens from 512 to 8192
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 44s
CI / Format (push) Successful in 45s
CI / Clippy (push) Successful in 2m41s
build-prerelease / Build neuron-blackwell (push) Successful in 5m35s
build-prerelease / Build cortex binary (push) Successful in 4m32s
CI / Test (push) Successful in 5m29s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Build neuron-ampere (push) Successful in 8m6s
build-prerelease / Build neuron-ada (push) Successful in 5m19s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m55s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
512 is too low for any modern coding model — clients that don't
explicitly set max_tokens get clipped responses with no diagnostic.
Bump the fallback at all four inference call sites (single-GPU
streaming + non-streaming, TP leader + non-leader) to 8192, which
fits comfortably within Qwen3-class context windows after a
typical agent prompt and lines up with what helexa-acp / a0 / curl
clients reasonably expect.

Clients that explicitly set max_tokens (now including helexa-acp
via HELEXA_ACP_MAX_TOKENS / per-endpoint TOML) override this.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 12:38:28 +03:00
6cc14e925c feat(helexa-acp): per-endpoint max_tokens config
Some checks failed
CI / Format (push) Successful in 34s
build-prerelease / Resolve version stamps (push) Successful in 35s
CI / Clippy (push) Failing after 1m3s
CI / Test (push) Failing after 1m4s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
The agent was sending max_tokens: None, letting cortex/neuron pick
its own default — which trips Zed's "Output Limit Reached" on long
turns. Add a per-endpoint max_tokens option in EndpointConfig
(TOML key and HELEXA_ACP_MAX_TOKENS env var for the single-endpoint
fallback) that the agent threads into every CompletionRequest by
endpoint name.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 12:34:23 +03:00
1c16732668 feat(helexa-acp): route Qwen3 inline <think> blocks to reasoning
Some checks failed
build-prerelease / Build cortex binary (push) Blocked by required conditions
CI / Test (push) Waiting to run
CI / Format (push) Successful in 26s
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Clippy (push) Successful in 2m40s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
Qwen3 emits chain-of-thought as literal <think>...</think> tags
inside delta.content rather than via the separate reasoning_content
field — so without parsing the markers, the thinking shows up in
the message pane as ordinary text. Add a small ThinkParser in
qwen3.rs (same chunk-boundary discipline as ToolCallParser) and
stage it after the tool-call parser in decode_stream: text events
from the tool-call parser are fed in and split into TextDelta /
ReasoningDelta. Zed now renders thinking in its dedicated thought
UI; visible answer text stays in the message pane.

The parking-lot entry from the plan is now closed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 12:30:25 +03:00
5a0861d639 fix(helexa-acp): forward Dispatch::Response to its awaiting router
Some checks failed
build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Format (push) Successful in 41s
CI / Clippy (push) Successful in 2m31s
build-prerelease / Build cortex binary (push) Successful in 4m36s
CI / Test (push) Successful in 5m31s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 5m51s
build-prerelease / Package cortex RPM (push) Successful in 1m29s
build-prerelease / Build neuron-ampere (push) Successful in 7m18s
build-prerelease / Build neuron-ada (push) Successful in 5m6s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
The catch-all on_receive_dispatch handler was applying
respond_with_error to *every* Dispatch variant, including Response.
For Response variants, that call routes the error to the
ResponseRouter for the *outgoing* request — silently overwriting
the real reply from Zed with "Internal error: not implemented yet".

Every ACP roundtrip we issue (fs/read_text_file, fs/write_text_file,
session/request_permission, terminal/*) was therefore returning an
error to the tool runner regardless of what Zed actually responded.
The model saw uniformly-failing tools, gave up, and confabulated
plausible explanations.

Fix: pattern-match the Dispatch. Response → forward to its router
via respond_with_result. Request / Notification → keep the
"not implemented yet" error response as before.

Found via debug logs showing
  WARN helexa_acp::agent: unhandled ACP message method="fs/read_text_file"
right before every tool failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 12:16:21 +03:00
33652ac651 feat(helexa-acp): HELEXA_ACP_LOG_FILE env for editor-host logging
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 37s
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m44s
CI / Test (push) Successful in 5m3s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m36s
build-prerelease / Build neuron-blackwell (push) Successful in 6m1s
build-prerelease / Package cortex RPM (push) Successful in 1m22s
build-prerelease / Build neuron-ampere (push) Successful in 8m23s
build-prerelease / Build neuron-ada (push) Successful in 5m26s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 6m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s
Editors that launch ACP agents (Zed today) don't reliably surface
the child's stderr — and `args` in an `agent_servers` config is
exec-args, not shell, so the usual `&>>` redirect trick doesn't
work. Add a HELEXA_ACP_LOG_FILE env var that, when set to an
absolute path, routes the tracing subscriber to append-write that
file (ANSI off) instead of stderr. RUST_LOG still controls levels.
Unopenable paths fall back to stderr with a warning so a typo
doesn't silence the agent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 11:47:28 +03:00
c297a54074 chore(helexa-acp): log raw bash output and tool result snippets
All checks were successful
CI / Format (push) Successful in 36s
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Clippy (push) Successful in 2m38s
build-prerelease / Build neuron-blackwell (push) Successful in 4m34s
build-prerelease / Build cortex binary (push) Successful in 4m49s
CI / Test (push) Successful in 5m42s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Build neuron-ampere (push) Successful in 7m46s
build-prerelease / Build neuron-ada (push) Successful in 7m38s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m58s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m49s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s
Diagnostic for "the tool ran but the model thinks it failed" cases.
Logs at debug level:

- exec_bash: terminal/create command + cwd, terminal/exit code/signal,
  terminal/output bytes + truncated flag + 200-char snippet.
- dispatch_tool_call: 200-char snippet of every successful result
  before it's folded back into history.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 11:15:26 +03:00
0121a1930f feat(helexa-acp): inject and parse Qwen3 Hermes tool format
Some checks failed
CI / Format (push) Successful in 38s
build-prerelease / Resolve version stamps (push) Successful in 42s
CI / Clippy (push) Successful in 2m33s
CI / Test (push) Successful in 5m45s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 5m13s
build-prerelease / Build neuron-blackwell (push) Successful in 6m0s
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Build neuron-ampere (push) Successful in 7m55s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
The OpenAI `tools` API field isn't load-bearing in this stack —
neuron's chat template renders only message.content, so tool
definitions sent that way never reach the model. Move both sides
of the tool conversation into the Qwen3 Hermes wire format the
model is actually trained on:

- Append a `# Tools` block to the system prompt describing every
  available function (qwen3::render_tool_block).
- Parse `<tool_call>{json}</tool_call>` markers out of the streamed
  content via a chunk-boundary-safe state machine (qwen3::ToolCallParser),
  surfacing them as the existing CompletionEvent::ToolCall* events
  so the agent loop doesn't change.
- Re-serialise assistant turns that called tools with inline
  `<tool_call>` blocks and tool results as user turns wrapped in
  `<tool_response>` (qwen3::render_assistant_with_tool_calls,
  render_tool_response).

Verified against cortex+Qwen3.6-27B: the model produces a
well-formed `<tool_call>{"name":"list_dir","arguments":{"path":"/tmp"}}</tool_call>`
in response to a Hermes-formatted prompt.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 11:06:38 +03:00
13f4c36aeb chore(helexa-acp): log outgoing chat-completion body at debug level
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Format (push) Successful in 47s
CI / Clippy (push) Failing after 56s
CI / Test (push) Successful in 5m43s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 5m22s
build-prerelease / Build cortex binary (push) Successful in 6m51s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Successful in 7m14s
build-prerelease / Build neuron-ada (push) Successful in 5m57s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m4s
Useful for diagnosing "the model isn't using tools" — confirming
that helexa-acp is in fact sending the `tools` array (and what
messages, system prompt, etc. accompany it) without having to
attach a packet capture upstream of cortex.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 10:38:10 +03:00
4a51a54554 fix(helexa-acp): describe Stage 3 tools in the default system prompt
Some checks failed
build-prerelease / Build cortex binary (push) Blocked by required conditions
CI / Test (push) Waiting to run
build-prerelease / Resolve version stamps (push) Successful in 35s
CI / Format (push) Successful in 42s
CI / Clippy (push) Successful in 2m39s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
The Stage 2 prompt told the model it had no tools, which models
trained for caution then dutifully repeat back ("Stage 2 build: no
tools available — I can't read files…"). Stage 3 ships tools in the
CompletionRequest.tools array, but the system message was still
overriding that. Update the default prompt to list the five tools
and instruct the model to use them rather than asking the user to
paste contents.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 10:33:17 +03:00
0609f1ac5d feat(helexa-acp): add tools, session modes, and permission gating
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 36s
CI / Format (push) Successful in 39s
CI / Clippy (push) Successful in 2m38s
CI / Test (push) Successful in 5m9s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 5m54s
build-prerelease / Build neuron-ampere (push) Successful in 7m54s
build-prerelease / Build neuron-ada (push) Successful in 4m59s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m14s
build-prerelease / Build cortex binary (push) Successful in 4m9s
build-prerelease / Package cortex RPM (push) Successful in 1m22s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 6m47s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 3m54s
Stage 3 introduces five tools (read_file, write_file, edit_file,
list_dir, bash) backed by ACP fs/* and terminal/* calls, a
ClientOps trait so the runner is mock-testable, two session modes
(default + bypassPermissions) with session/set_mode honouring them,
and a tool-call loop in the agent that streams the model, dispatches
each call, feeds results back into history, and re-enters until the
model finishes or MAX_TOOL_ROUNDS is hit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 10:01:32 +03:00
96fc379893 feat(helexa-acp): wire ACP agent loop for text-only conversations
Some checks failed
build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps (push) Successful in 41s
CI / Format (push) Successful in 38s
CI / Clippy (push) Successful in 2m35s
build-prerelease / Build cortex binary (push) Successful in 5m26s
CI / Test (push) Successful in 5m43s
build-prerelease / Build neuron-blackwell (push) Successful in 5m47s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ampere (push) Successful in 8m13s
build-prerelease / Build neuron-ada (push) Successful in 5m28s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Stage 2 lands the agent loop on top of the Stage 1 scaffold: session
state with per-session cancellation, a system-prompt builder honouring
HELEXA_ACP_SYSTEM_PROMPT_PATH / system_prompt_path TOML, and handlers
for initialize / session/new / session/prompt / session/cancel that
stream provider output back as session/update notifications. Verified
end-to-end against cortex from Zed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 09:46:22 +03:00
e267f583e1 chore(neuron): rustfmt drift in is_device_fault test
Some checks failed
CI / Format (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 58s
CI / Clippy (push) Failing after 3m43s
CI / Test (push) Successful in 5m29s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m48s
build-prerelease / Build neuron-blackwell (push) Successful in 6m10s
build-prerelease / Package cortex RPM (push) Successful in 1m32s
build-prerelease / Build neuron-ampere (push) Successful in 7m41s
build-prerelease / Build neuron-ada (push) Successful in 5m17s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m49s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 9m18s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
One assert! call grew past the line limit after the previous commits;
cargo fmt --all picked it up. No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 08:13:55 +03:00
e23d5011d0 feat(helexa-acp): scaffold ACP bridge with provider trait + OpenAI chat
Adds a new workspace crate `helexa-acp` (binary, Apache-2.0) — the
start of "the missing ACP binary" for multi-endpoint LLM setups
mixing public APIs, private LAN deployments, and various wire
formats. Today it speaks OpenAI /v1/chat/completions; the
Provider trait is the seam that lets OpenAI Responses, Anthropic
/v1/messages, and other wire formats slot in later without touching
the agent loop.

The crate is intentionally self-contained — no dependencies on the
other workspace crates (cortex-core, cortex-gateway, neuron) — so a
future migration to a dedicated GitHub repo is a Cargo.toml-only
change. All deps come from crates.io.

This commit lands:

  * `config.rs` — TOML config at $XDG_CONFIG_HOME/helexa-acp/config.toml
    with multi-endpoint support (each `[[endpoints]]` declares its
    name, base_url, wire_api, default_model, optional API key /
    api_key_env). Falls back to env-only single-endpoint config when
    no TOML exists (HELEXA_ACP_BASE_URL, HELEXA_ACP_MODEL, etc.). The
    `endpoint:model` selector syntax is validated and tested.

  * `provider/mod.rs` — `Provider` trait + provider-agnostic types
    (`CompletionRequest`, `CompletionEvent`, `Message`, `ToolCall`,
    `ToolSpec`, `Role`, `UsageStats`). Agent loop consumes these
    without knowing the wire format on the other side.

  * `provider/openai_chat.rs` — `OpenAIChatProvider` impl. Compatible
    with cortex, LM Studio, Ollama (compat mode), OpenRouter, OpenAI
    itself. Streams via reqwest + eventsource-stream + async-stream.
    Surfaces text deltas, reasoning deltas (for models that emit
    `reasoning_content`), tool-call lifecycle (start, args-delta,
    completion), usage, finish reason. Cancellation-token aware.

  * `main.rs` — tokio + stderr-only tracing-subscriber + Stdio
    transport. Builds a provider per configured endpoint at startup,
    surfacing config mistakes before the editor even initializes.
    Currently responds to `initialize`; everything else stubs to
    `not implemented yet` until the agent loop lands in the next
    commit.

12 unit tests pass — encoder shape, decoder shape (text-only,
tool-call progressive, cancellation, malformed-chunk recovery),
config parsing (multi-endpoint TOML, env fallback, validation).

The `#![allow(dead_code)]` on `provider/mod.rs` is temporary — the
agent loop in the next commit reads every field. It's noted in the
module-level docstring so the next reader knows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 08:13:47 +03:00
249b2e5c98 fix(neuron): only poison the model on actual device faults
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 38s
CI / Clippy (push) Successful in 2m22s
CI / Test (push) Successful in 4m55s
build-prerelease / Build cortex binary (push) Successful in 4m24s
build-prerelease / Build neuron-blackwell (push) Successful in 5m49s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ampere (push) Successful in 8m7s
build-prerelease / Build neuron-ada (push) Successful in 5m0s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s
CI / Format (push) Failing after 33s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Previously every inference Err — shape mismatch, NaN logits, tokenizer
error, missing handle — marked the model poisoned and rejected every
subsequent request until an operator unload+reloaded. The benjy
incident on 2026-05-27 showed how this misfires: a concurrency bug
produced a `broadcast_add: shape mismatch` error that had nothing to
do with CUDA, but the model was taken down anyway.

Add `is_device_fault(err_chain: &str)` — a conservative classifier
that returns false only for errors we know are pre-kernel / CPU-side
(shape mismatches, NaN logits, tokenize/detokenize, missing handle,
DecodeStream, empty prompt). Everything else defaults to true so a
genuine driver fault still poisons.

Applied at all six poisoning sites:
  - chat_completion CUDA worker path
  - chat_completion CPU spawn_blocking path
  - chat_completion_stream CUDA worker path
  - chat_completion_stream CPU spawn_blocking path
  - chat_completion_tp non-streaming wrapper
  - chat_completion_tp_stream spawned task

Each site now logs either "model marked poisoned" (device fault) or
"model NOT marked poisoned" (non-device) so the journal makes the
classification visible. Tests cover the known non-device patterns and
a couple of real CUDA driver messages.

Pairs with the inference_lock commit (c59da83): together they
eliminate both the cause of the spurious-poisoning we just observed
(the shape mismatch) AND the over-reaction to it (the unconditional
poison). Each fix is independently useful but the combination is
what makes the system actually robust to concurrent agent workloads.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 18:57:48 +03:00
c59da83636 fix(neuron): serialise single-GPU inference per loaded model
Two concurrent chat_completion requests against the same single-GPU
model could interleave their `clear_kv_cache → forward(chunk0) →
forward(chunk1) → ...` sequences. The device-worker channel serialises
individual jobs but not the sequence boundary, so the cache could end
up holding tokens from one request while another's mask was sized for
its own prompt — producing a shape mismatch mid-prefill.

Observed on benjy 2026-05-27 18:41:05: agent-zero's `memorize memories`
and `memorize solutions` extensions fired 4ms apart against
Qwen/Qwen3-8B (a0's utility model). Both prefilled into the same KV
cache, and request a08b4a's chunk 0 forward produced scores of shape
[1, 32, 512, 1024] against a mask of [1, 1, 512, 512] — broadcast_add
failed, both requests bubbled the error up, both flipped the model to
poisoned.

Add `LoadedModel.inference_lock: tokio::sync::Mutex<()>`, mirroring
the TpLoadedModel.pool lock that the TP path already held. Acquire
it at the start of `chat_completion` and inside the spawned task of
`chat_completion_stream` (so the role chunk goes out immediately and
only the inference work queues behind the lock).

The CPU branch uses `blocking_lock` from inside spawn_blocking; the
CUDA branch uses async `.lock().await` inside tokio::spawn.

Throughput impact: zero. The GPU was already serialised at the
device-worker channel — multiple requests just produced corrupt KV
cache state instead of clean serial throughput. The lock makes the
existing serialisation honest.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 18:54:04 +03:00
f05882369d fix(neuron): don't poison the model on tokio JoinError panics
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 34s
CI / Clippy (push) Successful in 2m18s
build-prerelease / Build cortex binary (push) Successful in 4m28s
build-prerelease / Package cortex RPM (push) Successful in 1m28s
build-prerelease / Build neuron-ampere (push) Successful in 8m25s
build-prerelease / Build neuron-ada (push) Successful in 8m54s
CI / Test (push) Successful in 4m43s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m51s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s
CUDA driver failures propagate as Err through `?` and become
`Ok(Err(InferenceError::Other(_)))` from the spawned task — those are
real device faults and still poison the model. Tokio JoinError is
different: it fires on Rust-level panic (tokenizer bug, sampler bug,
serialisation, the UTF-8 slice that landed in commit bd04d7f before
the fix) or task cancellation. Those don't touch the device context,
so failing the one request without tearing down the model is correct.

Two sites changed:

  - chat_completion's CPU spawn_blocking handler — JoinError no longer
    sets loaded.poisoned.
  - chat_completion_tp's tokio::spawn wrapper — JoinError no longer
    sets tp_for_marker.poisoned. The inner-Err case still does.

Each path logs the cause (panicked / was cancelled / ended abnormally)
explicitly so the journal makes the new behaviour obvious — search for
"model NOT marked poisoned" to find these events.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 18:02:52 +03:00
bd04d7f580 fix(neuron): stream tokens via DecodeStream to avoid UTF-8 panic
When BPE byte-fallback splits a multi-byte UTF-8 char (e.g. an emoji)
across multiple tokens, the previous "decode the cumulative token list,
byte-slice the delta against a stored prefix" pattern would panic with
'start byte index N is not a char boundary; it is inside <emoji>'.

The race: at step N the tokenizer renders the partial bytes as U+FFFD
(3 bytes); at step N+1 it can decode the complete codepoint (e.g. 4
bytes for 🌫). `decoded_prefix.len()` from step N then lands inside the
codepoint in step N+1's `full` string, and `&str[start..]` panics.

Replace with tokenizers' `DecodeStream::step(id)` which maintains an
internal byte buffer across token boundaries and only emits when a
clean codepoint completes. Applied at all three SSE emission sites:

  - stream_inference_via_worker (single-GPU CUDA stream)
  - chat_completion_tp_stream's spawned task (TP stream)
  - run_inference_streaming (CPU stream)

The shared emit helper splits into emit_delta (async, mpsc::send) and
emit_delta_blocking (sync, mpsc::blocking_send) so each path keeps its
existing send semantics. The old emit_chunk helper that did the
unsafe full-decode-and-slice is removed entirely.

Observed on beast 2026-05-27 17:49:55 — model emitted 🌫 in a tool-call
response after a long agent-zero session; the spawned TP stream task
panicked at candle.rs:2648. The model itself stayed healthy (no CUDA
fault), only the one streaming request died.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 18:01:24 +03:00
1e13889392 feat(neuron): chunked prefill + VRAM/prompt-length pre-flight checks
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 34s
CI / Format (push) Successful in 36s
CI / Clippy (push) Successful in 2m15s
CI / Test (push) Successful in 5m9s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 5m1s
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Build neuron-blackwell (push) Successful in 11m7s
build-prerelease / Build neuron-ampere (push) Successful in 12m16s
build-prerelease / Build neuron-ada (push) Successful in 12m30s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m47s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
Prevents the OOM-during-prefill → poisoned-context → 5-minute-reload
cycle observed on beast under agent-zero workloads. Three changes,
all keyed off env-driven knobs so an operator can tune without a
rebuild:

1. Chunked prefill (NEURON_PREFILL_CHUNK_TOKENS, default 512). The
   initial forward is split into N-token windows, each with a
   monotonically growing offset. KV cache accumulates across chunks
   exactly as it would under one big prefill; only the final chunk's
   logits are kept for sampling. Activation memory now scales with
   chunk size instead of prompt length, so a 13 k-token prompt stops
   holding tens of GB of intermediate activations live at once.

   Wired into all six prefill call sites:
   - run_inference / run_inference_streaming (CPU path)
   - run_inference_via_worker / stream_inference_via_worker (CUDA
     single-GPU through device worker)
   - chat_completion_tp_inner / chat_completion_tp_stream (TP via
     WorkerPool)

   Three helpers — chunked_prefill_local, chunked_prefill_via_worker,
   chunked_prefill_tp — own the loop shape so the chunking semantics
   stay identical across paths. Per-chunk debug log shows progress.

2. Max prompt length (NEURON_MAX_PROMPT_TOKENS, default 16384).
   Requests above the cap return a structured 400 with
   `code: prompt_too_long` rather than going through the prefill and
   discovering the limit by OOMing partway through. New
   InferenceError::PromptTooLong variant.

3. Minimum free VRAM gate (NEURON_MIN_FREE_VRAM_MB, default 1500).
   If `vram_free_mb` is below the threshold at request start (e.g.
   another concurrent request is mid-prefill), reject with a clean
   503 + `code: insufficient_vram` rather than starting work that
   will OOM. New InferenceError::InsufficientVram variant. CPU loads
   (vram=0 sentinel) skip this check.

All three gates fire BEFORE any device work, so a rejected request
costs ~one tokenisation pass and never touches the worker thread —
poison cascades from rejected work are now impossible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 13:46:54 +03:00
6e1c1dd0fc ci: retry clippy + test up to 3 times on spurious sccache failures
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 36s
CI / Clippy (push) Successful in 2m25s
CI / Test (push) Successful in 5m7s
build-prerelease / Build cortex binary (push) Successful in 4m34s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Build neuron-blackwell (push) Successful in 11m2s
build-prerelease / Build neuron-ada (push) Successful in 12m23s
build-prerelease / Build neuron-ampere (push) Successful in 12m26s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s
sccache occasionally fails mid-compile with race-condition errors that
clear on a re-run without any code changes. Rather than tracking that
down right now, wrap the two affected steps in a bash loop that retries
up to three times with a 5-second pause. Real failures still surface;
they just take ~10s longer to fail.

fmt is left as a single invocation — it's a one-shot syntactic check,
not a build, and isn't subject to the same sccache races.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 12:55:18 +03:00
35876954cd chore(neuron): default tracing filter to info (was info,neuron=debug)
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Format (push) Successful in 33s
CI / Clippy (push) Successful in 2m17s
CI / Test (push) Successful in 4m43s
build-prerelease / Build cortex binary (push) Successful in 4m19s
build-prerelease / Build neuron-blackwell (push) Successful in 3m43s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Build neuron-ampere (push) Successful in 5m12s
build-prerelease / Build neuron-ada (push) Successful in 5m25s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m58s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m55s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m14s
Production deployments that want neuron-internal debug detail (e.g.
trim_device_pool's per-clear-kv line, slab inserts/drops) override
RUST_LOG explicitly via systemd. Defaulting to debug for the whole
neuron target produced a lot of journal volume that wasn't useful in
the common case.

beast already sets RUST_LOG=debug in
/etc/systemd/system/neuron.service.d/local.conf, so beast's verbosity
is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 12:47:30 +03:00
740299bd9d chore(neuron/beast): switch default-model quant from q5k to q6k
Some checks failed
CI / Format (push) Successful in 35s
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Clippy (push) Successful in 2m22s
build-prerelease / Build neuron-blackwell (push) Successful in 3m35s
CI / Test (push) Successful in 5m8s
build-prerelease / Build cortex binary (push) Successful in 4m34s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m16s
build-prerelease / Build neuron-ampere (push) Successful in 5m12s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
q5k produced NaN logits on Qwen/Qwen3.6-27B under candle TP=2 (sampler
fell over with "logits unhealthy nan: 248320/248320"). q6k is the
quant that worked well in production under mistral.rs on the same
hardware, so it's the right baseline for verifying the mempool-trim
fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 12:36:18 +03:00
cdf0f4e66d fix(neuron): trim cudarc mempool after clear_kv_cache to release VRAM
cudarc's stream-ordered memory pool retains freed blocks (cuMemFreeAsync
returns memory to the device's default mempool, not to the OS), so
mem_get_info under-reports free VRAM between requests. With
Qwen/Qwen3.6-27B TP=2, the second consecutive chat completion saw
~4.5 GB of "missing" free VRAM and either OOMed or tripped cuBLAS into
CUBLAS_STATUS_INTERNAL_ERROR depending on quant.

Add a cuda-gated trim_device_pool helper that, after each successful
clear_kv_cache, synchronizes the context and calls cuMemPoolTrimTo(pool,
0) against the device's default mempool. Failures (no async-alloc
support, transient driver errors) are non-fatal and log at debug. The
before/after free-VRAM delta is logged so an operator can correlate the
trim with the next request's prefill VRAM.

ConcatKvCache::reset() in candle-nn 0.10.2 already drops its tensors
correctly; the leak was strictly at the cudarc pool layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 12:36:13 +03:00
c4954e0eed docs: per-device worker thread architecture (phase 5 of refactor)
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 36s
CI / Format (push) Successful in 36s
CI / Clippy (push) Successful in 2m18s
build-prerelease / Build neuron-blackwell (push) Successful in 3m39s
CI / Test (push) Successful in 5m10s
build-prerelease / Build cortex binary (push) Successful in 4m40s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m22s
build-prerelease / Build neuron-ampere (push) Successful in 5m16s
build-prerelease / Build neuron-ada (push) Successful in 4m58s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m39s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 10m36s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s
Closes the per-device CUDA context-ownership refactor planned at
~/.claude/plans/plan-the-per-device-worker-abstract-micali.md.

CLAUDE.md:
- New "Per-device worker thread (neuron)" section under Key design
  decisions, covering the three load-bearing properties (context
  locality, drop safety, poisoning blast radius), the CPU-fallback
  exception, and pointers to the canonical narrative in
  crates/neuron/src/harness/device_worker/mod.rs's module doc-comment.
- New 2026-05-27 addendum dating the migration and naming the four
  PR commits (Phase 1: 081b532, Phase 2: b179204, Phase 3: 76ab24d,
  Phase 4: b4f3576). Same convention as the 2026-04-15 and 2026-05-18
  addenda.

README.md:
- One paragraph in "Node setup" noting the per-device thread pattern
  with a pointer to CLAUDE.md and the device_worker module.

No code changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 11:15:43 +03:00
b4f3576d82 refactor(neuron): phase 4 — model loads move onto the device worker
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 35s
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m25s
CI / Test (push) Successful in 4m40s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m51s
build-prerelease / Build cortex binary (push) Successful in 4m21s
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Build neuron-ampere (push) Successful in 5m7s
build-prerelease / Build neuron-ada (push) Successful in 5m19s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
Final structural slice of the per-device CUDA context-ownership
refactor. The four remaining spawn_blocking sites that did CUDA work
on the leader are gone:

- Single-GPU GGUF load (`load_arch_gguf` spawn_blocking) →
  `Job::LoadGguf` dispatched on the worker.
- Single-GPU dense load (`load_arch_dense` spawn_blocking) →
  `Job::LoadDense` on the worker.
- TP shard load (`WorkerPool::load_dense_shard` spawn_blocking) →
  `Job::TpLoadShard`. The dispatch handler reads `state.nccl.comm()`
  directly — no cross-thread `Arc<Comm>` transfer, no `SendComm`
  wrapper for this path.

The Phase 2 / Phase 3 bridges that moved freshly-built models across
the channel boundary (`Job::TransferIn`, `Job::TransferInTp`,
`Job::CloneLeaderComm`) are removed. Models are now constructed on
the worker thread directly; the slab gets populated by `insert_arch` /
the inline `tp_models.insert` in dispatch handlers.

What this phase preserves:

- CPU loads still use `tokio::task::spawn_blocking` against
  `Arc<Mutex<ModelArch>>`. There's no CUDA context to own on CPU and
  channel overhead would only add latency. Four `spawn_blocking`
  references remain in `candle.rs` (load_arch_gguf, load_arch_dense,
  chat_completion, chat_completion_stream) and all are deliberate
  CPU-only fallback.
- Public API unchanged. `Harness::load_model`, `chat_completion`,
  HTTP routes all keep identical signatures.

What this phase removes:

- `SendComm` wrapper is no longer used in the load path (the Phase 3
  bridge that justified it). It remains in `nccl_state.rs` for the
  Phase 1–3 era and any future cross-thread Comm move; consider
  deleting in a follow-up.
- `Job::TransferIn`, `Job::TransferInTp`, `Job::CloneLeaderComm` and
  their handle convenience methods deleted.
- The leader_device parameter on `load_dense_shard` is now `_` —
  unused since the worker has its own bound device. Removing the
  arg outright is a public-API change; keeping the underscore prefix
  preserves the signature and signals deadness without churn.

Helper relocation:

- `LlamaDense::from_parts` is a new pub(crate) constructor so the
  worker-thread loader can build a `LlamaDense` without going through
  the original `load_arch_dense` async function.
- `check_dense_config_supported` is bumped to `pub(crate)` for the
  same reason.

Sweep verified: `grep -rn spawn_blocking crates/neuron/src/harness/`
returns only CPU-fallback hits in `candle.rs` + doc-comment references
to the old design. All four leader-side CUDA `spawn_blocking` sites
are gone.

fmt + clippy clean; 37 lib tests + all integration tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 10:24:38 +03:00
76ab24d98c refactor(neuron): phase 3 — TP forward + NCCL state move onto device worker
Some checks failed
CI / Format (push) Successful in 29s
build-prerelease / Resolve version stamps (push) Successful in 32s
CI / Test (push) Failing after 58s
CI / Clippy (push) Successful in 2m31s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m13s
build-prerelease / Build neuron-blackwell (push) Successful in 4m1s
build-prerelease / Package cortex RPM (push) Successful in 1m30s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
Third slice of the per-device CUDA context-ownership refactor planned at
~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. The
leader's `NcclState`, every `Comm::all_reduce` issued by the TP layers,
the leader-side KV cache reset, and the TP forward step itself now all
run on the per-device worker thread — the same OS thread that bound
the leader's `CudaContext` at startup.

What this phase changes:

- `Job` gains `NcclInit`, `NcclSanity`, `CloneLeaderComm` (Phase 3
  bridge — Phase 4 removes), `TransferInTp`, `DropTp`, `TpClearKv`,
  `TpForwardLogits`. Plus a new `TpHandle(u64)` opaque key.
- `DeviceWorkerState` gains `nccl: NcclState` and
  `tp_models: HashMap<TpHandle, Box<TpLeaderModel>>` (+ counter).
- `WorkerPool` loses its `leader_nccl` field; gains a
  `leader_worker: Arc<DeviceWorkerHandle>` passed at construction.
  `init_nccl`, `nccl_sanity_check`, `load_dense_shard`,
  `generate_step`, `clear_kv_cache` all route their leader-side ops
  through `Job::Nccl*` / `Job::Tp*` instead of spawn_blocking against
  a Mutex-wrapped state. `generate_step` returns `Vec<f32>` instead
  of a device-resident `Tensor` — the worker copies logits to CPU
  before reply so the async caller can sample on a CPU candle
  tensor with zero device-context touch.
- `TpLoadedModel.leader_model: Arc<Mutex<TpLeaderModel>>` → opaque
  `leader_handle: TpHandle`. The boxed `TpLeaderModel` lives in the
  worker thread's slab; both the model's CUDA tensors and the
  embedded `Arc<Comm>` clones release on the same thread that
  allocated them (the Drop semantics constraint cudarc forces).
- `Job::CloneLeaderComm` is a Phase 3 bridge: the TP shard load still
  runs in spawn_blocking and needs the leader's `Arc<Comm>` to build
  the row-parallel layers' AllReduce ops. The Job clones the Comm
  out of the worker's NcclState and ships it back as `SendComm`.
  Phase 4 deletes this bridge when the load itself moves onto the
  worker.
- `Job::NcclInit` and `Job::NcclSanity` are ungated by `cuda` so the
  no-cuda `NcclState` stubs (which reply with `cuda_feature_not_enabled`)
  still flow through the same channel uniformly; the cuda-only
  TP variants (CloneLeaderComm, Transfer/Drop/Clear/Forward Tp)
  remain gated.

What this phase doesn't touch (yet):

- TP shard load itself — still spawn_blocking, bridged via
  `CloneLeaderComm`. Phase 4 moves it to `Job::TpLoadShard` and
  reads `state.nccl.comm()` directly inside the worker.
- Single-GPU model loads — still spawn_blocking, transferred via
  `Job::TransferIn`. Phase 4 moves them.
- `device_vram_mb` / `cuda_mem_mb` / `log_construction_complete`
  helpers — still present, used inside spawn_blocking load closures.
  Phase 4 cleanup folds them into `dispatch.rs`.

`tp/mod.rs::WorkerPool::spawn` gained a required
`leader_worker: Arc<DeviceWorkerHandle>` argument. Three external
callers were updated: `CandleHarness::load_tp` (passes the cached
device worker), `main.rs::tp_smoke` (spawns a fresh worker), and
the two `tp_worker_lifecycle*.rs` integration tests.

Public API unchanged. fmt + clippy clean; 37 lib tests + all
integration tests pass. CUDA-only TP integration smoke deferred to
the next deploy on beast.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 10:16:02 +03:00
b179204fd3 refactor(neuron): phase 2 — single-GPU forward + clear_kv route through device worker
Some checks failed
build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions
CI / Format (push) Successful in 34s
CI / Clippy (push) Successful in 2m12s
build-prerelease / Resolve version stamps (push) Successful in 3m41s
CI / Test (push) Successful in 5m1s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m32s
build-prerelease / Build neuron-ampere (push) Successful in 5m20s
build-prerelease / Build cortex binary (push) Successful in 12m20s
build-prerelease / Build neuron-ada (push) Successful in 5m17s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Second slice of the per-device CUDA context-ownership refactor planned at
~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. The two
spawn_blocking sites in `chat_completion` and `chat_completion_stream`
now route through the device worker thread on CUDA loads. CPU loads
keep the existing spawn_blocking + `Arc<Mutex<ModelArch>>` path; there's
no context to own and the channel hop would only add latency.

What this phase changes:

- `Job` gains `TransferIn`, `DropArch`, `ClearKv`, `ForwardLogits`. The
  worker's dispatch state grows a `HashMap<ArchHandle, Box<ModelArch>>`
  slab and a `next_handle` counter for minting opaque handles.
- `LoadedModel.arch: Arc<Mutex<ModelArch>>` → `Option<Arc<Mutex<>>>`,
  plus a new `arch_handle: Option<ArchHandle>` field. The two are
  mutually exclusive: CUDA loads set `arch_handle = Some(_)` after
  transferring the boxed arch into the worker's slab; CPU loads keep
  `arch = Some(_)` for the legacy spawn_blocking path.
- New `run_inference_via_worker` and `stream_inference_via_worker`
  drive the prefill + decode loop by sending `Job::ForwardLogits` per
  step; the worker copies the resulting `[vocab]` logits to a
  CPU-side `Vec<f32>` before reply, so the async caller never holds a
  device-resident tensor. `apply_repeat_penalty` and
  `LogitsProcessor::sample` run on a CPU candle tensor; no context
  binding side-effects on tokio worker threads.
- `logits_health_slice(&[f32])` complements the existing
  `logits_health(&Tensor)` so the new worker paths can compute
  health stats directly from the CPU vec.
- `unload_model` for the single-GPU CUDA path now sends
  `Job::DropArch { handle }` to the worker so the `Box<ModelArch>`
  drops on the thread that allocated its CUDA tensors. The `Drop` runs
  with the bound context, freeing memory on the right context.

What this phase doesn't touch (yet):

- TP forward, TP load, NCCL bring-up — still on spawn_blocking. Phase 3.
- Single-GPU model load — still spawn_blocking, followed by a
  `Job::TransferIn` to move the freshly-built `ModelArch` into the
  worker slab. Phase 4 moves the load itself onto the worker thread
  and eliminates the bootstrap TransferIn.
- The `device_vram_mb` / `cuda_mem_mb` helpers — still present and
  used by the construction-time logs running inside spawn_blocking
  loads. Phase 4 cleanup folds them into `dispatch.rs`.

Public API unchanged. fmt + clippy clean; 37 lib tests + all
integration tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 09:55:08 +03:00
081b532387 refactor(neuron): phase 1 — per-device worker thread, VRAM queries route through it
Some checks failed
CI / Format (push) Successful in 31s
build-prerelease / Resolve version stamps (push) Successful in 36s
CI / Clippy (push) Failing after 59s
build-prerelease / Build neuron-blackwell (push) Successful in 3m30s
CI / Test (push) Successful in 4m47s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m17s
build-prerelease / Package cortex RPM (push) Successful in 1m32s
build-prerelease / Build neuron-ampere (push) Successful in 5m16s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
First slice of the per-device CUDA context-ownership refactor planned at
~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. Adds the
infrastructure for a dedicated OS thread per CUDA device that owns the
device's `CudaContext` for the daemon's lifetime, and routes the 8
async-context `device_vram_mb()` call sites in candle.rs through it.

What this phase changes:

- New module `harness/device_worker/` (mod.rs, jobs.rs, dispatch.rs).
  `DeviceWorkerHandle::spawn(idx)` creates a named OS thread
  (`cuda-dev-N`), binds `CudaContext::new(idx)` once at startup, and
  enters a dispatch loop reading `Job`s off a `std::sync::mpsc` channel.
  Replies cross back via `tokio::sync::oneshot::Sender` so async callers
  await without parking a tokio worker.
- Two Job variants: `QueryVram` and `Shutdown`. Phases 2–4 add Forward,
  ClearKv, NCCL init/sanity, and load variants.
- `LoadedModel` and `TpLoadedModel` gain a `worker` field populated at
  load time by a new `CandleHarness::ensure_device_worker(idx)` method
  that lazily spawns + caches one worker per device index.
- Per-model `query_vram()` convenience method on both struct types so
  the 8 call sites in chat_completion / chat_completion_stream /
  chat_completion_tp_inner / chat_completion_tp_stream become
  `loaded.query_vram().await` (or `tp.query_vram().await`) — same field
  values logged, just sourced from the owner thread instead of the
  caller thread.

What this phase doesn't touch (yet):

- Forward, kv-cache clear, model load, NCCL — still on `spawn_blocking`.
  Phase 2 moves the single-GPU forward + clear; Phase 3 moves the TP
  forward + NCCL bring-up; Phase 4 moves the loads and deletes the now-
  unused `device_vram_mb` / `cuda_mem_mb` helpers.
- Public API — unchanged. `Harness::load_model`, `chat_completion`,
  HTTP routes all keep identical shapes.

Tests:

- 5 new unit tests in `device_worker/mod.rs::tests` cover spawn → query
  → shutdown round-trip, thread naming, post-shutdown submit returns
  `Gone`, poisoned flag fast-rejects, and concurrent jobs drain across
  a Shutdown. CPU build (the only one CI runs) is enough to exercise
  channel mechanics.
- All 37 lib tests + all integration tests pass; fmt + clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 09:40:34 +03:00
7c19da9361 feat(neuron): construction-complete vram/config dump + logits health + per-step vram
All checks were successful
CI / Format (push) Successful in 40s
build-prerelease / Resolve version stamps (push) Successful in 45s
CI / Clippy (push) Successful in 2m27s
build-prerelease / Build cortex binary (push) Successful in 4m24s
build-prerelease / Build neuron-blackwell (push) Successful in 4m0s
build-prerelease / Package cortex RPM (push) Successful in 1m18s
build-prerelease / Build neuron-ampere (push) Successful in 5m10s
build-prerelease / Build neuron-ada (push) Successful in 4m56s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m1s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m47s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
CI / Test (push) Successful in 4m24s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Three additive diagnostics that turn the 2026-05-27 q5k Qwen3.6-27B
incident from "guess at KV cache / quant sizes" into "read the
journal":

1. Construction-complete summary in TpQwen3_5ForCausalLM::load and
   TpQwen3ForCausalLM::load. After the last "after layer N" log fires,
   each rank emits a single info line with: free_mb/total_mb (the
   number that drops by ~9 GB between per-layer and first-request on
   beast, with no inference traffic), every resolved config knob
   (vocab_size, hidden_size, num_layers, head_dim, num_kv_heads,
   max_position_embeddings), and a per-token KV-cache byte estimate.
   For Qwen3-Next also includes the linear/full-attention layer split
   so the hybrid architecture's cache cost is unambiguous.

2. Logits health snapshot on sample failure. Today the failure logs
   "A weight is negative, too large or not a valid number" with no
   context — was it a NaN cascade, an Inf, a negative weight?
   `logits_health(&logits)` computes nan/pos_inf/neg_inf/neg counts
   plus finite_min/max/mean on the failure path (zero cost on the
   success path) and emits a warn line just before the wrapper's
   terminal "failed, model marked poisoned" log. Wired into both the
   prefill and decode sample sites of the non-streaming AND streaming
   TP chat paths.

3. VRAM snapshot at prefill complete + every decode step. The
   "prefill complete" info line now carries vram_free_mb so the
   activations + KV growth from the prefill itself is visible. The
   per-step trace line gets vram_free_mb too, so an operator running
   with RUST_LOG=trace can watch headroom shrink token by token.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 09:04:55 +03:00
24e20dcb5c feat(catalogue,gateway): model aliases (helexa/small, helexa/balanced, helexa/large)
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Format (push) Successful in 40s
CI / Clippy (push) Successful in 2m21s
CI / Test (push) Successful in 4m40s
build-prerelease / Build neuron-blackwell (push) Successful in 3m38s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m19s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Successful in 5m20s
build-prerelease / Build neuron-ada (push) Successful in 4m45s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m59s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m10s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 9m40s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
Operators can now define tier aliases in models.toml:

  [aliases]
  "helexa/small" = "Qwen/Qwen3-1.7B"
  "helexa/balanced" = "Qwen/Qwen3-8B"
  "helexa/large" = "Qwen/Qwen3.6-27B"

A client request for `model: "helexa/small"` is resolved to the concrete
model id at routing time. The gateway also rewrites the proxied body's
`model` field to the concrete id so neuron sees a name that matches its
loaded handle (otherwise the harness rejects the request).

Motivated by the finger-in-the-wind benchmark: same "what's the capital
of Georgia" probe runs in 2.5s on the 1.7B vs 6.7s on the 27B with
identical correctness. Aliases let clients pick a latency tier without
hardcoding model ids, and let operators swap targets without changing
client code.

Changes:
  * cortex-core: `ModelCatalogue` gains `aliases: HashMap<String, String>`
    + `resolve_alias(&str) -> &str`. Unit tests cover the basic
    resolution + TOML round-trip.
  * cortex-gateway:
    * `RouteDecision` gains `resolved_model_id: String`. `router::resolve`
      consumes aliases at entry and threads the concrete id through.
    * Handlers (chat_completions, completions, anthropic_messages
      streaming + non-streaming) rewrite the body's `model` field with
      `rewrite_model_in_body` before proxying, using the resolved id
      for metrics labels, LRU touch, and the body itself.
    * `/v1/models` (Pass 4) emits each alias as its own entry mirroring
      the target's `loaded` flag, feasible_on, and locations — clients
      browsing the endpoint see both names and can pick either.
  * `models.toml` declares the three tier aliases; `models.example.toml`
    documents the section as opt-in.
  * Integration tests verify: end-to-end alias→concrete request flow,
    alias surfacing in /v1/models, and no-op fall-through for
    non-alias model ids.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 16:10:41 +03:00
becf61b9c1 feat(script): validate-neuron.sh waits for /health activation=ready
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Format (push) Successful in 30s
CI / Clippy (push) Successful in 2m12s
build-prerelease / Build neuron-blackwell (push) Successful in 3m48s
CI / Test (push) Successful in 5m2s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 5m11s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Successful in 5m25s
build-prerelease / Build neuron-ada (push) Successful in 4m58s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 6m50s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
Adds wait_for_ready() that polls /health until activation.state flips
to "ready" (or the NEURON_LOAD_TIMEOUT deadline). Inserted between
probe_health and the is_loaded/trigger_load step.

Before this, running validate-neuron.sh right after deploy.sh raced
the background pre-warm and failed in ~9 ms with "neuron not reachable"
(the pre-2026-05-26 build) or with a partial-load error (the new
build, where the listener binds before default_models finishes).

The poll prints the in_progress model on each tick so an operator
watching the log can see which model is delaying readiness. Backs off
from 2s to 10s after the first few iterations so a long TP load
doesn't spam.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 15:26:21 +03:00
b9e7a76a7a feat(gateway): surface mid-prewarm models as Loading on /v1/models
The poller now fetches /health alongside /models on each neuron and
stashes the activation snapshot on NodeState. The /v1/models handler
gains a Pass 3 that synthesises Loading locations from each neuron's
activation.in_progress and activation.pending lists, so a catalogued
model that's mid-prewarm surfaces as `status: "loading"` rather than
appearing absent (loaded=false, locations=[]).

Without this, a client polling /v1/models during a beast restart sees
Qwen3.6-27B disappear for the ~5 minutes the q5k load takes, then
reappear. Now it stays visible the whole time with a clear status.

Adds ModelStatus::Loading to cortex-core. The router's per-node priority
loop gets an explicit (no-op) arm: Loading models aren't routable yet,
and falling through to the catalogue cold-load path is the existing
race — no worse than before, but tagged as a known follow-up needing
neuron-side in-flight tracking on /models/load.

New test_poller_captures_activation_from_health exercises the full
round-trip: mock neuron with empty /models but a pre_warming /health
→ poller writes node.activation. Common test helpers gain
spawn_mock_neuron_with_models_and_health and default_health_response.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 15:26:12 +03:00
800498f530 feat(neuron): bind listener before pre-warm, surface activation in /health
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 41s
CI / Clippy (push) Successful in 2m26s
build-prerelease / Build neuron-blackwell (push) Successful in 3m34s
CI / Test (push) Successful in 4m44s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m29s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
Two coupled changes addressing the 2026-05-26 validate-neuron failure
where a fresh deploy of beast had /health unreachable for ~5 minutes
while Qwen3.6-27B q5k materialised, even though systemd reported the
unit as active.

1. main.rs no longer awaits load_default_models before binding axum.
   The listener binds first; pre-warm runs in a spawned background
   task that holds a read lock on the harness registry for the
   duration of its sequential load loop. Concurrent on-demand
   /models/load and /v1/chat/completions traffic still flow.

2. /health gains an `activation` field carrying:
     state         pre_warming | ready
     pending       model ids queued but not started
     in_progress   model id currently loading (Option)
     completed     model ids loaded successfully this activation
     failed        [{model_id, error}] for failed entries
   The field is `#[serde(default)]` so a pre-change cortex polling a
   new neuron — or vice versa — keeps working.

`ActivationTracker` (new module `neuron::activation`) owns the
RwLock-wrapped state; load_default_models takes a tracker reference
and updates it per-model. NeuronState holds an Arc clone for the
/health handler.

Tests updated to construct trackers and assert state transitions
(empty noop, two failures → ready with both in `failed`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 15:18:04 +03:00
d3f2d50749 feat(deploy): per-host neuron config + pre-warm headline models
All checks were successful
CI / Format (push) Successful in 39s
build-prerelease / Resolve version stamps (push) Successful in 40s
CI / Clippy (push) Successful in 2m17s
CI / Test (push) Successful in 4m57s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m50s
build-prerelease / Build cortex binary (push) Successful in 4m52s
build-prerelease / Package cortex RPM (push) Successful in 1m22s
build-prerelease / Build neuron-ampere (push) Successful in 5m13s
build-prerelease / Build neuron-ada (push) Successful in 5m14s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m53s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
Adds asset/neuron/{beast,benjy,quadbrat}.toml — per-host neuron.toml
files keyed by the first dot-component of the host. deploy.sh now
rsyncs the matching file to /etc/neuron/neuron.toml on each neuron and
stops+starts the service so default_models is re-read.

Headline model per host (drives /v1/models output immediately after a
clean deploy):

  beast     Qwen/Qwen3.6-27B  (q5k, tp=2, devices=[0,1])
  benjy     Qwen/Qwen3-8B     (bf16, devices=[0])
  quadbrat  Qwen/Qwen3-1.7B   (bf16, devices=[0])

Removes the need to follow deploy.sh with `validate-neuron.sh beast
Qwen/Qwen3.6-27B q5k 2` to surface the 27B in the catalogue — the
neuron loads it itself on activation.

The neuron loop now mirrors the cortex flow (stop → install/upgrade →
sync config → start) so config-only changes pick up on subsequent
deploys; previously a no-package-change deploy would silently leave
the host on the old default_models.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 14:05:54 +03:00
2740e61a23 fix(neuron,candle): name lifetime on acquire_pool_lock
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 46s
CI / Format (push) Successful in 46s
CI / Clippy (push) Successful in 2m15s
CI / Test (push) Successful in 5m8s
build-prerelease / Build cortex binary (push) Successful in 4m21s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m39s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Build neuron-ampere (push) Successful in 5m25s
build-prerelease / Build neuron-ada (push) Successful in 5m3s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 7m41s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s
Lifetime elision fails when a function has two reference parameters
and returns a borrow: rustc can't infer whether the MutexGuard's
lifetime ties to `pool` or `model_id`. The non-CUDA build skipped
this code path (cfg-gated), so the error only surfaced on the GPU
build at https://git.lair.cafe/helexa/cortex/actions/runs/162.

The guard borrows the pool, so name the lifetime on `pool` and the
return type. `model_id` keeps its independent (elided) lifetime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 12:37:32 +03:00
67f79c868f fix(neuron,shutdown): time-bound unloads, fast-exit past tokio drain
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 42s
CI / Format (push) Successful in 43s
CI / Clippy (push) Successful in 2m46s
build-prerelease / Build neuron-blackwell (push) Failing after 3m32s
CI / Test (push) Successful in 4m25s
build-prerelease / Build cortex binary (push) Successful in 4m20s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m17s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
Two failure modes from the 2026-05-26 beast incident:

1. `unload_all_models` looped through models calling `unload_model`,
   logging individual failures at warn. The cumulative effect was a
   single warn line for the failed unload then "shutdown complete" —
   no signal that the model was actually still loaded. Now each unload
   is bounded by a 20s timeout, failures escalate to error, and a
   summary "leaving N model(s) loaded" line fires when anything is
   stuck so the operator knows the OS will reclaim VRAM after exit.

2. Returning `Ok(())` from `main` after the unload sweep dropped the
   tokio runtime, which then waited indefinitely on a CUDA-stuck
   spawn_blocking thread (the journal's "Stack trace of thread
   2951308" — spinning on `cuCtxGetCurrent`). systemd's TimeoutStopSec
   fired 2 minutes later, SIGABRT, core dump. Replacing the return
   with `std::process::exit(0)` skips the runtime drain and hands the
   OS a clean exit code; stuck threads get reaped with the process.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 12:30:06 +03:00
fc6ef0ee0f feat(neuron,candle): detect CUDA context poisoning and refuse follow-ups
Once a CUDA driver error has hit a forward or kv-cache call, the
device's context is unrecoverable in-process — subsequent kernels can
hang (the failure mode seen on beast on 2026-05-26), return garbage,
or trip another illegal-address. The harness now marks the model
poisoned on any forward / spawn_blocking / TP-task failure, refuses
further inference against it with a clear "unload and reload" error,
and surfaces `status: "poisoned"` on `/models` so an operator running
`curl beast:13131/models` (or cortex polling) can see the bad state.

Without this, a single OOM on a too-large prefill quietly turned every
subsequent request into a stuck wait on the pool lock; with it, the
first request fails fast with the driver error in the journal and the
client gets a usable 5xx instead of a hung connection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 12:28:42 +03:00
1385979e3d feat(neuron,candle): log per-device VRAM at chat_completion start
Every "starting" log line now carries vram_free_mb / vram_total_mb for
the request's serving device (the leader device on TP). On the 2026-05-26
incident this would have made the 14k-token prefill OOM diagnosable from
the first log line: with ~412 MB free, that prompt was never going to
fit, and the operator could have caught the imbalance before the CUDA
context got poisoned.

`device_vram_mb` mirrors the existing helper in tp_qwen3_5.rs and is
kept separate to avoid coupling the inference path to the TP module.
TpLoadedModel gains a `leader_device: Device` clone so the request
path reads the device without locking the leader model (which would
contend with an in-flight forward).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 12:26:23 +03:00
0a1cfcd4d0 feat(neuron,candle): req_id spans, terminal failure logs, pool-lock warnings
Every chat completion path (single-GPU + TP, streaming + non-streaming)
now opens an `info_span!("chat", req_id=…, model=…)`. The fmt subscriber
prefixes every event with that span so `grep req_id=…` over journalctl
reconstructs one request even when dozens overlap.

Every path also emits a terminal log line on both success ("done", with
prompt_tokens/completion_tokens/finish_reason/total_ms) and failure
("failed", with full anyhow chain + total_ms). Failures used to vanish
silently — a request that hit a CUDA OOM left "starting" in the journal
and no further trace.

New `acquire_pool_lock` helper replaces the bare `tp.pool.lock().await`
in both TP paths. It warns at 2s ("still waiting on pool lock") and
re-warns every 2s thereafter, so queued requests stuck behind a
deadlocked holder are visible immediately instead of looking like idle
silence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 12:25:11 +03:00
ea0e0f7911 fix(neuron,tp): log leader forward errors with full context
Worker rank failures were already surfaced at WARN, but the leader's
own forward Result::Err was silently coerced to a `leader_ok=false`
bool. When the leader and a worker both fail together — the typical
shape of a CUDA OOM cascading into an illegal-address — the journal
showed only the worker side and an operator had to guess what hit
rank 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 12:22:30 +03:00
aa88d37509 fix(gateway): full observability + stop leaking upstream bodies
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Format (push) Successful in 42s
CI / Clippy (push) Successful in 2m27s
build-prerelease / Build neuron-blackwell (push) Successful in 3m39s
CI / Test (push) Successful in 4m42s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m31s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Successful in 4m53s
build-prerelease / Build neuron-ada (push) Successful in 5m7s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m58s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m3s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
Comprehensive sweep across cortex-gateway's request handling. Every
failure path now emits exactly one structured warn (or error) event
on the cortex side with the wire-level detail an operator needs;
the API response carries only a generic message plus, where useful,
the upstream status code.

proxy.rs::forward_request:
- warn on network failure (network error, target URL).
- warn on upstream non-2xx (status, target URL). Streaming body still
  passes through to the client; we just can't snippet without
  breaking the stream.
- warn on response-build failure.
- ProxyError::into_response no longer interpolates the inner error
  into the API body — generic "upstream request failed" / "failed to
  build response" instead.

handlers.rs::chat_completions, handlers.rs::completions:
- warn on missing model field, with handler= label.
- warn on route resolve failure with model + error chain. The
  user-facing 404 keeps the RouteError Display string (which is
  short, informative, and contains no internal detail beyond the
  model id and config'd node names).

handlers.rs::anthropic_messages:
- warn on invalid Anthropic body, on translated-OpenAI serialise
  failure (which is internal), on route resolve, on upstream network
  error, on upstream non-2xx (with 512-char body snippet for parse
  errors), on upstream body read, on response parse.
- All warns share consistent field shape: handler, model, node, url,
  status / error / body as applicable.
- API response messages are now uniformly generic.
- Adds an info-level "proxying request" log on the non-streaming
  path so successful proxies are also visible.

handlers.rs::proxy_with_metrics:
- still calls e.into_response() but proxy::forward_request already
  warn'd at the wire layer, so no double-log here.

Tests:
- All 32 existing unit tests + 22 gateway integration tests + 4
  new router tests pass.
- Tests that asserted on the "no healthy nodes" / "not found"
  strings still match because RouteError messages are preserved
  in the 404 user-facing path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 07:17:26 +03:00
0f00f72b47 fix(router,handlers): strip trailing slash from rewritten URL + log upstream failures
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 32s
CI / Format (push) Successful in 33s
CI / Clippy (push) Successful in 2m20s
CI / Test (push) Successful in 4m41s
build-prerelease / Build neuron-blackwell (push) Successful in 3m34s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m31s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
Two coupled bugs surfaced after 9b0ed0b:

1. url::Url::parse("http://host:port").to_string() normalises the
   empty path to "/", so rewrite_loopback_host was returning
   "http://beast:13131/". Downstream callers then did
   format!("{endpoint}/v1/chat/completions") and produced a
   double-slash path that neuron's axum router 404'd with an empty
   body. Strip the trailing slash in the rewriter so the endpoint is
   a clean base string for concatenation.

2. The anthropic_messages handler returned the upstream's empty body
   to the API caller as `"upstream error: "` with no journal log on
   the cortex side. Operators had no way to see what happened. Add
   warn-level tracing on both upstream failure paths (network error
   and non-2xx) with model, node, target URL, status, and a 512-char
   body snippet. The API response now carries just `"upstream
   returned <status>"` — the implementation detail lives in the log.

Updates the two existing rewrite tests for the no-trailing-slash
output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 07:10:39 +03:00
9b0ed0b57f fix(router): rewrite loopback inference URLs to use neuron's host
Some checks failed
CI / Format (push) Successful in 30s
build-prerelease / Resolve version stamps (push) Successful in 41s
build-prerelease / Build neuron-blackwell (push) Successful in 3m34s
CI / Clippy (push) Successful in 7m25s
build-prerelease / Build neuron-ampere (push) Successful in 4m57s
build-prerelease / Build cortex binary (push) Successful in 4m15s
build-prerelease / Build neuron-ada (push) Successful in 5m14s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m46s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m6s
CI / Test (push) Failing after 4m34s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Neuron hardcodes its bind_url as `http://localhost:13131` (it can't
reliably know its own externally-resolvable name). When cortex runs
on a different host than the neuron it's routing to, blindly
proxying to that URL hits localhost on the cortex box instead of the
neuron.

Cortex already knows each neuron's reachable host from cortex.toml.
After fetching the inference URL from `/models/{id}/endpoint`, if
the host is a loopback name (localhost / 127.0.0.1 / 0.0.0.0 / ::1),
swap it for the configured neuron host. Preserve the port and path
from neuron's URL so a future harness serving inference on a
different port than the management API still works.

Adds `url` (already a transitive dep via reqwest) as a direct
dep for the URL parsing.

Tests cover: localhost rewrite, distinct inference port preservation,
non-loopback passthrough, malformed input.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:23:47 +03:00
dc2a803266 fix(rpm): migrate legacy helexa-cortex firewalld service to cortex
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 1m1s
CI / Clippy (push) Successful in 3m12s
CI / Test (push) Successful in 4m31s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m52s
build-prerelease / Package cortex RPM (push) Successful in 1m18s
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
Adds a %posttrans scriptlet to cortex.spec that:

- Removes the stale /etc/firewalld/services/helexa-cortex.xml left
  behind by an older packaging stream that named the service
  `helexa-cortex` and (in some build streams) carried wrong port
  numbers (9301/9302/9304).
- Walks every active firewalld zone; for any zone where the legacy
  helexa-cortex service was enabled, swaps it out for the new
  `cortex` service (which the RPM ships at
  /usr/lib/firewalld/services/cortex.xml with the right
  31313/31314 ports).
- Reloads firewalld so the change takes effect without operator
  intervention.

Operators on whom this happened were silently dropping inbound
connections to cortex on 31313 — the active zone advertised a
helexa-cortex service that listed unrelated ports, masking the
correctly-defined vendor cortex service.

helexa-neuron is unaffected: that spec already ships the vendor
service as helexa-neuron.xml (namespaced from day one) and no
stale /etc override files exist in the fleet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:12:51 +03:00
e71181499e feat(stage-8e-3): quantize lm_head in TP Qwen3-Next
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 42s
build-prerelease / Build neuron-blackwell (push) Successful in 3m43s
build-prerelease / Build cortex binary (push) Successful in 4m25s
build-prerelease / Package cortex RPM (push) Successful in 1m26s
build-prerelease / Build neuron-ampere (push) Successful in 5m23s
build-prerelease / Build neuron-ada (push) Successful in 4m56s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m52s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m59s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m42s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
CI / Format (push) Successful in 30s
CI / Clippy (push) Successful in 2m19s
CI / Test (push) Successful in 4m21s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
TpQwen3_5ForCausalLM::lm_head is now a MaybeQuantLinear. When the
load spec has quant set and tie_word_embeddings is false, lm_head's
(vocab_size, hidden_size) weight is quantized in-situ at load time
along with all the per-layer linears. The non-tied case on
Qwen3.6-27B saves ~1.7 GB per rank vs bf16 (248320 x 5120 x 2
bytes = 2.42 GB -> ~700 MB at Q5K) and shaves a small amount of
decode latency from the per-token logits matmul.

Tied case (tie_word_embeddings=true) keeps the lm_head plain even
when quant is set — quantizing the shared tensor would corrupt the
embedding lookup, and the tied case already gets the memory win
from only holding one copy.

This is the last MaybeQuantLinear hookup in the Qwen3-Next TP path.
The dense Qwen3 path (tp_qwen3.rs) is unchanged — defer until it's
the bottleneck for a model that actually needs TP at consumer scale.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 21:53:14 +03:00
ee663e5e99 fix(stage-8e-2e): bump quant prefill threshold to M > 64
Some checks failed
build-prerelease / Build cortex binary (push) Blocked by required conditions
CI / Test (push) Waiting to run
CI / Format (push) Successful in 34s
build-prerelease / Resolve version stamps (push) Successful in 37s
CI / Clippy (push) Successful in 2m20s
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
The M > 8 threshold from 8e-2d activated forward_via_f16 on the test
case (M=30) and slightly regressed prefill (143 -> 133 T/s). The
dequant cost (~30 MB f16 per linear * ~480 calls per prefill = ~200 ms)
eats the cuBLAS GEMM speedup at small M.

Move the crossover to M > 64 so short prefills (typical for the
validate probe) stay on the GGUF GEMV kernel where per-call cost is
comparable but the dequant tax is zero. Long prefills still get the
dequant-then-cuBLAS-GEMM path where the GEMM scaling amortises the
fixed dequant cost.

Doesn't close the gap to mistralrs's 423 T/s on Q5K prefill — that
needs either a dequant cache (gives back the ISQ memory win) or a
fused dequant+gemm kernel. Both larger projects.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 21:50:45 +03:00
34f9b77d9d feat(stage-8e-2d): route quantized matmul by M (prefill vs decode)
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 37s
CI / Format (push) Successful in 41s
CI / Clippy (push) Successful in 2m20s
CI / Test (push) Successful in 4m40s
build-prerelease / Build cortex binary (push) Successful in 4m20s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m58s
build-prerelease / Build neuron-ampere (push) Successful in 5m14s
build-prerelease / Package cortex RPM (push) Successful in 9m25s
build-prerelease / Build neuron-ada (push) Successful in 5m12s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m55s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
MaybeQuantLinear::forward picks between two QMatMul paths:

- M > 8 (prefill): QMatMul::forward_via_f16 dequantises the weight
  once into f16 and runs a real cuBLAS-backed GEMM. The dequant cost
  is fixed per call, so it's amortised across the M tokens.
- M <= 8 (decode): QMatMul::forward uses candle's GGUF GEMV kernel
  on the quantized blocks directly. Requires f32 inputs so we still
  cast in/out at the boundary in that arm.

Earlier 8e-2c sent everything through the GGUF GEMV kernel, which
is excellent at GEMV (decode) but doesn't have a real batched GEMM
path — prefill regressed ~4x. This restores prefill to roughly the
bf16 cuBLAS GEMM throughput while keeping the decode gain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 21:15:32 +03:00
f084aaab8e fix(stage-8e-2c): cast bf16/f16 activations to f32 around QMatMul
All checks were successful
CI / Format (push) Successful in 33s
build-prerelease / Resolve version stamps (push) Successful in 40s
CI / Clippy (push) Successful in 2m18s
CI / Test (push) Successful in 4m26s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m41s
build-prerelease / Build cortex binary (push) Successful in 4m22s
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Build neuron-ampere (push) Successful in 5m12s
build-prerelease / Build neuron-ada (push) Successful in 4m41s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m59s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s
candle's QTensor::cuda_fwd requires f32 inputs — its on-the-fly
GGUF dequantize accumulates in f32. The model dtype flowing into
MaybeQuantLinear::forward is bf16, so QMatMul::forward errored with
"unexpected dtype, expected: F32, got: BF16".

Wrap the Quant arm to cast the activation to f32 before the matmul
and cast the result back to the input dtype. The cast is a single
launch on the activation tensor (small relative to weight traffic);
it's the price of in-situ GGUF-style quantization, and what mistralrs
does inside its own Linear wrapper.

The Plain arm is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 20:05:19 +03:00
68a606a79c fix(stage-8e-2b): allow quant on the TP load path
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 35s
CI / Clippy (push) Successful in 2m16s
CI / Test (push) Successful in 4m29s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m50s
build-prerelease / Build cortex binary (push) Successful in 8m37s
build-prerelease / Build neuron-ampere (push) Successful in 5m13s
build-prerelease / Package cortex RPM (push) Successful in 1m17s
build-prerelease / Build neuron-ada (push) Successful in 4m55s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m57s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 12m35s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
The pre-existing guard in candle.rs rejected any spec.quant on the TP
path with "GGUF quantized models are not supported in the TP path" —
written when quant only ever meant GGUF. With 8e-1/8e-2 in,
quant != None on the TP path triggers in-situ quantization of the
loaded safetensors shards. resolve_dense_files only looks for
safetensors so a GGUF-source-file model with TP still errors out
cleanly downstream.

validate-neuron.sh: rebuild the load payload incrementally so
tp_size > 1 + non-empty quant produces both fields. Same script now
covers all four combos (single/TP × dense/ISQ).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 19:17:14 +03:00
4aa71902d0 feat(stage-8e-2): plumb quant config from ModelSpec to TP load path
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 31s
CI / Format (push) Successful in 36s
CI / Clippy (push) Successful in 2m7s
CI / Test (push) Successful in 4m21s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m47s
build-prerelease / Build neuron-ampere (push) Successful in 5m17s
build-prerelease / Build neuron-ada (push) Successful in 5m14s
build-prerelease / Build cortex binary (push) Successful in 18m31s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m7s
- LoadDenseShard RPC gains an optional `quant` string field.
- WorkerPool::load_dense_shard takes a `quant: Option<String>`,
  passes it via the RPC to workers and via parse_quant_string to
  the leader's local load.
- The Qwen3-Next TP load chain (ForCausalLM → Model → DecoderLayer
  → Attention / GatedDeltaNet / MLP) takes `quant: Option<GgmlDType>`
  end-to-end, calling Column/RowParallelLinear::load_with_quant.
- The fused in_proj_qkv inside TpQwen3_5GatedDeltaNet is now a
  MaybeQuantLinear so it also picks up quantization.
- parse_quant_string accepts q4_0/q4_1/q5_0/q5_1/q8_0/q8_1, q2k..q8k
  (with or without underscore), and f16/bf16/f32. Empty / None means
  no quantization.

Callers from candle.rs forward spec.quant through pool.load_dense_shard.
This means a `quant = "q5k"` in models.toml now flows end-to-end to a
QTensor-backed QMatMul for every per-rank linear in the Qwen3-Next
TP path. Leaves lm_head and the small replicated bias/log tensors in
their loaded dtype (Stage 8e-3).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 18:03:36 +03:00
bef159b21c feat(stage-8e-1): MaybeQuantLinear primitive + parallel-linear quant variants
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 37s
build-prerelease / Build cortex binary (push) Successful in 4m36s
build-prerelease / Build neuron-blackwell (push) Successful in 3m31s
build-prerelease / Package cortex RPM (push) Successful in 1m27s
CI / Format (push) Waiting to run
CI / Clippy (push) Waiting to run
CI / Test (push) Waiting to run
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
Introduces MaybeQuantLinear, which wraps either a plain candle Linear
or a candle QMatMul backed by a freshly-quantized QTensor. Forward
dispatches identically through the Module trait so downstream code
doesn't care which arm is active.

ColumnParallelLinear and RowParallelLinear gain `load_with_quant`
methods. The existing `load` methods stay as backward-compatible
no-quantization wrappers — no churn at the 27 existing call sites.

This is the foundation for in-situ quantization at load time. Wiring
the user-facing quant config and switching call sites to
load_with_quant follow in stages 8e-2 / 8e-3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 17:55:26 +03:00
8d7b099b36 feat(stage-8d-7): direct safetensors fused-region loader
Some checks failed
build-prerelease / Package cortex RPM (push) Blocked by required conditions
CI / Format (push) Successful in 35s
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Clippy (push) Successful in 2m18s
CI / Test (push) Successful in 4m28s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m51s
build-prerelease / Build cortex binary (push) Successful in 4m13s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
Replaces load_fused_qkv_slice_2d/_3d with reads from a separate
MmapedSafetensors handle. Each per-rank fused tensor is built by
reading the three region byte-slices directly from the mmap,
concatenating them host-side, and uploading as one device
allocation — no full-fused-tensor device materialisation.

The prior approach allocated a ~100 MB transient device tensor
per linear-attention layer; on Qwen3.6-27B with 48 linear-attn
layers that's ~4.8 GB of allocator churn during load — enough
to fragment the cuda caching allocator on a tight-VRAM 32 GB
consumer GPU, which is what triggered the layer-22 up_proj
OOM seen on beast.

Threading: MmapedSafetensors flows worker → ForCausalLM →
Model → DecoderLayer → GatedDeltaNet::load. Both leader (mod.rs)
and worker (worker.rs) construct their own mmap; Linux's page
cache shares the underlying pages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 17:49:35 +03:00
89d98d1fb2 diag(stage-8d-6): per-layer VRAM logging in TP load path
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Format (push) Successful in 33s
CI / Clippy (push) Successful in 2m14s
build-prerelease / Build neuron-blackwell (push) Successful in 3m59s
CI / Test (push) Successful in 4m58s
build-prerelease / Build cortex binary (push) Successful in 4m36s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m26s
build-prerelease / Build neuron-ampere (push) Successful in 4m52s
build-prerelease / Build neuron-ada (push) Successful in 5m11s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m1s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m52s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s
Wraps each TpQwen3_5DecoderLayer::load in a with_context that captures
free/total VRAM on failure, plus an info-level log after every layer
that succeeds. Uses cudarc::driver::result::mem_get_info — same API
mistralrs uses.

Diagnostic only: forward path is unchanged. Helps distinguish true
VRAM exhaustion from allocator fragmentation when loading large
models at BF16 on 2x consumer GPUs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 12:54:05 +03:00
cc95fe28d9 feat(stage-8d-5b): wire fused_gdn_gating CUDA kernel
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 1m45s
build-prerelease / Build neuron-blackwell (push) Successful in 3m40s
build-prerelease / Build cortex binary (push) Successful in 4m27s
build-prerelease / Package cortex RPM (push) Successful in 1m24s
build-prerelease / Build neuron-ampere (push) Successful in 5m30s
build-prerelease / Build neuron-ada (push) Successful in 5m24s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m49s
CI / Format (push) Successful in 35s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m7s
CI / Clippy (push) Successful in 2m16s
CI / Test (push) Successful in 4m37s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
run_fused_gating helper consolidates the per-layer gating math:
  beta = sigmoid(b)
  g    = -exp(a_log) * softplus(a + dt_bias)

CUDA path issues a single launch via fused_gdn_gating_cuda;
cpu path falls back to the original per-op Rust sequence. Replaces
~10 candle launches per linear-attention layer (sigmoid + 2× to_dtype
+ exp + neg + broadcast_add + softplus + 2× unsqueeze + broadcast_mul)
across both single-GPU and TP forward paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:52:38 +03:00
09c945f81e feat(stage-8d-4): dispatch chunked_gated_delta_rule_recurrence at prefill
Some checks failed
build-prerelease / Build cortex binary (push) Blocked by required conditions
CI / Test (push) Waiting to run
build-prerelease / Resolve version stamps (push) Successful in 31s
CI / Format (push) Successful in 44s
CI / Clippy (push) Failing after 52s
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
run_delta_rule_cuda now picks between the per-token kernel and the
BT=64 chunked variant based on seq_len. Threshold = 64 matches mistralrs.
Prefill on Qwen3.6-27B (typical seq_len in the hundreds) drops from
one block-launch per token to one per 64-token chunk.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:50:30 +03:00
05dc0bad18 feat(stage-8d-3): wire causal_conv1d_update/full CUDA kernels
Some checks failed
CI / Clippy (push) Waiting to run
CI / Test (push) Waiting to run
build-prerelease / Resolve version stamps (push) Successful in 37s
CI / Format (push) Successful in 38s
build-prerelease / Build cortex binary (push) Has started running
build-prerelease / Build neuron-blackwell (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
Replaces the per-layer conv1d + silu sequence in both single-GPU and
TP linear-attention forward paths with a shared run_causal_conv1d
helper that dispatches to:

- causal_conv1d_update for decode (seq_len=1 with existing conv_state)
- causal_conv1d_full for prefill / fresh start (zero-pads internally)

Both kernels fuse the depthwise conv + SiLU into a single launch — 4×
fewer cuda launches per linear-attention layer vs the candle conv1d +
candle_nn::ops::silu combo. Falls back to the original Rust path on
cpu.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:49:41 +03:00
10c151efa5 feat(stage-8d-5): wire gated_delta_rule_recurrence kernel into tp_qwen3_5
Some checks failed
build-prerelease / Package cortex RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps (push) Successful in 36s
CI / Format (push) Successful in 39s
CI / Clippy (push) Successful in 2m21s
build-prerelease / Build neuron-blackwell (push) Successful in 3m36s
CI / Test (push) Successful in 4m39s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m34s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
TP per-token Rust loop replaced with shared run_delta_rule dispatch
from arch/qwen3_5/linear_attn.rs. Both single-GPU and TP variants now
use the cuda kernel when available, per-token Rust fallback otherwise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:44:12 +03:00
44ae927e38 feat(stage-8d-2): wire gated_delta_rule_recurrence kernel into qwen3_5
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 35s
CI / Format (push) Successful in 38s
CI / Test (push) Failing after 45s
CI / Clippy (push) Successful in 2m16s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
Replaces the per-token Rust delta-rule loop in
`arch/qwen3_5/linear_attn.rs::GatedDeltaNet::forward` with a single
dispatch to the `gated_delta_rule_recurrence` kernel imported from
mistralrs in 1ebbe87.

The kernel is V-tiled with compile-time BK (one block per (V-tile,
batch*head), one thread per V-column, BK state floats in registers).
For Qwen3.6's per-rank `(B=1, H=24, D_k=128, D_v=128)` shape this
collapses ~6 candle tensor-op launches per token per layer (each
~50µs CUDA dispatch overhead, so ~300µs/token/layer × 48 linear-
attention layers = 14ms in launch overhead alone) to a single
kernel launch with full ILP / register residency.

New free function `run_delta_rule`:
- cuda branch (when q is on a CUDA device): flattens
  `(B, H, ...)` → `(BH, ...)`, dispatches the kernel via
  `crate::cuda::gdn::gated_delta_rule_recurrence_cuda`, reshapes
  outputs back to `(B, H, L, D_v)` and state to `(B, H, D_k, D_v)`.
- cpu fallback: the original per-token Rust loop, unchanged. Keeps
  cargo test --workspace passing on hosts without cuda.

Dispatch decision lives in the wrapper (`q.device().is_cuda()`).

Build: `cargo build -p neuron --features cuda` compiles + links;
clippy clean on both CPU and cuda paths. 32 lib tests still pass
(none of them exercise this code path on cuda; smoke test for the
TP variant is the deployed Tbilisi probe).

Stage 8d-3 wires the conv1d kernels; 8d-4 the chunked prefill;
8d-5 the same wiring for `tp/tp_qwen3_5.rs`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:39:30 +03:00
1ebbe87651 feat(stage-8d-1): import mistralrs GDN CUDA kernels — build infra only
Some checks failed
build-prerelease / Build cortex binary (push) Blocked by required conditions
CI / Test (push) Waiting to run
build-prerelease / Resolve version stamps (push) Successful in 29s
CI / Format (push) Successful in 40s
CI / Clippy (push) Successful in 2m23s
build-prerelease / Build neuron-blackwell (push) Has started running
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
Stage 8d (new): port the Gated DeltaNet CUDA kernels from
EricLBuehler/mistral.rs to close the ~500x decode performance gap
we measured on Qwen3.6-27B TP-2 (~12s/token in our pure-candle path
vs ~37 T/s in mistralrs on the same hardware).

This commit lays the build infrastructure with zero behavioural
change. Subsequent commits (8d-2 .. 8d-5) wire each kernel into the
qwen3_5 architecture and TP variant.

Added:
- `crates/neuron/build.rs` — uses `cudaforge::KernelBuilder` to compile
  every `src/cuda/*.cu` file into `libneuroncuda.a` under the `cuda`
  feature, then links it + `cudart`. Mirrors mistralrs's
  `mistralrs-core/build.rs` setup verbatim (same NVCC flag set, same
  sm_<80 bf16 gate).
- `crates/neuron/src/cuda/gdn.cu` — five kernels ported verbatim from
  upstream:
    * `gated_delta_rule_recurrence` (V-tiled per-token decode)
    * `chunked_gated_delta_rule_recurrence` (BT=64 chunked prefill)
    * `causal_conv1d_update` (single-token conv decode)
    * `causal_conv1d_full` (multi-token conv prefill)
    * `fused_gdn_gating` (beta = sigmoid(b); g = -exp(A_log) *
      softplus(a + dt_bias))
- `crates/neuron/src/cuda/gdn.rs` — Rust wrappers around the kernels,
  cudarc::CudaSlice::device_ptr boilerplate identical to upstream.
- `crates/neuron/src/cuda/ffi.rs` — `extern "C"` decls (subset of
  upstream's ffi.rs covering only the five GDN kernels; MoE / SSM /
  top-k decls land here when we absorb those too).
- `crates/neuron/src/cuda/mod.rs` — re-exports + module docs.

Cargo wiring: `cudaforge` added as an optional build-dep, activated
by the `cuda` feature. CPU build is unchanged (the `cuda/` module is
fully `#[cfg(feature = "cuda")]`). The cuda feature build inside the
patched container compiles `gdn.cu` (1 of 1 kernels) and links
clean.

Licensing: upstream files preserve their MIT origin via per-file
comment banners pointing to the mistralrs path. No behaviour-relevant
edits to the .cu kernels — local diff against upstream is just the
banner. The `.rs` wrappers and `ffi.rs` subset are also from upstream;
their structure (module path `crate::cuda::ffi::*`) matches identically
so future kernel imports drop in unchanged.

CPU clippy + 32 lib tests pass; `cargo clippy --features cuda` clean
inside the runner container.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:34:11 +03:00
70eb6af42b feat(tp): cancellation-safe inference + structured tracing
All checks were successful
CI / Format (push) Successful in 30s
build-prerelease / Resolve version stamps (push) Successful in 35s
CI / Clippy (push) Successful in 2m14s
build-prerelease / Build neuron-blackwell (push) Successful in 3m44s
build-prerelease / Build cortex binary (push) Successful in 4m13s
CI / Test (push) Successful in 4m38s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ampere (push) Successful in 5m13s
build-prerelease / Build neuron-ada (push) Successful in 4m47s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m1s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m41s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
Two changes addressing operator visibility into TP inference + the
HTTP-cancellation poisoning chain:

1. `chat_completion_tp` now runs its body inside `tokio::spawn`. When
   the HTTP client disconnects (curl --max-time, browser nav, etc.)
   the future returned from `chat_completion_tp` gets dropped, but
   the spawned task keeps running to completion — finishing every
   `pool.generate_step` / `pool.clear_kv_cache` to drain the worker
   pipes. The next inference request then finds a clean pool.

   Previously: dropped future left workers still processing the
   in-flight request, the next call's `ClearKvCache` recv would
   read the stale `GenerateStepOk` from the abandoned step ("rank N
   expected KvCacheCleared, got GenerateStepOk"). The drain-on-
   leader-error fix from d1a4aad covered Rust-side leader failures
   but not HTTP-layer cancellation, which is what we actually hit
   on the user's Qwen3.6 test.

2. Tracing throughout the TP path so journalctl shows where an
   inference spends its time without needing to surface harness
   internals via the HTTP error body:

   - `chat_completion_tp_inner` (now a free fn so it can run inside
     spawn): `info` at request start (prompt_len, max_new, temp,
     top_p, eos_id), `info` per major phase (prefill complete with
     elapsed_ms, decode complete with elapsed_ms + token count),
     `info` at completion (total_ms, finish_reason). `debug` for
     pool-lock acquisition + kv-cache clear timing. `trace` per
     decode step (next_token, step_ms).

   - `WorkerPool::generate_step` (leader side): `debug` at fan-out,
     `debug` after leader forward returns with elapsed_ms + ok flag,
     `debug` after drain with errors count + total_ms.

   - `WorkerPool::clear_kv_cache`: matching `debug` at fan-out + drain.

   - `worker::handle_generate_step`: `debug` at forward start + done
     with elapsed_ms, `warn` on forward failure with the full error.

The default log filter is already `info,neuron=debug` so the
operator gets every `info` and `debug` line by default; `trace`
needs RUST_LOG=trace for per-step decode timing.

Stage 7c-ii crash-detection is still future work; this is the
minimum that makes the "where did the 120s go" question answerable
from the logs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 08:22:00 +03:00
d1a4aad91d fix(tp): always drain worker responses on leader failure
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 34s
CI / Format (push) Successful in 1m6s
CI / Clippy (push) Successful in 2m56s
build-prerelease / Build neuron-blackwell (push) Successful in 3m40s
CI / Test (push) Successful in 5m1s
build-prerelease / Build cortex binary (push) Successful in 4m36s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Build neuron-ampere (push) Successful in 4m29s
build-prerelease / Build neuron-ada (push) Successful in 4m51s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m9s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m4s
The TP-2 inference probe against Qwen3.6-27B surfaced:
    worker rank 1 ClearKvCache: expected KvCacheCleared, got
    GenerateStepOk

Caused by pipe poisoning. The previous shape of `generate_step`:

  for w in workers { w.send_only(GenerateStep) }   // 1. fan-out
  let logits = spawn_blocking(leader.forward)??;   // 2. early return on err
  for w in workers { w.recv_only() }               // 3. drain (skipped on 2's err)

When step 2 returned `Err` (e.g. a dtype mismatch we hadn't seen
before, an OOM, a downstream squeeze that didn't match the shape),
the function bailed before step 3 — but workers had already written
`GenerateStepOk` to their stdout pipes, since their forwards (and
the NCCL collectives inside) completed independently of the leader's
post-collective Rust-side work.

The next call (typically `ClearKvCache` at the start of the *next*
inference request) would then send a fresh request and read those
stale replies as if they were the new operation's. Once a pipe is
poisoned, every subsequent call surfaces the same shape of error
even though nothing's actually broken.

Fix: introduce two helpers in `tp/mod.rs`:

- `drain_workers(workers, check)` — reads exactly one response from
  every worker regardless of individual outcomes. Returns
  `Vec<String>` of `rank N: detail` strings for any non-OK reply.

- `combine_leader_workers(leader, worker_errs, op)` — folds the
  leader's `Result<Result<T>>` (the spawn_blocking shape) with the
  worker drain into a single `Result<T>`. Leader failure takes
  precedence but worker errors get appended so both halves surface.

`generate_step` and `clear_kv_cache` now use this pattern. Worst case:
both halves fail and the operator sees a combined error message;
either way the pipes are always drained so the next call's recv
matches the request it sent.

Note: the model is still poisoned in the current state — the
operator needs to either `POST /models/unload` + reload, or
`systemctl restart neuron`, to recover. The fix prevents *future*
desync; it doesn't repair existing stale pipe state.

Stage 7c-ii crash detection was tracked as the canonical solution to
this class of issue; this is the minimum-viable subset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 07:39:36 +03:00
95dc8745eb feat(stage-8c): TP-aware Qwen3-Next (tp_qwen3_5)
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 36s
CI / Format (push) Successful in 39s
CI / Clippy (push) Successful in 2m13s
build-prerelease / Build neuron-blackwell (push) Successful in 3m37s
CI / Test (push) Successful in 4m49s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m26s
build-prerelease / Build neuron-ampere (push) Successful in 5m18s
build-prerelease / Package cortex RPM (push) Successful in 7m6s
build-prerelease / Build neuron-ada (push) Successful in 5m13s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m2s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m55s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 5m39s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
Adds `harness/tp/tp_qwen3_5.rs` — the tensor-parallel variant of the
Qwen3-Next architecture — plus the dispatch wiring needed to route a
load through it on both the leader and the workers.

Architecture pieces (all per-rank, follow `tp_qwen3.rs` patterns for
the full-attention layers + a new pattern for linear-attention):

- TpQwen3_5GatedDeltaNet: V-head-dim sharded. `num_v_heads / world_size`
  V-heads per rank, `num_k_heads / world_size` K-heads. `in_proj_z`,
  `in_proj_b`, `in_proj_a`, `A_log`, `dt_bias` shard uniformly along
  the V-head dim. `out_proj` is row-parallel + AllReduce (the only
  collective inside the block). The recurrent state shards 1:1 with
  V-heads — no cross-rank sync inside the delta-rule loop.

  `in_proj_qkv` and `conv1d.weight` are FUSED tensors with three
  regions along dim 0 (`[first key_dim, second key_dim, value_dim]`).
  Standard uniform-slicing doesn't align with the head boundaries —
  rank 0 would end up with `[first half of K_0, full K_1, first half
  of V]`. New `load_fused_qkv_slice_{2d,3d}` helpers load the full
  tensor, narrow per-region per-rank, and `Tensor::cat` the three
  slices into a per-rank fused weight. Transient peak of one full
  tensor per layer during construction; net memory is properly per-
  rank after the full drops.

- TpQwen3_5Attention: column-parallel `q_proj` (the widened
  `2 * num_heads * head_dim` output, including the gate half — shards
  along the head axis so both query AND gate halves stay consistent
  per rank), `k_proj`, `v_proj`; row-parallel `o_proj` with AllReduce.
  Otherwise mirrors `tp_qwen3.rs`'s attention.

- TpQwen3_5MLP, TpQwen3_5DecoderLayer (dispatches on layer_types),
  TpQwen3_5Model (with `model.language_model.*` prefix), and
  TpQwen3_5ForCausalLM (with tied or separate `lm_head` at top level).

Dispatch wiring:

- New `tp::TpLeaderModel` enum holds either Qwen3 or Qwen3_5 variant.
  `WorkerPool::load_dense_shard` now dispatches on `model_type` from
  the config JSON and returns `Arc<Mutex<TpLeaderModel>>`. The two
  downstream methods (`generate_step`, `clear_kv_cache`) thread this
  enum through — the inner forward+clear_kv_cache dispatch happens
  via the enum's pub methods. Adding another TP architecture later is
  one more enum variant + match arms.

- Worker side gets a parallel `WorkerModel` enum + dispatch in
  `handle_load_dense_shard`, branching on the same `model_type`.

- Harness gate `TP_SUPPORTED_MODEL_TYPES` now `["qwen3", "qwen3_5"]`.
  `TpLoadedModel.leader_model` retyped to the enum.

Helpers in `arch/qwen3_5/linear_attn.rs`:
- `softplus` and `repeat_interleave` made `pub(crate)` so the TP
  module reuses them rather than duplicating.

Reuses unchanged: `Qwen3_5RmsNorm` (replicated weight), the gated
`Qwen3_5RmsNormGated` tail, `l2norm`, the `RotaryEmbedding` (partial
RoPE with `partial_rotary_factor` already correct).

CPU build + clippy + 32 lib tests pass; `cargo clippy --features cuda`
also clean inside the patched runner container.

Single inflight risk to call out: tensor names. For full-attention
layers the per-layer prefix is `model.language_model.layers.<i>.self_attn.*`
and for linear-attention layers `model.language_model.layers.<i>.linear_attn.*`
— the same as the single-GPU path. lm_head sits at the top level (not
under `language_model`) — consistent with the single-GPU path that
validated against Qwen3.5-0.8B.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 22:02:42 +03:00
495d3f7c05 fix(qwen3_5): promote beta to F32 alongside q/k/v in delta rule
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 40s
CI / Format (push) Successful in 43s
CI / Clippy (push) Successful in 2m20s
CI / Test (push) Successful in 4m33s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m19s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Build neuron-blackwell (push) Successful in 3m39s
build-prerelease / Build neuron-ampere (push) Successful in 4m46s
build-prerelease / Build neuron-ada (push) Successful in 5m9s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m58s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m9s
The single-GPU dense load of Qwen/Qwen3.5-0.8B succeeded but the first
inference forward bombed with `dtype mismatch in mul, lhs: F32, rhs:
BF16`. Trace through the recurrent delta-rule loop:

  let q = (q.to_dtype(F32)? * scale)?;        // F32
  let k = k.to_dtype(F32)?;                    // F32
  let v = v.to_dtype(F32)?;                    // F32
  // g built from A_log/dt_bias                 // F32
  // beta = sigmoid(b)                          // BF16 (sigmoid preserves dtype)
  ...
  let delta = (v_t - kv_mem)?.broadcast_mul(&beta_col)?;
                ^^^^^^^^^^^^^                    ^^^^^^^^^
                F32                              BF16   ← mismatch

`g` was already F32 because it was constructed from `a_log.to_dtype(F32)`
+ `dt_bias.to_dtype(F32)` earlier in the function. `beta` came from
`sigmoid(b)` where `b` was the model dtype (BF16), so beta stayed BF16
and the multiplication tripped candle's dtype-mismatch check.

Promote beta to F32 at the same point we promote q/k/v.

Caught by the validate-neuron.sh probe against Qwen/Qwen3.5-0.8B on
beast — load returned 200, then `POST /v1/chat/completions` returned
the dtype error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 21:13:19 +03:00
5c4c8e0eba fix(qwen3_5): tensor names are under model.language_model.*, not model.*
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 35s
CI / Clippy (push) Successful in 2m12s
build-prerelease / Build neuron-blackwell (push) Successful in 3m49s
CI / Test (push) Successful in 4m27s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 4m50s
build-prerelease / Build neuron-ada (push) Successful in 5m12s
build-prerelease / Build cortex binary (push) Successful in 4m14s
build-prerelease / Package cortex RPM (push) Successful in 1m17s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m52s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s
Qwen3-Next is a multimodal architecture whose text core sits under
`model.language_model.*` — sibling to `model.visual.*` (vision tower)
and to top-level `lm_head` / `mtp.*`. Every text-side tensor in the
safetensors files carries that prefix:

  model.language_model.embed_tokens.weight
  model.language_model.layers.{i}.{input,post_attention}_layernorm.weight
  model.language_model.layers.{i}.linear_attn.{in_proj_*, conv1d.weight, A_log, dt_bias, norm.weight, out_proj.weight}
  model.language_model.layers.{i}.self_attn.{q,k,v,o}_proj.weight + {q,k}_norm.weight
  model.language_model.layers.{i}.mlp.{gate,up,down}_proj.weight
  model.language_model.norm.weight
  lm_head.weight              (top-level; not under language_model)

The single-pre-emptive fix is in Qwen3_5Model::load — derive a
`text_vb = vb.pp("model.language_model")` once and walk
embed_tokens / layers / norm from there. `lm_head` stays at the
top-level VB; that path was already correct.

The non-text tensors (`model.visual.*`, `mtp.*`) are ignored: we
don't reference them, so the safetensors mmap is fine even though
the bytes are loaded into the address space.

After this, the load that was failing at
"cannot find tensor model.embed_tokens.weight" should proceed to
materialising the actual layer weights — where any further bugs
will be substantive architecture issues rather than naming ones.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 16:48:16 +03:00
07c44d5db1 fix(qwen3_5): nested rope_parameters + partial_rotary_factor=0.25
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 34s
CI / Format (push) Successful in 36s
CI / Clippy (push) Successful in 2m16s
CI / Test (push) Successful in 4m37s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m21s
build-prerelease / Build neuron-blackwell (push) Successful in 3m51s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Successful in 5m2s
build-prerelease / Build neuron-ada (push) Successful in 5m8s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m55s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m40s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m11s
Two interlocked bugs surfaced trying to load Qwen/Qwen3.5-0.8B (and
the same applies to Qwen/Qwen3.6-27B):

1. Qwen3-Next config.json does NOT have a top-level `rope_theta`.
   It lives inside `rope_parameters: { rope_theta, partial_rotary_factor,
   rope_type, mrope_section, mrope_interleaved }`. Our TextConfig
   declared `rope_theta` as a non-optional top-level field, so the
   deserializer bailed with the misleading "missing field
   `rope_theta` at line 74 col 5".

   Replaced with a nested `RopeParameters` struct that mirrors the
   upstream shape. Defaults are conservative (rope_theta=10000,
   partial_rotary_factor=1.0) so a missing or partial block degrades
   to standard full-rotation RoPE rather than failing.

2. `partial_rotary_factor: 0.25` means only `head_dim * 0.25 = 64` of
   the 256 head_dim values get RoPE applied — the rest pass through
   unchanged. Our RotaryEmbedding was building the inv_freq table
   for the full head_dim and rotating everything. Silently wrong
   for every full-attention layer.

   `RotaryEmbedding` now derives `rotary_dim` from
   `head_dim * partial_rotary_factor`, builds its cos/sin tables at
   that smaller size, and in `apply()` splits q/k into (rotate, pass)
   on the last dim, only `rope_slow`-rotates the rotate half, and
   re-concatenates. Mirrors the reference Python's
   `apply_rotary_pos_emb` exactly for the non-trivial
   `partial_rotary_factor` case.

Tests updated: config-deserialise fixture uses the real `rope_parameters`
shape (matching the Qwen3.6-27B and Qwen3.5-0.8B configs). The
linear-attention forward-smoke test was already using full rotation
which still works; just shifted to the nested struct.

After this, the load that previously failed at "parse Qwen3-Next
(qwen3_5) config.json: missing field rope_theta" should reach the
actual safetensors materialisation step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 16:18:52 +03:00
e7eb3dab6a feat(stage-8c): full-attention layer + decoder + Model + ForCausalLM for qwen3_5
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 37s
CI / Format (push) Successful in 39s
CI / Clippy (push) Successful in 2m19s
CI / Test (push) Successful in 4m50s
build-prerelease / Build cortex binary (push) Successful in 4m21s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m41s
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Build neuron-ampere (push) Successful in 4m58s
build-prerelease / Build neuron-ada (push) Successful in 5m8s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m52s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 58s
Completes the single-GPU dense path for Qwen3-Next (Qwen3.6's
architecture). The four new modules wrap the substantive
`linear_attn.rs` (landed previously) with the rest of the
transformer:

- `arch/qwen3_5/rope.rs` — text-side rotary embedding. MRoPE is
  simplified to plain RoPE (the three position grids collapse to one
  for text-only inference); uses candle's `rope_slow` for the
  GLM-style rotate-half rotation.
- `arch/qwen3_5/mlp.rs` — Qwen3_5MLP (SwiGLU: gate/up/down, bias=False).
- `arch/qwen3_5/full_attn.rs` — Qwen3_5Attention with the two
  Qwen3-Next quirks:
  - `q_proj` widened to `2 * num_heads * head_dim`; second half
    sigmoid'd and multiplied into the attention output before `o_proj`.
  - q_norm/k_norm use the `(1+w)*x` RmsNorm variant.
- `arch/qwen3_5/decoder.rs` — Qwen3_5DecoderLayer dispatching on
  `layer_types[i]` to either Full attention or GatedDeltaNet.

`arch/qwen3_5/mod.rs` gets the real `Qwen3_5Model` (embedding + layer
stack + final norm) and `Qwen3_5ForCausalLM` (model + lm_head). The
forward returns `[B, 1, vocab]` to match `qwen3_dense`; the harness's
`squeeze_to_vocab` handles either shape.

Switch: `candle.rs::load_arch_dense` for `model_type=qwen3_5` now
builds a `ShardedVarBuilder` instead of a plain VarBuilder. The
sharded backend falls through to the unsharded path when
`world_size=1`, so single-GPU load is zero-cost; this lets the
forthcoming `tp_qwen3_5.rs` reuse the same load functions without a
second copy.

Verified: cargo build CPU + --features cuda inside the patched
container; clippy clean on both; 32 lib tests still pass. The
ForCausalLM forward no longer bails — but numerical correctness vs
the Python reference hasn't been validated yet (that's the next
step, with the Tbilisi probe).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 15:52:33 +03:00
180274548d feat(stage-8c): linear-attention layer (Qwen3-Next GatedDeltaNet)
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Format (push) Successful in 38s
CI / Clippy (push) Successful in 2m17s
build-prerelease / Build neuron-blackwell (push) Successful in 3m48s
CI / Test (push) Successful in 5m1s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m36s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ampere (push) Successful in 5m13s
build-prerelease / Build neuron-ada (push) Successful in 4m39s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m4s
Implements the recurrent-path Gated DeltaNet block that occupies 48 of
Qwen3.6's 64 decoder layers (`layer_types[i] == "linear_attention"`).
Ported from `huggingface/transformers/models/qwen3_5/modeling_qwen3_5.py`
(`Qwen3_5GatedDeltaNet`, `torch_recurrent_gated_delta_rule`,
`Qwen3_5RMSNormGated`, `l2norm`).

Layout: `arch/qwen3_5.rs` becomes `arch/qwen3_5/` with submodules
- `mod.rs`         — Config + (still-stub) ForCausalLM
- `linear_attn.rs` — GatedDeltaNet + GatedDeltaNetState
- `rmsnorm.rs`     — Qwen3_5RmsNorm `(1+w)*x`, Qwen3_5RmsNormGated, l2norm

Architecture pieces in this commit:
- Block: in_proj_qkv + in_proj_z + in_proj_b + in_proj_a + out_proj
  (all bias=False); depthwise causal Conv1d (k=4) with state-aware
  prepend; SiLU; per-head reshape; L2norm on q,k.
- Discretisation: g = -exp(A_log) * softplus(a + dt_bias); beta = σ(b).
  All computed in f32 to avoid the -inf underflow in fp16 that the
  reference notes.
- Delta rule (recurrent, per-token):
    state *= exp(g_t)
    kv_mem = state^T · k_t
    delta  = (v_t - kv_mem) * beta_t
    state += outer(k_t, delta)
    out_t  = state^T · q_t
- Output: RMSNormGated(core_attn_out, z) reshape out_proj.

State (`GatedDeltaNetState`) lives inline on the layer:
- conv_state: (B, conv_dim, conv_kernel_size) — left-padded tail.
- recurrent_state: (B, num_v_heads, head_k_dim, head_v_dim) — the
  delta-rule outer-product memory.
Cleared via `clear_kv_cache` at the start of every new request.

Config extended with the qwen3_5-specific fields:
- linear_num_value_heads (48 in Qwen3.6-27B)
- linear_num_key_heads   (16)
- linear_key_head_dim    (128)
- linear_value_head_dim  (128)
- linear_conv_kernel_dim (4)
- hidden_act             ("silu")

Performance note: this is the **recurrent** delta-rule (PyTorch's
`torch_recurrent_gated_delta_rule`), correct for any seq_len but O(L)
prefill. The chunked algorithm (`torch_chunk_gated_delta_rule`,
chunk_size=64) is a follow-up perf optimisation; surface stays the
same.

8 unit tests:
- softplus small/large branches
- l2norm hand-calc + zero-vector stability
- repeat_interleave round-trip
- forward_smoke on tiny dims (4-head fixture) — verifies shape +
  no NaN/Inf propagation through the f32-promotion pipeline. Doesn't
  validate numerical correctness against the Python reference; that
  requires a fixed-weight fixture and is the next step.

cargo clippy CPU + --features cuda both clean; 32 lib tests pass.
The ForCausalLM stub still bails on forward — wrapping
attention/MLP/decoder layer + lm_head is the next sub-stage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:29:52 +03:00
a70f317729 feat(stage-8c): scaffold qwen3_5 (Qwen3.6) — dispatch + stubs + TP gate
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Format (push) Successful in 38s
CI / Clippy (push) Successful in 2m14s
CI / Test (push) Successful in 4m29s
build-prerelease / Build neuron-blackwell (push) Successful in 3m39s
build-prerelease / Build cortex binary (push) Successful in 4m17s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m31s
build-prerelease / Build neuron-ampere (push) Successful in 5m13s
build-prerelease / Build neuron-ada (push) Successful in 5m1s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m14s
Lays the wiring for the top-priority TP-2 target without doing the
substantive architecture work yet. After this commit, attempting to
load a Qwen3.6 (`model_type = "qwen3_5"`) model:
- Passes config.json parse — the real upstream shape (text_config
  wrapper, layer_types, attn_output_gate, head_dim=256, etc.) round-
  trips through a typed Config (unit test included).
- Constructs a placeholder Qwen3_5ForCausalLM, attaches it to a
  ModelArch::Qwen3_5Dense variant, registers it in the loaded set.
- Fails on the first inference forward with a clear "Qwen3-Next
  forward not implemented yet (Stage 8c, TP-2 motivator)" — the
  point where the real architecture work begins.

New layout:
- `harness/arch/` for custom architectures candle-transformers doesn't
  ship. Each architecture is one module: Config + ForCausalLM + impl.
- `harness/arch/qwen3_5.rs` — the scaffold. Heavy doc comments on the
  open work: layer_types dispatch (full_attention vs linear_attention,
  the latter being the hard part with no candle precedent),
  attn_output_gate, text_config nesting, recurrent state lifecycle.
- DENSE_SUPPORTED_MODEL_TYPES adds "qwen3_5"; load_arch_dense gains a
  branch that constructs the stub.

TP-side gate:
- New `check_tp_arch_supported`: even though Llama / Qwen3 MoE pass
  the single-GPU dense check (DENSE_SUPPORTED_MODEL_TYPES), the
  worker pool's `load_dense_shard` reconstructs the config as Qwen3
  on every rank — silently misrouting a non-Qwen3 dense load through
  it would surface as a cryptic per-rank deserialise error.
- TP_SUPPORTED_MODEL_TYPES = ["qwen3"] (cuda-gated). Anything else
  bails *before* the worker pool spawns and NCCL handshake costs are
  paid, with a marker pointing at the `tp_<family>.rs` module a
  contributor would need to add. qwen3_5 specifically lands here
  until its architecture is real.

The naming choice: keep "qwen3_5" from the model's own config.json
rather than mistralrs's "qwen3_next" — the latter ages poorly the
moment Qwen ship another architecture revision.

Unit tests: 2 new for qwen3_5 (config deserialise + dispatch gate);
the previously-rejecting test for qwen3_5 swapped to a fictional
arch so it stays meaningful as the supported set grows. 26 lib tests
pass; cargo clippy CPU + --features cuda both clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 08:58:01 +03:00
c6022aa6b9 feat(stage-8b): Llama + Qwen3 MoE families on the candle harness
All checks were successful
CI / Format (push) Successful in 31s
build-prerelease / Resolve version stamps (push) Successful in 36s
CI / Clippy (push) Successful in 2m6s
build-prerelease / Build neuron-blackwell (push) Successful in 3m50s
build-prerelease / Build cortex binary (push) Successful in 4m54s
CI / Test (push) Successful in 4m58s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ampere (push) Successful in 4m43s
build-prerelease / Build neuron-ada (push) Successful in 5m8s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m52s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s
Broadens the single-GPU dense and quantized paths to cover three
non-Qwen3 architectures already shipped by candle-transformers. TP for
these is a separate stage (each family would need its own tp_*.rs
mirroring tp_qwen3.rs).

`ModelArch` gains four variants:
- LlamaDense (boxed — wraps Llama + an inline Cache + the config it
  takes to rebuild the cache, since candle::llama::Cache has no reset)
- LlamaQuantized (candle_transformers::models::quantized_llama)
- Qwen3MoeDense (candle::models::qwen3_moe::ModelForCausalLM)
- Qwen3MoeQuantized (candle::models::quantized_qwen3_moe::GGUFQWenMoE
  — takes an explicit compute dtype; F16 by default for best
  consumer-GPU throughput)

The dispatch is method-based now:
- `ModelArch::forward(&mut self, input, offset) -> Result<Tensor>`
  with a shared `squeeze_to_vocab` normalising shape differences
  (qwen3 returns [B,1,V]; quantized_qwen3 returns [B,V]; new families
  may differ again — the helper handles all of them).
- `ModelArch::clear_kv_cache(&mut self) -> Result<()>`. Llama needs
  a Cache rebuild because its Cache has no in-place reset; the new
  `LlamaDense` wrapper holds the bits needed to do it.

`run_inference` / `run_inference_streaming` collapse to a single
dispatch path: no more per-variant match arms in the hot loop, and
new architectures pick up streaming + non-streaming for free with
zero changes outside `ModelArch`.

DENSE_SUPPORTED_MODEL_TYPES is now ["llama", "qwen3", "qwen3_moe"].
GGUF arch switch grows "qwen3moe" + "llama" branches (qwen3moe with
no underscore matches llama.cpp's general.architecture convention).
Stage 8a's diagnostic auto-reports the new supported set.

The `LlamaDense` variant is boxed because the wrapper's inline Cache
+ Config makes it 544 bytes vs ~300 for everything else
(clippy::large_enum_variant).

Verified: cargo test --workspace passes 66 tests; cargo clippy CPU
and `--features cuda` both clean (the cuda check ran inside the
locally-built `neuron-build-local` container with the math_functions.h
patch applied).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 08:36:22 +03:00
9e31d8deca feat(stage-8a): pre-flight architecture check for dense model loads
Some checks failed
CI / Format (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 34s
CI / Clippy (push) Successful in 2m21s
CI / Test (push) Successful in 4m27s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m50s
build-prerelease / Build cortex binary (push) Successful in 4m28s
build-prerelease / Package cortex RPM (push) Successful in 1m24s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
A request to load Qwen/Qwen3.6-27B (model_type "qwen3_5") on the
dense path was failing deep inside serde with:
    missing field `vocab_size` at line 140 column 1
…because Qwen3.6 wraps its actual hyperparameters under `text_config`,
so none of `qwen3::Config`'s expected top-level fields are present.
The error gave no hint that the *architecture* was the problem.

`check_dense_config_supported` parses `config.json` as an untyped
JSON Value, inspects `model_type` (with `architectures` as bonus
context), and bails cleanly when it's not in the supported set
(currently `["qwen3"]`). The error names the rejected type, the
supported set, and points at the files a contributor needs to touch
to extend coverage — both the single-process `ModelArch` variants in
`candle.rs` and the TP analogue in `tp_qwen3.rs`.

Wired into both load paths:
- `load_arch_dense` (single-GPU), before the typed deserialize.
- `load_tp`, before spawning the worker pool — TP loads of an
  unsupported arch now fail before NCCL/init costs are paid.

4 unit tests cover the accept/reject/missing-field/malformed cases.
Bonus: makes Stage 8b/8c work easier — adding a new architecture is
now a `DENSE_SUPPORTED_MODEL_TYPES` edit + ModelArch variant + load
branch, with the diagnostic auto-correctly listing the supported set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 08:27:29 +03:00
b400e8b704 feat(neuron): honour HF_HUB_CACHE / HF_HOME for the candle harness cache
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 31s
build-prerelease / Build neuron-blackwell (push) Successful in 3m39s
build-prerelease / Build cortex binary (push) Successful in 4m17s
build-prerelease / Package cortex RPM (push) Successful in 1m22s
CI / Format (push) Successful in 32s
CI / Test (push) Failing after 51s
CI / Clippy (push) Successful in 2m17s
build-prerelease / Build neuron-ampere (push) Successful in 4m58s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ada (push) Successful in 5m1s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m4s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m37s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
Resolves the candle harness's HuggingFace cache directory with the
following precedence (first hit wins):

1. Explicit `hf_cache` in `[harness.candle]` from neuron.toml.
2. `HF_HUB_CACHE` env var — the Python `huggingface_hub` convention.
   The Rust hf-hub crate doesn't read this natively, so we bridge here.
3. `HF_HOME` env var (`$HF_HOME/hub` per the canonical layout).
4. None — falls through to hf-hub's own default.

Honouring HF_HUB_CACHE lets a neuron host reuse an existing cache
directory shared with Python tooling or other harnesses on the same
host without per-tool config. The canonical per-host setup is a
systemd drop-in:

    /etc/systemd/system/neuron.service.d/local.conf
    [Service]
    Environment=HF_HUB_CACHE=/archive/hf-cache

neuron.example.toml documents the resolution chain inline.

script/validate-neuron.sh: bump LOAD_TIMEOUT from 600s to 3600s and
expose both load/infer timeouts via env (NEURON_LOAD_TIMEOUT,
NEURON_INFER_TIMEOUT). A Qwen3.6-class dense model is ~54 GB and was
hitting the 10-min ceiling cold-downloading on a residential link.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 07:52:50 +03:00
62ca125a68 chore: keep models.example.toml generic; deploy.sh sync's local models.toml
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 34s
CI / Format (push) Successful in 40s
CI / Clippy (push) Successful in 2m22s
CI / Test (push) Successful in 4m31s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m28s
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has started running
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
Reverts the previous commit's naming of specific helexa neuron hosts
in the shipped example catalogue (`models.example.toml`) — the example
is supposed to be a generic starting point that any operator copies
and adapts, not a record of one particular fleet's layout.

- `pinned_on` in the TP example uses the placeholder
  `"your-multi-gpu-neuron"`. Other entries keep the model ids
  (since those are HuggingFace-canonical, not fleet-specific).
- New `models.toml` at repo root holds the helexa-fleet catalogue
  (beast / benjy / quadbrat). Added to `.gitignore` alongside
  `cortex.toml` — both are operator-owned, gitignored, RPM-marked
  `%config(noreplace)`, and synced by `deploy.sh`.
- `deploy.sh` now rsync's `models.toml` to `/etc/cortex/models.toml`
  on the gateway host on the same lifecycle as `cortex.toml`. Skips
  cleanly when no local file exists, so users without a catalogue
  aren't surprised by silent overwrites.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 07:47:08 +03:00
735945ee81 feat(cortex): unified /v1/models — catalogue × topology feasibility + cold-load
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 45s
CI / Format (push) Successful in 48s
CI / Clippy (push) Successful in 2m12s
CI / Test (push) Successful in 4m42s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 5m10s
build-prerelease / Build neuron-blackwell (push) Successful in 3m35s
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
Realises [project-unified-models-endpoint]: cortex now surfaces every
model the operator has provisioned in the catalogue, transparently
cold-loads on the first request, and routes the request once the load
is done — without per-node configuration or client awareness of which
neuron hosts what.

cortex-core changes:
- NodeState gains `discovery: Option<DiscoveryResponse>` — populated
  once per neuron on first successful poll, cached forever after
  (topology is invariant for a neuron process).
- ModelProfile gains `is_feasible_on(neuron, devices)` with the
  pinned_on / min_devices / min_device_vram_mb logic + 5 unit tests.
- CortexModelEntry expanded with OpenAI-compatible (`id`, `object`,
  `created`, `owned_by`) plus helexa-specific extension fields
  (`loaded`, `feasible_on`, `locations`).

cortex-gateway changes:
- poller.rs: `maybe_poll_discovery` fetches `GET /discovery` once per
  neuron and caches on NodeState.
- handlers.rs::list_models rewritten as union of (catalogue × topology
  feasibility) + (currently loaded somewhere). Catalogue-defined models
  surface even when not yet loaded.
- router.rs::resolve gains priority 3 (catalogue cold-load):
    1. loaded somewhere → route there
    2. unloaded somewhere → route + lazy load via neuron
    3. in catalogue → pick feasible neuron, POST /models/load, wait,
       route. Cache the new entry locally so subsequent requests skip
       the poll wait.
    4. else 404
- pick_feasible_neuron prefers pinned_on neurons, falls back to any
  feasible one (stable by name).
- profile_to_spec translates ModelProfile → ModelSpec, picking devices
  by VRAM floor and setting tensor_parallel = min_devices for multi-
  device profiles.
- "already loaded" responses from neuron are tolerated (two concurrent
  requests racing the same cold-load is a benign outcome).

models.example.toml rewritten to reflect the canonical helexa fleet
(beast = 2x RTX 5090, benjy = RTX 4090, quadbrat = RTX 3060) with a
working TP example (Qwen3.6-27B pinned on beast) plus single-GPU
profiles for the smaller models.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 07:39:04 +03:00
f72dee094f feat(tp): Stage 7c-i — streaming SSE through TP
Some checks failed
build-prerelease / Package cortex RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps (push) Successful in 35s
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m12s
CI / Test (push) Successful in 5m3s
build-prerelease / Build neuron-blackwell (push) Successful in 3m39s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 5m7s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
`chat_completion_stream` no longer returns an error for TP loads. The
new `chat_completion_tp_stream` mirrors the non-streaming TP path
(clear_kv_cache, prefill, sample, decode loop) but emits one
`ChatCompletionChunk` per generated token over an mpsc channel so the
handler can write a streaming SSE response.

Unlike the single-GPU streaming path (which runs candle's forward
inside `spawn_blocking` and uses `blocking_send`), the TP loop is
itself async — every `pool.generate_step` already awaits the leader's
own spawn_blocking forward plus every worker's recv_only. So the
orchestration runs as a plain `tokio::spawn` task using `Sender::send`.

The shared `emit_chunk` helper tracks the cumulative decoded prefix and
emits the delta — same UTF-8-safe BPE boundary handling as the
single-GPU streaming path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 07:32:46 +03:00
d46d8d4f6c feat(tp): Stage 7b-iv — RPC + orchestration for TP load/inference
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 38s
CI / Format (push) Successful in 40s
CI / Clippy (push) Successful in 2m20s
build-prerelease / Build cortex binary (push) Successful in 4m25s
build-prerelease / Package cortex RPM (push) Successful in 1m22s
CI / Test (push) Successful in 4m34s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m57s
build-prerelease / Build neuron-ampere (push) Successful in 4m51s
build-prerelease / Build neuron-ada (push) Successful in 5m12s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m49s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m51s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s
Wires the in-flight TP machinery (Stage 7a workers, 7b-iii sharded
Qwen3) end to end so a non-streaming chat completion can run across
multiple GPUs via NCCL.

RPC additions (tp/rpc.rs):
- LoadDenseShard{model_id, config_json, safetensors_paths}
- GenerateStep{model_id, tokens, offset}
- ClearKvCache{model_id}
- UnloadModel{model_id}
- LoadDenseShardOk / GenerateStepOk / KvCacheCleared / Unloaded

Worker side (tp/worker.rs):
- WorkerState gains a `models: HashMap<String, TpQwen3ForCausalLM>`
  keyed by model_id. LoadDenseShard mmaps safetensors via
  ShardedVarBuilder (only this rank's slice materialises), builds the
  TP model with the rank's NCCL Comm cloned from NcclState.
- GenerateStep runs the rank-local forward; the resulting logits are
  dropped (only the leader's are used for sampling). The forward's
  value here is the NCCL collectives inside the row-parallel layers
  letting the leader's rank-0 forward make progress.

Pool side (tp/mod.rs):
- WorkerPool::load_dense_shard fans LoadDenseShard out to every worker,
  builds rank 0's shard on the leader via spawn_blocking with a fresh
  SendComm wrapper at the move boundary (Comm is !Send at the type
  level), collects per-rank LoadDenseShardOk. Returns the leader's
  Arc<Mutex<TpQwen3ForCausalLM>>.
- WorkerPool::generate_step fans GenerateStep out, runs the leader's
  rank-0 forward in spawn_blocking (the AllReduce CustomOps inside
  row-parallel layers block until every worker issues the matching
  collective), returns the leader's last-position logits Tensor.
- WorkerPool::clear_kv_cache + unload_model follow the same pattern.

NcclState refactor (tp/nccl_state.rs):
- comm field becomes Option<Arc<Comm>> (was Option<Comm>) so callers
  can share a clone with TpQwen3ForCausalLM::load.
- new `comm()` accessor + `SendComm` wrapper for spawn_blocking moves.
- single allow(clippy::arc_with_non_send_sync) at the canonical
  construction site (Comm is !Send by type but the runtime invariant
  is enforced by SendComm + the pool's Mutex).

Harness side (candle.rs):
- LoadedHandle enum (Single | Tp) replaces the bare Arc<LoadedModel>
  in the harness's registry. list_models / unload_model /
  inference_endpoint walk the enum uniformly.
- TpLoadedModel holds the pool + leader_model + tokenizer + devices.
- load_model dispatches on `spec.tensor_parallel > 1` to a new
  cuda-gated load_tp path: resolve dense files via hf-hub, spawn the
  pool, init_nccl, load_dense_shard.
- chat_completion branches on the handle variant. The TP path mirrors
  run_inference: clear_kv_cache, prefill, sample, decode loop,
  detokenize. Acquires the pool Mutex for the whole request.
- Streaming through TP is deferred to Stage 7c (returns Other(err)).

Script (script/validate-neuron.sh):
- 4th positional arg `tp_size` (default 1). When >1, switches to the
  dense path (tp + GGUF is mutually exclusive — bails) and adds
  `tensor_parallel` + `devices` to the load payload. NEURON_DEVICES
  env overrides the default 0..N-1 device list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 06:38:33 +03:00
9b8bd146f6 feat(tp): --tp-smoke CLI subcommand + remote validation script
All checks were successful
CI / Format (push) Successful in 36s
build-prerelease / Resolve version stamps (push) Successful in 38s
CI / Clippy (push) Successful in 2m19s
CI / Test (push) Successful in 4m32s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m43s
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m16s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ampere (push) Successful in 4m56s
build-prerelease / Build neuron-ada (push) Successful in 5m1s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m51s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m39s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s
Adds a one-shot diagnostic that exercises the lower half of the TP
stack — WorkerPool::spawn, init_nccl, nccl_sanity_check — in isolation
from model load and inference. Runs N-1 worker subprocesses (rank 0
stays in this process), joins them in an NCCL communicator on the
specified CUDA devices, all_reduces a sentinel 1u32 per rank, verifies
the observed_sum equals world_size on every rank, then shuts down.

Output is `status=ok` on stdout (plus key=value lines for tp_size and
cuda_devices) when every check passes, non-zero exit + tracing on
stderr otherwise. The smoke command is diagnostic-only and not exposed
through the daemon HTTP API.

script/tp-smoke.sh wraps it with an ssh invocation against a fleet
host (default beast — the only host with 2 GPUs) and asserts the
status line, mirroring the validate-neuron.sh ergonomics.

This is step 1 of the TP test plan. A failure here means TP cannot
work on the host at all; step 2 (Stage 7b-iv) wires real model load
and inference through the same WorkerPool primitives.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 19:40:25 +03:00
96d8755245 fix(tp): add half dep + drop double-wrapped .w() on CudaDevice::alloc
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 35s
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m17s
CI / Test (push) Successful in 4m50s
build-prerelease / Build neuron-blackwell (push) Successful in 3m36s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m32s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Build neuron-ampere (push) Successful in 5m13s
build-prerelease / Build neuron-ada (push) Successful in 4m42s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m52s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m39s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m12s
Two follow-up cuda-only fixes surfaced by `cargo build --features cuda`
inside the cuda-13.0 runner container:

1. `half::{bf16, f16}` was an undeclared dep. Added `half = "2.5"`
   (matching candle-core's pinned major) under the cuda feature flag.
2. `dev.alloc::<T>(n)` already returns `candle_core::Result` (it calls
   `.w()` internally on the cudarc error). Calling `.w()?` on top of
   that needs `From<candle_core::Error> for CudaError`, which doesn't
   exist — collapse to `?`. Removed the now-unused
   `cuda_backend::WrapErr` import.

Verified by `cargo build -p neuron --features cuda` and
`cargo clippy -p neuron --all-targets --features cuda -- -D warnings`
inside `git.lair.cafe/gongfoo/runner-cuda-13.0` with the local
glibc/CUDA-13.0 math_functions.h noexcept patch. CPU clippy/tests stay
green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 19:11:59 +03:00
12549c9aed fix(tp): import BackendStorage trait for CudaStorage methods
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 32s
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 3m9s
CI / Test (push) Successful in 4m28s
build-prerelease / Build neuron-blackwell (push) Failing after 3m41s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m32s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ampere (push) Failing after 4m45s
build-prerelease / Build neuron-ada (push) Failing after 5m13s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Stage 7b-iii (1/2) introduced AllReduce with `s.device()` and
`s.dtype()` calls on `&CudaStorage`. Both come from the
`candle_core::backend::BackendStorage` trait, which wasn't imported —
fine on CPU builds (the cuda_fwd block was cfg-gated out) but the
prerelease cuda build hit E0599.

Also drop the unused `cudarc::driver::DeviceSlice` import inside
cuda_fwd — `CudaSlice::len()` is an inherent method on cudarc 0.19,
not a trait method.

Caught by run 2894 (build-neuron-{blackwell,ampere}); CPU clippy +
tests stay green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 18:32:05 +03:00
46527d7804 feat(tp): TP-aware Qwen3 dense model (Stage 7b-iii 2/2)
Mirrors candle_transformers::models::qwen3 structurally with column-
parallel q/k/v + gate/up projections, row-parallel o + down projections,
and replicated embedding/norms/lm_head. Per-rank head counts come from
dividing num_attention_heads / num_key_value_heads by world_size at load
time; intermediate_size split likewise. Load bails on any non-divisible
shape — the safetensors slice would lose data otherwise.

KV cache holds the rank-local slice since K/V come out of column-parallel
projections; no cache resharding across ranks. Causal mask is computed
on rank 0 shape and broadcasts over the head dim so per-rank H differs
without rework.

Replicated tensors (embedding, all RmsNorms, untied lm_head) load via
vb.get(shape, name), which uses the default Shard { world_size: 1 } and
falls through to the unsharded backend path on ShardedSafeTensors.

The cuda / non-cuda load splits track the existing tp_linear pattern:
RowParallelLinear takes an Arc<Comm> only under cuda, and the higher-
level composers (TpQwen3MLP, TpQwen3Attention, TpDecoderLayer,
TpQwen3Model, TpQwen3ForCausalLM) thread it through accordingly.

7b-iv wires RPC + dispatch in CandleHarness::load_model.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 18:24:20 +03:00
8d3194f992 Stage 7b-iii (1/2): AllReduce CustomOp + ShardedVarBuilder-backed TP linears
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 35s
CI / Format (push) Successful in 38s
CI / Clippy (push) Successful in 2m16s
build-prerelease / Build neuron-blackwell (push) Failing after 3m19s
CI / Test (push) Successful in 4m26s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m22s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ampere (push) Failing after 4m58s
build-prerelease / Build neuron-ada (push) Failing after 4m53s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Ports the canonical
candle-examples/examples/llama_multiprocess/model.rs pattern into
the harness. Two new files, one deletion:

- harness/tp/all_reduce.rs — AllReduce wraps Arc<cudarc::nccl::Comm>
  and implements candle's CustomOp1 trait. cuda_fwd extracts the
  rank's CudaSlice<dtype> from a CudaStorage, asserts the input is
  contiguous (a strided activation hitting all_reduce is almost
  always a model construction bug), allocates an output CudaSlice
  on the same device, calls Comm::all_reduce(Sum), and wraps the
  result back as a CudaStorage. Handles BF16, F16, F32. NcclError
  surfaces via {e:?} (no Display impl in cudarc 0.19.x). Send/Sync
  hand-impl'd with the same NCCL-thread-safety caveat candle's
  example documents.

- harness/tp/tp_linear.rs — ColumnParallelLinear and
  RowParallelLinear, both built on candle's ShardedVarBuilder +
  Shard hints. `vb.get_with_hints((), "weight", shard(dim, rank, ws))`
  reads JUST the rank's slice from the safetensors view; no full-
  tensor host materialisation. ColumnParallel.forward is a plain
  local matmul (output is naturally sharded). RowParallel.forward =
  local matmul + apply_op1_no_bwd(&self.all_reduce). On CPU /
  world_size == 1, the AllReduce is skipped and the partial output
  is returned as-is. Both layers are no-bias — every Qwen3-family
  target sets attention_bias=false; bias-aware sharding is a
  future-model concern.

- Deletes harness/tp/sharded_linear.rs from 7b-ii. That commit's
  hand-rolled "load full + narrow" approach was useful exploration
  but candle's ShardedVarBuilder does the same work without
  materialising the full tensor on host. The 5 unit tests there
  verified the slicing math against an unsharded reference; that
  math now lives inside candle and is covered by candle's own tests.

Next (7b-iii 2/2): TpQwen3Attention + TpQwen3MLP composing the
column/row pair, then a TpQwen3Model that runs the full forward.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 18:14:54 +03:00
5436af9c73 fix(neuron/candle): dense Qwen3 returns rank-3 logits, double-squeeze
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 38s
CI / Clippy (push) Successful in 2m19s
build-prerelease / Build neuron-blackwell (push) Successful in 3m32s
CI / Test (push) Successful in 4m34s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m16s
build-prerelease / Package cortex RPM (push) Successful in 1m18s
build-prerelease / Build neuron-ampere (push) Successful in 4m55s
build-prerelease / Build neuron-ada (push) Successful in 5m11s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m52s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m35s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s
Caught by live validation against Qwen/Qwen3-1.7B on beast:
  HTTP 500 "unexpected rank, expected: 1, got: 2 ([1, 151936])"

Candle's qwen3::ModelForCausalLM::forward returns shape [B, 1, V]
(no final squeeze) while quantized_qwen3::ModelWeights::forward
returns [B, V] (with squeeze(1) at the end). My match arms applied
a single squeeze(0) uniformly, which is correct for the quantized
[1, V] → [V] but leaves the dense at [1, V] → which then trips
apply_repeat_penalty::to_vec1() expecting rank 1.

Dense match arms now strip both batch and seq dims:
  model.forward(&input, offset)?.squeeze(0)?.squeeze(0)?

Also fixes validate-neuron.sh's `${3:-Q4_K_M}` → `${3-Q4_K_M}`
(no colon) so passing an explicit empty third arg now drives the
dense path instead of falling back to Q4_K_M.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 17:49:43 +03:00
8e882c0757 fix(neuron/tp): NcclError {e:?} + cudarc 0.19 deprecation cleanup
All checks were successful
CI / Format (push) Successful in 38s
build-prerelease / Resolve version stamps (push) Successful in 40s
CI / Clippy (push) Successful in 2m15s
build-prerelease / Build neuron-blackwell (push) Successful in 3m35s
CI / Test (push) Successful in 5m0s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m51s
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Build neuron-ampere (push) Successful in 4m55s
build-prerelease / Build neuron-ada (push) Successful in 4m57s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m50s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m37s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s
Two cuda-feature-only build errors only the CI runner catches:

1. cudarc::nccl::NcclError doesn't impl Display in 0.19.x, so the
   `format!("...: {e}")` map_err calls fail to compile when the cuda
   feature actually wires them up. Switch every NcclError-typed `{e}`
   in nccl_state.rs to `{e:?}` — surfaces variant + ncclResult code
   in the same diagnostic shape just via Debug instead of Display.
2. cudarc::CudaStream::memcpy_stod / memcpy_dtov are deprecated in
   0.19.7 in favour of clone_htod / clone_dtoh. The replacements
   take/return the same types, so the swap is mechanical.

Dev box can't compile with --features cuda (no nvcc), so these only
surface in the build-prerelease CUDA matrix jobs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 17:24:13 +03:00
93421f48e2 Stage 7b-ii: ColumnParallel + RowParallel sharded linear primitives
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Format (push) Successful in 31s
CI / Clippy (push) Failing after 49s
build-prerelease / Build neuron-blackwell (push) Failing after 3m29s
build-prerelease / Build cortex binary (push) Successful in 4m41s
CI / Test (push) Successful in 5m6s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Build neuron-ampere (push) Failing after 5m1s
build-prerelease / Build neuron-ada (push) Failing after 4m53s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Adds harness/tp/sharded_linear.rs with ShardedLinear — a Megatron-LM
style sharded wrapper over candle_nn::Linear. Two constructors:

- load_column: splits the output dimension. Each rank holds rows
  [r*out/N .. (r+1)*out/N] of the weight, plus its slice of the bias.
  Forward = local matmul; output is naturally sharded; downstream
  consumer either accepts the shard (next layer is column-parallel)
  or merges via all-gather later.
- load_row: splits the input dimension. Each rank holds cols
  [r*in/N .. (r+1)*in/N] of the weight; bias lives only on rank 0
  so the post-all_reduce sum carries it exactly once. Forward
  produces a partial output that the caller reduces via NCCL.

Both constructors bail with a clear error when divisibility doesn't
hold — the precondition mistral.rs's first qwen3-next-tp commit
made explicit. The path included in the error is the VarBuilder
prefix, so the operator sees exactly which projection failed
("column-parallel 'model.layers.0.self_attn.q_proj': out_features=...").

5 unit tests on CPU verify the math against an unsharded reference:
- column shard produces the expected slice of the full matmul
- row partials sum to the unsharded result
- row bias appears only on rank 0
- divisibility violations bail (column + row)

forward_with_comm() is stubbed for row-parallel (CUDA-only) — wiring
the actual cudarc::nccl all_reduce against candle's Tensor lands in
7b-iii alongside the model assembly, where the model holds the Comm
in scope. ColumnParallel's forward_with_comm just delegates to the
local matmul (no collective needed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 17:07:19 +03:00
05e15f3597 Stage 7b-i: dense safetensors Qwen3 load path
Some checks failed
build-prerelease / Build cortex binary (push) Blocked by required conditions
CI / Test (push) Waiting to run
CI / Format (push) Successful in 43s
build-prerelease / Resolve version stamps (push) Successful in 44s
CI / Clippy (push) Successful in 2m4s
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
Adds the bf16/fp16 safetensors path alongside the existing GGUF
quantized one. The harness now dispatches by ModelSpec.quant:
- Some(_) → GGUF (pre-quantized, single-GPU only path, unchanged).
- None    → safetensors dense (new).

The dense path uses candle-transformers::models::qwen3::ModelForCausalLM
verbatim, fed via VarBuilder::from_mmaped_safetensors over the files
listed in `model.safetensors.index.json` (sharded layout) or the
single `model.safetensors` fallback. dtype is bf16 to match the
canonical Qwen3 HF distribution dtype. tokenizer.json is fetched from
the same repo (no -GGUF suffix to strip).

ModelArch gains a Qwen3Dense variant; the forward signature mirrors
QuantizedQwen3Weights (same `forward(&Tensor, offset)` → last-position
logits), so run_inference / run_inference_streaming just add a parallel
match arm — no shape changes downstream.

This is the foundation 7b-ii (ColumnParallel/RowParallel) builds on:
because the source is dense safetensors that can be byte-sliced per
rank, the TP work avoids the GGUF super-block alignment problem
entirely. Vanilla GGUF inference keeps working unchanged.

validate-neuron.sh learns the dense path: pass an empty third arg
(quant) and the script omits the `quant` field from the load
payload, triggering the dense dispatch. Example:
  script/validate-neuron.sh beast.hanzalova.internal Qwen/Qwen3-0.6B ''

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 17:03:59 +03:00
da068ded6d Stage 7a-ii: real NCCL handshake behind the worker pool
Some checks failed
CI / Format (push) Failing after 38s
build-prerelease / Resolve version stamps (push) Successful in 42s
CI / Clippy (push) Successful in 2m18s
build-prerelease / Build neuron-blackwell (push) Failing after 3m33s
CI / Test (push) Successful in 4m27s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m31s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Failing after 4m19s
build-prerelease / Build neuron-ada (push) Failing after 4m56s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Wires cudarc::nccl into the TP worker lifecycle introduced in 7a-i.
With --features cuda the leader and its workers now establish a live
NCCL communicator end-to-end; without the feature the same code paths
return Error{kind="cuda_feature_not_enabled"} so a misconfigured
build is obvious instead of silently no-op.

NCCL state machine (harness/tp/nccl_state.rs) is shared between the
worker process and the leader's pool:
- generate_comm_id_hex() mints an Id::new() on the leader.
- NcclState::init parses 256 hex chars → [c_char; 128] → Id::uninit,
  opens a CudaContext on the configured device, calls Comm::from_rank
  with the supplied (rank, world_size, id). NCCL blocks until every
  rank has joined.
- NcclState::sanity_check runs one all_reduce(1u32, Sum); the leader
  asserts every rank reports observed_sum == world_size.
- NCCL handles serialised under Mutex; unsafe impl Send/Sync gates
  the Comm across spawn_blocking boundaries (NCCL is move-safe; only
  concurrent op issuance is unsafe).

WorkerPool::init_nccl orchestrates the rendezvous:
1. Write Init { comm_id } to every worker's stdin (no await yet).
2. Leader rank 0 calls its own Comm::from_rank in spawn_blocking,
   concurrently with workers.
3. NCCL handshake completes for all ranks simultaneously.
4. Leader collects InitOk responses.
WorkerPool::nccl_sanity_check follows the same pattern over
all_reduce, validating world_size == observed_sum on every rank.

Worker.send_only / Worker.recv_only split out from the previous
monolithic Worker.request so the leader can interleave its own NCCL
work with the worker calls — required because NCCL blocks during
init.

Tests:
- 4 hex roundtrip unit tests for the wire encoding.
- The 7a-i "not implemented" expectation now reads
  "cuda_feature_not_enabled" on the local dev box (no CUDA), or
  accepts InitOk on a cuda-built test binary.
- New cuda-integration test in tp_worker_lifecycle_cuda.rs covers
  the real init + sanity round-trip; gated on the cuda-integration
  feature so default CI doesn't try to NCCL.

Verifiable on beast (2× RTX 5090):
  cargo test -p neuron --features cuda-integration \
        --test tp_worker_lifecycle_cuda

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 16:40:01 +03:00
2a7ede0232 Stage 7a-i: TP worker lifecycle scaffolding
All checks were successful
CI / Format (push) Successful in 36s
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Clippy (push) Successful in 2m12s
CI / Test (push) Successful in 4m25s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m49s
build-prerelease / Build cortex binary (push) Successful in 4m22s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ampere (push) Successful in 5m9s
build-prerelease / Build neuron-ada (push) Successful in 4m59s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m59s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m38s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m8s
Leader → worker process plumbing for tensor parallelism. The neuron
binary picks up two modes: default (the existing daemon, axum + HTTP)
and `--worker` (a bare RPC loop driven over stdin/stdout). The leader
spawns one worker per non-zero NCCL rank via tokio::process::Command
on the same binary path (production: /proc/self/exe; tests:
env!("CARGO_BIN_EXE_neuron")) and talks to each over newline-
delimited JSON.

Protocol (harness/tp/rpc.rs) is serde-tagged from the start —
WorkerRequest::{Ping, Init, NcclSanityCheck, Shutdown} and
WorkerResponse::{Pong, InitOk, NcclSanityResult, Bye, Error}, both
`#[serde(tag = "op", rename_all = "snake_case")]`. Adding ops in 7b/7c
is purely additive; unknown ops on the wire fail to parse (verified
in unit tests).

7a-i scope:
- WorkerPool::spawn(binary, world_size, devices) forks ranks 1..N as
  subprocesses, captures stdin/stdout, kills on drop.
- ping_all() round-trips a Ping to every worker and validates the
  returned rank.
- shutdown() sends Shutdown to each worker, awaits Bye, reaps.
- Worker mode: parse Ping/Shutdown, return Pong/Bye; Init and
  NcclSanityCheck return Error{kind="not_implemented_7a_i"} so a 7a-ii
  binary speaking the same wire is a drop-in replacement (the kind
  field signals "real NCCL lands in the next commit").
- CandleHarness::load_model refuses tensor_parallel > 1 with a clear
  message until 7b is in.

Three integration tests in tests/tp_worker_lifecycle.rs cover spawn/
ping/shutdown for 2- and 3-worker pools, plus the
not_implemented_7a_i contract test for Init. Seven rpc serde unit
tests assert the wire shape (op tags, field names, unknown-op
rejection). All pass on the dev host; no CUDA required.

Stage 7a-ii (next): the real NCCL Comm::from_rank wiring behind the
existing Init/NcclSanityCheck op surface, CUDA-gated. Verifiable on
beast's 2×5090.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 15:53:00 +03:00
18ae3c30ee post-validation cleanup: cuDNN runtime + repetition penalty
All checks were successful
CI / Format (push) Successful in 34s
build-prerelease / Resolve version stamps (push) Successful in 35s
CI / Clippy (push) Successful in 2m17s
CI / Test (push) Successful in 4m16s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m28s
build-prerelease / Build neuron-blackwell (push) Successful in 3m42s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Build neuron-ampere (push) Successful in 4m27s
build-prerelease / Build neuron-ada (push) Successful in 4m51s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m50s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m40s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 6m52s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 2m32s
Two followups from the live single-GPU validation pass.

1. deploy.sh now ensures libcudnn.so.9 is available on each neuron
   host before installing/upgrading the package. Probes ldconfig first
   so hosts with a manual (tar/runfile) cuDNN install are untouched,
   then adds NVIDIA's RHEL9 CUDA repo (the Fedora 43 CUDA repo doesn't
   ship cuDNN; only the RHEL9 one does) and installs libcudnn9-cuda-13.
   benjy hit "cannot open shared object file: libcudnn.so.9" during
   validation; this prevents that recurring.

2. candle.rs applies a 1.1 repetition penalty over the last 64
   generated tokens before sampling, in both the non-streaming
   chat_completion path and the streaming chat_completion_stream
   path. Without it small Q4_K_M models degenerate into "Wait, no,
   no..." loops once they hit a confident-but-wrong path; with it
   sampling stays coherent. Defaults match mistral.rs and llama.cpp;
   exposing the value via the OpenAI request (frequency/presence
   penalty mapping) is Stage 8 territory.

Both routes through a new sample_with_penalty() helper so future
sampling tweaks land in one place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 14:48:08 +03:00
1a0400131e fix(deploy): use dnf upgrade for stale installs, install only when absent
All checks were successful
CI / Format (push) Successful in 35s
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Clippy (push) Successful in 2m27s
CI / Test (push) Successful in 4m30s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m29s
build-prerelease / Build cortex binary (push) Successful in 4m32s
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Build neuron-ampere (push) Successful in 5m15s
build-prerelease / Build neuron-ada (push) Successful in 4m51s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m48s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m47s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m38s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 57s
dnf5's `dnf install <pkg>` is a no-op when the package is already
installed at ANY version — it does NOT auto-upgrade to the latest
available. The deploy script's install branch was therefore silently
leaving hosts on older builds even though needs_update correctly
reported an upgrade was available.

Add an is_installed() probe and an install_or_upgrade() helper that
picks the right verb: `dnf install` when fresh, `dnf upgrade` when
stale. Captured combined-stream output is exposed via __DNF_OUTPUT__
for the existing failure-diagnostic path.

Verified end-to-end against the live fleet: hanzalova/beast/benjy/
quadbrat all upgraded cleanly from prior prerelease NVRs to
0.1.16-0.1.20260519134302.git1866b99.fc43, validation script returned
"Paris" from all three neurons.

Followup (not in this commit): all hosts running helexa-neuron-*
need libcudnn.so.9 available at runtime. Currently:
  - quadbrat: libcudnn9-cuda-13 RPM (rhel9 CUDA repo)
  - beast:    /usr/lib64/libcudnn.so.9 (manual install)
  - benjy:    needed rhel9 CUDA repo added + libcudnn9-cuda-13 installed
              as part of this validation pass.
The spec currently excludes cuDNN from auto-detected deps. Should
add a Recommends:libcudnn9-cuda-13 (soft) and ensure the rhel9 CUDA
repo is configured on each neuron host, similar to how ensure_lair_repo
handles the unstable channel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 14:10:48 +03:00
1866b99a89 fix(validate-neuron): jq for JSON, say→stderr, sane max_tokens
All checks were successful
CI / Format (push) Successful in 35s
build-prerelease / Resolve version stamps (push) Successful in 38s
CI / Clippy (push) Successful in 2m13s
CI / Test (push) Successful in 4m22s
build-prerelease / Build neuron-blackwell (push) Successful in 3m25s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m21s
build-prerelease / Package cortex RPM (push) Successful in 1m17s
build-prerelease / Build neuron-ampere (push) Successful in 4m39s
build-prerelease / Build neuron-ada (push) Successful in 4m57s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m58s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m34s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
Three real bugs caught while exercising the script end-to-end against
the live quadbrat node:

1. say() printed status to stdout. Inside run_probe(), the
   "POST /v1/chat/completions (probe: ...)" line was being captured
   by `raw=$(run_probe)` along with the JSON body, so jq saw
   "[host] POST..." as the first line and choked at column 29 with
   "Invalid numeric literal" (it tried to parse the `[` as the start
   of a JSON array). Redirect say() to stderr so command
   substitutions capture only the intended return value.

2. The pretty-print step `echo "${raw}" | yq -r '.'` re-emitted the
   JSON as YAML, which fails on response content that looks like YAML
   markers (chatcmpl ids that parse as aliases, escaped quotes inside
   <think>...</think> blocks). Drop the pretty-print; just echo the
   raw JSON.

3. JSON response parsing now uses jq (always JSON) instead of yq
   (parses input as YAML by default). yq remains in use only for the
   genuinely-YAML asset/manifest.yml elsewhere.

4. max_tokens bumped 32 → 256. Qwen3 prepends a <think>...</think>
   reasoning block before its final answer when the chat template
   enables thinking mode, and that eats most of a small budget — the
   "Paris" answer was being truncated mid-thought. 256 leaves enough
   room for both.

Verified pipeline end-to-end on quadbrat (RTX 3060, helexa-neuron-ampere
git602e8e1): /health OK → /models/load (unsloth/Qwen3-0.6B-GGUF Q4_K_M)
→ /v1/chat/completions → response content contains "Paris".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 13:43:02 +03:00
60176e7c2e ci: monotonic prerelease versions + serialize CI on shared runner
Two CI hygiene fixes uncovered while validating against the live fleet.

1. Same-day prerelease packages were being ordered by RPM-vercmp's
   alpha-vs-digit precedence on the git SHA fragment, not by commit
   chronology. With release stamps like "0.1.${YYYYMMDD}git${SHA}",
   two commits on the same day produce the same numeric prefix and
   rpmvercmp falls back to comparing the alphanumeric SHA suffixes,
   where digit-leading SHAs are ranked above alpha-leading ones —
   completely unrelated to which commit landed first. Verified with
   rpmdev-vercmp:
     gitabc1234 < gitdef5678   (old scheme — purely lexicographic)
   Bumping the timestamp prefix to second-precision (%Y%m%d%H%M%S)
   makes the numeric prefix strictly monotonic for any chronologically-
   ordered commits, so the SHA fragment becomes a debug identifier
   only — never participates in version ordering.

2. ci.yml and build-prerelease.yml both target the `rust` runner label
   and both auto-trigger on push to main. The act-based runner reuses
   /root/.cache/act/<hash>/hostexecutor/ across concurrent jobs, so
   ci.yml's clippy and build-prerelease.yml's build-cortex were racing
   each other's checkout/cleanup steps and corrupting in-flight
   compile artifacts. Real fix is in gongfoo; workflow-level workaround
   is a shared concurrency group with cancel-in-progress=false so the
   two workflows queue sequentially on the same ref.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 13:36:53 +03:00
602e8e1471 fix(neuron/candle): source tokenizer.json from base repo when GGUF
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 31s
CI / Format (push) Successful in 37s
CI / Clippy (push) Failing after 50s
CI / Test (push) Failing after 49s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m32s
build-prerelease / Build cortex binary (push) Successful in 4m34s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Successful in 5m9s
build-prerelease / Build neuron-ada (push) Successful in 4m52s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m36s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s
GGUF-only HF repos (unsloth/Qwen3-*-GGUF, Qwen/Qwen3-*-GGUF) ship the
.gguf file but not tokenizer.json — the tokenizer data is embedded in
the GGUF metadata itself, and the standalone tokenizer.json lives in
the base non-GGUF repo (unsloth/Qwen3-0.6B, Qwen/Qwen3-0.6B, etc.).

Live validation against quadbrat hit:
  HTTP 400 fetch tokenizer.json from unsloth/Qwen3-0.6B-GGUF:
  HTTP status client error (404 Not Found)

resolve_files now derives the tokenizer repo by stripping a `-GGUF`
or `-gguf` suffix from the model_id; non-GGUF ids fall through to
fetching from the same repo. The error message includes the
attempted tokenizer repo id so the next failure (e.g. base repo
doesn't exist) is unambiguous.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 13:16:39 +03:00
e9d0a75dd5 ci(prerelease): auto-build on every push to main
Some checks failed
build-prerelease / Build cortex binary (push) Blocked by required conditions
CI / Clippy (push) Waiting to run
CI / Test (push) Waiting to run
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 36s
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
The build-prerelease workflow was workflow_dispatch-only, which meant
every commit needed a manual run dispatch before any host could
upgrade. That left rolling fixes (e.g. f9f5fa4's StateDirectory fix)
sitting on main with no published RPM behind them, so deploy.sh
silently fell back to an older prerelease.

Add 'push: branches: [main]' alongside the existing workflow_dispatch
trigger; the unstable channel now tracks head automatically. The
concurrency group is keyed on ${{ github.ref }} with
cancel-in-progress so successive rapid-fire pushes coalesce to one
build (latest wins) rather than queueing every intermediate commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 13:13:36 +03:00
6cf87e328f chore(neuron): log load_model failures server-side with full chain
The HTTP handler now emits a tracing::warn on load_model failures with
the expanded anyhow chain (format!("{e:#}")) before returning the 400.
journalctl -u neuron will surface the underlying hf-hub /
materialisation error without needing to capture the curl response
body separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 13:08:54 +03:00
f9f5fa41b6 fix(neuron): surface full anyhow chain + ensure $HOME exists at start
Some checks failed
CI / Format (push) Successful in 30s
CI / Test (push) Failing after 49s
CI / Clippy (push) Successful in 2m16s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Two fixes uncovered by the live validation against beast/benjy/quadbrat:

1. api.rs swallowed everything beyond the outermost anyhow context.
   The validation script reported '{"error":"fetch GGUF ...gguf"}' but
   the actual underlying hf-hub failure (cache dir creation, network,
   auth, etc.) was hidden. Switching every error response to
   format!("{e:#}") expands the full cause chain via anyhow's
   alternate Display format.

2. The neuron systemd unit declared the service user but never ensured
   /var/lib/neuron (its $HOME) existed. hf-hub defaults its cache to
   ~/.cache/huggingface/hub — when $HOME is absent the cache dir
   creation fails and the download aborts. Adding `StateDirectory=neuron`
   makes systemd create + chown that directory at activation; no spec
   change needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 08:17:37 +03:00
ed4d71db09 fix(validate-neuron): default to unsloth GGUF + capture curl errors
Two reasons the previous run silently bailed after POST /models/load:

1. Default model was Qwen/Qwen3-0.6B-GGUF (official). That repo ships
   ONLY Q8_0 — no Q4_K_M, no Q4_0, nothing else. The GGUF filename
   matcher in CandleHarness::resolve_files returned "no GGUF file
   matching quant Q4_K_M" and the load endpoint returned an error,
   but the script used `curl --silent --fail` and swallowed it.

2. /models/load is synchronous (it awaits the full HF download + GGUF
   parse). curl --max-time 30 was way too short for a 400 MB fresh
   download.

Fixes:
- Default model is now unsloth/Qwen3-0.6B-GGUF, which mirrors the
  full Q-spectrum (Q2_K through Q8_0 plus BF16) so Q4_K_M actually
  exists.
- trigger_load / run_probe now use --write-out to capture HTTP code
  and emit the response body on non-2xx, so failures surface a real
  diagnostic instead of an opaque set -e abort.
- LOAD_TIMEOUT bumped to 600s; INFER_TIMEOUT to 120s.
- Probe payload built via `yq -n` so JSON quoting is reliable
  regardless of the prompt text.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 08:14:31 +03:00
39010c779f add script/validate-neuron.sh — end-to-end candle harness smoke test
Loads a small public Qwen3 GGUF on a target neuron host, fires a
deterministic reasoning probe ("What is the capital of France?"),
and asserts the response contains 'Paris'. Used to validate the
candle harness on a real GPU host before the Stage 7 TP work begins,
and as a regression check after future neuron builds.

Defaults to beast.hanzalova.internal + Qwen/Qwen3-1.7B-GGUF + Q4_K_M;
all three are positional args so the same script tests any node /
model combination. Polls /models after triggering the load since
/models/load returns once the materialisation is *queued*, not
finished.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 07:58:05 +03:00
57d7ef8d3c chore: revert dnf. runner user has no system privs
All checks were successful
CI / Format (push) Successful in 38s
CI / Clippy (push) Successful in 2m20s
CI / Test (push) Successful in 4m42s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
2026-05-19 07:16:38 +03:00
0e9671dd7d fix(ci): drop sudo from dnf install (runner runs as root, no sudo)
All checks were successful
CI / Format (push) Successful in 36s
CI / Clippy (push) Successful in 2m13s
CI / Test (push) Successful in 4m17s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
The act runner container has no sudo binary; the runner user already
runs as root inside the container. Existing steps (rpmbuild, gpg, etc)
already invoke privileged commands directly without sudo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 07:06:52 +03:00
e29c9e35f0 fix(ci): ensure rust toolchain present on cuda-13.0 runner
The currently-published runner-cuda-13.0 image (gongfoo) is missing
rust/cargo despite inheriting from runner-rust. Build-neuron fails
immediately with 'cargo: command not found' even though build-cortex
on the bare 'rust' runner builds fine.

Add a defensive `dnf install rust cargo clippy` step at the top of
build-neuron. Idempotent — on a properly-built runner image this is
a fast no-op; on the current broken image it installs the toolchain
in a few seconds. The runner image itself should be rebuilt in
gongfoo so this step becomes redundant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 07:04:57 +03:00
8a2334eacb deploy: dnf-native version check + lair.cafe repo bootstrap
Replaces the string compare of 'git describe --tags' vs the binary's
self-reported --version (which lies about prereleases — every
0.1.16-* RPM reports just "0.1.16") with the dnf-native question of
"is the installed package current against what the repo offers".

Mechanism:
- installed_nvr(): rpm -q --qf '%{version}-%{release}' for the
  resident package, falling back to "(not installed)". Capturing rpm's
  output through a variable keeps its "package X is not installed"
  stdout message out of the result on failure.
- needs_update(): probes rpm -q first (treats absent as "needs work"),
  then asks dnf check-update --refresh -q. Other dnf failures collapse
  into "needs update" so the subsequent install surfaces a real error
  rather than this check swallowing one silently.
- ensure_lair_repo(): probes for /etc/yum.repos.d/lair-cafe-unstable.repo
  and adds it with `dnf config-manager addrepo` when missing. The
  upstream .repo file ships enabled=0 (unstable channel doesn't
  auto-engage on fetch), so we then run `dnf config-manager setopt
  lair-cafe-unstable.enabled=1` every run — cheap, idempotent.
- Cortex and neuron install branches now guard `systemctl stop` with
  `[ ! -f /usr/lib/systemd/system/...service ] || sudo systemctl stop`
  so fresh installs (no unit file yet) don't short-circuit the install
  step under set -e.
- dnf output is captured into a variable and only printed (with a
  [host]   prefix per line) on failure, so success stays quiet and
  failures show the actual diagnostic instead of being eaten by
  &> /dev/null.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 18:55:02 +03:00
aad314cdfa feat(neuron): graceful unload-on-shutdown via SIGTERM/SIGINT
Stage 6 of the candle-native pivot. Adds first-class deactivation:
neuron now drains in-flight requests on SIGTERM (systemd stop) or
SIGINT (Ctrl-C), then unloads every loaded model before the process
exits — releasing CUDA contexts and VRAM cleanly rather than leaving
the OS to reclaim them.

Mechanism:
- startup::shutdown_signal() resolves on either ctrl_c() or a
  SIGTERM listener.
- axum::serve(...).with_graceful_shutdown(shutdown_signal()) stops
  accepting new connections, lets active requests finish, then
  returns control to main.
- startup::unload_all_models(&registry) iterates list_all_models()
  and calls unload per entry. Per-model failures are logged warnings;
  cleanup continues. Empty registry is a fast no-op.
- main holds an Arc<NeuronState> reference past axum's lifetime so
  the registry is still reachable for the unload sweep.

data/neuron.service:
- TimeoutStopSec=120s — generous bound for big-model unloads before
  systemd escalates to SIGKILL.
- KillSignal=SIGTERM — explicit, matches the handler.

Two non-gated tests cover the empty-registry no-op and the no-models-
loaded path. Real load-then-unload-on-shutdown is exercised by the
cuda-integration test from Stage 2 (which calls unload_model directly)
and observable on a real GPU host by stopping the service and
watching nvidia-smi.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:58:07 +03:00
6779b7526a feat(neuron): load default_models on service activation
All checks were successful
CI / Format (push) Successful in 34s
CI / Clippy (push) Successful in 2m13s
CI / Test (push) Successful in 4m6s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Stage 5 of the candle-native pivot. Adds first-class support for
auto-loading a configured set of models when the neuron service
activates.

Config:
- NeuronConfig.default_models: Vec<ModelSpec> (defaults to []).
- neuron.example.toml ships a commented [[default_models]] example.

Activation flow (crates/neuron/src/startup.rs::load_default_models):
- Sequential — VRAM contention makes parallel loads risky.
- Per-entry timing logged at info level on success.
- Failures logged as warnings; the next entry is still attempted.
- An empty list short-circuits without log noise.

Called from main.rs after the registry is built and before the axum
listener binds, so /models reflects the loaded state from the very
first request.

data/neuron.service gains TimeoutStartSec=1800s. With activation
blocked on potentially slow first-time HF downloads + GGUF
materialisation, systemd's default 90s would kill larger model loads
mid-flight.

Two non-gated tests in tests/activation.rs cover the
continues-past-failure and empty-list paths using a synthetically
unknown harness name to fail loads fast without touching the network.
The cuda-integration test from earlier stages still exercises the
real load/unload lifecycle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:56:08 +03:00
84f5662df1 feat(neuron): OpenAI-compatible SSE streaming chat completions
Stage 4 of the candle-native pivot. /v1/chat/completions now switches
to text/event-stream when the request sets stream: true, emitting one
chat.completion.chunk per generated token followed by the OpenAI
[DONE] terminator.

Pipeline:
- chat_completion_stream creates a bounded mpsc::channel<ChatCompletionChunk>(32),
  sends the leading role chunk, then spawns a blocking task that
  acquires the per-model arch lock and runs the streaming generation
  loop.
- run_inference_streaming tracks a cumulative decoded prefix so each
  chunk's delta.content is the substring added since the last chunk —
  safe across BPE byte-fallback boundaries that would otherwise split
  multi-byte UTF-8 chars.
- The blocking task aborts cleanly if blocking_send fails (client
  disconnected), so generation stops when the SSE consumer hangs up.
- Final chunk carries finish_reason ("stop" on EOS, "length" on
  max_tokens). The handler appends data: [DONE] after the channel
  closes.

The Stage 3 streaming 501 placeholder test is repurposed: with the
streaming path live, an unloaded model now hits the same 404 surface
as the non-streaming path (the model lookup happens first).

cortex-gateway's existing proxy is unchanged — it already forwards
SSE bytes verbatim from Phase 2 work, so the candle SSE format passes
through unmodified.

Neuron Cargo.toml gains futures + tokio-stream (both already in
workspace deps) for ReceiverStream and stream combinators.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:53:14 +03:00
249c9442e8 chore: track deployment script
All checks were successful
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m2s
CI / Test (push) Successful in 3m59s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
2026-05-18 17:50:35 +03:00
5e17081fb4 ci(prerelease): drop redundant rustup install step
The build-cortex and build-neuron jobs were running a copied-from-
mistralrs rustup install step. Both jobs use runner images that
already provide rust via dnf:

- runner-rust installs rust/cargo/clippy/rustfmt directly.
- runner-cuda-13.0 extends runner-rust.

Running 'rustup update stable' on top would install a parallel
rustup-managed toolchain and shadow the dnf one — confusing and
unnecessary. The existing ci.yml already trusts the dnf toolchain
without any install step, so match that behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:47:29 +03:00
03bed93fee add asset/manifest.yml describing fleet hosts and neuron flavours
All checks were successful
CI / Format (push) Successful in 28s
CI / Clippy (push) Successful in 2m54s
CI / Test (push) Successful in 5m37s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Adds a single source of truth for which hosts run cortex vs neuron
and which CUDA compute-capability flavour each neuron host needs:

  cortex   : hanzalova.internal
  neurons  :
    beast      → helexa-neuron-blackwell  (2x RTX 5090, sm_120)
    benjy      → helexa-neuron-ada        (RTX 4090,    sm_89)
    quadbrat   → helexa-neuron-ampere     (RTX 3060,    sm_86)

script/deploy.sh (gitignored, local-only) is updated locally to read
hosts and flavours from this manifest and dnf install the correct
helexa-neuron-<flavour> package per host. Using
'dnf install --refresh --allowerasing' lets it swap out the previous
bare helexa-neuron RPM or a different flavour without manual
intervention; the spec Conflicts: clauses keep at most one flavour
resident.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:37:14 +03:00
4a5211d830 ci(prerelease): add ampere flavour alongside ada and blackwell
Adds ampere (CUDA compute capability sm_86) to both the build-neuron
and package-neuron matrices, so helexa-neuron-ampere RPMs are built
and published alongside helexa-neuron-ada and helexa-neuron-blackwell.

The prerelease spec already lists ampere in its Conflicts: clause, so
no spec change is needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:28:19 +03:00
6d2dc5ff1a fix(ci): give fmt/clippy/test distinct CARGO_TARGET_DIR to avoid races
After the candle deps were added, cargo builds run long enough that
the parallel fmt/clippy/test jobs (all on the `rust` runner label,
which appears to use act in host-executor mode) start racing each
other's intermediate temp files under
  /root/.cache/act/<hash>/hostexecutor/target/debug/deps/

Concretely the test job hit:
  error: No such file or directory at path
  "target/debug/deps/.tmprlicL7"
  Compiling unicode-ident
because another job's cargo invocation cleaned up the temp file
mid-compile. fmt and clippy happened to finish without their own
target races landing fatally, so only test failed visibly.

Set CARGO_TARGET_DIR=target-${{ github.job }} at the workflow level
so each job writes to its own target directory. sccache still backs
the actual rustc cache, so the rebuild penalty is just metadata not
full recompiles.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:26:29 +03:00
b713dbe669 fix(ci): pass GPG secrets via env to avoid Gitea log leakage
Some checks failed
CI / Format (push) Successful in 28s
CI / Test (push) Failing after 43s
CI / Clippy (push) Successful in 2m9s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
The previous "Import signing key" step inlined ${{ secrets.RPM_SIGNING_KEY }}
and ${{ secrets.RPM_SIGNING_KEY_ID }} directly into the run: block.
Template expansion writes the literal secret value into the rendered
shell script, and Gitea logs the rendered script — Gitea's masker may
not reliably scrub multi-line keys, so values can leak.

Move both secrets into the step's env: block (the same pattern the
"Set up SSH" step already uses) and reference $VARs in the script.
The script body now contains only variable names; the secret values
live in the process environment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:13:52 +03:00
5c957d08ec ci: add build-prerelease workflow for CUDA RPMs on rpm.lair.cafe
Some checks failed
CI / Format (push) Successful in 36s
CI / Test (push) Failing after 53s
CI / Clippy (push) Successful in 2m35s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Adds a manually-triggered workflow that builds CUDA-flavoured neuron
binaries and a CPU cortex binary, packages them as Fedora RPMs, signs
them, and rsyncs to the unstable channel at
https://rpm.lair.cafe/fedora/43/x86_64/unstable/. Mirrors the build
pipeline used by grenade/mistralrs-package.

Pipeline:
- prepare: derive {version,short_sha,commit_date} from the checkout;
  the prerelease Release stamp "0.1.YYYYMMDDgitSHORTSHA" sorts below
  the eventual "1" stable release.
- build-cortex: cargo build --release -p cortex-cli on a rust runner.
- build-neuron: matrix over ada (sm_89) and blackwell (sm_120) on
  cuda-13.0 runners; cargo build with features "cuda cudnn flash-attn"
  and CUDA_COMPUTE_CAP set per flavour.
- package-{cortex,neuron}: rpmbuild on the rpm runner against the new
  prebuilt-binary specs in rpm/.
- publish: import signing key, sign RPMs, rsync to oolon, createrepo_c
  --update, then regenerate packages.json for the UI.

New specs are prebuilt-binary variants — they consume the artifact
from the build job rather than running cargo at rpmbuild time. Each
helexa-neuron-{flavour} package Conflicts with the other flavours and
with helexa-neuron (the future source-build stable package) so one
flavour is installed at a time on a given host.

neuron crate gains cudnn and flash-attn feature flags forwarding to
the corresponding candle features, so the CI build command compiles
those kernels into the binary.

sccache is intentionally NOT used in the prerelease jobs — CUDA
compute cap isn't in its cache key, so flavours would mis-hit each
other. Each prerelease build is a clean cargo build.

Required Gitea secrets (already in place for cortex.spec / COPR
workflow):
- RPM_SIGNING_KEY, RPM_SIGNING_KEY_ID
- RSYNC_SSH_KEY

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:01:35 +03:00
729317d1ef feat(neuron): OpenAI-compatible non-streaming chat completion
Stage 3 of the candle-native pivot. neuron now serves
POST /v1/chat/completions backed by candle's quantized_qwen3 forward
pass on a per-model serialised generation loop, returning the standard
OpenAI ChatCompletionResponse envelope.

Pipeline per request:
- Look up the LoadedModel by request.model (404 if absent).
- Apply the Qwen3 chat template across all messages.
- Tokenize, then spawn_blocking onto tokio's blocking pool to acquire
  the per-model arch lock and run prefill + greedy/temperature/top-p
  sampling via LogitsProcessor.
- Stop on <|im_end|>/<|endoftext|> EOS or max_tokens (finish_reason
  "stop" vs "length").
- Decode with skip_special_tokens=true, build OpenAI response with
  prompt/completion/total usage counts.

Supporting changes:
- HarnessRegistry now stores Arc<dyn Harness> and caches a typed
  Arc<CandleHarness> so inference routes bypass dyn-Trait dispatch.
- LoadedModel.arch becomes Arc<Mutex<ModelArch>> so the lock guard
  can be moved into spawn_blocking.
- NeuronState gains an Option<Arc<CandleHarness>> field for the new
  inference route.
- Typed InferenceError lets the handler map ModelNotLoaded → 404 and
  other failures → 500 without string-matching anyhow messages.
- stream=true returns 501 until Stage 4 wires up SSE.
- Two leftover mistral.rs string references in proxy.rs and cortex-cli
  (missed during the Stage 1 sweep) are corrected here.

Three new default-feature tests cover the no-candle 503, model-not-
loaded 404, and stream=true 501 paths. The cuda-integration test from
Stage 2 still covers real load/unload; a streaming-feature gated test
exercising actual generation will arrive with Stage 4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:47:58 +03:00
5c2bd1a1da feat(neuron): wire candle harness load/unload via GGUF
Stage 2 of the candle-native pivot. Fleshes out CandleHarness with a
LoadedModel registry keyed by model_id, hf-hub-backed GGUF download,
and Qwen3 quantized weight construction via candle-transformers'
quantized_qwen3 module. unload_model drops the entry; Drop on the
candle ModelWeights frees device memory.

Device selection prefers CUDA (gated behind the new `cuda` feature),
falling back to CPU when CUDA is unavailable so default builds work
on non-GPU hosts. The candle CUDA toolchain isn't pulled in unless
`--features cuda` is passed, keeping CI green on CPU runners.

Config gains a [harness.candle] block with an optional hf_cache path.
HarnessRegistry::from_configs now takes HarnessSettings so per-harness
config flows through.

A gated tests/candle_lifecycle.rs exercises real load → list → unload
→ list-empty when run with `--features cuda-integration` against a
host with HF network access. The default-feature test in tests/api.rs
covers the wrong-harness rejection path without needing the network.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:02:49 +03:00
3cccc2c56b refactor(neuron): cut mistralrs/llamacpp, scaffold candle harness
Stage 1 of the candle-native pivot. Replaces the external-process
harness model (mistralrs over HTTP, llamacpp placeholder) with an
in-process Harness trait whose sole implementation is candle. The
trait keeps its shape so future engines slot in additively, but
start/stop default to no-ops and HarnessConfig drops endpoint and
systemd_unit since no harness needs external supervision.

Behaviour is unchanged on the wire: load_model returns a "not
implemented yet (Stage 2)" error and list_models is empty. The
gateway-side proxy, poller, and router are untouched.

CLAUDE.md Phase 11 (llama.cpp) and Phase 12 (mistral.rs COPR) are
marked superseded; the staged plan lives in
~/.claude/plans/create-a-more-aggressive-calm-naur.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:53:04 +03:00
7f797b0265 ci: parallelise fmt/clippy/test and drop sccache install step
All checks were successful
CI / Format (push) Successful in 33s
CI / Clippy (push) Successful in 1m31s
CI / Test (push) Successful in 2m11s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-11 13:55:17 +03:00
5a0360c1d5 ci: use container runner labels for CI jobs
Some checks failed
CI / Format, lint, build, test (push) Successful in 4m20s
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-11 13:29:42 +03:00
472c0e8737 fix(rpm): ship firewalld service definitions with correct ports
Some checks failed
CI / Format, lint, build, test (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
cortex: opens 31313/tcp (API) and 31314/tcp (metrics)
neuron: opens 13131/tcp

Installs to /usr/lib/firewalld/services/ so firewall-cmd
--add-service=cortex / --add-service=helexa-neuron works
out of the box.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-11 12:52:20 +03:00
Gitea Actions
b9d8e30058 chore: bump version to 0.1.16 2026-04-16 15:04:21 +00:00
25f75fe552 chore: ignore local deploy script
All checks were successful
CI / Format, lint, build, test (push) Successful in 1m15s
CI / Build cortex SRPM (push) Successful in 43s
CI / Build neuron SRPM (push) Successful in 44s
CI / Publish cortex to COPR (push) Successful in 7m23s
CI / Publish neuron to COPR (push) Successful in 15m58s
CI / Bump version in source (push) Successful in 31s
2026-04-16 17:45:25 +03:00
3f94c50817 chore: move default ports out of common-collision ranges
Previous defaults collided with well-trodden infra services and with
the Linux ephemeral port range:

- cortex API     8000 — common dev-server default (Django, minio UI)
- cortex metrics 9100 — Prometheus node_exporter default
- neuron API     9090 — Cockpit default on Fedora, Prometheus self

Move to helexa-themed palindromic ports, all below Linux's
32768-60999 ephemeral range and not registered to any well-known
service:

- cortex API     31313
- cortex metrics 31314
- neuron API     13131

Updated places:
- cortex.example.toml, neuron.example.toml defaults
- default impls in cortex-core and neuron config
- cortex-cli --endpoint default for the status subcommand
- doc comments citing example URLs
- README.md and CLAUDE.md snippets

Consumers already on the old ports need a one-line edit in their
/etc/cortex/cortex.toml or /etc/neuron/neuron.toml to match;
firewall rules and prometheus scrape configs will also need
updating.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 17:45:25 +03:00
3e1fb60076 ci: drop actions/cache for cargo registry and target
The cache round-trip (download + unpack) was consistently taking
around 6 minutes, noticeably longer than the ~3 minute cold build
it was meant to accelerate. Net-negative on CI time — remove it.

sccache with the S3 backend still provides dep-level caching at a
much lower overhead, so we keep the majority of the cache benefit
without paying the actions/cache tarball cost.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 17:45:25 +03:00
Gitea Actions
9bf987888c chore: bump version to 0.1.14 2026-04-16 16:57:24 +03:00
abe4ff7ccc ci: publish both packages to a single helexa/helexa COPR project
All checks were successful
CI / Format, lint, build, test (push) Successful in 9m50s
CI / Build neuron SRPM (push) Successful in 43s
CI / Build cortex SRPM (push) Successful in 48s
CI / Publish neuron to COPR (push) Successful in 6m14s
CI / Publish cortex to COPR (push) Successful in 7m53s
CI / Bump version in source (push) Successful in 31s
Consolidates the previous helexa/cortex and helexa/helexa-neuron COPR
projects into one shared project. Hosts enable a single repo and get
access to both packages — cortex for gateway hosts and helexa-neuron
for GPU nodes. Reduces the "which copr do I enable on this host"
friction, and makes it clear the two packages are parts of the same
helexa project suite.

CI keeps two independent publish jobs (copr-cortex and copr-neuron)
running in parallel; they now both target helexa/helexa with their
respective SRPMs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 16:37:47 +03:00
7c3390a4e1 fix(rpm): rename neuron package to helexa-neuron
Fedora's official repos ship a package named `neuron` — the NEURON
neural-simulation environment from Yale (see
https://src.fedoraproject.org/rpms/neuron). Having our own `neuron`
in the helexa COPR caused dnf5 to silently no-op `dnf install neuron`
because of the name collision, even with the COPR repo enabled and
keys imported. The only workarounds were full NEVRA (`dnf install
neuron-0.1.12-1.fc43.x86_64`) or a local file install — neither
acceptable for end-users.

Rename the RPM package to `helexa-neuron`. Keep binary (/usr/bin/neuron),
systemd unit (neuron.service), system user (neuron), and config dir
(/etc/neuron) unchanged — those are project-local contexts where the
short name is unambiguous. Follows Fedora subpackage-style naming
except with a vendor prefix rather than a parent-package prefix,
because neuron is an independent package from cortex (installed on
different hosts) and neither depends on the other.

Changes:
- neuron.spec -> helexa-neuron.spec (git rename)
- Name: neuron -> helexa-neuron (with comment explaining why)
- CI: srpm-neuron job now builds helexa-neuron-VERSION.tar.gz with the
  matching top-level dir prefix, publishes to helexa/helexa-neuron COPR
- CI: bump-version job references helexa-neuron.spec
- CLAUDE.md: install instructions updated

Old helexa/neuron COPR project can be deleted after the first
helexa/helexa-neuron build lands.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 16:37:47 +03:00
2ff062da0e ci: commit generated %changelog entries back to main
Previously the srpm-* jobs generated a fresh %changelog entry and
shipped it to COPR, but the version-stamped spec pushed back to main
by the bump-version job only updated the Version: line — not the
%changelog section. The result: SRPM and in-tree spec diverged and
a fresh clone of the repo showed a perpetually empty changelog.

Run the rpm-changelog action in bump-version too. Now the committed
specs track the SRPMs: each release leaves a dated %changelog entry
in main covering commits since the previous tag, visible in git log
and in the repo's spec browser.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 16:37:03 +03:00
Gitea Actions
357f858a29 chore: bump version to 0.1.12 2026-04-16 15:47:21 +03:00
556e5293dc fix(rpm): explicitly Provides user(name) to satisfy systemd unit Requires
All checks were successful
CI / Format, lint, build, test (push) Successful in 2m59s
CI / Build cortex SRPM (push) Successful in 44s
CI / Build neuron SRPM (push) Successful in 49s
CI / Publish neuron to COPR (push) Successful in 8m17s
CI / Publish cortex to COPR (push) Successful in 9m56s
CI / Bump version in source (push) Successful in 30s
Diagnosing the persistent "Nothing to do" on v0.1.10 surfaced that
removing %attr(,,name) from %files wasn't enough. systemd-rpm-macros
ships its own rpm dep generator (/usr/lib/rpm/systemd.req) that parses
User=/Group= directives from every .service file the package ships
and emits Requires: user(NAME)/group(NAME) accordingly.

Rpmbuild log from v0.1.10 shows these Requires are still emitted even
after the %attr removal. Meanwhile the sysusers provides-generator
emits group(NAME) in both unversioned and versioned forms, but only
a versioned user(NAME) = <base64> when the u-line has GECOS/home/shell
fields. The asymmetry leaves Requires: user(NAME) unresolvable.

Add explicit Provides: user(NAME) back to both specs, with a comment
documenting the actual cause (systemd unit parsing, not file attrs)
so the next person touching these specs doesn't repeat the mistake.

Why monsoon didn't hit this: it creates its user in %pre via
groupadd/useradd (not sysusers.d), so no Provides are generated at
all — matching the Requires: user(monsoon) by luck of the rpm solver
treating unknown symbols as soft-fails for that path. Ours went through
the sysusers Provides code path and hit the asymmetry instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 15:32:51 +03:00
1d90238b01 ci: migrate rpm changelog generation to reusable action
Replace the local .gitea/scripts/generate-rpm-changelog.sh with the
shared composite action at https://git.lair.cafe/actions/rpm-changelog@v1.
Behaviour is identical — collect commits since the previous v* tag,
filter bump-version and merge noise, prepend a dated entry to the
spec — but the logic now lives in one place that other projects can
consume.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 15:32:51 +03:00
d99b25fb8a ci: auto-generate rpm changelog entry per release
On every tag push, build a %changelog entry from the git log since
the previous v* tag and prepend it to each spec. Stops the initial
entry from drifting further and catches bogus-date / stale-version
warnings automatically since the generated date always matches the
day the CI runs.

The generator drops "chore: bump version" commits (bot-authored,
noisy in user-facing changelogs) and merge commits. Author defaults
to the gitea-actions identity but can be overridden via
CHANGELOG_AUTHOR env var if a human release is desired.

Requires fetch-depth: 0 on checkout so git describe can see prior
tags and git log can reach them.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 15:32:51 +03:00
034da319f1 fix(rpm): correct weekday in changelog entry
April 15 2026 was a Wednesday, not Tuesday. rpmbuild validates the
day-of-week against the date and warns on mismatch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 15:32:51 +03:00
Gitea Actions
7ece281617 chore: bump version to 0.1.10 2026-04-16 15:06:18 +03:00
3bb5b3c425 fix(rpm): drop %attr(,,user) on config files to avoid dnf silent filter
All checks were successful
CI / Format, lint, build, test (push) Successful in 1m11s
CI / Publish cortex to COPR (push) Successful in 11m3s
CI / Build cortex SRPM (push) Successful in 43s
CI / Build neuron SRPM (push) Successful in 43s
CI / Publish neuron to COPR (push) Successful in 8m56s
CI / Bump version in source (push) Successful in 30s
Using %attr(,,cortex) / %attr(,,neuron) on config files caused rpm's
auto-dep-generator to emit Requires: user(name) and group(name) on
each package. When those Requires couldn't be resolved — whether due
to sysusers Provides mismatches, missing GPG keys, or dnf5 cache
state — dnf5 silently filtered the package out of the candidate set
and reported "Nothing to do" rather than an unsatisfied-dep error.

Adopt the pattern that already works reliably across our infra
(grenade/monsoon): ship config files as default root:root with 0644
perms, don't declare user/group ownership in the rpm file list.
systemd-sysusers still creates the service user via the shipped
sysusers.d file; the service drops to that user at runtime via the
User= directive in the unit.

This removes the user(cortex)/user(neuron) Requires entirely, which
is the root cause of the dnf5 filtering. File permission tightening
can be reintroduced later — either via a separate secrets file with
different mode bits, or by moving secret material to /var/lib/<svc>/
where the service drop-privileges account already has write access.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 14:50:17 +03:00
Gitea Actions
9fa51ad874 chore: bump version to 0.1.8 2026-04-16 10:56:07 +00:00
9697fbae73 fix(neuron): run service as neuron user, not cortex
All checks were successful
CI / Format, lint, build, test (push) Successful in 2m22s
CI / Build cortex SRPM (push) Successful in 43s
CI / Build neuron SRPM (push) Successful in 43s
CI / Publish neuron to COPR (push) Successful in 8m49s
CI / Publish cortex to COPR (push) Successful in 11m22s
CI / Bump version in source (push) Successful in 31s
neuron and cortex are independent packages installable on different
hosts. Having neuron run under a 'cortex' system user implied a
shared identity that doesn't exist. Give neuron its own user/group.

- New data/neuron-sysusers.conf declares the neuron user/group with
  home /var/lib/neuron.
- systemd unit User/Group changed to neuron.
- Spec file attrs, explicit Provides, and %sysusers_create_compat
  updated to reference the neuron user.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 13:32:36 +03:00
Gitea Actions
2ce1060cb8 chore: bump version to 0.1.7 2026-04-16 13:25:34 +03:00
142e91c3f7 fix(neuron): install config at /etc/neuron/, not /etc/cortex/
All checks were successful
CI / Format, lint, build, test (push) Successful in 4m45s
CI / Build neuron SRPM (push) Successful in 44s
CI / Build cortex SRPM (push) Successful in 45s
CI / Publish neuron to COPR (push) Successful in 8m52s
CI / Publish cortex to COPR (push) Successful in 11m17s
CI / Bump version in source (push) Successful in 30s
The neuron package was shipping its config at /etc/cortex/neuron.toml,
which implied a shared config directory between two independent
packages. Move to /etc/neuron/neuron.toml — neuron owns its own etc
dir, consistent with its own /usr/lib/sysusers.d/neuron.conf and
/usr/lib/systemd/system/neuron.service. Updated the systemd unit's
ExecStart path and the example toml header to match.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 13:07:06 +03:00
Gitea Actions
52c8b4c983 chore: bump version to 0.1.5 2026-04-16 13:01:42 +03:00
4a9a4fc775 ci: migrate copr publish to reusable action
All checks were successful
CI / Format, lint, build, test (push) Successful in 1m26s
CI / Build neuron SRPM (push) Successful in 45s
CI / Build cortex SRPM (push) Successful in 44s
CI / Publish neuron to COPR (push) Successful in 8m22s
CI / Publish cortex to COPR (push) Successful in 11m0s
CI / Bump version in source (push) Successful in 30s
Replace the in-repo .gitea/scripts/copr-build.sh and per-job
copr-cli configuration with the shared composite action at
https://git.lair.cafe/actions/copr-publish@v1. Behaviour is
identical — submit, watch, dump per-chroot logs — but the logic
now lives in a single place that other projects can consume.

Removes the actions/checkout step from both COPR jobs since the
build script is no longer local to this repo.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 12:34:39 +03:00
53a3c1e157 fix(rpm): explicitly Provides user(cortex)/group(cortex)
All checks were successful
CI / Format, lint, build, test (push) Successful in 57s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
dnf5 was silently rejecting neuron-0.1.3 with "Nothing to do" because
it had an unresolvable Requires. Inspection showed:

  Requires: user(cortex)               ← unversioned
  Provides: user(cortex) = <base64>    ← versioned only, no unversioned

rpm's sysusers provides-generator only emits the unversioned user()
provide when the u-line is minimal. Our sysusers.conf specifies GECOS,
home dir, and shell, which pushes the generator to versioned-only.
The matching Requires (auto-generated from %attr(,,cortex) on config
files) is unversioned, so resolution failed silently.

Explicitly declare Provides: user(cortex) and Provides: group(cortex)
to guarantee the unversioned forms exist. group(cortex) was already
emitted unversioned but adding it for symmetry and to protect against
future generator changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 12:06:05 +03:00
5c7d63c658 ci: dump COPR per-chroot build logs to CI output
Previously the COPR publish steps only surfaced copr-cli's status
updates (pending/importing/running). When a build failed, diagnosing
required clicking through to the COPR web UI. Now we submit with
--nowait, watch the build, then use copr-cli download-build to fetch
each chroot's builder-live.log and cat them as collapsible ::group::
blocks in the CI output.

Logic is factored into .gitea/scripts/copr-build.sh so cortex and
neuron jobs share it. Both COPR jobs now check out the repo to access
the script.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 12:06:05 +03:00
Gitea Actions
f161412f91 chore: bump version to 0.1.3 2026-04-16 11:41:11 +03:00
214 changed files with 62627 additions and 852 deletions

View File

@@ -0,0 +1,618 @@
name: build-prerelease
# Builds CUDA-flavoured neuron binaries (and a single cortex binary),
# packages each as a Fedora RPM, signs them, and publishes to the
# `unstable` channel at rpm.lair.cafe.
#
# Change-aware: the `prepare` job diffs HEAD against the git sha
# embedded in the most recently *published* unstable RPM (per package)
# and skips builds whose inputs didn't change. Docs-only commits build
# nothing; gateway-only commits skip the 3 CUDA builds (and, via
# deploy.yml's own check-update gate, the neuron restarts + model
# cold-loads). Diffing against the published sha — not the previous
# push — means a failed run can never cause a change to be missed.
#
# Lint (fmt+clippy) and test run here as parallel jobs and gate
# `publish`; ci.yml no longer runs on pushes to main (see its trigger
# comment), so the two workflows stop competing for the same runners.
#
# The published packages are versioned as e.g.
# helexa-neuron-blackwell-0.1.16-0.1.20260518T140530.gitabcdef0.fc43.x86_64
# ^^^^^^^^^^^^^^^^^^ ^^^^^^^^
# commit time (s) commit sha
# so they sort BELOW the eventual 0.1.16-1 stable release, and so two
# commits on the same day are still strictly ordered by their commit
# timestamps (rather than by RPM-vercmp's alpha-vs-digit precedence
# on the SHA fragment).
on:
# Auto-build on every push to main so the unstable channel tracks
# head without a manual dispatch step.
push:
branches: [main]
# Manual dispatch still available to build from a non-main ref.
# Dispatched runs skip change detection and build everything.
workflow_dispatch:
inputs:
ref:
description: "Git ref to build (branch / tag / commit). Defaults to the workflow's branch."
required: false
default: ""
# Coalesce same-ref pushes: a newer push cancels the older in-flight
# run — the newest commit is the one we want on the fleet. The publish
# job keeps its own `rpm-publish` group (cancel=false) so an in-flight
# repo update is never interrupted. Runners are ephemeral (one VM per
# job) so concurrent runs no longer race on a shared workspace; the
# old shared `cortex-runner-pool` group with ci.yml is gone.
concurrency:
group: build-prerelease-${{ github.ref }}
cancel-in-progress: true
env:
CARGO_INCREMENTAL: "0"
CARGO_TERM_COLOR: "always"
jobs:
prepare:
name: Resolve version stamps + change detection
timeout-minutes: 10
runs-on: rust
outputs:
version: ${{ steps.info.outputs.version }}
release: ${{ steps.info.outputs.release }}
short_sha: ${{ steps.info.outputs.short_sha }}
commit_timestamp: ${{ steps.info.outputs.commit_timestamp }}
build_cortex: ${{ steps.changes.outputs.build_cortex }}
build_neuron: ${{ steps.changes.outputs.build_neuron }}
build_bench: ${{ steps.changes.outputs.build_bench }}
check_rust: ${{ steps.changes.outputs.check_rust }}
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.ref }}
fetch-depth: 0
- id: info
run: |
set -eux
VERSION=$(awk -F\" '/^version[[:space:]]*=/ { print $2; exit }' Cargo.toml)
SHORT_SHA=$(git rev-parse --short=7 HEAD)
# Second-precise commit timestamp gives the release stamp a
# strictly monotonic numeric prefix. The earlier %Y%m%d-only
# form let same-day builds be ordered by RPM's rpmvercmp
# rules over the SHA, which is non-chronological — e.g.
# "git602e8e1" sorts newer than "gitf9f5fa4" purely because
# rpmvercmp ranks digit-prefixed segments above alpha ones.
# The SHA stays only as a debug identifier; sort order is
# decided entirely by the timestamp.
COMMIT_TIMESTAMP=$(git log -1 --format=%cd --date=format:%Y%m%d%H%M%S HEAD)
RELEASE="0.1.${COMMIT_TIMESTAMP}.git${SHORT_SHA}"
echo "version=${VERSION}" >> "$GITHUB_OUTPUT"
echo "release=${RELEASE}" >> "$GITHUB_OUTPUT"
echo "short_sha=${SHORT_SHA}" >> "$GITHUB_OUTPUT"
echo "commit_timestamp=${COMMIT_TIMESTAMP}" >> "$GITHUB_OUTPUT"
- id: changes
run: |
set -ux
# Default: build everything. Detection only ever narrows
# this, and any failure along the way (manifest unreachable,
# unparsable, sha not in history after a force-push) leaves
# the full build in place. Manual dispatches always build
# everything — predictable when building odd refs.
BUILD_CORTEX=true
BUILD_NEURON=true
BUILD_BENCH=true
CHECK_RUST=true
if [ "${GITHUB_EVENT_NAME}" = "push" ]; then
MANIFEST_URL="https://rpm.lair.cafe/fedora/43/x86_64/unstable/packages.json"
if curl -fsS --max-time 20 -o /tmp/packages.json "$MANIFEST_URL"; then
# Latest published sha per package, by buildTime.
base_for() {
python3 - "$1" <<'PY'
import json, re, sys
name = sys.argv[1]
try:
with open("/tmp/packages.json") as f:
pkgs = json.load(f)["packages"]
cands = [p for p in pkgs if p.get("name") == name]
if cands:
latest = max(cands, key=lambda p: p.get("buildTime", 0))
m = re.search(r"git\.?([0-9a-f]{7,40})", latest.get("release", ""))
if m:
print(m.group(1))
except Exception:
pass
PY
}
# true if no usable base, else true iff the diff since
# the published sha touches the given path pattern.
decide() {
local base="$1" pattern="$2"
if [ -z "$base" ] \
|| ! git cat-file -e "${base}^{commit}" 2>/dev/null \
|| ! git merge-base --is-ancestor "$base" HEAD 2>/dev/null; then
echo true; return
fi
if git diff --name-only "${base}..HEAD" | grep -qE "$pattern"; then
echo true
else
echo false
fi
}
# cortex-core is shared by both binaries; Cargo.{toml,lock}
# affect both; this workflow file affects both.
NEURON_RE='^crates/neuron/|^crates/cortex-core/|^Cargo\.toml$|^Cargo\.lock$|^rpm/helexa-neuron-prerelease\.spec$|^data/neuron|^neuron\.example\.toml$|^\.gitea/workflows/build-prerelease\.yml$'
CORTEX_RE='^crates/cortex-gateway/|^crates/cortex-cli/|^crates/cortex-core/|^Cargo\.toml$|^Cargo\.lock$|^rpm/cortex-prerelease\.spec$|^data/cortex|^cortex\.example\.toml$|^models\.example\.toml$|^\.gitea/workflows/build-prerelease\.yml$'
BENCH_RE='^crates/helexa-bench/|^crates/cortex-core/|^Cargo\.toml$|^Cargo\.lock$|^rpm/helexa-bench-prerelease\.spec$|^data/helexa-bench|^helexa-bench\.example\.toml$|^\.gitea/workflows/build-prerelease\.yml$'
# Any Rust change (incl. crates not packaged here, e.g.
# helexa-acp) still needs lint+test on main.
RUST_RE='\.rs$|^crates/|Cargo\.toml$|^Cargo\.lock$'
CORTEX_BASE=$(base_for cortex)
NEURON_BASE=$(base_for helexa-neuron-blackwell)
BENCH_BASE=$(base_for helexa-bench)
BUILD_CORTEX=$(decide "$CORTEX_BASE" "$CORTEX_RE")
BUILD_NEURON=$(decide "$NEURON_BASE" "$NEURON_RE")
BUILD_BENCH=$(decide "$BENCH_BASE" "$BENCH_RE")
if [ "$BUILD_CORTEX" = "true" ] || [ "$BUILD_NEURON" = "true" ] || [ "$BUILD_BENCH" = "true" ]; then
CHECK_RUST=true
else
CHECK_RUST=$(decide "$CORTEX_BASE" "$RUST_RE")
fi
fi
fi
echo "build_cortex=${BUILD_CORTEX}" >> "$GITHUB_OUTPUT"
echo "build_neuron=${BUILD_NEURON}" >> "$GITHUB_OUTPUT"
echo "build_bench=${BUILD_BENCH}" >> "$GITHUB_OUTPUT"
echo "check_rust=${CHECK_RUST}" >> "$GITHUB_OUTPUT"
echo "### change detection: build_cortex=${BUILD_CORTEX} build_neuron=${BUILD_NEURON} build_bench=${BUILD_BENCH} check_rust=${CHECK_RUST}"
# fmt + clippy + test moved here from ci.yml for main pushes so the
# two workflows stop queueing against each other (ci.yml's checks
# used to delay build-cortex by ~12 minutes on the shared runner
# pool). They run in parallel with the builds and gate `publish`,
# not the builds themselves — a clippy warning still can't reach the
# fleet, but it also doesn't serialize the pipeline.
lint:
name: Lint (fmt + clippy)
timeout-minutes: 25
needs: prepare
if: needs.prepare.outputs.check_rust == 'true'
runs-on: rust
env:
RUSTC_WRAPPER: sccache
SCCACHE_BUCKET: sccache
SCCACHE_ENDPOINT: http://caveman.kosherinata.internal:9000
SCCACHE_REGION: auto
SCCACHE_S3_USE_SSL: "false"
AWS_ACCESS_KEY_ID: ${{ secrets.SCCACHE_S3_ACCESS_KEY }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.SCCACHE_S3_SECRET_KEY }}
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.ref }}
- run: cargo fmt --check --all
# Failure-aware sccache escalation lives in the shared script: a
# signal death (rustc SIGSEGV / OOM-kill) keeps the cache and fails
# fast instead of triggering a slower uncached rebuild; only a real
# sccache fault drops the cache. See script/ci-cargo-escalate.sh.
- name: Clippy (sccache escalation)
run: script/ci-cargo-escalate.sh cargo clippy --workspace -- -D warnings
test:
name: Test
timeout-minutes: 25
needs: prepare
if: needs.prepare.outputs.check_rust == 'true'
runs-on: rust
env:
RUSTC_WRAPPER: sccache
SCCACHE_BUCKET: sccache
SCCACHE_ENDPOINT: http://caveman.kosherinata.internal:9000
SCCACHE_REGION: auto
SCCACHE_S3_USE_SSL: "false"
AWS_ACCESS_KEY_ID: ${{ secrets.SCCACHE_S3_ACCESS_KEY }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.SCCACHE_S3_SECRET_KEY }}
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.ref }}
# See script/ci-cargo-escalate.sh for the escalation rationale.
- name: Test (sccache escalation)
run: script/ci-cargo-escalate.sh cargo test --workspace
build-cortex:
name: Build cortex binary
timeout-minutes: 25
needs: prepare
if: needs.prepare.outputs.build_cortex == 'true'
# runner-rust image already provides rust/cargo/clippy/rustfmt via
# dnf — no rustup install step needed.
runs-on: rust
env:
RUSTC_WRAPPER: sccache
SCCACHE_BUCKET: sccache
SCCACHE_ENDPOINT: http://caveman.kosherinata.internal:9000
SCCACHE_REGION: auto
SCCACHE_S3_USE_SSL: "false"
AWS_ACCESS_KEY_ID: ${{ secrets.SCCACHE_S3_ACCESS_KEY }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.SCCACHE_S3_SECRET_KEY }}
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.ref }}
# See script/ci-cargo-escalate.sh for the escalation rationale.
- name: Build cortex (release, sccache escalation)
run: script/ci-cargo-escalate.sh cargo build --release -p cortex-cli
- name: Stage binary
run: |
mkdir --parents artifacts
cp target/release/cortex artifacts/cortex
./artifacts/cortex --version || true
- uses: actions/upload-artifact@v3
with:
name: cortex-fc43
path: artifacts/cortex
retention-days: 1
build-bench:
name: Build helexa-bench binary
timeout-minutes: 25
needs: prepare
if: needs.prepare.outputs.build_bench == 'true'
# Pure-Rust, non-CUDA binary — same runner as cortex.
runs-on: rust
env:
RUSTC_WRAPPER: sccache
SCCACHE_BUCKET: sccache
SCCACHE_ENDPOINT: http://caveman.kosherinata.internal:9000
SCCACHE_REGION: auto
SCCACHE_S3_USE_SSL: "false"
AWS_ACCESS_KEY_ID: ${{ secrets.SCCACHE_S3_ACCESS_KEY }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.SCCACHE_S3_SECRET_KEY }}
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.ref }}
- name: Build helexa-bench (release, sccache escalation)
run: |
# Stamp the SHA helexa-bench records as bench_sha against every
# run (option_env! in sweep.rs reads it at compile time).
export HELEXA_BUILD_SHA="$(git rev-parse HEAD)"
script/ci-cargo-escalate.sh cargo build --release -p helexa-bench
- name: Stage binary
run: |
mkdir --parents artifacts
cp target/release/helexa-bench artifacts/helexa-bench
./artifacts/helexa-bench --version || true
- uses: actions/upload-artifact@v3
with:
name: bench-fc43
path: artifacts/helexa-bench
retention-days: 1
build-neuron:
name: Build neuron-${{ matrix.flavour }}
timeout-minutes: 35
needs: prepare
if: needs.prepare.outputs.build_neuron == 'true'
strategy:
fail-fast: false
matrix:
include:
- flavour: ampere
compute_cap: "86"
runner: cuda-13.0
cuda_home: /usr/local/cuda-13.0
build_jobs: 8
nvcc_threads: 4
cargo_features: "cuda cudnn"
- flavour: ada
compute_cap: "89"
runner: cuda-13.0
cuda_home: /usr/local/cuda-13.0
build_jobs: 8
nvcc_threads: 4
cargo_features: "cuda cudnn"
- flavour: blackwell
compute_cap: "120"
runner: cuda-13.0
cuda_home: /usr/local/cuda-13.0
build_jobs: 8
nvcc_threads: 4
cargo_features: "cuda cudnn"
runs-on: ${{ matrix.runner }}
env:
SCCACHE_BUCKET: sccache
SCCACHE_ENDPOINT: http://caveman.kosherinata.internal:9000
SCCACHE_REGION: auto
SCCACHE_S3_USE_SSL: "false"
AWS_ACCESS_KEY_ID: ${{ secrets.SCCACHE_S3_ACCESS_KEY }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.SCCACHE_S3_SECRET_KEY }}
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.ref }}
# sccache handling + failure classification lives in
# script/ci-cargo-escalate.sh: it probes for sccache (the CUDA
# image may not ship it — a missing binary degrades to an uncached
# build rather than failing at `sccache rustc -vV`), and a rustc
# SIGSEGV / OOM-kill keeps the cache and fails fast instead of
# escalating to a slower uncached rebuild. The cache covers the
# ~600-crate host-side dep tree (the bulk of the 10-14 min build),
# shared across all three flavours, so even one run seeds the next.
- name: Build neuron with CUDA (${{ matrix.flavour }})
run: |
export PATH="${{ matrix.cuda_home }}/bin:${PATH}"
export LD_LIBRARY_PATH="${{ matrix.cuda_home }}/targets/x86_64-linux/lib:${{ matrix.cuda_home }}/lib64:${LD_LIBRARY_PATH:-}"
export LIBRARY_PATH="${{ matrix.cuda_home }}/targets/x86_64-linux/lib:${{ matrix.cuda_home }}/lib64:${LIBRARY_PATH:-}"
# Pin the build SHA neuron reports from GET /version. The git
# fallback in build.rs would also work on a full checkout, but
# injecting the exact checked-out commit is unambiguous under
# shallow/detached states and makes the artifact self-describing.
export HELEXA_BUILD_SHA="$(git rev-parse HEAD)"
script/ci-cargo-escalate.sh cargo build --release -p neuron --features "${{ matrix.cargo_features }}"
env:
CUDA_COMPUTE_CAP: ${{ matrix.compute_cap }}
CARGO_BUILD_JOBS: ${{ matrix.build_jobs }}
NVCC_THREADS: ${{ matrix.nvcc_threads }}
- name: Stage binary
run: |
mkdir --parents artifacts
cp target/release/neuron artifacts/neuron-${{ matrix.flavour }}
file "artifacts/neuron-${{ matrix.flavour }}"
- uses: actions/upload-artifact@v3
with:
name: neuron-${{ matrix.flavour }}-fc43
path: artifacts/neuron-${{ matrix.flavour }}
retention-days: 1
package-cortex:
name: Package cortex RPM
timeout-minutes: 20
needs: [prepare, build-cortex]
runs-on: rpm
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.ref }}
- uses: actions/download-artifact@v3
with:
name: cortex-fc43
path: artifacts/
- name: Build RPM
run: |
set -eux
rm -f ~/.rpmmacros
rpmdev-setuptree
cp artifacts/cortex ~/rpmbuild/SOURCES/
cp data/cortex.service ~/rpmbuild/SOURCES/
cp data/cortex-sysusers.conf ~/rpmbuild/SOURCES/
cp data/cortex-firewalld.xml ~/rpmbuild/SOURCES/
cp cortex.example.toml ~/rpmbuild/SOURCES/
cp models.example.toml ~/rpmbuild/SOURCES/
cp LICENSE ~/rpmbuild/SOURCES/
rpmbuild -bb rpm/cortex-prerelease.spec \
--define "cortex_version ${{ needs.prepare.outputs.version }}" \
--define "cortex_prerelease ${{ needs.prepare.outputs.release }}" \
--undefine dist \
--define "dist .fc43"
- uses: actions/upload-artifact@v3
with:
name: rpm-cortex-fc43
path: ~/rpmbuild/RPMS/x86_64/*.rpm
retention-days: 7
package-bench:
name: Package helexa-bench RPM
timeout-minutes: 20
needs: [prepare, build-bench]
runs-on: rpm
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.ref }}
- uses: actions/download-artifact@v3
with:
name: bench-fc43
path: artifacts/
- name: Build RPM
run: |
set -eux
rm -f ~/.rpmmacros
rpmdev-setuptree
cp artifacts/helexa-bench ~/rpmbuild/SOURCES/
cp data/helexa-bench.service ~/rpmbuild/SOURCES/
cp data/helexa-bench-sysusers.conf ~/rpmbuild/SOURCES/
cp data/helexa-bench-firewalld.xml ~/rpmbuild/SOURCES/
cp helexa-bench.example.toml ~/rpmbuild/SOURCES/
cp LICENSE ~/rpmbuild/SOURCES/
rpmbuild -bb rpm/helexa-bench-prerelease.spec \
--define "bench_version ${{ needs.prepare.outputs.version }}" \
--define "bench_prerelease ${{ needs.prepare.outputs.release }}" \
--undefine dist \
--define "dist .fc43"
- uses: actions/upload-artifact@v3
with:
name: rpm-bench-fc43
path: ~/rpmbuild/RPMS/x86_64/*.rpm
retention-days: 7
package-neuron:
name: Package helexa-neuron-${{ matrix.flavour }} RPM
timeout-minutes: 20
needs: [prepare, build-neuron]
runs-on: rpm
strategy:
fail-fast: false
matrix:
include:
- flavour: ampere
- flavour: ada
- flavour: blackwell
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.ref }}
- uses: actions/download-artifact@v3
with:
name: neuron-${{ matrix.flavour }}-fc43
path: artifacts/
- name: Build RPM
run: |
set -eux
rm -f ~/.rpmmacros
rpmdev-setuptree
cp artifacts/neuron-${{ matrix.flavour }} ~/rpmbuild/SOURCES/
cp data/neuron.service ~/rpmbuild/SOURCES/
cp data/neuron-sysusers.conf ~/rpmbuild/SOURCES/
cp data/neuron-firewalld.xml ~/rpmbuild/SOURCES/
cp neuron.example.toml ~/rpmbuild/SOURCES/
cp LICENSE ~/rpmbuild/SOURCES/
rpmbuild -bb rpm/helexa-neuron-prerelease.spec \
--define "neuron_version ${{ needs.prepare.outputs.version }}" \
--define "neuron_flavour ${{ matrix.flavour }}" \
--define "neuron_prerelease ${{ needs.prepare.outputs.release }}" \
--undefine dist \
--define "dist .fc43"
- uses: actions/upload-artifact@v3
with:
name: rpm-neuron-${{ matrix.flavour }}-fc43
path: ~/rpmbuild/RPMS/x86_64/*.rpm
retention-days: 7
publish:
name: Publish to rpm.lair.cafe (unstable)
timeout-minutes: 25
needs: [lint, test, package-cortex, package-neuron, package-bench]
# Runs when at least one package was built and nothing failed.
# lint/test may be skipped (docs-only refs never get here because
# no packages build), but a real failure in any blocks the
# fleet from receiving the RPMs.
if: >-
${{
!cancelled()
&& (needs.lint.result == 'success' || needs.lint.result == 'skipped')
&& (needs.test.result == 'success' || needs.test.result == 'skipped')
&& (needs.package-cortex.result == 'success' || needs.package-neuron.result == 'success' || needs.package-bench.result == 'success')
&& needs.package-cortex.result != 'failure'
&& needs.package-neuron.result != 'failure'
&& needs.package-bench.result != 'failure'
}}
runs-on: rpm
concurrency:
group: rpm-publish
cancel-in-progress: false
env:
RPM_REPO_HOST: oolon.kosherinata.internal
FEDORA_VERSION: "43"
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.ref }}
- name: Download all built RPMs
uses: actions/download-artifact@v3
with:
path: rpms/
pattern: rpm-*-fc43
- name: Flatten RPM artifacts
run: |
set -eux
find rpms/ -name '*.rpm' -exec mv --target-directory=rpms/ {} +
find rpms/ -mindepth 1 -type d -empty -delete
ls -la rpms/
- name: Check for sequoia-sq
run: |
if ! command -v sq &> /dev/null; then
echo "ERROR: sequoia-sq is not installed. Install with: sudo dnf install sequoia-sq"
exit 1
fi
- name: Import signing key
env:
# Pass secrets via env so values stay out of the rendered shell
# script (which Gitea includes in step logs). Template
# expansion of ${{ secrets.X }} inside `run:` writes the literal
# value into the script and depends on Gitea's log masker to
# scrub it — fragile for multi-line keys.
RPM_SIGNING_KEY: ${{ secrets.RPM_SIGNING_KEY }}
RPM_SIGNING_KEY_ID: ${{ secrets.RPM_SIGNING_KEY_ID }}
run: |
echo "$RPM_SIGNING_KEY" | gpg --batch --import
fpr=$(gpg --batch --with-colons --list-keys "$RPM_SIGNING_KEY_ID" | awk -F: '/^fpr:/ { print $10; exit }')
echo "${fpr}:6:" | gpg --batch --import-ownertrust
sed "s/@GPG_NAME@/$RPM_SIGNING_KEY_ID/" rpm/rpmmacros > ~/.rpmmacros
- name: Sign RPMs
run: |
set -eux
for rpm in rpms/*.rpm; do
echo "signing ${rpm}..."
rpm --addsign "${rpm}"
done
- name: Set up SSH for rsync
run: |
install --directory --mode 700 ~/.ssh
echo "${RSYNC_SSH_KEY}" | install --mode 600 /dev/stdin ~/.ssh/id_ed25519
env:
RSYNC_SSH_KEY: ${{ secrets.RSYNC_SSH_KEY }}
- name: Test SSH connectivity
run: |
ssh -o StrictHostKeyChecking=accept-new "gitea_ci@${RPM_REPO_HOST}" exit
- name: Ensure unstable repo directory exists
run: |
ssh "gitea_ci@${RPM_REPO_HOST}" \
"mkdir --parents /var/www/rpm/fedora/${FEDORA_VERSION}/x86_64/unstable"
- name: Sync RPMs to unstable repo
run: |
rsync \
--archive \
--verbose \
--chmod D755,F644 \
rpms/*.rpm \
"gitea_ci@${RPM_REPO_HOST}:/var/www/rpm/fedora/${FEDORA_VERSION}/x86_64/unstable/"
- name: Update unstable repo metadata
run: |
ssh "gitea_ci@${RPM_REPO_HOST}" \
"cd /var/www/rpm/fedora/${FEDORA_VERSION}/x86_64/unstable && createrepo_c --update ."
- name: Generate packages.json manifest
run: |
scp script/generate-packages-json.py "gitea_ci@${RPM_REPO_HOST}:/tmp/"
ssh "gitea_ci@${RPM_REPO_HOST}" \
"python3 /tmp/generate-packages-json.py \
--repodata-dir /var/www/rpm/fedora/${FEDORA_VERSION}/x86_64/unstable/repodata \
--output /var/www/rpm/fedora/${FEDORA_VERSION}/x86_64/unstable/packages.json \
--base-url https://rpm.lair.cafe/fedora/${FEDORA_VERSION}/x86_64/unstable"

View File

@@ -1,12 +1,26 @@
name: CI
# Pushes to main are deliberately excluded: build-prerelease.yml runs
# its own lint/test jobs there (gating publish), and running both
# workflows on the same push made them queue against each other on the
# same runner labels — ~12 minutes of added latency per deploy. Feature
# branches, PRs to main, and release tags keep the full gate here.
on:
push:
branches: ["**"]
branches-ignore: [main]
tags: ["v*"]
pull_request:
branches: [main]
# Coalesce same-ref pushes; a newer push supersedes the in-flight run.
# (The old shared `cortex-runner-pool` group with build-prerelease.yml
# is gone — the workflows no longer trigger on the same refs, and
# ephemeral one-VM-per-job runners removed the shared-workspace race
# that group existed to serialize.)
concurrency:
group: ci-${{ github.ref }}
cancel-in-progress: true
env:
CARGO_INCREMENTAL: "0"
RUSTC_WRAPPER: sccache
@@ -16,56 +30,103 @@ env:
SCCACHE_S3_USE_SSL: "false"
AWS_ACCESS_KEY_ID: ${{ secrets.SCCACHE_S3_ACCESS_KEY }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.SCCACHE_S3_SECRET_KEY }}
# fmt, clippy, and test all run in parallel on the same `rust` runner
# and would otherwise share /root/.cache/act/<hash>/hostexecutor/target/,
# racing each other's cargo temp files (.tmpXXXXXX) and failing builds
# mid-compile. Give each job its own target directory so the invocations
# don't collide. sccache still backs the actual rustc cache, so the
# rebuild penalty is small.
CARGO_TARGET_DIR: target-${{ github.job }}
jobs:
check:
name: Format, lint, build, test
runs-on: fedora
fmt:
name: Format
timeout-minutes: 15
runs-on: rust
steps:
- uses: actions/checkout@v4
- run: cargo fmt --check --all
- name: Cache cargo registry and target
uses: actions/cache@v4
with:
path: |
~/.cargo/bin
~/.cargo/registry/index
~/.cargo/registry/cache
~/.cargo/git/db
target
key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.lock') }}
restore-keys: |
${{ runner.os }}-cargo-
clippy:
name: Clippy
timeout-minutes: 25
runs-on: rust
steps:
- uses: actions/checkout@v4
# Failure-aware sccache escalation lives in the shared script (kept
# in sync with build-prerelease.yml): a signal death (rustc SIGSEGV
# / OOM-kill) keeps the cache and fails fast instead of an uncached
# rebuild; only a real sccache fault drops the cache.
- name: Clippy (sccache escalation)
run: script/ci-cargo-escalate.sh cargo clippy --workspace -- -D warnings
- name: Ensure sccache with S3 support
env:
RUSTC_WRAPPER: ""
test:
name: Test
timeout-minutes: 25
runs-on: rust
steps:
- uses: actions/checkout@v4
# See script/ci-cargo-escalate.sh for the escalation rationale.
- name: Test (sccache escalation)
run: script/ci-cargo-escalate.sh cargo test --workspace
# Type-check the CUDA-only code path. Borrow-check-only — we
# never run the tests here (the runner has no GPU). This catches
# the category of bug where a refactor compiles fine under the
# default feature set (which is what the `clippy` and `test` jobs
# exercise) but fails inside a `#[cfg(feature = "cuda")]` block.
# `runs-on: cuda-13.0` selects the runner that ships nvcc /
# cudarc's build prerequisites. The generic `rust` and `rpm`
# runners don't have them (the previous label `rpm` was tried
# first and tripped cudarc's `nvcc --version` build script —
# see commit history).
cuda-check:
name: CUDA type-check
timeout-minutes: 35
runs-on: cuda-13.0
# The workflow-level env sets `RUSTC_WRAPPER: sccache`
# unconditionally, which hard-fails cargo if the CUDA image
# doesn't ship sccache. Clear it at job level; the "Enable
# sccache when available" step opts back in only after probing
# for the binary. SCCACHE_*/AWS creds stay set — harmless when
# the wrapper is off, required when it's on.
env:
RUSTC_WRAPPER: ""
# candle-kernels' build script falls back to `nvidia-smi` for
# compute-cap detection when this is unset — and the GPU-less
# builder image doesn't ship nvidia-smi. Any valid cap works for
# a borrow-check; the real per-flavour caps live in
# build-prerelease.yml's matrix.
CUDA_COMPUTE_CAP: "86"
steps:
- uses: actions/checkout@v4
# sccache probing + failure classification lives in the shared
# script (see build-prerelease.yml's neuron build for the same
# pattern). It probes for sccache and, on a rustc SIGSEGV / OOM,
# keeps the cache and fails fast rather than rebuilding uncached.
- name: cargo check --features cuda (sccache escalation)
run: |
if sccache --version 2>/dev/null && sccache --show-stats 2>/dev/null; then
echo "sccache with S3 support already installed"
else
cargo install sccache --features s3 --locked
fi
- name: Check formatting
run: cargo fmt --check --all
- name: Clippy
run: cargo clippy --workspace -- -D warnings
- name: Test
run: cargo test --workspace
- name: Show sccache stats
run: sccache --show-stats
# act launches the step shell without /etc/profile, so the
# gitea_runner user's inherited PATH lacks /usr/local/cuda-13.0/bin.
# cudarc's build.rs shells out to `nvcc --version` (the neuron
# crate enables cuda-version-from-build-system) and panics with
# ENOENT if nvcc isn't resolvable — keep this export in sync
# with build-prerelease.yml.
export PATH="/usr/local/cuda-13.0/bin:${PATH}"
export LD_LIBRARY_PATH="/usr/local/cuda-13.0/targets/x86_64-linux/lib:/usr/local/cuda-13.0/lib64:${LD_LIBRARY_PATH:-}"
export LIBRARY_PATH="/usr/local/cuda-13.0/targets/x86_64-linux/lib:/usr/local/cuda-13.0/lib64:${LIBRARY_PATH:-}"
script/ci-cargo-escalate.sh cargo check -p neuron --features cuda --all-targets
srpm-cortex:
name: Build cortex SRPM
runs-on: fedora
needs: check
timeout-minutes: 25
runs-on: rpm
needs: [fmt, clippy, test, cuda-check]
if: startsWith(github.ref, 'refs/tags/v')
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Determine version
id: version
@@ -79,6 +140,12 @@ jobs:
sed -i '/\[workspace\.package\]/,/\[/{ s/^version = ".*"/version = "'"${VERSION}"'"/ }' Cargo.toml
sed -i "s/^Version:.*/Version: ${VERSION}/" cortex.spec
- name: Generate changelog entry
uses: https://git.lair.cafe/actions/rpm-changelog@v1
with:
spec: cortex.spec
version: ${{ steps.version.outputs.VERSION }}
- name: Generate source tarball
run: |
set -ex
@@ -113,11 +180,14 @@ jobs:
srpm-neuron:
name: Build neuron SRPM
runs-on: fedora
needs: check
timeout-minutes: 25
runs-on: rpm
needs: [fmt, clippy, test, cuda-check]
if: startsWith(github.ref, 'refs/tags/v')
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Determine version
id: version
@@ -129,31 +199,37 @@ jobs:
run: |
VERSION="${{ steps.version.outputs.VERSION }}"
sed -i '/\[workspace\.package\]/,/\[/{ s/^version = ".*"/version = "'"${VERSION}"'"/ }' Cargo.toml
sed -i "s/^Version:.*/Version: ${VERSION}/" neuron.spec
sed -i "s/^Version:.*/Version: ${VERSION}/" helexa-neuron.spec
- name: Generate changelog entry
uses: https://git.lair.cafe/actions/rpm-changelog@v1
with:
spec: helexa-neuron.spec
version: ${{ steps.version.outputs.VERSION }}
- name: Generate source tarball
run: |
set -ex
VERSION="${{ steps.version.outputs.VERSION }}"
tar czf /tmp/neuron-${VERSION}.tar.gz \
--transform "s,^\.,neuron-${VERSION}," \
tar czf /tmp/helexa-neuron-${VERSION}.tar.gz \
--transform "s,^\.,helexa-neuron-${VERSION}," \
--exclude='./target' \
--exclude='./.git' \
--exclude='*.tar.gz' \
--exclude='*.src.rpm' \
.
mv /tmp/neuron-${VERSION}.tar.gz .
mv /tmp/helexa-neuron-${VERSION}.tar.gz .
- name: Vendor Rust dependencies
run: |
VERSION="${{ steps.version.outputs.VERSION }}"
cargo vendor vendor/
tar czf neuron-${VERSION}-vendor.tar.gz vendor/
tar czf helexa-neuron-${VERSION}-vendor.tar.gz vendor/
rm -rf vendor/
- name: Build SRPM
run: |
rpmbuild -bs neuron.spec \
rpmbuild -bs helexa-neuron.spec \
--define "_sourcedir $(pwd)" \
--define "_srcrpmdir $(pwd)"
@@ -165,7 +241,8 @@ jobs:
copr-cortex:
name: Publish cortex to COPR
runs-on: fedora
timeout-minutes: 60
runs-on: fedora-43
needs: srpm-cortex
steps:
- name: Download SRPM
@@ -173,17 +250,17 @@ jobs:
with:
name: srpm-cortex
- name: Configure copr-cli
run: |
mkdir -p ~/.config
echo "${{ secrets.COPR_CONFIG }}" > ~/.config/copr
- name: Submit build to COPR
run: copr-cli build helexa/cortex *.src.rpm
- name: Publish to COPR
uses: https://git.lair.cafe/actions/copr-publish@v1
with:
project: helexa/helexa
srpm: "*.src.rpm"
copr-config: ${{ secrets.COPR_CONFIG }}
copr-neuron:
name: Publish neuron to COPR
runs-on: fedora
timeout-minutes: 60
runs-on: fedora-43
needs: srpm-neuron
steps:
- name: Download SRPM
@@ -191,37 +268,59 @@ jobs:
with:
name: srpm-neuron
- name: Configure copr-cli
run: |
mkdir -p ~/.config
echo "${{ secrets.COPR_CONFIG }}" > ~/.config/copr
- name: Submit build to COPR
run: copr-cli build helexa/neuron *.src.rpm
- name: Publish to COPR
uses: https://git.lair.cafe/actions/copr-publish@v1
with:
project: helexa/helexa
srpm: "*.src.rpm"
copr-config: ${{ secrets.COPR_CONFIG }}
bump-version:
name: Bump version in source
runs-on: fedora
timeout-minutes: 15
runs-on: rust
needs: [copr-cortex, copr-neuron]
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Stamp version and push
- name: Determine version
id: version
run: echo "VERSION=${GITHUB_REF#refs/tags/v}" >> "$GITHUB_OUTPUT"
- name: Stamp version
run: |
VERSION="${{ steps.version.outputs.VERSION }}"
sed -i '/\[workspace\.package\]/,/\[/{ s/^version = ".*"/version = "'"${VERSION}"'"/ }' Cargo.toml
sed -i "s/^Version:.*/Version: ${VERSION}/" cortex.spec
sed -i "s/^Version:.*/Version: ${VERSION}/" helexa-neuron.spec
cargo check --workspace 2>/dev/null || true
- name: Generate cortex changelog entry
uses: https://git.lair.cafe/actions/rpm-changelog@v1
with:
spec: cortex.spec
version: ${{ steps.version.outputs.VERSION }}
- name: Generate helexa-neuron changelog entry
uses: https://git.lair.cafe/actions/rpm-changelog@v1
with:
spec: helexa-neuron.spec
version: ${{ steps.version.outputs.VERSION }}
- name: Commit and push
env:
GITEA_TOKEN: ${{ secrets.GITEA_TOKEN }}
run: |
VERSION="${GITHUB_REF#refs/tags/v}"
sed -i '/\[workspace\.package\]/,/\[/{ s/^version = ".*"/version = "'"${VERSION}"'"/ }' Cargo.toml
sed -i "s/^Version:.*/Version: ${VERSION}/" cortex.spec
sed -i "s/^Version:.*/Version: ${VERSION}/" neuron.spec
cargo check --workspace 2>/dev/null || true
VERSION="${{ steps.version.outputs.VERSION }}"
git config user.name "Gitea Actions"
git config user.email "actions@git.lair.cafe"
git add Cargo.toml Cargo.lock cortex.spec neuron.spec
git add Cargo.toml Cargo.lock cortex.spec helexa-neuron.spec
if git diff --cached --quiet; then
echo "Version already at ${VERSION}"
echo "Nothing to commit for ${VERSION}"
else
git commit -m "chore: bump version to ${VERSION}"
git remote set-url origin "https://gitea-actions:${GITEA_TOKEN}@git.lair.cafe/helexa/cortex.git"
git remote set-url origin "https://gitea-actions:${GITEA_TOKEN}@git.lair.cafe/${{ github.repository }}.git"
git push origin HEAD:main
fi

View File

@@ -0,0 +1,136 @@
name: deploy-dev
# Fast-path iteration deploy for a SINGLE neuron host: build one CUDA
# flavour, copy the raw binary to the host, restart neuron.service.
# Skips the other two flavours, all RPM packaging, signing, repo
# publish, and dnf — push-to-testable drops from ~20 min to roughly
# one CUDA build plus a service restart.
#
# This is a DEV convenience, not a release path:
# - the binary lands at /usr/bin/neuron *outside* RPM ownership;
# the next regular deploy.yml run reconciles the host back to the
# packaged binary (dnf sees the newer RPM and reinstalls). `rpm -V
# helexa-neuron-<flavour>` flagging a modified /usr/bin/neuron in
# the interim is expected.
# - nothing is published; other hosts are untouched.
# - requires the `install` sudoers rule from
# asset/sudoers.d/neuron-host.conf (re-run script/infra-setup.sh
# after updating it).
#
# Trigger from the Gitea UI: Actions → deploy-dev → Run workflow,
# pick the target host. Defaults to the ref you dispatch from, so it
# works from feature branches without touching main.
on:
workflow_dispatch:
inputs:
target:
description: "neuron host to deploy to"
required: true
type: choice
options: [beast, benjy, quadbrat]
default: beast
# One dev deploy at a time; a newer dispatch for the same host wins.
concurrency:
group: deploy-dev-${{ inputs.target }}
cancel-in-progress: true
env:
CARGO_INCREMENTAL: "0"
CARGO_TERM_COLOR: "always"
jobs:
build:
name: Build neuron (${{ inputs.target }})
runs-on: cuda-13.0
outputs:
flavour: ${{ steps.map.outputs.flavour }}
steps:
- uses: actions/checkout@v4
# host → flavour → compute cap. Keep in sync with the
# build-neuron matrix in build-prerelease.yml and the
# deploy-neurons matrix in deploy.yml.
- id: map
run: |
case "${{ inputs.target }}" in
beast) flavour=blackwell cap=120 ;;
benjy) flavour=ada cap=89 ;;
quadbrat) flavour=ampere cap=86 ;;
*) echo "unknown target ${{ inputs.target }}"; exit 1 ;;
esac
echo "flavour=${flavour}" >> "$GITHUB_OUTPUT"
echo "cap=${cap}" >> "$GITHUB_OUTPUT"
- name: Build neuron with CUDA
run: |
set -eux
export PATH="/usr/local/cuda-13.0/bin:${PATH}"
export LD_LIBRARY_PATH="/usr/local/cuda-13.0/targets/x86_64-linux/lib:/usr/local/cuda-13.0/lib64:${LD_LIBRARY_PATH:-}"
export LIBRARY_PATH="/usr/local/cuda-13.0/targets/x86_64-linux/lib:/usr/local/cuda-13.0/lib64:${LIBRARY_PATH:-}"
cargo build --release -p neuron --features "cuda cudnn"
env:
CUDA_COMPUTE_CAP: ${{ steps.map.outputs.cap }}
CARGO_BUILD_JOBS: "8"
NVCC_THREADS: "4"
- name: Stage binary
run: |
mkdir --parents artifacts
cp target/release/neuron artifacts/neuron-dev
file artifacts/neuron-dev
- uses: actions/upload-artifact@v3
with:
name: neuron-dev-${{ inputs.target }}
path: artifacts/neuron-dev
retention-days: 1
deploy:
name: Deploy to ${{ inputs.target }}
needs: build
runs-on: fedora-43
env:
DEPLOY_KEY: |
${{ secrets.RSYNC_SSH_KEY }}
TARGET_HOST: ${{ inputs.target }}.hanzalova.internal
steps:
- name: SSH init
run: |
mkdir -p ~/.ssh
echo "${DEPLOY_KEY}" > ~/.ssh/id_ed25519
chmod 600 ~/.ssh/id_ed25519
ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=accept-new \
"gitea_ci@${TARGET_HOST}" 'hostname -f'
- uses: actions/download-artifact@v3
with:
name: neuron-dev-${{ inputs.target }}
path: artifacts/
- name: Copy binary to host
run: |
scp artifacts/neuron-dev "gitea_ci@${TARGET_HOST}:/var/lib/gitea_ci/neuron-dev"
- name: Install binary and restart neuron.service
run: |
ssh "gitea_ci@${TARGET_HOST}" '
set -eu
if systemctl is-active --quiet neuron.service; then
sudo /usr/bin/systemctl stop neuron.service
fi
# Exact command form required by the sudoers rule in
# asset/sudoers.d/neuron-host.conf — change both together.
sudo /usr/bin/install -o root -g root -m 0755 /var/lib/gitea_ci/neuron-dev /usr/bin/neuron
# enable --now so a dev deploy also leaves the unit enabled
# for boot, consistent with deploy.yml.
sudo /usr/bin/systemctl enable --now neuron.service
rm -f /var/lib/gitea_ci/neuron-dev'
- name: Capture neuron.service startup journal
if: always()
run: |
sleep 10
ssh "gitea_ci@${TARGET_HOST}" \
'journalctl --unit neuron.service -I --no-pager'

448
.gitea/workflows/deploy.yml Normal file
View File

@@ -0,0 +1,448 @@
name: deploy
# Roll the freshly-published unstable RPMs onto the helexa fleet:
# cortex on the gateway, helexa-neuron-<flavour> on each neuron host,
# and helexa-bench on bob (the bench host).
#
# Triggered automatically after `build-prerelease` succeeds (by which
# point the new RPMs are live on rpm.lair.cafe/unstable), and also
# re-runnable manually from the Gitea UI.
#
# Each host self-gates: if dnf sees no newer package than what is
# installed, the service is left alone — no stop, no restart, no model
# cold-load. Combined with build-prerelease's change detection this
# means a docs- or gateway-only push never restarts the neurons (a
# neuron restart costs ~5 min of 27B cold-load, see issue #1).
#
# Per-host one-time setup (gitea_ci user, authorized_keys, scoped
# sudoers drop-in) lives in script/infra-setup.sh — run that once per
# host before this workflow can succeed.
on:
workflow_run:
workflows: [build-prerelease]
types: [completed]
workflow_dispatch:
# Serialize deploys. Overlapping runs would race on dnf metadata
# refresh and service-restart timing; queueing keeps the fleet
# predictable. Don't cancel an in-flight deploy — a half-applied dnf
# transaction is worse than a slightly stale deploy.
concurrency:
group: deploy
cancel-in-progress: false
env:
DEPLOY_KEY: |
${{ secrets.RSYNC_SSH_KEY }}
jobs:
deploy-cortex:
runs-on: fedora-43
# Two trigger paths: manual dispatch always runs; workflow_run
# only runs if the upstream `build-prerelease` actually succeeded.
if: >-
${{
github.event_name == 'workflow_dispatch'
|| github.event.workflow_run.conclusion == 'success'
}}
steps:
- name: SSH init
run: |
mkdir -p ~/.ssh
echo "${DEPLOY_KEY}" > ~/.ssh/id_ed25519
chmod 600 ~/.ssh/id_ed25519
ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=accept-new \
gitea_ci@hanzalova.internal 'hostname -f'
# Gating compares `rpm -q` against the packages.json manifest the
# publish job maintains — NOT unprivileged `dnf check-update`,
# which proved unreliable as the gitea_ci user (hung on metadata
# locks on one host, silently reported "no updates" on others).
# An unreadable/unparsable manifest fails open: deploy proceeds.
- name: Deploy cortex (skips when already current)
run: |
ssh gitea_ci@hanzalova.internal 'bash -s' <<'DEPLOY'
set -eu
pkg=cortex
installed=$(rpm -q --qf '%{VERSION}-%{RELEASE}' "${pkg}" 2>/dev/null || echo "not-installed")
latest=$(curl -fsS --max-time 15 "https://rpm.lair.cafe/fedora/43/x86_64/unstable/packages.json" 2>/dev/null \
| python3 -c '
import json, sys
name = sys.argv[1]
cands = [p for p in json.load(sys.stdin)["packages"] if p.get("name") == name]
if cands:
p = max(cands, key=lambda p: p.get("buildTime", 0))
print(p["version"] + "-" + p["release"])
' "${pkg}" 2>/dev/null || true)
if [ -n "${latest}" ] && [ "${latest}" = "${installed}" ]; then
echo "${pkg}-${installed} already current — leaving service untouched"
exit 0
fi
echo "installed=${installed} published=${latest:-unknown} — deploying"
if systemctl is-active --quiet cortex.service; then
sudo /usr/bin/systemctl stop cortex.service
fi
if rpm -q "${pkg}" >/dev/null 2>&1; then
sudo /usr/bin/dnf upgrade --refresh --allowerasing -y cortex
else
sudo /usr/bin/dnf install --refresh --allowerasing -y cortex
fi
sudo /usr/bin/systemctl daemon-reload
# enable --now: start the service AND enable it for boot so the
# fleet self-heals after a host reboot.
sudo /usr/bin/systemctl enable --now cortex.service
DEPLOY
# Wait for the service to either come up or wedge, then capture
# the latest-invocation journal. Runs even on prior failure so a
# failed start step still leaves a usable record in the deploy log.
- name: Capture cortex.service startup journal
if: always()
run: |
sleep 10
ssh gitea_ci@hanzalova.internal \
'journalctl --unit cortex.service -I --no-pager'
deploy-neurons:
needs: [deploy-cortex]
runs-on: fedora-43
strategy:
# One neuron failing must not cancel the others. Cortex is up
# already; a partial neuron deploy is strictly better than
# rolling back to zero.
fail-fast: false
matrix:
include:
# load_timeout: how long to wait for default_models to finish
# loading after a restart. beast cold-loads Qwen3.6-27B Q6K
# TP=2 (~5-6 min typical, see #1); benjy/quadbrat load small
# single-GPU models in well under a minute.
#
# max_prompt_tokens: per-model context cap, written to the
# neuron.service.d/model.conf drop-in (NEURON_MAX_PROMPT_TOKENS).
# A change here restarts the neuron even with no new RPM. Values
# are VRAM-safe ceilings derived per model — see
# doc/context-limits.md. beast (Qwen3.6-27B, hybrid linear, 2x
# 32GB) has ample KV headroom; benjy (Qwen3-8B dense, ~6GB free)
# is VRAM-bound and stays at the default; quadbrat (Qwen3-1.7B)
# likewise conservative.
- host: beast.hanzalova.internal
flavour: blackwell
load_timeout: 900
max_prompt_tokens: 131072
- host: benjy.hanzalova.internal
flavour: ada
load_timeout: 300
max_prompt_tokens: 16384
- host: quadbrat.hanzalova.internal
flavour: ampere
load_timeout: 300
max_prompt_tokens: 16384
steps:
- name: SSH init
run: |
mkdir -p ~/.ssh
echo "${DEPLOY_KEY}" > ~/.ssh/id_ed25519
chmod 600 ~/.ssh/id_ed25519
ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=accept-new \
gitea_ci@${{ matrix.host }} 'hostname -f'
# See deploy-cortex for why gating uses the publish manifest and
# not unprivileged `dnf check-update`.
- name: Deploy helexa-neuron-${{ matrix.flavour }} (skips when already current)
run: |
ssh gitea_ci@${{ matrix.host }} 'bash -s' <<'DEPLOY'
set -eu
pkg=helexa-neuron-${{ matrix.flavour }}
max_prompt_tokens="${{ matrix.max_prompt_tokens }}"
# ── Desired per-model systemd drop-in ─────────────────────────
# model.conf carries NEURON_MAX_PROMPT_TOKENS so the context cap
# is deterministic per host and rolled out (with a restart) by
# this workflow, not hand-edited. It sorts after local.conf, so a
# deploy-managed value wins over any manual local override of the
# same variable. See doc/context-limits.md.
conf=/etc/systemd/system/neuron.service.d/model.conf
config_changed=0
if [ -n "${max_prompt_tokens}" ]; then
desired=$(printf '%s\n%s\n%s\n%s' \
"# Managed by .gitea/workflows/deploy.yml - do not edit by hand." \
"# Per-model context cap; see doc/context-limits.md." \
"[Service]" \
"Environment=NEURON_MAX_PROMPT_TOKENS=${max_prompt_tokens}")
[ "${desired}" = "$(cat "${conf}" 2>/dev/null || true)" ] || config_changed=1
fi
# ── Package version gate (manifest rationale: see deploy-cortex) ──
installed=$(rpm -q --qf '%{VERSION}-%{RELEASE}' "${pkg}" 2>/dev/null || echo "not-installed")
latest=$(curl -fsS --max-time 15 "https://rpm.lair.cafe/fedora/43/x86_64/unstable/packages.json" 2>/dev/null \
| python3 -c '
import json, sys
name = sys.argv[1]
cands = [p for p in json.load(sys.stdin)["packages"] if p.get("name") == name]
if cands:
p = max(cands, key=lambda p: p.get("buildTime", 0))
print(p["version"] + "-" + p["release"])
' "${pkg}" 2>/dev/null || true)
pkg_changed=1
if [ -n "${latest}" ] && [ "${latest}" = "${installed}" ]; then
pkg_changed=0
fi
# Skip only when BOTH the package and the drop-in are unchanged —
# a context-cap change must restart the neuron even with no new RPM.
if [ "${pkg_changed}" -eq 0 ] && [ "${config_changed}" -eq 0 ]; then
echo "${pkg}-${installed} current; NEURON_MAX_PROMPT_TOKENS=${max_prompt_tokens:-<unset>} unchanged — leaving service untouched"
exit 0
fi
echo "installed=${installed} published=${latest:-unknown} pkg_changed=${pkg_changed} config_changed=${config_changed} — deploying"
# Write the drop-in (staged in gitea_ci's dir, installed root-owned).
if [ "${config_changed}" -eq 1 ]; then
printf '%s\n' "${desired}" > /var/lib/gitea_ci/model.conf
sudo /usr/bin/install -o root -g root -m 0644 -D /var/lib/gitea_ci/model.conf "${conf}"
rm -f /var/lib/gitea_ci/model.conf
echo "applied ${conf}: NEURON_MAX_PROMPT_TOKENS=${max_prompt_tokens}"
fi
if systemctl is-active --quiet neuron.service; then
sudo /usr/bin/systemctl stop neuron.service
fi
if [ "${pkg_changed}" -eq 1 ]; then
if rpm -q "${pkg}" >/dev/null 2>&1; then
sudo /usr/bin/dnf upgrade --refresh --allowerasing -y "${pkg}"
else
sudo /usr/bin/dnf install --refresh --allowerasing -y "${pkg}"
fi
fi
# daemon-reload picks up both a new unit (dnf) and the drop-in.
sudo /usr/bin/systemctl daemon-reload
# enable --now: start the service AND enable it for boot so the
# fleet self-heals after a host reboot.
sudo /usr/bin/systemctl enable --now neuron.service
# ── Post-deploy validation ────────────────────────────────
# A deploy only goes green if the neuron (a) finishes loading
# its default models and (b) answers a trivial prompt like an
# LLM should. Catches the class of bug where the binary
# starts fine but model load or inference is broken — which
# previously surfaced only when a human noticed. The wait
# polls /health activation (the structured source of the
# "loaded default model" journal line, plus per-model failure
# detail); the journal-capture step below still runs for
# forensics either way.
load_timeout=${{ matrix.load_timeout }}
echo "waiting for default models (timeout ${load_timeout}s)"
deadline=$(( $(date +%s) + load_timeout ))
health=""
while :; do
health=$(curl -fsS --max-time 5 http://localhost:13131/health 2>/dev/null || true)
state=$(printf %s "${health}" | python3 -c '
import json, sys
try:
print(json.load(sys.stdin).get("activation", {}).get("state", ""))
except Exception:
print("")
')
if [ "${state}" = "ready" ]; then
break
fi
if [ "$(date +%s)" -ge "${deadline}" ]; then
echo "FAIL: activation not ready within ${load_timeout}s (last state: ${state:-unreachable})"
exit 1
fi
sleep 10
done
model=$(printf %s "${health}" | python3 -c '
import json, sys
a = json.load(sys.stdin).get("activation", {})
failed = a.get("failed", [])
if failed:
for f in failed:
msg = "FAILED " + str(f.get("model_id")) + ": " + str(f.get("error", ""))[:400]
sys.stderr.write(msg + chr(10))
sys.exit(1)
completed = a.get("completed", [])
print(completed[0] if completed else "")
')
if [ -z "${model}" ]; then
echo "no default models configured — skipping LLM probe"
exit 0
fi
echo "LLM probe against ${model}"
probe_body=$(printf '{"model":"%s","messages":[{"role":"user","content":"Reply with exactly one word: pineapple"}],"max_tokens":512,"temperature":0}' "${model}")
resp=$(curl -fsS --max-time 180 -H "content-type: application/json" \
-d "${probe_body}" http://localhost:13131/v1/chat/completions) || {
echo "FAIL: probe request errored"
exit 1
}
if printf %s "${resp}" | grep -qi pineapple; then
echo "LLM probe passed"
else
echo "FAIL: probe response missing expected token"
printf %s "${resp}" | head -c 2000
echo
exit 1
fi
DEPLOY
- name: Ensure firewalld allows helexa-neuron
run: |
ssh gitea_ci@${{ matrix.host }} '
if ! sudo /usr/bin/firewall-cmd --query-service=helexa-neuron --quiet 2>/dev/null; then
sudo /usr/bin/firewall-cmd --add-service=helexa-neuron --permanent
sudo /usr/bin/firewall-cmd --reload
fi'
# Wait for the service to either come up or wedge, then capture
# the latest-invocation journal. Runs even on prior failure so a
# failed start step still leaves a usable record in the deploy log.
- name: Capture neuron.service startup journal
if: always()
run: |
sleep 10
ssh gitea_ci@${{ matrix.host }} \
'journalctl --unit neuron.service -I --no-pager'
# helexa-bench is a separate package on a separate host (bob), and it
# only consumes the fleet's HTTP APIs — it has no deploy-ordering
# dependency on cortex or the neurons (the sweep loop is version-aware
# and picks up whatever each neuron reports whenever). So it runs
# alongside the cortex→neurons chain rather than after it.
deploy-bench:
runs-on: fedora-43
if: >-
${{
github.event_name == 'workflow_dispatch'
|| github.event.workflow_run.conclusion == 'success'
}}
steps:
- name: SSH init
run: |
mkdir -p ~/.ssh
echo "${DEPLOY_KEY}" > ~/.ssh/id_ed25519
chmod 600 ~/.ssh/id_ed25519
ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=accept-new \
gitea_ci@bob.hanzalova.internal 'hostname -f'
# See deploy-cortex for why gating uses the publish manifest and
# not unprivileged `dnf check-update`.
- name: Deploy helexa-bench (skips when already current)
run: |
ssh gitea_ci@bob.hanzalova.internal 'bash -s' <<'DEPLOY'
set -eu
pkg=helexa-bench
installed=$(rpm -q --qf '%{VERSION}-%{RELEASE}' "${pkg}" 2>/dev/null || echo "not-installed")
latest=$(curl -fsS --max-time 15 "https://rpm.lair.cafe/fedora/43/x86_64/unstable/packages.json" 2>/dev/null \
| python3 -c '
import json, sys
name = sys.argv[1]
cands = [p for p in json.load(sys.stdin)["packages"] if p.get("name") == name]
if cands:
p = max(cands, key=lambda p: p.get("buildTime", 0))
print(p["version"] + "-" + p["release"])
' "${pkg}" 2>/dev/null || true)
if [ -n "${latest}" ] && [ "${latest}" = "${installed}" ]; then
echo "${pkg}-${installed} already current — leaving service untouched"
exit 0
fi
echo "installed=${installed} published=${latest:-unknown} — deploying"
if systemctl is-active --quiet helexa-bench.service; then
sudo /usr/bin/systemctl stop helexa-bench.service
fi
if rpm -q "${pkg}" >/dev/null 2>&1; then
sudo /usr/bin/dnf upgrade --refresh --allowerasing -y helexa-bench
else
sudo /usr/bin/dnf install --refresh --allowerasing -y helexa-bench
fi
sudo /usr/bin/systemctl daemon-reload
# enable --now: start the service AND enable it for boot so the
# bench resumes collecting after a host reboot.
sudo /usr/bin/systemctl enable --now helexa-bench.service
# ── Post-deploy validation ────────────────────────────────
# The bench serves a read-only API on :13132 alongside the
# outbound sweep loop. Probe the API over localhost (bypasses
# firewalld) — catches a crash-on-start or a bad bind. Bail
# early if the unit drops out of active (Restart backoff).
echo "waiting for bench API on :13132"
deadline=$(( $(date +%s) + 30 ))
while :; do
if curl -fsS --max-time 5 http://localhost:13132/api/health >/dev/null 2>&1; then
echo "bench API healthy"
break
fi
if ! systemctl is-active --quiet helexa-bench.service; then
echo "FAIL: helexa-bench.service is not active"
systemctl --no-pager status helexa-bench.service | head -20 || true
exit 1
fi
if [ "$(date +%s)" -ge "${deadline}" ]; then
echo "FAIL: bench API not healthy within 30s"
exit 1
fi
sleep 3
done
DEPLOY
- name: Ensure firewalld allows helexa-bench
run: |
ssh gitea_ci@bob.hanzalova.internal '
if ! sudo /usr/bin/firewall-cmd --query-service=helexa-bench --quiet 2>/dev/null; then
sudo /usr/bin/firewall-cmd --add-service=helexa-bench --permanent
sudo /usr/bin/firewall-cmd --reload
fi'
# Wait for the service to either come up or wedge, then capture
# the latest-invocation journal. Runs even on prior failure so a
# failed start step still leaves a usable record in the deploy log.
- name: Capture helexa-bench.service startup journal
if: always()
run: |
sleep 10
ssh gitea_ci@bob.hanzalova.internal \
'journalctl --unit helexa-bench.service -I --no-pager'
# Build the bench UI and publish it to the public nginx vhost on the
# gateway (https://bench.helexa.ai). The vhost + Let's Encrypt cert are
# one-time host setup (script/infra-setup.sh); this job just refreshes
# the static assets. nginx reverse-proxies /api to the bob API, so the
# SPA is built same-origin (no VITE_API_BASE). Independent of the other
# deploy jobs.
deploy-bench-ui:
runs-on: fedora-43
if: >-
${{
github.event_name == 'workflow_dispatch'
|| github.event.workflow_run.conclusion == 'success'
}}
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
- name: Build UI
run: |
cd bench
npm ci
npm run build
- name: SSH init
run: |
mkdir -p ~/.ssh
echo "${DEPLOY_KEY}" > ~/.ssh/id_ed25519
chmod 600 ~/.ssh/id_ed25519
ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=accept-new \
gitea_ci@hanzalova.internal 'hostname -f'
- name: Rsync built UI to gateway webroot
run: |
rsync --archive --compress --delete \
--rsync-path 'sudo rsync' \
bench/dist/ \
gitea_ci@hanzalova.internal:/var/www/bench.helexa.ai/

5
.gitignore vendored
View File

@@ -1,7 +1,12 @@
/target
/bench/node_modules
/bench/dist
*.swp
*.swo
.idea/
.vscode/
cortex.toml
models.toml
doc/plan/*
/target-cuda/
.claude/

268
AGENTS.md Normal file
View File

@@ -0,0 +1,268 @@
# AGENTS.md — helexa/cortex
## Project Overview
helexa is a self-hosted LLM serving stack for multi-node GPU inference clusters. It has two components:
- **cortex** — the per-operator control plane and LLM proxy. A Rust reverse-proxy that sits in front of the fleet and presents a unified OpenAI + Anthropic compatible API surface. It handles model routing, lifecycle management (load/unload/evict), request translation, and metrics collection.
- **neuron** — the per-host LLM harness. One instance runs on every GPU host, serving candle-based in-process inference and managing local hardware discovery and model lifecycle.
## Repository Layout
```
cortex/
├── Cargo.toml # workspace root (Rust 2024 edition, GPL-3.0)
├── cortex.example.toml # example gateway config
├── models.example.toml # example model catalogue
├── neuron.example.toml # example neuron config
├── README.md # public-facing documentation
├── CLAUDE.md # detailed design rationale and implementation history
├── AGENTS.md # ← you are here
├── cortex.spec # RPM spec for cortex
├── helexa-neuron.spec # RPM spec for neuron (renamed to avoid Fedora collision)
├── rpm/ # prerelease RPM specs
│ ├── cortex-prerelease.spec
│ ├── helexa-neuron-prerelease.spec
│ └── helexa-bench-prerelease.spec
├── data/ # systemd units and example configs for packaging
│ ├── cortex.service
│ ├── neuron.service
│ ├── cortex.example.toml
│ ├── neuron.example.toml
│ └── models.example.toml
└── crates/
├── cortex-core/ # shared types, config, envelopes
│ └── src/
│ ├── lib.rs
│ ├── build_info.rs # BuildInfo type for /version endpoint
│ ├── config.rs # figment-based config structs
│ ├── catalogue.rs # ModelProfile, placement matching
│ ├── discovery.rs # DeviceInfo, DiscoveryResponse
│ ├── harness.rs # Harness trait, HarnessConfig, HarnessHealth
│ ├── node.rs # NodeState, ModelStatus
│ ├── openai.rs # OpenAI request/response types
│ ├── anthropic.rs # Anthropic request/response types
│ ├── translate.rs # OpenAI <-> Anthropic translation
│ └── metrics.rs # RequestMetrics, histogram helpers
├── cortex-gateway/ # the HTTP proxy server
│ └── src/
│ ├── lib.rs
│ ├── state.rs # CortexState: Arc<RwLock<...>>
│ ├── router.rs # model -> node routing logic
│ ├── proxy.rs # streaming HTTP proxy to backends
│ ├── evictor.rs # LRU/priority eviction logic
│ ├── poller.rs # background task polling neuron status
│ ├── handlers.rs # axum handlers (chat, completions, models, etc.)
│ └── metrics.rs # prometheus exporter endpoint
├── cortex-cli/ # CLI entrypoint
│ └── src/main.rs # binary: `cortex`
├── neuron/ # per-host LLM daemon (replaces cortex-agent)
│ ├── Cargo.toml # features: cuda, cudnn, flash-attn, cuda-integration
│ ├── build.rs # compiles CUDA kernels, emits build metadata
│ └── src/
│ ├── main.rs # binary: `neuron`
│ ├── discovery.rs # nvidia-smi parsing, device enumeration
│ ├── health.rs # runtime GPU polling
│ ├── api.rs # HTTP handlers for /discovery, /models, etc.
│ ├── version.rs # GET /version endpoint with BuildInfo
│ ├── models.rs # local model lifecycle orchestration
│ └── harness/ # in-process candle inference
│ ├── device_worker/ # per-device CUDA worker threads
│ │ ├── mod.rs # canonical narrative for worker architecture
│ │ ├── jobs.rs # Job enum, dispatch handlers
│ │ └── dispatch.rs # DeviceWorkerState struct
│ ├── candle.rs # candle model implementation
│ └── tp/ # tensor parallelism
│ └── worker.rs # TP worker subprocesses
├── helexa-acp/ # Agent Client Protocol bridge (Apache-2.0)
│ └── src/main.rs # binary: `helexa-acp`, self-contained (no workspace deps)
└── helexa-bench/ # benchmark harness
└── src/main.rs # binary: `helexa-bench`, SQLite-backed, version-aware
```
## Key Design Decisions
### Architecture
- **cortex** is the control plane. It exposes the unified API, routes requests, manages model lifecycle across the fleet, and collects metrics.
- **neuron** is the node plane. One instance runs on every GPU host. It discovers local hardware, manages in-process candle inference, handles NCCL tensor parallelism, and reports runtime state.
- cortex never shells out to `nvidia-smi`, never touches systemd units, and never talks directly to a harness. It talks only to neurons via HTTP API on port 13131.
### Per-device worker thread (neuron)
Every CUDA device gets one dedicated OS thread that owns its `CudaContext` for the daemon's lifetime. All CUDA operations route through this thread via a `std::sync::mpsc` job channel. Tensors never escape the worker thread alive. Inference replies carry `Vec<f32>` CPU-side logits; sampled tokens come back as `u32`. The opaque `ArchHandle(u64)` and `TpHandle(u64)` are indices into the worker's state slab, not pointers.
CPU loads (`Device::Cpu` fallback) keep the legacy `tokio::task::spawn_blocking + Arc<Mutex<ModelArch>>` path — there's no context to own and the channel hop would only add latency. Four `spawn_blocking` references in `harness/candle.rs` are deliberate CPU fallback.
### candle-native (not mistral.rs)
neuron builds directly on [candle](https://github.com/huggingface/candle). Every model architecture it serves is implemented in this repository, ported against the HuggingFace reference. No external inference server to babysit. The Harness trait remains as an internal seam for adding future engines (vision/audio/diffusion) but its only implementation is in-process candle.
### Streaming proxy
Chat completions are proxied as SSE streams. The gateway must:
1. Parse the inbound request to extract the model name
2. Route to the correct backend neuron
3. Stream the response back, capturing token timing for metrics
4. NOT buffer the full response — true streaming passthrough
### Anthropic translation
When a request arrives at `/v1/messages` (Anthropic format), the gateway translates it to OpenAI format before proxying to neuron, then translates the response back. This is stateless envelope transformation. Non-streaming round-trip is implemented; streaming SSE translation deferred.
### Eviction
The evictor runs as a background task. Before loading a model on a node where VRAM is tight:
1. Check if the model is already loaded elsewhere → route there instead
2. Find the LRU model on the target node (excluding pinned models)
3. Call `POST {neuron}/models/unload` on that model
4. The incoming request's lazy-load triggers the new model load
### Metrics
Per-request: model, node, prompt_tokens, completion_tokens, total_tokens, tok_per_sec, time_to_first_token_ms, total_latency_ms. Exposed as Prometheus histograms/counters on a separate port (31314).
## Tech Stack
- **Rust 2024 edition** — workspace with 6 crates
- **Axum 0.8** — HTTP framework
- **reqwest** — HTTP client for proxying to backends
- **figment** — config loading (TOML + env vars)
- **tokio** — async runtime
- **metrics + metrics-exporter-prometheus** — observability
- **tracing** — structured logging
- **candle** — in-process inference engine (neuron only, with CUDA support)
- **cudarc** — patched for neuron's needs (see workspace `[patch]`)
- **clap** — CLI parsing
- **rusqlite** (bundled) — helexa-bench SQLite system-of-record
## Build Commands
```sh
cargo build --release # build all crates
cargo run -p cortex-cli -- serve # run the gateway
cargo test # run all tests
cargo clippy --workspace # lint
```
### neuron Features
- `cuda`: Enables CUDA acceleration in candle and cudarc/nccl bindings. Without it, falls back to CPU.
- `cudnn`: Use cuDNN for convolution/attention kernels (requires `cuda`).
- `flash-attn`: FlashAttention kernels (requires `cuda`).
- `cuda-integration`: Reserved for GPU-only integration tests (requires multiple CUDA devices + libnccl).
### Build Scripts
- `neuron/build.rs`: Compiles CUDA kernels (`src/cuda/*.cu`) using `cudaforge::KernelBuilder` when `cuda` feature is enabled. Handles compute capability checks (sm_<80 disables bf16 intrinsics). Also captures build metadata: git SHA, dirty flag, timestamp, rustc version, profile, features, candle-core version.
## CI
Gitea Actions runs on every push to any branch. All three checks must pass before merging:
```sh
cargo fmt --check --all # formatting
cargo clippy --workspace -- -D warnings # lint (warnings are errors)
cargo test --workspace # tests
```
Run these locally before pushing. `cargo fmt --all` fixes formatting automatically. Clippy warnings must be resolved, not suppressed with `#[allow(...)]` unless there is a clear rationale.
Tagged releases (`v*`) build SRPMs for `cortex`, `helexa-neuron`, and `helexa-bench` and publish to COPR (`helexa/helexa`). Build metadata SHA injection: CI sets `HELEXA_BUILD_SHA=$(git rev-parse HEAD)`.
## Environment
- Targets Fedora 43 (systemd, SELinux enforcing)
- Nodes communicate over a private network (e.g. WireGuard mesh)
- cortex listens on port 31313 (API) and 31314 (metrics)
- neuron listens on port 13131 on each GPU host
- TLS terminated at gateway or via nginx; internal traffic is plaintext over WireGuard
## Conventions
- Error handling: `anyhow` for binaries, `thiserror` for library crates
- No `unwrap()` in library code; `expect()` only with clear rationale
- All public types derive `Debug, Clone, Serialize, Deserialize` where sensible
- Config structs use `figment` with TOML as primary source, env vars as override
- Prefer `Arc<RwLock<...>>` for shared fleet state; minimize lock duration
- SSE streaming uses `tokio_stream` + `eventsource-stream` for parsing
- Log at `info` for request routing, `debug` for proxy details, `warn` for eviction and node health, `error` for proxy failures
## Testing
### Gateway tests
Use mock neurons spawned via axum in `crates/cortex-gateway/tests/common/mod.rs`. Helpers: `spawn_mock_backend()`, `spawn_gateway()`.
### neuron integration tests
- Numerical reference tests (`numerical_reference.rs`) require `NEURON_REF_MODEL_PATH` env var pointing to a HF snapshot directory. Fixtures are f32-based for precision validation against HuggingFace transformers.
- CUDA integration tests (`tp_worker_lifecycle_cuda.rs`) gated behind `cuda-integration` feature; requires 2+ CUDA devices (e.g., 2x RTX 5090).
### Metrics testing
Use `install_test_recorder()` in test code to capture metrics without the HTTP listener.
## helexa-bench
A continuous, version-aware benchmark harness. Hits each neuron directly on `:13131`, exercises each warm model with a Scenario suite (chat-latency family), and records results into SQLite stamped with the neuron's full `BuildInfo`. The loop is version-aware: skips any (target, build SHA, model, scenario) cell already at `samples_per_version`.
Packaged as `helexa-bench` RPM (prebuilt-binary spec). One systemd unit, typically on the metrics host.
## helexa-acp
Agent Client Protocol bridge — connects ACP editors (Zed, etc.) to any OpenAI-compatible endpoint, cortex by default. Intentionally self-contained: no workspace crate dependencies. Uses `agent-client-protocol` with `unstable_session_model` feature for Zed model picker support. Licensed Apache-2.0 (workspace is GPL-3.0).
## RPM Packaging
- `cortex.spec` — installs the `cortex` binary
- `helexa-neuron.spec` — installs the `neuron` binary under package name `helexa-neuron` (renamed to avoid Fedora's NEURON neural-simulation package collision)
- Systemd units in `data/cortex.service`, `data/neuron.service`
- Example configs: `cortex.example.toml`, `neuron.example.toml`, `models.example.toml`
Install:
```sh
dnf copr enable helexa/helexa
dnf install cortex # gateway host
dnf install helexa-neuron # GPU nodes
```
## Configuration Files
### cortex.toml (gateway)
```toml
[gateway]
listen = "0.0.0.0:31313"
metrics_listen = "0.0.0.0:31314"
[eviction]
strategy = "lru" # lru | priority
defrag_after_cycles = 50
[[neurons]]
name = "beast"
endpoint = "http://beast.internal:13131"
```
### models.toml (catalogue)
```toml
[[models]]
id = "Qwen/Qwen3-Coder-30B-A3B-Instruct"
harness = "candle"
quant = "Q4_K_M"
vram_mb = 19000
min_devices = 2
min_device_vram_mb = 10000
pinned_on = ["beast"] # optional: never evict from these neurons
```
### neuron.toml (per-host)
Configured via figment + env override. See `neuron.example.toml` for reference.
## neuron API Endpoints
```
GET /discovery → hardware discovery (hostname, OS, CUDA, devices, harnesses)
GET /health → runtime GPU stats (VRAM, utilization, temperature)
GET /models → loaded/unloaded models with VRAM usage
POST /models/load → load a model with spec (quant, TP, devices)
POST /models/unload → unload a model, freeing device memory
GET /models/{id}/endpoint → inference URL for a model
GET /version → build metadata (SHA, features, candle version, etc.)
```
## Sources of Truth
When prose documentation conflicts with code, trust:
1. Executable configuration (`*.toml`, `Cargo.toml` features)
2. Type definitions in `cortex-core/`
3. Test files in `crates/*/tests/` and `*/src/**/*_test.rs`
4. `CLAUDE.md` for historical design rationale

272
CLAUDE.md
View File

@@ -1,16 +1,26 @@
# CLAUDE.md — cortex
# CLAUDE.md — helexa
## Project overview
cortex is a Rust reverse-proxy that sits in front of multiple
mistral.rs inference nodes and presents a unified OpenAI + Anthropic
compatible API surface. It handles model routing, lifecycle management
(load/unload/evict), request translation, and metrics collection.
helexa is a self-hosted LLM serving stack for multi-node GPU inference
clusters. It has two components:
- **cortex** — the per-operator control plane and LLM proxy. A Rust
reverse-proxy that sits in front of the fleet and presents a unified
OpenAI + Anthropic compatible API surface. It handles model routing,
lifecycle management (load/unload/evict), request translation, and
metrics collection.
- **neuron** — the per-host LLM harness. One instance runs on every GPU
host, serving candle-based in-process inference and managing local
hardware discovery and model lifecycle.
(Historical note: cortex originally proxied to mistral.rs nodes; neuron
replaced that — see the 2026-05-18 candle-native addendum below.)
## Repository layout
```
cortex/
helexa/
├── Cargo.toml # workspace root
├── cortex.toml # example gateway config
├── README.md
@@ -84,6 +94,63 @@ Per-request: model, node, prompt_tokens, completion_tokens, total_tokens,
tok_per_sec, time_to_first_token_ms, total_latency_ms.
Exposed as Prometheus histograms/counters on a separate port.
### Per-device worker thread (neuron)
The neuron daemon dedicates one OS thread per CUDA device it loads
onto. That thread binds the device's `CudaContext` once at startup and
owns it for the daemon's lifetime; every model load, forward step,
KV-cache reset, VRAM query, NCCL init/sanity, NCCL all_reduce, and
model drop on that device routes through this thread via a
`std::sync::mpsc` job channel. Replies cross back via
`tokio::sync::oneshot`.
Three properties this gives us, in order of weight:
1. **Context locality.** cudarc binds the CUDA context per OS thread
via `cuCtxSetCurrent`. Before this refactor, ad-hoc
`tokio::task::spawn_blocking` calls bound the context onto a
different thread per request — and `device_vram_mb()` from an
async task bound it onto whichever tokio worker happened to be
running. Pinning the context to one named thread ends that.
2. **Drop safety.** Every `CudaSlice` in a `Tensor`, every
`cudarc::nccl::Comm`, and the `CudaContext` itself call `cuMemFree` /
`ncclCommDestroy` / `cuCtxDestroy` during `Drop` — and require the
right context current. With the worker owning the model slab,
`Drop` always runs on the right thread. The cudarc Drop constraint
is structurally enforced.
3. **Poisoning blast radius.** When a CUDA driver error makes the
context unrecoverable, the poison flag lives on the
`DeviceWorkerHandle` itself. Subsequent `submit()` calls fast-reject
at the channel boundary with a clear "device worker is poisoned"
error before any further CUDA work is attempted. The thread doesn't
exit (dropping the slab would re-touch the broken context) — it
enters a drain-only mode and replies error to everything until the
daemon restarts.
Tensors never escape the worker thread alive. Inference replies carry
`Vec<f32>` CPU-side logits; the async caller wraps them in a CPU
candle tensor and runs `apply_repeat_penalty` + `LogitsProcessor::sample`
without ever rebinding the device context. Sampled tokens come back as
`u32`; VRAM queries as `(u64, u64)`. The opaque `ArchHandle(u64)` and
`TpHandle(u64)` are the only "references" callers hold to loaded
models — they're indices into the worker's state slab, not pointers.
The TP worker subprocesses in `harness/tp/worker.rs` are the same
pattern out-of-process — a dedicated context-owning process per
non-zero NCCL rank. The in-process worker in `harness/device_worker/`
brings the discipline to rank 0.
CPU loads (`Device::Cpu` fallback when CUDA is unavailable) keep the
legacy `tokio::task::spawn_blocking + Arc<Mutex<ModelArch>>` path —
there's no context to own and the channel hop would only add latency.
Four `spawn_blocking` references in `harness/candle.rs` are deliberate
CPU fallback.
Canonical narrative lives in
`crates/neuron/src/harness/device_worker/mod.rs`'s module
doc-comment; touch points (the `Job` enum, the dispatch handlers, the
`DeviceWorkerState` struct) are in the sibling `jobs.rs` and
`dispatch.rs`.
## Tech stack
- **Rust 2024 edition** — workspace with 4 crates
@@ -125,7 +192,8 @@ automatically. Clippy warnings must be resolved, not suppressed with
- One or more GPU nodes running mistral.rs on port 8080
- Optionally a metrics-only node (no GPU) for Prometheus/Grafana
- Each node runs `mistralrs serve` on port 8080
- Gateway listens on port 8000 (API) and 9100 (metrics)
- Gateway listens on port 31313 (API) and 31314 (metrics)
- neuron listens on port 13131 on each GPU host
- TLS terminated at gateway or via nginx; internal traffic is plaintext over WireGuard
## Conventions
@@ -380,7 +448,7 @@ processes (one process per loaded model, each on its own port).
## neuron API
neuron exposes an HTTP API on port 9090 that cortex polls and calls.
neuron exposes an HTTP API on port 13131 that cortex polls and calls.
```
GET /discovery
@@ -424,8 +492,8 @@ endpoint. cortex.toml shrinks to:
```toml
[gateway]
listen = "0.0.0.0:8000"
metrics_listen = "0.0.0.0:9100"
listen = "0.0.0.0:31313"
metrics_listen = "0.0.0.0:31314"
[eviction]
strategy = "lru"
@@ -433,15 +501,15 @@ defrag_after_cycles = 50
[[neurons]]
name = "beast"
endpoint = "http://beast.hanzalova.internal:9090"
endpoint = "http://beast.hanzalova.internal:13131"
[[neurons]]
name = "benjy"
endpoint = "http://benjy.kosherinata.internal:9090"
endpoint = "http://benjy.hanzalova.internal:13131"
[[neurons]]
name = "quadbrat"
endpoint = "http://quadbrat.hanzalova.internal:9090"
endpoint = "http://quadbrat.hanzalova.internal:13131"
```
On startup and periodically, cortex calls `GET /discovery` and
@@ -490,7 +558,7 @@ and the hardcoded `vram_mb` per node.
## Revised repository layout
```
cortex/
helexa/
├── Cargo.toml
├── cortex.toml # gateway config (neurons only)
├── models.toml # model catalogue
@@ -521,7 +589,7 @@ cortex/
│ │ └── metrics.rs # prometheus exporter (unchanged)
│ ├── neuron/ # node plane (replaces cortex-agent)
│ │ └── src/
│ │ ├── main.rs # binary entrypoint, axum server on :9090
│ │ ├── main.rs # binary entrypoint, axum server on :13131
│ │ ├── discovery.rs # nvidia-smi, device enumeration
│ │ ├── health.rs # runtime GPU polling
│ │ ├── api.rs # HTTP handlers for /discovery, /models, etc.
@@ -595,70 +663,140 @@ placement matching can be added incrementally.
Completed. Both packages have RPM specs, systemd units, and example configs.
CI builds parallel SRPMs on tag push and publishes to separate COPR repos.
- `cortex.spec` `helexa/cortex` COPR: binary, systemd unit, config files
- `neuron.spec``helexa/neuron` COPR: binary, systemd unit, config
- `cortex.spec` — installs the `cortex` binary. Package name keeps the
short `cortex` because no Fedora package collides with it.
- `helexa-neuron.spec` — installs the `neuron` binary under package name
`helexa-neuron`. Renamed from bare `neuron` to avoid collision with
Fedora's NEURON neural-simulation package
(https://src.fedoraproject.org/rpms/neuron); binary, systemd unit,
system user, and config dir all stay named `neuron` since those are
project-local contexts.
- `data/cortex.service`, `data/neuron.service` — systemd units
- `cortex.example.toml`, `neuron.example.toml`, `models.example.toml`
- CI: parallel `srpm-cortex` + `srpm-neuron` jobs, then parallel COPR publish
- CI: parallel `srpm-cortex` + `srpm-neuron` jobs, then parallel COPR
publish to a single project `helexa/helexa` hosting both packages.
Install:
```sh
dnf copr enable helexa/cortex && dnf install cortex # gateway host
dnf copr enable helexa/neuron && dnf install neuron # GPU nodes
dnf copr enable helexa/helexa
dnf install cortex # gateway host
dnf install helexa-neuron # GPU nodes
```
### Phase 11: llama.cpp harness stub
## 2026-05-18 addendum: candle-native pivot
**Goal:** Prove the harness abstraction works with a second engine.
Phases 11 (llama.cpp harness) and 12 (mistral.rs COPR) below are
**superseded**. The project no longer treats mistral.rs or llama.cpp as
dependencies — both are conceptually out of scope. neuron becomes a
candle-native inference daemon, with `Harness` retained as an
internal seam for adding future engines (vision/audio/diffusion) but
its only implementation being in-process candle.
**Steps:**
1. `crates/neuron/src/harness/llamacpp.rs` — implement the `Harness`
trait for llama.cpp's `llama-server`.
- `start()` — launch `llama-server` with the correct model path,
`--port`, `--n-gpu-layers`, `--tensor-split` args. Track the
child process.
- `stop()` — send SIGTERM to the child process.
- `list_models()` — llama-server serves one model per process, so
return a single-element list.
- `load_model()` — start a new llama-server process for this model.
- `unload_model()` — stop the process.
- `inference_endpoint()` — return `http://localhost:{assigned_port}`.
2. Port allocation: neuron assigns ports from a range (e.g. 8100-8199)
to llama-server instances.
3. Register in `HarnessRegistry` when configured:
```toml
[[harnesses]]
name = "llamacpp"
binary = "/usr/local/bin/llama-server"
port_range = [8100, 8199]
```
4. Tests: mock llama-server (simple HTTP server returning canned
responses), test load/unload/endpoint lifecycle.
The full staged plan for this pivot lives at
`~/.claude/plans/create-a-more-aggressive-calm-naur.md`. Summary:
**Done when:** A model with `harness = "llamacpp"` in `models.toml` can
be loaded and served through cortex. Tests pass with mock llama-server.
- **Stage 1 (this commit):** delete `mistralrs.rs` and `llamacpp.rs`,
scaffold inert `CandleHarness`, drop `endpoint`/`systemd_unit` from
`HarnessConfig`, default no-op `start`/`stop` on the `Harness` trait.
- **Stages 24:** wire up candle model load/unload (quantized Qwen3
first), add OpenAI-compatible inference endpoint in neuron, then SSE
streaming.
- **Stages 56:** load-on-activation (default models in config) and
unload-on-deactivation (graceful shutdown).
- **Stages 78:** multi-GPU tensor parallelism and broader model/quant
coverage.
### Phase 12 (lower priority): mistral.rs COPR packaging
Sections of this document that describe mistral.rs HTTP behaviour
("mistral.rs API gotchas") are retained as historical context for
Phases 110 — they document what was true while the project depended
on mistral.rs. They do not describe current behaviour.
**Goal:** Fedora RPMs for mistral.rs built against specific CUDA versions.
---
**Steps:**
1. `mistralrs-cuda.spec` — RPM spec that clones a pinned mistral.rs git
tag, builds with `--features cuda`, links against the system CUDA
toolkit. Produces `mistralrs-cuda13-server` (CUDA 13.x / sm_120) and
`mistralrs-cuda12-server` (CUDA 12.x / sm_89). Install binary to
`/usr/local/bin/mistralrs`.
2. COPR build config: enable the NVIDIA CUDA repo as a build dependency.
Pin the CUDA toolkit version in `BuildRequires`.
3. Gitea Actions or manual workflow: bump the mistral.rs tag in the spec,
trigger COPR rebuild.
4. neuron's mistralrs harness config references which binary/package
provides the mistral.rs binary. neuron could warn at startup if the
installed mistral.rs CUDA version doesn't match the discovered driver.
### Phase 11 (superseded): llama.cpp harness stub
**Done when:** `dnf install mistralrs-cuda13-server` on beast provides a
working `mistralrs` binary built for Blackwell GPUs. `dnf install
mistralrs-cuda12-server` on benjy provides one built for Ada GPUs.
~~Originally planned as a second engine to prove the harness
abstraction.~~ Replaced by the candle harness work in the 2026-05-18
addendum above. llama.cpp's any-model/any-hardware breadth is no
longer in scope for helexa.
This is a separate repo/spec — not part of the cortex workspace — but
tightly coupled operationally. Track it as a sibling project.
### Phase 12 (superseded): mistral.rs COPR packaging
~~Originally planned to ship CUDA-versioned mistral.rs RPMs.~~ Replaced
by the candle harness work in the 2026-05-18 addendum above. With
mistral.rs out of the dependency tree, there is nothing to package.
## 2026-05-27 addendum: per-device worker thread
Replaced the ad-hoc `tokio::task::spawn_blocking` pattern that drove
every leader-side CUDA op with one dedicated OS thread per CUDA device,
permanently bound to that device's `CudaContext`. All leader-side
inference work (GGUF + dense + TP shard load, forward, kv-cache clear,
NCCL init/sanity, NCCL all_reduce, VRAM query, model drop) routes
through the worker via a `std::sync::mpsc` channel; tensors never
escape the worker thread alive. See "Per-device worker thread (neuron)"
above and `crates/neuron/src/harness/device_worker/mod.rs` for the
canonical narrative.
Motivated by the 2026-05-26 silent-hang on beast: a CUDA OOM cascade
poisoned the device context on whichever spawn_blocking thread caught
it, and subsequent requests stalled invisibly on the pool lock. After
the refactor, the same failure mode shows up in journalctl as
`prefill sample failed; logits unhealthy nan: 248320/248320` followed
by `failed, model marked poisoned`. The thread stays alive and rejects
subsequent requests at the channel boundary.
Landed in four PRs:
- **Phase 1** (`081b532`) — device_worker module + 8 VRAM-query sites
route through the worker. CPU build only; smoke on beast confirmed
a persistent `cuda-dev-0` thread.
- **Phase 2** (`b179204`) — single-GPU forward + clear_kv + drop via
the worker. `LoadedModel.arch_handle: Option<ArchHandle>` replaces
`Arc<Mutex<ModelArch>>` for CUDA loads. CPU keeps the legacy path.
- **Phase 3** (`76ab24d`) — TP forward + NCCL init/sanity + leader
KV-clear routed through the worker. `WorkerPool.leader_nccl` moves
into the worker's state. `TpLoadedModel.leader_handle: TpHandle`
replaces `Arc<Mutex<TpLeaderModel>>`. CUDA-only TP smoke deferred to
next deploy.
- **Phase 4** (`b4f3576`) — GGUF + dense + TP shard loads move onto
the worker. The `Job::TransferIn` / `Job::CloneLeaderComm` bridges
from Phases 2/3 deleted; `SendComm` newtype no longer needed in the
load path. `grep -rn spawn_blocking crates/neuron/src/harness/`
returns only deliberate CPU-fallback hits after this PR.
## 2026-06-13 addendum: build metadata + helexa-bench
Two coupled additions so fleet performance can be tracked automatically
across neuron updates instead of by hand-running `script/bench.py` and
editing `doc/benchmarks.md`.
**neuron build metadata + `GET /version`.** neuron's `build.rs` now also
captures build identity (`HELEXA_GIT_SHA` — preferring a CI/RPM-injected
`HELEXA_BUILD_SHA`, falling back to git, else `unknown` — plus dirty
flag, build timestamp, rustc version, profile, enabled cargo features,
and a best-effort `candle-core` version from `Cargo.lock`). These are
exposed as `cortex_core::build_info::BuildInfo` (new module) from a new
`GET /version` endpoint (`neuron/src/version.rs`, wired in `api.rs`) and
in clap's `--version` long form. The SHA is injected in CI
(`build-prerelease.yml` build-neuron step: `export HELEXA_BUILD_SHA=$(git
rev-parse HEAD)`) and via `--define helexa_commit` in the source-build
spec, so tarball-built RPMs report the real SHA. `/version` is now the
canonical "which build is live" probe (supersedes the per-host RPM-sha
check in the fleet-validation flow).
**`crates/helexa-bench`** — a new binary: a continuous, version-aware
benchmark harness (one systemd unit, typically on the metrics host). It
hits each neuron **directly** on `:13131`, exercises each **warm**
(`status == "loaded"`) model with an extensible `Scenario` suite (phase
1: the chat-latency family ported verbatim from `bench.py` — synthetic
128/4096-tok prompts, `/no_think`, streamed TTFT + decode-window
tok/s), and records each run into a SQLite system-of-record stamped with
the neuron's full `BuildInfo`. The loop is **version-aware**: it skips
any (target, build SHA, model, scenario) cell already at
`samples_per_version`, so a steady fleet costs only cheap `/version` +
`/models` polls until a new SHA ships. `helexa-bench report` regenerates
the `benchmarks.md`-style table from the DB. `kind = "openai"` targets
(mistral.rs/llama.cpp comparison) are scaffolded but not yet wired.
Packaged as the `helexa-bench` RPM (prebuilt-binary spec, outbound-only
so no firewalld service) via the same `build-prerelease.yml` pipeline.

2630
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -5,13 +5,15 @@ members = [
"crates/cortex-gateway",
"crates/cortex-cli",
"crates/neuron",
"crates/helexa-acp",
"crates/helexa-bench",
]
[workspace.package]
version = "0.1.2"
version = "0.1.16"
edition = "2024"
license = "GPL-3.0-or-later"
repository = "https://git.lair.cafe/helexa/cortex"
repository = "https://git.lair.cafe/helexa/helexa"
[workspace.dependencies]
# async runtime
@@ -27,7 +29,7 @@ serde = { version = "1", features = ["derive"] }
serde_json = "1"
toml = "0.8"
# http client (for proxying to mistralrs backends)
# http client (for proxying to neuron backends)
reqwest = { version = "0.12", features = ["json", "stream"] }
# observability
@@ -60,3 +62,12 @@ eventsource-stream = "0.2"
# workspace crates
cortex-core = { path = "crates/cortex-core" }
cortex-gateway = { path = "crates/cortex-gateway" }
# Patched cudarc (affects neuron's 0.19.x only; candle's 0.17.x is
# untouched since the fork is 0.19.7 and doesn't satisfy a 0.17 req). Adds
# Comm::abort / get_async_error / raw comm() — needed for #17 Stage 2 TP
# hang-recovery (abort a wedged collective from another thread, then
# rebuild the comm). Pinned to a fork revision pending upstream review
# (grenade/cudarc @ nccl-comm-abort).
[patch.crates-io]
cudarc = { git = "https://github.com/grenade/cudarc", rev = "63327a256059f8252641ae46c6bb9eefe707f382" }

227
README.md
View File

@@ -1,24 +1,68 @@
# cortex
# helexa
A Rust reverse-proxy and fleet management layer for multi-node
[mistral.rs](https://github.com/EricLBuehler/mistral.rs) inference clusters.
**Near-frontier AI for mortals.**
## Problem
helexa is a self-hosted LLM serving stack, written in Rust, for people
who run open-weight models on their own consumer GPUs. It has two
components:
Running local LLMs across multiple GPU nodes (different VRAM tiers, different
model affinities) requires a unified API surface that:
- **cortex** — the per-operator control plane and LLM proxy. It sits in
front of your GPU fleet and presents a unified OpenAI + Anthropic
compatible API surface, handling model routing, lifecycle management
(load / unload / evict), request translation, and metrics.
- **neuron** — the per-host LLM harness. One instance runs on every GPU
host, serving candle-based in-process inference and managing local
hardware discovery and model lifecycle.
- Presents a **single `/v1/models` catalogue** merging every model across every
node.
- **Routes requests** to the correct node based on where a model is loaded (or
*can* be loaded).
- Manages **model lifecycle** — unload cold models, reload on demand, pin
critical ones — using the mistral.rs
`/v1/models/{unload,reload,status}` HTTP API (PR #1828+).
- Translates between **OpenAI and Anthropic** request/response envelopes so
every client in the homelab speaks whichever dialect it prefers.
- Captures **per-request metrics** (tokens, tok/s, TTFT, latency) and exposes
them as Prometheus counters/histograms.
## Why
Two principles constrain everything in this repository:
1. **Frontier or close to it.** helexa serves the open-weight models
that get nearest to frontier capability — not every architecture
ever published.
2. **Consumer hardware.** Everything must run on the cards mortals can
actually buy: a 3060 here, a 4090 there, a 5090 if you got lucky.
Mixed VRAM tiers across mismatched boxes are the expected topology,
not a degraded case.
GPU acquisition is harder than it was a year ago, and the gap between
what cloud providers charge and what your own silicon costs keeps
widening. The intersection of those two principles — near-frontier
models, squeezed onto hardware you own — is helexa's entire niche.
The secondary objective is **predictable consumption**. If you own the
hardware, your tooling shouldn't break because a cloud provider changed
billing, deprecated a model, or reshaped an API. cortex's OpenAI and
Anthropic surfaces are a stability contract: point your editor, agent,
or CLI at it once, and it keeps working.
## What helexa is not
This is an intentionally different path from vLLM, SGLang, and peers —
not a smaller version of them. Out of scope, permanently:
- Any-model breadth. Architectures are ported because they're at or
near the frontier, not to complete a compatibility matrix.
- Datacenter-class scheduling. No sophisticated continuous-batching /
paged-attention machinery — the workload is a handful of operators
and their agents, not 200 QPS.
- Wrapping external inference engines. neuron builds directly on
[candle](https://github.com/huggingface/candle); every model
architecture it serves is implemented in this repository, ported
against the HuggingFace reference.
One thing that is *not* a principle: CUDA exclusivity. All high-end
consumer hardware is in scope. helexa is CUDA-only today because
that's the hardware on the bench — nothing ships untested — and ROCm
or other consumer accelerators join as soon as there's real hardware
to build against.
In scope, and where the engineering effort goes: aggressive
quantization (GGUF Q4_K_M / Q6_K / Q8_0), NCCL tensor parallelism
across heterogeneous consumer GPUs, careful CUDA failure handling, and
single-request latency — the performance that one operator at a
keyboard actually feels.
## Architecture
@@ -28,102 +72,119 @@ model affinities) requires a unified API surface that:
└──────┬───────┘ └─────┬────┘ └──────┬─────┘ └──────┬─────┘
│ │ │ │
└────────────────┴──────┬───────┴───────────────┘
OpenAI + Anthropic APIs
┌──────────▼──────────┐
│ cortex │
(cortex-gateway)
cortex
│ (cortex-gateway) │
│ │
│ Router · Metrics │
│ Evictor · Translate│
└──┬──────┬────────┬──┘
│ │ │
┌──────────▼┐ ┌──▼─────┐ ┌▼──────────┐
gpu-large │ │gpu-med │ │ gpu-small
mistralrs │ │mistral │ │ mistralrs
serve │ │rs serve│ │ serve
│ :8080 │ │ :8080 │ │ :8080 │
neuron │ │ neuron │ │ neuron
:13131 │ │ :13131 │ │ :13131
candle │ │ candle │ │ candle
└───────────┘ └────────┘ └───────────┘
private network (.internal)
```
cortex discovers each neuron's hardware (devices, VRAM, compute
capability) at runtime and matches it against a model catalogue
(`models.toml`) to decide placement: which models fit where, what to
evict when VRAM is tight, where to route a request right now. Adding a
GPU host to the fleet is one `[[neurons]]` entry — no device specs in
config.
### Crates
| Crate | Purpose |
|---|---|
| `cortex-core` | Shared types: config, node/model state, metrics, OpenAI/Anthropic request/response envelopes |
| `cortex-gateway` | Axum HTTP server: proxy, router, evictor, metrics exporter |
| `cortex-agent` | Per-node sidecar: polls local mistralrs, reports to gateway, handles restart/defrag |
| `cortex-core` | Shared types: config, node/model state, metrics, OpenAI/Anthropic envelopes, harness trait, discovery types |
| `cortex-gateway` | Axum HTTP server: proxy, router, evictor, poller, metrics exporter |
| `neuron` | Per-host daemon: GPU discovery, in-process candle inference, NCCL tensor parallelism, model lifecycle API |
| `cortex-cli` | CLI entrypoint (`cortex serve`, `cortex status`, etc.) |
| `helexa-acp` | Agent Client Protocol bridge — connects ACP editors (Zed, etc.) to any OpenAI-compatible endpoint, cortex by default |
## Node setup
## The engine
Each GPU node runs `mistralrs serve` with a multi-model config. Models are
declared but start **unloaded** — mistral.rs lazy-loads on first request and
the gateway can explicitly unload/reload via the HTTP API.
neuron runs inference in-process on candle — there is no external
inference server to babysit. The parts that earn their keep:
Example node systemd unit:
- **Per-device worker threads.** Every CUDA device gets one dedicated
OS thread that owns its CUDA context for the daemon's lifetime. All
loads, forward passes, KV-cache resets, NCCL collectives, VRAM
queries, and unloads route through it; tensors never escape it
alive. Context binding is pinned to a known thread, the CUDA `Drop`
contract is structurally safe, and a driver error poisons one worker
— visibly — instead of hanging the whole process.
- **Tensor parallelism on consumer cards.** Megatron-style row/column
parallel layers with NCCL all-reduce, spanning the mismatched GPUs
you actually have. A step watchdog aborts wedged collectives instead
of letting a request hang forever.
- **Current model focus: the Qwen3 family** — dense and GGUF-quantized,
including the hybrid linear-attention (Gated DeltaNet) generation.
Vision support is in progress. Each architecture is ported against
its HuggingFace reference implementation.
```ini
# /etc/systemd/system/mistralrs.service
[Unit]
Description=mistral.rs inference server
After=network-online.target
Wants=network-online.target
See `CLAUDE.md` for design rationale and
`crates/neuron/src/harness/device_worker/` for the worker narrative.
[Service]
Type=simple
ExecStart=/usr/local/bin/mistralrs serve \
--from-config /etc/mistralrs/config.toml \
--port 8080
Restart=on-failure
RestartSec=5
Environment=CUDA_VISIBLE_DEVICES=0,1
## Install
[Install]
WantedBy=multi-user.target
Pre-built RPMs for Fedora:
```sh
dnf copr enable helexa/helexa
dnf install cortex # on the gateway host
dnf install helexa-neuron # on each GPU host
systemctl enable --now cortex # or neuron, respectively
```
## Gateway config
## Configure
```toml
# cortex.toml
# /etc/cortex/cortex.toml
[gateway]
listen = "0.0.0.0:8000"
metrics_listen = "0.0.0.0:9100"
listen = "0.0.0.0:31313"
metrics_listen = "0.0.0.0:31314"
[eviction]
strategy = "lru" # lru | priority
defrag_after_cycles = 50
[[nodes]]
name = "gpu-large"
endpoint = "http://gpu-large.internal:8080"
vram_mb = 49_152 # e.g. 2x RTX 4090
pinned = ["your-org/large-model"]
[[neurons]]
name = "beast"
endpoint = "http://beast.internal:13131"
[[nodes]]
name = "gpu-medium"
endpoint = "http://gpu-medium.internal:8080"
vram_mb = 24_576 # e.g. RTX 4090
pinned = ["your-org/medium-model"]
[[nodes]]
name = "gpu-small"
endpoint = "http://gpu-small.internal:8080"
vram_mb = 12_288 # e.g. RTX 3060
pinned = ["your-org/embedding-model"]
[[neurons]]
name = "benjy"
endpoint = "http://benjy.internal:13131"
```
## Building
Model placement profiles (VRAM requirements, quant, device minimums,
pinning) live in `models.toml` — see `models.example.toml`.
## Run
```sh
# start the gateway
cortex serve --config /etc/cortex/cortex.toml
# check fleet status
cortex status
# one catalogue across every node
curl http://localhost:31313/v1/models
```
## Build from source
```sh
cargo build --release
```
## CI
Every push triggers format, lint, and test checks. Ensure these pass
locally before pushing:
CI runs on every push; keep it green locally:
```sh
cargo fmt --check --all # must be clean
@@ -131,20 +192,18 @@ cargo clippy --workspace -- -D warnings # warnings are errors
cargo test --workspace # all tests must pass
```
Tagged releases (`v*`) additionally build an SRPM and publish to COPR.
Tagged releases (`v*`) build SRPMs for `cortex` and `helexa-neuron`
and publish to COPR.
## Running
## Status
```sh
# start the gateway
cortex serve --config cortex.toml
Pre-1.0 and moving fast. The gateway path (routing, eviction,
translation, metrics) is stable and tested; the candle-native engine
is under active development — expect the supported-model list to track
the open-weight frontier, deliberately narrowly.
# check fleet status
cortex status
# list all models across nodes
curl http://localhost:8000/v1/models
```
Development happens at <https://git.lair.cafe/helexa/helexa>;
<https://github.com/helexa-ai/helexa> is a read-only mirror.
## License

View File

@@ -0,0 +1,38 @@
# helexa-bench config for bob.hanzalova.internal.
#
# Synced to /etc/helexa-bench/helexa-bench.toml by script/infra-setup.sh
# (the helexa-bench RPM ships helexa-bench.example.toml as a
# %config(noreplace) default; this per-host file overrides it).
#
# bob is a client host (it also runs Agent Zero); helexa-bench here hits
# every neuron on the fleet directly and records build-stamped results
# into the local SQLite store.
[bench]
sweep_interval_secs = 1800
samples_per_version = 5
iteration_pause_secs = 2
request_timeout_secs = 600
db_path = "/var/lib/helexa-bench/bench.sqlite"
[scenarios]
prompt_sizes = [128, 4096]
max_tokens = 256
# Read-only JSON API consumed by the bench UI (hosted separately) and for
# programmatic access. Served alongside the sweep loop.
[api]
enabled = true
listen = "0.0.0.0:13132"
[[targets]]
name = "beast"
endpoint = "http://beast.hanzalova.internal:13131"
[[targets]]
name = "benjy"
endpoint = "http://benjy.hanzalova.internal:13131"
[[targets]]
name = "quadbrat"
endpoint = "http://quadbrat.hanzalova.internal:13131"

24
asset/neuron/beast.toml Normal file
View File

@@ -0,0 +1,24 @@
# neuron.toml for beast.hanzalova.internal
#
# 2x RTX 5090 (32 GB each) — TP-2 capable. Pre-warms Qwen3.6-27B with
# q5k ISQ across both GPUs at activation, matching the validate-neuron
# invocation: `validate-neuron.sh beast.hanzalova.internal
# Qwen/Qwen3.6-27B q5k 2`.
#
# Synced to /etc/neuron/neuron.toml by script/infra-setup.sh. Edits
# take effect after the next deploy workflow run restarts the service
# (default_models is read at activation).
port = 13131
[[harnesses]]
name = "candle"
[harness.candle]
[[default_models]]
model_id = "Qwen/Qwen3.6-27B"
harness = "candle"
quant = "q6k"
tensor_parallel = 2
devices = [0, 1]

19
asset/neuron/benjy.toml Normal file
View File

@@ -0,0 +1,19 @@
# neuron.toml for benjy.hanzalova.internal
#
# 1x RTX 4090 (24 GB) — largest single-GPU host on the fleet. Pre-warms
# Qwen3-8B (bf16, ~18 GB), leaving ~6 GB for KV cache + activations on
# moderate-length contexts.
#
# Synced to /etc/neuron/neuron.toml by script/infra-setup.sh.
port = 13131
[[harnesses]]
name = "candle"
[harness.candle]
[[default_models]]
model_id = "Qwen/Qwen3-8B"
harness = "candle"
devices = [0]

View File

@@ -0,0 +1,19 @@
# neuron.toml for quadbrat.hanzalova.internal
#
# 1x RTX 3060 (12 GB) — small / quantised tier. Pre-warms Qwen3-1.7B
# (bf16, ~4 GB), leaving ~7 GB for KV cache so long contexts on a small
# model still have plenty of room.
#
# Synced to /etc/neuron/neuron.toml by script/infra-setup.sh.
port = 13131
[[harnesses]]
name = "candle"
[harness.candle]
[[default_models]]
model_id = "Qwen/Qwen3-1.7B"
harness = "candle"
devices = [0]

View File

@@ -0,0 +1,15 @@
# Bootstrap vhost for bench.helexa.ai — http-only, used ONLY to obtain
# the initial Let's Encrypt cert via the webroot challenge (the full TLS
# vhost can't load before the cert file exists). script/infra-setup.sh
# installs this, runs certbot, then swaps in bench.helexa.ai.conf.
server {
listen 80;
server_name bench.helexa.ai;
location /.well-known/acme-challenge/ {
root /var/www/bench.helexa.ai;
}
location / {
try_files $uri $uri/ =404;
}
}

View File

@@ -0,0 +1,56 @@
# Public, auth-less bench UI at https://bench.helexa.ai.
#
# Serves the static SPA from /var/www/bench.helexa.ai (rsynced by
# .gitea/workflows/deploy.yml's deploy-bench-ui job) and reverse-proxies
# /api to the helexa-bench read API on bob over the WireGuard mesh — so
# the browser stays same-origin (no CORS) and the internal API never
# needs to be exposed publicly.
#
# TLS via Let's Encrypt; the cert is obtained/renewed by certbot
# (bootstrapped one-time in script/infra-setup.sh). Mirrors the
# dev.swym.hanzalova.internal vhost convention on this host.
server {
listen 80;
server_name bench.helexa.ai;
# Keep serving the ACME webroot so certbot can renew.
location /.well-known/acme-challenge/ {
root /var/www/bench.helexa.ai;
}
location / {
return 301 https://$host$request_uri;
}
}
server {
listen 443 ssl;
http2 on;
server_name bench.helexa.ai;
ssl_certificate /etc/letsencrypt/live/bench.helexa.ai/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/bench.helexa.ai/privkey.pem;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers HIGH:!aNULL:!MD5;
ssl_prefer_server_ciphers on;
ssl_session_cache shared:SSL:10m;
root /var/www/bench.helexa.ai;
index index.html;
# Bench read API on bob (internal WireGuard); browser stays same-origin.
location /api/ {
proxy_pass http://bob.hanzalova.internal:13132;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_read_timeout 60s;
}
# SPA fallback — client-side routes (/trends, /runs) resolve to index.html.
location / {
try_files $uri $uri/ /index.html;
}
}

View File

@@ -0,0 +1,34 @@
# Internal bench UI vhost — https://bench.internal, reachable from inside
# the WireGuard mesh (the public bench.helexa.ai dead-ends at the OPNsense
# LAN interface, which only port-forwards :443 from the WAN). Same SPA +
# /api→bob proxy as bench.helexa.ai, but with an internal-CA cert
# (smallstep "lair", renewed by step@bench.timer). Mirrors the
# *.internal vhost convention on oolon.kosherinata.internal.
server {
server_name bench.internal;
listen 443 ssl;
http2 on;
ssl_certificate /etc/nginx/tls/cert/bench.internal.pem;
ssl_certificate_key /etc/nginx/tls/key/bench.internal.pem;
ssl_trusted_certificate /etc/pki/ca-trust/source/anchors/root-internal.pem;
ssl_protocols TLSv1.3;
# Shared webroot with the public vhost — same built SPA.
root /var/www/bench.helexa.ai;
index index.html;
location /api/ {
proxy_pass http://bob.hanzalova.internal:13132;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_read_timeout 60s;
}
location / {
try_files $uri $uri/ /index.html;
}
}

View File

@@ -0,0 +1,25 @@
# Install on the bench host (bob) as /etc/sudoers.d/helexa_gitea_ci
# (owner root:root, mode 0440). Required by .gitea/workflows/deploy.yml,
# which SSHes as gitea_ci@bob to roll out helexa-bench package upgrades
# and config changes.
#
# Filename convention `helexa_gitea_ci` (vs bare `gitea_ci`) so other
# helexa-org apps can drop their own sudoers files on the same host
# without overwriting this one.
#
# helexa-bench polls the neuron fleet (outbound) and serves a read-only
# JSON API on tcp/13132 for the bench UI — hence the firewall-cmd grants.
gitea_ci ALL=(root) NOPASSWD: /usr/bin/rsync * /etc/helexa-bench/helexa-bench.toml
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl start helexa-bench.service
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl stop helexa-bench.service
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl enable --now helexa-bench.service
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl daemon-reload
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf install --refresh --allowerasing -y helexa-bench
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf upgrade --refresh --allowerasing -y helexa-bench
# sudoers reserves `:` and `=` and requires `\` escaping inside command
# arguments — without it visudo errors at the first `:` in `https://`.
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf config-manager addrepo --from-repofile\=https\://rpm.lair.cafe/lair-cafe-unstable.repo
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf config-manager setopt lair-cafe-unstable.enabled\=1
gitea_ci ALL=(root) NOPASSWD: /usr/bin/firewall-cmd --add-service=helexa-bench --permanent
gitea_ci ALL=(root) NOPASSWD: /usr/bin/firewall-cmd --reload

View File

@@ -0,0 +1,23 @@
# Install on the cortex gateway host as /etc/sudoers.d/helexa_gitea_ci
# (owner root:root, mode 0440). Required by .gitea/workflows/deploy.yml,
# which SSHes as gitea_ci@<gateway> to roll out cortex package upgrades
# and config changes.
#
# Filename convention `helexa_gitea_ci` (vs bare `gitea_ci`) so other
# helexa-org apps can drop their own sudoers files on the same host
# without overwriting this one.
gitea_ci ALL=(root) NOPASSWD: /usr/bin/rsync * /etc/cortex/cortex.toml
gitea_ci ALL=(root) NOPASSWD: /usr/bin/rsync * /etc/cortex/models.toml
# deploy-bench-ui rsyncs the built bench SPA into the nginx webroot.
gitea_ci ALL=(root) NOPASSWD: /usr/bin/rsync * /var/www/bench.helexa.ai/
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl start cortex.service
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl stop cortex.service
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl enable --now cortex.service
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl daemon-reload
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf install --refresh --allowerasing -y cortex
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf upgrade --refresh --allowerasing -y cortex
# sudoers reserves `:` and `=` and requires `\` escaping inside command
# arguments — without it visudo errors at the first `:` in `https://`.
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf config-manager addrepo --from-repofile\=https\://rpm.lair.cafe/lair-cafe-unstable.repo
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf config-manager setopt lair-cafe-unstable.enabled\=1

View File

@@ -0,0 +1,43 @@
# Install on every neuron host as /etc/sudoers.d/helexa_gitea_ci
# (owner root:root, mode 0440). Required by .gitea/workflows/deploy.yml,
# which SSHes as gitea_ci@<neuron-host> to roll out helexa-neuron-<flavour>
# package upgrades and config changes.
#
# Filename convention `helexa_gitea_ci` (vs bare `gitea_ci`) so other
# helexa-org apps can drop their own sudoers files on the same host
# without overwriting this one.
#
# All three CUDA flavours are listed because a host's flavour can change
# (e.g. GPU swap) and we don't want the sudoers file to need to change
# in lockstep. Only one flavour can be installed at a time (the packages
# Conflict: with each other), so the attack surface is bounded to "wrong
# flavour installed" — vandalism, not privilege escalation.
gitea_ci ALL=(root) NOPASSWD: /usr/bin/rsync * /etc/neuron/neuron.toml
# deploy.yml writes the per-model systemd drop-in carrying
# NEURON_MAX_PROMPT_TOKENS: gitea_ci stages it in its own dir, then
# installs it root-owned. Exact source/dest paths; see doc/context-limits.md.
gitea_ci ALL=(root) NOPASSWD: /usr/bin/install -o root -g root -m 0644 -D /var/lib/gitea_ci/model.conf /etc/systemd/system/neuron.service.d/model.conf
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl start neuron.service
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl stop neuron.service
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl enable --now neuron.service
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl daemon-reload
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf install --refresh --allowerasing -y helexa-neuron-ampere
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf upgrade --refresh --allowerasing -y helexa-neuron-ampere
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf install --refresh --allowerasing -y helexa-neuron-ada
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf upgrade --refresh --allowerasing -y helexa-neuron-ada
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf install --refresh --allowerasing -y helexa-neuron-blackwell
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf upgrade --refresh --allowerasing -y helexa-neuron-blackwell
# sudoers reserves `:` and `=` and requires `\` escaping inside command
# arguments — without it visudo errors at the first `:` in `https://`.
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf config-manager addrepo --from-repofile\=https\://rpm.lair.cafe/lair-cafe-unstable.repo
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf config-manager setopt lair-cafe-unstable.enabled\=1
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf config-manager addrepo --from-repofile\=https\://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf install -y libcudnn9-cuda-13
gitea_ci ALL=(root) NOPASSWD: /usr/bin/firewall-cmd --add-service=helexa-neuron --permanent
gitea_ci ALL=(root) NOPASSWD: /usr/bin/firewall-cmd --reload
# deploy-dev.yml fast path: install a freshly-built dev binary over the
# packaged one. Exact source path + args; the workflow must use this
# command form verbatim. The next deploy.yml run reconciles the host
# back to the RPM-owned binary.
gitea_ci ALL=(root) NOPASSWD: /usr/bin/install -o root -g root -m 0755 /var/lib/gitea_ci/neuron-dev /usr/bin/neuron

View File

@@ -0,0 +1,20 @@
# Internal-CA cert renewal for %i.internal, driven by step@%i.timer.
# Replicated from oolon.kosherinata.internal (the kosherinata DC proxy).
# Renews an EXISTING cert via mTLS (step ca renew) — the initial cert
# must be issued once with a provisioner (see script/infra-setup.sh).
# Installed to /etc/systemd/system/step@.service.
[Unit]
Description=step cert renew for %i.internal
Documentation=https://smallstep.com/docs/step-ca/renewal
[Service]
Type=oneshot
ExecCondition=/usr/bin/step certificate needs-renewal \
/etc/nginx/tls/cert/%i.internal.pem
ExecStart=/usr/bin/step ca renew \
--force \
--ca-url https://ca.internal \
--root /etc/pki/ca-trust/source/anchors/root-internal.pem \
/etc/nginx/tls/cert/%i.internal.pem \
/etc/nginx/tls/key/%i.internal.pem
ExecStartPost=/usr/bin/systemctl reload nginx.service

15
asset/systemd/step@.timer Normal file
View File

@@ -0,0 +1,15 @@
# Periodic internal-cert renewal for %i.internal (every 15 min, jittered).
# Replicated from oolon.kosherinata.internal. Installed to
# /etc/systemd/system/step@.timer; enable per-cert with
# `systemctl enable --now step@bench.timer`.
[Unit]
Description=step cert renew timer for %i.internal
[Timer]
Persistent=true
OnCalendar=*:1/15
AccuracySec=1us
RandomizedDelaySec=5m
[Install]
WantedBy=timers.target

3
bench/.gitignore vendored Normal file
View File

@@ -0,0 +1,3 @@
node_modules
dist
*.local

45
bench/README.md Normal file
View File

@@ -0,0 +1,45 @@
# helexa bench UI
A Vite + React (SWC, TypeScript) app that visualises the fleet benchmark
data collected by `helexa-bench`. It reads the read-only JSON API the
bench daemon serves (`crates/helexa-bench/src/api.rs`, default
`:13132` on bob).
Stack: React Router, react-bootstrap, Recharts.
## Pages
- **Overview** — latest median results per (host, model, scenario) cell.
- **Trends** — decode-tok/s and TTFT plotted across neuron build SHAs as
releases roll out (the headline view). Pick host / model / scenario.
- **Runs** — filterable raw-run explorer.
## Develop
```sh
cd bench
npm install
npm run dev # http://localhost:5173
```
`vite.config.ts` proxies `/api``http://bob.hanzalova.internal:13132`,
so the dev server talks to the live bench API with no CORS fuss. Point
the proxy elsewhere (or run a local `helexa-bench serve`) to develop
against other data.
## Production hosting
Public at **https://bench.helexa.ai** — nginx on the gateway
(`hanzalova.internal`) serves the static `dist/` and reverse-proxies
`/api` to the bench API on bob over WireGuard, so the SPA is same-origin
(no CORS) and the internal API stays off the public internet.
- `npm run build` is run with **no** `VITE_API_BASE` (the app calls
`/api/...` on its own origin; nginx proxies it to bob).
- `.gitea/workflows/deploy.yml` (`deploy-bench-ui`) builds and rsyncs
`dist/` to `/var/www/bench.helexa.ai` on every deploy.
- The nginx vhost (`asset/nginx/bench.helexa.ai.conf`) and the
Let's Encrypt cert are one-time host setup in `script/infra-setup.sh`.
To host elsewhere instead, build with
`VITE_API_BASE=<bob-api-origin>` and serve the static `dist/`.

12
bench/index.html Normal file
View File

@@ -0,0 +1,12 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>helexa bench</title>
</head>
<body>
<div id="root"></div>
<script type="module" src="/src/main.tsx"></script>
</body>
</html>

2191
bench/package-lock.json generated Normal file

File diff suppressed because it is too large Load Diff

28
bench/package.json Normal file
View File

@@ -0,0 +1,28 @@
{
"name": "helexa-bench-ui",
"private": true,
"version": "0.1.0",
"type": "module",
"description": "Visualisation app for helexa-bench fleet benchmark data.",
"scripts": {
"dev": "vite",
"build": "tsc && vite build",
"preview": "vite preview"
},
"dependencies": {
"bootstrap": "^5.3.3",
"react": "^18.3.1",
"react-bootstrap": "^2.10.5",
"react-dom": "^18.3.1",
"react-router-dom": "^6.26.2",
"recharts": "^2.12.7"
},
"devDependencies": {
"@types/node": "^20.14.0",
"@types/react": "^18.3.5",
"@types/react-dom": "^18.3.0",
"@vitejs/plugin-react-swc": "^3.7.0",
"typescript": "^5.5.4",
"vite": "^5.4.0"
}
}

30
bench/src/App.tsx Normal file
View File

@@ -0,0 +1,30 @@
import { Container, Nav, Navbar } from "react-bootstrap";
import { NavLink, Outlet } from "react-router-dom";
export default function App() {
return (
<>
<Navbar bg="dark" variant="dark" expand="md">
<Container>
<Navbar.Brand as={NavLink} to="/">
helexa&nbsp;bench
</Navbar.Brand>
<Nav className="me-auto">
<Nav.Link as={NavLink} to="/" end>
Overview
</Nav.Link>
<Nav.Link as={NavLink} to="/trends">
Trends
</Nav.Link>
<Nav.Link as={NavLink} to="/runs">
Runs
</Nav.Link>
</Nav>
</Container>
</Navbar>
<Container className="py-4">
<Outlet />
</Container>
</>
);
}

45
bench/src/api.ts Normal file
View File

@@ -0,0 +1,45 @@
import type { Dimensions, ReportRow, RunRow, SeriesPoint } from "./types";
// Empty default → `fetch('/api/...')` hits the dev proxy (vite.config.ts)
// or the same origin. For a separately-hosted build, set VITE_API_BASE to
// the bob API origin (e.g. http://bob.hanzalova.internal:13132).
const BASE = import.meta.env.VITE_API_BASE ?? "";
async function getJson<T>(path: string): Promise<T> {
const res = await fetch(`${BASE}${path}`);
if (!res.ok) {
throw new Error(`${res.status} ${res.statusText}: ${await res.text()}`);
}
return res.json() as Promise<T>;
}
export const getDimensions = () => getJson<Dimensions>("/api/dimensions");
export const getSummary = () => getJson<ReportRow[]>("/api/summary");
// host is resolved server-side (each model maps to one host today), so the
// public UI selects by model + scenario alone.
export const getSeries = (model: string, scenario: string) =>
getJson<SeriesPoint[]>(
`/api/series?model=${encodeURIComponent(model)}&scenario=${encodeURIComponent(scenario)}`,
);
export interface RunsParams {
host?: string;
model?: string;
scenario?: string;
sha?: string;
ok?: boolean;
limit?: number;
}
export const getRuns = (p: RunsParams = {}) => {
const q = new URLSearchParams();
if (p.host) q.set("host", p.host);
if (p.model) q.set("model", p.model);
if (p.scenario) q.set("scenario", p.scenario);
if (p.sha) q.set("sha", p.sha);
if (p.ok !== undefined) q.set("ok", String(p.ok));
if (p.limit) q.set("limit", String(p.limit));
const qs = q.toString();
return getJson<RunRow[]>(`/api/runs${qs ? `?${qs}` : ""}`);
};

52
bench/src/baseline.ts Normal file
View File

@@ -0,0 +1,52 @@
// Pre-helexa-bench baseline, transcribed verbatim from doc/benchmarks.md.
//
// IMPORTANT — different measurement regime. These were measured by
// script/bench.py *through the cortex gateway* (so TTFT/total include a
// proxy hop), reported as medians only, before helexa-bench existed.
// helexa-bench measures each neuron *directly*. So these points are an
// honest historical anchor, NOT apples-to-apples with the live series —
// the Trends view renders them dashed + labelled, never merged into the
// live line.
//
// Host is inferred from the model via the doc's Fleet table
// (beast=27B, benjy=8B, quadbrat=1.7B). Timestamps are the two 2026-06-12
// snapshots in the doc, ordered (08:00 = pre-#11, 16:00 = post-#11) so
// they sort before the bench era on the shared time axis.
export interface BaselinePoint {
host: string;
model: string;
scenario: string;
git_sha: string;
build_timestamp: string;
ttft_s: number;
decode_tps: number;
total_s: number;
}
/** Source: bench.py via cortex gateway — see doc/benchmarks.md. */
export const BASELINE_SOURCE = "bench.py · via cortex gateway";
export const BASELINE: BaselinePoint[] = [
// ── 8f6f1d3 — baseline (2026-06-12) ────────────────────────────────
{ host: "beast", model: "Qwen/Qwen3.6-27B", scenario: "chat:128", git_sha: "8f6f1d3", build_timestamp: "2026-06-12T08:00:00Z", ttft_s: 1.658, decode_tps: 35.0, total_s: 8.981 },
{ host: "beast", model: "Qwen/Qwen3.6-27B", scenario: "chat:4096", git_sha: "8f6f1d3", build_timestamp: "2026-06-12T08:00:00Z", ttft_s: 7.067, decode_tps: 33.7, total_s: 14.63 },
{ host: "benjy", model: "Qwen/Qwen3-8B", scenario: "chat:128", git_sha: "8f6f1d3", build_timestamp: "2026-06-12T08:00:00Z", ttft_s: 0.884, decode_tps: 62.4, total_s: 4.938 },
{ host: "benjy", model: "Qwen/Qwen3-8B", scenario: "chat:4096", git_sha: "8f6f1d3", build_timestamp: "2026-06-12T08:00:00Z", ttft_s: 1.818, decode_tps: 46.5, total_s: 7.27 },
{ host: "quadbrat", model: "Qwen/Qwen3-1.7B", scenario: "chat:128", git_sha: "8f6f1d3", build_timestamp: "2026-06-12T08:00:00Z", ttft_s: 0.685, decode_tps: 81.3, total_s: 3.741 },
{ host: "quadbrat", model: "Qwen/Qwen3-1.7B", scenario: "chat:4096", git_sha: "8f6f1d3", build_timestamp: "2026-06-12T08:00:00Z", ttft_s: 2.743, decode_tps: 35.4, total_s: 9.884 },
// ── a1952a4 — post prefix-KV-cache (#11, 2026-06-12) ───────────────
{ host: "beast", model: "Qwen/Qwen3.6-27B", scenario: "chat:128", git_sha: "a1952a4", build_timestamp: "2026-06-12T16:00:00Z", ttft_s: 1.355, decode_tps: 45.8, total_s: 4.147 },
{ host: "beast", model: "Qwen/Qwen3.6-27B", scenario: "chat:4096", git_sha: "a1952a4", build_timestamp: "2026-06-12T16:00:00Z", ttft_s: 1.431, decode_tps: 43.3, total_s: 4.387 },
{ host: "benjy", model: "Qwen/Qwen3-8B", scenario: "chat:128", git_sha: "a1952a4", build_timestamp: "2026-06-12T16:00:00Z", ttft_s: 0.886, decode_tps: 78.6, total_s: 2.478 },
{ host: "benjy", model: "Qwen/Qwen3-8B", scenario: "chat:4096", git_sha: "a1952a4", build_timestamp: "2026-06-12T16:00:00Z", ttft_s: 1.824, decode_tps: 58.3, total_s: 3.969 },
{ host: "quadbrat", model: "Qwen/Qwen3-1.7B", scenario: "chat:128", git_sha: "a1952a4", build_timestamp: "2026-06-12T16:00:00Z", ttft_s: 0.702, decode_tps: 104.8, total_s: 1.895 },
{ host: "quadbrat", model: "Qwen/Qwen3-1.7B", scenario: "chat:4096", git_sha: "a1952a4", build_timestamp: "2026-06-12T16:00:00Z", ttft_s: 2.749, decode_tps: 44.9, total_s: 5.534 },
];
/** Baseline points for one (model, scenario) cell, oldest first. */
export function baselineFor(model: string, scenario: string): BaselinePoint[] {
return BASELINE.filter(
(b) => b.model === model && b.scenario === scenario,
).sort((a, b) => a.build_timestamp.localeCompare(b.build_timestamp));
}

22
bench/src/main.tsx Normal file
View File

@@ -0,0 +1,22 @@
import React from "react";
import ReactDOM from "react-dom/client";
import { BrowserRouter, Route, Routes } from "react-router-dom";
import "bootstrap/dist/css/bootstrap.min.css";
import App from "./App";
import Overview from "./pages/Overview";
import Trends from "./pages/Trends";
import Runs from "./pages/Runs";
ReactDOM.createRoot(document.getElementById("root")!).render(
<React.StrictMode>
<BrowserRouter>
<Routes>
<Route path="/" element={<App />}>
<Route index element={<Overview />} />
<Route path="trends" element={<Trends />} />
<Route path="runs" element={<Runs />} />
</Route>
</Routes>
</BrowserRouter>
</React.StrictMode>,
);

View File

@@ -0,0 +1,64 @@
import { useEffect, useState } from "react";
import { Alert, Spinner, Table } from "react-bootstrap";
import { getSummary } from "../api";
import type { ReportRow } from "../types";
const f = (n: number | null, p = 2) => (n == null ? "—" : n.toFixed(p));
export default function Overview() {
const [rows, setRows] = useState<ReportRow[]>([]);
const [err, setErr] = useState<string | null>(null);
const [loading, setLoading] = useState(true);
useEffect(() => {
getSummary()
.then(setRows)
.catch((e) => setErr(String(e)))
.finally(() => setLoading(false));
}, []);
if (loading) return <Spinner animation="border" />;
if (err) return <Alert variant="danger">{err}</Alert>;
return (
<>
<h3 className="mb-3">Latest results per cell</h3>
<p className="text-muted">
Median of each cell's samples on the most recent build seen for that
(host, model, scenario).
</p>
<Table striped bordered hover responsive size="sm">
<thead>
<tr>
<th>GPU</th>
<th>model</th>
<th className="text-end">prompt tok</th>
<th className="text-end">TTFT (s)</th>
<th className="text-end">decode tok/s</th>
<th className="text-end">total (s)</th>
<th>build</th>
<th className="text-end">n</th>
</tr>
</thead>
<tbody>
{rows.map((r, i) => (
<tr key={i}>
<td>{r.gpu ?? r.target_name}</td>
<td>{r.model_id}</td>
<td className="text-end">
{r.prompt_tokens ?? `~${r.prompt_size_approx}`}
</td>
<td className="text-end">{f(r.ttft_s_median, 3)}</td>
<td className="text-end">{f(r.decode_tps_median, 1)}</td>
<td className="text-end">{f(r.total_s_median, 3)}</td>
<td>
<code>{r.git_sha}</code>
</td>
<td className="text-end">{r.samples}</td>
</tr>
))}
</tbody>
</Table>
</>
);
}

141
bench/src/pages/Runs.tsx Normal file
View File

@@ -0,0 +1,141 @@
import { useEffect, useState } from "react";
import { Alert, Badge, Col, Form, Row, Spinner, Table } from "react-bootstrap";
import { getDimensions, getRuns } from "../api";
import type { Dimensions, RunRow } from "../types";
const f = (n: number | null, p = 2) => (n == null ? "—" : n.toFixed(p));
function Picker({
label,
value,
set,
options,
}: {
label: string;
value: string;
set: (v: string) => void;
options: string[];
}) {
return (
<Form.Group as={Col}>
<Form.Label>{label}</Form.Label>
<Form.Select value={value} onChange={(e) => set(e.target.value)}>
<option value="">(all)</option>
{options.map((o) => (
<option key={o} value={o}>
{o}
</option>
))}
</Form.Select>
</Form.Group>
);
}
export default function Runs() {
const [dims, setDims] = useState<Dimensions | null>(null);
const [host, setHost] = useState("");
const [model, setModel] = useState("");
const [scenario, setScenario] = useState("");
const [rows, setRows] = useState<RunRow[]>([]);
const [err, setErr] = useState<string | null>(null);
const [loading, setLoading] = useState(false);
useEffect(() => {
getDimensions()
.then(setDims)
.catch((e) => setErr(String(e)));
}, []);
useEffect(() => {
setLoading(true);
getRuns({
host: host || undefined,
model: model || undefined,
scenario: scenario || undefined,
limit: 200,
})
.then(setRows)
.catch((e) => setErr(String(e)))
.finally(() => setLoading(false));
}, [host, model, scenario]);
if (err) return <Alert variant="danger">{err}</Alert>;
return (
<>
<h3 className="mb-3">Runs</h3>
{dims && (
<Row className="g-3 mb-3">
{/* GPU filter — labelled by GPU, but filters by the underlying host. */}
<Form.Group as={Col}>
<Form.Label>GPU</Form.Label>
<Form.Select value={host} onChange={(e) => setHost(e.target.value)}>
<option value="">(all)</option>
{dims.hosts.map((h) => (
<option key={h} value={h}>
{dims.host_gpus[h] ?? h}
</option>
))}
</Form.Select>
</Form.Group>
<Picker
label="Model"
value={model}
set={setModel}
options={dims.models}
/>
<Picker
label="Scenario"
value={scenario}
set={setScenario}
options={dims.scenarios}
/>
</Row>
)}
{loading ? (
<Spinner animation="border" />
) : (
<Table striped bordered hover responsive size="sm">
<thead>
<tr>
<th>ts</th>
<th>GPU</th>
<th>model</th>
<th>scenario</th>
<th>build</th>
<th className="text-end">TTFT</th>
<th className="text-end">tok/s</th>
<th className="text-end">total</th>
<th>ok</th>
</tr>
</thead>
<tbody>
{rows.map((r) => (
<tr key={r.id}>
<td>{r.ts}</td>
<td>{r.gpu ?? r.host}</td>
<td>{r.model_id}</td>
<td>{r.scenario_id}</td>
<td>
<code>{r.git_sha}</code>
</td>
<td className="text-end">{f(r.ttft_s, 3)}</td>
<td className="text-end">{f(r.decode_tps, 1)}</td>
<td className="text-end">{f(r.total_s, 3)}</td>
<td>
{r.ok ? (
<Badge bg="success">ok</Badge>
) : (
<Badge bg="danger" title={r.error ?? ""}>
fail
</Badge>
)}
</td>
</tr>
))}
</tbody>
</Table>
)}
</>
);
}

221
bench/src/pages/Trends.tsx Normal file
View File

@@ -0,0 +1,221 @@
import { useEffect, useMemo, useState } from "react";
import { Alert, Col, Form, Row, Spinner } from "react-bootstrap";
import {
CartesianGrid,
Legend,
Line,
LineChart,
ReferenceLine,
ResponsiveContainer,
Tooltip,
XAxis,
YAxis,
} from "recharts";
import { getDimensions, getSeries } from "../api";
import type { Dimensions, SeriesPoint } from "../types";
import { BASELINE_SOURCE, baselineFor } from "../baseline";
function Picker({
label,
value,
set,
options,
}: {
label: string;
value: string;
set: (v: string) => void;
options: string[];
}) {
return (
<Form.Group as={Col}>
<Form.Label>{label}</Form.Label>
<Form.Select value={value} onChange={(e) => set(e.target.value)}>
{options.map((o) => (
<option key={o} value={o}>
{o}
</option>
))}
</Form.Select>
</Form.Group>
);
}
export default function Trends() {
const [dims, setDims] = useState<Dimensions | null>(null);
const [model, setModel] = useState("");
const [scenario, setScenario] = useState("");
const [series, setSeries] = useState<SeriesPoint[]>([]);
const [err, setErr] = useState<string | null>(null);
useEffect(() => {
getDimensions()
.then((d) => {
setDims(d);
if (d.models[0]) setModel(d.models[0]);
if (d.scenarios[0]) setScenario(d.scenarios[0]);
})
.catch((e) => setErr(String(e)));
}, []);
useEffect(() => {
if (model && scenario) {
getSeries(model, scenario)
.then(setSeries)
.catch((e) => setErr(String(e)));
}
}, [model, scenario]);
// Prepend the pre-helexa-bench baseline (dashed, separate keys) so it
// anchors the timeline without being merged into the live line. Different
// measurement regime — see baseline.ts / doc/benchmarks.md.
const base = useMemo(
() => baselineFor(model, scenario),
[model, scenario],
);
const data = useMemo(
() => [
...base.map((p) => ({
label: p.git_sha,
baseTtft: p.ttft_s,
baseDecode: p.decode_tps,
baseTotal: p.total_s,
})),
...series.map((p) => ({
label: p.git_sha,
ttft: p.ttft_s_median,
decode: p.decode_tps_median,
total: p.total_s_median,
})),
],
[series, base],
);
// Divider marking the boundary between the two regimes (drawn at the
// first live build, with baseline points to its left).
const firstLive = series[0]?.git_sha;
const showDivider = base.length > 0 && series.length > 0;
if (err) return <Alert variant="danger">{err}</Alert>;
if (!dims) return <Spinner animation="border" />;
return (
<>
<h3 className="mb-3">Trends over builds</h3>
<Row className="g-3 mb-4">
<Picker
label="Model"
value={model}
set={setModel}
options={dims.models}
/>
<Picker
label="Scenario"
value={scenario}
set={setScenario}
options={dims.scenarios}
/>
</Row>
{dims.model_gpus[model] && (
<p className="text-muted mb-3">
Measured on <strong>{dims.model_gpus[model]}</strong>.
</p>
)}
{data.length === 0 ? (
<Alert variant="info">No data for this selection yet.</Alert>
) : (
<>
{base.length > 0 && (
<p className="text-muted small mb-3">
Dashed = pre-helexa-bench baseline ({BASELINE_SOURCE}); solid =
helexa-bench (direct to neuron). Different measurement regimes
see <code>doc/benchmarks.md</code>.
</p>
)}
<h5 className="mt-3">decode tok/s (higher is better)</h5>
<ResponsiveContainer width="100%" height={280}>
<LineChart data={data} margin={{ top: 8, right: 24, bottom: 8, left: 0 }}>
<CartesianGrid strokeDasharray="3 3" />
<XAxis dataKey="label" />
<YAxis />
<Tooltip />
<Legend />
{showDivider && firstLive && (
<ReferenceLine
x={firstLive}
stroke="#bbb"
strokeDasharray="3 3"
label={{
value: "bench.py → helexa-bench",
position: "top",
fill: "#999",
fontSize: 11,
}}
/>
)}
<Line
type="monotone"
dataKey="decode"
name="decode tok/s"
stroke="#0d6efd"
connectNulls
/>
{base.length > 0 && (
<Line
type="monotone"
dataKey="baseDecode"
name="baseline (bench.py · gateway)"
stroke="#888"
strokeDasharray="5 5"
connectNulls
/>
)}
</LineChart>
</ResponsiveContainer>
<h5 className="mt-4">TTFT seconds (lower is better)</h5>
<ResponsiveContainer width="100%" height={280}>
<LineChart data={data} margin={{ top: 8, right: 24, bottom: 8, left: 0 }}>
<CartesianGrid strokeDasharray="3 3" />
<XAxis dataKey="label" />
<YAxis />
<Tooltip />
<Legend />
{showDivider && firstLive && (
<ReferenceLine
x={firstLive}
stroke="#bbb"
strokeDasharray="3 3"
label={{
value: "bench.py → helexa-bench",
position: "top",
fill: "#999",
fontSize: 11,
}}
/>
)}
<Line
type="monotone"
dataKey="ttft"
name="TTFT (s)"
stroke="#dc3545"
connectNulls
/>
{base.length > 0 && (
<Line
type="monotone"
dataKey="baseTtft"
name="baseline (bench.py · gateway)"
stroke="#888"
strokeDasharray="5 5"
connectNulls
/>
)}
</LineChart>
</ResponsiveContainer>
</>
)}
</>
);
}

69
bench/src/types.ts Normal file
View File

@@ -0,0 +1,69 @@
// Mirrors the JSON served by helexa-bench's read API (crates/helexa-bench/src/api.rs).
export interface BuildRef {
git_sha: string;
build_timestamp: string | null;
package_version: string | null;
}
export interface Dimensions {
hosts: string[];
models: string[];
scenarios: string[];
builds: BuildRef[];
/** host → GPU label, e.g. "2× RTX 5090". */
host_gpus: Record<string, string>;
/** model → GPU label (model maps to one host today). */
model_gpus: Record<string, string>;
}
/** Latest-SHA-per-cell medians (the report table). */
export interface ReportRow {
target_name: string;
model_id: string;
scenario_id: string;
prompt_size_approx: number;
git_sha: string;
prompt_tokens: number | null;
ttft_s_median: number | null;
decode_tps_median: number | null;
total_s_median: number | null;
samples: number;
/** Public-facing resource name (the host's GPU(s)). */
gpu: string | null;
}
/** One point in a per-build time-series for a (host, model, scenario) cell. */
export interface SeriesPoint {
git_sha: string;
build_timestamp: string | null;
package_version: string | null;
ttft_s_median: number | null;
decode_tps_median: number | null;
total_s_median: number | null;
samples: number;
}
export interface RunRow {
id: number;
ts: string;
host: string;
/** Public-facing resource name (the host's GPU(s)). */
gpu: string | null;
hostname: string | null;
git_sha: string;
build_timestamp: string | null;
package_version: string;
model_id: string;
harness: string;
scenario_id: string;
prompt_size_approx: number;
prompt_tokens_actual: number | null;
max_tokens: number;
ttft_s: number | null;
decode_tps: number | null;
total_s: number | null;
completion_tokens: number | null;
ok: boolean;
error: string | null;
}

9
bench/src/vite-env.d.ts vendored Normal file
View File

@@ -0,0 +1,9 @@
/// <reference types="vite/client" />
interface ImportMetaEnv {
/** Base origin of the bench API. Empty → use the dev proxy / same origin. */
readonly VITE_API_BASE?: string;
}
interface ImportMeta {
readonly env: ImportMetaEnv;
}

22
bench/tsconfig.json Normal file
View File

@@ -0,0 +1,22 @@
{
"compilerOptions": {
"target": "ES2022",
"useDefineForClassFields": true,
"lib": ["ES2022", "DOM", "DOM.Iterable"],
"module": "ESNext",
"skipLibCheck": true,
"moduleResolution": "bundler",
"allowImportingTsExtensions": true,
"resolveJsonModule": true,
"isolatedModules": true,
"moduleDetection": "force",
"noEmit": true,
"jsx": "react-jsx",
"strict": true,
"noUnusedLocals": true,
"noUnusedParameters": true,
"noFallthroughCasesInSwitch": true,
"types": ["node", "vite/client"]
},
"include": ["src", "vite.config.ts"]
}

18
bench/vite.config.ts Normal file
View File

@@ -0,0 +1,18 @@
import { defineConfig } from "vite";
import react from "@vitejs/plugin-react-swc";
// Dev server proxies /api to the bench API on bob so `fetch('/api/...')`
// works without CORS/mixed-origin fuss during local development.
// For a production build hosted elsewhere, set VITE_API_BASE to the bob
// API origin (e.g. http://bob.hanzalova.internal:13132) instead.
export default defineConfig({
plugins: [react()],
server: {
proxy: {
"/api": {
target: "http://bob.hanzalova.internal:13132",
changeOrigin: true,
},
},
},
});

View File

@@ -3,22 +3,27 @@
# Copy to cortex.toml and adjust for your environment.
#
# Environment variable overrides use CORTEX_ prefix with __ separators:
# CORTEX_GATEWAY__LISTEN=0.0.0.0:9000
# CORTEX_GATEWAY__LISTEN=0.0.0.0:31313
# Path to the model catalogue (limits, cost, pinning, aliases, feasibility).
# Defaults to the packaged location below; uncomment to override for a
# non-packaged / local run.
# models_config = "/etc/cortex/models.toml"
[gateway]
listen = "0.0.0.0:8000"
metrics_listen = "0.0.0.0:9100"
listen = "0.0.0.0:31313"
metrics_listen = "0.0.0.0:31314"
[eviction]
strategy = "lru"
# Restart mistralrs after this many load/unload cycles to defragment VRAM.
# Restart neurons after this many load/unload cycles to defragment VRAM.
# Set to 0 to disable.
defrag_after_cycles = 50
# -- Nodes ---------------------------------------------------------------
# Each [[nodes]] entry declares a mistral.rs instance in the fleet.
# Models are discovered by polling the node's /v1/models endpoint.
# Pinned models are never evicted.
# Each [[nodes]] entry declares a neuron daemon in the fleet.
# Models are discovered by polling the neuron's /models endpoint.
# Pinned models (see models.toml) are never evicted.
[[nodes]]
name = "gpu-large"
@@ -43,3 +48,45 @@ vram_mb = 12288 # e.g. RTX 3060 (12 GB)
pinned = [
"your-org/embedding-model",
]
# -- Entitlements (multi-tenant governance, #47) -------------------------
# Identity + per-key token budgets. Omit this section entirely for the
# legacy single-operator behaviour: requests are anonymous and uncapped.
#
# The local/static provider below is the source of truth for accounts,
# keys, and hard caps until the upstream clearing house exists. Identity
# rides standard bearer auth only — clients send
# Authorization: Bearer <key>
# no custom headers or body fields.
[entitlements]
# Reject unauthenticated requests with 401 invalid_api_key. Leave false
# (allow-anonymous) during rollout; flip to true once keys are issued.
require_auth = false
# One entry per API key.
[[entitlements.keys]]
key = "sk-example-rolling" # the bearer token the client sends
account_id = "team-research" # billable account (keys may share one)
key_id = "research-ci" # stable label for ledger/metrics (optional)
hard_cap = 5_000_000 # hard token cap over the window
# Rolling window that resets — over-cap requests get 429 rate_limit_exceeded
# + Retry-After, so well-behaved clients (opencode/AI SDK) back off and retry.
window = { kind = "rolling", seconds = 3600 }
[[entitlements.keys]]
key = "sk-example-balance"
account_id = "team-research"
key_id = "research-prepaid"
hard_cap = 20_000_000
# Hard balance, no reset — exhaustion returns 429 insufficient_quota
# (the client surfaces and stops). This is the default when `window` is
# omitted. Never 402.
window = { kind = "balance" }
[[entitlements.keys]]
key = "sk-example-infra"
account_id = "operator"
key_id = "infra"
# No hard_cap → uncapped operator infra key (own fleet, own use). Still
# metered for visibility.

View File

@@ -1,10 +1,10 @@
Name: cortex
Version: 0.1.2
Version: 0.1.16
Release: 1%{?dist}
Summary: Inference gateway for multi-node GPU clusters
License: GPL-3.0-or-later
URL: https://git.lair.cafe/helexa/cortex
URL: https://git.lair.cafe/helexa/helexa
Source0: %{name}-%{version}.tar.gz
Source1: %{name}-%{version}-vendor.tar.gz
@@ -21,6 +21,16 @@ BuildRequires: systemd-rpm-macros
Requires(pre): shadow-utils
Requires: systemd
Requires: firewalld-filesystem
# systemd-rpm-macros ships a unit dep generator that parses User=/Group=
# from our .service file and emits Requires: user(cortex)/group(cortex).
# rpm's sysusers provides-generator emits the unversioned form for groups
# but only a versioned user(cortex) = <base64> for users with GECOS/home/
# shell. Provide the unversioned user(cortex) explicitly so dnf can resolve
# the auto-generated Requires. Without this, dnf5 silently filters the
# package and reports "Nothing to do".
Provides: user(cortex)
%description
Cortex is a Rust reverse-proxy that sits in front of multiple inference
@@ -47,9 +57,10 @@ cargo build --release -p cortex-cli
install -Dm755 target/release/cortex %{buildroot}%{_bindir}/cortex
install -Dm644 data/cortex.service %{buildroot}%{_unitdir}/cortex.service
install -Dm644 data/cortex-sysusers.conf %{buildroot}%{_sysusersdir}/cortex.conf
install -dm750 %{buildroot}%{_sysconfdir}/cortex
install -Dm640 cortex.example.toml %{buildroot}%{_sysconfdir}/cortex/cortex.toml
install -Dm640 models.example.toml %{buildroot}%{_sysconfdir}/cortex/models.toml
install -Dm644 data/cortex-firewalld.xml %{buildroot}%{_prefix}/lib/firewalld/services/cortex.xml
install -dm755 %{buildroot}%{_sysconfdir}/cortex
install -Dm644 cortex.example.toml %{buildroot}%{_sysconfdir}/cortex/cortex.toml
install -Dm644 models.example.toml %{buildroot}%{_sysconfdir}/cortex/models.toml
%pre
%sysusers_create_compat %{_builddir}/%{name}-%{version}/data/cortex-sysusers.conf
@@ -63,16 +74,53 @@ install -Dm640 models.example.toml %{buildroot}%{_sysconfdir}/cortex/models.toml
%postun
%systemd_postun_with_restart cortex.service
%posttrans
# Migration: older cortex packages shipped the firewalld service as
# `helexa-cortex` and (in some build streams) with wrong port numbers
# (9301/9302/9304). Operators who enabled that legacy service in their
# zone end up with the wrong-port override taking precedence over the
# vendor `cortex.xml` now in /usr/lib/firewalld/services/. Clean up the
# stale /etc/ override here and migrate any zone bindings to the new
# service name.
if [ -f /etc/firewalld/services/helexa-cortex.xml ]; then
rm -f /etc/firewalld/services/helexa-cortex.xml
fi
if [ -x /usr/bin/firewall-cmd ] && /usr/bin/firewall-cmd --state >/dev/null 2>&1; then
# Drop the legacy service name from every zone where it was enabled
# and add the new `cortex` service in its place. Operators who never
# ran firewall-cmd against either name see no zone change.
for zone in $(/usr/bin/firewall-cmd --get-active-zones 2>/dev/null \
| awk '!/^[[:space:]]/ {print $1}'); do
if /usr/bin/firewall-cmd --permanent --zone="$zone" --query-service=helexa-cortex >/dev/null 2>&1; then
/usr/bin/firewall-cmd --permanent --zone="$zone" --remove-service=helexa-cortex >/dev/null 2>&1 || :
/usr/bin/firewall-cmd --permanent --zone="$zone" --add-service=cortex >/dev/null 2>&1 || :
fi
done
/usr/bin/firewall-cmd --reload >/dev/null 2>&1 || :
fi
:
%files
%license LICENSE
%doc README.md
%{_bindir}/cortex
%{_unitdir}/cortex.service
%{_sysusersdir}/cortex.conf
%dir %attr(750,root,cortex) %{_sysconfdir}/cortex
%config(noreplace) %attr(640,root,cortex) %{_sysconfdir}/cortex/cortex.toml
%config(noreplace) %attr(640,root,cortex) %{_sysconfdir}/cortex/models.toml
%{_prefix}/lib/firewalld/services/cortex.xml
%dir %{_sysconfdir}/cortex
%config(noreplace) %{_sysconfdir}/cortex/cortex.toml
%config(noreplace) %{_sysconfdir}/cortex/models.toml
%changelog
* Tue Apr 15 2026 Rob Thijssen <grenade@rob.tn> - 0.1.0-1
* Thu Apr 16 2026 Gitea Actions <actions@git.lair.cafe> - 0.1.16-1
- chore: ignore local deploy script
- chore: move default ports out of common-collision ranges
- ci: drop actions/cache for cargo registry and target
* Thu Apr 16 2026 Gitea Actions <actions@git.lair.cafe> - 0.1.14-1
- ci: publish both packages to a single helexa/helexa COPR project
- fix(rpm): rename neuron package to helexa-neuron
- ci: commit generated %changelog entries back to main
* Wed Apr 15 2026 Rob Thijssen <grenade@rob.tn> - 0.1.0-1
- Initial package

View File

@@ -5,7 +5,7 @@ use tracing_subscriber::EnvFilter;
#[derive(Parser)]
#[command(name = "cortex")]
#[command(about = "Unified inference gateway for multi-node mistral.rs clusters")]
#[command(about = "Unified inference gateway for multi-node GPU clusters")]
#[command(version)]
struct Cli {
#[command(subcommand)]
@@ -23,7 +23,7 @@ enum Commands {
/// Print the fleet status (models, nodes, health).
Status {
/// Gateway API endpoint to query.
#[arg(short, long, default_value = "http://localhost:8000")]
#[arg(short, long, default_value = "http://localhost:31313")]
endpoint: String,
},
}

View File

@@ -2,7 +2,7 @@
//!
//! These mirror the `/v1/messages` format used by the Anthropic API.
//! The gateway accepts these, translates to OpenAI format, proxies to
//! mistral.rs, then translates the response back.
//! the inference backend (neuron), then translates the response back.
use serde::{Deserialize, Serialize};
use serde_json::Value;

View File

@@ -0,0 +1,119 @@
//! Build/version metadata shared between cortex and neuron.
//!
//! neuron captures these facts at compile time in its `build.rs`
//! (git SHA, enabled cargo features, rustc/candle versions, …) and
//! serves them from `GET /version`. cortex and `helexa-bench`
//! deserialize the same struct so a benchmark run can be attributed to
//! the exact daemon build that produced it — not just the host's CUDA
//! and driver versions that `/discovery` already reports.
//!
//! Every field beyond the always-present package version is
//! `#[serde(default)]` so a newer reader stays compatible with an
//! older neuron that omits a field (and vice versa) — the same
//! forward/backward-compat discipline as
//! [`crate::discovery::ActivationStatus`].
use serde::{Deserialize, Serialize};
/// Build-time identity of a neuron daemon.
///
/// Returned by `GET /version`. The `git_sha` is the canonical "which
/// build is live" key — benchmark records are bucketed by it, so a
/// regression can be pinned to a daemon change rather than a host
/// change. When neuron is built from a source tarball with no git
/// metadata available (and no `HELEXA_BUILD_SHA` injected by CI/RPM),
/// `git_sha` is the string `"unknown"`.
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)]
pub struct BuildInfo {
/// Crate version from `CARGO_PKG_VERSION` (e.g. `"0.1.16"`).
pub package_version: String,
/// Short git SHA, or `"unknown"` when unavailable at build time.
#[serde(default = "unknown")]
pub git_sha: String,
/// Full 40-char git SHA when available.
#[serde(default)]
pub git_sha_long: Option<String>,
/// Whether the working tree had uncommitted changes at build time.
/// `false` when the SHA is unknown (tarball build).
#[serde(default)]
pub git_dirty: bool,
/// RFC3339 build timestamp.
#[serde(default)]
pub build_timestamp: Option<String>,
/// `rustc --version` output of the compiler used.
#[serde(default)]
pub rustc_version: Option<String>,
/// Cargo build profile: `"release"` or `"debug"`.
#[serde(default)]
pub profile: Option<String>,
/// Target triple the binary was compiled for.
#[serde(default)]
pub target: Option<String>,
/// Enabled cargo features (e.g. `["cuda", "cudnn"]`). These define
/// the performance envelope, so they are recorded against every
/// benchmark run.
#[serde(default)]
pub features: Vec<String>,
/// Locked `candle-core` version, best-effort from `Cargo.lock`.
#[serde(default)]
pub candle_version: Option<String>,
}
fn unknown() -> String {
"unknown".to_string()
}
impl BuildInfo {
/// A placeholder used by non-neuron benchmark targets (and tests)
/// that have no build metadata to report.
pub fn unknown() -> Self {
BuildInfo {
package_version: env!("CARGO_PKG_VERSION").to_string(),
git_sha: unknown(),
git_sha_long: None,
git_dirty: false,
build_timestamp: None,
rustc_version: None,
profile: None,
target: None,
features: Vec::new(),
candle_version: None,
}
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn round_trips_full() {
let info = BuildInfo {
package_version: "0.1.16".into(),
git_sha: "30d50d6".into(),
git_sha_long: Some("30d50d6abc123".into()),
git_dirty: true,
build_timestamp: Some("2026-06-13T10:00:00+00:00".into()),
rustc_version: Some("rustc 1.85.0".into()),
profile: Some("release".into()),
target: Some("x86_64-unknown-linux-gnu".into()),
features: vec!["cuda".into(), "cudnn".into()],
candle_version: Some("0.10.2".into()),
};
let json = serde_json::to_string(&info).unwrap();
let back: BuildInfo = serde_json::from_str(&json).unwrap();
assert_eq!(info, back);
}
#[test]
fn deserializes_minimal_payload() {
// An older neuron might send only the package version; every
// other field must default rather than fail.
let back: BuildInfo = serde_json::from_str(r#"{"package_version":"0.1.0"}"#).unwrap();
assert_eq!(back.package_version, "0.1.0");
assert_eq!(back.git_sha, "unknown");
assert!(!back.git_dirty);
assert!(back.features.is_empty());
assert!(back.candle_version.is_none());
}
}

View File

@@ -1,6 +1,9 @@
//! Model catalogue — profiles describing how to serve each model.
use crate::discovery::DeviceInfo;
use crate::harness::{ModelCost, ModelLimit};
use serde::{Deserialize, Serialize};
use std::collections::HashMap;
use std::path::Path;
/// A model serving profile loaded from models.toml.
@@ -22,6 +25,32 @@ pub struct ModelProfile {
/// Neurons where this model should never be evicted.
#[serde(default)]
pub pinned_on: Vec<String>,
/// Source scheme this profile's weights come from. When set, the
/// router prefixes `id` with `scheme:` before forwarding the load
/// request to neuron, ensuring the daemon fetches from the right
/// registry regardless of which entry happens to match `id`.
///
/// `None` lets neuron substitute its own `default_source` (typically
/// `huggingface`). Set to `"helexa"` when the model is hosted in
/// the helexa registry — operator-procurement-grade audit relies
/// on this being explicit per model rather than implicit.
#[serde(default)]
pub source: Option<String>,
// ── Enrichment (issue #62) ────────────────────────────────
/// Per-model token budget. When present, advertised in `/v1/models`
/// so clients can size and compact their context automatically.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub limit: Option<ModelLimit>,
/// Operator-set pricing (USD per 1M tokens). `0.0` for self-hosted.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub cost: Option<ModelCost>,
/// Static capability flags the operator wants to advertise even
/// before the model is loaded on any neuron (e.g. `"reasoning"`,
/// `"tool_call"`). Runtime-detected capabilities from the harness
/// are unioned with this set in the gateway's `/v1/models` response.
#[serde(default)]
pub capabilities: Vec<String>,
}
fn default_min_devices() -> u32 {
@@ -33,6 +62,14 @@ fn default_min_devices() -> u32 {
pub struct ModelCatalogue {
#[serde(default)]
pub models: Vec<ModelProfile>,
/// Tier aliases — clients can send a request with `model: "helexa/small"`
/// and the gateway transparently rewrites + routes to the concrete
/// model id this maps to. Lets operators define latency/quality
/// tiers (`small`/`balanced`/`large`, `fast`/`thinking`, etc.)
/// without imposing knowledge of specific model ids on clients.
/// Loaded from the `[aliases]` table in models.toml.
#[serde(default)]
pub aliases: HashMap<String, String>,
}
impl ModelCatalogue {
@@ -64,4 +101,165 @@ impl ModelCatalogue {
.iter()
.any(|p| p.id == model_id && p.pinned_on.contains(&neuron_name.to_string()))
}
/// Find a profile by model id.
pub fn get(&self, model_id: &str) -> Option<&ModelProfile> {
self.models.iter().find(|p| p.id == model_id)
}
/// Resolve an alias to its concrete model id. Returns `id` verbatim
/// when it isn't an alias. Aliases never chain — operator config
/// is treated as flat — so this is a single lookup.
pub fn resolve_alias<'a>(&'a self, id: &'a str) -> &'a str {
self.aliases.get(id).map(String::as_str).unwrap_or(id)
}
}
impl ModelProfile {
/// True iff this profile's placement constraints can be satisfied
/// by the named neuron with the given device topology.
///
/// Constraints checked:
/// - `pinned_on`: non-empty → neuron must be on the list.
/// - `min_devices`: neuron must have at least this many devices.
/// - `min_device_vram_mb`: at least `min_devices` of the neuron's
/// devices must each meet this VRAM floor.
pub fn is_feasible_on(&self, neuron_name: &str, devices: &[DeviceInfo]) -> bool {
if !self.pinned_on.is_empty() && !self.pinned_on.iter().any(|n| n == neuron_name) {
return false;
}
if (devices.len() as u32) < self.min_devices {
return false;
}
if let Some(min_vram) = self.min_device_vram_mb {
let big_enough = devices
.iter()
.filter(|d| d.vram_total_mb >= min_vram)
.count() as u32;
if big_enough < self.min_devices {
return false;
}
}
true
}
}
#[cfg(test)]
mod tests {
use super::*;
use crate::discovery::DeviceInfo;
fn device(idx: u32, vram_mb: u64) -> DeviceInfo {
DeviceInfo {
index: idx,
name: format!("DEV-{idx}"),
vram_total_mb: vram_mb,
compute_capability: "8.6".into(),
}
}
fn profile() -> ModelProfile {
ModelProfile {
id: "Qwen/Qwen3.6-27B".into(),
harness: "candle".into(),
quant: None,
vram_mb: Some(45_000),
min_devices: 2,
min_device_vram_mb: Some(24_000),
pinned_on: vec![],
source: None,
limit: None,
cost: None,
capabilities: vec![],
}
}
#[test]
fn feasible_when_two_devices_meet_vram_floor() {
let p = profile();
let devices = [device(0, 32_000), device(1, 32_000)];
assert!(p.is_feasible_on("beast", &devices));
}
#[test]
fn infeasible_when_only_one_device() {
let p = profile();
let devices = [device(0, 64_000)];
assert!(!p.is_feasible_on("benjy", &devices));
}
#[test]
fn infeasible_when_one_device_underspec() {
let p = profile();
let devices = [device(0, 32_000), device(1, 12_000)];
assert!(!p.is_feasible_on("mixed", &devices));
}
#[test]
fn pinned_on_excludes_other_neurons() {
let mut p = profile();
p.pinned_on = vec!["beast".into()];
let devices = [device(0, 32_000), device(1, 32_000)];
assert!(p.is_feasible_on("beast", &devices));
assert!(!p.is_feasible_on("benjy", &devices));
}
#[test]
fn no_vram_floor_just_needs_min_devices() {
let mut p = profile();
p.min_device_vram_mb = None;
let devices = [device(0, 1_000), device(1, 1_000)];
assert!(p.is_feasible_on("anywhere", &devices));
}
#[test]
fn resolve_alias_returns_target_when_alias_present() {
let mut cat = ModelCatalogue::default();
cat.aliases
.insert("helexa/small".into(), "Qwen/Qwen3-1.7B".into());
assert_eq!(cat.resolve_alias("helexa/small"), "Qwen/Qwen3-1.7B");
}
#[test]
fn resolve_alias_passes_through_when_not_an_alias() {
let mut cat = ModelCatalogue::default();
cat.aliases
.insert("helexa/small".into(), "Qwen/Qwen3-1.7B".into());
assert_eq!(cat.resolve_alias("Qwen/Qwen3-8B"), "Qwen/Qwen3-8B");
}
#[test]
fn source_defaults_to_none_when_absent_from_toml() {
let src = r#"
[[models]]
id = "Qwen/Qwen3-30B"
harness = "candle"
"#;
let cat: ModelCatalogue = toml::from_str(src).expect("parse models table");
assert!(cat.models[0].source.is_none());
}
#[test]
fn source_round_trips_through_toml() {
let src = r#"
[[models]]
id = "Helexa/Qwen3.6-27B-Uncensored"
harness = "candle"
source = "helexa"
"#;
let cat: ModelCatalogue = toml::from_str(src).expect("parse models table");
assert_eq!(cat.models[0].source.as_deref(), Some("helexa"));
}
#[test]
fn aliases_table_round_trips_through_toml() {
let src = r#"
[aliases]
"helexa/small" = "Qwen/Qwen3-1.7B"
"helexa/large" = "Qwen/Qwen3.6-27B"
"#;
let cat: ModelCatalogue = toml::from_str(src).expect("parse aliases table");
assert_eq!(cat.resolve_alias("helexa/small"), "Qwen/Qwen3-1.7B");
assert_eq!(cat.resolve_alias("helexa/large"), "Qwen/Qwen3.6-27B");
}
}

View File

@@ -1,3 +1,4 @@
use crate::entitlements::CapWindow;
use figment::{
Figment,
providers::{Env, Format, Toml},
@@ -11,20 +12,68 @@ pub struct GatewayConfig {
pub eviction: EvictionSettings,
/// Neuron endpoints (replaces old NodeConfig with static vram_mb/pinned).
pub neurons: Vec<NeuronEndpoint>,
/// Path to the model catalogue file (default: "models.toml").
/// Path to the model catalogue file. Defaults to the packaged
/// location (`/etc/cortex/models.toml`); set explicitly for
/// non-packaged / local runs.
#[serde(default = "default_models_path")]
pub models_config: String,
/// Multi-tenant governance: auth + per-key token budgets (#47). Empty
/// by default — anonymous, uncapped — so existing single-operator
/// setups keep working until keys are configured.
#[serde(default)]
pub entitlements: EntitlementsConfig,
}
/// `[entitlements]` — the local/static [`crate::entitlements::EntitlementProvider`]
/// source of truth (#50). Accounts, keys, and hard caps live here; the
/// future upstream client (#57) ignores this section.
#[derive(Debug, Clone, Serialize, Deserialize, Default)]
pub struct EntitlementsConfig {
/// Reject unauthenticated requests with `401 invalid_api_key` when
/// true. Default `false` (allow-anonymous) for dev / single-operator
/// continuity.
#[serde(default)]
pub require_auth: bool,
/// Static API keys and their budgets, consumed by the local provider.
#[serde(default)]
pub keys: Vec<ApiKeyConfig>,
}
/// One configured API key: the bearer token, the account it bills to, and
/// its hard cap. `[[entitlements.keys]]` in TOML.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ApiKeyConfig {
/// The bearer token clients send in `Authorization: Bearer <key>`.
pub key: String,
/// Billable account. Multiple keys may share one account.
pub account_id: String,
/// Stable per-key identifier for ledger/metrics labels. Defaults to
/// `account_id` when omitted, so the secret is never used as a label.
#[serde(default)]
pub key_id: Option<String>,
/// Hard token cap. `None`/omitted = uncapped (e.g. operator infra key).
#[serde(default)]
pub hard_cap: Option<u64>,
/// Cap-window semantics. Default: a non-resetting [`CapWindow::Balance`].
#[serde(default)]
pub window: CapWindow,
}
fn default_models_path() -> String {
"models.toml".into()
// Absolute, so the systemd-launched binary finds the catalogue
// regardless of its working directory. The RPM installs the catalogue
// here (`cortex.spec`); a relative "models.toml" silently resolved to
// the service cwd and left the catalogue empty in production
// (pinning / aliases / limits all no-ops). Override via `models_config`
// in cortex.toml for local runs.
"/etc/cortex/models.toml".into()
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct GatewaySettings {
/// Address to listen on for API requests (e.g. "0.0.0.0:8000")
/// Address to listen on for API requests (e.g. "0.0.0.0:31313")
pub listen: String,
/// Address to listen on for Prometheus metrics (e.g. "0.0.0.0:9100")
/// Address to listen on for Prometheus metrics (e.g. "0.0.0.0:31314")
pub metrics_listen: String,
}
@@ -50,7 +99,7 @@ pub enum EvictionStrategy {
pub struct NeuronEndpoint {
/// Human-readable node name (e.g. "beast")
pub name: String,
/// Base URL of the neuron daemon (e.g. "http://beast.internal:9090")
/// Base URL of the neuron daemon (e.g. "http://beast.internal:13131")
pub endpoint: String,
}
@@ -70,8 +119,8 @@ impl Default for GatewayConfig {
fn default() -> Self {
Self {
gateway: GatewaySettings {
listen: "0.0.0.0:8000".into(),
metrics_listen: "0.0.0.0:9100".into(),
listen: "0.0.0.0:31313".into(),
metrics_listen: "0.0.0.0:31314".into(),
},
eviction: EvictionSettings {
strategy: EvictionStrategy::Lru,
@@ -79,6 +128,7 @@ impl Default for GatewayConfig {
},
neurons: vec![],
models_config: default_models_path(),
entitlements: EntitlementsConfig::default(),
}
}
}

View File

@@ -22,6 +22,23 @@ pub struct DiscoveryResponse {
pub driver_version: Option<String>,
pub devices: Vec<DeviceInfo>,
pub harnesses: Vec<String>,
/// Set when the host has an NVIDIA stack that is currently
/// unusable — specifically the userspace↔kernel-module version
/// skew after an un-rebooted driver update ("Driver/library
/// version mismatch"), where every CUDA call including nvidia-smi
/// fails (#19). `None` on healthy hosts AND on hosts with no
/// NVIDIA stack at all (CPU-only is not an error). Carries an
/// operator-actionable description; cortex can read it to route
/// around the node instead of cold-loading into a guaranteed
/// failure.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub cuda_unavailable_reason: Option<String>,
/// The neuron's effective maximum prompt size in tokens
/// (`NEURON_MAX_PROMPT_TOKENS`) — the enforced prompt cap on this
/// host. `#[serde(default)]` (→ 0) for forward-compat with neurons
/// that predate this field; cortex treats 0 as "unknown".
#[serde(default)]
pub max_prompt_tokens: u64,
}
/// Runtime health metrics for a single GPU device.
@@ -36,8 +53,123 @@ pub struct DeviceHealth {
/// Runtime health response from a neuron endpoint.
/// Returned by `GET /health`.
///
/// `activation` was added in 2026-05-26 to distinguish "process is up
/// and reachable" from "process is ready to serve traffic". A `Type=simple`
/// systemd unit reports `active` the moment the binary starts — but a
/// neuron whose `default_models` list takes minutes to materialise
/// won't bind its listener (or, in the new flow, won't have any models
/// loaded) until pre-warm completes. The new field is `#[serde(default)]`
/// so a pre-2026-05-26 gateway polling a new neuron — or vice versa —
/// keeps working.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct HealthResponse {
pub uptime_secs: u64,
pub devices: Vec<DeviceHealth>,
#[serde(default)]
pub activation: ActivationStatus,
/// Per-model admission load (#53): how many requests are running vs.
/// queued on each loaded model right now. Cortex's load-aware router
/// (#55) reads this to spread traffic across replicas and to propagate
/// honest backpressure. `#[serde(default)]` keeps older gateways/neurons
/// interoperable (absent → empty → treated as no load info).
#[serde(default)]
pub models: Vec<ModelLoad>,
}
/// Live admission load for one loaded model (#53).
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ModelLoad {
pub id: String,
/// Requests currently running (batch-1 → 0 or 1).
pub in_flight: usize,
/// Requests waiting in the bounded admission queue.
pub queue_depth: usize,
}
#[cfg(test)]
mod health_load_tests {
use super::*;
#[test]
fn health_response_without_models_field_still_deserializes() {
// A pre-#53 neuron's /health payload omits `models`; the gateway
// must still parse it (serde default → empty).
let json = r#"{"uptime_secs":42,"devices":[]}"#;
let resp: HealthResponse = serde_json::from_str(json).expect("back-compat parse");
assert_eq!(resp.uptime_secs, 42);
assert!(resp.models.is_empty());
}
#[test]
fn health_response_round_trips_model_load() {
let resp = HealthResponse {
uptime_secs: 1,
devices: vec![],
activation: ActivationStatus::default(),
models: vec![ModelLoad {
id: "Qwen/Qwen3.6-27B".into(),
in_flight: 1,
queue_depth: 3,
}],
};
let s = serde_json::to_string(&resp).unwrap();
let back: HealthResponse = serde_json::from_str(&s).unwrap();
assert_eq!(back.models.len(), 1);
assert_eq!(back.models[0].in_flight, 1);
assert_eq!(back.models[0].queue_depth, 3);
}
}
/// High-level activation state of the neuron daemon. The HTTP listener
/// is bound during both states; what differs is whether the configured
/// `default_models` have finished loading.
#[derive(Debug, Clone, Copy, Serialize, Deserialize, Default, PartialEq, Eq)]
#[serde(rename_all = "snake_case")]
pub enum ActivationState {
/// At least one `default_models` entry is still loading. The
/// neuron's other endpoints work, but inference against
/// not-yet-loaded models will 404.
PreWarming,
/// Every `default_models` entry has either loaded or failed; the
/// neuron is steady-state. Subsequent on-demand loads via
/// `/models/load` don't flip back to PreWarming — that field
/// reflects the activation-time set only.
#[default]
Ready,
}
/// Per-model failure record surfaced in [`ActivationStatus::failed`].
/// The error string is the rendered anyhow chain at the time of the
/// failure; operators read it from `/health` to decide whether to
/// retry, edit the spec, or unload+reload.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct PreWarmFailure {
pub model_id: String,
pub error: String,
}
/// Activation-time progress snapshot. All four lists are populated by
/// the neuron's pre-warm task and read by the `/health` handler. The
/// snapshot is consistent: a model id appears in exactly one of
/// `pending`, `in_progress` (as `Option<String>`), `completed`, or
/// `failed` at any point in time.
#[derive(Debug, Clone, Serialize, Deserialize, Default)]
pub struct ActivationStatus {
pub state: ActivationState,
/// Model ids queued but not yet started. Empty in `Ready` state.
#[serde(default)]
pub pending: Vec<String>,
/// Model id currently materialising. None when between models or
/// in `Ready` state.
#[serde(default)]
pub in_progress: Option<String>,
/// Model ids that finished loading successfully during this
/// activation. Cleared on process restart.
#[serde(default)]
pub completed: Vec<String>,
/// Model ids that failed during this activation, with the rendered
/// error chain. Cleared on process restart.
#[serde(default)]
pub failed: Vec<PreWarmFailure>,
}

View File

@@ -0,0 +1,145 @@
//! Identity and entitlement primitives for multi-tenant governance (#47).
//!
//! Identity is the shared substrate the whole epic hangs off:
//! `identity (principal) → accounting (spend) → policy → enforcement`. This
//! module defines the seam — the [`EntitlementProvider`] trait and its data
//! types — so the local/static provider (operator-config caps, in
//! cortex-gateway) can land the auth + per-key-cap + amplification fix
//! *before* any upstream clearing house exists. The future helexa-upstream
//! client (#57) is just another impl of this trait.
//!
//! The provider owns three jobs:
//! 1. **resolve** a bearer key to a [`Principal`] (drives auth, #49);
//! 2. **reserve → settle/release** token budget around a request so spend
//! can never overshoot a hard cap under concurrency (drives budget
//! enforcement, #52);
//! 3. expose a [`BudgetSnapshot`] for metering/metrics (#51).
//!
//! [`BudgetError`] carries the cap-window semantics so the caller can pick
//! the correct #63 rejection (`rate_limit_exceeded` + `Retry-After` for a
//! resetting window vs `insufficient_quota` for a hard balance) without the
//! provider knowing anything about HTTP.
use async_trait::async_trait;
use serde::{Deserialize, Serialize};
/// Internal header carrying the resolved account id from cortex to neuron.
/// neuron trusts these over the WireGuard link (#54); cortex **strips** any
/// client-supplied copy before stamping the authoritative value, so a client
/// can never assert a principal directly.
pub const HEADER_ACCOUNT_ID: &str = "x-helexa-account-id";
/// Internal header carrying the resolved key id from cortex to neuron.
pub const HEADER_KEY_ID: &str = "x-helexa-key-id";
/// Who a request is for. Resolved once at the edge from the bearer key and
/// carried through the request context. `account_id` is the billable owner
/// (spendable at any operator, by decision); `key_id` identifies the
/// specific API key for per-key hard caps and ledger/metrics labels.
#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
pub struct Principal {
pub account_id: String,
pub key_id: String,
}
/// Cap-window semantics for a key's hard cap. Determines which #63 code an
/// over-cap reservation maps to.
#[derive(Debug, Clone, Default, PartialEq, Eq, Serialize, Deserialize)]
#[serde(tag = "kind", rename_all = "snake_case")]
pub enum CapWindow {
/// Hard balance — the cap never resets. Exhaustion is permanent
/// (`429 insufficient_quota`, no `Retry-After`).
#[default]
Balance,
/// Rolling window of `seconds` that resets. Exhaustion is transient
/// (`429 rate_limit_exceeded` + `Retry-After` until reset).
Rolling { seconds: u64 },
}
/// An outstanding budget reservation. The caller holds this opaque handle
/// between [`EntitlementProvider::reserve`] and exactly one of
/// [`EntitlementProvider::settle`] / [`EntitlementProvider::release`]. Not
/// `Clone` — a reservation is consumed once.
#[derive(Debug)]
pub struct Reservation {
/// Provider-local handle; opaque to the caller.
pub id: u64,
/// The principal this reservation belongs to.
pub principal: Principal,
/// Tokens reserved against the cap.
pub reserved: u64,
}
/// A point-in-time view of a key's budget, for metering and metrics (#51).
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct BudgetSnapshot {
/// Hard cap in tokens. `None` means uncapped (e.g. an operator infra
/// key, #58).
pub hard_cap: Option<u64>,
/// Settled spend in the current window.
pub spent: u64,
/// Sum of outstanding (un-settled) reservations.
pub reserved: u64,
}
/// Authentication failure — the bearer key could not be resolved. Maps to
/// `401 invalid_api_key` (#49/#63).
#[derive(Debug, thiserror::Error)]
pub enum AuthError {
#[error("invalid or unknown API key")]
InvalidKey,
}
/// Why a reservation was refused. Carries enough for the caller to build the
/// correct #63 envelope without the provider touching HTTP.
#[derive(Debug, thiserror::Error)]
pub enum BudgetError {
/// A resetting window is exhausted → `429 rate_limit_exceeded` +
/// `Retry-After: retry_after_secs`.
#[error(
"rolling-window budget exhausted ({requested} requested, {available} available); \
resets in {retry_after_secs}s"
)]
RateLimited {
requested: u64,
available: u64,
retry_after_secs: u64,
},
/// A hard balance is exhausted → `429 insufficient_quota` (no
/// `Retry-After`; the client surfaces and stops). Never `402`.
#[error("hard balance exhausted ({requested} requested, {available} available)")]
InsufficientQuota { requested: u64, available: u64 },
}
/// The seam between cortex's enforcement and whatever decides entitlement —
/// a local/static config provider today (#50), the helexa-upstream client
/// later (#57). All methods are async so the upstream impl can do network
/// I/O; the local impl resolves in-process.
#[async_trait]
pub trait EntitlementProvider: Send + Sync {
/// Resolve a bearer API key to its principal. `Err(InvalidKey)` for an
/// unknown/empty key.
async fn resolve(&self, api_key: &str) -> Result<Principal, AuthError>;
/// Reserve up to `max_tokens` against the principal's cap. Returns a
/// handle on success, or a [`BudgetError`] (which the caller maps to a
/// #63 `429`) if the reservation would exceed the cap. Reserving the
/// *maximum* a request could consume before dispatch is what prevents
/// overshoot under concurrency.
async fn reserve(
&self,
principal: &Principal,
max_tokens: u64,
) -> Result<Reservation, BudgetError>;
/// Settle a reservation with the tokens actually consumed, releasing the
/// unused remainder back to the cap.
async fn settle(&self, reservation: Reservation, actual_tokens: u64);
/// Release a reservation in full — e.g. dispatch failed before any
/// tokens were consumed.
async fn release(&self, reservation: Reservation);
/// Current budget snapshot for a principal, for metering/metrics.
/// `None` if the provider doesn't track this principal.
async fn snapshot(&self, principal: &Principal) -> Option<BudgetSnapshot>;
}

View File

@@ -0,0 +1,257 @@
//! The OpenAI-standard error envelope (#60) and the rejection contract
//! that rides on it (#63).
//!
//! Every non-2xx response cortex and neuron emit uses the shape
//!
//! ```json
//! { "error": { "message": "...", "type": "...", "code": "...", "param": null } }
//! ```
//!
//! because OpenAI-compatible clients (opencode, the AI SDK, litellm, the
//! OpenAI SDKs) read `error.type` / `error.code` to decide what to do —
//! most importantly `code == "context_length_exceeded"` triggers
//! auto-compaction, and a `429` with `Retry-After` makes them back off and
//! retry rather than surfacing an opaque failure. A flat `{"error":"..."}`
//! string is invisible to that logic.
//!
//! This module is the single source of truth for that envelope. It is
//! deliberately **axum-agnostic** — cortex-core is a pure types crate — so
//! it carries the response as data (`status`, `body()`, `retry_after_secs`)
//! and each HTTP crate (cortex-gateway, neuron) owns a tiny adapter that
//! turns an [`OpenAiError`] into its framework's response type, setting the
//! `Retry-After` header when present.
//!
//! Retryable conditions **must** carry `Retry-After` (per #63). The named
//! constructors below encode that: [`OpenAiError::rate_limit_exceeded`] and
//! [`OpenAiError::service_unavailable`] take a retry hint;
//! [`OpenAiError::insufficient_quota`] (hard balance, no reset) and
//! [`OpenAiError::context_length_exceeded`] / [`OpenAiError::invalid_api_key`]
//! (permanent) do not. `402 Payment Required` is banned by the contract — use
//! `429 insufficient_quota` for hard budget exhaustion.
use serde_json::{Map, Value, json};
/// A rejection rendered in the OpenAI error envelope.
///
/// Build with [`OpenAiError::new`] (or a named constructor), refine with the
/// `with_*` builders, then hand to the consuming crate's adapter to turn into
/// an HTTP response.
#[derive(Debug, Clone)]
pub struct OpenAiError {
/// HTTP status code (e.g. `401`, `429`, `503`).
pub status: u16,
/// Broad OpenAI category — `"invalid_request_error"`, `"api_error"`,
/// `"rate_limit_error"`, …
pub error_type: String,
/// Specific machine-readable code clients key on (`"invalid_api_key"`,
/// `"rate_limit_exceeded"`, `"context_length_exceeded"`, …). `None`
/// renders as JSON `null`.
pub code: Option<String>,
/// Human-readable, actionable message.
pub message: String,
/// OpenAI's `param` field — the offending request parameter, if any.
pub param: Option<String>,
/// Seconds to advertise in the `Retry-After` header. Set only on
/// retryable conditions; `None` means no header.
pub retry_after_secs: Option<u64>,
/// Diagnostic fields merged *inside* the `error` object (e.g.
/// `prompt_len`, `max`, `free_mb`) so they don't break the envelope
/// shape. Clients ignore unknown keys.
pub extra: Map<String, Value>,
}
impl OpenAiError {
/// Construct an envelope with an explicit code. For a `null` code use
/// [`OpenAiError::without_code`].
pub fn new(
status: u16,
error_type: impl Into<String>,
code: impl Into<String>,
message: impl Into<String>,
) -> Self {
Self {
status,
error_type: error_type.into(),
code: Some(code.into()),
message: message.into(),
param: None,
retry_after_secs: None,
extra: Map::new(),
}
}
/// Construct an envelope whose `code` is `null` (e.g. an unclassified
/// internal error).
pub fn without_code(
status: u16,
error_type: impl Into<String>,
message: impl Into<String>,
) -> Self {
Self {
status,
error_type: error_type.into(),
code: None,
message: message.into(),
param: None,
retry_after_secs: None,
extra: Map::new(),
}
}
/// Advertise a `Retry-After` (seconds). Use on retryable rejections.
pub fn with_retry_after(mut self, secs: u64) -> Self {
self.retry_after_secs = Some(secs);
self
}
/// Set the OpenAI `param` field.
pub fn with_param(mut self, param: impl Into<String>) -> Self {
self.param = Some(param.into());
self
}
/// Merge one diagnostic field into the error object.
pub fn with_extra(mut self, key: impl Into<String>, value: Value) -> Self {
self.extra.insert(key.into(), value);
self
}
/// Merge a bag of diagnostic fields into the error object.
pub fn with_extras(mut self, extras: Map<String, Value>) -> Self {
for (k, v) in extras {
self.extra.insert(k, v);
}
self
}
/// Render the `{ "error": { … } }` body. Field order is irrelevant to
/// clients (they parse JSON); the standard keys come first, then any
/// diagnostic extras.
pub fn body(&self) -> Value {
let mut error = Map::new();
error.insert("message".into(), Value::String(self.message.clone()));
error.insert("type".into(), Value::String(self.error_type.clone()));
error.insert(
"code".into(),
self.code.clone().map(Value::String).unwrap_or(Value::Null),
);
error.insert(
"param".into(),
self.param.clone().map(Value::String).unwrap_or(Value::Null),
);
for (k, v) in &self.extra {
error.insert(k.clone(), v.clone());
}
json!({ "error": Value::Object(error) })
}
// ── Named constructors for the #63 standard codes ──────────────────
/// `401 invalid_api_key` — missing/invalid bearer token (#49). Permanent.
pub fn invalid_api_key(message: impl Into<String>) -> Self {
Self::new(401, "invalid_request_error", "invalid_api_key", message)
}
/// `429 rate_limit_exceeded` + `Retry-After` — transient overload,
/// fair-share/in-flight cap, admission rejection, or a rolling budget
/// window that resets (#52/#53/#54/#55). Clients back off and retry.
pub fn rate_limit_exceeded(message: impl Into<String>, retry_after_secs: u64) -> Self {
Self::new(429, "rate_limit_error", "rate_limit_exceeded", message)
.with_retry_after(retry_after_secs)
}
/// `429 insufficient_quota` — hard balance exhausted, no reset (#52).
/// No `Retry-After`; the client surfaces and stops. (Never `402`.)
pub fn insufficient_quota(message: impl Into<String>) -> Self {
Self::new(429, "insufficient_quota", "insufficient_quota", message)
}
/// `400 context_length_exceeded` — prompt exceeds the model's context
/// window (#56/#60). Permanent for this request; opencode auto-compacts.
pub fn context_length_exceeded(message: impl Into<String>) -> Self {
Self::new(
400,
"invalid_request_error",
"context_length_exceeded",
message,
)
}
/// `503 service_unavailable` + optional `Retry-After` — transient
/// backend unavailability (no healthy nodes, recovery, fail-closed
/// upstream). Retryable when a hint is given.
pub fn service_unavailable(message: impl Into<String>, retry_after_secs: Option<u64>) -> Self {
let mut err = Self::new(503, "api_error", "service_unavailable", message);
err.retry_after_secs = retry_after_secs;
err
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn body_has_standard_envelope_shape() {
let env = OpenAiError::new(429, "rate_limit_error", "rate_limit_exceeded", "slow down");
let body = env.body();
let error = body.get("error").and_then(Value::as_object).unwrap();
assert_eq!(error["message"], "slow down");
assert_eq!(error["type"], "rate_limit_error");
assert_eq!(error["code"], "rate_limit_exceeded");
assert_eq!(error["param"], Value::Null);
}
#[test]
fn without_code_renders_null_code() {
let env = OpenAiError::without_code(500, "api_error", "kaboom");
assert_eq!(env.body()["error"]["code"], Value::Null);
}
#[test]
fn extras_ride_inside_the_error_object() {
let env = OpenAiError::context_length_exceeded("too long")
.with_extra("prompt_len", json!(60_000))
.with_extra("max", json!(49_152));
let error = &env.body()["error"];
assert_eq!(error["prompt_len"], 60_000);
assert_eq!(error["max"], 49_152);
assert_eq!(error["code"], "context_length_exceeded");
}
#[test]
fn rolling_window_rejection_carries_retry_after() {
let env = OpenAiError::rate_limit_exceeded("budget window", 30);
assert_eq!(env.status, 429);
assert_eq!(env.retry_after_secs, Some(30));
}
#[test]
fn hard_balance_rejection_has_no_retry_after() {
let env = OpenAiError::insufficient_quota("out of credit");
assert_eq!(env.status, 429);
assert_eq!(env.code.as_deref(), Some("insufficient_quota"));
assert_eq!(env.retry_after_secs, None);
}
#[test]
fn permanent_rejections_have_no_retry_after() {
assert_eq!(OpenAiError::invalid_api_key("nope").retry_after_secs, None);
assert_eq!(
OpenAiError::context_length_exceeded("too long").retry_after_secs,
None
);
}
#[test]
fn service_unavailable_retry_after_is_optional() {
assert_eq!(
OpenAiError::service_unavailable("recovering", Some(5)).retry_after_secs,
Some(5)
);
assert_eq!(
OpenAiError::service_unavailable("gone", None).retry_after_secs,
None
);
}
}

View File

@@ -9,13 +9,13 @@ use async_trait::async_trait;
use serde::{Deserialize, Serialize};
/// Configuration for a harness instance on a neuron.
///
/// All current harnesses are in-process (candle); per-harness tuning
/// (cache paths, device policies, etc.) lives in dedicated config
/// blocks rather than on this struct.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct HarnessConfig {
pub name: String,
/// Base URL of the harness (e.g. "http://localhost:8080" for mistral.rs).
pub endpoint: Option<String>,
/// Systemd unit name, if the harness is managed via systemd.
pub systemd_unit: Option<String>,
}
/// Health status of a harness process.
@@ -36,6 +36,44 @@ pub struct ModelSpec {
pub devices: Option<Vec<u32>>,
}
/// Per-model token budget advertised by the catalogue or neuron.
///
/// `context` is the hard wall (the served max-seq-len). `input` is the
/// compaction trigger — when set, opencode treats it as "usable context =
/// input reserved". When omitted, clients fall back to `context output`.
/// `output` is the maximum number of generation tokens.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ModelLimit {
/// Hard wall — served max-seq-len in tokens.
pub context: usize,
/// Compaction trigger / usable input budget. When absent clients fall
/// back to `context output`.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub input: Option<usize>,
/// Maximum number of generation tokens.
pub output: usize,
}
/// Operator-set pricing in USD per 1M tokens.
///
/// Self-hosted deployments typically leave both at `0.0`. Cache fields are
/// optional — set when the backend supports a prefix-cache discount tier.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ModelCost {
/// USD per 1M input (prompt) tokens.
#[serde(default)]
pub input: f64,
/// USD per 1M output (completion) tokens.
#[serde(default)]
pub output: f64,
/// USD per 1M cache-hit tokens (optional).
#[serde(default, skip_serializing_if = "Option::is_none")]
pub cache_read: Option<f64>,
/// USD per 1M cache-write tokens (optional).
#[serde(default, skip_serializing_if = "Option::is_none")]
pub cache_write: Option<f64>,
}
/// A model as reported by a harness.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ModelInfo {
@@ -44,19 +82,54 @@ pub struct ModelInfo {
pub status: String,
pub devices: Vec<u32>,
pub vram_used_mb: Option<u64>,
/// Modalities this loaded model supports. Today: `["text"]` for
/// text-only checkpoints, `["text", "vision"]` for vision-capable
/// ones (Stage B7). Clients like litellm / agent0 can gate
/// `image_url` submission on the advertised set.
///
/// Optional in the wire format so older clients that don't read
/// it stay compatible. Default-empty for absent/older data, which
/// callers can interpret as "text".
#[serde(default, skip_serializing_if = "Vec::is_empty")]
pub capabilities: Vec<String>,
// ── Enrichment (issue #62) ────────────────────────────────
/// Token budget advertised by the catalogue or discovered at load time.
/// `None` when neither the catalogue nor the loaded model can provide it.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub limit: Option<ModelLimit>,
/// Operator-set pricing in USD per 1M tokens (0.0 = free/self-hosted).
#[serde(default, skip_serializing_if = "Option::is_none")]
pub cost: Option<ModelCost>,
/// `true` when the model's tokenizer contains recognised tool-call
/// marker tokens (`<tool_call>` / `<\/tool_call>` convention).
#[serde(default)]
pub tool_call: bool,
/// `true` when the model's tokenizer contains recognised reasoning
/// marker tokens (`<think>` / `<\/think>` or similar).
#[serde(default)]
pub reasoning: bool,
}
/// What an inference harness must do, from neuron's perspective.
///
/// All current harnesses are in-process — they share neuron's address
/// space and lifecycle. `start`/`stop` therefore default to no-ops; a
/// future process-supervising harness would override them.
#[async_trait]
pub trait Harness: Send + Sync {
/// Human-readable name (e.g. "mistralrs", "llamacpp", "comfyui").
/// Human-readable name (e.g. "candle").
fn name(&self) -> &str;
/// Start the harness process if it is not already running.
async fn start(&self, config: &HarnessConfig) -> Result<()>;
/// Start the harness. Default no-op for in-process harnesses.
async fn start(&self, _config: &HarnessConfig) -> Result<()> {
Ok(())
}
/// Stop the harness process gracefully.
async fn stop(&self) -> Result<()>;
/// Stop the harness. Default no-op for in-process harnesses.
async fn stop(&self) -> Result<()> {
Ok(())
}
/// Health check. Returns the harness process status.
async fn health(&self) -> HarnessHealth;

View File

@@ -1,9 +1,14 @@
pub mod anthropic;
pub mod build_info;
pub mod catalogue;
pub mod config;
pub mod discovery;
pub mod entitlements;
pub mod error_envelope;
pub mod harness;
pub mod metrics;
pub mod node;
pub mod openai;
pub mod responses;
pub mod source;
pub mod translate;

View File

@@ -1,3 +1,5 @@
use crate::discovery::{ActivationStatus, DiscoveryResponse, ModelLoad};
use crate::harness::{ModelCost, ModelLimit};
use chrono::{DateTime, Utc};
use serde::{Deserialize, Serialize};
use std::collections::HashMap;
@@ -6,13 +8,30 @@ use std::collections::HashMap;
#[derive(Debug, Clone)]
pub struct NodeState {
pub name: String,
/// Base URL of the neuron daemon (e.g. "http://beast.internal:9090").
/// Base URL of the neuron daemon (e.g. "http://beast.internal:13131").
pub endpoint: String,
pub healthy: bool,
pub models: HashMap<String, ModelEntry>,
/// Number of load/unload cycles since last process restart.
pub lifecycle_cycles: u32,
pub last_poll: Option<DateTime<Utc>>,
/// Result of the most recent successful `GET /discovery` against
/// this neuron. Cached forever once obtained — device topology is
/// invariant for a given neuron process. `None` until the first
/// successful poll. Used by the router and `/v1/models` to do
/// catalogue × topology feasibility checks.
pub discovery: Option<DiscoveryResponse>,
/// Last-seen pre-warm progress from this neuron's `/health`
/// endpoint. `None` until the first /health poll succeeds. The
/// `/v1/models` handler reads `in_progress` + `pending` from here
/// to synthesize `Loading` locations so clients see a catalogued
/// model that's mid-prewarm as "loading", not "missing".
pub activation: Option<ActivationStatus>,
/// Last-seen per-model admission load from this neuron's `/health`
/// (#53), keyed by model id. The router (#55) reads it to pick the
/// least-busy replica when a model is loaded on more than one neuron.
/// Empty until the first /health poll reports load.
pub model_load: HashMap<String, ModelLoad>,
}
/// A model registered on a node, with its runtime status.
@@ -24,25 +43,102 @@ pub struct ModelEntry {
pub last_accessed: Option<DateTime<Utc>>,
/// Estimated VRAM usage in MB when loaded.
pub vram_estimate_mb: Option<u64>,
/// Modalities the loaded model advertises (e.g. `["text", "vision"]`),
/// copied verbatim from the neuron's `ModelInfo.capabilities` at poll
/// time. Empty when the neuron reports none. `#[serde(default)]` keeps
/// older persisted/serialised entries deserialisable.
#[serde(default)]
pub capabilities: Vec<String>,
/// Runtime-detected capability flags from the neuron's `/models`
/// response (`ModelInfo`). `false` when the neuron predates these
/// fields or hasn't reported them yet.
#[serde(default)]
pub tool_call: bool,
#[serde(default)]
pub reasoning: bool,
/// Self-derived token budget the neuron computed for this loaded
/// model (#67), copied from `ModelInfo.limit` at poll time. `None`
/// when the neuron doesn't compute one (arch without a context
/// profile, or derivation disabled). This is the authoritative
/// source the gateway advertises — operator-declared catalogue
/// limits are no longer consulted.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub limit: Option<ModelLimit>,
}
/// Model lifecycle status.
///
/// `Loading` is a gateway-side synthetic status: neurons never emit it
/// on `/models` (that endpoint only knows about already-loaded handles).
/// The gateway populates it from a neuron's `/health` activation
/// snapshot so the unified `/v1/models` can distinguish "model is
/// catalogued but no one has it" from "model is materialising on
/// neuron N right now". Other status values are reported verbatim by
/// neurons.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
#[serde(rename_all = "lowercase")]
pub enum ModelStatus {
Loaded,
Unloaded,
Reloading,
Loading,
/// Reported by neuron while a poisoned model auto-recovers via
/// unload→reload (#17/#20). Temporarily unservable but NOT
/// evicted: the gateway holds the route, answers with a transient
/// retry error instead of 404, and must not race a second
/// placement elsewhere.
Recovering,
}
/// Unified model entry as exposed by the gateway's `/v1/models` endpoint.
/// Includes which node(s) host this model and their status.
///
/// The first four fields (`id`, `object`, `created`, `owned_by`) match
/// OpenAI's `/v1/models` shape verbatim, so existing OpenAI-aware
/// tooling deserialises this without custom code. The remaining fields
/// are helexa-specific extensions — OpenAI clients ignore unknown
/// fields and other consumers can read them for placement / debugging.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct CortexModelEntry {
pub id: String,
/// Always `"model"` per OpenAI's contract.
pub object: String,
/// Which nodes have this model (and their status).
/// Unix-second timestamp; cortex stamps this at response time.
pub created: u64,
/// OpenAI's "publisher" field — `"helexa"` for everything we serve.
pub owned_by: String,
/// True if any neuron currently has this model loaded. False for
/// catalogue entries that are feasible but not yet loaded.
pub loaded: bool,
/// Neurons whose discovered topology can satisfy this model's
/// catalogue placement constraints. Empty for models that are
/// loaded somewhere but not present in the catalogue (cortex has
/// no feasibility opinion on those).
pub feasible_on: Vec<String>,
/// Where this model is actually loaded right now. Subset of (or
/// disjoint from) `feasible_on` depending on whether the catalogue
/// covers this model.
pub locations: Vec<ModelLocation>,
/// Union of the modalities advertised by every neuron that has this
/// model loaded (e.g. `["text", "vision"]`). Empty for catalogue-only
/// entries with no loaded location — filled from catalogue profile
/// capabilities when available, then unioned with runtime-detected
/// values from loaded neurons.
#[serde(default)]
pub capabilities: Vec<String>,
// ── Enrichment (issue #62) ────────────────────────────────
/// Per-model token budget from the catalogue profile or discovered
/// at load time. `None` when neither source provides it.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub limit: Option<ModelLimit>,
/// Operator-set pricing in USD per 1M tokens (0.0 = free/self-hosted).
#[serde(default, skip_serializing_if = "Option::is_none")]
pub cost: Option<ModelCost>,
/// `true` when any neuron reports this model supports tool calls.
#[serde(default)]
pub tool_call: bool,
/// `true` when any neuron reports this model supports reasoning tokens.
#[serde(default)]
pub reasoning: bool,
}
#[derive(Debug, Clone, Serialize, Deserialize)]

View File

@@ -3,7 +3,7 @@
//! These are a subset sufficient for chat completions (streaming + non-streaming).
//! Fields not relevant to proxying are captured as `serde_json::Value` via
//! `#[serde(flatten)]` so we forward them without needing to enumerate every
//! extension field mistral.rs supports.
//! extension field a backend might support.
use serde::{Deserialize, Serialize};
use serde_json::Value;
@@ -22,7 +22,7 @@ pub struct ChatCompletionRequest {
pub max_tokens: Option<u64>,
#[serde(skip_serializing_if = "Option::is_none")]
pub stream: Option<bool>,
/// All other fields (tools, response_format, mistral.rs extensions, etc.)
/// All other fields (tools, response_format, backend extensions, etc.)
#[serde(flatten)]
pub extra: Value,
}
@@ -71,10 +71,18 @@ pub struct ChatCompletionChoice {
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ChatCompletionChunk {
#[serde(default)]
pub id: String,
#[serde(default)]
pub object: String,
#[serde(default)]
pub created: u64,
// Lenient deserialization throughout: the gateway parses chunks
// from arbitrary OpenAI-compatible upstreams, and some engines
// omit fields on special frames (e.g. usage-only final chunks).
#[serde(default)]
pub model: String,
#[serde(default)]
pub choices: Vec<ChunkChoice>,
#[serde(skip_serializing_if = "Option::is_none")]
pub usage: Option<Usage>,
@@ -98,6 +106,31 @@ pub struct Usage {
pub prompt_tokens: u64,
pub completion_tokens: u64,
pub total_tokens: u64,
/// OpenAI-standard breakdown of `completion_tokens`. Optional and
/// additive — clients that don't read it are unaffected. Carries
/// `reasoning_tokens` for reasoning models (a sub-count of
/// `completion_tokens`, never added into `total_tokens`).
#[serde(default, skip_serializing_if = "Option::is_none")]
pub completion_tokens_details: Option<CompletionTokensDetails>,
/// OpenAI-standard breakdown of `prompt_tokens`. Populated once
/// prompt caching lands (#11); `None` until then.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub prompt_tokens_details: Option<PromptTokensDetails>,
}
/// Sub-counts of `Usage::completion_tokens`.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct CompletionTokensDetails {
/// Tokens generated inside the model's reasoning span.
pub reasoning_tokens: u64,
}
/// Sub-counts of `Usage::prompt_tokens`.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct PromptTokensDetails {
/// Prompt tokens served from cache (cache-read rate). Populated
/// once prompt caching lands (#11).
pub cached_tokens: u64,
}
// ── Models list response ─────────────────────────────────────────────

View File

@@ -0,0 +1,372 @@
//! OpenAI Responses API (`POST /v1/responses`) envelope types.
//!
//! This is OpenAI's newer chat surface, distinct from
//! `/v1/chat/completions` in three ways that matter for us:
//!
//! 1. **Input shape**. Instead of a `messages` array, the request
//! carries `input` — either a plain string (single user turn)
//! or an array of typed items (messages, function calls,
//! function-call outputs, reasoning blocks, …).
//! 2. **Output shape**. The response carries a single `output`
//! array of items, each typed. We always emit one
//! `OutputItem::Message` containing the assistant's reply (plus,
//! when we get there, separate `function_call` items).
//! 3. **Streaming events**. Where chat completions stream
//! structurally-identical `chat.completion.chunk` frames over
//! `data:` lines, Responses streams *named* events
//! (`response.created`, `response.output_text.delta`,
//! `response.completed`, …) over `event:` + `data:` SSE pairs.
//! The wire projector in `neuron::wire::openai_responses` builds
//! these from the same [`crate::openai`]-shaped
//! `InferenceEvent` stream the chat projector consumes.
//!
//! Scope cuts for this first cut:
//!
//! - **`previous_response_id` is rejected at parse time**. Stateful
//! chained conversations need a persistence layer we don't have.
//! - **Reasoning items are accepted-and-ignored** (no Qwen3
//! `<think>` routing yet). Audio and embedded resources are
//! rejected as unsupported.
//! - **Tool calls** (function_call / function_call_output) are
//! carried as round-trip types but the candle harness doesn't
//! emit them yet — wired so the surface is in place for the
//! day we add proper tool-call extraction.
use serde::{Deserialize, Serialize};
use serde_json::Value;
// ── Request ──────────────────────────────────────────────────────────
/// Body of a `POST /v1/responses` request.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ResponsesRequest {
pub model: String,
pub input: ResponsesInput,
/// System-prompt-style instructions. The Responses API
/// separates these from input so a caller doesn't have to
/// build a `system` message item by hand.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub instructions: Option<String>,
#[serde(default)]
pub stream: bool,
#[serde(default, skip_serializing_if = "Option::is_none")]
pub max_output_tokens: Option<u64>,
#[serde(default, skip_serializing_if = "Option::is_none")]
pub temperature: Option<f64>,
#[serde(default, skip_serializing_if = "Option::is_none")]
pub top_p: Option<f64>,
/// Chained-conversation identifier. We don't store responses
/// server-side yet; if this is `Some`, the handler returns 400.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub previous_response_id: Option<String>,
/// Catch-all for anything we don't model yet (tools, tool_choice,
/// reasoning, response_format, …). Lets a client send a
/// forward-compatible request without our parser rejecting it.
#[serde(flatten)]
pub extra: Value,
}
/// `input` is either a single string or an array of typed items.
/// `#[serde(untagged)]` so the wire shape `"input": "hi"` and
/// `"input": [{...}]` both deserialize.
#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(untagged)]
pub enum ResponsesInput {
Text(String),
Items(Vec<ResponsesInputItem>),
}
#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(tag = "type", rename_all = "snake_case")]
pub enum ResponsesInputItem {
/// A user / assistant / system turn.
Message {
role: String,
content: ResponsesMessageContent,
},
/// Assistant emitted a tool call. Round-trip only — neuron
/// doesn't synthesise these yet.
FunctionCall {
call_id: String,
name: String,
arguments: String,
},
/// User is feeding a tool result back into the model.
FunctionCallOutput { call_id: String, output: String },
/// Reasoning items emitted by o-series models. Accepted but
/// not forwarded to the model — neuron's candle path doesn't
/// surface reasoning separately yet.
Reasoning {
#[serde(default)]
content: Vec<Value>,
},
}
/// Inside a `Message` item, content is either a plain string or an
/// array of typed parts. Mirrors the chat-completions Parts shape.
#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(untagged)]
pub enum ResponsesMessageContent {
Text(String),
Parts(Vec<ResponsesContentPart>),
}
#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(tag = "type", rename_all = "snake_case")]
pub enum ResponsesContentPart {
/// Plain text inside a user / system turn.
InputText { text: String },
/// An image. `image_url` is either a remote URL or a
/// `data:image/png;base64,…` URI; the request translator just
/// forwards the string.
InputImage {
image_url: String,
#[serde(default, skip_serializing_if = "Option::is_none")]
detail: Option<String>,
},
/// Returned text inside an assistant turn — only relevant when
/// the caller is feeding an assistant turn back in to continue
/// a conversation manually (no `previous_response_id`).
OutputText {
text: String,
#[serde(default, skip_serializing_if = "Vec::is_empty")]
annotations: Vec<Value>,
},
}
// ── Response (non-streaming) ─────────────────────────────────────────
/// Body of a `POST /v1/responses` response.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ResponsesResponse {
pub id: String,
/// Always `"response"`.
pub object: String,
pub created_at: u64,
/// `"completed"`, `"incomplete"`, or — for the initial event of
/// a streaming response — `"in_progress"`.
pub status: String,
pub model: String,
pub output: Vec<ResponsesOutputItem>,
/// Populated on completion; `None` while streaming.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub usage: Option<ResponsesUsage>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(tag = "type", rename_all = "snake_case")]
pub enum ResponsesOutputItem {
Message {
id: String,
/// Always `"assistant"` for model output.
role: String,
/// Output content parts. We always emit a single
/// `OutputText` today; multi-part output would land here
/// once we have e.g. image generation.
content: Vec<ResponsesOutputContent>,
/// Item-level status. `"in_progress"` while streaming the
/// content parts, `"completed"` when done.
#[serde(default = "default_item_status")]
status: String,
},
/// Reserved for the day tool-call extraction lands. The wire
/// shape mirrors `ResponsesInputItem::FunctionCall`.
FunctionCall {
id: String,
call_id: String,
name: String,
arguments: String,
#[serde(default = "default_item_status")]
status: String,
},
}
fn default_item_status() -> String {
"completed".into()
}
#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(tag = "type", rename_all = "snake_case")]
pub enum ResponsesOutputContent {
OutputText {
text: String,
/// Citations / inline annotations. Empty today; reserved
/// for the day we wire in web search / file search.
#[serde(default, skip_serializing_if = "Vec::is_empty")]
annotations: Vec<Value>,
},
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ResponsesUsage {
pub input_tokens: u64,
pub output_tokens: u64,
pub total_tokens: u64,
/// OpenAI-standard breakdown of `output_tokens`. Optional and
/// additive. Carries `reasoning_tokens` for reasoning models (a
/// sub-count of `output_tokens`, never added into `total_tokens`).
#[serde(default, skip_serializing_if = "Option::is_none")]
pub output_tokens_details: Option<OutputTokensDetails>,
/// OpenAI-standard breakdown of `input_tokens`. Populated once
/// prompt caching lands (#11); `None` until then.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub input_tokens_details: Option<InputTokensDetails>,
}
/// Sub-counts of `ResponsesUsage::output_tokens`.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct OutputTokensDetails {
/// Tokens generated inside the model's reasoning span.
pub reasoning_tokens: u64,
}
/// Sub-counts of `ResponsesUsage::input_tokens`.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct InputTokensDetails {
/// Input tokens served from cache (cache-read rate). Populated
/// once prompt caching lands (#11).
pub cached_tokens: u64,
}
// ── Streaming event names ────────────────────────────────────────────
/// Event names the SSE projector emits, hoisted as constants so
/// the projector and the wire shape stay in sync without
/// string-typos. The strings are dictated by OpenAI's published
/// Responses API.
pub mod events {
pub const CREATED: &str = "response.created";
/// Fired between `response.created` and the first output-item
/// event. Marks "request validated, model is generating" —
/// some clients use it to differentiate the "warming up" state
/// from "streaming tokens" in their UI.
pub const IN_PROGRESS: &str = "response.in_progress";
pub const OUTPUT_ITEM_ADDED: &str = "response.output_item.added";
pub const CONTENT_PART_ADDED: &str = "response.content_part.added";
pub const OUTPUT_TEXT_DELTA: &str = "response.output_text.delta";
pub const OUTPUT_TEXT_DONE: &str = "response.output_text.done";
pub const CONTENT_PART_DONE: &str = "response.content_part.done";
pub const OUTPUT_ITEM_DONE: &str = "response.output_item.done";
pub const COMPLETED: &str = "response.completed";
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn deserialises_input_string_form() {
let raw = r#"{"model": "m", "input": "hello"}"#;
let req: ResponsesRequest = serde_json::from_str(raw).unwrap();
match req.input {
ResponsesInput::Text(s) => assert_eq!(s, "hello"),
other => panic!("expected Text, got {other:?}"),
}
}
#[test]
fn deserialises_input_items_form() {
let raw = r#"{
"model": "m",
"input": [
{"type": "message", "role": "user", "content": "hi"}
]
}"#;
let req: ResponsesRequest = serde_json::from_str(raw).unwrap();
match req.input {
ResponsesInput::Items(items) => {
assert_eq!(items.len(), 1);
match &items[0] {
ResponsesInputItem::Message { role, content } => {
assert_eq!(role, "user");
match content {
ResponsesMessageContent::Text(t) => assert_eq!(t, "hi"),
other => panic!("expected Text content, got {other:?}"),
}
}
other => panic!("expected Message item, got {other:?}"),
}
}
other => panic!("expected Items, got {other:?}"),
}
}
#[test]
fn deserialises_input_with_image() {
let raw = r#"{
"model": "m",
"input": [
{"type": "message", "role": "user", "content": [
{"type": "input_text", "text": "what is this"},
{"type": "input_image", "image_url": "data:image/png;base64,AAA="}
]}
]
}"#;
let req: ResponsesRequest = serde_json::from_str(raw).unwrap();
let items = match req.input {
ResponsesInput::Items(i) => i,
other => panic!("expected Items, got {other:?}"),
};
let parts = match &items[0] {
ResponsesInputItem::Message {
content: ResponsesMessageContent::Parts(p),
..
} => p,
other => panic!("expected Parts, got {other:?}"),
};
assert_eq!(parts.len(), 2);
assert!(matches!(
&parts[0],
ResponsesContentPart::InputText { text } if text == "what is this"
));
assert!(matches!(
&parts[1],
ResponsesContentPart::InputImage { image_url, .. }
if image_url == "data:image/png;base64,AAA="
));
}
#[test]
fn unknown_fields_round_trip_via_extra() {
let raw = r#"{
"model": "m",
"input": "hi",
"tools": [{"type": "web_search"}],
"reasoning": {"effort": "medium"}
}"#;
let req: ResponsesRequest = serde_json::from_str(raw).unwrap();
assert!(req.extra.get("tools").is_some());
assert!(req.extra.get("reasoning").is_some());
}
#[test]
fn response_round_trips_through_serde() {
let r = ResponsesResponse {
id: "resp_1".into(),
object: "response".into(),
created_at: 1700,
status: "completed".into(),
model: "m".into(),
output: vec![ResponsesOutputItem::Message {
id: "msg_1".into(),
role: "assistant".into(),
content: vec![ResponsesOutputContent::OutputText {
text: "hi there".into(),
annotations: vec![],
}],
status: "completed".into(),
}],
usage: Some(ResponsesUsage {
input_tokens: 5,
output_tokens: 3,
total_tokens: 8,
output_tokens_details: None,
input_tokens_details: None,
}),
};
let json = serde_json::to_string(&r).unwrap();
let parsed: ResponsesResponse = serde_json::from_str(&json).unwrap();
assert_eq!(parsed.id, "resp_1");
assert_eq!(parsed.output.len(), 1);
}
}

View File

@@ -0,0 +1,267 @@
//! Scheme-qualified model identifiers.
//!
//! cortex/neuron historically resolves every model id through hf-hub
//! against `https://huggingface.co`. Helexa is adding an EU-hosted
//! registry (`registry.helexa.ai`) alongside HF — both speak the same
//! HF-compatible wire format, but the bytes, jurisdiction, and trust
//! root differ. Model ids therefore need a scheme:
//!
//! - `huggingface:Qwen/Qwen3.6-27B` — HF-hosted bytes
//! - `helexa:Qwen/Qwen3.6-27B-Uncensored` — helexa registry bytes
//! - `helexa:SomeOperator/CustomFinetune` — operator publishing
//! under the helexa namespace; same scheme handles all `org/name`
//! pairs hosted in that registry.
//!
//! Bare `org/name` parses with an empty scheme; the caller (typically
//! a harness) substitutes its configured default scheme so existing
//! configs keep working through the transition.
use serde::{Deserialize, Serialize};
use std::fmt;
use std::str::FromStr;
/// Parsed `scheme:org/name`. Bare `org/name` produces an empty scheme
/// — call `with_default_scheme` (or check `is_scheme_unset`) to
/// resolve before using.
#[derive(Debug, Clone, PartialEq, Eq, Hash, Serialize, Deserialize)]
pub struct ModelSourceId {
pub scheme: String,
pub org: String,
pub name: String,
}
/// Errors from `ModelSourceId::from_str`. Carries the offending input
/// so log lines / API errors can echo what the operator typed.
#[derive(Debug, Clone, PartialEq, Eq, thiserror::Error)]
pub enum ParseError {
#[error("empty model id")]
Empty,
#[error("model id '{0}' is missing the '/' between org and name")]
MissingSlash(String),
#[error("model id '{0}' has an empty scheme before ':'")]
EmptyScheme(String),
#[error("model id '{0}' has an empty org")]
EmptyOrg(String),
#[error("model id '{0}' has an empty name")]
EmptyName(String),
#[error("model id '{0}' has a scheme containing '/' which is reserved for org/name")]
SchemeContainsSlash(String),
#[error("model id '{0}' has a name containing ':' which is reserved for the scheme prefix")]
NameContainsColon(String),
}
impl ModelSourceId {
/// Construct directly from already-validated parts. Used by tests
/// and call sites that have the fields separately; the public API
/// for parsing user input is `FromStr`.
pub fn new(scheme: impl Into<String>, org: impl Into<String>, name: impl Into<String>) -> Self {
Self {
scheme: scheme.into(),
org: org.into(),
name: name.into(),
}
}
/// True when this id parsed from a bare `org/name` (no scheme
/// prefix). The harness substitutes its configured default in
/// `with_default_scheme` before resolving against a registry.
pub fn is_scheme_unset(&self) -> bool {
self.scheme.is_empty()
}
/// Substitute `default` for an empty scheme. No-op when the scheme
/// is already set. Returns self by value so it composes neatly:
/// `id.parse::<ModelSourceId>()?.with_default_scheme("huggingface")`.
pub fn with_default_scheme(mut self, default: &str) -> Self {
if self.scheme.is_empty() {
self.scheme = default.to_string();
}
self
}
/// The `org/name` half — what an hf-hub `Api::model(...)` call
/// expects regardless of which scheme/endpoint we're hitting.
pub fn repo_path(&self) -> String {
format!("{}/{}", self.org, self.name)
}
}
impl fmt::Display for ModelSourceId {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
if self.scheme.is_empty() {
write!(f, "{}/{}", self.org, self.name)
} else {
write!(f, "{}:{}/{}", self.scheme, self.org, self.name)
}
}
}
impl FromStr for ModelSourceId {
type Err = ParseError;
fn from_str(s: &str) -> Result<Self, Self::Err> {
if s.is_empty() {
return Err(ParseError::Empty);
}
// Scheme split. Only the *first* colon counts — anything after
// belongs to org/name (and would be rejected separately because
// `:` isn't allowed there).
let (scheme, rest) = match s.split_once(':') {
Some((scheme, rest)) => {
if scheme.is_empty() {
return Err(ParseError::EmptyScheme(s.to_string()));
}
if scheme.contains('/') {
return Err(ParseError::SchemeContainsSlash(s.to_string()));
}
(scheme.to_string(), rest)
}
None => (String::new(), s),
};
let (org, name) = rest
.split_once('/')
.ok_or_else(|| ParseError::MissingSlash(s.to_string()))?;
if org.is_empty() {
return Err(ParseError::EmptyOrg(s.to_string()));
}
if name.is_empty() {
return Err(ParseError::EmptyName(s.to_string()));
}
if name.contains(':') {
return Err(ParseError::NameContainsColon(s.to_string()));
}
Ok(Self {
scheme,
org: org.to_string(),
name: name.to_string(),
})
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn parses_qualified() {
let id: ModelSourceId = "huggingface:Qwen/Qwen3.6-27B".parse().unwrap();
assert_eq!(id.scheme, "huggingface");
assert_eq!(id.org, "Qwen");
assert_eq!(id.name, "Qwen3.6-27B");
assert_eq!(id.repo_path(), "Qwen/Qwen3.6-27B");
assert!(!id.is_scheme_unset());
}
#[test]
fn parses_helexa_scheme() {
let id: ModelSourceId = "helexa:SomeOperator/Qwen3.6-27B-Uncensored"
.parse()
.unwrap();
assert_eq!(id.scheme, "helexa");
assert_eq!(id.org, "SomeOperator");
assert_eq!(id.name, "Qwen3.6-27B-Uncensored");
}
#[test]
fn parses_bare_id_with_empty_scheme() {
let id: ModelSourceId = "Qwen/Qwen3-30B-A3B-Instruct".parse().unwrap();
assert_eq!(id.scheme, "");
assert_eq!(id.org, "Qwen");
assert_eq!(id.name, "Qwen3-30B-A3B-Instruct");
assert!(id.is_scheme_unset());
}
#[test]
fn substitutes_default_scheme_only_when_unset() {
let id: ModelSourceId = "Qwen/Q3".parse().unwrap();
assert_eq!(id.with_default_scheme("huggingface").scheme, "huggingface");
let id: ModelSourceId = "helexa:Qwen/Q3".parse().unwrap();
assert_eq!(
id.with_default_scheme("huggingface").scheme,
"helexa",
"default substitution must not override an explicit scheme"
);
}
#[test]
fn display_roundtrips_qualified_id() {
let s = "helexa:Helexa/Qwen3.6-27B";
let id: ModelSourceId = s.parse().unwrap();
assert_eq!(id.to_string(), s);
}
#[test]
fn display_roundtrips_bare_id() {
let s = "Qwen/Q3";
let id: ModelSourceId = s.parse().unwrap();
assert_eq!(id.to_string(), s);
}
#[test]
fn rejects_empty() {
assert_eq!("".parse::<ModelSourceId>().unwrap_err(), ParseError::Empty);
}
#[test]
fn rejects_missing_slash() {
match "Qwen".parse::<ModelSourceId>().unwrap_err() {
ParseError::MissingSlash(s) => assert_eq!(s, "Qwen"),
other => panic!("expected MissingSlash, got {other:?}"),
}
match "huggingface:Qwen".parse::<ModelSourceId>().unwrap_err() {
ParseError::MissingSlash(s) => assert_eq!(s, "huggingface:Qwen"),
other => panic!("expected MissingSlash, got {other:?}"),
}
}
#[test]
fn rejects_empty_scheme() {
match ":Qwen/Q3".parse::<ModelSourceId>().unwrap_err() {
ParseError::EmptyScheme(s) => assert_eq!(s, ":Qwen/Q3"),
other => panic!("expected EmptyScheme, got {other:?}"),
}
}
#[test]
fn rejects_scheme_with_slash() {
match "hugg/ingface:Q/N".parse::<ModelSourceId>().unwrap_err() {
ParseError::SchemeContainsSlash(s) => assert_eq!(s, "hugg/ingface:Q/N"),
other => panic!("expected SchemeContainsSlash, got {other:?}"),
}
}
#[test]
fn rejects_empty_org_or_name() {
match "huggingface:/N".parse::<ModelSourceId>().unwrap_err() {
ParseError::EmptyOrg(_) => {}
other => panic!("expected EmptyOrg, got {other:?}"),
}
match "huggingface:Q/".parse::<ModelSourceId>().unwrap_err() {
ParseError::EmptyName(_) => {}
other => panic!("expected EmptyName, got {other:?}"),
}
}
#[test]
fn rejects_name_with_colon() {
match "huggingface:Q/N:weird"
.parse::<ModelSourceId>()
.unwrap_err()
{
ParseError::NameContainsColon(s) => assert_eq!(s, "huggingface:Q/N:weird"),
other => panic!("expected NameContainsColon, got {other:?}"),
}
}
#[test]
fn serde_roundtrips_via_struct() {
// We serialize as a struct (scheme/org/name fields) so the
// shape is self-describing in API payloads. Callers that want
// the compact `scheme:org/name` string use `Display`/`FromStr`.
let id = ModelSourceId::new("helexa", "Helexa", "Qwen3.6-27B");
let json = serde_json::to_string(&id).unwrap();
let back: ModelSourceId = serde_json::from_str(&json).unwrap();
assert_eq!(back, id);
}
}

File diff suppressed because it is too large Load Diff

View File

@@ -6,6 +6,7 @@ license.workspace = true
[dependencies]
cortex-core.workspace = true
async-trait.workspace = true
tokio.workspace = true
axum.workspace = true
tower.workspace = true
@@ -24,6 +25,7 @@ tokio-stream.workspace = true
eventsource-stream.workspace = true
bytes = "1"
urlencoding = "2"
url = "2"
[dev-dependencies]
tokio = { workspace = true, features = ["test-util"] }

View File

@@ -0,0 +1,235 @@
//! Streaming Anthropic SSE translation (#24).
//!
//! The `/v1/messages` handler translates the request envelope to
//! OpenAI before proxying (see `cortex_core::translate`); this module
//! completes the round trip for `stream: true` — the upstream OpenAI
//! SSE stream is re-framed, event by event, into Anthropic's
//! `message_start` / `content_block_*` / `message_delta` /
//! `message_stop` sequence as it arrives. True streaming: each
//! upstream chunk is translated and forwarded immediately; nothing is
//! buffered beyond the current SSE event's bytes.
//!
//! The translation state machine itself is pure and lives in
//! [`cortex_core::translate::AnthropicStreamTranslator`]; this module
//! owns the wire concerns — splitting the upstream byte stream into
//! SSE events, parsing `data:` payloads, and framing the translated
//! events as `event: <name>\ndata: <json>\n\n`.
use axum::body::Body;
use axum::http::StatusCode;
use axum::response::Response;
use bytes::Bytes;
use cortex_core::openai::ChatCompletionChunk;
use cortex_core::translate::AnthropicStreamTranslator;
use futures::StreamExt;
use tokio_stream::wrappers::ReceiverStream;
/// Forward the translated OpenAI request to the upstream node and
/// return the response translated to Anthropic SSE framing.
pub async fn stream_translated(
client: &reqwest::Client,
endpoint: &str,
openai_body: axum::body::Bytes,
model_id: &str,
node_name: &str,
inbound_headers: &axum::http::HeaderMap,
usage_sink: Option<crate::metering::UsageSink>,
) -> Response {
let url = format!("{endpoint}/v1/chat/completions");
tracing::info!(
handler = "anthropic_messages",
model = %model_id,
node = %node_name,
url = %url,
"proxying streaming request (anthropic SSE translation)"
);
let request = crate::auth::forward_principal_headers(
client
.post(&url)
.header("content-type", "application/json")
.body(openai_body),
inbound_headers,
);
let upstream = match request.send().await {
Ok(r) => r,
Err(e) => {
tracing::warn!(
handler = "anthropic_messages",
node = %node_name,
url = %url,
error = %e,
"anthropic stream: upstream request failed"
);
return anthropic_error(StatusCode::BAD_GATEWAY, "upstream request failed");
}
};
let status = upstream.status();
if !status.is_success() {
tracing::warn!(
handler = "anthropic_messages",
node = %node_name,
url = %url,
status = status.as_u16(),
"anthropic stream: upstream returned non-2xx"
);
return anthropic_error(
StatusCode::from_u16(status.as_u16()).unwrap_or(StatusCode::BAD_GATEWAY),
"upstream returned an error",
);
}
// Bounded channel: a slow client back-pressures the pump task,
// which back-pressures the upstream read — same propagation
// discipline as neuron's own projectors.
let (tx, rx) = tokio::sync::mpsc::channel::<Result<Bytes, std::convert::Infallible>>(32);
let node = node_name.to_string();
let model = model_id.to_string();
tokio::spawn(async move {
let mut upstream = upstream.bytes_stream();
let mut translator = AnthropicStreamTranslator::new();
let mut buf: Vec<u8> = Vec::new();
let mut done = false;
// Wire-debug accounting for the stream summary emitted at the
// end: did the model emit a structured tool call, what was the
// final finish_reason, and how many upstream frames did we see.
let mut saw_tool_call = false;
let mut last_finish: Option<String> = None;
let mut frames = 0u64;
// Engine-truth usage for metering (#51), scanned from the upstream
// frames (neuron emits a final `usage` object on the stream, #48).
let mut usage_prompt = 0u64;
let mut usage_completion = 0u64;
'outer: while let Some(block) = upstream.next().await {
let block = match block {
Ok(b) => b,
Err(e) => {
tracing::warn!(node = %node, error = %e, "anthropic stream: upstream read failed mid-stream");
break;
}
};
buf.extend_from_slice(&block);
// SSE events are separated by a blank line.
while let Some(pos) = find_event_boundary(&buf) {
let event: Vec<u8> = buf.drain(..pos + 2).collect();
let text = String::from_utf8_lossy(&event);
for line in text.lines() {
let Some(data) = line.strip_prefix("data:") else {
continue;
};
let data = data.trim();
if data == "[DONE]" {
done = true;
if !send_frames(&tx, translator.finish()).await {
break 'outer;
}
continue;
}
tracing::trace!(node = %node, frame = %data, "anthropic stream: upstream frame");
// Capture usage for metering before translation — the
// usage object rides on a late frame (often after the
// last content delta).
if let Some(p) = crate::proxy::last_count_for(data, "prompt_tokens") {
usage_prompt = p;
}
if let Some(c) = crate::proxy::last_count_for(data, "completion_tokens") {
usage_completion = c;
}
let Ok(chunk) = serde_json::from_str::<ChatCompletionChunk>(data) else {
tracing::debug!(node = %node, "anthropic stream: unparsable upstream frame skipped");
continue;
};
frames += 1;
if chunk
.choices
.iter()
.any(|c| c.delta.get("tool_calls").is_some())
{
saw_tool_call = true;
}
if let Some(fr) = chunk.choices.iter().find_map(|c| c.finish_reason.clone()) {
last_finish = Some(fr);
}
if !send_frames(&tx, translator.on_chunk(&chunk)).await {
break 'outer;
}
}
}
}
// Upstream ended without [DONE] (error or truncation): still
// close the Anthropic event sequence so clients aren't left
// with an unterminated message.
if !done {
let _ = send_frames(&tx, translator.finish()).await;
}
// Stream summary: the streaming counterpart to the non-streaming
// handler's "upstream response" line. `upstream_tool_calls =
// false` on a tools-bearing request is the fingerprint of the
// model improvising an unparsed tool-call format.
tracing::debug!(
wire = "anthropic",
model = %model,
node = %node,
frames,
upstream_tool_calls = saw_tool_call,
finish_reason = ?last_finish,
terminated = done,
"anthropic stream complete"
);
// Settle metering with the observed usage (#51). Runs on every exit
// path of the pump — clean end, early break, or upstream error — so
// the reservation is always resolved. `(0, 0)` when no usage frame
// was seen, which releases without recording spend.
if let Some(sink) = usage_sink {
sink(usage_prompt, usage_completion);
}
});
Response::builder()
.status(StatusCode::OK)
.header("content-type", "text/event-stream")
.header("cache-control", "no-cache")
.body(Body::from_stream(ReceiverStream::new(rx)))
.unwrap_or_else(|_| {
anthropic_error(
StatusCode::INTERNAL_SERVER_ERROR,
"failed to build response",
)
})
}
/// `\n\n` boundary of the first complete SSE event in `buf`, if any.
fn find_event_boundary(buf: &[u8]) -> Option<usize> {
buf.windows(2).position(|w| w == b"\n\n")
}
/// Render translated events as SSE frames and send them. Returns
/// `false` when the client has gone away (receiver dropped).
async fn send_frames(
tx: &tokio::sync::mpsc::Sender<Result<Bytes, std::convert::Infallible>>,
events: Vec<(String, serde_json::Value)>,
) -> bool {
for (name, payload) in events {
let frame = format!("event: {name}\ndata: {payload}\n\n");
if tx.send(Ok(Bytes::from(frame))).await.is_err() {
return false;
}
}
true
}
/// Anthropic-shaped error body (`{"type":"error","error":{...}}`).
fn anthropic_error(status: StatusCode, message: &str) -> Response {
let body = serde_json::json!({
"type": "error",
"error": { "type": "api_error", "message": message }
});
Response::builder()
.status(status)
.header("content-type", "application/json")
.body(Body::from(body.to_string()))
.expect("static error response must build")
}

View File

@@ -0,0 +1,133 @@
//! API-key authentication + principal resolution (#49).
//!
//! Identity rides standard bearer auth only — `Authorization: Bearer <key>`
//! — which is what keeps every tier OpenAI-compatible by construction (no
//! custom required headers or body fields, per #47). The middleware resolves
//! the key to a [`Principal`] via the [`EntitlementProvider`], carries it in
//! the request extensions for cortex-side metering/enforcement (#51/#52), and
//! stamps it as internal headers on the request so it reaches neuron, which
//! trusts cortex's assertion over WireGuard (#54).
//!
//! Anti-spoofing: any client-supplied principal header is **stripped** before
//! the authoritative value is stamped, so a client can never assert a
//! principal it didn't authenticate as.
//!
//! Rejection contract (#63): missing key under `require_auth`, or any present
//! but unresolvable key, yields `401 invalid_api_key` in the #60 envelope.
use crate::error::envelope_response;
use crate::state::CortexState;
use axum::extract::{Request, State};
use axum::http::header::AUTHORIZATION;
use axum::http::{HeaderMap, HeaderValue};
use axum::middleware::Next;
use axum::response::Response;
use cortex_core::entitlements::{HEADER_ACCOUNT_ID, HEADER_KEY_ID};
use cortex_core::error_envelope::OpenAiError;
use std::sync::Arc;
/// Endpoints that never require auth: liveness/readiness probes. Everything
/// else flows through resolution.
fn is_public(path: &str) -> bool {
path == "/health" || path == "/"
}
/// Extract the bearer token from an `Authorization` header value, if present
/// and well-formed. Scheme match is case-insensitive per RFC 7235.
fn parse_bearer(headers: &HeaderMap) -> Option<String> {
let raw = headers.get(AUTHORIZATION)?.to_str().ok()?;
let (scheme, token) = raw.split_once(' ')?;
if scheme.eq_ignore_ascii_case("bearer") {
let token = token.trim();
(!token.is_empty()).then(|| token.to_string())
} else {
None
}
}
/// Axum middleware: resolve the bearer key, attach the principal, stamp the
/// internal headers. Wired in `build_app` via `from_fn_with_state`.
pub async fn require_principal(
State(fleet): State<Arc<CortexState>>,
mut req: Request,
next: Next,
) -> Response {
if is_public(req.uri().path()) {
return next.run(req).await;
}
// Anti-spoof: drop any client-supplied principal headers up front.
{
let headers = req.headers_mut();
headers.remove(HEADER_ACCOUNT_ID);
headers.remove(HEADER_KEY_ID);
}
match parse_bearer(req.headers()) {
Some(key) => match fleet.entitlements.resolve(&key).await {
Ok(principal) => {
// Stamp the authoritative principal for neuron. Account/key
// ids come from operator config, so they're valid header
// values; guard anyway and skip a malformed one rather than
// panic.
if let (Ok(account), Ok(key_id)) = (
HeaderValue::from_str(&principal.account_id),
HeaderValue::from_str(&principal.key_id),
) {
let headers = req.headers_mut();
headers.insert(HEADER_ACCOUNT_ID, account);
headers.insert(HEADER_KEY_ID, key_id);
}
// Carry the typed principal for cortex-side metering (#51)
// and budget enforcement (#52).
req.extensions_mut().insert(principal);
next.run(req).await
}
// An unrecognized key only hard-fails when auth is *required*.
// In allow-anonymous mode (the default) we must IGNORE it and
// serve the request unauthenticated — otherwise the placeholder
// keys that OpenAI-compatible clients send by default (opencode,
// Open WebUI, Agent Zero, litellm) would all break, even though
// the operator never opted into auth. Pre-#49 the bearer was
// never inspected at all; this preserves that for require_auth=false.
Err(_) => {
if fleet.require_auth {
unauthorized("invalid API key")
} else {
tracing::debug!(
"ignoring unrecognized bearer token (require_auth=false): serving anonymously"
);
next.run(req).await
}
}
},
None => {
if fleet.require_auth {
unauthorized("missing API key; supply 'Authorization: Bearer <key>'")
} else {
next.run(req).await
}
}
}
}
/// `401 invalid_api_key` in the standard envelope (#63).
fn unauthorized(message: &str) -> Response {
envelope_response(OpenAiError::invalid_api_key(message))
}
/// Copy the cortex-stamped principal headers from an inbound [`HeaderMap`]
/// onto an outbound reqwest builder. Used by the Anthropic proxy paths,
/// which construct their own upstream requests instead of going through
/// [`crate::proxy::forward_request`] (which forwards all headers verbatim).
pub fn forward_principal_headers(
mut builder: reqwest::RequestBuilder,
headers: &HeaderMap,
) -> reqwest::RequestBuilder {
for name in [HEADER_ACCOUNT_ID, HEADER_KEY_ID] {
if let Some(value) = headers.get(name) {
builder = builder.header(name, value);
}
}
builder
}

View File

@@ -0,0 +1,317 @@
//! The local/static [`EntitlementProvider`] (#50).
//!
//! Accounts, keys, and hard caps come from operator config
//! ([`cortex_core::config::EntitlementsConfig`]); reservations and settled
//! spend are tracked in-process. This lands auth + per-key caps + the
//! amplification fix before any upstream clearing house exists; the future
//! helexa-upstream client (#57) implements the same trait.
//!
//! Budget math is serialized under a single [`std::sync::Mutex`] so
//! reserve/settle/release are atomic — a key's `spent + reserved` can never
//! exceed its hard cap even under concurrent requests (the #52 guarantee).
//! The lock is held only for the in-memory arithmetic, never across an
//! await.
use cortex_core::config::{ApiKeyConfig, EntitlementsConfig};
use cortex_core::entitlements::{
AuthError, BudgetError, BudgetSnapshot, CapWindow, EntitlementProvider, Principal, Reservation,
};
use std::collections::HashMap;
use std::sync::Mutex;
use std::sync::atomic::{AtomicU64, Ordering};
use std::time::Instant;
/// Per-key budget configuration (resolved from [`ApiKeyConfig`]).
struct Budget {
hard_cap: Option<u64>,
window: CapWindow,
}
/// Live, mutable accounting for one key over its current window.
#[derive(Default)]
struct Ledger {
/// Settled spend in the current window.
spent: u64,
/// Sum of outstanding (un-settled) reservations.
reserved: u64,
/// Start of the current rolling window; `None` until the first reserve.
/// Unused for [`CapWindow::Balance`].
window_start: Option<Instant>,
}
pub struct LocalEntitlementProvider {
/// Bearer token → principal.
keys: HashMap<String, Principal>,
/// `key_id` → budget config.
budgets: HashMap<String, Budget>,
/// `key_id` → live ledger.
ledgers: Mutex<HashMap<String, Ledger>>,
/// Monotonic source of opaque reservation handles.
next_id: AtomicU64,
}
impl LocalEntitlementProvider {
/// Build from the `[entitlements]` config. A key without an explicit
/// `key_id` is tracked at `account_id` granularity (its secret is never
/// used as a label).
pub fn from_config(config: &EntitlementsConfig) -> Self {
let mut keys = HashMap::new();
let mut budgets = HashMap::new();
for ApiKeyConfig {
key,
account_id,
key_id,
hard_cap,
window,
} in &config.keys
{
let key_id = key_id.clone().unwrap_or_else(|| account_id.clone());
keys.insert(
key.clone(),
Principal {
account_id: account_id.clone(),
key_id: key_id.clone(),
},
);
budgets.insert(
key_id,
Budget {
hard_cap: *hard_cap,
window: window.clone(),
},
);
}
Self {
keys,
budgets,
ledgers: Mutex::new(HashMap::new()),
next_id: AtomicU64::new(1),
}
}
}
/// Tokens still available under `cap` given current `spent`/`reserved`.
/// `None` cap = unlimited.
fn available(cap: Option<u64>, spent: u64, reserved: u64) -> Option<u64> {
cap.map(|c| c.saturating_sub(spent).saturating_sub(reserved))
}
#[async_trait::async_trait]
impl EntitlementProvider for LocalEntitlementProvider {
async fn resolve(&self, api_key: &str) -> Result<Principal, AuthError> {
self.keys.get(api_key).cloned().ok_or(AuthError::InvalidKey)
}
async fn reserve(
&self,
principal: &Principal,
max_tokens: u64,
) -> Result<Reservation, BudgetError> {
// A principal with no configured budget (or an uncapped one) always
// reserves; we still track spend for metrics.
let budget = self.budgets.get(&principal.key_id);
let (cap, window) = match budget {
Some(b) => (b.hard_cap, b.window.clone()),
None => (None, CapWindow::Balance),
};
let mut ledgers = self.ledgers.lock().expect("ledger mutex poisoned");
let ledger = ledgers.entry(principal.key_id.clone()).or_default();
// Lazily reset a rolling window that has elapsed before checking.
let mut retry_after_secs = 0;
if let CapWindow::Rolling { seconds } = window {
let now = Instant::now();
match ledger.window_start {
Some(start) if now.duration_since(start).as_secs() < seconds => {
retry_after_secs = seconds - now.duration_since(start).as_secs();
}
_ => {
// First reserve, or the window has fully elapsed: reset.
ledger.spent = 0;
ledger.window_start = Some(now);
retry_after_secs = seconds;
}
}
}
if let Some(avail) = available(cap, ledger.spent, ledger.reserved)
&& max_tokens > avail
{
return Err(match window {
CapWindow::Rolling { .. } => BudgetError::RateLimited {
requested: max_tokens,
available: avail,
// At least 1s so clients don't hot-loop on a sub-second
// remainder.
retry_after_secs: retry_after_secs.max(1),
},
CapWindow::Balance => BudgetError::InsufficientQuota {
requested: max_tokens,
available: avail,
},
});
}
ledger.reserved += max_tokens;
Ok(Reservation {
id: self.next_id.fetch_add(1, Ordering::Relaxed),
principal: principal.clone(),
reserved: max_tokens,
})
}
async fn settle(&self, reservation: Reservation, actual_tokens: u64) {
let mut ledgers = self.ledgers.lock().expect("ledger mutex poisoned");
if let Some(ledger) = ledgers.get_mut(&reservation.principal.key_id) {
ledger.reserved = ledger.reserved.saturating_sub(reservation.reserved);
ledger.spent += actual_tokens;
}
}
async fn release(&self, reservation: Reservation) {
let mut ledgers = self.ledgers.lock().expect("ledger mutex poisoned");
if let Some(ledger) = ledgers.get_mut(&reservation.principal.key_id) {
ledger.reserved = ledger.reserved.saturating_sub(reservation.reserved);
}
}
async fn snapshot(&self, principal: &Principal) -> Option<BudgetSnapshot> {
let ledgers = self.ledgers.lock().expect("ledger mutex poisoned");
let (spent, reserved) = ledgers
.get(&principal.key_id)
.map(|l| (l.spent, l.reserved))
.unwrap_or((0, 0));
let hard_cap = self.budgets.get(&principal.key_id).and_then(|b| b.hard_cap);
Some(BudgetSnapshot {
hard_cap,
spent,
reserved,
})
}
}
#[cfg(test)]
mod tests {
use super::*;
fn provider() -> LocalEntitlementProvider {
let config = EntitlementsConfig {
require_auth: true,
keys: vec![
ApiKeyConfig {
key: "sk-balance".into(),
account_id: "acct-a".into(),
key_id: Some("key-balance".into()),
hard_cap: Some(1_000),
window: CapWindow::Balance,
},
ApiKeyConfig {
key: "sk-rolling".into(),
account_id: "acct-b".into(),
key_id: Some("key-rolling".into()),
hard_cap: Some(500),
window: CapWindow::Rolling { seconds: 3_600 },
},
ApiKeyConfig {
key: "sk-infra".into(),
account_id: "operator".into(),
key_id: Some("key-infra".into()),
hard_cap: None,
window: CapWindow::Balance,
},
],
};
LocalEntitlementProvider::from_config(&config)
}
#[tokio::test]
async fn resolves_configured_key_to_principal() {
let p = provider();
let principal = p.resolve("sk-balance").await.expect("known key resolves");
assert_eq!(principal.account_id, "acct-a");
assert_eq!(principal.key_id, "key-balance");
}
#[tokio::test]
async fn unknown_key_is_invalid() {
let p = provider();
assert!(matches!(
p.resolve("sk-nope").await,
Err(AuthError::InvalidKey)
));
}
#[tokio::test]
async fn reserve_settle_release_round_trip() {
let p = provider();
let principal = p.resolve("sk-balance").await.unwrap();
let r = p.reserve(&principal, 400).await.expect("within cap");
// Reserved, not yet spent.
let snap = p.snapshot(&principal).await.unwrap();
assert_eq!(snap.hard_cap, Some(1_000));
assert_eq!(snap.reserved, 400);
assert_eq!(snap.spent, 0);
// Used fewer tokens than reserved → remainder released, spend exact.
p.settle(r, 250).await;
let snap = p.snapshot(&principal).await.unwrap();
assert_eq!(snap.reserved, 0);
assert_eq!(snap.spent, 250);
// A reservation that is released contributes no spend.
let r2 = p.reserve(&principal, 100).await.unwrap();
p.release(r2).await;
let snap = p.snapshot(&principal).await.unwrap();
assert_eq!(snap.reserved, 0);
assert_eq!(snap.spent, 250);
}
#[tokio::test]
async fn balance_over_cap_is_insufficient_quota() {
let p = provider();
let principal = p.resolve("sk-balance").await.unwrap();
// Reserve most of the cap, then ask for more than remains.
let _r = p.reserve(&principal, 900).await.unwrap();
let err = p.reserve(&principal, 200).await.expect_err("over cap");
match err {
BudgetError::InsufficientQuota {
requested,
available,
} => {
assert_eq!(requested, 200);
assert_eq!(available, 100);
}
other => panic!("expected InsufficientQuota, got {other:?}"),
}
}
#[tokio::test]
async fn rolling_over_cap_is_rate_limited_with_retry_after() {
let p = provider();
let principal = p.resolve("sk-rolling").await.unwrap();
let _r = p.reserve(&principal, 500).await.unwrap();
let err = p.reserve(&principal, 1).await.expect_err("over cap");
match err {
BudgetError::RateLimited {
retry_after_secs, ..
} => {
assert!(retry_after_secs >= 1, "must advertise a retry hint");
assert!(retry_after_secs <= 3_600);
}
other => panic!("expected RateLimited, got {other:?}"),
}
}
#[tokio::test]
async fn uncapped_infra_key_never_refuses() {
let p = provider();
let principal = p.resolve("sk-infra").await.unwrap();
let r = p.reserve(&principal, 10_000_000).await.expect("uncapped");
p.settle(r, 10_000_000).await;
let snap = p.snapshot(&principal).await.unwrap();
assert_eq!(snap.hard_cap, None);
assert_eq!(snap.spent, 10_000_000);
}
}

View File

@@ -0,0 +1,24 @@
//! Gateway adapter that turns the shared, axum-agnostic
//! [`cortex_core::error_envelope::OpenAiError`] into an axum [`Response`],
//! setting the `Retry-After` header when the envelope carries one.
//!
//! cortex-core owns the envelope shape and the rejection contract (#60/#63);
//! this is the only place the gateway crosses from that data into axum.
use axum::http::{HeaderValue, StatusCode, header};
use axum::response::{IntoResponse, Json, Response};
use cortex_core::error_envelope::OpenAiError;
/// Render an [`OpenAiError`] as an axum response (status + JSON envelope +
/// optional `Retry-After`).
pub fn envelope_response(err: OpenAiError) -> Response {
let status = StatusCode::from_u16(err.status).unwrap_or(StatusCode::INTERNAL_SERVER_ERROR);
let retry_after = err.retry_after_secs;
let mut response = (status, Json(err.body())).into_response();
if let Some(secs) = retry_after
&& let Ok(value) = HeaderValue::from_str(&secs.to_string())
{
response.headers_mut().insert(header::RETRY_AFTER, value);
}
response
}

File diff suppressed because it is too large Load Diff

View File

@@ -1,5 +1,10 @@
pub mod anthropic_sse;
pub mod auth;
pub mod entitlements_local;
pub mod error;
pub mod evictor;
pub mod handlers;
pub mod metering;
pub mod metrics;
pub mod poller;
pub mod proxy;
@@ -8,15 +13,26 @@ pub mod state;
use anyhow::Result;
use axum::Router;
use axum::middleware::from_fn_with_state;
use cortex_core::config::GatewayConfig;
use std::sync::Arc;
use tower_http::cors::CorsLayer;
use tower_http::trace::TraceLayer;
/// Build the Axum application router with all routes wired up.
///
/// Layer order (outermost first): trace → CORS → auth → handlers. CORS is
/// outer to auth so preflight `OPTIONS` short-circuits before resolution;
/// auth (`require_principal`) resolves the bearer key, attaches the
/// principal, and stamps the internal principal headers before any handler
/// runs.
pub fn build_app(fleet: Arc<state::CortexState>) -> Router {
Router::new()
.merge(handlers::api_routes())
.layer(from_fn_with_state(
Arc::clone(&fleet),
auth::require_principal,
))
.layer(CorsLayer::permissive())
.layer(TraceLayer::new_for_http())
.with_state(fleet)

View File

@@ -0,0 +1,219 @@
//! Per-request token metering (#51).
//!
//! Captures the real `(prompt, completion)` usage of every request and feeds
//! it to two places: the [`EntitlementProvider`] spend ledger (via
//! reserve→settle) and per-principal Prometheus counters. The principal is
//! reconstructed from the internal headers the auth middleware stamped (#49),
//! so this works uniformly across every proxy path without threading the
//! typed principal through each handler.
//!
//! The reserve→settle lifecycle is established here but, in this phase,
//! reserves **zero** tokens — metering only, no enforcement. Budget
//! enforcement (#52) flips the reserved amount to the real
//! `prompt + max_output` and handles the [`BudgetError`] rejection; the
//! settle/release plumbing is identical, so that change is localized.
//!
//! [`ReservationGuard`] makes leaks impossible: settling records actual
//! spend and releases the unused remainder; dropping a guard that was never
//! settled releases the whole reservation. So an early return, error path,
//! or dropped stream can't strand a reservation.
use axum::http::HeaderMap;
use cortex_core::entitlements::{
BudgetError, EntitlementProvider, HEADER_ACCOUNT_ID, HEADER_KEY_ID, Principal,
};
use cortex_core::error_envelope::OpenAiError;
use std::sync::Arc;
/// Fallback output-token budget when neither the request nor the model's
/// advertised limit gives one. Bounds the reservation so a capped key is
/// still gated even on under-specified requests (#52).
pub const FALLBACK_MAX_OUTPUT: u64 = 4096;
/// Invoked exactly once at request completion with best-effort
/// `(prompt_tokens, completion_tokens)`. When no usage could be observed
/// (e.g. a pre-dispatch failure or a dropped stream) it is dropped unused —
/// which releases the held reservation via [`ReservationGuard`]'s `Drop`.
pub type UsageSink = Box<dyn FnOnce(u64, u64) + Send>;
/// Reconstruct the principal from the cortex-stamped internal headers. The
/// auth middleware strips any client copy and stamps the authoritative value,
/// so these headers are trustworthy within cortex. `None` for anonymous
/// (unauthenticated) requests.
pub fn principal_from_headers(headers: &HeaderMap) -> Option<Principal> {
let account_id = headers.get(HEADER_ACCOUNT_ID)?.to_str().ok()?.to_string();
let key_id = headers.get(HEADER_KEY_ID)?.to_str().ok()?.to_string();
Some(Principal { account_id, key_id })
}
/// Emit per-principal spend counters (#51). Labelled by account/key only —
/// both are operator-bounded, so cardinality is controlled.
pub fn record_spend(principal: &Principal, prompt: u64, completion: u64) {
let labels = [
("account", principal.account_id.clone()),
("key", principal.key_id.clone()),
];
metrics::counter!("cortex_spend_tokens_total", &labels).increment(prompt + completion);
metrics::counter!("cortex_spend_prompt_tokens_total", &labels).increment(prompt);
metrics::counter!("cortex_spend_completion_tokens_total", &labels).increment(completion);
}
/// Holds a budget reservation for the life of a request. [`settle`] records
/// actual spend and releases the remainder; an un-settled guard releases the
/// whole reservation when dropped. Anonymous requests carry an empty guard,
/// where every operation is a no-op.
///
/// [`settle`]: ReservationGuard::settle
pub struct ReservationGuard {
provider: Arc<dyn EntitlementProvider>,
reservation: Option<cortex_core::entitlements::Reservation>,
}
impl ReservationGuard {
/// An empty guard for an anonymous request — no reservation to resolve.
pub fn anonymous(provider: Arc<dyn EntitlementProvider>) -> Self {
Self {
provider,
reservation: None,
}
}
/// Wrap an already-acquired reservation.
fn held(
provider: Arc<dyn EntitlementProvider>,
reservation: cortex_core::entitlements::Reservation,
) -> Self {
Self {
provider,
reservation: Some(reservation),
}
}
/// Settle with the tokens actually consumed, disarming the drop-release.
/// Spawns the (fast, in-process for the local provider) settle so the
/// caller — which may be a sync stream-completion callback — needn't
/// await.
pub fn settle(mut self, actual_tokens: u64) {
if let Some(reservation) = self.reservation.take() {
let provider = Arc::clone(&self.provider);
tokio::spawn(async move {
provider.settle(reservation, actual_tokens).await;
});
}
}
}
impl Drop for ReservationGuard {
fn drop(&mut self) {
if let Some(reservation) = self.reservation.take() {
let provider = Arc::clone(&self.provider);
tokio::spawn(async move {
provider.release(reservation).await;
});
}
}
}
/// Build the completion sink for an authenticated request: record spend and
/// settle the reservation with the observed total. Dropping it unused (no
/// usage observed) releases the reservation via the guard.
pub fn usage_sink(principal: Principal, guard: ReservationGuard) -> UsageSink {
Box::new(move |prompt, completion| {
record_spend(&principal, prompt, completion);
guard.settle(prompt + completion);
})
}
/// Reserve the request's upper-bound token cost for the principal, refusing
/// *before* dispatch if it would exceed the hard cap (#52). On success
/// returns a guard the caller settles with actual usage; on refusal returns
/// the #63 envelope (`rate_limit_exceeded` + `Retry-After` for a resetting
/// window, `insufficient_quota` for a hard balance — never `402`).
pub async fn reserve_or_reject(
provider: Arc<dyn EntitlementProvider>,
principal: &Principal,
max_tokens: u64,
) -> Result<ReservationGuard, OpenAiError> {
match provider.reserve(principal, max_tokens).await {
Ok(reservation) => Ok(ReservationGuard::held(provider, reservation)),
Err(err) => Err(budget_error_to_envelope(err)),
}
}
/// Map a [`BudgetError`] to the #63 envelope. The provider chose the window
/// semantics; this only translates them to HTTP.
fn budget_error_to_envelope(err: BudgetError) -> OpenAiError {
match err {
BudgetError::RateLimited {
retry_after_secs, ..
} => OpenAiError::rate_limit_exceeded(err.to_string(), retry_after_secs),
BudgetError::InsufficientQuota { .. } => OpenAiError::insufficient_quota(err.to_string()),
}
}
/// Upper-bound tokens to reserve for a request (#52): an over-estimate of
/// the prompt plus the maximum output. `advertised_output` is the model's
/// `limit.output` (#62), used when the request omits `max_(completion_)tokens`.
/// Over-reserving is safe — settle corrects spend to the actual usage.
pub fn reservation_estimate(body: &[u8], advertised_output: Option<u64>) -> u64 {
let max_output = requested_max_output(body)
.or(advertised_output)
.unwrap_or(FALLBACK_MAX_OUTPUT);
estimate_prompt_tokens(body).saturating_add(max_output)
}
/// The client's requested output cap, from `max_completion_tokens` (or the
/// legacy `max_tokens`). `None` when unspecified.
fn requested_max_output(body: &[u8]) -> Option<u64> {
let v: serde_json::Value = serde_json::from_slice(body).ok()?;
v.get("max_completion_tokens")
.or_else(|| v.get("max_tokens"))
.and_then(serde_json::Value::as_u64)
}
/// Rough prompt-token estimate at ~4 chars/token over the whole body. cortex
/// has no tokenizer; JSON overhead makes this a conservative over-estimate,
/// and neuron remains the exact context wall (#56/#60). Settle reconciles to
/// the real usage afterward.
fn estimate_prompt_tokens(body: &[u8]) -> u64 {
(body.len() as u64 / 4).max(1)
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn requested_max_output_prefers_max_completion_tokens() {
let body = br#"{"model":"m","max_completion_tokens":256,"max_tokens":99}"#;
assert_eq!(requested_max_output(body), Some(256));
}
#[test]
fn requested_max_output_falls_back_to_legacy_max_tokens() {
let body = br#"{"model":"m","max_tokens":128}"#;
assert_eq!(requested_max_output(body), Some(128));
}
#[test]
fn estimate_uses_requested_output_when_present() {
// Requested output dominates; prompt estimate is small for a tiny body.
let body = br#"{"model":"m","max_tokens":1000}"#;
let est = reservation_estimate(body, Some(8192));
assert!(est >= 1000 && est < 1100, "est was {est}");
}
#[test]
fn estimate_uses_advertised_output_when_request_omits_it() {
let body = br#"{"model":"m","messages":[]}"#;
let est = reservation_estimate(body, Some(8192));
assert!(est >= 8192, "est was {est}");
}
#[test]
fn estimate_falls_back_when_nothing_advertised() {
let body = br#"{"model":"m"}"#;
let est = reservation_estimate(body, None);
assert!(est >= FALLBACK_MAX_OUTPUT, "est was {est}");
}
}

View File

@@ -46,6 +46,14 @@ fn describe_metrics() {
"Generation throughput in tokens per second"
);
metrics::describe_counter!("cortex_requests_total", "Total number of proxied requests");
metrics::describe_counter!(
"cortex_prompt_tokens_total",
"Total prompt tokens reported by upstream usage objects"
);
metrics::describe_counter!(
"cortex_completion_tokens_total",
"Total completion tokens reported by upstream usage objects"
);
metrics::describe_counter!(
"cortex_request_errors_total",
"Total number of failed proxy requests"
@@ -55,4 +63,16 @@ fn describe_metrics() {
"cortex_cold_starts_total",
"Total number of cold-start model loads"
);
metrics::describe_counter!(
"cortex_spend_tokens_total",
"Total metered tokens (prompt + completion) per principal, labelled by account/key (#51)"
);
metrics::describe_counter!(
"cortex_spend_prompt_tokens_total",
"Metered prompt tokens per principal, labelled by account/key (#51)"
);
metrics::describe_counter!(
"cortex_spend_completion_tokens_total",
"Metered completion tokens per principal, labelled by account/key (#51)"
);
}

View File

@@ -3,6 +3,7 @@
use crate::state::CortexState;
use chrono::Utc;
use cortex_core::discovery::{DiscoveryResponse, HealthResponse};
use cortex_core::harness::ModelInfo;
use cortex_core::node::{ModelEntry, ModelStatus};
use std::sync::Arc;
@@ -25,7 +26,68 @@ pub async fn poll_once(fleet: &CortexState) {
}
}
/// Fetch `GET /discovery` and cache it on the NodeState — topology is
/// invariant for a given neuron process, so a successful fetch is kept.
/// Re-polled only while `max_prompt_tokens` is still unknown (0): on a
/// rolling deploy cortex can win the race and cache a neuron's discovery
/// before that neuron reports the field (it deserialises to 0). Re-polling
/// until a real cap arrives self-heals that without periodic polling.
async fn maybe_poll_discovery(fleet: &CortexState, name: &str, endpoint: &str) {
{
let nodes = fleet.nodes.read().await;
match nodes.get(name) {
Some(n)
if n.discovery
.as_ref()
.is_some_and(|d| d.max_prompt_tokens > 0) =>
{
return;
}
_ => {}
}
}
let url = format!("{endpoint}/discovery");
let resp = match fleet
.http_client
.get(&url)
.timeout(Duration::from_secs(5))
.send()
.await
{
Ok(r) if r.status().is_success() => r,
Ok(r) => {
tracing::debug!(node = name, status = %r.status(), "discovery probe non-success");
return;
}
Err(e) => {
tracing::debug!(node = name, error = %e, "discovery probe unreachable");
return;
}
};
match resp.json::<DiscoveryResponse>().await {
Ok(d) => {
let mut nodes = fleet.nodes.write().await;
if let Some(node) = nodes.get_mut(name) {
tracing::info!(
node = name,
hostname = %d.hostname,
devices = d.devices.len(),
"discovery cached"
);
node.discovery = Some(d);
}
}
Err(e) => {
tracing::warn!(node = name, error = %e, "failed to parse /discovery response");
}
}
}
async fn poll_neuron(fleet: &CortexState, name: &str, endpoint: &str) {
// Topology first — cheap once cached, and the router needs it to
// route requests against catalogue entries that aren't loaded yet.
maybe_poll_discovery(fleet, name, endpoint).await;
let url = format!("{endpoint}/models");
let result = fleet
@@ -54,12 +116,22 @@ async fn poll_neuron(fleet: &CortexState, name: &str, endpoint: &str) {
.and_modify(|e| {
e.status = status;
e.vram_estimate_mb = upstream.vram_used_mb;
e.capabilities = upstream.capabilities.clone();
e.tool_call = upstream.tool_call;
e.reasoning = upstream.reasoning;
// Neuron's self-derived limit (#67) — the
// authoritative source the gateway advertises.
e.limit = upstream.limit.clone();
})
.or_insert_with(|| ModelEntry {
id: upstream.id.clone(),
status,
last_accessed: None,
vram_estimate_mb: upstream.vram_used_mb,
capabilities: upstream.capabilities.clone(),
tool_call: upstream.tool_call,
reasoning: upstream.reasoning,
limit: upstream.limit.clone(),
});
}
@@ -89,6 +161,54 @@ async fn poll_neuron(fleet: &CortexState, name: &str, endpoint: &str) {
node.healthy = false;
}
}
// Release the write lock before the next HTTP call.
drop(nodes);
// Poll /health for the activation snapshot. We don't want this to
// flip the node to unhealthy on its own — a neuron that's serving
// /models fine is still operational even if /health is briefly
// unavailable — so failures are debug-level and leave the existing
// activation reading in place.
poll_health(fleet, name, endpoint).await;
}
/// Fetch `/health` and stash the activation snapshot on NodeState.
/// Decoupled from the /models poll so a /health glitch doesn't mark
/// the neuron unhealthy or evict the model list.
async fn poll_health(fleet: &CortexState, name: &str, endpoint: &str) {
let url = format!("{endpoint}/health");
let resp = match fleet
.http_client
.get(&url)
.timeout(Duration::from_secs(5))
.send()
.await
{
Ok(r) if r.status().is_success() => r,
Ok(r) => {
tracing::debug!(node = name, status = %r.status(), "/health probe non-success");
return;
}
Err(e) => {
tracing::debug!(node = name, error = %e, "/health probe failed");
return;
}
};
match resp.json::<HealthResponse>().await {
Ok(h) => {
let mut nodes = fleet.nodes.write().await;
if let Some(node) = nodes.get_mut(name) {
node.activation = Some(h.activation);
// Per-model admission load (#53) → keyed by id for the
// load-aware router (#55).
node.model_load = h.models.into_iter().map(|m| (m.id.clone(), m)).collect();
}
}
Err(e) => {
tracing::debug!(node = name, error = %e, "failed to parse /health response");
}
}
}
fn parse_status(s: &str) -> ModelStatus {
@@ -96,6 +216,8 @@ fn parse_status(s: &str) -> ModelStatus {
"loaded" => ModelStatus::Loaded,
"unloaded" => ModelStatus::Unloaded,
"reloading" => ModelStatus::Reloading,
"loading" => ModelStatus::Loading,
"recovering" => ModelStatus::Recovering,
_ => ModelStatus::Loaded,
}
}

View File

@@ -1,4 +1,4 @@
//! Streaming HTTP reverse proxy to mistral.rs backends.
//! Streaming HTTP reverse proxy to neuron backends.
//!
//! For streaming requests, SSE chunks are forwarded as they arrive.
//! The proxy captures timing information for metrics but does not
@@ -9,16 +9,31 @@ use anyhow::Result;
use axum::body::Body;
use axum::http::{HeaderMap, StatusCode};
use axum::response::{IntoResponse, Response};
use futures::Stream;
use futures::stream::BoxStream;
use reqwest::Client;
use std::pin::Pin;
use std::task::{Context, Poll};
use std::time::Instant;
/// Proxy a request body to the resolved backend node and stream the response.
///
/// Logging contract: every call emits exactly one structured event at
/// info / warn level for operator visibility, regardless of outcome.
/// Network-level failures and non-2xx upstream statuses are warn'd here
/// (closest to the wire); the user-facing response carries only the
/// status code and a generic message — implementation detail (body,
/// error chain) lives in the log, never in the API surface.
pub async fn forward_request(
client: &Client,
route: &RouteDecision,
path: &str,
headers: HeaderMap,
body: bytes::Bytes,
model_id: &str,
usage_sink: Option<crate::metering::UsageSink>,
) -> Result<Response, ProxyError> {
let request_start = Instant::now();
let url = format!("{}{}", route.endpoint, path);
tracing::info!(
node = %route.node_name,
@@ -37,13 +52,39 @@ pub async fn forward_request(
req_builder = req_builder.header(key, value);
}
let upstream_resp = req_builder.send().await.map_err(ProxyError::Upstream)?;
let upstream_resp = match req_builder.send().await {
Ok(r) => r,
Err(e) => {
tracing::warn!(
node = %route.node_name,
url = %url,
error = %e,
"proxy: upstream request failed (network)"
);
return Err(ProxyError::Upstream(e));
}
};
let status =
StatusCode::from_u16(upstream_resp.status().as_u16()).unwrap_or(StatusCode::BAD_GATEWAY);
let upstream_status = upstream_resp.status();
if !upstream_status.is_success() {
// Streaming body — can't snippet without breaking the stream
// pass-through. Log status + URL; the client still gets the
// upstream status, just without the leaked body.
tracing::warn!(
node = %route.node_name,
url = %url,
status = upstream_status.as_u16(),
"proxy: upstream returned non-2xx"
);
}
let status = StatusCode::from_u16(upstream_status.as_u16()).unwrap_or(StatusCode::BAD_GATEWAY);
let resp_headers = upstream_resp.headers().clone();
let stream = upstream_resp.bytes_stream();
let stream = TokenMetricsStream::new(
Box::pin(upstream_resp.bytes_stream()),
TokenMetrics::new(model_id, &route.node_name, request_start, usage_sink),
);
let body = Body::from_stream(stream);
@@ -52,31 +93,284 @@ pub async fn forward_request(
response = response.header(key, value);
}
response
.body(body)
.map_err(|e| ProxyError::ResponseBuild(e.to_string()))
response.body(body).map_err(|e| {
tracing::warn!(
node = %route.node_name,
url = %url,
error = %e,
"proxy: failed to build response"
);
ProxyError::ResponseBuild(e.to_string())
})
}
#[derive(Debug, thiserror::Error)]
pub enum ProxyError {
#[error("upstream request failed: {0}")]
#[error("upstream request failed")]
Upstream(reqwest::Error),
#[error("failed to build response: {0}")]
#[error("failed to build response")]
ResponseBuild(String),
}
impl IntoResponse for ProxyError {
fn into_response(self) -> Response {
let status = match &self {
ProxyError::Upstream(_) => StatusCode::BAD_GATEWAY,
ProxyError::ResponseBuild(_) => StatusCode::INTERNAL_SERVER_ERROR,
let (status, code, message) = match &self {
ProxyError::Upstream(_) => (
StatusCode::BAD_GATEWAY,
"upstream_connection_error",
"upstream request failed",
),
ProxyError::ResponseBuild(_) => (
StatusCode::INTERNAL_SERVER_ERROR,
"internal_server_error",
"failed to build response",
),
};
let body = serde_json::json!({
"error": {
"message": self.to_string(),
"type": "proxy_error",
}
});
(status, axum::Json(body)).into_response()
crate::error::envelope_response(cortex_core::error_envelope::OpenAiError::new(
status.as_u16(),
"api_error",
code,
message,
))
}
}
// ── Per-request token metrics (#21) ─────────────────────────────────
//
// The proxy never buffers or re-serialises the upstream body — chunks
// are forwarded verbatim. For metrics it observes each chunk's arrival
// time and keeps a bounded tail of the body text, from which the final
// OpenAI `usage` object (present on the last SSE chunk and on
// non-streaming JSON bodies alike) yields engine-truth token counts.
//
// Emitted per request, labelled {model, node}:
// cortex_time_to_first_token_seconds (histogram) — first body chunk
// cortex_tokens_per_second (histogram) — completion tokens
// over the decode window (first→last chunk); falls back to the
// full request duration for single-chunk (non-streaming) bodies
// cortex_prompt_tokens_total / cortex_completion_tokens_total (counters)
/// Cap on the retained body tail. The usage object rides on the final
/// chunk, so a generous tail is plenty; the cap bounds memory on huge
/// non-streaming bodies.
const TAIL_CAP_BYTES: usize = 64 * 1024;
/// Find the value of the LAST `"key": <integer>` occurrence in `tail`.
/// Pure and chunk-boundary-safe (the tail is contiguous appended text).
/// The quoted-needle form means `completion_tokens` never matches
/// `completion_tokens_details`.
pub(crate) fn last_count_for(tail: &str, key: &str) -> Option<u64> {
let needle = format!("\"{key}\"");
let mut result = None;
for (idx, _) in tail.match_indices(&needle) {
let rest = tail[idx + needle.len()..].trim_start();
let Some(rest) = rest.strip_prefix(':') else {
continue;
};
let rest = rest.trim_start();
let digits: &str = &rest[..rest
.char_indices()
.find(|(_, c)| !c.is_ascii_digit())
.map(|(i, _)| i)
.unwrap_or(rest.len())];
if let Ok(v) = digits.parse::<u64>() {
result = Some(v);
}
}
result
}
struct TokenMetrics {
labels: [(&'static str, String); 2],
request_start: Instant,
first_chunk: Option<Instant>,
last_chunk: Option<Instant>,
tail: String,
finished: bool,
/// Per-principal metering hook (#51). Invoked exactly once in `finish`
/// with the observed `(prompt, completion)` so the reservation can be
/// settled and spend recorded. `None` for anonymous requests.
usage_sink: Option<crate::metering::UsageSink>,
}
impl TokenMetrics {
fn new(
model_id: &str,
node_name: &str,
request_start: Instant,
usage_sink: Option<crate::metering::UsageSink>,
) -> Self {
Self {
labels: [
("model", model_id.to_string()),
("node", node_name.to_string()),
],
request_start,
first_chunk: None,
last_chunk: None,
tail: String::new(),
finished: false,
usage_sink,
}
}
fn observe(&mut self, chunk: &[u8]) {
let now = Instant::now();
self.first_chunk.get_or_insert(now);
self.last_chunk = Some(now);
self.tail.push_str(&String::from_utf8_lossy(chunk));
if self.tail.len() > TAIL_CAP_BYTES {
// Keep the newest half; the usage object is always at the
// very end of the body. Split at a char boundary.
let mut cut = self.tail.len() - TAIL_CAP_BYTES / 2;
while !self.tail.is_char_boundary(cut) {
cut += 1;
}
self.tail.drain(..cut);
}
}
/// Emit the metrics exactly once — called on clean stream end and
/// from Drop (client disconnect mid-stream still records what we
/// saw).
fn finish(&mut self) {
if self.finished {
return;
}
self.finished = true;
let prompt = last_count_for(&self.tail, "prompt_tokens");
let completion = last_count_for(&self.tail, "completion_tokens");
// Per-model metrics — only when body chunks actually arrived.
if let Some(first) = self.first_chunk {
let ttft = first.duration_since(self.request_start).as_secs_f64();
metrics::histogram!("cortex_time_to_first_token_seconds", &self.labels).record(ttft);
if let Some(prompt) = prompt {
metrics::counter!("cortex_prompt_tokens_total", &self.labels).increment(prompt);
}
if let Some(completion) = completion.filter(|c| *c > 0) {
metrics::counter!("cortex_completion_tokens_total", &self.labels)
.increment(completion);
let last = self.last_chunk.unwrap_or(first);
let decode_window = last.duration_since(first).as_secs_f64();
// Streaming: rate over the decode window (first→last chunk).
// Non-streaming bodies arrive as ~one chunk (window ≈ 0),
// where the only honest denominator is the full request
// duration.
let secs = if decode_window >= 0.1 {
decode_window
} else {
last.duration_since(self.request_start).as_secs_f64()
};
if secs > 0.0 {
metrics::histogram!("cortex_tokens_per_second", &self.labels)
.record(completion as f64 / secs);
}
}
}
// Per-principal metering + reservation settle (#51). Always runs so
// the reservation is resolved even when no usage/body was observed
// (sink with (0, 0) → settle 0 → release).
if let Some(sink) = self.usage_sink.take() {
sink(prompt.unwrap_or(0), completion.unwrap_or(0));
}
}
}
/// Pass-through stream wrapper that feeds [`TokenMetrics`]. Emits on
/// clean end-of-stream; the Drop impl covers client disconnects.
struct TokenMetricsStream {
inner: BoxStream<'static, Result<bytes::Bytes, reqwest::Error>>,
metrics: TokenMetrics,
}
impl TokenMetricsStream {
fn new(
inner: BoxStream<'static, Result<bytes::Bytes, reqwest::Error>>,
metrics: TokenMetrics,
) -> Self {
Self { inner, metrics }
}
}
impl Stream for TokenMetricsStream {
type Item = Result<bytes::Bytes, reqwest::Error>;
fn poll_next(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Option<Self::Item>> {
let this = self.get_mut();
match this.inner.as_mut().poll_next(cx) {
Poll::Ready(Some(Ok(chunk))) => {
this.metrics.observe(&chunk);
Poll::Ready(Some(Ok(chunk)))
}
Poll::Ready(Some(Err(e))) => Poll::Ready(Some(Err(e))),
Poll::Ready(None) => {
this.metrics.finish();
Poll::Ready(None)
}
Poll::Pending => Poll::Pending,
}
}
}
impl Drop for TokenMetricsStream {
fn drop(&mut self) {
self.metrics.finish();
}
}
#[cfg(test)]
mod tests {
use super::last_count_for;
#[test]
fn extracts_counts_from_final_sse_usage_chunk() {
let tail = concat!(
"data: {\"choices\":[{\"delta\":{\"content\":\"hi\"}}]}\n\n",
"data: {\"choices\":[],\"usage\":{\"prompt_tokens\":225,",
"\"completion_tokens\":42,\"total_tokens\":267}}\n\n",
"data: [DONE]\n\n"
);
assert_eq!(last_count_for(tail, "prompt_tokens"), Some(225));
assert_eq!(last_count_for(tail, "completion_tokens"), Some(42));
}
#[test]
fn extracts_counts_from_non_streaming_body() {
let tail = "{\"choices\":[{\"message\":{\"content\":\"hi\"}}],\
\"usage\":{\"prompt_tokens\": 12, \"completion_tokens\": 7}}";
assert_eq!(last_count_for(tail, "prompt_tokens"), Some(12));
assert_eq!(last_count_for(tail, "completion_tokens"), Some(7));
}
#[test]
fn ignores_details_variants_and_takes_last_occurrence() {
// completion_tokens_details must not shadow completion_tokens,
// and the LAST usage object wins (matters when content echoes
// a usage-shaped string earlier in the stream).
let tail = concat!(
"data: {\"usage\":{\"completion_tokens\":1}}\n\n",
"data: {\"usage\":{\"completion_tokens\":99,",
"\"completion_tokens_details\":{\"reasoning_tokens\":3}}}\n\n"
);
assert_eq!(last_count_for(tail, "completion_tokens"), Some(99));
}
#[test]
fn absent_keys_yield_none() {
assert_eq!(
last_count_for("data: [DONE]\n\n", "completion_tokens"),
None
);
assert_eq!(last_count_for("", "prompt_tokens"), None);
// key present but non-numeric value
assert_eq!(
last_count_for("\"completion_tokens\": null", "completion_tokens"),
None
);
}
}

View File

@@ -2,13 +2,21 @@
//!
//! Given a model ID from an inbound request, determine which node should
//! handle it. Priority:
//! 1. Node where the model is currently `Loaded`
//! 2. Node where the model is `Unloaded` (will lazy-load on request)
//! 3. Error: model not found on any node
//! 1. Node where the model is currently `Loaded` → use it.
//! 2. Node where the model is `Unloaded` → use it; neuron's existing
//! lazy-load behaviour will reload before serving the request.
//! 3. Model is in the catalogue → pick a feasible neuron, call
//! `POST /models/load`, wait for the load to complete, then
//! proxy. First-request cold-load latency is acceptable per the
//! unified-endpoint contract.
//! 4. Not in catalogue, not loaded anywhere → 404.
use crate::state::CortexState;
use cortex_core::catalogue::ModelProfile;
use cortex_core::harness::ModelSpec;
use cortex_core::node::ModelStatus;
use std::sync::Arc;
use std::time::Duration;
/// The routing decision: which node endpoint to proxy the request to.
#[derive(Debug, Clone)]
@@ -16,62 +24,400 @@ pub struct RouteDecision {
pub node_name: String,
/// The inference endpoint to proxy to (from neuron's /models/{id}/endpoint).
pub endpoint: String,
/// Whether the model will need to load (cold start).
/// Whether the model will need to load (cold start). Set to true
/// when we proxied to an `Unloaded` node (lazy load on neuron) or
/// when we just triggered an explicit cold-load via the catalogue
/// path.
pub cold_start: bool,
/// The concrete model id we actually routed to. Equal to the
/// caller's requested id unless an alias was resolved (e.g. caller
/// asked for `helexa/small`, this carries `Qwen/Qwen3-1.7B`). The
/// handler uses this to rewrite the request body's `model` field
/// before proxying — neurons reject requests where the body's
/// model name doesn't match a loaded model.
pub resolved_model_id: String,
}
#[derive(Debug, thiserror::Error)]
pub enum RouteError {
#[error("model '{0}' not found on any node")]
#[error("model '{0}' not found on any node and not in catalogue")]
ModelNotFound(String),
#[error("no healthy nodes available")]
NoHealthyNodes,
#[error("failed to resolve inference endpoint for model '{0}' on node '{1}'")]
EndpointResolveFailed(String, String),
#[error(
"model '{model_id}' is in the catalogue but no healthy neuron's topology satisfies its constraints"
)]
NoFeasibleNeuron { model_id: String },
#[error("cold-load of '{model_id}' on '{node}' failed: {message}")]
ColdLoadFailed {
model_id: String,
node: String,
message: String,
},
#[error(
"model '{model_id}' is recovering on node '{node}' (device context rebuild in progress) — retry shortly"
)]
ModelRecovering { model_id: String, node: String },
}
impl RouteError {
/// HTTP status the gateway should answer with. `NoHealthyNodes` and
/// `ModelRecovering` are the transient cases (503 service_unavailable,
/// safe to retry the same request); everything else is 404.
pub fn http_status(&self) -> u16 {
match self {
RouteError::NoHealthyNodes | RouteError::ModelRecovering { .. } => 503,
_ => 404,
}
}
/// Broad OpenAI error category for the JSON envelope.
pub fn broad_type(&self) -> &'static str {
match self {
RouteError::ModelNotFound(_) => "invalid_request_error",
RouteError::NoHealthyNodes
| RouteError::EndpointResolveFailed(_, _)
| RouteError::NoFeasibleNeuron { .. }
| RouteError::ColdLoadFailed { .. }
| RouteError::ModelRecovering { .. } => "api_error",
}
}
/// Specific machine-readable error code.
pub fn code(&self) -> &'static str {
match self {
RouteError::ModelNotFound(_) => "model_not_found",
RouteError::NoHealthyNodes => "service_unavailable",
RouteError::EndpointResolveFailed(_, _) => "service_unavailable",
RouteError::NoFeasibleNeuron { .. } => "service_unavailable",
RouteError::ColdLoadFailed { .. } => "service_unavailable",
RouteError::ModelRecovering { .. } => "service_unavailable",
}
}
/// Seconds to advertise in `Retry-After` for the transient variants
/// (#63). `NoHealthyNodes` may clear once the poller re-marks a node
/// healthy; `ModelRecovering` clears once the device context finishes
/// rebuilding — both are safe to retry. Everything else is permanent
/// for this request (404) and carries no hint.
pub fn retry_after_secs(&self) -> Option<u64> {
match self {
RouteError::ModelRecovering { .. } => Some(2),
RouteError::NoHealthyNodes => Some(5),
_ => None,
}
}
}
/// Resolve which node should serve a request for the given model.
/// Asks the neuron for the inference endpoint after selecting a node.
pub async fn resolve(
fleet: &Arc<CortexState>,
model_id: &str,
requested_model_id: &str,
) -> Result<RouteDecision, RouteError> {
let (node_name, neuron_endpoint, cold_start) = {
// Alias resolution first — swap `helexa/small` (etc.) for the
// concrete id before any node lookups so the rest of routing,
// loading, and metrics deal in concrete ids only. `resolve_alias`
// returns the input verbatim when it isn't an alias.
let model_id = fleet.catalogue.resolve_alias(requested_model_id);
if model_id != requested_model_id {
tracing::debug!(
requested = requested_model_id,
resolved = model_id,
"alias resolved"
);
}
// Snapshot loaded / unloaded / recovering state from the poller cache.
let (loaded_route, unloaded_route, recovering_node, any_healthy) = {
let nodes = fleet.nodes.read().await;
let mut loaded_candidate = None;
let mut unloaded_candidate = None;
// All healthy nodes with the model loaded, each with its current
// admission load (#53) so we can pick the least-busy replica (#55).
let mut loaded_candidates: Vec<(String, String, usize)> = Vec::new();
let mut unloaded_route = None;
let mut recovering_node = None;
let mut any_healthy = false;
for node in nodes.values() {
if !node.healthy {
continue;
}
any_healthy = true;
if let Some(entry) = node.models.get(model_id) {
match entry.status {
ModelStatus::Loaded | ModelStatus::Reloading => {
loaded_candidate = Some((node.name.clone(), node.endpoint.clone(), false));
break;
// Least-busy score: in-flight + queued from the
// neuron's last /health (#53). Unknown load (no poll
// yet) scores 0 so the replica stays eligible.
let score = node
.model_load
.get(model_id)
.map(|l| l.in_flight + l.queue_depth)
.unwrap_or(0);
loaded_candidates.push((node.name.clone(), node.endpoint.clone(), score));
}
ModelStatus::Unloaded => {
if unloaded_candidate.is_none() {
unloaded_candidate =
Some((node.name.clone(), node.endpoint.clone(), true));
if unloaded_route.is_none() {
unloaded_route = Some((node.name.clone(), node.endpoint.clone(), true));
}
}
// Auto-recovering (#17/#20): the model is rebuilding
// its device context on this node. Hold the route —
// answer "retry shortly" rather than 404, and do NOT
// fall through to the catalogue cold-load, which
// would race a second placement (and a second copy's
// worth of VRAM) against the in-flight recovery.
ModelStatus::Recovering => {
if recovering_node.is_none() {
recovering_node = Some(node.name.clone());
}
}
// Loading is gateway-synthesised from neuron's
// activation snapshot; it never appears on the
// wire from neuron's `/models`. Skip — the model
// isn't actually servable yet. The pre-existing
// race (catalogue cold_load fires a parallel
// /models/load against the in-flight load) is no
// worse than before; fixing it needs neuron-side
// in-flight tracking on /models/load itself.
ModelStatus::Loading => {}
}
}
}
// Pick the least-busy loaded replica; ties break by node name for
// deterministic routing. `false` = not a cold start.
let loaded_route = loaded_candidates
.into_iter()
.min_by(|a, b| a.2.cmp(&b.2).then_with(|| a.0.cmp(&b.0)))
.map(|(name, endpoint, _score)| (name, endpoint, false));
(loaded_route, unloaded_route, recovering_node, any_healthy)
};
if !any_healthy {
return Err(RouteError::NoHealthyNodes);
}
// Priority 1: already loaded.
if let Some((node_name, neuron_endpoint, cold_start)) = loaded_route {
return finish(fleet, &node_name, &neuron_endpoint, model_id, cold_start).await;
}
// Priority 2: recovering somewhere — transient hold, not a reroute.
if let Some(node) = recovering_node {
return Err(RouteError::ModelRecovering {
model_id: model_id.to_string(),
node,
});
}
// Priority 3: known to neuron but unloaded (neuron's lazy load).
if let Some((node_name, neuron_endpoint, cold_start)) = unloaded_route {
return finish(fleet, &node_name, &neuron_endpoint, model_id, cold_start).await;
}
// Priority 4: catalogue × topology cold-load.
if let Some(profile) = fleet.catalogue.get(model_id) {
let (node_name, neuron_endpoint) = pick_feasible_neuron(fleet, profile).await?;
cold_load(fleet, &node_name, &neuron_endpoint, profile).await?;
return finish(fleet, &node_name, &neuron_endpoint, model_id, true).await;
}
Err(RouteError::ModelNotFound(model_id.to_string()))
}
/// Pick a healthy neuron whose discovered topology satisfies the
/// profile. Preference order:
/// 1. A neuron from `profile.pinned_on` that is healthy + feasible.
/// 2. Otherwise, any healthy + feasible neuron, stable by name.
async fn pick_feasible_neuron(
fleet: &Arc<CortexState>,
profile: &ModelProfile,
) -> Result<(String, String), RouteError> {
let nodes = fleet.nodes.read().await;
let mut candidates: Vec<(String, String, bool)> = Vec::new();
for node in nodes.values() {
if !node.healthy {
continue;
}
let Some(disc) = node.discovery.as_ref() else {
continue;
};
if !profile.is_feasible_on(&node.name, &disc.devices) {
continue;
}
let pinned = profile.pinned_on.iter().any(|n| n == &node.name);
candidates.push((node.name.clone(), node.endpoint.clone(), pinned));
}
candidates.sort_by(|a, b| {
b.2.cmp(&a.2) // pinned first (true > false)
.then(a.0.cmp(&b.0))
});
let pick = candidates.into_iter().next();
pick.map(|(n, e, _)| (n, e))
.ok_or_else(|| RouteError::NoFeasibleNeuron {
model_id: profile.id.clone(),
})
}
/// Issue `POST {endpoint}/models/load` for this profile on this neuron,
/// blocking until the load completes (neuron's load endpoint is
/// synchronous — it returns 200 once VRAM is materialised). On success
/// also inserts a `Loaded` entry into the local NodeState cache so the
/// caller's subsequent endpoint lookup sees the new model without
/// waiting for the next poll cycle.
async fn cold_load(
fleet: &Arc<CortexState>,
node_name: &str,
neuron_endpoint: &str,
profile: &ModelProfile,
) -> Result<(), RouteError> {
let spec = profile_to_spec(fleet, node_name, profile).await;
let url = format!("{neuron_endpoint}/models/load");
tracing::info!(model = %profile.id, node = node_name, "cold-loading via /models/load");
// Generous timeout: a fresh download + safetensors mmap + device
// copy for a 30B-class dense model can comfortably exceed 5 min on
// a slow link. The HTTP client's own default already covers most
// of this; pin a longer per-request bound just here.
let resp = match fleet
.http_client
.post(&url)
.timeout(Duration::from_secs(1800))
.json(&spec)
.send()
.await
{
Ok(r) => r,
Err(e) => {
return Err(RouteError::ColdLoadFailed {
model_id: profile.id.clone(),
node: node_name.to_string(),
message: format!("HTTP request failed: {e}"),
});
}
};
let status = resp.status();
if !status.is_success() {
let body = resp.text().await.unwrap_or_default();
// Neuron returns 400 "already loaded" when two concurrent
// requests race the same model. Treat that as success — both
// requests effectively achieved the same end state.
if body.contains("already loaded") {
tracing::info!(
model = %profile.id,
node = node_name,
"cold-load saw 'already loaded' — treating as success"
);
} else {
return Err(RouteError::ColdLoadFailed {
model_id: profile.id.clone(),
node: node_name.to_string(),
message: format!("HTTP {status}: {body}"),
});
}
} else {
tracing::info!(model = %profile.id, node = node_name, "cold-load returned 200");
}
// Warm the cache: insert a Loaded ModelEntry so the next
// resolve() finds the model without waiting for the poll loop.
{
let mut nodes = fleet.nodes.write().await;
if let Some(node) = nodes.get_mut(node_name) {
node.models.insert(
profile.id.clone(),
cortex_core::node::ModelEntry {
id: profile.id.clone(),
status: ModelStatus::Loaded,
last_accessed: Some(chrono::Utc::now()),
vram_estimate_mb: profile.vram_mb,
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}
}
Ok(())
}
/// Translate a `ModelProfile` to a `ModelSpec` neuron's /models/load
/// accepts. Devices are picked from the neuron's discovered topology —
/// the first `min_devices` indices that meet `min_device_vram_mb`.
async fn profile_to_spec(
fleet: &Arc<CortexState>,
node_name: &str,
profile: &ModelProfile,
) -> ModelSpec {
let devices = {
let nodes = fleet.nodes.read().await;
let mut picked: Vec<u32> = Vec::new();
if let Some(node) = nodes.get(node_name)
&& let Some(disc) = &node.discovery
{
let min_vram = profile.min_device_vram_mb.unwrap_or(0);
for d in &disc.devices {
if d.vram_total_mb >= min_vram {
picked.push(d.index);
if picked.len() as u32 >= profile.min_devices {
break;
}
}
}
}
loaded_candidate.or(unloaded_candidate).ok_or_else(|| {
if nodes.values().any(|n| n.healthy) {
RouteError::ModelNotFound(model_id.to_string())
} else {
RouteError::NoHealthyNodes
}
})?
if picked.is_empty() {
// Fall back to a 0..min_devices default; pick_feasible_neuron
// already verified the topology satisfies the constraints,
// so this only fires if discovery raced or was lost.
(0..profile.min_devices).collect()
} else {
picked
}
};
// Ask the neuron for the inference endpoint for this model.
let tensor_parallel = if profile.min_devices > 1 {
Some(profile.min_devices)
} else {
None
};
ModelSpec {
model_id: qualified_model_id(profile),
harness: profile.harness.clone(),
quant: profile.quant.clone(),
tensor_parallel,
devices: Some(devices),
}
}
/// Prefix the catalogue id with the scheme when one is declared, so
/// neuron resolves the load against the right registry. Without this,
/// a profile pointing at the helexa registry would resolve via
/// neuron's `default_source` (typically `huggingface`) and fetch
/// bytes from the wrong place. Profiles that omit `source` continue
/// to pass the bare id through, preserving the pre-Phase-3 contract.
///
/// Stays at module scope (not nested in `profile_to_spec`) so the unit
/// tests can exercise it without spinning up CortexState topology.
fn qualified_model_id(profile: &ModelProfile) -> String {
match profile.source.as_deref() {
Some(scheme) if !scheme.is_empty() => format!("{scheme}:{}", profile.id),
_ => profile.id.clone(),
}
}
/// Resolve neuron's `/models/{id}/endpoint` to its inference URL and
/// build the final `RouteDecision`. Shared by all three priority
/// branches above.
async fn finish(
fleet: &Arc<CortexState>,
node_name: &str,
neuron_endpoint: &str,
model_id: &str,
cold_start: bool,
) -> Result<RouteDecision, RouteError> {
let endpoint_url = format!(
"{}/models/{}/endpoint",
neuron_endpoint,
@@ -89,13 +435,122 @@ pub async fn resolve(
_ => None,
};
let endpoint = inference_endpoint.ok_or_else(|| {
RouteError::EndpointResolveFailed(model_id.to_string(), node_name.clone())
let raw = inference_endpoint.ok_or_else(|| {
RouteError::EndpointResolveFailed(model_id.to_string(), node_name.to_string())
})?;
// Rewrite loopback inference URLs to use the configured neuron host.
// Neuron's default bind_url is `http://localhost:13131` (it can't
// reliably know its own externally-resolvable name). Cortex sees a
// URL that's only meaningful from the neuron host's own perspective;
// proxying directly to localhost from a different cortex host would
// hit nothing. Keep neuron's port and path (a future harness could
// serve inference on a different port than the management API), but
// swap the host for the one in cortex.toml.
let endpoint = rewrite_loopback_host(&raw, neuron_endpoint).unwrap_or(raw);
Ok(RouteDecision {
node_name,
node_name: node_name.to_string(),
endpoint,
cold_start,
resolved_model_id: model_id.to_string(),
})
}
/// If `inference_url`'s host is a loopback name (localhost / 127.0.0.1 /
/// 0.0.0.0 / ::1), return a copy with the host replaced by
/// `neuron_endpoint`'s host. Otherwise return None and the caller falls
/// back to the inference URL as-is.
fn rewrite_loopback_host(inference_url: &str, neuron_endpoint: &str) -> Option<String> {
let inf = url::Url::parse(inference_url).ok()?;
let inf_host = inf.host_str()?;
let is_loopback = matches!(inf_host, "localhost" | "127.0.0.1" | "0.0.0.0" | "::1");
if !is_loopback {
return None;
}
let neuron = url::Url::parse(neuron_endpoint).ok()?;
let new_host = neuron.host_str()?;
let mut out = inf.clone();
out.set_host(Some(new_host)).ok()?;
// url::Url::to_string normalises an empty path to "/", which then
// breaks downstream callers that do format!("{endpoint}/v1/...")
// and produce a double slash. The proxy URL is treated as a base
// string that the caller appends paths to, so strip the trailing
// slash here.
let s = out.to_string();
Some(s.trim_end_matches('/').to_string())
}
#[cfg(test)]
mod tests {
use super::{ModelProfile, qualified_model_id, rewrite_loopback_host};
fn bare_profile(id: &str, source: Option<&str>) -> ModelProfile {
ModelProfile {
id: id.into(),
harness: "candle".into(),
quant: None,
vram_mb: None,
min_devices: 1,
min_device_vram_mb: None,
pinned_on: vec![],
source: source.map(String::from),
limit: None,
cost: None,
capabilities: vec![],
}
}
#[test]
fn qualified_id_passes_through_when_source_absent() {
let p = bare_profile("Qwen/Qwen3-30B", None);
assert_eq!(qualified_model_id(&p), "Qwen/Qwen3-30B");
}
#[test]
fn qualified_id_prefixes_when_source_set() {
let p = bare_profile("Helexa/Qwen3.6-27B-Uncensored", Some("helexa"));
assert_eq!(
qualified_model_id(&p),
"helexa:Helexa/Qwen3.6-27B-Uncensored"
);
}
#[test]
fn qualified_id_passes_through_when_source_is_empty_string() {
// An empty scheme is treated as absent — neuron's default_source
// substitution kicks in.
let p = bare_profile("Qwen/Qwen3-30B", Some(""));
assert_eq!(qualified_model_id(&p), "Qwen/Qwen3-30B");
}
#[test]
fn rewrites_localhost_keeps_port_and_path() {
let out = rewrite_loopback_host(
"http://localhost:13131",
"http://beast.hanzalova.internal:13131",
);
assert_eq!(
out.as_deref(),
Some("http://beast.hanzalova.internal:13131")
);
}
#[test]
fn rewrites_loopback_with_distinct_inference_port() {
let out = rewrite_loopback_host("http://127.0.0.1:8080", "http://beast.lan:13131");
assert_eq!(out.as_deref(), Some("http://beast.lan:8080"));
}
#[test]
fn leaves_non_loopback_alone() {
let out = rewrite_loopback_host("http://other.host:1234", "http://beast.lan:13131");
assert_eq!(out, None);
}
#[test]
fn malformed_inference_url_returns_none() {
let out = rewrite_loopback_host("not a url", "http://beast.lan:13131");
assert_eq!(out, None);
}
}

View File

@@ -1,7 +1,10 @@
use crate::entitlements_local::LocalEntitlementProvider;
use cortex_core::catalogue::ModelCatalogue;
use cortex_core::config::{EvictionSettings, GatewayConfig, NeuronEndpoint};
use cortex_core::entitlements::EntitlementProvider;
use cortex_core::node::NodeState;
use std::collections::HashMap;
use std::sync::Arc;
use tokio::sync::RwLock;
/// Shared fleet state, protected by a RwLock for concurrent reader access.
@@ -11,6 +14,12 @@ pub struct CortexState {
pub eviction: EvictionSettings,
pub catalogue: ModelCatalogue,
pub http_client: reqwest::Client,
/// Resolves bearer keys to principals and enforces token budgets (#47).
/// A local/static provider today (#50); the upstream client later (#57).
pub entitlements: Arc<dyn EntitlementProvider>,
/// Whether to reject unauthenticated requests (#49). Read by the auth
/// middleware once it lands.
pub require_auth: bool,
}
impl CortexState {
@@ -26,12 +35,18 @@ impl CortexState {
models: HashMap::new(),
lifecycle_cycles: 0,
last_poll: None,
discovery: None,
activation: None,
model_load: HashMap::new(),
},
);
}
let catalogue = ModelCatalogue::load(&config.models_config);
let entitlements: Arc<dyn EntitlementProvider> =
Arc::new(LocalEntitlementProvider::from_config(&config.entitlements));
Self {
nodes: RwLock::new(nodes),
neuron_configs: config.neurons.clone(),
@@ -41,6 +56,8 @@ impl CortexState {
.timeout(std::time::Duration::from_secs(300))
.build()
.expect("failed to build HTTP client"),
entitlements,
require_auth: config.entitlements.require_auth,
}
}
}

View File

@@ -0,0 +1,280 @@
//! Alias resolution: a client request with `model: "helexa/small"`
//! routes to the concrete model id (e.g. `Qwen/Qwen3-1.7B`), with the
//! proxied request body rewritten so the upstream neuron sees a model
//! name that matches its loaded handle.
mod common;
use cortex_core::config::{
EvictionSettings, EvictionStrategy, GatewayConfig, GatewaySettings, NeuronEndpoint,
};
use cortex_core::node::{ModelEntry, ModelStatus};
use cortex_gateway::state::CortexState;
use serde_json::json;
use std::path::PathBuf;
use std::sync::Arc;
use tokio::net::TcpListener;
/// Write a `models.toml` with one alias to a unique temp path. Returns
/// the path; the file persists for the test process and gets reaped by
/// the OS at exit. Using $XDG_RUNTIME_DIR fallback for the temp dir
/// keeps the file off shared /tmp on CI without pulling in tempfile.
fn write_models_toml(alias: &str, target: &str) -> PathBuf {
let contents = format!(
r#"
[aliases]
"{alias}" = "{target}"
"#
);
let mut path = std::env::temp_dir();
let pid = std::process::id();
let now = std::time::SystemTime::now()
.duration_since(std::time::UNIX_EPOCH)
.unwrap()
.as_nanos();
path.push(format!("cortex-test-models-{pid}-{now}.toml"));
std::fs::write(&path, contents).expect("write temp models.toml");
path
}
#[tokio::test]
async fn test_alias_resolves_in_chat_completions() {
let mock_url = common::spawn_mock_neuron().await;
let models_path = write_models_toml("helexa/small", "test-model");
let config = GatewayConfig {
gateway: GatewaySettings {
listen: "127.0.0.1:0".into(),
metrics_listen: "127.0.0.1:0".into(),
},
eviction: EvictionSettings {
strategy: EvictionStrategy::Lru,
defrag_after_cycles: 0,
},
neurons: vec![NeuronEndpoint {
name: "mock-node".into(),
endpoint: mock_url,
}],
models_config: models_path.to_string_lossy().to_string(),
entitlements: Default::default(),
};
let fleet = Arc::new(CortexState::from_config(&config));
// Seed the node as healthy with the concrete model loaded under
// the target id. The poller doesn't run in this test; we just
// populate state manually.
{
let mut nodes = fleet.nodes.write().await;
let node = nodes.get_mut("mock-node").expect("node must exist");
node.healthy = true;
node.models.insert(
"test-model".into(),
ModelEntry {
id: "test-model".into(),
status: ModelStatus::Loaded,
last_accessed: None,
vram_estimate_mb: None,
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}
// Sanity: the catalogue actually picked up the alias.
assert_eq!(
fleet.catalogue.resolve_alias("helexa/small"),
"test-model",
"alias should resolve to target id"
);
// Spawn the gateway against this fleet.
let app = cortex_gateway::build_app(Arc::clone(&fleet));
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let gateway_addr = listener.local_addr().unwrap();
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
let gateway_url = format!("http://{gateway_addr}");
// Send a chat completion against the alias. The mock backend
// echoes back the `model` field it received — so a body whose
// model wasn't rewritten would come back as "helexa/small", and a
// properly-rewritten one as "test-model".
let client = reqwest::Client::new();
let resp = client
.post(format!("{gateway_url}/v1/chat/completions"))
.json(&json!({
"model": "helexa/small",
"messages": [{"role": "user", "content": "hi"}],
}))
.send()
.await
.expect("gateway should respond");
assert!(resp.status().is_success(), "gateway returned non-2xx");
let body: serde_json::Value = resp.json().await.expect("response is JSON");
assert_eq!(
body.get("model").and_then(|m| m.as_str()),
Some("test-model"),
"mock backend should have seen the resolved model id, not the alias"
);
}
#[tokio::test]
async fn test_aliases_surface_in_v1_models() {
let mock_url = common::spawn_mock_neuron().await;
let models_path = write_models_toml("helexa/small", "test-model");
let config = GatewayConfig {
gateway: GatewaySettings {
listen: "127.0.0.1:0".into(),
metrics_listen: "127.0.0.1:0".into(),
},
eviction: EvictionSettings {
strategy: EvictionStrategy::Lru,
defrag_after_cycles: 0,
},
neurons: vec![NeuronEndpoint {
name: "mock-node".into(),
endpoint: mock_url,
}],
models_config: models_path.to_string_lossy().to_string(),
entitlements: Default::default(),
};
let fleet = Arc::new(CortexState::from_config(&config));
// Seed the target as loaded so the alias's mirrored entry shows
// loaded=true.
{
let mut nodes = fleet.nodes.write().await;
let node = nodes.get_mut("mock-node").expect("node must exist");
node.healthy = true;
node.models.insert(
"test-model".into(),
ModelEntry {
id: "test-model".into(),
status: ModelStatus::Loaded,
last_accessed: None,
vram_estimate_mb: Some(2000),
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}
let app = cortex_gateway::build_app(Arc::clone(&fleet));
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let gateway_addr = listener.local_addr().unwrap();
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
let gateway_url = format!("http://{gateway_addr}");
let resp = reqwest::get(format!("{gateway_url}/v1/models"))
.await
.expect("gateway should respond");
let body: serde_json::Value = resp.json().await.unwrap();
let entries = body
.get("data")
.and_then(|d| d.as_array())
.expect("data array");
// Both the alias and the target should be present.
let ids: Vec<&str> = entries
.iter()
.filter_map(|e| e.get("id").and_then(|v| v.as_str()))
.collect();
assert!(ids.contains(&"test-model"), "target should be listed");
assert!(ids.contains(&"helexa/small"), "alias should be listed");
// The alias's `loaded` flag and locations should mirror the target.
let alias_entry = entries
.iter()
.find(|e| e.get("id").and_then(|v| v.as_str()) == Some("helexa/small"))
.expect("alias entry");
assert_eq!(alias_entry.get("loaded"), Some(&json!(true)));
let locations = alias_entry
.get("locations")
.and_then(|l| l.as_array())
.expect("locations array");
assert_eq!(locations.len(), 1);
assert_eq!(
locations[0].get("node").and_then(|n| n.as_str()),
Some("mock-node")
);
}
#[tokio::test]
async fn test_alias_falls_through_for_unmapped_model() {
// Catalogue has an alias for some-other-thing but the request
// model "test-model" isn't an alias; resolution should be a no-op.
let mock_url = common::spawn_mock_neuron().await;
let models_path = write_models_toml("helexa/large", "definitely-not-loaded");
let config = GatewayConfig {
gateway: GatewaySettings {
listen: "127.0.0.1:0".into(),
metrics_listen: "127.0.0.1:0".into(),
},
eviction: EvictionSettings {
strategy: EvictionStrategy::Lru,
defrag_after_cycles: 0,
},
neurons: vec![NeuronEndpoint {
name: "mock-node".into(),
endpoint: mock_url,
}],
models_config: models_path.to_string_lossy().to_string(),
entitlements: Default::default(),
};
let fleet = Arc::new(CortexState::from_config(&config));
{
let mut nodes = fleet.nodes.write().await;
let node = nodes.get_mut("mock-node").expect("node must exist");
node.healthy = true;
node.models.insert(
"test-model".into(),
ModelEntry {
id: "test-model".into(),
status: ModelStatus::Loaded,
last_accessed: None,
vram_estimate_mb: None,
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}
let app = cortex_gateway::build_app(Arc::clone(&fleet));
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let gateway_addr = listener.local_addr().unwrap();
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
let gateway_url = format!("http://{gateway_addr}");
let resp = reqwest::Client::new()
.post(format!("{gateway_url}/v1/chat/completions"))
.json(&json!({
"model": "test-model",
"messages": [{"role": "user", "content": "hi"}],
}))
.send()
.await
.unwrap();
assert!(resp.status().is_success());
let body: serde_json::Value = resp.json().await.unwrap();
assert_eq!(
body.get("model").and_then(|m| m.as_str()),
Some("test-model")
);
}

View File

@@ -123,3 +123,212 @@ async fn test_anthropic_invalid_request() {
assert_eq!(resp.status(), 400);
}
/// Tool round-trip: an Anthropic `/v1/messages` request carrying tools
/// (the Claude Code shape: `{name, description, input_schema}`) must
/// reach the upstream neuron reshaped into OpenAI function-tool form,
/// and tool history (`tool_use` / `tool_result` blocks) must become
/// `tool_calls` / `role:"tool"` messages. This is the fix for the
/// failure where the model received malformed tool defs and improvised
/// an unparseable `<tool_use_name>` format.
#[tokio::test]
async fn test_anthropic_tools_reshaped_for_upstream() {
let (mock_url, captured) = common::spawn_capturing_mock_neuron().await;
let gw_url = common::spawn_gateway(&mock_url).await;
let client = reqwest::Client::new();
let resp = client
.post(format!("{gw_url}/v1/messages"))
.header("content-type", "application/json")
.json(&json!({
"model": "test-model",
"max_tokens": 100,
"tools": [{
"name": "Read",
"description": "Read a file from disk",
"input_schema": {
"type": "object",
"properties": {"path": {"type": "string"}},
"required": ["path"]
}
}],
"tool_choice": {"type": "auto"},
"messages": [
{"role": "user", "content": "read /etc/hosts"},
{"role": "assistant", "content": [
{"type": "text", "text": "Reading it."},
{"type": "tool_use", "id": "toolu_42", "name": "Read",
"input": {"path": "/etc/hosts"}}
]},
{"role": "user", "content": [
{"type": "tool_result", "tool_use_id": "toolu_42",
"content": "127.0.0.1 localhost"}
]}
]
}))
.send()
.await
.expect("request should succeed");
assert_eq!(resp.status(), 200);
let forwarded = {
let guard = captured.lock().unwrap();
guard.last().cloned().expect("upstream received a request")
};
// Tool definitions reshaped to OpenAI function form.
let tools = forwarded["tools"].as_array().expect("tools array");
assert_eq!(tools[0]["type"], "function");
assert_eq!(tools[0]["function"]["name"], "Read");
assert_eq!(
tools[0]["function"]["parameters"]["properties"]["path"]["type"],
"string"
);
assert!(tools[0]["function"].get("input_schema").is_none());
// tool_choice mapped.
assert_eq!(forwarded["tool_choice"], "auto");
// Message history: user, assistant(+tool_calls), tool, user.
let msgs = forwarded["messages"].as_array().expect("messages array");
let assistant = msgs
.iter()
.find(|m| m["role"] == "assistant")
.expect("assistant turn");
assert_eq!(assistant["tool_calls"][0]["id"], "toolu_42");
assert_eq!(assistant["tool_calls"][0]["function"]["name"], "Read");
// arguments is the parsed object, not a JSON string — the Qwen3.6
// chat template iterates `tool_call.arguments | items`.
assert_eq!(
assistant["tool_calls"][0]["function"]["arguments"],
json!({"path": "/etc/hosts"})
);
let tool_msg = msgs
.iter()
.find(|m| m["role"] == "tool")
.expect("tool result turn");
assert_eq!(tool_msg["tool_call_id"], "toolu_42");
assert_eq!(tool_msg["content"], "127.0.0.1 localhost");
}
/// #24: a streaming Anthropic request gets a translated Anthropic SSE
/// stream — not raw OpenAI frames. Verifies the full event sequence,
/// text reassembly, and the content type.
#[tokio::test]
async fn test_anthropic_streaming_sse_translation() {
let mock_url =
common::spawn_streaming_mock_neuron(4, std::time::Duration::from_millis(20)).await;
let gw_url = common::spawn_gateway(&mock_url).await;
let client = reqwest::Client::new();
let resp = client
.post(format!("{gw_url}/v1/messages"))
.header("content-type", "application/json")
.json(&json!({
"model": "test-model",
"max_tokens": 64,
"stream": true,
"messages": [{"role": "user", "content": "Hi"}]
}))
.send()
.await
.expect("request should succeed");
assert_eq!(resp.status(), 200);
assert!(
resp.headers()
.get("content-type")
.and_then(|v| v.to_str().ok())
.unwrap_or("")
.starts_with("text/event-stream"),
"anthropic stream must be SSE"
);
let body = resp.text().await.expect("stream should complete");
assert!(
!body.contains("chat.completion.chunk"),
"raw OpenAI frames must not leak through:\n{body}"
);
let event_names: Vec<&str> = body
.lines()
.filter_map(|l| l.strip_prefix("event: "))
.collect();
assert_eq!(
event_names,
vec![
"message_start",
"content_block_start",
"content_block_delta",
"content_block_delta",
"content_block_delta",
"content_block_delta",
"content_block_stop",
"message_delta",
"message_stop",
],
"unexpected event sequence:\n{body}"
);
// Reassemble the text deltas: the mock emits token0..token3.
let text: String = body
.lines()
.filter_map(|l| l.strip_prefix("data: "))
.filter_map(|d| serde_json::from_str::<serde_json::Value>(d).ok())
.filter(|v| v["type"] == "content_block_delta")
.filter_map(|v| v["delta"]["text"].as_str().map(String::from))
.collect();
assert_eq!(text, "token0token1token2token3");
// The mock sends no finish_reason — stop_reason defaults to
// end_turn, and output_tokens falls back to the delta count.
let message_delta = body
.lines()
.filter_map(|l| l.strip_prefix("data: "))
.filter_map(|d| serde_json::from_str::<serde_json::Value>(d).ok())
.find(|v| v["type"] == "message_delta")
.expect("message_delta event present");
assert_eq!(message_delta["delta"]["stop_reason"], "end_turn");
assert_eq!(message_delta["usage"]["output_tokens"], 4);
}
/// #24: an upstream usage frame (stream_options include_usage shape)
/// rides into message_delta as input/output token counts.
#[tokio::test]
async fn test_anthropic_streaming_usage_propagation() {
let mock_url = common::spawn_streaming_mock_neuron_with_usage(
3,
std::time::Duration::from_millis(10),
225,
42,
)
.await;
let gw_url = common::spawn_gateway(&mock_url).await;
let client = reqwest::Client::new();
let body = client
.post(format!("{gw_url}/v1/messages"))
.header("content-type", "application/json")
.json(&json!({
"model": "test-model",
"max_tokens": 64,
"stream": true,
"messages": [{"role": "user", "content": "Hi"}]
}))
.send()
.await
.expect("request should succeed")
.text()
.await
.expect("stream should complete");
let message_delta = body
.lines()
.filter_map(|l| l.strip_prefix("data: "))
.filter_map(|d| serde_json::from_str::<serde_json::Value>(d).ok())
.find(|v| v["type"] == "message_delta")
.expect("message_delta event present");
assert_eq!(message_delta["usage"]["output_tokens"], 42);
assert_eq!(message_delta["usage"]["input_tokens"], 225);
}

View File

@@ -0,0 +1,272 @@
//! Integration tests for API-key auth + principal resolution (#49).
//!
//! Verifies the #63 rejection contract (401 invalid_api_key via the #60
//! envelope) and that an authenticated request reaches neuron carrying the
//! internal principal headers — while a client-supplied principal header is
//! stripped (anti-spoofing).
use axum::Json;
use axum::extract::Path;
use axum::http::HeaderMap;
use axum::routing::{get, post};
use cortex_core::config::{
ApiKeyConfig, EntitlementsConfig, EvictionSettings, EvictionStrategy, GatewayConfig,
GatewaySettings, NeuronEndpoint,
};
use cortex_core::entitlements::{CapWindow, HEADER_ACCOUNT_ID, HEADER_KEY_ID};
use cortex_core::node::{ModelEntry, ModelStatus};
use cortex_gateway::state::CortexState;
use serde_json::{Value, json};
use std::sync::{Arc, Mutex};
use tokio::net::TcpListener;
/// What the mock neuron observed on the inbound `/v1/chat/completions`
/// request: the principal headers cortex stamped (or didn't).
#[derive(Default)]
struct Seen {
account_id: Option<String>,
key_id: Option<String>,
}
/// Spawn a mock neuron that records the principal headers it receives and
/// returns a trivial chat completion. Returns (base_url, observed).
async fn spawn_capturing_neuron() -> (String, Arc<Mutex<Seen>>) {
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
let base_url = format!("http://{addr}");
let inference_url = base_url.clone();
let seen: Arc<Mutex<Seen>> = Arc::new(Mutex::new(Seen::default()));
let sink = Arc::clone(&seen);
let app = axum::Router::new()
.route(
"/models/{model_id}/endpoint",
get(move |Path(_): Path<String>| {
let url = inference_url.clone();
async move { Json(json!({ "url": url })) }
}),
)
.route(
"/v1/chat/completions",
post(move |headers: HeaderMap, Json(body): Json<Value>| {
let sink = Arc::clone(&sink);
async move {
{
let mut s = sink.lock().unwrap();
s.account_id = headers
.get(HEADER_ACCOUNT_ID)
.and_then(|v| v.to_str().ok())
.map(str::to_string);
s.key_id = headers
.get(HEADER_KEY_ID)
.and_then(|v| v.to_str().ok())
.map(str::to_string);
}
let model = body.get("model").and_then(Value::as_str).unwrap_or("m");
Json(json!({
"id": "chatcmpl-auth-001",
"object": "chat.completion",
"created": 1700000000_u64,
"model": model,
"choices": [{
"index": 0,
"message": {"role": "assistant", "content": "ok"},
"finish_reason": "stop"
}],
"usage": {"prompt_tokens": 3, "completion_tokens": 1, "total_tokens": 4}
}))
}
}),
)
.with_state(());
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
(base_url, seen)
}
/// Spawn a gateway with the given entitlements config, a single neuron, and
/// `test-model` seeded as loaded (build_app spawns no poller).
async fn spawn_gateway(neuron_url: &str, entitlements: EntitlementsConfig) -> String {
let config = GatewayConfig {
gateway: GatewaySettings {
listen: "127.0.0.1:0".into(),
metrics_listen: "127.0.0.1:0".into(),
},
eviction: EvictionSettings {
strategy: EvictionStrategy::Lru,
defrag_after_cycles: 0,
},
neurons: vec![NeuronEndpoint {
name: "mock-node".into(),
endpoint: neuron_url.to_string(),
}],
models_config: "/dev/null".into(),
entitlements,
};
let fleet = Arc::new(CortexState::from_config(&config));
{
let mut nodes = fleet.nodes.write().await;
let node = nodes.get_mut("mock-node").unwrap();
node.healthy = true;
node.models.insert(
"test-model".into(),
ModelEntry {
id: "test-model".into(),
status: ModelStatus::Loaded,
last_accessed: None,
vram_estimate_mb: Some(8000),
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}
let app = cortex_gateway::build_app(Arc::clone(&fleet));
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
format!("http://{addr}")
}
fn one_key_config(require_auth: bool) -> EntitlementsConfig {
EntitlementsConfig {
require_auth,
keys: vec![ApiKeyConfig {
key: "sk-good".into(),
account_id: "acct-1".into(),
key_id: Some("key-1".into()),
hard_cap: None,
window: CapWindow::Balance,
}],
}
}
fn chat_body() -> Value {
json!({
"model": "test-model",
"messages": [{"role": "user", "content": "hi"}]
})
}
#[tokio::test]
async fn missing_key_when_required_is_401_invalid_api_key() {
let (neuron, _seen) = spawn_capturing_neuron().await;
let gateway = spawn_gateway(&neuron, one_key_config(true)).await;
let resp = reqwest::Client::new()
.post(format!("{gateway}/v1/chat/completions"))
.json(&chat_body())
.send()
.await
.unwrap();
assert_eq!(resp.status(), reqwest::StatusCode::UNAUTHORIZED);
let body: Value = resp.json().await.unwrap();
assert_eq!(body["error"]["code"], "invalid_api_key");
assert_eq!(body["error"]["type"], "invalid_request_error");
}
#[tokio::test]
async fn unrecognized_key_is_ignored_when_auth_not_required() {
let (neuron, seen) = spawn_capturing_neuron().await;
// allow-anonymous mode: a placeholder/unknown bearer (as opencode,
// Open WebUI, Agent Zero, litellm all send by default) must NOT be
// rejected — it's ignored and the request is served anonymously.
let gateway = spawn_gateway(&neuron, one_key_config(false)).await;
let resp = reqwest::Client::new()
.post(format!("{gateway}/v1/chat/completions"))
.bearer_auth("sk-dummy-placeholder")
.json(&chat_body())
.send()
.await
.unwrap();
assert_eq!(resp.status(), reqwest::StatusCode::OK);
let _ = resp.bytes().await.unwrap();
// Served, but anonymous — no principal stamped from the bogus key.
assert!(seen.lock().unwrap().account_id.is_none());
}
#[tokio::test]
async fn invalid_key_is_401_when_auth_required() {
let (neuron, seen) = spawn_capturing_neuron().await;
// With auth required, a present-but-wrong credential is rejected.
let gateway = spawn_gateway(&neuron, one_key_config(true)).await;
let resp = reqwest::Client::new()
.post(format!("{gateway}/v1/chat/completions"))
.bearer_auth("sk-wrong")
.json(&chat_body())
.send()
.await
.unwrap();
assert_eq!(resp.status(), reqwest::StatusCode::UNAUTHORIZED);
let body: Value = resp.json().await.unwrap();
assert_eq!(body["error"]["code"], "invalid_api_key");
// Rejected before dispatch — neuron never saw the request.
assert!(seen.lock().unwrap().account_id.is_none());
}
#[tokio::test]
async fn valid_key_reaches_neuron_with_principal_headers() {
let (neuron, seen) = spawn_capturing_neuron().await;
let gateway = spawn_gateway(&neuron, one_key_config(true)).await;
let resp = reqwest::Client::new()
.post(format!("{gateway}/v1/chat/completions"))
.bearer_auth("sk-good")
// A spoofed principal header must be stripped, not forwarded.
.header(HEADER_ACCOUNT_ID, "attacker")
.json(&chat_body())
.send()
.await
.unwrap();
assert_eq!(resp.status(), reqwest::StatusCode::OK);
let s = seen.lock().unwrap();
assert_eq!(s.account_id.as_deref(), Some("acct-1"));
assert_eq!(s.key_id.as_deref(), Some("key-1"));
}
#[tokio::test]
async fn anonymous_allowed_when_auth_not_required() {
let (neuron, seen) = spawn_capturing_neuron().await;
let gateway = spawn_gateway(&neuron, EntitlementsConfig::default()).await;
let resp = reqwest::Client::new()
.post(format!("{gateway}/v1/chat/completions"))
.json(&chat_body())
.send()
.await
.unwrap();
assert_eq!(resp.status(), reqwest::StatusCode::OK);
// No principal resolved → no principal headers stamped.
let s = seen.lock().unwrap();
assert!(s.account_id.is_none());
assert!(s.key_id.is_none());
}
#[tokio::test]
async fn health_is_public_even_when_auth_required() {
let (neuron, _seen) = spawn_capturing_neuron().await;
let gateway = spawn_gateway(&neuron, one_key_config(true)).await;
let resp = reqwest::Client::new()
.get(format!("{gateway}/health"))
.send()
.await
.unwrap();
assert_eq!(resp.status(), reqwest::StatusCode::OK);
}

View File

@@ -0,0 +1,253 @@
//! Integration tests for budget enforcement (#52) — the A0 seatbelt.
//!
//! A reservation over the key's hard cap is refused *before* neuron is hit,
//! with the #63 code matching the cap-window semantics (rate_limit_exceeded
//! + Retry-After for a resetting window, insufficient_quota for a hard
//! balance). Spend never exceeds the cap. No 402, ever.
use axum::Json;
use axum::extract::Path;
use axum::routing::{get, post};
use cortex_core::config::{
ApiKeyConfig, EntitlementsConfig, EvictionSettings, EvictionStrategy, GatewayConfig,
GatewaySettings, NeuronEndpoint,
};
use cortex_core::entitlements::{CapWindow, Principal};
use cortex_core::node::{ModelEntry, ModelStatus};
use cortex_gateway::state::CortexState;
use serde_json::{Value, json};
use std::sync::Arc;
use std::sync::atomic::{AtomicU64, Ordering};
use tokio::net::TcpListener;
/// Mock neuron with a hit counter on the inference path, so a test can prove
/// a request was (or wasn't) dispatched.
async fn spawn_counting_neuron() -> (String, Arc<AtomicU64>) {
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
let base_url = format!("http://{addr}");
let inference_url = base_url.clone();
let hits = Arc::new(AtomicU64::new(0));
let sink = Arc::clone(&hits);
let app = axum::Router::new()
.route(
"/models/{model_id}/endpoint",
get(move |Path(_): Path<String>| {
let url = inference_url.clone();
async move { Json(json!({ "url": url })) }
}),
)
.route(
"/v1/chat/completions",
post(move |Json(body): Json<Value>| {
let sink = Arc::clone(&sink);
async move {
sink.fetch_add(1, Ordering::SeqCst);
let model = body.get("model").and_then(Value::as_str).unwrap_or("m");
Json(json!({
"id": "chatcmpl-budget",
"object": "chat.completion",
"created": 1700000000_u64,
"model": model,
"choices": [{"index": 0, "message": {"role": "assistant", "content": "ok"}, "finish_reason": "stop"}],
"usage": {"prompt_tokens": 10, "completion_tokens": 5, "total_tokens": 15}
}))
}
}),
);
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
(base_url, hits)
}
async fn spawn_gateway(neuron_url: &str, key: ApiKeyConfig) -> (Arc<CortexState>, String) {
let config = GatewayConfig {
gateway: GatewaySettings {
listen: "127.0.0.1:0".into(),
metrics_listen: "127.0.0.1:0".into(),
},
eviction: EvictionSettings {
strategy: EvictionStrategy::Lru,
defrag_after_cycles: 0,
},
neurons: vec![NeuronEndpoint {
name: "mock-node".into(),
endpoint: neuron_url.to_string(),
}],
models_config: "/dev/null".into(),
entitlements: EntitlementsConfig {
require_auth: true,
keys: vec![key],
},
};
let fleet = Arc::new(CortexState::from_config(&config));
{
let mut nodes = fleet.nodes.write().await;
let node = nodes.get_mut("mock-node").unwrap();
node.healthy = true;
node.models.insert(
"test-model".into(),
ModelEntry {
id: "test-model".into(),
status: ModelStatus::Loaded,
last_accessed: None,
vram_estimate_mb: Some(8000),
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}
let app = cortex_gateway::build_app(Arc::clone(&fleet));
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
(fleet, format!("http://{addr}"))
}
fn key(window: CapWindow, hard_cap: u64) -> ApiKeyConfig {
ApiKeyConfig {
key: "sk-cap".into(),
account_id: "acct-cap".into(),
key_id: Some("key-cap".into()),
hard_cap: Some(hard_cap),
window,
}
}
fn chat(max_tokens: u64) -> Value {
json!({
"model": "test-model",
"max_tokens": max_tokens,
"messages": [{"role": "user", "content": "hi"}]
})
}
#[tokio::test]
async fn balance_over_cap_is_429_insufficient_quota_before_dispatch() {
let (neuron, hits) = spawn_counting_neuron().await;
// Cap far below a single request's reservation (max_tokens 1000).
let (_fleet, gateway) = spawn_gateway(&neuron, key(CapWindow::Balance, 10)).await;
let resp = reqwest::Client::new()
.post(format!("{gateway}/v1/chat/completions"))
.bearer_auth("sk-cap")
.json(&chat(1000))
.send()
.await
.unwrap();
assert_eq!(resp.status(), reqwest::StatusCode::TOO_MANY_REQUESTS);
// Hard balance → no Retry-After.
assert!(resp.headers().get(reqwest::header::RETRY_AFTER).is_none());
let body: Value = resp.json().await.unwrap();
assert_eq!(body["error"]["code"], "insufficient_quota");
// Refused before dispatch — neuron never saw it.
assert_eq!(hits.load(Ordering::SeqCst), 0);
}
#[tokio::test]
async fn rolling_over_cap_is_429_rate_limited_with_retry_after() {
let (neuron, hits) = spawn_counting_neuron().await;
let (_fleet, gateway) =
spawn_gateway(&neuron, key(CapWindow::Rolling { seconds: 3600 }, 10)).await;
let resp = reqwest::Client::new()
.post(format!("{gateway}/v1/chat/completions"))
.bearer_auth("sk-cap")
.json(&chat(1000))
.send()
.await
.unwrap();
assert_eq!(resp.status(), reqwest::StatusCode::TOO_MANY_REQUESTS);
let retry = resp
.headers()
.get(reqwest::header::RETRY_AFTER)
.expect("rolling-window rejection must carry Retry-After");
assert!(retry.to_str().unwrap().parse::<u64>().unwrap() >= 1);
let body: Value = resp.json().await.unwrap();
assert_eq!(body["error"]["code"], "rate_limit_exceeded");
assert_eq!(hits.load(Ordering::SeqCst), 0);
}
#[tokio::test]
async fn within_cap_is_served() {
let (neuron, hits) = spawn_counting_neuron().await;
let (_fleet, gateway) = spawn_gateway(&neuron, key(CapWindow::Balance, 1_000_000)).await;
let resp = reqwest::Client::new()
.post(format!("{gateway}/v1/chat/completions"))
.bearer_auth("sk-cap")
.json(&chat(50))
.send()
.await
.unwrap();
assert_eq!(resp.status(), reqwest::StatusCode::OK);
let _ = resp.bytes().await.unwrap();
assert_eq!(hits.load(Ordering::SeqCst), 1);
}
#[tokio::test]
async fn a0_seatbelt_caps_a_runaway_fan_out() {
// An Agent-Zero-style key with a modest cap: a burst of requests drains
// it, then further requests are refused — the account stops draining and
// spend never exceeds the cap.
let (neuron, hits) = spawn_counting_neuron().await;
let (fleet, gateway) = spawn_gateway(&neuron, key(CapWindow::Balance, 100)).await;
let client = reqwest::Client::new();
let mut ok = 0;
let mut refused = 0;
for _ in 0..20 {
let resp = client
.post(format!("{gateway}/v1/chat/completions"))
.bearer_auth("sk-cap")
.json(&chat(20))
.send()
.await
.unwrap();
match resp.status() {
reqwest::StatusCode::OK => {
ok += 1;
let _ = resp.bytes().await.unwrap();
}
reqwest::StatusCode::TOO_MANY_REQUESTS => {
refused += 1;
let body: Value = resp.json().await.unwrap();
assert_eq!(body["error"]["code"], "insufficient_quota");
}
other => panic!("unexpected status {other}"),
}
}
assert!(ok >= 1, "some requests should be served");
assert!(refused >= 1, "the cap must eventually refuse the fan-out");
assert_eq!(
hits.load(Ordering::SeqCst),
ok,
"refused requests never dispatched"
);
// Spend never exceeded the hard cap (reservation prevents overshoot).
// Poll briefly for in-flight settles to land.
let principal = Principal {
account_id: "acct-cap".into(),
key_id: "key-cap".into(),
};
for _ in 0..50 {
let snap = fleet.entitlements.snapshot(&principal).await.unwrap();
if snap.reserved == 0 {
break;
}
tokio::time::sleep(std::time::Duration::from_millis(20)).await;
}
let snap = fleet.entitlements.snapshot(&principal).await.unwrap();
assert!(snap.spent <= 100, "spent {} exceeded cap", snap.spent);
}

View File

@@ -22,6 +22,7 @@ use tokio::net::TcpListener;
/// - GET /models/:id/endpoint (returns the inference URL)
/// - POST /models/unload (accepts unload requests)
/// - GET /v1/chat/completions + POST /v1/chat/completions (inference)
///
/// Returns the neuron base URL.
pub async fn spawn_mock_neuron() -> String {
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
@@ -43,6 +44,7 @@ pub async fn spawn_mock_neuron() -> String {
post(|Json(_body): Json<Value>| async { Json(json!({"status": "unloaded"})) }),
)
.route("/v1/chat/completions", post(mock_chat_completions))
.route("/v1/responses", post(mock_responses))
.route("/v1/models", get(mock_v1_models));
tokio::spawn(async move {
@@ -52,9 +54,64 @@ pub async fn spawn_mock_neuron() -> String {
base_url
}
/// Like [`spawn_mock_neuron`] but captures the JSON body of every
/// `POST /v1/chat/completions` it receives into the returned handle, so
/// a test can assert what the gateway *actually forwarded upstream*
/// (e.g. that Anthropic-shaped tools were reshaped to OpenAI form).
pub async fn spawn_capturing_mock_neuron() -> (String, Arc<std::sync::Mutex<Vec<Value>>>) {
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
let base_url = format!("http://{addr}");
let inference_url = base_url.clone();
let captured: Arc<std::sync::Mutex<Vec<Value>>> = Arc::new(std::sync::Mutex::new(Vec::new()));
let sink = captured.clone();
let app = Router::new()
.route("/models", get(mock_neuron_list_models))
.route(
"/models/{model_id}/endpoint",
get(move |Path(_): Path<String>| {
let url = inference_url.clone();
async move { Json(json!({"url": url})) }
}),
)
.route(
"/v1/chat/completions",
post(move |Json(body): Json<Value>| {
let sink = sink.clone();
async move {
let model = body
.get("model")
.and_then(|v| v.as_str())
.unwrap_or("unknown");
let resp = json!({
"id": "chatcmpl-capture-001",
"object": "chat.completion",
"created": 1700000000_u64,
"model": model,
"choices": [{
"index": 0,
"message": {"role": "assistant", "content": "Hello from mock backend"},
"finish_reason": "stop"
}],
"usage": {"prompt_tokens": 10, "completion_tokens": 5, "total_tokens": 15}
});
sink.lock().unwrap().push(body);
Json(resp)
}
}),
);
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
(base_url, captured)
}
async fn mock_neuron_list_models() -> Json<Value> {
Json(json!([
{"id": "test-model", "harness": "mistralrs", "status": "loaded", "devices": [0], "vram_used_mb": 8000}
{"id": "test-model", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": 8000, "capabilities": ["text"], "tool_call": false, "reasoning": false}
]))
}
@@ -92,6 +149,39 @@ async fn mock_chat_completions(Json(body): Json<Value>) -> Json<Value> {
}))
}
async fn mock_responses(Json(body): Json<Value>) -> Json<Value> {
let model = body
.get("model")
.and_then(|v| v.as_str())
.unwrap_or("unknown");
// Echo the model field back and synthesise a tiny ResponsesResponse.
// Mirrors the shape neuron's /v1/responses handler emits so the
// gateway test only needs to assert the proxy round-tripped it.
Json(json!({
"id": "resp-test-001",
"object": "response",
"created_at": 1700000000_u64,
"status": "completed",
"model": model,
"output": [{
"type": "message",
"id": "msg-test-001",
"role": "assistant",
"content": [{
"type": "output_text",
"text": "Hello from mock backend",
"annotations": []
}],
"status": "completed"
}],
"usage": {
"input_tokens": 5,
"output_tokens": 5,
"total_tokens": 10
}
}))
}
/// Spawns a mock neuron that returns SSE streaming responses for chat completions.
pub async fn spawn_streaming_mock_neuron(chunk_count: usize, chunk_delay: Duration) -> String {
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
@@ -161,8 +251,120 @@ pub async fn spawn_streaming_mock_neuron(chunk_count: usize, chunk_delay: Durati
base_url
}
/// Like `spawn_streaming_mock_neuron`, but the stream ends with an
/// OpenAI `stream_options.include_usage`-style final chunk (empty
/// choices + usage object) before `[DONE]` — the shape the gateway's
/// token metrics (#21) extract counts from.
pub async fn spawn_streaming_mock_neuron_with_usage(
chunk_count: usize,
chunk_delay: Duration,
prompt_tokens: u64,
completion_tokens: u64,
) -> String {
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
let base_url = format!("http://{addr}");
let inference_url = base_url.clone();
let app = Router::new()
.route("/models", get(mock_neuron_list_models))
.route(
"/models/{model_id}/endpoint",
get(move |Path(_model_id): Path<String>| {
let url = inference_url.clone();
async move { Json(json!({"url": url})) }
}),
)
.route(
"/v1/chat/completions",
post(move |Json(body): Json<Value>| async move {
let model = body
.get("model")
.and_then(|v| v.as_str())
.unwrap_or("unknown")
.to_string();
let mut chunks: Vec<String> = (0..chunk_count)
.map(|i| {
let chunk = json!({
"id": "chatcmpl-stream-002",
"object": "chat.completion.chunk",
"created": 1700000000_u64,
"model": model,
"choices": [{
"index": 0,
"delta": { "content": format!("token{i}") },
"finish_reason": null
}]
});
format!("data: {chunk}\n\n")
})
.collect();
let usage_chunk = json!({
"id": "chatcmpl-stream-002",
"object": "chat.completion.chunk",
"created": 1700000000_u64,
"model": model,
"choices": [],
"usage": {
"prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens,
"total_tokens": prompt_tokens + completion_tokens
}
});
chunks.push(format!("data: {usage_chunk}\n\n"));
chunks.push("data: [DONE]\n\n".to_string());
let delay = chunk_delay;
let stream = stream::iter(chunks).then(move |chunk| async move {
tokio::time::sleep(delay).await;
Ok::<_, std::convert::Infallible>(chunk)
});
Response::builder()
.header(header::CONTENT_TYPE, "text/event-stream")
.header(header::CACHE_CONTROL, "no-cache")
.body(Body::from_stream(stream))
.unwrap()
}),
);
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
base_url
}
/// Spawns a mock neuron with a custom models list.
pub async fn spawn_mock_neuron_with_models(models_response: Value) -> String {
spawn_mock_neuron_with_models_and_health(models_response, default_health_response()).await
}
/// Default `/health` response used by mocks that don't care about the
/// activation field — empty devices, no in-flight pre-warm, state=ready.
pub fn default_health_response() -> Value {
json!({
"uptime_secs": 0,
"devices": [],
"activation": {
"state": "ready",
"pending": [],
"in_progress": null,
"completed": [],
"failed": []
}
})
}
/// Variant of `spawn_mock_neuron_with_models` that also serves a
/// `/health` body. Used by tests that drive the gateway's activation
/// surface (poller reading /health, /v1/models synthesising Loading
/// locations from in_progress / pending).
pub async fn spawn_mock_neuron_with_models_and_health(
models_response: Value,
health_response: Value,
) -> String {
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
let base_url = format!("http://{addr}");
@@ -176,6 +378,13 @@ pub async fn spawn_mock_neuron_with_models(models_response: Value) -> String {
async move { Json(resp) }
}),
)
.route(
"/health",
get(move || {
let resp = health_response.clone();
async move { Json(resp) }
}),
)
.route(
"/models/{model_id}/endpoint",
get(move |Path(_model_id): Path<String>| {
@@ -220,6 +429,7 @@ pub async fn spawn_gateway_with_state(mock_url: &str) -> (Arc<CortexState>, Stri
endpoint: mock_url.to_string(),
}],
models_config: "/dev/null".into(),
entitlements: Default::default(),
};
let fleet = Arc::new(CortexState::from_config(&config));
@@ -236,6 +446,10 @@ pub async fn spawn_gateway_with_state(mock_url: &str) -> (Arc<CortexState>, Stri
status: ModelStatus::Loaded,
last_accessed: None,
vram_estimate_mb: Some(8000),
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}

View File

@@ -0,0 +1,140 @@
mod common;
use serde_json::json;
#[tokio::test]
async fn error_response_model_not_found() {
let neuron_url = common::spawn_mock_neuron().await;
let gateway_url = common::spawn_gateway(&neuron_url).await;
let client = reqwest::Client::new();
// Request a model that isn't loaded on the mock neuron.
let resp = client
.post(format!("{gateway_url}/v1/chat/completions"))
.header("Content-Type", "application/json")
.json(&json!({
"model": "nonexistent-model",
"messages": [{"role": "user", "content": "hi"}]
}))
.send()
.await
.expect("request should succeed");
assert_eq!(resp.status(), axum::http::StatusCode::NOT_FOUND);
let body: serde_json::Value = resp.json().await.expect("valid json");
let err = body.get("error").expect("response has error object");
// Broad type categorization
assert_eq!(err.get("type").unwrap(), "invalid_request_error");
// Specific machine-readable code
assert_eq!(
err.get("code").unwrap().as_str().unwrap(),
"model_not_found"
);
// param is always null
assert!(err.get("param").unwrap().is_null());
}
#[tokio::test]
async fn error_response_missing_model_field() {
let neuron_url = common::spawn_mock_neuron().await;
let gateway_url = common::spawn_gateway(&neuron_url).await;
let client = reqwest::Client::new();
// Request without the required `model` field.
let resp = client
.post(format!("{gateway_url}/v1/chat/completions"))
.header("Content-Type", "application/json")
.json(&json!({
"messages": [{"role": "user", "content": "hi"}]
}))
.send()
.await
.expect("request should succeed");
assert_eq!(resp.status(), axum::http::StatusCode::BAD_REQUEST);
let body: serde_json::Value = resp.json().await.expect("valid json");
let err = body.get("error").expect("response has error object");
assert_eq!(err.get("type").unwrap(), "invalid_request_error");
assert_eq!(
err.get("code").unwrap().as_str().unwrap(),
"missing_model_field"
);
assert!(err.get("param").unwrap().is_null());
}
#[tokio::test]
async fn error_response_no_healthy_nodes() {
use cortex_core::config::{EvictionSettings, GatewayConfig, GatewaySettings, NeuronEndpoint};
use std::sync::Arc;
// Create a gateway config with a neuron pointing at an unreachable port so no node is ever healthy.
let config = GatewayConfig {
gateway: GatewaySettings {
listen: "127.0.0.1:0".into(),
metrics_listen: "127.0.0.1:0".into(),
},
eviction: EvictionSettings {
strategy: cortex_core::config::EvictionStrategy::Lru,
defrag_after_cycles: 0,
},
neurons: vec![NeuronEndpoint {
name: "dead-node".into(),
endpoint: "http://127.0.0.1:1".into(),
}],
models_config: "/dev/null".into(),
entitlements: Default::default(),
};
let fleet = Arc::new(cortex_gateway::state::CortexState::from_config(&config));
let app = cortex_gateway::build_app(fleet);
let listener = tokio::net::TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
// Allow the poller a moment to mark the node unhealthy.
tokio::time::sleep(std::time::Duration::from_millis(200)).await;
let client = reqwest::Client::new();
let resp = client
.post(format!("http://{addr}/v1/chat/completions"))
.header("Content-Type", "application/json")
.json(&json!({
"model": "any-model",
"messages": [{"role": "user", "content": "hi"}]
}))
.send()
.await
.expect("request should succeed");
assert_eq!(resp.status(), axum::http::StatusCode::SERVICE_UNAVAILABLE);
// Transient 503 — the gateway advertises Retry-After so OpenAI-compatible
// clients back off and retry rather than surfacing an opaque error (#63).
let retry_after = resp
.headers()
.get(reqwest::header::RETRY_AFTER)
.expect("transient 503 must carry Retry-After")
.to_str()
.unwrap()
.to_string();
assert_eq!(retry_after, "5");
let body: serde_json::Value = resp.json().await.expect("valid json");
let err = body.get("error").expect("response has error object");
assert_eq!(err.get("type").unwrap(), "api_error");
assert_eq!(
err.get("code").unwrap().as_str().unwrap(),
"service_unavailable"
);
assert!(err.get("param").unwrap().is_null());
}

View File

@@ -71,6 +71,7 @@ fn make_fleet(endpoint: &str, defrag_after: u32) -> Arc<CortexState> {
endpoint: endpoint.to_string(),
}],
models_config: "/dev/null".into(),
entitlements: Default::default(),
};
Arc::new(CortexState::from_config(&config))
}
@@ -91,6 +92,10 @@ async fn test_evict_lru_model() {
status: ModelStatus::Loaded,
last_accessed: Some(Utc::now() - chrono::Duration::hours(2)),
vram_estimate_mb: Some(8000),
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
node.models.insert(
@@ -100,6 +105,10 @@ async fn test_evict_lru_model() {
status: ModelStatus::Loaded,
last_accessed: Some(Utc::now()),
vram_estimate_mb: Some(8000),
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}
@@ -163,6 +172,10 @@ async fn test_eviction_increments_lifecycle_cycles() {
status: ModelStatus::Loaded,
last_accessed: None,
vram_estimate_mb: None,
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}

View File

@@ -0,0 +1,189 @@
//! Load-aware routing across replicas (#55).
//!
//! When a model is loaded on more than one healthy neuron, the router picks
//! the least-busy replica using the per-model admission load each neuron
//! reports on `GET /health` (#53), rather than always taking the first.
mod common;
use axum::Json;
use axum::extract::Path;
use axum::http::{StatusCode, header};
use axum::response::IntoResponse;
use axum::routing::{get, post};
use cortex_core::config::{
EvictionSettings, EvictionStrategy, GatewayConfig, GatewaySettings, NeuronEndpoint,
};
use cortex_core::discovery::ModelLoad;
use cortex_core::node::{ModelEntry, ModelStatus};
use cortex_gateway::state::CortexState;
use serde_json::{Value, json};
use std::sync::Arc;
use tokio::net::TcpListener;
/// Seed a node as healthy with `test-model` loaded and a given admission load.
async fn seed_loaded(fleet: &CortexState, node: &str, in_flight: usize, queue_depth: usize) {
let mut nodes = fleet.nodes.write().await;
let n = nodes.get_mut(node).expect("node exists");
n.healthy = true;
n.models.insert(
"test-model".into(),
ModelEntry {
id: "test-model".into(),
status: ModelStatus::Loaded,
last_accessed: None,
vram_estimate_mb: Some(8000),
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
n.model_load.insert(
"test-model".into(),
ModelLoad {
id: "test-model".into(),
in_flight,
queue_depth,
},
);
}
/// Build a gateway state over two mock neurons (no poller; we seed state).
async fn two_neuron_fleet(endpoint_a: &str, endpoint_b: &str) -> Arc<CortexState> {
let config = GatewayConfig {
gateway: GatewaySettings {
listen: "127.0.0.1:0".into(),
metrics_listen: "127.0.0.1:0".into(),
},
eviction: EvictionSettings {
strategy: EvictionStrategy::Lru,
defrag_after_cycles: 0,
},
neurons: vec![
NeuronEndpoint {
name: "node-a".into(),
endpoint: endpoint_a.to_string(),
},
NeuronEndpoint {
name: "node-b".into(),
endpoint: endpoint_b.to_string(),
},
],
models_config: "/dev/null".into(),
entitlements: Default::default(),
};
Arc::new(CortexState::from_config(&config))
}
#[tokio::test]
async fn routes_to_least_busy_replica() {
let neuron_a = common::spawn_mock_neuron().await;
let neuron_b = common::spawn_mock_neuron().await;
let fleet = two_neuron_fleet(&neuron_a, &neuron_b).await;
// A is busy (1 running + 3 queued), B is idle.
seed_loaded(&fleet, "node-a", 1, 3).await;
seed_loaded(&fleet, "node-b", 0, 0).await;
let route = cortex_gateway::router::resolve(&fleet, "test-model")
.await
.expect("model is loaded on both nodes");
assert_eq!(route.node_name, "node-b", "should pick the idle replica");
// Flip the load: now B is the busy one.
seed_loaded(&fleet, "node-a", 0, 0).await;
seed_loaded(&fleet, "node-b", 1, 5).await;
let route = cortex_gateway::router::resolve(&fleet, "test-model")
.await
.expect("still loaded");
assert_eq!(route.node_name, "node-a", "should follow the lighter load");
}
/// Mock neuron whose inference endpoint always returns a #63 backpressure
/// envelope (503 + Retry-After) — simulating a saturated neuron.
async fn spawn_busy_neuron() -> String {
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
let base_url = format!("http://{addr}");
let inference_url = base_url.clone();
let app = axum::Router::new()
.route(
"/models/{model_id}/endpoint",
get(move |Path(_): Path<String>| {
let url = inference_url.clone();
async move { Json(json!({ "url": url })) }
}),
)
.route(
"/v1/chat/completions",
post(|| async {
let body = json!({"error": {
"message": "model is busy (admission queue full); retry shortly",
"type": "rate_limit_error",
"code": "rate_limit_exceeded",
"param": null
}});
(
StatusCode::SERVICE_UNAVAILABLE,
[(header::RETRY_AFTER, "6")],
Json(body),
)
.into_response()
}),
);
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
base_url
}
#[tokio::test]
async fn neuron_backpressure_is_propagated_intact() {
// A saturated neuron's 503 + Retry-After + envelope must reach the client
// verbatim — not unwrapped, remapped, or stripped (#55 / #63).
let neuron = spawn_busy_neuron().await;
let fleet = two_neuron_fleet(&neuron, &neuron).await;
seed_loaded(&fleet, "node-a", 1, 8).await;
let app = cortex_gateway::build_app(Arc::clone(&fleet));
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
let resp = reqwest::Client::new()
.post(format!("http://{addr}/v1/chat/completions"))
.json(&json!({"model": "test-model", "messages": [{"role": "user", "content": "hi"}]}))
.send()
.await
.unwrap();
assert_eq!(resp.status(), reqwest::StatusCode::SERVICE_UNAVAILABLE);
assert_eq!(
resp.headers()
.get(reqwest::header::RETRY_AFTER)
.and_then(|v| v.to_str().ok()),
Some("6"),
"Retry-After must survive the proxy"
);
let body: Value = resp.json().await.unwrap();
assert_eq!(body["error"]["code"], "rate_limit_exceeded");
}
#[tokio::test]
async fn ties_break_deterministically_by_name() {
let neuron_a = common::spawn_mock_neuron().await;
let neuron_b = common::spawn_mock_neuron().await;
let fleet = two_neuron_fleet(&neuron_a, &neuron_b).await;
// Equal load on both → stable pick (lowest node name).
seed_loaded(&fleet, "node-a", 0, 0).await;
seed_loaded(&fleet, "node-b", 0, 0).await;
let route = cortex_gateway::router::resolve(&fleet, "test-model")
.await
.expect("loaded");
assert_eq!(route.node_name, "node-a", "ties break by name");
}

View File

@@ -0,0 +1,207 @@
//! Integration tests for per-request token metering (#51).
//!
//! Drives authenticated requests through the gateway to a mock neuron that
//! reports a fixed `usage` object, then asserts the EntitlementProvider's
//! spend ledger reflects cumulative per-key spend and that reservations
//! settle to actual (no outstanding reserved tokens once requests complete).
mod common;
use cortex_core::config::{
ApiKeyConfig, EntitlementsConfig, EvictionSettings, EvictionStrategy, GatewayConfig,
GatewaySettings, NeuronEndpoint,
};
use cortex_core::entitlements::{CapWindow, Principal};
use cortex_core::node::{ModelEntry, ModelStatus};
use cortex_gateway::state::CortexState;
use serde_json::json;
use std::sync::Arc;
use std::time::Duration;
use tokio::net::TcpListener;
const ACCOUNT: &str = "acct-meter";
const KEY_ID: &str = "key-meter";
const BEARER: &str = "sk-meter";
/// The mock neuron (common::spawn_mock_neuron) reports this fixed usage on
/// every chat completion.
const PROMPT_PER_REQ: u64 = 10;
const COMPLETION_PER_REQ: u64 = 5;
async fn spawn_metered_gateway(neuron_url: &str) -> (Arc<CortexState>, String) {
let config = GatewayConfig {
gateway: GatewaySettings {
listen: "127.0.0.1:0".into(),
metrics_listen: "127.0.0.1:0".into(),
},
eviction: EvictionSettings {
strategy: EvictionStrategy::Lru,
defrag_after_cycles: 0,
},
neurons: vec![NeuronEndpoint {
name: "mock-node".into(),
endpoint: neuron_url.to_string(),
}],
models_config: "/dev/null".into(),
entitlements: EntitlementsConfig {
require_auth: true,
keys: vec![ApiKeyConfig {
key: BEARER.into(),
account_id: ACCOUNT.into(),
key_id: Some(KEY_ID.into()),
hard_cap: Some(1_000_000),
window: CapWindow::Balance,
}],
},
};
let fleet = Arc::new(CortexState::from_config(&config));
{
let mut nodes = fleet.nodes.write().await;
let node = nodes.get_mut("mock-node").unwrap();
node.healthy = true;
node.models.insert(
"test-model".into(),
ModelEntry {
id: "test-model".into(),
status: ModelStatus::Loaded,
last_accessed: None,
vram_estimate_mb: Some(8000),
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}
let app = cortex_gateway::build_app(Arc::clone(&fleet));
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
(fleet, format!("http://{addr}"))
}
fn principal() -> Principal {
Principal {
account_id: ACCOUNT.into(),
key_id: KEY_ID.into(),
}
}
/// Poll the provider ledger until settled spend reaches `expected` (settle
/// runs in a spawned task after the response stream finishes) or time out.
async fn await_spent(fleet: &CortexState, expected: u64) -> u64 {
let principal = principal();
for _ in 0..100 {
let snap = fleet.entitlements.snapshot(&principal).await.unwrap();
if snap.spent >= expected {
return snap.spent;
}
tokio::time::sleep(Duration::from_millis(20)).await;
}
fleet.entitlements.snapshot(&principal).await.unwrap().spent
}
#[tokio::test]
async fn cumulative_spend_is_metered_per_key() {
let neuron = common::spawn_mock_neuron().await;
let (fleet, gateway) = spawn_metered_gateway(&neuron).await;
let client = reqwest::Client::new();
const N: u64 = 3;
for _ in 0..N {
let resp = client
.post(format!("{gateway}/v1/chat/completions"))
.bearer_auth(BEARER)
.json(&json!({"model": "test-model", "messages": [{"role": "user", "content": "hi"}]}))
.send()
.await
.unwrap();
assert_eq!(resp.status(), reqwest::StatusCode::OK);
// Drain the body so the response stream finishes and metering settles.
let _ = resp.bytes().await.unwrap();
}
let expected = N * (PROMPT_PER_REQ + COMPLETION_PER_REQ);
let spent = await_spent(&fleet, expected).await;
assert_eq!(
spent, expected,
"ledger must reflect cumulative per-key spend"
);
// Reservations settled to actual — nothing left outstanding.
let snap = fleet.entitlements.snapshot(&principal()).await.unwrap();
assert_eq!(snap.reserved, 0, "all reservations must settle/release");
assert_eq!(snap.hard_cap, Some(1_000_000));
}
#[tokio::test]
async fn anonymous_request_records_no_spend() {
// require_auth=false so the unauthenticated request is served, but with
// no principal it must not touch any ledger.
let neuron = common::spawn_mock_neuron().await;
let config = GatewayConfig {
gateway: GatewaySettings {
listen: "127.0.0.1:0".into(),
metrics_listen: "127.0.0.1:0".into(),
},
eviction: EvictionSettings {
strategy: EvictionStrategy::Lru,
defrag_after_cycles: 0,
},
neurons: vec![NeuronEndpoint {
name: "mock-node".into(),
endpoint: neuron.clone(),
}],
models_config: "/dev/null".into(),
entitlements: EntitlementsConfig::default(),
};
let fleet = Arc::new(CortexState::from_config(&config));
{
let mut nodes = fleet.nodes.write().await;
let node = nodes.get_mut("mock-node").unwrap();
node.healthy = true;
node.models.insert(
"test-model".into(),
ModelEntry {
id: "test-model".into(),
status: ModelStatus::Loaded,
last_accessed: None,
vram_estimate_mb: Some(8000),
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}
let app = cortex_gateway::build_app(Arc::clone(&fleet));
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
let resp = reqwest::Client::new()
.post(format!("http://{addr}/v1/chat/completions"))
.json(&json!({"model": "test-model", "messages": [{"role": "user", "content": "hi"}]}))
.send()
.await
.unwrap();
assert_eq!(resp.status(), reqwest::StatusCode::OK);
let _ = resp.bytes().await.unwrap();
// An unconfigured principal has a zeroed snapshot — nothing was metered.
let snap = fleet
.entitlements
.snapshot(&Principal {
account_id: "nobody".into(),
key_id: "nobody".into(),
})
.await
.unwrap();
assert_eq!(snap.spent, 0);
}

View File

@@ -1,20 +1,26 @@
mod common;
use serde_json::json;
use std::sync::OnceLock;
/// The metrics recorder is a process-wide global; both tests in this
/// binary run against one shared install. Assertions must therefore be
/// order-independent (presence of names / monotonic counters, not
/// "empty before").
fn recorder() -> &'static metrics_exporter_prometheus::PrometheusHandle {
static HANDLE: OnceLock<metrics_exporter_prometheus::PrometheusHandle> = OnceLock::new();
HANDLE.get_or_init(|| {
cortex_gateway::metrics::install_test_recorder().expect("recorder should install")
})
}
#[tokio::test]
async fn test_metrics_emitted_after_proxy() {
let handle = cortex_gateway::metrics::install_test_recorder().expect("recorder should install");
let handle = recorder();
let mock_url = common::spawn_mock_neuron().await;
let gw_url = common::spawn_gateway(&mock_url).await;
let before = handle.render();
assert!(
!before.contains("cortex_requests_total"),
"no request metrics before any requests"
);
let client = reqwest::Client::new();
let resp = client
.post(format!("{gw_url}/v1/chat/completions"))
@@ -44,3 +50,72 @@ async fn test_metrics_emitted_after_proxy() {
"no errors expected for a successful request"
);
}
#[tokio::test]
async fn test_token_metrics_emitted_for_streamed_request() {
// #21: a streamed chat completion with a final usage chunk must
// produce TTFT + tok/s histograms and prompt/completion token
// counters, labelled with model and node. The recorder is global
// per-process, so this test runs in its own binary invocation —
// cargo's per-file integration binaries give us that as long as
// only one test in this file installs the recorder... it isn't:
// test_metrics_emitted_after_proxy also installs. Whichever wins
// the race, both render from the same recorder, so assert on
// delta-able names rather than exact totals.
let handle = recorder();
let mock_url = common::spawn_streaming_mock_neuron_with_usage(
5,
std::time::Duration::from_millis(40),
225,
42,
)
.await;
let gw_url = common::spawn_gateway(&mock_url).await;
let client = reqwest::Client::new();
let resp = client
.post(format!("{gw_url}/v1/chat/completions"))
.header("content-type", "application/json")
.json(&json!({
"model": "test-model",
"messages": [{"role": "user", "content": "Hi"}],
"stream": true
}))
.send()
.await
.expect("request should succeed");
assert_eq!(resp.status(), 200);
let body = resp.text().await.expect("stream should complete");
assert!(body.contains("[DONE]"));
let rendered = handle.render();
for needle in [
"cortex_time_to_first_token_seconds",
"cortex_tokens_per_second",
] {
assert!(
rendered.contains(needle),
"{needle} should be present.\nMetrics:\n{rendered}"
);
}
// The recorder is shared with the sibling test (same model/node
// labels), so counters are lower bounds, not exact values: this
// request contributed prompt=225 / completion=42.
let counter_value = |name: &str| -> u64 {
rendered
.lines()
.find(|l| l.starts_with(name) && l.contains(r#"model="test-model""#))
.and_then(|l| l.rsplit(' ').next())
.and_then(|v| v.parse().ok())
.unwrap_or_else(|| panic!("{name} should be present.\nMetrics:\n{rendered}"))
};
assert!(
counter_value("cortex_prompt_tokens_total") >= 225,
"prompt token counter should include this request's 225.\nMetrics:\n{rendered}"
);
assert!(
counter_value("cortex_completion_tokens_total") >= 42,
"completion token counter should include this request's 42.\nMetrics:\n{rendered}"
);
}

View File

@@ -0,0 +1,132 @@
//! Issue #62 / #67: `GET /v1/models` advertises a per-model serving budget so
//! an OpenAI-compatible client (opencode's helexa provider) can size and
//! compact its context without hand-configuration.
//!
//! Asserts the composition sources land on the response:
//! - `limit` from the neuron's self-derived value (#67) — NOT the catalogue;
//! an operator-declared catalogue `limit` is deliberately ignored.
//! - `cost` from the catalogue profile (operator-set pricing).
//! - `tool_call` / `reasoning` from the neuron's runtime detection (OR-ed in)
//!
//! Also a regression guard for the removal of `max_model_len` — the misnamed,
//! unconsumed vLLM-ism that this contract replaces.
use cortex_core::config::{
EvictionSettings, EvictionStrategy, GatewayConfig, GatewaySettings, NeuronEndpoint,
};
use cortex_core::harness::ModelLimit;
use cortex_core::node::{ModelEntry, ModelStatus};
use cortex_gateway::state::CortexState;
use std::sync::Arc;
use tokio::net::TcpListener;
#[tokio::test]
async fn v1_models_surfaces_limit_cost_and_capability_flags() {
// Catalogue declares pricing + an operator `limit` that must be IGNORED
// (#67): the neuron's self-derived limit is authoritative.
let models_toml = r#"
[[models]]
id = "test-model"
harness = "candle"
limit.context = 999999
limit.input = 999999
limit.output = 999999
cost.input = 0.0
cost.output = 0.0
capabilities = ["text"]
"#;
let cat_path = std::env::temp_dir().join("cortex_test_issue62_models.toml");
std::fs::write(&cat_path, models_toml).unwrap();
let config = GatewayConfig {
gateway: GatewaySettings {
listen: "127.0.0.1:0".into(),
metrics_listen: "127.0.0.1:0".into(),
},
eviction: EvictionSettings {
strategy: EvictionStrategy::Lru,
defrag_after_cycles: 0,
},
neurons: vec![NeuronEndpoint {
name: "mock-node".into(),
// Never contacted: build_app does not spawn the poller, so the
// seeded state below is authoritative for /v1/models.
endpoint: "http://127.0.0.1:1".into(),
}],
models_config: cat_path.to_string_lossy().into_owned(),
entitlements: Default::default(),
};
let fleet = Arc::new(CortexState::from_config(&config));
// Seed the model as loaded on the node with runtime-detected flags set —
// these must OR into the catalogue entry, not be lost.
{
let mut nodes = fleet.nodes.write().await;
let node = nodes.get_mut("mock-node").expect("node exists");
node.healthy = true;
node.models.insert(
"test-model".into(),
ModelEntry {
id: "test-model".into(),
status: ModelStatus::Loaded,
last_accessed: None,
vram_estimate_mb: Some(8000),
capabilities: vec!["text".into()],
tool_call: true,
reasoning: true,
// Neuron's self-derived limit (#67) — the authoritative
// source. Distinct from the catalogue's (ignored) values.
limit: Some(ModelLimit {
context: 49152,
input: Some(40960),
output: 8192,
}),
},
);
}
let app = cortex_gateway::build_app(Arc::clone(&fleet));
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
let body: serde_json::Value = reqwest::Client::new()
.get(format!("http://{addr}/v1/models"))
.send()
.await
.unwrap()
.json()
.await
.unwrap();
let entry = body["data"]
.as_array()
.expect("data is an array")
.iter()
.find(|m| m["id"] == "test-model")
.expect("test-model present in /v1/models");
// `limit` is the neuron's self-derived value (#67), NOT the catalogue's
// (which declared 999999 and must be ignored). `cost` still flows from
// the catalogue.
assert_eq!(entry["limit"]["context"], 49152);
assert_eq!(entry["limit"]["input"], 40960);
assert_eq!(entry["limit"]["output"], 8192);
assert_eq!(entry["cost"]["input"], 0.0);
assert_eq!(entry["cost"]["output"], 0.0);
// Runtime-detected capability flags OR-ed in from the neuron's ModelEntry.
assert_eq!(entry["tool_call"], true);
assert_eq!(entry["reasoning"], true);
// Regression guard: the removed, unconsumed vLLM-ism must not reappear.
assert!(
entry.get("max_model_len").is_none(),
"max_model_len was removed; /v1/models must not advertise it"
);
let _ = std::fs::remove_file(&cat_path);
}

View File

@@ -12,8 +12,8 @@ use std::sync::Arc;
async fn test_poller_discovers_models() {
// Mock neuron reports 2 models via /models endpoint (neuron format).
let mock_url = common::spawn_mock_neuron_with_models(json!([
{"id": "model-a", "harness": "mistralrs", "status": "loaded", "devices": [0], "vram_used_mb": 8000},
{"id": "model-b", "harness": "mistralrs", "status": "unloaded", "devices": [], "vram_used_mb": null}
{"id": "model-a", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": 8000},
{"id": "model-b", "harness": "candle", "status": "unloaded", "devices": [], "vram_used_mb": null}
]))
.await;
@@ -31,6 +31,7 @@ async fn test_poller_discovers_models() {
endpoint: mock_url,
}],
models_config: "/dev/null".into(),
entitlements: Default::default(),
};
let fleet = Arc::new(CortexState::from_config(&config));
@@ -63,8 +64,8 @@ async fn test_poller_discovers_models() {
#[tokio::test]
async fn test_poller_updates_gateway_models_endpoint() {
let mock_url = common::spawn_mock_neuron_with_models(json!([
{"id": "model-x", "harness": "mistralrs", "status": "loaded", "devices": [0], "vram_used_mb": null},
{"id": "model-y", "harness": "mistralrs", "status": "loaded", "devices": [1], "vram_used_mb": null}
{"id": "model-x", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": null},
{"id": "model-y", "harness": "candle", "status": "loaded", "devices": [1], "vram_used_mb": null}
]))
.await;
@@ -82,6 +83,7 @@ async fn test_poller_updates_gateway_models_endpoint() {
endpoint: mock_url,
}],
models_config: "/dev/null".into(),
entitlements: Default::default(),
};
let fleet = Arc::new(CortexState::from_config(&config));
@@ -118,6 +120,88 @@ async fn test_poller_updates_gateway_models_endpoint() {
}
}
#[tokio::test]
async fn test_models_endpoint_unions_capabilities_across_nodes() {
// C3: two neurons each have the same model loaded but advertise
// different capability sets. The gateway's /v1/models must report
// the union — a model loaded text-only on one node and
// text+vision on another is vision-capable to the fleet.
let node_a = common::spawn_mock_neuron_with_models(json!([
{"id": "shared-model", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": null, "capabilities": ["text"]}
]))
.await;
let node_b = common::spawn_mock_neuron_with_models(json!([
{"id": "shared-model", "harness": "candle", "status": "loaded", "devices": [1], "vram_used_mb": null, "capabilities": ["text", "vision"]}
]))
.await;
let config = GatewayConfig {
gateway: GatewaySettings {
listen: "127.0.0.1:0".into(),
metrics_listen: "127.0.0.1:0".into(),
},
eviction: EvictionSettings {
strategy: EvictionStrategy::Lru,
defrag_after_cycles: 0,
},
neurons: vec![
NeuronEndpoint {
name: "node-a".into(),
endpoint: node_a,
},
NeuronEndpoint {
name: "node-b".into(),
endpoint: node_b,
},
],
models_config: "/dev/null".into(),
entitlements: Default::default(),
};
let fleet = Arc::new(CortexState::from_config(&config));
cortex_gateway::poller::poll_once(&fleet).await;
let app = cortex_gateway::build_app(Arc::clone(&fleet));
let listener = tokio::net::TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
let client = reqwest::Client::new();
let body: serde_json::Value = client
.get(format!("http://{addr}/v1/models"))
.send()
.await
.expect("request should succeed")
.json()
.await
.unwrap();
let model = body["data"]
.as_array()
.expect("data array")
.iter()
.find(|m| m["id"] == "shared-model")
.expect("shared-model should be present");
let caps: Vec<&str> = model["capabilities"]
.as_array()
.expect("capabilities array")
.iter()
.filter_map(|c| c.as_str())
.collect();
assert!(caps.contains(&"text"), "union must include text: {caps:?}");
assert!(
caps.contains(&"vision"),
"union must include vision: {caps:?}"
);
assert_eq!(caps.len(), 2, "union must not duplicate text: {caps:?}");
// Both nodes hold the model, so two locations regardless of caps.
assert_eq!(model["locations"].as_array().unwrap().len(), 2);
}
#[tokio::test]
async fn test_poller_marks_unreachable_node_unhealthy() {
let config = GatewayConfig {
@@ -134,6 +218,7 @@ async fn test_poller_marks_unreachable_node_unhealthy() {
endpoint: "http://127.0.0.1:1".into(),
}],
models_config: "/dev/null".into(),
entitlements: Default::default(),
};
let fleet = Arc::new(CortexState::from_config(&config));
@@ -152,8 +237,8 @@ async fn test_poller_marks_unreachable_node_unhealthy() {
#[tokio::test]
async fn test_poller_removes_stale_models() {
let mock_url = common::spawn_mock_neuron_with_models(json!([
{"id": "keep-me", "harness": "mistralrs", "status": "loaded", "devices": [0], "vram_used_mb": null},
{"id": "drop-me", "harness": "mistralrs", "status": "loaded", "devices": [0], "vram_used_mb": null}
{"id": "keep-me", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": null},
{"id": "drop-me", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": null}
]))
.await;
@@ -171,6 +256,7 @@ async fn test_poller_removes_stale_models() {
endpoint: mock_url,
}],
models_config: "/dev/null".into(),
entitlements: Default::default(),
};
let fleet = Arc::new(CortexState::from_config(&config));
@@ -183,7 +269,7 @@ async fn test_poller_removes_stale_models() {
// New mock with only one model.
let new_mock_url = common::spawn_mock_neuron_with_models(json!([
{"id": "keep-me", "harness": "mistralrs", "status": "loaded", "devices": [0], "vram_used_mb": null}
{"id": "keep-me", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": null}
]))
.await;
@@ -201,6 +287,7 @@ async fn test_poller_removes_stale_models() {
endpoint: new_mock_url,
}],
models_config: "/dev/null".into(),
entitlements: Default::default(),
};
let fleet2 = Arc::new(CortexState::from_config(&config2));
@@ -216,6 +303,10 @@ async fn test_poller_removes_stale_models() {
status: ModelStatus::Loaded,
last_accessed: None,
vram_estimate_mb: None,
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
node.models.insert(
@@ -225,6 +316,10 @@ async fn test_poller_removes_stale_models() {
status: ModelStatus::Loaded,
last_accessed: None,
vram_estimate_mb: None,
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}
@@ -237,3 +332,96 @@ async fn test_poller_removes_stale_models() {
assert!(node.models.contains_key("keep-me"));
assert!(!node.models.contains_key("drop-me"));
}
#[tokio::test]
async fn test_poller_captures_activation_from_health() {
// Mock neuron is mid-prewarm: /models reports nothing (the loading
// model hasn't been inserted into the harness map yet), but
// /health's activation says model-x is in_progress and model-y is
// queued behind it.
let mock_url = common::spawn_mock_neuron_with_models_and_health(
json!([]),
json!({
"uptime_secs": 30,
"devices": [],
"activation": {
"state": "pre_warming",
"pending": ["Qwen/model-y"],
"in_progress": "Qwen/model-x",
"completed": [],
"failed": []
}
}),
)
.await;
let config = GatewayConfig {
gateway: GatewaySettings {
listen: "127.0.0.1:0".into(),
metrics_listen: "127.0.0.1:0".into(),
},
eviction: EvictionSettings {
strategy: EvictionStrategy::Lru,
defrag_after_cycles: 0,
},
neurons: vec![NeuronEndpoint {
name: "prewarm-node".into(),
endpoint: mock_url,
}],
models_config: "/dev/null".into(),
entitlements: Default::default(),
};
let fleet = Arc::new(CortexState::from_config(&config));
cortex_gateway::poller::poll_once(&fleet).await;
let nodes = fleet.nodes.read().await;
let node = nodes.get("prewarm-node").unwrap();
assert!(node.healthy);
// /models was empty — no entries in the per-node model map.
assert!(node.models.is_empty());
// But /health's activation should be captured.
let activation = node
.activation
.as_ref()
.expect("activation should be populated after /health poll");
assert_eq!(activation.in_progress.as_deref(), Some("Qwen/model-x"));
assert_eq!(activation.pending, vec!["Qwen/model-y".to_string()]);
}
#[tokio::test]
async fn test_poller_parses_recovering_status() {
// #20: a model auto-recovering on a neuron (poisoned → unload →
// reload, #17) is reported with status "recovering" and must land
// in gateway state as the dedicated Recovering status — not fall
// through the parser's catch-all to Loaded.
let mock_url = common::spawn_mock_neuron_with_models(json!([
{"id": "model-r", "harness": "candle", "status": "recovering", "devices": [0, 1], "vram_used_mb": null}
]))
.await;
let config = GatewayConfig {
gateway: GatewaySettings {
listen: "127.0.0.1:0".into(),
metrics_listen: "127.0.0.1:0".into(),
},
eviction: EvictionSettings {
strategy: EvictionStrategy::Lru,
defrag_after_cycles: 0,
},
neurons: vec![NeuronEndpoint {
name: "test-node".into(),
endpoint: mock_url,
}],
models_config: "/dev/null".into(),
entitlements: Default::default(),
};
let fleet = Arc::new(CortexState::from_config(&config));
cortex_gateway::poller::poll_once(&fleet).await;
let nodes = fleet.nodes.read().await;
let node = nodes.get("test-node").unwrap();
let model_r = node.models.get("model-r").expect("model-r should exist");
assert_eq!(model_r.status, ModelStatus::Recovering);
}

View File

@@ -0,0 +1,174 @@
//! Fail-fast prompt pre-validation + advisory client hints (#56).
//!
//! cortex refuses a prompt that already exceeds the model's advertised
//! context window before dispatching to neuron — the same #60
//! `context_length_exceeded` envelope neuron would emit, just earlier — and
//! attaches an advisory `X-Helexa-Advice` header for fingerprinted clients.
use axum::Json;
use axum::extract::Path;
use axum::routing::{get, post};
use cortex_core::config::{
EvictionSettings, EvictionStrategy, GatewayConfig, GatewaySettings, NeuronEndpoint,
};
use cortex_core::harness::ModelLimit;
use cortex_core::node::{ModelEntry, ModelStatus};
use cortex_gateway::state::CortexState;
use serde_json::{Value, json};
use std::sync::Arc;
use std::sync::atomic::{AtomicU64, Ordering};
use tokio::net::TcpListener;
/// Mock neuron with a hit counter, so a test can prove a request was (or
/// wasn't) dispatched past the gateway's pre-validation.
async fn spawn_counting_neuron() -> (String, Arc<AtomicU64>) {
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
let base_url = format!("http://{addr}");
let inference_url = base_url.clone();
let hits = Arc::new(AtomicU64::new(0));
let sink = Arc::clone(&hits);
let app = axum::Router::new()
.route(
"/models/{model_id}/endpoint",
get(move |Path(_): Path<String>| {
let url = inference_url.clone();
async move { Json(json!({ "url": url })) }
}),
)
.route(
"/v1/chat/completions",
post(move || {
let sink = Arc::clone(&sink);
async move {
sink.fetch_add(1, Ordering::SeqCst);
Json(json!({
"id": "c", "object": "chat.completion", "created": 1_700_000_000_u64,
"model": "test-model",
"choices": [{"index": 0, "message": {"role": "assistant", "content": "ok"}, "finish_reason": "stop"}],
"usage": {"prompt_tokens": 3, "completion_tokens": 1, "total_tokens": 4}
}))
}
}),
);
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
(base_url, hits)
}
/// Gateway over one neuron with `test-model` loaded and a tiny advertised
/// context window (so a modest prompt overflows it).
async fn spawn_gateway(neuron: &str, context: usize) -> String {
let config = GatewayConfig {
gateway: GatewaySettings {
listen: "127.0.0.1:0".into(),
metrics_listen: "127.0.0.1:0".into(),
},
eviction: EvictionSettings {
strategy: EvictionStrategy::Lru,
defrag_after_cycles: 0,
},
neurons: vec![NeuronEndpoint {
name: "mock-node".into(),
endpoint: neuron.to_string(),
}],
models_config: "/dev/null".into(),
entitlements: Default::default(),
};
let fleet = Arc::new(CortexState::from_config(&config));
{
let mut nodes = fleet.nodes.write().await;
let n = nodes.get_mut("mock-node").unwrap();
n.healthy = true;
n.models.insert(
"test-model".into(),
ModelEntry {
id: "test-model".into(),
status: ModelStatus::Loaded,
last_accessed: None,
vram_estimate_mb: Some(8000),
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: Some(ModelLimit {
context,
input: None,
output: 16,
}),
},
);
}
let app = cortex_gateway::build_app(Arc::clone(&fleet));
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
format!("http://{addr}")
}
#[tokio::test]
async fn over_long_prompt_is_rejected_before_dispatch() {
let (neuron, hits) = spawn_counting_neuron().await;
let gateway = spawn_gateway(&neuron, 50).await; // tiny 50-token window
// ~1200 chars → ~300 est tokens, well over 50.
let big = "word ".repeat(240);
let resp = reqwest::Client::new()
.post(format!("{gateway}/v1/chat/completions"))
.header("user-agent", "litellm/1.0")
.json(&json!({"model": "test-model", "messages": [{"role": "user", "content": big}]}))
.send()
.await
.unwrap();
assert_eq!(resp.status(), reqwest::StatusCode::BAD_REQUEST);
// Advisory hint for the fingerprinted client (header only, never body).
assert!(
resp.headers().get("x-helexa-advice").is_some(),
"litellm should get advice"
);
let body: Value = resp.json().await.unwrap();
assert_eq!(body["error"]["code"], "context_length_exceeded");
assert_eq!(body["error"]["max"], 50);
// Refused at the edge — neuron never saw it.
assert_eq!(hits.load(Ordering::SeqCst), 0);
}
#[tokio::test]
async fn within_context_passes_through() {
let (neuron, hits) = spawn_counting_neuron().await;
let gateway = spawn_gateway(&neuron, 4096).await;
let resp = reqwest::Client::new()
.post(format!("{gateway}/v1/chat/completions"))
.json(&json!({"model": "test-model", "messages": [{"role": "user", "content": "hi"}]}))
.send()
.await
.unwrap();
assert_eq!(resp.status(), reqwest::StatusCode::OK);
let _ = resp.bytes().await.unwrap();
assert_eq!(hits.load(Ordering::SeqCst), 1, "served by neuron");
}
#[tokio::test]
async fn unknown_client_gets_no_advice_header() {
let (neuron, _hits) = spawn_counting_neuron().await;
let gateway = spawn_gateway(&neuron, 50).await;
let big = "word ".repeat(240);
let resp = reqwest::Client::new()
.post(format!("{gateway}/v1/chat/completions"))
// no/unknown User-Agent → no advice, but still a clean 400
.json(&json!({"model": "test-model", "messages": [{"role": "user", "content": big}]}))
.send()
.await
.unwrap();
assert_eq!(resp.status(), reqwest::StatusCode::BAD_REQUEST);
assert!(resp.headers().get("x-helexa-advice").is_none());
let body: Value = resp.json().await.unwrap();
assert_eq!(body["error"]["code"], "context_length_exceeded");
}

View File

@@ -117,6 +117,7 @@ async fn test_no_healthy_nodes() {
endpoint: "http://127.0.0.1:1".into(),
}],
models_config: "/dev/null".into(),
entitlements: Default::default(),
};
let fleet = std::sync::Arc::new(cortex_gateway::state::CortexState::from_config(&config));
@@ -139,7 +140,7 @@ async fn test_no_healthy_nodes() {
.await
.expect("request should succeed");
assert_eq!(resp.status(), 404);
assert_eq!(resp.status(), 503);
let body: serde_json::Value = resp.json().await.unwrap();
assert!(
@@ -171,3 +172,67 @@ async fn test_missing_model_field() {
let body: serde_json::Value = resp.json().await.unwrap();
assert!(body["error"]["message"].as_str().unwrap().contains("model"));
}
#[tokio::test]
async fn test_recovering_model_returns_503_and_stays_listed() {
// #20: while a model auto-recovers on a neuron, the gateway must
// hold the route — transient 503 ("retry shortly"), not the 404
// "not found on any node" that makes a recovering model look
// evicted — and keep listing it on /v1/models.
let mock_url = common::spawn_mock_neuron().await;
let (fleet, gw_url) = common::spawn_gateway_with_state(&mock_url).await;
{
let mut nodes = fleet.nodes.write().await;
let node = nodes.get_mut("mock-node").expect("node must exist");
node.models.insert(
"recovering-model".into(),
cortex_core::node::ModelEntry {
id: "recovering-model".into(),
status: cortex_core::node::ModelStatus::Recovering,
last_accessed: None,
vram_estimate_mb: Some(8000),
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}
let client = reqwest::Client::new();
let resp = client
.post(format!("{gw_url}/v1/chat/completions"))
.header("content-type", "application/json")
.json(&json!({
"model": "recovering-model",
"messages": [{"role": "user", "content": "Hi"}]
}))
.send()
.await
.expect("request should succeed");
assert_eq!(resp.status(), 503);
let body: serde_json::Value = resp.json().await.unwrap();
let message = body["error"]["message"].as_str().unwrap();
assert!(
message.contains("recovering") && message.contains("retry"),
"503 body must say recovering/retry, got: {message}"
);
// The model must still be visible on the unified models endpoint.
let models: serde_json::Value = client
.get(format!("{gw_url}/v1/models"))
.send()
.await
.expect("models request should succeed")
.json()
.await
.unwrap();
let listed = models["data"]
.as_array()
.unwrap()
.iter()
.any(|m| m["id"] == "recovering-model");
assert!(listed, "recovering model must stay listed on /v1/models");
}

View File

@@ -0,0 +1,91 @@
//! Integration tests for the `/v1/responses` proxy route.
//!
//! The gateway forwards the request body to whichever neuron has the
//! model loaded. These tests exercise the routing decision (200 on a
//! known model, 404 on an unknown model, 400 on a missing model
//! field) and confirm the response body round-trips verbatim.
mod common;
use serde_json::json;
/// Happy path: gateway routes a `/v1/responses` request to the neuron
/// that has the model loaded, and the neuron's response body
/// arrives at the client unchanged.
#[tokio::test]
async fn test_responses_proxy() {
let mock_url = common::spawn_mock_neuron().await;
let gw_url = common::spawn_gateway(&mock_url).await;
let client = reqwest::Client::new();
let resp = client
.post(format!("{gw_url}/v1/responses"))
.header("content-type", "application/json")
.json(&json!({
"model": "test-model",
"input": "Hi"
}))
.send()
.await
.expect("request should succeed");
assert_eq!(resp.status(), 200);
let body: serde_json::Value = resp.json().await.expect("valid JSON response");
assert_eq!(body["id"], "resp-test-001");
assert_eq!(body["object"], "response");
assert_eq!(body["model"], "test-model");
assert_eq!(body["status"], "completed");
assert_eq!(
body["output"][0]["content"][0]["text"],
"Hello from mock backend"
);
// Usage shape is the Responses-specific (input/output_tokens),
// not the chat-completions one (prompt/completion_tokens). Asserts
// the proxy didn't accidentally route through the wrong handler.
assert_eq!(body["usage"]["total_tokens"], 10);
assert!(body["usage"].get("input_tokens").is_some());
}
/// A request that targets a model not present in the catalogue gets
/// 404 from the router. This matches the chat-completions handler's
/// behaviour — same error path, same status code, so a client can
/// share retry logic across the two routes.
#[tokio::test]
async fn test_responses_model_not_found() {
let mock_url = common::spawn_mock_neuron().await;
let gw_url = common::spawn_gateway(&mock_url).await;
let client = reqwest::Client::new();
let resp = client
.post(format!("{gw_url}/v1/responses"))
.json(&json!({
"model": "not-in-catalogue",
"input": "Hi"
}))
.send()
.await
.unwrap();
assert_eq!(resp.status(), 404);
}
/// A request body without a `model` field can't be routed; the
/// gateway returns 400 before reaching a backend. Same as the
/// chat-completions handler — extracted via the same `extract_model`
/// helper.
#[tokio::test]
async fn test_responses_missing_model_field() {
let mock_url = common::spawn_mock_neuron().await;
let gw_url = common::spawn_gateway(&mock_url).await;
let client = reqwest::Client::new();
let resp = client
.post(format!("{gw_url}/v1/responses"))
.json(&json!({
"input": "Hi"
}))
.send()
.await
.unwrap();
assert_eq!(resp.status(), 400);
}

View File

@@ -51,18 +51,18 @@ async fn test_streaming_sse_passthrough() {
}
assert!(
chunks.len() >= chunk_count + 1,
"expected at least {} chunks (got {}): {:?}",
chunk_count + 1,
chunks.len() > chunk_count,
"expected more than {} chunks (got {}): {:?}",
chunk_count,
chunks.len(),
chunks,
);
assert_eq!(chunks.last().unwrap(), "[DONE]");
for i in 0..chunk_count {
for (i, chunk) in chunks.iter().enumerate().take(chunk_count) {
let chunk_json: serde_json::Value =
serde_json::from_str(&chunks[i]).expect("chunk should be valid JSON");
serde_json::from_str(chunk).expect("chunk should be valid JSON");
assert_eq!(
chunk_json["choices"][0]["delta"]["content"],
format!("token{i}")

View File

@@ -0,0 +1,48 @@
[package]
name = "helexa-acp"
version = "0.1.16"
edition = "2024"
license = "Apache-2.0"
repository = "https://git.lair.cafe/helexa/helexa"
description = """
Agent Client Protocol bridge for the helexa self-hosted LLM stack.
Speaks ACP to ACP-compatible editor clients (Zed, etc.) and forwards
the conversation to any OpenAI-compatible HTTP endpoint — defaulting
to cortex (helexa's reverse-proxy / fleet gateway).
"""
# This crate is intentionally self-contained — no dependencies on other
# workspace crates (cortex-core, cortex-gateway, neuron). The goal is
# a painless migration to a dedicated GitHub repo in the future if the
# project grows beyond helexa's needs. All deps are crates.io.
[dependencies]
# `unstable_session_model` flips on the SessionModelState type and the
# session/set_model RPC the model-picker dropdown in Zed needs. The
# feature is upstream-marked unstable; we accept that risk because the
# model picker is core UX and the alternative (rolling our own
# extension method) drifts further from spec each time it moves.
agent-client-protocol = { version = "0.12", features = ["unstable_session_model"] }
tokio = { version = "1", features = ["rt-multi-thread", "macros", "sync", "io-util", "process", "signal"] }
reqwest = { version = "0.12", features = ["json", "stream", "rustls-tls"], default-features = false }
serde = { version = "1", features = ["derive"] }
serde_json = "1"
toml = "0.8"
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["env-filter"] }
anyhow = "1"
thiserror = "2"
async-trait = "0.1"
futures = "0.3"
tokio-stream = "0.1"
tokio-util = { version = "0.7", features = ["rt"] }
eventsource-stream = "0.2"
async-stream = "0.3"
url = { version = "2", features = ["serde"] }
# Already transitively pulled via the ACP SDK; declared directly so we
# can format ISO 8601 timestamps for `SessionInfo.updated_at` in the
# session/list response.
chrono = { version = "0.4", default-features = false, features = ["std"] }
[[bin]]
name = "helexa-acp"
path = "src/main.rs"

546
crates/helexa-acp/README.md Normal file
View File

@@ -0,0 +1,546 @@
# helexa-acp
ACP (Agent Client Protocol) bridge for editors like
[Zed](https://zed.dev). Lets you point your editor's agent panel at
**any combination** of OpenAI-compatible, OpenAI Responses, and
Anthropic Messages endpoints — public APIs, private LAN deployments,
local Ollama / LM Studio — and switch between them per session via a
model dropdown.
The "missing ACP binary" for users who don't want to be locked into
one vendor's agent client.
```
┌───────────────────────────────────┐
│ Zed (or any ACP editor client) │
└────────────┬──────────────────────┘
│ stdio JSON-RPC (ACP)
┌─────────────────┐
│ helexa-acp │ ← one binary, multi-endpoint
└─────┬───────────┘
│ HTTP / SSE
┌────────┼─────────────┬──────────────┬──────────────┐
▼ ▼ ▼ ▼ ▼
cortex/ OpenAI Anthropic OpenRouter LM Studio
neuron Responses Messages
(self- (gpt-5,…) (Claude)
hosted)
```
## What it does
- **Speaks ACP** over stdio to editor clients (Zed today; any future
ACP client tomorrow).
- **Multi-endpoint** — one config file lists every LLM endpoint
you want available; pick one per session via the model dropdown
(`endpoint:model` selector).
- **Three wire formats**: `openai-chat` (the broadly compatible
default), `openai-responses` (newer OpenAI surface), and
`anthropic-messages` (Claude). Each is a separate provider impl
in `src/provider/`; adding a fourth (Gemini, Ollama native, …) is
one file plus a `WireApi` enum variant.
- **Built-in tools**: `read_file`, `write_file`, `edit_file`,
`list_dir`, `bash`. Permission-gated by default; the editor user
approves writes/shell per-call.
- **Three session modes**: Default (gated), Bypass Permissions
(auto-allow), and Plan (write-only-to-plan-dir, no shell).
- **Vision** — drag-drop images into the agent panel against any
vision-capable model.
- **Session resume** — multi-day conversations survive editor
restarts via on-disk transcript persistence.
- **Context compaction** — rolling history stays inside the model's
context window automatically so long sessions on small-context
local models don't fall over.
## Install
### From source
```sh
git clone https://git.lair.cafe/helexa/helexa.git
cd helexa
cargo install --path crates/helexa-acp
# Binary lands at ~/.cargo/bin/helexa-acp
```
### Pre-built RPM (Fedora 43)
```sh
dnf copr enable helexa/helexa
dnf install helexa-acp
```
The COPR project bundles helexa-acp alongside the cortex gateway
and helexa-neuron flavours; install only the package(s) you need.
## Quick start
The fastest path: env-var single-endpoint config.
```sh
export HELEXA_ACP_BASE_URL=http://hanzalova.internal:31313/v1
export HELEXA_ACP_MODEL=Qwen/Qwen3.6-27B
helexa-acp # speaks ACP over stdin/stdout; not interactive
```
Then in Zed (`~/.config/zed/settings.json`):
```jsonc
{
"agent_servers": {
"helexa": {
"command": "helexa-acp",
"args": []
}
}
}
```
Restart Zed → open the agent panel → pick "helexa" → start
chatting. Tool calls (file reads, writes, bash) prompt for
permission per-call in Default mode.
That's the minimum. The full config story below is what unlocks
the multi-endpoint dropdown.
## Multi-endpoint config
Copy `helexa-acp.example.toml` from this repo to
`$XDG_CONFIG_HOME/helexa-acp/config.toml` (typically
`~/.config/helexa-acp/config.toml`) and edit:
```toml
default_endpoint = "helexa"
[[endpoints]]
name = "helexa"
base_url = "http://hanzalova.internal:31313/v1"
wire_api = "openai-chat"
default_model = "Qwen/Qwen3.6-27B"
max_tokens = 8192
context_window = 32768
[[endpoints]]
name = "openrouter"
base_url = "https://openrouter.ai/api/v1"
wire_api = "openai-chat"
api_key_env = "OPENROUTER_API_KEY"
default_model = "anthropic/claude-opus-4"
[[endpoints]]
name = "anthropic"
base_url = "https://api.anthropic.com/v1"
wire_api = "anthropic-messages"
api_key_env = "ANTHROPIC_API_KEY"
default_model = "claude-opus-4"
```
Restart Zed. The model dropdown lists every model from every
configured endpoint with the `endpoint:model` selector
(`helexa:Qwen/Qwen3.6-27B`, `openrouter:anthropic/claude-opus-4`,
…). Switch mid-session; the next prompt routes to the new endpoint.
When only one endpoint is configured the prefix is dropped (model
ids appear bare).
### Selector syntax
The `model` field on every internal request is parsed as
`<endpoint>:<model>`:
- `openrouter:gpt-4o` → routes to the `openrouter` endpoint,
model `gpt-4o`.
- `helexa/large` → no colon → falls through to whichever endpoint
is named in `default_endpoint`, model `helexa/large`.
- `:gpt-5` → leading colon → also falls through to default.
## Endpoint cookbook
Copy-pasteable blocks. Mix and match.
### cortex / neuron (self-hosted)
```toml
[[endpoints]]
name = "helexa"
base_url = "http://hanzalova.internal:31313/v1"
wire_api = "openai-chat"
default_model = "Qwen/Qwen3.6-27B"
max_tokens = 8192
context_window = 32768
```
Use `openai-responses` instead of `openai-chat` once cortex 0.1.16+
is deployed and you want the Responses API surface (vision item
shape, structured reasoning items, etc.).
### OpenAI directly
```toml
[[endpoints]]
name = "openai"
base_url = "https://api.openai.com/v1"
wire_api = "openai-responses"
api_key_env = "OPENAI_API_KEY"
default_model = "gpt-5"
```
`openai-responses` is the right choice for current OpenAI models;
`openai-chat` works against legacy GPT-3.5/4 deployments and
anything labelled "chat completions".
### Anthropic directly
```toml
[[endpoints]]
name = "anthropic"
base_url = "https://api.anthropic.com/v1"
wire_api = "anthropic-messages"
api_key_env = "ANTHROPIC_API_KEY"
default_model = "claude-opus-4"
```
helexa-acp sends `x-api-key` + `anthropic-version: 2023-06-01`
automatically. The `api_key_env` indirection keeps your key out of
the config file.
### OpenRouter (multi-vendor proxy)
```toml
[[endpoints]]
name = "openrouter"
base_url = "https://openrouter.ai/api/v1"
wire_api = "openai-chat"
api_key_env = "OPENROUTER_API_KEY"
default_model = "anthropic/claude-opus-4"
```
OpenRouter speaks OpenAI-compat for every model it fronts, so
`openai-chat` is the right wire format regardless of the
underlying vendor.
### LM Studio (local)
```toml
[[endpoints]]
name = "lmstudio"
base_url = "http://localhost:1234/v1"
wire_api = "openai-chat"
default_model = "auto"
```
LM Studio's "auto" model id picks whatever's loaded. Same shape
works for Ollama in compat mode (`http://localhost:11434/v1`) and
vLLM.
### Multiple cortex deployments
```toml
[[endpoints]]
name = "lan"
base_url = "http://hanzalova.internal:31313/v1"
wire_api = "openai-chat"
default_model = "Qwen/Qwen3.6-27B"
[[endpoints]]
name = "cloud"
base_url = "https://cortex.example.com/v1"
wire_api = "openai-chat"
api_key_env = "CLOUD_CORTEX_KEY"
default_model = "Qwen/Qwen3-VL-8B"
```
Use the `endpoint:model` selector to switch between them mid-session.
## Zed setup
`~/.config/zed/settings.json`:
```jsonc
{
"agent_servers": {
"helexa": {
"command": "helexa-acp"
}
}
}
```
Optional environment overrides for the binary:
```jsonc
{
"agent_servers": {
"helexa": {
"command": "helexa-acp",
"env": {
"HELEXA_ACP_LOG_FILE": "/tmp/helexa-acp.log",
"RUST_LOG": "helexa_acp=debug"
}
}
}
}
```
`HELEXA_ACP_LOG_FILE` is the one you actually want — Zed doesn't
surface the agent's stderr, so without that env var debug output is
invisible. Point it at a file you can `tail -f`.
After restarting Zed: ⌘+? (or wherever your "Open Agent Panel"
binding is) → select "helexa" → the model dropdown populates from
your config → start prompting.
## Modes
Three session modes ship; the user picks via Zed's mode dropdown
on the agent panel.
| Mode | Reads | Writes | Bash | Permission prompts |
|------|-------|--------|------|--------------------|
| **Default** | ✓ | with prompt | with prompt | per call |
| **Bypass Permissions** | ✓ | ✓ | ✓ | never |
| **Plan** | ✓ | only into plan dir | disabled | never (plan-dir writes auto-allow) |
### Default
Reads are always allowed (`read_file`, `list_dir` are
unrestricted). Writes and shell commands prompt the user before
running. The intended baseline for any session where the agent
might do something you'd rather review first.
### Bypass Permissions
Auto-allow every tool call. Use for agentic loops you trust — bulk
edits across many files, scripted workflows, prepared session
templates. Never for code the agent hasn't seen before.
### Plan
The "draft an implementation plan before you write code" mode.
Available tools:
- `read_file`, `list_dir`: unrestricted (read the codebase).
- `write_file`, `edit_file`: allowed *only* under
`$XDG_DATA_HOME/helexa-acp/plans/<project-id>/`. Any path
outside that returns "plan mode: writes are restricted to …"
back to the model so it self-corrects.
- `bash`: disabled outright. Returns "plan mode: shell execution
is disabled" if attempted.
When the plan is complete, the model presents a 3-option menu:
1. **Bypass Permissions** — implement the plan now, no prompts.
2. **Default** — implement now with per-tool prompts.
3. **Plan** (stay here) — refine the plan with more guidance.
Switch the mode dropdown to your preference and reply to proceed.
## Tools
Five tools, defined in `src/tools.rs`:
| Tool | Args | Gated in Default? |
|------|------|-------------------|
| `read_file` | `path`, `line?`, `limit?` | no |
| `list_dir` | `path` | no |
| `write_file` | `path`, `content` | yes |
| `edit_file` | `path`, `old_text`, `new_text` | yes |
| `bash` | `command`, `cwd?` | yes |
### Path handling
`~`, `~/`, `$HOME`, and `$HOME/` are expanded server-side before
the path reaches ACP or local fs. Lets the model emit
`~/git/repo/file.rs` and have it Just Work.
`read_file` first tries the editor's filesystem (ACP's
`fs/read_text_file` — respects open buffers, workspace overlays,
etc.). If that fails — typically because the path is outside Zed's
workspace boundary — it falls back to `std::fs::read_to_string`.
This lets the agent pull in shared material like
`~/git/architecture/generic.md` from a different project's
session.
The fallback is logged at warn level so you can see when it kicks
in.
### Tool dispatch
Tool descriptions reach the model through a Qwen3 Hermes-format
`# Tools` block injected into the system prompt — cortex/neuron
pass the OpenAI `tools` request field through to the encoder
unread, so we work the model into emitting `<tool_call>{json}</tool_call>`
markers it then parses out of the content stream. This applies to
the helexa wire format; OpenAI / Anthropic endpoints with native
tool support would use their own paths once they're wired in.
The parser is tolerant: malformed JSON (trailing braces, missing
`name`, name nested in `arguments`) gets a repair pass; if that
fails the call surfaces as a "Malformed tool call" card in Zed and
the model gets a synthetic error result so it can self-correct.
## Session resume
helexa-acp persists every session to
`$XDG_DATA_HOME/helexa-acp/sessions/<id>.json`. Zed's `session/list`
RPC asks helexa-acp to enumerate them on workspace open;
`session/load` rehydrates and replays the transcript as
`session/update` notifications so the agent panel renders the
prior conversation.
Behaviour:
- Persisted per-round, so a mid-turn agent stall (long bash, wedged
ACP roundtrip) doesn't lose earlier rounds.
- Survives editor restart and the helexa-acp binary upgrading
between versions.
- Project-scoped: only sessions whose `cwd` matches the workspace
are listed.
To wipe history: `rm -rf $XDG_DATA_HOME/helexa-acp/sessions/`.
## Context compaction
When an endpoint sets `context_window`, helexa-acp projects the
rolling history into a token budget before each request — old
`ToolResult` content (read_file payloads are the worst offenders)
gets elided to one-line markers, preserving `tool_call_id` pairing
so the wire schema stays valid.
System prompts, user turns, and the most recent ~4 messages are
never elided. The full history stays on disk; compaction is a
per-request projection, not a destructive edit.
Set `context_window = 32768` for a 32 K Qwen3, `131072` for a
modern Claude, etc. With `max_tokens` also set, the budget is
`context_window - max_tokens - 512_safety`.
## Troubleshooting
### "default endpoint 'helexa' has no usable provider — check config"
The named default endpoint failed to construct. Usually:
- `api_key_env` references a variable that isn't set in the env
Zed launched helexa-acp with.
- The TOML's `wire_api` is misspelled (only `openai-chat`,
`openai-responses`, `anthropic-messages` are accepted).
Test by running `helexa-acp` directly from a shell — startup
errors land on stderr.
### Model dropdown is empty
Each provider's `list_models` failed at startup. Look at
`HELEXA_ACP_LOG_FILE` for "list_models failed; this endpoint's
models won't appear in the picker". Likely the endpoint URL is
wrong, the API key is invalid, or the upstream `/v1/models`
endpoint isn't responding.
The agent still works against `default_model` even when the
dropdown is empty — list-models is for picking, not routing.
### "prompt_too_long" / agent stalls mid-conversation
You hit the model's context window. Set `context_window` on the
endpoint and helexa-acp will compact before sending. The log line
`context compaction applied` confirms it's running; if it fires
but the upstream still rejects, the compaction heuristic
under-counted and the budget needs tuning down.
### Reading files outside the workspace returns "not found"
Zed's `fs/read_text_file` is workspace-scoped. helexa-acp falls
back to local `std::fs` automatically when that fails — look for
`fs/read_text_file failed; falling back to local std::fs` in the
log. If even local read fails, the file genuinely doesn't exist
or the user process lacks permissions.
### Tool calls render as text instead of structured cards
The model is emitting `<tool_call>` markers that the parser can't
decode. Two common causes:
1. The system prompt isn't reaching the model (cortex/neuron's
tool-block injection didn't fire). Confirm with
`RUST_LOG=helexa_acp=debug` and look at the outgoing
`POST /chat/completions` body.
2. The model itself is too small / undertrained to follow the
Hermes format reliably. helexa-acp has shape-based name
inference and JSON repair, but there's a floor below which
nothing helps.
### Plan-mode writes refused even inside the plan dir
The path comparison is byte-for-byte. If the model emits a path
with `~` and the plan_dir has the expanded form, expansion runs
*before* the comparison — but resolved-vs-symlinked-path
mismatches can still bite. The error message names the attempted
path and the expected prefix so you can compare directly.
## Architecture
Source layout under `crates/helexa-acp/src/`:
| File | Responsibility |
|------|----------------|
| `main.rs` | tokio + Stdio transport. Builds providers, hands off to `agent::Agent` |
| `config.rs` | TOML + env-fallback config, endpoint resolver |
| `agent.rs` | ACP handlers (initialize, session/new, session/prompt, session/cancel, session/set_mode, session/set_model, session/load, session/list), prompt loop with tool-call recursion |
| `session.rs` | Per-session state map (Arc<RwLock<HashMap<…>>>) |
| `store.rs` | On-disk session persistence, plan-dir resolution |
| `prompt.rs` | System-prompt assembly, plan-mode addendum |
| `tools.rs` | Tool schemas + shape-based name inference |
| `tool_runner.rs` | Dispatch a single tool call through ACP client RPCs; permission gate |
| `qwen3.rs` | Qwen3 Hermes tool-format parser (`<tool_call>` / `<think>` markers) |
| `compaction.rs` | Token-budget compaction for the rolling history |
| `path_util.rs` | `~` / `$HOME` expansion shared across every path-taking tool |
| `provider/openai_chat.rs` | OpenAI chat completions provider |
| `provider/openai_responses.rs` | OpenAI Responses API provider |
| `provider/anthropic_messages.rs` | Anthropic Messages API provider |
### Adding a new wire format
1. New file under `src/provider/` implementing the `Provider`
trait (encoder + SSE decoder).
2. Add a `WireApi` variant in `config.rs`.
3. Wire it into `build_provider` in `main.rs`.
4. Done — every other module is wire-format-agnostic.
### Concurrency
- `Arc<RwLock<HashMap<SessionId, Arc<Mutex<SessionState>>>>>`
per-session mutex so concurrent requests across sessions don't
contend; the map's RwLock is read-mostly.
- Every tool call dispatched serially within a session (parallel
dispatch would require Zed to handle interleaved permission
prompts).
- Provider streams are back-pressured by the consumer (bounded
mpsc channels).
### Self-contained
The crate has no workspace-internal dependencies (no
`cortex-core`, no `cortex-gateway`). Migration to a dedicated
GitHub repo for cross-platform CI / cargo-dist binaries is
Cargo.toml-only.
## Status
- Stages 16 shipped: scaffold, agent loop, tools, modes, session
resume, image input, model picker, three wire formats.
- Stage 8 (RPM + multi-platform CI) tracked in the canonical plan;
Linux x86_64 RPM ships today via the cortex monorepo's Gitea
Actions.
## Contributing
Repository: https://git.lair.cafe/helexa/helexa (`crates/helexa-acp/`).
Issues / PRs welcome. The canonical staged plan is in
`~/.claude/plans/plan-the-per-device-worker-abstract-micali.md` on
the maintainer's machine; the substages 3a3e and 6a/6b that the
canonical plan didn't anticipate are documented in commit messages.
CI: `cargo fmt --check --all`, `cargo clippy --workspace -- -D
warnings`, `cargo test --workspace` must all pass before merge.

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,425 @@
//! Rolling-conversation compaction for small-context local models.
//!
//! The tool-call loop in [`crate::agent`] grows the message vec it
//! sends upstream every round. On a frontier model that's fine; on a
//! 32 K Qwen3 the first few `read_file` results can push the prompt
//! past the model's context window, at which point cortex/neuron
//! refuses with `prompt_too_long` and the whole turn dies. Long-form
//! local agents are unusable without something here.
//!
//! Strategy (intentionally simple — no LLM-summarization round-trip,
//! no tokenizer dependency):
//!
//! 1. **Protect** the things the model cannot reason without:
//! - The system prompt (idx 0).
//! - Every `Role::User` turn (the user's intent — irreplaceable).
//! - The last [`KEEP_TAIL`] messages (most recent rounds stay
//! verbatim so the model can keep working on what it just
//! observed).
//! 2. **Elide** older `Role::Assistant` prose and older `Role::Tool`
//! result content. The structure stays — `tool_call_id`s, tool
//! names, and argument JSON survive intact — so OpenAI's strict
//! `tool_calls` ↔ `tool` pairing schema remains satisfied. Only
//! the *payload* shrinks to a one-line marker.
//! 3. Walk oldest→newest, recomputing the budget after each elision.
//! Stop as soon as we fit; we don't compact more than necessary.
//! 4. If we still exceed budget after eliding everything we're
//! allowed to, return what we have. The upstream will surface a
//! `prompt_too_long` error and the user can intervene; that's
//! better than silently dropping content the model needs.
//!
//! Token estimation uses a `chars / 3.5` heuristic — conservative
//! (over-estimates tokens slightly) so we compact a touch early
//! rather than a touch late.
use crate::provider::{Message, MessageContent, MessagePart, Role};
/// Most-recent N messages that are never elided. Roughly "the
/// current tool round in flight" — assistant turn that called the
/// tools + each tool result + a bit of slack.
const KEEP_TAIL: usize = 4;
/// Below this content size we don't bother eliding — the savings
/// don't outweigh the loss of detail. Roughly 6080 tokens.
const ELIDE_MIN_CHARS: usize = 256;
/// Roughly tokens-per-character for English + code mixed in. The
/// actual per-tokenizer ratio varies (GPT-4o ≈ 4 chars/token on
/// English prose, ≈ 3 chars/token on code-heavy text). We pick a
/// value on the conservative end so the budget check fires *before*
/// the upstream tokenizer says no.
const CHARS_PER_TOKEN: f32 = 3.5;
/// Per-message envelope overhead (role + JSON framing). Comes out
/// to a few tokens; tiny but it adds up across long histories.
const ENVELOPE_TOKENS: usize = 8;
/// Rough per-image token cost used by the budget estimator. Real
/// vision tokenizers vary widely (2561024 tokens for typical
/// resolutions on Qwen3-VL, OpenAI's `low`/`high` detail toggles
/// pick between ~85 and ~1000+). 512 is a defensible middle that
/// keeps compaction from treating images as free.
const IMAGE_TOKENS_APPROX: usize = 512;
/// Stats reported back from [`compact_to_budget`] for the caller to
/// log. The numbers are estimates (see [`estimate_tokens`]), so
/// don't compare them to upstream-reported token counts as if they
/// were exact.
#[derive(Debug, Clone, Default, PartialEq, Eq)]
pub struct CompactionStats {
/// Estimated tokens in the input messages.
pub original_tokens: usize,
/// Estimated tokens after compaction. Equal to `original_tokens`
/// when no compaction was needed.
pub final_tokens: usize,
/// Number of messages whose content was elided. Zero is the
/// hot path (nothing to do).
pub elided_messages: usize,
}
impl CompactionStats {
fn unchanged(tokens: usize) -> Self {
Self {
original_tokens: tokens,
final_tokens: tokens,
elided_messages: 0,
}
}
}
/// Approximate token count for one message. Sums the textual
/// payload's chars, divides by [`CHARS_PER_TOKEN`], and adds an
/// envelope constant. Cheap (no allocation) so safe to call once per
/// message per round.
pub fn estimate_tokens(msg: &Message) -> usize {
let chars = match &msg.content {
MessageContent::Text { text } => text.len(),
MessageContent::MultiPart { parts } => parts
.iter()
.map(|p| match p {
MessagePart::Text { text } => text.len(),
// Each image is one block in the context window; the
// upstream tokenizer handles the real cost (and it
// varies wildly by model — Qwen3-VL uses ~256-1024
// tokens per image depending on size). Take a
// middle estimate so the budget tracker doesn't
// pretend images are free.
MessagePart::Image(_) => IMAGE_TOKENS_APPROX * CHARS_PER_TOKEN as usize,
})
.sum(),
MessageContent::ToolCalls { text, calls } => {
let txt = text.as_deref().map(|s| s.len()).unwrap_or(0);
let calls_size: usize = calls
.iter()
.map(|c| c.name.len() + c.arguments.len() + c.id.len())
.sum();
txt + calls_size
}
MessageContent::ToolResult {
tool_call_id,
content,
} => tool_call_id.len() + content.len(),
};
((chars as f32 / CHARS_PER_TOKEN) as usize) + ENVELOPE_TOKENS
}
/// Sum of [`estimate_tokens`] across all messages.
pub fn total_tokens(messages: &[Message]) -> usize {
messages.iter().map(estimate_tokens).sum()
}
/// Project `messages` into a vec whose estimated token count fits in
/// `budget` tokens. Returns the projection plus stats about what
/// was done. When the input already fits, the projection is a clone
/// of the input and stats report zero elisions.
///
/// See module docs for the strategy and protected set.
pub fn compact_to_budget(messages: &[Message], budget: usize) -> (Vec<Message>, CompactionStats) {
let original = total_tokens(messages);
if original <= budget {
return (messages.to_vec(), CompactionStats::unchanged(original));
}
let mut out = messages.to_vec();
let len = out.len();
let tail_start = len.saturating_sub(KEEP_TAIL);
let mut elided = 0usize;
// Two passes. First pass: ToolResult contents (largest savings
// per elision — read_file payloads land here). Second pass: long
// Assistant prose. We don't interleave because eliding a long
// assistant turn before a really old read_file would do less
// good per elision; oldest-first ordering is enforced *within*
// each pass instead.
for pass in 0..2 {
for i in 1..tail_start {
if matches!(out[i].role, Role::User) {
continue;
}
let target_pass_2 = matches!(
&out[i].content,
MessageContent::Text { .. } | MessageContent::ToolCalls { .. }
);
let target_pass_1 = matches!(&out[i].content, MessageContent::ToolResult { .. });
let in_pass = (pass == 0 && target_pass_1) || (pass == 1 && target_pass_2);
if !in_pass {
continue;
}
if elide_in_place(&mut out[i]) {
elided += 1;
if total_tokens(&out) <= budget {
let final_tokens = total_tokens(&out);
return (
out,
CompactionStats {
original_tokens: original,
final_tokens,
elided_messages: elided,
},
);
}
}
}
}
let final_tokens = total_tokens(&out);
(
out,
CompactionStats {
original_tokens: original,
final_tokens,
elided_messages: elided,
},
)
}
/// Shrink one message's payload while keeping its structural role
/// (so tool_call_id pairing survives). Returns `true` when the
/// message changed.
///
/// - `ToolResult.content` → `(elided: N bytes of tool result)`
/// - `ToolCalls.text` → `(elided: N bytes of assistant prose)`
/// - `Text` (assistant) → `(elided: N bytes of assistant prose)`
///
/// Already-tiny payloads are skipped — eliding a 50-byte string
/// would *grow* it once the marker is in place.
fn elide_in_place(msg: &mut Message) -> bool {
match &mut msg.content {
MessageContent::ToolResult { content, .. } => {
if content.len() < ELIDE_MIN_CHARS {
return false;
}
*content = format!("(elided: {} bytes of tool result)", content.len());
true
}
MessageContent::ToolCalls { text, .. } => match text {
Some(t) if t.len() >= ELIDE_MIN_CHARS => {
*text = Some(format!("(elided: {} bytes of assistant prose)", t.len()));
true
}
_ => false,
},
MessageContent::Text { text } => {
if text.len() < ELIDE_MIN_CHARS {
return false;
}
*text = format!("(elided: {} bytes of assistant prose)", text.len());
true
}
MessageContent::MultiPart { .. } => {
// MultiPart messages today only exist as User turns,
// and User turns are protected by the role check in
// `compact_to_budget` — so this branch is unreachable
// for current call sites. Returning false keeps the
// unreachable path benign if a future stage starts
// emitting MultiPart on other roles.
false
}
}
}
#[cfg(test)]
mod tests {
use super::*;
use crate::provider::ToolCall;
fn sys(text: &str) -> Message {
Message {
role: Role::System,
content: MessageContent::Text { text: text.into() },
}
}
fn user(text: &str) -> Message {
Message {
role: Role::User,
content: MessageContent::Text { text: text.into() },
}
}
fn assistant_text(text: &str) -> Message {
Message {
role: Role::Assistant,
content: MessageContent::Text { text: text.into() },
}
}
fn assistant_calls(text: Option<&str>, name: &str, args: &str, id: &str) -> Message {
Message {
role: Role::Assistant,
content: MessageContent::ToolCalls {
text: text.map(|s| s.to_string()),
calls: vec![ToolCall {
id: id.into(),
name: name.into(),
arguments: args.into(),
}],
},
}
}
fn tool_result(id: &str, body: &str) -> Message {
Message {
role: Role::Tool,
content: MessageContent::ToolResult {
tool_call_id: id.into(),
content: body.into(),
},
}
}
#[test]
fn under_budget_is_a_no_op_clone() {
let msgs = vec![sys("you are an agent"), user("hi"), assistant_text("hello")];
let (out, stats) = compact_to_budget(&msgs, 10_000);
assert_eq!(stats.elided_messages, 0);
assert_eq!(stats.original_tokens, stats.final_tokens);
assert_eq!(out.len(), msgs.len());
// Strings unchanged.
match &out[2].content {
MessageContent::Text { text } => assert_eq!(text, "hello"),
other => panic!("expected Text, got {other:?}"),
}
}
#[test]
fn elides_old_tool_result_before_old_assistant_prose() {
// History: sys, user, assistant_calls, big_tool_result,
// assistant_with_big_text, user, assistant_calls,
// small_tool_result.
// KEEP_TAIL=4 protects the last four; the big tool result
// sits in the prunable range and should go first because
// pass 0 (tool results) runs before pass 1 (prose).
let big_result = "X".repeat(4096);
let big_prose = "Y".repeat(2048);
let msgs = vec![
sys("preamble"),
user("first ask"),
assistant_calls(None, "read_file", r#"{"path":"/a"}"#, "c0"),
tool_result("c0", &big_result),
assistant_text(&big_prose),
user("follow up"),
assistant_calls(None, "read_file", r#"{"path":"/b"}"#, "c1"),
tool_result("c1", "short result body"),
];
let before = total_tokens(&msgs);
// Force compaction by setting budget well below current.
let budget = before / 2;
let (out, stats) = compact_to_budget(&msgs, budget);
assert!(
stats.elided_messages >= 1,
"expected at least one elision, got {stats:?}"
);
// The big tool result must be elided (oldest fat target).
match &out[3].content {
MessageContent::ToolResult { content, .. } => {
assert!(
content.starts_with("(elided:"),
"tool result not elided: {content:?}"
);
}
other => panic!("expected ToolResult, got {other:?}"),
}
// Last four messages must be untouched.
assert!(matches!(
&out[out.len() - 1].content,
MessageContent::ToolResult { content, .. } if content == "short result body"
));
}
#[test]
fn never_elides_system_or_user_turns() {
let big_user = "U".repeat(8192);
let msgs = vec![sys("preamble"), user(&big_user), assistant_text("ok")];
let budget = 10; // way below — forces all possible elision
let (out, _stats) = compact_to_budget(&msgs, budget);
// System unchanged.
match &out[0].content {
MessageContent::Text { text } => assert_eq!(text, "preamble"),
other => panic!("expected Text, got {other:?}"),
}
// User unchanged even though it's huge.
match &out[1].content {
MessageContent::Text { text } => assert_eq!(text.len(), big_user.len()),
other => panic!("expected Text, got {other:?}"),
}
}
#[test]
fn preserves_tool_call_id_pairing_after_elision() {
// OpenAI strict mode rejects a tool-result whose tool_call_id
// doesn't match a preceding assistant tool_call. Elision
// must not break that linkage.
let big = "Z".repeat(4096);
let msgs = vec![
sys("preamble"),
user("first"),
assistant_calls(None, "read_file", r#"{"path":"/a"}"#, "call_42"),
tool_result("call_42", &big),
// Tail messages.
user("next"),
assistant_calls(None, "read_file", r#"{"path":"/b"}"#, "call_43"),
tool_result("call_43", "ok"),
assistant_text("done"),
];
let budget = total_tokens(&msgs) / 3;
let (out, _stats) = compact_to_budget(&msgs, budget);
// The assistant call and its result both carry call_42.
let call_id = match &out[2].content {
MessageContent::ToolCalls { calls, .. } => calls[0].id.clone(),
other => panic!("expected ToolCalls, got {other:?}"),
};
match &out[3].content {
MessageContent::ToolResult { tool_call_id, .. } => {
assert_eq!(tool_call_id, &call_id, "pairing broken");
}
other => panic!("expected ToolResult, got {other:?}"),
}
}
#[test]
fn estimate_tokens_grows_with_content() {
let small = sys("hi");
let large = sys(&"x".repeat(10_000));
assert!(estimate_tokens(&large) > estimate_tokens(&small) * 100);
}
#[test]
fn elide_in_place_skips_short_content() {
let mut m = tool_result("c0", "tiny");
assert!(!elide_in_place(&mut m));
match m.content {
MessageContent::ToolResult { content, .. } => assert_eq!(content, "tiny"),
other => panic!("expected ToolResult, got {other:?}"),
}
}
#[test]
fn returns_best_effort_when_budget_unmeetable() {
// Single huge user message that cannot be elided. Budget 10.
// We don't error — we return what we have and let upstream
// refuse the prompt with its own error.
let big_user = "U".repeat(100_000);
let msgs = vec![sys("preamble"), user(&big_user)];
let (out, stats) = compact_to_budget(&msgs, 10);
assert_eq!(out.len(), msgs.len());
assert!(stats.final_tokens > 10, "still over budget by design");
}
}

View File

@@ -0,0 +1,424 @@
//! Configuration for the helexa-acp bridge.
//!
//! Loaded from `$XDG_CONFIG_HOME/helexa-acp/config.toml` (or
//! `~/.config/helexa-acp/config.toml` as a fallback). If no config file
//! exists, falls back to building a single anonymous endpoint from env
//! vars — that keeps "just point at one cortex" frictionless without
//! requiring a config file on disk.
//!
//! The design goal is "the missing ACP binary for users with multiple
//! API endpoints (possibly on a private LAN, possibly mixing wire
//! types)". Hence: every endpoint is named, has its own wire API, and
//! has its own default model. The agent's selected model id can be
//! prefixed `endpoint:model` to route across endpoints; a bare
//! `model` falls through to the configured `default_endpoint`.
//!
//! ### Example TOML
//!
//! ```toml
//! default_endpoint = "helexa"
//!
//! [[endpoints]]
//! name = "helexa"
//! base_url = "http://hanzalova.internal:31313/v1"
//! wire_api = "openai-chat"
//! default_model = "helexa/large"
//!
//! [[endpoints]]
//! name = "openrouter"
//! base_url = "https://openrouter.ai/api/v1"
//! wire_api = "openai-chat"
//! api_key_env = "OPENROUTER_API_KEY"
//! default_model = "anthropic/claude-opus-4"
//!
//! [[endpoints]]
//! name = "lmstudio"
//! base_url = "http://localhost:1234/v1"
//! wire_api = "openai-chat"
//! default_model = "auto"
//! ```
use anyhow::{Context, anyhow};
use serde::{Deserialize, Serialize};
use std::path::{Path, PathBuf};
use url::Url;
const DEFAULT_BASE_URL: &str = "http://hanzalova.internal:31313/v1";
const DEFAULT_MODEL: &str = "helexa/large";
const DEFAULT_ENDPOINT_NAME: &str = "default";
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Config {
/// Name of the endpoint used when a request doesn't pick one
/// explicitly. Must reference an entry in `endpoints`. Defaults to
/// the first endpoint declared if unset.
#[serde(default)]
pub default_endpoint: Option<String>,
/// Per-endpoint configuration. At least one entry is required.
#[serde(default)]
pub endpoints: Vec<EndpointConfig>,
/// Optional path to a system-prompt file. When unset, the built-in
/// default prompt from `prompt.rs` is used.
#[serde(default)]
pub system_prompt_path: Option<PathBuf>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct EndpointConfig {
/// Short identifier used in `endpoint:model` routing and in logs.
pub name: String,
/// Base URL of the OpenAI-compatible API. Must include the `/v1`
/// (or equivalent) suffix — paths like `chat/completions` and
/// `models` are joined onto this.
pub base_url: Url,
/// Wire protocol the endpoint speaks. Phase 1 supports
/// [`WireApi::OpenAiChat`] only; `openai-responses` and
/// `anthropic-messages` land later behind their own providers.
#[serde(default)]
pub wire_api: WireApi,
/// Model to use when the client hasn't picked one via
/// `session/set_model`.
#[serde(default)]
pub default_model: Option<String>,
/// Static API key to send as `Authorization: Bearer …`. Prefer
/// `api_key_env` for anything sensitive — keys in plain TOML are a
/// liability.
#[serde(default)]
pub api_key: Option<String>,
/// Env var name to read for the API key. Resolved at startup so a
/// missing env var yields a clear error rather than silent
/// unauthenticated calls.
#[serde(default)]
pub api_key_env: Option<String>,
/// Cap on the model's output tokens per turn. `None` lets the
/// upstream pick its own default (cortex/neuron's default is
/// often small enough to trip Zed's "Output Limit Reached" on
/// long responses). Set to e.g. `32768` to let the model
/// produce longer turns. Goes into the OpenAI `max_tokens`
/// request field.
#[serde(default)]
pub max_tokens: Option<u64>,
/// Model context window in tokens (prompt + response). When set,
/// the agent compacts conversation history before each completion
/// so the prompt fits within `context_window - max_tokens - safety`
/// tokens — long sessions on small-context local models (Qwen3 at
/// 32 K) survive past the first few tool-call rounds rather than
/// dying with `prompt_too_long`. `None` disables compaction.
#[serde(default)]
pub context_window: Option<usize>,
}
#[derive(Debug, Clone, Copy, PartialEq, Eq, Default, Serialize, Deserialize)]
pub enum WireApi {
/// `POST {base}/chat/completions` returning OpenAI-format SSE.
/// Compatible with cortex, LM Studio, Ollama (compat mode),
/// OpenRouter, OpenAI itself.
#[default]
#[serde(rename = "openai-chat")]
OpenAiChat,
/// `POST {base}/responses` — OpenAI's newer Responses API. Not
/// implemented yet; the variant is reserved so endpoint configs
/// can be authored ahead of provider support.
#[serde(rename = "openai-responses")]
OpenAiResponses,
/// `POST {base}/messages` — Anthropic format. Reserved.
#[serde(rename = "anthropic-messages")]
AnthropicMessages,
}
impl EndpointConfig {
/// Resolve the API key from `api_key` (literal) or `api_key_env`
/// (env-var lookup). Returns `Ok(None)` when neither is set;
/// `Err` when `api_key_env` references a missing variable.
pub fn resolve_api_key(&self) -> anyhow::Result<Option<String>> {
if let Some(literal) = &self.api_key {
return Ok(Some(literal.clone()));
}
if let Some(var) = &self.api_key_env {
return Ok(Some(std::env::var(var).with_context(|| {
format!(
"endpoint '{}' references missing env var {}",
self.name, var
)
})?));
}
Ok(None)
}
/// `{base_url}/chat/completions`.
pub fn chat_completions_url(&self) -> Url {
join_segments(&self.base_url, &["chat", "completions"])
}
/// `{base_url}/responses` — OpenAI Responses API endpoint.
pub fn responses_url(&self) -> Url {
join_segments(&self.base_url, &["responses"])
}
/// `{base_url}/models`. Called from `Provider::list_models`, which
/// Stage 4 wires into the model-picker dropdown; until then it's
/// reachable code with no in-tree callers.
#[allow(dead_code)]
pub fn models_url(&self) -> Url {
join_segments(&self.base_url, &["models"])
}
}
impl Config {
/// Load from TOML at the standard config path, or build from env
/// vars if no file exists. Env-fallback yields a single endpoint
/// named `"default"`.
pub fn load() -> anyhow::Result<Self> {
let path = config_path();
if let Some(path) = &path
&& path.exists()
{
return Self::from_file(path);
}
Self::from_env()
}
/// Single-endpoint config constructed from `HELEXA_ACP_BASE_URL`,
/// `HELEXA_ACP_MODEL`, `HELEXA_ACP_API_KEY`,
/// `HELEXA_ACP_SYSTEM_PROMPT_PATH`, `HELEXA_ACP_MAX_TOKENS`.
pub fn from_env() -> anyhow::Result<Self> {
let base_url = std::env::var("HELEXA_ACP_BASE_URL")
.ok()
.unwrap_or_else(|| DEFAULT_BASE_URL.into());
let base_url = Url::parse(&base_url)
.with_context(|| format!("HELEXA_ACP_BASE_URL is not a valid URL ({base_url})"))?;
let default_model = std::env::var("HELEXA_ACP_MODEL")
.ok()
.unwrap_or_else(|| DEFAULT_MODEL.into());
let api_key = std::env::var("HELEXA_ACP_API_KEY")
.ok()
.filter(|s| !s.is_empty());
let system_prompt_path = std::env::var("HELEXA_ACP_SYSTEM_PROMPT_PATH")
.ok()
.filter(|s| !s.is_empty())
.map(PathBuf::from);
let max_tokens = std::env::var("HELEXA_ACP_MAX_TOKENS")
.ok()
.filter(|s| !s.is_empty())
.map(|s| {
s.parse::<u64>().with_context(|| {
format!("HELEXA_ACP_MAX_TOKENS is not a positive integer ({s})")
})
})
.transpose()?;
let context_window = std::env::var("HELEXA_ACP_CONTEXT_WINDOW")
.ok()
.filter(|s| !s.is_empty())
.map(|s| {
s.parse::<usize>().with_context(|| {
format!("HELEXA_ACP_CONTEXT_WINDOW is not a positive integer ({s})")
})
})
.transpose()?;
Ok(Self {
default_endpoint: Some(DEFAULT_ENDPOINT_NAME.into()),
endpoints: vec![EndpointConfig {
name: DEFAULT_ENDPOINT_NAME.into(),
base_url,
wire_api: WireApi::OpenAiChat,
default_model: Some(default_model),
api_key,
api_key_env: None,
max_tokens,
context_window,
}],
system_prompt_path,
})
}
pub fn from_file(path: &Path) -> anyhow::Result<Self> {
let text = std::fs::read_to_string(path)
.with_context(|| format!("read config {}", path.display()))?;
let mut cfg: Self =
toml::from_str(&text).with_context(|| format!("parse config {}", path.display()))?;
cfg.validate()?;
Ok(cfg)
}
fn validate(&mut self) -> anyhow::Result<()> {
if self.endpoints.is_empty() {
return Err(anyhow!("config has no [[endpoints]] entries"));
}
for (i, ep) in self.endpoints.iter().enumerate() {
if ep.name.is_empty() {
return Err(anyhow!("endpoints[{i}] has empty name"));
}
if ep.name.contains(':') {
return Err(anyhow!(
"endpoints[{i}].name '{}' contains ':' which would clash \
with the endpoint:model selector syntax",
ep.name
));
}
}
// Pick a default endpoint if none was named.
if self.default_endpoint.is_none() {
self.default_endpoint = Some(self.endpoints[0].name.clone());
}
let default_name = self.default_endpoint.as_deref().unwrap();
if !self.endpoints.iter().any(|e| e.name == default_name) {
return Err(anyhow!(
"default_endpoint '{default_name}' is not declared in [[endpoints]]"
));
}
Ok(())
}
/// Look up an endpoint by name. Returns `None` if not configured.
pub fn endpoint(&self, name: &str) -> Option<&EndpointConfig> {
self.endpoints.iter().find(|e| e.name == name)
}
/// The default endpoint (guaranteed to exist after `validate`).
pub fn default_endpoint(&self) -> &EndpointConfig {
let name = self
.default_endpoint
.as_deref()
.expect("default_endpoint set by validate");
self.endpoint(name)
.expect("default_endpoint resolves after validate")
}
}
/// Parse an ACP-side `model` field into (endpoint name, raw model id).
///
/// `helexa:helexa/large` → (`Some("helexa")`, `"helexa/large"`).
/// `helexa/large` → (`None`, `"helexa/large"`).
///
/// The split happens at the FIRST colon. Model ids commonly contain
/// `/` (HuggingFace style) but rarely `:`; if a model id ever does, the
/// user can quote-prefix with the default endpoint name.
pub fn parse_model_selector(input: &str) -> (Option<&str>, &str) {
match input.split_once(':') {
Some((endpoint, model)) if !endpoint.is_empty() && !model.is_empty() => {
(Some(endpoint), model)
}
_ => (None, input),
}
}
fn config_path() -> Option<PathBuf> {
if let Ok(override_path) = std::env::var("HELEXA_ACP_CONFIG_PATH") {
return Some(PathBuf::from(override_path));
}
let xdg = std::env::var("XDG_CONFIG_HOME")
.ok()
.filter(|s| !s.is_empty());
let base = xdg.map(PathBuf::from).or_else(|| {
std::env::var("HOME")
.ok()
.map(|h| PathBuf::from(h).join(".config"))
})?;
Some(base.join("helexa-acp").join("config.toml"))
}
fn join_segments(base: &Url, segments: &[&str]) -> Url {
let mut out = base.clone();
if let Ok(mut path) = out.path_segments_mut() {
path.pop_if_empty().extend(segments.iter().copied());
}
out
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn url_join_handles_trailing_slash() {
let ep = EndpointConfig {
name: "x".into(),
base_url: Url::parse("http://h.internal:31313/v1").unwrap(),
wire_api: WireApi::OpenAiChat,
default_model: None,
api_key: None,
api_key_env: None,
max_tokens: None,
context_window: None,
};
assert_eq!(
ep.chat_completions_url().as_str(),
"http://h.internal:31313/v1/chat/completions"
);
assert_eq!(
ep.models_url().as_str(),
"http://h.internal:31313/v1/models"
);
}
#[test]
fn parses_model_selector() {
assert_eq!(
parse_model_selector("helexa:helexa/large"),
(Some("helexa"), "helexa/large")
);
assert_eq!(parse_model_selector("helexa/large"), (None, "helexa/large"));
assert_eq!(parse_model_selector("gpt-5"), (None, "gpt-5"));
// Edge case: a leading colon → no endpoint.
assert_eq!(parse_model_selector(":gpt-5"), (None, ":gpt-5"));
}
#[test]
fn env_fallback_builds_single_endpoint() {
// Don't actually set env vars (would race with other tests);
// just confirm the default path constructs cleanly.
unsafe {
std::env::remove_var("HELEXA_ACP_BASE_URL");
std::env::remove_var("HELEXA_ACP_MODEL");
std::env::remove_var("HELEXA_ACP_API_KEY");
}
let cfg = Config::from_env().unwrap();
assert_eq!(cfg.endpoints.len(), 1);
assert_eq!(cfg.endpoints[0].name, "default");
assert_eq!(cfg.endpoints[0].base_url.as_str(), DEFAULT_BASE_URL);
assert_eq!(
cfg.endpoints[0].default_model.as_deref(),
Some(DEFAULT_MODEL)
);
}
#[test]
fn toml_parses_multi_endpoint() {
let toml_text = r#"
default_endpoint = "helexa"
[[endpoints]]
name = "helexa"
base_url = "http://hanzalova.internal:31313/v1"
default_model = "helexa/large"
[[endpoints]]
name = "openrouter"
base_url = "https://openrouter.ai/api/v1"
wire_api = "openai-chat"
api_key_env = "OPENROUTER_API_KEY"
default_model = "anthropic/claude-opus-4"
"#;
let mut cfg: Config = toml::from_str(toml_text).unwrap();
cfg.validate().unwrap();
assert_eq!(cfg.endpoints.len(), 2);
assert_eq!(cfg.default_endpoint().name, "helexa");
assert_eq!(cfg.endpoints[0].wire_api, WireApi::OpenAiChat);
assert_eq!(
cfg.endpoints[1].api_key_env.as_deref(),
Some("OPENROUTER_API_KEY")
);
}
#[test]
fn validate_rejects_colon_in_endpoint_name() {
let toml_text = r#"
[[endpoints]]
name = "bad:name"
base_url = "http://x/v1"
"#;
let mut cfg: Config = toml::from_str(toml_text).unwrap();
let err = cfg.validate().unwrap_err();
assert!(format!("{err}").contains("clash"));
}
}

View File

@@ -0,0 +1,145 @@
//! helexa-acp — Agent Client Protocol bridge for multi-endpoint LLM
//! setups (helexa, LM Studio, Ollama, OpenRouter, OpenAI, Anthropic,
//! …) with a clean per-endpoint wire-format selector.
//!
//! Speaks ACP over stdio to an editor client (Zed today). Every
//! configured endpoint produces a wire-format-specific
//! [`provider::Provider`] implementation; the agent loop in
//! [`agent::Agent`] is provider-agnostic, so adding e.g. an Anthropic
//! /v1/messages provider doesn't touch `agent.rs`.
//!
//! Config: `$XDG_CONFIG_HOME/helexa-acp/config.toml` for the multi-
//! endpoint case; env vars (`HELEXA_ACP_BASE_URL`, etc.) for the
//! single-endpoint case when no config file exists.
use agent_client_protocol::{Result, Stdio};
use std::sync::Arc;
mod agent;
mod compaction;
mod config;
mod path_util;
mod prompt;
mod provider;
mod qwen3;
mod session;
mod store;
mod tool_runner;
mod tools;
use agent::Agent;
use config::{Config, EndpointConfig, WireApi};
use provider::{
Provider, anthropic_messages::AnthropicMessagesProvider, openai_chat::OpenAIChatProvider,
openai_responses::OpenAIResponsesProvider,
};
/// Set up tracing. Logs go to stderr by default — stdout is
/// reserved for the JSON-RPC stream. Setting `HELEXA_ACP_LOG_FILE`
/// to an absolute path appends logs to that file instead, which is
/// the practical way to capture debug output when the agent runs
/// under an editor (Zed, etc.) that doesn't surface stderr.
///
/// `RUST_LOG` still controls levels (e.g. `helexa_acp=debug`).
/// ANSI colours are auto-stripped when writing to a file so the log
/// is plain text.
fn init_tracing() {
let env_filter = tracing_subscriber::EnvFilter::try_from_default_env()
.unwrap_or_else(|_| tracing_subscriber::EnvFilter::new("info"));
let log_file = std::env::var("HELEXA_ACP_LOG_FILE")
.ok()
.filter(|s| !s.is_empty());
match log_file {
Some(path) => match std::fs::OpenOptions::new()
.create(true)
.append(true)
.open(&path)
{
Ok(file) => {
tracing_subscriber::fmt()
.with_writer(std::sync::Mutex::new(file))
.with_env_filter(env_filter)
.with_ansi(false)
.init();
}
Err(e) => {
// Fall back to stderr and shout. We don't want a
// typo'd log path to silence the agent entirely.
tracing_subscriber::fmt()
.with_writer(std::io::stderr)
.with_env_filter(env_filter)
.init();
tracing::warn!(
path = %path,
error = %e,
"HELEXA_ACP_LOG_FILE could not be opened; using stderr"
);
}
},
None => {
tracing_subscriber::fmt()
.with_writer(std::io::stderr)
.with_env_filter(env_filter)
.init();
}
}
}
/// Build a provider for `endpoint` according to its declared
/// `wire_api`. Future wire types (OpenAI Responses, Anthropic
/// /v1/messages, Ollama native) slot in here without changing the
/// caller.
fn build_provider(endpoint: EndpointConfig) -> anyhow::Result<Arc<dyn Provider>> {
match endpoint.wire_api {
WireApi::OpenAiChat => Ok(Arc::new(OpenAIChatProvider::new(endpoint)?)),
WireApi::OpenAiResponses => Ok(Arc::new(OpenAIResponsesProvider::new(endpoint)?)),
WireApi::AnthropicMessages => Ok(Arc::new(AnthropicMessagesProvider::new(endpoint)?)),
}
}
#[tokio::main]
async fn main() -> Result<()> {
init_tracing();
let cfg = Config::load()
.map_err(|e| agent_client_protocol::util::internal_error(format!("config: {e:#}")))?;
tracing::info!(
endpoints = cfg.endpoints.len(),
default_endpoint = %cfg.default_endpoint().name,
default_model = ?cfg.default_endpoint().default_model,
"helexa-acp starting"
);
// Build a provider for each configured endpoint up-front. Cheap —
// just sets up a reqwest::Client and resolves the API key — and
// surfaces config mistakes (missing API key env var, unsupported
// wire_api) before the editor even sends an initialize request.
let mut providers: Vec<Arc<dyn Provider>> = Vec::with_capacity(cfg.endpoints.len());
for endpoint in &cfg.endpoints {
match build_provider(endpoint.clone()) {
Ok(p) => {
tracing::info!(
endpoint = %endpoint.name,
base_url = %endpoint.base_url,
wire_api = ?endpoint.wire_api,
"registered provider"
);
providers.push(p);
}
Err(e) => {
tracing::warn!(
endpoint = %endpoint.name,
error = %format!("{e:#}"),
"skipping endpoint with invalid config"
);
}
}
}
let agent = Agent::new(&cfg, providers)
.await
.map_err(|e| agent_client_protocol::util::internal_error(format!("agent: {e:#}")))?;
agent.serve(Stdio::new()).await
}

View File

@@ -0,0 +1,192 @@
//! Path expansion shared across every tool that takes a path.
//!
//! Models often emit shell-style paths like `~/git/repo/file.rs` or
//! `$HOME/notes.md`. ACP's `fs/read_text_file` and friends — and our
//! own local `std::fs` reads — both want a real absolute path; the
//! `~` / `$HOME` forms reach them as literal strings and the open
//! fails. The tool schemas already document "absolute path" but in
//! practice the model slips up often enough that handling it
//! server-side is the difference between "works" and "the agent is
//! brittle".
//!
//! Scope is deliberately small:
//!
//! - `~` and `~/` (current user only — `~user` lookups would require
//! pulling in passwd parsing).
//! - `$HOME` and `$HOME/`.
//!
//! Any other shell variable (`$PWD`, `${HOME}`, …) passes through
//! unchanged. The shell already expands them inside `bash` tool
//! commands; for the file-tool argument fields, we deliberately
//! limit the set so the behaviour is predictable.
//!
//! Falls back to the input path verbatim when `HOME` is unset
//! (stripped-down container env). That preserves the "no surprise
//! mutations" rule — never invent a path the caller didn't ask for.
use std::path::{Path, PathBuf};
/// Process-global lock for tests that mutate `HOME`. Anyone in the
/// crate touching `HOME` must hold this for the duration of the
/// read-modify-restore window — otherwise concurrent `cargo test`
/// workers race and flake.
///
/// Only built into the test binaries. Production code never mutates
/// env vars.
#[cfg(test)]
pub(crate) static ENV_LOCK: std::sync::Mutex<()> = std::sync::Mutex::new(());
/// Expand `~`, `~/`, `$HOME`, and `$HOME/` prefixes against the
/// current user's home directory. All other inputs pass through
/// unchanged.
///
/// Returns the input verbatim if `HOME` isn't set in the env.
pub fn expand_path(input: &Path) -> PathBuf {
let Some(s) = input.to_str() else {
return input.to_path_buf();
};
let Ok(home) = std::env::var("HOME") else {
return input.to_path_buf();
};
let home = PathBuf::from(home);
if s == "~" || s == "$HOME" {
return home;
}
if let Some(rest) = s.strip_prefix("~/") {
return home.join(rest);
}
if let Some(rest) = s.strip_prefix("$HOME/") {
return home.join(rest);
}
input.to_path_buf()
}
#[cfg(test)]
mod tests {
use super::*;
/// Set HOME for the duration of the test. Tests using this run
/// serially under the crate-wide [`ENV_LOCK`] because env
/// mutation isn't thread-safe — `cargo test` parallel workers
/// would race without it.
fn with_home<F: FnOnce()>(home: &str, body: F) {
let _g = ENV_LOCK.lock().unwrap();
let prior = std::env::var("HOME").ok();
// SAFETY: tests touch process-global env. The mutex
// serialises access; sub-threads in other test modules
// touching HOME aren't expected (none in this crate).
unsafe {
std::env::set_var("HOME", home);
}
body();
unsafe {
match prior {
Some(p) => std::env::set_var("HOME", p),
None => std::env::remove_var("HOME"),
}
}
}
#[test]
fn expands_tilde_slash() {
with_home("/home/me", || {
assert_eq!(
expand_path(Path::new("~/git/repo/file.rs")),
PathBuf::from("/home/me/git/repo/file.rs")
);
});
}
#[test]
fn expands_bare_tilde() {
with_home("/home/me", || {
assert_eq!(expand_path(Path::new("~")), PathBuf::from("/home/me"));
});
}
#[test]
fn expands_dollar_home_slash() {
with_home("/home/me", || {
assert_eq!(
expand_path(Path::new("$HOME/notes.md")),
PathBuf::from("/home/me/notes.md")
);
});
}
#[test]
fn expands_bare_dollar_home() {
with_home("/home/me", || {
assert_eq!(expand_path(Path::new("$HOME")), PathBuf::from("/home/me"));
});
}
#[test]
fn absolute_path_passes_through() {
with_home("/home/me", || {
assert_eq!(
expand_path(Path::new("/etc/hostname")),
PathBuf::from("/etc/hostname")
);
});
}
#[test]
fn relative_path_passes_through() {
with_home("/home/me", || {
assert_eq!(
expand_path(Path::new("src/main.rs")),
PathBuf::from("src/main.rs")
);
});
}
#[test]
fn tilde_user_form_not_expanded() {
// ~other is shell sugar for /home/other and would require
// passwd parsing to resolve. Out of scope — pass it
// through and let the open fail with a clear error.
with_home("/home/me", || {
assert_eq!(
expand_path(Path::new("~other/x")),
PathBuf::from("~other/x")
);
});
}
#[test]
fn no_home_env_passes_through() {
// Share the same crate-wide lock as `with_home` — otherwise
// a parallel test setting HOME races this clear-and-assert
// window.
let _g = ENV_LOCK.lock().unwrap();
let prior = std::env::var("HOME").ok();
// SAFETY: serialised by LOCK above.
unsafe {
std::env::remove_var("HOME");
}
assert_eq!(
expand_path(Path::new("~/git/repo")),
PathBuf::from("~/git/repo")
);
unsafe {
if let Some(p) = prior {
std::env::set_var("HOME", p);
}
}
}
#[test]
fn dollar_other_var_not_expanded() {
with_home("/home/me", || {
assert_eq!(
expand_path(Path::new("$PWD/file")),
PathBuf::from("$PWD/file")
);
assert_eq!(
expand_path(Path::new("${HOME}/file")),
PathBuf::from("${HOME}/file")
);
});
}
}

View File

@@ -0,0 +1,274 @@
//! System prompt assembly.
//!
//! The system message has two parts:
//!
//! 1. A short human-readable preamble (working directory, style
//! instructions). Either the built-in [`DEFAULT_PROMPT`] or a
//! user-supplied file at `HELEXA_ACP_SYSTEM_PROMPT_PATH` /
//! `system_prompt_path`. `{cwd}` is substituted in both.
//! 2. A `# Tools` block in Qwen3 Hermes format (see [`crate::qwen3`])
//! describing the available functions. This is what makes the
//! model actually call them — neuron/cortex don't honour the
//! OpenAI `tools` API field, so the tool list has to live in the
//! prompt itself.
use agent_client_protocol::schema::SessionModeId;
use anyhow::Context;
use std::path::Path;
use crate::provider::ToolSpec;
use crate::qwen3;
use crate::session::MODE_PLAN;
const DEFAULT_PROMPT: &str = "\
You are helexa-acp, a coding assistant working inside an editor.
Working directory: {cwd}
Use the tools described below whenever the user's request involves
looking at or modifying files, or running commands. Do not ask the
user to paste file contents you could read yourself. All file paths
must be absolute. Writes and shell commands may prompt the user for
permission depending on the session mode.
Be concise; the user is reading your output in an editor pane.";
/// Build the system prompt for a session.
///
/// - `cwd`: session working directory (substituted for `{cwd}` in
/// the preamble — both the default and any user-supplied template).
/// - `override_path`: path to a user-supplied template, already
/// resolved by [`crate::config::Config`]. The `# Tools` block is
/// appended *after* the user's template so a custom preamble
/// still gets the tool descriptions the model needs.
/// - `tools`: the tools to advertise. Empty list → no `# Tools`
/// block is appended at all.
/// - `mode`: current session mode. When the mode is [`MODE_PLAN`]
/// a plan-mode addendum describing the restrictions and the
/// completion menu is appended *after* the `# Tools` block so it
/// is the last thing the model reads before user input.
/// - `plan_dir`: resolved plan directory for the cwd. Only consulted
/// when `mode == MODE_PLAN`. `None` means the plan directory could
/// not be resolved (no `HOME` / `XDG_DATA_HOME`) — the addendum
/// still renders but with a placeholder so the model knows to
/// surface the error to the user rather than guess a path.
pub fn build_system_prompt(
cwd: &Path,
override_path: Option<&Path>,
tools: &[ToolSpec],
mode: &SessionModeId,
plan_dir: Option<&Path>,
) -> anyhow::Result<String> {
let template = match override_path {
Some(path) => std::fs::read_to_string(path)
.with_context(|| format!("read system prompt from {}", path.display()))?,
None => DEFAULT_PROMPT.to_string(),
};
let mut prompt = template.replace("{cwd}", &cwd.display().to_string());
prompt.push_str(&qwen3::render_tool_block(tools));
if mode.0.as_ref() == MODE_PLAN {
prompt.push_str(&render_plan_mode_block(plan_dir));
}
Ok(prompt)
}
/// Plan-mode instruction block. Tells the model:
///
/// 1. Where it may write — only inside `plan_dir`.
/// 2. What it may *not* do — bash is disabled; writes outside
/// `plan_dir` are refused by the runtime.
/// 3. How to finish — emit the 3-option menu so the user can
/// switch modes and either kick off implementation (with or
/// without permission prompts) or keep iterating on the plan.
fn render_plan_mode_block(plan_dir: Option<&Path>) -> String {
let plan_path = plan_dir
.map(|p| p.display().to_string())
.unwrap_or_else(|| "<plan directory could not be resolved — tell the user>".to_string());
format!(
"\n\n# Plan mode\n\
\n\
You are in **plan mode**. Your task is to draft a written\n\
implementation plan for the user; you must NOT modify any\n\
project files or run shell commands.\n\
\n\
Rules in plan mode:\n\
\n\
- `read_file` and `list_dir` are unrestricted — use them to\n\
explore the codebase as needed.\n\
- `write_file` and `edit_file` are allowed ONLY under the\n\
plan directory: `{plan_path}`. The runtime will refuse any\n\
write outside it.\n\
- `bash` is disabled. Do not call it.\n\
\n\
Write the plan as one or more Markdown files under\n\
`{plan_path}`. Use descriptive filenames\n\
(`01-overview.md`, `02-data-model.md`, etc.). It is fine to\n\
iterate — overwrite the file when you refine a section.\n\
\n\
When the plan is complete, do NOT begin implementation.\n\
Instead, end your turn with this menu, verbatim, so the\n\
user can choose how to proceed:\n\
\n\
---\n\
**Plan complete.** To proceed, switch the session mode in\n\
the agent dropdown and send a follow-up message:\n\
\n\
1. **Bypass Permissions** — implement the plan now, skipping\n\
per-tool permission prompts.\n\
2. **Default** — implement the plan now, prompting before\n\
each write or shell command.\n\
3. **Plan** (stay here) — refine the plan; reply with the\n\
change you want and I will revise it.\n\
---\n"
)
}
#[cfg(test)]
mod tests {
use super::*;
use crate::session::{MODE_DEFAULT, MODE_PLAN};
use std::io::Write;
fn default_mode() -> SessionModeId {
SessionModeId::new(MODE_DEFAULT)
}
fn plan_mode() -> SessionModeId {
SessionModeId::new(MODE_PLAN)
}
#[test]
fn default_prompt_substitutes_cwd() {
let prompt =
build_system_prompt(Path::new("/home/me/proj"), None, &[], &default_mode(), None)
.unwrap();
assert!(
prompt.contains("/home/me/proj"),
"cwd not interpolated: {prompt}"
);
assert!(prompt.contains("helexa-acp"));
assert!(
!prompt.contains("{cwd}"),
"left-over placeholder in default prompt"
);
// With no tools, the # Tools block is absent.
assert!(!prompt.contains("# Tools"));
// Default mode does not get the plan-mode addendum.
assert!(!prompt.contains("# Plan mode"));
}
#[test]
fn tools_are_appended_in_hermes_format() {
let spec = ToolSpec {
name: "read_file".into(),
description: "Read a file.".into(),
parameters: serde_json::json!({"type":"object","properties":{}, "required":[]}),
};
let prompt =
build_system_prompt(Path::new("/x"), None, &[spec], &default_mode(), None).unwrap();
assert!(prompt.contains("# Tools"));
assert!(prompt.contains("<tools>"));
assert!(prompt.contains("\"name\":\"read_file\""));
assert!(prompt.contains("<tool_call>"));
}
#[test]
fn override_path_is_read_and_templated() {
let mut tmp = tempfile_in_target("prompt.txt");
tmp.write_all(b"custom prompt for {cwd} only").unwrap();
tmp.flush().unwrap();
let path = tmp.path().to_path_buf();
drop(tmp);
let prompt = build_system_prompt(
Path::new("/etc"),
Some(path.as_path()),
&[],
&default_mode(),
None,
)
.expect("read override");
assert_eq!(prompt, "custom prompt for /etc only");
let _ = std::fs::remove_file(&path);
}
#[test]
fn missing_override_path_errors() {
let err = build_system_prompt(
Path::new("/tmp"),
Some(Path::new("/definitely/not/a/real/path")),
&[],
&default_mode(),
None,
)
.unwrap_err();
assert!(format!("{err:#}").contains("read system prompt"));
}
#[test]
fn plan_mode_addendum_includes_plan_dir_and_menu() {
let plan_dir = Path::new("/home/me/.local/share/helexa-acp/plans/proj-deadbeef");
let prompt = build_system_prompt(
Path::new("/home/me/proj"),
None,
&[],
&plan_mode(),
Some(plan_dir),
)
.unwrap();
assert!(prompt.contains("# Plan mode"));
assert!(
prompt.contains(plan_dir.to_str().unwrap()),
"plan dir not interpolated: {prompt}"
);
// The 3-option menu must be present so the model emits it verbatim.
assert!(prompt.contains("Bypass Permissions"));
assert!(prompt.contains("**Default**"));
assert!(prompt.contains("3. **Plan**"));
// Bash disabled instruction must be present.
assert!(prompt.contains("`bash` is disabled"));
}
#[test]
fn plan_mode_addendum_handles_unresolved_plan_dir() {
let prompt =
build_system_prompt(Path::new("/home/me/proj"), None, &[], &plan_mode(), None).unwrap();
assert!(prompt.contains("# Plan mode"));
assert!(prompt.contains("could not be resolved"));
}
/// Tiny temp-file helper that doesn't pull in the `tempfile` crate.
/// Writes under `target/` so it's cleaned up by `cargo clean`.
fn tempfile_in_target(name: &str) -> TempHandle {
let base = std::env::var("CARGO_TARGET_TMPDIR")
.ok()
.map(std::path::PathBuf::from)
.unwrap_or_else(std::env::temp_dir);
let _ = std::fs::create_dir_all(&base);
let pid = std::process::id();
let path = base.join(format!("helexa-acp-{pid}-{name}"));
let file = std::fs::File::create(&path).expect("create temp file");
TempHandle { file, path }
}
struct TempHandle {
file: std::fs::File,
path: std::path::PathBuf,
}
impl TempHandle {
fn path(&self) -> &Path {
&self.path
}
}
impl Write for TempHandle {
fn write(&mut self, buf: &[u8]) -> std::io::Result<usize> {
self.file.write(buf)
}
fn flush(&mut self) -> std::io::Result<()> {
self.file.flush()
}
}
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,230 @@
//! Provider trait — the seam between the ACP-side agent loop and
//! whatever wire protocol an endpoint actually speaks.
//!
//! Every concrete provider (OpenAI chat completions, OpenAI Responses,
//! Anthropic /v1/messages, Ollama native, …) implements
//! [`Provider`]. The agent constructs a [`CompletionRequest`] using
//! provider-agnostic types and consumes a stream of
//! [`CompletionEvent`]s — neither end knows which wire format is on
//! the other side of the trait.
//!
//! Day-1 provider: [`openai_chat::OpenAIChatProvider`]. Day-N
//! providers slot in without touching `agent.rs`.
use async_trait::async_trait;
use futures::stream::BoxStream;
use serde::{Deserialize, Serialize};
use serde_json::Value;
use tokio_util::sync::CancellationToken;
pub mod anthropic_messages;
pub mod openai_chat;
pub mod openai_responses;
/// Provider-agnostic LLM endpoint. Implementations translate between
/// [`CompletionRequest`] / [`CompletionEvent`] and whatever wire
/// format their endpoint speaks.
#[async_trait]
pub trait Provider: Send + Sync {
/// Endpoint name as configured by the user (e.g. `"helexa"`,
/// `"openrouter"`). Used in logs and in the `endpoint:model`
/// selector.
fn name(&self) -> &str;
/// List models available at this endpoint. Used to build the
/// model-picker dropdown in editor clients (Stage 4). Should
/// return quickly (cache if necessary).
#[allow(dead_code)]
async fn list_models(&self) -> anyhow::Result<Vec<ModelInfo>>;
/// Run a chat completion. Returns a stream of provider-agnostic
/// events. The stream stops when the upstream finishes, when
/// `cancel` is fired, or when the stream is dropped.
async fn complete(
&self,
request: CompletionRequest,
cancel: CancellationToken,
) -> anyhow::Result<BoxStream<'static, anyhow::Result<CompletionEvent>>>;
}
/// One model exposed by a provider. Constructed by `list_models` —
/// Stage 4 is when the agent loop starts consuming it for the
/// model-picker dropdown.
#[allow(dead_code)]
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ModelInfo {
pub id: String,
/// Human-friendly name, if the endpoint exposes one. Otherwise
/// `id` is used as the display name.
#[serde(default)]
pub display_name: Option<String>,
}
/// Inputs to a completion. Provider-agnostic — concrete providers
/// translate this into their wire format.
#[derive(Debug, Clone)]
pub struct CompletionRequest {
/// Endpoint-local model id (without the `endpoint:` prefix).
pub model: String,
pub messages: Vec<Message>,
/// Tools the model is allowed to call. Empty list means no tool
/// support advertised.
pub tools: Vec<ToolSpec>,
pub temperature: Option<f64>,
pub top_p: Option<f64>,
pub max_tokens: Option<u64>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Message {
pub role: Role,
pub content: MessageContent,
}
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
#[serde(rename_all = "snake_case")]
pub enum Role {
System,
User,
Assistant,
/// Tool result message. Provider impls turn this into whatever
/// shape the upstream wire format wants (OpenAI uses
/// `role: "tool"` + `tool_call_id`; Anthropic uses content blocks).
/// Stage 3 (tools) constructs this; Stage 2 never does.
Tool,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(tag = "type", rename_all = "snake_case")]
pub enum MessageContent {
/// Plain text turn (system / user / assistant). Struct variant
/// rather than newtype so the persisted JSON has an explicit
/// `text` field — that lets us use internal tagging on the
/// enum, which is incompatible with newtype-of-primitive
/// variants.
Text { text: String },
/// Mixed text + image user turn. Stage 5 introduces this when
/// Zed sends an `ImageContent` block alongside the user's prompt.
/// Providers that don't support vision should down-convert by
/// dropping image parts and concatenating text parts.
MultiPart { parts: Vec<MessagePart> },
/// Assistant turn that called one or more tools. Stage 3 starts
/// constructing this when the provider stream yields a
/// `ToolCallStart` / `ToolCallArgsDelta` sequence.
ToolCalls {
/// Optional text the assistant said alongside the tool calls.
text: Option<String>,
calls: Vec<ToolCall>,
},
/// Tool result. `tool_call_id` matches the assistant's call id.
/// Stage 3 constructs this after the tool runner finishes.
ToolResult {
tool_call_id: String,
content: String,
},
}
/// One part of a [`MessageContent::MultiPart`] message.
#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(tag = "type", rename_all = "snake_case")]
pub enum MessagePart {
Text { text: String },
Image(ImageData),
}
/// Inline image attachment. `data` is base64-encoded raw image
/// bytes; the encoder constructs an `image_url` data URI from it
/// at request time. `uri` carries any pointer the client supplied
/// (e.g. `file:///tmp/x.png`) — we keep it on the message for
/// debugging / future providers but the OpenAI encoder ignores it
/// when `data` is present (data wins, since it round-trips through
/// every wire format).
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ImageData {
pub mime_type: String,
/// Base64-encoded image bytes (no `data:` prefix, no padding
/// stripped — exactly what `ImageContent.data` carried).
pub data: String,
#[serde(default, skip_serializing_if = "Option::is_none")]
pub uri: Option<String>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ToolCall {
/// Provider-assigned id that ties the call to its result. The
/// Qwen3 wire format we use today doesn't carry this on the
/// model side (calls and results are matched positionally inside
/// a turn), so the field looks unused in the prod build — but it
/// flows through to `MessageContent::ToolResult.tool_call_id` for
/// history bookkeeping and a future strict-OpenAI backend will
/// consume it directly.
#[allow(dead_code)]
pub id: String,
pub name: String,
/// JSON-encoded arguments. Kept as a string because providers
/// stream argument bytes incrementally and only validate at the
/// end; the agent decodes once the call is complete.
pub arguments: String,
}
#[derive(Debug, Clone)]
pub struct ToolSpec {
pub name: String,
pub description: String,
/// JSON Schema of the arguments object.
pub parameters: Value,
}
/// Events emitted by a provider during a streaming completion.
#[derive(Debug, Clone)]
pub enum CompletionEvent {
/// Incremental visible text from the assistant.
TextDelta(String),
/// Incremental "reasoning" / thought text, if the model emits one
/// (e.g. Qwen3 with `<think>` tags surfaced as a separate stream,
/// or OpenAI reasoning models).
ReasoningDelta(String),
/// A new tool call has started. Stage 2 ignores the payload; the
/// agent loop in Stage 3 reads `index` to correlate with
/// [`Self::ToolCallArgsDelta`], `id` for the eventual tool-result
/// turn, and `name` to dispatch the runner.
#[allow(dead_code)]
ToolCallStart {
index: usize,
id: String,
name: String,
},
/// More argument bytes for a tool call already announced via
/// [`Self::ToolCallStart`]. Stage 2 ignores; Stage 3 accumulates
/// the bytes by `index` until the call's arguments are complete.
#[allow(dead_code)]
ToolCallArgsDelta { index: usize, args_delta: String },
/// A `<tool_call>` block whose JSON couldn't be parsed even with
/// the qwen3 module's repair attempts. The agent surfaces this
/// as a Failed `SessionUpdate::ToolCall` card with the raw body
/// visible (so the editor renders structured failure UI rather
/// than dumping the body inline in the message pane), and feeds
/// a synthetic tool-error message back into history so the
/// model can self-correct on the next round.
MalformedToolCall { raw: String },
/// Stream finished. Carries the upstream `finish_reason` if it
/// gave one (`"stop"`, `"length"`, `"tool_calls"`, …).
Finish { reason: Option<String> },
/// Final usage stats, if the provider supplied them. Stage 2
/// matches the variant to drop it; Stage 6b (token metrics) is
/// when the payload starts being read.
#[allow(dead_code)]
Usage(UsageStats),
}
/// Token accounting reported by the provider at the end of a stream.
/// Stage 2 doesn't surface usage anywhere — the stable `PromptResponse`
/// has no usage field, and the unstable variant is gated. Stage 6b
/// turns these on with Prometheus metrics.
#[allow(dead_code)]
#[derive(Debug, Clone, Copy, Default)]
pub struct UsageStats {
pub prompt_tokens: u64,
pub completion_tokens: u64,
pub total_tokens: u64,
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,987 @@
//! OpenAI Responses API (`POST /v1/responses`) provider.
//!
//! Mirror image of [`super::openai_chat`]: same `Provider` trait
//! impl, same back-pressured SSE decoder, but speaking OpenAI's
//! newer Responses surface instead of chat completions.
//!
//! Differences from the chat provider, all contained in this file:
//!
//! - **Request encoding**: history flattens into an `input` array
//! of typed items (`message`, `function_call`, `function_call_output`)
//! plus a top-level `instructions` field for the system prompt.
//! Multi-part user content stays in the same `[{type:"input_text"},
//! {type:"input_image"}]` shape neuron's `request_to_chat` already
//! accepts.
//! - **Streaming decoder**: events are named (`response.created`,
//! `response.output_text.delta`, `response.completed`, …) carried
//! on the SSE `event:` line. The chat path's `[DONE]` terminator
//! doesn't apply; the stream ends after `response.completed`.
//! - **Tool calls** plumb through the `response.output_item.added`
//! (item type `function_call`) → `response.function_call_arguments.delta`
//! → `response.function_call_arguments.done` event sequence. The
//! neuron candle harness doesn't synthesize these yet (tracked as
//! issue #6), but the decoder is wired so the day the upstream
//! does, downstream `CompletionEvent::ToolCall*` plumbing just
//! works.
//!
//! Tool-name handling: the model knows its tool descriptions via
//! the [`crate::qwen3`] system-prompt block exactly the way the chat
//! provider does. We don't echo them in the request body because
//! neuron currently ignores `tools` on /v1/responses (same as on
//! /v1/chat/completions). Once neuron honours request-side tool
//! definitions, both providers add them in the same place.
use async_trait::async_trait;
use eventsource_stream::Eventsource;
use futures::{Stream, StreamExt, stream::BoxStream};
use serde::{Deserialize, Serialize};
use serde_json::{Value, json};
use std::collections::HashMap;
use tokio_util::sync::CancellationToken;
use super::{
CompletionEvent, CompletionRequest, Message, MessageContent, MessagePart, ModelInfo, Provider,
Role, UsageStats,
};
use crate::config::EndpointConfig;
pub struct OpenAIResponsesProvider {
endpoint: EndpointConfig,
#[allow(dead_code)] // Read in `complete()`'s HTTP path; tests don't stand up a server.
api_key: Option<String>,
#[allow(dead_code)]
http: reqwest::Client,
}
impl OpenAIResponsesProvider {
pub fn new(endpoint: EndpointConfig) -> anyhow::Result<Self> {
let api_key = endpoint.resolve_api_key()?;
let http = reqwest::Client::builder()
// Same generous timeout as the chat provider: cortex may
// need to cold-load a model before serving the first
// chunk, which can be tens of seconds. Cancellation
// handles early termination, not timeout.
.timeout(std::time::Duration::from_secs(600))
.build()?;
Ok(Self {
endpoint,
api_key,
http,
})
}
}
#[async_trait]
impl Provider for OpenAIResponsesProvider {
fn name(&self) -> &str {
&self.endpoint.name
}
async fn list_models(&self) -> anyhow::Result<Vec<ModelInfo>> {
let mut req = self.http.get(self.endpoint.models_url());
if let Some(key) = &self.api_key {
req = req.bearer_auth(key);
}
let resp = req
.send()
.await
.map_err(|e| anyhow::anyhow!("{} list_models: {e}", self.endpoint.name))?;
let status = resp.status();
if !status.is_success() {
let body = resp.text().await.unwrap_or_default();
anyhow::bail!(
"{} list_models returned {}: {}",
self.endpoint.name,
status,
body
);
}
let body: WireModelsResponse = resp.json().await?;
Ok(body
.data
.into_iter()
.map(|m| ModelInfo {
id: m.id,
display_name: None,
})
.collect())
}
async fn complete(
&self,
request: CompletionRequest,
cancel: CancellationToken,
) -> anyhow::Result<BoxStream<'static, anyhow::Result<CompletionEvent>>> {
let body = encode_request(&request);
tracing::debug!(
endpoint = %self.endpoint.name,
url = %self.endpoint.responses_url(),
body = %serde_json::to_string(&body).unwrap_or_else(|_| "<unserializable>".into()),
"POST /responses"
);
let mut req = self.http.post(self.endpoint.responses_url()).json(&body);
if let Some(key) = &self.api_key {
req = req.bearer_auth(key);
}
let resp = req
.send()
.await
.map_err(|e| anyhow::anyhow!("{} responses send: {e}", self.endpoint.name))?;
let status = resp.status();
if !status.is_success() {
let body = resp.text().await.unwrap_or_default();
anyhow::bail!(
"{} responses returned {}: {}",
self.endpoint.name,
status,
body
);
}
let sse = resp.bytes_stream().eventsource();
let stream = decode_stream(sse, cancel);
Ok(Box::pin(stream))
}
}
// ── Request encoding ─────────────────────────────────────────────────
fn encode_request(req: &CompletionRequest) -> Value {
// Pull the system messages out of history into a single
// `instructions` string — the Responses API expects them there,
// not inline as an `input` item. Multiple system messages
// concatenate with blank lines so we don't lose ordering.
let mut instructions: Vec<String> = Vec::new();
let mut input_items: Vec<Value> = Vec::new();
for msg in &req.messages {
if msg.role == Role::System
&& let MessageContent::Text { text } = &msg.content
{
instructions.push(text.clone());
continue;
}
if let Some(item) = encode_message_as_input_item(msg) {
input_items.push(item);
}
}
let mut body = json!({
"model": req.model,
"input": input_items,
"stream": true,
});
if let Value::Object(map) = &mut body {
if !instructions.is_empty() {
map.insert(
"instructions".into(),
Value::String(instructions.join("\n\n")),
);
}
if let Some(t) = req.temperature {
map.insert("temperature".into(), json!(t));
}
if let Some(p) = req.top_p {
map.insert("top_p".into(), json!(p));
}
if let Some(m) = req.max_tokens {
// Responses calls it `max_output_tokens`; preserve the
// semantic (response cap) when we translate.
map.insert("max_output_tokens".into(), json!(m));
}
}
body
}
fn encode_message_as_input_item(msg: &Message) -> Option<Value> {
match (msg.role, &msg.content) {
(Role::System, _) => None, // handled out-of-band as `instructions`
(Role::User, MessageContent::Text { text }) => Some(json!({
"type": "message",
"role": "user",
"content": text,
})),
(Role::User, MessageContent::MultiPart { parts }) => Some(json!({
"type": "message",
"role": "user",
"content": encode_user_parts(parts),
})),
(Role::Assistant, MessageContent::Text { text }) => Some(json!({
"type": "message",
"role": "assistant",
"content": [{
"type": "output_text",
"text": text,
"annotations": [],
}],
})),
(Role::Assistant, MessageContent::ToolCalls { text, calls }) => {
// Assistant turns that called tools become a sequence of
// items: an optional `message` (any prose alongside the
// call) followed by one `function_call` per call. Mirrors
// OpenAI Responses' "each item is one structural slot"
// shape.
//
// We can't return multiple items from one call site, so
// we encode this by side-stuffing additional items into a
// single composite value and have the caller flatten —
// but that complicates the API. Easier: build the array
// ourselves in the caller path. For now, emit just the
// function_calls (the assistant's prose lives in the next
// turn's chat history anyway because the model isn't
// looking back at its own previous narration). If the
// text is non-empty AND we have calls, we lose the text;
// qwen3 rarely emits prose alongside tool calls so this
// is a deliberate simplification — revisit if it bites.
let _ = text;
// Take the first call only for the moment; multi-call
// turns would need the caller-flattening above.
let call = calls.first()?;
Some(json!({
"type": "function_call",
"call_id": call.id,
"name": call.name,
"arguments": call.arguments,
}))
}
(
Role::Tool,
MessageContent::ToolResult {
tool_call_id,
content,
},
) => Some(json!({
"type": "function_call_output",
"call_id": tool_call_id,
"output": content,
})),
(role, content) => {
tracing::warn!(
?role,
?content,
"openai_responses: unexpected (role, content) shape"
);
None
}
}
}
fn encode_user_parts(parts: &[MessagePart]) -> Value {
let items: Vec<Value> = parts
.iter()
.map(|p| match p {
MessagePart::Text { text } => json!({"type": "input_text", "text": text}),
MessagePart::Image(img) => json!({
"type": "input_image",
"image_url": format!("data:{};base64,{}", img.mime_type, img.data),
}),
})
.collect();
Value::Array(items)
}
// ── Wire types ──────────────────────────────────────────────────────
#[allow(dead_code)] // fields read only when list_models runs against a real endpoint
#[derive(Debug, Deserialize)]
struct WireModelsResponse {
data: Vec<WireModelObject>,
}
#[allow(dead_code)]
#[derive(Debug, Deserialize)]
struct WireModelObject {
id: String,
}
// SSE event payload shapes. We only model the fields we care about;
// `#[serde(default)]` + `Option` everywhere else lets the upstream
// add optional fields without breaking deserialise.
#[derive(Debug, Deserialize, Serialize)]
struct OutputItemAddedEvent {
#[serde(default)]
output_index: u32,
item: OutputItem,
}
#[derive(Debug, Deserialize, Serialize)]
#[serde(tag = "type", rename_all = "snake_case")]
enum OutputItem {
Message {
#[serde(default)]
id: Option<String>,
},
FunctionCall {
#[serde(default)]
id: Option<String>,
#[serde(default)]
call_id: Option<String>,
#[serde(default)]
name: Option<String>,
/// Some upstreams populate `arguments` already on the
/// `output_item.added` event for a fully-buffered tool call
/// (i.e. when the model finalised the call before the SSE
/// flush). Capture it so we can emit a single args delta.
#[serde(default)]
arguments: Option<String>,
},
/// `reasoning`, `web_search_call`, etc. We capture-and-ignore
/// any item we don't model; the decoder still emits the
/// outer events correctly.
#[serde(other)]
Unknown,
}
#[derive(Debug, Deserialize, Serialize)]
struct OutputTextDeltaEvent {
#[serde(default)]
item_id: Option<String>,
#[serde(default)]
output_index: u32,
#[serde(default)]
delta: String,
}
#[derive(Debug, Deserialize, Serialize)]
struct FunctionCallArgumentsDeltaEvent {
#[serde(default)]
item_id: Option<String>,
#[serde(default)]
output_index: u32,
#[serde(default)]
delta: String,
}
#[derive(Debug, Deserialize, Serialize)]
struct ResponseCompletedEvent {
response: ResponseShell,
}
#[derive(Debug, Deserialize, Serialize)]
struct ResponseShell {
#[serde(default)]
status: Option<String>,
#[serde(default)]
usage: Option<WireUsage>,
}
#[derive(Debug, Deserialize, Serialize)]
struct WireUsage {
#[serde(default)]
input_tokens: u64,
#[serde(default)]
output_tokens: u64,
#[serde(default)]
total_tokens: u64,
}
// ── Streaming decoder ───────────────────────────────────────────────
/// Translate the named-event Responses SSE into the provider-agnostic
/// [`CompletionEvent`] stream the agent loop expects. The decoder
/// holds per-stream state — output_index → tool-call-index plus
/// the next available tool-call slot — so it can fire
/// `ToolCallStart` exactly once per item.
fn decode_stream<S>(
sse: S,
cancel: CancellationToken,
) -> impl Stream<Item = anyhow::Result<CompletionEvent>>
where
S: Stream<
Item = Result<
eventsource_stream::Event,
eventsource_stream::EventStreamError<reqwest::Error>,
>,
> + Send
+ 'static,
{
async_stream::stream! {
let mut sse = Box::pin(sse);
// Maps an output_index that's a function_call to the tool-call
// slot we hand downstream. Lets us correlate later
// `function_call_arguments.delta` events back to the index
// we already announced on `output_item.added`.
let mut tool_index_by_output: HashMap<u32, usize> = HashMap::new();
let mut next_tool_index: usize = 0;
loop {
tokio::select! {
biased;
_ = cancel.cancelled() => {
tracing::debug!("openai_responses: cancellation requested, ending stream");
break;
}
next = sse.next() => {
let Some(event) = next else { break };
let event = match event {
Ok(e) => e,
Err(e) => {
yield Err(anyhow::anyhow!("SSE transport: {e}"));
break;
}
};
// Event name lives on `event.event`; data is JSON.
let event_name = event.event.as_str();
let data = event.data.as_str();
match event_name {
"response.output_text.delta" => {
match serde_json::from_str::<OutputTextDeltaEvent>(data) {
Ok(d) if !d.delta.is_empty() => {
yield Ok(CompletionEvent::TextDelta(d.delta));
}
Ok(_) => {}
Err(e) => {
tracing::warn!(
error = %e,
raw = %data,
"openai_responses: failed to parse output_text.delta; skipping"
);
}
}
}
"response.output_item.added" => {
match serde_json::from_str::<OutputItemAddedEvent>(data) {
Ok(ev) => {
if let OutputItem::FunctionCall {
id,
call_id,
name,
arguments,
} = ev.item
{
let idx = next_tool_index;
next_tool_index += 1;
tool_index_by_output.insert(ev.output_index, idx);
// Prefer the user-facing
// `call_id` (what gets paired
// with tool results) over the
// internal item `id` when
// both are present. Falls
// back to a synthetic id so
// history bookkeeping never
// breaks.
let final_id = call_id
.or(id)
.unwrap_or_else(|| format!("call_{idx}"));
let final_name = name.unwrap_or_default();
yield Ok(CompletionEvent::ToolCallStart {
index: idx,
id: final_id,
name: final_name,
});
// Some upstreams attach the
// fully-buffered arguments on
// the `output_item.added`
// event itself (rare; happens
// when the model finalised
// before the SSE flush).
// Emit as a single args
// delta if present.
if let Some(args) = arguments
&& !args.is_empty()
{
yield Ok(CompletionEvent::ToolCallArgsDelta {
index: idx,
args_delta: args,
});
}
}
}
Err(e) => {
tracing::warn!(
error = %e,
raw = %data,
"openai_responses: failed to parse output_item.added; skipping"
);
}
}
}
"response.function_call_arguments.delta" => {
match serde_json::from_str::<FunctionCallArgumentsDeltaEvent>(data) {
Ok(ev) => {
let Some(&idx) = tool_index_by_output.get(&ev.output_index)
else {
// Args delta for an item we
// never saw an `output_item.added`
// for. Could happen if the
// upstream reordered events;
// log + skip.
tracing::warn!(
output_index = ev.output_index,
"openai_responses: function_call_arguments.delta for unknown output_index"
);
continue;
};
if !ev.delta.is_empty() {
yield Ok(CompletionEvent::ToolCallArgsDelta {
index: idx,
args_delta: ev.delta,
});
}
}
Err(e) => {
tracing::warn!(
error = %e,
raw = %data,
"openai_responses: failed to parse function_call_arguments.delta; skipping"
);
}
}
}
"response.completed" => {
// Final event. Pull usage + status off
// the response shell. Status maps:
// "completed" → no special handling
// (caller treats as EndTurn),
// "incomplete" → length stop.
let (reason, usage) =
match serde_json::from_str::<ResponseCompletedEvent>(data) {
Ok(ev) => {
let reason = match ev.response.status.as_deref() {
Some("incomplete") => Some("length".to_string()),
_ => Some("stop".to_string()),
};
let usage = ev.response.usage.map(|u| UsageStats {
prompt_tokens: u.input_tokens,
completion_tokens: u.output_tokens,
total_tokens: u.total_tokens,
});
(reason, usage)
}
Err(e) => {
tracing::warn!(
error = %e,
raw = %data,
"openai_responses: failed to parse response.completed; ending stream with EndTurn"
);
(Some("stop".to_string()), None)
}
};
if let Some(u) = usage {
yield Ok(CompletionEvent::Usage(u));
}
yield Ok(CompletionEvent::Finish { reason });
break;
}
// Bookkeeping events we don't need to surface:
// response.created, response.in_progress,
// response.content_part.added/.done,
// response.output_text.done,
// response.output_item.done,
// response.function_call_arguments.done,
// response.reasoning_*. Logged at debug for
// wire-tracing.
other => {
tracing::trace!(
event = other,
"openai_responses: bookkeeping event"
);
}
}
}
}
}
}
}
#[cfg(test)]
mod tests {
use super::*;
use crate::provider::ToolCall;
use crate::provider::{ImageData, MessagePart};
use futures::stream;
use url::Url;
fn ep() -> EndpointConfig {
EndpointConfig {
name: "test".into(),
base_url: Url::parse("http://localhost:9999/v1").unwrap(),
wire_api: crate::config::WireApi::OpenAiResponses,
default_model: None,
api_key: None,
api_key_env: None,
max_tokens: None,
context_window: None,
}
}
// ── encode_request ──────────────────────────────────────────────
#[test]
fn system_messages_collapse_to_instructions() {
let req = CompletionRequest {
model: "m".into(),
messages: vec![
Message {
role: Role::System,
content: MessageContent::Text {
text: "you are helpful".into(),
},
},
Message {
role: Role::User,
content: MessageContent::Text { text: "hi".into() },
},
],
tools: vec![],
temperature: Some(0.7),
top_p: None,
max_tokens: Some(256),
};
let body = encode_request(&req);
assert_eq!(body["model"], "m");
assert_eq!(body["instructions"], "you are helpful");
assert_eq!(body["stream"], true);
assert_eq!(body["max_output_tokens"], 256);
assert_eq!(body["temperature"], 0.7);
let input = body["input"].as_array().unwrap();
// System message NOT echoed in input — it's only in
// instructions.
assert_eq!(input.len(), 1);
assert_eq!(input[0]["type"], "message");
assert_eq!(input[0]["role"], "user");
assert_eq!(input[0]["content"], "hi");
}
#[test]
fn multiple_system_messages_concatenate() {
let req = CompletionRequest {
model: "m".into(),
messages: vec![
Message {
role: Role::System,
content: MessageContent::Text {
text: "first".into(),
},
},
Message {
role: Role::System,
content: MessageContent::Text {
text: "second".into(),
},
},
Message {
role: Role::User,
content: MessageContent::Text { text: "hi".into() },
},
],
tools: vec![],
temperature: None,
top_p: None,
max_tokens: None,
};
let body = encode_request(&req);
assert_eq!(body["instructions"], "first\n\nsecond");
}
#[test]
fn user_multipart_becomes_input_parts_array() {
let req = CompletionRequest {
model: "vl".into(),
messages: vec![Message {
role: Role::User,
content: MessageContent::MultiPart {
parts: vec![
MessagePart::Text {
text: "what's in this?".into(),
},
MessagePart::Image(ImageData {
mime_type: "image/png".into(),
data: "AAA=".into(),
uri: None,
}),
],
},
}],
tools: vec![],
temperature: None,
top_p: None,
max_tokens: None,
};
let body = encode_request(&req);
let content = &body["input"][0]["content"].as_array().unwrap().clone();
assert_eq!(content.len(), 2);
assert_eq!(content[0]["type"], "input_text");
assert_eq!(content[0]["text"], "what's in this?");
assert_eq!(content[1]["type"], "input_image");
assert_eq!(content[1]["image_url"], "data:image/png;base64,AAA=");
}
#[test]
fn assistant_text_becomes_output_text_content_part() {
let req = CompletionRequest {
model: "m".into(),
messages: vec![
Message {
role: Role::User,
content: MessageContent::Text { text: "hi".into() },
},
Message {
role: Role::Assistant,
content: MessageContent::Text {
text: "hello there".into(),
},
},
Message {
role: Role::User,
content: MessageContent::Text {
text: "more".into(),
},
},
],
tools: vec![],
temperature: None,
top_p: None,
max_tokens: None,
};
let body = encode_request(&req);
let input = body["input"].as_array().unwrap();
assert_eq!(input.len(), 3);
assert_eq!(input[1]["type"], "message");
assert_eq!(input[1]["role"], "assistant");
assert_eq!(input[1]["content"][0]["type"], "output_text");
assert_eq!(input[1]["content"][0]["text"], "hello there");
}
#[test]
fn tool_calls_and_results_round_trip_via_function_call_items() {
let req = CompletionRequest {
model: "m".into(),
messages: vec![
Message {
role: Role::Assistant,
content: MessageContent::ToolCalls {
text: None,
calls: vec![ToolCall {
id: "call_42".into(),
name: "read_file".into(),
arguments: r#"{"path":"/etc/hostname"}"#.into(),
}],
},
},
Message {
role: Role::Tool,
content: MessageContent::ToolResult {
tool_call_id: "call_42".into(),
content: "host".into(),
},
},
],
tools: vec![],
temperature: None,
top_p: None,
max_tokens: None,
};
let body = encode_request(&req);
let input = body["input"].as_array().unwrap();
assert_eq!(input.len(), 2);
assert_eq!(input[0]["type"], "function_call");
assert_eq!(input[0]["call_id"], "call_42");
assert_eq!(input[0]["name"], "read_file");
assert_eq!(input[0]["arguments"], r#"{"path":"/etc/hostname"}"#);
assert_eq!(input[1]["type"], "function_call_output");
assert_eq!(input[1]["call_id"], "call_42");
assert_eq!(input[1]["output"], "host");
}
// ── decode_stream ───────────────────────────────────────────────
fn sse_event(name: &str, data: &str) -> eventsource_stream::Event {
eventsource_stream::Event {
id: String::new(),
retry: None,
event: name.into(),
data: data.into(),
}
}
async fn collect_events(
items: Vec<eventsource_stream::Event>,
) -> Vec<anyhow::Result<CompletionEvent>> {
let sse = stream::iter(
items
.into_iter()
.map(Ok::<_, eventsource_stream::EventStreamError<reqwest::Error>>),
);
let decoded = decode_stream(sse, CancellationToken::new());
decoded.collect().await
}
#[tokio::test]
async fn decodes_text_then_finish() {
let events = collect_events(vec![
sse_event("response.created", "{}"),
sse_event(
"response.output_text.delta",
r#"{"item_id":"msg_1","output_index":0,"delta":"hel"}"#,
),
sse_event(
"response.output_text.delta",
r#"{"item_id":"msg_1","output_index":0,"delta":"lo"}"#,
),
sse_event(
"response.completed",
r#"{"response":{"status":"completed","usage":{"input_tokens":3,"output_tokens":2,"total_tokens":5}}}"#,
),
])
.await;
let events: Vec<CompletionEvent> = events.into_iter().map(|r| r.unwrap()).collect();
let mut iter = events.into_iter();
assert!(matches!(iter.next(), Some(CompletionEvent::TextDelta(t)) if t == "hel"));
assert!(matches!(iter.next(), Some(CompletionEvent::TextDelta(t)) if t == "lo"));
assert!(matches!(iter.next(), Some(CompletionEvent::Usage(u)) if u.total_tokens == 5));
assert!(matches!(
iter.next(),
Some(CompletionEvent::Finish { reason: Some(r) }) if r == "stop"
));
assert!(iter.next().is_none());
}
#[tokio::test]
async fn empty_delta_is_dropped() {
let events = collect_events(vec![
sse_event(
"response.output_text.delta",
r#"{"item_id":"m","output_index":0,"delta":""}"#,
),
sse_event(
"response.completed",
r#"{"response":{"status":"completed"}}"#,
),
])
.await;
let mut completion_events = events.into_iter().map(|r| r.unwrap());
// First event MUST be the Finish — the empty delta dropped.
assert!(matches!(
completion_events.next(),
Some(CompletionEvent::Finish { .. })
));
}
#[tokio::test]
async fn incomplete_status_maps_to_length_finish_reason() {
let events = collect_events(vec![sse_event(
"response.completed",
r#"{"response":{"status":"incomplete"}}"#,
)])
.await;
let events: Vec<CompletionEvent> = events.into_iter().map(|r| r.unwrap()).collect();
assert!(matches!(
events.last(),
Some(CompletionEvent::Finish { reason: Some(r) }) if r == "length"
));
}
#[tokio::test]
async fn function_call_items_emit_toolcall_events() {
let events = collect_events(vec![
sse_event(
"response.output_item.added",
r#"{"output_index":0,"item":{"type":"function_call","id":"item_1","call_id":"call_xyz","name":"read_file"}}"#,
),
sse_event(
"response.function_call_arguments.delta",
r#"{"item_id":"item_1","output_index":0,"delta":"{\"path"}"#,
),
sse_event(
"response.function_call_arguments.delta",
r#"{"item_id":"item_1","output_index":0,"delta":"\":\"/etc/hostname\"}"}"#,
),
sse_event("response.completed", r#"{"response":{"status":"completed"}}"#),
])
.await;
let events: Vec<CompletionEvent> = events.into_iter().map(|r| r.unwrap()).collect();
let mut iter = events.into_iter();
assert!(matches!(
iter.next(),
Some(CompletionEvent::ToolCallStart { index: 0, ref id, ref name })
if id == "call_xyz" && name == "read_file"
));
assert!(matches!(
iter.next(),
Some(CompletionEvent::ToolCallArgsDelta { index: 0, ref args_delta })
if args_delta == r#"{"path"#
));
assert!(matches!(
iter.next(),
Some(CompletionEvent::ToolCallArgsDelta { index: 0, ref args_delta })
if args_delta == r#"":"/etc/hostname"}"#
));
assert!(matches!(iter.next(), Some(CompletionEvent::Finish { .. })));
}
#[tokio::test]
async fn function_call_added_with_inline_arguments_emits_single_args_delta() {
// Some upstreams (rare) include the fully-buffered arguments
// on the `output_item.added` event when the model finalised
// the call before SSE flush. Verify both ToolCallStart and a
// single args delta fire.
let events = collect_events(vec![
sse_event(
"response.output_item.added",
r#"{"output_index":0,"item":{"type":"function_call","call_id":"call_a","name":"f","arguments":"{\"x\":1}"}}"#,
),
sse_event("response.completed", r#"{"response":{"status":"completed"}}"#),
])
.await;
let events: Vec<CompletionEvent> = events.into_iter().map(|r| r.unwrap()).collect();
let mut iter = events.into_iter();
assert!(matches!(
iter.next(),
Some(CompletionEvent::ToolCallStart { .. })
));
assert!(matches!(
iter.next(),
Some(CompletionEvent::ToolCallArgsDelta { index: 0, ref args_delta })
if args_delta == r#"{"x":1}"#
));
assert!(matches!(iter.next(), Some(CompletionEvent::Finish { .. })));
}
#[tokio::test]
async fn cancellation_ends_stream_promptly() {
// Hand the decoder an empty stream + a triggered cancellation
// token; it should terminate without yielding anything.
let sse = stream::iter(Vec::<
Result<eventsource_stream::Event, eventsource_stream::EventStreamError<reqwest::Error>>,
>::new());
let cancel = CancellationToken::new();
cancel.cancel();
let decoded = decode_stream(sse, cancel);
let events: Vec<_> = decoded.collect().await;
assert!(events.is_empty());
}
#[tokio::test]
async fn malformed_event_payload_is_skipped() {
let events = collect_events(vec![
sse_event("response.output_text.delta", "{not valid json"),
sse_event(
"response.output_text.delta",
r#"{"item_id":"m","output_index":0,"delta":"ok"}"#,
),
sse_event(
"response.completed",
r#"{"response":{"status":"completed"}}"#,
),
])
.await;
let events: Vec<CompletionEvent> = events.into_iter().map(|r| r.unwrap()).collect();
// First text delta dropped; second one fires.
assert!(
events
.iter()
.any(|e| matches!(e, CompletionEvent::TextDelta(t) if t == "ok"))
);
// No errors yielded (parse failures are warn-and-skip).
assert!(
events
.iter()
.all(|e| !matches!(e, CompletionEvent::Finish { reason: None }))
);
}
#[test]
fn provider_construction_is_cheap() {
let _ = OpenAIResponsesProvider::new(ep()).unwrap();
}
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,188 @@
//! Per-session state for the ACP agent loop.
//!
//! Concurrency:
//!
//! - [`SessionStore`] is an `Arc<RwLock<HashMap<SessionId, …>>>`. The map
//! itself is read-mostly: it changes only on `session/new` and never
//! shrinks during Stage 2, so an `RwLock` keeps concurrent reads
//! contention-free.
//! - Each session is wrapped in its own `Arc<Mutex<SessionState>>`. Holding
//! one session's lock doesn't block requests against any other session,
//! which matters once a client opens multiple sessions in parallel.
//!
//! All operations hold a lock only long enough to copy out (or mutate) the
//! state they need — never across an `await` that drives the upstream
//! provider stream.
use std::collections::HashMap;
use std::path::PathBuf;
use std::sync::Arc;
use agent_client_protocol::schema::{SessionId, SessionModeId};
use tokio::sync::{Mutex, RwLock};
use tokio_util::sync::CancellationToken;
use crate::provider::Message;
/// Mode id advertised as the gated default. Writes / bash prompt for
/// permission via `session/request_permission`.
pub const MODE_DEFAULT: &str = "default";
/// Mode id advertised as "auto-allow everything". Matches the
/// favorite name (`bypassPermissions`) Zed clients tend to reference.
pub const MODE_BYPASS: &str = "bypassPermissions";
/// Mode id for read-and-plan-only operation. The model may read files
/// and list directories freely, may write *only* into the per-project
/// plan directory under `$XDG_DATA_HOME/helexa-acp/plans/<project-id>/`,
/// and cannot run shell commands. Designed for "draft the
/// implementation plan, then I'll review and let you execute" flows.
pub const MODE_PLAN: &str = "plan";
/// State carried for a single ACP session.
///
/// Mutated under `Mutex<SessionState>`; never share a clone across
/// tasks expecting to see the same `cancel` token — clone the token
/// explicitly when handing it to the streaming task.
#[derive(Debug)]
pub struct SessionState {
/// Conversation history in chronological order (user / assistant
/// turns). The system prompt is *not* stored here — it's built
/// fresh per request so any cwd / config changes take effect.
pub history: Vec<Message>,
/// Working directory the client opened the session against. Used
/// by [`crate::prompt::build_system_prompt`] and (Stage 3) by
/// filesystem tools.
pub cwd: PathBuf,
/// Currently-selected model id. Format is either a bare model id
/// (resolved against the default endpoint) or `endpoint:model`.
/// Mutated by `session/set_model` in Stage 4; Stage 2 sets it
/// once at session creation and never changes it.
pub model_id: String,
/// Cancellation handle for the in-flight prompt, if any. A fresh
/// token is installed at the start of every `session/prompt`
/// request; `session/cancel` fires this one. Between prompts the
/// token is "spent" — firing it does nothing — which is fine,
/// `session/cancel` is a no-op when there's nothing to cancel.
pub cancel: CancellationToken,
/// Permission gating mode. Stage 3 advertises two ids in
/// `NewSessionResponse.modes`: [`MODE_DEFAULT`] (writes / bash
/// prompt the user) and [`MODE_BYPASS`] (auto-allow). Mutated by
/// `session/set_mode`.
pub mode_id: SessionModeId,
}
impl SessionState {
pub fn new(cwd: PathBuf, model_id: String) -> Self {
Self {
history: Vec::new(),
cwd,
model_id,
cancel: CancellationToken::new(),
mode_id: SessionModeId::new(MODE_DEFAULT),
}
}
}
/// Concurrent map of live sessions.
///
/// Cloning is cheap (`Arc` bump). Pass clones into every handler that
/// needs session access; never hold a clone across an `.await` that
/// could outlive the request.
pub type SessionStore = Arc<RwLock<HashMap<SessionId, Arc<Mutex<SessionState>>>>>;
/// Fresh, empty session store.
pub fn new_store() -> SessionStore {
Arc::new(RwLock::new(HashMap::new()))
}
/// Look up a session by id. Returns `None` if no such session is registered.
pub async fn get(store: &SessionStore, id: &SessionId) -> Option<Arc<Mutex<SessionState>>> {
store.read().await.get(id).cloned()
}
/// Register a fresh session. Overwrites any prior entry with the same id
/// (which should never happen — ids are uniquely generated by the agent).
pub async fn insert(store: &SessionStore, id: SessionId, state: SessionState) {
store.write().await.insert(id, Arc::new(Mutex::new(state)));
}
#[cfg(test)]
mod tests {
use super::*;
use crate::provider::{MessageContent, Role};
fn id(s: &str) -> SessionId {
SessionId::new(s)
}
#[tokio::test]
async fn insert_then_get_round_trip() {
let store = new_store();
let state = SessionState::new(PathBuf::from("/tmp"), "m".into());
insert(&store, id("s1"), state).await;
let got = get(&store, &id("s1")).await.expect("session present");
let locked = got.lock().await;
assert_eq!(locked.cwd, PathBuf::from("/tmp"));
assert_eq!(locked.model_id, "m");
assert!(locked.history.is_empty());
}
#[tokio::test]
async fn missing_session_is_none() {
let store = new_store();
assert!(get(&store, &id("nope")).await.is_none());
}
#[tokio::test]
async fn history_is_per_session() {
let store = new_store();
insert(
&store,
id("a"),
SessionState::new(PathBuf::from("/a"), "m".into()),
)
.await;
insert(
&store,
id("b"),
SessionState::new(PathBuf::from("/b"), "m".into()),
)
.await;
// Appending to a's history must not affect b's.
get(&store, &id("a"))
.await
.unwrap()
.lock()
.await
.history
.push(Message {
role: Role::User,
content: MessageContent::Text {
text: "hello".into(),
},
});
assert_eq!(
get(&store, &id("a"))
.await
.unwrap()
.lock()
.await
.history
.len(),
1
);
assert_eq!(
get(&store, &id("b"))
.await
.unwrap()
.lock()
.await
.history
.len(),
0
);
}
}

View File

@@ -0,0 +1,462 @@
//! On-disk session persistence for `session/load` support.
//!
//! Storage layout:
//!
//! ```text
//! $XDG_DATA_HOME/helexa-acp/sessions/{session_id}.json
//! ```
//!
//! (Fallback to `~/.local/share/helexa-acp/sessions/` when
//! `$XDG_DATA_HOME` is unset.) One JSON file per session. Writes
//! happen at the end of every `session/prompt` round through
//! [`save`], using tempfile-plus-rename so a crash mid-write can't
//! corrupt the store. Reads happen on `session/load` via [`load`].
//!
//! No compaction, no rotation: files accumulate until the user
//! cleans them up. That's deliberate — disk is cheap, and the
//! resume-on-restart workflow matters more than tidiness. The
//! [`SESSIONS_DIRNAME`] subdirectory is created lazily on first
//! save so an unprivileged install path never errors at startup.
use std::path::PathBuf;
use std::time::SystemTime;
use agent_client_protocol::schema::SessionId;
use serde::{Deserialize, Serialize};
use crate::provider::Message;
const APP_DIRNAME: &str = "helexa-acp";
const SESSIONS_DIRNAME: &str = "sessions";
const PLANS_DIRNAME: &str = "plans";
/// The shape persisted to disk for one session. Only what we can't
/// rebuild from the running config goes in here: the conversation
/// history, the mode toggle, the model id, and the cwd-at-creation.
///
/// `created_at` / `updated_at` are seconds-since-epoch — cheap to
/// compare, no third-party time crate, and stable across runs.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct PersistedSession {
pub session_id: String,
pub cwd: PathBuf,
pub model_id: String,
pub mode_id: String,
pub history: Vec<Message>,
pub created_at: u64,
pub updated_at: u64,
}
/// Resolve the directory that holds session JSON files. Honors
/// `$XDG_DATA_HOME`; falls back to `~/.local/share/helexa-acp/sessions/`.
/// Returns `None` if neither is resolvable (no `HOME` set — possible
/// in stripped-down container environments).
pub fn sessions_dir() -> Option<PathBuf> {
let base = std::env::var("XDG_DATA_HOME")
.ok()
.filter(|s| !s.is_empty())
.map(PathBuf::from)
.or_else(|| {
std::env::var("HOME")
.ok()
.map(|h| PathBuf::from(h).join(".local").join("share"))
})?;
Some(base.join(APP_DIRNAME).join(SESSIONS_DIRNAME))
}
/// Atomic save into the default sessions directory.
pub fn save(session: &PersistedSession) -> anyhow::Result<()> {
let dir = sessions_dir()
.ok_or_else(|| anyhow::anyhow!("can't resolve XDG_DATA_HOME or HOME for session store"))?;
save_to_dir(&dir, session)
}
/// Load from the default sessions directory.
pub fn load(session_id: &SessionId) -> anyhow::Result<PersistedSession> {
let dir = sessions_dir()
.ok_or_else(|| anyhow::anyhow!("can't resolve XDG_DATA_HOME or HOME for session store"))?;
load_from_dir(&dir, session_id)
}
/// Atomic save into an explicit directory. Writes to
/// `{id}.json.tmp` then renames over `{id}.json`. Creates the
/// target directory if it doesn't exist. Split from [`save`] so
/// unit tests can target a per-test scratch dir without mutating
/// process-global env vars.
pub fn save_to_dir(dir: &std::path::Path, session: &PersistedSession) -> anyhow::Result<()> {
std::fs::create_dir_all(dir).map_err(|e| anyhow::anyhow!("create {}: {e}", dir.display()))?;
let safe = sanitize_id(&session.session_id);
let final_path = dir.join(format!("{safe}.json"));
let tmp_path = dir.join(format!("{safe}.json.tmp"));
let json = serde_json::to_string_pretty(session)?;
std::fs::write(&tmp_path, json)
.map_err(|e| anyhow::anyhow!("write {}: {e}", tmp_path.display()))?;
std::fs::rename(&tmp_path, &final_path)
.map_err(|e| anyhow::anyhow!("rename → {}: {e}", final_path.display()))?;
Ok(())
}
/// Load from an explicit directory. Returns a friendly error
/// message when the session id has no file on disk so the caller
/// can map it to a clean ACP error response.
pub fn load_from_dir(
dir: &std::path::Path,
session_id: &SessionId,
) -> anyhow::Result<PersistedSession> {
let safe = sanitize_id(session_id.0.as_ref());
let path = dir.join(format!("{safe}.json"));
let bytes = std::fs::read(&path).map_err(|e| {
if e.kind() == std::io::ErrorKind::NotFound {
anyhow::anyhow!("no persisted session at {}", path.display())
} else {
anyhow::anyhow!("read {}: {e}", path.display())
}
})?;
let session: PersistedSession = serde_json::from_slice(&bytes)
.map_err(|e| anyhow::anyhow!("parse {}: {e}", path.display()))?;
Ok(session)
}
/// List all persisted sessions, optionally filtered by `cwd`. Used
/// by the `session/list` handler so a client (Zed) can find the
/// session that belongs to the workspace it's reopening.
///
/// `filter_cwd = None` returns every session on disk. `Some(path)`
/// returns only sessions whose persisted `cwd` is exactly equal.
///
/// Files that fail to parse are skipped with a warning rather than
/// aborting the whole list — one corrupt session shouldn't make
/// the resume picker unusable.
pub fn list(filter_cwd: Option<&std::path::Path>) -> anyhow::Result<Vec<PersistedSession>> {
let dir = sessions_dir()
.ok_or_else(|| anyhow::anyhow!("can't resolve XDG_DATA_HOME or HOME for session store"))?;
list_in_dir(&dir, filter_cwd)
}
/// Explicit-dir variant for tests, mirroring [`save_to_dir`] /
/// [`load_from_dir`].
pub fn list_in_dir(
dir: &std::path::Path,
filter_cwd: Option<&std::path::Path>,
) -> anyhow::Result<Vec<PersistedSession>> {
let read = match std::fs::read_dir(dir) {
Ok(r) => r,
Err(e) if e.kind() == std::io::ErrorKind::NotFound => return Ok(Vec::new()),
Err(e) => return Err(anyhow::anyhow!("read_dir {}: {e}", dir.display())),
};
let mut out = Vec::new();
for entry in read.flatten() {
let path = entry.path();
if path.extension().and_then(|s| s.to_str()) != Some("json") {
continue;
}
match std::fs::read(&path).and_then(|bytes| {
serde_json::from_slice::<PersistedSession>(&bytes).map_err(std::io::Error::other)
}) {
Ok(session) => {
if let Some(want) = filter_cwd
&& session.cwd != want
{
continue;
}
out.push(session);
}
Err(e) => {
tracing::warn!(
path = %path.display(),
error = %e,
"store: skipping unparseable session file"
);
}
}
}
// Most-recent first by updated_at.
out.sort_by_key(|s| std::cmp::Reverse(s.updated_at));
Ok(out)
}
/// Seconds-since-epoch, saturating to 0 if the system clock is
/// behind epoch (which shouldn't happen but the type system
/// requires a fallible read).
pub fn now_secs() -> u64 {
SystemTime::now()
.duration_since(SystemTime::UNIX_EPOCH)
.map(|d| d.as_secs())
.unwrap_or(0)
}
/// Root directory for plan-mode artefacts. Mirrors [`sessions_dir`]
/// but under `…/helexa-acp/plans/` so plans and conversation
/// transcripts are siblings, not nested.
pub fn plans_root() -> Option<PathBuf> {
sessions_dir().and_then(|s| s.parent().map(|p| p.join(PLANS_DIRNAME)))
}
/// Per-project plan directory:
/// `$XDG_DATA_HOME/helexa-acp/plans/<project-id>/`. The id derives
/// from the session's cwd so plans for the same project survive
/// across cwd-changes (a `/home/foo/git/bar` ↔ symlinked
/// `/srv/checkout/bar` would technically diverge, accepted as a
/// won't-fix corner case).
pub fn plan_dir_for(cwd: &std::path::Path) -> Option<PathBuf> {
plans_root().map(|root| root.join(project_id_for(cwd)))
}
/// Deterministic, human-readable project identifier. Format:
/// `<basename>-<8-hex>` where the 8-hex suffix is FNV-1a of the
/// full path. Basename keeps the path skim-readable when poking
/// around `$XDG_DATA_HOME` by hand; the hash suffix disambiguates
/// repos that share a final path component (e.g. multiple
/// `/.../checkout/beat` checkouts).
///
/// FNV-1a rather than `std::collections::hash::DefaultHasher`
/// because the latter (SipHash) reseeds per process, so it'd give
/// us a different project_id on every run.
pub fn project_id_for(cwd: &std::path::Path) -> String {
let basename = cwd
.file_name()
.and_then(|s| s.to_str())
.unwrap_or("unknown");
let sanitised: String = basename
.chars()
.map(|c| {
if c.is_ascii_alphanumeric() || c == '-' || c == '_' {
c
} else {
'_'
}
})
.collect();
let hash = fnv1a_32(cwd.to_string_lossy().as_bytes());
format!("{sanitised}-{hash:08x}")
}
/// FNV-1a (32-bit). Deterministic, no third-party crate. Used for
/// project ids only — not cryptographic.
fn fnv1a_32(bytes: &[u8]) -> u32 {
let mut h: u32 = 0x811c_9dc5;
for b in bytes {
h ^= u32::from(*b);
h = h.wrapping_mul(0x0100_0193);
}
h
}
/// Format seconds-since-epoch as an ISO 8601 / RFC 3339 string
/// (`YYYY-MM-DDTHH:MM:SSZ`) for `SessionInfo.updated_at`. Returns
/// `None` for values outside the representable range, in which
/// case the caller should omit the field.
pub fn unix_to_iso8601(secs: u64) -> Option<String> {
use chrono::TimeZone;
let dt = chrono::Utc.timestamp_opt(secs as i64, 0).single()?;
Some(dt.to_rfc3339_opts(chrono::SecondsFormat::Secs, true))
}
/// Strip anything that isn't a safe filename character so a
/// mischievous (or just unconventional) session id can't escape
/// the sessions directory.
fn sanitize_id(id: &str) -> String {
id.chars()
.map(|c| {
if c.is_ascii_alphanumeric() || c == '-' || c == '_' {
c
} else {
'_'
}
})
.collect()
}
#[cfg(test)]
mod tests {
use super::*;
use crate::provider::{MessageContent, Role};
/// Unique scratch dir per test invocation. We use this dir
/// directly with the `*_to_dir` / `*_from_dir` functions so
/// the tests never mutate `$XDG_DATA_HOME` — that env var
/// would race across the parallel test harness.
fn unique_dir() -> PathBuf {
let base = std::env::var("CARGO_TARGET_TMPDIR")
.ok()
.map(PathBuf::from)
.unwrap_or_else(std::env::temp_dir);
let pid = std::process::id();
let nanos = SystemTime::now()
.duration_since(SystemTime::UNIX_EPOCH)
.map(|d| d.subsec_nanos())
.unwrap_or(0);
let dir = base.join(format!("helexa-acp-store-test-{pid}-{nanos}"));
std::fs::create_dir_all(&dir).expect("create test dir");
dir
}
fn sample(id: &str) -> PersistedSession {
PersistedSession {
session_id: id.into(),
cwd: PathBuf::from("/home/me/proj"),
model_id: "Qwen/Qwen3.6-27B".into(),
mode_id: "default".into(),
history: vec![
Message {
role: Role::User,
content: MessageContent::Text {
text: "hello".into(),
},
},
Message {
role: Role::Assistant,
content: MessageContent::Text { text: "hi".into() },
},
],
created_at: 1_700_000_000,
updated_at: 1_700_000_001,
}
}
#[test]
fn round_trip_save_then_load() {
let dir = unique_dir();
save_to_dir(&dir, &sample("hxa-1")).expect("save");
let loaded = load_from_dir(&dir, &SessionId::new("hxa-1")).expect("load");
assert_eq!(loaded.session_id, "hxa-1");
assert_eq!(loaded.cwd, PathBuf::from("/home/me/proj"));
assert_eq!(loaded.history.len(), 2);
let _ = std::fs::remove_dir_all(&dir);
}
#[test]
fn load_missing_session_errors_with_not_found_message() {
let dir = unique_dir();
let err = load_from_dir(&dir, &SessionId::new("nope")).unwrap_err();
let msg = format!("{err}");
assert!(
msg.contains("no persisted session"),
"want NotFound, got: {msg}"
);
let _ = std::fs::remove_dir_all(&dir);
}
#[test]
fn save_overwrites_existing_atomically() {
let dir = unique_dir();
save_to_dir(&dir, &sample("hxa-1")).expect("save");
let mut updated = sample("hxa-1");
updated.history.push(Message {
role: Role::User,
content: MessageContent::Text {
text: "third turn".into(),
},
});
updated.updated_at = 1_700_000_500;
save_to_dir(&dir, &updated).expect("re-save");
let loaded = load_from_dir(&dir, &SessionId::new("hxa-1")).expect("load");
assert_eq!(loaded.history.len(), 3);
assert_eq!(loaded.updated_at, 1_700_000_500);
let _ = std::fs::remove_dir_all(&dir);
}
#[test]
fn save_then_load_preserves_tool_calls_and_results() {
use crate::provider::ToolCall;
let dir = unique_dir();
let mut session = sample("hxa-2");
session.history.push(Message {
role: Role::Assistant,
content: MessageContent::ToolCalls {
text: Some("calling".into()),
calls: vec![ToolCall {
id: "call_0".into(),
name: "read_file".into(),
arguments: r#"{"path":"/etc/hostname"}"#.into(),
}],
},
});
session.history.push(Message {
role: Role::Tool,
content: MessageContent::ToolResult {
tool_call_id: "call_0".into(),
content: "host".into(),
},
});
save_to_dir(&dir, &session).expect("save");
let loaded = load_from_dir(&dir, &SessionId::new("hxa-2")).expect("load");
assert_eq!(loaded.history.len(), 4);
match &loaded.history[2].content {
MessageContent::ToolCalls { calls, .. } => {
assert_eq!(calls[0].name, "read_file");
}
other => panic!("expected ToolCalls, got {other:?}"),
}
let _ = std::fs::remove_dir_all(&dir);
}
#[test]
fn list_filters_by_cwd_and_sorts_recent_first() {
let dir = unique_dir();
let mut a = sample("a");
a.cwd = PathBuf::from("/home/me/proj-x");
a.updated_at = 1_700_000_010;
let mut b = sample("b");
b.cwd = PathBuf::from("/home/me/proj-x");
b.updated_at = 1_700_000_020;
let mut c = sample("c");
c.cwd = PathBuf::from("/home/me/elsewhere");
c.updated_at = 1_700_000_030;
save_to_dir(&dir, &a).unwrap();
save_to_dir(&dir, &b).unwrap();
save_to_dir(&dir, &c).unwrap();
let proj_x = PathBuf::from("/home/me/proj-x");
let list = list_in_dir(&dir, Some(&proj_x)).unwrap();
let ids: Vec<&str> = list.iter().map(|s| s.session_id.as_str()).collect();
// Filtered to proj-x; b before a because b is more recent.
assert_eq!(ids, vec!["b", "a"]);
let all = list_in_dir(&dir, None).unwrap();
assert_eq!(all.len(), 3);
// Global list still sorted recent-first across all cwds.
assert_eq!(all[0].session_id, "c");
let _ = std::fs::remove_dir_all(&dir);
}
#[test]
fn list_returns_empty_for_missing_dir() {
let dir = unique_dir().join("does-not-exist");
let list = list_in_dir(&dir, None).unwrap();
assert!(list.is_empty());
}
#[test]
fn list_skips_unparseable_files() {
let dir = unique_dir();
save_to_dir(&dir, &sample("good")).unwrap();
std::fs::write(dir.join("garbage.json"), b"{not valid json").unwrap();
let list = list_in_dir(&dir, None).unwrap();
// Garbage skipped; good survives.
assert_eq!(list.len(), 1);
assert_eq!(list[0].session_id, "good");
let _ = std::fs::remove_dir_all(&dir);
}
#[test]
fn iso8601_formats_unix_seconds() {
// 2024-01-01T00:00:00Z is 1704067200 unix seconds.
assert_eq!(
unix_to_iso8601(1_704_067_200),
Some("2024-01-01T00:00:00Z".into())
);
assert_eq!(unix_to_iso8601(0), Some("1970-01-01T00:00:00Z".into()));
}
#[test]
fn sanitize_id_rejects_path_traversal() {
// `../../etc/passwd` — 6 non-alnum chars before "etc"
// (`.`, `.`, `/`, `.`, `.`, `/`), one between, none
// after, none before nothing. Every disallowed char
// collapses to `_`.
assert_eq!(sanitize_id("../../etc/passwd"), "______etc_passwd");
assert_eq!(sanitize_id("ok-name_42"), "ok-name_42");
}
}

File diff suppressed because it is too large Load Diff

Some files were not shown because too many files have changed in this diff Show More