Commit Graph

206 Commits

Author SHA1 Message Date
a60c9f1075 feat(#47 #53 phase 2b): expose per-model admission load in GET /health
All checks were successful
CI / Format (push) Successful in 30s
CI / CUDA type-check (push) Successful in 1m30s
CI / Clippy (push) Successful in 2m18s
CI / Test (push) Successful in 4m17s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Resolve version stamps + change detection (push) Successful in 33s
build-prerelease / Build neuron-blackwell (push) Successful in 1m42s
build-prerelease / Build neuron-ampere (push) Successful in 2m18s
build-prerelease / Build neuron-ada (push) Successful in 2m19s
build-prerelease / Build helexa-bench binary (push) Successful in 2m18s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m27s
build-prerelease / Build cortex binary (push) Successful in 2m45s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m2s
build-prerelease / Test (push) Successful in 4m50s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m18s
build-prerelease / Package cortex RPM (push) Successful in 1m22s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m37s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 56s
Completes #53: the bounded scheduler's lock-free counters are now visible
to the fleet, which is what cortex's load-aware router (#55) consumes to
spread traffic across replicas and propagate honest backpressure.

- cortex-core::discovery: HealthResponse gains `models: Vec<ModelLoad>`
  (#[serde(default)] — back-compatible; older gateways/neurons interop).
  ModelLoad { id, in_flight, queue_depth }.
- LoadedHandle::load() → (in_flight, queue_depth), lock-free for both
  single-GPU and TP; CandleHarness::load_snapshot() enumerates resident
  models; the /health handler overlays it from the candle harness.

Tests: /health always exposes a models array (api integration test); a
pre-#53 payload without `models` still deserializes, and ModelLoad
round-trips (cortex-core serde tests). Local fmt/clippy/test green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 20:13:07 +03:00
b2bd86bfa5 feat(#47 #53 phase 2a): neuron admission control — bounded queue + backpressure
All checks were successful
CI / Format (push) Successful in 41s
CI / CUDA type-check (push) Successful in 1m40s
CI / Clippy (push) Successful in 2m18s
CI / Test (push) Successful in 4m53s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 1m43s
build-prerelease / Build neuron-ampere (push) Successful in 2m18s
build-prerelease / Build neuron-ada (push) Successful in 2m19s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m29s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m46s
build-prerelease / Test (push) Successful in 4m48s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m49s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m53s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m7s
Replaces the per-model unbounded, untimed FIFO of inference-lock waiters
(a busy model made new requests hang ~300s until the client gave up with
an opaque error) with an explicit bounded scheduler.

- harness::admission::AdmissionController: batch-1 scheduler — max_in_flight
  running (1) + a bounded queue (max_queue_depth) with a max_wait. enter()
  fast-rejects when the queue is full (QueueFull) or the wait elapses
  (Timeout); the returned AdmissionPermit is held for the request and frees
  both slots on drop. Pure async (no CUDA), lock-free in_flight/queue_depth
  counters for future /health reporting. Configurable via
  [harness.candle.admission] (max_in_flight=1, max_queue_depth=8,
  max_wait_secs=30).
- Gated at all four inference entry points before the inference_lock/pool
  lock: single-GPU non-streaming + streaming, TP non-streaming + streaming.
  The streaming paths acquire the permit before opening the SSE (so a
  rejection is a clean error, not a half-open stream) and move it into the
  inference task.
- InferenceError::Overloaded { retry_after_secs } → 503 rate_limit_exceeded
  + Retry-After via the #60/#63 envelope: a fast, retryable "busy" signal
  opencode/AI SDK back off on, not a stall.

Scope: this branch is the admission *core* (the hang→backpressure fix).
Exposing in_flight/queue_depth in GET /health (consumed by cortex
load-aware routing #55) is the next focused branch under #53.

4 unit tests (admit/report load, queue-full reject, wait-timeout reject)
+ Overloaded envelope mapping test. Non-CUDA build green locally; the
CUDA + TP sites are validated by branch CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 20:03:07 +03:00
cdf87284af feat(#47 phase 1d): budget enforcement — hard caps, reserve→settle, 429
All checks were successful
CI / Format (push) Successful in 1s
CI / CUDA type-check (push) Successful in 1m40s
CI / Clippy (push) Successful in 2m40s
CI / Test (push) Successful in 6m23s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Resolve version stamps + change detection (push) Successful in 34s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m19s
build-prerelease / Test (push) Successful in 4m28s
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 2m27s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 50s
Stage 1 complete: the A0 seatbelt (#52). Flips the metering-only reserve(0)
from #51 to the request's real upper-bound cost and refuses over-cap
requests *before* neuron is hit.

- metering::reservation_estimate: prompt estimate (~4 chars/token over the
  body — cortex has no tokenizer, so a conservative over-estimate; neuron
  stays the exact context wall) + max output. Max output comes from
  max_completion_tokens / legacy max_tokens, else the model's advertised
  limit.output (#62), else FALLBACK_MAX_OUTPUT. Over-reserving is safe —
  settle reconciles to actual.
- metering::reserve_or_reject: reserve the estimate; on BudgetError map to
  the #63 envelope and the caller refuses before dispatch — rolling window →
  429 rate_limit_exceeded + Retry-After (until reset); hard balance → 429
  insufficient_quota (no Retry-After). Never 402.
- Wired into both the OpenAI proxy path (proxy_with_metrics) and the
  Anthropic path (estimate from the translated body). advertised_output_limit
  reads the loaded model's limit.output from fleet state.
- Reservation prevents overshoot under concurrency: a successful reserve
  gates on spent+reserved+estimate ≤ cap, and settle records actual ≤
  reserved, so spend can never exceed the hard cap.

4 integration tests with a hit-counting mock neuron: balance over-cap →
429 insufficient_quota (no Retry-After, not dispatched); rolling over-cap →
429 rate_limit_exceeded + Retry-After (not dispatched); within-cap served;
**A0 repro** — a capped key's 20-request fan-out drains the cap, then is
refused, neuron only saw the served ones, and spend never exceeds the cap.
Plus 5 metering unit tests. Local fmt/clippy/test all green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 19:35:04 +03:00
4f16b8c541 feat(#47 phase 1c): per-request token metering + spend ledger
All checks were successful
CI / Format (push) Successful in 40s
CI / CUDA type-check (push) Successful in 1m41s
CI / Clippy (push) Successful in 2m15s
CI / Test (push) Successful in 4m28s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m30s
build-prerelease / Build cortex binary (push) Successful in 2m49s
build-prerelease / Package cortex RPM (push) Successful in 1m24s
build-prerelease / Test (push) Successful in 5m59s
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 49s
Stage 1 accounting (#51): capture real per-request usage and feed it to
the spend ledger + per-principal metrics. Establishes the reserve→settle
lifecycle that budget enforcement (#52) will tighten.

- cortex-gateway::metering: ReservationGuard makes reservation leaks
  impossible — settle() records actual spend + releases the remainder;
  dropping an un-settled guard releases the whole reservation, so any
  early return / error / dropped stream resolves it. UsageSink is the
  completion hook; principal_from_headers reconstructs the principal from
  the middleware-stamped headers (uniform across all proxy paths, no
  handler-signature churn); record_spend emits per-principal counters.
- proxy::TokenMetrics gains an optional usage_sink, invoked exactly once
  in finish() with the observed (prompt, completion) — restructured so it
  always runs (even when no body/usage arrived → settle 0 → release),
  while preserving the existing per-model metric emissions unchanged.
- All proxy paths metered: chat/completions/responses via
  proxy_with_metrics (reserve 0 → forward_request → settle in finish);
  Anthropic non-streaming settles from the buffered body; Anthropic
  streaming (anthropic_sse) now scans the upstream frames for the usage
  object (#48) — it captured none before — and settles at pump end.
- This phase reserves 0 tokens (metering only, no enforcement); #52 flips
  the reserved amount to prompt+max_output and surfaces BudgetError. The
  settle/release plumbing is identical, so that change is localized.
- New Prometheus counters: cortex_spend_tokens_total (+ prompt/completion
  splits), labelled by account/key.

2 integration tests: cumulative per-key spend after N requests with
reservations settled to zero outstanding; anonymous requests record no
spend. Local fmt/clippy/test all green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 19:29:51 +03:00
486d7e9a8f feat(#47 phase 1b): API-key auth + principal resolution
All checks were successful
CI / Format (push) Successful in 36s
CI / CUDA type-check (push) Successful in 1m51s
CI / Clippy (push) Successful in 2m40s
CI / Test (push) Successful in 5m50s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Build neuron-blackwell (push) Successful in 1m41s
build-prerelease / Build neuron-ada (push) Successful in 2m15s
build-prerelease / Build neuron-ampere (push) Successful in 2m18s
build-prerelease / Build helexa-bench binary (push) Successful in 2m20s
build-prerelease / Build cortex binary (push) Successful in 2m22s
build-prerelease / Lint (fmt + clippy) (push) Successful in 3m10s
build-prerelease / Test (push) Successful in 5m19s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m18s
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m44s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 57s
Stage 1 identity (#49): cortex now knows who a request is for. Identity
rides standard bearer auth only (Authorization: Bearer <key>) — no custom
required headers or body fields — which is what keeps every tier
OpenAI-compatible by construction.

- cortex-gateway::auth: `require_principal` axum middleware
  (from_fn_with_state), wired in build_app outer-to-inner as
  trace → CORS → auth → handlers (CORS outer so preflight short-circuits).
  It resolves the bearer key via the EntitlementProvider, inserts the
  typed Principal into request extensions (for metering #51 / enforcement
  #52), and stamps internal x-helexa-account-id / x-helexa-key-id headers
  so the principal reaches neuron, which trusts cortex over WireGuard (#54).
- Anti-spoofing: client-supplied principal headers are stripped before the
  authoritative value is stamped — a client can never assert a principal
  it didn't authenticate as.
- Rejection contract (#63): missing key under require_auth, or any present
  but unresolvable key, → 401 invalid_api_key in the #60 envelope. /health
  and / stay public. require_auth=false (default) allows anonymous through
  but still 401s a present-but-invalid key.
- Header-name constants (HEADER_ACCOUNT_ID/KEY_ID) live in cortex-core so
  neuron (#54) shares them. The chat/completions/responses paths forward
  the stamped headers automatically via proxy::forward_request; the
  Anthropic streaming + non-streaming paths forward them explicitly via
  auth::forward_principal_headers (they build their own upstream requests).

5 integration tests: missing-key 401, invalid-key 401 (even when auth not
required, not dispatched), valid key reaches neuron with principal headers
+ spoofed header stripped, anonymous allowed when not required, /health
public. Local fmt/clippy/test all green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 19:07:10 +03:00
bc74e0e95f feat(#47 phase 1a): EntitlementProvider trait + local/static provider
Some checks failed
CI / Format (push) Successful in 38s
CI / CUDA type-check (push) Successful in 1m39s
CI / Clippy (push) Successful in 2m26s
CI / Test (push) Successful in 4m49s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Build neuron-blackwell (push) Successful in 1m40s
build-prerelease / Build neuron-ada (push) Successful in 2m19s
build-prerelease / Build neuron-ampere (push) Successful in 2m22s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m49s
build-prerelease / Build cortex binary (push) Successful in 3m0s
build-prerelease / Test (push) Successful in 4m25s
build-prerelease / Package cortex RPM (push) Successful in 1m32s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m50s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m49s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m54s
build-prerelease / Build helexa-bench binary (push) Successful in 2m12s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Stage 1's build seam (#50): the interface auth, metering, and budget
enforcement all hang off, with a local/static provider so the A0
amplification fix can land before any upstream clearing house exists.
The future helexa-upstream client (#57) is just another impl.

- cortex-core::entitlements: Principal {account_id, key_id}, CapWindow
  (Balance | Rolling{seconds}), Reservation handle, BudgetSnapshot,
  AuthError/BudgetError, and the async EntitlementProvider trait
  (resolve / reserve / settle / release / snapshot). BudgetError carries
  the window semantics so callers pick the #63 code (rate_limit_exceeded
  + Retry-After vs insufficient_quota) without the provider touching HTTP.
- cortex-core::config: [entitlements] section on GatewayConfig
  (require_auth + [[entitlements.keys]] with account_id, optional key_id,
  hard_cap, window). Additive + serde(default) — anonymous/uncapped when
  omitted, so existing setups are unaffected.
- cortex-gateway::entitlements_local: LocalEntitlementProvider. Budget
  math serialized under one Mutex so spent+reserved can never exceed a
  hard cap under concurrency (the #52 guarantee); rolling windows reset
  lazily; uncapped keys (no hard_cap) always reserve but still meter.
- CortexState gains Arc<dyn EntitlementProvider> + require_auth, built in
  from_config. Not yet consumed by the request path — auth middleware is
  1b (#49), enforcement is 1d (#52).
- cortex.example.toml documents the section; test GatewayConfig literals
  updated for the new field.

6 provider unit tests (resolve, unknown-key, round-trip, balance/rolling
over-cap codes, uncapped infra key). Local fmt/clippy/test all green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 19:00:05 +03:00
f22d83df14 feat(#47 phase 0): centralize OpenAI error envelope + add Retry-After
Some checks failed
CI / Format (push) Successful in 38s
CI / CUDA type-check (push) Successful in 1m40s
CI / Clippy (push) Successful in 2m20s
CI / Test (push) Successful in 4m35s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Test (push) Blocked by required conditions
build-prerelease / Build cortex binary (push) Blocked by required conditions
build-prerelease / Package helexa-bench RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 24s
build-prerelease / Build neuron-blackwell (push) Successful in 1m26s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m48s
build-prerelease / Build neuron-ada (push) Successful in 2m3s
build-prerelease / Build helexa-bench binary (push) Successful in 2m7s
build-prerelease / Build neuron-ampere (push) Successful in 2m12s
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
The rejection contract (#63) requires every "no" path to speak the
OpenAI envelope with standard codes and, for retryable conditions, a
Retry-After header. Two gaps remained despite #63 being closed:
Retry-After was implemented nowhere, and the envelope was hand-built
inline in four places (gateway handlers/proxy/router, neuron api) with
no shared source of truth — exactly the inconsistency #63 set out to
prevent, and a foundation every Stage 1-2 rejection (401/429/503) needs.

- cortex-core: new `error_envelope::OpenAiError` — an axum-agnostic
  builder carrying status, type, code, message, param, optional
  retry_after, and diagnostic extras. Named constructors encode the #63
  codes (invalid_api_key, rate_limit_exceeded, insufficient_quota,
  context_length_exceeded, service_unavailable) and which carry
  Retry-After. cortex-core stays a pure types crate; each HTTP crate
  owns a thin `envelope_response` adapter that sets the header.
- cortex-gateway: route error_response, ProxyError, and RouteError
  through the shared builder; RouteError::retry_after_secs wires
  Retry-After on the transient NoHealthyNodes (5s) / ModelRecovering
  (2s) variants.
- neuron: route inference_error_response through the shared builder;
  InsufficientVram (transient 503) now advertises Retry-After: 5.

Behaviour for existing paths is unchanged (same status/type/code/extras);
only the new Retry-After headers are added. Tests cover the builder wire
shape and Retry-After presence/absence on both sides.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 18:46:56 +03:00
4b28a64b34 feat(#67 phase 5b): enforce the derived input as the prompt cap
All checks were successful
CI / Format (push) Successful in 39s
CI / CUDA type-check (push) Successful in 1m38s
CI / Clippy (push) Successful in 2m19s
CI / Test (push) Successful in 4m17s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m14s
build-prerelease / Build neuron-blackwell (push) Successful in 1m42s
build-prerelease / Build neuron-ada (push) Successful in 2m15s
build-prerelease / Build neuron-ampere (push) Successful in 2m17s
build-prerelease / Build helexa-bench binary (push) Successful in 2m23s
build-prerelease / Build cortex binary (push) Successful in 2m29s
build-prerelease / Test (push) Successful in 4m28s
build-prerelease / Package cortex RPM (push) Successful in 1m15s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m17s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m41s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 51s
The request path now rejects prompts above the model's self-derived input
budget, not the static NEURON_MAX_PROMPT_TOKENS — so a VRAM-tight host
(where the VRAM ceiling binds below the static cap) rejects an
over-budget prompt up front instead of accepting it and OOMing
mid-prefill.

- derived_input_cap: AtomicUsize on LoadedModel + TpLoadedModel; refreshed
  by LoadedHandle::derived_limit (runs on every /models poll). 0 = not
  derived yet.
- effective_prompt_cap(): cached derived input when >0, else the static
  max_prompt_tokens() (cold-start / no-profile fallback).
- validate_request takes the cap as a param; all 4 call sites
  (chat_completion, inference_stream, inference_tp_stream, TP
  chat_completion) pass the in-scope model's effective_prompt_cap().
- doc/context-limits.md: enforcement note updated from "remaining" to
  landed.

Reads the cap lock-free from the sync validate path (no per-request VRAM
query); the cap tracks live state via the poll-driven derivation. With
this, advertise and enforce agree and both track the resident model.

fmt/clippy/test green; CUDA paths type-checked in CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 14:26:37 +03:00
dd65eedb24 feat(#67 phase 5a): NEURON_MAX_PROMPT_TOKENS becomes a clamp-only backstop; docs
All checks were successful
CI / Format (push) Successful in 31s
CI / CUDA type-check (push) Successful in 1m49s
CI / Clippy (push) Successful in 2m12s
CI / Test (push) Successful in 4m24s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Demotes the static per-host prompt cap from authority to an optional
upper-bound clamp on the self-derived limit, and rewrites the
context-limits doc around the computed model.

- max_prompt_tokens_clamp(): reads NEURON_MAX_PROMPT_TOKENS directly so
  "explicitly set" is distinct from the 16384 default; returns None when
  unset (no clamp). Applied as derive_limit's hard_ceiling in
  LoadedHandle::derived_limit, so the advertised context is clamped only
  when an operator set a backstop — the derivation is otherwise
  authoritative and binds below it in practice.
- doc/context-limits.md: intro + "After #62" rewritten as "After #67 —
  the neuron computes its own limit" (formula, live signals, config
  block, opencode note, NEURON_MAX_PROMPT_TOKENS demotion).

Remaining (phase 5b, follow-up): enforce the *derived* input as the
prompt cap (reject above computed input, not the static
NEURON_MAX_PROMPT_TOKENS) so VRAM-tight hosts can't accept an
OOM-inducing prompt. Needs a per-model cached cap read from the sync
validate path; scoped separately. Until then the static cap remains the
enforced backstop (advertised <= enforced holds when the env is set).

fmt/clippy/test green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 14:14:34 +03:00
8b2e01a072 feat(#67 phase 4): advertise neuron-computed limit on /models; drop catalogue override
Some checks failed
CI / Test (push) Waiting to run
CI / Format (push) Successful in 35s
CI / CUDA type-check (push) Successful in 2m12s
CI / Clippy (push) Successful in 2m10s
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
The neuron now self-derives and advertises limit{context,input,output}
per loaded model; cortex forwards it and stops consulting the
operator-declared catalogue limit (which can't track hot-swapped models
or live capacity). Operator-set `cost` still flows from the catalogue.

neuron:
- CandleHarness gains context_limit_cfg (from [harness.candle.context_limit]).
- LoadedHandle::derived_limit(): profile + live tightest-card free VRAM
  (single: query_vram; TP: query_vram_tightest_free_mb) + prefill-rate
  EMA (bootstrap until first sample) → derive_limit. None for arches
  without a context profile. No operator clamp here (advertise the honest
  derived value; the clamp is an enforcement-side backstop).
- list_models() fills ModelInfo.limit from derived_limit (was None).
- derive_limit treats free_tightest_mb == 0 (unknown/CPU sentinel) as
  "no VRAM ceiling" instead of collapsing to zero.

cortex:
- ModelEntry gains `limit`, copied from ModelInfo.limit by the poller.
- /v1/models: catalogue `limit` no longer flows (Pass 1 sets None);
  Pass 2 adopts the neuron's limit, taking the tightest across neurons
  via tightest_limit(). cost unchanged.
- model_limits.rs rewritten: catalogue limit (999999) is ignored; the
  neuron's ModelEntry.limit is advertised; cost still from catalogue.
- All ModelEntry literals updated with the new field.

fmt/clippy/test green; CUDA paths type-checked in CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 14:10:20 +03:00
464b6b0db9 feat(neuron): self-measured prefill tok/s EMA on streaming paths (#67 phase 3)
All checks were successful
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m13s
CI / Test (push) Successful in 4m30s
CI / CUDA type-check (push) Successful in 1m42s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Refs #67. Feeds the throughput ceiling a live, per-model prefill rate
instead of only the configured bootstrap estimate, so the advertised
limit tracks real prefill speed and rises automatically as prefix
caching (#11) reduces effective prefill cost.

- context_limit::PrefillRateEma: lock-free f64-bits EMA (alpha 0.3),
  ignores degenerate samples, None before the first sample. Unit-tested.
- prefill_rate field on LoadedModel + TpLoadedModel.
- Recorded as total-prompt-tokens / prefill-elapsed in the two streaming
  serving paths (TP: inference_tp_stream via tp_for_task; single-GPU:
  stream_inference_via_worker via a new &prefill_rate param threaded from
  loaded_for_task). Measuring total prompt (not just the divergent
  suffix) means a prefix-cache hit shrinks elapsed while the prompt stays
  large, so the effective rate — and the ceiling — rises toward the VRAM
  ceiling, exactly the #11 payoff.

Per the agreed scope, non-streaming + CPU paths fall back to the
bootstrap estimate (opencode streams; those paths rarely carry the
fleet). fmt/clippy/test green; CUDA paths type-checked in CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 14:02:02 +03:00
f2e05d96ec feat(neuron): capture ContextProfile at load + per-rank VRAM fan-out (#67 phase 2)
All checks were successful
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m14s
CI / Test (push) Successful in 4m38s
CI / CUDA type-check (push) Successful in 1m30s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Refs #67. Captures the per-model context physics at load and adds the
live free-VRAM signal the derivation needs — the tightest card across TP
ranks, not just the leader.

- ContextProfile captured at load:
  - single-GPU dense CUDA path (world_size 1) via
    context_limit::profile_from_qwen3_5_config(config_path, ..);
  - TP path (world_size = tp_size) at TpLoadedModel construction.
  GGUF/CPU/non-qwen3_5 → None (fall back to the static prompt cap).
  New `context_profile` field on LoadedModel + TpLoadedModel.
- profile_from_qwen3_5_config(): reads config.json (mirrors
  VisionMeta::from_config_path), counts full_attention layers
  (layer_types authoritative, full_attention_interval fallback), builds
  the per-card KV cost via the shared helper.
- Folded the inline per-rank KV-bytes math in tp_qwen3.rs (both
  cuda/non-cuda log_construction_complete) and tp_qwen3_5.rs onto
  context_limit::kv_bytes_per_token + KV_CACHE_DTYPE_BYTES.
- Per-rank VRAM fan-out (tightest card):
  - WorkerRequest::QueryVram + WorkerResponse::VramInfo { free_mb, total_mb };
  - worker.rs handle_query_vram (cuda: mem_get_info; non-cuda: error);
  - WorkerPool::query_vram_tightest_free_mb fans out to every rank
    (leader via its device worker, subprocess ranks via RPC) → min free;
  - TpLoadedModel::query_vram_tightest_free_mb convenience wrapper.

No advertise/enforce yet (phases 4/5). fmt/clippy/test green; CUDA paths
type-checked in CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 13:18:27 +03:00
4f05a87449 feat(neuron): self-derived context-limit core — physics + policy (#67 phase 1)
All checks were successful
CI / Format (push) Successful in 38s
CI / CUDA type-check (push) Successful in 1m49s
CI / Clippy (push) Successful in 2m16s
CI / Test (push) Successful in 4m28s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Refs #67. The correct limit{context,input,output} for a deployment is a
computed function of model architecture + live free VRAM + a
coherence/throughput trade-off, not an operator-declared static fact that
goes stale on model swap. This lands the arch-agnostic derivation core;
later phases capture per-model physics at load, measure throughput, and
advertise/enforce the computed limit.

- crates/neuron/src/harness/context_limit.rs (new):
  - kv_bytes_per_token(): shared per-card KV cost (counts only
    full-attention layers; sharded by TP world size). The TP load paths'
    inline math folds onto this in phase 2.
  - ContextProfile: per-model physics snapshot (max_position_embeddings,
    kv_bytes_per_token_per_card, world_size).
  - derive_limit(): context = min(max_pos, vram_ceiling,
    throughput_ceiling) clamped by an optional backstop; input = context −
    output; rounded to 1024. 6 unit tests.
- config.rs: [harness.candle.context_limit] block (mirrors prefix_cache):
  target_prefill_latency_secs, bootstrap_prefill_tok_per_sec,
  activation_headroom_mb, min_free_floor_mb, output_reserve_tokens.
- neuron.example.toml: documented the new block.

No runtime behaviour change yet. fmt/clippy/test green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 13:00:52 +03:00
2f67d17ec7 feat(neuron): emit reasoning_tokens usage details on streaming
All checks were successful
CI / CUDA type-check (push) Successful in 1m45s
CI / Format (push) Successful in 43s
CI / Clippy (push) Successful in 2m16s
CI / Test (push) Successful in 4m28s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Resolve version stamps + change detection (push) Successful in 34s
build-prerelease / Build neuron-blackwell (push) Successful in 1m38s
build-prerelease / Build neuron-ada (push) Successful in 2m3s
build-prerelease / Build cortex binary (push) Successful in 2m16s
build-prerelease / Build helexa-bench binary (push) Successful in 2m23s
build-prerelease / Build neuron-ampere (push) Successful in 2m50s
build-prerelease / Lint (fmt + clippy) (push) Successful in 3m3s
build-prerelease / Package cortex RPM (push) Successful in 1m22s
build-prerelease / Test (push) Successful in 5m11s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m24s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m41s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 56s
Closes #64.

opencode meters reasoning tokens separately via the OpenAI-standard
detail objects, which neuron's usage structs didn't expose. Add them
additively so older clients ignore them.

- cortex-core: Usage gains completion_tokens_details/prompt_tokens_details;
  ResponsesUsage gains output_tokens_details/input_tokens_details. Optional
  + skip_serializing_if, so the wire shape is unchanged for non-reasoning
  models. cached_tokens fields are defined but always None until prompt
  caching lands (#11).
- candle.rs: count tokens generated while in_reasoning across all three
  streaming paths (TP, worker, CPU); carry the count on InferenceEvent::Finish.
- chat projector: populate completion_tokens_details.reasoning_tokens.
- responses projector: wire up base usage emission on the streaming path
  (it emitted none before) and add output_tokens_details.reasoning_tokens.
- non-streaming paths leave details None (they don't track in_reasoning).

reasoning_tokens is a sub-count of completion/output tokens (OpenAI
semantics) — not added into total_tokens.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 12:04:05 +03:00
11b2e6f78c fix(cortex): default models_config to the packaged absolute path
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m24s
build-prerelease / Build neuron-blackwell (push) Successful in 1m42s
build-prerelease / Build neuron-ada (push) Successful in 2m7s
build-prerelease / Build helexa-bench binary (push) Successful in 2m7s
build-prerelease / Build cortex binary (push) Successful in 2m20s
build-prerelease / Build neuron-ampere (push) Successful in 2m49s
build-prerelease / Test (push) Successful in 4m26s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m23s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m41s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m43s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m47s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 52s
cortex resolved the catalogue path "models.toml" relative to the service's
working directory, so the systemd-launched binary never found
/etc/cortex/models.toml and ran with an EMPTY catalogue in production —
limits, cost, pinning, aliases and feasibility were all silent no-ops,
with models surfacing only via the neuron poller. Tests never caught it
because they pass models_config explicitly; only the defaulted,
packaged path was broken.

Default to the absolute /etc/cortex/models.toml (where cortex.spec installs
it) and document the override in cortex.example.toml. Restores the #62
limit/cost advertisement (the catalogue is now actually read) along with
pinning/aliases/feasibility.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 10:04:29 +03:00
8a636c687f feat(cortex): per-model limit + cost on /v1/models; remove max_model_len
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 37s
build-prerelease / Build neuron-blackwell (push) Successful in 1m36s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m33s
build-prerelease / Build neuron-ada (push) Successful in 2m2s
build-prerelease / Build neuron-ampere (push) Successful in 2m47s
build-prerelease / Build helexa-bench binary (push) Successful in 2m8s
build-prerelease / Build cortex binary (push) Successful in 2m35s
build-prerelease / Test (push) Successful in 5m13s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m17s
build-prerelease / Package cortex RPM (push) Successful in 1m18s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m43s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m42s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 54s
Resolves #62. opencode's helexa provider discovers a model's serving
budget from /v1/models and uses it to size context, trigger compaction,
and show spend with no hand-configuration. Each model entry now carries:

  - limit { context, input?, output }  — operator-declared in models.toml
  - cost  { input, output, cache_read?, cache_write? }  — USD per 1M tokens
  - tool_call / reasoning  — runtime-detected by the candle harness and
    OR-ed in from each serving neuron

Composition: the catalogue profile supplies limit/cost (Pass 1); the
poller carries the neuron's detected tool_call/reasoning into ModelEntry,
which the gateway unions onto the entry (Pass 2); aliases propagate every
field (Pass 4). Wire types extend ModelInfo / ModelProfile /
CortexModelEntry additively (serde default + skip_serializing_if), so
older neurons and clients are unaffected. helexa-bench's ModelInfo
constructor and the gateway test fixtures are updated for the new fields.
Adds tests/model_limits.rs asserting /v1/models surfaces limit + cost
(catalogue) and tool_call + reasoning (runtime), and that max_model_len
is gone.

Removes max_model_len. It was write-only with no consumer — opencode's
source references it nowhere and it is not an OpenAI /v1/models field —
and doubly misleading: vLLM's max_model_len means total sequence length,
but cortex populated it from NEURON_MAX_PROMPT_TOKENS, a prompt-only cap.
The limit{} contract replaces it. The neuron's max_prompt_tokens remains
the enforced prompt cap (neuron-side); cortex just stops re-advertising a
derived, mis-named copy. Closes #66 — its stale-max_model_len premise is
moot once the field is gone.

limit/cost are operator-declared (catalogue) per #62's design; auto-
deriving the advertised budget from each neuron's reported cap is a
tracked follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 09:26:55 +03:00
04f798ec23 feat(cortex-gateway): enhance error responses with structured data
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m22s
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 2m26s
build-prerelease / Test (push) Successful in 4m23s
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 47s
fixes #63
Standardize error messages by adding type, code, and param fields to
align with OpenAI API format. Updates include:
- Structured error envelopes with broad type categorization
  (invalid_request_error/api_error)
- Specific machine-readable codes (model_not_found/service_unavailable)
- Null param field as required by OpenAI specification
- Consistent error response formatting across handlers, proxy, and
  routing layers

New tests verify correct error envelope structure for various failure
scenarios.

Co-Authored-By: Helexa (Qwen3.6-27B, 48k context) <noreply@helexa.ai>
2026-06-16 17:51:04 +03:00
8f9e956d17 fix(neuron): emit OpenAI-standard nested error envelopes (#60)
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 33s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 1m44s
build-prerelease / Build neuron-ada (push) Successful in 2m14s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m16s
build-prerelease / Build neuron-ampere (push) Successful in 2m55s
build-prerelease / Test (push) Successful in 4m24s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m41s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m43s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 53s
InferenceError responses were a flat `{"error": "..."}` string. OpenAI
clients (opencode, the openai SDK) reach into `error.type`/`error.code`
to drive behaviour — most importantly `code == "context_length_exceeded"`
triggers auto-compaction + retry instead of a hard failure. A flat string
is invisible to that logic.

Rewrite `inference_error_response` to emit the nested envelope
`{"error": {"message","type","code","param", ...diagnostics}}` and map:

- ModelNotLoaded   → 404 invalid_request_error / model_not_found
- PromptTooLong    → 400 invalid_request_error / context_length_exceeded
  (message: "maximum context length is N tokens", + prompt_len/max)
- InsufficientVram → 503 api_error / insufficient_vram
- VisionUnsupported→ 400 invalid_request_error / vision_unsupported
- TemplateRenderFailed → 422 invalid_request_error / template_render_failed
- Other            → 500 api_error / null code

Diagnostic extras ride inside the error object so the envelope shape is
stable. Both inline match blocks in the chat-completions handler
(streaming + non-streaming) now defer to the shared helper, which the
responses handler already used — one source of truth.

Adds 4 unit tests covering the envelope shape and codes. Also fixes a
pre-existing clippy lint (cloned_ref_to_slice_refs) in qwen3_5 snapshot
test surfaced by a newer clippy.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 20:42:14 +03:00
cb758d4706 feat(neuron): emit usage on the streaming path so clients can track context
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 33s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m20s
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 1m46s
build-prerelease / Build neuron-ada (push) Successful in 2m9s
build-prerelease / Build cortex binary (push) Successful in 2m24s
build-prerelease / Build neuron-ampere (push) Successful in 2m52s
build-prerelease / Test (push) Successful in 4m16s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m43s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m43s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 55s
The deeper reason opencode showed "Context: 0 tokens / 0% used" and flew
into a 400: streaming responses carried NO `usage`. Clients track context
(and trigger compaction) from the `usage` field; the legacy candle
streaming path set `usage: None` on every chunk, so a streaming client
had no token count at all — `max_model_len` alone is a denominator with
no numerator.

InferenceEvent::Finish now carries prompt_tokens + completion_tokens
(the streaming loops already have both: prompt_tokens.len() and the
generated all_tokens.len()). The openai_chat projector emits an
OpenAI-style trailing usage chunk (empty `choices`, populated `usage`)
after the finish chunk. cortex's Anthropic stream translator already
reads chunk.usage, so this fixes context tracking on BOTH the OpenAI
(opencode) and Anthropic (Claude Code) paths.

Also harden the max_model_len plumbing's sibling: cortex re-polls
/discovery while a neuron's max_prompt_tokens is still 0 (unknown), so a
rolling-deploy race where cortex caches discovery before the neuron has
the field self-heals instead of pinning max_model_len to None until a
manual cortex restart.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 19:43:59 +03:00
a2d2dbd006 feat: advertise max_model_len on /v1/models so clients can compact
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m15s
build-prerelease / Build neuron-blackwell (push) Successful in 1m38s
build-prerelease / Build neuron-ada (push) Successful in 2m2s
build-prerelease / Build helexa-bench binary (push) Successful in 2m0s
build-prerelease / Build cortex binary (push) Successful in 2m26s
build-prerelease / Build neuron-ampere (push) Successful in 2m55s
build-prerelease / Test (push) Successful in 4m28s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m22s
build-prerelease / Package cortex RPM (push) Successful in 1m22s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m37s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m41s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m41s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 54s
opencode (and any OpenAI/Anthropic client) couldn't size or compact its
context against helexa because /v1/models never advertised a context
window — opencode showed "0 tokens / 0% used" and flew straight into a
400 PromptTooLong once a conversation + a fetched 64KB log overflowed the
49152-token cap. Compaction is the client's job, but the client needs to
know the limit to do it.

neuron now reports its effective prompt cap (NEURON_MAX_PROMPT_TOKENS)
in GET /discovery (`max_prompt_tokens`). cortex surfaces it on
/v1/models as `max_model_len` (vLLM / OpenAI-compatible convention) per
model — the smallest cap among the neurons that can serve it
(feasible_on ∪ locations), so the advertised limit holds wherever the
request routes. A neuron reporting 0 predates the field and is treated
as unknown (skipped); models with no reporting neuron omit the field.

helexa still rejects over-limit prompts with a clean 400 — this just
gives clients the number to compact *before* hitting it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 19:11:13 +03:00
544214d0f8 fix(neuron): normalize OpenAI string tool-call arguments before rendering
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 29s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m16s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 1m39s
build-prerelease / Build neuron-ada (push) Successful in 2m5s
build-prerelease / Build neuron-ampere (push) Successful in 2m48s
build-prerelease / Test (push) Successful in 4m35s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m38s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m39s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 52s
opencode (OpenAI path, /v1/chat/completions passthrough) hit the same
chat_template:120 failure Claude Code did — "cannot convert value into
pairs" — because the OpenAI wire format carries
tool_calls[].function.arguments as a JSON *string*, while Qwen3.6's
template iterates it as a dict (`arguments | items`). The Anthropic-side
fix (8880b2f) only covered cortex's translation; the OpenAI path reaches
neuron unchanged.

render_chat_template now normalizes string-form tool-call arguments to
objects across all messages before building the Jinja context, so OpenAI
and Anthropic clients both render. Object args (Anthropic path) pass
through untouched; a string that doesn't parse is left as-is and the
render fails loudly (422 TemplateRenderFailed, a94dd55) rather than
silently dropping tools.

The loud-fail change earned out immediately here: opencode got a clean
422 with the exact `chat_template:120` cause instead of a degraded
session.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 18:13:36 +03:00
a94dd55ab8 feat(neuron): fail loud (422) when a tools-bearing request can't render
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m18s
build-prerelease / Test (push) Successful in 4m12s
build-prerelease / Build neuron-blackwell (push) Successful in 1m38s
build-prerelease / Build neuron-ada (push) Successful in 2m10s
build-prerelease / Build neuron-ampere (push) Successful in 2m49s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m36s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 58s
Three of this session's bugs (system-message position, tool_call argument
shape, and the original tool rendering) all hid behind the same silent
behaviour: chat_template render fails → neuron falls back to
format_qwen3_prompt, which drops every tool → the request still returns
200 with degraded, tool-less output. Each cost real debugging time
because the failure was invisible on the wire.

build_prompt_for_request now returns Result. On a render failure it
checks whether the request carried tools: if so it returns the new
InferenceError::TemplateRenderFailed (mapped to 422 with a
template_render_failed code and the underlying Jinja error), instead of
silently degrading. A render failure with no tools still falls back
quietly — there's nothing to lose, and `format_qwen3_prompt` is a
reasonable text-only prompt. The four prompt-build call sites propagate
with `?`.

Now the next client/template incompatibility surfaces as a loud 422 the
operator sees immediately, not a mysteriously-degraded session.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 17:48:31 +03:00
8880b2f8a6 fix(cortex): emit tool_call arguments as an object so Qwen3.6 can chain tools
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Build helexa-bench binary (push) Successful in 2m14s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m27s
build-prerelease / Build cortex binary (push) Successful in 2m37s
build-prerelease / Test (push) Successful in 4m32s
build-prerelease / Build neuron-blackwell (push) Successful in 1m41s
build-prerelease / Build neuron-ada (push) Successful in 2m5s
build-prerelease / Build neuron-ampere (push) Successful in 2m50s
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m23s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m39s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m41s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m4s
Verified live via the rendered-prompt trace: once a tool call is in the
conversation history, the Qwen3.6 chat template fails to render —

  render chat_template: invalid operation: cannot convert value into
  pairs (in chat_template:120)

because line 120 iterates `tool_call.arguments | items` (treats arguments
as a dict), while cortex emitted the OpenAI-standard JSON *string*. On
that render error neuron silently falls back to a tool-less prompt, so
the model loses every tool the moment it makes one call — it can make the
first tool call, read the result, then can only narrate ("now let me
check the runs") and stop, because the next turn has no tools. That's the
"drops the ball a little later" symptom: the CC trace shows the get_me
turn rendering 42653 tokens (tools present) and every subsequent
tool-history turn falling back to ~6k tokens (tools gone).

anthropic_to_openai now passes `function.arguments` as the parsed object
rather than stringifying it. Tests updated to expect the object form.

This is the same silent-fallback failure class as the system-message
merge (295b10c) — which is why making neuron's template-render fallback
LOUD (4xx on a tools-bearing request instead of a degraded 200) is now
clearly worth doing: it would have surfaced both in seconds.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 16:43:17 +03:00
4e8f4e0d04 fix(neuron): don't generate <think> reasoning when the client drops it
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m14s
build-prerelease / Build neuron-blackwell (push) Successful in 1m50s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 2m36s
build-prerelease / Build neuron-ada (push) Successful in 2m37s
build-prerelease / Test (push) Successful in 4m15s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m36s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m37s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 50s
Verified live: Qwen/Qwen3.6-27B with a simple prompt and max_tokens=400
generated 400 tokens, finish_reason=length, and 0 visible characters —
the model spent the ENTIRE budget on <think> reasoning, which we then
drop for OpenAI/Anthropic clients (include_thinking=false), starving the
visible answer. This is why Claude Code "dropped the ball": empty or
truncated responses. A/B confirms the cause — same prompt with
chat_template_kwargs.enable_thinking=false yields a full 545-char answer.

The earlier prompt_opens_reasoning fix stopped the reasoning *leaking* as
text but left it consuming the token budget. Couple the two: when the
caller isn't going to see the reasoning (include_thinking=false, the
default), default chat_template_kwargs.enable_thinking to false so the
model doesn't generate it. An explicit client enable_thinking wins;
thinking-aware clients (helexa-acp, x-include-thinking: true) keep
reasoning on. Tests cover the default (false), surfacing (true), explicit
override, and preservation of other kwargs.

Note: only the /v1/chat/completions path (what Claude Code uses via
cortex /v1/messages); /v1/responses could get the same defaulting as a
follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 15:00:50 +03:00
295b10c103 fix(cortex): merge all system content into one leading system message
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 33s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m49s
build-prerelease / Build helexa-bench binary (push) Successful in 2m14s
build-prerelease / Build cortex binary (push) Successful in 2m54s
build-prerelease / Test (push) Successful in 5m21s
build-prerelease / Build neuron-blackwell (push) Successful in 1m38s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m19s
build-prerelease / Build neuron-ada (push) Successful in 2m3s
build-prerelease / Build neuron-ampere (push) Successful in 2m52s
build-prerelease / Package cortex RPM (push) Successful in 1m34s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m38s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m46s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 54s
Verified live via neuron trace: Claude Code's real requests carry a
top-level `system` AND a `role:"system"` turn inside `messages`. cortex
passed the latter through at a non-first position, and Qwen3.6's chat
template hard-rejects it:

  WARN chat_template render failed; falling back to format_qwen3_prompt
  error=... invalid operation: System message must be at the beginning.

On that render error neuron silently falls back to a template that
renders NO tools, so the model got zero tool-format guidance and
improvised an unparseable `<tool><name>…` syntax — tool calling broke
entirely for real CC traffic, even though synthetic single-system
probes (and the earlier translation/parse fixes) worked.

anthropic_to_openai now accumulates the top-level `system` plus every
`role:"system"` conversation turn and emits a single system message at
index 0, with the non-system turns following in order. Reproduced the
trigger (system-role message at index>0 → fallback) and the fix
(merged → template renders tools). Test covers the merge + ordering.

Secondary hardening worth a follow-up: neuron's silent template
fallback drops tools without surfacing it to the client — a render
failure on a tools-bearing request should arguably 4xx rather than
degrade invisibly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 14:09:08 +03:00
1c485aedce feat(neuron): trace the fully rendered chat-template prompt
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 27s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 1m31s
build-prerelease / Build neuron-ampere (push) Successful in 2m13s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m45s
build-prerelease / Build neuron-ada (push) Successful in 3m31s
build-prerelease / Test (push) Successful in 4m39s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m49s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m46s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 52s
Debugging tool-call format drift (Qwen3.6-27B emitting wrapper-less
<tool><name>…> under Claude Code's real system prompt + 120-tool list,
which neuron's <tool_call> detector can't parse) needs ground truth on
what the model actually sees. neuron logged nothing about the rendered
prompt. Add a trace! in build_prompt_for_request emitting the full
rendered prompt + char count + tool count, so we can see whether the
chat template's <tool_call> format instruction survives a large system
prompt and how the tools render. Gated at trace (the prompt can be tens
of KB): RUST_LOG=neuron::harness::candle=trace.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 13:38:51 +03:00
746d84c0fb fix(neuron): seed in_reasoning from the prompt so Qwen3.6 thinking isn't leaked
Some checks failed
build-prerelease / Build neuron-blackwell (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 2m3s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m18s
build-prerelease / Build neuron-ada (push) Successful in 2m15s
build-prerelease / Test (push) Successful in 4m13s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Qwen3.6's chat template injects the opening <think> into the generation
prompt, so generation begins mid-thought and the open marker is never
sampled. The streaming loops flipped in_reasoning to true only on a
*generated* open token, so they stayed in text mode and streamed the
model's reasoning out as visible text — verified live: a tool request
returned a 255-char text block of chain-of-thought ("The user wants to
know the weather… I will construct the function call now.") ahead of the
tool_use block, with the trailing </think> stripped (close token
recognised) but no opening <think>.

Each streaming loop now seeds in_reasoning by replaying the prompt's
reasoning markers (new `prompt_opens_reasoning`): if the prompt ends
inside an open <think>, the loop starts in reasoning mode, the thinking
routes to ReasoningDelta (dropped by the chat projector's default
include_thinking=false, which is what cortex uses), and the model's
</think> flips back to visible text for the answer/tool call. Template-
agnostic and self-correcting: a prompt that doesn't open reasoning (no
think injection, enable_thinking off, non-reasoning model) starts false,
preserving current behaviour. Thinking is hidden, not disabled, so answer
quality is unaffected.

Applied to all three streaming loops (inference_tp_stream,
stream_inference_via_worker, run_inference_streaming). Test covers
open/close replay, multi-turn closed state, reopen-at-tail, and the
no-pair pass-through.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 11:03:26 +03:00
f15b9e2848 fix(neuron): parse Qwen-XML tool calls + emit tool_use stop_reason
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m16s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 2m2s
build-prerelease / Build neuron-ada (push) Successful in 2m7s
build-prerelease / Build neuron-ampere (push) Successful in 2m16s
build-prerelease / Test (push) Successful in 4m13s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m34s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m36s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m37s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 52s
Verified live (commit d662fa2 logs): cortex now delivers OpenAI-shaped
tools to neuron correctly, but Qwen3.6-27B emits tool calls in the
Qwen-XML form inside the <tool_call> markers —

    <tool_call>
    <function=get_weather>
    <parameter=city>
    Brno
    </parameter>
    </function>
    </tool_call>

— while parse_tool_call_body only did serde_json::from_str expecting
{"name":…,"arguments":…}. It returned None, the dispatch re-emitted the
raw block as a text delta, and clients saw the markup as prose. cortex
logged upstream_tool_calls=false finish_reason="stop".

parse_tool_call_body is now format-tolerant: JSON first (Qwen3-Instruct
/ Hermes), then a Qwen-XML parser (Qwen3-Coder / Qwen3.6). Each
<parameter> value is coerced to its declared JSON type using a new
ToolSchemas map built from the request's tools (string stays string,
integer/number/boolean/object/array coerced, mistyped values fall back
to string so an argument is never dropped). build_tool_schemas is
threaded into all three streaming loops (inference_tp_stream,
stream_inference_via_worker, run_inference_streaming).

Each loop also tracks emitted_tool_call and promotes the terminal
finish_reason from Stop to ToolCalls when a call parsed, so the OpenAI
chunk carries finish_reason:"tool_calls" and cortex maps it to Anthropic
stop_reason:"tool_use" — without which an Anthropic agent (Claude Code)
sees a tool_use block but stop_reason:end_turn and may not run the tool.
FinishReason::ToolCalls drops its dead_code allow.

Tests: JSON form still parses; Qwen-XML multi-param parse with
schema-driven string/integer/boolean coercion; no-schema type sniffing;
type-mismatch string fallback; unparseable body returns None.

Known gap (separate): the non-streaming run_inference paths have no
tool-call handling at all; Claude Code streams, so the streaming loops
are the ones that matter here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 10:39:38 +03:00
d662fa20ef fix(cortex): translate Anthropic tools to OpenAI shape + wire-debug logging
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m20s
build-prerelease / Build helexa-bench binary (push) Successful in 2m6s
build-prerelease / Build cortex binary (push) Successful in 2m20s
build-prerelease / Test (push) Successful in 4m12s
build-prerelease / Build neuron-blackwell (push) Successful in 1m38s
build-prerelease / Build neuron-ada (push) Successful in 2m5s
build-prerelease / Build neuron-ampere (push) Successful in 4m44s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m17s
build-prerelease / Package cortex RPM (push) Successful in 1m17s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m41s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m42s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m48s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 53s
Claude Code (ANTHROPIC_BASE_URL -> cortex) hits POST /v1/messages, but
anthropic_to_openai forwarded the request's `tools` array verbatim via
the flattened `extra`. neuron feeds that straight into the HF chat
template, which iterates the OpenAI shape (tool.function.name/.parameters).
Anthropic-shaped tools ({name, description, input_schema}) rendered as
broken/empty definitions, the model improvised an unparseable
<tool_use_name>...</tool_use_name> tool-call format, neuron's
<tool_call>{json}</tool_call> detector missed it, and the markup fell
through as plain assistant text — so CC never received a structured
tool_use and the agent loop died.

Request-side translation now reshapes:
- tool definitions: {name, description, input_schema}
  -> {type:"function", function:{name, description, parameters}}
- tool_choice: auto->"auto", any->"required", none->"none",
  tool->{type:"function",function:{name}}
- assistant tool_use blocks -> OpenAI assistant.tool_calls
  (arguments JSON-stringified) — fixes multi-turn
- user tool_result blocks -> standalone role:"tool" messages keyed by
  tool_call_id
- system content blocks flatten to text instead of being JSON-serialised
  into the prompt; best-effort image-block -> image_url part

Wire-debug instrumentation (tracing levels only; cortex/neuron ship at
info, operator infra runs at debug):
- every handler emits a debug! "inbound request" line tagging the wire
  surface (anthropic | openai-chat | openai-responses | openai-completions)
  plus model/stream/tools and, for Anthropic, tool_history/system
- response side reports upstream_tool_calls + finish_reason, streaming
  and non-streaming
- full inbound + translated-upstream bodies at trace! (UTF-8-safe, capped)

Tests: 8 request-side unit tests + an end-to-end gateway test asserting
the upstream neuron receives OpenAI-shaped tools and a
user->assistant(+tool_calls)->tool->user history.

Also tighten script/infra-log-verbosity.sh: independent cortex/neuron
RUST_LOG args, cortex-only by default (neuron restart behind
--with-neuron so we don't needlessly cold-reload models), mkdir -p the
drop-in dir, symmetric RUST_LOG cleanup, and set -euo pipefail.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 09:58:25 +03:00
d04f4ad704 feat(bench): show GPUs as the resource name instead of hostnames
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Successful in 2m34s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m54s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m15s
build-prerelease / Test (push) Successful in 5m11s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 56s
Public visitors don't know the hostnames, so surface each host's GPU(s)
as the resource name across the UI.

- store: gpu_label() turns the stored gpus_json into a compact label
  ("2× RTX 5090", "RTX 4090"); add `gpu` to ReportRow + RunRow and
  `host_gpus`/`model_gpus` maps to /api/dimensions (from each one's
  latest run). render_json gains gpu too.
- UI: Overview + Runs show a "GPU" column (gpu, fallback host); Runs'
  filter is now GPU-labelled (still filters by host underneath); Trends
  shows a "Measured on <gpu>" line for the selected model.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:29:13 +03:00
e3879f093a feat(bench-ui): drop host selector from Trends; resolve host server-side
Some checks are pending
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m38s
build-prerelease / Test (push) Successful in 4m47s
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Successful in 2m2s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m22s
Public visitors don't know the hostnames or per-host hardware, so the
host picker on Trends was confusing. Select by model + scenario only;
/api/series now takes host as optional and resolves it to the host
serving that (model, scenario) — coherent since each model maps to one
host today. Runs (drill-down) keeps its host filter.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:19:09 +03:00
f50f5531cf feat(bench): read-only JSON API on bob + bench/ React visualisation app
Some checks are pending
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m21s
build-prerelease / Build cortex binary (push) Successful in 2m27s
build-prerelease / Build helexa-bench binary (push) Successful in 2m44s
build-prerelease / Test (push) Successful in 4m32s
build-prerelease / Build neuron-ampere (push) Successful in 2m7s
build-prerelease / Build neuron-ada (push) Successful in 2m28s
build-prerelease / Build neuron-blackwell (push) Successful in 2m59s
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m19s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m39s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m39s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m42s
Part A — helexa-bench read API:
- [api] config (enabled, listen :13132); WAL on the store so API reads
  never block the sweep writer.
- store read methods: summary, series (chronological per-build medians),
  runs (filtered), dimensions, run_count.
- api.rs: axum /api/health|dimensions|summary|series|runs, permissive
  CORS (UI is a separate origin). The `run` daemon binds the API
  alongside the sweep; new `serve` subcommand serves API-only.
- listener plumbing (bench gains a port): data/helexa-bench-firewalld.xml,
  spec install, deploy-bench /api/health probe + firewalld step, sudoers
  firewall-cmd grants, [api] in example + bob.toml.
- 5 API tests + serve smoke.

Part B — bench/ Vite + React-SWC-TS app (router, react-bootstrap,
recharts): Overview (summary table), Trends (decode tok/s & TTFT across
build SHAs), Runs (filterable explorer). Typed API client with
VITE_API_BASE + dev proxy to bob. npm build/typecheck clean. Hosted
separately from the API (per design); .gitignore excludes node_modules/dist.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 11:26:55 +03:00
42da25a37c feat(bench): version-aware benchmark harness + neuron build metadata
All checks were successful
CI / CUDA type-check (push) Successful in 1m36s
CI / Format (push) Successful in 31s
CI / Clippy (push) Successful in 2m47s
CI / Test (push) Successful in 4m33s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Adds automated, longitudinal performance tracking across neuron builds,
replacing manual script/bench.py runs and hand edits to benchmarks.md.

neuron build metadata + GET /version:
- cortex-core: shared BuildInfo type (build_info.rs).
- neuron build.rs captures git SHA (preferring injected HELEXA_BUILD_SHA,
  else git, else "unknown"), dirty flag, build timestamp, rustc version,
  profile, target, enabled cargo features, and best-effort candle-core
  version from Cargo.lock.
- New GET /version endpoint (version.rs) + clap --version long form.
- SHA injected in CI (build-neuron step) and helexa-neuron.spec
  (%{?helexa_commit}) so tarball RPMs report the real SHA. /version is
  now the canonical "which build is live" probe.

helexa-bench crate:
- Continuous daemon: hits each neuron directly on :13131, exercises each
  warm (status==loaded) model, records every run into a SQLite
  system-of-record stamped with the neuron's full BuildInfo.
- Version-aware: skips any (target, build SHA, model, scenario) cell
  already at samples_per_version, so a steady fleet costs only cheap
  /version + /models polls until a new SHA ships.
- Extensible Scenario trait; phase-1 chat-latency family ported verbatim
  from bench.py (synthetic 128/4096-tok prompts, /no_think, streamed
  TTFT + decode-window tok/s). `report` regenerates the benchmarks table.
- kind="openai" comparison targets scaffolded, not yet wired.

Packaging: data/helexa-bench.service (+ sysusers), prebuilt-binary RPM
spec (outbound-only, no firewalld), and build/package/publish wiring in
build-prerelease.yml with change detection.

Tests: cortex-core BuildInfo round-trip, neuron GET /version integration,
helexa-bench unit (prompt/SSE/config/store) + end-to-end sweep
(record -> skip -> resume on new SHA). Docs updated (benchmarks.md,
CLAUDE.md addendum).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 15:26:02 +03:00
ec764a2cac feat(neuron): speculative decoding — acceptance core + config (#25, phase 1)
All checks were successful
CI / Format (push) Successful in 36s
CI / Format (pull_request) Successful in 31s
CI / CUDA type-check (push) Successful in 1m50s
CI / CUDA type-check (pull_request) Successful in 1m44s
CI / Clippy (push) Successful in 2m38s
CI / Test (push) Successful in 4m21s
CI / Clippy (pull_request) Successful in 2m29s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 4m37s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
First phase of speculative decoding: the pure, state-free acceptance
logic and per-target config, unit-tested in isolation before the
draft/verify loop and GDN-state rollback wire it into the generation
path.

greedy_accept walks the drafter's K proposed tokens against the
target's greedy token at each of the K+1 positions, accepting the
longest matching prefix and always committing one bonus token on top
(the target's correction at the first mismatch, or a free extra token
when the whole draft matched). So a round commits 1..=K+1 tokens —
never zero, guaranteeing forward progress even with a useless drafter.
Greedy is exact for temperature-0 (the fleet probe + #22 bench
regime); stochastic acceptance is a later phase.

SpeculativeConfig carries the drafter id (must share the target's
tokenizer — Qwen3.5-0.8B for the Qwen3.6-27B target, both qwen3_5,
byte-identical tokenizer, confirmed on beast) and the draft length K.

6 unit tests: full accept, partial accept, zero accept (progress
guarantee), last-position mismatch, single-token draft, config
gating. Not yet wired into the decode path — phase 2 (single-GPU
draft/verify) follows. Design + phasing on the issue.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 08:30:21 +03:00
988ef5afc2 feat(neuron): chunk the single-GPU vision prefill (parity with TP) (#18)
All checks were successful
CI / Format (push) Successful in 31s
CI / Format (pull_request) Successful in 41s
CI / CUDA type-check (pull_request) Successful in 1m27s
CI / CUDA type-check (push) Successful in 2m11s
CI / Clippy (push) Successful in 2m38s
CI / Clippy (pull_request) Successful in 2m31s
CI / Test (pull_request) Successful in 4m13s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Test (push) Successful in 4m39s
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
The single-GPU vision path was still single-shot: a long vision-bearing
prompt to a single-GPU-loaded qwen3_5 had the OOM exposure the TP path
shed in fa01350 (it was only guard-rejected, never served).

Mirror TpQwen3_5ForCausalLM::prefill_with_images_chunked onto the
single-GPU Qwen3_5ForCausalLM: encode the image(s) once, walk the
pre-expanded prompt in prefill_chunk_tokens() windows splicing the
per-chunk <|image_pad|> rows, accumulate KV + GDN state across chunks
via the growing offset, keep the last chunk's logits. Interleaved
M-RoPE positions are computed once over the whole prompt and sliced
per chunk (an image compresses the position space, so per-chunk offset
arithmetic would be wrong) — so Qwen3_5Model::forward_inner gains an
explicit position_ids path alongside the internal-from-grids
(single-shot) and plain (text/decode) paths, plus a forward_with_positions
entry point. The device-worker ForwardLogitsWithImages handler now
calls the chunked method; chunk size comes from prefill_chunk_tokens()
on the worker thread, so the Job/handle surface and the callers are
unchanged.

The shared validate_vision_prefill VRAM/KV backstop stays (TP keeps it
too) — chunking bounds activation memory, not the accumulating KV
cache, so the guard still does useful work.

Verified on real weights (Qwen3.5-0.8B): extended the #15 vision
reference test to also run the chunked path with chunk_size=64 over the
217-token prompt (4 chunks; the ~196-token image-pad run spans them).
Chunked vs single-shot logits: cosine 1.000000, max_abs 0.0001;
argmax matches the HF reference. The test covers all three
forward_inner branches (text plain / single-shot vision / chunked
vision) on a real single-GPU qwen3_5 load.

Closes #18

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 08:17:11 +03:00
1c4b53cbf1 feat(neuron): numerical validation against the transformers reference (#15)
All checks were successful
CI / Format (push) Successful in 44s
CI / Format (pull_request) Successful in 37s
CI / CUDA type-check (push) Successful in 1m39s
CI / CUDA type-check (pull_request) Successful in 2m6s
CI / Clippy (push) Successful in 2m23s
CI / Clippy (pull_request) Successful in 2m24s
CI / Test (push) Successful in 4m21s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 4m1s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
script/dump_reference.py captures fixtures from the HF qwen3_5
implementation (token ids + reference tensors, f32 by default so the
comparison pins math rather than dtype noise);
tests/numerical_reference.rs replays them through our arch and
asserts argmax equality, cosine similarity, and max-abs ceilings. The
tests self-skip without NEURON_REF_MODEL_PATH so CI stays green
without weights.

Measured on beast (f32-vs-f32): text logits max_abs 0.000 / cosine
1.000000 (the >64-token prompt routes through the chunked GDN
prefill, so the production prefill math is what's validated); vision
tower cosine 0.999998, end-to-end vision logits cosine 1.000000 with
identical argmax. Mutation sensitivity: NEURON_VISION_LEGACY_POS=1
collapses tower cosine to 0.75 and fails loudly.

One production fidelity fix the harness surfaced: the pos-embed
bilinear blend now accumulates in f32 and casts once at the end,
matching the reference (we previously rounded the weights to bf16
before blending).

Fixtures: 0.8B text + vision (f32), 27B text (bf16 — an f32 27B
forward needs ~108 GB; the automated comparison runs against the
0.8B, which executes the same arch modules). Regeneration documented
in tests/fixtures/numerical/README.md.

Closes #15

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 23:35:57 +03:00
90e971dcf5 perf(neuron): parallel in-situ quantization + cold-load phase timing (#1)
All checks were successful
CI / Format (push) Successful in 32s
CI / Format (pull_request) Successful in 35s
CI / CUDA type-check (push) Successful in 1m50s
CI / CUDA type-check (pull_request) Successful in 2m7s
CI / Clippy (pull_request) Successful in 2m18s
CI / Clippy (push) Successful in 2m46s
CI / Test (push) Successful in 5m33s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 5m33s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
QTensor::quantize runs its per-block math strictly sequentially on
one core (CUDA storage round-trips through the same CPU path), which
made Q6K ISQ the dominant phase of the 27B TP cold load. Blocks are
independent, so quantize_parallel re-implements the same encoding
through candle's public per-block API (k_quants::GgmlType::from_float)
with rayon fanning blocks across the CPU pool — byte-identical output,
pinned by parity tests against QTensor::quantize for Q6K/Q5K/Q4K/Q8_0.

Threading discipline holds: the device-to-host read and the
QStorage::from_data upload stay on the calling thread (device worker /
subprocess main); rayon workers touch host memory only.

Also adds the per-phase timing the issue asked for first: per-layer
debug + layer-loop total + lm_head info lines, so the next cold load
shows where the time actually goes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 22:47:57 +03:00
812d191e50 fix(neuron): UT transform by forward substitution, not nilpotent squaring
All checks were successful
CI / Format (push) Successful in 32s
CI / Format (pull_request) Successful in 53s
CI / CUDA type-check (push) Successful in 1m52s
CI / CUDA type-check (pull_request) Successful in 2m12s
CI / Clippy (push) Successful in 2m18s
CI / Clippy (pull_request) Successful in 2m36s
CI / Test (push) Successful in 4m18s
CI / Test (pull_request) Successful in 4m22s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
Live A/B on beast produced NaN logits ("!!!" replies) on real prompts:
the nilpotent-squaring form of (I - T)^-1 computes raw powers of T,
whose entries grow combinatorially (path counts ~ C(62,31)) before
nilpotency collapses them — fine on uncorrelated test data, f32
precision death on real prompts whose repetitive text makes keys
highly correlated. The reference's forward-substitution loop never
forms raw powers; its intermediates are the convergent M entries.

Port the reference loop faithfully (rows accumulate into a fresh
tensor). New adversarial parity test with near-identical keys and
beta ~= 1 diverges to 8e30 under the squaring form and passes under
forward substitution.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 21:18:32 +03:00
2a9def6d2d perf(neuron): chunked delta-rule prefill for Gated DeltaNet (#23)
All checks were successful
CI / Format (push) Successful in 32s
CI / Format (pull_request) Successful in 24s
CI / CUDA type-check (push) Successful in 1m38s
CI / CUDA type-check (pull_request) Successful in 2m10s
CI / Clippy (push) Successful in 2m34s
CI / Test (push) Successful in 4m20s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Clippy (pull_request) Successful in 2m29s
CI / Test (pull_request) Successful in 4m21s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
Prefill (seq_len >= 64) now runs the chunk-parallel gated delta rule
ported from the HF reference torch_chunk_gated_delta_rule
(chunk_size=64): identical math reorganised into per-chunk batched
matmuls (cuBLAS/tensor cores on CUDA, gemm on CPU) instead of the
O(L)-sequential per-token recurrence. Decode steps and short prompts
keep the recurrent paths (CUDA kernel / Rust loop) unchanged.

One deliberate deviation from the reference: its in-place row-by-row
UT-transform computes (I - T)^-1 - I by forward substitution; T is
strictly lower triangular and therefore nilpotent at chunk size 64,
so the same inverse is the product of six squarings
prod_{j=0..5}(I + T^(2^j)) — batched matmuls instead of 63 sequential
row updates, which suits candle's immutable tensors. Chunk-local math
runs rank-3 over a flattened B*H*N batch dim (candle matmul supports
at most two batch dims).

Initial-state continuation is supported, so chunked prefill composes
with #11's restored prefix snapshots. Both single-GPU and TP paths
pick this up through the shared run_delta_rule dispatch.
NEURON_GDN_CHUNKED=0 forces the recurrent paths for A/B measurement.

Parity tests pin chunked against recurrent (2e-4 abs) across padding
(L=130), exact multiples with non-zero initial state (L=128 after a
50-token prefix), and a single exact chunk.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:51:51 +03:00
4f266dbd82 fix(neuron): snapshot at the last special-token boundary (#11)
All checks were successful
CI / Format (push) Successful in 42s
CI / Format (pull_request) Successful in 34s
CI / CUDA type-check (push) Successful in 1m31s
CI / Clippy (push) Successful in 2m19s
CI / CUDA type-check (pull_request) Successful in 2m10s
CI / Test (push) Successful in 4m13s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Clippy (pull_request) Successful in 2m9s
CI / Test (pull_request) Successful in 4m5s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
Second finding from live 27B validation: prompt-covering snapshots
still never matched. The rendered prompt ends with
`<|im_start|>assistant\n`, and when the next turn re-tokenizes that
text followed by the assistant's reply, BPE merges the trailing
newline with the reply's first characters — the final token(s) of the
cached sequence differ from the next prompt's, so the exact-prefix
match never fires. (A reply starting with an atomic special token
like <think> masks this, which is why the 0.8B check passed.)

Snapshot one past the last <|im_start|> instead: special tokens are
hard segmentation points, so ids up to and including it are provably
identical across renders. Prefill pauses at that boundary to capture
the snapshot, then finishes the ~2-token `assistant\n` tail. Applied
to all six request paths; unit tests for the cut helper.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 19:16:45 +03:00
3fd1989b2b fix(neuron): snapshot prefix cache at the prefill boundary (#11)
All checks were successful
CI / Format (push) Successful in 41s
CI / Format (pull_request) Successful in 42s
CI / CUDA type-check (push) Successful in 1m39s
CI / CUDA type-check (pull_request) Successful in 2m6s
CI / Clippy (push) Successful in 3m10s
CI / Clippy (pull_request) Successful in 3m3s
CI / Test (pull_request) Successful in 4m2s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
CI / Test (push) Successful in 5m1s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Live validation on beast's Qwen3.6-27B showed reused=0 on every turn:
the post-generation snapshot includes reasoning tokens (<think>...)
that get stripped when the client echoes the assistant message back,
so the cached sequence is never a token-prefix of the next prompt.
quadbrat's 0.8B only matched because its think block round-tripped as
literal text.

Snapshot after prefill instead (covering exactly the prompt tokens) —
that is the state the next turn provably extends under a stable chat
template, regardless of how reasoning or tool-call content is
transformed on echo. Taken after the first healthy sample so
NaN-poisoned prefills never cache their state; this also retires the
forwarded-token bookkeeping and the consumer-hangup store sites.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 18:29:00 +03:00
7e66f77851 fix(neuron): CUDA type-check fixes for TP prefix cache
All checks were successful
CI / Format (push) Successful in 38s
CI / Format (pull_request) Successful in 39s
CI / CUDA type-check (pull_request) Successful in 1m26s
CI / CUDA type-check (push) Successful in 1m34s
CI / Clippy (push) Successful in 3m14s
CI / Clippy (pull_request) Successful in 3m18s
CI / Test (push) Successful in 5m15s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 3m56s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
Two errors only the cuda config surfaces: the TpSnapshotKv dispatch
arms mixed candle and anyhow error types, and restore_or_clear_tp held
the registry MutexGuard across the cleanup await inside a let-chain
(making the TP request futures non-Send). Bind the removed ref before
awaiting, same discipline as the other lock sites.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 17:39:32 +03:00
e629e1872c feat(neuron): prefix KV caching for the TP path (#11)
Some checks failed
CI / Format (push) Successful in 37s
CI / Format (pull_request) Successful in 31s
CI / CUDA type-check (push) Failing after 1m55s
CI / CUDA type-check (pull_request) Failing after 1m47s
CI / Clippy (push) Successful in 2m11s
CI / Test (push) Successful in 4m15s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Clippy (pull_request) Successful in 2m23s
CI / Test (pull_request) Successful in 4m0s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
Extends the prefix cache to tensor-parallel models — Qwen3.6-27B on
beast, where the TTFT win is largest. Closes #11.

Every rank holds its shard's snapshot under one pool-minted id: the
leader's lives in the device worker beside the TP slab
(Job::TpSnapshotKv / TpRestoreKv / TpDropKvSnapshot), each subprocess
rank stores its own in-process via new WorkerRequest variants
(SnapshotKvCache / RestoreKvCache / DropKvSnapshot). Shard state has
the same shape as single-GPU (attention ConcatKvCache + GDN
conv/recurrent state + rope_delta), so the snapshot types are reused;
all ranks sit at the same token boundary because step fan-out is
synchronous.

Consistency on partial failure: a failed restore falls back to
clear-all-ranks + full prefill (and drops the entry); a failed
snapshot drops the id on every rank so nothing half-stored leaks.
DropTp / UnloadModel invalidate a model's snapshots with it, covering
auto-recovery. Vision requests bypass as on single-GPU. Budget
accounting uses leader bytes x world_size (shards are symmetric).

Wired into both TP request paths (non-streaming inner + streaming
orchestration task); chunked_prefill_tp gains the restored-offset
start.

Closes #11

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 17:34:49 +03:00
c5378d532d feat(neuron): prefix KV caching across requests — single-GPU + CPU paths (#11)
All checks were successful
CI / Format (push) Successful in 32s
CI / Format (pull_request) Successful in 34s
CI / Clippy (push) Successful in 2m29s
CI / CUDA type-check (pull_request) Successful in 1m31s
CI / CUDA type-check (push) Successful in 1m37s
CI / Clippy (pull_request) Successful in 2m32s
CI / Test (push) Successful in 4m24s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 4m23s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
Stop discarding cache state between requests. When an incoming
prompt's token sequence starts with the exact tokens of a stored
snapshot, restore it and prefill only the divergent suffix.

For the hybrid qwen3_5 arch a snapshot is attention ConcatKvCache k/v
+ GatedDeltaNet conv/recurrent state + the rope_delta counter, all at
one token boundary; the recurrent state cannot rewind, so matching is
exact-prefix only. GDN states are deep-copied both directions (the
CUDA delta-rule kernels mutate the state buffer in place); attention
k/v snapshots share storage safely (append-by-cat never mutates).

Snapshots live in the device worker's state next to the model slab
(Job::SnapshotKv / RestoreKv / DropKvSnapshot); the async side holds
only an opaque id + token sequence + byte size. DropArch drops a
model's snapshots with it, so unload and auto-recovery invalidate for
free. CPU loads hold snapshots inline on the legacy path.

Per-model LRU registry (harness/prefix_cache.rs) bounded by
[harness.candle.prefix_cache] budget_mb / max_entries, enabled by
default; inserting a snapshot drops entries it strictly extends.
Vision requests and candle-transformers archs bypass the cache
entirely (clear-every-request, unchanged).

Covers the single-GPU worker path (streaming + non-streaming) and the
CPU-local path. The TP path (Qwen3.6-27B on beast) is a follow-up PR
that closes #11 with before/after bench numbers.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 17:14:07 +03:00
569c528c4b feat(gateway): Anthropic streaming SSE translation (#24)
All checks were successful
CI / Format (push) Successful in 36s
CI / CUDA type-check (push) Successful in 2m25s
CI / Clippy (push) Successful in 2m25s
CI / Format (pull_request) Successful in 41s
CI / CUDA type-check (pull_request) Successful in 2m9s
CI / Clippy (pull_request) Successful in 2m45s
CI / Test (push) Successful in 5m3s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 4m29s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
The /v1/messages handler translated request envelopes but proxied raw
OpenAI SSE frames back to streaming Anthropic clients — the gap
between the README's "point your tooling at it once" contract and
what Claude Code actually received.

cortex-core gains AnthropicStreamTranslator, a pure per-stream state
machine: OpenAI chunks in, ordered (event, payload) pairs out —
message_start → content_block_start/delta/stop (text and tool_use
blocks, indexed; tool_calls map to input_json_delta) → message_delta
(stop_reason mapped via the now-shared map_stop_reason, which also
teaches the non-streaming path tool_calls→tool_use) → message_stop.
Without an upstream usage frame the output count falls back to the
delta count (engine-exact for neuron's one-chunk-per-token streams,
#31); with one, input/output tokens ride message_delta.

cortex-gateway gains anthropic_sse: the wire pump that splits the
upstream byte stream into SSE events, parses data: payloads
(leniently — engines omit fields on special frames), feeds the
translator, and frames results as `event:`/`data:` pairs through a
bounded channel (slow client back-pressures the upstream read).
Upstream truncation without [DONE] still closes the Anthropic event
sequence. Nothing is buffered beyond the current event's bytes.

Tests: 5 state-machine unit tests (text flow, stop-reason mapping +
defaults, tool_use blocks, usage propagation, idempotent finish) and
2 gateway integration tests (full event sequence + text reassembly,
usage propagation into message_delta). Validated end-to-end by
running this branch's gateway against a production neuron and
streaming a live Anthropic request.

Closes #24

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 15:47:30 +03:00
b0d0b939af Merge pull request 'feat(gateway): per-request token metrics — TTFT and tok/s (#21)' (#30) from feat/gateway-21-token-metrics into main
Some checks failed
build-prerelease / Lint (fmt + clippy) (push) Blocked by required conditions
build-prerelease / Test (push) Blocked by required conditions
build-prerelease / Build cortex binary (push) Blocked by required conditions
build-prerelease / Build neuron-blackwell (push) Blocked by required conditions
build-prerelease / Build neuron-ampere (push) Blocked by required conditions
build-prerelease / Build neuron-ada (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 33s
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
2026-06-12 12:25:32 +00:00
6a36d15ef1 feat(gateway): per-request token metrics — TTFT and tok/s (#21)
All checks were successful
CI / Format (push) Successful in 45s
CI / Format (pull_request) Successful in 37s
CI / CUDA type-check (push) Successful in 2m25s
CI / Clippy (push) Successful in 2m37s
CI / Test (push) Successful in 4m22s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Clippy (pull_request) Successful in 2m23s
CI / Test (pull_request) Successful in 4m19s
CI / CUDA type-check (pull_request) Successful in 1m57s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
The deferred Phase 6b, and the unblock for the 7→8 milestone's
benchmark work (#22): until cortex measures itself per request,
nothing downstream can be benchmarked or graphed.

The proxy wraps the upstream byte stream in a pass-through inspector
(TokenMetricsStream): chunks are forwarded verbatim — never buffered
or re-serialised — while the inspector records arrival times and
keeps a bounded (64 KiB) tail of the body text. At stream end (or
client disconnect, via Drop) it extracts the final OpenAI usage
object — present on the last SSE chunk and non-streaming JSON bodies
alike — for engine-truth token counts.

Per request, labelled {model, node}:
- cortex_time_to_first_token_seconds (histogram) — first body chunk
- cortex_tokens_per_second (histogram) — completion tokens over the
  decode window (first→last chunk); falls back to total request
  duration for single-chunk non-streaming bodies
- cortex_prompt_tokens_total / cortex_completion_tokens_total
  (counters)

The extractor is pure and chunk-boundary-safe; quoted-needle matching
keeps completion_tokens_details from shadowing completion_tokens,
and the last usage object wins. Covers chat completions, completions,
the Responses API, and the Anthropic streaming path (which currently
proxies OpenAI SSE).

Tests: 4 extractor unit tests; integration test with a streaming
mock emitting a stream_options-style final usage chunk, asserting
both histograms and exact-or-greater counter values (the test
recorder is process-global and shared across the binary's tests).

Closes #21

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 15:11:52 +03:00
716558c8ff feat(neuron): startup preflight for NVIDIA driver/library mismatch (#19)
All checks were successful
CI / Format (push) Successful in 38s
CI / Format (pull_request) Successful in 38s
CI / CUDA type-check (push) Successful in 2m11s
CI / Clippy (push) Successful in 2m13s
CI / Clippy (pull_request) Successful in 2m37s
CI / Test (push) Successful in 4m17s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 3m56s
CI / CUDA type-check (pull_request) Successful in 1m44s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
The un-rebooted driver update (userspace libs bumped, kernel module
still old) kills every CUDA call on the host including nvidia-smi,
and neuron surfaced it only as `Comm::from_rank ... NcclError` deep
inside the first model load — 30 minutes of forensics on beast
(2026-06-08) to diagnose. Make it instantly legible instead:

- discovery distinguishes nvidia-smi absent (CPU-only, fine) from
  present-but-failing, classifies the "Driver/library version
  mismatch" signature, and pairs the userspace NVML version with the
  loaded kernel-module version from /proc/driver/nvidia/version.
- DiscoveryResponse gains `cuda_unavailable_reason` (omitted when
  None — wire-compatible) so cortex can see why the node has no
  devices and route around it.
- startup logs one loud ERROR line with the actionable reason
  ("reboot the host to reload the kernel module") and skips default
  model loads entirely, marking each failed with that reason so
  /health activation shows the real cause.
- POST /models/load fast-rejects with 503 + code=cuda_unavailable on
  a mismatch host instead of dying minutes later in cuInit/NCCL.

No false positives: other nvidia-smi failures (no devices, perms)
keep their existing behaviour, CPU-only hosts stay silent.

Closes #19

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 15:00:00 +03:00
df9c490614 feat(neuron+gateway): keep auto-recovering models visible as recovering (#20)
Some checks failed
CI / Format (push) Successful in 37s
CI / CUDA type-check (pull_request) Failing after 28s
CI / Format (pull_request) Successful in 37s
CI / Clippy (push) Successful in 2m54s
CI / Clippy (pull_request) Successful in 3m36s
CI / Test (push) Successful in 4m37s
CI / Test (pull_request) Successful in 5m20s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
CI / CUDA type-check (push) Failing after 31s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
During the #17 auto-recovery window (unload → reload, minutes for a
large TP model) the model's registry slot is absent, so it vanished
from neuron's /models — and cortex, routing by /models presence,
answered "model not found on any node" while a direct request to
neuron would have correctly said "recovering, retry shortly".

neuron: the recovery set becomes a map carrying a devices/capabilities
snapshot taken at trigger time (while the registry slot still exists).
list_models reports `recovering` for models in the set — both while
the poisoned slot is still present and during the reload gap, where
the snapshot keeps the model listed.

gateway: ModelStatus grows a Recovering variant (parsed from the
wire); the router holds the route — new RouteError::ModelRecovering
mapped to 503 instead of 404 — and deliberately does not fall through
to the catalogue cold-load, which would race a second placement
against the in-flight recovery. The evictor already ignores
non-Loaded entries.

Tests: neuron unit test (recovering model stays listed with snapshot),
gateway integration tests (poller parses `recovering`; request gets
503 retry-shortly and the model stays on /v1/models).

Closes #20

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 13:42:03 +03:00
1a74cb0c56 chore: rename repo cortex -> helexa
Some checks failed
CI / CUDA type-check (push) Failing after 30s
build-prerelease / Resolve version stamps (push) Successful in 45s
CI / Format (push) Successful in 32s
build-prerelease / Build neuron-blackwell (push) Failing after 31s
build-prerelease / Build neuron-ada (push) Failing after 34s
build-prerelease / Build neuron-ampere (push) Failing after 38s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
CI / Clippy (push) Failing after 1m11s
build-prerelease / Build cortex binary (push) Successful in 3m47s
CI / Test (push) Successful in 5m32s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m22s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
helexa is the project; cortex (per-operator control plane / LLM proxy)
and neuron (per-host LLM harness) are its components. The Gitea repo
is now helexa/helexa. Update repository URLs in Cargo metadata, RPM
specs, and docs; make the CI changelog push URL rename-proof via the
github.repository context; reframe README.md and CLAUDE.md around the
project name. Binary, package, service, and config-path names are
unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 10:54:01 +03:00