Compare commits

..

45 Commits

Author SHA1 Message Date
c83f1eb98c feat(#47 #54 phase 2c): neuron per-principal in-flight cap (fair-share)
Some checks failed
CI / Format (push) Successful in 37s
CI / CUDA type-check (push) Successful in 1m37s
CI / Clippy (push) Successful in 2m13s
CI / Test (push) Successful in 4m50s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Test (push) Blocked by required conditions
build-prerelease / Build neuron-ampere (push) Blocked by required conditions
build-prerelease / Build neuron-ada (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 37s
build-prerelease / Build neuron-blackwell (push) Successful in 1m28s
build-prerelease / Lint (fmt + clippy) (push) Successful in 3m0s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Budget caps total spend over time (#52); this caps instantaneous
starvation so one principal's burst can't monopolize a model while others
wait.

- AdmissionController gains per-principal accounting (moved from a lone
  atomic to a Mutex<AdmissionState> holding the overall pending count + a
  per-principal map). enter(principal) now also fast-rejects when a
  principal already has max_per_principal requests in flight/queued →
  AdmissionRejection::PrincipalCap. Anonymous (None) requests are exempt.
- Config [harness.candle.admission].max_per_principal (default 2 = one
  running + one queued; 0 disables). A bursting principal's overflow is
  refused while a different principal still gets a queue slot.
- The principal (account/key) is reconstructed on the neuron side from the
  x-helexa-account-id/key-id headers cortex stamps (#49) — trusted over
  WireGuard, never from the request body — and threaded explicitly through
  all inference entry points (chat_completion, *_stream(_with),
  responses_stream, and the TP variants) to the admission gate.
- InferenceError::PerPrincipalLimit → 429 rate_limit_exceeded + Retry-After
  (distinct from load-shedding's 503 Overloaded); opencode/AI SDK self-pace.

Tests: fair-share unit test (A floods → A's 2nd is PrincipalCap, B still
queues + is served) + the existing admission tests adapted to enter(None).
Non-CUDA build green locally; TP entry points (cuda-gated) validated by CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 20:40:25 +03:00
a60c9f1075 feat(#47 #53 phase 2b): expose per-model admission load in GET /health
All checks were successful
CI / Format (push) Successful in 30s
CI / CUDA type-check (push) Successful in 1m30s
CI / Clippy (push) Successful in 2m18s
CI / Test (push) Successful in 4m17s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Resolve version stamps + change detection (push) Successful in 33s
build-prerelease / Build neuron-blackwell (push) Successful in 1m42s
build-prerelease / Build neuron-ampere (push) Successful in 2m18s
build-prerelease / Build neuron-ada (push) Successful in 2m19s
build-prerelease / Build helexa-bench binary (push) Successful in 2m18s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m27s
build-prerelease / Build cortex binary (push) Successful in 2m45s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m2s
build-prerelease / Test (push) Successful in 4m50s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m18s
build-prerelease / Package cortex RPM (push) Successful in 1m22s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m37s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 56s
Completes #53: the bounded scheduler's lock-free counters are now visible
to the fleet, which is what cortex's load-aware router (#55) consumes to
spread traffic across replicas and propagate honest backpressure.

- cortex-core::discovery: HealthResponse gains `models: Vec<ModelLoad>`
  (#[serde(default)] — back-compatible; older gateways/neurons interop).
  ModelLoad { id, in_flight, queue_depth }.
- LoadedHandle::load() → (in_flight, queue_depth), lock-free for both
  single-GPU and TP; CandleHarness::load_snapshot() enumerates resident
  models; the /health handler overlays it from the candle harness.

Tests: /health always exposes a models array (api integration test); a
pre-#53 payload without `models` still deserializes, and ModelLoad
round-trips (cortex-core serde tests). Local fmt/clippy/test green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 20:13:07 +03:00
b2bd86bfa5 feat(#47 #53 phase 2a): neuron admission control — bounded queue + backpressure
All checks were successful
CI / Format (push) Successful in 41s
CI / CUDA type-check (push) Successful in 1m40s
CI / Clippy (push) Successful in 2m18s
CI / Test (push) Successful in 4m53s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 1m43s
build-prerelease / Build neuron-ampere (push) Successful in 2m18s
build-prerelease / Build neuron-ada (push) Successful in 2m19s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m29s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m46s
build-prerelease / Test (push) Successful in 4m48s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m49s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m53s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m7s
Replaces the per-model unbounded, untimed FIFO of inference-lock waiters
(a busy model made new requests hang ~300s until the client gave up with
an opaque error) with an explicit bounded scheduler.

- harness::admission::AdmissionController: batch-1 scheduler — max_in_flight
  running (1) + a bounded queue (max_queue_depth) with a max_wait. enter()
  fast-rejects when the queue is full (QueueFull) or the wait elapses
  (Timeout); the returned AdmissionPermit is held for the request and frees
  both slots on drop. Pure async (no CUDA), lock-free in_flight/queue_depth
  counters for future /health reporting. Configurable via
  [harness.candle.admission] (max_in_flight=1, max_queue_depth=8,
  max_wait_secs=30).
- Gated at all four inference entry points before the inference_lock/pool
  lock: single-GPU non-streaming + streaming, TP non-streaming + streaming.
  The streaming paths acquire the permit before opening the SSE (so a
  rejection is a clean error, not a half-open stream) and move it into the
  inference task.
- InferenceError::Overloaded { retry_after_secs } → 503 rate_limit_exceeded
  + Retry-After via the #60/#63 envelope: a fast, retryable "busy" signal
  opencode/AI SDK back off on, not a stall.

Scope: this branch is the admission *core* (the hang→backpressure fix).
Exposing in_flight/queue_depth in GET /health (consumed by cortex
load-aware routing #55) is the next focused branch under #53.

4 unit tests (admit/report load, queue-full reject, wait-timeout reject)
+ Overloaded envelope mapping test. Non-CUDA build green locally; the
CUDA + TP sites are validated by branch CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 20:03:07 +03:00
cdf87284af feat(#47 phase 1d): budget enforcement — hard caps, reserve→settle, 429
All checks were successful
CI / Format (push) Successful in 1s
CI / CUDA type-check (push) Successful in 1m40s
CI / Clippy (push) Successful in 2m40s
CI / Test (push) Successful in 6m23s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Resolve version stamps + change detection (push) Successful in 34s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m19s
build-prerelease / Test (push) Successful in 4m28s
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 2m27s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 50s
Stage 1 complete: the A0 seatbelt (#52). Flips the metering-only reserve(0)
from #51 to the request's real upper-bound cost and refuses over-cap
requests *before* neuron is hit.

- metering::reservation_estimate: prompt estimate (~4 chars/token over the
  body — cortex has no tokenizer, so a conservative over-estimate; neuron
  stays the exact context wall) + max output. Max output comes from
  max_completion_tokens / legacy max_tokens, else the model's advertised
  limit.output (#62), else FALLBACK_MAX_OUTPUT. Over-reserving is safe —
  settle reconciles to actual.
- metering::reserve_or_reject: reserve the estimate; on BudgetError map to
  the #63 envelope and the caller refuses before dispatch — rolling window →
  429 rate_limit_exceeded + Retry-After (until reset); hard balance → 429
  insufficient_quota (no Retry-After). Never 402.
- Wired into both the OpenAI proxy path (proxy_with_metrics) and the
  Anthropic path (estimate from the translated body). advertised_output_limit
  reads the loaded model's limit.output from fleet state.
- Reservation prevents overshoot under concurrency: a successful reserve
  gates on spent+reserved+estimate ≤ cap, and settle records actual ≤
  reserved, so spend can never exceed the hard cap.

4 integration tests with a hit-counting mock neuron: balance over-cap →
429 insufficient_quota (no Retry-After, not dispatched); rolling over-cap →
429 rate_limit_exceeded + Retry-After (not dispatched); within-cap served;
**A0 repro** — a capped key's 20-request fan-out drains the cap, then is
refused, neuron only saw the served ones, and spend never exceeds the cap.
Plus 5 metering unit tests. Local fmt/clippy/test all green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 19:35:04 +03:00
4f16b8c541 feat(#47 phase 1c): per-request token metering + spend ledger
All checks were successful
CI / Format (push) Successful in 40s
CI / CUDA type-check (push) Successful in 1m41s
CI / Clippy (push) Successful in 2m15s
CI / Test (push) Successful in 4m28s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m30s
build-prerelease / Build cortex binary (push) Successful in 2m49s
build-prerelease / Package cortex RPM (push) Successful in 1m24s
build-prerelease / Test (push) Successful in 5m59s
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 49s
Stage 1 accounting (#51): capture real per-request usage and feed it to
the spend ledger + per-principal metrics. Establishes the reserve→settle
lifecycle that budget enforcement (#52) will tighten.

- cortex-gateway::metering: ReservationGuard makes reservation leaks
  impossible — settle() records actual spend + releases the remainder;
  dropping an un-settled guard releases the whole reservation, so any
  early return / error / dropped stream resolves it. UsageSink is the
  completion hook; principal_from_headers reconstructs the principal from
  the middleware-stamped headers (uniform across all proxy paths, no
  handler-signature churn); record_spend emits per-principal counters.
- proxy::TokenMetrics gains an optional usage_sink, invoked exactly once
  in finish() with the observed (prompt, completion) — restructured so it
  always runs (even when no body/usage arrived → settle 0 → release),
  while preserving the existing per-model metric emissions unchanged.
- All proxy paths metered: chat/completions/responses via
  proxy_with_metrics (reserve 0 → forward_request → settle in finish);
  Anthropic non-streaming settles from the buffered body; Anthropic
  streaming (anthropic_sse) now scans the upstream frames for the usage
  object (#48) — it captured none before — and settles at pump end.
- This phase reserves 0 tokens (metering only, no enforcement); #52 flips
  the reserved amount to prompt+max_output and surfaces BudgetError. The
  settle/release plumbing is identical, so that change is localized.
- New Prometheus counters: cortex_spend_tokens_total (+ prompt/completion
  splits), labelled by account/key.

2 integration tests: cumulative per-key spend after N requests with
reservations settled to zero outstanding; anonymous requests record no
spend. Local fmt/clippy/test all green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 19:29:51 +03:00
486d7e9a8f feat(#47 phase 1b): API-key auth + principal resolution
All checks were successful
CI / Format (push) Successful in 36s
CI / CUDA type-check (push) Successful in 1m51s
CI / Clippy (push) Successful in 2m40s
CI / Test (push) Successful in 5m50s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Build neuron-blackwell (push) Successful in 1m41s
build-prerelease / Build neuron-ada (push) Successful in 2m15s
build-prerelease / Build neuron-ampere (push) Successful in 2m18s
build-prerelease / Build helexa-bench binary (push) Successful in 2m20s
build-prerelease / Build cortex binary (push) Successful in 2m22s
build-prerelease / Lint (fmt + clippy) (push) Successful in 3m10s
build-prerelease / Test (push) Successful in 5m19s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m18s
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m44s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 57s
Stage 1 identity (#49): cortex now knows who a request is for. Identity
rides standard bearer auth only (Authorization: Bearer <key>) — no custom
required headers or body fields — which is what keeps every tier
OpenAI-compatible by construction.

- cortex-gateway::auth: `require_principal` axum middleware
  (from_fn_with_state), wired in build_app outer-to-inner as
  trace → CORS → auth → handlers (CORS outer so preflight short-circuits).
  It resolves the bearer key via the EntitlementProvider, inserts the
  typed Principal into request extensions (for metering #51 / enforcement
  #52), and stamps internal x-helexa-account-id / x-helexa-key-id headers
  so the principal reaches neuron, which trusts cortex over WireGuard (#54).
- Anti-spoofing: client-supplied principal headers are stripped before the
  authoritative value is stamped — a client can never assert a principal
  it didn't authenticate as.
- Rejection contract (#63): missing key under require_auth, or any present
  but unresolvable key, → 401 invalid_api_key in the #60 envelope. /health
  and / stay public. require_auth=false (default) allows anonymous through
  but still 401s a present-but-invalid key.
- Header-name constants (HEADER_ACCOUNT_ID/KEY_ID) live in cortex-core so
  neuron (#54) shares them. The chat/completions/responses paths forward
  the stamped headers automatically via proxy::forward_request; the
  Anthropic streaming + non-streaming paths forward them explicitly via
  auth::forward_principal_headers (they build their own upstream requests).

5 integration tests: missing-key 401, invalid-key 401 (even when auth not
required, not dispatched), valid key reaches neuron with principal headers
+ spoofed header stripped, anonymous allowed when not required, /health
public. Local fmt/clippy/test all green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 19:07:10 +03:00
bc74e0e95f feat(#47 phase 1a): EntitlementProvider trait + local/static provider
Some checks failed
CI / Format (push) Successful in 38s
CI / CUDA type-check (push) Successful in 1m39s
CI / Clippy (push) Successful in 2m26s
CI / Test (push) Successful in 4m49s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Build neuron-blackwell (push) Successful in 1m40s
build-prerelease / Build neuron-ada (push) Successful in 2m19s
build-prerelease / Build neuron-ampere (push) Successful in 2m22s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m49s
build-prerelease / Build cortex binary (push) Successful in 3m0s
build-prerelease / Test (push) Successful in 4m25s
build-prerelease / Package cortex RPM (push) Successful in 1m32s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m50s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m49s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m54s
build-prerelease / Build helexa-bench binary (push) Successful in 2m12s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Stage 1's build seam (#50): the interface auth, metering, and budget
enforcement all hang off, with a local/static provider so the A0
amplification fix can land before any upstream clearing house exists.
The future helexa-upstream client (#57) is just another impl.

- cortex-core::entitlements: Principal {account_id, key_id}, CapWindow
  (Balance | Rolling{seconds}), Reservation handle, BudgetSnapshot,
  AuthError/BudgetError, and the async EntitlementProvider trait
  (resolve / reserve / settle / release / snapshot). BudgetError carries
  the window semantics so callers pick the #63 code (rate_limit_exceeded
  + Retry-After vs insufficient_quota) without the provider touching HTTP.
- cortex-core::config: [entitlements] section on GatewayConfig
  (require_auth + [[entitlements.keys]] with account_id, optional key_id,
  hard_cap, window). Additive + serde(default) — anonymous/uncapped when
  omitted, so existing setups are unaffected.
- cortex-gateway::entitlements_local: LocalEntitlementProvider. Budget
  math serialized under one Mutex so spent+reserved can never exceed a
  hard cap under concurrency (the #52 guarantee); rolling windows reset
  lazily; uncapped keys (no hard_cap) always reserve but still meter.
- CortexState gains Arc<dyn EntitlementProvider> + require_auth, built in
  from_config. Not yet consumed by the request path — auth middleware is
  1b (#49), enforcement is 1d (#52).
- cortex.example.toml documents the section; test GatewayConfig literals
  updated for the new field.

6 provider unit tests (resolve, unknown-key, round-trip, balance/rolling
over-cap codes, uncapped infra key). Local fmt/clippy/test all green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 19:00:05 +03:00
f22d83df14 feat(#47 phase 0): centralize OpenAI error envelope + add Retry-After
Some checks failed
CI / Format (push) Successful in 38s
CI / CUDA type-check (push) Successful in 1m40s
CI / Clippy (push) Successful in 2m20s
CI / Test (push) Successful in 4m35s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Test (push) Blocked by required conditions
build-prerelease / Build cortex binary (push) Blocked by required conditions
build-prerelease / Package helexa-bench RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 24s
build-prerelease / Build neuron-blackwell (push) Successful in 1m26s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m48s
build-prerelease / Build neuron-ada (push) Successful in 2m3s
build-prerelease / Build helexa-bench binary (push) Successful in 2m7s
build-prerelease / Build neuron-ampere (push) Successful in 2m12s
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
The rejection contract (#63) requires every "no" path to speak the
OpenAI envelope with standard codes and, for retryable conditions, a
Retry-After header. Two gaps remained despite #63 being closed:
Retry-After was implemented nowhere, and the envelope was hand-built
inline in four places (gateway handlers/proxy/router, neuron api) with
no shared source of truth — exactly the inconsistency #63 set out to
prevent, and a foundation every Stage 1-2 rejection (401/429/503) needs.

- cortex-core: new `error_envelope::OpenAiError` — an axum-agnostic
  builder carrying status, type, code, message, param, optional
  retry_after, and diagnostic extras. Named constructors encode the #63
  codes (invalid_api_key, rate_limit_exceeded, insufficient_quota,
  context_length_exceeded, service_unavailable) and which carry
  Retry-After. cortex-core stays a pure types crate; each HTTP crate
  owns a thin `envelope_response` adapter that sets the header.
- cortex-gateway: route error_response, ProxyError, and RouteError
  through the shared builder; RouteError::retry_after_secs wires
  Retry-After on the transient NoHealthyNodes (5s) / ModelRecovering
  (2s) variants.
- neuron: route inference_error_response through the shared builder;
  InsufficientVram (transient 503) now advertises Retry-After: 5.

Behaviour for existing paths is unchanged (same status/type/code/extras);
only the new Retry-After headers are added. Tests cover the builder wire
shape and Retry-After presence/absence on both sides.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 18:46:56 +03:00
4b28a64b34 feat(#67 phase 5b): enforce the derived input as the prompt cap
All checks were successful
CI / Format (push) Successful in 39s
CI / CUDA type-check (push) Successful in 1m38s
CI / Clippy (push) Successful in 2m19s
CI / Test (push) Successful in 4m17s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m14s
build-prerelease / Build neuron-blackwell (push) Successful in 1m42s
build-prerelease / Build neuron-ada (push) Successful in 2m15s
build-prerelease / Build neuron-ampere (push) Successful in 2m17s
build-prerelease / Build helexa-bench binary (push) Successful in 2m23s
build-prerelease / Build cortex binary (push) Successful in 2m29s
build-prerelease / Test (push) Successful in 4m28s
build-prerelease / Package cortex RPM (push) Successful in 1m15s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m17s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m41s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 51s
The request path now rejects prompts above the model's self-derived input
budget, not the static NEURON_MAX_PROMPT_TOKENS — so a VRAM-tight host
(where the VRAM ceiling binds below the static cap) rejects an
over-budget prompt up front instead of accepting it and OOMing
mid-prefill.

- derived_input_cap: AtomicUsize on LoadedModel + TpLoadedModel; refreshed
  by LoadedHandle::derived_limit (runs on every /models poll). 0 = not
  derived yet.
- effective_prompt_cap(): cached derived input when >0, else the static
  max_prompt_tokens() (cold-start / no-profile fallback).
- validate_request takes the cap as a param; all 4 call sites
  (chat_completion, inference_stream, inference_tp_stream, TP
  chat_completion) pass the in-scope model's effective_prompt_cap().
- doc/context-limits.md: enforcement note updated from "remaining" to
  landed.

Reads the cap lock-free from the sync validate path (no per-request VRAM
query); the cap tracks live state via the poll-driven derivation. With
this, advertise and enforce agree and both track the resident model.

fmt/clippy/test green; CUDA paths type-checked in CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 14:26:37 +03:00
dd65eedb24 feat(#67 phase 5a): NEURON_MAX_PROMPT_TOKENS becomes a clamp-only backstop; docs
All checks were successful
CI / Format (push) Successful in 31s
CI / CUDA type-check (push) Successful in 1m49s
CI / Clippy (push) Successful in 2m12s
CI / Test (push) Successful in 4m24s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Demotes the static per-host prompt cap from authority to an optional
upper-bound clamp on the self-derived limit, and rewrites the
context-limits doc around the computed model.

- max_prompt_tokens_clamp(): reads NEURON_MAX_PROMPT_TOKENS directly so
  "explicitly set" is distinct from the 16384 default; returns None when
  unset (no clamp). Applied as derive_limit's hard_ceiling in
  LoadedHandle::derived_limit, so the advertised context is clamped only
  when an operator set a backstop — the derivation is otherwise
  authoritative and binds below it in practice.
- doc/context-limits.md: intro + "After #62" rewritten as "After #67 —
  the neuron computes its own limit" (formula, live signals, config
  block, opencode note, NEURON_MAX_PROMPT_TOKENS demotion).

Remaining (phase 5b, follow-up): enforce the *derived* input as the
prompt cap (reject above computed input, not the static
NEURON_MAX_PROMPT_TOKENS) so VRAM-tight hosts can't accept an
OOM-inducing prompt. Needs a per-model cached cap read from the sync
validate path; scoped separately. Until then the static cap remains the
enforced backstop (advertised <= enforced holds when the env is set).

fmt/clippy/test green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 14:14:34 +03:00
8b2e01a072 feat(#67 phase 4): advertise neuron-computed limit on /models; drop catalogue override
Some checks failed
CI / Test (push) Waiting to run
CI / Format (push) Successful in 35s
CI / CUDA type-check (push) Successful in 2m12s
CI / Clippy (push) Successful in 2m10s
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
The neuron now self-derives and advertises limit{context,input,output}
per loaded model; cortex forwards it and stops consulting the
operator-declared catalogue limit (which can't track hot-swapped models
or live capacity). Operator-set `cost` still flows from the catalogue.

neuron:
- CandleHarness gains context_limit_cfg (from [harness.candle.context_limit]).
- LoadedHandle::derived_limit(): profile + live tightest-card free VRAM
  (single: query_vram; TP: query_vram_tightest_free_mb) + prefill-rate
  EMA (bootstrap until first sample) → derive_limit. None for arches
  without a context profile. No operator clamp here (advertise the honest
  derived value; the clamp is an enforcement-side backstop).
- list_models() fills ModelInfo.limit from derived_limit (was None).
- derive_limit treats free_tightest_mb == 0 (unknown/CPU sentinel) as
  "no VRAM ceiling" instead of collapsing to zero.

cortex:
- ModelEntry gains `limit`, copied from ModelInfo.limit by the poller.
- /v1/models: catalogue `limit` no longer flows (Pass 1 sets None);
  Pass 2 adopts the neuron's limit, taking the tightest across neurons
  via tightest_limit(). cost unchanged.
- model_limits.rs rewritten: catalogue limit (999999) is ignored; the
  neuron's ModelEntry.limit is advertised; cost still from catalogue.
- All ModelEntry literals updated with the new field.

fmt/clippy/test green; CUDA paths type-checked in CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 14:10:20 +03:00
464b6b0db9 feat(neuron): self-measured prefill tok/s EMA on streaming paths (#67 phase 3)
All checks were successful
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m13s
CI / Test (push) Successful in 4m30s
CI / CUDA type-check (push) Successful in 1m42s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Refs #67. Feeds the throughput ceiling a live, per-model prefill rate
instead of only the configured bootstrap estimate, so the advertised
limit tracks real prefill speed and rises automatically as prefix
caching (#11) reduces effective prefill cost.

- context_limit::PrefillRateEma: lock-free f64-bits EMA (alpha 0.3),
  ignores degenerate samples, None before the first sample. Unit-tested.
- prefill_rate field on LoadedModel + TpLoadedModel.
- Recorded as total-prompt-tokens / prefill-elapsed in the two streaming
  serving paths (TP: inference_tp_stream via tp_for_task; single-GPU:
  stream_inference_via_worker via a new &prefill_rate param threaded from
  loaded_for_task). Measuring total prompt (not just the divergent
  suffix) means a prefix-cache hit shrinks elapsed while the prompt stays
  large, so the effective rate — and the ceiling — rises toward the VRAM
  ceiling, exactly the #11 payoff.

Per the agreed scope, non-streaming + CPU paths fall back to the
bootstrap estimate (opencode streams; those paths rarely carry the
fleet). fmt/clippy/test green; CUDA paths type-checked in CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 14:02:02 +03:00
f2e05d96ec feat(neuron): capture ContextProfile at load + per-rank VRAM fan-out (#67 phase 2)
All checks were successful
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m14s
CI / Test (push) Successful in 4m38s
CI / CUDA type-check (push) Successful in 1m30s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Refs #67. Captures the per-model context physics at load and adds the
live free-VRAM signal the derivation needs — the tightest card across TP
ranks, not just the leader.

- ContextProfile captured at load:
  - single-GPU dense CUDA path (world_size 1) via
    context_limit::profile_from_qwen3_5_config(config_path, ..);
  - TP path (world_size = tp_size) at TpLoadedModel construction.
  GGUF/CPU/non-qwen3_5 → None (fall back to the static prompt cap).
  New `context_profile` field on LoadedModel + TpLoadedModel.
- profile_from_qwen3_5_config(): reads config.json (mirrors
  VisionMeta::from_config_path), counts full_attention layers
  (layer_types authoritative, full_attention_interval fallback), builds
  the per-card KV cost via the shared helper.
- Folded the inline per-rank KV-bytes math in tp_qwen3.rs (both
  cuda/non-cuda log_construction_complete) and tp_qwen3_5.rs onto
  context_limit::kv_bytes_per_token + KV_CACHE_DTYPE_BYTES.
- Per-rank VRAM fan-out (tightest card):
  - WorkerRequest::QueryVram + WorkerResponse::VramInfo { free_mb, total_mb };
  - worker.rs handle_query_vram (cuda: mem_get_info; non-cuda: error);
  - WorkerPool::query_vram_tightest_free_mb fans out to every rank
    (leader via its device worker, subprocess ranks via RPC) → min free;
  - TpLoadedModel::query_vram_tightest_free_mb convenience wrapper.

No advertise/enforce yet (phases 4/5). fmt/clippy/test green; CUDA paths
type-checked in CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 13:18:27 +03:00
4f05a87449 feat(neuron): self-derived context-limit core — physics + policy (#67 phase 1)
All checks were successful
CI / Format (push) Successful in 38s
CI / CUDA type-check (push) Successful in 1m49s
CI / Clippy (push) Successful in 2m16s
CI / Test (push) Successful in 4m28s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Refs #67. The correct limit{context,input,output} for a deployment is a
computed function of model architecture + live free VRAM + a
coherence/throughput trade-off, not an operator-declared static fact that
goes stale on model swap. This lands the arch-agnostic derivation core;
later phases capture per-model physics at load, measure throughput, and
advertise/enforce the computed limit.

- crates/neuron/src/harness/context_limit.rs (new):
  - kv_bytes_per_token(): shared per-card KV cost (counts only
    full-attention layers; sharded by TP world size). The TP load paths'
    inline math folds onto this in phase 2.
  - ContextProfile: per-model physics snapshot (max_position_embeddings,
    kv_bytes_per_token_per_card, world_size).
  - derive_limit(): context = min(max_pos, vram_ceiling,
    throughput_ceiling) clamped by an optional backstop; input = context −
    output; rounded to 1024. 6 unit tests.
- config.rs: [harness.candle.context_limit] block (mirrors prefix_cache):
  target_prefill_latency_secs, bootstrap_prefill_tok_per_sec,
  activation_headroom_mb, min_free_floor_mb, output_reserve_tokens.
- neuron.example.toml: documented the new block.

No runtime behaviour change yet. fmt/clippy/test green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 13:00:52 +03:00
2f67d17ec7 feat(neuron): emit reasoning_tokens usage details on streaming
All checks were successful
CI / CUDA type-check (push) Successful in 1m45s
CI / Format (push) Successful in 43s
CI / Clippy (push) Successful in 2m16s
CI / Test (push) Successful in 4m28s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Resolve version stamps + change detection (push) Successful in 34s
build-prerelease / Build neuron-blackwell (push) Successful in 1m38s
build-prerelease / Build neuron-ada (push) Successful in 2m3s
build-prerelease / Build cortex binary (push) Successful in 2m16s
build-prerelease / Build helexa-bench binary (push) Successful in 2m23s
build-prerelease / Build neuron-ampere (push) Successful in 2m50s
build-prerelease / Lint (fmt + clippy) (push) Successful in 3m3s
build-prerelease / Package cortex RPM (push) Successful in 1m22s
build-prerelease / Test (push) Successful in 5m11s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m24s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m41s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 56s
Closes #64.

opencode meters reasoning tokens separately via the OpenAI-standard
detail objects, which neuron's usage structs didn't expose. Add them
additively so older clients ignore them.

- cortex-core: Usage gains completion_tokens_details/prompt_tokens_details;
  ResponsesUsage gains output_tokens_details/input_tokens_details. Optional
  + skip_serializing_if, so the wire shape is unchanged for non-reasoning
  models. cached_tokens fields are defined but always None until prompt
  caching lands (#11).
- candle.rs: count tokens generated while in_reasoning across all three
  streaming paths (TP, worker, CPU); carry the count on InferenceEvent::Finish.
- chat projector: populate completion_tokens_details.reasoning_tokens.
- responses projector: wire up base usage emission on the streaming path
  (it emitted none before) and add output_tokens_details.reasoning_tokens.
- non-streaming paths leave details None (they don't track in_reasoning).

reasoning_tokens is a sub-count of completion/output tokens (OpenAI
semantics) — not added into total_tokens.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 12:04:05 +03:00
11b2e6f78c fix(cortex): default models_config to the packaged absolute path
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m24s
build-prerelease / Build neuron-blackwell (push) Successful in 1m42s
build-prerelease / Build neuron-ada (push) Successful in 2m7s
build-prerelease / Build helexa-bench binary (push) Successful in 2m7s
build-prerelease / Build cortex binary (push) Successful in 2m20s
build-prerelease / Build neuron-ampere (push) Successful in 2m49s
build-prerelease / Test (push) Successful in 4m26s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m23s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m41s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m43s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m47s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 52s
cortex resolved the catalogue path "models.toml" relative to the service's
working directory, so the systemd-launched binary never found
/etc/cortex/models.toml and ran with an EMPTY catalogue in production —
limits, cost, pinning, aliases and feasibility were all silent no-ops,
with models surfacing only via the neuron poller. Tests never caught it
because they pass models_config explicitly; only the defaulted,
packaged path was broken.

Default to the absolute /etc/cortex/models.toml (where cortex.spec installs
it) and document the override in cortex.example.toml. Restores the #62
limit/cost advertisement (the catalogue is now actually read) along with
pinning/aliases/feasibility.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 10:04:29 +03:00
8a636c687f feat(cortex): per-model limit + cost on /v1/models; remove max_model_len
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 37s
build-prerelease / Build neuron-blackwell (push) Successful in 1m36s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m33s
build-prerelease / Build neuron-ada (push) Successful in 2m2s
build-prerelease / Build neuron-ampere (push) Successful in 2m47s
build-prerelease / Build helexa-bench binary (push) Successful in 2m8s
build-prerelease / Build cortex binary (push) Successful in 2m35s
build-prerelease / Test (push) Successful in 5m13s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m17s
build-prerelease / Package cortex RPM (push) Successful in 1m18s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m43s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m42s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 54s
Resolves #62. opencode's helexa provider discovers a model's serving
budget from /v1/models and uses it to size context, trigger compaction,
and show spend with no hand-configuration. Each model entry now carries:

  - limit { context, input?, output }  — operator-declared in models.toml
  - cost  { input, output, cache_read?, cache_write? }  — USD per 1M tokens
  - tool_call / reasoning  — runtime-detected by the candle harness and
    OR-ed in from each serving neuron

Composition: the catalogue profile supplies limit/cost (Pass 1); the
poller carries the neuron's detected tool_call/reasoning into ModelEntry,
which the gateway unions onto the entry (Pass 2); aliases propagate every
field (Pass 4). Wire types extend ModelInfo / ModelProfile /
CortexModelEntry additively (serde default + skip_serializing_if), so
older neurons and clients are unaffected. helexa-bench's ModelInfo
constructor and the gateway test fixtures are updated for the new fields.
Adds tests/model_limits.rs asserting /v1/models surfaces limit + cost
(catalogue) and tool_call + reasoning (runtime), and that max_model_len
is gone.

Removes max_model_len. It was write-only with no consumer — opencode's
source references it nowhere and it is not an OpenAI /v1/models field —
and doubly misleading: vLLM's max_model_len means total sequence length,
but cortex populated it from NEURON_MAX_PROMPT_TOKENS, a prompt-only cap.
The limit{} contract replaces it. The neuron's max_prompt_tokens remains
the enforced prompt cap (neuron-side); cortex just stops re-advertising a
derived, mis-named copy. Closes #66 — its stale-max_model_len premise is
moot once the field is gone.

limit/cost are operator-declared (catalogue) per #62's design; auto-
deriving the advertised budget from each neuron's reported cap is a
tracked follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 09:26:55 +03:00
6088830e7d feat(deploy): manage NEURON_MAX_PROMPT_TOKENS per host via model.conf drop-in
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Test (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Roll the per-model context cap into deploy.yml so it is deterministic per
host and rolled out (with a restart) alongside the rest of the service
config, rather than hand-edited in local.conf. The deploy now writes
/etc/systemd/system/neuron.service.d/model.conf from a new per-host
`max_prompt_tokens` matrix field, and restarts a neuron when the package
OR the drop-in changes — so a cap change applies even with no new RPM.

beast (Qwen3.6-27B, hybrid linear, 2x 32GB) -> 131072 (~128k); benjy and
quadbrat (dense, VRAM-bound) stay at 16384 but become deploy-managed.

Adds the scoped sudoers grant for the root-owned drop-in install, and
doc/context-limits.md documenting the knob relationships and KV/VRAM math
(refs #62 for the eventual /models-advertised source of truth, #65 for
the length-aware text VRAM guard that gates pushing beyond 128k).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-16 18:48:19 +03:00
04f798ec23 feat(cortex-gateway): enhance error responses with structured data
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m22s
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 2m26s
build-prerelease / Test (push) Successful in 4m23s
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 47s
fixes #63
Standardize error messages by adding type, code, and param fields to
align with OpenAI API format. Updates include:
- Structured error envelopes with broad type categorization
  (invalid_request_error/api_error)
- Specific machine-readable codes (model_not_found/service_unavailable)
- Null param field as required by OpenAI specification
- Consistent error response formatting across handlers, proxy, and
  routing layers

New tests verify correct error envelope structure for various failure
scenarios.

Co-Authored-By: Helexa (Qwen3.6-27B, 48k context) <noreply@helexa.ai>
2026-06-16 17:51:04 +03:00
6f3e9276cd docs: add AGENTS.md with project architecture, build commands, and conventions
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 37s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m27s
build-prerelease / Test (push) Successful in 4m37s
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
2026-06-16 14:15:32 +03:00
8f9e956d17 fix(neuron): emit OpenAI-standard nested error envelopes (#60)
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 33s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 1m44s
build-prerelease / Build neuron-ada (push) Successful in 2m14s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m16s
build-prerelease / Build neuron-ampere (push) Successful in 2m55s
build-prerelease / Test (push) Successful in 4m24s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m41s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m43s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 53s
InferenceError responses were a flat `{"error": "..."}` string. OpenAI
clients (opencode, the openai SDK) reach into `error.type`/`error.code`
to drive behaviour — most importantly `code == "context_length_exceeded"`
triggers auto-compaction + retry instead of a hard failure. A flat string
is invisible to that logic.

Rewrite `inference_error_response` to emit the nested envelope
`{"error": {"message","type","code","param", ...diagnostics}}` and map:

- ModelNotLoaded   → 404 invalid_request_error / model_not_found
- PromptTooLong    → 400 invalid_request_error / context_length_exceeded
  (message: "maximum context length is N tokens", + prompt_len/max)
- InsufficientVram → 503 api_error / insufficient_vram
- VisionUnsupported→ 400 invalid_request_error / vision_unsupported
- TemplateRenderFailed → 422 invalid_request_error / template_render_failed
- Other            → 500 api_error / null code

Diagnostic extras ride inside the error object so the envelope shape is
stable. Both inline match blocks in the chat-completions handler
(streaming + non-streaming) now defer to the shared helper, which the
responses handler already used — one source of truth.

Adds 4 unit tests covering the envelope shape and codes. Also fixes a
pre-existing clippy lint (cloned_ref_to_slice_refs) in qwen3_5 snapshot
test surfaced by a newer clippy.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 20:42:14 +03:00
cb758d4706 feat(neuron): emit usage on the streaming path so clients can track context
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 33s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m20s
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 1m46s
build-prerelease / Build neuron-ada (push) Successful in 2m9s
build-prerelease / Build cortex binary (push) Successful in 2m24s
build-prerelease / Build neuron-ampere (push) Successful in 2m52s
build-prerelease / Test (push) Successful in 4m16s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m43s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m43s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 55s
The deeper reason opencode showed "Context: 0 tokens / 0% used" and flew
into a 400: streaming responses carried NO `usage`. Clients track context
(and trigger compaction) from the `usage` field; the legacy candle
streaming path set `usage: None` on every chunk, so a streaming client
had no token count at all — `max_model_len` alone is a denominator with
no numerator.

InferenceEvent::Finish now carries prompt_tokens + completion_tokens
(the streaming loops already have both: prompt_tokens.len() and the
generated all_tokens.len()). The openai_chat projector emits an
OpenAI-style trailing usage chunk (empty `choices`, populated `usage`)
after the finish chunk. cortex's Anthropic stream translator already
reads chunk.usage, so this fixes context tracking on BOTH the OpenAI
(opencode) and Anthropic (Claude Code) paths.

Also harden the max_model_len plumbing's sibling: cortex re-polls
/discovery while a neuron's max_prompt_tokens is still 0 (unknown), so a
rolling-deploy race where cortex caches discovery before the neuron has
the field self-heals instead of pinning max_model_len to None until a
manual cortex restart.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 19:43:59 +03:00
a2d2dbd006 feat: advertise max_model_len on /v1/models so clients can compact
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m15s
build-prerelease / Build neuron-blackwell (push) Successful in 1m38s
build-prerelease / Build neuron-ada (push) Successful in 2m2s
build-prerelease / Build helexa-bench binary (push) Successful in 2m0s
build-prerelease / Build cortex binary (push) Successful in 2m26s
build-prerelease / Build neuron-ampere (push) Successful in 2m55s
build-prerelease / Test (push) Successful in 4m28s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m22s
build-prerelease / Package cortex RPM (push) Successful in 1m22s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m37s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m41s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m41s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 54s
opencode (and any OpenAI/Anthropic client) couldn't size or compact its
context against helexa because /v1/models never advertised a context
window — opencode showed "0 tokens / 0% used" and flew straight into a
400 PromptTooLong once a conversation + a fetched 64KB log overflowed the
49152-token cap. Compaction is the client's job, but the client needs to
know the limit to do it.

neuron now reports its effective prompt cap (NEURON_MAX_PROMPT_TOKENS)
in GET /discovery (`max_prompt_tokens`). cortex surfaces it on
/v1/models as `max_model_len` (vLLM / OpenAI-compatible convention) per
model — the smallest cap among the neurons that can serve it
(feasible_on ∪ locations), so the advertised limit holds wherever the
request routes. A neuron reporting 0 predates the field and is treated
as unknown (skipped); models with no reporting neuron omit the field.

helexa still rejects over-limit prompts with a clean 400 — this just
gives clients the number to compact *before* hitting it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 19:11:13 +03:00
544214d0f8 fix(neuron): normalize OpenAI string tool-call arguments before rendering
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 29s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m16s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 1m39s
build-prerelease / Build neuron-ada (push) Successful in 2m5s
build-prerelease / Build neuron-ampere (push) Successful in 2m48s
build-prerelease / Test (push) Successful in 4m35s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m38s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m39s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 52s
opencode (OpenAI path, /v1/chat/completions passthrough) hit the same
chat_template:120 failure Claude Code did — "cannot convert value into
pairs" — because the OpenAI wire format carries
tool_calls[].function.arguments as a JSON *string*, while Qwen3.6's
template iterates it as a dict (`arguments | items`). The Anthropic-side
fix (8880b2f) only covered cortex's translation; the OpenAI path reaches
neuron unchanged.

render_chat_template now normalizes string-form tool-call arguments to
objects across all messages before building the Jinja context, so OpenAI
and Anthropic clients both render. Object args (Anthropic path) pass
through untouched; a string that doesn't parse is left as-is and the
render fails loudly (422 TemplateRenderFailed, a94dd55) rather than
silently dropping tools.

The loud-fail change earned out immediately here: opencode got a clean
422 with the exact `chat_template:120` cause instead of a degraded
session.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 18:13:36 +03:00
a94dd55ab8 feat(neuron): fail loud (422) when a tools-bearing request can't render
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m18s
build-prerelease / Test (push) Successful in 4m12s
build-prerelease / Build neuron-blackwell (push) Successful in 1m38s
build-prerelease / Build neuron-ada (push) Successful in 2m10s
build-prerelease / Build neuron-ampere (push) Successful in 2m49s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m36s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 58s
Three of this session's bugs (system-message position, tool_call argument
shape, and the original tool rendering) all hid behind the same silent
behaviour: chat_template render fails → neuron falls back to
format_qwen3_prompt, which drops every tool → the request still returns
200 with degraded, tool-less output. Each cost real debugging time
because the failure was invisible on the wire.

build_prompt_for_request now returns Result. On a render failure it
checks whether the request carried tools: if so it returns the new
InferenceError::TemplateRenderFailed (mapped to 422 with a
template_render_failed code and the underlying Jinja error), instead of
silently degrading. A render failure with no tools still falls back
quietly — there's nothing to lose, and `format_qwen3_prompt` is a
reasonable text-only prompt. The four prompt-build call sites propagate
with `?`.

Now the next client/template incompatibility surfaces as a loud 422 the
operator sees immediately, not a mysteriously-degraded session.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 17:48:31 +03:00
8880b2f8a6 fix(cortex): emit tool_call arguments as an object so Qwen3.6 can chain tools
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Build helexa-bench binary (push) Successful in 2m14s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m27s
build-prerelease / Build cortex binary (push) Successful in 2m37s
build-prerelease / Test (push) Successful in 4m32s
build-prerelease / Build neuron-blackwell (push) Successful in 1m41s
build-prerelease / Build neuron-ada (push) Successful in 2m5s
build-prerelease / Build neuron-ampere (push) Successful in 2m50s
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m23s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m39s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m41s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m4s
Verified live via the rendered-prompt trace: once a tool call is in the
conversation history, the Qwen3.6 chat template fails to render —

  render chat_template: invalid operation: cannot convert value into
  pairs (in chat_template:120)

because line 120 iterates `tool_call.arguments | items` (treats arguments
as a dict), while cortex emitted the OpenAI-standard JSON *string*. On
that render error neuron silently falls back to a tool-less prompt, so
the model loses every tool the moment it makes one call — it can make the
first tool call, read the result, then can only narrate ("now let me
check the runs") and stop, because the next turn has no tools. That's the
"drops the ball a little later" symptom: the CC trace shows the get_me
turn rendering 42653 tokens (tools present) and every subsequent
tool-history turn falling back to ~6k tokens (tools gone).

anthropic_to_openai now passes `function.arguments` as the parsed object
rather than stringifying it. Tests updated to expect the object form.

This is the same silent-fallback failure class as the system-message
merge (295b10c) — which is why making neuron's template-render fallback
LOUD (4xx on a tools-bearing request instead of a degraded 200) is now
clearly worth doing: it would have surfaced both in seconds.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 16:43:17 +03:00
4e8f4e0d04 fix(neuron): don't generate <think> reasoning when the client drops it
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m14s
build-prerelease / Build neuron-blackwell (push) Successful in 1m50s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 2m36s
build-prerelease / Build neuron-ada (push) Successful in 2m37s
build-prerelease / Test (push) Successful in 4m15s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m36s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m37s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 50s
Verified live: Qwen/Qwen3.6-27B with a simple prompt and max_tokens=400
generated 400 tokens, finish_reason=length, and 0 visible characters —
the model spent the ENTIRE budget on <think> reasoning, which we then
drop for OpenAI/Anthropic clients (include_thinking=false), starving the
visible answer. This is why Claude Code "dropped the ball": empty or
truncated responses. A/B confirms the cause — same prompt with
chat_template_kwargs.enable_thinking=false yields a full 545-char answer.

The earlier prompt_opens_reasoning fix stopped the reasoning *leaking* as
text but left it consuming the token budget. Couple the two: when the
caller isn't going to see the reasoning (include_thinking=false, the
default), default chat_template_kwargs.enable_thinking to false so the
model doesn't generate it. An explicit client enable_thinking wins;
thinking-aware clients (helexa-acp, x-include-thinking: true) keep
reasoning on. Tests cover the default (false), surfacing (true), explicit
override, and preservation of other kwargs.

Note: only the /v1/chat/completions path (what Claude Code uses via
cortex /v1/messages); /v1/responses could get the same defaulting as a
follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 15:00:50 +03:00
295b10c103 fix(cortex): merge all system content into one leading system message
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 33s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m49s
build-prerelease / Build helexa-bench binary (push) Successful in 2m14s
build-prerelease / Build cortex binary (push) Successful in 2m54s
build-prerelease / Test (push) Successful in 5m21s
build-prerelease / Build neuron-blackwell (push) Successful in 1m38s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m19s
build-prerelease / Build neuron-ada (push) Successful in 2m3s
build-prerelease / Build neuron-ampere (push) Successful in 2m52s
build-prerelease / Package cortex RPM (push) Successful in 1m34s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m38s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m46s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 54s
Verified live via neuron trace: Claude Code's real requests carry a
top-level `system` AND a `role:"system"` turn inside `messages`. cortex
passed the latter through at a non-first position, and Qwen3.6's chat
template hard-rejects it:

  WARN chat_template render failed; falling back to format_qwen3_prompt
  error=... invalid operation: System message must be at the beginning.

On that render error neuron silently falls back to a template that
renders NO tools, so the model got zero tool-format guidance and
improvised an unparseable `<tool><name>…` syntax — tool calling broke
entirely for real CC traffic, even though synthetic single-system
probes (and the earlier translation/parse fixes) worked.

anthropic_to_openai now accumulates the top-level `system` plus every
`role:"system"` conversation turn and emits a single system message at
index 0, with the non-system turns following in order. Reproduced the
trigger (system-role message at index>0 → fallback) and the fix
(merged → template renders tools). Test covers the merge + ordering.

Secondary hardening worth a follow-up: neuron's silent template
fallback drops tools without surfacing it to the client — a render
failure on a tools-bearing request should arguably 4xx rather than
degrade invisibly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 14:09:08 +03:00
1c485aedce feat(neuron): trace the fully rendered chat-template prompt
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 27s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 1m31s
build-prerelease / Build neuron-ampere (push) Successful in 2m13s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m45s
build-prerelease / Build neuron-ada (push) Successful in 3m31s
build-prerelease / Test (push) Successful in 4m39s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m49s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m40s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m46s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 52s
Debugging tool-call format drift (Qwen3.6-27B emitting wrapper-less
<tool><name>…> under Claude Code's real system prompt + 120-tool list,
which neuron's <tool_call> detector can't parse) needs ground truth on
what the model actually sees. neuron logged nothing about the rendered
prompt. Add a trace! in build_prompt_for_request emitting the full
rendered prompt + char count + tool count, so we can see whether the
chat template's <tool_call> format instruction survives a large system
prompt and how the tools render. Gated at trace (the prompt can be tens
of KB): RUST_LOG=neuron::harness::candle=trace.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 13:38:51 +03:00
b3dc835375 ci: bound job runtime + stop dropping sccache on rustc signal-death
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m23s
build-prerelease / Build cortex binary (push) Successful in 2m29s
build-prerelease / Build helexa-bench binary (push) Successful in 2m34s
build-prerelease / Test (push) Successful in 4m33s
build-prerelease / Build neuron-blackwell (push) Successful in 1m31s
build-prerelease / Build neuron-ada (push) Successful in 2m13s
build-prerelease / Build neuron-ampere (push) Successful in 2m50s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m17s
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m38s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m42s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 55s
A neuron-blackwell build hung ~90 min (siblings finished in 2) and there
was no job timeout to kill it, so it sat burning a runner. Root cause of
the hang: the inline retry loop treated every failure identically and, on
its final attempt, rebuilt with sccache disabled. When the real failure
is a rustc SIGSEGV or an OOM-kill, an uncached rebuild does *more* work
under the same memory pressure — turning one transient compiler crash
into a wedged job.

Two fixes:

1. timeout-minutes on every job in build-prerelease.yml and ci.yml
   (builds 25, neuron CUDA build/cuda-check 35, packaging 20, COPR 60,
   fast jobs 10-15). A hang now dies in minutes, not hours.

2. New script/ci-cargo-escalate.sh replaces the five (prerelease) + three
   (ci) inline escalation loops. It classifies the failure:
     - signal death (exit >=128, or cargo reporting `signal: N`/SIGSEGV/
       SIGKILL) → compiler crash, NOT an sccache fault: keep the cache,
       one warm retry, then fail fast. Never escalate to uncached.
     - sccache fault (recognisable sccache error) → restart the server,
       retry, then one final uncached attempt.
     - deterministic compile/test error → fail fast (no wasteful retry).
   It also folds in the CUDA-image sccache probe the neuron/cuda-check
   jobs did inline. Classification verified locally against success,
   plain failure, exit-139, and the cargo-wrapped `signal: 11` form.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 13:02:50 +03:00
746d84c0fb fix(neuron): seed in_reasoning from the prompt so Qwen3.6 thinking isn't leaked
Some checks failed
build-prerelease / Build neuron-blackwell (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 2m3s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m18s
build-prerelease / Build neuron-ada (push) Successful in 2m15s
build-prerelease / Test (push) Successful in 4m13s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Qwen3.6's chat template injects the opening <think> into the generation
prompt, so generation begins mid-thought and the open marker is never
sampled. The streaming loops flipped in_reasoning to true only on a
*generated* open token, so they stayed in text mode and streamed the
model's reasoning out as visible text — verified live: a tool request
returned a 255-char text block of chain-of-thought ("The user wants to
know the weather… I will construct the function call now.") ahead of the
tool_use block, with the trailing </think> stripped (close token
recognised) but no opening <think>.

Each streaming loop now seeds in_reasoning by replaying the prompt's
reasoning markers (new `prompt_opens_reasoning`): if the prompt ends
inside an open <think>, the loop starts in reasoning mode, the thinking
routes to ReasoningDelta (dropped by the chat projector's default
include_thinking=false, which is what cortex uses), and the model's
</think> flips back to visible text for the answer/tool call. Template-
agnostic and self-correcting: a prompt that doesn't open reasoning (no
think injection, enable_thinking off, non-reasoning model) starts false,
preserving current behaviour. Thinking is hidden, not disabled, so answer
quality is unaffected.

Applied to all three streaming loops (inference_tp_stream,
stream_inference_via_worker, run_inference_streaming). Test covers
open/close replay, multi-turn closed state, reopen-at-tail, and the
no-pair pass-through.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 11:03:26 +03:00
f15b9e2848 fix(neuron): parse Qwen-XML tool calls + emit tool_use stop_reason
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m16s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 2m2s
build-prerelease / Build neuron-ada (push) Successful in 2m7s
build-prerelease / Build neuron-ampere (push) Successful in 2m16s
build-prerelease / Test (push) Successful in 4m13s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m34s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m36s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m37s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 52s
Verified live (commit d662fa2 logs): cortex now delivers OpenAI-shaped
tools to neuron correctly, but Qwen3.6-27B emits tool calls in the
Qwen-XML form inside the <tool_call> markers —

    <tool_call>
    <function=get_weather>
    <parameter=city>
    Brno
    </parameter>
    </function>
    </tool_call>

— while parse_tool_call_body only did serde_json::from_str expecting
{"name":…,"arguments":…}. It returned None, the dispatch re-emitted the
raw block as a text delta, and clients saw the markup as prose. cortex
logged upstream_tool_calls=false finish_reason="stop".

parse_tool_call_body is now format-tolerant: JSON first (Qwen3-Instruct
/ Hermes), then a Qwen-XML parser (Qwen3-Coder / Qwen3.6). Each
<parameter> value is coerced to its declared JSON type using a new
ToolSchemas map built from the request's tools (string stays string,
integer/number/boolean/object/array coerced, mistyped values fall back
to string so an argument is never dropped). build_tool_schemas is
threaded into all three streaming loops (inference_tp_stream,
stream_inference_via_worker, run_inference_streaming).

Each loop also tracks emitted_tool_call and promotes the terminal
finish_reason from Stop to ToolCalls when a call parsed, so the OpenAI
chunk carries finish_reason:"tool_calls" and cortex maps it to Anthropic
stop_reason:"tool_use" — without which an Anthropic agent (Claude Code)
sees a tool_use block but stop_reason:end_turn and may not run the tool.
FinishReason::ToolCalls drops its dead_code allow.

Tests: JSON form still parses; Qwen-XML multi-param parse with
schema-driven string/integer/boolean coercion; no-schema type sniffing;
type-mismatch string fallback; unparseable body returns None.

Known gap (separate): the non-streaming run_inference paths have no
tool-call handling at all; Claude Code streams, so the streaming loops
are the ones that matter here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 10:39:38 +03:00
d662fa20ef fix(cortex): translate Anthropic tools to OpenAI shape + wire-debug logging
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m20s
build-prerelease / Build helexa-bench binary (push) Successful in 2m6s
build-prerelease / Build cortex binary (push) Successful in 2m20s
build-prerelease / Test (push) Successful in 4m12s
build-prerelease / Build neuron-blackwell (push) Successful in 1m38s
build-prerelease / Build neuron-ada (push) Successful in 2m5s
build-prerelease / Build neuron-ampere (push) Successful in 4m44s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m17s
build-prerelease / Package cortex RPM (push) Successful in 1m17s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m41s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m42s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m48s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 53s
Claude Code (ANTHROPIC_BASE_URL -> cortex) hits POST /v1/messages, but
anthropic_to_openai forwarded the request's `tools` array verbatim via
the flattened `extra`. neuron feeds that straight into the HF chat
template, which iterates the OpenAI shape (tool.function.name/.parameters).
Anthropic-shaped tools ({name, description, input_schema}) rendered as
broken/empty definitions, the model improvised an unparseable
<tool_use_name>...</tool_use_name> tool-call format, neuron's
<tool_call>{json}</tool_call> detector missed it, and the markup fell
through as plain assistant text — so CC never received a structured
tool_use and the agent loop died.

Request-side translation now reshapes:
- tool definitions: {name, description, input_schema}
  -> {type:"function", function:{name, description, parameters}}
- tool_choice: auto->"auto", any->"required", none->"none",
  tool->{type:"function",function:{name}}
- assistant tool_use blocks -> OpenAI assistant.tool_calls
  (arguments JSON-stringified) — fixes multi-turn
- user tool_result blocks -> standalone role:"tool" messages keyed by
  tool_call_id
- system content blocks flatten to text instead of being JSON-serialised
  into the prompt; best-effort image-block -> image_url part

Wire-debug instrumentation (tracing levels only; cortex/neuron ship at
info, operator infra runs at debug):
- every handler emits a debug! "inbound request" line tagging the wire
  surface (anthropic | openai-chat | openai-responses | openai-completions)
  plus model/stream/tools and, for Anthropic, tool_history/system
- response side reports upstream_tool_calls + finish_reason, streaming
  and non-streaming
- full inbound + translated-upstream bodies at trace! (UTF-8-safe, capped)

Tests: 8 request-side unit tests + an end-to-end gateway test asserting
the upstream neuron receives OpenAI-shaped tools and a
user->assistant(+tool_calls)->tool->user history.

Also tighten script/infra-log-verbosity.sh: independent cortex/neuron
RUST_LOG args, cortex-only by default (neuron restart behind
--with-neuron so we don't needlessly cold-reload models), mkdir -p the
drop-in dir, symmetric RUST_LOG cleanup, and set -euo pipefail.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 09:58:25 +03:00
d04f4ad704 feat(bench): show GPUs as the resource name instead of hostnames
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Successful in 2m34s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m54s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m15s
build-prerelease / Test (push) Successful in 5m11s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 56s
Public visitors don't know the hostnames, so surface each host's GPU(s)
as the resource name across the UI.

- store: gpu_label() turns the stored gpus_json into a compact label
  ("2× RTX 5090", "RTX 4090"); add `gpu` to ReportRow + RunRow and
  `host_gpus`/`model_gpus` maps to /api/dimensions (from each one's
  latest run). render_json gains gpu too.
- UI: Overview + Runs show a "GPU" column (gpu, fallback host); Runs'
  filter is now GPU-labelled (still filters by host underneath); Trends
  shows a "Measured on <gpu>" line for the selected model.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:29:13 +03:00
e3879f093a feat(bench-ui): drop host selector from Trends; resolve host server-side
Some checks are pending
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m38s
build-prerelease / Test (push) Successful in 4m47s
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Successful in 2m2s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m22s
Public visitors don't know the hostnames or per-host hardware, so the
host picker on Trends was confusing. Select by model + scenario only;
/api/series now takes host as optional and resolves it to the host
serving that (model, scenario) — coherent since each model maps to one
host today. Runs (drill-down) keeps its host filter.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:19:09 +03:00
e4b9b88de0 feat(bench-ui): mark the baseline↔live regime boundary on Trends
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Lint (fmt + clippy) (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Test (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Add a dashed vertical ReferenceLine at the first live build (labelled
"bench.py → helexa-bench") so the intentional gap between the gateway
baseline and the direct-to-neuron series reads as a deliberate
measurement-regime change, not missing data. The two series stay
unconnected by design (different regimes, not directly comparable).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:13:34 +03:00
21db334e37 feat(bench-ui): overlay pre-helexa-bench baseline on Trends
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Lint (fmt + clippy) (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Test (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Option C: a curated static baseline (bench/src/baseline.ts), transcribed
from doc/benchmarks.md (8f6f1d3 + a1952a4 post-#11), overlaid on the
Trends charts as a dashed, clearly-labelled historical series ahead of
the bench era. Host inferred from model via the doc's fleet table;
ordered by snapshot time so it anchors the timeline.

Kept deliberately separate from the live series (no DB/API change) — the
baseline is a different regime (bench.py through the cortex gateway,
medians only) so it's never merged into the direct-to-neuron line; a
caption spells out the distinction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:02:43 +03:00
7dd1ddcfba fix(infra-setup): stat LE live dir via sudo; rsync provisioner secret for bench.internal issuance
Some checks failed
build-prerelease / Resolve version stamps + change detection (push) Failing after 11m1s
build-prerelease / Lint (fmt + clippy) (push) Has been cancelled
build-prerelease / Test (push) Has been cancelled
build-prerelease / Build cortex binary (push) Has been cancelled
build-prerelease / Build helexa-bench binary (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-bench RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
- cert_present() must `sudo test -d /etc/letsencrypt/live/...` (root-only
  0700); without sudo it falsely reported "no cert" and downgraded the
  bench.helexa.ai vhost to the http-only bootstrap (dropping its 443
  server). Now correctly keeps the full TLS vhost.
- bench.internal initial cert: rsync the operator's JWK 'lair' provisioner
  password to the host transiently (root, 0600), issue via
  step ca certificate, then remove it (trap + belt-and-suspenders rm).

Verified: bench.helexa.ai (LE) and bench.internal (lair CA) both serve the
SPA + /api→bob; step@bench.timer renews; secret removed from host.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 15:40:38 +03:00
4ee7da4f97 feat(bench-ui): internal vhost bench.internal + step@ cert renewal
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Lint (fmt + clippy) (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Test (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Inside the WireGuard mesh, bench.helexa.ai dead-ends at the OPNsense LAN
interface (only WAN :443 is port-forwarded), so add an internal path:

- asset/nginx/bench.internal.conf — server_name bench.internal, internal
  "lair" CA cert, same SPA + /api→bob proxy. Mirrors the *.internal vhost
  convention on oolon.kosherinata.internal.
- asset/systemd/step@.{service,timer} — replicate oolon's smallstep cert
  renewal (step ca renew via mTLS, every 15 min, reload nginx).
- infra-setup.sh: install the step@ units + /etc/nginx/tls/{cert,key},
  install the vhost + enable step@bench.timer once the cert exists; prints
  the one-time issuance command otherwise.

Initial cert issuance (JWK provisioner) and bench.internal DNS are
operator steps.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 15:34:38 +03:00
db3cb95cbf fix(infra-setup): provision bench.helexa.ai cert via Cloudflare DNS-01 (ecdsa)
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 33s
build-prerelease / Lint (fmt + clippy) (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Test (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
The webroot/http-01 approach needed nginx serving :80, but the gateway's
nginx was dormant. Switch to the host's established convention —
certbot --dns-cloudflare --key-type ecdsa with /root/.certbot-internal —
which needs neither nginx nor :80, so the cert provisions independently
of the vhost being served. Also restorecon the webroot (SELinux
enforcing → nginx 403 without httpd_sys_content_t), and only ever
install the full TLS vhost once the cert exists (http-only bootstrap
otherwise) so `nginx -t` always passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 11:54:24 +03:00
37c19aa985 feat(bench-ui): public hosting at https://bench.helexa.ai via gateway nginx
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Build neuron-blackwell (push) Successful in 1m32s
build-prerelease / Build neuron-ada (push) Successful in 2m15s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m29s
build-prerelease / Build helexa-bench binary (push) Successful in 2m25s
build-prerelease / Build cortex binary (push) Successful in 2m39s
build-prerelease / Build neuron-ampere (push) Successful in 2m48s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m30s
build-prerelease / Test (push) Successful in 4m38s
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m36s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m37s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m39s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 51s
nginx on the gateway serves the bench SPA and reverse-proxies /api to the
bob bench API over WireGuard — public, auth-less, same-origin (no CORS),
internal API stays private.

- asset/nginx/bench.helexa.ai.conf (full TLS vhost: SPA + /api proxy) and
  a bootstrap http-only vhost for the initial ACME challenge.
- infra-setup.sh: one-time gateway setup — webroot, Let's Encrypt cert
  (certbot webroot, idempotent), install + enable the vhost.
- deploy.yml: deploy-bench-ui builds the SPA (setup-node) and rsyncs
  dist/ to /var/www/bench.helexa.ai every deploy; built same-origin so
  no VITE_API_BASE.
- cortex-host.conf: scoped gitea_ci rsync grant for the webroot.
- bench/README: production hosting notes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 11:40:29 +03:00
f50f5531cf feat(bench): read-only JSON API on bob + bench/ React visualisation app
Some checks are pending
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m21s
build-prerelease / Build cortex binary (push) Successful in 2m27s
build-prerelease / Build helexa-bench binary (push) Successful in 2m44s
build-prerelease / Test (push) Successful in 4m32s
build-prerelease / Build neuron-ampere (push) Successful in 2m7s
build-prerelease / Build neuron-ada (push) Successful in 2m28s
build-prerelease / Build neuron-blackwell (push) Successful in 2m59s
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m19s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m39s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m39s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m42s
Part A — helexa-bench read API:
- [api] config (enabled, listen :13132); WAL on the store so API reads
  never block the sweep writer.
- store read methods: summary, series (chronological per-build medians),
  runs (filtered), dimensions, run_count.
- api.rs: axum /api/health|dimensions|summary|series|runs, permissive
  CORS (UI is a separate origin). The `run` daemon binds the API
  alongside the sweep; new `serve` subcommand serves API-only.
- listener plumbing (bench gains a port): data/helexa-bench-firewalld.xml,
  spec install, deploy-bench /api/health probe + firewalld step, sudoers
  firewall-cmd grants, [api] in example + bob.toml.
- 5 API tests + serve smoke.

Part B — bench/ Vite + React-SWC-TS app (router, react-bootstrap,
recharts): Overview (summary table), Trends (decode tok/s & TTFT across
build SHAs), Runs (filterable explorer). Typed API client with
VITE_API_BASE + dev proxy to bob. npm build/typecheck clean. Hosted
separately from the API (per design); .gitignore excludes node_modules/dist.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 11:26:55 +03:00
5999c8a5a3 Merge branch 'feat/deploy-bench-on-bob' into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 36s
build-prerelease / Lint (fmt + clippy) (push) Has been skipped
build-prerelease / Test (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
ci(deploy): deploy helexa-bench to bob + enable all fleet services on boot

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 09:17:11 +03:00
66833890c0 ci(deploy): deploy helexa-bench to bob + enable all fleet services on boot
All checks were successful
CI / CUDA type-check (push) Successful in 2m9s
CI / Format (push) Successful in 36s
CI / Clippy (push) Successful in 2m12s
CI / Test (push) Successful in 4m8s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Adds a deploy-bench job to deploy.yml that rolls helexa-bench onto bob
(the bench host, also running Agent Zero), following the deploy-cortex
pattern: manifest-gated skip-when-current, light "service stays active"
validation (outbound-only, no listener/model to probe), journal capture.
Runs alongside the cortex→neurons chain (no deploy-ordering dependency —
the sweep loop is version-aware).

Boot persistence: all systemd deployments now `systemctl enable --now`
instead of bare `start`, so cortex / neuron / helexa-bench come back
after a host reboot. Covers deploy.yml (all three services) and
deploy-dev.yml (neuron fast path); sudoers gain the matching
`enable --now <svc>` grant.

infra-setup.sh handles bob: provisions gitea_ci, installs the
bench-host sudoers, enables the lair-cafe-unstable repo (bob is a client
host without it), pre-creates /etc/helexa-bench, and syncs
asset/helexa-bench/bob.toml. New assets: bench-host.conf sudoers and
bob.toml (three neuron targets).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 09:10:07 +03:00
7bb20241a6 Merge branch 'feat/version-metadata-and-bench' into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Build neuron-ada (push) Successful in 2m13s
build-prerelease / Build neuron-ampere (push) Successful in 2m15s
build-prerelease / Build neuron-blackwell (push) Successful in 2m30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m34s
build-prerelease / Build cortex binary (push) Successful in 2m38s
build-prerelease / Build helexa-bench binary (push) Successful in 3m40s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m53s
build-prerelease / Test (push) Successful in 4m35s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m14s
build-prerelease / Package cortex RPM (push) Successful in 1m16s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m41s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m42s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 50s
feat(bench): version-aware benchmark harness + neuron build metadata

Adds GET /version build metadata to neuron and the helexa-bench crate — a continuous, version-aware harness that records fleet benchmarks into SQLite keyed by neuron build SHA, replacing manual bench.py runs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 15:33:33 +03:00
106 changed files with 11390 additions and 674 deletions

View File

@@ -56,6 +56,7 @@ env:
jobs:
prepare:
name: Resolve version stamps + change detection
timeout-minutes: 10
runs-on: rust
outputs:
version: ${{ steps.info.outputs.version }}
@@ -180,6 +181,7 @@ jobs:
# fleet, but it also doesn't serialize the pipeline.
lint:
name: Lint (fmt + clippy)
timeout-minutes: 25
needs: prepare
if: needs.prepare.outputs.check_rust == 'true'
runs-on: rust
@@ -196,38 +198,16 @@ jobs:
with:
ref: ${{ inputs.ref }}
- run: cargo fmt --check --all
# sccache failures come in two modes: transient races (a plain
# retry clears them) and a wedged/dead server, where every
# same-VM retry fails identically (sccache fatal error, ENOENT
# on its own tmp files). Escalate accordingly: retry → restart
# the server → final attempt uncached. A sick cache costs build
# time, never the run.
- name: Clippy (with sccache escalation)
run: |
for attempt in 1 2 3; do
echo "::group::clippy attempt ${attempt}"
if [ "${attempt}" -eq 3 ]; then
echo "final attempt: building without sccache"
export RUSTC_WRAPPER=""
fi
if cargo clippy --workspace -- -D warnings; then
echo "::endgroup::"
exit 0
fi
echo "::endgroup::"
echo "clippy failed on attempt ${attempt}"
if [ "${attempt}" -eq 1 ]; then
sccache --stop-server || true
sccache --start-server || true
fi
sleep 5
done
echo "clippy failed after 3 attempts"
exit 1
- run: sccache --show-stats || true
# Failure-aware sccache escalation lives in the shared script: a
# signal death (rustc SIGSEGV / OOM-kill) keeps the cache and fails
# fast instead of triggering a slower uncached rebuild; only a real
# sccache fault drops the cache. See script/ci-cargo-escalate.sh.
- name: Clippy (sccache escalation)
run: script/ci-cargo-escalate.sh cargo clippy --workspace -- -D warnings
test:
name: Test
timeout-minutes: 25
needs: prepare
if: needs.prepare.outputs.check_rust == 'true'
runs-on: rust
@@ -243,33 +223,13 @@ jobs:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.ref }}
# See the lint job for the escalation rationale.
- name: Test (with sccache escalation)
run: |
for attempt in 1 2 3; do
echo "::group::test attempt ${attempt}"
if [ "${attempt}" -eq 3 ]; then
echo "final attempt: building without sccache"
export RUSTC_WRAPPER=""
fi
if cargo test --workspace; then
echo "::endgroup::"
exit 0
fi
echo "::endgroup::"
echo "test failed on attempt ${attempt}"
if [ "${attempt}" -eq 1 ]; then
sccache --stop-server || true
sccache --start-server || true
fi
sleep 5
done
echo "test failed after 3 attempts"
exit 1
- run: sccache --show-stats || true
# See script/ci-cargo-escalate.sh for the escalation rationale.
- name: Test (sccache escalation)
run: script/ci-cargo-escalate.sh cargo test --workspace
build-cortex:
name: Build cortex binary
timeout-minutes: 25
needs: prepare
if: needs.prepare.outputs.build_cortex == 'true'
# runner-rust image already provides rust/cargo/clippy/rustfmt via
@@ -288,32 +248,9 @@ jobs:
with:
ref: ${{ inputs.ref }}
# Escalation mirrors the lint/test jobs: retry → restart the
# sccache server → final attempt uncached. A sick cache costs
# build time, never the run.
- name: Build cortex (release, with sccache escalation)
run: |
for attempt in 1 2 3; do
echo "::group::build attempt ${attempt}"
if [ "${attempt}" -eq 3 ]; then
echo "final attempt: building without sccache"
export RUSTC_WRAPPER=""
fi
if cargo build --release -p cortex-cli; then
echo "::endgroup::"
sccache --show-stats || true
exit 0
fi
echo "::endgroup::"
echo "build failed on attempt ${attempt}"
if [ "${attempt}" -eq 1 ]; then
sccache --stop-server || true
sccache --start-server || true
fi
sleep 5
done
echo "build failed after 3 attempts"
exit 1
# See script/ci-cargo-escalate.sh for the escalation rationale.
- name: Build cortex (release, sccache escalation)
run: script/ci-cargo-escalate.sh cargo build --release -p cortex-cli
- name: Stage binary
run: |
@@ -329,6 +266,7 @@ jobs:
build-bench:
name: Build helexa-bench binary
timeout-minutes: 25
needs: prepare
if: needs.prepare.outputs.build_bench == 'true'
# Pure-Rust, non-CUDA binary — same runner as cortex.
@@ -346,32 +284,12 @@ jobs:
with:
ref: ${{ inputs.ref }}
- name: Build helexa-bench (release, with sccache escalation)
- name: Build helexa-bench (release, sccache escalation)
run: |
# Stamp the SHA helexa-bench records as bench_sha against every
# run (option_env! in sweep.rs reads it at compile time).
export HELEXA_BUILD_SHA="$(git rev-parse HEAD)"
for attempt in 1 2 3; do
echo "::group::build attempt ${attempt}"
if [ "${attempt}" -eq 3 ]; then
echo "final attempt: building without sccache"
export RUSTC_WRAPPER=""
fi
if cargo build --release -p helexa-bench; then
echo "::endgroup::"
sccache --show-stats || true
exit 0
fi
echo "::endgroup::"
echo "build failed on attempt ${attempt}"
if [ "${attempt}" -eq 1 ]; then
sccache --stop-server || true
sccache --start-server || true
fi
sleep 5
done
echo "build failed after 3 attempts"
exit 1
script/ci-cargo-escalate.sh cargo build --release -p helexa-bench
- name: Stage binary
run: |
@@ -387,6 +305,7 @@ jobs:
build-neuron:
name: Build neuron-${{ matrix.flavour }}
timeout-minutes: 35
needs: prepare
if: needs.prepare.outputs.build_neuron == 'true'
strategy:
@@ -427,28 +346,16 @@ jobs:
with:
ref: ${{ inputs.ref }}
# Escalation mirrors the lint/test jobs: retry → restart the
# sccache server → final attempt uncached.
#
# The CUDA image may or may not ship sccache — probe inside this
# step (NOT via GITHUB_ENV from a prior step, which this runner
# does not propagate; observed: probe step said "enabled", build
# ran unwrapped, server stats showed 4 compile requests). A
# missing binary degrades to an uncached build rather than
# failing cargo at `sccache rustc -vV`. The cache covers the
# ~600-crate host-side dep tree (the bulk of the 10-14 min
# build); rustc compilations are shared across all three
# flavours, so even one run seeds the next.
# sccache handling + failure classification lives in
# script/ci-cargo-escalate.sh: it probes for sccache (the CUDA
# image may not ship it — a missing binary degrades to an uncached
# build rather than failing at `sccache rustc -vV`), and a rustc
# SIGSEGV / OOM-kill keeps the cache and fails fast instead of
# escalating to a slower uncached rebuild. The cache covers the
# ~600-crate host-side dep tree (the bulk of the 10-14 min build),
# shared across all three flavours, so even one run seeds the next.
- name: Build neuron with CUDA (${{ matrix.flavour }})
run: |
set -ux
if command -v sccache >/dev/null 2>&1; then
export RUSTC_WRAPPER=sccache
sccache --start-server 2>/dev/null || true
echo "sccache enabled"
else
echo "sccache not on PATH — building uncached"
fi
export PATH="${{ matrix.cuda_home }}/bin:${PATH}"
export LD_LIBRARY_PATH="${{ matrix.cuda_home }}/targets/x86_64-linux/lib:${{ matrix.cuda_home }}/lib64:${LD_LIBRARY_PATH:-}"
export LIBRARY_PATH="${{ matrix.cuda_home }}/targets/x86_64-linux/lib:${{ matrix.cuda_home }}/lib64:${LIBRARY_PATH:-}"
@@ -457,27 +364,7 @@ jobs:
# injecting the exact checked-out commit is unambiguous under
# shallow/detached states and makes the artifact self-describing.
export HELEXA_BUILD_SHA="$(git rev-parse HEAD)"
for attempt in 1 2 3; do
echo "::group::build attempt ${attempt}"
if [ "${attempt}" -eq 3 ]; then
echo "final attempt: building without sccache"
export RUSTC_WRAPPER=""
fi
if cargo build --release -p neuron --features "${{ matrix.cargo_features }}"; then
echo "::endgroup::"
command -v sccache >/dev/null 2>&1 && sccache --show-stats || true
exit 0
fi
echo "::endgroup::"
echo "build failed on attempt ${attempt}"
if [ "${attempt}" -eq 1 ] && command -v sccache >/dev/null 2>&1; then
sccache --stop-server || true
sccache --start-server || true
fi
sleep 5
done
echo "build failed after 3 attempts"
exit 1
script/ci-cargo-escalate.sh cargo build --release -p neuron --features "${{ matrix.cargo_features }}"
env:
CUDA_COMPUTE_CAP: ${{ matrix.compute_cap }}
CARGO_BUILD_JOBS: ${{ matrix.build_jobs }}
@@ -497,6 +384,7 @@ jobs:
package-cortex:
name: Package cortex RPM
timeout-minutes: 20
needs: [prepare, build-cortex]
runs-on: rpm
steps:
@@ -535,6 +423,7 @@ jobs:
package-bench:
name: Package helexa-bench RPM
timeout-minutes: 20
needs: [prepare, build-bench]
runs-on: rpm
steps:
@@ -555,6 +444,7 @@ jobs:
cp artifacts/helexa-bench ~/rpmbuild/SOURCES/
cp data/helexa-bench.service ~/rpmbuild/SOURCES/
cp data/helexa-bench-sysusers.conf ~/rpmbuild/SOURCES/
cp data/helexa-bench-firewalld.xml ~/rpmbuild/SOURCES/
cp helexa-bench.example.toml ~/rpmbuild/SOURCES/
cp LICENSE ~/rpmbuild/SOURCES/
rpmbuild -bb rpm/helexa-bench-prerelease.spec \
@@ -571,6 +461,7 @@ jobs:
package-neuron:
name: Package helexa-neuron-${{ matrix.flavour }} RPM
timeout-minutes: 20
needs: [prepare, build-neuron]
runs-on: rpm
strategy:
@@ -616,6 +507,7 @@ jobs:
publish:
name: Publish to rpm.lair.cafe (unstable)
timeout-minutes: 25
needs: [lint, test, package-cortex, package-neuron, package-bench]
# Runs when at least one package was built and nothing failed.
# lint/test may be skipped (docs-only refs never get here because

View File

@@ -41,6 +41,7 @@ env:
jobs:
fmt:
name: Format
timeout-minutes: 15
runs-on: rust
steps:
- uses: actions/checkout@v4
@@ -48,67 +49,26 @@ jobs:
clippy:
name: Clippy
timeout-minutes: 25
runs-on: rust
steps:
- uses: actions/checkout@v4
# sccache failures come in two modes: transient races (a plain
# retry clears them) and a wedged/dead server, where every
# same-VM retry fails identically. Escalate: retry → restart the
# server → final attempt uncached. A sick cache costs build
# time, never the run. Keep in sync with build-prerelease.yml.
- name: Clippy (with sccache escalation)
run: |
for attempt in 1 2 3; do
echo "::group::clippy attempt ${attempt}"
if [ "${attempt}" -eq 3 ]; then
echo "final attempt: building without sccache"
export RUSTC_WRAPPER=""
fi
if cargo clippy --workspace -- -D warnings; then
echo "::endgroup::"
exit 0
fi
echo "::endgroup::"
echo "clippy failed on attempt ${attempt}"
if [ "${attempt}" -eq 1 ]; then
sccache --stop-server || true
sccache --start-server || true
fi
sleep 5
done
echo "clippy failed after 3 attempts"
exit 1
- run: sccache --show-stats || true
# Failure-aware sccache escalation lives in the shared script (kept
# in sync with build-prerelease.yml): a signal death (rustc SIGSEGV
# / OOM-kill) keeps the cache and fails fast instead of an uncached
# rebuild; only a real sccache fault drops the cache.
- name: Clippy (sccache escalation)
run: script/ci-cargo-escalate.sh cargo clippy --workspace -- -D warnings
test:
name: Test
timeout-minutes: 25
runs-on: rust
steps:
- uses: actions/checkout@v4
# See the clippy job for the escalation rationale.
- name: Test (with sccache escalation)
run: |
for attempt in 1 2 3; do
echo "::group::test attempt ${attempt}"
if [ "${attempt}" -eq 3 ]; then
echo "final attempt: building without sccache"
export RUSTC_WRAPPER=""
fi
if cargo test --workspace; then
echo "::endgroup::"
exit 0
fi
echo "::endgroup::"
echo "test failed on attempt ${attempt}"
if [ "${attempt}" -eq 1 ]; then
sccache --stop-server || true
sccache --start-server || true
fi
sleep 5
done
echo "test failed after 3 attempts"
exit 1
- run: sccache --show-stats || true
# See script/ci-cargo-escalate.sh for the escalation rationale.
- name: Test (sccache escalation)
run: script/ci-cargo-escalate.sh cargo test --workspace
# Type-check the CUDA-only code path. Borrow-check-only — we
# never run the tests here (the runner has no GPU). This catches
@@ -122,6 +82,7 @@ jobs:
# see commit history).
cuda-check:
name: CUDA type-check
timeout-minutes: 35
runs-on: cuda-13.0
# The workflow-level env sets `RUSTC_WRAPPER: sccache`
# unconditionally, which hard-fails cargo if the CUDA image
@@ -139,52 +100,26 @@ jobs:
CUDA_COMPUTE_CAP: "86"
steps:
- uses: actions/checkout@v4
# sccache is probed inside this step (NOT via GITHUB_ENV from a
# prior step — this runner doesn't propagate it; see
# build-prerelease.yml for the observed failure).
- name: cargo check --features cuda (with sccache escalation)
# sccache probing + failure classification lives in the shared
# script (see build-prerelease.yml's neuron build for the same
# pattern). It probes for sccache and, on a rustc SIGSEGV / OOM,
# keeps the cache and fails fast rather than rebuilding uncached.
- name: cargo check --features cuda (sccache escalation)
run: |
if command -v sccache >/dev/null 2>&1; then
export RUSTC_WRAPPER=sccache
sccache --start-server 2>/dev/null || true
echo "sccache enabled"
else
echo "sccache not on PATH — building uncached"
fi
# act launches the step shell without /etc/profile, so the
# gitea_runner user's inherited PATH lacks /usr/local/cuda-13.0/bin.
# cudarc's build.rs:157 shells out to `nvcc --version` (because
# the neuron crate enables cuda-version-from-build-system) and
# panics with ENOENT if nvcc isn't resolvable. build-prerelease.yml
# does the same export — keep them in sync.
# cudarc's build.rs shells out to `nvcc --version` (the neuron
# crate enables cuda-version-from-build-system) and panics with
# ENOENT if nvcc isn't resolvable — keep this export in sync
# with build-prerelease.yml.
export PATH="/usr/local/cuda-13.0/bin:${PATH}"
export LD_LIBRARY_PATH="/usr/local/cuda-13.0/targets/x86_64-linux/lib:/usr/local/cuda-13.0/lib64:${LD_LIBRARY_PATH:-}"
export LIBRARY_PATH="/usr/local/cuda-13.0/targets/x86_64-linux/lib:/usr/local/cuda-13.0/lib64:${LIBRARY_PATH:-}"
# Escalation mirrors the lint/test jobs: plain retry →
# sccache server restart → final attempt uncached.
for attempt in 1 2 3; do
echo "::group::cuda-check attempt ${attempt}"
if [ "${attempt}" -eq 3 ]; then
echo "final attempt: building without sccache"
export RUSTC_WRAPPER=""
fi
if cargo check -p neuron --features cuda --all-targets; then
echo "::endgroup::"
exit 0
fi
echo "::endgroup::"
echo "cuda-check failed on attempt ${attempt}"
if [ "${attempt}" -eq 1 ] && command -v sccache >/dev/null 2>&1; then
sccache --stop-server || true
sccache --start-server || true
fi
sleep 5
done
echo "cuda-check failed after 3 attempts"
exit 1
script/ci-cargo-escalate.sh cargo check -p neuron --features cuda --all-targets
srpm-cortex:
name: Build cortex SRPM
timeout-minutes: 25
runs-on: rpm
needs: [fmt, clippy, test, cuda-check]
if: startsWith(github.ref, 'refs/tags/v')
@@ -245,6 +180,7 @@ jobs:
srpm-neuron:
name: Build neuron SRPM
timeout-minutes: 25
runs-on: rpm
needs: [fmt, clippy, test, cuda-check]
if: startsWith(github.ref, 'refs/tags/v')
@@ -305,6 +241,7 @@ jobs:
copr-cortex:
name: Publish cortex to COPR
timeout-minutes: 60
runs-on: fedora-43
needs: srpm-cortex
steps:
@@ -322,6 +259,7 @@ jobs:
copr-neuron:
name: Publish neuron to COPR
timeout-minutes: 60
runs-on: fedora-43
needs: srpm-neuron
steps:
@@ -339,6 +277,7 @@ jobs:
bump-version:
name: Bump version in source
timeout-minutes: 15
runs-on: rust
needs: [copr-cortex, copr-neuron]
steps:

View File

@@ -123,7 +123,9 @@ jobs:
# Exact command form required by the sudoers rule in
# asset/sudoers.d/neuron-host.conf — change both together.
sudo /usr/bin/install -o root -g root -m 0755 /var/lib/gitea_ci/neuron-dev /usr/bin/neuron
sudo /usr/bin/systemctl start neuron.service
# enable --now so a dev deploy also leaves the unit enabled
# for boot, consistent with deploy.yml.
sudo /usr/bin/systemctl enable --now neuron.service
rm -f /var/lib/gitea_ci/neuron-dev'
- name: Capture neuron.service startup journal

View File

@@ -1,7 +1,8 @@
name: deploy
# Roll the freshly-published unstable RPMs onto the helexa fleet:
# cortex on the gateway, helexa-neuron-<flavour> on each neuron host.
# cortex on the gateway, helexa-neuron-<flavour> on each neuron host,
# and helexa-bench on bob (the bench host).
#
# Triggered automatically after `build-prerelease` succeeds (by which
# point the new RPMs are live on rpm.lair.cafe/unstable), and also
@@ -88,7 +89,9 @@ jobs:
sudo /usr/bin/dnf install --refresh --allowerasing -y cortex
fi
sudo /usr/bin/systemctl daemon-reload
sudo /usr/bin/systemctl start cortex.service
# enable --now: start the service AND enable it for boot so the
# fleet self-heals after a host reboot.
sudo /usr/bin/systemctl enable --now cortex.service
DEPLOY
# Wait for the service to either come up or wedge, then capture
@@ -115,15 +118,27 @@ jobs:
# loading after a restart. beast cold-loads Qwen3.6-27B Q6K
# TP=2 (~5-6 min typical, see #1); benjy/quadbrat load small
# single-GPU models in well under a minute.
#
# max_prompt_tokens: per-model context cap, written to the
# neuron.service.d/model.conf drop-in (NEURON_MAX_PROMPT_TOKENS).
# A change here restarts the neuron even with no new RPM. Values
# are VRAM-safe ceilings derived per model — see
# doc/context-limits.md. beast (Qwen3.6-27B, hybrid linear, 2x
# 32GB) has ample KV headroom; benjy (Qwen3-8B dense, ~6GB free)
# is VRAM-bound and stays at the default; quadbrat (Qwen3-1.7B)
# likewise conservative.
- host: beast.hanzalova.internal
flavour: blackwell
load_timeout: 900
max_prompt_tokens: 131072
- host: benjy.hanzalova.internal
flavour: ada
load_timeout: 300
max_prompt_tokens: 16384
- host: quadbrat.hanzalova.internal
flavour: ampere
load_timeout: 300
max_prompt_tokens: 16384
steps:
- name: SSH init
run: |
@@ -140,6 +155,26 @@ jobs:
ssh gitea_ci@${{ matrix.host }} 'bash -s' <<'DEPLOY'
set -eu
pkg=helexa-neuron-${{ matrix.flavour }}
max_prompt_tokens="${{ matrix.max_prompt_tokens }}"
# ── Desired per-model systemd drop-in ─────────────────────────
# model.conf carries NEURON_MAX_PROMPT_TOKENS so the context cap
# is deterministic per host and rolled out (with a restart) by
# this workflow, not hand-edited. It sorts after local.conf, so a
# deploy-managed value wins over any manual local override of the
# same variable. See doc/context-limits.md.
conf=/etc/systemd/system/neuron.service.d/model.conf
config_changed=0
if [ -n "${max_prompt_tokens}" ]; then
desired=$(printf '%s\n%s\n%s\n%s' \
"# Managed by .gitea/workflows/deploy.yml - do not edit by hand." \
"# Per-model context cap; see doc/context-limits.md." \
"[Service]" \
"Environment=NEURON_MAX_PROMPT_TOKENS=${max_prompt_tokens}")
[ "${desired}" = "$(cat "${conf}" 2>/dev/null || true)" ] || config_changed=1
fi
# ── Package version gate (manifest rationale: see deploy-cortex) ──
installed=$(rpm -q --qf '%{VERSION}-%{RELEASE}' "${pkg}" 2>/dev/null || echo "not-installed")
latest=$(curl -fsS --max-time 15 "https://rpm.lair.cafe/fedora/43/x86_64/unstable/packages.json" 2>/dev/null \
| python3 -c '
@@ -150,21 +185,42 @@ jobs:
p = max(cands, key=lambda p: p.get("buildTime", 0))
print(p["version"] + "-" + p["release"])
' "${pkg}" 2>/dev/null || true)
pkg_changed=1
if [ -n "${latest}" ] && [ "${latest}" = "${installed}" ]; then
echo "${pkg}-${installed} already current — leaving service untouched"
pkg_changed=0
fi
# Skip only when BOTH the package and the drop-in are unchanged —
# a context-cap change must restart the neuron even with no new RPM.
if [ "${pkg_changed}" -eq 0 ] && [ "${config_changed}" -eq 0 ]; then
echo "${pkg}-${installed} current; NEURON_MAX_PROMPT_TOKENS=${max_prompt_tokens:-<unset>} unchanged — leaving service untouched"
exit 0
fi
echo "installed=${installed} published=${latest:-unknown} — deploying"
echo "installed=${installed} published=${latest:-unknown} pkg_changed=${pkg_changed} config_changed=${config_changed} — deploying"
# Write the drop-in (staged in gitea_ci's dir, installed root-owned).
if [ "${config_changed}" -eq 1 ]; then
printf '%s\n' "${desired}" > /var/lib/gitea_ci/model.conf
sudo /usr/bin/install -o root -g root -m 0644 -D /var/lib/gitea_ci/model.conf "${conf}"
rm -f /var/lib/gitea_ci/model.conf
echo "applied ${conf}: NEURON_MAX_PROMPT_TOKENS=${max_prompt_tokens}"
fi
if systemctl is-active --quiet neuron.service; then
sudo /usr/bin/systemctl stop neuron.service
fi
if rpm -q "${pkg}" >/dev/null 2>&1; then
sudo /usr/bin/dnf upgrade --refresh --allowerasing -y "${pkg}"
else
sudo /usr/bin/dnf install --refresh --allowerasing -y "${pkg}"
if [ "${pkg_changed}" -eq 1 ]; then
if rpm -q "${pkg}" >/dev/null 2>&1; then
sudo /usr/bin/dnf upgrade --refresh --allowerasing -y "${pkg}"
else
sudo /usr/bin/dnf install --refresh --allowerasing -y "${pkg}"
fi
fi
# daemon-reload picks up both a new unit (dnf) and the drop-in.
sudo /usr/bin/systemctl daemon-reload
sudo /usr/bin/systemctl start neuron.service
# enable --now: start the service AND enable it for boot so the
# fleet self-heals after a host reboot.
sudo /usr/bin/systemctl enable --now neuron.service
# ── Post-deploy validation ────────────────────────────────
# A deploy only goes green if the neuron (a) finishes loading
@@ -250,3 +306,143 @@ jobs:
sleep 10
ssh gitea_ci@${{ matrix.host }} \
'journalctl --unit neuron.service -I --no-pager'
# helexa-bench is a separate package on a separate host (bob), and it
# only consumes the fleet's HTTP APIs — it has no deploy-ordering
# dependency on cortex or the neurons (the sweep loop is version-aware
# and picks up whatever each neuron reports whenever). So it runs
# alongside the cortex→neurons chain rather than after it.
deploy-bench:
runs-on: fedora-43
if: >-
${{
github.event_name == 'workflow_dispatch'
|| github.event.workflow_run.conclusion == 'success'
}}
steps:
- name: SSH init
run: |
mkdir -p ~/.ssh
echo "${DEPLOY_KEY}" > ~/.ssh/id_ed25519
chmod 600 ~/.ssh/id_ed25519
ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=accept-new \
gitea_ci@bob.hanzalova.internal 'hostname -f'
# See deploy-cortex for why gating uses the publish manifest and
# not unprivileged `dnf check-update`.
- name: Deploy helexa-bench (skips when already current)
run: |
ssh gitea_ci@bob.hanzalova.internal 'bash -s' <<'DEPLOY'
set -eu
pkg=helexa-bench
installed=$(rpm -q --qf '%{VERSION}-%{RELEASE}' "${pkg}" 2>/dev/null || echo "not-installed")
latest=$(curl -fsS --max-time 15 "https://rpm.lair.cafe/fedora/43/x86_64/unstable/packages.json" 2>/dev/null \
| python3 -c '
import json, sys
name = sys.argv[1]
cands = [p for p in json.load(sys.stdin)["packages"] if p.get("name") == name]
if cands:
p = max(cands, key=lambda p: p.get("buildTime", 0))
print(p["version"] + "-" + p["release"])
' "${pkg}" 2>/dev/null || true)
if [ -n "${latest}" ] && [ "${latest}" = "${installed}" ]; then
echo "${pkg}-${installed} already current — leaving service untouched"
exit 0
fi
echo "installed=${installed} published=${latest:-unknown} — deploying"
if systemctl is-active --quiet helexa-bench.service; then
sudo /usr/bin/systemctl stop helexa-bench.service
fi
if rpm -q "${pkg}" >/dev/null 2>&1; then
sudo /usr/bin/dnf upgrade --refresh --allowerasing -y helexa-bench
else
sudo /usr/bin/dnf install --refresh --allowerasing -y helexa-bench
fi
sudo /usr/bin/systemctl daemon-reload
# enable --now: start the service AND enable it for boot so the
# bench resumes collecting after a host reboot.
sudo /usr/bin/systemctl enable --now helexa-bench.service
# ── Post-deploy validation ────────────────────────────────
# The bench serves a read-only API on :13132 alongside the
# outbound sweep loop. Probe the API over localhost (bypasses
# firewalld) — catches a crash-on-start or a bad bind. Bail
# early if the unit drops out of active (Restart backoff).
echo "waiting for bench API on :13132"
deadline=$(( $(date +%s) + 30 ))
while :; do
if curl -fsS --max-time 5 http://localhost:13132/api/health >/dev/null 2>&1; then
echo "bench API healthy"
break
fi
if ! systemctl is-active --quiet helexa-bench.service; then
echo "FAIL: helexa-bench.service is not active"
systemctl --no-pager status helexa-bench.service | head -20 || true
exit 1
fi
if [ "$(date +%s)" -ge "${deadline}" ]; then
echo "FAIL: bench API not healthy within 30s"
exit 1
fi
sleep 3
done
DEPLOY
- name: Ensure firewalld allows helexa-bench
run: |
ssh gitea_ci@bob.hanzalova.internal '
if ! sudo /usr/bin/firewall-cmd --query-service=helexa-bench --quiet 2>/dev/null; then
sudo /usr/bin/firewall-cmd --add-service=helexa-bench --permanent
sudo /usr/bin/firewall-cmd --reload
fi'
# Wait for the service to either come up or wedge, then capture
# the latest-invocation journal. Runs even on prior failure so a
# failed start step still leaves a usable record in the deploy log.
- name: Capture helexa-bench.service startup journal
if: always()
run: |
sleep 10
ssh gitea_ci@bob.hanzalova.internal \
'journalctl --unit helexa-bench.service -I --no-pager'
# Build the bench UI and publish it to the public nginx vhost on the
# gateway (https://bench.helexa.ai). The vhost + Let's Encrypt cert are
# one-time host setup (script/infra-setup.sh); this job just refreshes
# the static assets. nginx reverse-proxies /api to the bob API, so the
# SPA is built same-origin (no VITE_API_BASE). Independent of the other
# deploy jobs.
deploy-bench-ui:
runs-on: fedora-43
if: >-
${{
github.event_name == 'workflow_dispatch'
|| github.event.workflow_run.conclusion == 'success'
}}
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
- name: Build UI
run: |
cd bench
npm ci
npm run build
- name: SSH init
run: |
mkdir -p ~/.ssh
echo "${DEPLOY_KEY}" > ~/.ssh/id_ed25519
chmod 600 ~/.ssh/id_ed25519
ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=accept-new \
gitea_ci@hanzalova.internal 'hostname -f'
- name: Rsync built UI to gateway webroot
run: |
rsync --archive --compress --delete \
--rsync-path 'sudo rsync' \
bench/dist/ \
gitea_ci@hanzalova.internal:/var/www/bench.helexa.ai/

2
.gitignore vendored
View File

@@ -1,4 +1,6 @@
/target
/bench/node_modules
/bench/dist
*.swp
*.swo
.idea/

268
AGENTS.md Normal file
View File

@@ -0,0 +1,268 @@
# AGENTS.md — helexa/cortex
## Project Overview
helexa is a self-hosted LLM serving stack for multi-node GPU inference clusters. It has two components:
- **cortex** — the per-operator control plane and LLM proxy. A Rust reverse-proxy that sits in front of the fleet and presents a unified OpenAI + Anthropic compatible API surface. It handles model routing, lifecycle management (load/unload/evict), request translation, and metrics collection.
- **neuron** — the per-host LLM harness. One instance runs on every GPU host, serving candle-based in-process inference and managing local hardware discovery and model lifecycle.
## Repository Layout
```
cortex/
├── Cargo.toml # workspace root (Rust 2024 edition, GPL-3.0)
├── cortex.example.toml # example gateway config
├── models.example.toml # example model catalogue
├── neuron.example.toml # example neuron config
├── README.md # public-facing documentation
├── CLAUDE.md # detailed design rationale and implementation history
├── AGENTS.md # ← you are here
├── cortex.spec # RPM spec for cortex
├── helexa-neuron.spec # RPM spec for neuron (renamed to avoid Fedora collision)
├── rpm/ # prerelease RPM specs
│ ├── cortex-prerelease.spec
│ ├── helexa-neuron-prerelease.spec
│ └── helexa-bench-prerelease.spec
├── data/ # systemd units and example configs for packaging
│ ├── cortex.service
│ ├── neuron.service
│ ├── cortex.example.toml
│ ├── neuron.example.toml
│ └── models.example.toml
└── crates/
├── cortex-core/ # shared types, config, envelopes
│ └── src/
│ ├── lib.rs
│ ├── build_info.rs # BuildInfo type for /version endpoint
│ ├── config.rs # figment-based config structs
│ ├── catalogue.rs # ModelProfile, placement matching
│ ├── discovery.rs # DeviceInfo, DiscoveryResponse
│ ├── harness.rs # Harness trait, HarnessConfig, HarnessHealth
│ ├── node.rs # NodeState, ModelStatus
│ ├── openai.rs # OpenAI request/response types
│ ├── anthropic.rs # Anthropic request/response types
│ ├── translate.rs # OpenAI <-> Anthropic translation
│ └── metrics.rs # RequestMetrics, histogram helpers
├── cortex-gateway/ # the HTTP proxy server
│ └── src/
│ ├── lib.rs
│ ├── state.rs # CortexState: Arc<RwLock<...>>
│ ├── router.rs # model -> node routing logic
│ ├── proxy.rs # streaming HTTP proxy to backends
│ ├── evictor.rs # LRU/priority eviction logic
│ ├── poller.rs # background task polling neuron status
│ ├── handlers.rs # axum handlers (chat, completions, models, etc.)
│ └── metrics.rs # prometheus exporter endpoint
├── cortex-cli/ # CLI entrypoint
│ └── src/main.rs # binary: `cortex`
├── neuron/ # per-host LLM daemon (replaces cortex-agent)
│ ├── Cargo.toml # features: cuda, cudnn, flash-attn, cuda-integration
│ ├── build.rs # compiles CUDA kernels, emits build metadata
│ └── src/
│ ├── main.rs # binary: `neuron`
│ ├── discovery.rs # nvidia-smi parsing, device enumeration
│ ├── health.rs # runtime GPU polling
│ ├── api.rs # HTTP handlers for /discovery, /models, etc.
│ ├── version.rs # GET /version endpoint with BuildInfo
│ ├── models.rs # local model lifecycle orchestration
│ └── harness/ # in-process candle inference
│ ├── device_worker/ # per-device CUDA worker threads
│ │ ├── mod.rs # canonical narrative for worker architecture
│ │ ├── jobs.rs # Job enum, dispatch handlers
│ │ └── dispatch.rs # DeviceWorkerState struct
│ ├── candle.rs # candle model implementation
│ └── tp/ # tensor parallelism
│ └── worker.rs # TP worker subprocesses
├── helexa-acp/ # Agent Client Protocol bridge (Apache-2.0)
│ └── src/main.rs # binary: `helexa-acp`, self-contained (no workspace deps)
└── helexa-bench/ # benchmark harness
└── src/main.rs # binary: `helexa-bench`, SQLite-backed, version-aware
```
## Key Design Decisions
### Architecture
- **cortex** is the control plane. It exposes the unified API, routes requests, manages model lifecycle across the fleet, and collects metrics.
- **neuron** is the node plane. One instance runs on every GPU host. It discovers local hardware, manages in-process candle inference, handles NCCL tensor parallelism, and reports runtime state.
- cortex never shells out to `nvidia-smi`, never touches systemd units, and never talks directly to a harness. It talks only to neurons via HTTP API on port 13131.
### Per-device worker thread (neuron)
Every CUDA device gets one dedicated OS thread that owns its `CudaContext` for the daemon's lifetime. All CUDA operations route through this thread via a `std::sync::mpsc` job channel. Tensors never escape the worker thread alive. Inference replies carry `Vec<f32>` CPU-side logits; sampled tokens come back as `u32`. The opaque `ArchHandle(u64)` and `TpHandle(u64)` are indices into the worker's state slab, not pointers.
CPU loads (`Device::Cpu` fallback) keep the legacy `tokio::task::spawn_blocking + Arc<Mutex<ModelArch>>` path — there's no context to own and the channel hop would only add latency. Four `spawn_blocking` references in `harness/candle.rs` are deliberate CPU fallback.
### candle-native (not mistral.rs)
neuron builds directly on [candle](https://github.com/huggingface/candle). Every model architecture it serves is implemented in this repository, ported against the HuggingFace reference. No external inference server to babysit. The Harness trait remains as an internal seam for adding future engines (vision/audio/diffusion) but its only implementation is in-process candle.
### Streaming proxy
Chat completions are proxied as SSE streams. The gateway must:
1. Parse the inbound request to extract the model name
2. Route to the correct backend neuron
3. Stream the response back, capturing token timing for metrics
4. NOT buffer the full response — true streaming passthrough
### Anthropic translation
When a request arrives at `/v1/messages` (Anthropic format), the gateway translates it to OpenAI format before proxying to neuron, then translates the response back. This is stateless envelope transformation. Non-streaming round-trip is implemented; streaming SSE translation deferred.
### Eviction
The evictor runs as a background task. Before loading a model on a node where VRAM is tight:
1. Check if the model is already loaded elsewhere → route there instead
2. Find the LRU model on the target node (excluding pinned models)
3. Call `POST {neuron}/models/unload` on that model
4. The incoming request's lazy-load triggers the new model load
### Metrics
Per-request: model, node, prompt_tokens, completion_tokens, total_tokens, tok_per_sec, time_to_first_token_ms, total_latency_ms. Exposed as Prometheus histograms/counters on a separate port (31314).
## Tech Stack
- **Rust 2024 edition** — workspace with 6 crates
- **Axum 0.8** — HTTP framework
- **reqwest** — HTTP client for proxying to backends
- **figment** — config loading (TOML + env vars)
- **tokio** — async runtime
- **metrics + metrics-exporter-prometheus** — observability
- **tracing** — structured logging
- **candle** — in-process inference engine (neuron only, with CUDA support)
- **cudarc** — patched for neuron's needs (see workspace `[patch]`)
- **clap** — CLI parsing
- **rusqlite** (bundled) — helexa-bench SQLite system-of-record
## Build Commands
```sh
cargo build --release # build all crates
cargo run -p cortex-cli -- serve # run the gateway
cargo test # run all tests
cargo clippy --workspace # lint
```
### neuron Features
- `cuda`: Enables CUDA acceleration in candle and cudarc/nccl bindings. Without it, falls back to CPU.
- `cudnn`: Use cuDNN for convolution/attention kernels (requires `cuda`).
- `flash-attn`: FlashAttention kernels (requires `cuda`).
- `cuda-integration`: Reserved for GPU-only integration tests (requires multiple CUDA devices + libnccl).
### Build Scripts
- `neuron/build.rs`: Compiles CUDA kernels (`src/cuda/*.cu`) using `cudaforge::KernelBuilder` when `cuda` feature is enabled. Handles compute capability checks (sm_<80 disables bf16 intrinsics). Also captures build metadata: git SHA, dirty flag, timestamp, rustc version, profile, features, candle-core version.
## CI
Gitea Actions runs on every push to any branch. All three checks must pass before merging:
```sh
cargo fmt --check --all # formatting
cargo clippy --workspace -- -D warnings # lint (warnings are errors)
cargo test --workspace # tests
```
Run these locally before pushing. `cargo fmt --all` fixes formatting automatically. Clippy warnings must be resolved, not suppressed with `#[allow(...)]` unless there is a clear rationale.
Tagged releases (`v*`) build SRPMs for `cortex`, `helexa-neuron`, and `helexa-bench` and publish to COPR (`helexa/helexa`). Build metadata SHA injection: CI sets `HELEXA_BUILD_SHA=$(git rev-parse HEAD)`.
## Environment
- Targets Fedora 43 (systemd, SELinux enforcing)
- Nodes communicate over a private network (e.g. WireGuard mesh)
- cortex listens on port 31313 (API) and 31314 (metrics)
- neuron listens on port 13131 on each GPU host
- TLS terminated at gateway or via nginx; internal traffic is plaintext over WireGuard
## Conventions
- Error handling: `anyhow` for binaries, `thiserror` for library crates
- No `unwrap()` in library code; `expect()` only with clear rationale
- All public types derive `Debug, Clone, Serialize, Deserialize` where sensible
- Config structs use `figment` with TOML as primary source, env vars as override
- Prefer `Arc<RwLock<...>>` for shared fleet state; minimize lock duration
- SSE streaming uses `tokio_stream` + `eventsource-stream` for parsing
- Log at `info` for request routing, `debug` for proxy details, `warn` for eviction and node health, `error` for proxy failures
## Testing
### Gateway tests
Use mock neurons spawned via axum in `crates/cortex-gateway/tests/common/mod.rs`. Helpers: `spawn_mock_backend()`, `spawn_gateway()`.
### neuron integration tests
- Numerical reference tests (`numerical_reference.rs`) require `NEURON_REF_MODEL_PATH` env var pointing to a HF snapshot directory. Fixtures are f32-based for precision validation against HuggingFace transformers.
- CUDA integration tests (`tp_worker_lifecycle_cuda.rs`) gated behind `cuda-integration` feature; requires 2+ CUDA devices (e.g., 2x RTX 5090).
### Metrics testing
Use `install_test_recorder()` in test code to capture metrics without the HTTP listener.
## helexa-bench
A continuous, version-aware benchmark harness. Hits each neuron directly on `:13131`, exercises each warm model with a Scenario suite (chat-latency family), and records results into SQLite stamped with the neuron's full `BuildInfo`. The loop is version-aware: skips any (target, build SHA, model, scenario) cell already at `samples_per_version`.
Packaged as `helexa-bench` RPM (prebuilt-binary spec). One systemd unit, typically on the metrics host.
## helexa-acp
Agent Client Protocol bridge — connects ACP editors (Zed, etc.) to any OpenAI-compatible endpoint, cortex by default. Intentionally self-contained: no workspace crate dependencies. Uses `agent-client-protocol` with `unstable_session_model` feature for Zed model picker support. Licensed Apache-2.0 (workspace is GPL-3.0).
## RPM Packaging
- `cortex.spec` — installs the `cortex` binary
- `helexa-neuron.spec` — installs the `neuron` binary under package name `helexa-neuron` (renamed to avoid Fedora's NEURON neural-simulation package collision)
- Systemd units in `data/cortex.service`, `data/neuron.service`
- Example configs: `cortex.example.toml`, `neuron.example.toml`, `models.example.toml`
Install:
```sh
dnf copr enable helexa/helexa
dnf install cortex # gateway host
dnf install helexa-neuron # GPU nodes
```
## Configuration Files
### cortex.toml (gateway)
```toml
[gateway]
listen = "0.0.0.0:31313"
metrics_listen = "0.0.0.0:31314"
[eviction]
strategy = "lru" # lru | priority
defrag_after_cycles = 50
[[neurons]]
name = "beast"
endpoint = "http://beast.internal:13131"
```
### models.toml (catalogue)
```toml
[[models]]
id = "Qwen/Qwen3-Coder-30B-A3B-Instruct"
harness = "candle"
quant = "Q4_K_M"
vram_mb = 19000
min_devices = 2
min_device_vram_mb = 10000
pinned_on = ["beast"] # optional: never evict from these neurons
```
### neuron.toml (per-host)
Configured via figment + env override. See `neuron.example.toml` for reference.
## neuron API Endpoints
```
GET /discovery → hardware discovery (hostname, OS, CUDA, devices, harnesses)
GET /health → runtime GPU stats (VRAM, utilization, temperature)
GET /models → loaded/unloaded models with VRAM usage
POST /models/load → load a model with spec (quant, TP, devices)
POST /models/unload → unload a model, freeing device memory
GET /models/{id}/endpoint → inference URL for a model
GET /version → build metadata (SHA, features, candle version, etc.)
```
## Sources of Truth
When prose documentation conflicts with code, trust:
1. Executable configuration (`*.toml`, `Cargo.toml` features)
2. Type definitions in `cortex-core/`
3. Test files in `crates/*/tests/` and `*/src/**/*_test.rs`
4. `CLAUDE.md` for historical design rationale

2
Cargo.lock generated
View File

@@ -793,6 +793,7 @@ name = "cortex-gateway"
version = "0.1.16"
dependencies = [
"anyhow",
"async-trait",
"axum",
"bytes",
"chrono",
@@ -1916,6 +1917,7 @@ dependencies = [
"serde_json",
"tokio",
"tokio-stream",
"tower-http",
"tracing",
"tracing-subscriber",
]

View File

@@ -0,0 +1,38 @@
# helexa-bench config for bob.hanzalova.internal.
#
# Synced to /etc/helexa-bench/helexa-bench.toml by script/infra-setup.sh
# (the helexa-bench RPM ships helexa-bench.example.toml as a
# %config(noreplace) default; this per-host file overrides it).
#
# bob is a client host (it also runs Agent Zero); helexa-bench here hits
# every neuron on the fleet directly and records build-stamped results
# into the local SQLite store.
[bench]
sweep_interval_secs = 1800
samples_per_version = 5
iteration_pause_secs = 2
request_timeout_secs = 600
db_path = "/var/lib/helexa-bench/bench.sqlite"
[scenarios]
prompt_sizes = [128, 4096]
max_tokens = 256
# Read-only JSON API consumed by the bench UI (hosted separately) and for
# programmatic access. Served alongside the sweep loop.
[api]
enabled = true
listen = "0.0.0.0:13132"
[[targets]]
name = "beast"
endpoint = "http://beast.hanzalova.internal:13131"
[[targets]]
name = "benjy"
endpoint = "http://benjy.hanzalova.internal:13131"
[[targets]]
name = "quadbrat"
endpoint = "http://quadbrat.hanzalova.internal:13131"

View File

@@ -0,0 +1,15 @@
# Bootstrap vhost for bench.helexa.ai — http-only, used ONLY to obtain
# the initial Let's Encrypt cert via the webroot challenge (the full TLS
# vhost can't load before the cert file exists). script/infra-setup.sh
# installs this, runs certbot, then swaps in bench.helexa.ai.conf.
server {
listen 80;
server_name bench.helexa.ai;
location /.well-known/acme-challenge/ {
root /var/www/bench.helexa.ai;
}
location / {
try_files $uri $uri/ =404;
}
}

View File

@@ -0,0 +1,56 @@
# Public, auth-less bench UI at https://bench.helexa.ai.
#
# Serves the static SPA from /var/www/bench.helexa.ai (rsynced by
# .gitea/workflows/deploy.yml's deploy-bench-ui job) and reverse-proxies
# /api to the helexa-bench read API on bob over the WireGuard mesh — so
# the browser stays same-origin (no CORS) and the internal API never
# needs to be exposed publicly.
#
# TLS via Let's Encrypt; the cert is obtained/renewed by certbot
# (bootstrapped one-time in script/infra-setup.sh). Mirrors the
# dev.swym.hanzalova.internal vhost convention on this host.
server {
listen 80;
server_name bench.helexa.ai;
# Keep serving the ACME webroot so certbot can renew.
location /.well-known/acme-challenge/ {
root /var/www/bench.helexa.ai;
}
location / {
return 301 https://$host$request_uri;
}
}
server {
listen 443 ssl;
http2 on;
server_name bench.helexa.ai;
ssl_certificate /etc/letsencrypt/live/bench.helexa.ai/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/bench.helexa.ai/privkey.pem;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers HIGH:!aNULL:!MD5;
ssl_prefer_server_ciphers on;
ssl_session_cache shared:SSL:10m;
root /var/www/bench.helexa.ai;
index index.html;
# Bench read API on bob (internal WireGuard); browser stays same-origin.
location /api/ {
proxy_pass http://bob.hanzalova.internal:13132;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_read_timeout 60s;
}
# SPA fallback — client-side routes (/trends, /runs) resolve to index.html.
location / {
try_files $uri $uri/ /index.html;
}
}

View File

@@ -0,0 +1,34 @@
# Internal bench UI vhost — https://bench.internal, reachable from inside
# the WireGuard mesh (the public bench.helexa.ai dead-ends at the OPNsense
# LAN interface, which only port-forwards :443 from the WAN). Same SPA +
# /api→bob proxy as bench.helexa.ai, but with an internal-CA cert
# (smallstep "lair", renewed by step@bench.timer). Mirrors the
# *.internal vhost convention on oolon.kosherinata.internal.
server {
server_name bench.internal;
listen 443 ssl;
http2 on;
ssl_certificate /etc/nginx/tls/cert/bench.internal.pem;
ssl_certificate_key /etc/nginx/tls/key/bench.internal.pem;
ssl_trusted_certificate /etc/pki/ca-trust/source/anchors/root-internal.pem;
ssl_protocols TLSv1.3;
# Shared webroot with the public vhost — same built SPA.
root /var/www/bench.helexa.ai;
index index.html;
location /api/ {
proxy_pass http://bob.hanzalova.internal:13132;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_read_timeout 60s;
}
location / {
try_files $uri $uri/ /index.html;
}
}

View File

@@ -0,0 +1,25 @@
# Install on the bench host (bob) as /etc/sudoers.d/helexa_gitea_ci
# (owner root:root, mode 0440). Required by .gitea/workflows/deploy.yml,
# which SSHes as gitea_ci@bob to roll out helexa-bench package upgrades
# and config changes.
#
# Filename convention `helexa_gitea_ci` (vs bare `gitea_ci`) so other
# helexa-org apps can drop their own sudoers files on the same host
# without overwriting this one.
#
# helexa-bench polls the neuron fleet (outbound) and serves a read-only
# JSON API on tcp/13132 for the bench UI — hence the firewall-cmd grants.
gitea_ci ALL=(root) NOPASSWD: /usr/bin/rsync * /etc/helexa-bench/helexa-bench.toml
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl start helexa-bench.service
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl stop helexa-bench.service
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl enable --now helexa-bench.service
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl daemon-reload
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf install --refresh --allowerasing -y helexa-bench
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf upgrade --refresh --allowerasing -y helexa-bench
# sudoers reserves `:` and `=` and requires `\` escaping inside command
# arguments — without it visudo errors at the first `:` in `https://`.
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf config-manager addrepo --from-repofile\=https\://rpm.lair.cafe/lair-cafe-unstable.repo
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf config-manager setopt lair-cafe-unstable.enabled\=1
gitea_ci ALL=(root) NOPASSWD: /usr/bin/firewall-cmd --add-service=helexa-bench --permanent
gitea_ci ALL=(root) NOPASSWD: /usr/bin/firewall-cmd --reload

View File

@@ -9,8 +9,11 @@
gitea_ci ALL=(root) NOPASSWD: /usr/bin/rsync * /etc/cortex/cortex.toml
gitea_ci ALL=(root) NOPASSWD: /usr/bin/rsync * /etc/cortex/models.toml
# deploy-bench-ui rsyncs the built bench SPA into the nginx webroot.
gitea_ci ALL=(root) NOPASSWD: /usr/bin/rsync * /var/www/bench.helexa.ai/
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl start cortex.service
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl stop cortex.service
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl enable --now cortex.service
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl daemon-reload
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf install --refresh --allowerasing -y cortex
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf upgrade --refresh --allowerasing -y cortex

View File

@@ -14,8 +14,13 @@
# flavour installed" — vandalism, not privilege escalation.
gitea_ci ALL=(root) NOPASSWD: /usr/bin/rsync * /etc/neuron/neuron.toml
# deploy.yml writes the per-model systemd drop-in carrying
# NEURON_MAX_PROMPT_TOKENS: gitea_ci stages it in its own dir, then
# installs it root-owned. Exact source/dest paths; see doc/context-limits.md.
gitea_ci ALL=(root) NOPASSWD: /usr/bin/install -o root -g root -m 0644 -D /var/lib/gitea_ci/model.conf /etc/systemd/system/neuron.service.d/model.conf
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl start neuron.service
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl stop neuron.service
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl enable --now neuron.service
gitea_ci ALL=(root) NOPASSWD: /usr/bin/systemctl daemon-reload
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf install --refresh --allowerasing -y helexa-neuron-ampere
gitea_ci ALL=(root) NOPASSWD: /usr/bin/dnf upgrade --refresh --allowerasing -y helexa-neuron-ampere

View File

@@ -0,0 +1,20 @@
# Internal-CA cert renewal for %i.internal, driven by step@%i.timer.
# Replicated from oolon.kosherinata.internal (the kosherinata DC proxy).
# Renews an EXISTING cert via mTLS (step ca renew) — the initial cert
# must be issued once with a provisioner (see script/infra-setup.sh).
# Installed to /etc/systemd/system/step@.service.
[Unit]
Description=step cert renew for %i.internal
Documentation=https://smallstep.com/docs/step-ca/renewal
[Service]
Type=oneshot
ExecCondition=/usr/bin/step certificate needs-renewal \
/etc/nginx/tls/cert/%i.internal.pem
ExecStart=/usr/bin/step ca renew \
--force \
--ca-url https://ca.internal \
--root /etc/pki/ca-trust/source/anchors/root-internal.pem \
/etc/nginx/tls/cert/%i.internal.pem \
/etc/nginx/tls/key/%i.internal.pem
ExecStartPost=/usr/bin/systemctl reload nginx.service

15
asset/systemd/step@.timer Normal file
View File

@@ -0,0 +1,15 @@
# Periodic internal-cert renewal for %i.internal (every 15 min, jittered).
# Replicated from oolon.kosherinata.internal. Installed to
# /etc/systemd/system/step@.timer; enable per-cert with
# `systemctl enable --now step@bench.timer`.
[Unit]
Description=step cert renew timer for %i.internal
[Timer]
Persistent=true
OnCalendar=*:1/15
AccuracySec=1us
RandomizedDelaySec=5m
[Install]
WantedBy=timers.target

3
bench/.gitignore vendored Normal file
View File

@@ -0,0 +1,3 @@
node_modules
dist
*.local

45
bench/README.md Normal file
View File

@@ -0,0 +1,45 @@
# helexa bench UI
A Vite + React (SWC, TypeScript) app that visualises the fleet benchmark
data collected by `helexa-bench`. It reads the read-only JSON API the
bench daemon serves (`crates/helexa-bench/src/api.rs`, default
`:13132` on bob).
Stack: React Router, react-bootstrap, Recharts.
## Pages
- **Overview** — latest median results per (host, model, scenario) cell.
- **Trends** — decode-tok/s and TTFT plotted across neuron build SHAs as
releases roll out (the headline view). Pick host / model / scenario.
- **Runs** — filterable raw-run explorer.
## Develop
```sh
cd bench
npm install
npm run dev # http://localhost:5173
```
`vite.config.ts` proxies `/api``http://bob.hanzalova.internal:13132`,
so the dev server talks to the live bench API with no CORS fuss. Point
the proxy elsewhere (or run a local `helexa-bench serve`) to develop
against other data.
## Production hosting
Public at **https://bench.helexa.ai** — nginx on the gateway
(`hanzalova.internal`) serves the static `dist/` and reverse-proxies
`/api` to the bench API on bob over WireGuard, so the SPA is same-origin
(no CORS) and the internal API stays off the public internet.
- `npm run build` is run with **no** `VITE_API_BASE` (the app calls
`/api/...` on its own origin; nginx proxies it to bob).
- `.gitea/workflows/deploy.yml` (`deploy-bench-ui`) builds and rsyncs
`dist/` to `/var/www/bench.helexa.ai` on every deploy.
- The nginx vhost (`asset/nginx/bench.helexa.ai.conf`) and the
Let's Encrypt cert are one-time host setup in `script/infra-setup.sh`.
To host elsewhere instead, build with
`VITE_API_BASE=<bob-api-origin>` and serve the static `dist/`.

12
bench/index.html Normal file
View File

@@ -0,0 +1,12 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>helexa bench</title>
</head>
<body>
<div id="root"></div>
<script type="module" src="/src/main.tsx"></script>
</body>
</html>

2191
bench/package-lock.json generated Normal file

File diff suppressed because it is too large Load Diff

28
bench/package.json Normal file
View File

@@ -0,0 +1,28 @@
{
"name": "helexa-bench-ui",
"private": true,
"version": "0.1.0",
"type": "module",
"description": "Visualisation app for helexa-bench fleet benchmark data.",
"scripts": {
"dev": "vite",
"build": "tsc && vite build",
"preview": "vite preview"
},
"dependencies": {
"bootstrap": "^5.3.3",
"react": "^18.3.1",
"react-bootstrap": "^2.10.5",
"react-dom": "^18.3.1",
"react-router-dom": "^6.26.2",
"recharts": "^2.12.7"
},
"devDependencies": {
"@types/node": "^20.14.0",
"@types/react": "^18.3.5",
"@types/react-dom": "^18.3.0",
"@vitejs/plugin-react-swc": "^3.7.0",
"typescript": "^5.5.4",
"vite": "^5.4.0"
}
}

30
bench/src/App.tsx Normal file
View File

@@ -0,0 +1,30 @@
import { Container, Nav, Navbar } from "react-bootstrap";
import { NavLink, Outlet } from "react-router-dom";
export default function App() {
return (
<>
<Navbar bg="dark" variant="dark" expand="md">
<Container>
<Navbar.Brand as={NavLink} to="/">
helexa&nbsp;bench
</Navbar.Brand>
<Nav className="me-auto">
<Nav.Link as={NavLink} to="/" end>
Overview
</Nav.Link>
<Nav.Link as={NavLink} to="/trends">
Trends
</Nav.Link>
<Nav.Link as={NavLink} to="/runs">
Runs
</Nav.Link>
</Nav>
</Container>
</Navbar>
<Container className="py-4">
<Outlet />
</Container>
</>
);
}

45
bench/src/api.ts Normal file
View File

@@ -0,0 +1,45 @@
import type { Dimensions, ReportRow, RunRow, SeriesPoint } from "./types";
// Empty default → `fetch('/api/...')` hits the dev proxy (vite.config.ts)
// or the same origin. For a separately-hosted build, set VITE_API_BASE to
// the bob API origin (e.g. http://bob.hanzalova.internal:13132).
const BASE = import.meta.env.VITE_API_BASE ?? "";
async function getJson<T>(path: string): Promise<T> {
const res = await fetch(`${BASE}${path}`);
if (!res.ok) {
throw new Error(`${res.status} ${res.statusText}: ${await res.text()}`);
}
return res.json() as Promise<T>;
}
export const getDimensions = () => getJson<Dimensions>("/api/dimensions");
export const getSummary = () => getJson<ReportRow[]>("/api/summary");
// host is resolved server-side (each model maps to one host today), so the
// public UI selects by model + scenario alone.
export const getSeries = (model: string, scenario: string) =>
getJson<SeriesPoint[]>(
`/api/series?model=${encodeURIComponent(model)}&scenario=${encodeURIComponent(scenario)}`,
);
export interface RunsParams {
host?: string;
model?: string;
scenario?: string;
sha?: string;
ok?: boolean;
limit?: number;
}
export const getRuns = (p: RunsParams = {}) => {
const q = new URLSearchParams();
if (p.host) q.set("host", p.host);
if (p.model) q.set("model", p.model);
if (p.scenario) q.set("scenario", p.scenario);
if (p.sha) q.set("sha", p.sha);
if (p.ok !== undefined) q.set("ok", String(p.ok));
if (p.limit) q.set("limit", String(p.limit));
const qs = q.toString();
return getJson<RunRow[]>(`/api/runs${qs ? `?${qs}` : ""}`);
};

52
bench/src/baseline.ts Normal file
View File

@@ -0,0 +1,52 @@
// Pre-helexa-bench baseline, transcribed verbatim from doc/benchmarks.md.
//
// IMPORTANT — different measurement regime. These were measured by
// script/bench.py *through the cortex gateway* (so TTFT/total include a
// proxy hop), reported as medians only, before helexa-bench existed.
// helexa-bench measures each neuron *directly*. So these points are an
// honest historical anchor, NOT apples-to-apples with the live series —
// the Trends view renders them dashed + labelled, never merged into the
// live line.
//
// Host is inferred from the model via the doc's Fleet table
// (beast=27B, benjy=8B, quadbrat=1.7B). Timestamps are the two 2026-06-12
// snapshots in the doc, ordered (08:00 = pre-#11, 16:00 = post-#11) so
// they sort before the bench era on the shared time axis.
export interface BaselinePoint {
host: string;
model: string;
scenario: string;
git_sha: string;
build_timestamp: string;
ttft_s: number;
decode_tps: number;
total_s: number;
}
/** Source: bench.py via cortex gateway — see doc/benchmarks.md. */
export const BASELINE_SOURCE = "bench.py · via cortex gateway";
export const BASELINE: BaselinePoint[] = [
// ── 8f6f1d3 — baseline (2026-06-12) ────────────────────────────────
{ host: "beast", model: "Qwen/Qwen3.6-27B", scenario: "chat:128", git_sha: "8f6f1d3", build_timestamp: "2026-06-12T08:00:00Z", ttft_s: 1.658, decode_tps: 35.0, total_s: 8.981 },
{ host: "beast", model: "Qwen/Qwen3.6-27B", scenario: "chat:4096", git_sha: "8f6f1d3", build_timestamp: "2026-06-12T08:00:00Z", ttft_s: 7.067, decode_tps: 33.7, total_s: 14.63 },
{ host: "benjy", model: "Qwen/Qwen3-8B", scenario: "chat:128", git_sha: "8f6f1d3", build_timestamp: "2026-06-12T08:00:00Z", ttft_s: 0.884, decode_tps: 62.4, total_s: 4.938 },
{ host: "benjy", model: "Qwen/Qwen3-8B", scenario: "chat:4096", git_sha: "8f6f1d3", build_timestamp: "2026-06-12T08:00:00Z", ttft_s: 1.818, decode_tps: 46.5, total_s: 7.27 },
{ host: "quadbrat", model: "Qwen/Qwen3-1.7B", scenario: "chat:128", git_sha: "8f6f1d3", build_timestamp: "2026-06-12T08:00:00Z", ttft_s: 0.685, decode_tps: 81.3, total_s: 3.741 },
{ host: "quadbrat", model: "Qwen/Qwen3-1.7B", scenario: "chat:4096", git_sha: "8f6f1d3", build_timestamp: "2026-06-12T08:00:00Z", ttft_s: 2.743, decode_tps: 35.4, total_s: 9.884 },
// ── a1952a4 — post prefix-KV-cache (#11, 2026-06-12) ───────────────
{ host: "beast", model: "Qwen/Qwen3.6-27B", scenario: "chat:128", git_sha: "a1952a4", build_timestamp: "2026-06-12T16:00:00Z", ttft_s: 1.355, decode_tps: 45.8, total_s: 4.147 },
{ host: "beast", model: "Qwen/Qwen3.6-27B", scenario: "chat:4096", git_sha: "a1952a4", build_timestamp: "2026-06-12T16:00:00Z", ttft_s: 1.431, decode_tps: 43.3, total_s: 4.387 },
{ host: "benjy", model: "Qwen/Qwen3-8B", scenario: "chat:128", git_sha: "a1952a4", build_timestamp: "2026-06-12T16:00:00Z", ttft_s: 0.886, decode_tps: 78.6, total_s: 2.478 },
{ host: "benjy", model: "Qwen/Qwen3-8B", scenario: "chat:4096", git_sha: "a1952a4", build_timestamp: "2026-06-12T16:00:00Z", ttft_s: 1.824, decode_tps: 58.3, total_s: 3.969 },
{ host: "quadbrat", model: "Qwen/Qwen3-1.7B", scenario: "chat:128", git_sha: "a1952a4", build_timestamp: "2026-06-12T16:00:00Z", ttft_s: 0.702, decode_tps: 104.8, total_s: 1.895 },
{ host: "quadbrat", model: "Qwen/Qwen3-1.7B", scenario: "chat:4096", git_sha: "a1952a4", build_timestamp: "2026-06-12T16:00:00Z", ttft_s: 2.749, decode_tps: 44.9, total_s: 5.534 },
];
/** Baseline points for one (model, scenario) cell, oldest first. */
export function baselineFor(model: string, scenario: string): BaselinePoint[] {
return BASELINE.filter(
(b) => b.model === model && b.scenario === scenario,
).sort((a, b) => a.build_timestamp.localeCompare(b.build_timestamp));
}

22
bench/src/main.tsx Normal file
View File

@@ -0,0 +1,22 @@
import React from "react";
import ReactDOM from "react-dom/client";
import { BrowserRouter, Route, Routes } from "react-router-dom";
import "bootstrap/dist/css/bootstrap.min.css";
import App from "./App";
import Overview from "./pages/Overview";
import Trends from "./pages/Trends";
import Runs from "./pages/Runs";
ReactDOM.createRoot(document.getElementById("root")!).render(
<React.StrictMode>
<BrowserRouter>
<Routes>
<Route path="/" element={<App />}>
<Route index element={<Overview />} />
<Route path="trends" element={<Trends />} />
<Route path="runs" element={<Runs />} />
</Route>
</Routes>
</BrowserRouter>
</React.StrictMode>,
);

View File

@@ -0,0 +1,64 @@
import { useEffect, useState } from "react";
import { Alert, Spinner, Table } from "react-bootstrap";
import { getSummary } from "../api";
import type { ReportRow } from "../types";
const f = (n: number | null, p = 2) => (n == null ? "—" : n.toFixed(p));
export default function Overview() {
const [rows, setRows] = useState<ReportRow[]>([]);
const [err, setErr] = useState<string | null>(null);
const [loading, setLoading] = useState(true);
useEffect(() => {
getSummary()
.then(setRows)
.catch((e) => setErr(String(e)))
.finally(() => setLoading(false));
}, []);
if (loading) return <Spinner animation="border" />;
if (err) return <Alert variant="danger">{err}</Alert>;
return (
<>
<h3 className="mb-3">Latest results per cell</h3>
<p className="text-muted">
Median of each cell's samples on the most recent build seen for that
(host, model, scenario).
</p>
<Table striped bordered hover responsive size="sm">
<thead>
<tr>
<th>GPU</th>
<th>model</th>
<th className="text-end">prompt tok</th>
<th className="text-end">TTFT (s)</th>
<th className="text-end">decode tok/s</th>
<th className="text-end">total (s)</th>
<th>build</th>
<th className="text-end">n</th>
</tr>
</thead>
<tbody>
{rows.map((r, i) => (
<tr key={i}>
<td>{r.gpu ?? r.target_name}</td>
<td>{r.model_id}</td>
<td className="text-end">
{r.prompt_tokens ?? `~${r.prompt_size_approx}`}
</td>
<td className="text-end">{f(r.ttft_s_median, 3)}</td>
<td className="text-end">{f(r.decode_tps_median, 1)}</td>
<td className="text-end">{f(r.total_s_median, 3)}</td>
<td>
<code>{r.git_sha}</code>
</td>
<td className="text-end">{r.samples}</td>
</tr>
))}
</tbody>
</Table>
</>
);
}

141
bench/src/pages/Runs.tsx Normal file
View File

@@ -0,0 +1,141 @@
import { useEffect, useState } from "react";
import { Alert, Badge, Col, Form, Row, Spinner, Table } from "react-bootstrap";
import { getDimensions, getRuns } from "../api";
import type { Dimensions, RunRow } from "../types";
const f = (n: number | null, p = 2) => (n == null ? "—" : n.toFixed(p));
function Picker({
label,
value,
set,
options,
}: {
label: string;
value: string;
set: (v: string) => void;
options: string[];
}) {
return (
<Form.Group as={Col}>
<Form.Label>{label}</Form.Label>
<Form.Select value={value} onChange={(e) => set(e.target.value)}>
<option value="">(all)</option>
{options.map((o) => (
<option key={o} value={o}>
{o}
</option>
))}
</Form.Select>
</Form.Group>
);
}
export default function Runs() {
const [dims, setDims] = useState<Dimensions | null>(null);
const [host, setHost] = useState("");
const [model, setModel] = useState("");
const [scenario, setScenario] = useState("");
const [rows, setRows] = useState<RunRow[]>([]);
const [err, setErr] = useState<string | null>(null);
const [loading, setLoading] = useState(false);
useEffect(() => {
getDimensions()
.then(setDims)
.catch((e) => setErr(String(e)));
}, []);
useEffect(() => {
setLoading(true);
getRuns({
host: host || undefined,
model: model || undefined,
scenario: scenario || undefined,
limit: 200,
})
.then(setRows)
.catch((e) => setErr(String(e)))
.finally(() => setLoading(false));
}, [host, model, scenario]);
if (err) return <Alert variant="danger">{err}</Alert>;
return (
<>
<h3 className="mb-3">Runs</h3>
{dims && (
<Row className="g-3 mb-3">
{/* GPU filter — labelled by GPU, but filters by the underlying host. */}
<Form.Group as={Col}>
<Form.Label>GPU</Form.Label>
<Form.Select value={host} onChange={(e) => setHost(e.target.value)}>
<option value="">(all)</option>
{dims.hosts.map((h) => (
<option key={h} value={h}>
{dims.host_gpus[h] ?? h}
</option>
))}
</Form.Select>
</Form.Group>
<Picker
label="Model"
value={model}
set={setModel}
options={dims.models}
/>
<Picker
label="Scenario"
value={scenario}
set={setScenario}
options={dims.scenarios}
/>
</Row>
)}
{loading ? (
<Spinner animation="border" />
) : (
<Table striped bordered hover responsive size="sm">
<thead>
<tr>
<th>ts</th>
<th>GPU</th>
<th>model</th>
<th>scenario</th>
<th>build</th>
<th className="text-end">TTFT</th>
<th className="text-end">tok/s</th>
<th className="text-end">total</th>
<th>ok</th>
</tr>
</thead>
<tbody>
{rows.map((r) => (
<tr key={r.id}>
<td>{r.ts}</td>
<td>{r.gpu ?? r.host}</td>
<td>{r.model_id}</td>
<td>{r.scenario_id}</td>
<td>
<code>{r.git_sha}</code>
</td>
<td className="text-end">{f(r.ttft_s, 3)}</td>
<td className="text-end">{f(r.decode_tps, 1)}</td>
<td className="text-end">{f(r.total_s, 3)}</td>
<td>
{r.ok ? (
<Badge bg="success">ok</Badge>
) : (
<Badge bg="danger" title={r.error ?? ""}>
fail
</Badge>
)}
</td>
</tr>
))}
</tbody>
</Table>
)}
</>
);
}

221
bench/src/pages/Trends.tsx Normal file
View File

@@ -0,0 +1,221 @@
import { useEffect, useMemo, useState } from "react";
import { Alert, Col, Form, Row, Spinner } from "react-bootstrap";
import {
CartesianGrid,
Legend,
Line,
LineChart,
ReferenceLine,
ResponsiveContainer,
Tooltip,
XAxis,
YAxis,
} from "recharts";
import { getDimensions, getSeries } from "../api";
import type { Dimensions, SeriesPoint } from "../types";
import { BASELINE_SOURCE, baselineFor } from "../baseline";
function Picker({
label,
value,
set,
options,
}: {
label: string;
value: string;
set: (v: string) => void;
options: string[];
}) {
return (
<Form.Group as={Col}>
<Form.Label>{label}</Form.Label>
<Form.Select value={value} onChange={(e) => set(e.target.value)}>
{options.map((o) => (
<option key={o} value={o}>
{o}
</option>
))}
</Form.Select>
</Form.Group>
);
}
export default function Trends() {
const [dims, setDims] = useState<Dimensions | null>(null);
const [model, setModel] = useState("");
const [scenario, setScenario] = useState("");
const [series, setSeries] = useState<SeriesPoint[]>([]);
const [err, setErr] = useState<string | null>(null);
useEffect(() => {
getDimensions()
.then((d) => {
setDims(d);
if (d.models[0]) setModel(d.models[0]);
if (d.scenarios[0]) setScenario(d.scenarios[0]);
})
.catch((e) => setErr(String(e)));
}, []);
useEffect(() => {
if (model && scenario) {
getSeries(model, scenario)
.then(setSeries)
.catch((e) => setErr(String(e)));
}
}, [model, scenario]);
// Prepend the pre-helexa-bench baseline (dashed, separate keys) so it
// anchors the timeline without being merged into the live line. Different
// measurement regime — see baseline.ts / doc/benchmarks.md.
const base = useMemo(
() => baselineFor(model, scenario),
[model, scenario],
);
const data = useMemo(
() => [
...base.map((p) => ({
label: p.git_sha,
baseTtft: p.ttft_s,
baseDecode: p.decode_tps,
baseTotal: p.total_s,
})),
...series.map((p) => ({
label: p.git_sha,
ttft: p.ttft_s_median,
decode: p.decode_tps_median,
total: p.total_s_median,
})),
],
[series, base],
);
// Divider marking the boundary between the two regimes (drawn at the
// first live build, with baseline points to its left).
const firstLive = series[0]?.git_sha;
const showDivider = base.length > 0 && series.length > 0;
if (err) return <Alert variant="danger">{err}</Alert>;
if (!dims) return <Spinner animation="border" />;
return (
<>
<h3 className="mb-3">Trends over builds</h3>
<Row className="g-3 mb-4">
<Picker
label="Model"
value={model}
set={setModel}
options={dims.models}
/>
<Picker
label="Scenario"
value={scenario}
set={setScenario}
options={dims.scenarios}
/>
</Row>
{dims.model_gpus[model] && (
<p className="text-muted mb-3">
Measured on <strong>{dims.model_gpus[model]}</strong>.
</p>
)}
{data.length === 0 ? (
<Alert variant="info">No data for this selection yet.</Alert>
) : (
<>
{base.length > 0 && (
<p className="text-muted small mb-3">
Dashed = pre-helexa-bench baseline ({BASELINE_SOURCE}); solid =
helexa-bench (direct to neuron). Different measurement regimes
see <code>doc/benchmarks.md</code>.
</p>
)}
<h5 className="mt-3">decode tok/s (higher is better)</h5>
<ResponsiveContainer width="100%" height={280}>
<LineChart data={data} margin={{ top: 8, right: 24, bottom: 8, left: 0 }}>
<CartesianGrid strokeDasharray="3 3" />
<XAxis dataKey="label" />
<YAxis />
<Tooltip />
<Legend />
{showDivider && firstLive && (
<ReferenceLine
x={firstLive}
stroke="#bbb"
strokeDasharray="3 3"
label={{
value: "bench.py → helexa-bench",
position: "top",
fill: "#999",
fontSize: 11,
}}
/>
)}
<Line
type="monotone"
dataKey="decode"
name="decode tok/s"
stroke="#0d6efd"
connectNulls
/>
{base.length > 0 && (
<Line
type="monotone"
dataKey="baseDecode"
name="baseline (bench.py · gateway)"
stroke="#888"
strokeDasharray="5 5"
connectNulls
/>
)}
</LineChart>
</ResponsiveContainer>
<h5 className="mt-4">TTFT seconds (lower is better)</h5>
<ResponsiveContainer width="100%" height={280}>
<LineChart data={data} margin={{ top: 8, right: 24, bottom: 8, left: 0 }}>
<CartesianGrid strokeDasharray="3 3" />
<XAxis dataKey="label" />
<YAxis />
<Tooltip />
<Legend />
{showDivider && firstLive && (
<ReferenceLine
x={firstLive}
stroke="#bbb"
strokeDasharray="3 3"
label={{
value: "bench.py → helexa-bench",
position: "top",
fill: "#999",
fontSize: 11,
}}
/>
)}
<Line
type="monotone"
dataKey="ttft"
name="TTFT (s)"
stroke="#dc3545"
connectNulls
/>
{base.length > 0 && (
<Line
type="monotone"
dataKey="baseTtft"
name="baseline (bench.py · gateway)"
stroke="#888"
strokeDasharray="5 5"
connectNulls
/>
)}
</LineChart>
</ResponsiveContainer>
</>
)}
</>
);
}

69
bench/src/types.ts Normal file
View File

@@ -0,0 +1,69 @@
// Mirrors the JSON served by helexa-bench's read API (crates/helexa-bench/src/api.rs).
export interface BuildRef {
git_sha: string;
build_timestamp: string | null;
package_version: string | null;
}
export interface Dimensions {
hosts: string[];
models: string[];
scenarios: string[];
builds: BuildRef[];
/** host → GPU label, e.g. "2× RTX 5090". */
host_gpus: Record<string, string>;
/** model → GPU label (model maps to one host today). */
model_gpus: Record<string, string>;
}
/** Latest-SHA-per-cell medians (the report table). */
export interface ReportRow {
target_name: string;
model_id: string;
scenario_id: string;
prompt_size_approx: number;
git_sha: string;
prompt_tokens: number | null;
ttft_s_median: number | null;
decode_tps_median: number | null;
total_s_median: number | null;
samples: number;
/** Public-facing resource name (the host's GPU(s)). */
gpu: string | null;
}
/** One point in a per-build time-series for a (host, model, scenario) cell. */
export interface SeriesPoint {
git_sha: string;
build_timestamp: string | null;
package_version: string | null;
ttft_s_median: number | null;
decode_tps_median: number | null;
total_s_median: number | null;
samples: number;
}
export interface RunRow {
id: number;
ts: string;
host: string;
/** Public-facing resource name (the host's GPU(s)). */
gpu: string | null;
hostname: string | null;
git_sha: string;
build_timestamp: string | null;
package_version: string;
model_id: string;
harness: string;
scenario_id: string;
prompt_size_approx: number;
prompt_tokens_actual: number | null;
max_tokens: number;
ttft_s: number | null;
decode_tps: number | null;
total_s: number | null;
completion_tokens: number | null;
ok: boolean;
error: string | null;
}

9
bench/src/vite-env.d.ts vendored Normal file
View File

@@ -0,0 +1,9 @@
/// <reference types="vite/client" />
interface ImportMetaEnv {
/** Base origin of the bench API. Empty → use the dev proxy / same origin. */
readonly VITE_API_BASE?: string;
}
interface ImportMeta {
readonly env: ImportMetaEnv;
}

22
bench/tsconfig.json Normal file
View File

@@ -0,0 +1,22 @@
{
"compilerOptions": {
"target": "ES2022",
"useDefineForClassFields": true,
"lib": ["ES2022", "DOM", "DOM.Iterable"],
"module": "ESNext",
"skipLibCheck": true,
"moduleResolution": "bundler",
"allowImportingTsExtensions": true,
"resolveJsonModule": true,
"isolatedModules": true,
"moduleDetection": "force",
"noEmit": true,
"jsx": "react-jsx",
"strict": true,
"noUnusedLocals": true,
"noUnusedParameters": true,
"noFallthroughCasesInSwitch": true,
"types": ["node", "vite/client"]
},
"include": ["src", "vite.config.ts"]
}

18
bench/vite.config.ts Normal file
View File

@@ -0,0 +1,18 @@
import { defineConfig } from "vite";
import react from "@vitejs/plugin-react-swc";
// Dev server proxies /api to the bench API on bob so `fetch('/api/...')`
// works without CORS/mixed-origin fuss during local development.
// For a production build hosted elsewhere, set VITE_API_BASE to the bob
// API origin (e.g. http://bob.hanzalova.internal:13132) instead.
export default defineConfig({
plugins: [react()],
server: {
proxy: {
"/api": {
target: "http://bob.hanzalova.internal:13132",
changeOrigin: true,
},
},
},
});

View File

@@ -5,6 +5,11 @@
# Environment variable overrides use CORTEX_ prefix with __ separators:
# CORTEX_GATEWAY__LISTEN=0.0.0.0:31313
# Path to the model catalogue (limits, cost, pinning, aliases, feasibility).
# Defaults to the packaged location below; uncomment to override for a
# non-packaged / local run.
# models_config = "/etc/cortex/models.toml"
[gateway]
listen = "0.0.0.0:31313"
metrics_listen = "0.0.0.0:31314"
@@ -43,3 +48,45 @@ vram_mb = 12288 # e.g. RTX 3060 (12 GB)
pinned = [
"your-org/embedding-model",
]
# -- Entitlements (multi-tenant governance, #47) -------------------------
# Identity + per-key token budgets. Omit this section entirely for the
# legacy single-operator behaviour: requests are anonymous and uncapped.
#
# The local/static provider below is the source of truth for accounts,
# keys, and hard caps until the upstream clearing house exists. Identity
# rides standard bearer auth only — clients send
# Authorization: Bearer <key>
# no custom headers or body fields.
[entitlements]
# Reject unauthenticated requests with 401 invalid_api_key. Leave false
# (allow-anonymous) during rollout; flip to true once keys are issued.
require_auth = false
# One entry per API key.
[[entitlements.keys]]
key = "sk-example-rolling" # the bearer token the client sends
account_id = "team-research" # billable account (keys may share one)
key_id = "research-ci" # stable label for ledger/metrics (optional)
hard_cap = 5_000_000 # hard token cap over the window
# Rolling window that resets — over-cap requests get 429 rate_limit_exceeded
# + Retry-After, so well-behaved clients (opencode/AI SDK) back off and retry.
window = { kind = "rolling", seconds = 3600 }
[[entitlements.keys]]
key = "sk-example-balance"
account_id = "team-research"
key_id = "research-prepaid"
hard_cap = 20_000_000
# Hard balance, no reset — exhaustion returns 429 insufficient_quota
# (the client surfaces and stops). This is the default when `window` is
# omitted. Never 402.
window = { kind = "balance" }
[[entitlements.keys]]
key = "sk-example-infra"
account_id = "operator"
key_id = "infra"
# No hard_cap → uncapped operator infra key (own fleet, own use). Still
# metered for visibility.

View File

@@ -1,6 +1,7 @@
//! Model catalogue — profiles describing how to serve each model.
use crate::discovery::DeviceInfo;
use crate::harness::{ModelCost, ModelLimit};
use serde::{Deserialize, Serialize};
use std::collections::HashMap;
use std::path::Path;
@@ -35,6 +36,21 @@ pub struct ModelProfile {
/// on this being explicit per model rather than implicit.
#[serde(default)]
pub source: Option<String>,
// ── Enrichment (issue #62) ────────────────────────────────
/// Per-model token budget. When present, advertised in `/v1/models`
/// so clients can size and compact their context automatically.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub limit: Option<ModelLimit>,
/// Operator-set pricing (USD per 1M tokens). `0.0` for self-hosted.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub cost: Option<ModelCost>,
/// Static capability flags the operator wants to advertise even
/// before the model is loaded on any neuron (e.g. `"reasoning"`,
/// `"tool_call"`). Runtime-detected capabilities from the harness
/// are unioned with this set in the gateway's `/v1/models` response.
#[serde(default)]
pub capabilities: Vec<String>,
}
fn default_min_devices() -> u32 {
@@ -152,6 +168,9 @@ mod tests {
min_device_vram_mb: Some(24_000),
pinned_on: vec![],
source: None,
limit: None,
cost: None,
capabilities: vec![],
}
}

View File

@@ -1,3 +1,4 @@
use crate::entitlements::CapWindow;
use figment::{
Figment,
providers::{Env, Format, Toml},
@@ -11,13 +12,61 @@ pub struct GatewayConfig {
pub eviction: EvictionSettings,
/// Neuron endpoints (replaces old NodeConfig with static vram_mb/pinned).
pub neurons: Vec<NeuronEndpoint>,
/// Path to the model catalogue file (default: "models.toml").
/// Path to the model catalogue file. Defaults to the packaged
/// location (`/etc/cortex/models.toml`); set explicitly for
/// non-packaged / local runs.
#[serde(default = "default_models_path")]
pub models_config: String,
/// Multi-tenant governance: auth + per-key token budgets (#47). Empty
/// by default — anonymous, uncapped — so existing single-operator
/// setups keep working until keys are configured.
#[serde(default)]
pub entitlements: EntitlementsConfig,
}
/// `[entitlements]` — the local/static [`crate::entitlements::EntitlementProvider`]
/// source of truth (#50). Accounts, keys, and hard caps live here; the
/// future upstream client (#57) ignores this section.
#[derive(Debug, Clone, Serialize, Deserialize, Default)]
pub struct EntitlementsConfig {
/// Reject unauthenticated requests with `401 invalid_api_key` when
/// true. Default `false` (allow-anonymous) for dev / single-operator
/// continuity.
#[serde(default)]
pub require_auth: bool,
/// Static API keys and their budgets, consumed by the local provider.
#[serde(default)]
pub keys: Vec<ApiKeyConfig>,
}
/// One configured API key: the bearer token, the account it bills to, and
/// its hard cap. `[[entitlements.keys]]` in TOML.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ApiKeyConfig {
/// The bearer token clients send in `Authorization: Bearer <key>`.
pub key: String,
/// Billable account. Multiple keys may share one account.
pub account_id: String,
/// Stable per-key identifier for ledger/metrics labels. Defaults to
/// `account_id` when omitted, so the secret is never used as a label.
#[serde(default)]
pub key_id: Option<String>,
/// Hard token cap. `None`/omitted = uncapped (e.g. operator infra key).
#[serde(default)]
pub hard_cap: Option<u64>,
/// Cap-window semantics. Default: a non-resetting [`CapWindow::Balance`].
#[serde(default)]
pub window: CapWindow,
}
fn default_models_path() -> String {
"models.toml".into()
// Absolute, so the systemd-launched binary finds the catalogue
// regardless of its working directory. The RPM installs the catalogue
// here (`cortex.spec`); a relative "models.toml" silently resolved to
// the service cwd and left the catalogue empty in production
// (pinning / aliases / limits all no-ops). Override via `models_config`
// in cortex.toml for local runs.
"/etc/cortex/models.toml".into()
}
#[derive(Debug, Clone, Serialize, Deserialize)]
@@ -79,6 +128,7 @@ impl Default for GatewayConfig {
},
neurons: vec![],
models_config: default_models_path(),
entitlements: EntitlementsConfig::default(),
}
}
}

View File

@@ -33,6 +33,12 @@ pub struct DiscoveryResponse {
/// failure.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub cuda_unavailable_reason: Option<String>,
/// The neuron's effective maximum prompt size in tokens
/// (`NEURON_MAX_PROMPT_TOKENS`) — the enforced prompt cap on this
/// host. `#[serde(default)]` (→ 0) for forward-compat with neurons
/// that predate this field; cortex treats 0 as "unknown".
#[serde(default)]
pub max_prompt_tokens: u64,
}
/// Runtime health metrics for a single GPU device.
@@ -62,6 +68,57 @@ pub struct HealthResponse {
pub devices: Vec<DeviceHealth>,
#[serde(default)]
pub activation: ActivationStatus,
/// Per-model admission load (#53): how many requests are running vs.
/// queued on each loaded model right now. Cortex's load-aware router
/// (#55) reads this to spread traffic across replicas and to propagate
/// honest backpressure. `#[serde(default)]` keeps older gateways/neurons
/// interoperable (absent → empty → treated as no load info).
#[serde(default)]
pub models: Vec<ModelLoad>,
}
/// Live admission load for one loaded model (#53).
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ModelLoad {
pub id: String,
/// Requests currently running (batch-1 → 0 or 1).
pub in_flight: usize,
/// Requests waiting in the bounded admission queue.
pub queue_depth: usize,
}
#[cfg(test)]
mod health_load_tests {
use super::*;
#[test]
fn health_response_without_models_field_still_deserializes() {
// A pre-#53 neuron's /health payload omits `models`; the gateway
// must still parse it (serde default → empty).
let json = r#"{"uptime_secs":42,"devices":[]}"#;
let resp: HealthResponse = serde_json::from_str(json).expect("back-compat parse");
assert_eq!(resp.uptime_secs, 42);
assert!(resp.models.is_empty());
}
#[test]
fn health_response_round_trips_model_load() {
let resp = HealthResponse {
uptime_secs: 1,
devices: vec![],
activation: ActivationStatus::default(),
models: vec![ModelLoad {
id: "Qwen/Qwen3.6-27B".into(),
in_flight: 1,
queue_depth: 3,
}],
};
let s = serde_json::to_string(&resp).unwrap();
let back: HealthResponse = serde_json::from_str(&s).unwrap();
assert_eq!(back.models.len(), 1);
assert_eq!(back.models[0].in_flight, 1);
assert_eq!(back.models[0].queue_depth, 3);
}
}
/// High-level activation state of the neuron daemon. The HTTP listener

View File

@@ -0,0 +1,145 @@
//! Identity and entitlement primitives for multi-tenant governance (#47).
//!
//! Identity is the shared substrate the whole epic hangs off:
//! `identity (principal) → accounting (spend) → policy → enforcement`. This
//! module defines the seam — the [`EntitlementProvider`] trait and its data
//! types — so the local/static provider (operator-config caps, in
//! cortex-gateway) can land the auth + per-key-cap + amplification fix
//! *before* any upstream clearing house exists. The future helexa-upstream
//! client (#57) is just another impl of this trait.
//!
//! The provider owns three jobs:
//! 1. **resolve** a bearer key to a [`Principal`] (drives auth, #49);
//! 2. **reserve → settle/release** token budget around a request so spend
//! can never overshoot a hard cap under concurrency (drives budget
//! enforcement, #52);
//! 3. expose a [`BudgetSnapshot`] for metering/metrics (#51).
//!
//! [`BudgetError`] carries the cap-window semantics so the caller can pick
//! the correct #63 rejection (`rate_limit_exceeded` + `Retry-After` for a
//! resetting window vs `insufficient_quota` for a hard balance) without the
//! provider knowing anything about HTTP.
use async_trait::async_trait;
use serde::{Deserialize, Serialize};
/// Internal header carrying the resolved account id from cortex to neuron.
/// neuron trusts these over the WireGuard link (#54); cortex **strips** any
/// client-supplied copy before stamping the authoritative value, so a client
/// can never assert a principal directly.
pub const HEADER_ACCOUNT_ID: &str = "x-helexa-account-id";
/// Internal header carrying the resolved key id from cortex to neuron.
pub const HEADER_KEY_ID: &str = "x-helexa-key-id";
/// Who a request is for. Resolved once at the edge from the bearer key and
/// carried through the request context. `account_id` is the billable owner
/// (spendable at any operator, by decision); `key_id` identifies the
/// specific API key for per-key hard caps and ledger/metrics labels.
#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
pub struct Principal {
pub account_id: String,
pub key_id: String,
}
/// Cap-window semantics for a key's hard cap. Determines which #63 code an
/// over-cap reservation maps to.
#[derive(Debug, Clone, Default, PartialEq, Eq, Serialize, Deserialize)]
#[serde(tag = "kind", rename_all = "snake_case")]
pub enum CapWindow {
/// Hard balance — the cap never resets. Exhaustion is permanent
/// (`429 insufficient_quota`, no `Retry-After`).
#[default]
Balance,
/// Rolling window of `seconds` that resets. Exhaustion is transient
/// (`429 rate_limit_exceeded` + `Retry-After` until reset).
Rolling { seconds: u64 },
}
/// An outstanding budget reservation. The caller holds this opaque handle
/// between [`EntitlementProvider::reserve`] and exactly one of
/// [`EntitlementProvider::settle`] / [`EntitlementProvider::release`]. Not
/// `Clone` — a reservation is consumed once.
#[derive(Debug)]
pub struct Reservation {
/// Provider-local handle; opaque to the caller.
pub id: u64,
/// The principal this reservation belongs to.
pub principal: Principal,
/// Tokens reserved against the cap.
pub reserved: u64,
}
/// A point-in-time view of a key's budget, for metering and metrics (#51).
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct BudgetSnapshot {
/// Hard cap in tokens. `None` means uncapped (e.g. an operator infra
/// key, #58).
pub hard_cap: Option<u64>,
/// Settled spend in the current window.
pub spent: u64,
/// Sum of outstanding (un-settled) reservations.
pub reserved: u64,
}
/// Authentication failure — the bearer key could not be resolved. Maps to
/// `401 invalid_api_key` (#49/#63).
#[derive(Debug, thiserror::Error)]
pub enum AuthError {
#[error("invalid or unknown API key")]
InvalidKey,
}
/// Why a reservation was refused. Carries enough for the caller to build the
/// correct #63 envelope without the provider touching HTTP.
#[derive(Debug, thiserror::Error)]
pub enum BudgetError {
/// A resetting window is exhausted → `429 rate_limit_exceeded` +
/// `Retry-After: retry_after_secs`.
#[error(
"rolling-window budget exhausted ({requested} requested, {available} available); \
resets in {retry_after_secs}s"
)]
RateLimited {
requested: u64,
available: u64,
retry_after_secs: u64,
},
/// A hard balance is exhausted → `429 insufficient_quota` (no
/// `Retry-After`; the client surfaces and stops). Never `402`.
#[error("hard balance exhausted ({requested} requested, {available} available)")]
InsufficientQuota { requested: u64, available: u64 },
}
/// The seam between cortex's enforcement and whatever decides entitlement —
/// a local/static config provider today (#50), the helexa-upstream client
/// later (#57). All methods are async so the upstream impl can do network
/// I/O; the local impl resolves in-process.
#[async_trait]
pub trait EntitlementProvider: Send + Sync {
/// Resolve a bearer API key to its principal. `Err(InvalidKey)` for an
/// unknown/empty key.
async fn resolve(&self, api_key: &str) -> Result<Principal, AuthError>;
/// Reserve up to `max_tokens` against the principal's cap. Returns a
/// handle on success, or a [`BudgetError`] (which the caller maps to a
/// #63 `429`) if the reservation would exceed the cap. Reserving the
/// *maximum* a request could consume before dispatch is what prevents
/// overshoot under concurrency.
async fn reserve(
&self,
principal: &Principal,
max_tokens: u64,
) -> Result<Reservation, BudgetError>;
/// Settle a reservation with the tokens actually consumed, releasing the
/// unused remainder back to the cap.
async fn settle(&self, reservation: Reservation, actual_tokens: u64);
/// Release a reservation in full — e.g. dispatch failed before any
/// tokens were consumed.
async fn release(&self, reservation: Reservation);
/// Current budget snapshot for a principal, for metering/metrics.
/// `None` if the provider doesn't track this principal.
async fn snapshot(&self, principal: &Principal) -> Option<BudgetSnapshot>;
}

View File

@@ -0,0 +1,257 @@
//! The OpenAI-standard error envelope (#60) and the rejection contract
//! that rides on it (#63).
//!
//! Every non-2xx response cortex and neuron emit uses the shape
//!
//! ```json
//! { "error": { "message": "...", "type": "...", "code": "...", "param": null } }
//! ```
//!
//! because OpenAI-compatible clients (opencode, the AI SDK, litellm, the
//! OpenAI SDKs) read `error.type` / `error.code` to decide what to do —
//! most importantly `code == "context_length_exceeded"` triggers
//! auto-compaction, and a `429` with `Retry-After` makes them back off and
//! retry rather than surfacing an opaque failure. A flat `{"error":"..."}`
//! string is invisible to that logic.
//!
//! This module is the single source of truth for that envelope. It is
//! deliberately **axum-agnostic** — cortex-core is a pure types crate — so
//! it carries the response as data (`status`, `body()`, `retry_after_secs`)
//! and each HTTP crate (cortex-gateway, neuron) owns a tiny adapter that
//! turns an [`OpenAiError`] into its framework's response type, setting the
//! `Retry-After` header when present.
//!
//! Retryable conditions **must** carry `Retry-After` (per #63). The named
//! constructors below encode that: [`OpenAiError::rate_limit_exceeded`] and
//! [`OpenAiError::service_unavailable`] take a retry hint;
//! [`OpenAiError::insufficient_quota`] (hard balance, no reset) and
//! [`OpenAiError::context_length_exceeded`] / [`OpenAiError::invalid_api_key`]
//! (permanent) do not. `402 Payment Required` is banned by the contract — use
//! `429 insufficient_quota` for hard budget exhaustion.
use serde_json::{Map, Value, json};
/// A rejection rendered in the OpenAI error envelope.
///
/// Build with [`OpenAiError::new`] (or a named constructor), refine with the
/// `with_*` builders, then hand to the consuming crate's adapter to turn into
/// an HTTP response.
#[derive(Debug, Clone)]
pub struct OpenAiError {
/// HTTP status code (e.g. `401`, `429`, `503`).
pub status: u16,
/// Broad OpenAI category — `"invalid_request_error"`, `"api_error"`,
/// `"rate_limit_error"`, …
pub error_type: String,
/// Specific machine-readable code clients key on (`"invalid_api_key"`,
/// `"rate_limit_exceeded"`, `"context_length_exceeded"`, …). `None`
/// renders as JSON `null`.
pub code: Option<String>,
/// Human-readable, actionable message.
pub message: String,
/// OpenAI's `param` field — the offending request parameter, if any.
pub param: Option<String>,
/// Seconds to advertise in the `Retry-After` header. Set only on
/// retryable conditions; `None` means no header.
pub retry_after_secs: Option<u64>,
/// Diagnostic fields merged *inside* the `error` object (e.g.
/// `prompt_len`, `max`, `free_mb`) so they don't break the envelope
/// shape. Clients ignore unknown keys.
pub extra: Map<String, Value>,
}
impl OpenAiError {
/// Construct an envelope with an explicit code. For a `null` code use
/// [`OpenAiError::without_code`].
pub fn new(
status: u16,
error_type: impl Into<String>,
code: impl Into<String>,
message: impl Into<String>,
) -> Self {
Self {
status,
error_type: error_type.into(),
code: Some(code.into()),
message: message.into(),
param: None,
retry_after_secs: None,
extra: Map::new(),
}
}
/// Construct an envelope whose `code` is `null` (e.g. an unclassified
/// internal error).
pub fn without_code(
status: u16,
error_type: impl Into<String>,
message: impl Into<String>,
) -> Self {
Self {
status,
error_type: error_type.into(),
code: None,
message: message.into(),
param: None,
retry_after_secs: None,
extra: Map::new(),
}
}
/// Advertise a `Retry-After` (seconds). Use on retryable rejections.
pub fn with_retry_after(mut self, secs: u64) -> Self {
self.retry_after_secs = Some(secs);
self
}
/// Set the OpenAI `param` field.
pub fn with_param(mut self, param: impl Into<String>) -> Self {
self.param = Some(param.into());
self
}
/// Merge one diagnostic field into the error object.
pub fn with_extra(mut self, key: impl Into<String>, value: Value) -> Self {
self.extra.insert(key.into(), value);
self
}
/// Merge a bag of diagnostic fields into the error object.
pub fn with_extras(mut self, extras: Map<String, Value>) -> Self {
for (k, v) in extras {
self.extra.insert(k, v);
}
self
}
/// Render the `{ "error": { … } }` body. Field order is irrelevant to
/// clients (they parse JSON); the standard keys come first, then any
/// diagnostic extras.
pub fn body(&self) -> Value {
let mut error = Map::new();
error.insert("message".into(), Value::String(self.message.clone()));
error.insert("type".into(), Value::String(self.error_type.clone()));
error.insert(
"code".into(),
self.code.clone().map(Value::String).unwrap_or(Value::Null),
);
error.insert(
"param".into(),
self.param.clone().map(Value::String).unwrap_or(Value::Null),
);
for (k, v) in &self.extra {
error.insert(k.clone(), v.clone());
}
json!({ "error": Value::Object(error) })
}
// ── Named constructors for the #63 standard codes ──────────────────
/// `401 invalid_api_key` — missing/invalid bearer token (#49). Permanent.
pub fn invalid_api_key(message: impl Into<String>) -> Self {
Self::new(401, "invalid_request_error", "invalid_api_key", message)
}
/// `429 rate_limit_exceeded` + `Retry-After` — transient overload,
/// fair-share/in-flight cap, admission rejection, or a rolling budget
/// window that resets (#52/#53/#54/#55). Clients back off and retry.
pub fn rate_limit_exceeded(message: impl Into<String>, retry_after_secs: u64) -> Self {
Self::new(429, "rate_limit_error", "rate_limit_exceeded", message)
.with_retry_after(retry_after_secs)
}
/// `429 insufficient_quota` — hard balance exhausted, no reset (#52).
/// No `Retry-After`; the client surfaces and stops. (Never `402`.)
pub fn insufficient_quota(message: impl Into<String>) -> Self {
Self::new(429, "insufficient_quota", "insufficient_quota", message)
}
/// `400 context_length_exceeded` — prompt exceeds the model's context
/// window (#56/#60). Permanent for this request; opencode auto-compacts.
pub fn context_length_exceeded(message: impl Into<String>) -> Self {
Self::new(
400,
"invalid_request_error",
"context_length_exceeded",
message,
)
}
/// `503 service_unavailable` + optional `Retry-After` — transient
/// backend unavailability (no healthy nodes, recovery, fail-closed
/// upstream). Retryable when a hint is given.
pub fn service_unavailable(message: impl Into<String>, retry_after_secs: Option<u64>) -> Self {
let mut err = Self::new(503, "api_error", "service_unavailable", message);
err.retry_after_secs = retry_after_secs;
err
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn body_has_standard_envelope_shape() {
let env = OpenAiError::new(429, "rate_limit_error", "rate_limit_exceeded", "slow down");
let body = env.body();
let error = body.get("error").and_then(Value::as_object).unwrap();
assert_eq!(error["message"], "slow down");
assert_eq!(error["type"], "rate_limit_error");
assert_eq!(error["code"], "rate_limit_exceeded");
assert_eq!(error["param"], Value::Null);
}
#[test]
fn without_code_renders_null_code() {
let env = OpenAiError::without_code(500, "api_error", "kaboom");
assert_eq!(env.body()["error"]["code"], Value::Null);
}
#[test]
fn extras_ride_inside_the_error_object() {
let env = OpenAiError::context_length_exceeded("too long")
.with_extra("prompt_len", json!(60_000))
.with_extra("max", json!(49_152));
let error = &env.body()["error"];
assert_eq!(error["prompt_len"], 60_000);
assert_eq!(error["max"], 49_152);
assert_eq!(error["code"], "context_length_exceeded");
}
#[test]
fn rolling_window_rejection_carries_retry_after() {
let env = OpenAiError::rate_limit_exceeded("budget window", 30);
assert_eq!(env.status, 429);
assert_eq!(env.retry_after_secs, Some(30));
}
#[test]
fn hard_balance_rejection_has_no_retry_after() {
let env = OpenAiError::insufficient_quota("out of credit");
assert_eq!(env.status, 429);
assert_eq!(env.code.as_deref(), Some("insufficient_quota"));
assert_eq!(env.retry_after_secs, None);
}
#[test]
fn permanent_rejections_have_no_retry_after() {
assert_eq!(OpenAiError::invalid_api_key("nope").retry_after_secs, None);
assert_eq!(
OpenAiError::context_length_exceeded("too long").retry_after_secs,
None
);
}
#[test]
fn service_unavailable_retry_after_is_optional() {
assert_eq!(
OpenAiError::service_unavailable("recovering", Some(5)).retry_after_secs,
Some(5)
);
assert_eq!(
OpenAiError::service_unavailable("gone", None).retry_after_secs,
None
);
}
}

View File

@@ -36,6 +36,44 @@ pub struct ModelSpec {
pub devices: Option<Vec<u32>>,
}
/// Per-model token budget advertised by the catalogue or neuron.
///
/// `context` is the hard wall (the served max-seq-len). `input` is the
/// compaction trigger — when set, opencode treats it as "usable context =
/// input reserved". When omitted, clients fall back to `context output`.
/// `output` is the maximum number of generation tokens.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ModelLimit {
/// Hard wall — served max-seq-len in tokens.
pub context: usize,
/// Compaction trigger / usable input budget. When absent clients fall
/// back to `context output`.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub input: Option<usize>,
/// Maximum number of generation tokens.
pub output: usize,
}
/// Operator-set pricing in USD per 1M tokens.
///
/// Self-hosted deployments typically leave both at `0.0`. Cache fields are
/// optional — set when the backend supports a prefix-cache discount tier.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ModelCost {
/// USD per 1M input (prompt) tokens.
#[serde(default)]
pub input: f64,
/// USD per 1M output (completion) tokens.
#[serde(default)]
pub output: f64,
/// USD per 1M cache-hit tokens (optional).
#[serde(default, skip_serializing_if = "Option::is_none")]
pub cache_read: Option<f64>,
/// USD per 1M cache-write tokens (optional).
#[serde(default, skip_serializing_if = "Option::is_none")]
pub cache_write: Option<f64>,
}
/// A model as reported by a harness.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ModelInfo {
@@ -46,14 +84,31 @@ pub struct ModelInfo {
pub vram_used_mb: Option<u64>,
/// Modalities this loaded model supports. Today: `["text"]` for
/// text-only checkpoints, `["text", "vision"]` for vision-capable
/// ones (Stage B7 of the vision plan). Clients like litellm /
/// agent0 can gate `image_url` submission on the advertised set.
/// ones (Stage B7). Clients like litellm / agent0 can gate
/// `image_url` submission on the advertised set.
///
/// Optional in the wire format so older clients that don't read
/// it stay compatible. Default-empty for absent/older data, which
/// callers can interpret as "text".
#[serde(default, skip_serializing_if = "Vec::is_empty")]
pub capabilities: Vec<String>,
// ── Enrichment (issue #62) ────────────────────────────────
/// Token budget advertised by the catalogue or discovered at load time.
/// `None` when neither the catalogue nor the loaded model can provide it.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub limit: Option<ModelLimit>,
/// Operator-set pricing in USD per 1M tokens (0.0 = free/self-hosted).
#[serde(default, skip_serializing_if = "Option::is_none")]
pub cost: Option<ModelCost>,
/// `true` when the model's tokenizer contains recognised tool-call
/// marker tokens (`<tool_call>` / `<\/tool_call>` convention).
#[serde(default)]
pub tool_call: bool,
/// `true` when the model's tokenizer contains recognised reasoning
/// marker tokens (`<think>` / `<\/think>` or similar).
#[serde(default)]
pub reasoning: bool,
}
/// What an inference harness must do, from neuron's perspective.

View File

@@ -3,6 +3,8 @@ pub mod build_info;
pub mod catalogue;
pub mod config;
pub mod discovery;
pub mod entitlements;
pub mod error_envelope;
pub mod harness;
pub mod metrics;
pub mod node;

View File

@@ -1,4 +1,5 @@
use crate::discovery::{ActivationStatus, DiscoveryResponse};
use crate::harness::{ModelCost, ModelLimit};
use chrono::{DateTime, Utc};
use serde::{Deserialize, Serialize};
use std::collections::HashMap;
@@ -43,6 +44,21 @@ pub struct ModelEntry {
/// older persisted/serialised entries deserialisable.
#[serde(default)]
pub capabilities: Vec<String>,
/// Runtime-detected capability flags from the neuron's `/models`
/// response (`ModelInfo`). `false` when the neuron predates these
/// fields or hasn't reported them yet.
#[serde(default)]
pub tool_call: bool,
#[serde(default)]
pub reasoning: bool,
/// Self-derived token budget the neuron computed for this loaded
/// model (#67), copied from `ModelInfo.limit` at poll time. `None`
/// when the neuron doesn't compute one (arch without a context
/// profile, or derivation disabled). This is the authoritative
/// source the gateway advertises — operator-declared catalogue
/// limits are no longer consulted.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub limit: Option<ModelLimit>,
}
/// Model lifecycle status.
@@ -99,10 +115,25 @@ pub struct CortexModelEntry {
pub locations: Vec<ModelLocation>,
/// Union of the modalities advertised by every neuron that has this
/// model loaded (e.g. `["text", "vision"]`). Empty for catalogue-only
/// entries with no loaded location — the catalogue profile doesn't
/// declare capabilities yet (tracked separately from C3).
/// entries with no loaded location — filled from catalogue profile
/// capabilities when available, then unioned with runtime-detected
/// values from loaded neurons.
#[serde(default)]
pub capabilities: Vec<String>,
// ── Enrichment (issue #62) ────────────────────────────────
/// Per-model token budget from the catalogue profile or discovered
/// at load time. `None` when neither source provides it.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub limit: Option<ModelLimit>,
/// Operator-set pricing in USD per 1M tokens (0.0 = free/self-hosted).
#[serde(default, skip_serializing_if = "Option::is_none")]
pub cost: Option<ModelCost>,
/// `true` when any neuron reports this model supports tool calls.
#[serde(default)]
pub tool_call: bool,
/// `true` when any neuron reports this model supports reasoning tokens.
#[serde(default)]
pub reasoning: bool,
}
#[derive(Debug, Clone, Serialize, Deserialize)]

View File

@@ -106,6 +106,31 @@ pub struct Usage {
pub prompt_tokens: u64,
pub completion_tokens: u64,
pub total_tokens: u64,
/// OpenAI-standard breakdown of `completion_tokens`. Optional and
/// additive — clients that don't read it are unaffected. Carries
/// `reasoning_tokens` for reasoning models (a sub-count of
/// `completion_tokens`, never added into `total_tokens`).
#[serde(default, skip_serializing_if = "Option::is_none")]
pub completion_tokens_details: Option<CompletionTokensDetails>,
/// OpenAI-standard breakdown of `prompt_tokens`. Populated once
/// prompt caching lands (#11); `None` until then.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub prompt_tokens_details: Option<PromptTokensDetails>,
}
/// Sub-counts of `Usage::completion_tokens`.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct CompletionTokensDetails {
/// Tokens generated inside the model's reasoning span.
pub reasoning_tokens: u64,
}
/// Sub-counts of `Usage::prompt_tokens`.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct PromptTokensDetails {
/// Prompt tokens served from cache (cache-read rate). Populated
/// once prompt caching lands (#11).
pub cached_tokens: u64,
}
// ── Models list response ─────────────────────────────────────────────

View File

@@ -202,6 +202,30 @@ pub struct ResponsesUsage {
pub input_tokens: u64,
pub output_tokens: u64,
pub total_tokens: u64,
/// OpenAI-standard breakdown of `output_tokens`. Optional and
/// additive. Carries `reasoning_tokens` for reasoning models (a
/// sub-count of `output_tokens`, never added into `total_tokens`).
#[serde(default, skip_serializing_if = "Option::is_none")]
pub output_tokens_details: Option<OutputTokensDetails>,
/// OpenAI-standard breakdown of `input_tokens`. Populated once
/// prompt caching lands (#11); `None` until then.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub input_tokens_details: Option<InputTokensDetails>,
}
/// Sub-counts of `ResponsesUsage::output_tokens`.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct OutputTokensDetails {
/// Tokens generated inside the model's reasoning span.
pub reasoning_tokens: u64,
}
/// Sub-counts of `ResponsesUsage::input_tokens`.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct InputTokensDetails {
/// Input tokens served from cache (cache-read rate). Populated
/// once prompt caching lands (#11).
pub cached_tokens: u64,
}
// ── Streaming event names ────────────────────────────────────────────
@@ -336,6 +360,8 @@ mod tests {
input_tokens: 5,
output_tokens: 3,
total_tokens: 8,
output_tokens_details: None,
input_tokens_details: None,
}),
};
let json = serde_json::to_string(&r).unwrap();

View File

@@ -11,48 +11,90 @@ use crate::openai::{
use serde_json::{Value, json};
/// Convert an Anthropic Messages request into an OpenAI ChatCompletion request.
///
/// This is the request half of the round trip Claude Code (and any
/// Anthropic-native client pointed at cortex via `ANTHROPIC_BASE_URL`)
/// exercises. The non-obvious work here is **tool translation**: the
/// Anthropic and OpenAI tool shapes differ, and neuron feeds whatever
/// `tools` array it receives straight into the HF chat template, which
/// iterates the OpenAI shape (`tool.function.name`,
/// `tool.function.parameters`). If we forwarded Anthropic-shaped tools
/// (`{name, description, input_schema}`) verbatim the template would
/// render empty/garbage definitions and the model would improvise an
/// unparseable tool-call format — exactly the
/// `<tool_use_name>…</tool_use_name>` text that leaks through to the
/// client. So we reshape here:
///
/// - tool **definitions**: `{name, description, input_schema}` →
/// `{type:"function", function:{name, description, parameters}}`
/// - `tool_choice`: Anthropic `{type:"auto"|"any"|"tool", name}` →
/// OpenAI `"auto"|"required"|{type:"function",function:{name}}`
/// - assistant `tool_use` content blocks → an OpenAI assistant message
/// carrying `tool_calls` (with `arguments` JSON-stringified)
/// - user `tool_result` content blocks → standalone `role:"tool"`
/// messages keyed by `tool_call_id`
pub fn anthropic_to_openai(req: MessagesRequest) -> ChatCompletionRequest {
let mut messages = Vec::new();
// Anthropic `system` field becomes a system message.
// Collect ALL system content into a single leading system message.
// The top-level `system` field PLUS any `role:"system"` turns inside
// `messages` (Claude Code injects extra system-role messages beyond
// the top-level one) are merged into one message at index 0.
//
// This is load-bearing: most chat templates — Qwen3.6's among them —
// hard-reject a system message anywhere but the start
// (`raise_exception('System message must be at the beginning.')`),
// and on that render error neuron silently falls back to a
// template that renders NO tools at all, so the model gets zero
// tool-format guidance and improvises an unparseable tool syntax —
// tool calling breaks entirely. Merging keeps every system
// instruction while satisfying the template.
let mut system_parts: Vec<String> = Vec::new();
if let Some(system) = req.system {
let content = match system {
system_parts.push(match system {
SystemPrompt::Text(t) => t,
SystemPrompt::Blocks(blocks) => serde_json::to_string(&blocks).unwrap_or_default(),
};
messages.push(ChatMessage {
role: "system".into(),
content: MessageContent::Text(content),
extra: Value::Null,
SystemPrompt::Blocks(blocks) => system_blocks_to_text(&blocks),
});
}
// Convert message roles and content.
// Translate the conversation. A single Anthropic message can fan out
// into several OpenAI messages (tool results split into their own
// `role:"tool"` turns); `role:"system"` turns are pulled into the
// accumulator above rather than emitted mid-stream.
let mut convo: Vec<ChatMessage> = Vec::new();
for msg in req.messages {
let content = match msg.content {
AnthropicContent::Text(t) => MessageContent::Text(t),
AnthropicContent::Blocks(blocks) => {
// For simple text-only blocks, extract the text.
// For mixed content (images, etc.), pass as parts.
if blocks.len() == 1 && blocks[0].block_type == "text" {
let text = blocks[0]
.data
.get("text")
.and_then(|v| v.as_str())
.unwrap_or("")
.to_string();
MessageContent::Text(text)
} else {
MessageContent::Parts(blocks.into_iter().map(|b| json!(b)).collect())
}
}
};
if msg.role == "system" {
system_parts.push(anthropic_content_to_text(msg.content));
continue;
}
push_translated_message(&mut convo, &msg.role, msg.content);
}
let mut messages = Vec::new();
if !system_parts.is_empty() {
messages.push(ChatMessage {
role: msg.role,
content,
role: "system".into(),
content: MessageContent::Text(system_parts.join("\n\n")),
extra: Value::Null,
});
}
messages.extend(convo);
// Reshape `tools` / `tool_choice` (carried over from the request's
// flattened `extra`) into the OpenAI shape neuron's chat template
// expects. Computed-then-inserted to avoid borrowing `obj` across
// the mutation.
let mut extra = req.extra;
if let Value::Object(obj) = &mut extra {
let tools = obj.get("tools").and_then(anthropic_tools_to_openai);
if let Some(tools) = tools {
obj.insert("tools".into(), tools);
}
let tool_choice = obj
.get("tool_choice")
.and_then(anthropic_tool_choice_to_openai);
if let Some(tc) = tool_choice {
obj.insert("tool_choice".into(), tc);
}
}
ChatCompletionRequest {
model: req.model,
@@ -61,7 +103,278 @@ pub fn anthropic_to_openai(req: MessagesRequest) -> ChatCompletionRequest {
top_p: req.top_p,
max_tokens: Some(req.max_tokens),
stream: req.stream,
extra: req.extra,
extra,
}
}
/// Translate one Anthropic message into one-or-more OpenAI messages,
/// appending them to `out`.
fn push_translated_message(out: &mut Vec<ChatMessage>, role: &str, content: AnthropicContent) {
let blocks = match content {
AnthropicContent::Text(t) => {
out.push(ChatMessage {
role: role.into(),
content: MessageContent::Text(t),
extra: Value::Null,
});
return;
}
AnthropicContent::Blocks(blocks) => blocks,
};
let mut text_segments: Vec<String> = Vec::new();
let mut parts: Vec<Value> = Vec::new();
let mut has_nontext_part = false;
let mut tool_calls: Vec<Value> = Vec::new();
let mut tool_msgs: Vec<ChatMessage> = Vec::new();
for block in blocks {
match block.block_type.as_str() {
"text" => {
let t = block
.data
.get("text")
.and_then(Value::as_str)
.unwrap_or_default()
.to_string();
parts.push(json!({ "type": "text", "text": t }));
text_segments.push(t);
}
"tool_use" => {
let id = block
.data
.get("id")
.and_then(Value::as_str)
.unwrap_or("toolu_unknown");
let name = block
.data
.get("name")
.and_then(Value::as_str)
.unwrap_or_default();
let input = block
.data
.get("input")
.cloned()
.unwrap_or_else(|| json!({}));
tool_calls.push(json!({
"id": id,
"type": "function",
"function": {
"name": name,
// Arguments as the parsed OBJECT, not the OpenAI
// JSON-string form. The Qwen3.6 chat template
// iterates `tool_call.arguments | items` (treats
// it as a dict); a string throws "cannot convert
// value into pairs", making neuron fall back to a
// tool-less prompt — which silently breaks
// tool-call chaining the moment one tool call is
// in the history.
"arguments": input,
}
}));
}
"tool_result" => {
let tool_use_id = block
.data
.get("tool_use_id")
.and_then(Value::as_str)
.unwrap_or("toolu_unknown");
tool_msgs.push(ChatMessage {
role: "tool".into(),
content: MessageContent::Text(tool_result_content_to_string(&block.data)),
extra: json!({ "tool_call_id": tool_use_id }),
});
}
"image" => {
if let Some(part) = anthropic_image_to_openai(&block.data) {
parts.push(part);
has_nontext_part = true;
}
}
_ => {
// Unknown block kind: preserve it as a JSON part rather
// than silently dropping it.
parts.push(serde_json::to_value(&block).unwrap_or(Value::Null));
has_nontext_part = true;
}
}
}
// Tool results become standalone `role:"tool"` turns and must
// precede any residual content from the same Anthropic message.
out.append(&mut tool_msgs);
if !tool_calls.is_empty() {
// An assistant turn that invoked tools. OpenAI carries the calls
// in `tool_calls`; the visible text (if any) stays in `content`.
out.push(ChatMessage {
role: role.into(),
content: MessageContent::Text(text_segments.join("")),
extra: json!({ "tool_calls": tool_calls }),
});
} else if has_nontext_part {
// Mixed content (images): forward as OpenAI content parts.
out.push(ChatMessage {
role: role.into(),
content: MessageContent::Parts(parts),
extra: Value::Null,
});
} else if !text_segments.is_empty() {
out.push(ChatMessage {
role: role.into(),
content: MessageContent::Text(text_segments.join("")),
extra: Value::Null,
});
}
// else: the message was only tool_result blocks — already emitted
// as `role:"tool"` turns above, nothing residual to add.
}
/// Extract plain text from an Anthropic `tool_result` block's `content`
/// (a string, or an array of `{type:"text", text}` blocks).
fn tool_result_content_to_string(data: &Value) -> String {
match data.get("content") {
Some(Value::String(s)) => s.clone(),
Some(Value::Array(arr)) => arr
.iter()
.map(|b| {
if b.get("type").and_then(Value::as_str) == Some("text") {
b.get("text")
.and_then(Value::as_str)
.unwrap_or_default()
.to_string()
} else {
b.to_string()
}
})
.collect::<Vec<_>>()
.join(""),
Some(other) => other.to_string(),
None => String::new(),
}
}
/// Convert an Anthropic image block's `data` (`{source:{...}}`) into an
/// OpenAI `image_url` content part.
fn anthropic_image_to_openai(data: &Value) -> Option<Value> {
let source = data.get("source")?;
match source
.get("type")
.and_then(Value::as_str)
.unwrap_or("base64")
{
"base64" => {
let media = source
.get("media_type")
.and_then(Value::as_str)
.unwrap_or("image/png");
let b64 = source
.get("data")
.and_then(Value::as_str)
.unwrap_or_default();
Some(json!({
"type": "image_url",
"image_url": { "url": format!("data:{media};base64,{b64}") }
}))
}
"url" => {
let url = source
.get("url")
.and_then(Value::as_str)
.unwrap_or_default();
Some(json!({ "type": "image_url", "image_url": { "url": url } }))
}
_ => None,
}
}
/// Reshape an Anthropic `tools` array into the OpenAI function-tool
/// shape. Returns `None` if the value isn't an array (left untouched).
fn anthropic_tools_to_openai(tools: &Value) -> Option<Value> {
let arr = tools.as_array()?;
let converted = arr
.iter()
.map(|t| {
// Already OpenAI-shaped (a client mixing conventions, or a
// re-translation): pass through unchanged.
if t.get("type").and_then(Value::as_str) == Some("function")
&& t.get("function").is_some()
{
return t.clone();
}
let mut function = serde_json::Map::new();
function.insert("name".into(), t.get("name").cloned().unwrap_or(Value::Null));
if let Some(desc) = t.get("description") {
function.insert("description".into(), desc.clone());
}
function.insert(
"parameters".into(),
t.get("input_schema")
.cloned()
.unwrap_or_else(|| json!({ "type": "object" })),
);
json!({ "type": "function", "function": Value::Object(function) })
})
.collect();
Some(Value::Array(converted))
}
/// Map an Anthropic `tool_choice` to the OpenAI form.
fn anthropic_tool_choice_to_openai(tc: &Value) -> Option<Value> {
match tc.get("type").and_then(Value::as_str)? {
"auto" => Some(json!("auto")),
"any" => Some(json!("required")),
"none" => Some(json!("none")),
"tool" => {
let name = tc.get("name").and_then(Value::as_str).unwrap_or_default();
Some(json!({ "type": "function", "function": { "name": name } }))
}
_ => None,
}
}
/// Flatten Anthropic system content blocks (`[{type:"text", text}]`)
/// into a single string.
fn system_blocks_to_text(blocks: &[Value]) -> String {
let joined = blocks
.iter()
.filter(|b| b.get("type").and_then(Value::as_str) == Some("text"))
.filter_map(|b| b.get("text").and_then(Value::as_str))
.collect::<Vec<_>>()
.join("\n");
if joined.is_empty() {
// Unusual shape — don't lose it.
serde_json::to_string(blocks).unwrap_or_default()
} else {
joined
}
}
/// Flatten an Anthropic message's content into plain text. Used to fold
/// `role:"system"` conversation turns into the leading system message;
/// non-text blocks (rare in a system turn) are JSON-stringified rather
/// than dropped.
fn anthropic_content_to_text(content: AnthropicContent) -> String {
match content {
AnthropicContent::Text(t) => t,
AnthropicContent::Blocks(blocks) => blocks
.iter()
.map(|b| {
if b.block_type == "text" {
b.data
.get("text")
.and_then(Value::as_str)
.unwrap_or_default()
.to_string()
} else {
serde_json::to_value(b)
.ok()
.map(|v| v.to_string())
.unwrap_or_default()
}
})
.collect::<Vec<_>>()
.join("\n"),
}
}
@@ -85,6 +398,8 @@ pub fn openai_to_anthropic(resp: ChatCompletionResponse) -> MessagesResponse {
prompt_tokens: 0,
completion_tokens: 0,
total_tokens: 0,
completion_tokens_details: None,
prompt_tokens_details: None,
});
MessagesResponse {
@@ -455,6 +770,8 @@ mod stream_tests {
prompt_tokens: 225,
completion_tokens: 42,
total_tokens: 267,
completion_tokens_details: None,
prompt_tokens_details: None,
});
t.on_chunk(&usage_chunk);
let fin = t.finish();
@@ -475,3 +792,238 @@ mod stream_tests {
assert!(t2.finish().is_empty(), "second finish must emit nothing");
}
}
#[cfg(test)]
mod request_tests {
use super::*;
use crate::openai::MessageContent;
fn req(value: Value) -> MessagesRequest {
serde_json::from_value(value).expect("valid MessagesRequest")
}
#[test]
fn tool_definitions_reshape_to_openai_function_shape() {
let r = req(json!({
"model": "Qwen/Qwen3.6-27B",
"max_tokens": 1024,
"messages": [{"role": "user", "content": "read the file"}],
"tools": [{
"name": "Read",
"description": "Read a file",
"input_schema": {
"type": "object",
"properties": {"path": {"type": "string"}},
"required": ["path"]
}
}]
}));
let openai = anthropic_to_openai(r);
let tools = openai
.extra
.get("tools")
.and_then(Value::as_array)
.expect("tools array");
assert_eq!(tools.len(), 1);
let t = &tools[0];
assert_eq!(t["type"], "function");
assert_eq!(t["function"]["name"], "Read");
assert_eq!(t["function"]["description"], "Read a file");
// input_schema is renamed to parameters, contents preserved.
assert_eq!(
t["function"]["parameters"]["properties"]["path"]["type"],
"string"
);
assert!(t["function"].get("input_schema").is_none());
}
#[test]
fn tool_choice_maps_each_variant() {
let mk = |tc: Value| {
let r = req(json!({
"model": "m", "max_tokens": 8,
"messages": [{"role": "user", "content": "hi"}],
"tool_choice": tc
}));
anthropic_to_openai(r)
.extra
.get("tool_choice")
.cloned()
.unwrap()
};
assert_eq!(mk(json!({"type": "auto"})), json!("auto"));
assert_eq!(mk(json!({"type": "any"})), json!("required"));
assert_eq!(mk(json!({"type": "none"})), json!("none"));
assert_eq!(
mk(json!({"type": "tool", "name": "Read"})),
json!({"type": "function", "function": {"name": "Read"}})
);
}
#[test]
fn assistant_tool_use_block_becomes_openai_tool_calls() {
let r = req(json!({
"model": "m", "max_tokens": 8,
"messages": [{
"role": "assistant",
"content": [
{"type": "text", "text": "Let me read it."},
{"type": "tool_use", "id": "toolu_1", "name": "Read",
"input": {"path": "/etc/hosts"}}
]
}]
}));
let openai = anthropic_to_openai(r);
// One assistant message carrying both the text and the call.
let m = openai.messages.last().expect("a message");
assert_eq!(m.role, "assistant");
match &m.content {
MessageContent::Text(t) => assert_eq!(t, "Let me read it."),
other => panic!("expected text content, got {other:?}"),
}
let calls = m
.extra
.get("tool_calls")
.and_then(Value::as_array)
.expect("tool_calls");
assert_eq!(calls[0]["id"], "toolu_1");
assert_eq!(calls[0]["type"], "function");
assert_eq!(calls[0]["function"]["name"], "Read");
// arguments is the parsed object (Qwen3.6 template iterates it).
assert_eq!(
calls[0]["function"]["arguments"],
json!({"path": "/etc/hosts"})
);
}
#[test]
fn user_tool_result_block_becomes_role_tool_message() {
let r = req(json!({
"model": "m", "max_tokens": 8,
"messages": [{
"role": "user",
"content": [
{"type": "tool_result", "tool_use_id": "toolu_1",
"content": "127.0.0.1 localhost"}
]
}]
}));
let openai = anthropic_to_openai(r);
assert_eq!(openai.messages.len(), 1);
let m = &openai.messages[0];
assert_eq!(m.role, "tool");
assert_eq!(m.extra["tool_call_id"], "toolu_1");
match &m.content {
MessageContent::Text(t) => assert_eq!(t, "127.0.0.1 localhost"),
other => panic!("expected text content, got {other:?}"),
}
}
#[test]
fn tool_result_with_block_array_content_is_flattened() {
let r = req(json!({
"model": "m", "max_tokens": 8,
"messages": [{
"role": "user",
"content": [
{"type": "tool_result", "tool_use_id": "t",
"content": [{"type": "text", "text": "line1"}, {"type": "text", "text": "line2"}]}
]
}]
}));
let openai = anthropic_to_openai(r);
match &openai.messages[0].content {
MessageContent::Text(t) => assert_eq!(t, "line1line2"),
other => panic!("expected text, got {other:?}"),
}
}
#[test]
fn tool_result_then_text_emits_tool_turn_first() {
// A user turn that carries a tool result *and* a follow-up
// question must yield the tool message before the user text.
let r = req(json!({
"model": "m", "max_tokens": 8,
"messages": [{
"role": "user",
"content": [
{"type": "tool_result", "tool_use_id": "t", "content": "ok"},
{"type": "text", "text": "now what?"}
]
}]
}));
let openai = anthropic_to_openai(r);
assert_eq!(openai.messages.len(), 2);
assert_eq!(openai.messages[0].role, "tool");
assert_eq!(openai.messages[1].role, "user");
match &openai.messages[1].content {
MessageContent::Text(t) => assert_eq!(t, "now what?"),
other => panic!("expected text, got {other:?}"),
}
}
#[test]
fn system_blocks_flatten_to_text_not_json() {
let r = req(json!({
"model": "m", "max_tokens": 8,
"system": [{"type": "text", "text": "You are helpful."}],
"messages": [{"role": "user", "content": "hi"}]
}));
let openai = anthropic_to_openai(r);
let sys = &openai.messages[0];
assert_eq!(sys.role, "system");
match &sys.content {
MessageContent::Text(t) => assert_eq!(t, "You are helpful."),
other => panic!("expected text, got {other:?}"),
}
}
#[test]
fn system_role_messages_merge_into_one_leading_system() {
// Claude Code's shape: a top-level `system` PLUS a `role:"system"`
// turn inside `messages` (which Qwen3.6's template rejects unless
// it's first). Both must merge into a single leading system msg.
let r = req(json!({
"model": "m", "max_tokens": 8,
"system": "TOP LEVEL SYSTEM",
"messages": [
{"role": "user", "content": "hello"},
{"role": "system", "content": "INJECTED SYSTEM"},
{"role": "user", "content": "do it"}
],
"tools": [{"name": "noop", "input_schema": {"type": "object"}}]
}));
let openai = anthropic_to_openai(r);
// Exactly one system message, at index 0, merging both parts.
let systems: Vec<usize> = openai
.messages
.iter()
.enumerate()
.filter(|(_, m)| m.role == "system")
.map(|(i, _)| i)
.collect();
assert_eq!(systems, vec![0], "one system message, at the front");
match &openai.messages[0].content {
MessageContent::Text(t) => {
assert!(t.contains("TOP LEVEL SYSTEM"));
assert!(t.contains("INJECTED SYSTEM"));
}
other => panic!("expected text, got {other:?}"),
}
// The two real user turns survive, in order, after the system.
let roles: Vec<&str> = openai.messages.iter().map(|m| m.role.as_str()).collect();
assert_eq!(roles, vec!["system", "user", "user"]);
}
#[test]
fn already_openai_shaped_tools_pass_through() {
let r = req(json!({
"model": "m", "max_tokens": 8,
"messages": [{"role": "user", "content": "hi"}],
"tools": [{"type": "function", "function": {"name": "x", "parameters": {}}}]
}));
let openai = anthropic_to_openai(r);
let tools = openai.extra.get("tools").and_then(Value::as_array).unwrap();
assert_eq!(tools[0]["function"]["name"], "x");
}
}

View File

@@ -6,6 +6,7 @@ license.workspace = true
[dependencies]
cortex-core.workspace = true
async-trait.workspace = true
tokio.workspace = true
axum.workspace = true
tower.workspace = true

View File

@@ -32,6 +32,8 @@ pub async fn stream_translated(
openai_body: axum::body::Bytes,
model_id: &str,
node_name: &str,
inbound_headers: &axum::http::HeaderMap,
usage_sink: Option<crate::metering::UsageSink>,
) -> Response {
let url = format!("{endpoint}/v1/chat/completions");
tracing::info!(
@@ -42,13 +44,14 @@ pub async fn stream_translated(
"proxying streaming request (anthropic SSE translation)"
);
let upstream = match client
.post(&url)
.header("content-type", "application/json")
.body(openai_body)
.send()
.await
{
let request = crate::auth::forward_principal_headers(
client
.post(&url)
.header("content-type", "application/json")
.body(openai_body),
inbound_headers,
);
let upstream = match request.send().await {
Ok(r) => r,
Err(e) => {
tracing::warn!(
@@ -82,11 +85,22 @@ pub async fn stream_translated(
// discipline as neuron's own projectors.
let (tx, rx) = tokio::sync::mpsc::channel::<Result<Bytes, std::convert::Infallible>>(32);
let node = node_name.to_string();
let model = model_id.to_string();
tokio::spawn(async move {
let mut upstream = upstream.bytes_stream();
let mut translator = AnthropicStreamTranslator::new();
let mut buf: Vec<u8> = Vec::new();
let mut done = false;
// Wire-debug accounting for the stream summary emitted at the
// end: did the model emit a structured tool call, what was the
// final finish_reason, and how many upstream frames did we see.
let mut saw_tool_call = false;
let mut last_finish: Option<String> = None;
let mut frames = 0u64;
// Engine-truth usage for metering (#51), scanned from the upstream
// frames (neuron emits a final `usage` object on the stream, #48).
let mut usage_prompt = 0u64;
let mut usage_completion = 0u64;
'outer: while let Some(block) = upstream.next().await {
let block = match block {
@@ -113,10 +127,31 @@ pub async fn stream_translated(
}
continue;
}
tracing::trace!(node = %node, frame = %data, "anthropic stream: upstream frame");
// Capture usage for metering before translation — the
// usage object rides on a late frame (often after the
// last content delta).
if let Some(p) = crate::proxy::last_count_for(data, "prompt_tokens") {
usage_prompt = p;
}
if let Some(c) = crate::proxy::last_count_for(data, "completion_tokens") {
usage_completion = c;
}
let Ok(chunk) = serde_json::from_str::<ChatCompletionChunk>(data) else {
tracing::debug!(node = %node, "anthropic stream: unparsable upstream frame skipped");
continue;
};
frames += 1;
if chunk
.choices
.iter()
.any(|c| c.delta.get("tool_calls").is_some())
{
saw_tool_call = true;
}
if let Some(fr) = chunk.choices.iter().find_map(|c| c.finish_reason.clone()) {
last_finish = Some(fr);
}
if !send_frames(&tx, translator.on_chunk(&chunk)).await {
break 'outer;
}
@@ -129,6 +164,28 @@ pub async fn stream_translated(
if !done {
let _ = send_frames(&tx, translator.finish()).await;
}
// Stream summary: the streaming counterpart to the non-streaming
// handler's "upstream response" line. `upstream_tool_calls =
// false` on a tools-bearing request is the fingerprint of the
// model improvising an unparsed tool-call format.
tracing::debug!(
wire = "anthropic",
model = %model,
node = %node,
frames,
upstream_tool_calls = saw_tool_call,
finish_reason = ?last_finish,
terminated = done,
"anthropic stream complete"
);
// Settle metering with the observed usage (#51). Runs on every exit
// path of the pump — clean end, early break, or upstream error — so
// the reservation is always resolved. `(0, 0)` when no usage frame
// was seen, which releases without recording spend.
if let Some(sink) = usage_sink {
sink(usage_prompt, usage_completion);
}
});
Response::builder()

View File

@@ -0,0 +1,119 @@
//! API-key authentication + principal resolution (#49).
//!
//! Identity rides standard bearer auth only — `Authorization: Bearer <key>`
//! — which is what keeps every tier OpenAI-compatible by construction (no
//! custom required headers or body fields, per #47). The middleware resolves
//! the key to a [`Principal`] via the [`EntitlementProvider`], carries it in
//! the request extensions for cortex-side metering/enforcement (#51/#52), and
//! stamps it as internal headers on the request so it reaches neuron, which
//! trusts cortex's assertion over WireGuard (#54).
//!
//! Anti-spoofing: any client-supplied principal header is **stripped** before
//! the authoritative value is stamped, so a client can never assert a
//! principal it didn't authenticate as.
//!
//! Rejection contract (#63): missing key under `require_auth`, or any present
//! but unresolvable key, yields `401 invalid_api_key` in the #60 envelope.
use crate::error::envelope_response;
use crate::state::CortexState;
use axum::extract::{Request, State};
use axum::http::header::AUTHORIZATION;
use axum::http::{HeaderMap, HeaderValue};
use axum::middleware::Next;
use axum::response::Response;
use cortex_core::entitlements::{HEADER_ACCOUNT_ID, HEADER_KEY_ID};
use cortex_core::error_envelope::OpenAiError;
use std::sync::Arc;
/// Endpoints that never require auth: liveness/readiness probes. Everything
/// else flows through resolution.
fn is_public(path: &str) -> bool {
path == "/health" || path == "/"
}
/// Extract the bearer token from an `Authorization` header value, if present
/// and well-formed. Scheme match is case-insensitive per RFC 7235.
fn parse_bearer(headers: &HeaderMap) -> Option<String> {
let raw = headers.get(AUTHORIZATION)?.to_str().ok()?;
let (scheme, token) = raw.split_once(' ')?;
if scheme.eq_ignore_ascii_case("bearer") {
let token = token.trim();
(!token.is_empty()).then(|| token.to_string())
} else {
None
}
}
/// Axum middleware: resolve the bearer key, attach the principal, stamp the
/// internal headers. Wired in `build_app` via `from_fn_with_state`.
pub async fn require_principal(
State(fleet): State<Arc<CortexState>>,
mut req: Request,
next: Next,
) -> Response {
if is_public(req.uri().path()) {
return next.run(req).await;
}
// Anti-spoof: drop any client-supplied principal headers up front.
{
let headers = req.headers_mut();
headers.remove(HEADER_ACCOUNT_ID);
headers.remove(HEADER_KEY_ID);
}
match parse_bearer(req.headers()) {
Some(key) => match fleet.entitlements.resolve(&key).await {
Ok(principal) => {
// Stamp the authoritative principal for neuron. Account/key
// ids come from operator config, so they're valid header
// values; guard anyway and skip a malformed one rather than
// panic.
if let (Ok(account), Ok(key_id)) = (
HeaderValue::from_str(&principal.account_id),
HeaderValue::from_str(&principal.key_id),
) {
let headers = req.headers_mut();
headers.insert(HEADER_ACCOUNT_ID, account);
headers.insert(HEADER_KEY_ID, key_id);
}
// Carry the typed principal for cortex-side metering (#51)
// and budget enforcement (#52).
req.extensions_mut().insert(principal);
next.run(req).await
}
// A present-but-invalid credential is always an error, even when
// anonymous access is otherwise allowed.
Err(_) => unauthorized("invalid API key"),
},
None => {
if fleet.require_auth {
unauthorized("missing API key; supply 'Authorization: Bearer <key>'")
} else {
next.run(req).await
}
}
}
}
/// `401 invalid_api_key` in the standard envelope (#63).
fn unauthorized(message: &str) -> Response {
envelope_response(OpenAiError::invalid_api_key(message))
}
/// Copy the cortex-stamped principal headers from an inbound [`HeaderMap`]
/// onto an outbound reqwest builder. Used by the Anthropic proxy paths,
/// which construct their own upstream requests instead of going through
/// [`crate::proxy::forward_request`] (which forwards all headers verbatim).
pub fn forward_principal_headers(
mut builder: reqwest::RequestBuilder,
headers: &HeaderMap,
) -> reqwest::RequestBuilder {
for name in [HEADER_ACCOUNT_ID, HEADER_KEY_ID] {
if let Some(value) = headers.get(name) {
builder = builder.header(name, value);
}
}
builder
}

View File

@@ -0,0 +1,317 @@
//! The local/static [`EntitlementProvider`] (#50).
//!
//! Accounts, keys, and hard caps come from operator config
//! ([`cortex_core::config::EntitlementsConfig`]); reservations and settled
//! spend are tracked in-process. This lands auth + per-key caps + the
//! amplification fix before any upstream clearing house exists; the future
//! helexa-upstream client (#57) implements the same trait.
//!
//! Budget math is serialized under a single [`std::sync::Mutex`] so
//! reserve/settle/release are atomic — a key's `spent + reserved` can never
//! exceed its hard cap even under concurrent requests (the #52 guarantee).
//! The lock is held only for the in-memory arithmetic, never across an
//! await.
use cortex_core::config::{ApiKeyConfig, EntitlementsConfig};
use cortex_core::entitlements::{
AuthError, BudgetError, BudgetSnapshot, CapWindow, EntitlementProvider, Principal, Reservation,
};
use std::collections::HashMap;
use std::sync::Mutex;
use std::sync::atomic::{AtomicU64, Ordering};
use std::time::Instant;
/// Per-key budget configuration (resolved from [`ApiKeyConfig`]).
struct Budget {
hard_cap: Option<u64>,
window: CapWindow,
}
/// Live, mutable accounting for one key over its current window.
#[derive(Default)]
struct Ledger {
/// Settled spend in the current window.
spent: u64,
/// Sum of outstanding (un-settled) reservations.
reserved: u64,
/// Start of the current rolling window; `None` until the first reserve.
/// Unused for [`CapWindow::Balance`].
window_start: Option<Instant>,
}
pub struct LocalEntitlementProvider {
/// Bearer token → principal.
keys: HashMap<String, Principal>,
/// `key_id` → budget config.
budgets: HashMap<String, Budget>,
/// `key_id` → live ledger.
ledgers: Mutex<HashMap<String, Ledger>>,
/// Monotonic source of opaque reservation handles.
next_id: AtomicU64,
}
impl LocalEntitlementProvider {
/// Build from the `[entitlements]` config. A key without an explicit
/// `key_id` is tracked at `account_id` granularity (its secret is never
/// used as a label).
pub fn from_config(config: &EntitlementsConfig) -> Self {
let mut keys = HashMap::new();
let mut budgets = HashMap::new();
for ApiKeyConfig {
key,
account_id,
key_id,
hard_cap,
window,
} in &config.keys
{
let key_id = key_id.clone().unwrap_or_else(|| account_id.clone());
keys.insert(
key.clone(),
Principal {
account_id: account_id.clone(),
key_id: key_id.clone(),
},
);
budgets.insert(
key_id,
Budget {
hard_cap: *hard_cap,
window: window.clone(),
},
);
}
Self {
keys,
budgets,
ledgers: Mutex::new(HashMap::new()),
next_id: AtomicU64::new(1),
}
}
}
/// Tokens still available under `cap` given current `spent`/`reserved`.
/// `None` cap = unlimited.
fn available(cap: Option<u64>, spent: u64, reserved: u64) -> Option<u64> {
cap.map(|c| c.saturating_sub(spent).saturating_sub(reserved))
}
#[async_trait::async_trait]
impl EntitlementProvider for LocalEntitlementProvider {
async fn resolve(&self, api_key: &str) -> Result<Principal, AuthError> {
self.keys.get(api_key).cloned().ok_or(AuthError::InvalidKey)
}
async fn reserve(
&self,
principal: &Principal,
max_tokens: u64,
) -> Result<Reservation, BudgetError> {
// A principal with no configured budget (or an uncapped one) always
// reserves; we still track spend for metrics.
let budget = self.budgets.get(&principal.key_id);
let (cap, window) = match budget {
Some(b) => (b.hard_cap, b.window.clone()),
None => (None, CapWindow::Balance),
};
let mut ledgers = self.ledgers.lock().expect("ledger mutex poisoned");
let ledger = ledgers.entry(principal.key_id.clone()).or_default();
// Lazily reset a rolling window that has elapsed before checking.
let mut retry_after_secs = 0;
if let CapWindow::Rolling { seconds } = window {
let now = Instant::now();
match ledger.window_start {
Some(start) if now.duration_since(start).as_secs() < seconds => {
retry_after_secs = seconds - now.duration_since(start).as_secs();
}
_ => {
// First reserve, or the window has fully elapsed: reset.
ledger.spent = 0;
ledger.window_start = Some(now);
retry_after_secs = seconds;
}
}
}
if let Some(avail) = available(cap, ledger.spent, ledger.reserved)
&& max_tokens > avail
{
return Err(match window {
CapWindow::Rolling { .. } => BudgetError::RateLimited {
requested: max_tokens,
available: avail,
// At least 1s so clients don't hot-loop on a sub-second
// remainder.
retry_after_secs: retry_after_secs.max(1),
},
CapWindow::Balance => BudgetError::InsufficientQuota {
requested: max_tokens,
available: avail,
},
});
}
ledger.reserved += max_tokens;
Ok(Reservation {
id: self.next_id.fetch_add(1, Ordering::Relaxed),
principal: principal.clone(),
reserved: max_tokens,
})
}
async fn settle(&self, reservation: Reservation, actual_tokens: u64) {
let mut ledgers = self.ledgers.lock().expect("ledger mutex poisoned");
if let Some(ledger) = ledgers.get_mut(&reservation.principal.key_id) {
ledger.reserved = ledger.reserved.saturating_sub(reservation.reserved);
ledger.spent += actual_tokens;
}
}
async fn release(&self, reservation: Reservation) {
let mut ledgers = self.ledgers.lock().expect("ledger mutex poisoned");
if let Some(ledger) = ledgers.get_mut(&reservation.principal.key_id) {
ledger.reserved = ledger.reserved.saturating_sub(reservation.reserved);
}
}
async fn snapshot(&self, principal: &Principal) -> Option<BudgetSnapshot> {
let ledgers = self.ledgers.lock().expect("ledger mutex poisoned");
let (spent, reserved) = ledgers
.get(&principal.key_id)
.map(|l| (l.spent, l.reserved))
.unwrap_or((0, 0));
let hard_cap = self.budgets.get(&principal.key_id).and_then(|b| b.hard_cap);
Some(BudgetSnapshot {
hard_cap,
spent,
reserved,
})
}
}
#[cfg(test)]
mod tests {
use super::*;
fn provider() -> LocalEntitlementProvider {
let config = EntitlementsConfig {
require_auth: true,
keys: vec![
ApiKeyConfig {
key: "sk-balance".into(),
account_id: "acct-a".into(),
key_id: Some("key-balance".into()),
hard_cap: Some(1_000),
window: CapWindow::Balance,
},
ApiKeyConfig {
key: "sk-rolling".into(),
account_id: "acct-b".into(),
key_id: Some("key-rolling".into()),
hard_cap: Some(500),
window: CapWindow::Rolling { seconds: 3_600 },
},
ApiKeyConfig {
key: "sk-infra".into(),
account_id: "operator".into(),
key_id: Some("key-infra".into()),
hard_cap: None,
window: CapWindow::Balance,
},
],
};
LocalEntitlementProvider::from_config(&config)
}
#[tokio::test]
async fn resolves_configured_key_to_principal() {
let p = provider();
let principal = p.resolve("sk-balance").await.expect("known key resolves");
assert_eq!(principal.account_id, "acct-a");
assert_eq!(principal.key_id, "key-balance");
}
#[tokio::test]
async fn unknown_key_is_invalid() {
let p = provider();
assert!(matches!(
p.resolve("sk-nope").await,
Err(AuthError::InvalidKey)
));
}
#[tokio::test]
async fn reserve_settle_release_round_trip() {
let p = provider();
let principal = p.resolve("sk-balance").await.unwrap();
let r = p.reserve(&principal, 400).await.expect("within cap");
// Reserved, not yet spent.
let snap = p.snapshot(&principal).await.unwrap();
assert_eq!(snap.hard_cap, Some(1_000));
assert_eq!(snap.reserved, 400);
assert_eq!(snap.spent, 0);
// Used fewer tokens than reserved → remainder released, spend exact.
p.settle(r, 250).await;
let snap = p.snapshot(&principal).await.unwrap();
assert_eq!(snap.reserved, 0);
assert_eq!(snap.spent, 250);
// A reservation that is released contributes no spend.
let r2 = p.reserve(&principal, 100).await.unwrap();
p.release(r2).await;
let snap = p.snapshot(&principal).await.unwrap();
assert_eq!(snap.reserved, 0);
assert_eq!(snap.spent, 250);
}
#[tokio::test]
async fn balance_over_cap_is_insufficient_quota() {
let p = provider();
let principal = p.resolve("sk-balance").await.unwrap();
// Reserve most of the cap, then ask for more than remains.
let _r = p.reserve(&principal, 900).await.unwrap();
let err = p.reserve(&principal, 200).await.expect_err("over cap");
match err {
BudgetError::InsufficientQuota {
requested,
available,
} => {
assert_eq!(requested, 200);
assert_eq!(available, 100);
}
other => panic!("expected InsufficientQuota, got {other:?}"),
}
}
#[tokio::test]
async fn rolling_over_cap_is_rate_limited_with_retry_after() {
let p = provider();
let principal = p.resolve("sk-rolling").await.unwrap();
let _r = p.reserve(&principal, 500).await.unwrap();
let err = p.reserve(&principal, 1).await.expect_err("over cap");
match err {
BudgetError::RateLimited {
retry_after_secs, ..
} => {
assert!(retry_after_secs >= 1, "must advertise a retry hint");
assert!(retry_after_secs <= 3_600);
}
other => panic!("expected RateLimited, got {other:?}"),
}
}
#[tokio::test]
async fn uncapped_infra_key_never_refuses() {
let p = provider();
let principal = p.resolve("sk-infra").await.unwrap();
let r = p.reserve(&principal, 10_000_000).await.expect("uncapped");
p.settle(r, 10_000_000).await;
let snap = p.snapshot(&principal).await.unwrap();
assert_eq!(snap.hard_cap, None);
assert_eq!(snap.spent, 10_000_000);
}
}

View File

@@ -0,0 +1,24 @@
//! Gateway adapter that turns the shared, axum-agnostic
//! [`cortex_core::error_envelope::OpenAiError`] into an axum [`Response`],
//! setting the `Retry-After` header when the envelope carries one.
//!
//! cortex-core owns the envelope shape and the rejection contract (#60/#63);
//! this is the only place the gateway crosses from that data into axum.
use axum::http::{HeaderValue, StatusCode, header};
use axum::response::{IntoResponse, Json, Response};
use cortex_core::error_envelope::OpenAiError;
/// Render an [`OpenAiError`] as an axum response (status + JSON envelope +
/// optional `Retry-After`).
pub fn envelope_response(err: OpenAiError) -> Response {
let status = StatusCode::from_u16(err.status).unwrap_or(StatusCode::INTERNAL_SERVER_ERROR);
let retry_after = err.retry_after_secs;
let mut response = (status, Json(err.body())).into_response();
if let Some(secs) = retry_after
&& let Ok(value) = HeaderValue::from_str(&secs.to_string())
{
response.headers_mut().insert(header::RETRY_AFTER, value);
}
response
}

View File

@@ -11,6 +11,8 @@ use axum::http::HeaderMap;
use axum::response::{IntoResponse, Json, Response};
use axum::routing::{get, post};
use chrono::Utc;
use cortex_core::error_envelope::OpenAiError;
use cortex_core::harness::ModelLimit;
use cortex_core::node::{CortexModelEntry, ModelLocation};
use serde_json::{Value, json};
use std::sync::Arc;
@@ -33,6 +35,7 @@ async fn chat_completions(
headers: HeaderMap,
body: Bytes,
) -> Response {
log_inbound("openai-chat", "/v1/chat/completions", &body);
let model_id = match extract_model(&body) {
Some(m) => m,
None => {
@@ -40,7 +43,12 @@ async fn chat_completions(
handler = "chat_completions",
"rejected: missing 'model' field in request body"
);
return error_response(400, "missing 'model' field in request body");
return error_response(
400,
"invalid_request_error",
"missing_model_field",
"missing 'model' field in request body",
);
}
};
@@ -53,11 +61,7 @@ async fn chat_completions(
error = %e,
"route resolve failed"
);
// RouteError's Display strings are short and informative
// ("model 'X' not found...", "no healthy nodes available")
// — fine to surface to the caller. The warn above carries
// any extra context for operators.
return error_response(e.http_status(), &e.to_string());
return route_error_response(&e);
}
};
@@ -89,6 +93,7 @@ async fn responses(
headers: HeaderMap,
body: Bytes,
) -> Response {
log_inbound("openai-responses", "/v1/responses", &body);
let model_id = match extract_model(&body) {
Some(m) => m,
None => {
@@ -96,7 +101,12 @@ async fn responses(
handler = "responses",
"rejected: missing 'model' field in request body"
);
return error_response(400, "missing 'model' field in request body");
return error_response(
400,
"invalid_request_error",
"missing_model_field",
"missing 'model' field in request body",
);
}
};
@@ -109,7 +119,7 @@ async fn responses(
error = %e,
"route resolve failed"
);
return error_response(e.http_status(), &e.to_string());
return route_error_response(&e);
}
};
@@ -133,6 +143,7 @@ async fn completions(
headers: HeaderMap,
body: Bytes,
) -> Response {
log_inbound("openai-completions", "/v1/completions", &body);
let model_id = match extract_model(&body) {
Some(m) => m,
None => {
@@ -140,7 +151,12 @@ async fn completions(
handler = "completions",
"rejected: missing 'model' field in request body"
);
return error_response(400, "missing 'model' field in request body");
return error_response(
400,
"invalid_request_error",
"missing_model_field",
"missing 'model' field in request body",
);
}
};
@@ -153,11 +169,7 @@ async fn completions(
error = %e,
"route resolve failed"
);
// RouteError's Display strings are short and informative
// ("model 'X' not found...", "no healthy nodes available")
// — fine to surface to the caller. The warn above carries
// any extra context for operators.
return error_response(e.http_status(), &e.to_string());
return route_error_response(&e);
}
};
@@ -178,7 +190,7 @@ async fn completions(
/// `POST /v1/messages` — accept Anthropic format, translate, proxy, translate back.
async fn anthropic_messages(
State(fleet): State<Arc<CortexState>>,
_headers: HeaderMap,
headers: HeaderMap,
body: Bytes,
) -> Response {
// Parse as Anthropic request.
@@ -190,13 +202,48 @@ async fn anthropic_messages(
error = %e,
"rejected: invalid Anthropic request body"
);
return error_response(400, "invalid Anthropic request body");
return error_response(
400,
"invalid_request_error",
"invalid_anthropic_body",
"invalid Anthropic request body",
);
}
};
let model_id = anth_req.model.clone();
let is_streaming = anth_req.stream.unwrap_or(false);
// Wire-debug: make the exercised path and request shape concrete
// rather than guesswork. `tool_history` flags whether the client is
// continuing a tool conversation (tool_use/tool_result blocks in the
// message history) vs. opening a fresh one. Full bodies ride at
// trace! (cortex/neuron ship at info; operator infra runs at debug).
if tracing::enabled!(tracing::Level::DEBUG) {
let n_tools = anth_req
.extra
.get("tools")
.and_then(Value::as_array)
.map(|a| a.len())
.unwrap_or(0);
let tool_history = anth_req
.messages
.iter()
.any(|m| anthropic_message_has_tool_blocks(&m.content));
tracing::debug!(
wire = "anthropic",
endpoint = "/v1/messages",
model = %model_id,
stream = is_streaming,
messages = anth_req.messages.len(),
tools = n_tools,
tool_history,
system = anth_req.system.is_some(),
"inbound request"
);
}
tracing::trace!(wire = "anthropic", body = %body_preview(&body), "inbound anthropic body");
// Translate to OpenAI format.
let openai_req = cortex_core::translate::anthropic_to_openai(anth_req);
let openai_body = match serde_json::to_vec(&openai_req) {
@@ -208,7 +255,12 @@ async fn anthropic_messages(
error = %e,
"internal: failed to serialise translated OpenAI request"
);
return error_response(500, "internal translation error");
return error_response(
500,
"api_error",
"internal_translation_error",
"internal translation error",
);
}
};
@@ -225,7 +277,7 @@ async fn anthropic_messages(
// ("model 'X' not found...", "no healthy nodes available")
// — fine to surface to the caller. The warn above carries
// any extra context for operators.
return error_response(e.http_status(), &e.to_string());
return route_error_response(&e);
}
};
@@ -235,6 +287,14 @@ async fn anthropic_messages(
// neuron's harness sees a model name that matches what it has
// loaded.
let openai_body = rewrite_model_in_body(openai_body, &route.resolved_model_id);
// The translated body is what neuron actually sees — the reshaped
// OpenAI-form tools live here. Tracing it makes "did the tool
// definitions survive translation?" a log line, not a guess.
tracing::trace!(
wire = "anthropic",
body = %body_preview(&openai_body),
"translated openai body (sent upstream)"
);
let labels = [
("model", route.resolved_model_id.clone()),
@@ -246,6 +306,29 @@ async fn anthropic_messages(
}
let start = Instant::now();
// Per-request metering + budget enforcement (#51/#52), same lifecycle as
// the OpenAI paths. Estimate from the translated OpenAI body (what neuron
// sees). Refuse over-cap before dispatch via the #63 envelope; otherwise
// build the sink consumed by whichever branch runs below.
let usage_sink = match crate::metering::principal_from_headers(&headers) {
Some(principal) => {
let advertised =
advertised_output_limit(&fleet, &route.node_name, &route.resolved_model_id).await;
let max_tokens = crate::metering::reservation_estimate(&openai_body, advertised);
match crate::metering::reserve_or_reject(
Arc::clone(&fleet.entitlements),
&principal,
max_tokens,
)
.await
{
Ok(guard) => Some(crate::metering::usage_sink(principal, guard)),
Err(env) => return crate::error::envelope_response(env),
}
}
None => None,
};
if is_streaming {
// Anthropic SSE translation (#24): upstream speaks OpenAI SSE;
// re-frame it event-by-event into Anthropic's message_start /
@@ -256,6 +339,8 @@ async fn anthropic_messages(
openai_body,
&model_id,
&route.node_name,
&headers,
usage_sink,
)
.await;
metrics::histogram!("cortex_request_duration_seconds", &labels)
@@ -275,13 +360,16 @@ async fn anthropic_messages(
cold_start = route.cold_start,
"proxying request"
);
let upstream_resp = fleet
.http_client
.post(&target_url)
.body(openai_body)
.header("content-type", "application/json")
.send()
.await;
let upstream_resp = crate::auth::forward_principal_headers(
fleet
.http_client
.post(&target_url)
.body(openai_body)
.header("content-type", "application/json"),
&headers,
)
.send()
.await;
let upstream_resp = match upstream_resp {
Ok(r) => r,
@@ -295,7 +383,12 @@ async fn anthropic_messages(
error = %e,
"upstream request failed (network)"
);
return error_response(502, "upstream request failed");
return error_response(
502,
"api_error",
"upstream_connection_error",
"upstream request failed",
);
}
};
@@ -314,7 +407,12 @@ async fn anthropic_messages(
body = %body_snippet,
"upstream returned non-2xx"
);
return error_response(status, &format!("upstream returned {status}"));
return error_response(
status,
"api_error",
"upstream_error",
&format!("upstream returned {status}"),
);
}
let body_bytes = match upstream_resp.bytes().await {
@@ -329,7 +427,12 @@ async fn anthropic_messages(
error = %e,
"failed to read upstream response body"
);
return error_response(502, "failed to read upstream response");
return error_response(
502,
"api_error",
"upstream_connection_error",
"failed to read upstream response",
);
}
};
@@ -351,17 +454,68 @@ async fn anthropic_messages(
body = %body_snippet,
"failed to parse upstream response as OpenAI ChatCompletionResponse"
);
return error_response(502, "malformed upstream response");
return error_response(
502,
"api_error",
"upstream_malformed_response",
"malformed upstream response",
);
}
};
metrics::histogram!("cortex_request_duration_seconds", &labels)
.record(start.elapsed().as_secs_f64());
// Settle metering with the upstream usage (#51). Scanned from the
// raw body — same engine-truth source as the streaming path — so we
// don't depend on the typed usage struct's optionality.
if let Some(sink) = usage_sink {
let tail = String::from_utf8_lossy(&body_bytes);
let prompt = proxy::last_count_for(&tail, "prompt_tokens").unwrap_or(0);
let completion = proxy::last_count_for(&tail, "completion_tokens").unwrap_or(0);
sink(prompt, completion);
}
// Did the model actually produce a structured tool call, or just
// text? This is the single most useful signal for "is tool
// calling working end-to-end" — a `false` here alongside a
// request that carried tools means the model improvised an
// unparsed format (the original failure mode).
let upstream_tool_calls = openai_resp.choices.iter().any(|c| {
c.message
.extra
.get("tool_calls")
.and_then(Value::as_array)
.map(|a| !a.is_empty())
.unwrap_or(false)
});
let finish_reason = openai_resp
.choices
.first()
.and_then(|c| c.finish_reason.clone());
tracing::debug!(
wire = "anthropic",
model = %model_id,
node = %route.node_name,
upstream_tool_calls,
finish_reason = ?finish_reason,
"upstream non-streaming response"
);
let anthropic_resp = cortex_core::translate::openai_to_anthropic(openai_resp);
Json(json!(anthropic_resp)).into_response()
}
}
/// Combine two self-derived limits for the same model loaded on
/// different neurons (#67): keep the tightest (smallest `context`) so a
/// client sized against the advertised limit never overflows the
/// most-constrained deployment that might serve the request. `None`
/// means "that neuron reported no limit"; the present one wins.
fn tightest_limit(a: Option<ModelLimit>, b: Option<ModelLimit>) -> Option<ModelLimit> {
match (a, b) {
(None, x) | (x, None) => x,
(Some(a), Some(b)) => Some(if b.context < a.context { b } else { a }),
}
}
/// `GET /v1/models` — union of (catalogue × topology feasibility) and
/// (currently loaded somewhere). The result is what the fleet *could*
/// serve, not just what's already loaded — so OpenAI-compatible tools
@@ -409,9 +563,20 @@ async fn list_models(State(fleet): State<Arc<CortexState>>) -> Json<Value> {
loaded: false,
feasible_on,
locations: Vec::new(),
// Catalogue profiles don't declare capabilities yet;
// the union is filled in Pass 2 from loaded locations.
capabilities: Vec::new(),
// Start with catalogue-declared capabilities; Pass 2 unions
// runtime-detected ones from loaded neurons.
capabilities: profile.capabilities.clone(),
// `limit` is no longer operator-declared (#67): the neuron
// self-derives it from live VRAM + throughput and reports it
// per loaded model — Pass 2 fills it from the neuron's
// ModelEntry. A catalogue `limit`, if present, is ignored
// (it can't track hot-swapped models or live capacity).
// `cost` stays operator-set and flows from the catalogue.
limit: None,
cost: profile.cost.clone(),
// Runtime-detected — will be OR-ed in Pass 2 from neuron data.
tool_call: false,
reasoning: false,
},
);
}
@@ -444,6 +609,15 @@ async fn list_models(State(fleet): State<Arc<CortexState>>) -> Json<Value> {
e.capabilities.push(cap.clone());
}
}
// OR-in runtime-detected capability flags from the neuron.
e.tool_call = e.tool_call || entry.tool_call;
e.reasoning = e.reasoning || entry.reasoning;
// Adopt the neuron's self-derived limit (#67). When a
// model is loaded on several neurons with different
// headroom, advertise the tightest (smallest context)
// so a client never overflows the most-constrained
// deployment that might serve it.
e.limit = tightest_limit(e.limit.take(), entry.limit.clone());
})
.or_insert_with(|| CortexModelEntry {
id: model_id.clone(),
@@ -456,6 +630,10 @@ async fn list_models(State(fleet): State<Arc<CortexState>>) -> Json<Value> {
feasible_on: Vec::new(),
locations: vec![location],
capabilities: entry.capabilities.clone(),
limit: entry.limit.clone(),
cost: None,
tool_call: entry.tool_call,
reasoning: entry.reasoning,
});
}
}
@@ -508,6 +686,10 @@ async fn list_models(State(fleet): State<Arc<CortexState>>) -> Json<Value> {
// A model that's only mid-prewarm has no loaded
// location to read capabilities from yet.
capabilities: Vec::new(),
limit: None,
cost: None,
tool_call: false,
reasoning: false,
});
}
}
@@ -538,6 +720,10 @@ async fn list_models(State(fleet): State<Arc<CortexState>>) -> Json<Value> {
feasible_on: target_entry.feasible_on,
locations: target_entry.locations,
capabilities: target_entry.capabilities,
limit: target_entry.limit.clone(),
cost: target_entry.cost.clone(),
tool_call: target_entry.tool_call,
reasoning: target_entry.reasoning,
},
);
}
@@ -585,9 +771,42 @@ async fn proxy_with_metrics(
metrics::counter!("cortex_cold_starts_total", &labels).increment(1);
}
// Per-request metering + budget enforcement (#51/#52): reconstruct the
// principal from the middleware-stamped headers, reserve the request's
// upper-bound cost (prompt estimate + max output), and build the
// completion sink that settles actual spend when the response finishes.
// A reservation over the hard cap is refused *before* dispatch with the
// #63 envelope. Anonymous requests skip all of this. Must happen before
// `headers`/`body` are moved into the proxy.
let usage_sink = match crate::metering::principal_from_headers(&headers) {
Some(principal) => {
let advertised = advertised_output_limit(fleet, &route.node_name, model_id).await;
let max_tokens = crate::metering::reservation_estimate(&body, advertised);
match crate::metering::reserve_or_reject(
Arc::clone(&fleet.entitlements),
&principal,
max_tokens,
)
.await
{
Ok(guard) => Some(crate::metering::usage_sink(principal, guard)),
Err(env) => return crate::error::envelope_response(env),
}
}
None => None,
};
let start = Instant::now();
let result =
proxy::forward_request(&fleet.http_client, route, path, headers, body, model_id).await;
let result = proxy::forward_request(
&fleet.http_client,
route,
path,
headers,
body,
model_id,
usage_sink,
)
.await;
let duration = start.elapsed();
match result {
@@ -606,6 +825,25 @@ async fn proxy_with_metrics(
}
}
/// The model's advertised `limit.output` (#62) on a given node, used as the
/// default output budget for budget reservations (#52) when the request
/// omits `max_(completion_)tokens`. `None` when the node/model/limit is
/// unknown — callers fall back to [`crate::metering::FALLBACK_MAX_OUTPUT`].
async fn advertised_output_limit(
fleet: &CortexState,
node_name: &str,
model_id: &str,
) -> Option<u64> {
let nodes = fleet.nodes.read().await;
nodes
.get(node_name)?
.models
.get(model_id)?
.limit
.as_ref()
.map(|l| l.output as u64)
}
/// Update `last_accessed` timestamp for a model on a node (drives LRU eviction).
async fn touch_model(fleet: &CortexState, node_name: &str, model_id: &str) {
let mut nodes = fleet.nodes.write().await;
@@ -621,6 +859,57 @@ fn extract_model(body: &[u8]) -> Option<String> {
v.get("model")?.as_str().map(|s| s.to_string())
}
/// Emit a uniform wire-debug summary for an OpenAI-family inbound
/// request (chat/completions, completions, responses). Makes which
/// surface a client exercised — and whether it sent tools / asked for
/// streaming — a concrete log line. The full body rides at trace!.
///
/// Parsing is gated on the debug level being enabled so info-level
/// deployments pay nothing.
fn log_inbound(wire: &str, endpoint: &str, body: &[u8]) {
if tracing::enabled!(tracing::Level::DEBUG) {
let v: Value = match serde_json::from_slice(body) {
Ok(v) => v,
Err(_) => return,
};
let model = v.get("model").and_then(Value::as_str).unwrap_or("?");
let stream = v.get("stream").and_then(Value::as_bool).unwrap_or(false);
let tools = v
.get("tools")
.and_then(Value::as_array)
.map(|a| a.len())
.unwrap_or(0);
tracing::debug!(wire, endpoint, model, stream, tools, "inbound request");
}
tracing::trace!(wire, endpoint, body = %body_preview(body), "inbound body");
}
/// True if an Anthropic message's content carries any `tool_use` or
/// `tool_result` block — i.e. the client is mid tool-conversation.
fn anthropic_message_has_tool_blocks(content: &cortex_core::anthropic::AnthropicContent) -> bool {
use cortex_core::anthropic::AnthropicContent;
match content {
AnthropicContent::Text(_) => false,
AnthropicContent::Blocks(blocks) => blocks
.iter()
.any(|b| matches!(b.block_type.as_str(), "tool_use" | "tool_result")),
}
}
/// Render a UTF-8-safe, length-capped preview of a request/response
/// body for trace logging. Caps by characters (not bytes) so the slice
/// can never split a multi-byte codepoint.
fn body_preview(body: &[u8]) -> String {
const MAX_CHARS: usize = 8192;
let text = String::from_utf8_lossy(body);
if text.chars().count() > MAX_CHARS {
let head: String = text.chars().take(MAX_CHARS).collect();
format!("{head}…<truncated, {} bytes total>", body.len())
} else {
text.into_owned()
}
}
/// Rewrite the `model` field of an OpenAI-style JSON request body to
/// the resolved concrete id. Returns the original bytes if `new_model`
/// matches what's already there or the body fails to parse — the
@@ -653,14 +942,16 @@ fn rewrite_model_in_body(body: Bytes, new_model: &str) -> Bytes {
}
}
fn error_response(status: u16, message: &str) -> Response {
let code = axum::http::StatusCode::from_u16(status)
.unwrap_or(axum::http::StatusCode::INTERNAL_SERVER_ERROR);
let body = json!({
"error": {
"message": message,
"type": "gateway_error",
}
});
(code, Json(body)).into_response()
fn error_response(status: u16, typ: &str, code: &str, message: &str) -> Response {
crate::error::envelope_response(OpenAiError::new(status, typ, code, message))
}
/// Render a [`RouteError`] in the standard envelope, attaching `Retry-After`
/// for its transient variants (#63).
fn route_error_response(e: &router::RouteError) -> Response {
let mut env = OpenAiError::new(e.http_status(), e.broad_type(), e.code(), e.to_string());
if let Some(secs) = e.retry_after_secs() {
env = env.with_retry_after(secs);
}
crate::error::envelope_response(env)
}

View File

@@ -1,6 +1,10 @@
pub mod anthropic_sse;
pub mod auth;
pub mod entitlements_local;
pub mod error;
pub mod evictor;
pub mod handlers;
pub mod metering;
pub mod metrics;
pub mod poller;
pub mod proxy;
@@ -9,15 +13,26 @@ pub mod state;
use anyhow::Result;
use axum::Router;
use axum::middleware::from_fn_with_state;
use cortex_core::config::GatewayConfig;
use std::sync::Arc;
use tower_http::cors::CorsLayer;
use tower_http::trace::TraceLayer;
/// Build the Axum application router with all routes wired up.
///
/// Layer order (outermost first): trace → CORS → auth → handlers. CORS is
/// outer to auth so preflight `OPTIONS` short-circuits before resolution;
/// auth (`require_principal`) resolves the bearer key, attaches the
/// principal, and stamps the internal principal headers before any handler
/// runs.
pub fn build_app(fleet: Arc<state::CortexState>) -> Router {
Router::new()
.merge(handlers::api_routes())
.layer(from_fn_with_state(
Arc::clone(&fleet),
auth::require_principal,
))
.layer(CorsLayer::permissive())
.layer(TraceLayer::new_for_http())
.with_state(fleet)

View File

@@ -0,0 +1,219 @@
//! Per-request token metering (#51).
//!
//! Captures the real `(prompt, completion)` usage of every request and feeds
//! it to two places: the [`EntitlementProvider`] spend ledger (via
//! reserve→settle) and per-principal Prometheus counters. The principal is
//! reconstructed from the internal headers the auth middleware stamped (#49),
//! so this works uniformly across every proxy path without threading the
//! typed principal through each handler.
//!
//! The reserve→settle lifecycle is established here but, in this phase,
//! reserves **zero** tokens — metering only, no enforcement. Budget
//! enforcement (#52) flips the reserved amount to the real
//! `prompt + max_output` and handles the [`BudgetError`] rejection; the
//! settle/release plumbing is identical, so that change is localized.
//!
//! [`ReservationGuard`] makes leaks impossible: settling records actual
//! spend and releases the unused remainder; dropping a guard that was never
//! settled releases the whole reservation. So an early return, error path,
//! or dropped stream can't strand a reservation.
use axum::http::HeaderMap;
use cortex_core::entitlements::{
BudgetError, EntitlementProvider, HEADER_ACCOUNT_ID, HEADER_KEY_ID, Principal,
};
use cortex_core::error_envelope::OpenAiError;
use std::sync::Arc;
/// Fallback output-token budget when neither the request nor the model's
/// advertised limit gives one. Bounds the reservation so a capped key is
/// still gated even on under-specified requests (#52).
pub const FALLBACK_MAX_OUTPUT: u64 = 4096;
/// Invoked exactly once at request completion with best-effort
/// `(prompt_tokens, completion_tokens)`. When no usage could be observed
/// (e.g. a pre-dispatch failure or a dropped stream) it is dropped unused —
/// which releases the held reservation via [`ReservationGuard`]'s `Drop`.
pub type UsageSink = Box<dyn FnOnce(u64, u64) + Send>;
/// Reconstruct the principal from the cortex-stamped internal headers. The
/// auth middleware strips any client copy and stamps the authoritative value,
/// so these headers are trustworthy within cortex. `None` for anonymous
/// (unauthenticated) requests.
pub fn principal_from_headers(headers: &HeaderMap) -> Option<Principal> {
let account_id = headers.get(HEADER_ACCOUNT_ID)?.to_str().ok()?.to_string();
let key_id = headers.get(HEADER_KEY_ID)?.to_str().ok()?.to_string();
Some(Principal { account_id, key_id })
}
/// Emit per-principal spend counters (#51). Labelled by account/key only —
/// both are operator-bounded, so cardinality is controlled.
pub fn record_spend(principal: &Principal, prompt: u64, completion: u64) {
let labels = [
("account", principal.account_id.clone()),
("key", principal.key_id.clone()),
];
metrics::counter!("cortex_spend_tokens_total", &labels).increment(prompt + completion);
metrics::counter!("cortex_spend_prompt_tokens_total", &labels).increment(prompt);
metrics::counter!("cortex_spend_completion_tokens_total", &labels).increment(completion);
}
/// Holds a budget reservation for the life of a request. [`settle`] records
/// actual spend and releases the remainder; an un-settled guard releases the
/// whole reservation when dropped. Anonymous requests carry an empty guard,
/// where every operation is a no-op.
///
/// [`settle`]: ReservationGuard::settle
pub struct ReservationGuard {
provider: Arc<dyn EntitlementProvider>,
reservation: Option<cortex_core::entitlements::Reservation>,
}
impl ReservationGuard {
/// An empty guard for an anonymous request — no reservation to resolve.
pub fn anonymous(provider: Arc<dyn EntitlementProvider>) -> Self {
Self {
provider,
reservation: None,
}
}
/// Wrap an already-acquired reservation.
fn held(
provider: Arc<dyn EntitlementProvider>,
reservation: cortex_core::entitlements::Reservation,
) -> Self {
Self {
provider,
reservation: Some(reservation),
}
}
/// Settle with the tokens actually consumed, disarming the drop-release.
/// Spawns the (fast, in-process for the local provider) settle so the
/// caller — which may be a sync stream-completion callback — needn't
/// await.
pub fn settle(mut self, actual_tokens: u64) {
if let Some(reservation) = self.reservation.take() {
let provider = Arc::clone(&self.provider);
tokio::spawn(async move {
provider.settle(reservation, actual_tokens).await;
});
}
}
}
impl Drop for ReservationGuard {
fn drop(&mut self) {
if let Some(reservation) = self.reservation.take() {
let provider = Arc::clone(&self.provider);
tokio::spawn(async move {
provider.release(reservation).await;
});
}
}
}
/// Build the completion sink for an authenticated request: record spend and
/// settle the reservation with the observed total. Dropping it unused (no
/// usage observed) releases the reservation via the guard.
pub fn usage_sink(principal: Principal, guard: ReservationGuard) -> UsageSink {
Box::new(move |prompt, completion| {
record_spend(&principal, prompt, completion);
guard.settle(prompt + completion);
})
}
/// Reserve the request's upper-bound token cost for the principal, refusing
/// *before* dispatch if it would exceed the hard cap (#52). On success
/// returns a guard the caller settles with actual usage; on refusal returns
/// the #63 envelope (`rate_limit_exceeded` + `Retry-After` for a resetting
/// window, `insufficient_quota` for a hard balance — never `402`).
pub async fn reserve_or_reject(
provider: Arc<dyn EntitlementProvider>,
principal: &Principal,
max_tokens: u64,
) -> Result<ReservationGuard, OpenAiError> {
match provider.reserve(principal, max_tokens).await {
Ok(reservation) => Ok(ReservationGuard::held(provider, reservation)),
Err(err) => Err(budget_error_to_envelope(err)),
}
}
/// Map a [`BudgetError`] to the #63 envelope. The provider chose the window
/// semantics; this only translates them to HTTP.
fn budget_error_to_envelope(err: BudgetError) -> OpenAiError {
match err {
BudgetError::RateLimited {
retry_after_secs, ..
} => OpenAiError::rate_limit_exceeded(err.to_string(), retry_after_secs),
BudgetError::InsufficientQuota { .. } => OpenAiError::insufficient_quota(err.to_string()),
}
}
/// Upper-bound tokens to reserve for a request (#52): an over-estimate of
/// the prompt plus the maximum output. `advertised_output` is the model's
/// `limit.output` (#62), used when the request omits `max_(completion_)tokens`.
/// Over-reserving is safe — settle corrects spend to the actual usage.
pub fn reservation_estimate(body: &[u8], advertised_output: Option<u64>) -> u64 {
let max_output = requested_max_output(body)
.or(advertised_output)
.unwrap_or(FALLBACK_MAX_OUTPUT);
estimate_prompt_tokens(body).saturating_add(max_output)
}
/// The client's requested output cap, from `max_completion_tokens` (or the
/// legacy `max_tokens`). `None` when unspecified.
fn requested_max_output(body: &[u8]) -> Option<u64> {
let v: serde_json::Value = serde_json::from_slice(body).ok()?;
v.get("max_completion_tokens")
.or_else(|| v.get("max_tokens"))
.and_then(serde_json::Value::as_u64)
}
/// Rough prompt-token estimate at ~4 chars/token over the whole body. cortex
/// has no tokenizer; JSON overhead makes this a conservative over-estimate,
/// and neuron remains the exact context wall (#56/#60). Settle reconciles to
/// the real usage afterward.
fn estimate_prompt_tokens(body: &[u8]) -> u64 {
(body.len() as u64 / 4).max(1)
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn requested_max_output_prefers_max_completion_tokens() {
let body = br#"{"model":"m","max_completion_tokens":256,"max_tokens":99}"#;
assert_eq!(requested_max_output(body), Some(256));
}
#[test]
fn requested_max_output_falls_back_to_legacy_max_tokens() {
let body = br#"{"model":"m","max_tokens":128}"#;
assert_eq!(requested_max_output(body), Some(128));
}
#[test]
fn estimate_uses_requested_output_when_present() {
// Requested output dominates; prompt estimate is small for a tiny body.
let body = br#"{"model":"m","max_tokens":1000}"#;
let est = reservation_estimate(body, Some(8192));
assert!(est >= 1000 && est < 1100, "est was {est}");
}
#[test]
fn estimate_uses_advertised_output_when_request_omits_it() {
let body = br#"{"model":"m","messages":[]}"#;
let est = reservation_estimate(body, Some(8192));
assert!(est >= 8192, "est was {est}");
}
#[test]
fn estimate_falls_back_when_nothing_advertised() {
let body = br#"{"model":"m"}"#;
let est = reservation_estimate(body, None);
assert!(est >= FALLBACK_MAX_OUTPUT, "est was {est}");
}
}

View File

@@ -63,4 +63,16 @@ fn describe_metrics() {
"cortex_cold_starts_total",
"Total number of cold-start model loads"
);
metrics::describe_counter!(
"cortex_spend_tokens_total",
"Total metered tokens (prompt + completion) per principal, labelled by account/key (#51)"
);
metrics::describe_counter!(
"cortex_spend_prompt_tokens_total",
"Metered prompt tokens per principal, labelled by account/key (#51)"
);
metrics::describe_counter!(
"cortex_spend_completion_tokens_total",
"Metered completion tokens per principal, labelled by account/key (#51)"
);
}

View File

@@ -26,14 +26,23 @@ pub async fn poll_once(fleet: &CortexState) {
}
}
/// One-shot fetch of `GET /discovery`. Cached on the NodeState forever
/// after the first success — topology is invariant for a given neuron
/// process. Skipped when the cache is already populated.
/// Fetch `GET /discovery` and cache it on the NodeState — topology is
/// invariant for a given neuron process, so a successful fetch is kept.
/// Re-polled only while `max_prompt_tokens` is still unknown (0): on a
/// rolling deploy cortex can win the race and cache a neuron's discovery
/// before that neuron reports the field (it deserialises to 0). Re-polling
/// until a real cap arrives self-heals that without periodic polling.
async fn maybe_poll_discovery(fleet: &CortexState, name: &str, endpoint: &str) {
{
let nodes = fleet.nodes.read().await;
match nodes.get(name) {
Some(n) if n.discovery.is_some() => return,
Some(n)
if n.discovery
.as_ref()
.is_some_and(|d| d.max_prompt_tokens > 0) =>
{
return;
}
_ => {}
}
}
@@ -108,6 +117,11 @@ async fn poll_neuron(fleet: &CortexState, name: &str, endpoint: &str) {
e.status = status;
e.vram_estimate_mb = upstream.vram_used_mb;
e.capabilities = upstream.capabilities.clone();
e.tool_call = upstream.tool_call;
e.reasoning = upstream.reasoning;
// Neuron's self-derived limit (#67) — the
// authoritative source the gateway advertises.
e.limit = upstream.limit.clone();
})
.or_insert_with(|| ModelEntry {
id: upstream.id.clone(),
@@ -115,6 +129,9 @@ async fn poll_neuron(fleet: &CortexState, name: &str, endpoint: &str) {
last_accessed: None,
vram_estimate_mb: upstream.vram_used_mb,
capabilities: upstream.capabilities.clone(),
tool_call: upstream.tool_call,
reasoning: upstream.reasoning,
limit: upstream.limit.clone(),
});
}

View File

@@ -31,6 +31,7 @@ pub async fn forward_request(
headers: HeaderMap,
body: bytes::Bytes,
model_id: &str,
usage_sink: Option<crate::metering::UsageSink>,
) -> Result<Response, ProxyError> {
let request_start = Instant::now();
let url = format!("{}{}", route.endpoint, path);
@@ -82,7 +83,7 @@ pub async fn forward_request(
let resp_headers = upstream_resp.headers().clone();
let stream = TokenMetricsStream::new(
Box::pin(upstream_resp.bytes_stream()),
TokenMetrics::new(model_id, &route.node_name, request_start),
TokenMetrics::new(model_id, &route.node_name, request_start, usage_sink),
);
let body = Body::from_stream(stream);
@@ -113,20 +114,24 @@ pub enum ProxyError {
impl IntoResponse for ProxyError {
fn into_response(self) -> Response {
let (status, message) = match &self {
ProxyError::Upstream(_) => (StatusCode::BAD_GATEWAY, "upstream request failed"),
let (status, code, message) = match &self {
ProxyError::Upstream(_) => (
StatusCode::BAD_GATEWAY,
"upstream_connection_error",
"upstream request failed",
),
ProxyError::ResponseBuild(_) => (
StatusCode::INTERNAL_SERVER_ERROR,
"internal_server_error",
"failed to build response",
),
};
let body = serde_json::json!({
"error": {
"message": message,
"type": "proxy_error",
}
});
(status, axum::Json(body)).into_response()
crate::error::envelope_response(cortex_core::error_envelope::OpenAiError::new(
status.as_u16(),
"api_error",
code,
message,
))
}
}
@@ -182,10 +187,19 @@ struct TokenMetrics {
last_chunk: Option<Instant>,
tail: String,
finished: bool,
/// Per-principal metering hook (#51). Invoked exactly once in `finish`
/// with the observed `(prompt, completion)` so the reservation can be
/// settled and spend recorded. `None` for anonymous requests.
usage_sink: Option<crate::metering::UsageSink>,
}
impl TokenMetrics {
fn new(model_id: &str, node_name: &str, request_start: Instant) -> Self {
fn new(
model_id: &str,
node_name: &str,
request_start: Instant,
usage_sink: Option<crate::metering::UsageSink>,
) -> Self {
Self {
labels: [
("model", model_id.to_string()),
@@ -196,6 +210,7 @@ impl TokenMetrics {
last_chunk: None,
tail: String::new(),
finished: false,
usage_sink,
}
}
@@ -223,36 +238,45 @@ impl TokenMetrics {
return;
}
self.finished = true;
let Some(first) = self.first_chunk else {
return; // no body ever arrived — nothing to record
};
let ttft = first.duration_since(self.request_start).as_secs_f64();
metrics::histogram!("cortex_time_to_first_token_seconds", &self.labels).record(ttft);
if let Some(prompt) = last_count_for(&self.tail, "prompt_tokens") {
metrics::counter!("cortex_prompt_tokens_total", &self.labels).increment(prompt);
}
let Some(completion) = last_count_for(&self.tail, "completion_tokens") else {
return;
};
if completion == 0 {
return;
}
metrics::counter!("cortex_completion_tokens_total", &self.labels).increment(completion);
let prompt = last_count_for(&self.tail, "prompt_tokens");
let completion = last_count_for(&self.tail, "completion_tokens");
let last = self.last_chunk.unwrap_or(first);
let decode_window = last.duration_since(first).as_secs_f64();
// Streaming: rate over the decode window (first→last chunk).
// Non-streaming bodies arrive as ~one chunk (window ≈ 0), where
// the only honest denominator is the full request duration.
let secs = if decode_window >= 0.1 {
decode_window
} else {
last.duration_since(self.request_start).as_secs_f64()
};
if secs > 0.0 {
metrics::histogram!("cortex_tokens_per_second", &self.labels)
.record(completion as f64 / secs);
// Per-model metrics — only when body chunks actually arrived.
if let Some(first) = self.first_chunk {
let ttft = first.duration_since(self.request_start).as_secs_f64();
metrics::histogram!("cortex_time_to_first_token_seconds", &self.labels).record(ttft);
if let Some(prompt) = prompt {
metrics::counter!("cortex_prompt_tokens_total", &self.labels).increment(prompt);
}
if let Some(completion) = completion.filter(|c| *c > 0) {
metrics::counter!("cortex_completion_tokens_total", &self.labels)
.increment(completion);
let last = self.last_chunk.unwrap_or(first);
let decode_window = last.duration_since(first).as_secs_f64();
// Streaming: rate over the decode window (first→last chunk).
// Non-streaming bodies arrive as ~one chunk (window ≈ 0),
// where the only honest denominator is the full request
// duration.
let secs = if decode_window >= 0.1 {
decode_window
} else {
last.duration_since(self.request_start).as_secs_f64()
};
if secs > 0.0 {
metrics::histogram!("cortex_tokens_per_second", &self.labels)
.record(completion as f64 / secs);
}
}
}
// Per-principal metering + reservation settle (#51). Always runs so
// the reservation is resolved even when no usage/body was observed
// (sink with (0, 0) → settle 0 → release).
if let Some(sink) = self.usage_sink.take() {
sink(prompt.unwrap_or(0), completion.unwrap_or(0));
}
}
}

View File

@@ -63,15 +63,52 @@ pub enum RouteError {
}
impl RouteError {
/// HTTP status the gateway should answer with. `ModelRecovering`
/// is the one transient case (503, retry the same request);
/// everything else keeps the long-standing 404 behaviour.
/// HTTP status the gateway should answer with. `NoHealthyNodes` and
/// `ModelRecovering` are the transient cases (503 service_unavailable,
/// safe to retry the same request); everything else is 404.
pub fn http_status(&self) -> u16 {
match self {
RouteError::ModelRecovering { .. } => 503,
RouteError::NoHealthyNodes | RouteError::ModelRecovering { .. } => 503,
_ => 404,
}
}
/// Broad OpenAI error category for the JSON envelope.
pub fn broad_type(&self) -> &'static str {
match self {
RouteError::ModelNotFound(_) => "invalid_request_error",
RouteError::NoHealthyNodes
| RouteError::EndpointResolveFailed(_, _)
| RouteError::NoFeasibleNeuron { .. }
| RouteError::ColdLoadFailed { .. }
| RouteError::ModelRecovering { .. } => "api_error",
}
}
/// Specific machine-readable error code.
pub fn code(&self) -> &'static str {
match self {
RouteError::ModelNotFound(_) => "model_not_found",
RouteError::NoHealthyNodes => "service_unavailable",
RouteError::EndpointResolveFailed(_, _) => "service_unavailable",
RouteError::NoFeasibleNeuron { .. } => "service_unavailable",
RouteError::ColdLoadFailed { .. } => "service_unavailable",
RouteError::ModelRecovering { .. } => "service_unavailable",
}
}
/// Seconds to advertise in `Retry-After` for the transient variants
/// (#63). `NoHealthyNodes` may clear once the poller re-marks a node
/// healthy; `ModelRecovering` clears once the device context finishes
/// rebuilding — both are safe to retry. Everything else is permanent
/// for this request (404) and carries no hint.
pub fn retry_after_secs(&self) -> Option<u64> {
match self {
RouteError::ModelRecovering { .. } => Some(2),
RouteError::NoHealthyNodes => Some(5),
_ => None,
}
}
}
/// Resolve which node should serve a request for the given model.
@@ -281,6 +318,9 @@ async fn cold_load(
last_accessed: Some(chrono::Utc::now()),
vram_estimate_mb: profile.vram_mb,
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}
@@ -440,6 +480,9 @@ mod tests {
min_device_vram_mb: None,
pinned_on: vec![],
source: source.map(String::from),
limit: None,
cost: None,
capabilities: vec![],
}
}

View File

@@ -1,7 +1,10 @@
use crate::entitlements_local::LocalEntitlementProvider;
use cortex_core::catalogue::ModelCatalogue;
use cortex_core::config::{EvictionSettings, GatewayConfig, NeuronEndpoint};
use cortex_core::entitlements::EntitlementProvider;
use cortex_core::node::NodeState;
use std::collections::HashMap;
use std::sync::Arc;
use tokio::sync::RwLock;
/// Shared fleet state, protected by a RwLock for concurrent reader access.
@@ -11,6 +14,12 @@ pub struct CortexState {
pub eviction: EvictionSettings,
pub catalogue: ModelCatalogue,
pub http_client: reqwest::Client,
/// Resolves bearer keys to principals and enforces token budgets (#47).
/// A local/static provider today (#50); the upstream client later (#57).
pub entitlements: Arc<dyn EntitlementProvider>,
/// Whether to reject unauthenticated requests (#49). Read by the auth
/// middleware once it lands.
pub require_auth: bool,
}
impl CortexState {
@@ -34,6 +43,9 @@ impl CortexState {
let catalogue = ModelCatalogue::load(&config.models_config);
let entitlements: Arc<dyn EntitlementProvider> =
Arc::new(LocalEntitlementProvider::from_config(&config.entitlements));
Self {
nodes: RwLock::new(nodes),
neuron_configs: config.neurons.clone(),
@@ -43,6 +55,8 @@ impl CortexState {
.timeout(std::time::Duration::from_secs(300))
.build()
.expect("failed to build HTTP client"),
entitlements,
require_auth: config.entitlements.require_auth,
}
}
}

View File

@@ -56,6 +56,7 @@ async fn test_alias_resolves_in_chat_completions() {
endpoint: mock_url,
}],
models_config: models_path.to_string_lossy().to_string(),
entitlements: Default::default(),
};
let fleet = Arc::new(CortexState::from_config(&config));
@@ -75,6 +76,9 @@ async fn test_alias_resolves_in_chat_completions() {
last_accessed: None,
vram_estimate_mb: None,
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}
@@ -138,6 +142,7 @@ async fn test_aliases_surface_in_v1_models() {
endpoint: mock_url,
}],
models_config: models_path.to_string_lossy().to_string(),
entitlements: Default::default(),
};
let fleet = Arc::new(CortexState::from_config(&config));
@@ -156,6 +161,9 @@ async fn test_aliases_surface_in_v1_models() {
last_accessed: None,
vram_estimate_mb: Some(2000),
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}
@@ -223,6 +231,7 @@ async fn test_alias_falls_through_for_unmapped_model() {
endpoint: mock_url,
}],
models_config: models_path.to_string_lossy().to_string(),
entitlements: Default::default(),
};
let fleet = Arc::new(CortexState::from_config(&config));
@@ -238,6 +247,9 @@ async fn test_alias_falls_through_for_unmapped_model() {
last_accessed: None,
vram_estimate_mb: None,
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}

View File

@@ -124,6 +124,94 @@ async fn test_anthropic_invalid_request() {
assert_eq!(resp.status(), 400);
}
/// Tool round-trip: an Anthropic `/v1/messages` request carrying tools
/// (the Claude Code shape: `{name, description, input_schema}`) must
/// reach the upstream neuron reshaped into OpenAI function-tool form,
/// and tool history (`tool_use` / `tool_result` blocks) must become
/// `tool_calls` / `role:"tool"` messages. This is the fix for the
/// failure where the model received malformed tool defs and improvised
/// an unparseable `<tool_use_name>` format.
#[tokio::test]
async fn test_anthropic_tools_reshaped_for_upstream() {
let (mock_url, captured) = common::spawn_capturing_mock_neuron().await;
let gw_url = common::spawn_gateway(&mock_url).await;
let client = reqwest::Client::new();
let resp = client
.post(format!("{gw_url}/v1/messages"))
.header("content-type", "application/json")
.json(&json!({
"model": "test-model",
"max_tokens": 100,
"tools": [{
"name": "Read",
"description": "Read a file from disk",
"input_schema": {
"type": "object",
"properties": {"path": {"type": "string"}},
"required": ["path"]
}
}],
"tool_choice": {"type": "auto"},
"messages": [
{"role": "user", "content": "read /etc/hosts"},
{"role": "assistant", "content": [
{"type": "text", "text": "Reading it."},
{"type": "tool_use", "id": "toolu_42", "name": "Read",
"input": {"path": "/etc/hosts"}}
]},
{"role": "user", "content": [
{"type": "tool_result", "tool_use_id": "toolu_42",
"content": "127.0.0.1 localhost"}
]}
]
}))
.send()
.await
.expect("request should succeed");
assert_eq!(resp.status(), 200);
let forwarded = {
let guard = captured.lock().unwrap();
guard.last().cloned().expect("upstream received a request")
};
// Tool definitions reshaped to OpenAI function form.
let tools = forwarded["tools"].as_array().expect("tools array");
assert_eq!(tools[0]["type"], "function");
assert_eq!(tools[0]["function"]["name"], "Read");
assert_eq!(
tools[0]["function"]["parameters"]["properties"]["path"]["type"],
"string"
);
assert!(tools[0]["function"].get("input_schema").is_none());
// tool_choice mapped.
assert_eq!(forwarded["tool_choice"], "auto");
// Message history: user, assistant(+tool_calls), tool, user.
let msgs = forwarded["messages"].as_array().expect("messages array");
let assistant = msgs
.iter()
.find(|m| m["role"] == "assistant")
.expect("assistant turn");
assert_eq!(assistant["tool_calls"][0]["id"], "toolu_42");
assert_eq!(assistant["tool_calls"][0]["function"]["name"], "Read");
// arguments is the parsed object, not a JSON string — the Qwen3.6
// chat template iterates `tool_call.arguments | items`.
assert_eq!(
assistant["tool_calls"][0]["function"]["arguments"],
json!({"path": "/etc/hosts"})
);
let tool_msg = msgs
.iter()
.find(|m| m["role"] == "tool")
.expect("tool result turn");
assert_eq!(tool_msg["tool_call_id"], "toolu_42");
assert_eq!(tool_msg["content"], "127.0.0.1 localhost");
}
/// #24: a streaming Anthropic request gets a translated Anthropic SSE
/// stream — not raw OpenAI frames. Verifies the full event sequence,
/// text reassembly, and the content type.

View File

@@ -0,0 +1,250 @@
//! Integration tests for API-key auth + principal resolution (#49).
//!
//! Verifies the #63 rejection contract (401 invalid_api_key via the #60
//! envelope) and that an authenticated request reaches neuron carrying the
//! internal principal headers — while a client-supplied principal header is
//! stripped (anti-spoofing).
use axum::Json;
use axum::extract::Path;
use axum::http::HeaderMap;
use axum::routing::{get, post};
use cortex_core::config::{
ApiKeyConfig, EntitlementsConfig, EvictionSettings, EvictionStrategy, GatewayConfig,
GatewaySettings, NeuronEndpoint,
};
use cortex_core::entitlements::{CapWindow, HEADER_ACCOUNT_ID, HEADER_KEY_ID};
use cortex_core::node::{ModelEntry, ModelStatus};
use cortex_gateway::state::CortexState;
use serde_json::{Value, json};
use std::sync::{Arc, Mutex};
use tokio::net::TcpListener;
/// What the mock neuron observed on the inbound `/v1/chat/completions`
/// request: the principal headers cortex stamped (or didn't).
#[derive(Default)]
struct Seen {
account_id: Option<String>,
key_id: Option<String>,
}
/// Spawn a mock neuron that records the principal headers it receives and
/// returns a trivial chat completion. Returns (base_url, observed).
async fn spawn_capturing_neuron() -> (String, Arc<Mutex<Seen>>) {
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
let base_url = format!("http://{addr}");
let inference_url = base_url.clone();
let seen: Arc<Mutex<Seen>> = Arc::new(Mutex::new(Seen::default()));
let sink = Arc::clone(&seen);
let app = axum::Router::new()
.route(
"/models/{model_id}/endpoint",
get(move |Path(_): Path<String>| {
let url = inference_url.clone();
async move { Json(json!({ "url": url })) }
}),
)
.route(
"/v1/chat/completions",
post(move |headers: HeaderMap, Json(body): Json<Value>| {
let sink = Arc::clone(&sink);
async move {
{
let mut s = sink.lock().unwrap();
s.account_id = headers
.get(HEADER_ACCOUNT_ID)
.and_then(|v| v.to_str().ok())
.map(str::to_string);
s.key_id = headers
.get(HEADER_KEY_ID)
.and_then(|v| v.to_str().ok())
.map(str::to_string);
}
let model = body.get("model").and_then(Value::as_str).unwrap_or("m");
Json(json!({
"id": "chatcmpl-auth-001",
"object": "chat.completion",
"created": 1700000000_u64,
"model": model,
"choices": [{
"index": 0,
"message": {"role": "assistant", "content": "ok"},
"finish_reason": "stop"
}],
"usage": {"prompt_tokens": 3, "completion_tokens": 1, "total_tokens": 4}
}))
}
}),
)
.with_state(());
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
(base_url, seen)
}
/// Spawn a gateway with the given entitlements config, a single neuron, and
/// `test-model` seeded as loaded (build_app spawns no poller).
async fn spawn_gateway(neuron_url: &str, entitlements: EntitlementsConfig) -> String {
let config = GatewayConfig {
gateway: GatewaySettings {
listen: "127.0.0.1:0".into(),
metrics_listen: "127.0.0.1:0".into(),
},
eviction: EvictionSettings {
strategy: EvictionStrategy::Lru,
defrag_after_cycles: 0,
},
neurons: vec![NeuronEndpoint {
name: "mock-node".into(),
endpoint: neuron_url.to_string(),
}],
models_config: "/dev/null".into(),
entitlements,
};
let fleet = Arc::new(CortexState::from_config(&config));
{
let mut nodes = fleet.nodes.write().await;
let node = nodes.get_mut("mock-node").unwrap();
node.healthy = true;
node.models.insert(
"test-model".into(),
ModelEntry {
id: "test-model".into(),
status: ModelStatus::Loaded,
last_accessed: None,
vram_estimate_mb: Some(8000),
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}
let app = cortex_gateway::build_app(Arc::clone(&fleet));
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
format!("http://{addr}")
}
fn one_key_config(require_auth: bool) -> EntitlementsConfig {
EntitlementsConfig {
require_auth,
keys: vec![ApiKeyConfig {
key: "sk-good".into(),
account_id: "acct-1".into(),
key_id: Some("key-1".into()),
hard_cap: None,
window: CapWindow::Balance,
}],
}
}
fn chat_body() -> Value {
json!({
"model": "test-model",
"messages": [{"role": "user", "content": "hi"}]
})
}
#[tokio::test]
async fn missing_key_when_required_is_401_invalid_api_key() {
let (neuron, _seen) = spawn_capturing_neuron().await;
let gateway = spawn_gateway(&neuron, one_key_config(true)).await;
let resp = reqwest::Client::new()
.post(format!("{gateway}/v1/chat/completions"))
.json(&chat_body())
.send()
.await
.unwrap();
assert_eq!(resp.status(), reqwest::StatusCode::UNAUTHORIZED);
let body: Value = resp.json().await.unwrap();
assert_eq!(body["error"]["code"], "invalid_api_key");
assert_eq!(body["error"]["type"], "invalid_request_error");
}
#[tokio::test]
async fn invalid_key_is_401_even_when_auth_not_required() {
let (neuron, seen) = spawn_capturing_neuron().await;
// A present-but-wrong credential is always an error.
let gateway = spawn_gateway(&neuron, one_key_config(false)).await;
let resp = reqwest::Client::new()
.post(format!("{gateway}/v1/chat/completions"))
.bearer_auth("sk-wrong")
.json(&chat_body())
.send()
.await
.unwrap();
assert_eq!(resp.status(), reqwest::StatusCode::UNAUTHORIZED);
let body: Value = resp.json().await.unwrap();
assert_eq!(body["error"]["code"], "invalid_api_key");
// Rejected before dispatch — neuron never saw the request.
assert!(seen.lock().unwrap().account_id.is_none());
}
#[tokio::test]
async fn valid_key_reaches_neuron_with_principal_headers() {
let (neuron, seen) = spawn_capturing_neuron().await;
let gateway = spawn_gateway(&neuron, one_key_config(true)).await;
let resp = reqwest::Client::new()
.post(format!("{gateway}/v1/chat/completions"))
.bearer_auth("sk-good")
// A spoofed principal header must be stripped, not forwarded.
.header(HEADER_ACCOUNT_ID, "attacker")
.json(&chat_body())
.send()
.await
.unwrap();
assert_eq!(resp.status(), reqwest::StatusCode::OK);
let s = seen.lock().unwrap();
assert_eq!(s.account_id.as_deref(), Some("acct-1"));
assert_eq!(s.key_id.as_deref(), Some("key-1"));
}
#[tokio::test]
async fn anonymous_allowed_when_auth_not_required() {
let (neuron, seen) = spawn_capturing_neuron().await;
let gateway = spawn_gateway(&neuron, EntitlementsConfig::default()).await;
let resp = reqwest::Client::new()
.post(format!("{gateway}/v1/chat/completions"))
.json(&chat_body())
.send()
.await
.unwrap();
assert_eq!(resp.status(), reqwest::StatusCode::OK);
// No principal resolved → no principal headers stamped.
let s = seen.lock().unwrap();
assert!(s.account_id.is_none());
assert!(s.key_id.is_none());
}
#[tokio::test]
async fn health_is_public_even_when_auth_required() {
let (neuron, _seen) = spawn_capturing_neuron().await;
let gateway = spawn_gateway(&neuron, one_key_config(true)).await;
let resp = reqwest::Client::new()
.get(format!("{gateway}/health"))
.send()
.await
.unwrap();
assert_eq!(resp.status(), reqwest::StatusCode::OK);
}

View File

@@ -0,0 +1,253 @@
//! Integration tests for budget enforcement (#52) — the A0 seatbelt.
//!
//! A reservation over the key's hard cap is refused *before* neuron is hit,
//! with the #63 code matching the cap-window semantics (rate_limit_exceeded
//! + Retry-After for a resetting window, insufficient_quota for a hard
//! balance). Spend never exceeds the cap. No 402, ever.
use axum::Json;
use axum::extract::Path;
use axum::routing::{get, post};
use cortex_core::config::{
ApiKeyConfig, EntitlementsConfig, EvictionSettings, EvictionStrategy, GatewayConfig,
GatewaySettings, NeuronEndpoint,
};
use cortex_core::entitlements::{CapWindow, Principal};
use cortex_core::node::{ModelEntry, ModelStatus};
use cortex_gateway::state::CortexState;
use serde_json::{Value, json};
use std::sync::Arc;
use std::sync::atomic::{AtomicU64, Ordering};
use tokio::net::TcpListener;
/// Mock neuron with a hit counter on the inference path, so a test can prove
/// a request was (or wasn't) dispatched.
async fn spawn_counting_neuron() -> (String, Arc<AtomicU64>) {
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
let base_url = format!("http://{addr}");
let inference_url = base_url.clone();
let hits = Arc::new(AtomicU64::new(0));
let sink = Arc::clone(&hits);
let app = axum::Router::new()
.route(
"/models/{model_id}/endpoint",
get(move |Path(_): Path<String>| {
let url = inference_url.clone();
async move { Json(json!({ "url": url })) }
}),
)
.route(
"/v1/chat/completions",
post(move |Json(body): Json<Value>| {
let sink = Arc::clone(&sink);
async move {
sink.fetch_add(1, Ordering::SeqCst);
let model = body.get("model").and_then(Value::as_str).unwrap_or("m");
Json(json!({
"id": "chatcmpl-budget",
"object": "chat.completion",
"created": 1700000000_u64,
"model": model,
"choices": [{"index": 0, "message": {"role": "assistant", "content": "ok"}, "finish_reason": "stop"}],
"usage": {"prompt_tokens": 10, "completion_tokens": 5, "total_tokens": 15}
}))
}
}),
);
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
(base_url, hits)
}
async fn spawn_gateway(neuron_url: &str, key: ApiKeyConfig) -> (Arc<CortexState>, String) {
let config = GatewayConfig {
gateway: GatewaySettings {
listen: "127.0.0.1:0".into(),
metrics_listen: "127.0.0.1:0".into(),
},
eviction: EvictionSettings {
strategy: EvictionStrategy::Lru,
defrag_after_cycles: 0,
},
neurons: vec![NeuronEndpoint {
name: "mock-node".into(),
endpoint: neuron_url.to_string(),
}],
models_config: "/dev/null".into(),
entitlements: EntitlementsConfig {
require_auth: true,
keys: vec![key],
},
};
let fleet = Arc::new(CortexState::from_config(&config));
{
let mut nodes = fleet.nodes.write().await;
let node = nodes.get_mut("mock-node").unwrap();
node.healthy = true;
node.models.insert(
"test-model".into(),
ModelEntry {
id: "test-model".into(),
status: ModelStatus::Loaded,
last_accessed: None,
vram_estimate_mb: Some(8000),
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}
let app = cortex_gateway::build_app(Arc::clone(&fleet));
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
(fleet, format!("http://{addr}"))
}
fn key(window: CapWindow, hard_cap: u64) -> ApiKeyConfig {
ApiKeyConfig {
key: "sk-cap".into(),
account_id: "acct-cap".into(),
key_id: Some("key-cap".into()),
hard_cap: Some(hard_cap),
window,
}
}
fn chat(max_tokens: u64) -> Value {
json!({
"model": "test-model",
"max_tokens": max_tokens,
"messages": [{"role": "user", "content": "hi"}]
})
}
#[tokio::test]
async fn balance_over_cap_is_429_insufficient_quota_before_dispatch() {
let (neuron, hits) = spawn_counting_neuron().await;
// Cap far below a single request's reservation (max_tokens 1000).
let (_fleet, gateway) = spawn_gateway(&neuron, key(CapWindow::Balance, 10)).await;
let resp = reqwest::Client::new()
.post(format!("{gateway}/v1/chat/completions"))
.bearer_auth("sk-cap")
.json(&chat(1000))
.send()
.await
.unwrap();
assert_eq!(resp.status(), reqwest::StatusCode::TOO_MANY_REQUESTS);
// Hard balance → no Retry-After.
assert!(resp.headers().get(reqwest::header::RETRY_AFTER).is_none());
let body: Value = resp.json().await.unwrap();
assert_eq!(body["error"]["code"], "insufficient_quota");
// Refused before dispatch — neuron never saw it.
assert_eq!(hits.load(Ordering::SeqCst), 0);
}
#[tokio::test]
async fn rolling_over_cap_is_429_rate_limited_with_retry_after() {
let (neuron, hits) = spawn_counting_neuron().await;
let (_fleet, gateway) =
spawn_gateway(&neuron, key(CapWindow::Rolling { seconds: 3600 }, 10)).await;
let resp = reqwest::Client::new()
.post(format!("{gateway}/v1/chat/completions"))
.bearer_auth("sk-cap")
.json(&chat(1000))
.send()
.await
.unwrap();
assert_eq!(resp.status(), reqwest::StatusCode::TOO_MANY_REQUESTS);
let retry = resp
.headers()
.get(reqwest::header::RETRY_AFTER)
.expect("rolling-window rejection must carry Retry-After");
assert!(retry.to_str().unwrap().parse::<u64>().unwrap() >= 1);
let body: Value = resp.json().await.unwrap();
assert_eq!(body["error"]["code"], "rate_limit_exceeded");
assert_eq!(hits.load(Ordering::SeqCst), 0);
}
#[tokio::test]
async fn within_cap_is_served() {
let (neuron, hits) = spawn_counting_neuron().await;
let (_fleet, gateway) = spawn_gateway(&neuron, key(CapWindow::Balance, 1_000_000)).await;
let resp = reqwest::Client::new()
.post(format!("{gateway}/v1/chat/completions"))
.bearer_auth("sk-cap")
.json(&chat(50))
.send()
.await
.unwrap();
assert_eq!(resp.status(), reqwest::StatusCode::OK);
let _ = resp.bytes().await.unwrap();
assert_eq!(hits.load(Ordering::SeqCst), 1);
}
#[tokio::test]
async fn a0_seatbelt_caps_a_runaway_fan_out() {
// An Agent-Zero-style key with a modest cap: a burst of requests drains
// it, then further requests are refused — the account stops draining and
// spend never exceeds the cap.
let (neuron, hits) = spawn_counting_neuron().await;
let (fleet, gateway) = spawn_gateway(&neuron, key(CapWindow::Balance, 100)).await;
let client = reqwest::Client::new();
let mut ok = 0;
let mut refused = 0;
for _ in 0..20 {
let resp = client
.post(format!("{gateway}/v1/chat/completions"))
.bearer_auth("sk-cap")
.json(&chat(20))
.send()
.await
.unwrap();
match resp.status() {
reqwest::StatusCode::OK => {
ok += 1;
let _ = resp.bytes().await.unwrap();
}
reqwest::StatusCode::TOO_MANY_REQUESTS => {
refused += 1;
let body: Value = resp.json().await.unwrap();
assert_eq!(body["error"]["code"], "insufficient_quota");
}
other => panic!("unexpected status {other}"),
}
}
assert!(ok >= 1, "some requests should be served");
assert!(refused >= 1, "the cap must eventually refuse the fan-out");
assert_eq!(
hits.load(Ordering::SeqCst),
ok,
"refused requests never dispatched"
);
// Spend never exceeded the hard cap (reservation prevents overshoot).
// Poll briefly for in-flight settles to land.
let principal = Principal {
account_id: "acct-cap".into(),
key_id: "key-cap".into(),
};
for _ in 0..50 {
let snap = fleet.entitlements.snapshot(&principal).await.unwrap();
if snap.reserved == 0 {
break;
}
tokio::time::sleep(std::time::Duration::from_millis(20)).await;
}
let snap = fleet.entitlements.snapshot(&principal).await.unwrap();
assert!(snap.spent <= 100, "spent {} exceeded cap", snap.spent);
}

View File

@@ -54,9 +54,64 @@ pub async fn spawn_mock_neuron() -> String {
base_url
}
/// Like [`spawn_mock_neuron`] but captures the JSON body of every
/// `POST /v1/chat/completions` it receives into the returned handle, so
/// a test can assert what the gateway *actually forwarded upstream*
/// (e.g. that Anthropic-shaped tools were reshaped to OpenAI form).
pub async fn spawn_capturing_mock_neuron() -> (String, Arc<std::sync::Mutex<Vec<Value>>>) {
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
let base_url = format!("http://{addr}");
let inference_url = base_url.clone();
let captured: Arc<std::sync::Mutex<Vec<Value>>> = Arc::new(std::sync::Mutex::new(Vec::new()));
let sink = captured.clone();
let app = Router::new()
.route("/models", get(mock_neuron_list_models))
.route(
"/models/{model_id}/endpoint",
get(move |Path(_): Path<String>| {
let url = inference_url.clone();
async move { Json(json!({"url": url})) }
}),
)
.route(
"/v1/chat/completions",
post(move |Json(body): Json<Value>| {
let sink = sink.clone();
async move {
let model = body
.get("model")
.and_then(|v| v.as_str())
.unwrap_or("unknown");
let resp = json!({
"id": "chatcmpl-capture-001",
"object": "chat.completion",
"created": 1700000000_u64,
"model": model,
"choices": [{
"index": 0,
"message": {"role": "assistant", "content": "Hello from mock backend"},
"finish_reason": "stop"
}],
"usage": {"prompt_tokens": 10, "completion_tokens": 5, "total_tokens": 15}
});
sink.lock().unwrap().push(body);
Json(resp)
}
}),
);
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
(base_url, captured)
}
async fn mock_neuron_list_models() -> Json<Value> {
Json(json!([
{"id": "test-model", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": 8000}
{"id": "test-model", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": 8000, "capabilities": ["text"], "tool_call": false, "reasoning": false}
]))
}
@@ -374,6 +429,7 @@ pub async fn spawn_gateway_with_state(mock_url: &str) -> (Arc<CortexState>, Stri
endpoint: mock_url.to_string(),
}],
models_config: "/dev/null".into(),
entitlements: Default::default(),
};
let fleet = Arc::new(CortexState::from_config(&config));
@@ -391,6 +447,9 @@ pub async fn spawn_gateway_with_state(mock_url: &str) -> (Arc<CortexState>, Stri
last_accessed: None,
vram_estimate_mb: Some(8000),
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}

View File

@@ -0,0 +1,140 @@
mod common;
use serde_json::json;
#[tokio::test]
async fn error_response_model_not_found() {
let neuron_url = common::spawn_mock_neuron().await;
let gateway_url = common::spawn_gateway(&neuron_url).await;
let client = reqwest::Client::new();
// Request a model that isn't loaded on the mock neuron.
let resp = client
.post(format!("{gateway_url}/v1/chat/completions"))
.header("Content-Type", "application/json")
.json(&json!({
"model": "nonexistent-model",
"messages": [{"role": "user", "content": "hi"}]
}))
.send()
.await
.expect("request should succeed");
assert_eq!(resp.status(), axum::http::StatusCode::NOT_FOUND);
let body: serde_json::Value = resp.json().await.expect("valid json");
let err = body.get("error").expect("response has error object");
// Broad type categorization
assert_eq!(err.get("type").unwrap(), "invalid_request_error");
// Specific machine-readable code
assert_eq!(
err.get("code").unwrap().as_str().unwrap(),
"model_not_found"
);
// param is always null
assert!(err.get("param").unwrap().is_null());
}
#[tokio::test]
async fn error_response_missing_model_field() {
let neuron_url = common::spawn_mock_neuron().await;
let gateway_url = common::spawn_gateway(&neuron_url).await;
let client = reqwest::Client::new();
// Request without the required `model` field.
let resp = client
.post(format!("{gateway_url}/v1/chat/completions"))
.header("Content-Type", "application/json")
.json(&json!({
"messages": [{"role": "user", "content": "hi"}]
}))
.send()
.await
.expect("request should succeed");
assert_eq!(resp.status(), axum::http::StatusCode::BAD_REQUEST);
let body: serde_json::Value = resp.json().await.expect("valid json");
let err = body.get("error").expect("response has error object");
assert_eq!(err.get("type").unwrap(), "invalid_request_error");
assert_eq!(
err.get("code").unwrap().as_str().unwrap(),
"missing_model_field"
);
assert!(err.get("param").unwrap().is_null());
}
#[tokio::test]
async fn error_response_no_healthy_nodes() {
use cortex_core::config::{EvictionSettings, GatewayConfig, GatewaySettings, NeuronEndpoint};
use std::sync::Arc;
// Create a gateway config with a neuron pointing at an unreachable port so no node is ever healthy.
let config = GatewayConfig {
gateway: GatewaySettings {
listen: "127.0.0.1:0".into(),
metrics_listen: "127.0.0.1:0".into(),
},
eviction: EvictionSettings {
strategy: cortex_core::config::EvictionStrategy::Lru,
defrag_after_cycles: 0,
},
neurons: vec![NeuronEndpoint {
name: "dead-node".into(),
endpoint: "http://127.0.0.1:1".into(),
}],
models_config: "/dev/null".into(),
entitlements: Default::default(),
};
let fleet = Arc::new(cortex_gateway::state::CortexState::from_config(&config));
let app = cortex_gateway::build_app(fleet);
let listener = tokio::net::TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
// Allow the poller a moment to mark the node unhealthy.
tokio::time::sleep(std::time::Duration::from_millis(200)).await;
let client = reqwest::Client::new();
let resp = client
.post(format!("http://{addr}/v1/chat/completions"))
.header("Content-Type", "application/json")
.json(&json!({
"model": "any-model",
"messages": [{"role": "user", "content": "hi"}]
}))
.send()
.await
.expect("request should succeed");
assert_eq!(resp.status(), axum::http::StatusCode::SERVICE_UNAVAILABLE);
// Transient 503 — the gateway advertises Retry-After so OpenAI-compatible
// clients back off and retry rather than surfacing an opaque error (#63).
let retry_after = resp
.headers()
.get(reqwest::header::RETRY_AFTER)
.expect("transient 503 must carry Retry-After")
.to_str()
.unwrap()
.to_string();
assert_eq!(retry_after, "5");
let body: serde_json::Value = resp.json().await.expect("valid json");
let err = body.get("error").expect("response has error object");
assert_eq!(err.get("type").unwrap(), "api_error");
assert_eq!(
err.get("code").unwrap().as_str().unwrap(),
"service_unavailable"
);
assert!(err.get("param").unwrap().is_null());
}

View File

@@ -71,6 +71,7 @@ fn make_fleet(endpoint: &str, defrag_after: u32) -> Arc<CortexState> {
endpoint: endpoint.to_string(),
}],
models_config: "/dev/null".into(),
entitlements: Default::default(),
};
Arc::new(CortexState::from_config(&config))
}
@@ -92,6 +93,9 @@ async fn test_evict_lru_model() {
last_accessed: Some(Utc::now() - chrono::Duration::hours(2)),
vram_estimate_mb: Some(8000),
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
node.models.insert(
@@ -102,6 +106,9 @@ async fn test_evict_lru_model() {
last_accessed: Some(Utc::now()),
vram_estimate_mb: Some(8000),
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}
@@ -166,6 +173,9 @@ async fn test_eviction_increments_lifecycle_cycles() {
last_accessed: None,
vram_estimate_mb: None,
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}

View File

@@ -0,0 +1,207 @@
//! Integration tests for per-request token metering (#51).
//!
//! Drives authenticated requests through the gateway to a mock neuron that
//! reports a fixed `usage` object, then asserts the EntitlementProvider's
//! spend ledger reflects cumulative per-key spend and that reservations
//! settle to actual (no outstanding reserved tokens once requests complete).
mod common;
use cortex_core::config::{
ApiKeyConfig, EntitlementsConfig, EvictionSettings, EvictionStrategy, GatewayConfig,
GatewaySettings, NeuronEndpoint,
};
use cortex_core::entitlements::{CapWindow, Principal};
use cortex_core::node::{ModelEntry, ModelStatus};
use cortex_gateway::state::CortexState;
use serde_json::json;
use std::sync::Arc;
use std::time::Duration;
use tokio::net::TcpListener;
const ACCOUNT: &str = "acct-meter";
const KEY_ID: &str = "key-meter";
const BEARER: &str = "sk-meter";
/// The mock neuron (common::spawn_mock_neuron) reports this fixed usage on
/// every chat completion.
const PROMPT_PER_REQ: u64 = 10;
const COMPLETION_PER_REQ: u64 = 5;
async fn spawn_metered_gateway(neuron_url: &str) -> (Arc<CortexState>, String) {
let config = GatewayConfig {
gateway: GatewaySettings {
listen: "127.0.0.1:0".into(),
metrics_listen: "127.0.0.1:0".into(),
},
eviction: EvictionSettings {
strategy: EvictionStrategy::Lru,
defrag_after_cycles: 0,
},
neurons: vec![NeuronEndpoint {
name: "mock-node".into(),
endpoint: neuron_url.to_string(),
}],
models_config: "/dev/null".into(),
entitlements: EntitlementsConfig {
require_auth: true,
keys: vec![ApiKeyConfig {
key: BEARER.into(),
account_id: ACCOUNT.into(),
key_id: Some(KEY_ID.into()),
hard_cap: Some(1_000_000),
window: CapWindow::Balance,
}],
},
};
let fleet = Arc::new(CortexState::from_config(&config));
{
let mut nodes = fleet.nodes.write().await;
let node = nodes.get_mut("mock-node").unwrap();
node.healthy = true;
node.models.insert(
"test-model".into(),
ModelEntry {
id: "test-model".into(),
status: ModelStatus::Loaded,
last_accessed: None,
vram_estimate_mb: Some(8000),
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}
let app = cortex_gateway::build_app(Arc::clone(&fleet));
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
(fleet, format!("http://{addr}"))
}
fn principal() -> Principal {
Principal {
account_id: ACCOUNT.into(),
key_id: KEY_ID.into(),
}
}
/// Poll the provider ledger until settled spend reaches `expected` (settle
/// runs in a spawned task after the response stream finishes) or time out.
async fn await_spent(fleet: &CortexState, expected: u64) -> u64 {
let principal = principal();
for _ in 0..100 {
let snap = fleet.entitlements.snapshot(&principal).await.unwrap();
if snap.spent >= expected {
return snap.spent;
}
tokio::time::sleep(Duration::from_millis(20)).await;
}
fleet.entitlements.snapshot(&principal).await.unwrap().spent
}
#[tokio::test]
async fn cumulative_spend_is_metered_per_key() {
let neuron = common::spawn_mock_neuron().await;
let (fleet, gateway) = spawn_metered_gateway(&neuron).await;
let client = reqwest::Client::new();
const N: u64 = 3;
for _ in 0..N {
let resp = client
.post(format!("{gateway}/v1/chat/completions"))
.bearer_auth(BEARER)
.json(&json!({"model": "test-model", "messages": [{"role": "user", "content": "hi"}]}))
.send()
.await
.unwrap();
assert_eq!(resp.status(), reqwest::StatusCode::OK);
// Drain the body so the response stream finishes and metering settles.
let _ = resp.bytes().await.unwrap();
}
let expected = N * (PROMPT_PER_REQ + COMPLETION_PER_REQ);
let spent = await_spent(&fleet, expected).await;
assert_eq!(
spent, expected,
"ledger must reflect cumulative per-key spend"
);
// Reservations settled to actual — nothing left outstanding.
let snap = fleet.entitlements.snapshot(&principal()).await.unwrap();
assert_eq!(snap.reserved, 0, "all reservations must settle/release");
assert_eq!(snap.hard_cap, Some(1_000_000));
}
#[tokio::test]
async fn anonymous_request_records_no_spend() {
// require_auth=false so the unauthenticated request is served, but with
// no principal it must not touch any ledger.
let neuron = common::spawn_mock_neuron().await;
let config = GatewayConfig {
gateway: GatewaySettings {
listen: "127.0.0.1:0".into(),
metrics_listen: "127.0.0.1:0".into(),
},
eviction: EvictionSettings {
strategy: EvictionStrategy::Lru,
defrag_after_cycles: 0,
},
neurons: vec![NeuronEndpoint {
name: "mock-node".into(),
endpoint: neuron.clone(),
}],
models_config: "/dev/null".into(),
entitlements: EntitlementsConfig::default(),
};
let fleet = Arc::new(CortexState::from_config(&config));
{
let mut nodes = fleet.nodes.write().await;
let node = nodes.get_mut("mock-node").unwrap();
node.healthy = true;
node.models.insert(
"test-model".into(),
ModelEntry {
id: "test-model".into(),
status: ModelStatus::Loaded,
last_accessed: None,
vram_estimate_mb: Some(8000),
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}
let app = cortex_gateway::build_app(Arc::clone(&fleet));
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
let resp = reqwest::Client::new()
.post(format!("http://{addr}/v1/chat/completions"))
.json(&json!({"model": "test-model", "messages": [{"role": "user", "content": "hi"}]}))
.send()
.await
.unwrap();
assert_eq!(resp.status(), reqwest::StatusCode::OK);
let _ = resp.bytes().await.unwrap();
// An unconfigured principal has a zeroed snapshot — nothing was metered.
let snap = fleet
.entitlements
.snapshot(&Principal {
account_id: "nobody".into(),
key_id: "nobody".into(),
})
.await
.unwrap();
assert_eq!(snap.spent, 0);
}

View File

@@ -0,0 +1,132 @@
//! Issue #62 / #67: `GET /v1/models` advertises a per-model serving budget so
//! an OpenAI-compatible client (opencode's helexa provider) can size and
//! compact its context without hand-configuration.
//!
//! Asserts the composition sources land on the response:
//! - `limit` from the neuron's self-derived value (#67) — NOT the catalogue;
//! an operator-declared catalogue `limit` is deliberately ignored.
//! - `cost` from the catalogue profile (operator-set pricing).
//! - `tool_call` / `reasoning` from the neuron's runtime detection (OR-ed in)
//!
//! Also a regression guard for the removal of `max_model_len` — the misnamed,
//! unconsumed vLLM-ism that this contract replaces.
use cortex_core::config::{
EvictionSettings, EvictionStrategy, GatewayConfig, GatewaySettings, NeuronEndpoint,
};
use cortex_core::harness::ModelLimit;
use cortex_core::node::{ModelEntry, ModelStatus};
use cortex_gateway::state::CortexState;
use std::sync::Arc;
use tokio::net::TcpListener;
#[tokio::test]
async fn v1_models_surfaces_limit_cost_and_capability_flags() {
// Catalogue declares pricing + an operator `limit` that must be IGNORED
// (#67): the neuron's self-derived limit is authoritative.
let models_toml = r#"
[[models]]
id = "test-model"
harness = "candle"
limit.context = 999999
limit.input = 999999
limit.output = 999999
cost.input = 0.0
cost.output = 0.0
capabilities = ["text"]
"#;
let cat_path = std::env::temp_dir().join("cortex_test_issue62_models.toml");
std::fs::write(&cat_path, models_toml).unwrap();
let config = GatewayConfig {
gateway: GatewaySettings {
listen: "127.0.0.1:0".into(),
metrics_listen: "127.0.0.1:0".into(),
},
eviction: EvictionSettings {
strategy: EvictionStrategy::Lru,
defrag_after_cycles: 0,
},
neurons: vec![NeuronEndpoint {
name: "mock-node".into(),
// Never contacted: build_app does not spawn the poller, so the
// seeded state below is authoritative for /v1/models.
endpoint: "http://127.0.0.1:1".into(),
}],
models_config: cat_path.to_string_lossy().into_owned(),
entitlements: Default::default(),
};
let fleet = Arc::new(CortexState::from_config(&config));
// Seed the model as loaded on the node with runtime-detected flags set —
// these must OR into the catalogue entry, not be lost.
{
let mut nodes = fleet.nodes.write().await;
let node = nodes.get_mut("mock-node").expect("node exists");
node.healthy = true;
node.models.insert(
"test-model".into(),
ModelEntry {
id: "test-model".into(),
status: ModelStatus::Loaded,
last_accessed: None,
vram_estimate_mb: Some(8000),
capabilities: vec!["text".into()],
tool_call: true,
reasoning: true,
// Neuron's self-derived limit (#67) — the authoritative
// source. Distinct from the catalogue's (ignored) values.
limit: Some(ModelLimit {
context: 49152,
input: Some(40960),
output: 8192,
}),
},
);
}
let app = cortex_gateway::build_app(Arc::clone(&fleet));
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
let body: serde_json::Value = reqwest::Client::new()
.get(format!("http://{addr}/v1/models"))
.send()
.await
.unwrap()
.json()
.await
.unwrap();
let entry = body["data"]
.as_array()
.expect("data is an array")
.iter()
.find(|m| m["id"] == "test-model")
.expect("test-model present in /v1/models");
// `limit` is the neuron's self-derived value (#67), NOT the catalogue's
// (which declared 999999 and must be ignored). `cost` still flows from
// the catalogue.
assert_eq!(entry["limit"]["context"], 49152);
assert_eq!(entry["limit"]["input"], 40960);
assert_eq!(entry["limit"]["output"], 8192);
assert_eq!(entry["cost"]["input"], 0.0);
assert_eq!(entry["cost"]["output"], 0.0);
// Runtime-detected capability flags OR-ed in from the neuron's ModelEntry.
assert_eq!(entry["tool_call"], true);
assert_eq!(entry["reasoning"], true);
// Regression guard: the removed, unconsumed vLLM-ism must not reappear.
assert!(
entry.get("max_model_len").is_none(),
"max_model_len was removed; /v1/models must not advertise it"
);
let _ = std::fs::remove_file(&cat_path);
}

View File

@@ -31,6 +31,7 @@ async fn test_poller_discovers_models() {
endpoint: mock_url,
}],
models_config: "/dev/null".into(),
entitlements: Default::default(),
};
let fleet = Arc::new(CortexState::from_config(&config));
@@ -82,6 +83,7 @@ async fn test_poller_updates_gateway_models_endpoint() {
endpoint: mock_url,
}],
models_config: "/dev/null".into(),
entitlements: Default::default(),
};
let fleet = Arc::new(CortexState::from_config(&config));
@@ -153,6 +155,7 @@ async fn test_models_endpoint_unions_capabilities_across_nodes() {
},
],
models_config: "/dev/null".into(),
entitlements: Default::default(),
};
let fleet = Arc::new(CortexState::from_config(&config));
@@ -215,6 +218,7 @@ async fn test_poller_marks_unreachable_node_unhealthy() {
endpoint: "http://127.0.0.1:1".into(),
}],
models_config: "/dev/null".into(),
entitlements: Default::default(),
};
let fleet = Arc::new(CortexState::from_config(&config));
@@ -252,6 +256,7 @@ async fn test_poller_removes_stale_models() {
endpoint: mock_url,
}],
models_config: "/dev/null".into(),
entitlements: Default::default(),
};
let fleet = Arc::new(CortexState::from_config(&config));
@@ -282,6 +287,7 @@ async fn test_poller_removes_stale_models() {
endpoint: new_mock_url,
}],
models_config: "/dev/null".into(),
entitlements: Default::default(),
};
let fleet2 = Arc::new(CortexState::from_config(&config2));
@@ -298,6 +304,9 @@ async fn test_poller_removes_stale_models() {
last_accessed: None,
vram_estimate_mb: None,
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
node.models.insert(
@@ -308,6 +317,9 @@ async fn test_poller_removes_stale_models() {
last_accessed: None,
vram_estimate_mb: None,
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}
@@ -357,6 +369,7 @@ async fn test_poller_captures_activation_from_health() {
endpoint: mock_url,
}],
models_config: "/dev/null".into(),
entitlements: Default::default(),
};
let fleet = Arc::new(CortexState::from_config(&config));
@@ -401,6 +414,7 @@ async fn test_poller_parses_recovering_status() {
endpoint: mock_url,
}],
models_config: "/dev/null".into(),
entitlements: Default::default(),
};
let fleet = Arc::new(CortexState::from_config(&config));

View File

@@ -117,6 +117,7 @@ async fn test_no_healthy_nodes() {
endpoint: "http://127.0.0.1:1".into(),
}],
models_config: "/dev/null".into(),
entitlements: Default::default(),
};
let fleet = std::sync::Arc::new(cortex_gateway::state::CortexState::from_config(&config));
@@ -139,7 +140,7 @@ async fn test_no_healthy_nodes() {
.await
.expect("request should succeed");
assert_eq!(resp.status(), 404);
assert_eq!(resp.status(), 503);
let body: serde_json::Value = resp.json().await.unwrap();
assert!(
@@ -192,6 +193,9 @@ async fn test_recovering_model_returns_503_and_stays_listed() {
last_accessed: None,
vram_estimate_mb: Some(8000),
capabilities: Vec::new(),
tool_call: false,
reasoning: false,
limit: None,
},
);
}

View File

@@ -27,12 +27,15 @@ futures = { workspace = true }
tokio-stream = { workspace = true }
eventsource-stream = { workspace = true }
# read-only JSON API (api.rs)
axum = { workspace = true }
tower-http = { workspace = true }
# SQLite system-of-record. `bundled` compiles SQLite from source so the
# binary has no libsqlite3 runtime dependency — matches the project's
# single-static-binary packaging.
rusqlite = { version = "0.32", features = ["bundled"] }
[dev-dependencies]
axum = { workspace = true }
# Jail (isolated cwd + env) for config tests.
figment = { workspace = true, features = ["test"] }

View File

@@ -0,0 +1,119 @@
//! Read-only JSON API over the bench SQLite store.
//!
//! Consumed by the `bench/` visualisation app and for programmatic
//! access. Served by the `run` daemon (alongside the sweep loop) and by
//! the standalone `serve` subcommand. CORS is permissive because the UI
//! is hosted separately (different origin); the API is internal-only
//! (WireGuard + firewalld) and read-only, so this predates the auth epic.
use crate::store::{RunFilter, Store};
use anyhow::Result;
use axum::Router;
use axum::extract::{Query, State};
use axum::http::StatusCode;
use axum::response::Json;
use axum::routing::get;
use serde::Deserialize;
use serde_json::json;
use std::sync::Arc;
use tokio::sync::Mutex;
use tower_http::cors::CorsLayer;
/// Shared API state: a dedicated read connection to the store, guarded
/// (rusqlite `Connection` isn't `Sync`). Separate from the sweep's
/// writer connection — WAL lets them run concurrently.
pub type ApiState = Arc<Mutex<Store>>;
/// Open an API state over the store at `db_path`.
pub fn open_state(db_path: &str) -> Result<ApiState> {
Ok(Arc::new(Mutex::new(Store::open(db_path)?)))
}
/// Build the API router.
pub fn api_routes(state: ApiState) -> Router {
Router::new()
.route("/api/health", get(health))
.route("/api/dimensions", get(dimensions))
.route("/api/summary", get(summary))
.route("/api/series", get(series))
.route("/api/runs", get(runs))
.layer(CorsLayer::permissive())
.with_state(state)
}
/// Bind `listen` and serve the API until the process exits.
pub async fn serve(listen: &str, state: ApiState) -> Result<()> {
let listener = tokio::net::TcpListener::bind(listen).await?;
tracing::info!(%listen, "bench API listening");
axum::serve(listener, api_routes(state)).await?;
Ok(())
}
type ApiError = (StatusCode, String);
fn err500(e: anyhow::Error) -> ApiError {
(StatusCode::INTERNAL_SERVER_ERROR, format!("{e:#}"))
}
async fn health(State(s): State<ApiState>) -> Result<Json<serde_json::Value>, ApiError> {
let store = s.lock().await;
let count = store.run_count().map_err(err500)?;
Ok(Json(json!({ "status": "ok", "run_count": count })))
}
async fn dimensions(State(s): State<ApiState>) -> Result<Json<crate::store::Dimensions>, ApiError> {
let store = s.lock().await;
store.dimensions().map(Json).map_err(err500)
}
async fn summary(
State(s): State<ApiState>,
) -> Result<Json<Vec<crate::store::ReportRow>>, ApiError> {
let store = s.lock().await;
store.summary().map(Json).map_err(err500)
}
#[derive(Debug, Deserialize)]
struct SeriesQuery {
/// Optional — when omitted the store resolves the host serving this model.
host: Option<String>,
model: String,
scenario: String,
}
async fn series(
State(s): State<ApiState>,
Query(q): Query<SeriesQuery>,
) -> Result<Json<Vec<crate::store::SeriesPoint>>, ApiError> {
let store = s.lock().await;
store
.series(q.host.as_deref(), &q.model, &q.scenario)
.map(Json)
.map_err(err500)
}
#[derive(Debug, Deserialize)]
struct RunsQuery {
host: Option<String>,
model: Option<String>,
scenario: Option<String>,
sha: Option<String>,
ok: Option<bool>,
limit: Option<u32>,
}
async fn runs(
State(s): State<ApiState>,
Query(q): Query<RunsQuery>,
) -> Result<Json<Vec<crate::store::RunRow>>, ApiError> {
let filter = RunFilter {
host: q.host,
model: q.model,
scenario: q.scenario,
sha: q.sha,
ok: q.ok,
limit: q.limit,
};
let store = s.lock().await;
store.runs(&filter).map(Json).map_err(err500)
}

View File

@@ -151,6 +151,10 @@ impl TargetClient {
devices: Vec::new(),
vram_used_mb: None,
capabilities: Vec::new(),
limit: None,
cost: None,
tool_call: false,
reasoning: false,
})
.collect())
}

View File

@@ -16,11 +16,35 @@ pub struct BenchConfig {
pub bench: BenchSettings,
#[serde(default)]
pub scenarios: ScenarioConfig,
/// Read-only JSON API (consumed by the bench UI + programmatic access).
#[serde(default)]
pub api: ApiSettings,
/// Endpoints to benchmark. At least one is required for `run`/`once`.
#[serde(default)]
pub targets: Vec<TargetConfig>,
}
/// The read-only HTTP API the `run` daemon (and the `serve` subcommand)
/// exposes over the SQLite store.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ApiSettings {
/// Whether to bind the API at all.
#[serde(default = "default_api_enabled")]
pub enabled: bool,
/// Listen address for the API.
#[serde(default = "default_api_listen")]
pub listen: String,
}
impl Default for ApiSettings {
fn default() -> Self {
ApiSettings {
enabled: default_api_enabled(),
listen: default_api_listen(),
}
}
}
/// Loop/timing knobs.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct BenchSettings {
@@ -151,6 +175,12 @@ fn default_timeout() -> u64 {
fn default_db_path() -> String {
"/var/lib/helexa-bench/bench.sqlite".to_string()
}
fn default_api_enabled() -> bool {
true
}
fn default_api_listen() -> String {
"0.0.0.0:13132".to_string()
}
fn default_prompt_sizes() -> Vec<u32> {
vec![128, 4096]
}

View File

@@ -4,6 +4,7 @@
//! full build/version provenance into SQLite so improvements can be
//! tracked automatically across neuron implementation updates.
pub mod api;
pub mod client;
pub mod config;
pub mod report;

View File

@@ -10,6 +10,7 @@
use anyhow::{Context, Result};
use clap::{Parser, Subcommand};
use helexa_bench::api;
use helexa_bench::config::BenchConfig;
use helexa_bench::report;
use helexa_bench::store::Store;
@@ -37,6 +38,11 @@ enum Command {
#[arg(short, long, default_value = "helexa-bench.toml")]
config: String,
},
/// Serve the read-only JSON API only (no sweeping).
Serve {
#[arg(short, long, default_value = "helexa-bench.toml")]
config: String,
},
/// Render recorded results. Uses `--db` if given, else the db_path
/// from `--config`.
Report {
@@ -77,10 +83,31 @@ async fn run(cli: Cli) -> Result<()> {
Command::Run { config } => {
let cfg = load_config(&config)?;
require_targets(&cfg)?;
// Bind the read API alongside the sweep loop (one bob service
// does both). Its own store connection; WAL keeps the sweep
// writer and the API readers from blocking each other.
if cfg.api.enabled {
let state = api::open_state(&cfg.bench.db_path)?;
let listen = cfg.api.listen.clone();
tokio::spawn(async move {
if let Err(e) = api::serve(&listen, state).await {
tracing::error!(error = %format!("{e:#}"), "bench API server exited");
}
});
}
let sweeper = Sweeper::new(cfg)?;
tracing::info!("helexa-bench started; entering continuous sweep loop");
sweeper.run_forever().await
}
Command::Serve { config } => {
let cfg = load_config(&config)?;
if !cfg.api.enabled {
anyhow::bail!("[api] enabled = false — nothing to serve");
}
let state = api::open_state(&cfg.bench.db_path)?;
tracing::info!("helexa-bench serving API only");
api::serve(&cfg.api.listen, state).await
}
Command::Once { config } => {
let cfg = load_config(&config)?;
require_targets(&cfg)?;

View File

@@ -47,6 +47,7 @@ pub fn render_json(rows: &[ReportRow]) -> Result<String> {
"total_s_median": r.total_s_median,
"git_sha": r.git_sha,
"samples": r.samples,
"gpu": r.gpu,
})
})
.collect();
@@ -77,6 +78,7 @@ mod tests {
decode_tps_median: Some(45.6),
total_s_median: Some(1.234),
samples: 5,
gpu: Some("2× RTX 5090".into()),
}];
let md = render_markdown(&rows);
assert!(md.contains("| engine |"));
@@ -98,6 +100,7 @@ mod tests {
decode_tps_median: None,
total_s_median: Some(0.5),
samples: 1,
gpu: None,
}];
let md = render_markdown(&rows);
assert!(md.contains("~128"));

View File

@@ -7,7 +7,7 @@
//! never held across one.
use anyhow::{Context, Result};
use rusqlite::{Connection, params};
use rusqlite::{Connection, OptionalExtension, params};
use std::path::Path;
/// A single measured (or failed) iteration, with full provenance.
@@ -87,6 +87,9 @@ impl Store {
fn init(conn: &Connection) -> Result<()> {
conn.execute_batch(
r#"
-- WAL so the read-only API connection never blocks the
-- sweep writer (and vice versa).
PRAGMA journal_mode=WAL;
CREATE TABLE IF NOT EXISTS runs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
ts TEXT NOT NULL,
@@ -221,7 +224,7 @@ impl Store {
// successful run, then median that SHA's samples.
let mut stmt = self.conn.prepare(
"SELECT target_name, model_id, scenario_id, prompt_size_approx, git_sha,
ttft_s, decode_tps, total_s, prompt_tokens_actual
ttft_s, decode_tps, total_s, prompt_tokens_actual, gpus_json
FROM runs
WHERE ok=1
ORDER BY target_name, model_id, scenario_id, id",
@@ -237,11 +240,322 @@ impl Store {
decode_tps: row.get(6)?,
total_s: row.get(7)?,
prompt_tokens_actual: row.get(8)?,
gpus_json: row.get(9)?,
})
})?;
let raws: Vec<RawRow> = rows.collect::<rusqlite::Result<_>>()?;
Ok(aggregate(raws))
}
// ── Read API surface (consumed by api.rs) ─────────────────────────
/// Total recorded runs (for `/api/health`).
pub fn run_count(&self) -> Result<u64> {
let n: i64 = self
.conn
.query_row("SELECT COUNT(*) FROM runs", [], |row| row.get(0))?;
Ok(n as u64)
}
/// Distinct hosts / models / scenarios / builds, for populating UI
/// filters. Builds are ordered chronologically by build timestamp
/// (falling back to first-seen wall-clock).
pub fn dimensions(&self) -> Result<Dimensions> {
let col = |sql: &str| -> Result<Vec<String>> {
let mut stmt = self.conn.prepare(sql)?;
let rows = stmt.query_map([], |r| r.get::<_, String>(0))?;
Ok(rows.collect::<rusqlite::Result<_>>()?)
};
let hosts = col("SELECT DISTINCT target_name FROM runs ORDER BY target_name")?;
let models = col("SELECT DISTINCT model_id FROM runs ORDER BY model_id")?;
let scenarios = col("SELECT DISTINCT scenario_id FROM runs ORDER BY scenario_id")?;
let mut stmt = self.conn.prepare(
"SELECT git_sha, MAX(build_timestamp), MAX(package_version), MIN(COALESCE(build_timestamp, ts)) AS ord
FROM runs GROUP BY git_sha ORDER BY ord",
)?;
let builds = stmt
.query_map([], |r| {
Ok(BuildRef {
git_sha: r.get(0)?,
build_timestamp: r.get(1)?,
package_version: r.get(2)?,
})
})?
.collect::<rusqlite::Result<_>>()?;
// host/model → GPU label, taken from each one's most recent run.
let gpu_map = |group_col: &str| -> Result<std::collections::HashMap<String, String>> {
let sql = format!(
"SELECT {group_col}, gpus_json FROM runs \
WHERE id IN (SELECT MAX(id) FROM runs GROUP BY {group_col})"
);
let mut stmt = self.conn.prepare(&sql)?;
let rows = stmt.query_map([], |r| {
Ok((r.get::<_, String>(0)?, r.get::<_, Option<String>>(1)?))
})?;
let mut out = std::collections::HashMap::new();
for row in rows {
let (key, gpus) = row?;
if let Some(label) = gpus.as_deref().and_then(gpu_label) {
out.insert(key, label);
}
}
Ok(out)
};
let host_gpus = gpu_map("target_name")?;
let model_gpus = gpu_map("model_id")?;
Ok(Dimensions {
hosts,
models,
scenarios,
builds,
host_gpus,
model_gpus,
})
}
/// Latest-SHA-per-cell medians (the report table as JSON).
pub fn summary(&self) -> Result<Vec<ReportRow>> {
self.report_rows()
}
/// Per-build median metrics for one (model, scenario) cell, ordered
/// chronologically by build — the "over time" series. `host` is
/// optional: when omitted it resolves to the host with the most recent
/// run for this (model, scenario). Each model is served by a single
/// host today, so this yields a coherent single-host series and lets
/// callers (the public UI) select by model alone.
pub fn series(
&self,
host: Option<&str>,
model: &str,
scenario: &str,
) -> Result<Vec<SeriesPoint>> {
let host = match host {
Some(h) => h.to_string(),
None => {
let resolved: Option<String> = self
.conn
.query_row(
"SELECT target_name FROM runs WHERE ok=1 AND model_id=?1 \
AND scenario_id=?2 ORDER BY id DESC LIMIT 1",
params![model, scenario],
|r| r.get(0),
)
.optional()?;
match resolved {
Some(h) => h,
None => return Ok(Vec::new()),
}
}
};
let mut stmt = self.conn.prepare(
"SELECT git_sha, build_timestamp, package_version, ttft_s, decode_tps, total_s, ts
FROM runs
WHERE ok=1 AND target_name=?1 AND model_id=?2 AND scenario_id=?3
ORDER BY id",
)?;
let raws: Vec<SeriesRaw> = stmt
.query_map(params![host, model, scenario], |r| {
Ok(SeriesRaw {
git_sha: r.get(0)?,
build_timestamp: r.get(1)?,
package_version: r.get(2)?,
ttft_s: r.get(3)?,
decode_tps: r.get(4)?,
total_s: r.get(5)?,
ts: r.get(6)?,
})
})?
.collect::<rusqlite::Result<_>>()?;
Ok(aggregate_series(raws))
}
/// Raw rows, optionally filtered. For drill-down + programmatic access.
pub fn runs(&self, f: &RunFilter) -> Result<Vec<RunRow>> {
let mut sql = String::from(
"SELECT id, ts, target_name, hostname, git_sha, build_timestamp, package_version,
model_id, harness, scenario_id, prompt_size_approx, prompt_tokens_actual,
max_tokens, ttft_s, decode_tps, total_s, completion_tokens, ok, error,
gpus_json
FROM runs",
);
let mut conds: Vec<String> = Vec::new();
let mut args: Vec<Box<dyn rusqlite::ToSql>> = Vec::new();
let bind = |col: &str,
val: Option<&str>,
conds: &mut Vec<String>,
args: &mut Vec<Box<dyn rusqlite::ToSql>>| {
if let Some(v) = val {
args.push(Box::new(v.to_string()));
conds.push(format!("{col}=?{}", args.len()));
}
};
bind("target_name", f.host.as_deref(), &mut conds, &mut args);
bind("model_id", f.model.as_deref(), &mut conds, &mut args);
bind("scenario_id", f.scenario.as_deref(), &mut conds, &mut args);
bind("git_sha", f.sha.as_deref(), &mut conds, &mut args);
if let Some(ok) = f.ok {
args.push(Box::new(ok as i64));
conds.push(format!("ok=?{}", args.len()));
}
if !conds.is_empty() {
sql.push_str(" WHERE ");
sql.push_str(&conds.join(" AND "));
}
sql.push_str(" ORDER BY id DESC");
let limit = f.limit.unwrap_or(500).min(5000);
args.push(Box::new(limit as i64));
sql.push_str(&format!(" LIMIT ?{}", args.len()));
let mut stmt = self.conn.prepare(&sql)?;
let rows = stmt
.query_map(rusqlite::params_from_iter(args.iter()), |r| {
let gpus_json: Option<String> = r.get(19)?;
Ok(RunRow {
id: r.get(0)?,
ts: r.get(1)?,
host: r.get(2)?,
gpu: gpus_json.as_deref().and_then(gpu_label),
hostname: r.get(3)?,
git_sha: r.get(4)?,
build_timestamp: r.get(5)?,
package_version: r.get(6)?,
model_id: r.get(7)?,
harness: r.get(8)?,
scenario_id: r.get(9)?,
prompt_size_approx: r.get(10)?,
prompt_tokens_actual: r.get(11)?,
max_tokens: r.get(12)?,
ttft_s: r.get(13)?,
decode_tps: r.get(14)?,
total_s: r.get(15)?,
completion_tokens: r.get(16)?,
ok: r.get::<_, i64>(17)? != 0,
error: r.get(18)?,
})
})?
.collect::<rusqlite::Result<_>>()?;
Ok(rows)
}
}
// ── Read-API serde types ──────────────────────────────────────────────
#[derive(Debug, Clone, serde::Serialize)]
pub struct Dimensions {
pub hosts: Vec<String>,
pub models: Vec<String>,
pub scenarios: Vec<String>,
pub builds: Vec<BuildRef>,
/// host → GPU label (latest run), so the UI can show the GPU as the
/// resource name instead of the internal hostname.
pub host_gpus: std::collections::HashMap<String, String>,
/// model → GPU label (latest run); model maps to one host today.
pub model_gpus: std::collections::HashMap<String, String>,
}
#[derive(Debug, Clone, serde::Serialize)]
pub struct BuildRef {
pub git_sha: String,
pub build_timestamp: Option<String>,
pub package_version: Option<String>,
}
#[derive(Debug, Clone, serde::Serialize)]
pub struct SeriesPoint {
pub git_sha: String,
pub build_timestamp: Option<String>,
pub package_version: Option<String>,
pub ttft_s_median: Option<f64>,
pub decode_tps_median: Option<f64>,
pub total_s_median: Option<f64>,
pub samples: usize,
}
struct SeriesRaw {
git_sha: String,
build_timestamp: Option<String>,
package_version: Option<String>,
ttft_s: Option<f64>,
decode_tps: Option<f64>,
total_s: Option<f64>,
ts: String,
}
/// Group id-ordered rows by build SHA, median each metric, and order the
/// resulting points chronologically by build (timestamp, else first ts).
fn aggregate_series(raws: Vec<SeriesRaw>) -> Vec<SeriesPoint> {
use std::collections::BTreeMap;
// Preserve first-seen order per sha for the chronological sort key.
let mut order: Vec<String> = Vec::new();
let mut groups: BTreeMap<String, Vec<SeriesRaw>> = BTreeMap::new();
for r in raws {
if !groups.contains_key(&r.git_sha) {
order.push(r.git_sha.clone());
}
groups.entry(r.git_sha.clone()).or_default().push(r);
}
let mut points: Vec<(String, SeriesPoint)> = order
.into_iter()
.map(|sha| {
let rows = &groups[&sha];
let sort_key = rows
.iter()
.map(|r| r.build_timestamp.clone().unwrap_or_else(|| r.ts.clone()))
.min()
.unwrap_or_default();
let point = SeriesPoint {
git_sha: sha,
build_timestamp: rows.iter().find_map(|r| r.build_timestamp.clone()),
package_version: rows.iter().find_map(|r| r.package_version.clone()),
ttft_s_median: median(rows.iter().filter_map(|r| r.ttft_s)),
decode_tps_median: median(rows.iter().filter_map(|r| r.decode_tps)),
total_s_median: median(rows.iter().filter_map(|r| r.total_s)),
samples: rows.len(),
};
(sort_key, point)
})
.collect();
points.sort_by(|a, b| a.0.cmp(&b.0));
points.into_iter().map(|(_, p)| p).collect()
}
#[derive(Debug, Clone, Default)]
pub struct RunFilter {
pub host: Option<String>,
pub model: Option<String>,
pub scenario: Option<String>,
pub sha: Option<String>,
pub ok: Option<bool>,
pub limit: Option<u32>,
}
#[derive(Debug, Clone, serde::Serialize)]
pub struct RunRow {
pub id: i64,
pub ts: String,
pub host: String,
/// Public-facing resource name (the host's GPU(s)), e.g. "RTX 4090".
pub gpu: Option<String>,
pub hostname: Option<String>,
pub git_sha: String,
pub build_timestamp: Option<String>,
pub package_version: String,
pub model_id: String,
pub harness: String,
pub scenario_id: String,
pub prompt_size_approx: u32,
pub prompt_tokens_actual: Option<u64>,
pub max_tokens: u64,
pub ttft_s: Option<f64>,
pub decode_tps: Option<f64>,
pub total_s: Option<f64>,
pub completion_tokens: Option<u64>,
pub ok: bool,
pub error: Option<String>,
}
struct RawRow {
@@ -254,10 +568,11 @@ struct RawRow {
decode_tps: Option<f64>,
total_s: Option<f64>,
prompt_tokens_actual: Option<u64>,
gpus_json: Option<String>,
}
/// An aggregated cell ready for the report table.
#[derive(Debug, Clone, PartialEq)]
#[derive(Debug, Clone, PartialEq, serde::Serialize)]
pub struct ReportRow {
pub target_name: String,
pub model_id: String,
@@ -269,6 +584,8 @@ pub struct ReportRow {
pub decode_tps_median: Option<f64>,
pub total_s_median: Option<f64>,
pub samples: usize,
/// Public-facing resource name (the host's GPU(s)), e.g. "2× RTX 5090".
pub gpu: Option<String>,
}
/// Group by (target, model, scenario), keep only the latest SHA's rows
@@ -305,11 +622,51 @@ fn aggregate(raws: Vec<RawRow>) -> Vec<ReportRow> {
decode_tps_median: median(cell.iter().filter_map(|r| r.decode_tps)),
total_s_median: median(cell.iter().filter_map(|r| r.total_s)),
samples: cell.len(),
gpu: cell
.iter()
.find_map(|r| r.gpus_json.as_deref().and_then(gpu_label)),
});
}
out
}
/// Compact GPU label from a run's stored `gpus_json` (the discovery device
/// list) — e.g. "2× RTX 5090", "RTX 4090". `None` when empty/absent. Used
/// as the public-facing resource name in place of internal hostnames.
fn gpu_label(gpus_json: &str) -> Option<String> {
let devices: Vec<serde_json::Value> = serde_json::from_str(gpus_json).ok()?;
if devices.is_empty() {
return None;
}
let mut order: Vec<String> = Vec::new();
let mut counts: std::collections::HashMap<String, usize> = std::collections::HashMap::new();
for d in &devices {
let name = d.get("name").and_then(|v| v.as_str()).unwrap_or("GPU");
let short = name
.trim_start_matches("NVIDIA GeForce ")
.trim_start_matches("NVIDIA ")
.to_string();
if !counts.contains_key(&short) {
order.push(short.clone());
}
*counts.entry(short).or_insert(0) += 1;
}
Some(
order
.iter()
.map(|n| {
let c = counts[n];
if c > 1 {
format!("{c}× {n}")
} else {
n.clone()
}
})
.collect::<Vec<_>>()
.join(" + "),
)
}
fn median(values: impl Iterator<Item = f64>) -> Option<f64> {
let mut v: Vec<f64> = values.collect();
if v.is_empty() {
@@ -397,4 +754,15 @@ mod tests {
assert_eq!(rows[0].samples, 2);
assert!((rows[0].ttft_s_median.unwrap() - 0.3).abs() < 1e-9);
}
#[test]
fn gpu_label_formats() {
let two = r#"[{"name":"NVIDIA GeForce RTX 5090"},{"name":"NVIDIA GeForce RTX 5090"}]"#;
assert_eq!(gpu_label(two).as_deref(), Some("2× RTX 5090"));
let one = r#"[{"name":"NVIDIA GeForce RTX 4090"}]"#;
assert_eq!(gpu_label(one).as_deref(), Some("RTX 4090"));
let dc = r#"[{"name":"NVIDIA H100"}]"#;
assert_eq!(gpu_label(dc).as_deref(), Some("H100"));
assert_eq!(gpu_label("[]"), None);
}
}

View File

@@ -0,0 +1,219 @@
//! Read-API tests: seed a temp store, serve the router, assert JSON.
use helexa_bench::api;
use helexa_bench::store::{RunRecord, Store};
use serde_json::Value;
#[allow(clippy::too_many_arguments)]
fn rec(
host: &str,
sha: &str,
build_ts: Option<&str>,
model: &str,
scenario: &str,
ttft: f64,
ok: bool,
) -> RunRecord {
RunRecord {
ts: "2026-06-13T00:00:00Z".into(),
target_name: host.into(),
target_kind: "neuron".into(),
endpoint: format!("http://{host}:13131"),
hostname: Some(host.into()),
driver_version: Some("580.159".into()),
cuda_version: Some("13.0".into()),
gpus_json: Some("[]".into()),
git_sha: sha.into(),
git_sha_long: None,
package_version: "0.1.16".into(),
git_dirty: false,
build_timestamp: build_ts.map(|s| s.to_string()),
rustc_version: None,
profile: Some("release".into()),
features_json: "[\"cuda\"]".into(),
candle_version: Some("0.10.2".into()),
bench_version: "0.1.16".into(),
bench_sha: "deadbee".into(),
model_id: model.into(),
harness: "candle".into(),
capabilities_json: "[\"text\"]".into(),
devices_json: "[0]".into(),
scenario_id: scenario.into(),
prompt_size_approx: 128,
prompt_tokens_actual: Some(130),
max_tokens: 64,
ttft_s: if ok { Some(ttft) } else { None },
decode_tps: if ok { Some(30.0) } else { None },
total_s: if ok { Some(2.0) } else { None },
completion_tokens: if ok { Some(60) } else { None },
ok,
error: if ok { None } else { Some("boom".into()) },
}
}
/// Seed a temp db, return its path.
fn seed(tag: &str) -> String {
let path = std::env::temp_dir().join(format!("hb-api-{}-{tag}.sqlite", std::process::id()));
let _ = std::fs::remove_file(&path);
let p = path.to_string_lossy().to_string();
let store = Store::open(&p).unwrap();
// beast / m / chat:128 across two builds (old then new).
store
.insert_run(&rec(
"beast",
"old",
Some("2026-06-01T00:00:00Z"),
"m",
"chat:128",
0.20,
true,
))
.unwrap();
store
.insert_run(&rec(
"beast",
"new",
Some("2026-06-10T00:00:00Z"),
"m",
"chat:128",
0.10,
true,
))
.unwrap();
store
.insert_run(&rec(
"beast",
"new",
Some("2026-06-10T00:00:00Z"),
"m",
"chat:128",
0.12,
true,
))
.unwrap();
// a failed row (must not count in series/summary medians)
store
.insert_run(&rec(
"beast",
"new",
Some("2026-06-10T00:00:00Z"),
"m",
"chat:128",
0.0,
false,
))
.unwrap();
// a different host for the runs filter
store
.insert_run(&rec(
"benjy",
"new",
Some("2026-06-10T00:00:00Z"),
"n",
"chat:128",
0.15,
true,
))
.unwrap();
p
}
async fn spawn(db: &str) -> String {
let state = api::open_state(db).unwrap();
let app = api::api_routes(state);
let listener = tokio::net::TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();
tokio::spawn(async move {
axum::serve(listener, app).await.unwrap();
});
format!("http://{addr}")
}
async fn get(base: &str, path: &str) -> Value {
reqwest::get(format!("{base}{path}"))
.await
.unwrap()
.json()
.await
.unwrap()
}
#[tokio::test]
async fn health_reports_run_count() {
let base = spawn(&seed("health")).await;
let v = get(&base, "/api/health").await;
assert_eq!(v["status"], "ok");
assert_eq!(v["run_count"], 5);
}
#[tokio::test]
async fn dimensions_lists_distinct_values_and_builds_chronologically() {
let base = spawn(&seed("dims")).await;
let v = get(&base, "/api/dimensions").await;
let hosts: Vec<&str> = v["hosts"]
.as_array()
.unwrap()
.iter()
.map(|x| x.as_str().unwrap())
.collect();
assert_eq!(hosts, vec!["beast", "benjy"]);
assert_eq!(v["models"].as_array().unwrap().len(), 2);
// builds ordered by earliest build_timestamp: old before new
let builds = v["builds"].as_array().unwrap();
assert_eq!(builds[0]["git_sha"], "old");
assert_eq!(builds[1]["git_sha"], "new");
}
#[tokio::test]
async fn summary_uses_latest_sha_and_ignores_failures() {
let base = spawn(&seed("summary")).await;
let v = get(&base, "/api/summary").await;
let rows = v.as_array().unwrap();
let beast = rows
.iter()
.find(|r| r["target_name"] == "beast" && r["scenario_id"] == "chat:128")
.unwrap();
assert_eq!(beast["git_sha"], "new");
assert_eq!(beast["samples"], 2); // two ok rows on "new"; failure excluded
// median of 0.10 and 0.12
assert!((beast["ttft_s_median"].as_f64().unwrap() - 0.11).abs() < 1e-9);
}
#[tokio::test]
async fn series_is_chronological_per_build() {
let base = spawn(&seed("series")).await;
let v = get(&base, "/api/series?host=beast&model=m&scenario=chat:128").await;
let pts = v.as_array().unwrap();
assert_eq!(pts.len(), 2);
assert_eq!(pts[0]["git_sha"], "old");
assert_eq!(pts[1]["git_sha"], "new");
assert_eq!(pts[0]["samples"], 1);
assert_eq!(pts[1]["samples"], 2);
}
#[tokio::test]
async fn series_resolves_host_when_omitted() {
// The public UI selects by model alone; the store resolves the host.
let base = spawn(&seed("series-nohost")).await;
let v = get(&base, "/api/series?model=m&scenario=chat:128").await;
let pts = v.as_array().unwrap();
assert_eq!(pts.len(), 2);
assert_eq!(pts[0]["git_sha"], "old");
assert_eq!(pts[1]["git_sha"], "new");
}
#[tokio::test]
async fn runs_filters_by_host() {
let base = spawn(&seed("runs")).await;
let all = get(&base, "/api/runs").await;
assert_eq!(all.as_array().unwrap().len(), 5);
let beast = get(&base, "/api/runs?host=beast").await;
let rows = beast.as_array().unwrap();
assert_eq!(rows.len(), 4);
assert!(rows.iter().all(|r| r["host"] == "beast"));
// failed row carries its error + ok=false
assert!(
rows.iter()
.any(|r| r["ok"] == false && r["error"] == "boom")
);
}

View File

@@ -90,6 +90,7 @@ fn config_for(endpoint: String, db_path: String) -> BenchConfig {
prompt_sizes: vec![128], // single scenario keeps assertions simple
max_tokens: 16,
},
api: Default::default(),
targets: vec![TargetConfig {
name: "mock".into(),
kind: TargetKind::Neuron,

View File

@@ -13,6 +13,7 @@ use axum::response::sse::{Event, KeepAlive, Sse};
use axum::response::{IntoResponse, Json};
use axum::routing::{get, post};
use cortex_core::discovery::{DiscoveryResponse, HealthResponse};
use cortex_core::entitlements::{HEADER_ACCOUNT_ID, HEADER_KEY_ID};
use cortex_core::harness::ModelSpec;
use cortex_core::openai::{ChatCompletionRequest, MessageContent};
use cortex_core::responses::{ResponsesRequest, ResponsesUsage};
@@ -71,6 +72,12 @@ async fn health_handler(State(state): State<Arc<NeuronState>>) -> Json<HealthRes
// know about activation lifecycle.
let mut snapshot = state.health_cache.snapshot().await;
snapshot.activation = state.activation.snapshot().await;
// Per-model admission load (#53) — read live from the candle harness so
// cortex's load-aware router (#55) can spread traffic and propagate
// backpressure. Absent when no candle harness is present.
if let Some(candle) = &state.candle {
snapshot.models = candle.load_snapshot().await;
}
Json(snapshot)
}
@@ -198,13 +205,54 @@ async fn model_endpoint(
}
}
/// Default `chat_template_kwargs.enable_thinking` to `include_thinking`
/// when the client didn't set it explicitly, leaving any explicit client
/// choice untouched. See the call site in [`chat_completions`] for the
/// rationale (reasoning eating the token budget for clients that drop it).
fn default_enable_thinking(req: &mut ChatCompletionRequest, include_thinking: bool) {
if req
.extra
.get("chat_template_kwargs")
.and_then(|k| k.get("enable_thinking"))
.is_some()
{
return; // client chose explicitly — respect it
}
if !req.extra.is_object() {
req.extra = json!({});
}
let Some(obj) = req.extra.as_object_mut() else {
return;
};
let kwargs = obj
.entry("chat_template_kwargs")
.or_insert_with(|| json!({}));
if !kwargs.is_object() {
*kwargs = json!({});
}
if let Some(kw) = kwargs.as_object_mut() {
kw.insert("enable_thinking".into(), json!(include_thinking));
}
}
/// The request's principal for fair-share admission (#54), reconstructed
/// from the internal headers cortex stamps (#49). cortex strips any
/// client-supplied copy and asserts the authoritative value, so over the
/// trusted WireGuard link these are safe to key fair-share on. `None` for an
/// unauthenticated/direct request — exempt from the per-principal cap.
fn principal_key(headers: &axum::http::HeaderMap) -> Option<String> {
let account = headers.get(HEADER_ACCOUNT_ID)?.to_str().ok()?;
let key = headers.get(HEADER_KEY_ID)?.to_str().ok()?;
Some(format!("{account}/{key}"))
}
/// OpenAI-compatible chat completions. Dispatches to streaming SSE when
/// `stream: true` is set on the request; otherwise returns a single
/// `ChatCompletionResponse`.
async fn chat_completions(
State(state): State<Arc<NeuronState>>,
headers: axum::http::HeaderMap,
Json(req): Json<ChatCompletionRequest>,
Json(mut req): Json<ChatCompletionRequest>,
) -> impl IntoResponse {
let Some(candle) = state.candle.as_ref().map(Arc::clone) else {
return (
@@ -229,8 +277,26 @@ async fn chat_completions(
reasoning_markers: None, // filled in from the loaded model inside candle
};
// Couple reasoning *generation* to reasoning *surfacing*. Reasoning
// models (Qwen3.6) think by default, and that `<think>` block can
// consume the entire `max_tokens` budget — which, when we then drop
// it (`include_thinking == false`, the default for OpenAI/Anthropic
// clients like Claude Code), leaves the visible answer empty or
// truncated. So when the caller isn't going to see the reasoning,
// don't generate it: default `enable_thinking` to `include_thinking`.
// A client that explicitly set `chat_template_kwargs.enable_thinking`
// wins; thinking-aware clients (helexa-acp, `x-include-thinking:
// true`) keep reasoning on.
default_enable_thinking(&mut req, include_thinking);
// Fair-share admission principal (#54), from cortex's stamped headers.
let principal = principal_key(&headers);
if req.stream.unwrap_or(false) {
match candle.chat_completion_stream_with(req, chat_config).await {
match candle
.chat_completion_stream_with(req, chat_config, principal)
.await
{
Ok(rx) => {
// Each chunk → one SSE `data: {json}` line. After the
// channel closes, append the OpenAI [DONE] terminator.
@@ -244,104 +310,12 @@ async fn chat_completions(
.keep_alive(KeepAlive::default())
.into_response()
}
Err(InferenceError::ModelNotLoaded(id)) => (
StatusCode::NOT_FOUND,
Json(json!({"error": format!("model '{id}' not loaded on this neuron")})),
)
.into_response(),
Err(InferenceError::PromptTooLong { prompt_len, max }) => (
StatusCode::BAD_REQUEST,
Json(json!({
"error": format!("prompt has {prompt_len} tokens but max is {max}"),
"code": "prompt_too_long",
"prompt_len": prompt_len,
"max": max,
})),
)
.into_response(),
Err(InferenceError::InsufficientVram {
free_mb,
required_mb,
}) => (
StatusCode::SERVICE_UNAVAILABLE,
Json(json!({
"error": format!(
"insufficient free VRAM: {free_mb} MiB free, need at least {required_mb} MiB"
),
"code": "insufficient_vram",
"free_mb": free_mb,
"required_mb": required_mb,
})),
)
.into_response(),
Err(InferenceError::VisionUnsupported { model_id }) => (
StatusCode::BAD_REQUEST,
Json(json!({
"error": format!(
"model '{model_id}' does not support image input"
),
"code": "vision_unsupported",
"model_id": model_id,
"suggestion": "load a vision-capable model or remove image_url content parts",
})),
)
.into_response(),
Err(InferenceError::Other(e)) => (
StatusCode::INTERNAL_SERVER_ERROR,
Json(json!({"error": format!("{e:#}")})),
)
.into_response(),
Err(e) => inference_error_response(e),
}
} else {
match candle.chat_completion(req).await {
match candle.chat_completion(req, principal).await {
Ok(resp) => Json(resp).into_response(),
Err(InferenceError::ModelNotLoaded(id)) => (
StatusCode::NOT_FOUND,
Json(json!({"error": format!("model '{id}' not loaded on this neuron")})),
)
.into_response(),
Err(InferenceError::PromptTooLong { prompt_len, max }) => (
StatusCode::BAD_REQUEST,
Json(json!({
"error": format!("prompt has {prompt_len} tokens but max is {max}"),
"code": "prompt_too_long",
"prompt_len": prompt_len,
"max": max,
})),
)
.into_response(),
Err(InferenceError::InsufficientVram {
free_mb,
required_mb,
}) => (
StatusCode::SERVICE_UNAVAILABLE,
Json(json!({
"error": format!(
"insufficient free VRAM: {free_mb} MiB free, need at least {required_mb} MiB"
),
"code": "insufficient_vram",
"free_mb": free_mb,
"required_mb": required_mb,
})),
)
.into_response(),
Err(InferenceError::VisionUnsupported { model_id }) => (
StatusCode::BAD_REQUEST,
Json(json!({
"error": format!(
"model '{model_id}' does not support image input"
),
"code": "vision_unsupported",
"model_id": model_id,
"suggestion": "load a vision-capable model or remove image_url content parts",
})),
)
.into_response(),
Err(InferenceError::Other(e)) => (
StatusCode::INTERNAL_SERVER_ERROR,
Json(json!({"error": format!("{e:#}")})),
)
.into_response(),
Err(e) => inference_error_response(e),
}
}
}
@@ -352,6 +326,7 @@ async fn chat_completions(
/// event stream into the Responses event family.
async fn responses(
State(state): State<Arc<NeuronState>>,
headers: axum::http::HeaderMap,
Json(req): Json<ResponsesRequest>,
) -> impl IntoResponse {
let Some(candle) = state.candle.as_ref().map(Arc::clone) else {
@@ -386,9 +361,12 @@ async fn responses(
};
chat_req.stream = Some(stream_requested);
// Fair-share admission principal (#54), from cortex's stamped headers.
let principal = principal_key(&headers);
if stream_requested {
match candle
.responses_stream(chat_req, response_id, message_item_id)
.responses_stream(chat_req, response_id, message_item_id, principal)
.await
{
Ok(rx) => {
@@ -412,7 +390,7 @@ async fn responses(
// and translate the result. We don't currently re-tokenise
// to compute usage; the harness returns it via the chat
// response and we pass it through.
match candle.chat_completion(chat_req).await {
match candle.chat_completion(chat_req, principal).await {
Ok(chat_resp) => {
// Extract the assistant text (chat completions
// always emits one choice on the candle path).
@@ -440,6 +418,9 @@ async fn responses(
input_tokens: u.prompt_tokens,
output_tokens: u.completion_tokens,
total_tokens: u.prompt_tokens + u.completion_tokens,
// Non-streaming reasoning accounting deferred (#64).
output_tokens_details: None,
input_tokens_details: None,
});
let meta = openai_responses::ResponseMeta {
response_id: mint_response_id(),
@@ -466,58 +447,112 @@ fn finish_reason_from_str(s: &str) -> crate::wire::FinishReason {
}
/// Centralised mapping from [`InferenceError`] to an HTTP response.
/// Lifted out so the chat-completions and responses handlers stay
/// readable and changes to error-code semantics happen in one spot.
///
/// Emits the OpenAI-standard *nested* error envelope:
///
/// ```json
/// { "error": { "message": "...", "type": "...", "code": "...", "param": null } }
/// ```
///
/// OpenAI-compatible clients (opencode, the openai SDK) reach into
/// `error.type` / `error.code` to drive behaviour — most importantly,
/// `code == "context_length_exceeded"` triggers auto-compaction and
/// retry rather than a hard failure. A flat `{"error": "..."}` string
/// is invisible to that logic, so every variant nests here. Diagnostic
/// extras (prompt_len, free_mb, …) ride *inside* the error object so
/// they don't break the envelope shape.
fn inference_error_response(err: InferenceError) -> axum::response::Response {
match err {
InferenceError::ModelNotLoaded(id) => (
StatusCode::NOT_FOUND,
Json(json!({"error": format!("model '{id}' not loaded on this neuron")})),
use cortex_core::error_envelope::OpenAiError;
let env = match err {
InferenceError::ModelNotLoaded(id) => OpenAiError::new(
404,
"invalid_request_error",
"model_not_found",
format!("model '{id}' not loaded on this neuron"),
)
.into_response(),
InferenceError::PromptTooLong { prompt_len, max } => (
StatusCode::BAD_REQUEST,
Json(json!({
"error": format!("prompt has {prompt_len} tokens but max is {max}"),
"code": "prompt_too_long",
"prompt_len": prompt_len,
"max": max,
})),
)
.into_response(),
.with_extra("model_id", json!(id)),
// OpenAI's canonical context-overflow error. opencode keys on
// `code == "context_length_exceeded"` and the message phrasing
// ("maximum context length is N tokens") to auto-compact+retry.
InferenceError::PromptTooLong { prompt_len, max } => {
OpenAiError::context_length_exceeded(format!(
"This model's maximum context length is {max} tokens. \
However, your messages resulted in {prompt_len} tokens. \
Please reduce the length of the messages."
))
.with_extra("prompt_len", json!(prompt_len))
.with_extra("max", json!(max))
}
// VRAM frees as the in-flight request(s) complete, so this is a
// transient 503 — advertise a short Retry-After (#63).
InferenceError::InsufficientVram {
free_mb,
required_mb,
} => (
StatusCode::SERVICE_UNAVAILABLE,
Json(json!({
"error": format!(
"insufficient free VRAM: {free_mb} MiB free, need at least {required_mb} MiB"
),
"code": "insufficient_vram",
"free_mb": free_mb,
"required_mb": required_mb,
})),
} => OpenAiError::new(
503,
"api_error",
"insufficient_vram",
format!("insufficient free VRAM: {free_mb} MiB free, need at least {required_mb} MiB"),
)
.into_response(),
InferenceError::VisionUnsupported { model_id } => (
StatusCode::BAD_REQUEST,
Json(json!({
"error": format!(
"model '{model_id}' does not support image input"
),
"code": "vision_unsupported",
"model_id": model_id,
"suggestion": "load a vision-capable model or remove image_url content parts",
})),
.with_retry_after(5)
.with_extra("free_mb", json!(free_mb))
.with_extra("required_mb", json!(required_mb)),
InferenceError::VisionUnsupported { model_id } => OpenAiError::new(
400,
"invalid_request_error",
"vision_unsupported",
format!("model '{model_id}' does not support image input"),
)
.into_response(),
InferenceError::Other(e) => (
StatusCode::INTERNAL_SERVER_ERROR,
Json(json!({"error": format!("{e:#}")})),
.with_extra("model_id", json!(model_id))
.with_extra(
"suggestion",
json!("load a vision-capable model or remove image_url content parts"),
),
InferenceError::TemplateRenderFailed { detail } => OpenAiError::new(
422,
"invalid_request_error",
"template_render_failed",
format!("chat template could not render this request: {detail}"),
),
// Admission control refused on load (#53): a fast, retryable "busy"
// signal. 503 (service busy) + Retry-After; opencode/AI SDK back off.
InferenceError::Overloaded { retry_after_secs } => OpenAiError::new(
503,
"rate_limit_error",
"rate_limit_exceeded",
"model is busy (admission queue full); retry shortly",
)
.into_response(),
.with_retry_after(retry_after_secs),
// Per-principal fair-share cap (#54): 429 rate_limit_exceeded +
// Retry-After — the caller is sending too many concurrent requests.
InferenceError::PerPrincipalLimit { retry_after_secs } => OpenAiError::new(
429,
"rate_limit_error",
"rate_limit_exceeded",
"too many concurrent requests for this key; retry shortly",
)
.with_retry_after(retry_after_secs),
InferenceError::Other(e) => OpenAiError::without_code(500, "api_error", format!("{e:#}")),
};
envelope_response(env)
}
/// Neuron adapter: turn the shared [`cortex_core::error_envelope::OpenAiError`]
/// into an axum response, setting `Retry-After` when the envelope carries one.
/// cortex-core owns the envelope shape (#60/#63); this is the only crossing
/// from that data into axum on the neuron side.
fn envelope_response(err: cortex_core::error_envelope::OpenAiError) -> axum::response::Response {
let status = StatusCode::from_u16(err.status).unwrap_or(StatusCode::INTERNAL_SERVER_ERROR);
let retry_after = err.retry_after_secs;
let mut response = (status, Json(err.body())).into_response();
if let Some(secs) = retry_after
&& let Ok(value) = axum::http::HeaderValue::from_str(&secs.to_string())
{
response
.headers_mut()
.insert(axum::http::header::RETRY_AFTER, value);
}
response
}
fn mint_response_id() -> String {
@@ -541,3 +576,193 @@ fn unix_subsec_nanos() -> u64 {
.map(|d| d.as_nanos() as u64)
.unwrap_or(0)
}
#[cfg(test)]
mod thinking_tests {
use super::*;
fn req(value: serde_json::Value) -> ChatCompletionRequest {
serde_json::from_value(value).expect("valid ChatCompletionRequest")
}
fn enable_thinking(r: &ChatCompletionRequest) -> Option<bool> {
r.extra
.get("chat_template_kwargs")
.and_then(|k| k.get("enable_thinking"))
.and_then(|v| v.as_bool())
}
#[test]
fn defaults_enable_thinking_to_include_thinking_false() {
let mut r = req(json!({"model": "m", "messages": []}));
default_enable_thinking(&mut r, false);
assert_eq!(enable_thinking(&r), Some(false));
}
#[test]
fn defaults_enable_thinking_true_when_surfacing() {
let mut r = req(json!({"model": "m", "messages": []}));
default_enable_thinking(&mut r, true);
assert_eq!(enable_thinking(&r), Some(true));
}
#[test]
fn explicit_client_choice_is_respected() {
let mut r = req(json!({
"model": "m", "messages": [],
"chat_template_kwargs": {"enable_thinking": true}
}));
// include_thinking=false would normally force false; explicit wins.
default_enable_thinking(&mut r, false);
assert_eq!(enable_thinking(&r), Some(true));
}
#[test]
fn preserves_other_chat_template_kwargs() {
let mut r = req(json!({
"model": "m", "messages": [],
"chat_template_kwargs": {"some_other": 42}
}));
default_enable_thinking(&mut r, false);
assert_eq!(enable_thinking(&r), Some(false));
assert_eq!(
r.extra["chat_template_kwargs"]["some_other"],
json!(42),
"existing kwargs must survive"
);
}
}
#[cfg(test)]
mod error_envelope_tests {
use super::*;
use axum::http::StatusCode;
/// Drive an `InferenceError` through the mapper and decode the
/// `(status, json)` pair it produces.
async fn map(err: InferenceError) -> (StatusCode, Value) {
let resp = inference_error_response(err);
let status = resp.status();
let bytes = axum::body::to_bytes(resp.into_body(), usize::MAX)
.await
.expect("buffer error body");
let body: Value = serde_json::from_slice(&bytes).expect("error body is JSON");
(status, body)
}
#[tokio::test]
async fn prompt_too_long_is_context_length_exceeded() {
let (status, body) = map(InferenceError::PromptTooLong {
prompt_len: 60_000,
max: 49_152,
})
.await;
assert_eq!(status, StatusCode::BAD_REQUEST);
// The envelope must be nested under `error`, not a flat string.
let error = body
.get("error")
.and_then(Value::as_object)
.expect("error object");
assert_eq!(error["type"], "invalid_request_error");
assert_eq!(
error["code"], "context_length_exceeded",
"opencode keys on this code to auto-compact and retry"
);
assert_eq!(error["param"], Value::Null);
// Phrasing opencode/openai clients pattern-match on.
let msg = error["message"].as_str().unwrap();
assert!(
msg.contains("maximum context length is 49152 tokens"),
"message was: {msg}"
);
// Diagnostics ride inside the error object.
assert_eq!(error["prompt_len"], 60_000);
assert_eq!(error["max"], 49_152);
}
#[tokio::test]
async fn model_not_loaded_is_404_model_not_found() {
let (status, body) = map(InferenceError::ModelNotLoaded("Qwen/X".into())).await;
assert_eq!(status, StatusCode::NOT_FOUND);
let error = &body["error"];
assert_eq!(error["type"], "invalid_request_error");
assert_eq!(error["code"], "model_not_found");
assert_eq!(error["model_id"], "Qwen/X");
}
#[tokio::test]
async fn insufficient_vram_is_503_api_error() {
let (status, body) = map(InferenceError::InsufficientVram {
free_mb: 1_024,
required_mb: 8_192,
})
.await;
assert_eq!(status, StatusCode::SERVICE_UNAVAILABLE);
let error = &body["error"];
assert_eq!(error["type"], "api_error");
assert_eq!(error["code"], "insufficient_vram");
assert_eq!(error["free_mb"], 1_024);
assert_eq!(error["required_mb"], 8_192);
}
#[tokio::test]
async fn overloaded_is_503_rate_limited_with_retry_after() {
// Admission rejection (#53) → fast, retryable backpressure.
let resp = inference_error_response(InferenceError::Overloaded {
retry_after_secs: 7,
});
assert_eq!(resp.status(), StatusCode::SERVICE_UNAVAILABLE);
let retry = resp
.headers()
.get(axum::http::header::RETRY_AFTER)
.expect("admission rejection must advertise Retry-After");
assert_eq!(retry.to_str().unwrap(), "7");
let bytes = axum::body::to_bytes(resp.into_body(), usize::MAX)
.await
.unwrap();
let body: Value = serde_json::from_slice(&bytes).unwrap();
assert_eq!(body["error"]["code"], "rate_limit_exceeded");
}
#[tokio::test]
async fn insufficient_vram_carries_retry_after() {
// Transient 503 — VRAM frees as in-flight requests finish, so the
// client should back off and retry (#63).
let resp = inference_error_response(InferenceError::InsufficientVram {
free_mb: 1_024,
required_mb: 8_192,
});
let retry = resp
.headers()
.get(axum::http::header::RETRY_AFTER)
.expect("transient 503 must advertise Retry-After");
assert_eq!(retry.to_str().unwrap(), "5");
}
#[tokio::test]
async fn permanent_rejections_have_no_retry_after() {
// context_length_exceeded is permanent for this request — no hint.
let resp = inference_error_response(InferenceError::PromptTooLong {
prompt_len: 60_000,
max: 49_152,
});
assert!(
resp.headers()
.get(axum::http::header::RETRY_AFTER)
.is_none(),
"permanent rejection must not advertise Retry-After"
);
}
#[tokio::test]
async fn other_is_500_with_null_code() {
let (status, body) = map(InferenceError::Other(anyhow::anyhow!("kaboom"))).await;
assert_eq!(status, StatusCode::INTERNAL_SERVER_ERROR);
let error = &body["error"];
assert_eq!(error["type"], "api_error");
assert_eq!(error["code"], Value::Null);
assert!(error["message"].as_str().unwrap().contains("kaboom"));
}
}

View File

@@ -77,6 +77,76 @@ pub struct CandleHarnessConfig {
/// model, on architectures that support cache snapshots (qwen3_5).
#[serde(default)]
pub prefix_cache: PrefixCacheConfig,
/// Self-derived context/token limits (#67). The neuron computes the
/// most-efficient `limit{context,input,output}` that still allows
/// coherent agentic performance from model architecture + live free
/// VRAM + a self-measured throughput ceiling, advertises it on
/// `/models`, and enforces it. These knobs tune that derivation.
#[serde(default)]
pub context_limit: ContextLimitConfig,
/// Admission control (#53): bounds the per-model wait queue so a busy
/// model returns a fast, retryable `429`/`503` instead of stalling new
/// requests until their client times out.
#[serde(default)]
pub admission: AdmissionConfig,
}
/// `[harness.candle.admission]` settings (#53).
///
/// Inference is batch-1, so `max_in_flight` is 1 in practice; the queue
/// (`max_queue_depth`) absorbs short bursts, and `max_wait_secs` caps how
/// long a queued request waits before it's refused with backpressure.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct AdmissionConfig {
/// Concurrent running requests per model. Batch-1 inference → 1.
#[serde(default = "default_admission_max_in_flight")]
pub max_in_flight: usize,
/// Queued (waiting) requests allowed beyond the in-flight one. The
/// `(max_in_flight + max_queue_depth + 1)`-th request is refused
/// immediately with `429`/`503` + `Retry-After`.
#[serde(default = "default_admission_max_queue_depth")]
pub max_queue_depth: usize,
/// Maximum seconds a queued request waits for the in-flight slot before
/// it is refused (turns the old ~300s client-side hang into a fast,
/// honest signal).
#[serde(default = "default_admission_max_wait_secs")]
pub max_wait_secs: u64,
/// Per-principal fair-share cap (#54): max in-flight + queued requests
/// for any single principal (resolved from the `x-helexa-*` headers
/// cortex stamps), so one client can't monopolize the queue while others
/// wait. Over-cap → `429 rate_limit_exceeded` + `Retry-After`. `0`
/// disables the cap; anonymous requests are always exempt.
#[serde(default = "default_admission_max_per_principal")]
pub max_per_principal: usize,
}
impl Default for AdmissionConfig {
fn default() -> Self {
Self {
max_in_flight: default_admission_max_in_flight(),
max_queue_depth: default_admission_max_queue_depth(),
max_wait_secs: default_admission_max_wait_secs(),
max_per_principal: default_admission_max_per_principal(),
}
}
}
fn default_admission_max_in_flight() -> usize {
1
}
fn default_admission_max_queue_depth() -> usize {
8
}
fn default_admission_max_wait_secs() -> u64 {
30
}
fn default_admission_max_per_principal() -> usize {
2
}
/// `[harness.candle.prefix_cache]` settings.
@@ -119,6 +189,94 @@ fn default_prefix_cache_max_entries() -> usize {
8
}
/// `[harness.candle.context_limit]` settings (#67).
///
/// The derived limit is `context = min(max_position_embeddings,
/// vram_ceiling, throughput_ceiling)`, then `input = context
/// output_reserve`. `vram_ceiling` and `throughput_ceiling` read live
/// state, so the advertised/enforced limit tracks the resident model and
/// rises automatically as efficiency work (e.g. prefix caching, #11)
/// frees headroom or speeds prefill — no operator action.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ContextLimitConfig {
/// Master switch. On by default — set `false` to fall back to the
/// static `NEURON_MAX_PROMPT_TOKENS` cap with no advertised limit.
#[serde(default = "default_context_limit_enabled")]
pub enabled: bool,
/// Coherence target: the longest prefill-per-turn latency (seconds)
/// considered acceptable agentic performance. The throughput ceiling
/// is `target_prefill_latency_secs × measured_prefill_tok_per_sec`.
/// Raise it once cross-request prefix caching (#11) makes long
/// contexts cheap to re-prefill.
#[serde(default = "default_target_prefill_latency_secs")]
pub target_prefill_latency_secs: f64,
/// Cold-start prefill speed (tokens/sec) used for the throughput
/// ceiling until the model has served enough requests to measure its
/// own rate. A conservative estimate; the live EMA supersedes it.
#[serde(default = "default_bootstrap_prefill_tok_per_sec")]
pub bootstrap_prefill_tok_per_sec: f64,
/// VRAM (MiB) reserved per card for prefill activations on top of the
/// resident weights and the KV cache, before computing the VRAM
/// context ceiling.
#[serde(default = "default_activation_headroom_mb")]
pub activation_headroom_mb: u64,
/// Free-VRAM floor (MiB) kept available per card — the VRAM ceiling
/// leaves at least this much unused. Mirrors `NEURON_MIN_FREE_VRAM_MB`.
#[serde(default = "default_context_min_free_floor_mb")]
pub min_free_floor_mb: u64,
/// Generation reserve (tokens) left below the context wall:
/// `input = context output_reserve_tokens`. Defaults to neuron's
/// default `max_tokens`.
#[serde(default = "default_output_reserve_tokens")]
pub output_reserve_tokens: usize,
}
impl Default for ContextLimitConfig {
fn default() -> Self {
Self {
enabled: default_context_limit_enabled(),
target_prefill_latency_secs: default_target_prefill_latency_secs(),
bootstrap_prefill_tok_per_sec: default_bootstrap_prefill_tok_per_sec(),
activation_headroom_mb: default_activation_headroom_mb(),
min_free_floor_mb: default_context_min_free_floor_mb(),
output_reserve_tokens: default_output_reserve_tokens(),
}
}
}
fn default_context_limit_enabled() -> bool {
true
}
fn default_target_prefill_latency_secs() -> f64 {
// ~2 min/turn is the coherence wall observed pre-#11 on beast
// (the issue's worked example). Raisable once prefix caching lands.
120.0
}
fn default_bootstrap_prefill_tok_per_sec() -> f64 {
// beast Qwen3.6-27B TP=2 measured ~850 tok/s prefill; a conservative
// floor so the cold-start ceiling isn't wildly optimistic.
800.0
}
fn default_activation_headroom_mb() -> u64 {
2048
}
fn default_context_min_free_floor_mb() -> u64 {
1500
}
fn default_output_reserve_tokens() -> usize {
8192
}
/// Per-scheme source configuration. Mirrors the shape `hf_hub::ApiBuilder`
/// needs: endpoint URL, optional auth token (read from an env var so
/// secrets stay out of the config file), and optional cache directory

View File

@@ -273,6 +273,7 @@ pub async fn discover_system() -> Result<DiscoveryResponse> {
devices,
harnesses: vec![], // populated by harness registry in Phase 8
cuda_unavailable_reason,
max_prompt_tokens: crate::harness::candle::max_prompt_tokens() as u64,
})
}

View File

@@ -0,0 +1,298 @@
//! Per-model admission control (#53).
//!
//! Inference against a loaded model is batch-1: one request runs at a time,
//! serialized by the model's `inference_lock` (single-GPU) / `pool` mutex
//! (TP). Before this, the wait for that lock was an **unbounded FIFO of
//! mutex waiters with no timeout** — a busy model made every new request
//! hang until its client gave up (~300s) with an opaque error.
//!
//! [`AdmissionController`] replaces that implicit unbounded wait with an
//! explicit bounded scheduler: at most `max_in_flight` running (1, batch-1)
//! plus a bounded queue of `max_queue_depth` waiters, each waiting at most
//! `max_wait`. When the queue is full or the wait elapses, the request is
//! rejected *immediately* — an honest, fast, retryable "busy" signal
//! (`429`/`503` + `Retry-After` per #63) instead of a silent stall.
//!
//! The controller is pure async (no CUDA), so the inference paths just call
//! [`AdmissionController::enter`] before taking the inference lock and hold
//! the returned [`AdmissionPermit`] for the request's lifetime. Its counters
//! ([`in_flight`](AdmissionController::in_flight) /
//! [`queue_depth`](AdmissionController::queue_depth)) are lock-free, so
//! `/health` can read live load without contending with inference.
use crate::config::AdmissionConfig;
use std::collections::HashMap;
use std::sync::{Arc, Mutex};
use std::time::Duration;
use tokio::sync::{OwnedSemaphorePermit, Semaphore};
/// Why admission was refused. All map to the #63 backpressure envelope
/// (`rate_limit_exceeded` + `Retry-After`); they differ in cause (and HTTP
/// status — load → `503`, per-principal → `429`).
#[derive(Debug, Clone, Copy)]
pub enum AdmissionRejection {
/// The bounded wait queue was already full (server-side load).
QueueFull { retry_after_secs: u64 },
/// A queue slot was taken but the in-flight slot didn't free within
/// `max_wait` (server-side load).
Timeout { retry_after_secs: u64 },
/// This principal already has `max_per_principal` requests in flight or
/// queued (#54 fair-share) — one principal can't monopolize the model.
PrincipalCap { retry_after_secs: u64 },
}
impl AdmissionRejection {
pub fn retry_after_secs(&self) -> u64 {
match self {
AdmissionRejection::QueueFull { retry_after_secs }
| AdmissionRejection::Timeout { retry_after_secs }
| AdmissionRejection::PrincipalCap { retry_after_secs } => *retry_after_secs,
}
}
}
/// Admission accounting, mutated under a brief lock (never held across an
/// await). `pending` is queued + in-flight overall; `per_principal` is the
/// same count keyed by principal for fair-share (#54).
#[derive(Default, Debug)]
struct AdmissionState {
pending: usize,
per_principal: HashMap<String, usize>,
}
/// Bounded batch-1 scheduler for one loaded model, with per-principal
/// fair-share.
pub struct AdmissionController {
/// In-flight slots — `max_in_flight` permits (1 for batch-1).
slots: Arc<Semaphore>,
/// Queued + in-flight accounting (overall + per principal).
state: Arc<Mutex<AdmissionState>>,
/// `max_in_flight + max_queue_depth` — the overall rejection threshold.
max_pending: usize,
/// Max in-flight + queued for any single principal (#54). `0` disables.
max_per_principal: usize,
max_in_flight: usize,
max_wait: Duration,
}
impl AdmissionController {
pub fn new(cfg: &AdmissionConfig) -> Self {
// A controller with zero in-flight slots would deadlock; clamp.
let max_in_flight = cfg.max_in_flight.max(1);
Self {
slots: Arc::new(Semaphore::new(max_in_flight)),
state: Arc::new(Mutex::new(AdmissionState::default())),
max_pending: max_in_flight + cfg.max_queue_depth,
max_per_principal: cfg.max_per_principal,
max_in_flight,
max_wait: Duration::from_secs(cfg.max_wait_secs),
}
}
/// Admit a request for `principal` (`None` = anonymous, exempt from the
/// per-principal cap). Reserves a queue slot — fast-rejecting if the
/// overall queue is full or the principal is over its fair-share cap —
/// then waits up to `max_wait` for an in-flight slot. The returned permit
/// must be held for the request's lifetime; dropping it frees the slots.
pub async fn enter(
&self,
principal: Option<&str>,
) -> Result<AdmissionPermit, AdmissionRejection> {
// Decision + reservation under one brief lock so concurrent callers
// can't both slip past the thresholds. No await is held here.
{
let mut st = self.state.lock().expect("admission state poisoned");
if st.pending >= self.max_pending {
return Err(AdmissionRejection::QueueFull {
retry_after_secs: self.retry_hint(st.pending),
});
}
if let Some(p) = principal
&& self.max_per_principal > 0
&& st.per_principal.get(p).copied().unwrap_or(0) >= self.max_per_principal
{
return Err(AdmissionRejection::PrincipalCap {
retry_after_secs: self.retry_hint(st.pending),
});
}
st.pending += 1;
if let Some(p) = principal {
*st.per_principal.entry(p.to_string()).or_insert(0) += 1;
}
}
match tokio::time::timeout(self.max_wait, Arc::clone(&self.slots).acquire_owned()).await {
Ok(Ok(permit)) => Ok(AdmissionPermit {
_permit: permit,
state: Arc::clone(&self.state),
principal: principal.map(str::to_string),
}),
// Semaphore is never closed; treat a closed/elapsed wait the same.
Ok(Err(_)) | Err(_) => {
self.release(principal);
Err(AdmissionRejection::Timeout {
retry_after_secs: self.retry_hint(self.max_pending),
})
}
}
}
/// Roll back a reserved-but-not-admitted slot (wait timed out).
fn release(&self, principal: Option<&str>) {
let mut st = self.state.lock().expect("admission state poisoned");
st.pending = st.pending.saturating_sub(1);
decrement_principal(&mut st.per_principal, principal);
}
/// Requests currently running (holding an in-flight slot).
pub fn in_flight(&self) -> usize {
self.max_in_flight
.saturating_sub(self.slots.available_permits())
}
/// Requests waiting for an in-flight slot.
pub fn queue_depth(&self) -> usize {
let pending = self.state.lock().expect("admission state poisoned").pending;
pending.saturating_sub(self.in_flight())
}
/// Rough `Retry-After`: scale with how backed-up the model is, clamped to
/// a sane band. Without per-request timing this is a heuristic, but it
/// gives well-behaved clients (opencode/AI SDK) a sensible backoff.
fn retry_hint(&self, pending: usize) -> u64 {
let queued = pending.saturating_sub(self.max_in_flight) as u64;
((queued + 1) * 2).clamp(1, 120)
}
}
/// Decrement (and prune at zero) a principal's outstanding count.
fn decrement_principal(map: &mut HashMap<String, usize>, principal: Option<&str>) {
if let Some(p) = principal
&& let Some(count) = map.get_mut(p)
{
*count -= 1;
if *count == 0 {
map.remove(p);
}
}
}
/// Held for a request's lifetime; frees the in-flight + queue slot (and the
/// principal's fair-share slot) on drop.
#[derive(Debug)]
pub struct AdmissionPermit {
_permit: OwnedSemaphorePermit,
state: Arc<Mutex<AdmissionState>>,
principal: Option<String>,
}
impl Drop for AdmissionPermit {
fn drop(&mut self) {
let mut st = self.state.lock().expect("admission state poisoned");
st.pending = st.pending.saturating_sub(1);
decrement_principal(&mut st.per_principal, self.principal.as_deref());
}
}
#[cfg(test)]
mod tests {
use super::*;
/// Config with the per-principal cap disabled (0) — most tests exercise
/// the overall queue with anonymous (`None`) callers.
fn cfg(max_in_flight: usize, max_queue_depth: usize, max_wait_secs: u64) -> AdmissionConfig {
AdmissionConfig {
max_in_flight,
max_queue_depth,
max_wait_secs,
max_per_principal: 0,
}
}
#[tokio::test]
async fn admits_up_to_in_flight_and_reports_load() {
let ctrl = AdmissionController::new(&cfg(1, 4, 30));
assert_eq!(ctrl.in_flight(), 0);
let p = ctrl.enter(None).await.expect("first admits");
assert_eq!(ctrl.in_flight(), 1);
assert_eq!(ctrl.queue_depth(), 0);
drop(p);
assert_eq!(ctrl.in_flight(), 0);
}
#[tokio::test]
async fn rejects_when_queue_full() {
// 1 in-flight + 1 queue slot = capacity 2; the 3rd is refused fast.
let ctrl = Arc::new(AdmissionController::new(&cfg(1, 1, 30)));
let _running = ctrl.enter(None).await.expect("admit running");
// Fill the single queue slot with a waiter that parks on the semaphore.
let ctrl2 = Arc::clone(&ctrl);
let waiter = tokio::spawn(async move { ctrl2.enter(None).await.map(|p| drop(p)) });
// Give the waiter a moment to occupy the queue slot.
tokio::time::sleep(Duration::from_millis(50)).await;
assert_eq!(ctrl.queue_depth(), 1);
// Queue full → immediate QueueFull with a Retry-After hint.
match ctrl.enter(None).await {
Err(AdmissionRejection::QueueFull { retry_after_secs }) => {
assert!(retry_after_secs >= 1)
}
other => panic!("expected QueueFull, got {other:?}"),
}
// Release the runner so the parked waiter can proceed and finish.
drop(_running);
waiter.await.unwrap().unwrap();
}
#[tokio::test]
async fn rejects_on_wait_timeout() {
// Zero queue depth + a runner holding the only slot → a second
// request can't even queue, so it's QueueFull, not Timeout. Use a
// queue of 1 and a tiny max_wait to exercise the timeout path.
let ctrl = Arc::new(AdmissionController::new(&cfg(1, 1, 0)));
let _running = ctrl.enter(None).await.expect("admit running");
// max_wait 0 → the queued request times out almost immediately.
match ctrl.enter(None).await {
Err(AdmissionRejection::Timeout { .. }) => {}
other => panic!("expected Timeout, got {other:?}"),
}
// The timed-out request released its queue slot.
assert_eq!(ctrl.queue_depth(), 0);
}
#[tokio::test]
async fn per_principal_cap_protects_other_principals() {
// Generous overall queue, but each principal capped at 1 in-flight+
// queued. Principal A holds the running slot; A's second request is
// refused (PrincipalCap) rather than occupying the queue, so B's
// single request still gets a queue slot and proceeds.
let cfg = AdmissionConfig {
max_in_flight: 1,
max_queue_depth: 8,
max_wait_secs: 30,
max_per_principal: 1,
};
let ctrl = Arc::new(AdmissionController::new(&cfg));
let _a1 = ctrl.enter(Some("acct-a/key-a")).await.expect("A admits");
// A is over its fair-share cap → fast PrincipalCap, no queue slot taken.
match ctrl.enter(Some("acct-a/key-a")).await {
Err(AdmissionRejection::PrincipalCap { retry_after_secs }) => {
assert!(retry_after_secs >= 1)
}
other => panic!("expected PrincipalCap, got {other:?}"),
}
// B (a different principal) is admitted to the queue and proceeds
// once A releases — it was never stuck behind A's backlog.
let ctrl2 = Arc::clone(&ctrl);
let b = tokio::spawn(async move { ctrl2.enter(Some("acct-b/key-b")).await.map(drop) });
tokio::time::sleep(Duration::from_millis(50)).await;
assert_eq!(ctrl.queue_depth(), 1, "B is queued, not rejected");
drop(_a1);
b.await.unwrap().expect("B is served after A releases");
}
}

View File

@@ -195,7 +195,7 @@ mod tests {
// mutates — same justification as the real loader.
let vb = unsafe {
candle_nn::var_builder::ShardedSafeTensors::var_builder(
&[path.clone()],
std::slice::from_ref(&path),
DType::F32,
&dev,
)

File diff suppressed because it is too large Load Diff

View File

@@ -221,7 +221,7 @@ pub fn render_chat_template(
// becomes a string; Parts becomes an array of content blocks.
// The HF templates handle both shapes via `content is string`
// checks or content-array iteration.
let messages_json: Vec<Value> = messages
let mut messages_json: Vec<Value> = messages
.iter()
.map(|m| {
let content_value = match &m.content {
@@ -243,6 +243,12 @@ pub fn render_chat_template(
})
.collect();
// OpenAI clients (opencode, the OpenAI SDK) carry tool-call
// `arguments` as a JSON *string*; Qwen3.6's template iterates it as a
// dict, so normalise string args to objects before rendering. Without
// this, `chat_template:120` errors "cannot convert value into pairs".
normalize_tool_call_arguments(&mut messages_json);
// Build the kwargs context. Add base bindings the template
// expects (`messages`, `add_generation_prompt`, `tools`) plus
// anything the caller passed in `chat_template_kwargs`. Caller
@@ -267,6 +273,37 @@ pub fn render_chat_template(
.context("render chat_template")
}
/// Normalize OpenAI-style tool-call `arguments` from JSON strings to
/// objects, in place, across all messages.
///
/// The OpenAI wire format carries `tool_calls[].function.arguments` as a
/// JSON *string*; HF chat templates (Qwen3.6 at `chat_template:120`)
/// iterate it as a dict (`arguments | items`), which throws "cannot
/// convert value into pairs" on a string. Parsing string args into the
/// object the template expects lets OpenAI and Anthropic clients both
/// render. A string that doesn't parse is left untouched — the render
/// then fails loudly rather than silently (see
/// `InferenceError::TemplateRenderFailed`).
fn normalize_tool_call_arguments(messages: &mut [Value]) {
for msg in messages {
let Some(tool_calls) = msg.get_mut("tool_calls").and_then(Value::as_array_mut) else {
continue;
};
for tc in tool_calls {
let Some(func) = tc.get_mut("function").and_then(Value::as_object_mut) else {
continue;
};
let parsed = match func.get("arguments") {
Some(Value::String(s)) => serde_json::from_str::<Value>(s).ok(),
_ => None,
};
if let Some(p) = parsed {
func.insert("arguments".into(), p);
}
}
}
}
#[cfg(test)]
mod tests {
use super::*;
@@ -559,4 +596,40 @@ THINK_OK\
let rendered = render_chat_template(template, &[msg], &Value::Null, &Value::Null).unwrap();
assert_eq!(rendered, "t1");
}
#[test]
fn normalizes_openai_string_tool_call_arguments_to_object() {
// The opencode / OpenAI-SDK shape: arguments as a JSON string.
let mut messages = vec![json!({
"role": "assistant",
"tool_calls": [{
"id": "c1", "type": "function",
"function": {"name": "Read", "arguments": "{\"path\":\"/x\"}"}
}]
})];
normalize_tool_call_arguments(&mut messages);
assert_eq!(
messages[0]["tool_calls"][0]["function"]["arguments"],
json!({"path": "/x"}),
"string args must become the object the template iterates"
);
}
#[test]
fn leaves_object_args_and_non_tool_messages_untouched() {
let mut messages = vec![
json!({"role": "user", "content": "hi"}),
json!({"role": "assistant", "tool_calls": [
{"function": {"name": "f", "arguments": {"a": 1}}}
]}),
];
normalize_tool_call_arguments(&mut messages);
// Already-object args pass through unchanged (Anthropic path).
assert_eq!(
messages[1]["tool_calls"][0]["function"]["arguments"],
json!({"a": 1})
);
// Ordinary messages are not disturbed.
assert_eq!(messages[0]["content"], "hi");
}
}

View File

@@ -0,0 +1,366 @@
//! Self-derived context/token limits (#67).
//!
//! The correct `limit{context,input,output}` for a deployment is not a
//! static fact an operator should memorise — it's a computed function of
//! things the neuron already knows better than any operator:
//!
//! - **model architecture** — `max_position_embeddings` and the
//! KV-cost-per-token implied by the attention layout;
//! - **live free VRAM** on the tightest card the model occupies, after
//! weights and an activation reserve;
//! - the **coherence/throughput trade-off** — "biggest that fits VRAM"
//! is not "biggest that's usable": with no cross-request KV reuse every
//! turn re-prefills the whole context, so there's a usable ceiling
//! below the VRAM ceiling (it rises as prefix caching / #11 lands).
//!
//! This module is the arch-agnostic physics + policy. Each arch's load
//! path builds a [`ContextProfile`] (the physics) via
//! [`kv_bytes_per_token`]; [`derive_limit`] applies the policy against
//! live VRAM + a self-measured prefill rate + [`ContextLimitConfig`].
//! qwen3_5 is the only arch wired today; a future standard
//! full-attention model is the simpler case (`n_full_attn_layers =
//! n_layers`) and drops in by constructing a `ContextProfile`.
use std::path::Path;
use std::sync::atomic::{AtomicU64, Ordering};
use std::time::Duration;
use cortex_core::harness::ModelLimit;
use crate::config::ContextLimitConfig;
/// EMA smoothing factor for the prefill-rate sample. Low enough that one
/// anomalous turn (a contended GPU, a cold cache) doesn't swing the
/// advertised limit, high enough to track a real shift (e.g. prefix
/// caching, #11, dropping effective prefill cost) within a few turns.
const PREFILL_EMA_ALPHA: f64 = 0.3;
/// Self-measured prefill throughput for one loaded model, as an
/// exponential moving average of tokens/sec (#67). Updated at the end of
/// each streaming request's prefill phase, read when deriving the
/// throughput ceiling. Lock-free: prefill is serialised per model (the
/// `inference_lock`), and the limit reader only needs a recent value.
/// Stores the f64 rate as raw bits; `0` means "no sample yet" → callers
/// fall back to the configured bootstrap estimate.
#[derive(Debug)]
pub struct PrefillRateEma {
bits: AtomicU64,
}
impl PrefillRateEma {
pub const fn new() -> Self {
Self {
bits: AtomicU64::new(0),
}
}
/// Fold one prefill measurement (`prompt_tokens` processed in
/// `elapsed`) into the EMA. No-op for degenerate inputs so a probe
/// request or a clock blip can't poison the average.
pub fn record(&self, prompt_tokens: usize, elapsed: Duration) {
let secs = elapsed.as_secs_f64();
if prompt_tokens == 0 || secs <= 0.0 {
return;
}
let sample = prompt_tokens as f64 / secs;
if !sample.is_finite() || sample <= 0.0 {
return;
}
let prev = f64::from_bits(self.bits.load(Ordering::Acquire));
let next = if prev > 0.0 {
PREFILL_EMA_ALPHA * sample + (1.0 - PREFILL_EMA_ALPHA) * prev
} else {
sample
};
self.bits.store(next.to_bits(), Ordering::Release);
}
/// The current measured rate (tokens/sec), or `None` before the
/// first sample lands.
pub fn get(&self) -> Option<f64> {
let v = f64::from_bits(self.bits.load(Ordering::Acquire));
(v.is_finite() && v > 0.0).then_some(v)
}
}
impl Default for PrefillRateEma {
fn default() -> Self {
Self::new()
}
}
/// Bytes per element of the KV cache. qwen3_5 keeps K/V in the model's
/// f16/bf16 compute dtype regardless of weight quantisation (ISQ
/// quantises weights, not the cache), so this is 2 for every supported
/// load. Matches the per-rank logging math in the TP load paths.
pub const KV_CACHE_DTYPE_BYTES: usize = 2;
/// Bytes of KV cache one token adds **per card**, counting only the
/// full-attention layers (linear/recurrent layers carry fixed-size
/// state, not a growing cache). Sharded across the TP world: per-rank
/// KV-head count is `n_kv_heads / world_size`.
///
/// `2 ×` accounts for K and V. Shared by the limit derivation here and
/// the per-rank load-time logging in the TP paths (and, in future, by
/// #65's length-aware pre-flight guard).
pub fn kv_bytes_per_token(
n_full_attn_layers: usize,
n_kv_heads: usize,
head_dim: usize,
dtype_bytes: usize,
world_size: u32,
) -> u64 {
let per_rank_kv_heads = (n_kv_heads / world_size.max(1) as usize).max(1);
(2 * n_full_attn_layers * per_rank_kv_heads * head_dim * dtype_bytes) as u64
}
/// Per-model physics needed to derive a context limit, captured at load
/// time (the arch config is consumed during model construction, so the
/// relevant numbers are snapshotted into this struct). Arch-agnostic:
/// the hybrid qwen3_5 case counts only its full-attention layers; a
/// standard transformer would pass `n_full_attn_layers = n_layers`.
#[derive(Debug, Clone, Copy)]
pub struct ContextProfile {
/// The model's native context ceiling (quality wall).
pub max_position_embeddings: usize,
/// KV bytes added per token, per card — from [`kv_bytes_per_token`].
pub kv_bytes_per_token_per_card: u64,
/// Tensor-parallel world size the model is loaded with (1 = single GPU).
pub world_size: u32,
}
/// Build a [`ContextProfile`] from a qwen3_5 `config.json` on disk
/// (mirrors `VisionMeta::from_config_path`). Returns `None` for any other
/// `model_type` or an unparseable config — those arches fall back to the
/// static prompt cap with no advertised limit. `world_size` is the TP
/// degree the model is loaded with (1 = single GPU).
///
/// KV grows only on full-attention layers; `layer_types` is authoritative
/// (every entry is `"full_attention"` or `"linear_attention"`), with the
/// `full_attention_interval` hint as a fallback when the array is absent.
pub fn profile_from_qwen3_5_config(config_path: &Path, world_size: u32) -> Option<ContextProfile> {
let text = std::fs::read_to_string(config_path).ok()?;
let model_type = serde_json::from_str::<serde_json::Value>(&text)
.ok()?
.get("model_type")?
.as_str()?
.to_owned();
if model_type != super::arch::qwen3_5::MODEL_TYPE {
return None;
}
let cfg: super::arch::qwen3_5::Config = serde_json::from_str(&text).ok()?;
let tc = &cfg.text_config;
let n_full_attn_layers = {
let counted = tc
.layer_types
.iter()
.filter(|t| t.as_str() == "full_attention")
.count();
if counted > 0 {
counted
} else {
// layer_types absent — derive from the interval hint.
let interval = tc.full_attention_interval.unwrap_or(4).max(1);
tc.num_hidden_layers / interval
}
};
let kv_bytes_per_token_per_card = kv_bytes_per_token(
n_full_attn_layers,
tc.num_key_value_heads,
tc.head_dim,
KV_CACHE_DTYPE_BYTES,
world_size,
);
Some(ContextProfile {
max_position_embeddings: tc.max_position_embeddings,
kv_bytes_per_token_per_card,
world_size,
})
}
/// Round a token count down to a clean boundary so the advertised limit
/// doesn't jitter by a handful of tokens as live VRAM / the throughput
/// EMA wobble between polls.
fn round_down(tokens: usize, granularity: usize) -> usize {
if granularity == 0 {
return tokens;
}
(tokens / granularity) * granularity
}
const CONTEXT_GRANULARITY: usize = 1024;
/// Derive `limit{context,input,output}` for a loaded model.
///
/// ```text
/// output = output_reserve_tokens
/// vram_ceiling = (free_tightest activation_headroom min_free_floor) / kv_bytes_per_token_per_card
/// throughput_ceiling = target_prefill_latency_secs × prefill_tok_per_sec
/// context = min(max_position_embeddings, vram_ceiling, throughput_ceiling) [clamped by `hard_ceiling` if set]
/// input = context output
/// ```
///
/// `free_tightest_mb` is the minimum free VRAM (MiB) across the model's
/// devices — the tightest card, which on a TP model is often a
/// non-leader rank. `prefill_tok_per_sec` is the model's self-measured
/// prefill rate (or a bootstrap estimate before the first sample).
/// `hard_ceiling` is an optional clamp-only backstop
/// (`NEURON_MAX_PROMPT_TOKENS` or a catalogue override); `None` = no clamp.
///
/// `reasoning`: `input = context output` keeps a generation reserve
/// below the wall; `output` (the reserve) is a *sub-budget* of context,
/// matching opencode's compaction model.
pub fn derive_limit(
profile: &ContextProfile,
free_tightest_mb: u64,
prefill_tok_per_sec: f64,
hard_ceiling: Option<usize>,
cfg: &ContextLimitConfig,
) -> ModelLimit {
let output = cfg.output_reserve_tokens;
// VRAM ceiling — what actually fits, from live free VRAM. A zero
// `free_tightest_mb` is the "unknown / no-context sentinel" (CPU
// build, or a failed per-rank query) → VRAM imposes no ceiling, the
// other terms bind, rather than collapsing the limit to zero.
let vram_ceiling = if free_tightest_mb == 0 {
usize::MAX
} else {
let reserved_mb = cfg
.activation_headroom_mb
.saturating_add(cfg.min_free_floor_mb);
let avail_bytes = free_tightest_mb
.saturating_sub(reserved_mb)
.saturating_mul(1024 * 1024);
// `checked_div` yields `None` for a degenerate zero-KV profile
// (e.g. no full-attention layers) → VRAM imposes no ceiling.
avail_bytes
.checked_div(profile.kv_bytes_per_token_per_card)
.map_or(usize::MAX, |t| t as usize)
};
// Throughput ceiling — usable, not just fittable. Fall back to the
// bootstrap estimate until the model has measured its own rate.
let tok_per_sec = if prefill_tok_per_sec.is_finite() && prefill_tok_per_sec > 0.0 {
prefill_tok_per_sec
} else {
cfg.bootstrap_prefill_tok_per_sec
};
let throughput_ceiling = (cfg.target_prefill_latency_secs * tok_per_sec).max(0.0) as usize;
let mut context = profile
.max_position_embeddings
.min(vram_ceiling)
.min(throughput_ceiling);
if let Some(clamp) = hard_ceiling {
context = context.min(clamp);
}
context = round_down(context, CONTEXT_GRANULARITY);
let input = context.saturating_sub(output);
ModelLimit {
context,
input: Some(input),
output,
}
}
#[cfg(test)]
mod tests {
use super::*;
/// beast Qwen3.6-27B: 16 full-attn layers, 4 kv heads, head_dim 256,
/// f16 (2 B), TP=2 → 64 KiB/token total, 32 KiB/token/card.
fn beast_profile() -> ContextProfile {
let kv = kv_bytes_per_token(16, 4, 256, 2, 2);
ContextProfile {
max_position_embeddings: 262144,
kv_bytes_per_token_per_card: kv,
world_size: 2,
}
}
#[test]
fn kv_bytes_matches_hand_derivation() {
// 2 × 16 × (4/2) × 256 × 2 = 32 KiB per card.
assert_eq!(kv_bytes_per_token(16, 4, 256, 2, 2), 32 * 1024);
// Single-GPU (world=1) doubles the per-card cost: 64 KiB.
assert_eq!(kv_bytes_per_token(16, 4, 256, 2, 1), 64 * 1024);
}
#[test]
fn throughput_ceiling_binds_pre_prefix_cache() {
// ~850 tok/s × 120 s ≈ 102k → the coherence wall binds below the
// VRAM ceiling on beast pre-#11. VRAM (~9.2 GB free) allows far
// more, max_position_embeddings is 262144, so throughput wins.
let cfg = ContextLimitConfig::default();
let limit = derive_limit(&beast_profile(), 9254, 850.0, None, &cfg);
// 120 × 850 = 102000 → rounded down to 1024 → 101376.
assert_eq!(limit.context, 101376);
assert_eq!(limit.output, 8192);
assert_eq!(limit.input, Some(101376 - 8192));
assert!(limit.input.unwrap() < limit.context);
}
#[test]
fn faster_prefill_raises_the_limit() {
// Prefix caching (#11) speeds effective prefill → ceiling rises,
// eventually pinned by VRAM / max_position_embeddings.
let cfg = ContextLimitConfig::default();
let slow = derive_limit(&beast_profile(), 9254, 850.0, None, &cfg);
let fast = derive_limit(&beast_profile(), 9254, 8500.0, None, &cfg);
assert!(fast.context > slow.context);
}
#[test]
fn tighter_vram_lowers_the_limit() {
// Same model, less free VRAM → VRAM ceiling binds below throughput.
let cfg = ContextLimitConfig::default();
let roomy = derive_limit(&beast_profile(), 9254, 8500.0, None, &cfg);
let tight = derive_limit(&beast_profile(), 2600, 8500.0, None, &cfg);
assert!(tight.context < roomy.context);
}
#[test]
fn hard_ceiling_clamps_only_downward() {
let cfg = ContextLimitConfig::default();
// A backstop below the derived value clamps it.
let clamped = derive_limit(&beast_profile(), 9254, 8500.0, Some(49152), &cfg);
assert_eq!(clamped.context, 49152);
// A backstop above the derived value is a no-op.
let unclamped = derive_limit(&beast_profile(), 9254, 850.0, Some(200000), &cfg);
assert_eq!(unclamped.context, 101376);
}
#[test]
fn prefill_ema_tracks_and_ignores_degenerate_samples() {
let ema = PrefillRateEma::new();
assert_eq!(ema.get(), None);
// First real sample seeds the average exactly.
ema.record(1000, Duration::from_secs(1));
assert_eq!(ema.get(), Some(1000.0));
// Degenerate inputs are ignored (no poisoning).
ema.record(0, Duration::from_secs(1));
ema.record(1000, Duration::from_secs(0));
assert_eq!(ema.get(), Some(1000.0));
// A faster sample pulls the EMA up but is smoothed (alpha 0.3):
// 0.3*2000 + 0.7*1000 = 1300.
ema.record(2000, Duration::from_secs(1));
assert!((ema.get().unwrap() - 1300.0).abs() < 1e-6);
}
#[test]
fn zero_kv_cost_falls_back_to_other_ceilings() {
// A degenerate profile (no full-attn layers) must not divide by
// zero — VRAM ceiling becomes unbounded, others still apply.
let profile = ContextProfile {
max_position_embeddings: 32768,
kv_bytes_per_token_per_card: 0,
world_size: 1,
};
let cfg = ContextLimitConfig::default();
let limit = derive_limit(&profile, 8000, 8500.0, None, &cfg);
// max_position_embeddings (32768) binds below throughput (~1.02M).
assert_eq!(limit.context, 32768);
}
}

View File

@@ -1,8 +1,10 @@
//! Harness registry — maps harness names to trait implementations.
pub mod admission;
pub mod arch;
pub mod candle;
pub mod chat_template;
pub mod context_limit;
pub mod device_worker;
pub mod prefix_cache;
pub mod preflight;

View File

@@ -1025,6 +1025,44 @@ impl WorkerPool {
Ok(())
}
/// Minimum free VRAM (MiB) across every rank's device — the tightest
/// card, which on a TP model is often a non-leader rank (e.g. beast
/// GPU 1). Used to derive the context limit (#67) against what
/// actually fits, not just the leader's headroom. Returns 0 if any
/// rank reports the CPU/no-context sentinel, so the caller can treat
/// it as "unknown" and skip the VRAM ceiling.
#[cfg(feature = "cuda")]
pub async fn query_vram_tightest_free_mb(
&mut self,
leader_handle: super::device_worker::TpHandle,
) -> Result<u64> {
for w in &mut self.workers {
w.send_only(&WorkerRequest::QueryVram).await?;
}
// Leader (rank 0) via its in-process device worker — same
// `mem_get_info` the subprocess ranks run, on the leader's
// context-owning thread.
let (leader_free_mb, _leader_total) = self
.leader_worker
.query_vram()
.await
.map_err(|e| anyhow::anyhow!("leader query_vram: {e}"))?;
let mut frees = vec![leader_free_mb];
let worker_errors = drain_workers(&mut self.workers, |r| match r {
WorkerResponse::VramInfo { free_mb, .. } => {
frees.push(free_mb);
Ok(())
}
WorkerResponse::Error { kind, message } => Err(format!("[{kind}]: {message}")),
other => Err(format!("expected VramInfo, got {other:?}")),
})
.await;
if !worker_errors.is_empty() {
anyhow::bail!("QueryVram: {}", worker_errors.join("; "));
}
Ok(frees.into_iter().min().unwrap_or(0))
}
/// Capture every rank's cache state as one prefix snapshot (#11)
/// stored under `snapshot_id` (minted by the caller). All ranks
/// are at the same token boundary — step fan-out is synchronous —

View File

@@ -138,6 +138,12 @@ pub enum WorkerRequest {
/// was present.
DropKvSnapshot { model_id: String, snapshot_id: u64 },
/// Query this rank's live device VRAM as `(free_mb, total_mb)`.
/// Non-mutating; replies `VramInfo`. Used to derive the context
/// limit (#67) against the tightest card across ranks — a non-leader
/// card is often tighter than the leader's.
QueryVram,
/// Drop this rank's shard for the given model. Releases the VRAM
/// the shard's weights occupied; subsequent `GenerateStep` calls
/// against the same `model_id` return an `Error`.
@@ -186,6 +192,9 @@ pub enum WorkerResponse {
/// Reply to `ClearKvCache`. Empty payload.
KvCacheCleared,
/// Reply to `QueryVram`. This rank's device VRAM in MiB.
VramInfo { free_mb: u64, total_mb: u64 },
/// Reply to `SnapshotKvCache`. Carries this rank's snapshot size
/// in bytes so the leader can budget-account the whole fleet's
/// footprint (shards are symmetric, so leader bytes × world_size

View File

@@ -634,8 +634,15 @@ fn log_construction_complete(cfg: &Config, rank: u32, world_size: u32, device: &
// contributes. Knowing per-token bytes lets the operator estimate
// headroom for a given prompt length before hitting an edge.
let per_rank_num_kv_heads = (cfg.num_key_value_heads / world_size as usize).max(1);
let kv_bytes_per_token_per_layer = per_rank_num_kv_heads * cfg.head_dim * 2 * 2;
let kv_bytes_per_token = kv_bytes_per_token_per_layer * cfg.num_hidden_layers;
// Vanilla Qwen3 is dense attention end-to-end, so every layer
// contributes KV. Shared helper (#67) — also drives the derived limit.
let kv_bytes_per_token = crate::harness::context_limit::kv_bytes_per_token(
cfg.num_hidden_layers,
cfg.num_key_value_heads,
cfg.head_dim,
crate::harness::context_limit::KV_CACHE_DTYPE_BYTES,
world_size,
);
tracing::info!(
target: "neuron::tp::load",
rank,
@@ -658,8 +665,15 @@ fn log_construction_complete(cfg: &Config, rank: u32, world_size: u32, device: &
#[cfg(not(feature = "cuda"))]
fn log_construction_complete(cfg: &Config, rank: u32, world_size: u32, _device: &Device) {
let per_rank_num_kv_heads = (cfg.num_key_value_heads / world_size as usize).max(1);
let kv_bytes_per_token_per_layer = per_rank_num_kv_heads * cfg.head_dim * 2 * 2;
let kv_bytes_per_token = kv_bytes_per_token_per_layer * cfg.num_hidden_layers;
// Vanilla Qwen3 is dense attention end-to-end, so every layer
// contributes KV. Shared helper (#67) — also drives the derived limit.
let kv_bytes_per_token = crate::harness::context_limit::kv_bytes_per_token(
cfg.num_hidden_layers,
cfg.num_key_value_heads,
cfg.head_dim,
crate::harness::context_limit::KV_CACHE_DTYPE_BYTES,
world_size,
);
tracing::info!(
target: "neuron::tp::load",
rank,

View File

@@ -1659,8 +1659,16 @@ fn log_construction_complete(
// sharded across world_size. Linear-attention layers carry a
// fixed-size state instead of a growing cache.
let per_rank_num_kv_heads = (cfg.num_key_value_heads / world_size as usize).max(1);
let kv_bytes_per_token_per_layer = per_rank_num_kv_heads * cfg.head_dim * 2 /* K+V */ * 2 /* bf16 */;
let kv_bytes_per_token = kv_bytes_per_token_per_layer * full_attn_layers;
// Only full-attention layers grow a KV cache (linear layers carry a
// fixed-size recurrent state). Shared helper (#67) — the same
// per-card math drives the derived context limit.
let kv_bytes_per_token = crate::harness::context_limit::kv_bytes_per_token(
full_attn_layers,
cfg.num_key_value_heads,
cfg.head_dim,
crate::harness::context_limit::KV_CACHE_DTYPE_BYTES,
world_size,
);
tracing::info!(
target: "neuron::tp::load",
rank,

View File

@@ -261,11 +261,38 @@ impl WorkerState {
model_id,
snapshot_id,
} => self.handle_drop_kv_snapshot(&model_id, snapshot_id),
WorkerRequest::QueryVram => self.handle_query_vram(),
WorkerRequest::UnloadModel { model_id } => self.handle_unload_model(&model_id),
WorkerRequest::Shutdown => WorkerResponse::Bye,
}
}
/// This rank's live device VRAM. `mem_get_info` reports the device
/// whose CUDA context is current on this worker thread — which is
/// exactly this rank's device. Used to derive the context limit
/// against the tightest card across ranks (#67).
#[cfg(feature = "cuda")]
fn handle_query_vram(&self) -> WorkerResponse {
match candle_core::cuda::cudarc::driver::result::mem_get_info() {
Ok((free, total)) => WorkerResponse::VramInfo {
free_mb: (free / (1024 * 1024)) as u64,
total_mb: (total / (1024 * 1024)) as u64,
},
Err(e) => WorkerResponse::Error {
kind: "vram_query_failed".into(),
message: format!("mem_get_info: {e:?}"),
},
}
}
#[cfg(not(feature = "cuda"))]
fn handle_query_vram(&self) -> WorkerResponse {
WorkerResponse::Error {
kind: "cuda_feature_not_enabled".into(),
message: "QueryVram requires --features cuda".into(),
}
}
#[cfg(feature = "cuda")]
fn handle_load_dense_shard(
&mut self,

View File

@@ -30,6 +30,9 @@ impl HealthCache {
// direct read from the cache stays a well-typed
// HealthResponse on the wire.
activation: Default::default(),
// Per-model admission load is overlaid by the api handler
// from the candle harness (#53); the cache doesn't own it.
models: Vec::new(),
}),
has_gpus: RwLock::new(false),
}

View File

@@ -69,8 +69,22 @@ pub enum InferenceEvent {
},
/// The stream is complete. Carries the reason so wire formats
/// that use it (OpenAI's `finish_reason`, Anthropic's
/// `stop_reason`) can render it without re-parsing.
Finish { reason: FinishReason },
/// `stop_reason`) can render it without re-parsing — plus the token
/// counts, so the streaming projectors can emit a `usage` chunk
/// (clients like opencode track context / trigger compaction off
/// it; without it they show "0 tokens" and overflow the cap).
Finish {
reason: FinishReason,
prompt_tokens: u32,
completion_tokens: u32,
/// Tokens generated inside the reasoning span — a sub-count of
/// `completion_tokens` (OpenAI semantics; not added into
/// `total_tokens`). Streaming projectors surface this as
/// `completion_tokens_details.reasoning_tokens` (chat) /
/// `output_tokens_details.reasoning_tokens` (responses).
/// Zero for non-reasoning models.
reasoning_tokens: u32,
},
}
/// Why a stream stopped. Stays small on purpose — anything that
@@ -92,9 +106,9 @@ pub enum FinishReason {
/// Hit `max_tokens` before EOS.
Length,
/// Stopped because the model called a tool and is waiting for
/// the result. Not yet emitted by the candle harness —
/// reserved for the day tool-call extraction lands.
#[allow(dead_code)]
/// the result. Emitted by the streaming candle loops once a
/// `<tool_call>` block parses into a structured tool call, so
/// Anthropic clients receive `stop_reason: tool_use`.
ToolCalls,
}

View File

@@ -26,7 +26,7 @@
//! producer blocks on its own send. The bounded channels
//! propagate without us writing any logic.
use cortex_core::openai::{ChatCompletionChunk, ChunkChoice};
use cortex_core::openai::{ChatCompletionChunk, ChunkChoice, CompletionTokensDetails, Usage};
use serde_json::json;
use tokio::sync::mpsc;
@@ -188,8 +188,28 @@ pub fn project_chat_stream_with(
&id, created, &model_id, index, &call_id, &name, &arguments,
)]
}
InferenceEvent::Finish { reason } => {
vec![final_chunk(&id, created, &model_id, reason)]
InferenceEvent::Finish {
reason,
prompt_tokens,
completion_tokens,
reasoning_tokens,
} => {
// The finish_reason chunk, then an OpenAI-style
// usage-only chunk (`choices: []`, `usage` populated).
// Clients (opencode) read this to track context size;
// cortex's Anthropic translator also picks `usage` up
// for its `message_delta`.
vec![
final_chunk(&id, created, &model_id, reason),
usage_chunk(
&id,
created,
&model_id,
prompt_tokens,
completion_tokens,
reasoning_tokens,
),
]
}
};
for chunk in chunks {
@@ -301,6 +321,41 @@ fn final_chunk(
}
}
/// OpenAI-style trailing usage chunk: empty `choices`, populated
/// `usage`. Mirrors what `stream_options: {include_usage: true}`
/// produces. Emitted unconditionally — clients that don't read usage
/// ignore the empty-choices chunk; clients that do (opencode, and
/// cortex's Anthropic translator) get the token counts they need to
/// track context.
fn usage_chunk(
id: &str,
created: u64,
model_id: &str,
prompt_tokens: u32,
completion_tokens: u32,
reasoning_tokens: u32,
) -> ChatCompletionChunk {
ChatCompletionChunk {
id: id.into(),
object: "chat.completion.chunk".into(),
created,
model: model_id.into(),
choices: Vec::new(),
usage: Some(Usage {
prompt_tokens: prompt_tokens as u64,
completion_tokens: completion_tokens as u64,
total_tokens: (prompt_tokens + completion_tokens) as u64,
// Additive reasoning sub-count — omitted for non-reasoning
// models so older clients see unchanged JSON.
completion_tokens_details: (reasoning_tokens > 0).then_some(CompletionTokensDetails {
reasoning_tokens: reasoning_tokens as u64,
}),
prompt_tokens_details: None,
}),
extra: serde_json::Value::Object(Default::default()),
}
}
#[cfg(test)]
mod tests {
use super::*;
@@ -323,7 +378,7 @@ mod tests {
}
#[tokio::test]
async fn start_text_finish_produces_three_chunks() {
async fn start_text_finish_produces_role_content_finish_and_usage() {
let (tx, rx) = mpsc::channel::<InferenceEvent>(4);
let out_rx = project_chat_stream(rx, "id-1".into(), 1700, "m".into());
@@ -333,16 +388,22 @@ mod tests {
.unwrap();
tx.send(InferenceEvent::Finish {
reason: FinishReason::Stop,
prompt_tokens: 0,
completion_tokens: 0,
reasoning_tokens: 0,
})
.await
.unwrap();
drop(tx);
let out = collect(out_rx).await;
assert_eq!(out.len(), 3);
assert_eq!(out.len(), 4); // role, content, finish, usage
assert_eq!(out[0].choices[0].delta["role"], "assistant");
assert_eq!(out[1].choices[0].delta["content"], "hello");
assert_eq!(out[2].choices[0].finish_reason.as_deref(), Some("stop"));
// Trailing usage-only chunk: empty choices, usage populated.
assert!(out[3].choices.is_empty());
assert!(out[3].usage.is_some());
// Every chunk carries the stamped metadata.
for chunk in &out {
assert_eq!(chunk.id, "id-1");
@@ -370,13 +431,17 @@ mod tests {
let out_rx = project_chat_stream(rx, "id".into(), 1, "m".into());
tx.send(InferenceEvent::Finish {
reason: FinishReason::Length,
prompt_tokens: 0,
completion_tokens: 0,
reasoning_tokens: 0,
})
.await
.unwrap();
drop(tx);
let out = collect(out_rx).await;
assert_eq!(out.len(), 1);
assert_eq!(out.len(), 2); // finish, usage
assert_eq!(out[0].choices[0].finish_reason.as_deref(), Some("length"));
assert!(out[1].usage.is_some(), "usage chunk emitted after finish");
}
#[tokio::test]
@@ -428,6 +493,9 @@ mod tests {
.unwrap();
tx.send(InferenceEvent::Finish {
reason: FinishReason::Stop,
prompt_tokens: 0,
completion_tokens: 0,
reasoning_tokens: 0,
})
.await
.unwrap();
@@ -437,14 +505,19 @@ mod tests {
// → close marker → visible answer → final chunk.
let contents: Vec<&str> = out
.iter()
.filter_map(|c| c.choices[0].delta["content"].as_str())
.filter_map(|c| {
c.choices
.first()
.and_then(|ch| ch.delta["content"].as_str())
})
.collect();
assert_eq!(
contents,
vec!["<think>", "first ", "second", "</think>", "answer"]
);
assert_eq!(
out.last().unwrap().choices[0].finish_reason.as_deref(),
out.iter()
.find_map(|c| c.choices.first().and_then(|ch| ch.finish_reason.as_deref())),
Some("stop")
);
}
@@ -471,6 +544,9 @@ mod tests {
.unwrap();
tx.send(InferenceEvent::Finish {
reason: FinishReason::Length,
prompt_tokens: 0,
completion_tokens: 0,
reasoning_tokens: 0,
})
.await
.unwrap();
@@ -478,11 +554,16 @@ mod tests {
let out = collect(out_rx).await;
let contents: Vec<&str> = out
.iter()
.filter_map(|c| c.choices[0].delta["content"].as_str())
.filter_map(|c| {
c.choices
.first()
.and_then(|ch| ch.delta["content"].as_str())
})
.collect();
assert_eq!(contents, vec!["<think>", "thinking...", "</think>"]);
assert_eq!(
out.last().unwrap().choices[0].finish_reason.as_deref(),
out.iter()
.find_map(|c| c.choices.first().and_then(|ch| ch.finish_reason.as_deref())),
Some("length")
);
}
@@ -508,6 +589,9 @@ mod tests {
.unwrap();
tx.send(InferenceEvent::Finish {
reason: FinishReason::Stop,
prompt_tokens: 0,
completion_tokens: 0,
reasoning_tokens: 0,
})
.await
.unwrap();
@@ -515,7 +599,11 @@ mod tests {
let out = collect(out_rx).await;
let contents: Vec<&str> = out
.iter()
.filter_map(|c| c.choices[0].delta["content"].as_str())
.filter_map(|c| {
c.choices
.first()
.and_then(|ch| ch.delta["content"].as_str())
})
.collect();
assert_eq!(contents, vec!["raw"]);
}
@@ -544,6 +632,9 @@ mod tests {
.unwrap();
tx.send(InferenceEvent::Finish {
reason: FinishReason::Stop,
prompt_tokens: 0,
completion_tokens: 0,
reasoning_tokens: 0,
})
.await
.unwrap();
@@ -551,8 +642,74 @@ mod tests {
let out = collect(out_rx).await;
let contents: Vec<&str> = out
.iter()
.filter_map(|c| c.choices[0].delta["content"].as_str())
.filter_map(|c| {
c.choices
.first()
.and_then(|ch| ch.delta["content"].as_str())
})
.collect();
assert_eq!(contents, vec!["visible"]);
}
#[tokio::test]
async fn finish_emits_a_usage_chunk() {
let (tx, rx) = mpsc::channel::<InferenceEvent>(4);
let out_rx = project_chat_stream(rx, "id".into(), 1, "m".into());
tx.send(InferenceEvent::TextDelta("hello".into()))
.await
.unwrap();
tx.send(InferenceEvent::Finish {
reason: FinishReason::Stop,
prompt_tokens: 42,
completion_tokens: 5,
reasoning_tokens: 2,
})
.await
.unwrap();
drop(tx);
let out = collect(out_rx).await;
// Last chunk is usage-only: empty choices, populated usage.
let last = out.last().unwrap();
assert!(last.choices.is_empty(), "usage chunk has no choices");
let u = last.usage.as_ref().expect("usage present on final chunk");
assert_eq!(u.prompt_tokens, 42);
assert_eq!(u.completion_tokens, 5);
assert_eq!(u.total_tokens, 47);
// reasoning_tokens is a sub-count of completion_tokens: reported
// in the detail object, never added into total_tokens.
let d = u
.completion_tokens_details
.as_ref()
.expect("reasoning detail present");
assert_eq!(d.reasoning_tokens, 2);
assert!(d.reasoning_tokens <= u.completion_tokens);
}
#[tokio::test]
async fn non_reasoning_finish_omits_usage_details() {
// Back-compat: with reasoning_tokens == 0 the additive detail
// object is omitted, so older clients see unchanged JSON.
let (tx, rx) = mpsc::channel::<InferenceEvent>(4);
let out_rx = project_chat_stream(rx, "id".into(), 1, "m".into());
tx.send(InferenceEvent::Finish {
reason: FinishReason::Stop,
prompt_tokens: 10,
completion_tokens: 7,
reasoning_tokens: 0,
})
.await
.unwrap();
drop(tx);
let out = collect(out_rx).await;
let u = out
.last()
.unwrap()
.usage
.as_ref()
.expect("usage present on final chunk");
assert!(u.completion_tokens_details.is_none());
// And it serialises without the detail key at all.
let json = serde_json::to_value(u).unwrap();
assert!(json.get("completion_tokens_details").is_none());
assert!(json.get("prompt_tokens_details").is_none());
}
}

View File

@@ -29,9 +29,9 @@
use cortex_core::openai::{ChatCompletionRequest, ChatMessage, MessageContent};
use cortex_core::responses::{
ResponsesContentPart, ResponsesInput, ResponsesInputItem, ResponsesMessageContent,
ResponsesOutputContent, ResponsesOutputItem, ResponsesRequest, ResponsesResponse,
ResponsesUsage, events,
OutputTokensDetails, ResponsesContentPart, ResponsesInput, ResponsesInputItem,
ResponsesMessageContent, ResponsesOutputContent, ResponsesOutputItem, ResponsesRequest,
ResponsesResponse, ResponsesUsage, events,
};
use serde_json::{Value, json};
use tokio::sync::mpsc;
@@ -263,6 +263,7 @@ async fn run_projection(
) {
let mut accumulated = String::new();
let mut finish: Option<FinishReason> = None;
let mut usage: Option<ResponsesUsage> = None;
let mut emitted_start = false;
while let Some(event) = rx.recv().await {
@@ -303,8 +304,26 @@ async fn run_projection(
// projector handles tool calls. Future work
// tracked in #7 alongside the in_progress event.
}
InferenceEvent::Finish { reason } => {
InferenceEvent::Finish {
reason,
prompt_tokens,
completion_tokens,
reasoning_tokens,
} => {
finish = Some(reason);
// Surface usage on the streaming `response.completed`
// frame — clients (opencode) track context/spend off it.
// reasoning_tokens is an additive sub-count of
// output_tokens (omitted for non-reasoning models).
usage = Some(ResponsesUsage {
input_tokens: prompt_tokens as u64,
output_tokens: completion_tokens as u64,
total_tokens: (prompt_tokens + completion_tokens) as u64,
output_tokens_details: (reasoning_tokens > 0).then_some(OutputTokensDetails {
reasoning_tokens: reasoning_tokens as u64,
}),
input_tokens_details: None,
});
}
}
}
@@ -317,7 +336,7 @@ async fn run_projection(
}
let reason = finish.unwrap_or(FinishReason::Stop);
let _ = emit_finish_frames(&tx, &meta, &accumulated, reason).await;
let _ = emit_finish_frames(&tx, &meta, &accumulated, reason, usage.as_ref()).await;
}
async fn emit_start_frames(tx: &mpsc::Sender<ResponseStreamFrame>, meta: &ResponseMeta) -> bool {
@@ -370,6 +389,7 @@ async fn emit_finish_frames(
meta: &ResponseMeta,
full_text: &str,
reason: FinishReason,
usage: Option<&ResponsesUsage>,
) -> bool {
let status = finish_to_status(reason);
let full_part = json!({
@@ -413,7 +433,7 @@ async fn emit_finish_frames(
ResponseStreamFrame {
event_name: events::COMPLETED,
data: json!({
"response": response_shell(meta, status, &[full_item], None)
"response": response_shell(meta, status, &[full_item], usage)
}),
},
];
@@ -439,14 +459,25 @@ fn response_shell(
obj.insert("model".into(), Value::String(meta.model_id.clone()));
obj.insert("output".into(), Value::Array(output.to_vec()));
if let Some(u) = usage {
obj.insert(
"usage".into(),
json!({
"input_tokens": u.input_tokens,
"output_tokens": u.output_tokens,
"total_tokens": u.total_tokens,
}),
);
let mut usage_obj = serde_json::Map::new();
usage_obj.insert("input_tokens".into(), json!(u.input_tokens));
usage_obj.insert("output_tokens".into(), json!(u.output_tokens));
usage_obj.insert("total_tokens".into(), json!(u.total_tokens));
// Additive detail objects — only emitted when populated, so
// older clients see the unchanged three-field usage shape.
if let Some(d) = &u.output_tokens_details {
usage_obj.insert(
"output_tokens_details".into(),
json!({ "reasoning_tokens": d.reasoning_tokens }),
);
}
if let Some(d) = &u.input_tokens_details {
usage_obj.insert(
"input_tokens_details".into(),
json!({ "cached_tokens": d.cached_tokens }),
);
}
obj.insert("usage".into(), Value::Object(usage_obj));
}
Value::Object(obj)
}
@@ -772,6 +803,9 @@ mod tests {
.unwrap();
tx.send(InferenceEvent::Finish {
reason: FinishReason::Stop,
prompt_tokens: 0,
completion_tokens: 0,
reasoning_tokens: 0,
})
.await
.unwrap();
@@ -812,6 +846,60 @@ mod tests {
assert_eq!(output[0]["content"][0]["text"], "hello");
}
#[tokio::test]
async fn completed_frame_carries_usage_with_reasoning_detail() {
let (tx, rx) = mpsc::channel::<InferenceEvent>(8);
let out = project_responses_stream(rx, meta());
tx.send(InferenceEvent::Start).await.unwrap();
tx.send(InferenceEvent::Finish {
reason: FinishReason::Stop,
prompt_tokens: 30,
completion_tokens: 12,
reasoning_tokens: 4,
})
.await
.unwrap();
drop(tx);
let frames = collect(out).await;
let completed = frames
.iter()
.find(|f| f.event_name == events::COMPLETED)
.unwrap();
let usage = &completed.data["response"]["usage"];
assert_eq!(usage["input_tokens"], 30);
assert_eq!(usage["output_tokens"], 12);
// reasoning_tokens is a sub-count of output_tokens, not summed
// into total_tokens.
assert_eq!(usage["total_tokens"], 42);
assert_eq!(usage["output_tokens_details"]["reasoning_tokens"], 4);
// Deferred cache detail is absent until #11.
assert!(usage.get("input_tokens_details").is_none());
}
#[tokio::test]
async fn completed_frame_omits_reasoning_detail_for_non_reasoning() {
let (tx, rx) = mpsc::channel::<InferenceEvent>(8);
let out = project_responses_stream(rx, meta());
tx.send(InferenceEvent::Start).await.unwrap();
tx.send(InferenceEvent::Finish {
reason: FinishReason::Stop,
prompt_tokens: 8,
completion_tokens: 3,
reasoning_tokens: 0,
})
.await
.unwrap();
drop(tx);
let frames = collect(out).await;
let completed = frames
.iter()
.find(|f| f.event_name == events::COMPLETED)
.unwrap();
let usage = &completed.data["response"]["usage"];
assert_eq!(usage["output_tokens"], 3);
assert!(usage.get("output_tokens_details").is_none());
}
#[tokio::test]
async fn length_finish_maps_to_incomplete_status() {
let (tx, rx) = mpsc::channel::<InferenceEvent>(8);
@@ -819,6 +907,9 @@ mod tests {
tx.send(InferenceEvent::Start).await.unwrap();
tx.send(InferenceEvent::Finish {
reason: FinishReason::Length,
prompt_tokens: 0,
completion_tokens: 0,
reasoning_tokens: 0,
})
.await
.unwrap();
@@ -862,6 +953,9 @@ mod tests {
.unwrap();
tx.send(InferenceEvent::Finish {
reason: FinishReason::Stop,
prompt_tokens: 0,
completion_tokens: 0,
reasoning_tokens: 0,
})
.await
.unwrap();
@@ -886,6 +980,8 @@ mod tests {
input_tokens: 5,
output_tokens: 1,
total_tokens: 6,
output_tokens_details: None,
input_tokens_details: None,
}),
);
assert_eq!(r.status, "completed");

View File

@@ -51,6 +51,7 @@ fn fake_discovery() -> DiscoveryResponse {
],
harnesses: vec![],
cuda_unavailable_reason: None,
max_prompt_tokens: 16384,
}
}
@@ -113,6 +114,12 @@ async fn test_health_endpoint() {
let body: serde_json::Value = resp.json().await.unwrap();
assert_eq!(body["uptime_secs"], 0);
// Per-model admission load (#53) is always present, even with no models
// loaded (empty array) — cortex's load-aware router (#55) relies on it.
assert!(
body["models"].is_array(),
"/health must expose a models load array"
);
}
#[tokio::test]
@@ -126,6 +133,7 @@ async fn test_discovery_no_gpus() {
devices: vec![],
harnesses: vec![],
cuda_unavailable_reason: None,
max_prompt_tokens: 16384,
};
let url = spawn_neuron(disc).await;
@@ -529,6 +537,7 @@ async fn test_driver_mismatch_rejects_load_and_rides_discovery() {
devices: vec![],
harnesses: vec!["candle".into()],
cuda_unavailable_reason: Some(reason.into()),
max_prompt_tokens: 16384,
};
let url = spawn_neuron(disc).await;
let client = reqwest::Client::new();

View File

@@ -0,0 +1,6 @@
<?xml version="1.0" encoding="utf-8"?>
<service>
<short>helexa-bench</short>
<description>helexa-bench — read-only benchmark data API</description>
<port protocol="tcp" port="13132"/>
</service>

198
doc/context-limits.md Normal file
View File

@@ -0,0 +1,198 @@
# Context-window & token-limit settings
How the numeric knobs that govern usable context fit together, what the
valid ranges are, and where they live. Getting these out of sync is the
difference between "the agent has room to think" and "it compacts every
few turns and reasons from a corrupted summary."
The tables below document the **manual** reasoning that
[#62](https://git.lair.cafe/helexa/helexa/issues/62) and then
[#67](https://git.lair.cafe/helexa/helexa/issues/67) automate. As of #67
the neuron **computes** this limit itself from model architecture + live
VRAM + a self-measured throughput ceiling and advertises it on
`GET /models`; operators no longer hand-derive it. Read the rules below
as the *why* behind the derivation — see
[After #67](#after-67-the-neuron-computes-its-own-limit) for what the
daemon now does automatically.
## The knobs
| Knob | Where | What it bounds |
|---|---|---|
| `max_position_embeddings` | model `config.json` (fixed per model) | the model's native context ceiling — quality wall |
| `NEURON_MAX_PROMPT_TOKENS` | neuron systemd drop-in (env) | hard **prompt** cap; neuron rejects larger prompts with `400 context_length_exceeded` before any device work |
| `NEURON_MIN_FREE_VRAM_MB` | neuron systemd drop-in (env, default 1500) | static free-VRAM floor below which prefill is refused (`503 service_unavailable` / `InsufficientVram`) |
| request `max_tokens` | per request; neuron default 8192 | generation length; KV grows by prompt **+** generation |
| `limit.context` | `opencode.json` `provider.models.<id>.limit` | the wall opencode tracks for compaction % |
| `limit.input` | same | compaction trigger — opencode compacts to keep the prompt at/under this |
| `limit.output` | same | generation reserve opencode leaves below the wall |
## How they must relate
For a single model on a single neuron, all of these must hold:
```
1. limit.input + limit.output ≤ limit.context (opencode internal; convention: input = context output)
2. limit.context ≤ max_position_embeddings (model quality wall)
3. limit.input ≤ NEURON_MAX_PROMPT_TOKENS (else neuron 400s a prompt opencode thought was fine)
4. NEURON_MAX_PROMPT_TOKENS + max_tokens ≤ max_position_embeddings
5. KV(limit.context)/card + activation + NEURON_MIN_FREE_VRAM_MB ≤ free VRAM on the tightest card
```
Notes:
- **Keep a margin on rule 3.** Set `NEURON_MAX_PROMPT_TOKENS` a bit above
`limit.input` (e.g. one `output`-worth) so opencode↔neuron tokenizer
counting differences don't trip a spurious 400 mid-session.
- **Convention:** mirror `limit.context` to `NEURON_MAX_PROMPT_TOKENS`
and set `limit.input = context output`. opencode then compacts to
keep the prompt one `output` below the neuron wall — there is always
generation headroom under the cap.
- **Rule 5 is the one with teeth at scale.** Today only the *static*
floor (`NEURON_MIN_FREE_VRAM_MB`) guards the text path; it does **not**
scale with prompt length. A long-but-under-cap prompt can clear the
floor and then OOM mid-prefill (poisoning the device context). Tracked
in [#65](https://git.lair.cafe/helexa/helexa/issues/65) — until that
lands, treat the VRAM-safe ceiling in rule 5 as a hard limit you set
`NEURON_MAX_PROMPT_TOKENS` below, not something the daemon enforces.
## VRAM cost of context (Qwen3.6-27B on beast)
Qwen3.6-27B is a hybrid linear-attention model: of its 64 layers only
every 4th is full-attention (`full_attention_interval = 4`**16**
full-attn layers); the rest are `linear_attention` with constant-size
recurrent state. KV cache grows **only** on the 16 full-attn layers
(GQA, `num_key_value_heads = 4`, `head_dim = 256`, F16):
```
kv_per_token (total) = 2 (K+V) × 16 layers × 4 kv_heads × 256 head_dim × 2 B = 65536 B = 64 KiB/token
kv_per_token (per card, TP=2) = 32 KiB/token
```
beast = 2× RTX 5090 (32607 MiB each). KV per card and headroom against
the **measured idle free of the tighter card (GPU 1: 9254 MiB)**:
| `limit.context` | KV / card | Free left on GPU 1 (after KV) | Verdict |
|---|---|---|---|
| 49152 (≈49k, prior default) | ~1.5 GiB | ~7.7 GiB | very safe |
| **131072 (128k, recommended)** | **~4.0 GiB** | **~5.2 GiB** | **safe** |
| 196608 (192k, stretch) | ~6.0 GiB | ~3.1 GiB | plausible; wants #65 guard |
| 262144 (256k, model max) | ~8.0 GiB | ~1.1 GiB | unsafe at current free (under the 1500 MiB floor) |
`ConcatKvCache` is lazy — it allocates nothing at idle and resets between
requests — so raising the cap costs zero until a session actually uses
the longer window. The numbers above are upper bounds at *measured idle
free*; real usable headroom is lower under fragmentation and whatever
else is resident. Leave margin.
Reaching 256k (or running concurrent long sessions) needs more free VRAM
than this load leaves — KV quantization or a fixed/paged KV allocator —
none of which is required for 128k.
## Recommended profile: 128k
**neuron**`NEURON_MAX_PROMPT_TOKENS` is **deploy-managed**, not
hand-edited. It lives in the `deploy-neurons` matrix in
`.gitea/workflows/deploy.yml` (`max_prompt_tokens` per host) and is
written to `/etc/systemd/system/neuron.service.d/model.conf` on each run.
A change to that value restarts the neuron **even when no new RPM ships**
(the deploy gates on package version *or* drop-in change), so the cap
rolls out alongside the rest of the service config. To change it, edit
the matrix value and let the deploy apply it:
```yaml
# .gitea/workflows/deploy.yml → jobs.deploy-neurons.strategy.matrix.include
- host: beast.hanzalova.internal
flavour: blackwell
load_timeout: 900
max_prompt_tokens: 131072
```
The drop-in it writes:
```ini
# /etc/systemd/system/neuron.service.d/model.conf (managed by deploy.yml)
[Service]
Environment=NEURON_MAX_PROMPT_TOKENS=131072
```
Verify after a deploy:
```sh
curl -s http://beast:13131/discovery | jq .max_prompt_tokens # expect 131072
```
`model.conf` sorts after any manual `local.conf`, so the deploy-managed
value wins over a hand override of the same variable. Use `local.conf`
only for genuinely host-local, transient experiments — and remember a
later deploy will re-assert `model.conf`.
**opencode**`opencode.json`, `provider.models."Qwen/Qwen3.6-27B".limit`:
```json
{ "context": 131072, "input": 122880, "output": 8192 }
```
(`input = context output = 131072 8192`; `NEURON_MAX_PROMPT_TOKENS`
131072 sits one `output` above `input`, the tokenizer-drift margin.)
## After #62: single source of truth (superseded by #67)
[#62](https://git.lair.cafe/helexa/helexa/issues/62) moved `limit
{ context, input, output }` (and `cost`) onto `GET /models`, sourced from
the operator-declared catalogue (`models.toml`). That was the right
plumbing but the wrong *source*: a per-model catalogue limit goes stale
the moment cortex hot-swaps a neuron's resident model, and forces the
hand-tuning fight (the tables above) to be re-run on every change.
## After #67: the neuron computes its own limit
[#67](https://git.lair.cafe/helexa/helexa/issues/67) makes the limit a
**computed function of live state**, not an operator-declared fact. Per
loaded model, the neuron derives:
```
output = output_reserve_tokens (config; default 8192)
kv/token/card = 2(K+V) · n_full_attn_layers · (n_kv_heads / tp) · head_dim · dtype_bytes
vram_ceiling = (free_tightest activation_headroom min_free_floor) / kv_per_token_per_card
throughput_ceiling = target_prefill_latency_secs · measured_prefill_tok_per_sec
context = min(max_position_embeddings, vram_ceiling, throughput_ceiling)
clamped by NEURON_MAX_PROMPT_TOKENS only if explicitly set (backstop)
input = context output
```
- `free_tightest` is the **minimum free VRAM across the model's
devices** — the tightest card, often a non-leader TP rank.
- `measured_prefill_tok_per_sec` is **self-measured** (an EMA over real
requests; a configured bootstrap until the first sample). Because it
reads live state, the advertised `limit` **rises automatically** as
prefix caching (#11) or other efficiency work frees VRAM / speeds
prefill — no operator action.
- Knobs live in `[harness.candle.context_limit]` (see
`neuron.example.toml`). The catalogue `limit` is **no longer
consulted** (the field is inert/deprecated); `cost` stays
operator-set in the catalogue.
- **opencode**: remove any hand-entered `limit` block from
`opencode.json` — discovery is authoritative.
`NEURON_MAX_PROMPT_TOKENS` is demoted from authority to an **optional
clamp-only backstop** (applied only when explicitly set). The
deploy-managed drop-in still pins a per-host ceiling, but the derivation
binds below it in practice.
The request path **enforces the derived cap**: a prompt is rejected with
`PromptTooLong` when it exceeds the model's computed `input` budget
(refreshed on every `/models` poll), not the static
`NEURON_MAX_PROMPT_TOKENS` — so a VRAM-tight host rejects an over-budget
prompt up front instead of OOMing mid-prefill. Before the first
derivation (or for an arch without a context profile) it falls back to
the static cap.
## Operational note
`GET /discovery` reports the live `max_prompt_tokens` the running neuron
process actually uses — check it rather than assuming the drop-in took
effect. A drop-in change only applies after `daemon-reload` + a neuron
restart, which the deploy performs; if `/discovery` doesn't match the
`max_prompt_tokens` in the deploy matrix, the host hasn't been
re-deployed since the value changed (or a higher-sorting drop-in is
overriding it). Re-run the `deploy` workflow to reconcile.

View File

@@ -26,6 +26,12 @@ db_path = "/var/lib/helexa-bench/bench.sqlite"
prompt_sizes = [128, 4096]
max_tokens = 256
# Read-only JSON API (consumed by the bench UI + programmatic access),
# served alongside the sweep loop by `run` (or standalone via `serve`).
[api]
enabled = true
listen = "0.0.0.0:13132"
# One [[targets]] block per neuron on the fleet. `kind = "neuron"` (the
# default) gets build metadata via GET /version and warm-model discovery
# via GET /models.

Some files were not shown because too many files have changed in this diff Show More